├── .DS_Store ├── README.md ├── data ├── .DS_Store ├── category.json ├── labeldata │ ├── extract.py │ ├── labeldata.json │ └── patch │ │ └── example.json ├── papers │ ├── labels │ │ ├── DBMS_testing.md │ │ ├── IR_code_model.md │ │ ├── PL_design_for_LLMs.md │ │ ├── agent_design.md │ │ ├── benchmark.md │ │ ├── binary_code_model.md │ │ ├── bug_detection.md │ │ ├── bug_reproduction.md │ │ ├── call_graph_analysis.md │ │ ├── code_completion.md │ │ ├── code_generation.md │ │ ├── code_model.md │ │ ├── code_model_robustness.md │ │ ├── code_model_security.md │ │ ├── code_model_training.md │ │ ├── code_review.md │ │ ├── code_search.md │ │ ├── code_similarity_analysis.md │ │ ├── code_summarization.md │ │ ├── commit_message_generation.md │ │ ├── compiler_testing.md │ │ ├── data-flow_analysis.md │ │ ├── debugging.md │ │ ├── differential_testing.md │ │ ├── documentation_generation.md │ │ ├── empirical_study.md │ │ ├── equivalence_checking.md │ │ ├── fuzzing.md │ │ ├── general_coding_task.md │ │ ├── general_testing.md │ │ ├── hallucination_in_reasoning.md │ │ ├── library_testing.md │ │ ├── mutation_testing.md │ │ ├── planning.md │ │ ├── pointer_analysis.md │ │ ├── program_decompilation.md │ │ ├── program_optimization.md │ │ ├── program_repair.md │ │ ├── program_synthesis.md │ │ ├── program_testing.md │ │ ├── program_transformation.md │ │ ├── program_verification.md │ │ ├── prompt_strategy.md │ │ ├── protocol_fuzzing.md │ │ ├── reason_with_code.md │ │ ├── retrieval-augmented_generation.md │ │ ├── sampling_and_ranking.md │ │ ├── software_composition_analysis.md │ │ ├── software_configuration.md │ │ ├── software_maintenance_and_deployment.md │ │ ├── source_code_model.md │ │ ├── specification_inference.md │ │ ├── static_analysis.md │ │ ├── survey.md │ │ ├── symbolic_execution.md │ │ ├── syntactic_analysis.md │ │ ├── system_log_analysis.md │ │ ├── type_inference.md │ │ ├── unit_testing.md │ │ └── vulnerability_exploitation.md │ └── venues │ │ ├── AAAI2024 │ │ ├── README.md │ │ ├── paper_1.md │ │ └── paper_2.md │ │ ├── ACL2023 │ │ ├── README.md │ │ ├── paper_1.md │ │ ├── paper_10.md │ │ ├── paper_11.md │ │ ├── paper_12.md │ │ ├── paper_13.md │ │ ├── paper_2.md │ │ ├── paper_3.md │ │ ├── paper_4.md │ │ ├── paper_5.md │ │ ├── paper_6.md │ │ ├── paper_7.md │ │ ├── paper_8.md │ │ └── paper_9.md │ │ ├── ACL2024 │ │ ├── README.md │ │ ├── paper_1.md │ │ ├── paper_10.md │ │ ├── paper_11.md │ │ ├── paper_12.md │ │ ├── paper_13.md │ │ ├── paper_14.md │ │ ├── paper_15.md │ │ ├── paper_16.md │ │ ├── paper_17.md │ │ ├── paper_18.md │ │ ├── paper_19.md │ │ ├── paper_2.md │ │ ├── paper_20.md │ │ ├── paper_21.md │ │ ├── paper_22.md │ │ ├── paper_3.md │ │ ├── paper_4.md │ │ ├── paper_5.md │ │ ├── paper_6.md │ │ ├── paper_7.md │ │ ├── paper_8.md │ │ └── paper_9.md │ │ ├── ACL2025 │ │ ├── README.md │ │ └── paper_1.md │ │ ├── ACMSurvey2023 │ │ ├── README.md │ │ └── paper_1.md │ │ ├── APSEC2023 │ │ ├── README.md │ │ └── paper_1.md │ │ ├── ASE2023 │ │ ├── README.md │ │ ├── paper_1.md │ │ ├── paper_10.md │ │ ├── paper_11.md │ │ ├── paper_12.md │ │ ├── paper_13.md │ │ ├── paper_14.md │ │ ├── paper_15.md │ │ ├── paper_16.md │ │ ├── paper_17.md │ │ ├── paper_18.md │ │ ├── paper_2.md │ │ ├── paper_3.md │ │ ├── paper_4.md │ │ ├── paper_5.md │ │ ├── paper_6.md │ │ ├── paper_7.md │ │ ├── paper_8.md │ │ └── paper_9.md │ │ ├── ASE2024 │ │ ├── README.md │ │ ├── paper_1.md │ │ ├── paper_10.md │ │ ├── paper_11.md │ │ ├── paper_12.md │ │ ├── paper_13.md │ │ ├── paper_14.md │ │ ├── paper_15.md │ │ ├── paper_16.md │ │ ├── paper_17.md │ │ ├── paper_18.md │ │ ├── paper_19.md │ │ ├── paper_2.md │ │ ├── paper_20.md │ │ ├── paper_21.md │ │ ├── paper_22.md │ │ ├── paper_23.md │ │ ├── paper_24.md │ │ ├── paper_25.md │ │ ├── paper_26.md │ │ ├── paper_27.md │ │ ├── paper_28.md │ │ ├── paper_29.md │ │ ├── paper_3.md │ │ ├── paper_30.md │ │ ├── paper_31.md │ │ ├── paper_32.md │ │ ├── paper_33.md │ │ ├── paper_34.md │ │ ├── paper_35.md │ │ ├── paper_36.md │ │ ├── paper_37.md │ │ ├── paper_38.md │ │ ├── paper_39.md │ │ ├── paper_4.md │ │ ├── paper_40.md │ │ ├── paper_41.md │ │ ├── paper_42.md │ │ ├── paper_43.md │ │ ├── paper_44.md │ │ ├── paper_45.md │ │ ├── paper_5.md │ │ ├── paper_6.md │ │ ├── paper_7.md │ │ ├── paper_8.md │ │ └── paper_9.md │ │ ├── ASIACCS2024 │ │ ├── README.md │ │ └── paper_1.md │ │ ├── ASPLOS2024 │ │ ├── README.md │ │ └── paper_1.md │ │ ├── Apple2024 │ │ ├── README.md │ │ └── paper_1.md │ │ ├── CAV2024 │ │ ├── README.md │ │ └── paper_1.md │ │ ├── CC2025 │ │ ├── README.md │ │ └── paper_1.md │ │ ├── CCS2023 │ │ ├── README.md │ │ ├── paper_1.md │ │ ├── paper_2.md │ │ └── paper_3.md │ │ ├── CCS2024 │ │ ├── README.md │ │ ├── paper_1.md │ │ ├── paper_2.md │ │ ├── paper_3.md │ │ ├── paper_4.md │ │ ├── paper_5.md │ │ ├── paper_6.md │ │ ├── paper_7.md │ │ ├── paper_8.md │ │ └── paper_9.md │ │ ├── CGO2022 │ │ ├── README.md │ │ └── paper_1.md │ │ ├── EMNLP2020 │ │ ├── README.md │ │ └── paper_1.md │ │ ├── EMNLP2023 │ │ ├── README.md │ │ ├── paper_1.md │ │ ├── paper_10.md │ │ ├── paper_11.md │ │ ├── paper_12.md │ │ ├── paper_13.md │ │ ├── paper_14.md │ │ ├── paper_15.md │ │ ├── paper_16.md │ │ ├── paper_17.md │ │ ├── paper_18.md │ │ ├── paper_2.md │ │ ├── paper_3.md │ │ ├── paper_4.md │ │ ├── paper_5.md │ │ ├── paper_6.md │ │ ├── paper_7.md │ │ ├── paper_8.md │ │ └── paper_9.md │ │ ├── EMNLP2024 │ │ ├── README.md │ │ ├── paper_1.md │ │ ├── paper_10.md │ │ ├── paper_11.md │ │ ├── paper_12.md │ │ ├── paper_13.md │ │ ├── paper_14.md │ │ ├── paper_15.md │ │ ├── paper_16.md │ │ ├── paper_17.md │ │ ├── paper_18.md │ │ ├── paper_19.md │ │ ├── paper_2.md │ │ ├── paper_20.md │ │ ├── paper_21.md │ │ ├── paper_22.md │ │ ├── paper_23.md │ │ ├── paper_24.md │ │ ├── paper_25.md │ │ ├── paper_26.md │ │ ├── paper_27.md │ │ ├── paper_28.md │ │ ├── paper_29.md │ │ ├── paper_3.md │ │ ├── paper_30.md │ │ ├── paper_31.md │ │ ├── paper_32.md │ │ ├── paper_33.md │ │ ├── paper_34.md │ │ ├── paper_35.md │ │ ├── paper_36.md │ │ ├── paper_37.md │ │ ├── paper_38.md │ │ ├── paper_39.md │ │ ├── paper_4.md │ │ ├── paper_5.md │ │ ├── paper_6.md │ │ ├── paper_7.md │ │ ├── paper_8.md │ │ └── paper_9.md │ │ ├── FASE2024 │ │ ├── README.md │ │ └── paper_1.md │ │ ├── FMCAD2024 │ │ ├── README.md │ │ └── paper_1.md │ │ ├── FSE2023 │ │ ├── README.md │ │ ├── paper_1.md │ │ ├── paper_10.md │ │ ├── paper_11.md │ │ ├── paper_12.md │ │ ├── paper_13.md │ │ ├── paper_2.md │ │ ├── paper_3.md │ │ ├── paper_4.md │ │ ├── paper_5.md │ │ ├── paper_6.md │ │ ├── paper_7.md │ │ ├── paper_8.md │ │ └── paper_9.md │ │ ├── FSE2024 │ │ ├── README.md │ │ ├── paper_1.md │ │ ├── paper_10.md │ │ ├── paper_11.md │ │ ├── paper_12.md │ │ ├── paper_13.md │ │ ├── paper_14.md │ │ ├── paper_15.md │ │ ├── paper_16.md │ │ ├── paper_17.md │ │ ├── paper_18.md │ │ ├── paper_19.md │ │ ├── paper_2.md │ │ ├── paper_20.md │ │ ├── paper_21.md │ │ ├── paper_22.md │ │ ├── paper_23.md │ │ ├── paper_24.md │ │ ├── paper_25.md │ │ ├── paper_26.md │ │ ├── paper_27.md │ │ ├── paper_28.md │ │ ├── paper_29.md │ │ ├── paper_3.md │ │ ├── paper_30.md │ │ ├── paper_31.md │ │ ├── paper_4.md │ │ ├── paper_5.md │ │ ├── paper_6.md │ │ ├── paper_7.md │ │ ├── paper_8.md │ │ └── paper_9.md │ │ ├── FSE2025 │ │ ├── README.md │ │ └── paper_1.md │ │ ├── Forge2024 │ │ ├── README.md │ │ └── paper_1.md │ │ ├── Galois2024 │ │ ├── README.md │ │ └── paper_1.md │ │ ├── Google2023 │ │ ├── README.md │ │ └── paper_1.md │ │ ├── Google2024 │ │ ├── README.md │ │ ├── paper_1.md │ │ └── paper_2.md │ │ ├── ICLR2021 │ │ ├── README.md │ │ └── paper_1.md │ │ ├── ICLR2023 │ │ ├── README.md │ │ └── paper_1.md │ │ ├── ICLR2024 │ │ ├── README.md │ │ ├── paper_1.md │ │ ├── paper_2.md │ │ ├── paper_3.md │ │ ├── paper_4.md │ │ ├── paper_5.md │ │ ├── paper_6.md │ │ ├── paper_7.md │ │ └── paper_8.md │ │ ├── ICLR2025 │ │ ├── README.md │ │ ├── paper_1.md │ │ ├── paper_2.md │ │ ├── paper_3.md │ │ └── paper_4.md │ │ ├── ICML2021 │ │ ├── README.md │ │ ├── paper_1.md │ │ └── paper_2.md │ │ ├── ICML2023 │ │ ├── README.md │ │ ├── paper_1.md │ │ ├── paper_2.md │ │ └── paper_3.md │ │ ├── ICML2024 │ │ ├── README.md │ │ ├── paper_1.md │ │ ├── paper_2.md │ │ └── paper_3.md │ │ ├── ICML2025 │ │ ├── README.md │ │ ├── paper_1.md │ │ ├── paper_2.md │ │ └── paper_3.md │ │ ├── ICSE2023 │ │ ├── README.md │ │ ├── paper_1.md │ │ ├── paper_10.md │ │ ├── paper_11.md │ │ ├── paper_2.md │ │ ├── paper_3.md │ │ ├── paper_4.md │ │ ├── paper_5.md │ │ ├── paper_6.md │ │ ├── paper_7.md │ │ ├── paper_8.md │ │ └── paper_9.md │ │ ├── ICSE2024 │ │ ├── README.md │ │ ├── paper_1.md │ │ ├── paper_10.md │ │ ├── paper_11.md │ │ ├── paper_12.md │ │ ├── paper_13.md │ │ ├── paper_14.md │ │ ├── paper_15.md │ │ ├── paper_16.md │ │ ├── paper_17.md │ │ ├── paper_18.md │ │ ├── paper_19.md │ │ ├── paper_2.md │ │ ├── paper_20.md │ │ ├── paper_21.md │ │ ├── paper_22.md │ │ ├── paper_23.md │ │ ├── paper_24.md │ │ ├── paper_25.md │ │ ├── paper_26.md │ │ ├── paper_27.md │ │ ├── paper_28.md │ │ ├── paper_29.md │ │ ├── paper_3.md │ │ ├── paper_30.md │ │ ├── paper_31.md │ │ ├── paper_32.md │ │ ├── paper_4.md │ │ ├── paper_5.md │ │ ├── paper_6.md │ │ ├── paper_7.md │ │ ├── paper_8.md │ │ └── paper_9.md │ │ ├── ICSE2025 │ │ ├── README.md │ │ ├── paper_1.md │ │ ├── paper_2.md │ │ ├── paper_3.md │ │ ├── paper_4.md │ │ ├── paper_5.md │ │ └── paper_6.md │ │ ├── ISSTA2022 │ │ ├── README.md │ │ └── paper_1.md │ │ ├── ISSTA2023 │ │ ├── README.md │ │ ├── paper_1.md │ │ ├── paper_2.md │ │ ├── paper_3.md │ │ ├── paper_4.md │ │ └── paper_5.md │ │ ├── ISSTA2024 │ │ ├── README.md │ │ ├── paper_1.md │ │ ├── paper_10.md │ │ ├── paper_11.md │ │ ├── paper_12.md │ │ ├── paper_13.md │ │ ├── paper_14.md │ │ ├── paper_15.md │ │ ├── paper_16.md │ │ ├── paper_17.md │ │ ├── paper_18.md │ │ ├── paper_19.md │ │ ├── paper_2.md │ │ ├── paper_20.md │ │ ├── paper_21.md │ │ ├── paper_22.md │ │ ├── paper_23.md │ │ ├── paper_24.md │ │ ├── paper_25.md │ │ ├── paper_26.md │ │ ├── paper_27.md │ │ ├── paper_28.md │ │ ├── paper_3.md │ │ ├── paper_4.md │ │ ├── paper_5.md │ │ ├── paper_6.md │ │ ├── paper_7.md │ │ ├── paper_8.md │ │ └── paper_9.md │ │ ├── KDD2025 │ │ ├── README.md │ │ └── paper_1.md │ │ ├── LLM4Code2025 │ │ ├── README.md │ │ └── paper_1.md │ │ ├── LangSec2025 │ │ ├── README.md │ │ └── paper_1.md │ │ ├── Meta2024 │ │ ├── README.md │ │ └── paper_1.md │ │ ├── Microsoft2023 │ │ ├── README.md │ │ ├── paper_1.md │ │ └── paper_2.md │ │ ├── Microsoft2024 │ │ ├── README.md │ │ └── paper_1.md │ │ ├── NAACL2024 │ │ ├── README.md │ │ ├── paper_1.md │ │ ├── paper_2.md │ │ ├── paper_3.md │ │ ├── paper_4.md │ │ ├── paper_5.md │ │ └── paper_6.md │ │ ├── NDSS2024 │ │ ├── README.md │ │ ├── paper_1.md │ │ └── paper_2.md │ │ ├── NDSS2025 │ │ ├── README.md │ │ ├── paper_1.md │ │ ├── paper_2.md │ │ ├── paper_3.md │ │ ├── paper_4.md │ │ ├── paper_5.md │ │ ├── paper_6.md │ │ ├── paper_7.md │ │ └── paper_8.md │ │ ├── NVDIA2024 │ │ ├── README.md │ │ └── paper_1.md │ │ ├── NeurIPS2018 │ │ ├── README.md │ │ └── paper_1.md │ │ ├── NeurIPS2022 │ │ ├── README.md │ │ ├── paper_1.md │ │ └── paper_2.md │ │ ├── NeurIPS2023 │ │ ├── README.md │ │ ├── paper_1.md │ │ ├── paper_2.md │ │ ├── paper_3.md │ │ ├── paper_4.md │ │ ├── paper_5.md │ │ └── paper_6.md │ │ ├── NeurIPS2024 │ │ ├── README.md │ │ ├── paper_1.md │ │ ├── paper_2.md │ │ ├── paper_3.md │ │ ├── paper_4.md │ │ ├── paper_5.md │ │ ├── paper_6.md │ │ ├── paper_7.md │ │ └── paper_8.md │ │ ├── OOPLSA2023 │ │ ├── README.md │ │ └── paper_1.md │ │ ├── OOPSLA2024 │ │ ├── README.md │ │ ├── paper_1.md │ │ ├── paper_2.md │ │ ├── paper_3.md │ │ ├── paper_4.md │ │ ├── paper_5.md │ │ ├── paper_6.md │ │ ├── paper_7.md │ │ ├── paper_8.md │ │ └── paper_9.md │ │ ├── OOPSLA2025 │ │ ├── README.md │ │ ├── paper_1.md │ │ ├── paper_2.md │ │ └── paper_3.md │ │ ├── OpenAI2024 │ │ ├── README.md │ │ └── paper_1.md │ │ ├── PLDI2023 │ │ ├── README.md │ │ ├── paper_1.md │ │ └── paper_2.md │ │ ├── PLDI2025 │ │ ├── README.md │ │ └── paper_1.md │ │ ├── POPL2025 │ │ ├── README.md │ │ └── paper_1.md │ │ ├── ProtectAI2024 │ │ ├── README.md │ │ └── paper_1.md │ │ ├── RAID2023 │ │ ├── README.md │ │ └── paper_1.md │ │ ├── S&P2023 │ │ ├── README.md │ │ └── paper_1.md │ │ ├── S&P2024 │ │ ├── README.md │ │ ├── paper_1.md │ │ ├── paper_2.md │ │ ├── paper_3.md │ │ └── paper_4.md │ │ ├── SOAP2024 │ │ ├── README.md │ │ └── paper_1.md │ │ ├── SOSP2024 │ │ ├── README.md │ │ └── paper_1.md │ │ ├── TKDD2024 │ │ ├── README.md │ │ └── paper_1.md │ │ ├── TMLR2024 │ │ ├── README.md │ │ └── paper_1.md │ │ ├── TOSEM2023 │ │ ├── README.md │ │ ├── paper_1.md │ │ ├── paper_2.md │ │ └── paper_3.md │ │ ├── TOSEM2024 │ │ ├── README.md │ │ ├── paper_1.md │ │ ├── paper_10.md │ │ ├── paper_2.md │ │ ├── paper_3.md │ │ ├── paper_4.md │ │ ├── paper_5.md │ │ ├── paper_6.md │ │ ├── paper_7.md │ │ ├── paper_8.md │ │ └── paper_9.md │ │ ├── TSE2023 │ │ ├── README.md │ │ ├── paper_1.md │ │ ├── paper_2.md │ │ ├── paper_3.md │ │ ├── paper_4.md │ │ ├── paper_5.md │ │ └── paper_6.md │ │ ├── TSE2024 │ │ ├── README.md │ │ ├── paper_1.md │ │ ├── paper_10.md │ │ ├── paper_11.md │ │ ├── paper_12.md │ │ ├── paper_13.md │ │ ├── paper_14.md │ │ ├── paper_2.md │ │ ├── paper_3.md │ │ ├── paper_4.md │ │ ├── paper_5.md │ │ ├── paper_6.md │ │ ├── paper_7.md │ │ ├── paper_8.md │ │ └── paper_9.md │ │ ├── USENIXSec2023 │ │ ├── README.md │ │ ├── paper_1.md │ │ ├── paper_2.md │ │ └── paper_3.md │ │ ├── USENIXSec2024 │ │ ├── README.md │ │ ├── paper_1.md │ │ ├── paper_2.md │ │ ├── paper_3.md │ │ ├── paper_4.md │ │ ├── paper_5.md │ │ └── paper_6.md │ │ ├── arXiv2023 │ │ ├── README.md │ │ ├── paper_1.md │ │ ├── paper_10.md │ │ ├── paper_11.md │ │ ├── paper_12.md │ │ ├── paper_13.md │ │ ├── paper_14.md │ │ ├── paper_15.md │ │ ├── paper_16.md │ │ ├── paper_2.md │ │ ├── paper_3.md │ │ ├── paper_4.md │ │ ├── paper_5.md │ │ ├── paper_6.md │ │ ├── paper_7.md │ │ ├── paper_8.md │ │ └── paper_9.md │ │ ├── arXiv2024 │ │ ├── README.md │ │ ├── paper_1.md │ │ ├── paper_10.md │ │ ├── paper_11.md │ │ ├── paper_12.md │ │ ├── paper_13.md │ │ ├── paper_14.md │ │ ├── paper_15.md │ │ ├── paper_16.md │ │ ├── paper_17.md │ │ ├── paper_18.md │ │ ├── paper_19.md │ │ ├── paper_2.md │ │ ├── paper_20.md │ │ ├── paper_21.md │ │ ├── paper_22.md │ │ ├── paper_23.md │ │ ├── paper_24.md │ │ ├── paper_25.md │ │ ├── paper_26.md │ │ ├── paper_27.md │ │ ├── paper_28.md │ │ ├── paper_29.md │ │ ├── paper_3.md │ │ ├── paper_30.md │ │ ├── paper_31.md │ │ ├── paper_32.md │ │ ├── paper_33.md │ │ ├── paper_34.md │ │ ├── paper_35.md │ │ ├── paper_36.md │ │ ├── paper_37.md │ │ ├── paper_38.md │ │ ├── paper_39.md │ │ ├── paper_4.md │ │ ├── paper_40.md │ │ ├── paper_41.md │ │ ├── paper_5.md │ │ ├── paper_6.md │ │ ├── paper_7.md │ │ ├── paper_8.md │ │ └── paper_9.md │ │ └── arXiv2025 │ │ ├── README.md │ │ ├── paper_1.md │ │ ├── paper_10.md │ │ ├── paper_11.md │ │ ├── paper_12.md │ │ ├── paper_13.md │ │ ├── paper_14.md │ │ ├── paper_15.md │ │ ├── paper_16.md │ │ ├── paper_17.md │ │ ├── paper_18.md │ │ ├── paper_19.md │ │ ├── paper_2.md │ │ ├── paper_20.md │ │ ├── paper_3.md │ │ ├── paper_4.md │ │ ├── paper_5.md │ │ ├── paper_6.md │ │ ├── paper_7.md │ │ ├── paper_8.md │ │ └── paper_9.md ├── rawdata │ ├── 2023 │ │ ├── ACL2023.html │ │ ├── ASE2023.bib │ │ ├── CCS2023.bib │ │ ├── EMNLP2023.html │ │ ├── FSE2023.bib │ │ ├── ICSE2023.bib │ │ ├── ISSTA2023.bib │ │ ├── NDSS2023.html │ │ ├── OOPSLA2023.bib │ │ ├── PLDI2023.bib │ │ ├── S&P2023.bib │ │ ├── TOSEM2023.bib │ │ ├── TSE2023.bib │ │ └── USENIXSec2023.bib │ ├── 2024 │ │ ├── ACL2024.html │ │ ├── ASE2024.bib │ │ ├── CCS2024.bib │ │ ├── EMNLP-findings2024.html │ │ ├── EMNLP-main2024.html │ │ ├── FSE2024.bib │ │ ├── ICSE2024.bib │ │ ├── ISSTA2024.bib │ │ ├── NAACL2024.html │ │ ├── NDSS2024.html │ │ ├── OOPSLA2024.bib │ │ ├── PLDI2024.bib │ │ ├── S&P2024.bib │ │ ├── TOSEM2024.bib │ │ └── TSE2024.bib │ ├── 2025 │ │ └── NDSS2025.html │ └── .DS_Store └── template.txt └── src ├── .DS_Store ├── patch.py └── process.py /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PurCL/CodeLLMPaper/2df01657bb46cba896ada39d3bf08814ae6cc751/.DS_Store -------------------------------------------------------------------------------- /data/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PurCL/CodeLLMPaper/2df01657bb46cba896ada39d3bf08814ae6cc751/data/.DS_Store -------------------------------------------------------------------------------- /data/labeldata/extract.py: -------------------------------------------------------------------------------- 1 | # read the content from static_analysis.md line by line 2 | with open('static_analysis.md') as f: 3 | content = f.readlines() 4 | 5 | cnt = 0 6 | new_lines = set([]) 7 | for line in content: 8 | if line == '\n': 9 | continue 10 | cnt += 1 11 | 12 | # fill [5] LLM Meets Bounded Model Checking: Neuro-symbolic Loop Invariant Inference (ASE2024) 5 with line 13 | # extract the index number and the title and venue 14 | line = line.strip() 15 | idx = line.find(']') 16 | title = line[idx+2:line.find('(')-1] 17 | venue = line[line.find('(')+1:line.find(')')] 18 | print(f"Title: {title}, Venue: {venue}") 19 | new_line = f"{title}" 20 | new_lines.add(new_line) 21 | print(cnt) 22 | 23 | print(len(new_lines)) 24 | # # dump to static_analysis.md 25 | # with open('static_analysis.md', 'w') as f: 26 | # f.writelines("\n".join(new_lines)) 27 | 28 | 29 | 30 | -------------------------------------------------------------------------------- /data/papers/labels/DBMS_testing.md: -------------------------------------------------------------------------------- 1 | # DBMS Testing 2 | 3 | - [Sedar: Obtaining High-Quality Seeds for DBMS Fuzzing via Cross-DBMS SQL Transfer](../venues/ICSE2024/paper_16.md), ([ICSE2024](../venues/ICSE2024/README.md)) 4 | 5 | - **Abstract**: Effective DBMS fuzzing relies on high-quality initial seeds, which serve as the starting point for mutation. These initial seeds should incorporate various DBMS features to explore the state space thoroughly. While built-in test cases are typically used as initial seeds, many DBMSs lack comprehensive test cases, making it difficult to apply state-of-the-art fuzzing techniques directly.To address this, we propose Sedar which produces initial seeds for a target DBMS by transferring test cases from... 6 | - **Labels**: [program testing](program_testing.md), [fuzzing](fuzzing.md), [DBMS testing](DBMS_testing.md) 7 | -------------------------------------------------------------------------------- /data/papers/labels/equivalence_checking.md: -------------------------------------------------------------------------------- 1 | # Equivalence Checking 2 | 3 | - [Evaluating the effectiveness of deep learning models for foundational program analysis tasks](../venues/OOPSLA2024/paper_9.md), ([OOPSLA2024](../venues/OOPSLA2024/README.md)) 4 | 5 | - **Abstract**: While deep neural networks provide state-of-the-art solutions to a wide range of programming language tasks, their effectiveness in dealing with foundational program analysis tasks remains under explored. In this paper, we present an empirical study that evaluates four prominent models of code (i.e., CuBERT, CodeBERT, GGNN, and Graph Sandwiches) in two such foundational tasks: (1) alias prediction, in which models predict whether two pointers must alias, may alias or must not alias; and (2) equi... 6 | - **Labels**: [static analysis](static_analysis.md), [pointer analysis](pointer_analysis.md), [equivalence checking](equivalence_checking.md), [code model](code_model.md), [code model training](code_model_training.md), [source code model](source_code_model.md) 7 | -------------------------------------------------------------------------------- /data/papers/labels/general_testing.md: -------------------------------------------------------------------------------- 1 | # General Testing 2 | 3 | - [You Name It, I Run It: An LLM Agent to Execute Tests of Arbitrary Projects](../venues/arXiv2024/paper_24.md), ([arXiv2024](../venues/arXiv2024/README.md)) 4 | 5 | - **Abstract**: The ability to execute the test suite of a project is essential in many scenarios, e.g., to assess code quality and code coverage, to validate code changes made by developers or automated tools, and to ensure compatibility with dependencies. Despite its importance, executing the test suite of a project can be challenging in practice because different projects use different programming languages, software ecosystems, build systems, testing frameworks, and other tools. These challenges make it dif... 6 | - **Labels**: [program testing](program_testing.md), [general testing](general_testing.md), [agent design](agent_design.md), [planning](planning.md) 7 | -------------------------------------------------------------------------------- /data/papers/labels/library_testing.md: -------------------------------------------------------------------------------- 1 | # Library Testing 2 | 3 | - [Large Language Models Are Zero-Shot Fuzzers: Fuzzing Deep-Learning Libraries via Large Language Models](../venues/ISSTA2023/paper_2.md), ([ISSTA2023](../venues/ISSTA2023/README.md)) 4 | 5 | - **Abstract**: Deep Learning (DL) systems have received exponential growth in popularity and have become ubiquitous in our everyday life. Such systems are built on top of popular DL libraries, e.g., TensorFlow and PyTorch which provide APIs as building blocks for DL systems. Detecting bugs in these DL libraries is critical for almost all downstream DL systems in ensuring effectiveness/safety for end users. Meanwhile, traditional fuzzing techniques can be hardly effective for such a challenging domain since the... 6 | - **Labels**: [program testing](program_testing.md), [fuzzing](fuzzing.md), [library testing](library_testing.md) 7 | -------------------------------------------------------------------------------- /data/papers/labels/mutation_testing.md: -------------------------------------------------------------------------------- 1 | # Mutation Testing 2 | 3 | - [LLMorpheus: Mutation Testing using Large Language Models](../venues/arXiv2024/paper_31.md), ([arXiv2024](../venues/arXiv2024/README.md)) 4 | 5 | - **Abstract**: In mutation testing, the quality of a test suite is evaluated by introducing faults into a program and determining whether the program's tests detect them. Most existing approaches for mutation testing involve the application of a fixed set of mutation operators, e.g., replacing a "+" with a "-" or removing a function's body. However, certain types of real-world bugs cannot easily be simulated by such approaches, limiting their effectiveness. This paper presents a technique where a Large Languag... 6 | - **Labels**: [program testing](program_testing.md), [mutation testing](mutation_testing.md) 7 | 8 | 9 | - [Large Language Models for Equivalent Mutant Detection: How Far Are We?](../venues/ISSTA2024/paper_23.md), ([ISSTA2024](../venues/ISSTA2024/README.md)) 10 | 11 | - **Abstract**: Mutation testing is vital for ensuring software quality. However, the presence of equivalent mutants is known to introduce redundant cost and bias issues, hindering the effectiveness of mutation testing in practical use. Although numerous equivalent mutant detection (EMD) techniques have been proposed, they exhibit limitations due to the scarcity of training data and challenges in generalizing to unseen mutants. Recently, large language models (LLMs) have been extensively adopted in various code... 12 | - **Labels**: [program testing](program_testing.md), [mutation testing](mutation_testing.md), [empirical study](empirical_study.md) 13 | -------------------------------------------------------------------------------- /data/papers/labels/protocol_fuzzing.md: -------------------------------------------------------------------------------- 1 | # Protocol Fuzzing 2 | 3 | - [Large Language Model guided Protocol Fuzzing](../venues/NDSS2024/paper_2.md), ([NDSS2024](../venues/NDSS2024/README.md)) 4 | 5 | - **Abstract**: How to find security flaws in a protocol implementation without a machine-readable specification of the protocol? Facing the internet, protocol implementations are particularly security-critical software systems where inputs must adhere to a specific structure and order that is often informally specified in hundreds of pages in natural language (RFC). Without some machine-readable version of that protocol, it is difficult to automatically generate valid test inputs for its implementation that fo... 6 | - **Labels**: [program testing](program_testing.md), [fuzzing](fuzzing.md), [protocol fuzzing](protocol_fuzzing.md) 7 | -------------------------------------------------------------------------------- /data/papers/labels/software_configuration.md: -------------------------------------------------------------------------------- 1 | # Software Configuration 2 | 3 | - [Face It Yourselves: An LLM-Based Two-Stage Strategy to Localize Configuration Errors via Logs](../venues/ISSTA2024/paper_1.md), ([ISSTA2024](../venues/ISSTA2024/README.md)) 4 | 5 | - **Abstract**: Configurable software systems are prone to configuration errors, resulting in significant losses to companies. However, diagnosing these errors is challenging due to the vast and complex configuration space. These errors pose significant challenges for both experienced maintainers and new end-users, particularly those without access to the source code of the software systems. Given that logs are easily accessible to most end-users, we conduct a preliminary study to outline the challenges and opp... 6 | - **Labels**: [software maintenance and deployment](software_maintenance_and_deployment.md), [software configuration](software_configuration.md) 7 | -------------------------------------------------------------------------------- /data/papers/labels/symbolic_execution.md: -------------------------------------------------------------------------------- 1 | # Symbolic Execution 2 | 3 | - [Large Language Model powered Symbolic Execution](../venues/arXiv2025/paper_18.md), ([arXiv2025](../venues/arXiv2025/README.md)) 4 | 5 | - **Abstract**: Large Language Models (LLMs) have emerged as a promising alternative to traditional static program analysis methods, such as symbolic execution, offering the ability to reason over code directly without relying on theorem provers or SMT solvers. However, LLMs are also inherently probabilistic by nature, and therefore face significant challenges in relation to the accuracy and scale of the analysis in real-world application. Such issues often necessitate the use of larger LLMs with higher token l... 6 | - **Labels**: [static analysis](static_analysis.md), [symbolic execution](symbolic_execution.md) 7 | -------------------------------------------------------------------------------- /data/papers/labels/syntactic_analysis.md: -------------------------------------------------------------------------------- 1 | # Syntactic Analysis 2 | 3 | - [Which Syntactic Capabilities Are Statistically Learned by Masked Language Models for Code?](../venues/ICSE2024/paper_20.md), ([ICSE2024](../venues/ICSE2024/README.md)) 4 | 5 | - **Abstract**: This paper discusses the limitations of evaluating Masked Language Models (MLMs) in code completion tasks. We highlight that relying on accuracy-based measurements may lead to an overestimation of models' capabilities by neglecting the syntax rules of programming languages. To address these issues, we introduce a technique called SyntaxEval in which Syntactic Capabilities are used to enhance the evaluation of MLMs. SyntaxEval automates the process of masking elements in the model input based on ... 6 | - **Labels**: [static analysis](static_analysis.md), [syntactic analysis](syntactic_analysis.md), [empirical study](empirical_study.md) 7 | -------------------------------------------------------------------------------- /data/papers/venues/AAAI2024/paper_2.md: -------------------------------------------------------------------------------- 1 | # Relational Programming with Foundational Models 2 | 3 | **Authors**: Ziyang Li and Jiani Huang and Jason Liu and Felix Zhu and Eric Zhao and William Dodds and Neelay Velingker and Rajeev Alur and Mayur Naik 4 | 5 | **Abstract**: 6 | 7 | Foundation models have vast potential to enable diverse AI applications. The powerful yet incomplete nature of these models has spurred a wide range of mechanisms to augment them with capabilities such as in-context learning, information retrieval, and code interpreting. We propose VIEIRA, a declarative framework that unifies these mechanisms in a general solution for programming with foundation models. VIEIRA follows a probabilistic relational paradigm and treats foundation models as stateless functions with relational inputs and outputs. It supports neuro-symbolic applications by enabling the seamless combination of such models with logic programs, as well as complex, multi-modal applications by streamlining the composition of diverse sub-models. We implement VIEIRA by extending the SCALLOP compiler with a foreign interface that supports foundation models as plugins. We implement plugins for 12 foundation models including GPT, CLIP, and SAM. We evaluate VIEIRA on 9 challenging tasks that span language, vision, and structured and vector databases. Our evaluation shows that programs in VIEIRA are concise, can incorporate modern foundation models, and have comparable or better accuracy than competitive baselines. 8 | 9 | **Link**: [Read Paper](https://doi.org/10.1609/aaai.v38i9.28934) 10 | 11 | **Labels**: [PL design for LLMs](../../labels/PL_design_for_LLMs.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/ACL2023/paper_1.md: -------------------------------------------------------------------------------- 1 | # Self-Edit: Fault-Aware Code Editor for Code Generation 2 | 3 | **Authors**: Zhang, Kechi and Li, Zhuo and Li, Jia and Li, Ge and Jin, Zhi 4 | 5 | **Abstract**: 6 | 7 | Large language models (LLMs) have demonstrated an impressive ability to generate codes on competitive programming tasks. However, with limited sample numbers, LLMs still suffer from poor accuracy. Inspired by the process of human programming, we propose a generate-and-edit approach named Self-Edit that utilizes execution results of the generated code from LLMs to improve the code quality on the competitive programming task. We execute the generated code on the example test case provided in the question and wrap execution results into a supplementary comment. Utilizing this comment as guidance, our fault-aware code editor is employed to correct errors in the generated code. We perform extensive evaluations across two competitive programming datasets with nine different LLMs. Compared to directly generating from LLMs, our approach can improve the average of pass@1 by 89% on APPS-dev, 31% on APPS-test, and 48% on HumanEval over nine popular code generation LLMs with parameter sizes ranging from 110M to 175B. Compared to other post-processing methods, our method demonstrates superior accuracy and efficiency. 8 | 9 | **Link**: [Read Paper](https://doi.org/10.18653/v1/2023.acl-long.45) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [program repair](../../labels/program_repair.md), [benchmark](../../labels/benchmark.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/ACL2023/paper_10.md: -------------------------------------------------------------------------------- 1 | # Python Code Generation by Asking Clarification Questions 2 | 3 | **Authors**: Li, Haau-Sing (Xiaocheng) and Mesgar, Mohsen and Martins, André and Gurevych, Iryna 4 | 5 | **Abstract**: 6 | 7 | Code generation from text requires understanding the user’s intent from a natural languagedescription and generating an executable code snippet that satisfies this intent. While recent pretrained language models demonstrate remarkable performance for this task, these models fail when the given natural language description is under-specified. In this work, we introduce a novel and more realistic setup for this task. We hypothesize that the under-specification of a natural language description can be resolved by asking clarification questions. Therefore, we collect and introduce a new dataset named CodeClarQA containing pairs of natural language descriptions and code with created synthetic clarification questions and answers. The empirical results of our evaluation of pretrained language model performance on code generation show that clarifications result in more precisely generated code, as shown by the substantial improvement of model performance in all evaluation metrics. Alongside this, our task and dataset introduce new challenges to the community, including when and what clarification questions should be asked. Our code and dataset are available on GitHub. 8 | 9 | **Link**: [Read Paper](https://doi.org/10.18653/v1/2023.acl-long.799) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [program synthesis](../../labels/program_synthesis.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/ACL2023/paper_3.md: -------------------------------------------------------------------------------- 1 | # Fact-Checking Complex Claims with Program-Guided Reasoning 2 | 3 | **Authors**: Pan, Liangming and Wu, Xiaobao and Lu, Xinyuan and Luu, Anh Tuan and Wang, William Yang and Kan, Min-Yen and Nakov, Preslav 4 | 5 | **Abstract**: 6 | 7 | Fact-checking real-world claims often requires collecting multiple pieces of evidence and applying complex multi-step reasoning. In this paper, we present Program-Guided Fact-Checking (ProgramFC), a novel fact-checking model that decomposes complex claims into simpler sub-tasks that can be solved using a shared library of specialized functions. We first leverage the in-context learning ability of large language models to generate reasoning programs to guide the verification process. Afterward, we execute the program by delegating each sub-task to the corresponding sub-task handler. This process makes our model both explanatory and data-efficient, providing clear explanations of its reasoning process and requiring minimal training data. We evaluate ProgramFC on two challenging fact-checking datasets and show that it outperforms seven fact-checking baselines across different settings of evidence availability, with explicit output programs that benefit human debugging. Our codes and data are publicly available at https://github.com/mbzuai-nlp/ProgramFC. 8 | 9 | **Link**: [Read Paper](https://doi.org/10.18653/v1/2023.acl-long.386) 10 | 11 | **Labels**: [agent design](../../labels/agent_design.md), [prompt strategy](../../labels/prompt_strategy.md), [reason with code](../../labels/reason_with_code.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/ACL2024/paper_12.md: -------------------------------------------------------------------------------- 1 | # MPCoder: Multi-user Personalized Code Generator with Explicit and Implicit Style Representation Learning 2 | 3 | **Authors**: Dai, Zhenlong and Yao, Chang and Han, WenKang and Yuanying, Yuanying and Gao, Zhipeng and Chen, Jingyuan 4 | 5 | **Abstract**: 6 | 7 | Large Language Models (LLMs) have demonstrated great potential for assisting developers in their daily development. However, most research focuses on generating correct code, how to use LLMs to generate personalized code has seldom been investigated. To bridge this gap, we proposed MPCoder (Multi-user Personalized Code Generator) to generate personalized code for multiple users. To better learn coding style features, we utilize explicit coding style residual learning to capture the syntax code style standards and implicit style learning to capture the semantic code style conventions. We train a multi-user style adapter to better differentiate the implicit feature representations of different users through contrastive learning, ultimately enabling personalized code generation for multiple users. We further propose a novel evaluation metric for estimating similarities between codes of different coding styles. The experimental results show the effectiveness of our approach for this novel task. 8 | 9 | **Link**: [Read Paper](https://doi.org/10.18653/v1/2024.acl-long.207) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [program synthesis](../../labels/program_synthesis.md), [code model](../../labels/code_model.md), [code model training](../../labels/code_model_training.md), [source code model](../../labels/source_code_model.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/ACL2024/paper_14.md: -------------------------------------------------------------------------------- 1 | # DolphCoder: Echo-Locating Code Large Language Models with Diverse and Multi-Objective Instruction Tuning 2 | 3 | **Authors**: Wang, Yejie and He, Keqing and Dong, Guanting and Wang, Pei and Zeng, Weihao and Diao, Muxi and Xu, Weiran and Wang, Jingang and Zhang, Mengdi and Cai, Xunliang 4 | 5 | **Abstract**: 6 | 7 | Code Large Language Models (Code LLMs) have demonstrated outstanding performance in code-related tasks. Various instruction finetuning approaches have been proposed to boost the code generation performance of pre-trained Code LLMs. In this paper, we introduce a diverse instruction model DolphCoder with self-evaluating for code generation. It learns diverse instruction targets and combines a code evaluation objective to enhance its code generation ability. Our model achieves superior performance on the HumanEval and MBPP benchmarks, demonstrating new insights for future code instruction tuning work. Our key findings are: (1) Augmenting more diverse responses with more distinct reasoning paths increases the code capability of LLMs. (2) Improving one’s ability to evaluate the correctness of code also enhances their ability to create it. 8 | 9 | **Link**: [Read Paper](https://doi.org/10.18653/v1/2024.acl-long.259) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [program synthesis](../../labels/program_synthesis.md), [code model](../../labels/code_model.md), [code model training](../../labels/code_model_training.md), [source code model](../../labels/source_code_model.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/ACL2024/paper_15.md: -------------------------------------------------------------------------------- 1 | # Who Wrote this Code? Watermarking for Code Generation 2 | 3 | **Authors**: Lee, Taehyun and Hong, Seokhee and Ahn, Jaewoo and Hong, Ilgee and Lee, Hwaran and Yun, Sangdoo and Shin, Jamin and Kim, Gunhee 4 | 5 | **Abstract**: 6 | 7 | Since the remarkable generation performance of large language models raised ethical and legal concerns, approaches to detect machine-generated text by embedding watermarks are being developed.However, we discover that the existing works fail to function appropriately in code generation tasks due to the task’s nature of having low entropy.Extending a logit-modifying watermark method, we propose Selective WatErmarking via Entropy Thresholding (SWEET), which enhances detection ability and mitigates code quality degeneration by removing low-entropy segments at generating and detecting watermarks.Our experiments show that SWEET significantly improves code quality preservation while outperforming all baselines, including post-hoc detection methods, in detecting machine-generated code text.Our code is available inhttps://github.com/hongcheki/sweet-watermark. 8 | 9 | **Link**: [Read Paper](https://doi.org/10.18653/v1/2024.acl-long.268) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [program synthesis](../../labels/program_synthesis.md), [code model](../../labels/code_model.md), [code model security](../../labels/code_model_security.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/ACL2024/paper_21.md: -------------------------------------------------------------------------------- 1 | # Lightweight reranking for language model generations 2 | 3 | **Authors**: Jain, Siddhartha and Ma, Xiaofei and Deoras, Anoop and Xiang, Bing 4 | 5 | **Abstract**: 6 | 7 | Large Language Models (LLMs) can exhibit considerable variation in the quality of their sampled outputs. Reranking and selecting the best generation from the sampled set is a popular way of obtaining strong gains in generation quality. In this paper, we present a novel approach for reranking LLM generations. Unlike other techniques that might involve additional inferences or training a specialized reranker, our approach relies on easy to compute pairwise statistics between the generations that have minimal compute overhead. We show that our approach can be formalized as an extension of self-consistency and analyze its performance in that framework, theoretically as well as via simulations. We show strong improvements for selecting the best k generations for code generation tasks as well as robust improvements for the best generation for the tasks of autoformalization, summarization, and translation. While our approach only assumes black-box access to LLMs, we show that additional access to token probabilities can improve performance even further. 8 | 9 | **Link**: [Read Paper](https://doi.org/10.18653/v1/2024.acl-long.376) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [program synthesis](../../labels/program_synthesis.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/ACL2024/paper_7.md: -------------------------------------------------------------------------------- 1 | # On Improving Repository-Level Code QA for Large Language Models 2 | 3 | **Authors**: Strich, Jan and Schneider, Florian and Nikishina, Irina and Biemann, Chris 4 | 5 | **Abstract**: 6 | 7 | Large Language Models (LLMs) such as ChatGPT, GitHub Copilot, Llama, or Mistral assist programmers as copilots and knowledge sources to make the coding process faster and more efficient. This paper aims to improve the copilot performance by implementing different self-alignment processes and retrieval-augmented generation (RAG) pipelines, as well as their combination. To test the effectiveness of all approaches, we create a dataset and apply a model-based evaluation, using LLM as a judge. It is designed to check the model’s abilities to understand the source code semantics, the dependency between files, and the overall meta-information about the repository. We also compare our approach with other existing solutions, e.g. ChatGPT-3.5, and evaluate on the existing benchmarks. Code and dataset are available online (https://anonymous.4open.science/r/ma_llm-382D). 8 | 9 | **Link**: [Read Paper](https://doi.org/10.18653/v1/2024.acl-srw.28) 10 | 11 | **Labels**: [general coding task](../../labels/general_coding_task.md), [benchmark](../../labels/benchmark.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/ACL2025/README.md: -------------------------------------------------------------------------------- 1 | # ACL2025 2 | 3 | Number of papers: 1 4 | 5 | ## [LocAgent: Graph-Guided LLM Agents for Code Localization](paper_1.md) 6 | - **Authors**: Zhaoling Chen, Xiangru Tang, Gangda Deng, Fang Wu, Jialong Wu, Zhiwei Jiang, Viktor Prasanna, Arman Cohan, Xingyao Wang 7 | - **Abstract**: Code localization--identifying precisely where in a codebase changes need to be made--is a fundamental yet challenging task in software maintenance. Existing approaches struggle to efficiently navigate complex codebases when identifying relevant code sections. The challenge lies in bridging natural language problem descriptions with the appropriate code elements, often requiring reasoning across hierarchical structures and multiple dependencies. We introduce LocAgent, a framework that addresses ... 8 | - **Link**: [Read Paper](https://arxiv.org/abs/2503.09089) 9 | - **Labels**: [agent design](../../labels/agent_design.md), [prompt strategy](../../labels/prompt_strategy.md), [retrieval-augmented generation](../../labels/retrieval-augmented_generation.md), [planning](../../labels/planning.md) 10 | -------------------------------------------------------------------------------- /data/papers/venues/ACMSurvey2023/README.md: -------------------------------------------------------------------------------- 1 | # ACMSurvey2023 2 | 3 | Number of papers: 1 4 | 5 | ## [Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing](paper_1.md) 6 | - **Authors**: Liu, Pengfei and Yuan, Weizhe and Fu, Jinlan and Jiang, Zhengbao and Hayashi, Hiroaki and Neubig, Graham 7 | - **Abstract**: This article surveys and organizes research works in a new paradigm in natural language processing, which we dub “prompt-based learning.” Unlike traditional supervised learning, which trains a model to take in an input x and predict an output y as P(y|x), prompt-based learning is based on language models that model the probability of text directly. To use these models to perform prediction tasks, the original input x is modified using a template into a textual string prompt x′ that has some unfi... 8 | - **Link**: [Read Paper](https://arxiv.org/pdf/2107.13586) 9 | - **Labels**: [survey](../../labels/survey.md), [agent design](../../labels/agent_design.md), [prompt strategy](../../labels/prompt_strategy.md) 10 | -------------------------------------------------------------------------------- /data/papers/venues/APSEC2023/README.md: -------------------------------------------------------------------------------- 1 | # APSEC2023 2 | 3 | Number of papers: 1 4 | 5 | ## [Refactoring programs using large language models with few-shot examples](paper_1.md) 6 | - **Authors**: Shirafuji, Atsushi and Oda, Yusuke and Suzuki, Jun and Morishita, Makoto and Watanobe, Yutaka 7 | - **Abstract**: A less complex and more straightforward program is a crucial factor that enhances its maintainability and makes writing secure and bug-free programs easier. However, due to its heavy workload and the risks of breaking the working programs, programmers are reluctant to do code refactoring, and thus, it also causes the loss of potential learning experiences. To mitigate this, we demonstrate the application of using a large language model (LLM), GPT-3.5, to suggest less complex versions of the user... 8 | - **Link**: [Read Paper](https://arxiv.org/pdf/2311.11690.pdf) 9 | - **Labels**: [code generation](../../labels/code_generation.md), [program transformation](../../labels/program_transformation.md) 10 | -------------------------------------------------------------------------------- /data/papers/venues/ASE2023/paper_3.md: -------------------------------------------------------------------------------- 1 | # SMT Solver Validation Empowered by Large Pre-Trained Language Models 2 | 3 | **Authors**: Sun, Maolin and Yang, Yibiao and Wang, Yang and Wen, Ming and Jia, Haoxiang and Zhou, Yuming 4 | 5 | **Abstract**: 6 | 7 | SMT solvers are utilized to check the satisfiability of logic formulas and have been applied in various crucial domains, including software verification, test case generation, and program synthesis. However, bugs hidden in SMT solvers can lead to severe consequences, causing erroneous results in these domains. Therefore, ensuring the reliability and robustness of SMT solvers is of critical importance. Despite several testing approaches proposed for SMT solvers, generating effective test formulas to comprehensively test SMT solvers remains a challenge. To address this challenge, in this study, we propose to port large language models (LLMs) to generate SMT formulas for fuzzing solvers. Specifically, the study presents a novel retrain-finetune pipeline to unleash the potential of language models to generate effective SMT formulas and improve their generation performance through data augmentation. We implemented our approach as a practical fuzzing tool, named LasT,and then extensively tested the state-of-the-art SMT solvers, namely Z3, cvc5, and Bitwuzla. To date, Last has successfully uncovered 65 genuine bugs for the solvers, of which 45 have been fixed by the developers. 8 | 9 | **Link**: [Read Paper](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10298442) 10 | 11 | **Labels**: [program testing](../../labels/program_testing.md), [fuzzing](../../labels/fuzzing.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/ASE2024/paper_23.md: -------------------------------------------------------------------------------- 1 | # LLM-Generated Invariants for Bounded Model Checking Without Loop Unrolling 2 | 3 | **Authors**: Pirzada, Muhammad A. A. and Reger, Giles and Bhayat, Ahmed and Cordeiro, Lucas C. 4 | 5 | **Abstract**: 6 | 7 | We investigate a modification of the classical Bounded Model Checking (BMC) procedure that does not handle loops through unrolling but via modifications to the control flow graph (CFG). A portion of the CFG representing a loop is replaced by a node asserting invariants of the loop. We generate these invariants using Large Language Models (LLMs) and use a first-order theorem prover to ensure the correctness of the generated statements. We thus transform programs to loop-free variants in a sound manner. Our experimental results show that the resulting tool, ESBMC ibmc, is competitive with state-of-the-art formal verifiers for programs with unbounded loops, significantly improving the number of programs verified by the industrial-strength software verifier ESBMC and verifying programs that state-of-the-art software verifiers such as SeaHorn and VeriAbs could not. 8 | 9 | **Link**: [Read Paper](https://doi.org/10.1145/3691620.3695512) 10 | 11 | **Labels**: [static analysis](../../labels/static_analysis.md), [program verification](../../labels/program_verification.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/ASE2024/paper_25.md: -------------------------------------------------------------------------------- 1 | # Test-Driven Development and LLM-based Code Generation 2 | 3 | **Authors**: Mathews, Noble Saji and Nagappan, Meiyappan 4 | 5 | **Abstract**: 6 | 7 | Recent Large Language Models (LLMs) have demonstrated significant capabilities in generating code snippets directly from problem statements. This increasingly automated process mirrors traditional human-led software development, where code is often written in response to a requirement. Historically, Test-Driven Development (TDD) has proven its merit, requiring developers to write tests before the functional code, ensuring alignment with the initial problem statements. Applying TDD principles to LLM-based code generation offers one distinct benefit: it enables developers to verify the correctness of generated code against predefined tests. This paper investigates if and how TDD can be incorporated into AI-assisted code-generation processes. We experimentally evaluate our hypothesis that providing LLMs like GPT-4 and Llama 3 with tests in addition to the problem statements enhances code generation outcomes. We experimented with established function-level code generation benchmarks such as MBPP and HumanEval. Our results consistently demonstrate that including test cases leads to higher success in solving programming challenges. We assert that TDD is a promising paradigm for helping ensure that the code generated by LLMs effectively captures the requirements. 8 | 9 | **Link**: [Read Paper](https://doi.org/10.1145/3691620.3695527) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [program synthesis](../../labels/program_synthesis.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/ASE2024/paper_29.md: -------------------------------------------------------------------------------- 1 | # Understanding Developer-Analyzer Interactions in Code Reviews 2 | 3 | **Authors**: Schaef, Martin and Cirisci, Berk and Luo, Linghui and Mansur, Muhammad Numair and Tripp, Omer and Sanchez, Daniel and Zhou, Qiang and Zafar, Muhammad Bilal 4 | 5 | **Abstract**: 6 | 7 | Static code analyzers are now a common part of the codereview process. These automated tools integrate into the code review process by commenting on code changes and suggesting improvements, in the same way as human reviewers. The comments made by static analyzers often trigger a conversation between developers to align on if and how the issue should be fixed. Because developers rarely give feedback directly to the tool, understanding the sentiment and intent in the conversation triggered by the tool comments can be used to measure the usefulness of the static analyzer.In this paper, we report on an experiment where we use large language models to automatically label and categorize the sentiment and intent of such conversations triggered by static analyzer comments. Our experiment demonstrates that LLMs not only classify and interpret complex developer-analyzer conversations, but can be more accurate than human experts. 8 | 9 | **Link**: [Read Paper](https://doi.org/10.1145/3691620.3695257) 10 | 11 | **Labels**: [software maintenance and deployment](../../labels/software_maintenance_and_deployment.md), [code review](../../labels/code_review.md), [empirical study](../../labels/empirical_study.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/ASE2024/paper_34.md: -------------------------------------------------------------------------------- 1 | # PACGBI: A Pipeline for Automated Code Generation from Backlog Items 2 | 3 | **Authors**: Sarschar, Mahja and Zhang, Gefei and Nowak, Annika 4 | 5 | **Abstract**: 6 | 7 | While there exist several tools to leverage Large Language Models (LLMs) for code generation, their capabilities are limited to the source code editor and are disconnected from the overall software development process. These tools typically generate standalone code snippets that still require manual integration into the codebase. There is still a lack of integrated solutions that seamlessly automate the entire development cycle, from backlog items to code generation and merge requests. We present the Pipeline for Automated Code Generation from Backlog Items (PACGBI), an LLM-assisted pipeline integrated into GitLab CI. PACGBI reads backlog items in the code repository, automatically generates the corresponding code, and creates merge requests for the generated changes. Our case study demonstrates the potential of PACGBI in automating agile software development processes, allowing parallelization of development and reduction of development costs. PACGBI can be utilized by software developers and enables nontechnical stakeholders and designers by providing a holistic solution for using LLMs in software development. A screencast of this tool is available at https://youtu.be/TI53m-fIoyc, its source code at https://github.com/Masa-99/pacgbi. 8 | 9 | **Link**: [Read Paper](https://doi.org/10.1145/3691620.3695346) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [program synthesis](../../labels/program_synthesis.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/ASE2024/paper_36.md: -------------------------------------------------------------------------------- 1 | # CoqPilot, a plugin for LLM-based generation of proofs 2 | 3 | **Authors**: Kozyrev, Andrei and Solovev, Gleb and Khramov, Nikita and Podkopaev, Anton 4 | 5 | **Abstract**: 6 | 7 | We present CoqPilot, a VS Code extension designed to help automate writing of Coq proofs. The plugin collects the parts of proofs marked with the admit tactic in a Coq file, i.e., proof holes, and combines LLMs along with non-machine-learning methods to generate proof candidates for the holes. Then, CoqPilot checks if each proof candidate solves the given subgoal and, if successful, replaces the hole with it. The focus of CoqPilot is twofold. Firstly, we want to allow users to seamlessly combine multiple Coq generation approaches and provide a zero-setup experience for our tool. Secondly, we want to deliver a platform for LLM-based experiments on Coq proof generation. We developed a benchmarking system for Coq generation methods, available in the plugin, and conducted an experiment using it, showcasing the framework's possibilities. Demo of CoqPilot is available at: https://youtu.be/oB1Lx-So9Lo. Code at: https://github.com/JetBrains-Research/coqpilot 8 | 9 | **Link**: [Read Paper](https://doi.org/10.1145/3691620.3695357) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [program synthesis](../../labels/program_synthesis.md), [static analysis](../../labels/static_analysis.md), [program verification](../../labels/program_verification.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/ASE2024/paper_38.md: -------------------------------------------------------------------------------- 1 | # Automated Validation of COBOL to Java Transformation 2 | 3 | **Authors**: Kumar, Atul and Saha, Diptikalyan and Yasue, Toshiaki and Ono, Kohichi and Krishnan, Saravanan and Hans, Sandeep and Satoh, Fumiko and Mitchell, Gerald and Kumar, Sachin 4 | 5 | **Abstract**: 6 | 7 | Recent advances in Large Language Model (LLM) based Generative AI techniques have made it feasible to translate enterpriselevel code from legacy languages such as COBOL to modern languages such as Java or Python. While the results of LLM-based automatic transformation are encouraging, the resulting code cannot be trusted to correctly translate the original code. We propose a framework and a tool to help validate the equivalence of COBOL and translated Java. The results can also help repair the code if there are some issues and provide feedback to the AI model to improve. We have developed a symbolic-execution-based test generation to automatically generate unit tests for the source COBOL programs which also mocks the external resource calls. We generate equivalent JUnit test cases with equivalent mocking as COBOL and run them to check semantic equivalence between original and translated programs. Demo Video: https://youtu.be/aqF_agNP-lU 8 | 9 | **Link**: [Read Paper](https://doi.org/10.1145/3691620.3695365) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [program transformation](../../labels/program_transformation.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/ASE2024/paper_39.md: -------------------------------------------------------------------------------- 1 | # Can Large Language Models Comprehend Code Stylometry? 2 | 3 | **Authors**: Dipongkor, Atish 4 | 5 | **Abstract**: 6 | 7 | Code Authorship Attribution (CAA) has several applications such as copyright disputes, plagiarism detection and criminal prosecution. Existing studies mainly focused on CAA by proposing machine learning (ML) and Deep Learning (DL) based techniques. The main limitations of ML-based techniques are (a) manual feature engineering is required to train these models and (b) they are vulnerable to adversarial attack. In this study, we initially fine-tune five Large Language Models (LLMs) for CAA and evaluate their performance. Our results show that LLMs are robust and less vulnerable compared to existing techniques in CAA task. 8 | 9 | **Link**: [Read Paper](https://doi.org/10.1145/3691620.3695370) 10 | 11 | **Labels**: [static analysis](../../labels/static_analysis.md), [software composition analysis](../../labels/software_composition_analysis.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/ASE2024/paper_40.md: -------------------------------------------------------------------------------- 1 | # Bridging Gaps in LLM Code Translation: Reducing Errors with Call Graphs and Bridged Debuggers 2 | 3 | **Authors**: Luo, Yang and Yu, Richard and Zhang, Fajun and Liang, Ling and Xiong, Yongqiang 4 | 5 | **Abstract**: 6 | 7 | When using large language models (LLMs) for code translation of complex software, numerous compilation and runtime errors can occur due to insufficient context awareness. To address this issue, this paper presents a code translation method based on call graphs and bridged debuggers: TransGraph. TransGraph first obtains the call graph of the entire code project using the Language Server Protocol, which provides a detailed description of the function call relationships in the program. Through this structured view of the code, LLMs can more effectively handle large-scale and complex codebases, significantly reducing compilation errors. Furthermore, TransGraph, combined with bridged debuggers and dynamic test case generation, significantly reduces runtime errors, overcoming the limitations of insufficient test case coverage in traditional methods. In experiments on six datasets including CodeNet and Avatar, TransGraph outperformed existing code translation methods and LLMs in terms of translation accuracy, with improvements of up to 10.2\%. 8 | 9 | **Link**: [Read Paper](https://doi.org/10.1145/3691620.3695322) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [program transformation](../../labels/program_transformation.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/ASE2024/paper_41.md: -------------------------------------------------------------------------------- 1 | # Cross-lingual Code Clone Detection: When LLMs Fail Short Against Embedding-based Classifier 2 | 3 | **Authors**: Moumoula, Micheline Benedicte and Kabore, Abdoul Kader and Klein, Jacques and Bissyande, Tegawende F. 4 | 5 | **Abstract**: 6 | 7 | Cross-lingual code clone detection has gained attention in software development due to the use of multiple programming languages. Recent advances in machine learning, particularly Large Language Models (LLMs), have motivated a reexamination of this problem.This paper evaluates the performance of four LLMs and eight prompts for detecting cross-lingual code clones, as well as a pretrained embedding model for classifying clone pairs. Both approaches are tested on the XLCoST and CodeNet datasets.Our findings show that while LLMs achieve high F1 scores (up to 0.98) on straightforward programming examples, they struggle with complex cases and cross-lingual understanding. In contrast, embedding models, which map code fragments from different languages into a common representation space, allow for the training of a basic classifier that outperforms LLMs by approximately 2 and 24 percentage points on the XLCoST and CodeNet datasets, respectively. This suggests that embedding models provide more robust representations, enabling state-of-the-art performance in cross-lingual code clone detection. 8 | 9 | **Link**: [Read Paper](https://doi.org/10.1145/3691620.3695335) 10 | 11 | **Labels**: [static analysis](../../labels/static_analysis.md), [code similarity analysis](../../labels/code_similarity_analysis.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/ASIACCS2024/README.md: -------------------------------------------------------------------------------- 1 | # ASIACCS2024 2 | 3 | Number of papers: 1 4 | 5 | ## [An investigation into misuse of java security apis by large language models](paper_1.md) 6 | - **Authors**: Mousavi, Zahra and Islam, Chadni and Moore, Kristen and Abuadbba, Alsharif and Babar, M Ali 7 | - **Abstract**: The increasing trend of using Large Language Models (LLMs) for code generation raises the question of their capability to generate trustworthy code. While many researchers are exploring the utility of code generation for uncovering software vulnerabilities, one crucial but often overlooked aspect is the security Application Programming Interfaces (APIs). APIs play an integral role in upholding software security, yet effectively integrating security APIs presents substantial challenges. This lead... 8 | - **Link**: [Read Paper](https://arxiv.org/html/2404.03823v1) 9 | - **Labels**: [code model](../../labels/code_model.md), [code model security](../../labels/code_model_security.md), [empirical study](../../labels/empirical_study.md) 10 | -------------------------------------------------------------------------------- /data/papers/venues/ASPLOS2024/README.md: -------------------------------------------------------------------------------- 1 | # ASPLOS2024 2 | 3 | Number of papers: 1 4 | 5 | ## [The Mutators Reloaded: Fuzzing Compilers with Large Language Model Generated Mutation Operators](paper_1.md) 6 | - **Authors**: Ou, Xianfei and Li, Cong and Jiang, Yanyan and Xu, Chang 7 | - **Abstract**: Crafting high-quality mutators–the core of mutation-based fuzzing that shapes the search space–is challenging. It requires human expertise and creativity, and their implementation demands knowledge of compiler internals. This paper presents MetaMut framework for developing new, useful mutators for compiler fuzzing. It integrates our compilerdomain knowledge into prompts and processes that can best harness the capabilities of a large language model. With MetaMut, we have successfully created 118 ... 8 | - **Link**: [Read Paper](https://connglli.github.io/pdfs/metamut_asplos24.pdf) 9 | - **Labels**: [program testing](../../labels/program_testing.md), [fuzzing](../../labels/fuzzing.md), [compiler testing](../../labels/compiler_testing.md) 10 | -------------------------------------------------------------------------------- /data/papers/venues/ASPLOS2024/paper_1.md: -------------------------------------------------------------------------------- 1 | # The Mutators Reloaded: Fuzzing Compilers with Large Language Model Generated Mutation Operators 2 | 3 | **Authors**: Ou, Xianfei and Li, Cong and Jiang, Yanyan and Xu, Chang 4 | 5 | **Abstract**: 6 | 7 | Crafting high-quality mutators–the core of mutation-based fuzzing that shapes the search space–is challenging. It requires human expertise and creativity, and their implementation demands knowledge of compiler internals. This paper presents MetaMut framework for developing new, useful mutators for compiler fuzzing. It integrates our compilerdomain knowledge into prompts and processes that can best harness the capabilities of a large language model. With MetaMut, we have successfully created 118 semantic-aware mutators at approximately $0.5 each, with only moderate human effort. With these mutators, our fuzzer uncovered 131 bugs in GCC and Clang, 129 of which were confirmed or fixed. The success of MetaMut suggests that the integration of AI into software and system engineering tasks traditionally thought to require expert human intervention could be a promising research direction. 8 | 9 | **Link**: [Read Paper](https://connglli.github.io/pdfs/metamut_asplos24.pdf) 10 | 11 | **Labels**: [program testing](../../labels/program_testing.md), [fuzzing](../../labels/fuzzing.md), [compiler testing](../../labels/compiler_testing.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/Apple2024/README.md: -------------------------------------------------------------------------------- 1 | # Apple2024 2 | 3 | Number of papers: 1 4 | 5 | ## [Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models](paper_1.md) 6 | - **Authors**: Mirzadeh, Iman and Alizadeh, Keivan and Shahrokhi, Hooman and Tuzel, Oncel and Bengio, Samy and Farajtabar, Mehrdad 7 | - **Abstract**: Recent advancements in Large Language Models (LLMs) have sparked interest in their formal reasoning capabilities, particularly in mathematics. The GSM8K benchmark is widely used to assess the mathematical reasoning of models on grade-school-level questions. While the performance of LLMs on GSM8K has significantly improved in recent years, it remains unclear whether their mathematical reasoning capabilities have genuinely advanced, raising questions about the reliability of the reported metrics. ... 8 | - **Link**: [Read Paper](https://arxiv.org/pdf/2410.05229) 9 | - **Labels**: [hallucination in reasoning](../../labels/hallucination_in_reasoning.md), [empirical study](../../labels/empirical_study.md) 10 | -------------------------------------------------------------------------------- /data/papers/venues/CAV2024/README.md: -------------------------------------------------------------------------------- 1 | # CAV2024 2 | 3 | Number of papers: 1 4 | 5 | ## [Enchanting program specification synthesis by large language models using static analysis and program verification](paper_1.md) 6 | - **Authors**: Wen, Cheng and Cao, Jialun and Su, Jie and Xu, Zhiwu and Qin, Shengchao and He, Mengda and Li, Haokun and Cheung, Shing-Chi and Tian, Cong 7 | - **Abstract**: Formal verification provides a rigorous and systematic approach to ensure the correctness and reliability of software systems. Yet, constructing specifications for the full proof relies on domain expertise and non-trivial manpower. In view of such needs, an automated approach for specification synthesis is desired. While existing automated approaches are limited in their versatility, i.e., they either focus only on synthesizing loop invariants for numerical programs, or are tailored for specific... 8 | - **Link**: [Read Paper](https://arxiv.org/pdf/2404.00762.pdf) 9 | - **Labels**: [static analysis](../../labels/static_analysis.md), [program verification](../../labels/program_verification.md), [specification inference](../../labels/specification_inference.md) 10 | -------------------------------------------------------------------------------- /data/papers/venues/CC2025/README.md: -------------------------------------------------------------------------------- 1 | # CC2025 2 | 3 | Number of papers: 1 4 | 5 | ## [LLM Compiler: Foundation Language Models for Compiler Optimization](paper_1.md) 6 | - **Authors**: Cummins, Chris and Seeker, Volker and Grubisic, Dejan and Roziere, Baptiste and Gehring, Jonas and Synnaeve, Gabriel and Leather, Hugh 7 | - **Abstract**: Large Language Models (LLMs) have demonstrated remarkable capabilities across a variety of software engineering and coding tasks. However, their application in the domain of code and compiler optimization remains underexplored. Training LLMs is resource-intensive, requiring substantial GPU hours and extensive data collection, which can be prohibitive. To address this gap, we introduce LLM Compiler, a suite of robust, openly available, pre-trained models specifically designed for compiler tasks. ... 8 | - **Link**: [Read Paper](https://dl.acm.org/doi/pdf/10.1145/3708493.3712691) 9 | - **Labels**: [static analysis](../../labels/static_analysis.md), [program optimization](../../labels/program_optimization.md), [code model](../../labels/code_model.md), [code model training](../../labels/code_model_training.md), [IR code model](../../labels/IR_code_model.md) 10 | -------------------------------------------------------------------------------- /data/papers/venues/CCS2024/paper_5.md: -------------------------------------------------------------------------------- 1 | # An Exploration of Large Language Models in Malicious Source Code Detection 2 | 3 | **Authors**: Xue, Di and Zhao, Gang and Fan, Zhongqi and Li, Wei and Xu, Yahong and Liu, Zhen and Liu, Yin and Yuan, Zhongliang 4 | 5 | **Abstract**: 6 | 7 | Embedding malicious code within the software supply chain has become a significant concern in the information technology field. Current methods for detecting malicious code, based on signatures, behavior analysis, and traditional machine learning models, lack result interpretability. This study proposes a novel malicious code detection framework, Mal-LLM, which leverages the cost advantages of traditional machine learning models and the interpretability of LLMs. Initially, traditional machine learning models filter vast amounts of malicious source code in the software supply chain. Subsequently, LLMs analyze and interpret the filtered malicious source code using a customized prompt template incorporating role-playing and chain-of-thought techniques. The feasibility of the Mal-LLM framework is validated through extensive experimental analyses, examining the ambiguity and redundancy of the LLM in the framework, the significance of ''experience'' and ''malicious'' prompts, and exploring methods to reduce the cost of using LLMs from an enterprise perspective. 8 | 9 | **Link**: [Read Paper](https://doi.org/10.1145/3658644.3691374) 10 | 11 | **Labels**: [static analysis](../../labels/static_analysis.md), [bug detection](../../labels/bug_detection.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/CCS2024/paper_6.md: -------------------------------------------------------------------------------- 1 | # Repairing Bugs with the Introduction of New Variables: A Multi-Agent Large Language Model 2 | 3 | **Authors**: Zhang, Elisa and Sun, Shiyu and Xing, Yunlong and Sun, Kun 4 | 5 | **Abstract**: 6 | 7 | Trained on billions of tokens, large language models (LLMs) have a broad range of empirical knowledge which enables them to generate software patches with complex repair patterns. We leverage the powerful code-fixing capabilities of LLMs and propose VarPatch, a multi-agent conversational automated program repair (APR) technique that iteratively queries the LLM to generate software patches by providing various prompts and context information. VarPatch focuses on the variable addition repair pattern, as previous APR tools struggle to introduce and use new variables to fix buggy code. Additionally, we summarize commonly used APIs and identify four repair patterns involving new variable addition. Our evaluation on the Defects4J 1.2 dataset shows that VarPatch can repair 69\% more bugs than baseline tools and over 8 times more bugs than GPT-4. 8 | 9 | **Link**: [Read Paper](https://doi.org/10.1145/3658644.3691412) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [program repair](../../labels/program_repair.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/CCS2024/paper_7.md: -------------------------------------------------------------------------------- 1 | # TAPChecker: Model Checking in Trigger-Action Rules Generation Using Large Language Models 2 | 3 | **Authors**: Bui, Huan and Lienerth, Harper and Fu, Chenglong and Sridhar, Meera 4 | 5 | **Abstract**: 6 | 7 | The integration of large language models (LLMs) in smart home systems holds significant promise for automating the generation of Trigger-Action Programming (TAP) rules, potentially streamlining smart home user experiences and enhancing convenience. However, LLMs lack of holistic view of smart home IoT deployments and may introduce TAP rules that result in hazards. This paper explores the application of LLM for generating TAP rules and applying formal verification to validate and ensure the safety of TAP rules generated by LLMs. By systematically analyzing and verifying these rules, we aim to identify and mitigate potential security vulnerabilities. Furthermore, we propose a feedback mechanism to refine the LLM's output, enhancing its reliability and safety in generating automation rules. Through this approach, we seek to bridge the gap between the efficiency of LLMs and the stringent security requirements of smart IoT systems, fostering a safer automation environment. 8 | 9 | **Link**: [Read Paper](https://doi.org/10.1145/3658644.3691416) 10 | 11 | **Labels**: [static analysis](../../labels/static_analysis.md), [specification inference](../../labels/specification_inference.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/CCS2024/paper_8.md: -------------------------------------------------------------------------------- 1 | # Enhance Hardware Domain Specific Large Language Model with Reinforcement Learning for Resilience 2 | 3 | **Authors**: Fu, Weimin and Zhao, Yifang and Jin, Yier and Guo, Xiaolong 4 | 5 | **Abstract**: 6 | 7 | To enhance the performance of large language models (LLMs) on hardware design tasks, we focus on training with reinforcement learning(RL) to improve LLMs' syntax synthesis and functional verification performance. We observed significant gains in power, performance, and area (PPA) metrics by applying RL. Specifically, DeepSeek Code saw a 23.6\% performance increase, while the RTLCoder improved by 7.86\%. Our findings demonstrate the effectiveness of RL in refining LLMs for more accurate hardware generation, considering power and area consumption. This approach offers a promising direction for generating hardware resilient to side-channel attacks in computer systems. 8 | 9 | **Link**: [Read Paper](https://doi.org/10.1145/3658644.3691384) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [program synthesis](../../labels/program_synthesis.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/CGO2022/README.md: -------------------------------------------------------------------------------- 1 | # CGO2022 2 | 3 | Number of papers: 1 4 | 5 | ## [CompilerGym: robust, performant compiler optimization environments for AI research](paper_1.md) 6 | - **Authors**: Cummins, Chris and Wasti, Bram and Guo, Jiadong and Cui, Brandon and Ansel, Jason and Gomez, Sahir and Jain, Somya and Liu, Jia and Teytaud, Olivier and Steiner, Benoit and Tian, Yuandong and Leather, Hugh 7 | - **Abstract**: Interest in applying Artificial Intelligence (AI) techniques to compiler optimizations is increasing rapidly, but compiler research has a high entry barrier. Unlike in other domains, compiler and AI researchers do not have access to the datasets and frameworks that enable fast iteration and development of ideas, and getting started requires a significant engineering investment. What is needed is an easy, reusable experimental infrastructure for real world compiler optimization tasks that can ser... 8 | - **Link**: [Read Paper](https://doi.org/10.1109/CGO53902.2022.9741258) 9 | - **Labels**: [static analysis](../../labels/static_analysis.md), [program optimization](../../labels/program_optimization.md), [benchmark](../../labels/benchmark.md) 10 | -------------------------------------------------------------------------------- /data/papers/venues/EMNLP2020/README.md: -------------------------------------------------------------------------------- 1 | # EMNLP2020 2 | 3 | Number of papers: 1 4 | 5 | ## [Codebert: A pre-trained model for programming and natural languages](paper_1.md) 6 | - **Authors**: Feng, Zhangyin and Guo, Daya and Tang, Duyu and Duan, Nan and Feng, Xiaocheng and Gong, Ming and Shou, Linjun and Qin, Bing and Liu, Ting and Jiang, Daxin and others 7 | - **Abstract**: We present CodeBERT, a bimodal pre-trained model for programming language (PL) and nat-ural language (NL). CodeBERT learns general-purpose representations that support downstream NL-PL applications such as natural language codesearch, code documentation generation, etc. We develop CodeBERT with Transformer-based neural architecture, and train it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled ... 8 | - **Link**: [Read Paper](https://arxiv.org/pdf/2208.09727) 9 | - **Labels**: [general coding task](../../labels/general_coding_task.md), [code model](../../labels/code_model.md), [code model training](../../labels/code_model_training.md), [source code model](../../labels/source_code_model.md) 10 | -------------------------------------------------------------------------------- /data/papers/venues/EMNLP2023/paper_12.md: -------------------------------------------------------------------------------- 1 | # MiniChain: A Small Library for Coding with Large Language Models 2 | 3 | **Authors**: Rush, Alexander 4 | 5 | **Abstract**: 6 | 7 | Programming augmented by large language models (LLMs) opens up many new application areas, but also requires care. LLMs are accurate enough, on average, to replace core functionality, yet make basic mistakes that demonstrate a lack of robustness. An ecosystem of prompting tools, from intelligent agents to new programming languages, have emerged with different solutions for patching LLMs with other tools. In this work, we introduce MiniChain, an opinionated tool for LLM augmented programming, with the design goals of ease-of-use of prototyping, transparency through automatic visualization, and a minimalistic approach to advanced features. The MiniChain library provides core primitives for coding LLM calls, separating out prompt templates, and capturing program structure. The library includes demo implementations of the main applications papers in the area, including chat-bots, code generation, retrieval-based question answering, and complex information extraction. The library is open-source and available at https://github.com/srush/MiniChain, with code demos available at https://srush-minichain.hf.space/, and video demo at https://www.youtube.com/watch?v=VszZ1VnO7sk. 8 | 9 | **Link**: [Read Paper](https://doi.org/10.18653/v1/2023.emnlp-demo.27) 10 | 11 | **Labels**: [general coding task](../../labels/general_coding_task.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/EMNLP2023/paper_13.md: -------------------------------------------------------------------------------- 1 | # Ranking llm-generated loop invariants for program verification 2 | 3 | **Authors**: Chakraborty, Saikat and Lahiri, Shuvendu K and Fakhoury, Sarah and Musuvathi, Madanlal and Lal, Akash and Rastogi, Aseem and Senthilnathan, Aditya and Sharma, Rahul and Swamy, Nikhil 4 | 5 | **Abstract**: 6 | 7 | Synthesizing inductive loop invariants is fundamental to automating program verification. In this work, we observe that Large Language Models (such as gpt-3.5 or gpt-4) are capable of synthesizing loop invariants for a class of programs in a 0-shot setting, yet require several samples to generate the correct invariants. This can lead to a large number of calls to a program verifier to establish an invariant. To address this issue, we propose a {\it re-ranking} approach for the generated results of LLMs. We have designed a ranker that can distinguish between correct inductive invariants and incorrect attempts based on the problem definition. The ranker is optimized as a contrastive ranker. Experimental results demonstrate that this re-ranking mechanism significantly improves the ranking of correct invariants among the generated candidates, leading to a notable reduction in the number of calls to a verifier. 8 | 9 | **Link**: [Read Paper](https://aclanthology.org/2023.findings-emnlp.614.pdf) 10 | 11 | **Labels**: [static analysis](../../labels/static_analysis.md), [program verification](../../labels/program_verification.md), [agent design](../../labels/agent_design.md), [prompt strategy](../../labels/prompt_strategy.md), [sampling and ranking](../../labels/sampling_and_ranking.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/EMNLP2023/paper_18.md: -------------------------------------------------------------------------------- 1 | # On Sample-Efficient Code Generation 2 | 3 | **Authors**: Han, Hojae and Kim, Yu Jin and Kim, Byoungjip and Lee, Youngwon and Lee, Kyungjae and Lee, Kyungmin and Lee, Moontae and Bae, Kyunghoon and Hwang, Seung-won 4 | 5 | **Abstract**: 6 | 7 | Large language models often struggle to predict runtime behavior in code generation tasks, leading to a reliance on rejection sampling (best-of-n) to generate multiple code snippets then select the best. Our distinction is reducing sampling costs, without compromising generation quality. We introduce EFFICODE, a novel framework that prioritizes sampling on test problems that models can solve. We show how EFFICODE estimates solvability to optimize computational costs during multiple sampling. Based on empirical evidence, EFFICODE consistently demonstrates reduced sampling budgets while maintaining comparable code generation performance, especially when problems are challenging. In addition, utilizing EFFICODE to rank sampled code snippets also shows its effectiveness in answer code selection for reducing temporal costs, by not requiring any execution or test case generation. 8 | 9 | **Link**: [Read Paper](https://doi.org/10.18653/v1/2023.emnlp-industry.73) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [program synthesis](../../labels/program_synthesis.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/EMNLP2023/paper_4.md: -------------------------------------------------------------------------------- 1 | # Symbolic Planning and Code Generation for Grounded Dialogue 2 | 3 | **Authors**: Chiu, Justin and Zhao, Wenting and Chen, Derek and Vaduguru, Saujas and Rush, Alexander and Fried, Daniel 4 | 5 | **Abstract**: 6 | 7 | Large language models (LLMs) excel at processing and generating text and code. However, LLMs have had limited applicability in grounded task-oriented dialogue as they are difficult to steer toward task objectives and fail to handle novel grounding. We present a modular and interpretable grounded dialogue system that addresses these shortcomings by composing LLMs with a symbolic planner and grounded code execution. Our system, consists of a reader and planner: the reader leverages an LLM to convert partner utterances into executable code, calling functions that perform grounding. The translated code’s output is stored to track dialogue state, while a symbolic planner determines the next appropriate response. We evaluate our system’s performance on the demanding OneCommon dialogue task, involving collaborative reference resolution on abstract images of scattered dots. Our system substantially outperforms the previous state-of-the-art, including improving task success in human evaluations from 56% to 69% in the most challenging setting. 8 | 9 | **Link**: [Read Paper](https://doi.org/10.18653/v1/2023.emnlp-main.460) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [program synthesis](../../labels/program_synthesis.md), [agent design](../../labels/agent_design.md), [planning](../../labels/planning.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/EMNLP2023/paper_6.md: -------------------------------------------------------------------------------- 1 | # Let’s Sample Step by Step: Adaptive-Consistency for Efficient Reasoning and Coding with LLMs 2 | 3 | **Authors**: Aggarwal, Pranjal and Madaan, Aman and Yang, Yiming and Mausam 4 | 5 | **Abstract**: 6 | 7 | A popular approach for improving the correctness of output from large language models (LLMs) is Self-Consistency - poll the LLM multiple times and output the most frequent solution. Existing Self-Consistency techniques always generate a constant number of samples per question, where a better approach will be to non-uniformly distribute the available budget based on the amount of agreement in the samples generated so far. In response, we introduce Adaptive-Consistency, a cost-efficient, model-agnostic technique that dynamically adjusts the number of samples per question using a lightweight stopping criterion. Our experiments over 17 reasoning and code generation datasets and three LLMs demonstrate that Adaptive-Consistency reduces sample budget by up to 7.9 times with an average accuracy drop of less than 0.1% 8 | 9 | **Link**: [Read Paper](https://doi.org/10.18653/v1/2023.emnlp-main.761) 10 | 11 | **Labels**: [agent design](../../labels/agent_design.md), [prompt strategy](../../labels/prompt_strategy.md), [sampling and ranking](../../labels/sampling_and_ranking.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/EMNLP2023/paper_9.md: -------------------------------------------------------------------------------- 1 | # API-Assisted Code Generation for Question Answering on Varied Table Structures 2 | 3 | **Authors**: Cao, Yihan and Chen, Shuyi and Liu, Ryan and Wang, Zhiruo and Fried, Daniel 4 | 5 | **Abstract**: 6 | 7 | A persistent challenge to table question answering (TableQA) by generating executable programs has been adapting to varied table structures, typically requiring domain-specific logical forms. In response, this paper introduces a unified TableQA framework that: (1) provides a unified representation for structured tables as multi-index Pandas data frames, (2) uses Python as a powerful querying language, and (3) uses few-shot prompting to translate NL questions into Python programs, which are executable on Pandas data frames. Furthermore, to answer complex relational questions with extended program functionality and external knowledge, our framework allows customized APIs that Python programs can call. We experiment with four TableQA datasets that involve tables of different structures — relational, multi-table, and hierarchical matrix shapes — and achieve prominent improvements over past state-of-the-art systems. In ablation studies, we (1) show benefits from our multi-index representation and APIs over baselines that use only an LLM, and (2) demonstrate that our approach is modular and can incorporate additional APIs. 8 | 9 | **Link**: [Read Paper](https://doi.org/10.18653/v1/2023.emnlp-main.897) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [program synthesis](../../labels/program_synthesis.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/EMNLP2024/paper_1.md: -------------------------------------------------------------------------------- 1 | # TransLLaMa: LLM-based Simultaneous Translation System 2 | 3 | **Authors**: Koshkin, Roman and Sudoh, Katsuhito and Nakamura, Satoshi 4 | 5 | **Abstract**: 6 | 7 | Decoder-only large language models (LLMs) have recently demonstrated impressive capabilities in text generation and reasoning. Nonetheless, they have limited applications in simultaneous machine translation (SiMT), currently dominated by encoder-decoder transformers. This study demonstrates that, after fine-tuning on a small dataset comprising causally aligned source and target sentence pairs, a pre-trained open-source LLM can control input segmentation directly by generating a special “wait” token. This obviates the need for a separate policy and enables the LLM to perform English-German and English-Russian SiMT tasks with BLEU scores that are comparable to those of specific state-of-the-art baselines. We also evaluated closed-source models such as GPT-4, which displayed encouraging results in performing the SiMT task without prior training (zero-shot), indicating a promising avenue for enhancing future SiMT systems. 8 | 9 | **Link**: [Read Paper](https://aclanthology.org/2024.findings-emnlp.27) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [program transformation](../../labels/program_transformation.md), [empirical study](../../labels/empirical_study.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/EMNLP2024/paper_12.md: -------------------------------------------------------------------------------- 1 | # Rethinking Code Refinement: Learning to Judge Code Efficiency 2 | 3 | **Authors**: Seo, Minju and Baek, Jinheon and Hwang, Sung Ju 4 | 5 | **Abstract**: 6 | 7 | Large Language Models (LLMs) have demonstrated impressive capabilities in understanding and generating codes. Due to these capabilities, many recent methods are proposed to automatically refine the codes with LLMs. However, we should rethink that the refined codes (from LLMs and even humans) are not always more efficient than their original versions. On the other hand, running two different versions of codes and comparing them every time is not ideal and time-consuming. Therefore, in this work, we propose a novel method based on the code language model that is trained to judge the efficiency between two different codes (generated across humans and machines) by either classifying the superior one or predicting the relative improvement. We validate our method on multiple programming languages with multiple refinement steps, demonstrating that the proposed method can effectively distinguish between more and less efficient versions of code. 8 | 9 | **Link**: [Read Paper](https://aclanthology.org/2024.findings-emnlp.645) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [program transformation](../../labels/program_transformation.md), [code model](../../labels/code_model.md), [code model training](../../labels/code_model_training.md), [source code model](../../labels/source_code_model.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/EMNLP2024/paper_13.md: -------------------------------------------------------------------------------- 1 | # Revisiting the Impact of Pursuing Modularity for Code Generation 2 | 3 | **Authors**: Kang, Deokyeong and Seo, KiJung and Kim, Taeuk 4 | 5 | **Abstract**: 6 | 7 | Modular programming, which aims to construct the final program by integrating smaller, independent building blocks, has been regarded as a desirable practice in software development. However, with the rise of recent code generation agents built upon large language models (LLMs), a question emerges: is this traditional practice equally effective for these new tools? In this work, we assess the impact of modularity in code generation by introducing a novel metric for its quantitative measurement. Surprisingly, unlike conventional wisdom on the topic, we find that modularity is not a core factor for improving the performance of code generation models. We also explore potential explanations for why LLMs do not exhibit a preference for modular code compared to non-modular code. 8 | 9 | **Link**: [Read Paper](https://aclanthology.org/2024.findings-emnlp.676) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [program synthesis](../../labels/program_synthesis.md), [empirical study](../../labels/empirical_study.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/EMNLP2024/paper_15.md: -------------------------------------------------------------------------------- 1 | # On Leakage of Code Generation Evaluation Datasets 2 | 3 | **Authors**: Matton, Alexandre and Sherborne, Tom and Aumiller, Dennis and Tommasone, Elena and Alizadeh, Milad and He, Jingyi and Ma, Raymond and Voisin, Maxime and Gilsenan-McMahon, Ellen and Gallé, Matthias 4 | 5 | **Abstract**: 6 | 7 | In this paper, we consider contamination by code generation test sets, in particular in their use in modern large language models.We discuss three possible sources of such contamination and show findings supporting each of them: (i) direct data leakage, (ii) indirect data leakage through the use of synthetic data and (iii) overfitting to evaluation sets during model selection.To address this, we release Less Basic Python Problems (LBPP): an uncontaminated new benchmark of 161 prompts with their associated Python solutions. LBPP is released at https://huggingface.co/datasets/CohereForAI/lbpp 8 | 9 | **Link**: [Read Paper](https://aclanthology.org/2024.findings-emnlp.772) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [program synthesis](../../labels/program_synthesis.md), [benchmark](../../labels/benchmark.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/EMNLP2024/paper_16.md: -------------------------------------------------------------------------------- 1 | # Impeding LLM-assisted Cheating in Introductory Programming Assignments via Adversarial Perturbation 2 | 3 | **Authors**: Salim, Saiful Islam and Yang, Rubin Yuchan and Cooper, Alexander and Ray, Suryashree and Debray, Saumya and Rahaman, Sazzadur 4 | 5 | **Abstract**: 6 | 7 | While Large language model (LLM)-based programming assistants such as CoPilot and ChatGPT can help improve the productivity of professional software developers, they can also facilitate cheating in introductory computer programming courses. Assuming instructors have limited control over the industrial-strength models, this paper investigates the baseline performance of 5 widely used LLMs on a collection of introductory programming problems, examines adversarial perturbations to degrade their performance, and describes the results of a user study aimed at measuring the efficacy of such perturbations in hindering actual code generation for introductory programming assignments. The user study suggests that i) perturbations combinedly reduced the average correctness score by 77%, ii) the drop in correctness caused by these perturbations was affected based on their detectability. 8 | 9 | **Link**: [Read Paper](https://aclanthology.org/2024.emnlp-main.27) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [program synthesis](../../labels/program_synthesis.md), [empirical study](../../labels/empirical_study.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/EMNLP2024/paper_18.md: -------------------------------------------------------------------------------- 1 | # LLM4Decompile: Decompiling Binary Code with Large Language Models 2 | 3 | **Authors**: Tan, Hanzhuo and Luo, Qi and Li, Jing and Zhang, Yuqun 4 | 5 | **Abstract**: 6 | 7 | Decompilation aims to convert binary code to high-level source code, but traditional tools like Ghidra often produce results that are difficult to read and execute. Motivated by the advancements in Large Language Models (LLMs), we propose LLM4Decompile, the first and largest open-source LLM series (1.3B to 33B) trained to decompile binary code. We optimize the LLM training process and introduce the LLM4Decompile-End models to decompile binary directly. The resulting models significantly outperform GPT-4o and Ghidra on the HumanEval and ExeBench benchmarks by over 100% in terms of re-executability rate. Additionally, we improve the standard refinement approach to fine-tune the LLM4Decompile-Ref models, enabling them to effectively refine the decompiled code from Ghidra and achieve a further 16.2% improvement over the LLM4Decompile-End. LLM4Decompile demonstrates the potential of LLMs to revolutionize binary code decompilation, delivering remarkable improvements in readability and executability while complementing conventional tools for optimal results. 8 | 9 | **Link**: [Read Paper](https://aclanthology.org/2024.emnlp-main.203) 10 | 11 | **Labels**: [static analysis](../../labels/static_analysis.md), [program decompilation](../../labels/program_decompilation.md), [code model](../../labels/code_model.md), [code model training](../../labels/code_model_training.md), [binary code model](../../labels/binary_code_model.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/EMNLP2024/paper_19.md: -------------------------------------------------------------------------------- 1 | # PTD-SQL: Partitioning and Targeted Drilling with LLMs in Text-to-SQL 2 | 3 | **Authors**: Luo, Ruilin and Wang, Liyuan and Lin, Binghuai and Lin, Zicheng and Yang, Yujiu 4 | 5 | **Abstract**: 6 | 7 | Large Language Models (LLMs) have emerged as powerful tools for Text-to-SQL tasks, exhibiting remarkable reasoning capabilities. Different from tasks such as math word problem and commonsense reasoning, SQL solutions have a relatively fixed pattern. This facilitates the investigation of whether LLMs can benefit from categorical thinking, mirroring how humans acquire knowledge through inductive reasoning based on comparable examples. In this study, we propose that employing query group partitioning allows LLMs to focus on learning the thought processes specific to a single problem type, consequently enhancing their reasoning abilities across diverse difficulty levels and problem categories. Our experiments reveal that multiple advanced LLMs, when equipped with PTD-SQL, can either surpass or match previous state-of-the-art (SOTA) methods on the Spider and BIRD datasets. Intriguingly, models with varying initial performances have exhibited significant improvements mainly at the boundary of their capabilities after targeted drilling, suggesting a parallel with human progress. Code is available at https://github.com/lrlbbzl/PTD-SQL. 8 | 9 | **Link**: [Read Paper](https://aclanthology.org/2024.emnlp-main.221) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [program synthesis](../../labels/program_synthesis.md), [empirical study](../../labels/empirical_study.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/EMNLP2024/paper_2.md: -------------------------------------------------------------------------------- 1 | # Introducing Compiler Semantics into Large Language Models as Programming Language Translators: A Case Study of C to x86 Assembly 2 | 3 | **Authors**: Zhang, Shuoming and Zhao, Jiacheng and Xia, Chunwei and Wang, Zheng and Chen, Yunji and Cui, Huimin 4 | 5 | **Abstract**: 6 | 7 | Compilers are complex software containing millions of lines of code, taking years to develop. This paper investigates to what extent Large Language Models (LLMs) can replace hand-crafted compilers in translating high-level programming languages to machine instructions, using C to x86 assembly as a case study. We identify two challenges of using LLMs for code translation and introduce two novel data pre-processing techniques to address the challenges: numerical value conversion and training data resampling. While only using a 13B model, our approach achieves a behavioral accuracy of over 91%, outperforming the much larger GPT-4 Turbo model by over 50%. Our results are encouraging, showing that LLMs have the potential to transform how compilation tools are constructed. 8 | 9 | **Link**: [Read Paper](https://aclanthology.org/2024.findings-emnlp.55) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [program transformation](../../labels/program_transformation.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/EMNLP2024/paper_21.md: -------------------------------------------------------------------------------- 1 | # Socratic Human Feedback (SoHF): Expert Steering Strategies for LLM Code Generation 2 | 3 | **Authors**: Chidambaram, Subramanian and Li, Li Erran and Bai, Min and Li, Xiaopeng and Lin, Kaixiang and Zhou, Xiong and Williams, Alex C. 4 | 5 | **Abstract**: 6 | 7 | Large Language Models (LLMs) are increasingly used for generating code solutions, empowered by features like self-debugging and self-reflection. However, LLMs often struggle with complex programming problems without human guidance. This paper investigates the strategies employed by expert programmers to steer code-generating LLMs toward successful outcomes. Through a study involving experts using natural language to guide GPT-4, Gemini Ultra, and, Claude 3.5 Sonnet on highly difficult programming challenges, we frame our analysis using the “Socratic Feedback” paradigm for understanding effective steering strategies. By analyzing 30 conversational transcripts across all three models, we map observed feedback strategies to five stages of Socratic Questioning: Definition, Elenhus, Maieutic, Dialectic, and Counter-factual reasoning. We find evidence that by employing a combination of different Socratic feedback strategies across multiple turns, programmers successfully guided the models to solve 74% of the problems that the models initially failed to solve on their own. 8 | 9 | **Link**: [Read Paper](https://aclanthology.org/2024.findings-emnlp.908) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [program synthesis](../../labels/program_synthesis.md), [empirical study](../../labels/empirical_study.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/EMNLP2024/paper_22.md: -------------------------------------------------------------------------------- 1 | # PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLMs 2 | 3 | **Authors**: Yadav, Ankit and Beniwal, Himanshu and Singh, Mayank 4 | 5 | **Abstract**: 6 | 7 | Driven by the surge in code generation using large language models (LLMs), numerous benchmarks have emerged to evaluate these LLMs capabilities. We conducted a large-scale human evaluation of *HumanEval* and *MBPP*, two popular benchmarks for Python code generation, analyzing their diversity and difficulty. Our findings unveil a critical bias towards a limited set of programming concepts, neglecting most of the other concepts entirely. Furthermore, we uncover a worrying prevalence of easy tasks that can inflate model performance estimations. To address these limitations, we propose a novel benchmark, *PythonSaga*, featuring 185 hand-crafted prompts in a balanced representation of 38 programming concepts across diverse difficulty levels. The robustness of our benchmark is demonstrated by the poor performance of existing Code-LLMs. The code and data set are openly available to the NLP community at this [URL](https://github.com/PythonSaga/PythonSaga). 8 | 9 | **Link**: [Read Paper](https://aclanthology.org/2024.findings-emnlp.996) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [program synthesis](../../labels/program_synthesis.md), [benchmark](../../labels/benchmark.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/EMNLP2024/paper_31.md: -------------------------------------------------------------------------------- 1 | # ECCO: Can We Improve Model-Generated Code Efficiency Without Sacrificing Functional Correctness? 2 | 3 | **Authors**: Waghjale, Siddhant and Veerendranath, Vishruth and Wang, Zhiruo and Fried, Daniel 4 | 5 | **Abstract**: 6 | 7 | Although large language models (LLMs) have been largely successful in generating functionally correct programs, conditioning models to produce efficient solutions while ensuring correctness remains a challenge. Further, unreliability in benchmarking code efficiency is a hurdle across varying hardware specifications for popular interpreted languages such as Python. In this paper, we present ECCO, a reproducible benchmark for evaluating program efficiency via two paradigms: natural language (NL) based code generation and history-based code editing. On ECCO, we adapt and thoroughly investigate the three most promising existing LLM-based approaches: in-context learning, iterative refinement with execution or NL feedback, and fine-tuning conditioned on execution and editing history. While most methods degrade functional correctness and moderately increase program efficiency, we find that adding execution information often helps maintain functional correctness, and NL feedback enhances more on efficiency. We release our benchmark to support future work on LLM-based generation of efficient code. 8 | 9 | **Link**: [Read Paper](https://aclanthology.org/2024.emnlp-main.859) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [program synthesis](../../labels/program_synthesis.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/EMNLP2024/paper_34.md: -------------------------------------------------------------------------------- 1 | # CoCoST: Automatic Complex Code Generation with Online Searching and Correctness Testing 2 | 3 | **Authors**: He, Xinyi and Zou, Jiaru and Lin, Yun and Zhou, Mengyu and Han, Shi and Yuan, Zejian and Zhang, Dongmei 4 | 5 | **Abstract**: 6 | 7 | Large Language Models have revolutionized code generation ability by converting natural language descriptions into executable code. However, generating complex code within real-world scenarios remains challenging due to intricate structures, subtle bugs, understanding of advanced data types, and lack of supplementary contents. To address these challenges, we introduce the CoCoST framework, which enhances complex code generation by online searching for more information with planned queries and correctness testing for code refinement. Moreover, CoCoST serializes the complex inputs and outputs to improve comprehension and generates test cases to ensure the adaptability for real-world applications. CoCoST is validated through rigorous experiments on the DS-1000 and ClassEval datasets. Experimental results show that CoCoST substantially improves the quality of complex code generation, highlighting its potential to enhance the practicality of LLMs in generating complex code. 8 | 9 | **Link**: [Read Paper](https://aclanthology.org/2024.emnlp-main.1082) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [program synthesis](../../labels/program_synthesis.md), [benchmark](../../labels/benchmark.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/EMNLP2024/paper_35.md: -------------------------------------------------------------------------------- 1 | # CodeJudge: Evaluating Code Generation with Large Language Models 2 | 3 | **Authors**: Tong, Weixi and Zhang, Tianyi 4 | 5 | **Abstract**: 6 | 7 | Large Language Models (LLMs) have shown promising performance in code generation. However, how to reliably evaluate code generated by LLMs remains an unresolved problem. This paper presents CodeJudge, a code evaluation framework that leverages LLMs to evaluate the semantic correctness of generated code without the need for test cases. We investigate different ways to guide the LLM in performing “slow thinking” to arrive at an in-depth and reliable evaluation. We experimented with four LLMs as evaluators on four code generation datasets and five programming languages. The results show that CodeJudge significantly outperformed existing methods in most settings. Furthermore, compared with a SOTA GPT-3.5-based code evaluation method, CodeJudge achieved better results even when using a much smaller model, Llama-3-8B-Instruct. Our code and datasets are available on GitHub https://github.com/VichyTong/CodeJudge. 8 | 9 | **Link**: [Read Paper](https://aclanthology.org/2024.emnlp-main.1118) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [program synthesis](../../labels/program_synthesis.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/EMNLP2024/paper_38.md: -------------------------------------------------------------------------------- 1 | # Coffee-Gym: An Environment for Evaluating and Improving Natural Language Feedback on Erroneous Code 2 | 3 | **Authors**: Chae, Hyungjoo and Kwon, Taeyoon and Moon, Seungjun and Song, Yongho and Kang, Dongjin and Ong, Kai Tzu-iunn and Kwak, Beong-woo and Bae, Seonghyeon and Hwang, Seung-won and Yeo, Jinyoung 4 | 5 | **Abstract**: 6 | 7 | This paper presents Coffee-Gym, a comprehensive RL environment for training models that provide feedback on code editing. Coffee-Gym includes two major components: (1) Coffee, a dataset containing humans’ code edit traces for coding questions and human-written feedback for editing erroneous code; (2) CoffeeEval, a reward function that faithfully reflects the helpfulness of feedback by assessing the performance of the revised code in unit tests. With them, Coffee-Gym addresses the unavailability of high-quality datasets for training feedback models with RL, and provides more accurate rewards than the SOTA reward model (i.e., GPT-4). By applying Coffee-Gym, we elicit feedback models that outperform baselines in enhancing open-source code LLMs’ code editing, making them comparable with closed-source LLMs. We make the dataset and the model checkpoint publicly available in https://huggingface.co/spaces/Coffee-Gym/Project-Coffee-Gym. 8 | 9 | **Link**: [Read Paper](https://aclanthology.org/2024.emnlp-main.1254) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [program repair](../../labels/program_repair.md), [benchmark](../../labels/benchmark.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/EMNLP2024/paper_4.md: -------------------------------------------------------------------------------- 1 | # Sanitizing Large Language Models in Bug Detection with Data-Flow 2 | 3 | **Authors**: Wang, Chengpeng and Zhang, Wuqi and Su, Zian and Xu, Xiangzhe and Zhang, Xiangyu 4 | 5 | **Abstract**: 6 | 7 | Large language models (LLMs) show potential in code reasoning tasks, facilitating the customization of detecting bugs in software development. However, the hallucination effect can significantly compromise the reliability of bug reports. This work formulates a new schema of bug detection and presents a novel sanitization technique that detects false positives for hallucination mitigation. Our key idea is to enforce LLMs to emit data-flow paths in few-shot chain-of-thought prompting and validate them via the program-property decomposition. Specifically, we dissect data-flow paths into basic properties upon concise code snippets and leverage parsing-based analysis and LLMs for validation. Our approach averagely achieves 91.03% precision and 74.00% recall upon synthetic benchmarks and boosts the precision by 21.99% with the sanitization. The evaluation upon real-world Android malware applications also demonstrates the superiority over an industrial analyzer, surpassing the precision and recall by 15.36% and 3.61%, respectively. 8 | 9 | **Link**: [Read Paper](https://aclanthology.org/2024.findings-emnlp.217) 10 | 11 | **Labels**: [static analysis](../../labels/static_analysis.md), [bug detection](../../labels/bug_detection.md), [data-flow analysis](../../labels/data-flow_analysis.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/EMNLP2024/paper_5.md: -------------------------------------------------------------------------------- 1 | # Stanceformer: Target-Aware Transformer for Stance Detection 2 | 3 | **Authors**: Garg, Krishna and Caragea, Cornelia 4 | 5 | **Abstract**: 6 | 7 | The task of Stance Detection involves discerning the stance expressed in a text towards a specific subject or target. Prior works have relied on existing transformer models that lack the capability to prioritize targets effectively. Consequently, these models yield similar performance regardless of whether we utilize or disregard target information, undermining the task’s significance. To address this challenge, we introduce Stanceformer, a target-aware transformer model that incorporates enhanced attention towards the targets during both training and inference. Specifically, we design a Target Awareness matrix that increases the self-attention scores assigned to the targets. We demonstrate the efficacy of the Stanceformer with various BERT-based models, including state-of-the-art models and Large Language Models (LLMs), and evaluate its performance across three stance detection datasets, alongside a zero-shot dataset. Our approach Stanceformer not only provides superior performance but also generalizes even to other domains, such as Aspect-based Sentiment Analysis. We make the code publicly available. 8 | 9 | **Link**: [Read Paper](https://aclanthology.org/2024.findings-emnlp.286) 10 | 11 | **Labels**: [static analysis](../../labels/static_analysis.md), [bug detection](../../labels/bug_detection.md), [empirical study](../../labels/empirical_study.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/FASE2024/README.md: -------------------------------------------------------------------------------- 1 | # FASE2024 2 | 3 | Number of papers: 1 4 | 5 | ## [Can ChatGPT support software verification?](paper_1.md) 6 | - **Authors**: Jan{\ss}en, Christian and Richter, Cedric and Wehrheim, Heike 7 | - **Abstract**: Large language models have become increasingly effective in software engineering tasks such as code generation, debugging and repair. Language models like ChatGPT can not only generate code, but also explain its inner workings and in particular its correctness. This raises the question whether we can utilize ChatGPT to support formal software verification.... 8 | - **Link**: [Read Paper](https://arxiv.org/abs/2311.02433) 9 | - **Labels**: [static analysis](../../labels/static_analysis.md), [program verification](../../labels/program_verification.md) 10 | -------------------------------------------------------------------------------- /data/papers/venues/FASE2024/paper_1.md: -------------------------------------------------------------------------------- 1 | # Can ChatGPT support software verification? 2 | 3 | **Authors**: Jan{\ss}en, Christian and Richter, Cedric and Wehrheim, Heike 4 | 5 | **Abstract**: 6 | 7 | Large language models have become increasingly effective in software engineering tasks such as code generation, debugging and repair. Language models like ChatGPT can not only generate code, but also explain its inner workings and in particular its correctness. This raises the question whether we can utilize ChatGPT to support formal software verification. 8 | 9 | **Link**: [Read Paper](https://arxiv.org/abs/2311.02433) 10 | 11 | **Labels**: [static analysis](../../labels/static_analysis.md), [program verification](../../labels/program_verification.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/FMCAD2024/README.md: -------------------------------------------------------------------------------- 1 | # FMCAD2024 2 | 3 | Number of papers: 1 4 | 5 | ## [Leveraging LLMs for Program Verification](paper_1.md) 6 | - **Authors**: Adharsh Kamath, Aditya Senthilnathan, Saikat Chakraborty, Pantazis Deligiannis, Shuvendu Lahiri, Akash Lal, Aseem Rastogi, Subhajit Roy, Rahul Sharma 7 | - **Abstract**: Loop invariants are fundamental to reasoning about programs with loops. They establish properties about a given loop’s behavior. When they additionally are inductive, they become useful for the task of formal verification that seeks to establish strong mathematical guarantees about program’s runtime behavior. The inductiveness ensures that the invariants can be checked locally without consulting the entire program, thus are indispensable artifacts in a formal proof of correctness. Finding induct... 8 | - **Link**: [Read Paper](https://www.microsoft.com/en-us/research/publication/finding-inductive-loop-invariants-using-large-language-models/) 9 | - **Labels**: [static analysis](../../labels/static_analysis.md), [program verification](../../labels/program_verification.md) 10 | -------------------------------------------------------------------------------- /data/papers/venues/FSE2023/paper_6.md: -------------------------------------------------------------------------------- 1 | # Assisting Static Analysis with Large Language Models: A ChatGPT Experiment 2 | 3 | **Authors**: Li, Haonan and Hao, Yu and Zhai, Yizhuo and Qian, Zhiyun 4 | 5 | **Abstract**: 6 | 7 | Recent advances of Large Language Models (LLMs), e.g., ChatGPT, exhibited strong capabilities of comprehending and responding to questions across a variety of domains. Surprisingly, ChatGPT even possesses a strong understanding of program code. In this paper, we investigate where and how LLMs can assist static analysis by asking appropriate questions. In particular, we target a specific bug-finding tool, which produces many false positives from the static analysis. In our evaluation, we find that these false positives can be effectively pruned by asking carefully constructed questions about function-level behaviors or function summaries. Specifically, with a pilot study of 20 false positives, we can successfully prune 8 out of 20 based on GPT-3.5, whereas GPT-4 had a near-perfect result of 16 out of 20, where the four failed ones are not currently considered/supported by our questions, e.g., involving concurrency. Additionally, it also identified one false negative case (a missed bug). We find LLMs a promising tool that can enable a more effective and efficient program analysis. 8 | 9 | **Link**: [Read Paper](https://doi.org/10.1145/3611643.3613078) 10 | 11 | **Labels**: [static analysis](../../labels/static_analysis.md), [bug detection](../../labels/bug_detection.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/FSE2023/paper_7.md: -------------------------------------------------------------------------------- 1 | # LLM-Based Code Generation Method for Golang Compiler Testing 2 | 3 | **Authors**: Gu, Qiuhan 4 | 5 | **Abstract**: 6 | 7 | Modern optimizing compilers are among the most complex software systems humans build. One way to identify subtle compiler bugs is fuzzing. Both the quantity and the quality of testcases are crucial to the performance of fuzzing. Traditional testcase-generation methods, such as Csmith and YARPGen, have been proven successful at discovering compiler bugs. However, such generated testcases have limited coverage and quantity. In this paper, we present a code generation method for compiler testing based on LLM to maximize the quality and quantity of the generated code. In particular, to avoid undefined behavior and syntax errors in generated testcases, we design a filter strategy to clean the source code, preparing a high-quality dataset for the model training. Besides, we present a seed schedule strategy to improve code generation. We apply the method to test the Golang compiler and the result shows that our pipeline outperforms previous methods both qualitatively and quantitatively. It produces testcases with an average coverage of 3.38\%, in contrast to the testcases generated by GoFuzz, which have an average coverage of 0.44\%. Moreover, among all the generated testcases, only 2.79\% exhibited syntax errors, and none displayed undefined behavior. 8 | 9 | **Link**: [Read Paper](https://doi.org/10.1145/3611643.3617850) 10 | 11 | **Labels**: [program testing](../../labels/program_testing.md), [fuzzing](../../labels/fuzzing.md), [compiler testing](../../labels/compiler_testing.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/FSE2024/paper_11.md: -------------------------------------------------------------------------------- 1 | # Do Large Language Models Pay Similar Attention Like Human Programmers When Generating Code? 2 | 3 | **Authors**: Kou, Bonan and Chen, Shengmai and Wang, Zhijie and Ma, Lei and Zhang, Tianyi 4 | 5 | **Abstract**: 6 | 7 | Large Language Models (LLMs) have recently been widely used for code generation. Due to the complexity and opacity of LLMs, little is known about how these models generate code. We made the first attempt to bridge this knowledge gap by investigating whether LLMs attend to the same parts of a task description as human programmers during code generation. An analysis of six LLMs, including GPT-4, on two popular code generation benchmarks revealed a consistent misalignment between LLMs' and programmers' attention. We manually analyzed 211 incorrect code snippets and found five attention patterns that can be used to explain many code generation errors. Finally, a user study showed that model attention computed by a perturbation-based method is often favored by human programmers. Our findings highlight the need for human-aligned LLMs for better interpretability and programmer trust. 8 | 9 | **Link**: [Read Paper](https://doi.org/10.1145/3660807) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [program synthesis](../../labels/program_synthesis.md), [empirical study](../../labels/empirical_study.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/FSE2024/paper_27.md: -------------------------------------------------------------------------------- 1 | # When fuzzing meets llms: Challenges and opportunities 2 | 3 | **Authors**: Jiang, Yu and Liang, Jie and Ma, Fuchen and Chen, Yuanliang and Zhou, Chijin and Shen, Yuheng and Wu, Zhiyong and Fu, Jingzhou and Wang, Mingzhe and Li, Shanshan and others 4 | 5 | **Abstract**: 6 | 7 | Fuzzing, a widely-used technique for bug detection, has seen advancements through Large Language Models (LLMs). Despite their potential, LLMs face specific challenges in fuzzing. In this paper, we identified five major challenges of LLM-assisted fuzzing. To support our findings, we revisited the most recent papers from top-tier conferences, confirming that these challenges are widespread. As a remedy, we propose some actionable recommendations to help improve applying LLM in Fuzzing and conduct preliminary evaluations on DBMS fuzzing. The results demonstrate that our recommendations effectively address the identified challenges. 8 | 9 | **Link**: [Read Paper](https://dl.acm.org/doi/pdf/10.1145/3663529.3663784) 10 | 11 | **Labels**: [program testing](../../labels/program_testing.md), [fuzzing](../../labels/fuzzing.md), [survey](../../labels/survey.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/FSE2024/paper_4.md: -------------------------------------------------------------------------------- 1 | # SimLLM: Calculating Semantic Similarity in Code Summaries using a Large Language Model-Based Approach 2 | 3 | **Authors**: Jin, Xin and Lin, Zhiqiang 4 | 5 | **Abstract**: 6 | 7 | Code summaries are pivotal in software engineering, serving to improve code readability, maintainability, and collaboration. While recent advancements in Large Language Models (LLMs) have opened new avenues for automatic code summarization, existing metrics for evaluating summary quality, such as BLEU and BERTScore, have notable limitations. Specifically, these existing metrics either fail to capture the nuances of semantic meaning in summaries or are further limited in understanding domain-specific terminologies and expressions prevalent in code summaries. In this paper, we present SimLLM, a novel LLM-based approach designed to more precisely evaluate the semantic similarity of code summaries. Built upon an autoregressive LLM using a specialized pretraining task on permutated inputs and a pooling-based pairwise similarity measure, SimLLM overcomes the shortcomings of existing metrics. Our empirical evaluations demonstrate that SimLLM not only outperforms existing metrics but also shows a significantly high correlation with human ratings. 8 | 9 | **Link**: [Read Paper](https://doi.org/10.1145/3660769) 10 | 11 | **Labels**: [static analysis](../../labels/static_analysis.md), [code summarization](../../labels/code_summarization.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/FSE2024/paper_5.md: -------------------------------------------------------------------------------- 1 | # A Quantitative and Qualitative Evaluation of LLM-Based Explainable Fault Localization 2 | 3 | **Authors**: Kang, Sungmin and An, Gabin and Yoo, Shin 4 | 5 | **Abstract**: 6 | 7 | Fault Localization (FL), in which a developer seeks to identify which part of the code is malfunctioning and needs to be fixed, is a recurring challenge in debugging. To reduce developer burden, many automated FL techniques have been proposed. However, prior work has noted that existing techniques fail to provide rationales for the suggested locations, hindering developer adoption of these techniques. With this in mind, we propose AutoFL, a Large Language Model (LLM)-based FL technique that generates an explanation of the bug along with a suggested fault location. AutoFL prompts an LLM to use function calls to navigate a repository, so that it can effectively localize faults over a large software repository and overcome the limit of the LLM context length. Extensive experiments on 798 real-world bugs in Java and Python reveal AutoFL improves method-level acc@1 by up to 233.3\% over baselines. Furthermore, developers were interviewed on their impression of AutoFL-generated explanations, showing that developers generally liked the natural language explanations of AutoFL, and that they preferred reading a few, high-quality explanations instead of many. 8 | 9 | **Link**: [Read Paper](https://doi.org/10.1145/3660771) 10 | 11 | **Labels**: [program testing](../../labels/program_testing.md), [debugging](../../labels/debugging.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/FSE2025/README.md: -------------------------------------------------------------------------------- 1 | # FSE2025 2 | 3 | Number of papers: 1 4 | 5 | ## [AlphaTrans: A Neuro-Symbolic Compositional Approach for Repository-Level Code Translation and Validation](paper_1.md) 6 | - **Authors**: Ali Reza Ibrahimzada, Kaiyao Ke, Mrigank Pawagi, Muhammad Salman Abid, Rangeet Pan, Saurabh Sinha, Reyhaneh Jabbarvand 7 | - **Abstract**: Code translation transforms programs from one programming language (PL) to another. Several rule-based transpilers have been designed to automate code translation between different pairs of PLs. However, the rules can become obsolete as the PLs evolve and cannot generalize to other PLs. Recent studies have explored the automation of code translation using Large Language Models (LLMs). One key observation is that such techniques may work well for crafted benchmarks but fail to generalize to the s... 8 | - **Link**: [Read Paper](https://arxiv.org/pdf/2410.24117) 9 | - **Labels**: [code generation](../../labels/code_generation.md), [program transformation](../../labels/program_transformation.md) 10 | -------------------------------------------------------------------------------- /data/papers/venues/Forge2024/README.md: -------------------------------------------------------------------------------- 1 | # Forge2024 2 | 3 | Number of papers: 1 4 | 5 | ## [The Emergence of Large Language Models in Static Analysis: A First Look through Micro-Benchmarks](paper_1.md) 6 | - **Authors**: Venkatesh, Ashwin Prasad Shivarpatna and Sabu, Samkutty and Mir, Amir M and Reis, Sofia and Bodden, Eric 7 | - **Abstract**: Binary code similarity detection(BCSD), as a fundamental technique in software security, has various applications, including malware family detection, known vulnerability detection and code plagiarism detection. Recent deep learning-based BCSD approaches have demonstrated promising performance. However, they face two significant challenges that limit detection performance. First, most approaches that use sequence networks (like RNN and Transformer) utilize coarse-grained tokenization methods, wh... 8 | - **Link**: [Read Paper](https://doi.org/10.1145/3650105.3652288) 9 | - **Labels**: [static analysis](../../labels/static_analysis.md), [bug detection](../../labels/bug_detection.md), [empirical study](../../labels/empirical_study.md) 10 | -------------------------------------------------------------------------------- /data/papers/venues/Galois2024/README.md: -------------------------------------------------------------------------------- 1 | # Galois2024 2 | 3 | Number of papers: 1 4 | 5 | ## [Function Argument Nullability Using an LLM](paper_1.md) 6 | - **Authors**: Galois 7 | - **Abstract**: We think that Rust is a great language, and maybe you agree! Unfortunately, even if you do, there’s a good chance whatever application you’re working on is written in some older language such as C. To help with this, Galois has been developing c2rust, an automated transpiler (source-to-source translator) from C code into Rust code. c2rust can take almost any C and turn it into C-like Rust code, the first step in creating a new Rust application. And we’re building more features to turn C into saf... 8 | - **Link**: [Read Paper](https://galois.com/blog/2024/11/function-argument-nullability-using-an-llm/) 9 | - **Labels**: [static analysis](../../labels/static_analysis.md), [pointer analysis](../../labels/pointer_analysis.md) 10 | -------------------------------------------------------------------------------- /data/papers/venues/Galois2024/paper_1.md: -------------------------------------------------------------------------------- 1 | # Function Argument Nullability Using an LLM 2 | 3 | **Authors**: Galois 4 | 5 | **Abstract**: 6 | 7 | We think that Rust is a great language, and maybe you agree! Unfortunately, even if you do, there’s a good chance whatever application you’re working on is written in some older language such as C. To help with this, Galois has been developing c2rust, an automated transpiler (source-to-source translator) from C code into Rust code. c2rust can take almost any C and turn it into C-like Rust code, the first step in creating a new Rust application. And we’re building more features to turn C into safe, idiomatic Rust code. Recently, we have been experimenting with LLMs to help with transpilation from C to Rust. This blog describes one such experiment, where we built an analysis for determining nullability of function arguments in C. This is a necessary stage in the c2rust translation pipeline, and we already have an existing interprocedural static analysis tool that performs this task. We built a companion LLM-based tool using GPT-4o, and compared the performance between our static and LLM-based analysis. 8 | 9 | **Link**: [Read Paper](https://galois.com/blog/2024/11/function-argument-nullability-using-an-llm/) 10 | 11 | **Labels**: [static analysis](../../labels/static_analysis.md), [pointer analysis](../../labels/pointer_analysis.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/Google2023/README.md: -------------------------------------------------------------------------------- 1 | # Google2023 2 | 3 | Number of papers: 1 4 | 5 | ## [Chain of code: Reasoning with a language model-augmented code emulator](paper_1.md) 6 | - **Authors**: Li, Chengshu and Liang, Jacky and Zeng, Andy and Chen, Xinyun and Hausman, Karol and Sadigh, Dorsa and Levine, Sergey and Fei-Fei, Li and Xia, Fei and Ichter, Brian 7 | - **Abstract**: Code provides a general syntactic structure to build complex programs and perform precise computations when paired with a code interpreter -- we hypothesize that language models (LMs) can leverage code-writing to improve Chain of Thought reasoning not only for logic and arithmetic tasks, but also for linguistic ones (and in particular, those that are a mix of both). For example, consider prompting an LM to write code that counts the number of times it detects sarcasm in an essay: the LM may stru... 8 | - **Link**: [Read Paper](https://arxiv.org/pdf/2312.04474.pdf) 9 | - **Labels**: [agent design](../../labels/agent_design.md), [prompt strategy](../../labels/prompt_strategy.md), [reason with code](../../labels/reason_with_code.md) 10 | -------------------------------------------------------------------------------- /data/papers/venues/Google2024/README.md: -------------------------------------------------------------------------------- 1 | # Google2024 2 | 3 | Number of papers: 2 4 | 5 | ## [Evaluating Offensive Security Capabilities of Large Language Models](paper_2.md) 6 | - **Authors**: Google 7 | - **Abstract**: At Project Zero, we constantly seek to expand the scope and effectiveness of our vulnerability research. Though much of our work still relies on traditional methods like manual source code audits and reverse engineering, we're always looking for new approaches.... 8 | - **Link**: [Read Paper](https://googleprojectzero.blogspot.com/2024/06/project-naptime.html) 9 | - **Labels**: [program testing](../../labels/program_testing.md), [vulnerability exploitation](../../labels/vulnerability_exploitation.md) 10 | 11 | 12 | ## [From Naptime to Big Sleep: Using Large Language Models To Catch Vulnerabilities In Real-World Code](paper_1.md) 13 | - **Authors**: Google 14 | - **Abstract**: In our previous post, Project Naptime: Evaluating Offensive Security Capabilities of Large Language Models, we introduced our framework for large-language-model-assisted vulnerability research and demonstrated its potential by improving the state-of-the-art performance on Meta's CyberSecEval2 benchmarks. Since then, Naptime has evolved into Big Sleep, a collaboration between Google Project Zero and Google DeepMind.... 15 | - **Link**: [Read Paper](https://googleprojectzero.blogspot.com/2024/10/from-naptime-to-big-sleep.html) 16 | - **Labels**: [program testing](../../labels/program_testing.md), [vulnerability exploitation](../../labels/vulnerability_exploitation.md) 17 | -------------------------------------------------------------------------------- /data/papers/venues/Google2024/paper_1.md: -------------------------------------------------------------------------------- 1 | # From Naptime to Big Sleep: Using Large Language Models To Catch Vulnerabilities In Real-World Code 2 | 3 | **Authors**: Google 4 | 5 | **Abstract**: 6 | 7 | In our previous post, Project Naptime: Evaluating Offensive Security Capabilities of Large Language Models, we introduced our framework for large-language-model-assisted vulnerability research and demonstrated its potential by improving the state-of-the-art performance on Meta's CyberSecEval2 benchmarks. Since then, Naptime has evolved into Big Sleep, a collaboration between Google Project Zero and Google DeepMind. 8 | 9 | **Link**: [Read Paper](https://googleprojectzero.blogspot.com/2024/10/from-naptime-to-big-sleep.html) 10 | 11 | **Labels**: [program testing](../../labels/program_testing.md), [vulnerability exploitation](../../labels/vulnerability_exploitation.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/Google2024/paper_2.md: -------------------------------------------------------------------------------- 1 | # Evaluating Offensive Security Capabilities of Large Language Models 2 | 3 | **Authors**: Google 4 | 5 | **Abstract**: 6 | 7 | At Project Zero, we constantly seek to expand the scope and effectiveness of our vulnerability research. Though much of our work still relies on traditional methods like manual source code audits and reverse engineering, we're always looking for new approaches. 8 | 9 | **Link**: [Read Paper](https://googleprojectzero.blogspot.com/2024/06/project-naptime.html) 10 | 11 | **Labels**: [program testing](../../labels/program_testing.md), [vulnerability exploitation](../../labels/vulnerability_exploitation.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/ICLR2021/README.md: -------------------------------------------------------------------------------- 1 | # ICLR2021 2 | 3 | Number of papers: 1 4 | 5 | ## [Graphcodebert: Pre-training code representations with data flow](paper_1.md) 6 | - **Authors**: Guo, Daya and Ren, Shuo and Lu, Shuai and Feng, Zhangyin and Tang, Duyu and Liu, Shujie and Zhou, Long and Duan, Nan and Svyatkovskiy, Alexey and Fu, Shengyu and others 7 | - **Abstract**: Pre-trained models for programming language have achieved dramatic empirical improvements on a variety of code-related tasks such as code search, code completion, code summarization, etc. However, existing pre-trained models regard a code snippet as a sequence of tokens, while ignoring the inherent structure of code, which provides crucial code semantics and would enhance the code understanding process. We present GraphCodeBERT, a pre-trained model for programming language that considers the inh... 8 | - **Link**: [Read Paper](https://arxiv.org/pdf/2302.05319) 9 | - **Labels**: [general coding task](../../labels/general_coding_task.md), [code model](../../labels/code_model.md), [code model training](../../labels/code_model_training.md), [source code model](../../labels/source_code_model.md) 10 | -------------------------------------------------------------------------------- /data/papers/venues/ICLR2023/README.md: -------------------------------------------------------------------------------- 1 | # ICLR2023 2 | 3 | Number of papers: 1 4 | 5 | ## [Is Self-Repair a Silver Bullet for Code Generation?](paper_1.md) 6 | - **Authors**: Olausson, Theo X and Inala, Jeevana Priya and Wang, Chenglong and Gao, Jianfeng and Solar-Lezama, Armando 7 | - **Abstract**: Large language models have shown remarkable aptitude in code generation, but still struggle to perform complex tasks. Self-repair---in which the model debugs and repairs its own code---has recently become a popular way to boost performance in these settings. However, despite its increasing popularity, existing studies of self-repair have been limited in scope; in many settings, its efficacy thus remains poorly understood. In this paper, we analyze Code Llama, GPT-3.5 and GPT-4's ability to perfo... 8 | - **Link**: [Read Paper](https://openreview.net/forum?id=y0GJXRungR) 9 | - **Labels**: [code generation](../../labels/code_generation.md), [program repair](../../labels/program_repair.md) 10 | -------------------------------------------------------------------------------- /data/papers/venues/ICLR2023/paper_1.md: -------------------------------------------------------------------------------- 1 | # Is Self-Repair a Silver Bullet for Code Generation? 2 | 3 | **Authors**: Olausson, Theo X and Inala, Jeevana Priya and Wang, Chenglong and Gao, Jianfeng and Solar-Lezama, Armando 4 | 5 | **Abstract**: 6 | 7 | Large language models have shown remarkable aptitude in code generation, but still struggle to perform complex tasks. Self-repair---in which the model debugs and repairs its own code---has recently become a popular way to boost performance in these settings. However, despite its increasing popularity, existing studies of self-repair have been limited in scope; in many settings, its efficacy thus remains poorly understood. In this paper, we analyze Code Llama, GPT-3.5 and GPT-4's ability to perform self-repair on problems taken from HumanEval and APPS. We find that when the cost of carrying out repair is taken into account, performance gains are often modest, vary a lot between subsets of the data, and are sometimes not present at all. We hypothesize that this is because self-repair is bottlenecked by the model's ability to provide feedback on its own code; using a stronger model to artificially boost the quality of the feedback, we observe substantially larger performance gains. Similarly, a small-scale study in which we provide GPT-4 with feedback from human participants suggests that even for the strongest models, self-repair still lags far behind what can be achieved with human-level debugging. 8 | 9 | **Link**: [Read Paper](https://openreview.net/forum?id=y0GJXRungR) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [program repair](../../labels/program_repair.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/ICLR2024/paper_6.md: -------------------------------------------------------------------------------- 1 | # Lemur: Integrating large language models in automated program verification 2 | 3 | **Authors**: Wu, Haoze and Barrett, Clark and Narodytska, Nina 4 | 5 | **Abstract**: 6 | 7 | The demonstrated code-understanding capability of LLMs raises the question of whether they can be used for automated program verification, a task that typically demands high-level abstract reasoning about program properties that is challenging for verification tools. We propose a general methodology to combine the power of LLMs and automated reasoners for automated program verification. We formally describe this methodology as a set of derivation rules and prove its soundness. We instantiate the calculus as a sound automated verification procedure, which led to practical improvements on a set of synthetic and competition benchmarks. 8 | 9 | **Link**: [Read Paper](https://openreview.net/forum?id=Q3YaCghZNt) 10 | 11 | **Labels**: [static analysis](../../labels/static_analysis.md), [program verification](../../labels/program_verification.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/ICLR2025/paper_1.md: -------------------------------------------------------------------------------- 1 | # Type-Aware Constraining for Code LLMs 2 | 3 | **Authors**: Niels Mündler, Jingxuan He, Hao Wang, Koushik Sen, Dawn Song, Martin Vechev 4 | 5 | **Abstract**: 6 | 7 | Large Language Models (LLMs) have achieved notable success in code generation. However, they still frequently produce invalid code, as they do not precisely model formal aspects of programming languages. Constrained decoding is a promising approach to alleviate this issue and has been successfully applied to domain-specific languages and syntactic features, but is not able to enforce more semantic features, such as well-typedness. To address this issue, we introduce type-aware constrained decoding. We develop a novel prefix automata formalism and introduce a sound approach to guarantee existence of a type-safe completion of a partial program based on type inference and a search over inhabitable types. We implement type-aware constraining first for a foundational simply-typed language, then extend it to TypeScript. In our evaluation across state-of-the-art open-weight LLMs of up to 34B parameters and various model families, type-aware constraining reduces compilation errors by on average 70.9% and increases functional correctness by 16.2% in code synthesis, translation, and repair tasks. 8 | 9 | **Link**: [Read Paper](https://openreview.net/forum?id=DNAapYMXkc) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [code completion](../../labels/code_completion.md), [code model](../../labels/code_model.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/ICLR2025/paper_2.md: -------------------------------------------------------------------------------- 1 | # AbsInt-AI: Language Models for Abstract Interpretation 2 | 3 | **Authors**: Michael Wang, Kexin Pei, Armando Solar-Lezama 4 | 5 | **Abstract**: 6 | 7 | Static program analysis is a popular technique in software engineering. Traditional static analysis algorithms treat programs as sets of logical statements with well-defined semantics. These traditional analyzers can provide guarantees of their performance, such as guaranteeing that they will never miss a bug. However, they leave out lots of very rich information such as variable and field names. Language models for code on the other hand, take full advantage of information such as variable names, but it is extremely difficult to provide guarantees of their output. In this work, we present ABSINT-AI, a language model augmented static analyzer based on abstract interpretation that combines the best of both worlds. Using a language model in ABSINT-AI achieves up to a 70% decrease in false positives for bug detection while providing guarantees of never missing a bug. 8 | 9 | **Link**: [Read Paper](https://openreview.net/forum?id=3RP6YmKo59) 10 | 11 | **Labels**: [static analysis](../../labels/static_analysis.md), [bug detection](../../labels/bug_detection.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/ICML2021/paper_2.md: -------------------------------------------------------------------------------- 1 | # Programl: A graph-based program representation for data flow analysis and compiler optimizations 2 | 3 | **Authors**: Cummins, Chris and Fisches, Zacharias V and Ben-Nun, Tal and Hoefler, Torsten and O’Boyle, Michael FP and Leather, Hugh 4 | 5 | **Abstract**: 6 | 7 | Machine learning (ML) is increasingly seen as a viable approach for building compiler optimization heuristics, but many ML methods cannot replicate even the simplest of the data flow analyses that are critical to making good optimization decisions. We posit that if ML cannot do that, then it is insufficiently able to reason about programs. We formulate data flow analyses as supervised learning tasks and introduce a large open dataset of programs and their corresponding labels from several analyses. We use this dataset to benchmark ML methods and show that they struggle on these fundamental program reasoning tasks. We propose ProGraML-Program Graphs for Machine Learning-a language-independent, portable representation of program semantics. ProGraML overcomes the limitations of prior works and yields improved performance on downstream optimization tasks. 8 | 9 | **Link**: [Read Paper](https://proceedings.mlr.press/v139/cummins21a/cummins21a.pdf) 10 | 11 | **Labels**: [static analysis](../../labels/static_analysis.md), [data-flow analysis](../../labels/data-flow_analysis.md), [program optimization](../../labels/program_optimization.md), [code model](../../labels/code_model.md), [code model training](../../labels/code_model_training.md), [IR code model](../../labels/IR_code_model.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/ICML2023/paper_1.md: -------------------------------------------------------------------------------- 1 | # LongCoder: {A} Long-Range Pre-trained Language Model for Code Completion 2 | 3 | **Authors**: Daya Guo and Canwen Xu and Nan Duan and Jian Yin and Julian J. McAuley 4 | 5 | **Abstract**: 6 | 7 | In this paper, we introduce a new task for code completion that focuses on handling long code input and propose a sparse Transformer model, called LongCoder, to address this task. LongCoder employs a sliding window mechanism for self-attention and introduces two types of globally accessible tokens-bridge tokens and memory tokens-to improve performance and efficiency. Bridge tokens are inserted throughout the input sequence to aggregate local information and facilitate global interaction, while memory tokens are included to highlight important statements that may be invoked later and need to be memorized, such as package imports and definitions of classes, functions, or structures. We conduct experiments on a newly constructed dataset that contains longer code context and the publicly available CodeXGLUE benchmark. Experimental results demonstrate that LongCoder achieves superior performance on code completion tasks compared to previous models while maintaining comparable efficiency in terms of computational resources during inference. 8 | 9 | **Link**: [Read Paper](https://proceedings.mlr.press/v202/guo23j.html) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [code completion](../../labels/code_completion.md), [code model](../../labels/code_model.md), [code model training](../../labels/code_model_training.md), [source code model](../../labels/source_code_model.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/ICML2023/paper_3.md: -------------------------------------------------------------------------------- 1 | # Can large language models reason about program invariants? 2 | 3 | **Authors**: Pei, Kexin and Bieber, David and Shi, Kensen and Sutton, Charles and Yin, Pengcheng 4 | 5 | **Abstract**: 6 | 7 | Identifying invariants is an important program analysis task with applications towards program understanding, bug finding, vulnerability analysis, and formal verification. Existing tools for identifying program invariants rely on dynamic analysis, requiring traces collected from multiple executions in order to produce reliable invariants. We study the application of large language models to invariant prediction, finding that models trained on source code and fine-tuned for invariant generation can perform invariant prediction as static rather than dynamic analysis. Using a scratchpad approach where invariants are predicted sequentially through a program gives the best performance, finding invariants statically of quality comparable to those obtained by a dynamic analysis tool with access to five program traces. 8 | 9 | **Link**: [Read Paper](https://openreview.net/pdf?id=mXv2aVqUGG) 10 | 11 | **Labels**: [static analysis](../../labels/static_analysis.md), [program verification](../../labels/program_verification.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/ICML2024/paper_1.md: -------------------------------------------------------------------------------- 1 | # Instruction tuning for secure code generation 2 | 3 | **Authors**: He, Jingxuan and Vero, Mark and Krasnopolska, Gabriela and Vechev, Martin 4 | 5 | **Abstract**: 6 | 7 | Modern language models (LMs) have gained widespread acceptance in everyday and professional contexts, particularly in programming. An essential procedure enabling this adoption is instruction tuning, which substantially enhances LMs' practical utility by training them to follow user instructions and human preferences. However, existing instruction tuning schemes overlook a crucial aspect: the security of generated code. As a result, even the state-of-the-art instruction-tuned LMs frequently produce unsafe code, posing significant security risks. In this work, we introduce SafeCoder to address this gap. SafeCoder performs security-centric fine-tuning using a diverse and high-quality dataset that we collected using an automated pipeline. We integrate the security fine-tuning with standard instruction tuning, to facilitate a joint optimization of both security and utility. Despite its simplicity, we show that SafeCoder is effective across a variety of popular LMs and datasets. It is able to drastically improve security (by about 30%), while preserving utility. 8 | 9 | **Link**: [Read Paper](https://arxiv.org/pdf/2405.00218) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [code model](../../labels/code_model.md), [code model security](../../labels/code_model_security.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/ICML2025/paper_1.md: -------------------------------------------------------------------------------- 1 | # PROSEC: Fortifying Code LLMs with Proactive Security Alignment 2 | 3 | **Authors**: Xiangzhe Xu, Zian Su, Jinyao Guo, Kaiyuan Zhang, Zhenting Wang, Xiangyu Zhang 4 | 5 | **Abstract**: 6 | 7 | Modern language models (LMs) have gained widespread acceptance in everyday and professional contexts, particularly in programming. An essential procedure enabling this adoption is instruction tuning, which substantially enhances LMs' practical utility by training them to follow user instructions and human preferences. However, existing instruction tuning schemes overlook a crucial aspect: the security of generated code. As a result, even the state-of-the-art instruction-tuned LMs frequently produce unsafe code, posing significant security risks. In this work, we introduce SafeCoder to address this gap. SafeCoder performs security-centric fine-tuning using a diverse and high-quality dataset that we collected using an automated pipeline. We integrate the security fine-tuning with standard instruction tuning, to facilitate a joint optimization of both security and utility. Despite its simplicity, we show that SafeCoder is effective across a variety of popular LMs and datasets. It is able to drastically improve security (by about 30%), while preserving utility. 8 | 9 | **Link**: [Read Paper](https://arxiv.org/abs/2411.12882) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [code model](../../labels/code_model.md), [code model security](../../labels/code_model_security.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/ICSE2023/paper_1.md: -------------------------------------------------------------------------------- 1 | # CodaMosa: Escaping Coverage Plateaus in Test Generation with Pre-Trained Large Language Models 2 | 3 | **Authors**: Lemieux, Caroline and Inala, Jeevana Priya and Lahiri, Shuvendu K. and Sen, Siddhartha 4 | 5 | **Abstract**: 6 | 7 | Search-based software testing (SBST) generates high-coverage test cases for programs under test with a combination of test case generation and mutation. SBST's performance relies on there being a reasonable probability of generating test cases that exercise the core logic of the program under test. Given such test cases, SBST can then explore the space around them to exercise various parts of the program. This paper explores whether Large Language Models (LLMs) of code, such as OpenAI's Codex, can be used to help SBST's exploration. Our proposed algorithm, CodaMosa, conducts SBST until its coverage improvements stall, then asks Codex to provide example test cases for under-covered functions. These examples help SBST redirect its search to more useful areas of the search space. On an evaluation over 486 benchmarks, CodaMosa achieves statistically significantly higher coverage on many more benchmarks (173 and 279) than it reduces coverage on (10 and 4), compared to SBST and LLM-only baselines. 8 | 9 | **Link**: [Read Paper](https://doi.org/10.1109/ICSE48619.2023.00085) 10 | 11 | **Labels**: [program testing](../../labels/program_testing.md), [fuzzing](../../labels/fuzzing.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/ICSE2023/paper_8.md: -------------------------------------------------------------------------------- 1 | # On the Applicability of Language Models to Block-Based Programs 2 | 3 | **Authors**: Griebl, Elisabeth and Fein, Benedikt and Oberm\"{u}ller, Florian and Fraser, Gordon and Just, Ren\'{e 4 | 5 | **Abstract**: 6 | 7 | Block-based programming languages like SCRATCH are increasingly popular for programming education and end-user programming. Recent program analyses build on the insight that source code can be modelled using techniques from natural language processing. Many of the regularities of source code that support this approach are due to the syntactic overhead imposed by textual programming languages. This syntactic overhead, however, is precisely what block-based languages remove in order to simplify programming. Consequently, it is unclear how well this modelling approach performs on block-based programming languages. In this paper, we investigate the applicability of language models for the popular block-based programming language SCRATCH. We model SCRATCH programs using n-gram models, the most essential type of language model, and transformers, a popular deep learning model. Evaluation on the example tasks of code completion and bug finding confirm that blocks inhibit predictability, but the use of language models is nevertheless feasible. Our findings serve as foundation for improving tooling and analyses for block-based languages. 8 | 9 | **Link**: [Read Paper](https://doi.org/10.1109/ICSE48619.2023.00199) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [code completion](../../labels/code_completion.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/ICSE2024/paper_20.md: -------------------------------------------------------------------------------- 1 | # Which Syntactic Capabilities Are Statistically Learned by Masked Language Models for Code? 2 | 3 | **Authors**: Velasco, Alejandro and Palacio, David N and Rodriguez-Cardenas, Daniel and Poshyvanyk, Denys 4 | 5 | **Abstract**: 6 | 7 | This paper discusses the limitations of evaluating Masked Language Models (MLMs) in code completion tasks. We highlight that relying on accuracy-based measurements may lead to an overestimation of models' capabilities by neglecting the syntax rules of programming languages. To address these issues, we introduce a technique called SyntaxEval in which Syntactic Capabilities are used to enhance the evaluation of MLMs. SyntaxEval automates the process of masking elements in the model input based on their Abstract Syntax Trees (ASTs). We conducted a case study on two popular MLMs using data from GitHub repositories. Our results showed negative causal effects between the node types and MLMs' accuracy. We conclude that MLMs under study fail to predict some syntactic capabilities. 8 | 9 | **Link**: [Read Paper](https://arxiv.org/pdf/2401.01512) 10 | 11 | **Labels**: [static analysis](../../labels/static_analysis.md), [syntactic analysis](../../labels/syntactic_analysis.md), [empirical study](../../labels/empirical_study.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/ISSTA2022/README.md: -------------------------------------------------------------------------------- 1 | # ISSTA2022 2 | 3 | Number of papers: 1 4 | 5 | ## [Jtrans: Jump-aware transformer for binary code similarity detection](paper_1.md) 6 | - **Authors**: Wang, Hao and Qu, Wenjie and Katz, Gilad and Zhu, Wenyu and Gao, Zeyu and Qiu, Han and Zhuge, Jianwei and Zhang, Chao 7 | - **Abstract**: Binary code similarity detection (BCSD) has important applications in various fields such as vulnerabilities detection, software component analysis, and reverse engineering. Recent studies have shown that deep neural networks (DNNs) can comprehend instructions or control-flow graphs (CFG) of binary code and support BCSD. In this study, we propose a novel Transformer-based approach, namely jTrans, to learn representations of binary code. It is the first solution that embeds control flow informati... 8 | - **Link**: [Read Paper](https://dl.acm.org/doi/pdf/10.1145/3533767.3534367) 9 | - **Labels**: [static analysis](../../labels/static_analysis.md), [bug detection](../../labels/bug_detection.md), [code model](../../labels/code_model.md), [code model training](../../labels/code_model_training.md), [binary code model](../../labels/binary_code_model.md) 10 | -------------------------------------------------------------------------------- /data/papers/venues/ISSTA2024/paper_25.md: -------------------------------------------------------------------------------- 1 | # HECS: A Hypergraph Learning-Based System for Detecting Extract Class Refactoring Opportunities 2 | 3 | **Authors**: Wang, Luqiao and Wang, Qiangqiang and Wang, Jiaqi and Zhao, Yutong and Wei, Minjie and Quan, Zhou and Cui, Di and Li, Qingshan 4 | 5 | **Abstract**: 6 | 7 | HECS is an advanced tool designed for Extract Class refactoring by leveraging hypergraph learning to model complex dependencies within large classes. Unlike traditional tools that rely on direct one-to-one dependency graphs, HECS uses intra-class dependency hypergraphs to capture one-to-many relationships. This allows HECS to provide more accurate and relevant refactoring suggestions. The tool constructs hypergraphs for each target class, attributes nodes using a pre-trained code model, and trains an enhanced hypergraph neural network. Coupled with a large language model, HECS delivers practical refactoring suggestions. In evaluations on large-scale and real-world datasets, HECS achieved a 38.5\% increase in precision, 9.7\% in recall, and 44.4\% in f1-measure compared to JDeodorant, SSECS, and LLMRefactor. These improvements make HECS a valuable tool for developers, offering practical insights and enhancing existing refactoring techniques. 8 | 9 | **Link**: [Read Paper](https://doi.org/10.1145/3650212.3685307) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [program transformation](../../labels/program_transformation.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/ISSTA2024/paper_26.md: -------------------------------------------------------------------------------- 1 | # Collaboration to Repository-Level Vulnerability Detection 2 | 3 | **Authors**: Wen, Xin-Cheng 4 | 5 | **Abstract**: 6 | 7 | Large Language Model (LLM)-based methods have proven to be effective for many software engineering domains, with a potential for substantial productivity effective for software vulnerability detection. However, due to the limitation of the length of input contexts of LLM, the existing LLM-based methods mainly focus on detecting function-level and leveraging the in-file context information for vulnerability detection (i.e., intra-procedural vulnerabilities), ignoring the more complex inter-procedural vulnerability detection scenarios in practice. For instance, in real-world scenarios, developers routinely engage with program analysis to detect vulnerabilities that span multiple cross-file information within repositories. Since complex processes tend to have redundancy dependencies from spanning multiple files in the repository level and invoking multiple static analysis tools, the ideal goal of vulnerability detection is to extract the vulnerability-related information from the repository and provide potential possible explanations for vulnerability triggers. However, such a goal is hard to achieve, and thus in this work, we design three works through multi-agent collaboration to approach the goal of repository-level vulnerability detection. 8 | 9 | **Link**: [Read Paper](https://doi.org/10.1145/3650212.3685562) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [bug detection](../../labels/bug_detection.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/ISSTA2024/paper_28.md: -------------------------------------------------------------------------------- 1 | # Domain Adaptation for Code Model-Based Unit Test Case Generation 2 | 3 | **Authors**: Shin, Jiho and Hashtroudi, Sepehr and Hemmati, Hadi and Wang, Song 4 | 5 | **Abstract**: 6 | 7 | Recently, deep learning-based test case generation approaches have been proposed to automate the generation of unit test cases. In this study, we leverage Transformer-based code models to generate unit tests with the help of Domain Adaptation (DA) at a project level. Specifically, we use CodeT5, a relatively small language model trained on source code data, and fine-tune it on the test generation task. Then, we apply domain adaptation to each target project data to learn project-specific knowledge (project-level DA). We use the Methods2test dataset to fine-tune CodeT5 for the test generation task and the Defects4j dataset for project-level domain adaptation and evaluation. We compare our approach with (a) CodeT5 fine-tuned on the test generation without DA, (b) the A3Test tool, and (c) GPT-4 on five projects from the Defects4j dataset. The results show that tests generated using DA can increase the line coverage by 18.62\%, 19.88\%, and 18.02\% and mutation score by 16.45\%, 16.01\% 8 | 9 | **Link**: [Read Paper](https://doi.org/10.1145/3650212.3680354) 10 | 11 | **Labels**: [program testing](../../labels/program_testing.md), [unit testing](../../labels/unit_testing.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/KDD2025/README.md: -------------------------------------------------------------------------------- 1 | # KDD2025 2 | 3 | Number of papers: 1 4 | 5 | ## [CompilerDream: Learning a Compiler World Model for General Code Optimization](paper_1.md) 6 | - **Authors**: Chaoyi Deng, Jialong Wu, Ningya Feng, Jianmin Wang, Mingsheng Long 7 | - **Abstract**: Effective code optimization in compilers is crucial for computer and software engineering. The success of these optimizations primarily depends on the selection and ordering of the optimization passes applied to the code. While most compilers rely on a fixed sequence of optimization passes, current methods to find the optimal sequence either employ impractically slow search algorithms or learning methods that struggle to generalize to code unseen during training. We introduce CompilerDream, a mo... 8 | - **Link**: [Read Paper](https://arxiv.org/abs/2404.16077) 9 | - **Labels**: [static analysis](../../labels/static_analysis.md), [program optimization](../../labels/program_optimization.md) 10 | -------------------------------------------------------------------------------- /data/papers/venues/LLM4Code2025/README.md: -------------------------------------------------------------------------------- 1 | # LLM4Code2025 2 | 3 | Number of papers: 1 4 | 5 | ## [METAMON: Finding Inconsistencies between Program Documentation and Behavior using Metamorphic LLM Queries](paper_1.md) 6 | - **Authors**: Hyeonseok Lee, Gabin An, Shin Yoo 7 | - **Abstract**: Code documentation can, if written precisely, help developers better understand the code they accompany. However, unlike code, code documentation cannot be automatically verified via execution, potentially leading to inconsistencies between documentation and the actual behavior. While such inconsistencies can harmful for developer’s understanding of the code, checking and finding them remains a costly task due to the involvement of human engineers. This paper proposes METAMON, which uses an exis... 8 | - **Link**: [Read Paper](https://coinse.github.io/publications/pdfs/Lee2025aa.pdf) 9 | - **Labels**: [program testing](../../labels/program_testing.md), [differential testing](../../labels/differential_testing.md) 10 | -------------------------------------------------------------------------------- /data/papers/venues/LLM4Code2025/paper_1.md: -------------------------------------------------------------------------------- 1 | # METAMON: Finding Inconsistencies between Program Documentation and Behavior using Metamorphic LLM Queries 2 | 3 | **Authors**: Hyeonseok Lee, Gabin An, Shin Yoo 4 | 5 | **Abstract**: 6 | 7 | Code documentation can, if written precisely, help developers better understand the code they accompany. However, unlike code, code documentation cannot be automatically verified via execution, potentially leading to inconsistencies between documentation and the actual behavior. While such inconsistencies can harmful for developer’s understanding of the code, checking and finding them remains a costly task due to the involvement of human engineers. This paper proposes METAMON, which uses an existing search-based test generation technique to capture the current program behavior in the form of test cases, and subsequently uses LLM-based code reasoning to identify the generated regression test oracles that are not consistent with the program specifications in the documentation. METAMON is supported in this task by metamorphic testing and self-consistency. An empirical evaluation against 9,482 pairs of code documentation and code snippets, generated using five opensource projects from Defects4J v2.0.1, shows that METAMON can classify the code-and-documentation inconsistencies with the precision of 0.72 and the recall of 0.48. 8 | 9 | **Link**: [Read Paper](https://coinse.github.io/publications/pdfs/Lee2025aa.pdf) 10 | 11 | **Labels**: [program testing](../../labels/program_testing.md), [differential testing](../../labels/differential_testing.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/LangSec2025/README.md: -------------------------------------------------------------------------------- 1 | # LangSec2025 2 | 3 | Number of papers: 1 4 | 5 | ## [Large Language Models for Validating Network Protocol Parsers](paper_1.md) 6 | - **Authors**: Mingwei Zheng, Danning Xie, Xiangyu Zhang 7 | - **Abstract**: Network protocol parsers are essential for enabling correct and secure communication between devices. Bugs in these parsers can introduce critical vulnerabilities, including memory corruption, information leakage, and denial-of-service attacks. An intuitive way to assess parser correctness is to compare the implementation with its official protocol standard. However, this comparison is challenging because protocol standards are typically written in natural language, whereas implementations are i... 8 | - **Link**: [Read Paper](https://arxiv.org/abs/2504.13515) 9 | - **Labels**: [static analysis](../../labels/static_analysis.md), [specification inference](../../labels/specification_inference.md), [bug detection](../../labels/bug_detection.md) 10 | -------------------------------------------------------------------------------- /data/papers/venues/Meta2024/README.md: -------------------------------------------------------------------------------- 1 | # Meta2024 2 | 3 | Number of papers: 1 4 | 5 | ## [Don't Transform the Code, Code the Transforms: Towards Precise Code Rewriting using LLMs](paper_1.md) 6 | - **Authors**: Cummins, Chris and Seeker, Volker and Armengol-Estap{\'e}, Jordi and Markosyan, Aram H and Synnaeve, Gabriel and Leather, Hugh 7 | - **Abstract**: Tools for rewriting, refactoring and optimizing code should be fast and correct. Large language models (LLMs), by their nature, possess neither of these qualities. Yet, there remains tremendous opportunity in using LLMs to improve code. We explore the use of LLMs not to transform code, but to code transforms. We propose a chain-of-thought approach to synthesizing code transformations from a small number of input/output code examples that incorporates execution and feedback. Unlike the direct rew... 8 | - **Link**: [Read Paper](https://arxiv.org/pdf/2410.08806) 9 | - **Labels**: [agent design](../../labels/agent_design.md), [prompt strategy](../../labels/prompt_strategy.md), [reason with code](../../labels/reason_with_code.md) 10 | -------------------------------------------------------------------------------- /data/papers/venues/Meta2024/paper_1.md: -------------------------------------------------------------------------------- 1 | # Don't Transform the Code, Code the Transforms: Towards Precise Code Rewriting using LLMs 2 | 3 | **Authors**: Cummins, Chris and Seeker, Volker and Armengol-Estap{\'e}, Jordi and Markosyan, Aram H and Synnaeve, Gabriel and Leather, Hugh 4 | 5 | **Abstract**: 6 | 7 | Tools for rewriting, refactoring and optimizing code should be fast and correct. Large language models (LLMs), by their nature, possess neither of these qualities. Yet, there remains tremendous opportunity in using LLMs to improve code. We explore the use of LLMs not to transform code, but to code transforms. We propose a chain-of-thought approach to synthesizing code transformations from a small number of input/output code examples that incorporates execution and feedback. Unlike the direct rewrite approach, LLM-generated transformations are easy to inspect, debug, and validate. The logic of the rewrite is explicitly coded and easy to adapt. The compute required to run code transformations is minute compared to that of LLM rewriting. We test our approach on 16 Python code transformations and find that LLM- generated transforms are perfectly precise for 7 of them and less imprecise than direct LLM rewriting on the others. We hope to encourage further research to improving the precision of LLM code rewriting. 8 | 9 | **Link**: [Read Paper](https://arxiv.org/pdf/2410.08806) 10 | 11 | **Labels**: [agent design](../../labels/agent_design.md), [prompt strategy](../../labels/prompt_strategy.md), [reason with code](../../labels/reason_with_code.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/Microsoft2023/paper_1.md: -------------------------------------------------------------------------------- 1 | # Natural Language Commanding via Program Synthesis 2 | 3 | **Authors**: Gandhi, Apurva and Nguyen, Thong Q and Jiao, Huitian and Steen, Robert and Bhatawdekar, Ameya 4 | 5 | **Abstract**: 6 | 7 | We present Semantic Interpreter, a natural language-friendly AI system for productivity software such as Microsoft Office that leverages large language models (LLMs) to execute user intent across application features. While LLMs are excellent at understanding user intent expressed as natural language, they are not sufficient for fulfilling application-specific user intent that requires more than text-to-text transformations. We therefore introduce the Office Domain Specific Language (ODSL), a concise, high-level language specialized for performing actions in and interacting with entities in Office applications. Semantic Interpreter leverages an Analysis-Retrieval prompt construction method with LLMs for program synthesis, translating natural language user utterances to ODSL programs that can be transpiled to application APIs and then executed. We focus our discussion primarily on a research exploration for Microsoft PowerPoint. 8 | 9 | **Link**: [Read Paper](https://arxiv.org/pdf/2306.03460.pdf) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [program synthesis](../../labels/program_synthesis.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/Microsoft2024/README.md: -------------------------------------------------------------------------------- 1 | # Microsoft2024 2 | 3 | Number of papers: 1 4 | 5 | ## [Steering Large Language Models between Code Execution and Textual Reasoning](paper_1.md) 6 | - **Authors**: Chen, Yongchao and Jhamtani, Harsh and Sharma, Srinagesh and Fan, Chuchu and Wang, Chi 7 | - **Abstract**: While a lot of recent research focuses on enhancing the textual reasoning capabilities of Large Language Models (LLMs) by optimizing the multi-agent framework or reasoning chains, several benchmark tasks can be solved with 100% success through direct coding, which is more scalable and avoids the computational overhead associated with textual iterating and searching. Textual reasoning has inherent limitations in solving tasks with challenges in math, logics, optimization, and searching, which is ... 8 | - **Link**: [Read Paper](https://arxiv.org/pdf/2410.03524) 9 | - **Labels**: [agent design](../../labels/agent_design.md), [prompt strategy](../../labels/prompt_strategy.md), [reason with code](../../labels/reason_with_code.md) 10 | -------------------------------------------------------------------------------- /data/papers/venues/NAACL2024/paper_2.md: -------------------------------------------------------------------------------- 1 | # Program-Aided Reasoners (Better) Know What They Know 2 | 3 | **Authors**: Kabra, Anubha and Rangreji, Sanketh and Mathur, Yash and Madaan, Aman and Liu, Emmy and Neubig, Graham 4 | 5 | **Abstract**: 6 | 7 | Prior work shows that program-aided reasoning, in which large language models (LLMs) are combined with programs written in programming languages such as Python, can significantly improve accuracy on various reasoning tasks. However, while accuracy is essential, it is also important for such reasoners to “know what they know”, which can be quantified through the calibration of the model. In this paper, we compare the calibration of Program Aided Language Models (PAL) and text-based Chain-of-thought (COT) prompting techniques over 5 datasets and 2 model types - LLaMA models and OpenAI models. Our results indicate that PAL leads to improved calibration in 75% of the instances. Our analysis uncovers that prompting styles that produce lesser diversity in generations also have more calibrated results, and thus we also experiment with inducing lower generation diversity using temperature scaling and find that for certain temperatures, PAL is not only more accurate but is also more calibrated than COT. Overall, we demonstrate that, in the majority of cases, program-aided reasoners better know what they know than text-based counterparts. 8 | 9 | **Link**: [Read Paper](https://doi.org/10.18653/v1/2024.naacl-long.125) 10 | 11 | **Labels**: [agent design](../../labels/agent_design.md), [prompt strategy](../../labels/prompt_strategy.md), [reason with code](../../labels/reason_with_code.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/NVDIA2024/README.md: -------------------------------------------------------------------------------- 1 | # NVDIA2024 2 | 3 | Number of papers: 1 4 | 5 | ## [Effective Large Language Model Debugging with Best-first Tree Search](paper_1.md) 6 | - **Authors**: Song, Jialin and Raiman, Jonathan and Catanzaro, Bryan 7 | - **Abstract**: Large Language Models (LLMs) show promise in code generation tasks. However, their code-writing abilities are often limited in scope: while they can successfully implement simple functions, they struggle with more complex tasks. A fundamental difference with how an LLM writes code, compared to a human programmer, is that it cannot consistently spot and fix bugs. Debugging is a crucial skill for programmers and it enables iterative code refinement towards a correct implementation. In this work, w... 8 | - **Link**: [Read Paper](https://arxiv.org/pdf/2407.19055) 9 | - **Labels**: [code generation](../../labels/code_generation.md), [debugging](../../labels/debugging.md) 10 | -------------------------------------------------------------------------------- /data/papers/venues/NVDIA2024/paper_1.md: -------------------------------------------------------------------------------- 1 | # Effective Large Language Model Debugging with Best-first Tree Search 2 | 3 | **Authors**: Song, Jialin and Raiman, Jonathan and Catanzaro, Bryan 4 | 5 | **Abstract**: 6 | 7 | Large Language Models (LLMs) show promise in code generation tasks. However, their code-writing abilities are often limited in scope: while they can successfully implement simple functions, they struggle with more complex tasks. A fundamental difference with how an LLM writes code, compared to a human programmer, is that it cannot consistently spot and fix bugs. Debugging is a crucial skill for programmers and it enables iterative code refinement towards a correct implementation. In this work, we propose a novel algorithm to enable LLMs to debug their code via self-reflection and search where a model attempts to identify its previous mistakes. Our key contributions are 1) a best-first tree search algorithm with self-reflections (BESTER) that achieves state-of-the-art Pass@1 in three code generation benchmarks. BESTER maintains its superiority when we measure pass rates taking into account additional inference costs incurred by tree search. 2) A novel interpretability study on what self-reflections attend to in buggy programs and how they impact bug fixes, which provides a deeper understanding of the debugging process. 3) An extensive study on when self-reflections are effective in finding bugs. 8 | 9 | **Link**: [Read Paper](https://arxiv.org/pdf/2407.19055) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [debugging](../../labels/debugging.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/NeurIPS2018/README.md: -------------------------------------------------------------------------------- 1 | # NeurIPS2018 2 | 3 | Number of papers: 1 4 | 5 | ## [Neural code comprehension: A learnable representation of code semantics](paper_1.md) 6 | - **Authors**: Ben-Nun, Tal and Jakobovits, Alice Shoshana and Hoefler, Torsten 7 | - **Abstract**: With the recent success of embeddings in natural language processing, research has been conducted into applying similar methods to code analysis. Most works attempt to process the code directly or use a syntactic tree representation, treating it like sentences written in a natural language. However, none of the existing methods are sufficient to comprehend program semantics robustly, due to structural features such as function calls, branching, and interchangeable order of statements. In this pa... 8 | - **Link**: [Read Paper](https://dl.acm.org/doi/pdf/10.5555/3327144.3327276) 9 | - **Labels**: [general coding task](../../labels/general_coding_task.md), [code model](../../labels/code_model.md), [code model training](../../labels/code_model_training.md), [source code model](../../labels/source_code_model.md) 10 | -------------------------------------------------------------------------------- /data/papers/venues/NeurIPS2022/paper_2.md: -------------------------------------------------------------------------------- 1 | # Self-consistency improves chain of thought reasoning in language models 2 | 3 | **Authors**: Wang, Xuezhi and Wei, Jason and Schuurmans, Dale and Le, Quoc and Chi, Ed and Narang, Sharan and Chowdhery, Aakanksha and Zhou, Denny 4 | 5 | **Abstract**: 6 | 7 | Chain-of-thought prompting combined with pre-trained large language models has achieved encouraging results on complex reasoning tasks. In this paper, we propose a new decoding strategy, self-consistency, to replace the naive greedy decoding used in chain-of-thought prompting. It first samples a diverse set of reasoning paths instead of only taking the greedy one, and then selects the most consistent answer by marginalizing out the sampled reasoning paths. Self-consistency leverages the intuition that a complex reasoning problem typically admits multiple different ways of thinking leading to its unique correct answer. Our extensive empirical evaluation shows that self-consistency boosts the performance of chain-of-thought prompting with a striking margin on a range of popular arithmetic and commonsense reasoning benchmarks, including GSM8K (+17.9%), SVAMP (+11.0%), AQuA (+12.2%), StrategyQA (+6.4%) and ARC-challenge (+3.9%). 8 | 9 | **Link**: [Read Paper](https://arxiv.org/abs/2203.11171) 10 | 11 | **Labels**: [hallucination in reasoning](../../labels/hallucination_in_reasoning.md), [agent design](../../labels/agent_design.md), [prompt strategy](../../labels/prompt_strategy.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/NeurIPS2024/paper_3.md: -------------------------------------------------------------------------------- 1 | # Verified multi-step synthesis using large language models and monte carlo tree search 2 | 3 | **Authors**: Brandfonbrener, David and Raja, Sibi and Prasad, Tarun and Loughridge, Chloe and Yang, Jianang and Henniger, Simon and Byrd, William E and Zinkov, Robert and Amin, Nada 4 | 5 | **Abstract**: 6 | 7 | We present an approach using Monte Carlo Tree Search (MCTS) to guide Large Language Models (LLMs) to generate verified programs in Dafny, Lean and Coq. Our method, which we call VMCTS, leverages the verifier inside the search algorithm by checking partial programs at each step. In combination with the LLM prior, the verifier feedback raises the synthesis capabilities of open source models. On a set of five verified programming problems, we find that in four problems where the base model cannot solve the question even when re-sampling solutions for one hour, VMCTS can solve the problems within 6 minutes. The base model with VMCTS is even competitive with ChatGPT4 augmented with plugins and multiple re-tries on these problems. Our code and benchmarks are available at https://github.com/namin/llm-verified-with-monte-carlo-tree-search. 8 | 9 | **Link**: [Read Paper](https://openreview.net/pdf?id=HmB9uZTzaD) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [program synthesis](../../labels/program_synthesis.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/NeurIPS2024/paper_5.md: -------------------------------------------------------------------------------- 1 | # Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search 2 | 3 | **Authors**: Nicola Dainese and Matteo Merler and Minttu Alakuijala and Pekka Marttinen 4 | 5 | **Abstract**: 6 | 7 | In this work we consider Code World Models, world models generated by a Large Language Model (LLM) in the form of Python code for model-based Reinforcement Learning (RL). Calling code instead of LLMs for planning has the advantages of being precise, reliable, interpretable, and extremely efficient. However, writing appropriate Code World Models requires the ability to understand complex instructions, to generate exact code with non-trivial logic and to self-debug a long program with feedback from unit tests and environment trajectories. To address these challenges, we propose Generate, Improve and Fix with Monte Carlo Tree Search (GIF-MCTS), a new code generation strategy for LLMs. To test our approach, we introduce the Code World Models Benchmark (CWMB), a suite of program synthesis and planning tasks comprised of 18 diverse RL environments paired with corresponding textual descriptions and curated trajectories. GIF-MCTS surpasses all baselines on the CWMB and two other benchmarks, and we show that the Code World Models synthesized with it can be successfully used for planning, resulting in model-based RL agents with greatly improved sample efficiency and inference speed. 8 | 9 | **Link**: [Read Paper](https://doi.org/10.48550/arXiv.2405.15383) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [program synthesis](../../labels/program_synthesis.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/OOPLSA2023/README.md: -------------------------------------------------------------------------------- 1 | # OOPLSA2023 2 | 3 | Number of papers: 1 4 | 5 | ## [Grounded Copilot: How Programmers Interact with Code-Generating Models](paper_1.md) 6 | - **Authors**: Barke, Shraddha and James, Michael B. and Polikarpova, Nadia 7 | - **Abstract**: Powered by recent advances in code-generating models, AI assistants like Github Copilot promise to change the face of programming forever. But what is this new face of programming? We present the first grounded theory analysis of how programmers interact with Copilot, based on observing 20 participants—with a range of prior experience using the assistant—as they solve diverse programming tasks across four languages. Our main finding is that interactions with programming assistants are bimodal: i... 8 | - **Link**: [Read Paper](https://doi.org/10.1145/3586030) 9 | - **Labels**: [code generation](../../labels/code_generation.md), [code completion](../../labels/code_completion.md), [empirical study](../../labels/empirical_study.md) 10 | -------------------------------------------------------------------------------- /data/papers/venues/OOPLSA2023/paper_1.md: -------------------------------------------------------------------------------- 1 | # Grounded Copilot: How Programmers Interact with Code-Generating Models 2 | 3 | **Authors**: Barke, Shraddha and James, Michael B. and Polikarpova, Nadia 4 | 5 | **Abstract**: 6 | 7 | Powered by recent advances in code-generating models, AI assistants like Github Copilot promise to change the face of programming forever. But what is this new face of programming? We present the first grounded theory analysis of how programmers interact with Copilot, based on observing 20 participants—with a range of prior experience using the assistant—as they solve diverse programming tasks across four languages. Our main finding is that interactions with programming assistants are bimodal: in acceleration mode, the programmer knows what to do next and uses Copilot to get there faster; in exploration mode, the programmer is unsure how to proceed and uses Copilot to explore their options. Based on our theory, we provide recommendations for improving the usability of future AI programming assistants. 8 | 9 | **Link**: [Read Paper](https://doi.org/10.1145/3586030) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [code completion](../../labels/code_completion.md), [empirical study](../../labels/empirical_study.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/OOPSLA2024/paper_5.md: -------------------------------------------------------------------------------- 1 | # Enhancing Static Analysis for Practical Bug Detection: An LLM-Integrated Approach 2 | 3 | **Authors**: Li, Haonan and Hao, Yu and Zhai, Yizhuo and Qian, Zhiyun 4 | 5 | **Abstract**: 6 | 7 | While static analysis is instrumental in uncovering software bugs, its precision in analyzing large and intricate codebases remains challenging. The emerging prowess of Large Language Models (LLMs) offers a promising avenue to address these complexities. In this paper, we present LLift, a pioneering framework that synergizes static analysis and LLMs, with a spotlight on identifying use-before-initialization (UBI) bugs within the Linux kernel. Drawing from our insights into variable usage conventions in Linux, we enhance path analysis using post-constraint guidance. This approach, combined with our methodically crafted procedures, empowers LLift to adeptly handle the challenges of bug-specific modeling, extensive codebases, and the unpredictable nature of LLMs. Our real-world evaluations identified four previously undiscovered UBI bugs in the mainstream Linux kernel, which the Linux community has acknowledged. This study reaffirms the potential of marrying static analysis with LLMs, setting a compelling direction for future research in this area. 8 | 9 | **Link**: [Read Paper](https://doi.org/10.1145/3649828) 10 | 11 | **Labels**: [static analysis](../../labels/static_analysis.md), [bug detection](../../labels/bug_detection.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/OOPSLA2025/paper_2.md: -------------------------------------------------------------------------------- 1 | # Artemis: Toward Accurate Detection of Server-Side Request Forgeries through LLM-Assisted Inter-procedural Path-Sensitive Taint Analysis 2 | 3 | **Authors**: Yuchen Ji, Ting Dai, Zhichao Zhou, Yutian Tang, Jingzhu He 4 | 5 | **Abstract**: 6 | 7 | Server-side request forgery (SSRF) vulnerabilities are inevitable in PHP web applications. Existing static tools in detecting vulnerabilities in PHP web applications neither contain SSRF-related features to enhance detection accuracy nor consider PHP’s dynamic type features. In this paper, we present Artemis, a static taint analysis tool for detecting SSRF vulnerabilities in PHP web applications. First, Artemis extracts both PHP built-in and third-party functions as candidate source and sink functions. Second, Artemis constructs both explicit and implicit call graphs to infer functions’ relationships. Third, Artemis performs taint analysis based on a set of rules that prevent over-tainting and pauses when SSRF exploitation is impossible. Fourth, Artemis analyzes the compatibility of path conditions to prune false positives. We have implemented a prototype of Artemis and evaluated it on 250 PHP web applications. Artemis reports 207 true vulnerable paths (106 true SSRFs) with 15 false positives. Of the 106 detected SSRFs, 35 are newly found and reported to developers, with 24 confirmed and assigned CVE IDs. 8 | 9 | **Link**: [Read Paper](https://dl.acm.org/doi/10.1145/3720488) 10 | 11 | **Labels**: [static analysis](../../labels/static_analysis.md), [bug detection](../../labels/bug_detection.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/OOPSLA2025/paper_3.md: -------------------------------------------------------------------------------- 1 | # Laurel: Unblocking Automated Verification with Large Language Models 2 | 3 | **Authors**: YEric Mugnier, Emmanuel Anaya Gonzalez, Nadia Polikarpova, Ranjit Jhala, Zhou Yuanyuan 4 | 5 | **Abstract**: 6 | 7 | Program verifiers such as Dafny automate proofs by outsourcing them to an SMT solver. This automation is not perfect, however, and the solver often requires hints in the form of assertions, creating a burden for the proof engineer. In this paper, we propose, a tool that alleviates this burden by automatically generating assertions using large language models (LLMs). To improve the success rate of LLMs in this task, we design two domain-specific prompting techniques. First, we help the LLM determine the location of the missing assertion by analyzing the verifier’s error message and inserting an assertion placeholder at that location. Second, we provide the LLM with example assertions from the same codebase, which we select based on a new proof similarity metric. We evaluate our techniques on our new benchmark, a dataset of complex lemmas we extracted from three real-world Dafny codebases. Our evaluation shows that is able to generate over 56.6% of the required assertions given only a few attempts, making LLMs an affordable tool for unblocking program verifiers without human intervention. 8 | 9 | **Link**: [Read Paper](https://dl.acm.org/doi/10.1145/3720499) 10 | 11 | **Labels**: [static analysis](../../labels/static_analysis.md), [program verification](../../labels/program_verification.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/OpenAI2024/README.md: -------------------------------------------------------------------------------- 1 | # OpenAI2024 2 | 3 | Number of papers: 1 4 | 5 | ## [OpenAI’s Approach to External Red Teaming for AI Models and Systems](paper_1.md) 6 | - **Authors**: Lama Ahmad, Sandhini Agarwal, Michael Lampe, Pamela Mishkin 7 | - **Abstract**: Red teaming has emerged as a critical practice in assessing the possible risks of AI models and systems. It aids in the discovery of novel risks, stress testing possible gaps in existing mitigations, enriching existing quantitative safety metrics, facilitating the creation of new safety measurements, and enhancing public trust and the legitimacy of AI risk assessments. This white paper describes OpenAI’s work to date in external red teaming and draws some more general conclusions from this work.... 8 | - **Link**: [Read Paper](https://cdn.openai.com/papers/openais-approach-to-external-red-teaming.pdf) 9 | - **Labels**: [code model](../../labels/code_model.md), [code model security](../../labels/code_model_security.md), [benchmark](../../labels/benchmark.md) 10 | -------------------------------------------------------------------------------- /data/papers/venues/PLDI2023/paper_2.md: -------------------------------------------------------------------------------- 1 | # Scallop: A Language for Neurosymbolic Programming 2 | 3 | **Authors**: Li, Ziyang and Huang, Jiani and Naik, Mayur 4 | 5 | **Abstract**: 6 | 7 | We present Scallop, a language which combines the benefits of deep learning and logical reasoning. Scallop enables users to write a wide range of neurosymbolic applications and train them in a data- and compute-efficient manner. It achieves these goals through three key features: 1) a flexible symbolic representation that is based on the relational data model; 2) a declarative logic programming language that is based on Datalog and supports recursion, aggregation, and negation; and 3) a framework for automatic and efficient differentiable reasoning that is based on the theory of provenance semirings. We evaluate Scallop on a suite of eight neurosymbolic applications from the literature. Our evaluation demonstrates that Scallop is capable of expressing algorithmic reasoning in diverse and challenging AI tasks, provides a succinct interface for machine learning programmers to integrate logical domain knowledge, and yields solutions that are comparable or superior to state-of-the-art models in terms of accuracy. Furthermore, Scallop's solutions outperform these models in aspects such as runtime and data efficiency, interpretability, and generalizability. 8 | 9 | **Link**: [Read Paper](https://doi.org/10.1145/3591280) 10 | 11 | **Labels**: [PL design for LLMs](../../labels/PL_design_for_LLMs.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/PLDI2025/README.md: -------------------------------------------------------------------------------- 1 | # PLDI2025 2 | 3 | Number of papers: 1 4 | 5 | ## [DR.FIX: Automatically Fixing Data Races at Industry Scale](paper_1.md) 6 | - **Authors**: Farnaz Behrang, Zhizhou Zhang, Georgian-Vlad Saioc, Peng Liu, Milind Chabbi 7 | - **Abstract**: Data races are a prevalent class of concurrency bugs in shared-memory parallel programs, posing significant challenges to software reliability and reproducibility. While there is an extensive body of research on detecting data races and a wealth of practical detection tools across various programming languages, considerably less effort has been directed toward automatically fixing data races at an industrial scale. In large codebases, data races are continuously introduced and exhibit myriad pat... 8 | - **Link**: [Read Paper](https://arxiv.org/abs/2504.15637) 9 | - **Labels**: [code generation](../../labels/code_generation.md), [program repair](../../labels/program_repair.md) 10 | -------------------------------------------------------------------------------- /data/papers/venues/POPL2025/README.md: -------------------------------------------------------------------------------- 1 | # POPL2025 2 | 3 | Number of papers: 1 4 | 5 | ## [Automated Program Refinement: Guide and Verify Code Large Language Model with Refinement Calculus](paper_1.md) 6 | - **Authors**: Cai, Yufan and Hou, Zhe and Luan, Xiaokun and Baena, David Miguel Sanan and Lin, Yun and Sun, Jun and Dong, Jin Song 7 | - **Abstract**: Recently, the rise of code-centric large language models (LLMs) appears to have reshaped the software engineering world with low-barrier tools like Copilot that can generate code easily. However, there is no correctness guarantee for the code generated by LLMs, which suffer from the hallucination problem, and their output is fraught with risks. Besides, the end-to-end process from specification to code through LLMs is a non-transparent and uncontrolled black box. This opacity makes it difficult ... 8 | - **Link**: [Read Paper](https://arxiv.org/html/2406.18616v1) 9 | - **Labels**: [code generation](../../labels/code_generation.md), [program transformation](../../labels/program_transformation.md), [static analysis](../../labels/static_analysis.md), [program verification](../../labels/program_verification.md) 10 | -------------------------------------------------------------------------------- /data/papers/venues/ProtectAI2024/README.md: -------------------------------------------------------------------------------- 1 | # ProtectAI2024 2 | 3 | Number of papers: 1 4 | 5 | ## [Vulnhuntr: Autonomous AI Finds First 0-Day Vulnerabilities in Wild](paper_1.md) 6 | - **Authors**: Dan McInerney and Marcello Salvati 7 | - **Abstract**: Today, we introduce [Vulnhuntr](https://github.com/protectai/vulnhuntr), a Python static code analyzer that leverages the power of large language models (LLMs) to find and explain complex, multistep vulnerabilities. Thanks to the capabilities of models like Claude 3.5, AI has now uncovered more than a dozen remotely exploitable 0-day vulnerabilities targeting open-source projects in the AI ecosystem with over 10,000 GitHub stars in just a few hours of running it. These discoveries include full-b... 8 | - **Link**: [Read Paper](https://protectai.com/threat-research/vulnhuntr-first-0-day-vulnerabilities) 9 | - **Labels**: [program testing](../../labels/program_testing.md), [vulnerability exploitation](../../labels/vulnerability_exploitation.md) 10 | -------------------------------------------------------------------------------- /data/papers/venues/ProtectAI2024/paper_1.md: -------------------------------------------------------------------------------- 1 | # Vulnhuntr: Autonomous AI Finds First 0-Day Vulnerabilities in Wild 2 | 3 | **Authors**: Dan McInerney and Marcello Salvati 4 | 5 | **Abstract**: 6 | 7 | Today, we introduce [Vulnhuntr](https://github.com/protectai/vulnhuntr), a Python static code analyzer that leverages the power of large language models (LLMs) to find and explain complex, multistep vulnerabilities. Thanks to the capabilities of models like Claude 3.5, AI has now uncovered more than a dozen remotely exploitable 0-day vulnerabilities targeting open-source projects in the AI ecosystem with over 10,000 GitHub stars in just a few hours of running it. These discoveries include full-blown Remote Code Execution. If you’d like to get paid for using Vulnhuntr then head on over to https://huntr.com which is an AI bug bounty program helping secure the exploding open source AI ecosystem. 8 | 9 | **Link**: [Read Paper](https://protectai.com/threat-research/vulnhuntr-first-0-day-vulnerabilities) 10 | 11 | **Labels**: [program testing](../../labels/program_testing.md), [vulnerability exploitation](../../labels/vulnerability_exploitation.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/RAID2023/README.md: -------------------------------------------------------------------------------- 1 | # RAID2023 2 | 3 | Number of papers: 1 4 | 5 | ## [DiverseVul: {A} New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection](paper_1.md) 6 | - **Authors**: Yizheng Chen and Zhoujie Ding and Lamya Alowain and Xinyun Chen and David A. Wagner 7 | - **Abstract**: We propose and release a new vulnerable source code dataset. We curate the dataset by crawling security issue websites, extracting vulnerability-fixing commits and source codes from the corresponding projects. Our new dataset contains 18,945 vulnerable functions spanning 150 CWEs and 330,492 non-vulnerable functions extracted from 7,514 commits. Our dataset covers 295 more projects than all previous datasets combined.Combining our new dataset with previous datasets, we present an analysis of the... 8 | - **Link**: [Read Paper](https://doi.org/10.1145/3607199.3607242) 9 | - **Labels**: [static analysis](../../labels/static_analysis.md), [bug detection](../../labels/bug_detection.md), [benchmark](../../labels/benchmark.md) 10 | -------------------------------------------------------------------------------- /data/papers/venues/S&P2023/README.md: -------------------------------------------------------------------------------- 1 | # S&P2023 2 | 3 | Number of papers: 1 4 | 5 | ## [Examining Zero-Shot Vulnerability Repair with Large Language Models](paper_1.md) 6 | - **Authors**: Pearce, Hammond and Tan, Benjamin and Ahmad, Baleegh and Karri, Ramesh and Dolan-Gavitt, Brendan 7 | - **Abstract**: Human developers can produce code with cybersecurity bugs. Can emerging ‘smart’ code completion tools help repair those bugs? In this work, we examine the use of large language models (LLMs) for code (such as OpenAI’s Codex and AI21’s Jurassic J-1) for zero-shot vulnerability repair. We investigate challenges in the design of prompts that coax LLMs into generating repaired versions of insecure code. This is difficult due to the numerous ways to phrase key information— both semantically and synta... 8 | - **Link**: [Read Paper](https://doi.ieeecomputersociety.org/10.1109/SP46215.2023.10179420) 9 | - **Labels**: [static analysis](../../labels/static_analysis.md), [bug detection](../../labels/bug_detection.md), [empirical study](../../labels/empirical_study.md) 10 | -------------------------------------------------------------------------------- /data/papers/venues/S&P2023/paper_1.md: -------------------------------------------------------------------------------- 1 | # Examining Zero-Shot Vulnerability Repair with Large Language Models 2 | 3 | **Authors**: Pearce, Hammond and Tan, Benjamin and Ahmad, Baleegh and Karri, Ramesh and Dolan-Gavitt, Brendan 4 | 5 | **Abstract**: 6 | 7 | Human developers can produce code with cybersecurity bugs. Can emerging ‘smart’ code completion tools help repair those bugs? In this work, we examine the use of large language models (LLMs) for code (such as OpenAI’s Codex and AI21’s Jurassic J-1) for zero-shot vulnerability repair. We investigate challenges in the design of prompts that coax LLMs into generating repaired versions of insecure code. This is difficult due to the numerous ways to phrase key information— both semantically and syntactically—with natural languages. We perform a large scale study of five commercially available, black-box, "off-the-shelf" LLMs, as well as an open-source model and our own locally-trained model, on a mix of synthetic, hand-crafted, and real-world security bug scenarios. Our experiments demonstrate that while the approach has promise (the LLMs could collectively repair 100% of our synthetically generated and hand-crafted scenarios), a qualitative evaluation of the model’s performance over a corpus of historical real-world examples highlights challenges in generating functionally correct code. 8 | 9 | **Link**: [Read Paper](https://doi.ieeecomputersociety.org/10.1109/SP46215.2023.10179420) 10 | 11 | **Labels**: [static analysis](../../labels/static_analysis.md), [bug detection](../../labels/bug_detection.md), [empirical study](../../labels/empirical_study.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/SOAP2024/README.md: -------------------------------------------------------------------------------- 1 | # SOAP2024 2 | 3 | Number of papers: 1 4 | 5 | ## [Interleaving Static Analysis and LLM Prompting](paper_1.md) 6 | - **Authors**: Chapman, Patrick J and Rubio-Gonz{\'a}lez, Cindy and Thakur, Aditya V 7 | - **Abstract**: This paper presents a new approach for using Large Language Models (LLMs) to improve static program analysis. Specifically, during program analysis, we interleave calls to the static analyzer and queries to the LLM: the prompt used to query the LLM is constructed using intermediate results from the static analysis, and the result from the LLM query is used for subsequent analysis of the program. We apply this novel approach to the problem of error-specification inference of functions in systems ... 8 | - **Link**: [Read Paper](https://web.cs.ucdavis.edu/~rubio/includes/soap24.pdf) 9 | - **Labels**: [static analysis](../../labels/static_analysis.md), [bug detection](../../labels/bug_detection.md) 10 | -------------------------------------------------------------------------------- /data/papers/venues/SOAP2024/paper_1.md: -------------------------------------------------------------------------------- 1 | # Interleaving Static Analysis and LLM Prompting 2 | 3 | **Authors**: Chapman, Patrick J and Rubio-Gonz{\'a}lez, Cindy and Thakur, Aditya V 4 | 5 | **Abstract**: 6 | 7 | This paper presents a new approach for using Large Language Models (LLMs) to improve static program analysis. Specifically, during program analysis, we interleave calls to the static analyzer and queries to the LLM: the prompt used to query the LLM is constructed using intermediate results from the static analysis, and the result from the LLM query is used for subsequent analysis of the program. We apply this novel approach to the problem of error-specification inference of functions in systems code written in C; i.e., inferring the set of values returned by each function upon error, which can aid in program understanding as well as in finding error-handling bugs. We evaluate our approach on real-world C programs, such as MbedTLS and zlib, by incorporating LLMs into EESI, a state-of-the-art static analysis for error-specification inference. Compared to EESI, our approach achieves higher recall across all benchmarks (from average of 52.55% to 77.83%) and higher F1-score (from average of 0.612 to 0.804) while maintaining precision (from average of 86.67% to 85.12%). 8 | 9 | **Link**: [Read Paper](https://web.cs.ucdavis.edu/~rubio/includes/soap24.pdf) 10 | 11 | **Labels**: [static analysis](../../labels/static_analysis.md), [bug detection](../../labels/bug_detection.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/SOSP2024/README.md: -------------------------------------------------------------------------------- 1 | # SOSP2024 2 | 3 | Number of papers: 1 4 | 5 | ## [If At First You Don’t Succeed, Try, Try, Again...? Insights and LLM-informed Tooling for Detecting Retry Bugs in Software Systems](paper_1.md) 6 | - **Authors**: Bogdan A. Stoica, Utsav Sethi, Yiming Su, Cyrus Zhou, Shan Lu, Jonathan Mace, Madanlal Musuvathi, Suman Nath 7 | - **Abstract**: Retry—the re-execution of a task on failure—is a common mechanism to enable resilient software systems. Yet, despite its commonality and long history, retry remains difficult to implement and test. Guided by our study of real-world retry issues, we propose a novel suite of static and dynamic techniques to detect retry problems in software. We find that the ad-hoc nature of retry implementation poses challenges for traditional program analysis but can be well suited for large language models; and... 8 | - **Link**: [Read Paper](https://dl.acm.org/doi/pdf/10.1145/3694715.3695971) 9 | - **Labels**: [static analysis](../../labels/static_analysis.md), [bug detection](../../labels/bug_detection.md) 10 | -------------------------------------------------------------------------------- /data/papers/venues/SOSP2024/paper_1.md: -------------------------------------------------------------------------------- 1 | # If At First You Don’t Succeed, Try, Try, Again...? Insights and LLM-informed Tooling for Detecting Retry Bugs in Software Systems 2 | 3 | **Authors**: Bogdan A. Stoica, Utsav Sethi, Yiming Su, Cyrus Zhou, Shan Lu, Jonathan Mace, Madanlal Musuvathi, Suman Nath 4 | 5 | **Abstract**: 6 | 7 | Retry—the re-execution of a task on failure—is a common mechanism to enable resilient software systems. Yet, despite its commonality and long history, retry remains difficult to implement and test. Guided by our study of real-world retry issues, we propose a novel suite of static and dynamic techniques to detect retry problems in software. We find that the ad-hoc nature of retry implementation poses challenges for traditional program analysis but can be well suited for large language models; and that carefully repurposing existing unit tests can, along with fault injection, expose various types of retry problems. 8 | 9 | **Link**: [Read Paper](https://dl.acm.org/doi/pdf/10.1145/3694715.3695971) 10 | 11 | **Labels**: [static analysis](../../labels/static_analysis.md), [bug detection](../../labels/bug_detection.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/TKDD2024/README.md: -------------------------------------------------------------------------------- 1 | # TKDD2024 2 | 3 | Number of papers: 1 4 | 5 | ## [Automatically Inspecting Thousands of Static Bug Warnings with Large Language Model: How Far Are We?](paper_1.md) 6 | - **Authors**: Cheng Wen, Yuandao Cai, Bin Zhang, Jie Su, Zhiwu Xu, Dugang Liu, Shengchao Qin, Zhong Ming, Tian Cong 7 | - **Abstract**: Static analysis tools for capturing bugs and vulnerabilities in software programs are widely employed in practice, as they have the unique advantages of high coverage and independence from the execution environment. However, existing tools for analyzing large codebases often produce a great deal of false warnings over genuine bug reports. As a result, developers are required to manually inspect and confirm each warning, a challenging, time-consuming, and automation-essential task. 8 | This article ... 9 | - **Link**: [Read Paper](https://dl.acm.org/doi/pdf/10.1145/3653718) 10 | - **Labels**: [static analysis](../../labels/static_analysis.md), [bug detection](../../labels/bug_detection.md) 11 | -------------------------------------------------------------------------------- /data/papers/venues/TMLR2024/README.md: -------------------------------------------------------------------------------- 1 | # TMLR2024 2 | 3 | Number of papers: 1 4 | 5 | ## [Unifying the perspectives of nlp and software engineering: A survey on language models for code](paper_1.md) 6 | - **Authors**: Zhang, Ziyin and Chen, Chaoyu and Liu, Bingchang and Liao, Cong and Gong, Zi and Yu, Hang and Li, Jianguo and Wang, Rui 7 | - **Abstract**: In this work we systematically review the recent advancements in software engineering with language models, covering 70+ models, 40+ evaluation tasks, 180+ datasets, and 900 related works. Unlike previous works, we integrate software engineering (SE) with natural language processing (NLP) by discussing the perspectives of both sides: SE applies language models for development automation, while NLP adopts SE tasks for language model evaluation. We break down code processing models into general la... 8 | - **Link**: [Read Paper](https://arxiv.org/pdf/2311.07989) 9 | - **Labels**: [general coding task](../../labels/general_coding_task.md), [survey](../../labels/survey.md) 10 | -------------------------------------------------------------------------------- /data/papers/venues/TOSEM2024/paper_10.md: -------------------------------------------------------------------------------- 1 | # Survey of Code Search Based on Deep Learning 2 | 3 | **Authors**: Xie, Yutao and Lin, Jiayi and Dong, Hande and Zhang, Lei and Wu, Zhonghai 4 | 5 | **Abstract**: 6 | 7 | Code writing is repetitive and predictable, inspiring us to develop various code intelligence techniques. This survey focuses on code search, that is, to retrieve code that matches a given natural language query by effectively capturing the semantic similarity between the query and code. Deep learning, being able to extract complex semantics information, has achieved great success in this field. Recently, various deep learning methods, such as graph neural networks and pretraining models, have been applied to code search with significant progress. Deep learning is now the leading paradigm for code search. In this survey, we provide a comprehensive overview of deep learning-based code search. We review the existing deep learning-based code search framework that maps query/code to vectors and measures their similarity. Furthermore, we propose a new taxonomy to illustrate the state-of-the-art deep learning-based code search in a three-step process: query semantics modeling, code semantics modeling, and matching modeling, which involves the deep learning model training. Finally, we suggest potential avenues for future research in this promising field. 8 | 9 | **Link**: [Read Paper](https://doi.org/10.1145/3628161) 10 | 11 | **Labels**: [survey](../../labels/survey.md), [static analysis](../../labels/static_analysis.md), [code search](../../labels/code_search.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/TOSEM2024/paper_7.md: -------------------------------------------------------------------------------- 1 | # Harnessing the power of llm to support binary taint analysis 2 | 3 | **Authors**: Liu, Puzhuo and Sun, Chengnian and Zheng, Yaowen and Feng, Xuan and Qin, Chuan and Wang, Yuncheng and Li, Zhi and Sun, Limin 4 | 5 | **Abstract**: 6 | 7 | This paper proposes LATTE, the first static binary taint analysis that is powered by a large language model (LLM). LATTE is superior to the state of the art (e.g., Emtaint, Arbiter, Karonte) in three aspects. First, LATTE is fully automated while prior static binary taint analyzers need rely on human expertise to manually customize taint propagation rules and vulnerability inspection rules. Second, LATTE is significantly effective in vulnerability detection, demonstrated by our comprehensive evaluations. For example, LATTE has found 37 new bugs in real-world firmware which the baselines failed to find, and 7 of them have been assigned CVE numbers. Lastly, LATTE incurs remarkably low engineering cost, making it a cost-efficient and scalable solution for security researchers and practitioners. We strongly believe that LATTE opens up a new direction to harness the recent advance in LLMs to improve vulnerability analysis for binary programs. 8 | 9 | **Link**: [Read Paper](https://arxiv.org/abs/2310.08275) 10 | 11 | **Labels**: [static analysis](../../labels/static_analysis.md), [bug detection](../../labels/bug_detection.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/TSE2024/paper_5.md: -------------------------------------------------------------------------------- 1 | # ChatGPT vs SBST: A Comparative Assessment of Unit Test Suite Generation 2 | 3 | **Authors**: Tang, Yutian and Liu, Zhijie and Zhou, Zhichao and Luo, Xiapu 4 | 5 | **Abstract**: 6 | 7 | Recent advancements in large language models (LLMs) have demonstrated exceptional success in a wide range of general domain tasks, such as question answering and following instructions. Moreover, LLMs have shown potential in various software engineering applications. In this study, we present a systematic comparison of test suites generated by the ChatGPT LLM and the state-of-the-art SBST tool EvoSuite. Our comparison is based on several critical factors, including correctness, readability, code coverage, and bug detection capability. By highlighting the strengths and weaknesses of LLMs (specifically ChatGPT) in generating unit test cases compared to EvoSuite, this work provides valuable insights into the performance of LLMs in solving software engineering problems. Overall, our findings underscore the potential of LLMs in software engineering and pave the way for further research in this area. 8 | 9 | **Link**: [Read Paper](https://doi.org/10.1109/TSE.2024.3382365) 10 | 11 | **Labels**: [program testing](../../labels/program_testing.md), [unit testing](../../labels/unit_testing.md), [empirical study](../../labels/empirical_study.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/USENIXSec2023/paper_1.md: -------------------------------------------------------------------------------- 1 | # Lost at C: a user study on the security implications of large language model code assistants 2 | 3 | **Authors**: Sandoval, Gustavo and Pearce, Hammond and Nys, Teo and Karri, Ramesh and Garg, Siddharth and Dolan-Gavitt, Brendan 4 | 5 | **Abstract**: 6 | 7 | Large Language Models (LLMs) such as OpenAI Codex are increasingly being used as AI-based coding assistants. Understanding the impact of these tools on developers' code is paramount, especially as recent work showed that LLMs may suggest cybersecurity vulnerabilities. We conduct a security-driven user study (N=58) to assess code written by student programmers when assisted by LLMs. Given the potential severity of low-level bugs as well as their relative frequency in real-world projects, we tasked participants with implementing a singly-linked 'shopping list' structure in C. Our results indicate that the security impact in this setting (low-level C with pointer and array manipulations) is small: AI-assisted users produce critical security bugs at a rate no greater than 10\% more than the control, indicating the use of LLMs does not introduce new security risks. 8 | 9 | **Link**: [Read Paper](https://arxiv.org/pdf/2208.09727) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [program synthesis](../../labels/program_synthesis.md), [empirical study](../../labels/empirical_study.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/USENIXSec2023/paper_2.md: -------------------------------------------------------------------------------- 1 | # Continuous learning for android malware detection 2 | 3 | **Authors**: Chen, Yizheng and Ding, Zhoujie and Wagner, David 4 | 5 | **Abstract**: 6 | 7 | Machine learning methods can detect Android malware with very high accuracy. However, these classifiers have an Achilles heel, concept drift: they rapidly become out of date and ineffective, due to the evolution of malware apps and benign apps. Our research finds that, after training an Android malware classifier on one year's worth of data, the F1 score quickly dropped from 0.99 to 0.76 after 6 months of deployment on new test samples. 8 | 9 | **Link**: [Read Paper](https://surrealyz.github.io/files/pubs/sec23winter-active-learning-prepub.pdf) 10 | 11 | **Labels**: [static analysis](../../labels/static_analysis.md), [bug detection](../../labels/bug_detection.md), [empirical study](../../labels/empirical_study.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/USENIXSec2024/paper_2.md: -------------------------------------------------------------------------------- 1 | # Leveraging Semantic Relations in Code and Data to Enhance Taint Analysis of Embedded Systems 2 | 3 | **Authors**: Zhao, Jiaxu and Li, Yuekang and Zou, Yanyan and Liang, Zhaohui and Xiao, Yang and Li, Yeting and Peng, Bingwei and Zhong, Nanyu and Wang, Xinyi and Wang, Wei and others 4 | 5 | **Abstract**: 6 | 7 | IoT devices have significantly impacted our daily lives, and detecting vulnerabilities in embedded systems early on is critical for ensuring their security. Among the existing vulnerability detection techniques for embedded systems, static taint analysis has been proven effective in detecting severe vulnerabilities, such as command injection vulnerabilities, which can cause remote code execution. Nevertheless, static taint analysis is faced with the problem of identifying sources comprehensively and accurately. 8 | 9 | **Link**: [Read Paper](https://www.usenix.org/system/files/usenixsecurity24-zhao.pdf) 10 | 11 | **Labels**: [static analysis](../../labels/static_analysis.md), [bug detection](../../labels/bug_detection.md), [code model](../../labels/code_model.md), [code model training](../../labels/code_model_training.md), [source code model](../../labels/source_code_model.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/USENIXSec2024/paper_5.md: -------------------------------------------------------------------------------- 1 | # Hermes: Unlocking Security Analysis of Cellular Network Protocols by Synthesizing Finite State Machines from Natural Language Specifications 2 | 3 | **Authors**: Abdullah Al Ishtiaq, Sarkar Snigdha Sarathi Das, Syed Md Mukit Rashid, Ali Ranjbar, Kai Tu, Tianwei Wu, Zhezheng Song, Weixuan Wang, Mujtahid Akon, Rui Zhang, Syed Rafiul Hussain 4 | 5 | **Abstract**: 6 | 7 | In this paper, we present Hermes, an end-to-end framework to automatically generate formal representations from natural language cellular specifications. We first develop a neural constituency parser, NEUTREX, to process transition-relevant texts and extract transition components (i.e., states, conditions, and actions). We also design a domain-specific language to translate these transition components to logical formulas by leveraging dependency parse trees. Finally, we compile these logical formulas to generate transitions and create the formal model as finite state machines. To demonstrate the effectiveness of Hermes, we evaluate it on 4G NAS, 5G NAS, and 5G RRC specifications and obtain an overall accuracy of 81-87%, which is a substantial improvement over the state-of-the-art. Our security analysis of the extracted models uncovers 3 new vulnerabilities and identifies 19 previous attacks in 4G and 5G specifications, and 7 deviations in commercial 4G basebands. 8 | 9 | **Link**: [Read Paper](https://arxiv.org/abs/2310.04381) 10 | 11 | **Labels**: [static analysis](../../labels/static_analysis.md), [bug detection](../../labels/bug_detection.md), [specification inference](../../labels/specification_inference.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/arXiv2023/paper_14.md: -------------------------------------------------------------------------------- 1 | # Cumulative reasoning with large language models 2 | 3 | **Authors**: Zhang, Yifan and Yang, Jingqin and Yuan, Yang and Yao, Andrew Chi-Chih 4 | 5 | **Abstract**: 6 | 7 | While language models are powerful and versatile, they often fail to address highly complex problems. This is because solving complex problems requires deliberate thinking, which has been only minimally guided during training. In this paper, we propose a new method called Cumulative Reasoning (CR), which employs language models in a cumulative and iterative manner to emulate human thought processes. By decomposing tasks into smaller components, \ournameb streamlines the problem-solving process, rendering it both more manageable and effective. For logical inference tasks, CR consistently outperforms existing methods with an improvement up to 9.3\%, and achieves the astonishing accuracy of 98.04\% on the curated FOLIO wiki dataset. In the context of the Game of 24, CR achieves an accuracy of 94\%, which signifies a substantial enhancement of 20\% over the previous state-of-the-art method. 8 | 9 | **Link**: [Read Paper](https://arxiv.org/pdf/2308.04371.pdf) 10 | 11 | **Labels**: [hallucination in reasoning](../../labels/hallucination_in_reasoning.md), [agent design](../../labels/agent_design.md), [prompt strategy](../../labels/prompt_strategy.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/arXiv2023/paper_2.md: -------------------------------------------------------------------------------- 1 | # Lmpa: Improving decompilation by synergy of large language model and program analysis 2 | 3 | **Authors**: Xu, Xiangzhe and Zhang, Zhuo and Feng, Shiwei and Ye, Yapeng and Su, Zian and Jiang, Nan and Cheng, Siyuan and Tan, Lin and Zhang, Xiangyu 4 | 5 | **Abstract**: 6 | 7 | Decompilation aims to recover the source code form of a binary executable. It has many applications in security and software engineering such as malware analysis, vulnerability detection and code reuse. A prominent challenge in decompilation is to recover variable names. We propose a novel method that leverages the synergy of large language model (LLM) and program analysis. Language models encode rich multi-modal knowledge, but its limited input size prevents providing sufficient global context for name recovery. We propose to divide the task to many LLM queries and use program analysis to correlate and propagate the query results, which in turn improves the performance of LLM by providing additional contextual information. Our results show that 75% of the recovered names are considered good by users and our technique outperforms the state-of-the-art technique by 16.5% and 20.23% in precision and recall, respectively. 8 | 9 | **Link**: [Read Paper](https://arxiv.org/pdf/2306.02546v1) 10 | 11 | **Labels**: [static analysis](../../labels/static_analysis.md), [program decompilation](../../labels/program_decompilation.md), [code model](../../labels/code_model.md), [code model training](../../labels/code_model_training.md), [binary code model](../../labels/binary_code_model.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/arXiv2023/paper_5.md: -------------------------------------------------------------------------------- 1 | # How Far Have We Gone in Vulnerability Detection Using Large Language Models 2 | 3 | **Authors**: Zeyu Gao and Hao Wang and Yuchen Zhou and Wenyu Zhu and Chao Zhang 4 | 5 | **Abstract**: 6 | 7 | As software becomes increasingly complex and prone to vulnerabilities, automated vulnerability detection is critically important, yet challenging. Given the significant successes of large language models (LLMs) in various tasks, there is growing anticipation of their efficacy in vulnerability detection. However, a quantitative understanding of their potential in vulnerability detection is still missing. To bridge this gap, we introduce a comprehensive vulnerability benchmark VulBench. This benchmark aggregates high-quality data from a wide range of CTF (Capture-the-Flag) challenges and real-world applications, with annotations for each vulnerable function detailing the vulnerability type and its root cause. Through our experiments encompassing 16 LLMs and 6 state-of-the-art (SOTA) deep learning-based models and static analyzers, we find that several LLMs outperform traditional deep learning approaches in vulnerability detection, revealing an untapped potential in LLMs. This work contributes to the understanding and utilization of LLMs for enhanced software security. 8 | 9 | **Link**: [Read Paper](https://doi.org/10.48550/arXiv.2311.12420) 10 | 11 | **Labels**: [static analysis](../../labels/static_analysis.md), [bug detection](../../labels/bug_detection.md), [benchmark](../../labels/benchmark.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/arXiv2024/paper_1.md: -------------------------------------------------------------------------------- 1 | # Security of Language Models for Code: A Systematic Literature Review 2 | 3 | **Authors**: Chen, Yuchen and Sun, Weisong and Fang, Chunrong and Chen, Zhenpeng and Ge, Yifei and Han, Tingxu and Zhang, Quanjun and Liu, Yang and Chen, Zhenyu and Xu, Baowen 4 | 5 | **Abstract**: 6 | 7 | Language models for code (CodeLMs) have emerged as powerful tools for code-related tasks, outperforming traditional methods and standard machine learning approaches. However, these models are susceptible to security vulnerabilities, drawing increasing research attention from domains such as software engineering, artificial intelligence, and cybersecurity. Despite the growing body of research focused on the security of CodeLMs, a comprehensive survey in this area remains absent. To address this gap, we systematically review 67 relevant papers, organizing them based on attack and defense strategies. Furthermore, we provide an overview of commonly used language models, datasets, and evaluation metrics, and highlight open-source tools and promising directions for future research in securing CodeLMs. 8 | 9 | **Link**: [Read Paper](https://arxiv.org/pdf/2410.15631) 10 | 11 | **Labels**: [code model](../../labels/code_model.md), [code model security](../../labels/code_model_security.md), [survey](../../labels/survey.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/arXiv2024/paper_12.md: -------------------------------------------------------------------------------- 1 | # Automatic Programming: Large Language Models and Beyond 2 | 3 | **Authors**: Lyu, Michael R and Ray, Baishakhi and Roychoudhury, Abhik and Tan, Shin Hwei and Thongtanunam, Patanamon 4 | 5 | **Abstract**: 6 | 7 | Automatic programming has seen increasing popularity due to the emergence of tools like GitHub Copilot which rely on Large Language Models (LLMs). At the same time, automatically generated code faces challenges during deployment due to concerns around quality and trust. In this article, we study automated coding in a general sense and study the concerns around code quality, security and related issues of programmer responsibility. These are key issues for organizations while deciding on the usage of automatically generated code. We discuss how advances in software engineering such as program repair and analysis can enable automatic programming. We conclude with a forward looking view, focusing on the programming environment of the near future, where programmers may need to switch to different roles to fully utilize the power of automatic programming. Automated repair of automatically generated programs from LLMs, can help produce higher assurance code from LLMs, along with evidence of assurance. 8 | 9 | **Link**: [Read Paper](https://arxiv.org/pdf/2405.02213) 10 | 11 | **Labels**: [general coding task](../../labels/general_coding_task.md), [empirical study](../../labels/empirical_study.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/arXiv2024/paper_21.md: -------------------------------------------------------------------------------- 1 | # LLM-Assisted Static Analysis for Detecting Security Vulnerabilities 2 | 3 | **Authors**: Li, Ziyang and Dutta, Saikat and Naik, Mayur 4 | 5 | **Abstract**: 6 | 7 | Software is prone to security vulnerabilities. Program analysis tools to detect them have limited effectiveness in practice due to their reliance on human labeled specifications. Large language models (or LLMs) have shown impressive code generation capabilities but they cannot do complex reasoning over code to detect such vulnerabilities especially since this task requires whole-repository analysis. We propose IRIS, a neuro-symbolic approach that systematically combines LLMs with static analysis to perform whole-repository reasoning for security vulnerability detection. Specifically, IRIS leverages LLMs to infer taint specifications and perform contextual analysis, alleviating needs for human specifications and inspection. For evaluation, we curate a new dataset, CWE-Bench-Java, comprising 120 manually validated security vulnerabilities in real-world Java projects. A state-of-the-art static analysis tool CodeQL detects only 27 of these vulnerabilities whereas IRIS with GPT-4 detects 55 (+28) and improves upon CodeQL's average false discovery rate by 5% points. Furthermore, IRIS identifies 6 previously unknown vulnerabilities which cannot be found by existing tools. 8 | 9 | **Link**: [Read Paper](https://arxiv.org/abs/2405.17238) 10 | 11 | **Labels**: [static analysis](../../labels/static_analysis.md), [bug detection](../../labels/bug_detection.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/arXiv2024/paper_22.md: -------------------------------------------------------------------------------- 1 | # Specification-Driven Code Translation Powered by Large Language Models: How Far Are We? 2 | 3 | **Authors**: Soumit Kanti Saha, Fazle Rabbi, Song Wang, Jinqiu Yang 4 | 5 | **Abstract**: 6 | 7 | Large Language Models (LLMs) are increasingly being applied across various domains, including code-related tasks such as code translation. Previous studies have explored using LLMs for translating code between different programming languages. Since LLMs are more effective with natural language, using natural language as an intermediate representation in code translation tasks presents a promising approach. In this work, we investigate using NL-specification as an intermediate representation for code translation. We evaluate our method using three datasets, five popular programming languages, and 29 language pair permutations. Our results show that using NL-specification alone does not lead to performance improvements. However, when combined with source code, it provides a slight improvement over the baseline in certain language pairs. Besides analyzing the performance of code translation, we also investigate the quality of the translated code and provide insights into the issues present in the translated code. 8 | 9 | **Link**: [Read Paper](https://arxiv.org/pdf/2412.04590) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [program transformation](../../labels/program_transformation.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/arXiv2024/paper_3.md: -------------------------------------------------------------------------------- 1 | # SecCodePLT: A Unified Platform for Evaluating the Security of Code GenAI 2 | 3 | **Authors**: Yu Yang, Yuzhou Nie, Zhun Wang, Yuheng Tang, Wenbo Guo, Bo Li, Dawn Song 4 | 5 | **Abstract**: 6 | 7 | Language models for code (CodeLMs) have emerged as powerful tools for code-related tasks, outperforming traditional methods and standard machine learning approaches. However, these models are susceptible to security vulnerabilities, drawing increasing research attention from domains such as software engineering, artificial intelligence, and cybersecurity. Despite the growing body of research focused on the security of CodeLMs, a comprehensive survey in this area remains absent. To address this gap, we systematically review 67 relevant papers, organizing them based on attack and defense strategies. Furthermore, we provide an overview of commonly used language models, datasets, and evaluation metrics, and highlight open-source tools and promising directions for future research in securing CodeLMs. 8 | 9 | **Link**: [Read Paper](https://arxiv.org/abs/2410.11096) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [program synthesis](../../labels/program_synthesis.md), [code model](../../labels/code_model.md), [code model security](../../labels/code_model_security.md), [benchmark](../../labels/benchmark.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/arXiv2024/paper_30.md: -------------------------------------------------------------------------------- 1 | # Llm4fuzz: Guided fuzzing of smart contracts with large language models 2 | 3 | **Authors**: Shou, Chaofan and Liu, Jing and Lu, Doudou and Sen, Koushik 4 | 5 | **Abstract**: 6 | 7 | As blockchain platforms grow exponentially, millions of lines of smart contract code are being deployed to manage extensive digital assets. However, vulnerabilities in this mission-critical code have led to significant exploitations and asset losses. Thorough automated security analysis of smart contracts is thus imperative. This paper introduces LLM4Fuzz to optimize automated smart contract security analysis by leveraging large language models (LLMs) to intelligently guide and prioritize fuzzing campaigns. While traditional fuzzing suffers from low efficiency in exploring the vast state space, LLM4Fuzz employs LLMs to direct fuzzers towards high-value code regions and input sequences more likely to trigger vulnerabilities. Additionally, LLM4Fuzz can leverage LLMs to guide fuzzers based on user-defined invariants, reducing blind exploration overhead. Evaluations of LLM4Fuzz on real-world DeFi projects show substantial gains in efficiency, coverage, and vulnerability detection compared to baseline fuzzing. LLM4Fuzz also uncovered five critical vulnerabilities that can lead to a loss of more than $247k. 8 | 9 | **Link**: [Read Paper](https://arxiv.org/pdf/2401.11108.pdf) 10 | 11 | **Labels**: [program testing](../../labels/program_testing.md), [fuzzing](../../labels/fuzzing.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/arXiv2024/paper_31.md: -------------------------------------------------------------------------------- 1 | # LLMorpheus: Mutation Testing using Large Language Models 2 | 3 | **Authors**: Tip, Frank and Bell, Jonathan and Sch{\"a}fer, Max 4 | 5 | **Abstract**: 6 | 7 | In mutation testing, the quality of a test suite is evaluated by introducing faults into a program and determining whether the program's tests detect them. Most existing approaches for mutation testing involve the application of a fixed set of mutation operators, e.g., replacing a "+" with a "-" or removing a function's body. However, certain types of real-world bugs cannot easily be simulated by such approaches, limiting their effectiveness. This paper presents a technique where a Large Language Model (LLM) is prompted to suggest mutations by asking it what placeholders that have been inserted in source code could be replaced with. The technique is implemented in LLMorpheus, a mutation testing tool for JavaScript, and evaluated on 13 subject packages, considering several variations on the prompting strategy, and using several LLMs. We find LLMorpheus to be capable of producing mutants that resemble existing bugs that cannot be produced by StrykerJS, a state-of-the-art mutation testing tool. Moreover, we report on the running time, cost, and number of mutants produced by LLMorpheus, demonstrating its practicality. 8 | 9 | **Link**: [Read Paper](https://arxiv.org/pdf/2404.09952) 10 | 11 | **Labels**: [program testing](../../labels/program_testing.md), [mutation testing](../../labels/mutation_testing.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/arXiv2024/paper_32.md: -------------------------------------------------------------------------------- 1 | # Teams of LLM Agents can Exploit Zero-Day Vulnerabilities 2 | 3 | **Authors**: Fang, Richard and Bindu, Rohan and Gupta, Akul and Zhan, Qiusi and Kang, Daniel 4 | 5 | **Abstract**: 6 | 7 | LLM agents have become increasingly sophisticated, especially in the realm of cybersecurity. Researchers have shown that LLM agents can exploit real-world vulnerabilities when given a description of the vulnerability and toy capture-the-flag problems. However, these agents still perform poorly on real-world vulnerabilities that are unknown to the agent ahead of time (zero-day vulnerabilities). In this work, we show that teams of LLM agents can exploit real-world, zero-day vulnerabilities. Prior agents struggle with exploring many different vulnerabilities and long-range planning when used alone. To resolve this, we introduce HPTSA, a system of agents with a planning agent that can launch subagents. The planning agent explores the system and determines which subagents to call, resolving long-term planning issues when trying different vulnerabilities. We construct a benchmark of 15 real-world vulnerabilities and show that our team of agents improve over prior work by up to 4.5. 8 | 9 | **Link**: [Read Paper](https://arxiv.org/abs/2406.01637) 10 | 11 | **Labels**: [program testing](../../labels/program_testing.md), [vulnerability exploitation](../../labels/vulnerability_exploitation.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/arXiv2024/paper_34.md: -------------------------------------------------------------------------------- 1 | # Large language model-based agents for software engineering: A survey 2 | 3 | **Authors**: Liu, Junwei and Wang, Kaixin and Chen, Yixuan and Peng, Xin and Chen, Zhenpeng and Zhang, Lingming and Lou, Yiling 4 | 5 | **Abstract**: 6 | 7 | The recent advance in Large Language Models (LLMs) has shaped a new paradigm of AI agents, i.e., LLM-based agents. Compared to standalone LLMs, LLM-based agents substantially extend the versatility and expertise of LLMs by enhancing LLMs with the capabilities of perceiving and utilizing external resources and tools. To date, LLM-based agents have been applied and shown remarkable effectiveness in Software Engineering (SE). The synergy between multiple agents and human interaction brings further promise in tackling complex real-world SE problems. In this work, we present a comprehensive and systematic survey on LLM-based agents for SE. We collect 106 papers and categorize them from two perspectives, i.e., the SE and agent perspectives. In addition, we discuss open challenges and future directions in this critical domain. The repository of this survey is at https://github.com/FudanSELab/Agent4SE-Paper-List. 8 | 9 | **Link**: [Read Paper](https://arxiv.org/pdf/2409.02977) 10 | 11 | **Labels**: [survey](../../labels/survey.md), [agent design](../../labels/agent_design.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/arXiv2024/paper_40.md: -------------------------------------------------------------------------------- 1 | # Large Language Models Based Fuzzing Techniques: A Survey 2 | 3 | **Authors**: Misu, Md Rakib Hossain and Lopes, Cristina V. and Ma, Iris and Noble, James 4 | 5 | **Abstract**: 6 | 7 | In the modern era where software plays a pivotal role, software security and vulnerability analysis have become essential for software development. Fuzzing test, as an efficient software testing method, are widely used in various domains. Moreover, the rapid development of Large Language Models (LLMs) has facilitated their application in the field of software testing, demonstrating remarkable performance. Considering existing fuzzing test techniques are not entirely automated and software vulnerabilities continue to evolve, there is a growing trend towards employing fuzzing test generated based on large language models. This survey provides a systematic overview of the approaches that fuse LLMs and fuzzing tests for software testing. In this paper, a statistical analysis and discussion of the literature in three areas, including LLMs, fuzzing test, and fuzzing test generated based on LLMs, are conducted by summarising the state-of-the-art methods up until 2024. Our survey also investigates the potential for widespread deployment and application of fuzzing test techniques generated by LLMs in the future. 8 | 9 | **Link**: [Read Paper](https://arxiv.org/pdf/2402.00350) 10 | 11 | **Labels**: [program testing](../../labels/program_testing.md), [fuzzing](../../labels/fuzzing.md), [survey](../../labels/survey.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/arXiv2024/paper_9.md: -------------------------------------------------------------------------------- 1 | # CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks 2 | 3 | **Authors**: Xie, Yiqing and Xie, Alex and Sheth, Divyanshu and Liu, Pengfei and Fried, Daniel and Rose, Carolyn 4 | 5 | **Abstract**: 6 | 7 | To facilitate evaluation of code generation systems across diverse scenarios, we present CodeBenchGen, a framework to create scalable execution-based benchmarks that only requires light guidance from humans. Specifically, we leverage a large language model (LLM) to convert an arbitrary piece of code into an evaluation example, including test cases for execution-based evaluation. We illustrate the usefulness of our framework by creating a dataset, Exec-CSN, which includes 1,931 examples involving 293 libraries revised from code in 367 GitHub repositories taken from the CodeSearchNet dataset. To demonstrate the complexity and solvability of examples in Exec-CSN, we present a human study demonstrating that 81.3% of the examples can be solved by humans and 61% are rated as ``requires effort to solve''. We conduct code generation experiments on open-source and proprietary models and analyze the performance of both humans and models. We will release the code of both the framework and the dataset upon acceptance. 8 | 9 | **Link**: [Read Paper](https://arxiv.org/pdf/2404.00566) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [benchmark](../../labels/benchmark.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/arXiv2025/paper_14.md: -------------------------------------------------------------------------------- 1 | # Fixing 7,400 Bugs for 1$: Cheap Crash-Site Program Repair 2 | 3 | **Authors**: Han Zheng, Ilia Shumailov, Tianqi Fan, Aiden Hall, Mathias Payer 4 | 5 | **Abstract**: 6 | 7 | The rapid advancement of bug-finding techniques has led to the discovery of more vulnerabilities than developers can reasonably fix, creating an urgent need for effective Automated Program Repair (APR) methods. However, the complexity of modern bugs often makes precise root cause analysis difficult and unreliable. To address this challenge, we propose crash-site repair to simplify the repair task while still mitigating the risk of exploitation. In addition, we introduce a template-guided patch generation approach that significantly reduces the token cost of Large Language Models (LLMs) while maintaining both efficiency and effectiveness. We implement our prototype system, WILLIAMT, and evaluate it against state-of-the-art APR tools. Our results show that, when combined with the top-performing agent CodeRover-S, WILLIAMT reduces token cost by 45.9% and increases the bug-fixing rate to 73.5% (+29.6%) on ARVO, a ground-truth open source software vulnerabilities benchmark. Furthermore, we demonstrate that WILLIAMT can function effectively even without access to frontier LLMs: even a local model running on a Mac M4 Mini achieves a reasonable repair rate. These findings highlight the broad applicability and scalability of WILLIAMT. 8 | 9 | **Link**: [Read Paper](https://arxiv.org/abs/2505.13103) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [program repair](../../labels/program_repair.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/arXiv2025/paper_16.md: -------------------------------------------------------------------------------- 1 | # AI Software Engineer: Programming with Trust 2 | 3 | **Authors**: Abhik Roychoudhury, Corina Pasareanu, Michael Pradel, Baishakhi Ray 4 | 5 | **Abstract**: 6 | 7 | Large Language Models (LLMs) have shown surprising proficiency in generating code snippets, promising to automate large parts of software engineering via artificial intelligence (AI). We argue that successfully deploying AI software engineers requires a level of trust equal to or even greater than the trust established by human-driven software engineering practices. The recent trend toward LLM agents offers a path toward integrating the power of LLMs to create new code with the power of analysis tools to increase trust in the code. This opinion piece comments on whether LLM agents could dominate software engineering workflows in the future and whether the focus of programming will shift from programming at scale to programming with trust. 8 | 9 | **Link**: [Read Paper](https://arxiv.org/pdf/2502.13767) 10 | 11 | **Labels**: [code generation](../../labels/code_generation.md), [survey](../../labels/survey.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/arXiv2025/paper_2.md: -------------------------------------------------------------------------------- 1 | # The Hitchhiker's Guide to Program Analysis, Part II: Deep Thoughts by LLMs 2 | 3 | **Authors**: Haonan Li, Hang Zhang, Kexin Pei, Zhiyun Qian 4 | 5 | **Abstract**: 6 | 7 | Static analysis is a cornerstone for software vulnerability detection, yet it often struggles with the classic precision-scalability trade-off. In practice, such tools often produce high false positive rates, particularly in large codebases like the Linux kernel. This imprecision can arise from simplified vulnerability modeling and over-approximation of path and data constraints. While large language models (LLMs) show promise in code understanding, their naive application to program analysis yields unreliable results due to inherent reasoning limitations. We introduce BugLens, a post-refinement framework that significantly improves static analysis precision. BugLens guides an LLM to follow traditional analysis steps by assessing buggy code patterns for security impact and validating the constraints associated with static warnings. Evaluated on real-world Linux kernel bugs, BugLens raises precision from 0.10 (raw) and 0.50 (semi-automated refinement) to 0.72, substantially reducing false positives and revealing four previously unreported vulnerabilities. Our results suggest that a structured LLM-based workflow can meaningfully enhance the effectiveness of static analysis tools. 8 | 9 | **Link**: [Read Paper](https://arxiv.org/abs/2504.11711) 10 | 11 | **Labels**: [static analysis](../../labels/static_analysis.md), [bug detection](../../labels/bug_detection.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/arXiv2025/paper_4.md: -------------------------------------------------------------------------------- 1 | # Language Models for Code Optimization: Survey, Challenges and Future Directions 2 | 3 | **Authors**: Jingzhi Gong, Vardan Voskanyan, Paul Brookes, Fan Wu, Wei Jie, Jie Xu, Rafail Giavrimis, Mike Basios, Leslie Kanthan, Zheng Wang 4 | 5 | **Abstract**: 6 | 7 | Language models (LMs) built upon deep neural networks (DNNs) have recently demonstrated breakthrough effectiveness in software engineering tasks such as code generation, completion, and repair. This has paved the way for the emergence of LM-based code optimization techniques, which are crucial for enhancing the performance of existing programs, such as accelerating program execution time. However, a comprehensive survey dedicated to this specific application has been lacking. To fill this gap, we present a systematic literature review of over 50 primary studies, identifying emerging trends and addressing 11 specialized questions. Our findings reveal five critical open challenges, such as balancing model complexity with practical usability, cross-language/performance generalizability, and building trust in AI-driven solutions. Furthermore, we provide eight future research directions to facilitate more efficient, robust, and reliable LM-based code optimization. Thereby, this study aims to provide actionable insights and foundational references for both researchers and practitioners in this rapidly evolving field. 8 | 9 | **Link**: [Read Paper](https://arxiv.org/abs/2501.01277) 10 | 11 | **Labels**: [static analysis](../../labels/static_analysis.md), [program optimization](../../labels/program_optimization.md), [survey](../../labels/survey.md) 12 | -------------------------------------------------------------------------------- /data/papers/venues/arXiv2025/paper_7.md: -------------------------------------------------------------------------------- 1 | # Is Your Benchmark (Still) Useful? Dynamic Benchmarking for Code Language Models 2 | 3 | **Authors**: Batu Guan, Xiao Wu, Yuanyuan Yuan, Shaohua Li 4 | 5 | **Abstract**: 6 | 7 | In this paper, we tackle a critical challenge in model evaluation: how to keep code benchmarks useful when models might have already seen them during training. We introduce a novel solution, dynamic benchmarking framework, to address this challenge. Given a code understanding or reasoning benchmark, our framework dynamically transforms each input, i.e., programs, with various semantic-preserving mutations to build a syntactically new while semantically identical benchmark. We evaluated ten popular language models on our dynamic benchmarks. Our evaluation reveals several interesting or surprising findings: (1) all models perform significantly worse than before, (2) the ranking between some models shifts dramatically, and (3) our dynamic benchmarks can resist against the data contamination problem. 8 | 9 | **Link**: [Read Paper](https://arxiv.org/abs/2503.06643) 10 | 11 | **Labels**: [benchmark](../../labels/benchmark.md) 12 | -------------------------------------------------------------------------------- /data/rawdata/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PurCL/CodeLLMPaper/2df01657bb46cba896ada39d3bf08814ae6cc751/data/rawdata/.DS_Store -------------------------------------------------------------------------------- /src/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PurCL/CodeLLMPaper/2df01657bb46cba896ada39d3bf08814ae6cc751/src/.DS_Store --------------------------------------------------------------------------------