├── README.md ├── benchmark ├── 0_test │ ├── corpus.json │ ├── corpus.pt │ └── datas.json ├── 1_environment │ ├── corpus.json │ └── datas.json ├── 2_mining │ ├── corpus.json │ └── datas.json ├── 3_transport │ ├── corpus.json │ └── datas.json ├── 4_aerospace │ ├── corpus.json │ └── datas.json ├── 5_telecom │ ├── corpus.json │ └── datas.json ├── 6_architecture │ ├── corpus.json │ └── datas.json ├── 7_water │ ├── corpus.json │ └── datas.json └── 8_farming │ ├── corpus.json │ └── datas.json ├── eval_0_prepare_embeddings.py ├── eval_5_ours.py ├── eval_results └── 0_test │ ├── generated_by_5_ours_5_2_1_.jsonl │ └── generated_by_5_ours_5_2_1_.jsonl_score.jsonl ├── eval_score_calculate.py ├── eval_score_show.py ├── ours_framework.py ├── requirements.txt └── utils ├── bench_fig.png ├── embedder.py ├── head_fig.PNG ├── method_fig.png ├── openai_api.py ├── prompts.py └── qwen_api.py /README.md: -------------------------------------------------------------------------------- 1 | # DeepSolution 2 | DeepSolution: Boosting Complex Engineering Solution Design via Tree-based Exploration and Bi-point Thinking 3 | https://arxiv.org/abs/2502.20730 4 | https://huggingface.co/papers/2502.20730 5 | https://huggingface.co/datasets/lzq2021/SolutionBench 6 | 7 | 8 | ## 0. Introduction 9 | Designing solutions for complex engineering challenges is crucial in human production activities. 10 | However, previous research in the retrieval-augmented generation (RAG) field has not sufficiently addressed tasks related to the design of complex engineering solutions. 11 | To fill this gap, we introduce a new benchmark, SolutionBench, to evaluate a system’s ability to generate complete and feasible solutions for engineering problems with multiple complex constraints. 12 | To further advance the design of complex engineering solutions, we propose a novel system, SolutionRAG, that leverages the tree-based exploration and bi-point thinking mechanism to generate reliable solutions. 13 | 14 |
15 | 16 |
17 | 18 | ## 1. SolutionBench 19 | The SolutionBench is in ```benchmark``` or https://huggingface.co/datasets/lzq2021/SolutionBench. The constrcution process is as illustrated in the figure, we first collect engineering technical reports about complex solution design from authoritative journals across various engineering fields. Then, based on manually formatted extraction templates, we use powerful LLMs to implement useful content extraction. Finally, after manually checking and removing redundancy, the extracted content is integrated into a complete benchmark. 20 | ![](utils/bench_fig.png) 21 | 22 | ## 2. SolutionRAG 23 | Since the improvement process from a suboptimal solution to a reliable one is flexible and lacks a fixed reasoning pattern, SolutionRAG performs tree-base exploration to find the most effective improvement process for each input requirement. 24 | Moreover, due to the multiple real-world constraints within the requirements, the system cannot directly guarantee the generated solutions satisfy all constraints. Therefore, SolutionRAG employs a bi-point thinking approach, alternating between solution design and review, gradually enhancing the solution’s completeness and reliability. 25 | Finally, to balance inference performance and efficiency, SolutionRAG employs node evaluation to prune the tree, ensuring that the inference process follows the most promising solutions and the most helpful reviewed comments. 26 | ![](utils/method_fig.png) 27 | 28 | ### 2.1 Environment Installation 29 | ```bash 30 | python=3.10.16 31 | pip install -r requirements.txt 32 | pip install vllm==0.6.6.post1 33 | ``` 34 | 35 | ### 2.2 Base Model Employment 36 | ```bash 37 | # employ the base model as API for convenient experiments, will get qwen_url 38 | model=Qwen2.5-7B-Instruct 39 | CUDA_VISIBLE_DEVICES=0,1,2,3 OUTLINES_CACHE_DIR=tmp_257b nohup python -m vllm.entrypoints.openai.api_server --model ${model} --served-model-name Qwen --tensor-parallel-size 4 --port 1225 --gpu_memory_utilization 0.45 --disable-custom-all-reduce > vllm_257b.log 40 | 41 | ``` 42 | 43 | ### 2.3 Embeddings Preparation 44 | ```bash 45 | # generate embeddings for each knowledge base corpus 46 | python eval_0_prepare_embeddings.py 47 | ``` 48 | 49 | ### 2.4 SolutionRAG Running 50 | ```bash 51 | # generate solutions, outputs will be in eval_results 52 | scenario="0_test" # 1_environment, 2_mining, 3_transport, 4_aerospace, 5_telecom, 6_architecture, 7_water, 8_farming 53 | python eval_5_ours.py --qwen_url ${qwen_url} --scenario ${scenario} 54 | ``` 55 | 56 | ### 2.5 Score Calculation and Showing 57 | ```bash 58 | # eval and show scores for output solutions based on GPT-4o 59 | python eval_score_calculate.py 60 | python eval_score_show.py 61 | ``` 62 | -------------------------------------------------------------------------------- /benchmark/0_test/corpus.json: -------------------------------------------------------------------------------- 1 | [ 2 | { 3 | "id": "5798c164f31411ef837800163e064343", 4 | "content": "the original nitrogen oxide (NOx) emissions are 350 mg/m³ and need to be reduced to 45 mg/m³ (standard). Under these conditions, the modification will face the challenge of high nitrogen oxide emissions from the boiler body, which may result in the boiler's NOx emissions failing to meet the emission standards." 5 | }, 6 | { 7 | "id": "5798c222f31411ef837800163e064343", 8 | "content": "sulfur dioxide (SO2) emissions fluctuate greatly. Although the desulfurization capacity in the furnace meets the ultra-low emission requirements, the feed rate lags behind changes in coal feed rate. Under these conditions, the modification will face the challenge of exceeding SO2 emissions limits, which may result in failing to meet environmental emission standards." 9 | }, 10 | { 11 | "id": "5798c27cf31411ef837800163e064343", 12 | "content": "dust emissions exceed the standard and need to be below 8 mg/m³ (standard). Under these conditions, the modification will face the challenge of insufficient efficiency of the existing dust collector, which may result in dust emissions failing to meet the standards." 13 | }, 14 | { 15 | "id": "5798c2b8f31411ef837800163e064343", 16 | "content": "Boiler Low-NOx Combustion Retrofit Technology This technology is applicable for reducing boiler NOx emissions. Through measures such as water-cooled wind chamber, secondary air system modification, and water-cooled wall replacement, NOx emissions are reduced to 140 mg/m³ (standard) and further improved to 45 mg/m³ through SNCR." 17 | }, 18 | { 19 | "id": "5798c2f4f31411ef837800163e064343", 20 | "content": "Limestone automatic feeding system improvement This technology is applicable for improving desulfurization efficiency within the furnace. By improving the limestone feeding system and its control logic, it enhances feeding timeliness and stabilizes SO2 emissions." 21 | }, 22 | { 23 | "id": "5798c326f31411ef837800163e064343", 24 | "content": "New ultra-low emission filter bag dust removal technology This technology is suitable for high-efficiency dust filtration. By using ultrafine fiber filter bags, it ensures that dust emissions are below 8 mg/m³ (standard)." 25 | }, 26 | { 27 | "id": "5798c34ef31411ef837800163e064343", 28 | "content": "Turkey's increasingly stringent environmental standards require a reduction in SO2 emissions, and the sulfur content in coal entering the furnace is increasing. Under these conditions, it is necessary to modify the existing desulfurization system to meet lower SO2 emission standards. The challenges include a lack of modification interfaces, complex equipment and pipeline arrangements, and limited construction space, which may lead to difficulties in project implementation and increased costs." 29 | }, 30 | { 31 | "id": "5798c380f31411ef837800163e064343", 32 | "content": "Absorption tower and its auxiliary system renovation technology This technology is applicable for retrofitting based on existing absorption towers. By raising the slurry pool of the absorption tower and increasing the gypsum retention time, it improves the conversion rate from CaSO3 to CaSO4, achieving a desulfurization efficiency of 97%~98%." 33 | }, 34 | { 35 | "id": "5798c3b2f31411ef837800163e064343", 36 | "content": "Vacuum belt dewatering machine technology Add a new vacuum belt dehydrator and auxiliary equipment with a single processing capacity of 53 t/h, to improve the processing efficiency of the gypsum dehydration system." 37 | }, 38 | { 39 | "id": "5798c3e4f31411ef837800163e064343", 40 | "content": "The conditions are an increase in the average oxygen concentration and pressure at the furnace inlet. Under these conditions, combustion will face challenges such as a significant increase in the volume fractions of CO2 and H2O in the flue gas at the tail and a reduction in the radiant heat transfer surface space in the furnace layout, which may result in difficulties in furnace temperature control and an unreasonable distribution of heat absorption." 41 | }, 42 | { 43 | "id": "5798c40cf31411ef837800163e064343", 44 | "content": "External heat exchanger technology This technology is applicable when the oxygen concentration and pressure at the furnace inlet increase, leading to tight arrangement of the heating surface, allowing for increased heat exchange capacity by setting it outside the furnace." 45 | }, 46 | { 47 | "id": "5798c448f31411ef837800163e064343", 48 | "content": "Screen superheater technology Adding a screen superheater inside the furnace is suitable for alleviating the issue of excessive heat absorption in the furnace, ensuring the stability of the flue gas temperature at the furnace outlet." 49 | }, 50 | { 51 | "id": "5798c470f31411ef837800163e064343", 52 | "content": "the existing wastewater treatment plant serves a population that exceeds the designed scale and the wastewater treatment capacity far exceeds the designed capacity, with the current scale being 100,000 tons/day, which needs to be increased to 200,000 tons/day. Under these conditions, upgrading and expanding the project will face the challenge of the effluent water quality not meeting the \"Guangdong Province Water Pollutant Discharge Standards\" and national standards, which may lead to failure in meeting environmental protection standards, requiring optimization of facilities to enhance processing capacity." 53 | }, 54 | { 55 | "id": "5798c4a2f31411ef837800163e064343", 56 | "content": "A/A/O micro-aeration oxidation ditch - MBBR process This technology is applicable to the biochemical treatment process in sewage plants, aiming to increase the biomass and biodiversity in the reactor to achieve higher treatment efficiency and better effluent water quality." 57 | }, 58 | { 59 | "id": "5798c4d4f31411ef837800163e064343", 60 | "content": "Fiber rotary disc filter tank This technology is applicable for enhancing the removal of suspended impurities in sewage treatment, operating under filtration, backwashing, and sludge discharge conditions to improve the quality of the effluent." 61 | }, 62 | { 63 | "id": "5798c506f31411ef837800163e064343", 64 | "content": "MBR membrane bioreactor technology This technology is suitable for efficient filtration in wastewater treatment, using membrane separation instead of traditional sludge settling, maintaining a high concentration of biomass, improving effluent quality, and requiring a small footprint." 65 | }, 66 | { 67 | "id": "5798c52ef31411ef837800163e064343", 68 | "content": "Biologically Aerated Filter (BAF) Suitable for efficiently handling suspended solids and nitrogen and phosphorus removal in wastewater, featuring characteristics such as aeration and high filtration speed, occupying a small footprint and being low in cost." 69 | }, 70 | { 71 | "id": "5798c560f31411ef837800163e064343", 72 | "content": "The conditions involve the presence of single or combined pollution of polycyclic aromatic hydrocarbons (PAHs), petroleum hydrocarbons (C10 ~ C40), and heavy metals (arsenic, cobalt) in the soil within the site, with pollutant concentrations exceeding the acceptable risk levels for humans and having exceeded the site boundary limits. Under these conditions, the repair will face the challenges of a large polluted area, complex pollution situations, composite pollution in some regions, and a tight subsequent development schedule, which may result in poor repair outcomes or an inability to meet development requirements." 73 | }, 74 | { 75 | "id": "5798c592f31411ef837800163e064343", 76 | "content": "Ex-situ chemical oxidation technology This technology is suitable for pollutants such as volatile/semi-volatile organic compounds and petroleum hydrocarbons, capable of achieving oxidative removal of pollutants with low remediation costs and short cycles, but it is less effective for high-concentration organic pollution remediation." 77 | }, 78 | { 79 | "id": "5798c5c4f31411ef837800163e064343", 80 | "content": "Ex situ thermal desorption technology This technology is suitable for organic pollutants and volatile heavy metals, capable of efficiently removing pollutants and allowing for rapid remediation, but it is costly, energy-intensive, and requires high operational standards." 81 | }, 82 | { 83 | "id": "5798c5f6f31411ef837800163e064343", 84 | "content": "Soil chemical leaching technology This technology is suitable for heavy metal pollution, capable of quickly removing heavy metals while also having a supplementary removal effect on other organic pollutants." 85 | }, 86 | { 87 | "id": "5798c61ef31411ef837800163e064343", 88 | "content": "The conditions indicate that the contaminated site is located in North China and is polluted by petroleum hydrocarbons (C10 - C40), with the maximum exceeding rate reaching 39.2 times. The pollution depth is concentrated within the upper 2.0 meters, and the overall exceeding rate reaches 11.27%. Under these conditions, the repair will face challenges such as a wide range of exceedance and serious pollution, which may lead to substandard repair results and the spreading of soil contamination." 89 | }, 90 | { 91 | "id": "5798c65af31411ef837800163e064343", 92 | "content": "In-situ ex situ chemical oxidation remediation technology This technology is suitable for petroleum hydrocarbon contaminated soil. By injecting chemical oxidants into the soil for oxidation reactions, it can degrade or transform pollutants, and it has the advantages of shorter remediation time and lower remediation costs." 93 | }, 94 | { 95 | "id": "5798c682f31411ef837800163e064343", 96 | "content": "the elevation of the slopes at the aggregate yard of the hydropower station is below 1,275.00 m, the slopes are steep, and no access roads were reserved during the construction process, especially as the rocky slopes lack the necessary conditions for vegetation restoration. Under these conditions, vegetation restoration will face challenges due to the steep slope leading to difficulties in construction and maintenance, and the lack of necessary soil and moisture conditions may result in poor greening effects across the entire slope and high construction costs." 97 | }, 98 | { 99 | "id": "5798c6b4f31411ef837800163e064343", 100 | "content": "Vegetated concrete technology This technology is suitable for slope surface remediation, forming a substrate layer through anchoring with mesh and spraying a mixture of cement, soil, and vegetation concrete, achieving stable vegetation restoration on the slope." 101 | }, 102 | { 103 | "id": "5798c6e6f31411ef837800163e064343", 104 | "content": "Planting technology Suitable for steep slopes, it achieves vegetation restoration by setting up suspended reinforced concrete planting troughs, filling with topsoil, and combining climbing plants with shrubs for greening." 105 | }, 106 | { 107 | "id": "5798c718f31411ef837800163e064343", 108 | "content": "Ground platform technology By utilizing the stable treated waste backfill to form a platform, reduce the height of the exposed slopes, and improve the operability and safety of ecological restoration." 109 | }, 110 | { 111 | "id": "5798c74af31411ef837800163e064343", 112 | "content": "The conditions are severe heavy metal pollution, with major pollutants including cadmium, lead, arsenic, zinc, manganese, and antimony. Under these conditions, the repair will face environmental pollution such as wastewater discharge, dust, and noise, as well as impacts on surrounding residents and the environment, which may lead to substandard repairs and secondary pollution." 113 | }, 114 | { 115 | "id": "5798c772f31411ef837800163e064343", 116 | "content": "the construction area is close to the Xiang River and residential areas, with many sensitive points in the surroundings. Under these conditions, the repair will face the risk of pollutant dispersion, which may have an impact on surface water, the atmosphere, and residential environments." 117 | }, 118 | { 119 | "id": "5798c7a4f31411ef837800163e064343", 120 | "content": "Integrated treatment facility This technology is suitable for dealing with rainwater and groundwater seepage in construction areas, ensuring that wastewater discharge meets the strictest standards of the \"Comprehensive Wastewater Discharge Standards\" (GB 8978—1996)." 121 | }, 122 | { 123 | "id": "5798c7d6f31411ef837800163e064343", 124 | "content": "Semi-closed operations and dust prevention net This technology is suitable for reducing construction dust by covering exposed soil and achieving dust control through semi-enclosed operations." 125 | }, 126 | { 127 | "id": "5798c7fef31411ef837800163e064343", 128 | "content": "Surrounding environment monitoring This technology is suitable for monitoring surrounding surface water, groundwater, and ambient air, ensuring that the construction process does not impact the Xiang River and residential areas." 129 | }, 130 | { 131 | "id": "5798c830f31411ef837800163e064343", 132 | "content": "The initial pH value of the wastewater is 6.5. Under these conditions, wastewater treatment will face the challenge of increased sludge, which may result in suboptimal removal of oil content and suspended solids." 133 | }, 134 | { 135 | "id": "5798c862f31411ef837800163e064343", 136 | "content": "Water quality modification technology This technology is suitable for adjusting the pH value of wastewater and improving water quality by increasing liquid alkali to raise the pH value from 6.5 to above 7, thereby enhancing the effectiveness of wastewater treatment." 137 | }, 138 | { 139 | "id": "5798c894f31411ef837800163e064343", 140 | "content": "Water purification technology This technology is applicable for reducing the oil content and suspended solids content in wastewater, achieving optimized purification results through the combined use of coagulants and flocculants, with coagulant concentration at 100 mg/L and flocculant at 40 mg/L." 141 | }, 142 | { 143 | "id": "5798c8bcf31411ef837800163e064343", 144 | "content": "Sterilization and corrosion inhibiting technology This technology is suitable for inhibiting microbial growth and slowing metal corrosion, achieving effective corrosion control by using biocidal corrosion inhibitors No. 3 and No. 4 at a concentration of 40 mg/L." 145 | }, 146 | { 147 | "id": "5798c8eef31411ef837800163e064343", 148 | "content": "the total salt content and sulfate concentration in the discharge water of the cooling cycle system exceed standards, and the power plant is located in the water-scarce central region of the North China Plain, requiring a reduction in concentration ratios to meet local environmental protection standards. Under these conditions, the sewage treatment will face the risk of exceeding emission standards and high water consumption, which may lead to serious environmental accountability and resource waste." 149 | }, 150 | { 151 | "id": "5798c920f31411ef837800163e064343", 152 | "content": "Nanofiltration technology This technology is suitable for reducing the concentration of salts and sulfate in circulating wastewater, enabling compliance with emission standards and allowing the concentrated water to be reused as desulfurization process water." 153 | }, 154 | { 155 | "id": "5798c952f31411ef837800163e064343", 156 | "content": "Reverse osmosis technology This technology is suitable for the desalination of circulating water bypass, providing high-quality freshwater for boiler make-up and circulating water, but the high chloride ion concentration in the brine affects the desulfurization efficiency." 157 | }, 158 | { 159 | "id": "5798c984f31411ef837800163e064343", 160 | "content": "The conditions for the project include the mine and the coal washing plant, and the project was completed in October 2018, capable of joint operation. Under these conditions, conducting an environmental acceptance survey will face challenges related to the complexity of coordinating the ecological, groundwater, atmospheric, water environment, and noise environment interactions between the mine and the coal washing plant, as well as the implementation of verification measures, which may result in an incomplete reflection of the effectiveness of environmental protection measures." 161 | }, 162 | { 163 | "id": "5798c9b6f31411ef837800163e064343", 164 | "content": "the project will conduct joint trial operation from November 2018 to November 2019. Under these conditions, the acceptance investigation will face the challenge of needing to verify the actual implementation and effectiveness of various environmental protection measures during the trial operation, which may result in issues that were not identified during the trial operation being exposed during formal operation." 165 | }, 166 | { 167 | "id": "5798c9e8f31411ef837800163e064343", 168 | "content": "The conditions include a reduction in the mining area and changes in the products extracted and their transportation methods. Under these conditions, the acceptance inspection will face the challenge of updating environmental protection goals and technical solutions, which may result in the original environmental impact assessment design being unsuitable or requiring adjustments." 169 | }, 170 | { 171 | "id": "5798ca10f31411ef837800163e064343", 172 | "content": "Investigation Techniques for Ecological and Environmental Protection Measures This technology is suitable for verifying the ecological environment impact and the implementation of protective measures, including surface vegetation, soil and water loss, and mine water treatment, and can achieve detailed assessments of ecological restoration and protection effects during the construction and operation periods." 173 | }, 174 | { 175 | "id": "5798ca4cf31411ef837800163e064343", 176 | "content": "Environmental monitoring and evaluation technology This technology is applicable for monitoring atmospheric, water, noise environment, and solid waste emissions within the project area, implemented in accordance with the technical specifications (HJ 672—2013), and assesses the impact of pollutant emissions on the environment through improvement measures and remedial plans." 177 | }, 178 | { 179 | "id": "5798ca88f31411ef837800163e064343", 180 | "content": "The black and odorous river pollution interception project needs to cross the old city upstream of the river, where the pollution sources are complex. Several sections of the upstream river are very close to residential areas, and there is basically no pollution interception along the route. There are 52 outfalls on both sides of Mayuan River, including 23 combined outfalls and 29 stormwater outfalls, and the supporting sewage collection pipeline network is not complete. Under these conditions, the pollution interception project will face challenges such as complex pollutant source emissions and actual construction obstacles, which may lead to direct sewage discharge and difficulties in improving water quality." 181 | }, 182 | { 183 | "id": "5798cab0f31411ef837800163e064343", 184 | "content": "The engineering geological conditions are poor, with pipeline foundations mostly located on a layer of sediment. This layer has unfavorable geological conditions, with weak soil quality and high compressibility, requiring temporary support and foundation treatment. Under these conditions, the laying of pipelines will face challenges with deep foundation excavation and support, which may lead to delays in construction progress and increased costs." 185 | }, 186 | { 187 | "id": "5798cae2f31411ef837800163e064343", 188 | "content": "The cross-section of the river channel is complex, with rectangular cross-sections having retaining walls on both sides, while other forms may include trapezoidal and zigzag cross-sections, encroached upon or sloping in natural conditions, set against backgrounds of municipal roads or greenery. Under these conditions, laying the pipelines will face coordination of construction processes with different cross-sectional forms, which may lead to increased construction complexity and reduced operational efficiency." 189 | }, 190 | { 191 | "id": "5798cb0af31411ef837800163e064343", 192 | "content": "Pipeline installation technology in rivers. This technology is applicable to rectangular and trapezoidal cross-sections, allowing the reduction of cofferdam and drainage costs by installing pipelines in the river channel, and by using local support piers to reduce foundation treatment costs." 193 | }, 194 | { 195 | "id": "5798cb3cf31411ef837800163e064343", 196 | "content": "Technical requirements for pipeline laying under the greenbelt outside the retaining wall or on top of the embankment road. Applicable to situations where the cross-section forms a retaining wall behind a municipal road or greenbelt, providing a scheme for joint construction with hydraulic engineering, reducing construction procedures and costs." 197 | }, 198 | { 199 | "id": "5798cb64f31411ef837800163e064343", 200 | "content": "Pipeline laying technology on riverbank slopes or under roads. When the slope is a natural gradient and is followed by greenery or municipal roads, pipeline protection measures should be implemented by using plain concrete encasement and slope protection to ensure the safety of pipeline operation." 201 | }, 202 | { 203 | "id": "5798cb96f31411ef837800163e064343", 204 | "content": "Nuclear security incidents are characterized by their suddenness, wide impact range, large number of people involved, and long duration of influence. Under these conditions, radiation monitoring will face issues of slow response times, untimely data collection, and insufficient accuracy, which may lead to the consequences of public property loss and increased social impact." 205 | }, 206 | { 207 | "id": "5798cbc8f31411ef837800163e064343", 208 | "content": "Regional radiation monitoring system This technology is applicable to radiation monitoring in nuclear security events, including monitoring of gamma dose rates, airborne radioactivity, and neutron dose equivalent rates, capable of providing real-time radiation information and issuing abnormal alerts." 209 | }, 210 | { 211 | "id": "5798cbfaf31411ef837800163e064343", 212 | "content": "Information structure of environmental monitoring stations This technology is suitable for unattended radiation monitoring scenarios, capable of real-time collection of equipment status and radiation data, ensuring the stability and data security of environmental monitoring stations." 213 | }, 214 | { 215 | "id": "5798cc2cf31411ef837800163e064343", 216 | "content": "On-site radiation processing unit This technology is suitable for on-site radiation data processing, providing real-time display of radiation data and alarm information through connection with detectors and RS485 communication, while also storing historical data." 217 | }, 218 | { 219 | "id": "5798cc5ef31411ef837800163e064343", 220 | "content": "The conditions are characterized by significant topographical destruction and the formation of steep slopes caused by the extensive mining of mineral resources in the Huashan mining area. The strata are mainly composed of the Silurian Middle System Fentou Formation and the Devonian Middle-Upper System Yuntaiguan Formation, with slope gradients typically ranging from 25 to 68 degrees. Under these conditions, the governance will face challenges such as complex terrain, unstable high slopes, and ecological environment destruction, which may lead to further environmental deterioration and potential geological disaster risks." 221 | }, 222 | { 223 | "id": "5798cc86f31411ef837800163e064343", 224 | "content": "The conditions include a significant amount of abandoned land and waste of land resources locally, as well as the accumulation of waste materials and construction debris generated during the mining process. Under these conditions, the governance will face challenges such as slag treatment, land reclamation, and resource reuse, which may lead to low land utilization and economic losses." 225 | }, 226 | { 227 | "id": "5798ccb8f31411ef837800163e064343", 228 | "content": "there is a lack of comprehensive hydrogeological facilities around the mining area, and serious surface soil erosion is occurring. Under these conditions, governance needs to consider interception drainage and soil and water conservation measures to prevent further soil erosion and environmental pollution." 229 | }, 230 | { 231 | "id": "5798cceaf31411ef837800163e064343", 232 | "content": "Corner cutting load reduction technology This technology is suitable for reducing slope load, improving slope stability through excavation operations, and creating conditions for vegetation restoration, with a total excavation volume of 35,151.6 m³." 233 | }, 234 | { 235 | "id": "5798cd1cf31411ef837800163e064343", 236 | "content": "Backfill foot technology Suitable for the stabilization of high steep slope edges, using gravel soil fill to increase sliding resistance and improve slope stability, with a total backfill volume of 33,272.60 m³." 237 | }, 238 | { 239 | "id": "5798cd44f31411ef837800163e064343", 240 | "content": "Buttress wall technology Suitable for preventing the sliding and collapse of backfill slopes, using a 177 m long reinforced concrete retaining wall to enhance structural strength and stability." 241 | }, 242 | { 243 | "id": "5798cd76f31411ef837800163e064343", 244 | "content": "Cut-off drainage channel technology This technology is suitable for soil and water loss prevention and control in steep slope areas, constructing a drainage system along the slope perimeter and at the foot of the slope with a length of 1,156.3 meters." 245 | }, 246 | { 247 | "id": "5798cda8f31411ef837800163e064343", 248 | "content": "Vegetation bag foot wall technology For slope foot protection, gabion nets and ecological bags are used to mitigate erosion, with a total length of 259.7 m." 249 | }, 250 | { 251 | "id": "5798cddaf31411ef837800163e064343", 252 | "content": "Soil greening technology Suitable for slope ecological restoration, using the method of importing soil for planting vegetation such as Robinia pseudoacacia and Caragana korshinskii to restore woodlands and reduce soil erosion." 253 | }, 254 | { 255 | "id": "5798ce0cf31411ef837800163e064343", 256 | "content": "The conditions are to determine the permanganate index using the acidic method, with a heating time of 30 minutes, a temperature of 100°C, a potassium permanganate concentration of 0.01 mol/L, and a sulfuric acid dosage of 0.2 mL. Under these conditions, sample titration will face hazards during the heating process, as the strong oxidizing properties of potassium permanganate and the corrosiveness of sulfuric acid may lead to safety accidents and unstable measurement results due to improper operation." 257 | }, 258 | { 259 | "id": "5798ce3ef31411ef837800163e064343", 260 | "content": "Ultraviolet-Visible Spectrophotometry This technology is suitable for the determination of permanganate index and can achieve high-sensitivity measurement of the index by determining the absorbance of residual potassium permanganate at a wavelength of 525 nm. It has good reproducibility, high sensitivity, requires fewer types of reagents, and is easy to operate." 261 | }, 262 | { 263 | "id": "5798ce7af31411ef837800163e064343", 264 | "content": "the Heiquan Reservoir is located on the Baoku River upstream of the Beichuan River. It is a large-scale water conservancy hub project with multiple purposes, and its environment is complex, involving sediment pollution and water quality relevance. Under these conditions, evaluating the pollution characteristics and ecological risks of sediment will face challenges such as difficulties in data collection, uneven spatial distribution, a wide variety of pollutants, and their interactions, which may result in inaccurate and incomplete monitoring results." 265 | }, 266 | { 267 | "id": "5798ceacf31411ef837800163e064343", 268 | "content": "EFDC Environmental Fluid Dynamics Model This technology is applicable for the dynamic simulation of reservoir sediment pollution. By constructing a three-dimensional hydrodynamic-water quality-sediment dynamic coupling model, it can fully reveal the migration and diffusion patterns of pollutants in the reservoir." 269 | }, 270 | { 271 | "id": "5798cedef31411ef837800163e064343", 272 | "content": "Geo-accumulation index method and potential ecological risk index method This technology is applicable to the ecological risk assessment of heavy metals in the sediments of Black Spring Reservoir, capable of evaluating both localized and comprehensive environmental ecological risks. By reflecting the impact of human activities and the toxicity levels of heavy metals, it provides a basis for risk control." 273 | }, 274 | { 275 | "id": "5798cf06f31411ef837800163e064343", 276 | "content": "SPSS Data Correlation Analysis This technology is applicable for the correlation analysis between sediment pollutants and water quality, revealing the contribution rate of sediment as an internal source of pollution to water quality through accurate statistical methods." 277 | }, 278 | { 279 | "id": "5798cf38f31411ef837800163e064343", 280 | "content": "the SO2 emission concentration in the cement kiln flue gas is relatively high. When the SO2 emission concentration is around 400 mg/Nm3 (general areas) or 200 mg/Nm3 (key areas), different raw material mills are used. Under these conditions, not using appropriate desulfurization technology may lead to excessive SO2 emissions, especially when a ball mill is used in the raw material mill, where only about 50% of the exhaust gases enter the mill, resulting in low SO2 removal efficiency." 281 | }, 282 | { 283 | "id": "5798cf6af31411ef837800163e064343", 284 | "content": "the raw material mill has not employed a suitable SO2 reduction device, which may result in SO2 emission concentrations exceeding 500 mg/Nm3 (in ordinary areas). Under these conditions, high SO2 emissions may lead to non-compliance with emission standards, especially when the raw material mill deviates from the optimal configuration, making it impossible to effectively eliminate SO2." 285 | }, 286 | { 287 | "id": "5798cf9cf31411ef837800163e064343", 288 | "content": "The condition involves the presence of volatile sulfur compounds in the exhaust, which may escape during the preheating stage and fail to react adequately with the highly reactive CaO. Under these conditions, it may result in SO2 not being effectively removed during the preheating process, leading to increased emission concentrations that cannot meet the emission standards." 289 | }, 290 | { 291 | "id": "5798cfcef31411ef837800163e064343", 292 | "content": "Limestone-Gypsum Wet Desulfurization Technology This technology is suitable for treating waste gas with SO2 concentrations of up to 1200 mg/Nm3, using limestone as a desulfurizing agent, achieving a desulfurization efficiency of over 95%, and producing high-purity gypsum by-products that can be used as a water-retaining agent in cement." 293 | }, 294 | { 295 | "id": "5798d000f31411ef837800163e064343", 296 | "content": "The conditions involve Cd-contaminated soils in three regions of Shaoguan and Yunfu in Guangdong Province, with significant differences in soil physicochemical properties. The pH values range from 4.73 to 8.01, and Cd content ranges from 1.61 mg·kg-1 to 5.60 mg·kg-1. Under these conditions, soil remediation will face the high bioavailability of Cd in acidic soils and the risk of crop absorption of Cd. In alkaline soils, while there is a risk of Cd, the organic carbon content is relatively low, leading to issues of poor crop growth and agricultural product safety." 297 | }, 298 | { 299 | "id": "5798d032f31411ef837800163e064343", 300 | "content": "Biochar-based soil amendment technology This technology is applicable to areas with varying levels of Cd contamination and soil pH, achieving a reduction in acid-soluble Cd content in soil and Cd uptake by crops through the combined use of biochar, lime, organic fertilizer, and other auxiliary materials." 301 | } 302 | ] -------------------------------------------------------------------------------- /benchmark/0_test/corpus.pt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/icip-cas/DeepSolution/a953dc46c122238316af229e6d98c9c30b8df559/benchmark/0_test/corpus.pt -------------------------------------------------------------------------------- /benchmark/0_test/datas.json: -------------------------------------------------------------------------------- 1 | [ 2 | { 3 | "id": 0, 4 | "title": "Design of an ultra-low emission retrofit plan for 3 × 150 t/h coal-fired boilers", 5 | "requirement": "Design of an ultra-low emission retrofit plan for 3 × 150 t/h coal-fired boilers Under the strict requirements for nitrogen oxides, sulfur dioxide, and dust emissions, complete the ultra-low emission transformation task for three 150 t/h coal-fired boilers to meet environmental emission standards, i.e., NOx emissions not exceeding 45 mg/m³ (standard), SO2 emissions not exceeding 28 mg/m³ (standard), and dust emissions not exceeding 8 mg/m³ (standard).", 6 | "solution": "This paper proposes an ultra-low emission retrofit plan for 3 × 150 t/h coal-fired boilers. Specifically, firstly, considering the challenge of high original nitrogen oxide emissions, low-NOx combustion retrofit technology is applied to the boiler, reducing the original NOx emissions to 140 mg/m³ (standard) through water-cooled air chambers, secondary air systems, and water-cooled wall replacement, and achieving a compliant emission of 45 mg/m³ after SNCR. Secondly, addressing the issue of significant fluctuations in SO2 emissions, the limestone automatic feeding system is improved to enhance feeding timeliness and control accuracy, enabling SO2 emissions of no more than 28 mg/m³ with in-furnace desulfurization alone. Lastly, to tackle the problem of excessive dust emissions, new ultra-low emission filter bag dust removal technology is employed, achieving compliant dust emissions with concentrations below 8 mg/m³ (standard) by replacing the filter bags.", 7 | "analysis_ids": [ 8 | "5798c164f31411ef837800163e064343", 9 | "5798c222f31411ef837800163e064343", 10 | "5798c27cf31411ef837800163e064343" 11 | ], 12 | "technology_ids": [ 13 | "5798c2b8f31411ef837800163e064343", 14 | "5798c2f4f31411ef837800163e064343", 15 | "5798c326f31411ef837800163e064343" 16 | ], 17 | "explanation": [ 18 | { 19 | "idx": "explanation0", 20 | "content": "For analysis0, considering the challenge of high original nitrogen oxide emissions, low-NOx combustion retrofit technology was used for the boiler. Significant NOx emission reductions were achieved through modifications such as the water-cooled wind chamber and the circulating material system." 21 | }, 22 | { 23 | "idx": "explanation1", 24 | "content": "For analysis1, considering the challenges of SO2 emission fluctuations and exceeding standards, the limestone automatic feeding system improvement technology is used to achieve stable control of SO2 emissions by optimizing the feeding system and control logic." 25 | }, 26 | { 27 | "idx": "explanation2", 28 | "content": "For analysis2, considering the challenge of excessive dust emissions, adopt the new ultra-low emission filter bag dust removal technology. By replacing the filter bags, improve dust removal efficiency and achieve compliant dust emissions." 29 | } 30 | ] 31 | }, 32 | { 33 | "id": 1, 34 | "title": "Design of capacity expansion and renovation scheme for the wet flue gas desulfurization system of a 3 × 210 MW coal-fired power plant", 35 | "requirement": "Design of capacity expansion and renovation scheme for the wet flue gas desulfurization system of a 3 × 210 MW coal-fired power plant Under Turkey's strict environmental standards and the conditions of burning high-sulfur coal, completing the capacity expansion and renovation of the 3 × 210 MW desulfurization unit requires achieving a desulfurization efficiency of 97.5% and completing the task on the existing facilities.", 36 | "solution": "This article proposes a solution to improve desulfurization efficiency by modifying the absorption tower and its auxiliary systems on the basis of existing facilities. Specifically, first, considering the need to increase the desulfurization efficiency to 97.5%, absorption tower modification technology is used to achieve a high conversion rate from CaSO3 to CaSO4 by increasing the absorption tower slurry pool volume and optimizing the spraying system. Second, a vacuum belt dehydrator is added to improve the efficiency of the gypsum dehydration system, further ensuring that the modified system meets stricter environmental standards.", 37 | "analysis_ids": [ 38 | "5798c34ef31411ef837800163e064343" 39 | ], 40 | "technology_ids": [ 41 | "5798c380f31411ef837800163e064343", 42 | "5798c3b2f31411ef837800163e064343" 43 | ], 44 | "explanation": [ 45 | { 46 | "idx": "explanation0", 47 | "content": "For analysis0, considering the environmental protection standards and the increase in sulfur content of coal, technology0 was used to meet the requirements of improving desulfurization efficiency by raising the absorption tower slurry pool and optimizing the stirring and spraying systems." 48 | } 49 | ] 50 | } 51 | ] -------------------------------------------------------------------------------- /eval_0_prepare_embeddings.py: -------------------------------------------------------------------------------- 1 | import json 2 | import tqdm 3 | import torch 4 | from utils.embedder import Embedder 5 | 6 | 7 | scenarios = [ 8 | "0_test", 9 | # "1_environment", 10 | # "2_mining" 11 | # "3_transport", 12 | # "4_aerospace", 13 | # "5_telecom", 14 | # "6_architecture", 15 | # "7_water", 16 | # "8_farming", 17 | ] 18 | embedder = Embedder(device='cpu') 19 | 20 | for scenario in scenarios: 21 | corpus = json.load(open(f"./benchmark/{scenario}/corpus.json")) 22 | embeddings = [] 23 | for data in tqdm.tqdm(corpus, desc=f"Embedding {scenario} corpus"): 24 | content = data["content"] 25 | embeddings.append(embedder.get_embedding(content, max_length=4096)) 26 | embeddings = torch.cat(embeddings) 27 | print("embeddings.shape", embeddings.shape) 28 | 29 | torch.save(embeddings, f"./benchmark/{scenario}/corpus.pt") 30 | print(f"Embeddings for {scenario} saved") 31 | 32 | print("All done") -------------------------------------------------------------------------------- /eval_5_ours.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | import tqdm 4 | import time 5 | import torch 6 | import argparse 7 | from utils.qwen_api import QwenAPI 8 | from utils.embedder import Embedder 9 | from ours_framework import GameTreeRAG 10 | from utils.openai_api import OpenaiAPI 11 | 12 | 13 | if __name__ == '__main__': 14 | parser = argparse.ArgumentParser() 15 | parser.add_argument('--scenario', type=str, default="0_test") 16 | parser.add_argument('--worker_id', type=str, default="") 17 | parser.add_argument('--embedder_device', type=str, default="cpu") 18 | 19 | parser.add_argument('--model_name', type=str, default="qwen") 20 | parser.add_argument('--qwen_url', type=str, default="10.32.10.224") 21 | parser.add_argument('--qwen_url2', type=str, default="lzq") 22 | parser.add_argument('--qwen_url3', type=str, default="lzq") 23 | parser.add_argument('--qwen_url4', type=str, default="lzq") 24 | 25 | parser.add_argument('--max_depth', type=int, default=5) 26 | parser.add_argument('--layer_top_k', type=int, default=1) 27 | parser.add_argument('--children_num', type=int, default=2) 28 | parser.add_argument('--retrieval_top_k', type=int, default=10) 29 | 30 | parser.add_argument('--doubt_max_new_tokens', type=int, default=2048) 31 | parser.add_argument('--solution_max_new_tokens', type=int, default=2048) 32 | parser.add_argument('--if_sum_reference', type=str, default="sum", choices=["sum", "notsum", "sumroot"]) 33 | parser.add_argument('--if_only_reference', type=str, default="semionly", choices=["only", "notonly", "semionly"]) 34 | parser.add_argument('--if_rerank', type=str, default="llmrerank", choices=["notrerank", "llmrerank"]) 35 | parser.add_argument('--tag', type=str, default="") 36 | parser.add_argument('--if_no_review', action='store_true') 37 | parser.add_argument('--if_no_explore', action='store_true') 38 | args = parser.parse_args() 39 | for arg in vars(args): 40 | print(f"{arg}: {getattr(args, arg)}") 41 | 42 | assert args.tag == "" or args.tag.startswith("_"), f"tag should be empty or start with _, i got {args.tag}" 43 | 44 | embedder = Embedder(device=args.embedder_device) 45 | if args.model_name == "qwen": 46 | llm = QwenAPI( 47 | url=f"http://{args.qwen_url}:1225/v1/chat/completions", 48 | url2=None if args.qwen_url2 is None else f"http://{args.qwen_url2}:1225/v1/chat/completions", 49 | url3=None if args.qwen_url3 is None else f"http://{args.qwen_url3}:1225/v1/chat/completions", 50 | url4=None if args.qwen_url4 is None else f"http://{args.qwen_url4}:1225/v1/chat/completions", 51 | ) 52 | os.system(f"curl {args.qwen_url}:1225/v1/models --connect-timeout 2") 53 | else: 54 | raise NotImplementedError(f"model_name {args.model_name} not implemented") 55 | 56 | datas = json.load(open(f'./benchmark/{args.scenario}/datas.json')) 57 | if args.worker_id != "": 58 | datas = datas[int(args.worker_id)::8] 59 | print("len(datas)", len(datas), 'worker_id', args.worker_id) 60 | corpus = json.load(open(f'./benchmark/{args.scenario}/corpus.json')) 61 | corpus_embeddings = torch.load(f'./benchmark/{args.scenario}/corpus.pt', map_location=embedder.model.device) 62 | print("corpus_embeddings.shape", corpus_embeddings.shape) 63 | 64 | framework = GameTreeRAG( 65 | embedder=embedder, 66 | llm=llm, 67 | knowledge_lib=corpus, 68 | knowledge_lib_embeddings=corpus_embeddings, 69 | max_depth=args.max_depth, 70 | children_num=args.children_num, 71 | layer_top_k=args.layer_top_k, 72 | retrieval_top_k=args.retrieval_top_k, 73 | doubt_max_new_tokens=args.doubt_max_new_tokens, 74 | solution_max_new_tokens=args.solution_max_new_tokens, 75 | if_only_reference=args.if_only_reference, 76 | if_sum_reference=args.if_sum_reference, 77 | if_rerank=args.if_rerank, 78 | if_no_review=args.if_no_review, 79 | if_no_explore=args.if_no_explore, 80 | ) 81 | 82 | suffix = f"{args.max_depth}_{args.children_num}_{args.layer_top_k}{args.tag}" 83 | fw_dir = f"./eval_results/{args.scenario}" 84 | if not os.path.exists(fw_dir): 85 | os.makedirs(fw_dir) 86 | fw_path = f'{fw_dir}/generated_by_5_ours_{suffix}_{args.worker_id}.jsonl' 87 | fw = open(fw_path, 'a') 88 | exiting_data_ids = [data['id'] for data in [json.loads(line) for line in open(fw_path)]] 89 | for data in tqdm.tqdm(datas, desc=f"evaluating {suffix}, worker_id {args.worker_id}"): 90 | if data['id'] in exiting_data_ids: 91 | print(f"\n\nskip {data['id']}, already_done") 92 | continue 93 | 94 | print(f"\n\nwill deal with data {data['id']}") 95 | 96 | query = data['requirement'] 97 | 98 | output_text, all_nodes_record = framework.get_final_solution(query) 99 | data['output_text'] = output_text 100 | data['all_nodes_record'] = all_nodes_record 101 | fw.write(json.dumps(data, ensure_ascii=False) + '\n') 102 | fw.flush() 103 | 104 | fw.close() 105 | print(f"evaluating 5_ours on {args.scenario} done, worker_id {args.worker_id}, results saved in {fw_path}") -------------------------------------------------------------------------------- /eval_score_calculate.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | import tqdm 4 | import argparse 5 | from utils.qwen_api import QwenAPI 6 | from utils.openai_api import OpenaiAPI 7 | 8 | 9 | prompt = \ 10 | """<> 11 | {task} 12 | 13 | <> 14 | {solution} 15 | 16 | <> 17 | ## Analysis knowledge: 18 | {analysis_knowledge} 19 | ## Technology knowledge: 20 | {technology_knowledge} 21 | ## Goldn explanation: 22 | {golden_explanation} 23 | ## Goldn solution: 24 | {golden_solution} 25 | 26 | <> 27 | The above <> is a complex requirement in an actual engineering scenario. The above <> is a solution generated by a certain model. You are required to evaluate this solution based on the <> annotated by a human expert. The <> consists of the following components: 28 | (a) Analysis knowledge: A deep analysis of the various restrictive factors present in this complex requirement. 29 | (b) Technology knowledge: A detailed explanation of the various technologies that must be used to solve this complex requirement. 30 | (c) Golden explanation: An explanation of how to use these technologies to overcome various challenges. 31 | (d) Golden solution: The standard solution provided by human experts. 32 | Your evaluation of the <> must fully consider the <>. The specific evaluation requirements are as follows. 33 | 34 | <> 35 | 1. You need to evaluate the solution above from two dimensions. The range for each of the two scores is an integer between 0 and 100, where the minimum score is 0 and the maximum score is 100. 36 | 2. The scoring details for the two dimensions are as follows: 37 | (2.1) Analysis Score: Refer to the aforementioned Analysis knowledge, Golden explanation, and Golden solution to assess whether the <> has thoroughly considered the various restrictive factors in the <>. Pay special attention to listing each restrictive factor in the Analysis knowledge one by one and evaluating whether the model output has considered these factors. If considered, you need to specify which part of the <> addresses the restrictive factor, and whether this part is sufficiently correct and specific. 38 | (2.1.1) If no factors are considered, score 0. 39 | (2.1.2) If factors are considered but the analysis is not entirely correct, score 11-30 depending on the degree of correctness. 40 | (2.1.3) If factors are considered and the analysis is correct but not specific, score 31-60 depending on the level of specificity. 41 | (2.1.4) If factors are considered, the analysis is correct, and it is specific, score 61-90 based on its similarity to the standard Analysis knowledge. 42 | (2.1.5) If it is fully consistent with the standard Analysis knowledge, score 100. 43 | (2.2) Technology Score: Refer to the aforementioned Technology knowledge, Golden explanation, and Golden solution to evaluate whether the <> has employed appropriate technologies to address the challenges in the <>. Pay special attention to listing each technology in the Technology knowledge one by one and evaluating whether the model output has used these technologies. If used, you need to specify which part of the <> utilizes the technology, and whether this part is sufficiently correct and specific. 44 | (2.2.1) If no technologies are used, score 0. 45 | (2.2.2) If technologies are used but not entirely correctly, score 11-30 depending on the degree of correctness. 46 | (2.2.3) If technologies are used and correctly applied but not specific, score 31-60 depending on the level of specificity. 47 | (2.2.4) If technologies are used correctly and specifically, score 61-90 based on their similarity to the standard Technology knowledge. 48 | (2.2.5) If it is fully consistent with the standard Technology knowledge, score 100. 49 | 3. During the evaluation process, you must first evaluate the solution based on the three dimensions mentioned above and display your reasoning process. After completing your reasoning, you must output the identifier ##Scores## followed by your evaluation results in the form of a dictionary, i.e.: ##Scores## {{"Analysis Score": int, "Technology Score": int}} 50 | 4. Note that a longer solution is not necessarily better. The sole basis of your evaluation process is the aforementioned Judgement reference, and your evaluation results must fully take this reference into account. 51 | 5. Below is an example you can refer to when completing your evaluation: 52 | ## An example of evaluation: 53 | 1. For Analysis Score Evaluation. The analysis evaluates whether the solution adequately considers restrictive factors in the task. Let's analyze each factor from the Analysis knowledge and compare it with the model's solution: 54 | For (analysis_0) Uneven soil particle distribution and high groundwater levels: The solution mentions groundwater analysis (in Geotechnical Investigation and Analysis) and dewatering systems (in Water Management). It also discusses preventing water ingress but does not explicitly address piping issues or the specific use of interlocking casing piles to stabilize the foundation pit. While the factors are considered, the analysis lacks detail and specificity regarding solutions such as technology for groundwater cut-off and particle stabilization. Conclusion: Analysis is present and mostly correct but lacks specificity in addressing key risks like piping and groundwater intrusion. Score: 60/100. 55 | For (analysis_1) Deep excavation with moderately weathered limestone: The solution discusses slope stability analysis and phased excavation to control depth, as well as bracing and anchoring systems for stabilization. However, it does not explicitly address challenges such as settlement risks or hard rock excavation in limestone. The application of pre-applied axial force in steel supports is not mentioned, which is critical for ensuring stability. Conclusion: The analysis is partially correct but lacks depth and specificity regarding excavation in moderately weathered limestone and associated risks. Score: 50/100. 56 | For (analysis_2) Sensitive surrounding environment: The solution acknowledges sensitivity in the surrounding environment and mentions real-time monitoring and mitigation measures (e.g., structural impact assessments, emergency response plans). However, it does not provide detailed measures for preventing settlement or damage to underground pipelines or nearby structures, such as dynamic monitoring and quantified safety parameters. Conclusion: The analysis is correct but not specific enough regarding risks to surrounding buildings and underground pipelines. Score: 60/100. 57 | Overall Analysis Score: While the solution considers the main restrictive factors, it does not address all of them comprehensively or with sufficient specificity. Final Analysis Score: 57/100. 58 | 2. For Technology Score Evaluation. The evaluation checks whether appropriate technologies were employed. Let's analyze each technology from the Technology knowledge: 59 | For (technology_0) Casing Interlocking Pile Technology: The solution does not explicitly mention casing interlocking pile technology, which is critical for ensuring foundation pit enclosure and managing groundwater effectively. Instead, it suggests general retaining structures like sheet piles and diaphragm walls, which are less specific for high groundwater conditions. Conclusion: This technology is not applied. Score: 0/100. 60 | For (technology_1) Layered Excavation and Steel Support System: The solution mentions phased excavation and the use of steel supports and bracing but does not include details about pre-applied axial forces, which are vital for controlling deformation and ensuring stability. The specificity of this technology's application is lacking. Conclusion: Partially applied. Score: 50/100. 61 | For (technology_2) Precipitation and Grouting Measures: The solution discusses dewatering systems and drainage channels but omits the use of grouting measures, which are essential for controlling groundwater levels and preventing leakage. Grouting was explicitly required in the reference but is absent here. Conclusion: Partially applied. Score: 40/100. 62 | For (technology_3) Dynamic Monitoring and Protection Measures: The solution includes real-time monitoring with sensors and references emergency response measures. However, it does not quantify safety metrics, such as settlement rates, or specify dynamic adjustment strategies. These omissions limit the specificity of this technology's application. Conclusion: Partially applied. Score: 50/100. 63 | Overall Technology Score: The solution employs some relevant technologies but omits critical ones (e.g., casing interlocking piles and grouting measures) and lacks specificity in others. Final Technology Score: 35/100. 64 | Final Scores 65 | ##Scores## {{"Analysis Score": 57, "Technology Score": 35}} 66 | 67 | <>""" 68 | 69 | 70 | def score_helper(data, id_2_knowledge, model_generated_solution, llm): 71 | analysis_knowledge_list = [ 72 | id_2_knowledge[analysis_id] for analysis_id in data['analysis_ids'] 73 | ] 74 | technology_knowledge_list = [ 75 | id_2_knowledge[technology_id] for technology_id in data['technology_ids'] 76 | ] 77 | 78 | task = f"{data['requirement']}" 79 | goldn_solution = data['solution'] 80 | 81 | analysis_knowledge = "" 82 | for knowledge in analysis_knowledge_list: 83 | strip_str = f"{data['id']}_" 84 | knowledge_id = knowledge['id'].replace(strip_str, "") 85 | analysis_knowledge += f"{knowledge_id}: {knowledge['content']}\n" 86 | analysis_knowledge = analysis_knowledge.strip() 87 | 88 | technology_knowledge = "" 89 | for knowledge in technology_knowledge_list: 90 | strip_str = f"{data['id']}_" 91 | knowledge_id = knowledge['id'].replace(strip_str, "") 92 | technology_knowledge += f"{knowledge_id}: {knowledge['content']}\n" 93 | technology_knowledge = technology_knowledge.strip() 94 | 95 | explanation = "" 96 | for ex in data['explanation']: 97 | explanation += f"{ex['idx']}: {ex['content']}\n" 98 | explanation = explanation.strip() 99 | 100 | input_text = prompt.format( 101 | task=task, 102 | solution=model_generated_solution, 103 | analysis_knowledge=analysis_knowledge, 104 | technology_knowledge=technology_knowledge, 105 | golden_explanation=explanation, 106 | golden_solution=goldn_solution 107 | ) 108 | 109 | score = None 110 | for _ in range(8): 111 | try: 112 | output_text = llm.get_response(input_text) 113 | if "##Scores##" in output_text: 114 | raw_score = output_text.split("##Scores##")[-1].strip().strip("#").strip("*").strip(":").strip("```json").strip() 115 | score = json.loads(raw_score) 116 | break 117 | elif "## Scores" in output_text: 118 | raw_score = output_text.split("## Scores")[-1].strip().strip("#").strip("*").strip(":").strip("```json").strip() 119 | score = json.loads(raw_score) 120 | break 121 | else: 122 | raise Exception("no score") 123 | except Exception as e: 124 | print("output_text", output_text) 125 | print("error", e) 126 | print("retry") 127 | 128 | return score, output_text 129 | 130 | 131 | def calculate_score(datas, corpus, generated_result_path, print_info): 132 | 133 | llm = OpenaiAPI() 134 | model_suffix = "" 135 | 136 | # qwen_url = "10.32.10.224" 137 | # llm = QwenAPI( 138 | # url=f"http://{qwen_url}:1225/v1/chat/completions", 139 | # ) 140 | # model_suffix = "_qwen" 141 | # print(f"=====warning: using QwenAPI api=====") 142 | 143 | id_2_knowledge = {corpu['id']: corpu for corpu in corpus} 144 | 145 | print(f"doing {generated_result_path}") 146 | generated_results = [json.loads(line) for line in open(generated_result_path)] 147 | id_2_result = {result['id']: result for result in generated_results} 148 | assert len(datas) == len(generated_results), f"len(datas) != len(generated_results): {len(datas)} != {len(generated_results)}" 149 | 150 | fw_path = f"{generated_result_path}_score{model_suffix}.jsonl" 151 | fw = open(fw_path, "a") 152 | exiting_data_ids = [data['id'] for data in [json.loads(line) for line in open(fw_path)]] 153 | for i in tqdm.tqdm(range(len(datas)), desc=f"{print_info} {model_suffix}"): 154 | print(f"\ndoing {i}/{len(datas)}") 155 | data = datas[i] 156 | generated_result = id_2_result[data['id']] 157 | 158 | if data['id'] in exiting_data_ids: 159 | print(f"skip {data['id']}, already_done") 160 | continue 161 | 162 | model_generated_solution = generated_result['output_text'].replace("#", "").replace("<", "").replace(">", "").strip() 163 | score, judgement = score_helper(data, id_2_knowledge, model_generated_solution, llm) 164 | 165 | generated_result["judgement"] = judgement 166 | generated_result["score"] = score 167 | 168 | fw.write(json.dumps(generated_result, ensure_ascii=False) + "\n") 169 | fw.flush() 170 | fw.close() 171 | 172 | print(f"done {generated_result_path}") 173 | return 1 174 | 175 | 176 | if __name__ == '__main__': 177 | 178 | for scenario in [ 179 | "0_test", 180 | # "1_environment", 181 | # "2_mining", 182 | # '3_transport', 183 | # '4_aerospace', 184 | # '5_telecom', 185 | # '6_architecture', 186 | # '7_water', 187 | # '8_farming', 188 | ]: 189 | 190 | datas = json.load(open(f"./benchmark/{scenario}/datas.json")) 191 | corpus = json.load(open(f"./benchmark/{scenario}/corpus.json")) 192 | 193 | generated_result_path = f"./eval_results/{scenario}/generated_by_5_ours_5_2_1_.jsonl" 194 | calculate_score(datas, corpus, generated_result_path, print_info=f"{scenario}") -------------------------------------------------------------------------------- /eval_score_show.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | import random 4 | import argparse 5 | 6 | 7 | def show_score(datas, score_file_path): 8 | score_datas = [json.loads(line) for line in open(score_file_path)] 9 | id_2_score_data = {score_data['id']: score_data for score_data in score_datas} 10 | if len(datas) != len(score_datas): 11 | print("==================WARNING==================") 12 | print(f"len(datas) != len(score_datas): {len(datas)} != {len(score_datas)}") 13 | print("===========================================") 14 | 15 | ok_num = 0 16 | error_num = 0 17 | ok_score = {"Analysis Score": 0, "Technology Score": 0} 18 | for data in datas: 19 | 20 | score_data = id_2_score_data[data['id']] 21 | 22 | if score_data['score'] == None: 23 | error_num += 1 24 | print(f"None score, score_data['id']: {score_data['id']}") 25 | continue 26 | ok_num += 1 27 | ok_score["Analysis Score"] += score_data['score']['Analysis Score'] 28 | ok_score["Technology Score"] += score_data['score']['Technology Score'] 29 | 30 | print(f"Ok num: {ok_num}, Error num: {error_num}") 31 | print(f"Average score: Analysis Score: {ok_score['Analysis Score']/ok_num:.1f}, Technology Score: {ok_score['Technology Score']/ok_num:.1f}") 32 | print() 33 | 34 | 35 | if __name__ == '__main__': 36 | parser = argparse.ArgumentParser() 37 | parser.add_argument('--language', type=str, default='en') 38 | args = parser.parse_args() 39 | 40 | for scenario in [ 41 | '0_test', 42 | # '1_environment', 43 | # '2_mining', 44 | # '3_transport', 45 | # '4_aerospace', 46 | # '5_telecom', 47 | # '6_architecture', 48 | # '7_water', 49 | # '8_farming', 50 | ]: 51 | datas = json.load(open(f"benchmark/{scenario}/datas.json")) 52 | print("\n\n\n") 53 | 54 | print(f"{scenario}") 55 | 56 | score_file_path = f"./eval_results/{scenario}/generated_by_5_ours_5_2_1_.jsonl_score.jsonl" 57 | 58 | show_score(datas, score_file_path) 59 | 60 | print("Done.") -------------------------------------------------------------------------------- /ours_framework.py: -------------------------------------------------------------------------------- 1 | import copy 2 | import json 3 | import torch 4 | import pickle 5 | import random 6 | random.seed(1225) 7 | from utils.qwen_api import QwenAPI 8 | from utils.embedder import Embedder 9 | from utils.prompts import ( 10 | prompt_for_get_doubt, 11 | prompt_for_get_doubt_proposal, 12 | prompt_for_get_solution, 13 | prompt_for_get_solution_proposal, 14 | prompt_for_get_solution_for_root, 15 | prompt_for_get_solution_proposal_for_root, 16 | ) 17 | 18 | 19 | class InNode: # initial node 20 | def __init__(self, question, my_id): 21 | self.question = question 22 | self.flag = "In" 23 | self.my_id = my_id 24 | self.children = [] 25 | self.retrieval = [] 26 | 27 | 28 | class SoNode: # solution node 29 | def __init__(self, father, my_id): 30 | self.flag = "So" 31 | self.my_id = my_id 32 | self.proposal = "" 33 | self.proposal_input = "" 34 | self.solution = "" 35 | self.solution_input = "" 36 | self.solution_summary = "" 37 | self.retrieval = [] 38 | self.retrieval_new = [] 39 | self.father = father 40 | self.father_id = father.my_id 41 | self.score_for_father = None 42 | self.children = [] 43 | self.scores_from_children = [] 44 | self.if_final_used = False 45 | 46 | 47 | class ReNode: # reflection node 48 | def __init__(self, father, my_id): 49 | self.flag = "Re" 50 | self.my_id = my_id 51 | self.proposal = "" 52 | self.proposal_input = "" 53 | self.reflection = "" 54 | self.reflection_input = "" 55 | self.reflection_summary = "" 56 | self.retrieval = [] 57 | self.retrieval_new = [] 58 | self.father = father 59 | self.father_id = father.my_id 60 | self.score_for_father = None 61 | self.children = [] 62 | self.scores_from_children = [] 63 | 64 | 65 | class GameTreeRAG: 66 | def __init__(self, 67 | knowledge_lib, 68 | knowledge_lib_embeddings, 69 | llm, 70 | embedder, 71 | max_depth, 72 | children_num, 73 | layer_top_k, 74 | retrieval_top_k, 75 | doubt_max_new_tokens, 76 | solution_max_new_tokens, 77 | if_sum_reference, 78 | if_only_reference, 79 | if_rerank, 80 | if_no_review=False, 81 | if_no_explore=False 82 | ): 83 | self.node_id = 0 84 | self.root = None 85 | self.all_nodes_record = [] 86 | self.all_nodes_record_dict = {} 87 | 88 | self.knowledge_lib = knowledge_lib 89 | self.knowledge_id_2_knowledge = {data['id']: data for data in knowledge_lib} 90 | self.knowledge_lib_embeddings = knowledge_lib_embeddings 91 | 92 | self.max_depth = max_depth 93 | self.children_num = children_num 94 | self.layer_top_k = layer_top_k 95 | self.retrieval_top_k = retrieval_top_k 96 | 97 | self.llm = llm 98 | self.embedder = embedder 99 | 100 | self.doubt_max_new_tokens = doubt_max_new_tokens 101 | self.solution_max_new_tokens = solution_max_new_tokens 102 | 103 | self.if_only_reference = if_only_reference 104 | self.if_sum_reference = if_sum_reference 105 | self.if_rerank = if_rerank 106 | 107 | self.if_no_review = if_no_review 108 | self.if_no_explore = if_no_explore 109 | if if_no_review or if_no_explore: 110 | print("#################### ablation setting ####################") 111 | print(f"if_no_review: {if_no_review}, if_no_explore: {if_no_explore}") 112 | print("##########################################################") 113 | assert if_no_review == False or if_no_explore == False, "if_no_review and if_no_explore can not be True at the same time" 114 | if self.if_no_explore: 115 | self.layer_top_k = 1 116 | self.children_num = 1 117 | 118 | def get_final_solution(self, question, oracle_knowledge_ids=None): 119 | self.node_id = 0 120 | self.root = None 121 | self.all_nodes_record = [] 122 | self.all_nodes_record_dict = {} 123 | 124 | self.root = InNode(question, my_id=self.node_id) 125 | self.all_nodes_record.append(self.root) 126 | self.node_id += 1 127 | 128 | good_solution_node = self.get_solution_tree() 129 | 130 | final_solution = good_solution_node.solution 131 | 132 | for node in self.all_nodes_record: 133 | if node.flag == "In": 134 | self.all_nodes_record_dict[str(node.my_id)] = { 135 | "question": node.question, 136 | "flag": node.flag, 137 | "my_id": node.my_id, 138 | "retrieval": node.retrieval 139 | } 140 | elif node.flag == "So": 141 | self.all_nodes_record_dict[str(node.my_id)] = { 142 | "proposal": node.proposal, 143 | "proposal_input": node.proposal_input, 144 | "solution": node.solution, 145 | "solution_input": node.solution_input, 146 | "solution_summary": node.solution_summary, 147 | "retrieval": node.retrieval, 148 | "retrieval_new": node.retrieval_new, 149 | "flag": node.flag, 150 | "my_id": node.my_id, 151 | "father_id": node.father_id, 152 | "score_for_father": node.score_for_father, 153 | "scores_from_children": node.scores_from_children, 154 | "if_final_used": node.if_final_used 155 | } 156 | elif node.flag == "Re": 157 | self.all_nodes_record_dict[str(node.my_id)] = { 158 | "proposal": node.proposal, 159 | "proposal_input": node.proposal_input, 160 | "reflection": node.reflection, 161 | "reflection_input": node.reflection_input, 162 | "reflection_summary": node.reflection_summary, 163 | "retrieval": node.retrieval, 164 | "retrieval_new": node.retrieval_new, 165 | "flag": node.flag, 166 | "my_id": node.my_id, 167 | "father_id": node.father_id, 168 | "score_for_father": node.score_for_father, 169 | "scores_from_children": node.scores_from_children, 170 | } 171 | else: 172 | raise ValueError("flag is not correct") 173 | 174 | return final_solution, self.all_nodes_record_dict 175 | 176 | def get_solution_tree(self): 177 | 178 | # 1. from root get four soultion children 179 | print_info = "current_depth: ROOT" 180 | print(f"\n\n======================================== {print_info}") 181 | self.root.retrieval = self.do_retrieval(query=self.root.question) 182 | self.root.children = self.get_children(self.root, children_num=self.layer_top_k*self.children_num, print_info=print_info) 183 | 184 | # 2. iter tree 185 | waiting_score_nodes = self.root.children # num is 4 186 | for current_depth in range(self.max_depth): 187 | 188 | new_children = [] # will be 4*2 189 | for n, node in enumerate(waiting_score_nodes): 190 | print_info = f"current_depth: {current_depth+1}/{self.max_depth}, node: {n+1}/{len(waiting_score_nodes)}, node.flag: {node.flag}" 191 | print(f"\n\n======================================== {print_info}") 192 | new_children += self.get_children(node, children_num=self.children_num, print_info=print_info) 193 | 194 | for node in new_children: 195 | node.father.children.append(node) 196 | node.father.scores_from_children.append(node.score_for_father) 197 | waiting_score_nodes_scores = [sum(node.scores_from_children) for node in waiting_score_nodes] 198 | top_k_indices = torch.argsort(torch.tensor(waiting_score_nodes_scores))[-self.layer_top_k:] 199 | scored_nodes = [waiting_score_nodes[i] for i in top_k_indices] # # num is 2 200 | 201 | waiting_score_nodes = [] 202 | for node in scored_nodes: 203 | waiting_score_nodes += node.children 204 | 205 | assert all([node.flag == "Re" for node in waiting_score_nodes]), "not all are comment nodes" 206 | good_solution_node = waiting_score_nodes[0].father 207 | good_solution_node.if_final_used = True 208 | return good_solution_node 209 | 210 | def get_children(self, node, children_num=2, print_info=""): 211 | children = [] 212 | if node.flag == "In": # will get solution nodes 213 | inputs = {"question": self.root.question} 214 | for _ in range(children_num): 215 | children.append(SoNode(father=node, my_id=self.node_id)) 216 | self.node_id += 1 217 | elif node.flag == "So": # will get reflection nodes 218 | inputs = {"question": self.root.question, "solution": node.solution_summary} 219 | for _ in range(children_num): 220 | children.append(ReNode(father=node, my_id=self.node_id)) 221 | self.node_id += 1 222 | elif node.flag == "Re": # will get solution nodes 223 | inputs = {"question": self.root.question, "solution": node.father.solution_summary, "reflection": node.reflection_summary} 224 | for _ in range(children_num): 225 | children.append(SoNode(father=node, my_id=self.node_id)) 226 | self.node_id += 1 227 | else: 228 | raise ValueError("flag is not correct") 229 | self.all_nodes_record += children 230 | 231 | for c, child in enumerate(children): 232 | print(f"\n=========do_proposal=========== {print_info}, child: {c+1}/{len(children)}, child.flag: {child.flag}") 233 | child.proposal_input, child.proposal = self.do_proposal(inputs, node.flag) 234 | 235 | print(f"\n=========do_retrieval=========== {print_info}, child: {c+1}/{len(children)}, child.flag: {child.flag}") 236 | child.retrieval_new = self.do_retrieval(child.proposal) 237 | child.retrieval = copy.deepcopy(child.retrieval_new) 238 | if self.if_sum_reference == "sum": 239 | new_retrieval_ids = [data['id'] for data in child.retrieval] 240 | for data in node.retrieval: 241 | if data['id'] not in new_retrieval_ids: 242 | child.retrieval.append(data) 243 | new_retrieval_ids.append(data['id']) 244 | print(f"== sum reference, len(child.retrieval_new): {len(child.retrieval_new)}, len(child.retrieval): {len(child.retrieval)}") 245 | elif self.if_sum_reference == "sumroot": 246 | new_retrieval_ids = [data['id'] for data in child.retrieval] 247 | for data in self.root.retrieval: 248 | if data['id'] not in new_retrieval_ids: 249 | child.retrieval.append(data) 250 | new_retrieval_ids.append(data['id']) 251 | print(f"== sum root reference, len(child.retrieval_new): {len(child.retrieval_new)}, len(child.retrieval): {len(child.retrieval)}") 252 | 253 | print(f"\n=========do_generation=========== {print_info}, child: {c+1}/{len(children)}, child.flag: {child.flag}") 254 | if node.flag == "Re" or node.flag == "In": 255 | child.score_for_father, child.solution_input, child.solution, child.solution_summary = self.do_generation(inputs, child.retrieval, node.flag) 256 | elif node.flag == "So": 257 | child.score_for_father, child.reflection_input, child.reflection, child.reflection_summary = self.do_generation(inputs, child.retrieval, node.flag) 258 | else: 259 | raise ValueError("flag is not correct") 260 | 261 | return children 262 | 263 | def do_proposal(self, inputs, father_flag): 264 | if father_flag == "In": 265 | print("## will get proposal based on father (In node)") 266 | input_text = prompt_for_get_solution_proposal_for_root.format(question=inputs['question']) 267 | proposal = self.llm.get_response(input_text, temperature=1.2, max_new_tokens=self.solution_max_new_tokens//2) 268 | elif father_flag == "So": 269 | if self.if_no_review: # ablation for no review, reflection is also solution 270 | print("## will get proposal based on father (So node), but reflection is also solution") 271 | input_text = prompt_for_get_solution_proposal.format(question=inputs['question'], solution=inputs['solution'], reflection=inputs['solution']) 272 | proposal = self.llm.get_response(input_text, temperature=1.2, max_new_tokens=self.solution_max_new_tokens//2) 273 | return input_text, proposal 274 | print("## will get proposal based on father (So node)") 275 | input_text = prompt_for_get_doubt_proposal.format(question=inputs['question'], solution=inputs['solution']) 276 | proposal = self.llm.get_response(input_text, temperature=1.2, max_new_tokens=self.doubt_max_new_tokens//2) 277 | elif father_flag == "Re": 278 | print("## will get proposal based on father (Re node)") 279 | input_text = prompt_for_get_solution_proposal.format(question=inputs['question'], solution=inputs['solution'], reflection=inputs['reflection']) 280 | proposal = self.llm.get_response(input_text, temperature=1.2, max_new_tokens=self.solution_max_new_tokens//2) 281 | else: 282 | raise ValueError("father_flag is not correct") 283 | return input_text, proposal 284 | 285 | def do_retrieval(self, query): 286 | query_embedding = self.embedder.get_embedding(query) 287 | scores = torch.nn.functional.cosine_similarity(query_embedding, self.knowledge_lib_embeddings, dim=1) 288 | top_k_datas = [self.knowledge_lib[i] for i in scores.argsort(descending=True)[:self.retrieval_top_k]] 289 | print(f"## retrieval knowledge, get {len(top_k_datas)} datas") 290 | return top_k_datas 291 | 292 | def do_generation(self, inputs, docs, father_flag): 293 | 294 | reference = "" 295 | for d, doc in enumerate(docs): 296 | reference += f"{d+1}. {doc['content']}\n" 297 | if self.if_only_reference == "semionly": 298 | if father_flag == "Re": 299 | print("## will add old solution and reflection to docs, because of semi-only reference") 300 | reference += f"{len(docs)+1}. {inputs['solution']}\n" 301 | reference += f"{len(docs)+2}. {inputs['reflection']}\n" 302 | reference = reference.strip() 303 | 304 | if self.if_rerank == "llmrerank": 305 | print("## will do llm rerank") 306 | input_text_for_rerank = f":\nIn order to solve the following question, we retrieved a set of references. Some of these references are highly useful for addressing the problem, while others are less useful. Your task is to categorize these references into two groups: one group containing highly useful references and the other group containing less useful references.\n\n:\n{inputs['question']}\n\n:\n{reference}\n\n:" 307 | rerank_output = self.llm.get_response(input_text_for_rerank, max_new_tokens=self.solution_max_new_tokens//2) 308 | reference = f"{reference}\n\n{rerank_output}" 309 | 310 | if father_flag == "In": # input: question, reference; output: solution 311 | print("## will get solution based on father (In node)") 312 | input_text = prompt_for_get_solution_for_root.format(question=inputs['question'], reference=reference) 313 | output_text = self.llm.get_response(input_text, max_new_tokens=self.solution_max_new_tokens) 314 | score = None 315 | elif father_flag == "So": # input: question, solution, reference; output: reflection 316 | if self.if_no_review: # ablation for no review, reflection is also solution 317 | print("## will get reflection based on father (So node), but reflection is also solution") 318 | if self.if_only_reference == "only" or self.if_only_reference == "semionly": 319 | print("## will get solution based on father (Re node) by only reference or semi-only reference") 320 | input_text = prompt_for_get_solution_for_root.format(question=inputs['question'], reference=reference) 321 | else: 322 | print("## will get solution based on father (Re node) by not only reference, but also old solution and reflection") 323 | input_text = prompt_for_get_solution.format(question=inputs['question'], solution=inputs['solution'], reflection=inputs['solution'], reference=reference) 324 | output_text = self.llm.get_response(input_text, max_new_tokens=self.solution_max_new_tokens) 325 | 326 | print("## will get score for father (So node), but reflection is also solution") 327 | input_text_for_score = f"The following is an old solution for the question, along with a doubt raised about the solution, and the new solution generated based on the doubt. You need to evaluate the effectiveness of the doubt, determining whether it effectively helped improve and refine the original solution.\n\n:\n{inputs['question']}\n\n:\n{inputs['solution']}\n\n:\n{output_text}\n\n:\n{output_text}\n\n:\nThis doubt is effective." 328 | score = self.llm.get_prompt_prob(input_text_for_score, if_print=False) 329 | return score, input_text, output_text, output_text 330 | 331 | print("## will get reflection based on father (So node)") 332 | input_text = prompt_for_get_doubt.format(question=inputs['question'], solution=inputs['solution'], reference=reference) 333 | output_text = self.llm.get_response(input_text, max_new_tokens=self.doubt_max_new_tokens) 334 | 335 | print("## will get score for father (So node)") 336 | input_text_for_score = f"The following is a candidate solution for the question, along with a doubt raised about the solution. You need to evaluate the solution based on the doubt and assign it a score, with higher scores indicating better solutions.\n\n:\n{inputs['question']}\n\n:\n{inputs['solution']}\n\n:\n{output_text}\n\n:\nThis solution is good." 337 | score = self.llm.get_prompt_prob(input_text_for_score, if_print=False) 338 | elif father_flag == "Re": # input: question, solution, reflection, reference; output: solution 339 | if self.if_only_reference == "only" or self.if_only_reference == "semionly": 340 | print("## will get solution based on father (Re node) by only reference or semi-only reference") 341 | input_text = prompt_for_get_solution_for_root.format(question=inputs['question'], reference=reference) 342 | else: 343 | print("## will get solution based on father (Re node) by not only reference, but also old solution and reflection") 344 | input_text = prompt_for_get_solution.format(question=inputs['question'], solution=inputs['solution'], reflection=inputs['reflection'], reference=reference) 345 | output_text = self.llm.get_response(input_text, max_new_tokens=self.solution_max_new_tokens) 346 | 347 | print("## will get score for father (Re node)") 348 | input_text_for_score = f"The following is an old solution for the question, along with a doubt raised about the solution, and the new solution generated based on the doubt. You need to evaluate the effectiveness of the doubt, determining whether it effectively helped improve and refine the original solution.\n\n:\n{inputs['question']}\n\n:\n{inputs['solution']}\n\n:\n{inputs['reflection']}\n\n:\n{output_text}\n\n:\nThis doubt is effective." 349 | score = self.llm.get_prompt_prob(input_text_for_score, if_print=False) 350 | else: 351 | raise ValueError("father_flag is not correct") 352 | 353 | print("## will get summary for output_text") 354 | output_text_summary_input = f"{output_text}\n\nGenerate a summary for the above text, the language should be English, and the summary should be concise and accurate." 355 | output_text_summary = self.llm.get_response(output_text_summary_input, max_new_tokens=self.solution_max_new_tokens//4) 356 | output_text_summary = output_text_summary.replace("\n", " ") 357 | 358 | return score, input_text, output_text, output_text_summary 359 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | torch==2.5.1+cu121 2 | tqdm==4.67.1 3 | transformers==4.47.0 4 | -------------------------------------------------------------------------------- /utils/bench_fig.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/icip-cas/DeepSolution/a953dc46c122238316af229e6d98c9c30b8df559/utils/bench_fig.png -------------------------------------------------------------------------------- /utils/embedder.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn.functional as F 3 | from transformers import AutoTokenizer, AutoModel 4 | 5 | 6 | class Embedder: 7 | def __init__(self, device='cpu'): 8 | model_path = "/root/hf_models/NV-Embed-v2" 9 | print(f"Loading embedder model from {model_path}") 10 | if device == 'cpu': 11 | self.model = AutoModel.from_pretrained(model_path, trust_remote_code=True).to(device) 12 | else: 13 | self.model = AutoModel.from_pretrained(model_path, trust_remote_code=True, torch_dtype="float16").to(device) 14 | self.model.eval() 15 | 16 | @torch.no_grad() 17 | def get_embedding(self, text, max_length=4096): 18 | query_embeddings = self.model.encode([text], max_length=max_length) 19 | query_embeddings = F.normalize(query_embeddings, p=2, dim=1) 20 | 21 | return query_embeddings 22 | -------------------------------------------------------------------------------- /utils/head_fig.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/icip-cas/DeepSolution/a953dc46c122238316af229e6d98c9c30b8df559/utils/head_fig.PNG -------------------------------------------------------------------------------- /utils/method_fig.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/icip-cas/DeepSolution/a953dc46c122238316af229e6d98c9c30b8df559/utils/method_fig.png -------------------------------------------------------------------------------- /utils/openai_api.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | import time 4 | import requests 5 | from transformers import AutoTokenizer 6 | 7 | 8 | class OpenaiAPI(): 9 | def __init__(self, model_name="gpt-4o-2024-11-20"): 10 | print("model_name", model_name) 11 | 12 | # print("loading tokenizer") 13 | if os.path.exists("/data4/lizhuoqun2021/hf_models/llama-2-7b"): 14 | self.tokenizer = AutoTokenizer.from_pretrained("/data4/lizhuoqun2021/hf_models/llama-2-7b") 15 | elif os.path.exists("/mnt/data/lizhuoqun/hf_models/gpt2"): 16 | self.tokenizer = AutoTokenizer.from_pretrained("/mnt/data/lizhuoqun/hf_models/gpt2") 17 | else: 18 | raise Exception("No model path found") 19 | 20 | self.model_name = model_name 21 | 22 | def get_response(self, prompt): 23 | 24 | token_nums = len(self.tokenizer(prompt)['input_ids']) 25 | print(f"input_text token_nums: {token_nums}") 26 | 27 | url = "http://47.88.8.18:8088/api/ask" 28 | headers = { 29 | "Content-Type": "application/json", 30 | "Authorization": "Bearer eyJ0eXAiOiJqd3QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VybmFtZSI6IjIzNzgzNiIsInBhc3N3b3JkIjoiMjM3ODM2MTIzIiwiZXhwIjoyMDMxMzc2MjA0fQ.Lz6IKLMUTWWT5isamrYTmbAcGNFpAqt87YFF2bynP3w" 31 | } 32 | raw_info = { 33 | "model": self.model_name, 34 | "messages": [{"role": "user", "content": prompt}], 35 | "temperature": 1, 36 | } 37 | for r in range(125): 38 | callback = requests.post(url, data=json.dumps(raw_info), headers=headers, timeout=512) 39 | result = callback.json() 40 | print("callback", callback) 41 | if "200" in str(callback): 42 | break 43 | else: 44 | print(f"result: {result}") 45 | sleep_time = 10 * ((r+1) ** 2) 46 | print(f"sleeping {sleep_time}s after trying {r+1} times") 47 | time.sleep(sleep_time) 48 | 49 | return result['data']['response']['choices'][0]['message']['content'] 50 | -------------------------------------------------------------------------------- /utils/prompts.py: -------------------------------------------------------------------------------- 1 | prompt_for_get_solution_proposal_for_root = """: 2 | In order to solve the following question, I need to search for relevant knowledge in external knowledge bases. Please generate a proposal based on the question and tell me what areas of knowledge I should search for. 3 | 4 | : 5 | {question} 6 | 7 | : 8 | 9 | """ 10 | 11 | 12 | 13 | 14 | 15 | 16 | prompt_for_get_doubt_proposal = """: 17 | For the following question, there is currently a candidate solution. To evaluate this candidate solution, I need to search for relevant knowledge in external knowledge bases. Please generate a proposal based on the question and the candidate solution, and tell me what areas of knowledge I should search for. 18 | 19 | : 20 | {question} 21 | 22 | : 23 | {solution} 24 | 25 | :""" 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | prompt_for_get_solution_proposal = """: 34 | For the following question, there is a candidate solution as well as an expert's evaluation of this candidate solution. In order to redesign a better solution, I need to search for relevant knowledge in external knowledge bases. Please generate a proposal based on the question and the candidate solution, and tell me what areas of knowledge I should search for. 35 | 36 | : 37 | {question} 38 | 39 | : 40 | {solution} 41 | 42 | : 43 | {reflection} 44 | 45 | :""" 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | prompt_for_get_solution_for_root = """: 58 | Based on the reference knowledge, design a good solution for the question. Be sure to make full use of reference knowledge to analyze the challenges contained within the question and provide a comprehensive solution. 59 | 60 | : 61 | {question} 62 | 63 | : 64 | {reference} 65 | 66 | :""" 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | prompt_for_get_doubt = """: 75 | For the following question, a candidate solution has already been provided. You need to critique the candidate solution based on the reference knowledge. Be sure to make full use of the reference knowledge to identify the shortcomings of the old solution in terms of its analysis of the challenges in the question and its technical implementation. 76 | 77 | : 78 | {question} 79 | 80 | : 81 | {solution} 82 | 83 | : 84 | {reference} 85 | 86 | :""" 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | prompt_for_get_solution = """: 98 | For the following question, an old solution has already been provided and its shortcomings have been pointed out by human experts. You need to redesign a better solution based on the reference knowledge and the guidance from human experts. Be sure to make full use of the reference knowledge to analyze the challenges contained within the question and provide a comprehensive solution. 99 | 100 | : 101 | {question} 102 | 103 | : 104 | {solution} 105 | 106 | : 107 | {reflection} 108 | 109 | : 110 | {reference} 111 | 112 | :""" 113 | 114 | 115 | 116 | 117 | 118 | -------------------------------------------------------------------------------- /utils/qwen_api.py: -------------------------------------------------------------------------------- 1 | import time 2 | import json 3 | import requests 4 | import os 5 | import random 6 | from transformers import AutoTokenizer 7 | 8 | 9 | class QwenAPI(): 10 | def __init__(self, url, url2="", url3="", url4=""): 11 | self.url = url 12 | self.url2 = url2 13 | self.url3 = url3 14 | self.url4 = url4 15 | self.url_list = [self.url, self.url2, self.url3, self.url4] 16 | self.url_list = [u for u in self.url_list if '.' in u] 17 | print("url_list", self.url_list) 18 | if len(self.url_list) > 4: 19 | random_seed = len(str(os.urandom(12))) * len(str(os.urandom(25))) 20 | print("random_seed", random_seed) 21 | random.seed(random_seed) 22 | for i in range(len(self.url_list)): 23 | random_index = random.randint(0, len(self.url_list)-1) 24 | print("random_index", random_index) 25 | self.url_list[i], self.url_list[random_index] = self.url_list[random_index], self.url_list[i] 26 | print("url_list after random", self.url_list) 27 | 28 | print("loading tokenizer") 29 | if os.path.exists("/data4/lizhuoqun2021/hf_models/llama-2-7b"): 30 | self.tokenizer = AutoTokenizer.from_pretrained("/data4/lizhuoqun2021/hf_models/llama-2-7b") 31 | elif os.path.exists("/mnt/data/lizhuoqun/hf_models/gpt2"): 32 | self.tokenizer = AutoTokenizer.from_pretrained("/mnt/data/lizhuoqun/hf_models/gpt2") 33 | else: 34 | raise Exception("No model path found") 35 | print("loading tokenizer done") 36 | 37 | def get_response(self, input_text, stop_str_list=[], temperature=0.7, max_new_tokens=4096, truction_thred=128000, truction_side="right", return_logprobs=False, if_print=True): 38 | current_time = time.time() 39 | 40 | input_text_token_num = len(self.tokenizer(input_text)['input_ids']) 41 | print(f"input_text_token_num: {input_text_token_num}") 42 | if input_text_token_num > truction_thred: 43 | print(f"we reduce the input_text_token_num to truction_thred {truction_thred}", "truction_side:", truction_side) 44 | if truction_side == "right": 45 | input_text = input_text[:int(len(input_text)*(truction_thred/input_text_token_num))] 46 | elif truction_side == "left": 47 | input_text = input_text[-int(len(input_text)*(truction_thred/input_text_token_num)):] 48 | else: 49 | raise Exception("truction_side not valid") 50 | url = self.url_select() 51 | headers = { 52 | # "Content-Type": "application/json", 53 | "Authorization": "EMPTY" 54 | } 55 | raw_info = { 56 | "model": "Qwen", 57 | "messages": [{"role": "user", "content": input_text}], 58 | "max_tokens": max_new_tokens, 59 | "stop": stop_str_list, 60 | "temperature": temperature, 61 | "logprobs": True, 62 | "echo": True, 63 | } 64 | 65 | data = json.dumps(raw_info) 66 | # print(data) 67 | 68 | for _ in range(2): 69 | try: 70 | callback = requests.post(url, headers=headers, data=data, timeout=10000) 71 | break 72 | except Exception as e: 73 | print("e", e) 74 | print("retry") 75 | url = self.url_select() 76 | 77 | if if_print: 78 | print("callback.status_code", callback.status_code) 79 | if callback.status_code != 200: 80 | print("callback.text", callback.text) 81 | raise Exception("callback.status_code != 200") 82 | # print("callback.json()", callback.json()) 83 | 84 | if if_print: 85 | print(f"prompt_tokens: {callback.json()['usage']['prompt_tokens']}, total_tokens: {callback.json()['usage']['total_tokens']}, completion_tokens: {callback.json()['usage']['completion_tokens']}") 86 | 87 | result = callback.json() 88 | # print(result) 89 | # print(result.keys()) 90 | response = result['choices'][0]['message']['content'] 91 | # print(response) 92 | # input() 93 | 94 | if if_print: 95 | print("used time in this qwenapi get_response:", (time.time()-current_time)/60, "min") 96 | 97 | if return_logprobs is False: 98 | return response 99 | else: 100 | raw_probs = result['choices'][0]['logprobs']['content'] 101 | avg_logprob = sum([raw_prob['logprob'] for raw_prob in raw_probs])/len(raw_probs) 102 | return response, avg_logprob 103 | 104 | def get_prompt_prob(self, input_text, truction_thred=128000, truction_side="left", if_print=True): 105 | current_time = time.time() 106 | 107 | input_text_token_num = len(self.tokenizer(input_text)['input_ids']) 108 | 109 | if if_print: 110 | print(f"input_text_token_num: {input_text_token_num}") 111 | 112 | if input_text_token_num > truction_thred: 113 | print(f"we reduce the input_text_token_num to truction_thred {truction_thred}", "truction_side:", truction_side) 114 | if truction_side == "right": 115 | input_text = input_text[:int(len(input_text)*(truction_thred/input_text_token_num))] 116 | elif truction_side == "left": 117 | input_text = input_text[-int(len(input_text)*(truction_thred/input_text_token_num)):] 118 | else: 119 | raise Exception("truction_side not valid") 120 | url = self.url_select() 121 | headers = { 122 | # "Content-Type": "application/json", 123 | "Authorization": "EMPTY" 124 | } 125 | raw_info = { 126 | "model": "Qwen", 127 | "messages": [{"role": "user", "content": input_text}], 128 | "max_tokens": 1, 129 | "temperature": 0.0, 130 | "logprobs": True, 131 | "echo": True, 132 | } 133 | 134 | data = json.dumps(raw_info) 135 | # print(data) 136 | 137 | for _ in range(2): 138 | try: 139 | callback = requests.post(url, headers=headers, data=data, timeout=(10000, 10000)) 140 | break 141 | except Exception as e: 142 | print("e", e) 143 | print("retry") 144 | url = self.url_select() 145 | 146 | if if_print: 147 | print("callback.status_code", callback.status_code) 148 | if callback.status_code != 200: 149 | print("callback.text", callback.text) 150 | raise Exception("callback.status_code != 200") 151 | # print("callback.json()", callback.json()) 152 | 153 | if if_print: 154 | print(f"prompt_tokens: {callback.json()['usage']['prompt_tokens']}, total_tokens: {callback.json()['usage']['total_tokens']}, completion_tokens: {callback.json()['usage']['completion_tokens']}") 155 | 156 | result = callback.json() 157 | # print(result) 158 | # print(result.keys()) 159 | # print(response) 160 | # input() 161 | 162 | raw_probs = result['prompt_logprobs'][-12:-5] 163 | avg_logprob = sum([list(raw_prob.values())[0]['logprob'] for raw_prob in raw_probs])/len(raw_probs) 164 | 165 | if if_print: 166 | print("used time in this qwenapi get_prompt_prob:", (time.time()-current_time)/60, "min") 167 | 168 | return avg_logprob 169 | 170 | def url_select(self): 171 | for url in self.url_list: 172 | curl_command = f'curl {url} --connect-timeout 1 -s' 173 | stream = os.popen(curl_command) 174 | output = stream.read() 175 | try: 176 | response_data = json.loads(output) 177 | break 178 | except json.JSONDecodeError as e: 179 | print(f"!!!!!!!!!!!!!!!! {url} is not available !!!!!!!!!!!!!!!!, change to next url") 180 | 181 | return url --------------------------------------------------------------------------------