├── assets ├── intro.png ├── logo.png ├── result1.png ├── result2.png └── result3.png ├── eval ├── infer_img.png ├── images │ ├── 26514 │ │ └── mri_group1 │ │ │ ├── 0.jpeg │ │ │ └── 1.jpeg │ ├── 98014 │ │ └── mri_group1 │ │ │ ├── 0.jpeg │ │ │ ├── 1.jpeg │ │ │ ├── 2.jpeg │ │ │ └── 3.jpeg │ ├── 156702 │ │ └── x-ray_group1 │ │ │ ├── 0.png │ │ │ └── 1.png │ ├── 176949 │ │ └── ct_group1 │ │ │ ├── 0.jpeg │ │ │ └── 1.jpeg │ ├── 180144 │ │ └── ct_group1 │ │ │ ├── 0.png │ │ │ ├── 1.jpeg │ │ │ ├── 2.jpeg │ │ │ └── 3.jpeg │ ├── 185438 │ │ └── mri_group1 │ │ │ ├── 0.jpeg │ │ │ ├── 1.jpeg │ │ │ ├── 2.jpeg │ │ │ └── 3.jpeg │ ├── 187899 │ │ └── x-ray_group1 │ │ │ └── 0.jpeg │ ├── 197271 │ │ └── ct_group1 │ │ │ ├── 0.jpeg │ │ │ ├── 1.jpeg │ │ │ ├── 2.jpeg │ │ │ ├── 3.jpeg │ │ │ └── 4.jpeg │ ├── 197850 │ │ ├── mri_group2 │ │ │ └── 0.jpeg │ │ └── x-ray_group1 │ │ │ └── 0.png │ └── 201392 │ │ └── ultrasound_group1 │ │ ├── 0.jpeg │ │ ├── 1.jpeg │ │ ├── 2.jpeg │ │ ├── 3.jpeg │ │ └── 4.jpeg ├── eval.py └── eval_data.json ├── src ├── demo_data │ ├── images │ │ ├── 158377 │ │ │ └── x-ray_group1 │ │ │ │ └── 0.jpeg │ │ ├── 159949 │ │ │ └── ct_group1 │ │ │ │ ├── 0.jpeg │ │ │ │ └── 1.jpeg │ │ ├── 160688 │ │ │ └── mammography_group1 │ │ │ │ └── 0.png │ │ ├── 169992 │ │ │ └── ct_group1 │ │ │ │ └── 0.png │ │ ├── 177194 │ │ │ └── ct_group1 │ │ │ │ ├── 0.png │ │ │ │ └── 1.png │ │ ├── 178062 │ │ │ └── ct_group1 │ │ │ │ └── 0.jpeg │ │ ├── 181112 │ │ │ └── ct_group1 │ │ │ │ ├── 0.jpeg │ │ │ │ └── 1.jpeg │ │ ├── 189353 │ │ │ └── x-ray_group1 │ │ │ │ └── 0.jpeg │ │ ├── 189817 │ │ │ └── ultrasound_group1 │ │ │ │ ├── 0.png │ │ │ │ └── 1.jpeg │ │ └── 190893 │ │ │ └── ct_group1 │ │ │ └── 0.jpeg │ └── demo.jsonl ├── qwenvl_forward.py ├── prompt.py ├── step.py ├── internvl_forward.py ├── model.py ├── run.py ├── utils.py └── mics.py ├── requirements.txt ├── train ├── zero_stage1_config.json ├── internvl2_5_8b_dynamic_res_2nd_finetune_lora.sh └── internvl2_5_2b_dynamic_res_2nd_finetune_lora.sh ├── LICENSE ├── .gitignore ├── infer.py └── README.md /assets/intro.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/assets/intro.png -------------------------------------------------------------------------------- /assets/logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/assets/logo.png -------------------------------------------------------------------------------- /assets/result1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/assets/result1.png -------------------------------------------------------------------------------- /assets/result2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/assets/result2.png -------------------------------------------------------------------------------- /assets/result3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/assets/result3.png -------------------------------------------------------------------------------- /eval/infer_img.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/eval/infer_img.png -------------------------------------------------------------------------------- /eval/images/180144/ct_group1/0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/eval/images/180144/ct_group1/0.png -------------------------------------------------------------------------------- /eval/images/176949/ct_group1/0.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/eval/images/176949/ct_group1/0.jpeg -------------------------------------------------------------------------------- /eval/images/176949/ct_group1/1.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/eval/images/176949/ct_group1/1.jpeg -------------------------------------------------------------------------------- /eval/images/180144/ct_group1/1.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/eval/images/180144/ct_group1/1.jpeg -------------------------------------------------------------------------------- /eval/images/180144/ct_group1/2.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/eval/images/180144/ct_group1/2.jpeg -------------------------------------------------------------------------------- /eval/images/180144/ct_group1/3.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/eval/images/180144/ct_group1/3.jpeg -------------------------------------------------------------------------------- /eval/images/185438/mri_group1/0.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/eval/images/185438/mri_group1/0.jpeg -------------------------------------------------------------------------------- /eval/images/185438/mri_group1/1.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/eval/images/185438/mri_group1/1.jpeg -------------------------------------------------------------------------------- /eval/images/185438/mri_group1/2.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/eval/images/185438/mri_group1/2.jpeg -------------------------------------------------------------------------------- /eval/images/185438/mri_group1/3.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/eval/images/185438/mri_group1/3.jpeg -------------------------------------------------------------------------------- /eval/images/197271/ct_group1/0.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/eval/images/197271/ct_group1/0.jpeg -------------------------------------------------------------------------------- /eval/images/197271/ct_group1/1.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/eval/images/197271/ct_group1/1.jpeg -------------------------------------------------------------------------------- /eval/images/197271/ct_group1/2.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/eval/images/197271/ct_group1/2.jpeg -------------------------------------------------------------------------------- /eval/images/197271/ct_group1/3.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/eval/images/197271/ct_group1/3.jpeg -------------------------------------------------------------------------------- /eval/images/197271/ct_group1/4.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/eval/images/197271/ct_group1/4.jpeg -------------------------------------------------------------------------------- /eval/images/197850/mri_group2/0.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/eval/images/197850/mri_group2/0.jpeg -------------------------------------------------------------------------------- /eval/images/26514/mri_group1/0.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/eval/images/26514/mri_group1/0.jpeg -------------------------------------------------------------------------------- /eval/images/26514/mri_group1/1.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/eval/images/26514/mri_group1/1.jpeg -------------------------------------------------------------------------------- /eval/images/98014/mri_group1/0.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/eval/images/98014/mri_group1/0.jpeg -------------------------------------------------------------------------------- /eval/images/98014/mri_group1/1.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/eval/images/98014/mri_group1/1.jpeg -------------------------------------------------------------------------------- /eval/images/98014/mri_group1/2.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/eval/images/98014/mri_group1/2.jpeg -------------------------------------------------------------------------------- /eval/images/98014/mri_group1/3.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/eval/images/98014/mri_group1/3.jpeg -------------------------------------------------------------------------------- /eval/images/156702/x-ray_group1/0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/eval/images/156702/x-ray_group1/0.png -------------------------------------------------------------------------------- /eval/images/156702/x-ray_group1/1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/eval/images/156702/x-ray_group1/1.png -------------------------------------------------------------------------------- /eval/images/187899/x-ray_group1/0.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/eval/images/187899/x-ray_group1/0.jpeg -------------------------------------------------------------------------------- /eval/images/197850/x-ray_group1/0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/eval/images/197850/x-ray_group1/0.png -------------------------------------------------------------------------------- /eval/images/201392/ultrasound_group1/0.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/eval/images/201392/ultrasound_group1/0.jpeg -------------------------------------------------------------------------------- /eval/images/201392/ultrasound_group1/1.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/eval/images/201392/ultrasound_group1/1.jpeg -------------------------------------------------------------------------------- /eval/images/201392/ultrasound_group1/2.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/eval/images/201392/ultrasound_group1/2.jpeg -------------------------------------------------------------------------------- /eval/images/201392/ultrasound_group1/3.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/eval/images/201392/ultrasound_group1/3.jpeg -------------------------------------------------------------------------------- /eval/images/201392/ultrasound_group1/4.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/eval/images/201392/ultrasound_group1/4.jpeg -------------------------------------------------------------------------------- /src/demo_data/images/159949/ct_group1/0.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/src/demo_data/images/159949/ct_group1/0.jpeg -------------------------------------------------------------------------------- /src/demo_data/images/159949/ct_group1/1.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/src/demo_data/images/159949/ct_group1/1.jpeg -------------------------------------------------------------------------------- /src/demo_data/images/169992/ct_group1/0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/src/demo_data/images/169992/ct_group1/0.png -------------------------------------------------------------------------------- /src/demo_data/images/177194/ct_group1/0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/src/demo_data/images/177194/ct_group1/0.png -------------------------------------------------------------------------------- /src/demo_data/images/177194/ct_group1/1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/src/demo_data/images/177194/ct_group1/1.png -------------------------------------------------------------------------------- /src/demo_data/images/178062/ct_group1/0.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/src/demo_data/images/178062/ct_group1/0.jpeg -------------------------------------------------------------------------------- /src/demo_data/images/181112/ct_group1/0.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/src/demo_data/images/181112/ct_group1/0.jpeg -------------------------------------------------------------------------------- /src/demo_data/images/181112/ct_group1/1.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/src/demo_data/images/181112/ct_group1/1.jpeg -------------------------------------------------------------------------------- /src/demo_data/images/190893/ct_group1/0.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/src/demo_data/images/190893/ct_group1/0.jpeg -------------------------------------------------------------------------------- /src/demo_data/images/158377/x-ray_group1/0.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/src/demo_data/images/158377/x-ray_group1/0.jpeg -------------------------------------------------------------------------------- /src/demo_data/images/189353/x-ray_group1/0.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/src/demo_data/images/189353/x-ray_group1/0.jpeg -------------------------------------------------------------------------------- /src/demo_data/images/189817/ultrasound_group1/0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/src/demo_data/images/189817/ultrasound_group1/0.png -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | openai==1.66.3 2 | Pillow==10.4.0 3 | qwen_vl_utils==0.0.10 4 | torch==2.6.0 5 | torchvision==0.21.0 6 | tqdm==4.67.1 7 | transformers==4.52.4 8 | -------------------------------------------------------------------------------- /src/demo_data/images/160688/mammography_group1/0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/src/demo_data/images/160688/mammography_group1/0.png -------------------------------------------------------------------------------- /src/demo_data/images/189817/ultrasound_group1/1.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manglu097/Chiron-o1/HEAD/src/demo_data/images/189817/ultrasound_group1/1.jpeg -------------------------------------------------------------------------------- /train/zero_stage1_config.json: -------------------------------------------------------------------------------- 1 | { 2 | "zero_optimization": { 3 | "stage": 1, 4 | "allgather_partitions": true, 5 | "allgather_bucket_size": 1e9, 6 | "overlap_comm": true, 7 | "reduce_scatter": true, 8 | "reduce_bucket_size": 1e9, 9 | "contiguous_gradients": true 10 | }, 11 | "fp16": { 12 | "enabled": "auto", 13 | "auto_cast": true, 14 | "loss_scale": 0, 15 | "initial_scale_power": 32, 16 | "loss_scale_window": 1000, 17 | "hysteresis": 2, 18 | "min_loss_scale": 1 19 | }, 20 | "bf16": { 21 | "enabled": "auto" 22 | }, 23 | "optimizer": { 24 | "type": "AdamW", 25 | "params": { 26 | "lr": "auto", 27 | "betas": [ 28 | 0.9, 29 | 0.999 30 | ], 31 | "eps": 1e-8, 32 | "weight_decay": "auto" 33 | } 34 | }, 35 | "gradient_accumulation_steps": "auto", 36 | "gradient_clipping": "auto", 37 | "steps_per_print": 2000, 38 | "train_batch_size": "auto", 39 | "train_micro_batch_size_per_gpu": "auto", 40 | "wall_clock_breakdown": true 41 | } 42 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2025 manglu097 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /src/qwenvl_forward.py: -------------------------------------------------------------------------------- 1 | from qwen_vl_utils import process_vision_info 2 | 3 | def qwenvl_forward(model, processor, prompt, img_paths, temperature=0.9): 4 | messages = [ 5 | { 6 | 'role': "system", 7 | "content": 'You are a highly professional and experienced clinician. You are familiar with the latest medical guidelines, diagnostic standards and treatment plans, and can reasonably answer users\' questions.' 8 | }, 9 | { 10 | "role": "user", 11 | "content": [], 12 | }, 13 | ] 14 | 15 | for img_path in img_paths: 16 | messages[1]["content"].append({ 17 | "type": "image", 18 | "image": img_path, 19 | }) 20 | 21 | messages[1]["content"].append({"type": "text", "text": prompt}) 22 | 23 | texts = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) 24 | image_inputs, video_inputs = process_vision_info(messages) 25 | inputs = processor( 26 | text=[texts], 27 | images=image_inputs, 28 | videos=video_inputs, 29 | padding=True, 30 | return_tensors="pt", 31 | ).to(model.device) 32 | 33 | generated_ids = model.generate(**inputs, max_new_tokens=2048, repetition_penalty=1, temperature=temperature) 34 | generated_ids_trimmed = [ 35 | out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) 36 | ] 37 | output_texts = processor.batch_decode( 38 | generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False 39 | )[0] 40 | 41 | return output_texts 42 | -------------------------------------------------------------------------------- /train/internvl2_5_8b_dynamic_res_2nd_finetune_lora.sh: -------------------------------------------------------------------------------- 1 | set -x 2 | 3 | GPUS=${GPUS:-2} 4 | BATCH_SIZE=${BATCH_SIZE:-16} 5 | PER_DEVICE_BATCH_SIZE=${PER_DEVICE_BATCH_SIZE:-4} 6 | GRADIENT_ACC=$((BATCH_SIZE / PER_DEVICE_BATCH_SIZE / GPUS)) 7 | 8 | 9 | export PYTHONPATH="${PYTHONPATH}:$(pwd)" 10 | export MASTER_PORT=34229 11 | export TF_CPP_MIN_LOG_LEVEL=3 12 | export LAUNCHER=pytorch 13 | 14 | OUTPUT_DIR='' # to save ckpt 15 | 16 | if [ ! -d "$OUTPUT_DIR" ]; then 17 | mkdir -p "$OUTPUT_DIR" 18 | fi 19 | 20 | torchrun \ 21 | --nnodes=1 \ 22 | --node_rank=0 \ 23 | --master_addr=127.0.0.1 \ 24 | --nproc_per_node=${GPUS} \ 25 | --master_port=${MASTER_PORT} \ 26 | internvl/train/internvl_chat_finetune.py \ 27 | # Path of base model 28 | --model_name_or_path "" \ 29 | --conv_style "internvl2_5" \ 30 | --use_fast_tokenizer False \ 31 | --output_dir ${OUTPUT_DIR} \ 32 | --meta_path "./shell/data/chiron-o1-8b.json" \ 33 | --overwrite_output_dir True \ 34 | --force_image_size 448 \ 35 | --max_dynamic_patch 6 \ 36 | --down_sample_ratio 0.5 \ 37 | --drop_path_rate 0.0 \ 38 | --freeze_llm True \ 39 | --freeze_mlp True \ 40 | --freeze_backbone True \ 41 | --use_llm_lora 16 \ 42 | --vision_select_layer -1 \ 43 | --dataloader_num_workers 4 \ 44 | --bf16 True \ 45 | --num_train_epochs 1 \ 46 | --per_device_train_batch_size ${PER_DEVICE_BATCH_SIZE} \ 47 | --gradient_accumulation_steps ${GRADIENT_ACC} \ 48 | --evaluation_strategy "no" \ 49 | --save_strategy "steps" \ 50 | --save_steps 2000 \ 51 | --save_total_limit 1 \ 52 | --learning_rate 4e-5 \ 53 | --weight_decay 0.05 \ 54 | --warmup_ratio 0.03 \ 55 | --lr_scheduler_type "cosine" \ 56 | --logging_steps 1 \ 57 | --max_seq_length 8192 \ 58 | --do_train True \ 59 | --grad_checkpoint True \ 60 | --group_by_length True \ 61 | --dynamic_image_size True \ 62 | --use_thumbnail True \ 63 | --ps_version 'v2' \ 64 | --deepspeed "zero_stage1_config.json" \ 65 | --report_to "tensorboard" \ 66 | 2>&1 | tee -a "${OUTPUT_DIR}/training_log.txt" 67 | -------------------------------------------------------------------------------- /train/internvl2_5_2b_dynamic_res_2nd_finetune_lora.sh: -------------------------------------------------------------------------------- 1 | set -x 2 | 3 | GPUS=${GPUS:-8} 4 | BATCH_SIZE=${BATCH_SIZE:-64} 5 | PER_DEVICE_BATCH_SIZE=${PER_DEVICE_BATCH_SIZE:-4} 6 | GRADIENT_ACC=$((BATCH_SIZE / PER_DEVICE_BATCH_SIZE / GPUS)) 7 | 8 | 9 | export PYTHONPATH="${PYTHONPATH}:$(pwd)" 10 | export MASTER_PORT=34229 11 | export TF_CPP_MIN_LOG_LEVEL=3 12 | export LAUNCHER=pytorch 13 | 14 | OUTPUT_DIR='' # to save ckpt 15 | 16 | if [ ! -d "$OUTPUT_DIR" ]; then 17 | mkdir -p "$OUTPUT_DIR" 18 | fi 19 | 20 | torchrun \ 21 | --nnodes=1 \ 22 | --node_rank=0 \ 23 | --master_addr=127.0.0.1 \ 24 | --nproc_per_node=${GPUS} \ 25 | --master_port=${MASTER_PORT} \ 26 | internvl/train/internvl_chat_finetune.py \ 27 | # Path of base model 28 | --model_name_or_path "" \ 29 | --conv_style "internvl2_5" \ 30 | --use_fast_tokenizer False \ 31 | --output_dir ${OUTPUT_DIR} \ 32 | --meta_path "./shell/data/chiron-o1-2b.json" \ 33 | --overwrite_output_dir True \ 34 | --force_image_size 448 \ 35 | --max_dynamic_patch 6 \ 36 | --down_sample_ratio 0.5 \ 37 | --drop_path_rate 0.0 \ 38 | --freeze_llm True \ 39 | --freeze_mlp True \ 40 | --freeze_backbone True \ 41 | --use_llm_lora 16 \ 42 | --vision_select_layer -1 \ 43 | --dataloader_num_workers 4 \ 44 | --bf16 True \ 45 | --num_train_epochs 1 \ 46 | --per_device_train_batch_size ${PER_DEVICE_BATCH_SIZE} \ 47 | --gradient_accumulation_steps ${GRADIENT_ACC} \ 48 | --evaluation_strategy "no" \ 49 | --save_strategy "steps" \ 50 | --save_steps 2000 \ 51 | --save_total_limit 1 \ 52 | --learning_rate 4e-5 \ 53 | --weight_decay 0.01 \ 54 | --warmup_ratio 0.03 \ 55 | --lr_scheduler_type "cosine" \ 56 | --logging_steps 1 \ 57 | --max_seq_length 8192 \ 58 | --do_train True \ 59 | --grad_checkpoint True \ 60 | --group_by_length True \ 61 | --dynamic_image_size True \ 62 | --use_thumbnail True \ 63 | --ps_version 'v2' \ 64 | --deepspeed "zero_stage1_config.json" \ 65 | --report_to "tensorboard" \ 66 | 2>&1 | tee -a "${OUTPUT_DIR}/training_log.txt" 67 | -------------------------------------------------------------------------------- /src/prompt.py: -------------------------------------------------------------------------------- 1 | JUDGE_PROMPT = """ 2 | Evaluate whether the model's answer aligns with the correct answer semantically. Output 'Yes' only if the model's answer matches the correct result, and 'No' if it does not match or if the correctness is unclear. Provide only 'Yes' or 'No' as the output, without any explanation. 3 | 4 | Question: {question} 5 | Model's answer: {model_answer} 6 | Correct answer: {gt_answer}""" 7 | 8 | 9 | 10 | 11 | REASONING_PROMPT = """ 12 | Given a specific question about the images, the patient's case information (such as age, gender, chief complaint and some relevant image analysis ), your goal is to generate a detailed, step-by-step thought process that leads to the correct answer. 13 | 1. Your thought process must rely solely on the provided information. Do not fabricate details or introduce information not present in the inputs. 14 | 2. Approach the task as if the answer is unknown, avoiding any shortcuts or assumptions that the gold standard answer is already understood. 15 | 3. If the thought process involves observations related to images, present those observations as if they were directly derived from the images themselves, without referencing image analysis. 16 | 4. Adapt your thought process to the complexity of each case, using fewer reasoning steps for simpler problems and more thorough analysis for complex ones, mirroring the flexible and analytical mindset of a skilled clinician. 17 | 18 | Format your response with the following format: 19 | ### Step 1: 20 | ### Step 2: 21 | ... 22 | ### The final answer is: 23 | 24 | Case Information: 25 | {case_info} 26 | Question: {question} 27 | Correct Answer: {gt_answer} 28 | Reasoning prefix: {reasoning_prefix} 29 | 30 | Continue your response by building directly on the existing thought process, ensuring that no changes are made to the content of the prior reasoning, and provide a clear and coherent completion of your answer. Just output the completed part. 31 | """ 32 | 33 | 34 | EVALUATE_PROMPT = """ 35 | Using the provided medical images and partial thought process, deduce the correct answer of the question through rigorous reasoning. Ensure the response is concise, accurate, and conforms to medical terminology standards. Provide only the final answer. 36 | 37 | Format your response with the following format: 38 | ### The final answer is: 39 | 40 | Question: {question} 41 | Reasoning prefix: {reasoning_prefix} 42 | """ -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled Python files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # Distribution / packaging 7 | .Python 8 | build/ 9 | develop-eggs/ 10 | dist/ 11 | downloads/ 12 | eggs/ 13 | .eggs/ 14 | lib/ 15 | lib64/ 16 | parts/ 17 | sdist/ 18 | var/ 19 | *.egg-info/ 20 | .installed.cfg 21 | *.egg 22 | MANIFEST 23 | 24 | # PyInstaller 25 | # Usually these files are written by a python script from a template 26 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 27 | *.manifest 28 | *.spec 29 | 30 | # Installer logs 31 | pip-log.txt 32 | pip-delete-this-directory.txt 33 | 34 | # Unit test / coverage reports 35 | htmlcov/ 36 | .tox/ 37 | .nox/ 38 | .coverage 39 | .coverage.* 40 | .cache 41 | nosetests.xml 42 | coverage.xml 43 | *.cover 44 | .hypothesis/ 45 | .pytest_cache/ 46 | 47 | # Translations 48 | *.mo 49 | *.pot 50 | 51 | # Django stuff: 52 | *.log 53 | local_settings.py 54 | db.sqlite3 55 | db.sqlite3-journal 56 | 57 | # Flask stuff: 58 | instance/ 59 | .webassets-cache 60 | 61 | # Scrapy stuff: 62 | .scrapy 63 | 64 | # Sphinx documentation 65 | docs/_build/ 66 | 67 | # PyBuilder 68 | target/ 69 | 70 | # Jupyter Notebook 71 | .ipynb_checkpoints 72 | 73 | # IPython 74 | profile_default/ 75 | ipython_config.py 76 | 77 | # pyenv 78 | .python-version 79 | 80 | # pipenv 81 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 82 | # However, in case of collaboration, it may be omitted. 83 | Pipfile.lock 84 | 85 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 86 | __pypackages__/ 87 | 88 | # Celery stuff 89 | celerybeat-schedule 90 | celerybeat.pid 91 | 92 | # SageMath parsed files 93 | *.sage.py 94 | 95 | # Environments 96 | .env 97 | .venv 98 | env/ 99 | venv/ 100 | ENV/ 101 | env.bak/ 102 | venv.bak/ 103 | 104 | # Spyder project settings 105 | .spyderproject 106 | .spyproject 107 | 108 | # Rope project settings 109 | .ropeproject 110 | 111 | # mkdocs documentation 112 | /site 113 | 114 | # mypy 115 | .mypy_cache/ 116 | .dmypy.json 117 | dmypy.json 118 | 119 | # PyCharm 120 | .idea/ 121 | 122 | # VS Code 123 | .vscode/ 124 | *.code-workspace 125 | 126 | # Sublime Text 127 | *.sublime-project 128 | *.sublime-workspace 129 | 130 | # Operating System Files 131 | .DS_Store 132 | Thumbs.db 133 | 134 | -------------------------------------------------------------------------------- /src/step.py: -------------------------------------------------------------------------------- 1 | class Step: 2 | def __init__(self, step_text="", prefix_steps="", parent=None, score=0.0, generated_by=None): 3 | """ 4 | Initializes a step node in the reasoning tree. 5 | 6 | Args: 7 | step_text: Text of the current reasoning step (e.g., "### Step 1: ..."). Typically starts with '###'. 8 | prefix_steps: Concatenated text of all preceding steps (parent's text). 9 | parent: Parent step node. 10 | score: Score of this step (based on student model evaluation). 11 | generated_by: Teacher model that generated this step. 12 | """ 13 | self.step_text = step_text 14 | self.parent = parent 15 | self.children = [] 16 | self.score = score # Score from student evaluation 17 | self.prefix_steps = prefix_steps # All steps before this node 18 | self.generated_by = generated_by # Tracks which teacher generated this step 19 | self.depth = parent.depth + 1 if parent else 0 20 | 21 | if parent: 22 | separator = "\n" if self.prefix_steps else "" 23 | self.text = self.prefix_steps + separator + self.step_text 24 | else: # Root node 25 | self.text = prefix_steps 26 | 27 | def is_terminal(self): 28 | """ 29 | Checks if the current step is a terminal step (i.e., has no children). 30 | """ 31 | return len(self.children) == 0 32 | 33 | def add_child_step(self, step_text, score=0.0, generated_by=None): 34 | """ 35 | Adds a child step to the current step. 36 | The child step's prefix_steps is the current step's full text. 37 | """ 38 | if step_text is None: 39 | print("Warning: Attempted to add a child step with None step_text.") 40 | return None 41 | child_step = Step(step_text=step_text, prefix_steps=self.text, parent=self, score=score, generated_by=generated_by) 42 | self.children.append(child_step) 43 | return child_step 44 | 45 | def get_full_reasoning(self): 46 | """ 47 | Returns the full reasoning chain text for this step. 48 | """ 49 | return self.text 50 | 51 | def get_step_path(self): 52 | """ 53 | Returns the list of steps from the root to the current step. 54 | """ 55 | step = self 56 | path = [] 57 | while step: 58 | path.append(step) 59 | step = step.parent 60 | return path[::-1] # Return from root step -------------------------------------------------------------------------------- /src/internvl_forward.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torchvision.transforms as T 3 | from PIL import Image 4 | from torchvision.transforms.functional import InterpolationMode 5 | 6 | 7 | IMAGENET_MEAN = (0.485, 0.456, 0.406) 8 | IMAGENET_STD = (0.229, 0.224, 0.225) 9 | 10 | def build_transform(input_size): 11 | MEAN, STD = IMAGENET_MEAN, IMAGENET_STD 12 | transform = T.Compose([ 13 | T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img), 14 | T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC), 15 | T.ToTensor(), 16 | T.Normalize(mean=MEAN, std=STD) 17 | ]) 18 | return transform 19 | 20 | def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size): 21 | best_ratio_diff = float('inf') 22 | best_ratio = (1, 1) 23 | area = width * height 24 | for ratio in target_ratios: 25 | target_aspect_ratio = ratio[0] / ratio[1] 26 | ratio_diff = abs(aspect_ratio - target_aspect_ratio) 27 | if ratio_diff < best_ratio_diff: 28 | best_ratio_diff = ratio_diff 29 | best_ratio = ratio 30 | elif ratio_diff == best_ratio_diff: 31 | if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]: 32 | best_ratio = ratio 33 | return best_ratio 34 | 35 | def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False): 36 | orig_width, orig_height = image.size 37 | aspect_ratio = orig_width / orig_height 38 | 39 | # calculate the existing image aspect ratio 40 | target_ratios = set( 41 | (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if 42 | i * j <= max_num and i * j >= min_num) 43 | target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1]) 44 | 45 | # find the closest aspect ratio to the target 46 | target_aspect_ratio = find_closest_aspect_ratio( 47 | aspect_ratio, target_ratios, orig_width, orig_height, image_size) 48 | 49 | # calculate the target width and height 50 | target_width = image_size * target_aspect_ratio[0] 51 | target_height = image_size * target_aspect_ratio[1] 52 | blocks = target_aspect_ratio[0] * target_aspect_ratio[1] 53 | 54 | # resize the image 55 | resized_img = image.resize((target_width, target_height)) 56 | processed_images = [] 57 | for i in range(blocks): 58 | box = ( 59 | (i % (target_width // image_size)) * image_size, 60 | (i // (target_width // image_size)) * image_size, 61 | ((i % (target_width // image_size)) + 1) * image_size, 62 | ((i // (target_width // image_size)) + 1) * image_size 63 | ) 64 | # split the image 65 | split_img = resized_img.crop(box) 66 | processed_images.append(split_img) 67 | assert len(processed_images) == blocks 68 | if use_thumbnail and len(processed_images) != 1: 69 | thumbnail_img = image.resize((image_size, image_size)) 70 | processed_images.append(thumbnail_img) 71 | return processed_images 72 | 73 | def load_image(image_file, input_size=448, max_num=12): 74 | image = Image.open(image_file).convert('RGB') 75 | transform = build_transform(input_size=input_size) 76 | images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num) 77 | pixel_values = [transform(image) for image in images] 78 | pixel_values = torch.stack(pixel_values) 79 | return pixel_values 80 | 81 | 82 | 83 | def internvl_forward(model, processor, prompt, img_paths, temperature=0.9): 84 | 85 | all_pixel_values = [] 86 | num_patches_list = [] 87 | 88 | for img_path in img_paths: 89 | pixel_values = load_image(img_path, max_num=12).to(torch.bfloat16).cuda() 90 | all_pixel_values.append(pixel_values) 91 | num_patches_list.append(pixel_values.size(0)) 92 | 93 | if len(img_paths) == 1: 94 | pixel_values = all_pixel_values[0] 95 | generation_config = dict(max_new_tokens=2048, do_sample=True, temperature=temperature) 96 | response = model.chat(processor, pixel_values, prompt, generation_config) 97 | else: 98 | if len(all_pixel_values) > 0: 99 | pixel_values = torch.cat(all_pixel_values, dim=0) 100 | else: 101 | return "Error: No images provided" 102 | 103 | generation_config = dict(max_new_tokens=2048, do_sample=True, temperature=temperature) 104 | response = model.chat(processor, pixel_values, prompt, generation_config, 105 | num_patches_list=num_patches_list) 106 | 107 | return response 108 | -------------------------------------------------------------------------------- /src/model.py: -------------------------------------------------------------------------------- 1 | import math 2 | import torch 3 | from transformers import ( 4 | AutoProcessor, 5 | AutoTokenizer, 6 | Qwen2_5_VLForConditionalGeneration, 7 | Qwen2VLForConditionalGeneration, 8 | AutoModel, 9 | AutoConfig 10 | ) 11 | 12 | def split_model(model_path): 13 | device_map = {} 14 | world_size = torch.cuda.device_count() 15 | config = AutoConfig.from_pretrained(model_path, trust_remote_code=True) 16 | num_layers = config.llm_config.num_hidden_layers 17 | # Since the first GPU will be used for ViT, treat it as half a GPU. 18 | num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5)) 19 | num_layers_per_gpu = [num_layers_per_gpu] * world_size 20 | num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5) 21 | layer_cnt = 0 22 | for i, num_layer in enumerate(num_layers_per_gpu): 23 | for j in range(num_layer): 24 | device_map[f'language_model.model.layers.{layer_cnt}'] = i 25 | layer_cnt += 1 26 | device_map['vision_model'] = 0 27 | device_map['mlp1'] = 0 28 | device_map['language_model.model.tok_embeddings'] = 0 29 | device_map['language_model.model.embed_tokens'] = 0 30 | device_map['language_model.output'] = 0 31 | device_map['language_model.model.norm'] = 0 32 | device_map['language_model.model.rotary_emb'] = 0 33 | device_map['language_model.lm_head'] = 0 34 | device_map[f'language_model.model.layers.{num_layers - 1}'] = 0 35 | 36 | return device_map 37 | 38 | def init_model(args): 39 | """ 40 | Initializes and loads models specified in command-line arguments. 41 | 42 | Args: 43 | args: Parsed command-line arguments containing model paths. 44 | 45 | Returns: 46 | A dictionary where keys are model names (e.g., 'qwen25_vl_7b') 47 | and values are dictionaries containing the loaded 'model' and 'processor'. 48 | Returns an empty dictionary if no models are specified or loaded. 49 | """ 50 | model_set = {} 51 | 52 | # --- Load Models Based on Provided Paths --- 53 | 54 | # Qwen/Qwen2.5-VL-72B-Instruct (Teacher) 55 | min_pixels = 256 * 28 * 28 56 | max_pixels = 1280 * 28 * 28 57 | 58 | # Qwen/Qwen2.5-VL-7B-Instruct (Intern) 59 | if args.qwen25_vl_7b_model_path: 60 | model_name = 'qwen25_vl_7b' 61 | print(f'Initializing {model_name}...') 62 | try: 63 | model = Qwen2_5_VLForConditionalGeneration.from_pretrained( 64 | args.qwen25_vl_7b_model_path, torch_dtype=torch.bfloat16, device_map="auto", attn_implementation="flash_attention_2", 65 | trust_remote_code=True 66 | ) 67 | processor = AutoProcessor.from_pretrained(args.qwen25_vl_7b_model_path, min_pixels=min_pixels, max_pixels=max_pixels, trust_remote_code=True) 68 | model_set[model_name] = {'model': model, 'processor': processor} 69 | print(f'{model_name} loaded successfully.') 70 | except Exception as e: 71 | print(f"Error loading {model_name}: {e}") 72 | 73 | 74 | # Qwen/Qwen2-VL-7B-Instruct (Intern) 75 | if args.qwen2_vl_7b_model_path: 76 | model_name = 'qwen2_vl_7b' 77 | print(f'Initializing {model_name}...') 78 | try: 79 | model = Qwen2VLForConditionalGeneration.from_pretrained( 80 | args.qwen2_vl_7b_model_path, torch_dtype=torch.bfloat16, device_map="auto", attn_implementation="flash_attention_2", 81 | trust_remote_code=True 82 | ) 83 | processor = AutoProcessor.from_pretrained(args.qwen2_vl_7b_model_path, min_pixels=min_pixels, max_pixels=max_pixels, trust_remote_code=True) 84 | model_set[model_name] = {'model': model, 'processor': processor} 85 | print(f'{model_name} loaded successfully.') 86 | except Exception as e: 87 | print(f"Error loading {model_name}: {e}") 88 | 89 | 90 | 91 | # OpenGVLab/InternVL3-8B (Intern) - NEW 92 | if args.internvl3_8b_model_path: 93 | model_name = 'internvl3_8b' 94 | print(f'Initializing {model_name}...') 95 | try: 96 | device_map2 = split_model(args.internvl3_8b_model_path) 97 | model = AutoModel.from_pretrained( 98 | args.internvl3_8b_model_path, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, 99 | use_flash_attn=True, trust_remote_code=True, device_map=device_map2).eval() 100 | processor = AutoTokenizer.from_pretrained(args.internvl3_8b_model_path, trust_remote_code=True) 101 | processor.pad_token_id = processor.eos_token_id 102 | 103 | model_set[model_name] = {'model': model, 'processor': processor} 104 | print(f'{model_name} loaded successfully.') 105 | except Exception as e: 106 | print(f"Error loading {model_name}: {e}") 107 | 108 | if not model_set: 109 | print("Warning: No models were loaded. Check provided model paths in arguments.") 110 | 111 | return model_set -------------------------------------------------------------------------------- /infer.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torchvision.transforms as T 3 | from PIL import Image 4 | from torchvision.transforms.functional import InterpolationMode 5 | from transformers import AutoModel, AutoTokenizer 6 | 7 | IMAGENET_MEAN = (0.485, 0.456, 0.406) 8 | IMAGENET_STD = (0.229, 0.224, 0.225) 9 | 10 | def build_transform(input_size): 11 | MEAN, STD = IMAGENET_MEAN, IMAGENET_STD 12 | transform = T.Compose([ 13 | T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img), 14 | T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC), 15 | T.ToTensor(), 16 | T.Normalize(mean=MEAN, std=STD) 17 | ]) 18 | return transform 19 | 20 | def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size): 21 | best_ratio_diff = float('inf') 22 | best_ratio = (1, 1) 23 | area = width * height 24 | for ratio in target_ratios: 25 | target_aspect_ratio = ratio[0] / ratio[1] 26 | ratio_diff = abs(aspect_ratio - target_aspect_ratio) 27 | if ratio_diff < best_ratio_diff: 28 | best_ratio_diff = ratio_diff 29 | best_ratio = ratio 30 | elif ratio_diff == best_ratio_diff: 31 | if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]: 32 | best_ratio = ratio 33 | return best_ratio 34 | 35 | def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False): 36 | orig_width, orig_height = image.size 37 | aspect_ratio = orig_width / orig_height 38 | 39 | # calculate the existing image aspect ratio 40 | target_ratios = set( 41 | (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if 42 | i * j <= max_num and i * j >= min_num) 43 | target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1]) 44 | 45 | # find the closest aspect ratio to the target 46 | target_aspect_ratio = find_closest_aspect_ratio( 47 | aspect_ratio, target_ratios, orig_width, orig_height, image_size) 48 | 49 | # calculate the target width and height 50 | target_width = image_size * target_aspect_ratio[0] 51 | target_height = image_size * target_aspect_ratio[1] 52 | blocks = target_aspect_ratio[0] * target_aspect_ratio[1] 53 | 54 | # resize the image 55 | resized_img = image.resize((target_width, target_height)) 56 | processed_images = [] 57 | for i in range(blocks): 58 | box = ( 59 | (i % (target_width // image_size)) * image_size, 60 | (i // (target_width // image_size)) * image_size, 61 | ((i % (target_width // image_size)) + 1) * image_size, 62 | ((i // (target_width // image_size)) + 1) * image_size 63 | ) 64 | # split the image 65 | split_img = resized_img.crop(box) 66 | processed_images.append(split_img) 67 | assert len(processed_images) == blocks 68 | if use_thumbnail and len(processed_images) != 1: 69 | thumbnail_img = image.resize((image_size, image_size)) 70 | processed_images.append(thumbnail_img) 71 | return processed_images 72 | 73 | def load_image(image_file, input_size=448, max_num=12): 74 | image = Image.open(image_file).convert('RGB') 75 | transform = build_transform(input_size=input_size) 76 | images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num) 77 | pixel_values = [transform(image) for image in images] 78 | pixel_values = torch.stack(pixel_values) 79 | return pixel_values 80 | 81 | 82 | path = 'manglu3935/Chiron-o1-2B' 83 | model = AutoModel.from_pretrained( 84 | path, 85 | torch_dtype=torch.bfloat16, 86 | load_in_8bit=False, 87 | low_cpu_mem_usage=True, 88 | use_flash_attn=True, 89 | trust_remote_code=True).eval().cuda() 90 | tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False) 91 | 92 | ########################################################## 93 | # single images 94 | # Fill in your image 95 | pixel_values = load_image('./eval/infer_img.png', max_num=12).to(torch.bfloat16).cuda() 96 | generation_config = dict(max_new_tokens=4096, do_sample=True) 97 | 98 | question = 'Based on the imaging findings of abnormal flattening of the left parieto-occipital bones in this 20-year-old male patient, what is the most likely diagnosis for this incidental finding? (Supplementary information is as follows:\nGender: Male, Age: 20 years, Chief complaint: Incidental findings)\nPlease base your response on the keyframes of the relevant medical imaging modality (Modality 1: ) and reason step-by-step to answer the above question.' 99 | response = model.chat(tokenizer, pixel_values, question, generation_config) 100 | print(f'User: {question}\nAssistant: {response}') 101 | 102 | ########################################################## 103 | # Multiple images 104 | 105 | # img_paths = ['xxx.png', 'xxx.png'] 106 | # question = 'xxxxxx' # Make sure the number of in question is consistent with the number of pictures 107 | # all_pixel_values = [] 108 | # num_patches_list = [] 109 | 110 | # for img_path in img_paths: 111 | # pixel_values = load_image(img_path, max_num=12).to(torch.bfloat16).cuda() 112 | # all_pixel_values.append(pixel_values) 113 | # num_patches_list.append(pixel_values.size(0)) 114 | 115 | # if len(all_pixel_values) > 0: 116 | # pixel_values = torch.cat(all_pixel_values, dim=0) 117 | # else: 118 | # pintt("Error: No images provided") 119 | 120 | # generation_config = dict(max_new_tokens=2048, do_sample=True) 121 | # response = model.chat(tokenizer, pixel_values, question, generation_config, 122 | # num_patches_list=num_patches_list) 123 | # print(f'User: {question}\nAssistant: {response}') 124 | -------------------------------------------------------------------------------- /src/run.py: -------------------------------------------------------------------------------- 1 | import json 2 | from tqdm import tqdm 3 | import os 4 | import argparse 5 | from utils import read_jsonl, get_chunk 6 | from model import init_model 7 | from mics import MentorInternSearch 8 | import traceback 9 | 10 | def mics_start(args): 11 | data_path = args.data_path 12 | data = None 13 | if data_path.endswith('.jsonl'): 14 | try: 15 | data = read_jsonl(data_path) 16 | except Exception as e: 17 | print(f"Error loading jsonl data from {data_path}: {e}") 18 | return 19 | elif data_path.endswith('.json'): 20 | try: 21 | with open(data_path, 'r') as f: 22 | data = json.load(f) 23 | except Exception as e: 24 | print(f"Error loading json data from {data_path}: {e}") 25 | return 26 | else: 27 | print(f"Error: Unsupported data file format for {data_path}. Please use .jsonl or .json") 28 | return 29 | 30 | if data is None: 31 | print("Error: Data loading failed.") 32 | return 33 | 34 | output_path = args.output_path 35 | failed_search_path = args.output_path.replace('.jsonl', '_failed.jsonl') 36 | 37 | try: 38 | output_dir = os.path.dirname(output_path) 39 | if output_dir: 40 | os.makedirs(output_dir, exist_ok=True) 41 | except OSError as e: 42 | print(f"Error creating output directory {output_dir}: {e}") 43 | return 44 | 45 | try: 46 | with open(output_path, "w") as search_file, open(failed_search_path, "w") as failed_search_file: 47 | if args.num_chunks > 1: 48 | try: 49 | data = get_chunk(data, args.num_chunks, args.chunk_idx) 50 | except IndexError: 51 | print(f"Error: Chunk index {args.chunk_idx} is out of range for {args.num_chunks} chunks.") 52 | failed_search_file.write(json.dumps({"error": f"Invalid chunk index {args.chunk_idx} for {args.num_chunks} chunks."}) + "\n") 53 | return 54 | 55 | # Initialize models 56 | model_set = None 57 | try: 58 | model_set = init_model(args) 59 | except Exception as e: 60 | print(f"Error initializing models: {e}") 61 | failed_search_file.write(json.dumps({"error": "Model initialization failed", "details": str(e)}) + "\n") 62 | return 63 | 64 | # Instantiate the search class 65 | search_process = None 66 | search_process = MentorInternSearch(args) 67 | 68 | 69 | print(f"Processing {len(data)} data items...") 70 | for d in tqdm(data): 71 | data_id = d.get('rid', 'N/A') 72 | try: 73 | search_process.search(d, model_set, search_file, failed_search_file) 74 | except Exception as e: 75 | error_message = traceback.format_exc() 76 | print(f"\n!!! Unexpected error during search for data ID {data_id}: {error_message} !!!") 77 | error_log = { 78 | "rid": data_id, 79 | "error": "Unhandled exception during search", 80 | "details": str(e), 81 | } 82 | failed_search_file.write(json.dumps(error_log) + "\n") 83 | failed_search_file.flush() 84 | 85 | except IOError as e: 86 | print(f"Error opening or writing to output files ({output_path}, {failed_search_path}): {e}") 87 | except Exception as e: 88 | print(f"An unexpected error occurred outside the main loop: {e}") 89 | 90 | if __name__ == "__main__": 91 | parser = argparse.ArgumentParser(description="Run Mentor-Intern Reasoning Search") 92 | parser.add_argument("--data_path", type=str, default='./rp_stage3_data/d2d_qa.jsonl', required=True, help="Path to input data (.jsonl or .json)") 93 | parser.add_argument("--image_dir_path", type=str, default='./rp_stage3_data/images', required=True, help="Path to the directory containing images") 94 | parser.add_argument("--output_path", type=str, default='./test/tsis_results.jsonl', required=True, help="Path to save successful results (.jsonl)") 95 | parser.add_argument("--num_chunks", type=int, default=1, help="Number of chunks to split data into") 96 | parser.add_argument("--chunk_idx", type=int, default=0, help="Index of the chunk to process (0-based)") 97 | parser.add_argument("--qwen25_vl_7b_model_path", type=str, default='Qwen/Qwen2.5-VL-7B-Instruct', required=True, help="Path to Qwen25-VL 7B model") 98 | parser.add_argument("--qwen2_vl_7b_model_path", type=str, default='Qwen/Qwen2-VL-7B-Instruct', required=True, help="Path to Qwen2-VL 7B model") 99 | parser.add_argument("--internvl3_8b_model_path", type=str, default='OpenGVLab/InternVL3-8B', required=True, help="Path to InternVL3 8B model") 100 | parser.add_argument("--openai_api_key", type=str, default="sk-xxx" ,required=True) 101 | parser.add_argument("--openai_base_url", type=str, default='' ,required=True) 102 | parser.add_argument("--qwen_api_key", type=str, default="sk-xxx" ,required=True) 103 | parser.add_argument("--qwen_base_url", type=str, default='' ,required=True) 104 | parser.add_argument("--gemini_api_key", type=str, default="sk-xxx" ,required=True) 105 | parser.add_argument("--gemini_base_url", type=str, default='' ,required=True) 106 | parser.add_argument("--ds_api_key", type=str, default="sk-xxx" ,required=True) 107 | parser.add_argument("--ds_base_url", type=str, default='' ,required=True) 108 | parser.add_argument("--mentor_models", nargs='+', type=str, default=['chatgpt-4o-latest', 'google/gemini-2.5-pro-preview-03-25', 'qwen2.5-vl-72b-instruct'], required=True, help="List of mentor model names (must match keys in model_dict or be 'gpt-...' style)") 109 | parser.add_argument("--intern_models", nargs='+', type=str, default=['qwen25_vl_7b', 'qwen2_vl_7b', 'internvl3_8b'], required=True, help="List of intern model names (must match keys in model_dict)") 110 | parser.add_argument("--evaluator_model", type=str, default='deepseek-chat', help="Evaluator model name (must be GPT for default JUDGE_PROMPT)") 111 | parser.add_argument("--max_depth", type=int, default=3, help="Maximum number of reasoning steps to generate.") 112 | parser.add_argument("--temperature1", type=float, default=1.2, help="First temperature for intern model evaluation and mentor generation.") 113 | parser.add_argument("--temperature2", type=float, default=0.2, help="Second temperature for intern model evaluation.") 114 | 115 | args = parser.parse_args() 116 | 117 | errors = [] 118 | model_name_to_arg_map = { 119 | 'qwen25_vl_7b': 'qwen25_vl_7b_model_path', 120 | 'internvl3_8b': 'internvl3_8b_model_path', 121 | 'qwen2_vl_7b': 'qwen2_vl_7b_model_path' 122 | } 123 | 124 | # Validate Mentors 125 | if len(args.mentor_models) < 2: 126 | errors.append("At least two mentor models must be specified via --mentor_models.") 127 | 128 | # Validate Interns 129 | if len(args.intern_models) < 1: 130 | errors.append("At least one intern model must be specified via --intern_models.") 131 | else: 132 | for model_name in args.intern_models: 133 | arg_name = model_name_to_arg_map.get(model_name) 134 | if not arg_name: 135 | errors.append(f"Intern model '{model_name}' is not recognized or mapped to a path argument.") 136 | elif getattr(args, arg_name, None) is None: 137 | errors.append(f"Path for intern model '{model_name}' (expected argument --{arg_name}) was not provided.") 138 | 139 | # Report errors if any 140 | if errors: 141 | print("\nArgument Validation Errors:") 142 | for error in errors: 143 | print(f"- {error}") 144 | parser.print_help() 145 | exit(1) 146 | 147 | mics_start(args) 148 | 149 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |
2 |

Chiron-o1: Igniting Multimodal Large Language Models towards Generalizable Medical Reasoning via Mentor-Intern Collaborative Search

3 |
4 | 5 |

If you find this project useful, please give us a star🌟.

6 | 7 | 8 |

9 | 📃 Paper | 🤗 Chiron-o1-2B | 🤗 Chiron-o1-8B | 📧 Email 10 |

11 | 12 | 13 | ## ⚡Introduction 14 | 15 | 16 | 17 | We propose **Mentor-Intern Collaborative Search (MICS)**, a new multi-model collaborative searching strategy designed to generate effective step-by-step CoT data. The core idea of MICS is leveraging multiple knowledgeable mentor models to collaboratively search for reasoning paths, while evaluating the searched paths based on feedback from intern models. 18 | 19 | ## 📚 Data Constrution with MICS 20 | 21 | 22 | We provide the code for generating step-by-step CoT reasoning based on MICS. You can try out MICS reasoning with the data in `./src/demo_data`. 23 | 24 | Our MICS code now supports reasoning path search using [GPT-4o](https://openai.com/index/hello-gpt-4o/), [Gemini 2.5 Pro Preview](https://deepmind.google/models/gemini/pro/), [Qwen2.5-VL-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct) as mentor models and [Qwen25-VL-7B](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct), [Qwen2-VL-7B](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct), [Internvl3-8B](https://huggingface.co/OpenGVLab/InternVL3-8B) as intern models. (🌟 You can flexibly adjust it as needed. For details, see [this](https://github.com/manglu097/Chiron-o1/blob/main/src/model.py).) 25 | 26 | **Due to [Radiopaedia](https://radiopaedia.org/)'s licensing, we do not directly provide the original dataset. Access to the original data containing the full dataset is restricted due to data licensing reasons. For access, please contact Radiopaedia to obtain a non-commercial license, and then reach out to the article authors. 27 | ```bash 28 | # At least a GPU with more than 50G of memory is required; otherwise, you will need multiple GPUs. CUDA_VISIBLE_DEVICES='0,1' 29 | python ./src/run.py \ 30 | --data_path './src/demo_data/demo.jsonl' \ 31 | --image_dir_path './src/demo_data/images' \ 32 | --output_path './src/demo_data/test/result.jsonl' \ 33 | # intern models 34 | --qwen25_vl_7b_model_path 'Qwen/Qwen2.5-VL-7B-Instruct' \ 35 | --qwen2_vl_7b_model_path 'Qwen/Qwen2-VL-7B-Instruct' \ 36 | --internvl3_8b_model_path 'OpenGVLab/InternVL3-8B' \ 37 | # mentor models 38 | --openai_api_key 'sk-xxx' \ 39 | --openai_base_url 'https://xxx' \ 40 | --qwen_api_key 'sk-xxx' \ 41 | --qwen_base_url 'https://xxx' \ 42 | --gemini_api_key 'sk-xxx' \ 43 | --gemini_base_url 'https://xxx' \ 44 | # judge model(default) 45 | --ds_api_key 'sk-xxx' \ 46 | --ds_base_url 'https://api.deepseek.com' \ 47 | # At least two mentor models and at least one intern model. 48 | # For models that are not needed, you can leave the corresponding parameter values blank (like ''). 49 | --mentor_models chatgpt-4o-latest google/gemini-2.5-pro-preview-03-25 qwen2.5-vl-72b-instruct \ 50 | --intern_models qwen25_vl_7b qwen2_vl_7b internvl3_8b 51 | 52 | ``` 53 | 54 | ## 🏋️ Training 55 | We use [InternVL](https://github.com/OpenGVLab/InternVL) to fine-tune the Chiron-o1 models. We provide the training instructions and configs [here](https://github.com/manglu097/Chiron-o1/tree/main/train). 56 | 57 | Specifically, for guidance on how to prepare the base model and custom data, please refer to the [official documentation](https://internvl.readthedocs.io/en/latest/internvl3.0/finetune.html). 58 | 59 | ```bash 60 | git clone https://github.com/OpenGVLab/InternVL.git 61 | cd InternVL/internvl_chat 62 | # The custom data has been prepared 63 | # Chiron-o1-2B (fine-tuning the LoRA requires 2x 32G/40G GPUs) 64 | GPUS=8 PER_DEVICE_BATCH_SIZE=4 sh shell/internvl2.5/2nd_finetune/internvl2_5_2b_dynamic_res_2nd_finetune_lora.sh 65 | # Chiron-o1-8B (fine-tuning the LoRA requires 2 A100 80G GPUs) 66 | GPUS=8 PER_DEVICE_BATCH_SIZE=2 sh shell/internvl2.5/2nd_finetune/internvl2_5_8b_dynamic_res_2nd_finetune_lora.sh 67 | ``` 68 | 69 | ## 🤔 Model 70 | | Model | Base Model | Link | GPU Memory | 71 | | ----------------- | ------------------------------------------------------------ | ---------------------------------------------------------- | ------------ | 72 | | Chiron-o1-2B | [InternVL3-2B](https://huggingface.co/OpenGVLab/InternVL3-2B) | [Link](https://huggingface.co/manglu3935/Chiron-o1-2B) | 8GB | 73 | | Chiron-o1-8B | [InternVL3-8B](https://huggingface.co/OpenGVLab/InternVL3-8B) | [Link](https://huggingface.co/manglu3935/Chiron-o1-8B) | 19GB | 74 | 75 | 76 | We provide an example of pure text reasoning using [transformers](https://huggingface.co/docs/transformers/index). For multimodal tasks, you can refer to the information [here](https://github.com/manglu097/Chiron-o1/blob/main/infer.py). 77 | 78 | ```python 79 | from transformers import AutoModel, AutoTokenizer 80 | import torch 81 | 82 | path = 'manglu3935/Chiron-o1-8B' 83 | model = AutoModel.from_pretrained( 84 | path, 85 | torch_dtype=torch.bfloat16, 86 | load_in_8bit=False, 87 | low_cpu_mem_usage=True, 88 | use_flash_attn=True, 89 | trust_remote_code=True, 90 | device_map="auto").eval() 91 | tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False) 92 | 93 | # pure text inference 94 | question = "Which of the following imaging findings is most consistent with a pure arterial malformation (PAM)?\nA) A vascular network connecting arteries and veins with early venous drainage \nB) A dilated, tortuous arterial loop without venous communication \nC) A focal saccular outpouching of a cerebral artery with surrounding edema \nD) A venous varix with adjacent arterial feeders\nLet's reason step-by-step to answer the above question." 95 | generation_config = dict(max_new_tokens=1024, do_sample=True) 96 | response = model.chat(tokenizer, None, question, generation_config) 97 | print(f'User: {question}\nAssistant: {response}') 98 | ``` 99 | 100 | 101 | ## 🧑‍⚖️ Evaluation 102 | 103 | You can use the script we provide in `./eval/eval.py` to evaluate the medical VQA benchmark (it may require slight modifications). Below are the specific instructions. 104 | 105 | ```bash 106 | python ./eval/eval.py \ 107 | --vqa_json_path './eval/eval_data.json' \ 108 | --image_dir './eval' \ 109 | --model_path 'manglu3935/Chiron-o1-8B' \ 110 | --output_path './eval/output.json' \ 111 | --api_key 'sk-xxx' # DeepSeek is used to evaluate the correctness of the results 112 | ``` 113 | 114 | ## 🏁 Results 115 | 116 | We perform a series of comprehensive experiments using four high-performance baseline models: [HuatuoGPT-Vision](https://github.com/FreedomIntelligence/HuatuoGPT-Vision), [Med-R1](https://github.com/Yuxiang-Lai117/Med-R1), [MedVLM-R1](https://huggingface.co/JZPeterPan/MedVLM-R1), and others. The **Results** demonstrate the performance of the Chiron-o1 models compared to other cutting-edge models across various widely-used benchmarks, as illustrated in the table below. 117 | 118 |
119 | image 120 |
121 | 122 | Compared to other multimodal medical reasoning models, Chiron-o1 demonstrates the ability to generate deep and reasonable reasoning paths, leading to correct answers. 123 | 124 |
125 | image 126 |
127 | 128 | 129 |
130 | image 131 |
132 | 133 | ## 🙏🏼 Acknowledgement 134 | 135 | We gratefully acknowledge the inspiring work of [Transformers](https://github.com/huggingface/transformers), [InternVL](https://github.com/OpenGVLab/InternVL) and [Mulberry](https://github.com/HJYao00/Mulberry), which have provided essential foundations and inspiration for this project. We also thank the developers of these outstanding tools for their contributions to open-source innovation. 136 | 137 | ## 📖 Citation 138 | 139 | ``` 140 | @article{sun2025enhancing, 141 | title={Enhancing Step-by-Step and Verifiable Medical Reasoning in MLLMs}, 142 | author={Sun, Haoran and Jiang, Yankai and Lou, Wenjie and Zhang, Yujie and Li, Wenjie and Wang, Lilong and Liu, Mianxin and Liu, Lei and Wang, Xiaosong}, 143 | journal={arXiv preprint arXiv:2506.16962}, 144 | year={2025} 145 | } 146 | ``` 147 | -------------------------------------------------------------------------------- /eval/eval.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torchvision.transforms as T 3 | from PIL import Image 4 | from torchvision.transforms.functional import InterpolationMode 5 | from transformers import AutoModel, AutoTokenizer 6 | import os 7 | import json 8 | from bert_score import score 9 | from openai import OpenAI 10 | from tqdm import tqdm 11 | import argparse 12 | 13 | IMAGENET_MEAN = (0.485, 0.456, 0.406) 14 | IMAGENET_STD = (0.229, 0.224, 0.225) 15 | 16 | def build_transform(input_size): 17 | MEAN, STD = IMAGENET_MEAN, IMAGENET_STD 18 | transform = T.Compose([ 19 | T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img), 20 | T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC), 21 | T.ToTensor(), 22 | T.Normalize(mean=MEAN, std=STD) 23 | ]) 24 | return transform 25 | 26 | def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size): 27 | best_ratio_diff = float('inf') 28 | best_ratio = (1, 1) 29 | area = width * height 30 | for ratio in target_ratios: 31 | target_aspect_ratio = ratio[0] / ratio[1] 32 | ratio_diff = abs(aspect_ratio - target_aspect_ratio) 33 | if ratio_diff < best_ratio_diff: 34 | best_ratio_diff = ratio_diff 35 | best_ratio = ratio 36 | elif ratio_diff == best_ratio_diff: 37 | if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]: 38 | best_ratio = ratio 39 | return best_ratio 40 | 41 | def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False): 42 | orig_width, orig_height = image.size 43 | aspect_ratio = orig_width / orig_height 44 | target_ratios = set( 45 | (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if 46 | i * j <= max_num and i * j >= min_num) 47 | target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1]) 48 | target_aspect_ratio = find_closest_aspect_ratio( 49 | aspect_ratio, target_ratios, orig_width, orig_height, image_size) 50 | target_width = image_size * target_aspect_ratio[0] 51 | target_height = image_size * target_aspect_ratio[1] 52 | blocks = target_aspect_ratio[0] * target_aspect_ratio[1] 53 | resized_img = image.resize((target_width, target_height)) 54 | processed_images = [] 55 | for i in range(blocks): 56 | box = ( 57 | (i % (target_width // image_size)) * image_size, 58 | (i // (target_width // image_size)) * image_size, 59 | ((i % (target_width // image_size)) + 1) * image_size, 60 | ((i // (target_width // image_size)) + 1) * image_size 61 | ) 62 | split_img = resized_img.crop(box) 63 | processed_images.append(split_img) 64 | assert len(processed_images) == blocks 65 | if use_thumbnail and len(processed_images) != 1: 66 | thumbnail_img = image.resize((image_size, image_size)) 67 | processed_images.append(thumbnail_img) 68 | return processed_images 69 | 70 | def load_image(image_file, input_size=448, max_num=12): 71 | image = Image.open(image_file).convert('RGB') 72 | transform = build_transform(input_size=input_size) 73 | images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num) 74 | pixel_values = [transform(image) for image in images] 75 | pixel_values = torch.stack(pixel_values) 76 | return pixel_values 77 | 78 | def ds_forward(client, pred_answer, ground_truth, temperature=0.9): 79 | message = [ 80 | {"role": "user", "content": f"Evaluate whether the model's answer is semantically similar to the correct answer. Output 'Yes' if the model's answer conveys a similar meaning to the correct result, even if the wording differs, and 'No' if it does not. Provide only 'Yes' or 'No' as the output, without any explanation.\nModel's answer: {pred_answer}\nCorrect answer: {ground_truth}"} 81 | ] 82 | completion = client.chat.completions.create( 83 | model='deepseek-chat', 84 | messages=message, 85 | temperature=temperature 86 | ) 87 | return completion.choices[0].message.content.strip() 88 | 89 | def get_correctness(judge_output): 90 | if 'yes' in judge_output.lower(): 91 | return 1 92 | else: 93 | return -1 94 | 95 | def evaluate_model(vqa_json_path, image_dir, model_path, output_path, api_key): 96 | client = OpenAI(base_url='https://api.deepseek.com', api_key=api_key) 97 | model = AutoModel.from_pretrained( 98 | model_path, 99 | torch_dtype=torch.bfloat16, 100 | low_cpu_mem_usage=True, 101 | use_flash_attn=True, 102 | trust_remote_code=True).eval().cuda() 103 | tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, use_fast=False) 104 | generation_config = dict(max_new_tokens=1024, do_sample=False) 105 | 106 | with open(vqa_json_path, "r", encoding="utf-8") as f: 107 | data = json.load(f) 108 | # Initialize metrics 109 | closed_correct = 0 110 | closed_total = 0 111 | open_bertscore_sum = 0.0 112 | open_count = 0 113 | results_output = [] 114 | 115 | # Create or clear the output file 116 | if not os.path.exists(output_path): 117 | with open(output_path, "w", encoding="utf-8") as f: 118 | f.write("[]") 119 | for idx, entry in enumerate(tqdm(data, desc="Processing Entries", unit="entry")): 120 | id = entry["id"] 121 | image_names = entry["img_name"] 122 | question = entry["question"] 123 | ground_truth = entry["answer"] 124 | answer_type = entry["answer_type"] 125 | if isinstance(image_names, str): 126 | image_names = [image_names] 127 | pixel_values_list = [] 128 | num_patches_list = [] 129 | 130 | for image_name in image_names: 131 | image_path = os.path.join(image_dir, image_name) 132 | pixel_values_single = load_image(image_path, max_num=12).to(torch.bfloat16).cuda() 133 | pixel_values_list.append(pixel_values_single) 134 | num_patches_list.append(pixel_values_single.size(0)) 135 | pixel_values = torch.cat(pixel_values_list, dim=0) 136 | # Construct prompt 137 | prompt = f"{question}" 138 | # Call the model for inference 139 | response = model.chat( 140 | tokenizer, 141 | pixel_values, 142 | prompt, 143 | generation_config, 144 | num_patches_list=num_patches_list 145 | ) 146 | # Get predicted answer 147 | pred_answer = response.strip() 148 | if "### The final answer is:" not in pred_answer: 149 | continue 150 | reason_answer = pred_answer.split("### The final answer is:")[0].strip() 151 | pred_answer = pred_answer.split("### The final answer is:")[1].strip() 152 | # Print question and model's output 153 | print(f"Question: {question}") 154 | print(f"Model Answer: {pred_answer}") 155 | print(f"Ground Truth: {ground_truth}") 156 | print("========================================") 157 | # Calculate accuracy 158 | closed_total += 1 159 | judge_output = ds_forward(client, pred_answer, ground_truth, temperature=0.9) 160 | if judge_output: 161 | is_correct = get_correctness(judge_output) 162 | if is_correct == 1: 163 | closed_correct += 1 164 | else: 165 | print(f"Failed to get judgment from evaluator") 166 | # Calculate BERTScore 167 | P, R, F1 = score( 168 | cands=[pred_answer], 169 | refs=[ground_truth], 170 | model_type="roberta-large", 171 | num_layers=17, 172 | idf=False, 173 | batch_size=4, 174 | verbose=True, 175 | lang="en" 176 | ) 177 | f1_score = F1.item() 178 | open_bertscore_sum += f1_score 179 | open_count += 1 180 | # Construct a single result entry 181 | result_entry = { 182 | "index": id, 183 | "image_name": image_names, 184 | "question": question, 185 | "reason_answer": reason_answer, 186 | "model_answer": pred_answer, 187 | "ground_truth": ground_truth, 188 | "answer_type": answer_type, 189 | "result": [closed_correct, closed_total, open_bertscore_sum, open_count] 190 | } 191 | # Append the result to the file 192 | with open(output_path, "r+", encoding="utf-8") as f: 193 | results = json.load(f) 194 | results.append(result_entry) 195 | f.seek(0) 196 | json.dump(results, f, ensure_ascii=False, indent=4) 197 | # Calculate final metrics 198 | closed_acc = closed_correct / closed_total if closed_total > 0 else 0 199 | open_bertscore_avg = open_bertscore_sum / open_count if open_count > 0 else 0 200 | print("====== Evaluation Results ======") 201 | print(f"CLOSED Questions Acc: {closed_acc:.4f}") 202 | print(f"OPEN Questions BERTScore-F1: {open_bertscore_avg:.4f}") 203 | # Append final metrics to the file 204 | final_metrics = { 205 | "CLOSED Questions Acc": closed_acc, 206 | "OPEN Questions BERTScore-F1": open_bertscore_avg 207 | } 208 | with open(output_path, "r+", encoding="utf-8") as f: 209 | results = json.load(f) 210 | results.append(final_metrics) 211 | f.seek(0) 212 | json.dump(results, f, ensure_ascii=False, indent=4) 213 | 214 | def main(): 215 | """ 216 | Main function to evaluate multiple models. 217 | """ 218 | parser = argparse.ArgumentParser(description="Evaluate models on VQA tasks.") 219 | parser.add_argument("--vqa_json_path", type=str, required=True, help="Path to the VQA JSON file.") 220 | parser.add_argument("--image_dir", type=str, required=True, help="Directory containing the images.") 221 | parser.add_argument("--model_path", type=str, required=True, help="Path to the model to evaluate.") 222 | parser.add_argument("--output_path", type=str, required=True, help="Path to save the evaluation results.") 223 | parser.add_argument("--api_key", type=str, required=True, help="API key for the OpenAI client.") 224 | args = parser.parse_args() 225 | 226 | vqa_json_path = args.vqa_json_path 227 | image_dir = args.image_dir 228 | model_path = args.model_path 229 | output_path = args.output_path 230 | api_key = args.api_key 231 | 232 | print(f"Evaluating model: {model_path}") 233 | evaluate_model(vqa_json_path, image_dir, model_path, output_path, api_key) 234 | print(f"Results saved to: {output_path}") 235 | 236 | if __name__ == "__main__": 237 | main() -------------------------------------------------------------------------------- /src/demo_data/demo.jsonl: -------------------------------------------------------------------------------- 1 | {"rid": "169992", "messages": [{"role": "user", "content": "**Question:** Given this 45-year-old male patient presenting with interscapular chest pain and an incidental finding of an anomalous left coronary artery coursing between the right ventricular outflow tract and ascending aorta on non-gated CT, what is the most likely explanation for his symptoms and this anatomical variant?"}, {"role": "assistant", "content": "**Correct Answer:** Anomalous interarterial course of the left coronary artery (ALCA) with potential myocardial ischemia."}], "images": ["169992/ct_group1/0.png"], "title": "Anomalous interarterial course of the left coronary artery", "certainty": "Diagnosis almost certain", "presentation": "Interscapular chest pain. Aortic dissection?", "age_label": "45 years", "gender_label": "Male", "caption": ["According to the \n, no aortic dissection.\nIncidental anomalous course of the left coronary artery, which passes between the right ventricular outflow tract and the ascending aorta."]} 2 | {"rid": "159949", "messages": [{"role": "user", "content": "**Question:** Given this 40-year-old female patient presenting with chest pain and palpitations, along with imaging findings of an anomalous left main coronary artery arising from the right aortic sinus with an interarterial course, what is the most concerning potential complication we should be vigilant for?"}, {"role": "assistant", "content": "**Correct Answer:** Sudden cardiac death (due to malignant anomalous left coronary artery course between the aorta and pulmonary artery)."}], "images": ["159949/ct_group1/0.jpeg", "159949/ct_group1/1.jpeg"], "title": "\"Malignant\" anomalous interarterial course of the left coronary artery", "certainty": "Diagnosis certain", "presentation": "Chest pain and palpitations.", "age_label": "40 years", "gender_label": "Female", "caption": ["According to the \n\n, anomalous left main coronary artery origin arising from the right aortic sinus with an interarterial course between the aorta and the pulmonary artery. The left main trifurcates into the left anterior descending, a ramus intermedius branch and the circumflex coronary artery which are normal in course and caliber.  Normal right coronary artery origin, course and caliber. Right dominant system."]} 3 | {"rid": "189353", "messages": [{"role": "user", "content": "**Question:** Based on the chest x-ray findings in this 70-year-old female with a pre-operative evaluation, what is the most likely explanation for the pacemaker lead abnormality observed at the left subclavian vein entry site?"}, {"role": "assistant", "content": "**Correct Answer:** Pacemaker lead fracture."}], "images": ["189353/x-ray_group1/0.jpeg"], "title": "Pacemaker lead fracture", "certainty": "Diagnosis almost certain", "presentation": "Pre-operative chest x-ray.", "age_label": "70 years", "gender_label": "Female", "caption": ["According to the \n, right ventricular pacemaker with lead fracture at the expected site of the left subclavian vein entry site. Enlarged left and right atrium, status post mitral valve replacement. No edema or pneumonia."]} 4 | {"rid": "190893", "messages": [{"role": "user", "content": "**Question:** Based on the imaging findings and clinical presentation of this 60-year-old male with chest pain, what is the most likely cause of his symptoms given the moderate stenosis observed in the distal LAD segment?"}, {"role": "assistant", "content": "**Correct Answer:** Myocardial bridging of the LAD."}], "images": ["190893/ct_group1/0.jpeg"], "title": "Myocardial bridging of the LAD", "certainty": "Diagnosis certain", "presentation": "Chest pain.", "age_label": "60 years", "gender_label": "Male", "caption": ["CORONARY ANATOMY:\n The coronary arteries arise in normal position. There is co-coronary artery dominance.\n Left main:  There is mild calcified plaque with minimal (<25%) stenosis. Left main trifurcates into LAD, Ramus and LCx.\n Left anterior descending: There is moderate eccentric calcified plaque in mid LAD immediately distal to the origin of first diagonal with minimal (<25% stenosis). According to the \n, there is bridging of distal LAD immediately distal to the origin of second diagonal (~ 3 cm long and ~ 0.3 cm deep) with a moderate stenosis (50-69%). Both first and second diagonal branches are small caliber vessels.\n Left circumflex: There is scattered calcified plaque in LCx proper and obtuse marginal branches with minimal (less than 25%) stenosis. First obtuse marginal branch has a high origin and is intermediate in caliber. Second obtuse marginal is a large-caliber vessel. LCx gives off the posterior lateral branch and a small left-PDA making codominant system.\n Ramus intermedius branch:  Intermediate caliber vessel with scattered calcified plaque with minimal (< 25%) stenosis.\n Right coronary artery:  There is a focal mixed plaque in proximal to mid RCA with minimal (< 25%) stenosis. Gives off a large patent PDA.\n \nCARDIAC MORPHOLOGY:\n LV: Concentric left ventricular hypertrophy\nRV: Normal.\nLA: Four pulmonary veins are seen draining into the left atrium. There is no left atrial appendage thrombus.\nRA: Normal.\nValvular calcification: There is mild mitral annular calcification. There is mild aortic valvular calcification.\nPericardium: Pericardial contour is preserved without effusion, thickening or calcifications.\n\n "]} 5 | {"rid": "158377", "messages": [{"role": "user", "content": "**Question:** Based on the chest x-ray findings in this 75-year-old female with a history of mitral valve replacement for mitral regurgitation, what is the most likely explanation for the fourth \"bump\" observed along the left cardiomediastinal border?"}, {"role": "assistant", "content": "**Correct Answer:** Enlargement of the left atrial appendage due to mitral valve disease."}], "images": ["158377/x-ray_group1/0.jpeg"], "title": "Mitral valve disease findings on plain chest x-ray", "certainty": "Diagnosis certain", "presentation": "Status post mitral valve replacement for known mitral regurgitation.", "age_label": "75 years", "gender_label": "Female", "caption": ["According to the \n, status post mitral valve replacement with enlarged left atrial appendage and splaying of the carina due to enlargement of the left atrium. Lungs clear, no pleural effusion."]} 6 | {"rid": "178062", "messages": [{"role": "user", "content": "**Question:** Given this 35-year-old female patient's persistent headache 10 days post-MVA with a normal CT head showing no intracranial bleed but mild sulcal widening and empty sella, what would you consider as the most likely explanation for her symptoms?"}, {"role": "assistant", "content": "**Correct Answer:** Post-traumatic headache"}], "images": ["178062/ct_group1/0.jpeg"], "title": "Normal CT head", "certainty": "Diagnosis certain", "presentation": "Headache 10 days after a MVA - bleed?", "age_label": "35 years", "gender_label": "Female", "caption": ["According to the \n, no intracranial bleed.
Frontal and parietal sulci mildly widened for patient's age.
Left ventricle slightly wider than right ventricle - physiological asymmetry.
The pituitary gland is flattened against the sellar floor - empty sella.
The cerebellar tonsils crowd the foramen magnum but do not extend below it.
Technically \"straightened\" cervical lordosis due to positioning."]} 7 | {"rid": "160688", "messages": [{"role": "user", "content": "**Question:** Based on the imaging findings of a hyperdense mass with irregular, spiculated margins and internal calcifications in the right breast, along with satellite lesions, what is the most likely diagnosis?"}, {"role": "assistant", "content": "**Correct Answer:** Invasive ductal carcinoma (confirmed by biopsy)."}], "images": ["160688/mammography_group1/0.png"], "title": "Breast mass (BI-RADS 5)", "certainty": "Diagnosis almost certain", "presentation": "The right breast palpable mass on physical exam.", "age_label": "55 years", "gender_label": "Female", "caption": ["According to the \n, a hyperdense mass with irregular and spiculated margins and internal foci of calcification is noted in the lower outer quadrant of the right breast, which is highly suspicious. (BI-RADS 5)\nIn addition, a few focal areas with fine pleomorphic calcification foci are scattered in the right breast, consistent with satellite lesions."]} 8 | {"rid": "177194", "messages": [{"role": "user", "content": "**Question:** Based on the CT cisternography findings showing contrast leak through the left cribriform plate to the left-sided anterior ethmoidal air cells in this 50-year-old female with a history of head trauma and suspected CSF rhinorrhoea, what is the most likely diagnosis?"}, {"role": "assistant", "content": "**Correct Answer:** Cerebrospinal fluid (CSF) leak through the left cribriform plate."}], "images": ["177194/ct_group1/0.png", "177194/ct_group1/1.png"], "title": "CSF rhinorrhoea - CT cisternography", "certainty": "Diagnosis certain", "presentation": "History of prior head trauma, now abnormal watery fluid leaking from her nose. Her physician suspects CSF rhinorrhoea.", "age_label": "50 years", "gender_label": "Female", "caption": ["According to the \n\n, cT cisternography in coronal and axial planes shows a contrast leak through the left cribriform plate to the left-sided anterior ethmoidal air cells. ", "Contrast leak and accumulation within the left-sided anterior ethmoid air cells."]} 9 | {"rid": "189817", "messages": [{"role": "user", "content": "**Question:** Based on the antenatal ultrasound findings of marked tricuspid regurgitation with a regurgitant jet velocity exceeding 200 cm/sec, non-visualization of bilateral kidneys and renal arteries, and the absence of the ductus venosus, what is the most likely underlying condition contributing to the oligohydramnios in this 25-year-old pregnant patient?"}, {"role": "assistant", "content": "**Correct Answer:** Bilateral renal agenesis with associated tricuspid regurgitation (Potter sequence)."}], "images": ["189817/ultrasound_group1/0.png", "189817/ultrasound_group1/1.jpeg"], "title": "Tricuspid regurgitation with bilateral renal agenesis (antenatal ultrasound)", "certainty": "Diagnosis almost certain", "presentation": "The patient presents with decreased liquor.", "age_label": "25 years", "gender_label": "Female", "caption": ["According to the \n, marked regurgitation is seen along the tricuspid valve, with increased right ventricular wall thickness. On pulse wave Doppler, the regurgitant jet velocity exceeds 200 cm/sec. According to the \n, additionally, there is non-visualization of the bilateral kidneys and bilateral renal arteries, with the 'lying down' adrenal sign on both sides. The ductus venosus is also absent"]} 10 | {"rid": "181112", "messages": [{"role": "user", "content": "**Question:** Given this 80-year-old male patient's incidental finding of a right atrial intracardiac mass with fat density on CT, along with concurrent acute PE, what is the most likely diagnosis for the cardiac mass?"}, {"role": "assistant", "content": "**Correct Answer:** Cardiac lipoma (a benign primary cardiac neoplasm)"}], "images": ["181112/ct_group1/0.jpeg", "181112/ct_group1/1.jpeg"], "title": "Cardiac lipoma", "certainty": "Diagnosis almost certain", "presentation": "Incidental finding in patient presenting with acute PE.", "age_label": "80 years", "gender_label": "Male", "caption": ["According to the \n\n, there is a right atrial intracardiac mass with the same density as subcutaneous fat layer, measuring 5.4 x 2.9 cm, suggesting a lipoma.\nEvidence of hypoattenuating filling defect noted in the anterior segment of left lower pulmonary arteries, with contrast passage around it, suggesting acute PE.\nMild septal thickening and atelectasis bands noted bilaterally. Few bilateral small pneumatocoele are noted, the largest in the superior segment of right lower lobe measures about 7 x 7 mm. There is bleb noted in the right upper lobe measures about 1.9 x 0.7 cm.\nThe thoracic aorta is also unremarkable."]} -------------------------------------------------------------------------------- /src/utils.py: -------------------------------------------------------------------------------- 1 | import json 2 | import base64 3 | import os 4 | import math 5 | from PIL import Image 6 | import io 7 | import math 8 | import random 9 | import re 10 | 11 | def resize_image_if_needed(image_path, max_dimension=512): 12 | try: 13 | with Image.open(image_path) as image: 14 | width, height = image.size 15 | 16 | if max(width, height) > max_dimension: 17 | scale_factor = max_dimension / max(width, height) 18 | new_width = int(round(width * scale_factor)) 19 | new_height = int(round(height * scale_factor)) 20 | image = image.resize((new_width, new_height), Image.Resampling.LANCZOS) 21 | 22 | return image.copy() 23 | except Exception as e: 24 | print(f"Error processing image: {e}") 25 | raise 26 | 27 | def encode_to_base64(buffered): 28 | return base64.b64encode(buffered.getvalue()).decode('utf-8') 29 | 30 | def encode_image(image_path): 31 | try: 32 | with Image.open(image_path) as image: 33 | resized_image = resize_image_if_needed(image_path) 34 | original_format = image.format or 'PNG' 35 | 36 | buffered = io.BytesIO() 37 | resized_image.save(buffered, format=original_format) 38 | 39 | return encode_to_base64(buffered) 40 | except Exception as e: 41 | print(f"Error encoding image: {e}") 42 | raise 43 | 44 | 45 | def split_list(lst, n): 46 | chunk_size = math.ceil(len(lst) / n) 47 | return [lst[i:i+chunk_size] for i in range(0, len(lst), chunk_size)] 48 | 49 | def get_chunk(lst, n, k): 50 | chunks = split_list(lst, n) 51 | return chunks[k] 52 | 53 | def read_jsonl(file): 54 | with open(file, 'r') as f: 55 | data = [json.loads(line) for line in f] 56 | return data 57 | 58 | 59 | def locate_img(d, args): 60 | valid_paths = [] 61 | extensions = [] 62 | for img in d['images']: 63 | if os.path.exists(os.path.join(args.image_dir_path, img)): 64 | full_path = os.path.join(args.image_dir_path, img) 65 | valid_paths.append(full_path) 66 | extensions.append(os.path.splitext(img)[1][1:].lower()) 67 | elif os.path.exists(img): 68 | valid_paths.append(img) 69 | extensions.append(os.path.splitext(img)[1][1:].lower()) 70 | if not valid_paths: 71 | raise ValueError(f"No valid image paths found in: {d['image']}") 72 | return valid_paths, extensions 73 | 74 | 75 | 76 | def gpt_forward(client, prompt, base64_images=None, temperature=0.9, model_name="chatgpt-4o-latest", extensions=None): 77 | content = [{ 78 | "type": "text", 79 | "text": prompt 80 | }] 81 | if base64_images is not None and extensions is not None: 82 | for base64_image, extension in zip(base64_images, extensions): 83 | content.append({ 84 | "type": "image_url", 85 | "image_url":{ 86 | "url": f"data:image/{extension};base64,{base64_image}" 87 | } 88 | }) 89 | message = [ 90 | {"role": "system", "content": """You are a highly professional clinician with expertise across multiple clinical fields, 91 | well-versed in the latest medical guidelines, diagnostic standards, and treatment plans, capable of providing accurate and reasoned responses to user inquiries."""}, 92 | {"role": "user", "content": content} 93 | ] 94 | completion = client.chat.completions.create( 95 | model = model_name, 96 | messages = message, 97 | temperature = temperature 98 | ) 99 | 100 | return completion.choices[0].message.content 101 | 102 | 103 | def qwenplus_forward(client, prompt, base64_images=None, temperature=0.9, model_name="qwen-vl-plus", extensions=None): 104 | content = [{ 105 | "type": "text", 106 | "text": prompt 107 | }] 108 | if base64_images is not None and extensions is not None: 109 | for base64_image, extension in zip(base64_images, extensions): 110 | content.append({ 111 | "type": "image_url", 112 | "image_url":{ 113 | "url": f"data:image/{extension};base64,{base64_image}" 114 | } 115 | }) 116 | message = [ 117 | {"role": "system", "content": """You are a highly professional clinician with expertise across multiple clinical fields, 118 | well-versed in the latest medical guidelines, diagnostic standards, and treatment plans, capable of providing accurate and reasoned responses to user inquiries."""}, 119 | {"role": "user", "content": content} 120 | ] 121 | completion = client.chat.completions.create( 122 | model = model_name, 123 | messages = message, 124 | temperature = temperature 125 | ) 126 | 127 | return completion.choices[0].message.content 128 | 129 | 130 | def ds_forward(client, prompt, temperature=0.9, model_name="chatgpt-4o-latest"): 131 | message = [ 132 | {"role": "system", "content": """You are a highly professional clinician with expertise across multiple clinical fields, 133 | well-versed in the latest medical guidelines, diagnostic standards, and treatment plans, capable of providing accurate and reasoned responses to user inquiries."""}, 134 | {"role": "user", "content": prompt} 135 | ] 136 | completion = client.chat.completions.create( 137 | model = model_name, 138 | messages = message, 139 | temperature = temperature 140 | ) 141 | 142 | return completion.choices[0].message.content.strip() 143 | 144 | 145 | def get_correctness(judge_output): 146 | if 'yes' in judge_output.lower(): 147 | return 1 148 | else: 149 | return -1 150 | 151 | 152 | def select_best_mentor(mentors_scores, current_depth): 153 | """ 154 | Select the best mentor based on their past scores. 155 | If at the first step, randomly select a mentor. 156 | If beyond the first step, calculate the competitiveness score for each mentor 157 | as the product of their past scores and select the one with the highest score. 158 | If scores are tied, randomly select among them. 159 | 160 | Args: 161 | mentors_scores (dict): A dictionary where keys are mentor names and values are lists of scores. 162 | current_depth (int): The current depth of exploration. 163 | 164 | Returns: 165 | str: The name of the selected mentor. 166 | """ 167 | if current_depth == 0: 168 | # Randomly select a mentor if at the first step 169 | return random.choice(list(mentors_scores.keys())) 170 | 171 | # Calculate competitiveness score for each mentor 172 | competitiveness_scores = {} 173 | for mentor, scores in mentors_scores.items(): 174 | competitiveness_scores[mentor] = 1 175 | for score in scores: 176 | competitiveness_scores[mentor] += score 177 | 178 | # Find the maximum competitiveness score 179 | max_score = max(competitiveness_scores.values()) 180 | best_mentors = [mentor for mentor, score in competitiveness_scores.items() if score == max_score] 181 | 182 | # Randomly select among the best mentors if there's a tie 183 | return random.choice(best_mentors) 184 | 185 | 186 | def process_case_info(data): 187 | """ 188 | Process the case information and format it into a structured string. 189 | 190 | Args: 191 | data (dict): A dictionary containing case information. 192 | 193 | Returns: 194 | str: A formatted string with presentation, age, gender, and caption details. 195 | """ 196 | 197 | presentation = data.get("presentation", "N/A") 198 | age = data.get("age_label", "N/A") 199 | gender = data.get("gender_label", "N/A") 200 | captions = data.get("caption", []) 201 | 202 | 203 | formatted_string = f"Chief complaint: {presentation}\n" 204 | formatted_string += f"Age: {age}\n" 205 | formatted_string += f"Gender: {gender}\n" 206 | 207 | if captions: 208 | formatted_string += "Image analysis:\n" 209 | for idx, caption in enumerate(captions, start=1): 210 | formatted_string += f"Modality {idx}: {caption}\n" 211 | else: 212 | formatted_string += "Captions: None\n" 213 | 214 | return formatted_string 215 | 216 | def select_next_step(steps, previous_mentors=None): 217 | """ 218 | Select the next reasoning step. If multiple steps have the same score, 219 | prioritize steps generated by mentors that have not been selected before. 220 | 221 | Args: 222 | steps (list): List of steps, each with score and generated_by attributes 223 | previous_mentors (list, optional): List of previously selected mentor models, default is None 224 | 225 | Returns: 226 | Step: The selected next step 227 | """ 228 | if not steps: 229 | return None 230 | 231 | max_score = max(s.score for s in steps) 232 | top_steps = [s for s in steps if s.score == max_score] 233 | 234 | # If there's only one highest scoring step, return it directly 235 | if len(top_steps) == 1: 236 | return top_steps[0] 237 | 238 | # When scores are tied, find steps generated by mentors not previously selected 239 | new_mentors_steps = [s for s in top_steps if s.generated_by not in previous_mentors] 240 | 241 | if new_mentors_steps: 242 | return random.choice(new_mentors_steps) 243 | 244 | return random.choice(top_steps) 245 | 246 | 247 | def replace_image_references(text): 248 | """ 249 | Replace all occurrences of 'According to the \n\n...' pattern with 250 | 'According to the image 1, image 2,...' where numbers are continuously incremented 251 | throughout the whole text. 252 | 253 | Args: 254 | text (str): The input string containing the phrases. 255 | 256 | Returns: 257 | str: The modified string with the phrases replaced. 258 | """ 259 | image_counter = 0 260 | 261 | def replace_with_numbered_images(match): 262 | nonlocal image_counter 263 | image_count = match.group(0).count('') 264 | 265 | if image_count == 0: 266 | return match.group(0) 267 | 268 | image_numbers = range(image_counter + 1, image_counter + image_count + 1) 269 | image_counter += image_count 270 | 271 | numbered_images = ", ".join([f"image {i}" for i in image_numbers]) 272 | 273 | if match.group(0).rstrip().endswith(","): 274 | return f"According to the {numbered_images}, " 275 | else: 276 | return f"According to the {numbered_images} " 277 | 278 | pattern = r"According to the (?:\n)+,?" 279 | replaced_text = re.sub(pattern, replace_with_numbered_images, text) 280 | return replaced_text 281 | 282 | def extract_first_step(text): 283 | """ 284 | Extract the content of the first step from the input text. 285 | 286 | Args: 287 | text (str): The input string containing multiple steps. 288 | 289 | Returns: 290 | str: The content of the first step, or an empty string if not found. 291 | """ 292 | 293 | parts = text.split("###") 294 | 295 | for part in parts: 296 | if "Step" in part: 297 | return part.strip() 298 | 299 | return "" 300 | 301 | def extract_first_two_steps(text): 302 | """ 303 | Extract the content of the first two steps from the input text. 304 | 305 | Args: 306 | text (str): The input string containing multiple steps. 307 | 308 | Returns: 309 | str: The content of the first two steps, or an empty string if not found. 310 | """ 311 | parts = text.split("###") 312 | 313 | steps = [] 314 | for part in parts: 315 | if "Step" in part and len(steps) < 2: 316 | steps.append(part.strip()) 317 | 318 | if len(steps) == 0: 319 | return "" 320 | elif len(steps) == 1: 321 | return steps[0] 322 | else: 323 | return steps[0] + "\n\n### " + steps[1] 324 | 325 | def remove_phrases(text): 326 | """ 327 | Remove all occurrences of 'According to the \n,' and 328 | 'According to the \n\n,' from the input text. 329 | 330 | Args: 331 | text (str): The input string containing the phrases. 332 | 333 | Returns: 334 | str: The modified string with the phrases removed. 335 | """ 336 | cleaned_text = re.sub(r"According to the (\n)+,", "", text) 337 | return cleaned_text 338 | -------------------------------------------------------------------------------- /src/mics.py: -------------------------------------------------------------------------------- 1 | from openai import OpenAI 2 | import json 3 | from utils import ( 4 | encode_image, locate_img, get_correctness, 5 | gpt_forward, replace_image_references, qwenplus_forward, ds_forward, extract_first_two_steps, 6 | select_best_mentor, process_case_info, select_next_step, remove_phrases, extract_first_step 7 | ) 8 | from qwenvl_forward import qwenvl_forward 9 | from internvl_forward import internvl_forward 10 | from prompt import EVALUATE_PROMPT, JUDGE_PROMPT, REASONING_PROMPT 11 | from step import Step 12 | import time 13 | 14 | class MentorInternSearch: 15 | def __init__(self, args): 16 | """ 17 | Initialize the Mentor-Intern Search process. 18 | Args: 19 | args: Command line arguments containing model lists, paths, max_depth, etc. 20 | """ 21 | 22 | self.root = Step(prefix_steps="") 23 | self.args = args 24 | self.mentor_models = args.mentor_models 25 | self.intern_models = args.intern_models 26 | self.evaluator_model = args.evaluator_model 27 | self.temperature1 = args.temperature1 28 | self.temperature2 = args.temperature2 29 | self.client1 = OpenAI(base_url=args.openai_base_url, api_key=args.openai_api_key) 30 | self.client2 = OpenAI(base_url=args.qwen_base_url, api_key=args.qwen_api_key) 31 | self.client3 = OpenAI(base_url=args.ds_base_url, api_key=args.ds_api_key) 32 | self.client4 = OpenAI(base_url=args.gemini_base_url, api_key=args.gemini_api_key) 33 | self.model_set = None 34 | 35 | 36 | def _call_model_forward(self, model_name, prompt, temperature, base64_images=None, img_paths=None, extensions=None): 37 | 38 | try: 39 | if 'gpt' in model_name: 40 | print("Calling gpt model!") 41 | return gpt_forward(self.client1, prompt, base64_images, temperature=temperature, model_name=model_name, extensions=extensions) 42 | elif 'gemini' in model_name: 43 | print("Calling gemini model!") 44 | return gpt_forward(self.client4, prompt, base64_images, temperature=temperature, model_name=model_name, extensions=extensions) 45 | elif '72' in model_name: 46 | print("Calling qwen model!") 47 | return qwenplus_forward(self.client2, prompt, base64_images, temperature=temperature, model_name=model_name, extensions=extensions) 48 | elif 'qwen' in model_name: 49 | response = qwenvl_forward( 50 | self.model_set[model_name]['model'], 51 | self.model_set[model_name]['processor'], 52 | prompt, 53 | img_paths, 54 | temperature=temperature 55 | ) 56 | return response 57 | elif 'internvl' in model_name: 58 | response = internvl_forward( 59 | self.model_set[model_name]['model'], 60 | self.model_set[model_name]['processor'], 61 | prompt, 62 | img_paths, 63 | temperature=temperature 64 | ) 65 | return response 66 | else: 67 | print(f"Warning: Model type for '{model_name}' not recognized for forwarding. Returning None.") 68 | return None 69 | except Exception as e: 70 | print(f"Error calling model {model_name} (Temp: {temperature}): {e}") 71 | return None 72 | 73 | 74 | def _generate_next_step_with_mentor(self, mentor_model_name, step, question, base64_images, img_paths, gt_answer, case_info, extensions): 75 | """ 76 | Generates the next single reasoning step using a mentor model. 77 | Returns the full reasoning chain, including prefix, current step, and future steps. 78 | """ 79 | next_step_number = step.depth + 1 80 | reasoning_prefix = step.get_full_reasoning() 81 | 82 | if 'gpt' in mentor_model_name or 'gemini' in mentor_model_name or '72' in mentor_model_name: 83 | case_info = replace_image_references(case_info) 84 | prompt = REASONING_PROMPT.format( 85 | reasoning_prefix=reasoning_prefix, 86 | question=question, 87 | gt_answer=gt_answer, 88 | case_info=case_info 89 | ) 90 | else: 91 | prompt = REASONING_PROMPT.format( 92 | reasoning_prefix=reasoning_prefix, 93 | question=question, 94 | gt_answer=gt_answer, 95 | case_info=case_info 96 | ) 97 | 98 | print("Starting to call mentor model!") 99 | time1 = time.time() 100 | response = self._call_model_forward( 101 | model_name=mentor_model_name, 102 | prompt=prompt, 103 | temperature=self.temperature1, 104 | base64_images=base64_images, 105 | img_paths=img_paths, 106 | extensions=extensions 107 | ) 108 | print(f"Mentor model call completed, time taken: {time.time() - time1} seconds") 109 | 110 | if response: 111 | return reasoning_prefix, response 112 | else: 113 | print(f" Failed to get response from {mentor_model_name} at generating step {next_step_number}.") 114 | return None 115 | 116 | def _evaluate_step_with_interns(self, step_to_evaluate, question, gt_answer, img_paths, mentors_scores): 117 | """Evaluates a generated step using intern models, each with two temperatures.""" 118 | 119 | print(f"Evaluating step (Depth {step_to_evaluate.depth}, Mentor: {step_to_evaluate.generated_by}) with interns...") 120 | reasoning_prefix = step_to_evaluate.get_full_reasoning() 121 | correct_count = 0 122 | num_intern_models = len(self.intern_models) 123 | total_evaluations = num_intern_models * 2 124 | 125 | if num_intern_models == 0: 126 | print("Warning: No intern models specified for evaluation. Returning score 0.") 127 | return 0.0 128 | 129 | for intern_model_name in self.intern_models: 130 | 131 | if 'internvl' in intern_model_name: 132 | image_prefix = "" 133 | for i, img_path in enumerate(img_paths, 1): 134 | image_prefix += f"Image {i}: \n" 135 | 136 | question = image_prefix + question 137 | 138 | intern_prompt = EVALUATE_PROMPT.format( 139 | question=question, 140 | reasoning_prefix=reasoning_prefix 141 | ) 142 | else : 143 | intern_prompt = EVALUATE_PROMPT.format( 144 | question=question, 145 | reasoning_prefix=reasoning_prefix 146 | ) 147 | 148 | for temp in [self.temperature1, self.temperature2]: 149 | print(f" Running intern: {intern_model_name} (Temp: {temp})") 150 | time1 = time.time() 151 | 152 | intern_response = self._call_model_forward( 153 | model_name=intern_model_name, 154 | prompt=intern_prompt, 155 | temperature=temp, 156 | img_paths=img_paths 157 | ) 158 | print(f"Intern model call completed, time taken: {time.time() - time1} seconds") 159 | 160 | if not intern_response: 161 | print(f"Failed to get response from intern {intern_model_name} (Temp: {temp})") 162 | continue 163 | 164 | if '### The final answer is:' not in intern_response: 165 | print(f"Warning: Intern {intern_model_name} (Temp: {temp}) response did not contain '### The final answer is:'.") 166 | continue 167 | 168 | model_answer = intern_response.split('### The final answer is:')[-1].strip() 169 | if not model_answer: 170 | print(f"Warning: Intern {intern_model_name} (Temp: {temp}) provided empty final answer.") 171 | continue 172 | 173 | judge_prompt = JUDGE_PROMPT.format( 174 | question=question, 175 | model_answer=model_answer, 176 | gt_answer=gt_answer 177 | ) 178 | 179 | time2 = time.time() 180 | judge_output = ds_forward(self.client3, judge_prompt, temperature=0.9, model_name=self.evaluator_model) 181 | print(f"Evaluator model call completed, time taken: {time.time() - time2} seconds") 182 | if judge_output: 183 | is_correct = get_correctness(judge_output) 184 | if is_correct == 1: 185 | correct_count += 1 186 | else: 187 | print(f" Failed to get judgment from evaluator {self.evaluator_model}") 188 | 189 | # Calculate score based on total evaluations (2 * number of intern models) 190 | score = correct_count / total_evaluations if total_evaluations > 0 else 0.0 191 | print(f" Step Score (Correct/Total Evaluations): {correct_count}/{total_evaluations} = {score:.2f}") 192 | 193 | 194 | return score 195 | 196 | def search(self, data, model_set, search_file, failed_search_file): 197 | """Performs the mentor-intern iterative search.""" 198 | self.model_set = model_set 199 | 200 | # 1. Initialize from data 201 | question = data["messages"][0]['content'] 202 | question = question.replace('**Question:**', ' ').strip() 203 | 204 | case_info = process_case_info(data) 205 | 206 | gt_answer = data["messages"][1]['content'] 207 | gt_answer = gt_answer.replace('**Correct Answer:**', ' ').strip() 208 | try: 209 | img_paths, extensions = locate_img(data, self.args) 210 | base64_images = [encode_image(img_path) for img_path in img_paths] 211 | except ValueError as e: 212 | print(f"Error encoding image for data item: {e}. Skipping item.") 213 | failed_data = data.copy() 214 | failed_data['error'] = str(e) 215 | failed_search_file.write(json.dumps(failed_data) + "\n") 216 | failed_search_file.flush() 217 | return 218 | except FileNotFoundError as e: 219 | print(f"Error finding image for data item: {e}. Skipping item.") 220 | failed_data = data.copy() 221 | failed_data['error'] = str(e) 222 | failed_search_file.write(json.dumps(failed_data) + "\n") 223 | failed_search_file.flush() 224 | return 225 | 226 | print(f"\n--- Starting Search for Question ID: {data.get('rid', 'N/A')} ---") 227 | 228 | initial_text = "Let's think about how to solve this problem clearly and reasonably step by step." 229 | self.root.text = initial_text 230 | current_step = self.root 231 | 232 | mentors_scores = {mentor: [] for mentor in self.mentor_models} 233 | reasoning_chains = {} 234 | previous_mentors = [] 235 | 236 | # 2. Loop through depth 237 | for depth in range(self.args.max_depth): 238 | print(f"\n-- Depth {depth} (Generating Step {depth + 1}) --") 239 | 240 | generated_children_for_step = [] 241 | full_score_mentors = [] 242 | all_zero_score = True # Set to False to cancel the early stop mechanism of all zero scores to save API consumption 243 | 244 | for mentor_model in self.mentor_models: 245 | print(f"Generating step with mentor: {mentor_model}") 246 | if current_step.parent is None: 247 | prefix_reasoning, suffix_reasoning = self._generate_next_step_with_mentor( 248 | mentor_model, current_step, question, base64_images, img_paths, gt_answer, case_info, extensions 249 | ) 250 | complete_reasoning = prefix_reasoning + "\n" + suffix_reasoning 251 | reasoning_chains[mentor_model] = complete_reasoning 252 | 253 | new_step = extract_first_two_steps(suffix_reasoning) 254 | new_step = "### " + new_step 255 | else: 256 | if current_step.generated_by == mentor_model: 257 | try: 258 | new_step = reasoning_chains[mentor_model].split("###")[depth + 2].strip() 259 | new_step = "### " + new_step 260 | except IndexError: 261 | new_step = "" 262 | 263 | else: 264 | prefix_reasoning, suffix_reasoning = self._generate_next_step_with_mentor( 265 | mentor_model, current_step, question, base64_images, img_paths, gt_answer, case_info, extensions 266 | ) 267 | 268 | complete_reasoning = prefix_reasoning + "\n" + suffix_reasoning 269 | reasoning_chains[mentor_model] = complete_reasoning 270 | 271 | new_step = extract_first_step(suffix_reasoning) 272 | new_step = "### " + new_step 273 | 274 | temp_step = Step(step_text=new_step, prefix_steps=current_step.text, parent=current_step, generated_by=mentor_model) 275 | score = self._evaluate_step_with_interns( 276 | temp_step, question, gt_answer, img_paths, mentors_scores 277 | ) 278 | mentors_scores[mentor_model].append(score) 279 | 280 | if score > 0: 281 | all_zero_score = False 282 | actual_child = current_step.add_child_step(step_text=new_step, score=score, generated_by=mentor_model) 283 | if actual_child: 284 | generated_children_for_step.append(actual_child) 285 | if score == 1.0: 286 | full_score_mentors.append(mentor_model) 287 | 288 | # If any mentor's step is full score, select best and use its reasoning chain 289 | if full_score_mentors: 290 | if len(full_score_mentors) == 1: 291 | best_mentor = full_score_mentors[0] 292 | else: 293 | best_mentor = select_best_mentor(mentors_scores, depth) 294 | print(f"Full score achieved by {len(full_score_mentors)} mentors. Selected mentor: {best_mentor}") 295 | full_reasoning = reasoning_chains[best_mentor] 296 | 297 | result_data = { 298 | 'rid': data.get('rid', 'N/A'), 299 | 'images': data['images'], 300 | 'question': question, 301 | 'gt_answer': gt_answer, 302 | 'reasoning': full_reasoning, 303 | 'scores': mentors_scores, 304 | 'final_depth': str(depth + 1), 305 | 'generated_by': previous_mentors + [best_mentor], 306 | 'search_id': '1' 307 | } 308 | 309 | search_file.write(json.dumps(result_data) + "\n") 310 | search_file.flush() 311 | print(f"Saved successful result with full score for {data.get('rid', 'N/A')}") 312 | return 313 | 314 | # If all scores are zero, stop and record as failure 315 | if all_zero_score: 316 | print("All mentor scores are zero. Stopping search and recording as failure.") 317 | failed_data = data.copy() 318 | failed_data['error'] = "All mentor scores are zero." 319 | failed_search_file.write(json.dumps(failed_data) + "\n") 320 | failed_search_file.flush() 321 | return 322 | 323 | # Select the best step to proceed 324 | if generated_children_for_step: 325 | current_step = select_next_step(generated_children_for_step, previous_mentors) 326 | previous_mentors.append(current_step.generated_by) 327 | print(f" Step (Depth {depth}) best child score: {current_step.score:.2f} (Mentor: {current_step.generated_by})") 328 | else: 329 | print("No valid steps generated. Stopping search.") 330 | break 331 | 332 | print("########################################################") 333 | 334 | # 3. Search finished (max depth reached) 335 | print(f"\n--- Search Completed (Max Depth Reached) ---") 336 | 337 | # 4. Save results - use the last selected mentor's reasoning 338 | if current_step.score >= 0: 339 | final_mentor = current_step.generated_by 340 | else : 341 | final_mentor = "No valid mentor!" 342 | 343 | if final_mentor in reasoning_chains: 344 | final_reasoning = reasoning_chains[final_mentor] 345 | result_data = { 346 | 'rid': data.get('rid', 'N/A'), 347 | 'images': data['images'], 348 | 'question': question, 349 | 'gt_answer': gt_answer, 350 | 'reasoning': final_reasoning, 351 | 'scores': mentors_scores, 352 | 'final_depth': str(depth + 1), 353 | 'generated_by': previous_mentors, # {current_step.get_step_path()[i].generated_by for i in range(len(current_step.get_step_path()))} 354 | 'search_id': '0' 355 | } 356 | 357 | search_file.write(json.dumps(result_data) + "\n") 358 | search_file.flush() 359 | print(f"Saved result for {data.get('rid', 'N/A')}") 360 | else: 361 | failed_data = data.copy() 362 | failed_data['error'] = "Search failed to find a valid reasoning path." 363 | failed_search_file.write(json.dumps(failed_data) + "\n") 364 | failed_search_file.flush() 365 | print(f"Saved failed search log for {data.get('rid', 'N/A')}") 366 | 367 | print(f"--- Search Finished for Question ID: {data.get('rid', 'N/A')} ---") -------------------------------------------------------------------------------- /eval/eval_data.json: -------------------------------------------------------------------------------- 1 | [ 2 | { 3 | "id": 1, 4 | "img_name": [ 5 | "images/197271/ct_group1/0.jpeg", 6 | "images/197271/ct_group1/1.jpeg", 7 | "images/197271/ct_group1/2.jpeg", 8 | "images/197271/ct_group1/3.jpeg", 9 | "images/197271/ct_group1/4.jpeg" 10 | ], 11 | "question": "Given the subtle imaging findings of diffuse gastric thickening and enhancement, along with gastrohepatic and perigastric adenopathy, what is the most likely diagnosis in this 45-year-old female presenting with abdominal discomfort? (Supplementary information is as follows:\nGender: Female, Age: 45 years, Chief complaint: Abdominal discomfort.)\nPlease base your response on the keyframes of the relevant medical imaging modality (Modality 1: \n\n\n\n) and reason step-by-step to answer the above question.", 12 | "reasoning": "Let's think about how to solve this problem clearly and reasonably step by step.\n### Step 1:\nStart by evaluating the patient's presenting complaint: a 45-year-old female with abdominal discomfort. This is a nonspecific symptom and could be attributed to a broad range of gastrointestinal, hepatobiliary, or systemic conditions. In this case, we need to rely on the imaging findings to narrow the differential diagnosis.\n\n### Step 2:\nReview the primary gastric imaging findings. There is a description of subtle, diffuse thickening and enhancement of the stomach. While such changes can be seen in benign conditions such as gastritis, the presence of thickening that is diffuse and enhancing raises suspicion for an infiltrative process. Notably, infiltrative gastric malignancies, such as linitis plastica (a subtype of diffuse gastric adenocarcinoma), often present with subtle or nonspecific wall thickening and enhancement, making them difficult to detect in early stages. This level of subtlety combined with the enhancement pattern increases concern for malignancy.\n\n### Step 3:\nAssess the lymph node findings. There is gastrohepatic adenopathy and lymphadenopathy along the greater omentum, extending from the gastric antrum and pylorus. These lymph nodes are not described as necrotic, but their distribution suggests spread along the regional drainage pathways of the stomach. Additionally, there are borderline left periaortic lymph nodes, which could represent more distant nodal involvement. Collectively, the regional and potentially distal lymphadenopathy significantly supports the likelihood of a malignant process with lymphatic spread.\n\n### Step 4:\nRule out other major differential diagnoses. While diseases such as lymphoma of the stomach, infectious causes like tuberculosis, or inflammatory conditions such as autoimmune gastritis could potentially result in gastric thickening and lymphadenopathy, the pattern described here—diffuse gastric wall thickening with associated lymphadenopathy in the gastrohepatic and perigastric regions—is most consistent with a primary gastric carcinoma. Infectious or inflammatory causes often have other systemic signs or a different pattern of involvement. Hepatic steatosis is noted but does not significantly influence the differential diagnosis in this context, nor does it explain the lymphadenopathy.\n\n### Step 5:\nCombine all findings. The constellation of subtle but diffuse gastric wall thickening, regional and perigastric nodal enlargement, and absence of liver lesions or peritoneal disease suggests a primary infiltrative gastric malignancy with lymph node involvement. The most common and likely diagnosis that fits these findings is infiltrative gastric adenocarcinoma, such as the diffuse type or linitis plastica, with regional nodal metastases.", 13 | "answer": "Infiltrative gastric adenocarcinoma with lymph node metastasis.", 14 | "answer_type": "OPEN" 15 | }, 16 | { 17 | "id": 2, 18 | "img_name": [ 19 | "images/185438/mri_group1/0.jpeg", 20 | "images/185438/mri_group1/1.jpeg", 21 | "images/185438/mri_group1/2.jpeg", 22 | "images/185438/mri_group1/3.jpeg" 23 | ], 24 | "question": "Based on the MRI findings of a well-defined submucosal mass with intermediate T1 and low T2 signal intensity, along with moderate heterogeneous enhancement, what is the most likely diagnosis for this 50-year-old female presenting with vaginal bleeding? (Supplementary information is as follows:\nGender: Female, Age: 50 years, Chief complaint: Vaginal bleeding.)\nPlease base your response on the keyframes of the relevant medical imaging modality (Modality 1: \n\n\n) and reason step-by-step to answer the above question.", 25 | "reasoning": "Let's think about how to solve this problem clearly and reasonably step by step.\n### Step 1:\nBegin by analyzing the patient's demographic and chief complaint. The patient is a 50-year-old woman presenting with vaginal bleeding. In women at or near the age of menopause, vaginal bleeding often prompts evaluation for both benign and malignant causes, including endometrial atrophy, hyperplasia, polyps, fibroids (leiomyomas), or endometrial carcinoma.\n\n### Step 2:\nNext, consider the MRI findings, which describe a well-defined submucosal mass distorting the endometrial cavity. The location of the mass is significant: submucosal masses are most often associated with abnormal uterine bleeding because of their direct impact on the endometrial lining. Well-defined margins also suggest a benign etiology, as malignant tumors often present with irregular or infiltrative borders.\n\n### Step 3:\nEvaluate the signal characteristics of the mass on MRI. The mass shows intermediate signal intensity on T1-weighted images and low signal intensity on T2-weighted images. These imaging features are classic for uterine leiomyomas (fibroids), which typically appear hypointense on T2 imaging due to their dense, fibrous composition. Intermediate T1 signal is also characteristic of fibroids, unless there is degeneration, which may alter the signal.\n\n### Step 4:\nConsider the enhancement pattern. The lesion shows moderate heterogeneous enhancement on postcontrast sequences. Uterine fibroids tend to enhance heterogeneously following contrast administration due to variable vascularity and areas of degeneration within the lesion. This pattern further supports a diagnosis of a benign leiomyoma over more concerning pathologies such as endometrial carcinoma, which often infiltrate the endometrium diffusely rather than presenting as a discrete submucosal mass.\n\n### Step 5:\nReview of the associated findings—specifically, a unilocular 5 cm right ovarian cyst and an umbilical hernia—are noted, but do not directly contribute to the cause of vaginal bleeding. The ovarian cyst is likely incidental, especially in the absence of features suggesting complexity or solid components. Similarly, the umbilical hernia containing omentum is a separate abdominal wall issue unrelated to the uterine pathology.\n\n### Step 6:\nConclude by correlating the clinical presentation and imaging findings. The combination of a well-defined submucosal mass distorting the endometrial cavity, its intermediate T1 and low T2 signal, and moderate heterogeneous enhancement forms a classic radiologic picture of a submucosal uterine leiomyoma. This benign tumor frequently causes abnormal uterine bleeding in premenopausal and perimenopausal women.", 26 | "answer": "Submucosal uterine leiomyoma.", 27 | "answer_type": "OPEN" 28 | }, 29 | { 30 | "id": 3, 31 | "img_name": [ 32 | "images/98014/mri_group1/0.jpeg", 33 | "images/98014/mri_group1/1.jpeg", 34 | "images/98014/mri_group1/2.jpeg", 35 | "images/98014/mri_group1/3.jpeg" 36 | ], 37 | "question": "Teacher, based on the imaging findings of T1 hypo- and T2 hyperintense signals at the D10/11 level with irregular endplates and disc destruction, along with the patient's history of esophageal carcinoma and recent surgery, what is the most likely diagnosis for his severe lower dorsal pain, fever, and weight loss? (Supplementary information is as follows:\nGender: Male, Age: 55 years, Chief complaint: History of esophageal carcinoma underwent surgical resection. Severe lower dorsal pain with weight loss, and fever two months after surgery.)\nPlease base your response on the keyframes of the relevant medical imaging modality (Modality 1: \n\n\n) and reason step-by-step to answer the above question.", 38 | "reasoning": "Let's think about how to solve this problem clearly and reasonably step by step.\n### Step 1: \nThe patient presents with severe lower dorsal pain, fever, and weight loss two months after undergoing surgical resection for esophageal carcinoma. These symptoms are concerning for an infectious or neoplastic process, given the systemic nature of fever and weight loss, as well as the localized pain. The history of recent surgery raises the possibility of postoperative complications, including infection or metastatic disease, given the underlying malignancy.\n\n### Step 2: \nThe imaging findings describe abnormal bone marrow signal at the D10/11 vertebral level, with T1 hypointense and T2 hyperintense signals, irregular opposing endplates, and destruction of the intervening disc with a central T2 hyperintense focus. These findings are highly suggestive of an inflammatory or infectious process involving the vertebral body and disc space. The absence of a central nuclear cleft further supports disc space involvement, which is atypical for degenerative changes and more consistent with an infectious or inflammatory etiology.\n\n### Step 3: \nThe presence of dehydration at adjacent levels (D7/8 and D8/9) may indicate reactive changes or early involvement, but the primary pathology appears localized to D10/11. The gastric pull-up surgery and moderate right-sided pleural effusion could be related to the patient's recent esophageal resection, but the pleural effusion could also represent a reactive process secondary to the spinal pathology, such as paravertebral abscess or inflammation.\n\n### Step 4: \nGiven the patient's history of esophageal carcinoma, metastatic disease to the spine must be considered. However, the imaging findings of disc space destruction and T2 hyperintensity are more characteristic of an infectious process like spondylodiscitis rather than metastatic disease, which typically spares the disc space. The fever further supports an infectious etiology.\n\n### Step 5: \nThe combination of vertebral body and disc space involvement with systemic symptoms (fever, weight loss) in a postoperative patient is highly suggestive of spondylodiscitis. This could arise from hematogenous spread of infection secondary to the surgical procedure or from a compromised immune state postoperatively. The absence of a more focal mass or osteolytic lesions makes metastatic disease less likely.", 39 | "answer": "Spondylodiscitis.", 40 | "answer_type": "OPEN" 41 | }, 42 | { 43 | "id": 4, 44 | "img_name": [ 45 | "images/180144/ct_group1/0.png", 46 | "images/180144/ct_group1/1.jpeg", 47 | "images/180144/ct_group1/2.jpeg", 48 | "images/180144/ct_group1/3.jpeg" 49 | ], 50 | "question": "Doctor, I've been told I have a jugulotympanic paraganglioma causing my tinnitus. Is it true that this tumor is pressing on my carotid artery and could cause a stroke? (Supplementary information is as follows:\nGender: Female, Age: 55 years, Chief complaint: Left side pulsatile tinnitus.)\nPlease base your response on the keyframes of the relevant medical imaging modality (Modality 1: \n\n\n) and reason step-by-step to answer the above question.", 51 | "reasoning": "Let's think about how to solve this problem clearly and reasonably step by step.\n### Step 1:\n\nBegin by analyzing the patient's chief complaint—pulsatile tinnitus localized to the left side in a 55-year-old female. Pulsatile tinnitus often suggests a vascular origin due to turbulent blood flow near the auditory system. This symptom is consistent with a vascular lesion near or involving the temporal bone and middle ear structures.\n\n### Step 2:\n\nNext, review the imaging findings. The lesion is described as an intensely enhancing, well-defined soft tissue mass centered on the left jugular fossa, with widening and moth-eaten erosions of the jugular foramen. This is classic for a paraganglioma (glomus jugulare tumor), which is a highly vascular tumor known to arise in this location and demonstrate bone erosion due to its expansile and locally invasive nature.\n\n### Step 3:\n\nFurther imaging indicates that the lesion extends into the middle ear cavity (specifically the hypotympanum) and is abutting the cochlear promontory. This finding supports the diagnosis of a jugulotympanic paraganglioma—a type of glomus tumor that involves both the jugular bulb and middle ear. Its proximity to the promontory explains the patient's pulsatile tinnitus, as vibrations from the adjacent vascular tumor can directly transmit to the cochlear structures.\n\n### Step 4:\n\nCritically, the tumor is also said to be eroding the posterior wall of the left carotid canal and is slightly abutting the left internal carotid artery. This is an important detail because the internal carotid artery (ICA) is a major blood vessel supplying the brain. Although there is no indication of ICA invasion or thrombosis, abutment and erosion of the bony canal increase the risk of potential vascular complications. These could include ICA compression or involvement, which might, in rare but serious cases, contribute to cerebrovascular events such as a stroke.\n\n### Step 5:\n\nAlthough the internal carotid artery is only slightly abutted and not yet invaded or thrombosed, the risk of such complications, including stroke, remains real because of the tumor’s anatomical proximity and its invasive nature. The tumor's inseparability from the jugular vein, despite no thrombosis, further demonstrates its locally aggressive behavior.", 52 | "answer": "Yes.", 53 | "answer_type": "OPEN" 54 | }, 55 | { 56 | "id": 5, 57 | "img_name": [ 58 | "images/201392/ultrasound_group1/0.jpeg", 59 | "images/201392/ultrasound_group1/1.jpeg", 60 | "images/201392/ultrasound_group1/2.jpeg", 61 | "images/201392/ultrasound_group1/3.jpeg", 62 | "images/201392/ultrasound_group1/4.jpeg" 63 | ], 64 | "question": "Teacher, in this patient with chronic hepatic disease presenting with an umbilical bulge, the imaging describes a \"Christmas tree\" appearance due to the umbilical hernia. Could you explain what specific imaging findings contribute to this distinctive pattern? (Supplementary information is as follows:\nGender: Male, Age: 55 years, Chief complaint: Chronic hepatic disease. Presented with an umbilical region bulge.)\nPlease base your response on the keyframes of the relevant medical imaging modality (Modality 1: \n\n\n\n) and reason step-by-step to answer the above question.", 65 | "reasoning": "Let's think about how to solve this problem clearly and reasonably step by step.\n### Step 1:\nBegin by focusing on the patient's clinical background. The patient is a 55-year-old male with chronic hepatic disease who presents with a bulge in the umbilical region. Patients with chronic liver disease often develop ascites due to portal hypertension, which predisposes them to abdominal wall hernias, particularly through the umbilicus, a site of anatomical weakness. The presence of ascitic fluid increases intra-abdominal pressure, further promoting herniation.\n\n### Step 2:\nNext, assess the imaging findings to identify the components of the umbilical hernia. It is noted that in images 1 and 2, the hernia contains abdominal fat and ascitic fluid. These materials have distinct imaging characteristics: fat typically appears hyperechoic (bright on ultrasound), while fluid is hypoechoic (dark). When fat and fluid herniate together, they may organize into a pattern of alternating bright and dark areas.\n\n### Step 3:\nEvaluate the description of the hernia's neck and compressive features. Image 3 shows the neck of the umbilical hernia with compression, which suggests that the herniated contents are being squeezed through a relatively narrow opening. This mechanical configuration can accentuate the layering of different tissue densities within the hernia sac, further enhancing the visibility of alternating echogenic patterns.\n\n### Step 4:\nIntegrate this visual pattern with the metaphor used in the question. The description of a \"Christmas tree\" appearance refers to the specific visual arrangement seen on imaging: alternating layers or bands of hyperechoic fat and hypoechoic ascitic fluid, branching or stacking in a way that resembles the shape of a Christmas tree. This pattern is not a structural abnormality per se but results from the way these components are distributed within the hernia sac.\n\n### Step 5:\nLastly, consider the relevance of associated findings in images 4 and 5, such as background hepatic parenchymal disease, gallstones, and right pleural effusion. While these findings further substantiate the diagnosis of chronic liver disease and its complications, they are not directly responsible for the \"Christmas tree\" appearance. Thus, they are supportive but not explanatory of the imaging pattern in question.", 66 | "answer": "Herniation of abdominal fat and ascitic fluid through the umbilical defect, creating alternating hyperechoic (fat) and hypoechoic (fluid) bands resembling a Christmas tree.", 67 | "answer_type": "OPEN" 68 | }, 69 | { 70 | "id": 6, 71 | "img_name": [ 72 | "images/156702/x-ray_group1/0.png", 73 | "images/156702/x-ray_group1/1.png" 74 | ], 75 | "question": "Given this 2-year-old male patient's greenstick fractures of the middle third of the radial and ulnar shafts with volar angulation, particularly noting the radial fracture's location proximal to the pronator teres insertion, how would you approach immobilization to ensure proper healing? (Supplementary information is as follows:\nGender: Male, Age: 2 years, Chief complaint: Fall onto the outstretched hand after jumping off a sofa. Mild forearm deformity upon physical examination.)\nPlease base your response on the keyframes of the relevant medical imaging modality (Modality 1: \n) and reason step-by-step to answer the above question.", 76 | "reasoning": "Let's think about how to solve this problem clearly and reasonably step by step.\n### Step 1: \nConsider the nature of greenstick fractures in children. These fractures are incomplete, where one side of the bone is broken while the other side is bent. In this case, the fractures are located at the middle third of both the radius and ulna with volar angulation. The radial fracture is specifically noted to be proximal to the pronator teres insertion.\n\n### Step 2:\nAnalyze the role of the pronator teres muscle. The pronator teres acts to pronate the forearm (rotate it so that the palm faces downward). Given that the radial fracture is proximal to the pronator teres insertion, there is a risk that the muscle's pull could exacerbate the deformity by further pulling the fractured segment into pronation and potentially increasing the volar angulation.\n\n### Step 3:\nDetermine the appropriate immobilization technique. To counteract the deforming force of the pronator teres and prevent further pronation and angulation, the forearm should be immobilized in a position that opposes the muscle's action. Supination (rotation of the forearm so that the palm faces upward) is the opposite of pronation and would help maintain proper alignment of the fracture.\n\n### Step 4:\nConsider the age of the patient. A 2-year-old child requires a method of immobilization that is both effective and comfortable. A well-molded cast applied in supination would provide the necessary support and immobilization while allowing for some growth and minimizing discomfort.", 77 | "answer": "Immobilization in supination to counteract the deforming force of the pronator teres.", 78 | "answer_type": "OPEN" 79 | }, 80 | { 81 | "id": 7, 82 | "img_name": [ 83 | "images/176949/ct_group1/0.jpeg", 84 | "images/176949/ct_group1/1.jpeg" 85 | ], 86 | "question": "Teacher, based on the imaging findings of a midline abdominal mass with peripheral enhancement, mottled air density, and surrounding thickened small bowel loops with fistulization, in a patient with recent hysterectomy, what is the most likely diagnosis considering the surgical history and imaging characteristics? (Supplementary information is as follows:\nGender: Female, Age: 40 years, Chief complaint: Abdominal pain. Recent hysterectomy.)\nPlease base your response on the keyframes of the relevant medical imaging modality (Modality 1: \n) and reason step-by-step to answer the above question.", 87 | "reasoning": "Let's think about how to solve this problem clearly and reasonably step by step.\n### Step 1:\nStart by analyzing the patient's clinical context — a 40-year-old female presenting with abdominal pain and a recent history of hysterectomy. Postoperative complications should be strongly considered, especially given the proximity of the imaging findings to recent surgical intervention.\n\n### Step 2:\nEvaluate the CT imaging described: presence of a **midline abdominal mass with peripheral enhancement and a mottled appearance due to air density**. This indicates a mass with central heterogeneity and gas within the lesion, which can be associated with infection, necrosis, or communication with bowel/gas-forming organisms.\n### Step 3:\nIncorporate the additional findings concerning the surrounding structures: **Thickened small bowel loops partially adherent to the mass**, implying an inflammatory reaction involving the bowel wall. The presence of **areas of fistulization between the bowel lumen and the mass** directly explains the mottled air density within the mass (gas tracking from the bowel into the mass) and reinforces the inflammatory/communicating nature of the process.\n### Step 4:\nNext, consider specific differential diagnoses that could present with a centrally gas-filled mass showing peripheral enhancement postoperatively. Although abscess, necrotic tumor, and infected seroma might be considered, the location of the mass in the midline, recent surgical history (hysterectomy), and the characteristic appearance of **mottled gas with peripheral enhancement** strongly suggest a retained surgical foreign body — specifically, **gossypiboma** (retained surgical sponge). The peripheral enhancement implies a fibrous capsule or inflammatory wall forming around the foreign body.\n\n### Step 5:\nThe presence of **fistulization to the bowel loops** further supports the diagnosis. Gossypibomas can incite vigorous inflammatory responses that cause adherence to adjacent viscera and, over time, lead to erosion or fistulization into bowel loops. This accounts for both the bowel wall thickening and the entrapped gas within the mass.\n\n### Step 6:\nHepatic steatosis is an incidental finding and is not contributing to the midline mass or the acute presentation. Therefore, primary diagnostic attention remains on the features compatible with inflammatory-septic complications arising after recent surgery.", 88 | "answer": "Gossypiboma with enteric fistulization.", 89 | "answer_type": "OPEN" 90 | }, 91 | { 92 | "id": 8, 93 | "img_name": "images/187899/x-ray_group1/0.jpeg", 94 | "question": "Based on the imaging findings of a horizontal radiolucent line at the base of the proximal 5th metatarsal and the patient's presentation with lateral foot pain and swelling, what is the most likely diagnosis? (Supplementary information is as follows:\nGender: Male, Age: 40 years, Chief complaint: Lateral left foot swelling and pain.)\nPlease base your response on the keyframes of the relevant medical imaging modality (Modality 1: ) and reason step-by-step to answer the above question.", 95 | "reasoning": "Let's think about how to solve this problem clearly and reasonably step by step.\n### Step 1:\nEvaluate the clinical presentation. The patient is a 40-year-old male presenting with **lateral left foot pain and swelling**. These symptoms suggest a localized musculoskeletal injury, likely related to the lateral structures of the foot, particularly the proximal fifth metatarsal, given anatomical correlation with the reported painful area.\n\n### Step 2:\nReview and interpret the imaging findings. On the radiograph, there is a distinct **horizontal radiolucent line visible at the base of the proximal aspect of the 5th metatarsal**, which does not extend into the articular surface and is situated at the **tuberosity of the bone**.\n\n### Step 3:\nDifferentiate among common fracture types of the 5th metatarsal:\n- **Avulsion fracture (Pseudo-Jones fracture):** Characterized by a horizontal, extra-articular fracture at the **tuberosity (styloid process)** of the 5th metatarsal. Typically results from an inversion injury pulling on the peroneus brevis tendon.\n- **Jones fracture:** A transverse fracture located approximately 1.5-2 cm distal to the base, in the **metaphyseal-diaphyseal junction**, with a higher risk of delayed healing.\n- **Stress fracture:** Usually located in the diaphysis and characterized by a more oblique fracture line, commonly in athletes with gradual onset of pain.\n\n### Step 4:\nCorrelate clinical and radiologic findings:\n- The fracture line is **horizontal and located at the tuberosity** (styloid) of the 5th metatarsal.\n- Given the **acute presentation** with swelling and pain, and radiographic evidence, this supports an **avulsion-type injury mechanism**.\n- The radiologic pattern and location make a **Jones fracture or a stress fracture less likely**.", 96 | "answer": "**Pseudo-Jones fracture (avulsion fracture of the 5th metatarsal styloid)**.", 97 | "answer_type": "OPEN" 98 | }, 99 | { 100 | "id": 9, 101 | "img_name": [ 102 | "images/197850/x-ray_group1/0.png", 103 | "images/197850/mri_group2/0.jpeg" 104 | ], 105 | "question": "Teacher, based on the imaging findings of a pedunculated bony protrusion with continuity of the cortex and medullary cavity, along with a cartilage cap measuring 4 mm on MRI, what is the most likely diagnosis for this 15-year-old female presenting with pain and a palpable mass on the distal femur? (Supplementary information is as follows:\nGender: Female, Age: 15 years, Chief complaint: Pain and palpable protruding mass under the skin on the anterior surface of the lower one-third of the left thigh.)\nPlease base your response on the keyframes of the relevant medical imaging modality (Modality 1: ,\nModality 2: ) and reason step-by-step to answer the above question.", 106 | "reasoning": "Let's think about how to solve this problem clearly and reasonably step by step.\n### Step 1: \nReview the clinical presentation and imaging findings. The patient is a 15-year-old female with pain and a palpable mass on the distal femur. The x-ray shows a pedunculated bony protrusion at the metaphysis of the distal femur, with continuity of the cortex and medullary cavity with the adjacent healthy bone. There is no calcification or periosteal reaction noted.\n\n### Step 2:\nAnalyze the MRI findings. The MRI confirms the presence of the bony protrusion seen on the x-ray and provides additional details. It reveals a cartilage cap measuring approximately 4 mm, which appears as high signal intensity on STIR and PDFS sequences. The lesion is solitary, and there are no signs suggestive of malignant transformation.\n\n### Step 3:\nConsider the differential diagnosis based on the age of the patient and the imaging characteristics. In a young individual, benign bone tumors are more common than malignant ones. The features of the lesion, such as its pedunculated nature, cortical and medullary continuity, and the presence of a cartilage cap, are highly suggestive of an osteochondroma.\n\n### Step 4:\nEvaluate the consistency of the findings with the typical presentation of an osteochondroma. Osteochondromas are the most common benign bone tumors, often occurring in the metaphyseal region of long bones, particularly around the knee. They typically present as a slow-growing, painless mass but can cause symptoms if they impinge on surrounding structures or nerves. The size and characteristics of the cartilage cap (less than 2 cm) in this case are consistent with a benign process.\n\n### Step 5:\nRule out other potential diagnoses. Other considerations such as enchondroma or chondrosarcoma are less likely given the absence of calcification, the continuity of the cortex and medullary cavity, and the lack of soft tissue mass or signs of malignant transformation on MRI.", 107 | "answer": "Osteochondroma.", 108 | "answer_type": "OPEN" 109 | }, 110 | { 111 | "id": 10, 112 | "img_name": [ 113 | "images/26514/mri_group1/0.jpeg", 114 | "images/26514/mri_group1/1.jpeg" 115 | ], 116 | "question": "Given the 7-year-old female patient presenting with acute onset right leg weakness and dysphasia, along with imaging findings of marked prominence of cortical veins on the left on SWI without T2 or post-contrast T1 asymmetry, subtle ipsilateral arteriospasm, and no diffusion restriction, what is the most likely diagnosis considering the transient nature of symptoms and resolution within 24 hours? (Supplementary information is as follows:\nGender: Female, Age: 7 years, Chief complaint: Acute onset right leg weakness and dysphasia.)\nPlease base your response on the keyframes of the relevant medical imaging modality (Modality 1: \n) and reason step-by-step to answer the above question.", 117 | "reasoning": "Let's think about how to solve this problem clearly and reasonably step by step.\n### Step 1: \nConsider the patient's clinical presentation. The chief complaint includes acute onset right leg weakness and dysphasia in a 7-year-old female. These symptoms suggest a neurological event affecting the left hemisphere of the brain, as the right side of the body is controlled by the left hemisphere. The transient nature of the symptoms, resolving within 24 hours, points towards a condition that causes temporary neurological deficits rather than permanent damage.\n\n### Step 2:\nAnalyze the imaging findings. The SWI image shows marked prominence of cortical veins on the left side without any T2 or post-contrast T1 asymmetry. This suggests there is no evidence of hemorrhage, infarction, or mass effect. The subtle reduction of arterial supply on the same side indicates possible arteriospasm, which can be associated with conditions that cause transient vasoconstriction. The absence of diffusion restriction rules out an acute ischemic stroke.\n\n### Step 3:\nIntegrate the clinical and imaging findings. Given the acute onset of focal neurological deficits, the imaging findings of prominent cortical veins and subtle arteriospasm without signs of infarction or hemorrhage, and the transient nature of the symptoms, we need to consider conditions that can cause transient vasoconstriction and neurological symptoms. Hemiplegic migraine is a known condition that fits this profile, especially in children. It is characterized by recurrent episodes of unilateral weakness, often accompanied by other neurological symptoms like dysphasia, and typically resolves within 24 hours.\n\n### Step 4:\nRule out other potential diagnoses. While other conditions such as transient ischemic attack (TIA) or reversible cerebral vasoconstriction syndrome (RCVS) could present with similar symptoms, the lack of diffusion restriction and the presence of prominent cortical veins without other signs of vascular pathology make these less likely. Additionally, the age of the patient and the specific pattern of symptoms are more consistent with hemiplegic migraine.", 118 | "answer": "Hemiplegic migraine.", 119 | "answer_type": "OPEN" 120 | } 121 | ] --------------------------------------------------------------------------------