├── examples ├── fedmkt │ ├── __init__.py │ ├── test_fedmkt_llmsuit.yaml │ └── fedmkt_config.yaml ├── pellm │ ├── __init__.py │ ├── test_pellm_llmsuite.yaml │ ├── bloom_lora_config.yaml │ └── test_bloom_lora.py └── offsite_tuning │ ├── __init__.py │ ├── test_offsite_tuning_llmsuite.yaml │ ├── offsite_tuning_config.yaml │ └── offsite_tuning.py ├── python ├── fate_llm │ ├── __init__.py │ ├── algo │ │ ├── __init__.py │ │ ├── fedavg │ │ │ ├── __init__.py │ │ │ └── fedavg.py │ │ ├── fedcot │ │ │ ├── __init__.py │ │ │ ├── encoder_decoder │ │ │ │ ├── __init__.py │ │ │ │ ├── init │ │ │ │ │ ├── __init__.py │ │ │ │ │ └── default_init.py │ │ │ │ └── slm_encoder_decoder.py │ │ │ └── slm_encoder_decoder_trainer.py │ │ ├── fedkseed │ │ │ ├── __init__.py │ │ │ ├── args.py │ │ │ ├── pytorch_utils.py │ │ │ └── zo_utils.py │ │ ├── inferdpt │ │ │ ├── __init__.py │ │ │ ├── init │ │ │ │ ├── _init.py │ │ │ │ └── default_init.py │ │ │ └── _encode_decode.py │ │ ├── ppc-gpt │ │ │ └── __init__.py │ │ ├── offsite_tuning │ │ │ └── __init__.py │ │ ├── fdkt │ │ │ ├── cluster │ │ │ │ ├── __init__.py │ │ │ │ ├── cluster_method.py │ │ │ │ └── cluster.py │ │ │ ├── utils │ │ │ │ ├── __init__.py │ │ │ │ ├── invalid_data_filter.py │ │ │ │ ├── dp_loss.py │ │ │ │ └── text_generate.py │ │ │ ├── __init__.py │ │ │ └── inference_inst.py │ │ ├── fedmkt │ │ │ ├── utils │ │ │ │ ├── __init__.py │ │ │ │ ├── tokenizer_tool.py │ │ │ │ ├── vars_define.py │ │ │ │ ├── dataset_sync_util.py │ │ │ │ └── generate_logit_utils.py │ │ │ ├── token_alignment │ │ │ │ ├── __init__.py │ │ │ │ ├── spectal_token_mapping.py │ │ │ │ └── vocab_mapping.py │ │ │ ├── __init__.py │ │ │ └── fedmkt_data_collator.py │ │ └── dp │ │ │ ├── opacus_compatibility │ │ │ ├── grad_sample │ │ │ │ ├── __init__.py │ │ │ │ └── embedding.py │ │ │ ├── optimizers │ │ │ │ ├── __init__.py │ │ │ │ └── optimizer.py │ │ │ ├── __init__.py │ │ │ └── transformers_compate.py │ │ │ └── __init__.py │ ├── data │ │ ├── __init__.py │ │ ├── data_collator │ │ │ ├── __init__.py │ │ │ ├── fedcot_collator.py │ │ │ └── cust_data_collator.py │ │ └── tokenizers │ │ │ ├── __init__.py │ │ │ └── cust_tokenizer.py │ ├── evaluate │ │ ├── __init__.py │ │ ├── scripts │ │ │ ├── __init__.py │ │ │ ├── data_cli.py │ │ │ ├── fate_llm_cli.py │ │ │ ├── config_cli.py │ │ │ ├── _options.py │ │ │ └── eval_cli.py │ │ ├── tasks │ │ │ ├── dolly_15k │ │ │ │ ├── __init__.py │ │ │ │ ├── default_dolly_15k.yaml │ │ │ │ └── dolly_utils.py │ │ │ ├── advertise_gen │ │ │ │ ├── __init__.py │ │ │ │ ├── default_advertise_gen.yaml │ │ │ │ └── advertise_utils.py │ │ │ └── __init__.py │ │ └── utils │ │ │ ├── __init__.py │ │ │ ├── _io.py │ │ │ ├── data_tools.py │ │ │ ├── model_tools.py │ │ │ └── config.py │ ├── inference │ │ ├── __init__.py │ │ ├── inference_base.py │ │ ├── api.py │ │ ├── hf_qw.py │ │ └── vllm.py │ ├── runner │ │ ├── __init__.py │ │ └── fedkseed_runner.py │ ├── trainer │ │ └── __init__.py │ ├── model_zoo │ │ ├── pellm │ │ │ ├── __init__.py │ │ │ ├── opt.py │ │ │ ├── qwen.py │ │ │ ├── bloom.py │ │ │ ├── bart.py │ │ │ ├── bert.py │ │ │ ├── roberta.py │ │ │ ├── deberta.py │ │ │ ├── chatglm.py │ │ │ ├── distilbert.py │ │ │ ├── albert.py │ │ │ ├── gpt2.py │ │ │ ├── llama.py │ │ │ └── parameter_efficient_llm.py │ │ ├── offsite_tuning │ │ │ └── __init__.py │ │ ├── __init__.py │ │ ├── embedding_transformer │ │ │ ├── __init__.py │ │ │ └── st_model.py │ │ └── hf_model.py │ └── dataset │ │ ├── data_config │ │ ├── __init__.py │ │ ├── default_ag_news.yaml │ │ └── default_yelp_review.yaml │ │ ├── __init__.py │ │ ├── fedcot_dataset.py │ │ ├── seq_cls_dataset.py │ │ └── input_output_dataset.py ├── MANIFEST.in ├── requirements.txt └── setup.py ├── doc ├── images │ ├── ot1.png │ ├── ot2.png │ ├── fate-llm-plan.png │ ├── fate-llm-show.png │ └── fate-llm-chatglm-6b.png ├── tutorial │ ├── fedcot │ │ └── README.md │ ├── fedmkt │ │ └── README.md │ ├── fdkt │ │ └── README.md │ ├── fedkseed │ │ └── README.md │ ├── offsite_tuning │ │ └── README.md │ └── pellm │ │ └── builtin_pellm_models.md ├── standalone_deploy.md └── fate_llm_evaluate.md ├── RELEASE.md └── README.md /examples/fedmkt/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /examples/pellm/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /python/fate_llm/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /python/fate_llm/algo/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /python/fate_llm/data/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /examples/offsite_tuning/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /python/fate_llm/evaluate/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /python/fate_llm/inference/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /python/fate_llm/runner/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /python/fate_llm/trainer/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /python/fate_llm/algo/fedavg/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /python/fate_llm/algo/fedcot/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /python/fate_llm/algo/fedkseed/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /python/fate_llm/algo/inferdpt/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /python/fate_llm/algo/ppc-gpt/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /python/fate_llm/model_zoo/pellm/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /python/fate_llm/algo/offsite_tuning/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /python/fate_llm/evaluate/scripts/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /python/fate_llm/evaluate/tasks/dolly_15k/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /python/fate_llm/model_zoo/offsite_tuning/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /python/fate_llm/algo/fedcot/encoder_decoder/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /python/fate_llm/evaluate/tasks/advertise_gen/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /python/fate_llm/algo/fedcot/encoder_decoder/init/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /python/fate_llm/evaluate/utils/__init__.py: -------------------------------------------------------------------------------- 1 | from ._parser import LlmJob, LlmPair, LlmSuite -------------------------------------------------------------------------------- /doc/images/ot1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/FederatedAI/FATE-LLM/HEAD/doc/images/ot1.png -------------------------------------------------------------------------------- /doc/images/ot2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/FederatedAI/FATE-LLM/HEAD/doc/images/ot2.png -------------------------------------------------------------------------------- /doc/images/fate-llm-plan.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/FederatedAI/FATE-LLM/HEAD/doc/images/fate-llm-plan.png -------------------------------------------------------------------------------- /doc/images/fate-llm-show.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/FederatedAI/FATE-LLM/HEAD/doc/images/fate-llm-show.png -------------------------------------------------------------------------------- /python/MANIFEST.in: -------------------------------------------------------------------------------- 1 | include fate_llm/dataset/data_config/*yaml 2 | include python/fate_llm/evaluate/tasks/*/*yaml -------------------------------------------------------------------------------- /doc/images/fate-llm-chatglm-6b.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/FederatedAI/FATE-LLM/HEAD/doc/images/fate-llm-chatglm-6b.png -------------------------------------------------------------------------------- /python/requirements.txt: -------------------------------------------------------------------------------- 1 | accelerate==0.27.2 2 | deepspeed==0.13.3 3 | peft==0.8.2 4 | sentencepiece==0.2.0 5 | lm_eval==0.4.2 6 | rouge-score==0.1.2 7 | datasets==2.18.0 8 | editdistance 9 | torch==2.3.1 10 | transformers==4.37.2 11 | opacus==1.4.1 12 | fastchat 13 | Jinja2 14 | sentence-transformers 15 | openai 16 | -------------------------------------------------------------------------------- /python/fate_llm/dataset/data_config/__init__.py: -------------------------------------------------------------------------------- 1 | import os 2 | # absolute path to current directory 3 | parent_dir = os.path.dirname(os.path.realpath(__file__)) 4 | 5 | DATA_CONFIG_TEMPLATE = {"ag_news": os.path.join(parent_dir, "default_ag_news.yaml"), 6 | "yelp_review": os.path.join(parent_dir, "default_yelp_review.yaml"),} -------------------------------------------------------------------------------- /examples/fedmkt/test_fedmkt_llmsuit.yaml: -------------------------------------------------------------------------------- 1 | data: 2 | - file: 3 | table_name: arc_challenge 4 | namespace: experiment 5 | role: guest_0 6 | - file: 7 | table_name: arc_challenge 8 | namespace: experiment 9 | role: host_0 10 | bloom_lora_vs_zero_shot: 11 | gpt2_fedmkt: 12 | pretrained: "gpt2" 13 | script: "./fedmkt.py" 14 | conf: "./fedmkt_config.yaml" -------------------------------------------------------------------------------- /examples/offsite_tuning/test_offsite_tuning_llmsuite.yaml: -------------------------------------------------------------------------------- 1 | data: 2 | - file: 3 | table_name: sciq 4 | namespace: experiment 5 | role: guest_0 6 | - file: 7 | table_name: sciq 8 | namespace: experiment 9 | role: host_0 10 | bloom_lora_vs_zero_shot: 11 | gpt2_ot: 12 | pretrained: "gpt2" 13 | script: "./offsite_tuning.py" 14 | conf: "./offsite_tuning_config.yaml" -------------------------------------------------------------------------------- /python/fate_llm/evaluate/tasks/dolly_15k/default_dolly_15k.yaml: -------------------------------------------------------------------------------- 1 | dataset_kwargs: 2 | data_files: databricks-dolly-15k.jsonl 3 | dataset_path: json 4 | doc_to_target: '{{response}}' 5 | doc_to_text: !function 'dolly_utils.doc_to_text' 6 | metric_list: 7 | - aggregation: mean 8 | higher_is_better: true 9 | metric: !function 'dolly_utils.rouge_l' 10 | output_type: generate_until 11 | task: dolly-15k 12 | validation_split: train 13 | -------------------------------------------------------------------------------- /doc/tutorial/fedcot/README.md: -------------------------------------------------------------------------------- 1 | # FATE-LLM: FedCoT 2 | FedCoT (Federated Chain-of-Thought Distillation for Large Language Models) is a novel framework for privacy-preserving federated distillation of large language models. This implementation is integrated into the FATE-LLM framework. 3 | 4 | For more details, please refer to the paper: ["FedCoT: Federated Chain-of-Thought Distillation for Large Language Models"](https://arxiv.org/pdf/2406.12403) 5 | -------------------------------------------------------------------------------- /python/fate_llm/evaluate/tasks/advertise_gen/default_advertise_gen.yaml: -------------------------------------------------------------------------------- 1 | dataset_kwargs: 2 | data_files: 3 | train: train.json 4 | validation: dev.json 5 | dataset_path: json 6 | doc_to_target: '{{summary}}' 7 | doc_to_text: '{{content}}' 8 | metric_list: 9 | - aggregation: mean 10 | higher_is_better: true 11 | metric: !function 'advertise_utils.rouge_l' 12 | output_type: generate_until 13 | task: advertise-gen 14 | validation_split: validation 15 | -------------------------------------------------------------------------------- /python/fate_llm/evaluate/tasks/advertise_gen/advertise_utils.py: -------------------------------------------------------------------------------- 1 | # adopted from https://github.com/huggingface/datasets/blob/main/metrics/rouge/rouge.py 2 | 3 | 4 | from rouge_score import rouge_scorer 5 | # from multiprocessing import Pool 6 | 7 | 8 | def rouge_l(predictions, references, use_stemmer=False): 9 | scorer = rouge_scorer.RougeScorer(rouge_types=['rougeL'], use_stemmer=use_stemmer) 10 | scores = [] 11 | for ref, pred in zip(references, predictions): 12 | score = scorer.score(ref, pred) 13 | scores.append(score) 14 | 15 | rouge_l_score = scores[0]['rougeL'].fmeasure 16 | return rouge_l_score 17 | -------------------------------------------------------------------------------- /doc/tutorial/fedmkt/README.md: -------------------------------------------------------------------------------- 1 | # FATE-LLM: FedMKT 2 | 3 | The algorithm is based on paper ["FedMKT: Federated Mutual Knowledge Transfer for Large and SmallLanguage Models"](https://arxiv.org/pdf/2406.02224), We integrate its code into the FATE-LLM framework. 4 | 5 | ## Citation 6 | If you publish work that uses FedMKT, please cite FedMKT as follows: 7 | ``` 8 | @article{fan2024fedmkt, 9 | title={FedMKT: Federated Mutual Knowledge Transfer for Large and Small Language Models}, 10 | author={Fan, Tao and Ma, Guoqiang and Kang, Yan and Gu, Hanlin and Fan, Lixin and Yang, Qiang}, 11 | journal={arXiv preprint arXiv:2406.02224}, 12 | year={2024} 13 | } 14 | ``` 15 | -------------------------------------------------------------------------------- /python/fate_llm/dataset/__init__.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | -------------------------------------------------------------------------------- /python/fate_llm/model_zoo/__init__.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | -------------------------------------------------------------------------------- /python/fate_llm/algo/fdkt/cluster/__init__.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | -------------------------------------------------------------------------------- /python/fate_llm/algo/fdkt/utils/__init__.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | -------------------------------------------------------------------------------- /python/fate_llm/algo/fedmkt/utils/__init__.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | -------------------------------------------------------------------------------- /python/fate_llm/data/data_collator/__init__.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | -------------------------------------------------------------------------------- /python/fate_llm/data/tokenizers/__init__.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | -------------------------------------------------------------------------------- /python/fate_llm/algo/fedmkt/token_alignment/__init__.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | -------------------------------------------------------------------------------- /python/fate_llm/model_zoo/embedding_transformer/__init__.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | -------------------------------------------------------------------------------- /python/fate_llm/algo/dp/opacus_compatibility/grad_sample/__init__.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | -------------------------------------------------------------------------------- /python/fate_llm/algo/dp/opacus_compatibility/optimizers/__init__.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | -------------------------------------------------------------------------------- /examples/pellm/test_pellm_llmsuite.yaml: -------------------------------------------------------------------------------- 1 | data: 2 | - file: examples/data/AdvertiseGen/train.json 3 | table_name: ad 4 | namespace: experiment 5 | role: guest_0 6 | - file: examples/data/AdvertiseGen/train.json 7 | table_name: ad 8 | namespace: experiment 9 | role: host_0 10 | bloom_lora_vs_zero_shot: 11 | bloom_lora: 12 | pretrained: "bloom-560m" 13 | script: "./test_bloom_lora.py" 14 | conf: "./bloom_lora_config.yaml" 15 | peft_path_format: "{{fate_base}}/fate_flow/model/{{job_id}}/guest/{{party_id}}/{{model_task_name}}/0/output/output_model/model_directory" 16 | tasks: 17 | - "advertise-gen" 18 | bloom_zero_shot: 19 | pretrained: "bloom-560m" 20 | tasks: 21 | - "advertise-gen" -------------------------------------------------------------------------------- /doc/tutorial/fdkt/README.md: -------------------------------------------------------------------------------- 1 | # FATE-LLM: FDKT 2 | The algorithm is based on paper [Federated Domain-Specific Knowledge Transfer on Large Language Models Using Synthetic Data](https://arxiv.org/pdf/2405.14212), 3 | a novel framework that enables domain-specific knowledge transfer from LLMs to SLMs while preserving SLM data privacy. 4 | 5 | ## Citation 6 | If you publish work that uses FDKT, please cite FDKT as follows: 7 | ``` 8 | @article{li2024federated, 9 | title={Federated Domain-Specific Knowledge Transfer on Large Language Models Using Synthetic Data}, 10 | author={Li, Haoran and Zhao, Xinyuan and Guo, Dadi and Gu, Hanlin and Zeng, Ziqian and Han, Yuxing and Song, Yangqiu and Fan, Lixin and Yang, Qiang}, 11 | journal={arXiv preprint arXiv:2405.14212}, 12 | year={2024} 13 | } 14 | ``` 15 | -------------------------------------------------------------------------------- /python/fate_llm/algo/dp/__init__.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | from .opacus_compatibility.transformers_compate import get_model_class 17 | from .dp_trainer import DPTrainer, DPTrainingArguments 18 | -------------------------------------------------------------------------------- /python/fate_llm/data/data_collator/fedcot_collator.py: -------------------------------------------------------------------------------- 1 | from transformers import DataCollatorForSeq2Seq 2 | from transformers import AutoTokenizer 3 | import pandas as pd 4 | 5 | class PrefixDataCollator(DataCollatorForSeq2Seq): 6 | def __call__(self, features, return_tensors=None): 7 | features_df = pd.DataFrame(features) 8 | cot = super().__call__(list(features_df['predict']), return_tensors) 9 | label = super().__call__(list(features_df['rationale']), return_tensors) 10 | 11 | return { 12 | 'predict': cot, 13 | 'rationale': label 14 | } 15 | 16 | 17 | def get_prefix_data_collator(tokenizer_name_or_path): 18 | tokenizer = AutoTokenizer.from_pretrained(tokenizer_name_or_path) 19 | data_collator = PrefixDataCollator(tokenizer) 20 | return data_collator 21 | -------------------------------------------------------------------------------- /doc/tutorial/fedkseed/README.md: -------------------------------------------------------------------------------- 1 | ## FedKSeed 2 | 3 | The Algorithm is based on the paper: [Federated Full-Parameter Tuning of Billion-Sized Language Models 4 | with Communication Cost under 18 Kilobytes](https://arxiv.org/pdf/2312.06353.pdf) and the code is adaptor 5 | from the https://github.com/alibaba/FederatedScope/tree/FedKSeed. 6 | We refactor the code to make it more compatible with (transformers/PyTorch) framework 7 | and integrate it into the FATE-LLM framework. 8 | 9 | The main works include: 10 | 1. An KSeedZerothOrderOptimizer class that can be used to optimize model along given direction that generated with random seed. 11 | 2. An KSeedZOExtendedTrainer subclass of Trainer from transformers that can be used to train large language models with KSeedZerothOrderOptimizer. 12 | 3. Trainers for federated learning with large language models. -------------------------------------------------------------------------------- /python/fate_llm/algo/fdkt/__init__.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | from .fdkt_data_aug import ( 17 | FDKTSLM, 18 | FDKTLLM, 19 | FDKTTrainingArguments 20 | ) 21 | 22 | __all__ = [ 23 | "FDKTSLM", 24 | "FDKTLLM", 25 | "FDKTTrainingArguments" 26 | ] 27 | -------------------------------------------------------------------------------- /python/fate_llm/algo/fedmkt/utils/tokenizer_tool.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | from transformers import AutoConfig 17 | 18 | 19 | def get_vocab_size(tokenizer_name_or_path): 20 | if tokenizer_name_or_path is not None: 21 | return AutoConfig.from_pretrained(tokenizer_name_or_path) 22 | -------------------------------------------------------------------------------- /python/fate_llm/algo/fedmkt/__init__.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | 17 | from fate_llm.algo.fedmkt.fedmkt import ( 18 | FedMKTTrainingArguments, 19 | FedMKTSLM, 20 | FedMKTLLM 21 | ) 22 | 23 | __all__ = [ 24 | "FedMKTSLM", 25 | "FedMKTLLM", 26 | "FedMKTTrainingArguments" 27 | ] 28 | -------------------------------------------------------------------------------- /python/fate_llm/algo/inferdpt/init/_init.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | 17 | from fate.arch import Context 18 | from typing import Union 19 | 20 | 21 | class InferInit(object): 22 | 23 | def __init__(self, ctx: Context): 24 | self.ctx = ctx 25 | 26 | def get_inst(self): 27 | pass 28 | 29 | -------------------------------------------------------------------------------- /python/fate_llm/inference/inference_base.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | 17 | from typing import List 18 | 19 | 20 | class Inference(object): 21 | 22 | def __init__(self): 23 | pass 24 | 25 | def inference(self, docs: List[str], inference_kwargs: dict = {}) -> List[str]: 26 | raise NotImplementedError() -------------------------------------------------------------------------------- /python/fate_llm/algo/fedmkt/token_alignment/spectal_token_mapping.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | import transformers 17 | 18 | 19 | TOKENIZER_TO_SPECIAL_TOKEN = { 20 | transformers.LlamaTokenizer: '▁', 21 | transformers.LlamaTokenizerFast: '▁', 22 | transformers.GPTNeoXTokenizerFast: 'Ġ', 23 | transformers.GPT2TokenizerFast: 'Ġ', 24 | transformers.GPT2Tokenizer: 'Ġ', 25 | transformers.BloomTokenizerFast: 'Ġ', 26 | } 27 | -------------------------------------------------------------------------------- /python/fate_llm/algo/fedmkt/utils/vars_define.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | PER_STEP_LOGITS = "per_step_logits" 17 | PER_STEP_INDICES = "per_step_indices" 18 | METRIC = "metric" 19 | 20 | ALIGNED_OTHER_LOGITS = "aligned_other_logits" 21 | ALIGNED_OTHER_INDICES = "aligned_other_indices" 22 | ALIGNED_OTHER_METRIC = "aligned_other_metrice" 23 | 24 | SELF_TARGET_DIST = "llm_target_distribution" 25 | OTHER_TARGET_DIST = "slm_target_distribution" 26 | 27 | INPUT_KEYS = {"input_ids", "attention_mask", "labels"} 28 | -------------------------------------------------------------------------------- /python/fate_llm/algo/fedcot/slm_encoder_decoder_trainer.py: -------------------------------------------------------------------------------- 1 | from fate_llm.trainer.seq2seq_trainer import Seq2SeqTrainer 2 | from transformers import DataCollatorForSeq2Seq 3 | from transformers import AutoTokenizer 4 | import pandas as pd 5 | 6 | 7 | class EDPrefixDataCollator(DataCollatorForSeq2Seq): 8 | def __call__(self, features, return_tensors=None): 9 | features_df = pd.DataFrame(features) 10 | a = super().__call__(list(features_df['encoder']), return_tensors) 11 | b = super().__call__(list(features_df['decoder']), return_tensors) 12 | 13 | return { 14 | 'encoder': a, 15 | 'decoder': b 16 | } 17 | 18 | 19 | class EncoderDecoderPrefixTrainer(Seq2SeqTrainer): 20 | 21 | def __init__(self, alpha=0.5, *args, **kwargs): 22 | super().__init__(*args, **kwargs) 23 | self.alpha = alpha 24 | 25 | def compute_loss(self, model, inputs, return_outputs=False): 26 | out_a = model(**inputs['encoder']) 27 | out_b = model(**inputs['decoder']) 28 | loss = self.alpha * out_a.loss + (1. - self.alpha) * out_b.loss 29 | return (loss, {'out_a': out_a, 'out_b': out_b}) if return_outputs else loss 30 | -------------------------------------------------------------------------------- /python/fate_llm/algo/fedkseed/args.py: -------------------------------------------------------------------------------- 1 | from dataclasses import dataclass, field 2 | 3 | 4 | @dataclass 5 | class KSeedTrainingArguments: 6 | """ 7 | TrainingArguments is the subset of the arguments we use in our example scripts, they are the arguments that 8 | 9 | Parameters: 10 | optim: optional, default is KSeedZO 11 | The optimizer to use. 12 | eps: optional, default is 0.0005 13 | Epsilon value for KSeedZerothOrderOptimizer. 14 | grad_clip: optional, default is -100.0 15 | Gradient clip value for KSeedZerothOrderOptimizer. 16 | """ 17 | 18 | zo_optim: bool = field( 19 | default=True, 20 | metadata={"help": "Whether to use KSeedZerothOrderOptimizer. This suppress `optim` argument when True."}, 21 | ) 22 | k: int = field( 23 | default=4096, 24 | metadata={"help": "The number of seed candidates to use. This suppress `seed_candidates` argument when > 1."}, 25 | ) 26 | eps: float = field(default=0.0005, metadata={"help": "Epsilon value for KSeedZerothOrderOptimizer."}) 27 | grad_clip: float = field(default=-100.0, metadata={"help": "Gradient clip value for KSeedZerothOrderOptimizer."}) 28 | -------------------------------------------------------------------------------- /python/fate_llm/algo/dp/opacus_compatibility/__init__.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | from .grad_sample.embedding import compute_embedding_grad_sample 17 | from .optimizers.optimizer import add_noise_wrapper 18 | 19 | 20 | def add_layer_compatibility(opacus): 21 | replace_method = [] 22 | for k, v in opacus.GradSampleModule.GRAD_SAMPLERS.items(): 23 | if v.__name__ == "compute_embedding_grad_sample": 24 | replace_method.append(k) 25 | 26 | for k in replace_method: 27 | opacus.GradSampleModule.GRAD_SAMPLERS[k] = compute_embedding_grad_sample 28 | 29 | 30 | def add_optimizer_compatibility(optimizer): 31 | add_noise_wrapper(optimizer) 32 | -------------------------------------------------------------------------------- /python/fate_llm/algo/inferdpt/_encode_decode.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | 17 | from fate.arch import Context 18 | from typing import List, Dict 19 | import logging 20 | 21 | 22 | logger = logging.getLogger(__name__) 23 | 24 | 25 | class EncoderDecoder(object): 26 | 27 | def __init__(self, ctx: Context) -> None: 28 | self.ctx = ctx 29 | 30 | def encode(self, docs: List[Dict[str, str]], format_template: str): 31 | pass 32 | 33 | def decode(self, docs: List[Dict[str, str]], format_template: str ): 34 | pass 35 | 36 | def inference(self, docs: List[Dict[str, str]], inference_kwargs: dict = {}, format_template: str = None): 37 | pass -------------------------------------------------------------------------------- /python/fate_llm/algo/fdkt/utils/invalid_data_filter.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | INVALID_CHARACTERS = "".join([' ', '-', '.', '_', '~', '/', '\\', '*', '|', '#']) 17 | LEAST_WORDS = 10 18 | 19 | 20 | def filter_invalid_data(data_dict): 21 | sample_num = len(data_dict["inputs"]) 22 | new_data_dict = dict( 23 | inputs=list(), 24 | labels=list() 25 | ) 26 | for idx in range(sample_num): 27 | text = data_dict["inputs"][idx].strip(INVALID_CHARACTERS) 28 | if len(text.split()) < LEAST_WORDS: 29 | continue 30 | 31 | new_data_dict["inputs"].append(text) 32 | new_data_dict["labels"].append(data_dict["labels"][idx]) 33 | 34 | return new_data_dict 35 | -------------------------------------------------------------------------------- /python/fate_llm/algo/fdkt/inference_inst.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | def api_init(api_url: str, model_name: str, api_key: str = 'EMPTY', api_timeout=3600): 17 | from fate_llm.inference.api import APICompletionInference 18 | return APICompletionInference( 19 | api_url=api_url, 20 | model_name=model_name, 21 | api_key=api_key, 22 | api_timeout=api_timeout 23 | ) 24 | 25 | 26 | def vllm_init(model_path: str, num_gpu=1, dtype='float16', gpu_memory_utilization=0.9): 27 | from fate_llm.inference.vllm import VLLMInference 28 | return VLLMInference( 29 | model_path=model_path, 30 | num_gpu=num_gpu, 31 | dtype=dtype, 32 | gpu_memory_utilization=gpu_memory_utilization 33 | ) 34 | -------------------------------------------------------------------------------- /python/fate_llm/algo/fdkt/cluster/cluster_method.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | from sklearn.cluster import KMeans 17 | 18 | 19 | class KMeansRunner(object): 20 | def __init__(self, n_clusters, **other_cluster_args): 21 | self.n_clusters = n_clusters 22 | self.other_cluster_args = other_cluster_args 23 | 24 | def fit(self, x): 25 | model = KMeans(n_clusters=self.n_clusters, **self.other_cluster_args) 26 | model.fit(x) 27 | 28 | return model.labels_ 29 | 30 | 31 | def get_cluster_runner(method, n_clusters, **other_cluster_args): 32 | if method.lower() == "kmeans": 33 | return KMeansRunner(n_clusters, **other_cluster_args) 34 | else: 35 | raise ValueError(f"cluster method={method} is not implemented") 36 | -------------------------------------------------------------------------------- /examples/pellm/bloom_lora_config.yaml: -------------------------------------------------------------------------------- 1 | data: 2 | guest: 3 | namespace: experiment 4 | name: ad 5 | host: 6 | namespace: experiment 7 | name: ad 8 | epoch: 1 9 | batch_size: 4 10 | lr: 5e-4 11 | pretrained_model_path: bloom-560m 12 | peft_config: 13 | alpha_pattern: {} 14 | auto_mapping: null 15 | base_model_name_or_path: null 16 | bias: none 17 | fan_in_fan_out: false 18 | inference_mode: false 19 | init_lora_weights: true 20 | layers_pattern: null 21 | layers_to_transform: null 22 | loftq_config: { } 23 | lora_alpha: 32 24 | lora_dropout: 0.1 25 | megatron_config: null 26 | megatron_core: megatron.core 27 | modules_to_save: null 28 | peft_type: LORA 29 | r: 8 30 | rank_pattern: { } 31 | revision: null 32 | target_modules: 33 | - query_key_value 34 | task_type: CAUSAL_LM 35 | use_rslora: false 36 | ds_config: 37 | fp16: 38 | enabled: true 39 | gradient_accumulation_steps: 1 40 | optimizer: 41 | params: 42 | adam_w_mode: false 43 | lr: 5e-4 44 | torch_adam: true 45 | type: Adam 46 | train_micro_batch_size_per_gpu: 4 47 | zero_optimization: 48 | allgather_bucket_size: 100000000.0 49 | allgather_partitions: true 50 | contiguous_gradients: true 51 | offload_optimizer: 52 | device: cpu 53 | offload_param: 54 | device: cpu 55 | overlap_comm: true 56 | reduce_bucket_size: 100000000.0 57 | reduce_scatter: true 58 | stage: 2 59 | -------------------------------------------------------------------------------- /python/fate_llm/model_zoo/hf_model.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | import torch 17 | from transformers import AutoModelForCausalLM 18 | 19 | 20 | class HFAutoModelForCausalLM: 21 | 22 | def __init__(self, pretrained_model_name_or_path, *model_args, **kwargs) -> None: 23 | self.pretrained_model_name_or_path = pretrained_model_name_or_path 24 | self.model_args = model_args 25 | self.kwargs = kwargs 26 | if "torch_dtype" in self.kwargs and self.kwargs["torch_dtype"] != "auto": 27 | dtype = self.kwargs.pop("torch_dtype") 28 | self.kwargs["torch_dtype"] = getattr(torch, dtype) 29 | 30 | def load(self): 31 | model = AutoModelForCausalLM.from_pretrained( 32 | self.pretrained_model_name_or_path, *self.model_args, **self.kwargs 33 | ) 34 | return model 35 | -------------------------------------------------------------------------------- /python/fate_llm/model_zoo/pellm/opt.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | from transformers import OPTConfig 17 | from transformers import OPTForCausalLM 18 | from fate_llm.model_zoo.pellm.parameter_efficient_llm import PELLM 19 | 20 | 21 | class OPT(PELLM): 22 | 23 | config_class = OPTConfig 24 | model_loader = OPTForCausalLM 25 | 26 | def __init__(self, config: dict = None, 27 | pretrained_path: str = None, 28 | peft_type: str = None, 29 | peft_config: dict = None, 30 | **kwargs 31 | ) -> None: 32 | 33 | if config is None and pretrained_path is None: 34 | config = OPTConfig().to_dict() # use default model setting 35 | super().__init__(config=config, pretrained_path=pretrained_path, 36 | peft_type=peft_type, peft_config=peft_config, **kwargs) 37 | -------------------------------------------------------------------------------- /python/fate_llm/model_zoo/pellm/qwen.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | from transformers import Qwen2Config 17 | from transformers import Qwen2ForCausalLM 18 | from fate_llm.model_zoo.pellm.parameter_efficient_llm import PELLM 19 | 20 | 21 | class Qwen(PELLM): 22 | 23 | config_class = Qwen2Config 24 | model_loader = Qwen2ForCausalLM 25 | 26 | def __init__(self, config: dict = None, 27 | pretrained_path: str = None, 28 | peft_type: str = None, 29 | peft_config: dict = None, 30 | **kwargs 31 | ) -> None: 32 | 33 | if config is None and pretrained_path is None: 34 | config = Qwen2Config().to_dict() # use default model setting 35 | super().__init__(config=config, pretrained_path=pretrained_path, 36 | peft_type=peft_type, peft_config=peft_config, **kwargs) 37 | -------------------------------------------------------------------------------- /python/fate_llm/model_zoo/pellm/bloom.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | from transformers import BloomConfig 17 | from transformers import BloomForCausalLM 18 | from fate_llm.model_zoo.pellm.parameter_efficient_llm import PELLM 19 | 20 | 21 | class Bloom(PELLM): 22 | 23 | config_class = BloomConfig 24 | model_loader = BloomForCausalLM 25 | 26 | def __init__(self, config: dict = None, 27 | pretrained_path: str = None, 28 | peft_type: str = None, 29 | peft_config: dict = None, 30 | **kwargs 31 | ) -> None: 32 | 33 | if config is None and pretrained_path is None: 34 | config = BloomConfig().to_dict() # use default model setting 35 | super().__init__(config=config, pretrained_path=pretrained_path, 36 | peft_type=peft_type, peft_config=peft_config, **kwargs) 37 | -------------------------------------------------------------------------------- /python/fate_llm/inference/api.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | 17 | from fate_llm.inference.inference_base import Inference 18 | from transformers import AutoModelForCausalLM, AutoTokenizer 19 | from transformers import GenerationConfig 20 | from typing import List 21 | 22 | 23 | class APICompletionInference(Inference): 24 | 25 | def __init__(self, api_url: str, model_name: str, api_key: str = 'EMPTY', api_timeout=3600): 26 | from openai import OpenAI 27 | self.model_name = model_name 28 | self.client = OpenAI( 29 | api_key=api_key, 30 | base_url=api_url, 31 | timeout=api_timeout 32 | ) 33 | 34 | def inference(self, docs: List[str], inference_kwargs: dict = {}) -> List[str]: 35 | completion = self.client.completions.create(model=self.model_name, prompt=docs, **inference_kwargs) 36 | rs_doc = [completion.choices[i].text for i in range(len(completion.choices))] 37 | return rs_doc -------------------------------------------------------------------------------- /python/fate_llm/algo/fdkt/cluster/cluster.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | from typing import List 17 | from .cluster_method import get_cluster_runner 18 | 19 | 20 | class SentenceCluster(object): 21 | def __init__(self, model, cluster_method="kmeans", n_clusters=8, **other_cluster_args): 22 | self.model = model 23 | self.cluster_method = cluster_method 24 | self.n_clusters = n_clusters 25 | self.other_cluster_args = other_cluster_args 26 | 27 | def get_embeddings(self, sentences: List[str]): 28 | return self.model.encode(sentences) 29 | 30 | def cluster(self, sentences): 31 | embeddings = self.get_embeddings(sentences) 32 | 33 | cluster_runner = get_cluster_runner(method=self.cluster_method, 34 | n_clusters=self.n_clusters, 35 | **self.other_cluster_args) 36 | 37 | cluster_rets = cluster_runner.fit(embeddings) 38 | 39 | return cluster_rets 40 | -------------------------------------------------------------------------------- /python/fate_llm/inference/hf_qw.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | 17 | from fate_llm.inference.inference_base import Inference 18 | from transformers import AutoModelForCausalLM, AutoTokenizer 19 | from typing import List 20 | import tqdm 21 | 22 | 23 | class QwenHFCompletionInference(Inference): 24 | 25 | def __init__(self, model, tokenizer): 26 | self.model = model 27 | self.tokenizer = tokenizer 28 | 29 | def inference(self, docs: List[str], inference_kwargs: dict = {}) -> List[str]: 30 | self.model = self.model.eval() 31 | rs_list = [] 32 | for d in tqdm.tqdm(docs): 33 | inputs = self.tokenizer(d, return_tensors='pt') 34 | inputs = inputs.to(self.model.device) 35 | inputs.update(inference_kwargs) 36 | pred = self.model.generate(**inputs) 37 | response = self.tokenizer.decode(pred.cpu()[0][len(inputs['input_ids'][0]):], skip_special_tokens=True) 38 | rs_list.append(response) 39 | self.model = self.model.train() 40 | return rs_list 41 | 42 | -------------------------------------------------------------------------------- /examples/offsite_tuning/offsite_tuning_config.yaml: -------------------------------------------------------------------------------- 1 | # params.yaml 2 | 3 | paths: 4 | pretrained_model_path: 'gpt2' 5 | 6 | pipeline: 7 | guest: '9999' 8 | arbiter: '9999' 9 | namespace: 'experiment' 10 | name: 'sciq' 11 | engine_run: 12 | cores: 1 13 | 14 | training: 15 | batch_size: 1 16 | learning_rate: 5e-5 17 | num_train_epochs: 1 18 | logging_steps: 10 19 | deepspeed: 20 | train_micro_batch_size_per_gpu: 1 21 | optimizer: 22 | type: "Adam" 23 | params: 24 | lr: 5e-5 25 | torch_adam: true 26 | adam_w_mode: false 27 | fp16: 28 | enabled: true 29 | gradient_accumulation_steps: 1 30 | zero_optimization: 31 | stage: 2 32 | allgather_partitions: true 33 | allgather_bucket_size: 1e8 34 | overlap_comm: true 35 | reduce_scatter: true 36 | reduce_bucket_size: 1e8 37 | contiguous_gradients: true 38 | offload_optimizer: 39 | device: "cpu" 40 | offload_param: 41 | device: "cpu" 42 | 43 | models: 44 | client: 45 | module_name: 'offsite_tuning.gpt2' 46 | item_name: 'GPT2LMHeadSubModel' 47 | emulator_layer_num: 11 48 | adapter_top_layer_num: 2 49 | adapter_bottom_layer_num: 2 50 | 51 | server: 52 | module_name: 'offsite_tuning.gpt2' 53 | item_name: 'GPT2LMHeadMainModel' 54 | emulator_layer_num: 11 55 | adapter_top_layer_num: 2 56 | adapter_bottom_layer_num: 2 57 | 58 | dataset: 59 | module_name: 'qa_dataset' 60 | item_name: 'QaDataset' 61 | tokenizer_name_or_path: 'gpt2' 62 | select_num: 100 63 | 64 | data_collator: 65 | module_name: 'data_collator.cust_data_collator' 66 | item_name: 'get_seq2seq_data_collator' 67 | tokenizer_name_or_path: 'gpt2' 68 | -------------------------------------------------------------------------------- /python/fate_llm/evaluate/scripts/data_cli.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2024 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | 17 | import os 18 | import copy 19 | import click 20 | import yaml 21 | import warnings 22 | 23 | from typing import Union 24 | from ._options import LlmSharedOptions 25 | from ..utils.llm_evaluator import download_task 26 | from ..utils._io import echo 27 | 28 | @click.command('download_data') 29 | @click.option('-t', '--tasks', required=False, type=str, multiple=True, default=None, 30 | help='tasks whose data will be downloaded') 31 | # @click.argument('other_args', nargs=-1) 32 | @LlmSharedOptions.get_shared_options(hidden=True) 33 | @click.pass_context 34 | def download_data(ctx, tasks, **kwargs): 35 | """ 36 | Evaluate a pretrained model with specified parameters. 37 | """ 38 | ctx.obj.update(**kwargs) 39 | ctx.obj.post_process() 40 | 41 | if tasks is None or len(tasks) == 0: 42 | tasks = None 43 | echo.echo(f"No task is given, will download data for all built-in tasks.", fg='red') 44 | else: 45 | echo.echo(f"given tasks: {tasks}", fg='red') 46 | download_task(tasks) 47 | -------------------------------------------------------------------------------- /python/fate_llm/dataset/data_config/default_ag_news.yaml: -------------------------------------------------------------------------------- 1 | dataset_kwargs: 2 | data_files: ag_news_review/AGnews/train.json 3 | dataset_path: json 4 | doc_to_target: '{{label}}' 5 | metric_list: 6 | - aggregation: mean 7 | higher_is_better: true 8 | metric: accuracy 9 | output_type: generate_until 10 | task: ag-news 11 | validation_split: train 12 | label_key: label 13 | text_key: text 14 | sub_domain: AGnews 15 | few_shot_num_per_label: 2 16 | tokenize_format: "Product type: {{sub_domain}} | Text Category: {{label}}" 17 | few_shot_format: "- : {{label}}.\n- : {{text}}\n\n" 18 | augment_format: "The news' topics belong to the following 4 categories: 0.world 1.sports 2.business 3.science and technology. Please generate news according to the following format, bearing in mind that the generated results should not resemble the examples, but should align with the specified category: \n" 19 | text_with_label_format: "******\n {{i}}.\nNews: {{text}}\nCategory: {{label}}.\n" 20 | filter_format: "I will give you some news samples with their categories, The news' topics belong to the following 4 categories: 0.world 1.sports 2.business 3.science and technology. the samples are delimited by '******':\n {text_with_label} Please filter out texts that are ambiguous, do not belong to news or do not meet the categories, and leave news texts that meet the categories.\n You should also filter out news text that are too similar to other samples and keep the most representative ones. Your answer should begin with 'The eligible samples:\n\n' and the indexes of the texts you choose, use spaces to separate the indexes and do not provide duplicate indices or indices that exceed the maximum index of samples." 21 | label_list: 22 | - 'world' 23 | - 'sports' 24 | - 'business' 25 | - 'science and technology' -------------------------------------------------------------------------------- /python/fate_llm/algo/dp/opacus_compatibility/transformers_compate.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | import torch 17 | import transformers 18 | from fate_llm.model_zoo.pellm.parameter_efficient_llm import PELLM 19 | from transformers.modeling_utils import unwrap_model 20 | 21 | 22 | def get_model_class(model): 23 | if isinstance(model, PELLM): 24 | model = model._pe_lm 25 | 26 | model = unwrap_model(model) 27 | 28 | return model.__class__ 29 | 30 | 31 | def prepare_position_ids(model, input_ids): 32 | if get_model_class(model) == transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel: 33 | return _get_position_ids_for_gpt2(input_ids) 34 | else: 35 | raise ValueError(f"Can not prepare position_ids for model_type={model.__class__}") 36 | 37 | 38 | def _get_position_ids_for_gpt2(input_ids): 39 | past_length = 0 40 | position_ids = torch.arange(past_length, input_ids.shape[-1] + past_length, dtype=torch.long, 41 | device=input_ids.device) 42 | position_ids = position_ids.unsqueeze(0) 43 | position_ids = position_ids.repeat(input_ids.shape[0], 1) 44 | 45 | return position_ids 46 | -------------------------------------------------------------------------------- /python/fate_llm/evaluate/tasks/dolly_15k/dolly_utils.py: -------------------------------------------------------------------------------- 1 | # adopted from https://github.com/huggingface/datasets/blob/main/metrics/rouge/rouge.py 2 | 3 | 4 | from rouge_score import rouge_scorer 5 | 6 | 7 | def rouge_l(predictions, references, use_stemmer=False): 8 | scorer = rouge_scorer.RougeScorer(rouge_types=['rougeL'], use_stemmer=use_stemmer) 9 | scores = [] 10 | for ref, pred in zip(references, predictions): 11 | score = scorer.score(ref, pred) 12 | scores.append(score) 13 | 14 | rouge_l_score = scores[0]['rougeL'].fmeasure 15 | return rouge_l_score 16 | 17 | def doc_to_text(doc): 18 | if doc["context"]: 19 | return f"context: {doc['context']}\ninstruction: {doc['instruction']}\nresponse:" 20 | else: 21 | return f"instruction: {doc['instruction']}\nresponse:" 22 | 23 | """ 24 | def train_load_evalaute_lm(): 25 | pipeline.fit(train_data) 26 | lm = OTModelLoader().load(path, **args) 27 | from fate_llm.evaluator import evaluator 28 | # general case 29 | evaluator.evaluate(lm, task="dolly_15k", **args) 30 | 31 | # user modified conf 32 | config = evaluator.get_task_template(task="dolly_15k") # return dict copy of yaml file 33 | config['dataset_kwargs'] = {"dataset_kwargs": 34 | {"data_files": 35 | {"test": './dolly_15k_test.csv', 36 | "dev": './dolly_15k_dev.csv'}}} 37 | # may provide arbitrary export path, must be of dir, create temp dir under the given path: {$export_path}/temp_dir 38 | new_task_dir = evaluator.export_config(config, task="dolly_15k", export_path=None) 39 | result = evaluator.evalute(lm, task="dolly_15k", include_path=new_task_dir, **args) 40 | print(result) # dict 41 | evaluator.delete_config(new_task_dir) 42 | """ -------------------------------------------------------------------------------- /python/fate_llm/algo/dp/opacus_compatibility/optimizers/optimizer.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright (c) Meta Platforms, Inc. and affiliates. 3 | # Copyright 2019 The FATE Authors. All Rights Reserved. 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | # 17 | import types 18 | from opacus.optimizers.optimizer import ( 19 | _check_processed_flag, 20 | _generate_noise, 21 | _mark_as_processed 22 | ) 23 | 24 | 25 | # modified from https://github.com/pytorch/opacus/blob/main/opacus/optimizers/optimizer.py#L424 26 | # avoid dtype error when summed_grad's dtype isn't torch.float32 27 | def add_noise(self): 28 | """ 29 | Adds noise to clipped gradients. Stores clipped and noised result in ``p.grad`` 30 | """ 31 | 32 | for p in self.params: 33 | _check_processed_flag(p.summed_grad) 34 | 35 | noise = _generate_noise( 36 | std=self.noise_multiplier * self.max_grad_norm, 37 | reference=p.summed_grad, 38 | generator=self.generator, 39 | secure_mode=self.secure_mode, 40 | ) 41 | noise = noise.to(p.summed_grad.dtype) 42 | p.grad = (p.summed_grad + noise).view_as(p) 43 | 44 | _mark_as_processed(p.summed_grad) 45 | 46 | 47 | def add_noise_wrapper(optimizer): 48 | optimizer.add_noise = types.MethodType(add_noise, optimizer) 49 | -------------------------------------------------------------------------------- /python/fate_llm/algo/fedcot/encoder_decoder/init/default_init.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | 17 | from fate_llm.algo.inferdpt.init._init import InferInit 18 | from fate_llm.inference.api import APICompletionInference 19 | from fate_llm.algo.fedcot.encoder_decoder.slm_encoder_decoder import SLMEncoderDecoderClient, SLMEncoderDecoderServer 20 | 21 | 22 | class FedCoTEDAPIClientInit(InferInit): 23 | 24 | api_url = '' 25 | api_model_name = '' 26 | api_key = 'EMPTY' 27 | 28 | def __init__(self, ctx): 29 | super().__init__(ctx) 30 | self.ctx = ctx 31 | 32 | def get_inst(self): 33 | inference = APICompletionInference(api_url=self.api_url, model_name=self.api_model_name, api_key=self.api_key) 34 | client = SLMEncoderDecoderClient(self.ctx, inference) 35 | return client 36 | 37 | 38 | class FedCoTEDAPIServerInit(InferInit): 39 | 40 | api_url = '' 41 | api_model_name = '' 42 | api_key = 'EMPTY' 43 | 44 | def __init__(self, ctx): 45 | super().__init__(ctx) 46 | self.ctx = ctx 47 | 48 | def get_inst(self): 49 | inference = APICompletionInference(api_url=self.api_url, model_name=self.api_model_name, api_key=self.api_key) 50 | return SLMEncoderDecoderServer(self.ctx, inference) 51 | -------------------------------------------------------------------------------- /python/fate_llm/inference/vllm.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | 17 | from fate_llm.inference.inference_base import Inference 18 | from transformers import AutoModelForCausalLM, AutoTokenizer 19 | from transformers import GenerationConfig 20 | import logging 21 | from typing import List 22 | 23 | 24 | logger = logging.getLogger(__name__) 25 | 26 | 27 | class VLLMInference(Inference): 28 | 29 | def __init__(self, model_path, num_gpu=1, dtype='float16', gpu_memory_utilization=0.9): 30 | from vllm import LLM 31 | self.llm = LLM(model=model_path, trust_remote_code=True, dtype=dtype, tensor_parallel_size=num_gpu, gpu_memory_utilization=gpu_memory_utilization) 32 | logger.info('vllm model init done, model path is {}'.format(model_path)) 33 | 34 | def inference(self, docs: List[str], inference_kwargs: dict = {}) -> List[str]: 35 | 36 | from vllm import SamplingParams 37 | param = SamplingParams(**inference_kwargs) 38 | outputs = self.llm.generate( 39 | prompts=docs, 40 | sampling_params=param) 41 | 42 | rs = [] 43 | for output in outputs: 44 | prompt = output.prompt 45 | generated_text = output.outputs[0].text 46 | rs.append(generated_text) 47 | 48 | return rs 49 | 50 | 51 | -------------------------------------------------------------------------------- /python/fate_llm/model_zoo/pellm/bart.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | from transformers import BartConfig, AutoConfig 17 | from transformers import BartForSequenceClassification 18 | from fate_llm.model_zoo.pellm.parameter_efficient_llm import PELLM 19 | 20 | 21 | class Bart(PELLM): 22 | config_class = BartConfig 23 | model_loader = BartForSequenceClassification 24 | 25 | def __init__(self, config: dict = None, 26 | pretrained_path: str = None, 27 | peft_type: str = None, 28 | peft_config: dict = None, 29 | **kwargs) -> None: 30 | 31 | if pretrained_path is not None: 32 | self.check_config(pretrain_path=pretrained_path) 33 | if config is None and pretrained_path is None: 34 | config = BartConfig().to_dict() 35 | super().__init__( 36 | config=config, 37 | pretrained_path=pretrained_path, 38 | peft_type=peft_type, 39 | peft_config=peft_config, 40 | **kwargs) 41 | 42 | def check_config(self, pretrain_path): 43 | config = AutoConfig.from_pretrained(pretrain_path) 44 | assert isinstance( 45 | config, BartConfig), 'The config of pretrained model must be BartConfig, but got {}'.format( 46 | type(config)) 47 | -------------------------------------------------------------------------------- /python/fate_llm/model_zoo/pellm/bert.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | from transformers import BertConfig, AutoConfig 17 | from transformers import BertForSequenceClassification 18 | from fate_llm.model_zoo.pellm.parameter_efficient_llm import PELLM 19 | 20 | 21 | class Bert(PELLM): 22 | config_class = BertConfig 23 | model_loader = BertForSequenceClassification 24 | 25 | def __init__(self, config: dict = None, 26 | pretrained_path: str = None, 27 | peft_type: str = None, 28 | peft_config: dict = None, 29 | **kwargs) -> None: 30 | 31 | if pretrained_path is not None: 32 | self.check_config(pretrain_path=pretrained_path) 33 | if config is None and pretrained_path is None: 34 | config = BertConfig().to_dict() 35 | super().__init__( 36 | config=config, 37 | pretrained_path=pretrained_path, 38 | peft_type=peft_type, 39 | peft_config=peft_config, 40 | **kwargs) 41 | 42 | def check_config(self, pretrain_path): 43 | config = AutoConfig.from_pretrained(pretrain_path) 44 | assert isinstance( 45 | config, BertConfig), 'The config of pretrained model must be BertConfig, but got {}'.format( 46 | type(config)) 47 | -------------------------------------------------------------------------------- /python/fate_llm/evaluate/scripts/fate_llm_cli.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2024 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | 17 | 18 | import click 19 | import yaml 20 | 21 | from typing import Union 22 | from .eval_cli import run_evaluate 23 | from .config_cli import eval_config_group 24 | from .data_cli import download_data 25 | from ._options import LlmSharedOptions 26 | 27 | 28 | commands = { 29 | "evaluate": run_evaluate, 30 | "config": eval_config_group, 31 | "download": download_data 32 | } 33 | 34 | 35 | class FATELlmCLI(click.MultiCommand): 36 | 37 | def list_commands(self, ctx): 38 | return list(commands) 39 | 40 | def get_command(self, ctx, name): 41 | if name not in commands and name in commands_alias: 42 | name = commands_alias[name] 43 | if name not in commands: 44 | ctx.fail("No such command '{}'.".format(name)) 45 | return commands[name] 46 | 47 | @click.command(cls=FATELlmCLI, help="A collection of tools to run FATE Llm Evaluation.", 48 | context_settings=dict(help_option_names=["-h", "--help"])) 49 | @LlmSharedOptions.get_shared_options() 50 | @click.pass_context 51 | def fate_llm_cli(ctx, **kwargs): 52 | ctx.ensure_object(LlmSharedOptions) 53 | ctx.obj.update(**kwargs) 54 | 55 | 56 | if __name__ == '__main__': 57 | fate_llm_cli(obj=LlmSharedOptions()) -------------------------------------------------------------------------------- /python/fate_llm/evaluate/utils/_io.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | import click 17 | import loguru 18 | 19 | 20 | # noinspection PyPep8Naming 21 | class echo(object): 22 | _file = None 23 | 24 | @classmethod 25 | def set_file(cls, file): 26 | cls._file = file 27 | 28 | @classmethod 29 | def echo(cls, message, **kwargs): 30 | click.secho(message, **kwargs) 31 | click.secho(message, file=cls._file, **kwargs) 32 | 33 | @classmethod 34 | def sep_line(cls): 35 | click.secho("-------------------------------------------------") 36 | 37 | @classmethod 38 | def file(cls, message, **kwargs): 39 | click.secho(message, file=cls._file, **kwargs) 40 | 41 | @classmethod 42 | def stdout(cls, message, **kwargs): 43 | click.secho(message, **kwargs) 44 | 45 | @classmethod 46 | def stdout_newline(cls): 47 | click.secho("") 48 | 49 | @classmethod 50 | def welcome(cls): 51 | 52 | cls.echo("Welcome to FATE Llm Evaluator") 53 | 54 | @classmethod 55 | def flush(cls): 56 | import sys 57 | sys.stdout.flush() 58 | 59 | 60 | def set_logger(name): 61 | loguru.logger.remove() 62 | loguru.logger.add(name, level='ERROR', delay=True) 63 | return loguru.logger 64 | 65 | 66 | LOGGER = loguru.logger 67 | -------------------------------------------------------------------------------- /python/fate_llm/model_zoo/pellm/roberta.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | from transformers import RobertaConfig, AutoConfig 17 | from transformers import RobertaForSequenceClassification 18 | from fate_llm.model_zoo.pellm.parameter_efficient_llm import PELLM 19 | 20 | 21 | class Roberta(PELLM): 22 | config_class = RobertaConfig 23 | model_loader = RobertaForSequenceClassification 24 | 25 | def __init__(self, config: dict = None, 26 | pretrained_path: str = None, 27 | peft_type: str = None, 28 | peft_config: dict = None, 29 | **kwargs) -> None: 30 | 31 | if pretrained_path is not None: 32 | self.check_config(pretrain_path=pretrained_path) 33 | if config is None and pretrained_path is None: 34 | config = RobertaConfig().to_dict() 35 | super().__init__( 36 | config=config, 37 | pretrained_path=pretrained_path, 38 | peft_type=peft_type, 39 | peft_config=peft_config, 40 | **kwargs) 41 | 42 | def check_config(self, pretrain_path): 43 | config = AutoConfig.from_pretrained(pretrain_path) 44 | assert isinstance( 45 | config, RobertaConfig), 'The config of pretrained model must be RobertaConfig, but got {}'.format( 46 | type(config)) 47 | -------------------------------------------------------------------------------- /python/fate_llm/evaluate/scripts/config_cli.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2024 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | 17 | 18 | import click 19 | import yaml 20 | from pathlib import Path 21 | from ..utils.config import create_eval_config, default_eval_config 22 | from ._options import LlmSharedOptions 23 | from ..utils._io import echo 24 | 25 | @click.group("eval_config", help="fate_llm evaluate config") 26 | def eval_config_group(): 27 | """ 28 | eval_config fate_llm 29 | """ 30 | pass 31 | 32 | 33 | @eval_config_group.command(name="new") 34 | def _new(): 35 | """ 36 | create new fate_llm eval config from template 37 | """ 38 | create_eval_config(Path("llm_eval_config.yaml")) 39 | click.echo(f"create eval_config file: llm_eval_config.yaml") 40 | 41 | 42 | @eval_config_group.command(name="edit") 43 | @LlmSharedOptions.get_shared_options(hidden=True) 44 | @click.pass_context 45 | def _edit(ctx, **kwargs): 46 | """ 47 | edit fate_llm eval_config file 48 | """ 49 | ctx.obj.update(**kwargs) 50 | eval_config = ctx.obj.get("eval_config") 51 | print(f"eval_config: {eval_config}") 52 | click.edit(filename=eval_config) 53 | 54 | 55 | @eval_config_group.command(name="show") 56 | def _show(): 57 | """ 58 | show fate_test default eval_config path 59 | """ 60 | click.echo(f"default eval_config path is {default_eval_config()}") 61 | -------------------------------------------------------------------------------- /python/fate_llm/model_zoo/pellm/deberta.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | from transformers import DebertaConfig, AutoConfig 17 | from transformers import DebertaForSequenceClassification 18 | from fate_llm.model_zoo.pellm.parameter_efficient_llm import PELLM 19 | 20 | 21 | class Deberta(PELLM): 22 | 23 | config_class = DebertaConfig 24 | model_loader = DebertaForSequenceClassification 25 | 26 | def __init__(self, config: dict = None, 27 | pretrained_path: str = None, 28 | peft_type: str = None, 29 | peft_config: dict = None, 30 | **kwargs) -> None: 31 | 32 | if pretrained_path is not None: 33 | self.check_config(pretrain_path=pretrained_path) 34 | if config is None and pretrained_path is None: 35 | config = DebertaConfig().to_dict() 36 | super().__init__( 37 | config=config, 38 | pretrained_path=pretrained_path, 39 | peft_type=peft_type, 40 | peft_config=peft_config, 41 | **kwargs) 42 | 43 | def check_config(self, pretrain_path): 44 | config = AutoConfig.from_pretrained(pretrain_path) 45 | assert isinstance( 46 | config, DebertaConfig), 'The config of pretrained model must be DebertaConfig, but got {}'.format( 47 | type(config)) 48 | -------------------------------------------------------------------------------- /python/fate_llm/model_zoo/pellm/chatglm.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | from fate_llm.model_zoo.pellm.parameter_efficient_llm import PELLM 17 | from transformers import AutoConfig 18 | 19 | 20 | class ChatGLM(PELLM): 21 | def __init__(self, 22 | pretrained_path: str = None, 23 | peft_type: str = None, 24 | peft_config: dict = None, 25 | pre_seq_len: int = None, 26 | prefix_projection: bool = False, 27 | **kwargs) -> None: 28 | 29 | self.pre_seq_len = pre_seq_len 30 | self.prefix_projection = prefix_projection 31 | 32 | super().__init__(pretrained_path=pretrained_path, 33 | peft_type=peft_type, 34 | peft_config=peft_config, 35 | **kwargs 36 | ) 37 | 38 | def init_config(self): 39 | self.config = AutoConfig.from_pretrained( 40 | self.config_path, trust_remote_code=True) 41 | self.config.pre_seq_len = self.pre_seq_len 42 | self.config.prefix_projection = self.prefix_projection 43 | 44 | def add_peft(self): 45 | if self.pre_seq_len: 46 | self._pe_lm.half() 47 | self._pe_lm.transformer.prefix_encoder.float() 48 | else: 49 | super(ChatGLM, self).add_peft() 50 | -------------------------------------------------------------------------------- /python/fate_llm/model_zoo/pellm/distilbert.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | from transformers import DistilBertConfig, AutoConfig 17 | from transformers import DistilBertForSequenceClassification 18 | from fate_llm.model_zoo.pellm.parameter_efficient_llm import PELLM 19 | 20 | 21 | class DistilBert(PELLM): 22 | config_class = DistilBertConfig 23 | model_loader = DistilBertForSequenceClassification 24 | 25 | def __init__(self, config: dict = None, 26 | pretrained_path: str = None, 27 | peft_type: str = None, 28 | peft_config: dict = None, 29 | **kwargs) -> None: 30 | 31 | if pretrained_path is not None: 32 | self.check_config(pretrain_path=pretrained_path) 33 | if config is None and pretrained_path is None: 34 | config = DistilBertConfig().to_dict() 35 | super().__init__( 36 | config=config, 37 | pretrained_path=pretrained_path, 38 | peft_type=peft_type, 39 | peft_config=peft_config, 40 | **kwargs) 41 | 42 | def check_config(self, pretrain_path): 43 | config = AutoConfig.from_pretrained(pretrain_path) 44 | assert isinstance( 45 | config, DistilBertConfig), 'The config of pretrained model must be DistilBertConfig, but got {}'.format( 46 | type(config)) 47 | -------------------------------------------------------------------------------- /python/fate_llm/model_zoo/pellm/albert.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | from transformers import AlbertConfig, AutoConfig 17 | from transformers import AlbertForSequenceClassification 18 | from fate_llm.model_zoo.pellm.parameter_efficient_llm import PELLM 19 | 20 | 21 | class Albert(PELLM): 22 | 23 | config_class = AlbertConfig 24 | model_loader = AlbertForSequenceClassification 25 | 26 | def __init__(self, config: dict = None, 27 | pretrained_path: str = None, 28 | peft_type: str = None, 29 | peft_config: dict = None, 30 | **kwargs 31 | ) -> None: 32 | 33 | if pretrained_path is not None: 34 | self.check_config(pretain_path=pretrained_path) 35 | if config is None and pretrained_path is None: 36 | config = AlbertConfig().to_dict() # use default model setting 37 | super().__init__( 38 | config=config, 39 | pretrained_path=pretrained_path, 40 | peft_type=peft_type, 41 | peft_config=peft_config, 42 | **kwargs) 43 | 44 | def check_config(self, pretain_path): 45 | config = AutoConfig.from_pretrained(pretain_path) 46 | assert isinstance( 47 | config, AlbertConfig), 'The config of pretrained model must be AlbertConfig, but got {}'.format( 48 | type(config)) 49 | -------------------------------------------------------------------------------- /doc/tutorial/offsite_tuning/README.md: -------------------------------------------------------------------------------- 1 | 2 | # Offsite-Tuning 3 | 4 | ## Standard Offsite-tuning 5 | 6 | Offsite-Tuning is designed for the efficient adaptation of large foundational models for specific downstream tasks. 7 | Through Offsite-Tuning, the model owner can enhance the capabilities of large models using data providers without having to disclose the full model weights and directly access the data providers' sensitive information. Specifically, the LLM owner sends a lightweight "Adapter" and a lossy compressed "Emulator" to the data owner. Using these smaller components, the data owner can then fine-tune the model solely on their private data. The Adapter, once fine-tuned, is returned to the model owner and integrated back into the large model to enhance its performance on the specific dataset. 8 | 9 | In FATE-LLM 1.3, we provide these built-in models: 10 | 11 | - GPT2 series models (e.g., GPT2, GPT2-XL, etc.) 12 | - Bloom series models (such as Bloom7B) 13 | - Llama-1 series models (e.g., Llama7B) 14 | 15 | FATE-LLM v1.3 builds on v1.2 and offers the ability to easily configure multi-machine and multi-card acceleration. It also has specialized optimizations for the network transmission of adapters and emulators. 16 | 17 | 18 | [Read the full paper](https://arxiv.org/abs/2302.04870) 19 | 20 |
21 | 22 |
23 | 24 | ## Offsite-tuning with Federated Learning 25 | 26 | In addition to supporting standard two-party (model owner and data provider) offsite-tuning, FATE also supports offsite-tuning with multiple data providers simultaneously. Adapters can be fine-tuned locally and then aggregated with those from other data providers. Ultimately, large models can be enhanced through the secure aggregation of adapters from multiple parties. This approach can be used to address issues related to the uneven distribution of computational power and data. 27 | As shown in the diagram below: 28 | 29 | 30 |
31 | 32 |
-------------------------------------------------------------------------------- /python/fate_llm/evaluate/utils/data_tools.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2024 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | 17 | 18 | def download_data(data_dir, data_url, is_tar=True): 19 | import os 20 | import requests 21 | import tarfile 22 | import io 23 | 24 | # Create data directory 25 | if not os.path.exists(data_dir): 26 | os.makedirs(data_dir) 27 | 28 | # Download data 29 | try: 30 | response = requests.get(data_url) 31 | if response.status_code == 200: 32 | if is_tar: 33 | # extract tar file and write to data_dir 34 | with tarfile.open(fileobj=io.BytesIO(response.content), mode='r:gz') as tar: 35 | for member in tar.getmembers(): 36 | # check if member is a file 37 | if member.isreg(): 38 | member.name = os.path.join(data_dir, os.path.basename(member.name)) 39 | tar.extract(member) 40 | else: 41 | # write to data_dir 42 | with open(os.path.join(data_dir, os.path.basename(data_url)), 'wb') as f: 43 | f.write(response.content) 44 | return True 45 | else: 46 | print(f"Error downloading file: {response.status_code}") 47 | return False 48 | 49 | except Exception as e: 50 | print(f"Error downloading file: {e}") 51 | return False 52 | -------------------------------------------------------------------------------- /python/fate_llm/model_zoo/pellm/gpt2.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | from transformers import GPT2Config, AutoConfig 17 | from transformers import GPT2ForSequenceClassification, AutoModelForCausalLM 18 | from fate_llm.model_zoo.pellm.parameter_efficient_llm import PELLM 19 | 20 | 21 | class GPT2(PELLM): 22 | config_class = GPT2Config 23 | model_loader = GPT2ForSequenceClassification 24 | 25 | def __init__(self, 26 | config: dict = None, 27 | pretrained_path: str = None, 28 | peft_type: str = None, 29 | peft_config: dict = None, 30 | **kwargs) -> None: 31 | 32 | if pretrained_path is not None: 33 | self.check_config(pretrain_path=pretrained_path) 34 | if config is None and pretrained_path is None: 35 | config = GPT2Config().to_dict() 36 | super().__init__( 37 | config=config, 38 | pretrained_path=pretrained_path, 39 | peft_type=peft_type, 40 | peft_config=peft_config, 41 | **kwargs) 42 | 43 | def check_config(self, pretrain_path): 44 | config = AutoConfig.from_pretrained(pretrain_path) 45 | assert isinstance( 46 | config, GPT2Config), 'The config of pretrained model must be GPT2Config, but got {}'.format( 47 | type(config)) 48 | 49 | 50 | class GPT2CLM(GPT2): 51 | model_loader = AutoModelForCausalLM 52 | -------------------------------------------------------------------------------- /python/fate_llm/data/tokenizers/cust_tokenizer.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | from transformers import AutoTokenizer 17 | 18 | 19 | def get_tokenizer( 20 | tokenizer_name_or_path, 21 | trust_remote_code=False, 22 | padding_side=None, 23 | pad_token=None, 24 | bos_token=None, 25 | eos_token=None, 26 | pad_token_id=None, 27 | bos_token_id=None, 28 | eos_token_id=None, 29 | add_eos_token=True, 30 | ): 31 | tokenizer = AutoTokenizer.from_pretrained( 32 | tokenizer_name_or_path, 33 | trust_remote_code=trust_remote_code, 34 | add_eos_token=add_eos_token 35 | ) 36 | if padding_side is not None: 37 | tokenizer.padding_side = padding_side 38 | if pad_token is not None: 39 | tokenizer.add_special_tokens({'pad_token': pad_token}) 40 | if bos_token is not None: 41 | tokenizer.add_special_tokens({'bos_token': bos_token}) 42 | if eos_token is not None: 43 | tokenizer.add_special_tokens({"eos_token": eos_token}) 44 | if pad_token_id is not None: 45 | tokenizer.pad_token_id = pad_token_id 46 | if bos_token_id is not None: 47 | tokenizer.bos_token_id = bos_token_id 48 | if eos_token_id is not None: 49 | tokenizer.eos_token_id = eos_token_id 50 | 51 | if "llama" in tokenizer_name_or_path.lower() or "gpt2" in tokenizer_name_or_path.lower(): 52 | tokenizer.pad_token = tokenizer.eos_token 53 | 54 | return tokenizer 55 | -------------------------------------------------------------------------------- /doc/tutorial/pellm/builtin_pellm_models.md: -------------------------------------------------------------------------------- 1 | ## Builtin PELLM Models 2 | FATE-LLM provide some builtin pellm models, users can use them simply to efficiently train their language models. 3 | To use these models, please read the using tutorial of [ChatGLM-6B Training Guide](./ChatGLM3-6B_ds.ipynb). 4 | After reading the training tutorial above, it's easy to use other models listing in the following tabular by changing `module_name`, `class_name`, `dataset` list below. 5 | 6 | 7 | 8 | | Model | ModuleName | ClassName | DataSetName | 9 | | -------------- | ----------------- | --------------| --------------- | 10 | | Qwen2 | pellm.qwen | Qwen | prompt_dataset | 11 | | Bloom-7B1 | pellm.bloom | Bloom | prompt_dataset | 12 | | OPT-6.7B | pellm.opt | OPT | prompt_dataset | 13 | | LLaMA-2-7B | pellm.llama | LLaMa | prompt_dataset | 14 | | LLaMA-7B | pellm.llama | LLaMa | prompt_dataset | 15 | | ChatGLM3-6B | pellm.chatglm | ChatGLM | prompt_dataset | 16 | | GPT-2 | pellm.gpt2 | GPT2CLM | prompt_dataset | 17 | | GPT-2 | pellm.gpt2 | GPT2 | seq_cls_dataset | 18 | | ALBERT | pellm.albert | Albert | seq_cls_dataset | 19 | | BART | pellm.bart | Bart | seq_cls_dataset | 20 | | BERT | pellm.bert | Bert | seq_cls_dataset | 21 | | DeBERTa | pellm.deberta | Deberta | seq_cls_dataset | 22 | | DistilBERT | pellm.distilbert | DistilBert | seq_cls_dataset | 23 | | RoBERTa | pellm.roberta | Roberta | seq_cls_dataset | 24 | -------------------------------------------------------------------------------- /python/fate_llm/dataset/data_config/default_yelp_review.yaml: -------------------------------------------------------------------------------- 1 | dataset_kwargs: 2 | data_files: yelp_review/Health/train.json 3 | dataset_path: json 4 | doc_to_target: '{{label}}' 5 | metric_list: 6 | - aggregation: mean 7 | higher_is_better: true 8 | metric: accuracy 9 | output_type: generate_until 10 | task: yelp-review 11 | label_key: stars 12 | text_key: text 13 | validation_split: train 14 | sub_domain: Health 15 | few_shot_num_per_label: 2 16 | tokenize_format: "Product type: {{sub_domain}} | Review Score: {{label}}" 17 | text_with_label_format: "******\n {{i}}.\nReview: {{text}}\nRating stars: {{label}}.\n" 18 | few_shot_format: "******\n- : {{label}} stars.\n- : {{text}}\n\n" 19 | augment_format: "The reviews are rated from 1 to 5 stars, with 1 being the worst, 3 being neutral and 5 being the best. Please generate more similar samples for each rating star about the Health domain as shown in the following format, bearing in mind that the generated results should not copy or resemble the examples, and should align with the {{sub_domain}} domain and the rating stars.\nThe examples are delimited by '******'." 20 | filter_format: "I will give you some customer review text samples with their rating stars, these samples are indexed starting from 0, the samples are delimited by '******':\n {{text_with_label}}. These reviews gradually shift from negative to positive from 1 star to 5 stars. 1 star represents the worst, 2 stars are better than 1 star, but still indicate a negative review. 3 stars represent a neutral review. 4 stars indicate a positive review, but less positive than 5 stars. 5 stars represent perfection.\n Please filter out text that does not belong to customer reviews or does not meet the rating stars, and leave review texts that meet the labels.\n You should also filter out text that are too similar to other samples and keep the most representative ones. Your answer should begin with 'The eligible samples:\n\n' and the indexes of the texts you choose, use spaces to separate the indexes and do not provide duplicate indices or indices that exceed the maximum index of samples." 21 | label_list: 22 | - 1 23 | - 2 24 | - 3 25 | - 4 26 | - 5 -------------------------------------------------------------------------------- /python/fate_llm/algo/inferdpt/init/default_init.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | from fate_llm.algo.inferdpt.init._init import InferInit 17 | from fate_llm.inference.api import APICompletionInference 18 | from fate_llm.algo.inferdpt import inferdpt 19 | from fate_llm.algo.inferdpt.utils import InferDPTKit 20 | from fate_llm.algo.inferdpt.inferdpt import InferDPTClient, InferDPTServer 21 | 22 | 23 | class InferDPTAPIClientInit(InferInit): 24 | 25 | api_url = '' 26 | api_model_name = '' 27 | api_key = 'EMPTY' 28 | inferdpt_kit_path = '' 29 | eps = 3.0 30 | 31 | def __init__(self, ctx): 32 | super().__init__(ctx) 33 | self.ctx = ctx 34 | 35 | def get_inst(self)-> InferDPTClient: 36 | inference = APICompletionInference(api_url=self.api_url, model_name=self.api_model_name, api_key=self.api_key) 37 | kit = InferDPTKit.load_from_path(self.inferdpt_kit_path) 38 | inferdpt_client = inferdpt.InferDPTClient(self.ctx, kit, inference, epsilon=self.eps) 39 | return inferdpt_client 40 | 41 | 42 | class InferDPTAPIServerInit(InferInit): 43 | 44 | api_url = '' 45 | api_model_name = '' 46 | api_key = 'EMPTY' 47 | 48 | def __init__(self, ctx): 49 | super().__init__(ctx) 50 | self.ctx = ctx 51 | 52 | def get_inst(self)-> InferDPTServer: 53 | inference = APICompletionInference(api_url=self.api_url, model_name=self.api_model_name, api_key=self.api_key) 54 | inferdpt_server = inferdpt.InferDPTServer(self.ctx,inference_inst=inference) 55 | return inferdpt_server 56 | -------------------------------------------------------------------------------- /python/fate_llm/evaluate/utils/model_tools.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2024 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | 17 | import os 18 | from transformers import AutoModel, AutoTokenizer 19 | from lm_eval.models.huggingface import HFLM 20 | 21 | 22 | def load_model_from_path(model_path, peft_path=None, peft_config=None, model_args=None): 23 | model_args = model_args or {} 24 | if peft_path is None: 25 | if os.path.isfile(model_path): 26 | return HFLM(pretrained=model_path, **model_args) 27 | else: 28 | raise ValueError(f"given model path is not valid, please check: {model_path}") 29 | else: 30 | import torch 31 | from peft import PeftModel, PeftConfig, LoraConfig, TaskType, get_peft_model 32 | tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) 33 | model = AutoModel.from_pretrained(model_path, trust_remote_code=True) 34 | model.half() 35 | model.eval() 36 | peft_config = peft_config or {} 37 | peft_config=LoraConfig(**peft_config) 38 | model = get_peft_model(model, peft_config) 39 | model.load_state_dict(torch.load(peft_path), strict=False) 40 | model.model.half() 41 | HFLM(pretrained=model, tokenizer=tokenizer, **model_args) 42 | 43 | 44 | def load_model(model_path, peft_path=None, model_args=None): 45 | model_args = model_args or {} 46 | return HFLM(pretrained=model_path, peft_path=peft_path, **model_args) 47 | 48 | 49 | def load_by_loader(loader_name=None, loader_conf_path=None, peft_path=None): 50 | #@todo: find loader fn & return loaded model 51 | pass -------------------------------------------------------------------------------- /python/fate_llm/algo/fedkseed/pytorch_utils.py: -------------------------------------------------------------------------------- 1 | from typing import List 2 | 3 | from transformers.pytorch_utils import ALL_LAYERNORM_LAYERS 4 | from transformers.trainer_pt_utils import get_parameter_names 5 | 6 | 7 | def get_decay_parameter_names(model) -> List[str]: 8 | """ 9 | Get all parameter names that weight decay will be applied to 10 | 11 | Note that some models implement their own layernorm instead of calling nn.LayerNorm, weight decay could still 12 | apply to those modules since this function only filter out instance of nn.LayerNorm 13 | 14 | NOTE: This function is copied from transformers 15 | # Copyright 2020-present the HuggingFace Inc. team. 16 | # 17 | # Licensed under the Apache License, Version 2.0 (the "License"); 18 | # you may not use this file except in compliance with the License. 19 | # You may obtain a copy of the License at 20 | # 21 | # http://www.apache.org/licenses/LICENSE-2.0 22 | # 23 | # Unless required by applicable law or agreed to in writing, software 24 | # distributed under the License is distributed on an "AS IS" BASIS, 25 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 26 | # See the License for the specific language governing permissions and 27 | # limitations under the License. 28 | """ 29 | decay_parameters = get_parameter_names(model, ALL_LAYERNORM_LAYERS) 30 | decay_parameters = [name for name in decay_parameters if "bias" not in name] 31 | return decay_parameters 32 | 33 | 34 | def get_optimizer_parameters_grouped_with_decay(model, weight_decay: float) -> List[dict]: 35 | """ 36 | Get the parameters grouped by whether they should have weight decay applied 37 | """ 38 | decay_parameters = get_decay_parameter_names(model) 39 | params_no_decay = [] 40 | params_decay = [] 41 | for n, p in model.named_parameters(): 42 | if p.requires_grad: 43 | if n in decay_parameters: 44 | params_decay.append(p) 45 | else: 46 | params_no_decay.append(p) 47 | grouped_parameters_with_decay = [ 48 | {"params": params_no_decay, "weight_decay": 0.0}, 49 | {"params": params_decay, "weight_decay": weight_decay}, 50 | ] 51 | return grouped_parameters_with_decay 52 | -------------------------------------------------------------------------------- /python/fate_llm/data/data_collator/cust_data_collator.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | from transformers.data import data_collator 17 | from ..tokenizers.cust_tokenizer import get_tokenizer 18 | 19 | 20 | def get_data_collator(data_collator_name, 21 | tokenizer_name_or_path=None, 22 | pad_token=None, 23 | bos_token=None, 24 | eos_token=None, 25 | pad_token_id=None, 26 | bos_token_id=None, 27 | eos_token_id=None, 28 | trust_remote_code=False, **kwargs): 29 | if not hasattr(data_collator, data_collator_name): 30 | support_collator_list = list(filter(lambda module_name: "collator" in module_name.lower(), dir(data_collator))) 31 | return ValueError(f"data_collator's name={data_collator_name} does not in support list={support_collator_list}") 32 | 33 | tokenizer = get_tokenizer(tokenizer_name_or_path=tokenizer_name_or_path, 34 | pad_token=pad_token, 35 | bos_token=bos_token, 36 | eos_token=eos_token, 37 | pad_token_id=pad_token_id, 38 | bos_token_id=bos_token_id, 39 | eos_token_id=eos_token_id, 40 | trust_remote_code=trust_remote_code) 41 | 42 | return getattr(data_collator, data_collator_name)(tokenizer, **kwargs) 43 | 44 | 45 | def get_seq2seq_data_collator(tokenizer_name_or_path, **kwargs): 46 | return get_data_collator("DataCollatorForSeq2Seq", tokenizer_name_or_path=tokenizer_name_or_path, **kwargs) 47 | -------------------------------------------------------------------------------- /python/fate_llm/dataset/fedcot_dataset.py: -------------------------------------------------------------------------------- 1 | from fate_llm.dataset.input_output_dataset import InputOutputDataset 2 | from transformers.trainer_pt_utils import LabelSmoother 3 | from typing import List, Dict, Union, Literal 4 | import logging 5 | from jinja2 import Template 6 | from transformers import AutoTokenizer 7 | 8 | 9 | logger = logging.getLogger(__name__) 10 | 11 | 12 | class PrefixDataset(InputOutputDataset): 13 | 14 | def __init__(self, 15 | tokenizer_path, 16 | predict_input_template: str, 17 | predict_output_template: str, 18 | rationale_input_template: str, 19 | rationale_output_template: str, 20 | max_input_length: int = 256, 21 | max_target_length: int = 256, 22 | load_from: Literal['jsonl', 'hf_load_from_disk', 'hf_load_dataset'] = 'hf_load_from_disk', 23 | split_key: str = None 24 | ): 25 | 26 | super().__init__(tokenizer_path, predict_input_template, predict_output_template, max_input_length, max_target_length, load_from, split_key) 27 | self.r_input_template = Template(rationale_input_template) 28 | self.r_output_template = Template(rationale_output_template) 29 | 30 | def load_rationale(self, result_list, key='rationale'): 31 | for d, r in zip(self.dataset, result_list): 32 | d[key] = r 33 | 34 | def get_str_item(self, i) -> dict: 35 | 36 | data_item = self.dataset[i] 37 | p_in = self.input_template.render(data_item) 38 | p_out = self.output_template.render(data_item) 39 | r_in = self.r_input_template.render(data_item) 40 | r_out = self.r_output_template.render(data_item) 41 | ret_dict = { 42 | 'predict':{ 43 | 'input': p_in, 44 | 'output': p_out 45 | }, 46 | 'rationale':{ 47 | 'input': r_in, 48 | 'output': r_out 49 | } 50 | } 51 | return ret_dict 52 | 53 | def get_tokenized_item(self, i) -> dict: 54 | 55 | str_item = self.get_str_item(i) 56 | ret_dict = { 57 | 'predict': self._process_item(str_item['predict']), 58 | 'rationale': self._process_item(str_item['rationale']) 59 | } 60 | 61 | return ret_dict 62 | -------------------------------------------------------------------------------- /python/fate_llm/model_zoo/pellm/llama.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | from fate_llm.model_zoo.pellm.parameter_efficient_llm import PELLM 17 | from transformers import AutoConfig 18 | from transformers import LlamaConfig 19 | from transformers import LlamaForCausalLM 20 | 21 | 22 | class LLaMa(PELLM): 23 | config_class = LlamaConfig 24 | 25 | def __init__(self, 26 | pretrained_path: str = None, 27 | peft_type: str = None, 28 | peft_config: dict = None, 29 | **kwargs) -> None: 30 | 31 | super().__init__(pretrained_path=pretrained_path, 32 | peft_type=peft_type, 33 | peft_config=peft_config, 34 | **kwargs) 35 | 36 | def init_base_lm(self, **kwargs): 37 | if self.config is not None: 38 | self._pe_lm = LlamaForCausalLM.from_pretrained(self.config_path, 39 | config=self.config, 40 | torch_dtype=self.torch_dtype, 41 | **kwargs) 42 | elif self.config_path is not None: 43 | self._pe_lm = LlamaForCausalLM.from_pretrained(self.config_path, torch_dtype=self.torch_dtype, **kwargs) 44 | else: 45 | raise ValueError( 46 | 'config_path to pretrained model folder cannot be None') 47 | 48 | def check_config(self, pretrain_path): 49 | config = AutoConfig.from_pretrained(pretrain_path) 50 | assert isinstance( 51 | config, LlamaConfig), 'The config of pretrained model must be LlamaConfig, but got {}'.format( 52 | type(config)) 53 | -------------------------------------------------------------------------------- /python/fate_llm/algo/dp/opacus_compatibility/grad_sample/embedding.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright (c) Meta Platforms, Inc. and affiliates. 3 | # Copyright 2019 The FATE Authors. All Rights Reserved. 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | # 17 | import torch 18 | import torch.nn as nn 19 | from typing import Dict 20 | 21 | 22 | # the function is modified from https://github.com/pytorch/opacus/blob/main/opacus/grad_sample/embedding.py#L25, 23 | # avoid dtype error when backprops's dtype isn't torch.float32 24 | def compute_embedding_grad_sample( 25 | layer: nn.Embedding, activations: torch.Tensor, backprops: torch.Tensor 26 | ) -> Dict[nn.Parameter, torch.Tensor]: 27 | """ 28 | Computes per sample gradients for ``nn.Embedding`` layer. 29 | 30 | Args: 31 | layer: Layer 32 | activations: Activations 33 | backprops: Backpropagations 34 | """ 35 | activations = activations[0] 36 | ret = {} 37 | if layer.weight.requires_grad: 38 | saved = torch.backends.cudnn.deterministic 39 | torch.backends.cudnn.deterministic = True 40 | 41 | batch_size = activations.shape[0] 42 | if batch_size == 0: 43 | ret[layer.weight] = torch.zeros_like(layer.weight).unsqueeze(0) 44 | return ret 45 | 46 | index = ( 47 | activations.unsqueeze(-1) 48 | .expand(*activations.shape, layer.embedding_dim) 49 | .reshape(batch_size, -1, layer.embedding_dim) 50 | ) 51 | grad_sample = torch.zeros( 52 | batch_size, *layer.weight.shape, device=layer.weight.device, dtype=backprops.dtype 53 | ) 54 | grad_sample.scatter_add_( 55 | 1, index, backprops.reshape(batch_size, -1, layer.embedding_dim) 56 | ) 57 | torch.backends.cudnn.deterministic = saved 58 | ret[layer.weight] = grad_sample 59 | 60 | return ret 61 | -------------------------------------------------------------------------------- /python/setup.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # 3 | # Copyright 2024 The FATE Authors. All Rights Reserved. 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | # 17 | 18 | from setuptools import find_packages, setup 19 | 20 | # Define the packages and modules 21 | packages = find_packages(".") 22 | package_data = {"": ["*"]} 23 | 24 | # Define dependencies 25 | install_requires = [ 26 | "accelerate==0.27.2", 27 | "deepspeed==0.13.3", 28 | "peft==0.8.2", 29 | "sentencepiece==0.2.0", 30 | "lm_eval==0.4.2", 31 | "rouge-score==0.1.2", 32 | "datasets==2.18.0", 33 | "editdistance", 34 | "torch==2.3.1", 35 | "transformers==4.37.2", 36 | "opacus==1.4.1", 37 | "fastchat", 38 | "Jinja2", 39 | "sentence-transformers", 40 | "openai" 41 | ] 42 | 43 | # Define the entry points for command-line tools 44 | entry_points = { 45 | "console_scripts": [ 46 | "fate_llm = fate_llm.evaluate.scripts.fate_llm_cli:fate_llm_cli" 47 | ] 48 | } 49 | 50 | extras_require = { 51 | "fate": ["pyfate==2.2.0"], 52 | "fate_flow": ["fate_flow==2.2.0"], 53 | "fate_client": ["fate_client==2.2.0"] 54 | } 55 | 56 | # Configure and call the setup function 57 | setup_kwargs = { 58 | "name": "fate_llm", 59 | "version": "2.2.0", 60 | "description": "Federated Learning for Large Language Models", 61 | "long_description": "Federated Learning for Large Language Models (FATE-LLM) provides a framework to train and evaluate large language models in a federated manner.", 62 | "long_description_content_type": "text/markdown", 63 | "author": "FederatedAI", 64 | "author_email": "contact@FedAI.org", 65 | "url": "https://fate.fedai.org/", 66 | "packages": packages, 67 | "install_requires": install_requires, 68 | "entry_points": entry_points, 69 | "extras_require": extras_require, 70 | "python_requires": ">=3.8", 71 | "include_package_data": True 72 | } 73 | 74 | setup(**setup_kwargs) 75 | -------------------------------------------------------------------------------- /python/fate_llm/evaluate/scripts/_options.py: -------------------------------------------------------------------------------- 1 | import time 2 | 3 | import click 4 | 5 | from ..utils.config import parse_config, default_eval_config 6 | from ..utils.config import _set_namespace 7 | 8 | 9 | def parse_custom_type(value): 10 | parts = value.split('=') 11 | if len(parts) == 2 and parts[1].isdigit(): 12 | return parts[0], int(parts[1]) 13 | elif len(parts) == 2 and isinstance(parts[1], str): 14 | return parts[0], parts[1] 15 | else: 16 | raise click.BadParameter('Invalid input format. Use "str=int" or "str=str".') 17 | 18 | 19 | class LlmSharedOptions(object): 20 | _options = { 21 | "eval_config": (('-c', '--eval_config'), 22 | dict(type=click.Path(exists=True), help=f"Manual specify config file", default=None), 23 | default_eval_config().__str__()), 24 | "yes": (('-y', '--yes',), dict(type=bool, is_flag=True, help="Skip double check", default=None), 25 | False), 26 | "namespace": (('-n', '--namespace'), 27 | dict(type=str, help=f"Manual specify fate llm namespace", default=None), 28 | time.strftime('%Y%m%d%H%M%S')) 29 | } 30 | 31 | def __init__(self): 32 | self._options_kwargs = {} 33 | 34 | def __getitem__(self, item): 35 | return self._options_kwargs[item] 36 | 37 | def get(self, k, default=None): 38 | v = self._options_kwargs.get(k, default) 39 | if v is None and k in self._options: 40 | v = self._options[k][2] 41 | return v 42 | 43 | def update(self, **kwargs): 44 | for k, v in kwargs.items(): 45 | if v is not None: 46 | self._options_kwargs[k] = v 47 | 48 | def post_process(self): 49 | # add defaults here 50 | for k, v in self._options.items(): 51 | if self._options_kwargs.get(k, None) is None: 52 | self._options_kwargs[k] = v[2] 53 | 54 | # update config 55 | config = parse_config(self._options_kwargs['eval_config']) 56 | self._options_kwargs['eval_config'] = config 57 | 58 | _set_namespace(self._options_kwargs['namespace']) 59 | 60 | @classmethod 61 | def get_shared_options(cls, hidden=False): 62 | def shared_options(f): 63 | for name, option in cls._options.items(): 64 | f = click.option(*option[0], **dict(option[1], hidden=hidden))(f) 65 | return f 66 | 67 | return shared_options 68 | -------------------------------------------------------------------------------- /python/fate_llm/algo/fedkseed/zo_utils.py: -------------------------------------------------------------------------------- 1 | from typing import List 2 | 3 | import torch 4 | 5 | 6 | def probability_from_amps(amps: List[List[float]], clip): 7 | """ 8 | Get the probability distribution from the amplitude history 9 | 10 | formula: amp_i = clamp(amp_i, -clip, clip).abs().mean() 11 | amp_i = (amp_i - min(amp)) / (max(amp) - min(amp)) 12 | prob_i = softmax(amp)_i 13 | 14 | :param amps: list of amplitude history 15 | :param clip: the clipping value 16 | :return: 17 | """ 18 | amps = [torch.Tensor(amp) for amp in amps] 19 | amp = torch.stack([amp.clamp_(-clip, clip).abs_().mean() for amp in amps]) 20 | return (amp - amp.min()).div_(amp.max() - amp.min() + 1e-10).softmax(0) 21 | 22 | 23 | def directional_derivative_step( 24 | param_groups: List[dict], 25 | directional_derivative_seed: int, 26 | directional_derivative_value: torch.FloatTensor, 27 | lr: float = None, 28 | weight_decay: float = None, 29 | ) -> torch.FloatTensor: 30 | """ 31 | perform a step update for the parameters of the model 32 | along the random direction z with the learning rate lr and the step size grad_projected_value 33 | 34 | Input: 35 | - param_groups (List[dict]): list of parameter groups 36 | - directional_derivative_seed (int): seed for the random direction 37 | - directional_derivative_value (torch.FloatTensor): the step size 38 | - lr (float, optional): learning rate 39 | - weight_decay (float, optional): weight decay 40 | """ 41 | 42 | torch.manual_seed(directional_derivative_seed) 43 | for param_group in param_groups: 44 | weight_decay = param_group["weight_decay"] if weight_decay is None else weight_decay 45 | lr = param_group["lr"] if lr is None else lr 46 | for param in param_group["params"]: 47 | z = torch.normal(mean=0, std=1, size=param.data.size(), device=param.data.device, dtype=param.data.dtype) 48 | if weight_decay is not None: 49 | param.data = param.data - lr * (directional_derivative_value * z + weight_decay * param.data) 50 | 51 | else: 52 | param.data = param.data - lr * (directional_derivative_value * z) 53 | 54 | return directional_derivative_value 55 | 56 | 57 | def build_seed_candidates(k, low=0, high=2**32): 58 | """ 59 | Build seed candidates for the random walk optimizer 60 | """ 61 | return torch.randint(low, high, size=(k,), dtype=torch.long) 62 | 63 | 64 | def get_even_seed_probabilities(k): 65 | """ 66 | Get the even seed probabilities, i.e., 1/k for each seed 67 | """ 68 | return torch.ones(k) / k 69 | -------------------------------------------------------------------------------- /examples/fedmkt/fedmkt_config.yaml: -------------------------------------------------------------------------------- 1 | # fedmkt_config.yaml 2 | 3 | # Configuration for Lora 4 | lora_config: 5 | llm: 6 | r: 8 7 | lora_alpha: 16 8 | lora_dropout: 0.05 9 | target_modules: 10 | - q_proj 11 | - k_proj 12 | - v_proj 13 | - o_proj 14 | slm: 15 | - # Configuration for the first SLM model 16 | r: 8 17 | lora_alpha: 32 18 | lora_dropout: 0.1 19 | target_modules: 20 | - q_proj 21 | - v_proj 22 | - # Configuration for the second SLM model 23 | r: 8 24 | lora_alpha: 32 25 | lora_dropout: 0.1 26 | target_modules: 27 | - c_attn 28 | 29 | # Training configuration 30 | training: 31 | llm: 32 | global_epochs: 5 33 | per_device_train_batch_size: 1 34 | gradient_accumulation_steps: 4 35 | learning_rate: 3e-5 36 | output_dir: "./" 37 | dataloader_num_workers: 4 38 | remove_unused_columns: false 39 | warmup_ratio: 0.008 40 | lr_scheduler_type: "cosine" 41 | optim: "adamw_torch" 42 | adam_beta1: 0.9 43 | adam_beta2: 0.95 44 | weight_decay: 0.1 45 | max_grad_norm: 1.0 46 | use_cpu: false 47 | slm: 48 | global_epochs: 5 49 | per_device_train_batch_size: 1 50 | gradient_accumulation_steps: 4 51 | learning_rate: 3e-5 # Adjust learning rate for SLM models 52 | output_dir: "./" 53 | dataloader_num_workers: 4 54 | remove_unused_columns: false 55 | warmup_ratio: 0.008 56 | lr_scheduler_type: "cosine" 57 | optim: "adamw_torch" 58 | adam_beta1: 0.9 59 | adam_beta2: 0.95 60 | weight_decay: 0.1 61 | max_grad_norm: 1.0 62 | use_cpu: false 63 | 64 | # Paths configuration 65 | paths: 66 | process_data_output_dir: "" 67 | llm_pretrained_path: "Llama-2-7b-hf" 68 | slm_pretrained_paths: 69 | - "opt-1.3b" 70 | - "gpt2" 71 | vocab_mapping_directory: "" 72 | slm_to_llm_vocab_mapping_paths: 73 | - "opt_to_llama.json" 74 | - "gpt2_to_llama.json" 75 | - "llama_small_to_llama.json" 76 | llm_to_slm_vocab_mapping_paths: 77 | - "llama_to_opt.json" 78 | - "llama_to_gpt2.json" 79 | - "llama_to_llama_small" 80 | 81 | # Models configuration 82 | models: 83 | slm_models: 84 | - ["pellm.opt", "OPT"] 85 | - ["pellm.gpt2", "GPT2CLM"] 86 | 87 | # Data configuration 88 | data: 89 | guest: 90 | namespace: "experiment" 91 | name: "arc_challenge" 92 | host: 93 | namespace: "experiment" 94 | name: "arc_challenge" 95 | 96 | # Example: Additional custom configuration 97 | custom_config: 98 | some_param: "value" 99 | another_param: 123 100 | -------------------------------------------------------------------------------- /python/fate_llm/algo/fdkt/utils/dp_loss.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | import torch 17 | import torch.nn as nn 18 | import torch.nn.functional as F 19 | from transformers.models.auto.modeling_auto import MODEL_FOR_CAUSAL_LM_MAPPING_NAMES 20 | 21 | 22 | NUMERICAL_STABILITY_CONSTANT = 1e-13 23 | 24 | 25 | class SequenceCrossEntropyLoss(nn.Module): 26 | def __init__(self, model_type, label_smoothing=-1, reduce=None): 27 | super().__init__() 28 | self.model_type = model_type 29 | self.label_smoothing = label_smoothing 30 | self.reduce = reduce 31 | 32 | def forward(self, logits, targets, mask): 33 | return sequence_cross_entropy_with_logits(logits, targets, mask, self.label_smoothing, self.reduce, self.model_type) 34 | 35 | 36 | def sequence_cross_entropy_with_logits(logits, targets, mask, label_smoothing, reduce, model_type): 37 | if model_type in MODEL_FOR_CAUSAL_LM_MAPPING_NAMES.values(): 38 | logits = logits[:, :-1].contiguous() 39 | targets = targets[:, 1:] 40 | mask = torch.ones_like(targets).float() 41 | 42 | logits_flat = logits.view(-1, logits.size(-1)) 43 | log_probs_flat = F.log_softmax(logits_flat, dim=-1) 44 | targets_flat = targets.reshape(-1, 1).long() 45 | 46 | if label_smoothing > 0.0: 47 | num_classes = logits.size(-1) 48 | smoothing_value = label_smoothing / float(num_classes) 49 | one_hot_targets = torch.zeros_like(log_probs_flat).scatter_(-1, targets_flat, 1.0 - label_smoothing) 50 | smoothed_targets = one_hot_targets + smoothing_value 51 | negative_log_likelihood_flat = -log_probs_flat * smoothed_targets 52 | negative_log_likelihood_flat = negative_log_likelihood_flat.sum(-1, keepdim=True) 53 | else: 54 | negative_log_likelihood_flat = - torch.gather(log_probs_flat, dim=1, index=targets_flat) 55 | 56 | negative_log_likelihood = negative_log_likelihood_flat.view(-1, logits.shape[1]) 57 | 58 | loss = negative_log_likelihood * mask 59 | 60 | if reduce: 61 | loss = loss.sum(1) / (mask.sum(1) + NUMERICAL_STABILITY_CONSTANT) 62 | 63 | if reduce is "batch": 64 | loss = loss.mean() 65 | 66 | return loss 67 | -------------------------------------------------------------------------------- /doc/standalone_deploy.md: -------------------------------------------------------------------------------- 1 | # FATE-LLM Single-Node Deployment Guide 2 | 3 | ## 1. Introduction 4 | 5 | **Server Configuration:** 6 | 7 | - **Quantity:** 1 8 | - **Configuration:** 8 cores / 16GB memory / 500GB hard disk / GPU Machine 9 | - **Operating System:** CentOS Linux release 7 10 | - **User:** User: app owner:apps 11 | 12 | The single-node version provides 3 deployment methods, which can be selected based on your needs: 13 | - Install FATE-LLM from PyPI With FATE 14 | - Install FATE-LLM from PyPI with FATE, FATE-Flow, FATE-Client 15 | 16 | ## 2. Install FATE-LLM from PyPI With FATE 17 | In this way, user can run tasks with Launcher, a convenient way for fast experimental using. 18 | 19 | ### 2.1 Installing Python Environment 20 | - Prepare and install [conda](https://docs.conda.io/projects/miniconda/en/latest/) environment. 21 | - Create a virtual environment: 22 | 23 | ```shell 24 | # FATE-LLM requires Python >= 3.10 25 | conda create -n fate_env python=3.10 26 | conda activate fate_env 27 | ``` 28 | 29 | ### 2.2 Installing FATE-LLM 30 | This section introduces how to install FATE-LLM from pypi with FATE, execute the following command to install FATE-LLM. 31 | 32 | ```shell 33 | pip install fate_llm[fate]==2.2.0 34 | ``` 35 | 36 | ### 2.3 Usage 37 | After installing successfully, please refer to [tutorials](../README.md#quick-start) to run tasks, tasks describe in the tutorials running will Launcher are all supported. 38 | 39 | 40 | ## 3. Install FATE-LLM from PyPI with FATE, FATE-Flow, FATE-Client 41 | In this way, user can run tasks with Pipeline or Launcher. 42 | 43 | ### 3.1 Installing Python Environment 44 | Please refer to section-2.1 45 | 46 | ### 3.2 Installing FATE-LLM with FATE, FATE-Flow, FATE-Client 47 | 48 | ```shell 49 | pip install fate_client[fate,fate_flow,fate_client]==2.2.0 50 | ``` 51 | 52 | ### 3.3 Service Initialization 53 | 54 | ```shell 55 | mkdir fate_workspace 56 | fate_flow init --ip 127.0.0.1 --port 9380 --home $(pwd)/fate_workspace 57 | pipeline init --ip 127.0.0.1 --port 9380 58 | ``` 59 | - `ip`: The IP address where the service runs. 60 | - `port`: The HTTP port the service runs on. 61 | - `home`: The data storage directory, including data, models, logs, job configurations, and SQLite databases. 62 | 63 | ### 3.4 Start Fate-Flow Service 64 | 65 | ```shell 66 | fate_flow start 67 | fate_flow status # make sure fate_flow service is started 68 | ``` 69 | 70 | FATE-Flow also provides other instructions like stop and restart, use only if users want to stop/restart fate_flow services. 71 | ```shell 72 | # Warning: normal installing process does not need to execute stop/restart instructions. 73 | fate_flow stop 74 | fate_flow restart 75 | ``` 76 | 77 | ### 3.5 Usage 78 | Please refer to [tutorials](../README.md#quick-start) for more usage guides, tasks describe in the tutorials running will Pipeline or Launcher are all supported. 79 | -------------------------------------------------------------------------------- /python/fate_llm/algo/fedmkt/token_alignment/vocab_mapping.py: -------------------------------------------------------------------------------- 1 | # 2 | # NOTE: The find_best_mapping function is copied from FuseAI/FuseLLM 3 | # Copyright FuseAI/FuseLLM 4 | # 5 | # 6 | # Copyright 2019 The FATE Authors. All Rights Reserved. 7 | # 8 | # Licensed under the Apache License, Version 2.0 (the "License"); 9 | # you may not use this file except in compliance with the License. 10 | # You may obtain a copy of the License at 11 | # 12 | # http://www.apache.org/licenses/LICENSE-2.0 13 | # 14 | # Unless required by applicable law or agreed to in writing, software 15 | # distributed under the License is distributed on an "AS IS" BASIS, 16 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 17 | # See the License for the specific language governing permissions and 18 | # limitations under the License. 19 | # 20 | import json 21 | import editdistance 22 | import tqdm 23 | import multiprocessing 24 | import logging 25 | 26 | from fate_llm.data.tokenizers.cust_tokenizer import get_tokenizer 27 | from fate_llm.algo.fedmkt.token_alignment.spectal_token_mapping import TOKENIZER_TO_SPECIAL_TOKEN 28 | 29 | logger = logging.getLogger(__name__) 30 | 31 | 32 | def find_best_mapping(x, base_tokens, blending_model_special_token, base_model_special_token, best_one=True): 33 | """code refer to https://github.com/fanqiwan/FuseAI/blob/main/FuseLLM/src/utils/vocab_mapping.py#L82""" 34 | tmp_x = x.replace(blending_model_special_token, base_model_special_token) 35 | if tmp_x in base_tokens: 36 | return tmp_x, tmp_x 37 | else: 38 | if best_one: 39 | return tmp_x, min([(y, editdistance.eval(tmp_x, y)) for y in base_tokens], key=lambda d: d[1])[0] 40 | else: 41 | token_and_distance = [(y, editdistance.eval(tmp_x, y)) for y in base_tokens] 42 | min_distance = min(item[1] for item in token_and_distance) 43 | shortest_distance_tokens = [item[0] for item in token_and_distance if item[1] == min_distance] 44 | return tmp_x, shortest_distance_tokens 45 | 46 | 47 | def get_vocab_mappings(model_name_or_path, candidate_model_name_or_path, vocab_mapping_save_path, num_processors=8): 48 | ori_tokenizer = get_tokenizer(model_name_or_path) 49 | candidate_tokenizer = get_tokenizer(candidate_model_name_or_path) 50 | 51 | ori_special_tok = TOKENIZER_TO_SPECIAL_TOKEN[ori_tokenizer.__class__] 52 | candidate_special_tok = TOKENIZER_TO_SPECIAL_TOKEN[candidate_tokenizer.__class__] 53 | 54 | candidate_tokens = list(candidate_tokenizer.get_vocab().keys()) 55 | 56 | with multiprocessing.Pool(num_processors) as process_pool: 57 | func_args = [(tok, candidate_tokens, ori_special_tok, candidate_special_tok) for tok in ori_tokenizer.get_vocab()] 58 | 59 | vocab_mappings = dict(tqdm.tqdm(process_pool.starmap(find_best_mapping, func_args)), 60 | total=len(ori_tokenizer.get_vocab())) 61 | 62 | with open(vocab_mapping_save_path, "w") as fout: 63 | json.dump(vocab_mappings, fout) 64 | 65 | return vocab_mappings 66 | -------------------------------------------------------------------------------- /RELEASE.md: -------------------------------------------------------------------------------- 1 | ## Release 2.2.0 2 | ### Major Features and Improvements 3 | * Integrate the FedCoT (Federated Chain-of-Thought) algorithm, a novel framework that enhances local small language models (SLMs) using differentially private protected Chain of Thoughts (Cot) generated by remote LLMs: 4 | * Implement InferDPT for privacy-preserving Cot generation. 5 | * Support an encoder-decoder mechanism for privacy-preserving Cot generation. 6 | * Add prefix trainers for step-by-step distillation and text encoder-decoder training. 7 | * Integrate the FDKT algorithm, a framework that enables domain-specific knowledge transfer from LLMs to SLMs while preserving SLM data privacy 8 | * Deployment Optimization: support installation of FATE-LLM by PyPi 9 | 10 | 11 | ## Release 2.1.0 12 | ### Major Features and Improvements 13 | * New FedMKT Federated Tuning Algorithms: Federated Mutual Knowledge Transfer for Large and Small Language Models 14 | * Support three distinct scenarios: Heterogeneous, Homogeneous and One-to-One 15 | * Support LLM to SLM one-way knowledge transfer 16 | * Introduce the InferDPT algorithm, which leverages differential privacy (DP) to facilitate privacy-preserving inference for large language models. 17 | * Introduce FATE-LLM Evaluate: evaluate FATE-LLM models in few lines with Python SDK or simple CLI commands(`fate_llm evaluate`), built-in cases included 18 | 19 | 20 | ## Release 2.0.0 21 | ### Major Features and Improvements 22 | * Adapt to fate-v2.0 framework: 23 | * Migrate parameter-efficient fine-tuning training methods and models. 24 | * Migrate Standard Offsite-Tuning and Extended Offsite-Tuning(Federated Offsite-Tuning+) 25 | * Newly trainer,dataset, data_processing function design 26 | * New FedKSeed Federated Tuning Algorithm: train large language models in a federated learning setting with extremely low communication cost 27 | 28 | ## Release 1.3.0 29 | ### Major Features and Improvements 30 | * FTL-LLM(Fedrated Learning + Transfer Learning + LLM) 31 | * Standard Offsite-Tuning and Extended Offsite-Tuning(Federated Offsite-Tuning+)now supported 32 | * Framework available for Emulator and Adapter development 33 | * New Offsite-Tuning Trainer introduced 34 | * Includes built-in models such as GPT-2 family, Llama7b, and Bloom family 35 | * FedIPR 36 | * Introduced WatermarkDataset as the foundational dataset class for backdoor-based watermarks 37 | * Added SignConv and SignLayerNorm blocks for feature-based watermark models 38 | * New FedIPR Trainer available 39 | * Built-in models with feature-based watermarks include Alexnet, Resnet18, DistilBert, and GPT2 40 | * More models support parameter-efficient fine-tuning: ChatGLM2-6B and Bloom-7B1 41 | 42 | 43 | ## Release 1.2.0 44 | ### Major Features and Improvements 45 | * Support Federated Training of LLaMA-7B with parameter-efficient fine-tuning. 46 | 47 | 48 | ## Release 1.1.0 49 | ### Major Features and Improvements 50 | * Support Federated Training of ChatGLM-6B with parameter-efficient fine-tuning adapters: like Lora and P-Tuning V2 etc. 51 | * Integration of `peft`, which support many parameter-efficient adapters. 52 | -------------------------------------------------------------------------------- /python/fate_llm/algo/fedmkt/utils/dataset_sync_util.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | import logging 17 | import datasets 18 | import torch 19 | import torch.distributed as dist 20 | from fate_llm.algo.fedmkt.utils.vars_define import ( 21 | METRIC, 22 | PER_STEP_LOGITS, 23 | PER_STEP_INDICES, 24 | ) 25 | 26 | logger = logging.getLogger(__name__) 27 | 28 | 29 | def sync_dataset(dataset, local_rank, world_size, device): 30 | integer_keys_2d = ["input_ids", "attention_mask", "labels"] 31 | integer_keys_3d = [PER_STEP_INDICES] 32 | float_keys_3d = [PER_STEP_LOGITS] 33 | float_keys_1d = [METRIC] 34 | 35 | if local_rank == 0: 36 | for key in integer_keys_2d + integer_keys_3d + float_keys_3d + float_keys_1d: 37 | if key in integer_keys_2d or key in integer_keys_3d: 38 | dtype = torch.int32 39 | else: 40 | dtype = torch.float64 41 | 42 | values = dataset[key] 43 | v_tensor = torch.tensor(values, dtype=dtype).cuda(device) 44 | shape_tensor = torch.tensor(v_tensor.shape, dtype=torch.int32).cuda(device) 45 | shape_tensors = [shape_tensor for _ in range(world_size)] 46 | dist.scatter(shape_tensor, shape_tensors, async_op=False) 47 | 48 | v_tensors = [v_tensor for _ in range(world_size)] 49 | dist.scatter(v_tensor, v_tensors, async_op=False) 50 | 51 | return dataset 52 | 53 | else: 54 | data_dict = dict() 55 | for key in integer_keys_2d + integer_keys_3d + float_keys_3d + float_keys_1d: 56 | if key in integer_keys_2d or key in integer_keys_3d: 57 | dtype = torch.int32 58 | else: 59 | dtype = torch.float64 60 | 61 | if key in integer_keys_2d: 62 | shape_tensor = torch.tensor([0, 0], dtype=torch.int32).cuda(device) 63 | elif key in float_keys_3d or key in integer_keys_3d: 64 | shape_tensor = torch.tensor([0, 0, 0], dtype=torch.int32).cuda(device) 65 | else: 66 | shape_tensor = torch.tensor([0], dtype=torch.int32).cuda(device) 67 | 68 | dist.scatter(shape_tensor, src=0, async_op=False) 69 | v_tensor = torch.zeros(shape_tensor.tolist(), dtype=dtype).cuda(device) 70 | dist.scatter(v_tensor, src=0, async_op=False) 71 | data_dict[key] = v_tensor.tolist() 72 | 73 | return datasets.Dataset.from_dict(data_dict) 74 | -------------------------------------------------------------------------------- /python/fate_llm/model_zoo/embedding_transformer/st_model.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | from sentence_transformers import SentenceTransformer 17 | from typing import Any, Optional, Dict, Union 18 | 19 | 20 | class SentenceTransformerModel(object): 21 | def __init__( 22 | self, 23 | model_name_or_path: Optional[str] = None, 24 | device: Optional[str] = None, 25 | prompts: Optional[Dict[str, str]] = None, 26 | default_prompt_name: Optional[str] = None, 27 | cache_folder: Optional[str] = None, 28 | trust_remote_code: bool = False, 29 | revision: Optional[str] = None, 30 | local_files_only: bool = False, 31 | token: Optional[Union[bool, str]] = None, 32 | use_auth_token: Optional[Union[bool, str]] = None, 33 | truncate_dim: Optional[int] = None, 34 | model_kwargs: Optional[Dict[str, Any]] = None, 35 | tokenizer_kwargs: Optional[Dict[str, Any]] = None, 36 | config_kwargs: Optional[Dict[str, Any]] = None, 37 | ) -> None: 38 | self.model_name_or_path = model_name_or_path 39 | self.device = device 40 | self.prompts = prompts 41 | self.default_prompt_name = default_prompt_name 42 | self.cache_folder = cache_folder 43 | self.trust_remote_code = trust_remote_code 44 | self.revision = revision 45 | self.local_files_only = local_files_only 46 | self.token = token 47 | self.use_auth_token = use_auth_token 48 | self.truncate_dim = truncate_dim 49 | self.model_kwargs = model_kwargs 50 | self.tokenizer_kwargs = tokenizer_kwargs 51 | self.config_kwargs = config_kwargs 52 | 53 | def load(self): 54 | model = SentenceTransformer( 55 | model_name_or_path=self.model_name_or_path, 56 | device=self.device, 57 | prompts=self.prompts, 58 | default_prompt_name=self.default_prompt_name, 59 | cache_folder=self.cache_folder, 60 | trust_remote_code=self.trust_remote_code, 61 | revision=self.revision, 62 | local_files_only=self.local_files_only, 63 | token=self.token, 64 | use_auth_token=self.use_auth_token, 65 | truncate_dim=self.truncate_dim, 66 | model_kwargs=self.model_kwargs, 67 | tokenizer_kwargs=self.tokenizer_kwargs, 68 | config_kwargs=self.config_kwargs 69 | ) 70 | 71 | return model 72 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # FATE-LLM 2 | FATE-LLM is a framework to support federated learning for large language models(LLMs) and small language models(SLMs). 3 |
4 | 5 |
6 | 7 | ## Design Principle 8 | - Federated learning for large language models(LLMs) and small language models(SLMs). 9 | - Promote training efficiency of federated LLMs using Parameter-Efficient methods. 10 | - Protect the IP of LLMs using FedIPR. 11 | - Protect data privacy during training and inference through privacy preserving mechanisms. 12 |
13 | 14 |
15 | 16 | ### Standalone deployment 17 | * To deploy FATE-LLM v2.2.0 or higher version, three ways are provided, please refer [deploy tutorial](./doc/standalone_deploy.md) for more details: 18 | * deploy with FATE only from pypi then using Launcher to run tasks 19 | * deploy with FATE、FATE-Flow、FATE-Client from pypi, user can run tasks with Pipeline 20 | * To deploy lower versions: please refer to [FATE-Standalone deployment](https://github.com/FederatedAI/FATE#standalone-deployment). 21 | * To deploy FATE-LLM v2.0.* - FATE-LLM v2.1.*, deploy FATE-Standalone with version >= 2.1, then make a new directory `{fate_install}/fate_llm` and clone the code into it, install the python requirements, and add `{fate_install}/fate_llm/python` to `PYTHONPATH` 22 | * To deploy FATE-LLM v1.x, deploy FATE-Standalone with 1.11.3 <= version < 2.0, then copy directory `python/fate_llm` to `{fate_install}/fate/python/fate_llm` 23 | 24 | ### Cluster deployment 25 | Use [FATE-LLM deployment packages](https://github.com/FederatedAI/FATE/wiki/Download#llm%E9%83%A8%E7%BD%B2%E5%8C%85) to deploy, refer to [FATE-Cluster deployment](https://github.com/FederatedAI/FATE#cluster-deployment) for more deployment details. 26 | 27 | ## Quick Start 28 | 29 | - [Federated ChatGLM3-6B Training](doc/tutorial/pellm/ChatGLM3-6B_ds.ipynb) 30 | - [Builtin Models In PELLM](doc/tutorial/pellm/builtin_pellm_models.md) 31 | - [Offsite Tuning: Transfer Learning without Full Model](./doc/tutorial/offsite_tuning/Offsite_tuning_tutorial.ipynb) 32 | - [FedKSeed: Federated Full-Parameter Tuning of Billion-Sized Language Models 33 | with Communication Cost under 18 Kilobytes](./doc/tutorial/fedkseed/) 34 | - [InferDPT: Privacy-preserving Inference for Black-box Large Language Models](./doc/tutorial/inferdpt/inferdpt_tutorial.ipynb) 35 | - [FedMKT: Federated Mutual Knowledge Transfer for Large and Small 36 | Language Models](./doc/tutorial/fedmkt/) 37 | - [FedCoT: Federated Chain-of-Thought Distillation for Large Language Models](./doc/tutorial/fedcot) 38 | - [FDKT: Federated Domain-Specific Knowledge Transfer on Large Language Models Using Synthetic Data](./doc/tutorial/fdkt) 39 | 40 | ## FATE-LLM Evaluate 41 | 42 | - [Python SDK & CLI Usage Guide](./doc/fate_llm_evaluate.md) 43 | 44 | ## Citation 45 | 46 | If you publish work that uses FATE-LLM, please cite FATE-LLM as follows: 47 | ``` 48 | @article{fan2023fate, 49 | title={Fate-llm: A industrial grade federated learning framework for large language models}, 50 | author={Fan, Tao and Kang, Yan and Ma, Guoqiang and Chen, Weijing and Wei, Wenbin and Fan, Lixin and Yang, Qiang}, 51 | journal={Symposium on Advances and Open Problems in Large Language Models (LLM@IJCAI'23)}, 52 | year={2023} 53 | } 54 | ``` 55 | -------------------------------------------------------------------------------- /python/fate_llm/evaluate/tasks/__init__.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2024 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | 17 | import yaml 18 | import os 19 | 20 | 21 | def local_fn_constructor(loader, node): 22 | return node 23 | 24 | 25 | def local_fn_representer(dumper, data): 26 | return data 27 | 28 | 29 | def dump_yaml(dict, path): 30 | yaml.add_representer(yaml.ScalarNode, local_fn_representer) 31 | with open(path, 'w') as f: 32 | yaml.dump(dict, f) 33 | 34 | class Task: 35 | _task_name = "" 36 | _task_dir = "" 37 | _task_conf_file = "" 38 | _task_source_url = "" 39 | script_dir = os.path.dirname(__file__) 40 | 41 | @property 42 | def task_name(self): 43 | return self._task_name 44 | 45 | @property 46 | def task_template(self): 47 | yaml.add_constructor("!function", local_fn_constructor) 48 | with open(os.path.abspath(os.path.join(self.script_dir, self._task_dir, self._task_conf_file)), "rb") as f: 49 | task_template = yaml.full_load(f) 50 | return task_template 51 | 52 | @property 53 | def task_scr_dir(self): 54 | return os.path.abspath(os.path.join(self.script_dir, self._task_dir)) 55 | 56 | @property 57 | def task_conf_path(self): 58 | return os.path.abspath(os.path.join(self.script_dir, self._task_dir, self._task_conf_file)) 59 | 60 | @property 61 | def task_source_url(self): 62 | return self._task_source_url 63 | 64 | def download_from_source(self): 65 | raise NotImplementedError(f"Should not be called here.") 66 | 67 | 68 | class Dolly(Task): 69 | _task_name = "dolly-15k" 70 | _task_dir = "dolly_15k" 71 | _task_conf_file = "default_dolly_15k.yaml" 72 | 73 | def download_from_source(self): 74 | try: 75 | from datasets import load_dataset 76 | data = load_dataset("databricks/databricks-dolly-15k", split="train") 77 | filename = os.path.join(self.task_scr_dir, "databricks-dolly-15k.jsonl") 78 | data.to_json(filename) 79 | return True 80 | except Exception as e: 81 | print(f"Failed to download data from source: {e}") 82 | return False 83 | 84 | 85 | class AdvertiseGen(Task): 86 | _task_name = "advertise-gen" 87 | _task_dir = "advertise_gen" 88 | _task_conf_file = "default_advertise_gen.yaml" 89 | _task_source_url = ["https://cloud.tsinghua.edu.cn/seafhttp/files/3781289a-5a60-44b1-b5f1-a04364e3eb9d/AdvertiseGen.tar.gz", 90 | "https://docs.google.com/uc?export=download&id=13_vf0xRTQsyneRKdD1bZIr93vBGOczrk"] 91 | 92 | def download_from_source(self): 93 | from ..utils.data_tools import download_data 94 | result = download_data(self.task_scr_dir, self.task_source_url[0]) 95 | if not result: 96 | print(f"retry with address: {self.task_source_url[1]}") 97 | return download_data(self.task_scr_dir, self.task_source_url[1]) 98 | return result 99 | 100 | 101 | build_in_tasks = {"dolly-15k": Dolly(), 102 | "advertise-gen": AdvertiseGen()} 103 | -------------------------------------------------------------------------------- /python/fate_llm/algo/fedcot/encoder_decoder/slm_encoder_decoder.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | import copy 17 | from jinja2 import Template 18 | from tqdm import tqdm 19 | from fate.arch import Context 20 | from typing import List, Dict, Union 21 | from fate.ml.nn.dataset.base import Dataset 22 | from fate_llm.algo.inferdpt.utils import InferDPTKit 23 | from openai import OpenAI 24 | import logging 25 | from fate_llm.inference.inference_base import Inference 26 | from fate_llm.algo.inferdpt.inferdpt import InferDPTClient, InferDPTServer 27 | from fate_llm.dataset.hf_dataset import HuggingfaceDataset 28 | 29 | 30 | logger = logging.getLogger(__name__) 31 | 32 | 33 | class SLMEncoderDecoderClient(InferDPTClient): 34 | 35 | def __init__(self, ctx: Context, local_inference_inst: Inference) -> None: 36 | self.ctx = ctx 37 | self.comm_idx = 0 38 | self.local_inference_inst = local_inference_inst 39 | self.local_inference_kwargs = {} 40 | 41 | def encode(self, docs: List[Dict[str, str]], format_template: str = None, verbose=False, perturb_doc_key: str ='perturbed_doc') -> List[Dict[str, str]]: 42 | 43 | template = Template(format_template) 44 | copy_docs = copy.deepcopy(docs) 45 | doc_to_infer = [] 46 | for doc in tqdm(copy_docs): 47 | rendered_doc = template.render(**doc) 48 | doc_to_infer.append(rendered_doc) 49 | # perturb using local model inference 50 | self.doc_to_infer = doc_to_infer 51 | infer_result = self.local_inference_inst.inference(doc_to_infer, self.local_inference_kwargs) 52 | for doc, pr in zip(copy_docs, infer_result): 53 | doc[perturb_doc_key] = pr 54 | self.doc_with_p = copy_docs 55 | return copy_docs 56 | 57 | def decode(self, p_docs: List[Dict[str, str]], instruction_template: str = None, decode_template: str = None, verbose=False, 58 | perturbed_response_key: str = 'perturbed_response', result_key: str = 'result', 59 | remote_inference_kwargs: dict = {}, local_inference_kwargs: dict = {}): 60 | return super().decode(p_docs, instruction_template, decode_template, verbose, perturbed_response_key, result_key, remote_inference_kwargs, local_inference_kwargs) 61 | 62 | def inference(self, docs: Union[List[Dict[str, str]], HuggingfaceDataset], 63 | encode_template: str, 64 | instruction_template: str, 65 | decode_template: str, 66 | verbose: bool = False, 67 | remote_inference_kwargs: dict = {}, 68 | local_inference_kwargs: dict = {}, 69 | perturb_doc_key: str = 'perturbed_doc', 70 | perturbed_response_key: str = 'perturbed_response', 71 | result_key: str = 'result', 72 | ) -> List[Dict[str, str]]: 73 | self.local_inference_kwargs = local_inference_kwargs 74 | return super().inference(docs, encode_template, instruction_template, decode_template, verbose, remote_inference_kwargs, \ 75 | local_inference_kwargs, perturb_doc_key, perturbed_response_key, result_key) 76 | 77 | 78 | class SLMEncoderDecoderServer(InferDPTServer): 79 | pass 80 | -------------------------------------------------------------------------------- /python/fate_llm/evaluate/utils/config.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2024 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | 17 | import os 18 | import click 19 | import yaml 20 | import typing 21 | from pathlib import Path 22 | from ._io import set_logger, echo 23 | 24 | 25 | DEFAULT_FATE_LLM_BASE_PATH = os.path.abspath(os.path.dirname(os.path.dirname(__file__))) 26 | FATE_LLM_BASE_PATH = os.getenv("FATE_LLM_BASE_PATH") or DEFAULT_FATE_LLM_BASE_PATH 27 | 28 | # DEFAULT_TASK_PATH = os.path.abspath(os.path.join(os.path.dirname(__file__), "../tasks")) 29 | DEFAULT_FATE_LLM_TASK_PATH = os.path.abspath(os.path.join(FATE_LLM_BASE_PATH, "tasks")) 30 | FATE_LLM_TASK_PATH = os.getenv("FATE_LLM_TASK_PATH") or DEFAULT_FATE_LLM_TASK_PATH 31 | 32 | _default_eval_config = Path(FATE_LLM_BASE_PATH).resolve() / 'llm_eval_config.yaml' 33 | 34 | template = """# args for evaluate 35 | batch_size: 10 36 | model_args: 37 | device: cuda 38 | dtype: auto 39 | trust_remote_code: true 40 | num_fewshot: 0 41 | """ 42 | 43 | 44 | def create_eval_config(path: Path, override=False): 45 | if path.exists() and not override: 46 | raise FileExistsError(f"{path} exists") 47 | 48 | with path.open("w") as f: 49 | f.write(template) 50 | 51 | 52 | def default_eval_config(): 53 | if not _default_eval_config.exists(): 54 | create_eval_config(_default_eval_config) 55 | return _default_eval_config 56 | 57 | 58 | class Config(object): 59 | def __init__(self, config): 60 | self.update_conf(**config) 61 | 62 | def update_conf(self, **kwargs): 63 | for k, v in kwargs.items(): 64 | setattr(self, k, v) 65 | 66 | @staticmethod 67 | def load(path: typing.Union[str, Path], **kwargs): 68 | if isinstance(path, str): 69 | path = Path(path) 70 | config = {} 71 | if path is not None: 72 | with path.open("r") as f: 73 | config.update(yaml.safe_load(f)) 74 | 75 | config.update(kwargs) 76 | return Config(config) 77 | 78 | @staticmethod 79 | def load_from_file(path: typing.Union[str, Path]): 80 | """ 81 | Loads conf content from yaml file. Used to read in parameter configuration 82 | Parameters 83 | ---------- 84 | path: str, path to conf file, should be absolute path 85 | 86 | Returns 87 | ------- 88 | dict, parameter configuration in dictionary format 89 | 90 | """ 91 | if isinstance(path, str): 92 | path = Path(path) 93 | config = {} 94 | if path is not None: 95 | file_type = path.suffix 96 | with path.open("r") as f: 97 | if file_type == ".yaml": 98 | config.update(yaml.safe_load(f)) 99 | else: 100 | raise ValueError(f"Cannot load conf from file type {file_type}") 101 | return config 102 | 103 | 104 | def parse_config(config): 105 | try: 106 | config_inst = Config.load(config) 107 | except Exception as e: 108 | raise RuntimeError(f"error parse config from {config}") from e 109 | return config_inst 110 | 111 | 112 | def _set_namespace(namespace): 113 | Path(f"logs/{namespace}").mkdir(exist_ok=True, parents=True) 114 | set_logger(f"logs/{namespace}/exception.log") 115 | echo.set_file(click.open_file(f'logs/{namespace}/stdout', "a")) 116 | -------------------------------------------------------------------------------- /python/fate_llm/algo/fedavg/fedavg.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | import torch 16 | from fate.ml.nn.homo.fedavg import FedAVGServer, FedAVGArguments, FedArguments 17 | from fate.arch import Context 18 | from fate_llm.trainer.seq2seq_trainer import HomoSeq2SeqTrainerClient, Seq2SeqTrainingArguments 19 | from fate.ml.aggregator import AggregatorClientWrapper 20 | import logging 21 | from typing import List, Optional, Tuple, Callable, Dict 22 | from fate.arch import Context 23 | from torch.optim import Optimizer 24 | from torch.utils.data import Dataset 25 | from torch.optim.lr_scheduler import _LRScheduler 26 | from transformers.trainer_callback import TrainerCallback 27 | from torch import nn 28 | from torch.utils.data import DataLoader 29 | from transformers import TrainerState, TrainerControl, PreTrainedTokenizer, EvalPrediction 30 | 31 | 32 | 33 | logger = logging.getLogger(__name__) 34 | 35 | 36 | Seq2SeqFedAVGServer = FedAVGServer 37 | 38 | 39 | class Seq2SeqFedAVGClient(HomoSeq2SeqTrainerClient): 40 | 41 | def __init__( 42 | self, 43 | ctx: Context, 44 | model: nn.Module, 45 | training_args: Seq2SeqTrainingArguments, 46 | fed_args: FedArguments, 47 | train_set: Dataset, 48 | val_set: Dataset = None, 49 | optimizer: torch.optim.Optimizer = None, 50 | scheduler: Optional[torch.optim.lr_scheduler._LRScheduler] = None, 51 | data_collator: Callable = None, 52 | tokenizer: Optional[PreTrainedTokenizer] = None, 53 | callbacks: Optional[List[TrainerCallback]] = [], 54 | compute_metrics: Optional[Callable[[EvalPrediction], Dict]] = None, 55 | local_mode: bool = False, 56 | save_trainable_weights_only: bool = False, 57 | preprocess_logits_for_metrics: Optional[Callable[[torch.Tensor, torch.Tensor], torch.Tensor]] = None, 58 | ): 59 | # in case you forget to set evaluation_strategy 60 | if val_set is not None and training_args.evaluation_strategy == "no": 61 | training_args.evaluation_strategy = "epoch" 62 | 63 | HomoSeq2SeqTrainerClient.__init__( 64 | self, 65 | ctx, 66 | model, 67 | training_args, 68 | fed_args, 69 | train_set, 70 | val_set, 71 | optimizer, 72 | data_collator, 73 | scheduler, 74 | tokenizer, 75 | callbacks, 76 | compute_metrics, 77 | local_mode, 78 | save_trainable_weights_only, 79 | preprocess_logits_for_metrics 80 | ) 81 | 82 | 83 | def init_aggregator(self, ctx: Context, fed_args: FedArguments): 84 | aggregate_type = "weighted_mean" 85 | aggregator_name = "fedavg" 86 | aggregator = fed_args.aggregator 87 | return AggregatorClientWrapper( 88 | ctx, aggregate_type, aggregator_name, aggregator, sample_num=len(self.train_dataset), args=self._args 89 | ) 90 | 91 | def on_federation( 92 | self, 93 | ctx: Context, 94 | aggregator: AggregatorClientWrapper, 95 | fed_args: FedArguments, 96 | args: Seq2SeqTrainingArguments, 97 | model: Optional[nn.Module] = None, 98 | optimizer: Optional[Optimizer] = None, 99 | scheduler: Optional[_LRScheduler] = None, 100 | dataloader: Optional[Tuple[DataLoader]] = None, 101 | control: Optional[TrainerControl] = None, 102 | state: Optional[TrainerState] = None, 103 | **kwargs, 104 | ): 105 | aggregator.model_aggregation(ctx, model) 106 | 107 | -------------------------------------------------------------------------------- /python/fate_llm/algo/fdkt/utils/text_generate.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | from tqdm import tqdm 17 | from typing import Any, Dict, List 18 | 19 | 20 | def slm_text_generate( 21 | inference_inst, 22 | model, 23 | tokenizer, 24 | prompt_dict, 25 | seq_num_for_single_category, 26 | batch_size, 27 | use_cpu, 28 | generation_config 29 | ): 30 | generated_ret = dict( 31 | inputs=list(), 32 | labels=list(), 33 | ) 34 | if inference_inst is not None: 35 | for label, prompt in prompt_dict.items(): 36 | generated_sequences = inference_inst.inference([prompt] * seq_num_for_single_category, generation_config) 37 | for g in generated_sequences: 38 | generated_ret["inputs"].append(g) 39 | generated_ret["labels"].append(label) 40 | else: 41 | model.eval() 42 | for label, prompt_ids in prompt_dict.items(): 43 | prompt_length = len(prompt_ids) 44 | batch_num = (seq_num_for_single_category + batch_size - 1) // batch_size 45 | for batch_idx in tqdm(range(batch_num)): 46 | if batch_idx + 1 == batch_num: 47 | cur_batch_size = seq_num_for_single_category - batch_idx * batch_size 48 | else: 49 | cur_batch_size = batch_size 50 | input_ids = prompt_ids.repeat(cur_batch_size, 1) 51 | 52 | if not use_cpu: 53 | input_ids = input_ids.to(model.device) 54 | 55 | output_sequences = model.generate( 56 | input_ids=input_ids, 57 | **generation_config 58 | ) 59 | output_sequences = output_sequences[:, prompt_length:] 60 | 61 | generated_sequences = tokenizer.batch_decode(output_sequences, skip_special_tokens=True) 62 | 63 | for g in generated_sequences: 64 | generated_ret["inputs"].append(g) 65 | generated_ret["labels"].append(label) 66 | 67 | return generated_ret 68 | 69 | 70 | def general_text_generate( 71 | inference_inst, 72 | model, 73 | tokenizer, 74 | generation_config: Dict[Any, Any], 75 | prompts: List[str], 76 | batch_size, 77 | use_cpu: bool, 78 | prompt_max_length 79 | ): 80 | if inference_inst is not None: 81 | if prompt_max_length is not None: 82 | prompts = [prompt[:prompt_max_length] for prompt in prompts] 83 | generate_texts = inference_inst.inference(prompts, generation_config) 84 | else: 85 | model.eval() 86 | generate_texts = [] 87 | batch_num = (len(prompts) + batch_size - 1) // batch_size 88 | for batch_idx in range(batch_num): 89 | batch_data = prompts[batch_idx * batch_size: (batch_idx + 1) * batch_size] 90 | 91 | inputs = tokenizer(batch_data, return_tensors="pt", padding="longest", truncation=True, 92 | max_length=prompt_max_length) 93 | input_ids = inputs["input_ids"] 94 | attention_mask = inputs["attention_mask"] 95 | 96 | if not use_cpu: 97 | input_ids = input_ids.to(model.device) 98 | attention_mask = attention_mask.to(model.device) 99 | 100 | output = model.generate( 101 | input_ids=input_ids, 102 | attention_mask=attention_mask, 103 | **generation_config 104 | ) 105 | 106 | batch_responses = tokenizer.batch_decode(output[:, input_ids.shape[1]:], skip_special_tokens=True) 107 | 108 | generate_texts.extend(batch_responses) 109 | 110 | return generate_texts 111 | -------------------------------------------------------------------------------- /examples/pellm/test_bloom_lora.py: -------------------------------------------------------------------------------- 1 | import time 2 | from fate_client.pipeline.components.fate.reader import Reader 3 | from fate_client.pipeline import FateFlowPipeline 4 | from fate_client.pipeline.components.fate.homo_nn import HomoNN, get_config_of_seq2seq_runner 5 | from fate_client.pipeline.components.fate.nn.algo_params import Seq2SeqTrainingArguments, FedAVGArguments 6 | from fate_client.pipeline.components.fate.nn.loader import LLMModelLoader, LLMDatasetLoader, LLMDataFuncLoader 7 | from peft import LoraConfig, TaskType 8 | from fate_client.pipeline.utils import test_utils 9 | import argparse 10 | import yaml 11 | from typing import Union, Dict 12 | 13 | 14 | def main(config="../../config.yaml", param: Union[Dict, str] = None, namespace=""): 15 | if isinstance(config, str): 16 | config = test_utils.load_job_config(config) 17 | if isinstance(param, str): 18 | param = yaml.safe_load(param) 19 | parties = config.parties 20 | guest = parties.guest[0] 21 | host = parties.host[0] 22 | arbiter = parties.arbiter[0] 23 | pipeline = FateFlowPipeline().set_parties(guest=guest, host=host, arbiter=arbiter) 24 | 25 | reader_0 = Reader("reader_0", runtime_parties=dict(guest=guest, host=host)) 26 | reader_0.guest.task_parameters( 27 | namespace=param["data"]["guest"]["namespace"], 28 | name=param["data"]["guest"]["name"] 29 | ) 30 | reader_0.hosts[0].task_parameters( 31 | namespace=param["data"]["host"]["namespace"], 32 | name=param["data"]["host"]["name"] 33 | ) 34 | 35 | lora_config = LoraConfig(**param["peft_config"]) 36 | lora_config.target_modules = list(lora_config.target_modules) 37 | 38 | pretrained_model_path = param["pretrained_model_path"] 39 | model = LLMModelLoader( 40 | "pellm.bloom", 41 | "Bloom", 42 | pretrained_path=pretrained_model_path, 43 | peft_type="LoraConfig", 44 | peft_config=lora_config.to_dict(), 45 | trust_remote_code=True 46 | ) 47 | 48 | tokenizer_params = dict( 49 | tokenizer_name_or_path=pretrained_model_path, 50 | trust_remote_code=True, 51 | ) 52 | 53 | dataset = LLMDatasetLoader( 54 | "prompt_dataset", 55 | "PromptDataset", 56 | **tokenizer_params, 57 | ) 58 | 59 | data_collator = LLMDataFuncLoader( 60 | "data_collator.cust_data_collator", 61 | "get_seq2seq_data_collator", 62 | **tokenizer_params, 63 | ) 64 | 65 | conf = get_config_of_seq2seq_runner( 66 | algo='fedavg', 67 | model=model, 68 | dataset=dataset, 69 | data_collator=data_collator, 70 | training_args=Seq2SeqTrainingArguments( 71 | num_train_epochs=param["epoch"], 72 | per_device_train_batch_size=param["batch_size"], 73 | remove_unused_columns=False, 74 | predict_with_generate=False, 75 | deepspeed=param["ds_config"], 76 | learning_rate=param["lr"], 77 | use_cpu=False, # this must be set as we will gpu 78 | fp16=True, 79 | ), 80 | fed_args=FedAVGArguments(), 81 | task_type='causal_lm', 82 | save_trainable_weights_only=True # only save trainable weights 83 | ) 84 | 85 | homo_nn_0 = HomoNN( 86 | 'nn_0', 87 | runner_conf=conf, 88 | train_data=reader_0.outputs["output_data"], 89 | runner_module="homo_seq2seq_runner", 90 | runner_class="Seq2SeqRunner", 91 | ) 92 | 93 | homo_nn_0.guest.conf.set("launcher_name", "deepspeed") # tell schedule engine to run task with deepspeed 94 | homo_nn_0.hosts[0].conf.set("launcher_name", "deepspeed") # tell schedule engine to run task with deepspeed 95 | 96 | pipeline.add_tasks([reader_0, homo_nn_0]) 97 | pipeline.conf.set("task", dict(engine_run={"cores": 1})) # the number of gpus of each party 98 | 99 | pipeline.compile() 100 | pipeline.fit() 101 | 102 | return pretrained_model_path 103 | 104 | 105 | if __name__ == "__main__": 106 | parser = argparse.ArgumentParser("LLMSUITE PIPELINE JOB") 107 | parser.add_argument("-c", "--config", type=str, 108 | help="config file", default="../../config.yaml") 109 | parser.add_argument("-p", "--param", type=str, 110 | help="config file for params", default="./bloom_lora_config.yaml") 111 | args = parser.parse_args() 112 | main(args.config, args.param) 113 | -------------------------------------------------------------------------------- /python/fate_llm/algo/fedmkt/utils/generate_logit_utils.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | import torch 17 | import torch.nn.functional as F 18 | import gc 19 | from fate_llm.algo.fedmkt.utils.vars_define import ( 20 | PER_STEP_LOGITS, 21 | PER_STEP_INDICES, 22 | METRIC 23 | ) 24 | 25 | 26 | class Metric(object): 27 | @classmethod 28 | def cal_metric(cls, logits, input_ids, attention_mask, labels, training_args): 29 | if training_args.metric_type == "ce": 30 | return cls.cal_ce(logits, input_ids, attention_mask, labels, training_args) 31 | else: 32 | raise NotImplemented(f"metric={training_args.metric_type} is not implemented yet") 33 | 34 | @classmethod 35 | def cal_ce(cls, logits, input_ids, attention_mask, labels, training_args): 36 | metric = F.cross_entropy(logits[..., :-1, :].contiguous().view(-1, logits.size(-1)), 37 | labels[..., 1:].contiguous().view(-1), reduction="none").view(logits.size(0), -1) 38 | 39 | metric = (metric * attention_mask[..., 1:]).sum(dim=-1) / attention_mask[..., 1:].sum(dim=-1) 40 | 41 | return metric 42 | 43 | 44 | class LogitsSelection(object): 45 | @classmethod 46 | def select_logits(cls, logits, training_args): 47 | if training_args.top_k_strategy == "highest": 48 | return cls.select_highest(logits, training_args.top_k_logits_keep) 49 | else: 50 | raise NotImplemented(f"logits selection strategy={training_args.top_k_strategy} is not implemented") 51 | 52 | @classmethod 53 | def select_highest(cls, logits, top_k_logits_keep): 54 | top_k_logits, top_k_indices = torch.topk(logits.cuda(), k=top_k_logits_keep) 55 | logits.cpu() 56 | 57 | return top_k_logits, top_k_indices 58 | 59 | 60 | def generate_pub_data_logits(inputs, model, training_args, data_collator): 61 | input_keys = ["attention_mask", "input_ids", "labels"] 62 | inputs_per_batched = [dict() for _ in range(len(inputs[input_keys[1]]))] 63 | for key in input_keys: 64 | if key not in inputs: 65 | continue 66 | 67 | for idx, _in in enumerate(inputs[key]): 68 | inputs_per_batched[idx][key] = _in 69 | 70 | if "attention_mask" not in inputs: 71 | for idx in range(len(inputs_per_batched)): 72 | inputs_per_batched[idx]["attention_mask"] = [1] * len(inputs_per_batched[idx]["input_ids"]) 73 | 74 | inputs_per_batched = data_collator(inputs_per_batched) 75 | 76 | input_ids = inputs_per_batched["input_ids"] 77 | attention_mask = inputs_per_batched["attention_mask"] 78 | labels = inputs_per_batched["labels"] 79 | 80 | device = next(model.parameters()).device 81 | if device.type == "cuda": 82 | input_ids = input_ids.cuda(device) 83 | attention_mask = attention_mask.cuda(device) 84 | labels = labels.cuda(device) 85 | 86 | model.eval() 87 | 88 | with torch.no_grad(): 89 | logits = model(input_ids=input_ids, attention_mask=attention_mask).logits 90 | 91 | metric = Metric.cal_metric(logits, input_ids, attention_mask, labels, training_args) 92 | 93 | input_ids.cpu() 94 | del input_ids 95 | attention_mask.cpu() 96 | del attention_mask 97 | labels.cpu() 98 | del labels 99 | logits.cpu() 100 | metric.cpu() 101 | 102 | if training_args.top_k_logits_keep is None: 103 | raise ValueError("Please specify top_k_logits_keep, fulling save will leak to memory exceeds") 104 | 105 | selected_logits, selected_indices = LogitsSelection.select_logits(logits=logits, training_args=training_args) 106 | selected_logits.cpu() 107 | selected_indices.cpu() 108 | 109 | inputs[PER_STEP_LOGITS] = selected_logits 110 | inputs[PER_STEP_INDICES] = selected_indices 111 | inputs[METRIC] = metric 112 | 113 | del logits 114 | 115 | gc.collect() 116 | 117 | model.train() 118 | 119 | return inputs 120 | -------------------------------------------------------------------------------- /python/fate_llm/dataset/seq_cls_dataset.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | from fate.ml.nn.dataset.base import Dataset 17 | import pandas as pd 18 | import torch as t 19 | from transformers import AutoTokenizer 20 | import os 21 | import numpy as np 22 | 23 | # avoid tokenizer parallelism 24 | os.environ["TOKENIZERS_PARALLELISM"] = "false" 25 | 26 | 27 | class SeqCLSDataset(Dataset): 28 | """ 29 | A Dataset for some basic NLP Tasks, this dataset will automatically transform raw text into word indices 30 | using AutoTokenizer from transformers library, 31 | 32 | Parameters 33 | ---------- 34 | truncation bool, truncate word sequence to 'text_max_length' 35 | text_max_length int, max length of word sequences 36 | tokenizer_name_or_path str, name of bert tokenizer(see transformers official for details) or path to local 37 | transformer tokenizer folder 38 | return_label bool, return label or not, this option is for host dataset, when running hetero-NN 39 | padding bool, whether to pad the word sequence to 'text_max_length' 40 | padding_side str, 'left' or 'right', where to pad the word sequence 41 | pad_token str, pad token, use this str as pad token, if None, use tokenizer.pad_token 42 | return_input_ids bool, whether to return input_ids or not, if False, return word_idx['input_ids'] 43 | """ 44 | 45 | def __init__( 46 | self, 47 | truncation=True, 48 | text_max_length=128, 49 | tokenizer_name_or_path="bert-base-uncased", 50 | return_label=True, 51 | padding=True, 52 | padding_side="right", 53 | pad_token=None, 54 | return_input_ids=True): 55 | 56 | super(SeqCLSDataset, self).__init__() 57 | self.text = None 58 | self.word_idx = None 59 | self.label = None 60 | self.tokenizer = None 61 | self.sample_ids = None 62 | self.padding = padding 63 | self.truncation = truncation 64 | self.max_length = text_max_length 65 | self.with_label = return_label 66 | self.tokenizer_name_or_path = tokenizer_name_or_path 67 | self.tokenizer = AutoTokenizer.from_pretrained( 68 | self.tokenizer_name_or_path) 69 | self.tokenizer.padding_side = padding_side 70 | self.return_input_ids = return_input_ids 71 | if pad_token is not None: 72 | self.tokenizer.add_special_tokens({'pad_token': pad_token}) 73 | 74 | def load(self, file_path): 75 | 76 | tokenizer = self.tokenizer 77 | self.text = pd.read_csv(file_path) 78 | text_list = list(self.text.text) 79 | 80 | self.word_idx = tokenizer( 81 | text_list, 82 | padding=self.padding, 83 | return_tensors='pt', 84 | truncation=self.truncation, 85 | max_length=self.max_length) 86 | 87 | if self.return_input_ids: 88 | self.word_idx = self.word_idx['input_ids'] 89 | 90 | if self.with_label: 91 | self.label = t.Tensor(self.text.label).detach().numpy() 92 | self.label = self.label.reshape((len(self.text), -1)) 93 | 94 | if 'id' in self.text: 95 | self.sample_ids = self.text['id'].values.tolist() 96 | 97 | def get_classes(self): 98 | return np.unique(self.label).tolist() 99 | 100 | def get_vocab_size(self): 101 | return self.tokenizer.vocab_size 102 | 103 | def get_sample_ids(self): 104 | return self.sample_ids 105 | 106 | def __getitem__(self, item): 107 | 108 | if self.return_input_ids: 109 | ret = self.word_idx[item] 110 | else: 111 | ret = {k: v[item] for k, v in self.word_idx.items()} 112 | 113 | if self.with_label: 114 | return ret, self.label[item] 115 | 116 | return ret 117 | 118 | def __len__(self): 119 | return len(self.text) 120 | 121 | def __repr__(self): 122 | return self.tokenizer.__repr__() -------------------------------------------------------------------------------- /python/fate_llm/dataset/input_output_dataset.py: -------------------------------------------------------------------------------- 1 | from fate.ml.nn.dataset.base import Dataset 2 | from transformers.trainer_pt_utils import LabelSmoother 3 | from typing import List, Dict, Union, Literal 4 | import logging 5 | from jinja2 import Template 6 | from transformers import AutoTokenizer 7 | 8 | 9 | logger = logging.getLogger(__name__) 10 | 11 | 12 | class InputOutputDataset(Dataset): 13 | 14 | def __init__(self, 15 | tokenizer_path, 16 | input_template: str, 17 | output_template: str, 18 | max_input_length: int = 256, 19 | max_target_length: int = 256, 20 | load_from: Literal['jsonl', 'hf_load_from_disk', 'hf_load_dataset'] = 'hf_load_from_disk', 21 | split_key: str = None 22 | ): 23 | 24 | super().__init__() 25 | self.tokenizer = None 26 | self.tokenizer_path = tokenizer_path 27 | self.tokenizer = AutoTokenizer.from_pretrained(self.tokenizer_path, trust_remote_code=True) 28 | self.max_input_length = max_input_length 29 | self.max_target_length = max_target_length 30 | self.dataset = None 31 | self.load_from = load_from 32 | self.input_template = Template(input_template) 33 | self.output_template = Template(output_template) 34 | self.split_key = split_key 35 | self.max_seq_length = max_input_length + max_target_length + 1 36 | 37 | def load(self, path): 38 | if self.load_from == 'hf_load_from_disk': 39 | import datasets 40 | self.dataset = datasets.load_from_disk(path) 41 | if self.split_key is not None: 42 | self.dataset = self.dataset[self.split_key] 43 | self.dataset = [i for i in self.dataset] 44 | elif self.load_from == 'jsonl': 45 | import json 46 | with open(path, 'r') as f: 47 | json_lines = f.read().split('\n') 48 | self.dataset = [] 49 | for i in json_lines: 50 | try: 51 | self.dataset.append(json.loads(i)) 52 | except: 53 | print('skip line') 54 | elif self.load_from == 'hf_load_dataset': 55 | from datasets import load_dataset 56 | self.dataset = load_dataset(path) 57 | if self.split_key is not None: 58 | self.dataset = self.dataset[self.split_key] 59 | self.dataset = [i for i in self.dataset] 60 | else: 61 | raise ValueError('unknown load format') 62 | 63 | if not isinstance(self.dataset, list) or not isinstance(self.dataset[0], dict): 64 | logger.warn('loaded dataset is expected to be a list of dict') 65 | 66 | def get_raw_dataset(self): 67 | return self.dataset 68 | 69 | def __len__(self): 70 | return len(self.dataset) 71 | 72 | def get_str_item(self, i) -> dict: 73 | 74 | data_item = self.dataset[i] 75 | in_ = self.input_template.render(**data_item) 76 | out_ = self.output_template.render(**data_item) 77 | return { 78 | 'input': in_, 79 | 'output': out_ 80 | } 81 | 82 | def _process_item(self, data_item): 83 | 84 | a_ids = self.tokenizer.encode(text=data_item['input'], add_special_tokens=True, truncation=True, 85 | max_length=self.max_input_length) 86 | b_ids = self.tokenizer.encode(text=data_item['output'], add_special_tokens=False, truncation=True, 87 | max_length=self.max_target_length) 88 | 89 | context_length = len(a_ids) 90 | input_ids = a_ids + b_ids + [self.tokenizer.eos_token_id] 91 | labels = [self.tokenizer.pad_token_id] * context_length + b_ids + [self.tokenizer.eos_token_id] 92 | 93 | pad_len = self.max_seq_length - len(input_ids) 94 | input_ids = input_ids + [self.tokenizer.pad_token_id] * pad_len 95 | labels = labels + [self.tokenizer.pad_token_id] * pad_len 96 | labels = [(l if l != self.tokenizer.pad_token_id else -100) for l in labels] 97 | 98 | assert len(input_ids) == len(labels), f"length mismatch: {len(input_ids)} vs {len(labels)}" 99 | 100 | return { 101 | "input_ids": input_ids, 102 | "labels": labels 103 | } 104 | 105 | def get_tokenized_item(self, i) -> dict: 106 | 107 | str_item = self.get_str_item(i) 108 | ret_dict = self._process_item(str_item) 109 | return ret_dict 110 | 111 | def __getitem__(self, i) -> dict: 112 | item = self.get_tokenized_item(i) 113 | return item 114 | -------------------------------------------------------------------------------- /python/fate_llm/evaluate/scripts/eval_cli.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2024 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | 17 | import os 18 | import copy 19 | import click 20 | import yaml 21 | import warnings 22 | 23 | from typing import Union 24 | from ._options import LlmSharedOptions 25 | from ..utils.config import default_eval_config 26 | from ..utils.llm_evaluator import evaluate, init_tasks, aggregate_table 27 | from ..utils.model_tools import load_by_loader 28 | from ..utils._io import echo 29 | from ..utils._parser import LlmSuite 30 | 31 | @click.command('evaluate') 32 | @click.option('-i', '--include', required=True, type=click.Path(exists=True), 33 | help='Path to model and metrics conf') 34 | @click.option('-c', '--eval-config', type=click.Path(exists=True), help='Path to FATE Llm evaluation config. ' 35 | 'If not provided, use default config.') 36 | @click.option('-o', '--result-output', type=click.Path(), 37 | help='Path to save evaluation results.') 38 | # @click.argument('other_args', nargs=-1) 39 | @LlmSharedOptions.get_shared_options(hidden=True) 40 | @click.pass_context 41 | def run_evaluate(ctx, include, eval_config, result_output, **kwargs): 42 | """ 43 | Evaluate a pretrained model with specified parameters. 44 | """ 45 | ctx.obj.update(**kwargs) 46 | ctx.obj.post_process() 47 | # namespace = ctx.obj["namespace"] 48 | yes = ctx.obj["yes"] 49 | 50 | echo.echo(f"include: {include}", fg='red') 51 | try: 52 | # include = os.path.abspath(include) 53 | suite = LlmSuite.load(include) 54 | except Exception as e: 55 | raise ValueError(f"Invalid include path: {include}, please check. {e}") 56 | 57 | if not eval_config: 58 | eval_config = default_eval_config() 59 | 60 | if not os.path.exists(eval_config): 61 | eval_config = None 62 | 63 | if not yes and not click.confirm("running?"): 64 | return 65 | # init tasks 66 | init_tasks() 67 | # run_suite_eval(suite, eval_config_dict, result_output) 68 | run_suite_eval(suite, eval_config, result_output) 69 | 70 | def run_job_eval(job, eval_conf): 71 | job_eval_conf = {} 72 | if isinstance(eval_conf, dict): 73 | job_eval_conf.update(eval_conf) 74 | elif eval_conf is not None and os.path.exists(eval_conf): 75 | with open(eval_conf, 'r') as f: 76 | job_eval_conf.update(yaml.safe_load(f)) 77 | 78 | # echo.echo(f"Evaluating job: {job.job_name} with tasks: {job.tasks}") 79 | if job.eval_conf_path: 80 | # job-level eval conf takes priority 81 | with open(job.eval_conf_path, 'r') as f: 82 | job_eval_conf.update(yaml.safe_load(f)) 83 | # get loader 84 | if job.loader: 85 | if job.peft_path: 86 | model = load_by_loader(loader_name=job.loader, 87 | loader_conf_path=loader_conf_path, 88 | peft_path=job.peft_path) 89 | else: 90 | model = load_by_loader(loader_name=job.loader, 91 | loader_conf_path=loader_conf_path) 92 | result = evaluate(model=model, tasks=job.tasks, include_path=job.include_path, **job_eval_conf) 93 | else: 94 | # feed in pretrained & peft path 95 | job_eval_conf["model_args"]["pretrained"] = job.pretrained_model_path 96 | if job.peft_path: 97 | job_eval_conf["model_args"]["peft"] = job.peft_path 98 | result = evaluate(tasks=job.tasks, include_path=job.include_path, **job_eval_conf) 99 | return result 100 | 101 | 102 | def run_suite_eval(suite, eval_conf, output_path=None): 103 | suite_results = dict() 104 | for pair in suite.pairs: 105 | job_results = dict() 106 | for job in pair.jobs: 107 | if not job.evaluate_only: 108 | # give warning that job will be skipped 109 | warnings.warn(f"Job {job.job_name} will be skipped since no pretrained model is provided") 110 | continue 111 | echo.echo(f"Evaluating job: {job.job_name} with tasks: {job.tasks}") 112 | result = run_job_eval(job, eval_conf) 113 | job_results[job.job_name] = result 114 | suite_results[pair.pair_name] = job_results 115 | suite_writers = aggregate_table(suite_results) 116 | for pair_name, pair_writer in suite_writers.items(): 117 | echo.sep_line() 118 | echo.echo(f"Pair: {pair_name}") 119 | echo.sep_line() 120 | echo.echo(pair_writer.dumps()) 121 | echo.stdout_newline() 122 | 123 | if output_path: 124 | with open(output_path, 'w') as f: 125 | for pair_name, pair_writer in suite_writers.items(): 126 | pair_writer.dumps(f) 127 | -------------------------------------------------------------------------------- /examples/offsite_tuning/offsite_tuning.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import yaml 3 | from fate_client.pipeline.components.fate.reader import Reader 4 | from fate_client.pipeline import FateFlowPipeline 5 | from fate_client.pipeline.components.fate.homo_nn import HomoNN, get_conf_of_ot_runner 6 | from fate_client.pipeline.components.fate.nn.algo_params import Seq2SeqTrainingArguments, FedAVGArguments 7 | from fate_client.pipeline.components.fate.nn.loader import LLMModelLoader, LLMDatasetLoader, LLMDataFuncLoader 8 | from fate_client.pipeline.components.fate.nn.torch.base import Sequential 9 | from fate_client.pipeline.components.fate.nn.torch import nn 10 | 11 | def load_params(file_path): 12 | """Load and parse the YAML params file.""" 13 | with open(file_path, 'r') as f: 14 | params = yaml.safe_load(f) 15 | return params 16 | 17 | def setup_pipeline(params): 18 | """Set up the pipeline using the provided parameters.""" 19 | guest = params['pipeline']['guest'] 20 | arbiter = params['pipeline']['arbiter'] 21 | pretrained_model_path = params['paths']['pretrained_model_path'] 22 | 23 | pipeline = FateFlowPipeline().set_parties(guest=guest, arbiter=arbiter) 24 | 25 | reader = Reader("reader_0", runtime_parties=dict(guest=guest)) 26 | reader.guest.task_parameters( 27 | namespace=params['pipeline']['namespace'], 28 | name=params['pipeline']['name'] 29 | ) 30 | 31 | client_model = LLMModelLoader( 32 | module_name=params['models']['client']['module_name'], 33 | item_name=params['models']['client']['item_name'], 34 | model_name_or_path=pretrained_model_path, 35 | emulator_layer_num=params['models']['client']['emulator_layer_num'], 36 | adapter_top_layer_num=params['models']['client']['adapter_top_layer_num'], 37 | adapter_bottom_layer_num=params['models']['client']['adapter_bottom_layer_num'] 38 | ) 39 | 40 | server_model = LLMModelLoader( 41 | module_name=params['models']['server']['module_name'], 42 | item_name=params['models']['server']['item_name'], 43 | model_name_or_path=pretrained_model_path, 44 | emulator_layer_num=params['models']['server']['emulator_layer_num'], 45 | adapter_top_layer_num=params['models']['server']['adapter_top_layer_num'], 46 | adapter_bottom_layer_num=params['models']['server']['adapter_bottom_layer_num'] 47 | ) 48 | 49 | dataset = LLMDatasetLoader( 50 | module_name=params['dataset']['module_name'], 51 | item_name=params['dataset']['item_name'], 52 | tokenizer_name_or_path=params['dataset']['tokenizer_name_or_path'], 53 | select_num=params['dataset']['select_num'] 54 | ) 55 | 56 | data_collator = LLMDataFuncLoader( 57 | module_name=params['data_collator']['module_name'], 58 | item_name=params['data_collator']['item_name'], 59 | tokenizer_name_or_path=params['data_collator']['tokenizer_name_or_path'] 60 | ) 61 | 62 | train_args = Seq2SeqTrainingArguments( 63 | per_device_train_batch_size=params['training']['batch_size'], 64 | learning_rate=params['training']['learning_rate'], 65 | disable_tqdm=False, 66 | num_train_epochs=params['training']['num_train_epochs'], 67 | logging_steps=params['training']['logging_steps'], 68 | logging_strategy='steps', 69 | dataloader_num_workers=4, 70 | use_cpu=False, 71 | deepspeed=params['training']['deepspeed'], # Add DeepSpeed config here 72 | remove_unused_columns=False, 73 | fp16=True 74 | ) 75 | 76 | client_conf = get_conf_of_ot_runner( 77 | model=client_model, 78 | dataset=dataset, 79 | data_collator=data_collator, 80 | training_args=train_args, 81 | fed_args=FedAVGArguments(), 82 | aggregate_model=False, 83 | ) 84 | 85 | server_conf = get_conf_of_ot_runner( 86 | model=server_model, 87 | dataset=dataset, 88 | data_collator=data_collator, 89 | training_args=train_args, 90 | fed_args=FedAVGArguments(), 91 | aggregate_model=False 92 | ) 93 | 94 | homo_nn = HomoNN( 95 | 'nn_0', 96 | train_data=reader.outputs["output_data"], 97 | runner_module="offsite_tuning_runner", 98 | runner_class="OTRunner" 99 | ) 100 | 101 | homo_nn.guest.task_parameters(runner_conf=client_conf) 102 | homo_nn.arbiter.task_parameters(runner_conf=server_conf) 103 | 104 | # If using Eggroll, you can add this line to submit your job 105 | homo_nn.guest.conf.set("launcher_name", "deepspeed") 106 | 107 | pipeline.add_tasks([reader, homo_nn]) 108 | pipeline.conf.set("task", dict(engine_run=params['pipeline']['engine_run'])) 109 | pipeline.compile() 110 | pipeline.fit() 111 | 112 | def main(config_file, param_file): 113 | params = load_params(param_file) 114 | setup_pipeline(params) 115 | 116 | if __name__ == "__main__": 117 | parser = argparse.ArgumentParser("LLMSUITE Offsite-tuning JOB") 118 | parser.add_argument("-c", "--config", type=str, 119 | help="Path to config file", default="./config.yaml") 120 | parser.add_argument("-p", "--param", type=str, 121 | help="Path to parameter file", default="./test_offsite_tuning_llmsuite.yaml") 122 | args = parser.parse_args() 123 | main(args.config, args.param) 124 | -------------------------------------------------------------------------------- /python/fate_llm/model_zoo/pellm/parameter_efficient_llm.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | import peft 17 | import torch 18 | from collections.abc import Mapping 19 | from peft import PeftModel, TaskType 20 | from transformers import AutoConfig 21 | from transformers import AutoModel 22 | from transformers.configuration_utils import PretrainedConfig 23 | import logging 24 | 25 | 26 | logger = logging.getLogger(__name__) 27 | 28 | 29 | AVAILABLE_PEFT_CONFIG = list( 30 | filter( 31 | lambda peft_type: peft_type.endswith("Config"), dir(peft) 32 | ) 33 | ) 34 | 35 | 36 | class PELLM(torch.nn.Module): 37 | 38 | config_class: PretrainedConfig = None 39 | model_loader = None 40 | 41 | def __init__(self, 42 | config: dict = None, 43 | pretrained_path: str = None, 44 | peft_type: str = None, 45 | peft_config=None, 46 | torch_dtype: str = None, 47 | trust_remote_code: bool = False, 48 | **kwargs 49 | ) -> None: 50 | 51 | super().__init__() 52 | self._pe_lm: PeftModel = None 53 | self.config = config 54 | self.config_path = pretrained_path 55 | self.peft_type = peft_type 56 | self.peft_config = peft_config 57 | self.torch_dtype = None if not torch_dtype else getattr(torch, torch_dtype) 58 | self.trust_remote_code = trust_remote_code 59 | 60 | assert self.config_path is not None or self.config is not None, \ 61 | "At least one of config_path and config must be set." 62 | self._init_pelm(**kwargs) 63 | 64 | def _init_pelm(self, **kwargs): 65 | self.init_lm_with_peft(**kwargs) 66 | self.model_summary() 67 | 68 | def init_lm_with_peft(self, **kwargs): 69 | self.init_config(**kwargs) 70 | self.init_base_lm() 71 | self.add_peft() 72 | 73 | def init_config(self, **kwargs): 74 | if self.config_path is not None: 75 | self.config = AutoConfig.from_pretrained(self.config_path, trust_remote_code=self.trust_remote_code) 76 | elif self.config is not None and self.config_class is not None: 77 | self.config = self.config_class().from_dict(self.config) 78 | else: 79 | raise ValueError( 80 | 'config_path to pretrained model folder and model config dict cannot be None at the same time, ' 81 | 'you need to specify one of them') 82 | 83 | if kwargs: 84 | self.config.update(kwargs) 85 | 86 | def init_base_lm(self, **kwargs): 87 | model_loader = self.model_loader if self.model_loader is not None else AutoModel 88 | if self.config is not None: 89 | self._pe_lm = model_loader.from_pretrained( 90 | self.config_path, config=self.config, 91 | torch_dtype=self.torch_dtype, **kwargs, 92 | trust_remote_code=self.trust_remote_code 93 | ) 94 | elif self.config_path is not None: 95 | self._pe_lm = model_loader.from_pretrained( 96 | self.config_path, torch_dtype=self.torch_dtype, 97 | trust_remote_code=self.trust_remote_code, **kwargs) 98 | else: 99 | raise ValueError( 100 | 'config_path to pretrained model folder cannot be None') 101 | 102 | def add_peft(self): 103 | assert self.peft_type in AVAILABLE_PEFT_CONFIG, 'peft name {} not in available config {}'.format( 104 | self.peft_type, AVAILABLE_PEFT_CONFIG) 105 | 106 | if self.peft_config is None: 107 | peft_config = getattr(peft, self.peft_type)() 108 | elif isinstance(self.peft_config, dict): 109 | peft_config = getattr(peft, self.peft_type)(**self.peft_config) 110 | else: 111 | raise ValueError(f"Can not parse peft_config of {type(self.peft_config)}") 112 | 113 | self._pe_lm = peft.get_peft_model(self._pe_lm, peft_config) 114 | self.peft_config = peft_config 115 | 116 | def model_summary(self): 117 | if hasattr(self._pe_lm, "print_trainable_parameters"): 118 | summary = self._pe_lm.print_trainable_parameters() 119 | logger.debug(f'PELLM model summary: \n{summary}') 120 | 121 | def forward(self, *args, **kwargs): 122 | forward_ret = self._pe_lm.forward(*args, **kwargs) 123 | 124 | if self.peft_config is None or self.peft_config.task_type != TaskType.SEQ_CLS: 125 | return forward_ret 126 | else: 127 | return forward_ret.logits 128 | 129 | def save_trainable(self, output_path): 130 | self._pe_lm.save_pretrained(output_path) 131 | 132 | 133 | class AutoPELLM(PELLM): 134 | 135 | def __init__(self, **kwargs) -> None: 136 | super().__init__(**kwargs) 137 | -------------------------------------------------------------------------------- /python/fate_llm/runner/fedkseed_runner.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2019 The FATE Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | 16 | import logging 17 | from typing import Dict 18 | from typing import Literal 19 | from typing import Optional 20 | 21 | import transformers 22 | from fate.components.components.nn.nn_runner import ( 23 | NNRunner, 24 | dir_warning, 25 | loader_load_from_conf, 26 | ) 27 | from fate.components.components.nn.runner.homo_default_runner import DefaultRunner 28 | 29 | from fate_llm.algo.fedkseed.fedkseed import Trainer, FedKSeedTrainingArguments, ClientTrainer 30 | from fate_llm.algo.fedkseed.zo_utils import build_seed_candidates 31 | from fate_llm.trainer.seq2seq_trainer import Seq2SeqTrainingArguments 32 | 33 | logger = logging.getLogger(__name__) 34 | 35 | SUPPORTED_ALGO = ["fedkseed"] 36 | 37 | 38 | class FedKSeedRunner(DefaultRunner): 39 | def __init__( 40 | self, 41 | algo: str = "fedkseed", 42 | model_conf: Optional[Dict] = None, 43 | dataset_conf: Optional[Dict] = None, 44 | optimizer_conf: Optional[Dict] = None, 45 | training_args_conf: Optional[Dict] = None, 46 | fed_args_conf: Optional[Dict] = None, 47 | data_collator_conf: Optional[Dict] = None, 48 | tokenizer_conf: Optional[Dict] = None, 49 | task_type: Literal["causal_lm", "other"] = "causal_lm", 50 | local_mode: bool = False, 51 | save_trainable_weights_only: bool = False, 52 | ) -> None: 53 | super(NNRunner, self).__init__() 54 | self.algo = algo 55 | self.model_conf = model_conf 56 | self.dataset_conf = dataset_conf 57 | self.optimizer_conf = optimizer_conf 58 | self.training_args_conf = training_args_conf 59 | self.fed_args_conf = fed_args_conf 60 | self.data_collator_conf = data_collator_conf 61 | self.local_mode = local_mode 62 | self.tokenizer_conf = tokenizer_conf 63 | self.task_type = task_type 64 | self.save_trainable_weights_only = save_trainable_weights_only 65 | 66 | # check param 67 | if self.algo not in SUPPORTED_ALGO: 68 | raise ValueError(f"algo should be one of {SUPPORTED_ALGO}") 69 | if self.task_type not in ["causal_lm", "others"]: 70 | raise ValueError("task_type should be one of [binary, multi, regression, others]") 71 | assert isinstance(self.local_mode, bool), "local should be bool" 72 | 73 | # setup var 74 | self.trainer = None 75 | self.training_args = None 76 | 77 | def client_setup(self, train_set=None, validate_set=None, output_dir=None, saved_model=None, stage="train"): 78 | if self.algo != "fedkseed": 79 | raise ValueError(f"algo {self.algo} not supported") 80 | 81 | ctx = self.get_context() 82 | 83 | model = maybe_loader_load_from_conf(self.model_conf) 84 | if model is None: 85 | raise ValueError(f"model is None, cannot load model from conf {self.model_conf}") 86 | 87 | if output_dir is None: 88 | output_dir = "./" 89 | 90 | tokenizer = transformers.AutoTokenizer.from_pretrained(**self.data_collator_conf["kwargs"]["tokenizer_params"]) 91 | 92 | data_collator = transformers.DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False) 93 | dir_warning(self.training_args_conf) 94 | 95 | training_args = Seq2SeqTrainingArguments(**self.training_args_conf) 96 | self.training_args = training_args 97 | training_args.output_dir = output_dir 98 | fedkseed_args = FedKSeedTrainingArguments(**self.fed_args_conf) 99 | logger.debug(f"training_args: {training_args}") 100 | logger.debug(f"fedkseed_args: {fedkseed_args}") 101 | trainer = ClientTrainer( 102 | ctx=ctx, 103 | model=model, 104 | training_args=training_args, 105 | fedkseed_args=fedkseed_args, 106 | data_collator=data_collator, 107 | tokenizer=tokenizer, 108 | train_dataset=train_set, 109 | eval_dataset=validate_set, 110 | ) 111 | return trainer 112 | 113 | def server_setup(self, stage="train"): 114 | 115 | if self.algo != "fedkseed": 116 | raise ValueError(f"algo {self.algo} not supported") 117 | ctx = self.get_context() 118 | 119 | fedkseed_args = FedKSeedTrainingArguments(**self.fed_args_conf) 120 | training_args = Seq2SeqTrainingArguments(**self.training_args_conf) 121 | 122 | seed_candidates = build_seed_candidates(fedkseed_args.k, low=0, high=2 ** 32) 123 | trainer = Trainer(ctx=ctx, seed_candidates=seed_candidates, args=training_args, fedkseed_args=fedkseed_args) 124 | return trainer 125 | 126 | 127 | def maybe_loader_load_from_conf(conf): 128 | from fate_llm.model_zoo.hf_model import HFAutoModelForCausalLM 129 | 130 | model = loader_load_from_conf(conf) 131 | if isinstance(model, HFAutoModelForCausalLM): 132 | model = model.load() 133 | return model 134 | -------------------------------------------------------------------------------- /python/fate_llm/algo/fedmkt/fedmkt_data_collator.py: -------------------------------------------------------------------------------- 1 | # 2 | # NOTE: The implementations of DataCollatorForFedMKT is modified from FuseAI/FuseLLM 3 | # Copyright FuseAI/FuseLLM 4 | # 5 | # 6 | # Copyright 2019 The FATE Authors. All Rights Reserved. 7 | # 8 | # Licensed under the Apache License, Version 2.0 (the "License"); 9 | # you may not use this file except in compliance with the License. 10 | # You may obtain a copy of the License at 11 | # 12 | # http://www.apache.org/licenses/LICENSE-2.0 13 | # 14 | # Unless required by applicable law or agreed to in writing, software 15 | # distributed under the License is distributed on an "AS IS" BASIS, 16 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 17 | # See the License for the specific language governing permissions and 18 | # limitations under the License. 19 | # 20 | import torch 21 | from torch.nn.functional import softmax 22 | from transformers import DataCollatorForSeq2Seq 23 | from transformers.tokenization_utils_base import PreTrainedTokenizerBase 24 | from transformers.utils import PaddingStrategy 25 | from typing import Optional, Any, Union 26 | import logging 27 | from fate_llm.algo.fedmkt.utils.vars_define import ( 28 | ALIGNED_OTHER_LOGITS, 29 | ALIGNED_OTHER_INDICES, 30 | PER_STEP_LOGITS, 31 | PER_STEP_INDICES, 32 | SELF_TARGET_DIST, 33 | OTHER_TARGET_DIST 34 | ) 35 | 36 | 37 | logger = logging.getLogger(__name__) 38 | 39 | 40 | class DataCollatorForFedMKT(DataCollatorForSeq2Seq): 41 | """modified from https://github.com/fanqiwan/FuseAI/blob/main/FuseLLM/src/utils/data_collator.py#L135""" 42 | tokenizer: PreTrainedTokenizerBase 43 | model: Optional[Any] = None 44 | padding: Union[bool, str, PaddingStrategy] = True 45 | max_length: Optional[int] = None 46 | pad_to_multiple_of: Optional[int] = None 47 | label_pad_token_id: int = -100 48 | return_tensors: str = "pt" 49 | blending_num: int = 1 50 | distill_temperature: float = 1.0 51 | vocab_size: int = None 52 | dtype: torch.dtype = torch.bfloat16 53 | 54 | def __init__(self, *args, **kwargs): 55 | blending_num = kwargs.pop("blending_num", 4) 56 | vocab_size = kwargs.pop("vocab_size", None) 57 | dtype = kwargs.pop("dtype", torch.dtype) 58 | distill_temperature = kwargs.pop("distill_temperature", 1.0) 59 | super(DataCollatorForFedMKT, self).__init__(*args, **kwargs) 60 | self.blending_num = blending_num 61 | self.vocab_size = vocab_size if vocab_size is not None else len(self.tokenizer.get_vocab()) 62 | self.pad_id = self.tokenizer.pad_token_id 63 | self.dtype = dtype 64 | self.distill_temperature = distill_temperature 65 | 66 | def __call__(self, features, return_tensors=None): 67 | extra_features = dict() 68 | feature_keys = list(features[0].keys()) 69 | for f_key in feature_keys: 70 | if f_key not in ["input_ids", "attention_mask", "labels"]: 71 | extra_features[f_key] = [] 72 | for feature in features: 73 | extra_features[f_key].append(feature.pop(f_key)) 74 | 75 | features = super().__call__(features=features, return_tensors=return_tensors) 76 | 77 | features.update(extra_features) 78 | 79 | batch_size = features["input_ids"].size(0) 80 | base_target_dist = torch.zeros(batch_size, self.max_length, self.vocab_size).to(self.dtype) 81 | aligned_target_dists = [torch.zeros(batch_size, self.max_length, self.vocab_size).to(self.dtype) 82 | for _ in range(self.blending_num)] 83 | 84 | for i in range(batch_size): 85 | base_seq_len = len(features[PER_STEP_LOGITS][i]) 86 | for j in range(self.max_length): 87 | if j < base_seq_len: 88 | base_logits = torch.tensor(features[PER_STEP_LOGITS][i][j], dtype=self.dtype) 89 | base_prob = softmax(base_logits / self.distill_temperature, -1) 90 | base_indices = torch.tensor(features[PER_STEP_INDICES][i][j]) 91 | base_target_dist[i][j] = base_target_dist[i][j].scatter_(-1, base_indices, base_prob) 92 | 93 | for k in range(self.blending_num): 94 | per_step_aligned_indices_key = f"{ALIGNED_OTHER_INDICES}_{k}" 95 | per_step_aligned_logits_key = f"{ALIGNED_OTHER_LOGITS}_{k}" 96 | if len(features[per_step_aligned_indices_key][i][j]) > 0: 97 | aligned_logits = torch.tensor(features[per_step_aligned_logits_key][i][j], dtype=self.dtype) 98 | aligned_prob = softmax(aligned_logits / self.distill_temperature, -1) 99 | aligned_indices = torch.tensor(features[per_step_aligned_indices_key][i][j]) 100 | aligned_target_dists[k][i][j] = aligned_target_dists[k][i][j].scatter_(-1, aligned_indices, aligned_prob) 101 | else: 102 | aligned_target_dists[k][i][j] = base_target_dist[i][j] 103 | 104 | else: # padding position 105 | base_target_dist[i][j][self.pad_id] = 1.0 106 | for k in range(self.blending_num): 107 | aligned_target_dists[k][i][j][self.pad_id] = 1.0 108 | 109 | features.pop(PER_STEP_LOGITS) 110 | features.pop(PER_STEP_INDICES) 111 | for i in range(self.blending_num): 112 | features.pop(f"{ALIGNED_OTHER_LOGITS}_{i}") 113 | features.pop(f"{ALIGNED_OTHER_INDICES}_{i}") 114 | features[f"{OTHER_TARGET_DIST}_{i}"] = aligned_target_dists[i] 115 | 116 | features[SELF_TARGET_DIST] = base_target_dist 117 | 118 | return features 119 | -------------------------------------------------------------------------------- /doc/fate_llm_evaluate.md: -------------------------------------------------------------------------------- 1 | ## FATE-LLM Python SDK 2 | 3 | FATE-LLM Python SDK provides simple API for evaluating large language models. 4 | Built on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/), our evaluation tool may be used on pre-trained models from Huggingface, local-built models, as well as FATE-LLM models. 5 | [Built-in datasets](#built-in-tasks) currently include Dolly-15k and Advertise Generation. 6 | Below shows how to evaluate given llm model in few lines. For quick single-model evaluation, below steps should suffice, however, if comparative evaluation among multiple models is desired, CLI is recommended. 7 | 8 | ```python 9 | from lm_eval.models.huggingface import HFLM 10 | from fate_llm.evaluate.utils import llm_evaluator 11 | 12 | # download data for built-in tasks if running fate-llm evaluation for the first time 13 | # alternatively, use CLI `fate-llm data download` to download data 14 | llm_evaluator.download_task("dolly-15k") 15 | # set paths of built-in tasks 16 | llm_evaluator.init_tasks() 17 | # load model 18 | bloom_lm = HFLM(pretrained='bloom-560') 19 | # if loading local model, specify peft storage location 20 | # gpt2_lm = HFLM(pretrained='bloom-560m', peft_path_format="path/to/peft") 21 | # run evaluation 22 | llm_evaluator.evaluate(model=bloom_lm, tasks="dolly-15k", show_result=True) 23 | ``` 24 | 25 | When network allows, or if already cached, tasks from lm-evaluation may be provided for evaluation in similar style. 26 | 27 | ```python 28 | from lm_eval.models.huggingface import HFLM 29 | from fate_llm.evaluate.utils import llm_evaluator 30 | # load model 31 | bloom_lm = HFLM(pretrained='bloom-560') 32 | # if loading local model, specify peft storage location 33 | # bloom_lm = HFLM(pretrained='bloom-560m', peft_path_format="path/to/peft") 34 | # run evaluation 35 | llm_evaluator.evaluate(model=gpt2_lm, tasks="ceval", show_result=True) 36 | ``` 37 | 38 | ## FATE-LLM Command Line Interface 39 | 40 | FATE LLM provides built-in tasks for comparing evaluation results of different llm models. 41 | Alternatively, user may provide arbitrary tasks for evaluation. 42 | 43 | ### install 44 | 45 | ```bash 46 | cd {path_to_fate_llm}/python 47 | pip install -e . 48 | ``` 49 | 50 | ### command options 51 | 52 | ```bash 53 | fate_llm --help 54 | ``` 55 | 56 | #### evaluate: 57 | 58 | 59 | 1. in: 60 | 61 | ```bash 62 | fate_llm evaluate -i 63 | ``` 64 | 65 | will run llm at 66 | *path1* 67 | 68 | 2. eval-config: 69 | 70 | ```bash 71 | fate_llm evaluate -i -c 72 | ``` 73 | 74 | 75 | will run llm testsuites in *path1* with evaluation configuration set to *path2* 76 | 77 | 3. result-output: 78 | 79 | ```bash 80 | fate_llm evaluate -i -o 81 | ``` 82 | 83 | will run llm testsuites in *path1* with evaluation result output stored in *path2* 84 | 85 | ### config 86 | 87 | ```bash 88 | fate_llm config --help 89 | ``` 90 | 91 | 1. new: 92 | ```bash 93 | fate_llm config new 94 | ``` 95 | 96 | will create a new evaluation configuration file in current directory 97 | 98 | 2. show: 99 | 100 | ```bash 101 | fate_llm config show 102 | ``` 103 | 104 | will show current evaluation configuration 105 | 106 | 3. edit: 107 | 108 | ```bash 109 | fate_llm config edit 110 | ``` 111 | 112 | will edit evaluation configuration 113 | 114 | ### data 115 | 116 | ```bash 117 | fate_llm data --help 118 | ``` 119 | 1. download: 120 | 121 | ```bash 122 | fate_llm data download -t -t ... 123 | ``` 124 | 125 | will download corresponding data for given tasks 126 | 127 | 128 | ### FATE-LLM Eval job configuration 129 | 130 | Configuration of jobs should be specified in a yaml file. 131 | 132 | A FATE-LLM testsuite includes the following elements: 133 | 134 | - job group: each group includes arbitrary number of jobs with paths 135 | to corresponding script and configuration 136 | 137 | - job: name of evaluation job to be run, must be unique within each group 138 | list 139 | - pretrained: path to pretrained model, should be either mmodel name from Hugginface or relative path to 140 | testsuite 141 | - peft: path to peft file, should be relative to testsuite, 142 | optional 143 | - tasks: list of tasks to be evaluated, optional for jobs skipping evaluation 144 | - include_path: should be specified if tasks are user-defined 145 | - eval_conf: path to evaluation configuration file, should be 146 | relative to testsuite; if not provided, will use default conf 147 | 148 | ```yaml 149 | bloom_lora: 150 | pretrained: "bloom-560m" 151 | peft_path_format: "{{fate_base}}/fate_flow/model/{{job_id}}/guest/{{party_id}}/{{model_task_name}}/0/output/output_model/model_directory" 152 | tasks: 153 | - "dolly-15k" 154 | 155 | ``` 156 | 157 | - llm suite 158 | 159 | ```yaml 160 | bloom_suite: 161 | bloom_zero_shot: 162 | pretrained: "bloom-560m" 163 | tasks: 164 | - "dolly-15k" 165 | ``` 166 | 167 | ## Built-in Tasks 168 | 169 | Currently, we include the following tasks in FATE-LLM Evaluate: 170 | 171 | | Task Name | Alias | Task Type | Metric | source | 172 | |:---------:|:-------------:|:----------:|:-------:|:-------------------------------------------------------------------------:| 173 | | Dolly-15k | dolly-15k | generation | rouge-L | [link](https://huggingface.co/datasets/databricks/databricks-dolly-15k) | 174 | | ADGEN | advertise-gen | generation | rouge-L | [link](https://github.com/THUDM/ChatGLM-6B/blob/main/ptuning/README_en.md#instructions) | 175 | 176 | Use corresponding alias to reference tasks in the system. 177 | --------------------------------------------------------------------------------