├── doc ├── FedML_ResNet18.md ├── .gitignore ├── FedML_YOLOv3.md └── FedML.md ├── README.md ├── .gitignore └── LICENSE /doc/FedML_ResNet18.md: -------------------------------------------------------------------------------- 1 | # FedML_ResNet18 2 | 3 | ## 1. FedML简介 4 | 5 | - FedML介绍[查看](FedML.md) 6 | 7 | ## 2. Docker环境配置 8 | 9 | - Docker环境配置[查看](FedML.md) 10 | 11 | ## 3. 生产环境配置建议 12 | 13 | ResNet18+cifar10对配置要求远低于YOLOv3+COCO。 14 | 15 | 16 | ## 4. 物体识别任务(FedML_ResNet18) 17 | 18 | 该任务可用于测试环境是否配置完成。关于将ResNet18+cifar10联邦学习化的过程可参考FedML联邦学习框架 [查看](FedML.md)的第4节。 19 | 20 | 运行方法: 21 | 22 | ``` 23 | cd fedml_experiments/distributed/fedResNet18/ 24 | sh run_feavg_distributed_pytorch.sh 5 4 20 25 | ``` 26 | 训练结果:准确率0.768. 27 | 28 | -------------------------------------------------------------------------------- /doc/.gitignore: -------------------------------------------------------------------------------- 1 | ### macOS ### 2 | *.DS_Store 3 | 4 | # Windows: 5 | Thumbs.db 6 | ehthumbs.db 7 | Desktop.ini 8 | 9 | # Python: 10 | *.py[cod] 11 | *.so 12 | *.egg 13 | *.egg-info 14 | dist 15 | build 16 | 17 | # for emacs 18 | *~ 19 | [#]*[#] 20 | 21 | # for Eclipse 22 | *.project 23 | 24 | # for Logs and databases 25 | *.log 26 | 27 | # remove SVN 28 | .svn 29 | 30 | # for Xcode 31 | .*.swp 32 | .clang_complete 33 | *.xcodeproj/project.xcworkspace/ 34 | *.xcodeproj/xcuserdata/ 35 | 36 | # for IDEA 37 | */build/* 38 | .idea/* 39 | *.iml 40 | /out/* 41 | 42 | 43 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## 联邦学习简介 2 | 联邦学习(Federated Learning) 是一种特殊的分布式学习, 又称联邦学习,联合学习,联盟学习. 是2016年由Google率先提出, 在满足用户隐私安全、数据安全和政府法规的要求下使用数据进行建模, 进而解决数据孤岛问题。 3 | 4 | 作为分布式的机器学习范式,联邦学习能够有效解决数据孤岛问题,让参与方在不共享数据的基础上联合建模,挖掘数据价值。 5 | 6 | ## 主流开源联邦学习框架 7 | - FATE:FATE (Federated AI Technology Enabler) 是微众银行AI部门发起的开源项目,为联邦学习生态系统提供了可靠的安全计算框架。FATE项目使用多方安全计算 (MPC) 以及同态加密 (HE) 技术构建底层安全计算协议,以此支持不同种类的机器学习的安全计算,包括逻辑回归、基于树的算法、深度学习和迁移学习等。 8 | - PaddleFL: PaddleFL是一个基于PaddlePaddle的开源联邦学习框架. 飞桨(PaddlePaddle)以百度多年的深度学习技术研究和业务应用为基础, 是中国首个自主研发、功能完备、 开源开放的产业级深度学习平台,集深度学习核心训练和推理框架、基础模型库、端到端开发套件和丰富的工具组件于一体。 9 | - Fedlearner:字节跳动联邦学习平台 Fedlearner 已经在电商、金融、教育等行业多个落地场景实际应用。 10 | - FedML: 美国南加州大学USC联合MIT、Stanford、MSU、UW-Madison、UIUC以及腾讯、微众银行等众多高校与公司联合发布了FedML联邦学习开源框架。 11 | - TFF: TensorFlow Federated (TFF) is an open-source framework for machine learning and other computations on decentralized data. 12 | 13 | 14 | ## 实践&教程 15 | 16 | - FedML联邦学习框架 [查看](doc/FedML.md) 17 | - 基于FedML的物体检测(YOLO v3+COCO)[查看](doc/FedML_YOLOv3.md) 18 | - 基于FedML的物体识别(ResNet18+cifar)[查看](doc/FedML_ResNet18.md) 19 | - 联邦学习展示DEMO [基于Vue开发前端展示页面](https://github.com/zhangqixun/fldemo) 20 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Logs 2 | logs 3 | *.log 4 | npm-debug.log* 5 | yarn-debug.log* 6 | yarn-error.log* 7 | lerna-debug.log* 8 | 9 | #webstorm 10 | .idea/ 11 | 12 | # Diagnostic reports (https://nodejs.org/api/report.html) 13 | report.[0-9]*.[0-9]*.[0-9]*.[0-9]*.json 14 | 15 | # Runtime data 16 | pids 17 | *.pid 18 | *.seed 19 | *.pid.lock 20 | 21 | # Directory for instrumented libs generated by jscoverage/JSCover 22 | lib-cov 23 | 24 | # Coverage directory used by tools like istanbul 25 | coverage 26 | *.lcov 27 | 28 | # nyc test coverage 29 | .nyc_output 30 | 31 | # Grunt intermediate storage (https://gruntjs.com/creating-plugins#storing-task-files) 32 | .grunt 33 | 34 | # Bower dependency directory (https://bower.io/) 35 | bower_components 36 | 37 | # node-waf configuration 38 | .lock-wscript 39 | 40 | # Compiled binary addons (https://nodejs.org/api/addons.html) 41 | build/Release 42 | 43 | # Dependency directories 44 | node_modules/ 45 | jspm_packages/ 46 | 47 | # TypeScript v1 declaration files 48 | typings/ 49 | 50 | # TypeScript cache 51 | *.tsbuildinfo 52 | 53 | # Optional npm cache directory 54 | .npm 55 | 56 | # Optional eslint cache 57 | .eslintcache 58 | 59 | # Microbundle cache 60 | .rpt2_cache/ 61 | .rts2_cache_cjs/ 62 | .rts2_cache_es/ 63 | .rts2_cache_umd/ 64 | 65 | # Optional REPL history 66 | .node_repl_history 67 | 68 | # Output of 'npm pack' 69 | *.tgz 70 | 71 | # Yarn Integrity file 72 | .yarn-integrity 73 | 74 | # dotenv environment variables file 75 | .env 76 | .env.test 77 | 78 | # parcel-bundler cache (https://parceljs.org/) 79 | .cache 80 | 81 | # Next.js build output 82 | .next 83 | 84 | # Nuxt.js build / generate output 85 | .nuxt 86 | dist 87 | 88 | # Gatsby files 89 | .cache/ 90 | # Comment in the public line in if your project uses Gatsby and *not* Next.js 91 | # https://nextjs.org/blog/next-9-1#public-directory-support 92 | # public 93 | 94 | # vuepress build output 95 | .vuepress/dist 96 | 97 | # Serverless directories 98 | .serverless/ 99 | 100 | # FuseBox cache 101 | .fusebox/ 102 | 103 | # DynamoDB Local files 104 | .dynamodb/ 105 | 106 | # TernJS port file 107 | .tern-port 108 | -------------------------------------------------------------------------------- /doc/FedML_YOLOv3.md: -------------------------------------------------------------------------------- 1 | # FedML_YOLOv3 2 | 3 | ## 1. FedML简介 4 | 5 | - FedML介绍[查看](FedML.md) 6 | 7 | ## 2. Docker环境配置 8 | 9 | - Docker环境配置[查看](FedML.md) 10 | 11 | ## 3. 生产环境配置建议 12 | 13 | YOLOv3+COCO训练对显卡要求较高。建议GPU: V100或2080Ti。 14 | 15 | 16 | ## 4. 物体检测任务(YOLOv3+coco) 17 | 18 | ### 4.1 安装依赖包 19 | 20 | pip install -r requirements.txt 21 | 22 | ### 4.2 准备YOLOv3模型相关的数据和程序 23 | YOLOv3模型相关的资料都保存在/FedML/fedml_api/model/YOLOv3文件夹下。 24 | #### 4.2.1 数据集 25 | YOLOv3使用的是coco数据集。为了模拟真实的生产环境,在实验时需要人工对数据进行切分。重新切分后的数据集在coco文件夹下。本次实验生成了10份相同规模的子数据集,以1...10命名,分别对应相应序号的client。每一个子数据集分为train集和test集,其中train集包含1641条数据,test集包含802条数据。 26 | 27 | 原YOLOv3项目通过读取数据文件列表遍历数据,实验改为传入文件夹路径,直接读取文件夹下所有文件。相应对/utils/datasets.py中ListDataset类的初始化函数进行了修改。 28 | 29 | 目前由于训练集和测试集较小,训练轮数少,无法涵盖所有的物体类型,训练效果差。在联邦学习化后遇到了一些bug,应该是由于原项目代码没有考虑到这些情况造成的。 30 | 31 | #### 4.2.2 模型类 32 | 对应原YOLOv3项目的models.py以及训练和测试依赖的工具类。基本与原YOLOv3项目相同。 33 | 34 | #### 4.2.3 YOLOv3Trainer类 35 | get_model_params、set_model_params直接调用了model(Darknet)类的方法。 36 | 37 | train函数修改自原项目中train.py,将定义参数的代码移动到main_fedavg.py中。由于model参数在接收server发来的全局参数时被更新,删除了从断点读取模型参数的代码。由于单轮训练的epoch一般较少,删除了保存断点的代码。删除了在训练过程中测试模型效果的代码。 38 | 39 | test函数修改自test.py,将定义参数的代码移动到main_fedavg.py中。 40 | 41 | #### 4.2.4 YOLOv3ResultAggregator类 42 | 该类待完成。 43 | 44 | ### 4.3 定义联邦学习入口程序 45 | 从启动集群的命令可以看出,每个参与者(无论是server还是client)都会执行python3 ./main_fedavg.py,可见main_fedavg.py定义了联邦学习程序的入口。这个文件也在/FedML/fedml_experiments/distributed/fedYOLOv3文件夹下。 46 | 47 | 接下来对main函数的执行过程进行简要释义。在对其他机器学习任务进行联邦学习化时基本也遵循以下的步骤。 48 | ``` 49 | logging.getLogger().setLevel(logging.INFO) 50 | ``` 51 | 定义logger,使打印内容在控制台显示。 52 | ``` 53 | comm, process_id, worker_number = FedML_init() 54 | ``` 55 | 初始化进程的MPI参数。comm用于后续的MPI通讯,process_id获取进程的身份信息,worker_number实际表示client的总数。 56 | ``` 57 | # parse python script input parameters 58 | parser = argparse.ArgumentParser() 59 | args = add_args(parser) 60 | logging.info(args) 61 | ``` 62 | 获取参数。在执行run_fedavg_distributed_pytorch.sh脚本时指定了三个联邦学习相关参数,其他联邦学习过程与模型训练的参数都在add_args函数中被添加并储存在args中,之后args会作为参数传给相关函数和类的对象。 63 | 64 | 在add_args函数中可以看到,参数可以分为两类:第一类是联邦学习过程相关的参数,这类参数与具体任务无关,定义了联邦学习的聚集与分发过程。第二类是模型训练与测试相关的参数,这类参数与具体任务有关。对于YOLOv3模型,这部分参数对应了单机版本的YOLOv3代码中train.py和test.py两个文件中定义的opt包含的参数。需要注意的是模型定义相关的参数(/YOLOv3/config/yolov3.cfg)不需要包含在其中。 65 | ``` 66 | str_process_name = "FedAvg (distributed):" + str(process_id) 67 | setproctitle.setproctitle(str_process_name) 68 | ``` 69 | 初始化进程名。 70 | ``` 71 | logging.basicConfig... 72 | hostname = socket.gethostname() 73 | logging.info... 74 | ``` 75 | 打印进程信息。 76 | ``` 77 | if process_id == 0: 78 | wandb.init... 79 | ``` 80 | 如果该进程的身份是server,在wandb平台注册任务。 81 | ``` 82 | random.seed(0) 83 | np.random.seed(0) 84 | torch.manual_seed(0) 85 | torch.cuda.manual_seed_all(0) 86 | ``` 87 | 确保每次调试时初始化的值相同,使结果可复现。 88 | ``` 89 | logging.info... 90 | device = init_training_device... 91 | ``` 92 | 获取训练设备,如果服务器没有gpu则使用cpu训练,否则根据一定的的算法分配gpu。init_training_device函数可以根据生产环境重新定义。 93 | ``` 94 | train_path = os.path.join(os.path.join(args.data_path, str(process_id)), "train") 95 | test_path = os.path.join(os.path.join(args.data_path, str(process_id)), "test") 96 | ``` 97 | 通过args.data_path与该进程的process_id,生成client对应的数据集路径。4.2.2.1中说明了生成数据集的文件结构。由于server的process_id是0,所有client的process_id分别是1, ..., 10,因此client对应的数据集路径可以由coco文件夹路径与process_id拼接得到。数据集的定义可能需要根据生产环境进行修改。 98 | ``` 99 | model = create_model(args) 100 | ``` 101 | 初始化model。create_model初始化并返回Darknet对象。 102 | ``` 103 | model_trainer = YOLOv3Trainer(model) 104 | result_aggregator = YOLOv3ResultAggregator() 105 | ``` 106 | 初始化model_trainer和result_aggregator。二者都是模型相关的,因此对于不同的模型要重新定义。见4.2.2.3和4.2.2.4. 107 | 108 | ### 4.4 定义联邦学习入口程序 109 | 从启动集群的命令可以看出,每个参与者(无论是server还是client)都会执行python3 ./main_fedavg.py,可见main_fedavg.py定义了联邦学习程序的入口。这个文件也在/FedML/fedml_experiments/distributed/fedYOLOv3文件夹下。 110 | 111 | 接下来对main函数的执行过程进行简要释义。在对其他机器学习任务进行联邦学习化时基本也遵循以下的步骤。 112 | ``` 113 | logging.getLogger().setLevel(logging.INFO) 114 | ``` 115 | 定义logger,使打印内容在控制台显示。 116 | ``` 117 | comm, process_id, worker_number = FedML_init() 118 | ``` 119 | 初始化进程的MPI参数。comm用于后续的MPI通讯,process_id获取进程的身份信息,worker_number实际表示client的总数。 120 | ``` 121 | # parse python script input parameters 122 | parser = argparse.ArgumentParser() 123 | args = add_args(parser) 124 | logging.info(args) 125 | ``` 126 | 获取参数。在执行run_fedavg_distributed_pytorch.sh脚本时指定了三个联邦学习相关参数,其他联邦学习过程与模型训练的参数都在add_args函数中被添加并储存在args中,之后args会作为参数传给相关函数和类的对象。 127 | 128 | 在add_args函数中可以看到,参数可以分为两类:第一类是联邦学习过程相关的参数,这类参数与具体任务无关,定义了联邦学习的聚集与分发过程。第二类是模型训练与测试相关的参数,这类参数与具体任务有关。对于YOLOv3模型,这部分参数对应了单机版本的YOLOv3代码中train.py和test.py两个文件中定义的opt包含的参数。需要注意的是模型定义相关的参数(/YOLOv3/config/yolov3.cfg)不需要包含在其中。 129 | ``` 130 | str_process_name = "FedAvg (distributed):" + str(process_id) 131 | setproctitle.setproctitle(str_process_name) 132 | ``` 133 | 初始化进程名。 134 | ``` 135 | logging.basicConfig... 136 | hostname = socket.gethostname() 137 | logging.info... 138 | ``` 139 | 打印进程信息。 140 | ``` 141 | if process_id == 0: 142 | wandb.init... 143 | ``` 144 | 如果该进程的身份是server,在wandb平台注册任务。 145 | ``` 146 | random.seed(0) 147 | np.random.seed(0) 148 | torch.manual_seed(0) 149 | torch.cuda.manual_seed_all(0) 150 | ``` 151 | 确保每次调试时初始化的值相同,使结果可复现。 152 | ``` 153 | logging.info... 154 | device = init_training_device... 155 | ``` 156 | 获取训练设备,如果服务器没有gpu则使用cpu训练,否则根据一定的的算法分配gpu。init_training_device函数可以根据生产环境重新定义。 157 | ``` 158 | train_path = os.path.join(os.path.join(args.data_path, str(process_id)), "train") 159 | test_path = os.path.join(os.path.join(args.data_path, str(process_id)), "test") 160 | ``` 161 | 通过args.data_path与该进程的process_id,生成client对应的数据集路径。4.2.1中说明了生成数据集的文件结构。由于server的process_id是0,所有client的process_id分别是1, ..., 10,因此client对应的数据集路径可以由coco文件夹路径与process_id拼接得到。数据集的定义可能需要根据生产环境进行修改。 162 | ``` 163 | model = create_model(args) 164 | ``` 165 | 初始化model。create_model初始化并返回Darknet对象。 166 | ``` 167 | model_trainer = YOLOv3Trainer(model) 168 | result_aggregator = YOLOv3ResultAggregator() 169 | ``` 170 | 初始化model_trainer和result_aggregator。二者都是模型相关的,因此对于不同的模型要重新定义。见4.2.3和4.2.4. 171 | 172 | ### 4.5 启动集群 173 | 用于启动集群相关的程序在是./fedml_experiments/distributed/fedYOLOv3/run_fedavg_distributed_pytorch.sh。具体解释参考FedML联邦学习框架 [查看](FedML.md)的4.4节. 174 | 175 | 启动集群的命令示例为: 176 | ``` 177 | sh run_fedavg_distributed_pytorch.sh 10 5 50 178 | ``` 179 | 意为共有10个client,每轮训练有5个client参加,一共训练50轮。由于生成数据集的限制,所有训练者的个数最多是10. 180 | 181 | -------------------------------------------------------------------------------- /doc/FedML.md: -------------------------------------------------------------------------------- 1 | ## 1. FedML 2 | 3 | https://github.com/FedML-AI/FedML 4 | 5 | ## 2. 修改后的FedML 6 | 7 | https://github.com/yukizhao1998/FineFL 8 | 9 | ## 3. docker容器间部署的配置方法 10 | 11 | #### 步骤1:拉取ubuntu16.04或cuda镜像 12 | docker pull ubuntu:16.04 13 | 14 | 如果训练需要使用gpu,则需要拉取对应版本的cuda镜像以及安装nvidia-docker。 15 | 16 | #### 步骤2:使用Dockerfile为镜像安装必要的软件包并制作新镜像 17 | 18 | ##### a.创建Dockerfile文件夹及Dockerfile文件: 19 | mkdir Dockerfile 20 | cd Dockerfile 21 | vim Dockerfile 22 | 23 | Dockerfile内容如下(如果使用cuda镜像则需要修改第一行的源镜像): 24 | ``` 25 | FROM ubuntu:16.04 26 | RUN apt-get update && apt-get install -y --no-install-recommends \ 27 | bzip2 \ 28 | g++ \ 29 | git \ 30 | graphviz \ 31 | libgl1-mesa-glx \ 32 | libhdf5-dev \ 33 | openmpi-bin \ 34 | wget && \ 35 | rm -rf /var/lib/apt/lists/* 36 | RUN sed -i 's/archive.ubuntu.com/mirrors.ustc.edu.cn/g' /etc/apt/sources.list 37 | RUN apt-get update 38 | ADD ./Anaconda3-2019.07-Linux-x86_64.sh ./anaconda.sh 39 | ENV LANG=C.UTF-8 LC_ALL=C.UTF-8 40 | ENV PATH /opt/conda/bin:$PATH 41 | RUN /bin/bash ./anaconda.sh -b -p /opt/conda && rm ./anaconda.sh && ln -s /opt/conda/etc/profile.d/conda.sh /etc/profile.d/conda.sh && echo ". /opt/conda/etc/profile.d/conda.sh" >> ~/.bashrc && echo "conda activate base" >> ~/.bashrc && find /opt/conda/ -follow -type f -name '*.a' -delete && find /opt/conda/ -follow -type f -name '*.js.map' -delete && /opt/conda/bin/conda clean -afy 42 | CMD [ "/bin/bash" ] 43 | RUN conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/ && conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/ \ 44 | && conda config --set show_channel_urls yes \ 45 | && conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/ \ 46 | && conda install pytorch torchvision cudatoolkit=10.2 \ 47 | && conda install -c anaconda mpi4py \ 48 | && pip install --upgrade wandb \ 49 | && conda install scikit-learn \ 50 | && conda install numpy \ 51 | && conda install h5py \ 52 | && conda install setproctitle \ 53 | && conda install networkx \ 54 | && pip install paho-mqtt 55 | RUN apt-get update 56 | RUN echo "Y" | apt-get install vim 57 | RUN echo "Y" | apt-get install ssh 58 | ``` 59 | 60 | ##### b.下载Anaconda安装包到当前目录: 61 | wget https://repo.anaconda.com/archive/Anaconda3-2019.07-Linux-x86_64.sh 62 | 63 | ##### c.制作镜像 64 | sudo docker build -t fedml . 65 | 66 | #### 步骤三:创建和运行容器 67 | ##### 1)创建容器 68 | 使用之前制作的fedml镜像,创建一个HeadNode及若干个ComputeNode容器。此处使用docker的文件映射机制将本地项目文件目录映射到容器的指定目录中,共享执行代码及数据。一下以创建一个HeadNode和两个ComputeNode为例。 69 | ``` 70 | sudo docker run --gpus all --shm-size 8G -it -p 2001:80 -v [本地项目地址]:[容器中项目地址] --name HeadNode -h HeadNode -d fedml /bin/bash 71 | sudo docker run --gpus all --shm-size 8G -it -p 2002:80 -v [本地项目地址]:[容器中项目地址] --name ComputeNode1 -h ComputeNode1 -d fedml /bin/bash 72 | sudo docker run --gpus all --shm-size 8G -it -p 2003:80 -v [本地项目地址]:[容器中项目地址] --name ComputeNode2 -h ComputeNode2 -d fedml /bin/bash 73 | ``` 74 | ##### 2)运行容器 75 | ``` 76 | sudo docker exec -it HeadNode /bin/bash 77 | sudo docker exec -it ComputeNode1 /bin/bash 78 | sudo docker exec -it ComputeNode2 /bin/bash 79 | ``` 80 | 81 | #### 步骤四:配置容器间的ssh免密登陆 82 | 83 | ##### 1)修改密码和权限 84 | 对于每一个容器,进行如下操作: 85 | 运行passwd root,修改root密码 86 | 进入/etc/ssh/sshd_config文件,修改PermitRootLogin yes,然后重启ssh服务 /etc/init.d/ssh restart 87 | 进入/etc/hosts文件,找到并记录容器ip地址 88 | 89 | ##### 2)生成密钥,设置免密登陆 90 | 对于每一个容器,进行如下操作: 91 | 92 | 将其他容器的ip地址及host名添加到/etc/hosts的末尾,如: 93 | 172.17.0.2 HeadNode 94 | 172.17.0.3 ComputeNode1 95 | 172.17.0.4 ComputeNode2 96 | 97 | 生成密钥: 98 | cd ~/.ssh(若不存在则创建) 99 | ssh-keygen -t rsa(一直回车) 100 | chmod 644 ~/.ssh/id_rsa.pub(与上一条语句中密钥的保存地址相同) 101 | cat ./id_rsa.pub >> ./authorized_keys 102 | 103 | 设置本容器免密登陆到其他容器(NodeName是所有其他容器的名字,有几个容器就执行几次语句): 104 | ssh-copy-id -i ~/.ssh/id_rsa.pub [NodeName] 105 | 106 | 如果ssh [NodeName]能直接登录到其他容器,则设置成功。 107 | 108 | 109 | ## 4.机器学习任务联邦学习化的一般方法 110 | 111 | ### 4.1 准备环境 112 | 为server和所有的client安装训练相关的依赖包。 113 | ### 4.2 准备模型相关的数据和程序 114 | 所有模型相关的资料保存在/FedML/fedml_api/model/[model_name]文件夹下。 115 | #### 4.2.1 数据集 116 | 目前的框架中,所有client运行同一份代码,因此所有client的训练和测试数据应该储存在同一路径,或能通过同样的方法检索到的路径中。具体来说,在FedML-Experiments层的main_fedavg.py中定义了训练集和测试集的路径。在训练和测试的过程中,程序也使用同样的方法使用路径下的数据。 117 | 118 | 119 | 在实验环境中,可以通过切分数据集模拟生产环境中client生成的数据。 120 | 121 | #### 4.2.2 模型类 122 | 123 | 对应单机版本的项目中对模型的定义,一般不需要修改。 124 | 125 | #### 4.2.3 [MyModel]Trainer类 126 | 127 | 继承自fedml_core.trainer.model_trainer.ModelTrainer抽象类,封装了get_model_params(获取模型参数)、set_model_params(设置模型参数)、train(训练)、test(测试)四个接口。 128 | 129 | [MyModel]Trainer类在main_fedavg.py中被初始化,初始化时会传入模型类的对象,保存为名为model的成员变量。之后[MyModel]Trainer类可以通过访问self.model使用模型的实例方法。在进行获取模型参数、设置模型参数、训练、测试等操作时使用的model都是self.model。 130 | 131 | get_model_params(self):返回self.model的模型参数(dict类型)。 132 | 133 | set_model_params(self, model_parameters):传入dict类型的模型参数,用其更新self.model.无返回值。 134 | 135 | get_local_sample_number(self, train_path):返回训练数据的条数,作为模型更新的权重。 136 | 137 | train(self, device, args, train_path):在client进行本地训练时被调用,作用是训练self.model。训练所需的参数,如epoch和batch_size都存储在args中。传入的参数都在main_fedavg.py中被定义。 138 | 139 | test(self, device, args, test_path):在client使用本地测试集测试全局模型的效果时被调用。测试所需的参数都存储在args中,传入的参数都在main_fedavg.py中被定义。返回值是一个字典,键为测试结果标签,值为对应的测试结果。如{“precision”:0.9, “recall”:”0.76”}。注意返回值必须能够被json序列化(json.dumps),否则无法传输。 140 | 141 | #### 4.2.4 [MyModel]ResultAggregator类 142 | 143 | server对client更新后发回的模型进行聚集(aggregate)后,如果本轮需要对模型进行评估,server会将模型广播给所有client,client在本地测试后将测试结果发回给server。收到所有client的测试结果后,server会调用[MyModel]ResultAggregator的aggregate_result方法对测试结果进行统计,得到(一般是)加权平均后的测试结果。 144 | 145 | aggregate_result函数收到的result_dict参数是一个字典,字典的键为client的序号client_idx,从0开始。result[client_idx]是一个字典,字典的键为”result”和”sample_number”。result[client_idx][”result”]是一个字典,与[MyModel]Trainer类中test的返回值的格式相同。result[client_idx][”sample_number”]是一个整数,是client数据集的规模,用于对训练结果进行加权平均。 146 | 147 | ### 4.3 定义联邦学习入口程序 148 | 149 | main_fedavg.py定义了联邦学习程序的入口。在将一个机器学习任务联邦学习化的过程中,需要关注的是有关指定模型、定义参数的部分。这个文件也在/FedML/fedml_experiments/distributed/fed[MyTrainer]文件夹下。 150 | 151 | 在/FedML/fedml_experiments/distributed/fedMyModel/main_fedavg.py是main _fedavg.py的一个基础版本。#TODO注释标明了在将一个新的机器学习任务联邦学习化时需要修改的地方,以及每个段代码的含义。具体例子及更详细的解释可参照。在对机器学习任务进行联邦学习化时基本遵循相同的步骤,只需对具体模型相关的内容进行修改。 152 | 153 | ### 4.4 启动集群 154 | 155 | 启动集群相关的程序在/FedML/fedml_experiments/distributed/fed[MyModel]文件夹下。 156 | 157 | run_fedavg_distributed_pytorch.sh定义了集群的启动方法,通过执行本脚本开始训练。进程间通过mpi协议进行通讯,脚本中的mpirun命令定义了进程数(-np $PROCESS_NUM)、参与通讯的hosts(-hostfile ./mpi_host_file)、每个host执行的程序(python3 ./main_fedavg.py)。 158 | 159 | mpi_host_file中定义了所有hosts,作为mpirun命令的参数。所有参与训练的hosts名都应写在其中,每个一行。其中,第一行必须是server(aggregator)的host名,后面的每一行都是一个client(trainer)的host名,其数量一般应该和CLIENT_NUM相同。在MPI集群启动后,所有的host会依次启动一个进程(process),每个进程被分配一个唯一的process_id,process_id是从0开始的连续整数,分配顺序与mpi_host_file中的定义顺序相同。 160 | 161 | 举例来说,如果启动有1个server(host名为HeadNode)和2个client(host名为ComputeNode1和ComputeNode2)的联邦学习训练集群,每轮有1个client参与,训练50轮,那么mpi_host_name的内容为: 162 | 163 | HeadNode 164 | ComputeNode1 165 | ComputeNode2 166 | 167 | 168 | 启动集群的命令为: 169 | 170 | sh run_fedavg_distributed_pytorch.sh 10 5 50 171 | 172 | 集群启动后,HeadNode启动的server进程的process_id为0,ComputeNode1启动的进程的process_id为1,ComputeNode1启动的进程的process_id为2. 173 | 174 | 175 | ## 5.FedML-API的主要接口介绍 176 | FedML-API与具体模型无关的接口在./fedml_api/distributed/fedavg路径下。在一般情况下,如果不修改联邦学习训练的流程,这些接口不需要修改。尽管如此,对这些接口的作用有所了解对熟悉框架和调试代码都是非常有益且必要的,所以这一节会对这些接口做一个简要的介绍。 177 | 178 | ### 5.1 FedAvgAPI.py 179 | 定义了server和client的初始化函数。 180 | #### 5.1.1 FedML_init 181 | 获取MPI的相关参数。 182 | #### 5.1.2 FedML_FedAvg_distributed 183 | 接收上一层(FedML-Experiments)定义的模型和参数,用于初始化server和client。函数根据process_id决定本进程的角色,从而初始化server(调用init_server)或client(init_client)。 184 | #### 5.1.3 init_server 185 | 初始化server,在process=0时被调用。server包含aggregator和manager,aggregator(FedAVGAggregator)定义了模型聚集的过程,manager(FedAvgServerManager中)定义了任务分发、通信等功能。 186 | #### 5.1.4 init_client 187 | 初始化client,在process!=0时被调用。client包含trainer和manager,trainer(FedAVGTrainer)定义了模型训练的过程,manager(FedAvgClientManager)定义了任务接收、通信等功能。 188 | 189 | ### 5.2 FedAvgServerManager.py 190 | FedAVGServerManager类,继承自fedml_core.distributed.server.server_manager中的ServerManager。定义了server的功能。 191 | #### 5.2.1 aggregator 192 | FedAVGAggregator类的一个对象。(见5.3) 193 | #### 5.2.2 run 194 | 继承自ServerManager,首先调用register_message_receive_handlers注册所有的handler函数,然后启动mqtt服务。 195 | #### 5.2.3 send_init_message 196 | 在FedAvgAPI.init_server中被调用,首先选取第一轮参与训练的client,然后将模型的初始参数广播给所有的client。 197 | #### 5.2.4 register_message_receive_handlers 198 | 调用继承自ServerManager的register_message_receive_handler,依次定义所有的handler(handle_massage_xxx形式的函数)。本质上是将所有消息类型与消息类型的handler存储在一个字典中,当收到某一类型的消息时进行查询和处理。所有相关的handler在下文中定义。 199 | #### 5.2.5 handle_message_receive_model_from_client 200 | 收到client发来的模型更新后的参数以及参与训练的数据量时被调用。实际上定义了训练的流程,一般不需要修改。 201 | #### 5.2.6 send_message_init_config 202 | 封装给client的初始化信息并发送,在所有训练(第一轮训练)开始前执行。 203 | #### 5.2.7 send_message_sync_model_to_client 204 | 封装模型aggregate后更新的参数并发送,在每轮训练完成、aggregate结束后将更新后的参数发送给所有client时被调用。 205 | 206 | ### 5.3 FedAVGAggregator.py 207 | 定义了FedAVGAggregator类,即联邦学习中的aggregator。 208 | #### 5.3.1 self.trainer 209 | 自定义[MyModel]Trainer类的一个对象。主要用来对模型参数进行全局更新等操作,并不实际进行训练。 210 | #### 5.3.2 aggregate 211 | 当server收到了本轮训练中所有参与训练的client更新后的模型参数时被调用,对更新后的模型参数做加权平均,得到全局模型参数。 212 | 213 | ### 5.4 FedAvgClientManager.py 214 | 定义FedAVGClientManager类,继承自fedml_core.distributed.server.client_manager中的ClientManager。如果不修改fedavg的流程(模型无关),这个文件一般不需要修改。 215 | #### 5.4.1 self.trainer 216 | FedAVGTrainer类的一个对象。(见5.5) 217 | #### 5.4.2 run 218 | 继承自ClientManager,首先调用register_message_receive_handlers注册所有的handler函数,然后启动mqtt服务。 219 | #### 5.4.3 send_model_to_server 220 | 封装本地更新后的模型参数和参与训练的数据条数并发送给server,在__train函数的结尾被执行。 221 | #### 5.4.4 handle_message_init 222 | 在服务刚启动,收到来自server的初始化信息时被调用。作用是更新trainer中的模型参数,初始化round_idx=0,然后调用__train进行一轮训练。 223 | #### 5.4.5 handle_message_receive_model_from_server 224 | 在收到来自server的模型更新信息时被调用。作用是更新trainer中的模型参数,将round_idx加1,然后调用__train进行一轮训练。 225 | #### 5.4.6 handle_message_test_model_on_client 226 | 在收到来自server的模型测试请求时被调用。作用是在本地测试数据上测试全局模型的效果。 227 | #### 5.4.7 __train 228 | 调用self.trainer的train函数对模型进行训练,然后调用send_model_to_server,将本轮训练的参数更新结果和参与训练的数据条数发送给server。 229 | 230 | ### 5.5 FedAVGTrainer.py 231 | #### 5.5.1 self.trainer 232 | 自定义Trainer类(见4.2.3)的一个对象。在main_fedavg.py中首先被定义(model_trainer),作为参数传给FedAvgAPI.FedML_FedAvg_distributed,用于初始化FedAVGTrainer。 233 | #### 5.5.2 update_model 234 | 调用self.trainer.set_model_params,用全局模型参数更新本地模型参数。 235 | #### 5.5.3 train 236 | 调用self.trainer.train进行本地训练,调用self.trainer.get_model_params获取本地模型参数,将其和参与训练的数据条数共同返回。 237 | #### 5.5.4 test 238 | 调用self.trainer.test进行本地测试。 239 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | --------------------------------------------------------------------------------