├── HDFS_TUTORIAL.md ├── LICENSE ├── README.md ├── aws_k8s.md ├── cluster_config.md ├── doc ├── ElasticCTR.png └── buy_bcc.png ├── elastic-ctr-cli ├── data.config ├── elastic-control.sh ├── fileserver.yaml.template ├── fleet-ctr.yaml.template ├── listen.py ├── pdserving.yaml ├── rolebind.yaml ├── service.py ├── service_auto_port.py └── slot.conf ├── elasticctr_arch.md ├── fleet-ctr ├── criteo_dataset.py ├── criteo_dataset.py~ ├── criteo_reader.py ├── dataset_generator.py ├── infer.py ├── ip_config.sh ├── k8s_tools.py ├── mlflow_run.sh ├── model_with_sparse_feature.py ├── nets.py ├── paddle_k8s ├── process_rawmodel.py ├── pserver0.sh ├── pserver1.sh ├── run_distributed.sh ├── train_with_mlflow.py ├── trainer0.sh └── trainer1.sh ├── huawei_k8s.md └── save_program ├── dumper.py ├── listening.py ├── nets.py ├── parameter_to_seqfile ├── replace_params.py └── save_program.py /HDFS_TUTORIAL.md: -------------------------------------------------------------------------------- 1 | # 如何搭建HDFS集群 2 | 3 | ## 综述 4 | 5 | 本篇文章只是用于demo的HDFS集群搭建教程,用于跑通ElasticCTR的各个流程。本文将会带着大家在百度云的节点上搭建一个HDFS,并将Criteo数据集按照ElasticCTR的数据集格式要求,存放在HDFS上。 6 | 7 | ## 购买BCC 8 | 9 | 搭建 HDFS 集群的过程较为复杂,首先需要购买一个 BCC 实例 10 | 11 |

12 |
13 | 14 |
15 |

16 | 17 | 在 BCC 实例当中购买较大的 CDS 云磁盘。 18 | 19 | ## 安装并启动Hadoop 20 | 21 | 在进入 BCC 之后首先需要用 fdisk 工具确认分区是否已经安装。 22 | 23 | 选择 hadoop-2.8.5.tar.gz。下载后解压把 hadoop-2.8.5 目录 move 到/usr/local 目录下。 在/usr/local/hadoop-2.8.5/etc/hadoop/下,编辑 core-site.xml文件,修改为 24 | ``` 25 | 26 | 27 | fs.defaultFS 28 | hdfs://${LOCAL_IP}:9000 29 | 30 | 31 | hadoop.tmp.dir 32 | /data/hadoop 33 | 34 | 35 | ``` 36 | 37 | 38 | 此处 `$LOCAL_IP` 推荐用内网的 IP,也就是在 ifconfig 下为 `192.168` 开头的 IP,在 K8S 当中也可以被访问 到。 39 | 40 | 在 `slave` 文件下输入 `root@127.0.0.1` 41 | 42 | 接下来配置无密码访问,首先要 `ssh-keygen`,无论提示什么全部回车数次之后,用 `ssh-copy-id` 命令把无密码访问配置到 `127.0.0.1` ,`localhost` ,`0.0.0.0` 几个 IP 地址。 43 | 44 | 把`/usr/local/hadoop-2.8.5/etc/hadoop` 设置为 `$HADOOP_HOME` 45 | 46 | 再把`$HADOOP_HOME/bin` 放在 `$PATH` 下。如果输入 `hadoop` 命令可以被执行,就执行 `hadoop namenode format`。 47 | 48 | 最后在`/usr/local/hadoop-2.8.5/sbin` 目录下运行 ,`start-all.sh`。 49 | 50 | 51 | 以上操作之后,HDFS 服务就启动完毕,接下来就创建流式训练的文件夹 `/train_data/`,使用命令 `hdfs dfs -mkdir hdfs://$IP:9000/train_data/` 52 | 53 | ## 复制Criteo数据集到HDFS 54 | 接下来从 `https://paddle-serving.bj.bcebos.com/criteo_ctr_example/criteo_demo.tar.gz` 下载数据集,解压之后在criteo_demo下 55 | 执行 56 | `hdfs dfs -put * hdfs://$IP:9000/train_data/20200401` 57 | `$IP`就是先前到HDFS地址。 58 | 这样,就在train_data下目录到20200401目录下存放了5个小时的训练集。20200401可以改动成任意一个日期。 59 | 在主页面的教程中,`data.config`文件就是用来现在配置的HDFS信息,日期信息会在这里被调用。 60 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # ElasticCTR 2 | 3 | ElasticCTR是分布式训练CTR预估任务和Serving流程一键部署的方案,用户只需配置数据源、样本格式即可完成一系列的训练与预测任务 4 | 5 | * [1. 总体概览](#head1) 6 | * [2. 配置集群](#head2) 7 | * [3. 一键部署教程](#head3) 8 | * [4. 训练进度追踪](#head4) 9 | * [5. 预测服务](#head5) 10 | 11 | ## 1. 总体概览 12 | 13 | 本项目提供了端到端的CTR训练和二次开发的解决方案,主要特点如下: 14 | 15 | 1.快速部署 16 | 17 | ElasticCTR当前提供的方案是基于百度云的Kubernetes集群进行部署,用户可以很容易扩展到其他原生的Kubernetes环境运行ElasticCTR。 18 | 19 | 2.高性能 20 | 21 | ElasticCTR采用PaddlePaddle提供的全异步分布式训练方式,在保证模型训练效果的前提下,近乎线性的扩展能力可以大幅度节省训练资源。在线服务方面,ElasticCTR采用Paddle Serving中高吞吐、低延迟的稀疏参数预估引擎,高并发条件下是常见开源组件吞吐量的10倍以上。 22 | 23 | 3.可定制 24 | 25 | 用户可以通过统一的配置文件,修改训练中的训练方式和基本配置,包括在离线训练方式、训练过程可视化指标、HDFS上的存储配置等。除了通过修改统一配置文件进行训练任务配置外,ElasticCTR采用全开源软件栈,方便用户进行快速的二次开发和改造。底层的Kubernetes、Volcano可以轻松实现对上层任务的灵活调度策略;基于PaddlePaddle的灵活组网能力、飞桨的分布式训练引擎Fleet和远程预估服务Paddle Serving,用户可以对训练模型、并行训练的模式、远程预估服务进行快速迭代;MLFlow提供的训练任务可视化能力,用户可以快速增加系统监控需要的各种指标。 26 | 27 | 28 | 29 | 本方案整体结构请参照这篇文章 [ElasticCTR架构](elasticctr_arch.md) 30 | 31 |

32 |
33 | 34 |
35 |

36 | 37 | ## 2. 配置集群 38 | 39 | 运行本方案前,需要用户已经搭建好k8s集群,并安装好volcano组件。k8s环境部署比较复杂,本文不涉及。百度智能云CCE容器引擎申请后即可使用,百度云上创建k8s的方法用户可以参考这篇文档[百度云创建k8s教程及使用指南](cluster_config.md)。此外,Elastic CTR还支持在其他云上部署,可以参考以下两篇文档[华为云创建k8s集群](huawei_k8s.md),[aws创建k8s集群](aws_k8s.md). 40 | 41 | 准备好K8S集群之后,我们需要配置HDFS作为数据集的来源[HDFS配置教程](HDFS_TUTORIAL.md) 42 | 43 | 44 | ## 3. 一键部署教程 45 | 46 | 您可以使用我们提供的脚本elastic-control.sh来完成部署,在运行脚本前,请确保您的机器装有python3并通过pip安装了mlflow,安装mlflow的命令如下: 47 | ```bash 48 | python3 -m pip install mlflow -i https://pypi.tuna.tsinghua.edu.cn/simple 49 | ``` 50 | 脚本的使用方式如下: 51 | ```bash 52 | bash elastic-control.sh [COMMAND] [OPTIONS] 53 | ``` 54 | 其中可选的命令(COMMAND)如下: 55 | - **-c|--config_client** 检索客户端二进制文件用于发送预测服务请求并接收预测结果 56 | - **-r|--config_resource** 定义训练配置 57 | - **-a|--apply** 应用配置并启动训练 58 | - **-l|--log** 打印训练状态,请确保您已经启动了训练 59 | 60 | 在定义训练配置时,您需要添加附加选项(OPTIONS)来指定配置的资源,可选的配置如下: 61 | - **-u|--cpu** 每个训练节点的CPU核心数 62 | - **-m|--mem** 每个节点的内存容量 63 | - **-t|--trainer** trainer节点的数量 64 | - **-p|--pserver** parameter-server节点的数量 65 | - **-b|--cube** cube分片数 66 | - **-hd|--hdfs_address** 存储数据文件的HDFS地址 67 | 68 | 注意:您的数据文件的格式应为以下示例格式: 69 | ``` 70 | $show $click $feasign0:$slot0 $feasign1:$slot1 $feasign2:$slot2...... 71 | ``` 72 | 举例如下: 73 | ``` 74 | 1 0 17241709254077376921:0 132683728328325035:1 9179429492816205016:2 12045056225382541705:3 75 | ``` 76 | 77 | - **-f|--datafile** 数据路径文件,需要指明HDFS地址并指定起始与截止日期(截止日期可选) 78 | - **-s|--slot_conf** 特征槽位配置文件,请注意文件后缀必须为'.txt' 79 | 80 | 以下是`data.config`文件,其中`START_DATE_HR`和`END_DATE_HR`就是我们在上一步配置HDFS的路径。 81 | ``` 82 | export HDFS_ADDRESS="hdfs://${IP}:9000" # HDFS地址 83 | export HDFS_UGI="root,i" # HDFS用户名密码 84 | export START_DATE_HR=20200401/00 # 训练集开始时间,代表2020年4月1日0点 85 | export END_DATE_HR=20200401/03 # 训练集结束时间,代表2020年4月1日3点 86 | export DATASET_PATH="/train_data" # 训练集在HDFS上的前缀 87 | export SPARSE_DIM="1000001" # 稀疏参数维度,可不动 88 | ``` 89 | 90 | 脚本的使用示例如下: 91 | ``` 92 | bash elastic-control.sh -r -u 4 -m 20 -t 2 -p 2 -b 5 -s slot.conf -f data.config 93 | bash elastic-control.sh -a 94 | bash elastic-control.sh -l 95 | bash elastic-control.sh -c 96 | ``` 97 | 98 | ## 4. 训练进度追踪 99 | 我们提供了两种方法让用户可以观察训练的进度,具体方式如下: 100 | 101 | 1.命令行查看 102 | 103 | 在训练过程中,用户可以随时输入以下命令,将Trainer0和file server的状态日志打印到标准输出上以便查看 104 | ```bash 105 | bash elastic-control.sh -l 106 | ``` 107 | 108 | ## 5. 预测服务 109 | 用户可以输入以下指令查看file server日志: 110 | ```bash 111 | bash elastic-control.sh -l 112 | ``` 113 | 当发现有模型产出后,可以进行预测,预测的方法是输入以下命令 114 | ```bash 115 | bash elastic-control.sh -c 116 | ``` 117 | 并按照屏幕上打出的提示继续执行即可进行预测,结果会打印在标准输出 118 | ![infer_help.png](https://github.com/suoych/WebChat/raw/master/infer_help.png) 119 | 120 | 121 | 122 | -------------------------------------------------------------------------------- /aws_k8s.md: -------------------------------------------------------------------------------- 1 | # AWS搭建k8s集群 2 | 3 | 本文档旨在介绍如何在aws上搭建k8s集群 4 | 5 | * [1. 流程概述](#head1) 6 | * [2. 购买跳板机](#head2) 7 | * [3. 部署集群](#head3) 8 | 9 | ## 1. 流程概述 10 | 11 | 在aws上搭建k8s集群主要有以下两个步骤: 12 | 13 | 1.购买跳板机 14 | 15 | 首先需要购买一个ec2实例作为跳板机来控制k8s集群,这个实例不需要很高的配置 16 | 17 | 2.部署集群 18 | 19 | 用上一步购买的跳板机创建集群,集群的配置可以自行调整 20 | 21 | 下面对每一步进行详细介绍。 22 | 23 | ## 2. 购买跳板机 24 | 25 | 用户可以在EC2控制台购买想要的实例作为跳板机。 26 | 27 | 具体的操作如下: 28 | 1. 打开 Amazon EC2 控制台,从控制台控制面板中,点击启动实例按钮。 29 | ![run_instance.png](https://github.com/suoych/WebChat/raw/master/run_instance.png) 30 | 2. 选择合适的AMI,建议选用Amazon Linux 2 AMI。 31 | ![choose_AMI.png](https://github.com/suoych/WebChat/raw/master/choose_AMI.png) 32 | 3. 选择实例类型,建议选用默认的t2.micro,选好后点击审核和启动 33 | ![choose_instance_type.png](https://github.com/suoych/WebChat/raw/master/choose_instance_type.png) 34 | 4. 在审核界面,在核查实例启动页面上的安全组栏中点击编辑安全组,然后在配置安全组界面中点击选择一个现有的安全组,点击名称为default的安全组,再点击审核和启动。 35 | ![review_instance.png](https://github.com/suoych/WebChat/raw/master/review_instance.png) 36 | ![select_security_group.png](https://github.com/suoych/WebChat/raw/master/select_security_group.png) 37 | 5. 在审核界面点击启动,在弹出的密钥对窗口中选择创建新密钥对,自定义密钥名称后下载密钥对,请一定保存好密钥对文件,因为丢失后无法再次下载。以上操作完成后点击启动实例即可完成跳板机购买。 38 | ![create_key.png](https://github.com/suoych/WebChat/raw/master/create_key.png) 39 | 40 | 41 | 请注意:密钥对文件下载之后请修改权限为400。 42 | 43 | ## 3. 部署集群 44 | 45 | 在上一步购买的实例启动后会显示公网ip和DNS,连接到实例进行部署,连接需要用到刚才下载的密钥对文件(后缀为.pem),连接指令如下: 46 | 47 | ```bash 48 | ssh -i ec2key.pem ec2-user@12.23.34.123 49 | ``` 50 | 或 51 | ```bash 52 | ssh -i ec2key.pem ec2-user@ec2-12-23-34-123.us-west-2.compute.amazonaws.com 53 | ``` 54 | 55 | 连接到跳板机后,需要安装一系列操控组件,具体如下: 56 | 1. 安装pip 57 | ```bash 58 | sudo yum -y install python-pip 59 | ``` 60 | 2. 安装或升级 AWS CLI 61 | ```bash 62 | sudo pip install --upgrade awscli 63 | ``` 64 | 3. 安装 eksctl 65 | ```bash 66 | curl --silent \ 67 | --location "https://github.com/weaveworks/eksctl/releases/download/latest_release/eksctl_$(uname -s)_amd64.tar.gz" \ 68 | | tar xz -C /tmp 69 | sudo mv /tmp/eksctl /usr/local/bin 70 | ``` 71 | 4. 安装 kubectl 72 | ```bash 73 | curl -o kubectl https://amazon-eks.s3-us-west-2.amazonaws.com/1.11.5/2018-12-06/bin/linux/amd64/kubectl 74 | chmod +x ./kubectl 75 | mkdir -p $HOME/bin && cp ./kubectl $HOME/bin/kubectl && export PATH=$HOME/bin:$PATH 76 | ``` 77 | 5. 安装 aws-iam-authenticator 78 | ```bash 79 | curl -o aws-iam-authenticator https://amazon-eks.s3-us-west-2.amazonaws.com/1.11.5/2018-12-06/bin/linux/amd64/aws-iam-authenticator 80 | chmod +x aws-iam-authenticator 81 | cp ./aws-iam-authenticator $HOME/bin/aws-iam-authenticator && export PATH=$HOME/bin:$PATH 82 | ``` 83 | 6. 安装 ksonnet 84 | ```bash 85 | export KS_VER=0.13.1 86 | export KS_PKG=ks_${KS_VER}_linux_amd64 87 | wget -O /tmp/${KS_PKG}.tar.gz https://github.com/ksonnet/ksonnet/releases/download/v${KS_VER}/${KS_PKG}.tar.gz 88 | mkdir -p ${HOME}/bin 89 | tar -xvf /tmp/$KS_PKG.tar.gz -C ${HOME}/bin 90 | sudo mv ${HOME}/bin/$KS_PKG/ks /usr/local/bin 91 | ``` 92 | 93 | 安装好这些组件后,用户可以购买集群并部署,指令如下: 94 | ```bash 95 | eksctl create cluster paddle-cluster \ 96 | --version 1.13 \ 97 | --nodes 2 \ 98 | --node-type=m5.2xlarge \ 99 | --timeout=40m \ 100 | --ssh-access \ 101 | --ssh-public-key ec2.key \ 102 | --region us-west-2 \ 103 | --auto-kubeconfig 104 | ``` 105 | 其中: 106 | 107 | **--version** 指k8s版本,目前aws支持1.12, 1.13 和 1.14 108 | 109 | **--nodes** 指节点数量 110 | 111 | **--node-type** 指节点实例型号,用户可以挑选自己喜欢的实例套餐 112 | 113 | **--ssh-public-key** 用户可以使用之前购买跳板机时定义的密钥名称 114 | 115 | **--region** 指节点所在地区 116 | 117 | 部署集群所需时间较长,请耐心等待,当部署成功后,用户可以测试集群,具体方法如下: 118 | 119 | 1. 输入以下指令查看节点信息: 120 | ```bash 121 | kubectl get nodes -o wide 122 | ``` 123 | 2. 验证集群是否处于活动状态: 124 | ```bash 125 | aws eks --region describe-cluster --name --query cluster.status 126 | ``` 127 | 应看到如下输出: 128 | ``` 129 | "ACTIVE" 130 | ``` 131 | 3. 如果在同一跳板机中具有多个集群设置,请验证 kubectl 上下文: 132 | ```bash 133 | kubectl config get-contexts 134 | ``` 135 | 如果未按预期设置该上下文,请使用以下命令修复此问题: 136 | ```bash 137 | aws eks --region update-kubeconfig --name 138 | ``` 139 | 以上是AWS搭建k8s集群的全部步骤,用户接下来可以再自行在aws上搭建hdfs,并在跳板机上部署elastic ctr2.0 140 | -------------------------------------------------------------------------------- /cluster_config.md: -------------------------------------------------------------------------------- 1 | ### 1 创建k8s集群 2 | 3 | 请参考 4 | [百度智能云CCE容器引擎帮助文档-创建集群](https://cloud.baidu.com/doc/CCE/s/zjxpoqohb),在百度智能云上建立一个集群,节点配置需要满足如下要求 5 | 6 | - CPU核数 \> 4 7 | 8 | 申请容器引擎示例: 9 | 10 | ![image](https://github.com/PaddlePaddle/Serving/raw/master/doc/elastic_ctr/ctr_node.png) 11 | 12 | 创建完成后,即可参考[百度智能云CCE容器引擎帮助文档-查看集群](https://cloud.baidu.com/doc/CCE/GettingStarted.html#.E6.9F.A5.E7.9C.8B.E9.9B.86.E7.BE.A4),查看刚刚申请的集群信息。 13 | 14 | ### 2 如何操作集群 15 | 16 | 集群的操作可以通过百度云web或者通过kubectl工具进行,推荐用kubectl工具。 17 | 18 | mac和linux用户安装kubectl工具可以参考以下步骤: 19 | 20 | 1.下载最新版本的kubectl 21 | ```bash 22 | curl -LO "https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)/bin/darwin/amd64/kubectl" 23 | ``` 24 | 2.为kubectl添加执行权限 25 | ```bash 26 | chmod +x ./kubectl 27 | ``` 28 | 3.将刚下载好的可执行文件移动到环境变量路径下 29 | ```bash 30 | sudo mv ./kubectl /usr/local/bin/kubectl 31 | ``` 32 | 4.检测安装是否成功 33 | ```bash 34 | kubectl version 35 | ``` 36 | 37 | \* 注意: 本操作指南给出的操作步骤都是基于linux操作环境的。 38 | 39 | - 安装好kubectl后,我们还要配置kubectl,下载集群凭证。集群凭证可以在百度云控制台查看,如下图所示 40 | ![config](https://github.com/suoych/WebChat/raw/master/d9d953129f4a27d8ec728d0a8.png) 41 | 在集群界面下载集群配置文件,放在kubectl的默认配置路径(请检查\~/.kube目录是否存在,若没有请创建) 42 | 43 | ```bash 44 | $ mv kubectl.conf ~/.kube/config 45 | ``` 46 | 47 | - 配置完成后,您即可以使用kubectl从本地计算机访问Kubernetes集群(注:请确保您的机器上没有配置网络代理) 48 | 49 | ```bash 50 | $ kubectl get node 51 | ``` 52 | 53 | 54 | ### 3 设置访问权限 55 | 56 | 建立分布式任务需要pod间有API互相访问的权限,可以按如下步骤 57 | 58 | ```bash 59 | $ kubectl create rolebinding default-view --clusterrole=view --serviceaccount=default:default --namespace=default 60 | ``` 61 | 62 | 注意: --namespace 指定的default 为创建集群时候的名称 63 | 64 | ### 4 安装Volcano 65 | 66 | 我们使用volcano作为训练阶段的批量任务管理工具。关于volcano的详细信息,请参考[官方网站](https://volcano.sh/)的Documentation。 67 | 68 | 执行以下命令安装volcano到k8s集群: 69 | 70 | ```bash 71 | $ kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml 72 | ``` 73 | 74 | ![image](https://github.com/PaddlePaddle/Serving/raw/master/doc/elastic_ctr/ctr_volcano_install.png) 75 | 76 | 77 | -------------------------------------------------------------------------------- /doc/ElasticCTR.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PaddlePaddle/ElasticCTR/c493c6b250eef001028b5a7264b61f679c79fa6d/doc/ElasticCTR.png -------------------------------------------------------------------------------- /doc/buy_bcc.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PaddlePaddle/ElasticCTR/c493c6b250eef001028b5a7264b61f679c79fa6d/doc/buy_bcc.png -------------------------------------------------------------------------------- /elastic-ctr-cli/data.config: -------------------------------------------------------------------------------- 1 | export HDFS_ADDRESS="hdfs://${IP}:9000" 2 | export HDFS_UGI="root,i" 3 | export START_DATE_HR=20200401/00 4 | export END_DATE_HR=20200401/03 5 | export DATASET_PATH="/train_data" 6 | export SPARSE_DIM="1000001" 7 | -------------------------------------------------------------------------------- /elastic-ctr-cli/elastic-control.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | ############################################################################### 16 | # Function definitions 17 | ############################################################################### 18 | help() 19 | { 20 | echo "Usage: sh elastic-control.sh [COMMAND] [OPTIONS]" 21 | echo "elastic-control is command line interface with ELASTIC CTR" 22 | echo "" 23 | echo "Commands:" 24 | echo "-r|--config_resource Configure training resource requirments. See bellow" 25 | echo "-a|--apply Apply configurations to start training process" 26 | echo "-l|--log Log the status of training, please make sure you have started the training process" 27 | echo "-c|--config_client Retrieve client binaries to send infer requests and receive results" 28 | echo "" 29 | echo "Options(Used only for --config_resource):" 30 | echo "-u|--cpu CPU cores for each training node (Unused for now)" 31 | echo "-m|--mem Memory for each training node (Unused for now)" 32 | echo "-t|--trainer Number of trainer nodes" 33 | echo "-p|--pserver Number of parameter-server nodes" 34 | echo "-b|--cube Number of cube shards" 35 | echo "-f|--datafile Data file path (Only HDFS supported) (Unused for now)" 36 | echo "-s|--slot_conf Slot config file" 37 | echo "" 38 | echo "Example:" 39 | echo "sh elastic-control.sh -r -u 4 -m 20 -t 2 -p 2 -b 5 -s slot.conf -f data.config" 40 | echo "sh elastic-control.sh -a" 41 | echo "sh elastic-control.sh -c" 42 | echo "" 43 | echo "Notes:" 44 | echo "Slot Config File: Specify which feature ids are used in training. One number per line." 45 | } 46 | 47 | die() 48 | { 49 | echo "[FAILED] ${1}" 50 | exit 1 51 | } 52 | 53 | ok() 54 | { 55 | echo "[OK] ${1}" 56 | } 57 | 58 | check_tools() 59 | { 60 | if [ $# -lt 1 ]; then 61 | echo "Usage: check_tools COMMAND [COMMAND...]" 62 | return 63 | fi 64 | while [ $# -ge 1 ]; do 65 | type $1 &>/dev/null || die "$1 is needed but not found. Aborting..." 66 | shift 67 | done 68 | return 0 69 | } 70 | 71 | function check_files() 72 | { 73 | if [ $# -lt 1 ]; then 74 | echo "Usage: check_files COMMAND [COMMAND...]" 75 | return 76 | fi 77 | while [ $# -ge 1 ]; do 78 | [ -f "$1" ] || die "$1 does not exist" 79 | shift 80 | done 81 | return 0 82 | } 83 | 84 | function start_fileserver() 85 | { 86 | unset http_proxy 87 | unset https_proxy 88 | kubectl get pod | grep file-server >/dev/null 2>&1 89 | if [ $? -ne 0 ]; then 90 | kubectl apply -f fileserver.yaml 91 | else 92 | echo "delete duplicate file server..." 93 | kubectl delete -f fileserver.yaml 94 | kubectl apply -f fileserver.yaml 95 | fi 96 | } 97 | 98 | function install_volcano() { 99 | unset http_proxy 100 | unset https_proxy 101 | kubectl get crds | grep jobs.batch.volcano.sh >/dev/null 2>&1 102 | if [ $? -ne 0 ]; then 103 | echo "volcano not found, now install" 104 | kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml 105 | fi 106 | 107 | } 108 | 109 | 110 | function config_client() 111 | { 112 | check_tools wget kubectl 113 | wget --no-check-certificate https://paddle-serving.bj.bcebos.com/data/ctr_prediction/elastic_ctr_client_million.tar.gz 114 | tar zxvf elastic_ctr_client_million.tar.gz 115 | rm elastic_ctr_client_million.tar.gz 116 | 117 | for number in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20; do 118 | SERVING_IP=`kubectl get services | grep 'paddleserving' | awk '{print $4}'` 119 | echo "searching Paddle Serving external IP, wait a moment." 120 | if [ "${SERVING_IP}" == "" ]; then 121 | sleep 10 122 | else 123 | break 124 | fi 125 | done 126 | 127 | SERVING_IP=`kubectl get services | grep 'paddleserving' | awk '{print $4}'` 128 | SERVING_PORT=`kubectl get services | grep 'paddleserving' | awk '{print $5}' | awk -F':' '{print $1}'` 129 | SERVING_ADDR="$SERVING_IP:$SERVING_PORT" 130 | sed -e "s#<$ SERVING_LIST $>#$SERVING_ADDR#g" client/template/predictors.prototxt.template > client/conf/predictors.prototxt 131 | FILESERVER_IP=`kubectl get services | grep 'file-server' | awk '{print $4}'` 132 | FILESERVER_PORT=`kubectl get services | grep 'file-server' | awk '{print $5}' | awk -F':' '{print $1}'` 133 | wget http://$FILESERVER_IP:$FILESERVER_PORT/slot.conf -O client/conf/slot.conf 134 | cp api/lib/*.so client/bin/ 135 | echo "Done." 136 | echo "============================================" 137 | echo "" 138 | echo "Try ELASTIC CTR:" 139 | echo "1. cd client" 140 | echo "2. (python) python bin/elastic_ctr.py $SERVING_IP $SERVING_PORT conf/slot.conf data/ctr_prediction/data.txt" 141 | echo "3. (C++ native) bin/elastic_ctr_demo --test_file data/ctr_prediction/data.txt" 142 | return 0 143 | } 144 | 145 | function generate_cube_yaml() 146 | { 147 | if [ $# != 1 ]; then 148 | echo "Invalid argument to function generate_cube_yaml" 149 | return -1 150 | fi 151 | if [ "$1" -lt 1 ]; then 152 | echo "CUBE_SHARD_NUM must not be less than 1" 153 | return -1 154 | fi 155 | CNT=$(($1-1)) 156 | CUBE_SHARD_NUM=$1 157 | for i in `seq 0 $CNT`; do 158 | echo "---" 159 | echo "apiVersion: v1" 160 | echo "kind: Pod" 161 | echo "metadata:" 162 | echo " name: cube-$i" 163 | echo " labels:" 164 | echo " app: cube-$i" 165 | echo "spec:" 166 | echo " containers:" 167 | echo " - name: cube-$i" 168 | echo " image: hub.baidubce.com/ctr/cube:v1" 169 | echo " workingDir: /cube" 170 | echo " command: ['/bin/bash']" 171 | echo " args: ['start.sh']" 172 | echo " env:" 173 | echo " - name: CUBE_SHARD_NUM" 174 | echo " value: \"$CUBE_SHARD_NUM\"" 175 | echo " ports:" 176 | echo " - containerPort: 8001" 177 | echo " name: cube-agent" 178 | echo " - containerPort: 8027" 179 | echo " name: cube-server" 180 | echo "---" 181 | echo "kind: Service" 182 | echo "apiVersion: v1" 183 | echo "metadata:" 184 | echo " name: cube-$i" 185 | echo "spec:" 186 | echo " ports:" 187 | echo " - name: agent" 188 | echo " port: 8001" 189 | echo " protocol: TCP" 190 | echo " - name: server" 191 | echo " port: 8027" 192 | echo " protocol: TCP" 193 | echo " selector:" 194 | echo " app: cube-$i" 195 | done > cube.yaml 196 | { 197 | echo "apiVersion: v1" 198 | echo "kind: Pod" 199 | echo "metadata:" 200 | echo " name: cube-transfer" 201 | echo " labels:" 202 | echo " app: cube-transfer" 203 | echo "spec:" 204 | echo " containers:" 205 | echo " - name: cube-transfer" 206 | echo " image: hub.baidubce.com/ctr/cube-transfer:v2" 207 | echo " workingDir: /" 208 | echo " env:" 209 | echo " - name: POD_IP" 210 | echo " valueFrom:" 211 | echo " fieldRef:" 212 | echo " apiVersion: v1" 213 | echo " fieldPath: status.podIP" 214 | echo " - name: CUBE_SHARD_NUM" 215 | echo " value: \"$CUBE_SHARD_NUM\"" 216 | echo " command: ['bash']" 217 | echo " args: ['nonstop.sh']" 218 | echo " ports:" 219 | echo " - containerPort: 8099" 220 | echo " name: cube-transfer" 221 | echo " - containerPort: 8098" 222 | echo " name: cube-http" 223 | } > transfer.yaml 224 | echo "cube.yaml written to ./cube.yaml" 225 | echo "transfer.yaml written to ./transfer.yaml" 226 | return 0 227 | } 228 | 229 | function generate_fileserver_yaml() 230 | { 231 | check_tools sed 232 | check_files fileserver.yaml.template 233 | if [ $# -ne 3 ]; then 234 | echo "Invalid argument to function generate_fileserver_yaml" 235 | return -1 236 | else 237 | hdfs_address=$1 238 | hdfs_ugi=$2 239 | dataset_path=$3 240 | sed -e "s#<$ HDFS_ADDRESS $>#$hdfs_address#g" \ 241 | -e "s#<$ HDFS_UGI $>#$hdfs_ugi#g" \ 242 | -e "s#<$ DATASET_PATH $>#$dataset_path#g" \ 243 | fileserver.yaml.template > fileserver.yaml 244 | echo "File server yaml written to fileserver.yaml" 245 | fi 246 | return 0 247 | } 248 | 249 | function generate_yaml() 250 | { 251 | check_tools sed 252 | check_files fleet-ctr.yaml.template 253 | if [ $# -ne 11 ]; then 254 | echo "Invalid argument to function generate_yaml" 255 | return -1 256 | else 257 | pserver_num=$1 258 | total_trainer_num=$2 259 | slave_trainer_num=$((total_trainer_num)) 260 | let total_pod_num=${total_trainer_num}+${pserver_num} 261 | cpu_num=$3 262 | mem=$4 263 | data_path=$5 264 | hdfs_address=$6 265 | hdfs_ugi=$7 266 | start_date_hr=$8 267 | end_date_hr=$9 268 | sparse_dim=${10} 269 | dataset_path=${11} 270 | 271 | sed -e "s#<$ PSERVER_NUM $>#$pserver_num#g" \ 272 | -e "s#<$ TRAINER_NUM $>#$total_trainer_num#g" \ 273 | -e "s#<$ SLAVE_TRAINER_NUM $>#$slave_trainer_num#g" \ 274 | -e "s#<$ CPU_NUM $>#$cpu_num#g" \ 275 | -e "s#<$ MEMORY $>#$mem#g" \ 276 | -e "s#<$ DATASET_PATH $>#$dataset_path#g" \ 277 | -e "s#<$ SPARSE_DIM $>#$sparse_dim#g" \ 278 | -e "s#<$ HDFS_ADDRESS $>#$hdfs_address#g" \ 279 | -e "s#<$ HDFS_UGI $>#$hdfs_ugi#g" \ 280 | -e "s#<$ START_DATE_HR $>#$start_date_hr#g" \ 281 | -e "s#<$ END_DATE_HR $>#$end_date_hr#g" \ 282 | -e "s#<$ TOTAL_POD_NUM $>#$total_pod_num#g" \ 283 | fleet-ctr.yaml.template > fleet-ctr.yaml 284 | echo "Main yaml written to fleet-ctr.yaml" 285 | fi 286 | return 0 287 | } 288 | 289 | function upload_slot_conf() 290 | { 291 | check_tools kubectl curl 292 | if [ $# -ne 1 ]; then 293 | die "upload_slot_conf: Slot conf file not specified" 294 | fi 295 | check_files $1 296 | echo "start file-server pod" 297 | start_fileserver 298 | for number in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20; do 299 | FILESERVER_IP=`kubectl get services | grep 'file-server' | awk '{print $4}'` 300 | echo "searching file-server external IP, wait a moment." 301 | if [ "${FILESERVER_IP}" == "" ]; then 302 | sleep 10 303 | else 304 | break 305 | fi 306 | done 307 | if [ "${FILESERVER_IP}" == "" ]; then 308 | echo "error in K8S cluster, cannot continue. Aborted" 309 | return 1 310 | fi 311 | 312 | FILESERVER_IP=`kubectl get services | grep 'file-server' | awk '{print $4}'` 313 | 314 | FILESERVER_PORT=`kubectl get services | grep 'file-server' | awk '{print $5}' | awk -F':' '{print $2}' | awk -F',' '{print $2}'` 315 | if [ "${file##*.}"x = "txt"x ]; 316 | then 317 | echo "slot file suffix must be '.txt'" 318 | fi 319 | echo "curl --upload-file $1 $FILESERVER_IP:$FILESERVER_PORT" 320 | curl --upload-file $1 $FILESERVER_IP:$FILESERVER_PORT 321 | if [ $? == 0 ]; then 322 | echo "File $1 uploaded to $FILESERVER_IP:$FILESERVER_PORT/slot.conf" 323 | fi 324 | return 0 325 | } 326 | 327 | function config_resource() 328 | { 329 | echo "CPU=$CPU MEM=$MEM CUBE=$CUBE TRAINER=$TRAINER PSERVER=$PSERVER "\ 330 | "CUBE=$CUBE DATA_PATH=$DATA_PATH SLOT_CONF=$SLOT_CONF VERBOSE=$VERBOSE "\ 331 | "HDFS_ADDRESS=$HDFS_ADDRESS HDFS_UGI=$HDFS_UGI START_DATE_HR=$START_DATE_HR END_DATE_HR=$END_DATE_HR "\ 332 | "SPARSE_DIM=$SPARSE_DIM DATASET_PATH=$DATASET_PATH " 333 | generate_cube_yaml $CUBE || die "config_resource: generate_cube_yaml failed" 334 | generate_fileserver_yaml $HDFS_ADDRESS $HDFS_UGI $DATASET_PATH || die "config_resource: generate_fileserver_yaml failed" 335 | generate_yaml $PSERVER $TRAINER $CPU $MEM $DATA_PATH $HDFS_ADDRESS $HDFS_UGI $START_DATE_HR $END_DATE_HR $SPARSE_DIM $DATASET_PATH || die "config_resource: generate_yaml failed" 336 | upload_slot_conf $SLOT_CONF || die "config_resource: upload_slot_conf failed" 337 | return 0 338 | } 339 | 340 | function log() 341 | { 342 | echo "Trainer 0 Log:" 343 | kubectl logs fleet-ctr-demo-trainer-0 | grep __main__ > train.log 344 | if [ -f train.log ]; then 345 | tail -n 20 train.log 346 | else 347 | echo "Trainer Log Has not been generated" 348 | fi 349 | echo "" 350 | echo "File Server Log:" 351 | file_server_pod=$(kubectl get po | grep file-server | awk {'print $1'}) 352 | kubectl logs ${file_server_pod} | grep __main__ > file-server.log 353 | if [ -f file-server.log ]; then 354 | tail -n 20 file-server.log 355 | else 356 | echo "File Server Log Has not been generated" 357 | fi 358 | echo "" 359 | echo "Cube Transfer Log:" 360 | kubectl logs cube-transfer | grep "all reload ok" > cube-transfer.log 361 | if [ -f cube-transfer.log ]; then 362 | tail -n 20 cube-transfer.log 363 | else 364 | echo "Cube Transfer Log Has not been generated" 365 | fi 366 | echo "" 367 | #echo "Padddle Serving Log:" 368 | #serving_pod=$(kubectl get po | grep paddleserving | awk {'print $1'}) 369 | #kubectl logs ${serving_pod} | grep __INFO__ > paddleserving.log 370 | #if [ -f paddleserving.log ]; then 371 | # tail -n 20 paddleserving.log 372 | #else 373 | # echo "PaddleServing Log Has not been generated" 374 | #fi 375 | } 376 | 377 | datafile_config() 378 | { 379 | source $DATA_CONF_PATH 380 | } 381 | 382 | function apply() 383 | { 384 | echo "Waiting for pod..." 385 | check_tools kubectl 386 | install_volcano 387 | kubectl get pod | grep cube | awk {'print $1'} | xargs kubectl delete pod >/dev/null 2>&1 388 | kubectl get pod | grep paddleserving | awk {'print $1'} | xargs kubectl delete pod >/dev/null 2>&1 389 | kubectl apply -f cube.yaml 390 | kubectl apply -f transfer.yaml 391 | kubectl apply -f pdserving.yaml 392 | 393 | kubectl get jobs.batch.volcano.sh | grep fleet-ctr-demo 394 | if [ $? == 0 ]; then 395 | kubectl delete jobs.batch.volcano.sh fleet-ctr-demo 396 | fi 397 | kubectl apply -f fleet-ctr.yaml 398 | # python3 listen.py & 399 | # echo "waiting for mlflow..." 400 | # python3 service.py 401 | return 402 | } 403 | 404 | ############################################################################### 405 | # Main logic begin 406 | ############################################################################### 407 | 408 | CMD="" 409 | CPU=2 410 | MEM=4 411 | CUBE=2 412 | TRAINER=2 413 | PSERVER=2 414 | DATA_PATH="/app" 415 | SLOT_CONF="./slot.conf" 416 | VERBOSE=0 417 | DATA_CONF_PATH="./data.config" 418 | source $DATA_CONF_PATH 419 | 420 | # Parse arguments 421 | TEMP=`getopt -n elastic-control -o crahu:m:t:p:b:f:s:v:l --longoption config_client,config_resource,apply,help,cpu:,mem:,trainer:,pserver:,cube:,datafile:,slot_conf:,verbose,log -- "$@"` 422 | 423 | # Die if they fat finger arguments, this program will be run as root 424 | [ $? = 0 ] || die "Error parsing arguments. Try $0 --help" 425 | 426 | eval set -- "$TEMP" 427 | while true; do 428 | case $1 in 429 | -c|--config_client) 430 | CMD="config_client"; shift; continue 431 | ;; 432 | -r|--config_resource) 433 | CMD="config_resource"; shift; continue 434 | ;; 435 | -a|--apply) 436 | CMD="apply"; shift; continue 437 | ;; 438 | -h|--help) 439 | help 440 | exit 0 441 | ;; 442 | -l|--log) 443 | log; shift; 444 | exit 0 445 | ;; 446 | -u|--cpu) 447 | CPU="$2"; shift; shift; continue 448 | ;; 449 | -m|--mem) 450 | MEM="$2"; shift; shift; continue 451 | ;; 452 | -t|--trainer) 453 | TRAINER="$2"; shift; shift; continue 454 | ;; 455 | -p|--pserver) 456 | PSERVER="$2"; shift; shift; continue 457 | ;; 458 | -b|--cube) 459 | CUBE="$2"; shift; shift; continue 460 | ;; 461 | -f|--datafile) 462 | DATA_CONF_PATH="$2"; datafile_config ; shift; shift; continue 463 | ;; 464 | -s|--slot_conf) 465 | SLOT_CONF="$2"; shift; shift; continue 466 | ;; 467 | -v|--verbose) 468 | VERBOSE=1; shift; continue 469 | ;; 470 | --) 471 | # no more arguments to parse 472 | break 473 | ;; 474 | *) 475 | printf "Unknown option %s\n" "$1" 476 | exit 1 477 | ;; 478 | esac 479 | done 480 | 481 | if [ $CMD = "config_resource" ]; then 482 | 483 | if ! grep '^[[:digit:]]*$' <<< "$CPU" >> /dev/null || [ $CPU -lt 1 ] || [ $CPU -gt 4 ]; then 484 | die "Invalid CPU Num, should be greater than 0 and less than 5." 485 | fi 486 | 487 | if ! grep '^[[:digit:]]*$' <<< "$MEM" >> /dev/null || [ $MEM -lt 1 ] || [ $MEM -gt 4 ]; then 488 | die "Invalid MEM Num, should be greater than 0 and less than 5." 489 | fi 490 | 491 | if ! grep '^[[:digit:]]*$' <<< "$PSERVER" >> /dev/null || [ $PSERVER -lt 1 ] || [ $PSERVER -gt 9 ]; then 492 | die "Invalid PSERVER Num, should be greater than 0 and less than 10." 493 | fi 494 | 495 | if ! grep '^[[:digit:]]*$' <<< "$TRAINER" >> /dev/null || [ $TRAINER -lt 1 ] || [ $TRAINER -gt 9 ]; then 496 | die "Invalid TRAINER Num, should be greater than 0 and less than 10." 497 | fi 498 | 499 | if ! grep '^[[:digit:]]*$' <<< "$CUBE" >> /dev/null || [ $CUBE -lt 1 ] || [ $CUBE -gt 9 ]; then 500 | die "Invalid CUBE Num, should be greater than 0 and less than 10." 501 | fi 502 | fi 503 | 504 | 505 | case $CMD in 506 | config_resource) 507 | config_resource 508 | ;; 509 | config_client) 510 | config_client 511 | ;; 512 | apply) 513 | apply 514 | ;; 515 | status) 516 | status 517 | ;; 518 | *) 519 | help 520 | ;; 521 | esac 522 | -------------------------------------------------------------------------------- /elastic-ctr-cli/fileserver.yaml.template: -------------------------------------------------------------------------------- 1 | apiVersion: apps/v1beta1 2 | kind: Deployment 3 | metadata: 4 | name: file-server 5 | labels: 6 | app: file-server 7 | spec: 8 | replicas: 1 9 | template: 10 | metadata: 11 | name: file-server 12 | labels: 13 | app: file-server 14 | spec: 15 | containers: 16 | - name: file-server 17 | image: hub.baidubce.com/ctr/file-server:latest 18 | imagePullPolicy: Always 19 | ports: 20 | - containerPort: 8080 21 | command: ['bash'] 22 | args: ['run.sh'] 23 | env: 24 | - name: NAMESPACE 25 | valueFrom: 26 | fieldRef: 27 | apiVersion: v1 28 | fieldPath: metadata.namespace 29 | - name: POD_IP 30 | valueFrom: 31 | fieldRef: 32 | apiVersion: v1 33 | fieldPath: status.podIP 34 | - name: POD_NAME 35 | valueFrom: 36 | fieldRef: 37 | apiVersion: v1 38 | fieldPath: metadata.name 39 | - name: PADDLE_CURRENT_IP 40 | valueFrom: 41 | fieldRef: 42 | apiVersion: v1 43 | fieldPath: status.podIP 44 | - name: JAVA_HOME 45 | value: /usr/local/jdk1.8.0_231 46 | - name: HADOOP_HOME 47 | value: /usr/local/hadoop-2.8.5 48 | - name: HADOOP_HOME 49 | value: /usr/local/hadoop-2.8.5 50 | - name: DATASET_PATH 51 | value: "<$ DATASET_PATH $>" 52 | - name: HDFS_ADDRESS 53 | value: "<$ HDFS_ADDRESS $>" 54 | - name: HDFS_UGI 55 | value: "<$ HDFS_UGI $>" 56 | - name: PATH 57 | value: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/jdk1.8.0_231/bin:/usr/local/hadoop-2.8.5/bin:/Python-3.7.0:/node-v12.13.1-linux-x64/bin 58 | 59 | --- 60 | kind: Service 61 | apiVersion: v1 62 | metadata: 63 | name: file-server 64 | spec: 65 | type: LoadBalancer 66 | ports: 67 | - name: file-server 68 | port: 8080 69 | targetPort: 8080 70 | - name: upload 71 | port: 9000 72 | targetPort: 9000 73 | selector: 74 | app: file-server 75 | -------------------------------------------------------------------------------- /elastic-ctr-cli/fleet-ctr.yaml.template: -------------------------------------------------------------------------------- 1 | apiVersion: batch.volcano.sh/v1alpha1 2 | kind: Job 3 | metadata: 4 | name: fleet-ctr-demo 5 | spec: 6 | minAvailable: <$ TOTAL_POD_NUM $> 7 | schedulerName: volcano 8 | policies: 9 | - event: PodEvicted 10 | action: RestartJob 11 | - event: PodFailed 12 | action: RestartJob 13 | tasks: 14 | - replicas: <$ PSERVER_NUM $> 15 | name: pserver 16 | template: 17 | metadata: 18 | labels: 19 | paddle-job-pserver: fluid-ctr 20 | spec: 21 | imagePullSecrets: 22 | - name: default-secret 23 | containers: 24 | - image: hub.baidubce.com/ctr/fleet-ctr:latest 25 | command: 26 | - paddle_k8s 27 | - start_fluid 28 | imagePullPolicy: Always 29 | name: preserver 30 | resources: 31 | limits: 32 | cpu: 10 33 | memory: 30Gi 34 | ephemeral-storage: 10Gi 35 | requests: 36 | cpu: 1 37 | memory: 100M 38 | ephemeral-storage: 1Gi 39 | env: 40 | - name: GLOG_v 41 | value: "0" 42 | - name: GLOG_logtostderr 43 | value: "1" 44 | - name: TOPOLOGY 45 | value: "" 46 | - name: TRAINER_PACKAGE 47 | value: /workspace 48 | - name: PADDLE_INIT_NICS 49 | value: eth2 50 | - name: NAMESPACE 51 | valueFrom: 52 | fieldRef: 53 | apiVersion: v1 54 | fieldPath: metadata.namespace 55 | - name: POD_IP 56 | valueFrom: 57 | fieldRef: 58 | apiVersion: v1 59 | fieldPath: status.podIP 60 | - name: POD_NAME 61 | valueFrom: 62 | fieldRef: 63 | apiVersion: v1 64 | fieldPath: metadata.name 65 | - name: PADDLE_CURRENT_IP 66 | valueFrom: 67 | fieldRef: 68 | apiVersion: v1 69 | fieldPath: status.podIP 70 | - name: PADDLE_JOB_NAME 71 | value: fluid-ctr 72 | - name: PADDLE_IS_LOCAL 73 | value: "0" 74 | - name: PADDLE_TRAINERS_NUM 75 | value: "<$ TRAINER_NUM $>" 76 | - name: PADDLE_PSERVERS_NUM 77 | value: "<$ PSERVER_NUM $>" 78 | - name: FLAGS_rpc_deadline 79 | value: "36000000" 80 | - name: ENTRY 81 | value: cd workspace && python3 train_with_mlflow.py slot.conf 82 | - name: PADDLE_PORT 83 | value: "30240" 84 | - name: HDFS_ADDRESS 85 | value: "<$ HDFS_ADDRESS $>" 86 | - name: HDFS_UGI 87 | value: "<$ HDFS_UGI $>" 88 | - name: START_DATE_HR 89 | value: <$ START_DATE_HR $> 90 | - name: END_DATE_HR 91 | value: <$ END_DATE_HR $> 92 | - name: DATASET_PATH 93 | value: "<$ DATASET_PATH $>" 94 | - name: SPARSE_DIM 95 | value: "<$ SPARSE_DIM $>" 96 | - name: PADDLE_TRAINING_ROLE 97 | value: PSERVER 98 | - name: TRAINING_ROLE 99 | value: PSERVER 100 | restartPolicy: OnFailure 101 | 102 | - replicas: <$ TRAINER_NUM $> 103 | policies: 104 | - event: TaskCompleted 105 | action: CompleteJob 106 | name: trainer 107 | template: 108 | metadata: 109 | labels: 110 | paddle-job: fluid-ctr 111 | app: mlflow 112 | spec: 113 | imagePullSecrets: 114 | - name: default-secret 115 | containers: 116 | - image: hub.baidubce.com/ctr/fleet-ctr:latest 117 | command: 118 | - paddle_k8s 119 | - start_fluid 120 | imagePullPolicy: Always 121 | name: trainer 122 | resources: 123 | limits: 124 | cpu: 10 125 | memory: 30Gi 126 | ephemeral-storage: 10Gi 127 | requests: 128 | cpu: 1 129 | memory: 100M 130 | ephemeral-storage: 10Gi 131 | env: 132 | - name: GLOG_v 133 | value: "0" 134 | - name: GLOG_logtostderr 135 | value: "1" 136 | - name: TRAINER_PACKAGE 137 | value: /workspace 138 | - name: PADDLE_INIT_NICS 139 | value: eth2 140 | - name: CPU_NUM 141 | value: "2" 142 | - name: NAMESPACE 143 | valueFrom: 144 | fieldRef: 145 | apiVersion: v1 146 | fieldPath: metadata.namespace 147 | - name: POD_IP 148 | valueFrom: 149 | fieldRef: 150 | apiVersion: v1 151 | fieldPath: status.podIP 152 | - name: POD_NAME 153 | valueFrom: 154 | fieldRef: 155 | apiVersion: v1 156 | fieldPath: metadata.name 157 | - name: PADDLE_CURRENT_IP 158 | valueFrom: 159 | fieldRef: 160 | apiVersion: v1 161 | fieldPath: status.podIP 162 | - name: PADDLE_JOB_NAME 163 | value: fluid-ctr 164 | - name: PADDLE_IS_LOCAL 165 | value: "0" 166 | - name: FLAGS_rpc_deadline 167 | value: "36000000" 168 | - name: PADDLE_PSERVERS_NUM 169 | value: "2" 170 | - name: PADDLE_TRAINERS_NUM 171 | value: "2" 172 | - name: PADDLE_PORT 173 | value: "30240" 174 | - name: PADDLE_TRAINING_ROLE 175 | value: TRAINER 176 | - name: TRAINING_ROLE 177 | value: TRAINER 178 | - name: HDFS_ADDRESS 179 | value: "<$ HDFS_ADDRESS $>" 180 | - name: HDFS_UGI 181 | value: "<$ HDFS_UGI $>" 182 | - name: START_DATE_HR 183 | value: <$ START_DATE_HR $> 184 | - name: END_DATE_HR 185 | value: <$ END_DATE_HR $> 186 | - name: DATASET_PATH 187 | value: "<$ DATASET_PATH $>" 188 | - name: SPARSE_DIM 189 | value: "<$ SPARSE_DIM $>" 190 | - name: JAVA_HOME 191 | value: /usr/local/jdk1.8.0_231 192 | - name: HADOOP_HOME 193 | value: /usr/local/hadoop-2.8.5 194 | - name: PATH 195 | value: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/jdk1.8.0_231/bin:/usr/local/hadoop-2.8.5/bin:/Python-3.7.0 196 | - name: ENTRY 197 | value: cd workspace && python3 train_with_mlflow.py slot.conf 198 | restartPolicy: OnFailure 199 | -------------------------------------------------------------------------------- /elastic-ctr-cli/listen.py: -------------------------------------------------------------------------------- 1 | import time 2 | import os 3 | import socket 4 | 5 | 6 | def rewrite_yaml(path): 7 | for root , dirs, files in os.walk(path): 8 | for name in files: 9 | if name == "meta.yaml": 10 | if len(root.split("/mlruns")) != 2: 11 | print("Error: the parent directory of your current directory should not contain a path named mlruns") 12 | exit(0) 13 | cmd = "sed -i \"s#/workspace#" + root.split("/mlruns")[0] + "#g\" " + os.path.join(root, name) 14 | os.system(cmd) 15 | 16 | time.sleep(5) 17 | os.system("rm -rf ./mlruns >/dev/null 2>&1") 18 | while True: 19 | r = os.popen("kubectl get pod | grep fleet-ctr-demo-trainer-0 | awk {'print $3'}") 20 | info = r.readlines() 21 | if info == []: 22 | exit(0) 23 | for line in info: 24 | line = line.strip() 25 | if line == "Completed" or line == "Terminating": 26 | exit(0) 27 | os.system("kubectl cp fleet-ctr-demo-trainer-0:workspace/mlruns ./mlruns_temp >/dev/null 2>&1") 28 | if os.path.exists("./mlruns_temp"): 29 | os.system("rm -rf ./mlruns >/dev/null 2>&1") 30 | os.system("mv ./mlruns_temp ./mlruns >/dev/null 2>&1") 31 | rewrite_yaml(os.getcwd()+"/mlruns") 32 | time.sleep(30) 33 | -------------------------------------------------------------------------------- /elastic-ctr-cli/pdserving.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: apps/v1beta1 2 | kind: Deployment 3 | metadata: 4 | name: paddleserving 5 | labels: 6 | app: paddleserving 7 | spec: 8 | replicas: 1 9 | template: 10 | metadata: 11 | name: paddleserving 12 | labels: 13 | app: paddleserving 14 | spec: 15 | containers: 16 | - name: paddleserving 17 | image: hub.baidubce.com/ctr/paddleserving:latest 18 | imagePullPolicy: Always 19 | workingDir: /serving 20 | command: ['/bin/bash'] 21 | args: ['run.sh'] 22 | ports: 23 | - containerPort: 8010 24 | name: serving 25 | --- 26 | apiVersion: v1 27 | kind: Service 28 | metadata: 29 | name: paddleserving 30 | spec: 31 | type: LoadBalancer 32 | ports: 33 | - name: paddleserving 34 | port: 8010 35 | targetPort: 8010 36 | selector: 37 | app: paddleserving 38 | 39 | -------------------------------------------------------------------------------- /elastic-ctr-cli/rolebind.yaml: -------------------------------------------------------------------------------- 1 | # replace suo with other namespace 2 | kind: ClusterRole 3 | apiVersion: rbac.authorization.k8s.io/v1 4 | metadata: 5 | name: suo 6 | namespace: suo 7 | rules: 8 | - apiGroups: [""] 9 | resources: ["pods"] 10 | verbs: ["get", "list", "watch"] 11 | 12 | --- 13 | kind: ClusterRoleBinding 14 | apiVersion: rbac.authorization.k8s.io/v1 15 | metadata: 16 | name: suo 17 | namespace: suo 18 | subjects: 19 | - kind: ServiceAccount 20 | name: default 21 | namespace: suo 22 | roleRef: 23 | kind: ClusterRole 24 | name: suo 25 | apiGroup: rbac.authorization.k8s.io 26 | -------------------------------------------------------------------------------- /elastic-ctr-cli/service.py: -------------------------------------------------------------------------------- 1 | import time 2 | import os 3 | import socket 4 | 5 | 6 | def net_is_used(port, ip='0.0.0.0'): 7 | s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) 8 | try: 9 | s.connect((ip, port)) 10 | s.shutdown(2) 11 | print('Error: %s:%d is used' % (ip, port)) 12 | return True 13 | except: 14 | #print('%s:%d is unused' % (ip, port)) 15 | return False 16 | 17 | os.system("ps -ef | grep ${USER} | grep mlflow | awk {'print $2'} | xargs kill -9 >/dev/null 2>&1") 18 | os.system("ps -ef | grep ${USER} | grep gunicorn | awk {'print $2'} | xargs kill -9 >/dev/null 2>&1") 19 | 20 | while True: 21 | if os.path.exists("./mlruns") and not net_is_used(8111): 22 | os.system("mlflow server --default-artifact-root ./mlruns/0 --host 0.0.0.0 --port 8111 >/dev/null 2>&1 &") 23 | time.sleep(3) 24 | print("mlflow ready!") 25 | exit(0) 26 | time.sleep(30) 27 | -------------------------------------------------------------------------------- /elastic-ctr-cli/service_auto_port.py: -------------------------------------------------------------------------------- 1 | import time 2 | import os 3 | import socket 4 | 5 | 6 | def net_is_used(port, ip='0.0.0.0'): 7 | s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) 8 | try: 9 | s.connect((ip, port)) 10 | s.shutdown(2) 11 | print('Error: %s:%d is used' % (ip, port)) 12 | return True 13 | except: 14 | #print('%s:%d is unused' % (ip, port)) 15 | return False 16 | 17 | os.system("ps -ef | grep ${USER} | grep mlflow | awk {'print $2'} | xargs kill -9 >/dev/null 2>&1") 18 | os.system("ps -ef | grep ${USER} | grep gunicorn | awk {'print $2'} | xargs kill -9 >/dev/null 2>&1") 19 | 20 | current_port = 8100 21 | while True: 22 | if os.path.exists("./mlruns"): 23 | if not net_is_used(current_port): 24 | os.system("mlflow server --default-artifact-root ./mlruns/0 --host 0.0.0.0 --port " + str(current_port) + " >/dev/null 2>&1 &") 25 | time.sleep(3) 26 | print("mlflow ready, started at port" + str(current_port) + "!") 27 | exit(0) 28 | else: 29 | current_port = current_port + 1 30 | continue 31 | else: 32 | time.sleep(30) 33 | -------------------------------------------------------------------------------- /elastic-ctr-cli/slot.conf: -------------------------------------------------------------------------------- 1 | 0 2 | 1 3 | 2 4 | 3 5 | 4 6 | 5 7 | 6 8 | 7 9 | 8 10 | 9 11 | 10 12 | 11 13 | 12 14 | 13 15 | 14 16 | 15 17 | 16 18 | 17 19 | 18 20 | 19 21 | 20 22 | 21 23 | 22 24 | 23 25 | 24 26 | 25 27 | 26 28 | -------------------------------------------------------------------------------- /elasticctr_arch.md: -------------------------------------------------------------------------------- 1 | 2 | # ElasticCTR整体架构 3 | 4 | ![elastic.png](https://github.com/suoych/WebChat/raw/master/overview_v2.png) 5 | 其中各个模块的介绍如下: 6 | - Client: CTR预估任务的客户端,训练前用户可以上传自定义的配置文件,预测时可以发起预测请求 7 | - file server: 接收用户上传的配置文件,存储模型供Paddle Serving和Cube使用。 8 | - trainer/pserver: 训练环节采用PaddlePaddle parameter server模式,对应trainer和pserver角色。分布式训练使用volcano做批量任务管理工具。 9 | - MLFlow: 训练任务的可视化模块,用户可以直观地查看训练情况。 10 | - HDFS: 用于用户存储数据。训练完成后产出的模型文件,也会在HDFS存储。 11 | - cube-transfer: 负责监控上游训练任务产出的模型文件,有新的模型产出便拉取到本地,并调用cube-builder构建cube字典文件;通知cube-agent节点拉取最新的字典文件,并维护各个cube-server上版本一致性。 12 | - cube-builder: 负责将训练作业产出的模型文件转换成可以被cube-server加载的字典文件。字典文件具有特定的数据结构,针对尺寸和内存中访问做了高度优化。 13 | - Cube-Server: 提供分片kv读写能力的服务节点。 14 | - Cube-agent: 与cube-server同机部署,接收cube-transfer下发的字典文件更新命令,拉取数据到本地,通知cube-server进行更新。 15 | - Paddle Serving: 加载CTR预估任务模型ProgramDesc和dense参数,提供预测服务。 16 | 17 | 以上组件串联完成从训练到预测部署的所有流程。本项目所提供的一键部署脚本elastic-control.sh可一键部署上述所有组件。用户可以参考本部署方案,将基于PaddlePaddle的分布式训练和Serving应用到业务环境。 18 | -------------------------------------------------------------------------------- /fleet-ctr/criteo_dataset.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | import paddle.fluid.incubate.data_generator as dg 16 | import sys 17 | import os 18 | 19 | class CriteoDataset(dg.MultiSlotDataGenerator): 20 | 21 | def set_config(self, feature_names): 22 | self.feature_names = feature_names 23 | self.feature_dict = {} 24 | for idx, name in enumerate(self.feature_names): 25 | self.feature_dict[name] = idx 26 | 27 | def generate_sample(self, line): 28 | """ 29 | Read the data line by line and process it as a dictionary 30 | """ 31 | hash_dim_ = int(os.environ['SPARSE_DIM']) 32 | def reader(): 33 | group = line.strip().split() 34 | label = int(group[1]) 35 | click = group[0] 36 | features = [] 37 | for i in range(len(self.feature_names)): 38 | features.append([]) 39 | for fea_pair in group[2:]: 40 | feasign, slot = fea_pair.split(":") 41 | if slot not in self.feature_dict: 42 | continue 43 | features[self.feature_dict[slot]].append(int(feasign) % hash_dim_) 44 | for i in range(len(features)): 45 | if features[i] == []: 46 | features[i].append(0) 47 | features.append([label]) 48 | yield zip(self.feature_names + ["label"], features) 49 | return reader 50 | 51 | 52 | d = CriteoDataset() 53 | feature_names = [] 54 | with open(sys.argv[1]) as fin: 55 | for line in fin: 56 | feature_names.append(line.strip()) 57 | d.set_config(feature_names) 58 | d.run_from_stdin() 59 | -------------------------------------------------------------------------------- /fleet-ctr/criteo_dataset.py~: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | import paddle.fluid.incubate.data_generator as dg 16 | import sys 17 | import os 18 | 19 | class CriteoDataset(dg.MultiSlotDataGenerator): 20 | 21 | def set_config(self, feature_names): 22 | self.feature_names = feature_names 23 | self.feature_dict = {} 24 | for idx, name in enumerate(self.feature_names): 25 | self.feature_dict[name] = idx 26 | 27 | def generate_sample(self, line): 28 | """ 29 | Read the data line by line and process it as a dictionary 30 | """ 31 | hash_dim_ = 400000001 32 | def reader(): 33 | group = line.strip().split() 34 | label = int(group[1]) 35 | click = group[0] 36 | features = [] 37 | for i in range(len(self.feature_names)): 38 | features.append([]) 39 | for fea_pair in group[2:]: 40 | feasign, slot = fea_pair.split(":") 41 | if slot not in self.feature_dict: 42 | continue 43 | features[self.feature_dict[slot]].append(int(feasign) % hash_dim_) 44 | for i in range(len(features)): 45 | if features[i] == []: 46 | features[i].append(0) 47 | features.append([label]) 48 | yield zip(self.feature_names + ["label"], features) 49 | return reader 50 | 51 | 52 | d = CriteoDataset() 53 | feature_names = [] 54 | with open(sys.argv[1]) as fin: 55 | for line in fin: 56 | feature_names.append(line.strip()) 57 | d.set_config(feature_names) 58 | d.run_from_stdin() 59 | -------------------------------------------------------------------------------- /fleet-ctr/criteo_reader.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | 16 | # There are 13 integer features and 26 categorical features 17 | 18 | class Dataset: 19 | def __init__(self): 20 | pass 21 | 22 | 23 | class CriteoDataset(Dataset): 24 | def __init__(self, feature_names): 25 | self.feature_names = feature_names 26 | self.feature_dict = {} 27 | for idx, name in enumerate(self.feature_names): 28 | self.feature_dict[name] = idx 29 | 30 | def _reader_creator(self, file_list, is_train, trainer_num, trainer_id): 31 | hash_dim_ = int(os.environ['SPARSE_DIM']) 32 | def reader(): 33 | for file in file_list: 34 | with open(file, 'r') as f: 35 | line_idx = 0 36 | for line in f: 37 | line_idx += 1 38 | feature = line.rstrip('\n').split() 39 | features = [] 40 | for i in range(len(self.feature_names)): 41 | features.append([]) 42 | label = feature[1] 43 | for fea_pair in feature[2:]: 44 | #print(fea_pair) 45 | tmp_list = fea_pair.split(":") 46 | feasign, slot = tmp_list[0], tmp_list[1] 47 | if slot not in self.feature_dict: 48 | continue 49 | features[self.feature_dict[slot]].append(int(feasign) % hash_dim_) 50 | for i in range(len(features)): 51 | if features[i] == []: 52 | features[i].append(0) 53 | features.append([label]) 54 | yield features 55 | 56 | return reader 57 | 58 | def train(self, file_list, trainer_num, trainer_id): 59 | return self._reader_creator(file_list, True, trainer_num, trainer_id) 60 | 61 | def test(self, file_list): 62 | return self._reader_creator(file_list, False, 1, 0) 63 | 64 | def infer(self, file_list): 65 | return self._reader_creator(file_list, False, 1, 0) 66 | -------------------------------------------------------------------------------- /fleet-ctr/dataset_generator.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | import paddle.fluid.incubate.data_generator as dg 16 | import os 17 | hash_dim_ = int(os.environ['SPARSE_DIM']) 18 | categorical_range_ = range(14, 40) 19 | 20 | class DacDataset(dg.MultiSlotDataGenerator): 21 | """ 22 | DacDataset: inheritance MultiSlotDataGeneratior, Implement data reading 23 | Help document: http://wiki.baidu.com/pages/viewpage.action?pageId=728820675 24 | """ 25 | 26 | def set_config(feature_names): 27 | self.feature_names = feature_names 28 | assert len(feature_names) < len(categorical_range_) 29 | 30 | def generate_sample(self, line): 31 | """ 32 | Read the data line by line and process it as a dictionary 33 | """ 34 | 35 | def reader(): 36 | """ 37 | This function needs to be implemented by the user, based on data format 38 | """ 39 | features = line.rstrip('\n').split('\t') 40 | sparse_feature = [] 41 | for idx in categorical_range_[:len(self.feature_names)]: 42 | sparse_feature.append( 43 | [hash(str(idx) + features[idx]) % hash_dim_]) 44 | label = [int(features[0])] 45 | feature_name = [] 46 | for i, idx in enumerate(categorical_range_[:len(self.feature_names)]): 47 | feature_name.append(self.feature_names[i]) 48 | feature_name.append("label") 49 | 50 | yield zip(feature_name, sparse_feature + [label]) 51 | 52 | return reader 53 | 54 | 55 | d = DacDataset() 56 | feature_names = [] 57 | with open(sys.argv[1]) as fin: 58 | for line in fin: 59 | feature_names.append(line.strip()) 60 | d.set_config(feature_names) 61 | d.run_from_stdin() 62 | -------------------------------------------------------------------------------- /fleet-ctr/infer.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | import math 16 | import sys 17 | import os 18 | import paddle.fluid as fluid 19 | import numpy as np 20 | from criteo_pyreader import CriteoDataset 21 | import paddle 22 | from nets import ctr_dnn_model 23 | 24 | feature_names = [] 25 | with open(sys.argv[1]) as fin: 26 | for line in fin: 27 | feature_names.append(line.strip()) 28 | 29 | print(feature_names) 30 | 31 | 32 | sparse_feature_dim = 400000001 33 | embedding_size = 9 34 | 35 | sparse_input_ids = [fluid.layers.data(name= name, shape=[1], lod_level=1, dtype='int64') 36 | for name in feature_names] 37 | label = fluid.layers.data(name='label', shape=[1], dtype='int64') 38 | _words = sparse_input_ids + [label] 39 | 40 | exe = fluid.Executor(fluid.CPUPlace()) 41 | exe.run(fluid.default_startup_program()) 42 | input_folder = "../data/infer_data" 43 | files = os.listdir(input_folder) 44 | infer_filelist = ["{}/{}".format(input_folder, f) for f in files] 45 | print(infer_filelist) 46 | 47 | criteo_dataset = CriteoDataset(feature_names) 48 | 49 | startup_program = fluid.framework.default_main_program() 50 | test_program = fluid.framework.default_main_program() 51 | test_reader = paddle.batch(criteo_dataset.test(infer_filelist), 1000) 52 | _, auc_var, _ = ctr_dnn_model(embedding_size, sparse_input_ids, label, sparse_feature_dim) 53 | [inference_program, feed_target_names, fetch_targets] = fluid.io.load_inference_model(dirname="./saved_models/",executor=exe) 54 | with open('infer_programdesc', 'w+') as f: 55 | f.write(inference_program.to_string(True)) 56 | def set_zero(var_name): 57 | param = fluid.global_scope().var(var_name).get_tensor() 58 | param_array = np.zeros(param._get_dims()).astype("int64") 59 | param.set(param_array, fluid.CPUPlace()) 60 | 61 | auc_states_names = ['_generated_var_2', '_generated_var_3'] 62 | for name in auc_states_names: 63 | set_zero(name) 64 | inputs = _words 65 | feeder = fluid.DataFeeder(feed_list = inputs, place = fluid.CPUPlace()) 66 | for batch_id, data in enumerate(test_reader()): 67 | auc_val = exe.run(inference_program, 68 | feed=feeder.feed(data), 69 | fetch_list=fetch_targets) 70 | print(auc_val) 71 | 72 | -------------------------------------------------------------------------------- /fleet-ctr/ip_config.sh: -------------------------------------------------------------------------------- 1 | #! /bin/bash 2 | export SERVER0=192.168.1.2 3 | export SERVER1=192.168.1.5 4 | -------------------------------------------------------------------------------- /fleet-ctr/k8s_tools.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | #!/bin/env python 16 | import os 17 | import sys 18 | import time 19 | import socket 20 | from kubernetes import client, config 21 | NAMESPACE = os.getenv("NAMESPACE") 22 | if os.getenv("KUBERNETES_SERVICE_HOST", None): 23 | config.load_incluster_config() 24 | else: 25 | config.load_kube_config() 26 | v1 = client.CoreV1Api() 27 | 28 | 29 | def get_pod_status(item): 30 | phase = item.status.phase 31 | 32 | # check terminate time although phase is Running. 33 | if item.metadata.deletion_timestamp != None: 34 | return "Terminating" 35 | 36 | return phase 37 | 38 | 39 | def containers_all_ready(label_selector): 40 | def container_statuses_ready(item): 41 | container_statuses = item.status.container_statuses 42 | 43 | for status in container_statuses: 44 | if not status.ready: 45 | return False 46 | return True 47 | 48 | api_response = v1.list_namespaced_pod( 49 | namespace=NAMESPACE, pretty=True, label_selector=label_selector) 50 | 51 | for item in api_response.items: 52 | if not container_statuses_ready(item): 53 | return False 54 | 55 | return True 56 | 57 | 58 | def fetch_pods_info(label_selector, phase=None): 59 | api_response = v1.list_namespaced_pod( 60 | namespace=NAMESPACE, pretty=True, label_selector=label_selector) 61 | pod_list = [] 62 | for item in api_response.items: 63 | if phase is not None and get_pod_status(item) != phase: 64 | continue 65 | 66 | pod_list.append((item.status.phase, item.status.pod_ip, item.metadata.name)) 67 | return pod_list 68 | 69 | 70 | def wait_pods_running(label_selector, desired): 71 | print "label selector: %s, desired: %s" % (label_selector, desired) 72 | while True: 73 | count = count_pods_by_phase(label_selector, 'Running') 74 | # NOTE: pods may be scaled. 75 | if count >= int(desired): 76 | break 77 | print 'current cnt: %d sleep for 5 seconds...' % count 78 | time.sleep(5) 79 | 80 | 81 | def wait_containers_ready(label_selector): 82 | print "label selector: %s, wait all containers ready" % (label_selector) 83 | while True: 84 | if containers_all_ready(label_selector): 85 | break 86 | print 'not all containers ready, sleep for 5 seconds...' 87 | time.sleep(5) 88 | 89 | 90 | def count_pods_by_phase(label_selector, phase): 91 | pod_list = fetch_pods_info(label_selector, phase) 92 | return len(pod_list) 93 | 94 | 95 | def fetch_ips_list(label_selector, phase=None): 96 | pod_list = fetch_pods_info(label_selector, phase) 97 | ips = [item[1] for item in pod_list] 98 | ips.sort() 99 | return ips 100 | 101 | def fetch_name_list(label_selector, phase=None): 102 | pod_list = fetch_pods_info(label_selector, phase) 103 | names = [item[2] for item in pod_list] 104 | names.sort() 105 | return names 106 | 107 | 108 | def fetch_ips_string(label_selector, phase=None): 109 | ips = fetch_ips_list(label_selector, phase) 110 | return ",".join(ips) 111 | 112 | 113 | def fetch_endpoints_string(label_selector, port, phase=None, sameport=True): 114 | ips = fetch_ips_list(label_selector, phase) 115 | if sameport: 116 | ips = ["{0}:{1}".format(ip, port) for ip in ips] 117 | else: 118 | srcips = ips 119 | ips = [] 120 | port = int(port) 121 | for ip in srcips: 122 | ips.append("{0}:{1}".format(ip, port)) 123 | port = port + 1 124 | return ",".join(ips) 125 | 126 | 127 | def fetch_pod_id(label_selector, phase=None, byname=True): 128 | if byname: 129 | names = fetch_name_list(label_selector, phase=phase) 130 | 131 | local_name = os.getenv('POD_NAME') 132 | for i in xrange(len(names)): 133 | if names[i] == local_name: 134 | return i 135 | 136 | return None 137 | else: 138 | ips = fetch_ips_list(label_selector, phase=phase) 139 | 140 | local_ip = socket.gethostbyname(socket.gethostname()) 141 | for i in xrange(len(ips)): 142 | if ips[i] == local_ip: 143 | return i 144 | 145 | # in minikube there can be one node only 146 | local_ip = os.getenv("POD_IP") 147 | for i in xrange(len(ips)): 148 | if ips[i] == local_ip: 149 | return i 150 | 151 | return None 152 | 153 | 154 | def fetch_ips(label_selector): 155 | return fetch_ips_string(label_selector, phase="Running") 156 | 157 | 158 | def fetch_endpoints(label_selector, port): 159 | return fetch_endpoints_string(label_selector, port=port, phase="Running", sameport=False) 160 | 161 | 162 | def fetch_id(label_selector): 163 | return fetch_pod_id(label_selector, phase="Running") 164 | 165 | 166 | if __name__ == "__main__": 167 | command = sys.argv[1] 168 | if command == "fetch_ips": 169 | print fetch_ips(sys.argv[2]) 170 | if command == "fetch_ips_string": 171 | print fetch_ips_string(sys.argv[2], sys.argv[3]) 172 | elif command == "fetch_endpoints": 173 | print fetch_endpoints(sys.argv[2], sys.argv[3]) 174 | elif command == "fetch_id": 175 | print fetch_id(sys.argv[2]) 176 | elif command == "count_pods_by_phase": 177 | print count_pods_by_phase(sys.argv[2], sys.argv[3]) 178 | elif command == "wait_pods_running": 179 | wait_pods_running(sys.argv[2], sys.argv[3]) 180 | elif command == "wait_containers_ready": 181 | wait_containers_ready(sys.argv[2]) 182 | -------------------------------------------------------------------------------- /fleet-ctr/mlflow_run.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | if [[ $PADDLE_TRAINER_ID -ne 0 ]] ; then 3 | echo "PADDLE_TRAINER_ID unset or Run MLFlow on Trainer 0." 4 | exit 0 5 | fi 6 | 7 | while true ; do 8 | echo "Still wait for CTR Training Setup" 9 | sleep 10 10 | if [ -d "./mlruns/0" ] ;then 11 | mlflow server --default-artifact-root ./mlruns/0 --host 0.0.0.0 --port 8111 12 | fi 13 | done 14 | exit 0 15 | -------------------------------------------------------------------------------- /fleet-ctr/model_with_sparse_feature.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | 16 | import math 17 | import sys 18 | import os 19 | import paddle.fluid as fluid 20 | import subprocess as sp 21 | from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet 22 | from paddle.fluid.incubate.fleet.base import role_maker 23 | from paddle.fluid.transpiler.distribute_transpiler import DistributeTranspilerConfig 24 | 25 | file_server_addr = os.environ["FILE_SERVER_SERVICE_HOST"] + ":"+ os.environ["FILE_SERVER_SERVICE_PORT"] 26 | os.system('wget '+ file_server_addr + '/slot.conf') 27 | 28 | feature_names = [] 29 | with open(sys.argv[1]) as fin: 30 | for line in fin: 31 | feature_names.append(line.strip()) 32 | 33 | print(feature_names) 34 | 35 | sparse_input_ids = [ 36 | fluid.layers.data(name=name, shape=[1], lod_level=1, dtype='int64') 37 | for name in feature_names] 38 | 39 | label = fluid.layers.data( 40 | name='label', shape=[1], dtype='int64') 41 | 42 | sparse_feature_dim = 100000001 43 | embedding_size = 9 44 | 45 | 46 | def embedding_layer(input): 47 | emb = fluid.layers.embedding( 48 | input=input, 49 | is_sparse=True, 50 | is_distributed=False, 51 | size=[sparse_feature_dim, embedding_size], 52 | param_attr=fluid.ParamAttr(name="SparseFeatFactors", 53 | initializer=fluid.initializer.Uniform())) 54 | emb_sum = fluid.layers.sequence_pool(input=emb, pool_type='sum') 55 | return emb_sum 56 | 57 | emb_sums = list(map(embedding_layer, sparse_input_ids)) 58 | concated = fluid.layers.concat(emb_sums, axis=1) 59 | fc1 = fluid.layers.fc(input=concated, size=400, act='relu', 60 | param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal( 61 | scale=1 / math.sqrt(concated.shape[1])))) 62 | fc2 = fluid.layers.fc(input=fc1, size=400, act='relu', 63 | param_attr=fluid.ParamAttr( 64 | initializer=fluid.initializer.Normal( 65 | scale=1 / math.sqrt(fc1.shape[1])))) 66 | fc3 = fluid.layers.fc(input=fc2, size=400, act='relu', 67 | param_attr=fluid.ParamAttr( 68 | initializer=fluid.initializer.Normal( 69 | scale=1 / math.sqrt(fc2.shape[1])))) 70 | 71 | predict = fluid.layers.fc(input=fc3, size=2, act='softmax', 72 | param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal( 73 | scale=1 / math.sqrt(fc3.shape[1])))) 74 | 75 | cost = fluid.layers.cross_entropy(input=predict, label=label) 76 | avg_cost = fluid.layers.reduce_sum(cost) 77 | accuracy = fluid.layers.accuracy(input=predict, label=label) 78 | auc_var, batch_auc_var, auc_states = \ 79 | fluid.layers.auc(input=predict, label=label, num_thresholds=2 ** 12, slide_steps=20) 80 | 81 | dataset = fluid.DatasetFactory().create_dataset() 82 | dataset.set_use_var(sparse_input_ids + [label]) 83 | pipe_command = "python criteo_dataset.py {}".format(sys.argv[1]) 84 | dataset.set_pipe_command(pipe_command) 85 | dataset.set_batch_size(32) 86 | dataset.set_thread(10) 87 | dataset.set_hdfs_config("hdfs://192.168.48.87:9000", "root,") 88 | optimizer = fluid.optimizer.SGD(0.0001) 89 | #optimizer.minimize(avg_cost) 90 | exe = fluid.Executor(fluid.CPUPlace()) 91 | 92 | input_folder = "hdfs:" 93 | output = sp.check_output("hdfs dfs -ls /train_data | awk '{if(NR>1) print $8}'", shell=True) 94 | train_filelist = ["{}{}".format(input_folder, f) for f in output.decode('ascii').strip().split('\n')] 95 | role = role_maker.PaddleCloudRoleMaker() 96 | fleet.init(role) 97 | 98 | 99 | config = DistributeTranspilerConfig() 100 | config.sync_mode = False 101 | 102 | optimizer = fleet.distributed_optimizer(optimizer, config) 103 | optimizer.minimize(avg_cost) 104 | 105 | 106 | if fleet.is_server(): 107 | fleet.init_server() 108 | fleet.run_server() 109 | elif fleet.is_worker(): 110 | place = fluid.CPUPlace() 111 | exe = fluid.Executor(place) 112 | fleet.init_worker() 113 | exe.run(fluid.default_startup_program()) 114 | print("startup program done.") 115 | fleet_filelist = fleet.split_files(train_filelist) 116 | dataset.set_filelist(fleet_filelist) 117 | exe.train_from_dataset( 118 | program=fluid.default_main_program(), 119 | dataset=dataset, 120 | fetch_list=[auc_var], 121 | fetch_info=["auc"], 122 | debug=True) 123 | print("end .... ") 124 | # save model here 125 | 126 | -------------------------------------------------------------------------------- /fleet-ctr/nets.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function 2 | 3 | import argparse 4 | import logging 5 | import os 6 | import time 7 | import math 8 | import numpy as np 9 | import six 10 | import paddle 11 | import paddle.fluid as fluid 12 | import paddle.fluid.proto.framework_pb2 as framework_pb2 13 | import paddle.fluid.core as core 14 | 15 | from multiprocessing import cpu_count 16 | 17 | def ctr_dnn_model(embedding_size, sparse_input_ids, label, 18 | sparse_feature_dim): 19 | def embedding_layer(input): 20 | emb = fluid.layers.embedding( 21 | input=input, 22 | is_sparse=True, 23 | is_distributed=False, 24 | size=[sparse_feature_dim, embedding_size], 25 | param_attr=fluid.ParamAttr(name="SparseFeatFactors", 26 | initializer=fluid.initializer.Uniform())) 27 | emb_sum = fluid.layers.sequence_pool(input=emb, pool_type='sum') 28 | return emb_sum 29 | 30 | emb_sums = list(map(embedding_layer, sparse_input_ids)) 31 | concated = fluid.layers.concat(emb_sums, axis=1) 32 | fc1 = fluid.layers.fc(input=concated, size=400, act='relu', 33 | param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal( 34 | scale=1 / math.sqrt(concated.shape[1])))) 35 | fc2 = fluid.layers.fc(input=fc1, size=400, act='relu', 36 | param_attr=fluid.ParamAttr( 37 | initializer=fluid.initializer.Normal( 38 | scale=1 / math.sqrt(fc1.shape[1])))) 39 | fc3 = fluid.layers.fc(input=fc2, size=400, act='relu', 40 | param_attr=fluid.ParamAttr( 41 | initializer=fluid.initializer.Normal( 42 | scale=1 / math.sqrt(fc2.shape[1])))) 43 | 44 | predict = fluid.layers.fc(input=fc3, size=2, act='softmax', 45 | param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal( 46 | scale=1 / math.sqrt(fc3.shape[1])))) 47 | 48 | cost = fluid.layers.cross_entropy(input=predict, label=label) 49 | avg_cost = fluid.layers.reduce_sum(cost) 50 | accuracy = fluid.layers.accuracy(input=predict, label=label) 51 | auc_var, batch_auc_var, auc_states = \ 52 | fluid.layers.auc(input=predict, label=label, num_thresholds=2 ** 12, slide_steps=20) 53 | return avg_cost, auc_var, batch_auc_var 54 | 55 | -------------------------------------------------------------------------------- /fleet-ctr/paddle_k8s: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | set -x 3 | start_pserver() { 4 | stdbuf -oL paddle pserver \ 5 | --use_gpu=0 \ 6 | --port=$PADDLE_INIT_PORT \ 7 | --ports_num=$PADDLE_INIT_PORTS_NUM \ 8 | --ports_num_for_sparse=$PADDLE_INIT_PORTS_NUM_FOR_SPARSE \ 9 | --nics=$PADDLE_INIT_NICS \ 10 | --comment=paddle_process_k8s \ 11 | --num_gradient_servers=$PADDLE_INIT_NUM_GRADIENT_SERVERS 12 | } 13 | 14 | start_new_pserver() { 15 | master_label="paddle-job-master=${PADDLE_JOB_NAME}" 16 | 17 | stdbuf -oL python /root/k8s_tools.py wait_pods_running ${master_label} 1 18 | export MASTER_IP=$(python /root/k8s_tools.py fetch_ips ${master_label}) 19 | stdbuf -oL /usr/bin/pserver \ 20 | -port=$PADDLE_INIT_PORT \ 21 | -num-pservers=$PSERVERS \ 22 | -log-level=debug \ 23 | -etcd-endpoint=http://$MASTER_IP:2379 24 | } 25 | 26 | start_master() { 27 | stdbuf -oL /usr/bin/master \ 28 | -port=8080 \ 29 | -chunk-per-task=1\ 30 | -task-timout-dur=16s\ 31 | -endpoints=http://127.0.0.1:2379 32 | } 33 | 34 | check_failed_cnt() { 35 | max_failed=$1 36 | failed_count=$(python /root/k8s_tools.py count_pods_by_phase paddle-job=${PADDLE_JOB_NAME} Failed) 37 | if [ $failed_count -gt $max_failed ]; then 38 | stdbuf -oL echo "Failed trainer count beyond the threadhold: "$max_failed 39 | echo "Failed trainer count beyond the threshold: " $max_failed > /dev/termination-log 40 | exit 0 41 | fi 42 | } 43 | 44 | check_trainer_ret() { 45 | ret=$1 46 | stdbuf -oL echo "job returned $ret...setting pod return message..." 47 | stdbuf -oL echo "===============================" 48 | 49 | if [ $ret -eq 136 ] ; then 50 | echo "Error Arithmetic Operation(Floating Point Exception)" > /dev/termination-log 51 | elif [ $ret -eq 139 ] ; then 52 | echo "Segmentation Fault" > /dev/termination-log 53 | elif [ $ret -eq 1 ] ; then 54 | echo "General Error" > /dev/termination-log 55 | elif [ $ret -eq 134 ] ; then 56 | echo "Program Abort" > /dev/termination-log 57 | fi 58 | stdbuf -oL echo "termination log wroted..." 59 | exit $ret 60 | } 61 | 62 | start_fluid_process() { 63 | pserver_label="paddle-job-pserver=${PADDLE_JOB_NAME}" 64 | trainer_label="paddle-job=${PADDLE_JOB_NAME}" 65 | hostname=${HOSTNAME} 66 | task_index="" 67 | 68 | if [ "${PADDLE_TRAINING_ROLE}" == "TRAINER" ] || [ "${PADDLE_TRAINING_ROLE}" == "PSERVER" ]; then 69 | stdbuf -oL python /root/k8s_tools.py wait_pods_running ${pserver_label} ${PADDLE_PSERVERS_NUM} 70 | fi 71 | 72 | if [ "${PADDLE_TRAINING_ROLE}" == "TRAINER" ] || [ "${PADDLE_TRAINING_ROLE}" == "WORKER" ]; then 73 | stdbuf -oL python /root/k8s_tools.py wait_pods_running ${trainer_label} ${PADDLE_TRAINERS_NUM} 74 | fi 75 | 76 | export PADDLE_PSERVERS=$(python /root/k8s_tools.py fetch_endpoints ${pserver_label} ${PADDLE_PORT}) 77 | export PADDLE_TRAINER_IPS=$(python /root/k8s_tools.py fetch_ips ${trainer_label}) 78 | 79 | if [ "${PADDLE_TRAINING_ROLE}" == "TRAINER" ] || [ "${PADDLE_TRAINING_ROLE}" == "WORKER" ]; then 80 | check_failed_cnt 1 81 | task_index=$(python /root/k8s_tools.py fetch_id ${trainer_label}) 82 | else 83 | task_index=$(python /root/k8s_tools.py fetch_id ${pserver_label}) 84 | fi 85 | 86 | export PADDLE_TRAINER_ID=${task_index} 87 | export PADDLE_PSERVER_ID=${task_index} 88 | 89 | stdbuf -oL sh -c "${ENTRY}" 90 | check_trainer_ret $? 91 | } 92 | 93 | # start_tf_benchmark_process is only used for benchmarking 94 | start_tf_benchmark_process() { 95 | # re-use the paddle job labels 96 | pserver_label="paddle-job-pserver=${PADDLE_JOB_NAME}" 97 | trainer_label="paddle-job=${PADDLE_JOB_NAME}" 98 | task_index="" 99 | 100 | export PADDLE_INIT_PSERVERS=$(python /root/k8s_tools.py fetch_ips ${pserver_label} ${PADDLE_INIT_PORT}) 101 | export PADDLE_WORKERS=$(python /root/k8s_tools.py fetch_ips ${trainer_label}) 102 | export PADDLE_TRAINER_IPS=$(python /root/k8s_tools.py fetch_ips ${trainer_label}) 103 | export TF_WORKER_EPS=$(python /root/k8s_tools.py fetch_ips ${trainer_label} ${TF_WORKER_PORT}) 104 | 105 | if [ "${TRAINING_ROLE}" == "TRAINER" ]; then 106 | check_failed_cnt 1 107 | task_index=$(python /root/k8s_tools.py fetch_id ${trainer_label}) 108 | export TF_ROLE=worker 109 | else 110 | task_index=$(python /root/k8s_tools.py fetch_id ${pserver_label}) 111 | export TF_ROLE=ps 112 | fi 113 | 114 | export PADDLE_INIT_TRAINER_ID=${task_index} 115 | export PADDLE_TRAINER_ID=${task_index} 116 | 117 | stdbuf -oL sh -c "${ENTRY}" 118 | check_trainer_ret $? 119 | } 120 | 121 | start_new_trainer() { 122 | # FIXME(Yancey1989): use command-line interface to configure the max failed count 123 | check_failed_cnt ${TRAINERS} 124 | 125 | master_label="paddle-job-master=${PADDLE_JOB_NAME}" 126 | pserver_label="paddle-job-pserver=${PADDLE_JOB_NAME}" 127 | 128 | stdbuf -oL python /root/k8s_tools.py wait_pods_running ${pserver_label} ${PSERVERS} 129 | sleep 5 130 | stdbuf -oL python /root/k8s_tools.py wait_pods_running ${master_label} 1 131 | export MASTER_IP=$(python /root/k8s_tools.py fetch_ips ${master_label}) 132 | export ETCD_IP="$MASTER_IP" 133 | 134 | # NOTE: $TRAINER_PACKAGE may be large, do not copy 135 | export PYTHONPATH=$TRAINER_PACKAGE:$PYTHONPATH 136 | cd $TRAINER_PACKAGE 137 | 138 | stdbuf -oL echo "Starting training job: " $TRAINER_PACKAGE, "num_gradient_servers:" \ 139 | $PADDLE_INIT_NUM_GRADIENT_SERVERS, "version: " $1 140 | 141 | stdbuf -oL sh -c "${ENTRY}" 142 | check_trainer_ret $? 143 | } 144 | 145 | start_trainer() { 146 | # paddle v1 and V2 distributed training does not allow any trainer failed. 147 | check_failed_cnt 0 148 | 149 | pserver_label="paddle-job-pserver=${PADDLE_JOB_NAME}" 150 | trainer_label="paddle-job=${PADDLE_JOB_NAME}" 151 | 152 | stdbuf -oL python /root/k8s_tools.py wait_pods_running ${pserver_label} ${PSERVERS} 153 | stdbuf -oL python /root/k8s_tools.py wait_pods_running ${trainer_label} ${TRAINERS} 154 | 155 | export PADDLE_INIT_PSERVERS=$(python /root/k8s_tools.py fetch_ips ${pserver_label}) 156 | export PADDLE_INIT_TRAINER_ID=$(python /root/k8s_tools.py fetch_id ${trainer_label}) 157 | stdbuf -oL echo $PADDLE_INIT_TRAINER_ID > /trainer_id 158 | # FIXME: /trainer_count = PADDLE_INIT_NUM_GRADIENT_SERVERS 159 | stdbuf -oL echo $PADDLE_INIT_NUM_GRADIENT_SERVERS > /trainer_count 160 | 161 | # NOTE: $TRAINER_PACKAGE may be large, do not copy 162 | export PYTHONPATH=$TRAINER_PACKAGE:$PYTHONPATH 163 | cd $TRAINER_PACKAGE 164 | 165 | stdbuf -oL echo "Starting training job: " $TRAINER_PACKAGE, "num_gradient_servers:" \ 166 | $PADDLE_INIT_NUM_GRADIENT_SERVERS, "trainer_id: " $PADDLE_INIT_TRAINER_ID, \ 167 | 168 | export version="v2" 169 | if [[ -z $1 ]]; then 170 | "didn't specified a version, use the default: " $version 171 | else 172 | export version="$1" 173 | "user specified version: " $version 174 | fi 175 | 176 | # FIXME: If we use the new PServer by Golang, add Kubernetes healthz 177 | # to wait PServer process get ready.Now only sleep 20 seconds. 178 | sleep 20 179 | 180 | case "$version" in 181 | "v1") 182 | FILE_COUNT=$(wc -l $TRAIN_LIST | awk '{print $1}') 183 | if [ $FILE_COUNT -le $PADDLE_INIT_NUM_GRADIENT_SERVERS ]; then 184 | echo "file count less than trainers" 185 | check_trainer_ret 0 186 | fi 187 | let lines_per_node="$FILE_COUNT / ($PADDLE_INIT_NUM_GRADIENT_SERVERS + 1)" 188 | echo "spliting file to" $lines_per_node 189 | cp $TRAIN_LIST / 190 | cd / 191 | split -l $lines_per_node -d -a 3 $TRAIN_LIST train.list 192 | CURRENT_LIST=$(printf "train.list%03d" $PADDLE_INIT_TRAINER_ID) 193 | # always use /train.list for paddle v1 for each node. 194 | echo "File for current node ${CURRENT_LIST}" 195 | sleep 10 196 | cp $CURRENT_LIST train.list 197 | 198 | cd $TRAINER_PACKAGE 199 | 200 | stdbuf -oL paddle train \ 201 | --port=$PADDLE_INIT_PORT \ 202 | --nics=$PADDLE_INIT_NICS \ 203 | --ports_num=$PADDLE_INIT_PORTS_NUM \ 204 | --ports_num_for_sparse=$PADDLE_INIT_PORTS_NUM_FOR_SPARSE \ 205 | --num_passes=$PADDLE_INIT_NUM_PASSES \ 206 | --trainer_count=$PADDLE_INIT_TRAINER_COUNT \ 207 | --saving_period=1 \ 208 | --log_period=20 \ 209 | --local=0 \ 210 | --rdma_tcp=tcp \ 211 | --config=$TOPOLOGY \ 212 | --use_gpu=$PADDLE_INIT_USE_GPU \ 213 | --trainer_id=$PADDLE_INIT_TRAINER_ID \ 214 | --save_dir=$OUTPUT \ 215 | --pservers=$PADDLE_INIT_PSERVERS \ 216 | --num_gradient_servers=$PADDLE_INIT_NUM_GRADIENT_SERVERS 217 | # paddle v1 API does not allow any trainer failed. 218 | check_trainer_ret $? 219 | ;; 220 | "v2") 221 | stdbuf -oL sh -c "${ENTRY}" 222 | # paddle v2 API does not allow any trainer failed. 223 | check_trainer_ret $? 224 | ;; 225 | *) 226 | ;; 227 | esac 228 | } 229 | 230 | usage() { 231 | echo "usage: paddle_k8s []:" 232 | echo " start_trainer [v1|v2] Start a trainer process with v1 or v2 API" 233 | echo " start_pserver Start a pserver process" 234 | echo " start_new_pserver Start a new pserver process" 235 | echo " start_new_trainer Start a new triner process" 236 | } 237 | 238 | case "$1" in 239 | start_pserver) 240 | start_pserver 241 | ;; 242 | start_trainer) 243 | start_trainer $2 244 | ;; 245 | start_new_trainer) 246 | start_new_trainer 247 | ;; 248 | start_new_pserver) 249 | start_new_pserver 250 | ;; 251 | start_master) 252 | start_master 253 | ;; 254 | start_fluid) 255 | start_fluid_process 256 | ;; 257 | --help) 258 | usage 259 | ;; 260 | *) 261 | usage 262 | ;; 263 | esac 264 | -------------------------------------------------------------------------------- /fleet-ctr/process_rawmodel.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | import os 16 | import sys 17 | 18 | os.system('python3 /save_program/save_program.py slot.conf . ' + sys.argv[2]) 19 | os.system('python3 /save_program/dumper.py --model_path ' + sys.argv[1] + ' --output_data_path ctr_cube/') 20 | os.system('python3 /save_program/replace_params.py --model_dir ' + sys.argv[1] +' --inference_only_model_dir inference_only') 21 | os.system('tar czf ctr_model.tar.gz inference_only/') 22 | -------------------------------------------------------------------------------- /fleet-ctr/pserver0.sh: -------------------------------------------------------------------------------- 1 | GLOG_v=0 \ 2 | GLOG_logtostderr=1 \ 3 | TOPOLOGY= \ 4 | TRAINER_PACKAGE=/share \ 5 | PADDLE_INIT_NICS=eth2 \ 6 | POD_IP=${SERVER0} \ 7 | POD_NAME=SERVER0 \ 8 | PADDLE_CURRENT_IP=${SERVER0} \ 9 | PADDLE_JOB_NAME=fleet-ctr \ 10 | PADDLE_IS_LOCAL=0 \ 11 | PADDLE_TRAINERS_NUM=2 \ 12 | PADDLE_PSERVERS_NUM=2 \ 13 | FLAGS_rpc_deadline=36000000 \ 14 | PADDLE_PORT=30239 \ 15 | TRAINING_ROLE=PSERVER \ 16 | PADDLE_PSERVERS_IP_PORT_LIST=${SERVER0}:30239,${SERVER1}:30241 \ 17 | python3 model_with_sparse_feature.py slot.conf 18 | -------------------------------------------------------------------------------- /fleet-ctr/pserver1.sh: -------------------------------------------------------------------------------- 1 | GLOG_v=0 \ 2 | GLOG_logtostderr=1 \ 3 | TOPOLOGY= \ 4 | TRAINER_PACKAGE=/share \ 5 | PADDLE_INIT_NICS=eth2 \ 6 | POD_IP=${SERVER1} \ 7 | POD_NAME=SERVER1 \ 8 | PADDLE_CURRENT_IP=${SERVER1} \ 9 | PADDLE_JOB_NAME=fleet-ctr \ 10 | PADDLE_IS_LOCAL=0 \ 11 | PADDLE_TRAINERS_NUM=2 \ 12 | PADDLE_PSERVERS_NUM=2 \ 13 | FLAGS_rpc_deadline=36000000 \ 14 | PADDLE_PORT=30241 \ 15 | TRAINING_ROLE=PSERVER \ 16 | PADDLE_PSERVERS_IP_PORT_LIST=${SERVER0}:30239,${SERVER1}:30241 \ 17 | python3 model_with_sparse_feature.py slot.conf 18 | -------------------------------------------------------------------------------- /fleet-ctr/run_distributed.sh: -------------------------------------------------------------------------------- 1 | unset http_proxy 2 | unset https_proxy 3 | 4 | PADDLE_PSERVERS_IP_PORT_LIST=127.0.0.1:6017,127.0.0.1:6018 \ 5 | PADDLE_TRAINERS_NUM=2 \ 6 | TRAINING_ROLE=PSERVER \ 7 | POD_IP=127.0.0.1 \ 8 | PADDLE_PORT=6017 \ 9 | python model_with_sparse_feature.py slot.conf 2> server0.elog > server0.stdlog & 10 | 11 | sleep 3 12 | 13 | PADDLE_PSERVERS_IP_PORT_LIST=127.0.0.1:6017,127.0.0.1:6018 \ 14 | PADDLE_TRAINERS_NUM=2 \ 15 | TRAINING_ROLE=PSERVER \ 16 | POD_IP=127.0.0.1 \ 17 | PADDLE_PORT=6018 \ 18 | python model_with_sparse_feature.py slot.conf 2> server1.elog > server1.stdlog & 19 | 20 | sleep 3 21 | 22 | PADDLE_PSERVERS_IP_PORT_LIST=127.0.0.1:6017,127.0.0.1:6018 \ 23 | PADDLE_TRAINERS_NUM=2 \ 24 | TRAINING_ROLE=TRAINER \ 25 | PADDLE_TRAINER_ID=0 \ 26 | python train_with_mlflow.py slot.conf 2> worker0.elog > worker0.stdlog & 27 | 28 | sleep 3 29 | 30 | PADDLE_PSERVERS_IP_PORT_LIST=127.0.0.1:6017,127.0.0.1:6018 \ 31 | PADDLE_TRAINERS_NUM=2 \ 32 | TRAINING_ROLE=TRAINER \ 33 | PADDLE_TRAINER_ID=1 \ 34 | python model_with_sparse_feature.py slot.conf 2> worker1.elog > worker1.stdlog & 35 | -------------------------------------------------------------------------------- /fleet-ctr/train_with_mlflow.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | 16 | import math 17 | import sys 18 | import os 19 | import paddle.fluid as fluid 20 | import mlflow 21 | from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet 22 | from paddle.fluid.incubate.fleet.base import role_maker 23 | from paddle.fluid.transpiler.distribute_transpiler import DistributeTranspilerConfig 24 | from paddle.fluid.contrib.utils.hdfs_utils import HDFSClient 25 | from nets import ctr_dnn_model 26 | import subprocess as sp 27 | import time 28 | import psutil 29 | import matplotlib 30 | import matplotlib.pyplot as plt 31 | from collections import OrderedDict 32 | from datetime import datetime, timedelta 33 | import json 34 | import logging 35 | logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s') 36 | logger = logging.getLogger(__name__) 37 | 38 | try: 39 | file_server_addr = os.environ["FILE_SERVER_SERVICE_HOST"] + ":"+ os.environ["FILE_SERVER_SERVICE_PORT"] 40 | slot_res = os.system('wget '+ file_server_addr + '/slot.conf') 41 | if slot_res !=0: 42 | raise ValueError('No Slot Conf Found.') 43 | feature_names = [] 44 | with open(sys.argv[1]) as fin: 45 | for line in fin: 46 | feature_names.append(line.strip()) 47 | logger.info("Slot:" + str(feature_names)) 48 | except ValueError as e: 49 | print(e) 50 | sys.exit(1) 51 | 52 | sparse_input_ids = [ 53 | fluid.layers.data(name=name, shape=[1], lod_level=1, dtype='int64') 54 | for name in feature_names] 55 | 56 | label = fluid.layers.data( 57 | name='label', shape=[1], dtype='int64') 58 | 59 | sparse_feature_dim = int(os.environ['SPARSE_DIM']) 60 | embedding_size = 9 61 | epochs = 1 62 | dataset_prefix = os.environ['DATASET_PATH'] 63 | 64 | avg_cost, auc_var, _ = ctr_dnn_model(embedding_size, sparse_input_ids, label, sparse_feature_dim) 65 | 66 | start_date_hr = datetime.strptime(os.environ['START_DATE_HR'], '%Y%m%d/%H') 67 | end_date_hr = datetime.strptime(os.environ['END_DATE_HR'], '%Y%m%d/%H') 68 | current_date_hr = start_date_hr 69 | hdfs_address = os.environ['HDFS_ADDRESS'] 70 | hdfs_ugi = os.environ['HDFS_UGI'] 71 | start_run_flag = True 72 | role = role_maker.UserDefinedRoleMaker( 73 | current_id=int(os.getenv("CURRENT_ID")), 74 | role=role_maker.Role.WORKER 75 | if os.getenv("TRAINING_ROLE") == "TRAINER" else role_maker.Role.SERVER, 76 | worker_num=int(os.getenv("PADDLE_TRAINERS_NUM")), 77 | server_endpoints=os.getenv("ENDPOINTS").split(",")) 78 | 79 | role = role_maker.PaddleCloudRoleMaker() 80 | fleet.init(role) 81 | 82 | config = DistributeTranspilerConfig() 83 | config.sync_mode = False 84 | optimizer = fluid.optimizer.SGD(0.0001) 85 | optimizer = fleet.distributed_optimizer(optimizer, config) 86 | optimizer.minimize(avg_cost) 87 | DATE_TIME_STRING_FORMAT = '%Y%m%d/%H' 88 | if fleet.is_server(): 89 | fleet.init_server() 90 | fleet.run_server() 91 | elif fleet.is_worker(): 92 | place = fluid.CPUPlace() 93 | exe = fluid.Executor(place) 94 | fleet.init_worker() 95 | exe.run(fluid.default_startup_program()) 96 | logger.info("startup program done.") 97 | if fleet.is_first_worker(): 98 | plt.figure() 99 | y_auc = [] 100 | y_cpu = [] 101 | y_memory = [] 102 | y_network_sent = [] 103 | y_network_recv = [] 104 | x_list = [] 105 | x = 0 106 | last_net_sent = psutil.net_io_counters().bytes_sent 107 | last_net_recv = psutil.net_io_counters().bytes_recv 108 | 109 | while True: 110 | 111 | #hadoop fs -D hadoop.job.ugi=root, -D fs.default.name=hdfs://192.168.48.87:9000 -ls / 112 | current_date_hr_exist = os.system('hadoop fs -D hadoop.job.ugi=' + hdfs_ugi + ' -D fs.defaultFS=' + hdfs_address + 113 | ' -ls ' + os.path.join(dataset_prefix, current_date_hr.strftime(DATE_TIME_STRING_FORMAT), 'donefile') +' >/dev/null 2>&1') 114 | 115 | if current_date_hr_exist: 116 | pass 117 | else: 118 | """ 119 | output = sp.check_output("hdfs dfs -ls hdfs://192.168.48.189:9000/train_data | awk '{if(NR>1) print $8}'", 120 | shell=True) 121 | print(output) 122 | train_filelist = ["{}".format(f) for f in output.decode('ascii').strip().split('\n')] 123 | print(train_filelist) 124 | for i in range(len(train_filelist)): 125 | train_filelist[i] = train_filelist[i].split("/")[-1] 126 | train_filelist.sort(reverse=True) 127 | print(train_filelist) 128 | """ 129 | 130 | dataset = fluid.DatasetFactory().create_dataset() 131 | dataset.set_use_var(sparse_input_ids + [label]) 132 | pipe_command = "python criteo_dataset.py {}".format(sys.argv[1]) 133 | dataset.set_pipe_command(pipe_command) 134 | dataset.set_batch_size(32) 135 | dataset.set_hdfs_config(hdfs_address, hdfs_ugi) 136 | dataset.set_thread(10) 137 | exe = fluid.Executor(fluid.CPUPlace()) 138 | input_folder = "hdfs:" 139 | output = sp.check_output( "hadoop fs -D hadoop.job.ugi=" + hdfs_ugi 140 | + " -D fs.defaultFS=" + hdfs_address +" -ls " + os.path.join(dataset_prefix, str(current_date_hr.strftime(DATE_TIME_STRING_FORMAT))) + "/ | awk '{if(NR>1) print $8}'", shell=True) 141 | train_filelist = ["{}{}".format(input_folder, f) for f in output.decode('ascii').strip().split('\n')] 142 | train_filelist.remove('hdfs:' + os.path.join(dataset_prefix, str(current_date_hr.strftime(DATE_TIME_STRING_FORMAT)), 'donefile')) 143 | #pending 144 | step = 0 145 | 146 | config = DistributeTranspilerConfig() 147 | config.sync_mode = False 148 | 149 | # if is worker 150 | fleet_filelist = fleet.split_files(train_filelist) 151 | print(fleet_filelist) 152 | logger.info( "Training: " + current_date_hr.strftime(DATE_TIME_STRING_FORMAT)) 153 | dataset.set_filelist(fleet_filelist) 154 | if fleet.is_first_worker(): 155 | # experiment_id = mlflow.create_experiment("fleet-ctr") 156 | if start_run_flag: 157 | mlflow.start_run() 158 | mlflow.log_param("sparse_feature_dim", sparse_feature_dim) 159 | mlflow.log_param("embedding_size", embedding_size) 160 | mlflow.log_param("selected_slots", feature_names) 161 | mlflow.log_param("epochs_num", epochs) 162 | mlflow.log_param("physical_cpu_counts", psutil.cpu_count(logical=False)) 163 | mlflow.log_param("logical_cpu_counts", psutil.cpu_count()) 164 | mlflow.log_param("total_memory/GB", 165 | round(psutil.virtual_memory().total / (1024.0 * 1024.0 * 1024.0), 3)) 166 | start_run_flag = False 167 | 168 | 169 | class FH(fluid.executor.FetchHandler): 170 | def handler(self, fetch_target_vars): 171 | auc = fetch_target_vars[0] 172 | print("test metric auc: ", fetch_target_vars) 173 | global last_net_sent 174 | global last_net_recv 175 | global y_auc 176 | global y_cpu 177 | global y_memory 178 | global y_network_sent 179 | global y_network_recv 180 | global x 181 | mlflow.log_metric("network_bytes_sent_speed", 182 | psutil.net_io_counters().bytes_sent - last_net_sent) 183 | mlflow.log_metric("network_bytes_recv_speed", 184 | psutil.net_io_counters().bytes_recv - last_net_recv) 185 | y_network_sent.append((psutil.net_io_counters().bytes_sent - last_net_sent)/10) 186 | y_network_recv.append((psutil.net_io_counters().bytes_recv - last_net_recv)/10) 187 | last_net_sent = psutil.net_io_counters().bytes_sent 188 | last_net_recv = psutil.net_io_counters().bytes_recv 189 | mlflow.log_metric("cpu_usage_total", round((psutil.cpu_percent(interval=0) / 100), 3)) 190 | y_cpu.append(round((psutil.cpu_percent(interval=0) / 100), 3)) 191 | mlflow.log_metric("free_memory/GB", 192 | round(psutil.virtual_memory().free / (1024.0 * 1024.0 * 1024.0), 3)) 193 | mlflow.log_metric("memory_usage", 194 | round((psutil.virtual_memory().total - psutil.virtual_memory().free) 195 | / float(psutil.virtual_memory().total), 3)) 196 | y_memory.append(round((psutil.virtual_memory().total - psutil.virtual_memory().free) 197 | / float(psutil.virtual_memory().total), 3)) 198 | if auc == None: 199 | mlflow.log_metric("auc", 0.5) 200 | auc = [0.5] 201 | else: 202 | mlflow.log_metric("auc", auc[0]) 203 | y_auc.append(auc) 204 | x_list.append(x) 205 | 206 | if x >= 120: # in the future 120 will be replaced by 24*60*60 which means one day length 207 | y_auc.pop(0) 208 | y_cpu.pop(0) 209 | y_memory.pop(0) 210 | y_network_recv.pop(0) 211 | y_network_sent.pop(0) 212 | x_list.pop(0) 213 | x += 10 214 | if x % 60 == 0 and x != 0: 215 | plt.subplot(221) 216 | plt.plot(x_list, y_auc) 217 | plt.title('auc') 218 | plt.grid(True) 219 | 220 | plt.subplot(222) 221 | plt.plot(x_list, y_cpu) 222 | plt.title('cpu_usage') 223 | plt.grid(True) 224 | 225 | plt.subplot(223) 226 | plt.plot(x_list, y_memory) 227 | plt.title('memory_usage') 228 | plt.grid(True) 229 | 230 | plt.subplot(224) 231 | plt.plot(x_list, y_network_sent, label="network_sent_speed") 232 | plt.plot(x_list, y_network_recv, label="network_recv_speed") 233 | plt.title('network_speed') 234 | plt.grid(True) 235 | 236 | plt.subplots_adjust(top=0.9, bottom=0.2, hspace=0.4, wspace=0.35) 237 | plt.legend(bbox_to_anchor=(0, -0.6), loc='lower left', borderaxespad=0.) 238 | temp_file_name = "dashboard_" + time.strftime('%Y-%m-%d_%H:%M:%S', 239 | time.localtime(time.time())) + ".png" 240 | plt.savefig(temp_file_name, dpi=250) 241 | sys.stdout.flush() 242 | plt.clf() 243 | os.system("rm -f "+str(mlflow.get_artifact_uri().split(":")[1]) + "/*.png") 244 | mlflow.log_artifact(local_path=temp_file_name) 245 | sys.stdout.flush() 246 | os.system("rm -f ./*.png") 247 | sys.stdout.flush() 248 | logger.info(str(mlflow.get_artifact_uri().split(":")[1])) 249 | sys.stdout.flush() 250 | 251 | os.system("hadoop fs -D hadoop.job.ugi=" + hdfs_ugi + " -D fs.defaultFS=" + hdfs_address 252 | + " -rm " + os.path.join(dataset_prefix, str(current_date_hr.strftime(DATE_TIME_STRING_FORMAT))+ "_model/donefile") + " >/dev/null 2>&1") 253 | for i in range(epochs): 254 | exe.train_from_dataset( 255 | program=fluid.default_main_program(), 256 | dataset=dataset, 257 | fetch_handler=FH([auc_var.name], 10, True), 258 | # fetch_list=[auc_var], 259 | # fetch_info=["auc"], 260 | debug=False) 261 | path = "./saved_models/" + current_date_hr.strftime(DATE_TIME_STRING_FORMAT) + "_model/" 262 | logger.info("save inference program: " + path) 263 | if len(y_auc) <=1: 264 | logger.info("Current AUC: " + str(y_auc[-1])) 265 | else: 266 | logger.info("Dataset is too small, cannot get AUC.") 267 | fetch_list = fleet.save_inference_model(exe, 268 | path, 269 | [x.name for x in sparse_input_ids] + [label.name], 270 | [auc_var]) 271 | os.system("hadoop fs -D hadoop.job.ugi=" + hdfs_ugi + " -D fs.defaultFS=" + hdfs_address 272 | + " -put -f "+ path +" " + os.path.join(dataset_prefix, current_date_hr.strftime(DATE_TIME_STRING_FORMAT).split("/")[0]) + " >/dev/null 2>&1") 273 | os.system('touch donefile') 274 | os.system("hadoop fs -D hadoop.job.ugi=" + hdfs_ugi + " -D fs.defaultFS=" + hdfs_address 275 | + " -put -f donefile" + " " + os.path.join(dataset_prefix, current_date_hr.strftime(DATE_TIME_STRING_FORMAT) + "_model/") +" >/dev/null 2>&1") 276 | logger.info("push raw model to HDFS: " + current_date_hr.strftime(DATE_TIME_STRING_FORMAT)) 277 | os.system('python process_rawmodel.py ' + './saved_models/' + current_date_hr.strftime(DATE_TIME_STRING_FORMAT) + "_model " + current_date_hr.strftime(DATE_TIME_STRING_FORMAT) + " >/dev/null 2>&1") 278 | os.system("hadoop fs -D hadoop.job.ugi=" + hdfs_ugi + " -D fs.defaultFS=" + hdfs_address 279 | + " -put -f ctr_model.tar.gz " + " /output/") 280 | os.system("hadoop fs -D hadoop.job.ugi=" + hdfs_ugi + " -D fs.defaultFS=" + hdfs_address 281 | + " -put -f ctr_cube " + " /output/") 282 | logger.info("push converted model to HDFS: " + current_date_hr.strftime(DATE_TIME_STRING_FORMAT)) 283 | os.system("rm -rf " + path) 284 | logger.info(current_date_hr.strftime(DATE_TIME_STRING_FORMAT) + ' Training Done.') 285 | else: 286 | for i in range(epochs): 287 | exe.train_from_dataset( 288 | program=fleet.main_program, 289 | dataset=dataset, 290 | # fetch_list=[auc_var], 291 | # fetch_info=["auc"], 292 | debug=False) 293 | while True: 294 | rawmodel_donefile_exist = os.system("hadoop fs -D hadoop.job.ugi=" + hdfs_ugi + " -D fs.defaultFS=" + hdfs_address 295 | + " -ls "+ os.path.join(dataset_prefix, current_date_hr.strftime(DATE_TIME_STRING_FORMAT) + "_model/donefile") +" >/dev/null 2>&1") 296 | if not rawmodel_donefile_exist: 297 | break 298 | if end_date_hr == current_date_hr: 299 | mlflow.end_run() 300 | break 301 | current_date_hr = current_date_hr + timedelta(hours=1) 302 | 303 | -------------------------------------------------------------------------------- /fleet-ctr/trainer0.sh: -------------------------------------------------------------------------------- 1 | GLOG_v=0 \ 2 | GLOG_logtostderr=1 \ 3 | TOPOLOGY= \ 4 | TRAINER_PACKAGE=/share \ 5 | PADDLE_INIT_NICS=eth2 \ 6 | POD_IP=127.0.0.1 \ 7 | POD_NAME=TRAINER0 \ 8 | PADDLE_CURRENT_IP=127.0.0.1 \ 9 | PADDLE_JOB_NAME=fleet-ctr \ 10 | PADDLE_IS_LOCAL=0 \ 11 | PADDLE_TRAINERS_NUM=2 \ 12 | PADDLE_PSERVERS_NUM=2 \ 13 | FLAGS_rpc_deadline=36000000 \ 14 | TRAINING_ROLE=TRAINER \ 15 | PADDLE_TRAINER_ID=0 \ 16 | PADDLE_PSERVERS_IP_PORT_LIST=${SERVER0}:30239,${SERVER1}:30241 \ 17 | python3 train_with_mlflow.py slot.conf 18 | -------------------------------------------------------------------------------- /fleet-ctr/trainer1.sh: -------------------------------------------------------------------------------- 1 | GLOG_v=0 \ 2 | GLOG_logtostderr=1 \ 3 | TOPOLOGY= \ 4 | TRAINER_PACKAGE=/share \ 5 | PADDLE_INIT_NICS=eth2 \ 6 | POD_IP=127.0.0.1 \ 7 | POD_NAME=TRAINER1 \ 8 | PADDLE_CURRENT_IP=127.0.0.1 \ 9 | PADDLE_JOB_NAME=fleet-ctr \ 10 | PADDLE_IS_LOCAL=0 \ 11 | PADDLE_TRAINERS_NUM=2 \ 12 | PADDLE_PSERVERS_NUM=2 \ 13 | FLAGS_rpc_deadline=36000000 \ 14 | TRAINING_ROLE=TRAINER \ 15 | PADDLE_TRAINER_ID=1 \ 16 | PADDLE_PSERVERS_IP_PORT_LIST=${SERVER0}:30239,${SERVER1}:30241 \ 17 | python3 model_with_sparse_feature.py slot.conf 18 | -------------------------------------------------------------------------------- /huawei_k8s.md: -------------------------------------------------------------------------------- 1 | # 华为云搭建k8s集群 2 | 3 | 本文档旨在介绍如何在华为云上搭建k8s集群,以便部署ElasticCTR。 4 | 5 | * [1. 流程概述](#head1) 6 | * [2. 购买集群](#head2) 7 | * [3. 定义负载均衡](#head2) 8 | * [4. 上传镜像](#head3) 9 | 10 | ## 1. 流程概述 11 | 12 | 在华为云上搭建k8s集群主要有以下三个步骤: 13 | 14 | 1.购买集群 15 | 16 | 首先需要登录到华为云的cce控制台,购买合适配置的集群 17 | 18 | 2.定义负载均衡 19 | 20 | 用上一步购买的跳板机创建集群,集群的配置可以自行调整 21 | 22 | 3. 上传镜像 23 | 24 | 华为云提供了镜像仓库的服务,我们可以将所需的镜像上传到仓库,以后拉取镜像的速度会变快 25 | 26 | 下面对每一步进行详细介绍。 27 | 28 | ## 2. 购买集群 29 | 30 | 用户登录到华为云的cce控制台购买k8s集群,具体的操作如下: 31 | 32 | 1. 打开华为云cce控制台,从控制台控制面板中,点击购买Kubernetes集群,在购买混合集群界面的服务选型项下设置集群信息,总体配置等。 33 | ![huawei_cce_choose_service.png](https://github.com/suoych/WebChat/raw/master/huawei_cce_choose_service.png) 34 | 35 | 2. 集群信息设置好后,点击下一步:创建节点,在创建节点项下设置节点配置。 36 | ![huawei_cce_create_node.png](https://github.com/suoych/WebChat/raw/master/huawei_cce_create_node.png) 37 | ![huawei_cce_create_node_configure.png](https://github.com/suoych/WebChat/raw/master/huawei_cce_create_node_configure.png) 38 | ![huawei_cce_create_node_configure_1.png](https://github.com/suoych/WebChat/raw/master/huawei_cce_create_node_configure_1.png) 39 | 40 | 3. 节点设置完成后,在安装插件项的高级功能插件下选择volcano,然后可以确认配置,完成支付。 41 | ![huawei_cce_plugin.png](https://github.com/suoych/WebChat/raw/master/huawei_cce_plugin.png) 42 | 43 | ## 3. 定义负载均衡 44 | 45 | 由于华为云对于负载均衡有限制,建议参考[华为云负载均衡](https://support.huaweicloud.com/usermanual-cce/cce_01_0014.html),修改 fileserver.yaml.template和pdserving.yaml 关于 Service定义的metadata,参见下图 46 | ![huawei_cce_load_balancer.png](https://github.com/suoych/WebChat/raw/master/huawei_cce_load_balancer.png) 47 | 48 | 49 | ## 4. 上传镜像 50 | 51 | 注:本步骤为可选操作,如果经费充足可以不做这一步。 52 | 点击镜像仓库,按照提示登录仓库,将指令复制到节点终端上执行,如下图所示: 53 | ![huawei_cce_image_repo.png](https://github.com/suoych/WebChat/raw/master/huawei_cce_image_repo.png) 54 | 55 | 随后按照提示将elastic ctr所需的镜像上传到仓库中 56 | ![huawei_cce_image_upload.png](https://github.com/suoych/WebChat/raw/master/huawei_cce_image_upload.png) 57 | 58 | 完成操作后可以看到效果如下: 59 | ![huawei_cce_image_upload_success.png](https://github.com/suoych/WebChat/raw/master/huawei_cce_image_upload_success.png) 60 | 61 | 至此,我们完成了华为云搭建k8s的流程,用户可以自行搭建hdfs,部署elastic ctr。 62 | 63 | 64 | 65 | 66 | -------------------------------------------------------------------------------- /save_program/dumper.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | 16 | import argparse 17 | import logging 18 | import struct 19 | import time 20 | import datetime 21 | import json 22 | from collections import OrderedDict 23 | import numpy as np 24 | import os 25 | import paddle 26 | import paddle.fluid as fluid 27 | import math 28 | 29 | 30 | logging.basicConfig(format='%(asctime)s - %(levelname)s - %(message)s') 31 | logger = logging.getLogger("fluid") 32 | logger.setLevel(logging.INFO) 33 | 34 | NOW_TIMESTAMP = time.time() 35 | NOW_DATETIME = datetime.datetime.now().strftime("%Y%m%d") 36 | 37 | sparse_feature_dim = 100000001 38 | embedding_size = 9 39 | 40 | 41 | def parse_args(): 42 | parser = argparse.ArgumentParser(description="PaddlePaddle DeepFM example") 43 | parser.add_argument( 44 | '--model_path', 45 | type=str, 46 | required=True, 47 | help="The path of model parameters file") 48 | parser.add_argument( 49 | '--output_data_path', 50 | type=str, 51 | default="/save_program/ctr_cube/", 52 | help="The path of dump output") 53 | 54 | return parser.parse_args() 55 | 56 | 57 | def write_donefile(base_datafile, base_donefile): 58 | dict = OrderedDict() 59 | if not os.access(os.path.dirname(base_donefile), os.F_OK): 60 | os.makedirs(os.path.dirname(base_donefile)) 61 | dict["id"] = str(int(NOW_TIMESTAMP)) 62 | dict["key"] = dict["id"] 63 | dict["input"] = os.path.dirname(base_datafile) 64 | with open(base_donefile, "a") as f: 65 | jsonstr = json.dumps(dict) + '\n' 66 | f.write(jsonstr) 67 | 68 | 69 | def dump(): 70 | args = parse_args() 71 | feature_names = [] 72 | 73 | output_data_path = os.path.abspath(args.output_data_path) 74 | base_datadir = output_data_path + "/" + NOW_DATETIME + "/base" 75 | try: 76 | os.makedirs(base_datadir) 77 | except: 78 | print ('Dir already exist, skip.') 79 | base_datafile = output_data_path + "/" + NOW_DATETIME + "/base/feature" 80 | write_base_datafile = "/output/ctr_cube/" + NOW_DATETIME + "/base/feature" 81 | base_donefile = output_data_path + "/" + "donefile/" + "base.txt" 82 | os.system('/save_program/parameter_to_seqfile ' + args.model_path +'/SparseFeatFactors ' + base_datafile) 83 | write_donefile(write_base_datafile + "0", base_donefile) 84 | 85 | 86 | if __name__ == '__main__': 87 | dump() 88 | 89 | -------------------------------------------------------------------------------- /save_program/listening.py: -------------------------------------------------------------------------------- 1 | import time 2 | import os 3 | import json 4 | 5 | previous_last_line = "" 6 | while True: 7 | try: 8 | with open("/share/saved_models/model_donefile","r") as f: 9 | lines = f.readlines() 10 | current_line = lines[-1] 11 | path = json.loads(current_line)['path'] 12 | 13 | if current_line != previous_last_line: 14 | os.system("python3 save_program slot.conf .") 15 | os.system("python dumper_multiprocessing.py 16 | os.system("python replace_params.py " + 17 | "--model_dir " + path + 18 | " --inference_only_model_dir " + path+"../inference_only") 19 | else: 20 | pass 21 | previous_last_line = current_line 22 | except Exception,err: 23 | pass 24 | 25 | -------------------------------------------------------------------------------- /save_program/nets.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | from __future__ import print_function 16 | 17 | import argparse 18 | import logging 19 | import os 20 | import time 21 | import math 22 | import numpy as np 23 | import six 24 | import paddle 25 | import paddle.fluid as fluid 26 | import paddle.fluid.proto.framework_pb2 as framework_pb2 27 | import paddle.fluid.core as core 28 | 29 | from multiprocessing import cpu_count 30 | 31 | def ctr_dnn_model(embedding_size, sparse_input_ids, label, 32 | sparse_feature_dim): 33 | def embedding_layer(input): 34 | emb = fluid.layers.embedding( 35 | input=input, 36 | is_sparse=True, 37 | is_distributed=False, 38 | size=[sparse_feature_dim, embedding_size], 39 | param_attr=fluid.ParamAttr(name="SparseFeatFactors", 40 | initializer=fluid.initializer.Uniform())) 41 | emb_sum = fluid.layers.sequence_pool(input=emb, pool_type='sum') 42 | return emb_sum 43 | 44 | emb_sums = list(map(embedding_layer, sparse_input_ids)) 45 | concated = fluid.layers.concat(emb_sums, axis=1) 46 | fc1 = fluid.layers.fc(input=concated, size=400, act='relu', 47 | param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal( 48 | scale=1 / math.sqrt(concated.shape[1])))) 49 | fc2 = fluid.layers.fc(input=fc1, size=400, act='relu', 50 | param_attr=fluid.ParamAttr( 51 | initializer=fluid.initializer.Normal( 52 | scale=1 / math.sqrt(fc1.shape[1])))) 53 | fc3 = fluid.layers.fc(input=fc2, size=400, act='relu', 54 | param_attr=fluid.ParamAttr( 55 | initializer=fluid.initializer.Normal( 56 | scale=1 / math.sqrt(fc2.shape[1])))) 57 | 58 | predict = fluid.layers.fc(input=fc3, size=2, act='softmax', 59 | param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal( 60 | scale=1 / math.sqrt(fc3.shape[1])))) 61 | 62 | cost = fluid.layers.cross_entropy(input=predict, label=label) 63 | avg_cost = fluid.layers.reduce_sum(cost) 64 | accuracy = fluid.layers.accuracy(input=predict, label=label) 65 | auc_var, batch_auc_var, auc_states = \ 66 | fluid.layers.auc(input=predict, label=label, num_thresholds=2 ** 12, slide_steps=20) 67 | return avg_cost, auc_var, batch_auc_var 68 | -------------------------------------------------------------------------------- /save_program/parameter_to_seqfile: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PaddlePaddle/ElasticCTR/c493c6b250eef001028b5a7264b61f679c79fa6d/save_program/parameter_to_seqfile -------------------------------------------------------------------------------- /save_program/replace_params.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | import os 16 | import shutil 17 | import argparse 18 | 19 | def parse_args(): 20 | parser = argparse.ArgumentParser(description="PaddlePaddle CTR example") 21 | parser.add_argument( 22 | '--model_dir', 23 | type=str, 24 | required=True, 25 | help='The trained model path (eg. models/pass-0)') 26 | parser.add_argument( 27 | '--inference_only_model_dir', 28 | type=str, 29 | required=True, 30 | help='The inference only model (eg. models/inference_only)') 31 | return parser.parse_args() 32 | 33 | def replace_params(): 34 | args = parse_args() 35 | inference_only_model_dir = args.inference_only_model_dir 36 | model_dir = args.model_dir 37 | 38 | files = [f for f in os.listdir(inference_only_model_dir)] 39 | 40 | for f in files: 41 | if (f.find("__model__") != -1): 42 | continue 43 | dst_file = inference_only_model_dir + "/" + f 44 | src_file = args.model_dir + "/" + f 45 | print("copying %s to %s" % (src_file, dst_file)) 46 | shutil.copyfile(src_file, dst_file) 47 | 48 | if __name__ == '__main__': 49 | replace_params() 50 | print("Done") 51 | -------------------------------------------------------------------------------- /save_program/save_program.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | 16 | import math 17 | import sys 18 | import os 19 | import paddle.fluid as fluid 20 | import google.protobuf.text_format as text_format 21 | import paddle.fluid.proto.framework_pb2 as framework_pb2 22 | import paddle.fluid.core as core 23 | import six 24 | import subprocess as sp 25 | 26 | inference_path = sys.argv[2]+ '/inference_only' 27 | feature_names = [] 28 | with open(sys.argv[1]) as fin: 29 | for line in fin: 30 | feature_names.append(line.strip()) 31 | 32 | print(feature_names) 33 | 34 | sparse_input_ids = [ 35 | fluid.layers.data(name=name, shape=[1], lod_level=1, dtype='int64') 36 | for name in feature_names] 37 | 38 | label = fluid.layers.data( 39 | name='label', shape=[1], dtype='int64') 40 | 41 | sparse_feature_dim = int(os.environ['SPARSE_DIM']) 42 | dataset_prefix = os.environ['DATASET_PATH'] 43 | embedding_size = 9 44 | 45 | current_date_hr = sys.argv[3] 46 | hdfs_address = os.environ['HDFS_ADDRESS'] 47 | hdfs_ugi = os.environ['HDFS_UGI'] 48 | 49 | def embedding_layer(input): 50 | emb = fluid.layers.embedding( 51 | input=input, 52 | is_sparse=True, 53 | is_distributed=False, 54 | size=[sparse_feature_dim, embedding_size], 55 | param_attr=fluid.ParamAttr(name="SparseFeatFactors", 56 | initializer=fluid.initializer.Uniform())) 57 | emb_sum = fluid.layers.sequence_pool(input=emb, pool_type='sum') 58 | return emb, emb_sum 59 | 60 | emb_sums = list(map(embedding_layer, sparse_input_ids)) 61 | 62 | emb_list = [x[0] for x in emb_sums] 63 | sparse_embed_seq = [x[1] for x in emb_sums] 64 | inference_feed_vars = emb_list 65 | 66 | concated = fluid.layers.concat(sparse_embed_seq, axis=1) 67 | fc1 = fluid.layers.fc(input=concated, size=400, act='relu', 68 | param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal( 69 | scale=1 / math.sqrt(concated.shape[1])))) 70 | fc2 = fluid.layers.fc(input=fc1, size=400, act='relu', 71 | param_attr=fluid.ParamAttr( 72 | initializer=fluid.initializer.Normal( 73 | scale=1 / math.sqrt(fc1.shape[1])))) 74 | fc3 = fluid.layers.fc(input=fc2, size=400, act='relu', 75 | param_attr=fluid.ParamAttr( 76 | initializer=fluid.initializer.Normal( 77 | scale=1 / math.sqrt(fc2.shape[1])))) 78 | 79 | predict = fluid.layers.fc(input=fc3, size=2, act='softmax', 80 | param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal( 81 | scale=1 / math.sqrt(fc3.shape[1])))) 82 | 83 | cost = fluid.layers.cross_entropy(input=predict, label=label) 84 | avg_cost = fluid.layers.reduce_sum(cost) 85 | accuracy = fluid.layers.accuracy(input=predict, label=label) 86 | auc_var, batch_auc_var, auc_states = \ 87 | fluid.layers.auc(input=predict, label=label, num_thresholds=2 ** 12, slide_steps=20) 88 | 89 | def save_program(): 90 | dataset = fluid.DatasetFactory().create_dataset() 91 | dataset.set_use_var(sparse_input_ids + [label]) 92 | pipe_command = "python criteo_dataset.py {}".format(sys.argv[1]) 93 | dataset.set_pipe_command(pipe_command) 94 | dataset.set_batch_size(32) 95 | dataset.set_thread(10) 96 | optimizer = fluid.optimizer.SGD(0.0001) 97 | optimizer.minimize(avg_cost) 98 | exe = fluid.Executor(fluid.CPUPlace()) 99 | 100 | input_folder = "hdfs:" 101 | output = sp.check_output( "hadoop fs -D hadoop.job.ugi=" + hdfs_ugi 102 | + " -D fs.defaultFS=" + hdfs_address +" -ls " + os.path.join(dataset_prefix, current_date_hr) + "/ | awk '{if(NR>1) print $8}'", shell=True) 103 | train_filelist = ["{}{}".format(input_folder, f) for f in output.decode('ascii').strip().split('\n')] 104 | train_filelist.remove('hdfs:' + os.path.join(dataset_prefix, current_date_hr, 'donefile')) 105 | train_filelist = [train_filelist[0]] 106 | print(train_filelist) 107 | 108 | exe.run(fluid.default_startup_program()) 109 | print("startup save program done.") 110 | dataset.set_filelist(train_filelist) 111 | exe.train_from_dataset( 112 | program=fluid.default_main_program(), 113 | dataset=dataset, 114 | fetch_list=[auc_var], 115 | fetch_info=["auc"], 116 | debug=False,) 117 | #print_period=10000) 118 | # save model here 119 | fetch_list = fluid.io.save_inference_model(inference_path, [x.name for x in inference_feed_vars], [predict], exe) 120 | 121 | 122 | def prune_program(): 123 | model_dir = inference_path 124 | model_file = model_dir + "/__model__" 125 | with open(model_file, "rb") as f: 126 | protostr = f.read() 127 | f.close() 128 | proto = framework_pb2.ProgramDesc.FromString(six.binary_type(protostr)) 129 | block = proto.blocks[0] 130 | kept_ops = [op for op in block.ops if op.type != "lookup_table"] 131 | del block.ops[:] 132 | block.ops.extend(kept_ops) 133 | 134 | kept_vars = [var for var in block.vars if var.name != "SparseFeatFactors"] 135 | del block.vars[:] 136 | block.vars.extend(kept_vars) 137 | 138 | with open(model_file, "wb") as f: 139 | f.write(proto.SerializePartialToString()) 140 | f.close() 141 | with open(model_file + ".prototxt.pruned", "w") as f: 142 | f.write(text_format.MessageToString(proto)) 143 | f.close() 144 | 145 | 146 | def remove_embedding_param_file(): 147 | model_dir = inference_path 148 | embedding_file = model_dir + "/SparseFeatFactors" 149 | os.remove(embedding_file) 150 | 151 | 152 | if __name__ == '__main__': 153 | save_program() 154 | prune_program() 155 | remove_embedding_param_file() 156 | --------------------------------------------------------------------------------