├── HDFS_TUTORIAL.md
├── LICENSE
├── README.md
├── aws_k8s.md
├── cluster_config.md
├── doc
├── ElasticCTR.png
└── buy_bcc.png
├── elastic-ctr-cli
├── data.config
├── elastic-control.sh
├── fileserver.yaml.template
├── fleet-ctr.yaml.template
├── listen.py
├── pdserving.yaml
├── rolebind.yaml
├── service.py
├── service_auto_port.py
└── slot.conf
├── elasticctr_arch.md
├── fleet-ctr
├── criteo_dataset.py
├── criteo_dataset.py~
├── criteo_reader.py
├── dataset_generator.py
├── infer.py
├── ip_config.sh
├── k8s_tools.py
├── mlflow_run.sh
├── model_with_sparse_feature.py
├── nets.py
├── paddle_k8s
├── process_rawmodel.py
├── pserver0.sh
├── pserver1.sh
├── run_distributed.sh
├── train_with_mlflow.py
├── trainer0.sh
└── trainer1.sh
├── huawei_k8s.md
└── save_program
├── dumper.py
├── listening.py
├── nets.py
├── parameter_to_seqfile
├── replace_params.py
└── save_program.py
/HDFS_TUTORIAL.md:
--------------------------------------------------------------------------------
1 | # 如何搭建HDFS集群
2 |
3 | ## 综述
4 |
5 | 本篇文章只是用于demo的HDFS集群搭建教程,用于跑通ElasticCTR的各个流程。本文将会带着大家在百度云的节点上搭建一个HDFS,并将Criteo数据集按照ElasticCTR的数据集格式要求,存放在HDFS上。
6 |
7 | ## 购买BCC
8 |
9 | 搭建 HDFS 集群的过程较为复杂,首先需要购买一个 BCC 实例
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 | 在 BCC 实例当中购买较大的 CDS 云磁盘。
18 |
19 | ## 安装并启动Hadoop
20 |
21 | 在进入 BCC 之后首先需要用 fdisk 工具确认分区是否已经安装。
22 |
23 | 选择 hadoop-2.8.5.tar.gz。下载后解压把 hadoop-2.8.5 目录 move 到/usr/local 目录下。 在/usr/local/hadoop-2.8.5/etc/hadoop/下,编辑 core-site.xml文件,修改为
24 | ```
25 |
26 |
27 | fs.defaultFS
28 | hdfs://${LOCAL_IP}:9000
29 |
30 |
31 | hadoop.tmp.dir
32 | /data/hadoop
33 |
34 |
35 | ```
36 |
37 |
38 | 此处 `$LOCAL_IP` 推荐用内网的 IP,也就是在 ifconfig 下为 `192.168` 开头的 IP,在 K8S 当中也可以被访问 到。
39 |
40 | 在 `slave` 文件下输入 `root@127.0.0.1`
41 |
42 | 接下来配置无密码访问,首先要 `ssh-keygen`,无论提示什么全部回车数次之后,用 `ssh-copy-id` 命令把无密码访问配置到 `127.0.0.1` ,`localhost` ,`0.0.0.0` 几个 IP 地址。
43 |
44 | 把`/usr/local/hadoop-2.8.5/etc/hadoop` 设置为 `$HADOOP_HOME`
45 |
46 | 再把`$HADOOP_HOME/bin` 放在 `$PATH` 下。如果输入 `hadoop` 命令可以被执行,就执行 `hadoop namenode format`。
47 |
48 | 最后在`/usr/local/hadoop-2.8.5/sbin` 目录下运行 ,`start-all.sh`。
49 |
50 |
51 | 以上操作之后,HDFS 服务就启动完毕,接下来就创建流式训练的文件夹 `/train_data/`,使用命令 `hdfs dfs -mkdir hdfs://$IP:9000/train_data/`
52 |
53 | ## 复制Criteo数据集到HDFS
54 | 接下来从 `https://paddle-serving.bj.bcebos.com/criteo_ctr_example/criteo_demo.tar.gz` 下载数据集,解压之后在criteo_demo下
55 | 执行
56 | `hdfs dfs -put * hdfs://$IP:9000/train_data/20200401`
57 | `$IP`就是先前到HDFS地址。
58 | 这样,就在train_data下目录到20200401目录下存放了5个小时的训练集。20200401可以改动成任意一个日期。
59 | 在主页面的教程中,`data.config`文件就是用来现在配置的HDFS信息,日期信息会在这里被调用。
60 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Apache License
2 | Version 2.0, January 2004
3 | http://www.apache.org/licenses/
4 |
5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6 |
7 | 1. Definitions.
8 |
9 | "License" shall mean the terms and conditions for use, reproduction,
10 | and distribution as defined by Sections 1 through 9 of this document.
11 |
12 | "Licensor" shall mean the copyright owner or entity authorized by
13 | the copyright owner that is granting the License.
14 |
15 | "Legal Entity" shall mean the union of the acting entity and all
16 | other entities that control, are controlled by, or are under common
17 | control with that entity. For the purposes of this definition,
18 | "control" means (i) the power, direct or indirect, to cause the
19 | direction or management of such entity, whether by contract or
20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the
21 | outstanding shares, or (iii) beneficial ownership of such entity.
22 |
23 | "You" (or "Your") shall mean an individual or Legal Entity
24 | exercising permissions granted by this License.
25 |
26 | "Source" form shall mean the preferred form for making modifications,
27 | including but not limited to software source code, documentation
28 | source, and configuration files.
29 |
30 | "Object" form shall mean any form resulting from mechanical
31 | transformation or translation of a Source form, including but
32 | not limited to compiled object code, generated documentation,
33 | and conversions to other media types.
34 |
35 | "Work" shall mean the work of authorship, whether in Source or
36 | Object form, made available under the License, as indicated by a
37 | copyright notice that is included in or attached to the work
38 | (an example is provided in the Appendix below).
39 |
40 | "Derivative Works" shall mean any work, whether in Source or Object
41 | form, that is based on (or derived from) the Work and for which the
42 | editorial revisions, annotations, elaborations, or other modifications
43 | represent, as a whole, an original work of authorship. For the purposes
44 | of this License, Derivative Works shall not include works that remain
45 | separable from, or merely link (or bind by name) to the interfaces of,
46 | the Work and Derivative Works thereof.
47 |
48 | "Contribution" shall mean any work of authorship, including
49 | the original version of the Work and any modifications or additions
50 | to that Work or Derivative Works thereof, that is intentionally
51 | submitted to Licensor for inclusion in the Work by the copyright owner
52 | or by an individual or Legal Entity authorized to submit on behalf of
53 | the copyright owner. For the purposes of this definition, "submitted"
54 | means any form of electronic, verbal, or written communication sent
55 | to the Licensor or its representatives, including but not limited to
56 | communication on electronic mailing lists, source code control systems,
57 | and issue tracking systems that are managed by, or on behalf of, the
58 | Licensor for the purpose of discussing and improving the Work, but
59 | excluding communication that is conspicuously marked or otherwise
60 | designated in writing by the copyright owner as "Not a Contribution."
61 |
62 | "Contributor" shall mean Licensor and any individual or Legal Entity
63 | on behalf of whom a Contribution has been received by Licensor and
64 | subsequently incorporated within the Work.
65 |
66 | 2. Grant of Copyright License. Subject to the terms and conditions of
67 | this License, each Contributor hereby grants to You a perpetual,
68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69 | copyright license to reproduce, prepare Derivative Works of,
70 | publicly display, publicly perform, sublicense, and distribute the
71 | Work and such Derivative Works in Source or Object form.
72 |
73 | 3. Grant of Patent License. Subject to the terms and conditions of
74 | this License, each Contributor hereby grants to You a perpetual,
75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76 | (except as stated in this section) patent license to make, have made,
77 | use, offer to sell, sell, import, and otherwise transfer the Work,
78 | where such license applies only to those patent claims licensable
79 | by such Contributor that are necessarily infringed by their
80 | Contribution(s) alone or by combination of their Contribution(s)
81 | with the Work to which such Contribution(s) was submitted. If You
82 | institute patent litigation against any entity (including a
83 | cross-claim or counterclaim in a lawsuit) alleging that the Work
84 | or a Contribution incorporated within the Work constitutes direct
85 | or contributory patent infringement, then any patent licenses
86 | granted to You under this License for that Work shall terminate
87 | as of the date such litigation is filed.
88 |
89 | 4. Redistribution. You may reproduce and distribute copies of the
90 | Work or Derivative Works thereof in any medium, with or without
91 | modifications, and in Source or Object form, provided that You
92 | meet the following conditions:
93 |
94 | (a) You must give any other recipients of the Work or
95 | Derivative Works a copy of this License; and
96 |
97 | (b) You must cause any modified files to carry prominent notices
98 | stating that You changed the files; and
99 |
100 | (c) You must retain, in the Source form of any Derivative Works
101 | that You distribute, all copyright, patent, trademark, and
102 | attribution notices from the Source form of the Work,
103 | excluding those notices that do not pertain to any part of
104 | the Derivative Works; and
105 |
106 | (d) If the Work includes a "NOTICE" text file as part of its
107 | distribution, then any Derivative Works that You distribute must
108 | include a readable copy of the attribution notices contained
109 | within such NOTICE file, excluding those notices that do not
110 | pertain to any part of the Derivative Works, in at least one
111 | of the following places: within a NOTICE text file distributed
112 | as part of the Derivative Works; within the Source form or
113 | documentation, if provided along with the Derivative Works; or,
114 | within a display generated by the Derivative Works, if and
115 | wherever such third-party notices normally appear. The contents
116 | of the NOTICE file are for informational purposes only and
117 | do not modify the License. You may add Your own attribution
118 | notices within Derivative Works that You distribute, alongside
119 | or as an addendum to the NOTICE text from the Work, provided
120 | that such additional attribution notices cannot be construed
121 | as modifying the License.
122 |
123 | You may add Your own copyright statement to Your modifications and
124 | may provide additional or different license terms and conditions
125 | for use, reproduction, or distribution of Your modifications, or
126 | for any such Derivative Works as a whole, provided Your use,
127 | reproduction, and distribution of the Work otherwise complies with
128 | the conditions stated in this License.
129 |
130 | 5. Submission of Contributions. Unless You explicitly state otherwise,
131 | any Contribution intentionally submitted for inclusion in the Work
132 | by You to the Licensor shall be under the terms and conditions of
133 | this License, without any additional terms or conditions.
134 | Notwithstanding the above, nothing herein shall supersede or modify
135 | the terms of any separate license agreement you may have executed
136 | with Licensor regarding such Contributions.
137 |
138 | 6. Trademarks. This License does not grant permission to use the trade
139 | names, trademarks, service marks, or product names of the Licensor,
140 | except as required for reasonable and customary use in describing the
141 | origin of the Work and reproducing the content of the NOTICE file.
142 |
143 | 7. Disclaimer of Warranty. Unless required by applicable law or
144 | agreed to in writing, Licensor provides the Work (and each
145 | Contributor provides its Contributions) on an "AS IS" BASIS,
146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 | implied, including, without limitation, any warranties or conditions
148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 | PARTICULAR PURPOSE. You are solely responsible for determining the
150 | appropriateness of using or redistributing the Work and assume any
151 | risks associated with Your exercise of permissions under this License.
152 |
153 | 8. Limitation of Liability. In no event and under no legal theory,
154 | whether in tort (including negligence), contract, or otherwise,
155 | unless required by applicable law (such as deliberate and grossly
156 | negligent acts) or agreed to in writing, shall any Contributor be
157 | liable to You for damages, including any direct, indirect, special,
158 | incidental, or consequential damages of any character arising as a
159 | result of this License or out of the use or inability to use the
160 | Work (including but not limited to damages for loss of goodwill,
161 | work stoppage, computer failure or malfunction, or any and all
162 | other commercial damages or losses), even if such Contributor
163 | has been advised of the possibility of such damages.
164 |
165 | 9. Accepting Warranty or Additional Liability. While redistributing
166 | the Work or Derivative Works thereof, You may choose to offer,
167 | and charge a fee for, acceptance of support, warranty, indemnity,
168 | or other liability obligations and/or rights consistent with this
169 | License. However, in accepting such obligations, You may act only
170 | on Your own behalf and on Your sole responsibility, not on behalf
171 | of any other Contributor, and only if You agree to indemnify,
172 | defend, and hold each Contributor harmless for any liability
173 | incurred by, or claims asserted against, such Contributor by reason
174 | of your accepting any such warranty or additional liability.
175 |
176 | END OF TERMS AND CONDITIONS
177 |
178 | APPENDIX: How to apply the Apache License to your work.
179 |
180 | To apply the Apache License to your work, attach the following
181 | boilerplate notice, with the fields enclosed by brackets "[]"
182 | replaced with your own identifying information. (Don't include
183 | the brackets!) The text should be enclosed in the appropriate
184 | comment syntax for the file format. We also recommend that a
185 | file or class name and description of purpose be included on the
186 | same "printed page" as the copyright notice for easier
187 | identification within third-party archives.
188 |
189 | Copyright [yyyy] [name of copyright owner]
190 |
191 | Licensed under the Apache License, Version 2.0 (the "License");
192 | you may not use this file except in compliance with the License.
193 | You may obtain a copy of the License at
194 |
195 | http://www.apache.org/licenses/LICENSE-2.0
196 |
197 | Unless required by applicable law or agreed to in writing, software
198 | distributed under the License is distributed on an "AS IS" BASIS,
199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 | See the License for the specific language governing permissions and
201 | limitations under the License.
202 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # ElasticCTR
2 |
3 | ElasticCTR是分布式训练CTR预估任务和Serving流程一键部署的方案,用户只需配置数据源、样本格式即可完成一系列的训练与预测任务
4 |
5 | * [1. 总体概览](#head1)
6 | * [2. 配置集群](#head2)
7 | * [3. 一键部署教程](#head3)
8 | * [4. 训练进度追踪](#head4)
9 | * [5. 预测服务](#head5)
10 |
11 | ## 1. 总体概览
12 |
13 | 本项目提供了端到端的CTR训练和二次开发的解决方案,主要特点如下:
14 |
15 | 1.快速部署
16 |
17 | ElasticCTR当前提供的方案是基于百度云的Kubernetes集群进行部署,用户可以很容易扩展到其他原生的Kubernetes环境运行ElasticCTR。
18 |
19 | 2.高性能
20 |
21 | ElasticCTR采用PaddlePaddle提供的全异步分布式训练方式,在保证模型训练效果的前提下,近乎线性的扩展能力可以大幅度节省训练资源。在线服务方面,ElasticCTR采用Paddle Serving中高吞吐、低延迟的稀疏参数预估引擎,高并发条件下是常见开源组件吞吐量的10倍以上。
22 |
23 | 3.可定制
24 |
25 | 用户可以通过统一的配置文件,修改训练中的训练方式和基本配置,包括在离线训练方式、训练过程可视化指标、HDFS上的存储配置等。除了通过修改统一配置文件进行训练任务配置外,ElasticCTR采用全开源软件栈,方便用户进行快速的二次开发和改造。底层的Kubernetes、Volcano可以轻松实现对上层任务的灵活调度策略;基于PaddlePaddle的灵活组网能力、飞桨的分布式训练引擎Fleet和远程预估服务Paddle Serving,用户可以对训练模型、并行训练的模式、远程预估服务进行快速迭代;MLFlow提供的训练任务可视化能力,用户可以快速增加系统监控需要的各种指标。
26 |
27 |
28 |
29 | 本方案整体结构请参照这篇文章 [ElasticCTR架构](elasticctr_arch.md)
30 |
31 |
32 |
33 |
34 |
35 |
36 |
37 | ## 2. 配置集群
38 |
39 | 运行本方案前,需要用户已经搭建好k8s集群,并安装好volcano组件。k8s环境部署比较复杂,本文不涉及。百度智能云CCE容器引擎申请后即可使用,百度云上创建k8s的方法用户可以参考这篇文档[百度云创建k8s教程及使用指南](cluster_config.md)。此外,Elastic CTR还支持在其他云上部署,可以参考以下两篇文档[华为云创建k8s集群](huawei_k8s.md),[aws创建k8s集群](aws_k8s.md).
40 |
41 | 准备好K8S集群之后,我们需要配置HDFS作为数据集的来源[HDFS配置教程](HDFS_TUTORIAL.md)
42 |
43 |
44 | ## 3. 一键部署教程
45 |
46 | 您可以使用我们提供的脚本elastic-control.sh来完成部署,在运行脚本前,请确保您的机器装有python3并通过pip安装了mlflow,安装mlflow的命令如下:
47 | ```bash
48 | python3 -m pip install mlflow -i https://pypi.tuna.tsinghua.edu.cn/simple
49 | ```
50 | 脚本的使用方式如下:
51 | ```bash
52 | bash elastic-control.sh [COMMAND] [OPTIONS]
53 | ```
54 | 其中可选的命令(COMMAND)如下:
55 | - **-c|--config_client** 检索客户端二进制文件用于发送预测服务请求并接收预测结果
56 | - **-r|--config_resource** 定义训练配置
57 | - **-a|--apply** 应用配置并启动训练
58 | - **-l|--log** 打印训练状态,请确保您已经启动了训练
59 |
60 | 在定义训练配置时,您需要添加附加选项(OPTIONS)来指定配置的资源,可选的配置如下:
61 | - **-u|--cpu** 每个训练节点的CPU核心数
62 | - **-m|--mem** 每个节点的内存容量
63 | - **-t|--trainer** trainer节点的数量
64 | - **-p|--pserver** parameter-server节点的数量
65 | - **-b|--cube** cube分片数
66 | - **-hd|--hdfs_address** 存储数据文件的HDFS地址
67 |
68 | 注意:您的数据文件的格式应为以下示例格式:
69 | ```
70 | $show $click $feasign0:$slot0 $feasign1:$slot1 $feasign2:$slot2......
71 | ```
72 | 举例如下:
73 | ```
74 | 1 0 17241709254077376921:0 132683728328325035:1 9179429492816205016:2 12045056225382541705:3
75 | ```
76 |
77 | - **-f|--datafile** 数据路径文件,需要指明HDFS地址并指定起始与截止日期(截止日期可选)
78 | - **-s|--slot_conf** 特征槽位配置文件,请注意文件后缀必须为'.txt'
79 |
80 | 以下是`data.config`文件,其中`START_DATE_HR`和`END_DATE_HR`就是我们在上一步配置HDFS的路径。
81 | ```
82 | export HDFS_ADDRESS="hdfs://${IP}:9000" # HDFS地址
83 | export HDFS_UGI="root,i" # HDFS用户名密码
84 | export START_DATE_HR=20200401/00 # 训练集开始时间,代表2020年4月1日0点
85 | export END_DATE_HR=20200401/03 # 训练集结束时间,代表2020年4月1日3点
86 | export DATASET_PATH="/train_data" # 训练集在HDFS上的前缀
87 | export SPARSE_DIM="1000001" # 稀疏参数维度,可不动
88 | ```
89 |
90 | 脚本的使用示例如下:
91 | ```
92 | bash elastic-control.sh -r -u 4 -m 20 -t 2 -p 2 -b 5 -s slot.conf -f data.config
93 | bash elastic-control.sh -a
94 | bash elastic-control.sh -l
95 | bash elastic-control.sh -c
96 | ```
97 |
98 | ## 4. 训练进度追踪
99 | 我们提供了两种方法让用户可以观察训练的进度,具体方式如下:
100 |
101 | 1.命令行查看
102 |
103 | 在训练过程中,用户可以随时输入以下命令,将Trainer0和file server的状态日志打印到标准输出上以便查看
104 | ```bash
105 | bash elastic-control.sh -l
106 | ```
107 |
108 | ## 5. 预测服务
109 | 用户可以输入以下指令查看file server日志:
110 | ```bash
111 | bash elastic-control.sh -l
112 | ```
113 | 当发现有模型产出后,可以进行预测,预测的方法是输入以下命令
114 | ```bash
115 | bash elastic-control.sh -c
116 | ```
117 | 并按照屏幕上打出的提示继续执行即可进行预测,结果会打印在标准输出
118 | 
119 |
120 |
121 |
122 |
--------------------------------------------------------------------------------
/aws_k8s.md:
--------------------------------------------------------------------------------
1 | # AWS搭建k8s集群
2 |
3 | 本文档旨在介绍如何在aws上搭建k8s集群
4 |
5 | * [1. 流程概述](#head1)
6 | * [2. 购买跳板机](#head2)
7 | * [3. 部署集群](#head3)
8 |
9 | ## 1. 流程概述
10 |
11 | 在aws上搭建k8s集群主要有以下两个步骤:
12 |
13 | 1.购买跳板机
14 |
15 | 首先需要购买一个ec2实例作为跳板机来控制k8s集群,这个实例不需要很高的配置
16 |
17 | 2.部署集群
18 |
19 | 用上一步购买的跳板机创建集群,集群的配置可以自行调整
20 |
21 | 下面对每一步进行详细介绍。
22 |
23 | ## 2. 购买跳板机
24 |
25 | 用户可以在EC2控制台购买想要的实例作为跳板机。
26 |
27 | 具体的操作如下:
28 | 1. 打开 Amazon EC2 控制台,从控制台控制面板中,点击启动实例按钮。
29 | 
30 | 2. 选择合适的AMI,建议选用Amazon Linux 2 AMI。
31 | 
32 | 3. 选择实例类型,建议选用默认的t2.micro,选好后点击审核和启动
33 | 
34 | 4. 在审核界面,在核查实例启动页面上的安全组栏中点击编辑安全组,然后在配置安全组界面中点击选择一个现有的安全组,点击名称为default的安全组,再点击审核和启动。
35 | 
36 | 
37 | 5. 在审核界面点击启动,在弹出的密钥对窗口中选择创建新密钥对,自定义密钥名称后下载密钥对,请一定保存好密钥对文件,因为丢失后无法再次下载。以上操作完成后点击启动实例即可完成跳板机购买。
38 | 
39 |
40 |
41 | 请注意:密钥对文件下载之后请修改权限为400。
42 |
43 | ## 3. 部署集群
44 |
45 | 在上一步购买的实例启动后会显示公网ip和DNS,连接到实例进行部署,连接需要用到刚才下载的密钥对文件(后缀为.pem),连接指令如下:
46 |
47 | ```bash
48 | ssh -i ec2key.pem ec2-user@12.23.34.123
49 | ```
50 | 或
51 | ```bash
52 | ssh -i ec2key.pem ec2-user@ec2-12-23-34-123.us-west-2.compute.amazonaws.com
53 | ```
54 |
55 | 连接到跳板机后,需要安装一系列操控组件,具体如下:
56 | 1. 安装pip
57 | ```bash
58 | sudo yum -y install python-pip
59 | ```
60 | 2. 安装或升级 AWS CLI
61 | ```bash
62 | sudo pip install --upgrade awscli
63 | ```
64 | 3. 安装 eksctl
65 | ```bash
66 | curl --silent \
67 | --location "https://github.com/weaveworks/eksctl/releases/download/latest_release/eksctl_$(uname -s)_amd64.tar.gz" \
68 | | tar xz -C /tmp
69 | sudo mv /tmp/eksctl /usr/local/bin
70 | ```
71 | 4. 安装 kubectl
72 | ```bash
73 | curl -o kubectl https://amazon-eks.s3-us-west-2.amazonaws.com/1.11.5/2018-12-06/bin/linux/amd64/kubectl
74 | chmod +x ./kubectl
75 | mkdir -p $HOME/bin && cp ./kubectl $HOME/bin/kubectl && export PATH=$HOME/bin:$PATH
76 | ```
77 | 5. 安装 aws-iam-authenticator
78 | ```bash
79 | curl -o aws-iam-authenticator https://amazon-eks.s3-us-west-2.amazonaws.com/1.11.5/2018-12-06/bin/linux/amd64/aws-iam-authenticator
80 | chmod +x aws-iam-authenticator
81 | cp ./aws-iam-authenticator $HOME/bin/aws-iam-authenticator && export PATH=$HOME/bin:$PATH
82 | ```
83 | 6. 安装 ksonnet
84 | ```bash
85 | export KS_VER=0.13.1
86 | export KS_PKG=ks_${KS_VER}_linux_amd64
87 | wget -O /tmp/${KS_PKG}.tar.gz https://github.com/ksonnet/ksonnet/releases/download/v${KS_VER}/${KS_PKG}.tar.gz
88 | mkdir -p ${HOME}/bin
89 | tar -xvf /tmp/$KS_PKG.tar.gz -C ${HOME}/bin
90 | sudo mv ${HOME}/bin/$KS_PKG/ks /usr/local/bin
91 | ```
92 |
93 | 安装好这些组件后,用户可以购买集群并部署,指令如下:
94 | ```bash
95 | eksctl create cluster paddle-cluster \
96 | --version 1.13 \
97 | --nodes 2 \
98 | --node-type=m5.2xlarge \
99 | --timeout=40m \
100 | --ssh-access \
101 | --ssh-public-key ec2.key \
102 | --region us-west-2 \
103 | --auto-kubeconfig
104 | ```
105 | 其中:
106 |
107 | **--version** 指k8s版本,目前aws支持1.12, 1.13 和 1.14
108 |
109 | **--nodes** 指节点数量
110 |
111 | **--node-type** 指节点实例型号,用户可以挑选自己喜欢的实例套餐
112 |
113 | **--ssh-public-key** 用户可以使用之前购买跳板机时定义的密钥名称
114 |
115 | **--region** 指节点所在地区
116 |
117 | 部署集群所需时间较长,请耐心等待,当部署成功后,用户可以测试集群,具体方法如下:
118 |
119 | 1. 输入以下指令查看节点信息:
120 | ```bash
121 | kubectl get nodes -o wide
122 | ```
123 | 2. 验证集群是否处于活动状态:
124 | ```bash
125 | aws eks --region describe-cluster --name --query cluster.status
126 | ```
127 | 应看到如下输出:
128 | ```
129 | "ACTIVE"
130 | ```
131 | 3. 如果在同一跳板机中具有多个集群设置,请验证 kubectl 上下文:
132 | ```bash
133 | kubectl config get-contexts
134 | ```
135 | 如果未按预期设置该上下文,请使用以下命令修复此问题:
136 | ```bash
137 | aws eks --region update-kubeconfig --name
138 | ```
139 | 以上是AWS搭建k8s集群的全部步骤,用户接下来可以再自行在aws上搭建hdfs,并在跳板机上部署elastic ctr2.0
140 |
--------------------------------------------------------------------------------
/cluster_config.md:
--------------------------------------------------------------------------------
1 | ### 1 创建k8s集群
2 |
3 | 请参考
4 | [百度智能云CCE容器引擎帮助文档-创建集群](https://cloud.baidu.com/doc/CCE/s/zjxpoqohb),在百度智能云上建立一个集群,节点配置需要满足如下要求
5 |
6 | - CPU核数 \> 4
7 |
8 | 申请容器引擎示例:
9 |
10 | 
11 |
12 | 创建完成后,即可参考[百度智能云CCE容器引擎帮助文档-查看集群](https://cloud.baidu.com/doc/CCE/GettingStarted.html#.E6.9F.A5.E7.9C.8B.E9.9B.86.E7.BE.A4),查看刚刚申请的集群信息。
13 |
14 | ### 2 如何操作集群
15 |
16 | 集群的操作可以通过百度云web或者通过kubectl工具进行,推荐用kubectl工具。
17 |
18 | mac和linux用户安装kubectl工具可以参考以下步骤:
19 |
20 | 1.下载最新版本的kubectl
21 | ```bash
22 | curl -LO "https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)/bin/darwin/amd64/kubectl"
23 | ```
24 | 2.为kubectl添加执行权限
25 | ```bash
26 | chmod +x ./kubectl
27 | ```
28 | 3.将刚下载好的可执行文件移动到环境变量路径下
29 | ```bash
30 | sudo mv ./kubectl /usr/local/bin/kubectl
31 | ```
32 | 4.检测安装是否成功
33 | ```bash
34 | kubectl version
35 | ```
36 |
37 | \* 注意: 本操作指南给出的操作步骤都是基于linux操作环境的。
38 |
39 | - 安装好kubectl后,我们还要配置kubectl,下载集群凭证。集群凭证可以在百度云控制台查看,如下图所示
40 | 
41 | 在集群界面下载集群配置文件,放在kubectl的默认配置路径(请检查\~/.kube目录是否存在,若没有请创建)
42 |
43 | ```bash
44 | $ mv kubectl.conf ~/.kube/config
45 | ```
46 |
47 | - 配置完成后,您即可以使用kubectl从本地计算机访问Kubernetes集群(注:请确保您的机器上没有配置网络代理)
48 |
49 | ```bash
50 | $ kubectl get node
51 | ```
52 |
53 |
54 | ### 3 设置访问权限
55 |
56 | 建立分布式任务需要pod间有API互相访问的权限,可以按如下步骤
57 |
58 | ```bash
59 | $ kubectl create rolebinding default-view --clusterrole=view --serviceaccount=default:default --namespace=default
60 | ```
61 |
62 | 注意: --namespace 指定的default 为创建集群时候的名称
63 |
64 | ### 4 安装Volcano
65 |
66 | 我们使用volcano作为训练阶段的批量任务管理工具。关于volcano的详细信息,请参考[官方网站](https://volcano.sh/)的Documentation。
67 |
68 | 执行以下命令安装volcano到k8s集群:
69 |
70 | ```bash
71 | $ kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml
72 | ```
73 |
74 | 
75 |
76 |
77 |
--------------------------------------------------------------------------------
/doc/ElasticCTR.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PaddlePaddle/ElasticCTR/c493c6b250eef001028b5a7264b61f679c79fa6d/doc/ElasticCTR.png
--------------------------------------------------------------------------------
/doc/buy_bcc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PaddlePaddle/ElasticCTR/c493c6b250eef001028b5a7264b61f679c79fa6d/doc/buy_bcc.png
--------------------------------------------------------------------------------
/elastic-ctr-cli/data.config:
--------------------------------------------------------------------------------
1 | export HDFS_ADDRESS="hdfs://${IP}:9000"
2 | export HDFS_UGI="root,i"
3 | export START_DATE_HR=20200401/00
4 | export END_DATE_HR=20200401/03
5 | export DATASET_PATH="/train_data"
6 | export SPARSE_DIM="1000001"
7 |
--------------------------------------------------------------------------------
/elastic-ctr-cli/elastic-control.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
3 | #
4 | # Licensed under the Apache License, Version 2.0 (the "License");
5 | # you may not use this file except in compliance with the License.
6 | # You may obtain a copy of the License at
7 | #
8 | # http://www.apache.org/licenses/LICENSE-2.0
9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 | ###############################################################################
16 | # Function definitions
17 | ###############################################################################
18 | help()
19 | {
20 | echo "Usage: sh elastic-control.sh [COMMAND] [OPTIONS]"
21 | echo "elastic-control is command line interface with ELASTIC CTR"
22 | echo ""
23 | echo "Commands:"
24 | echo "-r|--config_resource Configure training resource requirments. See bellow"
25 | echo "-a|--apply Apply configurations to start training process"
26 | echo "-l|--log Log the status of training, please make sure you have started the training process"
27 | echo "-c|--config_client Retrieve client binaries to send infer requests and receive results"
28 | echo ""
29 | echo "Options(Used only for --config_resource):"
30 | echo "-u|--cpu CPU cores for each training node (Unused for now)"
31 | echo "-m|--mem Memory for each training node (Unused for now)"
32 | echo "-t|--trainer Number of trainer nodes"
33 | echo "-p|--pserver Number of parameter-server nodes"
34 | echo "-b|--cube Number of cube shards"
35 | echo "-f|--datafile Data file path (Only HDFS supported) (Unused for now)"
36 | echo "-s|--slot_conf Slot config file"
37 | echo ""
38 | echo "Example:"
39 | echo "sh elastic-control.sh -r -u 4 -m 20 -t 2 -p 2 -b 5 -s slot.conf -f data.config"
40 | echo "sh elastic-control.sh -a"
41 | echo "sh elastic-control.sh -c"
42 | echo ""
43 | echo "Notes:"
44 | echo "Slot Config File: Specify which feature ids are used in training. One number per line."
45 | }
46 |
47 | die()
48 | {
49 | echo "[FAILED] ${1}"
50 | exit 1
51 | }
52 |
53 | ok()
54 | {
55 | echo "[OK] ${1}"
56 | }
57 |
58 | check_tools()
59 | {
60 | if [ $# -lt 1 ]; then
61 | echo "Usage: check_tools COMMAND [COMMAND...]"
62 | return
63 | fi
64 | while [ $# -ge 1 ]; do
65 | type $1 &>/dev/null || die "$1 is needed but not found. Aborting..."
66 | shift
67 | done
68 | return 0
69 | }
70 |
71 | function check_files()
72 | {
73 | if [ $# -lt 1 ]; then
74 | echo "Usage: check_files COMMAND [COMMAND...]"
75 | return
76 | fi
77 | while [ $# -ge 1 ]; do
78 | [ -f "$1" ] || die "$1 does not exist"
79 | shift
80 | done
81 | return 0
82 | }
83 |
84 | function start_fileserver()
85 | {
86 | unset http_proxy
87 | unset https_proxy
88 | kubectl get pod | grep file-server >/dev/null 2>&1
89 | if [ $? -ne 0 ]; then
90 | kubectl apply -f fileserver.yaml
91 | else
92 | echo "delete duplicate file server..."
93 | kubectl delete -f fileserver.yaml
94 | kubectl apply -f fileserver.yaml
95 | fi
96 | }
97 |
98 | function install_volcano() {
99 | unset http_proxy
100 | unset https_proxy
101 | kubectl get crds | grep jobs.batch.volcano.sh >/dev/null 2>&1
102 | if [ $? -ne 0 ]; then
103 | echo "volcano not found, now install"
104 | kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml
105 | fi
106 |
107 | }
108 |
109 |
110 | function config_client()
111 | {
112 | check_tools wget kubectl
113 | wget --no-check-certificate https://paddle-serving.bj.bcebos.com/data/ctr_prediction/elastic_ctr_client_million.tar.gz
114 | tar zxvf elastic_ctr_client_million.tar.gz
115 | rm elastic_ctr_client_million.tar.gz
116 |
117 | for number in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20; do
118 | SERVING_IP=`kubectl get services | grep 'paddleserving' | awk '{print $4}'`
119 | echo "searching Paddle Serving external IP, wait a moment."
120 | if [ "${SERVING_IP}" == "" ]; then
121 | sleep 10
122 | else
123 | break
124 | fi
125 | done
126 |
127 | SERVING_IP=`kubectl get services | grep 'paddleserving' | awk '{print $4}'`
128 | SERVING_PORT=`kubectl get services | grep 'paddleserving' | awk '{print $5}' | awk -F':' '{print $1}'`
129 | SERVING_ADDR="$SERVING_IP:$SERVING_PORT"
130 | sed -e "s#<$ SERVING_LIST $>#$SERVING_ADDR#g" client/template/predictors.prototxt.template > client/conf/predictors.prototxt
131 | FILESERVER_IP=`kubectl get services | grep 'file-server' | awk '{print $4}'`
132 | FILESERVER_PORT=`kubectl get services | grep 'file-server' | awk '{print $5}' | awk -F':' '{print $1}'`
133 | wget http://$FILESERVER_IP:$FILESERVER_PORT/slot.conf -O client/conf/slot.conf
134 | cp api/lib/*.so client/bin/
135 | echo "Done."
136 | echo "============================================"
137 | echo ""
138 | echo "Try ELASTIC CTR:"
139 | echo "1. cd client"
140 | echo "2. (python) python bin/elastic_ctr.py $SERVING_IP $SERVING_PORT conf/slot.conf data/ctr_prediction/data.txt"
141 | echo "3. (C++ native) bin/elastic_ctr_demo --test_file data/ctr_prediction/data.txt"
142 | return 0
143 | }
144 |
145 | function generate_cube_yaml()
146 | {
147 | if [ $# != 1 ]; then
148 | echo "Invalid argument to function generate_cube_yaml"
149 | return -1
150 | fi
151 | if [ "$1" -lt 1 ]; then
152 | echo "CUBE_SHARD_NUM must not be less than 1"
153 | return -1
154 | fi
155 | CNT=$(($1-1))
156 | CUBE_SHARD_NUM=$1
157 | for i in `seq 0 $CNT`; do
158 | echo "---"
159 | echo "apiVersion: v1"
160 | echo "kind: Pod"
161 | echo "metadata:"
162 | echo " name: cube-$i"
163 | echo " labels:"
164 | echo " app: cube-$i"
165 | echo "spec:"
166 | echo " containers:"
167 | echo " - name: cube-$i"
168 | echo " image: hub.baidubce.com/ctr/cube:v1"
169 | echo " workingDir: /cube"
170 | echo " command: ['/bin/bash']"
171 | echo " args: ['start.sh']"
172 | echo " env:"
173 | echo " - name: CUBE_SHARD_NUM"
174 | echo " value: \"$CUBE_SHARD_NUM\""
175 | echo " ports:"
176 | echo " - containerPort: 8001"
177 | echo " name: cube-agent"
178 | echo " - containerPort: 8027"
179 | echo " name: cube-server"
180 | echo "---"
181 | echo "kind: Service"
182 | echo "apiVersion: v1"
183 | echo "metadata:"
184 | echo " name: cube-$i"
185 | echo "spec:"
186 | echo " ports:"
187 | echo " - name: agent"
188 | echo " port: 8001"
189 | echo " protocol: TCP"
190 | echo " - name: server"
191 | echo " port: 8027"
192 | echo " protocol: TCP"
193 | echo " selector:"
194 | echo " app: cube-$i"
195 | done > cube.yaml
196 | {
197 | echo "apiVersion: v1"
198 | echo "kind: Pod"
199 | echo "metadata:"
200 | echo " name: cube-transfer"
201 | echo " labels:"
202 | echo " app: cube-transfer"
203 | echo "spec:"
204 | echo " containers:"
205 | echo " - name: cube-transfer"
206 | echo " image: hub.baidubce.com/ctr/cube-transfer:v2"
207 | echo " workingDir: /"
208 | echo " env:"
209 | echo " - name: POD_IP"
210 | echo " valueFrom:"
211 | echo " fieldRef:"
212 | echo " apiVersion: v1"
213 | echo " fieldPath: status.podIP"
214 | echo " - name: CUBE_SHARD_NUM"
215 | echo " value: \"$CUBE_SHARD_NUM\""
216 | echo " command: ['bash']"
217 | echo " args: ['nonstop.sh']"
218 | echo " ports:"
219 | echo " - containerPort: 8099"
220 | echo " name: cube-transfer"
221 | echo " - containerPort: 8098"
222 | echo " name: cube-http"
223 | } > transfer.yaml
224 | echo "cube.yaml written to ./cube.yaml"
225 | echo "transfer.yaml written to ./transfer.yaml"
226 | return 0
227 | }
228 |
229 | function generate_fileserver_yaml()
230 | {
231 | check_tools sed
232 | check_files fileserver.yaml.template
233 | if [ $# -ne 3 ]; then
234 | echo "Invalid argument to function generate_fileserver_yaml"
235 | return -1
236 | else
237 | hdfs_address=$1
238 | hdfs_ugi=$2
239 | dataset_path=$3
240 | sed -e "s#<$ HDFS_ADDRESS $>#$hdfs_address#g" \
241 | -e "s#<$ HDFS_UGI $>#$hdfs_ugi#g" \
242 | -e "s#<$ DATASET_PATH $>#$dataset_path#g" \
243 | fileserver.yaml.template > fileserver.yaml
244 | echo "File server yaml written to fileserver.yaml"
245 | fi
246 | return 0
247 | }
248 |
249 | function generate_yaml()
250 | {
251 | check_tools sed
252 | check_files fleet-ctr.yaml.template
253 | if [ $# -ne 11 ]; then
254 | echo "Invalid argument to function generate_yaml"
255 | return -1
256 | else
257 | pserver_num=$1
258 | total_trainer_num=$2
259 | slave_trainer_num=$((total_trainer_num))
260 | let total_pod_num=${total_trainer_num}+${pserver_num}
261 | cpu_num=$3
262 | mem=$4
263 | data_path=$5
264 | hdfs_address=$6
265 | hdfs_ugi=$7
266 | start_date_hr=$8
267 | end_date_hr=$9
268 | sparse_dim=${10}
269 | dataset_path=${11}
270 |
271 | sed -e "s#<$ PSERVER_NUM $>#$pserver_num#g" \
272 | -e "s#<$ TRAINER_NUM $>#$total_trainer_num#g" \
273 | -e "s#<$ SLAVE_TRAINER_NUM $>#$slave_trainer_num#g" \
274 | -e "s#<$ CPU_NUM $>#$cpu_num#g" \
275 | -e "s#<$ MEMORY $>#$mem#g" \
276 | -e "s#<$ DATASET_PATH $>#$dataset_path#g" \
277 | -e "s#<$ SPARSE_DIM $>#$sparse_dim#g" \
278 | -e "s#<$ HDFS_ADDRESS $>#$hdfs_address#g" \
279 | -e "s#<$ HDFS_UGI $>#$hdfs_ugi#g" \
280 | -e "s#<$ START_DATE_HR $>#$start_date_hr#g" \
281 | -e "s#<$ END_DATE_HR $>#$end_date_hr#g" \
282 | -e "s#<$ TOTAL_POD_NUM $>#$total_pod_num#g" \
283 | fleet-ctr.yaml.template > fleet-ctr.yaml
284 | echo "Main yaml written to fleet-ctr.yaml"
285 | fi
286 | return 0
287 | }
288 |
289 | function upload_slot_conf()
290 | {
291 | check_tools kubectl curl
292 | if [ $# -ne 1 ]; then
293 | die "upload_slot_conf: Slot conf file not specified"
294 | fi
295 | check_files $1
296 | echo "start file-server pod"
297 | start_fileserver
298 | for number in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20; do
299 | FILESERVER_IP=`kubectl get services | grep 'file-server' | awk '{print $4}'`
300 | echo "searching file-server external IP, wait a moment."
301 | if [ "${FILESERVER_IP}" == "" ]; then
302 | sleep 10
303 | else
304 | break
305 | fi
306 | done
307 | if [ "${FILESERVER_IP}" == "" ]; then
308 | echo "error in K8S cluster, cannot continue. Aborted"
309 | return 1
310 | fi
311 |
312 | FILESERVER_IP=`kubectl get services | grep 'file-server' | awk '{print $4}'`
313 |
314 | FILESERVER_PORT=`kubectl get services | grep 'file-server' | awk '{print $5}' | awk -F':' '{print $2}' | awk -F',' '{print $2}'`
315 | if [ "${file##*.}"x = "txt"x ];
316 | then
317 | echo "slot file suffix must be '.txt'"
318 | fi
319 | echo "curl --upload-file $1 $FILESERVER_IP:$FILESERVER_PORT"
320 | curl --upload-file $1 $FILESERVER_IP:$FILESERVER_PORT
321 | if [ $? == 0 ]; then
322 | echo "File $1 uploaded to $FILESERVER_IP:$FILESERVER_PORT/slot.conf"
323 | fi
324 | return 0
325 | }
326 |
327 | function config_resource()
328 | {
329 | echo "CPU=$CPU MEM=$MEM CUBE=$CUBE TRAINER=$TRAINER PSERVER=$PSERVER "\
330 | "CUBE=$CUBE DATA_PATH=$DATA_PATH SLOT_CONF=$SLOT_CONF VERBOSE=$VERBOSE "\
331 | "HDFS_ADDRESS=$HDFS_ADDRESS HDFS_UGI=$HDFS_UGI START_DATE_HR=$START_DATE_HR END_DATE_HR=$END_DATE_HR "\
332 | "SPARSE_DIM=$SPARSE_DIM DATASET_PATH=$DATASET_PATH "
333 | generate_cube_yaml $CUBE || die "config_resource: generate_cube_yaml failed"
334 | generate_fileserver_yaml $HDFS_ADDRESS $HDFS_UGI $DATASET_PATH || die "config_resource: generate_fileserver_yaml failed"
335 | generate_yaml $PSERVER $TRAINER $CPU $MEM $DATA_PATH $HDFS_ADDRESS $HDFS_UGI $START_DATE_HR $END_DATE_HR $SPARSE_DIM $DATASET_PATH || die "config_resource: generate_yaml failed"
336 | upload_slot_conf $SLOT_CONF || die "config_resource: upload_slot_conf failed"
337 | return 0
338 | }
339 |
340 | function log()
341 | {
342 | echo "Trainer 0 Log:"
343 | kubectl logs fleet-ctr-demo-trainer-0 | grep __main__ > train.log
344 | if [ -f train.log ]; then
345 | tail -n 20 train.log
346 | else
347 | echo "Trainer Log Has not been generated"
348 | fi
349 | echo ""
350 | echo "File Server Log:"
351 | file_server_pod=$(kubectl get po | grep file-server | awk {'print $1'})
352 | kubectl logs ${file_server_pod} | grep __main__ > file-server.log
353 | if [ -f file-server.log ]; then
354 | tail -n 20 file-server.log
355 | else
356 | echo "File Server Log Has not been generated"
357 | fi
358 | echo ""
359 | echo "Cube Transfer Log:"
360 | kubectl logs cube-transfer | grep "all reload ok" > cube-transfer.log
361 | if [ -f cube-transfer.log ]; then
362 | tail -n 20 cube-transfer.log
363 | else
364 | echo "Cube Transfer Log Has not been generated"
365 | fi
366 | echo ""
367 | #echo "Padddle Serving Log:"
368 | #serving_pod=$(kubectl get po | grep paddleserving | awk {'print $1'})
369 | #kubectl logs ${serving_pod} | grep __INFO__ > paddleserving.log
370 | #if [ -f paddleserving.log ]; then
371 | # tail -n 20 paddleserving.log
372 | #else
373 | # echo "PaddleServing Log Has not been generated"
374 | #fi
375 | }
376 |
377 | datafile_config()
378 | {
379 | source $DATA_CONF_PATH
380 | }
381 |
382 | function apply()
383 | {
384 | echo "Waiting for pod..."
385 | check_tools kubectl
386 | install_volcano
387 | kubectl get pod | grep cube | awk {'print $1'} | xargs kubectl delete pod >/dev/null 2>&1
388 | kubectl get pod | grep paddleserving | awk {'print $1'} | xargs kubectl delete pod >/dev/null 2>&1
389 | kubectl apply -f cube.yaml
390 | kubectl apply -f transfer.yaml
391 | kubectl apply -f pdserving.yaml
392 |
393 | kubectl get jobs.batch.volcano.sh | grep fleet-ctr-demo
394 | if [ $? == 0 ]; then
395 | kubectl delete jobs.batch.volcano.sh fleet-ctr-demo
396 | fi
397 | kubectl apply -f fleet-ctr.yaml
398 | # python3 listen.py &
399 | # echo "waiting for mlflow..."
400 | # python3 service.py
401 | return
402 | }
403 |
404 | ###############################################################################
405 | # Main logic begin
406 | ###############################################################################
407 |
408 | CMD=""
409 | CPU=2
410 | MEM=4
411 | CUBE=2
412 | TRAINER=2
413 | PSERVER=2
414 | DATA_PATH="/app"
415 | SLOT_CONF="./slot.conf"
416 | VERBOSE=0
417 | DATA_CONF_PATH="./data.config"
418 | source $DATA_CONF_PATH
419 |
420 | # Parse arguments
421 | TEMP=`getopt -n elastic-control -o crahu:m:t:p:b:f:s:v:l --longoption config_client,config_resource,apply,help,cpu:,mem:,trainer:,pserver:,cube:,datafile:,slot_conf:,verbose,log -- "$@"`
422 |
423 | # Die if they fat finger arguments, this program will be run as root
424 | [ $? = 0 ] || die "Error parsing arguments. Try $0 --help"
425 |
426 | eval set -- "$TEMP"
427 | while true; do
428 | case $1 in
429 | -c|--config_client)
430 | CMD="config_client"; shift; continue
431 | ;;
432 | -r|--config_resource)
433 | CMD="config_resource"; shift; continue
434 | ;;
435 | -a|--apply)
436 | CMD="apply"; shift; continue
437 | ;;
438 | -h|--help)
439 | help
440 | exit 0
441 | ;;
442 | -l|--log)
443 | log; shift;
444 | exit 0
445 | ;;
446 | -u|--cpu)
447 | CPU="$2"; shift; shift; continue
448 | ;;
449 | -m|--mem)
450 | MEM="$2"; shift; shift; continue
451 | ;;
452 | -t|--trainer)
453 | TRAINER="$2"; shift; shift; continue
454 | ;;
455 | -p|--pserver)
456 | PSERVER="$2"; shift; shift; continue
457 | ;;
458 | -b|--cube)
459 | CUBE="$2"; shift; shift; continue
460 | ;;
461 | -f|--datafile)
462 | DATA_CONF_PATH="$2"; datafile_config ; shift; shift; continue
463 | ;;
464 | -s|--slot_conf)
465 | SLOT_CONF="$2"; shift; shift; continue
466 | ;;
467 | -v|--verbose)
468 | VERBOSE=1; shift; continue
469 | ;;
470 | --)
471 | # no more arguments to parse
472 | break
473 | ;;
474 | *)
475 | printf "Unknown option %s\n" "$1"
476 | exit 1
477 | ;;
478 | esac
479 | done
480 |
481 | if [ $CMD = "config_resource" ]; then
482 |
483 | if ! grep '^[[:digit:]]*$' <<< "$CPU" >> /dev/null || [ $CPU -lt 1 ] || [ $CPU -gt 4 ]; then
484 | die "Invalid CPU Num, should be greater than 0 and less than 5."
485 | fi
486 |
487 | if ! grep '^[[:digit:]]*$' <<< "$MEM" >> /dev/null || [ $MEM -lt 1 ] || [ $MEM -gt 4 ]; then
488 | die "Invalid MEM Num, should be greater than 0 and less than 5."
489 | fi
490 |
491 | if ! grep '^[[:digit:]]*$' <<< "$PSERVER" >> /dev/null || [ $PSERVER -lt 1 ] || [ $PSERVER -gt 9 ]; then
492 | die "Invalid PSERVER Num, should be greater than 0 and less than 10."
493 | fi
494 |
495 | if ! grep '^[[:digit:]]*$' <<< "$TRAINER" >> /dev/null || [ $TRAINER -lt 1 ] || [ $TRAINER -gt 9 ]; then
496 | die "Invalid TRAINER Num, should be greater than 0 and less than 10."
497 | fi
498 |
499 | if ! grep '^[[:digit:]]*$' <<< "$CUBE" >> /dev/null || [ $CUBE -lt 1 ] || [ $CUBE -gt 9 ]; then
500 | die "Invalid CUBE Num, should be greater than 0 and less than 10."
501 | fi
502 | fi
503 |
504 |
505 | case $CMD in
506 | config_resource)
507 | config_resource
508 | ;;
509 | config_client)
510 | config_client
511 | ;;
512 | apply)
513 | apply
514 | ;;
515 | status)
516 | status
517 | ;;
518 | *)
519 | help
520 | ;;
521 | esac
522 |
--------------------------------------------------------------------------------
/elastic-ctr-cli/fileserver.yaml.template:
--------------------------------------------------------------------------------
1 | apiVersion: apps/v1beta1
2 | kind: Deployment
3 | metadata:
4 | name: file-server
5 | labels:
6 | app: file-server
7 | spec:
8 | replicas: 1
9 | template:
10 | metadata:
11 | name: file-server
12 | labels:
13 | app: file-server
14 | spec:
15 | containers:
16 | - name: file-server
17 | image: hub.baidubce.com/ctr/file-server:latest
18 | imagePullPolicy: Always
19 | ports:
20 | - containerPort: 8080
21 | command: ['bash']
22 | args: ['run.sh']
23 | env:
24 | - name: NAMESPACE
25 | valueFrom:
26 | fieldRef:
27 | apiVersion: v1
28 | fieldPath: metadata.namespace
29 | - name: POD_IP
30 | valueFrom:
31 | fieldRef:
32 | apiVersion: v1
33 | fieldPath: status.podIP
34 | - name: POD_NAME
35 | valueFrom:
36 | fieldRef:
37 | apiVersion: v1
38 | fieldPath: metadata.name
39 | - name: PADDLE_CURRENT_IP
40 | valueFrom:
41 | fieldRef:
42 | apiVersion: v1
43 | fieldPath: status.podIP
44 | - name: JAVA_HOME
45 | value: /usr/local/jdk1.8.0_231
46 | - name: HADOOP_HOME
47 | value: /usr/local/hadoop-2.8.5
48 | - name: HADOOP_HOME
49 | value: /usr/local/hadoop-2.8.5
50 | - name: DATASET_PATH
51 | value: "<$ DATASET_PATH $>"
52 | - name: HDFS_ADDRESS
53 | value: "<$ HDFS_ADDRESS $>"
54 | - name: HDFS_UGI
55 | value: "<$ HDFS_UGI $>"
56 | - name: PATH
57 | value: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/jdk1.8.0_231/bin:/usr/local/hadoop-2.8.5/bin:/Python-3.7.0:/node-v12.13.1-linux-x64/bin
58 |
59 | ---
60 | kind: Service
61 | apiVersion: v1
62 | metadata:
63 | name: file-server
64 | spec:
65 | type: LoadBalancer
66 | ports:
67 | - name: file-server
68 | port: 8080
69 | targetPort: 8080
70 | - name: upload
71 | port: 9000
72 | targetPort: 9000
73 | selector:
74 | app: file-server
75 |
--------------------------------------------------------------------------------
/elastic-ctr-cli/fleet-ctr.yaml.template:
--------------------------------------------------------------------------------
1 | apiVersion: batch.volcano.sh/v1alpha1
2 | kind: Job
3 | metadata:
4 | name: fleet-ctr-demo
5 | spec:
6 | minAvailable: <$ TOTAL_POD_NUM $>
7 | schedulerName: volcano
8 | policies:
9 | - event: PodEvicted
10 | action: RestartJob
11 | - event: PodFailed
12 | action: RestartJob
13 | tasks:
14 | - replicas: <$ PSERVER_NUM $>
15 | name: pserver
16 | template:
17 | metadata:
18 | labels:
19 | paddle-job-pserver: fluid-ctr
20 | spec:
21 | imagePullSecrets:
22 | - name: default-secret
23 | containers:
24 | - image: hub.baidubce.com/ctr/fleet-ctr:latest
25 | command:
26 | - paddle_k8s
27 | - start_fluid
28 | imagePullPolicy: Always
29 | name: preserver
30 | resources:
31 | limits:
32 | cpu: 10
33 | memory: 30Gi
34 | ephemeral-storage: 10Gi
35 | requests:
36 | cpu: 1
37 | memory: 100M
38 | ephemeral-storage: 1Gi
39 | env:
40 | - name: GLOG_v
41 | value: "0"
42 | - name: GLOG_logtostderr
43 | value: "1"
44 | - name: TOPOLOGY
45 | value: ""
46 | - name: TRAINER_PACKAGE
47 | value: /workspace
48 | - name: PADDLE_INIT_NICS
49 | value: eth2
50 | - name: NAMESPACE
51 | valueFrom:
52 | fieldRef:
53 | apiVersion: v1
54 | fieldPath: metadata.namespace
55 | - name: POD_IP
56 | valueFrom:
57 | fieldRef:
58 | apiVersion: v1
59 | fieldPath: status.podIP
60 | - name: POD_NAME
61 | valueFrom:
62 | fieldRef:
63 | apiVersion: v1
64 | fieldPath: metadata.name
65 | - name: PADDLE_CURRENT_IP
66 | valueFrom:
67 | fieldRef:
68 | apiVersion: v1
69 | fieldPath: status.podIP
70 | - name: PADDLE_JOB_NAME
71 | value: fluid-ctr
72 | - name: PADDLE_IS_LOCAL
73 | value: "0"
74 | - name: PADDLE_TRAINERS_NUM
75 | value: "<$ TRAINER_NUM $>"
76 | - name: PADDLE_PSERVERS_NUM
77 | value: "<$ PSERVER_NUM $>"
78 | - name: FLAGS_rpc_deadline
79 | value: "36000000"
80 | - name: ENTRY
81 | value: cd workspace && python3 train_with_mlflow.py slot.conf
82 | - name: PADDLE_PORT
83 | value: "30240"
84 | - name: HDFS_ADDRESS
85 | value: "<$ HDFS_ADDRESS $>"
86 | - name: HDFS_UGI
87 | value: "<$ HDFS_UGI $>"
88 | - name: START_DATE_HR
89 | value: <$ START_DATE_HR $>
90 | - name: END_DATE_HR
91 | value: <$ END_DATE_HR $>
92 | - name: DATASET_PATH
93 | value: "<$ DATASET_PATH $>"
94 | - name: SPARSE_DIM
95 | value: "<$ SPARSE_DIM $>"
96 | - name: PADDLE_TRAINING_ROLE
97 | value: PSERVER
98 | - name: TRAINING_ROLE
99 | value: PSERVER
100 | restartPolicy: OnFailure
101 |
102 | - replicas: <$ TRAINER_NUM $>
103 | policies:
104 | - event: TaskCompleted
105 | action: CompleteJob
106 | name: trainer
107 | template:
108 | metadata:
109 | labels:
110 | paddle-job: fluid-ctr
111 | app: mlflow
112 | spec:
113 | imagePullSecrets:
114 | - name: default-secret
115 | containers:
116 | - image: hub.baidubce.com/ctr/fleet-ctr:latest
117 | command:
118 | - paddle_k8s
119 | - start_fluid
120 | imagePullPolicy: Always
121 | name: trainer
122 | resources:
123 | limits:
124 | cpu: 10
125 | memory: 30Gi
126 | ephemeral-storage: 10Gi
127 | requests:
128 | cpu: 1
129 | memory: 100M
130 | ephemeral-storage: 10Gi
131 | env:
132 | - name: GLOG_v
133 | value: "0"
134 | - name: GLOG_logtostderr
135 | value: "1"
136 | - name: TRAINER_PACKAGE
137 | value: /workspace
138 | - name: PADDLE_INIT_NICS
139 | value: eth2
140 | - name: CPU_NUM
141 | value: "2"
142 | - name: NAMESPACE
143 | valueFrom:
144 | fieldRef:
145 | apiVersion: v1
146 | fieldPath: metadata.namespace
147 | - name: POD_IP
148 | valueFrom:
149 | fieldRef:
150 | apiVersion: v1
151 | fieldPath: status.podIP
152 | - name: POD_NAME
153 | valueFrom:
154 | fieldRef:
155 | apiVersion: v1
156 | fieldPath: metadata.name
157 | - name: PADDLE_CURRENT_IP
158 | valueFrom:
159 | fieldRef:
160 | apiVersion: v1
161 | fieldPath: status.podIP
162 | - name: PADDLE_JOB_NAME
163 | value: fluid-ctr
164 | - name: PADDLE_IS_LOCAL
165 | value: "0"
166 | - name: FLAGS_rpc_deadline
167 | value: "36000000"
168 | - name: PADDLE_PSERVERS_NUM
169 | value: "2"
170 | - name: PADDLE_TRAINERS_NUM
171 | value: "2"
172 | - name: PADDLE_PORT
173 | value: "30240"
174 | - name: PADDLE_TRAINING_ROLE
175 | value: TRAINER
176 | - name: TRAINING_ROLE
177 | value: TRAINER
178 | - name: HDFS_ADDRESS
179 | value: "<$ HDFS_ADDRESS $>"
180 | - name: HDFS_UGI
181 | value: "<$ HDFS_UGI $>"
182 | - name: START_DATE_HR
183 | value: <$ START_DATE_HR $>
184 | - name: END_DATE_HR
185 | value: <$ END_DATE_HR $>
186 | - name: DATASET_PATH
187 | value: "<$ DATASET_PATH $>"
188 | - name: SPARSE_DIM
189 | value: "<$ SPARSE_DIM $>"
190 | - name: JAVA_HOME
191 | value: /usr/local/jdk1.8.0_231
192 | - name: HADOOP_HOME
193 | value: /usr/local/hadoop-2.8.5
194 | - name: PATH
195 | value: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/jdk1.8.0_231/bin:/usr/local/hadoop-2.8.5/bin:/Python-3.7.0
196 | - name: ENTRY
197 | value: cd workspace && python3 train_with_mlflow.py slot.conf
198 | restartPolicy: OnFailure
199 |
--------------------------------------------------------------------------------
/elastic-ctr-cli/listen.py:
--------------------------------------------------------------------------------
1 | import time
2 | import os
3 | import socket
4 |
5 |
6 | def rewrite_yaml(path):
7 | for root , dirs, files in os.walk(path):
8 | for name in files:
9 | if name == "meta.yaml":
10 | if len(root.split("/mlruns")) != 2:
11 | print("Error: the parent directory of your current directory should not contain a path named mlruns")
12 | exit(0)
13 | cmd = "sed -i \"s#/workspace#" + root.split("/mlruns")[0] + "#g\" " + os.path.join(root, name)
14 | os.system(cmd)
15 |
16 | time.sleep(5)
17 | os.system("rm -rf ./mlruns >/dev/null 2>&1")
18 | while True:
19 | r = os.popen("kubectl get pod | grep fleet-ctr-demo-trainer-0 | awk {'print $3'}")
20 | info = r.readlines()
21 | if info == []:
22 | exit(0)
23 | for line in info:
24 | line = line.strip()
25 | if line == "Completed" or line == "Terminating":
26 | exit(0)
27 | os.system("kubectl cp fleet-ctr-demo-trainer-0:workspace/mlruns ./mlruns_temp >/dev/null 2>&1")
28 | if os.path.exists("./mlruns_temp"):
29 | os.system("rm -rf ./mlruns >/dev/null 2>&1")
30 | os.system("mv ./mlruns_temp ./mlruns >/dev/null 2>&1")
31 | rewrite_yaml(os.getcwd()+"/mlruns")
32 | time.sleep(30)
33 |
--------------------------------------------------------------------------------
/elastic-ctr-cli/pdserving.yaml:
--------------------------------------------------------------------------------
1 | apiVersion: apps/v1beta1
2 | kind: Deployment
3 | metadata:
4 | name: paddleserving
5 | labels:
6 | app: paddleserving
7 | spec:
8 | replicas: 1
9 | template:
10 | metadata:
11 | name: paddleserving
12 | labels:
13 | app: paddleserving
14 | spec:
15 | containers:
16 | - name: paddleserving
17 | image: hub.baidubce.com/ctr/paddleserving:latest
18 | imagePullPolicy: Always
19 | workingDir: /serving
20 | command: ['/bin/bash']
21 | args: ['run.sh']
22 | ports:
23 | - containerPort: 8010
24 | name: serving
25 | ---
26 | apiVersion: v1
27 | kind: Service
28 | metadata:
29 | name: paddleserving
30 | spec:
31 | type: LoadBalancer
32 | ports:
33 | - name: paddleserving
34 | port: 8010
35 | targetPort: 8010
36 | selector:
37 | app: paddleserving
38 |
39 |
--------------------------------------------------------------------------------
/elastic-ctr-cli/rolebind.yaml:
--------------------------------------------------------------------------------
1 | # replace suo with other namespace
2 | kind: ClusterRole
3 | apiVersion: rbac.authorization.k8s.io/v1
4 | metadata:
5 | name: suo
6 | namespace: suo
7 | rules:
8 | - apiGroups: [""]
9 | resources: ["pods"]
10 | verbs: ["get", "list", "watch"]
11 |
12 | ---
13 | kind: ClusterRoleBinding
14 | apiVersion: rbac.authorization.k8s.io/v1
15 | metadata:
16 | name: suo
17 | namespace: suo
18 | subjects:
19 | - kind: ServiceAccount
20 | name: default
21 | namespace: suo
22 | roleRef:
23 | kind: ClusterRole
24 | name: suo
25 | apiGroup: rbac.authorization.k8s.io
26 |
--------------------------------------------------------------------------------
/elastic-ctr-cli/service.py:
--------------------------------------------------------------------------------
1 | import time
2 | import os
3 | import socket
4 |
5 |
6 | def net_is_used(port, ip='0.0.0.0'):
7 | s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
8 | try:
9 | s.connect((ip, port))
10 | s.shutdown(2)
11 | print('Error: %s:%d is used' % (ip, port))
12 | return True
13 | except:
14 | #print('%s:%d is unused' % (ip, port))
15 | return False
16 |
17 | os.system("ps -ef | grep ${USER} | grep mlflow | awk {'print $2'} | xargs kill -9 >/dev/null 2>&1")
18 | os.system("ps -ef | grep ${USER} | grep gunicorn | awk {'print $2'} | xargs kill -9 >/dev/null 2>&1")
19 |
20 | while True:
21 | if os.path.exists("./mlruns") and not net_is_used(8111):
22 | os.system("mlflow server --default-artifact-root ./mlruns/0 --host 0.0.0.0 --port 8111 >/dev/null 2>&1 &")
23 | time.sleep(3)
24 | print("mlflow ready!")
25 | exit(0)
26 | time.sleep(30)
27 |
--------------------------------------------------------------------------------
/elastic-ctr-cli/service_auto_port.py:
--------------------------------------------------------------------------------
1 | import time
2 | import os
3 | import socket
4 |
5 |
6 | def net_is_used(port, ip='0.0.0.0'):
7 | s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
8 | try:
9 | s.connect((ip, port))
10 | s.shutdown(2)
11 | print('Error: %s:%d is used' % (ip, port))
12 | return True
13 | except:
14 | #print('%s:%d is unused' % (ip, port))
15 | return False
16 |
17 | os.system("ps -ef | grep ${USER} | grep mlflow | awk {'print $2'} | xargs kill -9 >/dev/null 2>&1")
18 | os.system("ps -ef | grep ${USER} | grep gunicorn | awk {'print $2'} | xargs kill -9 >/dev/null 2>&1")
19 |
20 | current_port = 8100
21 | while True:
22 | if os.path.exists("./mlruns"):
23 | if not net_is_used(current_port):
24 | os.system("mlflow server --default-artifact-root ./mlruns/0 --host 0.0.0.0 --port " + str(current_port) + " >/dev/null 2>&1 &")
25 | time.sleep(3)
26 | print("mlflow ready, started at port" + str(current_port) + "!")
27 | exit(0)
28 | else:
29 | current_port = current_port + 1
30 | continue
31 | else:
32 | time.sleep(30)
33 |
--------------------------------------------------------------------------------
/elastic-ctr-cli/slot.conf:
--------------------------------------------------------------------------------
1 | 0
2 | 1
3 | 2
4 | 3
5 | 4
6 | 5
7 | 6
8 | 7
9 | 8
10 | 9
11 | 10
12 | 11
13 | 12
14 | 13
15 | 14
16 | 15
17 | 16
18 | 17
19 | 18
20 | 19
21 | 20
22 | 21
23 | 22
24 | 23
25 | 24
26 | 25
27 | 26
28 |
--------------------------------------------------------------------------------
/elasticctr_arch.md:
--------------------------------------------------------------------------------
1 |
2 | # ElasticCTR整体架构
3 |
4 | 
5 | 其中各个模块的介绍如下:
6 | - Client: CTR预估任务的客户端,训练前用户可以上传自定义的配置文件,预测时可以发起预测请求
7 | - file server: 接收用户上传的配置文件,存储模型供Paddle Serving和Cube使用。
8 | - trainer/pserver: 训练环节采用PaddlePaddle parameter server模式,对应trainer和pserver角色。分布式训练使用volcano做批量任务管理工具。
9 | - MLFlow: 训练任务的可视化模块,用户可以直观地查看训练情况。
10 | - HDFS: 用于用户存储数据。训练完成后产出的模型文件,也会在HDFS存储。
11 | - cube-transfer: 负责监控上游训练任务产出的模型文件,有新的模型产出便拉取到本地,并调用cube-builder构建cube字典文件;通知cube-agent节点拉取最新的字典文件,并维护各个cube-server上版本一致性。
12 | - cube-builder: 负责将训练作业产出的模型文件转换成可以被cube-server加载的字典文件。字典文件具有特定的数据结构,针对尺寸和内存中访问做了高度优化。
13 | - Cube-Server: 提供分片kv读写能力的服务节点。
14 | - Cube-agent: 与cube-server同机部署,接收cube-transfer下发的字典文件更新命令,拉取数据到本地,通知cube-server进行更新。
15 | - Paddle Serving: 加载CTR预估任务模型ProgramDesc和dense参数,提供预测服务。
16 |
17 | 以上组件串联完成从训练到预测部署的所有流程。本项目所提供的一键部署脚本elastic-control.sh可一键部署上述所有组件。用户可以参考本部署方案,将基于PaddlePaddle的分布式训练和Serving应用到业务环境。
18 |
--------------------------------------------------------------------------------
/fleet-ctr/criteo_dataset.py:
--------------------------------------------------------------------------------
1 | # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
2 | #
3 | # Licensed under the Apache License, Version 2.0 (the "License");
4 | # you may not use this file except in compliance with the License.
5 | # You may obtain a copy of the License at
6 | #
7 | # http://www.apache.org/licenses/LICENSE-2.0
8 | #
9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 |
15 | import paddle.fluid.incubate.data_generator as dg
16 | import sys
17 | import os
18 |
19 | class CriteoDataset(dg.MultiSlotDataGenerator):
20 |
21 | def set_config(self, feature_names):
22 | self.feature_names = feature_names
23 | self.feature_dict = {}
24 | for idx, name in enumerate(self.feature_names):
25 | self.feature_dict[name] = idx
26 |
27 | def generate_sample(self, line):
28 | """
29 | Read the data line by line and process it as a dictionary
30 | """
31 | hash_dim_ = int(os.environ['SPARSE_DIM'])
32 | def reader():
33 | group = line.strip().split()
34 | label = int(group[1])
35 | click = group[0]
36 | features = []
37 | for i in range(len(self.feature_names)):
38 | features.append([])
39 | for fea_pair in group[2:]:
40 | feasign, slot = fea_pair.split(":")
41 | if slot not in self.feature_dict:
42 | continue
43 | features[self.feature_dict[slot]].append(int(feasign) % hash_dim_)
44 | for i in range(len(features)):
45 | if features[i] == []:
46 | features[i].append(0)
47 | features.append([label])
48 | yield zip(self.feature_names + ["label"], features)
49 | return reader
50 |
51 |
52 | d = CriteoDataset()
53 | feature_names = []
54 | with open(sys.argv[1]) as fin:
55 | for line in fin:
56 | feature_names.append(line.strip())
57 | d.set_config(feature_names)
58 | d.run_from_stdin()
59 |
--------------------------------------------------------------------------------
/fleet-ctr/criteo_dataset.py~:
--------------------------------------------------------------------------------
1 | # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
2 | #
3 | # Licensed under the Apache License, Version 2.0 (the "License");
4 | # you may not use this file except in compliance with the License.
5 | # You may obtain a copy of the License at
6 | #
7 | # http://www.apache.org/licenses/LICENSE-2.0
8 | #
9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 |
15 | import paddle.fluid.incubate.data_generator as dg
16 | import sys
17 | import os
18 |
19 | class CriteoDataset(dg.MultiSlotDataGenerator):
20 |
21 | def set_config(self, feature_names):
22 | self.feature_names = feature_names
23 | self.feature_dict = {}
24 | for idx, name in enumerate(self.feature_names):
25 | self.feature_dict[name] = idx
26 |
27 | def generate_sample(self, line):
28 | """
29 | Read the data line by line and process it as a dictionary
30 | """
31 | hash_dim_ = 400000001
32 | def reader():
33 | group = line.strip().split()
34 | label = int(group[1])
35 | click = group[0]
36 | features = []
37 | for i in range(len(self.feature_names)):
38 | features.append([])
39 | for fea_pair in group[2:]:
40 | feasign, slot = fea_pair.split(":")
41 | if slot not in self.feature_dict:
42 | continue
43 | features[self.feature_dict[slot]].append(int(feasign) % hash_dim_)
44 | for i in range(len(features)):
45 | if features[i] == []:
46 | features[i].append(0)
47 | features.append([label])
48 | yield zip(self.feature_names + ["label"], features)
49 | return reader
50 |
51 |
52 | d = CriteoDataset()
53 | feature_names = []
54 | with open(sys.argv[1]) as fin:
55 | for line in fin:
56 | feature_names.append(line.strip())
57 | d.set_config(feature_names)
58 | d.run_from_stdin()
59 |
--------------------------------------------------------------------------------
/fleet-ctr/criteo_reader.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/python
2 | # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
3 | #
4 | # Licensed under the Apache License, Version 2.0 (the "License");
5 | # you may not use this file except in compliance with the License.
6 | # You may obtain a copy of the License at
7 | #
8 | # http://www.apache.org/licenses/LICENSE-2.0
9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 |
16 | # There are 13 integer features and 26 categorical features
17 |
18 | class Dataset:
19 | def __init__(self):
20 | pass
21 |
22 |
23 | class CriteoDataset(Dataset):
24 | def __init__(self, feature_names):
25 | self.feature_names = feature_names
26 | self.feature_dict = {}
27 | for idx, name in enumerate(self.feature_names):
28 | self.feature_dict[name] = idx
29 |
30 | def _reader_creator(self, file_list, is_train, trainer_num, trainer_id):
31 | hash_dim_ = int(os.environ['SPARSE_DIM'])
32 | def reader():
33 | for file in file_list:
34 | with open(file, 'r') as f:
35 | line_idx = 0
36 | for line in f:
37 | line_idx += 1
38 | feature = line.rstrip('\n').split()
39 | features = []
40 | for i in range(len(self.feature_names)):
41 | features.append([])
42 | label = feature[1]
43 | for fea_pair in feature[2:]:
44 | #print(fea_pair)
45 | tmp_list = fea_pair.split(":")
46 | feasign, slot = tmp_list[0], tmp_list[1]
47 | if slot not in self.feature_dict:
48 | continue
49 | features[self.feature_dict[slot]].append(int(feasign) % hash_dim_)
50 | for i in range(len(features)):
51 | if features[i] == []:
52 | features[i].append(0)
53 | features.append([label])
54 | yield features
55 |
56 | return reader
57 |
58 | def train(self, file_list, trainer_num, trainer_id):
59 | return self._reader_creator(file_list, True, trainer_num, trainer_id)
60 |
61 | def test(self, file_list):
62 | return self._reader_creator(file_list, False, 1, 0)
63 |
64 | def infer(self, file_list):
65 | return self._reader_creator(file_list, False, 1, 0)
66 |
--------------------------------------------------------------------------------
/fleet-ctr/dataset_generator.py:
--------------------------------------------------------------------------------
1 | # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
2 | #
3 | # Licensed under the Apache License, Version 2.0 (the "License");
4 | # you may not use this file except in compliance with the License.
5 | # You may obtain a copy of the License at
6 | #
7 | # http://www.apache.org/licenses/LICENSE-2.0
8 | #
9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 |
15 | import paddle.fluid.incubate.data_generator as dg
16 | import os
17 | hash_dim_ = int(os.environ['SPARSE_DIM'])
18 | categorical_range_ = range(14, 40)
19 |
20 | class DacDataset(dg.MultiSlotDataGenerator):
21 | """
22 | DacDataset: inheritance MultiSlotDataGeneratior, Implement data reading
23 | Help document: http://wiki.baidu.com/pages/viewpage.action?pageId=728820675
24 | """
25 |
26 | def set_config(feature_names):
27 | self.feature_names = feature_names
28 | assert len(feature_names) < len(categorical_range_)
29 |
30 | def generate_sample(self, line):
31 | """
32 | Read the data line by line and process it as a dictionary
33 | """
34 |
35 | def reader():
36 | """
37 | This function needs to be implemented by the user, based on data format
38 | """
39 | features = line.rstrip('\n').split('\t')
40 | sparse_feature = []
41 | for idx in categorical_range_[:len(self.feature_names)]:
42 | sparse_feature.append(
43 | [hash(str(idx) + features[idx]) % hash_dim_])
44 | label = [int(features[0])]
45 | feature_name = []
46 | for i, idx in enumerate(categorical_range_[:len(self.feature_names)]):
47 | feature_name.append(self.feature_names[i])
48 | feature_name.append("label")
49 |
50 | yield zip(feature_name, sparse_feature + [label])
51 |
52 | return reader
53 |
54 |
55 | d = DacDataset()
56 | feature_names = []
57 | with open(sys.argv[1]) as fin:
58 | for line in fin:
59 | feature_names.append(line.strip())
60 | d.set_config(feature_names)
61 | d.run_from_stdin()
62 |
--------------------------------------------------------------------------------
/fleet-ctr/infer.py:
--------------------------------------------------------------------------------
1 | # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
2 | #
3 | # Licensed under the Apache License, Version 2.0 (the "License");
4 | # you may not use this file except in compliance with the License.
5 | # You may obtain a copy of the License at
6 | #
7 | # http://www.apache.org/licenses/LICENSE-2.0
8 | #
9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 |
15 | import math
16 | import sys
17 | import os
18 | import paddle.fluid as fluid
19 | import numpy as np
20 | from criteo_pyreader import CriteoDataset
21 | import paddle
22 | from nets import ctr_dnn_model
23 |
24 | feature_names = []
25 | with open(sys.argv[1]) as fin:
26 | for line in fin:
27 | feature_names.append(line.strip())
28 |
29 | print(feature_names)
30 |
31 |
32 | sparse_feature_dim = 400000001
33 | embedding_size = 9
34 |
35 | sparse_input_ids = [fluid.layers.data(name= name, shape=[1], lod_level=1, dtype='int64')
36 | for name in feature_names]
37 | label = fluid.layers.data(name='label', shape=[1], dtype='int64')
38 | _words = sparse_input_ids + [label]
39 |
40 | exe = fluid.Executor(fluid.CPUPlace())
41 | exe.run(fluid.default_startup_program())
42 | input_folder = "../data/infer_data"
43 | files = os.listdir(input_folder)
44 | infer_filelist = ["{}/{}".format(input_folder, f) for f in files]
45 | print(infer_filelist)
46 |
47 | criteo_dataset = CriteoDataset(feature_names)
48 |
49 | startup_program = fluid.framework.default_main_program()
50 | test_program = fluid.framework.default_main_program()
51 | test_reader = paddle.batch(criteo_dataset.test(infer_filelist), 1000)
52 | _, auc_var, _ = ctr_dnn_model(embedding_size, sparse_input_ids, label, sparse_feature_dim)
53 | [inference_program, feed_target_names, fetch_targets] = fluid.io.load_inference_model(dirname="./saved_models/",executor=exe)
54 | with open('infer_programdesc', 'w+') as f:
55 | f.write(inference_program.to_string(True))
56 | def set_zero(var_name):
57 | param = fluid.global_scope().var(var_name).get_tensor()
58 | param_array = np.zeros(param._get_dims()).astype("int64")
59 | param.set(param_array, fluid.CPUPlace())
60 |
61 | auc_states_names = ['_generated_var_2', '_generated_var_3']
62 | for name in auc_states_names:
63 | set_zero(name)
64 | inputs = _words
65 | feeder = fluid.DataFeeder(feed_list = inputs, place = fluid.CPUPlace())
66 | for batch_id, data in enumerate(test_reader()):
67 | auc_val = exe.run(inference_program,
68 | feed=feeder.feed(data),
69 | fetch_list=fetch_targets)
70 | print(auc_val)
71 |
72 |
--------------------------------------------------------------------------------
/fleet-ctr/ip_config.sh:
--------------------------------------------------------------------------------
1 | #! /bin/bash
2 | export SERVER0=192.168.1.2
3 | export SERVER1=192.168.1.5
4 |
--------------------------------------------------------------------------------
/fleet-ctr/k8s_tools.py:
--------------------------------------------------------------------------------
1 | # Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
2 | #
3 | # Licensed under the Apache License, Version 2.0 (the "License");
4 | # you may not use this file except in compliance with the License.
5 | # You may obtain a copy of the License at
6 | #
7 | # http://www.apache.org/licenses/LICENSE-2.0
8 | #
9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 |
15 | #!/bin/env python
16 | import os
17 | import sys
18 | import time
19 | import socket
20 | from kubernetes import client, config
21 | NAMESPACE = os.getenv("NAMESPACE")
22 | if os.getenv("KUBERNETES_SERVICE_HOST", None):
23 | config.load_incluster_config()
24 | else:
25 | config.load_kube_config()
26 | v1 = client.CoreV1Api()
27 |
28 |
29 | def get_pod_status(item):
30 | phase = item.status.phase
31 |
32 | # check terminate time although phase is Running.
33 | if item.metadata.deletion_timestamp != None:
34 | return "Terminating"
35 |
36 | return phase
37 |
38 |
39 | def containers_all_ready(label_selector):
40 | def container_statuses_ready(item):
41 | container_statuses = item.status.container_statuses
42 |
43 | for status in container_statuses:
44 | if not status.ready:
45 | return False
46 | return True
47 |
48 | api_response = v1.list_namespaced_pod(
49 | namespace=NAMESPACE, pretty=True, label_selector=label_selector)
50 |
51 | for item in api_response.items:
52 | if not container_statuses_ready(item):
53 | return False
54 |
55 | return True
56 |
57 |
58 | def fetch_pods_info(label_selector, phase=None):
59 | api_response = v1.list_namespaced_pod(
60 | namespace=NAMESPACE, pretty=True, label_selector=label_selector)
61 | pod_list = []
62 | for item in api_response.items:
63 | if phase is not None and get_pod_status(item) != phase:
64 | continue
65 |
66 | pod_list.append((item.status.phase, item.status.pod_ip, item.metadata.name))
67 | return pod_list
68 |
69 |
70 | def wait_pods_running(label_selector, desired):
71 | print "label selector: %s, desired: %s" % (label_selector, desired)
72 | while True:
73 | count = count_pods_by_phase(label_selector, 'Running')
74 | # NOTE: pods may be scaled.
75 | if count >= int(desired):
76 | break
77 | print 'current cnt: %d sleep for 5 seconds...' % count
78 | time.sleep(5)
79 |
80 |
81 | def wait_containers_ready(label_selector):
82 | print "label selector: %s, wait all containers ready" % (label_selector)
83 | while True:
84 | if containers_all_ready(label_selector):
85 | break
86 | print 'not all containers ready, sleep for 5 seconds...'
87 | time.sleep(5)
88 |
89 |
90 | def count_pods_by_phase(label_selector, phase):
91 | pod_list = fetch_pods_info(label_selector, phase)
92 | return len(pod_list)
93 |
94 |
95 | def fetch_ips_list(label_selector, phase=None):
96 | pod_list = fetch_pods_info(label_selector, phase)
97 | ips = [item[1] for item in pod_list]
98 | ips.sort()
99 | return ips
100 |
101 | def fetch_name_list(label_selector, phase=None):
102 | pod_list = fetch_pods_info(label_selector, phase)
103 | names = [item[2] for item in pod_list]
104 | names.sort()
105 | return names
106 |
107 |
108 | def fetch_ips_string(label_selector, phase=None):
109 | ips = fetch_ips_list(label_selector, phase)
110 | return ",".join(ips)
111 |
112 |
113 | def fetch_endpoints_string(label_selector, port, phase=None, sameport=True):
114 | ips = fetch_ips_list(label_selector, phase)
115 | if sameport:
116 | ips = ["{0}:{1}".format(ip, port) for ip in ips]
117 | else:
118 | srcips = ips
119 | ips = []
120 | port = int(port)
121 | for ip in srcips:
122 | ips.append("{0}:{1}".format(ip, port))
123 | port = port + 1
124 | return ",".join(ips)
125 |
126 |
127 | def fetch_pod_id(label_selector, phase=None, byname=True):
128 | if byname:
129 | names = fetch_name_list(label_selector, phase=phase)
130 |
131 | local_name = os.getenv('POD_NAME')
132 | for i in xrange(len(names)):
133 | if names[i] == local_name:
134 | return i
135 |
136 | return None
137 | else:
138 | ips = fetch_ips_list(label_selector, phase=phase)
139 |
140 | local_ip = socket.gethostbyname(socket.gethostname())
141 | for i in xrange(len(ips)):
142 | if ips[i] == local_ip:
143 | return i
144 |
145 | # in minikube there can be one node only
146 | local_ip = os.getenv("POD_IP")
147 | for i in xrange(len(ips)):
148 | if ips[i] == local_ip:
149 | return i
150 |
151 | return None
152 |
153 |
154 | def fetch_ips(label_selector):
155 | return fetch_ips_string(label_selector, phase="Running")
156 |
157 |
158 | def fetch_endpoints(label_selector, port):
159 | return fetch_endpoints_string(label_selector, port=port, phase="Running", sameport=False)
160 |
161 |
162 | def fetch_id(label_selector):
163 | return fetch_pod_id(label_selector, phase="Running")
164 |
165 |
166 | if __name__ == "__main__":
167 | command = sys.argv[1]
168 | if command == "fetch_ips":
169 | print fetch_ips(sys.argv[2])
170 | if command == "fetch_ips_string":
171 | print fetch_ips_string(sys.argv[2], sys.argv[3])
172 | elif command == "fetch_endpoints":
173 | print fetch_endpoints(sys.argv[2], sys.argv[3])
174 | elif command == "fetch_id":
175 | print fetch_id(sys.argv[2])
176 | elif command == "count_pods_by_phase":
177 | print count_pods_by_phase(sys.argv[2], sys.argv[3])
178 | elif command == "wait_pods_running":
179 | wait_pods_running(sys.argv[2], sys.argv[3])
180 | elif command == "wait_containers_ready":
181 | wait_containers_ready(sys.argv[2])
182 |
--------------------------------------------------------------------------------
/fleet-ctr/mlflow_run.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | if [[ $PADDLE_TRAINER_ID -ne 0 ]] ; then
3 | echo "PADDLE_TRAINER_ID unset or Run MLFlow on Trainer 0."
4 | exit 0
5 | fi
6 |
7 | while true ; do
8 | echo "Still wait for CTR Training Setup"
9 | sleep 10
10 | if [ -d "./mlruns/0" ] ;then
11 | mlflow server --default-artifact-root ./mlruns/0 --host 0.0.0.0 --port 8111
12 | fi
13 | done
14 | exit 0
15 |
--------------------------------------------------------------------------------
/fleet-ctr/model_with_sparse_feature.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/python
2 | # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
3 | #
4 | # Licensed under the Apache License, Version 2.0 (the "License");
5 | # you may not use this file except in compliance with the License.
6 | # You may obtain a copy of the License at
7 | #
8 | # http://www.apache.org/licenses/LICENSE-2.0
9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 |
16 | import math
17 | import sys
18 | import os
19 | import paddle.fluid as fluid
20 | import subprocess as sp
21 | from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet
22 | from paddle.fluid.incubate.fleet.base import role_maker
23 | from paddle.fluid.transpiler.distribute_transpiler import DistributeTranspilerConfig
24 |
25 | file_server_addr = os.environ["FILE_SERVER_SERVICE_HOST"] + ":"+ os.environ["FILE_SERVER_SERVICE_PORT"]
26 | os.system('wget '+ file_server_addr + '/slot.conf')
27 |
28 | feature_names = []
29 | with open(sys.argv[1]) as fin:
30 | for line in fin:
31 | feature_names.append(line.strip())
32 |
33 | print(feature_names)
34 |
35 | sparse_input_ids = [
36 | fluid.layers.data(name=name, shape=[1], lod_level=1, dtype='int64')
37 | for name in feature_names]
38 |
39 | label = fluid.layers.data(
40 | name='label', shape=[1], dtype='int64')
41 |
42 | sparse_feature_dim = 100000001
43 | embedding_size = 9
44 |
45 |
46 | def embedding_layer(input):
47 | emb = fluid.layers.embedding(
48 | input=input,
49 | is_sparse=True,
50 | is_distributed=False,
51 | size=[sparse_feature_dim, embedding_size],
52 | param_attr=fluid.ParamAttr(name="SparseFeatFactors",
53 | initializer=fluid.initializer.Uniform()))
54 | emb_sum = fluid.layers.sequence_pool(input=emb, pool_type='sum')
55 | return emb_sum
56 |
57 | emb_sums = list(map(embedding_layer, sparse_input_ids))
58 | concated = fluid.layers.concat(emb_sums, axis=1)
59 | fc1 = fluid.layers.fc(input=concated, size=400, act='relu',
60 | param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal(
61 | scale=1 / math.sqrt(concated.shape[1]))))
62 | fc2 = fluid.layers.fc(input=fc1, size=400, act='relu',
63 | param_attr=fluid.ParamAttr(
64 | initializer=fluid.initializer.Normal(
65 | scale=1 / math.sqrt(fc1.shape[1]))))
66 | fc3 = fluid.layers.fc(input=fc2, size=400, act='relu',
67 | param_attr=fluid.ParamAttr(
68 | initializer=fluid.initializer.Normal(
69 | scale=1 / math.sqrt(fc2.shape[1]))))
70 |
71 | predict = fluid.layers.fc(input=fc3, size=2, act='softmax',
72 | param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal(
73 | scale=1 / math.sqrt(fc3.shape[1]))))
74 |
75 | cost = fluid.layers.cross_entropy(input=predict, label=label)
76 | avg_cost = fluid.layers.reduce_sum(cost)
77 | accuracy = fluid.layers.accuracy(input=predict, label=label)
78 | auc_var, batch_auc_var, auc_states = \
79 | fluid.layers.auc(input=predict, label=label, num_thresholds=2 ** 12, slide_steps=20)
80 |
81 | dataset = fluid.DatasetFactory().create_dataset()
82 | dataset.set_use_var(sparse_input_ids + [label])
83 | pipe_command = "python criteo_dataset.py {}".format(sys.argv[1])
84 | dataset.set_pipe_command(pipe_command)
85 | dataset.set_batch_size(32)
86 | dataset.set_thread(10)
87 | dataset.set_hdfs_config("hdfs://192.168.48.87:9000", "root,")
88 | optimizer = fluid.optimizer.SGD(0.0001)
89 | #optimizer.minimize(avg_cost)
90 | exe = fluid.Executor(fluid.CPUPlace())
91 |
92 | input_folder = "hdfs:"
93 | output = sp.check_output("hdfs dfs -ls /train_data | awk '{if(NR>1) print $8}'", shell=True)
94 | train_filelist = ["{}{}".format(input_folder, f) for f in output.decode('ascii').strip().split('\n')]
95 | role = role_maker.PaddleCloudRoleMaker()
96 | fleet.init(role)
97 |
98 |
99 | config = DistributeTranspilerConfig()
100 | config.sync_mode = False
101 |
102 | optimizer = fleet.distributed_optimizer(optimizer, config)
103 | optimizer.minimize(avg_cost)
104 |
105 |
106 | if fleet.is_server():
107 | fleet.init_server()
108 | fleet.run_server()
109 | elif fleet.is_worker():
110 | place = fluid.CPUPlace()
111 | exe = fluid.Executor(place)
112 | fleet.init_worker()
113 | exe.run(fluid.default_startup_program())
114 | print("startup program done.")
115 | fleet_filelist = fleet.split_files(train_filelist)
116 | dataset.set_filelist(fleet_filelist)
117 | exe.train_from_dataset(
118 | program=fluid.default_main_program(),
119 | dataset=dataset,
120 | fetch_list=[auc_var],
121 | fetch_info=["auc"],
122 | debug=True)
123 | print("end .... ")
124 | # save model here
125 |
126 |
--------------------------------------------------------------------------------
/fleet-ctr/nets.py:
--------------------------------------------------------------------------------
1 | from __future__ import print_function
2 |
3 | import argparse
4 | import logging
5 | import os
6 | import time
7 | import math
8 | import numpy as np
9 | import six
10 | import paddle
11 | import paddle.fluid as fluid
12 | import paddle.fluid.proto.framework_pb2 as framework_pb2
13 | import paddle.fluid.core as core
14 |
15 | from multiprocessing import cpu_count
16 |
17 | def ctr_dnn_model(embedding_size, sparse_input_ids, label,
18 | sparse_feature_dim):
19 | def embedding_layer(input):
20 | emb = fluid.layers.embedding(
21 | input=input,
22 | is_sparse=True,
23 | is_distributed=False,
24 | size=[sparse_feature_dim, embedding_size],
25 | param_attr=fluid.ParamAttr(name="SparseFeatFactors",
26 | initializer=fluid.initializer.Uniform()))
27 | emb_sum = fluid.layers.sequence_pool(input=emb, pool_type='sum')
28 | return emb_sum
29 |
30 | emb_sums = list(map(embedding_layer, sparse_input_ids))
31 | concated = fluid.layers.concat(emb_sums, axis=1)
32 | fc1 = fluid.layers.fc(input=concated, size=400, act='relu',
33 | param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal(
34 | scale=1 / math.sqrt(concated.shape[1]))))
35 | fc2 = fluid.layers.fc(input=fc1, size=400, act='relu',
36 | param_attr=fluid.ParamAttr(
37 | initializer=fluid.initializer.Normal(
38 | scale=1 / math.sqrt(fc1.shape[1]))))
39 | fc3 = fluid.layers.fc(input=fc2, size=400, act='relu',
40 | param_attr=fluid.ParamAttr(
41 | initializer=fluid.initializer.Normal(
42 | scale=1 / math.sqrt(fc2.shape[1]))))
43 |
44 | predict = fluid.layers.fc(input=fc3, size=2, act='softmax',
45 | param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal(
46 | scale=1 / math.sqrt(fc3.shape[1]))))
47 |
48 | cost = fluid.layers.cross_entropy(input=predict, label=label)
49 | avg_cost = fluid.layers.reduce_sum(cost)
50 | accuracy = fluid.layers.accuracy(input=predict, label=label)
51 | auc_var, batch_auc_var, auc_states = \
52 | fluid.layers.auc(input=predict, label=label, num_thresholds=2 ** 12, slide_steps=20)
53 | return avg_cost, auc_var, batch_auc_var
54 |
55 |
--------------------------------------------------------------------------------
/fleet-ctr/paddle_k8s:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | set -x
3 | start_pserver() {
4 | stdbuf -oL paddle pserver \
5 | --use_gpu=0 \
6 | --port=$PADDLE_INIT_PORT \
7 | --ports_num=$PADDLE_INIT_PORTS_NUM \
8 | --ports_num_for_sparse=$PADDLE_INIT_PORTS_NUM_FOR_SPARSE \
9 | --nics=$PADDLE_INIT_NICS \
10 | --comment=paddle_process_k8s \
11 | --num_gradient_servers=$PADDLE_INIT_NUM_GRADIENT_SERVERS
12 | }
13 |
14 | start_new_pserver() {
15 | master_label="paddle-job-master=${PADDLE_JOB_NAME}"
16 |
17 | stdbuf -oL python /root/k8s_tools.py wait_pods_running ${master_label} 1
18 | export MASTER_IP=$(python /root/k8s_tools.py fetch_ips ${master_label})
19 | stdbuf -oL /usr/bin/pserver \
20 | -port=$PADDLE_INIT_PORT \
21 | -num-pservers=$PSERVERS \
22 | -log-level=debug \
23 | -etcd-endpoint=http://$MASTER_IP:2379
24 | }
25 |
26 | start_master() {
27 | stdbuf -oL /usr/bin/master \
28 | -port=8080 \
29 | -chunk-per-task=1\
30 | -task-timout-dur=16s\
31 | -endpoints=http://127.0.0.1:2379
32 | }
33 |
34 | check_failed_cnt() {
35 | max_failed=$1
36 | failed_count=$(python /root/k8s_tools.py count_pods_by_phase paddle-job=${PADDLE_JOB_NAME} Failed)
37 | if [ $failed_count -gt $max_failed ]; then
38 | stdbuf -oL echo "Failed trainer count beyond the threadhold: "$max_failed
39 | echo "Failed trainer count beyond the threshold: " $max_failed > /dev/termination-log
40 | exit 0
41 | fi
42 | }
43 |
44 | check_trainer_ret() {
45 | ret=$1
46 | stdbuf -oL echo "job returned $ret...setting pod return message..."
47 | stdbuf -oL echo "==============================="
48 |
49 | if [ $ret -eq 136 ] ; then
50 | echo "Error Arithmetic Operation(Floating Point Exception)" > /dev/termination-log
51 | elif [ $ret -eq 139 ] ; then
52 | echo "Segmentation Fault" > /dev/termination-log
53 | elif [ $ret -eq 1 ] ; then
54 | echo "General Error" > /dev/termination-log
55 | elif [ $ret -eq 134 ] ; then
56 | echo "Program Abort" > /dev/termination-log
57 | fi
58 | stdbuf -oL echo "termination log wroted..."
59 | exit $ret
60 | }
61 |
62 | start_fluid_process() {
63 | pserver_label="paddle-job-pserver=${PADDLE_JOB_NAME}"
64 | trainer_label="paddle-job=${PADDLE_JOB_NAME}"
65 | hostname=${HOSTNAME}
66 | task_index=""
67 |
68 | if [ "${PADDLE_TRAINING_ROLE}" == "TRAINER" ] || [ "${PADDLE_TRAINING_ROLE}" == "PSERVER" ]; then
69 | stdbuf -oL python /root/k8s_tools.py wait_pods_running ${pserver_label} ${PADDLE_PSERVERS_NUM}
70 | fi
71 |
72 | if [ "${PADDLE_TRAINING_ROLE}" == "TRAINER" ] || [ "${PADDLE_TRAINING_ROLE}" == "WORKER" ]; then
73 | stdbuf -oL python /root/k8s_tools.py wait_pods_running ${trainer_label} ${PADDLE_TRAINERS_NUM}
74 | fi
75 |
76 | export PADDLE_PSERVERS=$(python /root/k8s_tools.py fetch_endpoints ${pserver_label} ${PADDLE_PORT})
77 | export PADDLE_TRAINER_IPS=$(python /root/k8s_tools.py fetch_ips ${trainer_label})
78 |
79 | if [ "${PADDLE_TRAINING_ROLE}" == "TRAINER" ] || [ "${PADDLE_TRAINING_ROLE}" == "WORKER" ]; then
80 | check_failed_cnt 1
81 | task_index=$(python /root/k8s_tools.py fetch_id ${trainer_label})
82 | else
83 | task_index=$(python /root/k8s_tools.py fetch_id ${pserver_label})
84 | fi
85 |
86 | export PADDLE_TRAINER_ID=${task_index}
87 | export PADDLE_PSERVER_ID=${task_index}
88 |
89 | stdbuf -oL sh -c "${ENTRY}"
90 | check_trainer_ret $?
91 | }
92 |
93 | # start_tf_benchmark_process is only used for benchmarking
94 | start_tf_benchmark_process() {
95 | # re-use the paddle job labels
96 | pserver_label="paddle-job-pserver=${PADDLE_JOB_NAME}"
97 | trainer_label="paddle-job=${PADDLE_JOB_NAME}"
98 | task_index=""
99 |
100 | export PADDLE_INIT_PSERVERS=$(python /root/k8s_tools.py fetch_ips ${pserver_label} ${PADDLE_INIT_PORT})
101 | export PADDLE_WORKERS=$(python /root/k8s_tools.py fetch_ips ${trainer_label})
102 | export PADDLE_TRAINER_IPS=$(python /root/k8s_tools.py fetch_ips ${trainer_label})
103 | export TF_WORKER_EPS=$(python /root/k8s_tools.py fetch_ips ${trainer_label} ${TF_WORKER_PORT})
104 |
105 | if [ "${TRAINING_ROLE}" == "TRAINER" ]; then
106 | check_failed_cnt 1
107 | task_index=$(python /root/k8s_tools.py fetch_id ${trainer_label})
108 | export TF_ROLE=worker
109 | else
110 | task_index=$(python /root/k8s_tools.py fetch_id ${pserver_label})
111 | export TF_ROLE=ps
112 | fi
113 |
114 | export PADDLE_INIT_TRAINER_ID=${task_index}
115 | export PADDLE_TRAINER_ID=${task_index}
116 |
117 | stdbuf -oL sh -c "${ENTRY}"
118 | check_trainer_ret $?
119 | }
120 |
121 | start_new_trainer() {
122 | # FIXME(Yancey1989): use command-line interface to configure the max failed count
123 | check_failed_cnt ${TRAINERS}
124 |
125 | master_label="paddle-job-master=${PADDLE_JOB_NAME}"
126 | pserver_label="paddle-job-pserver=${PADDLE_JOB_NAME}"
127 |
128 | stdbuf -oL python /root/k8s_tools.py wait_pods_running ${pserver_label} ${PSERVERS}
129 | sleep 5
130 | stdbuf -oL python /root/k8s_tools.py wait_pods_running ${master_label} 1
131 | export MASTER_IP=$(python /root/k8s_tools.py fetch_ips ${master_label})
132 | export ETCD_IP="$MASTER_IP"
133 |
134 | # NOTE: $TRAINER_PACKAGE may be large, do not copy
135 | export PYTHONPATH=$TRAINER_PACKAGE:$PYTHONPATH
136 | cd $TRAINER_PACKAGE
137 |
138 | stdbuf -oL echo "Starting training job: " $TRAINER_PACKAGE, "num_gradient_servers:" \
139 | $PADDLE_INIT_NUM_GRADIENT_SERVERS, "version: " $1
140 |
141 | stdbuf -oL sh -c "${ENTRY}"
142 | check_trainer_ret $?
143 | }
144 |
145 | start_trainer() {
146 | # paddle v1 and V2 distributed training does not allow any trainer failed.
147 | check_failed_cnt 0
148 |
149 | pserver_label="paddle-job-pserver=${PADDLE_JOB_NAME}"
150 | trainer_label="paddle-job=${PADDLE_JOB_NAME}"
151 |
152 | stdbuf -oL python /root/k8s_tools.py wait_pods_running ${pserver_label} ${PSERVERS}
153 | stdbuf -oL python /root/k8s_tools.py wait_pods_running ${trainer_label} ${TRAINERS}
154 |
155 | export PADDLE_INIT_PSERVERS=$(python /root/k8s_tools.py fetch_ips ${pserver_label})
156 | export PADDLE_INIT_TRAINER_ID=$(python /root/k8s_tools.py fetch_id ${trainer_label})
157 | stdbuf -oL echo $PADDLE_INIT_TRAINER_ID > /trainer_id
158 | # FIXME: /trainer_count = PADDLE_INIT_NUM_GRADIENT_SERVERS
159 | stdbuf -oL echo $PADDLE_INIT_NUM_GRADIENT_SERVERS > /trainer_count
160 |
161 | # NOTE: $TRAINER_PACKAGE may be large, do not copy
162 | export PYTHONPATH=$TRAINER_PACKAGE:$PYTHONPATH
163 | cd $TRAINER_PACKAGE
164 |
165 | stdbuf -oL echo "Starting training job: " $TRAINER_PACKAGE, "num_gradient_servers:" \
166 | $PADDLE_INIT_NUM_GRADIENT_SERVERS, "trainer_id: " $PADDLE_INIT_TRAINER_ID, \
167 |
168 | export version="v2"
169 | if [[ -z $1 ]]; then
170 | "didn't specified a version, use the default: " $version
171 | else
172 | export version="$1"
173 | "user specified version: " $version
174 | fi
175 |
176 | # FIXME: If we use the new PServer by Golang, add Kubernetes healthz
177 | # to wait PServer process get ready.Now only sleep 20 seconds.
178 | sleep 20
179 |
180 | case "$version" in
181 | "v1")
182 | FILE_COUNT=$(wc -l $TRAIN_LIST | awk '{print $1}')
183 | if [ $FILE_COUNT -le $PADDLE_INIT_NUM_GRADIENT_SERVERS ]; then
184 | echo "file count less than trainers"
185 | check_trainer_ret 0
186 | fi
187 | let lines_per_node="$FILE_COUNT / ($PADDLE_INIT_NUM_GRADIENT_SERVERS + 1)"
188 | echo "spliting file to" $lines_per_node
189 | cp $TRAIN_LIST /
190 | cd /
191 | split -l $lines_per_node -d -a 3 $TRAIN_LIST train.list
192 | CURRENT_LIST=$(printf "train.list%03d" $PADDLE_INIT_TRAINER_ID)
193 | # always use /train.list for paddle v1 for each node.
194 | echo "File for current node ${CURRENT_LIST}"
195 | sleep 10
196 | cp $CURRENT_LIST train.list
197 |
198 | cd $TRAINER_PACKAGE
199 |
200 | stdbuf -oL paddle train \
201 | --port=$PADDLE_INIT_PORT \
202 | --nics=$PADDLE_INIT_NICS \
203 | --ports_num=$PADDLE_INIT_PORTS_NUM \
204 | --ports_num_for_sparse=$PADDLE_INIT_PORTS_NUM_FOR_SPARSE \
205 | --num_passes=$PADDLE_INIT_NUM_PASSES \
206 | --trainer_count=$PADDLE_INIT_TRAINER_COUNT \
207 | --saving_period=1 \
208 | --log_period=20 \
209 | --local=0 \
210 | --rdma_tcp=tcp \
211 | --config=$TOPOLOGY \
212 | --use_gpu=$PADDLE_INIT_USE_GPU \
213 | --trainer_id=$PADDLE_INIT_TRAINER_ID \
214 | --save_dir=$OUTPUT \
215 | --pservers=$PADDLE_INIT_PSERVERS \
216 | --num_gradient_servers=$PADDLE_INIT_NUM_GRADIENT_SERVERS
217 | # paddle v1 API does not allow any trainer failed.
218 | check_trainer_ret $?
219 | ;;
220 | "v2")
221 | stdbuf -oL sh -c "${ENTRY}"
222 | # paddle v2 API does not allow any trainer failed.
223 | check_trainer_ret $?
224 | ;;
225 | *)
226 | ;;
227 | esac
228 | }
229 |
230 | usage() {
231 | echo "usage: paddle_k8s []:"
232 | echo " start_trainer [v1|v2] Start a trainer process with v1 or v2 API"
233 | echo " start_pserver Start a pserver process"
234 | echo " start_new_pserver Start a new pserver process"
235 | echo " start_new_trainer Start a new triner process"
236 | }
237 |
238 | case "$1" in
239 | start_pserver)
240 | start_pserver
241 | ;;
242 | start_trainer)
243 | start_trainer $2
244 | ;;
245 | start_new_trainer)
246 | start_new_trainer
247 | ;;
248 | start_new_pserver)
249 | start_new_pserver
250 | ;;
251 | start_master)
252 | start_master
253 | ;;
254 | start_fluid)
255 | start_fluid_process
256 | ;;
257 | --help)
258 | usage
259 | ;;
260 | *)
261 | usage
262 | ;;
263 | esac
264 |
--------------------------------------------------------------------------------
/fleet-ctr/process_rawmodel.py:
--------------------------------------------------------------------------------
1 | # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
2 | #
3 | # Licensed under the Apache License, Version 2.0 (the "License");
4 | # you may not use this file except in compliance with the License.
5 | # You may obtain a copy of the License at
6 | #
7 | # http://www.apache.org/licenses/LICENSE-2.0
8 | #
9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 |
15 | import os
16 | import sys
17 |
18 | os.system('python3 /save_program/save_program.py slot.conf . ' + sys.argv[2])
19 | os.system('python3 /save_program/dumper.py --model_path ' + sys.argv[1] + ' --output_data_path ctr_cube/')
20 | os.system('python3 /save_program/replace_params.py --model_dir ' + sys.argv[1] +' --inference_only_model_dir inference_only')
21 | os.system('tar czf ctr_model.tar.gz inference_only/')
22 |
--------------------------------------------------------------------------------
/fleet-ctr/pserver0.sh:
--------------------------------------------------------------------------------
1 | GLOG_v=0 \
2 | GLOG_logtostderr=1 \
3 | TOPOLOGY= \
4 | TRAINER_PACKAGE=/share \
5 | PADDLE_INIT_NICS=eth2 \
6 | POD_IP=${SERVER0} \
7 | POD_NAME=SERVER0 \
8 | PADDLE_CURRENT_IP=${SERVER0} \
9 | PADDLE_JOB_NAME=fleet-ctr \
10 | PADDLE_IS_LOCAL=0 \
11 | PADDLE_TRAINERS_NUM=2 \
12 | PADDLE_PSERVERS_NUM=2 \
13 | FLAGS_rpc_deadline=36000000 \
14 | PADDLE_PORT=30239 \
15 | TRAINING_ROLE=PSERVER \
16 | PADDLE_PSERVERS_IP_PORT_LIST=${SERVER0}:30239,${SERVER1}:30241 \
17 | python3 model_with_sparse_feature.py slot.conf
18 |
--------------------------------------------------------------------------------
/fleet-ctr/pserver1.sh:
--------------------------------------------------------------------------------
1 | GLOG_v=0 \
2 | GLOG_logtostderr=1 \
3 | TOPOLOGY= \
4 | TRAINER_PACKAGE=/share \
5 | PADDLE_INIT_NICS=eth2 \
6 | POD_IP=${SERVER1} \
7 | POD_NAME=SERVER1 \
8 | PADDLE_CURRENT_IP=${SERVER1} \
9 | PADDLE_JOB_NAME=fleet-ctr \
10 | PADDLE_IS_LOCAL=0 \
11 | PADDLE_TRAINERS_NUM=2 \
12 | PADDLE_PSERVERS_NUM=2 \
13 | FLAGS_rpc_deadline=36000000 \
14 | PADDLE_PORT=30241 \
15 | TRAINING_ROLE=PSERVER \
16 | PADDLE_PSERVERS_IP_PORT_LIST=${SERVER0}:30239,${SERVER1}:30241 \
17 | python3 model_with_sparse_feature.py slot.conf
18 |
--------------------------------------------------------------------------------
/fleet-ctr/run_distributed.sh:
--------------------------------------------------------------------------------
1 | unset http_proxy
2 | unset https_proxy
3 |
4 | PADDLE_PSERVERS_IP_PORT_LIST=127.0.0.1:6017,127.0.0.1:6018 \
5 | PADDLE_TRAINERS_NUM=2 \
6 | TRAINING_ROLE=PSERVER \
7 | POD_IP=127.0.0.1 \
8 | PADDLE_PORT=6017 \
9 | python model_with_sparse_feature.py slot.conf 2> server0.elog > server0.stdlog &
10 |
11 | sleep 3
12 |
13 | PADDLE_PSERVERS_IP_PORT_LIST=127.0.0.1:6017,127.0.0.1:6018 \
14 | PADDLE_TRAINERS_NUM=2 \
15 | TRAINING_ROLE=PSERVER \
16 | POD_IP=127.0.0.1 \
17 | PADDLE_PORT=6018 \
18 | python model_with_sparse_feature.py slot.conf 2> server1.elog > server1.stdlog &
19 |
20 | sleep 3
21 |
22 | PADDLE_PSERVERS_IP_PORT_LIST=127.0.0.1:6017,127.0.0.1:6018 \
23 | PADDLE_TRAINERS_NUM=2 \
24 | TRAINING_ROLE=TRAINER \
25 | PADDLE_TRAINER_ID=0 \
26 | python train_with_mlflow.py slot.conf 2> worker0.elog > worker0.stdlog &
27 |
28 | sleep 3
29 |
30 | PADDLE_PSERVERS_IP_PORT_LIST=127.0.0.1:6017,127.0.0.1:6018 \
31 | PADDLE_TRAINERS_NUM=2 \
32 | TRAINING_ROLE=TRAINER \
33 | PADDLE_TRAINER_ID=1 \
34 | python model_with_sparse_feature.py slot.conf 2> worker1.elog > worker1.stdlog &
35 |
--------------------------------------------------------------------------------
/fleet-ctr/train_with_mlflow.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/python
2 | # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
3 | #
4 | # Licensed under the Apache License, Version 2.0 (the "License");
5 | # you may not use this file except in compliance with the License.
6 | # You may obtain a copy of the License at
7 | #
8 | # http://www.apache.org/licenses/LICENSE-2.0
9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 |
16 | import math
17 | import sys
18 | import os
19 | import paddle.fluid as fluid
20 | import mlflow
21 | from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet
22 | from paddle.fluid.incubate.fleet.base import role_maker
23 | from paddle.fluid.transpiler.distribute_transpiler import DistributeTranspilerConfig
24 | from paddle.fluid.contrib.utils.hdfs_utils import HDFSClient
25 | from nets import ctr_dnn_model
26 | import subprocess as sp
27 | import time
28 | import psutil
29 | import matplotlib
30 | import matplotlib.pyplot as plt
31 | from collections import OrderedDict
32 | from datetime import datetime, timedelta
33 | import json
34 | import logging
35 | logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
36 | logger = logging.getLogger(__name__)
37 |
38 | try:
39 | file_server_addr = os.environ["FILE_SERVER_SERVICE_HOST"] + ":"+ os.environ["FILE_SERVER_SERVICE_PORT"]
40 | slot_res = os.system('wget '+ file_server_addr + '/slot.conf')
41 | if slot_res !=0:
42 | raise ValueError('No Slot Conf Found.')
43 | feature_names = []
44 | with open(sys.argv[1]) as fin:
45 | for line in fin:
46 | feature_names.append(line.strip())
47 | logger.info("Slot:" + str(feature_names))
48 | except ValueError as e:
49 | print(e)
50 | sys.exit(1)
51 |
52 | sparse_input_ids = [
53 | fluid.layers.data(name=name, shape=[1], lod_level=1, dtype='int64')
54 | for name in feature_names]
55 |
56 | label = fluid.layers.data(
57 | name='label', shape=[1], dtype='int64')
58 |
59 | sparse_feature_dim = int(os.environ['SPARSE_DIM'])
60 | embedding_size = 9
61 | epochs = 1
62 | dataset_prefix = os.environ['DATASET_PATH']
63 |
64 | avg_cost, auc_var, _ = ctr_dnn_model(embedding_size, sparse_input_ids, label, sparse_feature_dim)
65 |
66 | start_date_hr = datetime.strptime(os.environ['START_DATE_HR'], '%Y%m%d/%H')
67 | end_date_hr = datetime.strptime(os.environ['END_DATE_HR'], '%Y%m%d/%H')
68 | current_date_hr = start_date_hr
69 | hdfs_address = os.environ['HDFS_ADDRESS']
70 | hdfs_ugi = os.environ['HDFS_UGI']
71 | start_run_flag = True
72 | role = role_maker.UserDefinedRoleMaker(
73 | current_id=int(os.getenv("CURRENT_ID")),
74 | role=role_maker.Role.WORKER
75 | if os.getenv("TRAINING_ROLE") == "TRAINER" else role_maker.Role.SERVER,
76 | worker_num=int(os.getenv("PADDLE_TRAINERS_NUM")),
77 | server_endpoints=os.getenv("ENDPOINTS").split(","))
78 |
79 | role = role_maker.PaddleCloudRoleMaker()
80 | fleet.init(role)
81 |
82 | config = DistributeTranspilerConfig()
83 | config.sync_mode = False
84 | optimizer = fluid.optimizer.SGD(0.0001)
85 | optimizer = fleet.distributed_optimizer(optimizer, config)
86 | optimizer.minimize(avg_cost)
87 | DATE_TIME_STRING_FORMAT = '%Y%m%d/%H'
88 | if fleet.is_server():
89 | fleet.init_server()
90 | fleet.run_server()
91 | elif fleet.is_worker():
92 | place = fluid.CPUPlace()
93 | exe = fluid.Executor(place)
94 | fleet.init_worker()
95 | exe.run(fluid.default_startup_program())
96 | logger.info("startup program done.")
97 | if fleet.is_first_worker():
98 | plt.figure()
99 | y_auc = []
100 | y_cpu = []
101 | y_memory = []
102 | y_network_sent = []
103 | y_network_recv = []
104 | x_list = []
105 | x = 0
106 | last_net_sent = psutil.net_io_counters().bytes_sent
107 | last_net_recv = psutil.net_io_counters().bytes_recv
108 |
109 | while True:
110 |
111 | #hadoop fs -D hadoop.job.ugi=root, -D fs.default.name=hdfs://192.168.48.87:9000 -ls /
112 | current_date_hr_exist = os.system('hadoop fs -D hadoop.job.ugi=' + hdfs_ugi + ' -D fs.defaultFS=' + hdfs_address +
113 | ' -ls ' + os.path.join(dataset_prefix, current_date_hr.strftime(DATE_TIME_STRING_FORMAT), 'donefile') +' >/dev/null 2>&1')
114 |
115 | if current_date_hr_exist:
116 | pass
117 | else:
118 | """
119 | output = sp.check_output("hdfs dfs -ls hdfs://192.168.48.189:9000/train_data | awk '{if(NR>1) print $8}'",
120 | shell=True)
121 | print(output)
122 | train_filelist = ["{}".format(f) for f in output.decode('ascii').strip().split('\n')]
123 | print(train_filelist)
124 | for i in range(len(train_filelist)):
125 | train_filelist[i] = train_filelist[i].split("/")[-1]
126 | train_filelist.sort(reverse=True)
127 | print(train_filelist)
128 | """
129 |
130 | dataset = fluid.DatasetFactory().create_dataset()
131 | dataset.set_use_var(sparse_input_ids + [label])
132 | pipe_command = "python criteo_dataset.py {}".format(sys.argv[1])
133 | dataset.set_pipe_command(pipe_command)
134 | dataset.set_batch_size(32)
135 | dataset.set_hdfs_config(hdfs_address, hdfs_ugi)
136 | dataset.set_thread(10)
137 | exe = fluid.Executor(fluid.CPUPlace())
138 | input_folder = "hdfs:"
139 | output = sp.check_output( "hadoop fs -D hadoop.job.ugi=" + hdfs_ugi
140 | + " -D fs.defaultFS=" + hdfs_address +" -ls " + os.path.join(dataset_prefix, str(current_date_hr.strftime(DATE_TIME_STRING_FORMAT))) + "/ | awk '{if(NR>1) print $8}'", shell=True)
141 | train_filelist = ["{}{}".format(input_folder, f) for f in output.decode('ascii').strip().split('\n')]
142 | train_filelist.remove('hdfs:' + os.path.join(dataset_prefix, str(current_date_hr.strftime(DATE_TIME_STRING_FORMAT)), 'donefile'))
143 | #pending
144 | step = 0
145 |
146 | config = DistributeTranspilerConfig()
147 | config.sync_mode = False
148 |
149 | # if is worker
150 | fleet_filelist = fleet.split_files(train_filelist)
151 | print(fleet_filelist)
152 | logger.info( "Training: " + current_date_hr.strftime(DATE_TIME_STRING_FORMAT))
153 | dataset.set_filelist(fleet_filelist)
154 | if fleet.is_first_worker():
155 | # experiment_id = mlflow.create_experiment("fleet-ctr")
156 | if start_run_flag:
157 | mlflow.start_run()
158 | mlflow.log_param("sparse_feature_dim", sparse_feature_dim)
159 | mlflow.log_param("embedding_size", embedding_size)
160 | mlflow.log_param("selected_slots", feature_names)
161 | mlflow.log_param("epochs_num", epochs)
162 | mlflow.log_param("physical_cpu_counts", psutil.cpu_count(logical=False))
163 | mlflow.log_param("logical_cpu_counts", psutil.cpu_count())
164 | mlflow.log_param("total_memory/GB",
165 | round(psutil.virtual_memory().total / (1024.0 * 1024.0 * 1024.0), 3))
166 | start_run_flag = False
167 |
168 |
169 | class FH(fluid.executor.FetchHandler):
170 | def handler(self, fetch_target_vars):
171 | auc = fetch_target_vars[0]
172 | print("test metric auc: ", fetch_target_vars)
173 | global last_net_sent
174 | global last_net_recv
175 | global y_auc
176 | global y_cpu
177 | global y_memory
178 | global y_network_sent
179 | global y_network_recv
180 | global x
181 | mlflow.log_metric("network_bytes_sent_speed",
182 | psutil.net_io_counters().bytes_sent - last_net_sent)
183 | mlflow.log_metric("network_bytes_recv_speed",
184 | psutil.net_io_counters().bytes_recv - last_net_recv)
185 | y_network_sent.append((psutil.net_io_counters().bytes_sent - last_net_sent)/10)
186 | y_network_recv.append((psutil.net_io_counters().bytes_recv - last_net_recv)/10)
187 | last_net_sent = psutil.net_io_counters().bytes_sent
188 | last_net_recv = psutil.net_io_counters().bytes_recv
189 | mlflow.log_metric("cpu_usage_total", round((psutil.cpu_percent(interval=0) / 100), 3))
190 | y_cpu.append(round((psutil.cpu_percent(interval=0) / 100), 3))
191 | mlflow.log_metric("free_memory/GB",
192 | round(psutil.virtual_memory().free / (1024.0 * 1024.0 * 1024.0), 3))
193 | mlflow.log_metric("memory_usage",
194 | round((psutil.virtual_memory().total - psutil.virtual_memory().free)
195 | / float(psutil.virtual_memory().total), 3))
196 | y_memory.append(round((psutil.virtual_memory().total - psutil.virtual_memory().free)
197 | / float(psutil.virtual_memory().total), 3))
198 | if auc == None:
199 | mlflow.log_metric("auc", 0.5)
200 | auc = [0.5]
201 | else:
202 | mlflow.log_metric("auc", auc[0])
203 | y_auc.append(auc)
204 | x_list.append(x)
205 |
206 | if x >= 120: # in the future 120 will be replaced by 24*60*60 which means one day length
207 | y_auc.pop(0)
208 | y_cpu.pop(0)
209 | y_memory.pop(0)
210 | y_network_recv.pop(0)
211 | y_network_sent.pop(0)
212 | x_list.pop(0)
213 | x += 10
214 | if x % 60 == 0 and x != 0:
215 | plt.subplot(221)
216 | plt.plot(x_list, y_auc)
217 | plt.title('auc')
218 | plt.grid(True)
219 |
220 | plt.subplot(222)
221 | plt.plot(x_list, y_cpu)
222 | plt.title('cpu_usage')
223 | plt.grid(True)
224 |
225 | plt.subplot(223)
226 | plt.plot(x_list, y_memory)
227 | plt.title('memory_usage')
228 | plt.grid(True)
229 |
230 | plt.subplot(224)
231 | plt.plot(x_list, y_network_sent, label="network_sent_speed")
232 | plt.plot(x_list, y_network_recv, label="network_recv_speed")
233 | plt.title('network_speed')
234 | plt.grid(True)
235 |
236 | plt.subplots_adjust(top=0.9, bottom=0.2, hspace=0.4, wspace=0.35)
237 | plt.legend(bbox_to_anchor=(0, -0.6), loc='lower left', borderaxespad=0.)
238 | temp_file_name = "dashboard_" + time.strftime('%Y-%m-%d_%H:%M:%S',
239 | time.localtime(time.time())) + ".png"
240 | plt.savefig(temp_file_name, dpi=250)
241 | sys.stdout.flush()
242 | plt.clf()
243 | os.system("rm -f "+str(mlflow.get_artifact_uri().split(":")[1]) + "/*.png")
244 | mlflow.log_artifact(local_path=temp_file_name)
245 | sys.stdout.flush()
246 | os.system("rm -f ./*.png")
247 | sys.stdout.flush()
248 | logger.info(str(mlflow.get_artifact_uri().split(":")[1]))
249 | sys.stdout.flush()
250 |
251 | os.system("hadoop fs -D hadoop.job.ugi=" + hdfs_ugi + " -D fs.defaultFS=" + hdfs_address
252 | + " -rm " + os.path.join(dataset_prefix, str(current_date_hr.strftime(DATE_TIME_STRING_FORMAT))+ "_model/donefile") + " >/dev/null 2>&1")
253 | for i in range(epochs):
254 | exe.train_from_dataset(
255 | program=fluid.default_main_program(),
256 | dataset=dataset,
257 | fetch_handler=FH([auc_var.name], 10, True),
258 | # fetch_list=[auc_var],
259 | # fetch_info=["auc"],
260 | debug=False)
261 | path = "./saved_models/" + current_date_hr.strftime(DATE_TIME_STRING_FORMAT) + "_model/"
262 | logger.info("save inference program: " + path)
263 | if len(y_auc) <=1:
264 | logger.info("Current AUC: " + str(y_auc[-1]))
265 | else:
266 | logger.info("Dataset is too small, cannot get AUC.")
267 | fetch_list = fleet.save_inference_model(exe,
268 | path,
269 | [x.name for x in sparse_input_ids] + [label.name],
270 | [auc_var])
271 | os.system("hadoop fs -D hadoop.job.ugi=" + hdfs_ugi + " -D fs.defaultFS=" + hdfs_address
272 | + " -put -f "+ path +" " + os.path.join(dataset_prefix, current_date_hr.strftime(DATE_TIME_STRING_FORMAT).split("/")[0]) + " >/dev/null 2>&1")
273 | os.system('touch donefile')
274 | os.system("hadoop fs -D hadoop.job.ugi=" + hdfs_ugi + " -D fs.defaultFS=" + hdfs_address
275 | + " -put -f donefile" + " " + os.path.join(dataset_prefix, current_date_hr.strftime(DATE_TIME_STRING_FORMAT) + "_model/") +" >/dev/null 2>&1")
276 | logger.info("push raw model to HDFS: " + current_date_hr.strftime(DATE_TIME_STRING_FORMAT))
277 | os.system('python process_rawmodel.py ' + './saved_models/' + current_date_hr.strftime(DATE_TIME_STRING_FORMAT) + "_model " + current_date_hr.strftime(DATE_TIME_STRING_FORMAT) + " >/dev/null 2>&1")
278 | os.system("hadoop fs -D hadoop.job.ugi=" + hdfs_ugi + " -D fs.defaultFS=" + hdfs_address
279 | + " -put -f ctr_model.tar.gz " + " /output/")
280 | os.system("hadoop fs -D hadoop.job.ugi=" + hdfs_ugi + " -D fs.defaultFS=" + hdfs_address
281 | + " -put -f ctr_cube " + " /output/")
282 | logger.info("push converted model to HDFS: " + current_date_hr.strftime(DATE_TIME_STRING_FORMAT))
283 | os.system("rm -rf " + path)
284 | logger.info(current_date_hr.strftime(DATE_TIME_STRING_FORMAT) + ' Training Done.')
285 | else:
286 | for i in range(epochs):
287 | exe.train_from_dataset(
288 | program=fleet.main_program,
289 | dataset=dataset,
290 | # fetch_list=[auc_var],
291 | # fetch_info=["auc"],
292 | debug=False)
293 | while True:
294 | rawmodel_donefile_exist = os.system("hadoop fs -D hadoop.job.ugi=" + hdfs_ugi + " -D fs.defaultFS=" + hdfs_address
295 | + " -ls "+ os.path.join(dataset_prefix, current_date_hr.strftime(DATE_TIME_STRING_FORMAT) + "_model/donefile") +" >/dev/null 2>&1")
296 | if not rawmodel_donefile_exist:
297 | break
298 | if end_date_hr == current_date_hr:
299 | mlflow.end_run()
300 | break
301 | current_date_hr = current_date_hr + timedelta(hours=1)
302 |
303 |
--------------------------------------------------------------------------------
/fleet-ctr/trainer0.sh:
--------------------------------------------------------------------------------
1 | GLOG_v=0 \
2 | GLOG_logtostderr=1 \
3 | TOPOLOGY= \
4 | TRAINER_PACKAGE=/share \
5 | PADDLE_INIT_NICS=eth2 \
6 | POD_IP=127.0.0.1 \
7 | POD_NAME=TRAINER0 \
8 | PADDLE_CURRENT_IP=127.0.0.1 \
9 | PADDLE_JOB_NAME=fleet-ctr \
10 | PADDLE_IS_LOCAL=0 \
11 | PADDLE_TRAINERS_NUM=2 \
12 | PADDLE_PSERVERS_NUM=2 \
13 | FLAGS_rpc_deadline=36000000 \
14 | TRAINING_ROLE=TRAINER \
15 | PADDLE_TRAINER_ID=0 \
16 | PADDLE_PSERVERS_IP_PORT_LIST=${SERVER0}:30239,${SERVER1}:30241 \
17 | python3 train_with_mlflow.py slot.conf
18 |
--------------------------------------------------------------------------------
/fleet-ctr/trainer1.sh:
--------------------------------------------------------------------------------
1 | GLOG_v=0 \
2 | GLOG_logtostderr=1 \
3 | TOPOLOGY= \
4 | TRAINER_PACKAGE=/share \
5 | PADDLE_INIT_NICS=eth2 \
6 | POD_IP=127.0.0.1 \
7 | POD_NAME=TRAINER1 \
8 | PADDLE_CURRENT_IP=127.0.0.1 \
9 | PADDLE_JOB_NAME=fleet-ctr \
10 | PADDLE_IS_LOCAL=0 \
11 | PADDLE_TRAINERS_NUM=2 \
12 | PADDLE_PSERVERS_NUM=2 \
13 | FLAGS_rpc_deadline=36000000 \
14 | TRAINING_ROLE=TRAINER \
15 | PADDLE_TRAINER_ID=1 \
16 | PADDLE_PSERVERS_IP_PORT_LIST=${SERVER0}:30239,${SERVER1}:30241 \
17 | python3 model_with_sparse_feature.py slot.conf
18 |
--------------------------------------------------------------------------------
/huawei_k8s.md:
--------------------------------------------------------------------------------
1 | # 华为云搭建k8s集群
2 |
3 | 本文档旨在介绍如何在华为云上搭建k8s集群,以便部署ElasticCTR。
4 |
5 | * [1. 流程概述](#head1)
6 | * [2. 购买集群](#head2)
7 | * [3. 定义负载均衡](#head2)
8 | * [4. 上传镜像](#head3)
9 |
10 | ## 1. 流程概述
11 |
12 | 在华为云上搭建k8s集群主要有以下三个步骤:
13 |
14 | 1.购买集群
15 |
16 | 首先需要登录到华为云的cce控制台,购买合适配置的集群
17 |
18 | 2.定义负载均衡
19 |
20 | 用上一步购买的跳板机创建集群,集群的配置可以自行调整
21 |
22 | 3. 上传镜像
23 |
24 | 华为云提供了镜像仓库的服务,我们可以将所需的镜像上传到仓库,以后拉取镜像的速度会变快
25 |
26 | 下面对每一步进行详细介绍。
27 |
28 | ## 2. 购买集群
29 |
30 | 用户登录到华为云的cce控制台购买k8s集群,具体的操作如下:
31 |
32 | 1. 打开华为云cce控制台,从控制台控制面板中,点击购买Kubernetes集群,在购买混合集群界面的服务选型项下设置集群信息,总体配置等。
33 | 
34 |
35 | 2. 集群信息设置好后,点击下一步:创建节点,在创建节点项下设置节点配置。
36 | 
37 | 
38 | 
39 |
40 | 3. 节点设置完成后,在安装插件项的高级功能插件下选择volcano,然后可以确认配置,完成支付。
41 | 
42 |
43 | ## 3. 定义负载均衡
44 |
45 | 由于华为云对于负载均衡有限制,建议参考[华为云负载均衡](https://support.huaweicloud.com/usermanual-cce/cce_01_0014.html),修改 fileserver.yaml.template和pdserving.yaml 关于 Service定义的metadata,参见下图
46 | 
47 |
48 |
49 | ## 4. 上传镜像
50 |
51 | 注:本步骤为可选操作,如果经费充足可以不做这一步。
52 | 点击镜像仓库,按照提示登录仓库,将指令复制到节点终端上执行,如下图所示:
53 | 
54 |
55 | 随后按照提示将elastic ctr所需的镜像上传到仓库中
56 | 
57 |
58 | 完成操作后可以看到效果如下:
59 | 
60 |
61 | 至此,我们完成了华为云搭建k8s的流程,用户可以自行搭建hdfs,部署elastic ctr。
62 |
63 |
64 |
65 |
66 |
--------------------------------------------------------------------------------
/save_program/dumper.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
3 | #
4 | # Licensed under the Apache License, Version 2.0 (the "License");
5 | # you may not use this file except in compliance with the License.
6 | # You may obtain a copy of the License at
7 | #
8 | # http://www.apache.org/licenses/LICENSE-2.0
9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 |
16 | import argparse
17 | import logging
18 | import struct
19 | import time
20 | import datetime
21 | import json
22 | from collections import OrderedDict
23 | import numpy as np
24 | import os
25 | import paddle
26 | import paddle.fluid as fluid
27 | import math
28 |
29 |
30 | logging.basicConfig(format='%(asctime)s - %(levelname)s - %(message)s')
31 | logger = logging.getLogger("fluid")
32 | logger.setLevel(logging.INFO)
33 |
34 | NOW_TIMESTAMP = time.time()
35 | NOW_DATETIME = datetime.datetime.now().strftime("%Y%m%d")
36 |
37 | sparse_feature_dim = 100000001
38 | embedding_size = 9
39 |
40 |
41 | def parse_args():
42 | parser = argparse.ArgumentParser(description="PaddlePaddle DeepFM example")
43 | parser.add_argument(
44 | '--model_path',
45 | type=str,
46 | required=True,
47 | help="The path of model parameters file")
48 | parser.add_argument(
49 | '--output_data_path',
50 | type=str,
51 | default="/save_program/ctr_cube/",
52 | help="The path of dump output")
53 |
54 | return parser.parse_args()
55 |
56 |
57 | def write_donefile(base_datafile, base_donefile):
58 | dict = OrderedDict()
59 | if not os.access(os.path.dirname(base_donefile), os.F_OK):
60 | os.makedirs(os.path.dirname(base_donefile))
61 | dict["id"] = str(int(NOW_TIMESTAMP))
62 | dict["key"] = dict["id"]
63 | dict["input"] = os.path.dirname(base_datafile)
64 | with open(base_donefile, "a") as f:
65 | jsonstr = json.dumps(dict) + '\n'
66 | f.write(jsonstr)
67 |
68 |
69 | def dump():
70 | args = parse_args()
71 | feature_names = []
72 |
73 | output_data_path = os.path.abspath(args.output_data_path)
74 | base_datadir = output_data_path + "/" + NOW_DATETIME + "/base"
75 | try:
76 | os.makedirs(base_datadir)
77 | except:
78 | print ('Dir already exist, skip.')
79 | base_datafile = output_data_path + "/" + NOW_DATETIME + "/base/feature"
80 | write_base_datafile = "/output/ctr_cube/" + NOW_DATETIME + "/base/feature"
81 | base_donefile = output_data_path + "/" + "donefile/" + "base.txt"
82 | os.system('/save_program/parameter_to_seqfile ' + args.model_path +'/SparseFeatFactors ' + base_datafile)
83 | write_donefile(write_base_datafile + "0", base_donefile)
84 |
85 |
86 | if __name__ == '__main__':
87 | dump()
88 |
89 |
--------------------------------------------------------------------------------
/save_program/listening.py:
--------------------------------------------------------------------------------
1 | import time
2 | import os
3 | import json
4 |
5 | previous_last_line = ""
6 | while True:
7 | try:
8 | with open("/share/saved_models/model_donefile","r") as f:
9 | lines = f.readlines()
10 | current_line = lines[-1]
11 | path = json.loads(current_line)['path']
12 |
13 | if current_line != previous_last_line:
14 | os.system("python3 save_program slot.conf .")
15 | os.system("python dumper_multiprocessing.py
16 | os.system("python replace_params.py " +
17 | "--model_dir " + path +
18 | " --inference_only_model_dir " + path+"../inference_only")
19 | else:
20 | pass
21 | previous_last_line = current_line
22 | except Exception,err:
23 | pass
24 |
25 |
--------------------------------------------------------------------------------
/save_program/nets.py:
--------------------------------------------------------------------------------
1 | # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
2 | #
3 | # Licensed under the Apache License, Version 2.0 (the "License");
4 | # you may not use this file except in compliance with the License.
5 | # You may obtain a copy of the License at
6 | #
7 | # http://www.apache.org/licenses/LICENSE-2.0
8 | #
9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 |
15 | from __future__ import print_function
16 |
17 | import argparse
18 | import logging
19 | import os
20 | import time
21 | import math
22 | import numpy as np
23 | import six
24 | import paddle
25 | import paddle.fluid as fluid
26 | import paddle.fluid.proto.framework_pb2 as framework_pb2
27 | import paddle.fluid.core as core
28 |
29 | from multiprocessing import cpu_count
30 |
31 | def ctr_dnn_model(embedding_size, sparse_input_ids, label,
32 | sparse_feature_dim):
33 | def embedding_layer(input):
34 | emb = fluid.layers.embedding(
35 | input=input,
36 | is_sparse=True,
37 | is_distributed=False,
38 | size=[sparse_feature_dim, embedding_size],
39 | param_attr=fluid.ParamAttr(name="SparseFeatFactors",
40 | initializer=fluid.initializer.Uniform()))
41 | emb_sum = fluid.layers.sequence_pool(input=emb, pool_type='sum')
42 | return emb_sum
43 |
44 | emb_sums = list(map(embedding_layer, sparse_input_ids))
45 | concated = fluid.layers.concat(emb_sums, axis=1)
46 | fc1 = fluid.layers.fc(input=concated, size=400, act='relu',
47 | param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal(
48 | scale=1 / math.sqrt(concated.shape[1]))))
49 | fc2 = fluid.layers.fc(input=fc1, size=400, act='relu',
50 | param_attr=fluid.ParamAttr(
51 | initializer=fluid.initializer.Normal(
52 | scale=1 / math.sqrt(fc1.shape[1]))))
53 | fc3 = fluid.layers.fc(input=fc2, size=400, act='relu',
54 | param_attr=fluid.ParamAttr(
55 | initializer=fluid.initializer.Normal(
56 | scale=1 / math.sqrt(fc2.shape[1]))))
57 |
58 | predict = fluid.layers.fc(input=fc3, size=2, act='softmax',
59 | param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal(
60 | scale=1 / math.sqrt(fc3.shape[1]))))
61 |
62 | cost = fluid.layers.cross_entropy(input=predict, label=label)
63 | avg_cost = fluid.layers.reduce_sum(cost)
64 | accuracy = fluid.layers.accuracy(input=predict, label=label)
65 | auc_var, batch_auc_var, auc_states = \
66 | fluid.layers.auc(input=predict, label=label, num_thresholds=2 ** 12, slide_steps=20)
67 | return avg_cost, auc_var, batch_auc_var
68 |
--------------------------------------------------------------------------------
/save_program/parameter_to_seqfile:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PaddlePaddle/ElasticCTR/c493c6b250eef001028b5a7264b61f679c79fa6d/save_program/parameter_to_seqfile
--------------------------------------------------------------------------------
/save_program/replace_params.py:
--------------------------------------------------------------------------------
1 | # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
2 | #
3 | # Licensed under the Apache License, Version 2.0 (the "License");
4 | # you may not use this file except in compliance with the License.
5 | # You may obtain a copy of the License at
6 | #
7 | # http://www.apache.org/licenses/LICENSE-2.0
8 | #
9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 |
15 | import os
16 | import shutil
17 | import argparse
18 |
19 | def parse_args():
20 | parser = argparse.ArgumentParser(description="PaddlePaddle CTR example")
21 | parser.add_argument(
22 | '--model_dir',
23 | type=str,
24 | required=True,
25 | help='The trained model path (eg. models/pass-0)')
26 | parser.add_argument(
27 | '--inference_only_model_dir',
28 | type=str,
29 | required=True,
30 | help='The inference only model (eg. models/inference_only)')
31 | return parser.parse_args()
32 |
33 | def replace_params():
34 | args = parse_args()
35 | inference_only_model_dir = args.inference_only_model_dir
36 | model_dir = args.model_dir
37 |
38 | files = [f for f in os.listdir(inference_only_model_dir)]
39 |
40 | for f in files:
41 | if (f.find("__model__") != -1):
42 | continue
43 | dst_file = inference_only_model_dir + "/" + f
44 | src_file = args.model_dir + "/" + f
45 | print("copying %s to %s" % (src_file, dst_file))
46 | shutil.copyfile(src_file, dst_file)
47 |
48 | if __name__ == '__main__':
49 | replace_params()
50 | print("Done")
51 |
--------------------------------------------------------------------------------
/save_program/save_program.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/python
2 | # Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
3 | #
4 | # Licensed under the Apache License, Version 2.0 (the "License");
5 | # you may not use this file except in compliance with the License.
6 | # You may obtain a copy of the License at
7 | #
8 | # http://www.apache.org/licenses/LICENSE-2.0
9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 |
16 | import math
17 | import sys
18 | import os
19 | import paddle.fluid as fluid
20 | import google.protobuf.text_format as text_format
21 | import paddle.fluid.proto.framework_pb2 as framework_pb2
22 | import paddle.fluid.core as core
23 | import six
24 | import subprocess as sp
25 |
26 | inference_path = sys.argv[2]+ '/inference_only'
27 | feature_names = []
28 | with open(sys.argv[1]) as fin:
29 | for line in fin:
30 | feature_names.append(line.strip())
31 |
32 | print(feature_names)
33 |
34 | sparse_input_ids = [
35 | fluid.layers.data(name=name, shape=[1], lod_level=1, dtype='int64')
36 | for name in feature_names]
37 |
38 | label = fluid.layers.data(
39 | name='label', shape=[1], dtype='int64')
40 |
41 | sparse_feature_dim = int(os.environ['SPARSE_DIM'])
42 | dataset_prefix = os.environ['DATASET_PATH']
43 | embedding_size = 9
44 |
45 | current_date_hr = sys.argv[3]
46 | hdfs_address = os.environ['HDFS_ADDRESS']
47 | hdfs_ugi = os.environ['HDFS_UGI']
48 |
49 | def embedding_layer(input):
50 | emb = fluid.layers.embedding(
51 | input=input,
52 | is_sparse=True,
53 | is_distributed=False,
54 | size=[sparse_feature_dim, embedding_size],
55 | param_attr=fluid.ParamAttr(name="SparseFeatFactors",
56 | initializer=fluid.initializer.Uniform()))
57 | emb_sum = fluid.layers.sequence_pool(input=emb, pool_type='sum')
58 | return emb, emb_sum
59 |
60 | emb_sums = list(map(embedding_layer, sparse_input_ids))
61 |
62 | emb_list = [x[0] for x in emb_sums]
63 | sparse_embed_seq = [x[1] for x in emb_sums]
64 | inference_feed_vars = emb_list
65 |
66 | concated = fluid.layers.concat(sparse_embed_seq, axis=1)
67 | fc1 = fluid.layers.fc(input=concated, size=400, act='relu',
68 | param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal(
69 | scale=1 / math.sqrt(concated.shape[1]))))
70 | fc2 = fluid.layers.fc(input=fc1, size=400, act='relu',
71 | param_attr=fluid.ParamAttr(
72 | initializer=fluid.initializer.Normal(
73 | scale=1 / math.sqrt(fc1.shape[1]))))
74 | fc3 = fluid.layers.fc(input=fc2, size=400, act='relu',
75 | param_attr=fluid.ParamAttr(
76 | initializer=fluid.initializer.Normal(
77 | scale=1 / math.sqrt(fc2.shape[1]))))
78 |
79 | predict = fluid.layers.fc(input=fc3, size=2, act='softmax',
80 | param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal(
81 | scale=1 / math.sqrt(fc3.shape[1]))))
82 |
83 | cost = fluid.layers.cross_entropy(input=predict, label=label)
84 | avg_cost = fluid.layers.reduce_sum(cost)
85 | accuracy = fluid.layers.accuracy(input=predict, label=label)
86 | auc_var, batch_auc_var, auc_states = \
87 | fluid.layers.auc(input=predict, label=label, num_thresholds=2 ** 12, slide_steps=20)
88 |
89 | def save_program():
90 | dataset = fluid.DatasetFactory().create_dataset()
91 | dataset.set_use_var(sparse_input_ids + [label])
92 | pipe_command = "python criteo_dataset.py {}".format(sys.argv[1])
93 | dataset.set_pipe_command(pipe_command)
94 | dataset.set_batch_size(32)
95 | dataset.set_thread(10)
96 | optimizer = fluid.optimizer.SGD(0.0001)
97 | optimizer.minimize(avg_cost)
98 | exe = fluid.Executor(fluid.CPUPlace())
99 |
100 | input_folder = "hdfs:"
101 | output = sp.check_output( "hadoop fs -D hadoop.job.ugi=" + hdfs_ugi
102 | + " -D fs.defaultFS=" + hdfs_address +" -ls " + os.path.join(dataset_prefix, current_date_hr) + "/ | awk '{if(NR>1) print $8}'", shell=True)
103 | train_filelist = ["{}{}".format(input_folder, f) for f in output.decode('ascii').strip().split('\n')]
104 | train_filelist.remove('hdfs:' + os.path.join(dataset_prefix, current_date_hr, 'donefile'))
105 | train_filelist = [train_filelist[0]]
106 | print(train_filelist)
107 |
108 | exe.run(fluid.default_startup_program())
109 | print("startup save program done.")
110 | dataset.set_filelist(train_filelist)
111 | exe.train_from_dataset(
112 | program=fluid.default_main_program(),
113 | dataset=dataset,
114 | fetch_list=[auc_var],
115 | fetch_info=["auc"],
116 | debug=False,)
117 | #print_period=10000)
118 | # save model here
119 | fetch_list = fluid.io.save_inference_model(inference_path, [x.name for x in inference_feed_vars], [predict], exe)
120 |
121 |
122 | def prune_program():
123 | model_dir = inference_path
124 | model_file = model_dir + "/__model__"
125 | with open(model_file, "rb") as f:
126 | protostr = f.read()
127 | f.close()
128 | proto = framework_pb2.ProgramDesc.FromString(six.binary_type(protostr))
129 | block = proto.blocks[0]
130 | kept_ops = [op for op in block.ops if op.type != "lookup_table"]
131 | del block.ops[:]
132 | block.ops.extend(kept_ops)
133 |
134 | kept_vars = [var for var in block.vars if var.name != "SparseFeatFactors"]
135 | del block.vars[:]
136 | block.vars.extend(kept_vars)
137 |
138 | with open(model_file, "wb") as f:
139 | f.write(proto.SerializePartialToString())
140 | f.close()
141 | with open(model_file + ".prototxt.pruned", "w") as f:
142 | f.write(text_format.MessageToString(proto))
143 | f.close()
144 |
145 |
146 | def remove_embedding_param_file():
147 | model_dir = inference_path
148 | embedding_file = model_dir + "/SparseFeatFactors"
149 | os.remove(embedding_file)
150 |
151 |
152 | if __name__ == '__main__':
153 | save_program()
154 | prune_program()
155 | remove_embedding_param_file()
156 |
--------------------------------------------------------------------------------