├── docs ├── practice │ ├── RESNET.md │ └── BERT.md ├── setup │ ├── images │ │ ├── CPFS_LIST.png │ │ ├── kubeconfig.png │ │ ├── CREATE_CPFS.png │ │ ├── cloud_shell.png │ │ ├── nas_created.png │ │ ├── CPFS_TEMPLATE.png │ │ ├── create_cluster.png │ │ ├── master_ssh_ip.png │ │ ├── nas_add_mount.png │ │ ├── nas_create_fs.png │ │ ├── nas_create_pv.png │ │ ├── nas_create_pvc.png │ │ ├── nas_get_mount.png │ │ ├── create_cluster_gpu.png │ │ ├── create_cluster_finish.png │ │ └── create_cluster_instance_type.png │ ├── SETUP_LOCAL.md │ ├── SETUP_PUBLIC_STORAGE.md │ ├── README.md │ ├── SETUP_USER_STORAGE.md │ ├── OPERATE_NOTEBOOK.md │ ├── SETUP_NAS.md │ ├── CREATE_CLUSTER.md │ ├── SETUP_CPFS.md │ ├── INSTALL_ARENA.md │ └── SETUP_NOTEBOOK.md ├── guide │ ├── images │ │ ├── run_notebook.jpg │ │ ├── access_notebook.png │ │ ├── ai-starter-demo.jpg │ │ ├── notebook-home.jpg │ │ ├── pipelines │ │ │ ├── input.jpg │ │ │ ├── metrics.jpg │ │ │ ├── comparsion.jpg │ │ │ ├── experiment.jpg │ │ │ └── pipeline-dag.jpg │ │ └── access_notebook_password.png │ ├── USE_NOTEBOOK.md │ ├── ACCESS_NOTEBOOK.md │ └── 1_AUTHOR_PIPELINES.md ├── SUMMARY.md └── README.md ├── .DS_Store ├── demo ├── 1-1-tensorboard.jpg ├── 2-1-tensorboard.jpg ├── 3-1-tensorboard.jpg ├── 2-distributed-mnist.ipynb ├── 3-submit-mpi.ipynb ├── 1-start-with-mnist.ipynb └── Bert-pretraining.ipynb ├── gitbook.md ├── README.md └── scripts ├── delete_notebook.sh ├── install_arena.sh ├── upgrade_notebook.sh ├── print_notebook.sh ├── access_notebook.sh └── install_notebook.sh /docs/practice/RESNET.md: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /docs/practice/BERT.md: -------------------------------------------------------------------------------- 1 | ### BERT模型 -------------------------------------------------------------------------------- /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AliyunContainerService/ai-starter/HEAD/.DS_Store -------------------------------------------------------------------------------- /demo/1-1-tensorboard.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AliyunContainerService/ai-starter/HEAD/demo/1-1-tensorboard.jpg -------------------------------------------------------------------------------- /demo/2-1-tensorboard.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AliyunContainerService/ai-starter/HEAD/demo/2-1-tensorboard.jpg -------------------------------------------------------------------------------- /demo/3-1-tensorboard.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AliyunContainerService/ai-starter/HEAD/demo/3-1-tensorboard.jpg -------------------------------------------------------------------------------- /docs/setup/images/CPFS_LIST.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AliyunContainerService/ai-starter/HEAD/docs/setup/images/CPFS_LIST.png -------------------------------------------------------------------------------- /docs/setup/images/kubeconfig.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AliyunContainerService/ai-starter/HEAD/docs/setup/images/kubeconfig.png -------------------------------------------------------------------------------- /docs/guide/images/run_notebook.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AliyunContainerService/ai-starter/HEAD/docs/guide/images/run_notebook.jpg -------------------------------------------------------------------------------- /docs/setup/images/CREATE_CPFS.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AliyunContainerService/ai-starter/HEAD/docs/setup/images/CREATE_CPFS.png -------------------------------------------------------------------------------- /docs/setup/images/cloud_shell.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AliyunContainerService/ai-starter/HEAD/docs/setup/images/cloud_shell.png -------------------------------------------------------------------------------- /docs/setup/images/nas_created.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AliyunContainerService/ai-starter/HEAD/docs/setup/images/nas_created.png -------------------------------------------------------------------------------- /docs/guide/images/access_notebook.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AliyunContainerService/ai-starter/HEAD/docs/guide/images/access_notebook.png -------------------------------------------------------------------------------- /docs/guide/images/ai-starter-demo.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AliyunContainerService/ai-starter/HEAD/docs/guide/images/ai-starter-demo.jpg -------------------------------------------------------------------------------- /docs/guide/images/notebook-home.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AliyunContainerService/ai-starter/HEAD/docs/guide/images/notebook-home.jpg -------------------------------------------------------------------------------- /docs/guide/images/pipelines/input.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AliyunContainerService/ai-starter/HEAD/docs/guide/images/pipelines/input.jpg -------------------------------------------------------------------------------- /docs/setup/images/CPFS_TEMPLATE.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AliyunContainerService/ai-starter/HEAD/docs/setup/images/CPFS_TEMPLATE.png -------------------------------------------------------------------------------- /docs/setup/images/create_cluster.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AliyunContainerService/ai-starter/HEAD/docs/setup/images/create_cluster.png -------------------------------------------------------------------------------- /docs/setup/images/master_ssh_ip.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AliyunContainerService/ai-starter/HEAD/docs/setup/images/master_ssh_ip.png -------------------------------------------------------------------------------- /docs/setup/images/nas_add_mount.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AliyunContainerService/ai-starter/HEAD/docs/setup/images/nas_add_mount.png -------------------------------------------------------------------------------- /docs/setup/images/nas_create_fs.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AliyunContainerService/ai-starter/HEAD/docs/setup/images/nas_create_fs.png -------------------------------------------------------------------------------- /docs/setup/images/nas_create_pv.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AliyunContainerService/ai-starter/HEAD/docs/setup/images/nas_create_pv.png -------------------------------------------------------------------------------- /docs/setup/images/nas_create_pvc.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AliyunContainerService/ai-starter/HEAD/docs/setup/images/nas_create_pvc.png -------------------------------------------------------------------------------- /docs/setup/images/nas_get_mount.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AliyunContainerService/ai-starter/HEAD/docs/setup/images/nas_get_mount.png -------------------------------------------------------------------------------- /docs/guide/images/pipelines/metrics.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AliyunContainerService/ai-starter/HEAD/docs/guide/images/pipelines/metrics.jpg -------------------------------------------------------------------------------- /docs/guide/images/pipelines/comparsion.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AliyunContainerService/ai-starter/HEAD/docs/guide/images/pipelines/comparsion.jpg -------------------------------------------------------------------------------- /docs/guide/images/pipelines/experiment.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AliyunContainerService/ai-starter/HEAD/docs/guide/images/pipelines/experiment.jpg -------------------------------------------------------------------------------- /docs/setup/images/create_cluster_gpu.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AliyunContainerService/ai-starter/HEAD/docs/setup/images/create_cluster_gpu.png -------------------------------------------------------------------------------- /docs/guide/images/pipelines/pipeline-dag.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AliyunContainerService/ai-starter/HEAD/docs/guide/images/pipelines/pipeline-dag.jpg -------------------------------------------------------------------------------- /docs/setup/images/create_cluster_finish.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AliyunContainerService/ai-starter/HEAD/docs/setup/images/create_cluster_finish.png -------------------------------------------------------------------------------- /docs/guide/images/access_notebook_password.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AliyunContainerService/ai-starter/HEAD/docs/guide/images/access_notebook_password.png -------------------------------------------------------------------------------- /docs/setup/images/create_cluster_instance_type.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AliyunContainerService/ai-starter/HEAD/docs/setup/images/create_cluster_instance_type.png -------------------------------------------------------------------------------- /gitbook.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | ### 运行GitBook 4 | 5 | ``` 6 | git clone git@github.com:AliyunContainerService/ai-starter.git 7 | cd bootstrap/docs/ 8 | gitbook serve 9 | ``` 10 | -------------------------------------------------------------------------------- /docs/setup/SETUP_LOCAL.md: -------------------------------------------------------------------------------- 1 | ## 配置本地环境 2 | 3 | 如果您需要在本地管理或者访问Kubernetes集群,需要做如下配置: 4 | * 安装kubectl。 (安装Kubectl)[https://kubernetes.io/docs/tasks/tools/install-kubectl/] 5 | * 下载集群凭证。在控制台集群详情中获取Kubeconfig。并将集群凭证放置到 `$HOME/.kube/config` 文件 6 | 7 | 8 | #### 下载凭证 9 | 进入容器服务控制台,集群详情页面。 点击复制 10 | ![image.png](images/kubeconfig.png) 11 | 12 | 将内容复制到 `$HOME/.kube/config` 文件。 -------------------------------------------------------------------------------- /docs/setup/SETUP_PUBLIC_STORAGE.md: -------------------------------------------------------------------------------- 1 | ## 配置用于存放数据的存储卷 2 | 运行机器学习模型训练需要训练代码,数据以及训练结果的输出,需要通过共享存储进行管理,并在模型训练程序中共享。 3 | Kubernetes中提供了PV(数据卷)和PVC(数据卷声明)作为共享存储的描述对象。阿里云在此之上提供CPFS, NAS和OSS的Flexvolume Driver ,可以轻松将阿里云的CPFS, NAS和OSS 等存储服务对接到Kubernetes集群。 4 | 5 | 我们建议集群管理员创建一个共享的存储池,共享存储会挂载到Notebook的 `/root/public` 目录中,数据科学家通过这个目录,可以相互分享数据,代码和执行结果。 6 | 7 | 我们支持多种类型的共享存储: 8 | * [配置NAS存储](./SETUP_NAS.md) 9 | * [配置CPFS存储](./SETUP_CPFS.md) 10 | * 配置自定义类型存储(TODO) 11 | -------------------------------------------------------------------------------- /docs/SUMMARY.md: -------------------------------------------------------------------------------- 1 | * [简介](README.md) 2 | * [环境准备](setup/README.md) 3 | * [创建集群](setup/CREATE_CLUSTER.md) 4 | * [配置NAS作为共享存储](setup/SETUP_NAS.md) 5 | * [安装机器学习基础架构Arena](setup/SETUP_NOTEBOOK.md) 6 | * [入门体验](guide/README.md) 7 | * [如何访问notebook](guide/ACCESS_NOTEBOOK.md) 8 | * [如何使用notebook](guide/USE_NOTEBOOK.md) 9 | * [如何在云上开始体验AI](guide/PREPARE_GIT_REPO.md) 10 | * [模型实践](practice/README.md) 11 | * [Resnet模型](practice/RESNET.md) 12 | * [Bert模型](practice/BERT.md) 13 | 14 | -------------------------------------------------------------------------------- /docs/setup/README.md: -------------------------------------------------------------------------------- 1 | ### 环境准备 2 | 搭建基于GPU的Kubernetes机器学习环境,我们需要 3 | * 创建一个阿里云容器服务Kubernetes集群,通过Kubernetes管理GPU节点。 [如何创建集群](CREATE_CLUSTER.md) 4 | * 配置NAS共享存储 [如何配置NAS](SETUP_NAS.md) 5 | * 部署Notebook环境 [如何安装Notebook](SETUP_Notebook.md) 6 | * 配置化本地环境 [如何配置本地环境](SETUP_LOCAL.md) 7 | 8 | > Arena 是一个机器学习基础架构工具,可供数据科学家轻而易举地运行和监控机器学习训练作业,并便捷地检查结果。目前,它支持单机/分布式深度学习模型训练。在实现层面,它基于 Kubernetes、helm 和 Kubeflow。但数据科学家可能对于 kubernetes 知之甚少。 9 | 10 | 与此同时,用户需要 GPU 资源和节点管理。Arena 还提供了 top 命令,用于检查 Kubernetes 集群内的可用 GPU 资源。 11 | 12 | 简而言之,Arena 的目标是让数据科学家感觉自己就像是在一台机器上工作,而实际上还可以享受到 GPU 集群的强大力量。 13 | -------------------------------------------------------------------------------- /docs/setup/SETUP_USER_STORAGE.md: -------------------------------------------------------------------------------- 1 | ## 配置Notebook存储 2 | 在Notebook中,为了保留数据科学家的工作内容。我们建议为每个用户分配存储卷,并挂载到Notebook的`/root`目录, 确保数据科学家的工作内容(代码,数据)得以保留,不会随着容器删除而丢失。 3 | 4 | 同时在团队开发中,我们建议分配一个共享的存储池,让数据和代码能够在团队中共享。部署Notebook时,如果您声明了配置公共存储,我们会将共享存储会挂载到Notebook的 `/root/public` 目录中,数据科学家可以通过这个目录,在团队之间可以相互分享数据,代码和模型。 5 | 6 | 在Kubernets中,通过存储卷和存储声明描述存储对象。 作为集群管理员,分配环境时您需要为每个数据科学家创建属于自己的存储声明,例如用户A和用户B,存储声明的后端可以挂载到相同的 NAS/CPFS,但是必须指定不同的子目录,保证他们工作环境隔离。 7 | 8 | #### 共享存储类型 9 | 10 | 您可以通过以下流程创建存储声明,选择合适的存储类型: 11 | * [配置NAS存储](./SETUP_NAS.md) 12 | * [配置CPFS存储](./SETUP_CPFS.md) 13 | 14 | #### 步骤 15 | 1. 您需要创建一个用于存放共享数据的存储卷,我们建议命名为`public-data`。(本步骤可选择操作) 16 | 2. 搭建每个数据科学家的Notebook环境,您需要为他创建属于数据科学家自己的存储声明,用于存放他的工作数据,不同的数据科学家需要通过命名区分,本示例中我们命名为`traing-data`。 -------------------------------------------------------------------------------- /docs/setup/OPERATE_NOTEBOOK.md: -------------------------------------------------------------------------------- 1 | # 运维数据科学家工作环境(Notebook) 2 | 3 | ### 前提 4 | * 已经按照 [部署数据科学家工作环境](../setup/SETUP_NOTEBOOK.md) 安装了Notebook 5 | 6 | 7 | #### 升级Notebook版本 8 | 如果您希望更新使用的Notebook,可以通过类似以下命令进行更新。注意:具体参数与你之前安装的Notebook相关。 9 | 10 | ``` 11 | # curl -s https://raw.githubusercontent.com/AliyunContainerService/ai-starter/master/scripts/upgrade_notebook.sh | \ 12 | bash -s -- \ 13 | --notebook-name susan \ 14 | --image registry.cn-hangzhou.aliyuncs.com/acs/arena-notebook:0.2.0-20190617081324-7af0024-cpu 15 | ``` 16 | 17 | #### 删除Notebook 18 | 19 | 20 | ``` 21 | # curl -s https://raw.githubusercontent.com/AliyunContainerService/ai-starter/master/scripts/delete_notebook.sh | \ 22 | bash -s -- \ 23 | --notebook-name susan \ 24 | --namespace kubeflow 25 | ``` -------------------------------------------------------------------------------- /docs/guide/USE_NOTEBOOK.md: -------------------------------------------------------------------------------- 1 | # 如何使用notebook 2 | ### 前提 3 | * 集群管理员已经完成Kubernetes集群和GPU节点的配置 [环境搭建](../setup/README.md) 4 | * 集群管理员已完成为数据科学家搭建工作环境(Notebook) [如何部署Notebook](../setup/SETUP_NOTEBOOK.md) 5 | * 您已部署并可以正常访问Jupter Notebook [如何访问Notebook](./ACCESS_NOTEBOOK.md) 6 | 7 | ### Notebook基础入门 8 | 9 | 1\. 首次访问Jupyter Notebook时, 浏览器会展示Notebook,显示home页面。可以看到`ai-starter`目录 10 | 11 | ![](./images/notebook-home.jpg) 12 | 13 | 14 | 2\. 点击`ai-starter`->`demo`, 可以看到一系列的notebook,您可以按照顺序逐个点击学习 15 | 16 | 17 | ![](./images/ai-starter-demo.jpg) 18 | 19 | 3\. 在Notebook的界面中,我们可以选择已有的代码块,通过点击上面的`运行` 按钮, 即可执行选中的代码块。 20 | 21 | ![](./images/run_notebook.jpg) 22 | 23 | 我们可以从第一个示例开始,比如第一个mnist的示例。 在Notebook中点击进入示例。 24 | 25 | [示例1,从Mnist开始](../../demo/1-start-with-mnist.ipynb) 26 | 27 | -------------------------------------------------------------------------------- /docs/README.md: -------------------------------------------------------------------------------- 1 | ### 未来已来:基于云原生技术的AI普惠 2 | Gartner根据对全球CIO的调查结果显示AI将成为2019年组织革命的颠覆性力量, 同时AI技术的进展也一日千里。 3 | 4 | 对于AI来说,算力即正义,成本即能力,而以Docker和Kubernetes代表云原生技术为AI提供了一种新的工作模式,它不仅可以提高资源利用率和共享度,而且为机器学习应用赋予的可移植性、弹性伸缩能力、无缝集成云基础架构的能力。 5 | 6 | 1. 用户能够利用可移植性的特点去重现那些牛逼闪闪的AI成果,并稍加改进为自己所用。 7 | 2. GPU的算力虽强但是成本不低,弹性伸缩能力帮助用户实现只为运算买单 8 | 3. 无缝整合CPFS,NAS,OSS等分布式存储,以及RDMA网络,不让IO拖了强大算力后腿 9 | 10 | 我们致力于提供 11 | * 搭建云上机器学习环境,通过Kubernetes 集成云上基础架构能力,提供高性能,低成本的分布式机器学习方案。 12 | * 为数据科学家提供高效的开发环境,帮助数据科学家无需切换工作空间中就可以完成模型开发,训练,预测部署等工作内容。并通过工作流将他们的工作变变得持久化可重复利用。 13 | * 结合阿里云的算法积累能力和Docker天然的可移植能力,帮助您最及时地体验到学术界State-of-the-art 的AI成果,并为己所用。享受算力带来的技术红利。 14 | 15 | 这里提供一系列的例子帮助您在阿里云Kubernetes容器服务上通过Notebook进行数据处理,模型训练(单机和分布式训练)和模型预测等一系列的工作: 16 | 17 | - 您可以学习和如何使用Notebook进行单机和分布式模型训练 18 | - 简单方便的重现高性能的Bert和ResNet50的训练过程及结果 19 | - 串联数据处理,模型开发,模型训练到模型预测整个过程 20 | 21 | ### 使用样例 22 | 23 | - [利用Notebook学习云上模型训练基本操作]() 24 | - [实战Bert高性能计算]() 25 | - [实战ResNet高性能计算]() -------------------------------------------------------------------------------- /docs/setup/SETUP_NAS.md: -------------------------------------------------------------------------------- 1 | ### NAS 2 | 阿里云文件存储(Network Attached Storage,简称 NAS)是面向阿里云 ECS 实例、E-HPC 和容器服务等计算节点的文件存储服务。阿里云NAS服务具有无缝集成,共享访问,安全控制等特性,非常适合跨多个 ECS、E-HPC 或容器服务实例部署的应用程序访问相同数据来源的应用场景。 3 | 4 | #### 创建NAS实例,配置挂载点 5 | 6 | 1\. 进入阿里云NAS服务控制台([https://nas.console.aliyun.com/#/ofs/list](https://nas.console.aliyun.com/#/ofs/list))。选择和Kubernetes集群对应的地域
7 | 8 | 2\. 选择对应创建文件系统,地域和可用区和Kubernetes集群选择一致。
![image.png](images/nas_create_fs.png) 9 | 10 | 11 | 3\. 创建挂载点,同样选择和集群一致的VPC和VSwitch
![image.png](images/nas_add_mount.png) 12 | 13 | 4\. 创建成功后,在控制台详情中能够查看NAS实例的挂载地址
![image.png](images/nas_get_mount.png) 14 | 15 | #### 配置Kubernetes中的存储卷和存储声明 16 | 17 | 1\. 回到容器服务控制台([https://cs.console.aliyun.com/](https://cs.console.aliyun.com/)),我们在容器集群中创建存储卷,将来源设置为NAS。
18 | 19 | 2\. 在容器控制台中选择 集群 -> 存储卷 -> 创建。 填入上一步中创建的NAS实例的挂载点地址,并设置这个存储声明的子目录和权限。
![image.png](images/nas_create_pv.png) 20 | 21 | 3\. 创建存储卷完成,继续创建存储声明(PVC),我们设置对应的名称, 并且需要刚刚的存储卷。 
![image.png](images/nas_create_pvc.png) 22 | 23 | 4\. 完成创建后,在存储声明的列表中,可以看到刚刚创建的存储声明实例。 24 | ![image.png](images/nas_created.png) 25 | 26 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ### 介绍 2 | 3 | 基于阿里云强大计算能力和开源社区Kubeflow的深度学习解决方案,为您提供一个低门槛、开放、端到端的深度学习服务平台。方便数据科学家和算法工程师快速开始利用阿里云的资源(包括 ECS 云服务器、GPU 云服务器、分布式存储NAS、CPFS、 对象存储 OSS、Elastic MapReduce、负载均衡等服务)执行数据准备、模型开发、模型训练、评估和预测等任务。并能够方便地将深度学习能力转化为服务 API,加速与业务应用的集成。 4 | 5 | 具体而言,该深度学习解决方案具备以下特性: 6 | 7 | - 简单:降低构建和管理深度学习平台的门槛。 8 | - 高效:提升 CPU、GPU 等异构计算资源的使用效率,提供统一的用户体验。 9 | - 开放:支持 TensorFlow、Keras、MXNet 等多种主流深度学习框架,同时用户也可使用定制的环境,另外所有方案中的工具全部开源。 10 | - 全周期:提供基于阿里云强大服务体系构建端到端深度学习任务流程的最佳实践。 11 | - 服务化:支持深度学习能力服务化,与云上应用的轻松集成。 12 | - 多用户:支持数据科学家团队的协同合作。 13 | 14 | 15 | ## 开始使用 16 | 17 | * 集群管理员的工作 18 | 1. 环境准备 19 | * [创建集群](docs/setup/CREATE_CLUSTER.md) 20 | * [安装机器学习基础架构](docs/setup/INSTALL_ARENA.md) 21 | 2. 部署Notebook 22 | * [配置共享存储](docs/setup/SETUP_USER_STORAGE.md) 23 | * [部署数据科学家工作环境(Notebook)](docs/setup/SETUP_NOTEBOOK.md) 24 | * [运维数据科学家工作环境(Notebook)](docs/setup/OPERATE_NOTEBOOK.md) 25 | 3. 部署Kubeflow Pipelines 26 | * [安装Kubeflow Pipelines](https://github.com/AliyunContainerService/kubeflow-aliyun/blob/master/README.md) 27 | 28 | 29 | * 数据科学家的工作 30 | 1. 入门体验 31 | * [访问Notebook](docs/guide/ACCESS_NOTEBOOK.md) 32 | * [使用Notebook](docs/guide/USE_NOTEBOOK.md) 33 | 2. 模型实践 34 | * [单机mnist](demo/1-start-with-mnist.ipynb) 35 | * [多机mnist](demo/2-distributed-mnist.ipynb) 36 | * [MPI分布式训练](demo8/3-submit-mpi.ipynb) 37 | 3. 端到端的机器学习工作流 38 | * [开发mnist Pipelines](docs/guide/1_AUTHOR_PIPELINES.md) 39 | -------------------------------------------------------------------------------- /docs/setup/CREATE_CLUSTER.md: -------------------------------------------------------------------------------- 1 | ### Kubernetes GPU环境 2 | 容器服务Kubernetes 集群支持管理GPU节点,并且支持GPU调度。管理GPU通过两个维度: 3 | * 在GPU节点上安装nvidia-docker,它和普通的docker runtime不同,通过nvidia-docker2,我们可以在容器中通过目录挂载的方式访问ECS的Nvidia Driver,并且可以在容器中挂载指定的GPU设备。 4 | * 在GPU节点上部署GPU DevicePlugin,它负责上报节点上的GPU信息。 这样我们可以通过在Kubernetes的Pod编排中声明request: `nvidia.com/gpu`   的方式指定容器是否需要挂载GPU以及个数。容器在调度节点的时候会被调度到满足条件的GPU节点,并根据数量映射GPU到容器中。 5 | 6 | 如何创建带GPU的Kubernetes集群可以参考文档 [https://help.aliyun.com/document_detail/86490.html](https://help.aliyun.com/document_detail/86490.html?spm=a2c4g.11186623.6.582.4fe64330JXBSuT) 。 7 | 8 | ### 创建集群步骤 9 | 1\. 进入容器服务控制台([https://cs.console.aliyun.com](https://cs.console.aliyun.com/#/k8s/cluster/list) ),选择创建集群
![PNG](images/create_cluster.png) 10 | 11 | 12 | 2\. Worker实例规格选择GN系列(包含GN5,GN6以及GN5i等等)。具体配置可以查看  [ECS 实例规格族](https://help.aliyun.com/document_detail/25378.html#gn5)
![image.png](images/create_cluster_instance_type.png) 13 | 14 | 15 | 3\. 其他配置可以按需设置。如果想要了解更详细的Kubernetes集群配置, 请参考文档:[https://help.aliyun.com/document_detail/86488.html](https://help.aliyun.com/document_detail/86488.html) 16 | 17 | 这里需要关注的地方包括: 18 | a. 如果是事先购买的ECS,请注意选择和该ECS同一个vpc和vswitch创建K8S集群 19 | b. 创建集群时选择Pod ip和Service ip网段不能和ECS网段冲突。网段规划参考文档: [https://help.aliyun.com/document_detail/86500.html](https://help.aliyun.com/document_detail/86500.html) 20 | c. 如果用户事先购买的ECS请至少保证系统磁盘为100G以上 21 | 22 | 23 | 4\. 完成配置后,点击等待大概10-20分钟后,集群创建完毕。
![image.png](images/create_cluster_finish.png) 24 | 25 | 26 | 5\. 我们可以在集群节点列表查看节点,可以查看到包含GN实例规格的节点,这些节点会被用于被调度和启动GPU应用容器。
![image.png](images/create_cluster_gpu.png) 27 | -------------------------------------------------------------------------------- /docs/setup/SETUP_CPFS.md: -------------------------------------------------------------------------------- 1 | ### 配置CPFS共享存储 2 | CPFS(Cloud Paralleled File System)是一种并行文件系统。CPFS 的数据存储在集群中的多个数据节点,并可由多个客户端同时访问,从而能够为大型高性能计算机集群提供高 IOPS、高吞吐、低时延的数据存储服务。 3 | 4 | 随着高性能并行计算的大规模商业化,传统并行文件系统正面临诸多挑战,如存储资源急剧增长、成本高、运维管理复杂度大、大规模存储系统的稳定性以及性能无法随规模进行线性扩展等。CPFS应运而生。 5 | 6 | 7 | ### 创建阿里云CPFS实例 8 | 1\. 进入阿里云[NAS服务控制台](https://nas.console.aliyun.com/#/cpfs/list),CPFS管理界面。
9 | 10 | 2\. 选择对应创建文件系统,其中地域和可用区和Kubernetes集群选择一致。
![image.png](images/CREATE_CPFS.png) 11 | 12 | 13 | 3\. 创建成功后,在控制台的列表中,我们可以看到CPFS实例,可以获得CPFS的挂载点和系统ID
![image.png](images/CPFS_LIST.png) 14 | 15 | #### 配置Kubernetes中的存储卷和存储声明 16 | 1\. 回到容器[服务控制台](https://cs.console.aliyun.com/)。进入[通过模板创建页面](https://cs.console.aliyun.com/#/k8s/deploy/yaml?kind=pvc)
17 | ![image.png](images/CPFS_TEMPLATE.png) 18 | 19 | 模板中需要修改内容: 20 | 1. `CPFS挂载点`, `CPFS文件系统ID` 替换为前面所创建的CPFS实例内容 21 | 2. `子目录` 填写你需要的子目录。 22 | 3. `存储名称` 按照需求填写您的存储名,便于管理。 23 | 24 | 模板内容如下: 25 | ``` 26 | apiVersion: v1 27 | kind: PersistentVolume 28 | metadata: 29 | name: <存储名> 30 | labels: 31 | alicloud-pvname: <存储名> 32 | spec: 33 | capacity: 34 | storage: 500Gi 35 | accessModes: 36 | - ReadWriteMany 37 | flexVolume: 38 | driver: "alicloud/cpfs" 39 | options: 40 | server: "" 41 | fileSystem: "" 42 | subPath: "<子目录>" 43 | 44 | --- 45 | kind: PersistentVolumeClaim 46 | apiVersion: v1 47 | metadata: 48 | name: <存储名> 49 | spec: 50 | accessModes: 51 | - ReadWriteMany 52 | resources: 53 | requests: 54 | storage: 500Gi 55 | selector: 56 | matchLabels: 57 | alicloud-pvname: <存储名> 58 | ``` -------------------------------------------------------------------------------- /scripts/delete_notebook.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | set -e 3 | 4 | # Print log 5 | function log() { 6 | echo $(date +"[%Y%m%d %H:%M:%S]: ") $1 7 | } 8 | 9 | # Print usage 10 | function usage() { 11 | echo "usage: delete_notebook.sh -n --notebook-name " 12 | } 13 | 14 | # Upgrade notebook's image 15 | function delete_notebook() { 16 | if [[ -z $NAMESPACE ]];then 17 | NAMESPACE="default" 18 | fi 19 | 20 | if [[ -z $NOTEBOOK_NAME ]];then 21 | usage 22 | exit 1 23 | fi 24 | NOTEBOOK_NAME=$NOTEBOOK_NAME-notebook 25 | 26 | # if the notebook exists 27 | local exist=$(check_resource_exist sts $NOTEBOOK_NAME $NAMESPACE) 28 | if [[ "$exist" == "0" ]]; then 29 | set -x 30 | kubectl delete -n $NAMESPACE sts $NOTEBOOK_NAME 31 | else 32 | log "notebook $NOTEBOOK_NAME in namespace $NAMESPACE is not found. Please check" 33 | fi 34 | } 35 | 36 | function check_resource_exist() { 37 | resource_type=$1 38 | resource_name=$2 39 | namespace=${3:-"default"} 40 | kubectl get -n $namespace $resource_type $resource_name &> /dev/null 41 | echo $? 42 | } 43 | 44 | 45 | function main() { 46 | while [ $# -gt 0 ];do 47 | case $1 in 48 | -n|--namespace) 49 | NAMESPACE=$2 50 | shift 51 | ;; 52 | --notebook-name) 53 | NOTEBOOK_NAME=$2 54 | shift 55 | ;; 56 | -h|--help) 57 | exit 0 58 | ;; 59 | *) 60 | echo "unknown option [$key]" 61 | exit 1 62 | ;; 63 | esac 64 | shift 65 | done 66 | delete_notebook 67 | } 68 | 69 | main "$@" 70 | -------------------------------------------------------------------------------- /docs/setup/INSTALL_ARENA.md: -------------------------------------------------------------------------------- 1 | ## 安装机器学习基础架构 2 | Kubernetes支持CRD的方式定义不同类型的工作负载,用于分布式的应用生命周期管理。在深度学习领域比如MPI,ParameterServer模式。 3 | 在深度学习解决方案中,我们支持了MPI,Tensorflow Parameter Server等模式。 想要在Kubernetes上运行他们,我们需要在集群中部署tj-job,mpi-operator, tf-job-dashboard 等基础服务。 为了简化安装,我们提供了安装脚本。  
4 | 5 | ### 执行环境 6 | 由于安装脚本需要和Kubernetes集群交互,我们的环境需要安装好kubectl,并拥有一个创建应用权限的kubernetes用户凭证。 这个阶段通常由集群管理员执行。 7 | 8 | ##### 登录到master 9 | 我们可以选择登录到master机器上运行安装命令, 在控制台上查看Master节点SSH登录地址:
![image.png](images/master_ssh_ip.png) 10 | 11 | 通过SSH登录到Master节点 12 | 13 | ##### CloudShell 14 | 如果您未开放master的ssh端口,也可以通过cloudShell执行安装命令,[参考文档](https://help.aliyun.com/document_detail/100650.html)
15 | ![image.png](images/cloud_shell.png) 16 | 17 | ##### 笔记本上运行 18 | 您也可以在本地运行,需要安装kubectl,并下载集群凭证。 (安装Kubectl)[https://kubernetes.io/docs/tasks/tools/install-kubectl/] 19 | 在控制台集群详情中获取Kubeconfig。 20 | ![image.png](images/kubeconfig.png) 21 | 22 | ### 安装基础架构 23 | 部署脚本在Github上开源,在部署脚本中,除了默认配置,我们还支持额外参数用于定制您的Notebook环境。 24 | 25 | ##### 安装Arena的依赖组件 26 | ``` 27 | curl -s https://raw.githubusercontent.com/AliyunContainerService/ai-starter/master/scripts/install_arena.sh | \ 28 | bash -s -- \ 29 | --prometheus 30 | ``` 31 | 32 | 上述安装执行中,可以通过以下参数定制部署的依赖组件: 33 | ``` 34 | --prometheus 指定是否部署Prometheus,以及GPU监控的采集器和Grafana 35 | ``` 36 | 37 | ##### 检查安装结果 38 | ``` 39 | # 查看arena 依赖 40 | # kubectl -n arena-system get po 41 | NAME READY STATUS RESTARTS AGE 42 | mpi-operator-5f89ddc9bf-5mw4c 1/1 Running 0 1m 43 | tf-job-dashboard-7dc786b7fb-t57wx 1/1 Running 0 1m 44 | tf-job-operator-v1alpha2-98bfbfc4-9d66t 1/1 Running 0 1m 45 | ``` -------------------------------------------------------------------------------- /scripts/install_arena.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | set -e 3 | 4 | function install_arena() { 5 | check_resource_exist "pod" "arena-installer" "kube-system" 6 | if [[ "$UPGRADE" != "true" && "$?" == "0" ]]; then 7 | echo "Arena has been installed." 8 | exit 0 9 | fi 10 | 11 | set -e 12 | 13 | HOST_NETWORK=${HOST_NETWORK:-"true"} 14 | PROMETHEUS=${PROMETHEUS:-"true"} 15 | 16 | cat < $LOG_PRINT - 17 | apiVersion: v1 18 | kind: Pod 19 | metadata: 20 | name: arena-installer 21 | namespace: kube-system 22 | spec: 23 | restartPolicy: Never 24 | serviceAccountName: admin 25 | hostNetwork: true 26 | containers: 27 | - name: arena 28 | image: registry.cn-beijing.aliyuncs.com/acs/arena:0.2.0-f6b6188 29 | env: 30 | - name: useHostNetwork 31 | value: "$HOST_NETWORK" 32 | - name: usePrometheus 33 | value: "$PROMETHEUS" 34 | - name: platform 35 | value: ack 36 | EOF 37 | } 38 | 39 | function check_resource_exist() { 40 | resource_type=$1 41 | resource_name=$2 42 | namespace=${3:-"default"} 43 | set +e 44 | kubectl get -n $namespace $resource_type $resource_name &> /dev/null 45 | return $? 46 | } 47 | 48 | function parse_args() { 49 | LOG_PRINT=${LOG_PRINT:-"/dev/null"} 50 | } 51 | 52 | function install() { 53 | parse_args 54 | install_arena 55 | echo "Install successful" 56 | } 57 | 58 | function main() { 59 | while [ $# -gt 0 ];do 60 | case $1 in 61 | -p|--prometheus) 62 | PROMETHEUS="true" 63 | ;; 64 | --upgrade) 65 | UPGRADE="true" 66 | ;; 67 | --debug) 68 | set -x 69 | ;; 70 | *) 71 | echo "unknown option [$key]" 72 | exit 1 73 | ;; 74 | esac 75 | shift 76 | done 77 | install 78 | } 79 | 80 | main "$@" -------------------------------------------------------------------------------- /scripts/upgrade_notebook.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | set -e 3 | 4 | # Print log 5 | function log() { 6 | echo $(date +"[%Y%m%d %H:%M:%S]: ") $1 7 | } 8 | 9 | # Print usage 10 | function usage() { 11 | echo "usage: upgrade_notebook.sh -n --image --notebook-name " 12 | } 13 | 14 | # Upgrade notebook's image 15 | function upgrade_notebook() { 16 | if [[ -z $NAMESPACE ]];then 17 | NAMESPACE="default" 18 | fi 19 | 20 | if [[ -z $NOTEBOOK_NAME ]];then 21 | usage 22 | exit 1 23 | fi 24 | NOTEBOOK_NAME=$NOTEBOOK_NAME-notebook 25 | 26 | if [[ -z $NOTEBOOK_IMAGE ]];then 27 | usage 28 | exit 1 29 | fi 30 | 31 | 32 | # if the notebook exists 33 | local exist=$(check_resource_exist sts $NOTEBOOK_NAME $NAMESPACE) 34 | if [[ "$exist" == "0" ]]; then 35 | set -x 36 | kubectl patch statefulset -n $NAMESPACE $NOTEBOOK_NAME -p '{"spec":{"updateStrategy":{"type":"RollingUpdate"}}}' 37 | kubectl set image statefulset -n $NAMESPACE $NOTEBOOK_NAME $NOTEBOOK_NAME=$NOTEBOOK_IMAGE 38 | else 39 | log "notebook $NOTEBOOK_NAME in namespace $NAMESPACE is not found. Please check" 40 | fi 41 | } 42 | 43 | function check_resource_exist() { 44 | resource_type=$1 45 | resource_name=$2 46 | namespace=${3:-"default"} 47 | kubectl get -n $namespace $resource_type $resource_name &> /dev/null 48 | echo $? 49 | } 50 | 51 | 52 | function main() { 53 | while [ $# -gt 0 ];do 54 | case $1 in 55 | -n|--namespace) 56 | NAMESPACE=$2 57 | shift 58 | ;; 59 | --notebook-name) 60 | NOTEBOOK_NAME=$2 61 | shift 62 | ;; 63 | --image) 64 | NOTEBOOK_IMAGE=$2 65 | shift 66 | ;; 67 | -h|--help) 68 | exit 0 69 | ;; 70 | *) 71 | echo "unknown option [$key]" 72 | exit 1 73 | ;; 74 | esac 75 | shift 76 | done 77 | upgrade_notebook 78 | } 79 | 80 | main "$@" 81 | -------------------------------------------------------------------------------- /docs/guide/ACCESS_NOTEBOOK.md: -------------------------------------------------------------------------------- 1 | # 如何访问notebook 2 | 3 | ### 前提 4 | * 集群管理员已经完成Kubernetes集群和GPU节点的配置 [环境搭建](../setup/README.md) 5 | * 集群管理员已完成为数据科学家搭建工作环境(Notebook) [如何部署Notebook](../setup/SETUP_NOTEBOOK.md) 6 | * 从集群管理员获取 Notebook的Pod IP,访问域名和地址,以及登录Token 7 | 8 | ### 访问Notebook 9 | 10 | #### 从集群管理员获取Notebook的访问信息 11 | 12 | 集群管理员部署完工作环境 Notebook后, 可以获取到 Notebook 的访问地址。打印内容如下: 13 | 14 | ``` 15 | Notebook pod ip is 172.16.1.103 16 | Notebook access token is 17 | Ingress of notebook ip is 39.104.xx.xx 18 | Ingress of notebook domain is foo.bar.com 19 | ``` 20 | 21 | 其中含义: 22 | * Pod IP 是Notebook 容器的访问IP 23 | * access token是用于首次登录Notebook的密码,登录成功后可以选择继续使用token 登录访问。也可以选择设置新的密码 24 | * Ingress IP 和 domain 是为Notebook配置的 Ingress 访问入口。 25 | 26 | 您可以通过以下两种方式访问Notebook: 27 | 1. 通过sshuttle访问: 集群管理员需要提供一个和集群网络联通的跳板机。从集群管理员中获得跳板机IP/密码,和Notebook的Pod IP,通过sshuttle设置代理访问Notebook。 28 | 2. 通过Ingress访问: 集群管理员部署Notebook时会将Notebook 通过Ingress的公网SLB 提供服务, 仅限域名访问,并且支持Https加密。 保证Notebook在公网的安全性。 需要从集群管理员中获得Ingress IP 和 域名 29 | 30 | 根据您的实际情况和需求,选择合适的访问方式。 31 | 32 | #### 通过sshuttle访问Notebook 33 | 1\. sshuttle是一个基于SSH的代理工具,通过它登录到跳板机,可以设置您对某个地址和网段的请求的代理。如果您没有安装过sshuttle ,可以通过以下命令安装: 34 | 35 | ``` 36 | sudo pip install sshuttle 37 | ``` 38 | 39 | 2\. 由于Kubernetes的网络在专有网络VPC中,我们无法通过公网直接访问。可以通过 sshuttle 的方式,通过一台跳板机, 设置本地环境和线上kubernetes环境的网络代理。 命令如下: 40 | 41 | ``` 42 | sshuttle -r root@<跳板机IP> <代理的IP网段> 43 | ``` 44 | 45 | * 跳板机必须是和Kubernetes在相同的VPC下, 并且在同一个安全组中,保证相互之间的网络访问畅通。 46 | * 通过集群管理员发送的Notebook Pod IP, 我们可以定义出代理的IP网段。 比如Pod IP是`172.16.1.103`, 我们可以将代理网段定义为 `172.16.1.0/24` 47 | 48 | 执行完成后会出现连接成功的提示 49 | ``` 50 | sshuttle -r root@39.104.xx.xx 172.16.1.0/16 51 | client: Connected. 52 | ``` 53 | 54 | 3\. 然后打开浏览器,可以直接访问地址 `http://:8888` 55 | 56 | #### 通过Ingress访问Notebook 57 | 58 | 1\. 如果您有自己的DNS解析,可以配置DNS解析,将Ingress的域名解析到对应的Ingress IP。 如果没有属于自己的DNS解析服务, 您也可以通过修改本地host文件的方式,将Ingress的域名解析到Ingress的IP。通过Ingress域名访问Notebook。 59 | ``` 60 | 47.101.xx.xxx foo.bar.com 61 | <你的Notebook域名> 62 | ``` 63 | 64 | 2\. 设置完成hosts后,通过浏览器直接访问 `https://` 即可访问。 65 | 66 | 67 | ### 登录Notebook 68 | 通过从集群管理员得到的Token,可以直接登录到Notebook中,也可以重置密码
69 | ![image.png](./images/access_notebook_password.png)
70 | 71 | 登录完成后,进入Notebook界面。 72 | ![image.png](./images/access_notebook.png) 73 | 74 | 您可以开始机器学习之旅了! -------------------------------------------------------------------------------- /scripts/print_notebook.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | set -e 3 | 4 | function print_ingress() { 5 | NOTEBOOK_NAME=${NOTEBOOK_NAME:-"arena-notebook"} 6 | if [[ -n $NOTEBOOK_WORKSPACE_NAME ]];then 7 | NOTEBOOK_NAME="$NOTEBOOK_WORKSPACE_NAME-notebook" 8 | fi 9 | INGRESS_NAME="arena-notebook-ingress" 10 | if [[ -n $NOTEBOOK_NAME ]];then 11 | INGRESS_NAME="$NOTEBOOK_NAME-ingress" 12 | fi 13 | INGRESS_NAMESPACE=${INGRESS_NAMESPACE:-"default"} 14 | pod_ip=$(kubectl get pod $NOTEBOOK_NAME-0 -n $INGRESS_NAMESPACE -ojsonpath='{.status.podIP}') 15 | echo "Notebook pod ip is $pod_ip" 16 | 17 | token=$(kubectl logs $NOTEBOOK_NAME-0 -n $INGRESS_NAMESPACE | grep NotebookApp | grep 'token=' | awk -F 'token=' '{print $2}') 18 | if [[ $token != "" ]];then 19 | echo "Notebook access token is $token" 20 | fi 21 | 22 | service_type=$(kubectl get service $NOTEBOOK_NAME -n $INGRESS_NAMESPACE -ojsonpath='{.spec.type}') 23 | if [[ "$service_type" == "NodePort" ]];then 24 | node_port=$(kubectl get service -oyaml $NOTEBOOK_NAME -n $INGRESS_NAMESPACE -ojsonpath='{.spec.ports[0].nodePort}') 25 | node_ip=$(kubectl get no -ojsonpath='{.items[0].status.addresses[0].address}') 26 | echo "You can access By NodePort: $node_ip:$node_port" 27 | fi 28 | 29 | # if the notebook ingress is exist 30 | local ingress_exist=$(check_resource_exist ingress $INGRESS_NAME $INGRESS_NAMESPACE) 31 | if [[ "$ingress_exist" == "0" ]]; then 32 | ingress_host=$(kubectl get ingress $INGRESS_NAME -n $INGRESS_NAMESPACE -ojsonpath='{.spec.rules[0].host}') 33 | ingress_ip=$(kubectl get ingress $INGRESS_NAME -n $INGRESS_NAMESPACE -ojsonpath='{.status.loadBalancer.ingress[0].ip}') 34 | echo "Ingress of notebook ip is $ingress_ip" 35 | echo "Ingress of notebook domain is $ingress_host" 36 | fi 37 | } 38 | 39 | function check_resource_exist() { 40 | resource_type=$1 41 | resource_name=$2 42 | namespace=${3:-"default"} 43 | kubectl get -n $namespace $resource_type $resource_name &> /dev/null 44 | echo $? 45 | } 46 | 47 | 48 | function main() { 49 | while [ $# -gt 0 ];do 50 | case $1 in 51 | -n|--namespace) 52 | INGRESS_NAMESPACE=$2 53 | shift 54 | ;; 55 | --notebook-name) 56 | NOTEBOOK_WORKSPACE_NAME=$2 57 | shift 58 | ;; 59 | -h|--help) 60 | exit 0 61 | ;; 62 | *) 63 | echo "unknown option [$key]" 64 | exit 1 65 | ;; 66 | esac 67 | shift 68 | done 69 | print_ingress 70 | } 71 | 72 | main "$@" 73 | -------------------------------------------------------------------------------- /docs/setup/SETUP_NOTEBOOK.md: -------------------------------------------------------------------------------- 1 | # 部署数据科学家工作环境(Notebook) 2 | 3 | ### 前提 4 | * 请按照 [配置本地环境](../setup/SETUP_LOCAL.md) 这章配置本地环境。 5 | * 请按照 [安装机器学习基础架构Arena](../setup/INSTALL_ARENA.md) 这章安装基础架构Arena 6 | * 请按照 [配置共享存储](../setup/SETUP_PUBLIC_STORAGE.md) 这章配置共享数据的存储声明 7 | * 请按照 [配置Notebook存储](../setup/SETUP_USER_STORAGE.md) 这章配置用于存放Notebook数据的存储声明 8 | 9 | 10 | #### 部署Notebook 11 | 安装Notebook时支持配置Ingress, 支持TLS为您的访问提供安全防护。 12 | 1. 准备您的服务证书。如果您没有证书,可以通过以下命令配置 13 | ``` 14 | # foo.bar.com 可以替换为您自己的域名 15 | # domain="foo.bar.com" 16 | # openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout tls.key -out tls.crt -subj "/CN=$domain/O=$domain" 17 | ``` 18 | 19 | 上述步骤中的命令会生成一个证书文件 tls.crt、一个私钥文件tls.key。 20 | 21 | 2. 用该证书和私钥创建一个名为 notebook-secret 的 Kubernetes Secret。 22 | ``` 23 | # kubectl create secret tls notebook-secret --key tls.key --cert tls.crt 24 | ``` 25 | 26 | 3. 部署Notebook 27 | 28 | 在部署时,您可以为Notebook选择不同的提供服务方式: 29 | * 通过sshuttle访问: 您需要有一个和集群网络联通的跳板机, 如果此跳板机为阿里云上的ECS,需要在此ECS的安全组打开ssh端口(通常为22),具体配置可以参考[文档](https://www.alibabacloud.com/help/zh/doc-detail/25471.htm).数据科学家通过sshuttle,将对Notebook的请求代理到跳板机中,保证数据科学家和Notebook的网络访问联通。部署时无需额外参数配置 30 | * 通过Ingress访问: 将Notebook通过Ingress的方式提供公网服务能力。部署Notebook时,指定Ingress参数`--ingress`, 以及声明Ingress的域名和TLS证书。 31 | 32 | 部署命令如下: 33 | 34 | ``` 35 | # foo.bar.com 可以替换为您自己的域名 36 | # curl -s https://raw.githubusercontent.com/AliyunContainerService/ai-starter/master/scripts/install_notebook.sh | \ 37 | bash -s -- \ 38 | --notebook-name susan \ 39 | --ingress --ingress-domain foo.bar.com --ingress-secret notebook-secret \ 40 | --pvc-name training-data 41 | ``` 42 | 43 | 上述安装执行中,可以通过以下参数定制部署的依赖组件: 44 | 45 | ``` 46 | --namespace 指定部署的Notebook所在Namespace 47 | --notebook-name 指定部署的Notebook标识名称 48 | --ingress 指定是否为Notebook配置Ingress 49 | --ingress-domain 指定为Notebook配置的Ingress域名,仅在指定--ingress时生效 50 | --ingress-secret 指定为Notebook配置的Ingress,HTTPS使用的证书Secret,仅在指定--ingress时生效 51 | --pvc-name 指定Notebook用于挂载的存储声明,将Notebook的/root目录挂载为这个存储声明,默认值为training-data 52 | --public-pvc-name 指定用于挂载共享数据的存储声明,将Notebook的/root/public 目录挂载为这个存储声明 53 | --notebook-image 指定Notebook的使用镜像,默认是registry.cn-beijing.aliyuncs.com/acs/arena-notebook:cpu 54 | --clean 如果指定了--clean参数,会清理之前通过脚本部署的Notebook应用 55 | ``` 56 | 57 | 4. 安装完成后,检查安装结果: 58 | 59 | ``` 60 | # 查看notebook安装结果 61 | # kubectl get po 62 | NAME READY STATUS RESTARTS AGE 63 | arena-notebook-5bd4d8c5f7-jc7vf 1/1 Running 0 4d 64 | ``` 65 | 66 | ### 获取Notebook的访问地址 67 | 执行以下脚本,获取Notebook的IP地址和Ingress域名地址: 68 | 69 | ``` 70 | # curl -s https://raw.githubusercontent.com/AliyunContainerService/ai-starter/master/scripts/print_notebook.sh | bash -s -- --notebook-name susan 71 | Notebook pod ip is 172.16.1.103 72 | Notebook access token is 73 | Ingress of notebook ip is 39.104.xx.xx 74 | Ingress of notebook domain is foo.bar.com 75 | ``` 76 | 77 | 至此集群管理员完成环境配置工作,并为数据科学家分配了一个深度学习环境。 78 | 集群管理员将密码和Tokenn交给数据科学家,数据科学家在Notebook中即可开始自己的深度学习工作。 -------------------------------------------------------------------------------- /scripts/access_notebook.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | set -e 3 | ACCESS_TYPE="PORT_FORWARD" 4 | LABEL="app=arena-notebook" 5 | DEPLOYMENT_NAME="arena-notebook" 6 | 7 | function access_notebook() { 8 | if [[ "$ACCESS_TYPE" == "SERVICE" ]];then 9 | expose_service 10 | else 11 | port_forward 12 | fi 13 | } 14 | 15 | function port_forward() { 16 | local NAMESPACE=${NAMESPACE:-default} 17 | local PORT=${PORT:-8081} 18 | local PODNAME=${PODNAME:-"arena-notebook-0"} 19 | if [[ -n $USER_NAME ]];then 20 | PODNAME="$USER_NAME-notebook-0" 21 | fi 22 | # local PODNAME=`kubectl get po -n $NAMESPACE -l $LABEL | grep -v NAME| head -1| awk '{print $1}'` 23 | echo "Forwarding pod: $NAMESPACE/$PODNAME, port: $PORT" 24 | echo "Open http://localhost:$PORT in browser" 25 | kubectl port-forward ${PODNAME} -n ${NAMESPACE} $PORT:8888 26 | } 27 | 28 | function expose_service() { 29 | local NAMESPACE=${NAMESPACE:-default} 30 | local PODNAME=${PODNAME:-"arena-notebook-0"} 31 | local SERVICE_TYPE=${SERVICE_TYPE:-LoadBalancer} 32 | local SERVICE_NAME="arena-notebook" 33 | local SERVICE_URL=$(get_service_url $SERVICE_NAME) 34 | if [[ "$SERVICE_URL" != "" ]]; then 35 | echo "Service $SERVICE_NAME is exist." 36 | echo "If you want to delete the service, please exec \"kubectl delete svc -n $NAMESPACE $SERVICE_NAME\" " 37 | echo "If you want to get service detail, please exec \"kubectl get svc -n $NAMESPACE $SERVICE_NAME\" " 38 | echo "You can access notebook by open http://$SERVICE_URL in bowser" 39 | exit 0 40 | fi 41 | 42 | kubectl expose pod $PODNAME -n $NAMESPACE --type=$SERVICE_TYPE --name=$SERVICE_NAME 43 | echo "Expose notebook by $SERVICE_TYPE type service" 44 | echo "If you want to delete the service, please exec \"kubectl delete svc -n $NAMESPACE $SERVICE_NAME\" " 45 | echo "If you want to get service detail, please exec \"kubectl get svc -n $NAMESPACE $SERVICE_NAME\" " 46 | if [[ "$SERVICE_TYPE" == "LoadBalancer" ]]; then 47 | echo "Wait for loadbalancer ready..." 48 | fi 49 | SERVICE_URL=$(get_service_url $SERVICE_NAME) 50 | while [[ $SERVICE_URL == "" ]];do 51 | sleep 3 52 | SERVICE_URL=$(get_service_url $SERVICE_NAME) 53 | done 54 | echo "You can access notebook by open http://$SERVICE_URL in bowser" 55 | } 56 | 57 | function get_service_url() { 58 | local SERVICE_NAME=$1 59 | local NAMESPACE=${2:-default} 60 | set +e 61 | kubectl get svc -n $NAMESPACE $SERVICE_NAME &> /dev/null 62 | local exist=$? 63 | set -e 64 | if [[ $exist == 1 ]]; then 65 | echo "" 66 | else 67 | SERVICE_TYPE=$(kubectl get svc -n $NAMESPACE $SERVICE_NAME -ojsonpath='{.spec.type}') 68 | SERVICE_PORT=$(kubectl get svc -n $NAMESPACE $SERVICE_NAME -ojsonpath='{.spec.ports[0].port}') 69 | SERVICE_IP="" 70 | if [[ "$SERVICE_TYPE" == "NodePort" ]];then 71 | SERVICE_IP=$(kubectl get no -ojsonpath='{.items[0].status.addresses[0].address}') 72 | elif [[ "$SERVICE_TYPE" == "LoadBalancer" ]]; then 73 | SERVICE_IP=$(kubectl get svc -n $NAMESPACE $SERVICE_NAME -ojsonpath='{.status.loadBalancer.ingress[0].ip}') 74 | else 75 | SERVICE_IP=$(kubectl get svc -n default $SERVICE_NAME -ojsonpath='{.spec.clusterIP}') 76 | fi 77 | 78 | if [[ $SERVICE_IP == "" ]];then 79 | echo "" 80 | exit 0 81 | fi 82 | echo "$SERVICE_IP:$SERVICE_PORT" 83 | fi 84 | } 85 | 86 | usage() { 87 | echo "Usage:" 88 | echo " access_notebook.sh [-s] [-p] [-t SERVICE_TYPE]" 89 | echo "Options:" 90 | echo " -p, use port forward to access notebook.[default]" 91 | echo " -s, use service to access notebook." 92 | echo " -t, the type of service. LoadBalancer/NodePort/ClusterIP" 93 | exit -1 94 | } 95 | 96 | function main() { 97 | while [ $# -gt 0 ];do 98 | case $1 in 99 | -p|--port-forward) 100 | ACCESS_TYPE="PORT_FORWARD" 101 | ;; 102 | -s|--service) 103 | ACCESS_TYPE="SERVICE" 104 | ;; 105 | -t|--service-type) 106 | SERVICE_TYPE=$2 107 | shift 108 | ;; 109 | -u|--user) 110 | USER_NAME=$2 111 | shift 112 | ;; 113 | -h|--help) 114 | usage 115 | exit 0 116 | ;; 117 | *) 118 | echo "unknown option [$key]" 119 | exit 1 120 | ;; 121 | esac 122 | shift 123 | done 124 | access_notebook 125 | } 126 | 127 | main "$@" -------------------------------------------------------------------------------- /scripts/install_notebook.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | set -e 3 | 4 | function install_notebook() { 5 | cat < $LOG_PRINT - 6 | apiVersion: v1 7 | kind: ServiceAccount 8 | metadata: 9 | name: $NOTEBOOK_NAME 10 | namespace: $NAMESPACE 11 | --- 12 | kind: ClusterRole 13 | apiVersion: rbac.authorization.k8s.io/v1 14 | metadata: 15 | name: $NOTEBOOK_NAME 16 | rules: 17 | - apiGroups: 18 | - "" 19 | resources: 20 | - pods 21 | - services 22 | - deployments 23 | - nodes 24 | - nodes/* 25 | - services/proxy 26 | - persistentvolumes 27 | verbs: 28 | - get 29 | - list 30 | --- 31 | apiVersion: rbac.authorization.k8s.io/v1 32 | kind: Role 33 | metadata: 34 | name: $NOTEBOOK_NAME 35 | namespace: $NAMESPACE 36 | rules: 37 | - apiGroups: 38 | - "" 39 | resources: 40 | - configmaps 41 | verbs: 42 | - '*' 43 | - apiGroups: 44 | - "" 45 | resources: 46 | - services/proxy 47 | - persistentvolumeclaims 48 | - events 49 | verbs: 50 | - get 51 | - list 52 | - apiGroups: 53 | - "" 54 | resources: 55 | - pods 56 | - pods/log 57 | - services 58 | verbs: 59 | - '*' 60 | - apiGroups: 61 | - "" 62 | - apps 63 | - extensions 64 | resources: 65 | - deployments 66 | - replicasets 67 | - statefulsets 68 | verbs: 69 | - '*' 70 | - apiGroups: 71 | - kubeflow.org 72 | resources: 73 | - '*' 74 | verbs: 75 | - '*' 76 | - apiGroups: 77 | - batch 78 | resources: 79 | - jobs 80 | verbs: 81 | - '*' 82 | --- 83 | kind: ClusterRoleBinding 84 | apiVersion: rbac.authorization.k8s.io/v1 85 | metadata: 86 | name: $NOTEBOOK_NAME-cluster-role 87 | namespace: $NAMESPACE 88 | roleRef: 89 | apiGroup: rbac.authorization.k8s.io 90 | kind: ClusterRole 91 | name: $NOTEBOOK_NAME 92 | subjects: 93 | - kind: ServiceAccount 94 | name: $NOTEBOOK_NAME 95 | namespace: $NAMESPACE 96 | --- 97 | kind: RoleBinding 98 | apiVersion: rbac.authorization.k8s.io/v1 99 | metadata: 100 | name: $NOTEBOOK_NAME-role 101 | namespace: $NAMESPACE 102 | roleRef: 103 | apiGroup: rbac.authorization.k8s.io 104 | kind: Role 105 | name: $NOTEBOOK_NAME 106 | subjects: 107 | - kind: ServiceAccount 108 | name: $NOTEBOOK_NAME 109 | namespace: $NAMESPACE 110 | EOF 111 | 112 | # Deploy notebook statefulSet with public dataSet 113 | if [[ -n $PUBLIC_PVC_NAME ]];then 114 | cat < $LOG_PRINT - 115 | apiVersion: apps/v1 116 | kind: StatefulSet 117 | metadata: 118 | name: $NOTEBOOK_NAME 119 | namespace: $NAMESPACE 120 | labels: 121 | app: $NOTEBOOK_NAME 122 | arena-notebook: $NOTEBOOK_NAME 123 | spec: 124 | selector: # define how the deployment finds the pods it mangages 125 | matchLabels: 126 | app: $NOTEBOOK_NAME 127 | serviceName: "$NOTEBOOK_NAME" 128 | template: 129 | metadata: 130 | labels: 131 | app: $NOTEBOOK_NAME 132 | spec: 133 | serviceAccountName: $NOTEBOOK_NAME 134 | containers: 135 | - name: $NOTEBOOK_NAME 136 | image: $NOTEBOOK_IMAGE 137 | imagePullPolicy: Always 138 | env: 139 | - name: PUBLIC_DATA_NAME 140 | value: $PUBLIC_PVC_NAME 141 | - name: USER_DATA_NAME 142 | value: $PVC_NAME 143 | ports: 144 | - containerPort: 8888 145 | volumeMounts: 146 | - mountPath: "$PVC_MOUNT_PATH" 147 | name: workspace 148 | - mountPath: "$PUBLIC_PVC_MOUNT_PATH" 149 | name: public-workspace 150 | volumes: 151 | - name: workspace 152 | persistentVolumeClaim: 153 | claimName: $PVC_NAME 154 | - name: public-workspace 155 | persistentVolumeClaim: 156 | claimName: $PUBLIC_PVC_NAME 157 | EOF 158 | else 159 | cat < $LOG_PRINT - 160 | apiVersion: apps/v1 161 | kind: StatefulSet 162 | metadata: 163 | name: $NOTEBOOK_NAME 164 | namespace: $NAMESPACE 165 | labels: 166 | app: $NOTEBOOK_NAME 167 | arena-notebook: $NOTEBOOK_NAME 168 | spec: 169 | selector: # define how the deployment finds the pods it mangages 170 | matchLabels: 171 | app: $NOTEBOOK_NAME 172 | serviceName: "$NOTEBOOK_NAME" 173 | template: 174 | metadata: 175 | labels: 176 | app: $NOTEBOOK_NAME 177 | spec: 178 | serviceAccountName: $NOTEBOOK_NAME 179 | containers: 180 | - name: $NOTEBOOK_NAME 181 | image: $NOTEBOOK_IMAGE 182 | imagePullPolicy: Always 183 | env: 184 | - name: PUBLIC_PVC_NAME 185 | value: '' 186 | - name: PVC_NAME 187 | value: $PVC_NAME 188 | ports: 189 | - containerPort: 8888 190 | volumeMounts: 191 | - mountPath: "$PVC_MOUNT_PATH" 192 | name: workspace 193 | volumes: 194 | - name: workspace 195 | persistentVolumeClaim: 196 | claimName: $PVC_NAME 197 | EOF 198 | fi 199 | 200 | # Define the arena notebook service 201 | cat < $LOG_PRINT - 202 | apiVersion: v1 203 | kind: Service 204 | metadata: 205 | name: $NOTEBOOK_NAME 206 | namespace: $NAMESPACE 207 | spec: 208 | ports: 209 | - port: 80 210 | targetPort: 8888 211 | name: notebook 212 | selector: 213 | app: $NOTEBOOK_NAME 214 | type: $SREVICE_TYPE 215 | EOF 216 | } 217 | 218 | function install_ingress() { 219 | if [[ "$INSTALL_INGRESS" != "true" ]]; then 220 | return 221 | fi 222 | 223 | cat < $LOG_PRINT - 224 | apiVersion: extensions/v1beta1 225 | kind: Ingress 226 | metadata: 227 | name: $NOTEBOOK_NAME-ingress 228 | namespace: $NAMESPACE 229 | spec: 230 | tls: 231 | - hosts: 232 | - $INGRESS_HOST 233 | secretName: $INGRESS_SECRET_NAME 234 | rules: 235 | - host: $INGRESS_HOST 236 | http: 237 | paths: 238 | - backend: 239 | serviceName: $NOTEBOOK_NAME 240 | servicePort: 80 241 | EOF 242 | } 243 | 244 | function parse_args() { 245 | NAMESPACE=${NAMESPACE:-"default"} 246 | PVC_NAME=${PVC_NAME:-"training-data"} 247 | PVC_MOUNT_PATH=${PVC_MOUNT_PATH:-"/root"} 248 | PUBLIC_PVC_MOUNT_PATH=${PUBLIC_PVC_MOUNT_PATH:-"/root/public"} 249 | SREVICE_TYPE=${SREVICE_TYPE:-"ClusterIP"} 250 | NOTEBOOK_IMAGE=${NOTEBOOK_IMAGE:-"registry.cn-beijing.aliyuncs.com/acs/arena-notebook:cpu"} 251 | NOTEBOOK_NAME=${NOTEBOOK_NAME:-"arena-notebook"} 252 | if [[ -n $NOTEBOOK_WORKSPACE_NAME ]];then 253 | NOTEBOOK_NAME="$NOTEBOOK_WORKSPACE_NAME-notebook" 254 | fi 255 | LOG_PRINT=${LOG_PRINT:-"/dev/null"} 256 | LOG_PRINT="/dev/stdout" 257 | } 258 | 259 | function check_args() { 260 | # if [[ ! -n $USER_NAME ]]; then 261 | # echo "USER_NAME can't be empty, please add -u params to specify" 262 | # exit 1 263 | # fi 264 | if [[ -n $INSTALL_INGRESS ]]; then 265 | if [[ ! -n $INGRESS_HOST ]]; then 266 | echo "INGRESS_HOST can't be empty, please add --ingress-domain to specify the ingress domain" 267 | exit 1 268 | fi 269 | if [[ ! -n $INGRESS_SECRET_NAME ]]; then 270 | echo "INGRESS_HOST can't be empty, please add --ingress-secret to specify the ingress tls secret" 271 | exit 1 272 | fi 273 | local secret_exist=$(check_resource_exist secret $INGRESS_SECRET_NAME $NAMESPACE) 274 | 275 | if [[ "$CLEAN" != "true" && "$secret_exist" != "0" ]]; then 276 | echo "Secret \"$INGRESS_SECRET_NAME\" is not exist, please check the secret specify by --ingress-secret" && \ 277 | exit 1 278 | fi 279 | fi 280 | # if the notebook sts is exist 281 | local sts_exist=$(check_resource_exist sts $NOTEBOOK_NAME $NAMESPACE) 282 | if [[ "$CLEAN" != "true" && "$sts_exist" == "0" ]]; then 283 | echo "This notebook \"$NOTEBOOK_NAME\" is installed, if you want to reinstall notebook, please specify --clean to uninstall" 284 | echo "If you want to install notebook for another user, please specify --notebook-name for user, or specify -n for different namespace" 285 | exit 0 286 | fi 287 | } 288 | 289 | function check_resource_exist() { 290 | resource_type=$1 291 | resource_name=$2 292 | namespace=${3:-"default"} 293 | kubectl get -n $namespace $resource_type $resource_name &> /dev/null 294 | echo $? 295 | } 296 | 297 | function delete_resource() { 298 | resource_type=$1 299 | resource_name=$2 300 | namespace=${3:-"default"} 301 | local exist=$(check_resource_exist $resource_type $resource_name $namespace) 302 | if [[ $exist == "0" ]]; then 303 | kubectl delete -n $namespace $resource_type $resource_name 304 | fi 305 | } 306 | 307 | function clean_notebook() { 308 | delete_resource svc $NOTEBOOK_NAME $NAMESPACE 309 | delete_resource sts $NOTEBOOK_NAME $NAMESPACE 310 | delete_resource sa $NOTEBOOK_NAME $NAMESPACE 311 | delete_resource secret $NOTEBOOK_NAME $NAMESPACE 312 | delete_resource clusterRole $NOTEBOOK_NAME $NAMESPACE 313 | delete_resource clusterRoleBinding "$NOTEBOOK_NAME-cluster-role" $NAMESPACE 314 | delete_resource role $NOTEBOOK_NAME $NAMESPACE 315 | delete_resource roleBinding "$NOTEBOOK_NAME-role" $NAMESPACE 316 | delete_resource ingress "$NOTEBOOK_NAME-ingress" $NAMESPACE 317 | 318 | echo "Clean notebook finish." 319 | } 320 | 321 | function install() { 322 | parse_args 323 | check_args 324 | if [[ "$CLEAN" == "true" ]]; then 325 | clean_notebook 326 | return 327 | fi 328 | install_notebook 329 | install_ingress 330 | echo "Install successful" 331 | } 332 | 333 | function main() { 334 | while [ $# -gt 0 ];do 335 | case $1 in 336 | -n|--namespace) 337 | NAMESPACE=$2 338 | shift 339 | ;; 340 | --ingress) 341 | INSTALL_INGRESS="true" 342 | ;; 343 | --ingress-domain) 344 | INGRESS_HOST=$2 345 | shift 346 | ;; 347 | --ingress-secret) 348 | INGRESS_SECRET_NAME=$2 349 | shift 350 | ;; 351 | --notebook-name) 352 | NOTEBOOK_WORKSPACE_NAME=$2 353 | shift 354 | ;; 355 | -u|--user) 356 | USER_NAME=$2 357 | shift 358 | ;; 359 | --pvc-name) 360 | PVC_NAME=$2 361 | shift 362 | ;; 363 | --public-pvc-name) 364 | PUBLIC_PVC_NAME=$2 365 | shift 366 | ;; 367 | --notebook-image) 368 | NOTEBOOK_IMAGE=$2 369 | shift 370 | ;; 371 | --clean) 372 | CLEAN="true" 373 | ;; 374 | --debug) 375 | set -x 376 | ;; 377 | *) 378 | echo "unknown option [$key]" 379 | exit 1 380 | ;; 381 | esac 382 | shift 383 | done 384 | install 385 | } 386 | 387 | main "$@" 388 | 389 | -------------------------------------------------------------------------------- /docs/guide/1_AUTHOR_PIPELINES.md: -------------------------------------------------------------------------------- 1 | # 如何开发Kubeflow Pipelines 2 | 3 | ## 准备工作 4 | 5 | 机器学习工作流是一个任务驱动的流程,同时也是数据驱动的流程,这里涉及到数据的导入和准备,模型训练Checkpoint的导出评估,到最终模型的导出。这就需要分布式存储作为传输的媒介,这里使用NAS作为分布式存储。 6 | 7 | - 创建分布式存储,这里以NAS为例。这里`NFS_SERVER_IP`需要替换成真实NAS服务器地址 8 | 9 | 1.创建阿里云NAS服务,可以参考[文档](https://github.com/AliyunContainerService/ai-starter/blob/master/docs/setup/SETUP_NAS.md) 10 | 11 | 2.需要在 NFS Server 中创建 `/data` 12 | 13 | ``` 14 | # mkdir -p /nfs 15 | # mount -t nfs -o vers=4.0 NFS_SERVER_IP:/ /nfs 16 | # mkdir -p /data 17 | # cd / 18 | # umount /nfs 19 | ``` 20 | 21 | 3.创建对应的Persistent Volume. 22 | 23 | ``` 24 | # cat nfs-pv.yaml 25 | apiVersion: v1 26 | kind: PersistentVolume 27 | metadata: 28 | name: user-susan 29 | labels: 30 | user-susan: pipelines 31 | spec: 32 | persistentVolumeReclaimPolicy: Retain 33 | capacity: 34 | storage: 10Gi 35 | accessModes: 36 | - ReadWriteMany 37 | nfs: 38 | server: NFS_SERVER_IP 39 | path: "/data" 40 | 41 | # kubectl create -f nfs-pv.yaml 42 | ``` 43 | 44 | 4.创建Persistent Volume Claim 45 | 46 | ``` 47 | # cat nfs-pvc.yaml 48 | apiVersion: v1 49 | kind: PersistentVolumeClaim 50 | metadata: 51 | name: user-susan 52 | annotations: 53 | description: "this is the mnist demo" 54 | owner: Tom 55 | spec: 56 | accessModes: 57 | - ReadWriteMany 58 | resources: 59 | requests: 60 | storage: 5Gi 61 | selector: 62 | matchLabels: 63 | user-susan: pipelines 64 | # kubectl create -f nfs-pvc.yaml 65 | ``` 66 | 67 | 68 | ## 开发Pipeline 69 | 70 | 由于Kubeflow Pipelines提供的例子都是依赖于Google的存储服务,这导致国内的用户无法真正体验Pipelines的能力。阿里云容器服务团队提供了训练MNIST模型的例子,方便您在阿里云上使用和学习Kubeflow Pipelines。具体步骤为3步: 71 | (1)下载数据 72 | (2)利用TensorFlow进行模型训练 73 | (3)模型导出 74 | 75 | 这3个步骤中后一个步骤都依赖与前一个步骤完成。 76 | 77 | 在Kubeflow Pipelines中可以用Python代码描述了这样一个流程, 完整代码可以查看[standalone_pipeline.py](https://github.com/cheyang/pipelines/blob/update_standalone_sample/samples/arena-samples/standalonejob/standalone_pipeline.py)。我们在这个例子中使用了`arena_op`这是对于Kubeflow默认的`container_op`封装,能够实现对于分布式训练MPI和PS模式的无缝衔接,另外也支持使用GPU和RDMA等异构设备和分布式存储的简单接入,同时也方便从git源同步代码。是一个比较实用的工具API。而`arena_op`是基于开源项目[Arena](https://github.com/kubeflow/arena)。 78 | 79 | ```python 80 | @dsl.pipeline( 81 | name='pipeline to run jobs', 82 | description='shows how to run pipeline jobs.' 83 | ) 84 | def sample_pipeline(learning_rate='0.01', 85 | dropout='0.9', 86 | model_version='1', 87 | commit='f097575656f927d86d99dd64931042e1a9003cb2'): 88 | """A pipeline for end to end machine learning workflow.""" 89 | data=["user-susan:/training"] 90 | gpus=1 91 | 92 | # 1. prepare data 93 | prepare_data = arena.standalone_job_op( 94 | name="prepare-data", 95 | image="byrnedo/alpine-curl", 96 | data=data, 97 | command="mkdir -p /training/dataset/mnist && \ 98 | cd /training/dataset/mnist && \ 99 | curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/t10k-images-idx3-ubyte.gz && \ 100 | curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/t10k-labels-idx1-ubyte.gz && \ 101 | curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/train-images-idx3-ubyte.gz && \ 102 | curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/train-labels-idx1-ubyte.gz") 103 | 104 | # 2. downalod source code and train the models 105 | train = arena.standalone_job_op( 106 | name="train", 107 | image="tensorflow/tensorflow:1.11.0-gpu-py3", 108 | sync_source="https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git", 109 | env=["GIT_SYNC_REV=%s" % (commit)], 110 | gpus=gpus, 111 | data=data, 112 | command=''' 113 | echo %s;python code/tensorflow-sample-code/tfjob/docker/mnist/main.py \ 114 | --max_steps 500 --data_dir /training/dataset/mnist \ 115 | --log_dir /training/output/mnist --learning_rate %s \ 116 | --dropout %s''' % (prepare_data.output, learning_rate, dropout), 117 | metrics=["Train-accuracy:PERCENTAGE"]) 118 | # 3. export the model 119 | export_model = arena.standalone_job_op( 120 | name="export-model", 121 | image="tensorflow/tensorflow:1.11.0-py3", 122 | sync_source="https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git", 123 | env=["GIT_SYNC_REV=%s" % (commit)], 124 | data=data, 125 | command="echo %s;python code/tensorflow-sample-code/tfjob/docker/mnist/export_model.py --model_version=%s --checkpoint_path=/training/output/mnist /training/output/models" % (train.output, model_version)) 126 | ``` 127 | 128 | Kubeflow Pipelines会将上面的代码转化成一个有向无环图(DAG),其中的每一个节点就是Component(组件),而Component(组件)之间的连线代表它们之间的依赖关系。从Pipelines UI可以看到DAG图: 129 | 130 | ![](images/pipelines/pipeline-dag.jpg) 131 | 132 | 首先具体理解一下数据准备的部分,这里我们提供了`arena.standalone_job_op`的Python API, 需要指定该步骤的`名称`:name,`需要使用的容器镜像`:image,`要使用的数据以及其对应到容器内部的挂载目录`:data,这里的data是一个数组格式, 如data=["user-susan:/training"],表示可以挂载到多个数据。 `user-susan`是之前创建的Persistent Volume Claim,而`/training`为容器内部的挂载目录。 133 | 134 | ```python 135 | prepare_data = arena.standalone_job_op( 136 | name="prepare-data", 137 | image="byrnedo/alpine-curl", 138 | data=data, 139 | command="mkdir -p /training/dataset/mnist && \ 140 | cd /training/dataset/mnist && \ 141 | curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/t10k-images-idx3-ubyte.gz && \ 142 | curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/t10k-labels-idx1-ubyte.gz && \ 143 | curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/train-images-idx3-ubyte.gz && \ 144 | curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/train-labels-idx1-ubyte.gz") 145 | ``` 146 | 147 | 而上述步骤实际上是从指定地址利用curl下载数据到分布式存储对应的目录`/training/dataset/mnist`,请注意这里的`/training`为分布式存储的根目录,类似大家熟悉的根mount点;而`/training/dataset/mnist`是子目录。其实后面的步骤可以通过使用同样的根mount点,读到数据,进行运算。 148 | 149 | 第二步是利用下载到分布式存储的数据,并通过git指定固定commit id下载代码,并进行模型训练 150 | 151 | ```python 152 | train = arena.standalone_job_op( 153 | name="train", 154 | image="tensorflow/tensorflow:1.11.0-gpu-py3", 155 | sync_source="https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git", 156 | env=["GIT_SYNC_REV=%s" % (commit)], 157 | gpus=gpus, 158 | data=data, 159 | command=''' 160 | echo %s;python code/tensorflow-sample-code/tfjob/docker/mnist/main.py \ 161 | --max_steps 500 --data_dir /training/dataset/mnist \ 162 | --log_dir /training/output/mnist --learning_rate %s \ 163 | --dropout %s''' % (prepare_data.output, learning_rate, dropout), 164 | metrics=["Train-accuracy:PERCENTAGE"]) 165 | ``` 166 | 167 | 可以看到这个步骤比数据准备要相对复杂一点,除了和第一步骤中的name,image, data和command之外,在模型训练步骤中,还需要指定: 168 | 169 | - **获取代码的方式:** 从可重现实验的角度来看,对于运行试验代码的追本溯源,是非常重要的一环。可以在API调用时指定`sync_source`的git代码源,同时通过设定`env`中`GIT_SYNC_REV`指定训练代码的commit id 170 | 171 | - **gpu:** 默认为0,就是不使用GPU;如果为大于0的整数值,就代表该步骤需要这个数量的GPU数。 172 | 173 | - **metrics:** 同样是从可重现和可比较的实验目的出发,用户可以将需要的一系列指标导出,并且通过Pipelines UI上直观的显示和比较。具体使用方法分为两步,1.在调用API时以数组的形式指定要收集指标的metrics name和指标的展示格式PERCENTAGE或者是RAW,比如`metrics=["Train-accuracy:PERCENTAGE"]`。2.由于Pipelines默认会从stdout日志中收集指标,你需要在真正运行的模型代码中输出{metrics name}={value}或者{metrics name}:{value}, 可以参考具体[样例代码](https://github.com/cheyang/tensorflow-sample-code/blob/master/tfjob/docker/mnist/main.py#L183) 174 | 175 | ![](images/pipelines/metrics.jpg) 176 | 177 | 值得注意的是: 178 | 179 | > 在本步骤中指定了和`prepare_data`相同的`data`参数["user-susan:/training"],就可以在训练代码中读到对应的数据,比如`--data_dir /training/dataset/mnist`, 180 | 181 | > 另外由于该步骤依赖于`prepare_data`,可以在方法中通过指定`prepare_data.output`表示两个步骤的依赖关系。 182 | 183 | 184 | 最后`export_model`是基于`train`训练产生的checkpoint,生成训练模型: 185 | 186 | ```python 187 | export_model = arena.standalone_job_op( 188 | name="export-model", 189 | image="tensorflow/tensorflow:1.11.0-py3", 190 | sync_source="https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git", 191 | env=["GIT_SYNC_REV=%s" % (commit)], 192 | data=data, 193 | command="echo %s;python code/tensorflow-sample-code/tfjob/docker/mnist/export_model.py --model_version=%s --checkpoint_path=/training/output/mnist /training/output/models" % (train.output, model_version)) 194 | ``` 195 | 196 | `export_model`和第二步`train`类似,甚至要更为简单,它只是从git同步模型导出代码并且利用共享目录`/training/output/mnist`中的checkpoint执行模型导出。 197 | 198 | 整个工作流程看起来还是很直观的, 下面就可以定义一个Python方法将整个流程贯穿在一起。 199 | 200 | 201 | ```python 202 | @dsl.pipeline( 203 | name='pipeline to run jobs', 204 | description='shows how to run pipeline jobs.' 205 | ) 206 | def sample_pipeline(learning_rate='0.01', 207 | dropout='0.9', 208 | model_version='1', 209 | commit='f097575656f927d86d99dd64931042e1a9003cb2'): 210 | ``` 211 | 212 | > @dsl.pipeline是表示工作流的装饰器,这个装饰器中需要定义两个属性,分别是`name`和`description` 213 | 214 | > 入口方法`sample_pipeline`中定义了4个参数`learning_rate`,`dropout`,`model_version`和`commit`,分别可以在上面的`train`和`export_model`阶段使用。这里的参数的值实际上是 [dsl.PipelineParam](https://github.com/kubeflow/pipelines/blob/master/sdk/python/kfp/dsl/_pipeline_param.py)类型,定义成dsl.PipelineParam的目的在于可以通过Kubeflow Pipelines的原生UI可以将其转换成输入表单,表单的关键字是参数名称,而默认值为参数的值. 值得注意的是,这里的dsl.PipelineParam对应值的实际上只能是字符串和数字型;而数组和map,以及自定义类型都是无法通过转型进行变换的。 215 | 216 | 而实际上,这些参数都可以在用户提交工作流时进行覆盖,以下就是提交工作流对应的UI: 217 | 218 | ![](images/pipelines/input.jpg) 219 | 220 | ## 提交Pipeline 221 | 222 | 您可以在自己的Kubernetes内将前面开发工作流的Python DSL提交到Kubeflow Pipelines服务中, 实际提交代码很简单: 223 | 224 | ```python 225 | KFP_SERVICE="ml-pipeline.kubeflow.svc.cluster.local:8888" 226 | import kfp.compiler as compiler 227 | compiler.Compiler().compile(sample_pipeline, __file__ + '.tar.gz') 228 | client = kfp.Client(host=KFP_SERVICE) 229 | try: 230 | experiment_id = client.get_experiment(experiment_name=EXPERIMENT_NAME).id 231 | except: 232 | experiment_id = client.create_experiment(EXPERIMENT_NAME).id 233 | run = client.run_pipeline(experiment_id, RUN_ID, __file__ + '.tar.gz', 234 | params={'learning_rate':learning_rate, 235 | 'dropout':dropout, 236 | 'model_version':model_version, 237 | 'commit':commit}) 238 | ``` 239 | > 利用`compiler.compile`将Python代码编译成执行引擎(Argo)识别的DAG配置文件 240 | 241 | > 通过Kubeflow Pipeline的客户端创建或者找到已有的实验,并且提交之前编译出的DAG配置文件 242 | 243 | 244 | 在集群内准备一个python3的环境,并且安装Kubeflow Pipelines SDK 245 | 246 | ``` 247 | # kubectl create job pipeline-client --namespace kubeflow --image python:3 -- sleep infinity 248 | # kubectl exec -it -n kubeflow $(kubectl get po -l job-name=pipeline-client -n kubeflow | grep -v NAME| awk '{print $1}') bash 249 | 250 | ``` 251 | 252 | 登录到Python3的环境后,执行如下命令,连续提交两个不同参数的任务 253 | 254 | ``` 255 | # pip3 install http://kubeflow.oss-cn-beijing.aliyuncs.com/kfp/0.1.14/kfp.tar.gz --upgrade 256 | # pip3 install http://kubeflow.oss-cn-beijing.aliyuncs.com/kfp-arena/kfp-arena-0.4.tar.gz --upgrade 257 | # curl -O https://raw.githubusercontent.com/cheyang/pipelines/update_standalone_sample/samples/arena-samples/standalonejob/standalone_pipeline.py 258 | # python3 standalone_pipeline.py --learning_rate 0.0001 --dropout 0.8 --model_version 2 259 | # python3 standalone_pipeline.py --learning_rate 0.0005 --dropout 0.8 --model_version 3 260 | ``` 261 | 262 | ## 查看运行结果 263 | 264 | 登录到Kubeflow Pipelines的UI: https://{pipeline地址}/pipeline/#/experiments, 比如 265 | 266 | ``` 267 | https://11.124.285.171/pipeline/#/experiments 268 | ``` 269 | 270 | ![](images/pipelines/experiment.jpg) 271 | 272 | 点击`Compare runs`按钮,可以比较两个实验的输入,花费的时间和精度等一系列指标。让实验可追溯是让实验可重现的第一步;而利用Kubeflow Pipelines本身的实验管理能力则是开启实验可重现的第一步。 273 | 274 | ![](images/pipelines/comparison.jpg) -------------------------------------------------------------------------------- /demo/2-distributed-mnist.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 入门云原生AI - 2. 运行分布式mnist\n", 8 | "接下来我们介绍如何通过arena提交,运维,管理一个分布式训练任务。通过arena,我们管理分布式训练任务,可以拥有像管理单机应用一样方便,快捷的体验。\n", 9 | "在这个示例中,我们将演示:\n", 10 | "\n", 11 | "* 下载并准备数据\n", 12 | "* 利用Arena提交分布式的训练任务,并且查看训练任务状态和日志\n", 13 | "* 通过TensorBoard查看训练任务\n", 14 | "\n", 15 | "> 前提:请先完成文档中的[共享存储配置](../docs/setup/SETUP_NAS.md),当前${HOME}就是其中`training-data`的数据卷对应目录。" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "1.下载TensorFlow样例源代码到${HOME}/models目录" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": 1, 28 | "metadata": { 29 | "scrolled": true 30 | }, 31 | "outputs": [ 32 | { 33 | "name": "stdout", 34 | "output_type": "stream", 35 | "text": [ 36 | "Cloning into '/root/models/tensorflow-sample-code'...\n", 37 | "remote: Enumerating objects: 242, done.\u001b[K\n", 38 | "remote: Counting objects: 100% (242/242), done.\u001b[K\n", 39 | "remote: Compressing objects: 100% (112/112), done.\u001b[K\n", 40 | "remote: Total 242 (delta 93), reused 242 (delta 93)\u001b[K\n", 41 | "Receiving objects: 100% (242/242), 11.25 MiB | 0 bytes/s, done.\n", 42 | "Resolving deltas: 100% (93/93), done.\n", 43 | "Checking connectivity... done.\n" 44 | ] 45 | } 46 | ], 47 | "source": [ 48 | "! if [ ! -d \"${HOME}/models/tensorflow-sample-code\" ] ; then \\\n", 49 | " git clone \"https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git\" \"${HOME}/models/tensorflow-sample-code\"; \\\n", 50 | "fi" 51 | ] 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "metadata": {}, 56 | "source": [ 57 | "2.下载mnist数据到${HOME}/dataset/mnist" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": 2, 63 | "metadata": { 64 | "scrolled": false 65 | }, 66 | "outputs": [ 67 | { 68 | "name": "stdout", 69 | "output_type": "stream", 70 | "text": [ 71 | " % Total % Received % Xferd Average Speed Time Time Time Current\n", 72 | " Dload Upload Total Spent Left Speed\n", 73 | "100 1610k 0 1610k 0 0 2453k 0 --:--:-- --:--:-- --:--:-- 2450k\n", 74 | " % Total % Received % Xferd Average Speed Time Time Time Current\n", 75 | " Dload Upload Total Spent Left Speed\n", 76 | "100 4542 0 4542 0 0 12110 0 --:--:-- --:--:-- --:--:-- 12144\n", 77 | " % Total % Received % Xferd Average Speed Time Time Time Current\n", 78 | " Dload Upload Total Spent Left Speed\n", 79 | "100 9680k 0 9680k 0 0 12.4M 0 --:--:-- --:--:-- --:--:-- 12.4M\n", 80 | " % Total % Received % Xferd Average Speed Time Time Time Current\n", 81 | " Dload Upload Total Spent Left Speed\n", 82 | "100 28881 0 28881 0 0 72028 0 --:--:-- --:--:-- --:--:-- 72022\n" 83 | ] 84 | } 85 | ], 86 | "source": [ 87 | "! mkdir -p ${HOME}/dataset/mnist && \\\n", 88 | " cd ${HOME}/dataset/mnist && \\\n", 89 | " curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/t10k-images-idx3-ubyte.gz && \\\n", 90 | " curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/t10k-labels-idx1-ubyte.gz && \\\n", 91 | " curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/train-images-idx3-ubyte.gz && \\\n", 92 | " curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/train-labels-idx1-ubyte.gz" 93 | ] 94 | }, 95 | { 96 | "cell_type": "markdown", 97 | "metadata": {}, 98 | "source": [ 99 | "3.创建训练结果${HOME}/output" 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": 3, 105 | "metadata": {}, 106 | "outputs": [], 107 | "source": [ 108 | "! mkdir -p ${HOME}/output" 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "metadata": {}, 114 | "source": [ 115 | "4.查看目录结构, 其中`dataset`是数据目录,`models`是模型代码目录,`output`是训练结果目录。" 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": 4, 121 | "metadata": {}, 122 | "outputs": [ 123 | { 124 | "name": "stdout", 125 | "output_type": "stream", 126 | "text": [ 127 | "/root\r\n", 128 | "|-- dataset\r\n", 129 | "| `-- mnist\r\n", 130 | "| |-- t10k-images-idx3-ubyte.gz\r\n", 131 | "| |-- t10k-labels-idx1-ubyte.gz\r\n", 132 | "| |-- train-images-idx3-ubyte.gz\r\n", 133 | "| `-- train-labels-idx1-ubyte.gz\r\n", 134 | "|-- models\r\n", 135 | "| `-- tensorflow-sample-code\r\n", 136 | "| |-- README.md\r\n", 137 | "| |-- data\r\n", 138 | "| |-- mnist-tf\r\n", 139 | "| |-- models\r\n", 140 | "| |-- mpijob\r\n", 141 | "| `-- tfjob\r\n", 142 | "`-- output\r\n", 143 | "\r\n", 144 | "10 directories, 5 files\r\n" 145 | ] 146 | } 147 | ], 148 | "source": [ 149 | "! tree -I ai-starter -L 3 ${HOME}" 150 | ] 151 | }, 152 | { 153 | "cell_type": "markdown", 154 | "metadata": {}, 155 | "source": [ 156 | "5.检查可用GPU资源" 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": 5, 162 | "metadata": {}, 163 | "outputs": [ 164 | { 165 | "name": "stdout", 166 | "output_type": "stream", 167 | "text": [ 168 | "NAME IPADDRESS ROLE GPU(Total) GPU(Allocated)\r\n", 169 | "cn-zhangjiakou.i-8vb2knpxzlk449e7lugx 192.168.0.209 1 0\r\n", 170 | "cn-zhangjiakou.i-8vb2knpxzlk449e7lugy 192.168.0.210 1 0\r\n", 171 | "cn-zhangjiakou.i-8vb2knpxzlk449e7lugz 192.168.0.208 1 0\r\n", 172 | "cn-zhangjiakou.i-8vb7yuo831zjzijo9sdw 192.168.0.205 master 0 0\r\n", 173 | "cn-zhangjiakou.i-8vbezxqzueo7662i0dbq 192.168.0.204 master 0 0\r\n", 174 | "cn-zhangjiakou.i-8vbezxqzueo7681j4fav 192.168.0.206 master 0 0\r\n", 175 | "-----------------------------------------------------------------------------------------\r\n", 176 | "Allocated/Total GPUs In Cluster:\r\n", 177 | "0/3 (0%) \r\n" 178 | ] 179 | } 180 | ], 181 | "source": [ 182 | "! arena top node" 183 | ] 184 | }, 185 | { 186 | "cell_type": "markdown", 187 | "metadata": {}, 188 | "source": [ 189 | "6.通过Arena提交训练任务, 这里`training-data`在配置[共享存储时](../docs/setup/SETUP_NAS.md)创建. \n", 190 | "`--data=training-data:/training`将其映射到训练任务的`/training`目录。而`/training`目录下的子目录`/training/models/tensorflow-sample-code`就是步骤1拷贝源代码的位置,`/training`目录下的子目录`/training/dataset/mnist`就是步骤2下载数据的位置, `/training`目录下的子目录`/training/output`就是步骤3创建的训练结果输出的位置。" 191 | ] 192 | }, 193 | { 194 | "cell_type": "code", 195 | "execution_count": 6, 196 | "metadata": {}, 197 | "outputs": [ 198 | { 199 | "name": "stdout", 200 | "output_type": "stream", 201 | "text": [ 202 | "env: JOB_NAME=tf-distributed-mnist\n", 203 | "configmap/tf-distributed-mnist-tfjob created\n", 204 | "configmap/tf-distributed-mnist-tfjob labeled\n", 205 | "service/tf-distributed-mnist-tensorboard created\n", 206 | "deployment.extensions/tf-distributed-mnist-tensorboard created\n", 207 | "tfjob.kubeflow.org/tf-distributed-mnist created\n", 208 | "\u001b[36mINFO\u001b[0m[0004] The Job tf-distributed-mnist has been submitted successfully \n", 209 | "\u001b[36mINFO\u001b[0m[0004] You can run `arena get tf-distributed-mnist --type tfjob` to check the job status \n" 210 | ] 211 | } 212 | ], 213 | "source": [ 214 | "# Set the Job Name\n", 215 | "%env JOB_NAME=tf-distributed-mnist\n", 216 | "# Submit a training job \n", 217 | "# using code and data from Data Volume\n", 218 | "!arena submit tf \\\n", 219 | " --name=$JOB_NAME \\\n", 220 | " --ps=1 \\\n", 221 | " --workers=2 \\\n", 222 | " --gpus=1 \\\n", 223 | " --data=training-data:/training \\\n", 224 | " --tensorboard \\\n", 225 | " --psImage=tensorflow/tensorflow:1.5.0 \\\n", 226 | " --image=tensorflow/tensorflow:1.5.0-gpu-py3 \\\n", 227 | " --logdir=/training/output/distributed-mnist \\\n", 228 | " \"python /training/models/tensorflow-sample-code/tfjob/docker/v1alpha2/distributed-mnist/main.py --max_steps 10000 --data_dir /training/dataset/mnist --log_dir /training/output/distributed-mnist\"" 229 | ] 230 | }, 231 | { 232 | "cell_type": "markdown", 233 | "metadata": {}, 234 | "source": [ 235 | "> `Arena`命令的`--logdir`指定`tensorboard`从训练任务的指定目录读取event\n", 236 | "> 完整参数可以参考[命令行文档](https://github.com/kubeflow/arena/blob/master/docs/cli/arena_submit_tfjob.md)" 237 | ] 238 | }, 239 | { 240 | "cell_type": "markdown", 241 | "metadata": {}, 242 | "source": [ 243 | "7.检查模型训练状态,当任务状态从`Pending`转为`Running`后就可以查看日志和GPU使用率了。这里`-e`为了方便检查任务`Pending`的原因。通常看到`[Pulling] pulling image \"tensorflow/tensorflow:1.5.0-gpu-py3\"`代表容器镜像过大,导致任务处于`Pending`。这时可以重复执行下列命令直到任务状态变为`Running`。" 244 | ] 245 | }, 246 | { 247 | "cell_type": "code", 248 | "execution_count": 7, 249 | "metadata": {}, 250 | "outputs": [ 251 | { 252 | "name": "stdout", 253 | "output_type": "stream", 254 | "text": [ 255 | "STATUS: PENDING\r\n", 256 | "NAMESPACE: default\r\n", 257 | "TRAINING DURATION: 0s\r\n", 258 | "\r\n", 259 | "NAME STATUS TRAINER AGE INSTANCE NODE\r\n", 260 | "tf-distributed-mnist PENDING TFJOB 0s tf-distributed-mnist-ps-0 N/A\r\n", 261 | "tf-distributed-mnist PENDING TFJOB 0s tf-distributed-mnist-worker-0 N/A\r\n", 262 | "tf-distributed-mnist PENDING TFJOB 0s tf-distributed-mnist-worker-1 N/A\r\n", 263 | "\r\n", 264 | "Your tensorboard will be available on:\r\n", 265 | "192.168.0.206:31963 \r\n", 266 | "\r\n", 267 | "Events: \r\n", 268 | "INSTANCE TYPE AGE MESSAGE\r\n", 269 | "-------- ---- --- -------\r\n", 270 | " \r\n", 271 | " \r\n", 272 | " \r\n" 273 | ] 274 | } 275 | ], 276 | "source": [ 277 | "! arena get $JOB_NAME -e" 278 | ] 279 | }, 280 | { 281 | "cell_type": "markdown", 282 | "metadata": {}, 283 | "source": [ 284 | "8.实时检查日志,此时可以通过调整`--tail=`的数值展示输出的行数。默认为显示全部日志。\n", 285 | "如果想要实时查看日志,可以增加`-f`参数。" 286 | ] 287 | }, 288 | { 289 | "cell_type": "code", 290 | "execution_count": 8, 291 | "metadata": {}, 292 | "outputs": [ 293 | { 294 | "name": "stdout", 295 | "output_type": "stream", 296 | "text": [ 297 | "2019-02-26T07:28:59.06252925Z WARNING:tensorflow:From /training/models/tensorflow-sample-code/tfjob/docker/mnist/main.py:40: read_data_sets (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.\r\n", 298 | "2019-02-26T07:28:59.062611786Z Instructions for updating:\r\n", 299 | "2019-02-26T07:28:59.062616602Z Please use alternatives such as official/mnist/dataset.py from tensorflow/models.\r\n", 300 | "2019-02-26T07:28:59.102090755Z WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:260: maybe_download (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.\r\n", 301 | "2019-02-26T07:28:59.102120053Z Instructions for updating:\r\n", 302 | "2019-02-26T07:28:59.102123749Z Please write your own downloading logic.\r\n", 303 | "2019-02-26T07:28:59.106943556Z WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:262: extract_images (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.\r\n", 304 | "2019-02-26T07:28:59.106955986Z Instructions for updating:\r\n", 305 | "2019-02-26T07:28:59.106959188Z Please use tf.data to implement this functionality.\r\n", 306 | "2019-02-26T07:28:59.705225214Z WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:267: extract_labels (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.\r\n", 307 | "2019-02-26T07:28:59.705261581Z Instructions for updating:\r\n", 308 | "2019-02-26T07:28:59.705266844Z Please use tf.data to implement this functionality.\r\n", 309 | "2019-02-26T07:28:59.710114243Z WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:110: dense_to_one_hot (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.\r\n", 310 | "2019-02-26T07:28:59.710131028Z Instructions for updating:\r\n", 311 | "2019-02-26T07:28:59.710134306Z Please use tf.one_hot on tensors.\r\n", 312 | "2019-02-26T07:28:59.775920928Z WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:290: DataSet.__init__ (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.\r\n", 313 | "2019-02-26T07:28:59.775952959Z Instructions for updating:\r\n", 314 | "2019-02-26T07:28:59.775956287Z Please use alternatives such as official/mnist/dataset.py from tensorflow/models.\r\n", 315 | "2019-02-26T07:29:00.062051404Z WARNING:tensorflow:From /training/models/tensorflow-sample-code/tfjob/docker/mnist/main.py:119: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.\r\n", 316 | "2019-02-26T07:29:00.062085896Z Instructions for updating:\r\n", 317 | "2019-02-26T07:29:00.062089719Z \r\n", 318 | "2019-02-26T07:29:00.062092246Z Future major versions of TensorFlow will allow gradients to flow\r\n", 319 | "2019-02-26T07:29:00.062095209Z into the labels input on backprop by default.\r\n", 320 | "2019-02-26T07:29:00.062097781Z \r\n", 321 | "2019-02-26T07:29:00.062100188Z See `tf.nn.softmax_cross_entropy_with_logits_v2`.\r\n", 322 | "2019-02-26T07:29:00.062114298Z \r\n", 323 | "2019-02-26T07:29:00.219646519Z 2019-02-26 07:29:00.219376: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA\r\n", 324 | "2019-02-26T07:29:00.395096237Z 2019-02-26 07:29:00.394842: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\r\n", 325 | "2019-02-26T07:29:00.396207408Z 2019-02-26 07:29:00.396052: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 0 with properties: \r\n", 326 | "2019-02-26T07:29:00.396224182Z name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285\r\n", 327 | "2019-02-26T07:29:00.396227903Z pciBusID: 0000:00:09.0\r\n", 328 | "2019-02-26T07:29:00.396230866Z totalMemory: 15.89GiB freeMemory: 15.60GiB\r\n", 329 | "2019-02-26T07:29:00.396234207Z 2019-02-26 07:29:00.396088: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0\r\n", 330 | "2019-02-26T07:29:00.747280828Z 2019-02-26 07:29:00.747034: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:\r\n", 331 | "2019-02-26T07:29:00.74731476Z 2019-02-26 07:29:00.747091: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] 0 \r\n", 332 | "2019-02-26T07:29:00.747318377Z 2019-02-26 07:29:00.747101: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0: N \r\n", 333 | "2019-02-26T07:29:00.747676956Z 2019-02-26 07:29:00.747525: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15117 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:09.0, compute capability: 6.0)\r\n", 334 | "2019-02-26T07:29:06.348085472Z 2019-02-26 07:29:06.347875: I tensorflow/stream_executor/dso_loader.cc:151] successfully opened CUDA library libcupti.so.9.0 locally\r\n" 335 | ] 336 | } 337 | ], 338 | "source": [ 339 | "! arena logs --tail=50 $JOB_NAME" 340 | ] 341 | }, 342 | { 343 | "cell_type": "markdown", 344 | "metadata": {}, 345 | "source": [ 346 | "9.查看实时训练的GPU使用情况" 347 | ] 348 | }, 349 | { 350 | "cell_type": "code", 351 | "execution_count": 9, 352 | "metadata": {}, 353 | "outputs": [ 354 | { 355 | "name": "stdout", 356 | "output_type": "stream", 357 | "text": [ 358 | "INSTANCE NAME GPU(Device Index) GPU(Duty Cycle) GPU(Memory MiB) STATUS NODE\r\n", 359 | "tf-distributed-mnist-ps-0 N/A N/A N/A Running 192.168.0.208\r\n", 360 | "tf-distributed-mnist-worker-0 0 9% 551.0MiB / 16276.2MiB Running 192.168.0.210\r\n", 361 | "tf-distributed-mnist-worker-1 0 6% 1092.0MiB / 16276.2MiB Running 192.168.0.208\r\n" 362 | ] 363 | } 364 | ], 365 | "source": [ 366 | "! arena top job $JOB_NAME" 367 | ] 368 | }, 369 | { 370 | "cell_type": "markdown", 371 | "metadata": {}, 372 | "source": [ 373 | "10.通过TensorBoard查看训练趋势。您可以使用 `192.168.1.117:30670` 访问 Tensorboard。如果您通过笔记本电脑无法直接访问 Tensorboard,可以考虑在您的笔记本电脑使用 `sshuttle`。例如:`sshuttle -r root@41.82.59.51 192.168.0.0/16`。其中`41.82.59.51`为集群内某个节点的外网IP,且该外网IP可以通过ssh访问。" 374 | ] 375 | }, 376 | { 377 | "cell_type": "code", 378 | "execution_count": 10, 379 | "metadata": { 380 | "scrolled": false 381 | }, 382 | "outputs": [ 383 | { 384 | "name": "stdout", 385 | "output_type": "stream", 386 | "text": [ 387 | "STATUS: RUNNING\r\n", 388 | "NAMESPACE: default\r\n", 389 | "TRAINING DURATION: 45s\r\n", 390 | "\r\n", 391 | "NAME STATUS TRAINER AGE INSTANCE NODE\r\n", 392 | "tf-distributed-mnist RUNNING TFJOB 45s tf-distributed-mnist-ps-0 192.168.0.208\r\n", 393 | "tf-distributed-mnist RUNNING TFJOB 45s tf-distributed-mnist-worker-0 192.168.0.210\r\n", 394 | "tf-distributed-mnist RUNNING TFJOB 45s tf-distributed-mnist-worker-1 192.168.0.208\r\n", 395 | "\r\n", 396 | "Your tensorboard will be available on:\r\n", 397 | "192.168.0.206:30308 \r\n" 398 | ] 399 | } 400 | ], 401 | "source": [ 402 | "! arena get $JOB_NAME" 403 | ] 404 | }, 405 | { 406 | "cell_type": "markdown", 407 | "metadata": {}, 408 | "source": [ 409 | "![](2-1-tensorboard.jpg)" 410 | ] 411 | }, 412 | { 413 | "cell_type": "markdown", 414 | "metadata": {}, 415 | "source": [ 416 | "11.查看模型训练产生的结果, 在`output`下生成了训练结果" 417 | ] 418 | }, 419 | { 420 | "cell_type": "code", 421 | "execution_count": 11, 422 | "metadata": {}, 423 | "outputs": [ 424 | { 425 | "name": "stdout", 426 | "output_type": "stream", 427 | "text": [ 428 | "/root\r\n", 429 | "|-- dataset\r\n", 430 | "| `-- mnist\r\n", 431 | "| |-- t10k-images-idx3-ubyte.gz\r\n", 432 | "| |-- t10k-labels-idx1-ubyte.gz\r\n", 433 | "| |-- train-images-idx3-ubyte.gz\r\n", 434 | "| `-- train-labels-idx1-ubyte.gz\r\n", 435 | "|-- models\r\n", 436 | "| `-- tensorflow-sample-code\r\n", 437 | "| |-- README.md\r\n", 438 | "| |-- data\r\n", 439 | "| |-- mnist-tf\r\n", 440 | "| |-- models\r\n", 441 | "| |-- mpijob\r\n", 442 | "| `-- tfjob\r\n", 443 | "`-- output\r\n", 444 | " `-- distributed-mnist\r\n", 445 | " |-- test\r\n", 446 | " `-- train\r\n", 447 | "\r\n", 448 | "13 directories, 5 files\r\n" 449 | ] 450 | } 451 | ], 452 | "source": [ 453 | "! tree -I ai-starter -L 3 ${HOME}" 454 | ] 455 | }, 456 | { 457 | "cell_type": "markdown", 458 | "metadata": {}, 459 | "source": [ 460 | "12.删除已经完成的任务" 461 | ] 462 | }, 463 | { 464 | "cell_type": "code", 465 | "execution_count": 13, 466 | "metadata": {}, 467 | "outputs": [ 468 | { 469 | "name": "stdout", 470 | "output_type": "stream", 471 | "text": [ 472 | "service \"tf-distributed-mnist-tensorboard\" deleted\n", 473 | "deployment.extensions \"tf-distributed-mnist-tensorboard\" deleted\n", 474 | "tfjob.kubeflow.org \"tf-distributed-mnist\" deleted\n", 475 | "configmap \"tf-distributed-mnist-tfjob\" deleted\n", 476 | "\u001b[36mINFO\u001b[0m[0004] The Job tf-distributed-mnist has been deleted successfully \n" 477 | ] 478 | } 479 | ], 480 | "source": [ 481 | "! arena delete $JOB_NAME" 482 | ] 483 | }, 484 | { 485 | "cell_type": "markdown", 486 | "metadata": {}, 487 | "source": [ 488 | "恭喜!您已经使用 `arena` 成功运行了训练作业,而且还能轻松检查 Tensorboard。\n", 489 | "\n", 490 | "总结,希望您通过本次演示了解:\n", 491 | "1. 如何准备代码和数据,并将其放入数据卷中\n", 492 | "2. 如何在分布式训练任务中引用数据卷,并且使用其中的代码和数据\n", 493 | "3. 如何利用arena管理您的分布式训练任务。\n", 494 | "\n", 495 | "以上是使用`Arena`在云上进行机器学习的例子,您可以修改代码`${HOME}/models/tensorflow-sample-code/tfjob/docker/v1alpha2/distributed-mnist/main.py`重新提交,从而实现模型开发的效果。" 496 | ] 497 | } 498 | ], 499 | "metadata": { 500 | "kernelspec": { 501 | "display_name": "Python 3", 502 | "language": "python", 503 | "name": "python3" 504 | }, 505 | "language_info": { 506 | "codemirror_mode": { 507 | "name": "ipython", 508 | "version": 3 509 | }, 510 | "file_extension": ".py", 511 | "mimetype": "text/x-python", 512 | "name": "python", 513 | "nbconvert_exporter": "python", 514 | "pygments_lexer": "ipython3", 515 | "version": "3.5.2" 516 | } 517 | }, 518 | "nbformat": 4, 519 | "nbformat_minor": 2 520 | } 521 | -------------------------------------------------------------------------------- /demo/3-submit-mpi.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 入门云原生AI - 3. 提交MPI任务\n", 8 | "\n", 9 | "在这个示例中,我们将演示:\n", 10 | "\n", 11 | "* 利用Arena提交分布式MPI的训练任务,并且查看训练任务状态和日志\n", 12 | "* 通过TensorBoard查看训练任务\n", 13 | "\n", 14 | "> 前提:请先完成文档中的[共享存储配置](../docs/setup/SETUP_NAS.md),当前${HOME}就是其中`training-data`的数据卷对应目录。" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "1.下载TensorFlow样例源代码到${HOME}/models目录" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 1, 27 | "metadata": { 28 | "scrolled": true 29 | }, 30 | "outputs": [ 31 | { 32 | "name": "stdout", 33 | "output_type": "stream", 34 | "text": [ 35 | "Cloning into '/root/models/tensorflow-benchmarks'...\n", 36 | "remote: Enumerating objects: 3748, done.\u001b[K\n", 37 | "remote: Counting objects: 100% (3748/3748), done.\u001b[K\n", 38 | "remote: Compressing objects: 100% (1170/1170), done.\u001b[K\n", 39 | "remote: Total 3748 (delta 2557), reused 3748 (delta 2557)\u001b[K\n", 40 | "Receiving objects: 100% (3748/3748), 1.98 MiB | 0 bytes/s, done.\n", 41 | "Resolving deltas: 100% (2557/2557), done.\n", 42 | "Checking connectivity... done.\n" 43 | ] 44 | } 45 | ], 46 | "source": [ 47 | "! if [ ! -d \"${HOME}/models/tensorflow-benchmarks\" ] ; then \\\n", 48 | " git clone -b cnn_tf_v1.9_compatible \"https://code.aliyun.com/xiaozhou/benchmark.git\" \"${HOME}/models/tensorflow-benchmarks\"; \\\n", 49 | "fi" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": {}, 55 | "source": [ 56 | "3.创建训练结果${HOME}/output" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": 2, 62 | "metadata": {}, 63 | "outputs": [], 64 | "source": [ 65 | "! mkdir -p ${HOME}/output" 66 | ] 67 | }, 68 | { 69 | "cell_type": "markdown", 70 | "metadata": {}, 71 | "source": [ 72 | "4.查看目录结构, 其中`dataset`是数据目录,`models`是模型代码目录,`output`是训练结果目录。" 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": 3, 78 | "metadata": {}, 79 | "outputs": [ 80 | { 81 | "name": "stdout", 82 | "output_type": "stream", 83 | "text": [ 84 | "/root\r\n", 85 | "|-- dataset\r\n", 86 | "|-- models\r\n", 87 | "| `-- tensorflow-benchmarks\r\n", 88 | "| |-- LICENSE\r\n", 89 | "| |-- README.md\r\n", 90 | "| |-- bower_components\r\n", 91 | "| |-- dashboard_app\r\n", 92 | "| |-- index.html\r\n", 93 | "| |-- js\r\n", 94 | "| |-- scripts\r\n", 95 | "| |-- soumith_benchmarks.html\r\n", 96 | "| `-- tools\r\n", 97 | "`-- output\r\n", 98 | "\r\n", 99 | "9 directories, 4 files\r\n" 100 | ] 101 | } 102 | ], 103 | "source": [ 104 | "! tree -I ai-starter -L 3 ${HOME}" 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "metadata": {}, 110 | "source": [ 111 | "5.检查可用GPU资源" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": 4, 117 | "metadata": {}, 118 | "outputs": [ 119 | { 120 | "name": "stdout", 121 | "output_type": "stream", 122 | "text": [ 123 | "NAME IPADDRESS ROLE GPU(Total) GPU(Allocated)\r\n", 124 | "cn-zhangjiakou.i-8vb2knpxzlk449e7lugx 192.168.0.209 1 0\r\n", 125 | "cn-zhangjiakou.i-8vb2knpxzlk449e7lugy 192.168.0.210 1 0\r\n", 126 | "cn-zhangjiakou.i-8vb2knpxzlk449e7lugz 192.168.0.208 1 0\r\n", 127 | "cn-zhangjiakou.i-8vb7yuo831zjzijo9sdw 192.168.0.205 master 0 0\r\n", 128 | "cn-zhangjiakou.i-8vbezxqzueo7662i0dbq 192.168.0.204 master 0 0\r\n", 129 | "cn-zhangjiakou.i-8vbezxqzueo7681j4fav 192.168.0.206 master 0 0\r\n", 130 | "-----------------------------------------------------------------------------------------\r\n", 131 | "Allocated/Total GPUs In Cluster:\r\n", 132 | "0/3 (0%) \r\n" 133 | ] 134 | } 135 | ], 136 | "source": [ 137 | "! arena top node" 138 | ] 139 | }, 140 | { 141 | "cell_type": "markdown", 142 | "metadata": {}, 143 | "source": [ 144 | "6.通过Arena提交训练任务, 这里`training-data`在配置[共享存储时](../docs/setup/SETUP_NAS.md)创建. \n", 145 | "`--data=training-data:/training`将其映射到训练任务的`/training`目录。 \n", 146 | "`/training`目录下的子目录`/training/models/tensorflow-benchmarks`就是步骤1拷贝源代码的位置。 \n", 147 | "`/training`目录下的子目录`/training/output/mpi-benchmarks`就是步骤3创建的训练结果输出的位置。" 148 | ] 149 | }, 150 | { 151 | "cell_type": "code", 152 | "execution_count": 5, 153 | "metadata": {}, 154 | "outputs": [ 155 | { 156 | "name": "stdout", 157 | "output_type": "stream", 158 | "text": [ 159 | "env: JOB_NAME=tf-mpi-benchmarks\n", 160 | "configmap/tf-mpi-benchmarks-mpijob created\n", 161 | "configmap/tf-mpi-benchmarks-mpijob labeled\n", 162 | "service/tf-mpi-benchmarks-tensorboard created\n", 163 | "deployment.extensions/tf-mpi-benchmarks-tensorboard created\n", 164 | "mpijob.kubeflow.org/tf-mpi-benchmarks created\n", 165 | "\u001b[36mINFO\u001b[0m[0004] The Job tf-mpi-benchmarks has been submitted successfully \n", 166 | "\u001b[36mINFO\u001b[0m[0004] You can run `arena get tf-mpi-benchmarks --type mpijob` to check the job status \n" 167 | ] 168 | } 169 | ], 170 | "source": [ 171 | "# Set the Job Name\n", 172 | "%env JOB_NAME=tf-mpi-benchmarks\n", 173 | "%env USER_DATA_NAME=training-data\n", 174 | "# Submit a training job \n", 175 | "# using code and data from Data Volume\n", 176 | "!arena submit mpi \\\n", 177 | " --name=$JOB_NAME \\\n", 178 | " --workers=2 \\\n", 179 | " --gpus=1 \\\n", 180 | " --data=$USER_DATA_NAME:/training \\\n", 181 | " --tensorboard \\\n", 182 | " --image=uber/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \\\n", 183 | " --logdir=/training/output/mpi-benchmarks \\\n", 184 | " \"mpirun python /training/models/tensorflow-benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod --train_dir=/training/output/mpi-benchmarks --summary_verbosity=3 --save_summaries_steps=10\"" 185 | ] 186 | }, 187 | { 188 | "cell_type": "markdown", 189 | "metadata": {}, 190 | "source": [ 191 | "> `Arena`命令的`--logdir`指定`tensorboard`从训练任务的指定目录读取event\n", 192 | "> 完整参数可以参考[命令行文档](https://github.com/kubeflow/arena/blob/master/docs/cli/arena_submit_tfjob.md)" 193 | ] 194 | }, 195 | { 196 | "cell_type": "markdown", 197 | "metadata": {}, 198 | "source": [ 199 | "7.检查模型训练状态,当任务状态从`Pending`转为`Running`后就可以查看日志和GPU使用率了。这里`-e`为了方便检查任务`Pending`的原因。通常看到`[Pulling] pulling image \"uber/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5\"`代表容器镜像过大,导致任务处于`Pending`。这时可以重复执行下列命令直到任务状态变为`Running`。" 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "execution_count": 6, 205 | "metadata": {}, 206 | "outputs": [ 207 | { 208 | "name": "stdout", 209 | "output_type": "stream", 210 | "text": [ 211 | "STATUS: RUNNING\r\n", 212 | "NAMESPACE: default\r\n", 213 | "TRAINING DURATION: 57s\r\n", 214 | "\r\n", 215 | "NAME STATUS TRAINER AGE INSTANCE NODE\r\n", 216 | "tf-mpi-benchmarks RUNNING MPIJOB 57s tf-mpi-benchmarks-launcher-ph6fr 192.168.0.208\r\n", 217 | "tf-mpi-benchmarks RUNNING MPIJOB 57s tf-mpi-benchmarks-worker-0 192.168.0.210\r\n", 218 | "tf-mpi-benchmarks RUNNING MPIJOB 57s tf-mpi-benchmarks-worker-1 192.168.0.209\r\n", 219 | "\r\n", 220 | "Your tensorboard will be available on:\r\n", 221 | "192.168.0.206:31129 \r\n", 222 | "\r\n", 223 | "Events: \r\n", 224 | "No events for pending pod\r\n" 225 | ] 226 | } 227 | ], 228 | "source": [ 229 | "! arena get $JOB_NAME -e" 230 | ] 231 | }, 232 | { 233 | "cell_type": "markdown", 234 | "metadata": {}, 235 | "source": [ 236 | "![](3-1-tensorboard.jpg)" 237 | ] 238 | }, 239 | { 240 | "cell_type": "markdown", 241 | "metadata": {}, 242 | "source": [ 243 | "8.实时检查日志,此时可以通过调整`--tail=`的数值展示输出的行数。默认为显示全部日志。\n", 244 | "如果想要实时查看日志,可以增加`-f`参数。" 245 | ] 246 | }, 247 | { 248 | "cell_type": "code", 249 | "execution_count": 7, 250 | "metadata": {}, 251 | "outputs": [ 252 | { 253 | "name": "stdout", 254 | "output_type": "stream", 255 | "text": [ 256 | "2019-03-02T10:25:06.633100624Z 2019-03-02 10:25:06.632415: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N \r\n", 257 | "2019-03-02T10:25:06.633242941Z 2019-03-02 10:25:06.632923: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15111 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:09.0, compute capability: 6.0)\r\n", 258 | "2019-03-02T10:25:06.946553283Z I0302 10:25:06.947077 140068638594816 tf_logging.py:115] Starting standard services.\r\n", 259 | "2019-03-02T10:25:06.94658461Z W0302 10:25:06.947337 140068638594816 tf_logging.py:125] Standard services need a 'logdir' passed to the SessionManager\r\n", 260 | "2019-03-02T10:25:06.946588591Z I0302 10:25:06.947422 140068638594816 tf_logging.py:115] Starting queue runners.\r\n", 261 | "2019-03-02T10:25:08.02147656Z I0302 10:25:08.020608 140189223290624 tf_logging.py:115] Running local_init_op.\r\n", 262 | "2019-03-02T10:25:08.456852287Z I0302 10:25:08.455843 140189223290624 tf_logging.py:115] Done running local_init_op.\r\n", 263 | "2019-03-02T10:25:13.438192987Z I0302 10:25:13.437267 140189223290624 tf_logging.py:115] Starting standard services.\r\n", 264 | "2019-03-02T10:25:13.540538367Z I0302 10:25:13.539715 140189223290624 tf_logging.py:115] Starting queue runners.\r\n", 265 | "2019-03-02T10:25:14.381412593Z I0302 10:25:14.380504 140185215694592 tf_logging.py:159] global_step/sec: 0\r\n", 266 | "2019-03-02T10:25:15.258478623Z Running warm up\r\n", 267 | "2019-03-02T10:25:15.283527742Z Running warm up\r\n", 268 | "2019-03-02T10:25:20.910169525Z \r\n", 269 | "2019-03-02T10:25:20.910219717Z tf-mpi-benchmarks-worker-0:8562:8575 [0] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]\r\n", 270 | "2019-03-02T10:25:20.910241917Z tf-mpi-benchmarks-worker-0:8562:8575 [0] INFO Using internal Network Socket\r\n", 271 | "2019-03-02T10:25:20.910244787Z tf-mpi-benchmarks-worker-0:8562:8575 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384\r\n", 272 | "2019-03-02T10:25:20.910247406Z tf-mpi-benchmarks-worker-0:8562:8575 [0] INFO NET : Using interface eth0:172.16.2.154<0>\r\n", 273 | "2019-03-02T10:25:20.910250216Z tf-mpi-benchmarks-worker-0:8562:8575 [0] INFO NET/Socket : 1 interfaces found\r\n", 274 | "2019-03-02T10:25:20.910252844Z NCCL version 2.2.13+cuda9.0\r\n", 275 | "2019-03-02T10:25:20.911021019Z \r\n", 276 | "2019-03-02T10:25:20.911034061Z tf-mpi-benchmarks-worker-1:22278:22294 [0] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]\r\n", 277 | "2019-03-02T10:25:20.911039817Z tf-mpi-benchmarks-worker-1:22278:22294 [0] INFO Using internal Network Socket\r\n", 278 | "2019-03-02T10:25:20.911044469Z tf-mpi-benchmarks-worker-1:22278:22294 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384\r\n", 279 | "2019-03-02T10:25:20.915446261Z Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/var/lib/docker/overlay2/l/2DJDBRMOBPLS3OLU7TWWCCX54S:/var/lib/docker/overlay2/l/PFG7JK77ZESMCADSD7OQUEPNNA:/var/lib/docker/overlay2/l/253K6TLSZMLAFF5XYM4QK33IDQ:/var/lib/docker/overlay2/l/7APFLVLMJRQQ5ONOK4TXUO3OXX:/var/lib/docker/overlay2/l/46WR5AGPYCSW6OVYV6SW3YYJXG:/var/lib/docker/overlay2/l/7A4MC2QCGNADQYXXZPTC6KRQIE:/var/lib/docker/overlay2/l/KKLS6NXJ6AKMWODEOHSSPPUXXU:/var/lib/docker/overlay2/l/WROAZLKZPVWBL4D3SZUU4VUUMV:/var/lib/docker/overlay2/l/JVZBJ4BH6U456'\r\n", 280 | "2019-03-02T10:25:20.915466117Z Unexpected end of /proc/mounts line `R2MOJSMX5EJU5:/var/lib/docker/overlay2/l/MNZI2ZHI7SFFLRKCTI7YME7JA4:/var/lib/docker/overlay2/l/L47TL5KBZKM5ZLI4CHK3565W5B:/var/lib/docker/overlay2/l/MSIKN7UHJEFYRJVSKU3IYRTBJ2:/var/lib/docker/overlay2/l/EL5YF3AEVYU34BCUN5QQZCE4IU:/var/lib/docker/overlay2/l/PPSCQMEKYC3RSLTDPVID2FHVGC:/var/lib/docker/overlay2/l/WPBTAL3YAOWRRQJIY343MCI4PZ:/var/lib/docker/overlay2/l/ZUMMHLCWCI42XIRULUDUYX5UBP:/var/lib/docker/overlay2/l/5U55TSOZJM3OH5TSNLFUXYW4MG:/var/lib/docker/overlay2/l/HTTBG5Z5MV5FWYZC42JWNXQOUT:/var/lib/do'\r\n", 281 | "2019-03-02T10:25:20.915473649Z Unexpected end of /proc/mounts line `cker/overlay2/l/P2637BN543AV7WYBCZPGF56JET:/var/lib/docker/overlay2/l/KYQO63RWGFEYCBVN6E5HAXNZXM:/var/lib/docker/overlay2/l/NM5K355TS52O4IXHNSULI3YCL7:/var/lib/docker/overlay2/l/IAKHDKK26V4N2EKQNU4RLB64ZW:/var/lib/docker/overlay2/l/SUD2DOE35DUKLVST7QEXX6NIT3:/var/lib/docker/overlay2/l/FJMN2F4MCKYAMICYG7R7YPBOIY:/var/lib/docker/overlay2/l/YBYUCPHIBIUU4BE73LPC6J45HT,upperdir=/var/lib/docker/overlay2/cba173854c5c75b89a4f2ff0d9bac1c04b4f8ec9e03f0d50c5be64dbd1be187f/diff,workdir=/var/lib/docker/overlay2/cba1738'\r\n", 282 | "2019-03-02T10:25:20.916475876Z Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/var/lib/docker/overlay2/l/FRZVWQYK7OWI3PBMGI5ZYDP6A2:/var/lib/docker/overlay2/l/DCWCZJJWUNVKEVCIWJVX6GSF3U:/var/lib/docker/overlay2/l/PGPIXXECDFLQVN532W2J7D3BIV:/var/lib/docker/overlay2/l/RTVTVZFIASVYRNA2T3E3Z473PE:/var/lib/docker/overlay2/l/3HGT4MVWKBKIWUZKKNU5URGZTY:/var/lib/docker/overlay2/l/M2LIXNGFQSGPEOO4FIKKWJ6MVF:/var/lib/docker/overlay2/l/S4A32QN43C3CKQ3SXW53UR4X5C:/var/lib/docker/overlay2/l/SPX7HOQ3EHDP3GLDRREKA4N65A:/var/lib/docker/overlay2/l/VP7H2BL4QPVEW'\r\n", 283 | "2019-03-02T10:25:20.916489821Z Unexpected end of /proc/mounts line `Z2JWGQMMECYHE:/var/lib/docker/overlay2/l/WA54EUV7CXDYN2LRD4YB5S3PHA:/var/lib/docker/overlay2/l/MLOKMOIIDA6B2VOWEQYWF45VEH:/var/lib/docker/overlay2/l/TLEMXYSZPS5P4XZ2FSNT3VBNHB:/var/lib/docker/overlay2/l/BR6VU5X3QA2YBWWQLQVKBVDSH3:/var/lib/docker/overlay2/l/J7D2QICKY5MN2UND7K7AGGAKTX:/var/lib/docker/overlay2/l/PEADBYPXXLIRID42BRBQCKVPCJ:/var/lib/docker/overlay2/l/P35PZJL5VWQZHAD4HFKL4BSKT2:/var/lib/docker/overlay2/l/RGDWJ7ARAHYJG7FGVFUHDA6RQC:/var/lib/docker/overlay2/l/4KUXB7XIV4QFXCQJLBYZCCNHID:/var/lib/do'\r\n", 284 | "2019-03-02T10:25:20.91650441Z Unexpected end of /proc/mounts line `cker/overlay2/l/GV6UAYEUJM2OWUSUICLUBZJDTR:/var/lib/docker/overlay2/l/6YQ632LUIZX4M2HWIDJIYEBNJ4:/var/lib/docker/overlay2/l/FJM6EITOVS53B6EQJFBNZRRNTL:/var/lib/docker/overlay2/l/NRV5CBUOJZ7UFQCOEIXT4M7MK3:/var/lib/docker/overlay2/l/WCF7HFJVI5SVZGAVNYCMOVFIYZ:/var/lib/docker/overlay2/l/VEVETHF2VF64VQUWQWQ4KV6WQZ:/var/lib/docker/overlay2/l/I5IGFWWPRMZ5FW72DA43T3Z226,upperdir=/var/lib/docker/overlay2/4d0c86aec61501de716eefcf7b0ea09f3828888664f1c9cdc667681f56cc33cf/diff,workdir=/var/lib/docker/overlay2/4d0c86a'\r\n", 285 | "2019-03-02T10:25:20.924272335Z tf-mpi-benchmarks-worker-0:8562:8575 [0] INFO comm 0x7f7fd02f9100 rank 0 nranks 2\r\n", 286 | "2019-03-02T10:25:20.931008568Z tf-mpi-benchmarks-worker-1:22278:22294 [0] INFO comm 0x7f63bc353b60 rank 1 nranks 2\r\n", 287 | "2019-03-02T10:25:20.931022959Z tf-mpi-benchmarks-worker-1:22278:22294 [0] INFO NET : Using interface eth0:172.16.1.232<0>\r\n", 288 | "2019-03-02T10:25:20.931026195Z tf-mpi-benchmarks-worker-1:22278:22294 [0] INFO NET/Socket : 1 interfaces found\r\n", 289 | "2019-03-02T10:25:20.932348798Z tf-mpi-benchmarks-worker-0:8562:8575 [0] INFO Using 256 threads\r\n", 290 | "2019-03-02T10:25:20.932531317Z tf-mpi-benchmarks-worker-0:8562:8575 [0] INFO Min Comp Cap 6\r\n", 291 | "2019-03-02T10:25:20.932538838Z tf-mpi-benchmarks-worker-0:8562:8575 [0] INFO NCCL_SINGLE_RING_THRESHOLD=131072\r\n", 292 | "2019-03-02T10:25:20.936292475Z tf-mpi-benchmarks-worker-0:8562:8575 [0] INFO Ring 00 : 0 1\r\n", 293 | "2019-03-02T10:25:20.94294648Z tf-mpi-benchmarks-worker-0:8562:8575 [0] INFO 1 -> 0 via NET/Socket/0\r\n", 294 | "2019-03-02T10:25:20.94334095Z tf-mpi-benchmarks-worker-1:22278:22294 [0] INFO 0 -> 1 via NET/Socket/0\r\n", 295 | "2019-03-02T10:25:20.951039327Z tf-mpi-benchmarks-worker-0:8562:8575 [0] INFO Launch mode Parallel\r\n", 296 | "2019-03-02T10:25:31.778698369Z Done warm up\r\n", 297 | "2019-03-02T10:25:31.779338161Z Step\tImg/sec\ttotal_loss\r\n", 298 | "2019-03-02T10:25:34.463195993Z Done warm up\r\n", 299 | "2019-03-02T10:25:34.468385192Z Step\tImg/sec\ttotal_loss\r\n", 300 | "2019-03-02T10:25:35.03856711Z 1\timages/sec: 112.4 +/- 0.0 (jitter = 0.0)\t9.102\r\n", 301 | "2019-03-02T10:25:35.038612058Z 1\timages/sec: 19.6 +/- 0.0 (jitter = 0.0)\t8.976\r\n", 302 | "2019-03-02T10:25:40.20952965Z 10\timages/sec: 75.9 +/- 8.7 (jitter = 2.0)\t9.101\r\n", 303 | "2019-03-02T10:25:42.972409109Z 10\timages/sec: 75.3 +/- 8.8 (jitter = 1.8)\t9.106\r\n", 304 | "2019-03-02T10:25:48.874253241Z 20\timages/sec: 74.9 +/- 6.2 (jitter = 2.2)\t8.978\r\n", 305 | "2019-03-02T10:25:51.483494563Z 20\timages/sec: 75.3 +/- 6.2 (jitter = 2.3)\t9.182\r\n" 306 | ] 307 | } 308 | ], 309 | "source": [ 310 | "! arena logs --tail=50 $JOB_NAME" 311 | ] 312 | }, 313 | { 314 | "cell_type": "markdown", 315 | "metadata": {}, 316 | "source": [ 317 | "9.查看实时训练的GPU使用情况" 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": 8, 323 | "metadata": {}, 324 | "outputs": [ 325 | { 326 | "name": "stdout", 327 | "output_type": "stream", 328 | "text": [ 329 | "INSTANCE NAME GPU(Device Index) GPU(Duty Cycle) GPU(Memory MiB) STATUS NODE\r\n", 330 | "tf-mpi-benchmarks-launcher-s6x5c N/A N/A N/A Running 192.168.0.208\r\n", 331 | "tf-mpi-benchmarks-worker-0 0 97% 15651.0MiB / 16276.2MiB Running 192.168.0.210\r\n", 332 | "tf-mpi-benchmarks-worker-1 0 0% 15651.0MiB / 16276.2MiB Running 192.168.0.208\r\n" 333 | ] 334 | } 335 | ], 336 | "source": [ 337 | "! arena top job $JOB_NAME" 338 | ] 339 | }, 340 | { 341 | "cell_type": "markdown", 342 | "metadata": {}, 343 | "source": [ 344 | "10.通过TensorBoard查看训练趋势。\n", 345 | "您可以通过Tensorboard查看训练趋势, 通过执行 `arena get ${JOB_NAME}`, 您可以获取到tensorboard的集群内网IP,本例中是 `192.168.0.206:31129`。如果您通过笔记本电脑无法直接访问 Tensorboard,可以考虑在您的笔记本电脑使用 `sshuttle`。例如:`sshuttle -r root@41.82.59.51 192.168.0.0/16`。其中`41.82.59.51`为集群内某个节点的外网IP,且该外网IP可以通过ssh访问。" 346 | ] 347 | }, 348 | { 349 | "cell_type": "code", 350 | "execution_count": 9, 351 | "metadata": { 352 | "scrolled": true 353 | }, 354 | "outputs": [ 355 | { 356 | "name": "stdout", 357 | "output_type": "stream", 358 | "text": [ 359 | "STATUS: RUNNING\r\n", 360 | "NAMESPACE: default\r\n", 361 | "TRAINING DURATION: 1m\r\n", 362 | "\r\n", 363 | "NAME STATUS TRAINER AGE INSTANCE NODE\r\n", 364 | "tf-mpi-benchmarks RUNNING MPIJOB 1m tf-mpi-benchmarks-launcher-ph6fr 192.168.0.208\r\n", 365 | "tf-mpi-benchmarks RUNNING MPIJOB 1m tf-mpi-benchmarks-worker-0 192.168.0.210\r\n", 366 | "tf-mpi-benchmarks RUNNING MPIJOB 1m tf-mpi-benchmarks-worker-1 192.168.0.209\r\n", 367 | "\r\n", 368 | "Your tensorboard will be available on:\r\n", 369 | "192.168.0.206:31129 \r\n" 370 | ] 371 | } 372 | ], 373 | "source": [ 374 | "! arena get $JOB_NAME" 375 | ] 376 | }, 377 | { 378 | "cell_type": "markdown", 379 | "metadata": {}, 380 | "source": [ 381 | "11.查看模型训练产生的结果, 在`output`下生成了训练结果" 382 | ] 383 | }, 384 | { 385 | "cell_type": "code", 386 | "execution_count": 10, 387 | "metadata": {}, 388 | "outputs": [ 389 | { 390 | "name": "stdout", 391 | "output_type": "stream", 392 | "text": [ 393 | "/root\r\n", 394 | "|-- dataset\r\n", 395 | "|-- models\r\n", 396 | "| `-- tensorflow-benchmarks\r\n", 397 | "| |-- LICENSE\r\n", 398 | "| |-- README.md\r\n", 399 | "| |-- bower_components\r\n", 400 | "| |-- dashboard_app\r\n", 401 | "| |-- index.html\r\n", 402 | "| |-- js\r\n", 403 | "| |-- scripts\r\n", 404 | "| |-- soumith_benchmarks.html\r\n", 405 | "| `-- tools\r\n", 406 | "`-- output\r\n", 407 | " `-- mpi-benchmarks\r\n", 408 | " |-- checkpoint\r\n", 409 | " |-- events.out.tfevents.1551522695.tf-mpi-benchmarks-worker-0\r\n", 410 | " |-- graph.pbtxt\r\n", 411 | " |-- model.ckpt-110.data-00000-of-00001\r\n", 412 | " |-- model.ckpt-110.index\r\n", 413 | " `-- model.ckpt-110.meta\r\n", 414 | "\r\n", 415 | "10 directories, 10 files\r\n" 416 | ] 417 | } 418 | ], 419 | "source": [ 420 | "! tree -I ai-starter -L 3 ${HOME}" 421 | ] 422 | }, 423 | { 424 | "cell_type": "markdown", 425 | "metadata": {}, 426 | "source": [ 427 | "12.删除已经完成的任务" 428 | ] 429 | }, 430 | { 431 | "cell_type": "code", 432 | "execution_count": 11, 433 | "metadata": {}, 434 | "outputs": [ 435 | { 436 | "name": "stdout", 437 | "output_type": "stream", 438 | "text": [ 439 | "service \"tf-distributed-mnist-tensorboard\" deleted\n", 440 | "deployment.extensions \"tf-distributed-mnist-tensorboard\" deleted\n", 441 | "tfjob.kubeflow.org \"tf-distributed-mnist\" deleted\n", 442 | "configmap \"tf-distributed-mnist-tfjob\" deleted\n", 443 | "\u001b[36mINFO\u001b[0m[0004] The Job tf-distributed-mnist has been deleted successfully \n" 444 | ] 445 | } 446 | ], 447 | "source": [ 448 | "! arena delete $JOB_NAME" 449 | ] 450 | }, 451 | { 452 | "cell_type": "markdown", 453 | "metadata": {}, 454 | "source": [ 455 | "恭喜!您已经使用 `arena` 成功运行了训练作业,而且还能轻松检查 Tensorboard。\n", 456 | "\n", 457 | "总结,希望您通过本次演示了解:\n", 458 | "1. 如何准备代码,并将其放入数据卷中\n", 459 | "2. 如何在分布式MPI训练任务中引用数据卷,并且使用其中的代码和数据\n", 460 | "3. 如何利用arena管理您的分布式训练任务。" 461 | ] 462 | } 463 | ], 464 | "metadata": { 465 | "kernelspec": { 466 | "display_name": "Python 3", 467 | "language": "python", 468 | "name": "python3" 469 | }, 470 | "language_info": { 471 | "codemirror_mode": { 472 | "name": "ipython", 473 | "version": 3 474 | }, 475 | "file_extension": ".py", 476 | "mimetype": "text/x-python", 477 | "name": "python", 478 | "nbconvert_exporter": "python", 479 | "pygments_lexer": "ipython3", 480 | "version": "3.5.2" 481 | } 482 | }, 483 | "nbformat": 4, 484 | "nbformat_minor": 2 485 | } 486 | -------------------------------------------------------------------------------- /demo/1-start-with-mnist.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 入门云原生AI - 1. 从mnist开始体验\n", 8 | "\n", 9 | "在这个示例中,我们将演示:\n", 10 | "\n", 11 | "* 下载并准备数据\n", 12 | "* 利用Arena提交单机训练任务,并且查看训练任务状态和日志,并通过TensorBoard查看训练任务\n", 13 | "* 为您的训练结果部署一个模型预测的在线服务。\n", 14 | "* 在Notebook中调用您的在线服务,验证模型准确率。" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "### 1. 下载TensorFlow样例源代码到 /root/models 目录" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 2, 27 | "metadata": {}, 28 | "outputs": [ 29 | { 30 | "name": "stdout", 31 | "output_type": "stream", 32 | "text": [ 33 | "mkdir: cannot create directory '/root/models': File exists\n", 34 | "fatal: destination path '/root/models/tensorflow-sample-code' already exists and is not an empty directory.\n" 35 | ] 36 | } 37 | ], 38 | "source": [ 39 | "! mkdir -p /root/models\n", 40 | "! git clone https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git /root/models/tensorflow-sample-code" 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "metadata": {}, 46 | "source": [ 47 | "### 2. 下载mnist数据到 /root/dataset/mnist" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": 3, 53 | "metadata": { 54 | "scrolled": true 55 | }, 56 | "outputs": [ 57 | { 58 | "name": "stdout", 59 | "output_type": "stream", 60 | "text": [ 61 | " % Total % Received % Xferd Average Speed Time Time Time Current\n", 62 | " Dload Upload Total Spent Left Speed\n", 63 | "100 1610k 0 1610k 0 0 5225k 0 --:--:-- --:--:-- --:--:-- 5228k\n", 64 | " % Total % Received % Xferd Average Speed Time Time Time Current\n", 65 | " Dload Upload Total Spent Left Speed\n", 66 | "100 4542 0 4542 0 0 22307 0 --:--:-- --:--:-- --:--:-- 22374\n", 67 | " % Total % Received % Xferd Average Speed Time Time Time Current\n", 68 | " Dload Upload Total Spent Left Speed\n", 69 | "100 9680k 0 9680k 0 0 25.2M 0 --:--:-- --:--:-- --:--:-- 25.2M\n", 70 | " % Total % Received % Xferd Average Speed Time Time Time Current\n", 71 | " Dload Upload Total Spent Left Speed\n", 72 | "100 28881 0 28881 0 0 134k 0 --:--:-- --:--:-- --:--:-- 135k\n" 73 | ] 74 | } 75 | ], 76 | "source": [ 77 | "! mkdir -p /root/output\n", 78 | "! mkdir -p /root/dataset/mnist && \\\n", 79 | " cd /root/dataset/mnist && \\\n", 80 | " curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/t10k-images-idx3-ubyte.gz && \\\n", 81 | " curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/t10k-labels-idx1-ubyte.gz && \\\n", 82 | " curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/train-images-idx3-ubyte.gz && \\\n", 83 | " curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/train-labels-idx1-ubyte.gz" 84 | ] 85 | }, 86 | { 87 | "cell_type": "markdown", 88 | "metadata": {}, 89 | "source": [ 90 | "### 3. 查看目录结构, 其中`dataset`是数据目录,`models`是模型代码目录,`output`是训练结果目录。" 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": 5, 96 | "metadata": {}, 97 | "outputs": [ 98 | { 99 | "name": "stdout", 100 | "output_type": "stream", 101 | "text": [ 102 | "/root\n", 103 | "|-- dataset\n", 104 | "| `-- mnist\n", 105 | "| |-- t10k-images-idx3-ubyte.gz\n", 106 | "| |-- t10k-labels-idx1-ubyte.gz\n", 107 | "| |-- train-images-idx3-ubyte.gz\n", 108 | "| `-- train-labels-idx1-ubyte.gz\n", 109 | "|-- models\n", 110 | "| `-- tensorflow-sample-code\n", 111 | "| |-- README.md\n", 112 | "| |-- data\n", 113 | "| |-- mnist-tf\n", 114 | "| |-- models\n", 115 | "| |-- mpijob\n", 116 | "| `-- tfjob\n", 117 | "|-- output\n", 118 | "|-- public\n", 119 | "\n", 120 | "25 directories, 12 files\n" 121 | ] 122 | } 123 | ], 124 | "source": [ 125 | "! tree -I ai-starter -L 3 /root" 126 | ] 127 | }, 128 | { 129 | "cell_type": "markdown", 130 | "metadata": {}, 131 | "source": [ 132 | "### 4.检查可用GPU资源" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": 6, 138 | "metadata": {}, 139 | "outputs": [ 140 | { 141 | "name": "stdout", 142 | "output_type": "stream", 143 | "text": [ 144 | "NAME IPADDRESS ROLE GPU(Total) GPU(Allocated)\r\n", 145 | "cn-shanghai.i-uf60zgmfu7nvxxxxxxxx 192.168.0.46 1 0\r\n", 146 | "cn-shanghai.i-uf6hf2g6lwyvxxxxxxxx 192.168.0.203 0 0\r\n", 147 | "-----------------------------------------------------------------------------------------\r\n", 148 | "Allocated/Total GPUs In Cluster:\r\n", 149 | "0/1 (0%) \r\n" 150 | ] 151 | } 152 | ], 153 | "source": [ 154 | "! arena top node" 155 | ] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "metadata": {}, 160 | "source": [ 161 | "### 5.通过Arena提交训练任务\n", 162 | "\n", 163 | "5.1 可以根据您的需要设置JOB_NAME,建议在多人共同使用的时候,设置自己独有的JOB_NAME\n", 164 | "\n", 165 | "5.2 这里 `$USER_DATA_NAME` 代表您的Notebook 使用的共享存储,存储的根目录和Notebook中`/root`对应。 由集群管理员在部署工作环境时指定[部署数据科学家工作环境(Notebook)](../docs/setup/SETUP_NOTEBOOK.md)创建。 \n", 166 | "* `$USER_DATA_NAME` 是存放您私有数据的共享存储,文件内容对应Notebook的 `/root` 目录\n", 167 | "* `$PUBLIC_DATA_NAME` 是存放公共数据的共享存储,文件内容对应Notebook的 `/root/public` 目录。在arena的命令中,如果需要使用公共存储里的文件,可以指定参数 `--data=$PUBLIC_DATA_NAME:/public`,并在训练代码中指定容器使用`/public`目录里的代码或数据\n", 168 | "\n", 169 | "在下面的提交命令中:\n", 170 | "* `--data=$USER_DATA_NAME:/training` 表示将共享存储映射到训练任务的`/training`目录。\n", 171 | "* `/training`目录下的子目录`/training/models/tensorflow-sample-code` 就是步骤1拷贝源代码的位置\n", 172 | "* `/training`目录下的子目录`/training/dataset/mnist`就是步骤2下载数据的位置\n", 173 | "* `/training`目录下的子目录`/training/output`就是步骤3创建的训练结果输出的位置。" 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": 54, 179 | "metadata": {}, 180 | "outputs": [ 181 | { 182 | "name": "stdout", 183 | "output_type": "stream", 184 | "text": [ 185 | "env: JOB_NAME=tf-mnist\n", 186 | "configmap/tf-mnist-tfjob created\n", 187 | "configmap/tf-mnist-tfjob labeled\n", 188 | "service/tf-mnist-tensorboard created\n", 189 | "deployment.extensions/tf-mnist-tensorboard created\n", 190 | "tfjob.kubeflow.org/tf-mnist created\n", 191 | "\u001b[36mINFO\u001b[0m[0008] The Job tf-mnist has been submitted successfully \n", 192 | "\u001b[36mINFO\u001b[0m[0008] You can run `arena get tf-mnist --type tfjob` to check the job status \n" 193 | ] 194 | } 195 | ], 196 | "source": [ 197 | "# Set the Job Name\n", 198 | "%env JOB_NAME=tf-mnist\n", 199 | "%env USER_DATA_NAME=training-data\n", 200 | "# Submit a training job \n", 201 | "# using code and data from Data Volume\n", 202 | "!arena submit tf \\\n", 203 | " --name=$JOB_NAME \\\n", 204 | " --gpus=1 \\\n", 205 | " --data=$USER_DATA_NAME:/training \\\n", 206 | " --tensorboard \\\n", 207 | " --image=tensorflow/tensorflow:1.11.0-gpu-py3 \\\n", 208 | " --logdir=/training/output/mnist \\\n", 209 | " \"python /training/models/tensorflow-sample-code/tfjob/docker/mnist/main.py --max_steps 5000 --data_dir /training/dataset/mnist --log_dir /training/output/mnist\"" 210 | ] 211 | }, 212 | { 213 | "cell_type": "markdown", 214 | "metadata": {}, 215 | "source": [ 216 | "> - `Arena`命令的`--logdir`指定`tensorboard`从训练任务的指定目录读取event \n", 217 | "> - 完整参数可以参考[命令行文档](https://github.com/kubeflow/arena/blob/master/docs/cli/arena_submit_tfjob.md)" 218 | ] 219 | }, 220 | { 221 | "cell_type": "markdown", 222 | "metadata": {}, 223 | "source": [ 224 | "### 6.检查模型训练状态\n", 225 | "当任务状态从`Pending`转为`Running`后就可以查看日志和GPU使用率了。这里`-e`为了方便检查任务`Pending`的原因。通常看到`[Pulling] pulling image \"tensorflow/tensorflow:1.11.0-gpu-py3\"`代表容器镜像过大,导致任务处于`Pending`。这时可以重复执行下列命令直到任务状态变为`Running`。" 226 | ] 227 | }, 228 | { 229 | "cell_type": "code", 230 | "execution_count": 55, 231 | "metadata": {}, 232 | "outputs": [ 233 | { 234 | "name": "stdout", 235 | "output_type": "stream", 236 | "text": [ 237 | "STATUS: RUNNING\n", 238 | "NAMESPACE: default\n", 239 | "TRAINING DURATION: 1s\n", 240 | "\n", 241 | "NAME STATUS TRAINER AGE INSTANCE NODE\n", 242 | "tf-mnist RUNNING TFJOB 1s tf-mnist-chief-0 192.168.0.31\n", 243 | "\n", 244 | "Your tensorboard will be available on:\n", 245 | "http://192.168.0.203:32116 \n", 246 | "\n", 247 | "Events: \n", 248 | "No events for pending pod\n" 249 | ] 250 | } 251 | ], 252 | "source": [ 253 | "! arena get $JOB_NAME -e" 254 | ] 255 | }, 256 | { 257 | "cell_type": "markdown", 258 | "metadata": {}, 259 | "source": [ 260 | "### 7.实时检查日志\n", 261 | "此时可以通过调整`--tail=`的数值展示输出的行数。默认为显示全部日志。\n", 262 | "想要获取实时日志, 可以执行`-f`参数" 263 | ] 264 | }, 265 | { 266 | "cell_type": "code", 267 | "execution_count": 56, 268 | "metadata": {}, 269 | "outputs": [ 270 | { 271 | "name": "stdout", 272 | "output_type": "stream", 273 | "text": [ 274 | "2019-04-03T11:30:20.434833662Z WARNING:tensorflow:From /training/models/tensorflow-sample-code/tfjob/docker/mnist/main.py:41: read_data_sets (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.\r\n", 275 | "2019-04-03T11:30:20.434891738Z Instructions for updating:\r\n", 276 | "2019-04-03T11:30:20.434898475Z Please use alternatives such as official/mnist/dataset.py from tensorflow/models.\r\n", 277 | "2019-04-03T11:30:20.473209783Z WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:260: maybe_download (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.\r\n", 278 | "2019-04-03T11:30:20.473235115Z Instructions for updating:\r\n", 279 | "2019-04-03T11:30:20.473239615Z Please write your own downloading logic.\r\n", 280 | "2019-04-03T11:30:20.483963934Z WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:262: extract_images (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.\r\n", 281 | "2019-04-03T11:30:20.483984281Z Instructions for updating:\r\n", 282 | "2019-04-03T11:30:20.483988225Z Please use tf.data to implement this functionality.\r\n", 283 | "2019-04-03T11:30:20.778228754Z WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:267: extract_labels (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.\r\n", 284 | "2019-04-03T11:30:20.778261288Z Instructions for updating:\r\n", 285 | "2019-04-03T11:30:20.778264876Z Please use tf.data to implement this functionality.\r\n", 286 | "2019-04-03T11:30:20.786090015Z WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:110: dense_to_one_hot (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.\r\n", 287 | "2019-04-03T11:30:20.78610894Z Instructions for updating:\r\n", 288 | "2019-04-03T11:30:20.786113407Z Please use tf.one_hot on tensors.\r\n" 289 | ] 290 | } 291 | ], 292 | "source": [ 293 | "! arena logs --tail=50 $JOB_NAME" 294 | ] 295 | }, 296 | { 297 | "cell_type": "markdown", 298 | "metadata": {}, 299 | "source": [ 300 | "### 8. 查看实时训练的GPU使用情况" 301 | ] 302 | }, 303 | { 304 | "cell_type": "code", 305 | "execution_count": 6, 306 | "metadata": {}, 307 | "outputs": [ 308 | { 309 | "name": "stdout", 310 | "output_type": "stream", 311 | "text": [ 312 | "INSTANCE NAME GPU(Device Index) GPU(Duty Cycle) GPU(Memory MiB) STATUS NODE\r\n", 313 | "tf-mnist-chief-0 N/A N/A N/A Succeeded \r\n" 314 | ] 315 | } 316 | ], 317 | "source": [ 318 | "! arena top job $JOB_NAME" 319 | ] 320 | }, 321 | { 322 | "cell_type": "markdown", 323 | "metadata": {}, 324 | "source": [ 325 | "### 9. 查看训练结果\n", 326 | "#### 9.1 通过TensorBoard查看训练趋势\n", 327 | "您可以使用 `192.168.1.117:30670` 访问 Tensorboard。如果您通过笔记本电脑无法直接访问 Tensorboard,可以考虑在您的笔记本电脑使用 `sshuttle`。例如:`sshuttle -r root@41.82.59.51 192.168.0.0/16`。其中`41.82.59.51`为集群内某个节点的外网IP,且该外网IP可以通过ssh访问。" 328 | ] 329 | }, 330 | { 331 | "cell_type": "code", 332 | "execution_count": 7, 333 | "metadata": {}, 334 | "outputs": [ 335 | { 336 | "name": "stdout", 337 | "output_type": "stream", 338 | "text": [ 339 | "STATUS: SUCCEEDED\n", 340 | "NAMESPACE: default\n", 341 | "TRAINING DURATION: 13m\n", 342 | "\n", 343 | "NAME STATUS TRAINER AGE INSTANCE NODE\n", 344 | "tf-mnist SUCCEEDED TFJOB 20m tf-mnist-chief-0 N/A\n", 345 | "\n", 346 | "Your tensorboard will be available on:\n", 347 | "http://192.168.0.203:30532 \n" 348 | ] 349 | } 350 | ], 351 | "source": [ 352 | "# show job detail\n", 353 | "! arena get $JOB_NAME" 354 | ] 355 | }, 356 | { 357 | "cell_type": "markdown", 358 | "metadata": {}, 359 | "source": [ 360 | "![](1-1-tensorboard.jpg)" 361 | ] 362 | }, 363 | { 364 | "cell_type": "markdown", 365 | "metadata": {}, 366 | "source": [ 367 | "#### 9.2. 查看模型训练产生的结果文件, 在`/root/output`下生成了训练结果.\n", 368 | "\n", 369 | "`/root/output/mnist` 目录中是训练过程中产生的checkpoint文件,代表训练结束时模型的变量状态。" 370 | ] 371 | }, 372 | { 373 | "cell_type": "code", 374 | "execution_count": 57, 375 | "metadata": {}, 376 | "outputs": [ 377 | { 378 | "name": "stdout", 379 | "output_type": "stream", 380 | "text": [ 381 | "/root/output\r\n", 382 | "|-- mnist\r\n", 383 | "| |-- checkpoint\r\n", 384 | "| |-- model.ckpt-4500.data-00000-of-00001\r\n", 385 | "| |-- model.ckpt-4500.index\r\n", 386 | "| |-- model.ckpt-4500.meta\r\n", 387 | "| |-- model.ckpt-4600.data-00000-of-00001\r\n", 388 | "| |-- model.ckpt-4600.index\r\n", 389 | "| |-- model.ckpt-4600.meta\r\n", 390 | "| |-- model.ckpt-4700.data-00000-of-00001\r\n", 391 | "| |-- model.ckpt-4700.index\r\n", 392 | "| |-- model.ckpt-4700.meta\r\n", 393 | "| |-- model.ckpt-4800.data-00000-of-00001\r\n", 394 | "| |-- model.ckpt-4800.index\r\n", 395 | "| |-- model.ckpt-4800.meta\r\n", 396 | "| |-- model.ckpt-4900.data-00000-of-00001\r\n", 397 | "| |-- model.ckpt-4900.index\r\n", 398 | "| |-- model.ckpt-4900.meta\r\n", 399 | "| |-- test\r\n", 400 | "| | |-- events.out.tfevents.1554286572.tf-mnist-chief-0\r\n", 401 | "| | `-- events.out.tfevents.1554291023.tf-mnist-chief-0\r\n", 402 | "| `-- train\r\n", 403 | "| |-- events.out.tfevents.1554286571.tf-mnist-chief-0\r\n", 404 | "| `-- events.out.tfevents.1554291021.tf-mnist-chief-0\r\n", 405 | "`-- mnist-model\r\n", 406 | " `-- 1\r\n", 407 | " |-- saved_model.pb\r\n", 408 | " `-- variables\r\n", 409 | "\r\n", 410 | "6 directories, 21 files\r\n" 411 | ] 412 | } 413 | ], 414 | "source": [ 415 | "! tree -I ai-starter -L 3 /root/output" 416 | ] 417 | }, 418 | { 419 | "cell_type": "markdown", 420 | "metadata": {}, 421 | "source": [ 422 | "### 10. 将训练过程中产生的checkpoint转换为模型文件\n", 423 | "checkpoint文件不能直接用于部署预测服务,需要进行一次模型导出,将checkpoint文件中的模型状态导出为可以被预测服务识别的模型文件。 我们可以通过arena 提交一个转换任务,声明将`output/mnist`目录下的checkpoint文件导出到`output/mnist-model`目录。" 424 | ] 425 | }, 426 | { 427 | "cell_type": "code", 428 | "execution_count": 3, 429 | "metadata": {}, 430 | "outputs": [ 431 | { 432 | "name": "stdout", 433 | "output_type": "stream", 434 | "text": [ 435 | "configmap/export-model-tfjob created\n", 436 | "configmap/export-model-tfjob labeled\n", 437 | "tfjob.kubeflow.org/export-model created\n", 438 | "\u001b[36mINFO\u001b[0m[0008] The Job export-model has been submitted successfully \n", 439 | "\u001b[36mINFO\u001b[0m[0008] You can run `arena get export-model --type tfjob` to check the job status \n" 440 | ] 441 | } 442 | ], 443 | "source": [ 444 | "! arena submit tf \\\n", 445 | " --name=export-model \\\n", 446 | " --workers=1 \\\n", 447 | " --gpus=1 \\\n", 448 | " --data=$USER_DATA_NAME:/training \\\n", 449 | " --image=tensorflow/tensorflow:1.11.0-gpu-py3 \\\n", 450 | " \"python /training/models/tensorflow-sample-code/tfjob/docker/mnist/export_model.py \\\n", 451 | " --checkpoint_step=4900 \\\n", 452 | " --checkpoint_path=/training/output/mnist /training/output/mnist-model/ \"" 453 | ] 454 | }, 455 | { 456 | "cell_type": "markdown", 457 | "metadata": {}, 458 | "source": [ 459 | "#### 10.1.查看模型导出任务的执行状态" 460 | ] 461 | }, 462 | { 463 | "cell_type": "code", 464 | "execution_count": 60, 465 | "metadata": {}, 466 | "outputs": [ 467 | { 468 | "name": "stdout", 469 | "output_type": "stream", 470 | "text": [ 471 | "STATUS: FAILED\r\n", 472 | "NAMESPACE: default\r\n", 473 | "TRAINING DURATION: 6s\r\n", 474 | "\r\n", 475 | "NAME STATUS TRAINER AGE INSTANCE NODE\r\n", 476 | "export-model FAILED TFJOB 1m export-model-chief-0 N/A\r\n" 477 | ] 478 | } 479 | ], 480 | "source": [ 481 | "! arena get export-model" 482 | ] 483 | }, 484 | { 485 | "cell_type": "markdown", 486 | "metadata": {}, 487 | "source": [ 488 | "#### 10.2 导出任务执行完毕后,可以在`output/mnist-model`目录中看到导出的模型文件。" 489 | ] 490 | }, 491 | { 492 | "cell_type": "code", 493 | "execution_count": 62, 494 | "metadata": {}, 495 | "outputs": [ 496 | { 497 | "name": "stdout", 498 | "output_type": "stream", 499 | "text": [ 500 | "/root/output/mnist-model\r\n", 501 | "`-- 1\r\n", 502 | " |-- saved_model.pb\r\n", 503 | " `-- variables\r\n", 504 | " |-- variables.data-00000-of-00001\r\n", 505 | " `-- variables.index\r\n", 506 | "\r\n", 507 | "2 directories, 3 files\r\n" 508 | ] 509 | } 510 | ], 511 | "source": [ 512 | "! tree -I ai-starter -L 3 /root/output/mnist-model" 513 | ] 514 | }, 515 | { 516 | "cell_type": "markdown", 517 | "metadata": {}, 518 | "source": [ 519 | "### 11. 部署预测服务\n", 520 | "\n", 521 | "我们得到可以用于部署预测服务的模型文件后,可以通过`arena serve` 提交一个模型预测的在线任务。 \n", 522 | "\n", 523 | "* `--data=$USER_DATA_NAME:/training` 表示将共享存储目录挂载到预测服务的`/training`目录\n", 524 | "* `--modelPath=/training/output/mnist-model` 表示预测服务使用的模型文件目录,就是我们在步骤12中导出的模型文件位置" 525 | ] 526 | }, 527 | { 528 | "cell_type": "code", 529 | "execution_count": 63, 530 | "metadata": {}, 531 | "outputs": [ 532 | { 533 | "name": "stdout", 534 | "output_type": "stream", 535 | "text": [ 536 | "configmap/mnist-tf-serving created\n", 537 | "configmap/mnist-tf-serving labeled\n", 538 | "configmap/mnist-tensorflow-serving-cm created\n", 539 | "service/mnist-tensorflow-serving created\n", 540 | "deployment.extensions/mnist-tensorflow-serving created\n" 541 | ] 542 | } 543 | ], 544 | "source": [ 545 | "! arena serve tensorflow \\\n", 546 | " --servingName=mnist \\\n", 547 | " --modelName=mnist \\\n", 548 | " --image=tensorflow/serving:latest \\\n", 549 | " --data=$USER_DATA_NAME:/training \\\n", 550 | " --modelPath=/training/output/mnist-model" 551 | ] 552 | }, 553 | { 554 | "cell_type": "markdown", 555 | "metadata": {}, 556 | "source": [ 557 | "#### 11.2查看预测服务\n", 558 | "\n", 559 | "我们可以查到到预测服务的部署状态,以及入口访问地址。" 560 | ] 561 | }, 562 | { 563 | "cell_type": "code", 564 | "execution_count": 31, 565 | "metadata": {}, 566 | "outputs": [ 567 | { 568 | "name": "stdout", 569 | "output_type": "stream", 570 | "text": [ 571 | "NAME TYPE VERSION STATUS CLUSTER-IP\r\n", 572 | "mnist Tensorflow RUNNING 172.19.82.216\r\n" 573 | ] 574 | } 575 | ], 576 | "source": [ 577 | "! arena serve list" 578 | ] 579 | }, 580 | { 581 | "cell_type": "markdown", 582 | "metadata": {}, 583 | "source": [ 584 | "### 12. 通过预测服务验证模型准确率" 585 | ] 586 | }, 587 | { 588 | "cell_type": "code", 589 | "execution_count": 91, 590 | "metadata": {}, 591 | "outputs": [ 592 | { 593 | "name": "stdout", 594 | "output_type": "stream", 595 | "text": [ 596 | "Requirement already satisfied: requests in /usr/local/lib/python3.5/dist-packages (2.21.0)\n", 597 | "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.5/dist-packages (from requests) (2019.3.9)\n", 598 | "Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/lib/python3.5/dist-packages (from requests) (2.8)\n", 599 | "Requirement already satisfied: urllib3<1.25,>=1.21.1 in /usr/local/lib/python3.5/dist-packages (from requests) (1.24.1)\n", 600 | "Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.5/dist-packages (from requests) (3.0.4)\n", 601 | "\u001b[33mYou are using pip version 18.1, however version 19.0.3 is available.\n", 602 | "You should consider upgrading via the 'pip install --upgrade pip' command.\u001b[0m\n" 603 | ] 604 | } 605 | ], 606 | "source": [ 607 | "# 安装必要的python 库\n", 608 | "! pip install requests" 609 | ] 610 | }, 611 | { 612 | "cell_type": "markdown", 613 | "metadata": {}, 614 | "source": [ 615 | "#### 12.1 定义函数,函数内容通过Http调用预测服务" 616 | ] 617 | }, 618 | { 619 | "cell_type": "markdown", 620 | "metadata": {}, 621 | "source": [ 622 | "以下代码定义一个python方法`pick_image_and_predict`,这个方法会从mnist的测试数据集中随机选择一张图片,作为请求的参数,通过http方式调用预测服务,得到模型预测的记过。这个方法执行后会同时打印图片的真实值,和通过预测服务推理得到的值。 您可以用于判断模型的准确率是否满足要求。" 623 | ] 624 | }, 625 | { 626 | "cell_type": "code", 627 | "execution_count": 92, 628 | "metadata": {}, 629 | "outputs": [ 630 | { 631 | "name": "stdout", 632 | "output_type": "stream", 633 | "text": [ 634 | "Extracting /root/dataset/mnist/train-images-idx3-ubyte.gz\n", 635 | "Extracting /root/dataset/mnist/train-labels-idx1-ubyte.gz\n", 636 | "Extracting /root/dataset/mnist/t10k-images-idx3-ubyte.gz\n", 637 | "Extracting /root/dataset/mnist/t10k-labels-idx1-ubyte.gz\n" 638 | ] 639 | } 640 | ], 641 | "source": [ 642 | "import matplotlib.pyplot as plt\n", 643 | "import numpy as np\n", 644 | "import random\n", 645 | "import requests\n", 646 | "import json\n", 647 | "from tensorflow.examples.tutorials.mnist import input_data\n", 648 | "%matplotlib inline\n", 649 | "data_dir=\"/root/dataset/mnist/\"\n", 650 | "mnist = input_data.read_data_sets(data_dir, one_hot=True)\n", 651 | "test_images = mnist.test.images\n", 652 | "test_labels = mnist.test.labels\n", 653 | "digits = ['0','1','2','3','4','5','6','7','8','9']\n", 654 | "# -*- coding: utf-8 -*-\n", 655 | "\n", 656 | "def show(idx, title):\n", 657 | " plt.figure()\n", 658 | " plt.imshow(test_images[idx].reshape(28,28))\n", 659 | " plt.axis('off')\n", 660 | " plt.title('\\n\\n{}'.format(title), fontdict={'size': 16})\n", 661 | "\n", 662 | "def predict(url, num):\n", 663 | " test_cls = np.argmax(test_labels, axis=1)\n", 664 | " show(num, 'The Picture is {}'.format(test_cls[num]))\n", 665 | " headers = {\"content-type\": \"application/json\"}\n", 666 | " metadata=requests.get(url + '/metadata')\n", 667 | " data = json.dumps({\"signature_name\": \"predict_images\", \"dropout/Placeholder\": 1.0,\"inputs\": test_images[num].reshape(1, 784).tolist()})\n", 668 | " json_response = requests.post(model_api+':predict', data=data, headers=headers)\n", 669 | " scores = json.loads(json_response.text)['outputs']\n", 670 | " predicted_digits_idx = np.argmax(scores)\n", 671 | " print('预测识别的数字: {}'.format(digits[predicted_digits_idx]))\n", 672 | " return scores\n", 673 | "\n", 674 | "def pick_image_and_predict(model_api):\n", 675 | " random_image = random.randint(0,len(test_images)-1)\n", 676 | " score = predict(model_api, random_image)" 677 | ] 678 | }, 679 | { 680 | "cell_type": "markdown", 681 | "metadata": {}, 682 | "source": [ 683 | "#### 12.2 调用预测服务\n", 684 | "我们可以得到真实值和预测值。 您可以多执行几次或者修改代码增加执行次数,比较预测结果和真是结果,判断模型的预测准确率。\n", 685 | "\n", 686 | "这里`mnist-tensorflow-serving` 代表预测服务的服务域名,您也可以改为IP地址,IP地址的获取在12步中,通过`arena serve list` 可以得到的预测服务的服务IP。" 687 | ] 688 | }, 689 | { 690 | "cell_type": "code", 691 | "execution_count": null, 692 | "metadata": {}, 693 | "outputs": [], 694 | "source": [ 695 | "model_api='http://mnist-tensorflow-serving:8501/v1/models/mnist'\n", 696 | "pick_image_and_predict(model_api)" 697 | ] 698 | }, 699 | { 700 | "cell_type": "markdown", 701 | "metadata": {}, 702 | "source": [ 703 | "### 13. 删除已经完成的任务" 704 | ] 705 | }, 706 | { 707 | "cell_type": "code", 708 | "execution_count": 88, 709 | "metadata": {}, 710 | "outputs": [ 711 | { 712 | "name": "stdout", 713 | "output_type": "stream", 714 | "text": [ 715 | "service \"tf-mnist-tensorboard\" deleted\n", 716 | "deployment.extensions \"tf-mnist-tensorboard\" deleted\n", 717 | "tfjob.kubeflow.org \"tf-mnist\" deleted\n", 718 | "configmap \"tf-mnist-tfjob\" deleted\n", 719 | "\u001b[36mINFO\u001b[0m[0006] The Job tf-mnist has been deleted successfully \n", 720 | "tfjob.kubeflow.org \"export-model\" deleted\n", 721 | "configmap \"export-model-tfjob\" deleted\n", 722 | "\u001b[36mINFO\u001b[0m[0006] The Job export-model has been deleted successfully \n", 723 | "configmap \"mnist-tensorflow-serving-cm\" deleted\n", 724 | "service \"mnist-tensorflow-serving\" deleted\n", 725 | "deployment.extensions \"mnist-tensorflow-serving\" deleted\n", 726 | "configmap \"mnist-tf-serving\" deleted\n", 727 | "\u001b[36mINFO\u001b[0m[0002] The Serving job mnist has been deleted successfully \n" 728 | ] 729 | } 730 | ], 731 | "source": [ 732 | "# delete job\n", 733 | "! arena delete $JOB_NAME\n", 734 | "! arena delete export-model\n", 735 | "# delete serving job\n", 736 | "! arena serve delete mnist" 737 | ] 738 | }, 739 | { 740 | "cell_type": "markdown", 741 | "metadata": {}, 742 | "source": [ 743 | "恭喜!您已经使用 `arena` 成功运行了训练作业,而且还能轻松检查 Tensorboard。\n", 744 | "\n", 745 | "总结,希望您通过本次演示了解:\n", 746 | "1. 如何准备代码和数据,并将其放入数据卷中\n", 747 | "2. 如何在训练任务中引用数据卷,并且使用其中的代码和数据\n", 748 | "3. 如何利用arena管理您的训练任务。\n", 749 | "4. 为您的训练结果部署一个模型预测的在线服务。\n", 750 | "5. 在Notebook中调用您的在线服务,验证模型准确率。\n", 751 | "\n", 752 | "以上是使用`Arena`在云上进行模型训练的例子,您可以通过修改代码`/root/models/tensorflow-sample-code/tfjob/docker/mnist/main.py`重新提交,实现迭代的模型开发目的。" 753 | ] 754 | } 755 | ], 756 | "metadata": { 757 | "kernelspec": { 758 | "display_name": "Python 3", 759 | "language": "python", 760 | "name": "python3" 761 | }, 762 | "language_info": { 763 | "codemirror_mode": { 764 | "name": "ipython", 765 | "version": 3 766 | }, 767 | "file_extension": ".py", 768 | "mimetype": "text/x-python", 769 | "name": "python", 770 | "nbconvert_exporter": "python", 771 | "pygments_lexer": "ipython3", 772 | "version": "3.5.2" 773 | } 774 | }, 775 | "nbformat": 4, 776 | "nbformat_minor": 2 777 | } 778 | -------------------------------------------------------------------------------- /demo/Bert-pretraining.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 入门云原生AI - 提交Bert训练任务\n", 8 | "BERT是Google开发的一种nlp领域的预训练语言表示模型,BERT在11项NLP任务中夺得非常好的结果,Google在11月份开源了bert的代码,同时发布了多种语言版本的模型。我们可以通过arena 提交bert模型的训练代码,非常方便地利用这项学术红利。\n", 9 | "\n", 10 | "在这个示例中,我们将演示:\n", 11 | "* 利用Arena提交Bert的pretraining训练任务,并且查看训练任务状态和日志。\n", 12 | "\n", 13 | "> 前提:请先完成文档中的[共享存储配置](../docs/setup/SETUP_NAS.md),当前${HOME}就是其中`training-data`的数据卷对应目录。" 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "metadata": {}, 19 | "source": [ 20 | "1.下载Bert样例源代码到${HOME}/models目录" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": 2, 26 | "metadata": { 27 | "scrolled": true 28 | }, 29 | "outputs": [ 30 | { 31 | "name": "stdout", 32 | "output_type": "stream", 33 | "text": [ 34 | "Cloning into '/root/models/bert'...\n", 35 | "remote: Enumerating objects: 317, done.\u001b[K\n", 36 | "remote: Total 317 (delta 0), reused 0 (delta 0), pack-reused 317\u001b[K\n", 37 | "Receiving objects: 100% (317/317), 254.03 KiB | 149.00 KiB/s, done.\n", 38 | "Resolving deltas: 100% (178/178), done.\n", 39 | "Checking connectivity... done.\n" 40 | ] 41 | } 42 | ], 43 | "source": [ 44 | "! git clone \"https://github.com/google-research/bert.git\" \"${HOME}/models/bert\"" 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "metadata": {}, 50 | "source": [ 51 | "2.下载pretraining 任务所需要的语料数据" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": 5, 57 | "metadata": {}, 58 | "outputs": [ 59 | { 60 | "name": "stdout", 61 | "output_type": "stream", 62 | "text": [ 63 | " % Total % Received % Xferd Average Speed Time Time Time Current\n", 64 | " Dload Upload Total Spent Left Speed\n", 65 | "100 388M 100 388M 0 0 12.6M 0 0:00:30 0:00:30 --:--:-- 12.2M\n", 66 | "Archive: uncased_L-12_H-768_A-12.zip\n", 67 | " creating: uncased_L-12_H-768_A-12/\n", 68 | " inflating: uncased_L-12_H-768_A-12/bert_model.ckpt.meta \n", 69 | " inflating: uncased_L-12_H-768_A-12/bert_model.ckpt.data-00000-of-00001 \n", 70 | " inflating: uncased_L-12_H-768_A-12/vocab.txt \n", 71 | " inflating: uncased_L-12_H-768_A-12/bert_model.ckpt.index \n", 72 | " inflating: uncased_L-12_H-768_A-12/bert_config.json \n" 73 | ] 74 | } 75 | ], 76 | "source": [ 77 | "! mkdir -p ${HOME}/dataset/bert\n", 78 | "! cd ${HOME}/dataset/bert && \\\n", 79 | " curl -O http://kubeflow.oss-cn-beijing.aliyuncs.com/uncased_L-12_H-768_A-12.zip && \\\n", 80 | " unzip uncased_L-12_H-768_A-12.zip" 81 | ] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "metadata": {}, 86 | "source": [ 87 | "3.创建训练结果的输出目录 ${HOME}/output/bert" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": 19, 93 | "metadata": {}, 94 | "outputs": [], 95 | "source": [ 96 | "! mkdir -p ${HOME}/output/bert" 97 | ] 98 | }, 99 | { 100 | "cell_type": "markdown", 101 | "metadata": {}, 102 | "source": [ 103 | "4.查看目录结构。\n", 104 | "* `dataset/bert` 是数据目录,用于存储训练所需的数据。\n", 105 | "* `models/bert` 是模型代码目录,用于存储模型训练的代码\n", 106 | "* `output/bert` 是训练结果目录,存放训练结果模型和checkpoint。" 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": 21, 112 | "metadata": {}, 113 | "outputs": [ 114 | { 115 | "name": "stdout", 116 | "output_type": "stream", 117 | "text": [ 118 | "/root\r\n", 119 | "|-- dataset\r\n", 120 | "| `-- bert\r\n", 121 | "|-- models\r\n", 122 | "| |-- bert\r\n", 123 | "| `-- tensorflow-benchmarks\r\n", 124 | "`-- output\r\n", 125 | " `-- bert\r\n", 126 | "\r\n", 127 | "7 directories, 0 files\r\n" 128 | ] 129 | } 130 | ], 131 | "source": [ 132 | "! tree -I ai-starter -L 2 ${HOME}" 133 | ] 134 | }, 135 | { 136 | "cell_type": "markdown", 137 | "metadata": {}, 138 | "source": [ 139 | "5.检查可用GPU资源,训练开始前,我们要保证有足够的空闲GPU资源" 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "execution_count": 22, 145 | "metadata": {}, 146 | "outputs": [ 147 | { 148 | "name": "stdout", 149 | "output_type": "stream", 150 | "text": [ 151 | "NAME IPADDRESS ROLE GPU(Total) GPU(Allocated)\r\n", 152 | "cn-zhangjiakou.i-8vb2knpxzlk449e7lugx 192.168.0.209 1 0\r\n", 153 | "cn-zhangjiakou.i-8vb2knpxzlk449e7lugy 192.168.0.210 1 1\r\n", 154 | "cn-zhangjiakou.i-8vb2knpxzlk449e7lugz 192.168.0.208 1 0\r\n", 155 | "cn-zhangjiakou.i-8vb7yuo831zjzijo9sdw 192.168.0.205 master 0 0\r\n", 156 | "cn-zhangjiakou.i-8vbezxqzueo7662i0dbq 192.168.0.204 master 0 0\r\n", 157 | "cn-zhangjiakou.i-8vbezxqzueo7681j4fav 192.168.0.206 master 0 0\r\n", 158 | "-----------------------------------------------------------------------------------------\r\n", 159 | "Allocated/Total GPUs In Cluster:\r\n", 160 | "1/3 (33%) \r\n" 161 | ] 162 | } 163 | ], 164 | "source": [ 165 | "! arena top node" 166 | ] 167 | }, 168 | { 169 | "cell_type": "markdown", 170 | "metadata": {}, 171 | "source": [ 172 | "6.通过Arena提交一个bert 创建pretrainingData 的训练任务, 用于创建Bert pretraining所需要的tfrecord文件。\n", 173 | "这里`training-data` 是在配置[共享存储时](../docs/setup/SETUP_NAS.md)创建的NAS存储声明. \n", 174 | "`--data=training-data:/training` 将其映射到训练任务的`/training`目录。\n", 175 | "* `/training`目录下的子目录`/training/models/bert` 是步骤1拷贝源代码的位置\n", 176 | "* `/training`目录下的子目录`/training/dataset/bert` 是步骤2下载数据的位置\n", 177 | "* `/training`目录下的子目录`/training/output` 就是步骤3创建的训练结果输出的位置。" 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": 9, 183 | "metadata": {}, 184 | "outputs": [ 185 | { 186 | "name": "stdout", 187 | "output_type": "stream", 188 | "text": [ 189 | "env: PRETRAIN_DATA_JOB_NAME=bert-create-pretrain-data\n", 190 | "configmap/bert-create-pretrain-data-tfjob created\n", 191 | "configmap/bert-create-pretrain-data-tfjob labeled\n", 192 | "service/bert-create-pretrain-data-tensorboard created\n", 193 | "deployment.extensions/bert-create-pretrain-data-tensorboard created\n", 194 | "tfjob.kubeflow.org/bert-create-pretrain-data created\n", 195 | "\u001b[36mINFO\u001b[0m[0004] The Job bert-create-pretrain-data has been submitted successfully \n", 196 | "\u001b[36mINFO\u001b[0m[0004] You can run `arena get bert-create-pretrain-data --type tfjob` to check the job status \n" 197 | ] 198 | } 199 | ], 200 | "source": [ 201 | "%env PRETRAIN_DATA_JOB_NAME=bert-create-pretrain-data\n", 202 | "!arena submit tf \\\n", 203 | " --name=$PRETRAIN_DATA_JOB_NAME \\\n", 204 | " --workers=1 \\\n", 205 | " --gpus=1 \\\n", 206 | " --data=training-data:/training \\\n", 207 | " --image=tensorflow/tensorflow:1.11.0-gpu-py3 \\\n", 208 | " \"python3 /training/models/bert/create_pretraining_data.py \\\n", 209 | " --input_file=/training/models/bert/sample_text.txt \\\n", 210 | " --output_file=/training/output/bert/tf_examples.tfrecord \\\n", 211 | " --vocab_file=/training/dataset/bert/uncased_L-12_H-768_A-12/vocab.txt \\\n", 212 | " --do_lower_case=True \\\n", 213 | " --max_seq_length=256 \\\n", 214 | " --max_predictions_per_seq=39 \\\n", 215 | " --masked_lm_prob=0.15 \\\n", 216 | " --random_seed=12345 \\\n", 217 | " --dupe_factor=5; python3 -c \\\"import os;fd=os.open('/training/output/bert/tf_examples.tfrecord',os.O_NONBLOCK);os.fsync(fd)\\\"\"\n" 218 | ] 219 | }, 220 | { 221 | "cell_type": "markdown", 222 | "metadata": {}, 223 | "source": [ 224 | "7.检查Pretraining data任务的状态,这个步骤不涉及大量计算,任务很快就可以完成。" 225 | ] 226 | }, 227 | { 228 | "cell_type": "code", 229 | "execution_count": 10, 230 | "metadata": { 231 | "scrolled": true 232 | }, 233 | "outputs": [ 234 | { 235 | "name": "stdout", 236 | "output_type": "stream", 237 | "text": [ 238 | "STATUS: SUCCEEDED\r\n", 239 | "NAMESPACE: default\r\n", 240 | "TRAINING DURATION: 3s\r\n", 241 | "\r\n", 242 | "NAME STATUS TRAINER AGE INSTANCE NODE\r\n", 243 | "bert-create-pretrain-data SUCCEEDED TFJOB 48s bert-create-pretrain-data-chief-0 N/A\r\n", 244 | "\r\n", 245 | "Your tensorboard will be available on:\r\n", 246 | "192.168.0.206:31785 \r\n", 247 | "\r\n", 248 | "Events: \r\n", 249 | "No events for pending pod\r\n" 250 | ] 251 | } 252 | ], 253 | "source": [ 254 | "! arena get $PRETRAIN_DATA_JOB_NAME -e" 255 | ] 256 | }, 257 | { 258 | "cell_type": "markdown", 259 | "metadata": {}, 260 | "source": [ 261 | "8.查看创建Pretraining data的任务日志\n" 262 | ] 263 | }, 264 | { 265 | "cell_type": "code", 266 | "execution_count": 12, 267 | "metadata": { 268 | "scrolled": true 269 | }, 270 | "outputs": [ 271 | { 272 | "name": "stdout", 273 | "output_type": "stream", 274 | "text": [ 275 | "2019-03-01T03:19:57.848619386Z INFO:tensorflow:masked_lm_weights: 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0\r\n", 276 | "2019-03-01T03:19:57.848753719Z INFO:tensorflow:next_sentence_labels: 0\r\n", 277 | "2019-03-01T03:19:57.849205981Z INFO:tensorflow:*** Example ***\r\n", 278 | "2019-03-01T03:19:57.849406004Z INFO:tensorflow:tokens: [CLS] like most of [MASK] fellow gold - seekers [MASK] cass was super ##sti [MASK] . [SEP] basket on phil ##am ##mon ' s head , and tr ##otted [MASK] up a neighbouring street . phil ##am ##mon followed , half contempt ##uous , half wondering at [MASK] this [MASK] might be [MASK] which could feed the self - con ##ce ##it of anything so ab ##ject as his ragged little api ##sh guide ; but the novel roar and w ##hir ##l of the street [MASK] the perpetual [MASK] of busy faces [MASK] the line of cu ##rri ##cles , pal ##an ##quin ##s , [MASK] ass ##es [MASK] camel ##s , elephants , [MASK] met and passed him , and squeezed him up steps and into doorway ##s , as they threaded their way through the great dwight - gate into the ample street beyond , drove everything from his mind but wondering curiosity , [MASK] a vague , helpless dread of [MASK] great living wilderness , more terrible than any [MASK] wilderness of sand which he had left behind . [MASK] he longed [MASK] the rep ##ose , the silence of [MASK] laura [MASK] - for faces which [MASK] him and smiled upon him [MASK] but it was too late to turn back now . his [MASK] held on for more than [MASK] mile [MASK] the great main street , crossed in [MASK] centre [MASK] [MASK] city , at right angles , [MASK] one equally magnificent [MASK] at [unused190] end [MASK] [MASK] [SEP]\r\n", 279 | "2019-03-01T03:19:57.849551229Z INFO:tensorflow:input_ids: 101 2066 2087 1997 103 3507 2751 1011 24071 103 16220 2001 3565 16643 103 1012 102 10810 2006 6316 3286 8202 1005 1055 2132 1010 1998 19817 26174 103 2039 1037 9632 2395 1012 6316 3286 8202 2628 1010 2431 17152 8918 1010 2431 6603 2012 103 2023 103 2453 2022 103 2029 2071 5438 1996 2969 1011 9530 3401 4183 1997 2505 2061 11113 20614 2004 2010 14202 2210 17928 4095 5009 1025 2021 1996 3117 11950 1998 1059 11961 2140 1997 1996 2395 103 1996 18870 103 1997 5697 5344 103 1996 2240 1997 12731 18752 18954 1010 14412 2319 12519 2015 1010 103 4632 2229 103 19130 2015 1010 16825 1010 103 2777 1998 2979 2032 1010 1998 7757 2032 2039 4084 1998 2046 7086 2015 1010 2004 2027 26583 2037 2126 2083 1996 2307 14304 1011 4796 2046 1996 20851 2395 3458 1010 5225 2673 2013 2010 2568 2021 6603 10628 1010 103 1037 13727 1010 13346 14436 1997 103 2307 2542 9917 1010 2062 6659 2084 2151 103 9917 1997 5472 2029 2002 2018 2187 2369 1012 103 2002 23349 103 1996 16360 9232 1010 1996 4223 1997 103 6874 103 1011 2005 5344 2029 103 2032 1998 3281 2588 2032 103 2021 2009 2001 2205 2397 2000 2735 2067 2085 1012 2010 103 2218 2006 2005 2062 2084 103 3542 103 1996 2307 2364 2395 1010 4625 1999 103 2803 103 103 2103 1010 2012 2157 12113 1010 103 2028 8053 12047 103 2012 195 2203 103 103 102\r\n", 280 | "2019-03-01T03:19:57.849677647Z INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\r\n", 281 | "2019-03-01T03:19:57.84980452Z INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\r\n", 282 | "2019-03-01T03:19:57.849810975Z INFO:tensorflow:masked_lm_positions: 4 9 14 29 47 49 52 86 89 93 96 106 109 115 117 139 150 157 164 173 183 186 194 196 201 207 219 225 227 235 237 238 245 247 249 251 253 254 0\r\n", 283 | "2019-03-01T03:19:57.849951006Z INFO:tensorflow:masked_lm_ids: 2010 1010 20771 2125 2054 4695 1010 1010 5460 1010 1997 14887 1010 2029 1998 4231 2013 1998 2008 2757 2525 2005 1996 1011 2354 1025 5009 1037 2039 1996 1997 1996 2011 8053 1010 2169 1997 2029 0\r\n", 284 | "2019-03-01T03:19:57.849956674Z INFO:tensorflow:masked_lm_weights: 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0\r\n", 285 | "2019-03-01T03:19:57.850092816Z INFO:tensorflow:next_sentence_labels: 1\r\n", 286 | "2019-03-01T03:19:57.85060904Z INFO:tensorflow:*** Example ***\r\n", 287 | "2019-03-01T03:19:57.850752801Z INFO:tensorflow:tokens: [CLS] but the novel [MASK] and w [MASK] ##l [MASK] the street , the perpetual [MASK] of busy faces [MASK] the line of cu ##rri ##cles , pal ##an ##quin ##s [MASK] [MASK] ass ##es , camel ##s , elephants , [MASK] met and passed him , and squeezed him up steps and into doorway ##s [MASK] as they [MASK] their way through the great moon - gate into the ample street beyond , drove everything from his mind but wondering curiosity [MASK] and a vague , helpless dread of that great living wilderness , more terrible than any dead wilderness of sand which he had left behind . already he longed for the rep ##ose [MASK] the silence of the laura - - for faces which knew him and smiled upon him ; but it [unused946] [MASK] late to turn back now . his guide held on for blown than a mile up [MASK] [MASK] main street , crossed in the [MASK] of [MASK] [MASK] , at right angles , by one equally magnificent , [MASK] each end of which , miles away , appeared , dim [MASK] distant [MASK] the heads of the living stream [MASK] passengers [MASK] [MASK] yellow sand - hills of the desert [MASK] [SEP] looking [MASK] it more at ##ten ##tively , he saw [MASK] it bore the inscription , \" may to cass [MASK] \" like abel of his fellow gold - [MASK] , cass was super ##sti ##tious dotted [SEP]\r\n", 288 | "2019-03-01T03:19:57.85089608Z INFO:tensorflow:input_ids: 101 2021 1996 3117 103 1998 1059 103 2140 103 1996 2395 1010 1996 18870 103 1997 5697 5344 103 1996 2240 1997 12731 18752 18954 1010 14412 2319 12519 2015 103 103 4632 2229 1010 19130 2015 1010 16825 1010 103 2777 1998 2979 2032 1010 1998 7757 2032 2039 4084 1998 2046 7086 2015 103 2004 2027 103 2037 2126 2083 1996 2307 4231 1011 4796 2046 1996 20851 2395 3458 1010 5225 2673 2013 2010 2568 2021 6603 10628 103 1998 1037 13727 1010 13346 14436 1997 2008 2307 2542 9917 1010 2062 6659 2084 2151 2757 9917 1997 5472 2029 2002 2018 2187 2369 1012 2525 2002 23349 2005 1996 16360 9232 103 1996 4223 1997 1996 6874 1011 1011 2005 5344 2029 2354 2032 1998 3281 2588 2032 1025 2021 2009 951 103 2397 2000 2735 2067 2085 1012 2010 5009 2218 2006 2005 10676 2084 1037 3542 2039 103 103 2364 2395 1010 4625 1999 1996 103 1997 103 103 1010 2012 2157 12113 1010 2011 2028 8053 12047 1010 103 2169 2203 1997 2029 1010 2661 2185 1010 2596 1010 11737 103 6802 103 1996 4641 1997 1996 2542 5460 103 5467 103 103 3756 5472 1011 4564 1997 1996 5532 103 102 2559 103 2009 2062 2012 6528 25499 1010 2002 2387 103 2009 8501 1996 9315 1010 1000 2089 2000 16220 103 1000 2066 16768 1997 2010 3507 2751 1011 103 1010 16220 2001 3565 16643 20771 20384 102 0 0 0 0 0 0 0 0\r\n", 289 | "2019-03-01T03:19:57.85102435Z INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0\r\n", 290 | "2019-03-01T03:19:57.851149611Z INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0\r\n", 291 | "2019-03-01T03:19:57.851160529Z INFO:tensorflow:masked_lm_positions: 4 7 9 15 19 31 32 41 56 59 82 91 116 118 133 135 136 137 149 154 155 162 164 165 176 188 190 197 199 200 208 211 220 230 233 239 246 0 0\r\n", 292 | "2019-03-01T03:19:57.85132258Z INFO:tensorflow:masked_lm_ids: 11950 11961 1997 5460 1010 1010 14887 2029 1010 26583 1010 2307 1010 4223 1025 2009 2001 2205 2062 1996 2307 2803 1996 2103 2012 1998 2058 1997 1010 1996 1025 2012 2008 1012 2087 24071 1012 0 0\r\n", 293 | "2019-03-01T03:19:57.851330847Z INFO:tensorflow:masked_lm_weights: 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0\r\n", 294 | "2019-03-01T03:19:57.851464762Z INFO:tensorflow:next_sentence_labels: 1\r\n", 295 | "2019-03-01T03:19:57.851968383Z INFO:tensorflow:*** Example ***\r\n", 296 | "2019-03-01T03:19:57.852112074Z INFO:tensorflow:tokens: [CLS] ? [MASK] and berlin little man jumped [MASK] , put his basket on phil ##am ##mon ' s head , and tr ##otted off up a neighbouring [MASK] . [MASK] ##am ##mon followed , half contempt ##uous , half [MASK] at what this philosophy might [MASK] , which [MASK] [MASK] the [MASK] - con ##ce ##it of anything so ab ##ject [MASK] his ragged little api ##sh guide ; but the novel roar [MASK] w ##hir ##l propose the street , the perpetual stream of busy faces , the line of cu ##rri ##cles , pal ##an ##quin ##s [MASK] laden ass ##es , camel ##s , elephants , which [MASK] and passed him , and bahn him up steps and into doorway ##s [MASK] as they threaded [MASK] [MASK] through the great moon - [MASK] into the ample street beyond , drove everything [MASK] his mind but wondering curiosity [MASK] and a vague , helpless dread of that great [MASK] wilderness , [MASK] terrible than any [MASK] wilderness of sand which he had left behind [MASK] already he longed for the [MASK] ##ose , the [MASK] of [SEP] [MASK] guide held on [MASK] more than a mile up the great main street , crossed in the centre of [MASK] city [MASK] at right angles , by [MASK] equally magnificent , at each end of which , miles fleets , appeared , dim and distant over the heads of the living stream of passengers , the yellow sand - hills of the desert ; [SEP]\r\n", 297 | "2019-03-01T03:19:57.852264474Z INFO:tensorflow:input_ids: 101 1029 103 1998 4068 2210 2158 5598 103 1010 2404 2010 10810 2006 6316 3286 8202 1005 1055 2132 1010 1998 19817 26174 2125 2039 1037 9632 103 1012 103 3286 8202 2628 1010 2431 17152 8918 1010 2431 103 2012 2054 2023 4695 2453 103 1010 2029 103 103 1996 103 1011 9530 3401 4183 1997 2505 2061 11113 20614 103 2010 14202 2210 17928 4095 5009 1025 2021 1996 3117 11950 103 1059 11961 2140 16599 1996 2395 1010 1996 18870 5460 1997 5697 5344 1010 1996 2240 1997 12731 18752 18954 1010 14412 2319 12519 2015 103 14887 4632 2229 1010 19130 2015 1010 16825 1010 2029 103 1998 2979 2032 1010 1998 17392 2032 2039 4084 1998 2046 7086 2015 103 2004 2027 26583 103 103 2083 1996 2307 4231 1011 103 2046 1996 20851 2395 3458 1010 5225 2673 103 2010 2568 2021 6603 10628 103 1998 1037 13727 1010 13346 14436 1997 2008 2307 103 9917 1010 103 6659 2084 2151 103 9917 1997 5472 2029 2002 2018 2187 2369 103 2525 2002 23349 2005 1996 103 9232 1010 1996 103 1997 102 103 5009 2218 2006 103 2062 2084 1037 3542 2039 1996 2307 2364 2395 1010 4625 1999 1996 2803 1997 103 2103 103 2012 2157 12113 1010 2011 103 8053 12047 1010 2012 2169 2203 1997 2029 1010 2661 25515 1010 2596 1010 11737 1998 6802 2058 1996 4641 1997 1996 2542 5460 1997 5467 1010 1996 3756 5472 1011 4564 1997 1996 5532 1025 102\r\n", 298 | "2019-03-01T03:19:57.852419325Z INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\r\n", 299 | "2019-03-01T03:19:57.85254519Z INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\r\n", 300 | "2019-03-01T03:19:57.852551662Z INFO:tensorflow:masked_lm_positions: 2 4 7 8 28 30 40 46 49 50 52 56 62 74 78 100 111 117 125 127 129 130 136 145 151 161 164 168 174 177 183 187 190 194 210 212 218 229 0\r\n", 301 | "2019-03-01T03:19:57.852693547Z INFO:tensorflow:masked_lm_ids: 1005 1996 5598 2039 2395 6316 6603 2022 2071 5438 2969 4183 2004 1998 1997 1010 2777 7757 1010 2027 2037 2126 4796 2013 1010 2542 2062 2757 2018 1012 16360 4223 2010 2005 1996 1010 2028 2185 0\r\n", 302 | "2019-03-01T03:19:57.852699495Z INFO:tensorflow:masked_lm_weights: 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0\r\n", 303 | "2019-03-01T03:19:57.852834135Z INFO:tensorflow:next_sentence_labels: 0\r\n", 304 | "2019-03-01T03:19:57.868337702Z INFO:tensorflow:Wrote 44 total instances\r\n" 305 | ] 306 | } 307 | ], 308 | "source": [ 309 | "! arena logs --tail=30 $PRETRAIN_DATA_JOB_NAME\n" 310 | ] 311 | }, 312 | { 313 | "cell_type": "markdown", 314 | "metadata": {}, 315 | "source": [ 316 | "9.创建bert pretrain 的训练任务, " 317 | ] 318 | }, 319 | { 320 | "cell_type": "code", 321 | "execution_count": 15, 322 | "metadata": {}, 323 | "outputs": [ 324 | { 325 | "name": "stdout", 326 | "output_type": "stream", 327 | "text": [ 328 | "env: PRETRAIN_JOB_NAME=bert-pretrain-data\n", 329 | "configmap/bert-pretrain-data-tfjob created\n", 330 | "configmap/bert-pretrain-data-tfjob labeled\n", 331 | "tfjob.kubeflow.org/bert-pretrain-data created\n", 332 | "\u001b[36mINFO\u001b[0m[0003] The Job bert-pretrain-data has been submitted successfully \n", 333 | "\u001b[36mINFO\u001b[0m[0003] You can run `arena get bert-pretrain-data --type tfjob` to check the job status \n" 334 | ] 335 | } 336 | ], 337 | "source": [ 338 | "%env PRETRAIN_JOB_NAME=bert-pretrain-data\n", 339 | "! arena submit tf --name=$PRETRAIN_JOB_NAME \\\n", 340 | " --gpus=1 \\\n", 341 | " --workers=1 \\\n", 342 | " --data=training-data:/training \\\n", 343 | " --image=tensorflow/tensorflow:1.11.0-gpu-py3 \\\n", 344 | " \"python /training/models/bert/run_pretraining.py \\\n", 345 | " --input_file=/training/output/bert/tf_examples.tfrecord \\\n", 346 | " --output_dir=/training/output/bert/pretraining_output \\\n", 347 | " --do_train=True \\\n", 348 | " --do_eval=True \\\n", 349 | " --bert_config_file=/training/dataset/bert/uncased_L-12_H-768_A-12/bert_config.json \\\n", 350 | " --train_batch_size=16 \\\n", 351 | " --max_seq_length=256 \\\n", 352 | " --max_predictions_per_seq=39 \\\n", 353 | " --num_train_steps=8000 \\\n", 354 | " --num_warmup_steps=10 \\\n", 355 | " --learning_rate=2e-5 \\\n", 356 | " --save_checkpoints_steps=4000\"" 357 | ] 358 | }, 359 | { 360 | "cell_type": "markdown", 361 | "metadata": {}, 362 | "source": [ 363 | "10.查看实时训练的GPU使用情况" 364 | ] 365 | }, 366 | { 367 | "cell_type": "code", 368 | "execution_count": 16, 369 | "metadata": {}, 370 | "outputs": [ 371 | { 372 | "name": "stdout", 373 | "output_type": "stream", 374 | "text": [ 375 | "INSTANCE NAME GPU(Device Index) GPU(Duty Cycle) GPU(Memory MiB) STATUS NODE\r\n", 376 | "bert-pretrain-data-chief-0 0 100% 15519.0MiB / 16276.2MiB Running 192.168.0.210\r\n" 377 | ] 378 | } 379 | ], 380 | "source": [ 381 | "! arena top job $PRETRAIN_JOB_NAME" 382 | ] 383 | }, 384 | { 385 | "cell_type": "markdown", 386 | "metadata": {}, 387 | "source": [ 388 | "11.查看pretraining 的任务状态和实例情况,本示例中,我们启动了一个训练实例" 389 | ] 390 | }, 391 | { 392 | "cell_type": "code", 393 | "execution_count": 17, 394 | "metadata": { 395 | "scrolled": false 396 | }, 397 | "outputs": [ 398 | { 399 | "name": "stdout", 400 | "output_type": "stream", 401 | "text": [ 402 | "STATUS: RUNNING\r\n", 403 | "NAMESPACE: default\r\n", 404 | "TRAINING DURATION: 3m\r\n", 405 | "\r\n", 406 | "NAME STATUS TRAINER AGE INSTANCE NODE\r\n", 407 | "bert-pretrain-data RUNNING TFJOB 3m bert-pretrain-data-chief-0 192.168.0.210\r\n" 408 | ] 409 | } 410 | ], 411 | "source": [ 412 | "! arena get $PRETRAIN_JOB_NAME" 413 | ] 414 | }, 415 | { 416 | "cell_type": "markdown", 417 | "metadata": {}, 418 | "source": [ 419 | "12.查看pretraining 的训练日志,一段时间后,可以出现`tensorflow:examples/sec`相关的日志,代表已经开始训练,以及训练的速度。\n", 420 | "出现由于bert pretraining 时间非常长,如果我们想要实时查看日志,可以增加`-f`参数。" 421 | ] 422 | }, 423 | { 424 | "cell_type": "code", 425 | "execution_count": 18, 426 | "metadata": {}, 427 | "outputs": [ 428 | { 429 | "name": "stdout", 430 | "output_type": "stream", 431 | "text": [ 432 | "2019-03-01T03:24:31.335467491Z INFO:tensorflow: name = bert/encoder/layer_11/attention/self/query/kernel:0, shape = (768, 768)\n", 433 | "2019-03-01T03:24:31.335472861Z INFO:tensorflow: name = bert/encoder/layer_11/attention/self/query/bias:0, shape = (768,)\n", 434 | "2019-03-01T03:24:31.335629243Z INFO:tensorflow: name = bert/encoder/layer_11/attention/self/key/kernel:0, shape = (768, 768)\n", 435 | "2019-03-01T03:24:31.335635168Z INFO:tensorflow: name = bert/encoder/layer_11/attention/self/key/bias:0, shape = (768,)\n", 436 | "2019-03-01T03:24:31.335790291Z INFO:tensorflow: name = bert/encoder/layer_11/attention/self/value/kernel:0, shape = (768, 768)\n", 437 | "2019-03-01T03:24:31.335795756Z INFO:tensorflow: name = bert/encoder/layer_11/attention/self/value/bias:0, shape = (768,)\n", 438 | "2019-03-01T03:24:31.335952831Z INFO:tensorflow: name = bert/encoder/layer_11/attention/output/dense/kernel:0, shape = (768, 768)\n", 439 | "2019-03-01T03:24:31.335958202Z INFO:tensorflow: name = bert/encoder/layer_11/attention/output/dense/bias:0, shape = (768,)\n", 440 | "2019-03-01T03:24:31.33609783Z INFO:tensorflow: name = bert/encoder/layer_11/attention/output/LayerNorm/beta:0, shape = (768,)\n", 441 | "2019-03-01T03:24:31.33610331Z INFO:tensorflow: name = bert/encoder/layer_11/attention/output/LayerNorm/gamma:0, shape = (768,)\n", 442 | "2019-03-01T03:24:31.336208538Z INFO:tensorflow: name = bert/encoder/layer_11/intermediate/dense/kernel:0, shape = (768, 3072)\n", 443 | "2019-03-01T03:24:31.336345617Z INFO:tensorflow: name = bert/encoder/layer_11/intermediate/dense/bias:0, shape = (3072,)\n", 444 | "2019-03-01T03:24:31.336351398Z INFO:tensorflow: name = bert/encoder/layer_11/output/dense/kernel:0, shape = (3072, 768)\n", 445 | "2019-03-01T03:24:31.336498005Z INFO:tensorflow: name = bert/encoder/layer_11/output/dense/bias:0, shape = (768,)\n", 446 | "2019-03-01T03:24:31.336503414Z INFO:tensorflow: name = bert/encoder/layer_11/output/LayerNorm/beta:0, shape = (768,)\n", 447 | "2019-03-01T03:24:31.336645136Z INFO:tensorflow: name = bert/encoder/layer_11/output/LayerNorm/gamma:0, shape = (768,)\n", 448 | "2019-03-01T03:24:31.336650519Z INFO:tensorflow: name = bert/pooler/dense/kernel:0, shape = (768, 768)\n", 449 | "2019-03-01T03:24:31.33680328Z INFO:tensorflow: name = bert/pooler/dense/bias:0, shape = (768,)\n", 450 | "2019-03-01T03:24:31.336808531Z INFO:tensorflow: name = cls/predictions/transform/dense/kernel:0, shape = (768, 768)\n", 451 | "2019-03-01T03:24:31.336957828Z INFO:tensorflow: name = cls/predictions/transform/dense/bias:0, shape = (768,)\n", 452 | "2019-03-01T03:24:31.336965945Z INFO:tensorflow: name = cls/predictions/transform/LayerNorm/beta:0, shape = (768,)\n", 453 | "2019-03-01T03:24:31.337111549Z INFO:tensorflow: name = cls/predictions/transform/LayerNorm/gamma:0, shape = (768,)\n", 454 | "2019-03-01T03:24:31.337331045Z INFO:tensorflow: name = cls/predictions/output_bias:0, shape = (30522,)\n", 455 | "2019-03-01T03:24:31.337348181Z INFO:tensorflow: name = cls/seq_relationship/output_weights:0, shape = (2, 768)\n", 456 | "2019-03-01T03:24:31.337352586Z INFO:tensorflow: name = cls/seq_relationship/output_bias:0, shape = (2,)\n", 457 | "2019-03-01T03:24:40.280462098Z INFO:tensorflow:Done calling model_fn.\n", 458 | "2019-03-01T03:24:40.282210657Z INFO:tensorflow:Create CheckpointSaverHook.\n", 459 | "2019-03-01T03:24:44.183074716Z INFO:tensorflow:Graph was finalized.\n", 460 | "2019-03-01T03:24:44.183316826Z 2019-03-01 03:24:44.183119: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA\n", 461 | "2019-03-01T03:24:44.351791898Z 2019-03-01 03:24:44.351531: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", 462 | "2019-03-01T03:24:44.352922705Z 2019-03-01 03:24:44.352755: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 0 with properties: \n", 463 | "2019-03-01T03:24:44.352939062Z name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285\n", 464 | "2019-03-01T03:24:44.352942657Z pciBusID: 0000:00:09.0\n", 465 | "2019-03-01T03:24:44.352945648Z totalMemory: 15.89GiB freeMemory: 15.60GiB\n", 466 | "2019-03-01T03:24:44.352948528Z 2019-03-01 03:24:44.352791: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0\n", 467 | "2019-03-01T03:24:44.679981448Z 2019-03-01 03:24:44.679730: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:\n", 468 | "2019-03-01T03:24:44.680018686Z 2019-03-01 03:24:44.679793: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] 0 \n", 469 | "2019-03-01T03:24:44.680022093Z 2019-03-01 03:24:44.679801: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0: N \n", 470 | "2019-03-01T03:24:44.680381343Z 2019-03-01 03:24:44.680250: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15117 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:09.0, compute capability: 6.0)\n", 471 | "2019-03-01T03:24:49.827980455Z INFO:tensorflow:Running local_init_op.\n", 472 | "2019-03-01T03:24:49.951970475Z INFO:tensorflow:Done running local_init_op.\n", 473 | "2019-03-01T03:24:57.201907836Z INFO:tensorflow:Saving checkpoints for 0 into /training/output/bert/pretraining_output/model.ckpt.\n", 474 | "2019-03-01T03:26:08.446110241Z INFO:tensorflow:global_step/sec: 1.72217\n", 475 | "2019-03-01T03:26:08.44614756Z INFO:tensorflow:examples/sec: 27.5547\n", 476 | "2019-03-01T03:27:01.490555755Z INFO:tensorflow:global_step/sec: 1.8852\n", 477 | "2019-03-01T03:27:01.490728971Z INFO:tensorflow:examples/sec: 30.1632\n", 478 | "2019-03-01T03:27:54.542023408Z INFO:tensorflow:global_step/sec: 1.88496\n", 479 | "2019-03-01T03:27:54.542154644Z INFO:tensorflow:examples/sec: 30.1594\n", 480 | "2019-03-01T03:28:47.586233208Z INFO:tensorflow:global_step/sec: 1.88522\n", 481 | "2019-03-01T03:28:47.586289848Z INFO:tensorflow:examples/sec: 30.1635\n" 482 | ] 483 | } 484 | ], 485 | "source": [ 486 | "! arena logs --tail=50 $PRETRAIN_JOB_NAME" 487 | ] 488 | }, 489 | { 490 | "cell_type": "markdown", 491 | "metadata": {}, 492 | "source": [ 493 | "13.训练完成后,我们可以删除已经完成的任务,清理环境。" 494 | ] 495 | }, 496 | { 497 | "cell_type": "code", 498 | "execution_count": 13, 499 | "metadata": {}, 500 | "outputs": [ 501 | { 502 | "name": "stdout", 503 | "output_type": "stream", 504 | "text": [ 505 | "service \"tf-distributed-mnist-tensorboard\" deleted\n", 506 | "deployment.extensions \"tf-distributed-mnist-tensorboard\" deleted\n", 507 | "tfjob.kubeflow.org \"tf-distributed-mnist\" deleted\n", 508 | "configmap \"tf-distributed-mnist-tfjob\" deleted\n", 509 | "\u001b[36mINFO\u001b[0m[0004] The Job tf-distributed-mnist has been deleted successfully \n" 510 | ] 511 | } 512 | ], 513 | "source": [ 514 | "! arena delete $PRETRAIN_JOB_NAME\n", 515 | "! arena delete $PRETRAIN_DATA_JOB_NAME" 516 | ] 517 | }, 518 | { 519 | "cell_type": "markdown", 520 | "metadata": {}, 521 | "source": [ 522 | "恭喜!您已经使用 `arena` 成功运行了训练作业\n", 523 | "\n", 524 | "总结,希望您通过本次演示如何提交Bert的pretraining任务,将包含以下几个步骤:\n", 525 | "1. 准备训练代码和数据,并将其放入数据卷中\n", 526 | "2. 提交pretraining 所需的数据处理工作\n", 527 | "3. 提交pretraining 的训练任务" 528 | ] 529 | } 530 | ], 531 | "metadata": { 532 | "kernelspec": { 533 | "display_name": "Python 3", 534 | "language": "python", 535 | "name": "python3" 536 | }, 537 | "language_info": { 538 | "codemirror_mode": { 539 | "name": "ipython", 540 | "version": 3 541 | }, 542 | "file_extension": ".py", 543 | "mimetype": "text/x-python", 544 | "name": "python", 545 | "nbconvert_exporter": "python", 546 | "pygments_lexer": "ipython3", 547 | "version": "3.5.2" 548 | } 549 | }, 550 | "nbformat": 4, 551 | "nbformat_minor": 2 552 | } 553 | --------------------------------------------------------------------------------