├── README.md
├── ai-learning
    ├── base
    │   ├── base.md
    │   ├── gpu虚拟化.md
    │   ├── 构建大规模k8s集群方案.md
    │   └── 构建高性能gpu网络方案.md
    ├── images
    │   ├── volcano-gang.png
    │   ├── volcano-scheduler-actions-start.png
    │   ├── volcano-scheduler-actions.png
    │   ├── volcano-scheduler-gang01.png
    │   ├── volcano-scheduler-registerfn.png
    │   ├── volcano-scheduler-workflow.png
    │   └── volcano-workflow.png
    └── volcano
    │   ├── 01-volcano基础.md
    │   ├── 02-volcano调度器.md
    │   ├── 03-volcano控制器.md
    │   ├── 04-volcano-admission.md
    │   └── volcano gpu设计思考.md
├── client-go
    ├── informer.md
    ├── study-client.md
    └── workqueue.md
├── controller
    ├── daemonset-controller.md
    ├── deployment-controller.md
    ├── endpoint-controller.md
    ├── gc-controller.md
    ├── images
    │   ├── client-go.png
    │   ├── instantiation-daemonset-controller.png
    │   ├── namespace-controller-delete.png
    │   ├── run-daemonset-controoller.png
    │   └── run-namespace-controller.png
    ├── namespace-controller.md
    ├── replicaset-controller.md
    └── statefulset-controller.md
├── kubelet
    ├── cadvisor理解.md
    ├── images
    │   ├── kubelet_ut.png
    │   └── volumemanager.png
    ├── kubelet_eviction驱逐模块.md
    ├── kubelet_nodestatus模块.md
    ├── kubelet_pleg模块.md
    ├── kubelet_podmanager模块.md
    ├── kubelet_volumemanager模块.md
    ├── kubelet增删改Pod流程.md
    ├── kubelet架构.md
    ├── kubelet状态管理statusManager.md
    ├── pluginmanager.md
    ├── volume_plugin.md
    └── 测试kubelet.md
├── kubernetes-ingress-controller
    ├── images
    │   └── images01.png
    └── ssl-terminationandssl-passthrough.md
├── scheduler
    ├── images
    │   ├── admission-controller-phases.png
    │   ├── aquire-lock.png
    │   ├── ep-leader-algorithm.png
    │   ├── informer.png
    │   ├── schedule-extend.png
    │   ├── scheduler-flow.png
    │   ├── scheduler-leader-elec.png
    │   ├── scheduler-log.png
    │   ├── scheduler-plugins.png
    │   └── scheduler-score.png
    ├── scheduler-leader-election.md
    └── scheduler阅读理解上.md
├── 应用篇
    ├── K8S与AD集成.md
    ├── K8S证书管理.md
    ├── MinIO 对象存储.md
    ├── SSL Offloading
    │   └── Termination vs SSL Passthrough vs SSL Bridging.md
    ├── east-west-traffic-encryption.md
    ├── images
    │   ├── external-dns-A-record.png
    │   ├── secret-cert.png
    │   └── 用户排查图.png
    ├── k8s-ldap-auth.md
    ├── k8s安全之opa.md
    ├── openebs-quota-issue.md
    ├── pvc延迟绑定.md
    ├── share AD base access control for k8s.pdf
    ├── share-volume-zone.pdf
    └── 日志篇.md
└── 开放接口
    ├── CRI.md
    ├── CSI.md
    ├── images
        ├── cni-plugin.png
        ├── cri.png
        ├── describe-pod.png
        ├── node-plugin.png
        └── start-kubelet-command.png
    └── k8s-cni.md


/README.md:
--------------------------------------------------------------------------------
 1 | **## 目录**
 2 | 
 3 | 第一章. client-go理解指南
 4 | 
 5 |   \- 1.1. [client](/client-go/study-client.md)
 6 | 
 7 |   \- 1.2. [informer](/client-go/informer.md)
 8 | 
 9 |   \- 1.3. [workqueue](/client-go/workqueue.md)
10 | 
11 | 第二章. Controller
12 | 
13 |   \- 2.1. [daemonset-controller](/controller/daemonset-controller.md)
14 | 
15 |   \- 2.2. [deployment-controller](/controller/deployment-controller.md)
16 | 
17 |   \- 2.3. [endpoint-controller](/controller/endpoint-controller.md)
18 | 
19 |   \- 2.4. [gc-controller](/controller/gc-controller.md)
20 | 
21 |   \- 2.5. [namespace-controller](/controller/namespace-controller.md)
22 | 
23 |   \- 2.6. [replicaset-controller](/controller/replicaset-controller.md)
24 | 
25 |   \- 2.7. [statefulset-controller](/controller/statefulset-controller.md)
26 | 
27 | 第三章. Scheduler
28 | 
29 |   \- 3.1. [scheluler overall](/scheduler/scheduler阅读理解上.md)
30 | 
31 | 第四章. Kubelet
32 | 
33 |   \- 4.1. [Kubelet架构](/kubelet/kubelet架构.md)
34 | 
35 |   \- 4.2. [Kubelet增删改Pod流程](/kubelet/kubelet增删改Pod流程.md)
36 | 
37 |   \- 4.3. [Kubelet pleg模块](/kubelet/kubelet_pleg模块.md)
38 | 
39 |   \- 4.4. [Kubelet nodestatus更新节点状态模块](/kubelet/kubelet_nodestatus模块.md)
40 | 
41 |   \- 4.5. [kubelet_volumemanager模块](/kubelet/kubelet_volumemanager模块.md)
42 | 
43 |   \- 4.6. [volume_plugin](/kubelet/volume_plugin.md)
44 | 
45 |   \- 4.7. [kubelet_PodManager模块](/kubelet/kubelet_podmanager模块.md)
46 | 
47 | TODO: PodManager模块
48 | 
49 |   \- 4.7. Need to update [Kubelet eviction驱逐模块](/kubelet/kubelet_eviction驱逐模块.md) 
50 | 
51 |   \- 4.8. [Kubelet statusManager状态管理模块](/kubelet/kubelet状态管理statusManager.md)
52 | 
53 | plan: probeMnager模块
54 | 
55 |   \- 4.5. [插件注册服务PluginManager](/kubelet/pluginmanager.md)
56 | 
57 | 第五章. 开放接口
58 | 
59 |   \- 4.1.  [容器运行时CRI](/开放接口/CRI.md)
60 | 
61 |   \- 4.2.  [容器存储接口CSI](/开放接口/CSI.md)
62 | 
63 | ​       \- 4.2.1.  开放接口
64 | ​	   \- 4.2.2.  卷的生命周期
65 | ​	  \- 4.2.3.  Sidecar容器
66 | ​	  \- 4.2.4.  部署CSI 驱动
67 | ​	 \- 4.2.5.  如何测试驱动
68 | ​	  \- 4.2.6.  如何编写一个本地存储
69 | 
70 | 第六章. 应用篇
71 | 
72 |   \- 5.1.  [pvc延迟绑定](/应用篇/pvc延迟绑定.md)
73 | 
74 | 
75 | 
76 | 
77 | 
78 | 
79 | 
80 | 


--------------------------------------------------------------------------------
/ai-learning/base/base.md:
--------------------------------------------------------------------------------
 1 | 
 2 | # 背景
 3 | 目前k8s是无法进行针对gpu/ mlu或者npu设备让多个任务共享的功能，并且我们也需要优化异构ai计算资源的利用率
 4 | 
 5 | 目标要求如下，总结是必须要做到设备共享以及设备资源隔离
 6 | 设备共享：
 7 | * 多个任务可以共享同一个设备，每个任务仅占用部分资源
 8 | * 设备内存控制，可以按MB之类来分配
 9 | * 设备规格制定
10 | * 无太大侵入控制，不需要更改程序就能控制资源分配
11 | 设备资源隔离：
12 | 对一个pod或者job 来说,按以下设定我们应该要在容器里面看到3G设备内存才对
13 | ```yaml
14 |       resources:
15 |         limits:
16 |           nvidia.com/gpu: 1 # 请求1个vGPU
17 |           nvidia.com/gpumem: 3000 # 每个vGPU包含3000m设备内存
18 | ```
19 | kubectl exec -it xxx -- /bin/sh
20 | nvidia-smi 即可查看memory-used
21 | 
22 | # 在每个 GPU 节点中配置 nvidia 容器运行时
23 | 需要先安装 nidia 驱动和 nvidia-container-toolkit
24 | 配置运行时，这里只记录配置containerd
25 | 在使用 containerd 运行 Kubernetes 时，修改配置文件，通常位于 /etc/containerd/config.toml，以设置 nvidia-container-runtime 为默认的低级运行时：
26 | ```yaml
27 | version = 2
28 | [plugins]
29 |   [plugins."io.containerd.grpc.v1.cri"]
30 |     [plugins."io.containerd.grpc.v1.cri".containerd]
31 |       default_runtime_name = "nvidia"
32 | 
33 |       [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
34 |         [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
35 |           privileged_without_host_devices = false
36 |           runtime_engine = ""
37 |           runtime_root = ""
38 |           runtime_type = "io.containerd.runc.v2"
39 |           [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
40 |             BinaryName = "/usr/bin/nvidia-container-runtime"
41 | ```
42 | 然后重启 containerd：
43 | ```bash
44 | sudo systemctl daemon-reload && systemctl restart containerd
45 | ```
46 | 给节点打标签
47 | 通过添加标签标记您的 GPU 节点。没有此标签，节点无法被我们的调度器管理。
48 | 注意使用不同比如说hami 或者volcano都有特定的标签，具体查询你所需要的管理k8s的异构ai 计算设备的工具
49 | kubectl label nodes {nodeid} gpu=on
50 | link: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
51 | 
52 | #


--------------------------------------------------------------------------------
/ai-learning/base/gpu虚拟化.md:
--------------------------------------------------------------------------------
 1 | 这是一个关于gpu硬件的知识篇
 2 | # 背景以及为什么虚拟化
 3 | gpu的划分（虚拟化）的需求其实是因为很多时候我们不能用满一整张显卡的所有资源，从资源利用率来说我们更希望gpu的虚拟化可以利用率更高，毕竟那么贵
 4 | 
 5 | # gpu并发机制
 6 | * cuda 流
 7 | * cuda多进程服务 MPS
 8 | * time-slicing时间分片
 9 | * 多实例 gpu MIG 
10 | * vGPU 虚拟化
11 | 
12 | # mig之前的虚拟化 vGPU
13 | NVIDIA 的vGPU：
14 | vGPU是NVIDIA推出的一个支持在VM应用上的虚拟GPU软件，它支持在一块物理GPU创建许多小型的虚拟GPU，并让这些虚拟GPU提供给不同的用户使用。
15 | 对于物理硬件的虚拟化，主要分为两个：存储的虚拟化、运算的虚拟化。
16 | 基于时间片（time-sliced） 的vGPU本质上面还是让任务公用物理设备(引擎)，**通过改变虚拟GPU的时间片使用多少，来控制任务对整个物理设备上的资源的使用量**。 这么方式虽然能够满足一些应用场景，但是由于物理卡上的资源是公用的，**所有任务要轮着用，使得整卡切分后在算力/带宽/任务切换上面难做到好的QoS**。举个例子，比如两个任务要共用Video Decode设备时会涉及任务的来回切换，相比按照比例来使用的方式（假设有4个Video Decode，每个任务使用两个）的成本更高。 同时对于每个任务而言，并不是所有任务都能够使用一张完整GPU的全部资源，单个任务运算时会产生一定的资源浪费。
17 | 国内虚拟化比如华为云的ModelArts,阿里云的GPU服务基本也是参考vGPU的方式。
18 | 
19 | GPU的虚拟化基本都是围绕：数据能否安全，隔离是否充分，QoS能否保证来设计
20 | 
21 | # MIG
22 | MIG其实是分块+组合，对物理卡上的物理资源进行切分，将分块后的资源重新整合，这样每个切分后的子GPU能够做到数据保护，故障隔离等
23 | 
24 | MIG的创建基本就是两个步骤：从物理GPU上建立GI（GPU Instance），然后从创建的GI上创建CI（Compute Instance）
25 | 
26 | # 有vGPU了，为什么还要MIG
27 | MIG和vGPU目标都是对物理GPU进行虚拟化来满足应用需求。vGPU能够结合MIG使用，什么意思？ 就是在MIG的基础上可以进一步使用vGPU。MIG出现之前，vGPU底层的实现用的是时间分片模式（time-sliced），我们比较MIG和vGPU，一般是对比“时间分片模式vGPU”和“MIG版的vGPU”
28 | 是测试数据发现 MIG-vGPU 的表现优于 Time-sliced vGPU，对比了训练时间，训练的吞吐，推理的时间，推理的吞吐等
29 | 
30 | 
31 | # cuda 流
32 | Compute Unified Device Architecture (CUDA) 是由 NVIDIA 开发的并行计算平台和编程模型，用于 GPU 上的常规计算。
33 | 流是 GPU 上问题顺序执行的操作序列。CUDA 命令通常在默认流中按顺序执行，任务在前面的任务完成后才会启动。
34 | 跨不同流的异步处理操作允许并行执行任务。在一个流中发布的任务之前、期间或另一个任务签发到另一个流后运行。这允许 GPU 以任何规定的顺序同时运行多个任务，从而提高性能。
35 | 
36 | 
37 | ## 一些基本的命令
38 | 打开MIG功能： nvidia-smi -i 7 (查看7号Gpu状态可以看到MIG是不是disable)
39 | 启动MIG功能： nvidia-smi -i 7 -mig 1
40 | 
41 | ## 常见问题
42 | MIG的使能可能会遇到一些问题，会有pending的错误提示：
43 | ```
44 | $ sudo nvidia-smi -i 0 -mig 1
45 | Warning: MIG mode is in pending enable state for GPU 00000000:00:03.0:Not Supported
46 | Reboot the system or try nvidia-smi --gpu-reset to make MIG mode effective on GPU 00000000:00:03.0
47 | All done.
48 | ```
49 | 可能的原因：
50 | 
51 | GPU正在被某些程序使用；
52 | docker容器挂载了这个GPU；
53 | 某些server（比如nvsm dcgm）占用了GPU；
54 | 解决方案：
55 | ```bash
56 | # nvidia-smi --gpu-reset
57 | # systemctl stop nvsm
58 | # systemctl stop dcgm
59 | # docker stop [container] # 停止运行的容器
60 | ```


--------------------------------------------------------------------------------
/ai-learning/base/构建大规模k8s集群方案.md:
--------------------------------------------------------------------------------
  1 | # overview
  2 | 这个文档主要是讲大规模k8s集群的一些性能问题
  3 | 内容包括对etcd, kube-apiserver, kube-controller等性能以及稳定性方面
  4 | 在看这篇文档之前最好预计一下你所需要的规模，pod数量，node数量等
  5 | 
  6 | 我们发现大规模集群的一些以下的问题：
  7 | * etcd: 读写延迟，经常拒绝服务
  8 |     * read/write latency spike 
  9 |     * too many request ddos
 10 |     * unable to write when hitting storage limit 
 11 | * apiserver： 查询资源延迟非常高
 12 |     * get pod/nodes latency very slow
 13 | * controller： 延时非常高
 14 |     * controller cannot catch up
 15 | * schduler： 延迟高
 16 |     * scheduler throughput low
 17 | * kubelet
 18 |     * fail restart slowly
 19 | 
 20 | 基于上面的问题，我们做了以下的调整：
 21 | # 架构设计和优化
 22 | 架构设计需要从网络设计，存储架构设计等方面都要入手，这个具体看业务以及能拿到的硬件资源等，可以整体从以下几个方面考虑:
 23 | ```
 24 | 节点部署：	 多节点、地理分布、硬件高性能配置
 25 | 存储：   	 NVMe SSD、优化文件系统、启用压缩
 26 | 参数调优：	 快照频率、Wal存放、资源限制
 27 | 网络：  	高速专用网络、TLS安全通信
 28 | 监控：  	关键指标实时监控，及时调优
 29 | 安全：  	TLS、权限控制、备份策略
 30 | ```
 31 | 
 32 | # kube-proxy
 33 | 必须使用ipvs 模式
 34 | kube-proxy 更改配置如下
 35 | ```yaml
 36 | --ipvs-min-sync-period=1s #默认：最小刷新间隔1s
 37 | --ipvs-sync-period=5s  # 默认：5s周期性刷新
 38 | --cleanup=false
 39 | --ipvs-min-sync-period=0s # 发生事件实时刷新
 40 | --ipvs-sync-period=30s # 30s周期性刷新, 刷新一次节点 iptables 规则
 41 | --cleanup=true (清理 iptables 和 ipvs 规则并退出)
 42 | ```
 43 | 对于数据面高并发场景，服务服务配置
 44 | ```yaml
 45 | --ipvs-scheduler=lc #最小连接，默认是rr round-robi
 46 | ```
 47 | 
 48 | ## conntracker设置
 49 | 无论是iptable 或者 ipvs ，底层都会走conntracker
 50 | ```bash
 51 | $ sysctl -w net.netfilter.nf_conntrack_max=2310720
 52 | ```
 53 | kube-proxy 启动的时候会重新设置, 因此配置参数可以按node CPU 个数进行调整
 54 | ```bash
 55 | --conntrack-max-per-core=144420  # 2310720/cpu个数。 2310720/16核 = 144420
 56 | --conntrack-min=2310720 # 设置为nf_conntrack_max
 57 | ```
 58 | 
 59 | # kubelet
 60 | 首先配置每个node的pod的上限，默认应该是100的，如果我们要上万个pod那是肯定要修改的
 61 | ```yaml
 62 | --max-pods=500
 63 | --node-status-update-frequency=3s
 64 | ```
 65 | node lease信息上报调优 (update /etc/kubernetes/kubelet-config.yaml)
 66 | ```yaml
 67 |   --node-status-update-frequency=3s #Node状态上报周期。默认10s --node-status-update-frequency - Specifies how often kubelet posts its node status to the API server. 可以参考 https://kubernetes.io/docs/concepts/architecture/nodes/
 68 |   --kube-api-qps=50 #node lease信息上报qps
 69 |   --kube-api-burst=100 #node lease信息上报并发
 70 |   --event-qps=100 #pod event信息上报 qps
 71 |   --event-burst=100 #pod event信息上报并发
 72 | ```
 73 | 集群中Node数据越少， kubelet的event/api qps与burst数越大，单机器上的产生的event数和apiserver交互的请求数就越大；
 74 | 集群数据Node数据越大， apiserver和etcd的个数也需要相应按比例增加
 75 | **初略计算方式: qps值 = apiserver的qps数 / node个数**
 76 | 
 77 | ## kubelet 测试
 78 | 测试 kubelet部署Pod速度 和 kubelet 缩容后 event上报速度 两个关键指标
 79 | kubelet部署Pod速度
 80 | 使用亲和性到单机，pod 1个数 扩容到 200个， 
 81 | ```bash
 82 | $ time kubectl rollout status deployment/demo
 83 | ```
 84 | kubelet 缩容 event上报速度
 85 | 亲和到单机，pod 200个数 缩容到 1个， 释放 199 个pod 时间， 从 32s 下降到 9.05s
 86 | ```bash
 87 | $  watch time kubectl get rs  
 88 | Every 2.0s: time kubectl get rs                                                                                       Thu 
 89 | 
 90 | NAME                         DESIRED   CURRENT   READY   AGE
 91 | demo-xx   1         1         1       28h
 92 | ```
 93 | 
 94 | # kube-controller-manager
 95 | 配置调优
 96 | ```yaml
 97 |  --node-monitor-period=2s #检查 kubelet 的状态时间间隔
 98 |   --node-monitor-grace-period=20s #检查 notready node 时间间隔
 99 |   --pod-eviction-timeout=30s # pod 绑定失败后重新调度时间间隔
100 |   --concurrent-deployment-syncs=50 # deployment并发度
101 |   --concurrent-endpoint-syncs=50 # endpoint并发度
102 |   --concurrent-job-syncs-50 # job并发度
103 |   --concurrent-namespace-syncs=100 # namespace并发度
104 |   --concurrent-replicaset-syncs=50 # replicaset并发度
105 |   --concurrent-service-syncs=100 # service并发度
106 |   --kube-api-qps=500 # 与apiserver 限流 500qps
107 |   --kube-api-burst=100 # 与apiserver 限并发 100连接并发
108 | ```
109 | 
110 | # etcd 优化
111 | 建议etcd尽量不要使用容器化部署
112 | ## 时间参数
113 | 这里我们是使用kubespray安装的集群, heartbeat-interval就是所谓的心跳间隔，即主节点通知从节点它还是领导者的频率。实践数据表明，该参数应该设置成节点之间 RTT 的时间, 评估RTT 的最简单的方法是使用ping 操作
114 | ```yaml
115 | --heartbeat-interval=250
116 | ```
117 | ## 快照
118 | 存储创建快照的代价是很高的，所以只用当参数累积到一定的数量时，Etcd 才会创建快照文件。 默认值是 10000 在超大集群中，Etcd 的内存使用和磁盘使用过高，那么应该尝试调低快照触发的阈值
119 | ```yaml
120 |    --snapshot-count=10000 #数量达到 10000 时才会建立快照
121 | ```
122 | ## 磁盘
123 | etcd 的存储目录分为 snapshot 和 wal，他们写入的方式是不同的，snapshot 是内存直接 dump file，而 wal 是顺序追加写。因此可以将 snap 与 wal 进行分盘，放在两块 SSD 盘上，提高整体的 IO 效率，这种方式可以提升 etcd 20%左右的性能。
124 | Linux 中 etcd 的磁盘优先级可以使用 ionice 配置：
125 | ```bash
126 | ionice -c2 -n0 -p `pgrep etcd`
127 | ```
128 | etcd 集群对磁盘I/O的延时非常敏感，而磁盘IO可以通过ionice命令来调整，IO调度策略分为三种：
129 | * idle: 其他进程没有磁盘io, 才进行磁盘io
130 | * best effort:数值越小优先级越高
131 | * real time： 立即访问磁盘，无视其他进程io
132 | * none
133 | 
134 | ## CPU 优先级调整
135 | `renice -n -20 -P $(pgrep etcd)`
136 | 其中 nice 值可以由用户指定，默认值为 0，root 用户的取值范围是[-20, 19]，普通用户的值取值范围是[0, 19]，数字越小，CPU 执行优先级越高。
137 | 
138 | ## 数据规模和自动整理
139 | etcd 的硬盘存储上限（默认是 2GB）,当 etcd 数据量超过默认 quota 值后便不再接受写请求，可以通过设置 --quota-backend-bytes 参数来增加存储大小, quota-backend-bytes 默认值 2GB，上限值为 8 GB, 3.4版本支持100GB
140 | ```yaml
141 |   --quota-backend-bytes=8589934592  # 后端存储 8G
142 |   --auto-compaction-mode=revision
143 |   --auto-compaction-retention=1000  # 开启每5分钟就自动压缩，并保留lastet 1000个revision
144 | ```
145 | ## K8s events 拆到单独的 etcd 集群
146 | apiserver是通过event驱动的服务，因此，将apiserver中不同类型的数据存储到不同类型的etcd集群中。 从 etcd 内部看，也就对应了不同的数据目录，通过将不同目录的数据路由到不同的后端 etcd 中，从而降低了单个 etcd 集群中存储的数据总量，提高了扩展性。
147 | 简单计算下： 如果有5000个Pod， 每次kubelet 上报信息是15Kb大小； 10000个Pod 事件变更信息，每次变更4Kb
148 | etcd 接受Node信息： 15KB * (60s/3s) * 5000 = 150000Kb = 1465Mb/min
149 | etcd 接受Pod event信息：10000 * 4Kb * 30% = 12Mb/min
150 | 这些更新将产生近 1.2GB/min 的 transaction logs（etcd 会记录变更历史)
151 | 拆解原则：
152 | 
153 | pod etcd
154 | lease etcd
155 | event etcd
156 | 其他etcd （node、job、deployment等等）
157 | apiserver events拆解：
158 | ```yaml
159 |   --etcd-servers="http://etcd1:2379,http://etcd2:2379,http://etcd3:2379" \
160 |   --etcd-servers-overrides="/events#http://etcd4:2379,http://etcd5:2379,http://etcd6:2379"
161 |   --etcd-servers-overrides="coordination.k8s.io/leases#http://etcd7:2379,http://etcd8:2379,http://etcd9:2379"
162 |   --etcd-servers-overrides="/pods#http://etcd10:2379,http://etcd11:2379,http://etcd12:2379"
163 | ```
164 | 
165 | ## etcd性能测试
166 | kubemark 来模拟k8s计算节点，我们mock 1000个Node, 部署5K个Pod
167 | 写入测试, 写入etcd 数据1.5个G
168 | ```bash
169 | // leader
170 | $ benchmark --endpoints="http://10.179.0.13:2379" --target-leader --conns=1 --clients=1 put --key-size=8 --sequential-keys --total=10000 --val-size=256
171 | 
172 | $ benchmark --endpoints="http://10.179.0.13:2379" --target-leader --conns=100 --clients=1000 put --key-size=8 --sequential-keys --total=100000 --val-size=256
173 | 
174 | // 所有 members
175 | $ benchmark --endpoints="http://10.179.0.13:2379,http://10.179.0.2:2379,http://10.179.0.6:2379" --target-leader --conns=1 --clients=1 put --key-size=8 --sequential-keys --total=10000 --val-size=256
176 | 
177 | $ benchmark --endpoints=""http://10.179.0.13:2379,http://10.179.0.2:2379,http://10.179.0.6:2379"  --target-leader --conns=100 --clients=1000 put --key-size=8 --sequential-keys --total=100000 --val-size=256
178 | ```
179 | 读取测试
180 | ```bash
181 | $ benchmark --endpoints="http://10.179.0.13:2379,http://10.179.0.2:2379,http://10.179.0.6:2379"  --conns=1 --clients=1  range foo --consistency=l --total=10000
182 | 
183 | $ benchmark --endpoints="http://10.179.0.13:2379,http://10.179.0.2:2379,http://10.179.0.6:2379"  --conns=1 --clients=1  range foo --consistency=s --total=10000
184 | 
185 | $ benchmark --endpoints="http://10.179.0.13:2379,http://10.179.0.2:2379,http://10.179.0.6:2379"  --conns=100 --clients=1000  range foo --consistency=l --total=100000
186 | 
187 | $ benchmark --endpoints="http://10.179.0.13:2379,http://10.179.0.2:2379,http://10.179.0.6:2379"  --conns=100 --clients=1000  range foo --consistency=s --total=100000
188 | ```
189 | 
190 | 
191 | # apiserver优化
192 | 
193 | 
194 | 
195 | 
196 | # 整体测试
197 | 使用45台机器 kubermark 在kubermark namespace下 mock 出来5000个 node
198 | 使用e2e工具 创建出来50个namespace， 1000个service， 40000个pod。
199 | 其中中1/4的namespace是大规模1000+ pod，1/4是中型规模400+ pod； 1/2的namespace是小型50+ pod； 按 namespace下的 configmap、serviceaccount、secrets等，按pod的个数等4：1比例随机创建


--------------------------------------------------------------------------------
/ai-learning/base/构建高性能gpu网络方案.md:
--------------------------------------------------------------------------------
  1 | 
  2 | 
  3 | # concept
  4 | ### PCIe交换机芯片：
  5 | 在高性能GPU计算的领域内，关键组件如CPU、内存模块、NVMe存储设备、GPU以及网络适配器等通过PCIe（外设部件互连标准）总线或专门设计的PCIe交换机芯片实现高效顺畅的连接。历经五代技术革新，目前最新的Gen5版本确保了设备间极为高效的互连性能。这一持续演进充分彰显了PCIe在构建高性能计算系统中的核心地位，显著提升了数据传输速度，并有力地促进了现代计算集群中各互联设备间的无缝协同工作。
  6 | 
  7 | ### NVLink定义：
  8 | NVLink是英伟达（NVIDIA）开发并推出的一种总线及其通信协议。NVLink采用点对点结构、串列传输，用于中央处理器（CPU）与图形处理器（GPU）之间的连接，也可用于多个图形处理器之间的相互连接。与PCI Express不同，一个设备可以包含多个NVLink，并且设备之间采用网格网络而非中心集线器方式进行通信。该协议于2014年3月首次发布，采用专有的高速信号互连技术（NVHS）。
  9 | NVLink的发展历程：从NVLink 1.0到NVLink 4.0
 10 | NVLink 1.0
 11 | 连接方式：采用4通道连接。
 12 | 总带宽：实现高达160 GB/s的双向总带宽。
 13 | 用途：主要用于加速GPU之间的数据传输，提升协同计算性能。
 14 | NVLink 2.0
 15 | 连接方式：基于6通道连接。
 16 | 总带宽：将双向总带宽提升至300 GB/s。
 17 | 性能提升：提供更高的数据传输速率，改善GPU间通信效率。
 18 | NVLink 3.0
 19 | 连接方式：采用12通道连接。
 20 | 总带宽：达到双向总带宽600 GB/s。
 21 | 新增特性：引入新技术和协议，提高通信带宽和效率。
 22 | NVLink 4.0
 23 | 连接方式：使用18通道连接。
 24 | 总带宽：进一步增加至双向总带宽900 GB/s。
 25 | 性能改进：通过增加通道数量，NVLink 4.0能更好地满足高性能计算和人工智能应用对更大带宽的需求。
 26 | NVLink 1.0、2.0、3.0和4.0之间的关键区别主要在于**连接通道数目的增加、所支持的总带宽以及由此带来的性能改进**。随着版本迭代，NVLink不断优化GPU间的数据传输能力，以适应日益复杂且要求严苛的应用场景。
 27 | 
 28 | ### NVSwitch：
 29 | NVSwitch是NVIDIA专为满足高性能计算和人工智能应用需求而研发的一款交换芯片，其核心作用在于**实现同一主机内部多颗GPU之间的高速、低延迟通信**。
 30 | 
 31 | 
 32 | ### NVLink交换机：
 33 | NVLink交换机是一种由NVIDIA专为在分布式计算环境中的**不同主机间实现GPU设备间高性能通信而设计制造的独立交换设备**。不同于集成于单个主机内部GPU模块上的NVSwitch，NVLink交换机旨在解决跨主机连接问题。可能有人会混淆NVLink交换机和NVSwitch的概念，但实际上早期提及的“NVLink交换机”是指安装在GPU模块上的切换芯片。直至2022年，NVIDIA将此芯片技术发展为一款独立型交换机产品，并正式命名为NVLink交换机。
 34 | 
 35 | ### HBM（高带宽内存）
 36 | 传统上，GPU内存与常见的DDR（双倍数据速率）内存相似，通过物理插槽插入主板并通过PCIe接口与CPU或GPU进行连接。然而，这种配置在PCIe总线中造成了带宽瓶颈，其中Gen4版本提供64GB/s的带宽，Gen5版本则将其提升至128GB/s。
 37 | 为了突破这一限制，包括但不限于NVIDIA在内的多家GPU制造商采取了创新手段，即将多个DDR芯片堆叠整合，形成了所谓的高带宽内存（HBM）。例如，在探讨H100时所展现的设计，GPU直接与其搭载的HBM内存相连，无需再经过PCIe交换芯片，从而极大地提高了数据传输速度，理论上可实现显著的数量级性能提升。因此，“高带宽内存”（HBM）这一术语精准地描述了这种先进的内存架构。
 38 | 
 39 | ### 带宽单位解析
 40 | 在网络通信场景下，数据速率通常以每秒比特数（bit/s）表示，且为了区分发送（TX）和接收（RX），常采用单向传输速率来衡量。而在诸如PCIe、内存、NVLink及HBM等其他硬件组件中，带宽指标则通常使用每秒字节数（Byte/s）或每秒事务数（T/s）来衡量，并且这些测量值一般代表双向总的带宽容量，涵盖了上行和下行两个方向的数据流。
 41 | 
 42 | 
 43 | # 基于8台配备NVIDIA A100 GPU节点的高性能网络架构设计图
 44 | 网络架构设计原则: 
 45 | * 高带宽、低延迟
 46 | * 拥塞控制与流量管理
 47 | 性能优化建议：
 48 | * 利用GPU的NVLink和NVSwitch（如果支持）提升GPU内部带宽
 49 | * 调整通信和计算的负载平衡，减少等待时间
 50 | * 使用性能监控工具（如NVIDIA Nsight、Perftools）优化性能瓶颈
 51 | 
 52 | 所需的组件，包括两颗CPU、存储网卡、PCIe交换芯片、NVSwitch、GPU和专用网卡：
 53 | 组件概览
 54 | * 两颗CPU（每节点）
 55 |     * 负责执行通用计算任务
 56 |     * 具备NUMA架构（非统一内存访问）
 57 | * 两块存储网络适配卡（NICs）
 58 |     * 连接分布式存储，支持带内管理
 59 | * 四颗PCIe Gen4交换芯片
 60 |     * 提供高速GPU与其他硬件的互联通道
 61 | * 六颗NVSwitch芯片
 62 |     * GPU间高速互联，支持极低延迟通信
 63 | * 八块A100 GPU（每节点）
 64 |     * 主要计算单元，进行深度学习、HPC任务
 65 | * 八块GPU专用网络适配卡
 66 |     * 每块GPU配备一块，优化GPU间网络通信
 67 | 具体连接关系与拓扑结构
 68 | 1. CPU与GPU的连接
 69 | 每个CPU连接到PCIe Gen4交换芯片（共2个）
 70 | 通过PCIe Gen4通道高速连接
 71 | 每块GPU通过PCIe Gen4连接到交换芯片
 72 | NUMA内存连接到CPU，GPU访问内存时遵循NUMA策略
 73 | 2. GPU内部高速互联（GPU-to-GPU）
 74 | 所有GPU之间通过六颗NVSwitch芯片形成全连接网络
 75 | NVSwitch实现GPU之间的直接高速通信（NVLink Gen3或Gen4）
 76 | 每块GPU连接到至少一颗NVSwitch，确保每块GPU都可以与其他GPU高速通信
 77 | 3. GPU与存储访问
 78 | 存储网络适配卡连接到PCIe交换芯片，提供访问分布式存储的高速通道
 79 | 通过存储NIC支持带内管理，确保存储访问效率和管理能力
 80 | 4. GPU专用网络适配卡
 81 | 每块GPU配备一块专用网络适配卡（如InfiniBand或100GbE NIC）
 82 | 连接到高速网络交换机（可能是InfiniBand HDR或NVIDIA Mellanox交换机）
 83 | 实现GPU之间的远程通信（如跨节点通信、分布式训练）
 84 | 5. 整体拓扑结构
 85 | 节点内部：
 86 | * CPU连接到PCIe交换芯片
 87 | * GPU连接到PCIe交换芯片及NVSwitch
 88 | * GPU之间通过NVSwitch实现高速互联
 89 | 节点外部：
 90 | * 所有GPU的专用网络适配卡连接到高速网络交换机
 91 | 存储访问：
 92 | * 存储NIC连接到存储网络，支持集中存储访问和管理
 93 | 图形示意
 94 | ```
 95 | +------------------------------+
 96 | |            CPU               |
 97 | |  +------------+------------+ |
 98 | |  | NUMA内存  | NUMA内存  | |
 99 | |  +------------+------------+ |
100 | +--------------|--------------+
101 |                |
102 |         +--------------+
103 |         | PCIe交换芯片  |
104 |         +--------------+
105 |         /     |     \
106 |        /      |      \
107 | +-------+ +-------+ +-------+ +-------+
108 | | GPU1  | | GPU2  | | GPU3  | | GPU4  |
109 | |  +-----+ +-----+ +-----+ +-----+ |
110 | |  |NVSwitch| ... (连接所有GPU)   |
111 | +-------+ +-------+ +-------+ +-------+
112 |    |            |             |       |
113 |    |            |             |       |
114 |    |            |             |       |
115 |    |          (GPU间高速通信 via NVSwitch)|
116 |    +-----------------------------------------+
117 | ```
118 | 外部网络连接：
119 | - 每块GPU通过专用网络适配卡连接到高速交换机（如InfiniBand）
120 | - 存储网络适配卡连接到存储设备和管理网络
121 | 关键点总结
122 | * GPU到GPU通信： 通过NVSwitch实现极高速的GPU内部通信
123 | * GPU到CPU： PCIe Gen4连接，确保高速数据传输
124 | * 存储访问： 存储NIC支持带内管理与访问分布式存储
125 | * 网络通信： GPU专用网络适配卡连接到高速网络交换机，用于跨节点通信和分布式训练
126 | 
127 | 备注
128 | 具体连接方式（如NVSwitch的连接拓扑）可以根据硬件实际支持情况进行微调（如全连接或部分连接）
129 | 设计方案强调GPU内部高速通信（NVSwitch、NVLink）和节点间高速网络（InfiniBand或100GbE）
130 | 
131 | 
132 | 


--------------------------------------------------------------------------------
/ai-learning/images/volcano-gang.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JaneLiuL/kubernetes-book/6ce28f8bc31b8ac89543964b03cf703cf268a515/ai-learning/images/volcano-gang.png


--------------------------------------------------------------------------------
/ai-learning/images/volcano-scheduler-actions-start.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JaneLiuL/kubernetes-book/6ce28f8bc31b8ac89543964b03cf703cf268a515/ai-learning/images/volcano-scheduler-actions-start.png


--------------------------------------------------------------------------------
/ai-learning/images/volcano-scheduler-actions.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JaneLiuL/kubernetes-book/6ce28f8bc31b8ac89543964b03cf703cf268a515/ai-learning/images/volcano-scheduler-actions.png


--------------------------------------------------------------------------------
/ai-learning/images/volcano-scheduler-gang01.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JaneLiuL/kubernetes-book/6ce28f8bc31b8ac89543964b03cf703cf268a515/ai-learning/images/volcano-scheduler-gang01.png


--------------------------------------------------------------------------------
/ai-learning/images/volcano-scheduler-registerfn.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JaneLiuL/kubernetes-book/6ce28f8bc31b8ac89543964b03cf703cf268a515/ai-learning/images/volcano-scheduler-registerfn.png


--------------------------------------------------------------------------------
/ai-learning/images/volcano-scheduler-workflow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JaneLiuL/kubernetes-book/6ce28f8bc31b8ac89543964b03cf703cf268a515/ai-learning/images/volcano-scheduler-workflow.png


--------------------------------------------------------------------------------
/ai-learning/images/volcano-workflow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JaneLiuL/kubernetes-book/6ce28f8bc31b8ac89543964b03cf703cf268a515/ai-learning/images/volcano-workflow.png


--------------------------------------------------------------------------------
/ai-learning/volcano/01-volcano基础.md:
--------------------------------------------------------------------------------
 1 | # what is volcano
 2 | 去官网看就行了
 3 | 
 4 | # why 我们选择
 5 | k8s目前的schduler是不支持对gpu npu等设备去实现多个任务共享等功能，我们需要对异构设备在k8s的调度能力，并且需要处理批计算。
 6 | 目前场景有：
 7 | * 保证所有任务同时启动，需要gang 调度
 8 | * 共享GPU异构资源 
 9 | * 支持异构设备等
10 | 
11 | 
12 | 基本上面我们的场景，我们选择了volcano调度。
13 | Volcano通过Kubernetes提供的Device plugin实现如下功能：
14 | * 收集集群中节点上gpu-number与显存gpu-memory： `volcano.sh/gpu-memory`和`volcano.sh/gpu-number`两种资源
15 | * 监控GPU健康状态
16 | * 在集群中为申请GPU的workload挂载GPU资源
17 | 
18 | # volcano device plugin
19 | 当用户安装部署gpu device plugin 在k8s集群上，我们可以查看get node来看到GPU显存与GPU卡数量信息。
20 |  1) GPU device plugin收集并上报GPU资源
21 |  这个是基于https://github.com/NVIDIA/k8s-device-plugin 去支持gpu 的软隔离,这个device 插件是以daemonset 来运行，主要工作有：
22 |  * 暴露每一个node上的gpu数量
23 |  * 跟踪gpu的健康
24 |  * 运行容器gpu enable
25 |  2) 用户提交gpu申请的pod到apiserver
26 |  3) volcano gpu调度插件开启配置：`predicate.GPUSharingEnable: true`
27 |  Predicates插件提供节点的预选功能，在enable GPU sharing功能的情况下会过滤出GPU节点，并选出能够满足当前pod资源申请的GPU卡id。例如：当前集群包含三个节点，其中Node1和Node2每台机器包含2张11178MB显存的GPU卡，Node3不包含GPU卡。当用户创建一个Pod请求80000MB显存资源的时候，调度器首先过滤出包含GPU资源的节点Node1和Node2，然后在节点中选取一张GPU能够满足Pod请求的资源进行调度。在该例子中，Volcano将会为该Pod选取Node2中的GPU1卡进行调度。
28 |  4) 启动容器， 节点上的Kubelet在收到Pod和节点绑定时间后，会创建Pod实体，Kubelet调用GPU plugin中实现的Allocate方法。该方法首先在节点所有pending状态的pod中选取出“volcano.sh/gpu-assigned”为false且predicate时间最早的pod进行创建，并更新该pod的“volcano.sh/gpu-assigned”为true。
29 | 大概工作流程如下：
30 | ![](../images/volcano-workflow.png)
31 | 
32 | 


--------------------------------------------------------------------------------
/ai-learning/volcano/02-volcano调度器.md:
--------------------------------------------------------------------------------
  1 | 
  2 | # overview
  3 | volcano scheduler是由action和plugin组成，action是指调度每个阶段需要执行的动作，plugin是具体actions中算法的实现
  4 | ![](../images/volcano-scheduler-workflow.png)
  5 | 
  6 | # session
  7 | session就是调度周期，默认1秒，每个调度周期都会new一个session 对象
  8 | ```go
  9 | go wait.Until(pc.runOnce, pc.schedulePeriod, stopCh)
 10 | ```
 11 | 在`runOnce`里面我们可以看到他会读取配置并且把plugin初始化，然后调用插件的`plugin.OnSessionOpen(ssn)`, 插件在`OnSessionOpen`中初始化所需要的数据，并且将回调函数注册到session中，插件注册的函数常用的有：
 12 | 
 13 | * Predict plugin就只注册了`ssn.AddPredicateFn(pp.Name(), func(task *api.TaskInfo, node *api.NodeInfo) error {` 这个AddPredicateFn函数到Session中
 14 | * Gang plugin就注册了`ssn.AddJobValidFn(gp.Name(), validJobFn)`  注册了jobvalid 函数
 15 | 注意Plugin不需要注册下面所有的函数，而是可以根据自己的需要，注册某几个函数。
 16 | ![](../images/volcano-scheduler-registerfn.png)
 17 | 
 18 | 初始化成功之后，volcano会依次调用不同action的Execute方法
 19 | 
 20 | # cache
 21 | 为什么突然把cache拿出来说很重要就是scheduler启动的顺序，读完配置加载，就到start cache了`pc.cache.Run(stopCh)`，当然这个很多都是使用informer机制了他并没有直接去调用apiserver去操作etcd。
 22 | SchedulerCache会持有很多informer， 初始化的informer注册各个eventHandler，然后pod/podgroup等变动会被同步在Jobs, Nodes, Queues, PriorityClasses等几个map中。podgroup加入jobInfo，pod 加入taskInfo，如下所示，我们可以看到他不断同步nodes,
 23 | ```go
 24 | // Run  starts the schedulerCache
 25 | func (sc *SchedulerCache) Run(stopCh <-chan struct{}) {
 26 | 	sc.informerFactory.Start(stopCh)
 27 | 	sc.vcInformerFactory.Start(stopCh)
 28 | 	sc.WaitForCacheSync(stopCh)
 29 | 	for i := 0; i < int(sc.nodeWorkers); i++ {
 30 | 		go wait.Until(sc.runNodeWorker, 0, stopCh)
 31 | 	}
 32 | 	...
 33 | }
 34 | 
 35 | func (sc *SchedulerCache) SyncNode(nodeName string) error {
 36 | 	node, err := sc.nodeInformer.Lister().Get(nodeName)
 37 | 	if err != nil {
 38 |     ...
 39 | 	}
 40 | 
 41 | 	if !sc.nodeCanAddCache(node) {
 42 | 		return nil
 43 | 	}
 44 | 	nodeCopy := node.DeepCopy()
 45 | 	csiNode, err := sc.csiNodeInformer.Lister().Get(nodeName)
 46 |   ...
 47 | 	return sc.AddOrUpdateNode(nodeCopy)
 48 | }
 49 | 
 50 | func (sc *SchedulerCache) nodeCanAddCache(node *v1.Node) bool {
 51 | 	if !responsibleForNode(node.Name, sc.schedulerPodName, sc.c) {
 52 | 		return false
 53 | 	}
 54 | 	if len(sc.nodeSelectorLabels) == 0 {
 55 | 		return true
 56 | 	}
 57 | 	for labelName, labelValue := range node.Labels {
 58 | 		key := labelName + ":" + labelValue
 59 | 		if _, ok := sc.nodeSelectorLabels[key]; ok {
 60 | 			return true
 61 | 		}
 62 | 	}
 63 | 	klog.Infof("node %s ignore add/update/delete into schedulerCache", node.Name)
 64 | 	return false
 65 | }
 66 | 
 67 | ```
 68 | # actions
 69 | 以下是vocano的一些action,我没有把所有action 都拿出来讲，只拿
 70 | ![](../images/volcano-scheduler-actions.png)
 71 | ![](../images/volcano-scheduler-actions-start.png)
 72 | 
 73 | ## enqueue
 74 | 
 75 | 
 76 | ## allocate
 77 | 
 78 | 
 79 | ## preempt
 80 | 
 81 | 
 82 | ## backfill
 83 | 
 84 | # 插件
 85 | 插件主要实现了3个函数：Name， OnSessionOpen， OnSessionClose
 86 | OnSessionOpen在会话开始时执行一些操作，并注册一些关于调度细节的函数。
 87 | OnSessionClose在会话结束时清理一些资源。
 88 | 下面写一些volcano现在有的插件
 89 | 首先是怎么配置volcano 调度器使用的插件
 90 | ```yaml
 91 | # default configuration for scheduler
 92 | actions: "enqueue, allocate, backfill"
 93 | tiers:
 94 | - plugins:
 95 |   - name: priority
 96 |   - name: gang
 97 |   - name: conformance
 98 | - plugins:
 99 |   - name: overcommit
100 |   - name: drf
101 |   - name: predicates
102 |   - name: proportion
103 |   - name: nodeorder
104 |   - name: binpack
105 | ```
106 | 
107 | 等上述配置apply 之后，会按以下顺序来启动加载
108 | 
109 | (OpenSession) --> (enqueue) --> (allocate) --> (backfill) --> (CloseSession)
110 | 
111 | 
112 | scheduler首先加载配置文件loadSchedulerConf，从配置文件读取actions以及plugins，这个第一次读取配置文件之后会不断的watch这个配置文件是否有modify/create来update,接下来启动scheduler初始化所需要的informer
113 | 定时默认每秒钟去执行每个schduler cycle: `runOnce`
114 | `runOnce`主要是
115 | 所有的插件注册都是通过执行`OpenSession` 来被call，
116 | ## `gang`插件
117 | gang调度是满足all or nothing的调度，
118 | ![](../images/volcano-scheduler-gang01.png)
119 | 它首先通过 AddJobValidFn，AddJobOrderFn，AddJobReadyFn，AddJobPipelinedFn，AddJobStarvingFns注册了jobValidFns JobOrderFn等函数
120 | 
121 | 在`AddJobValidFn`里面主要是判断`job.ValidTaskNum()` 是否小于 `job.MinAvailable`,说人话就是判断job下面的pod处于以下`Bound, Binding, Running, Allocated， Succeeded， Pipelined Pending`状态的数量总数是否少于job的`MinAvailable`，gang调度会认为以上的状态的pod是有更高的优先级。写得挺好的，如果加点测试代码更好了哈哈
122 | ```go
123 | func (ji *JobInfo) CheckTaskValid() bool {
124 | 	// if job minAvailable is less than sum of(task minAvailable), skip this check
125 | 	if ji.MinAvailable < ji.TaskMinAvailableTotal {
126 | 		return true
127 | 	}
128 | 
129 | 	actual := map[string]int32{}
130 | 	for status, tasks := range ji.TaskStatusIndex {
131 | 		if AllocatedStatus(status) ||
132 | 			status == Succeeded ||
133 | 			status == Pipelined ||
134 | 			status == Pending {
135 | 			for _, task := range tasks {
136 | 				actual[task.TaskRole]++
137 | 			}
138 | 		}
139 | 	}
140 |   for task, minAvailable := range ji.TaskMinAvailable {
141 | 		if minAvailable == 0 {
142 | 			continue
143 | 		}
144 | 		if act, ok := actual[task]; !ok || act < minAvailable {
145 | 			return false
146 | 		}
147 | 	}
148 | 
149 | 	return true
150 | ```
151 | 然后接下来我们可以看看
152 | 
153 | ## `predicate`插件
154 | 


--------------------------------------------------------------------------------
/ai-learning/volcano/03-volcano控制器.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JaneLiuL/kubernetes-book/6ce28f8bc31b8ac89543964b03cf703cf268a515/ai-learning/volcano/03-volcano控制器.md


--------------------------------------------------------------------------------
/ai-learning/volcano/04-volcano-admission.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JaneLiuL/kubernetes-book/6ce28f8bc31b8ac89543964b03cf703cf268a515/ai-learning/volcano/04-volcano-admission.md


--------------------------------------------------------------------------------
/ai-learning/volcano/volcano gpu设计思考.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JaneLiuL/kubernetes-book/6ce28f8bc31b8ac89543964b03cf703cf268a515/ai-learning/volcano/volcano gpu设计思考.md


--------------------------------------------------------------------------------
/client-go/informer.md:
--------------------------------------------------------------------------------
  1 | ## Overview
  2 | 
  3 | 这篇文章主要是学习Informer机制并且理解Informer各个组件的设计。
  4 | 
  5 | ## 背景
  6 | 
  7 | 为什么Kubernetes需要Informer机制？我们知道Kubernetes各个组件都是通过REST API跟API Server交互通信的，而如果每次每一个组件都直接跟API Server交互去读取/写入到后端的etcd的话，会对API Server以及etcd造成非常大的负担。 而Informer机制是为了保证各个组件之间通信的实时性、可靠，并且减缓对API Server和etcd的负担。
  8 | 
  9 | ## Informer 流程
 10 | 
 11 | 这里我们以CoreV1. Pod资源为例子：
 12 | 1. 第一次启动Informer的时候，Reflector 会使用`List`从API Server主动获取CoreV1. Pod的所有资源对象信息，通过`resync`将资源存放在`Store`中 
 13 | 2. 持续使用`Reflector`建立长连接，去`Watch` API Server发来的资源变更事件
 14 | 3. 当2 监控到CoreV1.Pod的资源对象有增加删除修改之后，就把资源对象存放在`DeltaFIFO`中，
 15 | 4. `DeltaFIFO`是一个先进先出队列，只要这个队列有数据，就被Pop到Controller中, 将这个资源对象存储至`Indexer`中，并且将该资源对象分发至`ShareInformer`
 16 | 5. Controller会触发`Process`回调函数
 17 | 
 18 | ### 打脸
 19 | 
 20 | 所以，我自己之前写代码的时候，一直以为是`ShareInformer`去主动watch API Server, 而现在正正打脸了，是`Reflector`做的List&Watch。
 21 | 
 22 | 
 23 | ### ListAndWatch 思考
 24 | 
 25 | 为什么Kubernetes里面是使用ListAndWatch呢？我们所知道的其他分布式系统常常使用RPC来触发行为。
 26 | 
 27 | 我们来分析下如果不这样做，而是采用API Server轮询推送消息给各个组件，或者各个组件轮询去访问API Server的话，那么**实时性**就得不到保证，并且对API Server造成很大的负载，很有可能需要开启大量的端口造成端口浪费。
 28 | 
 29 | 从实时性出发的话：
 30 | 
 31 | 我们希望是有任何资源的新增/改动/删除，都需要马上获取并且放入消息队列。可以对应我们Informer中的`Reflector`组件，去主动获取消息，并且放入`DeltaFIFO`队列被消费。
 32 | 
 33 | 从减轻负载出发的话：
 34 | 
 35 | 需要上缓存，这里可以对应我们的`Store`组件。
 36 | 
 37 | 从设计扩展性出发的话：
 38 | 
 39 | 作为一个“资源管理系统”的Kubernetes，我们的对象数量可能会无线扩大，那么我们需要设计一个高效扩展的组件，去应对对象的种类无线扩大，并且同一种对象可能会被用户实例化非常多次的行为。 这里可以对应我们的`Share Informer`。
 40 | 
 41 | 从消息的可靠性出发的话：
 42 | 
 43 | 刚刚说了这么多，都是进行长连接去Watch的，万一网络出错怎么办？这个时候我们的List机制就很明显发挥作用，一旦感知跟API Server中断，或者第一次启动，都是使用List机制的， List作为一个短连接去获取资源信息，Watch 作为长连接去持续接收资源的变更并且处理。（用List&Watch可以保证不会漏掉任何事件）
 44 | 
 45 | 
 46 | 
 47 | #### Watch的实现
 48 | 
 49 | `Watch`是通过HTTP 长连接接收API Server发送的资源变更事件，使用的`Chunkerd transfer coding`， 代码位置`./staging/src/k8s.io/apiserver/pkg/endpoints/handlers/watch.go`，源码如下
 50 | 
 51 | ```go
 52 |     e := streaming.NewEncoder(framer, s.Encoder)
 53 | 
 54 | 	// ensure the connection times out
 55 | 	timeoutCh, cleanup := s.TimeoutFactory.TimeoutCh()
 56 | 	defer cleanup()
 57 | 
 58 | 	// begin the stream
 59 | 	w.Header().Set("Content-Type", s.MediaType)
 60 | 	w.Header().Set("Transfer-Encoding", "chunked")
 61 | 	w.WriteHeader(http.StatusOK)
 62 | 	flusher.Flush()
 63 | ```
 64 | 
 65 | 我们使用通过`curl`来看看, 在`response`的`Header`中设置`Transfer-Encoding`的值是`chunkerd`
 66 | 
 67 | ```bash
 68 | # curl -i http://127.0.0.1:8001/api/v1/watch/namespaces?watch=yes
 69 | HTTP/1.1 200 OK
 70 | Cache-Control: no-cache, private
 71 | Content-Type: application/json
 72 | Date: Sun, 09 Aug 2020 02:44:07 GMT
 73 | Transfer-Encoding: chunked
 74 | 
 75 | {"type":"ADDED","object":{"kind":"Namespace","apiVersion":"v1","metadata":{"name":"...
 76 | ```
 77 | 
 78 | 
 79 | 
 80 | ## 监听事件 Reflector
 81 | 
 82 | 我的理解，Reflector是实现对指定的类型对象的监控，既包括Kubernetes内置资源，也可以是CRD自定义资源。
 83 | 
 84 | 
 85 | 
 86 | ### 数据结构
 87 | 
 88 | 我们来看看Reflector的数据结构， 代码块`staging/src/k8s.io/client-go/tools/cache/reflector.go`
 89 | 
 90 | listerWatcher其实就是从API Server里面去做List跟Watch的操作去获取对象的变更。
 91 | 
 92 | ```go
 93 | type Reflector struct {
 94 | 	name string
 95 |     // 监控的对象类型，比如Pod
 96 | 	expectedType reflect.Type
 97 |     // 存储
 98 | 	store Store	
 99 |     // ListerWatcher是针对某一类对象，比如Pod
100 | 	listerWatcher ListerWatcher
101 | 	period       time.Duration
102 | 	resyncPeriod time.Duration
103 | 	ShouldResync func() bool
104 | 	...
105 | }
106 | ```
107 | 
108 | ### Run
109 | 
110 | Run是循环一直把数据存储到`DeltaFIFO`中。
111 | 
112 | ```go
113 | func (r *Reflector) Run(stopCh <-chan struct{}) {
114 | 	klog.V(3).Infof("Starting reflector %v (%s) from %s", r.expectedType, r.resyncPeriod, r.name)
115 | 	wait.Until(func() {
116 | 		if err := r.ListAndWatch(stopCh); err != nil {
117 | 			utilruntime.HandleError(err)
118 | 		}
119 | 	}, r.period, stopCh)
120 | }
121 | ```
122 | 
123 | 也就是说，Reflector是一直在执行ListAndWatch, 除非收到消息stopCh要被关闭，Run才会退出。
124 | 
125 | 
126 | 
127 | ### ListAndWatch
128 | 
129 | 以下是Kubernetes中ListAndWatch的关键实现代码，ListAndWatch首先回列出所有的对象，并且获取到它们的版本号，然后watch资源对象的版本号更新来判断是否有变更。
130 | 
131 | ```go
132 | func (r *Reflector) ListAndWatch(stopCh <-chan struct{}) error {	
133 | 	var resourceVersion string
134 | 
135 |     // 首先将资源版本设置为0
136 | 	options := metav1.ListOptions{ResourceVersion: "0"}
137 | 
138 | 	if err := func() error {
139 | 		initTrace := trace.New("Reflector " + r.name + " ListAndWatch")
140 | 		defer initTrace.LogIfLong(10 * time.Second)
141 | 		var list runtime.Object
142 | 		var err error
143 | 		listCh := make(chan struct{}, 1)
144 | 		panicCh := make(chan interface{}, 1)
145 | 		go func() {
146 | 			defer func() {
147 | 				if r := recover(); r != nil {
148 | 					panicCh <- r
149 | 				}
150 | 			}()
151 |             // 先List
152 | 			list, err = r.listerWatcher.List(options)
153 | 			close(listCh)
154 | 		}()
155 | 		select {
156 | 		case <-stopCh:
157 | 			return nil
158 | 		case r := <-panicCh:
159 | 			panic(r)
160 | 		case <-listCh:
161 | 		}
162 | 
163 | 		initTrace.Step("Objects listed")
164 | 		listMetaInterface, err := meta.ListAccessor(list)
165 | 
166 | 		resourceVersion = listMetaInterface.GetResourceVersion()
167 | 		initTrace.Step("Resource version extracted")
168 |         // 将list的内容提取成list
169 | 		items, err := meta.ExtractList(list)
170 | 
171 | 		initTrace.Step("Objects extracted")
172 |         // 这个挺关键的，其实是将上方的list的内容和版本号都存到缓存store中
173 | 		if err := r.syncWith(items, resourceVersion); err != nil {
174 | 			return fmt.Errorf("%s: Unable to sync list result: %v", r.name, err)
175 | 		}
176 | 		initTrace.Step("SyncWith done")
177 |         // 设置最新的资源版本号码
178 | 		r.setLastSyncResourceVersion(resourceVersion)
179 | 		initTrace.Step("Resource version updated")
180 | 		return nil
181 | 	}(); err != nil {
182 | 		return err
183 | 	}
184 | 
185 | 	resyncerrc := make(chan error, 1)
186 | 	cancelCh := make(chan struct{})
187 | 	defer close(cancelCh)
188 | 	go func() {
189 | 		resyncCh, cleanup := r.resyncChan()
190 | 		defer func() {
191 | 			cleanup() // Call the last one written into cleanup
192 | 		}()
193 | 		for {
194 | 			select {
195 | 			case <-resyncCh:
196 | 			case <-stopCh:
197 | 				return
198 | 			case <-cancelCh:
199 | 				return
200 | 			}
201 | 			if r.ShouldResync == nil || r.ShouldResync() {
202 | 				if err := r.store.Resync(); err != nil {
203 | 					resyncerrc <- err
204 | 					return
205 | 				}
206 | 			}
207 | 			cleanup()
208 | 			resyncCh, cleanup = r.resyncChan()
209 | 		}
210 | 	}()
211 | 
212 |     // Watch
213 | 	for {
214 | 		select {
215 | 		case <-stopCh:
216 | 			return nil
217 | 		default:
218 | 		}
219 | 	// watch的超时时间
220 | 		timeoutSeconds := int64(minWatchTimeout.Seconds() * (rand.Float64() + 1.0))
221 | 		options = metav1.ListOptions{
222 | 			ResourceVersion: resourceVersion,
223 | 			TimeoutSeconds: &timeoutSeconds,
224 | 		}
225 | 
226 | 		w, err := r.listerWatcher.Watch(options)
227 | 		if err != nil {
228 | 			switch err {
229 | 			case io.EOF:
230 | 			case io.ErrUnexpectedEOF:
231 | 			default:
232 | 				utilruntime.HandleError(fmt.Errorf("%s: Failed to watch %v: %v", r.name, r.expectedType, err))
233 | 			}
234 | 
235 | 			if urlError, ok := err.(*url.Error); ok {
236 | 				if opError, ok := urlError.Err.(*net.OpError); ok {
237 | 					if errno, ok := opError.Err.(syscall.Errno); ok && errno == syscall.ECONNREFUSED {
238 | 						time.Sleep(time.Second)
239 | 						continue
240 | 					}
241 | 				}
242 | 			}
243 | 			return nil
244 | 		}
245 | 		// watchHandler是通过watch的方式保证当前的资源版本是最新的
246 | 		if err := r.watchHandler(w, &resourceVersion, resyncerrc, stopCh); err != nil {
247 | 			if err != errorStopRequested {
248 | 				klog.Warningf("%s: watch of %v ended with: %v", r.name, r.expectedType, err)
249 | 			}
250 | 			return nil
251 | 		}
252 | 	}
253 | }
254 | 
255 | ```
256 | 
257 | 
258 | 
259 | #### Kubernetes并发
260 | 
261 | 从ListAndWatch的代码，有一段关于`syncWith`的方法，比较重要，原来Kubernetes的并发是通过`ResourceVersion`来实现的，每次对这个对象的改动，都会把改对象的`ResourceVersion`加一。
262 | 
263 | 
264 | 
265 | 
266 | 
267 | ## 二级缓存DeltaFIFO 和 Store
268 | 
269 | ### DeltaFIFO
270 | 
271 | 我们通过数据结构来理解DeltaFIFO，我们先来理解一下Delta。
272 | 
273 | 代码块`staging/src/k8s.io/client-go/tools/cache/delta_fifo.go`
274 | 
275 | 通过下面的代码块，我们可以非常清晰看得出，`Delta`其实是一个资源对象存储，保存例如Pod的Added操作等。用白话来说其实就是记录Kubernetes每一个对象的变化。
276 | 
277 | ```go
278 | type Delta struct {
279 | 	Type   DeltaType
280 | 	Object interface{}
281 | }
282 | 
283 | type DeltaType string
284 | 
285 | const (
286 | 	Added   DeltaType = "Added"
287 | 	Updated DeltaType = "Updated"
288 | 	Deleted DeltaType = "Deleted"	
289 | 	Sync DeltaType = "Sync"
290 | )
291 | ```
292 | 
293 | FIFO就比较容易理解了，就是一个先进先出的队列。也可以看看代码块`staging/src/k8s.io/client-go/tools/cache/fifo.go`去看他的实现，如下
294 | 
295 | ```go
296 | type Queue interface {
297 | 	Store
298 |     // 可以看出来Queue是在Store的基础上扩展了Pop，可以让对象弹出。这里如果对比一下Indexer的数据结构发现很有意思，Indexer是在Store的基础上加了索引，去快速检索对象
299 | 	Pop(PopProcessFunc) (interface{}, error)
300 | 	AddIfNotPresent(interface{}) error
301 | 	HasSynced() bool
302 | 	Close()
303 | }
304 | ```
305 | 
306 | 结合起来，DeltaFIFO其实就是一个先进先出的Kubernetes对象变化的队列，这个队列中存储不同操作类型的同一个资源对象。
307 | 
308 | DeltaFIFO中的GET方法或者GetByKey都比较简单，接下来对queueActionLocked()函数重点说明。
309 | 
310 | 
311 | 
312 | #### queueActionLocked
313 | 
314 | ```go
315 | func (f *DeltaFIFO) queueActionLocked(actionType DeltaType, obj interface{}) error {
316 |     // 拿到对象的Key
317 | 	id, err := f.KeyOf(obj)
318 | 	if err != nil {
319 | 		return KeyError{obj, err}
320 | 	}
321 |     
322 |     // 把同一个对象的不同的actionType，都添加到newDeltas列表中
323 | 	newDeltas := append(f.items[id], Delta{actionType, obj})
324 |     // 合并去重
325 | 	newDeltas = dedupDeltas(newDeltas)
326 |      // 我一开始理解不了，觉得不可能存在<=0的情况，最新的Kubernetes的代码里面注释说了，正常情况下不会出现<=0， 加这个判断属于冗余判断
327 | 	if len(newDeltas) > 0 {
328 | 		if _, exists := f.items[id]; !exists {
329 | 			f.queue = append(f.queue, id)
330 | 		}
331 | 		f.items[id] = newDeltas        
332 | 		f.cond.Broadcast()
333 | 	} else {		
334 | 		delete(f.items, id)
335 | 	}
336 | 	return nil
337 | }
338 | ```
339 | 
340 | 看看**去重**的代码
341 | 
342 | ```go
343 | func dedupDeltas(deltas Deltas) Deltas {
344 | 	n := len(deltas)
345 |     // 少于2个也就是得一个，不需要合并了，直接返回
346 | 	if n < 2 {
347 | 		return deltas
348 | 	}
349 | 	a := &deltas[n-1]
350 | 	b := &deltas[n-2]
351 |     // 这里，最后调了isDeletionDup，这个是判断一个资源对象的两次操作是否都是删除，如果是，就去重，不需要删除两次
352 | 	if out := isDup(a, b); out != nil {
353 | 		d := append(Deltas{}, deltas[:n-2]...)
354 | 		return append(d, *out)
355 | 	}
356 | 	return deltas
357 | }
358 | 
359 | func isDup(a, b *Delta) *Delta {
360 | 	if out := isDeletionDup(a, b); out != nil {
361 | 		return out
362 | 	}
363 | 	// TODO: Detect other duplicate situations? Are there any?
364 | 	return nil
365 | }
366 | ```
367 | 
368 | 
369 | 
370 | 之前群里有人问为什么dedupDeltas只是去这个列表的倒数一个跟倒数第二个去进行合并去重的操作，这里说明一下，dedupDeltas是被queueActionLocked函数调用的，而queueActionLocked为什么我们拿出来讲，是因为在Delete/Update/Add里面去调用了queueActionLocked，合并是对某一个obj的一系列操作，而去重是只针对delete。
371 | 
372 | 我们可以拿一个例子来看看，假设是[obj1]: [add: delta1, update: delta2, delete: delta3,  delete: delta3] 在经过queueActionLocked之后会变成[obj1]: [add: delta1, update: delta2, delete: delta3] 
373 | 
374 | 
375 | 
376 | #### 消费者方法
377 | 
378 | ```go
379 | func (f *DeltaFIFO) Pop(process PopProcessFunc) (interface{}, error) {
380 | 	f.lock.Lock()
381 | 	defer f.lock.Unlock()
382 | 	for {
383 | 		for len(f.queue) == 0 {
384 | 			// 任何时候判断队列是否被关闭之前，都需要先判断队列的长度，看上方的len
385 | 			if f.IsClosed() {
386 | 				return nil, FIFOClosedError
387 | 			}
388 | 
389 | 			f.cond.Wait()
390 | 		}
391 | 		id := f.queue[0]
392 | 		f.queue = f.queue[1:]
393 | 		if f.initialPopulationCount > 0 {
394 | 			f.initialPopulationCount--
395 | 		}
396 | 		item, ok := f.items[id]
397 | 		if !ok {
398 | 			// Item may have been deleted subsequently.
399 | 			continue
400 | 		}
401 |         // 取出第一个f.queue[0]对象，从队列删除，将该对象交给process处理对象
402 | 		delete(f.items, id)
403 | 		err := process(item)
404 |         
405 | 		if e, ok := err.(ErrRequeue); ok {
406 |             // 处理失败，就重新入队
407 | 			f.addIfNotPresent(id, item)
408 | 			err = e.Err
409 | 		}
410 | 		// Don't need to copyDeltas here, because we're transferring
411 | 		// ownership to the caller.
412 | 		return item, err
413 | 	}
414 | }
415 | ```
416 | 
417 | 
418 | 
419 | 
420 | 
421 | 
422 | 
423 | #### LocalStore
424 | 
425 | 缓存机制，但LocalStore是被`Lister`的`List/Get`方法访问
426 | 
427 | 
428 | 
429 | ## Share Informer 共享机制
430 | 
431 | 从流程上我们说了，因为是`DeltaFIFO`把消息分发至`Share Informer`中，因此我们可以用`Inforomer`添加自定义的回调函数，也就是我们经常看到的`OnAdd`  `OnUpdaate`和`OnDelete`
432 | 
433 | 
434 | 
435 | Kubernetes内部的每一个资源都实现了Informer机制，如下是一个Namespace的Informer的例子
436 | 
437 | 代码块`staging/src/k8s.io/client-go/informers/core/v1/namespace.go`
438 | 
439 | ```go
440 | type NamespaceInformer interface {
441 | 	Informer() cache.SharedIndexInformer
442 | 	Lister() v1.NamespaceLister
443 | }
444 | 
445 | ```
446 | 
447 | 
448 | 
449 | ## Indexer
450 | 
451 | 以下是Indexer的数据结构，清晰的看见Indexer继承了Store接口， 还增加了索引的功能。
452 | 
453 | ```go
454 | type Indexer interface {
455 | 	Store
456 | 	Index(indexName string, obj interface{}) ([]interface{}, error)
457 | ...
458 | }
459 | 
460 | ```
461 | 
462 | 看看我们流程第四个步骤： `DeltaFIFO`是一个先进先出队列，只要这个队列有数据，就被Pop到Controller中, 将这个资源对象存储至`Indexer`中。 这个步骤说明了Indexer存储的数据来源。
463 | 
464 | 
465 | 
466 | 我们看看Indexer关键的几个索引函数
467 | 
468 | ```go
469 | // 索引函数，传入的是对象，返回的是检索结果的列表，例如我们可以通过IndexFunc去查某个Annotation/label的configmap
470 | type IndexFunc func(obj interface{}) ([]string, error)
471 | // 索引函数，key是索引器名词，value是索引器的实现函数
472 | type Indexers map[string]IndexFunc
473 |  // 索引函数name   对应多个索引键   多个对象键   真正对象 
474 | type Indices map[string]Index            
475 | // 索引缓存，map类型                     
476 | type Index map[string]sets.String 
477 | ```
478 | 
479 | 总结一下：
480 | 
481 | Indexers: 索引函数name --> 索引实现函数-->索引key值
482 | Indics: 索引函数name --> 对应多个索引key值 --> 每个索引key值对应不同的资源
483 | 
484 | 举个例子来说明的话：对象Pod有一个标签app=version1，这里标签就是索引键，Indexer会把相同标签的所有Pod放在一个集合里面，然后我们实现对标签分类就是我们Indexer的核心内容。
485 | 
486 | 
487 | 


--------------------------------------------------------------------------------
/client-go/study-client.md:
--------------------------------------------------------------------------------
  1 | # Overview
  2 | 
  3 | 这篇文章是基于Kubernetes的v1.14.10 分支写下的源码分析文档。
  4 | 
  5 | 我们可以使用`go get k8s.io/client-go@kubernetes-1.14.10` 来安装client-go
  6 | 
  7 | 主要内容是理解并使用client-go四种客户端，为什么需要四种客户端，场景分别是什么，如何初始化四种客户端，并使用四个客户端分别去获取资源。
  8 | 
  9 | # 客户端
 10 | 
 11 | Client-go提供了四种客户端，简单描述如下
 12 | 
 13 | | 客户端名称      | 源码目录              | 简单描述                                                     |
 14 | | --------------- | --------------------- | ------------------------------------------------------------ |
 15 | | RESTClient      | client-go/rest/       | 基础客户端，对HTTP Request封装                               |
 16 | | ClientSet       | client-go/kubernetes/ | 在RESTClient基础上封装了对Resource和Version，也就是说我们使用ClientSet的话是必须要知道Resource和Version， 例如AppsV1().Deployments或者CoreV1.Pods，缺点是不能访问CRD自定义资源 |
 17 | | DynamicClient   | client-go/dynamic/    | 包含一组动态的客户端，可以对任意的K8S API对象执行通用操作，包括CRD自定义资源 |
 18 | | DiscoveryClient | client-go/discovery/  | 在上述我们试过ClientSet是必须要知道Resource和Version, 但人是记不住的(例如我)，这个DiscoveryClient是提供一个发现客户端，发现API Server支持的资源组，资源版本和资源信息 |
 19 | 
 20 | 
 21 | 
 22 | # 练习
 23 | 
 24 | ## 使用RESTClient去Get Node列表
 25 | 
 26 | 源码`staging/src/k8s.io/client-go/rest/config.go` 当我们使用`RESTClientFor`的时候要注意把GroupVersion/ NegotiatedSerializer都要初始化
 27 | 
 28 | ```go
 29 | func RESTClientFor(config *Config) (*RESTClient, error) {
 30 | 	if config.GroupVersion == nil {
 31 | 		return nil, fmt.Errorf("GroupVersion is required when initializing a RESTClient")
 32 | 	}
 33 | 	if config.NegotiatedSerializer == nil {
 34 | 		return nil, fmt.Errorf("NegotiatedSerializer is required when initializing a RESTClient")
 35 | 	}
 36 | ```
 37 | 
 38 | 练习代码
 39 | 
 40 | ```go
 41 | package main
 42 | 
 43 | import (
 44 | 	"fmt"
 45 | 	corev1 "k8s.io/api/core/v1"
 46 | 	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
 47 | 	"k8s.io/client-go/rest"
 48 | 	"k8s.io/client-go/tools/clientcmd"
 49 | 	"k8s.io/client-go/kubernetes/scheme"
 50 | )
 51 | 
 52 | func main()  {
 53 | 	config, err := clientcmd.BuildConfigFromFlags("","/root/.kube/config")
 54 | 	if err != nil {
 55 | 		panic(err.Error())
 56 | 	}
 57 | 
 58 | 	config.APIPath = "api"
 59 | 	config.GroupVersion = &corev1.SchemeGroupVersion
 60 | 	config.NegotiatedSerializer = scheme.Codecs
 61 | 
 62 | 	restClient, err := rest.RESTClientFor(config)
 63 | 	if err != nil {
 64 | 		panic(err.Error())
 65 | 	}
 66 | 
 67 | 
 68 | 	result := &corev1.NodeList{}
 69 | 	err = restClient.Get().Namespace("").Resource("nodes").VersionedParams(&metav1.ListOptions{Limit: 100}, scheme.ParameterCodec).Do().Into(result)
 70 | 	if err != nil {
 71 | 		panic(err)
 72 | 	}
 73 | 
 74 | 	for _, d := range result.Items {
 75 | 		fmt.Printf("Node Name %v \n", d.Name)
 76 | 	}
 77 | }
 78 | 
 79 | ```
 80 | 
 81 | 
 82 | 
 83 | 
 84 | 
 85 | ## 使用ClientSet监听有namespace就注入Secret
 86 | 
 87 | ```go
 88 | package main
 89 | 
 90 | import (
 91 | 	"encoding/base64"
 92 | 	"encoding/json"
 93 | 	apiv1 "k8s.io/api/core/v1"
 94 | 	"k8s.io/apimachinery/pkg/api/errors"
 95 | 	"k8s.io/apimachinery/pkg/apis/meta/v1"
 96 | 	"k8s.io/client-go/informers"
 97 | 	"k8s.io/client-go/kubernetes"
 98 | 	"k8s.io/client-go/rest"
 99 | 	"k8s.io/client-go/tools/cache"
100 | 	"log"
101 | 	// apiv1 "k8s.io/api/core/v1"
102 | 	"time"
103 | )
104 | 
105 | const create_secret_name = "xx-secret"
106 | 
107 | func  generateDockerConfig() []byte {
108 | 	type dockerConfig struct {
109 | 		Auths map[string]map[string]string `json:"auths"`
110 | 	}
111 | 	username, password := "xx", "xx"
112 | 	auth := map[string]string{
113 | 		"username": "username",
114 | 		"password": "password",
115 | 		"auth": base64.StdEncoding.EncodeToString([]byte(username + ":" + password)),
116 | 	}
117 | 	dockerconfig := dockerConfig{
118 | 		Auths: map[string]map[string]string{
119 | 			"docker-registry-url": auth,
120 | 		},
121 | 	}
122 | 	bytes, _ := json.Marshal(dockerconfig)
123 | 	return bytes
124 | }
125 | 
126 | func create_secret(namespace string,clientset *kubernetes.Clientset) error  {
127 | 	secretClient := clientset.CoreV1().Secrets(namespace)
128 | 
129 | 	// check secret exist in the namespace or not
130 | 	_, err := secretClient.Get(create_secret_name, v1.GetOptions{})
131 | 
132 | 	if errors.IsNotFound(err) {
133 | 		// if not exist, then create secret
134 | 		log.Printf("Secret %s in namespace %s not found\n", create_secret_name, namespace)
135 | 		log.Println("Start to create secret..")
136 | 
137 | 		secretObj := &apiv1.Secret{
138 | 
139 | 			TypeMeta: v1.TypeMeta{
140 | 				Kind:       "Secret",
141 | 				APIVersion: "apps/v1",
142 | 			},
143 | 			ObjectMeta: v1.ObjectMeta{
144 | 				Name: create_secret_name,
145 | 			},
146 | 			Data: map[string][]byte{
147 | 				".dockerconfigjson": generateDockerConfig(),
148 | 			},
149 | 			Type: apiv1.SecretTypeDockerConfigJson,
150 | 		}
151 | 
152 | 		_, err := secretClient.Create(secretObj)
153 | 		if err != nil{
154 | 			return err
155 | 		} else {
156 | 			log.Println("create secret success")
157 | 			return nil
158 | 		}
159 | 	} else if statusError, isStatus := err.(*errors.StatusError); isStatus {
160 | 		log.Printf("Error getting Secret %s in namespace %s: %v\n",
161 | 		create_secret_name, namespace, statusError.ErrStatus.Message)
162 | 		return statusError
163 | 	} else if err != nil {
164 | 		return(err)
165 | 	} else {
166 | 		// if exist, then return
167 | 		log.Printf("Found secret in namespace %s\n",  namespace)
168 | 		return nil
169 | 	}		
170 | 
171 | }
172 | 
173 | func main() {
174 | 	// receive env 
175 | 
176 | 	// in cluster get config
177 | 	config, err := rest.InClusterConfig()
178 | 	if err != nil {
179 | 		panic(err.Error())
180 | 	}
181 | 
182 | 	// cilentset
183 | 	clientset, err := kubernetes.NewForConfig(config)
184 | 	if err != nil {
185 | 		panic(err)
186 | 	}
187 | 
188 | 	// listen namespace informer for AddFunc
189 | 	stopCh := make(chan struct{})
190 | 	defer close(stopCh)
191 | 
192 | 	shareInformers := informers.NewSharedInformerFactory(clientset, time.Second)
193 | 	informer := shareInformers.Core().V1().Namespaces().Informer()
194 | 
195 | 	informer.AddEventHandler(cache.ResourceEventHandlerFuncs{
196 | 		AddFunc: func (obj interface{})  {
197 | 			nObj := obj.(v1.Object)
198 | 			log.Printf("New namespaces add %s", nObj.GetName())
199 | 
200 | 			// create secret in the ns
201 | 			if nObj.GetName() == "kube-node-lease" {
202 | 				log.Println("skip this namespace")
203 | 			} else if nObj.GetName() == "kube-public" {
204 | 				log.Println("skip this namespace")
205 | 			} else if nObj.GetName() == "kube-system" {
206 | 				log.Println("skip this namespace")
207 | 			} else {
208 | 				err := create_secret(nObj.GetName(), clientset)
209 | 				if err != nil {
210 | 					log.Println("create secret fail, fail at %s", err)
211 | 				}
212 | 			}
213 | 		},
214 | 	})
215 | 	informer.Run(stopCh)
216 | 	
217 | }
218 | 
219 | ```
220 | 
221 | 
222 | 
223 | 
224 | 
225 | ## DynamicClient
226 | 
227 | 看`kubectl api-resources` 命令背后是否使用了DynamicClient， 代码块`pkg/kubectl/cmd/apiresources/apiresources.go`
228 | 
229 | ```go
230 | func (o *APIResourceOptions) RunAPIResources(cmd *cobra.Command, f cmdutil.Factory) error {
231 | 	w := printers.GetNewTabWriter(o.Out)
232 | 	defer w.Flush()
233 | 
234 |     // 这里可以看出来的确是使用了discoveryCilent
235 | 	discoveryclient, err := f.ToDiscoveryClient()
236 | 	if err != nil {
237 | 		return err
238 | 	}
239 | 
240 | 	if !o.Cached {
241 | 		// Always request fresh data from the server
242 | 		discoveryclient.Invalidate()
243 | 	}
244 | 
245 | 	errs := []error{}
246 |     
247 | 	lists, err := discoveryclient.ServerPreferredResources()	
248 | ...
249 | }
250 | ```
251 | 
252 | 
253 | 
254 | ## 使用DiscoveryClient去发现集群现所有的GVR
255 | 
256 | ```go
257 | package main
258 | 
259 | import (
260 | 	"fmt"
261 | 	"k8s.io/apimachinery/pkg/runtime/schema"
262 | 	"k8s.io/client-go/discovery"
263 | 	"k8s.io/client-go/tools/clientcmd"
264 | )
265 | 
266 | func main()  {
267 | 	config, err := clientcmd.BuildConfigFromFlags("","/root/.kube/config")
268 | 	if err != nil {
269 | 		panic(err.Error())
270 | 	}
271 | 
272 | 	discoveryClient, err := discovery.NewDiscoveryClientForConfig(config)
273 | 	if err != nil {
274 | 		panic(err.Error())
275 | 	}
276 | 
277 | 	_, APIResourceList, err := discoveryClient.ServerGroupsAndResources()
278 | 	if err != nil {
279 | 		panic(err.Error())
280 | 	}
281 | 	for _, list := range APIResourceList {
282 | 		gv, err := schema.ParseGroupVersion(list.GroupVersion)
283 | 		if err != nil {
284 | 			panic(err.Error())
285 | 		}
286 | 		for _, resource := range list.APIResources {
287 | 			fmt.Printf("name: %v, group: %v, version %v\n", resource.Name, gv.Group, gv.Version)
288 | 		}
289 | 	}
290 | }
291 | 
292 | ```
293 | 
294 | 
295 | 
296 | 
297 | 
298 | 
299 | 
300 | 


--------------------------------------------------------------------------------
/client-go/workqueue.md:
--------------------------------------------------------------------------------
  1 | # Overview
  2 | 
  3 | 这篇文章是基于Kubernetes的v1.14.10 分支写下的源码分析文档。
  4 | 
  5 | 此篇文档主要是围绕WorkQueue组件在`Client-go`中的介绍以及工作原理。
  6 | 
  7 | 
  8 | 
  9 | # 概念
 10 | 
 11 | WorkQueue称为工作队列，比FIFO略复杂，主要功能是**标记**和**去重**。
 12 | 
 13 | 从下图可以看出，client-go里面的WorkQueue，起的作用类似一个chan， 当资源发生变化的时候通过回调函数将资源写入队列，由controller的worker消费者完成业务处理
 14 | 
 15 | ![1597560260584](C:\Users\EZLIUJA\AppData\Roaming\Typora\typora-user-images\1597560260584.png)
 16 | 
 17 | 
 18 | 
 19 | ## 特性
 20 | 
 21 | 1. 公平原则，先进先出
 22 | 2. 去重，一个工作即使被多次加入，也只会被处理一次
 23 | 3. 多个消费者和多个生产者
 24 | 4. 关闭通知
 25 | 
 26 | # 通用队列
 27 | 
 28 | ## 数据结构
 29 | 
 30 | 代码块`staging/src/k8s.io/client-go/util/workqueue/queue.go`
 31 | 
 32 | ```go
 33 | type Type struct {
 34 |     // queue是一个工作队列，可以看得出是一个slice，主要作用是有序处理
 35 | 	queue []t
 36 |     // dirty定义了所有需要被processed处理的items，是一个map
 37 | 	dirty set
 38 |     // 标记是否正在被处理，是一个map
 39 | 	processing set
 40 | 
 41 | 	cond *sync.Cond
 42 | 
 43 | 	shuttingDown bool
 44 | 
 45 | 	metrics queueMetrics
 46 | 
 47 | 	unfinishedWorkUpdatePeriod time.Duration
 48 | 	clock                      clock.Clock
 49 | }
 50 | 
 51 | type empty struct{}
 52 | type t interface{}
 53 | type set map[t]empty
 54 | ```
 55 | 
 56 | ## 接口
 57 | 
 58 | `WorkQueue`的接口提供的方法如下，基本概况为可以插入元素，计算长度，获取元素等
 59 | 
 60 | ```go
 61 | type Interface interface {
 62 |     // 给队列添加元素
 63 | 	Add(item interface{})
 64 |     // 计算队列长度
 65 | 	Len() int
 66 |     // 获取队列头部的一个元素
 67 | 	Get() (item interface{}, shutdown bool)
 68 |     // 标记队列中该元素已被处理
 69 | 	Done(item interface{})
 70 |     // 关闭队列
 71 | 	ShutDown()
 72 |     // 查询是否正在被关闭
 73 | 	ShuttingDown() bool
 74 | }
 75 | ```
 76 | 
 77 | ### Get方法与总结
 78 | 
 79 | 接下来我们通过其中一个方法`Get`来看看发生了什么事情。
 80 | 
 81 | ```go
 82 | func (q *Type) Get() (item interface{}, shutdown bool) {
 83 |     // 通过锁保证同时只有一个元素从队列头部被取出
 84 | 	q.cond.L.Lock()
 85 | 	defer q.cond.L.Unlock()
 86 | ...
 87 |     // 每次只取一个元素
 88 | 	item, q.queue = q.queue[0], q.queue[1:]
 89 | 
 90 | 	q.metrics.get(item)
 91 |     // 把这个元素插入入processing，通过下方的insert也可以看出，(因为是map)同一个元素只会被插入一次
 92 |     // 也可以看出，queue, processing和dirty都是在维护各自的队列中的相同元素
 93 | 	q.processing.insert(item)
 94 |     // 从dirty里面去除item
 95 | 	q.dirty.delete(item)
 96 | 
 97 | 	return item, false
 98 | }
 99 | 
100 | func (s set) insert(item t) {
101 | 	s[item] = empty{}
102 | }
103 | ```
104 | 
105 | 看了`Get`方法，如上所示，可以得知queue, processing和dirty都是在维护各自的队列中的相同元素
106 | 
107 | ### Add方法与总结
108 | 
109 | 接下来我们通过其中一个方法`Add`来看看发生了什么事情。
110 | 
111 | ```go
112 | func (q *Type) Add(item interface{}) {
113 | 	q.cond.L.Lock()
114 | 	defer q.cond.L.Unlock()
115 | 	if q.shuttingDown {
116 | 		return
117 | 	}
118 |     // 从Get里面我们已经确认queue, processing和dirty都是在维护各自的队列中的相同元素
119 |     // 当每次往queue里面Add一个元素的时候，下方代码都会检查dirty里面是否有这个元素，如果有直接返回，也就是标记去重的作用
120 | 	if q.dirty.has(item) {
121 | 		return
122 | 	}
123 |     // 如果queue里面没有这个元素，那么追加到queue, dirty队列，但仍然会检查这个元素是否在processing队列中正在被处理
124 | 	q.metrics.add(item)
125 | 
126 | 	q.dirty.insert(item)
127 | 	if q.processing.has(item) {
128 | 		return
129 | 	}
130 | 
131 | 	q.queue = append(q.queue, item)
132 | 	q.cond.Signal()
133 | }
134 | ```
135 | 
136 | 也就是说，每次往Workqueue里面追加元素，都会检查，标记去重，保证每个元素只会被处理一次。
137 | 
138 | ## 初始化
139 | 
140 | ```go
141 | 
142 | func newQueue(c clock.Clock, metrics queueMetrics, updatePeriod time.Duration) *Type {
143 | 	t := &Type{
144 | 		clock:                      c,
145 | 		dirty:                      set{},
146 | 		processing:                 set{},
147 | 		cond:                       sync.NewCond(&sync.Mutex{}),
148 | 		metrics:                    metrics,
149 | 		unfinishedWorkUpdatePeriod: updatePeriod,
150 | 	}
151 |     // 启动协程，其实作用是队列没有关闭的时候 定时同步metrics信息
152 | 	go t.updateUnfinishedWorkLoop()
153 | 	return t
154 | }
155 | 
156 | 
157 | func (q *Type) updateUnfinishedWorkLoop() {
158 | 	t := q.clock.NewTicker(q.unfinishedWorkUpdatePeriod)
159 | 	defer t.Stop()
160 | 	for range t.C() {
161 | 		if !func() bool {
162 | 			q.cond.L.Lock()
163 | 			defer q.cond.L.Unlock()
164 | 			if !q.shuttingDown {
165 | 				q.metrics.updateUnfinishedWork()
166 | 				return true
167 | 			}
168 | 			return false
169 | 
170 | 		}() {
171 | 			return
172 | 		}
173 | 	}
174 | }
175 | ```
176 | 
177 | 
178 | 
179 | # 延时队列
180 | 
181 | ## 数据结构
182 | 
183 | 代码块`staging/src/k8s.io/client-go/util/workqueue/delaying_queue.go`
184 | 
185 | 从数据结构可以看出，延时队列是基于通用队列的基础上封装的
186 | 
187 | ```go
188 | type delayingType struct {
189 | 	Interface
190 | 	// clock tracks time for delayed firing
191 | 	clock clock.Clock
192 |     // 一个缓冲的通道，提供等待添加的元素的chann
193 |     waitingForAddCh chan *waitFor
194 | ...
195 | }
196 | ```
197 | 
198 | 
199 | 
200 | ## 接口
201 | 
202 | 从接口可以看出，延时队列是基于通用队列的基础上封装的，加了AddAfter的方法
203 | 
204 | ```go
205 | type DelayingInterface interface {
206 | 	Interface
207 |     // 延时添加元素
208 | 	AddAfter(item interface{}, duration time.Duration)
209 | }
210 | ```
211 | 
212 | ### AddAfter方法与总结
213 | 
214 | 下面我们可以看到AddAfter的方法，根据传入的duration来决定把元素马上添加到queue中，还是插入到queue的waitingForAddCh chan中，我们记住这个chan，等会在初始化的时候会分析
215 | 
216 | ```go
217 | func (q *delayingType) AddAfter(item interface{}, duration time.Duration) {
218 | 	// don't add if we're already shutting down
219 | 	if q.ShuttingDown() {
220 | 		return
221 | 	}
222 | 
223 | 	q.metrics.retry()
224 | 	q.deprecatedMetrics.retry()
225 | 	
226 |     // 如果duration小于等于0，那么就马上将元素添加到Queue中
227 | 	if duration <= 0 {
228 | 		q.Add(item)
229 | 		return
230 | 	}
231 | 
232 | 	select {
233 | 	case <-q.stopCh:
234 | 		// unblock if ShutDown() is called
235 |         // 按调用传入的参数，将该元素添加到waitingForAddCh的chan中，这个waitFor的数据结构挺有意思的，是保存元素，并且保存readyAt的时间戳
236 | 	case q.waitingForAddCh <- &waitFor{data: item, readyAt: q.clock.Now().Add(duration)}:
237 | 	}
238 | }
239 | 
240 | ```
241 | 
242 | 
243 | 
244 | ## 初始化
245 | 
246 | ```go
247 | func newDelayingQueue(clock clock.Clock, name string) DelayingInterface {
248 | 	ret := &delayingType{
249 | 		Interface:         NewNamed(name),
250 | 		clock:             clock,
251 | 		heartbeat:         clock.NewTicker(maxWait),
252 | 		stopCh:            make(chan struct{}),
253 | 		waitingForAddCh:   make(chan *waitFor, 1000),
254 | 		metrics:           newRetryMetrics(name),
255 | 		deprecatedMetrics: newDeprecatedRetryMetrics(name),
256 | 	}
257 |     // 这个是重点，上面只是构造对象，然后现在使用协程去进行真正的延时添加元素到队列中
258 | 	go ret.waitingLoop()
259 | 
260 | 	return ret
261 | }
262 | 
263 | func (q *delayingType) waitingLoop() {
264 | 	defer utilruntime.HandleCrash()
265 | 
266 | 	// 新建一个没有buffer的Chan 
267 | 	never := make(<-chan time.Time)
268 |     //waitingForQueue 是一个切片的对象
269 | 	waitingForQueue := &waitForPriorityQueue{}
270 |     // 这里挺有意思，heap是构造一个树，使用heap来实现优先队列
271 | 	heap.Init(waitingForQueue)
272 | 
273 | 	waitingEntryByData := map[t]*waitFor{}
274 | 
275 | 	for {
276 | 		if q.Interface.ShuttingDown() {
277 | 			return
278 | 		}
279 | 
280 | 		now := q.clock.Now()
281 | 		
282 |         // 从AddAfter的代码块，我们确认了需要延时添加的元素，是被追加到waitingForAddCh 里面的，现在waitingForQueue对象就是这个chann的实例化，通过判断waitingForQueue的长度来决定是否有需要延时添加的元素
283 |         // 死循环，直到这个队列长度为空，一直到延时时间戳是否就是现在，不是的话就继续循环， 是的话跳出循环添加到队列里面
284 | 		for waitingForQueue.Len() > 0 {
285 | 			entry := waitingForQueue.Peek().(*waitFor)
286 | 			if entry.readyAt.After(now) {
287 | 				break
288 | 			}
289 | 
290 |             // 延时的时间到了，然后从优先队列中取出，添加到队列中，按时间顺序添加
291 | 			entry = heap.Pop(waitingForQueue).(*waitFor)
292 | 			q.Add(entry.data)
293 | 			delete(waitingEntryByData, entry.data)
294 | 		}
295 | 
296 | 		// Set up a wait for the first item's readyAt (if one exists)
297 | 		nextReadyAt := never
298 |         // 计算等待第一个要添加元素的等待时间
299 | 		if waitingForQueue.Len() > 0 {
300 | 			entry := waitingForQueue.Peek().(*waitFor)
301 | 			nextReadyAt = q.clock.After(entry.readyAt.Sub(now))
302 | 		}
303 | 
304 | 		select {
305 | 		case <-q.stopCh:
306 | 			return
307 | 
308 | 		case <-q.heartbeat.C():
309 | 			// continue the loop, which will add ready items
310 | 
311 | 		case <-nextReadyAt:
312 | 			// continue the loop, which will add ready items
313 | 
314 |             // 获取放入 waitingForAddCh chan中的元素
315 | 		case waitEntry := <-q.waitingForAddCh:
316 | 			if waitEntry.readyAt.After(q.clock.Now()) {
317 | 				insert(waitingForQueue, waitingEntryByData, waitEntry)
318 | 			} else {
319 | 				q.Add(waitEntry.data)
320 | 			}
321 | 
322 | 			drained := false
323 | 			for !drained {
324 | 				select {
325 | 				case waitEntry := <-q.waitingForAddCh:
326 | 					if waitEntry.readyAt.After(q.clock.Now()) {
327 | 						insert(waitingForQueue, waitingEntryByData, waitEntry)
328 | 					} else {
329 | 						q.Add(waitEntry.data)
330 | 					}
331 | 				default:
332 | 					drained = true
333 | 				}
334 | 			}
335 | 		}
336 | 	}
337 | }
338 | ```
339 | 
340 | 
341 | 
342 | # 限速队列
343 | 
344 | 代码块`staging/src/k8s.io/client-go/util/workqueue/rate_limiting_queue.go`
345 | 
346 | ## 数据结构
347 | 
348 | ```go
349 | type rateLimitingType struct {
350 | 	DelayingInterface
351 | 	rateLimiter RateLimiter
352 | }
353 | 
354 | type RateLimiter interface {	
355 |     // 一个元素应该等多久，才可以插入队列里面
356 | 	When(item interface{}) time.Duration
357 | 	// 清空该元素的排队数
358 | 	Forget(item interface{})
359 | 	// 获取指定元素的排队数
360 | 	NumRequeues(item interface{}) int
361 | }
362 | 
363 | ```
364 | 
365 | 
366 | 
367 | ##  接口
368 | 
369 | 从接口可以看出，限速队列是基于延时队列的基础上封装的方法，加了AddRateLimited， Forget和NumRequeues的接口方法
370 | 
371 | ```go
372 | type RateLimitingInterface interface {
373 | 	DelayingInterface	
374 |     // 该方法是等时间到把元素插入workqueue，实际仍然是调用了延时队列的AddAfter方法
375 | 	AddRateLimited(item interface{})
376 | 
377 | 	// Forget indicates that an item is finished being retried.  Doesn't matter whether it's for perm failing
378 | 	// or for success, we'll stop the rate limiter from tracking it.  This only clears the `rateLimiter`, you
379 | 	// still have to call `Done` on the queue.
380 | 	Forget(item interface{})
381 | 
382 | 	// NumRequeues returns back how many times the item was requeued
383 | 	NumRequeues(item interface{}) int
384 | }
385 | 
386 | // 调用延时队列的AddAfter方法把元素插入workqueue
387 | func (q *rateLimitingType) AddRateLimited(item interface{}) {
388 | 	q.DelayingInterface.AddAfter(item, q.rateLimiter.When(item))
389 | }
390 | 
391 | // 
392 | func (q *rateLimitingType) Forget(item interface{}) {
393 | 	q.rateLimiter.Forget(item)
394 | }
395 | ```
396 | 
397 | 
398 | 
399 | ## 限速算法
400 | 
401 | 
402 | 
403 | ### 令通牌算法
404 | 
405 | BucketRateLimiter
406 | 
407 | 是通过第三方库"golang.org/x/time/rate" 实现的
408 | 
409 | 默认的清空下就实例化令牌桶实现的，以固定速率往桶里面插入元素，被插入的元素都会拿到一个token，以此来达到限制速度的目的。
410 | 
411 | ```go
412 | func DefaultControllerRateLimiter() RateLimiter {
413 | 	return NewMaxOfRateLimiter(
414 | 		NewItemExponentialFailureRateLimiter(5*time.Millisecond, 1000*time.Second),
415 | 		// 10 qps, 100 bucket size.  This is only for retry speed and its only the overall factor (not per item)
416 | 		&BucketRateLimiter{Limiter: rate.NewLimiter(rate.Limit(10), 100)},
417 | 	)
418 | }
419 | 
420 | func (r *BucketRateLimiter) When(item interface{}) time.Duration {
421 | 	return r.Limiter.Reserve().Delay()
422 | }
423 | ```
424 | 
425 | (rate.Limit(10), 100)
426 | 
427 | 第一个参数10表示每秒往“桶”里填充的 token 数量
428 | 
429 | 第二个参数100表示令牌桶的大小（即令牌桶最多存放的 token 数量）
430 | 
431 | 
432 | 
433 | ### 排队指数算法
434 | 
435 | ItemExponentialFailureRateLimiter
436 | 
437 | 排队指数算法将**相同元素**的排队数作为指数，排队数增大，速率限制呈指数级增长，但其最大值不会超过 `maxDelay`
438 | 
439 | 限速队列利用延迟队列的特性，延迟多个相同元素的插入时间，达到限速目的
440 | 
441 | ```go
442 | type ItemExponentialFailureRateLimiter struct {
443 |     // map元素的读写锁
444 | 	failuresLock sync.Mutex
445 |     // 元素失败次数记录
446 | 	failures     map[interface{}]int
447 | 
448 | 	baseDelay time.Duration
449 | 	maxDelay  time.Duration
450 | }
451 | // 初始化
452 | func NewItemExponentialFailureRateLimiter(baseDelay time.Duration, maxDelay time.Duration) RateLimiter {
453 | 	return &ItemExponentialFailureRateLimiter{
454 | 		failures:  map[interface{}]int{},
455 | 		baseDelay: baseDelay,
456 | 		maxDelay:  maxDelay,
457 | 	}
458 | }
459 | 
460 | // 代码挺简单的，就是通过计算失败次数来计算时间，不大于最大的maxdelay时间就返回当前计算需要延时的时间
461 | func (r *ItemExponentialFailureRateLimiter) When(item interface{}) time.Duration {
462 | 	r.failuresLock.Lock()
463 | 	defer r.failuresLock.Unlock()
464 | 
465 | 	exp := r.failures[item]
466 | 	r.failures[item] = r.failures[item] + 1
467 | 
468 | 	// The backoff is capped such that 'calculated' value never overflows.
469 | 	backoff := float64(r.baseDelay.Nanoseconds()) * math.Pow(2, float64(exp))
470 | 	if backoff > math.MaxInt64 {
471 | 		return r.maxDelay
472 | 	}
473 | 
474 | 	calculated := time.Duration(backoff)
475 | 	if calculated > r.maxDelay {
476 | 		return r.maxDelay
477 | 	}
478 | 
479 | 	return calculated
480 | }
481 | ```
482 | 
483 | ### 
484 | 
485 | 
486 | 
487 | # ParallelizeUntil
488 | 
489 | 代码块`staging/src/k8s.io/client-go/util/workqueue/parallelizer.go`
490 | 
491 | 这个是并发worker处理协程，总共有N个pieces的任务，然后交给doWorkPiece方法去处理这些pieces任务，也就是多消费者
492 | 
493 | ```go
494 | func ParallelizeUntil(ctx context.Context, workers, pieces int, doWorkPiece DoWorkPieceFunc) {
495 | 	var stop <-chan struct{}
496 | 	if ctx != nil {
497 | 		stop = ctx.Done()
498 | 	}
499 | 
500 | 	toProcess := make(chan int, pieces)
501 | 	for i := 0; i < pieces; i++ {
502 | 		toProcess <- i
503 | 	}
504 | 	close(toProcess)
505 | 
506 | 	if pieces < workers {
507 | 		workers = pieces
508 | 	}
509 | 
510 | 	wg := sync.WaitGroup{}
511 | 	wg.Add(workers)
512 | 	for i := 0; i < workers; i++ {
513 | 		go func() {
514 | 			defer utilruntime.HandleCrash()
515 | 			defer wg.Done()
516 | 			for piece := range toProcess {
517 | 				select {
518 | 				case <-stop:
519 | 					return
520 | 				default:
521 | 					doWorkPiece(piece)
522 | 				}
523 | 			}
524 | 		}()
525 | 	}
526 | 	wg.Wait()
527 | }
528 | ```
529 | 
530 | 
531 | 
532 | # 总结
533 | 
534 | 在workqueue里面可以看出K8S大部分代码都把接口拆分得足够小，然后使用组合的方法去做类似Java的继承，例如延时队列就是基于普通队列的基础上加了一个延时处理的接口方法。
535 | 
536 | 
537 | 
538 | 
539 | 
540 | 
541 | 
542 | # Reference
543 | 
544 | 《Kubernetes 源码剖析》第五章


--------------------------------------------------------------------------------
/controller/deployment-controller.md:
--------------------------------------------------------------------------------
  1 | # Overview
  2 | 
  3 | 这篇文章是基于Kubernetes的master  commitid: 写下的源码分析文档。
  4 | 
  5 | 此篇文档主要是围绕deployment controller的介绍以及工作原理。
  6 | 
  7 | 代码位置 `pkg/controller/deployment`
  8 | 
  9 | 
 10 | 
 11 | # 数据结构
 12 | 
 13 | ```go
 14 | type DeploymentController struct {
 15 | 	// replica sets controller
 16 | 	rsControl     controller.RSControlInterface
 17 | 	client        clientset.Interface
 18 |     // 事件广播
 19 | 	eventRecorder record.EventRecorder
 20 | 
 21 | 	
 22 | 	// used for unit testing
 23 | 	enqueueDeployment func(deployment *apps.Deployment)
 24 | 	
 25 |     // deployment informer去从store里面里面list/get deployment信息
 26 | 	dLister appslisters.DeploymentLister
 27 | 	// replicaset informer去从store里面里面list/get replicaset 信息
 28 | 	rsLister appslisters.ReplicaSetLister
 29 | 	// pod informer去从store里面里面list/get pod 信息
 30 | 	podLister corelisters.PodLister
 31 | 
 32 | 	...
 33 |     // workqueue队列
 34 | 	queue workqueue.RateLimitingInterface
 35 | }
 36 | ```
 37 | 
 38 | 
 39 | 
 40 | # 实例化Deployment controller
 41 | 
 42 | 实例化deploymentcontroller，接收来自deployment Informer和 replicaset Informer的增删改事件以及pod Informer的删除事件
 43 | 
 44 | ```go
 45 | func NewDeploymentController(dInformer appsinformers.DeploymentInformer, rsInformer appsinformers.ReplicaSetInformer, podInformer coreinformers.PodInformer, client clientset.Interface) (*DeploymentController, error) {
 46 |     // 事件广播器
 47 | 	eventBroadcaster := record.NewBroadcaster()
 48 | 	eventBroadcaster.StartStructuredLogging(0)
 49 | 	eventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: client.CoreV1().Events("")})
 50 | 
 51 | 	if client != nil && client.CoreV1().RESTClient().GetRateLimiter() != nil {
 52 | 		if err := ratelimiter.RegisterMetricAndTrackRateLimiterUsage("deployment_controller", client.CoreV1().RESTClient().GetRateLimiter()); err != nil {
 53 | 			return nil, err
 54 | 		}
 55 | 	}
 56 |     // 实例化
 57 | 	dc := &DeploymentController{
 58 | 		client:        client,
 59 | 		eventRecorder: eventBroadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: "deployment-controller"}),
 60 | 		queue:         workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "deployment"),
 61 | 	}
 62 | 	dc.rsControl = controller.RealRSControl{
 63 | 		KubeClient: client,
 64 | 		Recorder:   dc.eventRecorder,
 65 | 	}
 66 | 
 67 | 	dInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
 68 | 		AddFunc:    dc.addDeployment,
 69 | 		UpdateFunc: dc.updateDeployment,
 70 | 		DeleteFunc: dc.deleteDeployment,
 71 | 	})
 72 | 	rsInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
 73 | 		AddFunc:    dc.addReplicaSet,
 74 | 		UpdateFunc: dc.updateReplicaSet,
 75 | 		DeleteFunc: dc.deleteReplicaSet,
 76 | 	})
 77 | 	podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
 78 | 		DeleteFunc: dc.deletePod,
 79 | 	})
 80 | 
 81 | 	dc.syncHandler = dc.syncDeployment
 82 | 	dc.enqueueDeployment = dc.enqueue
 83 | 
 84 | 	dc.dLister = dInformer.Lister()
 85 | 	dc.rsLister = rsInformer.Lister()
 86 | 	dc.podLister = podInformer.Lister()
 87 | 	dc.dListerSynced = dInformer.Informer().HasSynced
 88 | 	dc.rsListerSynced = rsInformer.Informer().HasSynced
 89 | 	dc.podListerSynced = podInformer.Informer().HasSynced
 90 | 	return dc, nil
 91 | }
 92 | ```
 93 | 
 94 | 
 95 | 
 96 | # Run
 97 | 
 98 | ```go
 99 | func (dc *DeploymentController) Run(workers int, stopCh <-chan struct{}) {	
100 | ...
101 | 
102 | 	for i := 0; i < workers; i++ {
103 |         // 使用go协程启动worker
104 | 		go wait.Until(dc.worker, time.Second, stopCh)
105 | 	}
106 | 	<-stopCh
107 | }
108 | ```
109 | 
110 | ## worker
111 | 
112 | ```go
113 | func (dc *DeploymentController) worker() {
114 |     // 死循环进入processNextWorkItem方法
115 | 	for dc.processNextWorkItem() {
116 | 	}
117 | }
118 | 
119 | 
120 | func (dc *DeploymentController) processNextWorkItem() bool {
121 | 	key, quit := dc.queue.Get()
122 | 	if quit {
123 | 		return false
124 | 	}
125 | 	defer dc.queue.Done(key)
126 | 
127 |     // 进入syncHandler方法处理每一个进入workqueue的事件，跟replicaset Controller类似，deployment controller在实例化的时候通过操作dc.syncHandler = dc.syncDeployment
128 | 	err := dc.syncHandler(key.(string))
129 | 	dc.handleErr(err, key)
130 | 
131 | 	return true
132 | }
133 | 
134 | ```
135 | 
136 | 
137 | 
138 | ## syncHandler --> syncDeployment
139 | 
140 | ```go
141 | func (dc *DeploymentController) syncDeployment(key string) error {
142 | 	startTime := time.Now()
143 | ...
144 | 	namespace, name, err := cache.SplitMetaNamespaceKey(key)
145 | 	...
146 | 	deployment, err := dc.dLister.Deployments(namespace).Get(name)
147 | 	...
148 | 	// Deep-copy otherwise we are mutating our cache.
149 | 	// TODO: Deep-copy only when needed.
150 | 	d := deployment.DeepCopy()
151 | 
152 | 	everything := metav1.LabelSelector{}
153 | 	if reflect.DeepEqual(d.Spec.Selector, &everything) {
154 | 		dc.eventRecorder.Eventf(d, v1.EventTypeWarning, "SelectingAll", "This deployment is selecting all pods. A non-empty selector is required.")
155 | 		if d.Status.ObservedGeneration < d.Generation {
156 | 			d.Status.ObservedGeneration = d.Generation
157 | 			dc.client.AppsV1().Deployments(d.Namespace).UpdateStatus(context.TODO(), d, metav1.UpdateOptions{})
158 | 		}
159 | 		return nil
160 | 	}
161 | 	
162 |     // 获取owner reference 属于该Deployment对象的replicaset, 在下方有详细理解getReplicaSetsForDeployment方法
163 | 	rsList, err := dc.getReplicaSetsForDeployment(d)
164 | 
165 |     // 查找属于该deployment对象的pod
166 | 	podMap, err := dc.getPodMapForDeployment(d, rsList)
167 | 
168 | ...
169 |     // 返回的是bool和error, 如果true则说明Deployment.Spec.Replicas跟实际的owner reference 的Replica数量不一致，那么需要sync去执行scale事件
170 | 	scalingEvent, err := dc.isScalingEvent(d, rsList)
171 | 	if err != nil {
172 | 		return err
173 | 	}
174 | 	if scalingEvent {
175 | 		return dc.sync(d, rsList)
176 | 	}
177 | 
178 |     // 根据deployment对象的Spec.Strategy去决定要做什么操作，是要创建replicaSet呢还是要删除replicaSet
179 | 	switch d.Spec.Strategy.Type {
180 | 	case apps.RecreateDeploymentStrategyType:
181 | 		return dc.rolloutRecreate(d, rsList, podMap)
182 | 	case apps.RollingUpdateDeploymentStrategyType:
183 | 		return dc.rolloutRolling(d, rsList)
184 | 	}
185 | 	return fmt.Errorf("unexpected deployment strategy type: %s", d.Spec.Strategy.Type)
186 | }
187 | 
188 | ```
189 | 
190 | 
191 | 
192 | ### getReplicaSetsForDeployment
193 | 
194 | ```go
195 | func (dc *DeploymentController) getReplicaSetsForDeployment(d *apps.Deployment) ([]*apps.ReplicaSet, error) {
196 | 	// List all ReplicaSets to find those we own but that no longer match our
197 | 	// selector. They will be orphaned by ClaimReplicaSets().
198 |     // replicaset informer去从store里面里面list 所有的replicaset 
199 | 	rsList, err := dc.rsLister.ReplicaSets(d.Namespace).List(labels.Everything())
200 |     // 拿到该deployment对象的Spec.Selector
201 | 	deploymentSelector, err := metav1.LabelSelectorAsSelector(d.Spec.Selector)
202 | 
203 | 
204 |     // 这一步是为了保证我们现在拿到的d的deployment对象是最新的（并检查它的deletionTimestamp，而不是检查它的本地缓存）
205 | 	canAdoptFunc := controller.RecheckDeletionTimestamp(func() (metav1.Object, error) {
206 | 		fresh, err := dc.client.AppsV1().Deployments(d.Namespace).Get(context.TODO(), d.Name, metav1.GetOptions{})
207 | 		
208 | 		if fresh.UID != d.UID {
209 | 			return nil, fmt.Errorf("original Deployment %v/%v is gone: got uid %v, wanted %v", d.Namespace, d.Name, fresh.UID, d.UID)
210 | 		}
211 | 		return fresh, nil
212 | 	})
213 |     // 调用ClaimReplicaSets去返回owner reference属于该deployment对象的replicaset的list
214 | 	cm := controller.NewReplicaSetControllerRefManager(dc.rsControl, d, deploymentSelector, controllerKind, canAdoptFunc)
215 | 	return cm.ClaimReplicaSets(rsList)
216 | }
217 | ```
218 | 
219 | 
220 | 
221 | 
222 | 
223 | # 总结
224 | 
225 | Deployment Controller是通过Spec中的描述，跟实际Status比对去更改。
226 | 
227 | 
228 | 
229 | 
230 | 
231 | 
232 | 
233 | 从deployment controller的代码里面，我们可以看到，这又是一个典型的controller的flow: replicaset/deployment的创建/删除/更改都会写入workqueue, 然后从workqueue里面读取去执行processNextWorkItem, 去查找owner reference属于该deployment的Replicaset比对，是创建还是删除Replicaset。
234 | 
235 | 


--------------------------------------------------------------------------------
/controller/endpoint-controller.md:
--------------------------------------------------------------------------------
  1 | # Overview
  2 | 
  3 | 这篇文章是基于Kubernetes的master  commitid: 写下的源码分析文档。
  4 | 
  5 | 此篇文档主要是围绕endpoint controller的介绍以及工作原理。
  6 | 
  7 | 代码位置 `pkg/controller/endpoint`
  8 | 
  9 | # 概念
 10 | 
 11 | 
 12 | 
 13 | # 实例化endpoint
 14 | 
 15 | 
 16 | 
 17 | 
 18 | 
 19 | ```go
 20 | func (e *EndpointController) syncService(key string) error {
 21 | 	startTime := time.Now()
 22 | 	defer func() {
 23 | 		klog.V(4).Infof("Finished syncing service %q endpoints. (%v)", key, time.Since(startTime))
 24 | 	}()
 25 | 
 26 | 	namespace, name, err := cache.SplitMetaNamespaceKey(key)
 27 | 	if err != nil {
 28 | 		return err
 29 | 	}
 30 | 	service, err := e.serviceLister.Services(namespace).Get(name)
 31 | 	if err != nil {
 32 | 		if !errors.IsNotFound(err) {
 33 | 			return err
 34 | 		}
 35 | 
 36 | 		// Delete the corresponding endpoint, as the service has been deleted.
 37 | 		// TODO: Please note that this will delete an endpoint when a
 38 | 		// service is deleted. However, if we're down at the time when
 39 | 		// the service is deleted, we will miss that deletion, so this
 40 | 		// doesn't completely solve the problem. See #6877.
 41 | 		err = e.client.CoreV1().Endpoints(namespace).Delete(context.TODO(), name, metav1.DeleteOptions{})
 42 | 		if err != nil && !errors.IsNotFound(err) {
 43 | 			return err
 44 | 		}
 45 | 		e.triggerTimeTracker.DeleteService(namespace, name)
 46 | 		return nil
 47 | 	}
 48 | 
 49 | 	if service.Spec.Selector == nil {
 50 | 		// services without a selector receive no endpoints from this controller;
 51 | 		// these services will receive the endpoints that are created out-of-band via the REST API.
 52 | 		return nil
 53 | 	}
 54 | 
 55 | 	klog.V(5).Infof("About to update endpoints for service %q", key)
 56 | 	pods, err := e.podLister.Pods(service.Namespace).List(labels.Set(service.Spec.Selector).AsSelectorPreValidated())
 57 | 	if err != nil {
 58 | 		// Since we're getting stuff from a local cache, it is
 59 | 		// basically impossible to get this error.
 60 | 		return err
 61 | 	}
 62 | 
 63 | 	// If the user specified the older (deprecated) annotation, we have to respect it.
 64 | 	tolerateUnreadyEndpoints := service.Spec.PublishNotReadyAddresses
 65 | 	if v, ok := service.Annotations[TolerateUnreadyEndpointsAnnotation]; ok {
 66 | 		b, err := strconv.ParseBool(v)
 67 | 		if err == nil {
 68 | 			tolerateUnreadyEndpoints = b
 69 | 		} else {
 70 | 			utilruntime.HandleError(fmt.Errorf("Failed to parse annotation %v: %v", TolerateUnreadyEndpointsAnnotation, err))
 71 | 		}
 72 | 	}
 73 | 
 74 | 	// We call ComputeEndpointLastChangeTriggerTime here to make sure that the
 75 | 	// state of the trigger time tracker gets updated even if the sync turns out
 76 | 	// to be no-op and we don't update the endpoints object.
 77 | 	endpointsLastChangeTriggerTime := e.triggerTimeTracker.
 78 | 		ComputeEndpointLastChangeTriggerTime(namespace, service, pods)
 79 | 
 80 | 	subsets := []v1.EndpointSubset{}
 81 | 	var totalReadyEps int
 82 | 	var totalNotReadyEps int
 83 | 
 84 | 	for _, pod := range pods {
 85 | 		if len(pod.Status.PodIP) == 0 {
 86 | 			klog.V(5).Infof("Failed to find an IP for pod %s/%s", pod.Namespace, pod.Name)
 87 | 			continue
 88 | 		}
 89 | 		if !tolerateUnreadyEndpoints && pod.DeletionTimestamp != nil {
 90 | 			klog.V(5).Infof("Pod is being deleted %s/%s", pod.Namespace, pod.Name)
 91 | 			continue
 92 | 		}
 93 | 
 94 | 		ep, err := podToEndpointAddressForService(service, pod)
 95 | 		if err != nil {
 96 | 			// this will happen, if the cluster runs with some nodes configured as dual stack and some as not
 97 | 			// such as the case of an upgrade..
 98 | 			klog.V(2).Infof("failed to find endpoint for service:%v with ClusterIP:%v on pod:%v with error:%v", service.Name, service.Spec.ClusterIP, pod.Name, err)
 99 | 			continue
100 | 		}
101 | 
102 | 		epa := *ep
103 | 		if endpointutil.ShouldSetHostname(pod, service) {
104 | 			epa.Hostname = pod.Spec.Hostname
105 | 		}
106 | 
107 | 		// Allow headless service not to have ports.
108 | 		if len(service.Spec.Ports) == 0 {
109 | 			if service.Spec.ClusterIP == api.ClusterIPNone {
110 | 				subsets, totalReadyEps, totalNotReadyEps = addEndpointSubset(subsets, pod, epa, nil, tolerateUnreadyEndpoints)
111 | 				// No need to repack subsets for headless service without ports.
112 | 			}
113 | 		} else {
114 | 			for i := range service.Spec.Ports {
115 | 				servicePort := &service.Spec.Ports[i]
116 | 				portNum, err := podutil.FindPort(pod, servicePort)
117 | 				if err != nil {
118 | 					klog.V(4).Infof("Failed to find port for service %s/%s: %v", service.Namespace, service.Name, err)
119 | 					continue
120 | 				}
121 | 				epp := endpointPortFromServicePort(servicePort, portNum)
122 | 
123 | 				var readyEps, notReadyEps int
124 | 				subsets, readyEps, notReadyEps = addEndpointSubset(subsets, pod, epa, epp, tolerateUnreadyEndpoints)
125 | 				totalReadyEps = totalReadyEps + readyEps
126 | 				totalNotReadyEps = totalNotReadyEps + notReadyEps
127 | 			}
128 | 		}
129 | 	}
130 | 	subsets = endpoints.RepackSubsets(subsets)
131 | 
132 | 	// See if there's actually an update here.
133 | 	currentEndpoints, err := e.endpointsLister.Endpoints(service.Namespace).Get(service.Name)
134 | 	if err != nil {
135 | 		if errors.IsNotFound(err) {
136 | 			currentEndpoints = &v1.Endpoints{
137 | 				ObjectMeta: metav1.ObjectMeta{
138 | 					Name:   service.Name,
139 | 					Labels: service.Labels,
140 | 				},
141 | 			}
142 | 		} else {
143 | 			return err
144 | 		}
145 | 	}
146 | 
147 | 	createEndpoints := len(currentEndpoints.ResourceVersion) == 0
148 | 
149 | 	if !createEndpoints &&
150 | 		apiequality.Semantic.DeepEqual(currentEndpoints.Subsets, subsets) &&
151 | 		apiequality.Semantic.DeepEqual(currentEndpoints.Labels, service.Labels) {
152 | 		klog.V(5).Infof("endpoints are equal for %s/%s, skipping update", service.Namespace, service.Name)
153 | 		return nil
154 | 	}
155 | 	newEndpoints := currentEndpoints.DeepCopy()
156 | 	newEndpoints.Subsets = subsets
157 | 	newEndpoints.Labels = service.Labels
158 | 	if newEndpoints.Annotations == nil {
159 | 		newEndpoints.Annotations = make(map[string]string)
160 | 	}
161 | 
162 | 	if !endpointsLastChangeTriggerTime.IsZero() {
163 | 		newEndpoints.Annotations[v1.EndpointsLastChangeTriggerTime] =
164 | 			endpointsLastChangeTriggerTime.Format(time.RFC3339Nano)
165 | 	} else { // No new trigger time, clear the annotation.
166 | 		delete(newEndpoints.Annotations, v1.EndpointsLastChangeTriggerTime)
167 | 	}
168 | 
169 | 	if newEndpoints.Labels == nil {
170 | 		newEndpoints.Labels = make(map[string]string)
171 | 	}
172 | 
173 | 	if !helper.IsServiceIPSet(service) {
174 | 		newEndpoints.Labels = utillabels.CloneAndAddLabel(newEndpoints.Labels, v1.IsHeadlessService, "")
175 | 	} else {
176 | 		newEndpoints.Labels = utillabels.CloneAndRemoveLabel(newEndpoints.Labels, v1.IsHeadlessService)
177 | 	}
178 | 
179 | 	klog.V(4).Infof("Update endpoints for %v/%v, ready: %d not ready: %d", service.Namespace, service.Name, totalReadyEps, totalNotReadyEps)
180 | 	if createEndpoints {
181 | 		// No previous endpoints, create them
182 | 		_, err = e.client.CoreV1().Endpoints(service.Namespace).Create(context.TODO(), newEndpoints, metav1.CreateOptions{})
183 | 	} else {
184 | 		// Pre-existing
185 | 		_, err = e.client.CoreV1().Endpoints(service.Namespace).Update(context.TODO(), newEndpoints, metav1.UpdateOptions{})
186 | 	}
187 | 	if err != nil {
188 | 		if createEndpoints && errors.IsForbidden(err) {
189 | 			// A request is forbidden primarily for two reasons:
190 | 			// 1. namespace is terminating, endpoint creation is not allowed by default.
191 | 			// 2. policy is misconfigured, in which case no service would function anywhere.
192 | 			// Given the frequency of 1, we log at a lower level.
193 | 			klog.V(5).Infof("Forbidden from creating endpoints: %v", err)
194 | 
195 | 			// If the namespace is terminating, creates will continue to fail. Simply drop the item.
196 | 			if errors.HasStatusCause(err, v1.NamespaceTerminatingCause) {
197 | 				return nil
198 | 			}
199 | 		}
200 | 
201 | 		if createEndpoints {
202 | 			e.eventRecorder.Eventf(newEndpoints, v1.EventTypeWarning, "FailedToCreateEndpoint", "Failed to create endpoint for service %v/%v: %v", service.Namespace, service.Name, err)
203 | 		} else {
204 | 			e.eventRecorder.Eventf(newEndpoints, v1.EventTypeWarning, "FailedToUpdateEndpoint", "Failed to update endpoint %v/%v: %v", service.Namespace, service.Name, err)
205 | 		}
206 | 
207 | 		return err
208 | 	}
209 | 	return nil
210 | }
211 | 
212 | ```
213 | 
214 | 


--------------------------------------------------------------------------------
/controller/gc-controller.md:
--------------------------------------------------------------------------------
  1 | # Overview
  2 | 
  3 | 我们通常会通过删除一个Deployment的方法来同时删除Deployment对象以及所属
  4 | Pod资源对象， 而K8S中是怎么把所属资源也一并删除的呢？ 
  5 | 没错，就是GC Controller发挥的作用，GC Controller会将被删除对象的附属资源查询并且一并删除。
  6 | 垃圾回收就是从系统中删除未使用的对象，并释放分配给它们的计算资源。
  7 | 
  8 | 与面向对象的语言不同，在K8s对象清单定义中，我们从来没有明确定义或编写与所有者相关的关系，而是系统如何确定该关系? 
  9 | 在K8s中，每个从属对象都有一个唯一的元数据字段名称metas.ownerReferences用于关系表示。
 10 | 
 11 | 从Kubernetes 1.8开始，K8为由特定控制器(例如ReplicaSet，StatefulSet，DaemonSet，Deployment，Job和CronJob)创建或采用的对象设置ownerReferences的值。
 12 | 如果需要，还可以手动设置ownerReferences。
 13 | 一个对象可以有多个ownerReferences，例如在namespace中。
 14 | 
 15 | 此篇文档主要是讲述GC Controller的工作流程，
 16 | 
 17 | 
 18 | 
 19 | GC Controller 是controller-manager下的一个controller之一，主要作用是删除需要删除的对象，以及该对象的下属关系。
 20 | 
 21 | 为什么GC 需要以一个controller 的形式去运转， 我在阅读这个代码之前一直觉得应该是让kubelet 去操作才对，直到我读了这个Design proposal: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/api-machinery/garbage-collection.md#overview 在这里简单翻译一下：
 22 | 
 23 | 1. 支持服务器端Cascading deletion级联删除。简单理解一下其实是利用已经建立的父子关系(ownerreference)来删除，例如当一个pod的owner已经被删除的时候，就认为这个pod是没有人管需要删除
 24 | 2. 集中级联删除逻辑，而不是在控制器中扩展。
 25 | 3. 允许有选择地孤立依赖对象
 26 | 
 27 | # 代码理解
 28 | 
 29 | ## API 部分
 30 | 
 31 | 在ObjectMeta 中 引入了OwnerReferences, 去列出所有该对象依赖的对象们简称父亲们。 如果所有父亲们都被删除，那么这个对象就会被GC。
 32 | 
 33 | 另外一个是在ObjectMeta 中 引入了Finalizers 列表，去列出所有在删除这个对象之前的终结者们。 当这个对象被彻底从集群中删除之前这个列表是必须清空。列表中的每个字符串都是负责从列表中删除条目的组件的标识符。如果该对象的deletionTimestamp为非nil，则只能删除该列表中的条目。出于安全原因，更新终结器需要特殊的特权。为了实施允许规则，我们将终结器作为子资源公开，并禁止在更新主资源时直接更改终结器。
 34 | 
 35 | 有一个特别需要注意的事情是，OwnerReference这个Struct是没有namespace字段的，也就是说，父亲们如果是namespace scope的就必须是同一个namespace。
 36 | 
 37 | ```go
 38 | type ObjectMeta struct {
 39 | 	...
 40 | 	OwnerReferences []OwnerReference
 41 |     Finalizers []string
 42 | }
 43 | 
 44 | type OwnerReference struct {
 45 | 	// Version of the referent.
 46 | 	APIVersion string
 47 | 	// Kind of the referent.
 48 | 	Kind string
 49 | 	// Name of the referent.
 50 | 	Name string
 51 | 	// UID of the referent.
 52 | 	UID types.UID
 53 | }
 54 | ```
 55 | 
 56 | 对API Server来说，当一个对象的`ObjectMeta.Finalizers`是非空的时候，需要更新`DeletionTimestamp`。 当`ObjectMeta.Finalizers` 非空然后`options.GracePeriod` 值是0 的时候，那么需要删除该对象， 当`options.GracePeriod`  不为0的时候，只是更新`DeletionTimestamp`
 57 | 
 58 | 
 59 | 
 60 | 另外一个API 更改的地方是`DeleteOptions`， 引入了`OrphanDependents` ，允许用户去表示依赖对象是否应该成为孤立对象。它默认为true，因为在1.2版之前的控制器期望依赖对象成为孤儿。
 61 | 
 62 | ```go
 63 | type DeleteOptions struct {
 64 | 	…
 65 | 	OrphanDependents bool
 66 | }
 67 | ```
 68 | 
 69 | 
 70 | 
 71 | ## 启动GC Controller
 72 | 
 73 | 从以下启动代码我们可以得知，该controller的启动主要做了两个事情
 74 | 
 75 | 1. 实例化NewGarbageCollector 
 76 | 2. 启动garbage collector
 77 | 3. 每30秒定期执行 Sync
 78 | 
 79 | ```go
 80 | // 代码位置 cmd/kube-controller-manager/app/core.go
 81 | func startGarbageCollectorController(ctx ControllerContext) (http.Handler, bool, error) {
 82 |     // 如果不启动GC Controller，则直接返回退出
 83 | 	if !ctx.ComponentConfig.GarbageCollectorController.EnableGarbageCollector {
 84 | 		return nil, false, nil
 85 | 	}
 86 | 
 87 | 	gcClientset := ctx.ClientBuilder.ClientOrDie("generic-garbage-collector")
 88 | 	discoveryClient := ctx.ClientBuilder.DiscoveryClientOrDie("generic-garbage-collector")
 89 | 
 90 | 	config := ctx.ClientBuilder.ConfigOrDie("generic-garbage-collector")
 91 | 	metadataClient, err := metadata.NewForConfig(config)
 92 | 	...
 93 | 
 94 | 	ignoredResources := make(map[schema.GroupResource]struct{})
 95 | 	for _, r := range ctx.ComponentConfig.GarbageCollectorController.GCIgnoredResources {
 96 | 		ignoredResources[schema.GroupResource{Group: r.Group, Resource: r.Resource}] = struct{}{}
 97 | 	}
 98 |     // 实例化 NewGarbageCollector
 99 | 	garbageCollector, err := garbagecollector.NewGarbageCollector(
100 | 		gcClientset,
101 | 		metadataClient,
102 | 		ctx.RESTMapper,
103 | 		ignoredResources,
104 | 		ctx.ObjectOrMetadataInformerFactory,
105 | 		ctx.InformersStarted,
106 | 	)
107 | 	// 启动garbage collector.
108 | 	workers := int(ctx.ComponentConfig.GarbageCollectorController.ConcurrentGCSyncs)
109 | 	go garbageCollector.Run(workers, ctx.Stop)
110 | 
111 |     // 每30秒定期执行 Sync
112 | 	go garbageCollector.Sync(discoveryClient, 30*time.Second, ctx.Stop)
113 | 
114 | 	return garbagecollector.NewDebugHandler(garbageCollector), true, nil
115 | }
116 | ```
117 | 
118 | 
119 | 
120 | ### 实例化NewGarbageCollector 
121 | 
122 | GarbageCollector 的数据结构里面，有两个个队列，分别是
123 | `attemptToDelete`： 当时机成熟时，垃圾收集器尝试删除队列attemptToDelete中的项。
124 | `attemptToOrphan`： 垃圾收集器尝试使attemptToOrphan队列中的项的依赖项成为孤儿，然后删除这些项
125 | 
126 | `dependencyGraphBuilder` 里面也有一个隐藏的队列，那就是`graphChanges`。在介绍 `graphChanges` 这个队列之前我们先理解下`GraphBuilder` 这个Struct的作用。 
127 | 
128 | `GraphBuilder` 这个Structural其实是使用了Informer监听所有资源的增加删除修改，一旦发现之后会将对象加入`Dirty Queue` 队列。
129 | 
130 | ```go
131 | func NewGarbageCollector(...) (*GarbageCollector, error) {
132 | 	...
133 | 	attemptToDelete := workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "garbage_collector_attempt_to_delete")
134 | 	attemptToOrphan := workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "garbage_collector_attempt_to_orphan")
135 | 	absentOwnerCache := NewReferenceCache(500)
136 | 	gc := &GarbageCollector{
137 | 		metadataClient:   metadataClient,
138 | 		restMapper:       mapper,
139 | 		attemptToDelete:  attemptToDelete,
140 | 		attemptToOrphan:  attemptToOrphan,
141 | 		absentOwnerCache: absentOwnerCache,
142 | 	}
143 | 	gc.dependencyGraphBuilder = &GraphBuilder{
144 | 		eventRecorder:    eventRecorder,
145 | 		metadataClient:   metadataClient,
146 | 		informersStarted: informersStarted,
147 | 		restMapper:       mapper,
148 | 		graphChanges:     workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "garbage_collector_graph_changes"),
149 | 		uidToNode: &concurrentUIDToNode{
150 | 			uidToNode: make(map[types.UID]*node),
151 | 		},
152 | 		attemptToDelete:  attemptToDelete,
153 | 		attemptToOrphan:  attemptToOrphan,
154 | 		absentOwnerCache: absentOwnerCache,
155 | 		sharedInformers:  sharedInformers,
156 | 		ignoredResources: ignoredResources,
157 | 	}
158 | 
159 | 	return gc, nil
160 | }
161 | ```
162 | 
163 | 
164 | 
165 | ### 启动garbage collector
166 | 
167 | 开启一个新线程执行`dependencyGraphBuilder.Run`
168 | 
169 | 在确保所有的监控都以及存在并且这些监控的controllers HasSynced 函数都返回true
170 | 
171 | 开启一个新的线程**定时每秒**执行 ` runAttemptToDeleteWorker`
172 | 
173 | 开启一个新的线程**定时每秒**执行 `runAttemptToOrphanWorker`
174 | 
175 | ```go
176 | func (gc *GarbageCollector) Run(workers int, stopCh <-chan struct{}) {
177 | 	defer utilruntime.HandleCrash()
178 | 	defer gc.attemptToDelete.ShutDown()
179 | 	defer gc.attemptToOrphan.ShutDown()
180 | 	defer gc.dependencyGraphBuilder.graphChanges.ShutDown()
181 | 
182 |     // 开启一个新线程执行dependencyGraphBuilder.Run
183 |     go gc.dependencyGraphBuilder.Run(stopCh)
184 | 
185 | 	if !cache.WaitForNamedCacheSync("garbage collector", stopCh, gc.dependencyGraphBuilder.IsSynced) {
186 | 		return
187 | 	}
188 | 
189 | 	klog.Infof("Garbage collector: all resource monitors have synced. Proceeding to collect garbage")
190 | 
191 | 	// gc workers
192 | 	for i := 0; i < workers; i++ {
193 | 		go wait.Until(gc.runAttemptToDeleteWorker, 1*time.Second, stopCh)
194 | 		go wait.Until(gc.runAttemptToOrphanWorker, 1*time.Second, stopCh)
195 | 	}
196 | 
197 | 	<-stopCh
198 | }
199 | 
200 | ```
201 | 
202 | 
203 | 
204 | `runAttemptToDeleteWorker` 
205 | 
206 | 该方法主要是死循环执行 `attemptToDeleteWorker` 函数。工作流程如下：
207 | 
208 | 1. 从`attemptToDelete` 队列获取obj , 最后执行从队列中删除该obj
209 | 
210 | 2. 获取该obj 的node 。 
211 | 
212 | 3. 执行`attemptToDeleteItem`从node 节点中删除该obj。逻辑主要是获取该obj, 获取该obj的ownerreference， 如果没有ownerreference则直接返回。
213 | 
214 |    当ownerrefernce 还是存在的时候，这个obj就不会被删除。
215 | 
216 | 4. 判断执行的返回错误
217 | 
218 | ```go
219 | func (gc *GarbageCollector) runAttemptToDeleteWorker() {
220 | 	for gc.attemptToDeleteWorker() {
221 | 	}
222 | }
223 | 
224 | func (gc *GarbageCollector) attemptToDeleteWorker() bool {
225 | 	item, quit := gc.attemptToDelete.Get()
226 | 	gc.workerLock.RLock()
227 | 	defer gc.workerLock.RUnlock()
228 | 	if quit {
229 | 		return false
230 | 	}
231 | 	defer gc.attemptToDelete.Done(item)
232 | 	n, ok := item.(*node)
233 | 	...
234 | 
235 | 
236 | 	err := gc.attemptToDeleteItem(n)
237 | 	if err == enqueuedVirtualDeleteEventErr {		
238 | 		return true
239 | 	} else if err == namespacedOwnerOfClusterScopedObjectErr {
240 | 		// a cluster-scoped object referring to a namespaced owner is an error that will not resolve on retry, no need to requeue this node
241 | 		return true
242 | 	} else if err != nil {
243 | 		if _, ok := err.(*restMappingError); ok {			
244 | 			klog.V(5).Infof("error syncing item %s: %v", n, err)
245 | 		} else {
246 | 			utilruntime.HandleError(fmt.Errorf("error syncing item %s: %v", n, err))
247 | 		}		
248 | 		gc.attemptToDelete.AddRateLimited(item)
249 | 	} else if !n.isObserved() {
250 | 		klog.V(5).Infof("item %s hasn't been observed via informer yet", n.identity)
251 | 		gc.attemptToDelete.AddRateLimited(item)
252 | 	}
253 | 	return true
254 | }
255 | 
256 | ```
257 | 
258 | 
259 | 
260 | 
261 | 
262 | 
263 | 
264 | ### 每30秒定期执行 Sync
265 | 
266 | ```go
267 | 
268 | ```
269 | 
270 | 
271 | 
272 | 
273 | 
274 | 
275 | 
276 | 
277 | 
278 | 
279 | ## 注意事项
280 | 
281 | 今天在写代码的时候加了这一段`obj.SetOwnerReferences(append(obj.GetOwnerReferences(), ownerRef))` 到代码中，需求是希望创建keycloak client同时创建secret到不同namespace, 然后给这些secret添加ownerreference，但发现没有任何error捕获到，查了event有如下warning。
282 | 
283 | ```
284 | ownerRef [xxx/xxx, namespace: xxx, name: client, uid: 866ef543-22fb-4ec5-b253-63809d3abd22] does not exist in namespace "xx"
285 | ```
286 | 
287 | 然后翻查代码后发现
288 | 
289 | ```
290 | namespace-scoped的资源添加namespace-scope级别的owner reference的时候，owner reference只能添加在同一个namespace，不能跨namespace添加； 或者可以添加owner reference为cluster-scoped级别的资源
291 | cluster-scoped的资源只能添加cluster scope级别的owner references
292 | ```
293 | 
294 | 


--------------------------------------------------------------------------------
/controller/images/client-go.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JaneLiuL/kubernetes-book/6ce28f8bc31b8ac89543964b03cf703cf268a515/controller/images/client-go.png


--------------------------------------------------------------------------------
/controller/images/instantiation-daemonset-controller.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JaneLiuL/kubernetes-book/6ce28f8bc31b8ac89543964b03cf703cf268a515/controller/images/instantiation-daemonset-controller.png


--------------------------------------------------------------------------------
/controller/images/namespace-controller-delete.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JaneLiuL/kubernetes-book/6ce28f8bc31b8ac89543964b03cf703cf268a515/controller/images/namespace-controller-delete.png


--------------------------------------------------------------------------------
/controller/images/run-daemonset-controoller.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JaneLiuL/kubernetes-book/6ce28f8bc31b8ac89543964b03cf703cf268a515/controller/images/run-daemonset-controoller.png


--------------------------------------------------------------------------------
/controller/images/run-namespace-controller.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JaneLiuL/kubernetes-book/6ce28f8bc31b8ac89543964b03cf703cf268a515/controller/images/run-namespace-controller.png


--------------------------------------------------------------------------------
/controller/namespace-controller.md:
--------------------------------------------------------------------------------
  1 | # Overview
  2 | 
  3 | 这篇文章是基于Kubernetes的master  commitid:  8e8b6a01cf6bf55dea5e2e4f554597a95c82988a写下的源码分析文档。
  4 | 
  5 | 此篇文档主要是围绕namespace controller的介绍以及工作原理。
  6 | 
  7 | 代码位置 `pkg/controller/namespace`
  8 | 
  9 | # 概念
 10 | 
 11 | namespace 是一个从**逻辑**上划分隔离资源的一个组。使用namespace可以实现一种类似多租户的概念(包括隔离pod/deployment等资源，不同namespace不同权限，限制不同资源的quota等)
 12 | 
 13 | 在Kubernetes中，我们会使用namespace来实现多个虚拟集群。在namespace中，资源的名称是唯一的。但是并非所有资源对象都是属于namespace scope的，例如PV。
 14 | 
 15 | # 数据结构
 16 | 
 17 | 一个Namespace的name必须是DNS 兼容的标签；
 18 | 
 19 | 一个Namepsace必须先创建，然后存在该namespace的资源例如pod才能被创建。
 20 | 
 21 | 
 22 | 
 23 | ```go
 24 | type Namespace struct {
 25 |   TypeMeta   `json:",inline"`
 26 |   ObjectMeta `json:"metadata,omitempty"`
 27 | 
 28 |   Spec NamespaceSpec `json:"spec,omitempty"`
 29 |   Status NamespaceStatus `json:"status,omitempty"`
 30 | }
 31 | ```
 32 | 
 33 | ## Phases
 34 | 
 35 | Namespace有两种Phase， 默认是Active， 只有当namespace.ObjectMeta.DeletionTimestamp不为空的时候Phase就变成Terminating。
 36 | 
 37 | Active: 当一个Namespace被创建的时候，就默认是Active状态。
 38 | 
 39 | Terminating: 当该Namespace被执行了DELETE request的时候，` namespace.ObjectMeta.DeletionTimestamp` 就会被置为当前时间，然后`Namespace.Status.Phase` 则被置为Terminating.
 40 | 
 41 | ```go
 42 | type NamespacePhase string
 43 | const(
 44 |   NamespaceActive NamespacePhase = "Active"
 45 |   NamespaceTerminating NamespacePhase = "Terminating"
 46 | )
 47 | 
 48 | type NamespaceStatus struct { 
 49 |   ...
 50 |   Phase NamespacePhase 
 51 | }
 52 | ```
 53 | 
 54 | # REST API
 55 | 
 56 | 操作Namespace的API
 57 | 
 58 | | Action   | HTTP Verb | Path                                           | Description                    |
 59 | | -------- | --------- | ---------------------------------------------- | ------------------------------ |
 60 | | CREATE   | POST      | /api/{version}/namespaces                      | Create a namespace             |
 61 | | LIST     | GET       | /api/{version}/namespaces                      | List all namespaces            |
 62 | | UPDATE   | PUT       | /api/{version}/namespaces/{namespace}          | Update namespace {namespace}   |
 63 | | DELETE   | DELETE    | /api/{version}/namespaces/{namespace}          | Delete namespace {namespace}   |
 64 | | FINALIZE | PUT       | /api/{version}/namespaces/{namespace}/finalize | Finalize namespace {namespace} |
 65 | | WATCH    | GET       | /api/{version}/watch/namespaces                | Watch all namespaces           |
 66 | 
 67 | # 实例化Namespace Controller
 68 | 
 69 | 该控制器监听了`namespaceInformer`对象资源的监听的增改事件。
 70 | 
 71 | ```go
 72 | func NewNamespaceController(...) *NamespaceController { 
 73 | 	namespaceInformer.Informer().AddEventHandlerWithResyncPeriod(
 74 | 		cache.ResourceEventHandlerFuncs{
 75 | 			AddFunc: func(obj interface{}) {
 76 | 				namespace := obj.(*v1.Namespace)
 77 | 				namespaceController.enqueueNamespace(namespace)
 78 | 			},
 79 | 			UpdateFunc: func(oldObj, newObj interface{}) {
 80 | 				namespace := newObj.(*v1.Namespace)
 81 | 				namespaceController.enqueueNamespace(namespace)
 82 | 			},
 83 | 		},		
 84 | 	)
 85 | 	return namespaceController
 86 | }
 87 | ```
 88 | 
 89 | 
 90 | 
 91 | # Run
 92 | 
 93 | 在Namespace Controller Run方法中，主要是运行了worker去处理来自namespace的增改事件。
 94 | 
 95 | ![](./images/run-namespace-controller.png)
 96 | 
 97 | ```go
 98 | func (nm *NamespaceController) Run(workers int, stopCh <-chan struct{}) {
 99 |     ...
100 | 	for i := 0; i < workers; i++ {
101 |         // 使用go协程调worker方法处理workqueue的namespace增改事件
102 | 		go wait.Until(nm.worker, time.Second, stopCh)
103 | 	}
104 | 	<-stopCh
105 | }
106 | 
107 | // worker处理命名空间对象的队列。
108 | //每个namespace最多只能在队列中出现一次。
109 | //系统确保没有两个worker可以同时处理相同的名称空间。
110 | func (nm *NamespaceController) worker() {
111 | 	workFunc := func() bool {
112 | 		key, quit := nm.queue.Get()
113 | 		if quit {
114 | 			return true
115 | 		}
116 | 		defer nm.queue.Done(key)
117 | 		// 执行syncNamespaceFromKey
118 | 		err := nm.syncNamespaceFromKey(key.(string))
119 | 		if err == nil {
120 | 			nm.queue.Forget(key)
121 | 			return false
122 | 		}
123 | 
124 | 		if estimate, ok := err.(*deletion.ResourcesRemainingError); ok {
125 | 			t := estimate.Estimate/2 + 1
126 | 
127 | 			nm.queue.AddAfter(key, time.Duration(t)*time.Second)
128 | 		} else {
129 | 			nm.queue.AddRateLimited(key)
130 | 			utilruntime.HandleError(fmt.Errorf("deletion of namespace %v failed: %v", key, err))
131 | 		}
132 | 		return false
133 | 	}
134 | 
135 | 	for {
136 | 		quit := workFunc()
137 | 
138 | 		if quit {
139 | 			return
140 | 		}
141 | 	}
142 | }
143 | 
144 | 
145 | // syncNamespaceFromKey在它的存储中寻找具有指定键的namespace并同步它
146 | func (nm *NamespaceController) syncNamespaceFromKey(key string) (err error) {
147 | 	namespace, err := nm.lister.Get(key)
148 | 	return nm.namespacedResourcesDeleter.Delete(namespace.Name)
149 | }
150 | 
151 | ```
152 | 
153 | 
154 | 
155 | # 删除namespace
156 | 
157 | namespace被删除，有两个很重要的字段：
158 | 
159 | ObjectMeta.DeletionTimestamp: 一旦我们对一个namespace执行Delete操作，那么Kubernetes就会将当前的时间写入namespace的ObjectMeta.DeletionTimestamp。
160 | 
161 | Spec.Finalizers: 当namespace中的所有资源都被删除之后，就会从namespace移除该字段。
162 | 
163 | 
164 | 
165 | ## Delete方法
166 | 
167 | 从上面的数据结构的namespacedResourcesDeleter deletion.NamespacedResourcesDeleterInterface我们看到了如果删除namespace的话是调了NamespacedResourcesDeleterInterface接口，接下来我们看看该接口
168 | 
169 | 代码位置 `pkg/controller/namespace/deletion/namespaced_resources_deleter.go`
170 | 
171 | `Delete`方法的主要逻辑是删除给定名称空间中的所有资源。
172 | 
173 | 删除资源前:
174 | 
175 | ​	它确保删除时间戳DeletionTimestamp在namespace(如果缺失删除时间戳，则不执行任何操作)。
176 | 
177 | ​	验证名称空间是否处于“terminating”阶段 (更新namespace阶段，如果它还没有被标记为terminating)
178 | 
179 | 删除资源:
180 | 
181 | ​	调用deleteAllContent删除资源
182 | 
183 | 删除资源后:
184 | 
185 | ​	从给定的namespace中移除finalizer token 。
186 | 
187 | 如果其中任何一个步骤失败，返回错误。 
188 | 
189 | 但是如果正在删除资源的时候是会等待删除成功为止，否则就返回ResourcesRemainingError
190 | 
191 | 这也是我们经常可以看到namespace Terminating的原因之一。
192 | 
193 | 
194 | 
195 | ![](./images/namespace-controller-delete.png)
196 | 
197 | ```go
198 | type NamespacedResourcesDeleterInterface interface {
199 | 	Delete(nsName string) error
200 | }
201 | 
202 | 
203 | func (d *namespacedResourcesDeleter) Delete(nsName string) error {
204 | 	// 获取namespace对象，确保获取对象的时候对象存在，没有被其他worker controlller在处理。如果DeletionTimestamp字段非空，说明不需要删除
205 | 	namespace, err := d.nsClient.Get(context.TODO(), nsName, metav1.GetOptions{})
206 | 	if err != nil {
207 | 		if errors.IsNotFound(err) {
208 | 			return nil
209 | 		}
210 | 		return err
211 | 	}
212 | 	if namespace.DeletionTimestamp == nil {
213 | 		return nil
214 | 	}
215 | 
216 | 
217 |     // d.updateNamespaceStatusFunc 这个方法是deepcopy namespace, 把namespace的Status.Phase改成NamespaceTerminating并且返回该namespace
218 |     // 然后retryOnConflictError拿着目前的namespace对象跟d.updateNamespaceStatusFunc 返回的改了status.phase的namespace对象，先看看能否拿到目前的namespace，如果拿不到就说明被删掉，然后查看是否有冲突的错误。
219 | 	namespace, err = d.retryOnConflictError(namespace, d.updateNamespaceStatusFunc)
220 | 	if err != nil {
221 | 		if errors.IsNotFound(err) {
222 | 			return nil
223 | 		}
224 | 		return err
225 | 	}
226 | 
227 | 	// the latest view of the namespace asserts that namespace is no longer deleting..
228 | 	if namespace.DeletionTimestamp.IsZero() {
229 | 		return nil
230 | 	}
231 | 
232 |     // 判断namespace.Spec.Finalizers 的长度是否为空
233 | 	if finalized(namespace) {
234 | 		return nil
235 | 	}
236 | 	
237 |     // 跳转到deleteAllContent去找需要删除的资源并且删除
238 | 	estimate, err := d.deleteAllContent(namespace)
239 | 	if err != nil {
240 | 		return err
241 | 	}
242 | 	if estimate > 0 {
243 | 		return &ResourcesRemainingError{estimate}
244 | 	}
245 | 
246 | 	// we have removed content, so mark it finalized by us
247 | 	_, err = d.retryOnConflictError(namespace, d.finalizeNamespace)
248 | 	if err != nil {
249 | 		if errors.IsNotFound(err) {
250 | 			return nil
251 | 		}
252 | 		return err
253 | 	}
254 | 	return nil
255 | }
256 | 
257 | ```
258 | 
259 | 
260 | 
261 | ## deleteAllContent 发现需要删除的资源GVR
262 | 
263 | `deleteAllContent`首先使用了DiscoveryClient去发现在特定namespace中的所有资源以及资源的Group Version，传递给deleteAllContentForGroupVersionResource去执行删除逻辑。返回的是剩余资源被删除之前的估计数量，如果估计数量大于0，也就是说仍然有资源没有被删除。
264 | 
265 | ```go
266 | func (d *namespacedResourcesDeleter) deleteAllContent(ns *v1.Namespace) (int64, error) {
267 | 	namespace := ns.Name
268 | 	namespaceDeletedAt := *ns.DeletionTimestamp
269 | 	var errs []error
270 | 	conditionUpdater := namespaceConditionUpdater{}
271 | 	estimate := int64(0)
272 | 
273 |     // d.discoverResourcesFn()这个是namespacedResourcesDeleter对象的discoverResourcesFn字段，这个是属于metav1.APIResourceList
274 | 	resources, err := d.discoverResourcesFn()
275 | 	if err != nil {
276 |         // 如果有错交给ProcessDiscoverResourcesErr处理		
277 | 		errs = append(errs, err)
278 | 		conditionUpdater.ProcessDiscoverResourcesErr(err)
279 | 	}
280 | 	
281 |     // discovery这里是用了client-go里面的Discoveryclient发现客户端去发现K8S api server支持的GVR
282 |     // 这里是使用DicoveryClient去查找传入的resources里面Verbs操作是delete的资源
283 |     // 然后调用discovery.GroupVersionResources把该资源的GVR拿到
284 | 	deletableResources := discovery.FilteredBy(discovery.SupportsAllVerbs{Verbs: []string{"delete"}}, resources)
285 | 	groupVersionResources, err := discovery.GroupVersionResources(deletableResources)
286 |     
287 | 	...
288 | 
289 | 	numRemainingTotals := allGVRDeletionMetadata{
290 | 		gvrToNumRemaining:        map[schema.GroupVersionResource]int{},
291 | 		finalizersToNumRemaining: map[string]int{},
292 | 	}
293 |     // 轮询上面拿到的Verbs操作是delete的资源的GVR，把namespace, gvr, 删除事件戳传给deleteAllContentForGroupVersionResource方法
294 | 	for gvr := range groupVersionResources {
295 | 		gvrDeletionMetadata, err := d.deleteAllContentForGroupVersionResource(gvr, namespace, namespaceDeletedAt)
296 | 		...
297 | 		if gvrDeletionMetadata.finalizerEstimateSeconds > estimate {
298 | 			estimate = gvrDeletionMetadata.finalizerEstimateSeconds
299 | 		}
300 | 		if gvrDeletionMetadata.numRemaining > 0 {
301 | 			numRemainingTotals.gvrToNumRemaining[gvr] = gvrDeletionMetadata.numRemaining
302 | 			for finalizer, numRemaining := range gvrDeletionMetadata.finalizersToNumRemaining {
303 | 				if numRemaining == 0 {
304 | 					continue
305 | 				}
306 | 				numRemainingTotals.finalizersToNumRemaining[finalizer] = numRemainingTotals.finalizersToNumRemaining[finalizer] + numRemaining
307 | 			}
308 | 		}
309 | 	}
310 | 	conditionUpdater.ProcessContentTotals(numRemainingTotals)
311 | 
312 | 	if hasChanged := conditionUpdater.Update(ns); hasChanged {
313 | 		if _, err = d.nsClient.UpdateStatus(context.TODO(), ns, metav1.UpdateOptions{}); err != nil {
314 | 			utilruntime.HandleError(fmt.Errorf("couldn't update status condition for namespace %q: %v", namespace, err))
315 | 		}
316 | 	}
317 | 
318 | 	// if len(errs)==0, NewAggregate returns nil.
319 | 	return estimate, utilerrors.NewAggregate(errs)
320 | }
321 | 
322 | ```
323 | 
324 | 
325 | 
326 | ## deleteAllContentForGroupVersionResource
327 | 
328 | deleteAllContentForGroupVersionResource将使用DynamicClient删除在gvr中标识的每个资源。它返回在剩余资源被删除之前剩余时间的估计数。
329 | 
330 | DynamicClient是动态的客户端，可以对任意的K8S API对象执行通用操作，包括CRD自定义资源。
331 | 
332 | ```go
333 | func (d *namespacedResourcesDeleter) deleteAllContentForGroupVersionResource(
334 | 	gvr schema.GroupVersionResource, namespace string,
335 | 	namespaceDeletedAt metav1.Time) (gvrDeletionMetadata, error) {
336 |     // deleteCollection方法是通过使用https://godoc.org/k8s.io/client-go/metadata中接口去判断该资源GVR是否支持DeleteCollection
337 | 	deleteCollectionSupported, err := d.deleteCollection(gvr, namespace)
338 | 	if err != nil {
339 | 		return gvrDeletionMetadata{finalizerEstimateSeconds: estimate}, err
340 | 	}
341 | 
342 |     // 如果不支持DeleteCollection，那么我们就调用deleteEachItem去list并且删除	
343 | 	if !deleteCollectionSupported {
344 | 		err = d.deleteEachItem(gvr, namespace)
345 | 	}
346 | 	
347 | 	unstructuredList, listSupported, err := d.listCollection(gvr, namespace)	
348 |     // 使用列表查找 finalizers
349 | 	finalizersToNumRemaining := map[string]int{}
350 | 	for _, item := range unstructuredList.Items {
351 | 		for _, finalizer := range item.GetFinalizers() {
352 | 			finalizersToNumRemaining[finalizer] = finalizersToNumRemaining[finalizer] + 1
353 | 		}
354 | 	}
355 | 
356 |     // 统计最后还有多少资源没有被删除
357 | 	if estimate != int64(0) {
358 | 		return gvrDeletionMetadata{
359 | 			finalizerEstimateSeconds: estimate,
360 | 			numRemaining:             len(unstructuredList.Items),
361 | 			finalizersToNumRemaining: finalizersToNumRemaining,
362 | 		}, nil
363 | 	}
364 | 
365 | 	// if any item has a finalizer, we treat that as a normal condition, and use a default estimation to allow for GC to complete.
366 | 	if len(finalizersToNumRemaining) > 0 {
367 | 		return gvrDeletionMetadata{
368 | 			finalizerEstimateSeconds: finalizerEstimateSeconds,
369 | 			numRemaining:             len(unstructuredList.Items),
370 | 			finalizersToNumRemaining: finalizersToNumRemaining,
371 | 		}, nil
372 | 	}
373 | 
374 | 	return gvrDeletionMetadata{
375 | 		finalizerEstimateSeconds: estimate,
376 | 		numRemaining:             len(unstructuredList.Items),
377 | 	}, fmt.Errorf("unexpected items still remain in namespace: %s for gvr: %v", namespace, gvr)
378 | }
379 | ```
380 | 
381 | 
382 | 
383 | ## deleteEachItem 删除真正的资源
384 | 
385 | ```go
386 | func (d *namespacedResourcesDeleter) deleteEachItem(gvr schema.GroupVersionResource, namespace string) error {
387 | 	unstructuredList, listSupported, err := d.listCollection(gvr, namespace)
388 | ...
389 |     // 最重要的真正的删除，调用了metadataClient去一个一个的Delete
390 | 	for _, item := range unstructuredList.Items {
391 | 		background := metav1.DeletePropagationBackground
392 | 		opts := metav1.DeleteOptions{PropagationPolicy: &background}
393 | 		if err = d.metadataClient.Resource(gvr).Namespace(namespace).Delete(context.TODO(), item.GetName(), opts); err != nil && !errors.IsNotFound(err) && !errors.IsMethodNotSupported(err) {
394 | 			return err
395 | 		}
396 | 	}
397 | 	return nil
398 | }
399 | ```
400 | 
401 | 
402 | 
403 | # 总结
404 | 
405 | Namespace Controller一直循环检查namespace资源对象的新增或变化，一旦发现DeletionTimestamp非空，就将该资源对象设置成terminating，然后调用DiscoveryClient去发现该namespace中的所有资源，再调用DynamicClient去将资源删除，最后在namespace中移除Finalizer。
406 | 
407 | 
408 | 


--------------------------------------------------------------------------------
/controller/replicaset-controller.md:
--------------------------------------------------------------------------------
  1 | # Overview
  2 | 
  3 | 这篇文章是基于Kubernetes的master  commitid: 写下的源码分析文档。
  4 | 
  5 | 此篇文档主要是围绕namespace controller的介绍以及工作原理。
  6 | 
  7 | 代码位置 `pkg/controller/replicaset`
  8 | 
  9 | 
 10 | 
 11 | # 数据结构
 12 | 
 13 | 在ReplicaSetController的数据结构定义中，最重要的字段是rsLister 和 podLister， rsLister是ReplicaSets的存储对象，而podLister是pods的存储对象，这两个都是通过share informer传递给NewReplicaSetController工作
 14 | 
 15 | ```go
 16 | type ReplicaSetController struct {
 17 | ...
 18 | 	kubeClient clientset.Interface
 19 | 	podControl controller.PodControlInterface
 20 |     // syncHander是一个自定义的函数
 21 |     syncHandler func(rsKey string) error
 22 | 	rsLister appslisters.ReplicaSetLister
 23 | 	rsListerSynced cache.InformerSynced
 24 | 
 25 | 	// A store of pods, populated by the shared informer passed to NewReplicaSetController
 26 | 	podLister corelisters.PodLister
 27 | 	podListerSynced cache.InformerSynced    
 28 | ...
 29 | }
 30 | ```
 31 | 
 32 | # 实例化ReplicaSetController
 33 | 
 34 | ```go
 35 | func NewReplicaSetController(rsInformer appsinformers.ReplicaSetInformer, podInformer coreinformers.PodInformer, kubeClient clientset.Interface, burstReplicas int) *ReplicaSetController {
 36 |     // 开启k8s的event监听，把事件记录到event广播
 37 | 	eventBroadcaster := record.NewBroadcaster()
 38 | 	eventBroadcaster.StartStructuredLogging(0)
 39 | 	eventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: kubeClient.CoreV1().Events("")})
 40 |     // NewBaseController是把clientset和事件广播传递去初始化ReplicaSetController实例
 41 | 	return NewBaseController(rsInformer, podInformer, kubeClient, burstReplicas,
 42 | 		apps.SchemeGroupVersion.WithKind("ReplicaSet"),
 43 | 		"replicaset_controller",
 44 | 		"replicaset",
 45 | 		controller.RealPodControl{
 46 | 			KubeClient: kubeClient,
 47 | 			Recorder:   eventBroadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: "replicaset-controller"}),
 48 | 		},
 49 | 	)
 50 | }
 51 | 
 52 | func NewBaseController(rsInformer appsinformers.ReplicaSetInformer, podInformer coreinformers.PodInformer, kubeClient clientset.Interface, burstReplicas int,
 53 | 	gvk schema.GroupVersionKind, metricOwnerName, queueName string, podControl controller.PodControlInterface) *ReplicaSetController {
 54 | ...
 55 | 
 56 | 	rsc := &ReplicaSetController{
 57 | 		GroupVersionKind: gvk,
 58 | 		kubeClient:       kubeClient,
 59 | 		podControl:       podControl,
 60 | 		burstReplicas:    burstReplicas,
 61 | 		expectations:     controller.NewUIDTrackingControllerExpectations(controller.NewControllerExpectations()),
 62 | 		queue:            workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), queueName),
 63 | 	}
 64 | 
 65 |     // Replicaset controller监听了replica资源对象的增删改事件的变化
 66 | 	rsInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
 67 | 		AddFunc:    rsc.addRS,
 68 | 		UpdateFunc: rsc.updateRS,
 69 | 		DeleteFunc: rsc.deleteRS,
 70 | 	})
 71 | 	rsc.rsLister = rsInformer.Lister()
 72 | 	rsc.rsListerSynced = rsInformer.Informer().HasSynced
 73 | 
 74 |     // 同时， Replicaset controller监听了pod资源对象的增删改事件的变化
 75 | 	podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
 76 | 		AddFunc: rsc.addPod,
 77 | 		UpdateFunc: rsc.updatePod,
 78 | 		DeleteFunc: rsc.deletePod,
 79 | 	})
 80 | 	rsc.podLister = podInformer.Lister()
 81 | 	rsc.podListerSynced = podInformer.Informer().HasSynced
 82 | 
 83 |     // 这里非常重要！！ 记住！
 84 | 	rsc.syncHandler = rsc.syncReplicaSet
 85 | 
 86 | 	return rsc
 87 | }
 88 | ```
 89 | 
 90 | 
 91 | 
 92 | ## rsc的增删改
 93 | 
 94 | ```go
 95 | 
 96 | func (rsc *ReplicaSetController) addRS(obj interface{}) {
 97 | 	rs := obj.(*apps.ReplicaSet)
 98 | 	klog.V(4).Infof("Adding %s %s/%s", rsc.Kind, rs.Namespace, rs.Name)
 99 | 	rsc.enqueueRS(rs)
100 | }
101 | 
102 | func (rsc *ReplicaSetController) updateRS(old, cur interface{}) {
103 | 	oldRS := old.(*apps.ReplicaSet)
104 | 	curRS := cur.(*apps.ReplicaSet)
105 | 
106 |     // 如果两个replica的UID不一致的话就不可以再保留旧的replicaset了， 把新的replicaset的对象插入到Workqueue里面	
107 | 	if curRS.UID != oldRS.UID {
108 | 		key, err := controller.KeyFunc(oldRS)
109 | 		if err != nil {
110 | 			utilruntime.HandleError(fmt.Errorf("couldn't get key for object %#v: %v", oldRS, err))
111 | 			return
112 | 		}
113 |         // call deleteRS去把旧的replicaset的对象删掉操作插入到workqueue里面
114 | 		rsc.deleteRS(cache.DeletedFinalStateUnknown{
115 | 			Key: key,
116 | 			Obj: oldRS,
117 | 		})
118 | 	}
119 | 
120 | 	if *(oldRS.Spec.Replicas) != *(curRS.Spec.Replicas) {
121 | 		klog.V(4).Infof("%v %v updated. Desired pod count change: %d->%d", rsc.Kind, curRS.Name, *(oldRS.Spec.Replicas), *(curRS.Spec.Replicas))
122 | 	}
123 |     // 把新的replicaset的对象插入到Workqueue里面
124 | 	rsc.enqueueRS(curRS)
125 | }
126 | 
127 | func (rsc *ReplicaSetController) enqueueRS(rs *apps.ReplicaSet) {
128 | 	key, err := controller.KeyFunc(rs)
129 | 	if err != nil {
130 | 		utilruntime.HandleError(fmt.Errorf("couldn't get key for object %#v: %v", rs, err))
131 | 		return
132 | 	}
133 | 
134 |     // 把资源对象加到workqueue里面
135 | 	rsc.queue.Add(key)
136 | }
137 | ```
138 | 
139 | 
140 | 
141 | 这里我们复习下workqueue, workqueue是存储了资源对象以及他的操作方法的queue
142 | 
143 | ## pod的增删改
144 | 
145 | ```go
146 | func (rsc *ReplicaSetController) addPod(obj interface{}) {
147 | 	pod := obj.(*v1.Pod)
148 | 
149 |     // 如果pod的DeletionTimestamp是非空，说明这个pod被清理/清理中，那么更新expectations， 然后插入Workqueue中，
150 | 	if pod.DeletionTimestamp != nil {		
151 | 		rsc.deletePod(pod)
152 | 		return
153 | 	}
154 | 
155 | 	// If it has a ControllerRef, that's all that matters.
156 | 	if controllerRef := metav1.GetControllerOf(pod); controllerRef != nil {
157 | 		rs := rsc.resolveControllerRef(pod.Namespace, controllerRef)
158 | 		if rs == nil {
159 | 			return
160 | 		}
161 | 		rsKey, err := controller.KeyFunc(rs)
162 | 		if err != nil {
163 | 			return
164 | 		}
165 | 		klog.V(4).Infof("Pod %s created: %#v.", pod.Name, pod)
166 | 		rsc.expectations.CreationObserved(rsKey)
167 | 		rsc.queue.Add(rsKey)
168 | 		return
169 | 	}
170 | 
171 | 	rss := rsc.getPodReplicaSets(pod)
172 | 	if len(rss) == 0 {
173 | 		return
174 | 	}
175 | 	klog.V(4).Infof("Orphan Pod %s created: %#v.", pod.Name, pod)
176 | 	for _, rs := range rss {
177 | 		rsc.enqueueRS(rs)
178 | 	}
179 | }
180 | ```
181 | 
182 | 
183 | 
184 | ## Run
185 | 
186 | 我们从官网的client-go的图去看controller，现在上方我们是确定了pod跟Replicaset对象的增删改会把资源对象插入到workqueue, 我们进入到Run函数看它如何去实现ProcessItem
187 | 
188 | ![1598519716510](C:\Users\EZLIUJA\AppData\Roaming\Typora\typora-user-images\1598519716510.png)
189 | 
190 | ```go
191 | // Run begins watching and syncing.
192 | func (rsc *ReplicaSetController) Run(workers int, stopCh <-chan struct{}) {
193 | 	defer utilruntime.HandleCrash()
194 | 	defer rsc.queue.ShutDown()
195 | 
196 | 	controllerName := strings.ToLower(rsc.Kind)	
197 | 	if !cache.WaitForNamedCacheSync(rsc.Kind, stopCh, rsc.podListerSynced, rsc.rsListerSynced) {
198 | 		return
199 | 	}
200 |     // 使用协程启动worker，然后跳到下方worker()  方法
201 | 	for i := 0; i < workers; i++ {       
202 | 		go wait.Until(rsc.worker, time.Second, stopCh)
203 | 	}
204 | 
205 | 	<-stopCh
206 | }
207 | 
208 | func (rsc *ReplicaSetController) worker() {
209 |     // 死循环进入最重要的processNextWorkItem
210 | 	for rsc.processNextWorkItem() {
211 | 	}
212 | }
213 | 
214 | func (rsc *ReplicaSetController) processNextWorkItem() bool {
215 | 	key, quit := rsc.queue.Get()
216 | 	if quit {
217 | 		return false
218 | 	}
219 | 	defer rsc.queue.Done(key)
220 | 
221 |     // 进入最重要的syncHandler方法
222 | 	err := rsc.syncHandler(key.(string))
223 | 	if err == nil {
224 | 		rsc.queue.Forget(key)
225 | 		return true
226 | 	}
227 | 
228 | 	utilruntime.HandleError(fmt.Errorf("sync %q failed with %v", key, err))
229 | 	rsc.queue.AddRateLimited(key)
230 | 
231 | 	return true
232 | }
233 | 
234 | ```
235 | 
236 | 
237 | 
238 | ## syncHandler --> syncReplicaSet
239 | 
240 | 从上方的Run的方法，我们可以看到从workqueue里面去调到syncHandler方法，因为从最开始的数据结构以及实例化，我们已经确定了rsc.syncHandler = rsc.syncReplicaSet
241 | 
242 | 所以我们现在分析下syncReplicaSet方法做了什么动作
243 | 
244 | ```go
245 | func (rsc *ReplicaSetController) syncReplicaSet(key string) error {
246 | 	startTime := time.Now()
247 | 	namespace, name, err := cache.SplitMetaNamespaceKey(key)
248 | ...
249 | 	rs, err := rsc.rsLister.ReplicaSets(namespace).Get(name)
250 | ...
251 | 
252 | 	rsNeedsSync := rsc.expectations.SatisfiedExpectations(key)
253 | 	selector, err := metav1.LabelSelectorAsSelector(rs.Spec.Selector)
254 | ...
255 | 
256 | 	// list all pods to include the pods that don't match the rs`s selector
257 | 	// anymore but has the stale controller ref.
258 | 	// TODO: Do the List and Filter in a single pass, or use an index.
259 | 	allPods, err := rsc.podLister.Pods(rs.Namespace).List(labels.Everything())
260 | 	if err != nil {
261 | 		return err
262 | 	}
263 | 	// Ignore inactive pods.
264 | 	filteredPods := controller.FilterActivePods(allPods)
265 | 
266 | 	// NOTE: filteredPods are pointing to objects from cache - if you need to
267 | 	// modify them, you need to copy it first.
268 | 	filteredPods, err = rsc.claimPods(rs, selector, filteredPods)
269 | 	if err != nil {
270 | 		return err
271 | 	}
272 | 
273 | 	var manageReplicasErr error
274 |     // 把所有active的pod， 并且(从selector看)owner reference 是该replica的pod资源对象，跟该replicaset对象交给manageReplicas方法处理
275 | 	if rsNeedsSync && rs.DeletionTimestamp == nil {
276 | 		manageReplicasErr = rsc.manageReplicas(filteredPods, rs)
277 | 	}
278 | 	rs = rs.DeepCopy()
279 | 	newStatus := calculateStatus(rs, filteredPods, manageReplicasErr)
280 | 	
281 |     // 通过manageReplicasErr处理结果的比对去更详细replicaSet的状态
282 | 	updatedRS, err := updateReplicaSetStatus(rsc.kubeClient.AppsV1().ReplicaSets(rs.Namespace), rs, newStatus)
283 | ...
284 | 	// Resync the ReplicaSet after MinReadySeconds as a last line of defense to guard against clock-skew.
285 | 	if manageReplicasErr == nil && updatedRS.Spec.MinReadySeconds > 0 &&
286 | 		updatedRS.Status.ReadyReplicas == *(updatedRS.Spec.Replicas) &&
287 | 		updatedRS.Status.AvailableReplicas != *(updatedRS.Spec.Replicas) {
288 | 		rsc.queue.AddAfter(key, time.Duration(updatedRS.Spec.MinReadySeconds)*time.Second)
289 | 	}
290 | 	return manageReplicasErr
291 | }
292 | ```
293 | 
294 | 从上方的代码我们可以看到，syncReplicaSet这个方法主要是找到该replicaset对象以及owner reference是属于该replicaset 的pod，去交给manageRelicas去处理，通过处理结果的比对去更详细replicaSet的状态
295 | 
296 | 
297 | 
298 | ## manageReplicas
299 | 
300 | 
301 | 
302 | ```go
303 | func (rsc *ReplicaSetController) manageReplicas(filteredPods []*v1.Pod, rs *apps.ReplicaSet) error {
304 |     // 比对owner reference是我们replicaSet的pods 跟 replica.Spec.Replicas的数量差
305 | 	diff := len(filteredPods) - int(*(rs.Spec.Replicas))
306 | 	rsKey, err := controller.KeyFunc(rs)
307 | 	...
308 |     // 如果是小于0，则代表pod不够，需要增加
309 | 	if diff < 0 {
310 | 		diff *= -1
311 | 		if diff > rsc.burstReplicas {
312 | 			diff = rsc.burstReplicas
313 | 		}
314 | 		rsc.expectations.ExpectCreations(rsKey, diff)
315 | 		
316 | 		successfulCreations, err := slowStartBatch(diff, controller.SlowStartInitialBatchSize, func() error {
317 |             // 交给pod controller去创建owner reference是该replicaset的pod
318 | 			err := rsc.podControl.CreatePodsWithControllerRef(rs.Namespace, &rs.Spec.Template, rs, metav1.NewControllerRef(rs, rsc.GroupVersionKind))
319 | 			if err != nil {
320 | 				if errors.HasStatusCause(err, v1.NamespaceTerminatingCause) {
321 | 					return nil
322 | 				}
323 | 			}
324 | 			return err
325 | 		})
326 | 
327 | 		// Any skipped pods that we never attempted to start shouldn't be expected.
328 | 		// The skipped pods will be retried later. The next controller resync will
329 | 		// retry the slow start process.
330 | 		if skippedPods := diff - successfulCreations; skippedPods > 0 {
331 | 			klog.V(2).Infof("Slow-start failure. Skipping creation of %d pods, decrementing expectations for %v %v/%v", skippedPods, rsc.Kind, rs.Namespace, rs.Name)
332 | 			for i := 0; i < skippedPods; i++ {
333 | 				// Decrement the expected number of creates because the informer won't observe this pod
334 | 				rsc.expectations.CreationObserved(rsKey)
335 | 			}
336 | 		}
337 | 		return err
338 |         // 如果多，那么就删除pod
339 | 	} else if diff > 0 {
340 | 		if diff > rsc.burstReplicas {
341 | 			diff = rsc.burstReplicas
342 | 		}
343 | 		klog.V(2).InfoS("Too many replicas", "replicaSet", klog.KObj(rs), "need", *(rs.Spec.Replicas), "deleting", diff)
344 | 
345 | 		relatedPods, err := rsc.getIndirectlyRelatedPods(rs)
346 | 		utilruntime.HandleError(err)
347 | 
348 | 		// Choose which Pods to delete, preferring those in earlier phases of startup.
349 | 		podsToDelete := getPodsToDelete(filteredPods, relatedPods, diff)
350 | 
351 | 		rsc.expectations.ExpectDeletions(rsKey, getPodKeys(podsToDelete))
352 | 
353 | 		errCh := make(chan error, diff)
354 | 		var wg sync.WaitGroup
355 | 		wg.Add(diff)
356 | 		for _, pod := range podsToDelete {
357 | 			go func(targetPod *v1.Pod) {
358 | 				defer wg.Done()
359 | 				if err := rsc.podControl.DeletePod(rs.Namespace, targetPod.Name, rs); err != nil {
360 | 					// Decrement the expected number of deletes because the informer won't observe this deletion
361 | 					podKey := controller.PodKey(targetPod)
362 | 					rsc.expectations.DeletionObserved(rsKey, podKey)
363 | 					if !apierrors.IsNotFound(err) {
364 | 						klog.V(2).Infof("Failed to delete %v, decremented expectations for %v %s/%s", podKey, rsc.Kind, rs.Namespace, rs.Name)
365 | 						errCh <- err
366 | 					}
367 | 				}
368 | 			}(pod)
369 | 		}
370 | 		wg.Wait()
371 | 
372 | 		select {
373 | 		case err := <-errCh:
374 | 			// all errors have been reported before and they're likely to be the same, so we'll only return the first one we hit.
375 | 			if err != nil {
376 | 				return err
377 | 			}
378 | 		default:
379 | 		}
380 | 	}
381 | 
382 | 	return nil
383 | }
384 | 
385 | ```
386 | 
387 | 
388 | 
389 | # 总结
390 | 
391 | replicaSet controller做的事情比较简单，就是比对owner reference 的pod是自己的数量跟 replicaset.Spec.replica的数量差，去决定是增加还是删除pod


--------------------------------------------------------------------------------
/kubelet/cadvisor理解.md:
--------------------------------------------------------------------------------
 1 | # Overview
 2 | 
 3 | `cAdvisor`是谷歌开发的容器监控工具，可以对机器以及容器进行实时监控和数据采集，包括CPU、内存、文件系统等情况，cAdvisor是集成在`Kubelet` 每台worker node节点上都会自行启动`cAdvisor`, 主要代码位于`./pkg/kubelet/cadvisor`目录下。
 4 | 
 5 | # 接口
 6 | 
 7 | cadvisor调用的cadvisor方法有以下几种，像`ContainerInfo`，`SubcontainerInfo`和`MachineInfo`这些都是cadvisor本身已经提供的方法，`Kubelet`无需再实现或者简单封装，而`WatchEvents`这些
 8 | 
 9 | ```go
10 | type Interface interface {
11 | 	Start() error    
12 | 	DockerContainer(...)
13 | 	ContainerInfo(..)
14 | 	ContainerInfoV2(...)
15 | 	SubcontainerInfo(...)
16 | 	MachineInfo() (...)
17 | 	VersionInfo() (...)
18 | 	// 返回image信息
19 | 	ImagesFsInfo() (cadvisorapiv2.FsInfo, error)
20 | 	// 返回根的文件系统信息
21 | 	RootFsInfo() (cadvisorapiv2.FsInfo, error)
22 | 	// 从channel里面获取event信息
23 | 	WatchEvents(request *events.Request) (*events.EventChannel, error)
24 | 	//获取指定目录的文件系统信息 
25 | 	GetDirFsInfo(path string) (cadvisorapiv2.FsInfo, error)
26 | }
27 | ```
28 | 
29 | 那么`Kubelet`在什么时候如何使用这些接口呢？
30 | 
31 | `Kubelet` 在自身启动运行的时候获取机器信息，以及OOM模块中使用了`cAdvisor`去监听OOM事件
32 | 
33 | 
34 | 
35 | 
36 | 
37 | # OOM
38 | 
39 | OOM 是out of memory的意思，是Kubernetes使用google的cadvisor监控container的资源使用情况，一旦container out of memory， 则使用client-go的record把事件记录。
40 | 
41 | 目前在Kubernetes只支持Linux操作系统的OOM record, 其他操作系统均不支持。
42 | 
43 | 目前的一个设计缺陷是，一旦Kubernetes的某个pod对象因为OOM而被kill掉，那么event消息则不再保存，因为pod已经被kill掉再启动一个新对象了。我们需要把OOM Kill掉pod的event统一收集到central metrics之类的才可以看到此类情况。
44 | 
45 | ```go
46 | func (ow *realWatcher) Start(ref *v1.ObjectReference) error {
47 | 	oomLog, err := oomparser.New()
48 | 	...
49 | 	outStream := make(chan *oomparser.OomInstance, 10)
50 | 	go oomLog.StreamOoms(outStream)
51 | 
52 | 	go func() {
53 | 		defer runtime.HandleCrash()
54 | 
55 | 		for event := range outStream {
56 | 			if event.ContainerName == "/" {
57 | 				klog.V(1).Infof("Got sys oom event: %v", event)
58 | 				eventMsg := "System OOM encountered"
59 | 				if event.ProcessName != "" && event.Pid != 0 {
60 | 					eventMsg = fmt.Sprintf("%s, victim process: %s, pid: %d", eventMsg, event.ProcessName, event.Pid)
61 | 				}
62 |                 // 使用client-go的recorder记录事件消息
63 | 				ow.recorder.PastEventf(ref, metav1.Time{Time: event.TimeOfDeath}, v1.EventTypeWarning, systemOOMEvent, eventMsg)
64 | 			}
65 | 		}
66 | 		klog.Errorf("Unexpectedly stopped receiving OOM notifications")
67 | 	}()
68 | 	return nil
69 | }
70 | 
71 | ```
72 | 
73 | 
74 | 
75 | 
76 | 
77 | 
78 | 
79 | ```
80 | docker run \
81 | --volume=/:/rootfs:ro \
82 | --volume=/var/run:/var/run:rw \
83 | --volume=/sys/fs/cgroup/cpu,cpuacct:/sys/fs/cgroup/cpuacct,cpu \
84 | --volume=/var/lib/docker/:/var/lib/docker:ro \
85 | --publish=8080:8080 \
86 | --detach=true \
87 | --name=cadvisor \
88 | --privileged=true \
89 | google/cadvisor:latest
90 | ```
91 | 
92 | 
93 | 
94 | 
95 | 
96 | 


--------------------------------------------------------------------------------
/kubelet/images/kubelet_ut.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JaneLiuL/kubernetes-book/6ce28f8bc31b8ac89543964b03cf703cf268a515/kubelet/images/kubelet_ut.png


--------------------------------------------------------------------------------
/kubelet/images/volumemanager.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JaneLiuL/kubernetes-book/6ce28f8bc31b8ac89543964b03cf703cf268a515/kubelet/images/volumemanager.png


--------------------------------------------------------------------------------
/kubelet/kubelet_nodestatus模块.md:
--------------------------------------------------------------------------------
  1 | 
  2 | 
  3 | # 运行
  4 | 
  5 | 在`Kubelet`运行程序的时候，其中有一步是开启go协程**定期**同步节点状态，此篇文档是讲解同步节点状态以及nodestatus模块。
  6 | 
  7 | ```go
  8 | // 代码位置 pkg/kubelet/kubelet.go
  9 | func (kl *Kubelet) Run(updates <-chan kubetypes.PodUpdate) {
 10 | 	...	
 11 |     // 开启go协程定期同步节点状态
 12 | 	go wait.Until(kl.syncNodeStatus, kl.nodeStatusUpdateFrequency, wait.NeverStop)
 13 | 	...
 14 | }
 15 | ```
 16 | 
 17 | 
 18 | 
 19 | `syncNodeStatus` 主要做了两个事情
 20 | 
 21 | 1. 如果节点没有注册，那么往APIServer注册节点信息
 22 | 2. 更新节点状态
 23 | 
 24 | ```go
 25 | // 代码位置 pkg/kubelet/kubelet_node_status.go
 26 | func (kl *Kubelet) syncNodeStatus() {
 27 | 	...
 28 | 	// 往APIServer注册节点
 29 | 	if kl.registerNode {
 30 | 		kl.registerWithAPIServer()
 31 | 	}
 32 |     // 更新节点信息
 33 | 	if err := kl.updateNodeStatus(); err != nil {
 34 | 		klog.Errorf("Unable to update node status: %v", err)
 35 | 	}
 36 | }
 37 | ```
 38 | 
 39 | # 注册节点
 40 | 
 41 | 注册节点的工作流程如下
 42 | 
 43 | 1. 查看注册状态是否是已经完成状态，如果是则退出注册节点
 44 | 2. 进入循环，尝试构造v1.Node对象，并且使用该对象尝试往API Server注册节点信息
 45 |    1. 尝试使用client去create节点信息，如果创建成功则直接返回
 46 |    2.  如果返回错是节点已经存在，那么直接返回
 47 |    3. 尝试使用nodeName往API Server获取是否有已经存在，如果有错误或者无法获取到该节点信息，说明获取不到，也就是注册失败
 48 |    4. 如果可以获取到跟nodeName同名的节点对象信息，那么复制该对象，并且开始进入调和，使用patch更新节点信息
 49 | 3. 如果注册成功设置registrationCompleted是true，否则继续循环tryRegisterWithAPIServer
 50 | 
 51 | ```go
 52 | func (kl *Kubelet) registerWithAPIServer() {
 53 | 	if kl.registrationCompleted {
 54 | 		return
 55 | 	}
 56 | 	step := 100 * time.Millisecond
 57 | 
 58 | 	for {
 59 | 		time.Sleep(step)
 60 | 		step = step * 2
 61 | 		if step >= 7*time.Second {
 62 | 			step = 7 * time.Second
 63 | 		}
 64 | 		// initialNode是构造初始v1.Node对象，会默认带上os的label，是否有污点的label, 初始化node的时候，都会设置NodeNetworkUnavailable是true等
 65 | 		node, err := kl.initialNode(context.TODO())
 66 | 		...
 67 | 		klog.Infof("Attempting to register node %s", node.Name)
 68 |         // 尝试注册node信息，具体可以看下方
 69 | 		registered := kl.tryRegisterWithAPIServer(node)
 70 |         // 如果返回true，也就说明注册成功，设置registrationCompleted是true，否则继续循环tryRegisterWithAPIServer
 71 | 		if registered {
 72 | 			kl.registrationCompleted = true
 73 | 			return
 74 | 		}
 75 | 	}
 76 | }
 77 | 
 78 | func (kl *Kubelet) tryRegisterWithAPIServer(node *v1.Node) bool {
 79 |     // 尝试使用client去create节点信息，如果创建成功则直接返回
 80 | 	_, err := kl.kubeClient.CoreV1().Nodes().Create(node)
 81 | 	if err == nil {
 82 | 		return true
 83 | 	}
 84 | 	// 如果返回错是节点已经存在，那么直接返回
 85 | 	if !apierrors.IsAlreadyExists(err) {
 86 | 		klog.Errorf("Unable to register node %q with API server: %v", kl.nodeName, err)
 87 | 		return false
 88 | 	}
 89 |     // 进入下方说明没有创建节点成功，错误信息并非已经存在该节点，尝试查看是否能获取到同nodeName的节点
 90 | 	// 尝试使用nodeName往API Server获取是否有已经存在，如果有错误或者无法获取到该节点信息，说明获取不到，也就是注册失败
 91 | 	existingNode, err := kl.kubeClient.CoreV1().Nodes().Get(string(kl.nodeName), metav1.GetOptions{})
 92 | 	if err != nil {
 93 | 		klog.Errorf("Unable to register node %q with API server: error getting existing node: %v", kl.nodeName, err)
 94 | 		return false
 95 | 	}    
 96 | 	if existingNode == nil {
 97 | 		klog.Errorf("Unable to register node %q with API server: no node instance returned", kl.nodeName)
 98 | 		return false
 99 | 	}
100 | 
101 |     // 如果可以获取到跟nodeName同名的节点对象信息，那么复制该对象
102 | 	originalNode := existingNode.DeepCopy()
103 | 	klog.Infof("Node %s was previously registered", kl.nodeName)
104 | 	
105 |     // 尝试调和annotation的值
106 | 	requiresUpdate := kl.reconcileCMADAnnotationWithExistingNode(node, existingNode)
107 | 	requiresUpdate = kl.updateDefaultLabels(node, existingNode) || requiresUpdate
108 | 	requiresUpdate = kl.reconcileExtendedResource(node, existingNode) || requiresUpdate
109 | 	if requiresUpdate {
110 | 		if _, _, err := nodeutil.PatchNodeStatus(kl.kubeClient.CoreV1(), types.NodeName(kl.nodeName), originalNode, existingNode); err != nil {
111 | 			klog.Errorf("Unable to reconcile node %q with API server: error updating node: %v", kl.nodeName, err)
112 | 			return false
113 | 		}
114 | 	}
115 | 
116 | 	return true
117 | }
118 | 
119 | ```
120 | 
121 | 
122 | 
123 | # 更新节点状态
124 | 
125 | 调用链：
126 | 
127 | ```go
128 | func (kl *Kubelet) updateNodeStatus() error {
129 | 	klog.V(5).Infof("Updating node status")
130 |     // 循环5次尝试更新节点状态
131 | 	for i := 0; i < nodeStatusUpdateRetry; i++ {
132 | 		if err := kl.tryUpdateNodeStatus(i); err != nil {
133 | 			if i > 0 && kl.onRepeatedHeartbeatFailure != nil {
134 | 				kl.onRepeatedHeartbeatFailure()
135 | 			}
136 | 			klog.Errorf("Error updating node status, will retry: %v", err)
137 | 		} else {
138 | 			return nil
139 | 		}
140 | 	}
141 | 	return fmt.Errorf("update node status exceeds retry count")
142 | }
143 | 
144 | func (kl *Kubelet) tryUpdateNodeStatus(tryNumber int) error {	
145 | 	opts := metav1.GetOptions{}
146 |     // 如果是第一次尝试更新节点状态，那么从APIServer的缓存获取信息
147 | 	if tryNumber == 0 {
148 | 		util.FromApiserverCache(&opts)
149 | 	}     
150 |     // 获取节点对象
151 | 	node, err := kl.heartbeatClient.CoreV1().Nodes().Get(string(kl.nodeName), opts)
152 | 	// 将节点对象深度复制
153 | 	originalNode := node.DeepCopy()	
154 | 	podCIDRChanged := false
155 | 	if len(node.Spec.PodCIDRs) != 0 {
156 |    // 将node.Spec.PodCIDRs更新到podCIDRs
157 | 		podCIDRs := strings.Join(node.Spec.PodCIDRs, ",")
158 | 		if podCIDRChanged, err = kl.updatePodCIDR(podCIDRs); err != nil {
159 | 			klog.Errorf(err.Error())
160 | 		}
161 | 	}
162 | 	// 设置节点状态
163 | 	kl.setNodeStatus(node)
164 | 
165 | 	now := kl.clock.Now()
166 | 	if now.Before(kl.lastStatusReportTime.Add(kl.nodeStatusReportFrequency)) {
167 | 		if !podCIDRChanged && !nodeStatusHasChanged(&originalNode.Status, &node.Status) {
168 | 			// 
169 | 			kl.volumeManager.MarkVolumesAsReportedInUse(node.Status.VolumesInUse)
170 | 			return nil
171 | 		}
172 | 	}
173 | 
174 | 	// Patch the current status on the API server
175 | 	updatedNode, _, err := nodeutil.PatchNodeStatus(kl.heartbeatClient.CoreV1(), types.NodeName(kl.nodeName), originalNode, node)
176 | 	if err != nil {
177 | 		return err
178 | 	}
179 | 	kl.lastStatusReportTime = now
180 | 	kl.setLastObservedNodeAddresses(updatedNode.Status.Addresses)
181 | 	// If update finishes successfully, mark the volumeInUse as reportedInUse to indicate
182 | 	// those volumes are already updated in the node's status
183 | 	kl.volumeManager.MarkVolumesAsReportedInUse(updatedNode.Status.VolumesInUse)
184 | 	return nil
185 | }
186 | ```
187 | 
188 | ## 设置节点状态
189 | 
190 | 调用链关系如下：
191 | 
192 | ```bash
193 | updateNodeStatus()
194 | 	tryUpdateNodeStatus()
195 | 		setNodeStatus()
196 | 			setNodeStatusFuncs()
197 | 				defaultNodeStatusFuncs() # 在NewMainKubelet实例化的时候指出klet.setNodeStatusFuncs = klet.defaultNodeStatusFuncs()
198 | ```
199 | 
200 | 接下来让我们看看`defaultNodeStatusFuncs`是如何设置节点状态的， 
201 | 
202 | `nodestatus.NodeAddress`是返回一个更新地址相关的func
203 | 
204 | `nodestatus.MachineInfo`是调用了`cAdvisor`等返回一个更新机器信息相关例如`ResourceCPU`  `ResourceMemory`等的func
205 | 
206 | `nodestatus.VersionInfo`是一个返回更新版本信息相关包括`KernelVersion` `KubeletVersion`等的func
207 | 
208 | `nodestatus.DaemonEndpoints` 是一个返回更新`DaemonEndpoints`的func， 也就是更新`node.Status.DaemonEndpoints`
209 | 
210 | `nodestatus.Images`是返回一个更新节点上所有的image名字，tag的切片的func，也就是更新`node.Status.Images`
211 | 
212 | ```go
213 | func (kl *Kubelet) defaultNodeStatusFuncs() []func(*v1.Node) error {
214 | 	// if cloud is not nil, we expect the cloud resource sync manager to exist
215 | 	var nodeAddressesFunc func() ([]v1.NodeAddress, error)
216 |     // 如果是使用例如AWS/azure之类的cloud，那么使用cloudResourceSyncManager去获取node的地址
217 | 	if kl.cloud != nil {
218 | 		nodeAddressesFunc = kl.cloudResourceSyncManager.NodeAddresses
219 | 	}
220 |     
221 | 	var validateHostFunc func() error
222 | 	if kl.appArmorValidator != nil {
223 | 		validateHostFunc = kl.appArmorValidator.ValidateHost
224 | 	}
225 | 	var setters []func(n *v1.Node) error
226 | 	// setters是一个保存func的切片
227 | 	setters = append(setters,
228 | 		nodestatus.NodeAddress(kl.nodeIP, kl.nodeIPValidator, kl.hostname, kl.hostnameOverridden, kl.externalCloudProvider, kl.cloud, nodeAddressesFunc),
229 | 		nodestatus.MachineInfo(string(kl.nodeName), kl.maxPods, kl.podsPerCore, kl.GetCachedMachineInfo, kl.containerManager.GetCapacity,
230 | 			kl.containerManager.GetDevicePluginResourceCapacity, kl.containerManager.GetNodeAllocatableReservation, kl.recordEvent),
231 | 		nodestatus.VersionInfo(kl.cadvisor.VersionInfo, kl.containerRuntime.Type, kl.containerRuntime.Version),
232 | 		nodestatus.DaemonEndpoints(kl.daemonEndpoints),
233 | 		nodestatus.Images(kl.nodeStatusMaxImages, kl.imageManager.GetImageList),
234 | 		nodestatus.GoRuntime(),
235 | 	)	
236 |     // 返回volumeLimits的func
237 | 	setters = append(setters, nodestatus.VolumeLimits(kl.volumePluginMgr.ListVolumePluginWithLimits))
238 | 
239 | 	setters = append(setters,
240 | 		nodestatus.MemoryPressureCondition(kl.clock.Now, kl.evictionManager.IsUnderMemoryPressure, kl.recordNodeStatusEvent),
241 | 		nodestatus.DiskPressureCondition(kl.clock.Now, kl.evictionManager.IsUnderDiskPressure, kl.recordNodeStatusEvent),
242 | 		nodestatus.PIDPressureCondition(kl.clock.Now, kl.evictionManager.IsUnderPIDPressure, kl.recordNodeStatusEvent),
243 | 		nodestatus.ReadyCondition(kl.clock.Now, kl.runtimeState.runtimeErrors, kl.runtimeState.networkErrors, kl.runtimeState.storageErrors, validateHostFunc, kl.containerManager.Status, kl.recordNodeStatusEvent),
244 | 		nodestatus.VolumesInUse(kl.volumeManager.ReconcilerStatesHasBeenSynced, kl.volumeManager.GetVolumesInUse),		
245 | 		kl.recordNodeSchedulableEvent,
246 | 	)
247 | 	return setters
248 | }
249 | ```
250 | 
251 | 我们这里放一个`nodestatus.Images` 的具体代码来看看， 在上方调用的时候已经指出`imageListFunc = kl.imageManager.GetImageList`, 也就是从`imageCache`里面获取所有的images， 然后把该切片信息更新到`node.Status.Images`
252 | 
253 | ```go
254 | func Images(nodeStatusMaxImages int32,
255 | 	imageListFunc func() ([]kubecontainer.Image, error), // typically Kubelet.imageManager.GetImageList
256 | ) Setter {
257 | 	return func(node *v1.Node) error {
258 | 		// Update image list of this node
259 | 		var imagesOnNode []v1.ContainerImage
260 | 		containerImages, err := imageListFunc()
261 | 		if err != nil {
262 | 			node.Status.Images = imagesOnNode
263 | 			return fmt.Errorf("error getting image list: %v", err)
264 | 		}		
265 | 		if int(nodeStatusMaxImages) > -1 &&
266 | 			int(nodeStatusMaxImages) < len(containerImages) {
267 | 			containerImages = containerImages[0:nodeStatusMaxImages]
268 | 		}
269 | 
270 | 		for _, image := range containerImages {
271 | 			names := append(image.RepoDigests, image.RepoTags...)
272 | 			if len(names) > MaxNamesPerImageInNodeStatus {
273 | 				names = names[0:MaxNamesPerImageInNodeStatus]
274 | 			}
275 | 			imagesOnNode = append(imagesOnNode, v1.ContainerImage{
276 | 				Names:     names,
277 | 				SizeBytes: image.Size,
278 | 			})
279 | 		}
280 | 
281 | 		node.Status.Images = imagesOnNode
282 | 		return nil
283 | 	}
284 | }
285 | ```
286 | 
287 | ### 更新节点状态
288 | 
289 | 上述的setters返回的是更新node.status.xx的func切片，然后回到`setNodeStatus`路线，循环返回的setter, 让f(node)去执行操作去更新节点状态
290 | 
291 | ```go
292 | func (kl *Kubelet) setNodeStatus(node *v1.Node) {
293 | 	for i, f := range kl.setNodeStatusFuncs {
294 | 		klog.V(5).Infof("Setting node status at position %v", i)
295 | 		if err := f(node); err != nil {
296 | 			klog.Errorf("Failed to set some node status fields: %s", err)
297 | 		}
298 | 	}
299 | }
300 | ```
301 | 
302 | # 总结
303 | 
304 | nodestatus模块是一个负责注册节点，更新node.status相关信息的模块。


--------------------------------------------------------------------------------
/kubelet/kubelet_pleg模块.md:
--------------------------------------------------------------------------------
  1 | # Overview
  2 | 
  3 | pleg模块的全称是pod lifecycle event generator，是kubelet一个非常重要的模块，是通过调用CRI接口，获取pod和容器的状态，通过比对上一次的状态来产生事件，告诉kubelet是应该启动还是删除容器等。
  4 | 
  5 | 
  6 | 
  7 | # 实例化GenericPLEG
  8 | 
  9 | pleg在kubelet的`NewMainKubelet`时候实例化的，主要是做了两个事情：
 10 | 
 11 | 1. 实例化pleg
 12 | 2. 调用pleg的Healthy进行健康检查
 13 | 
 14 | ```go
 15 | func NewMainKubelet(kubeCfg *kubeletconfiginternal.KubeletConfiguration,
 16 | 	...) (*Kubelet, error) {
 17 |     ...
 18 |     // 实例化pleg
 19 | 	NewGenericPLEG(klet.containerRuntime, plegChannelCapacity, plegRelistPeriod, klet.podCache, clock.RealClock{})
 20 |     // pleg的Healthy健康检查
 21 |     klet.runtimeState.addHealthCheck("PLEG", klet.pleg.Healthy)
 22 | 	...	    
 23 | 	return klet, nil
 24 | }
 25 | ```
 26 | 
 27 | # 运行Start
 28 | 
 29 | 我们启动kubelet 的主方法可以看到，他调用了`PodLifecycleEventGenerator` 接口的`Start` 方法
 30 | 
 31 | ```go
 32 | // 代码位置 pkg/kubelet/kubelet.go
 33 | func (kl *Kubelet) Run(updates <-chan kubetypes.PodUpdate) {
 34 | 	...	
 35 |     // pleg.Start() 会调用relist来生产事件
 36 | 	kl.pleg.Start()
 37 |     // syncLoop是消费者，会消费relist的事件
 38 | 	kl.syncLoop(updates, kl)
 39 | }
 40 | ```
 41 | 
 42 | 接下来我们看看，pleg模块的`Start`方法 开启一个新的go协程，定时每秒钟运行`g.relist ` （实例化GenericPLEG的时候已经可以看到plegRelistPeriod = time.Second * 1）
 43 | 
 44 | ```go
 45 | // 代码位置 pkg/kubelet/pleg/generic.go
 46 | func (g *GenericPLEG) Start() {
 47 | 	go wait.Until(g.relist, g.relistPeriod, wait.NeverStop)
 48 | }
 49 | ```
 50 | 
 51 | # 生产者relist
 52 | 
 53 | relist作为event的生产者，具体工作流程如下：
 54 | 
 55 | 1. 调用CRI接口GetPods获取所有pod
 56 | 2. 更新这次relist的时间为当前时间，这个非常重要，kubelet在检查pleg模块是否健康的时候会通过读取relist的时间，一旦无法获取到该时间，或者超过30分钟，那么就认为pleg模块已经不是active状态了
 57 | 3. 更新运行中的pod和容器数量
 58 | 4. 对比旧的和目前pod，然后产生events
 59 | 5. 获取所有容器中旧的和新的pod， 然后交给computeEvents去得出需要做出的变化
 60 | 6. 如果开启了podCache，那么我们需要更新podCache
 61 | 7. 把Events发送到channel
 62 | 
 63 | ```go
 64 | func (g *GenericPLEG) relist() {	
 65 | 	timestamp := g.clock.Now()
 66 | 	...	
 67 |     // 调用CRI接口GetPods获取所有pod
 68 | 	podList, err := g.runtime.GetPods(true)
 69 | 	...
 70 | 	// 更新这次relist的时间为当前时间
 71 | 	g.updateRelistTime(timestamp)
 72 | 
 73 | 	pods := kubecontainer.Pods(podList)
 74 |     // 更新运行中的pod和容器数量	
 75 | 	updateRunningPodAndContainerMetrics(pods)
 76 |     // podRecords是一个map数据结构，记录着旧的pod以及目前的pod信息
 77 | 	g.podRecords.setCurrent(pods)
 78 | 	
 79 |     // 对比旧的和目前pod，然后产生events
 80 | 	eventsByPodID := map[types.UID][]*PodLifecycleEvent{}
 81 | 	for pid := range g.podRecords {
 82 | 		oldPod := g.podRecords.getOld(pid)
 83 | 		pod := g.podRecords.getCurrent(pid)
 84 | 		
 85 |         // 获取所有容器中旧的和新的pod， 然后交给computeEvents去得出需要做出的变化， 然后我们先看computeEvents 等会再回到主路线
 86 | 		allContainers := getContainersFromPods(oldPod, pod)
 87 | 		for _, container := range allContainers {
 88 | 			events := computeEvents(oldPod, pod, &container.ID)
 89 | 			for _, e := range events {
 90 | 				updateEvents(eventsByPodID, e)
 91 | 			}
 92 | 		}
 93 | 	}
 94 | 	var needsReinspection map[types.UID]*kubecontainer.Pod
 95 | 	if g.cacheEnabled() {
 96 | 		needsReinspection = make(map[types.UID]*kubecontainer.Pod)
 97 | 	}
 98 | 	
 99 | 	for pid, events := range eventsByPodID {
100 | 		pod := g.podRecords.getCurrent(pid)
101 |         // 如果开启了podCache，那么我们需要更新podCache
102 | 		if g.cacheEnabled() {
103 | 			if err := g.updateCache(pod, pid); err != nil {
104 | 				...
105 | 				needsReinspection[pid] = pod
106 | 				continue
107 | 			} else {
108 | 				delete(g.podsToReinspect, pid)
109 | 			}
110 | 		}		
111 |         // 更新podRecords
112 | 		g.podRecords.update(pid)
113 | 		for i := range events {
114 | 		
115 | 			if events[i].Type == ContainerChanged {
116 | 				continue
117 | 			}
118 | 			select {
119 |               // 把Events发送到channel
120 | 			case g.eventChannel <- events[i]:
121 | 			default:
122 | 				metrics.PLEGDiscardEvents.WithLabelValues().Inc()
123 | 				klog.Error("event channel is full, discard this relist() cycle event")
124 | 			}
125 | 		}
126 | 	}
127 | 
128 | 	if g.cacheEnabled() {
129 | 		if len(g.podsToReinspect) > 0 {
130 | 			// 更新cache
131 | 			for pid, pod := range g.podsToReinspect {
132 | 				if err := g.updateCache(pod, pid); err != nil {
133 | 					...
134 | 					needsReinspection[pid] = pod
135 | 				}
136 | 			}
137 | 		}
138 | 
139 | 		// 更新cache的timestamp
140 | 		g.cache.UpdateTime(timestamp)
141 | 	}
142 | 	g.podsToReinspect = needsReinspection
143 | }
144 | ```
145 | 
146 | 
147 | 
148 | ## 计算computeEvents
149 | 
150 | ```go
151 | func computeEvents(oldPod, newPod *kubecontainer.Pod, cid *kubecontainer.ContainerID) []*PodLifecycleEvent {
152 | 	var pid types.UID
153 | 	if oldPod != nil {
154 | 		pid = oldPod.ID
155 | 	} else if newPod != nil {
156 | 		pid = newPod.ID
157 | 	}
158 | 	oldState := getContainerState(oldPod, cid)
159 | 	newState := getContainerState(newPod, cid)
160 |     // 获取old pod的state跟目前pod的state, 交给generateEvents去得出需要做出的变化事件
161 | 	return generateEvents(pid, cid.ID, oldState, newState)
162 | }
163 | ```
164 | 
165 | 通过对比old pod的state跟目前pod的state，如果一致，直接返回，否则会出现如下几种情况：
166 | 
167 | 1. 当前podState 容器是"running"，那么返回事件，type是ContainerStarted
168 | 2. 当前podState 容器是"exited"，也就是说容器是死亡状态，返回报告事件，type是ContainerDied
169 | 3. 当前podState 容器状态是"unknown"， 那么返回报告事件, type是ContainerChanged
170 | 4. 当前podState 容器状态"non-existent", 不存在，也就说明之前已经报告过死亡了，返回报告事件，type是ContainerRemoved
171 | 5. 都不命中得情况下，就说明无法从CRI获取到容器状态了
172 | 
173 | ```go
174 | func generateEvents(podID types.UID, cid string, oldState, newState plegContainerState) []*PodLifecycleEvent {
175 |     // 如果old pod的state跟目前pod的state一样，说明不需要变化，直接返回
176 | 	if newState == oldState {
177 | 		return nil
178 | 	}
179 | 
180 | 	switch newState {
181 | 	case plegContainerRunning:
182 | 		return []*PodLifecycleEvent{{ID: podID, Type: ContainerStarted, Data: cid}}
183 | 	case plegContainerExited:
184 | 		return []*PodLifecycleEvent{{ID: podID, Type: ContainerDied, Data: cid}}
185 | 	case plegContainerUnknown:
186 | 		return []*PodLifecycleEvent{{ID: podID, Type: ContainerChanged, Data: cid}}
187 | 	case plegContainerNonExistent:
188 | 		switch oldState {
189 | 		case plegContainerExited:			
190 | 			return []*PodLifecycleEvent{{ID: podID, Type: ContainerRemoved, Data: cid}}
191 | 		default:
192 | 			return []*PodLifecycleEvent{{ID: podID, Type: ContainerDied, Data: cid}, {ID: podID, Type: ContainerRemoved, Data: cid}}
193 | 		}
194 | 	default:
195 | 		panic(fmt.Sprintf("unrecognized container state: a%v", newState))
196 | 	}
197 | }
198 | ```
199 | 
200 | # 消费者syncLoop
201 | 
202 | `syncLoop` 是作为消费者去消费pleg模块产生的消息，主要工作流程如下:
203 | 
204 | 1. 调用pleg的`Watch`方法，去获取`eventChannel`
205 | 2. 进入`syncLoopIteration` 循环，一旦可以从`eventChannel`消费消息，那么进入`HandlePodSyncs`去同步操作
206 | 
207 | ```go
208 | func (kl *Kubelet) syncLoop(updates <-chan kubetypes.PodUpdate, handler SyncHandler) {
209 | 	...
210 |     // pleg的Watch方法是返回eventChannel，也就是上面relist生产的Events消息
211 | 	plegCh := kl.pleg.Watch()
212 |     // 看下方的syncLoopIteration
213 |     if !kl.syncLoopIteration(updates, handler, syncTicker.C, housekeepingTicker.C, plegCh) {
214 | 			break
215 | 		}
216 |     ...
217 | }
218 | ```
219 | 
220 | 
221 | 
222 | ```go
223 | func (kl *Kubelet) syncLoopIteration(configCh <-chan kubetypes.PodUpdate, handler SyncHandler,
224 | 	syncCh <-chan time.Time, housekeepingCh <-chan time.Time, plegCh <-chan *pleg.PodLifecycleEvent) bool {
225 | 	select {
226 | 	...
227 |     // pleg的channel只要可以消费到消息，那么就进入该循环
228 | 	case e := <-plegCh:
229 | 		if isSyncPodWorthy(e) {
230 | 			// PLEG event for a pod; sync it.
231 |             // 调用handler接口的HandlePodSyncs去同步，例如是要启动或者删除操作
232 | 			if pod, ok := kl.podManager.GetPodByUID(e.ID); ok {
233 | 				klog.V(2).Infof("SyncLoop (PLEG): %q, event: %#v", format.Pod(pod), e)
234 | 				handler.HandlePodSyncs([]*v1.Pod{pod})
235 | 			} ...
236 | 		}
237 | 		// 如果是容器死亡的消息，那么调用cleanUpContainersInPod来清理pod里面的容器
238 | 		if e.Type == pleg.ContainerDied {
239 | 			if containerID, ok := e.Data.(string); ok {
240 | 				kl.cleanUpContainersInPod(e.ID, containerID)
241 | 			}
242 | 		}
243 | 	...
244 | }
245 | ```
246 | 
247 | 
248 | 
249 | # Healthy健康检查
250 | 
251 | 在上方的生产者relist中，我们看到每次执行relist的时候，都会调用`g.updateRelistTime(timestamp)`更新relist的时间，而kubelet在检查pleg模块是否健康的时候会通过读取relist的时间，一旦无法获取到该时间，或者超过30分钟，那么就认为pleg模块已经是不健康，也就是非active状态。
252 | 
253 | ```go
254 | func (g *GenericPLEG) Healthy() (bool, error) {
255 |     // 获取到pleg模块调用relist的时间
256 | 	relistTime := g.getRelistTime()
257 |     // 如果获取到的时间是空，也就说明从来没有进行过relist
258 | 	if relistTime.IsZero() {
259 | 		return false, fmt.Errorf("pleg has yet to be successful")
260 | 	}
261 | 	elapsed := g.clock.Since(relistTime)
262 |     // 如果获取到relist的时间跟现在的时间差别超过relistThreshold，就认为pleg模块不健康
263 | 	if elapsed > relistThreshold {
264 | 		return false, fmt.Errorf("pleg was last seen active %v ago; threshold is %v", elapsed, relistThreshold)
265 | 	}
266 | 	return true, nil
267 | }
268 | 
269 | ```
270 | 
271 | 
272 | 
273 | # 总结
274 | 
275 | PLEG是通过定时每秒钟调取CRI获取所有Pod，然后对比容器状态来写入事件到channel里面，例如ContainerRemoved或者是启动容器。 然后kubelet在`syncLoop`  来消费消息，完成从actual state到desire state的转变。
276 | 
277 | 
278 | 
279 | 
280 | 
281 | 
282 | 
283 | 
284 | 
285 | 
286 | 
287 | 
288 | 
289 | 
290 | 
291 | 
292 | 
293 | 
294 | 
295 | 
296 | 
297 | 
298 | 
299 | 
300 | 
301 | 
302 | 
303 | 
304 | 
305 | 
306 | 
307 | 


--------------------------------------------------------------------------------
/kubelet/kubelet架构.md:
--------------------------------------------------------------------------------
  1 | # 概述
  2 | 
  3 | 之前写过的DaemonSet controller, StatefulSet controller阅读理解，实际上只是存在于etcd的一个数据，而真正的把Pod资源在一个节点上启动，是离不开Kubelet组件的。Kubelet组件可以说是最后的执行者。
  4 | 
  5 | 
  6 | 
  7 | 我们首先来看一个Pod被创建的流程：
  8 | 
  9 | 1. 用户使用`Kubectl` 创建了一个Replicas 
 10 | 2. Replicas Controller根据提供的数据创建了一组Pod资源
 11 | 3. Scheduler监听到Pod资源的变动，把该Pod对象加到调度器队列中等待调度
 12 | 4. Kubelet 监听到Pod资源的变动，但由于Pod对象的NodeName 为空，则跳过
 13 | 5. Scheduler使用一系列算法，预选打分，为该Pod选取一个最优节点Node，并且更改Pod对象，添加.Status.NodeName字段的值
 14 | 6. Kubelet 监听到Pod的变动，这个时候Pod对象的NodeName 是自身所在的Node，进入Kubelet 循环
 15 | 
 16 | 
 17 | 
 18 | # 问题
 19 | 
 20 | 从Pod启动讲述下Kubelet具体的执行流程？
 21 | 
 22 | 为什么一个Pod里面的几个container的网络、存储是可以互相通？
 23 | 
 24 | 从Pod删除讲述下Kubelet具体的执行流程？
 25 | 
 26 | Kubelet是怎么上报Node自身的资源等数据？
 27 | 
 28 | Kubelet是如何执行垃圾回收？
 29 | 
 30 | 
 31 | 
 32 | 
 33 | 
 34 | # 我想象中的设计
 35 | 
 36 | Kubelet 需要统筹Pod的生命周期，需要考虑到服务、网络、存储、GC、日志、Metrics。
 37 | 
 38 | 1. 提供一个HTTP Server，暴露自身的数据，例如Pod的信息，自身的health check
 39 | 
 40 | 2. 存储方面
 41 | 
 42 |    1. 需要管理Volume的组件，管理Volume在Node节点的状态，以及上报状态
 43 |    2. 需要管理Image的组件，Image的size， GC等
 44 |    3. 需要管理Container的组件，GC等
 45 | 
 46 | 3. 网络方面
 47 | 
 48 |    1. 需要上报Node的地址
 49 |    2. 需要管理Pod的地址
 50 |    3. 需要管理Iptables/IPVS的组件
 51 | 
 52 | 4. 监控
 53 | 
 54 |    1. 需要监控并且上报Pod状态
 55 |    2. 需要监控Node状态，包括CPU/Memory/Process
 56 | 
 57 | 5. 日志
 58 | 
 59 |    1. Pod/Container日志
 60 |    
 61 |    
 62 |    
 63 |    
 64 | 
 65 | # 目录结构
 66 | 
 67 | ```bash
 68 | tree kubernetes/pkg/kubelet/
 69 | ```
 70 | 
 71 | 
 72 | 
 73 | 
 74 | 
 75 | | Folder            | Description                                     |
 76 | | ----------------- | ----------------------------------------------- |
 77 | | apis              |                                                 |
 78 | | cadvisor          |                                                 |
 79 | | certificate       |                                                 |
 80 | | checkpointmanager |                                                 |
 81 | | client            |                                                 |
 82 | | cloudresource     |                                                 |
 83 | | cm                |                                                 |
 84 | | config            |                                                 |
 85 | | configmap         |                                                 |
 86 | | container         |                                                 |
 87 | | cri               |                                                 |
 88 | | custommetrics     |                                                 |
 89 | | dockershim        | CRI 部分， dockershim作为docker的CRI 接口的实现 |
 90 | | envvars           |                                                 |
 91 | | events            |                                                 |
 92 | | eviction          | kubelet node节点将Pod驱逐                       |
 93 | |                   |                                                 |
 94 | |                   |                                                 |
 95 | |                   |                                                 |
 96 | |                   |                                                 |
 97 | |                   |                                                 |
 98 | |                   |                                                 |
 99 | |                   |                                                 |
100 | |                   |                                                 |
101 | |                   |                                                 |
102 | |                   |                                                 |
103 | |                   |                                                 |
104 | |                   |                                                 |
105 | |                   |                                                 |
106 | |                   |                                                 |
107 | |                   |                                                 |
108 | |                   |                                                 |
109 | |                   |                                                 |
110 | 
111 | 
112 | 
113 | ## <加个架构图>
114 | 
115 | 
116 | 
117 | 
118 | 
119 | # Run
120 | 
121 | QA: 等会找找创建Pod的channel  updates的数据从哪里来
122 | 
123 | ```go
124 | // 代码位置 `pkg/kubelet/kubelet.go`
125 | func (kl *Kubelet) Run(updates <-chan kubetypes.PodUpdate) {
126 | 	// 启动 cloudResourceSyncManager
127 | 	if kl.cloudResourceSyncManager != nil {
128 | 		go kl.cloudResourceSyncManager.Run(wait.NeverStop)
129 | 	}
130 | 	// 初始化container runtime之外的其他models
131 | 	if err := kl.initializeModules(); err != nil {
132 | 	..
133 | 	}
134 | 
135 | 	// 启动 volume manager
136 | 	go kl.volumeManager.Run(kl.sourcesReady, wait.NeverStop)
137 | 
138 | 	if kl.kubeClient != nil {		
139 |         // 启动 syncNodeStatus去同步节点状态
140 | 		go wait.Until(kl.syncNodeStatus, kl.nodeStatusUpdateFrequency, wait.NeverStop)
141 | 		go kl.fastStatusUpdateOnce()
142 | 
143 | 		// 启动 syncing lease
144 | 		go kl.nodeLeaseController.Run(wait.NeverStop)
145 | 	}
146 |     // 启动container runtime模块
147 | 	go wait.Until(kl.updateRuntimeUp, 5*time.Second, wait.NeverStop)
148 | 
149 | 	// 设置 iptables规则
150 | 	if kl.makeIPTablesUtilChains {
151 | 		kl.initNetworkUtil()
152 | 	}
153 | 
154 |     // 启动podKiller 
155 | 	go wait.Until(kl.podKiller.PerformPodKillingWork, 1*time.Second, wait.NeverStop)
156 | 
157 | 	
158 |     // 启动statusManager 和 probeManager
159 | 	kl.statusManager.Start()
160 | 	kl.probeManager.Start()
161 | 
162 | 	// Start syncing RuntimeClasses if enabled.
163 | 	if kl.runtimeClassManager != nil {
164 | 		kl.runtimeClassManager.Start(wait.NeverStop)
165 | 	}
166 | 	
167 |     // pleg是Kubelet最核心的模块，启动pleg模块
168 | 	kl.pleg.Start()
169 | 	kl.syncLoop(updates, kl)
170 | }
171 | ```
172 | 
173 | 
174 | 
175 | # 模块
176 | 
177 | PLEG(Pod Lifecycle Event Generator）核心模块 : call container runtime-→get container info→compare pod cache→send info to eventChannel
178 | 
179 | syncLop(consumer)-→consum eventChannel→sync Pod
180 | 
181 | evictionManager -->驱逐pod保证集群文档(when disk/memory presssure) 按posClass顺序驱逐
182 | 
183 | probeManager --> monitor container health, if livenessProbe fail, kubelet kill pod
184 | 
185 | statusManager  --> maintain status and update to api server (status info come from like probeManager )
186 | 
187 | imageManager-->包括pull image等去配合运行Pod
188 | 
189 | volumeManager-->挂载/卸载卷等操作
190 | 
191 | containerManager-->负责cgroup配置信息
192 | 
193 | runtimerManager -->底层使用docker/rkt就是runtime实现对接
194 | 
195 | podManager-->访问Pod信息
196 | 
197 | imageGC-->回收image
198 | 
199 | containerGC-->清理container
200 | 
201 | containerRefManager --> report container info like fail, map containerID and v1.ObjectReference
202 | 
203 | cAdvisor 模块 → monitor container info
204 | 
205 | OOMWatcher →Watch cAdvisor and generate Event
206 | 
207 | 
208 | 
209 | 
210 | 
211 | # 问题回答
212 | 
213 | 从Pod启动讲述下Kubelet具体的执行流程？
214 | 
215 | 为什么一个Pod里面的几个container的网络、存储是可以互相通？
216 | 
217 | 答： Pod的几个Container网络存储可以互通，具体在开放接口--CRI中提到，因为是使用了container模式
218 | 
219 | 从Pod删除讲述下Kubelet具体的执行流程？
220 | 
221 | Kubelet是怎么上报Node自身的资源等数据？
222 | 
223 | Kubelet是如何执行垃圾回收？


--------------------------------------------------------------------------------
/kubelet/kubelet状态管理statusManager.md:
--------------------------------------------------------------------------------
 1 | # Overview
 2 | 
 3 | statusManager模块
 4 | 
 5 | # 实例化
 6 | 
 7 | ```go
 8 | // 代码位置 pkg/kubelet/kubelet.go
 9 | func NewMainKubelet(...) (*Kubelet, error) {
10 | 	...
11 |     // 实例化statusManager
12 | 	klet.statusManager = status.NewManager(klet.kubeClient, klet.podManager, klet)	
13 | }
14 | 
15 | // 代码位置 pkg/kubelet/status/status_manager.go
16 | func NewManager(kubeClient clientset.Interface, podManager kubepod.Manager, podDeletionSafety PodDeletionSafetyProvider) Manager {
17 | 	return &manager{
18 | 		kubeClient:        kubeClient,
19 | 		podManager:        podManager,
20 | 		podStatuses:       make(map[types.UID]versionedPodStatus),
21 | 		podStatusChannel:  make(chan podStatusSyncRequest, 1000), // Buffer up to 1000 statuses
22 | 		apiStatusVersions: make(map[kubetypes.MirrorPodUID]uint64),
23 | 		podDeletionSafety: podDeletionSafety,
24 | 	}
25 | }
26 | ```
27 | 
28 | # 启动
29 | 
30 | ```go
31 | func (kl *Kubelet) Run(updates <-chan kubetypes.PodUpdate) {
32 | 	..	
33 |     // 启动statusManager
34 | 	kl.statusManager.Start()
35 | 	...
36 | }
37 | ```
38 | 
39 | 
40 | 
41 | 
42 | 
43 | ```go
44 | func (m *manager) Start() {
45 | 	...
46 |     // 使用ticker定时器
47 | 	syncTicker := time.Tick(syncPeriod)	
48 |     // 启动一个go协程永远运行
49 | 	go wait.Forever(func() {
50 | 		select {
51 |          // 如果可以从podStatusChannel channel消费消息，那么进入syncPod去同步
52 | 		case syncRequest := <-m.podStatusChannel:			
53 | 			m.syncPod(syncRequest.podUID, syncRequest.status)
54 |         // 定时批量同步    
55 | 		case <-syncTicker:
56 | 			m.syncBatch()
57 | 		}
58 | 	}, 0)
59 | }
60 | ```
61 | 
62 | 
63 | 
64 | 
65 | 
66 | 
67 | 
68 | 
69 | 
70 | 
71 | 
72 | 
73 | 
74 | 
75 | 
76 | 


--------------------------------------------------------------------------------
/kubelet/pluginmanager.md:
--------------------------------------------------------------------------------
  1 | # Overview
  2 | 
  3 | PluginManager是Kubelet的插件注册服务，代码位置位于`pkg/kubelet/pluginmanager`目录。该目录包含一个程序pluginwatcher，用于Kubelet注册不同类型的节点级插件，比如设备插件或CSI插件。它通过监视kubelet.getPluginsDir()返回的目录下的inotify事件来发现插件。我们将把这个目录称为PluginsDir。
  4 | 
  5 | 插件需要实现在`pkg/kubelet/apis/pluginregistration/v*/api.proto`中指定的gRPC注册服务。
  6 | 
  7 | # 插件发现
  8 | 
  9 | 代码位置：`pkg/kubelet/pluginmanager/pluginwatcher/plugin_wtcher.go`
 10 | 
 11 | 当他们把一个socket放在那个目录中时，或者在Kubelet启动时，如果套接字已经在那里，pluginwatcher服务将发现PluginDir中的插件。同样，如果socket文件从目录里面移除，那么pluginwatcher会从从kubelet移除该插件。
 12 | 
 13 | 我们以CSI驱动来举例子，Kubelet需要通过Unix Socket跟外部的CSI驱动沟通，CSI volume启动会在每一个Node节点的`/var/lib/kubelet/plugins/[CSIDriverName]/csi.sock`创建一个socket（CSIDriverName就是该插件的名字）。我们的PluginManager会监控这个目录，一旦有新文件创建，那么就把该插件注册。
 14 | 
 15 | ```go
 16 | func (w *Watcher) Start(stopCh <-chan struct{}) error {
 17 | 	w.stopped = make(chan struct{})
 18 |     // init方法会创建目录，并且授权0755
 19 | 	if err := w.init(); err != nil {
 20 | 		return err
 21 | 	}
 22 |     // 使用fsnotify不断监控目录
 23 | 	fsWatcher, err := fsnotify.NewWatcher()
 24 | 	...
 25 | 	w.fsWatcher = fsWatcher
 26 | 
 27 | 	// 遍历插件目录并添加fsWatcher文件系统观察者
 28 | 	if err := w.traversePluginDir(w.path); err != nil {
 29 | 		klog.Errorf("failed to traverse plugin socket path %q, err: %v", w.path, err)
 30 | 	}
 31 | 
 32 | 	go func(fsWatcher *fsnotify.Watcher) {
 33 | 		defer close(w.stopped)
 34 | 		for {
 35 | 			select {
 36 |                 // 监控到任何对该目录下有文件的添加都会触发handleCreateEvent
 37 | 			case event := <-fsWatcher.Events:				
 38 | 				if event.Op&fsnotify.Create == fsnotify.Create {
 39 | 					err := w.handleCreateEvent(event)
 40 | 					...
 41 | 				} else if event.Op&fsnotify.Remove == fsnotify.Remove {
 42 | 					w.handleDeleteEvent(event)
 43 | 				}
 44 | 				continue
 45 | 			case err := <-fsWatcher.Errors:
 46 | 				if err != nil {
 47 | 					..
 48 | 			case <-stopCh:
 49 | 					...
 50 | 			}
 51 | 		}
 52 | 	}(fsWatcher)
 53 | 	return nil
 54 | }
 55 | ```
 56 | 
 57 | 使用`handleCreateEvent`来处理目录里面新添加的插件文件
 58 | 
 59 | 这个socket文件名不应该以.开头。因为它会被忽略。
 60 | 
 61 | ```go
 62 | func (w *Watcher) handleCreateEvent(event fsnotify.Event) error {
 63 | 	fi, err := os.Stat(event.Name)
 64 | 	..
 65 |     // 忽略.开头的任何文件
 66 | 	if strings.HasPrefix(fi.Name(), ".") {
 67 | 		klog.V(5).Infof("Ignoring file (starts with '.'): %s", fi.Name())
 68 | 		return nil
 69 | 	}
 70 | 
 71 | 	if !fi.IsDir() {
 72 | 		isSocket, err := util.IsUnixDomainSocket(util.NormalizePath(event.Name))
 73 | 		...
 74 | 		if !isSocket {
 75 | 			klog.V(5).Infof("Ignoring non socket file %s", fi.Name())
 76 | 			return nil
 77 | 		}
 78 | 		// 调用插件注册程序
 79 | 		return w.handlePluginRegistration(event.Name)
 80 | 	}
 81 | 
 82 | 	return w.traversePluginDir(event.Name)
 83 | }
 84 | 
 85 | // 注册插件
 86 | func (w *Watcher) handlePluginRegistration(socketPath string) error {
 87 | 	if runtime.GOOS == "windows" {
 88 | 		socketPath = util.NormalizePath(socketPath)
 89 | 	}
 90 | 	err := w.desiredStateOfWorld.AddOrUpdatePlugin(socketPath)
 91 | 	return nil
 92 | }
 93 | ```
 94 | 
 95 | 
 96 | 
 97 | # 插件注册
 98 | 
 99 | 对于任何发现的插件，kubelet将发布一个注册。GetInfo gRPC调用获取插件类型，名称，端点和支持的服务API版本。
100 | 
101 | 如果以下任何步骤注册失败，在重试注册将从头开始:
102 | 
103 | 1. 登记。针对套接字调用GetInfo。
104 | 
105 | 2. 对内部插件类型处理程序调用Validate。
106 | 
107 | 3. 针对内部插件类型处理程序调用Register。
108 | 
109 | 4. 对套接字调用NotifyRegistrationStatus来指示注册结果。
110 | 
111 | 5. 在插件初始化阶段，Kubelet将发出特定于插件的调用(例如:DevicePlugin::GetDevicePluginOptions)。
112 | 
113 | 6. 一旦Kubelet确定它已经准备好使用你的插件，它将发出一个注册。NotifyRegistrationStatus gRPC调用。
114 | 
115 | 如果插件从PluginDir中删除了它的套接字，这将被解释为插件注销。如果以下任何步骤在注销失败，在重试注销将从头开始:
116 | 
117 | 1. 登记。针对套接字调用GetInfo。
118 | 
119 | 2. DeRegisterPlugin是针对内部插件类型处理程序调用的。
120 | 
121 | 
122 | 
123 | 
124 | 
125 | ## 插件注册的单元测试
126 | 
127 | 注册插件的单元测试，代码位置在：`pkg/kubelet/pluginmanager/pluginwatcher/plugin_wtcher_test.go`
128 | 
129 | ```go
130 | func TestPluginRegistration(t *testing.T) {
131 |     // 最后清理测试插件
132 | 	defer cleanup(t)
133 | 
134 |     // 获取DesiredStateOfWorld的instance数量
135 | 	dsw := cache.NewDesiredStateOfWorld()
136 |     // 开启监控，不断获取desiredStateOfWorld的数量
137 | 	newWatcher(t, dsw, wait.NeverStop)
138 | 
139 | 	for i := 0; i < 10; i++ {
140 |         // socketDir在单元测试的init函数已经确定，就是当前目录的plugin_test目录下，也就是socketPath是./plugin_test/plugin-[1-10].sock目录
141 | 		socketPath := fmt.Sprintf("%s/plugin-%d.sock", socketDir, i)
142 |         // 插件名字example-plugin-[1-10]
143 | 		pluginName := fmt.Sprintf("example-plugin-%d", i)
144 |         // 实例化一个ExamplePlugin
145 | 		p := NewTestExamplePlugin(pluginName, registerapi.DevicePlugin, socketPath, supportedVersions...)
146 | 		require.NoError(t, p.Serve("v1beta1", "v1beta2"))
147 | 		// 调用GetInfo登记插件信息
148 | 		pluginInfo := GetPluginInfo(p)
149 |         // 注册插件信息
150 | 		waitForRegistration(t, pluginInfo.SocketPath, dsw)
151 | 
152 | 		// 检查插件的desired state
153 | 		dswPlugins := dsw.GetPluginsToRegister()
154 | 		if len(dswPlugins) != 1 {
155 | 			t.Fatalf("TestPluginRegistration: desired state of world length should be 1 but it's %d", len(dswPlugins))
156 | 		}
157 | 
158 |         // 停止插件，并且把desire state of world cache清理
159 | 		require.NoError(t, p.Stop())
160 | 		waitForUnregistration(t, pluginInfo.SocketPath, dsw)
161 | 		dswPlugins = dsw.GetPluginsToRegister()
162 | 		if len(dswPlugins) != 0 {
163 | 			t.Fatalf("TestPluginRegistration: desired state of world length should be 0 but it's %d", len(dswPlugins))
164 | 		}
165 | 	}
166 | }
167 | ```
168 | 
169 | 
170 | 
171 | 
172 | 
173 | 
174 | 
175 | 


--------------------------------------------------------------------------------
/kubelet/volume_plugin.md:
--------------------------------------------------------------------------------
  1 | # Overview
  2 | 
  3 | 前几天把`pkg/kubelet/volumemanager`模块看完，主要是围绕着kubelet自身如何mount/umount, attach/detach 卷，并且维护了一套卷期望状态和实际状态的缓存。 在该模块中多次调用`pkg/volume`的插件去执行mount/umount等操作。
  4 | 
  5 | 这篇文档主要是围绕`pkg/volume`来讲解。
  6 | 
  7 | Commid id: 370e2f4b298e7276f560c131e24d8f91b88ae89f
  8 | 
  9 | # 带着问题出发
 10 | 
 11 | 1. volume 定义了什么接口
 12 | 3. volume plugin是什么时候被注册的, 如何管理和扩展这些插件
 13 | 
 14 | 
 15 | 
 16 | # 接口
 17 | 
 18 | Kubernetes卷插件扩展了Kubernetes卷接口，以支持block和/或file storage system。
 19 | 
 20 | `Volume` 的 `VolumePlugin` 接口定义了作为一个卷最基础的必须要实现的方法，如下所示。 配合卷插件的单元测试比较能理解这些必须要实现的方法。
 21 | 
 22 | ```go
 23 | // 代码位置 pkg/volume/plugins.go
 24 | type VolumePlugin interface {
 25 | 	Init(host VolumeHost) error
 26 | 	GetPluginName() string
 27 | 	GetVolumeName(spec *Spec) (string, error)
 28 | 	CanSupport(spec *Spec) bool
 29 | 	...
 30 | }
 31 | ```
 32 | 
 33 | 
 34 | 
 35 | ## 单元测试
 36 | 
 37 | volume plugin的单元测试非常容易理解一个卷的插件需要实现的最基础的方法
 38 | 
 39 | 1. 定义插件名字
 40 | 2. 实现`Init()`  `GetPluginName()`  `NewMounter()`  `GetVolumeName()`等方法
 41 | 3. 测试`VolumePluginMgr` 跟踪一个已注册的插件
 42 |    1. 调用`VolumePluginMgr.InitPlugins`去初始化插件，包括设置插件的名字，插件的host地址等，把这些信息写入到`VolumePluginMgr`对象中
 43 |    2. 调用`VolumePluginMgr.FindPluginByName` 测试能否获取该testPluginName名字一样的插件
 44 |    3. 调用`VolumePluginMgr.FindPluginBySpec(nil)` 来测试`FindPluginBySpec`方法本身，当Volume.spec是nil的时候应该返回Error才对
 45 |    4. 然后我们实例化`volumeSpec := &Spec{}`， 再调用`VolumePluginMgr.FindPluginBySpec(nil)` 来测试`FindPluginBySpec`方法本身这个时候，error就不应该是空。
 46 | 
 47 | ```go
 48 | // 代码位置 pkg/volume/plugins_test.go
 49 | const testPluginName = "kubernetes.io/testPlugin"
 50 | type testPlugins struct {
 51 | }
 52 | // 卷插件的Init会实现插件的初始化，某些插件如NFS会设置host
 53 | func (plugin *testPlugins) Init(host VolumeHost) error {
 54 | 	return nil
 55 | }
 56 | // 一般会用来返回插件的name, 卷插件的name pattern需要是"example.com/volume" 格式，中间必须包含/
 57 | func (plugin *testPlugins) GetPluginName() string {
 58 | 	return testPluginName
 59 | }
 60 | // 返回卷的名字或者ID来确定唯一的设备/目录，如NFS 卷插件会返回nfsserver/nfspath
 61 | func (plugin *testPlugins) GetVolumeName(spec *Spec) (string, error) {
 62 | 	return "", nil
 63 | }
 64 | ...
 65 | // 实例化volume.Mounter
 66 | func (plugin *testPlugins) NewMounter(spec *Spec, podRef *v1.Pod, opts VolumeOptions) (Mounter, error) {
 67 | 	return nil, nil
 68 | }
 69 | 
 70 | func newTestPlugin() []VolumePlugin {
 71 | 	return []VolumePlugin{&testPlugins{}}
 72 | }
 73 | 
 74 | func TestVolumePluginMgrFunc(t *testing.T) {
 75 | 	vpm := VolumePluginMgr{}
 76 | 	var prober DynamicPluginProber = nil // TODO (#51147) inject mock
 77 | 	vpm.InitPlugins(newTestPlugin(), prober, nil)
 78 | 
 79 | 	plug, err := vpm.FindPluginByName(testPluginName)
 80 | 	if err != nil {
 81 | 		t.Errorf("Can't find the plugin by name")
 82 | 	}
 83 | 	if plug.GetPluginName() != testPluginName {
 84 | 		t.Errorf("Wrong name: %s", plug.GetPluginName())
 85 | 	}
 86 | 
 87 | 	plug, err = vpm.FindPluginBySpec(nil)
 88 | 	if err == nil {
 89 | 		t.Errorf("Should return error if volume spec is nil")
 90 | 	}
 91 | 
 92 | 	volumeSpec := &Spec{}
 93 | 	plug, err = vpm.FindPluginBySpec(volumeSpec)
 94 | 	if err != nil {
 95 | 		t.Errorf("Should return test plugin if volume spec is not nil")
 96 | 	}
 97 | }
 98 | ```
 99 | 
100 | 卷是包含临时卷和持久卷，临时卷也就是像`emptyDir`这种跟随Pod的生命周期被创建和销毁的卷。而持久卷是独立于Pod的生命周期，例如AWS EBS或者NFS这种。
101 | 
102 | 持久卷接口是在`VolumePlugin`的基础上扩展了`GetAccessModes`
103 | 
104 | ```go
105 | type PersistentVolumePlugin interface {
106 | 	VolumePlugin
107 | 	GetAccessModes() []v1.PersistentVolumeAccessMode
108 | }
109 | ```
110 | 
111 | 具体其他接口可以查看`pkg/volume/plugins.go` 。
112 | 
113 | 
114 | 
115 | # 初始化卷插件
116 | 
117 | 我们先分析`kubelet` cmd目录下的代码，通过调用链关系（如下所示），可获取kubelet的插件是通过`ProbeVolumePlugins` 来返回的
118 | 
119 | ```bash
120 | main()
121 | 	app.NewKubeletCommand()		
122 | 		UnsecuredDependencies() # 构造默认的KubeletDeps
123 | 			plugins, err := ProbeVolumePlugins(featureGate)
124 | ```
125 | 
126 | 接下来我们看看`ProbeVolumePlugins` 
127 | 
128 | ```go
129 | func ProbeVolumePlugins(featureGate featuregate.FeatureGate) ([]volume.VolumePlugin, error) {
130 | 	allPlugins := []volume.VolumePlugin{}
131 | 
132 | 	var err error
133 |     // Legacy 的卷是指云厂商提供的，例如awsebs, azuredisk等
134 | 	allPlugins, err = appendLegacyProviderVolumes(allPlugins, featureGate)
135 | 	// 然后再把常用的卷插件，nfs, iscsi等添加到allPlugins
136 | 	allPlugins = append(allPlugins, emptydir.ProbeVolumePlugins()...)
137 | 	allPlugins = append(allPlugins, git_repo.ProbeVolumePlugins()...)
138 | 	allPlugins = append(allPlugins, hostpath.ProbeVolumePlugins(volume.VolumeConfig{})...)
139 | 	allPlugins = append(allPlugins, nfs.ProbeVolumePlugins(volume.VolumeConfig{})...)
140 | 	allPlugins = append(allPlugins, secret.ProbeVolumePlugins()...)
141 | 	allPlugins = append(allPlugins, iscsi.ProbeVolumePlugins()...)
142 | 	allPlugins = append(allPlugins, glusterfs.ProbeVolumePlugins()...)
143 | 	allPlugins = append(allPlugins, rbd.ProbeVolumePlugins()...)
144 | 	allPlugins = append(allPlugins, quobyte.ProbeVolumePlugins()...)
145 | 	allPlugins = append(allPlugins, cephfs.ProbeVolumePlugins()...)
146 | 	allPlugins = append(allPlugins, downwardapi.ProbeVolumePlugins()...)
147 | 	allPlugins = append(allPlugins, fc.ProbeVolumePlugins()...)
148 | 	allPlugins = append(allPlugins, flocker.ProbeVolumePlugins()...)
149 | 	allPlugins = append(allPlugins, configmap.ProbeVolumePlugins()...)
150 | 	allPlugins = append(allPlugins, projected.ProbeVolumePlugins()...)
151 | 	allPlugins = append(allPlugins, portworx.ProbeVolumePlugins()...)
152 | 	allPlugins = append(allPlugins, scaleio.ProbeVolumePlugins()...)
153 | 	allPlugins = append(allPlugins, local.ProbeVolumePlugins()...)
154 | 	allPlugins = append(allPlugins, storageos.ProbeVolumePlugins()...)
155 | 	allPlugins = append(allPlugins, csi.ProbeVolumePlugins()...)
156 | 	return allPlugins, nil
157 | }
158 | ```
159 | 
160 | 以上，也就是说，我们`pkg/volume/`下的插件都被添加进`allPlugins`。
161 | 
162 | 
163 | 
164 | 同样的，`VolumePluginMgr` 是在实例化kubelet的时候被实例化的。
165 | 
166 | ```go
167 | // 代码为 pkg/kubelet/kubelet.go
168 | func NewMainKubelet(kubeCfg *kubeletconfiginternal.KubeletConfiguration,...) (*Kubelet, error) {	
169 |     ...
170 | 	klet.volumePluginMgr, err =
171 | 		NewInitializedVolumePluginMgr(klet, secretManager, configMapManager, tokenManager, kubeDeps.VolumePlugins, kubeDeps.DynamicPluginProber)
172 |     ...
173 | }
174 | ```
175 | 
176 | 通过调用链关系`NewInitializedVolumePluginMgr()` --> `kvh.volumePluginMgr.InitPlugins()`， 来把所有卷插件都初始化。
177 | 
178 |  我们来看看`InitPlugins`方法的工作流程：
179 | 
180 | 1. 轮询所有卷插件
181 | 2. 检验卷插件name, 也就是必须要包含/， pattern是example.com/somename
182 | 3. 调用卷插件各自的Init方法
183 | 4.  将卷插件加载到VolumePluginMgr， 至此所有插件已经被注册进VolumePluginMgr对象中
184 | 
185 | ```go
186 | // 代码位置 pkg/volume/plugins.go
187 | func (pm *VolumePluginMgr) InitPlugins(plugins []VolumePlugin, prober DynamicPluginProber, host VolumeHost) error {	
188 |     
189 | 	pm.Host = host
190 | 	if prober == nil {
191 | 		pm.prober = &dummyPluginProber{}
192 | 	} else {
193 | 		pm.prober = prober
194 | 	}
195 | 	if err := pm.prober.Init(); err != nil {
196 | 		// Prober init failure should not affect the initialization of other plugins.
197 | 		klog.Errorf("Error initializing dynamic plugin prober: %s", err)
198 | 		pm.prober = &dummyPluginProber{}
199 | 	}
200 | 
201 | 	if pm.plugins == nil {
202 | 		pm.plugins = map[string]VolumePlugin{}
203 | 	}
204 | 	if pm.probedPlugins == nil {
205 | 		pm.probedPlugins = map[string]VolumePlugin{}
206 | 	}
207 | 
208 | 	allErrs := []error{}
209 |     // 轮询所有卷插件，（插件列表可以见上方ProbeVolumePlugins）
210 | 	for _, plugin := range plugins {
211 | 		name := plugin.GetPluginName()
212 |         // 检验卷插件name, 也就是必须要包含/， pattern是example.com/somename
213 | 		if errs := validation.IsQualifiedName(name); len(errs) != 0 {
214 | 			allErrs = append(allErrs, fmt.Errorf("volume plugin has invalid name: %q: %s", name, strings.Join(errs, ";")))
215 | 			continue
216 | 		}
217 | 
218 | 		if _, found := pm.plugins[name]; found {
219 | 			allErrs = append(allErrs, fmt.Errorf("volume plugin %q was registered more than once", name))
220 | 			continue
221 | 		}
222 |         // 调用卷插件各自的Init方法
223 | 		err := plugin.Init(host)
224 | 		// 将卷插件加载到VolumePluginMgr
225 | 		pm.plugins[name] = plugin
226 | 	}
227 | 	return utilerrors.NewAggregate(allErrs)
228 | }
229 | ```
230 | 
231 | 
232 | 
233 | 
234 | 
235 | 


--------------------------------------------------------------------------------
/kubelet/测试kubelet.md:
--------------------------------------------------------------------------------
 1 | # Overview
 2 | 
 3 | 此篇文档主要是讲kubelet的测试，包括单元测试，功能测试。
 4 | 
 5 | # 单元测试
 6 | 
 7 | 只测试`pkg/kubelet`
 8 | 
 9 | ```bash
10 | make test WHAT=./pkg/kubelet 
11 | # 也可以使用go test
12 | go test ./pkg/kubelet
13 | ```
14 | 
15 | ![](./images/kubelet_ut.png)
16 | 
17 | 
18 | 
19 | 测试kubelet以及它的所有subpackages
20 | 
21 | ```bash
22 | make test WHAT=./pkg/kubelet/..
23 | # 也可以使用go test
24 | go test ./pkg/kubelet/...
25 | ```
26 | 
27 | 
28 | 
29 | 单元测试的覆盖率
30 | 
31 | ```bash
32 | make test WHAT=./pkg/kubelet KUBE_COVER=y
33 | ```
34 | 
35 | 
36 | 
37 | 只运行某个单元测试函数, 也可以使用`go test ./pkg/kubelet -v -run TestSyncPodsDeletesWhenSourcesAreReady`
38 | 
39 | ```bash
40 | make test WHAT=./pkg/kubelet GOFLAGS="-v" KUBE_TEST_ARGS='-run TestSyncPodsDeletesWhenSourcesAreReady'
41 | # 以下是输出
42 | +++ [0218 04:26:52] Running tests without code coverage
43 | === RUN   TestSyncPodsDeletesWhenSourcesAreReadyPerQOS
44 | W0218 04:27:01.705184   25186 mutation_detector.go:53] Mutation detector is enabled, this will result in memory leakage.
45 | W0218 04:27:03.706031   25186 kubelet_getters.go:311] Path "/tmp/kubelet_test.502019488/pods/12345678/volumes" does not exist
46 | W0218 04:27:03.706145   25186 kubelet_getters.go:311] Path "/tmp/kubelet_test.502019488/pods/12345678/volumes" does not exist
47 | W0218 04:27:03.706195   25186 kubelet_getters.go:311] Path "/tmp/kubelet_test.502019488/pods/12345678/volumes" does not exist
48 | --- PASS: TestSyncPodsDeletesWhenSourcesAreReadyPerQOS (4.10s)
49 | === RUN   TestSyncPodsDeletesWhenSourcesAreReady
50 | W0218 04:27:05.807885   25186 mutation_detector.go:53] Mutation detector is enabled, this will result in memory leakage.
51 | --- PASS: TestSyncPodsDeletesWhenSourcesAreReady (0.00s)
52 | PASS
53 | ok  	k8s.io/kubernetes/pkg/kubelet	4.173
54 | ```
55 | 
56 | 并行3个Workers运行6单元测试（也就是18次，重复运行某个测试有时候可以发现问题）
57 | 
58 | ```bash
59 | make test WHAT=./pkg/kubelet  PARALLEL=3 ITERATION=6
60 | ```
61 | 
62 | 
63 | 
64 | # 集成测试
65 | 
66 | 
67 | 
68 | 
69 | 
70 | # 端到端测试


--------------------------------------------------------------------------------
/kubernetes-ingress-controller/images/images01.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JaneLiuL/kubernetes-book/6ce28f8bc31b8ac89543964b03cf703cf268a515/kubernetes-ingress-controller/images/images01.png


--------------------------------------------------------------------------------
/kubernetes-ingress-controller/ssl-terminationandssl-passthrough.md:
--------------------------------------------------------------------------------
  1 | # why this document
  2 | 刚好同事问了一下一个ingress-controller 能否同时支持ssl-termination 跟 ssl passthrough
  3 | 这个文档针对这个问题来写
  4 | 
  5 | # 模型
  6 | 传统安全通信模型（在负载均衡/反向代理入口做加密通信）：
  7 | client --- (via https) ---> nginx ---- (via http) ----> upstream backend services
  8 | 更为安全的通信模型：
  9 | client --- (via https) ---> nginx ---- (via https) ----> upstream backend services
 10 | 
 11 | # 例子
 12 | ![](./images/images01.png)
 13 | 说明：
 14 | svc1: 是对传统通信模型的“复现”，即client与ingress controller(nginx)间采用https加密通信，但ingress controller(nginx)与svc1间则是明文的http通信；
 15 | 
 16 | svc2: 是ssl-termination的安全配置模型，即client与svc2的https通信分为“两段”，client与nginx建立https连接后，nginx将client提交的加密请求解密后，再向svc2发起https请求，并重新加密请求数据。这种client端ssl的过程在反向代理或负载均衡器终结的https通信方式被称为“ssl-termination”。
 17 | 
 18 | svc3: 是ssl-passthrough的安全配置模型，即nginx不会对client的https request进行解密，而是直接转发给backend的svc3服务，client端的ssl过程不会终结于nginx，而是在svc3对应的pod中终结。这种https通信方式被称为”ssl-passthrough”。这种配置模型尤其适合backend service对client端进行client certificate验证的情况，同时也降低了nginx加解密的性能负担
 19 | 
 20 | 
 21 | # 环境搭建
 22 | 安装inc
 23 | ```bash
 24 | helm install ingress-nginx -n kube-system https://github.com/kubernetes/ingress-nginx/releases/download/helm-chart-4.11.3/ingress-nginx-4.11.3.tgz
 25 | ```
 26 | 
 27 | ## ssl-termination, 但nginx与backend服务之间采用明文传输（http)
 28 | 加密Web流量有两个主要配置方案：SSL termination和SSL passthrough。
 29 | 
 30 | 使用SSL termination时，客户端的SSL请求在负载均衡器/反向代理中解密，解密操作将增加负载均衡器的工作负担，较为耗费CPU，但简化了SSL证书的管理。至于负载均衡器和后端之间的流量是否加密，需要nginx另行配置。
 31 | 
 32 | SSL Passthrough，意味着client端将直接将SSL连接发送到后端(backend)。与SSL termination不同，请求始终保持加密，并且解密负载分布在后端服务器上。但是，这种情况的SSL证书管理略复杂，证书必须在每台服务器上自行管理。另外，在这种方式下可能无法添加或修改HTTP header，可能会丢失X-forwarded-* header中包含的客户端的IP地址，端口和其他信息。
 33 | 
 34 | 我们先来看一种并不那么“安全”的“传统模型”：在nginx上暴露https，但nginx到backend service(svc1)采用http。
 35 | 
 36 | 我们先来创建相关的密钥和公钥证书，并以一个Secret：demo-tls-secret存储秘密
 37 | 
 38 | ```bash
 39 | openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout ic1.key -out ic1.crt -subj "/CN=*.example.com/O=example.com"
 40 | kubectl create secret tls ingress-controller-demo-tls-secret --key  ic1.key --cert ic1.crt
 41 | ```
 42 | 
 43 | 我们开启了tls，secret用的就是我们上面创建的那个secret：ingress-controller-demo-tls-secret。
 44 | 
 45 | 
 46 | ```yaml
 47 | ---
 48 | apiVersion: v1
 49 | kind: Service
 50 | metadata:
 51 |   name: ic1-svc1
 52 |   labels:
 53 |     app: svc1
 54 | spec:
 55 |   type: ClusterIP
 56 |   ports:
 57 |     - port: 443
 58 |       targetPort: http
 59 |       protocol: TCP
 60 |       name: http
 61 |   selector:
 62 |     app: svc1
 63 |     release: ic1
 64 | ---
 65 | apiVersion: apps/v1
 66 | kind: Deployment
 67 | metadata:
 68 |   name: ic1-svc1
 69 |   labels:
 70 |     app: svc1
 71 | spec:
 72 |   replicas: 1
 73 |   selector:
 74 |     matchLabels:
 75 |       app: svc1
 76 |       release: ic1
 77 |   template:
 78 |     metadata:
 79 |       labels:
 80 |         app: svc1
 81 |         release: ic1
 82 |     spec:
 83 |       containers:
 84 |         - name: svc1
 85 |           image: "nginx:1.14.2"
 86 |           imagePullPolicy: Always
 87 |           ports:
 88 |             - name: http
 89 |               containerPort: 8080
 90 |               protocol: TCP
 91 |           resources:
 92 |             {}
 93 | ---
 94 | apiVersion: networking.k8s.io/v1 
 95 | kind: Ingress
 96 | metadata:
 97 |   name: ic1-svc1
 98 |   labels:
 99 |     app: svc1
100 |   annotations:
101 |     kubernetes.io/ingress.class: nginx
102 | spec:
103 |   tls:
104 |     - hosts:
105 |         - svc1.example.com
106 |       secretName: ingress-controller-demo-tls-secret
107 |   rules:
108 |     - host: svc1.example.com
109 |       http:
110 |         paths:
111 |         - backend:
112 |             service:
113 |               name: ic1-svc1
114 |               port:
115 |                 number: 80
116 |           path: /
117 |           pathType: ImplementationSpecific
118 | ```
119 | 
120 | 这个时候如果我们看nginx conf 可以看到配置如下
121 | ```
122 | # kubectl exec nginx-ingress-controller-xx -n kube-system -- cat /etc/nginx/nginx.conf
123 | 
124 |         # map port 442 to 443 for header X-Forwarded-Port
125 |         map $pass_server_port $pass_port {
126 |                 442              443;
127 |                 default          $pass_server_port;
128 |         }
129 | 
130 |         upstream default-svc1-http {
131 |                 least_conn;
132 | 
133 |                 keepalive 32;
134 | 
135 |                 server 192.168.28.13:8080 max_fails=0 fail_timeout=0;
136 | 
137 |         }
138 | 
139 | ## start server svc1.example.com
140 |         server {
141 |                 server_name svc1.example.com ;
142 | 
143 |                 listen 80;
144 | 
145 |                 listen [::]:80;
146 | 
147 |                 set $proxy_upstream_name "-";
148 | 
149 |                 listen 442 proxy_protocol   ssl http2;
150 | 
151 |                 listen [::]:442 proxy_protocol  ssl http2;
152 | 
153 |                 # PEM sha: 248951b75535e0824c1a7f74dc382be3447057b7
154 |                 ssl_certificate                         /ingress-controller/ssl/default-ingress-controller-demo-tls-secret.pem;
155 |                 ssl_certificate_key                     /ingress-controller/ssl/default-ingress-controller-demo-tls-secret.pem;
156 | 
157 |                 ssl_trusted_certificate                 /ingress-controller/ssl/default-ingress-controller-demo-tls-secret-full-chain.pem;
158 |                 ssl_stapling                            on;
159 |                 ssl_stapling_verify                     on;
160 | 
161 |                 location / {
162 |                         ... ...
163 |                         proxy_pass http://default-svc1-http;
164 | 
165 |                         proxy_redirect                          off;
166 | 
167 |                 }
168 |            ... ...
169 |         }
170 |         ## end server svc1.example.com
171 | ```
172 | 
173 | 可以看到30001(nodeport) 映射的ingress controller的443端口在svc1.example.com这个server域名下已经有了ssl标识，并且ssl_certificate和ssl_certificate_key对应的值就是我们之前创建的ingress-controller-demo-tls-secret。
174 | 
175 | 然后使用curl 访问svc1
176 | ```bash
177 | curl -k https://svc1.example:30001
178 | nginx
179 | ```
180 | 
181 | 此时，如果再用http方式去访问svc1，你会得到下面错误结果：
182 | ```bash
183 | # curl http://svc1.example.com:30092
184 | <html>
185 | <head><title>400 The plain HTTP request was sent to HTTPS port</title></head>
186 | <body bgcolor="white">
187 | <center><h1>400 Bad Request</h1></center>
188 | <center>The plain HTTP request was sent to HTTPS port</center>
189 | ```
190 | 
191 | 
192 | ## svc2: 使用ssl termination，但nginx与backend服务之间采用加密传输(https)
193 | SSL termination配置场景中，负载均衡器和后端之间的流量是否加密，需要nginx另行配置。svc1采用了未加密的方式，nginx -> backend service存在安全风险，我们要将其改造为也通过https进行数据加密传输，于是有了svc2这个例子。
194 | 
195 | ```yaml
196 | nginx.ingress.kubernetes.io/secure-backends: "true"
197 | ```
198 | 
199 | ingress跟上面一样，只是多了 `nginx.ingress.kubernetes.io/secure-backends: "true"·` annotations
200 | 
201 | 
202 | 这个annotation让nginx以https的方式去访问backend service: svc2。安装svc2 之后 nginx controller为svc2生成的配置如下：
203 | 
204 | ```
205 | ## start server svc2.example.com
206 |         server {
207 |                 server_name svc2.example.com;
208 | 
209 |                 listen 80;
210 | 
211 |                 listen [::]:80;
212 | 
213 |                 set $proxy_upstream_name "-";
214 | 
215 |                 listen 442 proxy_protocol   ssl http2;
216 | 
217 |                 listen [::]:442 proxy_protocol  ssl http2;
218 | 
219 |                 # PEM sha: 248951b75535e0824c1a7f74dc382be3447057b7
220 |                 ssl_certificate                         /ingress-controller/ssl/default-ingress-controller-demo-tls-secret.pem;
221 |                 ssl_certificate_key                     /ingress-controller/ssl/default-ingress-controller-demo-tls-secret.pem;
222 | 
223 |                 ssl_trusted_certificate                 /ingress-controller/ssl/default-ingress-controller-demo-tls-secret-full-chain.pem;
224 |                 ssl_stapling                            on;
225 |                 ssl_stapling_verify                     on;
226 | 
227 |                 location / {
228 |                      ... ...
229 |                         proxy_pass https://default-svc2-https;
230 | 
231 |                         proxy_redirect                          off;
232 | 
233 |                 }
234 | 
235 |         }
236 |         ## end server svc2.example.com
237 | 
238 |         upstream default-svc2-https {
239 |                 least_conn;
240 | 
241 |                 keepalive 32;
242 | 
243 |                 server 192.168.28.14:8080 max_fails=0 fail_timeout=0;
244 | 
245 |         }
246 | 
247 | ```
248 | ## ssl-passthrough （svc3: 使用ssl passthrough, termination at pod）
249 | 某些服务需要通过对client端的证书进行校验的方式，进行身份验证和授权，svc3就是这样一个对client certification进行校验的双向https校验的service。针对这种情况，ssl termination的配置方法无法满足需求，我们需要使用ssl passthrough的方案。
250 | 
251 | 在ingress nginx controller开启ssl passthrough方案需要在ingress controller和ingress中都做一些改动。
252 | 
253 | 首先我们需要为nginx-ingress-controller添加一个新的命令行参数：–enable-ssl-passthrough，并重新apply生效：
254 | ```yaml
255 | spec:
256 |       containers:
257 |         - name: nginx-ingress-controller
258 |           args:
259 |             - /nginx-ingress-controller
260 |             - --enable-ssl-passthrough
261 | ```
262 | 然后为ingress添加新的annotation `nginx.ingress.kubernetes.io/ssl-passthrough: “true”`
263 | 
264 | 在程序去验证，不带certificate的时候发现会报错
265 | TLS handshake error from 192.168.31.10:38634: tls: client didn't provide a certificate
266 | 
267 | 带上client.crt后，svc3通过了验证，返回了正确的应答。
268 | 
269 | 
270 | 
271 | 我们再看看采用ssl-passthrough方式下ingress-nginx controller的访问日志，当curl请求发出时，ingress-nginx controller并未有日志输出，因为没有在nginx处ssl termnination，从此也可以证实：nginx将client的ssl过程转发到pod中去了，即passthrough了。


--------------------------------------------------------------------------------
/scheduler/images/admission-controller-phases.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JaneLiuL/kubernetes-book/6ce28f8bc31b8ac89543964b03cf703cf268a515/scheduler/images/admission-controller-phases.png


--------------------------------------------------------------------------------
/scheduler/images/aquire-lock.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JaneLiuL/kubernetes-book/6ce28f8bc31b8ac89543964b03cf703cf268a515/scheduler/images/aquire-lock.png


--------------------------------------------------------------------------------
/scheduler/images/ep-leader-algorithm.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JaneLiuL/kubernetes-book/6ce28f8bc31b8ac89543964b03cf703cf268a515/scheduler/images/ep-leader-algorithm.png


--------------------------------------------------------------------------------
/scheduler/images/informer.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JaneLiuL/kubernetes-book/6ce28f8bc31b8ac89543964b03cf703cf268a515/scheduler/images/informer.png


--------------------------------------------------------------------------------
/scheduler/images/schedule-extend.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JaneLiuL/kubernetes-book/6ce28f8bc31b8ac89543964b03cf703cf268a515/scheduler/images/schedule-extend.png


--------------------------------------------------------------------------------
/scheduler/images/scheduler-flow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JaneLiuL/kubernetes-book/6ce28f8bc31b8ac89543964b03cf703cf268a515/scheduler/images/scheduler-flow.png


--------------------------------------------------------------------------------
/scheduler/images/scheduler-leader-elec.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JaneLiuL/kubernetes-book/6ce28f8bc31b8ac89543964b03cf703cf268a515/scheduler/images/scheduler-leader-elec.png


--------------------------------------------------------------------------------
/scheduler/images/scheduler-log.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JaneLiuL/kubernetes-book/6ce28f8bc31b8ac89543964b03cf703cf268a515/scheduler/images/scheduler-log.png


--------------------------------------------------------------------------------
/scheduler/images/scheduler-plugins.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JaneLiuL/kubernetes-book/6ce28f8bc31b8ac89543964b03cf703cf268a515/scheduler/images/scheduler-plugins.png


--------------------------------------------------------------------------------
/scheduler/images/scheduler-score.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JaneLiuL/kubernetes-book/6ce28f8bc31b8ac89543964b03cf703cf268a515/scheduler/images/scheduler-score.png


--------------------------------------------------------------------------------
/scheduler/scheduler-leader-election.md:
--------------------------------------------------------------------------------
  1 | # overview
  2 | 这篇文档围绕scheduler组建的leader选举为中心，对scheduler其他逻辑并没有太多侧重点。
  3 | scheduler这个组件是有leader选举的，并且是不使用etcd来实现leader election.
  4 | 
  5 | 
  6 | # background & introduction
  7 | 近年来，随着对可靠系统和基础设施的需求增加，“高可用性”一词越来越受欢迎。在分布式系统中，高可用性通常涉及最大化正常运行时间和使系统容错。高可用性中常见的做法是使用冗余来最大程度地减少单点故障。为冗余准备您的系统和服务可能就像在负载均衡器后面**部署更多副本**一样简单。尽管这样的配置适用于许多应用程序，但某些用例需要在副本之间进行**协调**才能使系统正常工作。
  8 | 
  9 | 一个很好的例子是将 Kubernetes scheduler /  controller mgmt 部署为多个实例。为了防止任何意外行为，leader选举过程必须确保在副本中选出一名领导者，并且是唯一主动协调集群的人。其他实例应保持不活动状态，但准备好在leader 发生故障时接管。
 10 | 
 11 | 
 12 | # 探索&&问题
 13 | 之前使用kubespray provision了一个ha cluster, 是2个node 来作为master node, 3个worker node。 
 14 | 我一直以为两个master node的scheduler 是独立运行，看了一下scheduler pod的日志，大概如下所示
 15 | ```
 16 | [instance.go:274] Using reconciler: lease
 17 | [lease.go:235] Resetting endpoints for master service "kubernetes" to 10.xx.xx.xx
 18 | ```
 19 | 然后当时公司同事问了一下lease,以及scheduler leader选举的一些问题
 20 | 使用如下命令去查看了一下endpoint, 居然发现没有endpoint, 但是有leader选举的日志
 21 | ```
 22 | kubectl get ep -n kube-system
 23 | ```
 24 | 同事提示了一下要看lease, 发现跟我以前看的逻辑还是有点区别的, `kubectl describe lease kube-scheduler -n kube-system 
 25 | `
 26 | ```yaml
 27 | apiVersion: coordination.k8s.io/v1
 28 | kind: Lease
 29 | metadata:
 30 |   creationTimestamp: "2022-11-30T15:37:15Z"
 31 |   labels:
 32 |     k8s.io/component: kube-scheduler
 33 |     kubernetes.io/hostname: kind-control-plane
 34 |   name: kube-scheduler
 35 |   namespace: kube-system
 36 |   resourceVersion: "18171"
 37 |   uid: d6c68901-4ec5-4385-b1ef-2d783738da6c
 38 | spec:
 39 |   Acquire Time: 2022-11-30T18:04:27.912073Z
 40 |   holderIdentity: node2-xxx-xxx
 41 |   leaseDurationSeconds: 3600
 42 |   lease transitions: 1
 43 |   renewTime: "2022-11-30T18:14:27.912073Z"
 44 | ```
 45 | 
 46 | 那就让我们继续带着问题出发，我们的问题分别是：
 47 | 1. ha cluster, 另外一个master 的scheduler是在什么情况下怎么获取leader的
 48 | 2. why use lease instead of endpoint
 49 | 3. 为什么只有一个master node的时候，也要默认开启leader
 50 | 
 51 | 
 52 | # 查看配置
 53 | 首先我们查看一下scheduler 启动的yaml文件，我们发现两个master node的scheduler都是默认启动了leader选举
 54 | ```
 55 | --leader-elect=true
 56 | ```
 57 | 然后我们开始查看我们的kubernetes 源码配置，发现跟leader选举的启动参数有以下几个
 58 | 
 59 | ```go
 60 | type LeaderElectionConfiguration struct {
 61 | 
 62 |     // 是否开启选举功能，默认开启
 63 | 	LeaderElect bool
 64 | 	// 锁的失效时间，类似于 session-timeout
 65 | 	LeaseDuration metav1.Duration
 66 | 	// leader 的心跳间隔，必须小于等于 lease-duration
 67 | 	RenewDeadline metav1.Duration
 68 | 	// leader-elect-retry-period: non-leader 
 69 | 	RetryPeriod metav1.Duration
 70 | 	// 用什么对象来存放选主信息, 可以用 lease, ep, configmap
 71 | 	ResourceLock string
 72 | 	// lease/endpoint 的名称，kube-scheduler
 73 | 	ResourceName string
 74 | 	// lease/ep/configmap放哪个namespace
 75 | 	ResourceNamespace string
 76 | }
 77 | ```
 78 | 
 79 | 查看了源码，位置： https://github.com/kubernetes/kubernetes/blob/release-1.25/cmd/kube-scheduler/app/options/options.go#L91
 80 | 可以看出，1.25版本的kubernetes scheduler组件默认是使用lease作为锁
 81 | ```go
 82 | func NewOptions() *Options {
 83 | 	o := &Options{
 84 | 	
 85 | 		LeaderElection: &componentbaseconfig.LeaderElectionConfiguration{
 86 | 			LeaderElect:       true,
 87 | 			LeaseDuration:     metav1.Duration{Duration: 15 * time.Second},
 88 | 			RenewDeadline:     metav1.Duration{Duration: 10 * time.Second},
 89 | 			RetryPeriod:       metav1.Duration{Duration: 2 * time.Second},
 90 | 			ResourceLock:      "leases",
 91 | 			ResourceName:      "kube-scheduler",
 92 | 			ResourceNamespace: "kube-system",
 93 | 		},
 94 | 	}
 95 | ```
 96 | 当然代码也提供了覆盖功能，如果在启动参数配置`leader-elect-resource-lock`是可以配置成endpoint的，代码就不展示了，位置在https://github.com/kubernetes/kubernetes/blob/release-1.25/cmd/kube-scheduler/app/options/options.go#L166
 97 | 
 98 | 
 99 | # 原理
100 | ## 锁结构
101 | ![](./images/aquire-lock.png)
102 | 
103 | ```go
104 | type LeaderElectionRecord struct {
105 | 	// HolderIdentity is the ID that owns the lease. If empty, no one owns this lease and
106 | 	// all callers may acquire. Versions of this library prior to Kubernetes 1.14 will not
107 | 	// attempt to acquire leases with empty identities and will wait for the full lease
108 | 	// interval to expire before attempting to reacquire. This value is set to empty when
109 | 	// a client voluntarily steps down. 他的注释已经足够清晰了，我就不翻译了
110 | 	HolderIdentity       string      `json:"holderIdentity"`
111 | 	LeaseDurationSeconds int         `json:"leaseDurationSeconds"`
112 |     // Leader 第一次成功获得租约时的时间戳
113 | 	AcquireTime          metav1.Time `json:"acquireTime"`
114 | 	RenewTime            metav1.Time `json:"renewTime"`
115 |     // leader 更换
116 | 	LeaderTransitions    int         `json:"leaderTransitions"`
117 | }
118 | ```
119 | 
120 | ## 发起选leader
121 | 现在我们有 lease, endpoint, configmap等resourcelock, 这些锁需要实现以下接口, 代码链接 https://github.com/kubernetes/client-go/blob/release-1.25/tools/leaderelection/resourcelock/interface.go
122 | ```go
123 | type Interface interface {
124 | 	// Get returns the LeaderElectionRecord
125 | 	Get(ctx context.Context) (*LeaderElectionRecord, []byte, error)
126 | 
127 | 	// Create attempts to create a LeaderElectionRecord
128 | 	Create(ctx context.Context, ler LeaderElectionRecord) error
129 | 
130 | 	// Update will update and existing LeaderElectionRecord
131 | 	Update(ctx context.Context, ler LeaderElectionRecord) error
132 | 
133 | 	// RecordEvent is used to record events
134 | 	RecordEvent(string)
135 | 
136 | 	// Identity will return the locks Identity
137 | 	Identity() string
138 | 
139 | 	// Describe is used to convert details on current resource lock
140 | 	// into a string
141 | 	Describe() string
142 | }
143 | ```
144 | 
145 | 以 `LeaseResourceLock` 的选举过程为例：
146 | ```go
147 | func (ll *LeaseLock) Create(ctx context.Context, ler LeaderElectionRecord) error {
148 | 	var err error
149 | 	ll.lease, err = ll.Client.Leases(ll.LeaseMeta.Namespace).Create(ctx, &coordinationv1.Lease{
150 | 		ObjectMeta: metav1.ObjectMeta{
151 | 			Name:      ll.LeaseMeta.Name,
152 | 			Namespace: ll.LeaseMeta.Namespace,
153 | 		},
154 | 		Spec: LeaderElectionRecordToLeaseSpec(&ler),
155 | 	}, metav1.CreateOptions{})
156 | 	return err
157 | }
158 | func LeaderElectionRecordToLeaseSpec(ler *LeaderElectionRecord) coordinationv1.LeaseSpec {
159 | 	leaseDurationSeconds := int32(ler.LeaseDurationSeconds)
160 | 	leaseTransitions := int32(ler.LeaderTransitions)
161 | 	return coordinationv1.LeaseSpec{
162 | 		HolderIdentity:       &ler.HolderIdentity,
163 | 		LeaseDurationSeconds: &leaseDurationSeconds,
164 | 		AcquireTime:          &metav1.MicroTime{ler.AcquireTime.Time},
165 | 		RenewTime:            &metav1.MicroTime{ler.RenewTime.Time},
166 | 		LeaseTransitions:     &leaseTransitions,
167 | 	}
168 | }
169 | 
170 | // Update will update an existing Lease spec.
171 | func (ll *LeaseLock) Update(ctx context.Context, ler LeaderElectionRecord) error {
172 | 	if ll.lease == nil {
173 | 		return errors.New("lease not initialized, call get or create first")
174 | 	}
175 | 	ll.lease.Spec = LeaderElectionRecordToLeaseSpec(&ler)
176 | 
177 | 	lease, err := ll.Client.Leases(ll.LeaseMeta.Namespace).Update(ctx, ll.lease, metav1.UpdateOptions{})
178 | 	if err != nil {
179 | 		return err
180 | 	}
181 | 
182 | 	ll.lease = lease
183 | 	return nil
184 | }
185 | 
186 | ```
187 | 
188 | 
189 | 
190 | ### scheduler Run 
191 | 位置 https://github.com/kubernetes/kubernetes/blob/release-1.25/cmd/kube-scheduler/app/server.go#L146
192 | 首先scheduler 在通过启动参数，构建了cc这个对象，如果启动了leader election, 他就开始进入构建 leaderElector对象，并且通过`leaderElector.Run(ctx)` 来开始进入选举， 也就是说**scheduler 在启动时就会发起选主**
193 | 此方法负责运行领导者选举循环。它首先尝试获取锁（使用 le.acquire）。成功后，它会运行我们之前配置的 OnStartedLeading 回调并定期更新租约。如果获取锁失败，它只会运行 OnStoppedLeading 回调并返回。
194 | 
195 | ```go
196 | func Run(ctx context.Context, cc *schedulerserverconfig.CompletedConfig, sched *scheduler.Scheduler) error {
197 | 	// remove some useless logic here...
198 | 
199 | 	// If leader election is enabled, runCommand via LeaderElector until done and exit.
200 | 	if cc.LeaderElection != nil {
201 | 		cc.LeaderElection.Callbacks = leaderelection.LeaderCallbacks{
202 | 			OnStartedLeading: func(ctx context.Context) {
203 | 				close(waitingForLeader)
204 | 				sched.Run(ctx)
205 | 			},
206 | 			OnStoppedLeading: func() {
207 | 				select {
208 | 				case <-ctx.Done():
209 | 					// We were asked to terminate. Exit 0.
210 | 					klog.InfoS("Requested to terminate, exiting")
211 | 					os.Exit(0)
212 | 				default:
213 | 					// We lost the lock.
214 | 					klog.ErrorS(nil, "Leaderelection lost")
215 | 					klog.FlushAndExit(klog.ExitFlushTimeout, 1)
216 | 				}
217 | 			},
218 | 		}
219 | 		leaderElector, err := leaderelection.NewLeaderElector(*cc.LeaderElection)
220 | 		if err != nil {
221 | 			return fmt.Errorf("couldn't create leader elector: %v", err)
222 | 		}
223 | 
224 | 		leaderElector.Run(ctx)
225 | 
226 | 		return fmt.Errorf("lost lease")
227 | 	}
228 | 
229 | 	// Leader election is disabled, so runCommand inline until done.
230 | 	close(waitingForLeader)
231 | 	sched.Run(ctx)
232 | 	return fmt.Errorf("finished without leader elect")
233 | }
234 | 
235 | ```
236 | 现在我们来看看 `leaderElector.Run(ctx)` 发生了什么事情
237 | acquire 和 renew 方法实现中最重要的部分是对 tryAcquireOrRenew 的调用，它包含锁定机制的核心逻辑。
238 | 
239 | 总结一下大概流程如下
240 | * tryAcquireOrRenew 函数尝试获取租约
241 |     * 获取不到 lease,并且 错误是找不到这个名字的lease，那么就尝试 create lease， 开始创建租约
242 |     * 如果获取lease有其他错误，那么返回 false，不再继续下面的逻辑
243 | * 如果创建租约成功，或者获取到租约的情况下，检查租约的 Identity & Time
244 |     * 更新本地缓存的租约，并更新观察时间戳，用来判断租约是否到期
245 |         * leader 的租约尚未到期，自己暂时不能抢占它，函数返回 false
246 |     * 租约到期，而 leader 身份不变，因此获得租约的时间戳 AcquireTime 保持不变
247 |     * 租约到期，leader 易主，transtions+1 说明 leader 更替了
248 |     * 尝试去更新租约记录
249 |         * 更新失败，函数返回 false
250 |         * 更新成功，函数返回 true
251 | * 函数返回 True 说明本 goroutine 已成功抢占到锁，获得租约合同，成为 leader。
252 | 
253 | 
254 | ![](./images/scheduler-leader-elec.png)
255 | 
256 | ```go
257 | // Run starts the leader election loop. Run will not return
258 | // before leader election loop is stopped by ctx or it has
259 | // stopped holding the leader lease
260 | func (le *LeaderElector) Run(ctx context.Context) {
261 | 	defer runtime.HandleCrash()
262 | 	defer func() {
263 | 		le.config.Callbacks.OnStoppedLeading()
264 | 	}()
265 | 
266 | 	if !le.acquire(ctx) {
267 | 		return // ctx signalled done
268 | 	}
269 | 	ctx, cancel := context.WithCancel(ctx)
270 | 	defer cancel()
271 | 	go le.config.Callbacks.OnStartedLeading(ctx)
272 | 	le.renew(ctx)
273 | }
274 | 
275 | // acquire loops calling tryAcquireOrRenew and returns true immediately when tryAcquireOrRenew succeeds.
276 | // Returns false if ctx signals done.
277 | func (le *LeaderElector) acquire(ctx context.Context) bool {
278 | 	ctx, cancel := context.WithCancel(ctx)
279 | 	defer cancel()
280 | 	succeeded := false
281 | 	desc := le.config.Lock.Describe()
282 | 	klog.Infof("attempting to acquire leader lease %v...", desc)
283 | 	wait.JitterUntil(func() {
284 | 		succeeded = le.tryAcquireOrRenew(ctx)
285 | 		le.maybeReportTransition()
286 | 		if !succeeded {
287 | 			klog.V(4).Infof("failed to acquire lease %v", desc)
288 | 			return
289 | 		}
290 | 		le.config.Lock.RecordEvent("became leader")
291 | 		le.metrics.leaderOn(le.config.Name)
292 | 		klog.Infof("successfully acquired lease %v", desc)
293 | 		cancel()
294 | 	}, le.config.RetryPeriod, JitterFactor, true, ctx.Done())
295 | 	return succeeded
296 | }
297 | 
298 | // tryAcquireOrRenew tries to acquire a leader lease if it is not already acquired,
299 | // else it tries to renew the lease if it has already been acquired. Returns true
300 | // on success else returns false.
301 | func (le *LeaderElector) tryAcquireOrRenew(ctx context.Context) bool {
302 | 	now := metav1.Now()
303 | 	leaderElectionRecord := rl.LeaderElectionRecord{
304 | 		HolderIdentity:       le.config.Lock.Identity(),
305 | 		LeaseDurationSeconds: int(le.config.LeaseDuration / time.Second),
306 | 		RenewTime:            now,
307 | 		AcquireTime:          now,
308 | 	}
309 | 
310 | 	// 1. obtain or create the ElectionRecord
311 |     //  这个也就是上面 lease get的方法实现
312 | 	oldLeaderElectionRecord, oldLeaderElectionRawRecord, err := le.config.Lock.Get(ctx)
313 | 	if err != nil {
314 |         //  如果 是获取不到 lease的其他错误, 那么返回false
315 | 		if !errors.IsNotFound(err) {
316 | 			klog.Errorf("error retrieving resource lock %v: %v", le.config.Lock.Describe(), err)
317 | 			return false
318 | 		}
319 |         //  能进入这里，说lease 404，那么就尝试 create lease
320 | 		if err = le.config.Lock.Create(ctx, leaderElectionRecord); err != nil {
321 | 			klog.Errorf("error initially creating leader election record: %v", err)
322 | 			return false
323 | 		}
324 | 
325 | 		le.setObservedRecord(&leaderElectionRecord)
326 | 
327 | 		return true
328 | 	}
329 |     //  如果创建租约成功，或者获取到租约的情况下
330 | 	// 2. Record obtained, check the Identity & Time
331 |     //  更新本地缓存的租约，并更新观察时间戳，用来判断租约是否到期
332 | 	if !bytes.Equal(le.observedRawRecord, oldLeaderElectionRawRecord) {
333 | 		le.setObservedRecord(oldLeaderElectionRecord)
334 | 
335 | 		le.observedRawRecord = oldLeaderElectionRawRecord
336 | 	}
337 | 
338 |     // leader 的租约尚未到期，自己暂时不能抢占它，函数返回 false
339 | 	if len(oldLeaderElectionRecord.HolderIdentity) > 0 &&
340 | 		le.observedTime.Add(le.config.LeaseDuration).After(now.Time) &&
341 | 		!le.IsLeader() {
342 | 		klog.V(4).Infof("lock is held by %v and has not yet expired", oldLeaderElectionRecord.HolderIdentity)
343 | 		return false
344 | 	}
345 | 
346 | 	// 3. We're going to try to update. The leaderElectionRecord is set to it's default
347 | 	// here. Let's correct it before updating.
348 |      // 3. 租约到期，而 leader 身份不变，因此获得租约的时间戳 AcquireTime 保持不变
349 | 	if le.IsLeader() {
350 | 		leaderElectionRecord.AcquireTime = oldLeaderElectionRecord.AcquireTime
351 | 		leaderElectionRecord.LeaderTransitions = oldLeaderElectionRecord.LeaderTransitions
352 | 	} else {
353 |         //  // 租约到期，leader 易主，transtions+1 说明 leader 更替了
354 | 		leaderElectionRecord.LeaderTransitions = oldLeaderElectionRecord.LeaderTransitions + 1
355 | 	}
356 | 
357 | 	// update the lock itself
358 |     //  // 尝试去更新租约记录， 
359 | 	if err = le.config.Lock.Update(ctx, leaderElectionRecord); err != nil {
360 | 		klog.Errorf("Failed to update lock: %v", err)
361 | 		return false
362 | 	}
363 |    // 更新成功，函数返回 true
364 | 	le.setObservedRecord(&leaderElectionRecord)
365 | 	return true
366 | }
367 | 
368 | ```
369 | 
370 | # 解答我们的问题
371 | Q:  一开始怎么判断谁是leader
372 | 
373 | A: 不需要判断谁是leader, 每个scheduler 启动都是只要获取不到lease, 那么就创建lease，成为leader
374 | 
375 | Q: 另外一个master 的scheduler是在什么情况下怎么获取leader的
376 | 
377 | A: 这是在已经有lease的情况下，并且leader不是自己的情况下的问题， 这个时候会等待并且观察时间戳，等租约到期，就开始更换leader了，更新成功即可
378 | 
379 | 
380 | Q: why use lease instead of endpoint
381 | 
382 | A: 在代码的注释有提到这一块 https://github.com/kubernetes/kubernetes/blob/release-1.26/staging/src/k8s.io/client-go/tools/leaderelection/resourcelock/interface.go#L37
383 | 在使用 EndpointsLeasesResourceLock 时，需要保证API Priority & Fairness 配置了非默认的流程模式，这将捕获与领导者选举相关的必要操作
384 | endpoint对象。
385 | cm 和 ep 的实现高负载下表现不保证，所以还是默认使用lease。
386 | 
387 | 
388 | Q: 为什么只有一个master node的时候，也要默认开启leader
389 | 
390 | A: 这是因为，就算不考虑多个Master的情况下，我们本来也要支持类似deployment rolling upgrade的场景。
391 | 当我们更新scheduler, 并且部署方式是deployment, upgrade过程中会出现2个 pod的，因此默认会开启leader
392 | 
393 | # summary
394 | Kubernetes 中的leader 选举过程很简单。它从创建一个锁对象开始，leader定期更新当前时间戳，以此作为通知其他副本其领导地位的一种方式。这个锁对象可以是 Lease 、 ConfigMap 或 Endpoint ，也持有当前领导者的身份。如果领导者未能在给定的时间间隔内更新时间戳，则认为它已经崩溃，这是当不活动的副本通过使用其身份更新锁来竞相获得领导权时。成功获取锁的 pod 将成为新的领导者。
395 | 
396 | 说句题外话：
397 | 其实kubernetes leader election 里面使用的lease 锁还是很明显的乐观锁的，他是用kubernetes 操作的原子性来确保没有两个副本可以同时获得租约（否则可能导致竞争条件和其他意外行为！）。每当 Lease 更新（更新或获取）时，其上的 resourceVersion 字段也会由 Kubernetes 更新。当另一个进程尝试同时更新 Lease 时，Kubernetes 会检查更新对象的 resourceVersion 字段是否与当前对象匹配——如果不匹配，则更新失败，从而防止并发问题！
398 | 


--------------------------------------------------------------------------------
/scheduler/scheduler阅读理解上.md:
--------------------------------------------------------------------------------
  1 | # 概述
  2 | 
  3 | 这篇文章是基于Kubernetes的master  commitid:  8e8b6a01cf6bf55dea5e2e4f554597a95c82988a写下的源码分析文档。
  4 | 
  5 | Kubernetes作为一个资源管理平台，其中一个关键的组件是调度器，而调度的核心，是要将有限的资源利用最大化。调度在Kubernetes里面是Kube-scheduler组件实现的。Kube-scheduler的主要逻辑在于，如何为集群中的每一个新创建的Pod或者没有被调度的Pod找到合适的节点。
  6 | 
  7 | # 带着问题出发
  8 | 
  9 | 问题1：
 10 | 
 11 | 调度器如何平衡准确性和效率性？
 12 | 
 13 | 准确性的话一般在业务处理里面会用到串行调度，对每一个Pod对象进行串行调度，等该对象调度完成再动下一个，但众所周知，一个Pod需要在节点上启动，绑定IP和磁盘等资源，如果单纯串行调度，是很难提高效率，符合业务需求，假设是并行调度的话，如何保证每一次调度决策的准确性？例如调度决策将同一个Node资源同时分配给两个不同的Pod，可能会出现在绑定阶段的另外一个Pod会因为资源不足而无法创建该Pod。
 14 | 
 15 | 问题2：
 16 | 
 17 | 当出现Kubernetes集群Node节点数量非常大，例如5000个节点，调度器为一个Pod对象选择Node节点时候，是如何从这5000个节点中选出最优解？如若每一个Pod对象都轮询5000个节点，调度器在进行调度的时候如何提高做出决策的时间（优化）？
 18 | 
 19 | 问题3：
 20 | 
 21 | 作为调度器，Kubernetes的调度器应该提供一些比较公共的调度算法，但每个公司都有特定的调度需求，云厂商会有希望进行资源超卖之类的调度需求，作为使用者的企业或者个人应该如何扩展调度器实现定制化调度需求？
 22 | 
 23 | # 流程概述
 24 | 
 25 | Kubernetes 的调度器的一个总的调度流程：
 26 | 
 27 | 1. 用户在集群里面创建了一个新的Pod对象。
 28 | 2. Pod对象被加入到调度器的队列里面， 调度器从“调度器队列”中获取Pod对象。
 29 | 3. 首先在调度器经过一轮**预选**，选出符合该Pod对象资源要求的一组Node节点列表。
 30 | 4. 将符合Pod对象的Node节点列表再筛选一遍，在筛选环节根据亲和性或者资源均衡性等对Node节点列表进行**打分**，返回Node节点以及对应的分数。
 31 | 5. 选出最高分的节点Node，**异步**进行**绑定**Volume，如果绑定环节失败，那么将解绑Node节点以及资源。
 32 | 6. 如果在4或者5失败，也就是无法选出符合Pod对象的Node，那么进入抢占环节，挤兑低优先级的Pod对象去抢占Node节点。
 33 | 
 34 | ![](./images/scheduler-flow.png)
 35 | 
 36 | 
 37 | 
 38 | 实际上，围绕着整个调度过程，在目前的Kubernetes版本中，Scheduler framework在每一个调度阶段发挥着紧密的作用，Scheduler framework把每一个阶段都做成一组插件。每一组插件都可以Enabled或者Disabled。
 39 | 
 40 | 在Kubernetes v1.18以及之后的版本里面，大部分的调度plugins都是默认Enabled，用户也可以配置Disable 某组插件，Plugin都有一个名字和对应的权重。在每一个调度阶段提供了**基于插件式**的接口，这些插件实现了绝大部分的调度功能。
 41 | 
 42 | 调度框架通过插件的机制去接受插件的结果，根据插件结果去继续下一个步骤或者停止调度，这种机制允许我们处理错误并且也可以与插件通信。
 43 | 
 44 | ![](./images/scheduler-plugins.png)
 45 | 
 46 | 从下图可以看出，Scheduler framework实现了两个阶段：调度周期和绑定周期。绿色所在箭头，Scheduler framework都提供了扩展接口供用户扩展调度需求。 接下来，我们会跟着一个Pod被调度的流程概述，详细分析在每一个阶段里面调度器的操作。
 47 | 
 48 | ![](./images/schedule-extend.png)
 49 | 
 50 | ## Step1: 新增Pod
 51 | 
 52 | 当我们往集群里面新增一个Pod对象的时候，`Informer`的`EventHandler`会触发将该Pod对象加入到调度器的activeQ调度队列中进行入队操作。
 53 | 
 54 | ```go
 55 | func addAllEventHandlers(...) {
 56 | 	podInformer.Informer().AddEventHandler(
 57 | 		cache.FilteringResourceEventHandler{
 58 | 			Handler: cache.ResourceEventHandlerFuncs{
 59 | 				AddFunc:    sched.addPodToSchedulingQueue,
 60 | 				UpdateFunc: sched.updatePodInSchedulingQueue,
 61 | 				DeleteFunc: sched.deletePodFromSchedulingQueue,
 62 | 			},
 63 | 		},
 64 | 	)
 65 | ...
 66 | }
 67 | func (sched *Scheduler) addPodToSchedulingQueue(obj interface{}) {
 68 | 	pod := obj.(*v1.Pod)	
 69 | 	if err := sched.SchedulingQueue.Add(pod); err != nil {...}
 70 | }
 71 | 
 72 | func (p *PriorityQueue) Add(pod *v1.Pod) error {	
 73 | 	pInfo := p.newQueuedPodInfo(pod)
 74 |     // 当新增一个Pod对象的时候，Eventhandler会将该Pod对象执行Add的操作，然后调用了activeQ.Add去把该Pod对象加到activeQ里面
 75 | 	if err := p.activeQ.Add(pInfo); err != nil {		
 76 | 		return err
 77 | 	}
 78 | 	...
 79 | }
 80 | ```
 81 | 
 82 | 
 83 | 
 84 | ## Step2: 入队
 85 | 
 86 | 调度器是定时的从调度队列里面获取队头的一个Pod对象`podInfo := sched.NextPod()`，也就是，每次只获取一个Pod对象，这个过程是阻塞的，当队列中不存在Pod对象的时候，`sched.NextPod()`会处于等待状态。
 87 | 
 88 | 在没有看源码之前，我个人想象中存放Pod的调度队列是使用Channel实现的，然后是直接就可以使用先进先出的功能，看了代码之后发现如果使用Channel的话是无法对已经进入队列里面的Pod对象进行排序，Kubernetes设计的调度队列是让高优先级的排在队列的对头，低优先级的被排在队尾，设计成需要在队列里面排队是为了符合业务需求高优先级的Pod应用需要优先被得到调度处理。
 89 | 
 90 | 在队列的最前面是**最高优先级**的pod，队列里面是有使用对Pod进行排序的，那就是Less方法（见下方的代码）， Less是根据Pod的优先级来进行比较，也就是说，每次进入调度周期的时候，每取出一个Pod，实际上都是取优先级最高的Pod先进行调度 （当优先级相等时，它使用 PodQueueInfo.timestamp）
 91 | 
 92 | 这里使用了Scheduler framework的队列插件中提供的"Less(Pod1, Pod2)" 方法来实现。
 93 | 
 94 | ```go
 95 | func (pl *PrioritySort) Less(pInfo1, pInfo2 *framework.QueuedPodInfo) bool {
 96 | 	p1 := pod.GetPodPriority(pInfo1.Pod)
 97 | 	p2 := pod.GetPodPriority(pInfo2.Pod)
 98 | 	return (p1 > p2) || (p1 == p2 && pInfo1.Timestamp.Before(pInfo2.Timestamp))
 99 | }
100 | ```
101 | 
102 | 
103 | 
104 | 而这个时候，调度器只是在操作一个Pod对象，让我们接下来继续看看，调度器是如何提高调度效率呢？
105 | 
106 | 
107 | 
108 | ## Step3: 预选
109 | 
110 | 在预选里面，调用了`numFeasibleNodesToFind`去寻找Node节点数量的函数，如果记得我们问题2里面，如果我们有5000个节点，那么调度器是查询所有节点吗？
111 | 
112 | 答案是有一个叫`percentageOfNodesToScore` 的变量，接收从0到100之间的整数值，而且其默认值是通过集群的规模计算得来的。该变量是用来设置调度集群中节点的阈值。调优的时候我们应该根据我们的节点数量对这个值进行一个合理的设置。如果我们不指定，那么Kubernetes会使用公式计算一个比例：在100左右的节点数量取50%，在5000左右的节点数量取10%。
113 | 
114 | 也就是说**Kubernetes调度是存在全局最优解和局部最优解两种解法，这取决于集群Node节点大小与`percentageOfNodesToScore` 值的设置**。
115 | 
116 | ```go
117 | func (g *genericScheduler) findNodesThatPassFilters(...) {
118 | 	allNodes, err := g.nodeInfoSnapshot.NodeInfos().List()	
119 | 	numNodesToFind := g.numFeasibleNodesToFind(int32(len(allNodes)))
120 | 	feasibleNodes := make([]*v1.Node, numNodesToFind)	    
121 | 	checkNode := func(i int) { 
122 |         // 我们从上一个调度周期中停止的地方开始检查节点
123 | 		nodeInfo := allNodes[(g.nextStartNodeIndex+i)%len(allNodes)]        
124 | 		fits, status, err := PodPassesFiltersOnNode(ctx, prof.PreemptHandle(), state, pod, nodeInfo
125 | ...
126 | 	}	
127 | 
128 | 	// 并行对所有的Node进行检查    
129 | 	parallelize.Until(ctx, len(allNodes), checkNode)
130 | 	processedNodes := int(feasibleNodesLen) + len(statuses)
131 | 	g.nextStartNodeIndex = (g.nextStartNodeIndex + processedNodes) % len(allNodes)
132 | 
133 |     // 一旦配置的可行节点数量达到，就停止搜索更多的节点
134 | 	feasibleNodes = feasibleNodes[:feasibleNodesLen]
135 | 	if err := errCh.ReceiveError(); err != nil {
136 | 		statusCode = framework.Error
137 | 		return nil, err
138 | 	}
139 | 	return feasibleNodes, nil
140 | }
141 | ```
142 | 
143 | 代码里面的`g.nextStartNodeIndex`值得大家注意，打个比喻来说第一个Pod如果是使用了1000个节点（Node[0] 到 Node[999] ）进行过滤，那么下一个Pod是从Node[1000]开始取，这是从公平性出发去确保所有节点都有相同的机会跨pods进行检查。如果在一定比例的节点数量中，仍然没有找到符合条件的节点，那么进入抢占环节。
144 | 
145 | 
146 | 
147 | 现在让我们看看在预选阶段，Scheduler framework使用的预选插件检查汇总：
148 | 
149 | | 预选算法           | 描述 |
150 | | ------------------ | ---- |
151 | | nodeunschedulable  |  检查Node是否可调度    |
152 | | noderesources      |  检查Node资源是否符合Pod的需求    |
153 | | nodename           |  检查Pod.Spec.NodeName是否符合目前的Node节点  |
154 | | nodeports          |  检查Node是否跟Pod需要暴露的端口有冲突    |
155 | | nodeaffinity       |  检查节点亲和性    |
156 | | tainttoleration    |  检查污点容忍性，例如节点Node带上某些污点标签，Pod是否可以容忍    |
157 | | nodevolumelimits   |  检查Node的卷是否达到限制    |
158 | | volumebinding      |  检查卷的绑定情况  |
159 | | volumezone         |  检查卷所在的Zone   |
160 | | interpodaffinity   |  Pod的亲和性检查，例如需要调度的Pod跟节点上的Pod是否有prefer 亲和性    |
161 | 
162 | 
163 | 
164 | 需要被调度的Pod在经过预选环节，会返回一个经过过滤筛选符合需求的Node列表，
165 | 
166 | 当列表长度<=0意味着没有Node节点符合Pod的需求，不再进入打分环节，直接认为调度失败，进入Step6
167 | 
168 | 当列表长度=1意味着只有一个Node节点符合Pod的需求，不再进入打分环节，把该Node进行Step5的绑定环节
169 | 
170 | 当列表长度大于2会进入下方的打分环节。
171 | 
172 | 
173 | 
174 | ## Step4: 打分
175 | 
176 | 在打分环节，仍然是使用了Scheduler framework的一组插件：评分插件。
177 | 
178 | 评分插件用于对通过预选的阶段的所有Node进行排名，调度器将为每个节点都调用所有评分插件。对于每一个评分插件，Scheduler framework都给与他们对应的权重Weight，如下方代码所示。
179 | 
180 | 当评分完成之后，调度器使用了数学公式：乘积，也就是节点在每一个评分插件得出的分数*权重，再把所有分数合并才是节点的最后分数。
181 | 
182 | ```go
183 | // 代码位置`pkg/scheduler/algorithmprovider/registry.go`
184 | 	Score: &schedulerapi.PluginSet{
185 | 			Enabled: []schedulerapi.Plugin{
186 |                 // 寻找资源使用率平衡的节点
187 | 				{Name: noderesources.BalancedAllocationName, Weight: 1},
188 |                 // 基于节点上是否下拉了运行Pod对象的镜像计算分数
189 | 				{Name: imagelocality.Name, Weight: 1},
190 |                 // 基于亲和性和反亲和性计算分数
191 | 				{Name: interpodaffinity.Name, Weight: 1},	
192 |                 // 基于节点亲和性计算分数
193 | 				{Name: nodeaffinity.Name, Weight: 1},
194 | 				{Name: nodepreferavoidpods.Name, Weight: 10000},
195 |                 // 基于污点和容忍度是否匹配分数
196 | 				{Name: tainttoleration.Name, Weight: 1},
197 |                 ...
198 | 			},
199 | 		},
200 | ```
201 | 
202 | 
203 | 
204 | 在打分环节完成之后会返回NodeScoreList的列表，调度器会`selectHost`方法从返回结果中为Pod对象绑定一个Node节点。
205 | 
206 | ![](./images/scheduler-score.png)
207 | 
208 | 
209 | 
210 | 至此调度周期完成，调度器进入下一个周期：绑定周期。
211 | 
212 | ## Step5: 绑定
213 | 
214 | 现在调度器进入了绑定周期，**使用了go rountine异步去执行bind去把Pod对象与节点Node进行绑定**，不需要等待bind完成就可以进行下一个Pod对象的调度。这也就是理解了，Kubernetes Scheduler是选择串行调度去保证准确性的同时，然后以异步的方式去做绑定去提高效率。
215 | 
216 | 如果绑定失败，调度器会自动执行回滚操作。
217 | 
218 | 在绑定周期，实际上是为Pod对象进行绑定Volume的操作。而绑定周期里面，是可以细分成三个阶段的：
219 | 
220 | ### 预绑定
221 | 
222 | 预绑定插件用于执行 Pod 绑定前所需的任何工作。例如，一个预绑定插件可能需要提供卷并且在允许 Pod 运行在该节点之前将其挂载到目标节点上。
223 | 
224 | 如果任何 PreBind 插件返回错误，则 Pod 将被拒绝并且返回到调度队列中被重新等待下一次调度。
225 | 
226 | 
227 | 
228 | ### 绑定
229 | 
230 | 绑定插件用于将 Pod 绑定到节点上。直到所有的 PreBind 插件都完成，绑定插件才会被调用。每个绑定插件按照配置顺序被调用。绑定插件可以选择是否处理指定的 Pod。如果绑定插件选择处理 Pod，**剩余的绑定插件将被跳过**。
231 | 
232 | ```go
233 | func (b DefaultBinder) Bind(ctx context.Context, state *framework.CycleState, p *v1.Pod, nodeName string) *framework.Status {
234 | 	binding := &v1.Binding{
235 | 		ObjectMeta: metav1.ObjectMeta{Namespace: p.Namespace, Name: p.Name, UID: p.UID},
236 | 		Target:     v1.ObjectReference{Kind: "Node", Name: nodeName},
237 | 	}
238 | 	err := b.handle.ClientSet().CoreV1().Pods(binding.Namespace).Bind(ctx, binding, metav1.CreateOptions{})
239 | 	return nil
240 | }
241 | ```
242 | 
243 | 从代码里面我们可以看出，`Bind`绑定实际上是通过调ClientSet去调API Server把Pod对象的Node信息加到Pod的ObjectMeta， 如果在此刻发生API Server故障等，是会绑定失败的，一旦绑定失败，就会触发运行`un-reserve`插件，去把之前预留给该Pod对象的资源释放。
244 | 
245 | 
246 | 
247 | ### 绑定后
248 | 
249 | 绑定插件用于将 Pod 绑定到节点上。直到所有的 PreBind 插件都完成，绑定插件才会被调用。每个绑定插件按照配置顺序被调用。绑定插件可以选择是否处理指定的 Pod。如果绑定插件选择处理 Pod，**剩余的绑定插件将被跳过。**
250 | 
251 | 
252 | 
253 | 至此绑定周期完成。
254 | 
255 | 
256 | 
257 | ## Step6: 抢占
258 | 
259 | 在上方我们提起到，当Pod对象在预选环节失败的时候，会进入抢占环节，企图抢占其他Pod对象的资源。这是从业务需求出发，我们希望我们优先级高的应用需要被调度成功。 而抢占，仍然是作为Scheduler framework的一组插件：`PostFilter`
260 | 
261 | ```go
262 | 		PostFilter: &schedulerapi.PluginSet{
263 | 			Enabled: []schedulerapi.Plugin{
264 | 				{Name: defaultpreemption.Name},
265 | 			},
266 | 		},
267 | ```
268 | 
269 | 抢占算法流程：
270 | 
271 | 1. 判断当前的Pod对象是否优先级比节点里面其他Pod对象优先级高，从而判断是否有资格抢占
272 | 
273 | 2. 从预选失败的节点尝试找到可以调度的候选节点列表
274 | 
275 | 3. 从候选节点列表中尝试找到能够抢占成功的节点列表
276 | 
277 | 4. 从经过3完成的节点里面中选择一个阶段用于最终被抢占，也就是被抢占节点
278 | 
279 | 5. 获取被抢占节点的所有NominatedPods列表
280 | 
281 | 6. 驱逐需要因为抢占而被删除的Pod的对象
282 | 
283 |    
284 | 
285 | # 附：扩展调度器
286 | 
287 | Kubernetes调度器有三个扩展点：
288 | 
289 | 1. Multiple scheduler: 需要用户自己实现一个调度器，开发成本比较大，允许与默认的调度器一起运行在Kubernetes集群中
290 | 2. Extender: 只有Filter、Proritize、Preempt和Bind这几个扩展点，是基于HTTP/HTTPS通过网络调用，会有一定的网络延时，并且由于Extender是独立运行的，不能使用Scheduler Cache
291 | 3. Scheduler framework
292 | 
293 | 从上方整个流程可以看出，Scheduler framework贯穿了调度器的每一个调度阶段， 他设计很多都值得我们参考，基于插件式的调度框架分层的将过滤和打分和抢占都做成不同层级别的插件，被编译进调度器中，我们可以扩展某个插件，并且不影响与上下游插件的通信。
294 | 


--------------------------------------------------------------------------------
/应用篇/K8S与AD集成.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JaneLiuL/kubernetes-book/6ce28f8bc31b8ac89543964b03cf703cf268a515/应用篇/K8S与AD集成.md


--------------------------------------------------------------------------------
/应用篇/MinIO 对象存储.md:
--------------------------------------------------------------------------------
 1 | # 特点
 2 | 
 3 | 数据保护——分布式Minio采用 纠删码来防范多个节点宕机和位衰减bit rot。***分布式Minio至少需要4个硬盘***，使用分布式Minio自动引入了纠删码功能。
 4 | 高可用——单机Minio服务存在单点故障，相反，如果是一个有N块硬盘的分布式Minio,只要有N/2硬盘在线，你的数据就是安全的。不过你需要至少有N/2+1个硬盘来创建新的对象。
 5 | 例如，一个16节点的Minio集群，每个节点16块硬盘，就算8台服務器宕机，这个集群仍然是可读的，不过你需要9台服務器才能写数据。
 6 | 【温馨提示】只要遵守分布式Minio的限制，你可以组合不同的节点和每个节点几块硬盘。比如，你可以使用2个节点，每个节点4块硬盘，也可以使用4个节点，每个节点两块硬盘，诸如此类。
 7 | 一致性——Minio在分布式和单机模式下，所有读写操作都严格遵守***read-after-write***一致性模型
 8 | 
 9 | MinIO的优点如下：
10 | 部署简单，一个二进制文件（minio）即是一切，还可以支持各种平台
11 | 支持海量存储，可以按zone扩展，支持单个对象最大5TB
12 | 低冗余且磁盘损坏高容忍，标准且最高的数据冗余系数为2(即存储一个1M的数据对象，实际占用磁盘空间为2M)。但在任意n/2块disk损坏的情况下依然可以读出数据(n为一个纠删码集合中的disk数量)。并且这种损坏恢复是基于单个对象的，而不是基于整个存储卷的
13 | 读写性能优异
14 | # MinIO 基础概念
15 | S3——Simple Storage Service，简单存储服务，这个概念是Amazon在2006年推出的，对象存储就是从那个时候诞生的。S3提供了一个简单Web服务接口，可用于随时在Web上的任何位置存储和检索任何数量的数据。
16 | Object——存储到 Minio 的基本对象，如文件、字节流，Anything...
17 | Bucket——用来存储 Object 的逻辑空间。每个 Bucket 之间的数据是相互隔离的。
18 | Drive——部署 Minio 时设置的磁盘，Minio 中所有的对象数据都会存储在 Drive 里。
19 | Set——一组 Drive 的集合，分布式部署根据集群规模自动划分一个或多个 Set ，每个 Set 中的 Drive 分布在不同位置。
20 | 
21 | 一个对象存储在一个Set上
22 | 一个集群划分为多个Set
23 | 一个Set包含的Drive数量是固定的，默认由系统根据集群规模自动计算得出
24 | 一个SET中的Drive尽可能分布在不同的节点上
25 | 


--------------------------------------------------------------------------------
/应用篇/SSL Offloading/Termination vs SSL Passthrough vs SSL Bridging.md:
--------------------------------------------------------------------------------
 1 | # 概念
 2 | SSL Bridging：负载均衡器/代理解密传入的 HTTPS 流量，然后重新加密，再将其转发到后端服务器。
 3 | SSL Offloading（也称为 SSL Termination）：负载均衡器/代理解密传入的 HTTPS 流量，并将其不加密地发送到后端服务器。
 4 | SSL Passthrough：负载均衡器/代理不解密传入的 HTTPS 流量，而是将其原封不动地转发到后端服务器。
 5 | 
 6 | 
 7 | # SSL Termination (offloading)
 8 | 
 9 | <img width="1061" alt="image" src="https://github.com/user-attachments/assets/8ed29af7-febf-4701-832d-040e0a4e5a47" />
10 | SSL Offloading（也称为 SSL Termination）允许用户借助负载均衡器前端的 SSL 证书与负载均衡器建立安全连接。
11 | 负载均衡器会解密传入的 HTTPS 流量。因此，在此阶段可以对流量应用第 7 层操作。
12 | 与 SSL 桥接不同，流量在从负载均衡器到后端服务器的途中不会重新加密。
13 | 经过卸载过程的流量会标有一个名为 X-Forwarded-Proto 的新header，该header会通知后端服务器客户端使用 HTTPS 联系负载均衡器。
14 | 
15 | # SSL Bridging
16 | <img width="1043" alt="image" src="https://github.com/user-attachments/assets/2c8749a5-d185-4efc-9916-815305f3a73b" />
17 | 
18 | SSL Bridging 用户能够使用负载均衡器前端的 SSL 证书与负载均衡器建立安全加密的连接。
19 | 负载均衡器解密传入的 HTTPS 流量，从而允许其对收到的流量执行第 7 层操作。
20 | 随后，***负载均衡器的后端建立新的加密连接***，以重新加密负载均衡器和后端服务器之间的流量，这次使用后端服务器的证书。
21 | 例如，在微服务架构中，当需要解决诸如横切关注点之类的其他功能时，建议采用这种方法。
22 | 
23 | # SSL Passthrough
24 | <img width="1000" alt="image" src="https://github.com/user-attachments/assets/f6d22906-ae29-41ea-a1f7-bc8d716b7624" />
25 | Passthrough管理负载均衡器上加密流量的最直接方法。
26 | 正如其名称所示，此方法涉及通过负载均衡器路由流量，而无需在负载均衡器本身上进行解密。
27 | 虽然此选项可以显著降低开销，但它有局限性，因为无法执行任何第 7 层操作。
28 | 因此，基于 cookie-based sticky sessions等功能无法通过此方法实现。
29 | 此外，在应用程序不在服务器之间共享会话的情况下，用户可能会因重定向到组内的不同服务器而丢失会话。
30 | 
31 | SSL 直通可确保整个传输过程中的连接安全，解密只发生在目的地，从而最大限度地降低针对负载均衡器和服务器之间流量的中间人攻击风险。
32 | 此外，由于负载均衡器不会解密客户端和服务器之间的流量，因此它们的开销相对较低，从而可以更精确地引导流量。
33 | 然而，SSL 直通需要更多的中央处理单元 (CPU) 周期，从而导致更高的运营成本。
34 | 此外，它缺乏检查请求或对网络流量执行操作的能力，从而无法使用访问规则、重定向和基于 cookie 的粘性会话。
35 | 由于这些限制，SSL 直通最适合较小的部署。对于更大、更苛刻的使用要求，可能需要考虑替代方法
36 | 


--------------------------------------------------------------------------------
/应用篇/east-west-traffic-encryption.md:
--------------------------------------------------------------------------------
1 | # overview
2 | 


--------------------------------------------------------------------------------
/应用篇/images/external-dns-A-record.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JaneLiuL/kubernetes-book/6ce28f8bc31b8ac89543964b03cf703cf268a515/应用篇/images/external-dns-A-record.png


--------------------------------------------------------------------------------
/应用篇/images/secret-cert.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JaneLiuL/kubernetes-book/6ce28f8bc31b8ac89543964b03cf703cf268a515/应用篇/images/secret-cert.png


--------------------------------------------------------------------------------
/应用篇/images/用户排查图.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JaneLiuL/kubernetes-book/6ce28f8bc31b8ac89543964b03cf703cf268a515/应用篇/images/用户排查图.png


--------------------------------------------------------------------------------
/应用篇/k8s-ldap-auth.md:
--------------------------------------------------------------------------------
 1 | # background
 2 | https://github.com/HopopOps/k8s-ldap-auth 不支持添加rootca
 3 | 但是公司内部的ldap必须要rootca,不然报错，代码已改
 4 | 这个文档是说明这个k8s-ldap-auth的一些工作原理
 5 | 
 6 | # 原理说明
 7 | Kubernetes 支持多种现成的身份验证方法，例如 X.509 客户端证书、static HTTP bearer tokens, 和 OpenID Connect. 
 8 | LDAP 身份验证意味着用户将能够使用其来自轻量级目录访问协议 (LDAP) 目录的现有凭据向 Kubernetes 集群进行身份验证：
 9 | 
10 | <img width="704" alt="image" src="https://github.com/user-attachments/assets/1892814f-a99d-4071-a158-de8a2677d59c" />
11 | 
12 | Every request to the Kubernetes API passes through three stages in the API server:
13 |  authentication, authorisation, and admission control
14 | 
15 |  <img width="803" alt="image" src="https://github.com/user-attachments/assets/1ec26fbe-3f68-4ed9-9e0f-2ec3ead20f93" />
16 | <img width="732" alt="image" src="https://github.com/user-attachments/assets/11d86926-b1b3-4e7a-9a7c-ede57efa0cf0" />
17 | 每个身份验证插件都实现一种特定的身份验证方法。
18 | 传入请求按顺序呈现给每个身份验证插件。
19 | 如果任何身份验证插件可以成功验证请求中的凭据，则身份验证完成，请求将进入授权阶段（其他身份验证插件的执行将被短路）。
20 | 如果没有一个身份验证插件可以验证请求，则请求将被拒绝，并显示 401 Unauthorized HTTP 状态代码。
21 | 
22 | # 搭建webhook-token 说明
23 | 因为这篇文档是针对我司，所以忽略了ldap server的搭建， 下图说明我们这个webhook 插件的工作原理
24 | <img width="859" alt="image" src="https://github.com/user-attachments/assets/5494643d-3126-458a-b71c-91eeb8df27c9" />
25 | 
26 | Webhook Token 插件要求请求包含 HTTP bearer token 作为身份验证凭据。
27 | 
28 | HTTP 承载令牌作为 Authorization: Bearer <TOKEN> 包含在请求的 Authorization 标头中。
29 | When the Webhook Token authentication plugin receives a request, it extracts the HTTP bearer token and submits it to an external webhook token authentication service for verification.
30 | 
31 | The Webhook Token authentication plugin invokes the webhook token authentication service with an HTTP POST request carrying a TokenReview object in JSON format that includes the token to verify.
32 | 
33 | TokenReview的格式如下
34 | ```
35 | {
36 |   "apiVersion": "authentication.k8s.io/v1beta1",
37 |   "kind": "TokenReview",
38 |   "spec": {
39 |     "token": "<TOKEN>"
40 |   }
41 | }
42 | ```
43 | 
44 | # solution
45 | 
46 | <img width="897" alt="image" src="https://github.com/user-attachments/assets/e1163a56-182a-4f3e-baca-f357fad76728" />
47 | 


--------------------------------------------------------------------------------
/应用篇/k8s安全之opa.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JaneLiuL/kubernetes-book/6ce28f8bc31b8ac89543964b03cf703cf268a515/应用篇/k8s安全之opa.md


--------------------------------------------------------------------------------
/应用篇/openebs-quota-issue.md:
--------------------------------------------------------------------------------
 1 | # Issue
 2 | when we using openebs version 4.1.0 to providing local PV hostpath feature, we found the actual use of storage is out of control.
 3 | 
 4 | # pre-requi
 5 | 1. xfprog is installed on OS
 6 |    ```bash
 7 |    rpm -qa | grep xfsprog
 8 |    ```
 9 |    result should should the package installed
10 | 2. openebs base path /var/openebs/local should be mounted with pqota option
11 | 
12 | ```bash
13 | df -Th /var/openebs/local
14 | mount | grep "openebs"
15 | # result should be: /dev/xxx on /var/openebs/local type xfs (rw,relate, seclable,attr2, inode64,logbufs=8,logbsize=32k, noquota)
16 | ```
17 | if you found there have noquota on mount option, please umount and mount it again.
18 | "pquota" is not usable with remount mount option.
19 | ```bash
20 | umount /dev/xxx/openebs
21 | mount -o rw,pquota /dev/xxxx/mount
22 | # this time will show there have prjquota on mount result
23 | cat /etc/fstab | grep openebs
24 | # this result should should with pquota too, like: /dev/xxx/openebs /var/openebs/local xfs defaults, pquota 0 0 
25 | ```
26 | 
27 | 3. openebs helm chart value
28 | under localpv-provisioner, set hostpathClass.xfsQuota.enabled to true. This will carry the xfsQuota to default storage class in cas.openebs.io/config annotation.
29 | 
30 | softLimitGrace and hardLimitGrace are used in conjunction with the PV storage request capacity to determine the soft and hard limits quota.
31 | ```yaml
32 | localpv-provisioner:
33 |   hostpathClass:
34 |     xfsQuota:
35 |       enabled: true
36 |       softLimitGrace: "0%"
37 |       hardLimitGrace: "0%"
38 | ```
39 | 
40 | 5. verify creating Storageclass
41 | ```bash
42 | kubectl apply -f -<<EOF
43 | 
44 | EOF
45 | 
46 | ```
47 | 


--------------------------------------------------------------------------------
/应用篇/share AD base access control for k8s.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JaneLiuL/kubernetes-book/6ce28f8bc31b8ac89543964b03cf703cf268a515/应用篇/share AD base access control for k8s.pdf


--------------------------------------------------------------------------------
/应用篇/share-volume-zone.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JaneLiuL/kubernetes-book/6ce28f8bc31b8ac89543964b03cf703cf268a515/应用篇/share-volume-zone.pdf


--------------------------------------------------------------------------------
/应用篇/日志篇.md:
--------------------------------------------------------------------------------
 1 | # 日志篇
 2 | 
 3 | # why log important
 4 | 我们通过日志来maintain 应用的健康，集群的健康，同时可以通过日志来解决应用的各种问题，提高用户体验性能
 5 | 
 6 | # 日志分类
 7 | ## pod && container log
 8 | container logging主要是抓取stdout && stderr，可以通过`kubect log`命令简单的查看日志
 9 | 
10 | 不同的容器运行时是不同的方法去处理container stdout 跟stderr streams的
11 | 
12 | ## system component log
13 | ### kubelet log
14 | kubelet的日志通常我们可以使用`journalctl`来读取systemd journal
15 | ### scheduler. controller mgr, apoiserver
16 | 这三个一般以静态pod存在
17 | 
18 | ## events
19 | events的日志在pod启动或者部署应用排查问题的时候会经常用到，比如说用户pod被gatekeeper拦截
20 | ## audit log
21 | audit log可以通过kube-apiserver的request
22 | # 日志格式 log format
23 | 
24 | # k8s logging tools
25 | 
26 | ## ELK + Kafka
27 | 一般来说ELK我们都会加一个kafka, kafka能帮助我们削峰。ELK可以使用redis作为消息队列，但redis作为消息队列不是强项而且redis集群不如专业的消息发布系统kafka。
28 | ***ELK没有消息队列缓存，可能存在数据丢失的风险，适合于数据量小的环境使用***
29 | 
30 | 一个公司可以用Kafka可以收集各种服务的log，通过kafka以统一接口服务的方式开放给各种consumer。后续会结合Kafka讲一个企业级日志系统应用（每天20TB的数据量）。
31 | * 消息系统：解耦和生产者和消费者、缓存消息等。
32 | * 用户活动跟踪：Kafka经常被用来记录web用户或者app用户的各种活动，如浏览网页、搜索、点击等活动，这些活动信息被各个服务器发布到kafka的topic中，然后订阅者通过订阅这些* topic来做实时的监控分析，或者装载到hadoop、数据仓库中做离线分析和挖掘。
33 | * 运营指标：Kafka也经常用来记录运营监控数据。包括收集各种分布式应用的数据，生产各种操作的集中反馈，比如报警和报告。
34 | * 流式处理：比如spark streaming和storm
35 | * 事件源
36 | * 
37 | 另外，日志采集器Logstash其功能虽然强大，但是它依赖java、在数据量大的时候，Logstash进程会消耗过多的系统资源，这将严重影响业务系统的性能，而filebeat就是一个完美的替代者，它基于Go语言没有任何依赖，配置文件简单，格式明了，同时filebeat比logstash更加轻量级，所以占用系统资源极少，非常适合安装在生产机器上。这就是推荐使用filebeat，也是 ELK Stack 在 Agent 的第一选择。
38 | 


--------------------------------------------------------------------------------
/开放接口/CSI.md:
--------------------------------------------------------------------------------
  1 | # Overview
  2 | 
  3 | 容器存储接口CSI是独立于Kubernetes的，作为一个独立project，在Kubernetes v1.9版本的时候引入。CSI为Kubernetes提供了一个插件系统，像云厂商的AWS EBS, Azure File这些都是作为CSI的一个插件被编进去Kubernetes中。
  4 | 
  5 | CSI是作为一个标准开发的，用于将任意块和文件存储存储系统暴露给容器编排系统(如Kubernetes)上的容器化工作负载。通过采用容器存储接口，Kubernetes卷层变得真正可扩展。使用CSI，第三方存储提供商可以编写和部署Kubernetes中公开新存储系统的插件，而无需接触Kubernetes的核心代码。这为Kubernetes用户提供了更多的存储选项，并使系统更加安全和可靠。
  6 | 
  7 | # 开放接口
  8 | 
  9 | 代码库：https://github.com/container-storage-interface/spec.git
 10 | 
 11 | 代码位置： https://github.com/container-storage-interface/spec/blob/master/csi.proto
 12 | 
 13 | 跟CRI类似，CSI也是定义了gRPC接口，云厂商的存储插件需要实现以下接口。
 14 | 
 15 | ## Identity接口
 16 | 
 17 | 该接口主要是用来获取插件信息，查询插件状态
 18 | 
 19 | ```protobuf
 20 | service Identity {
 21 |   // 返回插件信息
 22 |   rpc GetPluginInfo(GetPluginInfoRequest)
 23 |     returns (GetPluginInfoResponse) {}
 24 |   // 返回插件提供的能力
 25 |   rpc GetPluginCapabilities(GetPluginCapabilitiesRequest)
 26 |     returns (GetPluginCapabilitiesResponse) {}
 27 |   // 健康探针
 28 |   rpc Probe (ProbeRequest)
 29 |     returns (ProbeResponse) {}
 30 | }
 31 | ```
 32 | 
 33 | 
 34 | 
 35 | ## Controller接口
 36 | 
 37 | 从存储服务端来进行操作卷，包括创建卷，删除卷等
 38 | 
 39 | ```protobuf
 40 | service Controller {
 41 |   // 创建卷
 42 |   rpc CreateVolume (CreateVolumeRequest)
 43 |     returns (CreateVolumeResponse) {}
 44 |   // 删除卷
 45 |   rpc DeleteVolume (DeleteVolumeRequest)
 46 |     returns (DeleteVolumeResponse) {}
 47 |   // attach 卷
 48 |   rpc ControllerPublishVolume (ControllerPublishVolumeRequest)
 49 |     returns (ControllerPublishVolumeResponse) {}
 50 |   // deattach 卷
 51 |   rpc ControllerUnpublishVolume (ControllerUnpublishVolumeRequest)
 52 |     returns (ControllerUnpublishVolumeResponse) {}
 53 |  ...
 54 | }
 55 | ```
 56 | 
 57 | ## Node接口
 58 | 
 59 | 对Node节点上的卷进行操作
 60 | 
 61 | ```protobuf
 62 | service Node {
 63 |  // 格式化卷并且挂载到一个临时目录
 64 |   rpc NodeStageVolume (NodeStageVolumeRequest)
 65 |     returns (NodeStageVolumeResponse) {}
 66 | // 把卷从临时目录卸载
 67 |   rpc NodeUnstageVolume (NodeUnstageVolumeRequest)
 68 |     returns (NodeUnstageVolumeResponse) {}
 69 | // 把卷从临时目录挂载到目标目录
 70 |   rpc NodePublishVolume (NodePublishVolumeRequest)
 71 |     returns (NodePublishVolumeResponse) {}
 72 | // 把卷从目标目录卸载
 73 |   rpc NodeUnpublishVolume (NodeUnpublishVolumeRequest)
 74 |     returns (NodeUnpublishVolumeResponse) {}
 75 | ...
 76 | }
 77 | 
 78 | ```
 79 | 
 80 | 
 81 | 
 82 | # 背景知识
 83 | 
 84 | 这篇文章由于很多地方涉及到卷的操作：Attach, Detach, Mount和Unmount，一般来说，卷都需要Attach--> Mount-->Unmount--> Detach这四个操作，只有EmptyDir跟HostPath这两种才不需要Attach和Detach操作。
 85 | 
 86 | ## 卷的生命周期
 87 | 
 88 | 当用户创建卷的时候，首先调用CreateVolume接口，去创建卷，然后调用ControllerPublishVolume将卷attch到主机上，接下来会调用NodeStageVolume来进行格式化最后调用NodeStageVolume来mount到指定目录下。
 89 | 
 90 | CreateVolume-->ControllerPublishVolume-->NodeStageVolume-->NodePublishVolume
 91 | 
 92 | # Sidecar容器
 93 | 
 94 | Kubernetes CSI Sidecar容器是一组标准容器，旨在简化在Kubernetes上CSI驱动程序的开发和部署。
 95 | 
 96 | 这些容器包含公共逻辑，用于监视Kubernetes API，触发针对“CSI卷驱动程序”容器的适当操作，并适当地更新Kubernetes API。
 97 | 
 98 | 这些容器打算与第三方CSI驱动程序容器捆绑在一起，并作为pod部署在一起。
 99 | 
100 | 一般来说，CSI驱动程序应该和以下sidecar (helper)容器一起部署在Kubernetes上:
101 | 
102 | ## external-attacher
103 | 
104 | 监视由controller-manager创建的VolumeAttachment对象，并将卷附加/卸载到/从节点上(即调用ControllerPublish/ controllererunpublish)。
105 | 
106 | (ControllerPublish是让一个node节点不需要运行任何代码就可以attach卷)
107 | 
108 | (Detach就是反向操作，从一个node节点中detach卷)
109 | 
110 | 代码库： https://github.com/kubernetes-csi/external-attacher.git
111 | 
112 | ```go
113 | func NewCSIAttachController(client kubernetes.Interface, attacherName string, handler Handler, volumeAttachmentInformer storageinformers.VolumeAttachmentInformer, pvInformer coreinformers.PersistentVolumeInformer, vaRateLimiter, paRateLimiter workqueue.RateLimiter, shouldReconcileVolumeAttachment bool, reconcileSync time.Duration) *CSIAttachController {
114 | 	...
115 | 	ctrl := &CSIAttachController{
116 | 		client:                          client,
117 | 		attacherName:                    attacherName,
118 | 		handler:                         handler,
119 | 		eventRecorder:                   eventRecorder,
120 | 		vaQueue:                         workqueue.NewNamedRateLimitingQueue(vaRateLimiter, "csi-attacher-va"),
121 | 		pvQueue:                         workqueue.NewNamedRateLimitingQueue(paRateLimiter, "csi-attacher-pv"),
122 | 		shouldReconcileVolumeAttachment: shouldReconcileVolumeAttachment,
123 | 		reconcileSync:                   reconcileSync,
124 | 	}
125 |     // 监听volumeAttachment的增删改，触发写入队列
126 | 	volumeAttachmentInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
127 | 		AddFunc:    ctrl.vaAdded,
128 | 		UpdateFunc: ctrl.vaUpdated,
129 | 		DeleteFunc: ctrl.vaDeleted,
130 | 	})
131 | 	...
132 | 	pvInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
133 | 		AddFunc:    ctrl.pvAdded,
134 | 		UpdateFunc: ctrl.pvUpdated,
135 | 		//DeleteFunc: ctrl.pvDeleted, TODO: do we need this?
136 | 	})
137 | ...
138 | 	return ctrl
139 | }
140 | 
141 | ```
142 | 
143 | 然后执行ControllerPublishVolume
144 | 
145 | ```go
146 | func (a *attacher) Attach(ctx context.Context, volumeID string, readOnly bool, nodeID string, caps *csi.VolumeCapability, context, secrets map[string]string) (metadata map[string]string, detached bool, err error) {
147 | 	client := csi.NewControllerClient(a.conn)
148 | 	req := csi.ControllerPublishVolumeRequest{
149 | 		VolumeId:         volumeID,
150 | 		NodeId:           nodeID,
151 | 		VolumeCapability: caps,
152 | 		Readonly:         readOnly,
153 | 		VolumeContext:    context,
154 | 		Secrets:          secrets,
155 | 	}    
156 | 	rsp, err := client.ControllerPublishVolume(ctx, &req)
157 | 	...
158 | 	return rsp.PublishContext, false, nil
159 | }
160 | 
161 | ```
162 | 
163 | 
164 | 
165 | ## external-provisioner
166 | 
167 | 监视Kubernetes PersistentVolumeClaim对象，并针对CSI端点触CreateVolume和DeleteVolume操作。
168 | 
169 | 例如当我们创建一个`PersistentVolumeClaim`对象的时候，如果PVC指定了一个特定的`StorageClass`，并且存储类的provisioner字段中的名称与GetPluginInfo调用中指定的CSI端点返回的名称相匹配，则通过创建新的Kubernetes PersistentVolumeClaim对象触发卷供应。
170 | 
171 | 成功地配给新卷之后，sidecar容器创建一个Kubernetes PersistentVolume对象来表示卷。
172 | 
173 | 使用删除回收策略删除绑定到与此驱动程序对应的`PersistentVolumeClaim`对象，将导致sidecar容器对指定的CSI端点触发DeleteVolume操作以删除卷。成功删除卷之后，sidecar容器还删除表示卷的PersistentVolume对象。
174 | 
175 | 代码库：https://github.com/kubernetes-csi/external-provisioner.git
176 | 
177 | 
178 | 
179 | 
180 | 
181 | ## node-driver-registrar
182 | 
183 | 使用kubelet设备插件机制向kubelet注册CSI驱动程序。
184 | 
185 | 
186 | 
187 | ## cluster-driver-registrar
188 | 
189 | 通过创建CSIDriver对象向Kubernetes集群注册CSI驱动程序，该对象使驱动程序能够自定义Kubernetes与它的交互方式。
190 | 
191 | ## external-snapshotter
192 | 
193 | 监视Kubernetes VolumeSnapshot CRD对象，并针对CSI端点触发CreateSnapshot和DeleteSnapshot操作。
194 | 
195 | ## livenessprobe
196 | 
197 | 可能包含在CSI插件pod中，以启用Kubernetes的活性探测机制。
198 | 
199 | # 部署CSI 驱动
200 | 
201 | CSI驱动程序通常作为两个组件部署在Kubernetes中: 一个controller plugin和per-node plugin。
202 | 
203 | ## Controller 插件
204 | 
205 | controller组件可以作为Deployment或StatefulSet部署在集群中的任何节点上。它由实现CSI控制器服务的CSI驱动程序和一个或多个sidecar容器组成。这些控制器sidecar容器通常与Kubernetes对象交互，并调用驱动程序的CSI控制器服务。
206 | 
207 | 它通常不需要直接访问主机，可以通过Kubernetes API和外部控制平面服务执行所有操作。可以为HA部署controller组件的多个副本，但是建议使用leader选举，以确保一次只有一个活动控制器。
208 | 
209 | controller sidecar包括external-provisioner, external-attacher, external-snapshotter, and external-resizer。在部署中包括一个sidecar可能是可选的
210 | 
211 | ## Node 插件
212 | 
213 | Node插件需要以DaemonSet来给每一个node节点部署。
214 | 
215 | 如下图所示，Kubernetes kubelet运行在每个节点上，负责进行CSI节点服务调用。这些调用从存储系统挂载和卸载存储卷，使Pod可以使用它。**Kubelet通过主机上通过HostPath卷共享的UNIX socket调用CSI驱动程序**。还有第二个UNIX socket， node-driver-registrar 使用它将CSI驱动程序注册到kubelet。
216 | 
217 | 这里建议读者阅读kubelet/pluginmanager.md理解Kubelet是如何发现并且注册CSI插件。
218 | 
219 | ![](./images/node-plugin.png)
220 | 
221 | # 如何测试驱动
222 | 
223 | ## 单元测试
224 | 
225 | 使用CSI sanity的包可以用来测试CSI驱动。
226 | 
227 | https://github.com/kubernetes-csi/csi-test/tree/master/pkg/sanity
228 | 
229 | 
230 | 
231 | 
232 | 
233 | ## 功能测试
234 | 
235 | 这里说的功能测试，是在功能上进行“端到端”测试。
236 | 
237 | # 
238 | 
239 | 
240 | 
241 | 
242 | 
243 | 
244 | 
245 | 
246 | 
247 | 
248 | 
249 | 
250 | 
251 | 
252 | 
253 | 
254 | 
255 | 
256 | 
257 | 
258 | 
259 | 


--------------------------------------------------------------------------------
/开放接口/images/cni-plugin.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JaneLiuL/kubernetes-book/6ce28f8bc31b8ac89543964b03cf703cf268a515/开放接口/images/cni-plugin.png


--------------------------------------------------------------------------------
/开放接口/images/cri.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JaneLiuL/kubernetes-book/6ce28f8bc31b8ac89543964b03cf703cf268a515/开放接口/images/cri.png


--------------------------------------------------------------------------------
/开放接口/images/describe-pod.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JaneLiuL/kubernetes-book/6ce28f8bc31b8ac89543964b03cf703cf268a515/开放接口/images/describe-pod.png


--------------------------------------------------------------------------------
/开放接口/images/node-plugin.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JaneLiuL/kubernetes-book/6ce28f8bc31b8ac89543964b03cf703cf268a515/开放接口/images/node-plugin.png


--------------------------------------------------------------------------------
/开放接口/images/start-kubelet-command.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JaneLiuL/kubernetes-book/6ce28f8bc31b8ac89543964b03cf703cf268a515/开放接口/images/start-kubelet-command.png


--------------------------------------------------------------------------------
/开放接口/k8s-cni.md:
--------------------------------------------------------------------------------
  1 | # k8s 网络
  2 | * 每个pod 都有一个IP地址
  3 | * pod 2 pod之间都能通信
  4 | * 允许通过service 来访问后端pod
  5 | * ingress能访问svc
  6 | 
  7 | cni 要求
  8 | 支持本地回路cni
  9 | 支持hostPort
 10 | 支持流量shaping
 11 | 
 12 | https://github.com/containerd/containerd/blob/main/script/setup/install-cni 值得一看
 13 | 
 14 | 
 15 | # cni插件分类
 16 | * IPAM： IP地址分配
 17 | * 主插件：网卡设置
 18 |     bridge：创建一个网桥，并把主机端口和容器端口插入网桥
 19 |     ipvlan：为容器添加ipvlan网口
 20 |     loopback：设置loopback网口
 21 | * Meta： 附加功能
 22 |     portmap：设置主机端口和容器端口映射
 23 |     bandwidth：利用 Linux Traffic Control限流
 24 |     firewall：通过 iptables 或 firewalld 为容器设置防火墙规则
 25 | 
 26 | 总结：
 27 | cni插件主要实现两个基本的接口：配置网络和清理网络
 28 | ipam插件负责给容器分配IP地址，包括host-local和dhcp
 29 | 
 30 | CNI 插件运行机制
 31 | 容器运行时在启动时会从 CNI 的配置目录中读取 JSON 格式的配置文件。
 32 | ```bash
 33 | cat /etc/cni/net.d # 读取这个 CNI 配置文件
 34 | ls -al /opt/cni/bin # CNI 可执行二进制文件
 35 | ```
 36 | 文件后缀为 .conf、 .conflist、 .json。如果配置目录中包含多个文件，一般情况下，会以名字排序选用第一个配置文件作为默认的网络配置，并加载获取其中指定的 CNI 插件名称和配置参数。
 37 | 我们可以看到有个 .conflist 结尾的文件，是的，k8s 的 cni 允许同时使用多个插件，并且会把上一个插件的执行结果作为参数传递给下一个插件，以此我们可以串通多个插件，让多个插件做不同的事情。
 38 | 
 39 | Kubernetes Pod 中的其他容器都是Pod所属pause容器的网络，创建过程为：
 40 | 1. kubelet创建pause 容器生成network namespace
 41 | 2. 调用网络cni driver
 42 | 3. cni driver 根据配置调用插件
 43 | 4. cni插件给pause容器配置网络
 44 | 5. pod里面的其他容器使用pause容器网络
 45 | 
 46 | # cni 插件的通常三种实现模式
 47 | ![](./images/cni-plugin.png)
 48 | * Overlay 模式的典型特征是容器独立于主机的 IP 段，这个 IP 段进行跨主机网络通信时是通过在主机之间创建隧道的方式，将整个容器网段的包全都封装成底层的物理网络中主机之间的包。该方式的好处在于它不依赖于底层网络；
 49 | * 路由模式中主机和容器也分属不同的网段，它与 Overlay 模式的主要区别在于它的跨主机通信是通过路由打通，无需在不同主机之间做一个隧道封包。但路由打通就需要部分依赖于底层网络，比如说要求底层网络有二层可达的一个能力；
 50 | * Underlay 模式中容器和宿主机位于同一层网络，两者拥有相同的地位。容器之间网络的打通主要依靠于底层网络。因此该模式是强依赖于底层能力的。
 51 | 
 52 | 
 53 | 
 54 | # cni 部署
 55 | CNI 插件部署的时候一般会启动一个 DaemonSet，然后把 镜像里的二进制文件复制到宿主机的 /opt/cni/bin 目录下，这样就算是完成了部署。
 56 | 因此写一个 CNI 插件实际就是提供一些包含对应功能的二进制文件给 kubelet 调用。
 57 | CNI 可以 设置多个，只不过 CNI 是按照顺序执行~
 58 | 流程图：
 59 | 
 60 | user create pod---> apiserver
 61 |                       |
 62 |                       | 监听pod创建
 63 |                       |
 64 |                     kubelet
 65 |                |               |
 66 |                |               |          
 67 |        read config            execute bin------config pod network-----------pod
 68 |     /etc/cni/net.d/xxx.conf     /opt/cni/bin/xxxnet
 69 | 
 70 | 
 71 | # iptables
 72 | 网络插件除支持设置和清理 Pod 网络接口外,该插件还需要支持 Iptables。如果 Kube-proxy 工作在 Iptables 模式，网络插件需要确保容器流量能使用 Iptables 转发
 73 | 
 74 | # calico
 75 | calico除了提供主机和pod之间的网络连接，还涉及网络安全和network policy等
 76 | 同网段通信，基于第 3 层，Calico 使用 BGP 路由协议在主机之间路由数据包，使用 BGP 路由协议也意味着数据包在主机之间移动时不需要包装在额外的封装层中
 77 | 跨网段通信，基于 IPinIP 使用虚拟网卡设备 tunl0,用一个 IP 数据包封装另一个 IP 数据包，外层 IP 数据包头的源地址为隧道入口设备的 IP 地址，目标地址为隧道出口设备的 IP 地址。
 78 | networpolicy 是Calico 最受欢迎的功能之一，使用 ACLS 协议和 kube-proxy 来创建 iptables 过滤规则，从而实现隔离容器网络的目的。
 79 | Calico 有两种模式：
 80 | 封包解包的隧道模式
 81 | 动态路由模式
 82 | Calico 运行流程：
 83 | 插件部署后会启动 DaemonSet，该 DaemonSet 会把存放配置文件(/etc/cni/net.d)和二进制文件(/opt/cni/bin)的目录挂载到 Pod 里去，后把镜像里的配置文件和二进制文件复制到对应目录。
 84 | DaemonSet 会运行再所有节点上，所以添加或者删除节点时都可以处理。
 85 | ## Calico 数据流转演示：
 86 | 同一个节点不同pod：
 87 | 首先启动一个包含大量工具的容器，比如 centos
 88 | ```bash
 89 | kubectl run --images=centos centos
 90 | ```
 91 | 然后进入该 Pod
 92 | ```bash
 93 | $ k exec -it centos-5fdd4bb694-7cgc8 bash
 94 | ```
 95 | 查看该 Pod 的 IP 和路由信息
 96 | ```bash
 97 | $ ip a
 98 | 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
 99 |     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
100 |     inet 127.0.0.1/8 scope host lo
101 |        valid_lft forever preferred_lft forever
102 | 3: eth0@if48: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
103 |     link/ether 16:4c:ec:e4:3a:d6 brd ff:ff:ff:ff:ff:ff link-netnsid 0
104 |     inet 192.168.119.78/32 brd 192.168.119.78 scope global eth0
105 |        valid_lft forever preferred_lft forever
106 | 
107 | $ ip r
108 | default via 169.254.1.1 dev eth0
109 | 169.254.1.1 dev eth0 scope link
110 | ```
111 | 可以看到有一个默认路由，所有数据都要通过 eth0 发送到 169.254.1.1 这个IP。
112 | 
113 | 然后使用 arpping 看一下 169.254.1.1 这个 IP 是哪个设备
114 | ```bash
115 | $ arping 169.254.1.1
116 | ARPING 169.254.1.1 from 192.168.119.78 eth0
117 | Unicast reply from 169.254.1.1 [EE:EE:EE:EE:EE:EE]  0.579ms
118 | Unicast reply from 169.254.1.1 [EE:EE:EE:EE:EE:EE]  0.536ms
119 | ```
120 | 发现这个 IP 对应的 mac 地址是 EE:EE:EE:EE:EE:EE.
121 | 
122 | 然后退出容器，到主机上看一下有没有 mac 地址全为 e 的设备
123 | ```bash
124 | $ ip a
125 | 45: calie3f1daf7d15@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
126 |     link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 11
127 |     inet6 fe80::ecee:eeff:feee:eeee/64 scope link
128 |        valid_lft forever preferred_lft forever
129 | ```
130 | 发现还真有，所有以 cali 开头的设备都是这个 mac 地址，这些实际是 calico 创建的 veth pair，一端在 Pod 里，一端在主机上。
131 | 
132 | 然后再看一下宿主机上的路由信息
133 | ```bash
134 | [root@agent-1 ~]# ip r
135 | default via 192.168.10.1 dev eth0 
136 | 169.254.169.254 via 192.168.10.3 dev eth0 proto static 
137 | blackhole 172.25.0.128/26 proto 80 
138 | 172.25.0.131 dev cali1cc9705ed50 scope link 
139 | 172.25.0.132 dev cali9d0f51d41f9 scope link 
140 | 172.25.0.133 dev cali4dcdb8d77a9 scope link 
141 | 172.25.0.134 dev caliac318995356 scope link 
142 | ```
143 | 发现有好几个 IP 的流量都要转给 cali* 设备，实际上这些流量就是通过 veth pair 又进入到了别的容器里了，这也就是为什么在 Pod 里 ping 别的 Pod 能通。
144 | 
145 | 流量到宿主机转一圈又进入到 Pod 里了。
146 | 
147 | 不同节点上的 Pod
148 | 具体节点及 IP 信息如下：
149 | ipamblock:
150 | 10-233-90-0-24
151 | node1
152 | cidr: 10.233.90.0/24
153 | 
154 | ipamblock:
155 | 10-233-96-0-24
156 | node: node2
157 | cidr: 10.233.96.0/24
158 | 然后 calico 部署后会起一个叫做 bird 的守护进程，该进程的作用就是同步不同节点间的路由信息。
159 | Pod 的网络实际上是一个私有的信息，和组宿主机的网络没有任何关系，其他节点也无法感知当前节点上的 Pod 网段，因此需要借助 bird 这个工具来进行同步。
160 | 
161 | 具体同步的信息就是就是哪个 IP 段和哪个宿主机的 IP 是绑定关系。比如下面的要访问 10.233.96.0/24 这个IP 段就要转发到 192.168.34.11，要访问 10.233.90.0/24 这个 IP 段就需要转发到 192.168.34.10。
162 | 
163 | 因此在 node1 上就会出现这么一条指向 node2 的路由信息
164 | ```bash
165 | 10.233.96.0/24 via 192.168.34.11 dev tunl0 proto bird onlink
166 | ```
167 | 同样的 node2 上也有到 node1 的路由信息
168 | ```bash
169 | 10.233.90.0/24 via 192.168.34.10 dev tunl0 proto bird onlink
170 | ```
171 | 这样彼此就可以进行通信了。
172 | 具体 bird 相关信息可以通过 calico daemonset 中进行查看
173 | 
174 | # 背景知识：linux namespace
175 | namespace 是 Linux 内核用来隔离内核资源的方式。通过 namespace 可以让一些进程只能看到与自己相关的一部分资源，而另外一些进程也只能看到与它们自己相关的资源，这两拨进程根本就感觉不到对方的存在。具体的实现方式是把一个或多个进程的相关资源指定在同一个 namespace 中。
176 | Linux namespaces 是对全局系统资源的一种封装隔离，使得处于不同 namespace 的进程拥有独立的全局系统资源，改变一个 namespace 中的系统资源只会影响当前 namespace 里的进程，对其他 namespace 中的进程没有影响。
177 | 
178 | 
179 | 


--------------------------------------------------------------------------------