├── kubelet ├── build │ ├── nsenter │ ├── entrypoint.sh │ ├── Dockerfile │ └── build.sh ├── orphaned.md ├── deploy │ └── deploy.yaml ├── README.md ├── subpath-oss-error-delete.md ├── kubelet.sh └── subpath-error-reading.md └── README.md /kubelet/build/nsenter: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AliyunContainerService/kubernetes-issues-solution/HEAD/kubelet/build/nsenter -------------------------------------------------------------------------------- /kubelet/build/entrypoint.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | rm -rf /host/etc/kubernetes/acs-kubelet-recover/kubelet.sh 4 | cp /acs/kubelet.sh /host/etc/kubernetes/acs-kubelet-recover/kubelet.sh 5 | 6 | /acs/nsenter --mount=/proc/1/ns/mnt sh /etc/kubernetes/acs-kubelet-recover/kubelet.sh 7 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Provider scripts, resolution to fix your Kubernetes cluster issues. 2 | 3 | 4 | ## Kubelet Issue List 5 | 6 | [Kubelet Issues](kubelet/README.md) 7 | 8 | 9 | ## API-Server Issue List 10 | 11 | 12 | ## Scheduler Issue List 13 | 14 | 15 | ## Controller Manager Issue List 16 | -------------------------------------------------------------------------------- /kubelet/build/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM registry.aliyuncs.com/acs/alpine:3.3 2 | RUN apk add --update curl && rm -rf /var/cache/apk/* 3 | RUN apk --update add fuse curl libxml2 openssl libstdc++ libgcc && rm -rf /var/cache/apk/* 4 | 5 | RUN mkdir -p /acs 6 | COPY nsenter /acs/nsenter 7 | COPY kubelet.sh /acs/kubelet.sh 8 | COPY entrypoint.sh /acs/entrypoint.sh 9 | 10 | RUN chmod 755 /acs/* 11 | 12 | ENTRYPOINT ["/acs/entrypoint.sh"] 13 | -------------------------------------------------------------------------------- /kubelet/build/build.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | cd ${GOPATH}/src/github.com/AliyunContainerService/kubernetes-issues-solution/kubelet/build 4 | GIT_SHA=`git rev-parse --short HEAD || echo "HEAD"` 5 | 6 | rm -rf ./kubelet.sh 7 | cp ../kubelet.sh ./ 8 | 9 | version="v1.12" 10 | version=$version-$GIT_SHA-aliyun 11 | 12 | docker build -t=registry.cn-hangzhou.aliyuncs.com/plugins/acs-cluster-recover:$version . 13 | docker push registry.cn-hangzhou.aliyuncs.com/plugins/acs-cluster-recover:$version 14 | -------------------------------------------------------------------------------- /kubelet/orphaned.md: -------------------------------------------------------------------------------- 1 | Orphaned Pod is one pod which is terminating and should be clean up. 2 | 3 | ## Description 4 | 5 | Dec 25 16:44:48 iZ2ze65lci9pegg2wr99g9Z kubelet: E1225 16:44:48.581657 21207 kubelet_volumes.go:140] Orphaned pod "06fa705f-0821-11e9-8cd4-00163e1071ed" 6 | found, but volume paths are still present on disk : There were a total of 2 errors similar to this. Turn up verbosity to see them. 7 | 8 | 9 | ## Reproduce 10 | 11 | 12 | ## How to Fix 13 | New kubernetes release fix some issues on this. 14 | -------------------------------------------------------------------------------- /kubelet/deploy/deploy.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: extensions/v1beta1 2 | kind: DaemonSet 3 | metadata: 4 | name: acs-kubelet-recover 5 | labels: 6 | k8s-volume: acs-kubelet-recover 7 | spec: 8 | selector: 9 | matchLabels: 10 | name: acs-kubelet-recover 11 | template: 12 | metadata: 13 | labels: 14 | name: acs-kubelet-recover 15 | spec: 16 | hostPID: true 17 | hostNetwork: true 18 | tolerations: 19 | - key: node-role.kubernetes.io/master 20 | operator: Exists 21 | effect: NoSchedule 22 | containers: 23 | - name: acs-kubelet-recover 24 | image: registry.cn-hangzhou.aliyuncs.com/acs/ack-cluster-helper:v1.12-9b339d6-aliyun 25 | imagePullPolicy: Always 26 | securityContext: 27 | privileged: true 28 | env: 29 | - name: LONGRUNNING 30 | value: "True" 31 | resources: 32 | limits: 33 | memory: 100Mi 34 | volumeMounts: 35 | - name: recover 36 | mountPath: /host/etc/kubernetes/acs-cluster-recover 37 | volumes: 38 | - name: recover 39 | hostPath: 40 | path: /etc/kubernetes/acs-cluster-recover 41 | updateStrategy: 42 | type: RollingUpdate 43 | -------------------------------------------------------------------------------- /kubelet/README.md: -------------------------------------------------------------------------------- 1 | Sometime, kubelet issue happens and cannot be self healing, and we have to resolve the issue by hands. This scripts may be helpful for you to auto resolve the issues. 2 | 3 | Kubelet Logs is from /var/log/messages 4 | 5 | ## Issue List 6 | ### 1. Orphaned issue 7 | 8 | [Orphaned-Pod](./orphaned.md) 9 | 10 | Like Below Logs: 11 | * 21207 kubelet_volumes.go:140] Orphaned pod "06fa705f-0821-11e9-8cd4-00163e1071ed" found, 12 | but volume paths are still present on disk : There were a total of 2 errors similar to this. Turn up verbosity to see them. 13 | 14 | ### 2. Subpath umount issue 15 | 16 | [Subpath-Error-Reading](./subpath-error-reading.md) 17 | 18 | nas/oss mountpoint is umounted when pod running, pod cannot be delete normally. 19 | 20 | Like Below Logs: 21 | Operation for "\"flexvolume-alicloud/nas/pv-nas-v4\" (*)" failed.* Error: "error cleaning subPath mounts for volume \"pvc-nas\" (*) 22 | error reading /var/lib/kubelet/pods/*/volume-subpaths/pv-nas-v4/nginx: 23 | lstat /var/lib/kubelet/pods/*/volume-subpaths/pv-nas-v4/nginx/0: stale NFS file handle" 24 | 25 | or OSS: 26 | * Operation for "\"flexvolume-alicloud/oss/oss1\"*failed. *Error: "error cleaning subPath mounts for volume \"oss1\" *: 27 | error reading /var/lib/kubelet/pods/*/volume-subpaths/oss1/nginx-flexvolume-oss: 28 | lstat /var/lib/kubelet/pods/*/volume-subpaths/oss1/nginx-flexvolume-oss/0: transport endpoint is not connected" 29 | 30 | 31 | ### 3. Oss Subpath umount issue 32 | 33 | [Subpath-Oss-Error-Delete](./subpath-oss-error-delete.md) 34 | 35 | oss using subpath, and the subpath is removed when pod running. 36 | 37 | Like Below Logs: 38 | * Operation for "\"flexvolume-alicloud/oss/oss1\"* failed.* Error: "error cleaning subPath mounts for volume \"oss1\" * 39 | error deleting /var/lib/kubelet/pods/*/volume-subpaths/oss1/nginx-flexvolume-oss: 40 | remove /var/lib/kubelet/pods/*/volume-subpaths/oss1/nginx-flexvolume-oss: directory not empty" 41 | 42 | ## How to Use 43 | 44 | Different issue may have different resolution,refer to the issue Readme. 45 | 46 | -------------------------------------------------------------------------------- /kubelet/subpath-oss-error-delete.md: -------------------------------------------------------------------------------- 1 | 2 | ## Issue Description: 3 | 4 | If one oss volume use subpath as the mountPath option, the umount action could get failed sometimes. 5 | 6 | ``` 7 | Feb 28 11:00:47 iZ2ze1fa4tkhgqper1l406Z kubelet: E0228 11:00:47.816230 8651 nestedpendingoperations.go:267] Operation for "\"flexvolume-alicloud/oss/oss1\" 8 | (\"2c4fc18b-3b04-11e9-b1a1-00163e03e854\")" failed. No retries permitted until 2019-02-28 11:00:55.816187031 +0800 CST m=+137560.296226841 (durationBeforeRetry 8s). 9 | Error: "error cleaning subPath mounts for volume \"oss1\" (UniqueName: \"flexvolume-alicloud/oss/oss1\") pod \"2c4fc18b-3b04-11e9-b1a1-00163e03e854\" 10 | (UID: \"2c4fc18b-3b04-11e9-b1a1-00163e03e854\") : error deleting /var/lib/kubelet/pods/2c4fc18b-3b04-11e9-b1a1-00163e03e854/volume-subpaths/oss1/nginx-flexvolume-oss: 11 | remove /var/lib/kubelet/pods/2c4fc18b-3b04-11e9-b1a1-00163e03e854/volume-subpaths/oss1/nginx-flexvolume-oss: directory not empty" 12 | ``` 13 | 14 | ## Reason 15 | 16 | If remove the subpath when pod is running, the mountpoint is useless. I code get error when clean the mountpoint. 17 | 18 | ``` 19 | ## related code in pkg/util/mount/mount_linux.go 20 | subPaths, err := ioutil.ReadDir(fullContainerDirPath) 21 | if err != nil { 22 | return fmt.Errorf("error reading %s: %s", fullContainerDirPath, err) 23 | } 24 | ``` 25 | 26 | This issue is not fixed in kubelet, should submit a PR for this. 27 | 28 | 29 | ## How to Reproduce 30 | 31 | NFS example: 32 | 33 | ``` 34 | apiVersion: extensions/v1beta1 35 | kind: Deployment 36 | metadata: 37 | name: nginx-oss-deploy 38 | spec: 39 | replicas: 1 40 | template: 41 | metadata: 42 | labels: 43 | app: nginx 44 | spec: 45 | containers: 46 | - name: nginx-flexvolume-oss 47 | image: nginx 48 | volumeMounts: 49 | - name: "oss1" 50 | mountPath: "/data" 51 | subPath: hello 52 | volumes: 53 | - name: "oss1" 54 | flexVolume: 55 | driver: "alicloud/oss" 56 | options: 57 | bucket: "aliyun-docker" 58 | url: "oss-cn-hangzhou.aliyuncs.com" 59 | otherOpts: "-o max_stat_cache_size=0 -o allow_other" 60 | akId: "**" 61 | akSecret: "**" 62 | ``` 63 | 64 | ### 1. Create pod 65 | 66 | # kubectl create -f osss.yaml 67 | 68 | # kubectl get pod 69 | NAME READY STATUS RESTARTS AGE 70 | nginx-oss-deploy-6bfd859cc4-7sb75 1/1 Running 0 19m 71 | 72 | ### 2. Login the node which Pod locate 73 | 74 | # kubectl describe pod nginx-oss-deploy-6bfd859cc4-7sb75 | grep Node 75 | Node: cn-beijing.i-2ze1fa4tkhgqperal406/172.16.1.1 76 | 77 | # ssh 172.16.1.1 78 | 79 | 80 | ### 3. Reproduce 81 | 82 | On the Pod located node: 83 | 84 | # mount | grep oss 85 | ossfs on /var/lib/kubelet/pods/44f0528b-3b06-11e9-b1a1-00163e03e854/volumes/alicloud~oss/oss1 type fuse.ossfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other) 86 | ossfs on /var/lib/kubelet/pods/44f0528b-3b06-11e9-b1a1-00163e03e854/volume-subpaths/oss1/nginx-flexvolume-oss/0 type fuse.ossfs (rw,relatime,user_id=0,group_id=0,allow_other) 87 | 88 | ## remove the oss subpath when pod is running; 89 | # rm -rf /var/lib/kubelet/pods/44f0528b-3b06-11e9-b1a1-00163e03e854/volumes/alicloud~oss/oss1/hello 90 | 91 | ## Delete running pod, the pod is hang in deleting; 92 | # kubectl delete pod nginx-oss-deploy-6bfd859cc4-7sb75 93 | pod "nginx-oss-deploy-6bfd859cc4-7sb75" deleted 94 | 95 | ## check logs on pod locate node 96 | # tailf /var/log/messages | grep "directory not empty" 97 | Feb 28 11:31:48 iZ2ze1fa4tkhgqper1l406Z kubelet: E0228 11:31:48.070490 8651 nestedpendingoperations.go:267] Operation for "\"flexvolume-alicloud/oss/oss1\" 98 | (\"44f0528b-3b06-11e9-b1a1-00163e03e854\")" failed. No retries permitted until 2019-02-28 11:32:20.070437563 +0800 CST m=+139444.550477359 (durationBeforeRetry 32s). 99 | Error: "error cleaning subPath mounts for volume \"oss1\" (UniqueName: \"flexvolume-alicloud/oss/oss1\") pod \"44f0528b-3b06-11e9-b1a1-00163e03e854\" 100 | (UID: \"44f0528b-3b06-11e9-b1a1-00163e03e854\") : error deleting /var/lib/kubelet/pods/44f0528b-3b06-11e9-b1a1-00163e03e854/volume-subpaths/oss1/nginx-flexvolume-oss: 101 | remove /var/lib/kubelet/pods/44f0528b-3b06-11e9-b1a1-00163e03e854/volume-subpaths/oss1/nginx-flexvolume-oss: directory not empty" 102 | 103 | 104 | ## How to Fix 105 | 106 | Run the script on error node: 107 | 108 | # sh kubelet.sh 109 | 110 | Deploy daemonset to running script and monitor the issue all the time 111 | 112 | # kubectl create -f kubelet/deploy/deploy.yaml 113 | 114 | Warning: it is not recommended use subpath on oss. 115 | 116 | -------------------------------------------------------------------------------- /kubelet/kubelet.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | date_echo() { 4 | echo `date "+%H:%M:%S-%Y-%m-%d"` $1 5 | } 6 | 7 | date_echo "Starting to fix the possible issue..." 8 | ## umount subpath if mntpoint is corrupted 9 | ## both OSS, NAS may meet this issue; 10 | fix_Subpath_ErrorReading(){ 11 | lineStr=$1 12 | tmpStr=`echo $lineStr | awk -F"lstat" '{print $2}'` 13 | if [ "$tmpStr" != "" ]; then 14 | mntPoint=`echo $tmpStr | awk -F":" '{print $1}'` 15 | mntPoint=`echo $mntPoint | xargs` 16 | if [ "$mntPoint" != "" ]; then 17 | num=`mount | grep $mntPoint | wc -l` 18 | if [ "$num" != "0" ]; then 19 | umount $mntPoint 20 | date_echo "Fix subpath Error Reading Issue:: Umount $mntPoint ...." 21 | idleTimes=0 22 | fi 23 | fi 24 | fi 25 | } 26 | 27 | ## OSS issue, when remove the subpath; 28 | ## umount subpath if mntpoint is corrupted 29 | ## Reproduce: 30 | # 1. use subpath create pod; 31 | # 2. login host of pod, remove subpath with root mountpoint; 32 | # 3. kubectl delete pod ** 33 | # 4. check /var/log/message 34 | fix_Oss_Subpath_NotEmpty(){ 35 | lineStr=$1 36 | tmpStr=`echo $lineStr | awk -F"error deleting " '{print $2}'` 37 | if [ "$tmpStr" != "" ]; then 38 | mntPoint=`echo $tmpStr | awk -F": remove" '{print $1}'` 39 | mntPoint=`echo $mntPoint | xargs` 40 | if [ "$mntPoint" != "" ]; then 41 | num=`mount | grep $mntPoint | wc -l` 42 | if [ "$num" != "0" ]; then 43 | mntPoint=`mount | grep $mntPoint | awk '{print $3}'` 44 | umount $mntPoint 45 | date_echo "Fix Subpath Not empty Issue:: Umount $mntPoint ...." 46 | idleTimes=0 47 | fi 48 | fi 49 | fi 50 | } 51 | 52 | # fix orphaned pod, umount the mntpoint; 53 | fix_orphanedPod(){ 54 | secondPart=`echo $item | awk -F"Orphaned pod" '{print $2}'` 55 | podid=`echo $secondPart | awk -F"\"" '{print $2}'` 56 | 57 | # not process if the volume directory is not exist. 58 | if [ ! -d /var/lib/kubelet/pods/$podid/volumes/ ]; then 59 | continue 60 | fi 61 | # umount subpath if exist 62 | if [ -d /var/lib/kubelet/pods/$podid/volume-subpaths/ ]; then 63 | mountpath=`mount | grep /var/lib/kubelet/pods/$podid/volume-subpaths/ | awk '{print $3}'` 64 | for mntPath in $mountpath; 65 | do 66 | date_echo "Fix subpath Issue:: umount subpath $mntPath" 67 | umount $mntPath 68 | idleTimes=0 69 | done 70 | fi 71 | 72 | volumeTypes=`ls /var/lib/kubelet/pods/$podid/volumes/` 73 | for volumeType in $volumeTypes; 74 | do 75 | subVolumes=`ls -A /var/lib/kubelet/pods/$podid/volumes/$volumeType` 76 | if [ "$subVolumes" != "" ]; then 77 | date_echo "/var/lib/kubelet/pods/$podid/volumes/$volumeType contents volume: $subVolumes" 78 | for subVolume in $subVolumes; 79 | do 80 | if [ "$volumeType" == "kubernetes.io~csi" ]; then 81 | # check subvolume path is mounted or not 82 | findmnt /var/lib/kubelet/pods/$podid/volumes/$volumeType/$subVolume/mount 83 | if [ "$?" != "0" ]; then 84 | date_echo "/var/lib/kubelet/pods/$podid/volumes/$volumeType/$subVolume/mount is not mounted, just need to remove" 85 | content=`ls -A /var/lib/kubelet/pods/$podid/volumes/$volumeType/$subVolume/mount` 86 | # if path is empty, just remove the directory. 87 | if [ "$content" = "" ]; then 88 | rmdir /var/lib/kubelet/pods/$podid/volumes/$volumeType/$subVolume/mount 89 | rm -f /var/lib/kubelet/pods/$podid/volumes/$volumeType/$subVolume/vol_data.json 90 | rmdir /var/lib/kubelet/pods/$podid/volumes/$volumeType/$subVolume 91 | # if path is not empty, do nothing. 92 | else 93 | date_echo "/var/lib/kubelet/pods/$podid/volumes/$volumeType/$subVolume/mount is not mounted, but not empty" 94 | idleTimes=0 95 | fi 96 | # is mounted, umounted it first. 97 | else 98 | date_echo "Fix Orphaned Issue:: /var/lib/kubelet/pods/$podid/volumes/$volumeType/$subVolume/mount is mounted, umount it" 99 | umount /var/lib/kubelet/pods/$podid/volumes/$volumeType/$subVolume/mount 100 | fi 101 | else 102 | # check subvolume path is mounted or not 103 | findmnt /var/lib/kubelet/pods/$podid/volumes/$volumeType/$subVolume 104 | if [ "$?" != "0" ]; then 105 | date_echo "/var/lib/kubelet/pods/$podid/volumes/$volumeType/$subVolume is not mounted, just need to remove" 106 | content=`ls -A /var/lib/kubelet/pods/$podid/volumes/$volumeType/$subVolume` 107 | # if path is empty, just remove the directory. 108 | if [ "$content" = "" ]; then 109 | rmdir /var/lib/kubelet/pods/$podid/volumes/$volumeType/$subVolume 110 | # if path is not empty, do nothing. 111 | else 112 | date_echo "/var/lib/kubelet/pods/$podid/volumes/$volumeType/$subVolume is not mounted, but not empty" 113 | idleTimes=0 114 | fi 115 | # is mounted, umounted it first. 116 | else 117 | date_echo "Fix Orphaned Issue:: /var/lib/kubelet/pods/$podid/volumes/$volumeType/$subVolume is mounted, umount it" 118 | umount /var/lib/kubelet/pods/$podid/volumes/$volumeType/$subVolume 119 | fi 120 | fi 121 | done 122 | fi 123 | done 124 | } 125 | 126 | 127 | idleTimes=0 128 | IFS=$'\r\n' 129 | while : 130 | do 131 | for item in `tail /var/log/messages`; 132 | do 133 | ## orphaned pod process 134 | if [[ $item == *"Orphaned pod"* ]] && [[ $item == *"but volume paths are still present on disk"* ]]; then 135 | fix_orphanedPod $item 136 | ## subpath cannot umount error proccess 137 | elif [[ $item == *"error cleaning subPath mounts for volume"* ]] && [[ $item == *"error reading"* ]]; then 138 | fix_Subpath_ErrorReading $item 139 | ## oss subpath removed issue. 140 | elif [[ $item == *"error cleaning subPath mounts for volume"* ]] && [[ $item == *"error deleting"* ]] && [[ $item == *"directory not empty"* ]]; then 141 | fix_Oss_Subpath_NotEmpty $item 142 | fi 143 | done 144 | 145 | idleTimes=`expr $idleTimes + 1` 146 | if [ "$idleTimes" = "10" ] && [ "$LONGRUNNING" != "True" ]; then 147 | break 148 | fi 149 | sleep 5 150 | done 151 | 152 | date_echo "Finish Process......" 153 | -------------------------------------------------------------------------------- /kubelet/subpath-error-reading.md: -------------------------------------------------------------------------------- 1 | 2 | ## Issue Description: 3 | 4 | If one volume use subpath as the mountPath option, the umount action could get failed sometimes. 5 | 6 | ``` 7 | Feb 20 14:45:34 iZwz99gunotzijxig3j052Z kubelet: E0220 14:45:34.717930 4175 nestedpendingoperations.go:267] 8 | Operation for "\"flexvolume-alicloud/nas/pv-nas-v4\" (\"cb7ceb74-34d8-11e9-b51c-00163e0cd246\")" failed. 9 | No retries permitted until 2019-02-20 14:47:36.717900777 +0800 CST m=+6057600.314812412 (durationBeforeRetry 2m2s). 10 | Error: "error cleaning subPath mounts for volume \"pvc-nas\" (UniqueName: \"flexvolume-alicloud/nas/pv-nas-v4\") pod 11 | \"cb7ceb74-34d8-11e9-b51c-00163e0cd246\" (UID: \"cb7ceb74-34d8-11e9-b51c-00163e0cd246\") : 12 | error reading /var/lib/kubelet/pods/cb7ceb74-34d8-11e9-b51c-00163e0cd246/volume-subpaths/pv-nas-v4/nginx: 13 | lstat /var/lib/kubelet/pods/cb7ceb74-34d8-11e9-b51c-00163e0cd246/volume-subpaths/pv-nas-v4/nginx/0: stale NFS file handle" 14 | ``` 15 | 16 | Or using oss: 17 | 18 | ``` 19 | Mar 1 10:29:19 iZ2ze1fa4tkhgqper1l406Z kubelet: E0301 10:29:19.869173 8651 nestedpendingoperations.go:267] Operation for "\"flexvolume-alicloud/oss/oss1\" 20 | (\"aa401a77-3bc9-11e9-b1a1-00163e03e854\")" failed. No retries permitted until 2019-03-01 10:29:51.869139106 +0800 CST m=+222096.349178930 (durationBeforeRetry 32s). 21 | Error: "error cleaning subPath mounts for volume \"oss1\" (UniqueName: \"flexvolume-alicloud/oss/oss1\") pod \"aa401a77-3bc9-11e9-b1a1-00163e03e854\" 22 | (UID: \"aa401a77-3bc9-11e9-b1a1-00163e03e854\") : error reading /var/lib/kubelet/pods/aa401a77-3bc9-11e9-b1a1-00163e03e854/volume-subpaths/oss1/nginx-flexvolume-oss: 23 | lstat /var/lib/kubelet/pods/aa401a77-3bc9-11e9-b1a1-00163e03e854/volume-subpaths/oss1/nginx-flexvolume-oss/0: transport endpoint is not connected" 24 | ``` 25 | 26 | ## Reason 27 | 28 | If remove the subpath when pod is running, the mountpoint is useless. I code get error when clean the mountpoint. 29 | 30 | ``` 31 | ## related code in pkg/util/mount/mount_linux.go 32 | subPaths, err := ioutil.ReadDir(fullContainerDirPath) 33 | if err != nil { 34 | return fmt.Errorf("error reading %s: %s", fullContainerDirPath, err) 35 | } 36 | ``` 37 | 38 | This issue is fixed in 1.11.7 and 1.12 version, but just for nas; 39 | 40 | PR Details: [https://github.com/kubernetes/kubernetes/pull/71804](https://github.com/kubernetes/kubernetes/pull/71804) 41 | 42 | ## How to Reproduce - Nas 43 | 44 | NFS example: 45 | 46 | ``` 47 | apiVersion: v1 48 | kind: PersistentVolume 49 | metadata: 50 | name: pv-nas 51 | labels: 52 | alicloud-pvname: pv-nas 53 | spec: 54 | capacity: 55 | storage: 5Gi 56 | accessModes: 57 | - ReadWriteMany 58 | flexVolume: 59 | driver: "alicloud/nas" 60 | options: 61 | server: "**-**.cn-shenzhen.nas.aliyuncs.com" 62 | path: "/" 63 | vers: "4.0" 64 | --- 65 | kind: PersistentVolumeClaim 66 | apiVersion: v1 67 | metadata: 68 | name: pvc-nas 69 | spec: 70 | accessModes: 71 | - ReadWriteMany 72 | resources: 73 | requests: 74 | storage: 5Gi 75 | selector: 76 | matchLabels: 77 | alicloud-pvname: pv-nas 78 | --- 79 | apiVersion: apps/v1 80 | kind: Deployment 81 | metadata: 82 | name: nas-static 83 | labels: 84 | app: nginx 85 | spec: 86 | replicas: 1 87 | selector: 88 | matchLabels: 89 | app: nginx 90 | template: 91 | metadata: 92 | labels: 93 | app: nginx 94 | spec: 95 | containers: 96 | - name: nginx 97 | image: nginx 98 | ports: 99 | - containerPort: 80 100 | volumeMounts: 101 | - name: pvc-nas 102 | mountPath: "/data" 103 | subPath: "hello" 104 | volumes: 105 | - name: pvc-nas 106 | persistentVolumeClaim: 107 | claimName: pvc-nas 108 | ``` 109 | 110 | ### 1. Create pv, pvc, pod 111 | 112 | # kubectl create -f nas.yaml 113 | 114 | # kubectl get pod | grep nas 115 | nas-static-fdc9c8d65-bn4z7 1/1 Running 0 24s 116 | 117 | # kubectl get pvc | grep pvc-nas 118 | pvc-nas Bound pv-nas 5Gi RWX 58s 119 | 120 | # kubectl get pvc | grep pv-nas 121 | pvc-nas Bound pv-nas 5Gi RWX 1m 122 | 123 | ### 2. Login the node which Pod locate 124 | 125 | # kubectl describe pod nas-static-fdc9c8d65-bn4z7 | grep Node 126 | Node: cn-shenzhen.i-wz99gunotzijxig3j052/192.168.0.1 127 | 128 | # ssh 192.168.0.1 129 | 130 | 131 | ### 3. Reproduce 132 | 133 | On the Pod located node: 134 | 135 | # mount | grep nfs | grep -v container 136 | **-**.cn-shenzhen.nas.aliyuncs.com:/ on /var/lib/kubelet/pods/009381d9-3504-11e9-b51c-00163e0cd246/volumes/alicloud~nas/pv-nas type nfs4 (rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,noresvport,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.0.1,local_lock=none,addr=192.168.0.1) 137 | **-**.cn-shenzhen.nas.aliyuncs.com:/hello on /var/lib/kubelet/pods/009381d9-3504-11e9-b51c-00163e0cd246/volume-subpaths/pv-nas/nginx/0 type nfs4 (rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,noresvport,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.0.1,local_lock=none,addr=192.168.0.1) 138 | 139 | ## remove the subpath when pod is running; 140 | # rm -rf /var/lib/kubelet/pods/009381d9-3504-11e9-b51c-00163e0cd246/volumes/alicloud~nas/pv-nas/hello 141 | 142 | ## Delete running pod, the pod is hang in deleting; 143 | # kubectl delete pod nas-static-fdc9c8d65-bn4z7 144 | pod "nas-static-fdc9c8d65-bn4z7" deleted 145 | 146 | ## check logs on pod locate node 147 | # tailf /var/log/messages | grep "stale NFS file handle" 148 | Feb 20 19:46:15 iZwz99gunotzijxig3j052Z kubelet: E0220 19:46:15.539730 4175 nestedpendingoperations.go:267] Operation for "\"flexvolume-alicloud/nas/pv-nas\" (\"009381d9-3504-11e9-b51c-00163e0cd246\")" failed. 149 | No retries permitted until 2019-02-20 19:46:23.53968005 +0800 CST m=+6075527.136591731 (durationBeforeRetry 8s). 150 | Error: "error cleaning subPath mounts for volume \"pvc-nas\" (UniqueName: \"flexvolume-alicloud/nas/pv-nas\") pod \"009381d9-3504-11e9-b51c-00163e0cd246\" (UID: \"009381d9-3504-11e9-b51c-00163e0cd246\") : 151 | error reading /var/lib/kubelet/pods/009381d9-3504-11e9-b51c-00163e0cd246/volume-subpaths/pv-nas/nginx: 152 | lstat /var/lib/kubelet/pods/009381d9-3504-11e9-b51c-00163e0cd246/volume-subpaths/pv-nas/nginx/0: stale NFS file handle" 153 | 154 | 155 | ## How to Reproduce - Oss 156 | 157 | NFS example: 158 | 159 | ``` 160 | apiVersion: extensions/v1beta1 161 | kind: Deployment 162 | metadata: 163 | name: nginx-oss-deploy 164 | spec: 165 | replicas: 1 166 | template: 167 | metadata: 168 | labels: 169 | app: nginx 170 | spec: 171 | containers: 172 | - name: nginx-flexvolume-oss 173 | image: nginx 174 | volumeMounts: 175 | - name: "oss1" 176 | mountPath: "/data" 177 | subPath: hello 178 | volumes: 179 | - name: "oss1" 180 | flexVolume: 181 | driver: "alicloud/oss" 182 | options: 183 | bucket: "aliyun-docker" 184 | url: "oss-cn-hangzhou.aliyuncs.com" 185 | otherOpts: "-o max_stat_cache_size=0 -o allow_other" 186 | akId: "**" 187 | akSecret: "**" 188 | ``` 189 | 190 | ### 1. Create pod 191 | 192 | # kubectl create -f osss.yaml 193 | 194 | # kubectl get pod 195 | NAME READY STATUS RESTARTS AGE 196 | nginx-oss-deploy-6bfd859cc4-7sb75 1/1 Running 0 19m 197 | 198 | ### 2. Login the node which Pod locate 199 | 200 | # kubectl describe pod nginx-oss-deploy-6bfd859cc4-7sb75 | grep Node 201 | Node: cn-beijing.i-2ze1fa4tkhgqperal406/172.16.1.1 202 | 203 | # ssh 172.16.1.1 204 | 205 | 206 | ### 3. Reproduce 207 | 208 | On the Pod located node: 209 | 210 | # mount | grep oss 211 | ossfs on /var/lib/kubelet/pods/44f0528b-3b06-11e9-b1a1-00163e03e854/volumes/alicloud~oss/oss1 type fuse.ossfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other) 212 | ossfs on /var/lib/kubelet/pods/44f0528b-3b06-11e9-b1a1-00163e03e854/volume-subpaths/oss1/nginx-flexvolume-oss/0 type fuse.ossfs (rw,relatime,user_id=0,group_id=0,allow_other) 213 | 214 | ## kill the ossfs when pod is running; 215 | # ps -ef | grep ossfs 216 | # kill ** 217 | 218 | ## Delete running pod, the pod is hang in deleting; 219 | # kubectl delete pod nginx-oss-deploy-6bfd859cc4-7sb75 220 | pod "nginx-oss-deploy-6bfd859cc4-7sb75" deleted 221 | 222 | ## check logs on pod locate node 223 | # tailf /var/log/messages | grep "transport endpoint is not connected" 224 | Mar 1 10:29:19 iZ2ze1fa4tkhgqper1l406Z kubelet: E0301 10:29:19.869173 8651 nestedpendingoperations.go:267] Operation for "\"flexvolume-alicloud/oss/oss1\" 225 | (\"aa401a77-3bc9-11e9-b1a1-00163e03e854\")" failed. No retries permitted until 2019-03-01 10:29:51.869139106 +0800 CST m=+222096.349178930 (durationBeforeRetry 32s). 226 | Error: "error cleaning subPath mounts for volume \"oss1\" (UniqueName: \"flexvolume-alicloud/oss/oss1\") pod \"aa401a77-3bc9-11e9-b1a1-00163e03e854\" 227 | (UID: \"aa401a77-3bc9-11e9-b1a1-00163e03e854\") : error reading /var/lib/kubelet/pods/aa401a77-3bc9-11e9-b1a1-00163e03e854/volume-subpaths/oss1/nginx-flexvolume-oss: 228 | lstat /var/lib/kubelet/pods/aa401a77-3bc9-11e9-b1a1-00163e03e854/volume-subpaths/oss1/nginx-flexvolume-oss/0: transport endpoint is not connected" 229 | 230 | 231 | ## How to Fix 232 | 233 | Run the script on error node: 234 | 235 | # sh kubelet.sh 236 | 237 | Deploy daemonset to running script and monitor the issue all the time 238 | 239 | # kubectl create -f deploy/deploy.yaml 240 | 241 | --------------------------------------------------------------------------------