├── LICENSE
├── README.md
├── eBook
    ├── 01.01.md
    ├── 01.02.md
    ├── 01.03.md
    ├── 02.01.md
    ├── 03.01.md
    ├── 03.02.md
    ├── 03.03.md
    ├── 03.04.md
    ├── 04.01.md
    ├── 04.02.01.md
    ├── 04.02.02.md
    ├── 04.02.03.md
    ├── 04.02.md
    ├── 04.03.md
    ├── 04.04.md
    ├── 04.05.md
    ├── 04.06.md
    ├── 04.07.md
    ├── 04.08.md
    ├── 04.09.md
    ├── 04.10.md
    ├── 05.01.md
    ├── 05.02.md
    ├── directory.md
    └── preface.md
└── images
    ├── K8S-overlay.png
    ├── K8s-service-process.png
    ├── Load-Balancing-Algorithms.gif
    ├── TCP-IP-topology.png
    ├── browser-cap-to-curl.png
    ├── docker-app-replication.png
    ├── docker-node-ctrs.png
    ├── docker-two-node-host-gw.png
    ├── docker-two-node.png
    ├── flannel-host-gw.png
    ├── flannel-vxlan-node.png
    ├── hairpin-icon.png
    ├── host-gw-fault.png
    ├── iptables.png
    ├── lvs-dnat-without-snat.png
    ├── meme-k8s-network.jpg
    ├── nginx-hostnetwork.png
    ├── nodeport-cluster.png
    ├── nodeport-local.png
    ├── pod-delete-process.png
    ├── pod-to-annother-node.gif
    ├── pod-to-service.gif
    ├── router-ap.png
    ├── router-lan-laptop.png
    └── vxlan-mtu.png


/LICENSE:
--------------------------------------------------------------------------------
  1 |                     GNU GENERAL PUBLIC LICENSE
  2 |                        Version 2, June 1991
  3 | 
  4 |  Copyright (C) 1989, 1991 Free Software Foundation, Inc.,
  5 |  51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
  6 |  Everyone is permitted to copy and distribute verbatim copies
  7 |  of this license document, but changing it is not allowed.
  8 | 
  9 |                             Preamble
 10 | 
 11 |   The licenses for most software are designed to take away your
 12 | freedom to share and change it.  By contrast, the GNU General Public
 13 | License is intended to guarantee your freedom to share and change free
 14 | software--to make sure the software is free for all its users.  This
 15 | General Public License applies to most of the Free Software
 16 | Foundation's software and to any other program whose authors commit to
 17 | using it.  (Some other Free Software Foundation software is covered by
 18 | the GNU Lesser General Public License instead.)  You can apply it to
 19 | your programs, too.
 20 | 
 21 |   When we speak of free software, we are referring to freedom, not
 22 | price.  Our General Public Licenses are designed to make sure that you
 23 | have the freedom to distribute copies of free software (and charge for
 24 | this service if you wish), that you receive source code or can get it
 25 | if you want it, that you can change the software or use pieces of it
 26 | in new free programs; and that you know you can do these things.
 27 | 
 28 |   To protect your rights, we need to make restrictions that forbid
 29 | anyone to deny you these rights or to ask you to surrender the rights.
 30 | These restrictions translate to certain responsibilities for you if you
 31 | distribute copies of the software, or if you modify it.
 32 | 
 33 |   For example, if you distribute copies of such a program, whether
 34 | gratis or for a fee, you must give the recipients all the rights that
 35 | you have.  You must make sure that they, too, receive or can get the
 36 | source code.  And you must show them these terms so they know their
 37 | rights.
 38 | 
 39 |   We protect your rights with two steps: (1) copyright the software, and
 40 | (2) offer you this license which gives you legal permission to copy,
 41 | distribute and/or modify the software.
 42 | 
 43 |   Also, for each author's protection and ours, we want to make certain
 44 | that everyone understands that there is no warranty for this free
 45 | software.  If the software is modified by someone else and passed on, we
 46 | want its recipients to know that what they have is not the original, so
 47 | that any problems introduced by others will not reflect on the original
 48 | authors' reputations.
 49 | 
 50 |   Finally, any free program is threatened constantly by software
 51 | patents.  We wish to avoid the danger that redistributors of a free
 52 | program will individually obtain patent licenses, in effect making the
 53 | program proprietary.  To prevent this, we have made it clear that any
 54 | patent must be licensed for everyone's free use or not licensed at all.
 55 | 
 56 |   The precise terms and conditions for copying, distribution and
 57 | modification follow.
 58 | 
 59 |                     GNU GENERAL PUBLIC LICENSE
 60 |    TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
 61 | 
 62 |   0. This License applies to any program or other work which contains
 63 | a notice placed by the copyright holder saying it may be distributed
 64 | under the terms of this General Public License.  The "Program", below,
 65 | refers to any such program or work, and a "work based on the Program"
 66 | means either the Program or any derivative work under copyright law:
 67 | that is to say, a work containing the Program or a portion of it,
 68 | either verbatim or with modifications and/or translated into another
 69 | language.  (Hereinafter, translation is included without limitation in
 70 | the term "modification".)  Each licensee is addressed as "you".
 71 | 
 72 | Activities other than copying, distribution and modification are not
 73 | covered by this License; they are outside its scope.  The act of
 74 | running the Program is not restricted, and the output from the Program
 75 | is covered only if its contents constitute a work based on the
 76 | Program (independent of having been made by running the Program).
 77 | Whether that is true depends on what the Program does.
 78 | 
 79 |   1. You may copy and distribute verbatim copies of the Program's
 80 | source code as you receive it, in any medium, provided that you
 81 | conspicuously and appropriately publish on each copy an appropriate
 82 | copyright notice and disclaimer of warranty; keep intact all the
 83 | notices that refer to this License and to the absence of any warranty;
 84 | and give any other recipients of the Program a copy of this License
 85 | along with the Program.
 86 | 
 87 | You may charge a fee for the physical act of transferring a copy, and
 88 | you may at your option offer warranty protection in exchange for a fee.
 89 | 
 90 |   2. You may modify your copy or copies of the Program or any portion
 91 | of it, thus forming a work based on the Program, and copy and
 92 | distribute such modifications or work under the terms of Section 1
 93 | above, provided that you also meet all of these conditions:
 94 | 
 95 |     a) You must cause the modified files to carry prominent notices
 96 |     stating that you changed the files and the date of any change.
 97 | 
 98 |     b) You must cause any work that you distribute or publish, that in
 99 |     whole or in part contains or is derived from the Program or any
100 |     part thereof, to be licensed as a whole at no charge to all third
101 |     parties under the terms of this License.
102 | 
103 |     c) If the modified program normally reads commands interactively
104 |     when run, you must cause it, when started running for such
105 |     interactive use in the most ordinary way, to print or display an
106 |     announcement including an appropriate copyright notice and a
107 |     notice that there is no warranty (or else, saying that you provide
108 |     a warranty) and that users may redistribute the program under
109 |     these conditions, and telling the user how to view a copy of this
110 |     License.  (Exception: if the Program itself is interactive but
111 |     does not normally print such an announcement, your work based on
112 |     the Program is not required to print an announcement.)
113 | 
114 | These requirements apply to the modified work as a whole.  If
115 | identifiable sections of that work are not derived from the Program,
116 | and can be reasonably considered independent and separate works in
117 | themselves, then this License, and its terms, do not apply to those
118 | sections when you distribute them as separate works.  But when you
119 | distribute the same sections as part of a whole which is a work based
120 | on the Program, the distribution of the whole must be on the terms of
121 | this License, whose permissions for other licensees extend to the
122 | entire whole, and thus to each and every part regardless of who wrote it.
123 | 
124 | Thus, it is not the intent of this section to claim rights or contest
125 | your rights to work written entirely by you; rather, the intent is to
126 | exercise the right to control the distribution of derivative or
127 | collective works based on the Program.
128 | 
129 | In addition, mere aggregation of another work not based on the Program
130 | with the Program (or with a work based on the Program) on a volume of
131 | a storage or distribution medium does not bring the other work under
132 | the scope of this License.
133 | 
134 |   3. You may copy and distribute the Program (or a work based on it,
135 | under Section 2) in object code or executable form under the terms of
136 | Sections 1 and 2 above provided that you also do one of the following:
137 | 
138 |     a) Accompany it with the complete corresponding machine-readable
139 |     source code, which must be distributed under the terms of Sections
140 |     1 and 2 above on a medium customarily used for software interchange; or,
141 | 
142 |     b) Accompany it with a written offer, valid for at least three
143 |     years, to give any third party, for a charge no more than your
144 |     cost of physically performing source distribution, a complete
145 |     machine-readable copy of the corresponding source code, to be
146 |     distributed under the terms of Sections 1 and 2 above on a medium
147 |     customarily used for software interchange; or,
148 | 
149 |     c) Accompany it with the information you received as to the offer
150 |     to distribute corresponding source code.  (This alternative is
151 |     allowed only for noncommercial distribution and only if you
152 |     received the program in object code or executable form with such
153 |     an offer, in accord with Subsection b above.)
154 | 
155 | The source code for a work means the preferred form of the work for
156 | making modifications to it.  For an executable work, complete source
157 | code means all the source code for all modules it contains, plus any
158 | associated interface definition files, plus the scripts used to
159 | control compilation and installation of the executable.  However, as a
160 | special exception, the source code distributed need not include
161 | anything that is normally distributed (in either source or binary
162 | form) with the major components (compiler, kernel, and so on) of the
163 | operating system on which the executable runs, unless that component
164 | itself accompanies the executable.
165 | 
166 | If distribution of executable or object code is made by offering
167 | access to copy from a designated place, then offering equivalent
168 | access to copy the source code from the same place counts as
169 | distribution of the source code, even though third parties are not
170 | compelled to copy the source along with the object code.
171 | 
172 |   4. You may not copy, modify, sublicense, or distribute the Program
173 | except as expressly provided under this License.  Any attempt
174 | otherwise to copy, modify, sublicense or distribute the Program is
175 | void, and will automatically terminate your rights under this License.
176 | However, parties who have received copies, or rights, from you under
177 | this License will not have their licenses terminated so long as such
178 | parties remain in full compliance.
179 | 
180 |   5. You are not required to accept this License, since you have not
181 | signed it.  However, nothing else grants you permission to modify or
182 | distribute the Program or its derivative works.  These actions are
183 | prohibited by law if you do not accept this License.  Therefore, by
184 | modifying or distributing the Program (or any work based on the
185 | Program), you indicate your acceptance of this License to do so, and
186 | all its terms and conditions for copying, distributing or modifying
187 | the Program or works based on it.
188 | 
189 |   6. Each time you redistribute the Program (or any work based on the
190 | Program), the recipient automatically receives a license from the
191 | original licensor to copy, distribute or modify the Program subject to
192 | these terms and conditions.  You may not impose any further
193 | restrictions on the recipients' exercise of the rights granted herein.
194 | You are not responsible for enforcing compliance by third parties to
195 | this License.
196 | 
197 |   7. If, as a consequence of a court judgment or allegation of patent
198 | infringement or for any other reason (not limited to patent issues),
199 | conditions are imposed on you (whether by court order, agreement or
200 | otherwise) that contradict the conditions of this License, they do not
201 | excuse you from the conditions of this License.  If you cannot
202 | distribute so as to satisfy simultaneously your obligations under this
203 | License and any other pertinent obligations, then as a consequence you
204 | may not distribute the Program at all.  For example, if a patent
205 | license would not permit royalty-free redistribution of the Program by
206 | all those who receive copies directly or indirectly through you, then
207 | the only way you could satisfy both it and this License would be to
208 | refrain entirely from distribution of the Program.
209 | 
210 | If any portion of this section is held invalid or unenforceable under
211 | any particular circumstance, the balance of the section is intended to
212 | apply and the section as a whole is intended to apply in other
213 | circumstances.
214 | 
215 | It is not the purpose of this section to induce you to infringe any
216 | patents or other property right claims or to contest validity of any
217 | such claims; this section has the sole purpose of protecting the
218 | integrity of the free software distribution system, which is
219 | implemented by public license practices.  Many people have made
220 | generous contributions to the wide range of software distributed
221 | through that system in reliance on consistent application of that
222 | system; it is up to the author/donor to decide if he or she is willing
223 | to distribute software through any other system and a licensee cannot
224 | impose that choice.
225 | 
226 | This section is intended to make thoroughly clear what is believed to
227 | be a consequence of the rest of this License.
228 | 
229 |   8. If the distribution and/or use of the Program is restricted in
230 | certain countries either by patents or by copyrighted interfaces, the
231 | original copyright holder who places the Program under this License
232 | may add an explicit geographical distribution limitation excluding
233 | those countries, so that distribution is permitted only in or among
234 | countries not thus excluded.  In such case, this License incorporates
235 | the limitation as if written in the body of this License.
236 | 
237 |   9. The Free Software Foundation may publish revised and/or new versions
238 | of the General Public License from time to time.  Such new versions will
239 | be similar in spirit to the present version, but may differ in detail to
240 | address new problems or concerns.
241 | 
242 | Each version is given a distinguishing version number.  If the Program
243 | specifies a version number of this License which applies to it and "any
244 | later version", you have the option of following the terms and conditions
245 | either of that version or of any later version published by the Free
246 | Software Foundation.  If the Program does not specify a version number of
247 | this License, you may choose any version ever published by the Free Software
248 | Foundation.
249 | 
250 |   10. If you wish to incorporate parts of the Program into other free
251 | programs whose distribution conditions are different, write to the author
252 | to ask for permission.  For software which is copyrighted by the Free
253 | Software Foundation, write to the Free Software Foundation; we sometimes
254 | make exceptions for this.  Our decision will be guided by the two goals
255 | of preserving the free status of all derivatives of our free software and
256 | of promoting the sharing and reuse of software generally.
257 | 
258 |                             NO WARRANTY
259 | 
260 |   11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
261 | FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW.  EXCEPT WHEN
262 | OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
263 | PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
264 | OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
265 | MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.  THE ENTIRE RISK AS
266 | TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU.  SHOULD THE
267 | PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
268 | REPAIR OR CORRECTION.
269 | 
270 |   12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
271 | WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
272 | REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
273 | INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
274 | OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
275 | TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
276 | YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
277 | PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
278 | POSSIBILITY OF SUCH DAMAGES.
279 | 
280 |                      END OF TERMS AND CONDITIONS
281 | 
282 |             How to Apply These Terms to Your New Programs
283 | 
284 |   If you develop a new program, and you want it to be of the greatest
285 | possible use to the public, the best way to achieve this is to make it
286 | free software which everyone can redistribute and change under these terms.
287 | 
288 |   To do so, attach the following notices to the program.  It is safest
289 | to attach them to the start of each source file to most effectively
290 | convey the exclusion of warranty; and each file should have at least
291 | the "copyright" line and a pointer to where the full notice is found.
292 | 
293 |     <one line to give the program's name and a brief idea of what it does.>
294 |     Copyright (C) <year>  <name of author>
295 | 
296 |     This program is free software; you can redistribute it and/or modify
297 |     it under the terms of the GNU General Public License as published by
298 |     the Free Software Foundation; either version 2 of the License, or
299 |     (at your option) any later version.
300 | 
301 |     This program is distributed in the hope that it will be useful,
302 |     but WITHOUT ANY WARRANTY; without even the implied warranty of
303 |     MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
304 |     GNU General Public License for more details.
305 | 
306 |     You should have received a copy of the GNU General Public License along
307 |     with this program; if not, write to the Free Software Foundation, Inc.,
308 |     51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
309 | 
310 | Also add information on how to contact you by electronic and paper mail.
311 | 
312 | If the program is interactive, make it output a short notice like this
313 | when it starts in an interactive mode:
314 | 
315 |     Gnomovision version 69, Copyright (C) year name of author
316 |     Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
317 |     This is free software, and you are welcome to redistribute it
318 |     under certain conditions; type `show c' for details.
319 | 
320 | The hypothetical commands `show w' and `show c' should show the appropriate
321 | parts of the General Public License.  Of course, the commands you use may
322 | be called something other than `show w' and `show c'; they could even be
323 | mouse-clicks or menu items--whatever suits your program.
324 | 
325 | You should also get your employer (if you work as a programmer) or your
326 | school, if any, to sign a "copyright disclaimer" for the program, if
327 | necessary.  Here is a sample; alter the names:
328 | 
329 |   Yoyodyne, Inc., hereby disclaims all copyright interest in the program
330 |   `Gnomovision' (which makes passes at compilers) written by James Hacker.
331 | 
332 |   <signature of Ty Coon>, 1 April 1989
333 |   Ty Coon, President of Vice
334 | 
335 | This General Public License does not permit incorporating your program into
336 | proprietary programs.  If your program is a subroutine library, you may
337 | consider it more useful to permit linking proprietary applications with the
338 | library.  If this is what you want to do, use the GNU Lesser General
339 | Public License instead of this License.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # 图文风格介绍容器网络
 2 | 
 3 | ## 介绍
 4 | 
 5 | 近年来，在 docker、K8S 流行后，很多运维、开发都是网上找教程跟风学 K8S，很多教程都是下面的大体步骤：
 6 | 
 7 | 1. 搭建 K8S
 8 | 2. 部署 deploy、一些简单应用
 9 | 3. gitlab
10 | 4. jenkins
11 | 5. jenkins CICD K8S
12 | 
13 | 而在处理一些容器网络相关问题和故障的时候，排查和定位手段有限，很难承担紧急问题救火的责任。
14 | 
15 | ![meme-k8s-network](./images/meme-k8s-network.jpg)
16 | 
17 | 如果你属于下面的特征：
18 | 
19 | - （很多运维开发）不懂 iptables，每次出问题就是盲人摸象
20 | - （很多运维开发）不懂 Docker 网络、K8S 的 CNI 网络
21 | - 每天处理重复事情，遇到线上问题没有耐心钻研下去，随便网上搜几篇文章试试，碰对了就完事，事后不梳理故障处理过程和查缺补漏自己的弱项，导致技术一直原地踏步
22 | - 自己技术薄弱，非上班时间从来不去补漏、折腾和学习，太复杂的教程又不想看
23 | 
24 | 市面上也有深入讲解的，但是文字和代码居多的教程又太深奥了，本书以我个人当初小白到现在的经历、不少处理的故障的积累和帮助一些网友解决的问题经验总结的，以图文风格来简单讲解下这块儿。
25 | 
26 | ## 感谢支持
27 | 
28 | [打赏](https://github.com/zhangguanzhang/simple-container-network-book/wiki/%E6%89%93%E8%B5%8F)
29 | 
30 | ## 开始阅读
31 | 
32 | 读者最好有 docker 和 K8S 的使用经历
33 | 
34 | - 您可以选择以下方式阅读本书：
35 |   - [GitHub在线](./eBook/preface.md)
36 | 
37 | 想读书的人，不会找不到 [目录](eBook/directory.md) :)
38 | 


--------------------------------------------------------------------------------
/eBook/01.01.md:
--------------------------------------------------------------------------------
  1 | # 1.1 OSI 和网络
  2 | 
  3 | ## OSI 网络模型
  4 | 
  5 | OSI 网络模型是国际标准化组织提出的一个网络互联抽象概念模型，并不实际存在，是为了在一个宏观而非细节的角度上，让人理解网络是如何运作的。
  6 | 
  7 | <table>
  8 |     <tr>
  9 |         <td>OSI 7层模型</td>
 10 |         <td>TCP/IP 五层模型</td>
 11 |         <td>作用</td>
 12 |         <td>示例协议</td>
 13 |     </tr>
 14 |     <tr>
 15 |         <td>应用层</td>
 16 |         <td rowspan="3">应用层</td>
 17 |         <td>与用户的交互而提供应用功能</td>
 18 |         <td rowspan="3">HTTP、FTP、SMTP、DNS、SSH、Telnet、MQTT...</td>
 19 |     </tr>
 20 |     <tr>
 21 |         <td>表示层</td>
 22 |         <td>数据的编码和解码</td>
 23 |     </tr>
 24 |     <tr>
 25 |         <td>会话层</td>
 26 |         <td>建立、维护、管理会话链接</td>
 27 |     </tr>
 28 |     <tr>
 29 |         <td>传输层</td>
 30 |         <td>传输层</td>
 31 |         <td>提供端到端的通信</td>
 32 |         <td>TCP、UDP、STCP</td>
 33 |     </tr>
 34 |     <tr>
 35 |         <td>网络层</td>
 36 |         <td>网络层</td>
 37 |         <td>路由和寻址</td>
 38 |         <td>IP、ICMP、OSPF、BGP</td>
 39 |     </tr>
 40 |     <tr>
 41 |         <td>数据链路层</td>
 42 |         <td>数据链路层</td>
 43 |         <td>传输数据帧</td>
 44 |         <td>FDDI、Ethernet、PPP</td>
 45 |     </tr>
 46 |     <tr>
 47 |         <td>物理层</td>
 48 |         <td>物理层</td>
 49 |         <td>将 bits 转换为电、无线电或光信号传输</td>
 50 |         <td>IEEE802、IEEE802.2</td>
 51 |     </tr>
 52 | </table>
 53 | 
 54 | ## TCP/IP 网络模型
 55 | 
 56 | TCP/IP 五层或者四层模型是对于开发人员来讲的一个学习分层模型，每层的封装增加的信息为以下：
 57 | 
 58 | ```
 59 | ╭──────────────┬──────────────────────────────────────────────────────╮
 60 | │   应用层     │ 应用数据                                              │
 61 | ├──────────────┼──────────────────────────────────────────────────────┤
 62 | │   传输层     │ TCP/UDP头部(其中有源目端口号) + 上一层的内容             │
 63 | ├──────────────┼──────────────────────────────────────────────────────┤
 64 | │   网络层     │ IP头部（其中有源和目的IP地址） + 上一层内容              │
 65 | ├──────────────┼──────────────────────────────────────────────────────┤
 66 | │   数据链路层  │ 以太网帧头（其中有源目MAC地址）+ 上一层内容 + 尾部的帧校验│
 67 | ├──────────────┼──────────────────────────────────────────────────────┤
 68 | │   物理层      │ 把上一层内容按照每个比特位转换电信号或光信号            │
 69 | ╰──────────────┴──────────────────────────────────────────────────────╯
 70 | 
 71 | ```
 72 | 
 73 | 假如本机 DNS 请求 `223.5.5.5` 解析 `www.baidu.com` 域名 IP 举例，大致封包过程如下：
 74 | 
 75 | ```
 76 | # 根据机器上路由表，会选择从 eth0 192.168.2.112 网卡发送出去，分配的随机 client 端口为 51234 
 77 | ╭──────────┬──────────────────────────────────────────────────────────────────────────────╮
 78 | │应用层     │                                        数据（www.baidu.com 的 IP 地址是多少） │
 79 | ├──────────┼──────────────────────────────────────────────────────────────────────────────┤
 80 | │传输层     │                                                   UDP头部(51234|53)|数据     │
 81 | ├──────────┼─────────────────────────────────────────────────────────────────────────────┤
 82 | │网络层     │                             IP头部(192.168.2.112|223.5.5.5)|UDP头部|数据     │
 83 | ├───────────┼────────────────────────────────────────────────────────────────────────────┤
 84 | │数据链路层 │             以太网帧头(目标MAC地址|源MAC地址)|IP头部|UDP头部|数据|帧尾(CRC)    │
 85 | ├───────────┼────────────────────────────────────────────────────────────────────────────┤
 86 | │物理层     │                             把上一层内容按照每个比特位转换电信号或光信号       │
 87 | ╰───────────┴────────────────────────────────────────────────────────────────────────────╯
 88 | # 网络层发现 223.5.5.5 不在自己同一个二层内网，匹配路由表发往网关
 89 | # 封装二层以太网结构，需要目标的 MAC 地址，会先发一次 arp 请求获取网关的 MAC
 90 | # 然后封装 数据链路层 里的目标 MAC 地址为网关地址
 91 | ```
 92 | 
 93 | 以下是结合网络设备的 TCP/IP 四层模型的流量图，现实中网络设备和多层的 NAT 暂不考虑，以下是一个简化的图：
 94 | 
 95 | ![TCP-IP-topology](../images/TCP-IP-topology.png)
 96 | 
 97 | 在现实内部环境开发或者一些网络排查中，浏览器访问 URL 不通（通常来讲报错能反映一些问题，但是很多人都不会去看和思考），拿内网环境举例，一般的常见情况如下：
 98 | - 机器的端口没起来，也就是进程没起来或者没监听这个端口，或者 bind 错误了 IP
 99 | - 机器上防火墙没放行端口
100 | - 网络报文没路由到该机器
101 | - IP 冲突，arp 错误
102 | - 机器没开机
103 | 
104 | 很多时候作为一个开发你可能都无法 ssh 登录该机器（或者说其实也不用登录机器），最简单就是二分排查，也就是网络层到底可达没（不关注上层的端口 TCP/UDP 传输层），我们可以利用网络层的 ICMP 协议。也就是 ping 该机器的 IP，Linux 防火墙一般都会放行 ICMP 协议的（以及 Linux 内核参数默认是允许回复 icmp 请求的），所以可以先 ping 下，通了和不通都能说明接下来的排查方向：
105 | 
106 | ```
107 | ╭──────────┬─────────────────────────────────────────────────────────────────╮
108 | │网络层     │                    IP头部(192.168.2.112|223.5.5.5)|ICMP数据     │
109 | ├───────────┼────────────────────────────────────────────────────────────────┤
110 | │数据链路层 │     以太网帧头(目标MAC地址|源MAC地址)|IP头部|ICMP数据|帧尾(CRC)    │
111 | ├───────────┼────────────────────────────────────────────────────────────────┤
112 | │物理层     │                   把上一层内容按照每个比特位转换电信号或光信号     │
113 | ╰───────────┴────────────────────────────────────────────────────────────────╯
114 | ```
115 | 
116 | 分层在排查网络和理解网络很有帮助，此外你需要知道 ARP、MAC、IP路由、子网、内外网、NAT 的大概作用，如果不知道会对后续的很多网络概念无法理解，不了解的请自行阅读下面推荐的教程。
117 | 而对于后面 K8S CNI（flannel，calico）的细致工作原理，你需要了解 VXLAN、IPIP、BGP 之类的隧道和路由协议。
118 | 
119 | ## 一些教程
120 | 
121 | ### 图文教程
122 | 
123 | - 大佬 `无聊的闪客` 公众号 [如果让你来设计网络](https://mp.weixin.qq.com/s/jiPMUk6zUdOY6eKxAjNDbQ) 文章，从两台机器互联到多台机器，衍生出二层三层协议的由来
124 | 
125 | ### 视频动画教程
126 | 
127 | - [计算机网络-如何寻找目标计算机？](https://www.bilibili.com/video/BV1ho4y1j7qd/) 基于前面闪客大佬制作的视频动画
128 | - [IP 地址不够用怎么办](https://www.bilibili.com/video/BV1Hu411W7T7) 讲解了为啥分 ABC 三类 IP 地址以及 NAT、NAPT 转换
129 | 
130 | 二层记住最常见的太网协议，它里面包含源目 MAC 地址就行了，就像上面说的 223.5.5.5 dns 请求封装后发出去：
131 | 1. 交换机因为在机器的连接中间而收到报文，它会解析报文里的源目 MAC 地址，发往对应的网口，所以说交换机只解析查看二层协议部分，交换机是二层设备。利用转发特性，可以做到很多一个网口的设备，通过交换机连接在一起。
132 | 2. 这点后面的 host-gw 和 calico 的 BGP 模式下，NODE 添加了路由表下，Pod 跨节点封装的报文内的源目 MAC 地址是源目 NODE 节点，网络层 IP 是 PodIP 就很好理解了。利用同一个二层转发过去。
133 | 
134 | ### 扩展教程
135 | 
136 | - [内网穿透和NAT打洞是什么？](https://www.bilibili.com/video/BV19W4y1X7mV)
137 | 
138 | ## 链接
139 | 
140 | - [前言](preface.md)
141 | - 下一部分: [应用层以下的层](01.02.md)
142 | 


--------------------------------------------------------------------------------
/eBook/01.02.md:
--------------------------------------------------------------------------------
  1 | # 应用层以下的层
  2 | 
  3 | 介绍一些网络相关排查命令，对于有些 Linux 命令 windows 上没有的可以考虑 windows 上安装个 git bash。
  4 | 
  5 | ## 路由器
  6 | 
  7 | 家庭路由器 =  LAN 口就是一个交换机 + AP（WIFI）+ 三层路由器 + 一些应用（DHCP、DNS、upnp ...）
  8 | 
  9 | 你可以：
 10 | - 电脑，服务器，电脑三者连 LAN 口（LAN 是交换机）使用，配置 IP/MASK 不配置网关，局域网使用。
 11 | ![router-lan-laptop](../images/router-lan-laptop.png)
 12 | - 换下来（不支持一键 AP 设置）的老路由器可以关闭 DHCP ，LAN 口管理 IP 设置个主路由的不冲突的 IP， LAN 口连家里主路由器的 LAN 当 AP 使用（WIFI桥接延迟大概率会很高）或者扩展主路由的 LAN 口数量（例如有些主路由只有一个 LAN 口）。
 13 | ![router-ap](../images/router-ap.png)
 14 | 
 15 | 而 AP 和 LAN 都是桥接在 br-lan 上的，作用相当于一根网线了，例如下面是从 Linux 角度来看：
 16 | 
 17 | ```
 18 | $ brctl show
 19 | bridge name	bridge id		STP enabled	interfaces
 20 | br-lan		7fff.dcd87c280xxa	no		eth0
 21 | 							phy1-ap0
 22 | 							eth1
 23 | 							eth2
 24 | 							eth3
 25 | 							phy0-ap0
 26 | ```
 27 | 
 28 | 地址信息：
 29 | 
 30 | ```bash
 31 | $ ip a s 
 32 | 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
 33 |     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
 34 |     inet 127.0.0.1/8 scope host lo
 35 |        valid_lft forever preferred_lft forever
 36 |     inet6 ::1/128 scope host proto kernel_lo 
 37 |        valid_lft forever preferred_lft forever
 38 | 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master br-lan state UP group default qlen 1000
 39 |     link/ether dc:d8:7c:28:05:8a brd ff:ff:ff:ff:ff:ff
 40 | 3: eth1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq master br-lan state DOWN group default qlen 1000
 41 |     link/ether dc:d8:7d:28:05:8a brd ff:ff:ff:ff:ff:ff
 42 | 4: eth2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq master br-lan state DOWN group default qlen 1000
 43 |     link/ether dc:d8:7e:28:05:8a brd ff:ff:ff:ff:ff:ff
 44 | 5: eth3: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq master br-lan state DOWN group default qlen 1000
 45 |     link/ether dc:d8:7e:28:05:8a brd ff:ff:ff:ff:ff:ff
 46 | 6: eth4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
 47 |     link/ether dc:d8:7c:28:05:8b brd ff:ff:ff:ff:ff:ff
 48 |     inet 192.168.1.2/24 brd 192.168.1.255 scope global eth4
 49 |        valid_lft forever preferred_lft forever
 50 | 7: br-lan: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
 51 |     link/ether dc:d8:7c:28:05:8a brd ff:ff:ff:ff:ff:ff
 52 |     inet 192.168.0.1/24 brd 192.168.0.255 scope global br-lan
 53 |        valid_lft forever preferred_lft forever
 54 | 8: phy1-ap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-lan state UP group default qlen 1000
 55 |     link/ether dc:d8:7c:28:05:8d brd ff:ff:ff:ff:ff:ff
 56 |     inet6 fe80::ded8:7cff:fe28:58d/64 scope link proto kernel_ll 
 57 |        valid_lft forever preferred_lft forever
 58 | 9: phy0-ap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-lan state UP group default qlen 1000
 59 |     link/ether dc:d8:7c:28:05:8c brd ff:ff:ff:ff:ff:ff
 60 |     inet6 fe80::ded8:7cff:fe28:58c/64 scope link proto kernel_ll 
 61 |        valid_lft forever preferred_lft forever
 62 | $ ip route get 1
 63 | 1.0.0.0 via 192.168.1.1 dev eth3 src 192.168.1.2 uid 0 
 64 |     cache
 65 | ```
 66 | 
 67 | 1.1 是光猫的 LAN 口 IP 地址，eth4 是 WAN 口，暂不讨论光猫改桥接。**如果在客户的网络中心内，不要乱接网线**。
 68 | 
 69 | ## 协议和排查命令
 70 | 
 71 | ### 二层 MAC
 72 | 
 73 | 特别局域网内出现 IP 冲突时候，故障的机器无法上网，其他机器可以，多半是 IP 冲突，可以网关或者内网其他机器上查看 arp 表（存储 IP-MAC-NIC 映射记录）：
 74 | 
 75 | ```bash
 76 | $ arp -n
 77 | IP address       HW type     Flags       HW address            Mask     Device
 78 | 192.168.0.6      0x1         0x2         70:9c:xx:6f:xx:xx     *        br-lan
 79 | 192.168.1.1      0x1         0x2         34:d8:xx:xx:06:xx     *        eth3
 80 | 192.168.0.172    0x1         0x2         34:ea:34:xx:50:xx     *        br-lan
 81 | 192.168.0.2      0x1         0x2         06:22:d2:xx:ca:xx     *        br-lan
 82 | 192.168.0.141    0x1         0x0         70:9c:d1:xx:df:xx     *        br-lan
 83 | # 也可以 ip 命令
 84 | $ ip neighbour show
 85 | ...
 86 | ```
 87 | 
 88 | 删掉一个指定条目：
 89 | 
 90 | ```bash
 91 | arp --delete <IP_ADDRESS>
 92 | ip neighbour delete <ip_address> dev <interface_name>
 93 | ```
 94 | 
 95 | ### 三层 IP 层
 96 | 
 97 | TCP/IP 四层模型里，很多时候网络故障，都是先看最下层是否发生故障，例如家里上不了网了，可以：
 98 | - ping 网关 IP，确认 arp 、没有出现环路广播风暴和高延迟之类的
 99 | - ping 公网 IP，例如 `223.5.5.5`，`114.114.114.114` 从 2023 年开始很多地方 ping 不通了
100 | - ping 公网域名检查 DNS
101 | - curl -v 公网一个 http(s) url
102 | 
103 | 当然，上层是下层承载着的，例如后续的 K8S 跨节点故障的时候，可以从应用层到下层逐步看：
104 | - curl 非本机的 coredns 的 metrics web 接口 `:9153/metrics`
105 | - ping 非本机的 `POD_IP`
106 | 
107 | 实际生产环境中，最常见的就是利用（三层协议 ICMP） ping 看下 IP 可达不（Linux 系统防火墙和内核参数通常是允许被 ping 的），另外要注意，一些公共场所的 WIFI 之类的开了 AP 隔离，局域网除了网关以外互相无法访问。
108 | 
109 | #### IP
110 | 
111 | 在一些例如 rescue mode Linux 里，要从外部拷贝文件，而不想去找怎么配置网卡，可以 ip 命令临时配置下：
112 | 
113 | ```bash
114 | # ip 命令是新趋势，功能十分强大，下面的都以 ip 命令举例
115 | $ ip addr add 192.168.2.113/24 dev eth0
116 | # ip addr del 192.168.2.113/24 dev eth0
117 | ```
118 | 
119 | 这样同一个二层下其他机器就能和它可达了，或者 scp 二层其他机器上的文件，addr 其他子命令：
120 | 
121 | ```bash
122 | # 清空指定网卡上所有 IP
123 | $ ip addr flush dev eth0
124 | ```
125 | 
126 | 查看网卡 IP：
127 | 
128 | ```bash
129 | # ip a s 
130 | $ ip addr show
131 | $ ip addr show dev eth0
132 | # 带颜色展示，选项也可以缩写
133 | $ ip -color a s 
134 | # json形式输出，-j 一些低版本 ip 命令上没有 json 选项
135 | $ ip -json a s 
136 | ```
137 | 
138 | ip 命令的最常见的一些选项：
139 | - -h：输出人类可读的统计信息和后缀，例如把单位转换成 M、G 之类的。
140 | - -s：输出更多信息。如果该选项出现两次或更多，则信息量会增加。通常，信息是统计信息或一些时间值
141 | - -d：输出详细信息
142 | - -4：使用网络层协议是 ipv4 过滤
143 | - -6：使用网络层协议是 ipv6 过滤
144 | 
145 | `ip subCmd1 subCmd2 help` 会打印帮助
146 | 
147 | 另外要注意的是，一个网卡并不是只能有一个 IP，有多个 IP 很正常，在没有容器的时代都有这种场景。甚至网卡上的 IP 没有强制要求是 ABC 三类内网 IP，有些 IDC 把公网 IP 配置在网卡上很正常。
148 | 
149 | #### 路由
150 | 
151 | 除此之外，还要注意机器上的路由表：
152 | 
153 | ```bash
154 | # ip 命令很多都可以缩写
155 | $ ip route show
156 | $ ip r s 
157 | ```
158 | 
159 | 还有确认某个 IP 从哪个网卡走，出去的源 IP 是多少，`ip route get` 会帮你从 IP/MASK 计算，例如下面看默认路由：
160 | 
161 | ```bash
162 | # 等同于 ip route get 1.0.0.0
163 | $ ip route get 1
164 | # 包含了下一跳的 IP 地址，从 eth0 发出去，来源 IP 是 2.112
165 | 1.0.0.0 via 192.168.2.1 dev eth0 src 192.168.2.112 uid 0 
166 |     cache 
167 | ```
168 | 
169 | 当没有路由到你指定的 IP 地址的时候，会报错：
170 | 
171 | ```bash
172 | $ ip r g 1
173 | RTNETLINK answers: Network unreachable
174 | ```
175 | 
176 | 一般都是没有配置好默认路由造成的，可以添加默认路由：
177 | 
178 | ```bash
179 | $ ip route add default via 192.168.2.1 dev eth0
180 | $ ip route del default via 192.168.2.1 dev eth0
181 | ```
182 | 
183 | 也需要注意的是，多网卡一般和多路由配合，例如服务器八个网卡，接了 6 根线做三个 bond，给：
184 | 
185 | - 存储网：业务会使用 ceph rbd，10G 光口，专门的网络和交换机
186 | - 业务网：业务流量，默认路由，10G 光口，专门的交换机
187 | - 管理网：ssh 和管理使用，1G 电口，专门的交换机
188 | 
189 | 这里暂不讨论非 main 路由表，具体见 `cat /etc/iproute2/rt_tables` 和 `ip rule`。
190 | 
191 | `ip route` 其他示例：
192 | 
193 | ```bash
194 | # 目标 IP 在 <net/mask> 的话发往 <ip> 去
195 | $ ip route add <net/mask> via <ip>
196 | ```
197 | 
198 | #### 网卡
199 | 
200 | addr 子命令是针对网卡地址，而 link 子命令是针对链路层和网卡属性：
201 | 
202 | ```bash
203 | $ ip link show
204 | ```
205 | 
206 | ##### 添加
207 | 
208 | 添加
209 | 
210 | ```bash
211 | # 添加一个名为 kube-svc0 的 dummy 接口
212 | $ ip link add kube-svc0 type dummy
213 | # 添加一个网桥
214 | $ ip link add xxx type bridge
215 | $ ip link add xxx type veth ...
216 | $ ip link add xxx type vxlan ...
217 | $ ip link add xxx type vlan ...
218 | $ ip link add xxx type gre ...
219 | ...
220 | ```
221 | 
222 | 属性查看，例如查看 flannel.1 ：
223 | 
224 | ```bash
225 | $ ip -d link show flannel.1
226 | 150: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default 
227 |     link/ether 3a:b4:bd:ad:8b:8e brd ff:ff:ff:ff:ff:ff promiscuity 0 
228 |     vxlan id 1 local 10.xxx.xx.xxx dev ens192 srcport 0 0 dstport 8472 nolearning ageing 300 noudpcsum noudp6zerocsumtx noudp6zerocsumrx addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
229 | # 可以看到 vxlan 属性网卡 id 为 1，从哪个网卡出去，vxlan vtep 端口 8472
230 | ```
231 | 
232 | 以及 up/down 网卡和设置属性：
233 | 
234 | ```bash
235 | # 由于 link 有 show 和 set ，所以不能缩写 s
236 | $ ip link set xxx down/up
237 | # 设置 eth0 混杂模式
238 | $ ip link set eth0 promisc on/off
239 | # 设置网卡队列长度
240 | $ ip link set eth0 txqueuelen 1200
241 | # 设置网卡最大传输单元
242 | $ ip link set eth0 mtu 1450
243 | # 修改网卡 MAC 地址
244 | $ ip link set eth1 address 00:0c:29:f3:33:77
245 | # 设置 eth0 桥接在 br0 下
246 | $ ip link set eth0 master br0
247 | ...
248 | ```
249 | 
250 | 查看网卡包的传输：
251 | 
252 | ```bash
253 | $ ip -h -s link show ens192
254 | 2: ens192: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
255 |     link/ether 00:50:56:99:xx:xx brd ff:ff:ff:ff:ff:ff
256 |     RX: bytes  packets  errors  dropped overrun mcast   
257 |     55.2G      108M     0       1.71M   0       2.37M   
258 |     TX: bytes  packets  errors  dropped carrier collsns 
259 |     3.63G      7.17M    0       0       0       0 
260 | 
261 | $ ip -s link show flannel.1
262 | 150: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default 
263 |     link/ether 3a:b4:bd:ad:8b:8e brd ff:ff:ff:ff:ff:ff
264 |     RX: bytes  packets  errors  dropped overrun mcast   
265 |     0          0        0       0       0       0       
266 |     TX: bytes  packets  errors  dropped carrier collsns 
267 |     xxx        xxxxx    0       8       0       0
268 | ```
269 | 
270 | 例如上面的接收 RX 一直为空，最后排查到客户的 VPC 网络没有放行 udp 8472 端口造成 K8S 跨节点通信问题。
271 | 
272 | 
273 | ### 端口
274 | 
275 | 一般没 nat 表的 NAT 规则，想查看监听端口，netstat 很多新发行版不自带了，使用更强大的 ss 命令：
276 | 
277 | ```bash
278 | # ss 很多选项和 netstat 一样
279 | # -n 不反向解析，以纯 IP 数字形式展示
280 | # -l, --listening 显示监听 IP 地址
281 | # -p 输出进程信息，pid 和进程名
282 | # -t 是 tcp -u 是 udp
283 | $ ss -nlpt
284 | 
285 | # ss 还支持过滤表达式
286 | $ ss -nlpt 'sport = 80'
287 | # 显示统计信息
288 | $ ss -s
289 | ```
290 | 
291 | 除了端口以外，还要注意监听的 IP 地址，常规监听为以下几种情况，当然程序也可以自定义混合使用：
292 | 
293 | - `127.0.0.1` 不允许外部访问进来，只在本机提供访问。
294 | - `单张网卡 IP`， 只允许这张网卡进来 
295 | - `0.0.0.0` 监听所有网卡 IP
296 | 
297 | 改内核参数绑定非本机 IP 暂不讨论。另外要注意的是 golang 一个在 netstat 下的特殊点：
298 | 
299 | ```golang
300 | package main
301 | 
302 | import (
303 | 	"net"
304 | )
305 | 
306 | func main() {
307 | 	l, err := net.Listen("tcp", ":2000")
308 | 	if err != nil {
309 | 		log.Fatal(err)
310 | 	}
311 | 	defer l.Close()
312 | 	for {
313 | 		// Wait for a connection.
314 | 		conn, err := l.Accept()
315 | 		if err != nil {
316 | 			log.Fatal(err)
317 | 		}
318 | 		// Handle the connection in a new goroutine.
319 | 		// The loop then returns to accepting, so that
320 | 		// multiple connections may be served concurrently.
321 | 		go func(c net.Conn) {
322 | 			// Echo all incoming data.
323 | 			io.Copy(c, c)
324 | 			// Shut down the connection.
325 | 			c.Close()
326 | 		}(conn)
327 | 	}
328 | }
329 | ```
330 | 
331 | 以上代码 run 起来后，netstat 看到的 bind 的是 ipv6 ，实际是 ipv4 ipv6 都监听的，这是 golang 的一个小的显示问题。
332 | 
333 | ```bash
334 | $ netstat -nlpt | grep :2000
335 | tcp6       0      0 :::2000                 :::*                    LISTEN      98822/main
336 | ```
337 | 
338 | #### tcp
339 | 
340 | 很多时候不一定需要看 ping，特别是禁 ping 的时候，我们要检查远端端口是否可达，最常见的就是可以使用 telnet 协议，它在建立 TCP 握手后会维持一个类似 ssh 的虚拟终端：
341 | 
342 | ```bash
343 | # 退出交互可以 `ctrl + ]`
344 | $ telnet 223.5.5.5 53
345 | Trying 223.5.5.5...
346 | Connected to 223.5.5.5.
347 | Escape character is '^]'. <--- 出现这个代表 TCP 三次握手成功
348 | ```
349 | 
350 | 但是某些系统上 telnet 并不会出现上面的 `]`，只要不是连接拒绝都会显示等待输入一样，可以尝试使用 [tcping](http://linuxco.de/tcping/tcping.html)，通过建立 TCP 链接，查看握手时间，代码层面做探测使用对应的 socket 库即可。
351 | 
352 | #### udp
353 | 
354 | 由于 udp 是无状态协议，测 udp 端口时候需要登录到两台机器上，一般是看中间有设备拦截否，可以用 nc：
355 | 
356 | ```bash
357 | # 机器 1 起 server
358 | $ nc -l -u 8472
359 | # 机器 2 作为 client 发送到 server 上
360 | $ nc -u <dest_public_ip> 8472
361 | ```
362 | 
363 | 上面机器 2 上的 nc 交互窗口输入字符串回车后，server 上的 nc 窗口就能收到了。怀疑 udp 丢包或者带宽低可以用 `iperf3` 两台机器压测下。
364 | 
365 | ## 链接
366 | 
367 | - [OSI 网络模型](01.01.md)
368 | - 下一部分: [应用层常见排查命令](01.03.md)
369 | 


--------------------------------------------------------------------------------
/eBook/01.03.md:
--------------------------------------------------------------------------------
  1 | # 应用层常见排查命令
  2 | 
  3 | 介绍一些应用层相关排查命令，对于有些 Linux 命令 windows 上没有的可以考虑 windows 上安装个 git bash。
  4 | 
  5 | ## 应用层举例
  6 | 
  7 | 再往上层来看，TCP/UDP 都是为了传输数据，也就是字节流，上层怎么解析这些流是应用层程序自己的行为（协议）。类比就像电话，只负责传递声音，而不在乎你是啥语言，TCP/UDP 职责把字节流传输过去。
  8 | 
  9 | ### dns
 10 | 
 11 | Linux 上 DNS 相关配置就是 `/etc/resolv.conf` 文件里的 `nameserver` ，而里面所有可配置项目见 [man 手册页](https://man7.org/linux/man-pages/man5/resolv.conf.5.html) ，常见的就是下面的选项：
 12 | 
 13 | - nameserver，不要配置超过 3 个，见文章 [CentOS中resolv.conf的配置实验](https://linuxgeeks.github.io/2016/04/11/110119-CentOS%E4%B8%ADresolv.conf%E7%9A%84%E9%85%8D%E7%BD%AE%E5%AE%9E%E9%AA%8C/)
 14 | - search
 15 | - nodot
 16 | - 和一些常见的 options
 17 | 
 18 | 此处讲解下 DNS 解析的一些常见命令说明：
 19 | 
 20 | #### nslookup
 21 | 
 22 | nslookup 分为交互和非交互模式，不带任何选项和参数执行就是交互模式，交互模式能力稍微强点：
 23 | 
 24 | ```
 25 | $ nslookup www.baidu.com
 26 | Server:		127.0.0.53
 27 | Address:	127.0.0.53#53
 28 | 
 29 | Non-authoritative answer:
 30 | www.baidu.com	canonical name = www.a.shifen.com.
 31 | Name:	www.a.shifen.com
 32 | Address: 183.2.172.42
 33 | Name:	www.a.shifen.com
 34 | Address: 183.2.172.185
 35 | ```
 36 | 
 37 | 此处 DNS server IP 是 `127.0.0.53` 是因为有些 Linux 发行版会使用 `systemd-resolved` 在本机起一个 DNS server，通过命令来配置上游 DNS ：
 38 | 
 39 | ```
 40 | $ cat /etc/resolv.conf
 41 | nameserver 127.0.0.53
 42 | options edns0 trust-ad
 43 | # 使用 resolvectl 命令查看
 44 | $ resolvectl -h
 45 | $ resolvectl  --no-pager status
 46 | $ resolvectl  --no-pager statistics
 47 | ```
 48 | 
 49 | 在一些 initContainer 里有些人喜欢做一些前置依赖检查 service 域名解析可用性，而早期 K8S 官方文档里推荐使用 `busybox:1.28` 里的 nslookup，后续有不少人发现升级 busybox 后无法解析了:
 50 | 
 51 | ```bash
 52 | # busybox:1.28
 53 | $ nslookup kubernetes.default
 54 | Server:    10.96.0.10
 55 | Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
 56 | 
 57 | Name:      kubernetes.default
 58 | Address 1: 10.96.0.1 kubernetes.default.svc.cluster.local
 59 | 
 60 | # busybox:1.31.1
 61 | $ nslookup kubernetes.default
 62 | Server:         10.96.0.10
 63 | Address:        10.96.0.10:53
 64 | 
 65 | ** server can't find kubernetes.default: NXDOMAIN
 66 | 
 67 | *** Can't find kubernetes.default: No answer
 68 | 
 69 | command terminated with exit code 1
 70 | ```
 71 | 
 72 | 这个问题是见 issue [docker-library/busybox#48](https://github.com/docker-library/busybox/issues/48)，所以一般 DNS 排查过程中不推荐使用 nslookup 命令，但是只是简单解析的话，可以找个 glibc 镜像，使用以下黑科技：
 73 | 
 74 | ```
 75 | # https://github.com/bitnami/containers/blob/main/bitnami/git/2/debian-12/prebuildfs/opt/bitnami/scripts/libnet.sh#L26
 76 | $ getent ahostsv4 kubernetes.default | awk '/STREAM/ {print $1;exit; }'
 77 | 10.96.0.1
 78 | $ getent ahostsv4 kube-dns.kube-system | awk '/STREAM/ {print $1;exit; }'
 79 | 10.96.0.10
 80 | ```
 81 | 
 82 | #### dig
 83 | 
 84 | 排查 DNS 问题建议使用更强大的 `dig @server name type` 命令，很多程序的 DNS 解析行为都是用的 glibc 提供的库，而避免因素干扰，dig 可以完全从命令行来指定 DNS 的解析逻辑：
 85 | 
 86 | ```
 87 | # 使用默认的 DNS 解析
 88 | $ dig www.baidu.com
 89 | 
 90 | ; <<>> DiG 9.16.48-Ubuntu <<>> www.baidu.com
 91 | ;; global options: +cmd
 92 | ;; Got answer:
 93 | ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 21504
 94 | ;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1
 95 | 
 96 | ;; OPT PSEUDOSECTION:
 97 | ; EDNS: version: 0, flags:; udp: 65494
 98 | ;; QUESTION SECTION:
 99 | ;www.baidu.com.			IN	A
100 | 
101 | ;; ANSWER SECTION:
102 | www.baidu.com.		60	IN	CNAME	www.a.shifen.com.
103 | www.a.shifen.com.	59	IN	A	183.2.172.42
104 | www.a.shifen.com.	59	IN	A	183.2.172.185
105 | 
106 | ;; Query time: 0 msec
107 | ;; SERVER: 127.0.0.53#53(127.0.0.53)
108 | ;; WHEN: Mon Jul 15 14:53:38 CST 2024
109 | ;; MSG SIZE  rcvd: 101
110 | 
111 | # 简洁输出，缺省是查询 A 记录
112 | $ dig +short baidu.com
113 | 39.156.66.10
114 | 110.242.68.66
115 | 
116 | # tcp dns 解析
117 | $ dig +short +tcp baidu.com
118 | # 使用指定 DNS 解析
119 | $ dig @223.5.5.5 +short baidu.com
120 | 
121 | # 使用指定端口解析
122 | $ dig @223.5.5.5 -p 53 baidu.com
123 | 
124 | # Pod 容器内使用 resolv.conf 里的 search 查找，因为默认行为是 +nosearch 的
125 | $ dig +short kubernetes.default +search
126 | ```
127 | 
128 | 例如宿主机上有 dig 命令，测试 coredns 能否解析：
129 | 
130 | ```bash
131 | $ dig @10.96.0.10 +short kubernetes.default.svc.cluster.local
132 | # 或者绕过 svc，使用指定 coredns POD_IP 还不通就是 CNI 或者跨节点的问题
133 | $ dig @<POD_IP> +short kubernetes.default.svc.cluster.local
134 | ```
135 | 
136 | dig 只介绍这些基础用法。另外不要在公有云 ECS 上自建 DNS Server，需要有相关备案资质才行，个人家庭内自建使用无所谓，但是很多人 ECS 上自建给自己家里和手机用是不符合相关规定的。
137 | 
138 | ### ssh
139 | 
140 | ssh 卡住如何排查：
141 | 
142 | ```bash
143 | # ssh 提供 -v 参数，打印详细的信息，越多 v 信息越详细
144 | $ ssh -vvvv xxx@xxxx
145 | ```
146 | 
147 | 还有除去你添加过 pam 模块以外，可能是 sshd 配置问题，可以：
148 | 
149 | ```bash
150 | -r: Print elapsed time between system call starts (in the 2nd column in the output below)
151 | -T: Print the elapsed times (durations) of syscalls (the rightmost fields like “<0.000364>”)
152 | -f: Follow and trace child processes too
153 | -p: Attach to PID 10635 and trace that
154 | $ strace -r -T -f -p [sshd进程号]
155 | ```
156 | 
157 | ### http
158 | 
159 | 前面说了，TCP 实际就是传输数据的字节流，假设我们自己设计一个应用协议：
160 | 
161 | - 第一个字节（8bit）定义后面字节流的实际解码类型，例如 0 代表 raw 流例如文件，1 代表文字，2 代表图片...
162 | 
163 | 然后开发一个聊天程序，底层就是 case 第一个字节，如果是 1，把后面的字节解码为字符到聊天对话里展示，2 就是图片。
164 | 
165 | ```
166 |  type := data[0]
167 |  rec_data = data[1:-1]
168 |  case uint8(type):
169 |   0:
170 |     file.WriteByte(rec_data)
171 |   1:
172 |     message.WriteConsole(rec_data)
173 |   2:
174 |     pic.Display(rec_data)
175 | ```
176 | 
177 | 就此，你定义了一个应用层协议，然后发现这些不方便扩容，其实 http 协议和这个大致差不多，只不过它设计得没有这么粗制滥造，它的协议格式为下面三部分：
178 | 
179 | ```bash
180 | POST /submit-form HTTP/1.1 # 请求行执行的操作，如 GET、POST、PUT、DELETE。 URI 路径，HTTP 版本
181 | Host: www.example.com # header，后端处理一般会忽略大小写
182 | User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64)
183 | Content-Type: application/x-www-form-urlencoded
184 | Content-Length: 27
185 | 
186 | username=user&password=pass # body，和 header 里的 Content-Type 对应例如：text/html，image/jpeg。。。。
187 | ```
188 | 
189 | telnet 就是建立 TCP 后发送字符串，所以也可以模拟简单的 GET 请求：
190 | 
191 | ```bash
192 | $ telnet www.baidu.com 80
193 | Trying 183.2.172.185...
194 | Connected to www.a.shifen.com.
195 | Escape character is '^]'.<---- 输入下面的后回车回车，因为 telnet \r\n 才是发送
196 | GET / HTTP/1.1
197 | 
198 | .... <-- 百度的 http 回复
199 | ```
200 | 
201 | #### curl
202 | 
203 | 当然，在面对更复杂的 http 接口排错时候，更多还是使用 curl 命令发送或者模拟 http 请求，curl 命令不单单是支持 http 协议，还支持其他协议，这里暂不讨论：
204 | 
205 | ```
206 | $ curl -V
207 | curl 7.68.0 (x86_64-pc-linux-gnu) libcurl/7.68.0 OpenSSL/1.1.1f zlib/1.2.11 brotli/1.0.7 libidn2/2.2.0 libpsl/0.21.0 (+libidn2/2.2.0) libssh/0.9.3/openssl/zlib nghttp2/1.40.0 librtmp/2.3
208 | Release-Date: 2020-01-08
209 | Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtmp rtsp scp sftp smb smbs smtp smtps telnet tftp
210 | # ↑ 这里是协议支持
211 | Features: AsynchDNS brotli GSS-API HTTP2 HTTPS-proxy IDN IPv6 Kerberos Largefile libz NTLM NTLM_WB PSL SPNEGO SSL TLS-SRP UnixSockets
212 | ```
213 | 
214 | 介绍下常见的 http 基础用法：
215 | 
216 | 只看返回的头部，也就是 HEAD 请求：
217 | 
218 | ```bash
219 | $ curl -I http://httpbin.org
220 | ```
221 | 
222 | -v 选项打印详细信息，包含域名解析到的 IP，https 证书，发起的头部信息，然后接口 `http://httpbin.org/get` 会返回客户端的 header 信息：
223 | 
224 | ```bash
225 | $ curl -v http://httpbin.org/get
226 | *   Trying 35.173.225.247:80...
227 | * TCP_NODELAY set
228 | * Connected to httpbin.org (35.173.225.247) port 80 (#0)
229 | > GET /get HTTP/1.1
230 | > Host: httpbin.org
231 | > User-Agent: curl/7.68.0
232 | > Accept: */*
233 | > 
234 | * Mark bundle as not supporting multiuse
235 | < HTTP/1.1 200 OK
236 | < Date: Mon, 15 Jul 2024 08:05:33 GMT
237 | < Content-Type: application/json
238 | < Content-Length: 254
239 | < Connection: keep-alive
240 | < Server: gunicorn/19.9.0
241 | < Access-Control-Allow-Origin: *
242 | < Access-Control-Allow-Credentials: true
243 | < 
244 | {
245 |   "args": {}, 
246 |   "headers": {
247 |     "Accept": "*/*", 
248 |     "Host": "httpbin.org", 
249 |     "User-Agent": "curl/7.68.0", 
250 |     "X-Amzn-Trace-Id": "Root=1-6694d84d-29cfb5b87bcaeb267f3b918f"
251 |   }, 
252 |   "origin": "178.xx.xx.xx", 
253 |   "url": "http://httpbin.org/get"
254 | }
255 | ```
256 | 
257 | `-H key: value` 设置 header ，例如：
258 | 
259 | ```
260 | $ curl  http://httpbin.org/get -H 'test: 11111111111111'
261 | {
262 |   "args": {}, 
263 |   "headers": {
264 |     "Accept": "*/*", 
265 |     "Host": "httpbin.org", 
266 |     "Test": "11111111111111", 
267 |     "User-Agent": "curl/7.68.0", 
268 |     "X-Amzn-Trace-Id": "Root=1-6694d921-0449f11d77d437e71b7113fc"
269 |   }, 
270 |   "origin": "178.236.xx.xx", 
271 |   "url": "http://httpbin.org/get"
272 | }
273 | ```
274 | 
275 | `-X method` ，例如发送 post 请求：
276 | 
277 | ```bash
278 | $ curl -XPOST  http://httpbin.org/post?a=123 --data-raw '{"key":"value"}'
279 | {
280 |   "args": {
281 |     "a": "123"
282 |   }, 
283 |   "data": "", 
284 |   "files": {}, 
285 |   "form": {
286 |     "{\"key\":\"value\"}": ""
287 |   }, 
288 |   "headers": {
289 |     "Accept": "*/*", 
290 |     "Content-Length": "15", 
291 |     "Content-Type": "application/x-www-form-urlencoded", 
292 |     "Host": "httpbin.org", 
293 |     "User-Agent": "curl/7.68.0", 
294 |     "X-Amzn-Trace-Id": "Root=1-6694d982-180db3df277a1ce879676d16"
295 |   }, 
296 |   "json": null, 
297 |   "origin": "178.236.xx.xx", 
298 |   "url": "http://httpbin.org/post?a=123"
299 | }
300 | ```
301 | 
302 | 一些其他常见选项：
303 | 
304 | - `-o file` 把 body 内容存到文件，下载文件时候使用
305 | - `-k, --insecure` 允许非安全连接，通常 https 由非权威 ca 签署证书，例如 kubernetes 的接口，`curl -k https://10.96.0.1`
306 | - `-L --location` 跟随重定向
307 | 
308 | 而对于浏览器触发的请求，我们也可以 F12 network 里拷贝成 curl 命令格式，如下图：
309 | 
310 | ![browser-cap-to-curl](../images/browser-cap-to-curl.png)
311 | 
312 | 然后更改一些参数 curl 发请求，调试后端逻辑之类的。需要注意的是，有些后端逻辑会有防重放防御，也就是一个 http 请求发送后，再发送就无效了。
313 | 
314 | `--resolve` 选项方便你来不修改 hosts 来指定域名的 IP ：
315 | 
316 | ```
317 | # curl --resolve <host:port:address> <URL>
318 | $ curl --resolve example.com:80:127.0.0.1 http://example.com
319 | ```
320 | 
321 | 一个要注意点的是，有时候后端逻辑会根据 `User-Agent` 做判断，浏览器看到的是 html 页面，curl 或者一些编程语言的 http 库请求后是 json 接口信息，当然不单单是 UA 字段。
322 | 
323 | 还有有些网站避免爬虫，会利用浏览器的一些功能做人机校验，curl 是做不到那些的。
324 | 
325 | #### 故障排查案例
326 | 
327 | 一个故障排查案例，群友在 K8S 里部署了 jenkins，设置的插件源为 https 清华源，安装插件报错：
328 | 
329 | ```
330 | unable to find valid certifacation path to reuqested target
331 | ```
332 | 
333 | 这种一般是证书相关，非权威机构签署的假证书，当时比较忙，没时间帮他看，聊天和一些后续对话总结出来他环境的一些奇怪信息：
334 | - 设置的源是清华的 https 源，不应该是非权威 https 证书
335 | - 机器上和网关以及宿主机都没代理和透明代理
336 | - jenkins 镜像也是使用的官方的
337 | - 后面还反馈说在 pod 日志里，发现其他域名也出现类似问题，例如 baidu.com
338 | - K8S 宿主机是虚拟机，是下载的 qcow2 文件导入创建的虚机，后面从 ISO 安装的系统上搭建 K8S 就好了
339 | 
340 | 发现这个问题很奇怪，就让他复现一套环境给我远程上去看了下，在 ingress nginx 的容器里：
341 | 
342 | ```bash
343 | $ curl -v https://www.baidu.com
344 | *   Trying 173.255.194.134:443...
345 | * Connected to www.baidu.com (173.255.194.134) port 443 (#0)
346 | * ALPN, offering h2,http/1.1
347 | * TLSv1.3 (OUT), TLS handshake, Client hello (1):
348 | *   CAfile: /etc/ssl/certs/ca-certificates.crt
349 | *   CApath: /etc/ssl/certs
350 | * TLSv1.3 (IN), TLS handshake, Server hello (2):
351 | * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
352 | * TLSv1.3 (IN), TLS handshake, Certificate (11):
353 | * TLSv1.2 (OUT), TlS alert, certificate expired (557)
354 | * SSL certificate problem: certificate has expired
355 | * Closing connection 0
356 | * curl: (60) SSL certificate problem: certificate has expired
357 | More details here: https://curl.se/docs/sslcerts.html
358 | 
359 | curl failed to verify the legitimacy of the server and therefore could notestablish a secure connection to it, To learn more about this situation andhow to fix it, please visit the web page mentioned above.
360 | ```
361 | 
362 | 根据 curl 信息可以看到报错 https 证书过期，这个容器里的 curl -v 不够详细，于是 K8S 节点上 curl -v 发现又是正常的：
363 | 
364 | ```bash
365 | $ curl -v https://www.baidu.com
366 | * About to connect() to www.baidu.com port 443 (#0)
367 | *  Trying 180.101.50.242...
368 | * Connected to www.baidu.com (180.101.50.242) port 443 (#0)
369 | ...
370 | ```
371 | 
372 | 但是我细心的发现了这个解析 IP 和容器内不一样，不对劲，然后 K8S 节点上尝试：
373 | 
374 | ```bash
375 | $ curl -v --resolve www.baidu.com:443:173.255.194.134 https://www.baidu.com
376 | * Added www.baidu.com:443:173.255.194.134 to DNS cache
377 | * About to conenect() to www.baidu.com port 443 (#0)
378 | *  Trying 173.255.194.134...
379 | * Connected to www.baidu.com (173.255.194.134) port 443 (#0)
380 | ....
381 | * Server certificate:
382 | *       subject: CN=*.mytrafficmanagement.com
383 | *       start date: Aug 10 23:39:33 2023 GMT
384 | *       expire date: Nov 08 23:39:32 2023 GMT
385 | ...
386 | ```
387 | 
388 | 也就是 DNS 解析问题，然后发现是 search 导致的：
389 | 
390 | ```bash
391 | $ cat /etc/resolv.conf
392 | nameserver 192.168.122.1
393 | nameserver 119.29.29.29
394 | search com <-------- 这里
395 | $ dig @192.168.122.1 +short www.baidu.com
396 | www.a.shifen.com.
397 | 180.101.50.242
398 | 180.101.50.188
399 | $ dig @119.29.29.29 +short www.baidu.com.com <----加上 search 域名
400 | 45.79.19.196
401 | 173.255.194.134 <---- 找到解析 IP 了
402 | ....
403 | ```
404 | 
405 | 然后让去掉 K8S 节点上的 seach 后，重建下容器就好了，pod 拉起来的时候，如果是默认的 DNSPolicy 策略，会把宿主机的 search 和 options 追加到 POD 的容器内。
406 | 
407 | #### 负载均衡
408 | 
409 | 负载均衡是一个统称，各个方面都可以利用负载均衡
410 | 
411 | ##### 反向代理
412 | 
413 | 实际的 http server 不是单点的，往往是好几台，前面有高性能反向代理服务器:
414 | 
415 | ```
416 |                                             +-------+
417 |                                             |       |
418 |                                -+---------->+  rs   |
419 |                               /             +-------+
420 |                              /    
421 |                             /     
422 |                            /                +-------+
423 | +--------+                /                 |       |
424 | | client +----------> proxy/VIP/LB -------->+  rs   |
425 | +--------+                \                 +-------+
426 |                            \    
427 |                             \     
428 |                              \              +-------+
429 |                               ------------->+       |
430 |                                             |  rs   |
431 |                                             +-------+
432 | 
433 | ```
434 | 
435 | 例如常见使用 nginx 而是因为 nginx 是一款开源高性能的 7 层 HTTP 反向代理软件，Nginx 的优势之一是其采用 C 语言编写，在开源社区中得到了许多顶尖专家的优化，使得在处理反向代理请求时具有卓越的性能表现。
436 | 
437 | 与直接使用某些 web 框架来处理请求相比，Nginx 通常能够实现更好的性能，尤其在对于高并发、请求处理和负载均衡的情况下表现出色。因此，许多架构设计会选择将 Nginx 用作反向代理服务器，以提高系统整体的性能和稳定性。
438 | 
439 | 当然类比下，裁剪 Linux 内核和定制硬件，制作一款专门的硬件负载均衡拿去卖，例如 F5，而这样的负载均衡肯定有人性化网页配置，也就是管理功能，而管理网和业务流量肯定是分开的（例如两个网口或 bond），多网卡和多路由其实很常见。
440 | 
441 | 而负载均衡相关算法借鉴下别人的图：
442 | 
443 | ![LoadbalancingAlgorithms](../images/Load-Balancing-Algorithms.gif)
444 | 
445 | 软硬件实现手段为：
446 | 
447 | - 云上 SLB 之类的
448 | - 硬件 LB
449 | - 软件负载均衡，nginx、haproxy、keepalived + haproxy、Envoy....
450 | - 内核能力 lvs 之类的
451 | - 以及网络设备配置 ospf 和服务器 lvs 结合
452 | ...
453 | 
454 | 例如混合四层+应用层使用：
455 | 
456 | ```
457 | L1/L2 是 nginx、Envoy 之类 7 层代理
458 |                            http proxy            +-------+
459 |                                                  |       |
460 |                            +---------+---------->+  rs   |
461 |                            | +-----+ |           +-------+
462 |                            | |     | |
463 |                            | |  L1 | |
464 |                            | +-----+ |           +-------+
465 | +--------+   VIP or SLB    |         |           |       |
466 | | client +---------------->+         +---------->+  rs   |
467 | +--------+                 |         |           +-------+
468 |                            | +-----+ |
469 |                            | |     | |
470 |                            | | L2  | |           +-------+
471 |                            | +-----+ +---------->+       |
472 |                            +---------+           |  rs   |
473 |                                                  +-------+
474 | ```
475 | 
476 | ##### 一些其他举例
477 | 
478 | 除了应用多台机器上部署，负载均衡和类似的思想，在很多场景都可以用到：
479 | - 机器的网口，例如两根网线做 bond，分别到两台交换机（交换机做堆叠）
480 | - 避免业务流量影响，可能单独的管理网、存储网，也就是多网卡多路由很正常
481 | - 服务器双电源
482 | - DNS 负载均衡，例如谷歌的 8.8.8.8 使用了 anycast
483 | - 数据库负载均衡：主从、分库
484 | ...
485 | 
486 | ### 抓包
487 | 
488 | 抓包工具提供了一种详细深入的网络通信分析方式，能够帮助识别和解决各种网络和通信问题，同时也有助于理解和优化网络性能和安全性。
489 | 
490 | #### tcpdump
491 | 
492 | 命令格式 `tcpdump -nn -i <any|网卡名> [-w file.pcap] filter ` ：
493 | - `-n` 关闭反向 DNS 解析，`-nn` 顺带把端口按照数字显示
494 | - `-i` 指定网卡，any 表示所有网卡
495 | - `-w` 写入文件，可以导入 wireshark 查看 `.pcap` 格式文件
496 | 
497 | 输出常见选项：
498 | - `-e` 输出数据链路层的头部信息，例如源目 MAC 地址
499 | - `-v`、`-vv`、 `-vvv` 详细输出
500 | - `tcpdump -D` 列出可用于抓包的接口。将会列出接口的数值编号和接口名，它们都可以用于 `-i` 后。
501 | - `tcpdump -r xx.pcap` 读取抓包的文件。
502 | 
503 | 表达式单元的逻辑符号 `and、&&、or、 || 、not、!`，使用括号 `()` 可以改变表达式的优先级，但需要注意的是括号会被 shell 解释，所以应该使用反斜线转义为 `\(\)`，在需要的时候，还需要包围在引号中。
504 | 
505 | 一些示例：
506 | 
507 | ```
508 | # 抓 icmp 的 ping 包
509 | $ tcpdump -nn -i any icmp
510 | 
511 | # 限定 host
512 | $ tcpdump -nn -i any icmp and host 223.5.5.5
513 | 
514 | # -e、-i any 同时使用的时候，不会同时显示目标 MAC 地址
515 | $ tcpdump -nn -i ens160 icmp and host 223.5.5.5
516 | 
517 | # 抓取 flannel.1 的 vxlan 网络内的 172.27.1.4:9153 的包
518 | $ tcpdump -nn -i flannel.1 host 172.27.1.4 and port 9153 -vv
519 | 
520 | # 抓包并保存
521 | $ tcpdump -nn -i any icmp and host 223.5.5.5 -w icmp.pcap
522 | # 查看抓包文件
523 | $ tcpdump -nn -r icmp.pcap
524 | ```
525 | 
526 | 注意 tcpdump 4.99 开始默认抓包过程中就可以看到网卡名字了，但是包管理的版本都比较低，可以 [源码编译安装](https://www.tcpdump.org/#latest-releases) 或者使用一些容器里的高版本。
527 | 
528 | #### ptcpdump
529 | 
530 | [mozillazg/ptcpdump](https://github.com/mozillazg/ptcpdump) 是一个使用 eBPF 技术开发的、类 tcpdump 的网络抓包工具。它除了兼容 tcpdump 的常用命令行参数以及包过滤语法外， 还额外提供了以下特性：
531 | - 在输出中记录和显示发送网络流量的进程、容器、Pod 信息。
532 | - 支持对指定进程、容器以及 Pod 进行抓包
533 | - 当在 Wireshark 中打开保存的 pcapng 文件时，将能够看到每个数据包对应的进程、容器、Pod 信息。
534 | 
535 | 缺点是要求 `内核 >= 5.2`。
536 | 
537 | ## 链接
538 | 
539 | - [应用层以下的层](01.02.md)
540 | - 下一部分: [iptables基础](02.01.md)
541 | 


--------------------------------------------------------------------------------
/eBook/02.01.md:
--------------------------------------------------------------------------------
  1 | # 2.1 iptables
  2 | 
  3 | ## iptables 和容器相关
  4 | 
  5 | Docker、K8S 和 CNI（flannel、calico 等等） 都使用了 iptables：
  6 | - docker 的 -p
  7 | - docker 桥接网络下容器访问外部网络
  8 | - K8S 的 Service
  9 | - pod 访问其他的 node IP 或者外部网络
 10 | - pod 的 hostPort
 11 | - NetworkPolicy
 12 | 
 13 | iptables 知识点非常广，全部讲解到的话是可以单独出一本书的，所以只介绍基础知识。
 14 | 
 15 | 数据中心中有硬件防火墙，Linux 也有自己的防火墙，它就是 netfilter 安全框架，字面意思 “网络过滤” ，netfilter 处于内核空间，iptables 是命令行工具，我们用这个工具来增删改查 netfilter 提供的 hook 点上的规则（rule）。
 16 | 
 17 | ## 如何学习 iptables
 18 | 
 19 | 对于 iptables 基础教程，推荐先去看完 [朱双印 iptables 详解系列](https://www.zsythink.net/archives/1199)，看完后接着看下面的。
 20 | 
 21 | ![iptables](../images/iptables.png)
 22 | 
 23 | 很多介绍 iptables 的文章都画了类似的图，虽然风格和位置不一样，但是核心都是是五链（netfilter 提供的 hook 点）五表（存储链的规则，对于匹配的包执行什么行为）：
 24 | 
 25 | - raw、mangle、nat、filter、security
 26 | - PREROUTING、FORWARD、INPUT、OUTPUT、POSTROUTING
 27 | 
 28 | 五链总是有人记不住，这东西不能死记硬背，最简单的一个记忆方法根据网络情况记忆：
 29 | - 数据包进协议栈，和协议栈发包出去，就是 INPUT、OUTPUT
 30 | - Linux 有路由表，想到路由表，你就能想到路由前和路由后，也就是 PRE_ROUTING 、POST_ROUTING
 31 | - Linux 内核参数可以开启转发，开启后对于目标 IP 不是自己的可以转发出去。也就是 FORWARD
 32 | 
 33 | 五表：
 34 | - filter： 过滤报文，例如 DROP、REJECT、ACCEPT，iptables 默认缺省的表，例如 iptables 操作规则时候默认就是 `iptables -t filter`
 35 | - nat: 地址转换， DNAT（POSTROUTING 做目标地址转换）和 SNAT （在 PREROUTING 做源地址转换）以及动态获取网卡 IP 做 SNAT 的 `MASQUERADE`
 36 | - raw：优先级最高的表，iptables 是有状态的，一般用 `-j NOTRACK` 关闭匹配包的链路追踪（connection tracking）提高性能和 CT （conntrack table）行为（-j CT --helper ftp）
 37 | - mangle：拆解报文，做出修改，并重新封装，例如 MTU、TTL、TOS，还有修改 MARK 来配合做策略路由，以及 tproxy 透明代理。
 38 | - security：后面新加的表，SELinux 相关安全使用
 39 | 
 40 | ## iptables 常见概念和用法
 41 | 
 42 | 写常见概念和用法是因为我的不少同事看完后没有去跟着实操验证，左耳进右耳出一样，在对于处理相关故障方面还是无从下手。
 43 | 
 44 | ### iptables 常见知识点
 45 | 
 46 | #### 一些基础知识点
 47 | 
 48 | - Linux 的 firewalld 和 ufw 之类的关闭了，并不是代表 iptables 规则就不生效了，firewalld 和 ufw 只是更上层抽象后的易用和更丰富的一个管理工具，旨在减少对于底层的了解。这类防火墙默认策略都是拒绝的，所以很多人喜欢到手就关闭它们，有个要注意的点 `systemctl stop firewalld` 关闭 firewalld 会清空 iptables 规则，在运行 Docker 过程中停止 firewallld 会造成后续 `docker run -p` 报错 NO Chain 之类的。
 49 | - iptables 对规则增删改查会使用文件锁 `/run/xtables.lock`，可以使用 `-w/--wait` 选项避免竞争
 50 | 
 51 | #### iptables 常见命令
 52 | 
 53 | 身边的同事和一些微信好友，哪怕看过 iptables，最后遇到 iptables 规则导致的问题的时候也是无从下手，本小节介绍一些 iptables 常见命令帮助你定位。
 54 | 
 55 | 很多教程教查看 iptables 规则是：
 56 | 
 57 | ```
 58 | iptables -nvL
 59 | iptables -nvL INPUT
 60 | ```
 61 | 
 62 | 这个非常不适合 iptables 新手阅读使用，而是应该 `iptables -S`：
 63 | 
 64 | ```
 65 | $ iptables -S
 66 | -P INPUT ACCEPT
 67 | -P FORWARD DROP #<- 有些应用启动后，会把默认策略设置为 DROP，导致手动做转发的时候匹配到最后被DROP掉
 68 | -P OUTPUT ACCEPT
 69 | -N BASE-RULE
 70 | -A INPUT -p icmp -j ACCEPT
 71 | -A INPUT -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
 72 | -A INPUT -j BASE-RULE
 73 | -A FORWARD -j ACCEPT
 74 | -A BASE-RULE -i eth0 -m set ! --match-set whiteiplist src -m set --match-set whiteportlist dst -j DROP
 75 | -A BASE-RULE -j RETURN
 76 | ...
 77 | ```
 78 | 
 79 | 该命令会输出 iptables 命令行风格的规则。假如外部无法访问上面机器的 80 端口，只能 ping 通，你会隐约感觉是 match-set 这条规则导致的，而假设该规则你又看不懂具体是干啥的，可以两个思路：
 80 | 
 81 | 外面一台机器 `192.168.1.2` 频繁触发访问 80 端口：
 82 | 
 83 | 1. 计数器
 84 | ```
 85 | # 清空计数器
 86 | iptables --wait --zero INPUT
 87 | iptables --wait --zero BASE-RULE
 88 | # 观察计数器增加，看匹配到哪条链下的哪个规则
 89 | iptables --wait -nvL INPUT
 90 | iptables --wait -nvL BASE-RULE
 91 | ```
 92 | 
 93 | 2. 二分法，增加放行，例如
 94 | 
 95 | ```
 96 | # 明细的来源 IP 和目标端口放行，特别是线上的时候
 97 | # insert 是最前面插入
 98 | iptables --wait --insert INPUT --source 192.168.1.2/32 -p tcp --dport 80 -j ACCEPT
 99 | # 外面能通了后删除该规则，也能 iptables --wait --delete INPUT 1 删除该规则
100 | iptables --wait --delete INPUT --source 192.168.1.2/32 -p tcp --dport 80 -j ACCEPT
101 | 
102 | iptables --wait --insert INPUT 2 --source 192.168.1.2/32 -p tcp --dport 80 -j ACCEPT
103 | iptables --wait --delete INPUT 2
104 | 
105 | iptables --wait --insert BASE-RULE --source 192.168.1.2/32 -p tcp --dport 80 -j ACCEPT
106 | iptables --wait --delete BASE-RULE 1
107 | ...
108 | ```
109 | 
110 | 最后以此内推，定位到是 `BASE-RULE` 链里的那个看不懂的规则，生产环境特别是部署了 K8S 后，`iptables -nvL` 的计数器会因为流量大而不容易观察，更多是后面说的二分法定位大概规则范围。
111 | 
112 | iptables 的规则就是 `匹配条件 行为`，下面列举一些规则：
113 | 
114 | ```
115 | # 使用 --insert 短选项 -I 和省略 --wait 举例了，实践的话最好找个能 tty 登录的机器以免失联，每个规则记得自行删除
116 | 
117 | # 放行 icmp 协议，让外面能ping 通本机
118 | iptables -I INPUT --protocol icmp -j ACCEPT
119 | 
120 | # 从网卡 eth0 进来，来源 IP 是 192.168.1.0/24 的直接放行
121 | iptables -I INPUT --in-interface eth0 --source 192.168.1.0/24 -j ACCEPT
122 | 
123 | # DNS 劫持，常见用于路由器上
124 | # 例如路由器上运行了 5300 DNS server，这个 dns server 上游用 dns over http
125 | iptables -t nat -I PREROUTING -p udp -m udp --dport 53 -j REDIRECT --to-ports 5300
126 | iptables -t nat -I PREROUTING -p tcp -m tcp --dport 53 -j REDIRECT --to-ports 5300
127 | 
128 | # 劫持所有 1.1.1.1:53 的 DNS 请求并重定向到 223.5.5.5:53
129 | iptables -t nat -A PREROUTING -d 1.1.1.1 -p udp --dport 53 -j DNAT --to-destination 223.5.5.5:53
130 | iptables -t nat -A PREROUTING -d 1.1.1.1 -p tcp --dport 53 -j DNAT --to-destination 223.5.5.5:53
131 | 
132 | # 扔掉来源端口是 8082 的出去的 tcp 包
133 | iptables -I OUTPUT -p tcp --sport 8082 -j DROP
134 | 
135 | # 下列是组合使用的
136 | # 已经建立链接的直接放行
137 | -A INPUT -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
138 | # INPUT 链加了个BASE-RULE 链
139 | -A INPUT -j BASE-RULE
140 | # 表示从网卡 eth0 进来，来源 IP 不在 whiteiplist 的 ipset 里就 DROP
141 | -A BASE-RULE -i eth0 -m set ! --match-set whiteiplist src -j DROP
142 | # 符合了就继续往下一跳链走，会走到 RETURN，回到 INPUT 链
143 | -A BASE-RULE -j RETURN
144 | # 然后最后 INPUT 链默认策略是 ACCEPT，也就是 -P INPUT ACCEPT 上
145 | 
146 | ```
147 | 
148 | 最起码要先要学会看得懂 iptables 规则，熟悉了后就可以自己写规则了。
149 | 
150 | #### iptables 规则持久化
151 | 
152 | iptables 重启规则就会清空，所以如果不是程序添加的，人为手动添加的需要使用 `iptables.service` 之类的服务，例如下面的
153 | 
154 | ```
155 | yum install ipset ipset-service -y
156 | sed -ri '/^IPSET_SAVE_ON_STOP=/s#no#yes#' /etc/sysconfig/ipset-config
157 | systemctl enable --now ipset
158 | vi /etc/sysconfig/iptables
159 | systemctl enable iptables
160 | 
161 | # apt 系列 os
162 | apt update && apt install -y ipset-persistent iptables-persistent
163 | vi /etc/iptables/rules.v4
164 | systemctl enable netfilter-persistent.service
165 | ```
166 | 
167 | ## iptables 的两种模式
168 | 
169 | iptables 有两种模式：
170 | 
171 | 1. `iptables-legacy`:
172 |     - 这是传统的 iptables 实现，使用 xtables 进行扩展。
173 |     - 通常用于早期的 Linux 内核版本。
174 | 2. `iptables-nft`:
175 |     - 基于 nftables 实现的 iptables 命令。
176 |     - nftables 是 Netfilter 项目的一部分，旨在简化和统一 Linux 防火墙和流量控制的配置。
177 |     - 提供了与传统 iptables 类似的命令行接口，但其底层实现使用了 nftables，带来了更好的性能和灵活性。
178 | 
179 | 模式可以通过 iptables -V 查看：
180 | 
181 | ```
182 | iptables v1.8.9 (nf_tables)
183 | iptables v1.8.9 (legacy)
184 | ```
185 | 
186 | 如果老版本的 Linux 发行版上面命令输出没有结尾的模式，都是 legacy 模式。 如果你不知道机器上两种模式规则都存在，在添加规则后会出现非预期的结果，是因为 `nf_tables` 优先级比 `legacy` 高，你如果执行 `iptables -S` 出现下面的：
187 | 
188 | ```bash
189 | $ iptables -S
190 | # Warning: iptables-legacy tables presetn. use iptables-legacy to see them
191 | 
192 | # 可以单独查看模式的规则
193 | iptables-legacy -S
194 | iptables-nft -S
195 | ```
196 | 
197 | ## 容器里使用 iptables
198 | 
199 | 特别是容器里使用 iptables 避免和宿主机 iptables 模式不一致，可以像 [flannel](https://github.com/flannel-io/flannel/blob/8f480ccfb1319ef4fef9bc3a7abdc18286c6e3f1/images/Dockerfile#L20-L39) 那样使用下列仓库的编译整进去：
200 | 
201 | - https://github.com/kubernetes-sigs/iptables-wrappers
202 | 
203 | 这个仓库主干分支是 golang 写的，之前的 v2 版本是脚本形式的的，例如可以下面这样：
204 | 
205 | ```Dockerfile
206 | FROM alpine:3.18
207 | RUN set -eux; \
208 | #    sed -i 's/dl-cdn.alpinelinux.org/mirrors.aliyun.com/g' /etc/apk/repositories; \
209 |     apk add -u iptables  ipset  bash \
210 |         --no-cache; \
211 |     apk add --no-cache --virtual .curl \
212 |         curl; \
213 |     # 避免容器内和宿主机使用 iptables-nft iptables-legacy 模式不一致造成网络问题
214 |     curl -L https://raw.githubusercontent.com/kubernetes-sigs/iptables-wrappers/v2/iptables-wrapper-installer.sh > /iptables-wrapper-installer.sh; \
215 |     bash /iptables-wrapper-installer.sh --no-sanity-check; \
216 |     apk del .curl; \
217 |     rm -rf /var/cache/apk/* /tmp/*
218 | ...
219 | ```
220 | 
221 | 另外如果容器纯粹操作 iptables 规则是不需要特权的，uid 为 0 且有对应 CAP 就可以了，例如下面的：
222 | 
223 | ```yaml
224 | services:
225 |   ipset:
226 |     image: 'reg.xxx.lan:5000/xxx/ipset:v1.1'
227 |     container_name: ipset
228 |     network_mode: host
229 |     hostname: ipset
230 |     restart: always # unless-stopped
231 |     working_dir: '/data/kube/'
232 |     cap_drop: ["ALL"]
233 |     cap_add: ["NET_ADMIN", "NET_RAW", "FOWNER"]
234 |     volumes:
235 |         # 需要挂载 `/run/` 目录，或者直接 `/run/xtables.lock` 文件
236 |       - /run/xtables.lock:/run/xtables.lock:rw
237 |       - '/data/kube/rule/:/data/kube/'
238 | ```
239 | 
240 | ## 链接
241 | 
242 | - [应用层常见排查命令](01.03.md)
243 | - 下一部分: [docker网络: 桥接网络](03.01.md)
244 | 


--------------------------------------------------------------------------------
/eBook/03.01.md:
--------------------------------------------------------------------------------
  1 | # 默认的桥接网络
  2 | 
  3 | ## 桥接网络
  4 | 
  5 | 没有指定网络下，默认的网络模式 bridge，每个容器在 network namespace 隔离下有单独的协议栈，然后通过 veth peer 链接到宿主机上的 docker0 网桥。
  6 | 
  7 | ```
  8 | # 容器内的 lo 没有画，不影响理解
  9 | # 宿主机需要 modprobe br_netfilter 后开启 ip_forward
 10 | # docker info 最下面几行出现 warning 代表没开
 11 | +-----------------------------------+--------------------------+--------------------------+
 12 | |                  Host             |        Container 1       |       Container 2        |
 13 | |                                   |                          |                          |
 14 | |  +----------------------------+   | +----------------------+ | +----------------------+ |
 15 | |  |  Network Protocol Stack    |   | |Network Protocol Stack| | |Network Protocol Stack| |
 16 | |  +----------------------------+   | +----------------------+ | +----------------------+ |
 17 | |       ↑           ↑               |             ↑            |             ↑            |
 18 | |.......|...........|...............|.............|............|.............|............|
 19 | |       ↓           ↓               |             ↓            |             ↓            |
 20 | |   +------+   +----------+         |         +-------+        |         +-------+        |
 21 | |   |.2.112|   |172.17.0.1|         |         |  .0.2 |        |         |  .0.3 |        |
 22 | |   +------+   +----------+         |         +-------+        |         +-------+        |
 23 | |   | eth0 |   | docker0  |         |         | eth0  |        |         | eth0  |        |
 24 | |   +------+   +----------+         |         +-------+        |         +-------+        |
 25 | |       ↑        ↑      ↑           |             ↑            |             ↑            |
 26 | |       |        |      |           |             |            |             |            |
 27 | |       |        ↓      ↓           |             |            |             |            |
 28 | |       |   +-------+ +-------+     |             |            |             |            |
 29 | |       |   |  veth | |  veth |<------------------+            |             |            |
 30 | |       |   +-------+ +-------+     |                          |             |            |
 31 | |       |       ↑                   |                          |             |            |
 32 | |       |       +----------------------------------------------|-------------+            |
 33 | |       |                           |                          |                          |
 34 | |       |                           |                          |                          |
 35 | +-------|---------------------------+--------------------------+--------------------------+
 36 |         ↓
 37 | Physical Network  (192.168.2.0/24)
 38 | # docker0 的网段不要配置和内网的网段一样，否则会冲突
 39 | ```
 40 | 
 41 | ## 容器内访问外部的流量
 42 | 
 43 | 例如使用前面说的新版本 tcpdump 抓包后，我们起个下面容器：
 44 | 
 45 | ```bash
 46 | $ tcpdump -nn -i any icmp and host 223.5.5.5
 47 | 
 48 | # 另一个窗口上开启 ping 一个包
 49 | $ docker run --rm -d --name t1 nginx:alpine ping -c1 223.5.5.5
 50 | ```
 51 | 
 52 | 抓包输出为如下：
 53 | 
 54 | ```
 55 | # 出去的报文
 56 | 11:12:44.213854 veth91af98c P   ifindex 45 02:42:0a:b9:00:02 ethertype IPv4 (0x0800), length 104: 172.17.0.2 > 223.5.5.5: ICMP echo request, id 2, seq 0, length 64
 57 | 11:12:44.213854 docker0 In  ifindex 3 02:42:0a:b9:00:02 ethertype IPv4 (0x0800), length 104: 172.17.0.2 > 223.5.5.5: ICMP echo request, id 2, seq 0, length 64
 58 | 11:12:44.213895 ens160 Out ifindex 2 00:50:56:ad:6f:fa ethertype IPv4 (0x0800), length 104: 192.168.2.112 > 223.5.5.5: ICMP echo request, id 2, seq 0, length 64
 59 | # 回来的报文
 60 | 11:12:44.233494 ens160 In  ifindex 2 78:2c:29:c4:00:01 ethertype IPv4 (0x0800), length 104: 223.5.5.5 > 192.168.2.112: ICMP echo reply, id 2, seq 0, length 64
 61 | 11:12:44.233522 docker0 Out ifindex 3 02:42:50:71:f0:99 ethertype IPv4 (0x0800), length 104: 223.5.5.5 > 172.17.0.2: ICMP echo reply, id 2, seq 0, length 64
 62 | 11:12:44.233527 veth91af98c Out ifindex 45 02:42:50:71:f0:99 ethertype IPv4 (0x0800), length 104: 223.5.5.5 > 172.17.0.2: ICMP echo reply, id 2, seq 0, length 64
 63 | ```
 64 | 
 65 | 流量会根据图中箭头走：
 66 | 
 67 | ```
 68 | 
 69 | $ ip route show
 70 | default via 192.168.2.1 dev eth0 proto static 
 71 | 172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 
 72 | 192.168.2.0/24 dev eth0 proto kernel scope link src 192.168.2.112
 73 | $ ip route get 223.5.5.5
 74 | 223.5.5.5 via 192.168.2.1 dev eth0 src 192.168.2.112 uid 0 
 75 |     cache
 76 | +-----------------------------------+--------------------------+
 77 | |               Host                |        Container 1       |
 78 | | 路由后走 forward 默认路由 eth0 发出 |                          |
 79 | |  +----------------------------+   | +----------------------+ |
 80 | |  |  Network Protocol Stack    |   | |Network Protocol Stack| |
 81 | |  +----------------------------+   | +----------------------+ |
 82 | |       |           ↑               |             |            |
 83 | |.......|...........|...............|.............|............|
 84 | |       |           |               |             |            |
 85 | |   POST_ROUTING    |               |             |            |
 86 | |    SNAT 2.112     |               |             |            |
 87 | |       |           |               |             |            |
 88 | |       ↓           |               |             ↓            |
 89 | |   +------+   +----------+         |         +-------+        |
 90 | |   |.2.112|   |172.17.0.1|         |         |  .0.2 |        |
 91 | |   +------+   +----------+         |         +-------+        |
 92 | |   | eth0 |   | docker0  |         |         | eth0  |        |
 93 | |   +------+   +----------+         |         +-------+        |
 94 | |       ↓               ↑           |             |            |
 95 | |       |               |           |             |            |
 96 | |       |             +-------+     |             |            |
 97 | |       |             |  veth |<------------------+            |
 98 | |       |             +-------+     |                          |
 99 | |       |                           |                          |
100 | +-------↓---------------------------+--------------------------+
101 | $ iptables -w -t nat -S POSTROUTING
102 | -P POSTROUTING ACCEPT
103 | -A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE
104 | # 要添加下面内核参数，iptables 对 bridge 的数据进行处理
105 | net.bridge.bridge-nf-call-ip6tables = 1
106 | net.bridge.bridge-nf-call-iptables = 1
107 | net.bridge.bridge-nf-call-arptables = 1
108 | ```
109 | 
110 | iptables 做 NAT 时候，是有记录状态的，可以使用 conntrack 查看：
111 | 
112 | ```
113 | $ conntrack -L -p icmp
114 | icmp     1 29 src=172.17.0.2 dst=223.5.5.5 type=8 code=0 id=1 src=223.5.5.5 dst=192.168.2.112 type=0 code=0 id=1 mark=0 use=1
115 | ```
116 | 
117 | ## 外部访问容器
118 | 
119 | ```
120 | $ docker run --rm -d --name t2 -p 80:80 nginx:alpine
121 | # 外部机器 192.168.2.5 上
122 | $ curl 192.168.2.112
123 | ```
124 | 
125 | 流量会根据图中箭头走：
126 | 
127 | ```
128 | $ iptables -t nat -S
129 | ...
130 | -A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
131 | -A DOCKER ! -i docker0 -p tcp -m tcp --dport 80 -j DNAT --to-destination 172.17.0.3:80
132 | +-----------------------------------+--------------------------+
133 | |               Host                |        Container 1       |
134 | |    路由后走 forward  docker0       |                          |
135 | |  +----------------------------+   | +----------------------+ |
136 | |  |  Network Protocol Stack    |   | |Network Protocol Stack| |
137 | |  +----------------------------+   | +----------------------+ |
138 | |       ↑                ↓          |             ↑            |
139 | |.......|................|..........|.............|............|
140 | |       |                |          |             |            |
141 | |   PRE_ROUTING          |          |             |            |
142 | |  DNAT 172.17.0.3:80    |          |             |            |
143 | |       |                |          |             |            |
144 | |       ↑                ↓          |             ↑            |
145 | |   +------+       +----------+     |         +-------+        |
146 | |   |.2.112|       |172.17.0.1|     |         |  .0.3 |        |
147 | |   +------+       +----------+     |         +-------+        |
148 | |   | eth0 |       | docker0  |     |         | eth0  |        |
149 | |   +------+       +----------+     |         +-------+        |
150 | |       ↑                ↓          |             ↑            |
151 | |       |                |          |             |            |
152 | |       |                |          |             |            |
153 | |       |            +-------+      |             |            |
154 | |       |            |  veth |>-------------------+            |
155 | |       |            +-------+      |                          |
156 | |       |                           |                          |
157 | |       |                           |                          |
158 | +-------↑---------------------------+--------------------------+
159 | ```
160 | 
161 | 相关 conntrack 信息：
162 | 
163 | ```
164 | conntrack -L |& grep -P =80
165 | tcp      6 117 TIME_WAIT src=192.168.2.5 dst=192.168.2.112 sport=51894 dport=80 src=172.17.0.3 dst=192.168.2.5 sport=80 dport=51894 [ASSURED] mark=0 use=1
166 | ```
167 | 
168 | > 外部访问容器有些人没注意有个坑，有些应用默认监听 127.0.0.1，这种情况 -p 映射后，发往容器的协议栈里，由于没有 bind `0.0.0.0:x` 会导致外面无法访问。
169 | 
170 | ## docker-proxy 进程
171 | 
172 | 如果你细心观察的话，你会发现有 docker-proxy 的进程存在：
173 | 
174 | ```bash
175 | $ ss -nlpt  sport 80  | awk '{print $4,$5,$6}' | column -t
176 | Local       Address:Port  Peer
177 | 0.0.0.0:80  0.0.0.0:*     users:(("docker-proxy",pid=3709090,fd=4))
178 | [::]:80     [::]:*        users:(("docker-proxy",pid=3709098,fd=4))
179 | ```
180 | 
181 | 本机回环接口（localhost）的流量不会经过 iptables 的 PREROUTING 链。因此，如果没有 docker-proxy，这些请求不会被正确地重定向到容器。
182 | 
183 | ```bash
184 | $ curl -I localhost:80
185 | HTTP/1.1 200 OK
186 | ...
187 | ```
188 | 
189 | 然后让进程完全停止后再 curl 看看 http 状态码：
190 | 
191 | ```
192 | $ kill -STOP 3709090
193 | $ curl -I localhost:80
194 | ^C
195 | # 外部机器上访问这个容器宿主机 80 端口依然可以访问
196 | # 是因为外部进来是走的 PRE_ROUTING 被 DNAT 了
197 | $ curl -I 192.168.2.112:80
198 | HTTP/1.1 200 OK
199 | ...
200 | ```
201 | 
202 | 然后恢复进程，能访问
203 | 
204 | ```
205 | $ kill -CONT 3709090
206 | $ curl -I localhost:80
207 | HTTP/1.1 200 OK
208 | ...
209 | ```
210 | 
211 | docker-proxy 存在的另一个原因也是因为 iptables 和内核可以处理 IPv4 的 DNAT 和 SNAT，但对于 IPv6 的支持，尤其是在早期版本的 Docker 中并不完善（新的 Linux 发行版系统里有 ip6tables 命令）。docker-proxy 可以在用户空间处理 IPv6 请求。
212 | 
213 | ## 链接
214 | 
215 | - [iptables基础](02.01.md)
216 | - 下一部分: [docker网络: none 和 host 网络](03.02.md)
217 | 


--------------------------------------------------------------------------------
/eBook/03.02.md:
--------------------------------------------------------------------------------
 1 | # none 和 host 网络
 2 | 
 3 | ## none 网络
 4 | 
 5 | 每个容器在 network namespace 隔离下有单独的协议栈，没有网络接口（eth0）以及没有 veth 连接到 docker0 网桥上：
 6 | 
 7 | ```
 8 | +-----------------------------------+--------------------------+
 9 | |                  Host             |        Container 1       |
10 | |                                   |                          |
11 | |  +----------------------------+   | +----------------------+ |
12 | |  |  Network Protocol Stack    |   | |Network Protocol Stack| |
13 | |  +----------------------------+   | +----------------------+ |
14 | |       ↑           ↑               |             ↑            |
15 | |.......|...........|...............|.............|............|
16 | |       ↓           ↓               |             ↓            |
17 | |   +------+   +----------+         |         +-------+        |
18 | |   |.2.112|   |172.17.0.1|         |         |  127  |        |
19 | |   +------+   +----------+         |         +-------+        |
20 | |   | eth0 |   | docker0  |         |         |  lo   |        |
21 | |   +------+   +----------+         |         +-------+        |
22 | |       ↑                           |                          |
23 | |       |                           |                          |
24 | |       |                           |                          |
25 | +-------|---------------------------+--------------------------+
26 |         ↓
27 | Physical Network  (192.168.2.0/24)
28 | ```
29 | 
30 | 用途：
31 | 1. 用户自定义配置网络接口、路由之类的
32 | 2. 运行不依赖网络的进程，例如编译、压测之类的
33 | 
34 | ## host 网络和它的注意点
35 | 
36 | host 网络和 k8S 的 `hostNetwork` 理解为网络不单独的 network namespace 而是和宿主机一样的网络即可，它的其他 namespace 还是隔离的例如 mount、pid。
37 | 
38 | 它的应用场景：
39 | - 端口很多应用，或者被动增加端口监听数量的，例如 ftp，直接 host 网络简单粗暴
40 | - 有些应用例如 kafka 在桥接下，会检测到自己监听的信息 （172.17.0.2:9092）和别人进来的应用层层面的信息（例如 192.168.2.112:9092）对不上而有问题，有两种方式解决方式：
41 |         - 1. 应用层配置对外宣告链接自己的信息，例如 kafka 配置 `KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://192.168.2.112:9092`
42 |         - 2. host 网络粗暴解决，比如桥接就慢的情况下，host 网络就没问题
43 | 
44 | 要注意的一个点是 Docker 的 init 层的一些文件：
45 | - /etc/hostname
46 | - /etc/hosts
47 | - /etc/resolv.conf
48 | 
49 | 默认参数下，不使用额外参数，host 网络的容器启动时候，这三个的单独文件都会从宿主机的内容拷贝生成的：
50 | 
51 | ```bash
52 | $ docker inspect t2 | grep -Pi '(hosts|hostname|resolv).*path'
53 |         "ResolvConfPath": "/var/lib/docker/containers/30d40445175c3c9e83ba7d7b387eb2e9be7fb3f8a3b37822807c6ff818335536/resolv.conf",
54 |         "HostnamePath": "/var/lib/docker/containers/30d40445175c3c9e83ba7d7b387eb2e9be7fb3f8a3b37822807c6ff818335536/hostname",
55 |         "HostsPath": "/var/lib/docker/containers/30d40445175c3c9e83ba7d7b387eb2e9be7fb3f8a3b37822807c6ff818335536/hosts",
56 | ```
57 | 
58 | 加参数可以让和宿主机的不一样，例如 `--add-host`，还有个问题是因为 init 层是单独的文件，存在两种情况：
59 | - 修改容器内这些文件内容后，容器 restart 后，会回到之前的宿主机一样的配置
60 | - 宿主机修改了自己的文件后，容器内不会同步，需要重启下该容器
61 | 
62 | host 网络下，java 相关的要注意，java 有些库启动会解析自己的 hostname，无法解析会无法启动，例如 zookeeper，之前也遇到过 springboot 的一个项目在 `/etc/hosts` 里没有 hostname 的解析记录启动要 15 分钟，加了 hostname 的解析后 3 分钟就启动了。
63 | 
64 | ## 链接
65 | 
66 | - [docker网络: 桥接网络](03.01.md)
67 | - 下一部分: [docker网络: container 网络](03.03.md)
68 | 


--------------------------------------------------------------------------------
/eBook/03.03.md:
--------------------------------------------------------------------------------
  1 | # container 网络
  2 | 
  3 | ## 介绍
  4 | 
  5 | ```
  6 | # 容器内的 lo 没有画，不影响理解
  7 | +-----------------------------------+---------------+-----------------+
  8 | |                  Host             |  Container 1  |  Container 2    |
  9 | |                                   |               |                 |
 10 | |  +----------------------------+   | +-----------------------------+ |
 11 | |  |  Network Protocol Stack    |   | |   Network Protocol Stack    | |
 12 | |  +----------------------------+   | +-----------------------------+ |
 13 | |       ↑           ↑               |               ↑                 |
 14 | |.......|...........|...............|...............|.................|
 15 | |       ↓           ↓               |               ↓                 |
 16 | |   +------+   +----------+         |            +-------+            |
 17 | |   |.2.112|   |172.17.0.1|         |            |  .0.3 |            |
 18 | |   +------+   +----------+         |            +-------+            |
 19 | |   | eth0 |   | docker0  |         |            | eth0  |            |
 20 | |   +------+   +----------+         |            +-------+            |           
 21 | |       ↑            ↑              |                ↑                |
 22 | |       |            |              |                |                |
 23 | |       |            ↓              |                |                |
 24 | |       |        +-------+          |                |                |
 25 | |       |        |  veth |          |                |                |
 26 | |       |        +-------+          |                |                |
 27 | |       |            ↑              |                |                |
 28 | |       |            +-------------------------------+                |
 29 | |       |                           |                                 |
 30 | |       |                           |                                 |
 31 | +-------|---------------------------+---------------------------------+
 32 |         ↓
 33 | Physical Network  (192.168.2.0/24)
 34 | ```
 35 | 
 36 | 后面起的容器和指定容器共用一个 network namespace，而其他 namespace 还是各自隔离独享的，例如：
 37 | 
 38 | ```
 39 | $ docker ps -a
 40 | CONTAINER ID   IMAGE                           COMMAND                  CREATED        STATUS       PORTS                               NAMES
 41 | 30d40445175c   nginx:alpine                    "/docker-entrypoint.…"   4 hours ago    Up 4 hours   0.0.0.0:80->80/tcp, :::80->80/tcp   t2
 42 | 
 43 | $ docker run --rm -ti --net container:30d40445175c zhangguanzhang/netshoot
 44 | 30d40445175c> curl -I localhost
 45 | HTTP/1.1 200 OK
 46 | ...
 47 | ```
 48 | 
 49 | 应用场景：
 50 | 
 51 | - 某个容器内没有网络相关排查命令，例如新版本 coredns，可以这样进入容器网络后排查
 52 | - K8S 的 sandbox pause 容器，后面的容器都使用这个容器的 network namespace 组成 POD 的概念
 53 | 
 54 | nsenter 字面意思 namespace enter，进入到指定的 namespace 里：
 55 | 
 56 | ```bash
 57 | $ nsenter --help
 58 | Usage:
 59 |  nsenter [options] <program> [<argument>...]
 60 | 
 61 | Run a program with namespaces of other processes.
 62 | 
 63 | Options:
 64 |  -t, --target <pid>     target process to get namespaces from
 65 |  -m, --mount[=<file>]   enter mount namespace
 66 |  -u, --uts[=<file>]     enter UTS namespace (hostname etc)
 67 |  -i, --ipc[=<file>]     enter System V IPC namespace
 68 |  -n, --net[=<file>]     enter network namespace
 69 |  -p, --pid[=<file>]     enter pid namespace
 70 |  -U, --user[=<file>]    enter user namespace
 71 |  -S, --setuid <uid>     set uid in entered namespace
 72 |  -G, --setgid <gid>     set gid in entered namespace
 73 |      --preserve-credentials do not touch uids or gids
 74 |  -r, --root[=<dir>]     set the root directory
 75 |  -w, --wd[=<dir>]       set the working directory
 76 |  -F, --no-fork          do not fork before exec'ing <program>
 77 |  -Z, --follow-context   set SELinux context according to --target PID
 78 | 
 79 |  -h, --help     display this help and exit
 80 |  -V, --version  output version information and exit
 81 | 
 82 | For more details see nsenter(1).
 83 | ```
 84 | 
 85 | container 网络与 nsenter 的区别：
 86 | 
 87 | ```bash
 88 | # 宿主机 DNS
 89 | $ cat /etc/resolv.conf
 90 | nameserver 10.236.158.114
 91 | nameserver 10.236.158.106
 92 | 
 93 | $ docker inspect a7abc0e4af98 | grep -m1 -i pid
 94 |             "Pid": 3376,
 95 | 
 96 | $ nsenter --net --target 3376
 97 | $ netstat -nlptu
 98 | Active Internet connections (only servers)
 99 | Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
100 | tcp        0      0 0.0.0.0:80              0.0.0.0:*               LISTEN      3376/nginx: master  
101 | tcp6       0      0 :::80                   :::*                    LISTEN      3376/nginx: master  
102 | $ cat /etc/resolv.conf
103 | nameserver 10.236.158.114
104 | nameserver 10.236.158.106
105 | # 实际的容器内的 /etc/resolv.conf
106 | $ docker exec a7abc0e4af98 cat /etc/resolv.conf
107 | nameserver 10.96.0.10
108 | search default.svc.cluster.local svc.cluster.local cluster.local
109 | options ndots:5
110 | ```
111 | 
112 | nsenter 最常见的就是 `--net` 选项，但是这样下涉及到文件的例如 `/etc/resolv.conf` 和 `hosts` 都是宿主机的，带上 `--mount` 后会进入容器的 rootfs 而没有排查命令。 简单的看 IP 、端口和网络信息可以 `nsenter --net --target <pid>` 后使用宿主机的命令查看，而依赖这俩文件的，还是用 `--net container:xxx ` 起工具容器完善。
113 | 
114 | ```bash
115 | $ docker run --rm --net container:a7abc0e4af98 alpine:latest cat /etc/resolv.conf
116 | nameserver 10.96.0.10
117 | search default.svc.cluster.local svc.cluster.local cluster.local
118 | options ndots:5
119 | ```
120 | 
121 | ## 链接
122 | 
123 | - [docker网络: none 和 host 网络](03.02.md)
124 | - 下一部分: [docker 容器跨节点通信](03.04.md)
125 | 


--------------------------------------------------------------------------------
/eBook/03.04.md:
--------------------------------------------------------------------------------
  1 | # 容器跨节点通信
  2 | 
  3 | ## 介绍
  4 | 
  5 | 单机 docker 的容器网络很好，但是生产里服务都部署在单机上就存在单点故障，通常是部署在多台机器上部署：
  6 | 
  7 | ![docker-app-replication](../images/docker-app-replication.png)
  8 | 
  9 | 所以我们要解决容器跨节点的问题。假设在没有 K8S 下，如何来实现容器跨节点通信。
 10 | 
 11 | ## 跨节点容器互通
 12 | 
 13 | ![docker-two-node](../images/docker-two-node.png)
 14 | 
 15 | 跨节点互通的前提，因为容器 IP 都是 docker0 的 CIDR 下的，所以互通机器的 docker0 的网段要配置成不同，然后再考虑实现手段。
 16 | 
 17 | ### 宿主机转发
 18 | 
 19 | ```bash
 20 | # 192.168.2.112 上
 21 | $ ping 172.27.0.2
 22 | PING 172.27.0.2 (172.27.0.2) 56(84) bytes of data.
 23 | ^C
 24 | --- 172.27.0.2 ping statistics ---
 25 | 2 packets transmitted, 0 received, 100% packet loss, time 1005ms
 26 | 
 27 | # 192.168.2.111 上
 28 | $ ping 172.17.0.2
 29 | PING 172.17.0.2 (172.17.0.2) 56(84) bytes of data.
 30 | ^C
 31 | --- 172.17.0.2 ping statistics ---
 32 | 2 packets transmitted, 0 received, 100% packet loss, time 2024ms
 33 | ```
 34 | 
 35 | 直接是不通的，因为会走默认路由，走到网关，匹配到网关的默认路由，最后发到后续的设备被丢弃。
 36 | 
 37 | 根据网络知识，我们知道只要不过网关出去，也就是二层网络内 IP 不会被 NAT，所以我们可以直接添加路由让 `172.27.0.0/16` 发往 `192.168.2.111` 机器上：
 38 | 
 39 | ```bash
 40 | # 192.168.2.112 上
 41 | $ ip route add 172.27.0.0/16 via 192.168.2.111
 42 | $ ping -c1 172.27.0.2
 43 | PING 172.27.0.2 (172.27.0.2) 56(84) bytes of data.
 44 | 64 bytes from 172.27.0.1: icmp_seq=1 ttl=64 time=0.274 ms
 45 | 
 46 | --- 172.27.0.2 ping statistics ---
 47 | 1 packets transmitted, 1 received, 0% packet loss, time 0ms
 48 | rtt min/avg/max/mdev = 0.274/0.274/0.274/0.000 ms
 49 | ```
 50 | 
 51 | 没问题，然后我们在 `192.168.2.112` 上的容器内 ping 下：
 52 | 
 53 | ```bash
 54 | # 192.168.2.112 上
 55 | $ docker exec -ti ctr2 ping 172.27.0.2
 56 | PING 172.27.0.2 (172.27.0.2): 56 data bytes
 57 | 64 bytes from 172.27.0.2: seq=0 ttl=62 time=0.341 ms
 58 | 64 bytes from 172.27.0.2: seq=1 ttl=62 time=0.394 ms
 59 | 64 bytes from 172.27.0.2: seq=2 ttl=62 time=0.360 ms
 60 | 64 bytes from 172.27.0.2: seq=3 ttl=62 time=0.494 ms
 61 | ```
 62 | 
 63 | 但是会有个问题，如果你 ping 的同时去 `192.168.2.111` 上抓包：
 64 | 
 65 | ```bash
 66 | # 注意往右边拖动查看 IP
 67 | $ tcpdump -nn -e -i eth0 icmp 
 68 | tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
 69 | listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
 70 | 7e:bf:9c:27:f5:9e > 7e:10:5f:2f:f1:47, ethertype IPv4 (0x0800), length 98: 192.168.2.112 > 172.27.0.2: ICMP echo request, id 68, seq 42, length 64
 71 | 7e:10:5f:2f:f1:47 > 7e:bf:9c:27:f5:9e, ethertype IPv4 (0x0800), length 98: 172.27.0.2 > 192.168.2.112: ICMP echo reply, id 68, seq 42, length 64
 72 | ```
 73 | 
 74 | 会发现来源 IP 是宿主机而并非容器的 IP，这是因为每个机器上都有 docker 配置的 SNAT。实际中，我们肯定希望一个节点上的容器访问另一个节点上的容器 IP 是不做 NAT 的，所以我们希望添加一个 iptables 规则跳过 docker 的 NAT iptables 规则，例如在 `192.168.2.112` 上：
 75 | 
 76 | - 来源 IP 172.17.0.0/16 + 目标 IP 是 172.27.0.0/16 的直接跳到走 `POSTROUTING` 的 `ACCEPT`
 77 | 
 78 | 转换下就是在 `192.168.2.112` 上:
 79 | 
 80 | ```
 81 | # 192.168.2.112 上
 82 | $ iptables -t nat -S POSTROUTING
 83 | -P POSTROUTING ACCEPT
 84 | -A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE
 85 | 
 86 | # 添加规则，跳过 nat
 87 | $ iptables -w -t nat -I POSTROUTING -s 172.17.0.0/16 -d 172.27.0.0/16 -j RETURN
 88 | ```
 89 | 
 90 | 添加后我们在 `192.168.2.112` 上再 ping 目标容器 IP 会发现 ping 不通：
 91 | 
 92 | ```bash
 93 | # 测试容器访问公网，不影响访问公网的 SNAT
 94 | $ docker exec -ti ctr2 ping -c 2 223.5.5.5
 95 | PING 223.5.5.5 (223.5.5.5): 56 data bytes
 96 | 64 bytes from 223.5.5.5: seq=0 ttl=112 time=18.470 ms
 97 | 64 bytes from 223.5.5.5: seq=1 ttl=112 time=18.881 ms
 98 | 
 99 | --- 223.5.5.5 ping statistics ---
100 | 2 packets transmitted, 2 packets received, 0% packet loss
101 | round-trip min/avg/max = 18.470/18.675/18.881 ms
102 | 
103 | $ docker exec -ti ctr2 ping 172.27.0.2
104 | PING 172.27.0.2 (172.27.0.2): 56 data bytes
105 | 
106 | ```
107 | 
108 | 然后我们在 `2.111` 上抓包，会发现：
109 | 
110 | ```bash
111 | $ tcpdump -nn  -i eth0 icmp 
112 | tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
113 | listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
114 | 10:51:13.421797 IP 172.17.0.2 > 172.27.0.2: ICMP echo request, id 74, seq 72, length 64
115 | 10:51:13.421947 IP 172.27.0.2 > 172.17.0.2: ICMP echo reply, id 74, seq 72, length 64
116 | ```
117 | 
118 | 来源 IP 是对的，没有 SNAT 了，容器也回包了，但是因为 `2.111` 上没有添加路由，所以 `27` 回包给 `17` 会走默认路由发到网关上。
119 | > 请不要在一些底层使用 openstack 虚拟化的虚机上验证本部分，因为 openstack 的组件会对从虚机网卡发出去包的 MAC 地址和 IP 校验，此处的包的 MAC 地址是网卡的，但是 IP 地址不是网卡的 IP 地址，过不了校验会被扔掉
120 | 
121 | 这里留个作业，请结合前面的 iptables 教程和本小结的知识，自行配置让两端容器互通。
122 | 
123 | ![docker-two-node-host-gw](../images/docker-two-node-host-gw.png)
124 | 
125 | ### 结论
126 | 
127 | 路由这种互通就是利用宿主机转发，也就是 flannel 的 `host-gw` 模式（自行去查看 iptables 规则，实际上是一个 16 位 CIDR 段，省去多条 iptables 规则），它没有对包做处理，比 vxlan、IPIP 做了隧道封装的效率高，缺点是只能在同一个二层网络内使用：
128 | 
129 | ![host-gw-fault](../images/host-gw-fault.png)
130 | 
131 | 例如图里的大概网络 `100.64.100.0/24` 和机器 `10.13.178.0/24` 的有两台或者多台机器组了 K8S 节点，这个时候配置可不单单在 Linux 机器上添加路由就行了，而现实生活里，你更没有权限和能力在网络设备上去配置。
132 | 
133 | 这里不介绍其他几种实现了，例如 etcd/consul + calico/flannel、 macvlan、ipvlan 来实现 docker 跨节点互通。
134 | 
135 | ## 负载均衡
136 | 
137 | 解决了 docker 容器跨节点互通，但是实际中不可能一个应用一个副本容器，也就是下面的情况：
138 | 
139 | ![docker-node-ctrs](../images/docker-node-ctrs.png)
140 | 
141 | 例如 ctr1 要访问 ctr2 ，但是 ctr2 有两个副本，最常见的负载均衡就是四层均衡了，不用考虑上层的应用层数据，只要保证 TCP/UDP 数据发送到即可。所以我们希望实现：
142 | - 访问 `ctr2-LB-IP:port` 负载到 `ctr2-1:port` 和 `ctr2-2:port`
143 | 
144 | 实现思路是两个：
145 | - iptables 实现，目标IP `ctr2-LB-IP` + 目标端口 port + 轮询 dnat 到 `ctr2-1:port` 和 `ctr2-2:port`
146 | - lvs 实现这个负载均衡
147 | 
148 | 这个就是 K8S service 的 `ClusterIP` 的实现思想，kube-apiserver 负责分配出 LB-IP（service IP），kube-proxy watch 到 service 添加创建，会在本机上新增 iptables 或者 lvs 规则，在 POD 会发生重建调度，kube-proxy 会更新负载均衡后端的 real server 信息。
149 | 
150 | ## DNS
151 | 
152 | 假设我们用 iptables 实现了负载均衡，`LB-IP` 肯定是动态分配的，ctr1 访问 ctr2 的 `LB-IP` 可以使用服务注册发现来适配动态 IP，但是这样负载均衡和应用耦合在一起了，如果某个应用需要 nginx 代理下，nginx 也部署多个副本，总不能魔改 nginx 加上服务注册发现 SDK。所以一个服务调用另一个服务都是用的域名（就像公有云上 RDS 实例，不用关注背后的实现），也就是 K8S 的 service name。
153 | 
154 | 然后 DNS 也要多个副本，所以容器内的 dns server 会写指定的 DNS server 的 LB-IP，也就是 K8S 所有 POD 内：
155 | 
156 | ```bash
157 | $ grep nameserver /etc/resolv.conf
158 | nameserver 10.96.0.10
159 | ```
160 | 
161 | ## 总结
162 | 
163 | 本章大概介绍下，从单节点衍生到多节点，需要解决很多场景问题，后续 K8S 很多组件和设计思想都是一样的。
164 | 
165 | ## 链接
166 | 
167 | - [docker网络: container 网络](03.03.md)
168 | - 下一部分: [K8S overlay 概念](04.01.md)
169 | 


--------------------------------------------------------------------------------
/eBook/04.01.md:
--------------------------------------------------------------------------------
 1 | # 基础概念
 2 | 
 3 | 此章节介绍一些基础概念，不做详细描述。
 4 | 
 5 | ## K8S pod 网络
 6 | 
 7 | K8S 的网络如下，POD 的网络是 Overlay 网络，意思是在现有网络上构建一个网络。就像图里的，只有 K8S 节点才能访问 `10.244.0.0/16` 这个网络，其他机器访问就会走自己默认路由最后路由出去而不会进到 POD 网络。
 8 | 
 9 | ![K8S-overlay](../images/K8S-overlay.png)
10 | 
11 | 而 `cni0` 是 K8S 的 CNI plugin 接口规范，容器运行时并不是只有 docker ，还有很多其他的容器运行时实现，要求 pod 的容器 veth 都挂在 cbr0 下而不限制在 docker0 下，也就是 `cni0` 网桥。
12 | 
13 | K8S 的 POD 就是 `>=1` 个的容器，先创建 pause 这个 sandbox 容器，然后其他的容器 `--net container:<pause_id>` 附加上去。
14 | 
15 | ## 链接
16 | 
17 | - [docker 容器跨节点通信](03.04.md)
18 | - 下一部分: [Service 工作原理](04.02.md)
19 | 


--------------------------------------------------------------------------------
/eBook/04.02.01.md:
--------------------------------------------------------------------------------
 1 | # iptables Service 模式
 2 | 
 3 | 前面的那个图理解了，其实也够用了，iptables 模式下规则为以下几类：
 4 | 
 5 | - `KUBE-SERVICES`（nat.PREROUTING/nat.OUTPUT)：在 PREROUTING 和 OUTPUT 链的最开始，它的规则分为两类：
 6 |     - `-d SVCIP ... --dport Port -j KUBE-SVC-xxx` 将 `SVC_IP:PORT` 发放对应的
 7 |     - `-m addrtype --dst-type LOCAL` 将本地网卡的数据包分派到 KUBE-NODEPORTS 链
 8 | - `KUBE-NODEPORTS`：根据 dst-port 匹配 `NodePort` 端口
 9 |     - 数据包进入相应的 `KUBE-SVC-xxx` 链（externalTrafficPolicy=Cluster）
10 |     - 数据包进入相应的 `KUBE-XLB-xxx` 链（externalTrafficPolicy=Local）
11 | - `KUBE-SVC-xxx`： 对应 service，数据包将随机进入 `KUBE-SEP-xxx` 链
12 | - `KUBE-XLB-xxx`： 对应 service，数据包可能进入 `KUBE-SEP-xxx` 链或者被丢弃
13 | - `KUBE-SEP-xxx`： 对应 endpoint 中的 IP 地址，数据包将 DNAT 到 Pod IP
14 | - `KUBE-FIREWALL`(filter.INPUT/filter.OUTPUT)：丢弃 `0x8000` 的包，主要用在 `externalTrafficPolicy=Local` 的场景
15 | - `KUBE-MARK-MASQ`：标记数据包为 `0x4000`（需要SNAT）
16 | - `KUBE-MARK-DROP`：标记数据包为 `0x8000`（DROP包）
17 | - `KUBE-POSTROUTING`（nat.POSTROUTING）：MASQUERADE `0x4000` 的包
18 | 
19 | service 一般不会出问题，所以会查 iptables 规则和看就行了，上面的 SERVICE 和 MARK 、KUBE-POSTROUTING 几个实际在 ipvs 里也使用了。
20 | 
21 | ## 链接
22 | 
23 | - [K8S overlay 概念](04.01.md)
24 | - 下一部分: [IPVS Service 模式](04.02.02.md)
25 | 


--------------------------------------------------------------------------------
/eBook/04.02.02.md:
--------------------------------------------------------------------------------
  1 | # IPVS Service 模式
  2 | 
  3 | iptables 模式下 iptables 的规则后期会随着 pod 增加成正比增加，导致 iptables 匹配复杂度上升，K8S 在 v1.11 添加了 IPVS 模式。
  4 | 
  5 | ## 实现散装的 IPVS SVC
  6 | 
  7 | 市面上很多讲 IPVS 的都是在 K8S 机器上 ipvsadm 创建几个，实际上核心的一些 iptables 规则没讲到，所以这里是干净机器上实现 IPVS svc。
  8 | 
  9 | 
 10 | ### 准备环境
 11 | 
 12 | 管理 IPVS 规则的话我们需要安装 `ipvsadm` ，这里我是两台干净的 CentOS 7.8 来做环境。
 13 | 
 14 | | IP    |
 15 | | :----- |
 16 | | 192.168.2.111 |
 17 | | 192.168.2.222 |
 18 | 
 19 | 先安装下基础需要的：
 20 | 
 21 | ```bash
 22 | yum install -y ipvsadm curl wget tcpdump ipset conntrack-tools
 23 | 
 24 | # 开启转发
 25 | 
 26 | sysctl -w net.ipv4.ip_forward=1
 27 | 
 28 | # 确认 iptables 规则清空
 29 | $ iptables -S
 30 | -P INPUT ACCEPT
 31 | -P FORWARD ACCEPT
 32 | -P OUTPUT ACCEPT
 33 | $ iptables -t nat -S
 34 | -P PREROUTING ACCEPT
 35 | -P INPUT ACCEPT
 36 | -P OUTPUT ACCEPT
 37 | -P POSTROUTING ACCEPT
 38 | ```
 39 | 
 40 | ### 一些思考
 41 | 
 42 | 因为 SVC 端口和 POD 的端口不一样，所以 kube-proxy 使用的 `nat` 模式。暂且打算添加一个下面类似的 SVC ：
 43 | 
 44 | ```
 45 | IP:                169.254.11.2
 46 | Port:              https  80/TCP
 47 | TargetPort:        8080/TCP
 48 | Endpoints:         192.168.2.111:8080,192.168.2.222:8080
 49 | Session Affinity:  None
 50 | ```
 51 | 
 52 | web 的话我是使用的 golang 的一个简单 web 二进制起的 [ran](https://github.com/m3ng9i/ran) :
 53 | 
 54 | ```bash
 55 | wget https://github.com/m3ng9i/ran/releases/download/v0.1.6/ran_linux_amd64.zip
 56 | unzip -x ran_linux_amd64.zip
 57 | mkdir www
 58 | 
 59 | # 两个机器创建不同的 index 文件
 60 | echo 192.168.2.111 > www/test
 61 | echo 192.168.2.222 > www/test
 62 | 
 63 | ./ran_linux_amd64 -port 8080 -listdir www
 64 | ```
 65 | 
 66 | 两个机器的这个 web 都起来后我们开个窗口去 `192.168.2.111` 上继续后面的操作。
 67 | 
 68 | ### lvs nat
 69 | 
 70 | kube-proxy 并没有像 lvs nat 那样有单独的机器做 `NAT GW`，或者认为每个 node 都是自己的 `NAT GW`。现在来添加 `169.254.11.2:80` 这个 SVC ，使用 ipvsadm 添加：
 71 | 
 72 | ```bash
 73 | ipvsadm --add-service --tcp-service 169.254.11.2:80 --scheduler rr
 74 | ```
 75 | 
 76 | 先添加本地的 web 作为 real server ，下面含义是添加为一个 nat 类型的 real server ：
 77 | 
 78 | ```bash
 79 | ipvsadm --add-server --tcp-service 169.254.11.2:80 \
 80 |   --real-server 192.168.2.111:8080 --masquerading --weight 1
 81 | ```
 82 | 
 83 | 查看下当前列表：
 84 | 
 85 | ```bash
 86 | $ ipvsadm -ln
 87 | IP Virtual Server version 1.2.1 (size=4096)
 88 | Prot LocalAddress:Port Scheduler Flags
 89 |   -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
 90 | TCP  169.254.11.2:80 rr
 91 |   -> 192.168.2.111:8080           Masq    1      0          0 
 92 | ```
 93 | 
 94 | 因为是自己的 `NAT GW`，所以 VIP 配置在自己身上：
 95 | 
 96 | ```bash
 97 | ip addr add 169.254.11.2/32 dev eth0
 98 | ```
 99 | 
100 | 测试下访问看看：
101 | 
102 | ```bash
103 | $ curl 169.254.11.2/www/test
104 | 192.168.2.111
105 | ```
106 | 
107 | 添加上另一个节点的 8080：
108 | 
109 | ```bash
110 | $ ipvsadm --add-server --tcp-service 169.254.11.2:80 \
111 |   --real-server 192.168.2.222:8080 --masquerading --weight 1
112 | 
113 | $ ipvsadm -ln
114 | IP Virtual Server version 1.2.1 (size=4096)
115 | Prot LocalAddress:Port Scheduler Flags
116 |   -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
117 | TCP  169.254.11.2:80 rr
118 |   -> 192.168.2.111:8080           Masq    1      0          0         
119 |   -> 192.168.2.222:8080           Masq    1      0          0
120 | ```
121 | 
122 | 测试下访问看看：
123 | 
124 | ```bash
125 | $ curl 169.254.11.2/www/test
126 | 
127 | ```
128 | 
129 | 发现 curl 在卡住和能访问返回 `192.168.2.111` 之间切换，没有返回 `192.168.2.222` 的。查看下 IPVS 的 connection ，发现调度到非本机才会卡住：
130 | 
131 | ```bash
132 | $ ipvsadm -lnc
133 | IPVS connection entries
134 | pro expire state       source             virtual            destination
135 | TCP 00:48  SYN_RECV    169.254.11.2:50698 169.254.11.2:80    192.168.2.222:8080
136 | ```
137 | 
138 | 在 `192.168.2.222` 上抓包看看：
139 | 
140 | ```bash
141 | $ tcpdump -nn -i eth0 port 8080
142 | tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
143 | listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
144 | 07:38:26.360716 IP 169.254.11.2.50710 > 192.168.2.222.8080: Flags [S], seq 768065283, win 43690, options [mss 65495,sackOK,TS val 12276183 ecr 0,nop,wscale 7], length 0
145 | 07:38:26.360762 IP 192.168.2.222.8080 > 169.254.11.2.50710: Flags [S.], seq 2142784980, ack 768065284, win 28960, options [mss 1460,sackOK,TS val 676518144 ecr 12276183,nop,wscale 7], length 0
146 | 07:38:27.362848 IP 169.254.11.2.50710 > 192.168.2.222.8080: Flags [S], seq 768065283, win 43690, options [mss 65495,sackOK,TS val 12277186 ecr 0,nop,wscale 7], length 0
147 | 07:38:27.362884 IP 192.168.2.222.8080 > 169.254.11.2.50710: Flags [S.], seq 2142784980, ack 768065284, win 28960, options [mss 1460,sackOK,TS val 676519146 ecr 12276183,nop,wscale 7], length 0
148 | 07:38:28.562629 IP 192.168.2.222.8080 > 169.254.11.2.50710: Flags [S.], seq 2142784980, ack 768065284, win 28960, options [mss 1460,sackOK,TS val 676520346 ecr 12276183,nop,wscale 7], length 0
149 | 07:38:29.368811 IP 169.254.11.2.50710 > 192.168.2.222.8080: Flags [S], seq 768065283, win 43690, options [mss 65495,sackOK,TS val 12279192 ecr 0,nop,wscale 7], length 0
150 | 07:38:29.368853 IP 192.168.2.222.8080 > 169.254.11.2.50710: Flags [S.], seq 2142784980, ack 768065284, win 28960, options [mss 1460,sackOK,TS val 676521152 ecr 12276183,nop,wscale 7], length 0
151 | 07:38:31.562633 IP 192.168.2.222.8080 > 169.254.11.2.50710: Flags [S.], seq 2142784980, ack 768065284, win 28960, options [mss 1460,sackOK,TS val 676523346 ecr 12276183,nop,wscale 7], length 0
152 | 07:38:33.376829 IP 169.254.11.2.50710 > 192.168.2.222.8080: Flags [S], seq 768065283, win 43690, options [mss 65495,sackOK,TS val 12283200 ecr 0,nop,wscale 7], length 0
153 | 07:38:33.376869 IP 192.168.2.222.8080 > 169.254.11.2.50710: Flags [S.], seq 2142784980, ack 768065284, win 28960, options [mss 1460,sackOK,TS val 676525160 ecr 12276183,nop,wscale 7], length 0
154 | 07:38:37.562632 IP 192.168.2.222.8080 > 169.254.11.2.50710: Flags [S.], seq 2142784980, ack 768065284, win 28960, options [mss 1460,sackOK,TS val 676529346 ecr 12276183,nop,wscale 7], length 0
155 | ```
156 | 
157 | 从 `Flags` 看，就是 tcp 重传，并且 `SRC IP` 是 VIP 。节点 `192.168.2.222.8080` 给 `169.254.11.2.50710` 回包会走到网关上去。网关上抓包也看到确实如此：
158 | 
159 | ```bash
160 | $ tcpdump -nn -i eth0 host 169.254.11.2
161 | tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
162 | listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
163 | 19:39:47.487362 IP 192.168.2.222.8080 > 169.254.11.2.50714: Flags [S.], seq 4149799699, ack 251479303, win 28960, options [mss 1460,sackOK,TS val 676599263 ecr 12357303,nop,wscale 7], length 0
164 | 19:39:47.487405 IP 192.168.2.222.8080 > 169.254.11.2.50714: Flags [S.], seq 4149799699, ack 251479303, win 28960, options [mss 1460,sackOK,TS val 676599263 ecr 12357303,nop,wscale 7], length 0
165 | 19:39:48.487838 IP 192.168.2.222.8080 > 169.254.11.2.50714: Flags [S.], seq 4149799699, ack 251479303, win 28960, options [mss 1460,sackOK,TS val 676600264 ecr 12357303,nop,wscale 7], length 0
166 | 19:39:48.487868 IP 192.168.2.222.8080 > 169.254.11.2.50714: Flags [S.], seq 4149799699, ack 251479303, win 28960, options [mss 1460,sackOK,TS val 676600264 ecr 12357303,nop,wscale 7], length 0
167 | 19:39:49.569667 IP 192.168.2.222.8080 > 169.254.11.2.50714: Flags [S.], seq 4149799699, ack 251479303, win 28960, options [mss 1460,sackOK,TS val 676601346 ecr 12357303,nop,wscale 7], length 0
168 | 19:39:49.569699 IP 192.168.2.222.8080 > 169.254.11.2.50714: Flags [S.], seq 4149799699, ack 251479303, win 28960, options [mss 1460,sackOK,TS val 676601346 ecr 12357303,nop,wscale 7], length 0
169 | ```
170 | 
171 | 也就是如图：
172 | 
173 | ![lvs-dnat-without-snat.png](../images/lvs-dnat-without-snat.png)
174 | 
175 | 为了避免这种问题，我们需要让 DNAT 到非本机的包的来源 IP 是 192.168.2.111，也就是做 SNAT。
176 | 
177 | #### lvs 和 netfilter
178 | 
179 | 在介绍 lvs 的实现之前，我们需要了解 netfilter ，Linux 的所有数据包都会经过它，而我们使用的 iptables 是用户态提供的操作工具之一。Linux 内核处理进出的数据包分为了 5 个阶段。netfilter 在这 5 个阶段提供了 hook 点，来让注册的 hook 函数来实现对包的过滤和修改。下图的 local process 就是上层的协议栈。
180 | 
181 | 下面是 IPVS 在 netfilter 里的模型图，IPVS 也是基于 netfilter 框架的，但只工作在 `INPUT` 链上，通过注册 `ip_vs_in` 钩子函数来处理请求。因为 VIP 我们配置在机器上（常规的 lvs nat 的 VIP 是在 NAT GW 上，我们这里是自己），我们 curl 的时候就会进到 `INPUT` 链，`ip_vs_in` 会匹配然后直接跳转触发 `POSTROUTING` 链，跳过 iptables 规则。
182 | 
183 | ![lvs-netfilter](https://raw.githubusercontent.com/zhangguanzhang/Image-Hosting/master/picgo/lvs-netfilter.png)
184 | 
185 | 所以我们希望的请求流程是：
186 | 
187 | ```
188 | # CIP: client IP    # RIP: real server IP
189 | 
190 | 	CLIENT
191 | 	   | CIP:CPORT -> VIP:VPORT
192 | 	   |		||
193 | 	   |		\/
194 |     	   | CIP:CPORT -> VIP:VPORT
195 |    	LVS DNAT
196 |    	   | CIP:CPORT -> RIP:RPORT # DNAT 后，也要做 SNAT
197 |    	   |		||
198 | 	   |		\/
199 | 	   | CIP:CPORT -> RIP:RPORT
200 | 	   +
201 | 	REAL SERVER
202 | ```
203 | 
204 | lvs 做了 DNAT 并没有做 SNAT ，所以我们利用 iptables 做 SNAT ：
205 | 
206 | ```
207 | $ iptables -t nat -A POSTROUTING -m ipvs --vaddr 169.254.11.2 --vport 80 -j MASQUERADE
208 | ```
209 | 
210 | 访问看看还是不通，抓包看还是没生效，nat 是依赖 `conntrack` 的，而 IPVS 默认不会记录 conntrack，我们需要开启 IPVS 的 conntrack 才可以让 MASQUERADE 生效。
211 | 
212 | ```bash
213 | # 让 Netfilter 的 conntrack 状态管理功能也能应用于 IPVS 模块
214 | $ echo 1 >  /proc/sys/net/ipv4/vs/conntrack
215 | $  curl 169.254.11.2/www/test
216 | 192.168.2.111
217 | $ curl 169.254.11.2/www/test
218 | 192.168.2.222
219 | $ curl 169.254.11.2/www/test
220 | 192.168.2.111
221 | $ curl 169.254.11.2/www/test
222 | 192.168.2.222
223 | $ curl 169.254.11.2/www/test
224 | 192.168.2.111
225 | $ curl 169.254.11.2/www/test
226 | 192.168.2.222
227 | ```
228 | 
229 | 现在实现了单个 SVC 的，但是仔细思考下还是有问题，如果后续增加另一个 SVC 又得增加一个 iptables snat MASQUERADE 规则了，那就又回到 iptables 的匹配复杂度耗时长上去了。所以我们可以利用 iptables 的 mark 和 ipset 配合减少 iptables 规则。
230 | 
231 | ### 利用 ipset 和 iptable 的 mark
232 | 
233 | ![iptables_netfilter](https://raw.githubusercontent.com/zhangguanzhang/Image-Hosting/master/picgo/iptables_netfilter.png)
234 | 
235 | iptables 的五链四表如上图所示，我们先删掉原有的规则：
236 | 
237 | ```bash
238 | $ iptables -t nat -D POSTROUTING -m ipvs --vaddr 169.254.11.2 --vport 80 -j MASQUERADE
239 | ```
240 | 
241 | 平时自己家里使用了软路由 ，之前看了下上面的 iptables 规则设计挺好的，特别是预留了很多链专门给用户在合适的位置插入规则，比如下面的 `INPUT` 规则：
242 | 
243 | ```bash
244 | -A INPUT -i eth0 -m comment --comment "!fw3" -j zone_lan_input
245 | ...
246 | -A zone_lan_input -m comment --comment "!fw3: Custom lan input rule chain" -j input_lan_rule
247 | -A zone_lan_input -m conntrack --ctstate DNAT -m comment --comment "!fw3: Accept port redirections" -j ACCEPT
248 | -A zone_lan_input -m comment --comment "!fw3" -j zone_lan_src_ACCEPT
249 | ```
250 | 
251 | `zone_lan_src_ACCEPT` 是最后面，`zone_lan_input` 是最开始，那用户向 `input_lan_rule` 链里插入规则即可，利用多个链来设计也方便别人。
252 | 规则设计我们先逆着来思考下，最后肯定是 `MASQUERADE` 的，得在 nat 表的 `POSTROUTING` 链创建 `MASQUERADE` 的规则。
253 | 
254 | 但是添加之前先思考下，lvs 做了 DNAT 后，最后包走向了 `POSTROUTING` 链，而且后面我们是有多个 SVC 的。此刻包的 `SRC IP` 会是每个对应的 `VIP`，见上面抓包的结果：
255 | 
256 | ```
257 | # 假设没做 masq 的时候(刚好调度到非本地的 real server 上
258 | #也就是上面之前不通在目标机器上抓包)包的阶段
259 | 
260 | SRC:169.254.11.2:xxxx
261 | DST:169.254.11.2:80
262 |       ||
263 |       || 没经过 POSTROUTING 做 masq snat 的源 IP
264 |       \/
265 | SRC:169.254.11.2:xxxx
266 | DST:192.168.2.222:8080
267 | 
268 | # 第二个 svc
269 | SRC:169.254.11.33:xxxx
270 | DST:169.254.11.33:80
271 |       ||
272 |       || 没经过 POSTROUTING 做 masq snat 的源 IP
273 |       \/
274 | SRC:169.254.11.33:xxxx
275 | DST:192.168.2.222:8090
276 | ```
277 | 
278 | 因为会被 DNAT，而来源 IP 除去 VIP 以外，后续可能是在 docker 环境上部署，可能默认桥接网络下的容器也会去访问 `SVC`，此刻的 `SRC IP` 就不会是网卡上的 `VIP` 了，所以我们在 PREROUTING 阶段 dest IP,dest Port 是 svc 信息则做 masq snat。
279 | 
280 | 可以在此刻利用一个 ipset 存储所有的 `SVC_IP:SVC_PORT` 匹配，然后打上 mark，然后在 `POSTROUTING` 链去根据 mark 去做 `MASQUERADE` 。
281 | 
282 | ```bash
283 | # PREROUTING 阶段处理
284 | 
285 | # 提供一个入口链，而不是直接添加在 PREROUTING 链上
286 | iptables -t nat -N ZGZ-SERVICES
287 | iptables -t nat -A PREROUTING -m comment --comment "zgz service portals" -j ZGZ-SERVICES
288 | 
289 | # 在 PREROUTING 子链里去 ipset 匹配，跳转到我们打 mark 的链
290 | iptables -t nat -N ZGZ-MARK-MASQ
291 | # 创建存储所有 `SVC_IP:SVC_PORT` 的 ipset 
292 | ipset create ZGZ-CLUSTER-IP hash:ip,port -exist
293 | 
294 | # 专门 mark 的链
295 | iptables -t nat -A ZGZ-MARK-MASQ -j MARK --set-xmark 0x2000/0x2000
296 | 
297 | # 匹配 svc ip：svc port 的才跳转到打 mark 的链里
298 | iptables -t nat -A ZGZ-SERVICES -m comment --comment "zgz service cluster ip + port for masquerade purpose" -m set --match-set ZGZ-CLUSTER-IP dst,dst -j ZGZ-MARK-MASQ
299 | 
300 | 
301 | # POSTROUTING 阶段处理
302 | 
303 | # 提供一个入口链，而不是直接添加在 POSTROUTING 链上
304 | iptables -t nat -N ZGZ-SERVICES-POSTROUTING
305 | iptables -t nat -A POSTROUTING -m comment --comment "zgz postrouting rules" -j ZGZ-SERVICES-POSTROUTING
306 | # 在 POSTROUTING 阶段，有 mark 标记的就做 snat
307 | iptables -t nat -A ZGZ-SERVICES-POSTROUTING -m comment --comment "zgz service traffic requiring SNAT" -m mark --mark 0x2000/0x2000 -j MASQUERADE
308 | ```
309 | 
310 | 然后添加下 `SVC_IP:SVC_PORT` 到我们的 ipset 里：
311 | 
312 | ```bash
313 | ipset add ZGZ-CLUSTER-IP 169.254.11.2,tcp:80 -exist
314 | ```
315 | 
316 | 上面我们创建的 ipset 里 `ip,port` 和 iptables 里 `--match-set` 后面的 `dst,dst` 组合在一起就是 `DEST IP` 和 `DEST PORT` 同时匹配，下面是一些举例：
317 | 
318 | | ipset type      | iptables match-set | Packet fields    |
319 | | :-------------- | :----------------- | :-------------- |
320 | hash:net,port,net | src,dst,dst        | src IP CIDR address, dst port, dst IP CIDR address |
321 | hash:net,port,net | dst,src,src        | dst IP CIDR address, src port, src IP CIDR address |
322 | hash:ip,port,ip   | src,dst,dst        | src IP address, dst port, dst IP address |
323 | hash:ip,port,ip   | dst,src,src        | dst IP address, src port, src ip address |
324 | hash:mac          | src                | src mac address   |
325 | hash:mac          | dst                | dst mac address    |
326 | hash:ip,mac       | src,src            | src IP address, src mac address   |
327 | hash:ip,mac       | dst,dst            | dst IP address, dst mac address   |
328 | hash:ip,mac       | dst,src            | dst IP address, src mac address   |
329 | 
330 | 然后访问下还是不通，通过两台机器轮询负载均衡的 curl html 返回内容看到了访问不通的时候都是调度到非本机，也就是此刻的 curl 只经过 `OUTPUT` 链，过 `POSTROUTING` 的时候并没有 mark 也就不会做 masq ， 调试了下发现确实会走 `OUTPUT` 链：
331 | 
332 | ```bash
333 | $ echo 'kern.warning /var/log/iptables.log' >> /etc/rsyslog.conf
334 | $ systemctl restart rsyslog
335 | $ iptables -t nat -I OUTPUT -m set --match-set ZGZ-CLUSTER-IP dst,dst  -j LOG --log-prefix '**log-test**'
336 | $ curl 169.254.11.2/www/test
337 | ^C
338 | $ cat /var/log/iptables.log
339 | Sep 27 23:17:51 centos7 kernel: **log-test**IN= OUT=lo SRC=169.254.11.2 DST=169.254.11.2 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=44864 DF PROTO=TCP SPT=50794 DPT=80 WINDOW=43690 RES=0x00 SYN URGP=0 
340 | Sep 27 23:17:52 centos7 kernel: **log-test**IN= OUT=lo SRC=169.254.11.2 DST=169.254.11.2 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=2010 DF PROTO=TCP SPT=50796 DPT=80 WINDOW=43690 RES=0x00 SYN URGP=0 
341 | ```
342 | 
343 | 需要添加下面规则，让它也进下 svc 判断和打 mark：
344 | 
345 | ```bash
346 | iptables -t nat -A OUTPUT -m comment --comment "zgz service portals" -j ZGZ-SERVICES
347 | # OUTPUT 链上或许 ipvs 同样具有 dnat 能力
348 | # 在 2010 年 ipvs 已经移除了 NF_INET_POSTROUTING HOOK 点，并且在 NF_INET_LOCAL_OUT 添加了新 HOOK 用于支持 IPVS 的 LocalNode 功能
349 | # https://github.com/torvalds/linux/commit/cf356d69db0afef692cd640917bc70f708c27f14
350 | # https://github.com/torvalds/linux/commit/cb59155f21d4c0507d2034c2953f6a3f7806913d
351 | ```
352 | 
353 | ### keepalived 的自动化实现
354 | 
355 | 到目前为止都是手动挡，而且没健康检查，其实我们可以利用 keepalived 做个自动挡的。
356 | 
357 | #### 安装 keepalived 2
358 | 
359 | CentOS7 自带的源里 `keepalived` 版本很低，我们安装下比自带新的版本：
360 | 
361 | ```bash
362 | yum install -y http://www.nosuchhost.net/~cheese/fedora/packages/epel-7/x86_64/cheese-release-7-1.noarch.rpm
363 | yum install -y keepalived
364 | # 备份下自带的配置文件
365 | cp /etc/keepalived/keepalived.conf{,.bak}
366 | ```
367 | 
368 | #### 配置 keepalived
369 | 
370 | 我们需要配置下 keepalived ，修改之前先看下默认相关的：
371 | 
372 | ```bash
373 | $ systemctl cat keepalived
374 | # /usr/lib/systemd/system/keepalived.service
375 | [Unit]
376 | Description=LVS and VRRP High Availability Monitor
377 | After=syslog.target network-online.target
378 | 
379 | [Service]
380 | Type=forking
381 | KillMode=process
382 | EnvironmentFile=-/etc/sysconfig/keepalived
383 | ExecStart=/usr/sbin/keepalived $KEEPALIVED_OPTIONS
384 | ExecReload=/bin/kill -HUP $MAINPID
385 | 
386 | [Install]
387 | WantedBy=multi-user.target
388 | $ cat /etc/sysconfig/keepalived
389 | # Options for keepalived. See `keepalived --help' output and keepalived(8) and
390 | # keepalived.conf(5) man pages for a list of all options. Here are the most
391 | # common ones :
392 | #
393 | # --vrrp               -P    Only run with VRRP subsystem.
394 | # --check              -C    Only run with Health-checker subsystem.
395 | # --dont-release-vrrp  -V    Dont remove VRRP VIPs & VROUTEs on daemon stop.
396 | # --dont-release-ipvs  -I    Dont remove IPVS topology on daemon stop.
397 | # --dump-conf          -d    Dump the configuration data.
398 | # --log-detail         -D    Detailed log messages.
399 | # --log-facility       -S    0-7 Set local syslog facility (default=LOG_DAEMON)
400 | #
401 | 
402 | KEEPALIVED_OPTIONS="-D"
403 | ```
404 | 
405 | `/etc/sysconfig/keepalived` 里修改为下面：
406 | 
407 | ```conf
408 | KEEPALIVED_OPTIONS="-D --log-console --log-detail --use-file=/etc/keepalived/keepalived.conf"
409 | ```
410 | 
411 | 我们选择在主配置文件里去 include 子配置文件，keepalivd 接收 `kill -HUP` 信号触发 reload ，后续自动化添加 SVC 的时候添加子配置文件后发送信号即可。
412 | 
413 | ```bash
414 | cat > /etc/keepalived/keepalived.conf << EOF
415 | ! Configuration File for keepalived
416 | 
417 | global_defs {
418 | 
419 | }
420 | # 记住 keepalived 的任何配置文件不能有 x 权限
421 | include /etc/keepalived/conf.d/*.conf
422 | EOF
423 | mkdir -p /etc/keepalived/conf.d/
424 | ```
425 | 
426 | 我们写一个脚本，一个是用来添加一个子配置文件里的相关信息到 ipset 里，另一方面也让它在重启或者启动 keepalived 的时候每次能初始化，先添加 systemd 的部分：
427 | 
428 | ```bash
429 | mkdir -p /usr/lib/systemd/system/keepalived.service.d
430 | cat > /usr/lib/systemd/system/keepalived.service.d/10.keepalived.conf << EOF
431 | [Service]
432 | ExecStartPre=/etc/keepalived/ipvs.sh
433 | EOF
434 | ```
435 | 
436 | 然后编写脚本 `/etc/keepalived/ipvs.sh` :
437 | 
438 | ```bash
439 | #!/bin/bash
440 | 
441 | set -e
442 | 
443 | dummy_if=svc
444 | CONF_DIR=/etc/keepalived/conf.d/
445 | 
446 | 
447 | function ipset_init(){
448 |     ipset create ZGZ-CLUSTER-IP hash:ip,port -exist
449 |     ipset flush ZGZ-CLUSTER-IP
450 |     local f ip port protocol
451 |     for f in $(find  ${CONF_DIR} -maxdepth 1 -type f -name '*.conf');do
452 |         awk '{if($1=="virtual_server"){printf $2" "$3" ";flag=1;};if(flag==1 && $1=="protocol"){print $2;flag=0}}' "$f" | while read ip port protocol;do
453 |             # SVC IP port 插入 ipset 里
454 |             ipset add ZGZ-CLUSTER-IP ${ip},${protocol,,}:${port} -exist
455 |             # 添加 SVC IP 到 dummy 接口上
456 |             if ! ip r g ${ip} | grep -qw lo;then
457 |                 ip addr add ${ip}/32 dev ${dummy_if}
458 |             fi
459 |         done
460 |     done
461 | }
462 | 
463 | function create_Chain_in_nat(){
464 |     # delete use -X
465 |     local Chain option
466 |     option="-t nat --wait"
467 | 
468 |     for Chain in $@;do
469 |     if ! iptables $option -S | grep -Eq -- "-N\s+${Chain}$";then
470 |         iptables $option -N ${Chain}
471 |     fi
472 |     done
473 | }
474 | 
475 | function create_Rule_in_nat(){
476 |     local cmd='iptables -t nat --wait '
477 |     if ! ${cmd}  --check "$@" 2>/dev/null;then
478 |         ${cmd} -A "$@"
479 |     fi
480 | }
481 | 
482 | function iptables_init(){
483 |     create_Chain_in_nat ZGZ-SERVICES  ZGZ-SERVICES-POSTROUTING ZGZ-SERVICES-MARK-MASQ
484 | 
485 |     create_Rule_in_nat ZGZ-SERVICES-MARK-MASQ -j MARK --set-xmark 0x2000/0x2000
486 | 
487 |     create_Rule_in_nat ZGZ-SERVICES -m comment --comment "zgz service cluster ip + port for masquerade purpose" -m set --match-set ZGZ-CLUSTER-IP dst,dst -j ZGZ-SERVICES-MARK-MASQ
488 | 
489 |     create_Rule_in_nat PREROUTING -m comment --comment "zgz service portals" -j ZGZ-SERVICES
490 |     create_Rule_in_nat OUTPUT -m comment --comment "zgz service portals" -j ZGZ-SERVICES
491 | 
492 |     create_Rule_in_nat ZGZ-SERVICES-POSTROUTING -m comment --comment "zgz service traffic requiring SNAT" -m mark --mark 0x2000/0x2000 -j MASQUERADE
493 |     create_Rule_in_nat POSTROUTING -m comment --comment "zgz postrouting rules" -j ZGZ-SERVICES-POSTROUTING
494 | 
495 | }
496 | 
497 | function ipvs_svc_run(){
498 |   ip addr flush dev ${dummy_if}
499 |   ipset_init
500 |   iptables_init
501 |   echo 1 > /proc/sys/net/ipv4/vs/conntrack
502 | }
503 | # 无参数则是 keepalived 启动，也可以接收单个配置文件参数
504 | function main(){
505 |   if [ ! -d /proc/sys/net/ipv4/conf/${dummy_if} ];then
506 |     ip link add ${dummy_if} type dummy
507 |   fi
508 | 
509 |   if [ "$#" -eq 0 ];then
510 |     ipvs_svc_run
511 |     return
512 |   fi
513 | 
514 |   local file fullFile ip port protocol
515 |   for file in $@;do
516 |     fullFile=${CONF_DIR}/$file
517 |       awk '{if($1=="virtual_server"){printf $2" "$3" ";flag=1;};if(flag==1 && $1=="protocol"){print $2;flag=0}}' "$f" | while read ip port protocol;do
518 |           # SVC IP port 插入 ipset 里
519 |           ipset add ZGZ-CLUSTER-IP ${ip},${protocol,,}:${port} -exist
520 |           # 添加 SVC IP 到 dummy 接口上
521 |           if ! ip r g ${ip} | grep -qw lo;then
522 |               ip addr add ${ip}/32 dev ${dummy_if}
523 |           fi
524 |       done
525 |   done
526 |   # 重新 reload 
527 |   pkill --signal HUP keepalived
528 | }
529 | 
530 | main $@
531 | 
532 | ```
533 | 
534 | 脚本就如上面所示，读取 keepalived 的 lvs 文件，把 `VIP:PORT` 加到 ipset 里，VIP 加到 `dummy` 接口上，之前是加到 eth0 上，但是业务网卡可能会重启影响，dummy 接口和 loopback 类似，它总是 up 的，除非你 down 掉它，SVC 地址配置在它上面不会随着物理接口状态变化而受到影响。删除掉之前 eth0 上的 VIP `ip addr del 169.254.11.2/32 dev eth0`，然后把前面的转成 keepalived 的配置文件测试下：
535 | 
536 | ```bash
537 | chmod a+x /etc/keepalived/ipvs.sh
538 | cat > /etc/keepalived/conf.d/test.conf << EOF
539 | 
540 | virtual_server 169.254.11.2 80 {
541 |     delay_loop 3
542 |     lb_algo rr
543 |     lb_kind NAT
544 |     protocol TCP
545 |     alpha #默认是禁用，会导致在启动daemon时，所有rs都会上来，开启此选项下则是所有的RS在daemon启动的时候是down状态，healthcheck健康检查failed。这有助于其启动时误报错误
546 | 
547 |     real_server  192.168.2.111 8080 {
548 |         weight 1
549 |         HTTP_GET  {
550 |             url {
551 |               path /404
552 |               status_code 404
553 |             }
554 |             connect_port    8080
555 |             connect_timeout 2
556 |             retry 2
557 |             delay_before_retry 2
558 |         }
559 |     }
560 | 
561 |     real_server  192.168.2.222 8080 {
562 |         weight 1
563 |         HTTP_GET  {
564 |             url {
565 |               path /404
566 |               status_code 404
567 |             }
568 |             connect_port    8080
569 |             connect_timeout 2
570 |             retry 2
571 |             delay_before_retry 2
572 |         }
573 |     }
574 | }
575 | EOF
576 | ```
577 | 
578 | 测试下
579 | 
580 | ```
581 | # 先清理掉之前手动添加的
582 | ipvsadm --clear
583 | 
584 | systemctl daemon-reload
585 | systemctl restart keepalived
586 | 
587 | $ curl 169.254.11.2/www/test
588 | 192.168.2.222
589 | $ curl 169.254.11.2/www/test
590 | 192.168.2.111
591 | $ curl 169.254.11.2/www/test
592 | 192.168.2.222
593 | $ curl 169.254.11.2/www/test
594 | 192.168.2.111
595 | $ ip a s svc
596 | 4: svc: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
597 |     link/ether e6:a3:29:07:fa:57 brd ff:ff:ff:ff:ff:ff
598 |     inet 169.254.11.2/32 scope global svc
599 |        valid_lft forever preferred_lft forever
600 | ```
601 | 
602 | 停掉一个 web 后在我们配置的健康检查几秒也剔除了 rs ：
603 | 
604 | ```bash
605 | $ curl 169.254.11.2/www/test
606 | curl: (7) Failed connect to 169.254.11.2:80; Connection refused
607 | $ curl 169.254.11.2/www/test
608 | 192.168.2.111
609 | $ curl 169.254.11.2/www/test
610 | 192.168.2.111
611 | $ curl 169.254.11.2/www/test
612 | 192.168.2.111
613 | $ curl 169.254.11.2/www/test
614 | 192.168.2.111
615 | ```
616 | 
617 | #### 系统的相关配置
618 | 
619 | 后面重启后发现不通，发现内核模块没加载，使用 `systemd-modules-load` 去开机加载：
620 | 
621 | ```bash
622 | cat  > /etc/modules-load.d/ipvs.conf << EOF
623 | ip_vs
624 | ip_vs_rr
625 | ip_vs_wrr
626 | ip_vs_sh
627 | 
628 | EOF
629 | 
630 | cat > /etc/sysctl.d/90.ipvs.conf << EOF 
631 | # https://github.com/moby/moby/issues/31208 
632 | # ipvsadm -l --timout
633 | # 修复ipvs模式下长连接timeout问题 小于900即可
634 | net.ipv4.tcp_keepalive_time=600
635 | net.ipv4.tcp_keepalive_intvl=30
636 | net.ipv4.vs.conntrack=1
637 | # https://github.com/kubernetes/kubernetes/issues/70747 https://github.com/kubernetes/kubernetes/pull/71114
638 | net.ipv4.vs.conn_reuse_mode=0
639 | EOF
640 | ```
641 | 
642 | ### 一些说明
643 | 
644 | 有人 https://github.com/kubernetes/kubernetes/issues/72236 发现 ipvs 下，访问 svcIP+宿主机的端口，例如 22 也能访问，这不安全，然后就有大佬 2022/09/02 [合入的 pr](https://github.com/kubernetes/kubernetes/pull/108460) 加了个链 `KUBE-IPVS-FILTER` 让 svcIP:非svcPort 无法访问，ping 也 ping 不通了。
645 | 
646 | 
647 | ### docker 运行的方案
648 | 
649 | `docker-compose` 文件如下，自己把脚本挂载进去即可：
650 | 
651 | ```yaml
652 | version: '3.5'
653 | services:
654 |   keepalived: 
655 |     image: 'registry.aliyuncs.com/zhangguanzhang/keepalived:v2.2.0'
656 |     hostname: 'keepalived-ipvs'
657 |     restart: unless-stopped
658 |     container_name: "keepalived-ipvs"
659 |     labels: 
660 |       - app=keepalived
661 |     network_mode: host
662 |     privileged: true
663 |     cap_drop:
664 |       - ALL
665 |     cap_add:
666 |       - NET_BIND_SERVICE
667 |     volumes:
668 |       - /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime:ro
669 |       - /lib/modules:/lib/modules
670 |       - /run/xtables.lock:/run/xtables.lock
671 |       - ./conf.d/:/etc/keepalived/conf.d/
672 |       - ./keepalived.conf:/etc/keepalived/keepalived.conf
673 |       - ./always-initsh.d:/always-initsh.d
674 |       - ./tools:/etc/tools/
675 |     command: 
676 |       - --dont-fork
677 |       - --log-console
678 |       - --log-detail
679 |       - --use-file=/etc/keepalived/keepalived.conf
680 |     logging:
681 |       driver: json-file
682 |       options:
683 |         max-file: '3'
684 |         max-size: 20m
685 | ```
686 | 
687 | ## 参考文档
688 | 
689 | - [Interaction between LVS and netfilter](http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.filter_rules.html)
690 | - [lvs nat](http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.LVS-NAT.html#lvs_nat_intro)
691 | - [lvs-principle-and-source-analysis](https://github.com/liexusong/linux-source-code-analyze/blob/master/lvs-principle-and-source-analysis-part2.md)
692 | - [朱双印大佬的 iptables 技术系列](https://www.zsythink.net/archives/1199)
693 | 
694 | ## 链接
695 | 
696 | - [iptables Service 模式](04.02.01.md)
697 | - 下一部分: [externalIPs](04.02.03.md)
698 | 


--------------------------------------------------------------------------------
/eBook/04.02.03.md:
--------------------------------------------------------------------------------
  1 | # externalIPs
  2 | 
  3 | ## iptables 模式下
  4 | 
  5 | 集群信息为：
  6 | 
  7 | ```
  8 | NAME            STATUS   ROLES         AGE   VERSION    INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                                  KERNEL-VERSION                    CONTAINER-RUNTIME
  9 | 10.xxx.xx.211   Ready    master,node   14d   v1.20.11   10.xxx.xx.211   <none>        Kylin Linux Advanced Server V10 (Sword)   4.19.90-24.4.v2101.ky10.aarch64   docker://20.10.22
 10 | 10.xxx.xx.213   Ready    master,node   14d   v1.20.11   10.xxx.xx.213   <none>        Kylin Linux Advanced Server V10 (Sword)   4.19.90-24.4.v2101.ky10.aarch64   docker://20.10.22
 11 | 10.xxx.xx.214   Ready    master,node   14d   v1.20.11   10.xxx.xx.214   <none>        Kylin Linux Advanced Server V10 (Sword)   4.19.90-24.4.v2101.ky10.aarch64   docker://20.10.22
 12 | ```
 13 | 
 14 | 部署下下面的服务，
 15 | 
 16 | ```
 17 | apiVersion: v1
 18 | kind: Pod
 19 | metadata:
 20 |   name: nginx
 21 |   labels:
 22 |     app: zgz
 23 | spec:
 24 |   containers:
 25 |   - name: nginx
 26 |     image: nginx:stable
 27 |     ports:
 28 |       - containerPort: 80
 29 |         name: http-web-svc
 30 | ---
 31 | apiVersion: v1
 32 | kind: Service
 33 | metadata:
 34 |   name: nginx-service
 35 | spec:
 36 |   selector:
 37 |     app: zgz
 38 |   ports:
 39 |   - name: name-of-service-port
 40 |     protocol: TCP
 41 |     port: 80
 42 |     targetPort: http-web-svc
 43 | ```
 44 | 
 45 | 部署后，选个节点导出下 iptables nat 表规则，然后增加 externalIPs 设置为第一个节点的 IP 后再导出对比下
 46 | 
 47 | ```
 48 | $ iptables -t nat -S >  before-nat-pod.txt
 49 | $ kubectl patch svc nginx-service -p '{"spec":{"externalIPs":["10.xxx.xx.211"]}}'
 50 | $ iptables -t nat -S >  after-nat-pod.txt
 51 | # 对比
 52 | $ diff <(sort before-nat-pod.txt) <(sort after-nat-pod.txt)
 53 | 149a150,152
 54 | > -A KUBE-SERVICES -d 10.xxx.xx.211/32 -p tcp -m comment --comment "default/nginx-service:name-of-service-port external IP" -m tcp --dport 80 -j KUBE-MARK-MASQ
 55 | > -A KUBE-SERVICES -d 10.xxx.xx.211/32 -p tcp -m comment --comment "default/nginx-service:name-of-service-port external IP" -m tcp --dport 80 -m addrtype --dst-type LOCAL -j KUBE-SVC-IQGXNJVVP26VHMIN
 56 | > -A KUBE-SERVICES -d 10.xxx.xx.211/32 -p tcp -m comment --comment "default/nginx-service:name-of-service-port external IP" -m tcp --dport 80 -m physdev ! --physdev-is-in -m addrtype ! --src-type LOCAL -j KUBE-SVC-IQGXNJVVP26VHMIN
 57 | # 不 sort 对比就多了因为顺序乱了的重复内容，sort 对比又会导致上面的规则顺序乱了，所以直接正则匹配
 58 | $ grep -w 'name-of-service-port external IP' after-nat-pod.txt
 59 | -A KUBE-SERVICES -d 10.xxx.xx.211/32 -p tcp -m comment --comment "default/nginx-service:name-of-service-port external IP" -m tcp --dport 80 -j KUBE-MARK-MASQ
 60 | -A KUBE-SERVICES -d 10.xxx.xx.211/32 -p tcp -m comment --comment "default/nginx-service:name-of-service-port external IP" -m tcp --dport 80 -m physdev ! --physdev-is-in -m addrtype ! --src-type LOCAL -j KUBE-SVC-IQGXNJVVP26VHMIN
 61 | -A KUBE-SERVICES -d 10.xxx.xx.211/32 -p tcp -m comment --comment "default/nginx-service:name-of-service-port external IP" -m tcp --dport 80 -m addrtype --dst-type LOCAL -j KUBE-SVC-IQGXNJVVP26VHMIN
 62 | 
 63 | # KUBE-SVC-IQGXNJVVP26VHMIN 则是 svc 的 iptables 负载入口链
 64 | ```
 65 | 
 66 | diff 发现多了三条 iptables 规则，三条规则的前面属性都是 `目标 IP 和端口为 10.xxx.xx.211:80 的`：
 67 | - 进入动态伪装 IP 的处理链
 68 | - `-m physdev ! --physdev-is-in` 意思为不是从桥接接口进来，也就是外部的物理网卡进来的，`-m addrtype ! --src-type LOCAL` 意思是来源地址不是本机上的 IP，最后是到 svc 的 DNAT 到 podIP
 69 | - 第三个结尾的 `-m addrtype --dst-type LOCAL` 是目标 IP 在本机，也就是针对 externalIPs 设置成机器上的 网卡 IP 的，此刻拦截发到 svc。 
 70 | 
 71 | 因为前面都是针对 ip:80 的，所以 iptables 模式针对的都是纯四层，设置为集群机器 IP 这样没啥问题，问题是 ipvs 模式，也就是接下来讲的
 72 | 
 73 | ## ipvs 模式
 74 | 
 75 | 清理掉上面的svc：
 76 | ```
 77 | kubectl delete pod nginx
 78 | kubectl delete svc nginx-service
 79 | ```
 80 | 
 81 | ### 非 k8s IP
 82 | 
 83 | 集群 kube-proxy 切换为 ipvs 模式后，先找个不是 k8s 节点但是和 k8s 节点机器是同一个二层局域网的没使用的 IP 作为 `externalIPs`，然后部署上面的步骤。
 84 | 
 85 | ipvs 下，kube-proxy 会把 `svcIP/32` 配置在 `kube-ipvs0` 的 `dummy` 接口上，而 `externalIPs` 也会把它配置在  `kube-ipvs0` 上。然后我们做一个实验，假设此刻 externalIPs 设置为 `10.xxx.xx.250` 可以任何一台 k8s 机器上：
 86 | 
 87 | ```
 88 | $ docker run --rm -d --net host --name test nginx:alpine
 89 | 
 90 | $ curl -I localhost
 91 | HTTP/1.1 200 OK
 92 | $ curl -I 10.xxx.xx.250
 93 | HTTP/1.1 200 OK
 94 | ```
 95 | 
 96 | 你会发现也能通，这是因为 kube-proxy 把 `externalIPs/32` 配置在  `kube-ipvs0` 的 dummy 上了，因为 IP 在本机上，而且是 32 位掩码，又因为 nginx 进程 bind 的 `0.0.0.0` ，所以访问 `externalIPs:80` 也能到 nginx。
 97 | 
 98 | ### k8s 机器 IP
 99 | 
100 | 你可能在想，我 nginx 设置监听指定的本机节点网卡 IP ，这样 `externalIPs:80` 不就无法访问了吗，确实，但是 `externalIPs:其他端口` 也会定向到本机上，由于没端口监听访问最终会超时。假设你把 `externalIPs` 设置为第一个master 节点：
101 | 
102 | - 第一个 master 节点上，路由到本机 IP 都会到 kube-ipvs0（因为掩码32最大）上，也就会和外界断联
103 | - 其他 master 访问 master1:6443 和 etcd 的 2379 都被定向到本机的，信息就乱了，包含其他端口都会定向到本机上，有端口监听就信息乱，没端口监听就超时
104 | - 然后非第一个 master 上的 kubelet 和 kubectl 只要流量最终负载到第一个 master 上就都会超时
105 | 
106 | 假设已经发生：
107 | 
108 | - 每台机器从 tty 或者虚拟化 vnc 进去后停止 kube-proxy，`ip link delete <externalIPs>/32 dev kube-ipvs0`
109 | - 如果操作后能使用 kubectl 就 edit 或者删除那个 svc
110 | - 如果不能就 etcdctl 删除掉这个 svc
111 | - 最后再启动停止的进程
112 | 
113 | ### ipvs 模式如何使用 externalIPs
114 | 
115 | 可以设置外部没使用的 IP，然后网络设备添加路由把这个 IP 导向 k8s 的机器。或者有其他进程/工具宣告 externalIPs 的 arp，让外部机器访问 externalIPs 能到 k8s 机器上
116 | 
117 | ## 参考
118 | 
119 | - https://ipset.netfilter.org/iptables-extensions.man.html
120 | 
121 | ## 链接
122 | 
123 | - [IPVS Service 模式](04.02.02.md)
124 | - 下一部分: [NodePort](04.03.md)
125 | 


--------------------------------------------------------------------------------
/eBook/04.02.md:
--------------------------------------------------------------------------------
 1 | # Service 
 2 | 
 3 | ## 工作原理
 4 | 
 5 | K8S Service 就是在节点上访问 `ServiceIP:ServicePort` DNAT 成应用真实副本 `POD_IP:POD_PORT` 就像下面：
 6 | 
 7 | ![pod-to-service](../images/pod-to-service.gif)
 8 | 
 9 | 而 K8S service 相关工作进程和大体工作流程如下：
10 | 
11 | ![K8s-service-process.png](../images/K8s-service-process.png)
12 | 
13 | - kubelet：增删改 POD ，并且对 POD 做探针请求，http、tcp、exec ，如果探针失败认为 POD 不健康，上报 kube-apiserver
14 | - kube-proxy 监听到 POD 就绪、删除、不健康后，会更新到本节点上的 iptables/ipvs 规则
15 | 
16 | 查看当前节点 kube-proxy 模式可以通过下面命令查看，iptables 和 ipvs 模式各有优缺点和 bug，按需选择。
17 | 
18 | ```bash
19 | $ curl localhost:10249/proxyMode
20 | iptables
21 | ```
22 | 
23 | Service 虽然工作在内核态的四层（TCP/UDP/STCP）负载均衡，但是还是需要 IP 地址（因为是四层拦截，IP 不会被路由，所以 Service IP 也被称为虚拟 IP，VIP），官方一些教程默认的 CIDR 是 `10.96.0.0/12`，自行更改的话注意子网计算，例如 `10.95.0.0/12` 实际上是在 `10.80.0.0/12` 里，kube-apiserver 启动时候会创建 Service CIDR 第一个主机位的 IP 的 Service：
24 | 
25 | ```bash
26 | $ kubectl describe svc kubernetes
27 | Name:              kubernetes
28 | Namespace:         default
29 | Labels:            component=apiserver
30 |                    provider=kubernetes
31 | Annotations:       <none>
32 | Selector:          <none>
33 | Type:              ClusterIP
34 | IP Family Policy:  SingleStack
35 | IP Families:       IPv4
36 | IP:                10.96.0.1
37 | IPs:               10.96.0.1
38 | Port:              https  443/TCP
39 | TargetPort:        6443/TCP
40 | Endpoints:         192.168.2.111:6443,192.168.2.112:6443,192.168.2.113:6443
41 | Session Affinity:  None
42 | Events:            <none>
43 | ```
44 | 
45 | Service 后端是 Endpoints，而 Endpoint 的 IP 一般是 POD_IP 或者类似 kubernetes 这样的机器 IP ，以及如果你想创建类似把外部 IP 映射到内部的 service 域名，可以创建不指定 `selector` 的同名 service 和 endpoint。另外不一定要走 Service，很多服务有注册发现机制，把 `Pod_IP:Port` 注册上去，其他业务直接使用它来连接。
46 | 
47 | ## 一些坑
48 | 
49 | - POD 访问自己的 service 会不通（特别是在一些其他人或者云上的 K8S 实例上），需要配置 cni0 为混杂模式或者 kubelet 配置 `hairpinMode`
50 | - SVC 负载还得看应用层，比如 GRPC （基于 HTTP/2，只会一次 TCP 握手）会因为长连接而在 SVC 下不会均衡在每个 POD 上
51 | - 每个节点最好安装 conntrack-tools 让有 conntrack 二进制命令，kube-proxy 会使用它做一些操作
52 | 
53 | ## 链接
54 | 
55 | - [K8S overlay 概念](04.01.md)
56 | - 下一部分: [iptables Service 模式](04.02.01.md)
57 | 


--------------------------------------------------------------------------------
/eBook/04.03.md:
--------------------------------------------------------------------------------
 1 | # NodePort
 2 | 
 3 | NodePort 和 docker -p 类似，但是新版本 kube-proxy（1.21.14,>=1.22.10.>=1.23.7，和模式无关）废弃掉了用户空间的 kube-proxy 监听端口代理，kube-proxy [>=1.26 增加 `--iptables-localhost-nodeports=true` 选项开关](https://github.com/kubernetes/kubernetes/commit/bef207003148dfe061672269003d4727afb5170c) 开启 `net.ipv4.conf.all.route_localnet=1` 来配合 iptables 做 DNAT
 4 | - https://github.com/kubernetes/kubernetes/issues/106713
 5 | - https://github.com/kubernetes/kubernetes/issues/100643
 6 | - 移除 kube-proxy 用户态端口监听 https://github.com/kubernetes/kubernetes/pull/108496
 7 | - https://github.com/kubernetes/kubernetes/issues/103860
 8 | - https://github.com/kubernetes/kubernetes/pull/112133/files
 9 | 
10 | ## SNAT 和 externalTrafficPolicy
11 | 
12 | 默认 nodePort 的 `externalTrafficPolicy` 为 Cluster，流量如下：
13 | 
14 | ![nodeport-cluster.png](../images/nodeport-cluster.png)
15 | 
16 | 如果节点上没有 Pod，则会被 DNAT 发往其他节点，但是此刻也做了 SNAT，pod 最终回复给本节点，本节点返回给 client。
17 | 
18 | ## externalTrafficPolicy=Local
19 | 
20 | 在 Cluster 的时候，如果本节点上没 pod，会发往对应节点做 SNAT 而丢失来源 IP：
21 | 
22 | ![tutor-service-nodePort](https://kubernetes.io/zh-cn/docs/images/tutor-service-nodePort-fig01.svg)
23 | 
24 | 如果不做 SNAT 就有问题：
25 | 1. 外部 Client 访问 `Node2:30001`，报文的源 IP 是外部 `Client IP`
26 | 2. Node2 转发过去，因为最终收到的报文内部的源 IP 是 `Client IP`，会走自己路由回包给 `CLient IP`，Client 就会很莫名其妙，我访问的是 `Node2:30001` ，你不是 Node2 给我回复干啥，这个就是无效的 TCP 报文，网络设备可能会扔掉或者内核协议栈会扔掉。
27 | 3. 避免这种就是要做 SNAT，Client 感知到的请求和响应报文始终是 `Node2:30001`
28 | 
29 | 而 Local 模式就是解决因为 SNAT 而丢失源 IP 的问题，工作原理如下，本机上没 Pod 就会不通，有 Pod 就只发给本机 Pod，设置 `externalTrafficPolicy=Local` 以及配合 LB 的健康检查剔除掉没有 Pod 的节点：
30 | 
31 | ![nodeport-local.png](../images/nodeport-local.png)
32 | 
33 | ## hostPort
34 | 
35 | 和 NodePort 类似， nat 表的规则：
36 | 
37 | ```bash
38 | $ iptables -w -t nat -S | grep HOSTPORT
39 | -A PREROUTING -m addrtype --dst-type LOCAL -j CNI-HOSTPORT-DNAT
40 | ...
41 | ```
42 | 
43 | 但是后面是直接本机的 PodIP，毕竟它的属性是 Pod 上 `kubectl explain pod.spec.containers.ports.hostPort`。
44 | 
45 | ## 链接
46 | 
47 | - [externalIPs](04.02.03.md)
48 | - 下一部分: [探针和一些坑](04.04.md)
49 | 


--------------------------------------------------------------------------------
/eBook/04.04.md:
--------------------------------------------------------------------------------
 1 | # 探针和一些坑
 2 | 
 3 | 探针是本机上 kubelet 执行去探测本机上的 Pod 的，分为启动（Startup，>=1.17）、存活（Liveness）和就绪（Readiness）探针，当探针不正常时候，也可以手动对本机 Pod 发对应请求模拟排查。
 4 | 
 5 | ## 一些坑
 6 | 
 7 | http 类型探针状态码要 [200<= code < 400](https://github.com/kubernetes/kubernetes/blob/49ff25507455cae99cb1b681a278db617ac979c1/pkg/probe/http/http.go#L111) ，一些应用如果没单独的 `/health` 免认证接口而出现 401 就会失败，并且 kubelet 的 http user-agent 为 `kube-probe/v{version}`。
 8 | 
 9 | ### http 探针
10 | 
11 | 启动一定要配置就绪探针，kubelet 执行就绪探针成功后，会 Update POD `status.conditions[][@type=Ready].status=true` 后，kube-proxy watch 到这个 event 才把 POD_IP 加入到 service 后面的 endpoint 里（否则滚动更新会 503），大多数人都配置的 http 探针，可能在一些老版本会遇到下面类似报错：
12 | 
13 | ```
14 | Readiness probe failed: Get "http://xxxx/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
15 | ```
16 | 
17 | 该问题大部分是因为 kubelet http 探测是起新的连接而没复用连接造成（建议升级到 >=1.27），其他原因和详细信息见：
18 | 
19 | - https://github.com/kubernetes/kubernetes/issues/89898
20 | - https://github.com/kubernetes/kubernetes/issues/89898#issuecomment-1383207322
21 | 
22 | ### 优雅停止
23 | 
24 | 销毁Pod的流程:
25 | 
26 | 1. kubelet 收到删除 Pod 的指令后，如果 Pod 有配置 PreStop Hook，则先执行 PreStop Hook，否则直接给 Pod 的 1 号进程发送 SIGTERM 信号
27 | 2. kube-proxy 也同时收到 Pod 的删除指令，从 Service 的后端中摘除这个待删除的 PodIP；这一步与第 1 步是并行进行的，由于集群中服务数量越来越多的缘故，kube-proxy 完成摘除 PodIP 这一步需要 3~8 秒钟，就是说最迟 8 秒后，向该 Pod 的 Service 发起新的连接不会发再到删除中的 Pod，在那之前仍然可能有新的连接到达，已经建立的连接不会受影响。
28 | 3. kubelet 等待至终止宽限期（`terminationGracePeriodSeconds`，默认为30秒）结束，Pod 一直没退出就给 Pod 的 1 号进程发送 SYSKILL 信号强制杀死。
29 | 
30 | ![pod-delete-process](../images/pod-delete-process.png)
31 | 
32 | 如果 Pod 先退出，而 kube-proxy 晚摘掉 PodIP，则会有问题，所以需要 PreStopHook 来避免：
33 | 1. PreStop sleep 10s，sleep 10 结束之前，kube-proxy 就把 PodIP 从 Service 后面剔除了，这样此刻新的链接不会负载到该 POD
34 | 2. 容器里的 Pid 1 的进程此刻去处理 SIGTERM 信号，由于不会有新的连接，退出之前还需要处理存量的连接而优雅退出
35 | 3. 对于副本数量很多的，不要使用百分比而是使用数字，例如 40 个设置 25% 会一次滚动下线 10 个，造成剩下 30 个会压力暴涨
36 | 4. 对于有使用 istio 的，根据[文档](https://istio.io/latest/docs/reference/config/istio.mesh.v1alpha1/) `terminationDrainDuration` 默认值为 5s，会导致 5s 过后 Pod 内的 Envoy 退出，Pod 内就无法访问任何 IP 了。特别是优雅下线过程中要去访问一些中间件，所以需要配置 istio 的 `terminationDrainDuration > terminationGracePeriodSeconds`，如下：
37 | 
38 | ```yaml
39 | ...
40 |   template:
41 |     metadata:
42 |       annotations:
43 |         proxy.istio.io/config: |
44 |           terminationDrainDuration: 40s
45 | # 例如 terminationGracePeriodSeconds 为 30s ，可以多个两三秒即可，除了注解以外，也可以 meshConfig 方式配置
46 | ```
47 | 
48 | ## 链接
49 | 
50 | - [NodePort](04.03.md)
51 | - 下一部分: [CNI 和 cni-plugins](04.05.md)
52 | 


--------------------------------------------------------------------------------
/eBook/04.05.md:
--------------------------------------------------------------------------------
 1 | # CNI 和 cni-plugins
 2 | 
 3 | ## CNI
 4 | 
 5 | CNI 就是实现 Pod 的 overlay 网络的一个规范，实现了的工具例如有 flannel、calico、Cilium 和[更多](https://kubernetes.io/zh-cn/docs/concepts/cluster-administration/addons/#networking-and-network-policy)
 6 | 
 7 | Pod 跨节点访问走 Overlay 如图：
 8 | 
 9 | ![pod-to-annother-node](../images/pod-to-annother-node.gif)
10 | 
11 | 另外 openstack 上搭建集群，使用 flannel/calico 的 IPIP 模式，在默认放行 tcp/udp 后跨节点还是不通，是因为会单独识别到是 IPIP 协议而不是 IP 协议，需要按照协议号 94 放行，不行就试试 4（ip协议）：
12 | 
13 | ![penstack-dashboard-secruity-addrule](https://cdn.jsdelivr.net/gh/zhangguanzhang/Image-Hosting/picgo/openstack-dashboard-secruity-addrule.png)
14 | 
15 | [协议号参考](https://www.iana.org/assignments/protocol-numbers/protocol-numbers.xhtml)
16 | 
17 | ## cni-plugins
18 | 
19 | 而 [cni-plugins](https://github.com/containernetworking/plugins) 是负责容器启动时候创建和配置网络相关（网卡、IP、和路由之类的）。并且不要学一些老式的 flannel 部署教程，在早期的 kubelet 的没配置 cni ，而让所有 Pod 的容器挂在 docker0 上，还得配置每个节点 docker0 的网段。cni-plugins 初学阶段没必要深入研究，后续有容器网络开发需求才需要去了解。
20 | 
21 | ## 链接
22 | 
23 | - [探针和一些坑](04.04.md)
24 | - 下一部分: [CNI flannel](04.06.md)
25 | 


--------------------------------------------------------------------------------
/eBook/04.06.md:
--------------------------------------------------------------------------------
  1 | # flannel
  2 | 
  3 | flannel 是最简单的一款 CNI 实现了（缺点是不支持 NetworkPolicy），大多数都是使用 Linux 自带的能力和内核模块（并且就一个进程），先学会它搞懂基础，后面去学其他的 CNI 会有很多相似之处。
  4 | 
  5 | ## 后端存储
  6 | 
  7 | 支持 etcd v2 和 kubernetes API，etcd 只要不是非常老的版本都默认是 v3 API 存储了，而 etcd 官方不建议 v2 v3 数据共同存储，所以 K8S 部署都是使用后者（选项`--kube-subnet-mgr`）作为后端存储。
  8 | 
  9 | ## 后端类型
 10 | 
 11 | - VXLAN：基于 UDP 转发二层报文的隧道协议
 12 | - host-gw：k8s 节点添加转发路由，局限在所有节点都在同一个二层才能使用。
 13 | - IPIP：IPIP 也是 Linux 的原生隧道方案之一，相比 VXLAN 功能更加精简，但是只支持 IPv4 单播流量
 14 | - WireGuard: 新内核（>=5.6）模块自带的协议，基于它隧道组网
 15 | - IPSec：是一种对 IP 协议进行加密和认证隧道协议
 16 | - 基于云服务的后端：AWS 、[AliVPC](AliCloud VPC Backend for Flannel) 、GCE
 17 | - UDP：一种自定义的，基于用户态 UDP 转发三层报文的隧道协议，不推荐使用。
 18 | 
 19 | 最常见的是前面三个了。查看当前 flannel 模式可以通过每个节点上：
 20 | 
 21 | ```
 22 | $ cat /run/flannel/subnet.env 
 23 | FLANNEL_NETWORK=10.244.0.0/16 # pod 的 overlay 网络
 24 | FLANNEL_SUBNET=10.244.0.1/24 # 当前节点分配到的 CIDR，默认是 kube-controller-manager 设置的 node-mask=24
 25 | FLANNEL_MTU=1450 # 网卡 MTU
 26 | FLANNEL_IPMASQ=true # 是否对 Pod 的流量开启 SNAT
 27 | ```
 28 | 
 29 | 配置来源于 configmap：
 30 | 
 31 | ```bash
 32 | $ kubectl -n kube-system get cm kube-flannel-cfg -o yaml
 33 | apiVersion: v1
 34 | data:
 35 |   cni-conf.json: |
 36 |     {
 37 |       "name": "cbr0",
 38 |       "cniVersion": "0.3.1",
 39 |       "plugins": [
 40 |         {
 41 |           "type": "flannel",
 42 |           "delegate": {
 43 |             "hairpinMode": true,
 44 |             "isDefaultGateway": true
 45 |           }
 46 |         },
 47 |         {
 48 |           "type": "portmap",
 49 |           "capabilities": {
 50 |             "portMappings": true
 51 |           }
 52 |         }
 53 |       ]
 54 |     }
 55 |   net-conf.json: |
 56 |     {
 57 |       "Network": "10.244.0.0/16",
 58 |       "Backend": {
 59 |         "Type": "vxlan", #<----- 全局设置模式
 60 |         "Port": 8475
 61 |       }
 62 |     }
 63 | kind: ConfigMap
 64 | ...
 65 | ```
 66 | 
 67 | ### host-gw 模式和 SNAT 规则
 68 | 
 69 | ![flannel-host-gw](../images/flannel-host-gw.png)
 70 | 
 71 | host-gw docker 跨节点的章节介绍了，此模式下 flanneld 进程本机添加路由转发到其他节点上，另外在在每个节点上添加类似的 iptables 规则，假设下面第一个节点的
 72 | 
 73 | ```
 74 | $ iptables -t nat -S | grep FLANNEL
 75 | -N FLANNEL-POSTRTG
 76 | -A POSTROUTING -m comment --comment "flanneld masq" -j FLANNEL-POSTRTG
 77 | # kube-proxy 的 masq mark，这里防止重复 SNAT
 78 | -A FLANNEL-POSTRTG -m mark --mark 0x4000/0x4000 -m comment --comment "flanneld masq" -j RETURN
 79 | # 本机访问其他节点的 PodIP 不做 SNAT
 80 | -A FLANNEL-POSTRTG -s 10.244.0.0/24 -d 10.244.0.0/16 -m comment --comment "flanneld masq" -j RETURN
 81 | # 外部 Pod 访问本机 Pod 的不做 SNAT
 82 | -A FLANNEL-POSTRTG -s 10.244.0.0/16 -d 10.244.0.0/24 -m comment --comment "flanneld masq" -j RETURN
 83 | # 其他节点手动指定 本机Pod CIDR 到 node IP 的时候，不需要 SNAT
 84 | -A FLANNEL-POSTRTG ! -s 10.244.0.0/16 -d 10.244.0.0/24 -m comment --comment "flanneld masq" -j RETURN
 85 | # PodIP 访问外面（非组播地址），例如其他 nodeIP 和 公网 IP 做 SNAT，等同于 docker -p 桥接网络访问外面需要做 SNAT 
 86 | -A FLANNEL-POSTRTG -s 10.244.0.0/16 ! -d 224.0.0.0/4 -m comment --comment "flanneld masq" -j MASQUERADE
 87 | # 其他节点手动添加 集群POD CIDR 路由到本机上的时候，由于开了转发，此处做 SNAT 
 88 | -A FLANNEL-POSTRTG ! -s 10.244.0.0/16 -d 10.244.0.0/16 -m comment --comment "flanneld masq" -j MASQUERADE
 89 | ```
 90 | 
 91 | 一些遇到过的问题：
 92 | 
 93 | - [阿里云上使用flannel host-gw跨节点pod不通的解决](https://zhangguanzhang.github.io/2020/06/23/host-gw-in-aliyun/)
 94 | - 一些底层使用 openstack 虚机上 host-gw 会不通，因为 openstack 的组件会对从虚机网卡发出去包的 MAC 地址和 IP 校验，包的 MAC 地址是网卡的，但是 IP 地址不是网卡的 IP 地址，过不了校验会被扔掉
 95 | 
 96 | 
 97 | ### flannel vxlan 模式
 98 | 
 99 | #### vxlan 原理
100 | 
101 | 假设有两个节点 Node1 和 Node2，其中 Node1 的 PodA 要跟 Node2 的 PodB 通信，则它们之间的通信过程如下图所示：
102 | 
103 | ![flannel-vxlan-node](../images/flannel-vxlan-node.png)
104 | 
105 | 整个过程：
106 | 
107 | - 发送端：在 PodA 中发起 `ping 10.244.1.21` ，ICMP 报文经过 cni0 网桥后走到宿主机的路由表，发现是发往 flannel.1 网卡，而该网卡是 VXLAN 类型的设备，内核会使用 flannel.1 网卡的配置信息负责 VXLAN 封包。 将原始 L2 报文封装成 VXLAN UDP 报文封装完后，封装成正常的二层三层报文，然后从 eth0 发送。
108 | - 接收端：Node2 收到 UDP 报文，端口是 8472，发给本机的内核进程，内核根据信息查询到 flannel.1 的 VXLAN 信息而进行解包。根据解包后得到的原始报文中的目的 IP，将原始报文经由本机路由 cni0 发送给 PodB。
109 | 
110 | > 网上很多说 vxlan 封包解包 flanneld 进程处理的，该说法是错误的，实际是内核处理的，可以在每个节点 `kill -STOP <flanneld_pid>` 后测试跨节点网络。
111 | 
112 | 因为额外封装了，会有 50 字节开销，所以 [flanneld 源码会把 flanneld 的 mtu 剪掉 50](https://github.com/flannel-io/flannel/blob/bb077808679ff71527d5a9a0ff7ffbf7d108dad4/pkg/backend/vxlan/vxlan_network.go#L44)，就像俄罗斯套娃一样，想要把小的放到外面那层内部去，你必须限制 小的外面 <= 大的内部体积。
113 | ![vxlan-mtu](../images/vxlan-mtu.png)
114 | 
115 | flannel.1 就是 vxlan 的 vtep （虚拟隧道端点），flanneld 进程会给 K8S 打上 annotation 写上自己的 MAC 地址：
116 | 
117 | ```bash
118 | $ kubectl get node -o yaml | grep flannel.alpha
119 |       flannel.alpha.coreos.com/backend-data: '{"VNI":1,"VtepMAC":"02:2f:70:44:d7:3f"}'
120 |       flannel.alpha.coreos.com/backend-type: vxlan
121 |       flannel.alpha.coreos.com/kube-subnet-manager: "true"
122 |       flannel.alpha.coreos.com/public-ip: 192.168.50.2
123 |       flannel.alpha.coreos.com/backend-data: '{"VNI":1,"VtepMAC":"26:4a:e8:aa:01:d8"}'
124 |       flannel.alpha.coreos.com/backend-type: vxlan
125 |       flannel.alpha.coreos.com/kube-subnet-manager: "true"
126 |       flannel.alpha.coreos.com/public-ip: 192.168.50.3
127 |       flannel.alpha.coreos.com/backend-data: '{"VNI":1,"VtepMAC":"46:92:51:8e:a2:c9"}'
128 |       flannel.alpha.coreos.com/backend-type: vxlan
129 |       flannel.alpha.coreos.com/kube-subnet-manager: "true"
130 |       flannel.alpha.coreos.com/public-ip: 192.168.50.4
131 | $ ip link show flannel.1
132 | 50: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default 
133 |     link/ether 02:2f:70:44:d7:3f brd ff:ff:ff:ff:ff:ff
134 | ```
135 | 
136 | 然后 flanneld 进程会从 K8S 获取其他节点的 flannel.1 MAC 地址和 `public-ip` :
137 | - 维护自己的 L2 信息，让内核能以这个 mac 信息封装 vxlan 报文的 MAC 地址
138 | - 维护 fdb 转发表（就是告诉内核发给其他 flannel 节点应该把 vxlan 后的报文以宿主机网络封装应该发往啥 IP 地址），不依赖实际宿主机 MAC 地址，所以可以跨三层。
139 | 
140 | ```bash
141 | $ ip neigh show dev flannel.1
142 | 10.244.2.0 lladdr 46:92:51:8e:a2:c9 PERMANENT
143 | 10.244.1.0 lladdr 26:4a:e8:aa:01:d8 PERMANENT
144 | # 推荐使用绝对路径，特别是 cni-plugins 放到 PATH 里会执行成 cni-plugins 的 bridge 二进制了
145 | $ /sbin/bridge fdb | grep flannel
146 | 26:4a:e8:aa:01:d8 dev flannel.1 dst 192.168.50.3 self permanent
147 | 46:92:51:8e:a2:c9 dev flannel.1 dst 192.168.50.4 self permanent
148 | $ cat /run/flannel/subnet.env 
149 | FLANNEL_NETWORK=10.244.0.0/16
150 | FLANNEL_SUBNET=10.244.0.1/24
151 | FLANNEL_MTU=1450
152 | FLANNEL_IPMASQ=true
153 | ```
154 | 
155 | #### 一些 tips
156 | 
157 | 在云上跨 VPC 部署 flannel 的时候，需要设置 flanneld 的 `-public-ip`, k3s 则是 `--flannel-external-ip`，或者 `kubectl edit node` 改 Annotations。
158 | 
159 | udp 没有放行，例如一些云虚拟化上，可以部署之前 nc 测下 8472 udp 收发包：
160 | 
161 | ```bash
162 | # server
163 | nc -l -u 8472
164 | 
165 | # client
166 | nc -u <dest_public_ip> 8472
167 | ```
168 | 
169 | 或者看下 flannel.1 状态，例如 `RX` 接收为 0 说明收不到其他节点发送的，通常是 udp 没放行：
170 | 
171 | ```bash
172 | $ ip -s a s flannel.1
173 | ....
174 | RX: bytes  packets  errors  dropped overrun mcast   
175 |     0          0        0       0       0       0 
176 | ```
177 | 
178 | 跨主机能 ping 通但是上层应用层协议不通（例如 curl 其他节点 Pod 的 url），一般是虚拟化网络层面被处理了，例如下面两个问题：
179 | 
180 | - 深信服的 超融合 和 aCloud 虚拟化平台会使用 `8472/udp` ，见文章 [记一次K8S VXLAN Overlay网络8472端口冲突问题的排查](https://cloud.tencent.com/developer/article/1746944)，需要修改 flannel 端口。
181 | - 非 8472 端口的 flannel 见文章 [https://zhangguanzhang.github.io/2022/07/28/redhat84-vxlan-esxi/](https://zhangguanzhang.github.io/2022/07/28/redhat84-vxlan-esxi/)
182 | 
183 | ### IPIP 模式
184 | 
185 | 和 vxlan 类似，`Type: ipip` 网卡名是 `flannel.ipip`，内部 vxlan 换成 IP 报文，而 IP 头部是 20 字节，所以网卡 MTU `1500-20=1480`：
186 | 
187 | ```bash
188 | $ ip r s 
189 | default via 192.168.50.1 dev eth0 proto static 
190 | 192.168.50.0/24 dev eth0 proto kernel scope link src 192.168.50.2
191 | 10.244.1.0/24 via 192.168.50.3 dev flannel.ipip onlink 
192 | 10.244.2.0/24 via 192.168.50.4 dev flannel.ipip onlink 
193 | 10.244.0.0/24 dev cni0 proto kernel scope link src 10.244.0.1 
194 | $ cat /run/flannel/subnet.env 
195 | FLANNEL_NETWORK=10.244.0.0/16
196 | FLANNEL_SUBNET=10.244.0.1/24
197 | FLANNEL_MTU=1480
198 | FLANNEL_IPMASQ=true
199 | $ ip -d a s flannel.ipip
200 | 102: flannel.ipip@NONE: <NOARP,UP,LOWER_UP> mtu 1480 qdisc noqueue state UNKNOWN group default 
201 |     link/ipip 192.168.50.2 brd 0.0.0.0 promiscuity 0 minmtu 0 maxmtu 0 
202 |     ipip any remote any local 192.168.50.2 ttl inherit nopmtudisc numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 
203 |     inet 10.244.0.0/32 scope global flannel.ipip
204 |        valid_lft forever preferred_lft forever
205 |     inet6 fe80::5efe:a0d:5bd4/64 scope link 
206 |        valid_lft forever preferred_lft forever
207 | ```
208 | 
209 | 抓包，因为是套娃，不像 vxlan 直接抓 udp 端口过滤就行，是 IP 报文套 IP 报文，所以抓包比较麻烦，过滤条件看第二层 IP 报文的源目 IP hex 对比，例如本机 ping 另一个节点 PodIP `10.187.0.14` [在线转换成 hex](https://www.browserling.com/tools/ip-to-hex) 是 `0x0abb000e`:
210 | 
211 | ```bash
212 | $ tcpdump -nn -i ens160 'ip[32:4] = 0x0abb000e   or ip[36:4] = 0x0abb000e' -w ipip.pcap
213 | tcpdump: listening on ens160, link-type EN10MB (Ethernet), capture size 262144 bytes
214 | ^C13 packets captured
215 | 26 packets received by filter
216 | 0 packets dropped by kernel
217 | ```
218 | 
219 | 宿主机 IP 打码了，可以看到右边套娃了一层 IP 协议：
220 | 
221 | ```bash
222 | $ tcpdump -nn -r ipip.pcap 
223 | reading from file ipip.pcap, link-type EN10MB (Ethernet)
224 | 16:31:16.668984 IP xx.xx.xx.212 > xx.xx.xx.213: IP 10.187.2.0.62435 > 10.187.0.14.53: Flags [P.], seq 3040765315:3040765386, ack 1650546132, win 501, options [nop,nop,TS val 950814887 ecr 2423449147], length 71 43526+ [1au] AAAA? etcd3.default.default.svc.cluster2.local. (69) (ipip-proto-4)
225 | 16:31:16.669352 IP xx.xx.xx.213 > xx.xx.xx.212: IP 10.187.0.14.53 > 10.187.2.0.62435: Flags [P.], seq 1:168, ack 71, win 502, options [nop,nop,TS val 2423457143 ecr 950814887], length 167 43526 NXDomain*- 0/1/1 (165) (ipip-proto-4)
226 | 16:31:16.669391 IP xx.xx.xx.212 > xx.xx.xx.213: IP 10.187.2.0.62435 > 10.187.0.14.53: Flags [.], ack 168, win 501, options [nop,nop,TS val 950814888 ecr 2423457143], length 0 (ipip-proto-4)
227 | 16:31:16.669738 IP xx.xx.xx.212 > xx.xx.xx.213: IP 10.187.2.0.62435 > 10.187.0.14.53: Flags [P.], seq 71:134, ack 168, win 501, options [nop,nop,TS val 950814888 ecr 2423457143], length 63 49916+ [1au] AAAA? etcd3.default.svc.cluster2.local. (61) (ipip-proto-4)
228 | 16:31:16.670012 IP xx.xx.xx.213 > xx.xx.xx.212: IP 10.187.0.14.53 > 10.187.2.0.62435: Flags [P.], seq 168:327, ack 134, win 502, options [nop,nop,TS val 2423457144 ecr 950814888], length 159 49916*- 0/1/1 (157) (ipip-proto-4)
229 | 16:31:16.670031 IP xx.xx.xx.212 > xx.xx.xx.213: IP 10.187.2.0.62435 > 10.187.0.14.53: Flags [.], ack 327, win 501, options [nop,nop,TS val 950814888 ecr 2423457144], length 0 (ipip-proto-4)
230 | 16:31:16.672805 IP xx.xx.xx.212 > xx.xx.xx.213: IP 10.187.2.0.62435 > 10.187.0.14.53: Flags [P.], seq 134:205, ack 327, win 501, options [nop,nop,TS val 950814891 ecr 2423457144], length 71 63476+ [1au] A? etcd2.default.default.svc.cluster2.local. (69) (ipip-proto-4)
231 | 16:31:16.673063 IP xx.xx.xx.213 > xx.xx.xx.212: IP 10.187.0.14.53 > 10.187.2.0.62435: Flags [P.], seq 327:494, ack 205, win 502, options [nop,nop,TS val 2423457147 ecr 950814891], length 167 63476 NXDomain*- 0/1/1 (165) (ipip-proto-4)
232 | 16:31:16.673084 IP xx.xx.xx.212 > xx.xx.xx.213: IP 10.187.2.0.62435 > 10.187.0.14.53: Flags [.], ack 494, win 501, options [nop,nop,TS val 950814892 ecr 2423457147], length 0 (ipip-proto-4)
233 | 16:31:21.156396 IP xx.xx.xx.212 > xx.xx.xx.213: IP 10.187.2.0 > 10.187.0.14: ICMP echo request, id 10, seq 1, length 64 (ipip-proto-4)
234 | 16:31:21.156626 IP xx.xx.xx.213 > xx.xx.xx.212: IP 10.187.0.14 > 10.187.2.0: ICMP echo reply, id 10, seq 1, length 64 (ipip-proto-4)
235 | 16:31:22.178953 IP xx.xx.xx.212 > xx.xx.xx.213: IP 10.187.2.0 > 10.187.0.14: ICMP echo request, id 10, seq 2, length 64 (ipip-proto-4)
236 | 16:31:22.179206 IP xx.xx.xx.213 > xx.xx.xx.212: IP 10.187.0.14 > 10.187.2.0: ICMP echo reply, id 10, seq 2, length 64 (ipip-proto-4)
237 | ```
238 | 
239 | ## 一些其他信息
240 | 
241 | flannel 支持设置 `"Directrouting": true` 来让二层节点走 host-gw，三层走 vxlan/IPIP 模式。
242 | flanneld 进程存在意义是节点增删改查维护本机配置，它并不参与解包和封包，你也可以不用 CNI，自己写脚本循环用相关命令配置实现。
243 | 
244 | ### flannel 和 cni-plugins 相关
245 | 
246 | kube-controller-manager 设置了：`allocate-node-cidrs` 和 `cluster-cidr` 参数时，kube-controller-manager 会为每个 node 确定 pod ip 范围：
247 | 
248 | ```
249 | --node-cidr-mask-size-ipv4=24
250 | ```
251 | 
252 | 默认 IPv4 掩码是 24位，意味着每个节点最多 253 个 Pod （0，1，255 IP 不使用）。
253 | 
254 | 
255 | flanneld 刚启动时，在 `RegisterNetwork` (调用 `kubeSubnetManager.AcquireLease`) 中获取当前 node 的 `Spec.PodCIDR`，并把需要的一些信息写入到 node 的 `annotation`。
256 | 
257 | 子网信息再写入到 `/run/flannel/subnet.env` （main.WriteSubnetFile），由 flannel CNI 读取，用于分配 pod ip。
258 | 
259 | 
260 | bridge 二进制第一次运行的时候会在 node 上生成一个 Linux bridge 设备（非 hostNetwork 的 Pod 调度上来时候），默认名字是 cni0。这个 bridge 就是一个虚拟交换机，新生成的 pod 网卡会通过 veth 设备连接到这个 bridge 上面。
261 | 
262 | bridge 每次被调用的时候，会给 pod 创建 veth，将 veth 连接到 cni0，并且调用 `host-local` 从本机 subnet 中分配 ip。
263 | 
264 | 几个跟 flannel CNI 有关的文件或目录：
265 | 
266 | - `/etc/cni/net.d/10-flannel.conf` flannel CNI 的配置文件。
267 | - `/var/lib/cni/flannel` 这个目录下放的是 flannel 每次调用 bridge 用到的配置，文件名是 `io.kubernetes.sandbox.id` （通过 `docker inspect [container id]` 可以看到）。
268 | - `/var/lib/cni/networks/cbr0` 这个目录下放的是 `host-local` CNI 分配的 IP，文件名为分配的容器 IP（主机位递增），文件内容为 `io.kubernetes.sandbox.id` 。
269 | 
270 | ## 图片参考
271 | 
272 | - https://juejin.cn/post/6994825163757846565
273 | 
274 | ## 链接
275 | 
276 | - [CNI 和 cni-plugins](04.05.md)
277 | - 下一部分: [CNI calico](04.07.md)
278 | 


--------------------------------------------------------------------------------
/eBook/04.07.md:
--------------------------------------------------------------------------------
 1 | # calico
 2 | 
 3 | 安装不要用以前的，用新版本 CRD 即可，见官方文档 [kubernetes/quickstart](https://docs.tigera.io/calico/latest/getting-started/kubernetes/quickstart)
 4 | 
 5 | TODO 日后填坑
 6 | 
 7 | ## 原理
 8 | 
 9 | ## BGP
10 | 
11 | ## 路由反射打通办公网到集群网络
12 | 
13 | ## 链接
14 | 
15 | - [CNI flannel](04.06.md)
16 | - 下一部分: [DNS](04.08.md)
17 | 


--------------------------------------------------------------------------------
/eBook/04.08.md:
--------------------------------------------------------------------------------
  1 | # DNS
  2 | 
  3 | 在有跨节点网络（CNI）和应用多副本负载均衡（Service）后，一个应用调用另一个应用的时候肯定不能使用 IP（Service IP），就像我们手机和电脑访问网站一样，使用的都是域名，不在乎域名解析的 IP 的变化。
  4 | 
  5 | ## DNS 工作原理
  6 | 
  7 | ```              
  8 |                             deploy A           
  9 |                        SerivceName: test-web   
 10 |                             ┌─────┐            
 11 |                             │┌───┐│            
 12 |                             │└───┘│            
 13 |          ┌────────────────► │┌───┐│            
 14 |          │                  │└───┘│            
 15 |          │                  │┌───┐│            
 16 |          │                  │└───┘│            
 17 |      ┌───┴────┐             └─────┘            
 18 |      │ client │                                
 19 |      └────────┘             ┌─────┐            
 20 |  nameserver 10.96.0.10      │┌───┐│            
 21 |          │                  │└───┘│            
 22 |          │                  │┌───┐│            
 23 |          └────────────────► │└───┘│            
 24 |         dns test-web ip     │┌───┐│            
 25 |                             │└───┘│            
 26 |                             └─────┘            
 27 |                             coredns            
 28 |                                                
 29 |                       ServiceIP:10.96.0.10            
 30 | ```
 31 | 
 32 | - coredns 的 Service IP 部署 K8S 阶段就是写死的，例如 `10.96.0.10`
 33 | - kubelet 配置 `clusterDNS: [10.96.0.10]`，Pod 默认 DNS 策略下，kubelet 调用容器运行时（docker、containerd）创建容器的时候会附带 dns server 指定为 `10.96.0.10`
 34 | - Pod 启动后，访问 `test-web` 向 `/etc/resolv.conf` 里的 `nameserver 10.96.0.10` 发起 DNS 解析
 35 | - coredns 使用 RBAC role 从 K8S api 获取和缓存所有的 Service 名字和 Service IP，并响应访问到自己的 DNS 查询。
 36 | - Pod 得到 ServiceIP 后，向 ServiceIP 发请求，从 Client 进程的角度看，和没有容器的时代一样，上图一样可以用在正常非容器环境的 DNS 解析工作流程。
 37 | - 由于 K8S 集群的 overlay 网络，以及本机上的 DNAT 存在，所以集群内任何 node 上，不仅限于 Pod 内都可以发 DNS 查询，也可以 K8S node 上使用 dig 命令指定 dns server 来排查或者解析集群 Service 域名。
 38 | 
 39 | ## DNS options
 40 | 
 41 | K8S 有 namespace 隔离，而为了不影响容器解析，使用了一些 option 来实现，实际上容器内 `/etc/resolv.conf` 内容：
 42 | 
 43 | ```
 44 | nameserver 10.96.0.10
 45 | search default.svc.cluster.local. svc.cluster.local. cluster.local.
 46 | options ndots:5
 47 | ```
 48 | 
 49 | ### search 与 ndots
 50 | 
 51 | search 用于指定搜索域，也就是不是 FQDN（完全查询域名，结尾不带.）下，会依次尝试把这个未完整的域名与搜索列表内的每个域名后缀拼接起来，然后再执行 DNS 解析，直到找到匹配的域名。
 52 | 而 `ndots`（no dot，没有点）是限定域名中包含的点（.）少于 `ndots` 时候，才使用 search 列表。
 53 | 
 54 | 拿 kubernetes 这个自带的 service 测试，假如 default namespace 下有个 pod 要获取它的 Service IP：
 55 | 
 56 | ```bash
 57 | $ kubectl -n default exec xxxx -- cat /etc/resolv.conf
 58 | nameserver 10.96.0.10
 59 | search default.svc.cluster.local. svc.cluster.local. cluster.local.
 60 | options ndots:5
 61 | $ kubectl -n default exec xxxx -- curl -k https://kubernetes
 62 | {
 63 |   "kind": "Status",
 64 |   "apiVersion": "v1",
 65 |   "metadata": {},
 66 |   "status": "Failure",
 67 |   "message": "Unauthorized",
 68 |   "reason": "Unauthorized",
 69 |   "code": 401
 70 | }
 71 | $ kubectl exec -ti nginx -- curl -k https://kubernetes.
 72 | curl: (6) Could not resolve host: kubernetes.
 73 | command terminated with exit code 6
 74 | $ kubectl exec -ti nginx -- curl -k https://kubernetes.default.svc.cluster.local.
 75 | {
 76 |   "kind": "Status",
 77 |   "apiVersion": "v1",
 78 |   "metadata": {},
 79 |   "status": "Failure",
 80 |   "message": "Unauthorized",
 81 |   "reason": "Unauthorized",
 82 |   "code": 401
 83 | }
 84 | # 可以其他窗口开抓包观察，使用容器网络附加上去
 85 | $ docker run --rm -ti --net container:23cc818c1800 zhangguanzhang/netshoot tcpdump -nn -i eth0 port 53
 86 | listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
 87 | 03:54:07.139368 IP 10.244.0.8.41893 > 10.96.0.10.53: 40318+ [1au] A? kubernetes.default.svc.cluster.local. (66)
 88 | 03:54:07.139433 IP 10.244.0.8.41893 > 10.96.0.10.53: 13722+ [1au] AAAA? kubernetes.default.svc.cluster.local. (66)
 89 | 03:54:07.140954 IP 10.96.0.10.53 > 10.244.0.8.41893: 13722*- 0/1/1 (162)
 90 | 03:54:07.141108 IP 10.96.0.10.53 > 10.244.0.8.41893: 40318*- 1/0/1 A 10.96.0.1 (119)
 91 | ```
 92 | 
 93 | 只要不是特殊编程场景下，程序都会使用 libc 提供的库去做一些底层能力，例如 DNS 解析，上面的解析并没有体现出 search 使用，所以我们可以 pod 内 `curl https://kubernetes1`:
 94 | 
 95 | ```tcpdump
 96 | # 前面抓包不关
 97 | listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
 98 | 03:57:10.348955 IP 10.244.0.8.46982 > 10.96.0.10.53: 8279+ [1au] A? kubernetes1.default.svc.cluster.local. (67)
 99 | 03:57:10.349013 IP 10.244.0.8.46982 > 10.96.0.10.53: 60481+ [1au] AAAA? kubernetes1.default.svc.cluster.local. (67)
100 | 03:57:10.350196 IP 10.96.0.10.53 > 10.244.0.8.46982: 60481 NXDomain*- 0/1/1 (163)
101 | 03:57:10.350282 IP 10.96.0.10.53 > 10.244.0.8.46982: 8279 NXDomain*- 0/1/1 (163)
102 | 03:57:10.350377 IP 10.244.0.8.46982 > 10.96.0.10.53: 40735+ [1au] A? kubernetes1.svc.cluster.local. (59)
103 | 03:57:10.350410 IP 10.244.0.8.46982 > 10.96.0.10.53: 14551+ [1au] AAAA? kubernetes1.svc.cluster.local. (59)
104 | 03:57:10.350859 IP 10.96.0.10.53 > 10.244.0.8.46982: 40735 NXDomain*- 0/1/1 (155)
105 | 03:57:10.350920 IP 10.96.0.10.53 > 10.244.0.8.46982: 14551 NXDomain*- 0/1/1 (155)
106 | 03:57:10.350987 IP 10.244.0.8.46982 > 10.96.0.10.53: 37615+ [1au] A? kubernetes1.cluster.local. (55)
107 | 03:57:10.351051 IP 10.244.0.8.46982 > 10.96.0.10.53: 9477+ [1au] AAAA? kubernetes1.cluster.local. (55)
108 | 03:57:10.351437 IP 10.96.0.10.53 > 10.244.0.8.46982: 37615 NXDomain*- 0/1/1 (151)
109 | 03:57:10.351504 IP 10.96.0.10.53 > 10.244.0.8.46982: 9477 NXDomain*- 0/1/1 (151)
110 | 03:57:10.351628 IP 10.244.0.8.46982 > 10.96.0.10.53: 47051+ [1au] A? kubernetes1. (40)
111 | 03:57:10.351666 IP 10.244.0.8.46982 > 10.96.0.10.53: 47948+ [1au] AAAA? kubernetes1. (40)
112 | 03:57:10.523232 IP 10.96.0.10.53 > 10.244.0.8.46982: 47948 NXDomain 0/1/1 (115)
113 | 03:57:10.524150 IP 10.96.0.10.53 > 10.244.0.8.46982: 47051 NXDomain 0/1/1 (115)
114 | ```
115 | 
116 | 可以得知 `ndots=5` 下解析域名结尾不是 `.` 而 kubelet `clusterDomain` 为 `cluster.local.` 结尾会依次尝试解析：
117 | - `Domain.NS.svc.cluster.local.`
118 | - `Domain.svc.cluster.local.`
119 | - `Domain.cluster.local.`
120 | - `Domain`
121 | 
122 | 所以如果集群部署完成后，或者利用自带的 Pod 探测跨节点通信正常否，由于 K8S 的 overlay 网络，可以宿主机上 dig 命令探测或者解析域名：
123 | 
124 | ```bash
125 | # 使用 FQDN 是因为宿主机 /etc/resolv.conf 并没有 search 内容，并且 dig 默认 +nosearch
126 | $ dig @10.96.0.10 kubernetes.default.svc.cluster.local +short
127 | 10.96.0.1
128 | # 绕过 svc，例如使用本机上的 coredns 和非本机的
129 | $ dig @<coredns_PodIP> kubernetes.default.svc.cluster.local +short
130 | 10.96.0.1
131 | ```
132 | 
133 | ## pod 的 /etc/resolv.conf 生成机制
134 | 
135 | ```bash
136 | $ kubectl explain pod.spec.dnsPolicy
137 | ```
138 | 
139 | - 默认 pod 就是 `ClusterFirst` 会使用 kubelet 配置的 `clusterDNS` 也就是 coredns 的 svc IP
140 | - 如果是 `hostNetwork` 没配置 `dnsPolicy` 则会使用 kubelet 配置的 `resolvConf` 也就是宿主机的 `/etc/resolv.conf` 内容，`Default` 也是
141 | - `ClusterFirstWithHostNet` 字面意思，就是 pod 配置 `hostNetwork: true` 下，设置成 coredns 的 svc IP，例如 ingress controller
142 | - `None` 就是不使用 DNS，可以配合 `DNSConfig` 定义 DNS 相关的参数
143 | 
144 | 而 coredns 的 deploy 就设置了 `dnsPolicy: Default`，会使用宿主机的 `/etc/resolv.conf` 内容，然后结合它配置文件：
145 | 
146 | ```bash
147 | $ kubectl -n kube-system describe cm coredns 
148 | 
149 | Corefile:
150 | ----
151 | // .匹配所有
152 | .:53 {
153 |     errors
154 |     health {
155 |         lameduck 5s
156 |     }
157 |     ready
158 |     // cluster.local. 后缀请求解析到 kubernetes 插件
159 |     kubernetes cluster.local. in-addr.arpa ip6.arpa {
160 |         pods insecure
161 |         fallthrough in-addr.arpa ip6.arpa
162 |         ttl 30
163 |     }
164 |     prometheus :9153
165 |     // 没匹配到的 K8s 插件，例如pod 内访问公网域名转发到 /etc/resolv.conf 内的 nameserver
166 |     forward . /etc/resolv.conf {
167 |         max_concurrent 1000
168 |     }
169 |     cache 30
170 |     loop
171 |     reload
172 |     loadbalance
173 | }
174 | $ dig @10.96.0.2 baidu.com +short
175 | 110.242.68.66
176 | 39.156.66.10
177 | ```
178 | 
179 | 所以 coredns 能解析集群内域名和公网域名，需要扩展和配置 coredns 的配置，可以去看 [coredns plugin 文档](https://coredns.io/plugins/)。例如自定义一个域名使用 hosts 插件：
180 | 
181 | ```
182 | ...
183 |     kubernetes cluster.local. in-addr.arpa ip6.arpa {
184 |         pods insecure
185 |         fallthrough in-addr.arpa ip6.arpa
186 |         ttl 30
187 |     }
188 |     hosts {
189 |       127.0.0.1 test.com
190 |       reload 5s
191 |       fallthrough // 必须要，不匹配继续往下走
192 |     }
193 |     prometheus :9153
194 | 
195 |     forward . /etc/resolv.conf {
196 |         max_concurrent 1000
197 |     }
198 | ```
199 | 
200 | 或者 zone 设置：
201 | 
202 | ```
203 | .:53 {
204 | ...
205 | }
206 | 
207 | k8s.itlocal:53 {
208 | ...
209 | }
210 | ```
211 | 
212 | ## nameserver 数量
213 | 
214 | Linux 的 libc 不可能摆脱（见 [2005 年的这个 bug ](https://bugzilla.redhat.com/show_bug.cgi?id=168253)）只有 3 个 DNS nameserver 记录和 6 个 DNS search 记录的限制。Kubernetes 需要消耗 1 个 nameserver 记录和 3 条 search 记录。这意味着如果本地安装已经使用了 3 个 nameserver 或使用了多于 3 条 search，那么其中一些设置将会丢失。作为部分解决方法，节点可以运行 dnsmasq，它将提供更多 nameserver 条目，但没有更多的 search 条目。您也可以使用 kubelet `--resolv-conf` 标志。
215 | 
216 | ## 其他的 service 类型和解析
217 | 
218 | https://kubernetes.io/zh-cn/docs/concepts/services-networking/dns-pod-service/
219 | 
220 | ## coredns metrics
221 | 
222 | coredns 提供了一个 `:9153/metrics` 接口，例如刚部署完集群，没有部署应用的时候，测跨节点通信可以用它这个 http url。
223 | 
224 | ## DNS 5s 超时
225 | 
226 | 实际运行中，经常会碰到有容器报错偶现无法解析域名：
227 | 
228 | ```bash
229 | dial tcp: lookup xxxx: i/o timeout
230 | ```
231 | 
232 | ### 5s 超时原因
233 | 
234 | 根本原因是 Linux 内核 conntrack 模块的bug，Weave works的工程师 Martynas Pumputis 对这个问题做了很详细的分析：
235 | 
236 | DNS client (glibc 或 musl libc) 会并发请求 A 和 AAAA 记录，跟 DNS Server 通信自然会先 connect (建立 fd)，后面请求报文使用这个 fd 来发送，由于 UDP 是无状态协议， connect 时并不会发包，也就不会创建 conntrack 表项。
237 | 而并发请求的 A 和 AAAA 记录默认使用同一个 fd 发包，send 时各自发的包它们源 Port 相同(因为用的同一个 socket 发送)，当并发发包时，两个包都还没有被插入 conntrack 表项，所以 netfilter 会为它们分别创建 conntrack 表项，而集群内请求 kube-dns 或 coredns 都是访问的 CLUSTER-IP，报文最终会被 DNAT 成一个 endpoint 的 POD IP。
238 | 当两个包恰好又被 DNAT 成同一个 POD IP 时，它们的五元组就相同了，在最终插入的时候后面那个包就会被丢掉。
239 | 
240 | ```
241 | IPv6 解析请求: udp srcIP:destIP srcPort destPort
242 | IPv4 解析请求: udp srcIP:destIP srcPort destPort
243 | ```
244 | 
245 | 如果 dns 的 pod 副本只有一个实例的情况就很容易发生(始终被 DNAT 成同一个 POD IP)，现象就是 dns 请求超时，client 默认策略是等待 5s 自动重试，如果重试成功，我们看到的现象就是 dns 请求有 5s 的延时。ipvs 也使用了 conntrack, 使用 kube-proxy 的 ipvs 模式，并不能避免这个问题。
246 | 
247 | ### 规避方法
248 | 
249 | 参考 [`man 5 resolv.conf` 文档](https://www.man7.org/linux/man-pages/man5/resolv.conf.5.html)
250 | 
251 | #### tcp 请求
252 | 
253 | 因为是 udp 五元组冲突，所以可以 tcp dns 请求，配置 `options use-vc` since glibc 2.14，但是不一定生效。
254 | 
255 | #### 避免 A AAAA 记录并发
256 | 
257 | - `single-request-reopen` (since glibc 2.9)
258 | - `single-request` (since glibc 2.10)
259 | 
260 | **该选项对于 alpine 的 musl 并不支持**，而且也只是缓解。如果先要验证，可以下面这样重定向修改不重启容器：
261 | 
262 | ```bash
263 | while read ctr;do
264 |   ResolvConfPath=`docker inspect $ctr --format "{{ .ResolvConfPath }}" `
265 | 
266 | #  if ! grep -q use-vc $ResolvConfPath;then
267 | #    echo 'options use-vc' >> $ResolvConfPath
268 | #  fi
269 |   if ! grep -q attempts $ResolvConfPath;then
270 |     echo 'options timeout:2 attempts:3 single-request-reopen' >> $ResolvConfPath
271 |   fi  
272 | done < <(docker ps -a --filter label=io.kubernetes.docker.type=container --format '{{.ID}}' )
273 | ```
274 | 
275 | 后续固化 kubectl patch ：
276 | 
277 | ```bash
278 | function t(){
279 |     local ns deploy
280 |     for ns in $@;do
281 |       for deploy in $(kubectl -n $ns   get deploy --no-headers  | awk '{print $1}');do
282 |         kubectl -n $ns patch deploy $deploy  \
283 |             -p '{"spec":{"template":{"spec":{"dnsConfig":{"options":[{"name":"single-request-reopen"},{"name":"timeout","value":"2"},{"name":"attempts","value":"3"}]}}}}}'
284 |       done
285 |     done
286 | }
287 | 
288 | t default # 处理default namespace下的 deploy
289 | ```
290 | 
291 | #### nodelocaldns
292 | 
293 | 官方也意识到了这个问题比较常见，给出了 coredns 以 cache 模式作为 daemonset 部署的解决方案:
294 | 
295 | ![nodelocal-HA](https://github.com/kubernetes/enhancements/raw/master/keps/sig-network/1024-nodelocal-cache-dns/nodelocal-HA.png)
296 | 
297 | 红线是默认行为，非红线是部署后的查询逻辑：
298 | - hostNetwork 的 DaemonSet 部署的 nodelocaldns ，每个节点上都会有 nodelocaldns Pod
299 | - 创建 dummy 接口 nodelocaldns ，监听两个IP，kube-dns svcIP 和 `169.254.20.10`，并在 raw 表创建这俩 IP 的 iptables 规则 `-j NOTRACK` 让 Pod 请求 DNS svcIP 不走 dnat 而走到 `nodelocaldns` 网卡上：
300 | 
301 | ```bash
302 | $ ip -4 a s nodelocaldns
303 | 171: nodelocaldns: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default 
304 |     inet 169.254.20.10/32 scope global nodelocaldns
305 |        valid_lft forever preferred_lft forever
306 |     inet 10.0.0.10/32 scope global nodelocaldns
307 |        valid_lft forever preferred_lft forever
308 | ``` 
309 | 
310 | - 客户端 Pod 发起 DNS 查询，发到 nodelocaldns ，nodelocaldns 使用（用户态，非 resolv options use-vc ）TCP 向上游 coredns svcIP 查询并缓存，然后返回给客户端 Pod。
311 | 
312 | 部署可以参考官方文档 [在 Kubernetes 集群中使用 NodeLocal DNSCache](https://kubernetes.io/zh-cn/docs/tasks/administer-cluster/nodelocaldns/)， 整个部署不需要重装 K8S，而如果使用 iptalbes 模式，则不需要更改、配置和重启 kubelet。
313 | 而使用 ipvs 模式下，需要修改 [nodelocaldns.yaml](https://github.com/kubernetes/kubernetes/blob/master/cluster/addons/dns/nodelocaldns/nodelocaldns.yaml) 取消 `-localip` 后面的 `DNS_SERVER`，因为 ipvs 模式下 kube-ipvs0 网卡会绑定 dns serverIP，这里避免 nodelocaldns 再绑定而冲突，然后再需要修改 kubelet 的 `--cluster-dns` 指定为 nodelocaldns 网卡上的 IP ，例如 `169.254.20.10`。
314 | 
315 | 部署后，发现也并不是完全解决，只是降低了很多，然后让所有业务都使用基于 glibc 的容器镜像，开 single-request 那俩 options 才降低更多。
316 | 
317 | 另外官方创建了个 svc 选择的集群 DNS，并没有指定 service IP，cmdline 里指定了 这个 svc 名字：
318 | 
319 | ```yaml
320 | apiVersion: v1
321 | kind: Service
322 | metadata:
323 |   name: kube-dns-upstream
324 |   namespace: kube-system
325 | ...
326 | spec:
327 |   ports:
328 |   - name: dns
329 |     port: 53
330 |     protocol: UDP
331 |     targetPort: 53
332 |   - name: dns-tcp
333 |     port: 53
334 |     protocol: TCP
335 |     targetPort: 53
336 |   selector:
337 | 	k8s-app: kube-dns
338 | ...
339 |  args: [ ..., "-upstreamsvc", "kube-dns-upstream" ]
340 | 
341 | ```
342 | 
343 | 然后 nodelocaldns 启动的日志里可以看到配置文件被渲染了：
344 | 
345 | ```
346 |     forward . 10.96.189.136 {
347 |             force_tcp
348 |     }
349 | ```
350 | 
351 | 因为 [enableServiceLinks](https://kubernetes.io/zh-cn/docs/tutorials/services/connect-applications-service/#accessing-the-service) 的默认开启，pod 会有如下环境变量：
352 | 
353 | ```
354 | $ docker exec dfa env | grep KUBE_DNS_UPSTREAM_SERVICE_HOST
355 | KUBE_DNS_UPSTREAM_SERVICE_HOST=10.96.189.136
356 | ```
357 | 
358 | 代码里 可以看到就是把参数的 - 转换成 _ 取环境变量 `KUBE_DNS_UPSTREAM_SERVICE_HOST` 值然后渲染配置文件，这样就能取到 SVC 的 IP 了。
359 | 
360 | ```go
361 | func toSvcEnv(svcName string) string {
362 | 	envName := strings.Replace(svcName, "-", "_", -1)
363 | 	return "$" + strings.ToUpper(envName) + "_SERVICE_HOST"
364 | }
365 | ```
366 | 
367 | ### 二开 nodelocaldns
368 | 
369 | 需要自己编译 [nodelocaldns](https://github.com/kubernetes/dns) 的话：
370 | 
371 | ```bash
372 | make containers CONTAINER_BINARIES=node-cache DOCKER_BUILD_FLAGS=--load
373 | ```
374 | 
375 | ## 链接
376 | 
377 | - [CNI calico](04.07.md)
378 | - 下一部分: [K8S 高可用和 SLB 相关](04.09.md)
379 | 


--------------------------------------------------------------------------------
/eBook/04.09.md:
--------------------------------------------------------------------------------
  1 | # K8S 高可用和 SLB 相关
  2 | 
  3 | ## K8S 高可用相关
  4 | 
  5 | etcd 就不说了,奇数个副本,可以坏 `(n-1)/2` 个，有条件的可以单独 SSD 机器部署 etcd 。K8S 所有组件和客户端，都会使用 kubeconfig 或者集群内 rbac 的 serviceaccount 后通过 kubernetes 这个 service 最终连到实体的 kube-apiserver 进程：
  6 | 
  7 | ![](https://d33wubrfki0l68.cloudfront.net/2555d34e3008aab4b049ca5634cfabc2078ccf92/3269a/images/docs/ha.svg)
  8 | 
  9 | etcd 集群下，一般都是 kube-apiserver/kube-controller-manager/kube-scheduler 部署多份，但是所有的都是连接 kube-apiserver，所以需要一个负载均衡负载到 kube-apiserver 副本上。
 10 | 
 11 | 
 12 | ### K8S 集群内部相关
 13 | 
 14 | 集群内部自带了第一个 Service CIDR 第一个 IP 的 kubernetes service：
 15 | 
 16 | ```
 17 | $ kubectl -n default describe svc kubernetes 
 18 | Name:              kubernetes
 19 | Namespace:         default
 20 | Labels:            component=apiserver
 21 |                    provider=kubernetes
 22 | Annotations:       <none>
 23 | Selector:          <none>
 24 | Type:              ClusterIP
 25 | IP Family Policy:  SingleStack
 26 | IP Families:       IPv4
 27 | IP:                10.96.0.1
 28 | IPs:               10.96.0.1
 29 | Port:              https  443/TCP
 30 | TargetPort:        6443/TCP
 31 | Endpoints:         192.168.2.111:6443,192.168.2.112:6443,192.168.2.113:6443
 32 | Session Affinity:  None
 33 | Events:            <none>
 34 | ```
 35 | 
 36 | 集群内部访问 `https://kubernetes.default` 最终解析成 `10.96.0.1`，然后 DNAT 到 endpoint 里任意一个 kube-apiserver。可能有些聪明的人会问，这看起来是 kube-apiserver 上报的自己网卡 IP，如果是 nat 后的怎么处理，kube-apiserver 提供了 `--advertise-address` 选项，用于告诉访问自己使用哪个 IP。
 37 | 
 38 | ### 使用负载均衡
 39 | 
 40 | kube-apiserver 本质上是个 web 七层服务，所以我们要整高可用不要把 k8s 想的太复杂了，保证了 apiserver 的高可用就行了。tcp 和 web 层面，是否软硬件，还是组合使用都行，例如：
 41 | - 硬件 F5 配置转发
 42 | - nginx、caddy、envoy、haproxy
 43 | - 云上 SLB 代理
 44 | - keepalived VRRP
 45 | - keepalived VRRP + haproxy
 46 | - 网络设备 ospf + lvs
 47 | - ...
 48 | 
 49 | 如果是 7层负载均衡，健康检查可以用应用层面 url `/healthz`:
 50 | 
 51 | ```
 52 | $ kubectl get --raw / | grep /healthz
 53 |     "/healthz",
 54 |     "/healthz/autoregister-completion",
 55 |     "/healthz/etcd",
 56 |     "/healthz/log",
 57 |     "/healthz/ping",
 58 |     "/healthz/post
 59 |     ...
 60 | ```
 61 | 
 62 | 可以 kube-apiserver  不配置 `--anonymous-auth=false` 后 `/healthz` 路由就返回 200 状态码了，或者把证书上传到 SLB 上之类的配置里，7层检查，例如下面是 haproxy 的 7 层检查 tcp 代理：
 63 | 
 64 | ```
 65 | frontend k8s-api
 66 |   bind 0.0.0.0:8443
 67 |   bind 127.0.0.1:8443
 68 |   mode tcp
 69 |   option tcplog
 70 |   tcp-request inspect-delay 5s
 71 |   default_backend k8s-api
 72 | 
 73 | backend k8s-api
 74 |   mode tcp
 75 |   option tcplog
 76 |   option httpchk GET /healthz
 77 |   http-check expect string ok
 78 |   balance roundrobin
 79 |   default-server inter 10s downinter 5s rise 2 fall 2 slowstart 60s maxconn 250 maxqueue 256 weight 100
 80 |   server  api1  192.168.2.111:6443  check check-ssl verify none
 81 |   server  api2  192.168.2.112:6443  check check-ssl verify none
 82 |   server  api3  192.168.2.113:6443  check check-ssl verify none
 83 | ```
 84 | 
 85 | kubeconfig 文件里包含有 kubernetes 的 URL 地址，想在使用 kubectl 的时候绕过负载均衡，可以使用 `-s, --server=` 来指定 kube-apiserver url ，另外可能你会遇到 kubectl 报错无法连接：
 86 | 
 87 | ```
 88 | The connection to the server localhost:8080 was refused - did you specify the right host or port?
 89 | ```
 90 | 
 91 | 这是因为没有 `~/.kube/config` 文件和 `KUBECONFIG` 变量文件不存在或者权限问题下，kubectl 默认认为连接 kube-apiserver 地址是 `http://localhost:8080` 的不需要授权的匿名地址，而现在 kube-apiserver 都会把 `insecure-port` 关闭了造成的报错。
 92 | 
 93 | #### local proxy
 94 | 
 95 | 在一些私有化环境内，客户会无法提供 SLB 和硬件负载均衡，环境也可能是 openstack 环境上的虚拟机而无法使用 VRRP VIP（会因为 IP 和 MAC 不匹配被 drop），所以可以使用 local proxy 方式代理，也就是每个节点上部署一个软件做负载均衡，例如每个节点部署一个 nginx，监听 `0.0.0.0:8443` 反向代理：
 96 | 
 97 | ```
 98 |   upstream kube_apiserver {
 99 |     least_conn;
100 |     server 192.168.2.111:6443 max_fails=3 fail_timeout=10s;
101 |     server 192.168.2.112:6443 max_fails=3 fail_timeout=10s;
102 |     server 192.168.2.113:6443 max_fails=3 fail_timeout=10s;
103 |     }
104 | 
105 |   server {
106 |     listen        0.0.0.0:8443;
107 |     proxy_pass    kube_apiserver;
108 |     proxy_connect_timeout 1s;
109 |   }
110 | ```
111 | 
112 | 例如下面是市面上的 sealos 就是使用 kubeadm 部署，staticPod 的目录加了个 yaml，使用的内核 lvs 能力做四层负载。每个 node 跑一个 lvs 的 watch 进程的 pod 去维护 lvs 的规则保持高可用：
113 | 
114 | ![ha03](https://raw.githubusercontent.com/zhangguanzhang/Image-Hosting/master/k8s/ha03.jpg)
115 | 
116 | 再次重申下，不要局限在自己会的软件和实现手段的顽固思想上，而是看场景和后续升级维护和规模多方面考虑，例如如果客户经常注重漏洞扫描 CVE，那选择 caddy 这种单二进制 web server 做代理更新方便。
117 | 
118 | #### SLB 和 hairpin
119 | 
120 | 例如云上，阿里云的 K8S 实例，kube-apiserver 会创建一个 SLB，所有 client 访问：
121 | 
122 | ```
123 | # SLB 反向代理的机器/服务，我们一般称之为 real server。
124 |                                             +-------+
125 |                                             |       |
126 |                                -+---------->+  rs   |
127 |                               /             +-------+
128 |                              /    
129 |                             /     
130 |                            /                +-------+
131 | +--------+                /                 |       |
132 | | client +----------->   SLB     ---------->+  rs   |
133 | +--------+                \                 +-------+
134 |                            \    
135 |                             \     
136 |                              \              +-------+
137 |                               ------------->+       |
138 |                                             |  rs   |
139 |                                             +-------+
140 | ```
141 | 
142 | 因为阿里云的托管版 K8S 实例不能登录，所以有些人在自己部署 K8S 后创建 SLB 做负载的时候会发现存在问题。master 上也有 kubelet，kubelet 的 kubeconfig 里的 IP 是 SLB 的时候，这样会是下面的流量图：
143 | 
144 | ```
145 |                    +-------+
146 |                    |       |
147 |       -+-----------+  rs   |
148 |      /             +-------+
149 |     /    
150 |    /     
151 |   /                +-------+
152 |  /                 |       |
153 | SLB  <-------------+  rs   |
154 |  \                 +-------+
155 |   \    
156 |    \     
157 |     \              +-------+
158 |      --------------+       |
159 |                    |  rs   |
160 |                    +-------+
161 | ```
162 | 
163 | 例如上面这种 rs 访问 SLB 你会发现有几率超时，这个在 NAT 行为里有个专门的术语，叫 hairpinning（直译为发卡，意思 是像发卡一样，沿着一边上去，然后从另一边绕回来），很多负载均衡会直接扔掉报文。
164 | 
165 | ![hairpin-icon](../images/hairpin-icon.png)
166 | 
167 | 不单单是 master，例如 ingress controller 访问自己用的 SLB，而自己又是 SLB 的 rs 列表里，以及有些 Pod 的业务 `--external-url` 写 SLB 的 IP ，由于 CNI 对 Pod 访问非 Pod/Service CIDR 都会做 SNAT，访问到 SLB 来源 IP SNAT 成了 node 的节点 IP ，这个 node 又是 SLB 的 rs 也会出问题。
168 | 
169 | 其实 Pod 不能访问自己的 Service 也是这个相关，但是 kubelet 提供了选项 `hairpinMode: promiscuous-bridge` 配置。
170 | 
171 | ## 链接
172 | 
173 | - [DNS](04.08.md)
174 | - 下一部分: [Ingress Controller](04.10.md)
175 | 


--------------------------------------------------------------------------------
/eBook/04.10.md:
--------------------------------------------------------------------------------
  1 | # Ingress Controller
  2 | 
  3 | Ingress Controller 是一个统称，并不只有一个，例如：
  4 | 
  5 |  * [Ingress NGINX](https://github.com/kubernetes/ingress-nginx): Kubernetes 官方维护的方案，基于 nginx 实现。
  6 |  * [Nginx Ingress](https://docs.nginx.com/nginx-ingress-controller/): nginx 官方实现的。
  7 |  * [F5 BIG-IP Controller](https://clouddocs.f5.com/containers/latest/): F5 所开发的 Controller，它能够让管理员通过 CLI 或 API 让 Kubernetes 与 OpenShift 管理 F5 BIG-IP 设备。
  8 |  * [Ingress Kong](https://konghq.com/blog/kubernetes-ingress-controller-for-kong/): 著名的开源 API Gateway 方案所维护的 Kubernetes Ingress Controller。
  9 |  * [Traefik](https://github.com/containous/traefik)>: 是一套开源的 Golang 实现的 HTTP 反向代理与负载均衡器，而它也支持 Ingress。
 10 |  * [haproxy-ingress](https://haproxy-ingress.github.io/docs/getting-started/): 一套以 HAProxy 为底的 Ingress Controller。
 11 |  * [Cilium Gateway](https://haproxy-ingress.github.io/docs/getting-started/): 基于 Cilium 网关和 eBPF 技术构建的高级 Ingress 控制器。
 12 |  * [Envoy Gateway](https://gateway.envoyproxy.io/zh/): 基于 Envoy 代理的入口控制器，用于路由和管理外部流量.
 13 |  * [Higress](https://higress.io/en/): Higress 是基于阿里内部多年的 Envoy Gateway 实践沉淀，以开源 Istio 与 Envoy 为核心构建的云原生 API 网关。
 14 | 
 15 | > * 而 Ingress Controller 的实现不只上面这些方案,还有很多可以在网络上找到这里不一一列出来了
 16 | 
 17 | ## 怎么暴漏多个 http 服务
 18 | 
 19 | 现实生活中，大部分流量都是 http ，而通常情况下 K8S 的 Overlay 网络是 K8S 节点才能访问的，例如 PodIP 是 `10.224.1.198`，非 K8S worker 访问这个 IP 路由到自己网关，然后网关路由出去，也就是无法访问到 PodIP，这个和 vmware workstation 的 nat 网络虚拟机一样，你局域网内其他机器无法访问你 vmware 上的虚机。
 20 | 
 21 | 假设有两个 Service 需要映射出去：
 22 | - `account-user` 对外成 `/userInfo` web 路由
 23 | - `book-server` 对外成 `/bookList` web 路由
 24 | 
 25 | ### 反向代理
 26 | 
 27 | 你第一时间想到的是 node 上跑个 nginx 反向代理 Service：
 28 | 
 29 | ![nginx-hostnetwork](../images/nginx-hostnetwork.png)
 30 | 
 31 | nginx 写配置文件， resolver 写 coredns SVC_IP：
 32 | 
 33 | ```
 34 | http {
 35 |     resolver 10.96.0.10 valid=5s;
 36 | 
 37 |     server {
 38 |         listen 80;
 39 | 
 40 |         location /userInfo {
 41 |             proxy_pass http://account-user.prod.svc.cluster.local;
 42 |             proxy_set_header Host $host;
 43 |             proxy_set_header X-Real-IP $remote_addr;
 44 |             proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
 45 |             proxy_set_header X-Forwarded-Proto $scheme;
 46 |         }
 47 | 
 48 |         location /bookList {
 49 |             proxy_pass http://book-server.prod.svc.cluster.local;
 50 |             proxy_set_header Host $host;
 51 |             proxy_set_header X-Real-IP $remote_addr;
 52 |             proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
 53 |             proxy_set_header X-Forwarded-Proto $scheme;
 54 |         }
 55 |     }
 56 | }
 57 | ```
 58 | 
 59 | ### 优化
 60 | 
 61 | 后面你意识到这样每次都是走 Service，能否 `proxy_pass` 写的是 service 名字，lua 去请求 `https://kubernetes.default.svc.cluster.local` 把 PodIP 维护在 upstream 列表里。
 62 | 
 63 | ### annotation
 64 | 
 65 | 再后面，你意识到写文件非常繁琐，不知道哪些 Service 被代理了，你意识到 k8s 有 `annotation` 这个东西，能够在这里写注释自己 nginx + lua 读取，你可能写到 svc 的 `annotation` 上。也就是下面类似：
 66 | 
 67 | ```
 68 | apiVersion: v1
 69 | kind: Service
 70 | metadata:
 71 |   annotations:
 72 |     conf:|
 73 |       more_set_headers "Request-Id: $req_id";
 74 |       ...
 75 |   ...
 76 | ```
 77 | 
 78 | nginx + lua 连接 kube-apiserver，获取所有 Service 的 `annotation` 生成配置文件。
 79 | 
 80 | ### CRD 和 Ingress Controller 工作原理
 81 | 
 82 | 随着你对 k8s 的 api 越来越熟悉，意识到写 svc 上似乎太繁琐，这样不方便 kubectl 增删改查，可不可以创建自己的资源对象，类似 `Kind: Pod` 那样有 `kind: Proxy`，实际上这就是 `Kind: ingress` 由来。把反向代理配置文件抽象成 K8S 的 yaml 配置：
 83 | 
 84 | ```yaml
 85 | # 这里是举例，不保证时效性
 86 | apiVersion: networking.k8s.io/v1
 87 | kind: Ingress
 88 | metadata:
 89 |   name: my-ingress
 90 |   annotations:
 91 |     nginx.ingress.kubernetes.io/proxy-body-size: "0"
 92 |     nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
 93 |     nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
 94 | spec:
 95 |   ingressClassName: nginx
 96 |   rules:
 97 |   - host: api.domain.com
 98 |     http:
 99 |       paths:
100 |       - backend:
101 |           serviceName: api
102 |           servicePort: 80
103 |   - host: test.domain.com
104 |     http:
105 |       paths:
106 |       - path: /web/*
107 |         backend:
108 |           serviceName: web
109 |           servicePort: 8080
110 |   - host: backoffice.domain.com
111 |     http:
112 |       paths:
113 |       - backend:
114 |           serviceName: backoffice
115 |           servicePort: 8080     
116 | ```
117 | 
118 | 例如访问到 Ingress Controller 的 http 流量:
119 | - `curl -H 'host: api.domain.com' http://<ingress-controller-SLB_IP>` 会反向代理到 api 这个 service 后的 Pod 的 80 端口
120 | - `curl -H 'host: test.domain.com' http://<ingress-controller-SLB_IP>/web/v1` 会反向代理到 web 这个 service 后的 Pod 的 8080 端口
121 | - `curl -H 'host: backoffice.domain.com' http://<ingress-controller-SLB_IP>` 会反向代理到 backoffice 这个 service 后的 Pod 的 8080 端口
122 | 
123 | 也就是和 nginx 反向代理类似，如果你进入到 ingress nginx 内，会发现上面的 Ingress 最终还是在容器内生成 nginx 配置文件：
124 | 
125 | ```bash
126 | $ kubectl -n ingress-nginx exec nginx-ingress-controller-6cdcfd8ff9-t5sxl -- cat /etc/nginx/nginx.conf
127 | ...
128 | 	## start server nginx.testdomain.com
129 | 	server {
130 | 		server_name nginx.testdomain.com ;
131 | 		
132 | 		listen 80;
133 | 		
134 | 		set $proxy_upstream_name "-";
135 | 		
136 | 		location / {
137 | 			
138 | 			set $namespace      "default";
139 | 			set $ingress_name   "nginx-ingress";
140 | 			set $service_name   "nginx";
141 | 			set $service_port   "80";
142 | 			set $location_path  "/";
143 |             ........
144 | 	## end server nginx.testdomain.com      
145 | ...
146 | ```
147 | 
148 | 要注意的一点是虽然写的是反向代理 service 名字，但是实际是直接反向代理到 service 的 endpoint 上的。
149 | 
150 | ### 高可用
151 | 
152 | 网络方面：
153 | 
154 | - Ingress Controller 使用 NodePort 对接外部 SLB
155 | - hostNetwork 下对接外部 SLB，例如需要代理 TCP 流量
156 | - hostPort 对接外部 SLB
157 | 
158 | ```
159 |                                             +-------+
160 |                                             |       |
161 |                                -+---------->+       |
162 |                               /             +-------+ node1
163 |                              /    
164 |                             /     
165 |                            /                +-------+
166 | +--------+                /                 |       |
167 | | client +----------->   SLB -------------->+       |
168 | +--------+                \                 +-------+ node2
169 |                            \    
170 |                             \     
171 |                              \              +-------+
172 |                               ------------->+       |
173 |                                             |       |
174 |                                             +-------+ node3
175 | ```
176 | 
177 | 
178 | 部署方式：
179 | - daemonSet + nodeSeletor
180 | - deploy设置replicas数量 + nodeSeletor + pod互斥
181 | 
182 | 多个 `Ingres Controller`（注意 ingress yaml 的 `ingressClassName`）：
183 | 
184 | ```
185 |                                             +---------------------------+
186 |                                             |                           |
187 |                                -+---------->+ hostNetwork的ingress nginx|
188 |                               /             +---------------------------+ node1
189 |                              /    
190 |                             /     
191 |                            /                +---------------------------+
192 | +--------+                /                 |                           |
193 | | client +----------->    F5 -------------->+ hostNetwork的ingress nginx|
194 | +--------+                \                 +---------------------------+ node1
195 |                            \    
196 |                             \     
197 |                              \              +---------------------------+
198 |                               ------------->+hostNetwork的ingress nginx |
199 |                                             |                           |
200 |                                             +---------------------------+ node3
201 | 
202 |                                             +---------------------------+
203 |                                             |                           |
204 |                                -+---------->+ hostNetwork的Higress      |
205 |                               /             +---------------------------+ node4
206 |                              /    
207 |                             /     
208 |                            /                +---------------------------+
209 | +--------+                /                 |                           |
210 | | client +-----------> SLB(or VIP) -------->+ hostNetwork的Higress      |
211 | +--------+                \                 +---------------------------+ node5
212 |                            \    
213 |                             \     
214 |                              \              +---------------------------+
215 |                               ------------->+ hostNetwork的Higress      |
216 |                                             |                           |
217 |                                             +---------------------------+ node6
218 | 
219 | ```
220 | 
221 | ## 链接
222 | 
223 | - [K8S 高可用和 SLB 相关](04.09.md)
224 | - 下一部分: [案例: calico-kube-controllers 日志报错 dial tcp 10.96.0.1:443: i/o timeout](05.01.md)
225 | 


--------------------------------------------------------------------------------
/eBook/05.01.md:
--------------------------------------------------------------------------------
  1 | # 部署后 calico-kube-controllers 日志报错 dial tcp 10.96.0.1:443: i/o timeout
  2 | 
  3 | 该案例是一名微信好友找到我，写出来时让读者以排查角度来加深 K8S 组件作用印象。
  4 | 
  5 | ## 排查
  6 | 
  7 | ### 日志
  8 | 
  9 | 部署完成后，发现 `calico-kube-controllers` 日志报错：
 10 | 
 11 | ```bash
 12 | $ kubectl -n calico-system get pod
 13 | NAME                                            REDAY STATUS   RESTARTS     AGE
 14 | calico-kube-controllers-596576f79-j2pkgRunning  0/1   Running  3 (32s ago)  13m
 15 | calico-node-gqz7n                               1/1   Running  0            34m
 16 | calico-node-tg9gr                               1/1   Running  0            34m
 17 | calico-node-tg9gr                               1/1   Running  0            34m
 18 | calico-typha-b8cccb6bb-nr4m5                    1/1   Running  0            48m
 19 | calico-typha-b8cccb6bb-v9db21                   1/1   Running  0            48m
 20 | csi-node-driver-6wkcb                           2/2   Running  0            13m
 21 | csi-node-driver-gdc67                           2/2   Running  0            13m
 22 | csi-node-driver-kb896                           2/2   Running  0            13m
 23 | $ kubectl logs calico-kube-controllers-596576f7
 24 | 2024-06-12 05:53:08.531 [INF0][1] main.go 107: Loaded configuratiocyWorkers:1, NodeWorkers:1, Kubeconfig:"", DatastoreType:"kubernetes"}
 25 | ...
 26 | 2024-06-12 05:53:38.561 [ERROR][1] client.go 295: Error getting cluster information config ClusterInformations="default" error=Get "https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 10.96.0.1:443: io timeout
 27 | 2024-06-12 05:53:38.561 [INF0][1] main.go 138: Failed to initializal datastore error=Get "https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations.default": dial tcp 10.96.0.1:443: i/o timeout
 28 | ```
 29 | 
 30 | ### 排查过程
 31 | 
 32 | 确认系统防火墙关了，然后 kube-proxy 也部署了，问题根因是 kuberentes Service `https://10.96.0.1:443` 不通，让在每台机器上直接 curl 来看看 Service 工作否：
 33 | 
 34 | ```bash
 35 | $ curl -vk https://10.96.0.1
 36 | {
 37 |   "kind": "Status",
 38 |   "apiVersion": "v1",
 39 |   "metadata": {},
 40 |   "status": "Failure",
 41 |   "message": "Unauthorized",
 42 |   "reason": "Unauthorized",
 43 |   "code": 401
 44 | }
 45 | ```
 46 | 
 47 | 有 http 内容返回，说明节点上 Service 的 DNAT 是正常没问题的，问题是 Pod 内访问不了 Service，也就是 DNAT 后的流量有问题 `PodIP -> kube-apiserver:6443` 不通，所以故障节点上直接绕过 Service 看下：
 48 | 
 49 | ```bash
 50 | # 192.168.21.11 是一个 master 节点
 51 | $ tcpdump -nn -i any host 192.168.21.11
 52 | # 开启抓包后，进入容器的 network namespace 里 curl 
 53 | $ crictl inspect 或者 docker inspect xx 获取 calico-kube-controllers /pause 容器的 Pid
 54 | $ nsenter --net --target <pid> curl -k https://192.168.21.11.6443
 55 | 
 56 | 然后抓包显示
 57 | ....
 58 | 16:40:19.165180 calia9589d79281 In IP 192.168.236.10.59894 > 192.168.21.11.6443: Flags ...
 59 | 16:40:19.165190 ens192 Out IP 192.168.236.10.59894 > 192.168.21.11.6443: Flags ...
 60 | 16:40:23.357175 calia9589d79281 In IP 192.168.236.10.59894 > 192.168.21.11.6443: Flags ...
 61 | 16:40:23.357195 ens192 In IP 192.168.236.10.59894 > 192.168.21.11.6443: Flags ...
 62 | ```
 63 | 
 64 | 从抓包看就是节点上没做 SNAT，包到 master 机器上后，目标 master 机器上回包走默认路由到网关，网关发出去，最后没回来。就让他看 nat 表的 MASQUERADE：
 65 | 
 66 | ```bash
 67 | # 他发的截图，这里我只手打下关键信息并折叠方便阅读
 68 | $ iptables -w -t nat -S | grep MASQ
 69 | ...
 70 | -A cali-POSTROUTING -o tunl0 -m comment --comment "cali:SXwvdsbh4Mw7wOln" \
 71 |    -m addrtype !--src-type LOcAL --limit-iface-out -m addrtype LOCAL -j MASQUERADE --randomm-fully
 72 | -A cali-nat-outgoing -m comment --comment "cali:flqWnvo8yg4ULQLam" -m set --match-set cali40masg-ipam-pools src \
 73 |   -m set ! --match-set cali40all-ipam-pools dst -j MASQUERADE --randomm-fully
 74 | ```
 75 | 
 76 | 从 iptables 规则看，calico 使用 ipset 匹配（因为可以多个 IPPool 地址池而不增加 iptables 规则）来做 SNAT 判断条件的，查看下这俩 ipset：
 77 | 
 78 | ```bash
 79 | $ ipset list cali40masg-ipam-pools
 80 | Name: cali40masq-ipam-pools
 81 | Type: hash: net
 82 | Revision: 7
 83 | Header: family inet hashsize 1024 maxelem 1048576 bucketsize 12 initval 0x793b8a70
 84 | size in memory: 504
 85 | References: 1
 86 | Number of entries: 1
 87 | Members:
 88 | 192.168.0.0/16
 89 | 
 90 | $ ipset list cali40all-ipam-pools
 91 | Name: cali40all-ipam-pools
 92 | Type: hash:net
 93 | Revision:7
 94 | Header: family inet hashsize 1024 maxelem 1048576 bucketsize 12 initval 0xaa98b84c
 95 | size in memory: 504
 96 | References: 1
 97 | Number of entries: 1Members:
 98 | 192.168.0.0/16
 99 | 
100 | ```
101 | 
102 | 上面 iptables 配合 ipset 字面意思就是：来源IP `192.168.0.0/16` 访问 目标 IP 不是 `192.168.0.0/16` 的才做 SNAT，而目标 IP master 的 `192.168.21.11` 匹配到了所以没做 SNAT。
103 | 
104 | 让他 `/pause` 网络里去 `curl -I https://223.5.5.5` 看看是不是通的，回答是通的，因为没匹配上面规则而做了 SNAT 而正常。
105 | 
106 | 这里根因就是他把 calico 的 ippool 和宿主机重合了，宿主机网段是 `192.168.21.0/24`，calico 默认网段是 `192.168.0.0/16`。让他把 `kubectl edit ippool default-ipv4-ippool` 修改 cidr 后删下 Pod 就正常了。
107 | 
108 | ## 链接
109 | 
110 | - 上一部分[Ingress Controller](04.10.md)
111 | - 下一部分[新 K8S 集群部署业务后，pod 内无法连接到数据库](04.10.md)


--------------------------------------------------------------------------------
/eBook/05.02.md:
--------------------------------------------------------------------------------
 1 | # 新集群部署业务后，pod 内无法连接到数据库 dial tcp ip:3306: i/o timeout
 2 | 
 3 | 该案例是客户提供的机器上部署 k8s 后，业务部署遇到问题，现场实施排查到有个服务 pod 无法访问到数据库:
 4 | 
 5 | ```
 6 | dial tcp 10.xxx.xx.xx:3306: i/o timeout
 7 | ```
 8 | 
 9 | ## 排查
10 | 
11 | 该 pod 节点上探测下数据库端口是否可达，可能本机 IP 不在数据库白名单里，或者远端数据库端口没监听：
12 | 
13 | ```bash
14 | $ telnet 10.xxx.xx.xx 3306
15 | Trying 10.xxx.xx.xx....
16 | COnnect to 10.xxx.xx.xx.
17 | Escape character is '^]'
18 | ....
19 | ```
20 | 
21 | 说明 IP:Port 可达，使用 nsenter 进入 pod 网络内探测下：
22 | 
23 | ```bash
24 | # 找到 pod 所在的容器 ID，容器没起来则使用对应的 /pause 的容器 ID ，inspect 取 Pid
25 | $ docker inspect xxxxx | grep -m1 Pid
26 |          "Pid": 2018877,
27 | $ nsenter --net -t 2018877 telnet 10.xxx.xx.xx 3306
28 | Trying 10.xxx.xx.xx....
29 | telnet: connect to address 10.xxx.xx.xx: No route to host
30 | ```
31 | 
32 | 奇怪了，看下 ip 和路由：
33 | 
34 | ```bash
35 | $ nsenter --net -t 2018877 ip a s 
36 | 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
37 |     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
38 |     inet 127.0.0.1/8 scope host lo
39 |        valid_lft forever preferred_lft forever
40 |     inet6 ::1/128 scope host 
41 |        valid_lft forever preferred_lft forever
42 | 2: eth0@if9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default 
43 |     link/ether fe:31:23:1f:55:2a brd ff:ff:ff:ff:ff:ff link-netnsid 0
44 |     inet 10.88.0.3/16 brd 10.88.0.255 scope global eth0
45 |        valid_lft forever preferred_lft forever
46 |     inet6 fe80::fc31:23ff:fe1f:552a/64 scope link 
47 |        valid_lft forever preferred_lft forever
48 | ```
49 | 
50 | IP 奇怪，我们的 Pod CIDR 不是这个网段的:
51 | 
52 | ```bash
53 | $ ip a s cni0
54 | 7: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
55 |     link/ether 32:70:aa:c6:eb:fc brd ff:ff:ff:ff:ff:ff
56 |     inet 10.187.4.1/24 brd 10.187.0.255 scope global cni0
57 |        valid_lft forever preferred_lft forever
58 |     inet6 fe80::3070:aaff:fec6:ebfc/64 scope link 
59 |        valid_lft forever preferred_lft forever
60 | ```
61 | 
62 | Pod IP 是由 kubelet 拉起的时候调用 cni plugin 二进制 + `/etc/cni/net.d` 下的 cni 配置文件设置的，这个一般是有其他 cni-conf 文件，让现场查看了下:
63 | 
64 | ```
65 | $ ls -l /etc/cni/net.d
66 | total 8
67 | -rw-r--r-- 1 root root 292 Dec 17 17:38 10-flannel.conflist
68 | -rw-r--r-- 1 root root 483 Aug 16  2021 87-podman-bridge.conflist
69 | ```
70 | 
71 | 客户给的机器不干净，应该有安装 podman：
72 | 
73 | ```bash
74 | $ rpm -qa | grep podman
75 | podman-0.10.1-8.ky10.x86_64
76 | ```
77 | 
78 | 让现场去问客户有没有使用到，确认没有使用后卸载删掉该文件，删除下错误 IP 的 Pod 后正常了。
79 | 
80 | ## 链接
81 | 
82 | - 上一部分[部署后 calico-kube-controllers 日志报错 dial tcp 10.96.0.1:443: i/o timeout](05.01.md)
83 | 


--------------------------------------------------------------------------------
/eBook/directory.md:
--------------------------------------------------------------------------------
 1 | # 目录
 2 | - [前言](preface.md)
 3 | 
 4 | ## 网络基础概念
 5 | 
 6 | - 第 1 章：网络基础
 7 |     - 1.1 [网络分层概念](01.01.md)
 8 |     - 1.2 [应用层以下的排查工具](01.02.md)
 9 |     - 1.3 [常见的应用层协议的排查工具](01.03.md)
10 | 
11 | ## iptables
12 | 
13 | - 第 2 章：iptables
14 |     - 2.1 [iptables 概念和一些基础知识](02.01.md)
15 | 
16 | ## Docker 的网络
17 | 
18 | - 第 3 章：Docker 网络
19 |     - 3.1 [默认的桥接网络](03.01.md)
20 |     - 3.2 [none 和 host 网络](03.02.md)
21 |     - 3.3 [container 网络](03.03.md)
22 |     - 3.4 [Docker 容器跨节点通信](03.04.md)
23 | 
24 | ## k8S 的网络
25 | 
26 | - 第 4 章：K8S 网络：
27 |     - 4.1：[基础概念](04.01.md)
28 |     - 4.2：[Service 工作原理](04.02.md)
29 |         - 4.2.1: [iptables 模式](04.02.01.md)
30 |         - 4.2.2: [ipvs 模式](04.02.02.md)
31 |         - 4.2.3: [externalIPs 和坑](04.03.md)
32 |     - 4.3：[NodePort 和 SNAT 以及 hostPort](04.03.md)
33 |     - 4.4: [探针和一些坑](04.04.md)
34 |     - 4.5: [CNI 和 cni-plugins](04.05.md)
35 |     - 4.6: [CNI flannel 原理](04.06.md)
36 |     - 4.7: [CNI calico 原理和基础使用](04.07.md)
37 |     - 4.8: [K8S DNS 相关](04.08.md)
38 |     - 4.9: [K8S 高可用和 SLB 相关](04.09.md)
39 |     - 4.10: [Ingress Controller](04.10.md)
40 | 
41 | ## 案例
42 | 
43 | 欢迎投稿案例，尽量不要使用截图命令输出形式，如果涉及到网络拓扑，带上网络拓扑图相关
44 | 
45 | - 第 5 章：一些网络排查案例：
46 |     - 5.1：[部署后 calico-kube-controllers 日志报错 dial tcp 10.96.0.1:443: i/o timeout](05.01.md)
47 |     - 5.2：[新集群部署业务后，pod 内无法连接到数据库 dial tcp ip:3306: i/o timeout](05.02.md)
48 | 


--------------------------------------------------------------------------------
/eBook/preface.md:
--------------------------------------------------------------------------------
 1 | # 前言
 2 | 
 3 | ## 关于本分享
 4 | 
 5 | 本分享并不是容器网络深入的书，例如什么一个网络包是如何从网卡到达你的进程的这类涉及到内核源码的不会讲到，而是学习容器网络的基础知识。主要讲解大致分为下列大纲：
 6 | - OSI 网络模型
 7 | - 常见层的排查命令
 8 | - iptables
 9 | - docker 网络
10 | - K8S 网络和实现的 CNI（flannel、calico）
11 | - Service、DNS、高可用、Ingress Controller
12 | - 一些排查案例
13 | 
14 | 由于本人精力有限，对于发现本仓库内容有疏漏和错误的地方，以及你有更好的案例和讲解思路，欢迎提 pull request。
15 | 
16 | ## 链接
17 | 
18 | - [目录](directory.md)
19 | - 下一部分: [OSI 网络模型](01.01.md)
20 | 


--------------------------------------------------------------------------------
/images/K8S-overlay.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhangguanzhang/simple-container-network-book/1f93a4f7730d3b3ec55c8e04f61feba5ed23f792/images/K8S-overlay.png


--------------------------------------------------------------------------------
/images/K8s-service-process.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhangguanzhang/simple-container-network-book/1f93a4f7730d3b3ec55c8e04f61feba5ed23f792/images/K8s-service-process.png


--------------------------------------------------------------------------------
/images/Load-Balancing-Algorithms.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhangguanzhang/simple-container-network-book/1f93a4f7730d3b3ec55c8e04f61feba5ed23f792/images/Load-Balancing-Algorithms.gif


--------------------------------------------------------------------------------
/images/TCP-IP-topology.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhangguanzhang/simple-container-network-book/1f93a4f7730d3b3ec55c8e04f61feba5ed23f792/images/TCP-IP-topology.png


--------------------------------------------------------------------------------
/images/browser-cap-to-curl.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhangguanzhang/simple-container-network-book/1f93a4f7730d3b3ec55c8e04f61feba5ed23f792/images/browser-cap-to-curl.png


--------------------------------------------------------------------------------
/images/docker-app-replication.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhangguanzhang/simple-container-network-book/1f93a4f7730d3b3ec55c8e04f61feba5ed23f792/images/docker-app-replication.png


--------------------------------------------------------------------------------
/images/docker-node-ctrs.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhangguanzhang/simple-container-network-book/1f93a4f7730d3b3ec55c8e04f61feba5ed23f792/images/docker-node-ctrs.png


--------------------------------------------------------------------------------
/images/docker-two-node-host-gw.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhangguanzhang/simple-container-network-book/1f93a4f7730d3b3ec55c8e04f61feba5ed23f792/images/docker-two-node-host-gw.png


--------------------------------------------------------------------------------
/images/docker-two-node.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhangguanzhang/simple-container-network-book/1f93a4f7730d3b3ec55c8e04f61feba5ed23f792/images/docker-two-node.png


--------------------------------------------------------------------------------
/images/flannel-host-gw.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhangguanzhang/simple-container-network-book/1f93a4f7730d3b3ec55c8e04f61feba5ed23f792/images/flannel-host-gw.png


--------------------------------------------------------------------------------
/images/flannel-vxlan-node.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhangguanzhang/simple-container-network-book/1f93a4f7730d3b3ec55c8e04f61feba5ed23f792/images/flannel-vxlan-node.png


--------------------------------------------------------------------------------
/images/hairpin-icon.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhangguanzhang/simple-container-network-book/1f93a4f7730d3b3ec55c8e04f61feba5ed23f792/images/hairpin-icon.png


--------------------------------------------------------------------------------
/images/host-gw-fault.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhangguanzhang/simple-container-network-book/1f93a4f7730d3b3ec55c8e04f61feba5ed23f792/images/host-gw-fault.png


--------------------------------------------------------------------------------
/images/iptables.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhangguanzhang/simple-container-network-book/1f93a4f7730d3b3ec55c8e04f61feba5ed23f792/images/iptables.png


--------------------------------------------------------------------------------
/images/lvs-dnat-without-snat.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhangguanzhang/simple-container-network-book/1f93a4f7730d3b3ec55c8e04f61feba5ed23f792/images/lvs-dnat-without-snat.png


--------------------------------------------------------------------------------
/images/meme-k8s-network.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhangguanzhang/simple-container-network-book/1f93a4f7730d3b3ec55c8e04f61feba5ed23f792/images/meme-k8s-network.jpg


--------------------------------------------------------------------------------
/images/nginx-hostnetwork.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhangguanzhang/simple-container-network-book/1f93a4f7730d3b3ec55c8e04f61feba5ed23f792/images/nginx-hostnetwork.png


--------------------------------------------------------------------------------
/images/nodeport-cluster.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhangguanzhang/simple-container-network-book/1f93a4f7730d3b3ec55c8e04f61feba5ed23f792/images/nodeport-cluster.png


--------------------------------------------------------------------------------
/images/nodeport-local.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhangguanzhang/simple-container-network-book/1f93a4f7730d3b3ec55c8e04f61feba5ed23f792/images/nodeport-local.png


--------------------------------------------------------------------------------
/images/pod-delete-process.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhangguanzhang/simple-container-network-book/1f93a4f7730d3b3ec55c8e04f61feba5ed23f792/images/pod-delete-process.png


--------------------------------------------------------------------------------
/images/pod-to-annother-node.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhangguanzhang/simple-container-network-book/1f93a4f7730d3b3ec55c8e04f61feba5ed23f792/images/pod-to-annother-node.gif


--------------------------------------------------------------------------------
/images/pod-to-service.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhangguanzhang/simple-container-network-book/1f93a4f7730d3b3ec55c8e04f61feba5ed23f792/images/pod-to-service.gif


--------------------------------------------------------------------------------
/images/router-ap.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhangguanzhang/simple-container-network-book/1f93a4f7730d3b3ec55c8e04f61feba5ed23f792/images/router-ap.png


--------------------------------------------------------------------------------
/images/router-lan-laptop.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhangguanzhang/simple-container-network-book/1f93a4f7730d3b3ec55c8e04f61feba5ed23f792/images/router-lan-laptop.png


--------------------------------------------------------------------------------
/images/vxlan-mtu.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhangguanzhang/simple-container-network-book/1f93a4f7730d3b3ec55c8e04f61feba5ed23f792/images/vxlan-mtu.png


--------------------------------------------------------------------------------