├── PPT
    ├── main.pptx
    ├── memory.pptx
    └── migration.pptx
├── README.md
├── assets
    ├── divide_configuration_register.png
    ├── ich9_8259.png
    ├── ich9_ioapic_1.png
    ├── ich9_ioapic_2.png
    └── local_vector_table.png
├── boot-2-ysh.md
├── boot.md
├── extend_kvm_cpu.md
├── interrupt.md
├── interrupt_and_io
    ├── IA32_manual_Ch10.md
    ├── IOAPIC.md
    ├── PIC.md
    └── linux_interrupt.md
├── memory.md
├── pci.md
├── q35.md
├── schedule.md
├── shoot4u.md
└── timer_and_clock
    └── pvtimer.md


/PPT/main.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GiantVM/doc/9bb4f0ea40bb342b1c47b2256510cb394429c127/PPT/main.pptx


--------------------------------------------------------------------------------
/PPT/memory.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GiantVM/doc/9bb4f0ea40bb342b1c47b2256510cb394429c127/PPT/memory.pptx


--------------------------------------------------------------------------------
/PPT/migration.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GiantVM/doc/9bb4f0ea40bb342b1c47b2256510cb394429c127/PPT/migration.pptx


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # DOC
2 | 
3 | 维护项目相关文档
4 | 
5 | ## ref
6 | 
7 | * [中文文案排版指北](https://github.com/mzlogin/chinese-copywriting-guidelines)
8 | 


--------------------------------------------------------------------------------
/assets/divide_configuration_register.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GiantVM/doc/9bb4f0ea40bb342b1c47b2256510cb394429c127/assets/divide_configuration_register.png


--------------------------------------------------------------------------------
/assets/ich9_8259.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GiantVM/doc/9bb4f0ea40bb342b1c47b2256510cb394429c127/assets/ich9_8259.png


--------------------------------------------------------------------------------
/assets/ich9_ioapic_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GiantVM/doc/9bb4f0ea40bb342b1c47b2256510cb394429c127/assets/ich9_ioapic_1.png


--------------------------------------------------------------------------------
/assets/ich9_ioapic_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GiantVM/doc/9bb4f0ea40bb342b1c47b2256510cb394429c127/assets/ich9_ioapic_2.png


--------------------------------------------------------------------------------
/assets/local_vector_table.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GiantVM/doc/9bb4f0ea40bb342b1c47b2256510cb394429c127/assets/local_vector_table.png


--------------------------------------------------------------------------------
/boot-2-ysh.md:
--------------------------------------------------------------------------------
  1 | # GiantVm 操作过程 -ysh
  2 | 
  3 | 主机环境：Windows 11 64位 core i9 12900p
  4 | 
  5 | 此次使用VMWare Workstation pro 16进行操作
  6 | 
  7 | 
  8 | 
  9 | ## 1.环境配置
 10 | 
 11 | #### step1 系统准备
 12 | 
 13 | > win+r 打开cmd
 14 | >
 15 | > sysinfo
 16 | >
 17 | > 查看是否开启了Hyper-V，如果开启应将其关闭
 18 | >
 19 | > 关闭方式
 20 | >
 21 | > - 【启用或关闭Windows功能】-> 将虚拟机平台和windows虚拟机监控程序平台关闭
 22 | >   ![image-20220930105100032](C:\Users\Ysh78\AppData\Roaming\Typora\typora-user-images\image-20220930105100032.png)
 23 | > - 如果看到了Hyper-V的选项框，将其取消勾选
 24 | > - 然后重启生效修改
 25 | 
 26 |    ### step2 VMware 设置
 27 | 
 28 | > VMware 中启动Ubuntu16.04
 29 | > 配置为
 30 | > 内核： Linux4.15.0-112，磁盘分配>40G,在CPU设置中启用嵌套虚拟化
 31 | 
 32 |    ### step3  下载必要的包
 33 | > ```shell
 34 | > sudo apt-get install build-essential openssl libncurses5-dev libssl-dev
 35 | > 
 36 | > sudo apt-get install zlibc minizip libidn11-dev libidn11 bison flex
 37 | > ```
 38 | >
 39 | > 
 40 | 
 41 |    ### step4 获得Linux-DSM
 42 | > ```shell
 43 | > git clone https://github.com/GiantVM/Linux-DSM.git
 44 | > ```
 45 | >
 46 | > 
 47 | 
 48 |    ### step5
 49 | > ```shell
 50 | > cd Linux-DSM
 51 | > ```
 52 | >
 53 | > 
 54 | 
 55 |    ### step6 Enable DSM support
 56 | > ```shell
 57 | > make menuconfig
 58 | > ```
 59 | >
 60 | >    `Virtualization` --> `KVM distributed software memory support` --> `press 'Y' to include the option`
 61 | >    `Save` --> `Exit`
 62 | 
 63 |    ### step7 Compile the Kernel (make)
 64 |    make -jN   
 65 |    [N 是]
 66 |    wait for about three hour(or more)
 67 |    `
 68 |    之前的失败经历：
 69 |    Environment: win11 wsl2 Ubuntu16.04 LinuxKernel version 5.10
 70 |    output :
 71 |    makefile:976: recipe for target 'vmlinux' failed 
 72 |    `
 73 | 
 74 |    ### step 8 install the Kernel
 75 |    > ```shell
 76 |    > sudo make modules_install
 77 |    sudo make install
 78 |    ```
 79 | 
 80 |    
 81 | 
 82 | ### step 9 update the grub
 83 | 
 84 |    > [在我的尝试中，这时候应当先打开grub这个文件]
 85 |    > [gedit 比较方便看，用vi也可以]
 86 | 
 87 |    ```shell
 88 |    sudo gedit /etc/default/grub
 89 |    ```
 90 | 
 91 |    [然后将GRUB_HIDDEN_TIMEOUT 这个属性置为0，不然之后重启的时候没时间换系统]
 92 |    这自己操作
 93 |    [然后是核心操作 ]
 94 | 
 95 |    ```shell
 96 |    sudo update-grub
 97 |    ```
 98 | 
 99 |    [之后重启]
100 | 
101 |    ```shell
102 |    reboot
103 |    ```
104 | 
105 |    [重启后看到下面界面，按照图片选择]
106 |    [之后等待，启动后，在shell里输入]
107 | 
108 |    ```shell
109 |    uname -a 
110 |    ```
111 | 
112 |    [可以看到版本为ubuntu 4.9.76+]
113 | 
114 | 
115 | 
116 | ## 2.QEMU
117 | ### step1 Prepartion
118 | > sudo apt-get install python pkg-config libglib2.0-dev zlib1g-dev libpixman-1-dev libfdt-dev
119 | > git clone https://github.com/GiantVM/QEMU.git
120 | > cd QEMU
121 | 
122 | ### step2 Configuration
123 | > ```shell
124 | > ./configure --target-list=x86_64-softmmu --enable-kvm
125 | > ```
126 | >
127 | > 
128 | 
129 | ### step3 Compilation
130 | > ```shell
131 | > make -jN
132 | > ```
133 | >
134 | > 
135 | 
136 | ### step4 Create hard disk image
137 | > ```shell
138 | > cd ..
139 | > wget http://ftp.sjtu.edu.cn/ubuntu-cd/16.04.7/ubuntu-16.04.7-server-amd64.iso
140 | > ```
141 | >
142 | > [如果找不到，可以直接输入http://ftp.sjtu.edu.cn/ubuntu-cd/16.04.7，在里面找到Ubuntu-16.x-server.iso ，然后wget]
143 | >
144 | > [这里就是得用apt下一个qemu，选择不下,用底下x86-64_softmmu/ 底下的qemu-system-x86_64会卡死]
145 | >
146 | > ```shell
147 | > /qemu-img create -f qcow2 ubuntu-server.img 10G
148 | > sudo apt-get install qemu
149 | > qemu-system-x86_64 -m 1024 ubuntu-server.img -cdrom ../ubuntu-16.04.7-server-amd64.iso -enable-kvm
150 | > ```
151 | >
152 | > [上面会跳出系统设置，基本设置一下用户名和密码，然后会问是否要装载GRUB，选择yes，其他无所谓]
153 | 
154 | ## 3.Run Giant VM on a single machine
155 | 
156 | ### First we install vncviewer to monitor the guest.
157 | 
158 | ```shell
159 |    wget https://www.realvnc.com/download/file/viewer.files/VNC-Viewer-6.19.325-Linux-x64.deb
160 |    sudo dpkg -i VNC-Viewer-6.19.325-Linux-x64.deb
161 | ```
162 | 
163 | 如果下面报错说内存不够，把虚拟机关掉，多分配给它一点内存（>8G)
164 | 
165 | >  ### terminal 1 : 
166 | >
167 | >  ```shell
168 | >  cd QEMU/
169 | >  
170 | >  sudo x86_64-softmmu/qemu-system-x86_64 --nographic -hda ubuntu-server.img -cpu host -machine kernel-irqchip=off -smp 4 -m 4096  --enable-kvm -serial mon:stdio -local-cpu 2,start=0,iplist="127.0.0.1 127.0.0.1" -vnc :0
171 | >  ```
172 | >
173 | >  
174 | 
175 | > ### terminal 2：
176 | >
177 | > ```shell
178 | > cd QEMU/
179 | > sudo x86_64-softmmu/qemu-system-x86_64 --nographic -hda ubuntu-server.img -cpu host -machine kernel-irqchip=off -smp 4 -m 2048  --enable-kvm -serial mon:stdio -local-cpu 2,start=2,iplist="127.0.0.1 127.0.0.1"
180 | > ```
181 | 
182 | > ### terminal 3:[启动 vncviewer]
183 | >
184 | > 冒号后面的0和前面 terminal1 后面-vnc 后面的数字对应
185 | >
186 | > ```shell
187 | > vncviewer :0
188 | > ```
189 | >
190 | > 如果启动之后看到 nobootable device ，可能需要检查2 QEMU 最后一步的 qemu-system-x86_64 那段是不是正常
191 | >
192 | > 
193 | 
194 | 


--------------------------------------------------------------------------------
/boot.md:
--------------------------------------------------------------------------------
  1 | # GiantVM Booting (Single machine, TCP)
  2 | ## Change HostOS kernel to Linux-DSM
  3 | ### 1. Preparation
  4 | 
  5 | ```shell
  6 |     sudo apt-get install build-essential openssl libncurses5-dev libssl-dev 
  7 |     sudo apt-get install zlibc minizip libidn11-dev libidn11 bison flex
  8 |     
  9 |     git clone https://github.com/xianliang66/Linux-DSM.git
 10 |     cd Linux-DSM
 11 | 
 12 | ```
 13 | Switch the MACRO definition from `USE_KRDMA_NETWORK` to `USE_KTCP_NETWORK` in arch/x86/include/asm/kvm_host.h.
 14 | ### 2. Enable DSM support in kernel config
 15 | ```shell
 16 |     make menuconfig
 17 | ```
 18 | 
 19 | `Virtualization` --> `KVM distributed software memory support` --> press 'Y' to include the option
 20 | 
 21 | `Save` -->  `Exit`
 22 | 
 23 | ### 3. Compile the kernel
 24 | ```shell
 25 |     make -jN
 26 | ```
 27 | `N` is the number of threads you'd like to run parallelly.
 28 | 
 29 | ### 4. Install the kernel
 30 | ```shell
 31 |     sudo make modules_install
 32 |     sudo make install
 33 | ```
 34 | 
 35 | ### 5. Update grub
 36 | ```shell
 37 |     sudo update-grub
 38 | ```
 39 | 
 40 | ### 6. reboot
 41 | ```shell
 42 |     reboot
 43 | ```
 44 | Select the newly installed kernel, which should be `4.9.76+`.
 45 | 
 46 | ## QEMU
 47 | ### 1. Preparation
 48 | 
 49 | ```shell
 50 |     sudo apt-get install python pkg-config libglib2.0-dev zlib1g-dev libpixman-1-dev libfdt-dev qemu-system-x86
 51 |     
 52 |     git clone https://github.com/xianliang66/QEMU.git
 53 |     cd QEMU
 54 |     
 55 | ```
 56 | 
 57 | ### 2. Configuration
 58 | ```shell
 59 |     ./configure --target-list=x86_64-softmmu --enable-kvm
 60 | ```
 61 | 
 62 | ### 3. Compilation
 63 | ```shell
 64 |     make -jN
 65 | ```
 66 | 
 67 | ### 4. Create hard disk image
 68 | ```shell
 69 |     wget http://ftp.sjtu.edu.cn/ubuntu-cd/16.04.7/ubuntu-16.04.7-server-amd64.iso
 70 |     cd QEMU
 71 |     ./qemu-img create -f qcow2 ubuntu-server.img 10G
 72 |     qemu-system-x86_64 -m 1024 ubuntu-server.img -cdrom ../ubuntu-16.04.7-server-amd64.iso -enable-kvm
 73 | ```
 74 | 
 75 | ## Run GiantVM on a single machine
 76 | We simulate two machines by running two QEMU processes on a single machine.
 77 | 
 78 | First we install vncviewer to monitor the guest.
 79 | ```shell
 80 |    wget https://www.realvnc.com/download/file/viewer.files/VNC-Viewer-6.19.325-Linux-x64.deb
 81 |    sudo dpkg -i VNC-Viewer-6.19.325-Linux-x64.deb
 82 | ```
 83 | 
 84 | Terminal 1:
 85 | ```shell
 86 |     sudo x86_64-softmmu/qemu-system-x86_64 --nographic -hda ubuntu-server.img -cpu host -machine kernel-irqchip=off -smp 2 -m 2048  --enable-kvm -serial mon:stdio -local-cpu 1,start=0,iplist="127.0.0.1 127.0.0.1" -vnc :0
 87 | ```
 88 | Terminal 2:
 89 | ```shell
 90 |     sudo x86_64-softmmu/qemu-system-x86_64 --nographic -hda ubuntu-server.img -cpu host -machine kernel-irqchip=off -smp 2 -m 2048  --enable-kvm -serial mon:stdio -local-cpu 1,start=1,iplist="127.0.0.1 127.0.0.1"
 91 | ```
 92 | Terminal 3:
 93 | ```shell
 94 |     vncviewer :0
 95 | ```
 96 | 
 97 | ### Network Configuration
 98 | By default, QEMU provides an emulated, user mode networking for VMs, which is slow compared with physical network devices.
 99 | 
100 | 
101 | We could configure a network bridge between the Host and the VM.
102 | 
103 | #### On Host:
104 | 
105 | - Check network devices
106 |     ```
107 |     ifconfig
108 |     ```
109 | ##### Change all the "eth0" below to the name of the network device of your machine.
110 | 
111 | - Create a bridge
112 |     ```
113 | 	brctl addbr br0
114 | 	```
115 |     
116 | - Clear IP of `eth0`
117 |     ```
118 |     ip addr flush dev eth0
119 |     ```
120 | 
121 | - Add `eth0` to bridge
122 | 	```
123 |     brctl addif br0 eth0
124 |     ```
125 |     
126 | - Create tap interface
127 | 	```
128 |     tunctl -t tap0 -u root
129 |     ```
130 | 
131 | - Add `tap0` to bridge
132 | 	```
133 |     brctl addif br0 tap0
134 |     ```
135 |     
136 | - Make sure everything is up
137 | 	```
138 |     ifconfig eth0 up
139 |     ifconfig tap0 up
140 |     ifconfig br0 up
141 |     ```
142 | 
143 | - Check if properly bridged
144 | 	```
145 |     brctl show
146 |     ```
147 | 
148 | - Assign ip to `br0`
149 | 	```
150 | 	dhclient -v br0
151 |     ```
152 | 
153 | Reference: https://gist.github.com/extremecoders-re/e8fd8a67a515fee0c873dcafc81d811c
154 | 
155 | #### Run GiantVM with bridged network
156 | Terminal 1:
157 | ```shell
158 |     sudo x86_64-softmmu/qemu-system-x86_64 --nographic -hda ubuntu-server.img -cpu host -machine kernel-irqchip=off -smp 2 -m 2048  --enable-kvm -serial mon:stdio -local-cpu 1,start=0,iplist="127.0.0.1 127.0.0.1" -netdev tap,id=mynet0,ifname=tap0,script=no,downscript=no -device e1000,netdev=mynet0,mac=52:55:00:d1:55:01 -vnc :0
159 | ```
160 | Terminal 2:
161 | ```shell
162 |     sudo x86_64-softmmu/qemu-system-x86_64 --nographic -hda ubuntu-server.img -cpu host -machine kernel-irqchip=off -smp 2 -m 2048  --enable-kvm -serial mon:stdio -local-cpu 1,start=1,iplist="127.0.0.1 127.0.0.1"
163 | ```
164 | Terminal 3:
165 | ```shell
166 |     vncviewer :0
167 | ```
168 | 
169 | ## Author
170 | Weiye Chen
171 | 


--------------------------------------------------------------------------------
/extend_kvm_cpu.md:
--------------------------------------------------------------------------------
  1 | ## 让KVM突破限制，支持512个vCPU
  2 | 
  3 | 需要修改qemu和kvm。
  4 | 
  5 | 1. qemu-system-x86_64: unsupported number of maxcpus
  6 | 
  7 |     include/sysemu/sysemu.h
  8 | 
  9 |     ```c
 10 |     #define MAX_CPUMASK_BITS 288
 11 |     ```
 12 | 
 13 |     =>
 14 | 
 15 |     ```c
 16 |     #define MAX_CPUMASK_BITS 512
 17 |     ```
 18 | 
 19 | 2. qemu-system-x86_64: Number of SMP CPUs requested (500) exceeds max CPUs supported by machine 'pc-i440fx-2.8' (255)
 20 | 
 21 |     ```
 22 |     m->max_cpus = 288;
 23 |     ```
 24 | 
 25 |     =>
 26 | 
 27 |     ```
 28 |     m->max_cpus = 512;
 29 |     ```
 30 | 
 31 | 
 32 | 3. Warning: Number of SMP cpus requested (500) exceeds the recommended cpus supported by KVM (240)
 33 | Number of SMP cpus requested (500) exceeds the maximum cpus supported by KVM (288)
 34 | 
 35 |     改kvm-all.c
 36 | 
 37 |     ```c
 38 |     #define KVM_CAP_MAX_VCPUS 66       /* returns max vcpus per vm */
 39 |     static int kvm_max_vcpus(KVMState *s)
 40 |     {
 41 |         int ret = kvm_check_extension(s, KVM_CAP_MAX_VCPUS);
 42 |         return (ret) ? ret : kvm_recommended_vcpus(s);
 43 |     }
 44 | 
 45 |     int kvm_check_extension(KVMState *s, unsigned int extension)
 46 |     {
 47 |         int ret;
 48 | 
 49 |         ret = kvm_ioctl(s, KVM_CHECK_EXTENSION, extension);
 50 |         if (ret < 0) {
 51 |             ret = 0;
 52 |         }
 53 | 
 54 |         return ret;
 55 |     }
 56 | 
 57 |     static int kvm_init(MachineState *ms)
 58 |     {
 59 |         ...
 60 |         soft_vcpus_limit = kvm_recommended_vcpus(s);
 61 |         hard_vcpus_limit = kvm_max_vcpus(s);
 62 | 
 63 |         while (nc->name) {
 64 |             if (nc->num > soft_vcpus_limit) {
 65 |                 fprintf(stderr,
 66 |                         "Warning: Number of %s cpus requested (%d) exceeds "
 67 |                         "the recommended cpus supported by KVM (%d)\n",
 68 |                         nc->name, nc->num, soft_vcpus_limit);
 69 | 
 70 |                 if (nc->num > hard_vcpus_limit) {
 71 |                     fprintf(stderr, "Number of %s cpus requested (%d) exceeds "
 72 |                             "the maximum cpus supported by KVM (%d)\n",
 73 |                             nc->name, nc->num, hard_vcpus_limit);
 74 |                     exit(1);
 75 |                 }
 76 |             }
 77 |             nc++;
 78 |         }
 79 |     }
 80 |     ```
 81 | 
 82 |     KVM接口返回的vcpu数量限制。本质上还是要改kvm。
 83 | 
 84 |     arch/x86/include/asm/kvm_host.h
 85 | 
 86 |     ```c
 87 |     #define KVM_MAX_VCPUS 288
 88 |     #define KVM_SOFT_MAX_VCPUS 240
 89 |     ```
 90 | 
 91 |     =>
 92 | 
 93 |     ```c
 94 |     #define KVM_MAX_VCPUS 512
 95 |     #define KVM_SOFT_MAX_VCPUS 512
 96 |     ```
 97 | 
 98 | 
 99 | 4. qemu-system-x86_64: current -smp configuration requires Extended Interrupt Mode enabled. You can add an IOMMU using: -device intel-iommu,intremap=on,eim=on
100 | 
101 |     启动时加上参数 -device intel-iommu,intremap=on,eim=on
102 | 
103 | 5. qemu-system-x86_64: -device intel-iommu,intremap=on,eim=on: Intel Interrupt Remapping cannot work with kernel-irqchip=on, please use 'split|off'.
104 | 
105 |     ```bash
106 |     $ sudo $QEMU_SYSTEM_64 -m 512 -hda $DISK -boot c -vnc :1 -enable-kvm -smp 500 -machine q35,kernel-irqchip=split -device intel-iommu,intremap=on,eim=on
107 |     ```
108 |     + 将`$QEMU_SYSTEM_64`替换成重新编译过后qemu-system-x86_64的路径, 例如 `/home/binss/Desktop/qemu/bin/debug/native/x86_64-softmmu/qemu-system-x86_64`
109 |     + 将`$DISK`替换成目标img文件
110 |     + 可参考：<https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg02930.html>
111 | 
112 | 
113 | 6. 突破guest OS的CPU限制
114 | 
115 |     修改nr_cpu，当前为
116 | 
117 |     ```bash
118 |     $ grep NR_CPUS /boot/config-`uname -r`
119 |     $ CONFIG_NR_CPUS=256
120 |     ```
121 | 
122 |     这里可以手动修改config然后重新编译kernel, 详见:
123 |     - [KernelBuild](https://kernelnewbies.org/KernelBuild)
124 |     - [Kernels/Traditional compilation](https://wiki.archlinux.org/index.php/Kernels/Traditional_compilation)
125 | 
126 |     或者换用更新的kernel，比如4.4.0-62。
127 | 
128 | 
129 | 


--------------------------------------------------------------------------------
/interrupt.md:
--------------------------------------------------------------------------------
   1 | ## 定义
   2 | 
   3 | ### 传统(物理机)中断
   4 | 
   5 | 中断从某个设备发出，送到IOAPIC。IOAPIC查PRT表找到对应的表项PTE，得知目标LAPIC。于是格式化出中断消息发送给LAPIC，通知置remote irr为1(level)。
   6 | 
   7 | LAPIC收到中断消息后，根据向量号设置IRR后，进行中断选取，取得取得优先级最高的中断后，清除IRR，设置ISR，提交CPU进行中断处理，CPU处理完中断后，写LAPIC的EOI，通知IOAPIC清除remote irr(level且deassert)。
   8 | 
   9 | ### QEMU+KVM(虚拟机)中断
  10 | 
  11 | #### 中断退出
  12 | 虚拟机发生中断时，主动使guest发生 VMEXIT ，这样接下来能够在 VMENTRY 前进行中断注入。
  13 | 
  14 | #### 中断注入
  15 | 通过将中断写入 VMCS 的 VM-Entry interruption-infomation field ，实现向guest注入中断。
  16 | 
  17 | #### VMCS
  18 | 在 SDM 3 24.6 的 VM-EXECUTION CONTROL FIELDS 中定义了：
  19 | 
  20 | * Pin-Based VM-Execution Controls
  21 |     负责控制 External-interrupt / NMI / Virtual NMIs 时是否发生VMExit，退回到KVM中。
  22 | 
  23 |     比如 External-interrupt exiting 设置为1表示所有的外部中断都会产生VMExit，否则由VM自己处理
  24 | 
  25 | * Secondary Processor-Based VM-Execution Controls - virtual-interrupt delivery
  26 |     设置为1则当 VM entry/TPR virtualization/EOI virtualization/self-IPI virtualization/posted-interrupt processing 时会触发evaluate pending中断
  27 | 
  28 | 由于设置了 VMCS - Secondary Processor-Based VM-Execution Controls - virtualize APIC accesses 位，通过设置特定的VM-execution controls的位，使VM在访问APIC对应的页的时产生VMEXIT。
  29 | 
  30 | 当该中断被recognized了，并且满足以下四个条件，就会触发该虚拟中断的delivery：
  31 | 
  32 | 1. RFLAGS.IF = 1
  33 | 2. 没有因为STI产生的blocking
  34 | 3. 没有因为MOV SS或者POP SS产生的blocking
  35 | 4. Primary Processor-Based VM-Execution Controls中的 interrupt-window exiting bit 为0。
  36 | 
  37 | 虚拟中断的delivery会更新 guest interrupt status 中的RVI和SVI，并且在non-root环境下产生一个中断事件
  38 | 
  39 | 
  40 | #### 中断芯片
  41 | 
  42 | QEMU 和 KVM 都实现了对中断芯片的模拟，这是由于历史原因造成的。早在KVM诞生之前，QEMU就提供了一整套对设备的模拟，包括中断芯片。而KVM诞生之后，为了进一步提高中断性能，因此又在KVM中实现了一套中断芯片。我们可以通过QEMU的启动参数 kernel-irqchip 来决定使用谁的中断芯片(irq chip)。
  43 | 
  44 | * on： KVM 模拟全部
  45 | * split： QEMU模拟IOAPIC和PIC，KVM模拟LAPIC
  46 | * off： QEMU 模拟全部
  47 | 
  48 | 
  49 | #### GPIO(General-purpose input/output)
  50 | 从定义上，GPIO是一种通用的PIN，可以在运行时控制其作为input或者output。input可读，output可读写。
  51 | 
  52 | 在QEMU中，大量使用GPIO来表示设备和中断控制器的PIN。和硬件实现一样，它分为IN和OUT。以以下的数据结构描述：
  53 | 
  54 | ```c
  55 | struct NamedGPIOList {
  56 |     char *name;
  57 |     qemu_irq *in;
  58 |     int num_in;
  59 |     int num_out;
  60 |     QLIST_ENTRY(NamedGPIOList) node;
  61 | };
  62 | ```
  63 | 
  64 | PIN的单元为 qemu_irq 。因此该结构维护了input qemu_irq数组的首地址，可以访问到所有input qemu_irq。同时维护了input和output的数目。
  65 | 
  66 | 每个设备都会维护一个至多个 NamedGPIOList ，以链表形式组织(成员node就是用来串起来的)，用 DeviceState.gpios 指向。比如 8259 有一个拥有1个out和8个in的名为NULL的 NamedGPIOList 。
  67 | 
  68 | qemu_irq 是 IRQState 的指针，定义如下：
  69 | 
  70 | ```c
  71 | struct IRQState {
  72 |     Object parent_obj;
  73 | 
  74 |     qemu_irq_handler handler;
  75 |     void *opaque;
  76 |     int n;
  77 | };
  78 | ```
  79 | 
  80 | 通常来说，n为PIN号，opaque指向所属设备，而 handler 是该PIN的回调函数。
  81 | 
  82 | 当有信号要发送给设备的某个PIN时，就调用对应input qemu_irq 的handler，表示设备收到了信号，于是handler设置设备的一些属性，表示其状态发生了改变。如果需要输出，则调用某个output qemu_irq 的handler，表示将信号发送出去。
  83 | 
  84 | output PIN在初始化时往往设置为 qemu_irq 指针，在其上游设备(接收该设备的输出)初始化时，会将该指针指向上游设备自己的 input qemu_irq。因此当设备进行输出时，调用的是上游设备对应input qemu_irq的handler，这样就模拟了一个信号从一个设备传递到另一个设备的过程。
  85 | 
  86 | 根据q35的 qom-tree ，我们发现有GPIO的设备主要为 PIC 和 IOAPIC ：
  87 | 
  88 | ```
  89 | /machine (pc-q35-2.8-machine)
  90 |     /device[9] (hpet)
  91 |       /unnamed-gpio-in[0] (irq)
  92 |       /unnamed-gpio-in[1] (irq)
  93 |     /device[7] (isa-i8259)
  94 |       /unnamed-gpio-in[3] (irq)
  95 |       /unnamed-gpio-in[4] (irq)
  96 |       /unnamed-gpio-in[5] (irq)
  97 |       /unnamed-gpio-in[6] (irq)
  98 |       /unnamed-gpio-in[2] (irq)
  99 |       /unnamed-gpio-in[7] (irq)
 100 |       /unnamed-gpio-in[0] (irq)
 101 |       /unnamed-gpio-in[1] (irq)
 102 |     /device[8] (isa-i8259)
 103 |       /unnamed-gpio-in[3] (irq)
 104 |       /unnamed-gpio-in[4] (irq)
 105 |       /unnamed-gpio-in[5] (irq)
 106 |       /unnamed-gpio-in[6] (irq)
 107 |       /unnamed-gpio-in[2] (irq)
 108 |       /unnamed-gpio-in[7] (irq)
 109 |       /unnamed-gpio-in[0] (irq)
 110 |       /unnamed-gpio-in[1] (irq)
 111 |  /q35 (q35-pcihost)
 112 |     /pcie.0 (PCIE)
 113 |     /ioapic (ioapic)
 114 |       /unnamed-gpio-in[17] (irq)
 115 |       /unnamed-gpio-in[9] (irq)
 116 |       /unnamed-gpio-in[20] (irq)
 117 |       /unnamed-gpio-in[19] (irq)
 118 |       /unnamed-gpio-in[22] (irq)
 119 |       /unnamed-gpio-in[0] (irq)
 120 |       /unnamed-gpio-in[10] (irq)
 121 |       /unnamed-gpio-in[2] (irq)
 122 |       /unnamed-gpio-in[12] (irq)
 123 |       /unnamed-gpio-in[4] (irq)
 124 |       /unnamed-gpio-in[14] (irq)
 125 |       /unnamed-gpio-in[6] (irq)
 126 |       /unnamed-gpio-in[16] (irq)
 127 |       /unnamed-gpio-in[8] (irq)
 128 |       /unnamed-gpio-in[18] (irq)
 129 |       /unnamed-gpio-in[21] (irq)
 130 |       /unnamed-gpio-in[23] (irq)
 131 |       /unnamed-gpio-in[1] (irq)
 132 |       /unnamed-gpio-in[11] (irq)
 133 |       /unnamed-gpio-in[3] (irq)
 134 |       /unnamed-gpio-in[13] (irq)
 135 |       /unnamed-gpio-in[5] (irq)
 136 |       /unnamed-gpio-in[15] (irq)
 137 |       /unnamed-gpio-in[7] (irq)
 138 | ```
 139 | 
 140 | 
 141 | 
 142 | #### GSI(Global System Interrupt)
 143 | 
 144 | ACPI(Advanced Configuration and Power Interface)规范 为x86机器定义了统一的配置接口，中断也不例外。10年前，人们觉得计算机架构处于并将长期处于PIC和APIC混用的阶段，比如QEMU的经典架构q35，于是定义了GSI。
 145 | 
 146 | GSI为系统中每个中断控制器上的input pin都指定一个唯一的中断号：
 147 | 
 148 | * 对于 8259A ，GSI直接映射到ISA IRQ。比如 GSI 0 映射 IRQ 0。
 149 | * 对于 IOAPIC ，每一个IOAPIC都会被BIOS分配一个GSI base。映射时为base + pin。比如IOAPIC0的GSI base为0，有24个引脚，则它们对应的GSI为0-23。IOAPIC1的GSI base为24，有16个引脚，则范围为24-39，以此类推。
 150 | 
 151 | 
 152 | QEMU使用 GSIState 来描述GSI：
 153 | 
 154 | ```c
 155 | typedef struct GSIState {
 156 |     qemu_irq i8259_irq[ISA_NUM_IRQS];
 157 |     qemu_irq ioapic_irq[IOAPIC_NUM_PINS];
 158 | } GSIState;
 159 | ```
 160 | 
 161 | 对于Q35，在machine初始化函数 pc_q35_init 中负责对GSI进行初始化，同时填充GPIO。
 162 | 
 163 | 
 164 | 
 165 | 
 166 | 
 167 | ## 中断模拟
 168 | 
 169 | ### KVM模拟芯片
 170 | 
 171 | #### PIC
 172 | KVM使用 kvm_pic 模拟8259A芯片。它的指针被保存在 kvm.arch.vpic 中。
 173 | 
 174 | ```c
 175 | struct kvm_pic {
 176 |     spinlock_t lock;
 177 |     bool wakeup_needed;
 178 |     unsigned pending_acks;
 179 |     struct kvm *kvm;
 180 |     struct kvm_kpic_state pics[2]; /* 0 is master pic, 1 is slave pic */    // 维护了中断芯片上所有寄存器和状态
 181 |     int output;     /* intr from master PIC */
 182 |     struct kvm_io_device dev_master;                                        // 主片设备
 183 |     struct kvm_io_device dev_slave;                                         // 从片设备
 184 |     struct kvm_io_device dev_eclr;                                          // 控制中断触发模式的寄存器
 185 |     void (*ack_notifier)(void *opaque, int irq);
 186 |     unsigned long irq_states[PIC_NUM_PINS];
 187 | };
 188 | 
 189 | struct kvm_kpic_state {
 190 |     u8 last_irr;    /* edge detection */
 191 |     u8 irr;     /* interrupt request register */
 192 |     u8 imr;     /* interrupt mask register */
 193 |     u8 isr;     /* interrupt service register */
 194 |     u8 priority_add;    /* highest irq priority */
 195 |     u8 irq_base;
 196 |     u8 read_reg_select;
 197 |     u8 poll;
 198 |     u8 special_mask;
 199 |     u8 init_state;
 200 |     u8 auto_eoi;
 201 |     u8 rotate_on_auto_eoi;
 202 |     u8 special_fully_nested_mode;
 203 |     u8 init4;       /* true if 4 byte init */
 204 |     u8 elcr;        /* PIIX edge/trigger selection */
 205 |     u8 elcr_mask;
 206 |     u8 isr_ack; /* interrupt ack detection */
 207 |     struct kvm_pic *pics_state;
 208 | };
 209 | ```
 210 | 
 211 | dev_master 、 dev_slave 、 dev_eclr 定义了设备对应的操作函数，同时通过 kvm_io_bus_register_dev 将它们注册到PIO总线(KVM_PIO_BUS)上。
 212 | 
 213 | 当需要对设备进行读写时，会调用到以下函数：
 214 | 
 215 | ```c
 216 | static const struct kvm_io_device_ops picdev_master_ops = {
 217 |     .read     = picdev_master_read,
 218 |     .write    = picdev_master_write,
 219 | };
 220 | 
 221 | static const struct kvm_io_device_ops picdev_slave_ops = {
 222 |     .read     = picdev_slave_read,
 223 |     .write    = picdev_slave_write,
 224 | };
 225 | 
 226 | static const struct kvm_io_device_ops picdev_eclr_ops = {
 227 |     .read     = picdev_eclr_read,
 228 |     .write    = picdev_eclr_write,
 229 | };
 230 | ```
 231 | 
 232 | #### IOAPIC
 233 | 
 234 | KVM只模拟了一种IOAPIC，名为 kvm_ioapic 。它的指针被保存在 kvm.kvm_arch.vioapic 中。
 235 | 
 236 | ```c
 237 | struct kvm_ioapic {
 238 |     u64 base_address;
 239 |     u32 ioregsel;
 240 |     u32 id;
 241 |     u32 irr;                                                        // IRR寄存器
 242 |     u32 pad;
 243 |     union kvm_ioapic_redirect_entry redirtbl[IOAPIC_NUM_PINS];      // PRT，每个表项代表一个引脚
 244 |     unsigned long irq_states[IOAPIC_NUM_PINS];
 245 |     struct kvm_io_device dev;
 246 |     struct kvm *kvm;
 247 |     void (*ack_notifier)(void *opaque, int irq);
 248 |     spinlock_t lock;
 249 |     struct rtc_status rtc_status;
 250 |     struct delayed_work eoi_inject;
 251 |     u32 irq_eoi[IOAPIC_NUM_PINS];
 252 |     u32 irr_delivered;
 253 | };
 254 | 
 255 | // RTE
 256 | union kvm_ioapic_redirect_entry {
 257 |     u64 bits;
 258 |     struct {
 259 |         u8 vector;                  // 中断向量(ISRV)号，指定中断对应的vector。优先级 = vector / 16，越大越高
 260 |         u8 delivery_mode:3;         // 传送模式，指定该中断以何种方式发送给目的LAPIC，有Fixed、Lowest Priority、SMI、NMI、INIT、ExtINT
 261 |         u8 dest_mode:1;             // 目的地模式，0为Physical Mode，1为Logical Mode
 262 |         u8 delivery_status:1;       // 传送状态，0为IDEL(没有中断)，1为Send Pending(已收到该中断但由于某种原因还未发送)
 263 |         u8 polarity:1;              // 管脚极性，指定该管脚的有效电平是高电平还是低电平，0为高，1为低
 264 |         u8 remote_irr:1;            // 远程IRR，(中断水平触发)当LAPIC收到该中断后设为1，LAPIC写入EOI时清0
 265 |         u8 trig_mode:1;             // 触发模式，1为水平，2为边缘
 266 |         u8 mask:1;                  // 中断屏蔽位，1时屏蔽该中断
 267 |         u8 reserve:7;               // 未用
 268 |         u8 reserved[4];             // 未用
 269 |         u8 dest_id;                 // 目标，Physical Mode下表示目标LAPIC的ID，Logical Mode下表示一组CPU?
 270 |     } fields;
 271 | };
 272 | ```
 273 | 
 274 | 在收到QEMU发来的 KVM_CREATE_IRQCHIP 后，调用 kvm_ioapic_init 进行初始化：调用 kvm_iodevice_init 绑定操作：
 275 | 
 276 | ```c
 277 | static const struct kvm_io_device_ops ioapic_mmio_ops = {
 278 |     .read     = ioapic_mmio_read,
 279 |     .write    = ioapic_mmio_write,
 280 | };
 281 | ```
 282 | 
 283 | 并通过 kvm_io_bus_register_dev 将 dev 注册到MMIO总线(KVM_MMIO_BUS)上。
 284 | 
 285 | #### LAPIC
 286 | 
 287 | 在KVM中，每个vCPU都有自己的LAPIC，名为 kvm_lapic 。它的指针被保存在 vcpu.arch.apic 中。
 288 | 
 289 | ```c
 290 | struct kvm_lapic {
 291 |     unsigned long base_address;                 // 基地址(GPA)
 292 |     struct kvm_io_device dev;                   // 保存了LAPIC对应的操作
 293 |     struct kvm_timer lapic_timer;
 294 |     u32 divide_count;
 295 |     struct kvm_vcpu *vcpu;
 296 |     bool sw_enabled;
 297 |     bool irr_pending;
 298 |     bool lvt0_in_nmi_mode;
 299 |     /* Number of bits set in ISR. */
 300 |     s16 isr_count;
 301 |     /* The highest vector set in ISR; if -1 - invalid, must scan ISR. */
 302 |     int highest_isr_cache;
 303 |     /**
 304 |      * APIC register page.  The layout matches the register layout seen by
 305 |      * the guest 1:1, because it is accessed by the vmx microcode.
 306 |      * Note: Only one register, the TPR, is used by the microcode.
 307 |      */
 308 |     void *regs;                                 // 指向host的一个page，保存了LAPIC使用的所有虚拟寄存器，如IRR，ISR，LVT等
 309 |     gpa_t vapic_addr;
 310 |     struct gfn_to_hva_cache vapic_cache;
 311 |     unsigned long pending_events;
 312 |     unsigned int sipi_vector;
 313 | };
 314 | ```
 315 | 
 316 | 它在 vmx_create_vcpu => kvm_vcpu_init => kvm_arch_vcpu_init => kvm_create_lapic 中被创建和初始化。
 317 | 
 318 | 当需要对设备进行MMIO读写时，会调用到以下函数：
 319 | 
 320 | ```c
 321 | static const struct kvm_io_device_ops apic_mmio_ops = {
 322 |     .read     = apic_mmio_read,
 323 |     .write    = apic_mmio_write,
 324 | };
 325 | ```
 326 | 
 327 | 在读/写LAPIC的某个寄存器时，由于设置了 VMCS - Secondary Processor-Based VM-Execution Controls - virtualize APIC accesses 位为1，产生VMEXIT，回到KVM，将寄存器地址减去 base_address 得到offset，然后通过 kvm_lapic_reg_read / kvm_lapic_reg_write 对 LAPIC 这个struct进行操作。
 328 | 
 329 | 
 330 | 
 331 | 
 332 | 
 333 | #### 创建流程
 334 | 
 335 | QEMU在 kvm_init 中，如果是on或split，说明需要KVM来模拟中断芯片，因此进行初始化， on 的调用流程如下：
 336 | 
 337 | ```
 338 | kvm_irqchip_create => kvm_arch_irqchip_create => kvm_vm_enable_cap(s, KVM_CAP_SPLIT_IRQCHIP, 0, 24) => kvm_vm_ioctl(s, KVM_ENABLE_CAP, &cap)
 339 |                    => kvm_vm_ioctl(s, KVM_CREATE_IRQCHIP)
 340 | ```
 341 | 
 342 | 因此这里的关键是通过 KVM_CREATE_IRQCHIP 创建芯片。
 343 | 
 344 | 在 KVM 中，调用链如下：
 345 | 
 346 | ```
 347 | kvm_arch_vm_ioctl => kvm_create_pic                                           创建PIC芯片
 348 |                   => kvm_ioapic_init                                          创建并初始化IOAPIC芯片
 349 |                   => kvm_setup_default_irq_routing => kvm_set_irq_routing     设置中断路由
 350 | ```
 351 | 
 352 | 在 on 下，KVM将直接以 default_routing 作为中断路由。即由KVM来初始化中断路由表 kvm->irq_routing 。
 353 | 
 354 | 
 355 | ```c
 356 | static const struct kvm_irq_routing_entry default_routing[] = {
 357 |   ROUTING_ENTRY2(0), ROUTING_ENTRY2(1),
 358 |   ROUTING_ENTRY2(2), ROUTING_ENTRY2(3),
 359 |   ROUTING_ENTRY2(4), ROUTING_ENTRY2(5),
 360 |   ROUTING_ENTRY2(6), ROUTING_ENTRY2(7),
 361 |   ROUTING_ENTRY2(8), ROUTING_ENTRY2(9),
 362 |   ROUTING_ENTRY2(10), ROUTING_ENTRY2(11),
 363 |   ROUTING_ENTRY2(12), ROUTING_ENTRY2(13),
 364 |   ROUTING_ENTRY2(14), ROUTING_ENTRY2(15),
 365 |   ROUTING_ENTRY1(16), ROUTING_ENTRY1(17),
 366 |   ROUTING_ENTRY1(18), ROUTING_ENTRY1(19),
 367 |   ROUTING_ENTRY1(20), ROUTING_ENTRY1(21),
 368 |   ROUTING_ENTRY1(22), ROUTING_ENTRY1(23),
 369 | };
 370 | 
 371 | #define ROUTING_ENTRY2(irq) \
 372 |   IOAPIC_ROUTING_ENTRY(irq), PIC_ROUTING_ENTRY(irq)
 373 | 
 374 | #define ROUTING_ENTRY1(irq) IOAPIC_ROUTING_ENTRY(irq)
 375 | 
 376 | #define PIC_ROUTING_ENTRY(irq) \
 377 |   { .gsi = irq, .type = KVM_IRQ_ROUTING_IRQCHIP,  \
 378 |     .u.irqchip = { .irqchip = SELECT_PIC(irq), .pin = (irq) % 8 } }
 379 | 
 380 | #define IOAPIC_ROUTING_ENTRY(irq) \
 381 |   { .gsi = irq, .type = KVM_IRQ_ROUTING_IRQCHIP,  \
 382 |     .u.irqchip = { .irqchip = KVM_IRQCHIP_IOAPIC, .pin = (irq) } }
 383 | ```
 384 | 
 385 | 可以看到GSI的前16号(0-15)即有PIC又有IOAPIC。而16-23就只有IOAPIC了。
 386 | 
 387 | ##### kvm_set_irq_routing
 388 | 
 389 | ```
 390 | kvm_set_irq_routing => setup_routing_entry => kvm_set_routing_entry
 391 |                     => rcu_assign_pointer(kvm->irq_routing, new)
 392 | ```
 393 | 
 394 | 它会创建新的 kvm_irq_routing_table ，然后遍历新传入的entrys数组，对每一个entry一一调用 setup_routing_entry ，构造出 kvm_irq_routing_entry 并设置到新table中。最后将 kvm->irq_routing 指向新的table
 395 | 
 396 | 中断路由表 kvm->irq_routing 的类型为 kvm_irq_routing_table 。表项存储在 map 中，在这里是 kvm_irq_routing_entry 组成的列表：
 397 | 
 398 | ```c
 399 | struct kvm_irq_routing_table {
 400 |     int chip[KVM_NR_IRQCHIPS][KVM_IRQCHIP_NUM_PINS];            // 一级索引指对应的中断芯片，二级索引对应引脚，存放引脚对应的GSI号。目前已弃用
 401 |     u32 nr_rt_entries;
 402 |     /*
 403 |      * Array indexed by gsi. Each entry contains list of irq chips
 404 |      * the gsi is connected to.
 405 |      */
 406 |     struct hlist_head map[0];                                   // 指向项为list的哈希表，key为gsi，value为kvm_kernel_irq_routing_entry列表
 407 | };
 408 | 
 409 | struct kvm_kernel_irq_routing_entry {
 410 |     u32 gsi;                                                    // 引脚的GSI号
 411 |     u32 type;
 412 |     int (*set)(struct kvm_kernel_irq_routing_entry *e,          // 设置中断函数
 413 |            struct kvm *kvm, int irq_source_id, int level,       // kvm，中断资源ID，高低电平
 414 |            bool line_status);
 415 |     union {
 416 |         struct {
 417 |             unsigned irqchip;
 418 |             unsigned pin;
 419 |         } irqchip;
 420 |         struct {
 421 |             u32 address_lo;
 422 |             u32 address_hi;
 423 |             u32 data;
 424 |             u32 flags;
 425 |             u32 devid;
 426 |         } msi;
 427 |         struct kvm_s390_adapter_int adapter;
 428 |         struct kvm_hv_sint hv_sint;
 429 |     };
 430 |     struct hlist_node link;
 431 | };
 432 | ```
 433 | 
 434 | 具体的中断芯片(如PIC、IOAPIC)通过实现 kvm_irq_routing_entry 的 set 函数，实现了在中断注入时的对应行为。
 435 | 
 436 | ##### setup_routing_entry
 437 | 
 438 | 函数 setup_routing_entry 负责设置某个gsi的中断路由：
 439 | 
 440 | ```c
 441 | static int setup_routing_entry(struct kvm *kvm,
 442 |                    struct kvm_irq_routing_table *rt,
 443 |                    struct kvm_kernel_irq_routing_entry *e,
 444 |                    const struct kvm_irq_routing_entry *ue)
 445 | {
 446 |     int r = -EINVAL;
 447 |     struct kvm_kernel_irq_routing_entry *ei;
 448 | 
 449 |     /*
 450 |      * Do not allow GSI to be mapped to the same irqchip more than once.
 451 |      * Allow only one to one mapping between GSI and non-irqchip routing.
 452 |      */
 453 |     hlist_for_each_entry(ei, &rt->map[ue->gsi], link)
 454 |         if (ei->type != KVM_IRQ_ROUTING_IRQCHIP ||
 455 |             ue->type != KVM_IRQ_ROUTING_IRQCHIP ||
 456 |             ue->u.irqchip.irqchip == ei->irqchip.irqchip)
 457 |             return r;
 458 | 
 459 |     e->gsi = ue->gsi;
 460 |     e->type = ue->type;
 461 |     r = kvm_set_routing_entry(kvm, e, ue);
 462 |     if (r)
 463 |         goto out;
 464 |     if (e->type == KVM_IRQ_ROUTING_IRQCHIP)
 465 |         rt->chip[e->irqchip.irqchip][e->irqchip.pin] = e->gsi;
 466 | 
 467 |     hlist_add_head(&e->link, &rt->map[e->gsi]);
 468 |     r = 0;
 469 | out:
 470 |     return r;
 471 | }
 472 | ```
 473 | 
 474 | 这里遍历了 kvm_irq_routing_table->map 中gsi对应的列表，如果发现其中存在目标irqchip的 kvm_kernel_irq_routing_entry ，表示已设置，则直接返回。因为一个中断控制器中所有引脚对应的GSI都应该不同。
 475 | 
 476 | 否则对该entry进行设置，在填充了 gsi 和 type 后通过 kvm_set_routing_entry 进一步设置。
 477 | 
 478 | ```c
 479 | 
 480 | int kvm_set_routing_entry(struct kvm *kvm,
 481 |               struct kvm_kernel_irq_routing_entry *e,
 482 |               const struct kvm_irq_routing_entry *ue)
 483 | {
 484 |     int r = -EINVAL;
 485 |     int delta;
 486 |     unsigned max_pin;
 487 | 
 488 |     switch (ue->type) {
 489 |     case KVM_IRQ_ROUTING_IRQCHIP:
 490 |         delta = 0;
 491 |         switch (ue->u.irqchip.irqchip) {
 492 |         case KVM_IRQCHIP_PIC_MASTER:
 493 |             e->set = kvm_set_pic_irq;
 494 |             max_pin = PIC_NUM_PINS;
 495 |             break;
 496 |         case KVM_IRQCHIP_PIC_SLAVE:
 497 |             e->set = kvm_set_pic_irq;
 498 |             max_pin = PIC_NUM_PINS;
 499 |             delta = 8;
 500 |             break;
 501 |         case KVM_IRQCHIP_IOAPIC:
 502 |             max_pin = KVM_IOAPIC_NUM_PINS;
 503 |             e->set = kvm_set_ioapic_irq;
 504 |             break;
 505 |         default:
 506 |             goto out;
 507 |         }
 508 |         e->irqchip.irqchip = ue->u.irqchip.irqchip;
 509 |         e->irqchip.pin = ue->u.irqchip.pin + delta;
 510 |         if (e->irqchip.pin >= max_pin)
 511 |             goto out;
 512 |         break;
 513 |     case KVM_IRQ_ROUTING_MSI:
 514 |         e->set = kvm_set_msi;
 515 |         e->msi.address_lo = ue->u.msi.address_lo;
 516 |         e->msi.address_hi = ue->u.msi.address_hi;
 517 |         e->msi.data = ue->u.msi.data;
 518 | 
 519 |         if (kvm_msi_route_invalid(kvm, e))
 520 |             goto out;
 521 |         break;
 522 |     case KVM_IRQ_ROUTING_HV_SINT:
 523 |         e->set = kvm_hv_set_sint;
 524 |         e->hv_sint.vcpu = ue->u.hv_sint.vcpu;
 525 |         e->hv_sint.sint = ue->u.hv_sint.sint;
 526 |         break;
 527 |     default:
 528 |         goto out;
 529 |     }
 530 | 
 531 |     r = 0;
 532 | out:
 533 |     return r;
 534 | }
 535 | ```
 536 | 
 537 | 这里就设置了上文所诉 kvm_kernel_irq_routing_entry 结构的 set 函数。对于普通中断(KVM_IRQ_ROUTING_IRQCHIP)，会根据不同的中断控制器(irqchip)设置不同的set函数：
 538 | 
 539 | * 对于PIC，由于type为 KVM_IRQ_ROUTING_IRQCHIP ，因此 set为 kvm_set_pic_irq
 540 | * 对于IOAPIC，由于type为 KVM_IRQCHIP_PIC_MASTER / KVM_IRQCHIP_PIC_SLAVE ，因此 set为 kvm_set_ioapic_irq
 541 | 
 542 | 至此KVM中的中断路由初始化完毕。
 543 | 
 544 | 
 545 | 
 546 | 
 547 | 
 548 | ### 中断注入
 549 | 如果设备是在QEMU中模拟的，则产生中断时需要进行中断注入。
 550 | 
 551 | 在QEMU对KVM加速器进行初始化的函数 kvm_init 中，有：
 552 | 
 553 | ```c
 554 |     s->irq_set_ioctl = KVM_IRQ_LINE;
 555 |     if (kvm_check_extension(s, KVM_CAP_IRQ_INJECT_STATUS)) {
 556 |         s->irq_set_ioctl = KVM_IRQ_LINE_STATUS;
 557 |     }
 558 | 
 559 | #define KVM_IRQ_LINE              _IOW(KVMIO,  0x61, struct kvm_irq_level)
 560 | #define KVM_IRQ_LINE_STATUS       _IOWR(KVMIO, 0x67, struct kvm_irq_level)
 561 | ```
 562 | 
 563 | 如果KVM支持返回注入的结果，则设置 s->irq_set_ioctl = KVM_IRQ_LINE_STATUS ，否则为 KVM_IRQ_LINE
 564 | 
 565 | 于是QEMU会在 kvm_set_irq 中通过ioctl向KVM注入中断：
 566 | 
 567 | ```c
 568 | int kvm_set_irq(KVMState *s, int irq, int level)
 569 | {
 570 |     struct kvm_irq_level event;
 571 |     int ret;
 572 | 
 573 |     assert(kvm_async_interrupts_enabled());
 574 | 
 575 |     event.level = level;
 576 |     event.irq = irq;
 577 |     ret = kvm_vm_ioctl(s, s->irq_set_ioctl, &event);
 578 |     if (ret < 0) {
 579 |         perror("kvm_set_irq");
 580 |         abort();
 581 |     }
 582 | 
 583 |     return (s->irq_set_ioctl == KVM_IRQ_LINE) ? 1 : event.status;
 584 | }
 585 | ```
 586 | 
 587 | 在 KVM 中调用链为 kvm_vm_ioctl => kvm_vm_ioctl_irq_line => kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irq_event->irq, irq_event->level, line_status)
 588 | 
 589 | 
 590 | ```c
 591 | /*
 592 |  * Return value:
 593 |  *  < 0   Interrupt was ignored (masked or not delivered for other reasons)
 594 |  *  = 0   Interrupt was coalesced (previous irq is still pending)
 595 |  *  > 0   Number of CPUs interrupt was delivered to
 596 |  */
 597 | int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level,
 598 |         bool line_status)
 599 | {
 600 |     struct kvm_kernel_irq_routing_entry irq_set[KVM_NR_IRQCHIPS];
 601 |     int ret = -1, i, idx;
 602 | 
 603 |     trace_kvm_set_irq(irq, level, irq_source_id);
 604 | 
 605 |     /* Not possible to detect if the guest uses the PIC or the
 606 |      * IOAPIC.  So set the bit in both. The guest will ignore
 607 |      * writes to the unused one.
 608 |      */
 609 |     idx = srcu_read_lock(&kvm->irq_srcu);
 610 |     // 查询 kvm->irq_routing ，将对应中断号(pin?)的 kvm_kernel_irq_routing_entry 一一取出，设置到 irq_set 返回
 611 |     i = kvm_irq_map_gsi(kvm, irq_set, irq);
 612 |     srcu_read_unlock(&kvm->irq_srcu, idx);
 613 | 
 614 |     // 调用kvm_kernel_irq_routing_entry的set函数设置中断，如果芯片没实现，则set为空
 615 |     while (i--) {
 616 |         int r;
 617 |         r = irq_set[i].set(&irq_set[i], kvm, irq_source_id, level,
 618 |                    line_status);
 619 |         if (r < 0)
 620 |             continue;
 621 | 
 622 |         ret = r + ((ret < 0) ? 0 : ret);
 623 |     }
 624 | 
 625 |     return ret;
 626 | }
 627 | ```
 628 | 
 629 | 其中 irq_source_id 为中断源设备id， irq 为原始中断请求号(未转换成gsi)， level 表示中断的高低电平。
 630 | 
 631 | 这里从中断路由表中找到对应的 entry ，调用中断路由初始化时设置的set函数。前面提到过：
 632 | 
 633 | * 对于PIC，set为 kvm_set_pic_irq
 634 | * 对于IOAPIC，set为 kvm_set_ioapic_irq
 635 | 
 636 | #### kvm_set_pic_irq
 637 | 
 638 | ```
 639 | kvm_set_pic_irq => pic_irqchip                                              找到对应的中断芯片
 640 |                 => kvm_pic_set_irq => pic_set_irq1                          设置irq对应的pin，设置irr(interrupt request register)
 641 |                                    => pic_update_irq => pic_irq_request     发送中断请求
 642 | ```
 643 | 
 644 | 其中：
 645 | 
 646 | ```c
 647 | static void pic_irq_request(struct kvm *kvm, int level)
 648 | {
 649 |     struct kvm_pic *s = pic_irqchip(kvm);
 650 | 
 651 |     if (!s->output)
 652 |         s->wakeup_needed = true;
 653 |     s->output = level;
 654 | }
 655 | ```
 656 | 
 657 | 负责设置中断芯片 kvm_pic 中的 output 为对应电平。同时如果原来 output 为0，则设置 wakeup_needed 为true，于是在 pic_unlock 中会调用 `kvm_make_request(KVM_REQ_EVENT, found)` 设置请求然后通过 kvm_vcpu_kick 让目标vCPU退出来处理请求。
 658 | 
 659 | 
 660 | #### kvm_set_ioapic_irq
 661 | 
 662 | kvm_set_ioapic_irq => kvm_ioapic_set_irq => ioapic_set_irq => ioapic_service
 663 | 
 664 | ```
 665 | ioapic_service
 666 | => 创建并初始化中断消息 kvm_lapic_irq
 667 | => kvm_irq_delivery_to_apic                 将中断消息发送到LAPIC
 668 | ```
 669 | 
 670 | kvm_lapic_irq 为 IOAPIC格式化后的中断消息，定义如下：
 671 | 
 672 | ```c
 673 | struct kvm_lapic_irq {
 674 |     u32 vector;
 675 |     u16 delivery_mode;
 676 |     u16 dest_mode;
 677 |     bool level;
 678 |     u16 trig_mode;
 679 |     u32 shorthand;
 680 |     u32 dest_id;
 681 |     bool msi_redir_hint;
 682 | };
 683 | ```
 684 | 
 685 | 将消息 kvm_lapic_irq 作为参数，调用 kvm_irq_delivery_to_apic
 686 | 
 687 | ```c
 688 | int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
 689 |         struct kvm_lapic_irq *irq, struct dest_map *dest_map)
 690 | {
 691 |     int i, r = -1;
 692 |     struct kvm_vcpu *vcpu, *lowest = NULL;
 693 |     unsigned long dest_vcpu_bitmap[BITS_TO_LONGS(KVM_MAX_VCPUS)];
 694 |     unsigned int dest_vcpus = 0;
 695 | 
 696 |     if (irq->dest_mode == 0 && irq->dest_id == 0xff &&
 697 |             kvm_lowest_prio_delivery(irq)) {
 698 |         printk(KERN_INFO "kvm: apic: phys broadcast and lowest prio\n");
 699 |         irq->delivery_mode = APIC_DM_FIXED;
 700 |     }
 701 | 
 702 |     if (kvm_irq_delivery_to_apic_fast(kvm, src, irq, &r, dest_map))
 703 |         return r;
 704 | 
 705 |     memset(dest_vcpu_bitmap, 0, sizeof(dest_vcpu_bitmap));
 706 | 
 707 |     kvm_for_each_vcpu(i, vcpu, kvm) {
 708 |         if (!kvm_apic_present(vcpu))
 709 |             continue;
 710 | 
 711 |         if (!kvm_apic_match_dest(vcpu, src, irq->shorthand,
 712 |                     irq->dest_id, irq->dest_mode))
 713 |             continue;
 714 | 
 715 |         if (!kvm_lowest_prio_delivery(irq)) {
 716 |             if (r < 0)
 717 |                 r = 0;
 718 |             r += kvm_apic_set_irq(vcpu, irq, dest_map);
 719 |         } else if (kvm_lapic_enabled(vcpu)) {
 720 |             if (!kvm_vector_hashing_enabled()) {
 721 |                 if (!lowest)
 722 |                     lowest = vcpu;
 723 |                 else if (kvm_apic_compare_prio(vcpu, lowest) < 0)
 724 |                     lowest = vcpu;
 725 |             } else {
 726 |                 __set_bit(i, dest_vcpu_bitmap);
 727 |                 dest_vcpus++;
 728 |             }
 729 |         }
 730 |     }
 731 | 
 732 |     if (dest_vcpus != 0) {
 733 |         int idx = kvm_vector_to_index(irq->vector, dest_vcpus,
 734 |                     dest_vcpu_bitmap, KVM_MAX_VCPUS);
 735 | 
 736 |         lowest = kvm_get_vcpu(kvm, idx);
 737 |     }
 738 | 
 739 |     if (lowest)
 740 |         r = kvm_apic_set_irq(lowest, irq, dest_map);
 741 | 
 742 |     return r;
 743 | }
 744 | ```
 745 | 
 746 | 该函数除了可以处理外部中断(ioapic => lapic)，还可以处理IPI(lapic => lapic, 见 apic_send_ipi)。
 747 | 
 748 | 它首先尝试从 kvm.arch.apic_map 中找到目标LAPIC。 kvm.arch.apic_map 定义如下：
 749 | 
 750 | ```c
 751 | struct kvm_apic_map {
 752 |     struct rcu_head rcu;
 753 |     u8 mode;
 754 |     u32 max_apic_id;
 755 |     union {
 756 |         struct kvm_lapic *xapic_flat_map[8];
 757 |         struct kvm_lapic *xapic_cluster_map[16][4];
 758 |     };
 759 |     struct kvm_lapic *phys_map[];               // 维护了LAPIC ID到 kvm_lapic 指针的映射
 760 | };
 761 | ```
 762 | 
 763 | 于是 kvm_irq_delivery_to_apic_fast => kvm_apic_map_get_dest_lapic 中，对于不是广播和最低优先级的中断，可以直接根据 irq->dest_id 从 phys_map 中取出对应的 kvm_lapic 。然后直接 kvm_apic_set_irq 对目标vCPU设置中断。否则需要遍历所有的vCPU，逐一的和RTE的 irq->dest_id 进行匹配。对匹配的vcpu调用 kvm_apic_set_irq 。
 764 | 
 765 | 
 766 | kvm_apic_set_irq 实现为该vcpu的lapic设置中断：
 767 | 
 768 | ```
 769 | => __apic_accept_irq
 770 |     => 根据 delivery_mode 进行对应设置，如 APIC_DM_FIXED 为 kvm_lapic_set_vector + kvm_lapic_set_irr
 771 |     => kvm_make_request(event, vcpu) ，event 可取 KVM_REQ_EVENT / KVM_REQ_SMI / KVM_REQ_NMI
 772 |     => kvm_vcpu_kick(vcpu)              让目标vCPU退出来处理请求
 773 | ```
 774 | 
 775 | kvm_make_request 本质上是设置 vcpu->requests 中请求对应的bit ，在下次 vcpu_enter_guest 时会对请求进行处理。
 776 | 
 777 | 
 778 | 
 779 | ##### kvm_vcpu_kick
 780 | 
 781 | kvm_vcpu_kick => smp_send_reschedule (native_smp_send_reschedule) => apic->send_IPI(cpu, RESCHEDULE_VECTOR) (x2apic_send_IPI)
 782 | 
 783 | 向目标vcpu产生一个中断，让其重新被调度，由于在VMCS中设置了外部中断会发生 VMExit，因此返回到 KVM ，从而能够实现在其重新 VMENTRY (vcpu_enter_guest) 之前注入中断
 784 | 
 785 | 于是 kvm_x86_ops->run (vmx_vcpu_run) 返回到 vcpu_enter_guest 再到 vcpu_run 进入下一轮循环，于是又调用 vcpu_enter_guest ：
 786 | 
 787 | ```
 788 | vcpu_enter_guest => inject_pending_event        run前检查请求，如果kvm_check_request(KVM_REQ_EVENT, vcpu)，在运行vcpu前进行中断注入
 789 |                  => kvm_x86_ops->run            VMLAUNCH/VMRESUME
 790 |                  => vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD)
 791 |                  => vmx_complete_interrupts => __vmx_complete_interrupts    根据中断信息更新vcpu，该入队的入队
 792 | ```
 793 | 
 794 | 具体流程是：
 795 | 
 796 | ```c
 797 |     if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
 798 |         kvm_apic_accept_events(vcpu);
 799 |         if (vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED) {
 800 |             r = 1;
 801 |             goto out;
 802 |         }
 803 |         // 中断注入
 804 |         if (inject_pending_event(vcpu, req_int_win) != 0)
 805 |             req_immediate_exit = true;
 806 |         else {
 807 |             /* Enable NMI/IRQ window open exits if needed.
 808 |              *
 809 |              * SMIs have two cases: 1) they can be nested, and
 810 |              * then there is nothing to do here because RSM will
 811 |              * cause a vmexit anyway; 2) or the SMI can be pending
 812 |              * because inject_pending_event has completed the
 813 |              * injection of an IRQ or NMI from the previous vmexit,
 814 |              * and then we request an immediate exit to inject the SMI.
 815 |              */
 816 |             if (vcpu->arch.smi_pending && !is_smm(vcpu))
 817 |                 req_immediate_exit = true;
 818 |             if (vcpu->arch.nmi_pending)
 819 |                 kvm_x86_ops->enable_nmi_window(vcpu);
 820 |             if (kvm_cpu_has_injectable_intr(vcpu) || req_int_win)
 821 |                 kvm_x86_ops->enable_irq_window(vcpu);
 822 |         }
 823 | 
 824 |         if (kvm_lapic_enabled(vcpu)) {
 825 |             update_cr8_intercept(vcpu);
 826 |             kvm_lapic_sync_to_vapic(vcpu);
 827 |         }
 828 |     }
 829 | ```
 830 | 
 831 | 发现有 KVM_REQ_EVENT ，于是调用 inject_pending_event
 832 | 
 833 | ```
 834 | => 如果有 pending的异常 ，调用 kvm_x86_ops->queue_exception (vmx_queue_exception) 重新排队
 835 | => 如果 nmi_injected ，调用 kvm_x86_ops->set_nmi (vmx_inject_nmi)
 836 | => 如果有 pending的中断 ，调用 kvm_x86_ops->set_irq (vmx_inject_irq)
 837 | => 如果有 pending的不可屏蔽中断 ，调用 kvm_x86_ops->set_nmi (vmx_inject_nmi)
 838 | => kvm_cpu_has_injectable_intr                                                          如果vCPU有可注入的中断
 839 | => kvm_queue_interrupt(vcpu, kvm_cpu_get_interrupt(vcpu), false)                        将最高优先级的中断设置到 vcpu->arch.interrupt 中
 840 | => kvm_x86_ops->set_irq (vmx_inject_irq)                                                将中断信息写入VMCS
 841 |     => vmcs_write32(VM_ENTRY_INSTRUCTION_LEN, vmx->vcpu.arch.event_exit_inst_len)       对于软中断，需要写指令长度
 842 |     => vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, intr)                                     更新中断信息区域
 843 | ```
 844 | 
 845 | ##### kvm_cpu_has_injectable_intr
 846 | 用于判断是否有可注入的中断。
 847 | 
 848 | => lapic_in_kernel                      如果 LAPIC 不在KVM中，表示由QEMU负责模拟，于是 vcpu.arch.interrupt 早已被设置好，返回 interrupt.pending
 849 | => kvm_cpu_has_extint                   如果有pending的外部(非non-APIC)中断，返回 true
 850 | => kvm_vcpu_apicv_active                如果启用了virtual interrupt delivery，则APIC的中断会由硬件处理，无需软件干涉，返回 false
 851 | => kvm_apic_has_interrupt               如果 LAPIC 在KVM中，找到优先级最高的中断号，如果其大于PPR，返回 true
 852 |     => apic_update_ppr                                                    更新PPR
 853 |     => apic_find_highest_irr => apic_search_irr => find_highest_vector    从IRR中找到优先级最高的中断号
 854 |     => 如果该中断号小于等于PPR，则返回-1
 855 | 
 856 | 
 857 | #### 小结
 858 | 
 859 | 在重新 run 前，判断是否有中断请求，如果有，则检查LAPIC的中断队列，找到优先级最高的中断，如果其中断向量号大于PPR(Processor Priority Register)，则需要进行注入。
 860 | 
 861 | 于是设置 vcpu->arch.interrupt (kvm_queued_interrupt)，其中 pending 设置为true
 862 | 
 863 | ```c
 864 | struct kvm_queued_interrupt {
 865 |     bool pending;
 866 |     bool soft;      // 是否软中断
 867 |     u8 nr;          // 中断向量号
 868 | } interrupt;
 869 | ```
 870 | 
 871 | 
 872 | 在VMEXIT时，如果注入成功，会在 vmx_vcpu_run => vmx_complete_interrupts => __vmx_complete_interrupts => kvm_clear_interrupt_queue 将 pending 设置为 false。
 873 | 
 874 | 如果注入失败，会在 __vmx_complete_interrupts 调用 requeue ，重新进行注入。
 875 | 
 876 | 
 877 | 
 878 | 
 879 | 
 880 | 
 881 | 
 882 | 
 883 | 
 884 | ### QEMU模拟芯片
 885 | 我们知道，QEMU中的设备都是通过 TypeInfo 定义，然后以 TypeImpl 进行注册。在创建设备时，调用 class_init 初始化类对象，然后调用 instance_init 初始化类实例对象，最后通过 realize 完成设备的构造。
 886 | 
 887 | #### PIC
 888 | 
 889 | ```c
 890 | static const TypeInfo i8259_info = {
 891 |     .name       = TYPE_I8259,
 892 |     .instance_size = sizeof(PICCommonState),
 893 |     .parent     = TYPE_PIC_COMMON,
 894 |     .class_init = i8259_class_init,
 895 |     .class_size = sizeof(PICClass),
 896 |     .interfaces = (InterfaceInfo[]) {
 897 |         { TYPE_INTERRUPT_STATS_PROVIDER },
 898 |         { }
 899 |     },
 900 | };
 901 | ```
 902 | 
 903 | 在 pc_q35_init 中，有以下一段代码：
 904 | 
 905 | ```c
 906 | i8259 = i8259_init(isa_bus, pc_allocate_cpu_irq());
 907 | for (i = 0; i < ISA_NUM_IRQS; i++) {
 908 |     gsi_state->i8259_irq[i] = i8259[i];
 909 | }
 910 | ```
 911 | 
 912 | 这里首先为 PIC 设备分配一个中断(parent_irq)，于是调用 pc_allocate_cpu_irq => qemu_allocate_irq(pic_irq_request, NULL, 0) ，创建了一个序号为0，handler为 pic_irq_request 的中断。作为上游中断。
 913 | 
 914 | 然后初始化 PIC ：
 915 | 
 916 | ```c
 917 | qemu_irq *i8259_init(ISABus *bus, qemu_irq parent_irq)
 918 | {
 919 |     qemu_irq *irq_set;
 920 |     DeviceState *dev;
 921 |     ISADevice *isadev;
 922 |     int i;
 923 | 
 924 |     irq_set = g_new0(qemu_irq, ISA_NUM_IRQS);
 925 | 
 926 |     // 创建PIC master device，挂到 isa_bus 上
 927 |     isadev = i8259_init_chip(TYPE_I8259, bus, true);
 928 |     dev = DEVICE(isadev);
 929 | 
 930 |     qdev_connect_gpio_out(dev, 0, parent_irq);
 931 |     for (i = 0 ; i < 8; i++) {
 932 |         irq_set[i] = qdev_get_gpio_in(dev, i);
 933 |     }
 934 | 
 935 |     isa_pic = dev;
 936 |     // 创建PIC slave device，挂到 isa_bus 上
 937 |     isadev = i8259_init_chip(TYPE_I8259, bus, false);
 938 |     dev = DEVICE(isadev);
 939 | 
 940 |     qdev_connect_gpio_out(dev, 0, irq_set[2]);
 941 |     for (i = 0 ; i < 8; i++) {
 942 |         irq_set[i + 8] = qdev_get_gpio_in(dev, i);
 943 |     }
 944 | 
 945 |     slave_pic = PIC_COMMON(dev);
 946 | 
 947 |     return irq_set;
 948 | }
 949 | ```
 950 | 
 951 | 其负责创建两个8259中断芯片的类实例对象。i8259_init_chip => qdev_init_nofail => ... => pic_realize
 952 | 
 953 | ```c
 954 | static void pic_realize(DeviceState *dev, Error **errp)
 955 | {
 956 |     PICCommonState *s = PIC_COMMON(dev);
 957 |     PICClass *pc = PIC_GET_CLASS(dev);
 958 | 
 959 |     memory_region_init_io(&s->base_io, OBJECT(s), &pic_base_ioport_ops, s,
 960 |                           "pic", 2);
 961 |     memory_region_init_io(&s->elcr_io, OBJECT(s), &pic_elcr_ioport_ops, s,
 962 |                           "elcr", 1);
 963 | 
 964 |     qdev_init_gpio_out(dev, s->int_out, ARRAY_SIZE(s->int_out));
 965 |     qdev_init_gpio_in(dev, pic_set_irq, 8);
 966 | 
 967 |     pc->parent_realize(dev, errp);
 968 | }
 969 | ```
 970 | 
 971 | ##### qdev_init_gpio_out
 972 | 
 973 | ```
 974 | => qdev_init_gpio_out_named(dev, pins, NULL, n)
 975 |     => qdev_get_named_gpio_list         从设备实例(DeviceState)中取出gpio，遍历该链表找到对应名称的 NamedGPIOList ，如果找不到，创建一个，插到最前
 976 |     => 如果未传入名称，设置name为 "unnamed-gpio-out"
 977 |     => object_property_add_link         根据传入的 qemu_irq 数组和长度，将每个 qemu_irq *指针*以名称 "name[i]" 作为dev的link属性
 978 |     => NamedGPIOList.num_out +=n
 979 | ```
 980 | 
 981 | ##### qdev_init_gpio_in
 982 | 
 983 | ```
 984 | => qdev_init_gpio_in_named(dev, handler, NULL, n)
 985 |     => qdev_get_named_gpio_list         从设备实例(DeviceState)中取出gpio，遍历该链表找到对应名称的 NamedGPIOList ，如果找不到，创建一个，插到最前
 986 |     => NamedGPIOList.in = qemu_extend_irqs(gpio_list->in, gpio_list->num_in, handler, dev, n)  在原有基础上创建n个 qemu_irq ，handler为传入函数
 987 |     => 如果未传入名称，设置name为 "unnamed-gpio-in"
 988 |     => 根据传入的数目，将每个 qemu_irq 以名称 "name[i]" 作为dev的child属性
 989 |     => NamedGPIOList.num_in 增加n
 990 | ```
 991 | 
 992 | 于是每个 8259 会有1个out，8个in GPIO，存在名字为NULL的 DeviceState.gpios 链表中
 993 | 
 994 | 其中out对应的"unnamed-gpio-out[0]"的值是一个 qemu_irq *指针*，指向 s->int_out ，存放的是该成员的地址，而其还没有设置。
 995 | 而in对应的"unnamed-gpio-in[i]"的值是一个 qemu_irq 。其handler为 pic_set_irq ，opaque为 dev
 996 | 
 997 | 
 998 | #### PIC 连接
 999 | 
1000 | 在创建了8259中断芯片的类实例对象后， i8259_init 对主片(master)调用 qdev_connect_gpio_out 进行连接 ：
1001 | 
1002 | qdev_connect_gpio_out(dev, 0, parent_irq) => qdev_connect_gpio_out_named(dev, NULL, n, pin)  *<-pin就是parent_irq*
1003 | 
1004 | ##### qdev_connect_gpio_out_named
1005 | 
1006 | ```
1007 | => object_property_add_child    将上级中断(parent_irq) 以 "non-qdev-gpio[*]" 为名作为 "/machine/unattached" container 的child属性。
1008 | => object_property_set_link     设置dev名为 "name[i]" 的link属性值为 parent_irq 。该属性也就是前面的 qdev_init_gpio_out 中创建的 "name[i]" ，值指向 s->int_out ，于是 s->int_out 就被设置为了 parent_irq ，out的坑被填上了。
1009 | ```
1010 | 
1011 | 如果上级中断已有路径，则 `child->parent != NULL`， object_property_add_child 返回。否则进行添加为 "/machine/unattached" container的child。这里有个细节是实际上它们的属性名不是"non-qdev-gpio[*]"， 因为在 object_property_add_child => object_property_add 中，会尝试从0开始替换掉`*`。比如这里的 parent_irq 被分到的属性名为 "non-qdev-gpio[24]" ，于是完整路径为 "/machine/unattached/non-qdev-gpio[24]" ，这点在 qom-tree 也有体现：
1012 | 
1013 | ```
1014 | /machine (pc-q35-2.8-machine)
1015 |   /unattached (container)
1016 |     /non-qdev-gpio[24] (irq)
1017 | ```
1018 | 
1019 | 之所以要把上级中断(如 parent_irq )设置为 "/machine/unattached" container 的child属性，是因为在接下来的 object_property_set_link 中需要上级中断有自己的路径：
1020 | 
1021 | ```c
1022 | void object_property_set_link(Object *obj, Object *value,
1023 |                               const char *name, Error **errp)
1024 | {
1025 |     if (value) {
1026 |         // 取出上级中断的路径
1027 |         gchar *path = object_get_canonical_path(value);
1028 |         object_property_set_str(obj, path, name, errp);
1029 |         g_free(path);
1030 |     } else {
1031 |         object_property_set_str(obj, "", name, errp);
1032 |     }
1033 | }
1034 | ```
1035 | 
1036 | ##### qdev_get_gpio_in => qdev_get_gpio_in_named
1037 | 
1038 | 获取 qdev_init_gpio_in 中创建的存在 NamedGPIOList.in 中的8个 qemu_irq ，存到 irq_set 数组的 0-7 位置
1039 | 
1040 | 从片(slave)也会通过类似的过程来初始化，只不过其上游端口不是parent_irq，而是主片的第三个qemu_irq，即 irq_set[2] ，模拟了硬件上从片 out 接到主片的 irq2 引脚的电路。最后获取 qdev_init_gpio_in 中创建的存在 NamedGPIOList.in 中的8个 qemu_irq ，存到 irq_set 数组的 8-15 位置。
1041 | 
1042 | PIC初始化完成后，将 irq_set 返回，被存到 gsi_state->i8259_irq 。
1043 | 
1044 | 
1045 | 
1046 | #### LAPIC
1047 | 
1048 | pc_new_cpu ==> apic_init(env,env->cpuid_apic_id) ==> qdev_create(NULL, "kvm-apic");
1049 | 
1050 | 根据 x86_cpu_realizefn => x86_cpu_apic_create => apic_get_class ，此时LAPIC不放在KVM中，由QEMU负责对其进行模拟，于是 apic_type = "apic" ：
1051 | 
1052 | ```c
1053 | static const TypeInfo apic_common_type = {
1054 |     .name = TYPE_APIC_COMMON,
1055 |     .parent = TYPE_DEVICE,
1056 |     .instance_size = sizeof(APICCommonState),
1057 |     .instance_init = apic_common_initfn,
1058 |     .class_size = sizeof(APICCommonClass),
1059 |     .class_init = apic_common_class_init,
1060 |     .abstract = true,
1061 | };
1062 | ```
1063 | 
1064 | apic_common_realize => apic_realize
1065 | 
1066 | ```c
1067 | static void apic_realize(DeviceState *dev, Error **errp)
1068 | {
1069 |     APICCommonState *s = APIC(dev);
1070 | 
1071 |     if (s->id >= MAX_APICS) {
1072 |         error_setg(errp, "%s initialization failed. APIC ID %d is invalid",
1073 |                    object_get_typename(OBJECT(dev)), s->id);
1074 |         return;
1075 |     }
1076 | 
1077 |     memory_region_init_io(&s->io_memory, OBJECT(s), &apic_io_ops, s, "apic-msi",
1078 |                           APIC_SPACE_SIZE);
1079 | 
1080 |     s->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, apic_timer, s);
1081 |     local_apics[s->id] = s;
1082 | 
1083 |     msi_nonbroken = true;
1084 | }
1085 | ```
1086 | 
1087 | 可以看到它为MSI注册了对应的 MemoryRegion，当对该 MemoryRegion 进行操作时，执行以下操作：
1088 | 
1089 | ```c
1090 | static const MemoryRegionOps apic_io_ops = {
1091 |     .old_mmio = {
1092 |         .read = { apic_mem_readb, apic_mem_readw, apic_mem_readl, },
1093 |         .write = { apic_mem_writeb, apic_mem_writew, apic_mem_writel, },
1094 |     },
1095 |     .endianness = DEVICE_NATIVE_ENDIAN,
1096 | };
1097 | ```
1098 | 
1099 | 
1100 | #### IOAPIC
1101 | 
1102 | 定义如下：
1103 | 
1104 | ```c
1105 | static const TypeInfo ioapic_info = {
1106 |     .name          = "ioapic",
1107 |     .parent        = TYPE_IOAPIC_COMMON,
1108 |     .instance_size = sizeof(IOAPICCommonState),
1109 |     .class_init    = ioapic_class_init,
1110 | };
1111 | ```
1112 | 
1113 | 如果开启了PIC， pc_q35_init 会调用 `ioapic_init_gsi(gsi_state, "q35");` 初始化IOAPIC：
1114 | 
1115 | ```c
1116 | void ioapic_init_gsi(GSIState *gsi_state, const char *parent_name)
1117 | {
1118 |     DeviceState *dev;
1119 |     SysBusDevice *d;
1120 |     unsigned int i;
1121 | 
1122 |     if (kvm_ioapic_in_kernel()) {
1123 |         dev = qdev_create(NULL, "kvm-ioapic");
1124 |     } else {
1125 |         dev = qdev_create(NULL, "ioapic");
1126 |     }
1127 |     if (parent_name) {
1128 |         object_property_add_child(object_resolve_path(parent_name, NULL),
1129 |                                   "ioapic", OBJECT(dev), NULL);
1130 |     }
1131 |     qdev_init_nofail(dev);
1132 |     d = SYS_BUS_DEVICE(dev);
1133 |     sysbus_mmio_map(d, 0, IO_APIC_DEFAULT_ADDRESS);
1134 | 
1135 |     for (i = 0; i < IOAPIC_NUM_PINS; i++) {
1136 |         gsi_state->ioapic_irq[i] = qdev_get_gpio_in(dev, i);
1137 |     }
1138 | }
1139 | ```
1140 | 
1141 | 此时IOAPIC不放在KVM中，由QEMU负责对其进行模拟。 于是 qdev_init_nofail => ... => ioapic_realize
1142 | 
1143 | ```c
1144 | static void ioapic_realize(DeviceState *dev, Error **errp)
1145 | {
1146 |     IOAPICCommonState *s = IOAPIC_COMMON(dev);
1147 | 
1148 |     if (s->version != 0x11 && s->version != 0x20) {
1149 |         error_report("IOAPIC only supports version 0x11 or 0x20 "
1150 |                      "(default: 0x11).");
1151 |         exit(1);
1152 |     }
1153 | 
1154 |     memory_region_init_io(&s->io_memory, OBJECT(s), &ioapic_io_ops, s,
1155 |                           "ioapic", 0x1000);
1156 | 
1157 |     qdev_init_gpio_in(dev, ioapic_set_irq, IOAPIC_NUM_PINS);
1158 | 
1159 |     ioapics[ioapic_no] = s;
1160 |     s->machine_done.notify = ioapic_machine_done_notify;
1161 |     qemu_add_machine_init_done_notifier(&s->machine_done);
1162 | }
1163 | ```
1164 | 
1165 | 同样是通过 qdev_init_gpio_in 创建in的 qemu_irq ，一共创建24个，handler为 ioapic_set_irq
1166 | 
1167 | IOAPIC被存到全局数组 ioapics 中，index也由全局变量 ioapic_no 来维护。
1168 | 
1169 | 于是IOAPIC会有24个in GPIO。区别于 PIC 将创建的 qemu_irq 返回并由 pc_q35_init 负责设置到 gsi_state 中，ioapic_init_gsi 直接传入 gsi_state 指针，在函数内对 gsi_state->ioapic_irq 进行设置。
1170 | 
1171 | #### GSI
1172 | 
1173 | 至此， gsi_state 中 i8259_irq 和 ioapic_irq 都被填充完毕。而实际上在初始化 PIC 和 IOAPIC 前， pc_q35_init 会创建 GSI 的 qemu_irq ：
1174 | 
1175 | ```c
1176 |     // 如果 ioapic 在kernel(KVM)中，即在开启kvm的情况下，指定参数 kernel-irqchip=on ，则
1177 |     if (kvm_ioapic_in_kernel()) {
1178 |         kvm_pc_setup_irq_routing(pcmc->pci_enabled);                        // 创建中断路由，并设置到KVM
1179 |         pcms->gsi = qemu_allocate_irqs(kvm_pc_gsi_handler, gsi_state,
1180 |                                        GSI_NUM_PINS);
1181 |     }
1182 |     // 否则 (off / split时ioapic在QEMU中)
1183 |     else {
1184 |         pcms->gsi = qemu_allocate_irqs(gsi_handler, gsi_state, GSI_NUM_PINS);
1185 |     }
1186 | ```
1187 | 
1188 | 会创建 GSI_NUM_PINS(24) 个 qemu_irq ，编号从0-23，opaque指向 gsi_state ，handler 为 kvm_pc_gsi_handler (IOAPIC由KVM模拟时) / gsi_handler (IOAPIC由QEMU模拟时) ，保存到 PCMachineState.gsi 中。
1189 | 
1190 | 随后 pc_q35_init 会通过 pci_create_simple_multifunction 创建并初始化 ICH9-LPC ，调用其 realize 函数：
1191 | 
1192 | ```
1193 | ich9_lpc_realize => isa_bus = isa_bus_new(...)                                              创建 ISABus
1194 |                  => lpc->isa_bus = isa_bus                                                  将 ISABus 设置为 ICH9LPCState 的成员
1195 |                  => qdev_init_gpio_out_named(dev, lpc->gsi, ICH9_GPIO_GSI, GSI_NUM_PINS)    创建 24 个out GPIO
1196 |                  => isa_bus_irqs(isa_bus, lpc->gsi)                                         将 ISABus.irqs 设置为 ICH9LPCState.gsi
1197 | ```
1198 | 
1199 | 通过 qdev_init_gpio_out_named 创建了24个out GPIO(qemu_irq)，存在设备父类 DeviceState 的 gpios 链表中，名为"gsi"。每个 qemu_irq *指针*以名称 "name[i]" 作为dev的link属性，指向 ICH9LPCState.gsi 数组成员的地址。
1200 | 
1201 | 接下来的 isa_bus_irqs 将 ISABus.irqs 设置为 ICH9LPCState.gsi 。也就是说， ISABus.irqs 指向了 ICH9-LPC out GPIO(qemu_irq) 所指向的值 。
1202 | 
1203 | 接下来 pc_q35_init 会将 ICH9LPCState 刚创建的 out GPIO 一一连接到 pcms->gsi ：
1204 | 
1205 | ```c
1206 |     for (i = 0; i < GSI_NUM_PINS; i++) {
1207 |         qdev_connect_gpio_out_named(lpc_dev, ICH9_GPIO_GSI, i, pcms->gsi[i]);
1208 |     }
1209 | ```
1210 | 
1211 | 于是 ISABus.irqs 等同于 PCMachineState.gsi，即 ISABus.irqs[i] == PCMachineState.gsi[i] 。如图所示：
1212 | 
1213 | ```
1214 | PCMachineState.gsi  (gsi_handler)
1215 | | | | | | | ... |
1216 | 0 1 2 3 4 5     23
1217 | | | | | | | ... |  out
1218 | ------------------
1219 | |    ICH9-LPC    |
1220 | ------------------
1221 |    (ISABus.irqs)
1222 | ```
1223 | 
1224 | 
1225 | 可以用GDB进行验证：
1226 | 
1227 | ```
1228 | (gdb) p *isa_bus->irqs
1229 | $15 = (qemu_irq) 0x55555694f8c0
1230 | (gdb) p *PC_MACHINE(qdev_get_machine())->gsi
1231 | $16 = (qemu_irq) 0x55555694f8c0
1232 | ```
1233 | 
1234 | 
1235 | #### 中断注入
1236 | 
1237 | 不同于KVM模拟的中断芯片(IOAPIC)通过查找获取目标LAPIC然后直接设置其变量来传递中断，QEMU通过GPIO。
1238 | 
1239 | 一般来说，产生中断的设备的 irq 成员都会设置为 PCMachineState.gsi 。以串口(isa-serial)为例，在其 realize 函数 serial_isa_realizefn 中，调用了 `isa_init_irq(isadev, &s->irq, isa->isairq)` ，设置设备对象(SerialState)的 irq 成员，其中 isa->isairq 通过 isa_serial_irq[isa->index] 得到， index 是 serial_isa_realizefn 中的静态变量，每调用一次加一。
1240 | 
1241 | 根据 isa_serial_irq 的定义，共有4个串口设备，对应的 isairq 分别为 4, 3, 4, 3 。对于 index 为 0 的串口设备，其 isairq 为 4，于是：
1242 | 
1243 | ```c
1244 | void isa_init_irq(ISADevice *dev, qemu_irq *p, int isairq)
1245 | {
1246 |     assert(dev->nirqs < ARRAY_SIZE(dev->isairq));
1247 |     dev->isairq[dev->nirqs] = isairq;
1248 |     *p = isa_get_irq(dev, isairq);
1249 |     dev->nirqs++;
1250 | }
1251 | ```
1252 | 
1253 | 调用 isa_get_irq 从 isabus->irqs 中取出对应的 qemu_irq (isabus->irqs[4])，将其设置到串口设备的类实例对象，即 SerialState.irq 。
1254 | 
1255 | 前面提到，ISABus.irqs 等同于 PCMachineState.gsi 。于是串口设备的 irq 实际上指向了 GSI qemu_irq 。这相当于每个设备都对应到GSI上。GSI qemu_irq 的 handler 为 gsi_handler ，n指定了其在GSIState 数组中的序号。
1256 | 
1257 | ```
1258 | PCMachineState.gsi  (gsi_handler)
1259 |           |
1260 |           4
1261 |           | irq
1262 | -----------------------
1263 | |  isa-serial device  |
1264 | -----------------------
1265 | ```
1266 | 
1267 | 
1268 | 继续中断注入的分析。设备在发送中断时会调用以下两个函数设置电平：
1269 | 
1270 | * qemu_irq_lower => qemu_set_irq(irq, 0)       设为低电平
1271 | * qemu_irq_raise => qemu_set_irq(irq, 1)       设为高电平
1272 | 
1273 | 
1274 | ##### qemu_set_irq
1275 | 
1276 | => irq->handler(irq->opaque, irq->n, level) 负责取出 qemu_irq 中的 handler 进行调用。
1277 | 
1278 | 由于属于 GSIState ，因此调用的是 gsi_handler ：
1279 | 
1280 | ```c
1281 | void gsi_handler(void *opaque, int n, int level)
1282 | {
1283 |     GSIState *s = opaque;
1284 | 
1285 |     DPRINTF("pc: %s GSI %d\n", level ? "raising" : "lowering", n);
1286 |     if (n < ISA_NUM_IRQS) {
1287 |         qemu_set_irq(s->i8259_irq[n], level);
1288 |     }
1289 |     qemu_set_irq(s->ioapic_irq[n], level);
1290 | }
1291 | ```
1292 | 
1293 | 它根据序号(qemu_irq.n)，取出对应芯片的对应 qemu_irq 的 handler 进行调用。
1294 | 
1295 | #### PIC
1296 | 
1297 | ```
1298 | parent_irq (pic_irq_request)
1299 |     | out
1300 | -----------------
1301 | |  8259 master  |
1302 | -----------------
1303 | | | | | | | | |
1304 | 0 1 | 3 4 5 6 7     (pic_set_irq)
1305 |     |
1306 |     | out
1307 | -----------------
1308 | |  8259 slave   |
1309 | -----------------
1310 | | | | | | | | |
1311 | 0 1 2 3 4 5 6 7     (pic_set_irq)
1312 | ```
1313 | 
1314 | 对于 PIC ，handler为 pic_set_irq ，属于PIC 的 in 。于是 pic_set_irq 设置PIC芯片(PICCommonState)的irr寄存器(变量)，然后调用 pic_update_irq ：
1315 | 
1316 | ```c
1317 | static void pic_update_irq(PICCommonState *s)
1318 | {
1319 |     int irq;
1320 | 
1321 |     irq = pic_get_irq(s);
1322 |     if (irq >= 0) {
1323 |         DPRINTF("pic%d: imr=%x irr=%x padd=%d\n",
1324 |                 s->master ? 0 : 1, s->imr, s->irr, s->priority_add);
1325 |         qemu_irq_raise(s->int_out[0]);
1326 |     } else {
1327 |         qemu_irq_lower(s->int_out[0]);
1328 |     }
1329 | }
1330 | 
1331 | ```
1332 | 
1333 | 通过 pic_get_irq 获取 irr 中没被imr屏蔽掉的优先级最高的中断，如果有，则设置 out (s->int_out[0]) 为高电平，否则设置为低电平。于是
1334 | 
1335 | ```
1336 | qemu_set_irq => pic_irq_request => 如果CPU有LAPIC，调用 apic_deliver_pic_intr 设置到LAPIC
1337 |                                 => 否则根据电平调用 cpu_interrupt / cpu_reset_interrupt
1338 | ```
1339 | 
1340 | 这里有一个有趣的地方：在SMP中，PIC的中断应该发送给哪个CPU呢？QEMU的实现简单粗暴，根据 pic_irq_request ，其选择的是第一个CPU(first_cpu)。
1341 | 
1342 | #### IOAPIC
1343 | 
1344 | ```
1345 | ------------------
1346 | |    IOAPIC      |
1347 | ------------------
1348 | | | | | | | ... |
1349 | 0 1 | 3 4 5     23  in (ioapic_set_irq)
1350 | ```
1351 | 
1352 | 对于 IOAPIC ， handler 为 ioapic_set_irq ，属于 IOAPIC 的 in。
1353 | 
1354 | ```c
1355 | static void ioapic_set_irq(void *opaque, int vector, int level)
1356 | {
1357 |     IOAPICCommonState *s = opaque;
1358 | 
1359 |     /* ISA IRQs map to GSI 1-1 except for IRQ0 which maps
1360 |      * to GSI 2.  GSI maps to ioapic 1-1.  This is not
1361 |      * the cleanest way of doing it but it should work. */
1362 | 
1363 |     DPRINTF("%s: %s vec %x\n", __func__, level ? "raise" : "lower", vector);
1364 |     if (vector == 0) {
1365 |         vector = 2;
1366 |     }
1367 |     if (vector >= 0 && vector < IOAPIC_NUM_PINS) {
1368 |         uint32_t mask = 1 << vector;
1369 |         uint64_t entry = s->ioredtbl[vector];
1370 | 
1371 |         if (((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) & 1) ==
1372 |             IOAPIC_TRIGGER_LEVEL) {
1373 |             /* level triggered */
1374 |             if (level) {
1375 |                 s->irr |= mask;
1376 |                 if (!(entry & IOAPIC_LVT_REMOTE_IRR)) {
1377 |                     ioapic_service(s);
1378 |                 }
1379 |             } else {
1380 |                 s->irr &= ~mask;
1381 |             }
1382 |         } else {
1383 |             /* According to the 82093AA manual, we must ignore edge requests
1384 |              * if the input pin is masked. */
1385 |             if (level && !(entry & IOAPIC_LVT_MASKED)) {
1386 |                 s->irr |= mask;
1387 |                 ioapic_service(s);
1388 |             }
1389 |         }
1390 |     }
1391 | }
1392 | ```
1393 | 
1394 | 首先，它从 IOAPIC 的 I/O REDIRECTION TABLE 中找到中断向量号所对应的 entry 寄存器。其中包含 Interrupt Mask、Trigger Mode 、Remote IRR 等bit。如果 Trigger Mode bit 为1，表示水平触发，0表示边缘触发。
1395 | 
1396 | 对于水平触发，在设置irr中对应的bit后，需要判断 Remote IRR bit ，如果为1，表示 LAPIC 已经收到 IOAPIC 发来的中断了，正在处理中；如果为0，表示 LAPIC 已经处理完中断，向 IOAPIC 发送 EOI 消息，表示可以继续接收中断。因此如果为0，则可以调用 ioapic_service 发送中断消息。
1397 | 
1398 | 对于边缘触发，需要判断 Interrupt Mask bit ，如果为1，表示中断被屏蔽，无需设置irr；如果为0，表示可以发送中断，于是设置irr中对应的bit后，调用 ioapic_service 发送中断消息。
1399 | 
1400 | ioapic_service 会遍历 IOAPIC 上的所有pin，如果 irr 在对应的bit为1，则需要发送中断：
1401 | 
1402 | 若LAPIC在KVM中(kernel-irq=split)，则通过 kvm_set_irq 设置到KVM中，否则将其转换成 *MSI* 。根据定义，设备可以直接构造MSI消息，其中标明了中断目标地址，然后由设备直接发送中断给LAPIC，绕过了IOAPIC。
1403 | 
1404 | 由于我们讨论 LAPIC 由 QEMU 模拟的情况，因此其先用pin号查询 I/O REDIRECTION TABLE (IOAPICCommonState.ioredtbl)得到 entry ，然后通过 ioapic_entry_parse 得到相关信息 (ioapic_entry_info) ，最后通过 `stl_le_phys(ioapic_as, info.addr, info.data)` 修改 IOAPIC AddressSpace 。
1405 | 
1406 | 
1407 | 如果开启了IR， IOAPIC AddressSpace 是一个虚拟机的地址空间 `vtd_host_dma_iommu(bus, s, Q35_PSEUDO_DEVFN_IOAPIC)` 。否则为 address_space_memory 。当对该 AddressSpace 进行写入时，类似MMIO一样最终调用到 MemoryRegion 绑定的 apic_io_ops ，前文提到过，它们在 apic_realize 时被绑定到LAPIC的 apic-msi MemoryRegion。
1408 | 
1409 | 于是
1410 | 
1411 | ```
1412 | stl_le_phys => address_space_stl_le => address_space_stl_internal => memory_region_dispatch_write => access_with_adjusted_size => memory_region_oldmmio_write_accessor => mr->ops->old_mmio.write[ctz32(size)] (apic_mem_writel)
1413 | ```
1414 | 
1415 | apic_mem_writel 通过 cpu_get_current_apic 获取当前CPU的LAPIC(APICCommonState)，然后根据addr将data写入到其对应位置。
1416 | 
1417 | 因此 IOAPIC 没有out，其通过MSI将中断送达LAPIC。
1418 | 
1419 | 
1420 | 
1421 | 
1422 | 
1423 | 
1424 | 
1425 | 
1426 | 
1427 | 
1428 | #### QEMU模拟PIC、IOAPIC芯片，KVM模拟LAPIC
1429 | 
1430 | 比起前文所述的中断送达流程，spilt模式下在 ioapic_service 中就会将中断送入KVM中。KVM根据自己的 kvm->irq_routing 进行中断路由。
1431 | 
1432 | 
1433 | #### 中断芯片初始化
1434 | 
1435 | QEMU在 kvm_init 中，会对KVM进行中断芯片的初始化：
1436 | 
1437 | ```
1438 | kvm_irqchip_create => kvm_arch_irqchip_create => kvm_vm_enable_cap(s, KVM_CAP_SPLIT_IRQCHIP, 0, 24) => kvm_vm_ioctl(s, KVM_ENABLE_CAP, &cap)
1439 |                    => kvm_init_irq_routing
1440 | ```
1441 | 
1442 | 对于split，只需要在KVM中创建LAPIC而无需 kvm_vm_ioctl(s, KVM_CREATE_IRQCHIP) 。它通过 KVM_ENABLE_CAP 尝试开启 split 能力，然后调用 kvm_init_irq_routing ，初始化IOAPIC所有pin的中断路由。
1443 | 
1444 | 
1445 | ##### kvm_init_irq_routing
1446 | 
1447 | ```
1448 | => kvm_check_extension(s, KVM_CAP_IRQ_ROUTING)     获取KVM支持的gsi总数
1449 | => 创建 used_gsi_bitmap ，分配 irq_routes 数组
1450 | => kvm_arch_init_irq_routing => kvm_irqchip_add_msi_route => kvm_add_routing_entry  将entry添加到 KVMState.entries 数组中
1451 |                                                           => kvm_irqchip_commit_routes => kvm_vm_ioctl(KVM_SET_GSI_ROUTING) 将entries设置到KVM中
1452 | ```
1453 | 
1454 | kvm_irqchip_add_msi_route 会被调用24次，依次将nr(entries数组的长度)为1到24时的 KVMState.entries 作为 kvm_irq_routing 设置到QEMU中，kvm_irq_routing 定义如下：
1455 | 
1456 | ```c
1457 | struct kvm_irq_routing {
1458 |   __u32 nr;
1459 |   __u32 flags;
1460 |   struct kvm_irq_routing_entry entries[0];
1461 | };
1462 | 
1463 | struct kvm_irq_routing_entry {
1464 |   __u32 gsi;
1465 |   __u32 type;
1466 |   __u32 flags;
1467 |   __u32 pad;
1468 |   union {
1469 |     struct kvm_irq_routing_irqchip irqchip;
1470 |     struct kvm_irq_routing_msi msi;
1471 |     struct kvm_irq_routing_s390_adapter adapter;
1472 |     struct kvm_irq_routing_hv_sint hv_sint;
1473 |     __u32 pad[8];
1474 |   } u;
1475 | };
1476 | ```
1477 | 
1478 | 此时由于设备还未初始化，因此路由表项 kvm_irq_routing_entry 中的属性都为0。
1479 | 
1480 | 之后直到在虚拟机启动之后，BIOS/OS对中断路由表进行更新时，触发VMExit，退回到QEMU中进行更新(因为IOAPIC在QEMU中模拟)：
1481 | 
1482 | ```
1483 | address_space_rw => address_space_write => address_space_write_continue => memory_region_dispatch_write => access_with_adjusted_size => memory_region_write_accessor => ioapic_mem_write => ioapic_update_kvm_routes
1484 | => ioapic_entry_parse(s->ioredtbl[i], &info)
1485 | => msg.address = info.addr
1486 | => msg.data = info.data
1487 | => kvm_irqchip_update_msi_route(kvm_state, i, msg, NULL) => kvm_update_routing_entry 用entry更新 KVMState.entries 数组
1488 | => kvm_irqchip_commit_routes => kvm_vm_ioctl(s, KVM_SET_GSI_ROUTING, s->irq_routes)     将新的 KVMState.entries 数组更新到KVM中
1489 | ```
1490 | 
1491 | 举个例子，e1000对应的gsi 22的entry内容如下：
1492 | 
1493 | ```
1494 | (gdb) p kvm_state->irq_routes->entries[22]
1495 | $29 = {
1496 |   gsi = 22,
1497 |   type = 2,
1498 |   flags = 0,
1499 |   pad = 0,
1500 |   u = {
1501 |     irqchip = {
1502 |       irqchip = 4276092928,
1503 |       pin = 0
1504 |     },
1505 |     msi = {
1506 |       address_lo = 4276092928,
1507 |       address_hi = 0,
1508 |       data = 32865,
1509 |       {
1510 |         pad = 0,
1511 |         devid = 0
1512 |       }
1513 |     },
1514 |     adapter = {
1515 |       ind_addr = 4276092928,
1516 |       summary_addr = 32865,
1517 |       ind_offset = 0,
1518 |       summary_offset = 0,
1519 |       adapter_id = 0
1520 |     },
1521 |     hv_sint = {
1522 |       vcpu = 4276092928,
1523 |       sint = 0
1524 |     },
1525 |     pad = {[0] = 4276092928, [1] = 0, [2] = 32865, [3] = 0, [4] = 0, [5] = 0, [6] = 0, [7] = 0}
1526 |   }
1527 | }
1528 | ```
1529 | 
1530 | 
1531 | #### KVM
1532 | 
1533 | 在KVM中， kvm_vm_ioctl(s, KVM_ENABLE_CAP, &cap) 的调用链如下：
1534 | 
1535 | ```
1536 | kvm_vm_ioctl_enable_cap => kvm_setup_empty_irq_routing => kvm_set_irq_routing(kvm, empty_routing, 0, 0)
1537 |                         => kvm->arch.irqchip_split = true;
1538 | ```
1539 | 
1540 | 不同于IOAPIC由KVM模拟时通过 kvm_set_irq_routing 将路由初始化成 default_routing ，在split模式下路由需要等待QEMU来进行设置，因此将其设置为空，即 empty_routing 。
1541 | 
1542 | 同时设置 kvm->arch.irqchip_split = true ，此后KVM中用于判断是否为split模式的函数 irqchip_split 检查的就是这个变量。
1543 | 
1544 | 
1545 | kvm_vm_ioctl(s, KVM_SET_GSI_ROUTING, s->irq_routes) 的调用链如下：
1546 | 
1547 | ```
1548 | kvm_vm_ioctl => kvm_set_irq_routing => setup_routing_entry => kvm_set_routing_entry
1549 |                                     => rcu_assign_pointer(kvm->irq_routing, new)
1550 | ```
1551 | 
1552 | 它会创建新的 kvm_irq_routing_table ，然后遍历新传入的entrys数组，对每一个entry一一调用 setup_routing_entry ，构造出 kvm_irq_routing_entry 并设置到新table中。最后将 kvm->irq_routing 指向新的table。
1553 | 
1554 | 由于传入entry的type为 KVM_IRQ_ROUTING_MSI(2)， 因此在 kvm_set_routing_entry 中设置 set 为 kvm_set_msi
1555 | 
1556 | 
1557 | 
1558 | 
1559 | #### 中断注入
1560 | 
1561 | 话题回到spilt模式下在 ioapic_service 中就会调用 kvm_set_irq => kvm_vm_ioctl(s, s->irq_set_ioctl, &event) 向KVM注入中断。 s->irq_set_ioctl 根据 KVM能力可能为 KVM_IRQ_LINE 或 KVM_IRQ_LINE_STATUS ，区别在于后者会返回状态。
1562 | 
1563 | 于是进到 KVM 中， kvm_vm_ioctl => kvm_vm_ioctl_irq_line
1564 | 
1565 | ```
1566 | => kvm_irq_map_gsi    查询 kvm->irq_routing ，将对应gsi对应的 kvm_kernel_irq_routing_entry 一一取出
1567 | => kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irq_event->irq, irq_event->level, line_status) => irq_set[i].set (kvm_set_msi)
1568 | ```
1569 | 
1570 | ##### kvm_set_msi
1571 | 
1572 | ```
1573 | kvm_set_msi => kvm_set_msi_irq
1574 |             => kvm_irq_delivery_to_apic
1575 | ```
1576 | 
1577 | 负责将irq消息解析，构造 kvm_lapic_irq ，然后设置到对应vCPU的LAPIC中。
1578 | 
1579 | kvm_irq_delivery_to_apic => kvm_apic_set_irq => __apic_accept_irq 实现对目标LAPIC设置中断：
1580 | 
1581 | ```
1582 | => 根据 delivery_mode 进行对应设置，如 APIC_DM_FIXED 为 kvm_lapic_set_vector + kvm_lapic_set_irr
1583 | => kvm_make_request(KVM_REQ_EVENT, vcpu)
1584 | => kvm_vcpu_kick(vcpu)              让目标vCPU退出来处理请求
1585 | ```
1586 | 
1587 | 接下来在 vcpu_run => vcpu_enter_guest 中，由于 LAPIC 在KVM中，先通过 kvm_x86_ops->hwapic_irr_update (vmx_hwapic_irr_update) 更新 irr 中优先级最高的中断？
1588 | 
1589 | 之后在KVM中检测到有 KVM_REQ_EVENT 请求，调用 inject_pending_event 进行中断注入：
1590 | 
1591 | ```c
1592 | static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
1593 | {
1594 |   ...
1595 |   if (vcpu->arch.interrupt.pending) {
1596 |       kvm_x86_ops->set_irq(vcpu);
1597 |       return 0;
1598 |   }
1599 |   ...
1600 | }
1601 | ```
1602 | 
1603 | 最后由 vmx_inject_irq 将中断写入到VMCS中。
1604 | 
1605 | 
1606 | 
1607 | 
1608 | 
1609 | 
1610 | 
1611 | 
1612 | 
1613 | 
1614 | 
1615 | 
1616 | 
1617 | 
1618 | 
1619 | 
1620 | 
1621 | 
1622 | 
1623 | 
1624 | 
1625 | 
1626 | ### 中断的完整注入流程
1627 | 
1628 | 以 e1000 收到包后的中断为例
1629 | 
1630 | 需要设置中断的场景如下：
1631 | 
1632 | * MMIO
1633 |     address_space_rw => address_space_read => address_space_read_full => address_space_read_continue => memory_region_dispatch_read => memory_region_dispatch_read1 => access_with_adjusted_size => memory_region_read_accessor => e1000_mmio_read => mac_icr_read / ... => set_interrupt_cause => pci_set_irq
1634 | 
1635 | * QEMU收到e1000的包
1636 | 
1637 |     main_loop => main_loop_wait => qemu_clock_run_all_timers => qemu_clock_run_timers => timerlist_run_timers => ra_timer_handler => ndp_send_ra ip6_output => if_output => if_start => if_encap => slirp_output => qemu_send_packet => qemu_sendv_packet_async => qemu_net_queue_send_iov => qemu_net_queue_deliver_iov => qemu_deliver_packet_iov => e1000_receive_iov => set_ics => set_interrupt_cause => pci_set_irq
1638 | 
1639 | * Mitigation timer超时 (主线程触发)
1640 | 
1641 |     main_loop => main_loop_wait => qemu_clock_run_all_timers => qemu_clock_run_timers => timerlist_run_timers => e1000_mit_timer => set_interrupt_cause => pci_set_irq
1642 | 
1643 | 最终都调用到 pci_set_irq 来设置中断。
1644 | 
1645 | ```
1646 | pci_set_irq => pci_intx   获取 PCI 配置空间的 PCI_INTERRUPT_PIN
1647 |             => pci_irq_handler => pci_set_irq_state          设置设备的 irq_state
1648 |                                => pci_update_irq_status      为配置空间的 PCI_STATUS 加上 PCI_STATUS_INTERRUPT bit
1649 |                                => pci_irq_disabled           如果禁止中断，则直接返回
1650 |                                => pci_change_irq_level       否则发射中断
1651 | ```
1652 | 
1653 | ##### pci_change_irq_level
1654 | 
1655 | ```c
1656 | static void pci_change_irq_level(PCIDevice *pci_dev, int irq_num, int change)
1657 | {
1658 |     PCIBus *bus;
1659 |     for (;;) {
1660 |         bus = pci_dev->bus;
1661 |         irq_num = bus->map_irq(pci_dev, irq_num);
1662 |         if (bus->set_irq)
1663 |             break;
1664 |         pci_dev = bus->parent_dev;
1665 |     }
1666 |     bus->irq_count[irq_num] += change;
1667 |     bus->set_irq(bus->irq_opaque, irq_num, bus->irq_count[irq_num] != 0);
1668 | }
1669 | ```
1670 | 
1671 | 获取 PCI 设备所在的 bus ，调用 bus->map_irq 找到设备对应的 pirq(Programmable Interrupt Router) 号，对于 e1000，其 bus 为 pcie.0 ，map_irq 为 ich9_lpc_map_irq ：
1672 | 
1673 | ```c
1674 | int ich9_lpc_map_irq(PCIDevice *pci_dev, int intx)
1675 | {
1676 |     BusState *bus = qdev_get_parent_bus(&pci_dev->qdev);
1677 |     PCIBus *pci_bus = PCI_BUS(bus);
1678 |     PCIDevice *lpc_pdev =
1679 |             pci_bus->devices[PCI_DEVFN(ICH9_LPC_DEV, ICH9_LPC_FUNC)];
1680 |     ICH9LPCState *lpc = ICH9_LPC_DEVICE(lpc_pdev);
1681 | 
1682 |     return lpc->irr[PCI_SLOT(pci_dev->devfn)][intx];
1683 | }
1684 | ```
1685 | 
1686 | 它首先通过 qdev_get_parent_bus 拿到设备所属的bus对象(pcie.0)，然后从bus上连接的设备数组中找到 ICH9 LPC PCI to ISA bridge ，找到e1000在其irr中对应的 pirq 号，为6。
1687 | 
1688 | 如果当前一级bus定义了 set_irq 函数，则中断for循环，调用之发送中断，否则设置为bus的父设备，进入下一轮寻找。也就是从发送中断的设备开始，逐级向上查找，直到找到能处理该中断的bus为止。
1689 | 
1690 | 在这里 set_irq 为 ich9_lpc_set_irq 。于是把中断计数数组中当前中断对应的数值加上change，表示有多少个该类型的中断等待处理。随后调用 ich9_lpc_set_irq 。
1691 | 
1692 | ```c
1693 | void ich9_lpc_set_irq(void *opaque, int pirq, int level)
1694 | {
1695 |     ICH9LPCState *lpc = opaque;
1696 |     int pic_irq, pic_dis;
1697 | 
1698 |     assert(0 <= pirq);
1699 |     assert(pirq < ICH9_LPC_NB_PIRQS);
1700 | 
1701 |     ich9_lpc_update_apic(lpc, ich9_pirq_to_gsi(pirq));
1702 |     ich9_lpc_pic_irq(lpc, pirq, &pic_irq, &pic_dis);
1703 |     ich9_lpc_update_pic(lpc, pic_irq);
1704 | }
1705 | ```
1706 | 
1707 | 利用 ich9_pirq_to_gsi 将 pirq 转换成GSI编号，其实就是 pirq + 16 ，e1000为22。然后调用 ich9_lpc_update_apic ，如果中断计数数组中当前中断对应的数值不为0，则level为1。
1708 | 
1709 | 于是根据GSI编号，从 ICH9LPCState.gsi 中取出对应的 qemu_irq ，调用 qemu_set_irq 将其值设置为level。
1710 | 
1711 | #### 考虑e1000、IOAPIC、LAPIC都由QEMU进行模拟的情况(off)
1712 | 
1713 | gsi qemu_irq 的 handler 为 gsi_handler ，于是：
1714 | 
1715 | qemu_set_irq => irq->handler (gsi_handler) => qemu_set_irq => ioapic_set_irq 设置 IOAPICCommonState 的 irr 。
1716 | 
1717 | 但这时可能 Remote IRR bit 为 1，因此在设置irr后不会调用 ioapic_service 。
1718 | 
1719 | 直到某个时刻 LAPIC 处理完毕后发送 EOI 让 IOAPIC 的 Remote IRR bit 变为0，才会在之后的 ioapic_set_irq 中调用 ioapic_service 。
1720 | 
1721 | 由于此时irr可能积累了多个中断，因此 ioapic_service 会遍历 IOAPIC 上的所有pin，如果 irr 在对应的bit为1，通过 stl_le_phys 修改中断在 IOAPIC AddressSpace 的对应位置。
1722 | 
1723 | 当对该 AddressSpace 进行写入时，类似MMIO一样最终调用到 MemoryRegion 绑定的 apic_io_ops 。于是调用到 apic_mem_writel ，构造MSI消息后通过 apic_send_msi 发送。
1724 | 
1725 | ```
1726 | apic_deliver_irq => apic_bus_deliver => apic_set_irq => apic_set_bit => apic_set_bit(s->irr, vector_num)   根据中断向量号设置Interrupt Request Register
1727 |                                                                      => apic_set_bit(s->tmr, vector_num)   如果是水平触发，设置Trigger Mode Register
1728 |                                                      => apic_update_irq         通知CPU
1729 | ```
1730 | 
1731 | ##### apic_update_irq
1732 | 
1733 | ```c
1734 | /* signal the CPU if an irq is pending */
1735 | static void apic_update_irq(APICCommonState *s)
1736 | {
1737 |     CPUState *cpu;
1738 |     DeviceState *dev = (DeviceState *)s;
1739 | 
1740 |     cpu = CPU(s->cpu);
1741 |     if (!qemu_cpu_is_self(cpu)) {
1742 |         cpu_interrupt(cpu, CPU_INTERRUPT_POLL);
1743 |     } else if (apic_irq_pending(s) > 0) {
1744 |         cpu_interrupt(cpu, CPU_INTERRUPT_HARD);
1745 |     } else if (!apic_accept_pic_intr(dev) || !pic_get_output(isa_pic)) {
1746 |         cpu_reset_interrupt(cpu, CPU_INTERRUPT_HARD);
1747 |     }
1748 | }
1749 | ```
1750 | 
1751 | 于是：
1752 | 
1753 | ```
1754 | cpu_interrupt(cpu, CPU_INTERRUPT_POLL) => cpu_interrupt_handler (kvm_handle_interrupt) => cpu->interrupt_request |= mask
1755 |                                                                                        => qemu_cpu_kick
1756 | ```
1757 | 
1758 | 因此会设置目标cpu的 interrupt_request ，然后 kick 之让其退出到QEMU，回到 kvm_cpu_exec ，由于退出原因是 KVM_EXIT_INTR ，即使进入到 kvm_arch_handle_exit 也无法处理，于是 ret = -1 ，循环中断，退出到上级调用 qemu_kvm_cpu_thread_fn 中，于是在下一次循环中执行 kvm_cpu_exec => kvm_arch_process_async_events ，发现 interrupt_request 的 CPU_INTERRUPT_POLL 为1，调用 apic_poll_irq => apic_update_irq => cpu_interrupt(cpu, CPU_INTERRUPT_HARD) 。如果LAPIC有未处理的中断(apic_irq_pending)，则为 interrupt_request 加上 CPU_INTERRUPT_HARD
1759 | 
1760 | 于是在接下来的 kvm_arch_pre_run 中，如果中断可以注入，则通过 cpu_get_pic_interrupt => apic_get_interrupt 从 LAPIC 中取出中断号：
1761 | 
1762 | ```
1763 | => apic_irq_pending(s)              从irr中取出优先级级最高的中断号
1764 | => apic_reset_bit(s->irr, intno)    设置中断号在irr对应的bit为0
1765 | => apic_set_bit(s->isr, intno)      设置中断号在isr对应的bit为1
1766 | => apic_update_irq(s)               如果还有其它中断未处理，再次设置 cpu->interrupt_request 为 CPU_INTERRUPT_HARD
1767 | ```
1768 | 
1769 | 在获得中断号后，通过 kvm_vcpu_ioctl(cpu, KVM_INTERRUPT, &intr) 注入中断到KVM。
1770 | 
1771 | 如果前面还有中断没处理，则此时 cpu->interrupt_request 依然为 CPU_INTERRUPT_HARD ，但我们一次只能注入一个中断，因此设置 request_interrupt_window 为 1，从而保证党guest能够处理下一个中断时立刻退回到QEMU。
1772 | 
1773 | 这里注入到KVM中的 irq 是 **中断向量号(interrupt vector)** 。
1774 | 
1775 | ##### KVM
1776 | 
1777 | kvm_arch_vcpu_ioctl => kvm_vcpu_ioctl_interrupt => kvm_queue_interrupt(vcpu, irq->irq, false)   将中断设置到 vcpu->arch.interrupt
1778 |                                                 => kvm_make_request(KVM_REQ_EVENT, vcpu)        产生请求
1779 | 
1780 | 这样接下来当QEMU通过 `kvm_vcpu_ioctl(cpu, KVM_RUN, 0)` 进入到KVM时，在
1781 | 
1782 | 
1783 | kvm_arch_vcpu_ioctl_run => vcpu_run => vcpu_enter_guest 中，检测到有 KVM_REQ_EVENT 请求，调用 inject_pending_event 进行中断注入：
1784 | 
1785 | ```c
1786 | static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
1787 | {
1788 |   ...
1789 |   if (vcpu->arch.interrupt.pending) {
1790 |       kvm_x86_ops->set_irq(vcpu);
1791 |       return 0;
1792 |   }
1793 |   ...
1794 | }
1795 | ```
1796 | 
1797 | 最后由 vmx_inject_irq 将中断写入到VMCS中。
1798 | 
1799 | 
1800 | 
1801 | 
1802 | #### 考虑e1000、IOAPIC由QEMU进行模拟，LAPIC由KVM进行模拟的情况(split)
1803 | 
1804 | 中断从e1000发送到 IOAPIC 的流程和上文一致，直到 ioapic_service 。它会用 kvm_irqchip_is_split 判断是否为split模式，如果是，则 LAPIC 由KVM负责模拟，于是通过 kvm_set_irq 设置中断(注意对于on模式，IOAPIC也由KVM模拟，根本不会走到这里，因此这里只判断是否是split)。
1805 | 
1806 | 于是 kvm_set_irq => kvm_vm_ioctl(s, s->irq_set_ioctl, &event) 向KVM注入中断。 s->irq_set_ioctl 根据 KVM能力可能为 KVM_IRQ_LINE 或 KVM_IRQ_LINE_STATUS ，区别在于后者会返回状态。
1807 | 
1808 | 这里注入到KVM中的 irq 为中断设备对应的 **GSI**。e1000 的 gsi 是 22 。
1809 | 
1810 | 
1811 | 
1812 | 
1813 | ##### KVM
1814 | 
1815 | kvm_vm_ioctl => kvm_vm_ioctl_irq_line => kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irq_event->irq, irq_event->level, line_status)
1816 | 
1817 | 从 table 中找到对应的 entry ，调用 kvm_set_ioapic_irq => kvm_ioapic_set_irq => ioapic_set_irq => ioapic_service => kvm_irq_delivery_to_apic => kvm_apic_set_irq => __apic_accept_irq 对目标LAPIC设置中断：
1818 | 
1819 | ```
1820 | => 根据 delivery_mode 进行对应设置，如 APIC_DM_FIXED 为 kvm_lapic_set_vector + kvm_lapic_set_irr
1821 | => kvm_make_request(KVM_REQ_EVENT, vcpu)
1822 | => kvm_vcpu_kick(vcpu)              让目标vCPU退出来处理请求
1823 | ```
1824 | 
1825 | 接下来在 vcpu_run => vcpu_enter_guest 中，由于 LAPIC 在KVM中，先通过 kvm_x86_ops->hwapic_irr_update (vmx_hwapic_irr_update) 更新 irr 中优先级最高的中断？
1826 | 
1827 | 后检测到有 KVM_REQ_EVENT 请求，调用 inject_pending_event 进行中断注入：
1828 | 
1829 | ```c
1830 | static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
1831 | {
1832 |   ...
1833 |   if (vcpu->arch.interrupt.pending) {
1834 |       kvm_x86_ops->set_irq(vcpu);
1835 |       return 0;
1836 |   }
1837 |   ...
1838 | }
1839 | ```
1840 | 
1841 | 最后由 vmx_inject_irq 将中断写入到VMCS中。
1842 | 
1843 | 
1844 | 
1845 | #### 考虑e1000由QEMU进行模拟，IOAPIC、LAPIC由KVM进行模拟的情况(on)
1846 | 
1847 | 此时gsi qemu_irq 的 handler 为 kvm_pc_gsi_handler ，于是：
1848 | 
1849 | qemu_set_irq => irq->handler (kvm_pc_gsi_handler) => qemu_set_irq(s->ioapic_irq[n], level) => irq->handler (kvm_ioapic_set_irq) => kvm_set_irq(kvm_state, s->kvm_gsi_base + irq, level) => kvm_vm_ioctl(s, s->irq_set_ioctl, &event) 通过ioctl向KVM注入中断。
1850 | 
1851 | 这里注入到KVM中的 irq 为中断设备对应的 **GSI**。由于s->kvm_gsi_base 为 0， 因此e1000算出来的gsi s->kvm_gsi_base + irq 依然为 22 。
1852 | 
1853 | 因此可以发现在split和on情况下，不管IOAPIC在哪模拟，最终都是通过KVM的 KVM_IRQ_LINE / KVM_IRQ_LINE_STATUS 接口注入中断。并且中断的gsi都是22。
1854 | 
1855 | 
1856 | 
1857 | ##### KVM
1858 | 
1859 | KVM中的流程和split中的流程一样。因此和split的区别在于on需要通过接口去查询KVM模式的IOAPIC信息，而split由于QEMU负责模拟了，所以不用查询自己知道。
1860 | 
1861 | 比如on在 hmp 查询IOAPIC时，需要通过 kvm_ioapic_dump_state => kvm_ioapic_get => kvm_vm_ioctl(kvm_state, KVM_GET_IRQCHIP, &chip) 去查。
1862 | 
1863 | 


--------------------------------------------------------------------------------
/interrupt_and_io/IA32_manual_Ch10.md:
--------------------------------------------------------------------------------
  1 | # Intel IA32 Manual Chapter 10 APIC
  2 | ## Overview
  3 | Local APIC（以下简称APIC）可以从以下来源接收中断：
  4 | 1. 通过LINT0和LINT1这两个引脚接收的本地中断
  5 | 2. 通过IOAPIC接收的外部中断
  6 | 3. 其他CPU（甚至自己）发来的IPI
  7 | 4. APIC Timer产生的中断
  8 | 5. APIC内部错误引发的中断
  9 | 6. Performance Monitoring Counter产生的中断（c.f. 18.6.3.5.8）
 10 | 7. 温度传感器产生的中断（c.f. 14.7.2）
 11 | 
 12 | 其中1、4、5、6、7统称为本地中断源，通过LVT（Local Vector Table）控制其接收。通过写入Interrupt Control Register（ICR），则可以发送IPI。IOAPIC通过interrupt message给APIC发送中断，在message中已经指定了Interrupt Vector，因此不需要APIC通过LVT表来进行配置。
 13 | 
 14 | 在P6 family（即从奔腾Pro到奔腾3）的时代，各CPU的APIC间通过一条APIC总线通信，IPI message和interrupt message都在该总线上传输。从奔腾4开始，APIC间的通信改为走system bus，IOAPIC发送的interrupt message也通过system bus发到APIC。
 15 | > **Note:** APIC中有一些奔腾和P6 family独有的设置项，以下均用:warning:标出。其中部分涉及其特有的APIC Bus，若有兴趣，请查阅手册的10.7节、10.10节、10.13节，对照理解。
 16 | 
 17 | APIC最初源自外置的Intel 82489DX芯片，到奔腾和P6时代演化成APIC，自奔腾4开始进一步演变为xAPIC，近年来又扩展为x2APIC，提供更多功能。
 18 | 
 19 | ## Basic functions
 20 | ### Overview
 21 | APIC的寄存器是通过MMIO访问的，对应的物理地址为从0xFEE00000起的4KB页，寄存器的具体映射关系见手册的表10-1。判断CPU是否支持APIC以及是否支持x2APIC均可通过查询CPUID实现。
 22 | 
 23 | > **Info:** 寄存器所在页的物理地址可以通过设置改变，这是P6时代遗留下来的上古特性，当时是为了防止与已经使用了这块地址的程序发生冲突，现在已无人使用。此外还需注意，该页在页表中必须设置为strong uncacheable (UC)。
 24 | 
 25 | APIC可以硬件禁用/启用，禁用再启用后相当于断电重启。此外还可以[软件禁用/启用](#spurious-interrupt)，这只是暂时禁用，重新启用后其状态会得到保留。需要注意的是，APIC在通电后默认是处于软件禁用的状态的。
 26 | 
 27 | ### APIC MSR
 28 | 与APIC相关的唯一一个MSR为`IA32_APIC_BASE`：
 29 | - 第8位代表该CPU是否为BSP（1代表BSP）
 30 | - 第11位可以用于硬件启用/禁用APIC（0代表禁用，自奔腾4起禁用后仍可启用，P6则不行），禁用后相当于APIC不存在，CPUID也会显示不支持APIC，重新启用后APIC回到断电重启时的状态
 31 | - 第12-35位表示APIC Base，即APIC的寄存器所在物理地址（实际物理地址为APIC Base左移12位，即正好4K对齐）**【注意：该特性只为兼容存在】**
 32 | 
 33 | ### Local APIC ID Register
 34 | 位于Base + 0x20，作为CPU的ID使用，其初始值为开机时由硬件自动决定并赋予，会被BIOS和/或OS使用，因此不宜再行修改（且仅在部分型号CPU上支持修改）。若已改动其值，则仍可通过`CPUID.01H.EBX[31:24]`查询到其初始值。
 35 | 
 36 | 奔腾及P6 family仅使用第24-27位（共4位），故最多支持15个CPU，奔腾四开始的xAPIC模式使用第24-31位（共8位），故最多支持255个CPU。x2APIC使用32位的x2APIC ID，存在MSR中，可以认为不存在最大CPU数量限制。
 37 | 
 38 | ### Local APIC Version Register
 39 | 位于Base + 0x30，只读，功能如下：
 40 | - 第0-7位为Version Number
 41 | - 第16-23位为Max LVT Entry，其值为最大LVT数-1
 42 | - 第24位表示是否支持[Suppress EOI-broadcast](#spurious-interrupt-vector-register)
 43 | 
 44 | ## Handling Local Interrupts
 45 | ### Local Vector Table
 46 | LVT表中共有（最多）七项，每一项负责配置一种来源的中断的分发，分别是：
 47 | - LVT CMCI Register，Base + 0x2F0
 48 |     - 负责发送Corrected Machine Check Error Interrupt，即被纠正的Machine Check Error累积至超过一个阈值后，便会引起一个CMCI中断（从至强5500起才有此项，该功能默认禁止）
 49 | - LVT Timer Register，Base + 0x320
 50 |     - 负责发送由APIC Timer产生的中断
 51 | - LVT Thermal Monitor Register，Base + 0x330
 52 |     - 负责发送由温度传感器产生的中断（从奔腾4起才有此项）
 53 | - LVT Performance Counter Register，Base + 0x340（此地址仅供参考，该寄存器具体在何处是implementation specific的）
 54 |     - 负责发送由性能计数器Overflow产生的中断（从P6 family起才有此项）
 55 | - LVT LINT0 Register，Base + 0x350
 56 |     - 负责转发来自LINT0引脚的中断
 57 | - LVT LINT1 Register，Base + 0x360
 58 |     - 负责转发来自LINT1引脚的中断
 59 | - LVT Error Register，Base + 0x370
 60 |     - 负责发送APIC内部错误时产生的中断
 61 | 
 62 | 这些寄存器的功能如下：
 63 | - 第0-7位为Vector，即CPU收到的中断向量号，其中0-15号被视为非法，会产生一个Illegal Vector错误（即ESR的bit 6，[详下](#error-handling)）
 64 | - 第8-10位为Delivery Mode，有以下几种取值：
 65 |     - 000 (Fixed)：按Vector的值向CPU发送相应的中断向量号
 66 |     - 010 (SMI)：向CPU发送一个SMI，此模式下Vector必须为0
 67 |     - 100 (NMI)：向CPU发送一个NMI，此时Vector会被忽略
 68 |     - 111 (ExtINT)：令CPU按照响应外部8259A的方式响应中断，这将会引起一个INTA周期，CPU在该周期向外部控制器索取Vector。
 69 |         - APIC只支持一个ExtINT中断源，整个系统中应当只有一个CPU的其中一个LVT表项配置为ExtINT模式
 70 | - 第12位为Delivery Status（只读），取0表示空闲，取1表示CPU尚未接受该中断（尚未EOI）
 71 | - 第13位为Interrupt Input Pin Polarity，取0表示active high，取1表示active low
 72 | - 第14位为Remote IRR Flag（只读），若当前接受的中断为fixed mode且是level triggered的，则该位为1表示CPU已经接受中断（已将中断加入IRR），但尚未进行EOI。CPU执行EOI后，该位就恢复到0。
 73 | - 第15位为Trigger Mode，取0表示edge triggered，取1表示level triggered（具体使用时尚有许多注意点，详见手册10.5.1节）
 74 | - 第16位为Mask，取0表示允许接受中断，取1表示禁止，reset后初始值为1
 75 | - 第17/17-18位为Timer Mode，只有LVT Timer Register有，用于切换APIC Timer的三种模式，详下
 76 | 
 77 | 注意并不是LVT中每个寄存器都拥有所有的field，具体情况如下所示：
 78 | 
 79 | ![local_vector_table](/assets/local_vector_table.png)
 80 | 
 81 | ### Error Handling
 82 | APIC发生内部错误时，错误原因会被记录到Error Status Register（ESR），位于Base + 0x280，内容如下：
 83 | - :warning: Bit 0：Send Checksum Error，只在奔腾和P6 family上有效，**已过时无需考虑**
 84 | - :warning: Bit 1：Receive Checksum Error，只在奔腾和P6 family上有效，**已过时无需考虑**
 85 | - :warning: Bit 2：Send Accept Error，只在奔腾和P6 family上有效，**已过时无需考虑**
 86 | - :warning: Bit 3：Receive Accept Error，只在奔腾和P6 family上有效，**已过时无需考虑**
 87 | - Bit 4：Redirectable IPI，试图发送一个Lowest Priority Mode的IPI，但CPU并不支持
 88 |     - 只有部分型号会使用该Bit。由于发送Lowest Priority Mode IPI的能力依具体型号而定，不建议BIOS和OS使用该功能。
 89 | - Bit 5：Send Illegal Vector，试图（通过IPI）发送一个0-15范围的Interrupt Vector
 90 | - Bit 6：Receive Illegal Vector，接收到了一个0-15范围的Interrupt Vector，包括从自己的LVT接收和通过Interrupt Message接收IPI或外部中断
 91 | - Bit 7：Illegal Register Address，试图读取一个不存在的Register
 92 | 
 93 | 在读取ESR的内容前，必须先为其写入0，才能清空此前的flag，并会自动填入新的flag，以及重新激活错误检测机制。
 94 | 
 95 | ### APIC Timer
 96 | APIC Timer是一个32位的Timer，通过两个32位的Counter寄存器实现：
 97 | - Initial Count Register，Base + 0x380
 98 | - Current Count Register，Base + 0x390（只读）
 99 | 
100 | Counter的频率由APIC Timer的基准频率除以Divide Configuration Register确定的除数获得。Divide Configuration Register，位于Base + 0x3E0，其第0、1、3位决定了除数：
101 | 
102 | ![divider](/assets/divide_configuration_register.png)
103 | 
104 | > **Info:** APIC Timer可能会随CPU休眠而停止运作，检查`CPUID.06H.EAX.ARAT[bit 2]`（APIC Timer Always Running bit）可知其是否会永远保持运作。APIC Timer的基准频率是总线频率（外频）或Core晶振频率（如果能通过`CPUID.15H.ECX`查到的话）。
105 | 
106 | APIC Timer有三种操作模式，可以通过LVT Timer Register的第17-18位设置，分别是：
107 | - One shot（00）：写入Initial Count以启动Timer，Current Count会从Initial Count开始不断减小，直到最后降到零触发一个中断，并停止变化
108 | - Periodic（01）：写入Initial Count以重启Timer，Current Count会反复从Initial Count减小到0，并在减小到0时触发中断
109 | - TSC-Deadline Mode（10）
110 |     - `CPUID.01H.ECX.TSC_Deadline[bit 24]`表示是否支持TSC-Deadline模式，若不支持，第18位为reserved
111 |     - 此模式下，对Initial Count的写入会被忽略，Current Count永远为0。此时Timer受MSR`IA32_TSC_DEADLINE_MSR`控制，为其写入一个非零64位值即可激活Timer，使得在TSC达到该值时触发一个中断。该中断只会触发一次，触发后`IA32_TSC_DEADLINE_MSR`就被重置为零。
112 |     > **Note:** 写入LVT Timer Register切换到TSC-Deadline Mode是一个Memory操作，该操作和接下来的`WRMSR`指令间必须添加一个`MFENCE`以保证不会乱序
113 | 
114 | 注意前两种模式下，为Initial Count写入0即可停止Timer运作，在第三种模式下则是为`IA32_TSC_DEADLINE_MSR`写入0，此外修改模式也会停止Timer运行。当然，也可以通过LVT Timer Register中的Mask屏蔽Timer中断实现同样的效果。
115 | 
116 | ## Handling Interrupts
117 | ### Interrupt Acceptance for Fixed Interrupts
118 | 与接收中断有关的寄存器主要是Interrupt Request Register（IRR）和In-Service Register（ISR），都是256位宽，每一位代表一个Interrupt Vector。IRR位于Base + 0x200到Base + 0x270（共8个32位寄存器），ISR位于Base + 0x100到Base + 0x170（共8个32位寄存器）。
119 | 
120 | APIC收到的Fixed Interrupts首先进入IRR，写入对应的位，表示已收到但尚未发送给CPU的中断。当CPU准备好接收新的中断时，APIC选出IRR中优先级最高（即Vector最大）的中断，清除其bit，设置ISR中对应的bit，并发送给CPU。
121 | 
122 | 当CPU执行完当前的中断处理例程时，应对EOI Register（位于Base + 0xB0）写入0（写入非零似乎会引起#GP，但手册没写），从而清除ISR中优先级最高（即Vector最大）的中断的Bit。
123 | 
124 | 当CPU正在处理中断时，可以被更高优先级的中断打断。若IRR中收到的新中断的Priority Class高于CPU的当前Priority Class（见下节），则可以直接发送给CPU并写入到ISR中，打断正在运行的中断处理例程（当然CPU必须没有关中断）。
125 | 
126 | > **Info:** 当收到Priority Class小于等于CPU的当前Priority Class的中断时，如果IRR中还有空位，可以放入IRR，即使ISR中该Vector已经占了一个位置。也就是说，APIC对同一个Vector支持两个Pending的中断，与此相对，PIC对同一个Vector只支持一个Pending的中断（即In-Service的那一个）。
127 | 
128 | 此外还有一个Trigger Mode Register（TMR），也是256位宽，每一位代表一个Interrupt Vector。当某个中断来临时，除了将其加入IRR，还会根据它是否是Level Triggered设置TMR中的对应位，1代表Level Triggered。Level Triggered的中断会在CPU写入EOI Register后通过总线向所有IOAPIC广播EOI Message。
129 | > **Info:** 这个默认的广播行为是可以禁止的，通过设置[Spurious Interrupt Vector Register](#spurious-interrupt-vector-register)的第12位即可禁止向IOAPIC广播。此时，必须由软件手动设置发送中断的那个IOAPIC的EOI Register，来完成EOI的必要步骤。
130 | 
131 | ### Interrupt, Task, and Processor Priority
132 | 手册中将Interrupt Vector的高4位称为Interrupt Priority Class，低4位无特殊称谓。Interrupt之间的优先级纯看其数值大小，大者即优先级高。
133 | 
134 | CPU当前正在处理的中断是否能被新来的中断打断，取决于以下两个寄存器：
135 | - Task Priority Register（TPR），位于Base + 0x80，第4-7位为Task Priority Class，第0-3位为Task Priority Sub-Class。
136 |     > **Info:** 在IA32-e（即x86-64）模式下，有CR8寄存器，其第0-3位同样代表了Task Priority Class。此时，恒有TPR[7:4] = CR8[3:0]，但OS应在TPR和CR8之间选择一种操作方式，而不应两种方式混用。
137 | - Processor Priority Register（PPR），位于Base + 0xA0，第4-7位为Processor Priority Class，第0-3位为Processor Priority Sub-Class。
138 | 
139 | 它们的含义如下：
140 | - Processor Priority Class代表了CPU的当前Priority Class，只有Interrupt Priority Class大于它的新中断才允许被注入到CPU中，并从IRR进入ISR。Processor Priority Sub-Class并无实际作用。
141 | - PPR是由TPR和ISRV决定的一个只读的值，其中ISRV是ISR中最大的Vector（ISR为0则ISRV也为0）。公式如下：
142 |     - `PPR[7:4] = max(TPR[7:4], ISRV[7:4])`
143 |     - PPR[3:0] 无实际作用，故略
144 | 
145 | 也就是说，优先级的机制实际上就是每16个Interrupt Vector分为一组（Class），后16个比前16个优先级高。Task Priority表明的是CPU希望（暂时）屏蔽该Priority Class及以下的中断。此外，CPU当前正处理的中断只能被高于ISR中最高Priority Class的中断打断。
146 | 
147 | ### 总结：奔腾4和至强以来的处理流程
148 | 1. 确认自己是否是Interrupt Message的目标，如果是则继续
149 | 2. 如果收到的中断是NMI、SMI、INIT、ExtINT或SIPI，则直接交给CPU处理
150 | 3. 否则设置IRR中的适当的位
151 | 4. 对于在IRR中pending的中断，APIC根据它们的优先级，以及当前CPU的优先级（存在PRR中）依次分发，一次只给CPU一个中断
152 | 5. 中断处理例程执行完毕时，应写入EOI Register，使得APIC从ISR队列中删除中断对应的项，结束该中断的处理（NMI、SMI、INIT、ExtINT及SIPI不需要写入EOI）
153 | 
154 | ## Spurious Interrupt Vector Register
155 | Spurious Interrupt产生的原因如下：当CPU要接受一个ExtINT中断时，第一步获取中断向量号需要经过两个周期，第一个周期INTR引脚收到信号，第二个周期（INTA周期）从外部控制器获取中断向量号。而通常的中断都是在INTR引脚收到信号的周期内，就取得中断向量号。由于这个非原子性，若CPU正好在INTA周期通过LVT表项屏蔽了该ExtINT中断，则APIC会转而发送一个Spurious Interrupt。
156 | 
157 | Spurious-Interrupt Vector Register（SVR），位于Base + 0xF0，内容如下：
158 | - 第0-7位为Spurious Vector，即APIC产生Spurious Vector时应该发送的Interrupt Vector
159 |     - 对于奔腾和P6 family，第0-3位被hardwire到1，因此Spurious Vector低4位最好设置为全1
160 | - 第8位控制APIC软件启用/禁用，1为启用，0为禁用
161 |     - APIC断电重启后，默认处于软件禁用状态
162 |     - 在软件禁用状态下，IRR和ISR的状态仍会保留
163 |     - 在软件禁用状态下，仍能响应NMI、SMI、INIT、SIPI中断，并且仍可通过ICR发送IPI
164 |     - 在软件禁用状态下，LVT表项的Mask位都被强制设置为1（即屏蔽）
165 | - :warning: 第9位为Focus Processor Checking，取1表示禁用Focus Processor，取0表示启用。Focus Processor是奔腾和P6 family在处理Lowest Priority Mode时会涉及到的概念，**如今已没有用处**
166 | - 第12位为Suppress EOI Broadcasts，设置为1则禁止一个Level Triggered的中断的EOI默认向IOAPIC广播EOI Message的行为
167 |     > **Info:** 并非所有型号的CPU都支持该功能，查询APIC Version Register的第24位可知是否支持。
168 | 
169 | ## Issuing IPIs
170 | ### Interrupt Command Register (ICR)
171 | ICR是一个64位寄存器，分为ICR_Low和ICR_High两部分，分别位于Base + 0x300(Low)和Base + 0x310(High)，写入ICR_Low即可发送一个IPI。
172 | > **Info:** ICR_Low的内容可能会在进入深度休眠的C-State后丢失
173 | 
174 | ICR_Low的内容如下：
175 | - 第0-7位为Vector，即目标CPU收到的中断向量号，其中0-15号被视为非法，会给目标CPU的APIC产生一个Illegal Vector错误（即ESR的bit 6）
176 | - 第8-10位为Delivery Mode，有以下几种取值：
177 |     - 000 (Fixed)：按Vector的值向目标CPU(s)发送相应的中断向量号
178 |     - 001 (Lowest Priority)：按Vector的值向Destination决定的所有目标CPU(s)中Priority最低的CPU发送相应的中断向量号（[详下](#lowest-priority-mode)）
179 |         > **Info:** 发送Lowest Priority模式的IPI的能力取决于CPU型号，不总是有效，因此BIOS和OS不应该发送Lowest Priority模式的IPI
180 |     - 010 (SMI)：向目标CPU(s)发送一个SMI，此模式下Vector必须为0
181 |     - 100 (NMI)：向目标CPU(s)发送一个NMI，此时Vector会被忽略
182 |     - 101 (INIT)：向目标CPU(s)发送一个INIT IPI，导致该CPU发生一次INIT（INIT后的CPU状态参考手册表9-1），此模式下Vector必须为0
183 |         > **Info:** CPU在INIT后其APIC ID和Arb ID（只在奔腾和P6上存在）不变
184 |     - :warning: 101 (INIT Level De-assert)：向所有CPU广播一个特殊的IPI，将所有CPU的APIC的Arb ID（只在奔腾和P6上存在）重置为初始值（初始APIC ID）。要使用此模式，Level必须取0，Trigger Mode必须取1，Destination Shorthand必须设置为All Including Self。
185 |         - 只在奔腾和P6 family上有效，**已过时无需考虑**
186 |     - 110 (Start-up)：向目标CPU(s)发送一个Start-up IPI（SIPI），目标会从0x000VV000开始执行，其中0xVV为Vector的值
187 | - 第11位为Destination Mode，取0表示Physical，取1表示Logical（[详下](#determining-ipi-destination)）
188 | - 第12位为Delivery Status（只读），取0表示空闲，取1表示上一个IPI尚未发送完毕
189 | - :warning: 第13位为Level，取0则Delivery Mode 101表示INIT Level De-assert，否则表示INIT。
190 |     - 只在奔腾和P6 family上有意义，**在奔腾4及以后的CPU上该位必须永远取1**
191 | - :warning: 第14位为Trigger Mode，表示INIT Level De-assert模式下的trigger mode，取0表示edge triggered，取1表示level triggered。
192 |     - 只在奔腾和P6 family上有意义，**在奔腾4及以后的CPU上该位必须永远取0**
193 | - 第18-19位为Destination Shorthand，如果指定了Shorthand，就无需通过ICR_High中的Destination field来指定目标CPU，于是可以只通过一次对ICR_Low的写入发送一次IPI。该field取值如下：
194 |     - 00 (No Shorthand)：目标CPU通过Destination指定
195 |     - 01 (Self)：目标CPU为自身
196 |     - 10 (All Including Self)：向所有CPU广播IPI，此时发送的IPI Message的Destination会被设置为0xF（奔腾和P6）或0xFF（奔腾4及以后），表示是一个全局广播
197 |     - 11 (All Excluding Self)：同上，除了不向自己发送以外
198 | 
199 | ICR_High只有56-63位有效，用于表示Destination，决定了IPI发送的目标CPU(s)，如何决定见下节。
200 | 
201 | ### Determining IPI Destination
202 | #### Physical Destination Mode
203 | 若Destination Mode取0，则为Physical模式。在此模式下，Destination field表示目标CPU的APIC ID，0xF（奔腾和P6）或0xFF（奔腾4及以后）表示全局广播。
204 | 
205 | #### Logical Destination Mode
206 | 若Destination Mode取1，则为Logical模式。在此模式下，Destination field表示Message Destination Address（MDA）。IPI Message发送到总线上后，各个CPU的APIC会根据自己的Logical Destination Register（LDR）和Destination Format Register（DFR）决定是否要接受这个IPI。
207 | 
208 | Logical Destination Register，位于Base + 0xD0，其第24-31位表示Logical APIC ID
209 | 
210 | Destination Format Register，位于Base + 0xE0，其第28-31位表示Model，0000表示Cluster Model，1111表示Flat Model。
211 | > **Note:** 所有（软件）启用的APIC的DFR必须设置到同一个模式，应当在尽量早地（在启动的早期阶段）设定好DFR，并且应在设置完DFR模式后再（软件）启用APIC
212 | 
213 | ##### Flat Model
214 | APIC将自己的LDR和总线上的IPI Message的MDA进行`AND`操作，结果非零则接受该IPI
215 | 
216 | ##### Cluster Model【应该用于NUMA，不建议随便使用该模式】
217 | - :warning: Flat Cluster模式，只有奔腾和P6 family支持
218 |     - LDR和MDA各自分成两部分，高4位表示Cluster，只有LDR和MDA的高4位完全相等才表示属于该Cluster，可以进一步比较低4位，低4位用于选择具体的APIC，若LDR和MDA的低4位的`AND`结果非零则表示接受。
219 |     - Cluster取15表示对所有Cluster广播，因此最多只支持0-14共15个Cluster。MDR取0xFF即可实现全局广播。
220 | - Hierarchical Cluster模式，所有型号都支持
221 |     - 系统分为若干个Cluster，最多可以有15个Cluster，每个Cluster需要有一个特殊的硬件，称作Cluster Manager，每个Cluster最多可以有4个Agent，总共最多60个APIC Agent
222 |     - 具体寻址方式手册并未说明，从上述描述看似乎寻址方式和Flat Cluster相同，只是硬件结构不同（手册也没提到有什么软件方式可以切换这两个模式，因此认为寻址方式相同是合理的）
223 | 
224 | #### Lowest Priority Mode
225 | 当使用Logical Destination Mode或使用Destination Shorthand进行IPI群发时，可以使用Lowest Priority Mode，选出目标CPU集合中优先级最低的CPU作为发送对象，最终只有该CPU能收到IPI。
226 | 
227 | 在至强及以后的CPU，选择是通过主板芯片组自动进行的，CPU或无法干涉【还需更多调查 #TODO#】。在奔腾4上，芯片组可以通过一个特殊的总线cycle获得CPU的Task Priority（似乎应为TPR[7:0]【存疑】），从而当需要仲裁时选出Priority最低的CPU。
228 | 
229 | 在奔腾和P6 family上，仲裁依赖于Arbitration Priority Register（APR，位于Base + 0x90），它的取值如下：
230 | ```
231 | if (TPR[7:4] >= IRRV[7:4] && TPR[7:4] > ISRV[7:4]) {
232 |     APR[7:0] = TPR[7:0]
233 | } else {
234 |     APR[7:4] = max(TPR[7:4], ISRV[7:4], IRRV[7:4])
235 |     APR[3:0] = 0
236 | }
237 | ```
238 | 其中ISRV是ISR中最大的Vector，IRRV是IRR中最大的Vector。APR最小的CPU即胜出，获得IPI。
239 | 
240 | > **Note:** 除了IPI以外，IOAPIC发送的中断也可以是Lowest Priority Mode的，通过MSI发送的中断也可是Lowest Priority Mode的。该特性最初设计的时候应该是配合TPR使用，每当OS切换Task（进程/线程/etc.）时，就更新TPR（Task Priority），这样可以使得高优先级的Task不被中断打断，中断优先发给低优先级的Task。
241 | 
242 | ## Message Signalled Interrupts
243 | MSI的发送是通过向一个特殊的地址（表示目标CPU）进行一次PCI写入事务实现的。当某PCI设备要发送一个中断，它就会从自己的Message Address Register（以及Message Upper Address Register，如果地址是64位的话）读取出中断的目的地址，从Message Data Register读取出要填入中断Message的内容。下面分别介绍这两个寄存器：
244 | 
245 | Message Address Register：
246 | - 第20-31位，必须为0xFEE
247 | - 第12-19位为Destination ID，相当于IOAPIC中RTE的第56-63位，也相当于ICR_High中的Destination
248 | - 第3位为Redirection Hint Indication（RH），取0相当于fixed mode，取1相当于lowest priority mode
249 |     - RH取1，DM取0时，Destination不允许取0xFF，取其他值则可以正常发送到Physical Destination
250 |     - RH取1，DM取1，且采用Clustered Addressing Model时，Destination不允许取0xFF，即不允许全局广播
251 | - 第2位为Destination Mode（DM），取0表示Physical Mode，取1表示Logical Mode
252 | 
253 | Message Data Register：
254 | - 第0-7位为Vector，即目标CPU收到的中断向量号，其中0-15号被视为非法，会给目标CPU的APIC产生一个Illegal Vector错误（即ESR的bit 6）
255 | - 第8-10位为Delivery Mode，有以下几种取值：
256 |     - 000 (Fixed)：按Vector的值向目标CPU(s)发送相应的中断向量号
257 |     - 001 (Lowest Priority)：按Vector的值向Destination决定的所有目标CPU(s)中Priority最低的CPU发送相应的中断向量号
258 |     - 010 (SMI)：向目标CPU(s)发送一个SMI，此模式下Vector必须为0，SMI必须是edge triggered的
259 |     - 100 (NMI)：向目标CPU(s)发送一个NMI（走#NMI引脚），此时Vector会被忽略，NMI永远是edge triggered的，无论Trigger Mode设置成什么
260 |     - 101 (INIT)：向目标CPU(s)发送一个INIT IPI，导致该CPU发生一次INIT，此时Vector会被忽略，INIT永远是edge triggered的，无论Trigger Mode设置成什么
261 |     - 111（ExtINT）：向目标CPU(s)发送一个与8259A兼容的中断信号，将会引起一个INTA周期，CPU(s)在该周期向外部控制器索取Vector，ExtINT必须是edge triggered的
262 | - 第14位为Level，若中断是level triggered的，则取1表示Assert，取0表示Deassert
263 | - 第15位为Trigger Mode，取0表示edge triggered，取1表示level triggered
264 | 
265 | ## Extended XAPIC (x2APIC)
266 | 将`IA32_APIC_BASE`MSR的第10位设置位1，即可启用x2APIC。断电重启后首先进入的是xAPIC模式，随后才能进入x2APIC模式，一旦进入则无法回到xAPIC模式（否则会引起#GP），必须进行一次重启（硬件禁用再启用）才能回到xAPIC模式。
267 | > **Info:** x2APIC模式下INIT，回到x2APIC模式的初始状态，而不是xAPIC模式。同理xAPIC模式下INIT，回到xAPIC模式的初始状态。不允许在硬件禁用的同时开启x2APIC模式（试图如此做会引起#GP），硬件禁用时INIT仍处于禁用状态。
268 | 
269 | > 在启动x2APIC模式前，BIOS还应检查并启用VT-d中的Extended Interrupt Mode
270 | 
271 | x2APIC模式下通过MSR访问其寄存器，0x800到0x8FF的MSR都被预留给x2APIC。在进入x2APIC模式前无法访问这些MSR，否则会引起#GP。同样，进入x2APIC模式后，xAPIC的MMIO区域相当于关闭时的状态。
272 | 
273 | x2APIC与xAPIC相比，寄存器基本可以一一对应（MSR有64位宽，但只对应到32位的xAPIC寄存器，高32位保留），除了以下变化：
274 | - 取消了DFR寄存器
275 | - ICR_Low和ICR_High合并为了一个64位的寄存器ICR（MSR 0x830）
276 | - 增加了SELF IPI寄存器（MSR 0x83F）
277 | 
278 | ### x2APIC ID
279 | APIC ID被改为32位，扩展后的ID可称之为x2APIC ID，占满了APIC ID Register的32位。寄存器被改为只读，只会在开机时由硬件设置一次，其末8位被作为xAPIC模式下的APIC ID使用。
280 | > **Info:** 实际实现中x2APIC ID可能不足32位，此时不支持的高位永远为0，通过写入0xFFFFFFFF再读出即可确定其实际范围。
281 | 
282 | 若支持x2APIC模式，则通过`CPUID.0BH.EDX`可以获得完整32位的x2APIC ID，从而帮助BIOS在xAPIC模式下判断系统的APIC ID超过了256的上限。若真的超过上限，则BIOS必须(a)在进入OS前就在所有CPU上均开启x2APIC模式或(b)只启用APIC ID小于等于255的CPU（其余CPU令其进入深度睡眠，保证不被OS启动），并且保持在xAPIC模式。
283 | 
284 | ### ICR Operation
285 | ICR的低32位与原本的ICR_Low相同（除了取消了Delivery Status位），高32位的Destination Field从8位扩展到了32位。
286 | 
287 | Destination取0xFFFFFFFF无论在Physical还是Logical模式都表示广播。Physical模式下的语义显见，下面讨论Logical模式的变化。
288 | 
289 | LDR的有效内容从8位扩展到了32位，且变为只读，其取值可称之为Logical x2APIC ID。x2APIC模式下取消了Flat Model，于是按照Clustered Model，Logical x2APIC ID分为高16位的Cluster ID和低16位的Logical ID。前者与Destination的高16位匹配，后者与Destination的低16位求`AND`。
290 | 
291 | 实际上，初始化时Logical x2APIC ID就是由x2APIC ID决定的，计算方法为`Logical x2APIC ID = (x2APIC ID[19:4] << 16) | (1 << x2APIC ID[3:0])`。
292 | 
293 | ### Self IPI Register
294 | 该寄存器是一个只写的寄存器，试图读取会造成#GP，仅0-7位有效，代表了Interrupt Vector。写入该寄存器的效果等价于通过写入ICR发送一个Edge Triggered、Fixed Interrupt的Self IPI。
295 | 


--------------------------------------------------------------------------------
/interrupt_and_io/IOAPIC.md:
--------------------------------------------------------------------------------
  1 | # IOAPIC
  2 | IOAPIC全称为82093AA I/O Advanced Programmable Interrupt Controller，据此型号可找到其[手册](https://pdos.csail.mit.edu/6.828/2016/readings/ia32/ioapic.pdf)。
  3 | 
  4 | IOAPIC通过MMIO访问，实际只有两个32位的寄存器可供访问，分别是I/O Register Select（IOREGSEL，只有低8位有效）和I/O Window（IOWIN），其物理地址分别为0xFEC0xy00和0xFEC0xy10（其中x、y可通过APIC Base Address Relocation Register配置）。前者提供Index，其第0-7位表示要访问的寄存器编号，后者提供Data，用于读写要访问的IOAPIC寄存器。
  5 | 
  6 | > **Info:** APIC Base Address Relocation Register是PIIX3芯片上的寄存器，PIIX3是90年代末Intel的南桥，后来南桥进入ICH系列后取消了该寄存器，但后来又增加了新的配置寄存器，将两个寄存器的地址变为0xFEC0x000和0xFEC0x010（其中x可配置）
  7 | ## Registers
  8 | - IOAPIC ID (IOAPICID)，位于Index 0x0，第24-27位表示IOAPIC ID，用于标识IOAPIC
  9 | - IOAPIC Version (IOAPICVER)，位于Index 0x1，只读
 10 |     - 第0-7位表示APIC Version，取值应为0x11
 11 |     - 第16-23位表示Maximum Redirection Entry，即Redirection Table的项数-1，取值应为0x17（即23）
 12 | - IOAPIC Arbitration ID (IOAPICARB)，位于Index 0x2，只读，第24-27位表示Arb ID，用于APIC Bus的仲裁
 13 |     - LAPIC之间以及LAPIC和IOAPIC之间的通信都是走APIC Bus，每个LAPIC和IOAPIC都有一个Arb ID用于仲裁，Arb ID最高者胜利，并将自己的ID置为0，其余各APIC的Arb ID则加一（除了原本Arb ID为15者，要将Arb ID置位胜利者原本的Arb ID值）。
 14 |     > **Note:** APIC Bus是奔腾和P6 family使用的技术，从奔腾4开始LAPIC之间以及LAPIC和IOAPIC之间的通信都是走系统总线，不使用Arb ID进行仲裁，因此现在讨论Arb ID已没有意义。
 15 | - Redirection Table (IOREDTBL[0:23])，位于Index 0x10-0x3F（每项64位），负责配置中断转发功能，其中每一项简称RTE（Redirection Table Entry）
 16 | 
 17 | ## Interrupt Redirection
 18 | 当一个中断来到IOAPIC时，就要根据Redirection Table发送给CPU(s)，这张转发表中的表项RTE的内容如下：
 19 | - 第56-63位为Destination，代表目的CPU(s)，Physical模式则为APIC ID，Logical模式则为MDA
 20 |     - Physical模式下，仅第56-59位有效，第60-63位必须取0
 21 | - 第16位为Mask，取0表示允许接受中断，取1表示禁止，reset后初始值为1
 22 | - 第15位为Trigger Mode，取0表示edge triggered，取1表示level triggered
 23 | - 第14位为Remote IRR，只读且只对level triggered中断有意义，取1代表目的CPU已经接受中断，当收到CPU发来的EOI后，变回0表示中断已经完成
 24 |     > **Note:** Remote IRR取1时的作用，实际上是阻止Level Triggered的IRQ line上的Active信号再次触发一个中断。设想若Active信号会产生中断，则只要信号保持Active（e.g. 高电平），就会不断触发中断，这显然是不正确的，故需要由Remote IRR位将中断阻塞。由此可见，CPU应该先设法让IRQ line回到Inactive状态，然后再进行EOI，否则该中断将再次产生。
 25 | - 第13位为Interrupt Input Pin Polarity，取0表示active high，取1表示active low
 26 | - 第12位为Delivery Status（只读），取0表示空闲，取1表示CPU尚未接受中断（尚未将中断存入IRR）
 27 |     - 若目的CPU对某Vector已经有两个中断在Pending，IOAPIC等于可以为该Vector提供第三个Pending的中断
 28 | - 第11位为Destination Mode，取0表示Physical，取1表示Logical
 29 | - 第8-10位为Delivery Mode，有以下几种取值：
 30 |     - 000 (Fixed)：按Vector的值向目标CPU(s)发送相应的中断向量号
 31 |     - 001 (Lowest Priority)：按Vector的值向Destination决定的所有目标CPU(s)中Priority最低的CPU发送相应的中断向量号
 32 |         - 关于该模式，详见Intel IA32手册第三册第十章
 33 |     - 010 (SMI)：向目标CPU(s)发送一个SMI，此模式下Vector必须为0，SMI必须是edge triggered的
 34 |     - 100 (NMI)：向目标CPU(s)发送一个NMI（走#NMI引脚），此时Vector会被忽略，NMI必须是edge triggered的
 35 |     - 101 (INIT)：向目标CPU(s)发送一个INIT IPI，导致该CPU发生一次INIT（INIT后的CPU状态参考Intel IA32手册第三册表9-1），此模式下Vector必须为0，且必须是edge triggered
 36 |         > **Info:** CPU在INIT后其APIC ID和Arb ID（只在奔腾和P6上存在）不变
 37 |     - 111（ExtINT）：向目标CPU(s)发送一个与8259A兼容的中断信号，将会引起一个INTA周期，CPU(s)在该周期向外部控制器索取Vector，ExtINT必须是edge triggered的
 38 | - 第0-7位为Vector，即目标CPU收到的中断向量号，有效范围为16-254（0-15保留，255为全局广播）
 39 | 
 40 | ### Destination Mode
 41 | Physical Mode表示Destination的取值为目的LAPIC的APIC ID。Logical Mode表示Destination的取值为Message Destination Address（MDA），可以用于引用一组LAPIC（即可用于Multicast）。关于MDA，详见Intel IA32手册第三册第十章。
 42 | 
 43 | 如果使用Logical Mode寻址模式引用了一组CPU，同时选择了Lowest Priority发送模式，则中断最终会发给这组CPU中优先级最低的CPU。（优先级的确定可参考Intel IA32手册第三册第十章，其中一种方法是依据LAPIC中的TPR寄存器）
 44 | 
 45 | ### Pin
 46 | 根据手册，IOAPIC的24个中断输入引脚通常如下连接：
 47 | 
 48 | - Pin #1连接到键盘中断（IRQ1）
 49 | - Pin #2连接到IRQ0
 50 | - Pin #3-#11,#14,#15，分别连接到ISA IRQ[3:7,8#,9:11,14:15]
 51 | - Pin #12连接到鼠标中断（IRQ12/M）
 52 | - Pin #16-#19代表PCI IRQ[0:3]
 53 | - Pin #20-#21代表Motherboard IRQ[0:1]
 54 | - Pin #23代表SMI中断，若Mask掉，则SMI中断会从IOAPIC的#SMIOUT引脚引出，否则会由IOAPIC根据RTE #23转发
 55 | 
 56 | 上述描述代表了PIIX3芯片组时期的典型接法，若要了解现在的芯片组是如何连接这些引脚的，还应查询最新芯片组的datasheet。
 57 | 
 58 | 值得注意的是，若某个设备（如键盘控制器）的中断信号通过IRQ line连接到PIC，则它也会连接到IOAPIC的中断输入引脚。例如键盘控制器通过IRQ1连接到PIC，同时也通过Pin #1连接到IOAPIC。因此，若PIC和IOAPIC同时启用，可能会造成设备产生一个中断，CPU收到**两次**中断，故必须屏蔽其中一个而只用另一个，通常我们会屏蔽PIC（这可以通过写入0xFF到PIC的OCW1实现）。
 59 | 
 60 | 事实上，约定俗成地，人们通常将系统中的第一个IOAPIC（如果有多个的话）的前16个Pin和PIC的16个IRQ相对应。也就是说Pin #x和IRQ x中的中断信号相同（0 <= x < 16）。其中有两个例外，一是Master PIC的INTR输出引脚要连接到IOAPIC的Pin #0（这也是约定俗成的要求，MP Spec并未规定PIC的INTR输出一定要连接到IOAPIC），故Pin #0不对应与IRQ0，二是Pin #2对应于IRQ0，这是由于IRQ2是Slave PIC，无需对应于IOAPIC Pin，而Pin #0又没有连到IRQ0，正好将Pin #2和IRQ0连接起来。
 61 | 
 62 | ## IOAPIC till ICH9
 63 | 随着芯片集成度的提升，IOAPIC芯片已被集成到了南桥内，我们可以从历代南桥手册（Datasheet）中查询到其pin接到的是什么设备，增加了什么功能。此处介绍QEMU模拟的Q35芯片组中的ICH9南桥中的IOAPIC。
 64 | 
 65 | ### IOxAPIC
 66 | 从ICH1开始，IOAPIC就被集成到了南桥里，并且ICH中集成的不再是原版的82093AA IOAPIC，而是经过修改的版本，可以称之为IOxAPIC。下面介绍历代IOAPIC经过的改动：
 67 | 
 68 | #### ICH1
 69 | ICH1虽然仍保持IOAPIC Version为0x11，但实际上进行了一些改动：
 70 | - 取消了APIC Base Address Relocation Register，使得Index Register（IOREGSEL）和Data Register（IOWIN）固定位于0xFEC00000和0xFEC00010
 71 | - 增加了两个32位的Write Only的寄存器，分别是IRQ Pin Assertion Register和EOI Register，分别位于0xFEC00020和0xFEC00040
 72 | - IOAPIC Version Register的第15位表示是否支持IRQ Assertion Register（此前是保留位，默认取0）
 73 | 
 74 | IRQ Pin Assertion Register：第0-4位表示IRQ Number，其余位保留。每当对该寄存器写入val时，就会使对应的IRQ发生中断。
 75 | 
 76 | > **Info:** IRQ Assertion Register是MSI的早期实现使用的机制。当时MSI Address Register填写0xFEC00020，MSI Data Register填写IRQ Number，即可进行一次MSI。后来MSI改为使用Upstream Memory Write的方式，直接向CPU进行写入，由CPU直接处理MSI，IOAPIC中便取消了该功能。
 77 | 
 78 | EOI Register：第0-7位表示interrupt vector，其余位保留。每当对该寄存器写入val时，就会清除interrupt vector为val的RTE表项的Remote IRR位。
 79 | 
 80 | #### PCI Age (ICH2 - ICH5)
 81 | 从ICH2开始IOAPIC Version改为0x20（文档写作0x02，实际上是0x20，Version为0x2X表示支持PCI 2.2），有如下改动：
 82 | - 增加了一个间接引用的寄存器，Boot Configuration Register，位于index 0x3，可读写，仅最低位有效，取0表示APIC总线发送中断消息（默认取0），取1表示通过系统总线（FSB）发送中断消息
 83 | 
 84 | ICH4未改动IOAPIC Version，但又扩展了RTE表项，从ICH4开始RTE的第48-55位表示EDID（Extended Destination ID），当通过系统总线发送中断消息时，EDID是address的第4-11位。
 85 | > **PS:** 文档原文为"They become bits 11:4 of the address"，但LAPIC能接受的Destination只有8位或32位（x2APIC），且IOAPIC的Destination只有在Physical Mode时才只有4位，能与EDID拼接。EDID的具体作用和在OS中的使用尚待考察。
 86 | 
 87 | ICH5未改动IOAPIC Version，但删除了Arbitration ID Register和Boot Configuration Register，标志着对APIC Bus的兼容彻底移除（此时已经是2003年，离最后一款P6 family的CPU已经过去了好几年）。
 88 | 
 89 | #### PCIe Age (ICH6 - ICH9)
 90 | ICH6首次支持PCIe，有如下改动：
 91 | - 删除了IRQ Pin Assertion Register
 92 | 
 93 | 从ICH8开始，IOAPIC寄存器的地址又变为可变，通过Chipset Configuration Register中的OIC（Other Interrupt Control）寄存器第4-7位（APIC Range Select，ASEL）控制IOAPIC地址的第12-15位。此时IOAPIC的地址范围为0xFEC0x000-0xFEC0x040（其中x可配置）
 94 | 
 95 | ### IOxAPIC Interrupt Delivery
 96 | 最初，IOAPIC是通过向APIC Bus发送Message实现中断的发送的，到ICH1还是如此，但从ICH2开始就支持了通过系统总线发送中断。这里所谓的通过系统总线发送中断，其实和MSI的方式是相同的，都是向特定地址写入特定数据，CPU会监听到这个写入，于是把它解释成一个中断请求，并让LAPIC处理该请求。事实上，其地址和数据的格式同MSI也是相同的。
 97 | 
 98 | 这种技术在ICH2中称为Front-Side Interrupt Delivery，在ICH3、ICH4中称为System Bus Interrupt Delivery，在ICH5-ICH9中称为FSB Interrupt Delivery，本质上指的都是同一件事。
 99 | 
100 | 由此可见，ICH6中引入的现代意义上的MSI，实际上就是从ICH2开始的FSB Interrupt Delivery，只不过前者直接从PCI Device发到CPU，后者经过IOAPIC转发到CPU，但在CPU看来都是相同的。
101 | 
102 | 需要注意的是，FSB Interrupt Delivery只支持普通中断，不支持SMI、NMI、INIT中断，不能在Delivery Mode field中填写SMI、NMI或INIT（ICH2未提到该规定，ICH3-ICH9都有此规定）。OS可以通过ACPI表（由BIOS构造）查询芯片组是否支持FSB Interrupt Delivery特性。
103 | 
104 | ### 配置
105 | 
106 | ICH9中IRQ pin的配置如下：
107 | 
108 | ![ich9_ioapic_1](/assets/ich9_ioapic_1.png)
109 | ![ich9_ioapic_2](/assets/ich9_ioapic_2.png)
110 | 


--------------------------------------------------------------------------------
/interrupt_and_io/PIC.md:
--------------------------------------------------------------------------------
  1 | # PIC
  2 | PIC也就是8259A中断控制器，可以配合MCS-80/85（即8080/8085）或8086使用，前者与后者操作模式不同，以下只考虑8086模式，即现代PC所兼容的操作模式。手册[见此](http://heim.ifi.uio.no/~inf3151/doc/8259A.pdf)。
  3 | 
  4 | 每个8259A芯片具备8个中断引脚，分别记为IRQ0-IRQ7。8259A支持级联，最多可以支持一个Master和八个Slave，但约定俗成的使用方式是一个Master加一个Slave。此时Master的中断记为IRQ0-IRQ7，Slave的中断记为IRQ8-IRQ15，Slave的INT输出信号连到Master的IRQ2，故总共支持15个中断信号。
  5 | > **Info:** 最初只有一个PIC芯片时，IRQ2已被占用，当IBM切换到双PIC芯片方案时，当初的IRQ2就被重新连接到了IRQ9上
  6 | 
  7 | 8259A的两个基本寄存器是IRR和ISR，都是8位寄存器，用于记录当前正在处理的中断的状态。
  8 | 
  9 | ## 基本流程
 10 | 1. IRQ0-IRQ7的其中一个引脚收到一个中断信号，则设置IRR中对应的bit
 11 | 2. 8259A通过INT信号线向CPU发送中断信号
 12 | 3. CPU收到INT信号后，发出一个INTA信号，输入8259A的INTA输入引脚
 13 | 4. 将IRR中的bit清除，在ISR中设置对应的bit
 14 | 5. CPU在下一个周期再发出一个INTA信号，输入8259A的INTA输入引脚，此时8259A通过数据总线向CPU发送Interrupt Vector
 15 | 6. 最后，若处于AEOI模式，ISR中的bit直接清除，否则要等CPU处理完该中断后，向8259A进行一次EOI，才能将ISR中的bit清除
 16 | 
 17 | > **Note:** 若在第四步时中断信号已经消失，无从判断其IRQ号，则会产生一个Spurious Interrupt，其Interrupt Vector和IRQ7一致。因此接受IRQ7（或IRQ15等）时，需先检查PIC的ISR，看其是否是Spurious Interrupt，然后才能进行中断处理。
 18 | >
 19 | > PS. 显然，这没有LAPIC为Spurious Interrupt单独赋予一个向量号先进。
 20 | 
 21 | ### 级联模式
 22 | 在级联模式下，若Slave收到一个中断，它会通过INT输出引脚输出一个中断信号，引起Master在某个IRQ引脚上收到中断。Master由于事先已经被配置好，知道该引脚对应的是Slave（而且知道是哪个），故在给CPU发送INT信号后，会通过一个3位的CAS选择器，选择8个Slave之一。被选中的Slave便在第二个INTA周期向数据总线输出Interrupt Vector，完成中断的发送。
 23 | 
 24 | 需要注意的是，级联模式下，CPU执行完中断处理例程后，需要进行两次EOI，分别对Master和Slave各进行一次。
 25 | 
 26 | ## 编程接口
 27 | PIC通过Port IO操纵，Master占据0x20和0x21两个端口，Slave占据0xA0和0xA1两个端口。下面我们将前一个端口称为Command端口，后一个端口称为Data端口（这是OSDev上的叫法）。[手册](http://heim.ifi.uio.no/~inf3151/doc/8259A.pdf)中的A_0位表示的就是Port号的最后一位，取0即Command，取1即Data。
 28 | 
 29 | ### Initialization
 30 | 在初始化时有四个Word（8位）可以使用，分别为ICW1-ICW4（Initialization Command Word）。写入ICW1就会启动初始化过程，重新初始化PIC。
 31 | 
 32 | 当A_0位为0（即写入Command端口）且输入值的第四位取1时，就认为是ICW1，由此会开启初始化过程。随后需要输入ICW2-ICW4，它们都要求A_0位1，即从Data端口输入。
 33 | 
 34 | ICW1的内容如下：
 35 | - 第0位为IC4，表示是否需要ICW4，取1表示需要
 36 | - 第1位为SNGL，取1表示Single，取0表示Cascade，若为Single模式则不需要ICW3
 37 | - :warning: 第2位为ADI,8080/8085模式才有用，在8086模式下会被忽略
 38 | - 第3位为LTIM，取1表示Level Triggered Mode，此时Edge Triggered Interrupt会被忽略
 39 | - 第4位必须为1
 40 | - :warning: 第5-7位，8080/8085模式才有用，在8086模式下会被忽略
 41 | 
 42 | ICW1之后必须紧跟ICW2，内容如下：
 43 | - :warning: 第0-2位，8080/8085模式才有用，在8086模式下会被忽略
 44 | - 第3-7位，表示Interrupt Vector的第3-7位。也就是说一块PIC的IRQ0-IRQ7，会被映射到Offset+0到Offset+7，其中Offset就是ICW2的值（末三位总是被视为零，设置时填零即可）
 45 | 
 46 | > **Info:** 在实模式下，约定俗成的设置为Master的IRQ映射到0x08-0x0F，Slave的IRQ映射到0x70-0x77，BIOS通常会如此设置。但进入保护模式后，0x08-0x0F与CPU默认的Exception范围冲突，故应该将IRQ重映射，一般是映射到0x20-0x2F。
 47 | 
 48 | 如果处于级联模式，则还需设置ICW3，内容如下：
 49 | - Slave模式时，第0-2位表示Slave ID（对应于CAS选择器的值），其余位保留，应均为0
 50 | - Master模式时，ICW3是一个8位的bitmap，每一位对应于一个Slave
 51 | 
 52 | > **Info:** 在非Buffered模式下，PIC是处于Master还是Slave状态，是由SP/EN输入引脚的电平决定的，若为1则是Master，若为0则是Slave。在Buffered模式下，则是由ICW4的M/S位确定的，取1表示Master，取0表示Slave。
 53 | 
 54 | 如果IC4取1，则还需要设置ICW4，否则ICW4当做全为0处理，内容如下：
 55 | - 第0位为μPM，取0表示8080/8085操作模式，取1表示8086操作模式
 56 |     > 由此可见现代机器上若要使用PIC，必须设置ICW4
 57 | - 第1位为AEOI，取1表示开启Automatic EOI模式
 58 | - 第2位为M/S，在Buffered模式下用于确定Master和Slave
 59 | - 第3位为BUF，取1表示开启Buffered模式
 60 |     - Buffered模式下，SP/EN引脚作为输出引脚，控制Buffer的开启。所谓的Buffer，就是在PIC和Data Bus之间设置的一个Buffer。
 61 | - 第4位为SFNM，取1表示开启Special Fully Nested模式
 62 |     - 该模式用于级联配置，应由Master开启Special Fully Nested模式，此时Slave的中断不会被In-Service的它自己屏蔽。在该模式下，BIOS或OS需在Slave的中断处理完时检查Slave中是否有超过一个中断等待EOI，若是则只需对Slave进行EOI而无需对Master进行EOI，否则应对Master也进行EOI
 63 | - 第5-7位保留，应全为0
 64 | 
 65 | ### Operation
 66 | 初始化完成后，还能通过三个8位的OCW（Operation Command Word）进行运行时的调整。
 67 | 
 68 | 当A_0位为1时（即Data端口），输入值即为OCW1，代表了Interrupt Mask，其每一位对应于一个IRQ，取1即屏蔽该IRQ。同时，读取该端口即可读到IMR（Interrupt Mask Register）的值。
 69 | 
 70 | 当A_0位为0时（即Command端口），根据输入值的第3、第4位决定其含义。若第4位为1，则表示ICW1，若第4位为0，则第3位取0代表OCW2，第3位取1代表OCW3。
 71 | 
 72 | #### OCW2
 73 | 在介绍OCW2前，先介绍以下概念：
 74 | ##### End of Interrupt
 75 | - EOI Command：有两种EOI Command，即Specific和Non-Specific，Non-Specific EOI会自动清空ISR中最高优先级位【**手册中为最高位，疑为笔误，似应为最高优先级位**】，Specific EOI则清空ISR中被指定的那个Bit
 76 | - AEOI Mode：若ICW4中的AEOI位取1，则进入Automatic EOI模式，在中断收到后（INTA周期结束后）立即自动进行一次Non-Specific EOI
 77 | 
 78 | ##### Interrupt Priority
 79 | - Fully Nested Mode：PIC初始化后的默认状态，此时中断的优先级依照IRQ0-IRQ7的顺序依次降低，高优先级的中断可以打断低优先级的
 80 |     > **Info:** 一旦ISR中有高优先级的中断，低优先级的中断就不会进入IRR，而在LAPIC中则还能在IRR中存入一次低优先级的中断
 81 | - Rotation Mode：优先级轮转，一次轮转即令某个IRQ优先级变为最低，其余IRQ随之轮转（e.g. 令IRQ4为最低优先级7，则IRQ3优先级变为6，以此类推，最后IRQ5变为最高优先级0）
 82 |     - Automatic Rotation：依附在EOI上的选项，令EOI带有使被其清除的Bit对应的IRQ优先级变为最低，并让其余IRQ依次轮转的能力
 83 |     - Specific Rotation：手动进行优先级轮转，指定某个IRQ优先级变为最低，并让其余IRQ依次轮转（该操作与EOI无关，但可以依附在Specific EOI上）
 84 | 
 85 | ##### Operations of OCW2
 86 | OCW2的内容如下：
 87 | - 第0-2位，配合SL位使用，用于指定某个IRQ
 88 | - 第3-4位必须为0
 89 | - 第5位为EOI，取1表示这是一个EOI Command，可以清空ISR中的一个Bit
 90 | - 第6位为SL，即Specific Level，取1表示指定一个IRQ进行操作（Specific EOI或Specific Rotation）
 91 | - 第7位为R，即Rotation，用于决定是否进行优先级轮转
 92 | 
 93 | OCW2的第5-7位组合如下：
 94 | - EOI=1, SL=0, R=0：Non-Specific EOI Command，清空ISR最高优先级位，优先级不变
 95 | - EOI=1, SL=1, R=0：Specific EOI Command，OCW2的第0-2位用于指定清除ISR的哪个Bit
 96 | - EOI=1, SL=0, R=1：Non-Specific EOI with Automatic Rotation，清空ISR最高优先级位，优先级轮转
 97 | - EOI=0, SL=0, R=1：开启AEOI模式下的Automatic Rotation，使得AEOI下自动进行的Non-specific EOI带有Automatic Rotation功能
 98 | - EOI=0, SL=0, R=0：关闭AEOI模式下的Automatic Rotation
 99 | - EOI=0, SL=1, R=1：Specific Rotation，OCW2的第0-2位用于指定哪个IRQ为最低优先级
100 | - EOI=1, SL=1, R=1：Specific EOI with Specific Rotation，OCW2的第0-2位用于指定清除ISR的哪个Bit以及指定哪个IRQ为最低优先级
101 | - EOI=1, SL=1, R=0：无效操作，不会发生任何事
102 | 
103 | #### OCW3
104 | OCW3的内容如下：
105 | - 第0位为RIS，即Read ISR，取1表示从Command端口读出IRR，取0表示从Command端口读出ISR
106 | - 第1位为RR，Read Register，取1时启用RIS位，取0时RIS位会被忽略
107 | - 第2位为P，取1表示这是一个Poll Command，取0则没有任何效果
108 | - 第3位必须为1，第4位必须为0
109 | - 第5位为SMM，Special Mask Mode，取1代表开启Special Mask模式，取0代表关闭
110 | - 第6位为ESMM，Enable Special Mask Mode，取1时启用SMM位，取0时SMM位会被忽略
111 | - 第7位保留，应为0
112 | 
113 | 实际上OCW3共集成了三个功能（可以同时使用），第一个功能是从Command端口（A_0位为0）可以读出IRR或ISR的值，通过RR和RIS位可以选择读取哪个，PIC初始化后默认是读出IRR的值。
114 | 
115 | 第二个功能是Poll模式，在通过OCW3发出一个Poll Command后，下一次读取Command端口（A_0位为0）时，就相当于一次中断Accept。若此时恰有中断Pending，则ISR中的Bit会被设置，读到的值最高位为1，最低3位为IRQ号。否则，则读到的值最高位为0，表示没有中断发生。
116 | 
117 | 第三个功能是Special Mask模式，在此模式下若某个位被IMR屏蔽，则有以下效果：
118 | - 即使该位是ISR中有最高优先级，也不会被Non-Specific EOI清空
119 | - 其余未被屏蔽的位，即使优先级低于该位，其对应的IRQ仍然可以被接受
120 | 
121 | ## PIC in ICH9
122 | 随着芯片集成度的提升，8259芯片被集成到了南桥内，我们可以从历代南桥手册中查询到其IRQ pin接到的是什么设备，已经经历了什么小改动。此处介绍QEMU模拟的Q35芯片组中的ICH9南桥中对8259的改动和配置。
123 | 
124 | ### 改动
125 | 在ISA总线时代，ICW1的第3位决定了是否是Level Triggered，取1表示整个PIC的中断都是Level Triggered的。进入PCI总线时代后（从PIIX芯片组开始），添加了两个8位的ELCR（Edge/Level Control Register）寄存器（ELCR0、ELCR1），分别位于0x4D0和0x4D1端口，每个位控制一个IRQ是否是Level Triggered的。
126 | 
127 | ### 配置
128 | Slave的ID设置为010b，Master和Slave的IRQ Pin设置如下：
129 | 
130 | ![ich9_8259](/assets/ich9_8259.png)
131 | 
132 | 其中IRQ0、IRQ1、IRQ2、IRQ8、IRQ13必须为Edge Triggered，即ELCR中对应位必须为0
133 | 


--------------------------------------------------------------------------------
/interrupt_and_io/linux_interrupt.md:
--------------------------------------------------------------------------------
 1 | # Linux Interrupt
 2 | ## Facts of Hardware
 3 | 在x86 CPU上，中断需通过IDT中的表项分发，若通过Interrupt Gate进入中断处理例程，则EFLAGS的IF（Interrupt Flag）会被清零（相当于执行`CLI`指令），从而禁用中断，直到执行`IRET`指令从中断处理例程中返回时，才会重置IF。若通过的是Trap Gate，则不会对IF作任何操作。
 4 | 
 5 | ## Linux
 6 | Linux中，在进入一个中断处理例程后，会立即重新开启中断，但该中断自己会在所有CPU上都被屏蔽，从而防止其自身重入。因此，在Linux中，中断处理例程不必是可重入的。
 7 | 
 8 | ### SMP Affinity
 9 | `/proc/interrupts`显示了中断被各CPU处理的次数。`/proc/irq/IRQ#/smp_affinity`是一个bitmap，表示第IRQ#个irq应被发送到哪些CPU上，每个bit代表一个CPU，如果有多于一个CPU被分配到某个irq，则会使用lowest priority mode，由硬件选取这一组CPU中优先级最低的CPU作为中断的目的地。优先级通常可以通过设置LAPIC的TPR寄存器来变更。
10 | 
11 | SMP affinity也可以通过`/proc/irq/IRQ#/smp_affinity_list`查询和指定，它是一个CPU list，其形式形如`2`、`1-7`等。SMP affinity为32位、64位或128位宽，默认值通常是全1（即所有CPU都能收到），默认值可以通过`/proc/irq/default_smp_affinity`查询
12 | 
13 | 实践中，通常会将每个IRQ绑定到一个单独的核，而不是一组核，因为同一个中断在同一个核上处理locality更好。并且lowest priority mode也不是完美的，它只保证总能仲裁选出一组CPU中的某一个，不保证能在这一组CPU之间load balance，例如一组CPU的优先级都相同时，可能每次选出的CPU都是同一个。
14 | 
15 | 不幸的是，Linux并不会设置CPU的TPR：
16 | ```c
17 | // arch/x86/kernel/apic/apic.c
18 | // 下面这个片段证明Linux只会将TPR设为0，并不再变更
19 | 
20 | /*
21 |  * Set Task Priority to 'accept all'. We never change this
22 |  * later on.
23 |  */
24 | value = apic_read(APIC_TASKPRI);
25 | value &= ~APIC_TPRI_MASK;
26 | apic_write(APIC_TASKPRI, value);
27 | ```
28 | 因此当smp affinity为一个IRQ设置了超过一个CPU时，往往只有一个CPU在处理中断。
29 | 
30 | ---
31 | 
32 | 通常采用红帽开发的`irqbalance`工具进行smp affinity的管理，该工具能分析系统的负载，每隔一段时间（默认是10秒）修改一次smp affinity。它会将每个IRQ绑定到一个CPU，并根据负载动态变更这个绑定，使这个IRQ在CPU间的分布均匀。它还有一个Power save模式，当发现系统中断负载较低时，会把所有中断都绑定到同一个CPU上，减少其他CPU的工作，让其他CPU可以进入休眠模式。
33 | 
34 | ---
35 | 
36 | Linux内核中，设置smp affinity的工作在`kernel/irq/manage.c`中实现，关键函数为`setup_affinity(irq_desc, cpumask)`。该函数调用`irq_do_set_affinity`函数，后者再进一步调用`struct irq_chip`中的`irq_set_affinity`方法，调用到具体硬件的affinity设置代码。
37 | 
38 | 最终，`smp_affinity`这个bitmap被作为`cpumask`参数传入`struct apic`的`cpu_mask_to_apicid_and`方法，得到APIC Destination field用于填入硬件寄存器。APIC的每种配置方法都对应于一个`struct apic`对象，从而就有一种`cpu_mask_to_apicid_and`实现。
39 | 
40 | ### APIC initialization
41 | 
42 | Linux启动时会选择一个APIC驱动（`struct apic`对象），32位下通过`arch/x86/kernel/apic/probe_32.c`的`generic_apic_probe`函数实现，该函数会调用每个APIC驱动对象的`probe`方法查询是否可用，64位下通过`arch/x86/kernel/apic/probe_64.c`的`default_acpi_madt_oem_check`函数实现，该函数会调用每个APIC驱动对象的`acpi_madt_oem_check`方法，根据ACPI表选择驱动。APIC驱动被检查的顺序就是Makefile中指定的编译顺序。
43 | 
44 | 从编译顺序可知，系统会优先选择x2APIC，而在x2APIC和xAPIC模式内部，默认模式都是Logical + Lowest Priority模式，除非系统不支持从而只能选择Physical + Fixed模式。
45 | 
46 | > **推论：** 对于Physical + Fixed模式，其`cpu_mask_to_apicid_and`方法会选取`cpumask`中第一个找到的在线CPU，作为APIC Destination。因此假若APIC被配置为该模式，`smp_affinity`会退化到在一组中固定选择一个CPU。
47 | 


--------------------------------------------------------------------------------
/memory.md:
--------------------------------------------------------------------------------
  1 | ## 基础
  2 | 
  3 | ### 页表
  4 | 负责将 VA 转换为 PA。VA 的地址由页号和页内偏移量组成，转换时，先从页表的基地址寄存器 (CR3) 中读取页表的起始地址，然后加上页号得到对应页的页表项。从中取出页的物理地址，再加上偏移量得到 PA。
  5 | 
  6 | 随着寻址范围的扩大 (64 位 CPU 支持 48 位的虚拟地址寻址空间，和 52 位的物理地址寻址空间)，页表需要占用越来越多连续的内存空间，再加上每个进程都要有自己的页表，系统光是维护页表就需要耗费大量内存。为此，利用程序使用内存的局部化特征，引进了多级页表。
  7 | 
  8 | 目前版本的 Linux 使用了四级页表：
  9 | 
 10 | Page Map Level 4(PML4) => Page Directory Pointer Table(PDPT) => Page Directory(PD) => Page Table(PT)
 11 | 
 12 | 在某些地方被称为： Page Global Directory(PGD) => Page Upper Directory(PUD) => Page Middle Directory(PMD) => Page Table(PT)
 13 | 
 14 | 在 x86_64 下，一个普通 page 的大小为 4KB，由于地址为 64bit，因此一个页表项占 8 Byte，于是一张页表中只能存放 512 个表项。因此每级页表索引使用 9 个 bit，加上页内索引 (offset) 使用 12 个 bit，因此一个 64bit 地址中只有 0-47bit 被用到。
 15 | 
 16 | 在 64 位下，EPT 采用了和传统页表相同的结构，于是如果不考虑 TLB，进行一次 GVA 到 HVA 需要经过 4 * 4 次 (考虑访问每一级 page 都 fault 的情况) 页表查询。
 17 | 
 18 | 有多少次查询就要访问多少次内存，在 walk 过程中不断对内存进行访问无疑会对性能造成影响。为此引入 TLB(Translation Lookaside Buffer) ，用来缓存常用的 PTE。这样在 TLB 命中的情况下就无需到内存去进行查找了。利用程序使用内存的局部化特征，TLB 的命中率往往很高，改善了在多级页表下的的访问速度。
 19 | 
 20 | 
 21 | 
 22 | 
 23 | ### 内存虚拟化
 24 | QEMU 利用 mmap 系统调用，在进程的虚拟地址空间中申请连续的大小的空间，作为 Guest 的物理内存。
 25 | 
 26 | 在这样的架构下，内存地址访问有四层映射：
 27 | 
 28 | GVA - GPA - HVA - HPA
 29 | 
 30 | GVA - GPA 的映射由 guest OS 负责维护，而 HVA - HPA 由 host OS 负责维护。于是我们需要一种机制，来维护 GPA - HVA 的映射。常用的实现有 SPT(Shadow Page Table) 和 EPT/NPT ，前者通过软件维护影子页表，后者通过硬件特性实现二级映射。
 31 | 
 32 | 
 33 | ### 影子页表
 34 | KVM 通过维护 GVA 到 HPA 的页表 SPT ，实现了直接映射。于是页表可被物理 MMU 寻址使用。如何实现的呢：
 35 | 
 36 | KVM 将 Guest OS 的页表设置为 read-only ，当 Guest OS 进行修改时会触发 page fault， VMEXIT 到 KVM 。 KVM 会对 GVA 对应的页表项进行访问权限检查，结合错误码进行判断:
 37 | 
 38 | 1. 如果是由 Guest OS 引起的，则将该异常注入回去。 Guest OS 调用自己的 page fault 处理函数 (申请一个 page ，将 page 的 GPA 填充到 上级页表项中)
 39 | 2. 如果是 Guest OS 的页表和 SPT 不一致引起的，则同步 SPT ，根据 Guest OS 页表和 mmap 映射找到 GVA 到 GPA 再到 HVA 的映射关系，然后在 SPT 中增加 / 更新 GVA - HPA 的表项
 40 | 
 41 | 当 Guest OS 切换进程时，会把待切换进程的页表基址载入 Guest 的 CR3，导致 VM EXIT 回到 KVM。KVM 通过哈希表找到对应的 SPT ，然后加载机器的 CR3 中。
 42 | 
 43 | 缺点：需要为每个进程都维护一张 SPT ，带来额外的内存开销。需要保持 Guest OS 页表和 SPT 的同步。每当 Guest 发生 page fault ，即使是 guest 自身缺页导致的，都会导致 VMExit ，开销大。
 44 | 
 45 | 
 46 | ### EPT / NPT
 47 | Intel EPT 技术 引入了 EPT(Extended Page Table) 和 EPTP(EPT base pointer) 的概念。 EPT 中维护着 GPA 到 HPA 的映射，而 EPT base pointer 负责指向 EPT 。在 Guest OS 运行时，该 VM 对应的 EPT 地址被加载到 EPTP ，而 Guest OS 当前运行的进程页表基址被加载到 CR3 ，于是在进行地址转换时，首先通过 CR3 指向的页表实现 GVA 到 GPA 的转换，再通过 EPTP 指向的 EPT 实现从 GPA 到 HPA 的转换。
 48 | 
 49 | 在发生 EPT page fault 时，需要 VMExit 到 KVM，更新 EPT 。
 50 | 
 51 | AMD NPT(Nested Page Table) 是 AMD 搞出的解决方案，它原理和 EPT 类似，但描述和实现上略有不同。Guest OS 和 Host 都有自己的 CR3 。当进行地址转换时，根据 gCR3 指向的页表从 GVA 到 GPA ，然后根据 nCR3 指向的页表从 GPA 到 HPA 。
 52 | 
 53 | 优点：Guest 的缺页在 guest 内处理，不会 vm exit。地址转换基本由硬件 (MMU) 查页表完成。
 54 | 
 55 | 缺点：两级页表查询，只能寄望于 TLB 命中。
 56 | 
 57 | 
 58 | 
 59 | 
 60 | ## 实现
 61 | 
 62 | ### QEMU
 63 | 
 64 | #### 内存设备模拟
 65 | 
 66 | ##### PCDIMMDevice
 67 | 
 68 | ```c
 69 | typedef struct PCDIMMDevice {
 70 |     /* private */
 71 |     DeviceState parent_obj;
 72 | 
 73 |     /* public */
 74 |     uint64_t addr;                  // 映射到的起始 GPA
 75 |     uint32_t node;                  // 映射到的 numa 节点
 76 |     int32_t slot;                   // 插入的内存槽编号，默认为 -1，表示自动分配
 77 |     HostMemoryBackend *hostmem;     // 对应的 backend
 78 | } PCDIMMDevice;
 79 | ```
 80 | 
 81 | 通过 QOM(qemu object model) 定义的虚拟内存条。可通过 QMP 或 QEMU 命令行进行管理。通过增加 / 移除该对象实现 VM 中内存的热插拔。
 82 | 
 83 | 
 84 | ##### HostMemoryBackend
 85 | 
 86 | ```c
 87 | struct HostMemoryBackend {
 88 |     /* private */
 89 |     Object parent;
 90 | 
 91 |     /* protected */
 92 |     uint64_t size;                                  // 提供内存大小
 93 |     bool merge, dump;
 94 |     bool prealloc, force_prealloc, is_mapped;
 95 |     DECLARE_BITMAP(host_nodes, MAX_NODES + 1);
 96 |     HostMemPolicy policy;
 97 | 
 98 |     MemoryRegion mr;                                // 拥有的 MemoryRegion
 99 | };
100 | ```
101 | 
102 | 通过 QOM 定义的一段 Host 内存，为虚拟内存条提供内存。可通过 QMP 或 QEMU 命令行进行管理。
103 | 
104 | 
105 | #### 内存初始化
106 | 
107 | 在开启 KVM 的前提下， QEMU 通过以下流程初始化内存：
108 | 
109 | 
110 | ```
111 | main => configure_accelerator => kvm_init => kvm_memory_listener_register(s, &s->memory_listener, &address_space_memory, 0) 初始化
112 | kvm_state.memory_listener
113 |                                           => kml->listener.region_add = kvm_region_add                      为 listener 设置操作
114 |                                           => memory_listener_register                                       初始化 listener 并绑定到 address_space_memory
115 |                                           => memory_listener_register(&kvm_io_listener, &address_space_io)  初始化 kvm_io_listener 并绑定到 address_space_io
116 |      => cpu_exec_init_all => memory_map_init                                        创建 system_memory("system") 和 system_io("io") 两个全局 MemoryRegion
117 |                                  => address_space_init                              初始化 address_space_memory("memory") 和 address_space_io("I/O") AddressSpace，并把 system_memory 和 system_io 作为 root
118 |                                     => memory_region_transaction_commit             提交修改，引起地址空间的变化
119 | ```
120 | 
121 | 在进行进一步分析之前，我们先介绍下涉及的三种结构： AddressSpace 、 MemoryRegion 和 MemoryRegionSection ：
122 | 
123 | #### AddressSpace
124 | 
125 | ```c
126 | struct AddressSpace {
127 |     /* All fields are private. */
128 |     struct rcu_head rcu;
129 |     char *name;
130 |     MemoryRegion *root;
131 |     int ref_count;
132 |     bool malloced;
133 | 
134 |     /* Accessed via RCU.  */
135 |     struct FlatView *current_map;                               // 指向当前维护的 FlatView，在 address_space_update_topology 时作为 old 比较
136 | 
137 |     int ioeventfd_nb;
138 |     struct MemoryRegionIoeventfd *ioeventfds;
139 |     struct AddressSpaceDispatch *dispatch;                      // 负责根据 GPA 找到 HVA
140 |     struct AddressSpaceDispatch *next_dispatch;
141 |     MemoryListener dispatch_listener;
142 |     QTAILQ_HEAD(memory_listeners_as, MemoryListener) listeners;
143 |     QTAILQ_ENTRY(AddressSpace) address_spaces_link;
144 | };
145 | ```
146 | 
147 | 顾名思义，用来表示虚拟机的一片地址空间，如内存地址空间，IO 地址空间。每个 AddressSpace 一般包含一系列 MemoryRegion ： AddressSpace 的 root 指向根级 MemoryRegion ，该 MemoryRegion 有可能有自己的若干个 subregion ，于是形成树状结构。
148 | 
149 | 如上文所述，在内存初始化流程中调用了 memory_map_init ，其初始化了 address_space_memory 和 address_space_io ，其中：
150 | 
151 | * address_space_memory 的 root 为 system_memory
152 | * address_space_io 的 root 为 system_io
153 | 
154 | 
155 | 
156 | #### MemoryRegion
157 | 
158 | ```c
159 | struct MemoryRegion {
160 |     Object parent_obj;                                                  // 继承自 Object
161 | 
162 |     /* All fields are private - violators will be prosecuted */
163 | 
164 |     /* The following fields should fit in a cache line */
165 |     bool romd_mode;
166 |     bool ram;
167 |     bool subpage;
168 |     bool readonly; /* For RAM regions */
169 |     bool rom_device;                                                    // 是否只读
170 |     bool flush_coalesced_mmio;
171 |     bool global_locking;
172 |     uint8_t dirty_log_mask;                                             // dirty map 类型
173 |     RAMBlock *ram_block;                                                // 指向对应的 RAMBlock
174 |     Object *owner;
175 |     const MemoryRegionIOMMUOps *iommu_ops;
176 | 
177 |     const MemoryRegionOps *ops;
178 |     void *opaque;
179 |     MemoryRegion *container;                                            // 指向父 MemoryRegion
180 |     Int128 size;                                                        // 内存区域大小
181 |     hwaddr addr;                                                        // 在父 MemoryRegion 中的偏移量 (见 memory_region_add_subregion_common)
182 |     void (*destructor)(MemoryRegion *mr);
183 |     uint64_t align;
184 |     bool terminates;
185 |     bool ram_device;
186 |     bool enabled;
187 |     bool warning_printed; /* For reservations */
188 |     uint8_t vga_logging_count;
189 |     MemoryRegion *alias;                                                // 指向实体 MemoryRegion
190 |     hwaddr alias_offset;                                                // 起始地址 (GPA) 在实体 MemoryRegion 中的偏移量
191 |     int32_t priority;
192 |     QTAILQ_HEAD(subregions, MemoryRegion) subregions;                   // subregion 链表
193 |     QTAILQ_ENTRY(MemoryRegion) subregions_link;
194 |     QTAILQ_HEAD(coalesced_ranges, CoalescedMemoryRange) coalesced;
195 |     const char *name;
196 |     unsigned ioeventfd_nb;
197 |     MemoryRegionIoeventfd *ioeventfds;
198 |     QLIST_HEAD(, IOMMUNotifier) iommu_notify;
199 |     IOMMUNotifierFlag iommu_notify_flags;
200 | };
201 | ```
202 | 
203 | MemoryRegion 表示在 Guest memory layout 中的一段内存，具有逻辑 (Guest) 意义。
204 | 
205 | 在初始化 VM 的过程中，建立了相应的 MemoryRegion ：
206 | 
207 | ```
208 | pc_init1 / pc_q35_init => pc_memory_init => memory_region_allocate_system_memory                        初始化 MemoryRegion 并为其分配内存
209 |                                          => memory_region_init_alias => memory_region_init              初始化 alias MemoryRegion
210 |                                          => memory_region_init                                          初始化 MemoryRegion
211 |                                          => memory_region_init_ram => memory_region_init                初始化 MemoryRegion 并分配 Ramblock
212 | ```
213 | 
214 | 
215 | ##### memory_region_allocate_system_memory
216 | 
217 | 对于非 NUMA 架构的 VM ，直接分配内存
218 | 
219 | ```
220 | => allocate_system_memory_nonnuma => memory_region_init_ram_from_file / memory_region_init_ram          分配 MemoryRegion 对应 Ramblock 的内存
221 | => vmstate_register_ram                                                                                 根据 region 的名称 name 设置 RAMBlock 的 idstr
222 | ```
223 | 
224 | 对于 NUMA，分配后需要设置 HostMemoryBackend
225 | 
226 | ```
227 | => memory_region_init
228 | => memory_region_add_subregion                          遍历所有 NUMA 节点的内存 HostMemoryBackend ，依次把那些 mr 成员不为空的作为当前 MemoryRegion 的 subregion，偏移量从 0 开始递增
229 | => vmstate_register_ram_global => vmstate_register_ram  根据 region 的名称 name 设置 RAMBlock 的 idstr
230 | ```
231 | 
232 | ##### MemoryRegion 类型
233 | 
234 | 可将 MemoryRegion 划分为以下三种类型：
235 | 
236 | * 根级 MemoryRegion: 直接通过 memory_region_init 初始化，没有自己的内存，用于管理 subregion。如 system_memory
237 | * 实体 MemoryRegion: 通过 memory_region_init_ram 初始化，有自己的内存 (从 QEMU 进程地址空间中分配)，大小为 size 。如 ram_memory(pc.ram) 、 pci_memory(pci) 等
238 | * 别名 MemoryRegion: 通过 memory_region_init_alias 初始化，没有自己的内存，表示实体 MemoryRegion(如 pc.ram) 的一部分，通过 alias 成员指向实体 MemoryRegion，alias_offset 为在实体 MemoryRegion 中的偏移量。如 ram_below_4g 、ram_above_4g 等
239 | 
240 | 代码中常见的 MemoryRegion 关系为：
241 | 
242 | ```
243 |                   alias
244 | ram_memory (pc.ram) - ram_below_4g(ram-below-4g)
245 |                     - ram_above_4g(ram-above-4g)
246 | 
247 |              alias
248 | system_io(io) - (pci0-io)
249 |               - (isa_mmio)
250 |               - (isa-io)
251 |               - ...
252 | 
253 |                      sub
254 | system_memory(system) - ram_below_4g(ram-below-4g)
255 |                       - ram_above_4g(ram-above-4g)
256 |                       - pcms->hotplug_memory.mr        热插拔内存
257 | 
258 |           sub
259 | rom_memory - isa_bios(isa-bios)
260 |            - option_rom_mr(pc.rom)
261 | 
262 | ```
263 | 
264 | 同时将 AddressSpace 映射到 FlatView ，得到若干个 MemoryRegionSection ，调用 kvm_region_add ，将 MemoryRegionSection 注册到 KVM 中。
265 | 
266 | 
267 | ##### MemoryRegionSection
268 | 
269 | ```c
270 | struct MemoryRegionSection {
271 |     MemoryRegion *mr;                           // 指向所属 MemoryRegion
272 |     AddressSpace *address_space;                // 所属 AddressSpace
273 |     hwaddr offset_within_region;                // 起始地址 (HVA) 在 MemoryRegion 内的偏移量
274 |     Int128 size;
275 |     hwaddr offset_within_address_space;         // 在 AddressSpace 内的偏移量，如果该 AddressSpace 为系统内存，则为 GPA 起始地址
276 |     bool readonly;
277 | };
278 | ```
279 | 
280 | MemoryRegionSection 指向 MemoryRegion 的一部分 ([offset_within_region, offset_within_region + size])，是注册到 KVM 的基本单位。
281 | 
282 | 将 AddressSpace 中的 MemoryRegion 映射到线性地址空间后，由于重叠的关系，原本完整的 region 可能会被切分成片段，于是产生了 MemoryRegionSection。
283 | 
284 | 回头再看内存初始化的流程，做的工作很简单：创建一些 AddressSpace ，绑定 listener 。创建相应的 MemoryRegion，作为 AddressSpace 的根。最后提交修改，让地址空间的发生变化，更新到 KVM 中。下面将分点介绍。
285 | 
286 | 
287 | 
288 | ##### KVMMemoryListener
289 | 
290 | 在初始化过程中，我们为 address_space_memory 和 address_space_io 分别注册了 memory_listener 和 kvm_io_listener 。前者类型为 KVMMemoryListener ，后者类型为 MemoryListener：
291 | 
292 | ```c
293 | typedef struct KVMMemoryListener {
294 |     MemoryListener listener;
295 |     KVMSlot *slots;
296 |     int as_id;
297 | } KVMMemoryListener;
298 | 
299 | struct MemoryListener {void (*begin)(MemoryListener *listener);
300 |     void (*commit)(MemoryListener *listener);
301 |     void (*region_add)(MemoryListener *listener, MemoryRegionSection *section);
302 |     void (*region_del)(MemoryListener *listener, MemoryRegionSection *section);
303 |     void (*region_nop)(MemoryListener *listener, MemoryRegionSection *section);
304 |     void (*log_start)(MemoryListener *listener, MemoryRegionSection *section,
305 |                       int old, int new);
306 |     void (*log_stop)(MemoryListener *listener, MemoryRegionSection *section,
307 |                      int old, int new);
308 |     void (*log_sync)(MemoryListener *listener, MemoryRegionSection *section);
309 |     void (*log_global_start)(MemoryListener *listener);
310 |     void (*log_global_stop)(MemoryListener *listener);
311 |     void (*eventfd_add)(MemoryListener *listener, MemoryRegionSection *section,
312 |                         bool match_data, uint64_t data, EventNotifier *e);
313 |     void (*eventfd_del)(MemoryListener *listener, MemoryRegionSection *section,
314 |                         bool match_data, uint64_t data, EventNotifier *e);
315 |     void (*coalesced_mmio_add)(MemoryListener *listener, MemoryRegionSection *section,
316 |                                hwaddr addr, hwaddr len);
317 |     void (*coalesced_mmio_del)(MemoryListener *listener, MemoryRegionSection *section,
318 |                                hwaddr addr, hwaddr len);
319 |     /* Lower = earlier (during add), later (during del) */
320 |     unsigned priority;
321 |     AddressSpace *address_space;
322 |     QTAILQ_ENTRY(MemoryListener) link;
323 |     QTAILQ_ENTRY(MemoryListener) link_as;
324 | };
325 | ```
326 | 
327 | 可以看到 KVMMemoryListener 主体就是 MemoryListener ，而 MemoryListener 包含大量函数指针，用来指向 address_space 成员发生变化时调用的回调函数。
328 | 
329 | address_space_io 上绑有 kvm_io_listener 和 dispatch_listener 。因此 AddressSpace 和 listener 存在一对多的关系，当 AddressSpace 发生变化时，其绑定的所有 listener 都会被触发。这是如何实现的呢？
330 | 
331 | 实际上，任何对 AddressSpace 和 MemoryRegion 的操作，都以 memory_region_transaction_begin 开头，以 memory_region_transaction_commit 结尾。
332 | 
333 | 这些操作包括：启用、析构、增删 eventfd、增删 subregion、改变属性 (flag)、设置大小、开启 dirty log 等，如：
334 | 
335 | * memory_region_add_subregion
336 | * memory_region_del_subregion
337 | * memory_region_set_readonly
338 | * memory_region_set_enabled
339 | * memory_region_set_size
340 | * memory_region_set_address
341 | * memory_region_set_alias_offset
342 | * memory_region_readd_subregion
343 | * memory_region_update_container_subregions
344 | * memory_region_set_log
345 | * memory_region_finalize
346 | * ...
347 | 
348 | 对 AddressSpace 的 root MemoryRegion 进行操作：
349 | 
350 | * address_space_init
351 | * address_space_destroy
352 | 
353 | ##### memory_region_transaction_begin
354 | 
355 | ```
356 | => qemu_flush_coalesced_mmio_buffer => kvm_flush_coalesced_mmio_buffer
357 | => ++memory_region_transaction_depth
358 | ```
359 | 
360 | KVM 中对某些 MMIO 做了 batch 优化：KVM 遇到 MMIO 而 VMEXIT 时，将 MMIO 操作记录到 kvm_coalesced_mmio 结构中，然后塞到 kvm_coalesced_mmio_ring 中，不退出到 QEMU 。直到某一次退回到 QEMU ，要更新内存空间之前的那一刻，把 kvm_coalesced_mmio_ring 中的 kvm_coalesced_mmio 取出来做一遍，保证内存的一致性。这事就是 kvm_flush_coalesced_mmio_buffer 干的。
361 | 
362 | 
363 | ##### memory_region_transaction_commit
364 | 
365 | ```
366 | => --memory_region_transaction_depth
367 | => 如果 memory_region_transaction_depth 为 0 且 memory_region_update_pending 大于 0
368 |     => MEMORY_LISTENER_CALL_GLOBAL(begin, Forward)        从前向后调用全局列表 memory_listeners 中所有 listener 的 begin 函数
369 |     => 对 address_spaces 中的所有 address space，调用 address_space_update_topology ，更新 QEMU 和 KVM 中维护的 slot 信息。
370 |     => MEMORY_LISTENER_CALL_GLOBAL(commit, Forward)       从后向前调用全局列表 memory_listeners 中所有 listener 的 commit 函数
371 | ```
372 | 
373 | 调用 listener 对应的函数来实现对地址空间的更新。
374 | 
375 | ##### address_space_update_topology
376 | 
377 | ```
378 | => address_space_get_flatview                             获取原来 FlatView(AddressSpace.current_map)
379 | => generate_memory_topology                               生成新的 FlatView
380 | => address_space_update_topology_pass                     比较新老 FlatView，对其中不一致的 FlatRange，执行相应的操作。
381 | ```
382 | 
383 | 由于 AddressSpace 是树状结构，于是调用 address_space_update_topology ，使用 FlatView 模型将树状结构映射 (压平) 到线性地址空间。比较新老 FlatView，对其中不一致的 FlatRange，执行相应的操作，最终操作的 KVM。
384 | 
385 | ##### generate_memory_topology
386 | 
387 | ```
388 | => addrrange_make                   创建起始地址为 0，结束地址为 2^64 的地址空间，作为 guest 的线性地址空间
389 | => render_memory_region             从根级 region 开始，递归将 region 映射到线性地址空间中，产生一个个 FlatRange，构成 FlatView
390 | => flatview_simplify                将 FlatView 中连续的 FlatRange 进行合并为一个
391 | ```
392 | 
393 | AddressSpace 的 root 成员是该地址空间的根级 MemoryRegion ，generate_memory_topology 负责将它的树状结构进行压平，从而能够映射到一个线性地址空间，得到 FlatView 。
394 | 
395 | ##### address_space_update_topology_pass
396 | 
397 | 比较该 AddressSpace 的新老 FlatRange 是否有变化，如果有，从前到后或从后到前遍历 AddressSpace 的 listeners，调用对应 callback 函数。
398 | 
399 | ```
400 | => MEMORY_LISTENER_UPDATE_REGION => section_from_flat_range      根据 FlatRange 的范围构造 MemoryRegionSection
401 |                                  => MEMORY_LISTENER_CALL
402 | ```
403 | 
404 | 举个例子，前面提到过，在初始化流程中，注册了 kvm_state.memory_listener 作为 address_space_memory 的 listener，它会被加入到 AddressSpace 的 listeners 中。于是如果 address_space_memory 发生了变化，则调用会调用 memory_listener 中相应的函数。
405 | 
406 | 例如 MEMORY_LISTENER_UPDATE_REGION 传入的 callback 参数为 region_add ，则调用 memory_listener.region_add (kvm_region_add)。
407 | 
408 | 
409 | 
410 | ##### kvm_region_add
411 | 
412 | ```
413 | => kvm_set_phys_mem => kvm_lookup_overlapping_slot
414 |                     => 计算起始 HVA
415 |                     => kvm_set_user_memory_region => kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem)
416 | ```
417 | 
418 | kvm_lookup_overlapping_slot 用于判断新的 region section 的地址范围 (GPA) 是否与已有 KVMSlot(kml->slots) 有重叠，如果重叠了，需要进行处理：
419 | 
420 | 假设原 slot 可以切分成三个部分：prefix slot + overlap slot + suffix slot，重叠区域为 overlap
421 | 
422 | 对于完全重叠的情况，既有 prefix slot 又有 suffix slot。无需注册新 slot。
423 | 
424 | 对于部分重叠的情况，prefix slot = 0 或 suffix slot = 0。则执行以下流程：
425 | 
426 | 1. 删除原有 slot
427 | 2. 注册 prefix slot 或 suffix slot
428 | 3. 注册 overlap slot
429 | 
430 | 当然如果没有重叠，则直接注册新 slot 即可。然后将 slot 通过 kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem) 更新 KVM 中对应的 kvm_memory_slot 。
431 | 
432 | QEMU 中维护 slot 结构也需要更新，对于原有的 slot，因为它是 kml->slots 数组的项，所以在 kvm_set_phys_mem 直接修改即可。对于 kml->slots 中没有的 slot，如 prefix、suffix、overlap，则需要调用 kvm_alloc_slot => kvm_get_free_slot ，它会在 kml->slots 找一个空白的 (memory_size = 0) 为 slot 返回，然后对该 slot 进行设置。
433 | 
434 | ##### kvm_set_phys_mem => kvm_set_user_memory_region
435 | 
436 | KVM 规定了更新 memory slot 的参数为 kvm_userspace_memory_region ：
437 | 
438 | ```c
439 | struct kvm_userspace_memory_region {
440 |     __u32 slot;                                                             // 对应 kvm_memory_slot 的 id
441 |     __u32 flags;
442 |     __u64 guest_phys_addr;                                                  // GPA
443 |     __u64 memory_size; /* bytes */                                          // 大小
444 |     __u64 userspace_addr; /* start of the userspace allocated memory */     // HVA
445 | };
446 | ```
447 | 
448 | 它会在 kvm_set_phys_mem => kvm_set_user_memory_region 的过程中进行计算并填充，流程如下：
449 | 
450 | 1. 根据 region 的起始 HVA(memory_region_get_ram_ptr) + region section 在 region 中的偏移量 (offset_within_region) + 页对齐修正 (delta) 得到 section 真正的起始 HVA，填入 userspace_addr
451 | 
452 |     在 memory_region_get_ram_ptr 中，如果当前 region 是另一个 region 的 alias，则会向上追溯，一直追溯到非 alias region(实体 region) 为止。将追溯过程中的 alias_offset 加起来，可以得到当前 region 在实体 region 中的偏移量。
453 | 
454 |     由于实体 region 具有对应的 RAMBlock，所以调用 qemu_map_ram_ptr ，将实体 region 对应的 RAMBlock 的 host 和总 offset 加起来，得到当前 region 的起始 HVA。
455 | 
456 | 2. 根据 region section 在 AddressSpace 内的偏移量 (offset_within_address_space) + 页对齐修正 (delta) 得到 section 真正的 GPA，填入 start_addr
457 | 
458 | 3. 根据 region section 的大小 (size) - 页对齐修正 (delta) 得到 section 真正的大小，填入 memory_size
459 | 
460 | 
461 | 
462 | 
463 | 
464 | ### RAMBlock
465 | 
466 | 前面提到，MemoryRegion 表示在 guest memory layout 中的一段内存，具有逻辑意义。那么实际意义，也是就是这段内存所对应的实际内存信息是由谁维护的呢？
467 | 
468 | 我们可以发现在 MemoryRegion 有一个 ram_block 成员，它是一个 RAMBlock 类型的指针，由 RAMBlock 来负责维护实际的内存信息，如 HVA、GPA。比如在刚刚计算 userspace_addr 的流程中，计算 region 的起始 HVA 需要找到对应的 RAMBlock ，然后获取其 host 成员来得到。
469 | 
470 | RAMBlock 定义如下：
471 | 
472 | ```c
473 | struct RAMBlock {
474 |     struct rcu_head rcu;                                        // 用于保护 Read-Copy-Update
475 |     struct MemoryRegion *mr;                                    // 对应的 MemoryRegion
476 |     uint8_t *host;                                              // 对应的 HVA
477 |     ram_addr_t offset;                                          // 在 ram_list 地址空间中的偏移 (要把前面 block 的 size 都加起来)
478 |     ram_addr_t used_length;                                     // 当前使用的长度
479 |     ram_addr_t max_length;                                      // 总长度
480 |     void (*resized)(const char*, uint64_t length, void *host);  // resize 函数
481 |     uint32_t flags;
482 |     /* Protected by iothread lock.  */
483 |     char idstr[256];                                            // id
484 |     /* RCU-enabled, writes protected by the ramlist lock */
485 |     QLIST_ENTRY(RAMBlock) next;                                 // 指向在 ram_list.blocks 中的下一个 block
486 |     int fd;                                                     // 映射文件的文件描述符
487 |     size_t page_size;                                           // page 大小，一般和 host 保持一致
488 | };
489 | ```
490 | 
491 | 前文提到过， MemoryRegion 会调用 memory_region_* 对 MemoryRegion 结构进行初始化。常见的函数有以下几个：
492 | 
493 | * memory_region_init_ram => qemu_ram_alloc
494 |     通过 qemu_ram_alloc 创建的 RAMBlock.host 为 NULL
495 | 
496 | * memory_region_init_ram_from_file => qemu_ram_alloc_from_file
497 |     通过 qemu_ram_alloc_from_file 创建的 RAMBlock 会调用 file_ram_alloc 使用对应路径的 (设备) 文件来分配内存，通常是由于需要使用 hugepage，会通过 `-mem-path` 参数指定了 hugepage 的设备文件 (如 /dev/hugepages)
498 | 
499 | * memory_region_init_ram_ptr => qemu_ram_alloc_from_ptr
500 |     RAMBlock.host 为传入的指针地址，表示从该地址指向的内存分配内存
501 | 
502 | * memory_region_init_resizeable_ram => qemu_ram_alloc_resizeable
503 |     RAMBlock.host 为 NULL，但 resizeable 为 true，表示还没有分配内存，但可以 resize。
504 | 
505 | 
506 | qemu_ram_alloc_* (qemu_ram_alloc / qemu_ram_alloc_from_file / memory_region_init_ram_ptr / memory_region_init_resizeable_ram) 最后都会调用到  qemu_ram_alloc_internal => ram_block_add 。它如果发现 host 为 NULL ，则会调用 phys_mem_alloc (qemu_anon_ram_alloc) 分配内存。让 host 有所指向后，将该 RAMBlock 插入到 ram_list.blocks 中。
507 | 
508 | 
509 | ##### qemu_anon_ram_alloc
510 | 
511 | => qemu_ram_mmap(-1, size, QEMU_VMALLOC_ALIGN, false) => mmap
512 | 
513 | 通过 mmap 在 QEMU 的进程地址空间中分配 size 大小的内存。
514 | 
515 | 
516 | 
517 | 
518 | 
519 | ### RAMList
520 | 
521 | ram_list 是一个全局变量，以链表的形式维护了所有的 RAMBlock 。
522 | 
523 | ```c
524 | RAMList ram_list = {.blocks = QLIST_HEAD_INITIALIZER(ram_list.blocks) };
525 | 
526 | typedef struct RAMList {
527 |     QemuMutex mutex;
528 |     RAMBlock *mru_block;
529 |     /* RCU-enabled, writes protected by the ramlist lock. */
530 |     QLIST_HEAD(, RAMBlock) blocks;                              // RAMBlock 链表
531 |     DirtyMemoryBlocks *dirty_memory[DIRTY_MEMORY_NUM];          // 记录脏页信息，用于 VGA / TCG / Live Migration
532 |     uint32_t version;                                           // 每更改一次加 1
533 | } RAMList;
534 | extern RAMList ram_list;
535 | ```
536 | 
537 | 注：
538 | 
539 | * VGA: 显卡仿真通过 dirty_memory 跟踪 dirty 的视频内存，用于重绘界面
540 | * TCG: 动态翻译器通过 dirty_memory 追踪自调整的代码，当上游指令发生变化时对其重新编译
541 | * Live Migration: 动态迁移通过 dirty_memory 来跟踪 dirty page，在 dirty page 被改变之后重传
542 | 
543 | 
544 | 
545 | 
546 | ##### AddressSpaceDispatch
547 | 
548 | 根据：
549 | 
550 | ```
551 | address_space_init => address_space_init_dispatch => as->dispatch_listener = (MemoryListener) {
552 |                                                                             .begin = mem_begin,
553 |                                                                             .commit = mem_commit,
554 |                                                                             .region_add = mem_add,
555 |                                                                             .region_nop = mem_add,
556 |                                                                             .priority = 0,
557 |                                                                         };
558 |                                                   => memory_listener_register(as->dispatch_listener)
559 | ```
560 | 
561 | address_space_memory 上除了绑有 kvm_state.memory_listener ，还会创建并绑定 dispatch_listener 。该 listener 实现了为了在虚拟机退出时根据 GPA 找到对应的 HVA 。
562 | 
563 | 当 memory_region_transaction_commit 调用各个 listener 的 begin 函数时， mem_begin 被调用
564 | 
565 | ```
566 | => g_new0(AddressSpaceDispatch, 1)                  创建 AddressSpaceDispatch 结构作为 AddressSpace 的 next_dispatch 成员
567 | ```
568 | 
569 | AddressSpaceDispatch 结构如下：
570 | 
571 | ```c
572 | struct AddressSpaceDispatch {
573 |     struct rcu_head rcu;
574 | 
575 |     MemoryRegionSection *mru_section;
576 |     /* This is a multi-level map on the physical address space.
577 |      * The bottom level has pointers to MemoryRegionSections.
578 |      */
579 |     PhysPageEntry phys_map;
580 |     PhysPageMap map;            // GPA -> HVA 的映射，通过多级页表实现
581 |     AddressSpace *as;
582 | };
583 | ```
584 | 
585 | map 成员是一个多级 (6 级) 页表，最后一级页表指向 MemoryRegionSection 。
586 | 
587 | 当 address_space_update_topology_pass => address_space_update_topology_pass 处理 add 时， mem_add 被调用：
588 | 
589 | 于是调用 register_subpage / register_multipage 将 page 注册到页表中。
590 | 
591 | ```
592 | => 如果 MemoryRegionSection 所属的 MemoryRegion 的 subpage 不存在
593 |     => subpage_init                                         创建 subpage
594 |     => phys_page_set => phys_map_node_reserve               分配页目录项
595 |                      => phys_page_set_level                 填充页表，从 L5 填到 L0
596 | => 如果存在
597 |     => container_of(existing->mr, subpage_t, iomem)         取出
598 | => subpage_register                                         设置 subpage
599 | ```
600 | 
601 | 因此从 KVM 中退出到 QEMU 之后，通过 AddressSpaceDispatch.map 可以找到对应的 MemoryRegionSection ，继而找到对应的 HVA
602 | 
603 | 
604 | 
605 | ## KVM
606 | 
607 | 
608 | ### kvm_vm_ioctl_set_memory_region
609 | 
610 | 添加内存。在 KVM 收到 KVM_SET_USER_MEMORY_REGION(取代了 KVM_SET_MEMORY_REGION ，因为其不支持细粒度控制) 的 ioctl 时调用。
611 | 
612 | 传入参数如下：
613 | 
614 | ```c
615 | struct kvm_userspace_memory_region {
616 |     __u32 slot;                                                             // 对应 kvm_memory_slot 的 id
617 |     __u32 flags;
618 |     __u64 guest_phys_addr;                                                  // GPA
619 |     __u64 memory_size; /* bytes */                                          // 大小
620 |     __u64 userspace_addr; /* start of the userspace allocated memory */     // HVA
621 | };
622 | ```
623 | 
624 | flags 可选：
625 | 
626 | * KVM_MEM_LOG_DIRTY_PAGES 声明需要跟踪对该 Region 的写，提供给 KVM_GET_DIRTY_LOG 时读取
627 | * KVM_MEM_READONLY        如果支持 readonly(KVM_CAP_READONLY_MEM)，则当写该 Region 时触发 VMEXIT (KVM_EXIT_MMIO)
628 | 
629 | 于是 kvm_vm_ioctl_set_memory_region => kvm_set_memory_region => __kvm_set_memory_region
630 | 
631 | 该函数将根据 npages(region 所包含的数树) 和原来的 npages 判断用户操作：
632 | 
633 | #### KVM_MR_CREATE
634 | 现在有页而原来没有，则为新增内存区域，创建并初始化 slot 。
635 | 
636 | #### KVM_MR_DELETE
637 | 现在没有页而原来有，则为删除内存区域，将 slot 标记为 KVM_MEMSLOT_INVALID
638 | 
639 | #### KVM_MR_FLAGS_ONLY / KVM_MR_MOVE
640 | 现在有页且原来也有，则为修改内存区域，如果只有 flag 变了，则为 KVM_MR_FLAGS_ONLY ，目前只有可能是 KVM_MEM_LOG_DIRTY_PAGES ，则根据 flag 选择是要创建还是释放 dirty_bitmap。
641 | 
642 | 如果 GPA 有变，则为 KVM_MR_MOVE ，需要进行移动。其实就直接将原来的 slot 标记为 KVM_MEMSLOT_INVALID，然后添加新的。
643 | 
644 | 新增 / 修改后的 slot 通过 install_new_memslots 更新。
645 | 
646 | #### kvm_memory_slot
647 | 
648 | 在 __kvm_set_memory_region 操作的 slot 是 KVM 中内存管理中的基本单位，定义如下：
649 | 
650 | ```c
651 | struct kvm_memory_slot {
652 |     gfn_t base_gfn;                     // slot 的起始 gfn
653 |     unsigned long npages;               // page 数
654 |     unsigned long *dirty_bitmap;        // 脏页 bitmap
655 |     struct kvm_arch_memory_slot arch;   // 结构相关，包括 rmap 和 lpage_info 等
656 |     unsigned long userspace_addr;       // 对应的起始 HVA
657 |     u32 flags;
658 |     short id;
659 | };
660 | 
661 | 
662 | struct kvm_arch_memory_slot {struct kvm_rmap_head *rmap[KVM_NR_PAGE_SIZES];              // 反向链接
663 |     struct kvm_lpage_info *lpage_info[KVM_NR_PAGE_SIZES - 1];   // 维护下一级页表是否关闭 hugepage
664 |     unsigned short *gfn_track[KVM_PAGE_TRACK_MAX];
665 | };
666 | ```
667 | 
668 | slot 保存在 kvm->memslots[as_id]->memslots[id] 中，其中 as_id 为 address space id，其实通常的架构都只有一个地址空间，as_id 总是取 0，唯独x86需要两个地址空间，as_id = 0为普通的地址空间，as_id = 1为SMM模式专用的SRAM空间，id 为 slot id 。这些结构的内存都在 kvm_create_vm 中就分配好了。这里对其进行初始化。
669 | 
670 | 
671 | ### 内存管理单元 (MMU)
672 | 
673 | #### 初始化
674 | 
675 | ```
676 | kvm_init => kvm_arch_init => kvm_mmu_module_init => 建立 mmu_page_header_cache 作为 cache
677 |                                                  => register_shrinker(&mmu_shrinker)                注册回收函数
678 | 
679 | 
680 | kvm_vm_ioctl_create_vcpu =>
681 | kvm_arch_vcpu_create => kvm_x86_ops->vcpu_create (vmx_create_vcpu) => init_rmode_identity_map       为实模式建立 1024 个页的等值映射
682 |                                                                    => kvm_vcpu_init => kvm_arch_vcpu_init => kvm_mmu_create
683 | kvm_arch_vcpu_setup => kvm_mmu_setup => init_kvm_mmu => init_kvm_tdp_mmu                            如果支持 two dimentional paging(EPT)，初始化之，设置 vcpu->arch.mmu 中的属性和函数
684 |                                                      => init_kvm_softmmu => kvm_init_shadow_mmu     否则初始化 SPT
685 | ```
686 | 
687 | 
688 | ##### kvm_mmu_create
689 | 
690 | 以 vcpu 为单位初始化 mmu 相关信息。它们在 vcpu 中的相关定义包含：
691 | 
692 | ```c
693 | struct kvm_vcpu_arch {
694 |     ...
695 |     /*
696 |      * Paging state of the vcpu
697 |      *
698 |      * If the vcpu runs in guest mode with two level paging this still saves
699 |      * the paging mode of the l1 guest. This context is always used to
700 |      * handle faults.
701 |      */
702 |     struct kvm_mmu mmu;
703 | 
704 |     /*
705 |      * Paging state of an L2 guest (used for nested npt)
706 |      *
707 |      * This context will save all necessary information to walk page tables
708 |      * of the an L2 guest. This context is only initialized for page table
709 |      * walking and not for faulting since we never handle l2 page faults on
710 |      * the host.
711 |      */
712 |     struct kvm_mmu nested_mmu;
713 | 
714 |     /*
715 |      * Pointer to the mmu context currently used for
716 |      * gva_to_gpa translations.
717 |      */
718 |     struct kvm_mmu *walk_mmu;
719 | 
720 |     // 以下为 cache，用于提升常用数据结构的分配速度
721 |     // 用于分配 pte_list_desc ，它是反向映射链表 parent_ptes 的链表项，在 mmu_set_spte => rmap_add => pte_list_add 中分配
722 |     struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
723 |     // 用于分配 page ，作为 kvm_mmu_page.spt
724 |     struct kvm_mmu_memory_cache mmu_page_cache;
725 |     // 用于分配 kvm_mmu_page ，作为页表页
726 |     struct kvm_mmu_memory_cache mmu_page_header_cache;
727 |     ...
728 | }
729 | ```
730 | 
731 | 其中 cache 用于提升页表中常用数据结构的分配速度。这些 cache 会在初始化 MMU(kvm_mmu_load)、发生 page fault(tdp_page_fault) 等情况下调用 mmu_topup_memory_caches 来保证各 cache 充足。
732 | 
733 | ```c
734 | // 保证各 cache 充足
735 | static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu)
736 | {
737 |     // r 不为 0 表示从 slab 分配 /__get_free_page 失败，直接返回错误
738 |     int r;
739 |     // 如果 vcpu->arch.mmu_pte_list_desc_cache 不足，从 pte_list_desc_cache 中分配
740 |     r = mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
741 |                    pte_list_desc_cache, 8 + PTE_PREFETCH_NUM);
742 |     if (r)
743 |         goto out;
744 |     // 如果 vcpu->arch.mmu_page_cache 不足，直接通过 __get_free_page 分配
745 |     r = mmu_topup_memory_cache_page(&vcpu->arch.mmu_page_cache, 8);
746 |     if (r)
747 |         goto out;
748 |     // 如果 vcpu->arch.mmu_page_header_cache 不足，从 mmu_page_header_cache 中分配
749 |     r = mmu_topup_memory_cache(&vcpu->arch.mmu_page_header_cache,
750 |                    mmu_page_header_cache, 4);
751 | out:
752 |     return r;
753 | }
754 | ```
755 | 
756 | pte_list_desc_cache 和 mmu_page_header_cache 两块全局 slab cache 在 kvm_mmu_module_init 中被创建，作为 vcpu->arch.mmu_pte_list_desc_cache 和 vcpu->arch.mmu_page_header_cache 的 cache 来源。
757 | 
758 | 可以在 host 通过 `cat /proc/slabinfo` 查看到分配的 slab：
759 | 
760 | ```
761 | # name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
762 | kvm_mmu_page_header    576    576    168   48    2 : tunables    0    0    0 : slabdata     12     12      0
763 | ```
764 | 
765 | 
766 | 
767 | #### 加载页表
768 | 
769 | kvm_vm_ioctl_create_vcpu 仅仅是对 mmu 进行初始化，比如将 vcpu->arch.mmu.root_hpa 设置为 INVALID_PAGE ，直到要进入 VM(VMLAUNCH/VMRESUME) 前才真正设置该值。
770 | 
771 | ```
772 | vcpu_enter_guest => kvm_mmu_reload => kvm_mmu_load => mmu_topup_memory_caches                       保证各 cache 充足
773 |                                                    => mmu_alloc_roots => mmu_alloc_direct_roots     如果根页表不存在，则分配一个 kvm_mmu_page
774 |                                                    => vcpu->arch.mmu.set_cr3 (vmx_set_cr3)          对于 EPT，将该页的 spt(strcut page) 的 HPA 加载到 VMCS
775 |                                                                                                     对于 SPT，将该页的 spt(strcut page) 的 HPA 加载到 cr3
776 |                  => kvm_x86_ops->run (vmx_vcpu_run)
777 |                  => kvm_x86_ops->handle_exit (vmx_handle_exit)
778 | ```
779 | 
780 | 
781 | #### kvm_mmu_page
782 | 
783 | 页表页，详细解释见 Documentation/virtual/kvm/mmu.txt
784 | 
785 | ```c
786 | struct kvm_mmu_page {
787 |     struct list_head link;                          // 加到 kvm->arch.active_mmu_pages 或 invalid_list ，表示当前页处于的状态
788 |     struct hlist_node hash_link;                    // 加到 vcpu->kvm->arch.mmu_page_hash ，提供快速查找
789 | 
790 |     /*
791 |      * The following two entries are used to key the shadow page in the
792 |      * hash table.
793 |      */
794 |     gfn_t gfn;                                      // 管理地址范围的起始地址对应的 gfn
795 |     union kvm_mmu_page_role role;                   // 基本信息，包括硬件特性和所属层级等
796 | 
797 |     u64 *spt;                                       // 指向 struct page 的地址，其包含了所有页表项 (pte)。同时 page->private 会指向该 kvm_mmu_page
798 |     /* hold the gfn of each spte inside spt */
799 |     gfn_t *gfns;                                    // 所有页表项 (pte) 对应的 gfn
800 |     bool unsync;                                    // 用于最后一级页表页，表示该页的页表项 (pte) 是否与 guest 同步 (guest 是否已更新 tlb)
801 |     int root_count;          /* Currently serving as active root */ // 用于最高级页表页，统计有多少 EPTP 指向自身
802 |     unsigned int unsync_children;                   // 页表页中 unsync 的 pte 数
803 |     struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */ // 反向映射 (rmap)，维护指向自己的上级页表项
804 | 
805 |     /* The page is obsolete if mmu_valid_gen != kvm->arch.mmu_valid_gen.  */
806 |     unsigned long mmu_valid_gen;                    // 代数，如果比 kvm->arch.mmu_valid_gen 小则表示已失效
807 | 
808 |     DECLARE_BITMAP(unsync_child_bitmap, 512);       // 页表页中 unsync 的 spte bitmap
809 | 
810 | #ifdef CONFIG_X86_32
811 |     /*
812 |      * Used out of the mmu-lock to avoid reading spte values while an
813 |      * update is in progress; see the comments in __get_spte_lockless().
814 |      */
815 |     int clear_spte_count;                           // 32bit 下，对 spte 的修改是原子的，因此通过该计数来检测是否正在被修改，如果被改了需要 redo
816 | #endif
817 | 
818 |     /* Number of writes since the last time traversal visited this page.  */
819 |     atomic_t write_flooding_count;                  // 统计从上次使用以来的 emulation 次数，如果超过一定次数，会把该 page 给 unmap 掉
820 | };
821 | 
822 | union kvm_mmu_page_role {
823 |     unsigned word;
824 |     struct {
825 |         unsigned level:4;           // 页所处的层级
826 |         unsigned cr4_pae:1;         // cr4.pae，1 表示使用 64bit gpte
827 |         unsigned quadrant:2;        // 如果 cr4.pae=0，则 gpte 为 32bit，但 spte 为 64bit，因此需要用多个 spte 来表示一个 gpte，该字段指示是 gpte 的第几块
828 |         unsigned direct:1;
829 |         unsigned access:3;          // 访问权限
830 |         unsigned invalid:1;         // 失效，一旦 unpin 就会被销毁
831 |         unsigned nxe:1;             // efer.nxe
832 |         unsigned cr0_wp:1;          // cr0.wp，写保护
833 |         unsigned smep_andnot_wp:1;  // cr4.smep && !cr0.wp
834 |         unsigned smap_andnot_wp:1;  // cr4.smap && !cr0.wp
835 |         unsigned :8;
836 | 
837 |         /*
838 |          * This is left at the top of the word so that
839 |          * kvm_memslots_for_spte_role can extract it with a
840 |          * simple shift.  While there is room, give it a whole
841 |          * byte so it is also faster to load it from memory.
842 |          */
843 |         unsigned smm:8;             // 处于 system management mode
844 |     };
845 | };
846 | ```
847 | 
848 | 
849 | #### EPT Violation
850 | 
851 | 当 Guest 第一次访问某个页面时，由于没有 GVA 到 GPA 的映射，触发 Guest OS 的 page fault。于是 Guest OS 会建立对应的 pte 并修复好各级页表，最后访问对应的 GPA。由于没有建立 GPA 到 HVA 的映射，于是触发 EPT Violation，VMEXIT 到 KVM。 KVM 在 vmx_handle_exit 中执行 kvm_vmx_exit_handlers[exit_reason]，发现 exit_reason 是 EXIT_REASON_EPT_VIOLATION ，因此调用 handle_ept_violation 。
852 | 
853 | ##### handle_ept_violation
854 | 
855 | ```
856 | => vmcs_readl(EXIT_QUALIFICATION)                       获取 EPT 退出的原因。EXIT_QUALIFICATION 是 Exit reason 的补充，详见 Vol. 3C 27-9 Table 27-7
857 | => vmcs_read64(GUEST_PHYSICAL_ADDRESS)                  获取发生缺页的 GPA
858 | => 根据 exit_qualification 内容得到 error_code，可能是 read fault / write fault / fetch fault / ept page table is not present
859 | => kvm_mmu_page_fault => vcpu->arch.mmu.page_fault (tdp_page_fault)
860 | ```
861 | 
862 | ##### tdp_page_fault
863 | 
864 | ```
865 | => gfn = gpa >> PAGE_SHIFT      将 GPA 右移 pagesize 得到 gfn(guest frame number)
866 | => mapping_level                计算 gfn 在页表中所属 level，不考虑 hugepage 则为 L1
867 | => try_async_pf                 将 gfn 转换为 pfn(physical frame number)
868 |         => kvm_vcpu_gfn_to_memslot => __gfn_to_memslot                  找到 gfn 对应的 slot
869 |         => __gfn_to_pfn_memslot                                         找到 gfn 对应的 pfn
870 |                 => __gfn_to_hva_many => __gfn_to_hva_memslot            计算 gfn 对应的起始 HVA
871 |                 => hva_to_pfn                                           计算 HVA 对应的 pfn，同时确保该物理页在内存中
872 | 
873 | => __direct_map                                                         更新 EPT，将新的映射关系逐层添加到 EPT 中
874 |     => for_each_shadow_entry                                            从 level4(root) 开始，逐层补全页表，对于每一层：
875 |         => mmu_set_spte                                                 对于 level1 的页表，其页表项肯定是缺的，所以不用判断直接填上 pfn 的起始 hpa
876 |         => is_shadow_present_pte                                        如果下一级页表页不存在，即当前页表项没值 (*sptep = 0)
877 |             => kvm_mmu_get_page                                         分配一个页表页结构
878 |             => link_shadow_page                                         将新页表页的 HPA 填入到当前页表项 (sptep) 中
879 | ```
880 | 
881 | 可以发现主要有两步，第一步会获取 GPA 所对应的物理页，如果没有会进行分配。第二步是更新 EPT。
882 | 
883 | ##### try_async_pf
884 | 1. 根据 gfn 找到对应的 memslot
885 | 2. 用 memslot 的起始 HVA(userspace_addr) + (gfn - slot 中的起始 gfn(base_gfn) ) * 页大小 (PAGE_SIZE)，得到 gfn 对应的起始 HVA
886 | 3. 为该 HVA 分配一个物理页，有 hva_to_pfn_fast 和 hva_to_pfn_slow 两种， hva_to_pfn_fast 实际上是调用 __get_user_pages_fast ，会尝试去 pin 该 page，即确保该地址所在的物理页在内存中。如果失败，退化到 hva_to_pfn_slow ，会先去拿 mm->mmap_sem 的锁然后调用 __get_user_pages 来 pin。
887 | 4. 如果分配成功，对其返回的 struct page 调用 page_to_pfn 得到对应的 pfn
888 | 
889 | 该函数建立了 gfn 到 pfn 的映射，同时将该 page pin 死在 host 的内存中。
890 | 
891 | 
892 | ##### __direct_map
893 | 
894 | 通过迭代器 kvm_shadow_walk_iterator 将 EPT 中与该 GPA 相关的页表补充完整。
895 | 
896 | ```c
897 | struct kvm_shadow_walk_iterator {
898 |     u64 addr;                   // 发生 page fault 的 GPA，迭代过程就是要把 GPA 所涉及的页表项都填上
899 |     hpa_t shadow_addr;          // 当前页表项的 HPA，在 shadow_walk_init 中设置为 vcpu->arch.mmu.root_hpa
900 |     u64 *sptep;                 // 指向当前页表项，在 shadow_walk_okay 中更新
901 |     int level;                  // 当前层级，在 shadow_walk_init 中设置为 4 (x86_64 PT64_ROOT_LEVEL)，在 shadow_walk_next 中减 1
902 |     unsigned index;             // 在当前 level 页表中的索引，在 shadow_walk_okay 中更新
903 | };
904 | ```
905 | 
906 | 在每轮迭代中，sptep 都会指向 GPA 在当前级页表中所对应的页表项，我们的目的就是把下一级页表的 GPA 填到该页表项内 (即设置 *sptep)。因为是缺页，可能会出现下一级的页表页不存在的问题，这时候需要分配一个页表页，然后再将该页的 GPA 填进到 *sptep 中。
907 | 
908 | 举个例子，对于 GPA(如 0xfffff001)，其二进制为：
909 | 
910 | ```
911 | 000000000 000000011 111111111 111111111 000000000001
912 |   PML4      PDPT       PD        PT        Offset
913 | ```
914 | 
915 | 初始化状态：level = 4，shadow_addr = root_hpa，addr = GPA
916 | 
917 | 执行流程：
918 | 
919 | 1. index = addr 在当前 level 分段的值。如在 level = 4 时为 0(000000000)，在 level = 3 时为 3(000000011)
920 | 2. sptep = va(shadow_addr) + index，得到 GPA 在当前地址中所对应的页表项 HVA
921 | 3. 如果 *sptep 没值，分配一个 page 作为下级页表，同时将 *sptep 设置为该 page 的 HPA
922 | 4. shadow_addr = *sptep，进入下级页表，循环
923 | 
924 | 开启 hugepage 时，由于页表项管理的范围变大，所需页表级数减少，在默认情况下 page 大小为 2M，因此无需 level 1。
925 | 
926 | 
927 | 
928 | ##### mmu_set_spte
929 | 
930 | ```
931 | => set_spte => mmu_spte_update => mmu_spte_set => __set_spte                                设置物理页 (pfn) 起始 HPA 到 *sptep，即设置最后一级页表中某个 pte 的值
932 | => rmap_add => page_header(__pa(spte))                                                      获取 spetp 所在的页表页
933 |             => kvm_mmu_page_set_gfn                                                         将 gfn 设置到该页表页的 gfns 中
934 |             => gfn_to_rmap => __gfn_to_memslot                                              获取 gfn 对应的 slot
935 |                            => __gfn_to_rmap => gfn_to_index                                 通过 gfn 和 slot->base_gfn，算出该页在 slot 中的 index
936 |                                     => slot->arch.rmap[level - PT_PAGE_TABLE_LEVEL][idx]    从该 slot 中取出对应的 rmap
937 |             => pte_list_add                                                                 将当前项 (spetp) 的地址加入到 rmap 中，做反向映射
938 | ```
939 | 
940 | 作用于 1 级页表 (PT)。负责设置最后一级页表中的 pte(*spetp) 的值，同时将当前项 (spetp) 的地址加入到 slot->arch.rmap[level - PT_PAGE_TABLE_LEVEL][idx] 中作为反向映射，此后可以通过 gfn 快速找到该 kvm_mmu_page 。
941 | 
942 | 在大多数情况下，gfn 对应单个 kvm_mmu_page，于是 rmap_head 直接指向 spetp 即可。但由于一个 gfn 对应多个 kvm_mmu_page，因此在该情况下 rmap 采用链表 + 数组来维护。一个链表项 pte_list_desc 能存放三个 spetp。由于 pte_list_desc 频繁被分配，因此也是从 cache (vcpu->arch.mmu_pte_list_desc_cache) 中分配的。
943 | 
944 | 
945 | ##### kvm_mmu_get_page
946 | 
947 | 获取 gfn 对应的 kvm_mmu_page 。会通过 gfn 尝试从 vcpu->kvm->arch.mmu_page_hash 中找到对应的页表页，如果以前分配过该页则直接返回即可。否则需要通过 kvm_mmu_alloc_page 从 cache 中分配，然后以 gfn 为 key 将其加到 vcpu->kvm->arch.mmu_page_hash 中。
948 | 
949 | kvm_mmu_alloc_page 会通过 mmu_memory_cache_alloc 从 vcpu->arch.mmu_page_header_cache 和 vcpu->arch.mmu_page_cache 分配 kvm_mmu_page 和 page 对象，在 mmu_topup_memory_caches 中保证了这些 cache 的充足，如果发现余量不够，会通过全局变量的 slab 补充，这点前面也提到了。
950 | 
951 | 
952 | 
953 | ##### link_shadow_page
954 | 
955 | ```
956 | => mmu_spte_set => __set_spte                               为当前页表项的值 (*spetp) 设置下一级页表页的 HPA
957 | => mmu_page_add_parent_pte => pte_list_add                  将当前项的地址 (spetp) 加入到下一级页表页的 parent_ptes 中，做反向映射
958 | ```
959 | 
960 | 作用于 2-4 级页表 (PML4 - PDT)，在遍历过程中如果发现下一级页表缺页，需要在分配一个页表页后更新当前的迭代器指向的页表项 (spetp)，设置为下一级该页表页的 HPA，这样下次就能够通过页表项访问到该页表页了。同时需要将当前页表项 (spetp) 的地址加入到下一级页表页的 parent_ptes 中作为反向映射。
961 | 
962 | 利用两套反向映射，在右移 GPA 算出 gfn 后，可以通过 rmap 得到在 L1 中的页表项，然后通过 parent_ptes 可以依次得到在 L2-4 中的页表项。当 Host 需要将 Guest 的某个 GPA 的 page 换出时，直接通过反向索引找到该 gfn 相关的页表项进行修改，而无需再次走 EPT 查询。
963 | 
964 | 
965 | 
966 | 
967 | 
968 | 
969 | ## 总结
970 | 
971 | ### QEMU
972 | 创建一系列 MemoryRegion ，分别表示 Guest 中的 ROM、RAM 等区域。 MemoryRegion 之间通过 alias 或 subregion 的方式维护相互之间的关系，从而进一步细化区域的定义。
973 | 
974 | 对于一个实体 MemoryRegion(非 alias)，在初始化内存的过程中会创建它所对应的 RAMBlock 。 RAMBlock 通过 mmap 的方式从 QEMU 的进程空间中分配内存，并负责维护该 MemoryRegion 管理内存的起始 HVA/GPA/size 等信息。
975 | 
976 | AddressSpace 表示 VM 的物理地址空间。如果 AddressSpace 中的 MemoryRegion 发生变化，则 listener 被触发，将所属 AddressSpace 的 MemoryRegion 树展平，形成一维的 FlatView ，比较 FlatRange 是否发生了变化。如果是调用相应方法如 region_add 对变化的 section region 进行检查，更新 QEMU 内的 KVMSlot，同时填充 kvm_userspace_memory_region 结构，作为 ioctl 的参数更新 KVM 中的 kvm_memory_slot 。
977 | 
978 | 
979 | ### KVM
980 | 当 QEMU 通过 ioctl 创建 vcpu 时，调用 kvm_mmu_create 初始化 mmu 相关信息，为页表项结构分配 slab cache。
981 | 
982 | 当 KVM 要进入 Guest 前， vcpu_enter_guest => kvm_mmu_reload 会将根级页表地址加载到 VMCS，让 Guest 使用该页表。
983 | 
984 | 当 EPT Violation 发生时， VMEXIT 到 KVM 中。如果是缺页，则拿到对应的 GPA ，根据 GPA 算出 gfn，根据 gfn 找到对应的 memory slot ，得到对应的 HVA 。然后根据 HVA 找到对应的 pfn，确保该 page 位于内存。在把缺的页填上后，需要更新 EPT，完善其中缺少的页表项。于是从 L4 开始，逐层补全页表，对于在某层上缺少的页表页，会从 slab 中分配后将新页的 HPA 填入到上一级页表中。
985 | 
986 | 除了建立上级页表到下级页表的关联外，KVM 还会建立反向映射，可以直接根据 GPA 找到 gfn 相关的页表项，而无需再次走 EPT 查询。
987 | 
988 | 


--------------------------------------------------------------------------------
/pci.md:
--------------------------------------------------------------------------------
  1 | 
  2 | 
  3 | 本文将在熟悉QOM和q35架构的基础上，分析QEMU中的PCI设备是如何被初始化和挂载的。
  4 | 
  5 | 
  6 | ## PCI设备的创建与初始化
  7 | 
  8 | ```
  9 | pci_create_simple_multifunction => pci_create_multifunction
 10 |     => qdev_create(&bus->qbus, name) => object_class_by_name
 11 |     => qdev_prop_set_int32(dev, "addr", devfn)
 12 |     => qdev_init_nofail => dc->realize (pci_qdev_realize)
 13 |     => do_pci_register_device => pc->realize (ich9_lpc_realize)
 14 |     => pci_add_option_rom   // 添加rom
 15 | ```
 16 | 
 17 | 创建了PCI设备对象，并设置realized为true，于是调用realize函数进行初始化。
 18 | 
 19 | 这其中关键的一点在于 pci_create_simple_multifunction 传入的 devfn 参数，它是PCI设备的设备功能号，一般通过以下宏算出：
 20 | 
 21 | ```c
 22 | #define PCI_DEVFN(slot, func)   ((((slot) & 0x1f) << 3) | ((func) & 0x07))
 23 | ```
 24 | 
 25 | 因此通过该宏算出的 devfn 是一个8bit的数字，高5bit为slot号，低3bit位func号。因此有32个slot，每个slot有8个function。
 26 | 
 27 | 比如对于ISA桥 "ICH9-LPC" ， 其功能号为 `PCI_DEVFN(ICH9_LPC_DEV, ICH9_LPC_FUNC)` ，即 31 * 2^3 + 0 = 248
 28 | 
 29 | 在 object_class_by_name 时，会调用类实例构造函数。根据 ICH9-LPC 的 TypeInfo ：
 30 | 
 31 | ```c
 32 | static const TypeInfo ich9_lpc_info = {
 33 |     .name       = TYPE_ICH9_LPC_DEVICE,
 34 |     .parent     = TYPE_PCI_DEVICE,
 35 |     .instance_size = sizeof(struct ICH9LPCState),
 36 |     .instance_init = ich9_lpc_initfn,
 37 |     .class_init  = ich9_lpc_class_init,
 38 |     .interfaces = (InterfaceInfo[]) {
 39 |         { TYPE_HOTPLUG_HANDLER },
 40 |         { TYPE_ACPI_DEVICE_IF },
 41 |         { }
 42 |     }
 43 | };
 44 | ```
 45 | 
 46 | 其类实例构造函数为 ich9_lpc_class_init ，在里面对这款设备的属性进行了初始化：
 47 | 
 48 | ```c
 49 | static void ich9_lpc_class_init(ObjectClass *klass, void *data)
 50 | {
 51 |     ...
 52 |     k->realize = ich9_lpc_realize;
 53 |     k->config_write = ich9_lpc_config_write;
 54 |     dc->desc = "ICH9 LPC bridge";
 55 |     k->vendor_id = PCI_VENDOR_ID_INTEL;
 56 |     k->device_id = PCI_DEVICE_ID_INTEL_ICH9_8;
 57 |     k->revision = ICH9_A2_LPC_REVISION;
 58 |     k->class_id = PCI_CLASS_BRIDGE_ISA;
 59 |     ...
 60 | }
 61 | ```
 62 | 
 63 | ### do_pci_register_device
 64 | 对设备实例对象进行设置。
 65 | 
 66 | ```
 67 | => 查看 bus->devices[devfn] 是否为空，如果不为空，表示位置被人占了，报错返回
 68 | => 如果设备是hotplugged的， pci_get_function_0 如果非空，表示slot被人占了，报错返回
 69 |     => 如果bus有upstream PCIe port，则只能放在第一个slot的第一个设备，即devfn=0。否则可以放在devfn对应slot的第一个设备处
 70 |     => pci_init_bus_master
 71 | => pci_config_alloc(pci_dev)                                                                                   分配配置空间，PCI设备为256B，PCIe为4096B
 72 | => pci_config_set_vendor_id / pci_config_set_device_id / pci_config_set_revision / pci_config_set_revision     根据初始化时的数据设置设备标识
 73 | => 如果设备是bridge(is_bridge)， pci_init_mask_bridge
 74 | => pci_init_multifunction
 75 | => 设置 pci_dev->config_read 和 pci_dev->config_write ，如果在类构造函数中设置了则用设置的，否则使用默认函数 pci_default_read_config / pci_default_write_config
 76 | ```
 77 | 
 78 | ## BAR(base address register)
 79 | 
 80 | 根据PCI规范，BAR用于描述PCI设备需要占用的地址空间大小。比如网卡等设备需要占用较大的地址空间，而一些串口设备则占用较少的地址空间。其位于PCI配置空间中0x10-0x27这24个byte。如果使用32bit的BAR，则每个设备最多能设置6个。如果使用64bit，则最多只能设置3个。
 81 | 
 82 | 对于每个BAR，bit0指定了映射的类型，0为Memory，1为I/O。在真实硬件上，bit0是readonly的，由设备制造商决定，其他人无法修改。
 83 | 
 84 | 对于Memory类型，bit1-2为Locatable，0为表示寄存器大小为32bit，能映射到32bit内存空间中的任何位置；2为64bit；1保留给PCI Local Bus Specification 3修订版。bit3为Prefetchable，0为no，1为yes。bit4-end为Base Address(16-byte aligned)。
 85 | 
 86 | 对于I/O类型，bit1为Reserved，bit2-end为Base Address(4-byte aligned)。
 87 | 
 88 | 每一个BAR对应的长度由硬件决定，BIOS/OS根据获取到的长度动态的为其分配一段内存空间，然后将内存空间的起始地址写入BAR中作为address base，于是[base, base+len]这段范围将作为软件(OS)和PCI设备交流的信道。
 89 | 
 90 | 根据qemu monitor查看PCI设备的BAR ：
 91 | 
 92 | ```
 93 | (qemu) info pci
 94 |   Bus  0, device   0, function 0:
 95 |     Host bridge: PCI device 8086:29c0
 96 |       id ""
 97 |   Bus  0, device   1, function 0:
 98 |     VGA controller: PCI device 1234:1111
 99 |       BAR0: 32 bit prefetchable memory at 0xfd000000 [0xfdffffff].
100 |       BAR2: 32 bit memory at 0xfebf0000 [0xfebf0fff].
101 |       BAR6: 32 bit memory at 0xffffffffffffffff [0x0000fffe].
102 |       id ""
103 |   Bus  0, device   2, function 0:
104 |     Ethernet controller: PCI device 8086:100e
105 |       IRQ 11.
106 |       BAR0: 32 bit memory at 0xfebc0000 [0xfebdffff].
107 |       BAR1: I/O at 0xc000 [0xc03f].
108 |       BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe].
109 |       id ""
110 |   Bus  0, device  31, function 0:
111 |     ISA bridge: PCI device 8086:2918
112 |       id ""
113 |   Bus  0, device  31, function 2:
114 |     SATA controller: PCI device 8086:2922
115 |       IRQ 10.
116 |       BAR4: I/O at 0xc080 [0xc09f].
117 |       BAR5: 32 bit memory at 0xfebf1000 [0xfebf1fff].
118 |       id ""
119 |   Bus  0, device  31, function 3:
120 |     SMBus: PCI device 8086:2930
121 |       IRQ 10.
122 |       BAR4: I/O at 0x0700 [0x073f].
123 |       id ""
124 | ```
125 | 
126 | 可以发现只有 VGA controller 、 Ethernet controller 、SATA controller 和 SMBus 有BAR，而 Host bridge 和 PCI/ISA bridge (也就是前面提到的ICH9-LPC) 没bar。因此猜测是bridge设备都不需要映射地址空间来和OS进行通信。
127 | 
128 | 对应到OS中，通过`lspci -x`可以找到对应的bar，比如对于 PCI/ISA bridge，其 0x10 - 0x27 就是全0，而对于 Ethernet controller ，0x10 - 0x27 为：
129 | 
130 | ```
131 | 10: 00 00 bc fe 01 c0 00 00 00 00 00 00 00 00 00 00
132 | 20: 00 00 00 00 00 00 00 00
133 | ```
134 | 
135 | 一个bar占4个byte，因此可以看出(**x86为小端**)：
136 | 
137 | * BAR 0 值为 0xfebc0000 ，对应Memory类型，32bit，no Prefetchable，base = 0xfebc0000
138 | * BAR 1 值为 0x0000c001 ，对应I/O类型，32bit，no Prefetchable，base = 0x0000c000
139 | * BAR 6 值为 0xffffffffffffffff ，对应Memory类型，对应的是ROM。
140 | 
141 | 可以发现lspci的结果和从QEMU monitor查询而得的结果是一致的。
142 | 
143 | 
144 | ### 设置BAR
145 | 
146 | 那么设备的BAR在QEMU是什么时候被设置的呢？经过一番搜索，找到 pci_register_bar ：
147 | 
148 | ```c
149 | void pci_register_bar(PCIDevice *pci_dev, int region_num,
150 |                       uint8_t type, MemoryRegion *memory)
151 | {
152 |     PCIIORegion *r;
153 |     uint32_t addr; /* offset in pci config space */
154 |     uint64_t wmask;
155 |     pcibus_t size = memory_region_size(memory);
156 | 
157 |     assert(region_num >= 0);
158 |     assert(region_num < PCI_NUM_REGIONS);
159 |     if (size & (size-1)) {
160 |         fprintf(stderr, "ERROR: PCI region size must be pow2 "
161 |                     "type=0x%x, size=0x%"FMT_PCIBUS"\n", type, size);
162 |         exit(1);
163 |     }
164 | 
165 |     r = &pci_dev->io_regions[region_num];
166 |     r->addr = PCI_BAR_UNMAPPED;
167 |     r->size = size;
168 |     r->type = type;
169 |     r->memory = memory;
170 |     r->address_space = type & PCI_BASE_ADDRESS_SPACE_IO
171 |                         ? pci_dev->bus->address_space_io
172 |                         : pci_dev->bus->address_space_mem;
173 | 
174 |     wmask = ~(size - 1);
175 |     if (region_num == PCI_ROM_SLOT) {
176 |         /* ROM enable bit is writable */
177 |         wmask |= PCI_ROM_ADDRESS_ENABLE;
178 |     }
179 | 
180 |     addr = pci_bar(pci_dev, region_num);
181 |     pci_set_long(pci_dev->config + addr, type);
182 | 
183 |     if (!(r->type & PCI_BASE_ADDRESS_SPACE_IO) &&
184 |         r->type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
185 |         // 64bit base address
186 |         pci_set_quad(pci_dev->wmask + addr, wmask);
187 |         pci_set_quad(pci_dev->cmask + addr, ~0ULL);
188 |     } else {
189 |         // 32bit
190 |         pci_set_long(pci_dev->wmask + addr, wmask & 0xffffffff);
191 |         pci_set_long(pci_dev->cmask + addr, 0xffffffff);
192 |     }
193 | }
194 | ```
195 | 
196 | 其根据传入的 MemoryRegion ，初始化 PCIIORegion 结构，其表示了BAR中对应的一段地址空间：
197 | 
198 | ```c
199 | typedef struct PCIIORegion {
200 |     pcibus_t addr; /* current PCI mapping address. -1 means not mapped */
201 | #define PCI_BAR_UNMAPPED (~(pcibus_t)0)
202 |     pcibus_t size;
203 |     uint8_t type;
204 |     MemoryRegion *memory;
205 |     MemoryRegion *address_space;
206 | } PCIIORegion;
207 | ```
208 | 
209 | 这里设置了映射类型(type)、大小(size)、地址(addr，初始值为 PCI_BAR_UNMAPPED )等，然后设置为 PCIDevice.io_regions 数组region_num对应项。也就是说，所有BAR的相关信息都被存储在 io_regions 数组中。
210 | 
211 | 同时，将信息写入config中BAR对应的位置，这里只设置了BAR的type(bit0)。
212 | 
213 | 比如对于网卡e1000来说，在 pci_e1000_realize 时，设置了：
214 | 
215 | ```c
216 | pci_register_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY, &d->mmio);
217 | pci_register_bar(pci_dev, 1, PCI_BASE_ADDRESS_SPACE_IO, &d->io);
218 | ```
219 | 
220 | d->mmio 和 d->io 都是MemoryRegion，在 e1000_mmio_setup 中通过 memory_region_init_io 创建，前者大小为 PNPMMIO_SIZE(0x20000)，后者大小为 IOPORT_SIZE(0x40)。
221 | 
222 | 
223 | ### MemoryRegion 的映射
224 | 
225 | 既然我们创建了MemoryRegion并设置到 io_regions 中，那么其必然需要将对应的内存区域注册到虚拟机中，我们以 io_regions 作为关键字在代码中进行搜索，找到了 pci_do_device_reset ，它是在什么时候被调用呢？通过跟踪可以发现，当QEMU在main函数中将设备都注册完成，一切初始化都准备就绪，即将调用 main_loop 进入主循环之时，调用了 qemu_system_reset ，堆栈如下：
226 | 
227 | ```
228 | (gdb) bt
229 | #0  pci_do_device_reset (dev=0x7ffff22a0010) at hw/pci/pci.c:278
230 | #1  0x0000555555a1a657 in pcibus_reset (qbus=0x5555569a5640) at hw/pci/pci.c:304
231 | #2  0x00005555559720ef in qbus_reset_one (bus=0x5555569a5640, opaque=0x0) at hw/core/qdev.c:304
232 | #3  0x0000555555976e53 in qbus_walk_children (bus=0x5555569a5640, pre_devfn=0x0, pre_busfn=0x0, post_devfn=0x55555597206c <qdev_reset_one>, post_busfn=0x55555597208f <qbus_reset_one>, opaque=0x0) at hw/core/bus.c:68
233 | #4  0x0000555555972c50 in qdev_walk_children (dev=0x55555694ec00, pre_devfn=0x0, pre_busfn=0x0, post_devfn=0x55555597206c <qdev_reset_one>, post_busfn=0x55555597208f <qbus_reset_one>, opaque=0x0) at hw/core/qdev.c:602
234 | #5  0x0000555555976e17 in qbus_walk_children (bus=0x555556769d20, pre_devfn=0x0, pre_busfn=0x0, post_devfn=0x55555597206c <qdev_reset_one>, post_busfn=0x55555597208f <qbus_reset_one>, opaque=0x0) at hw/core/bus.c:59
235 | #6  0x00005555559721a2 in qbus_reset_all (bus=0x555556769d20) at hw/core/qdev.c:321
236 | #7  0x00005555559721c5 in qbus_reset_all_fn (opaque=0x555556769d20) at hw/core/qdev.c:327
237 | #8  0x00005555558d94ec in qemu_devices_reset () at vl.c:1765
238 | #9  0x000055555582832c in pc_machine_reset () at /home/binss/work/qemu/qemu-2.8.1.1/hw/i386/pc.c:2181
239 | #10 0x00005555558d9589 in qemu_system_reset (report=false) at vl.c:1778
240 | #11 0x00005555558e0eb0 in main (argc=19, argv=0x7fffffffe498, envp=0x7fffffffe538) at vl.c:4656
241 | ```
242 | 
243 | 调用链为 main => qemu_system_reset => mc->reset (pc_machine_reset) => qemu_devices_reset 。它会遍历所有的 reset_handlers ，拿到每一个QEMUResetEntry，调用它们的回调函数 re->func 。 QEMUResetEntry 在 qemu_register_reset 时加入到该队列中。
244 | 
245 | `#8` 对应的 QEMUResetEntry 为 main-system-bus ，也就是系统总线，其在 vl.c 中 `qemu_register_reset(qbus_reset_all_fn, sysbus_get_default());` 注册了 qbus_reset_all_fn 为reset handler。于是这里调用 qbus_reset_all_fn
246 | 
247 | qbus_reset_all_fn => qbus_reset_all => qbus_walk_children 会遍历该bus上的所有dev，对它们调用 qdev_walk_children 。对于 e1000 来说，这里自然是 q35-pcihost (如果不记得q35上dev和bus是怎么连接的，请参考q35) 。然后 qdev_walk_children 又对 q35-pcihost 上的所有child_bus都调用 qbus_walk_children ，于是来到 pcie.0 ，注意e1000就是挂在这个bus上。于是这里暂时打住，不考虑 pcie.0 的 qbus_walk_children 递归，而是关系递归完成后，调用了 post_busfn (qbus_reset_one) 对 pcie.0 这个bus进行reset。
248 | 
249 | 于是 qbus_reset_one => bc->reset (pcibus_reset) ，会遍历 bus->devices ，对每一个设备调用 pci_do_device_reset
250 | 
251 | 
252 | 顾名思义， pci_do_device_reset 主要做了重置工作：
253 | 
254 | ```
255 | => pci_word_test_and_clear_mask     清除config中 PCI_COMMAND 对应的2个byte(0x04-0x06)
256 | => pci_word_test_and_clear_mask     清除config中 PCI_STATUS 对应的2个byte(0x06-0x08)
257 | => 清除config中的 PCI_CACHE_LINE_SIZE 和 PCI_INTERRUPT_LINE 分别对应的1个byte
258 | => 遍历 io_regions ，如果存在，设置 config 中BAR对应位置的值，只写入了类型，剩余部分留给虚拟机内的BIOS/OS进行设备
259 | => pci_update_mappings
260 | => msi_reset
261 | => msix_reset
262 | ```
263 | 
264 | 这里的关键是 pci_update_mappings ，其负责检查 io_regions 数组中 PCIIORegion 的 addr ，如果 addr 不等于 PCI_BAR_UNMAPPED ，表示地址已经映射完成，于是通过 memory_region_add_subregion_overlap / memory_region_del_subregion 利用ioctl将BAR对应的 MemoryRegion 注册到KVM中(如果使用了KVM的话)：
265 | 
266 | ```c
267 | static void pci_update_mappings(PCIDevice *d)
268 | {
269 |     PCIIORegion *r;
270 |     int i;
271 |     pcibus_t new_addr;
272 | 
273 |     // 第7个io region(BAR 6)是 PCI_ROM_SLOT
274 |     for(i = 0; i < PCI_NUM_REGIONS; i++) {
275 |         r = &d->io_regions[i];
276 | 
277 |         /* this region isn't registered */
278 |         if (!r->size)
279 |             continue;
280 | 
281 |         // 从config中读取该BAR中存储的base address
282 |         new_addr = pci_bar_address(d, i, r->type, r->size);
283 | 
284 |         /* This bar isn't changed */
285 |         if (new_addr == r->addr)
286 |             continue;
287 | 
288 |         /* now do the real mapping */
289 |         if (r->addr != PCI_BAR_UNMAPPED) {
290 |             trace_pci_update_mappings_del(d, pci_bus_num(d->bus),
291 |                                           PCI_SLOT(d->devfn),
292 |                                           PCI_FUNC(d->devfn),
293 |                                           i, r->addr, r->size);
294 |             memory_region_del_subregion(r->address_space, r->memory);
295 |         }
296 |         r->addr = new_addr;
297 |         if (r->addr != PCI_BAR_UNMAPPED) {
298 |             trace_pci_update_mappings_add(d, pci_bus_num(d->bus),
299 |                                           PCI_SLOT(d->devfn),
300 |                                           PCI_FUNC(d->devfn),
301 |                                           i, r->addr, r->size);
302 |             memory_region_add_subregion_overlap(r->address_space,
303 |                                                 r->addr, r->memory, 1);
304 |         }
305 |     }
306 | 
307 |     pci_update_vga(d);
308 | }
309 | ```
310 | 
311 | 遗憾的是，在 qemu_system_reset 时，BAR的base address还没被设置，依然处于 unmap 的状态。于是 MemoryRegion 也就没有被注册到KVM中。
312 | 
313 | 那么e1000的这两个 MemoryRegion 到底在何时才会被注册到KVM中呢？答案是，进入虚拟机且其开始运行后，执行设备发现流程，给e1000配置一个address base，然后写到e1000的配置空间(config)中，这时才会将其注册，这会在下文再进行分析。
314 | 
315 | 
316 | 
317 | ### 配置空间写入和读取
318 | 
319 | 于是我们的配置空间初始化好了，接下来就是等待虚拟机来读写了。前面提到过，PCI设备设置了 config_write 和 config_read ，就是干这事的。理论上，BIOS/OS 应该通过 config_read 读取config，获得PCI设备的配置空间信息，然后为PCI设备分配地址，写回配置空间的BAR。
320 | 
321 | 
322 | 由于前面发现 ICH9-LPC 不带BAR，因此接下来选取 e1000 来进行分析（在写此文时一直抓着ICH9-LPC一通分析，最后才发现由于它不带bar导致很多过程直接continue了，不能较好的反应PCI设备的注册流程，泪崩）。
323 | 
324 | 根据 e1000_base_info ，其类构造函数为 e1000_class_init ，里面没有配置自己的config函数，但在初始化函数 pci_e1000_realize 设置了 `pci_dev->config_write = e1000_write_config` ，因此其 config_write 为 e1000_write_config， config_read 使用默认的 pci_default_read_config 。
325 | 
326 | 我们找到 config 的地址，然后设置观察点，观察其在读写的时刻。
327 | 
328 | 1. 在 do_pci_register_device 中分配内存，对config内容进行设置，如 pci_config_set_vendor_id
329 | 2. 在 pci_e1000_realize 中继续设置config，包括 pci_register_bar 中将BAR base address设置为全f
330 | 3. 由于有ROM(efi-e1000.rom)，于是调用 pci_add_option_rom ，注册 PCI_ROM_SLOT 为BAR6
331 | 4. pci_do_device_reset (调用链前面提过) 进行清理和设置
332 | 5. KVM_EXIT_IO
333 |     QEMU => KVM => VM 后，当VM运行port I/O指令访问config信息时，发生VMExit，VM => KVM => QEMU，QEMU根据 exit_reason 得知原因是 KVM_EXIT_IO ，于是从 cpu->kvm_run 中取出 io 信息，进行以下调用：
334 | 
335 |     kvm_cpu_exec => kvm_handle_io => address_space_rw => address_space_read => address_space_read_full => address_space_read_continue => memory_region_dispatch_read => memory_region_dispatch_read1 => access_with_adjusted_size => memory_region_read_accessor => mr->ops->read (pci_host_data_read) => pci_data_read => pci_host_config_read_common => pci_default_read_config
336 | 
337 |     于是e1000的 config_read 被调用，读取对应位置的配置空间信息返回给 KVM => VM。
338 | 
339 |     同理当VM需要写入config时，发生VMExit，于是 VM => KVM => QEMU，其调用链如下：
340 | 
341 |     kvm_cpu_exec => kvm_handle_io => address_space_rw => address_space_write => address_space_write_continue => memory_region_dispatch_write => access_with_adjusted_size => memory_region_write_accessor => mr->ops->write (pci_host_data_write) => pci_data_write => pci_host_config_write_common => pci_dev->config_write (e1000_write_config) => pci_default_write_config
342 | 
343 |     于是e1000的 config_write 被调用，读取对应位置的配置空间信息返回给 KVM => VM。
344 | 
345 |     当退回到QEMU时，QEMU根据 exit_reason 得知原因是 KVM_EXIT_IO ，于是从 cpu->kvm_run 中取出 io 信息，如 `{direction = 1, size = 4, port = 3324, count = 1, data_offset = 4096}` 。对于写操作，最后调用 pci_default_write_config 。
346 | 
347 | 6. KVM_EXIT_MMIO
348 | 
349 |     设置完config后，在Linux完成了了对设备的初始化后，就可以进行通信了。当VM对映射的内存区域进行访问时，发生VMExit，VM => KVM => QEMU，QEMU根据 exit_reason 得知原因是 KVM_EXIT_MMIO ，于是从 cpu->kvm_run 中取出 mmio 信息，进行以下调用：
350 | 
351 |     kvm_cpu_exec => address_space_rw => address_space_write => address_space_write_continue => memory_region_dispatch_write => access_with_adjusted_size => memory_region_write_accessor => e1000_mmio_write
352 | 
353 |     对于写操作，最终调用 e1000_mmio_write ，其会将 MemoryRegion 的opaque转换为 E1000State ，调用 macreg_writeops[index]。以参数addr=208, val=157, size=4为例，此时操作的地址是4273733840(0xfebc00d0)：
354 | 
355 |     `index = (addr & 0x1ffff) >> 2`，于是 index = 52。macreg_writeops[52]为 set_ims ，其在设置IMS后又调用 set_ics ，其负责设置中断，于是 set_interrupt_cause => pci_set_irq => pci_irq_handler => pci_update_irq_status ，将设备配置空间(config)的 PCI_STATUS([6:7]) 的 PCI_STATUS_INTERRUPT bit置1，表示收到INTx#信号。
356 | 
357 |     对于MMIO读，过程类似，只是在读后将 PCI_STATUS_INTERRUPT bit置0。
358 | 
359 | 
360 | 
361 | 就e1000而言，在Linux启动之前(没有任何启动信息输出)，进行了以下写操作：
362 | 
363 | * 将addr[16, 19]，即BAR0，写为 4294967295(0xffffffff)
364 | * 将addr[16, 19]，即BAR0，写为 0(0x0)
365 | * 将addr[16, 19]，即BAR0，写为 4273733632(0xfebc0000)
366 | * 将addr[20, 23]，即BAR1，写为 4294967295(0xffffffff)
367 | * 将addr[20, 23]，即BAR1，写为 1(0x1)
368 | * 将addr[20, 23]，即BAR1，写为 49152(0xc000)
369 | * 将addr[4, 5]，即COMMAND，写为 259(0x103，100000011)，响应Memory Space和I/O Space的访问。启用SERR# driver。
370 | 
371 | Linux启动后，进行了以下写操作：
372 | 
373 | * 将addr[4, 5]，即COMMAND，写为 256(0x100，100000000)，启用SERR# driver。
374 | * 将addr[4, 5]，即COMMAND，写为 259(0x103，100000011)，响应Memory Space和I/O Space的访问。启用SERR# driver。
375 | 
376 | 然后发现Linux又重新做长度检测，然后写入值。将BAR1写为49153(0xc001)。目测是感知到它是IO port，将其最后一bit修正为1。
377 | 
378 | 将addr[4, 5]，即COMMAND，写为 263(0x103，100000111)，响应Memory Space和I/O Space的访问。启用SERR# driver。成为bus master。
379 | 
380 | 此后在e1000的工作流程中，会在发包时调用 e1000x_rx_ready 检查是否就绪，这也需要读取 config[PCI_COMMAND] ，只有其为bus master，才能算是ready。
381 | 
382 | 
383 | 
384 | ### 补充：定位 config 区域
385 | 
386 | 我们发现 BIOS/OS 对设备的发现和配置是建立在能和设备的配置空间进行交互的基础上的。因此需要一种机制来让 BIOS/OS 定位到某个 PCI 设备的配置中间，也就是 QEMU 中设备的 config 成员。
387 | 
388 | (根据PCI规范?)这需要通过两步走，以写入为例：
389 | 
390 | 1. 设置目标阶段
391 |     将要访问的设备地址通过 pci_host_config_write 写入到 PCIHostState 的 config_reg
392 | 2. 设置值阶段
393 |     将要写入配置空间的值通过 pci_host_data_write ，通过先前写入的 config_reg 定位目标设备，调用设备对应的配置写入函数
394 | 
395 | 下面具体分析：
396 | 
397 | 在初始化host bridge的函数 q35_host_initfn 中设置了通过PIO访问设备 config 时所需要访问的 MemoryRegion ： PCIHostState.data_mem
398 | 
399 | ```c
400 |     memory_region_init_io(&phb->conf_mem, obj, &pci_host_conf_le_ops, phb,
401 |                           "pci-conf-idx", 4);
402 |     memory_region_init_io(&phb->data_mem, obj, &pci_host_data_le_ops, phb,
403 |                           "pci-conf-data", 4);
404 | ```
405 | 
406 | 随后在 q35_host_realize 中将其和对应port绑定起来：
407 | 
408 | ```c
409 | #define MCH_HOST_BRIDGE_CONFIG_ADDR            0xcf8
410 | #define MCH_HOST_BRIDGE_CONFIG_DATA            0xcfc
411 | 
412 | static void q35_host_realize(DeviceState *dev, Error **errp)
413 | {
414 |     ...
415 |     sysbus_add_io(sbd, MCH_HOST_BRIDGE_CONFIG_ADDR, &pci->conf_mem);
416 |     sysbus_init_ioports(sbd, MCH_HOST_BRIDGE_CONFIG_ADDR, 4);
417 | 
418 |     sysbus_add_io(sbd, MCH_HOST_BRIDGE_CONFIG_DATA, &pci->data_mem);
419 |     sysbus_init_ioports(sbd, MCH_HOST_BRIDGE_CONFIG_DATA, 4);
420 | }
421 | ```
422 | 
423 | 根据调试，pci-conf-idx 这个 MemoryRegion 确实 offset 为 3320 (0xcfc)，size 为 4 。而 pci-conf-data 的 offset 为 3324 (0xcfc)，size 为 4 ：
424 | 
425 | ```
426 | [struct MemoryRegion *] 0x5555569b70b0 pci-conf-idx size<4> offset<3320>
427 | [struct MemoryRegion *] 0x5555569b71b0 pci-conf-data size<4> offset<3324>
428 | ```
429 | 
430 | 对该区域访问就是要访问 PIO 设备的 config
431 | 
432 | 
433 | [kvm_run->io.data_offset, kvm_run->io.data_offset + kvm_run->io.len] 是要写入的内容
434 | 
435 | #### 设置目标阶段
436 | 
437 | kvm_handle_io => address_space_rw => address_space_write => address_space_write_continue => memory_region_dispatch_write => access_with_adjusted_size => memory_region_write_accessor => mr->ops->write (pci_host_config_write)
438 | 
439 | ##### address_space_rw
440 | 根据 direction 判断是 read 还是 write
441 | addr 为 port
442 | 
443 | ##### address_space_write
444 | 
445 | => mr = address_space_translate(as, addr, &addr1, &l, true)                 在AddressSpace中根据addr找到MemoryRegion，计算第一轮的 addr1
446 | => address_space_write_continue(as, addr, attrs, buf, len, addr1, l, mr)
447 | 
448 | 因此这里的关键是 address_space_translate 。它可以根据找到地址(port)从 address_space_io 中找到 pci-conf-data 这个 MemoryRegion 和 addr1(偏移量?)
449 | 
450 | ##### address_space_write_continue
451 | memory_access_size 计算单次写入宽度，然后调用
452 | 
453 | => result |= memory_region_dispatch_write(mr, addr1, val, 4, attrs);  注意传入的是 addr1 而不是 addr
454 | => address_space_translate           计算下一轮的 MemoryRegion 和 addr1
455 | 
456 | 因此总共调用 address_space_translate 的次数 (翻译次数) 为 写入长度 / 单次写入宽度
457 | 
458 | ##### memory_region_dispatch_write
459 | MemoryRegion 有 write 函数，则 accessor 为 memory_region_write_accessor ，作为参数传下去
460 | 
461 | ##### access_with_adjusted_size
462 | 调用传入的accessor
463 | 
464 | 如果size为4，则会调用4次 `r |= access(mr, addr + i, value, access_size, i * 8, access_mask, attrs)`
465 | 
466 | 其中任意一次出错都算出错
467 | 
468 | ##### memory_region_write_accessor
469 | 使用 MemoryRegion 的函数 (比如 pci_host_data_le_ops 中定义 write 为 pci_host_data_write) 操作 MemoryRegion 中存放的对象
470 | mr->ops->write(mr->opaque, addr, tmp, size)
471 | 
472 | ##### pci_host_config_write
473 | 将 opaque 转为 PCIHostState ，将值写入 PCIHostState 的 config_reg 成员中。
474 | 
475 | ##### 小结
476 | 
477 | 在设置地址阶段，目标port为3320，于是找到 pci-conf-idx 这个 MemoryRegion，将值写入 PCIHostState 的 config_reg 成员中。
478 | 
479 | 对于e1000来说，这里写入的值为 2147487760(0x80001010) ，表示 e1000 的配置空间所在的地址。
480 | 
481 | 
482 | #### 设置值阶段
483 | 
484 | kvm_cpu_exec => kvm_handle_io => address_space_rw => address_space_write => address_space_write_continue => memory_region_dispatch_write => access_with_adjusted_size => memory_region_write_accessor => mr->ops->write (pci_host_data_write) => pci_data_write => pci_host_config_write_common => pci_dev->config_write (e1000_write_config) => pci_default_write_config
485 | 
486 | 前面的调用流程和设置目标阶段一样，只是这次的 MemoryRegion 为 pci-conf-data ，因此调用 pci_host_data_write
487 | 
488 | ##### pci_host_data_write
489 | 将 opaque 转为 PCIHostState ，从中取出 PCIBus (s->bus), 同时将地址转换为 s->config_reg | (addr & 3)
490 | addr 为 0 ，因此传下去的参数 addr 就是 s->config_reg
491 | 
492 | 
493 | ##### pci_data_write
494 | 根据 addr 计算出 devfn 。然后从该 PCIBus (pcie.0) 的设备数组 devices 中找到对应的 PCIDevice 然后调用其 config_read / config_write 函数对地址位置进行操作
495 | 
496 | ##### 小结
497 | 
498 | 在设置值阶段，目标port为3324，于是找到 pci-conf-data 这个 MemoryRegion，然后读设置目标阶段存在 config_reg 的设备地址，根据地址找到设备的配置空间，然后写入值。
499 | 
500 | 
501 | #### pci_default_write_config
502 | 
503 | 在设置 config 时，最终都会调用到 pci_default_write_config 。根据：
504 | 
505 | ```c
506 |     if (ranges_overlap(addr, l, PCI_BASE_ADDRESS_0, 24) ||
507 |         ranges_overlap(addr, l, PCI_ROM_ADDRESS, 4) ||
508 |         ranges_overlap(addr, l, PCI_ROM_ADDRESS1, 4) ||
509 |         range_covers_byte(addr, l, PCI_COMMAND))
510 |         pci_update_mappings(d);
511 | ```
512 | 
513 | 如果修改的 [PCI_BASE_ADDRESS_0:PCI_BASE_ADDRESS_0+24] 表示的 BAR0-BAR5，[PCI_ROM_ADDRESS:PCI_ROM_ADDRESS+4]、[PCI_ROM_ADDRESS1:PCI_ROM_ADDRESS1+4] 表示的BAR6，或则改动了 PCI_COMMAND ，则触发 pci_update_mappings
514 | 
515 | 但经过调试，对 io_regions 中 PCIIORegion.address ，即base address的更新并不是在更新对应BAR的 pci_default_write_config 里面通过 pci_update_mappings 更新的，因为此时 PCI_COMMAND 中bit0和bit1为0，表示不响应Memory Space和I/O Space的访问，因此在 pci_bar_address 转换地址的过程中cmd为0，因此都会返回 PCI_BAR_UNMAPPED 。
516 | 
517 | 而是要等到将 PCI_COMMAND 从 256(0x100) 更新为 259(0x103) 时，表示开启了响应Memory Space和I/O Space的访问，此时一样会调用 pci_update_mappings ，遍历7个region，从config中依次读出，比如在前面的 pci_default_write_config 中已经更新了config中BAR0的base address为0xfebc0000，于是更新 io_regions[0].address 为0xfebc0000。然后其对应的 MemoryRegion (e1000-mmio) 作为offset添加到父级 MemoryRegion (pci) 中。触发address space的listener，最后更新flatview，通过ioctl更新到KVM。
518 | 
519 | 
520 | 
521 | #### 小结
522 | PCI设备通过自己的配置空间来和上层的BIOS和OS进行信息交换，实际在硬件实现上配置空间的每个字段对应一个个寄存器或ROM。
523 | 
524 | BIOS/OS在启动时，会执行设备发现逻辑，在发现到当前设备时，需要读取配置空间获取相关信息，于是发生VMExit到KVM，再到QEMU，QEMU会调用设备初始化时设置的 config_read 读取设备信息。读取完毕后，会对BAR写入全1检测大小，然后分配base address填入。填入后修改command位，允许响应Memory Space和I/O Space的访问。
525 | 
526 | 于是触发mapping的更新，将对应的IO MemoryRegion 设置到 KVM 中。此后OS通过PIO/MMIO和设备进行交互。
527 | 
528 | 
529 | 
530 | 
531 | ### 映射区域长度监测
532 | 
533 | BIOS/OS在为BAR分配地址时，需要知道BAR所需的长度。
534 | 
535 | 根据PCI规范，BIOS/OS往对应的BAR寄存器中写全1，然后读出，实现对BAR对应的size。比如size为4k，则写入 0xffffffff 后，会读得 0xfffff00X ，最低一位(4bit)为X表示它是固定的，不会因写入f而改变，因为前文提到过，PCI规范规定，最低4个bit都是存放META信息的，因此被设备制造商写死，其他人无法修改：
536 | 
537 | ```
538 | For Memory BARs
539 | 0     Region Type       0 = Memory
540 | 2-1   Locatable         0 = any 32-bit  /  1 = < 1 MiB  /  2 = any 64-bit
541 | 3     Prefetchable      0 = no  /  1 = yes
542 | 31-4  Base Address      16-byte aligned
543 | 
544 | For I/O BARs (Deprecated)
545 | 0     Region Type       1 = I/O (deprecated)
546 | 1     Reserved
547 | 31-2  Base Address      4-byte aligned
548 | ```
549 | 
550 | 于是BIOS/OS在读到 0xffffff00X 后，将X对应的4个bit设置为0，得到 0xfffff000 ，然后进行取反，得到 0x00000fff ，再加1，得到 0x00001000 ，因此得出大小即为 0x1000 ，即4kb
551 | 
552 | 至于写入 0xffffffff 为什么会读到 0xfffff00X ，这是由硬件特性保证的。那么在QEMU中，模拟的设备是怎么模拟这样的效果呢？
553 | 
554 | 实际上，这是通过mask来实现的。影响BAR的值有两个mask，分别为 wmask 和 w1cmask 。在设备初始化过程中，调用了 pci_qdev_realize => do_pci_register_device 对它们进行初始化：
555 | 
556 | ```c
557 | pci_config_alloc(pci_dev);
558 | pci_init_wmask(pci_dev);
559 | pci_init_w1cmask(pci_dev);
560 | ```
561 | 
562 | 前者为 wmask 和 w1cmask 分配了一片 config 大小的内存，对于e1000这个PCI设备来说是256byte。这说明 wmask 和 w1cmask 将覆盖整个配置空间，负责对整个配置空间的写入作mask。
563 | 
564 | 后者对 wmask 和 w1cmask 的一些byte进行初始化，保证对应位置为1，从而避免config在写入时被mask屏蔽掉。但在此时，BAR0-BAR6对应的位置依然是0。
565 | 
566 | 直到在注册BAR的函数 pci_register_bar 中，设置了 `uint64_t wmask = ~(size - 1)` ，并在最后，如果BAR长度为32bit，则执行 ：
567 | 
568 | ```
569 | pci_set_long(pci_dev->wmask + addr, wmask & 0xffffffff);
570 | ```
571 | 
572 | 比如对于e1000的BAR0，type为0，size为 PNPMMIO_SIZE(0x20000)，于是局部变量 wmask 为 0xfffffffffffe0000 ，而由于BAR0的长度只有32bit，因此要和 0xffffffff 与一下，得到 0xfffe0000 。于是 wmask[16:19] 为 00 00 fe ff ，即 0xfffe0000 。
573 | 
574 | 于是当 BIOS/OS 对BAR0写入 0xffffffff 时，最终调用到 pci_default_write_config ：
575 | 
576 | ```c
577 | void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val_in, int l)
578 | {
579 |     int i, was_irq_disabled = pci_irq_disabled(d);
580 |     uint32_t val = val_in;
581 | 
582 |     for (i = 0; i < l; val >>= 8, ++i) {
583 |         uint8_t wmask = d->wmask[addr + i];
584 |         uint8_t w1cmask = d->w1cmask[addr + i];
585 |         assert(!(wmask & w1cmask));
586 |         d->config[addr + i] = (d->config[addr + i] & ~wmask) | (val & wmask);
587 |         d->config[addr + i] &= ~(val & w1cmask); /* W1C: Write 1 to Clear */
588 |     }
589 |     if (ranges_overlap(addr, l, PCI_BASE_ADDRESS_0, 24) ||
590 |         ranges_overlap(addr, l, PCI_ROM_ADDRESS, 4) ||
591 |         ranges_overlap(addr, l, PCI_ROM_ADDRESS1, 4) ||
592 |         range_covers_byte(addr, l, PCI_COMMAND))
593 |         pci_update_mappings(d);
594 | 
595 |     if (range_covers_byte(addr, l, PCI_COMMAND)) {
596 |         pci_update_irq_disabled(d, was_irq_disabled);
597 |         memory_region_set_enabled(&d->bus_master_enable_region,
598 |                                   pci_get_word(d->config + PCI_COMMAND)
599 |                                     & PCI_COMMAND_MASTER);
600 |     }
601 | 
602 |     msi_write_config(d, addr, val_in, l);
603 |     msix_write_config(d, addr, val_in, l);
604 | }
605 | ```
606 | 
607 | 此时addr为16，长度l为4。于是循环4次，一次设置一个byte。对于每个byte，会取出 wmask 和 w1cmask 在对应位置的值，于是：
608 | 
609 | 1. i=0，wmask = 0 ， w1cmask = 0，于是 config[16] = (0 & 0xffff) | (0xffffffff & 0) = 0
610 |                                     config[16] &= ~(0xffffffff & 0) = 0
611 | 2. i=1，wmask = 0 ， w1cmask = 0，于是 config[17] = (0 & 0xffff) | (0xffffffff & 0) = 0
612 |                                     config[17] &= ~(0xffffffff & 0) = 0
613 | 3. i=1，wmask = 0xfe ， w1cmask = 0，于是 config[18] = (0 & 0x1) | (0xffffffff & 0xfe) = 0xfe
614 |                                         config[18] &= ~(0xffffffff & 0) = 0xfe
615 | 4. i=1，wmask = 0xff ， w1cmask = 0，于是 config[19] = (0 & 0) | (0xffffffff & 0xff) = 0xff
616 |                                         config[19] &= ~(0xffffffff & 0) = 0xff
617 | 
618 | 于是在经过mask后，要写入的 0xffffffff 实际上写入的是 0xfffe0000 。由于读取函数 pci_default_read_config 只是简单地做 memcpy ，因此BIOS/OS读到的值就是 0xfffe0000。
619 | 
620 | 于是BIOS/OS在拿到后，设置最后1位为0后取反得到 0x0001ffff 。然后加1，得到 0x00020000 ，即 0x20000 ，符合BAR0设置的size。
621 | 
622 | 同理对于BAR1，其size为 IOPORT_SIZE(0x40) ，wmask[16:19]为 c0 ff ff ff，即0xffffffc0 ，于是最后实际写入的是 0xffffffc1 ，于是BIOS/OS在读到后，设置最后1位为0后取反得到 0x0000003f 。然后加1，得到 0x00000040 ，即 0x40 ，符合BAR0设置的size。
623 | 
624 | 在确定了BAR的size后，BIOS/OS需要为其分配base address，需要和BAR的size对齐。
625 | 
626 | 
627 | ### 小结
628 | 
629 | BIOS/OS通过write and read的手段发现BAR的长度，然后为其分配base address。而QEMU是通过mask的机制实现了对BAR最大值的限制，模拟了硬件上的实现。
630 | 


--------------------------------------------------------------------------------
/q35.md:
--------------------------------------------------------------------------------
  1 | QEMU支持的架构非常少，在Q35出现之前，就只有诞生于1996年的i440FX + PIIX一个架构在苦苦支撑。一方面是Intel不断推出新的芯片组，搞出了PCIe、AHCI等等新东西。i440FX已经无法满足需求，为此在 KVM Forum 2012 上Jason Baron带来了PPT：A New Chipset For Qemu - Intel's Q35。Q35是Intel在2007年6月推出的芯片组，最吸引人的就是其支持PCI-e。
  2 | 
  3 | 根据Intel Q35文档，Q35的拓扑结构如图所示：
  4 | 
  5 | ![](http://illustration-10018028.file.myqcloud.com/20170717233926.png)
  6 | 
  7 | 可见其北桥为MCH，南桥为ICH9。CPU 通过 前端总线(FSB) 连接到 北桥(MCH)，北桥为内存、PCIE等提供接入，同时连接到南桥(ICH9)。南桥为 USB / PCIE / SATA 等提供接入。
  8 | 
  9 | 
 10 | ### Q35 拓扑结构
 11 | 
 12 | 那么在QEMU中实现的Q35拓扑结构是否真的如上图所示呢？我们在 QEMU 中通过 `info qtree` 查询，简化后的结构为：
 13 | 
 14 | ```
 15 | (qemu) info qtree
 16 | bus: main-system-bus
 17 |   dev: hpet, id ""
 18 |     gpio-in "" 2
 19 |     gpio-out "" 1
 20 |     gpio-out "sysbus-irq" 32
 21 |   dev: ioapic, id ""
 22 |     gpio-in "" 24
 23 |     version = 32 (0x20)
 24 |   dev: q35-pcihost, id ""
 25 |     bus: pcie.0
 26 |       type PCIE
 27 |       dev: e1000, id ""
 28 |       dev: VGA, id ""
 29 |       dev: ICH9 SMB, id ""
 30 |       dev: ich9-ahci, id ""
 31 |         bus: ide.5
 32 |           type IDE
 33 |         bus: ide.4
 34 |           type IDE
 35 |         bus: ide.3
 36 |           type IDE
 37 |         bus: ide.2
 38 |           type IDE
 39 |           dev: ide-cd, id ""
 40 |         bus: ide.1
 41 |           type IDE
 42 |         bus: ide.0
 43 |           type IDE
 44 |           dev: ide-hd, id ""
 45 |             drive = "ide0-hd0"
 46 |       dev: ICH9-LPC, id ""
 47 |         gpio-out "gsi" 24
 48 |         bus: isa.0
 49 |           type ISA
 50 |           dev: i8257, id ""
 51 |           dev: i8257, id ""
 52 |           dev: port92, id ""
 53 |             gpio-out "a20" 1
 54 |           dev: vmmouse, id ""
 55 |           dev: vmport, id ""
 56 |           dev: i8042, id ""
 57 |             gpio-out "a20" 1
 58 |           dev: isa-parallel, id ""
 59 |           dev: isa-serial, id ""
 60 |           dev: isa-pcspk, id ""
 61 |             iobase = 97 (0x61)
 62 |           dev: isa-pit, id ""
 63 |             gpio-in "" 1
 64 |             gpio-out "" 1
 65 |           dev: mc146818rtc, id ""
 66 |             gpio-out "" 1
 67 |           dev: isa-i8259, id ""
 68 |             gpio-in "" 8
 69 |             gpio-out "" 1
 70 |             master = false
 71 |           dev: isa-i8259, id ""
 72 |             gpio-in "" 8
 73 |             gpio-out "" 1
 74 |             master = true
 75 |       dev: mch, id ""
 76 |   dev: fw_cfg_io, id ""
 77 |   dev: kvmclock, id ""
 78 |   dev: kvmvapic, id ""
 79 | ```
 80 | 
 81 | 
 82 | 注意dev和bus是交替出现的，更加简化的设备图如下：
 83 | 
 84 | ```
 85 |     bus             dev          bus     dev         bus       dev
 86 | main-system-bus - ioapic
 87 |                 - q35-pcihost - pcie.0 - mch
 88 |                                        - ICH9-LPC - isa.0 - isa-i8259
 89 |                                        - ICH9 SMB
 90 |                                        - ich9-ahci
 91 |                                        - VGA
 92 |                                        - e1000
 93 |                                        - ...
 94 | ```
 95 | 
 96 | 
 97 | 由于对硬件和架构的不熟悉，这里反反复复研究了一天才有点眉目。我的理解如下：
 98 | 
 99 | main-system-bus 就是系统总线，ioapic直接连到系统总线上，符合我们对IOAPIC的认知，在q35架构图CPU为 Core 和 Pentium Pentium E2000 系列，而根据文档，Intel自从Pentium4/Xeon后就取消了APIC bus，换用系统总线。
100 | 
101 | 但让我纠结的是，mch为什么会连到 pcie.0 上？按照定义，mch是Intel对北桥的称呼。但mch的全称其实是memory controller hub，在这里它指的就真的是内存控制器这个hub。
102 | 
103 | 既然 mch 不是北桥，那么 q35-pcihost 自然就是了，一方面host bridge也是北桥的称呼，另一方面它上连系统总线，下连pcie总线。再看其定义：
104 | 
105 | ```c
106 | typedef struct Q35PCIHost {
107 |     /*< private >*/
108 |     PCIExpressHost parent_obj;
109 |     /*< public >*/
110 | 
111 |     MCHPCIState mch;
112 | } Q35PCIHost;
113 | 
114 | typedef struct MCHPCIState {
115 |     /*< private >*/
116 |     PCIDevice parent_obj;
117 |     /*< public >*/
118 | 
119 |     MemoryRegion *ram_memory;
120 |     MemoryRegion *pci_address_space;
121 |     MemoryRegion *system_memory;
122 |     MemoryRegion *address_space_io;
123 |     PAMMemoryRegion pam_regions[13];
124 |     MemoryRegion smram_region, open_high_smram;
125 |     MemoryRegion smram, low_smram, high_smram;
126 |     MemoryRegion tseg_blackhole, tseg_window;
127 |     Range pci_hole;
128 |     uint64_t below_4g_mem_size;
129 |     uint64_t above_4g_mem_size;
130 |     uint64_t pci_hole64_size;
131 |     uint32_t short_root_bus;
132 | } MCHPCIState;
133 | ```
134 | 
135 | 和更加验证了我的想法。然后内存控制器又是用pci统一编号的，因此接在pcie.0上还算合理。
136 | 
137 | 同理 ICH9-LPC 是 PCI/ISA bridge，也就是我们通常所说的南桥。然后上面接了条ISA总线，用来连接传统的(Legacy)ISA设备，比如PIC。
138 | 
139 | 
140 | 
141 | 结合虚拟机内部的lshw输出以帮助理解：
142 | 
143 | ```
144 | binss@ubuntu:~$ lshw -businfo
145 | WARNING: you should run this program as super-user.
146 | Bus info          Device      Class      Description
147 | ====================================================
148 |                               system     Computer
149 |                               bus        Motherboard
150 |                               memory     1999MiB System memory
151 | cpu@0                         processor  Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20G
152 | cpu@1                         processor  Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20G
153 | cpu@2                         processor  Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20G
154 | cpu@3                         processor  Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20G
155 | pci@0000:00:00.0              bridge     82G33/G31/P35/P31 Express DRAM Controll
156 | pci@0000:00:01.0              display    VGA compatible controller
157 | pci@0000:00:02.0  enp0s2      network    82540EM Gigabit Ethernet Controller
158 | pci@0000:00:1f.0              bridge     82801IB (ICH9) LPC Interface Controller
159 | pci@0000:00:1f.2              storage    82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA
160 | pci@0000:00:1f.3              bus        82801I (ICH9 Family) SMBus Controller
161 |                   scsi2       storage
162 | scsi@2:0.0.0      /dev/cdrom  disk       DVD reader
163 | ```
164 | 
165 | 
166 | ### 设备关系
167 | 个人认为有两种关系，一种是 挂载关系 ，通过设置类对象成员的形式来定义设备之间的关系。另一种是 组合树+关联关系 ，通过设置设备的属性来定义QOM之间的关系。
168 | 
169 | #### 挂载关系
170 | 
171 | 设备会挂载到bus上。创建设备对象的调用链如下：
172 | 
173 | ```
174 | qdev_create => qdev_try_create => 如果传入参数bus为NULL，则 bus = sysbus_get_default() => 如果全局变量 main_system_bus 不存在，调用 main_system_bus_create 创建并返回
175 |                                => qdev_set_parent_bus => 设置 dev->parent_bus
176 | ```
177 | 
178 | 因此一个device挂载在哪个bus上，是由其 DeviceState 的 parent_bus 成员指定的。
179 | 在这里，main_system_bus 即 main-system-bus ，创建时不传入bus参数的设备都将挂载在该bus上，比如 ioapic 和 q35-pcihost
180 | 
181 | 同理 ICH9-LPC 等设备在 qdev_create 时会传入bus参数，其指向 pcie.0 ，因此表示被挂载在 pcie.0 这个 PCIBus 上
182 | 
183 | 除此之外，对于 PCI 设备 PCIDevice ，虽然其"父类" DeviceState 的 parent_bus 已经指定挂载在哪个bus上，但其还自己维护了一个 bus 成员，指向 PCIBus 。该成员在 do_pci_register_device 时设置。
184 | 
185 | 以上定义了下游device和bus的关系，而上游device和bus的关系由 device 的 bus 成员来指向。比如
186 | 
187 | q35-pcihost 的 PCIExpressHost-PCIHostState 有 PCIBus 类型的 bus 成员，其在 q35_host_realize 时设置为 pcie.0
188 | 
189 | ICH9-LPC 的 ICH9LPCState 有 ISABus 类型的 isa_bus 成员， 其在 ich9_lpc_realize 时设置为 isa.0 (名字是在qbus_realize中拼凑出来的)
190 | 
191 | 
192 | 
193 | #### 组合树关系
194 | 
195 | 我们知道，无论是 bus 和 device ，本质上都是 QOM 对象。这些对象以路径的形式联系在一起，构成一棵组合树。
196 | 
197 | 将 info qom-tree 精简(去除irq和memory-region)后如下：
198 | 
199 | ```
200 | /machine (pc-q35-2.8-machine)
201 |   /unattached (container)
202 |     /device[2] (host-x86_64-cpu)
203 |       /lapic (kvm-apic)
204 |     /device[6] (ICH9-LPC)
205 |       /isa.0 (ISA)
206 |     /device[15] (i8042)
207 |     /device[19] (i8257)
208 |     /device[3] (host-x86_64-cpu)
209 |       /lapic (kvm-apic)
210 |     /device[7] (isa-i8259)
211 |     /device[20] (i8257)
212 |     /device[24] (ICH9 SMB)
213 |       /i2c (i2c-bus)
214 |     /device[8] (isa-i8259)
215 |     /device[17] (vmmouse)
216 |     /device[21] (ich9-ahci)
217 |       /ide.4 (IDE)
218 |       /ide.5 (IDE)
219 |       /ide.0 (IDE)
220 |       /ide.1 (IDE)
221 |       /ide.2 (IDE)
222 |       /ide.3 (IDE)
223 |     /sysbus (System)
224 |     /device[33] (VGA)
225 |   /q35 (q35-pcihost)
226 |     /pcie.0 (PCIE)
227 |     /ioapic (ioapic)
228 |     /mch (mch)
229 |   /peripheral-anon (container)
230 |   /peripheral (container)
231 | ```
232 | 
233 | 这样的关系是通过child属性来维系的。
234 | 
235 | main_system_bus_create 创建了 main-system-bus ，同时
236 | 
237 | ```c
238 | object_property_add_child(container_get(qdev_get_machine(), "/unattached"), "sysbus", OBJECT(main_system_bus), NULL);
239 | ```
240 | 
241 | 而 qdev_get_machine 定义如下：
242 | 
243 | ```c
244 | Object *qdev_get_machine(void)
245 | {
246 |     static Object *dev;
247 | 
248 |     if (dev == NULL) {
249 |         dev = container_get(object_get_root(), "/machine");
250 |     }
251 | 
252 |     return dev;
253 | }
254 | 
255 | Object *object_get_root(void)
256 | {
257 |     static Object *root;
258 | 
259 |     if (!root) {
260 |         root = object_new("container");
261 |     }
262 | 
263 |     return root;
264 | }
265 | 
266 | Object *container_get(Object *root, const char *path)
267 | {
268 |     Object *obj, *child;
269 |     gchar **parts;
270 |     int i;
271 | 
272 |     // 切分路径
273 |     parts = g_strsplit(path, "/", 0);
274 |     assert(parts != NULL && parts[0] != NULL && !parts[0][0]);
275 |     obj = root;
276 | 
277 |     // 逐级访问，如果发现某一级不存在，则创建一个container补上
278 |     for (i = 1; parts[i] != NULL; i++, obj = child) {
279 |         child = object_resolve_path_component(obj, parts[i]);
280 |         if (!child) {
281 |             child = object_new("container");
282 |             object_property_add_child(obj, parts[i], child, NULL);
283 |         }
284 |     }
285 | 
286 |     g_strfreev(parts);
287 | 
288 |     return obj;
289 | }
290 | ```
291 | 
292 | 首先 qdev_get_machine 调用了 container_get(object_get_root(), "/machine") ，而 object_get_root 负责返回根级对象。如果未创建则会创建一个再返回。我们可以看到它是一个container，根据其定义，可以发现其继承了Object，但没有定义新的成员，相当于alias：
293 | 
294 | ```c
295 | static const TypeInfo container_info = {
296 |     .name          = "container",
297 |     .instance_size = sizeof(Object),
298 |     .parent        = TYPE_OBJECT,
299 | };
300 | ```
301 | 
302 | 
303 | 然后 container_get 负责从传入的第一个参数(root)出发，返回相对路径为第二个参数(path)的对象。它将相对路径按"/"进行切分，对每一级调用 object_resolve_path_component
304 | 
305 | ```c
306 | Object *object_resolve_path_component(Object *parent, const gchar *part)
307 | {
308 |     ObjectProperty *prop = object_property_find(parent, part, NULL);
309 |     if (prop == NULL) {
310 |         return NULL;
311 |     }
312 | 
313 |     if (prop->resolve) {
314 |         return prop->resolve(parent, prop->opaque, part);
315 |     } else {
316 |         return NULL;
317 |     }
318 | }
319 | ```
320 | 
321 | 本质上就是查找名为 路径名 的属性，然后将其属性值返回。该属性值就是当前级对象在路径上的的下一级对象，然后再从下一级对象继续调用 object_resolve_path_component 找下下级对象。
322 | 
323 | 如此迭代，直到相对路径切分出来的所有级别都被访问到，然后将最后一级对象，也就是目标对象返回。
324 | 
325 | 如果某一级对象不存在，则创建一个container，将其作为当前级的child属性，然后继续下一级。
326 | 
327 | 因此 main-system-bus 的路径为 /machine/unattached/sysbus ，符合 qom-tree。我们可以通过 object_resolve_path 根据路径找到设备对象：
328 | 
329 | ```
330 | (gdb) p object_resolve_path("/machine/unattached/sysbus", 0)
331 | $92 = (Object *) 0x555556769d20
332 | (gdb) p object_resolve_path("/machine/unattached/sysbus", 0)->class->type->name
333 | $93 = 0x555556683760 "System"
334 | ```
335 | 
336 | 同理，经过：
337 | 
338 | ```c
339 | object_property_add_child(qdev_get_machine(), "q35", OBJECT(q35_host), NULL);
340 | object_property_add_child(object_resolve_path(parent_name, NULL), "ioapic", OBJECT(dev), NULL);
341 | ```
342 | 
343 | 于是 ioapic 的路径为 /machine/q35/ioapic ，回想上文提到 ioapic 和 q35-pcihost 一起挂在 main-system-bus 上，属于平级关系，显然挂载关系和组合树关系有不一致之处。
344 | 
345 | 
346 | 对于bus来说，在 qbus_realize => object_property_add_child(OBJECT(bus->parent), bus->name, OBJECT(bus), NULL) 中被添加为上一级对象的child属性，比如 pcie.0 被作为 q35-pcihost 的pcie.0属性，于是路径为 /machine/q35/pcie.0
347 | 
348 | 对于memory region来说，在 memory_region_init => object_property_add_child 中被添加为上一级对象的child属性。
349 | 
350 | 
351 | 
352 | 
353 | #### 关联关系
354 | 
355 | 在组合树关系中，我们可以看到，设备之间通过child属性，构成了一棵设备树，以machine为根，分门别类地向下展开。通过路径，我们能够从根设备出发，一级一级地向下迭代，找到目标设备。
356 | 
357 | 但组合树还不能完全表达设备之间的关系，在child关系中，父节点掌控了子节点的生命周期。如果我们不需要这么强的关系，而仅仅是需要表示一种设备和另一种设备有关联，应该如何表示呢？
358 | 
359 | 为此QOM定义了backlink关系。通过设置设备的link属性，表示一个设备引用了另外一个设备，在设置后，一个设备可以通过它的link属性访问到另一个设备，这也模拟了在硬件上两个设备之间直接连起来的场景。
360 | 
361 | 一种最典型的backlink就是 bus 和插在它上面的child device。child device会有指向bus的link属性，比如：
362 | 
363 | pc_q35_init => qdev_create(NULL, TYPE_Q35_HOST_DEVICE) => qdev_try_create => object_new 会调用其父类的实例函数，即 device_initfn ，则：
364 | 
365 | 
366 | ```c
367 | static void device_initfn(Object *obj)
368 | {
369 |     ...
370 |     object_property_add_link(OBJECT(dev), "parent_bus", TYPE_BUS,
371 |                              (Object **)&dev->parent_bus, NULL, 0,
372 |                              &error_abort);
373 | }
374 | ```
375 | 
376 | 将 main-system-bus 设置为 q35-pcihost 的link属性，名为 `parent_bus` 。注意这里的 dev->parent_bus 还是空的，因为 parent_bus 还没设置，直到：
377 | 
378 | pc_q35_init => qdev_create(NULL, TYPE_Q35_HOST_DEVICE) => qdev_try_create => qdev_set_parent_bus 的 `dev->parent_bus = bus;` 才会填上。因此属性值存的不是对象成员的值，而是对象成员的指针。当然由于 parent_bus 成员本来就是指针，于是这里是指针的指针(Object **)。
379 | 
380 | 相反，bus会有指向child device的link属性：
381 | 
382 | qdev_set_parent_bus 中， `dev->parent_bus = bus;` 的下一行就是 bus_add_child ：
383 | 
384 | ```c
385 | static void bus_add_child(BusState *bus, DeviceState *child)
386 | {
387 |     char name[32];
388 |     BusChild *kid = g_malloc0(sizeof(*kid));
389 | 
390 |     kid->index = bus->max_index++;
391 |     kid->child = child;
392 |     object_ref(OBJECT(kid->child));
393 | 
394 |     QTAILQ_INSERT_HEAD(&bus->children, kid, sibling);
395 | 
396 |     /* This transfers ownership of kid->child to the property.  */
397 |     snprintf(name, sizeof(name), "child[%d]", kid->index);
398 |     object_property_add_link(OBJECT(bus), name,
399 |                              object_get_typename(OBJECT(child)),
400 |                              (Object **)&kid->child,
401 |                              NULL, /* read-only property */
402 |                              0, /* return ownership on prop deletion */
403 |                              NULL);
404 | }
405 | ```
406 | 
407 | 它将 q35-pcihost device设置为 main-system-bus 的link属性，名为 `child[3]` (注：前三个是kvmvapic、kvmclock和fw_cfg_io)，值为 bus->children中的对应child的地址(指针的指针)。
408 | 
409 | 于是 bus 和 device 之间的关系就通过两个 link 属性关联起来了。
410 | 
411 | 
412 | ```
413 | /machine (pc-q35-2.8-machine)
414 |   /unattached (container)
415 |     /sysbus (System) <------------------
416 |                       -------          |
417 |                             | child[3] | parent_bus
418 |                       <------          |
419 |   /q35 (q35-pcihost) -------------------
420 |     /pcie.0 (PCIE)
421 |     /ioapic (ioapic)
422 |     /mch (mch)
423 | ```
424 | 
425 | 如果说设备之间的关系是一棵树，那么多了backlink后设备之间的关系就成为了一幅有向图。
426 | 
427 | 于是我们可以通过路径 /machine/unattached/sysbus/child[3] 访问 q35-pcihost ：
428 | 
429 | ```
430 | (gdb) p object_resolve_path("/machine/unattached/sysbus/child[3]", 0)
431 | $95 = (Object *) 0x55555694ec00
432 | (gdb) p object_resolve_path("/machine/unattached/sysbus/child[3]", 0)->class->type->name
433 | $96 = 0x555556690ae0 "q35-pcihost"
434 | ```
435 | 
436 | 而路径是可以回绕的，于是有这样的玩法：
437 | 
438 | ```
439 | (gdb) p object_resolve_path("/machine/unattached/sysbus", 0)
440 | $97 = (Object *) 0x555556769d20
441 | (gdb) p object_resolve_path("/machine/unattached/sysbus/child[3]/parent_bus", 0)
442 | $98 = (Object *) 0x555556769d20
443 | (gdb) p object_resolve_path("/machine/unattached/sysbus/child[3]/parent_bus/child[3]/parent_bus", 0)
444 | $99 = (Object *) 0x555556769d20
445 | (gdb) p object_resolve_path("/machine/unattached/sysbus/child[3]/parent_bus/child[3]/parent_bus/child[3]/parent_bus", 0)
446 | $100 = (Object *) 0x555556769d20
447 | ...
448 | ```
449 | 
450 | 
451 | 
452 | ### 总结
453 | 本文大致介绍了QEMU中"最新"q35架构的组成，同时对设备之间的关系也进行了分析。
454 | 
455 | 接下来将研究QEMU中设备的模拟，包括中断控制器PIC、IOAPIC，以及中断是如何在设备之间传递和路由的。
456 | 
457 | 
458 | 


--------------------------------------------------------------------------------
/schedule.md:
--------------------------------------------------------------------------------
 1 | ## Schedule
 2 | 
 3 | ### 3.17
 4 | #### 计划
 5 | 探讨vCPU相关实现
 6 | 
 7 | #### 总结
 8 | 学习并讨论VMX相关指令在手册查阅方式，确定未来将长期查阅和学习该手册(SDM-VOL-3C)
 9 | 
10 | 讨论了shoot4u原理和流程
11 | 
12 | 无法确定哪些代码是和vcpu相关的，根据手册信息，从VMCS入手
13 | 
14 | 
15 | ### 3.24
16 | 
17 | #### 计划
18 | 学习手册中VMCS部分原理
19 | 
20 | 结合代码，提交并讨论VMCS结构和相关所有接口的作用、实现原理，给出总结文档
21 | 
22 | #### 总结
23 | 讨论了VMCS的布局和字段。
24 | 
25 | 
26 | ### 3.31
27 | 
28 | #### 计划
29 | 学习并注释kvm_arch_vcpu_create、kvm_arch_vcpu_setup和kvm_arch_vcpu_postcreate函数。
30 | 
31 | #### 总结
32 | 注释完毕
33 | 
34 | 
35 | ### 4.7 + 4.14
36 | 时钟相关。
37 | 
38 | @tcbbd tsc
39 | @binss kvmclock
40 | @chrisbyd pit
41 | 
42 | ### 4.21
43 | @tcbbd tsc，生命周期
44 | @binss 动态迁移
45 | @chrisbyd 中断，中断注入
46 | 


--------------------------------------------------------------------------------
/shoot4u.md:
--------------------------------------------------------------------------------
  1 | 
  2 | 本文是对[Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs
  3 | ](http://dl.acm.org/citation.cfm?id=2892245) / [patch](https://github.com/ouyangjn/shoot4u)的学习笔记。
  4 | 
  5 | 该方法通过直接在VMM刷TLB而不是等到相应vCPU被调度时才刷，提高了性能。同时利用了Individual-address invalidation，可以只刷指定地址。
  6 | 
  7 | ## 背景
  8 | 
  9 | __invvpid在kvm中已经有定义：
 10 | 
 11 | ```c
 12 | static inline void __invvpid(int ext, u16 vpid, gva_t gva)
 13 | {
 14 |     struct {
 15 |     u64 vpid : 16;
 16 |     u64 rsvd : 48;
 17 |     u64 gva;
 18 |     } operand = { vpid, 0, gva };
 19 | 
 20 |     asm volatile (__ex(ASM_VMX_INVVPID)
 21 |           /* CF==1 or ZF==1 --> rc = -1 */
 22 |           "; ja 1f ; ud2 ; 1:"
 23 |           : : "a"(&operand), "c"(ext) : "cc", "memory");
 24 | }
 25 | ```
 26 | 
 27 | 调的是INVVPID指令，作用是根据VPID使TLB和page cache失效。接收两个参数，vpid和内存地址
 28 | 
 29 | 有四种模式：
 30 | 
 31 | * Individual-address invalidation(type=0)
 32 | 
 33 |     针对tag为VPID且地址为指定地址(参数传入)的
 34 | 
 35 | * Single-context invalidation(type=1)
 36 | 
 37 |     针对tag为VPID的
 38 | 
 39 | * All-contexts invalidation(type=2)
 40 | 
 41 |     针对除了vpid为0000H(应该是VMM)的所有
 42 | 
 43 | * Single-context invalidation, retaining global translations(type=3)
 44 | 
 45 |     针对tag为VPID的TLB的，但保留global translations
 46 | 
 47 | 
 48 | 
 49 | 原来已经用到了VMX_VPID_EXTENT_SINGLE_CONTEXT(1)和VMX_VPID_EXTENT_ALL_CONTEXT(2)
 50 | 
 51 | 
 52 | ```c
 53 | static inline void vpid_sync_vcpu_single(struct vcpu_vmx *vmx)
 54 | {
 55 |         if (vmx->vpid == 0)
 56 |                 return;
 57 | 
 58 |         if (cpu_has_vmx_invvpid_single())
 59 |                 __invvpid(VMX_VPID_EXTENT_SINGLE_CONTEXT, vmx->vpid, 0);
 60 | }
 61 | 
 62 | static inline void vpid_sync_vcpu_global(void)
 63 | {
 64 |         if (cpu_has_vmx_invvpid_global())
 65 |                 __invvpid(VMX_VPID_EXTENT_ALL_CONTEXT, 0, 0);
 66 | }
 67 | ```
 68 | 
 69 | ## HOST修改(KVM)
 70 | 
 71 | ### arch/x86/include/asm/vmx.h
 72 | 
 73 | 新增了0的操作。即Individual-address invalidation
 74 | 
 75 | ```c
 76 | #define VMX_VPID_EXTENT_INDIVIDUAL_ADDR        0
 77 | #define VMX_VPID_EXTENT_INDIVIDUAL_ADDR_BIT      (1ull << 8) /* (40 - 32) */
 78 | ```
 79 | 
 80 | ### arch/x86/kvm/vmx.c
 81 | 
 82 | 
 83 | 在tlb_flush后新增了三个操作
 84 | 
 85 | ```diff
 86 |     .tlb_flush = vmx_flush_tlb,
 87 | 
 88 | +   .tlb_flush_vpid_single_ctx = vmx_flush_tlb_single_ctx,
 89 | +   .tlb_flush_vpid_single_addr = vmx_flush_tlb_single_addr,
 90 | +   .get_vpid = vmx_get_vpid,
 91 | ```
 92 | 
 93 | 在当前版本是存在vmx_x86_ops里面的，里面保存的都是vmx平台支持的操作，这个数组会在KVM初始化(kvm_init)时作为参数传入，存到kvm_x86_ops中。相当于注册了函数。
 94 | 
 95 | 
 96 | #### vmx_get_vpid
 97 | 
 98 | 从vcpu_vmx结构中读取当前的vpid。vcpu_vmx包含了kvm_vcpu，表示vmx平台下的一个vcpu。
 99 | 
100 | 
101 | ```c
102 | static inline int vmx_get_vpid(struct kvm_vcpu *vcpu)
103 | {
104 |         struct vcpu_vmx *vmx = container_of(vcpu, struct vcpu_vmx, vcpu);
105 |         return vmx->vpid;
106 | }
107 | ```
108 | 
109 | ### vmx_flush_tlb_single_ctx
110 | 
111 | 老方法，single/all，刷掉全部
112 | 
113 | ```c
114 | static void vmx_flush_tlb_single_ctx(struct kvm_vcpu *vcpu)
115 | {
116 |     vpid_sync_context(to_vmx(vcpu));
117 | }
118 | ```
119 | 
120 | ### vmx_flush_tlb_single_addr
121 | 
122 | 尝试刷掉单条地址。
123 | 
124 | vmx_capability保存的是从MSR读MSR_IA32_VMX_EPT_VPID_CAP出来的信息，其中vpid放在高32位，所以实际上是读vmx_capability.vpid的第8个bit
125 | 
126 | 
127 | ```c
128 | static inline bool cpu_has_vmx_invvpid_addr(void)
129 | {
130 |     return vmx_capability.vpid & VMX_VPID_EXTENT_INDIVIDUAL_ADDR_BIT;
131 | }
132 | 
133 | 
134 | static inline void vpid_sync_addr(struct vcpu_vmx *vmx, unsigned long addr)
135 | {
136 |     if (vmx->vpid == 0)
137 |         return;
138 | 
139 |     // 如果vcpu支持新特性，则单独让该地址失效
140 |     if (cpu_has_vmx_invvpid_addr())
141 |         __invvpid(VMX_VPID_EXTENT_INDIVIDUAL_ADDR, vmx->vpid, addr);
142 |     else
143 |     // 否则使用老办法，single/all
144 |         vpid_sync_context(vmx);
145 | }
146 | 
147 | static void vmx_flush_tlb_single_addr(struct kvm_vcpu *vcpu, unsigned long addr)
148 | {
149 |     // 判断一下
150 |     vpid_sync_addr(to_vmx(vcpu), addr);
151 | }
152 | 
153 | ```
154 | 
155 | 
156 | 
157 | 
158 | ### include/uapi/linux/kvm_para.h
159 | 
160 | ```c
161 | #define KVM_HC_SHOOT4U         12
162 | ```
163 | 
164 | 
165 | 在kvm_emulate_hypercall里新增：
166 | 
167 | ```diff
168 | +   case KVM_HC_SHOOT4U:
169 | +       kvm_pv_shoot4u_op(vcpu, a0, a1, a2);
170 | +       ret = 0;
171 | +       break;
172 | ```
173 | 
174 | 注册了一种新的调用类型，当调用时通过kvm_pv_shoot4u_op进行处理。kvm_pv_shoot4u_op会根据定义的mode，去调用上文的vmx_flush_tlb_single_ctx/vmx_flush_tlb_single_addr。
175 | 
176 | 
177 | ### arch/x86/kvm/x86.c
178 | 
179 | ```c
180 | /* lapic timer advance (tscdeadline mode only) in nanoseconds */
181 | #define SHOOT4U_MODE_DEFAULT   0
182 | #define SHOOT4U_MODE_TEST1     1
183 | #define SHOOT4U_MODE_TEST2     2
184 | #define SHOOT4U_MODE_TEST3     3
185 | unsigned int shoot4u_mode = SHOOT4U_MODE_DEFAULT;
186 | module_param(shoot4u_mode, uint, S_IRUGO | S_IWUSR);
187 | ```
188 | 
189 | 可以开启不同模式：
190 | 
191 | * 0  刷掉整个tlb
192 | * 1  刷掉vpid的tlb
193 | * 2  如果有结束地址，刷掉整个tlb，否则尝试刷单个地址
194 | * 3  如果有结束地址，刷掉vpid的tlb，否则尝试刷单个地址
195 | 
196 | 怀疑是当前不支持指定区域，只能单条刷?
197 | 
198 | 
199 | ```c
200 | /*
201 |  * kvm_pv_shoot4u_op:  Handle tlb shoot down hypercall
202 |  *
203 |  * @apicid - apicid of vcpu to be kicked.
204 |  */
205 | 
206 | // 当前vcpu，VM包含的vcpu设置的bitmap，要失效的起始和结束地址
207 | static void kvm_pv_shoot4u_op(struct kvm_vcpu *vcpu, unsigned long vcpu_bitmap,
208 |         unsigned long start, unsigned long end)
209 | {
210 |     struct kvm_shoot4u_info info;
211 |     struct kvm *kvm = vcpu->kvm;
212 |     struct kvm_vcpu *v;
213 |     int i;
214 | 
215 |     info.flush_start = start;
216 |     info.flush_end = end;
217 | 
218 |     //printk("[shoot4u] inside hypercall handler\n");
219 |     // construct phsical cpu mask from vcpu bitmap
220 |     // 对于VM中除了自己以外的每个vcpu，查看它是否在bitmap中，如果是，调用flush_tlb_func_shoot4u刷掉自己在他之上的tlb
221 |     kvm_for_each_vcpu(i, v, kvm) {
222 |         if (v != vcpu && test_bit(v->vcpu_id, (void*)&vcpu_bitmap)) {
223 |             info.vcpu = v;
224 |             //printk("[shoot4u] before send IPI to vcpu %d at pcpu %d\n", v->vcpu_id, v->cpu);
225 |             // it is fine if a vCPU migrates because migratation triggers tlb_flush automatically
226 |             smp_call_function_single(v->cpu, flush_tlb_func_shoot4u, &info, 1);
227 |         }
228 |     }
229 | }
230 | 
231 | struct kvm_shoot4u_info {
232 |     struct kvm_vcpu *vcpu;
233 |     unsigned long flush_start;
234 |     unsigned long flush_end;
235 | };
236 | 
237 | 
238 | // 跨处理器操作
239 | /* shoot4u host IPI handler with invvipd */
240 | static void flush_tlb_func_shoot4u(void *info)
241 | {
242 |     struct kvm_shoot4u_info *f = info;
243 | 
244 |     //printk("[shoot4u] IPI handler at pCPU %d: invalidate vCPU %d\n", smp_processor_id(), f->vcpu->vcpu_id);
245 |     if (shoot4u_mode == SHOOT4U_MODE_DEFAULT) {
246 |         // all (linear + EPT mappings)
247 |         kvm_x86_ops->tlb_flush(f->vcpu);
248 |     } else if (shoot4u_mode == SHOOT4U_MODE_TEST1) {
249 |         // all (linear mappings only)
250 |         kvm_x86_ops->tlb_flush_vpid_single_ctx(f->vcpu);
251 |     } else if (shoot4u_mode == SHOOT4U_MODE_TEST2) {
252 |         // single or all (linear + EPT mappings)
253 |         if (!f->flush_end)
254 |             kvm_x86_ops->tlb_flush_vpid_single_addr(f->vcpu, f->flush_start);
255 |         else {
256 |             kvm_x86_ops->tlb_flush(f->vcpu);
257 |         }
258 |     } else if (shoot4u_mode == SHOOT4U_MODE_TEST3) {
259 |         // seg fault
260 |         // single or all (linear mappings only)
261 |         if (!f->flush_end)
262 |             kvm_x86_ops->tlb_flush_vpid_single_addr(f->vcpu, f->flush_start);
263 |         else {
264 |             kvm_x86_ops->tlb_flush_vpid_single_ctx(f->vcpu);
265 |         }
266 |     } else {
267 |         ...
268 |     }
269 | 
270 |     return;
271 | }
272 | ```
273 | 
274 | 
275 | ## Guest
276 | 
277 | ```diff
278 | +#ifdef CONFIG_SHOOT4U
279 | +        pv_mmu_ops.flush_tlb_others = shoot4u_flush_tlb_others;
280 | +#endif
281 | ```
282 | 
283 | 通过kvm_hypercall请求刷别人的tlb
284 | 
285 | 暂不支持超过64个vcpu，因为它用long来存，只有64bit
286 | 
287 | ```c
288 | void shoot4u_flush_tlb_others(const struct cpumask *cpumask,
289 |                 struct mm_struct *mm, unsigned long start,
290 |                 unsigned long end)
291 | {
292 |     // shoot4u currently uses an 8 bytes bitmap to pass target cores
293 |     // thus it supports up to 64 physical cores
294 |     u64 cpu_bitmap = 0;
295 |     int cpu;
296 | 
297 |     // 将要flush的vcpu设置到bitmap中
298 |     for_each_cpu(cpu, cpumask) {
299 |         if (cpu >= 64) {
300 |             panic("[shoot4u] ERROR: do not support more than 64 cores\n");
301 |         }
302 |         set_bit(cpu, (void *)&cpu_bitmap);
303 |     }
304 | 
305 |     //printk("[shoot4u] before KVM_HC_SHOOT4U hypercall, cpumask: %llx, start %lx, end %lx\n", cpu_bitmap, start, end);
306 |     kvm_hypercall3(KVM_HC_SHOOT4U, cpu_bitmap, start, end);
307 | }
308 | 
309 | ```
310 | 
311 | 
312 | 


--------------------------------------------------------------------------------
/timer_and_clock/pvtimer.md:
--------------------------------------------------------------------------------
  1 | 
  2 | 本文主要分析 [https://lkml.org/lkml/2017/12/8/116](https://lkml.org/lkml/2017/12/8/116) 中 pvtimer 的实现。
  3 | 
  4 | ## 原始实现
  5 | 
  6 | Linux kernel 中会通过 lapic_next_deadline 设置定时器，即设置下一个超时的时间点。原来的设置很简单：
  7 | 
  8 | ```c
  9 | static int lapic_next_deadline(unsigned long delta,
 10 |                    struct clock_event_device *evt)
 11 | {
 12 |     u64 tsc;
 13 | 
 14 |     tsc = rdtsc();
 15 |     wrmsrl(MSR_IA32_TSC_DEADLINE, tsc + (((u64) delta) * TSC_DIVISOR));
 16 |     return 0;
 17 | }
 18 | ```
 19 | 
 20 | 就是读取当前的 TSC ，加上要等待的时间 delta ，写入到 MSR_IA32_TSC_DEADLINE 中。
 21 | 
 22 | > TSC-Deadline Mode
 23 | > APIC Timer 的三种操作模式之一，为其写入一个非零64位值即可激活 Timer ，使得在 TSC 达到该值时触发一个中断。该中断只会触发一次，触发后 IA32_TSC_DEADLINE_MSR 就被重置为零。
 24 | 
 25 | 如果 Linux 是跑在 KVM 之上的 Guest ，则触发 VMExit ，退回到 KVM ，执行 `kvm_set_lapic_tscdeadline_msr(vcpu, data)`
 26 | 
 27 | ```c
 28 | void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, u64 data)
 29 | {
 30 |     struct kvm_lapic *apic = vcpu->arch.apic;
 31 | 
 32 |     if (!lapic_in_kernel(vcpu) || apic_lvtt_oneshot(apic) ||
 33 |             apic_lvtt_period(apic))
 34 |         return;
 35 | 
 36 |     hrtimer_cancel(&apic->lapic_timer.timer);
 37 |     apic->lapic_timer.tscdeadline = data;
 38 |     start_apic_timer(apic);
 39 | }
 40 | ```
 41 | 
 42 | 于是 KVM 就会取消当前 apic->lapic_timer.timer 上的定时，重新设置新的超时时间。注意，vCPU 设置的 timer 会被加入到物理 CPU 的 timer 红黑树中。
 43 | 
 44 | 
 45 | apic->lapic_timer.timer 的回调函数在 kvm_create_lapic 设置为 apic_timer_fn ：
 46 | 
 47 | 
 48 | ```c
 49 | static enum hrtimer_restart apic_timer_fn(struct hrtimer *data)
 50 | {
 51 |     struct kvm_timer *ktimer = container_of(data, struct kvm_timer, timer);
 52 |     struct kvm_lapic *apic = container_of(ktimer, struct kvm_lapic, lapic_timer);
 53 | 
 54 |     apic_timer_expired(apic);
 55 | 
 56 |     if (lapic_is_periodic(apic)) {
 57 |         hrtimer_add_expires_ns(&ktimer->timer, ktimer->period);
 58 |         return HRTIMER_RESTART;
 59 |     } else
 60 |         return HRTIMER_NORESTART;
 61 | }
 62 | ```
 63 | 
 64 | 
 65 | ### Guest
 66 | 
 67 | 每个 CPU 都维护有 pvtimer_vcpu_event_info 类型的 per-CPU 变量 pvtimer_shared_buf 。在 kvm_guest_cpu_init 时，会将其的地址填入 MSR_KVM_PV_TIMER_EN 中，供 KVM 填充:
 68 | 
 69 | ```c
 70 | #define MSR_KVM_PV_TIMER_EN     0x4b564d05
 71 | ```
 72 | 
 73 | 存放 pvtimer_vcpu_event_info 的地址：
 74 | 
 75 | ```c
 76 | struct pvtimer_vcpu_event_info {
 77 |     __u64 expire_tsc;
 78 |     __u64 next_sync_tsc;
 79 | } __attribute__((__packed__));
 80 | ```
 81 | 
 82 | 上文提过，原本 Linux kernel 在 lapic_next_deadline 把下一个超时的时间点写到 MSR_IA32_TSC_DEADLINE 中，会发生 VMExit 。而该 patch 的本质思想就是将其写到 pvtimer_vcpu_event_info 中，这样就避免了 VMExit 。
 83 | 
 84 | ```c
 85 | static int lapic_next_deadline(unsigned long delta,
 86 |                    struct clock_event_device *evt)
 87 | {
 88 |     u64 tsc = rdtsc() + (((u64) delta) * TSC_DIVISOR);
 89 | 
 90 |     /* TODO: undisciplined function call */
 91 |     if (kvm_pv_timer_next_event(tsc, evt))
 92 |         return 0;
 93 | 
 94 |     wrmsrl(MSR_IA32_TSC_DEADLINE, tsc);
 95 |     return 0;
 96 | }
 97 | 
 98 | static DEFINE_PER_CPU(int, pvtimer_enabled);
 99 | static DEFINE_PER_CPU(struct pvtimer_vcpu_event_info,
100 |              pvtimer_shared_buf) = {0};
101 | #define PVTIMER_PADDING        25000
102 | 
103 | int kvm_pv_timer_next_event(unsigned long tsc,
104 |         struct clock_event_device *evt)
105 | {
106 |     struct pvtimer_vcpu_event_info *src;
107 |     u64 now;
108 | 
109 |     if (!this_cpu_read(pvtimer_enabled))
110 |         return false;
111 | 
112 |     /* 将当前设置的超时时间写到 pvtimer_vcpu_event_info.expire_tsc 中
113 |      * 取出上次设置的 expire_tsc ，如果它：
114 |      *  1. 小于 pvtimer 下一次的 pv_sync_timer 超时的时间（pvtimer_vcpu_event_info.next_sync_tsc）
115 |      *      表示在 KVM 主动去检查是否有 timer 超时之前，该 timer 已经超时，所以需要将该超时时间通过传统方式，
116 |      *      即设置 MSR_IA32_TSC_DEADLINE 来让 KVM 立刻调用 kvm_apic_sync_pv_timer
117 |      *  2. 小于当前时间，表示已经超时，需要尽快触发中断，只能通过传统方式设置，让 KVM 立刻调用 kvm_apic_sync_pv_timer
118 |      *  3. 其他情况表示还未超时，无需进行处理
119 |      */
120 |     src = this_cpu_ptr(&pvtimer_shared_buf);
121 |     xchg((u64 *)&src->expire_tsc, tsc);
122 | 
123 |     barrier();
124 | 
125 |     if (tsc < src->next_sync_tsc)
126 |         return false;
127 | 
128 |     rdtscll(now);
129 |     if (tsc < now || tsc - now < PVTIMER_PADDING)
130 |         return false;
131 | 
132 |     return true;
133 | }
134 | ```
135 | 
136 | 
137 | 
138 | 
139 | ### KVM
140 | 
141 | #### cache 初始化
142 | 
143 | 当 Guest 设置 MSR_KVM_PV_TIMER_EN 时，会 VMExit 到 KVM ，调用 kvm_lapic_enable_pv_timer 进行初始化 vcpu->arch.pv_timer 结构。
144 | 
145 | ```c
146 | struct {
147 |      u64 msr_val;
148 |      struct gfn_to_hva_cache data;
149 | } pv_timer;
150 | 
151 | int kvm_lapic_enable_pv_timer(struct kvm_vcpu *vcpu, u64 data)
152 | {
153 |     u64 addr = data & ~KVM_MSR_ENABLED;
154 |     int ret;
155 | 
156 |     if (!lapic_in_kernel(vcpu))
157 |         return 1;
158 | 
159 |     if (!IS_ALIGNED(addr, 4))
160 |         return 1;
161 | 
162 |      // 保存 pvtimer_vcpu_event_info 的地址
163 |     vcpu->arch.pv_timer.msr_val = data;
164 |     if (!pv_timer_enabled(vcpu))
165 |         return 0;
166 | 
167 |     // 建立 GPA 到 HVA 的 cache
168 |     ret = kvm_gfn_to_hva_cache_init(vcpu->kvm, &vcpu->arch.pv_timer.data,
169 |                     addr, sizeof(struct pvtimer_vcpu_event_info));
170 | 
171 |     return ret;
172 | }
173 | ```
174 | 
175 | 
176 | 
177 | 
178 | 
179 | #### pvtimer 初始化
180 | 
181 | 在 kvm_lapic_init 时，通过 kvm_pv_timer_init 初始化 pvtimer ：
182 | 
183 | ```c
184 | #define PVTIMER_SYNC_CPU   (NR_CPUS - 1) /* dedicated CPU */
185 | #define PVTIMER_PERIOD_NS  250000L /* pvtimer default period */
186 | 
187 | static long pv_timer_period_ns = PVTIMER_PERIOD_NS;
188 | 
189 | static void kvm_pv_timer_init(void)
190 | {
191 |     ktime_t ktime = ktime_set(0, pv_timer_period_ns);
192 | 
193 |     hrtimer_init(&pv_sync_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
194 |     pv_sync_timer.function = &pv_sync_timer_callback;
195 | 
196 |     /* kthread for pv_timer sync buffer */
197 |     pv_timer_polling_worker = kthread_create(pv_timer_polling, NULL,
198 |                         "pv_timer_polling_worker/%d",
199 |                         PVTIMER_SYNC_CPU);
200 |     if (IS_ERR(pv_timer_polling_worker)) {
201 |         pr_warn_once("kvm: failed to create thread for pv_timer\n");
202 |         pv_timer_polling_worker = NULL;
203 |         hrtimer_cancel(&pv_sync_timer);
204 | 
205 |         return;
206 |     }
207 | 
208 |     kthread_bind(pv_timer_polling_worker, PVTIMER_SYNC_CPU);
209 |     wake_up_process(pv_timer_polling_worker);
210 |     hrtimer_start(&pv_sync_timer, ktime, HRTIMER_MODE_REL);
211 | }
212 | ```
213 | 
214 | 创建了高精度定时器 pv_sync_timer ，使用 monotonic （单调）时间，在 pv_timer_period_ns （默认为 250000 ns）后调用一次 pv_sync_timer_callback
215 | 
216 | ```c
217 | static enum hrtimer_restart pv_sync_timer_callback(struct hrtimer *timer)
218 | {
219 |     // 将 timer 的超时时间推后 pv_timer_period_ns
220 |     hrtimer_forward_now(timer, ns_to_ktime(pv_timer_period_ns));
221 |     // 唤醒 pv_timer_polling_worker
222 |     wake_up_process(pv_timer_polling_worker);
223 | 
224 |     // 返回 restart 表示 timer 会重新被激活
225 |     return HRTIMER_RESTART;
226 | }
227 | ```
228 | 
229 | kvm_pv_timer_init 同时创建了名为 pv_timer_polling_worker/x 的内核进程，其中 x 为最后一个 CPU 的编号，表示它将在该 CPU 上执行（kthread_bind）。该线程执行 pv_timer_polling 。
230 | 
231 | 配合 pv_sync_timer，相当于每隔 pv_timer_period_ns 唤醒一次 pv_timer_polling_worker ，执行 pv_timer_polling。
232 | 
233 | 
234 | ```c
235 | static int pv_timer_polling(void *arg)
236 | {
237 |     struct kvm *kvm;
238 |     struct kvm_vcpu *vcpu;
239 |     int i;
240 |     mm_segment_t oldfs = get_fs();
241 | 
242 |     while (1) {
243 |         // 设置为处于可中断睡眠状态
244 |         set_current_state(TASK_INTERRUPTIBLE);
245 | 
246 |         if (kthread_should_stop()) {
247 |             __set_current_state(TASK_RUNNING);
248 |             break;
249 |         }
250 | 
251 |         spin_lock(&kvm_lock);
252 |         // 设置处于可运行状态，此时如果被调度器选中会立刻运行
253 |         __set_current_state(TASK_RUNNING);
254 |         list_for_each_entry(kvm, &vm_list, vm_list) {
255 |             set_fs(USER_DS);
256 |             use_mm(kvm->mm);
257 |             kvm_for_each_vcpu(i, vcpu, kvm) {
258 |                 kvm_apic_sync_pv_timer(vcpu);
259 |             }
260 |             unuse_mm(kvm->mm);
261 |             set_fs(oldfs);
262 |         }
263 | 
264 |         spin_unlock(&kvm_lock);
265 | 
266 |         // 主动让出控制权给下一个进程
267 |         schedule();
268 |     }
269 | 
270 |     return 0;
271 | }
272 | ```
273 | 
274 | 该函数会遍历所有 VM ，对其中的每个 vCPU 调用 kvm_apic_sync_pv_timer 。调用完成后，pv_timer_polling 通过 schedule 将控制权让给下一个进程，等待 pv_sync_timer 的下一次超时唤醒。
275 | 
276 | ```c
277 | void kvm_apic_sync_pv_timer(void *data)
278 | {
279 |     struct kvm_vcpu *vcpu = data;
280 |     struct kvm_lapic *apic = vcpu->arch.apic;
281 |     unsigned long flags, this_tsc_khz = vcpu->arch.virtual_tsc_khz;
282 |     u64 guest_tsc, expire_tsc;
283 |     long rem_tsc;
284 | 
285 |     if (!lapic_in_kernel(vcpu) || !pv_timer_enabled(vcpu))
286 |         return;
287 | 
288 |     local_irq_save(flags);
289 |     // 获取 Guest 当前的实际 TSC 值
290 |     guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
291 |     // 计算 pv_sync_timer 器还有多少 TSC 超时
292 |     rem_tsc = ktime_to_ns(hrtimer_get_remaining(&pv_sync_timer))
293 |             * this_tsc_khz;
294 |     if (rem_tsc <= 0)
295 |         rem_tsc += pv_timer_period_ns * this_tsc_khz;
296 |     do_div(rem_tsc, 1000000L);
297 | 
298 |     /*
299 |      * make sure guest_tsc and rem_tsc are assigned before to update
300 |      * next_sync_tsc.
301 |      */
302 |     smp_wmb();
303 |     // 将下一次 pv_sync_timer 超时的 TSC 写入到 pvtimer_vcpu_event_info.next_sync_tsc
304 |     kvm_xchg_guest_cached(vcpu->kvm, &vcpu->arch.pv_timer.data,
305 |         offsetof(struct pvtimer_vcpu_event_info, next_sync_tsc),
306 |         guest_tsc + rem_tsc, 8);
307 | 
308 |     /* make sure next_sync_tsc is visible */
309 |     smp_wmb();
310 | 
311 |     // 读取 pvtimer_vcpu_event_info.expire_tsc 并将其设置为 0
312 |     // 此时 expire_tsc 存放的是 Guest （先前设置的）下一次超时时间点
313 |     expire_tsc = kvm_xchg_guest_cached(vcpu->kvm, &vcpu->arch.pv_timer.data,
314 |             offsetof(struct pvtimer_vcpu_event_info, expire_tsc),
315 |             0UL, 8);
316 | 
317 |     /* make sure expire_tsc is visible */
318 |     smp_wmb();
319 | 
320 |     // 如果当前 Guest timer 未超时，则把 expire_tsc 设置到 apic->lapic_timer.tscdeadline ，设置 timer
321 |     //   等效于 Guest 写入 MSR_IA32_TSCDEADLINE 发现 VMExit 后 KVM 进行的操作
322 |     // 如果已经超时，直接注入 APIC_LVTT 中断
323 |     if (expire_tsc) {
324 |         if (expire_tsc > guest_tsc)
325 |             /*
326 |              * As we bind this thread to a dedicated CPU through
327 |              * IPI, the timer is registered on that dedicated
328 |              * CPU here.
329 |              */
330 |             kvm_set_lapic_tscdeadline_msr(apic->vcpu, expire_tsc);
331 |         else
332 |             /* deliver immediately if expired */
333 |             kvm_apic_local_deliver(apic, APIC_LVTT);
334 |     }
335 |     local_irq_restore(flags);
336 | }
337 | ```
338 | 
339 | 另外还修改了写 MSR_IA32_TSCDEADLINE 时的处理，由原来的直接 `kvm_set_lapic_tscdeadline_msr(vcpu, data);` 改为
340 | 
341 | ```c
342 | if (pv_timer_enabled(vcpu))
343 |     smp_call_function_single(PVTIMER_SYNC_CPU,
344 |             kvm_apic_sync_pv_timer, vcpu, 0);
345 | else
346 |     kvm_set_lapic_tscdeadline_msr(vcpu, data);
347 | ```
348 | 
349 | 当然，如果在 kvm_apic_sync_pv_timer 中发现 Guest 设置的 timer 超时了，还是会调用 kvm_set_lapic_tscdeadline_msr 去设置 apic->lapic_timer.timer。
350 | 
351 | apic->lapic_timer.timer 超时时，会发送 APIC_LVTT 的中断，如果 timer 处于 TSC deadline mode ，按照规范需要将 deadline 设置为 0
352 | 
353 | ```c
354 | static enum hrtimer_restart apic_timer_fn(struct hrtimer *data) {
355 |     struct kvm_timer *ktimer = container_of(data, struct kvm_timer, timer);
356 |     struct kvm_lapic *apic = container_of(ktimer, struct kvm_lapic, lapic_timer);
357 | 
358 |     if (pv_timer_enabled(apic->vcpu)) {
359 |         kvm_apic_local_deliver(apic, APIC_LVTT);
360 |         if (apic_lvtt_tscdeadline(apic))
361 |             apic->lapic_timer.tscdeadline = 0;
362 |     } else
363 |         apic_timer_expired(apic);
364 | }
365 | ```
366 | 
367 | 
368 | ## 总结
369 | 
370 | 原来设置 APIC timer 的方式：Guest 将超时的时间戳（tsc-deadline timestamp）写入到 MSR_IA32_TSC_DEADLINE ，触发 VMExit 。KVM 会设置在物理CPU上设置 timer 。
371 | 
372 | pvtimer 将本来要设置到 MSR_IA32_TSCDEADLINE 里面的超时时间戳设置到 pvtimer_vcpu_event_info （的 expire_tsc）中。
373 | 
374 | KVM 创建一个线程，定时检查 pvtimer_vcpu_event_info.expire_tsc ，发现超时了就直接注入超时中断，还没超时就调用 kvm_set_lapic_tscdeadline_msr 。原来是设置了 MSR 后发生 VMExit ，KVM 中对应的 handler 去调用，现在变成了定时检查，如被设置了，则调用。减少了 VMExit 。
375 | 
376 | 
377 | 
378 | 


--------------------------------------------------------------------------------