├── .gitignore
├── README.md
├── _posts
    ├── airtest-douyin.md
    ├── bert-runtime.md
    ├── cuda101.md
    ├── gdbpython.md
    ├── hello-world.md
    ├── netease-games.md
    ├── newubuntu.md
    ├── py-patterns.md
    ├── pyflame.md
    ├── python-ml-optimize.md
    ├── pytorch-coredump.md
    └── vsdebugpycpp.md
├── images
    ├── airtest-douyin
    │   ├── assistant.png
    │   ├── ide01.png
    │   ├── ide02.png
    │   ├── ide03.png
    │   ├── nox.png
    │   └── snapshot.png
    ├── bert-runtime
    │   ├── async.png
    │   ├── bert.png
    │   ├── gelu.png
    │   ├── gelujit.png
    │   ├── gpucpu.png
    │   ├── qkv.png
    │   └── std.jpg
    ├── cuda101
    │   ├── autoscale.png
    │   ├── graph.jpg
    │   ├── grid.jpg
    │   ├── matrix.png
    │   ├── mem.png
    │   ├── nvprof.png
    │   ├── shared_matrix.png
    │   └── transistors.png
    ├── gdbpython
    │   └── bt.png
    ├── netease-games
    │   ├── 0.jpeg
    │   ├── 1.jpeg
    │   └── 2.jpeg
    ├── pyflame
    │   ├── Python-Thread-State.png
    │   ├── profile_c.svg
    │   └── profile_py.svg
    ├── qiyu.jpg
    ├── qrcode.bmp
    └── vsdebugpycpp
    │   ├── 5c3d9b205e60273aadf4650714DcRPJX.png
    │   ├── 5c3d9dbcaa49f15c3726191dzKYsU5Qw.png
    │   ├── 5c3d9e54a7f2529830bb770bjqP16HQy.png
    │   ├── 5c3da03a96dee435e6604c4aVmHaq5uC.png
    │   └── 5c3da1cb7f9d2a99198674256wRjQZ2J.png
└── tags
    └── index.md


/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_Store
2 | gh-md-toc


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # meteorix-blog
 2 | 
 3 | 
 4 | * [Python机器学习性能优化——PyCon2019分享](./_posts/python-ml-optimize.md)
 5 | * [BERT Runtime](./_posts/bert-runtime.md)
 6 | * [cuda101](./_posts/cuda101.md)
 7 | * [6年后，为什么离开网易游戏](./_posts/netease-games.md)
 8 | * [Pyflame解析和扩展](./_posts/pyflame.md)
 9 | * [记一次pytorch的coredump调试](./_posts/pytorch-coredump.md)
10 | * [Python设计模式](https://github.com/Meteorix/python-design-patterns)
11 | * [Python源码学习笔记](https://github.com/Meteorix/pysourcenote)
12 |   - [PyObject对象系统](https://github.com/Meteorix/pysourcenote/blob/master/object.md)
13 |   - [GC垃圾回收机制](https://github.com/Meteorix/pysourcenote/blob/master/gc.md)
14 |   - [最小实现minipython](https://github.com/Meteorix/pysourcenote/blob/master/minipython.md)
15 |   - [Python虚拟机](https://github.com/Meteorix/pysourcenote/blob/master/vm.md)
16 | * [新Ubuntu环境搭建](./_posts/newubuntu.md)
17 | * [vs2017混合调试py/c++](./_posts/vsdebugpycpp.md)
18 | * [gdb调试cpython](./_posts/gdbpython.md)
19 | * [Airtest刷刷抖音](./_posts/airtest-douyin.md)
20 | 


--------------------------------------------------------------------------------
/_posts/airtest-douyin.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: Airtest刷刷抖音
  3 | 
  4 | date: 2019-2-5 22:41:32
  5 | 
  6 | tags: [Python,Airtest]
  7 | 
  8 | ---
  9 | 
 10 | # Airtest刷刷抖音
 11 | 
 12 | 用[Airtest](https://github.com/AirtestProject/Airtest)做点有意思的事情，先来刷个抖音？
 13 | 
 14 | [github仓库 airtest-douyin](https://github.com/Meteorix/airtest-douyin)
 15 | 
 16 | ![ide01](/images/airtest-douyin/ide01.png)
 17 | 
 18 | ## Get Started
 19 | 
 20 | ### 环境准备
 21 | 
 22 | 手边没有android手机，iOS又懒得搭[ios-tagent](https://github.com/AirtestProject/iOS-Tagent)的环境，于是采用最偷懒的方式：
 23 | 
 24 | *   [夜神模拟器](https://www.yeshen.com/)（可用安卓机代替）
 25 | *   [AirtestIDE](http://airtest.netease.com/)
 26 | 
 27 | <!--more-->
 28 | 
 29 | 夜神模拟器装上抖音，用起来跟手机上一样舒服。看了下模拟器占内存200m和CPU 12%左右，还不错。夜神自带了一个多开器，后面分布式刷抖音再玩玩
 30 | 
 31 | ![nox](/images/airtest-douyin/nox.png)
 32 | 
 33 | 
 34 | ### 录制第一版代码
 35 | 
 36 | 打开AirtestIDE，按照[文档](http://airtest.netease.com/docs/cn/2_device_connection/3_emulator_connection.html#id2)连接好模拟器
 37 | 
 38 | ![ide01](/images/airtest-douyin/ide01.png)
 39 | 
 40 | 为了每次能用代码自动打开抖音，先用右上角的安卓助手查看一下抖音的package id
 41 | 
 42 | ![assistant](/images/airtest-douyin/assistant.png)
 43 | 
 44 | 手动加上代码
 45 | 
 46 | ```python
 47 | APP = "com.ss.android.ugc.aweme"
 48 | 
 49 | stop_app(APP)
 50 | start_app(APP)
 51 | ```
 52 | 
 53 | 然后将AirtestIDE调到安卓App的录制模式，进行一些操作，对应的代码就录制下来了
 54 | 
 55 | ![ide02](/images/airtest-douyin/ide02.png)
 56 | 
 57 | 
 58 | ### 稍微调整代码
 59 | 
 60 | 自动录制的代码不太好，稍微调整一下
 61 | 
 62 | ```python
 63 | poco(boundsInParent="[0.03194444444444444, 0.02734375]").click()
 64 | ```
 65 | 
 66 | 直接改成通过`text`来识别按钮
 67 | 
 68 | ```python
 69 | poco(text="我").click()
 70 | ```
 71 | 
 72 | 后面的上划操作，改成上划屏幕的``60%``
 73 | 
 74 | ```python
 75 | poco("com.ss.android.ugc.aweme:id/ak2").swipe([0, -0.6])
 76 | ```
 77 | 
 78 | 然后按`F5`运行一遍，一切正常
 79 | 
 80 | 
 81 | ### 一直刷下去
 82 | 
 83 | 简单地修改下最后一行代码，就能一直刷下去了
 84 | 
 85 | ```python
 86 | for i in range(10):
 87 |     poco("com.ss.android.ugc.aweme:id/ak2").swipe([0, -0.6])
 88 |     sleep(1)
 89 | ```
 90 | 
 91 | ### 好人点个赞
 92 | 
 93 | 继续用IDE的录制功能，进行点赞操作，生成下面的代码
 94 | 
 95 | ```python
 96 | poco("com.ss.android.ugc.aweme:id/al8").click()
 97 | ```
 98 | 
 99 | 原来抖音需要登录之后才能点赞，先手动登录吧，代码里面留个`TODO`
100 | 
101 | ```python
102 | if poco(text="输入手机号码").exists():
103 |     # TODO: 自动登录
104 |     print("先手动登录一下吧~")
105 |     break
106 | ```
107 | 
108 | ![ide03](/images/airtest-douyin/ide03.png)
109 | 
110 | 
111 | 然后我们截个图留念
112 | 
113 | ```
114 | snapshot()
115 | ```
116 | 
117 | 再运行一下，效果非常好
118 | 
119 | ![snapshot](/images/airtest-douyin/snapshot.png)
120 | 
121 | 
122 | > tips: 点击IDE工具栏的`log`按钮，你还能看到每步操作的报告。
123 | 
124 | 
125 | 
126 | ### 提交代码
127 | 
128 | 这个脚本里面没有用到图像识别，单个py文件就够了。于是我们从``douyin.air``里面取出代码文件。这样可以用你喜欢的编辑器打开修改，用python直接运行了。
129 | 
130 | 最终代码在[code/douyin.py](https://github.com/Meteorix/airtest-douyin/blob/master/code/douyin.py)，直接python运行。
131 | 
132 | ```shell
133 | python douyin.py
134 | ```
135 | 
136 | ### To be continued
137 | 
138 | *   录屏替代截图
139 | *   多开&分布式
140 | *   图像识别小姐姐点赞
141 | 


--------------------------------------------------------------------------------
/_posts/bert-runtime.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: BERT Runtime
  3 | 
  4 | date: 2019-9-7 15:03:33
  5 | 
  6 | tags: [cuda,深度学习]
  7 | 
  8 | ---
  9 | 
 10 | # BERT Runtime
 11 | 
 12 | 最近继续怼[BERT](https://arxiv.org/abs/1810.04805)，项目大部分模型都上了BERT，真香啊。
 13 | 
 14 | 本来一直在使用`PyTorch JIT`来解决加速和部署的问题，顺手还写了个[service-streamer](https://github.com/ShannonAI/service-streamer)来做web和模型的中间件。
 15 | 正好上个月NVIDIA开源了基于`TensorRT`的[BERT代码](https://github.com/NVIDIA/TensorRT/tree/release/5.1/demo/BERT)，官方[blog](https://devblogs.nvidia.com/nlu-with-tensorrt-bert/)号称单次`inference`只用2.2ms，比cpu快20倍。但是正确的问法是：这东西能比TF/PyTorch快多少呢？
 16 | 
 17 | 于是从[TensorRT](https://developer.nvidia.com/tensorrt)开始，认真学习了一波NVIDIA的BERT实现。并做了性能Benchmark对比TensorFlow和PyTorch，结论是gpu时间能快**15%-30%**。主要归因于对BERT的计算图优化，自己实现了4个cuda kernel，另外避免了TensorFlow和PyTorch等框架带来的overhead。
 18 | 
 19 | ## Prerequisite
 20 | 
 21 | 比较有用的几个背景知识：
 22 | 
 23 | 1. 当然是BERT的[Paper](https://arxiv.org/abs/1810.04805)，[Tensorflow实现](https://github.com/google-research/bert)，[PyTorch实现](https://github.com/huggingface/pytorch-transformers)
 24 | 1. Harvard写的著名解读[The Annotated Transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html)
 25 | 1. GPU和Cuda基础知识，很简单可以参考我的[cuda101](https://github.com/Meteorix/meteorix-blog/blob/master/_posts/cuda101.md)
 26 | 
 27 | ## TensorRT
 28 | 
 29 | **TensorRT**是NVIDIA官方推出的inference引擎，建立在CUDA之上。可以对``TensorFlow/PyTorch``等框架训练出来的模型进行CUDA优化，达到更高的inference性能。同时支持低精度参数、跨平台部署等，总之就是对自己家的GPU使用的最好。
 30 | 
 31 | <!--more-->
 32 | 
 33 | 跟[TensorRT](https://github.com/NVIDIA/TensorRT)的编译斗争了一两天，整体还是比较顺畅，照着``README``：
 34 | 
 35 | 1. 准备环境，常规c++/py编译环境和cuda环境，我是`Titan XP + cuda-10.0 + cuDNN-7.4`
 36 | 1. 下载TensorRT的binary release。TensorRT本身并没有开源，而是提供了编译好的lib。开源的周边代码包括：
 37 |     * `include`头文件
 38 |     * `plugin`实现一些cuda扩展
 39 |     * `parser`实现不同格式模型文件的解析
 40 | 1. Docker build编译用的镜像。
 41 | 1. 在Docker容器里面编译TensorRT的lib和开源代码。
 42 | 
 43 | ## TensorRT BERT
 44 | 
 45 | TensorRT的BERT实现代码在[demo/BERT](https://github.com/NVIDIA/TensorRT/tree/release/5.1/demo/BERT)目录下，主要提供了：
 46 | 1. 针对BERT进行了4个计算图优化，用cuda实现了几个fusion的kernel，封装成TensorRT的plugin
 47 | 1. TensorFlow模型文件转TensorRT模型文件的脚本
 48 | 1. C++和python版API和完整的BERT inference代码。
 49 | 
 50 | 还是看``README``，以`SQuAD(QA)`模型为例提供了完整的使用步骤：
 51 | 1. 下载BERT在SQuAD上finetune的TF模型文件，或者你也可以用自己finetune的模型文件
 52 | 1. 使用转换脚本将TF模型文件转换成TensorRT模型文件
 53 | 1. 使用另一个脚本将模型、参数、输入问题转换为Tensor形式的输入输出
 54 | 1. 编译C++可执行文件，即可测试加速后的模型和输入输出，并保存为`bert.engine`
 55 | 
 56 | 这个`bert.engine`文件，就可以单独使用了。既可以用C++ API或Python API加载后使用，也可以使用TensorRT Serving的docker直接加载做service。
 57 | 
 58 | ### Python API
 59 | 
 60 | NVIDIA也提供了Python API来完成上面的几个步骤，需要多编译一些python binding。不过既然我都编好了C++版本，就只用Python API做inference。后面测试结果可以看出，Python API在模型inference的性能上与C++版本比几乎没有损耗。
 61 | 
 62 | Python API的使用依赖[pycuda](https://developer.nvidia.com/pycuda)，这是另一个官方库，用来做Python与CUDA之间的直接交互。这里包括分配显存、内存与显存之间copy tensor等。读取`bert.engine`执行inference则是使用TensorRT发布的whl包。
 63 | 
 64 | 
 65 | ### 复现NVIDIA提供的性能数据
 66 | 
 67 | NVIDIA官方数据是在`batchsize=1，seqlen=128`时测试的。在我们的Titan XP上分别使用C++和Python API，GPU时间都在`2.6ms`左右，基本复现了官方数据。
 68 | 
 69 | ![gpucpu.png](/images/bert-runtime/gpucpu.png)
 70 | 
 71 | 比较有意思的是，明明与pytorch和tensorflow等框架比更能说明bert优化的效果，可能是为了diss cpu好卖gpu卡吧 :P
 72 | 
 73 | 下面我们就来正经做一下Benchmark
 74 | 
 75 | ## Benchmark
 76 | 
 77 | 对于BERT的inference，很大一部分时间消耗在预处理上，即将输入的文字``tokenize``为`index`，执行`padding`和`masking`，再组装成`tensor`。而我们这里的benchmark只关心GPU执行inference的性能。所以我们的计时代码只包含GPU时间，也就是tensor输入到输出的时间，排除掉前后处理时间，另外包含tensor在CPU和GPU之间copy的时间。
 78 | 
 79 | ### 环境
 80 | 
 81 | **GPU版本**
 82 | * GPU Titan XP
 83 | * Cuda 10.0
 84 | * Cudnn 7.5
 85 | 
 86 | **Python3.6版本**
 87 | * Torch==1.2.0
 88 | * TensorFlow==1.14.0
 89 | * tensorrt==5.1.5.0
 90 | 
 91 | **BERT实现**
 92 | * tensorrt基于 https://github.com/NVIDIA/TensorRT/tree/release/5.1/demo/BERT
 93 | * TensorFlow基于 https://github.com/google-research/bert
 94 | * PyTorch基于 https://github.com/huggingface/pytorch-transformers
 95 | 
 96 | **模型**
 97 | * bert-base 12层，SQuQA finetuned
 98 | * 相同的模型参数，分别转换为tensorrt/tf/pytorch模型文件
 99 | 
100 | ### SQuAD任务
101 | 
102 | 使用SQuAD(QA)任务进行测试
103 | ```
104 | # 输入文章和问题
105 | Passage: TensorRT is a high performance deep learning inference platform that delivers low latency and high throughput for apps such as recommenders, speech and image/video on NVIDIA GPUs. It includes parsers to import models, and plugins to support novel ops and layers before applying optimizations for inference. Today NVIDIA is open sourcing parsers and plugins in TensorRT so that the deep learning community can customize and extend these components to take advantage of powerful TensorRT optimizations for your apps.
106 | 
107 | Question: What is TensorRT?
108 | 
109 | # 输出答案
110 | Answer: 'a high performance deep learning inference platform'
111 | ```
112 | 
113 | 使用上面的QA任务样例，输入`padding`到`Sequence Length=328`，`Batch Size`分别使用`1`和`32`。测量100次取平均单句时间，单位是`ms`
114 | 
115 | ### 结论
116 | 
117 | |bs * seqlen|tensorrt c++|tensorrt py|tensorflow|pytorch|pytorch jit|
118 | |-|-|-|-|-|-|
119 | |1 * 328|9.9|9.9|17|16.3|14.8|
120 | |32 * 328|7.3| |11.6|9.9|8.6|
121 | 
122 | 注：
123 | 1. TensorFlow接口封装不太熟悉，仅供参考，目测与PyTorch无jit版本性能差不多
124 | 2. TensorRT py接口暂时没实现多batch的inference，目测与c++版本性能差不多
125 | 3. 所有测试GPU利用率都接近`100%`，说明没有什么GPU之外的阻塞代码
126 | 
127 | 结论：
128 | 1. TensorRT比PyTorch快39%-26% 
129 | 2. TensorRT比PyTorch jit快33%-16%
130 | 
131 | ## 计算图优化和kernel优化
132 | 
133 | 那么我们来看看TensorRT实现的BERT，到底做了哪些优化。
134 | 
135 | ![bert.png](/images/bert-runtime/bert.png)
136 | 
137 | 上面的计算图给了一个BERT `Transformer Encoder`的总览。对``Transformer``还不熟悉的话，可以回头看看Harvard写的著名解读[The Annotated Transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html)。总共有4点计算图优化，3点在`Transformer`中：
138 | 1. `gelu`激活函数的kernel实现
139 | 2. `skip`和`layernorm`函数的fusion
140 | 3. `Q/K/V`三个矩阵的合并乘法和转置
141 | 
142 | 上面的前3个优化在12层``Transformer``中都会用到，所以性价比很高。第4点优化在最底层`BERT Embedding`层：
143 | 
144 | 4. `embedding`和`layernorm`的fusion
145 | 
146 | 下面分别看看4处优化是如何实现的，我也是趁此机会了解计算图优化和cuda kernel函数的编写。
147 | 
148 | ### Gelu
149 | 
150 | 按照`gelu`的公式，如果每步分开计算，每步kernel调用都会进行一次global显存的读写。
151 | 
152 | ![gelu.png](/images/bert-runtime/gelu.png)
153 | 
154 | > 由于gpu的硬件特性，`global memory`的访问速度非常慢（相对计算而言），这里可以参考前一篇笔记中的[gpu设计和内存结构](https://github.com/Meteorix/meteorix-blog/blob/master/_posts/cuda101.md#gpu%E8%AE%BE%E8%AE%A1)。
155 | 
156 | 于是TensorRT就写一个gelu的kernel，一次kernel函数调用解决问题，只用一次显存读写。
157 | 
158 | https://github.com/NVIDIA/TensorRT/blob/release/5.1/demo/BERT/plugins/geluPlugin.cu
159 | 
160 | ```cpp
161 | // constants for approximating the normal cdf
162 | constexpr float A = 0.5;
163 | 
164 | constexpr float B = 0.7978845608028654; // sqrt(2.0/M_PI)
165 | 
166 | constexpr float C = 0.035677408136300125; // 0.044715 * sqrt(2.0/M_PI)
167 | 
168 | template <typename T, unsigned TPB>
169 | __global__ void geluKernel(const T a, const T b, const T c, int n, const T* input, T* output)
170 | {
171 | 
172 |     const int idx = blockIdx.x * TPB + threadIdx.x;
173 | 
174 |     if (idx < n)
175 |     {
176 |         const T in = input[idx];
177 |         const T cdf = a + a * tanh(in * (c * in * in + b));
178 |         output[idx] = in * cdf;
179 |     }
180 | }
181 | ```
182 | 
183 | 对比PyTorch JIT
184 | 
185 | ```
186 | @torch.jit.script
187 | def gelu(x):
188 |     return 0.5 * x * (1 + torch.tanh(0.797884 * (x + 0.044715 * torch.pow(x, 3))))
189 | 
190 | print(gelu.graph)
191 | ```
192 | 
193 | ![gelujit.png](/images/bert-runtime/gelujit.png)
194 | 
195 | 从计算图上看确实每一步是单独计算，除了`tanh`这种内置的函数，其他都要一层层函数调用。
196 | 
197 | 不过，在PyTorch 1.2的最新代码中，我发现`gelu`也是用了内置的cuda实现，两者几乎等价。
198 | 
199 | ### Skip and Layer-Normalization
200 | 
201 | LayerNorm层的PyTorch实现
202 | ```
203 | class LayerNorm(nn.Module):
204 |     "Construct a layernorm module (See citation for details)."
205 |     def __init__(self, features, eps=1e-6):
206 |         super(LayerNorm, self).__init__()
207 |         self.a_2 = nn.Parameter(torch.ones(features))
208 |         self.b_2 = nn.Parameter(torch.zeros(features))
209 |         self.eps = eps
210 | 
211 |     def forward(self, x):
212 |         mean = x.mean(-1, keepdim=True)
213 |         std = x.std(-1, keepdim=True)
214 |         return self.a_2 * (x - mean) / (std + self.eps) + self.b_2
215 | ```
216 | 
217 | 忽略掉几个不重要的参数，主要是计算`mean`和`std`，各需要遍历一次所有输入参数。
218 | 
219 | 加上`LayerNorm`之前的`Skip`层，一共需要遍历三次所有输入参数。
220 | 
221 | ```python
222 | x = LayerNorm(x + Sublayer(x))
223 | ```
224 | 
225 | 根据上面说的GPU硬件和显存特性，启动三次kernel函数、遍历三次，都是消耗较大的。
226 | 所以优化为：
227 | 1. 算`Skip`层的同时计算`x`和`x^2`的平均值
228 | 2. 再算`LayerNorm`层时直接用`x`和`x^2`的平均值得到`mean`和`std`
229 | 
230 |     ```python
231 |     std = sqrt(mean(x^2) - mean(x)^2)
232 |     ```
233 | 
234 | 看代码的时候没明白，跟yuxian手推了一波这个公式（逃
235 | 
236 | ![std.jpg](/images/bert-runtime/std.jpg)
237 | 
238 | 
239 | 这样将三次遍历fusion成一次，省去了读写global显存的时间
240 | 
241 | cuda代码实现：
242 | 
243 | https://github.com/NVIDIA/TensorRT/blob/release/5.1/demo/BERT/plugins/skipLayerNormPlugin.cu
244 | 
245 | 类似的还有`Embeding+LN`的fusion，理论上所有`LN`前面有一次遍历的都可以先算出来`x`和`x^2`的均值，省去两次遍历：
246 | 
247 | https://github.com/NVIDIA/TensorRT/blob/release/5.1/demo/BERT/plugins/embLayerNormPlugin.cu
248 | 
249 | 
250 | 
251 | ### QKV 优化
252 | 
253 | 有了上面的基础，这里的两个优化比较容易理解，直接看图和代码
254 | 
255 | ![qkv.png](/images/bert-runtime/qkv.png)
256 | 
257 | 1）``QKV``本来是分别成三个矩阵然后转置，现在变成成一个三倍大的矩阵转置，再slice
258 | 
259 | https://github.com/NVIDIA/TensorRT/blob/release/5.1/demo/BERT/plugins/qkvToContextPlugin.cu
260 | 
261 | 
262 | 2）``Scale+Softmax``，在scale那一次遍历同时求得`exp(x)`，减少一次遍历
263 | 
264 | https://github.com/NVIDIA/TensorRT/blob/e47febadb256d94f65efe0f1eac54c7caedd65d4/demo/BERT/plugins/pluginUtil.h#L220
265 | 
266 | 
267 | ### 异步执行
268 | 
269 | TensorRT的blog特别提了一下异步执行。由于CPU和GPU是异构的，在CPU和GPU之间copy tensor、GPU runtime执行计算都是异步完成的。不强制同步可以增加整个流程的吞吐量`througput`。Profile的时候需要特别注意这个异步的时间。这点在TensorRT的python代码中也能看到，实现的非常仔细。
270 | 
271 | ![async.png](/images/bert-runtime/async.png)
272 | 
273 | PyTorch实际上也是异步的，所以这点TensorRT没什么优势
274 | 
275 | ## 如何使用
276 | 
277 | 分析完TensorRT的BERT优化，我们看看能怎么用起来。
278 | 
279 | 这30%左右的inference速度提升还是很香的，可能的用法有：
280 | 
281 | 1. 使用Python API，替换tf/pytorch的BERT实现，前后处理代码不用动
282 | 1. 使用C++ API，封装前后处理C++代码，编译成二进制发布
283 | 1. 直接使用[tensorrt-inference-server](https://github.com/NVIDIA/tensorrt-inference-server)，server只处理tensor，前后处理需要另外实现
284 | 
285 | 
286 | 这三种用法都需要将tf/pytorch训练(finetune)好的模型文件，转化为tensorrt的`.engine`文件：
287 | 1. 转换模型参数，每个任务的模型BERT最上层会稍有不同
288 | 1. 确定输入输出、batch_size等参数，生成tensor文件
289 | 1. 用前两部的结果生成`.engine`文件
290 | 
291 | ### So, what's next?
292 | 
293 | 根据项目的发展的阶段，考虑采用三种用法，主要先理顺``模型迭代--业务开发--部署``的流程。
294 | 
295 | > 再次感叹BERT真香，NLP领域幸好有BERT，才能搞这些优化。
296 | 


--------------------------------------------------------------------------------
/_posts/cuda101.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: cuda 101
  3 | 
  4 | date: 2019-8-2 11:58:11
  5 | 
  6 | tags: [cuda,深度学习]
  7 | 
  8 | ---
  9 | 
 10 | # cuda 101
 11 | 
 12 | 最近开始做深度学习后端和性能优化，做到模型部分需要补补gpu和cuda知识。记录此篇入门笔记，配合官方文档食用。
 13 | 
 14 | NVIDIA的官方文档和blog写的真好，读了一天非常舒服。很多之前调包（Pytorch/Tensorflow)不理解的地方，都有更深的认识，期待尽早开始写自己的kernel。
 15 | 
 16 | ## get started
 17 | 
 18 | 从这篇官方博客开始quickstart，写的非常好，可以快速了解gpu特性、编程模型、显存等
 19 | 
 20 | https://devblogs.nvidia.com/even-easier-introduction-cuda/
 21 | 
 22 | 
 23 | ## 线程结构
 24 | 
 25 | ![image](/images/cuda101/grid.jpg)
 26 | 
 27 | <!--more-->
 28 | 
 29 | gpu上每个kernel函数调用，会包括
 30 | 
 31 | ```
 32 | 1 * grid --- n * block --- m * thread
 33 | ```
 34 | 
 35 | ```cpp
 36 | // blockdim == m
 37 | // griddim == n
 38 | 
 39 | __global__ void add(int n, float *x, float *y)
 40 | {
 41 |   int index = blockIdx.x * blockDim.x + threadIdx.x;
 42 |   int stride = blockDim.x * gridDim.x;
 43 |   for (int i = index; i < n; i += stride)
 44 |     y[i] = x[i] + y[i];
 45 | }
 46 | ```
 47 | 
 48 | ``blockdim``和``griddim``可以是1/2/3维的，方便不同维度的矩阵运算
 49 | 
 50 | ## 内存分配
 51 | 
 52 | ```cpp
 53 | int N = 1<<20;
 54 | float *x, *y;
 55 | // Allocate Unified Memory – accessible from CPU or GPU
 56 | cudaMallocManaged(&x, N*sizeof(float));
 57 | cudaMallocManaged(&y, N*sizeof(float));
 58 | ...
 59 | // Run kernel on 1M elements on the GPU
 60 | add<<<1, 1>>>(N, x, y);
 61 | ...
 62 | // Free memory
 63 | cudaFree(x);
 64 | cudaFree(y);
 65 | ```
 66 | 分配虚拟的共享内存，cpu和gpu都能访问
 67 | 
 68 | ## nvprof
 69 | 
 70 | ![image](/images/cuda101/nvprof.png)
 71 | 
 72 | 从上往下可以看出：
 73 | 1. kernel function调用（gpu device）
 74 | 1. cuda c runtime调用（cpu host）
 75 | 1. 显存/内存migration调用
 76 | 
 77 | 这个感觉会很有用，可以认真profile一下pytorch程序
 78 | 
 79 | ## 两种显存-内存分配方式
 80 | 
 81 | 在我的titan xp卡上，nvprof出来的结果不对，加了很多block并行并没有加速kernel function。上面的博客非常贴心的提示了这个问题，引出下面的博客：
 82 | 
 83 | https://devblogs.nvidia.com/unified-memory-cuda-beginners/
 84 | 
 85 | 原来是两种不同的共享内存分享方式
 86 | 
 87 | 1. 旧的（kapler），先分配到gpu，cpu访问时引起page fault，再从gpu读取
 88 | 2. 新的（pascal），先分配到cpu，gpu访问时引起page fault，再从cpu读取
 89 | 
 90 | 
 91 | 新的好处：
 92 | 1. gpu物理显存不用占太多，按需取
 93 | 2. 多gpu可以共享虚拟内存表（似乎是这个意思）
 94 | 
 95 | > 注意： profile的时候，第一次会算上migration内存的时间，可以先prefetch解决这个问题
 96 | 
 97 | ## bandwidth
 98 | 
 99 | ```
100 | bandwidth = bytes / seconds
101 | ```
102 | 
103 | 显存的上限在500GB/s这个数量级
104 | 
105 | 
106 | -------
107 | 
108 | **下面开始进入正经的cuda c编程指南**
109 | 
110 | https://docs.nvidia.com/cuda/cuda-c-programming-guide/
111 | 
112 | ## gpu设计
113 | 
114 | > 摩尔定律限制了芯片上的晶体管数量
115 | 
116 | *   cpu芯片的晶体管大部分用在了control、cache(L1/L2/L3)、少量 的alu
117 | *   gpu芯片则绝大部分用在了alu，少量control和cache
118 | 
119 | 这使得gpu计算（GFLOPS）和memory access能力大了好多个数量级。适合大量data并行计算，少control flow
120 | 
121 | ![image](/images/cuda101/transistors.png)
122 | 
123 | ## 自动多核并行
124 | 
125 | 每个核(SM)按block分配，自动占满所有SM
126 | 
127 | ![image](/images/cuda101/autoscale.png)
128 | 
129 | ## 内存结构
130 | 
131 | 1. 线程内本地内存
132 | 1. block内共享内存——block里面的thread间共享
133 | 1. global内存——block间和grid间共享
134 | 
135 | ![image](/images/cuda101/mem.png)
136 | 
137 | ## Unified Memory
138 | 
139 | cpu和gpu实际上是异构编程，执行运算是异步的，分配内存也是在不同的物理设备
140 | 
141 | unified memory是将两块内存伪装成同一份managed memory，大大减小了编程难度
142 | 
143 | 
144 | ## nvcc编译
145 | 
146 | ### 离线编译
147 | 
148 | 1. 分离device代码和host代码
149 | 1. device代码编译成汇编（ptx）或者二进制
150 | 1. host代码替换<<<...>>>成cuda c runtime函数，然后由nvcc调用gcc/g++等编译器编译
151 | 
152 | ### jit编译
153 | 
154 | 1. ptx代码，可以在runtime进一步被编译成device二进制代码，这就是jit
155 | 1. 好处是在保证ptx兼容性的情况下，旧的程序可以享受新的硬件
156 | 1. jit编译之后，第一次会生成一个cache，driver更新会cache会失效
157 | 
158 | ### 兼容性
159 | 1. 二进制兼容性
160 | 2. ptx兼容性
161 | 3. 应用兼容性
162 | 4. c/c++兼容性  device代码只支持c++子集
163 | 5. 64bit兼容性  
164 | 
165 | 主要是一些编译选项，用的时候查文档吧
166 | 
167 | 
168 | ## CUDA C Runtime
169 | ``cudart library``  静态链接库、动态链接库都有提供
170 | 
171 | ### 初始化
172 | *   cuda代码第一次运行的时候，会隐式初始化。profile的时候注意忽略这个时间。
173 | *   每个device初始化一个CUDA context，进程内所有线程共享这些context，然后jit编译pxt代码，load进显存。
174 | *   host可以通过cudaSetDevice(1)  指定运行的context，即指定运行的gpu卡
175 | *   进程可以cudaDeviceReset()手动销毁这个context
176 | 
177 | #### 设备显存
178 | 
179 | 显存可以是``linear memory``或者``cuda arrays``
180 | 
181 | ``cuda arrays``好像是用于``texture``（游戏引擎的texture？后面再研究这个）
182 | 
183 | ``linear memory``就是正经的40bit地址了，可以用指针指向
184 | 
185 | ```cpp
186 | cudaMalloc()
187 | cudaFree()
188 | cudaMemcpy()  // host-device device-host device-device三种
189 | cudaMallocPitch()  // 2d
190 | cudaMalloc3D()  // 3d
191 | ```
192 | 
193 | 要注意分配2d/3d内存的时候，index和stride方式稍有变化，指针别用错了
194 | 
195 | ```cpp
196 | // 各种memcpy全局显存的方式
197 | __constant__ float constData[256];
198 | float data[256];
199 | cudaMemcpyToSymbol(constData, data, sizeof(data));
200 | cudaMemcpyFromSymbol(data, constData, sizeof(data));
201 | 
202 | __device__ float devData;
203 | float value = 3.14f;
204 | cudaMemcpyToSymbol(devData, &value, sizeof(float));
205 | 
206 | __device__ float* devPointer;
207 | float* ptr;
208 | cudaMalloc(&ptr, 256 * sizeof(float));
209 | cudaMemcpyToSymbol(devPointer, &ptr, sizeof(ptr));
210 | ```
211 | 
212 | #### 共享显存 shared memory
213 | 
214 | 注意区分``unified memory``，这里是说同一block中thread shared memory
215 | 
216 | ``shared memory``比``global memory``快很多，相当于cpu的L1 cache？
217 | 
218 | 下面利用``shared memory``实现更快的矩阵乘法，这个例子太经典了，[链接在这里经常复习下](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory)
219 | 
220 | 没有使用shared memory时，需要访问``A.width*B.height``次global memory
221 | 
222 | ![image](/images/cuda101/matrix.png)
223 | 
224 | 使用shared memory，将一个block的数据一次性读取到shared memory，访问global memory的次数降为``(A.width/block_size)*(B.height/block_size)`` 
225 | 
226 | ![image](/images/cuda101/shared_matrix.png)
227 | 
228 | #### page-locked host memory
229 | 
230 | 有点像把host memory mmap到device memory，好处有：
231 | 
232 | *   memcpy和kernel执行是并行的
233 | *   mapped到显存，甚至不用memcpy了
234 | *   bandwith更高
235 | 
236 | page-locked host memory是整个系统层面珍稀的资源，不要滥用。
237 | 
238 | 有三种使用方式，具体用的时候再研究：
239 | *   portable
240 | *   write-combining
241 | *   mapped
242 | 
243 | ### 异步执行
244 | 
245 | 所有host计算、device计算、所有类型的memcpy之间都是异步的、可以并发的。
246 | 
247 | 注意profile的时候，可以用``CUDA_LAUNCH_BLOCKING``环境变量disable掉异步。(以后要用好Nsight/Visual Profiler)
248 | 
249 | 
250 | #### 异步模型
251 | 
252 | 目测有两种重要的异步编程模型：stream和graph
253 | 
254 | #### stream
255 | stream有点像游戏引擎，开一个或者多个stream，host代码往stream中发计算指令，需要的时候synchronize。
256 | 
257 | ```
258 | // 两个stream并行的例子
259 | for (int i = 0; i < 2; ++i)
260 |     cudaMemcpyAsync(inputDevPtr + i * size, hostPtr + i * size,
261 |                     size, cudaMemcpyHostToDevice, stream[i]);
262 | for (int i = 0; i < 2; ++i)
263 |     MyKernel<<<100, 512, 0, stream[i]>>>
264 |           (outputDevPtr + i * size, inputDevPtr + i * size, size);
265 |     for (int i = 0; i < 2; ++i)
266 |     cudaMemcpyAsync(hostPtr + i * size, outputDevPtr + i * size,
267 |                     size, cudaMemcpyDeviceToHost, stream[i]);
268 | ```
269 | 
270 | stream也支持callback，异步执行完之后callback host代码。注意callback不能再有cuda调用，否则会死锁。
271 | 
272 | #### graph
273 | graph则是定义好所有计算指令作为node，计算顺序和依赖关系也定义好作为edge，定义好之后每一步触发执行
274 | 
275 | ![image](/images/cuda101/graph.jpg)
276 | 
277 | stream追求灵活、graph则是更适合复杂的计算关系和追求执行效率
278 | 
279 | 
280 | Graph可以用两种方式定义：
281 | 1. 录制stream
282 | 2. 手动定义node和edge
283 | 
284 | #### event
285 | 
286 | 多个stream之间，可以用event同步，类似```threading.Event```吧。
287 | 
288 | event还能用于多stream录制成一个graph，wait event，相当于```thread.join()```吧。
289 | 
290 | 还有个用途，可以用来异步计时。定义start和stop event，然后使用``cudaEventElaspedTime``
291 | 
292 | ### 多卡
293 | 
294 | 多卡切换
295 | ```cpp
296 | cudaSetDevice(0);
297 | cudaSetDevice(1);
298 | ```
299 | 
300 | stream和device是绑定的，可以用event来同步多卡之间的stream
301 | 
302 | 64bit程序，可以通过api开启多卡之间的显存访问（目测多卡训练会有用？）
303 | 
304 | ### 虚拟内存
305 | 
306 | 64bit程序，host和device共用一个虚拟内存，可以参考上面的``显存-内存分配方式``
307 | 
308 | ### IPC
309 | 
310 | cuda也提供了IPC方式，多进程可以share显存指针和event
311 | 
312 | 后面在研究吧。。。
313 | 
314 | 
315 | ### error & callstack
316 | 
317 | cuda-gdb / Nsight
318 | 
319 | ### texture
320 | 
321 | 后面再研究吧，应该是游戏引擎用的？或者黑科技加速？
322 | 
323 | 还能跟opengl/dx接口交互。。。
324 | 
325 | 
326 | ## 性能指南
327 | 
328 | TODO....
329 | 
330 | 


--------------------------------------------------------------------------------
/_posts/gdbpython.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: gdb调试cpython
 3 | 
 4 | date: 2019-2-13 23:11:32
 5 | 
 6 | tags: Python
 7 | 
 8 | ---
 9 | 
10 | # gdb调试cpython
11 | 
12 | 主要参考这篇文章：
13 | 
14 | *	https://www.podoliaka.org/2016/04/10/debugging-cpython-gdb/
15 | *	https://blog.alswl.com/2013/11/python-gdb/
16 | 
17 | 
18 | ## 同时debug c栈和py栈
19 | 
20 | 在调试脚本卡死的时候特别有用，如下图
21 | 
22 | ![image](/images/gdbpython/bt.png)
23 | 
24 | ## 安装python-dbg
25 | 
26 | ```
27 | sudo apt-get install gdb python-dbg
28 | ```
29 | 
30 | <!--more-->
31 | 
32 | 
33 | ``python-dbg``包含symbol和py-bt
34 | 
35 | 会自动把libpython.py装到gdb的auto-load目录，并且保证后面的子目录跟python的目录地址一样
36 | 
37 | ```
38 | ➜  ~ which python2.7
39 | /usr/bin/python2.7
40 | ➜  ~ ls /usr/share/gdb/auto-load/usr/bin/          
41 | python2.7-dbg-gdb.py  python2.7-gdb.py
42 | ```
43 | 
44 | ### 其他版本python安装gdb dbg
45 | 
46 | 由于自己在服务器上安装了多个版本的python，比如自己用源码编译的py3.7
47 | 
48 | ```
49 | ➜  ~ which python3.7
50 | /usr/local/bin/python3.7
51 | ```
52 | 
53 | 然后可以在python的源码里面找到``Toos/gdb/libpython.py``
54 | > https://github.com/python/cpython/tree/3.7/Tools/gdb
55 | 
56 | 按照上面的目录规则cp到gdb的auto-load，保证调试python3.7进程的时候能找到
57 | 
58 | ```
59 | ➜  ~ ls /usr/share/gdb/auto-load/usr/local/bin/          
60 | python3.7-dbg-gdb.py  python3.7-gdb.py
61 | ```
62 | 
63 | ### attach到python进程
64 | 
65 | ```
66 | ps -x | grep python
67 | gdb -p <pid>
68 | ```
69 | 
70 | ### 常用指令
71 | 
72 | ```
73 | bt    # 当前C调用栈
74 | py-bt  # 当前Py调用栈
75 | py-list  # 当前py代码位置
76 | info thread   # 线程信息
77 | thread <id>   # 切换到某个线程
78 | thread apply all py-list  # 查看所有线程的py代码位置
79 | ctrl-c  # 中断
80 | ```
81 | 
82 | py-bt如果遇到中文编码问题
83 | export LC_CTYPE=C.UTF-8
84 | 
85 | 
86 | ### 配合gdb dashboard，更方便一点
87 | 
88 | https://github.com/cyrus-and/gdb-dashboard
89 | 


--------------------------------------------------------------------------------
/_posts/hello-world.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: Hello World
 3 | 
 4 | date: 2019-2-1 12:00:00
 5 | 
 6 | tags: 随笔
 7 | 
 8 | ---
 9 | 
10 | 2019年了，再次尝试开始写博客。
11 | 
12 | 考虑了几个地方，不是广告太多，就是嫌太水，于是回归到`github pages + hexo`。seo的问题以后再说吧，先写给自己看。
13 | 


--------------------------------------------------------------------------------
/_posts/netease-games.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: 6年后，为什么离开网易游戏
 3 | 
 4 | date: 2019-6-7 01:11:12
 5 | 
 6 | tags: [随笔,网易]
 7 | 
 8 | ---
 9 | 
10 | 在网易游戏待了六年，如果说这里是一所大学，那我也算毕业了，能拿个优秀毕业生吗:-D
11 | 
12 | > 游戏改变世界
13 | 
14 | 小s面试时问我梦想是什么，我想了很久说出这个答案。虽然听起来很中二，我今天依然相信，技术可以改变世界。在GDC和GoogleIO上发布Airtest，算是达成了一点小目标。可是做游戏，至少在网易，离梦想却是渐行渐远。
15 | 
16 | ---
17 | 
18 | 本科时认真做的两件事，考GRE和打Dota。以前我是个狂热的游戏玩家，打游戏就是要赢。为了Dota毕业杯，能拉十个人去训练个把月，热血依旧。清晰地记得当时的心境，做游戏多有激情，为何要去美帝大农村？于是拒掉USC，来到网易。
19 | 
20 | <!-- more -->
21 | 
22 | ![image](/images/netease-games/0.jpeg)
23 | 
24 | 前四年的网易，环境比较宽松。就像一所大学，只要愿意去上课，总能学到很多。从翔妹白菜到阿进ff，让我认识到好的程序员是什么样的。技术精湛，团队带的好，都是基本操作了。能把技术与产品的tradeoff做好，特别是在梦幻这个强产品驱动的环境下，还能把技术做出点花来，确实厉害。当然我大学时启蒙于Felixyan，又是另一种类型的大神了，深不可测。
25 | 
26 | 后来的磊涛和Simon，从大公司带来了一些不一样的东西，让我意识到技术之外的世界。这里牛逼的人绝对有，而我有点像是蹭课的，一直努力偷师。小s牛逼的地方在于管理，平时润物细无声，做决定和执行又很果断。该放手时能放手，整个大梦幻还是稳稳的。写到这里我一定要感谢这个环境，给我折腾的空间，从最初的小菜鸟成长起来，感谢小s。
27 | 
28 | ---
29 | 
30 | 说到Airtest，这件事带给我的自信是全面的。17年5月第一次参加GoogleIO，我找到Firebase的展台闲聊，说我们做了个自动化框架，比你们Google做得好。凭着当时的英语水平，第二天跟Justin聊了一小时，把他邀请到了广州。就这样谈成了跟Google两个团队的合作。到后来带团队把事情做落地、做开源、练口语、3月上GDC、5月上IO。当我站在台上与满场观众眼神相交，内心居然自信爆棚，说出来的口音都像native speaker。放出github时全场拿出手机拍照，我知道这就是世界最好的。想起去年Unity和Unreal的现场，也不过如此，我就是技术领域的rock star。
31 | 
32 | ![image](/images/netease-games/1.jpeg)
33 | 
34 | 前四年，我是个理想主义者。回头看去，转折点是磊涛17年夏天问我的一句话：你们这帮优秀的年轻人，待在这里就是为了钱吗？当时我就懵了，怎么会有人问我这个问题，是你不懂游戏。可是他就像在盗梦空间里给我种下了一个想法，后来这个想法开始生根发芽。
35 | 
36 | ---
37 | 
38 | 很早前看过韦一笑的《你为什么离开游戏行业》。当时并没有很认同他的宝箱论，可能因为之前我在梦幻。鹏鹏二妹小龙确实是好策划，能给玩家带来不一样的体验。但是后来几年的发展，就是劣币驱逐良币了，直到最近终于出了个taptap史上最低分。
39 | 
40 | 毕竟网易是家游戏公司，老板是个~~大男孩~~商人。追求商业目标无可厚非，哪家不是这样呢？去年跟萌哥到暴雪交流，圆了小时候的梦，那时的游戏还是第九艺术。国内的游戏公司十年后能做成那样么？愿意做成那样么？很久后有一次跟小璐聊天，她说出了本质，我们这辈人做的好或者不好，对这个行业并不会改变什么。
41 | 
42 | ![image](/images/netease-games/2.jpeg)
43 | 
44 | 校招第一份工作非常幸运地做了六年，小有所成。当年还很稚嫩的我，带出两个挺不错团队。在网易健身房遇到现在的老婆。还必须感谢网易游戏，给了我相对的财富自由。
45 | 
46 | 那么该往前看了，做点真正能带来改变的事
47 | 


--------------------------------------------------------------------------------
/_posts/newubuntu.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: 新ubuntu环境搭建
 3 | 
 4 | date: 2019-2-13 22:41:32
 5 | 
 6 | tags: Linux
 7 | 
 8 | ---
 9 | 
10 | # 新ubuntu环境搭建
11 | 
12 | ### apt
13 | 
14 | 替换apt源
15 | *   源列表：http://wiki.ubuntu.org.cn/
16 | *   清华源：https://mirror.tuna.tsinghua.edu.cn/help/ubuntu/
17 | 
18 | 安装``zsh/vim/git``
19 | 
20 | ```shell
21 | sudo apt-get update
22 | sudo apt-get install zsh vim git
23 | ```
24 | 
25 | ### git
26 | 
27 | ```
28 | git config --global core.editor "vim"
29 | git config --global --edit  # 设置name和email
30 | ```
31 | 
32 | <!--more-->
33 | 
34 | 
35 | ### zsh
36 | 
37 | 配置**oh-my-zsh** https://github.com/robbyrussell/oh-my-zsh
38 | 
39 | ```shell
40 | sh -c "$(curl -fsSL https://raw.githubusercontent.com/robbyrussell/oh-my-zsh/master/tools/install.sh)"
41 | ```
42 | 
43 | 如果替换默认shell失败，先设置密码再替换
44 | ```shell
45 | passwd  # 设置密码
46 | chsh zsh
47 | ```
48 | 
49 | 解决zsh 有残留的问题，在 ``~/.zshrc`` 添加
50 | 
51 | ```
52 | export LANG=en_US.UTF-8
53 | export LC_ALL=en_US.UTF-8
54 | ```
55 | 
56 | > 参考：https://github.com/sindresorhus/pure/issues/300
57 | 
58 | 解决tabname闪烁的问题，在``~/.zshrc``disable掉autotitle
59 | ```
60 | DISABLE_AUTO_TITLE="true"
61 | ```
62 | 
63 | 增加自动补全，参考：https://github.com/zsh-users/zsh-autosuggestions/blob/master/INSTALL.md
64 | 
65 | ### vimrc
66 | 
67 | 配置**vimrc** https://github.com/amix/vimrc
68 | ```shell
69 | git clone --depth=1 https://github.com/amix/vimrc.git ~/.vim_runtime
70 | sh ~/.vim_runtime/install_basic_vimrc.sh
71 | ```
72 | 
73 | 还要在``~/.vimrc``加上
74 | ```
75 | set nu  # 我喜欢加上行号
76 | set fencs=utf-8,gbk   # 这一行的作用是告诉vim，打开一个文件时，尝试utf8,gbk两种编码
77 | ```
78 | 
79 | ### python
80 | 
81 | https://askubuntu.com/questions/865554/how-do-i-install-python-3-6-using-apt-get
82 | 
83 | ```shell
84 | # python3.6 
85 | sudo add-apt-repository ppa:jonathonf/python-3.6
86 | sudo apt-get update
87 | sudo apt-get install python3.6-dev
88 | # 安装pip
89 | curl https://bootstrap.pypa.io/ez_setup.py -o - | python3.6 && python3.6 -m easy_install pip
90 | # 换pip源
91 | pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
92 | # 安装virtualenv
93 | python3.6 -m pip install virtualenv --user
94 | ```
95 | 


--------------------------------------------------------------------------------
/_posts/py-patterns.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: Python Design Patterns
  3 | 
  4 | date: 2018-06-20 20:41:32
  5 | 
  6 | tags: Python
  7 | 
  8 | ---
  9 | 
 10 | 如何写出更好的代码——刘欣 2018.6.20
 11 | 
 12 | > "Design patterns help you learn from others’ successes instead of your own failures"
 13 | 
 14 | 断断续续读了好久设计模式，记录此篇总结。设计模式最初发源于C++/Java等静态语言，Python语言本身的很多特性已经覆盖了设计模式，甚至用了都不知道，比如：decorator/metaclass/generator/getattr等。但是写稍微大型的项目时，还是经常力不从心。就如上面引用的那句话，通过设计模式可以学习前人的智慧，写出更好的代码。
 15 | 
 16 | 所有代码都在[https://github.com/Meteorix/python-design-patterns](https://github.com/Meteorix/python-design-patterns)，python3环境下可以跑通，请跑起来玩玩。代码仅限演示作用，更注重清晰地用python语法展示patterns，而不是完备性，请勿用在生产环境。欢迎提issue和pr : )
 17 | 
 18 | ## 设计模式六大原则
 19 | 
 20 | 
 21 | 1. 单一职责原则
 22 |     * 一个类只做一件事情，模块化
 23 | 
 24 | 1. 里氏替换原则
 25 |     * 所有使用父类的地方必须能完全替换为使用其子类
 26 |     * 即：子类可以扩展父类的功能，但不能改变父类原有的功能
 27 | <!--more-->
 28 | 
 29 | 1. 依赖倒置原则
 30 |     * 高层模块不应该依赖低层模块，二者都应该依赖其抽象
 31 |     * 抽象不应该依赖实现；实现应该依赖抽象
 32 |     * 面向接口编程，而不是面向实现编程，Duck Type
 33 | 
 34 | 1. 接口隔离原则
 35 |     * 一个类对另一个类依赖的接口越少越好
 36 | 
 37 | 1. 最小知识原则
 38 |     * 一个类对另一个类知道得越少越好
 39 | 
 40 | 1. 开闭原则
 41 |     * 类、模块、函数对扩展开放，对修改关闭
 42 |     * 尽量在不修改源代码的情况下进行扩展
 43 | 
 44 | 
 45 | > 其实以上的原则不限于类的设计，很多工程上的系统设计也适用。
 46 | 
 47 | 
 48 | ## 常用设计模式
 49 | 
 50 | ### 创造模式
 51 | #### Singleton
 52 | 
 53 | 一个类只有一个对象，似乎不太需要解释：）
 54 | 
 55 | ```python
 56 | class SingletonMeta(type):
 57 | 
 58 |     instance = None
 59 | 
 60 |     def __call__(cls, *args, **kwargs):
 61 |         if cls.instance is None:
 62 |             cls.instance = super(SingletonMeta, cls).__call__(*args, **kwargs)
 63 |         return cls.instance
 64 | 
 65 | 
 66 | class CurrentUser(object, metaclass=SingletonMeta):
 67 | 
 68 |     def __init__(self, name=None):
 69 |         super(CurrentUser, self).__init__()
 70 |         self.name = name
 71 | 
 72 |     def __str__(self):
 73 |         return repr(self) + ":" + repr(self.name)
 74 | 
 75 | 
 76 | if __name__ == '__main__':
 77 |     u = CurrentUser("liu")
 78 |     print(u)
 79 |     u2 = CurrentUser()
 80 |     u2.name = "xin"
 81 |     print(u2)
 82 |     print(u)
 83 |     assert u is u2
 84 | ```
 85 | 
 86 | 这个例子用MetaClass实现，其实Python里还有其他实现方式。但是Python里的MetaClass就是用来实例化Class的，用来实现单例类正好。
 87 | 
 88 | 
 89 | #### Factory
 90 | 
 91 | 工厂模式，用于生产一大堆对象。这里用``__subclasses__``来获取子类，这样可以动态扩展子类而不改变factory的代码。
 92 | ```python
 93 | class Shape(object):
 94 |     @classmethod
 95 |     def factory(cls, name, *args, **kwargs):
 96 |         types = {c.__name__: c for c in cls.__subclasses__()}  # 忽略性能:P
 97 |         shape_class = types[name]
 98 |         return shape_class(*args, **kwargs)
 99 | 
100 | 
101 | class Circle(Shape):
102 |     pass
103 | 
104 | 
105 | class Square(Shape):
106 |     pass
107 | 
108 | 
109 | if __name__ == '__main__':
110 |     shapes = ["Circle", "Square", "Square", "Circle"]
111 |     for i in shapes:
112 |         s = Shape.factory(i)
113 |         print(s)
114 | ```
115 | 
116 | 
117 | ### 结构模式
118 | 
119 | #### MVC
120 | 可能是最有名的设计模式，``数据<->控制器<->视图``。数据和视图分离，还可以同一份数据渲染多个视图。MVC做的最好的应该是各种Web框架和GUI框架。
121 | ```python
122 | class Model(object):
123 |     products = {
124 |         'milk': {'price': 1.50, 'quantity': 10},
125 |         'eggs': {'price': 0.20, 'quantity': 100},
126 |         'cheese': {'price': 2.00, 'quantity': 10}
127 |     }
128 | 
129 |     def get(self, name):
130 |         return self.products.get(name)
131 | 
132 | 
133 | class View(object):
134 |     def show_item_list(self, item_list):
135 |         print('-' * 20)
136 |         for item in item_list:
137 |             print("* Name: %s" % item)
138 |         print('-' * 20)
139 | 
140 |     def show_item_info(self, name, item_info):
141 |         print("Name: %s Price: %s Quantity: %s" % (name, item_info['price'], item_info['quantity']))
142 |         print('-' * 20)
143 | 
144 |     def show_empty(self, name):
145 |         print("Name: %s not found" % name)
146 |         print('-' * 20)
147 | 
148 | 
149 | class Controller(object):
150 |     def __init__(self, model, view):
151 |         self.model = model
152 |         self.view = view
153 | 
154 |     def show_items(self):
155 |         items = self.model.products.keys()
156 |         self.view.show_item_list(items)
157 | 
158 |     def show_item_info(self, item):
159 |         item_info = self.model.get(item)
160 |         if item_info:
161 |             self.view.show_item_info(item, item_info)
162 |         else:
163 |             self.view.show_empty(item)
164 | 
165 | 
166 | if __name__ == '__main__':
167 |     model = Model()
168 |     view = View()
169 |     controller = Controller(model, view)
170 |     controller.show_items()
171 |     controller.show_item_info('cheese')
172 |     controller.show_item_info('apple')
173 | ```
174 | 上面的例子还只演示了数据到视图的渲染，其实MVC还包括通过视图修改数据。
175 | 
176 | #### Proxy
177 | 不直接调用一个类，而是通过一个代理来访问。这样做的好处有：可以切换底层实现、权限控制、安全检查等。当然最有用的是可以实现远程代理，jsonrpc就是一种。
178 | ```python
179 | class Implementation(object):
180 |     def add(self, x, y):
181 |         return x + y
182 | 
183 |     def minus(self, x, y):
184 |         return x - y
185 | 
186 | 
187 | class Proxy(object):
188 |     def __init__(self, impl):
189 |         self._impl = impl
190 | 
191 |     def __getattr__(self, name):
192 |         return getattr(self._impl, name)
193 | 
194 | 
195 | if __name__ == '__main__':
196 |     p = Proxy(Implementation())
197 |     print(p.add(1, 2))
198 |     print(p.minus(1, 2))
199 | ```
200 | 
201 | #### Decorator
202 | 
203 | 装饰器，似乎不太需要解释，Python自带的语法，可以用来做很多事情，几个简单例子：
204 | 
205 | *   路由
206 | 
207 | ```python
208 | from flask import Flask
209 | app = Flask(__name__)
210 | 
211 | @app.route('/')
212 | def index():
213 |     return 'Hello, World'
214 | 
215 | @app.route('/home')
216 | def home():
217 |     return 'Welcome Home'
218 | ```
219 | 
220 | *   权限控制
221 | 
222 | ```python
223 | from django.contrib.auth.decorators import login_required
224 | 
225 | @login_required
226 | def my_view(request):
227 |     ...
228 | ```
229 | 
230 | *   输入输出
231 | 
232 | ```python
233 | from functools import wraps
234 | 
235 | 
236 | def debug(f):
237 |     @wraps(f)
238 |     def debug_function(*args, **kwargs):
239 |         print('call: ', f.__name__, args, kwargs)
240 |         ret = f(*args, **kwargs)
241 |         print('return: ', ret)
242 |     return debug_function
243 | 
244 | 
245 | @debug
246 | def foo(a, b, c=None):
247 |     print(a, b, c)
248 |     return True
249 | 
250 | 
251 | if __name__ == '__main__':
252 |     foo(1, 2, 3)
253 | ```
254 | 
255 | 
256 | 
257 | 
258 | ### 行为模式
259 | 
260 | #### Template
261 | 
262 | 基类作为模板，定义好接口，子类来实现功能，最好的例子就是Qt里的各种QWidget。
263 | 
264 | ```python
265 | class ApplicateFramework(object):
266 |     def __init__(self):
267 |         self.setup()
268 |         self.show()
269 | 
270 |     def setup(self):
271 |         pass
272 | 
273 |     def show(self):
274 |         pass
275 | 
276 |     def close(self):
277 |         pass
278 | 
279 | 
280 | class MyApplication(ApplicateFramework):
281 |     def setup(self):
282 |         print("setup", self)
283 | 
284 |     def show(self):
285 |         print("show", self)
286 | 
287 |     def close(self):
288 |         print("close", self)
289 | 
290 | 
291 | if __name__ == '__main__':
292 |     app = MyApplication()
293 |     app.close()
294 | ```
295 | 
296 | #### State Machine
297 | 状态机，`当前状态 + 操作 => 下一个状态`，似乎也不用怎么解释。如下例子实现的状态机，可以自定义状态、操作和转换规则。扩展的时候无需修改状态机代码，符合**开闭原则**。
298 | 
299 | ```python
300 | class StateMachine(object):
301 | 
302 |     def __init__(self, init_state):
303 |         self.current_state = init_state
304 |         self.current_state.run()
305 | 
306 |     def step(self, action):
307 |         self.current_state = self.current_state.next(action)
308 |         self.current_state.run()
309 | 
310 | 
311 | class State(object):
312 |     def __init__(self, name):
313 |         self.name = name
314 | 
315 |     def __str__(self):
316 |         return "<State '%s'>" % self.name
317 | 
318 |     def next(self, action):
319 |         if (self, action) in mapping:
320 |             next_state = mapping[(self, action)]
321 |         else:
322 |             next_state = self
323 |         print("%s + %s => %s" % (self, action, next_state))
324 |         return next_state
325 | 
326 |     def run(self):
327 |         print(self, "is current state")
328 | 
329 | 
330 | class Action(object):
331 |     def __init__(self, name):
332 |         self.name = name
333 | 
334 |     def __str__(self):
335 |         return "<Action '%s'>" % self.name
336 | 
337 | 
338 | State.Running = State("Running")
339 | State.Stopped = State("Stopped")
340 | State.Paused = State("Paused")
341 | 
342 | Action.start = Action("start")
343 | Action.stop = Action("stop")
344 | Action.pause = Action("pause")
345 | Action.resume = Action("resume")
346 | 
347 | 
348 | mapping = {
349 |     (State.Stopped, Action.start): State.Running,
350 |     (State.Running, Action.stop): State.Stopped,
351 |     (State.Running, Action.pause): State.Paused,
352 |     (State.Paused, Action.resume): State.Running,
353 |     (State.Paused, Action.stop): State.Stopped,
354 | }
355 | 
356 | 
357 | if __name__ == '__main__':
358 |     state_machine = StateMachine(State.Stopped)
359 |     state_machine.step(Action.start)
360 |     state_machine.step(Action.pause)
361 |     state_machine.step(Action.resume)
362 |     state_machine.step(Action.stop)
363 | ```
364 | 
365 | #### Iterator
366 | 
367 | 迭代器，在Python中也不需要怎么解释，使用起来好像理所应当一样。实际上迭代器的一大优势是无需关心数据类型，一样的``for``语法。另一大优势是无需事先计算好所有元素，而是在迭代到的时候才计算。如下例子是Python中使用`yield`语法产生生成器`generator`来实现的迭代器，优势一目了然。
368 | ```python
369 | def fibonacci(count=100):
370 |     a, b = 1, 2
371 |     yield a
372 |     yield b
373 |     while count:
374 |         a, b = b, a + b
375 |         count -= 1
376 |         yield b
377 | 
378 | 
379 | for i in fibonacci():
380 |     print(i)
381 | ```
382 | 
383 | #### Command
384 | 
385 | Command封装了一个原子操作，在类外面实现，个人认为最大的作用是实现``redo/undo``。
386 | ```python
387 | from collections import deque
388 | 
389 | 
390 | class Document(object):
391 |     value = ""
392 |     cmd_stack = deque()
393 | 
394 |     @classmethod
395 |     def execute(cls, cmd):
396 |         cmd.redo()
397 |         cls.cmd_stack.append(cmd)
398 | 
399 |     @classmethod
400 |     def undo(cls):
401 |         cmd = cls.cmd_stack.pop()
402 |         cmd.undo()
403 | 
404 | 
405 | class AddTextCommand(object):
406 |     def __init__(self, text):
407 |         self.text = text
408 | 
409 |     def redo(self):
410 |         Document.value += self.text
411 | 
412 |     def undo(self):
413 |         Document.value = Document.value[:-len(self.text)]
414 | 
415 | 
416 | if __name__ == '__main__':
417 |     cmds = [AddTextCommand("liu"), AddTextCommand("xin"), AddTextCommand("heihei")]
418 |     for cmd in cmds:
419 |         Document.execute(cmd)
420 |         print(Document.value)
421 | 
422 |     for i in range(len(cmds)):
423 |         Document.undo()
424 |         print(Document.value)
425 | ```
426 | 
427 | #### Chain Of Responsibility
428 | 链式Handler处理请求，某一个处理成功就返回。用Chain来动态构造Handler序列。
429 | ```python
430 | class Handler(object):
431 | 
432 |     def __init__(self):
433 |         self.successor = None
434 | 
435 |     def handle(self, data):
436 |         res = self._handle(data)
437 |         if res:
438 |             return res
439 |         if self.successor:
440 |             return self.successor.handle(data)
441 | 
442 |     def _handle(self, data):
443 |         raise NotImplementedError
444 | 
445 |     def link(self, handler):
446 |         self.successor = handler
447 |         return handler
448 | 
449 | 
450 | class DictHandler(Handler):
451 |     def _handle(self, data):
452 |         if isinstance(data, dict):
453 |             print("handled by %s" % self)
454 |             return True
455 | 
456 | 
457 | class ListHandler(Handler):
458 |     def _handle(self, data):
459 |         if isinstance(data, list):
460 |             print("handled by %s" % self)
461 |             return True
462 | 
463 | 
464 | if __name__ == '__main__':
465 |     h = DictHandler()
466 |     h.link(ListHandler()).link(Handler())
467 |     ret = h.handle([1, 2, 3])
468 |     ret = h.handle({1: 2})
469 | ```
470 | 
471 | 
472 | #### Chaining Method
473 | 
474 | 链式反应，就这样一直点下去。很多Query构造函数是这样，API更好用。
475 | 
476 | ```python
477 | class Player(object):
478 |     def __init__(self, name):
479 |         self.pos = (0, 0)
480 | 
481 |     def move(self, pos):
482 |         self.pos = pos
483 |         print("move to %s, %s" % self.pos)
484 |         return self
485 | 
486 |     def say(self, text):
487 |         print(text)
488 |         return self
489 | 
490 |     def home(self):
491 |         self.pos = (0, 0)
492 |         print("I am home")
493 |         return self
494 | 
495 | 
496 | if __name__ == '__main__':
497 |     p = Player('liuxin')
498 |     p.move((1, 1)).say("haha").move((2, 3)).home().say("go to sleep")
499 | ```
500 | 
501 | #### Visitor
502 | Visitor模式的目的是不改变原来的类，用另一个类来实现一些接口。下面的例子用Visitor模式实现了两种节点遍历的方法。
503 | ```python
504 | class Node(object):
505 |     def __init__(self, name, children=()):
506 |         self.name = name
507 |         self.children = list(children)
508 | 
509 |     def __str__(self):
510 |         return '<Node %s>' % self.name
511 | 
512 | 
513 | class Visitor(object):
514 | 
515 |     @classmethod
516 |     def visit(cls, node):
517 |         yield node
518 |         for child in node.children:
519 |             yield child
520 | 
521 |     @classmethod
522 |     def visit2(cls, node):
523 |         for child in node.children:
524 |             yield child
525 |         yield node
526 | 
527 | 
528 | if __name__ == '__main__':
529 |     root = Node('root', (Node('a'), Node('b')))
530 |     visitor = Visitor()
531 |     for node in visitor.visit(root):
532 |         print(node)
533 |     for node in visitor.visit2(root):
534 |         print(node)
535 | ```
536 | 
537 | #### Observer
538 | 当一个对象发生状态变化时，需要更新其他对象，用观察者模式来解耦这些对象，最小知识原则。
539 | ```python
540 | class Observable(object):
541 | 
542 |     def __init__(self):
543 |         self._observers = []
544 | 
545 |     def attach(self, observer):
546 |         if observer not in self._observers:
547 |             self._observers.append(observer)
548 | 
549 |     def detach(self, observer):
550 |         self._observers.remove(observer)
551 | 
552 |     def notify(self):
553 |         for observer in self._observers:
554 |             observer.update(self)
555 | 
556 | 
557 | class Observer(object):
558 |     def update(self, observable):
559 |         print('updating %s by %s' % (self, observable))
560 | 
561 | 
562 | if __name__ == '__main__':
563 |     clock = Observable()
564 |     user1 = Observer()
565 |     user2 = Observer()
566 |     clock.attach(user1)
567 |     clock.attach(user2)
568 |     clock.notify()
569 | ```
570 | 
571 | 上面的例子演示了最简单的实现，通常在实际程序中观察者模式会在不同线程中，要注意线程安全的问题。
572 | 
573 | ## 参考链接
574 | *   [python-patterns](https://github.com/faif/python-patterns)
575 | *   [python-3-patterns-idioms](http://python-3-patterns-idioms-test.readthedocs.io/en/latest/index.html)
576 | *   [设计模式六大原则](http://www.uml.org.cn/sjms/201211023.asp)
577 | *   [设计模式一句话总结](https://zhuanlan.zhihu.com/p/28737945)
578 | 


--------------------------------------------------------------------------------
/_posts/pyflame.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: Pyflame解析和扩展
  3 | 
  4 | date: 2019-5-24 20:12:22
  5 | 
  6 | tags: [Python,C++]
  7 | 
  8 | ---
  9 | 
 10 | [Pyflame](https://github.com/uber/pyflame)是uber开源的一个python性能profiler。
 11 | 
 12 | > 没有profile的优化就是耍流氓
 13 | 
 14 | 说起python，被吐槽最多的就是慢。根据上面的名言，首先我们需要一个性能profiler。
 15 | 
 16 | # profiler的两种原理
 17 | 
 18 | profile工具基本分成两种思路：
 19 | 
 20 | 1. 插桩
 21 | 
 22 |     这个理解起来非常直白，在每个函数的开始和结束时计时，然后统计每个函数的执行时间。python自带的`Profile`和`cProfile`模块就是这个原理。但是插桩会有两个问题，一是插桩计时本身带来的性能消耗`overhead`极大，二是对一个正在运行的程序没法插桩。
 23 | 
 24 | 2. 采样
 25 | 
 26 |     采样是注入正在运行的进程，高频地去探测函数调用栈。根据大数定理，探测到次数越多的函数运行时间越长。pyflame正是基于采样的原理。统计结果用另一个神器——火焰图[flamegraph](http://www.brendangregg.com/flamegraphs.html)展示，我们就可以完整地了解cpu运行状态。
 27 | 
 28 | 
 29 | # 使用pyflame和火焰图
 30 | 
 31 | 官方提供了pyflame源码，需要自己编译对应python版本，然后调用flamegraph生成火焰图svg。具体可以看[官方文档](https://pyflame.readthedocs.io/en/latest/)。为了方便自己人使用，我写了个pyflame-server。
 32 | 
 33 | <!-- more -->
 34 | 
 35 | ## pyflame-server
 36 | 
 37 | [https://github.com/Meteorix/pyflame-server](https://github.com/Meteorix/pyflame-server)
 38 | 
 39 | pyflame-server使用步骤很简单：
 40 | 1. 下载编译好的pyflame二进制文件(Linux x86_64)，支持py2.6/2.7/3.4/3.5/3.6/3.7
 41 | 2. pyflame启动python进程，或者attach到正在运行的进程，得到profile.txt
 42 | 3. 上传profile.txt，查看火焰图页面
 43 | 
 44 | 内部地址[http://172.22.22.230:5000/](http://172.22.22.230:5000/)
 45 | 
 46 | pyflame-server是基于flask简单的web项目，欢迎参与开发。
 47 | 
 48 | ## 如何读图
 49 | 
 50 | *   横向是采样次数的占比，越长的span表示调用次数越多，即时间消耗越多
 51 | *   纵向是函数调用栈，从底向上，最下层是可执行文件的入口
 52 | *   每个span里面显示了文件路径、函数名、行号、采样次数等信息，可以自己缩放svg图看看。
 53 | 
 54 | 以下面的代码为例：
 55 | 
 56 | ```
 57 |  1 import time
 58 |  2
 59 |  3
 60 |  4 def add(a, b):
 61 |  5    return a + b
 62 |  6
 63 |  7
 64 |  8 def sum():
 65 |  9    count = 1
 66 | 10    for i in range(10000000):
 67 | 11        count = add(count, i)
 68 | 12    print(count)
 69 | 13    time.sleep(1)
 70 | 14
 71 | 15
 72 | 16 sum()
 73 | ```
 74 | 
 75 | ![image](/images/pyflame/profile_py.svg)
 76 | 
 77 | 从下往上看：
 78 | *   2/3的时间花在了第16行的``sum``函数，1/3的时间是``idle``，即``sleep``的时候python释放了GIL。
 79 | *   ``sum``函数第10行的for循环占用了一小段时间，第11行的``add``占用了大量时间
 80 | *   再往上``add``函数第5行只占用了一小段时间，说明大量时间被函数调用本身占用了
 81 | 
 82 | 这样profile之后，如何进行优化就一目了然了吧。实际项目中，经过这样简单的profile，我们将一个NLP预处理的任务优化了70%的性能。
 83 | 
 84 | ## 即时维护的分支
 85 | 
 86 | Uber写pyflame的哥们离职了，还没人接手这个项目。于是我自己维护了一个分支，做了几点微小的工作：
 87 | 1. 修复py2.7编译脚本
 88 | 1. 修复py3.7兼容性问题，感谢pr
 89 | 1. 修复anaconda的兼容性问题，感谢另一个pr
 90 | 1. 增加dockerfile，enable所有abi，目前同时支持py2.6/2.7/3.4-3.7
 91 | 1. 支持c/c++ profile，c/c++栈与python栈合并显示
 92 | 
 93 | # How magic happens
 94 | 
 95 | Uber官方博客给了一篇[由浅入深的讲解](https://eng.uber.com/pyflame/)，这里简单提几个关键点。
 96 | 
 97 | ## ptrace
 98 | 
 99 | linux操作系统提供了一个强大的系统调用`ptrace`，让你可以注入任意进程(有sudo权限就是可以为所欲为)查看当前的寄存器、运行栈和内存，甚至可以任意修改别人的进程。
100 | 
101 | 著名的`gdb`也是利用ptrace实现。比如打断点，其实就是保存了当前的运行状态，修改寄存器执行断点调试函数，之后恢复保存的运行状态继续运行。
102 | 
103 | ## PyThreadState
104 | 
105 | 有了ptrace之后，我们可以通过python的符号找到`PyThreadState`，也就是python虚拟机保存线程状态（包括线程调用栈）的地方。然后通过`PyThreadState`，拿到虚拟机中的py调用栈，遍历栈帧反解出python的函数名、所在文件行号等信息。后面就是高频采样和输出统计结果了。
106 | 
107 | ![PyThreadState](/images/pyflame/Python-Thread-State.png)
108 | 
109 | 这部分如果想深入了解，可以看[python虚拟机](https://github.com/Meteorix/pysourcenote/blob/master/vm.md)这篇介绍。
110 | 
111 | # 如果我们想profile c/c++呢
112 | 
113 | 目前深度学习程序多半是c++和python混合开发。有时候我们看到python里面的一行代码，其实底层调用了几十行c++代码。这时候为了搞清楚性能消耗到底在哪，我们需要同时profile python和c++。这里提供基于pyflame和libunwind的实现思路。
114 | 
115 | ## libunwind
116 | 
117 | `libunwind`是另一个开源的c++库，同样利用`ptrace`实现了远程注入和解c栈的接口。于是我们可以在一次ptrace断点时，同时解出c栈和py栈，然后用一个巧妙的办法将两个栈merge到一起。再修改一下`flamegraph`的配色，可以得到c/c++栈和py栈混合profile的效果。
118 | 
119 | ![c/py混合profile](/images/pyflame/profile_c.svg)
120 | 
121 | 通过上面的火焰图，我们能清楚的看到每个python调用，实际上调用的底层c++函数。
122 | 
123 | 另一个好处是，python没有占用GIL的时候，我们可以看到c++的调用栈。
124 | 
125 | to be continued...
126 | 


--------------------------------------------------------------------------------
/_posts/python-ml-optimize.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: Python机器学习性能优化——PyCon2019分享
 3 | 
 4 | date: 2019-9-29 17:38:12
 5 | 
 6 | tags: [深度学习,Python]
 7 | 
 8 | ---
 9 | 
10 | 
11 | 
12 | [Python机器学习性能优化——PyCon2019分享](
13 | https://mp.weixin.qq.com/s?__biz=MzU3ODg1NDg0OQ==&mid=2247483733&idx=1&sn=50e64f16319411573ee5625a5eb3feb0&chksm=fd6e4bfbca19c2ed17998fb72776028bfd250715897774c825355c8014da079fd10bf7f32df3&token=1449769781&lang=zh_CN#rd)
14 | 


--------------------------------------------------------------------------------
/_posts/pytorch-coredump.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: 记一次pytorch的coredump调试
  3 | 
  4 | date: 2019-4-30 19:47:32
  5 | 
  6 | tags: [Python,深度学习]
  7 | 
  8 | ---
  9 | 
 10 | 
 11 | 1. 两个算法小哥写了个flask的web服务器，跑ocr的服务。单个请求是好的，多请求就崩了, log就显示：
 12 |     ``Segmentation fault (core dumped)``
 13 | 
 14 |     >   “wtf is core dump?”
 15 | 
 16 |     just google it, 简单说就是程序崩溃之后dump出来的一些信息
 17 | 
 18 | 2. 调试到了凌晨2点，加了一堆log，还是没有找到原因，看起来很随机。第二天找到我：
 19 | 
 20 |     > python coredump了怎么办？
 21 | 
 22 | 3. 为了debug这个coredump，先打开系统设置
 23 |     ```
 24 |     ulimit -c unlimited
 25 |     ```
 26 | 
 27 | 4. 再次制造coredump，在当前目录下就产生了一个4G的core文件
 28 |     ```
 29 |     -rw------- 1 liuxin root 4.2G 4月  29 14:21 core
 30 |     ```
 31 | 
 32 | 5. core文件就是保存了crash时候的所有信息，调用栈、线程、内存等。然后运用我们的linux c++开发经验，可以知道用gdb来调试coredump文件
 33 | 
 34 |     ```
 35 |     gdb <executable> <coredump file>
 36 |     ```
 37 | 
 38 | 6. 等等，这里我们是python程序崩了，为什么没有traceback？因为在调用到pytorch的c++代码时，直接segmentfault，并没能等到python的退出机制打印出traceback，直接崩了。但是操作系统能产生coredump文件，这是我们的救命稻草。
 39 | 
 40 | <!-- more -->
 41 | 
 42 | 7. 直接用gdb调试python的coredump文件
 43 | 
 44 |     ```
 45 |     gdb python ./core
 46 |     ```
 47 |     显示一大堆毫无意义的信息，因为你的python还没有装上“符号文件”，操作系统的栈里面的地址需要对上符号信息，才能解析出c/c++代码
 48 | 
 49 | 8. 找到你的python版本对应的符号文件，装上
 50 |     ```
 51 |     sudo apt install python3.6-dbg
 52 |     ```
 53 | 
 54 | 9. 然后就能够看到c栈了
 55 |     ```
 56 |     (gdb) bt
 57 |     #0  0x00007fffd721ec4b in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
 58 |     #1  0x00007fffd734ed06 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
 59 |     #2  0x00007fffd7138a46 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
 60 |     #3  0x00007fffd7138c73 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
 61 |     #4  0x00007fffd7290f40 in cuLaunchKernel () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
 62 |     #5  0x00007fffa72f2dcb in cudart::cudaApiLaunchCommon(void const*, bool) ()
 63 |        from /home/liuxin/venv/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so
 64 |     #6  0x00007fffa73105a8 in cudaLaunch () from /home/liuxin/venv/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so
 65 |     #7  0x00007fffa73c70f1 in cublasStatus_t cublasGemv<float, float, float, 128, 4, 16, 4, 4>(cublasContext*, cublasOperation_t, int, int, float const*, float const*, int, float const*, int, float const*, float*, int) ()
 66 |        from /home/liuxin/venv/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so
 67 |     #8  0x00007fffa73e0bae in cublasSgemmRecursiveEntry(cublasContext*, int, int, int, int, int, float const*, float const*, int, float const*, int, float const*, float*, int) () from /home/liuxin/venv/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so
 68 |     #9  0x00007fffa744f8dd in cublasSgemmEx_internal(cublasContext*, cublasOperation_t, cublasOperation_t, int, int, int, float const*, void const*, cudaDataType_t, int, void const*, cudaDataType_t, int, float const*, void*, cudaDataType_t, int, bool, bool) ()
 69 |        from /home/liuxin/venv/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so
 70 |     #10 0x00007fffa7459ad2 in cublasSgemmExRecursiveEntry(cublasContext*, int, int, int, int, int, float const*, void const*, cudaDataType_t, int, void const*, cudaDataType_t, int, float const*, void*, cudaDataType_t, int, cublasGemmAlgo_t, int, int) ()
 71 |        from /home/liuxin/venv/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so
 72 |       
 73 |     ...下面还有200多行...
 74 |     ```
 75 | 
 76 |     通过上面的c栈，大概能看出是崩在了pytoch的rnn模块。但是，我们的python代码哪里出问题了呢？
 77 | 
 78 | 10. 这时候需要找的是python栈在哪里呢？很幸运的是，python是c写的，通过c栈和python的符合信息，我们能反解出python调用栈！而且gdb-python直接帮我们做好了这件事。
 79 | 
 80 |     在python对应版本的源码中找``Toos/gdb/libpython.py``
 81 |     > https://github.com/python/cpython/tree/3.6/Tools/gdb
 82 |     
 83 |     下载下来，然后在gdb中使用
 84 |     ```
 85 |     (gdb) source libpython.py
 86 |     (gdb) py-bt
 87 |     Traceback (most recent call first):
 88 |       <built-in method _cudnn_rnn of type object at remote 0x7fffd51dd640>
 89 |       File "/home/liuxin/venv/lib/python3.6/site-packages/torch/nn/_functions/rnn.py", line 288, in forward
 90 |         dropout_ts)
 91 |       File "/home/liuxin/venv/lib/python3.6/site-packages/torch/nn/_functions/rnn.py", line 324, in forward
 92 |         return func(input, *fargs, **fkwargs)
 93 |       File "/home/liuxin/venv/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 192, in forward
 94 |         output, hidden = func(input, self.all_weights, hx, batch_sizes)
 95 |       File "/home/liuxin/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
 96 |         result = self.forward(*input, **kwargs)
 97 |       File "/home/liuxin/shannon_vision/shannon_vision_models/shannon_vision_nanjing_bank/crnn/master/models/crnn.py", line 69, in forward
 98 |         recurrent, _ = self.rnn(x)
 99 |       File "/home/liuxin/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
100 |         result = self.forward(*input, **kwargs)
101 |       File "/home/liuxin/venv/lib/python3.6/site-packages/torch/nn/modules/container.py", line 91, in forward
102 |         input = module(input)
103 |       File "/home/liuxin/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
104 |         result = self.forward(*input, **kwargs)
105 |       File "/home/liuxin/shannon_vision/shannon_vision_models/shannon_vision_nanjing_bank/crnn/master/models/crnn.py", line 57, in forward
106 |         out = self.rnn(conv)
107 |       File "/home/liuxin/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
108 |         result = self.forward(*input, **kwargs)
109 |       File "/home/liuxin/shannon_vision/shannon_vision_models/shannon_vision_nanjing_bank/crnn/master/models/recognizer.py", line 65, in recognize
110 |         pred_matrix = self.model(charline_image)
111 |       File "xiang_server.py", line 93, in run_crnn_server
112 |         ocr_pred, matrix = recognizer.recognize(img)
113 |       File "/home/liuxin/venv/lib/python3.6/site-packages/flask/app.py", line 1799, in dispatch_request
114 |      
115 |     ...下面还有三十多行...
116 |     ```
117 | 
118 | 11. 找到问题了，原来是这个``recognize``函数，让我们来fix it。通过"经验"，我们猜到这是多线程调用gpu代码导致的问题，先试试加锁
119 | 
120 |     ```
121 |     lock = threading.Lock()
122 |     ...
123 |     with lock:
124 |         ocr_pred, matrix = recognizer.recognize(img)
125 |     ...
126 |     ```
127 |     果然加了锁就再也没有诡异的coredump了，于是我们猜想：
128 |     > gpu代码调用可能不是thread-safe的，加个线程锁比较安全。
129 |     
130 | 12. 到底pytorch的函数是不是线程安全的呢？我们的算法小哥哥构造了一段代码验证:
131 |     ```
132 |     # -*- coding: utf-8 -*-
133 |     from torch import nn
134 |     import torch
135 |     from threading import Thread
136 |     
137 |     
138 |     class BidirectionalLSTM(nn.Module):
139 |         """构建一个简单的网络"""
140 |         def __init__(self, n_in, n_hidden, n_out):
141 |             super(BidirectionalLSTM, self).__init__()
142 |             self.rnn = nn.LSTM(n_in, n_hidden, bidirectional=True)
143 |             self.embedding = nn.Linear(n_hidden * 2, n_out)
144 |     
145 |         def forward(self, x):
146 |             recurrent, _ = self.rnn(x)
147 |             t, b, h = recurrent.size()
148 |             t_rec = recurrent.view(t * b, h)
149 |             out = self.embedding(t_rec)  # [t * b, n_out]
150 |             out = out.view(t, b, -1)
151 |     
152 |             return out
153 |     
154 |     
155 |     lstm = BidirectionalLSTM(256, 256, 256)
156 |     x = torch.randn(12,100,256)
157 |     device = torch.device('cuda:0')
158 |     lstm = lstm.to(device)
159 |     x = x.to(device)
160 |     
161 |     
162 |     def print_time(threadName):
163 |         count = 0
164 |         while count < 1000:
165 |             lstm(x)
166 |             count += 1
167 |             print(threadName, ":", count)
168 |     
169 |     
170 |     def run_rnn_multi_threaded():
171 |         for i in range(5):
172 |             name = "Thread-%d" % i
173 |             t = Thread(target=print_time, args=(name,), name=name)
174 |             t.start()
175 |     
176 |     
177 |     if __name__ == '__main__':
178 |         run_rnn_multi_threaded()
179 |     ```
180 |     运行没多久果然coredump了
181 |     ```
182 |     ...
183 |     Thread-0 : 79
184 |     Thread-4 : 74
185 |     Thread-3 : 74
186 |     Thread-2 : 74
187 |     Thread-1 : 76
188 |     Thread-0 : 80
189 |     Thread-4 : 75
190 |     Thread-3 : 75
191 |     Thread-2 : 75
192 |     Thread-1 : 77
193 |     Thread-0 : 81
194 |     Segmentation fault (core dumped)
195 |     ```
196 | 
197 | 13. 查了下``pytorch thread safety``，很多答案说默认是线程安全的。于是我提了个issue给官方 https://github.com/pytorch/pytorch/issues/19913 静待回复
198 | 
199 | 14. 虽然只用两行代码就临时解决了问题，通过这次调试经历，我们可以获得知识点：core dump/gdb/debug symbol/thread-safety，以及python调试的终极武器：[gdb-python](https://github.com/Meteorix/meteorix-blog/blob/master/_posts/gdbpython.md)
200 | 
201 |     > 在windows上我们更有宇宙最强的visual studio，提供了等价的[python/c混合调试](https://github.com/Meteorix/meteorix-blog/blob/master/_posts/vsdebugpycpp.md)
202 | 
203 | 15. 两天后，官方回复issue：pytorch新版本修复了很多线程安全问题。于是我升级pytorch 1.1，用上面的代码验证，果然没问题。为了提升web服务器的性能，我们准备升级pytorch版本，就可以去掉那个令人不爽的锁。
204 | 
205 | 16. 关于这个问题，还有很多未解之谜 to be continued...
206 | 


--------------------------------------------------------------------------------
/_posts/vsdebugpycpp.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: vs2017混合调试py/c++
 3 | 
 4 | date: 2019-2-13 23:51:32
 5 | 
 6 | tags: Python
 7 | 
 8 | ---
 9 | 
10 | # vs2017混合调试py/c++
11 | 
12 | 
13 | ## python卡死了怎么办?
14 | 
15 | 一般的python程序卡死，可以用pycharm debug。但是有时候是python和c/c++库混合开发，比如pyqt或者boost python程序卡死，就非常难查。以前都是二分法注释代码查找问题，异常低效。
16 | 
17 | 于是我尝试了vs2017的新功能：python & c++ 混合调试 [Debug Python and C++ together](https://docs.microsoft.com/en-us/visualstudio/python/debugging-mixed-mode-c-cpp-python-in-visual-studio?view=vs-2017)
18 | 
19 | ## vs2017 python
20 | 
21 | vs一直被称为宇宙最强IDE，貌似python支持是vs2017新加的功能。相比pycharm之类的python编辑器，vs2017最大的优势就是在python和native c/c++代码的混合调试。包括：
22 | 
23 | *   python / c++ 调用栈合并显示
24 | *   python / c++ 都可以断点
25 | *   python / c++ 代码之间step
26 | *   python / c++ 对象都可以watch
27 | 
28 | 由于python的性能瓶颈，很多时候混合开发需要同时调试，这正是我们需要的！
29 | 
30 | <!--more-->
31 | 
32 | ## 环境搭建
33 | 
34 | 下面以`Sunshine`(一个PyQt5程序)为例
35 | 
36 | 1. 安装vs2017 python支持
37 | 2. 导入Sunshine项目，会在主目录生成Sunshine.sln，以后都可以双击打开了
38 | 
39 |     ```New Project->Python->From Existing Python Code```
40 |     ![image.png](/images/vsdebugpycpp/5c3d9b205e60273aadf4650714DcRPJX.png)
41 | 3. 按照官网文档配置[Enable mixed-mode debugging in a Python project](https://docs.microsoft.com/en-us/visualstudio/python/debugging-mixed-mode-c-cpp-python-in-visual-studio?view=vs-2017#enable-mixed-mode-debugging-in-a-python-project)
42 | 4. 安装python的`pdb`文件，3.6直接用python installer安装，参考[这里](https://docs.microsoft.com/en-us/visualstudio/python/debugging-symbols-for-mixed-mode-c-cpp-python?view=vs-2017#download-symbols)
43 | 5. 配置`pdb`文件，主要是qt5的。qt5.9之后提供了pdb文件下载，[5.11.2的在这里](https://download.qt.io/archive/qt/5.11/5.11.2/)（PyQt最新版5.11.3，用的是Qt5.11.2，别问我为什么知道）
44 | 6. 选择`Sunshine.py` **F5**启动！
45 | 
46 | 启动会比较慢，从输出窗口可以看出vs再加载各个py文件的symbol。。。（额，毕竟是试验性的功能）
47 | 
48 | ## 让我们试试吧
49 | 
50 | ### 卡死时候暂停
51 | 
52 | 假设我们现在的Sunshine卡死了，这时候点击vs上的暂停按钮，就能立马停住。
53 | 
54 | ![image.png](/images/vsdebugpycpp/5c3d9dbcaa49f15c3726191dzKYsU5Qw.png)
55 | 
56 | 回想起Pycharm停不住的恐惧！毕竟vs是native的暂停，宇宙第一！
57 | 
58 | ### 强大的混合调试
59 | 
60 | 这时候就能使用强大的**混合调试**功能了，包括但不限于：
61 | *   同时查看python/c++ callstack
62 | *   callstack双击跳转到源码
63 | *   切换线程
64 | *   查看locals
65 | 
66 | ![image.png](/images/vsdebugpycpp/5c3d9e54a7f2529830bb770bjqP16HQy.png)
67 | 
68 | ### 混合的单步调试
69 | 
70 | 以前python调到Qt里面去之后，就不知道发生了什么，现在可以从python step到c++中。
71 | 
72 | ![image.png](/images/vsdebugpycpp/5c3da03a96dee435e6604c4aVmHaq5uC.png)
73 | 
74 | 双击callstack中的c++ frame，vs会提示你打开。当然，这需要你本地下载了qt的c++代码。
75 | 
76 | 然后你就可以在c++中打断点，做各种基本操作了
77 | 
78 | ## 换成Attach模式
79 | 
80 | vs Debug启动python项目真的超级慢（感觉是有bug），而且我们经常是使用过程中进程卡死，这时候就需要Attach到卡死的python进程。
81 | 
82 | Debug->Attach to Process
83 | 
84 | ![image.png](/images/vsdebugpycpp/5c3da1cb7f9d2a99198674256wRjQZ2J.png)
85 | 
86 | 连上之后所有功能一样使用
87 | 


--------------------------------------------------------------------------------
/images/airtest-douyin/assistant.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Meteorix/meteorix-blog/8a29208db7159cd78b5b1d12b7ca85344bce58f0/images/airtest-douyin/assistant.png


--------------------------------------------------------------------------------
/images/airtest-douyin/ide01.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Meteorix/meteorix-blog/8a29208db7159cd78b5b1d12b7ca85344bce58f0/images/airtest-douyin/ide01.png


--------------------------------------------------------------------------------
/images/airtest-douyin/ide02.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Meteorix/meteorix-blog/8a29208db7159cd78b5b1d12b7ca85344bce58f0/images/airtest-douyin/ide02.png


--------------------------------------------------------------------------------
/images/airtest-douyin/ide03.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Meteorix/meteorix-blog/8a29208db7159cd78b5b1d12b7ca85344bce58f0/images/airtest-douyin/ide03.png


--------------------------------------------------------------------------------
/images/airtest-douyin/nox.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Meteorix/meteorix-blog/8a29208db7159cd78b5b1d12b7ca85344bce58f0/images/airtest-douyin/nox.png


--------------------------------------------------------------------------------
/images/airtest-douyin/snapshot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Meteorix/meteorix-blog/8a29208db7159cd78b5b1d12b7ca85344bce58f0/images/airtest-douyin/snapshot.png


--------------------------------------------------------------------------------
/images/bert-runtime/async.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Meteorix/meteorix-blog/8a29208db7159cd78b5b1d12b7ca85344bce58f0/images/bert-runtime/async.png


--------------------------------------------------------------------------------
/images/bert-runtime/bert.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Meteorix/meteorix-blog/8a29208db7159cd78b5b1d12b7ca85344bce58f0/images/bert-runtime/bert.png


--------------------------------------------------------------------------------
/images/bert-runtime/gelu.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Meteorix/meteorix-blog/8a29208db7159cd78b5b1d12b7ca85344bce58f0/images/bert-runtime/gelu.png


--------------------------------------------------------------------------------
/images/bert-runtime/gelujit.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Meteorix/meteorix-blog/8a29208db7159cd78b5b1d12b7ca85344bce58f0/images/bert-runtime/gelujit.png


--------------------------------------------------------------------------------
/images/bert-runtime/gpucpu.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Meteorix/meteorix-blog/8a29208db7159cd78b5b1d12b7ca85344bce58f0/images/bert-runtime/gpucpu.png


--------------------------------------------------------------------------------
/images/bert-runtime/qkv.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Meteorix/meteorix-blog/8a29208db7159cd78b5b1d12b7ca85344bce58f0/images/bert-runtime/qkv.png


--------------------------------------------------------------------------------
/images/bert-runtime/std.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Meteorix/meteorix-blog/8a29208db7159cd78b5b1d12b7ca85344bce58f0/images/bert-runtime/std.jpg


--------------------------------------------------------------------------------
/images/cuda101/autoscale.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Meteorix/meteorix-blog/8a29208db7159cd78b5b1d12b7ca85344bce58f0/images/cuda101/autoscale.png


--------------------------------------------------------------------------------
/images/cuda101/graph.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Meteorix/meteorix-blog/8a29208db7159cd78b5b1d12b7ca85344bce58f0/images/cuda101/graph.jpg


--------------------------------------------------------------------------------
/images/cuda101/grid.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Meteorix/meteorix-blog/8a29208db7159cd78b5b1d12b7ca85344bce58f0/images/cuda101/grid.jpg


--------------------------------------------------------------------------------
/images/cuda101/matrix.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Meteorix/meteorix-blog/8a29208db7159cd78b5b1d12b7ca85344bce58f0/images/cuda101/matrix.png


--------------------------------------------------------------------------------
/images/cuda101/mem.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Meteorix/meteorix-blog/8a29208db7159cd78b5b1d12b7ca85344bce58f0/images/cuda101/mem.png


--------------------------------------------------------------------------------
/images/cuda101/nvprof.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Meteorix/meteorix-blog/8a29208db7159cd78b5b1d12b7ca85344bce58f0/images/cuda101/nvprof.png


--------------------------------------------------------------------------------
/images/cuda101/shared_matrix.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Meteorix/meteorix-blog/8a29208db7159cd78b5b1d12b7ca85344bce58f0/images/cuda101/shared_matrix.png


--------------------------------------------------------------------------------
/images/cuda101/transistors.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Meteorix/meteorix-blog/8a29208db7159cd78b5b1d12b7ca85344bce58f0/images/cuda101/transistors.png


--------------------------------------------------------------------------------
/images/gdbpython/bt.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Meteorix/meteorix-blog/8a29208db7159cd78b5b1d12b7ca85344bce58f0/images/gdbpython/bt.png


--------------------------------------------------------------------------------
/images/netease-games/0.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Meteorix/meteorix-blog/8a29208db7159cd78b5b1d12b7ca85344bce58f0/images/netease-games/0.jpeg


--------------------------------------------------------------------------------
/images/netease-games/1.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Meteorix/meteorix-blog/8a29208db7159cd78b5b1d12b7ca85344bce58f0/images/netease-games/1.jpeg


--------------------------------------------------------------------------------
/images/netease-games/2.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Meteorix/meteorix-blog/8a29208db7159cd78b5b1d12b7ca85344bce58f0/images/netease-games/2.jpeg


--------------------------------------------------------------------------------
/images/pyflame/Python-Thread-State.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Meteorix/meteorix-blog/8a29208db7159cd78b5b1d12b7ca85344bce58f0/images/pyflame/Python-Thread-State.png


--------------------------------------------------------------------------------
/images/pyflame/profile_py.svg:
--------------------------------------------------------------------------------
  1 | <?xml version="1.0" standalone="no"?>
  2 | <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
  3 | <svg version="1.1" width="1200" height="390" onload="init(evt)" viewBox="0 0 1200 390" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
  4 | <!-- Flame graph stack visualization. See https://github.com/brendangregg/FlameGraph for latest version, and http://www.brendangregg.com/flamegraphs.html for examples. -->
  5 | <!-- NOTES:  -->
  6 | <defs>
  7 | 	<linearGradient id="background" y1="0" y2="1" x1="0" x2="0" >
  8 | 		<stop stop-color="#eeeeee" offset="5%" />
  9 | 		<stop stop-color="#eeeeb0" offset="95%" />
 10 | 	</linearGradient>
 11 | </defs>
 12 | <style type="text/css">
 13 | 	text { font-family:Verdana; font-size:12px; fill:rgb(0,0,0); }
 14 | 	#search { opacity:0.1; cursor:pointer; }
 15 | 	#search:hover, #search.show { opacity:1; }
 16 | 	#subtitle { text-anchor:middle; font-color:rgb(160,160,160); }
 17 | 	#title { text-anchor:middle; font-size:17px}
 18 | 	#unzoom { cursor:pointer; }
 19 | 	#frames > *:hover { stroke:black; stroke-width:0.5; cursor:pointer; }
 20 | 	.hide { display:none; }
 21 | 	.parent { opacity:0.5; }
 22 | </style>
 23 | <script type="text/ecmascript">
 24 | <![CDATA[
 25 | 	"use strict";
 26 | 	var details, searchbtn, unzoombtn, matchedtxt, svg, searching;
 27 | 	function init(evt) {
 28 | 		details = document.getElementById("details").firstChild;
 29 | 		searchbtn = document.getElementById("search");
 30 | 		unzoombtn = document.getElementById("unzoom");
 31 | 		matchedtxt = document.getElementById("matched");
 32 | 		svg = document.getElementsByTagName("svg")[0];
 33 | 		searching = 0;
 34 | 	}
 35 | 
 36 | 	window.addEventListener("click", function(e) {
 37 | 		var target = find_group(e.target);
 38 | 		if (target) {
 39 | 			if (target.nodeName == "a") {
 40 | 				if (e.ctrlKey === false) return;
 41 | 				e.preventDefault();
 42 | 			}
 43 | 			if (target.classList.contains("parent")) unzoom();
 44 | 			zoom(target);
 45 | 		}
 46 | 		else if (e.target.id == "unzoom") unzoom();
 47 | 		else if (e.target.id == "search") search_prompt();
 48 | 	}, false)
 49 | 
 50 | 	// mouse-over for info
 51 | 	// show
 52 | 	window.addEventListener("mouseover", function(e) {
 53 | 		var target = find_group(e.target);
 54 | 		if (target) details.nodeValue = "Function: " + g_to_text(target);
 55 | 	}, false)
 56 | 
 57 | 	// clear
 58 | 	window.addEventListener("mouseout", function(e) {
 59 | 		var target = find_group(e.target);
 60 | 		if (target) details.nodeValue = ' ';
 61 | 	}, false)
 62 | 
 63 | 	// ctrl-F for search
 64 | 	window.addEventListener("keydown",function (e) {
 65 | 		if (e.keyCode === 114 || (e.ctrlKey && e.keyCode === 70)) {
 66 | 			e.preventDefault();
 67 | 			search_prompt();
 68 | 		}
 69 | 	}, false)
 70 | 
 71 | 	// functions
 72 | 	function find_child(node, selector) {
 73 | 		var children = node.querySelectorAll(selector);
 74 | 		if (children.length) return children[0];
 75 | 		return;
 76 | 	}
 77 | 	function find_group(node) {
 78 | 		var parent = node.parentElement;
 79 | 		if (!parent) return;
 80 | 		if (parent.id == "frames") return node;
 81 | 		return find_group(parent);
 82 | 	}
 83 | 	function orig_save(e, attr, val) {
 84 | 		if (e.attributes["_orig_" + attr] != undefined) return;
 85 | 		if (e.attributes[attr] == undefined) return;
 86 | 		if (val == undefined) val = e.attributes[attr].value;
 87 | 		e.setAttribute("_orig_" + attr, val);
 88 | 	}
 89 | 	function orig_load(e, attr) {
 90 | 		if (e.attributes["_orig_"+attr] == undefined) return;
 91 | 		e.attributes[attr].value = e.attributes["_orig_" + attr].value;
 92 | 		e.removeAttribute("_orig_"+attr);
 93 | 	}
 94 | 	function g_to_text(e) {
 95 | 		var text = find_child(e, "title").firstChild.nodeValue;
 96 | 		return (text)
 97 | 	}
 98 | 	function g_to_func(e) {
 99 | 		var func = g_to_text(e);
100 | 		// if there's any manipulation we want to do to the function
101 | 		// name before it's searched, do it here before returning.
102 | 		return (func);
103 | 	}
104 | 	function update_text(e) {
105 | 		var r = find_child(e, "rect");
106 | 		var t = find_child(e, "text");
107 | 		var w = parseFloat(r.attributes.width.value) -3;
108 | 		var txt = find_child(e, "title").textContent.replace(/\([^(]*\)$/,"");
109 | 		t.attributes.x.value = parseFloat(r.attributes.x.value) + 3;
110 | 
111 | 		// Smaller than this size won't fit anything
112 | 		if (w < 2 * 12 * 0.59) {
113 | 			t.textContent = "";
114 | 			return;
115 | 		}
116 | 
117 | 		t.textContent = txt;
118 | 		// Fit in full text width
119 | 		if (/^ *$/.test(txt) || t.getSubStringLength(0, txt.length) < w)
120 | 			return;
121 | 
122 | 		for (var x = txt.length - 2; x > 0; x--) {
123 | 			if (t.getSubStringLength(0, x + 2) <= w) {
124 | 				t.textContent = txt.substring(0, x) + "..";
125 | 				return;
126 | 			}
127 | 		}
128 | 		t.textContent = "";
129 | 	}
130 | 
131 | 	// zoom
132 | 	function zoom_reset(e) {
133 | 		if (e.attributes != undefined) {
134 | 			orig_load(e, "x");
135 | 			orig_load(e, "width");
136 | 		}
137 | 		if (e.childNodes == undefined) return;
138 | 		for (var i = 0, c = e.childNodes; i < c.length; i++) {
139 | 			zoom_reset(c[i]);
140 | 		}
141 | 	}
142 | 	function zoom_child(e, x, ratio) {
143 | 		if (e.attributes != undefined) {
144 | 			if (e.attributes.x != undefined) {
145 | 				orig_save(e, "x");
146 | 				e.attributes.x.value = (parseFloat(e.attributes.x.value) - x - 10) * ratio + 10;
147 | 				if (e.tagName == "text")
148 | 					e.attributes.x.value = find_child(e.parentNode, "rect[x]").attributes.x.value + 3;
149 | 			}
150 | 			if (e.attributes.width != undefined) {
151 | 				orig_save(e, "width");
152 | 				e.attributes.width.value = parseFloat(e.attributes.width.value) * ratio;
153 | 			}
154 | 		}
155 | 
156 | 		if (e.childNodes == undefined) return;
157 | 		for (var i = 0, c = e.childNodes; i < c.length; i++) {
158 | 			zoom_child(c[i], x - 10, ratio);
159 | 		}
160 | 	}
161 | 	function zoom_parent(e) {
162 | 		if (e.attributes) {
163 | 			if (e.attributes.x != undefined) {
164 | 				orig_save(e, "x");
165 | 				e.attributes.x.value = 10;
166 | 			}
167 | 			if (e.attributes.width != undefined) {
168 | 				orig_save(e, "width");
169 | 				e.attributes.width.value = parseInt(svg.width.baseVal.value) - (10 * 2);
170 | 			}
171 | 		}
172 | 		if (e.childNodes == undefined) return;
173 | 		for (var i = 0, c = e.childNodes; i < c.length; i++) {
174 | 			zoom_parent(c[i]);
175 | 		}
176 | 	}
177 | 	function zoom(node) {
178 | 		var attr = find_child(node, "rect").attributes;
179 | 		var width = parseFloat(attr.width.value);
180 | 		var xmin = parseFloat(attr.x.value);
181 | 		var xmax = parseFloat(xmin + width);
182 | 		var ymin = parseFloat(attr.y.value);
183 | 		var ratio = (svg.width.baseVal.value - 2 * 10) / width;
184 | 
185 | 		// XXX: Workaround for JavaScript float issues (fix me)
186 | 		var fudge = 0.0001;
187 | 
188 | 		unzoombtn.classList.remove("hide");
189 | 
190 | 		var el = document.getElementById("frames").children;
191 | 		for (var i = 0; i < el.length; i++) {
192 | 			var e = el[i];
193 | 			var a = find_child(e, "rect").attributes;
194 | 			var ex = parseFloat(a.x.value);
195 | 			var ew = parseFloat(a.width.value);
196 | 			var upstack;
197 | 			// Is it an ancestor
198 | 			if (0 == 0) {
199 | 				upstack = parseFloat(a.y.value) > ymin;
200 | 			} else {
201 | 				upstack = parseFloat(a.y.value) < ymin;
202 | 			}
203 | 			if (upstack) {
204 | 				// Direct ancestor
205 | 				if (ex <= xmin && (ex+ew+fudge) >= xmax) {
206 | 					e.classList.add("parent");
207 | 					zoom_parent(e);
208 | 					update_text(e);
209 | 				}
210 | 				// not in current path
211 | 				else
212 | 					e.classList.add("hide");
213 | 			}
214 | 			// Children maybe
215 | 			else {
216 | 				// no common path
217 | 				if (ex < xmin || ex + fudge >= xmax) {
218 | 					e.classList.add("hide");
219 | 				}
220 | 				else {
221 | 					zoom_child(e, xmin, ratio);
222 | 					update_text(e);
223 | 				}
224 | 			}
225 | 		}
226 | 	}
227 | 	function unzoom() {
228 | 		unzoombtn.classList.add("hide");
229 | 		var el = document.getElementById("frames").children;
230 | 		for(var i = 0; i < el.length; i++) {
231 | 			el[i].classList.remove("parent");
232 | 			el[i].classList.remove("hide");
233 | 			zoom_reset(el[i]);
234 | 			update_text(el[i]);
235 | 		}
236 | 	}
237 | 
238 | 	// search
239 | 	function reset_search() {
240 | 		var el = document.querySelectorAll("#frames rect");
241 | 		for (var i = 0; i < el.length; i++) {
242 | 			orig_load(el[i], "fill")
243 | 		}
244 | 	}
245 | 	function search_prompt() {
246 | 		if (!searching) {
247 | 			var term = prompt("Enter a search term (regexp " +
248 | 			    "allowed, eg: ^ext4_)", "");
249 | 			if (term != null) {
250 | 				search(term)
251 | 			}
252 | 		} else {
253 | 			reset_search();
254 | 			searching = 0;
255 | 			searchbtn.classList.remove("show");
256 | 			searchbtn.firstChild.nodeValue = "Search"
257 | 			matchedtxt.classList.add("hide");
258 | 			matchedtxt.firstChild.nodeValue = ""
259 | 		}
260 | 	}
261 | 	function search(term) {
262 | 		var re = new RegExp(term);
263 | 		var el = document.getElementById("frames").children;
264 | 		var matches = new Object();
265 | 		var maxwidth = 0;
266 | 		for (var i = 0; i < el.length; i++) {
267 | 			var e = el[i];
268 | 			var func = g_to_func(e);
269 | 			var rect = find_child(e, "rect");
270 | 			if (func == null || rect == null)
271 | 				continue;
272 | 
273 | 			// Save max width. Only works as we have a root frame
274 | 			var w = parseFloat(rect.attributes.width.value);
275 | 			if (w > maxwidth)
276 | 				maxwidth = w;
277 | 
278 | 			if (func.match(re)) {
279 | 				// highlight
280 | 				var x = parseFloat(rect.attributes.x.value);
281 | 				orig_save(rect, "fill");
282 | 				rect.attributes.fill.value = "rgb(230,0,230)";
283 | 
284 | 				// remember matches
285 | 				if (matches[x] == undefined) {
286 | 					matches[x] = w;
287 | 				} else {
288 | 					if (w > matches[x]) {
289 | 						// overwrite with parent
290 | 						matches[x] = w;
291 | 					}
292 | 				}
293 | 				searching = 1;
294 | 			}
295 | 		}
296 | 		if (!searching)
297 | 			return;
298 | 
299 | 		searchbtn.classList.add("show");
300 | 		searchbtn.firstChild.nodeValue = "Reset Search";
301 | 
302 | 		// calculate percent matched, excluding vertical overlap
303 | 		var count = 0;
304 | 		var lastx = -1;
305 | 		var lastw = 0;
306 | 		var keys = Array();
307 | 		for (k in matches) {
308 | 			if (matches.hasOwnProperty(k))
309 | 				keys.push(k);
310 | 		}
311 | 		// sort the matched frames by their x location
312 | 		// ascending, then width descending
313 | 		keys.sort(function(a, b){
314 | 			return a - b;
315 | 		});
316 | 		// Step through frames saving only the biggest bottom-up frames
317 | 		// thanks to the sort order. This relies on the tree property
318 | 		// where children are always smaller than their parents.
319 | 		var fudge = 0.0001;	// JavaScript floating point
320 | 		for (var k in keys) {
321 | 			var x = parseFloat(keys[k]);
322 | 			var w = matches[keys[k]];
323 | 			if (x >= lastx + lastw - fudge) {
324 | 				count += w;
325 | 				lastx = x;
326 | 				lastw = w;
327 | 			}
328 | 		}
329 | 		// display matched percent
330 | 		matchedtxt.classList.remove("hide");
331 | 		var pct = 100 * count / maxwidth;
332 | 		if (pct != 100) pct = pct.toFixed(1)
333 | 		matchedtxt.firstChild.nodeValue = "Matched: " + pct + "%";
334 | 	}
335 | ]]>
336 | </script>
337 | <rect x="0.0" y="0" width="1200.0" height="390.0" fill="url(#background)"  />
338 | <text id="title" x="600.00" y="24" >Flame Graph</text>
339 | <text id="details" x="10.00" y="373" > </text>
340 | <text id="unzoom" x="10.00" y="24" class="hide">Reset Zoom</text>
341 | <text id="search" x="1090.00" y="24" >Search</text>
342 | <text id="matched" x="1090.00" y="373" > </text>
343 | <g id="frames">
344 | <g >
345 | <title>test.py:sum:11 (145 samples, 51.79%)</title><rect x="578.9" y="309" width="611.1" height="15.0" fill="rgb(101,246,101)" rx="2" ry="2" />
346 | <text  x="581.93" y="319.5" >test.py:sum:11</text>
347 | </g>
348 | <g >
349 | <title>/opt/miniconda3/lib/python3.6/sysconfig.py:_init_posix:428 (1 samples, 0.36%)</title><rect x="439.9" y="133" width="4.2" height="15.0" fill="rgb(89,235,89)" rx="2" ry="2" />
350 | <text  x="442.86" y="143.5" ></text>
351 | </g>
352 | <g >
353 | <title>&lt;frozen importlib._bootstrap&gt;:_load_unlocked:665 (1 samples, 0.36%)</title><rect x="439.9" y="85" width="4.2" height="15.0" fill="rgb(185,185,53)" rx="2" ry="2" />
354 | <text  x="442.86" y="95.5" ></text>
355 | </g>
356 | <g >
357 | <title>/opt/miniconda3/lib/python3.6/site.py:getuserbase:248 (1 samples, 0.36%)</title><rect x="439.9" y="181" width="4.2" height="15.0" fill="rgb(60,209,60)" rx="2" ry="2" />
358 | <text  x="442.86" y="191.5" ></text>
359 | </g>
360 | <g >
361 | <title>&lt;frozen importlib._bootstrap_external&gt;:_compile_bytecode:487 (1 samples, 0.36%)</title><rect x="439.9" y="37" width="4.2" height="15.0" fill="rgb(201,201,59)" rx="2" ry="2" />
362 | <text  x="442.86" y="47.5" ></text>
363 | </g>
364 | <g >
365 | <title>/opt/miniconda3/lib/python3.6/site.py:addusersitepackages:282 (1 samples, 0.36%)</title><rect x="439.9" y="213" width="4.2" height="15.0" fill="rgb(74,222,74)" rx="2" ry="2" />
366 | <text  x="442.86" y="223.5" ></text>
367 | </g>
368 | <g >
369 | <title>&lt;frozen importlib._bootstrap_external&gt;:exec_module:674 (1 samples, 0.36%)</title><rect x="439.9" y="69" width="4.2" height="15.0" fill="rgb(193,193,56)" rx="2" ry="2" />
370 | <text  x="442.86" y="79.5" ></text>
371 | </g>
372 | <g >
373 | <title>&lt;frozen importlib._bootstrap&gt;:_find_and_load:971 (2 samples, 0.71%)</title><rect x="435.6" y="325" width="8.5" height="15.0" fill="rgb(186,186,54)" rx="2" ry="2" />
374 | <text  x="438.64" y="335.5" ></text>
375 | </g>
376 | <g >
377 | <title>&lt;frozen importlib._bootstrap&gt;:module_from_spec:577 (1 samples, 0.36%)</title><rect x="435.6" y="181" width="4.3" height="15.0" fill="rgb(204,204,60)" rx="2" ry="2" />
378 | <text  x="438.64" y="191.5" ></text>
379 | </g>
380 | <g >
381 | <title>all (280 samples, 100%)</title><rect x="10.0" y="341" width="1180.0" height="15.0" fill="rgb(223,83,83)" rx="2" ry="2" />
382 | <text  x="13.00" y="351.5" ></text>
383 | </g>
384 | <g >
385 | <title>&lt;frozen importlib._bootstrap&gt;:_load_unlocked:665 (2 samples, 0.71%)</title><rect x="435.6" y="293" width="8.5" height="15.0" fill="rgb(202,202,59)" rx="2" ry="2" />
386 | <text  x="438.64" y="303.5" ></text>
387 | </g>
388 | <g >
389 | <title>&lt;frozen importlib._bootstrap&gt;:_find_and_load_unlocked:955 (1 samples, 0.36%)</title><rect x="435.6" y="213" width="4.3" height="15.0" fill="rgb(182,182,52)" rx="2" ry="2" />
390 | <text  x="438.64" y="223.5" ></text>
391 | </g>
392 | <g >
393 | <title>/opt/miniconda3/lib/python3.6/encodings/__init__.py:&lt;module&gt;:31 (1 samples, 0.36%)</title><rect x="435.6" y="245" width="4.3" height="15.0" fill="rgb(81,229,81)" rx="2" ry="2" />
394 | <text  x="438.64" y="255.5" ></text>
395 | </g>
396 | <g >
397 | <title>/opt/miniconda3/lib/python3.6/sysconfig.py:get_config_vars:557 (1 samples, 0.36%)</title><rect x="439.9" y="149" width="4.2" height="15.0" fill="rgb(99,245,99)" rx="2" ry="2" />
398 | <text  x="442.86" y="159.5" ></text>
399 | </g>
400 | <g >
401 | <title>&lt;frozen importlib._bootstrap&gt;:_find_and_load_unlocked:955 (2 samples, 0.71%)</title><rect x="435.6" y="309" width="8.5" height="15.0" fill="rgb(209,209,62)" rx="2" ry="2" />
402 | <text  x="438.64" y="319.5" ></text>
403 | </g>
404 | <g >
405 | <title>&lt;frozen importlib._bootstrap_external&gt;:exec_module:678 (2 samples, 0.71%)</title><rect x="435.6" y="277" width="8.5" height="15.0" fill="rgb(216,216,64)" rx="2" ry="2" />
406 | <text  x="438.64" y="287.5" ></text>
407 | </g>
408 | <g >
409 | <title>/opt/miniconda3/lib/python3.6/site.py:&lt;module&gt;:541 (1 samples, 0.36%)</title><rect x="439.9" y="245" width="4.2" height="15.0" fill="rgb(54,203,54)" rx="2" ry="2" />
410 | <text  x="442.86" y="255.5" ></text>
411 | </g>
412 | <g >
413 | <title>/opt/miniconda3/lib/python3.6/sysconfig.py:get_config_var:608 (1 samples, 0.36%)</title><rect x="439.9" y="165" width="4.2" height="15.0" fill="rgb(102,248,102)" rx="2" ry="2" />
414 | <text  x="442.86" y="175.5" ></text>
415 | </g>
416 | <g >
417 | <title>test.py:sum:10 (32 samples, 11.43%)</title><rect x="444.1" y="309" width="134.8" height="15.0" fill="rgb(94,240,94)" rx="2" ry="2" />
418 | <text  x="447.07" y="319.5" >test.py:sum:10</text>
419 | </g>
420 | <g >
421 | <title>&lt;frozen importlib._bootstrap_external&gt;:get_code:779 (1 samples, 0.36%)</title><rect x="439.9" y="53" width="4.2" height="15.0" fill="rgb(179,179,51)" rx="2" ry="2" />
422 | <text  x="442.86" y="63.5" ></text>
423 | </g>
424 | <g >
425 | <title>&lt;frozen importlib._bootstrap&gt;:_init_module_attrs:558 (1 samples, 0.36%)</title><rect x="435.6" y="165" width="4.3" height="15.0" fill="rgb(181,181,52)" rx="2" ry="2" />
426 | <text  x="438.64" y="175.5" ></text>
427 | </g>
428 | <g >
429 | <title>/opt/miniconda3/lib/python3.6/site.py:main:522 (1 samples, 0.36%)</title><rect x="439.9" y="229" width="4.2" height="15.0" fill="rgb(59,208,59)" rx="2" ry="2" />
430 | <text  x="442.86" y="239.5" ></text>
431 | </g>
432 | <g >
433 | <title>&lt;frozen importlib._bootstrap&gt;:_find_and_load:971 (1 samples, 0.36%)</title><rect x="435.6" y="229" width="4.3" height="15.0" fill="rgb(208,208,62)" rx="2" ry="2" />
434 | <text  x="438.64" y="239.5" ></text>
435 | </g>
436 | <g >
437 | <title>&lt;frozen importlib._bootstrap&gt;:_find_and_load_unlocked:955 (1 samples, 0.36%)</title><rect x="439.9" y="101" width="4.2" height="15.0" fill="rgb(191,191,55)" rx="2" ry="2" />
438 | <text  x="442.86" y="111.5" ></text>
439 | </g>
440 | <g >
441 | <title>&lt;frozen importlib._bootstrap&gt;:_find_and_load:971 (1 samples, 0.36%)</title><rect x="439.9" y="117" width="4.2" height="15.0" fill="rgb(207,207,61)" rx="2" ry="2" />
442 | <text  x="442.86" y="127.5" ></text>
443 | </g>
444 | <g >
445 | <title>/opt/miniconda3/lib/python3.6/site.py:getusersitepackages:258 (1 samples, 0.36%)</title><rect x="439.9" y="197" width="4.2" height="15.0" fill="rgb(105,250,105)" rx="2" ry="2" />
446 | <text  x="442.86" y="207.5" ></text>
447 | </g>
448 | <g >
449 | <title>&lt;frozen importlib._bootstrap&gt;:_load_unlocked:658 (1 samples, 0.36%)</title><rect x="435.6" y="197" width="4.3" height="15.0" fill="rgb(213,213,64)" rx="2" ry="2" />
450 | <text  x="438.64" y="207.5" ></text>
451 | </g>
452 | <g >
453 | <title>test.py:&lt;module&gt;:16 (177 samples, 63.21%)</title><rect x="444.1" y="325" width="745.9" height="15.0" fill="rgb(98,244,98)" rx="2" ry="2" />
454 | <text  x="447.07" y="335.5" >test.py:&lt;module&gt;:16</text>
455 | </g>
456 | <g >
457 | <title>test.py:add:5 (44 samples, 15.71%)</title><rect x="1004.6" y="293" width="185.4" height="15.0" fill="rgb(88,235,88)" rx="2" ry="2" />
458 | <text  x="1007.57" y="303.5" >test.py:add:5</text>
459 | </g>
460 | <g >
461 | <title>(idle) (101 samples, 36.07%)</title><rect x="10.0" y="325" width="425.6" height="15.0" fill="rgb(248,120,120)" rx="2" ry="2" />
462 | <text  x="13.00" y="335.5" >(idle)</text>
463 | </g>
464 | <g >
465 | <title>&lt;frozen importlib._bootstrap&gt;:_call_with_frames_removed:219 (2 samples, 0.71%)</title><rect x="435.6" y="261" width="8.5" height="15.0" fill="rgb(181,181,52)" rx="2" ry="2" />
466 | <text  x="438.64" y="271.5" ></text>
467 | </g>
468 | </g>
469 | </svg>
470 | 


--------------------------------------------------------------------------------
/images/qiyu.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Meteorix/meteorix-blog/8a29208db7159cd78b5b1d12b7ca85344bce58f0/images/qiyu.jpg


--------------------------------------------------------------------------------
/images/qrcode.bmp:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Meteorix/meteorix-blog/8a29208db7159cd78b5b1d12b7ca85344bce58f0/images/qrcode.bmp


--------------------------------------------------------------------------------
/images/vsdebugpycpp/5c3d9b205e60273aadf4650714DcRPJX.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Meteorix/meteorix-blog/8a29208db7159cd78b5b1d12b7ca85344bce58f0/images/vsdebugpycpp/5c3d9b205e60273aadf4650714DcRPJX.png


--------------------------------------------------------------------------------
/images/vsdebugpycpp/5c3d9dbcaa49f15c3726191dzKYsU5Qw.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Meteorix/meteorix-blog/8a29208db7159cd78b5b1d12b7ca85344bce58f0/images/vsdebugpycpp/5c3d9dbcaa49f15c3726191dzKYsU5Qw.png


--------------------------------------------------------------------------------
/images/vsdebugpycpp/5c3d9e54a7f2529830bb770bjqP16HQy.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Meteorix/meteorix-blog/8a29208db7159cd78b5b1d12b7ca85344bce58f0/images/vsdebugpycpp/5c3d9e54a7f2529830bb770bjqP16HQy.png


--------------------------------------------------------------------------------
/images/vsdebugpycpp/5c3da03a96dee435e6604c4aVmHaq5uC.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Meteorix/meteorix-blog/8a29208db7159cd78b5b1d12b7ca85344bce58f0/images/vsdebugpycpp/5c3da03a96dee435e6604c4aVmHaq5uC.png


--------------------------------------------------------------------------------
/images/vsdebugpycpp/5c3da1cb7f9d2a99198674256wRjQZ2J.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Meteorix/meteorix-blog/8a29208db7159cd78b5b1d12b7ca85344bce58f0/images/vsdebugpycpp/5c3da1cb7f9d2a99198674256wRjQZ2J.png


--------------------------------------------------------------------------------
/tags/index.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: tags
3 | 
4 | date: 2019-04-22 23:51:06
5 | 
6 | type: "tags"
7 | 
8 | ---
9 | 


--------------------------------------------------------------------------------