├── .gitattributes
├── 0x08 关于消除DDR信息壁垒的思考.md
├── README.md
├── 0x02 DRAM颗粒测试算法综述.md
├── 0x00 测试流程及内容.md
├── 0x06 NEON加速之memcpy在ARM平台的优化.md
├── 0x03 测试算法.md
├── 0x01 看过DRAM部分论文.md
├── 0x07 什么?!NEON还要优化?[重点].md
├── 0x04 MemTest86 history.md
└── 0x05 memtester4.3.0 啃源码.md
/.gitattributes:
--------------------------------------------------------------------------------
1 | *.js linguist-language=asm
2 | *.css linguist-language=asm
3 | *.html linguist-language=asm
4 |
--------------------------------------------------------------------------------
/0x08 关于消除DDR信息壁垒的思考.md:
--------------------------------------------------------------------------------
1 | # 背景
2 | 就我们所知,我研究的那一堆的算法,要想成功有效,其基本的要求就是地址解码逻辑弄到手,然而这个却是人家的内部信息,是不会给的,就算连华为这样的巨鳄都拿不到更何况我们这些中小型企业咯!
3 |
4 | 那些专门做ATE设备的大厂比如AdvanTest也是拿不到的(ps.在早年日系设计厂还是会给这个逻辑的,后来慢慢变成都不给咯),那他们拿不到还怎么卖ATE设备呢?据他们的内部人员回复,每年都会从三星、镁光这样的企业高薪聘请设计人员,这样就拿到内部信息了,然后他们会开发一个编译器,把这部分信息进行包裹,然后才给我们这些客户用!
5 |
6 | 因此其中的核心点就是逻辑物理地址映射关系啊!
7 |
8 | # 思路
9 | 我们假设物理地址是P_addr,逻辑地址是L_addr,这其中的映射关系是f(x),我们现在要求的就是这个f(x)的表达式:
10 |
11 | 采用大量样本数据来进行训练,正样本好找,但是异常样本不好找啊,而且正负样本数严重不平衡,里面的是操作是步骤,不好找损失函数啊,因此我们可以想到用DQN,深度增强学习,通过不停的try来找到合理的策略(也就是映射关系)使得能识别尽可能多的fail IC。
12 |
13 | # Q learning
14 |
15 | # DQN
16 |
17 | # 深入
18 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # memtester-LPDDR3-
2 | how to measure LPDDR3 cell.
3 |
4 | # memory FT
5 |
6 | ## 目录
7 | **0x00:我要干嘛?**
8 | + memory 颗粒验证
9 |
10 |
11 | **0x01:pattern**
12 |
13 | 从原厂的角度来设计pattern。
14 |
15 | + 功能:全空间扫描目的是扫描坏块,memtester为主;
16 | + 性能:频率,IO性能,时序,电压调节等;
17 | + 功耗:IDD2和IDD6,一个standby和一个读写,其余的IDD状态都是瞬态的,当前测试架构无法支持测试;
18 | + 温度:在高温下的性能情况主要是burn-in-test;
19 | + 刷新时间:刷新时间过低会不会掉数据,这个应该包括在FaultModel里面,也就是testPatern里面。
20 |
21 |
22 | **0x02:框架**
23 |
24 | 上位机通过通信接口发送指令到多个测试板,测试板负责测试IC,将结果通过通信接口返回到上位机PC,上位机处理数据之后将结果显示。
25 |
26 | + 上位机显示界面,基于PC WINDOWS,用C#编写的界面(updata:实际是使用的Tkinter跟pyQt完成的界面,两个都有做,Tk是为了了解界面函数内部实现,pyQt是为了专注业务逻辑);
27 | + 通信接口采用的是串口,9600波特率,用串口转网口设备,理论上1台PC可带无限块测试板;
28 | + 下位机采用的ARM的A53处理器作为测试控制器,LPC1768作为辅控制器;
29 |
30 | **0x03:算法**
31 |
32 | 分为三个层次:底层物理缺陷层,中间功能层,以及性能层;
33 | + FaultModel:这是有一系列的论文、背景支持的,我们最终采用的是精简的失效模型来设计算法;
34 | + 功能层就是测一下每个cell功能是否正常,常规做法就是直接memtester,但是跟底层貌似有些重合,需确认;
35 | + 性能层就是我们在改变电压、时序等参数的时候,你跑pattern能跑到多高的频率。
36 |
37 | **0x04:加速**
38 | + 1、从一些栗子(ARM 官网memcpy)入手,讲如何加速,引入neon及GPU;
39 | + 2、neon实例分析就是我们的memtester底层加速;
40 | + 3、neon使用手册学习;
41 | + 4、安卓下的JNI/NDK使用,直接把c库给应用层用,已CNN为栗子实践及对比官方库学习;
42 | + 5、CUDA学习;
43 |
--------------------------------------------------------------------------------
/0x02 DRAM颗粒测试算法综述.md:
--------------------------------------------------------------------------------
1 |
2 | # 以下信息均通过阅读论文、读英文教材所得,分享之。
3 | 以下信息均通过阅读论文、读英文教材所得,分享之。
4 | 
5 | 
6 | 
7 | 
8 | 
9 | 
10 | 
11 | 
12 | 
13 |
14 | 
15 |
16 | 
17 |
18 | 
19 |
20 | 
21 |
22 |
23 |
24 |
--------------------------------------------------------------------------------
/0x00 测试流程及内容.md:
--------------------------------------------------------------------------------
1 |
2 | # 生产流程
3 |
4 | 然后我们看一下DRAM的生产流程哈!得知道它是从哪来的!
5 |
6 | 从wafer开始,会在晶圆厂进行CP测试,测试完之后就会去bounding打线然后封装,封完后送去做FT测试,测完之后合格的就可以作为产品卖出去啦!当然啦其中还是有很多细节没讲到的!先大体这么理解啦!
7 |
8 | 
9 |
10 | # 为什么需要测试?
11 | 1. 缺陷检测:
12 | + 包括 **wafer缺陷** 跟 **封装缺陷**;
13 | 2. 保证产品满足客户规格:
14 | + Voltage guard band
15 | + Temperature guard band
16 | + Timing guard band
17 | + Complex test pattern
18 | 3. 收集测试数据用来提升研发和工艺的精度
19 | + Quality
20 | + Reliability
21 | + Cost
22 | + Efficiency
23 |
24 | # IC测试方法论
25 | 理论上的结构如图所示,tester供电并输出pattern到DUT,然后对比输出得出测试结果,这是单site的框图,多site就是级联图了。
26 |
27 | 
28 |
29 | # 典型的DRAM FT测试流程总览
30 | **1. Burn-in**
31 | + MBT(Monitor Burn in test): 压力测试来去除早期缺陷芯片;
32 | + TBT(Test Burn in test):长时间的pattern测试;
33 | >一般像那种ATE设备,都是:Very Low Speed(5-20MHz), High Parallel Test (10-20Kpcs/oven), Low Cost
34 |
35 | **2. Core Test**
36 | + DC Test
37 | + Functional Test
38 | >Low Speed (DDR3 @667MHz), Typical tester Advantest T5588 + 512DUT HiFix
39 | >+ 爱德万测试机:Advantest T5588;
40 | >+ 512个DUT通道;
41 | >+ HiFix:IC成品测试界面接口板
42 |
43 | **3. Speed Test**
44 | + Speed & AC Timing Test
45 | >Full Speed (DDR3 @1600MHz and above), Advantest T5503 + 256DUT HiFix
46 |
47 | **4. Backend后端检测**
48 |
49 | Marking ==》 Ball Scan ==》 Visual Inspection ==》Baking ==》Vacuum Pack
50 |
51 |
52 | ## 1. Burn-in
53 |
54 | ### MBT(Monitor Burn in test)
55 | MBT就是给IC施加压力并且去除掉早期失效的IC,压力包括以下部分:
56 | + 高温:High Temperature Stress (125degC)
57 | + 高压:High Voltage Stress
58 | + pattern压力:Stressful Pattern
59 |
60 | 从下面的浴盆曲线可以看出,IC的失效率在前期跟后期都很大,因此前期做一个帅选很有必要。
61 | 
62 |
63 |
64 | ### TBT
65 | 长时间的测试pattern。
66 | + 多温度:Multiple temperature tested (e.g. 88’C, 25’C, -10’C)
67 | + 低速长时间:Long test time at low speed
68 | + 全空间扫描:Patterns cover all cell arrays
69 | + 无压测试:No Stressful condition
70 | + High parallel test count, low cost
71 |
72 | >**MBT跟TBT均不测DC**:Both MBT and TBT does NOT test DC (Ando Oven)
73 |
74 |
75 |
76 | ## 2. Core Test
77 |
78 | DRAM Advantest Test.
79 |
80 | ### DC test
81 |
82 | 测试原理图:
83 |
84 | ISVM|VSIM
85 | ---|---
86 | |
87 | **I Source; V Measure** |**V Source; I Measure**
88 |
89 | + Open/Short test
90 |
91 | OS测试目的是何?
92 | ```
93 | 1、Check connection between pins and test fixture
94 | 2、Check if pin to pin is short in IC package
95 | 3、Check if pin to wafer pad has open in IC package
96 | 4、Check if protection diodes work on die
97 | 5、It is a quick electrical check to determine if it is safe to apply power
98 | ```
99 | >Also called Continuity Test
100 |
101 | 缺陷类型如下:
102 |
103 | Wafer Problem(Defect of diode) | Assembly Problem(Wire bonding/Solder ball)|Contact Problem(Socket issue)
104 | ---|---|---
105 | | |
106 |
107 | 具体的测试电路什么的我们就不考虑了!这种短路开路的缺陷一般在后面的测试中也会部分体现,我们现在的基于A53的测试平台肯定是不支持如此大规模OS测试的。
108 |
109 | + Leakage test
110 | 漏电缺陷测试是用来:
111 |
112 | + Verify resistance of pin to VDD/VSS is high enough;
113 | + Verify resistance of pin to pins is high enough;
114 | + Identify process problem in CMOS device ;
115 |
116 | Wafer Problem | Assembly Problem|Contact Problem(Socket issue)
117 | ---|---|---
118 |  ||
119 |
120 |
121 |
122 | + IDD test
123 |
124 | 目的是测不同状态下的IDD电流,来看看功耗是不是超了。
125 |
126 | ### Functional test
127 |
128 | 用来验证DRAM是不是能正常工作。
129 |
130 | + Different parameter & Pattern for each function
131 | + To check DRAM can operate functionally
132 |
133 | EFT: easy function test
134 | 把0或1写入全部的cell然后读出比较,这样子来确认IC的**基本功能**。
135 |
136 | 典型算法是:**March C-**
137 | + **Algorithm:** ↑(w0);↑(r0,w1);↑(r1,w0);↓(r0,w1);↓(r1,w0); ↓(r0)
138 | + **Operation Count:** 10*n
139 | + **Scan type:** X-Scan (X inc -> Y inc,也就是说先定Y,然后扫x轴,扫完一轮后Y加一,然后沿着x轴继续扫), Y-Scan(Y inc -> X inc)
140 | + **Fault Coverage:** Most of Failure Mode
141 |
142 | 
143 |
144 | >**关于算法部分详细的解析请看后面的章节。**
145 |
146 | ## 3. Speed Test
147 | Timing test @ different speed grade
148 |
149 | ### AC timing test
150 | To verify IC can work as each timing parameter defined in datasheet.
151 | + Rise and fall time
152 | + Setup and hold time
153 | + Delay test
154 | + Others
155 |
156 | 
157 |
158 | ### Speed test
159 | Test DRAM at different speed:
160 | + 1. DDR3-1600(11-11-11) Test
161 | + 2. DDR3-1333(9-9-9) Test
162 | + 3. DDR3-1066(7-7-7) Test
163 |
164 | Test for each timing (tRCD, tRRD…) ……
165 | ### test plan in program
166 |
167 | 代码里面是先进行OS测试,然后测漏电,再测IDD,接着测EFT也就是March C-,最后测其他功能测试,然后才是速度测试AC测试。
168 |
169 | 
170 |
171 |
172 |
173 | ## 3. Backend后端检测
174 |
175 | 测试完之后会有机器对log位置啊,ball是不是变形啊这样子的缺陷进行检测,ok后就是打包送客户啦~
176 |
177 | 这其中一个检查外观是用到了机器视觉方面的东西,但是由于场景太单一,因此用常规算法均可解决,我在想深度学习算法在这一块可以做些什么呢?!
178 |
--------------------------------------------------------------------------------
/0x06 NEON加速之memcpy在ARM平台的优化.md:
--------------------------------------------------------------------------------
1 |
2 | # 0x01 前言
3 | 系统里面经常需要大量地搬运数据,一般调用的都是memcpy() C库来实现,因此本着“揪牛角尖”的精神,我们就来探究探究加速方案!毕竟很多事情被分解到底层之后就是一样的呢!
4 |
5 | 加速这个玩意,其实是跟很多因素相关的,因此我们要就环境来论加速,把当前环境考虑进去然后再设计出合理的优化策略,这才是万全之策!也即鲁棒性很强的策略。
6 |
7 | 这里我们从上而下地谈谈memcpy的优化问题,**一个问题可以如此优化,那么相比其他问题也都是类似罢!这怕就是深度学习之迁移学习的由来了!这个世界真奇妙!**
8 |
9 | # 0x02 测试环境
10 |
11 | 我们在ARM Cortex-A8 环境下进行一系列的测试。
12 |
13 | 测试时基于源/终地址以及我们要读写的byte数都是L1(64byte)的倍数;
14 |
15 | 我们需要考虑下对齐,但是这里只测了16MB,影响不大故不考虑;
16 |
17 | 测试时间是由处理器内部的性能寄存器记录的;
18 |
19 | 所有的测试中,L1 NEON 位被激活,这意味着当我们使用Neon 加载(load)指令的时候,会使得L1数据缓存进行linefill操作;
20 |
21 | 我们把分别对应指令和数据的L1、L2缓存使能,同时MMU核分支预测同样也被使能了。
22 |
23 | 有些地方甚至还使用到了PLD预取指令:这条指令会使得L2缓存在这个数据被使用前的某个时间开始加载数据,它提前发出了内存请求,所以CPU就不需要在那傻等memory把数据给吐出来了!
24 |
25 | # 0x03 测试策略
26 | ## 策略1: 傻搬
27 | 傻搬就是使用常规汇编指令一个字一个字地拷贝。
28 |
29 | 如下汇编代码所示, 我们每次把地址值加个4,然后不断循环直至搬运结束,这是最基本最常规的一种操作了,因此我们 把这个时间作为一个baseline.
30 |
31 | 汇编代码如下:
32 | ``` cpp
33 | WordCopy
34 | LDR r3, [r1], #4
35 | STR r3, [r0], #4
36 | SUBS r2, r2, #4
37 | BGE WordCopy
38 | ```
39 | ## 策略2: 多加载指令
40 |
41 | 我们之前是只用到了LDR指令,一次搬运32bit也就是1个word,由于只用到了r0~r3寄存器,因此每次调用的时候无需进行堆栈操作;
42 |
43 | 这里我们使用LDM核STM指令(M是Multiple的简多,意味着一次操作多个),**每个迭代过程中能够操作8word数据**,由于额外寄存器的使用,因此我们需要有现场保护操作,也就是入栈出栈操作。
44 | ```cpp
45 | LDMCopy
46 | PUSH {r4-r10}
47 | LDMloop
48 | LDMIA r1!, {r3 - r10}
49 | STMIA r0!, {r3 - r10}
50 | SUBS r2, r2, #32
51 | BGE LDMloop
52 | POP {r4-r10}
53 | ```
54 | >注:
55 | 1、 r0~r3一般作为函数的局部变量,传入的函数的参数按照顺序分给他们四个,超出的就要进入堆栈区了,其中**r0一般还会作为函数返回值变量**。
56 | 2、 r1!表示从r1这个地址处连续搬数据至r3到r10,每搬一个,r1的值就会自动加4。
57 |
58 | ## 策略3: NEON搬运
59 | 常规的NEON搬运,具体指令啊信息啊什么的,怎么操作啊,去看我上一篇博客吧!
60 | ```cpp
61 | NEONCopy
62 | VLDM r1!, {d0-d7}
63 | VSTM r0!, {d0-d7}
64 | SUBS r2, r2, #0x40
65 | BGE NEONCopy
66 | ```
67 | >这里一个d寄存器就是64bit,2个word,8个d寄存器搬运的是16个word了啊!报告老板,有人开挂!
68 |
69 | ## 策略4: 傻搬+预取
70 | 如题。
71 | + PLD的意思是我从r1地址处开始预先取出256byte数据到cache里面;
72 | + r12表示的是 我要搬运16次,每次都是4个byte,共计搬运64byte每轮;
73 | + 然后每轮结束后,把r2中的计数器减去64byte(0x40)开始下一轮直至结束;
74 |
75 | ```cpp
76 | WordCopyPLD
77 | PLD [r1, #0x100]
78 | MOV r12, #16
79 | WordCopyPLD1
80 | LDR r3, [r1], #4
81 | STR r3, [r0], #4
82 | SUBS r12, r12, #1
83 | BNE WordCopyPLD1
84 | SUBS r2, r2, #0x40
85 | BNE WordCopyPLD
86 | ```
87 | >同样的tips:这里每轮预取取多了!
88 |
89 | ## 策略5: 多加载+预取
90 | 这里的优势是。。。如题~
91 |
92 | ```cpp
93 | LDMCopyPLD
94 | PUSH {r4-r10}
95 | LDMloopPLD
96 | PLD [r1, #0x80]
97 | LDMIA r1!, {r3 - r10}
98 | STMIA r0!, {r3 - r10}
99 | LDMIA r1!, {r3 - r10}
100 | STMIA r0!, {r3 - r10}
101 | SUBS r2, r2, #0x40
102 | BGE LDMloopPLD
103 | POP {r4-r10}
104 | ```
105 |
106 | ## 策略6: NEON + PLD
107 | + 预取192byte;
108 | + d0~d7共计8x64bit=64byte
109 |
110 | **这里计算刚刚好,都很完美自洽!**
111 |
112 | ```cpp
113 | NEONCopyPLD
114 | PLD [r1, #0xC0]
115 | VLDM r1!,{d0-d7}
116 | VSTM r0!,{d0-d7}
117 | SUBS r2,r2,#0x40
118 | BGE NEONCopyPLD
119 | ```
120 |
121 | ## 策略7: Mixed ARM and NEON memory copy with preload
122 | 也就是说把各种指令穿插在一起组合处一个“多元体”来试验看看是不是会速度更快咯~
123 |
124 | ```cpp
125 | ARMNEONPLD
126 | PUSH {r4-r11}
127 | MOV r3, r0
128 | ARMNEON
129 | PLD [r1, #192]
130 | PLD [r1, #256]
131 | VLD1.64 {d0-d3}, [r1@128]!
132 | VLD1.64 {d4-d7}, [r1@128]!
133 | VLD1.64 {d16-d19}, [r1@128]!
134 | LDM r1!, {r4-r11}
135 | SUBS r2, r2, #128
136 | VST1.64 {d0-d3}, [r3@128]!
137 | VST1.64 {d4-d7}, [r3@128]!
138 | VST1.64 {d16-d19}, [r3@128]!
139 | STM r3!, {r4-r11}
140 | BGT ARMNEON
141 | POP {r4-r11}
142 | ```
143 |
144 | ## 测试时间结果
145 | 测试算法|时间花销(ms)|加速比
146 | ---|---|---
147 | 傻搬|104.8|100%
148 | 多加载(指令)|94.5|111%
149 | NEON搬|104.8|100%(说明等待时间占主要比例)
150 | 傻搬+预取|137.5|76%
151 | 多加载+预取|106.6|98%
152 | NEON+预取|70.2|149%
153 | 指令大杂烩|93.5|112%
154 |
155 | 小结:有一些奇怪的结论。
156 |
157 | 多加载指令仅仅提升了11%的性能,但是我们没有那么多指令了啊同时指令少就代表分枝预测里面的分支较少啊!原因是:
158 | + 指令cache100%击中,因此取指令是无须等待的;
159 | + 分枝预测在这里也不需要预测傻啊!
160 | + 单个写(一个接一个地写),memory system也把它当成突发写了,所以说效率并没有显著提升;
161 |
162 | NEON指令居然没有提升读写速度:
163 | + 读写循环的执行使用的寄存器很少,因此存在寄存器数据冲突的可能性就小;
164 | + 因为寄存器用的少,因此特别适合搬小数据块,因为我们不需要堆栈操作来恢复现场啊!!!
165 | + Cortex-A8处理器可以配置NEON加载数据的时候加载到L2 cache;可以防止内存copy过程中把L1中的不用数据给替换掉;(????)
166 |
167 | 尽管上面bb了一大堆好处,但是实践证明效果不咋地!
168 | (这里存疑,我在A53平台试验过,大概会块三倍的样子啊!除非时间是异或操作的时间!?)
169 |
170 |
171 | PLD可以使得内存控制器在数据被使用前就取到;因此加上NEON如虎添翼。
172 | + 其次,我们知道在burst传输的时候,第一次接入的延时是很大的,因此我们可以发起多次地请求到控制器(当然得控制器够先进够高级),这样子控制器就会把后面的请求合到一起,从而把后面每一次的请求的接入延时相当于去掉了,第一个access latency均摊给每个request之后也忽略不计了,这样子好高效啊!
173 |
174 |
175 | ## 影响内存拷贝速度的因素
176 |
177 | ### 因素1: 要拷贝的数据量
178 | 有些实现需要一定的准备时间,然后搬起来了就老快了。
179 |
180 | 因此搬运大数据块的时候,可以把建立准备的时间均摊,因此还好,但是搬运小块数据的时候就不划算了哟!
181 |
182 | 比如:在函数的开始stacking许多的寄存器,然后在主循环里面使用LDM跟STM指令来操作多个寄存器;
183 |
184 | ## 因素2: 对齐Alignment
185 |
186 | ARM架构搬运word对齐的数据会更高效;
187 |
188 | courser alignment granularities这玩意能支撑性能;
189 |
190 | 多加载指令在Cortex-A8 能每个周期从L1 cache加载2个寄存器的值,但是只有地址是64位对齐的时候才可以。**(所以NEON虽然一次可以搬运128bit,但是你的CPU 位宽,DRAM位宽都是32bit的,因此数据最终还是被拆分成一个一个的32bit再存储的,而我们的程序确实加速了是因为异或操作那部分时间加速了啊!)**
191 |
192 |
193 | cache对齐也是有影响的啊!
194 | `Cache behaviour (discussed later) can affect the performance of data accesses depending on its alignment relative to the size of a cache line. For the Cortex-A8, a level 1 cache line is 64 bytes, and a level 2 cache line is 64 bytes.
195 | `
196 | ## 因素3: Memory特性
197 | 这里讨论的操作(见上述诸程序)其性能瓶颈在存储的接口部分。
198 |
199 | 因为我们的循环很小啊,所以指令cache的就很好了,而且并没有计算部分(**注意了,我们March C-代码部分优化是有数学运算的,因此这部分比较费时**),因此处理器的逻辑计算部分压力并不大,因此速度因素极大地落在**存储器的速度**上了!
200 |
201 | 特定种类的memory在某种特定的读写模式下会性能更优,比如SDRAM的burst传输需要一个很长的延时来完成初始化操作,但是一旦操作完成就能很快地完成后续的读写操作;(**我的MARCH算法加速部分把burst传输模式考虑进去**)
202 |
203 | 此外一个好的memory控制器是能够**并行**接受很多读写请求的,并把这个initial latency给均摊掉。
204 |
205 | 此外一些特定的代码读写顺序也可能改善性能;
206 |
207 | ## 因素4: cache的使用
208 | 大量数据搬运的时候,很显然会把cache里面的数据全部“换血”的;
209 |
210 | 尽管这在内存自身拷贝的时候不会有什么影响,但是它可能会减速后面的代码,最终降低整体的性能。
211 |
212 | ## 因素5: Code dependencies
213 | 在标准的 memcpy()函数运行时,尤其遇上慢速的memory时,处理器大部分时间都没有被使用。
214 |
215 | 因此我们可以考虑在memcopy期间运行一些其他的代码;
216 |
217 | 因为memcpy()时阻塞的,因此只有函数结束才会返回,而此时cpu时被占死了;
218 | 我们可以使用管道来实现,把memcpy()放倒后台运行,然后通过poll或者中断来随时监控内存搬运的情况
219 | + 使用DMA操作,这样完全解放CPU了;并把数据块打碎这样就能一边搬运一遍操作了!效率提高了呢!(都是一些很常规的想法呢!)
220 |
221 | + cortex-A8内置的预加载引擎
222 | 数据预加载到L2 cache;
223 |
224 | 我们CPU先启动预加载指令,然后就去干别的活,直到接到电话(中断)说加载完成了,那么我就可以去对它进行操作了,操作完之后继续下一轮;
225 |
226 | + 使用其他处理器
227 | 内嵌的其他核原理同DMA;
228 |
229 |
230 | # 附录 PLD基本思想及应用考量
231 | 当我们在安卓平台需要处理图像数据的时候,其中一个基本的操作就是把大量的数据从内存搬来搬去,相对于CPU来说这个很耗时间,除了NEON加速之外,上面提到的PLD加速也是很有效的,具体多有效我们看实例。
232 |
233 | 我们比如在处理摄像头数据的时候,内存中搬数据一般是这样写的:
234 |
235 | ```cpp
236 | while (n--) {
237 | *dest++ = *src++;
238 | }
239 | ```
240 |
241 | 在我当前平台上跑了一下,1MB的空间搬完需要25ms的样子,这个就很费时了,**究其根本,是因为数据不在处理器的cache中,因此CPU需要花时间来等你DRAM传过来,也就是说,是我当前的CPU带宽大于存储器的带宽了**。
242 |
243 | 因此**解决方案**就是我们提前把数据放到cache中,由于cache带宽大于DRAM小于CPU,因此可以减少等待时间。
244 |
245 |
246 | 改进后的代码如下所示,提前预取数据,也就是在真正搬运数据之前目标数据就被放倒cache中来了,等真正搬数据时,一下子就在cache中hit了,马上走你!效率快很多!
247 | >这里有个小细节,每次只搬运了32bit也就是4byte,但是我们预取了128byte,这样子岂不是cache很快就溢出了?关于溢出CPU会怎么处理,我就没深究了,这也不是这段代码的主旨,权当留下我自己的一个思考吧!
248 |
249 | 这个酒厉害了,时间变为8ms了,**提速了三倍之多!**
250 |
251 | ```cpp
252 | while (n--) {
253 | asm ("PLD [%0, #128]"::"r" (src));
254 | *dest++ = *src++;
255 | }
256 | ```
257 |
258 | 当然了,优化这种事情时做不完的,你可以随着环境的变换不断优化的:
259 | + 比如这里,我们预取后存在溢出问题,也就是说没有物尽其用;
260 | + 其次,循环里面就一句搬运指令,然后就是n--操作,判断操作等,这样子来看的话一个loop里面真正干活的指令占的比例很小,因此也是一种浪费,把资源浪费在一些不重要的事情上了!
261 |
262 | 算法里面又个概念,就是时间跟空间是可以互相转换的,在这里的表现就是我通过多写一些代码操作,从而将有效操作时间在每个loop中的比例提升上来。
263 |
264 | 如下所示,我每个loop里面增加3个搬运操作,这样就保证了cache不会溢出,同时每个loop中数据搬运部分所占的CPU时间比例提升了,也即**“有效功率”**提升了!
265 |
266 | 这样子做大概又加快了1ms的样子!
267 |
268 | ```cpp
269 | n /= 4; //assume it's multiple of 4
270 | while (n--) {
271 | asm ("PLD [%0, #128]"::"r" (src));
272 | *dest++ = *src++;
273 | *dest++ = *src++;
274 | *dest++ = *src++;
275 | *dest++ = *src++;
276 | }
277 | ```
278 |
279 | ## 思路总结
280 | 这个世界是由基本的元素组成的,操作系统是由一些基本的门电路架起来的,雪花是分形的,因此我们总能在深入一个案例后,发现一些普世的道理!共勉!
281 |
282 | 我说过:“优化时可以随着环境的变化一直做的!”(名言啊!记住啊!划重点!要考的!)
283 |
284 | 这里不是有四个搬运操作嘛!编译器优化后也不知道会优化成啥样子,我们可以直接用汇编嘛!汇编里面的四个LDR/STR指令,然后是LDRM/STRM可以节省三次CPU指令操作时间及等待时间,然后就是NEON的并行加速了!哎呀呀~不得了!只会越来愈快!具体多快!你自己去实验咯!
285 |
--------------------------------------------------------------------------------
/0x03 测试算法.md:
--------------------------------------------------------------------------------
1 |
2 | 分为三个层次:**底层物理缺陷层,中间功能层,以及性能层;**
3 |
4 | + FaultModel:这是有一系列的论文、背景支持的,我们最终采用的是精简的失效模型来设计算法;
5 | + 功能层就是测一下每个cell功能是否正常,常规做法就是直接memtester,但是跟底层貌似有些重合,需确认;
6 | + 性能层就是我们在改变电压、时序等参数的时候,你跑pattern能跑到多高的频率。
7 |
8 |
9 |
10 | # 3.1 缺陷测试FaultModel
11 |
12 | ## 3.1.1 背景
13 | **基于FaultMOdel来进行测试,而这个缺陷模型又是怎么来的呢?**
14 | 这是由于以前出现过具体的某个物理缺陷,然后工程师们就抽象出了这个error,然后建立一个模型用来验证这样缺陷,这就是缺陷模型的由来;
15 |
16 | 内存模型从底层物理层至行为层如下图所示,每一个环节都有可能出问题,因此每个部分均须建模来进行测试。
17 |
18 | 
19 |
20 | **四层模型**分别是:
21 | + **Behavioral** model (memtester、跑系统及应用)
22 | + **Functional** model(Fault Model)
23 | + **Electrical** model
24 | + **Layout model** (rarely reported)
25 |
26 |
27 |
28 |
29 | ## 3.1.2 Fault Model分析
30 |
31 | 
32 |
33 | DRAM架构如上图所示,分地址解码器、内存单元阵列和读写逻辑三大部分。
34 |
35 | 
36 |
37 | 因此缺陷模型也对应分为**地址解码**缺陷、**内存单元**缺陷、**动态**缺陷三个部分。
38 |
39 | 三个部分的缺陷概览及分类如下:(来自历年论文)
40 | ### (一)Memory cell faults
41 | **1. Stuck-at fault (SAF):** cell or line s-a-0 or s-a-1 [1].
42 |
43 | **2. Stuck-open fault (SOF):** open cell or broken line .
44 |
45 | **3. Transition fault (TF):** cell fails to transit [1].
46 |
47 | **4. Data retention fault (DRF):** cell fails to retain its logic value after some specified time due to, e.g., leakage, resistor opens, or feedback path opens [2].
48 |
49 | **5. Coupling fault (CF):** Coupling faults are of three types [1].
50 | + **Inversion coupling fault (CFin):** a transition in one cell (aggressor) inverts the content of another cell (victim). [1,3]
51 | + **Idempotent coupling fault (CFid):** a transition in one cell forces a fixed logic value into another cell. [1,3]
52 | + **State coupling fault (CFst):** a cell/line is forced to a fixed state only if the coupling cell/line is in a given state (a.k.a. pattern sensitivity fault (PSF)). [1,3]
53 |
54 | **6. Bridging fault (BF):** short between cells (can be AND type or OR type) [1]
55 |
56 | **7. Neighborhood Pattern Sensitive Fault (NPSF)** [1]
57 |
58 | **8. Active (Dynamic) NPSF** [1]
59 |
60 | **9. Passive NPSF** [1]
61 |
62 | **10. Static NPSF** [1]
63 |
64 | ### (二)Address decoder faults (AFs)
65 | **1. No cell accessed by certain address** [1,3].
66 | **2. Multiple cells accessed by certain address** [1,3].
67 | **3. Certain cell not accessed by any address** [1,3].
68 | **4. Certain cell accessed by multiple addresses** [1].
69 |
70 | ### (三)Dynamic Faults
71 | **1. Recovery faults:** when some part of the memory cannot recover fast enough from a previous state [2].
72 | + **Sense amplifier recovery:** sense amplifier saturation after reading/writing a long string of 0s or 1s.
73 | + **Write recovery:** a write followed by a read or write at a different location resulting in reading or writing at the same location due to slow address decoder.
74 |
75 | **2. Disturb faults:** victim cell forced to 0 or 1 if we read or write aggressor cell (may be the same cell) [2].
76 |
77 | **3. Data Retention faults:** memory loses its content spontaneously, not caused by read or write [2].
78 | + **DRAM refresh fault:** Refresh-line stuck-at fault
79 | + **DRAM leakage fault:**
80 | + **Sleeping sickness**—loose data in less than specified hold time (typically hundreds of micro sec to tens of ms); caused by charge leakage or environment sensitivity; usually affects a row or a column.
81 | + **Static data losses**—defective pull-up device
82 | Inducing excessive leakage currents which can change the state of a cell Checkerboard pattern triggers max leakage.
83 |
84 |
85 |
86 | ### Fault Models 精简版如下:
87 | 
88 |
89 | **SAF:Stuck-At Fault**
90 | + The logic value of a cell or a line is always 0 or 1.
91 | >覆盖的功能缺陷有:Cell stuck、Driver stuck、Read/Write line stuck、chip-select line stck、Data line stuck、open sircuit in data line.
92 | `《essentials-of-electronic-testing-for-digital-memory-and-mixed-signal-vlsi-circuits》@2002`
93 |
94 | **TF: Transition Fault**
95 | + A cell or a line that fails to undergo a 0->1 or a 1->0 transition.
96 | >覆盖缺陷有: cell **can** be set to 0 but **not** to 1 (or vice versa 反之亦然).
97 | `《essentials-of-electronic-testing-for-digital-memory-and-mixed-signal-vlsi-circuits》@2002`
98 |
99 | **CF: Coupling Fault**
100 | + A write operation to one cell changes the content of a second cell.
101 | >覆盖功能缺陷有: short circuit between data lines、 crosstalk between data lines.
102 | `《essentials-of-electronic-testing-for-digital-memory-and-mixed-signal-vlsi-circuits》@2002`
103 |
104 |
105 | **NPSF: Neighborhood Pattern Sensitive Fault**
106 | + The content of a cell, or the ability to change its content, is influenced by the contents of some other cells in the memory.
107 | >覆盖功能缺陷有: pattern sensitive cell interaction.
108 | `《essentials-of-electronic-testing-for-digital-memory-and-mixed-signal-vlsi-circuits》@2002`
109 |
110 | **AF: Address decoding fault**
111 | + With a certain address, no cell will be accessed.
112 | + A certain cell is never accessed.
113 | + With a certain address, multiple cells are accessed simultaneously.
114 | + A certain cell can be accessed by multiple addresses.
115 | >覆盖功能缺陷有:address line stuck, open circuit in address line, shorts between address lines, open circuit in decoder, wrong address access, multiple simutaneous address access(多地址同时访问),
116 | `《essentials-of-electronic-testing-for-digital-memory-and-mixed-signal-vlsi-circuits》@2002`
117 |
118 | ### 物理缺陷对应关系
119 |
120 | aa: 是电容短路,从而导致SA0(stuck at 0)缺陷;
121 | bb: 是capacitor-WL短路,导致SA1缺陷;
122 | cc: 两根WLs线短路,同一列的这两个cell出现AND bridging缺陷,就是与操作了嘛!;
123 | dd:临近状态耦合缺陷;
124 | ee:这个cell与旁边这一整列的cell都AND bridging fault了;
125 | ff: 两个**bit lines**短路,这样子的话每个word line里面都有两个cell是AND bridging fault了。
126 |
127 | WL 和 BL短路会使得交点处的cell出现SA1缺陷:
128 | + 比如下图中,当读交点处的cell时,WL被拉高,然后对应BL也被拉高了,因此锁死在1状态;
129 | + 同时对应写1到交叉cell处是看不出来问题的,但是0就无论如何也写不进去了,因为总是被锁死在1啊!
130 | + 显然同列的都被锁死在1了,同样的同一行也被锁死在1(存疑!!?)
131 | 
132 |
133 |
134 | 在底层缺陷其实是有如下这样的许多种类,在memtester里面的test_stuck_address()是能覆盖测试的,**但是只能知道哪个地址出异常了,却不能定位出是下图中的哪种异常!**
135 | 
136 |
137 | ### 某测试厂测试项
138 |
139 | 
140 |
141 | ## 3.1.3 测试算法
142 |
143 | ### Ad-hoc tests
144 |
145 | 早期的Functional model (Ad-hoc tests)算法:
146 |
147 | 
148 |
149 | **Scan = Zero-One Test**:O(4n)
150 | Data background:0xffffffff,0x0
151 | In short notation: {⇑(w0); ⇑(r0); ⇑(w1); ⇑(r1)}
152 | 缺陷:Low Fault coverage
153 |
154 | **Checkerboard**:O(4n)
155 | Data background:0xAAAAAAAA,0x55555555
156 | Short Notation: {⇑(w1i , w0i+1); ⇑(r1i, r0i+1);⇑(w0i , w1i+1); ⇑(r0i, r1i+1)}
157 | 缺陷:Low fault coverage (similar to Scan)i+1);
158 |
159 | **Galpat,walking 1/0**: O(n2)
160 |
161 | >**总结:**覆盖率低且时间花销大,因此工业中很难接受,因此覆盖率较高且时间复杂度较低的March系列算法发展起来。
162 |
163 | Zero-One跟check board算法缺陷覆盖率不够,其他的一些常规算法又太耗时间,因此March测试算法横空出世。
164 |
165 | ### March tests:
166 | + 基于fault models设计
167 | + 线性时间花销
168 | + 较好的缺陷覆盖率
169 | + 持续改进...
170 |
171 | 下图所示为时间复杂度以及测试覆盖率,**目前测试场采用较多的算法是March C-,它是折中了时间复杂度跟缺陷覆盖率后的最优选择。**
172 |
173 | 
174 |
175 |
176 |
177 |
178 | ### 3.1.4 March算法概述
179 | 在针对缺陷模型开发的各类算法中,March类算法是性价比最高的,因此较为常用。
180 |
181 | 
182 |
183 | 不同算法检测不同类型的缺陷,同时必须清楚的是缺陷绝对是无法100%覆盖的,因此我们要组合选用测试算法来进行测试。
184 |
185 | ```
186 | SAF: 固定位故障
187 | AF:寻址故障
188 | TF:转换故障
189 | CF:耦合故障
190 | CFin:翻转耦合故障
191 | CFid:幂耦合故障
192 | CFdyn:动态耦合故障
193 | CFst(SCF):状态耦合故障
194 | ```
195 |
196 | **March算法演变**
197 | >ps. 步骤中很多操作步骤是为了区分不同的缺陷的,因此我们只需要检测缺陷而不识别缺陷的话,可以进一步减少测试步骤。
198 |
199 | + MATS 能以最少时间检测SAF,由Nair在1979年提出,后由Abadir改进为MATS+。
200 | + March A和March B能覆盖一些耦合缺陷,(等幂耦合+转换故障)组合故障,倒置耦合+等幂耦合,等。
201 | + March C于1982年提出,**沿用至今三十多年,何其壮哉!!!**
202 | + March C-修正了March C,去掉了第四步的读“0”操作,缺陷覆盖率并未降低,还是能测AF、SAF、TF、CF等缺陷,**但是并没有DRF缺陷检测能力。**
203 | + March-TB因此它增加了(M2,M4,M6,M10)、两个delay操作和一个写操作M9,其中M2,M4,M6是为了区别出已能检测出的故障类型,而M9,M10和delay是为了探测和诊断DRF缺陷的,这是March C-所没有的。
204 | + March-TB+算法在TB的基础上增加了三个操作,都是为了区别出已能检测出的缺陷类型;
205 | + March-TBA增加了操作,但是都是为了区别出已能检测出的故障类型,并未增加故障覆盖率;
206 |
207 |
208 |
209 |
210 | (..., w0, r 0,...) to detect SA1 faults and TFd
211 | (..., w1, r1,...) to detect SA0 faults and TFu
212 |
213 | # 3.2 功能测试memtester/MemTest86
214 | 主要就memtester的一些测试函数进行学习,这里用的是[官网](http://pyropus.ca/software/memtester/)上的代码,我们在ARM平台上跑的都是从这里移植过去的。
215 | ## 3.2.1 History
216 |
217 | **memtester和 MemTest86的区别?**
218 |
219 | **memtester** run on OS, so use api of OS for allocate ram and test it.
220 | **Memtest86** run without OS and is very small and access to RAM directly.
221 |
222 | **Memtest** and **Memtest86** are [open source](https://www.computerhope.com/jargon/o/opensour.htm) software utilities that scan and test computer memory chips ([RAM](https://www.computerhope.com/jargon/r/ram.htm)) for any defects or errors.
223 |
224 | >系统级使用的均是,也就是说根本不是专门测内存颗粒的。
225 |
226 |
227 | # 3.2.2 板级测试
228 |
229 | [这篇文章讲了算法部分](http://www.esacademy.com/en/library/technical-articles-and-documents/miscellaneous/software-based-memory-testing.html)
230 | + 物理或者电性不良会使对RAM造成毁灭性的问题,这样常规算法都能检测出来的;
231 | + 典型问题是电路板的问题,会导致cpu读不了memory、检查不到内存芯片、插入型号不对的内存等
232 | 作者的思路是,假如是内存颗粒的问题,那么常规的算法都可以检测出来,但是假如上述的三种板级问题常规的是无法检测出来的,因此我们设计了这些算法.
233 |
234 |
235 | ## 3.2.3 memtester/MemTest86 测试算法及其测试目的
236 | ### [ memtester4.3.0 ]
237 | 我们先看一下测试项(上层功能级函数,底层实现汇编级稍后分析):
238 |
239 | memtester-4.3.0 | memtester-ARM
240 | ---|---|
241 | int test_stuck_address(bufa, count);|(√ ) 先全部把地址值交替取反放入对应存储位置,然后再读出比较,重复2次(官网的重复了16次):测试address bus |
242 | int test_random_value(bufa, bufb, count); |(√ )等效test_random_comparison(bufa, bufb, count):数据敏感型测试用例 |
243 | int test_xor_comparison(bufa, bufb, count);|(-) 与test_random_value比多了个异或操作,用户场景之一,用例覆盖。数据敏感/指令功能验证,同时可验证SAF; |
244 | int test_sub_comparison(bufa, bufb, count); |(-)与test_random_value比多了个减法操作,用户场景之一,用例覆盖。数据敏感/指令功能验证,同时可验证SAF; |
245 | int test_mul_comparison(bufa, bufb, count); | (-)与test_random_value比多了个乘法操作,用户场景之一,用例覆盖。数据敏感/指令功能验证,同时可验证SAF;|
246 | int test_div_comparison(bufa, bufb, count); | (-)与test_random_value比多了个除法操作,用户场景之一,用例覆盖。数据敏感/指令功能验证,同时可验证SAF;|
247 | int test_or_comparison(bufa, bufb, count); | (√ )在test_random_comparison()里面合并了,用户场景之一,用例覆盖。数据敏感/指令功能验证,同时可验证SAF; |
248 | int test_and_comparison(bufa, bufb, count); | (√ )在test_random_comparison()里面合并了,用户场景之一,用例覆盖。数据敏感/指令功能验证,同时可验证SAF;|
249 | int test_seqinc_comparison(bufa, bufb, count); | (√ )这是 test_blockseq_comparison的一个子集;模拟客户压力测试场景。 |
250 | int test_solidbits_comparison(bufa, bufb, count); |(√ )固定全1后写入两个buffer,然后读出比较,然后全0写入读出比较;这就是Zero-One算法,Breuer & Friedman 1976 ,检测SAF的,算法是{w0,r0,w1,r1}时间复杂度是4N,又叫做MSCAN,验证每个cell能读写,间接测试了stuck at fault|
251 | int test_checkerboard_comparison(bufa, bufb, count); | (√ )把设定好的几组Data BackGround,依次写入,然后读出比较 (注:论文里说设计良好的Data background可以检测出state coupling faults时间复杂度是4N,这是验证相邻位置是否互相影响从而设计的用例。|
252 | int test_blockseq_comparison(bufa, bufb, count); | (√ )一次写一个count大小的块,写的值是拿byte级的数填充32bit,然后取出对比,接着重复256次;也是压力用例,只是次数变多了; |
253 | int test_walkbits0_comparison(bufa, bufb, count); | (√ )就是bit=1的位置在32bit里面移动,每移动一次就全部填满buffer,先是从低位往高位移,再是从高位往低位移动,(这么做的目的是啥?其中的一个目的是检测NPSF其次是CFs,其次是数据敏感型异常检测,注这里是32bit的,还有8bit的粒度更细了) |
254 | int test_walkbits1_comparison(bufa, bufb, count); |(√ )与上同理,另注:早memtester86中这个算法叫做moving inversions algorithm |
255 | int test_bitspread_comparison(bufa, bufb, count); |(√ )还是在32bit里面移动,只是这次移动的不是单单的一个0或者1,而是两个1,这两个1之间隔着两个空位,(是临近耦合异常的一种data pattern变体:两个1之间间隔1个位置,然后同步移动) |
256 | int test_bitflip_comparison(bufa, bufb, count); | (√ )也是32bit里面的一个bit=1不断移动生成data pattern然后,每个pattern均执行:{取反交替写入a、b缓冲区,写完之后检查一遍,然后不断重复以下步骤八次{用八个DMA从a缓冲区搬数据到b缓冲区,并行搬,模拟短时间内反复读写同一位置看是否有数据丢失异常}}核心思想:短时间内反复读写同一位置。 |
257 | int test_8bit_wide_random(bufa, bufb, count); | (√ )以char指针存值,也就是每次存8bit,粒度更细; |
258 | int test_16bit_wide_random(bufa, bufb, count);|(√ )以unsigned short指针存值,也就是每次存16bit,不同粒度检测; |
259 | ×|int test_crosstalk_comparison(bufa, bufb, count):[32个0,接着32bit里面1个0移动]以这样的模型叠加写入内存;(只有上行,没像有moving inversions algorithm一样进行反转)|
260 |
261 | **总结:**整个memtester测试的视角就是从用户的角度来看的,从用户角度设立不同的测试场景即测试用例,然后针对性地进行**功能**测试,注意是从系统级来测试,也就是说关注的不单单是内存颗粒了,还有系统板级的连线、IO性能、PCB等等相关的因素,在这些因素的影响下,你的memory是否还能**正常工作**;
262 |
263 | >注2: checkboard这里虽然有点类似于底层的NPSF(neighborhood pattern sensitive fault),但是这里的锚点却不是这个,而是:比如说我的客户在把内存插入电脑后,使用过程中有这种pattern的数据写入内存,会不会存在数据互相影响从而丢失的问题呢?这搞不好就是蓝屏了啊!也就是说我们关注的是表层的状态而不是底层的缺陷。
264 |
265 |
266 | ### [ MemTest86 ]
267 |
268 | 测试类型|测试内容描述
269 | ---|---
270 | Test 0 [Address test, walking ones, no cache]| walking ones address pattern测试所有的地址位
271 | Test 1 [Moving Inv, ones&zeros, cached] | moving inversions algorithm with patterns of only ones and zeros.
272 | Test 2 [Address test, own address, no cache] | 地址写入,检查其连续性,是test0的补充
273 | Test 3 [Moving inv, 8 bit pat, cached]|重点在8bit,检查粒度更细了
274 | Test 4 [Moving inv, 32 bit pat, cached]| 检测数据敏感性异常;
275 | Test 5 [Block move, 64 moves, cached]|检测压力下内存是否会出现问题;
276 | Test 6 [Modulo 20, ones&zeros, cached]| 去除cache的影响,还是moving inversions
277 | Test 7 [Moving inv, ones&zeros, no cache]| 补充测试未chche情况下是否ok,时间花销显著变大哟!(解决用例覆盖度问题)
278 | Test 8 [Block move, 512 moves, cached]| 既然加压了,那就试试不同压力下的情况吧!
279 | Test 9 [Moving inv, 8 bit pat, no cache]|理论上检测所有缺陷,因为粒度都到8bit了,但是时间花销上也是呵呵哒!
280 | Test 10 [Modulo 20, 8 bit, cached]|将data pattern(之前的就是0啊1啊的pattern而已)跟去除cache影响的Modulo-X算法组合技能确实效率更高,但是执行时间也是长长的~
281 | Test 11 [Moving inv, 32 bit pat, no cache]|我们用的是32bit的数据pattern,因此在找出数据敏感型的异常情况中最有效,但是由于没有cache这个速度也是很感人。
282 |
283 | >**总结**:就是从用户的角度来看,设立不同的测试场景即测试用例,然后针对性地进行**功能**测试,注意是从系统级来测试,也就是说关注的不仅仅是内存颗粒了,而是在系统板级的连线、IO性能、PCB等等相关的因素一同考虑进去后,你的memory是否还能**功能**正常;
284 | # 3.3 性能测试DS5 code
285 |
286 | 这就不展示了,毕竟平台相关了,说到底也就是一些单片机代码而已。
287 |
--------------------------------------------------------------------------------
/0x01 看过DRAM部分论文.md:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | # 教材
5 | 祖师爷级别的存在 A.J. van de Goor,下面是其零几年出的一本教材,里面讲了SDRAM的一些基本概念,**配合这个教材再看一些论文,就能得出一个粗细有致的知识图谱**。
6 |
7 | A.J. van de Goor:《Testing Semiconductor Memories Theory and Practice》
8 |
9 | # 论文
10 |
11 | 选择IEEE上的论文,从1987到2011年所有的论文都load下来,针对性(针对LPDDR3)读论文,可以发现在11年后就没发新论文了。
12 |
13 | IEEE:[http://ieeexplore.ieee.org/search/searchresult.jsp?searchWithin=%22Authors%22:.QT.A.J.%20van%20de%20Goor.QT.&newsearch=true](http://ieeexplore.ieee.org/search/searchresult.jsp?searchWithin=%22Authors%22:.QT.A.J.%20van%20de%20Goor.QT.&newsearch=true)
14 |
15 | 1987 [Amore Address Mapping with Overlapped Rotating Entries](http://ieeexplore.ieee.org/document/4089851/)
16 |
17 | 1988 [Test pattern generation for API faults in RAM](http://ieeexplore.ieee.org/document/8710/)
18 |
19 | 1988 [CMOS SRAM functional test with quiescent write supply current](http://ieeexplore.ieee.org/document/730724/)
20 |
21 | 1990 [An efficient, reference weight-based garbage collection method for distributed systems](http://ieeexplore.ieee.org/document/77179/)
22 |
23 | 1990 [Developments towards real-time frame-to-frame automatic contour detection on echocardiograms](http://ieeexplore.ieee.org/document/144251/)
24 |
25 | 1990 [Functional memory array testing](http://ieeexplore.ieee.org/document/113652/)
26 |
27 | 1991:[Logic synthesis of 100-percent testable logic networks](http://ieeexplore.ieee.org/document/139937/)
28 |
29 | 1991 [Locating Bridging Faults in Memory Arrays](http://ieeexplore.ieee.org/document/519733/)
30 |
31 | 1992:[Test generation for C-testable one-dimensional CMOS ILA's without observable vertical outputs](http://ieeexplore.ieee.org/document/205969/)
32 |
33 | 1992:[Functional tests for arbitration SRAM-type FIFOs](http://ieeexplore.ieee.org/document/224443/)
34 |
35 | 1992:[Functional testing of modern microprocessors](http://ieeexplore.ieee.org/document/205953/)
36 |
37 | 1993:[Effective march algorithms for testing single-order addressed memories](http://ieeexplore.ieee.org/document/386425/)
38 |
39 | 1993:[Using march tests to test SRAMs](http://ieeexplore.ieee.org/document/199799/)
40 |
41 | 1993 [Automatic verification of march tests [SRAMs]](http://ieeexplore.ieee.org/document/263136/)
42 |
43 | 1993 [Fault models and tests specific for FIFO functionality](http://ieeexplore.ieee.org/document/263146/)
44 |
45 | 1993 [Test pattern generation with restrictors](http://ieeexplore.ieee.org/document/470644/)
46 |
47 | 1994 [The automatic generation of march tests](http://ieeexplore.ieee.org/document/397192/)
48 |
49 | 1994 [BIST for ring-address SRAM-type FIFOs](http://ieeexplore.ieee.org/document/397188/)
50 |
51 | 1994 [Test generation and three-state elements, buses, and bidirectionals](http://ieeexplore.ieee.org/document/292325/)
52 |
53 | 1994 [Parallel pattern fast fault simulation for three-state circuits and bidirectional I/O](http://ieeexplore.ieee.org/document/528005/)
54 |
55 | 1994 [A 16×16-bit static CMOS wave-pipelined multiplier](http://ieeexplore.ieee.org/document/409217/)
56 |
57 | 1994 [Fault models and tests for Ring Address Type FIFOs](http://ieeexplore.ieee.org/document/292297/)
58 |
59 | 1994 [Automating the verification of memory tests](http://ieeexplore.ieee.org/document/292295/)
60 |
61 | 1994:[An effective BIST scheme for ring-address type FIFOs](http://ieeexplore.ieee.org/document/527979/)
62 |
63 | 1994:[Generating march tests automatically](http://ieeexplore.ieee.org/document/528034/)
64 |
65 | 1994:[Real-time frame-to-frame automatic contour detection on echocardiograms](http://ieeexplore.ieee.org/document/470257/)
66 |
67 | 1994 [Functional tests for ring-address SRAM-type FIFOs](http://ieeexplore.ieee.org/document/326797/)
68 |
69 | 1995:[Pseudo-exhaustive word-oriented DRAM testing](http://ieeexplore.ieee.org/document/470409/)
70 |
71 | 1995 [Compact test sets for industrial circuits](http://ieeexplore.ieee.org/document/512661/)
72 |
73 | 1995 [Coping with re-usability using sequential ATPG: a practical case study](http://ieeexplore.ieee.org/document/529840/)
74 |
75 | 1995 [Functional test for shifting-type FIFOs](http://ieeexplore.ieee.org/document/470408/)
76 |
77 | 1995 [Automatic computation of test length for pseudo-random memory tests](http://ieeexplore.ieee.org/document/518082/)
78 |
79 | 1996 [Circuit partitioned automatic test pattern generation constrained by three-state buses and restrictors](http://ieeexplore.ieee.org/document/555132/)
80 |
81 | 1996 [Accelerated compact test set generation for three-state circuits](http://ieeexplore.ieee.org/document/556940/)
82 |
83 | 1996 [RAM diagnostic tests](http://ieeexplore.ieee.org/document/782499/)
84 |
85 | 1996 [Towards a uniform notation for memory tests](http://ieeexplore.ieee.org/document/494335/)
86 |
87 | 1996 [March LR: a test for realistic linked faults](http://ieeexplore.ieee.org/document/510868/)
88 |
89 | 1996 [RAM testing algorithms for detection multiple linked faults](http://ieeexplore.ieee.org/document/494337/)
90 |
91 | 1997 [Sequential test generation with advanced illegal state search](http://ieeexplore.ieee.org/document/639686/)
92 |
93 | 1997 [An analysis of (linked) address decoder faults](http://ieeexplore.ieee.org/document/619389/)
94 |
95 | 1997:[An open notation for memory tests](http://ieeexplore.ieee.org/document/619398/)
96 |
97 | 1997: [Disturb neighborhood pattern sensitive fault](http://ieeexplore.ieee.org/document/599439/)
98 |
99 | 1997 [The implementation of pseudo-random memory tests on commercial memory testers](http://ieeexplore.ieee.org/document/639618/)
100 |
101 | 1997 [March LA: a test for linked memory faults](http://ieeexplore.ieee.org/document/582440/)
102 |
103 | 1998 [Fault models and tests for two-port memories](http://ieeexplore.ieee.org/document/670898/)
104 |
105 | 1998 [Consequences of port restrictions on testing two-port memories](http://ieeexplore.ieee.org/document/743138/)
106 |
107 | 1998 [Semiconductor manufacturing process monitoring using built-in self-test for embedded memories](http://ieeexplore.ieee.org/document/743277/)
108 |
109 | 1998 [Testing Embedded Memories: Is BIST The Ultimate Solution? Answers to the Key Issues](http://ieeexplore.ieee.org/document/741671/)
110 |
111 | 1998 [Complete search in test generation for industrial circuits with improved bus-conflict detection](http://ieeexplore.ieee.org/document/741616/)
112 |
113 | 1998:[Converting March tests for bit-oriented memories into tests for word-oriented memories](http://ieeexplore.ieee.org/document/705945/)
114 |
115 | 1998:[Consequences of port restrictions on testing address decoder faults in two-port memories](http://ieeexplore.ieee.org/document/741636/)
116 |
117 | 1998 [March tests for word-oriented memories](http://ieeexplore.ieee.org/document/655905/)
118 |
119 | 1999 [Defining SRAM resistive defects and their simulation stimuli](http://ieeexplore.ieee.org/document/810726/)
120 |
121 | 1999 [Fault (in)dependent cost estimates and conflict-directed backtracking to guide sequential circuit test generation](http://ieeexplore.ieee.org/document/810749/)
122 |
123 | 1999:[March tests for word-oriented two-port memories](http://ieeexplore.ieee.org/document/810729/)
124 |
125 | 1999 [Testability of the Philips 80C51 micro-controller](http://ieeexplore.ieee.org/document/805813/)
126 |
127 | 1999 [Port interference faults in two-port memories](http://ieeexplore.ieee.org/document/805833/)
128 |
129 | 1999 [Industrial evaluation of stress combinations for march tests applied to SRAMs](http://ieeexplore.ieee.org/document/805831/)
130 |
131 | 1999 [Illegal state space identification for sequential circuit test generation](http://ieeexplore.ieee.org/document/761213/)
132 |
133 | 2000 [Industrial evaluation of DRAM SIMM tests](http://ieeexplore.ieee.org/document/894234/)
134 |
135 | 2000 [An experimental analysis of spot defects in SRAMs: realistic fault models and tests](http://ieeexplore.ieee.org/document/893615/)
136 |
137 | 2000:[Functional memory faults: a formal notation and a taxonomy](http://ieeexplore.ieee.org/document/843856/)
138 |
139 | 2000 [March tests for realistic faults in two-port memories](http://ieeexplore.ieee.org/document/868618/)
140 |
141 | 2000 [Test point insertion for compact test sets](http://ieeexplore.ieee.org/document/894217/)
142 |
143 | 2000:[Impact of memory cell array bridges on the faulty behavior in embedded DRAMs](http://ieeexplore.ieee.org/document/893638/)
144 |
145 | 2001 [Transient faults in DRAMs: concept, analysis and impact on tests](http://ieeexplore.ieee.org/document/945229/)
146 |
147 | 2001 [Tests for resistive and capacitive defects in address decoders](http://ieeexplore.ieee.org/document/990255/)
148 |
149 | 2001 [Realistic fault models and test procedure for multi-port SRAMs](http://ieeexplore.ieee.org/document/945230/)
150 |
151 | 2001:[Static and dynamic behavior of memory cell array opens and shorts in embedded DRAMs](http://ieeexplore.ieee.org/document/915069/)
152 |
153 | 2001 [Simulation based analysis of temperature effect on the faulty behavior of embedded DRAMs](http://ieeexplore.ieee.org/document/966700/)
154 |
155 | 2001 [Detecting unique faults in multi-port SRAMs](http://ieeexplore.ieee.org/document/990256/)
156 |
157 | 2001 [Simulation and development of short transparent tests for RAM](http://ieeexplore.ieee.org/document/990276/)
158 |
159 | 2001 [A memory specific notation for fault modeling](http://ieeexplore.ieee.org/document/990257/)
160 |
161 | 2002 [DRAM specific approximation of the faulty behavior of cell defects](http://ieeexplore.ieee.org/document/1181694/)
162 |
163 | 2002 [Test point insertion that facilitates ATPG in reducing test time and data volume](http://ieeexplore.ieee.org/document/1041754/)
164 |
165 | 2002:[March SS: a test for all static simple RAM faults](http://ieeexplore.ieee.org/document/1029769/)
166 |
167 | 2002:[Minimal test for coupling faults in word-oriented memories](http://ieeexplore.ieee.org/document/998413/)
168 |
169 | 2002:[Thorough testing of any multiport memory with linear tests](http://ieeexplore.ieee.org/document/980260/)
170 |
171 | 2002:[Efficient tests for realistic faults in dual-port SRAMs](http://ieeexplore.ieee.org/document/1004586/)
172 |
173 | 2002 [Modeling techniques and tests for partial faults in memory devices](http://ieeexplore.ieee.org/document/998254/)
174 |
175 | 2002 [Testing static and dynamic faults in random access memories](http://ieeexplore.ieee.org/document/1011170/)
176 |
177 | 2002 [Address and data scrambling: causes and impact on memory tests](http://ieeexplore.ieee.org/document/994601/)
178 |
179 | 2002 [Approximating infinite dynamic behavior for DRAM cell defects](http://ieeexplore.ieee.org/document/1011171/)
180 |
181 | 2003 [Systematic memory test generation for DRAM defects causing two floating nodes](http://ieeexplore.ieee.org/document/1222357/)
182 |
183 | 2003 [Detecting intra-word faults in word-oriented memories](http://ieeexplore.ieee.org/document/1197657/)
184 |
185 | 2003 [Consequences of RAM bitline twisting for test coverage](http://ieeexplore.ieee.org/document/1253788/)
186 |
187 | 2003 [Static and dynamic behavior of memory cell array spot defects in embedded DRAMs](http://ieeexplore.ieee.org/document/1183945/)
188 |
189 | 2003 [TPI for improving PR fault coverage of Boolean and three-state circuits](http://ieeexplore.ieee.org/document/1231661/)
190 |
191 | 2003: [A systematic method for modifying march tests for bit-oriented memories into tests for word-oriented memories](http://ieeexplore.ieee.org/document/1234529/)
192 |
193 | 2003:[March SL: a test for all static linked memory faults](http://ieeexplore.ieee.org/document/1250839/)
194 |
195 | 2003 [A fault primitive based analysis of linked faults in RAMs](http://ieeexplore.ieee.org/document/1222358/)
196 |
197 | 2003 [Analyzing the impact of process variations on DRAM testing using border resistance traces](http://ieeexplore.ieee.org/document/1250777/)
198 |
199 | 2003 [Test generation and optimization for DRAM cell defects using electrical simulation](http://ieeexplore.ieee.org/document/1233823/)
200 |
201 | 2003 [Optimizing stresses for testing DRAM cell defects using electrical simulation](http://ieeexplore.ieee.org/document/1253656/)
202 |
203 | 2003 [Importance of dynamic faults for new SRAM technologies](http://ieeexplore.ieee.org/document/1231665/)
204 |
205 | 2004 [An industrial evaluation of DRAM tests](http://ieeexplore.ieee.org/document/1341382/)
206 |
207 | 2004 [Effects of bit line coupling on the faulty behavior of DRAMs](http://ieeexplore.ieee.org/document/1299234/)
208 |
209 | 2004 [Soft faults and the importance of stresses in memory testing](http://ieeexplore.ieee.org/document/1269037/)
210 |
211 | 2004:[Linked faults in random access memories: concept, fault models, test algorithms, and industrial results](http://ieeexplore.ieee.org/document/1291585/)
212 |
213 | 2004:[The state-of-art and future trends in testing embedded memories](http://ieeexplore.ieee.org/document/1327984/)
214 |
215 | 2004 [Tests for address decoder delay faults in RAMs due to inter-gate opens](http://ieeexplore.ieee.org/document/1347646/)
216 |
217 | 2004 [The effectiveness of the scan test and its new variants](http://ieeexplore.ieee.org/document/1327980/)
218 |
219 | 2004 [Detecting faults in the peripheral circuits and an evaluation of SRAM test](http://ieeexplore.ieee.org/document/1386943/)
220 |
221 | 2005 [Framework for fault analysis and test generation in DRAMs](http://ieeexplore.ieee.org/document/1395723/)
222 |
223 | 2005 [Impact of stresses on the fault coverage of memory tests](http://ieeexplore.ieee.org/document/1498211/)
224 |
225 | 2006:[Memory test experiment: industrial results and data](http://ieeexplore.ieee.org/document/1576336/)
226 |
227 | 2006:[Influence of Bit-Line Coupling and Twisting on the Faulty Behavior of DRAMs](http://ieeexplore.ieee.org/document/4014514/)
228 |
229 | 2006 [Opens and Delay Faults in CMOS RAM Address Decoders](http://ieeexplore.ieee.org/document/1717393/)
230 |
231 | 2006 [Space of DRAM Fault Models and Corresponding Testing](http://ieeexplore.ieee.org/document/1657087/)
232 |
233 | 2007:[An Investigation on Capacitive Coupling in RAM Address Decoders](http://ieeexplore.ieee.org/document/4437418/)
234 |
235 | 2008:[Defect Oriented Testing of the Strap Problem Under Process Variations in DRAMs](http://ieeexplore.ieee.org/document/4700631/)
236 |
237 | 2009 [New Algorithms for Address Decoder Delay Faults and Bit Line Imbalance Faults](http://ieeexplore.ieee.org/document/5359299/)
238 |
239 | 2010 [Low-cost, customized and flexible SRAM MBIST engine](http://ieeexplore.ieee.org/document/5491749/)
240 |
241 | 2010 [Advanced embedded memory testing: Reducing the defect per million level at lower test cost](http://ieeexplore.ieee.org/document/5491828/)
242 |
243 | 2010 [Using a CISC microcontroller to test embedded memories](http://ieeexplore.ieee.org/document/5491773/)
244 |
245 | 2010 [Memory testing with a RISC microcontroller](http://ieeexplore.ieee.org/document/5457210/)
246 |
247 | 2010 [MBIST architecture framework based on orthogonal constructs](http://ieeexplore.ieee.org/document/5724423/)
248 |
249 | 2011 [Optimizing memory BIST Address Generator implementations](http://ieeexplore.ieee.org/document/5941430/)
250 |
251 | 2011 [Generic, orthogonal and low-cost March Element based memory BIST](http://ieeexplore.ieee.org/document/6139148/)
252 |
--------------------------------------------------------------------------------
/0x07 什么?!NEON还要优化?[重点].md:
--------------------------------------------------------------------------------
1 |
2 | # 官网介绍:
3 | + [NEON宏观介绍](https://developer.arm.com/technologies/neon)
4 | + [NEON Programmer’s Guide Version: 1.0](https://developer.arm.com/products/architecture/a-profile/docs/den0018/a)
5 |
6 | # 直观认识
7 | **NEON整体描述**
8 | Arm NEON technology is an advanced SIMD (single instruction multiple data) architecture extension for the Arm Cortex-A series and Cortex-R52 processors.
9 |
10 | 
11 |
12 |
13 | NEON technology was introduced to the Armv7-A and Armv7-R profiles. It is also now an extension to the Armv8-A and Armv8-R profiles.
14 |
15 | NEON technology is intended to improve the multimedia user experience by accelerating audio and video encoding/decoding, user interface, 2D/3D graphics or gaming. NEON can also accelerate signal processing algorithms and functions to speed up applications such as audio and video processing, voice and facial recognition, computer vision and deep learning.
16 |
17 |
18 | **其他具体的细节,在官网还有很多描述,自己去看吧!**
19 |
20 | # 持续优化
21 | ## 初识
22 |
23 | [NEON的初步认识](https://community.arm.com/cn/b/blog/posts/neon)
24 |
25 | NEON是SIMD架构下的一个加速部分,其实本来是给多媒体做的,因为多媒体计算量大,而且计算的方式比较单一,因此那伙设计cpu的家伙可能就想,我能不能针对真快做个并行部分?
26 |
27 | 于是这玩意就出来了,嗯!就是这么来的(我瞎掰的)。
28 |
29 | **详细的教程如下系列:**
30 |
31 | [安卓开发看这个入手:ARM NEON programming quick reference](https://community.arm.com/android-community/b/android/posts/arm-neon-programming-quick-reference)
32 |
33 | [Coding for NEON - Part 1: Load and Stores](https://community.arm.com/processors/b/blog/posts/coding-for-neon---part-1-load-and-stores)
34 |
35 | [Coding for NEON - Part 2: Dealing With Leftovers](https://community.arm.com/processors/b/blog/posts/coding-for-neon---part-2-dealing-with-leftovers)
36 |
37 | [Coding for NEON - Part 3: Matrix Multiplication](https://community.arm.com/processors/b/blog/posts/coding-for-neon---part-3-matrix-multiplication)
38 |
39 | [Coding for NEON - Part 4: Shifting Left and Right](https://community.arm.com/processors/b/blog/posts/coding-for-neon---part-4-shifting-left-and-right)
40 |
41 | [Coding for NEON - Part 5: Rearranging Vectors](https://community.arm.com/processors/b/blog/posts/coding-for-neon---part-5-rearranging-vectors)
42 |
43 |
44 |
45 | ---
46 |
47 | ## 实例
48 |
49 | [ARM NEON技术在车位识别算法中的应用](https://community.arm.com/cn/b/blog/posts/arm-neon)
50 |
51 | 这里面讲到了两个核心点:
52 | + 循环的时候,可以把循环展开,从而稀释掉循环判断在占用时间中的比例,这个优化的是CPU的计算时间;
53 | + 在数据运算的时候,比如乘法,我们可以把刚刚第一步展开的4个32bit一次性丢进Q寄存器,然后一次就计算完了,理论上是四倍计算加速啊!
54 |
55 | >**总结:**虽然讲到了加速,而且还讲了点编译器的相关优化,但是都还不够geek,不够彻底,不够爽!
56 |
57 |
58 | **你看看下面这一篇,就知道什么叫做严谨,什么叫做geek了!**
59 |
60 |
61 | ---
62 |
63 | [从一个复数点积算法看NEON的汇编优化](https://community.arm.com/cn/b/blog/posts/neon-assemble-optimization-2013)
64 |
65 | 这里面讲到了指令重排、微架构、流水线等很深的东西,不懂,先知道设计原则就好。
66 |
67 | **其中尤其关注的点是:指令流水排布**
68 |
69 | + 那指令流水排布跟啥有关?即主要需要注意不要引起**CPU的Hazard**。
70 | + 介绍Hazard的连接在这:[**链接**](https://en.wikipedia.org/wiki/Hazard_(computer_architecture))。
71 | #### Hazard
72 | 简短说下:在CPU微架构里,当下一条指令无法在接下来的时钟周期内被执行,由此导致的指令流水线问题就叫做Hazard。
73 |
74 | 栗子🌰:你看哈,我现在是往Q0里面写数据,然后进入了流水线(假设是分取、算、存等步骤),下一条指令马上就是从Q0里面取出数据进行运算,那么很有可能我取的数据还是Q0原来的数据啊!因为由于流水线的原因,很有可能上面的Q0还没写回去,我就把数据给读出来了!
75 |
76 | 看下面的图(可治陈年老颈椎):
77 | 
78 |
79 | 你看上面一条指令假设需要花很多个指令周期(NEON指令确实如此),那么就被推入了流水线,并分割为好多的小步骤,其中最后一个步骤就是把数据写回Q0,让当前这条指令的下一句就是先取Q0的值,然后做运算啥的,你从图上看,是不是很有可能上一条指令的Q0还没写进去,我下面的Q0就读出来啦!**因此在微架构里面叫这玩意做:Hazard**。
80 |
81 | >当然咯!这种情况肯定是不允许出现的,具体怎么解我不知道,我知道的是,肯定得等Q0写完才能读,因此势必流水线就会被破坏,势必就有资源会被作为牺牲品,因此这就是为什么不提倡上述这种操作方式,再下面的NEON优化部分,我会有更专业的操作来验证。
82 |
83 |
84 |
85 | ---
86 | [Ne10 Library Getting Started](https://community.arm.com/android-community/b/android/posts/ne10-library-getting-started)
87 |
88 | 讲了Ne10这个库的结构,并教你怎么编译啊、使用啊啥的,如文章名所示,就是一个入门文章。
89 |
90 | [Using Ne10 on Android and iOS](https://community.arm.com/android-community/b/android/posts/using-ne10-on-android-and-ios)
91 |
92 | 这个就是实例了,讲具体在安卓啊,苹果啊环境下是如何使用的。
93 |
94 |
95 | ## 什么!?你要还要优化NEON?
96 | [ARM NEON optimization](https://community.arm.com/android-community/b/android/posts/arm-neon-optimization)
97 |
98 | #### Introduction
99 | 如其名,就是Neon优化的,讲了一些实践技巧。hin重要啊!
100 |
101 | #### Skill1:Remove data dependencies
102 |
103 | 在**ARMv7-A平台**(==**注意,其他平台不一定是这个效果了,需要去精调的哈**==)下,NEON指令通常比ARM标准指令集需要更多的指令周期。
104 |
105 |
106 | (原文:On ARMv7-A platform, NEON instructions usually take more cycles than ARM instructions. To reduce instruction latency, it’s better to avoid using the destination register of current instruction as the source register of next instruction.)
107 |
108 | 因此,为了减少指令延时时间,**避免使用当前指令的目地寄存器作为下一条指令的源寄存器**。
109 |
110 | >这里我大胆猜测下:为什么当前指令的目的寄存器作为下一条指令的源寄存器会增加延时?我觉得是由于流水线的问题,你想啊,NEON指令要的周期数比ARM指令要多,也就是是说在指令流水线里面呆的时间越长,因此为了保证数据不会出现Hazard(见上面的wiki解释),于是我刚进流水线我就得把流水线给清了,这无疑是极大浪费资源跟效率的。本人拙见,请点评。
111 |
112 | **C语言实现版本:**
113 |
114 | 就是两块内存中的数据求差分值,然后平方和,就是求均方差的函数,在求出差分(写入作为目的寄存器)后马上又要使用差分值平方(平方作为源寄存器)。
115 |
116 | ```c
117 | loat SumSquareError_C(const float* src_a, const float* src_b, int count)
118 | {
119 | float sse = 0u;
120 | int i;
121 | for (i = 0; i < count; ++i) {
122 | float diff = src_a[i] - src_b[i];
123 | sse += (float)(diff * diff);
124 | }
125 | return sse;
126 | }
127 | ```
128 |
129 |
130 | **汇编实现版本1**:
131 |
132 | 这里几乎是常规的写法,你看注释那里,可以看到q0是vsub的目的寄存器,马上又是下面vmla的源寄存器了,所以这里是会打断流水线的(粗略解释见上面**Hazard**部分)。
133 | ```
134 | float SumSquareError_NEON1(const float* src_a, const float* src_b, int count)
135 | {
136 | float sse;
137 | asm volatile (
138 | // Clear q8, q9, q10, q11
139 | "veor q8, q8, q8 \n"
140 | "veor q9, q9, q9 \n"
141 | "veor q10, q10, q10 \n"
142 | "veor q11, q11, q11 \n"
143 | "1: \n"
144 |
145 | "vld1.32 {q0, q1}, [%[src_a]]! \n"
146 | "vld1.32 {q2, q3}, [%[src_a]]! \n"
147 | "vld1.32 {q12, q13}, [%[src_b]]! \n"
148 | "vld1.32 {q14, q15}, [%[src_b]]! \n"
149 |
150 | "subs %[count], %[count], #16 \n"
151 |
152 | // q0, q1, q2, q3 are the destination of vsub.
153 | // they are also the source of vmla.
154 |
155 | "vsub.f32 q0, q0, q12 \n"
156 | "vmla.f32 q8, q0, q0 \n"
157 | "vsub.f32 q1, q1, q13 \n"
158 | "vmla.f32 q9, q1, q1 \n"
159 | "vsub.f32 q2, q2, q14 \n"
160 | "vmla.f32 q10, q2, q2 \n"
161 | "vsub.f32 q3, q3, q15 \n"
162 | "vmla.f32 q11, q3, q3 \n"
163 | "bgt 1b \n"
164 | "vadd.f32 q8, q8, q9 \n"
165 | "vadd.f32 q10, q10, q11 \n"
166 | "vadd.f32 q11, q8, q10 \n"
167 | "vpadd.f32 d2, d22, d23 \n"
168 | "vpadd.f32 d0, d2, d2 \n"
169 | "vmov.32 %3, d0[0] \n"
170 | : "+r"(src_a),
171 | "+r"(src_b),
172 | "+r"(count),
173 | "=r"(sse)
174 | :
175 | : "memory", "cc", "q0", "q1", "q2", "q3", "q8", "q9", "q10", "q11",
176 | "q12", "q13","q14", "q15");
177 | return sse;
178 | }
179 | ```
180 |
181 | **汇编实现版本2**:
182 |
183 | 你看这个操作就合理多了,把减法操作跟平方操作不连起来,在中间穿插其他的操作,从而保证Q0在被取的时候,流水线肯定把Q0的值写进去了,也就说留出足够的时间出来了。
184 |
185 | ```
186 | float SumSquareError_NEON2(const float* src_a, const float* src_b, int count)
187 | {
188 | float sse;
189 | asm volatile (
190 | // Clear q8, q9, q10, q11
191 | "veor q8, q8, q8 \n"
192 | "veor q9, q9, q9 \n"
193 | "veor q10, q10, q10 \n"
194 | "veor q11, q11, q11 \n"
195 | "1: \n"
196 |
197 | "vld1.32 {q0, q1}, [%[src_a]]! \n"
198 | "vld1.32 {q2, q3}, [%[src_a]]! \n"
199 |
200 | "vld1.32 {q12, q13}, [%[src_b]]! \n"
201 | "vld1.32 {q14, q15}, [%[src_b]]! \n"
202 |
203 | "subs %[count], %[count], #16 \n"
204 |
205 | "vsub.f32 q0, q0, q12 \n"
206 | "vsub.f32 q1, q1, q13 \n"
207 | "vsub.f32 q2, q2, q14 \n"
208 | "vsub.f32 q3, q3, q15
209 | \n"
210 | "vmla.f32 q8, q0, q0 \n"
211 | "vmla.f32 q9, q1, q1 \n"
212 | "vmla.f32 q10, q2, q2 \n"
213 | "vmla.f32 q11, q3, q3 \n"
214 | "bgt 1b \n"
215 |
216 | "vadd.f32 q8, q8, q9 \n"
217 | "vadd.f32 q10, q10, q11 \n"
218 | "vadd.f32 q11, q8, q10 \n"
219 | "vpadd.f32 d2, d22, d23 \n"
220 | "vpadd.f32 d0, d2, d2 \n"
221 | "vmov.32 %3, d0[0] \n"
222 | : "+r"(src_a),
223 | "+r"(src_b),
224 | "+r"(count),
225 | "=r"(sse)
226 | :
227 | : "memory", "cc", "q0", "q1", "q2", "q3", "q8", "q9", "q10", "q11",
228 | "q12", "q13","q14", "q15");
229 | return sse;
230 | }
231 | ```
232 |
233 | >**结论**: 在当前的ARMv7平台下测试效果是汇编版本2比版本1**快了约30%**,此外,intrinsics方式写的NEON指令可以通过编译器来进行精调。
234 | `Note: this test runs on Cortex-A9. The result may be different on other platforms.`
235 |
236 |
237 | #### Skill2:Reduce branches
238 |
239 | 在NEON的指令集里面是没有分支跳转指令的,如果你硬是要跳转的话那就得用ARM指令集的了;
240 |
241 | 在ARM处理器中分支预测技术被广泛应用,但是一旦分支预测失败了,那花费的代价尤其高啊!`(为啥呢?我认为是....我怎么知道啊!我又不是设计ARM内核的,但是我们可以知道的是:你买股票预测错了,你肯定是会有所损失的,搞不好还会上天台的...)`
242 |
243 | 因此避免使用跳转指令是个明智的选择,但是实际过程中总的要分支的吧,怎么整?唯有等效替换,比如用**逻辑操作代替分支选择**。
244 |
245 | **talk less, show me the code.(亮代码吧!兄嘚~)**
246 |
247 | **C语言版本:**
248 | ```
249 | if( flag )
250 | {
251 | dst[x * 4] = a;
252 | dst[x * 4 + 1] = a;
253 | dst[x * 4 + 2] = a;
254 | dst[x * 4 + 3] = a;
255 | }
256 | else
257 | {
258 | dst[x * 4] = b;
259 | dst[x * 4 + 1] = b;
260 | dst[x * 4 + 2] = b;
261 | dst[x * 4 + 3] = b;
262 | }
263 | ```
264 |
265 | **NEON版本:**
266 |
267 | ```
268 | //dst[x * 4] = (a&Eflag) | (b&~Eflag);
269 |
270 | //dst[x * 4 + 1] = (a&Eflag) | (b&~Eflag);
271 |
272 | //dst[x * 4 + 2] = (a&Eflag) | (b&~Eflag);
273 |
274 | //dst[x * 4 + 3] = (a&Eflag) | (b&~Eflag);
275 |
276 | VBSL qFlag, qA, qB
277 | ```
278 | **VBSL(按位选择)**:
279 |
280 | + 如果目标的对应位为1,则该指令从第一个操作数中选择目标的每一位;
281 | + 如果目标的对应位为 0,则从第二个操作数中选择目标的每一位。
282 |
283 |
284 | ARM NEON 指令集提供以下指令来帮助用户实现上述的逻辑操作:
285 | ```
286 | VCEQ, VCGE, VCGT, VCLE, VCLT……
287 | VBIT, VBIF, VBSL……
288 | ```
289 |
290 |
291 | >**小结**:减少分支并不是仅仅对NEON有用,这是一个对所有代码(指令集)都有用的小trick啦~ 甚至连在c语言里面这都是一条很有效的准则:**少用分支**。
292 |
293 |
294 |
295 | #### Skill3:Preload data-PLD
296 |
297 | ARM处理器是加载/存储的系统(load/store system)。
298 |
299 | 除了加载/存储指令集,其他的操作都是对寄存器的,因此提高加载/存储的效率对优化应用具有很高的实际意义。
300 |
301 | 预加载指令允许处理器去通知内存系统,告诉他:hey!兄弟,我很有可能待会得来这个地址取个东西,先帮我准备好,ok?
302 |
303 | 假如数据被正确预加载至cache了,于是cache 的hit rate将会极大提高,因此性能就极大提高了!
304 |
305 | 但是,预取也不是百试百灵的,在新的处理器里面预取非常难用,而且一个预取的不好就会降低性能的。
306 |
307 |
308 | ```
309 | PLD syntax:
310 |
311 | PLD{cond} [Rn {, #offset}]
312 |
313 | PLD{cond} [Rn, +/-Rm {, shift}]
314 |
315 | PLD{cond} label
316 |
317 | Where:
318 |
319 | Cond - is an optional condition code.
320 |
321 | Rn - is the register on which the memory address is based.
322 |
323 | Offset - is an immediate offset. If offset is omitted, the address is the value in Rn.
324 |
325 | Rm - contains an offset value and must not be PC (or SP, in Thumb state).
326 |
327 | Shift - is an optional shift.
328 |
329 | Label - is a PC-relative expression.
330 | ```
331 |
332 | **PLD操作的一些特征**:
333 | + 独立于加载/存储的的指令运行(Independent of load and store instruction execution);
334 | + 当处理器在持续执行其他指令的时候,预加载是在**后台运行**的;
335 | + 偏移量指定为实际情况(The offset is specified to real cases)。
336 |
337 | #### Skill4:Misc
338 |
339 | 在ARM NEON编程里面,不同的指令序列能实现同样的操作;但是更少的指令并不总是意味着更好的性能(也就是说**指令数量跟性能不是绝对的线性对应关系**)。
340 |
341 | 那指令数量跟性能的关系是基于什么呢?
342 |
343 | 基于特定情况下的benchmark and profiling result(基准和分析结果),如下就是一些特定情况下的实践分析。
344 |
345 | ##### Floating-point VMLA/VMLS instruction
346 |
347 | >注意咯:这里的数据仅对Cortex-A9平台有效,对于其他的平台结果就需要重新评估啦!
348 |
349 | 通常,VMUL+VADD/VMUL+VSUB指令能够被VMLA/VMLS指令替换,因为指令数量更少了,更精简了。
350 |
351 | 但是,对比于浮点VMUL操作,浮点VMLA/VMLS操作有更长的指令delay,假如在这段delay空隙中没有其他的指令能够插入的话,使用浮点VMUL+VADD/VMUL+VSUB操作将会表现出更好的性能。
352 |
353 | >问题:如何会更好?我认为还是一个流水线的问题罢!
354 |
355 | 举个实例:Ne10库中的浮点FIR函数,代码片段如下所述:
356 |
357 |
358 |
359 | **Implementation 1: VMLA**
360 |
361 | 这里在`VMLA`之间只有一个`VEXT`指令,而VMLA则需要9个指令周期的延时(`according to the table of NEON floating-point instructions timing`)
362 | ```
363 | Implementation 1: VMLA
364 |
365 | VEXT qTemp1,qInp,qTemp,#1
366 | VMLA qAcc0,qInp,dCoeff_0[0]
367 | VEXT qTemp2,qInp,qTemp,#2
368 | VMLA qAcc0,qTemp1,dCoeff_0[1]
369 | VEXT qTemp3,qInp,qTemp,#3
370 | VMLA qAcc0,qTemp2,dCoeff_1[0]
371 | VMLA qAcc0,qTemp3,dCoeff_1[1]
372 | ```
373 |
374 | **Implementation 2: VMUL+VADD**
375 |
376 | 这里仍然还有`qAcc0`的数据依赖缺(就是说流水线里面的Hazard,见前文的分析),但是 `VADD/VMUL`只需要5个指令周期哈!
377 |
378 | ```
379 | Implementation 2: VMUL+VADD
380 |
381 | VEXT qTemp1,qInp,qTemp,#1
382 | VMLA qAcc0,qInp,dCoeff_0[0] ]
383 | VMUL qAcc1,qTemp1,dCoeff_0[1]
384 | VEXT qTemp2,qInp,qTemp,#2
385 | VMUL qAcc2,qTemp2,dCoeff_1[0]
386 | VADD qAcc0, qAcc0, qAcc1
387 | VEXT qTemp3,qInp,qTemp,#3
388 | VMUL qAcc3,qTemp3,dCoeff_1[1]
389 | VADD qAcc0, qAcc0, qAcc2
390 | VADD qAcc0, qAcc0, qAcc3
391 | ```
392 |
393 | >**小结**:实测第二个版本有更好的性能。
394 |
395 |
396 | 上述仅是代码的一部分,想要详细的代码请看这:[Github上Ne10库的源码](https://github.com/projectNe10/Ne10/commit/97c162781c83584851ea3758203f9d2aa46772d5?diff=split: )
397 |
398 |
399 | >具体实现代码位置在:modules/dsp/NE10_fir.neon.s:line 195
400 |
401 |
402 | 下面是上面提到的,指令周期对照表:
403 |
404 | 
405 |
406 |
407 | >**总结**:优化手段总结如下
408 | > + 尽可能地利用指令之间的延时空隙;
409 | > + 避免使用分支;
410 | > + 关注cache hit;
411 |
412 |
413 |
414 | ## NEON assembly and intrinsics
415 |
416 | NEON汇编方式跟intrinsics方式对比如下:
417 |
418 | 
419 |
420 |
421 | 我们可以看到是分三个方面进行对比的:
422 |
423 | ---
424 |
425 | **性能:**
426 | + **assembly**: 对有经验的开发者来说,针对特定平台的汇编代码总是有最佳的性能表现,
427 | + **intrinsics**: 然而intrinsics方式则严重依赖于使用的工具链;
428 |
429 | ---
430 |
431 | **可移植性:**
432 | + **assembly**: 不同的指令集架构(**ISA**: Instruction Set Architecture)有不同的汇编实现;甚至是在同样的指令集架构下,**不同微架构**的汇编代码都可能需要精调来实现理想的性能;
433 | + **intrinsics**: 编程一次,即可适用于所有的指令集架构,编译器甚至会考虑到不同的微架构来进行性能精调;
434 |
435 | ---
436 |
437 | **操作性:**
438 | + **assembly**: 对比于C来说是很难读写的咯!
439 | + **intrinsics**: 跟C语言类似,读写容易;
440 | ---
441 |
442 |
443 |
444 | >小结:但是现实情况是远比这复杂的,尤其当碰到ARMv7-A/v8-A 跨平台问题时,因此接下来我们针对这给出些栗子来进行分析。
445 |
446 |
447 | ## 编写代码
448 |
449 | 对于NEON的初学者,内联函数的方式是比汇编更容易的,但是有经验的开发者(比如我....雾)可能对NEON汇编编程更为熟悉,毕竟我们需要时间去适应内联函数的编码方式啊!!!!
450 |
451 | 一些在真实开发过程中可能会出现的问,现描述如下:
452 |
453 | 
454 |
455 | ---
456 |
457 | **Flexibility of instruction**
458 |
459 | 使用汇编的方式可能会更灵活,**主要体现在数据的load/store**。
460 | >当然这个不足,可以在将来的编译器升级过程中进行解决。
461 |
462 | 有时候,编译器是有能力将两条内联函数指令优化为一条汇编指令的,比如:
463 |
464 | [图片上传失败...(image-bea8b2-1516264588074)]
465 |
466 | 因此,伴随着ARMv8工具连的升级,有望使得内联函数方式有跟汇编一样的灵活度;
467 |
468 | ---
469 |
470 | **Register allocation**:
471 |
472 | 当使用NEON汇编编程时,寄存器必须由用户分配,因此你必须清楚地知道哪个寄存器被占用了;
473 |
474 | 使用内联函数方式的好处之一是,你只管定义变量,不用管它的安全性,因为编译器会自动分配寄存器的,这是一个优点,但是有时候也是缺点;
475 |
476 | **实践证明:在内联函数编程模式下同时使用太多的NEON寄存器会使得gcc编译器产生寄存器分配异常。当这种情况发生时,许多的数据都被推进栈区(为啥?你心里没点B数么?NEON寄存器总共就这么多,提这么多的要求,消化不下啦,只能放到郊区的内存里面去啦,郊区那么远,你说浪费时间不!!!),这将会极大地降低程序的性能。**
477 |
478 | 因此当使用内联函数编程时,得好好考虑这个问题,当出现性能异常的时候,比如C的性能居然比NEON的还要好,这个时候你首先要做的就是反汇编,来确认是不是出现寄存器分配问题了!
479 |
480 | 对于ARMv8-A AArch64,有更多的NEON寄存器(32个 128bit NEON寄存器),因此对于寄存器分配问题的影响就较低了!
481 |
482 | ---
483 |
484 | **Performance and compiler**
485 |
486 |
487 | 在一个特定的平台下,NEON汇编的的性能表现仅仅取决于其实现代码,与编译器鸟关系都没有的啊!好处就是你能预测并估计你手调代码的性能表现,这很正常嘛!
488 |
489 | 相反的,内联函数的性能严重依赖于使用的编译器。不同的编译器会带来不同的性能,通常是老编译器性能会有最差的性能,同时使用老编译器时你要注意你内联函数的兼容性啊!
490 |
491 | 精调代码的时候,你不能预测和控制性能,但是偶尔也会有惊喜哟!,有时候内联函数的性能反而超过汇编方式,这不是不会出现,但是可以说是“罕见”。
492 |
493 | 编译器将会在NEON优化的过程中产生影响,下面这张图描述了NEON实现和优化的通常过程:
494 | [图片上传失败...(image-d68e68-1516264588074)]
495 |
496 | NEON汇编和内联方式有同样的实现过程,编码-调试-性能测试,但是他们却有不同的优化步骤:
497 |
498 | ---
499 |
500 | **汇编精调的方法如下**:
501 | + 改变实现方式,比如改变指令、调节并行度;
502 | + 调整指令序列来减少数据依赖性(上面已经分析过了,就是怕流水线断掉);
503 | + 试试我前面提到的那些skills
504 |
505 |
506 |
507 | 当精调汇编代码的时候,一个经验之谈(复杂的、富有经验的)是:
508 | + 明确知道使用指令的数量;
509 | + 使用PMU(`Performance Monitoring Unit`)来获取执行周期;
510 | + 基于已使用指令的时间消耗来调整指令的序列,并尽你所能地最小化指令延时(延时间隙插入指令);
511 |
512 | 这种方式的的缺点是优化仅仅是针对于某个`micro-architecture`的,移植性不好啊!同时对于相对较小的收益来说,这也是非常耗时的,也就是性价比不是很高啊!
513 |
514 | ---
515 |
516 | **NEON intrinsics 精调的方法更难哟!**:
517 |
518 | + 使用汇编优化里面的那一套方法,整一遍试一下!
519 | + 反汇编看数据依赖跟寄存器使用情况;
520 | + 检查性能是否满足期望,如果没有,那就换个编译器再来一遍,知道性能满足期望了!
521 |
522 | 当移植ARMv7-A的汇编代码到ARMv7-A/v8-A进行兼容时,汇编代码的性能可以作为一个参考,因此我们很容易就知道何时算优化结束。
523 |
524 | 然而,当内联函数方式来优化ARMv8-A的代码时,是没有性能参考点的,因此很难确定当前的性能是否是最优值;
525 |
526 | 基于ARMv7-A的经验,可能有这样的疑问:是不是汇编就一定有更好的性能呢?我认为随着 ARMv8-A 环境的成熟化,内联函数将会有更好的性能。
527 |
528 |
529 |
530 |
531 | #### Cross-platform and portability
532 |
533 |
534 | 许多现已存的NEON汇编代码,仅能在ARMv7-A/ARMv8-A平台下的AArch32模式下运行,假如你想让这些代码在ARMv8-A AArch64模式下运行,你必须**重写这些代码**,这需要花费很多的功夫啊!
535 |
536 | [图片上传失败...(image-5500d5-1516264588074)]
537 |
538 | 在这样的情形下,假如代码是用内联函数方式实现的,它们能够在ARMv8-A AArch64 模式下直接运行。
539 |
540 | **跨平台是一个明显的优势**,同时,不同平台你仅仅需要保持一份代码,这大大减少了维护工作。
541 |
542 | 然而,由于`ARMv7-A/ARMv8-A`平台不同的硬件资源(Q寄存器数量的差异),有时候即使是使用内联函数,但还是需要两套代码的。
543 |
544 | 下面以Ne10工程下的FFT实现为例子:
545 |
546 | ```
547 | // radix 4 butterfly with twiddles
548 |
549 | scratch[0].r = scratch_in[0].r;
550 |
551 | scratch[0].i = scratch_in[0].i;
552 |
553 | scratch[1].r = scratch_in[1].r * scratch_tw[0].r - scratch_in[1].i * scratch_tw[0].i;
554 |
555 | scratch[1].i = scratch_in[1].i * scratch_tw[0].r + scratch_in[1].r * scratch_tw[0].i;
556 |
557 | scratch[2].r = scratch_in[2].r * scratch_tw[1].r - scratch_in[2].i * scratch_tw[1].i;
558 |
559 | scratch[2].i = scratch_in[2].i * scratch_tw[1].r + scratch_in[2].r * scratch_tw[1].i;
560 |
561 | scratch[3].r = scratch_in[3].r * scratch_tw[2].r - scratch_in[3].i * scratch_tw[2].i;
562 |
563 | scratch[3].i = scratch_in[3].i * scratch_tw[2].r + scratch_in[3].r * scratch_tw[2].i;
564 | ```
565 |
566 | 以上的代码片段列出了FFT的基本操作单元-`radix4 butterfly`,从代码里面我们能推出如下信息:
567 | + 假如2个`radix4 butterflies`在1个循环里面执行,则需要20个`64-bit NEON`寄存器;
568 | + 假如4个`radix4 butterflies`在1个循环里面执行,则需要20个`128-bit NEON`寄存器
569 |
570 |
571 |
572 | 而且, `ARMv7-A/v8-A AArch32 and v8-A AArch64`资源如下:
573 | + `ARMv7-A/v8-A AArch32有:32 64-bit or 16 128-bit NEON registers.`
574 | + `ARMv8-A AArch64 有:32 128-bit NEON registers.`
575 |
576 |
577 | 考虑到上述因素,Ne10库的FFT实现代码里面,最终是有添加个汇编版本的:
578 | + 汇编版本是用于`ARMv7-A/v8-A AAch32`,里面是一个循环内执行`2 radix4 butterflies`;
579 | + 内联函数版本用于`ARMv8-A AArch64`,里面是一个循环内执行`4 radix4 butterflies`;
580 |
581 | 上述实例表明:当维护一份跨平台(`ARMv7-A/v8-A`)的代码时,你需要关注一些**例外情况**。
582 |
583 |
584 | # 总结
585 |
586 | 内联函数优化的越来越好了,甚至在ARMv8 平台下有优于汇编的性能,同时兼容性方面又比汇编好,因此使用内联函数是上上之选。
587 |
588 | 毕竟,NEON肯定会更新的,到时一更新你的底层汇编得全部跟着更新,但是使用内联函数的话就不要考虑这些了,反正编译器都帮我们做了嘛!
589 |
590 | 最后关于内联函数告诉后辈们几点人生经验:
591 | + 使用的寄存器数量要考虑周全;
592 | + 编译器注意好啊!
593 | + 一定要看看产生的汇编代码啊!
594 |
--------------------------------------------------------------------------------
/0x04 MemTest86 history.md:
--------------------------------------------------------------------------------
1 |
2 | ### MemTest86 History - from 1994
3 | MemTest86 was **originally** developed by Chris Brady (BradyTech Inc) with a first release in 1994.
4 |
5 | However, some of the **testing algorithms** used have been under development since 1981 and have been previously implemented on Dec PDP-11, VAX-11/780 and Cray XMP architectures.
6 |
7 | Since then there has been more than a dozen new versions being released. Support for 64bit, new CPU types, **symmetrical multiprocessors** and many other features have been added during this period.
8 |
9 | MemTest86 was released as **free** open source (GPL) software.
10 |
11 | ### MemTest86 and MemTest86+
12 | 被不同人发展出来了一系列的版本,所谓是百花齐放百家争鸣且螺旋式优化改进啊~
13 |
14 |
15 | ### MemTest86 the new era
16 | 新时代(64bit、DDR4)的到来,面临新的挑战。
17 | In Feb 2013, PassMark Software took over the maintenance of the MemTest86 project from Chris.
18 |
19 | This was around the time that a lot of technological changes were occurring. The 64bit era was here, DDR4 was coming, UEFI had already arrived and Microsoft's Secure boot technology threatened to prevent MemTest86 from booting on future PC hardware.
20 |
21 | Starting from MemTest86 v5, the code was re-written to support self booting from the newer UEFI platform. UEFI is able to provide additional services that is unavailable in BIOS, such as graphical, mouse and file system support. Support for DDR4 & 64bit were also added and Microsoft agreed to code sign MemTest86 for secure boot.
22 |
23 | The software (Free Edition) still remains free to use without restrictions. The MemTest86 v4 project is still maintained and remains open source, for use on old machines with BIOS. From V5 however the software is being released under a proprietary license. For advanced/enthusiast users or commercial applications, a professional version is available for users that require additional customizability and advanced features that may be more suitable for their testing needs. A comparison of the different versions can be found here. We have also created a support forum where users can discuss issues.
24 |
25 |
26 | [这里是它的发布历史轨迹](https://www.memtest86.com/support/ver_history.htm)
27 | ## MemTest86定义
28 | MemTest86 is the original, free, stand alone memory testing software for x86 computers. MemTest86 boots from a USB flash drive or CD and tests the RAM in your computer for **faults** using a series of **comprehensive algorithms and test patterns**.
29 | >指出:首先是免费,其次是使用了一些测试**pattern**跟**综合算法**来测试内存的**缺陷**。
30 |
31 | 特点:(就关注这两点)
32 | + 13 different RAM testing algorithms
33 | +DDR4 RAM (and DDR2 & DDR3) support
34 | + 测试算法发展了20年了,初代目早在1994年就出现了
35 |
36 | ### 这是memtest86的测试项:
37 |
38 | By running MemTest, you can ensure that your computers RAM is **correctly functioning**.
39 | >只能测功能;
40 |
41 | Unlike other memory checking software, MemTest is designed to **find all types of memory errors** including intermittent problems.
42 | >找到所有的内存缺陷;
43 |
44 | Therefore, it needs to be run for **several hours** to truly evaluate your RAM. MemTest works with any type of memory.
45 | >代价就是要测好几个小时;
46 |
47 | **Test 0 [Address test, walking ones, no cache]**
48 | Tests all address bits in all memory banks by using a walking ones address pattern.
49 | >walking ones address pattern测试地址位;
50 |
51 | **Test 1 [Address test, own address]**
52 | Each address is written with its own address and then is checked for consistency. In theory previous tests should have caught any memory addressing problems. This test should catch any addressing errors that somehow were not previously detected.
53 | >补充第一个测试缺陷的;
54 |
55 | **Test 2 [Moving inversions, ones&zeros]**
56 | This test uses the moving inversions algorithm with patterns of all ones and zeros. Cache is enabled even though it interferes to some degree with the test algorithm. With cache enabled this test does not take long and should quickly find all "hard" errors and some more subtle errors. This test is only a quick check.
57 | >快速检查出异常使用moving inversions algorithm with patterns of all ones and zeros.
58 |
59 | **Test 3 [Moving inversions, 8 bit pat]**
60 | This is the same as test one but uses a 8 bit wide pattern of "walking" ones and zeros. This test will better detect subtle errors in "wide" memory chips. A total of 20 **data patterns** are used.
61 | >算法同第一个测试项,但是使用的是8bit位宽,能检测更微小的内存错误;
62 |
63 | **Test 4 [Moving inversions, random pattern]**
64 | Test 4 uses the same algorithm as test 1 but the data pattern is a random number and it's complement. This test is particularly effective in finding difficult to detect **data sensitive errors**. A total of 60 patterns are used. The random number sequence is different with each pass so multiple passes increase effectiveness.
65 | >找出数据敏感型异常(基于随机数);
66 |
67 | **Test 5 [Block move, 64 moves]**
68 | This test stresses memory by using block move (movsl) instructions and is based on Robert Redelmeier's burnBX test. Memory is initialized with shifting patterns that are inverted every 8 bytes. Then 4mb blocks of memory are moved around using the movsl instruction. After the moves are completed the data patterns are checked. Because the data is checked only after the memory moves are completed it is not possible to know where the error occurred. The addresses reported are only for where the bad pattern was found. Since the moves are constrained to a 8mb segment of memory the failing address will always be less than 8mb away from the reported address. Errors from this test are not used to calculate BadRAM patterns.
69 | >内存块移动方式来进行压力测试;
70 |
71 | **Test 6 [Moving inversions, 32 bit pat]**
72 | This is a variation of the moving inversions algorithm that shifts the data pattern left one bit for each successive address. The starting bit position is shifted left for each pass. To use all possible data patterns 32 passes are required. This test is quite effective at detecting data sensitive errors but the execution time is long.
73 |
74 | **Test 7 [Random number sequence]**
75 | This test writes a series of random numbers into memory. By resetting the seed for the random number the same sequence of number can be created for a reference. The initial pattern is checked and then complemented and checked again on the next pass. However, unlike the moving inversions test writing and checking can only be done in the forward direction.
76 | >写一个序列的随机值进内存,只能一个方向;
77 |
78 | **Test 8 [Modulo 20, ones&zeros]**
79 | Using the Modulo-X algorithm should uncover errors that are not detected by moving inversions due to cache and buffering interference with the the algorithm. As with test one only ones and zeros are used for data patterns.
80 | >移动逆转由于cache的作用可能会覆盖不全某些异常,因此Modulo-X算法补充之;
81 |
82 | **Test 9 [Bit fade test, 90 min, 2 patterns]**
83 | The bit fade test initializes all of memory with a pattern and then sleeps for 90 minutes. Then memory is examined to see if any memory bits have changed. All ones and all zero patterns are used. This test takes 3 hours to complete. The Bit Fade test is not included in the normal test sequence and must be run manually via the runtime configuration menu.
84 | >睡90分钟后看是不是数据掉了。
85 |
86 | ---
87 |
88 | ## 异常显示
89 | Error Display
90 | Memtest has two options for reporting errors. The default is to report individual errors. Memtest is also able to create patterns used by the Linux BadRAM feature. This slick feature allows Linux to avoid bad memory pages. Details about the BadRAM feature can be found at: http://home.zonnet.nl/vanrein/badram
91 |
92 | For individual errors the following information is displayed when a memory error is detected. An error message is only displayed for errors with a different address or failing bit pattern. All displayed values are in hexadecimal.
93 |
94 | Tst: Test Number
95 |
96 | Failing Address: Failing memory address
97 |
98 | Good: Expected data pattern
99 |
100 | Bad: Failing data pattern
101 |
102 | Err-Bits: Exclusive or of good and bad data (this shows the position of the failing bit(s))
103 |
104 | Count: Number of consecutive errors with the same address and failing bits Error Display
105 |
106 | ## 问题定位
107 |
108 | Troubleshooting Memory Errors
109 | Please be aware that not all errors reported by Memtest86 are due to bad memory.
110 | The test implicitly tests the **CPU, L1 and L2 caches as well as the motherboard. **
111 |
112 | >讲到这算发测出异常了大概率是memory出问题了,小概率是是板子啊CPU啊cache啊出问题了。
113 |
114 | It is impossible for the test to determine what causes the failure to occur. However, most failures will be due to a problem with memory. When it is not, the only option is to replace parts until the failure is corrected.
115 |
116 | 出问题时按下面步骤来定温问题:
117 | there are steps that may be taken to determine the failing module. Here are four techniques that you may wish to use:
118 |
119 | 1) Removing modules
120 |
121 | 2) Rotating modules
122 |
123 | 3) Replacing modules
124 |
125 | 4) Avoiding allocation
126 |
127 | 兼容问题导致异常。
128 |
129 | Memtest86 can not diagnose many types of PC failures. For example a faulty CPU that causes Windows to crash will most likely just cause Memtest86 to crash in the same way.
130 | >就是用户用来检测内存是否可用的,系统是否有异常的。
131 |
132 | ## 执行时间
133 |
134 | The time required for a complete pass of Memtest86 will vary greatly depending on CPU speed, memory speed and memory size. Here are the execution times from a Pentium-II-366 with **64mb** of RAM:
135 | Test 0 0:05
136 | Test 1 0:18
137 | Test 2 1:02
138 | Test 3 1:38
139 | Test 4 8:05
140 | Test 5 1:40
141 | Test 6 4:24
142 | Test 7 6:04
143 | **Total (default tests) 23:16**
144 |
145 | Test 8 12:30
146 | Test 9 49:30
147 | Test 10 30:34
148 | Test 11 3:29:40
149 | **Total (all tests) 5:25:30**
150 |
151 | Memtest86 continues executes indefinitely. The pass counter increments each time that all of the selected tests have been run. Generally a single pass is sufficient to catch all but the most obscure errors. However, for complete confidence when intermittent errors are suspected testing for a longer period is advised.
152 | >为了发现间歇性问题,多测几轮还是很有必要的。
153 |
154 | ## 内存测试理念
155 | Memory Testing Philosophy
156 | There are many good approaches for testing memory. However, many tests simply throw some patterns at memory without much thought or knowledge of the memory architecture or how errors can best be detected. This works fine for hard memory failures but does little to find intermittent errors.
157 | >大多测试算法没有针对特定的存储架构做优化,因此测出来的都是致命性问题,间歇性的问题一般很难找出来。
158 | The BIOS based memory tests are useless for finding intermittent memory errors.
159 | >基于BIOS的测试对间歇性异常无能为力。
160 |
161 | Memory chips consist of a large array of tightly packed memory cells, one for each bit of data. The vast majority of the intermittent failures are a result of interaction between these memory cells. Often writing a memory cell can cause one of the adjacent cells to be written with the same data. An effective memory test should attempt to test for this condition. Therefore, an ideal strategy for testing memory would be the following:
162 |
163 | 1) write a cell with a zero
164 | 2) write all of the adjacent cells with a one, one or more times
165 | 3) check that the first cell still has a zero
166 | >这就是测1被0包围或者0被1包围的临近测试场景、
167 |
168 | It should be obvious that this strategy requires an exact knowledge of how the memory cells are laid out on the chip.
169 | >**很明显地我们需要知道内存在实际地三维空间中是如何排列的**。
170 |
171 | In addition there is a never ending number of possible chip layouts for different chip types and manufacturers making this strategy impractical. However, there are testing algorithms that can approximate this ideal.
172 | >布局不同测试算法就不同,因此我们很难搞适配啊!还好,有算法能搞定这个问题。
173 |
174 | ## 临近耦合缺陷测试算法
175 |
176 | Memtest86 uses two algorithms that provide a reasonable approximation of the ideal test strategy **above.**
177 |
178 | ### moving inversions算法
179 |
180 | 解决相邻cell互相影响的问题。
181 |
182 | The first of these strategies is called moving inversions. The moving inversion test works as follows:
183 |
184 | 1) Fill memory with a pattern
185 | 2) Starting at the lowest address
186 | 2a check that the pattern has not changed
187 | 2b write the patterns complement
188 | 2c increment the address
189 | repeat 2a - 2c
190 | 3) Starting at the highest address
191 | 3a check that the pattern has not changed
192 | 3b write the patterns complement
193 | 3c decrement the address
194 | repeat 3a - 3c
195 |
196 | >由于上了平台而不是在测试机,因此我们不能针对每个bit,一位一位连续着写(因为我们平台最多支持Byte操作),因此只能折中,把一些pattern组合着写入进去,**少了一些动态过程**。
197 |
198 | This algorithm is a good approximation of an ideal memory test but there are some **limitations**. Most high density chips today store data 4 to 16 bits wide. With chips that are more than one bit wide it is impossible to selectively read or write just one bit. This means that we cannot guarantee that all adjacent cells have been tested for interaction. In this case the best we can do is to use some patterns to insure that all adjacent cells have at least been written with all possible one and zero combinations.
199 |
200 | >同时caching、buffering都是不会使你的读写操作变连续的,因为会缓存嘛,这就使得本来要连续操作的变成断断续续了,要不得,因此我们整了个Modulo-X 算法来解决这个问题,其实就是一个很常规的操作:**我们把数据填充满cache或者buffer(高级特性无法关闭)这样就会自动触发flush操作,从而不会滞留数据了**。
201 |
202 | It can also be seen that caching, buffering and out of order execution will interfere with the moving inversions algorithm and make less effective. It is possible to turn off cache but the memory buffering in new high performance chips can not be disabled. To address this limitation a new algorithm I call Modulo-X was created. This algorithm is not affected by cache or buffering. The algorithm works as follows:
203 |
204 | 1) For starting offsets of 0 - 20 do
205 | 1a write every 20th location with a pattern
206 | 1b write all other locations with the patterns complement
207 | repeat 1b one or more times
208 | 1c check every 20th location for the pattern
209 |
210 | This algorithm accomplishes nearly the same level of adjacency testing as moving inversions but is not affected by caching or buffering.
211 | >于是就实现了adjacency testing as moving inversions,且不受caching or buffering的影响。
212 |
213 | Since separate write passes (1a, 1b) and the read pass (1c) are done for all of memory we can be assured that all of the buffers and cache have been flushed between passes. The selection of 20 as the stride size was somewhat arbitrary.
214 |
215 | >大的步进会使得更高效,但是会使得测试时长加大,因此折中速度和吞吐率.(**我们是全空间读写,因此这里可以设置大一点哟!**)
216 |
217 | Larger strides may be more effective but would take longer to execute. The choice of 20 seemed to be a reasonable compromise between speed and thoroughness.
218 |
219 |
220 | ## 单项测试描述(静态异常?)
221 |
222 | Memtest86 executes a series of numbered test sections to check for errors.
223 |
224 | These test sections consist of a combination of **test algorithm, data pattern and cache setting**.
225 |
226 | The **execution order** for these tests were arranged so that errors will be detected as rapidly as possible.
227 |
228 | Tests 8, 9, 10 and 11 are very long running extended tests and are only executed when **extended testing** is selected.
229 |
230 | The extended tests have a low probability of finding errors that were missed by the default tests.
231 |
232 | A description of each of the test sections follows:
233 |
234 | #### Test 0 [Address test, walking ones, no cache]
235 |
236 | Tests all address bits in all memory banks by using a walking ones address pattern.
237 |
238 | #### Test 1 [Moving Inv, ones&zeros, cached]
239 |
240 | This test uses the moving inversions algorithm with patterns of only ones and zeros.
241 |
242 | Cache is enabled even though it interferes to some degree with the test algorithm. With cache enabled this test does not take long and should quickly find all "hard" errors and some more subtle errors. This test is only a quick check.
243 |
244 | #### Test 2 [Address test, own address, no cache]
245 |
246 | Each address is written with its own address and then is checked for consistency.
247 |
248 | In theory previous tests should have caught any memory addressing problems. This test should catch any addressing errors that **somehow** were not previously detected.
249 |
250 | #### Test 3 [Moving inv, 8 bit pat, cached]
251 |
252 | This is the same as test one but uses a 8 bit wide pattern of "walking" ones and zeros. This test will **better detect subtle errors in "wide" memory chips.**
253 |
254 | **A total of 20 data patterns are used.**
255 |
256 | #### Test 4 [Moving inv, 32 bit pat, cached]
257 |
258 | This is a variation of the moving inversions algorithm that shifts the data pattern left one bit for each successive address.
259 |
260 | The starting bit position is shifted left for each pass. To use all possible data patterns **32 passes** are required.
261 |
262 | This test is effective in detecting** data sensitive** errors in "wide" memory chips.
263 |
264 | #### Test 5 [Block move, 64 moves, cached]
265 |
266 | This test stresses memory by using block move (movsl) instructions and is based on Robert Redelmeier's burnBX test.
267 |
268 | Memory is initialized with shifting patterns that are inverted every 8 bytes. Then 4mb blocks of memory are moved around using the movsl instruction.
269 |
270 | After the moves are completed the data patterns are checked. Because the data is checked only after the memory moves are completed it is not possible to know where the error occurred.
271 |
272 | The addresses reported are only for where the bad pattern was found. Since the moves are constrained to a 8mb segment of memory the failing address will always be less than 8mb away from the reported address. Errors from this test are not used to calculate BadRAM patterns.
273 |
274 | #### Test 6 [Modulo 20, ones&zeros, cached]
275 |
276 | Using the Modulo-X algorithm should uncover errors that are not detected by moving inversions due to cache and buffering interference with the the algorithm. As with test one only ones and zeros are used for data patterns.
277 | #### Test 7 [Moving inv, ones&zeros, no cache]
278 |
279 | This is the same as test one but without cache. With cache off there will be much less interference with the test algorithm. However, the execution time is much, much longer. This test may find very subtle errors missed by previous tests.
280 |
281 | #### Test 8 [Block move, 512 moves, cached]
282 |
283 | This is the first extended test. This is the same as test #5 except that we do more memory moves before checking memory. Errors from this test are not used to calculate BadRAM patterns.
284 |
285 | #### Test 9 [Moving inv, 8 bit pat, no cache]
286 |
287 | By using an 8 bit pattern with cache off this test should be effective in detecting all types of errors. However, it takes a very long time to execute and there is a low probability that it will detect errors not found by the previous tests.
288 |
289 | #### Test 10 [Modulo 20, 8 bit, cached]
290 |
291 | This is the first test to use the Modulo-X algorithm with a **data pattern** other than ones and zeros. This combination of algorithm and data pattern should be quite effective. However, it's very long execution time relegates it to the extended test section.
292 |
293 | #### Test 11 [Moving inv, 32 bit pat, no cache]
294 |
295 | This test should be the most effective in finding errors that are data pattern sensitive. However, without cache it's execution time is excessively long.
296 |
297 | 测试类型|测试内容描述
298 | ---|---
299 | Test 0 [Address test, walking ones, no cache]| walking ones address pattern测试所有的地址位
300 | Test 1 [Moving Inv, ones&zeros, cached] | moving inversions algorithm with patterns of only ones and zeros.
301 | Test 2 [Address test, own address, no cache] | 地址写入,检查其连续性,是test0的补充
302 | Test 3 [Moving inv, 8 bit pat, cached]|重点在8bit,检查粒度更细了
303 | Test 4 [Moving inv, 32 bit pat, cached]| 检测数据敏感性异常;
304 | Test 5 [Block move, 64 moves, cached]|检测压力下内存是否会出现问题;
305 | Test 6 [Modulo 20, ones&zeros, cached]| 去除cache的影响,还是moving inversions
306 | Test 7 [Moving inv, ones&zeros, no cache]| 补充测试未chche情况下是否ok,时间花销显著变大哟!(解决用例覆盖度问题)
307 | Test 8 [Block move, 512 moves, cached]| 既然加压了,那就试试不同压力下的情况吧!
308 | Test 9 [Moving inv, 8 bit pat, no cache]|理论上检测所有缺陷,因为粒度都到8bit了,但是时间花销上也是呵呵哒!
309 | Test 10 [Modulo 20, 8 bit, cached]|将data pattern(之前的就是0啊1啊的pattern而已)跟去除cache影响的Modulo-X算法组合技能确实效率更高,但是执行时间也是长长的~
310 | Test 11 [Moving inv, 32 bit pat, no cache]|我们用的是32bit的数据pattern,因此在找出数据敏感型的异常情况中最有效,但是由于没有cache这个速度也是很感人。
311 |
312 | >**总结**:就是从用户的角度来看,设立不同的测试场景即测试用例,然后针对性地进行**功能**测试,注意是从系统级来测试,也就是说关注的不仅仅是内存颗粒了,而是在系统板级的连线、IO性能、PCB等等相关的因素一同考虑进去后,你的memory是否还能**功能**正常;
313 |
--------------------------------------------------------------------------------
/0x05 memtester4.3.0 啃源码.md:
--------------------------------------------------------------------------------
1 |
2 |
3 | ## 前言
4 | 为一究memtester原理,现对其每个函数均按照如下格式进行描述:
5 | + 方法
6 | + 原理
7 | + 时间花销
8 |
9 |
10 | 以下是对每个测试项的简要描述:
11 |
12 | memtester-4.3.0 | memtester-ARM
13 | ---|---|
14 | int test_stuck_address(bufa, count);|(√ ) 先全部把地址值交替取反放入对应存储位置,然后再读出比较,重复2次(官网的重复了16次):测试address bus |
15 | int test_random_value(bufa, bufb, count); |(√ )等效test_random_comparison(bufa, bufb, count):数据敏感型测试用例 |
16 | int test_xor_comparison(bufa, bufb, count);|(-) 与test_random_value比多了个异或操作,用户场景之一,用例覆盖。数据敏感/指令功能验证,同时可验证SAF; |
17 | int test_sub_comparison(bufa, bufb, count); |(-)与test_random_value比多了个减法操作,用户场景之一,用例覆盖。数据敏感/指令功能验证,同时可验证SAF; |
18 | int test_mul_comparison(bufa, bufb, count); | (-)与test_random_value比多了个乘法操作,用户场景之一,用例覆盖。数据敏感/指令功能验证,同时可验证SAF;|
19 | int test_div_comparison(bufa, bufb, count); | (-)与test_random_value比多了个除法操作,用户场景之一,用例覆盖。数据敏感/指令功能验证,同时可验证SAF;|
20 | int test_or_comparison(bufa, bufb, count); | (√ )在test_random_comparison()里面合并了,用户场景之一,用例覆盖。数据敏感/指令功能验证,同时可验证SAF; |
21 | int test_and_comparison(bufa, bufb, count); | (√ )在test_random_comparison()里面合并了,用户场景之一,用例覆盖。数据敏感/指令功能验证,同时可验证SAF;|
22 | int test_seqinc_comparison(bufa, bufb, count); | (√ )这是 test_blockseq_comparison的一个子集;模拟客户压力测试场景。 |
23 | int test_solidbits_comparison(bufa, bufb, count); |(√ )固定全1后写入两个buffer,然后读出比较,然后全0写入读出比较;这就是Zero-One算法,Breuer & Friedman 1976 ,检测SAF的,算法是{w0,r0,w1,r1}时间复杂度是4N,又叫做MSCAN,验证每个cell能读写,间接测试了stuck at fault|
24 | int test_checkerboard_comparison(bufa, bufb, count); | (√ )把设定好的几组Data BackGround,依次写入,然后读出比较 (注:论文里说设计良好的Data background可以检测出state coupling faults时间复杂度是4N,这是验证相邻位置是否互相影响从而设计的用例。|
25 | int test_blockseq_comparison(bufa, bufb, count); | (√ )一次写一个count大小的块,写的值是拿byte级的数填充32bit,然后取出对比,接着重复256次;也是压力用例,只是次数变多了; |
26 | int test_walkbits0_comparison(bufa, bufb, count); | (√ )就是bit=1的位置在32bit里面移动,每移动一次就全部填满buffer,先是从低位往高位移,再是从高位往低位移动,(这么做的目的是啥?其中的一个目的是检测NPSF其次是CFs,其次是数据敏感型异常检测,注这里是32bit的,还有8bit的粒度更细了) |
27 | int test_walkbits1_comparison(bufa, bufb, count); |(√ )与上同理,另注:早memtester86中这个算法叫做moving inversions algorithm |
28 | int test_bitspread_comparison(bufa, bufb, count); |(√ )还是在32bit里面移动,只是这次移动的不是单单的一个0或者1,而是两个1,这两个1之间隔着两个空位,(是临近耦合异常的一种data pattern变体:两个1之间间隔1个位置,然后同步移动) |
29 | int test_bitflip_comparison(bufa, bufb, count); | (√ )也是32bit里面的一个bit=1不断移动生成data pattern然后,每个pattern均执行:{取反交替写入a、b缓冲区,写完之后检查一遍,然后不断重复以下步骤八次{用八个DMA从a缓冲区搬数据到b缓冲区,并行搬,模拟短时间内反复读写同一位置看是否有数据丢失异常}}核心思想:短时间内反复读写同一位置。 |
30 | int test_8bit_wide_random(bufa, bufb, count); | (√ )以char指针存值,也就是每次存8bit,粒度更细; |
31 | int test_16bit_wide_random(bufa, bufb, count);|(√ )以unsigned short指针存值,也就是每次存16bit,不同粒度检测; |
32 | ×|int test_crosstalk_comparison(bufa, bufb, count):[32个0,接着32bit里面1个0移动]以这样的模型叠加写入内存;(只有上行,没像有moving inversions algorithm一样进行反转)|
33 |
34 |
35 | ## 详解函数
36 | ### memtester-4.3.0 版本
37 | #### 方法test_stuck_address
38 | 函数名:`int test_stuck_address(ulv *bufa, size_t count) `
39 | 基本pattern按照下图所示,j=0时,先把P1的**地址值**写入**对应的内存位置**处,然后P2取反放入对应位置处,如此反复;
40 |
41 | 然后下一轮开始,即j=1,把上述步骤反过来再进行一遍即可;
42 | 
43 |
44 | 直到16轮结束,假若发生异常就把异常的地址直接返回即可!
45 |
46 | #### 目的(原理)
47 | 为了验证是否有地址无法访问,验证的是地址线。
48 |
49 | #### 时间花销
50 | **条件:**
51 | 全空间1G Byte ,DDR带宽1600M\*32bit,CPU: ARM A53 (1460~1800)M\* 32bit单核跑。
52 |
53 | **时间成本:**
54 | ___Sec.
55 | ```cpp
56 | int test_stuck_address(ulv *bufa, size_t count) {
57 | ulv *p1 = bufa;
58 | unsigned int j;
59 | size_t i;
60 | off_t physaddr;
61 | printf(" ");
62 | fflush(stdout);
63 | for (j = 0; j < 16; j++) {
64 | printf("\b\b\b\b\b\b\b\b\b\b\b");
65 | p1 = (ulv *) bufa;
66 | printf("setting %3u", j);
67 | fflush(stdout);
68 | for (i = 0; i < count; i++) {
69 | *p1 = ((j + i) % 2) == 0 ? (ul) p1 : ~((ul) p1);
70 | *p1++;
71 | }
72 | printf("\b\b\b\b\b\b\b\b\b\b\b");
73 | printf("testing %3u", j);
74 | fflush(stdout);
75 | p1 = (ulv *) bufa;
76 | for (i = 0; i < count; i++, p1++) {
77 | if (*p1 != (((j + i) % 2) == 0 ? (ul) p1 : ~((ul) p1))) {
78 | if (use_phys) {
79 | physaddr = physaddrbase + (i * sizeof(ul));
80 | fprintf(stderr,
81 | "FAILURE: possible bad address line at physical "
82 | "address 0x%08lx.\n",
83 | physaddr);
84 | } else {
85 | fprintf(stderr,
86 | "FAILURE: possible bad address line at offset "
87 | "0x%08lx.\n",
88 | (ul) (i * sizeof(ul)));
89 | }
90 | printf("Skipping to next test...\n");
91 | fflush(stdout);
92 | return -1;
93 | }
94 | }
95 | }
96 | printf("\b\b\b\b\b\b\b\b\b\b\b \b\b\b\b\b\b\b\b\b\b\b");
97 | fflush(stdout);
98 | return 0;
99 | }
100 | ```
101 |
102 |
103 | ### ARM A53移植版本
104 |
105 | ```cpp
106 | int test_stuck_address(unsigned int *bufa, unsigned int count)
107 | {
108 | unsigned int *p1 = bufa;
109 | unsigned int j;
110 | int i;
111 | for(j = 0; j < 2; j++){
112 | p1 = (unsigned int *)bufa;
113 | for(i = 0; i < count; i++){
114 | *p1 = ((j + i) % 2) == 0 ? (unsigned int)p1 : (~(unsigned int)p1);
115 | p1++;
116 | }
117 | p1 = (unsigned int *)bufa;
118 | for(i = 0; i < count; i++, p1++){
119 | if (*p1 != (((j + i) % 2) == 0 ? (unsigned int) p1 : ~((unsigned int) p1))){
120 | #ifdef PRINTK
121 | printk("[DRAM]test_stuck_address: %x is %x error\n", p1,(unsigned int)*p1);
122 | #endif
123 | if(((j + i) % 2) == 0){
124 | return (((unsigned int) p1)^(*p1));
125 | }
126 | else{
127 | return ((~((unsigned int) p1))^(*p1));
128 | }
129 | }
130 | }
131 | }
132 | return 0;
133 | }
134 | ```
135 | ---
136 |
137 | #### 方法test_random_value
138 | 函数名:`int test_random_value(ulv *bufa, ulv *bufb, size_t count) `
139 |
140 | 开了两个Buffer区域,然后同时写入随机值,写完count个之后,用compare_regions(bufa, bufb, count)函数来对比验证。
141 |
142 | ```cpp
143 | int compare_regions(ulv *bufa, ulv *bufb, size_t count) {
144 | int r = 0;
145 | size_t i;
146 | ulv *p1 = bufa;
147 | ulv *p2 = bufb;
148 | off_t physaddr;
149 | for (i = 0; i < count; i++, p1++, p2++) {
150 | if (*p1 != *p2) {
151 | if (use_phys) {
152 | physaddr = physaddrbase + (i * sizeof(ul));
153 | fprintf(stderr,
154 | "FAILURE: 0x%08lx != 0x%08lx at physical address "
155 | "0x%08lx.\n",
156 | (ul) *p1, (ul) *p2, physaddr);
157 | } else {
158 | fprintf(stderr,
159 | "FAILURE: 0x%08lx != 0x%08lx at offset 0x%08lx.\n",
160 | (ul) *p1, (ul) *p2, (ul) (i * sizeof(ul)));
161 | }
162 | /* printf("Skipping to next test..."); */
163 | r = -1;
164 | }
165 | }
166 | return r;
167 | }
168 | ```
169 | #### 目的(原理)
170 | 目的是测试data bus,以及某种数据pattern是否会导致cell无法读写,类似于软件测试里面的Monkey test;
171 |
172 | 其中有几个注意点:
173 |
174 | + 这里开源版本是没有底层加速优化的,一个个地往内存地址写数据,每写一个就要fflush操作一些,免得数据在stdout缓冲区内堆积而不会立即写入DRAM,因此我们底层优化的时候要考虑cache的影响;
175 |
176 | + 上述的对比函数就是一个一个值地对比,要是底层不做优化地话时间花销基本上是neon加速后的四倍;
177 |
178 |
179 | #### 时间花销
180 | **条件:**
181 | 全空间1G Byte ,DDR带宽1600M\*32bit,CPU: ARM A53 (1460~1800)M\* 32bit单核跑。
182 |
183 | **时间成本:**
184 | ___Sec.
185 |
186 | ```cpp
187 | int test_random_value(ulv *bufa, ulv *bufb, size_t count) {
188 | ulv *p1 = bufa;
189 | ulv *p2 = bufb;
190 | ul j = 0;
191 | size_t i;
192 | putchar(' ');
193 | fflush(stdout);
194 | for (i = 0; i < count; i++) {
195 | *p1++ = *p2++ = rand_ul();
196 | if (!(i % PROGRESSOFTEN)) {
197 | putchar('\b');
198 | putchar(progress[++j % PROGRESSLEN]);
199 | fflush(stdout);
200 | }
201 | }
202 | printf("\b \b");
203 | fflush(stdout);
204 | return compare_regions(bufa, bufb, count);
205 | }
206 | ```
207 |
208 |
209 | ### ARM A53移植版本
210 |
211 | 在公版的基础上融合进来与或的操作,但是还是一个一个写,验证了一下全空间写0的时间是neon写的4倍。
212 |
213 | arm平台(平台参数见上述描述)跑了24s左右。
214 |
215 | ```cpp
216 | int test_random_comparison(unsigned int *bufa, unsigned int *bufb, unsigned int count)
217 | {
218 | unsigned int *p1 = bufa;
219 | unsigned int *p2 = bufb;
220 | int i;
221 | int q;
222 | for(i = 0; i < count; i++, p1++, p2++){
223 | *p1 = *p2 = rand_ul();
224 | }
225 | p1 = bufa;
226 | p2 = bufb;
227 | for(i = 0; i < count; i++, p1++, p2++){
228 | if(i%2==0){
229 | q=rand_ul();
230 | *p1|=q;
231 | *p2|=q;
232 | }
233 | else{
234 | q=rand_ul();
235 | *p1&=q;
236 | *p2&=q;
237 | }
238 | }
239 | return compare_regions(bufa, bufb, count);
240 | }
241 |
242 | ```
243 | 读出对比的时间,全1GB空间实测大概花了6s的样子(A53平台单核1.46GHz 32bit),时间大大减少。
244 | ```cpp
245 | int compare_regions(unsigned int *bufa, unsigned int *bufb, unsigned int count)
246 | {
247 | unsigned int ret = 0,i,ERRO_ADDR[2];
248 |
249 | ret = mctl_neon_cmp(bufa,bufb,count<<2,&ERRO_ADDR[0]);
250 |
251 | if(ret)
252 | {
253 | for(i=0;i<1000;i++)
254 | {
255 | if(__ADDR(ERRO_ADDR[0])==__ADDR(ERRO_ADDR[1]))
256 | {
257 | printk("[DRAM]addr0:%x!=addr1:%x___compare erro bit is :%x---read erro\n",ERRO_ADDR[0],ERRO_ADDR[1],ret);
258 | break;
259 | }
260 | }
261 | if(i==1000)
262 | {
263 | printk("[DRAM]addr0:%x!=addr1:%x___compare erro bit is :%x---write erro\n",ERRO_ADDR[0],ERRO_ADDR[1],ret);
264 | }mctl_neon_write
265 | }
266 | return ret;
267 | }
268 | ```
269 | 而减少的时间主要是利用了neon底层加速:注释见代码
270 |
271 | 由于穿进来的参数是(bufa, bufb, count),因此:
272 | + **r0 = bufa**
273 | + **r1 = bufb**
274 | + **r2 = count**
275 |
276 | 
277 |
278 | 基本思想就是对比,比完之后异或出现非零值就是出异常了,报警!返回异常值。
279 |
280 | ``` cpp
281 | mctl_neon_cmp
282 | PUSH {r3-r12, lr}
283 | ADD r2,r2,r0 ;把count的值转成最终的地址值了
284 |
285 | neon_cmp
286 | VLDM r0!,{q0-q3} ;一次从bufa加载4*128bit数据到4个neon的Q寄存器
287 | VLDM r1!,{q4-q7} ;一次从bufb加载4*128bit数据到4个neon的Q寄存器
288 |
289 |
290 | VEOR q8,q0,q4
291 | VEOR q9,q1,q5
292 | VEOR q10,q2,q6
293 | VEOR q11,q3,q7
294 |
295 | VORR q12,q8,q9
296 | VORR q13,q11,q10
297 | VORR q14,q12,q13
298 | VORR d30,d28,d29
299 | VMOV r4, r5, d30
300 | ORR r6,r4,r5
301 |
302 | CMP r6,#0x0
303 | ;上面这一段就是上图的一个实现,为什么一次只操作2X4Q个数据,
304 | ;是因为总共就16个Q,不够用啊!
305 | ;要是支持VEOR q0, q0,q4的话这里还可以加速的哟!
306 |
307 | BNE rw_detect
308 | ;不为0跳转指令
309 | ;检测是否出现异常了,出现异常则结果不为0
310 |
311 |
312 | ;检测是否全部对比完了
313 | CMP r0,r2
314 | BNE neon_cmp
315 |
316 | ;这里应为r0是返回值,因此测试正常则赋0
317 | MOV r0,#0x0
318 | POP {r3-r12, pc}
319 |
320 | rw_detect
321 | ;这里是假如中间出错了要进行的操作
322 | SUB r0,r0,#0x40
323 | SUB r1,r1,#0x40
324 | ;bufa和bufb地址回退至4X16个32bit,及16个Q之前。
325 | ;(这里是要对16个Q值均进行清算)
326 |
327 | VMOV r4,r5,d16 ;d16解开
328 | CMP r4,#0
329 | BNE rw_detect_done
330 | ADD r0,r0,#0x4
331 | ADD r1,r1,#0x4
332 | CMP r5,#0
333 | BNE rw_detect_done
334 | ADD r0,r0,#0x4
335 | ADD r1,r1,#0x4
336 |
337 | VMOV r4,r5,d17;d17
338 | CMP r4,#0
339 | BNE rw_detect_done
340 | ADD r0,r0,#0x4
341 | ADD r1,r1,#0x4
342 | CMP r5,#0
343 | BNE rw_detect_done
344 | ADD r0,r0,#0x4
345 | ADD r1,r1,#0x4
346 |
347 | VMOV r4,r5,d18;d18
348 | CMP r4,#0
349 | BNE rw_detect_done
350 | ADD r0,r0,#0x4
351 | ADD r1,r1,#0x4
352 | CMP r5,#0
353 | BNE rw_detect_done
354 | ADD r0,r0,#0x4
355 | ADD r1,r1,#0x4
356 |
357 | VMOV r4,r5,d19;d19
358 | CMP r4,#0
359 | BNE rw_detect_done
360 | ADD r0,r0,#0x4
361 | ADD r1,r1,#0x4
362 | CMP r5,#0
363 | BNE rw_detect_done
364 | ADD r0,r0,#0x4
365 | ADD r1,r1,#0x4
366 |
367 | VMOV r4,r5,d20;d20
368 | CMP r4,#0
369 | BNE rw_detect_done
370 | ADD r0,r0,#0x4
371 | ADD r1,r1,#0x4
372 | CMP r5,#0
373 | BNE rw_detect_done
374 | ADD r0,r0,#0x4
375 | ADD r1,r1,#0x4
376 |
377 | VMOV r4,r5,d21;d21
378 | CMP r4,#0
379 | BNE rw_detect_done
380 | ADD r0,r0,#0x4
381 | ADD r1,r1,#0x4
382 | CMP r5,#0
383 | BNE rw_detect_done
384 | ADD r0,r0,#0x4
385 | ADD r1,r1,#0x4
386 |
387 | VMOV r4,r5,d22;d22
388 | CMP r4,#0
389 | BNE rw_detect_done
390 | ADD r0,r0,#0x4
391 | ADD r1,r1,#0x4
392 | CMP r5,#0
393 | BNE rw_detect_done
394 | ADD r0,r0,#0x4
395 | ADD r1,r1,#0x4
396 |
397 | VMOV r4,r5,d23;d23
398 | CMP r4,#0
399 | BNE rw_detect_done
400 | ADD r0,r0,#0x4
401 | ADD r1,r1,#0x4
402 | CMP r5,#0
403 | BNE rw_detect_done
404 | ADD r0,r0,#0x4
405 | ADD r1,r1,#0x4
406 |
407 | rw_detect_done
408 | MOV r2,r3
409 | STR r0,[r2];save erro addr
410 |
411 | ADD r3,r3,#0x4
412 | MOV r2,r3
413 | STR r1,[r2];save erro addr
414 |
415 | MOV r0,r6
416 | POP {r3-r12, pc}
417 | ```
418 |
419 | ---
420 |
421 | #### 方法test_walkbits1_comparison/test_walkbits0_comparison
422 | 函数名:`nt test_walkbits1_comparison(bufa, bufb, count); `
423 |
424 | 首先设定一个初始值(以walk1为例)为0x00000001,然后左移一位之后写入下一个地址,依此类推。
425 |
426 | #### 目的(原理)
427 | This test is intended to uncover **data or address bus problems** both **internal** to the memory device as well as **external**.
428 |
429 | **同时也覆盖测试临近耦合缺陷。**
430 |
431 | #### 时间花销
432 | **条件:**
433 | 全空间1G Byte ,DDR带宽1600M\*32bit,CPU: ARM A53 (1460~1800)M\* 32bit单核跑。
434 |
435 | **时间成本:**
436 | ___Sec.
437 |
438 | ```cpp
439 | int test_walkbits1_comparison(ulv *bufa, ulv *bufb, size_t count) {
440 | ulv *p1 = bufa;
441 | ulv *p2 = bufb;
442 | unsigned int j;
443 | size_t i;
444 |
445 | printf(" ");
446 | fflush(stdout);
447 | for (j = 0; j < UL_LEN * 2; j++) {
448 | printf("\b\b\b\b\b\b\b\b\b\b\b");
449 | p1 = (ulv *) bufa;
450 | p2 = (ulv *) bufb;
451 | printf("setting %3u", j);
452 | fflush(stdout);
453 | for (i = 0; i < count; i++) {
454 | if (j < UL_LEN) { /* Walk it up. */
455 | *p1++ = *p2++ = UL_ONEBITS ^ (ONE << j);
456 | } else { /* Walk it back down. */
457 | *p1++ = *p2++ = UL_ONEBITS ^ (ONE << (UL_LEN * 2 - j - 1));
458 | }
459 | }
460 | printf("\b\b\b\b\b\b\b\b\b\b\b");
461 | printf("testing %3u", j);
462 | fflush(stdout);
463 | if (compare_regions(bufa, bufb, count)) {
464 | return -1;
465 | }
466 | }
467 | printf("\b\b\b\b\b\b\b\b\b\b\b \b\b\b\b\b\b\b\b\b\b\b");
468 | fflush(stdout);
469 | return 0;
470 | }
471 | ```
472 |
473 |
474 | ### ARM A53移植版本
475 |
476 | ```cpp
477 | int test_walkbits1_comparison(unsigned int *bufa, unsigned int *bufb, unsigned int count)
478 | {
479 | unsigned int *p1 = bufa;
480 | unsigned int *p2 = bufb;
481 | unsigned int j;
482 | int ret = 0;
483 |
484 | for(j = 0; j **注2: 虽然这里有点类似于底层的NPSF(neighborhood pattern sensitive fault),但是这里的锚点却不是这个,而是:比如说我的客户在把内存插入电脑后,使用过程中有这种pattern的数据写入内存,会不会存在数据互相影响从而丢失的问题呢?这搞不好就是蓝屏了啊!也就是说我们关注的是表层的状态而不是底层的缺陷。**
801 |
802 | #### 时间花销
803 | **条件:**
804 | 全空间1G Byte ,DDR带宽1600M\*32bit,CPU: ARM A53 (1460~1800)M\* 32bit单核跑。
805 |
806 | **时间成本:**
807 | ___Sec.
808 |
809 | ```cpp
810 | int test_checkerboard_comparison(ulv *bufa, ulv *bufb, size_t count) {
811 | ulv *p1 = bufa;
812 | ulv *p2 = bufb;
813 | unsigned int j;
814 | ul q;
815 | size_t i;
816 |
817 | printf(" ");
818 | fflush(stdout);
819 | for (j = 0; j < 64; j++) {
820 | printf("\b\b\b\b\b\b\b\b\b\b\b");
821 | q = (j % 2) == 0 ? CHECKERBOARD1 : CHECKERBOARD2;
822 | printf("setting %3u", j);
823 | fflush(stdout);
824 | p1 = (ulv *) bufa;
825 | p2 = (ulv *) bufb;
826 | for (i = 0; i < count; i++) {
827 | *p1++ = *p2++ = (i % 2) == 0 ? q : ~q;
828 | }
829 | printf("\b\b\b\b\b\b\b\b\b\b\b");
830 | printf("testing %3u", j);
831 | fflush(stdout);
832 | if (compare_regions(bufa, bufb, count)) {
833 | return -1;
834 | }
835 | }
836 | printf("\b\b\b\b\b\b\b\b\b\b\b \b\b\b\b\b\b\b\b\b\b\b");
837 | fflush(stdout);
838 | return 0;
839 | }
840 | ```
841 |
842 |
843 | ### ARM A53移植版本
844 | 这里的checkboard选择好随意啊!!!
845 | checkboard一般是选择0x55跟0xaa交替写, 检测stuck bit cases and many adjacent cell dependency cases.
846 |
847 | 标准算法描述如下:
848 |
849 | 
850 |
851 | 然后这个地方的移植······
852 |
853 | ```cpp
854 | int test_checkrboard_comparison(unsigned int *bufa, unsigned int *bufb, unsigned int count){
855 | unsigned int *p1 = bufa;
856 | unsigned int *p2 = bufb;
857 | unsigned int j;
858 | unsigned int q;
859 | int ret = 0;
860 | unsigned int CHECKERBOARD1[16]={0x00000000,0x11111111,0x22222222,0x33333333,
861 | 0x44444444,0x55555555,0x66666666,0x77777777,
862 | 0x88888888,0x99999999,0xaaaaaaaa,0xbbbbbbbb,
863 | 0xcccccccc,0xdddddddd,0xeeeeeeee,0xffffffff};
864 | unsigned int CHECKERBOARD2[16]={0xffffffff,0xeeeeeeee,0xdddddddd,0xcccccccc,
865 | 0xbbbbbbbb,0xaaaaaaaa,0x99999999,0x88888888,
866 | 0x77777777,0x66666666,0x55555555,0x44444444,
867 | 0x33333333,0x22222222,0x11111111,0x00000000};
868 | for(j = 0; j < 64; j++){
869 | q = (j % 2) == 0 ? CHECKERBOARD1[(j/2)%16] : CHECKERBOARD2[(j/2)%16];
870 | p1 = (unsigned int *)bufa;
871 | p2 = (unsigned int *)bufb;
872 | mctl_neon_write(p1,p1+(count<<0),q);
873 | mctl_neon_write(p2,p2+(count<<0),q);
874 | ret = compare_regions(bufa, bufb, count);
875 | if(ret){
876 | return ret;
877 | }
878 | }
879 | return 0;
880 | }
881 | ```
882 |
883 |
884 | ---
885 |
886 | #### 方法
887 | 函数名:`int test_bitspread_comparison(ulv *bufa, ulv *bufb, size_t count) `
888 |
889 | 跟walkbits1比起就是data pattern变了一点点而已,其余不变,因此可以看作是walkbit1的一个扩展测试;
890 | walkbit1: 00000001 -> 00000010
891 | bitspread: 00000101 -> 00001010
892 | #### 目的(原理)
893 |
894 | 也是主要为了检测临近耦合缺陷;
895 | 
896 | #### 时间花销
897 | **条件:**
898 | 全空间1G Byte ,DDR带宽1600M\*32bit,CPU: ARM A53 (1460~1800)M\* 32bit单核跑。
899 |
900 | **时间成本:**
901 | ___Sec.
902 |
903 | ```cpp
904 | int test_bitspread_comparison(ulv *bufa, ulv *bufb, size_t count) {
905 | ulv *p1 = bufa;
906 | ulv *p2 = bufb;
907 | unsigned int j;
908 | size_t i;
909 |
910 | printf(" ");
911 | fflush(stdout);
912 | for (j = 0; j < UL_LEN * 2; j++) {
913 | printf("\b\b\b\b\b\b\b\b\b\b\b");
914 | p1 = (ulv *) bufa;
915 | p2 = (ulv *) bufb;
916 | printf("setting %3u", j);
917 | fflush(stdout);
918 | for (i = 0; i < count; i++) {
919 | if (j < UL_LEN) { /* Walk it up. */
920 | *p1++ = *p2++ = (i % 2 == 0)
921 | ? (ONE << j) | (ONE << (j + 2))
922 | : UL_ONEBITS ^ ((ONE << j) | (ONE << (j + 2)));
923 | } else { /* Walk it back down. */
924 | *p1++ = *p2++ = (i % 2 == 0)
925 | ? (ONE << (UL_LEN * 2 - 1 - j)) | (ONE << (UL_LEN * 2 + 1 - j))
926 | : UL_ONEBITS ^ (ONE << (UL_LEN * 2 - 1 - j)
927 | | (ONE << (UL_LEN * 2 + 1 - j)));
928 | }
929 | }
930 | printf("\b\b\b\b\b\b\b\b\b\b\b");
931 | printf("testing %3u", j);
932 | fflush(stdout);
933 | if (compare_regions(bufa, bufb, count)) {
934 | return -1;
935 | }
936 | }
937 | printf("\b\b\b\b\b\b\b\b\b\b\b \b\b\b\b\b\b\b\b\b\b\b");
938 | fflush(stdout);
939 | return 0;
940 | }
941 | ```
942 |
943 |
944 | ### ARM A53移植版本
945 |
946 | ```cpp
947 | int test_bitspread_comparison(unsigned int *bufa, unsigned int *bufb, unsigned int count){
948 | #define Q 2
949 | unsigned int *p1 = bufa;
950 | unsigned int *p2 = bufb;
951 | unsigned int j;
952 | int ret = 0;
953 | for(j = 0; j 注2: checkboard这里虽然有点类似于底层的NPSF(neighborhood pattern sensitive fault),但是这里的锚点却不是这个,而是:比如说我的客户在把内存插入电脑后,使用过程中有这种pattern的数据写入内存,会不会存在数据互相影响从而丢失的问题呢?这搞不好就是蓝屏了啊!也就是说我们关注的是表层的状态而不是底层的缺陷。
978 |
--------------------------------------------------------------------------------