forked from InstLatx64/InstLatx64
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathComments.htm
176 lines (173 loc) · 11 KB
/
Comments.htm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
<html>
<body>
<ul>
<li>These lists were created by <a href=http://www.aida64.com>AIDA64</a> Instruction Latency dump feature. If you do not believe in software measurements, wait for the official Intel/AMD/etc. guide and hope it will be more detailed and accurate than the current one. ;) You can create such dump in AIDA64 by right-clicking on the bottom status bar of AIDA64 main window -> CPU Debug -> Instruction Latency Dump. It fully works on trial version, too. </li>
<li>In this dump latency means the time that it takes for the next dependent same-type instruction to start. Throughput means the time that it takes for the next independent same-type instruction to start:
<pre> L: ADD rax, rax T: ADD rax, rax
ADD rax, rax ADD rbx, rbx
ADD rax, rax ADD rcx, rcx
... ...
</pre>
</li>
<li>These values are measured by long chains of instructions (~6000), so these are the sustained rates, peak values can be higher.</li>
<li>Some instructions do not modify the target register. E.g. <code>CMP, TEST, BT, NOP.
T</code>his way it is not always possible to measure directly the instruction latency. </li>
<li>Some instructions never depend on a previous one: they use different source and destination register sets or have memory operand, so it is not always possible to measure directly the instruction latency, but it possible to measure instruction
pairs. E.g. <code>PUSH + POP, MOV reg, [mem] + MOV [mem], reg</code>. The abbreviation "<code>LS pair</code>" means a load and a store form pair of a moving instruction.</li>
<li>Newer processors can recognize that some instructions with the same operand are independent from previous ones. In this case latency can be lower than 1. Classic example is the XOR instruction: <code>XOR eax, eax</code> always 0 so it never depends
on the result of the previous XOR. <code>XOR r32_1, r32_2 </code> means
<pre> L: XOR rax, rbx T: XOR rax, rbx
XOR rax, rbx XOR rbx, rcx
XOR rax, rbx XOR rcx, rdx
... ...
</pre>
</li>
<li>If TP value is less than 1, it means that more than one same-type instruction can start in the same clock cycle.</li>
<li>In case of memory operand, throughput can be higher than latency, because it uses more memory location than latency measurement.</li>
<li>FSQRT throughput can be higher than FSQRT latency on older processors because FSQRT is measured via
<pre> L: FSQRT T: FSQRT
FSQRT FDECSTP
FSQRT FSQRT
FSQRT FDECSTP
... ...
</pre>
chains and the oldies cannot do FSQRT and FDECSTP parallel.<i><b>Update</b>: FDECSTP changed to FXCH.</i></li>
<li>The (I)DIV latency on modern processors depends on the operand size. Because
(I)DIV always uses rDX:rAX registers for dividend, quotient and remainder, and only for some operand sizes is
possible to dividend = quotient : remainder (e.g.
If AX = 0xFEFF, after an <code>DIV AL</code> AX remains 0xFEFF) , need to refresh rDX/rAX. So "<code>DIV r8 12/ 8b ax upd</code>" means
<pre>
L: DIV al T: DIV bl
MOV ax, const MOV ax, const
DIV al DIV cl
MOV ax, const MOV ax, const
... ...
</pre>
chains. Similarly "<code>DIV r32 2^62/2^31 eax/edx</code>" means
<pre> L: DIV eax T: DIV ebx
MOV eax, const1 MOV eax, const1
MOV edx, const2 MOV edx, const2
DIV eax DIV ecx
MOV eax, const1 MOV eax, const1
MOV edx, const2 MOV edx, const2
... ...
</pre>
</li>
<li>For some x87 instruction combinations (and for some SSE in 32b mode) the 8 registers are not enough to measure the instruction throughput.</li>
<li>It is a measurement, not a constant table, so some values are rounded.</li>
<li><strong>Keep in mind that even though instruction latency and throughput are important, they may not directly reflect CPU performance!</strong> </li>
</ul>
Feature set:
<pre>
<ul>
<li>X86 - default, no CPUID bit</li>
<li>TSC - CPUID.00000001.EDX[4]</li>
<li>X87 - CPUID.00000001.EDX[0]</li>
<li>CMOV - CPUID.00000001.EDX[15]</li>
<li>MMX - CPUID.00000001.EDX[23]</li>
<li>3DNOW - CPUID.80000001.EDX[31]</li>
<li>3DNOWP - CPUID.80000001.EDX[30]</li>
<li>SSE - CPUID.00000001.EDX[25]</li>
<li>SSE2 - CPUID.00000001.EDX[26]</li>
<li>SSE3 - CPUID.00000001.ECX[0]</li>
<li>MMXP - CPUID.80000001.EDX[22]</li>
<li>AMD64 - CPUID.80000001.EDX[29]</li>
<li>SSSE3 - CPUID.00000001.ECX[9]</li>
<li>SSE4A - CPUID.80000001.ECX[6]</li>
<li>ABM - CPUID.80000001.ECX[5]</li>
<li>SSE41 - CPUID.00000001.ECX[19]</li>
<li>SSE42 - CPUID.00000001.ECX[20]</li>
<li>POPCNT - CPUID.00000001.ECX[23]</li>
<li>XOP - CPUID.80000001.ECX[11]</li>
<li>VIAAES - CPUID.C0000001.EDX[7]</li>
<li>LAHF - CPUID.80000001.ECX[0]</li>
<li>CMPX8 - CPUID.00000001.EDX[8]</li>
<li>CMPX16 - CPUID.00000001.ECX[13]</li>
<li>AESNI - CPUID.00000001.ECX[25]</li>
<li>CLMUL - CPUID.00000001.ECX[1]</li>
<li>AVX - CPUID.00000001.ECX[28]</li>
<li>FMA3 - CPUID.00000001.ECX[12]</li>
<li>MOVBE - CPUID.00000001.ECX[22]</li>
<li>FMA4 - CPUID.80000001.ECX[16]</li>
<li>F16C - CPUID.00000001.ECX[29]</li>
<li>HTT - CPUID.00000001.EDX[28]</li>
<li>VIAHASH - CPUID.C0000001.EDX[11]</li>
<li>VIARNG - CPUID.C0000001.EDX[3]</li>
<li>RDRAND - CPUID.00000001.ECX[30]</li>
<li>FSGSBASE - CPUID.00000007.EBX[0]</li>
<li>BMI - CPUID.00000007.EBX[3]</li>
<li>TBM - CPUID.80000001.ECX[21]</li>
<li>CLFLUSH - CPUID.00000001.EDX[19]</li>
<li>X2APIC - CPUID.00000001.ECX[21]</li>
<li>TSCINV - CPUID.80000007.EDX[8]</li>
<li>TOPEX - CPUID.80000001.ECX[22] - Topology extension (Phys_ID - Core_ID - SMT_ID vs. Phys_ID - Node_ID - ComputingUnit_ID - Core_ID)</li>
<li>RDTSCP - CPUID.80000001.EDX[27]</li>
<li>3DNOWPREF - CPUID.80000001.ECX[8]</li>
<li>MISALIGNSSE - CPUID.80000001.ECX[7]</li>
<li>LNOP - - Long NOP (opcode 0Fh 1Fh, no CPUID bit)</li>
<li>AVX2 - CPUID.00000007.EBX[5]</li>
<li>BMI2 - CPUID.00000007.EBX[8]</li>
<li>ERMS - CPUID.00000007.EBX[9]</li>
<li>XRNG2 - CPUID.C0000001.EDX[23] - another hardware random fill instruction</li>
<li>HLE - CPUID.00000007.EBX[4]</li>
<li>RTM - CPUID.00000007.EBX[11]</li>
<li>PSE - CPUID.00000001.EDX[3] - Page Size Extension</li>
<li>4LTOP - - 4 level topology (forced TOPEX, no CPUID bit, some processors use the 4 level topology without the TOPEX bit)</li>
<li>HYP - CPUID.00000001.ECX[31] - Hypervisor mode</li>
<li>RDSEED - CPUID.00000007.EBX[18]</li>
<li>ADX - CPUID.00000007.EBX[19]</li>
<li>SMAP - CPUID.00000007.EBX[20]</li>
<li>FP128 - CPUID.8000001A.EAX[0]</li>
<li>FP256 - CPUID.8000001A.EAX[2]</li>
<li>SHARED_CU - CPUID.8000001E.EBX[15:8] - Shared Copmuting Units - 1</li>
<li>MPX - CPUID.00000007.EBX[14] - Memory Protection Extensions</li>
<li>AVX512F - CPUID.00000007.EBX[16] - AVX-512 Foundation</li>
<li>PT - CPUID.00000007.EBX[25] - Processor Trace</li>
<li>AVX512PF - CPUID.00000007.EBX[26] - AVX-512 Prefetch</li>
<li>AVX512ER - CPUID.00000007.EBX[27] - AVX-512 Exponential and Reciprocal</li>
<li>AVX512CD - CPUID.00000007.EBX[28] - AVX-512 Conflict Detection</li>
<li>SHA - CPUID.00000007.EBX[29] - HW SHA</li>
<li>PWT1 - CPUID.00000007.ECX[0] - PREFETCHWT1</li>
<li>SGX - CPUID.00000007.EBX[2] - Software Guard Extensions</li>
<li>CLFLUSHOPT - CPUID.00000007.EBX[23] - CLFLUSHOPT</li>
<li>PQM - CPUID.00000007.EBX[12] - Platform Quality of Service Enforcement</li>
<li>PQE - CPUID.00000007.EBX[15] - Platform Quality of Service Monitoring</li>
<li>AVX512DQ - CPUID.00000007.EBX[17] - AVX-512 Doubleword and Quadword Instructions</li>
<li>AVX512BW - CPUID.00000007.EBX[30] - AVX-512 Byte and Word Instructions</li>
<li>AVX512VL - CPUID.00000007.EBX[31] - AVX-512 Vector Length Extensions</li>
<li>AVX512IFMA52 - CPUID.00000007.EBX[21] - AVX-512 52-bit Integer Instructions</li>
<li>AVX512VBMI - CPUID.00000007.ECX[1] - AVX-512 Vector Bit Manipulation Instructions</li>
<li><s>PCOMMIT - CPUID.00000007.EBX[22] - PCOMMIT</s> <a href=https://software.intel.com/en-us/blogs/2016/09/12/deprecate-pcommit-instruction>deprecated</a> </li>
<li>CLWB - CPUID.00000007.EBX[24] - Cache Line Write Back</li>
<li>CLZERO - CPUID.80000008.EBX[0] - Cache Line Zero Out</li>
<li>UMIP - CPUID.00000007.ECX[2] - User Mode Instruction Prevention</li>
<li>PKU - CPUID.00000007.ECX[3] - Protection Keys for User-Mode Pages</li>
<li>OSPKU - CPUID.00000007.ECX[4] - RDPKRU/WRPKRU instructions</li>
<li>RDPID - CPUID.00000007.ECX[22] - Read Processor ID</li>
<li>SGX_LC - CPUID.00000007.ECX[30] - SGX Launch Configuration</li>
<li>AVX512_4VNNIW - CPUID.00000007.EDX[2] - AVX-512 for Vector Neural Networks Instructions for Words(?)</li>
<li>AVX512_4FMAPS - CPUID.00000007.EDX[3] - AVX-512 half precision support(?)</li>
<li>MONITORX - CPUID.80000001.ECX[29] - MONITORX/MWAITX instructions</li>
<li>SME - CPUID.8000001F.EAX[0] - Secure Memory Encryption</li>
<li>SEV - CPUID.8000001F.EAX[1] - Secure Encrypted Virtualization</li>
</ul>
</pre>
Other similar pages:
<br>
<ul>
<li><a href=http://www.agner.org/optimize>http://www.agner.org/optimize</a></li>
<li><a href=http://gmplib.org/~tege/x86-timing.pdf>http://gmplib.org/~tege/x86-timing.pdf</a></li>
<li><a href=http://mubench.sourceforge.net/index.html>http://mubench.sourceforge.net/index.html</a></li>
</ul>
Official optimization guides:<br />
<ul>
<li><a href=http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf>Intel64 and IA-32 Architectures Optimization Reference Manual</a></li>
<li>AMD K6: <a href=http://www.ii.uib.no/~osvik/amd_opt/21924d.pdf>http://www.ii.uib.no/~osvik/amd_opt/21924d.pdf</a>
<li>AMD K7: <a href=http://www.ii.uib.no/~osvik/amd_opt/22007k.pdf>http://www.ii.uib.no/~osvik/amd_opt/22007k.pdf</a>
<li>AMD K8: <a href=http://support.amd.com/TechDocs/25112.PDF>http://support.amd.com/TechDocs/25112.PDF</a></li>
<li>AMD K10&K12: <a href=http://support.amd.com/TechDocs/40546.pdf>http://support.amd.com/TechDocs/40546.pdf</a></li>
<li>AMD K15: <a href=http://support.amd.com/TechDocs/47414.PDF>http://support.amd.com/TechDocs/47414.PDF</a>
<li>AMD K16: <a href=http://support.amd.com/TechDocs/52128_16h_Software_Opt_Guide.zip>http://support.amd.com/TechDocs/52128_16h_Software_Opt_Guide.zip</a>
</ul>
</body>
</html>