Comments.htm

<html>
<body>
<ul>
<li>These lists were created by <a href=http://www.aida64.com>AIDA64</a> Instruction Latency dump feature. If you do not believe in software measurements, wait for the official Intel/AMD/etc. guide and hope it will be more detailed and accurate than the current one. ;) You can create such dump in AIDA64 by right-clicking on the bottom status bar of AIDA64 main window -> CPU Debug -> Instruction Latency Dump. It fully works on trial version, too. </li>
<li>In this dump latency means the time that it takes for the next dependent same-type instruction to start. Throughput means the time that it takes for the next independent same-type instruction to start:
<pre>  L: ADD rax, rax       T: ADD rax, rax
     ADD rax, rax          ADD rbx, rbx
     ADD rax, rax          ADD rcx, rcx
     ...                   ...
</pre>
</li>
<li>These values are measured by long chains of instructions (~6000), so these are the sustained rates, peak values can be higher.</li>
<li>Some instructions do not modify the target register. E.g. <code>CMP, TEST, BT, NOP.
    T</code>his way it is not always possible to measure directly the instruction latency. </li>
<li>Some instructions never depend on a previous one: they use different source and destination register sets or have memory operand, so it is not always possible to measure directly the instruction latency, but it possible to measure instruction
    pairs. E.g. <code>PUSH + POP, MOV reg, [mem] + MOV [mem], reg</code>. The abbreviation "<code>LS pair</code>" means a load and a store form pair of a moving instruction.</li>
<li>Newer processors can recognize that some instructions with the same operand are independent from previous ones. In this case latency can be lower than 1. Classic example is the XOR instruction: <code>XOR eax, eax</code> always 0 so it never depends
    on the result of the previous XOR. <code>XOR r32_1, r32_2 </code> means
<pre> L: XOR rax, rbx        T: XOR rax, rbx
    XOR rax, rbx           XOR rbx, rcx
    XOR rax, rbx           XOR rcx, rdx
    ...                    ...
</pre>
</li>
<li>If TP value is less than 1, it means that more than one same-type instruction can start in the same clock cycle.</li>
<li>In case of memory operand, throughput can be higher than latency, because it uses more memory location than latency measurement.</li>
<li>FSQRT throughput can be higher than FSQRT latency on older processors because FSQRT is measured via

<pre> L: FSQRT               T: FSQRT 
    FSQRT                  FDECSTP
    FSQRT                  FSQRT
    FSQRT                  FDECSTP
     ...                   ...
</pre>
    chains and the oldies cannot do FSQRT and FDECSTP parallel.<i><b>Update</b>: FDECSTP changed to FXCH.</i></li>
<li>The (I)DIV latency on modern processors depends on the operand size. Because&nbsp;
    (I)DIV always uses rDX:rAX registers for dividend, quotient and remainder, and only for some operand sizes is
    possible to dividend = quotient : remainder&nbsp; (e.g.
    If AX = 0xFEFF, after an <code>DIV AL</code>&nbsp; AX remains 0xFEFF) , need to refresh rDX/rAX. So "<code>DIV r8 12/ 8b ax upd</code>" means
 
 <pre>
 L: DIV al              T: DIV bl
    MOV ax, const          MOV ax, const
    DIV al                 DIV cl
    MOV ax, const          MOV ax, const
    ...                    ...
</pre>
chains. Similarly "<code>DIV r32 2^62/2^31 eax/edx</code>" means
<pre> L: DIV eax            T: DIV ebx
    MOV eax, const1       MOV eax, const1
    MOV edx, const2       MOV edx, const2
    DIV eax               DIV ecx
    MOV eax, const1       MOV eax, const1
    MOV edx, const2       MOV edx, const2
    ...                   ...
</pre>
</li>
<li>For some x87 instruction combinations (and for some SSE in 32b mode) the 8 registers are not enough to measure the instruction throughput.</li>
<li>It is a measurement, not a constant table, so some values are rounded.</li>
<li><strong>Keep in mind that even though instruction latency and throughput are important, they may not directly reflect CPU performance!</strong> </li>
</ul>
Feature set:
<pre>
<ul>
<li>X86           - default, no CPUID bit</li>
<li>TSC           - CPUID.00000001.EDX[4]</li>
<li>X87           - CPUID.00000001.EDX[0]</li>
<li>CMOV          - CPUID.00000001.EDX[15]</li>
<li>MMX           - CPUID.00000001.EDX[23]</li>
<li>3DNOW         - CPUID.80000001.EDX[31]</li>
<li>3DNOWP        - CPUID.80000001.EDX[30]</li>
<li>SSE           - CPUID.00000001.EDX[25]</li>
<li>SSE2          - CPUID.00000001.EDX[26]</li>
<li>SSE3          - CPUID.00000001.ECX[0]</li>
<li>MMXP          - CPUID.80000001.EDX[22]</li>
<li>AMD64         - CPUID.80000001.EDX[29]</li>
<li>SSSE3         - CPUID.00000001.ECX[9]</li>
<li>SSE4A         - CPUID.80000001.ECX[6]</li>
<li>ABM           - CPUID.80000001.ECX[5]</li>
<li>SSE41         - CPUID.00000001.ECX[19]</li>
<li>SSE42         - CPUID.00000001.ECX[20]</li>
<li>POPCNT        - CPUID.00000001.ECX[23]</li>
<li>XOP           - CPUID.80000001.ECX[11]</li>
<li>VIAAES        - CPUID.C0000001.EDX[7]</li>
<li>LAHF          - CPUID.80000001.ECX[0]</li>
<li>CMPX8         - CPUID.00000001.EDX[8]</li>
<li>CMPX16        - CPUID.00000001.ECX[13]</li>
<li>AESNI         - CPUID.00000001.ECX[25]</li>
<li>CLMUL         - CPUID.00000001.ECX[1]</li>
<li>AVX           - CPUID.00000001.ECX[28]</li>
<li>FMA3          - CPUID.00000001.ECX[12]</li>
<li>MOVBE         - CPUID.00000001.ECX[22]</li>
<li>FMA4          - CPUID.80000001.ECX[16]</li>
<li>F16C          - CPUID.00000001.ECX[29]</li>
<li>HTT           - CPUID.00000001.EDX[28]</li>
<li>VIAHASH       - CPUID.C0000001.EDX[11]</li>
<li>VIARNG        - CPUID.C0000001.EDX[3]</li>
<li>RDRAND        - CPUID.00000001.ECX[30]</li>
<li>FSGSBASE      - CPUID.00000007.EBX[0]</li>
<li>BMI           - CPUID.00000007.EBX[3]</li>
<li>TBM           - CPUID.80000001.ECX[21]</li>
<li>CLFLUSH       - CPUID.00000001.EDX[19]</li>
<li>X2APIC        - CPUID.00000001.ECX[21]</li>
<li>TSCINV        - CPUID.80000007.EDX[8]</li>
<li>TOPEX         - CPUID.80000001.ECX[22]    - Topology extension (Phys_ID - Core_ID - SMT_ID vs. Phys_ID - Node_ID - ComputingUnit_ID - Core_ID)</li>
<li>RDTSCP        - CPUID.80000001.EDX[27]</li>
<li>3DNOWPREF     - CPUID.80000001.ECX[8]</li>
<li>MISALIGNSSE   - CPUID.80000001.ECX[7]</li>
<li>LNOP          -                           - Long NOP (opcode 0Fh 1Fh, no CPUID bit)</li>
<li>AVX2          - CPUID.00000007.EBX[5]</li>
<li>BMI2          - CPUID.00000007.EBX[8]</li>
<li>ERMS          - CPUID.00000007.EBX[9]</li>
<li>XRNG2         - CPUID.C0000001.EDX[23]    - another hardware random fill instruction</li>
<li>HLE           - CPUID.00000007.EBX[4]</li>
<li>RTM           - CPUID.00000007.EBX[11]</li>
<li>PSE           - CPUID.00000001.EDX[3]    - Page Size Extension</li>
<li>4LTOP         -                          - 4 level topology (forced TOPEX, no CPUID bit, some processors use the 4 level topology without the TOPEX bit)</li>
<li>HYP           - CPUID.00000001.ECX[31]   - Hypervisor mode</li>
<li>RDSEED        - CPUID.00000007.EBX[18]</li>
<li>ADX           - CPUID.00000007.EBX[19]</li>
<li>SMAP          - CPUID.00000007.EBX[20]</li>
<li>FP128         - CPUID.8000001A.EAX[0]</li>
<li>FP256         - CPUID.8000001A.EAX[2]</li>
<li>SHARED_CU     - CPUID.8000001E.EBX[15:8]  - Shared Copmuting Units  - 1</li>
<li>MPX           - CPUID.00000007.EBX[14]    - Memory Protection Extensions</li>
<li>AVX512F       - CPUID.00000007.EBX[16]    - AVX-512 Foundation</li>
<li>PT            - CPUID.00000007.EBX[25]    - Processor Trace</li>
<li>AVX512PF      - CPUID.00000007.EBX[26]    - AVX-512 Prefetch</li>
<li>AVX512ER      - CPUID.00000007.EBX[27]    - AVX-512 Exponential and Reciprocal</li>
<li>AVX512CD      - CPUID.00000007.EBX[28]    - AVX-512 Conflict Detection</li>
<li>SHA           - CPUID.00000007.EBX[29]    - HW SHA</li>
<li>PWT1          - CPUID.00000007.ECX[0]     - PREFETCHWT1</li>
<li>SGX           - CPUID.00000007.EBX[2]     - Software Guard Extensions</li>
<li>CLFLUSHOPT    - CPUID.00000007.EBX[23]    - CLFLUSHOPT</li>
<li>PQM           - CPUID.00000007.EBX[12]    - Platform Quality of Service Enforcement</li>
<li>PQE           - CPUID.00000007.EBX[15]    - Platform Quality of Service Monitoring</li>
<li>AVX512DQ      - CPUID.00000007.EBX[17]    - AVX-512 Doubleword and Quadword Instructions</li>
<li>AVX512BW      - CPUID.00000007.EBX[30]    - AVX-512 Byte and Word Instructions</li>
<li>AVX512VL      - CPUID.00000007.EBX[31]    - AVX-512 Vector Length Extensions</li>
<li>AVX512IFMA52  - CPUID.00000007.EBX[21]    - AVX-512 52-bit Integer Instructions</li>
<li>AVX512VBMI    - CPUID.00000007.ECX[1]     - AVX-512 Vector Bit Manipulation Instructions</li>
<li><s>PCOMMIT       - CPUID.00000007.EBX[22]    - PCOMMIT</s> <a href=https://software.intel.com/en-us/blogs/2016/09/12/deprecate-pcommit-instruction>deprecated</a> </li>
<li>CLWB          - CPUID.00000007.EBX[24]    - Cache Line Write Back</li>
<li>CLZERO        - CPUID.80000008.EBX[0]     - Cache Line Zero Out</li>
<li>UMIP          - CPUID.00000007.ECX[2]     - User Mode Instruction Prevention</li>
<li>PKU           - CPUID.00000007.ECX[3]     - Protection Keys for User-Mode Pages</li>
<li>OSPKU         - CPUID.00000007.ECX[4]     - RDPKRU/WRPKRU instructions</li>
<li>RDPID         - CPUID.00000007.ECX[22]    - Read Processor ID</li>
<li>SGX_LC        - CPUID.00000007.ECX[30]    - SGX Launch Configuration</li>
<li>AVX512_4VNNIW - CPUID.00000007.EDX[2]     - AVX-512 for Vector Neural Networks Instructions for Words(?)</li>
<li>AVX512_4FMAPS - CPUID.00000007.EDX[3]     - AVX-512 half precision support(?)</li>
<li>MONITORX      - CPUID.80000001.ECX[29]    - MONITORX/MWAITX instructions</li>
<li>SME           - CPUID.8000001F.EAX[0]     - Secure Memory Encryption</li>
<li>SEV           - CPUID.8000001F.EAX[1]     - Secure Encrypted Virtualization</li>

</ul>
</pre>
Other similar pages:
<br>
<ul>
<li><a href=http://www.agner.org/optimize>http://www.agner.org/optimize</a></li>
<li><a href=http://gmplib.org/~tege/x86-timing.pdf>http://gmplib.org/~tege/x86-timing.pdf</a></li>
<li><a href=http://mubench.sourceforge.net/index.html>http://mubench.sourceforge.net/index.html</a></li>
</ul>
Official optimization guides:<br />
<ul>
<li><a href=http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf>Intel64 and IA-32 Architectures Optimization Reference Manual</a></li>
<li>AMD K6: <a href=http://www.ii.uib.no/~osvik/amd_opt/21924d.pdf>http://www.ii.uib.no/~osvik/amd_opt/21924d.pdf</a>
<li>AMD K7: <a href=http://www.ii.uib.no/~osvik/amd_opt/22007k.pdf>http://www.ii.uib.no/~osvik/amd_opt/22007k.pdf</a>
<li>AMD K8: <a href=http://support.amd.com/TechDocs/25112.PDF>http://support.amd.com/TechDocs/25112.PDF</a></li>
<li>AMD K10&K12: <a href=http://support.amd.com/TechDocs/40546.pdf>http://support.amd.com/TechDocs/40546.pdf</a></li>
<li>AMD K15: <a href=http://support.amd.com/TechDocs/47414.PDF>http://support.amd.com/TechDocs/47414.PDF</a>
<li>AMD K16: <a href=http://support.amd.com/TechDocs/52128_16h_Software_Opt_Guide.zip>http://support.amd.com/TechDocs/52128_16h_Software_Opt_Guide.zip</a>
</ul>
</body>
</html>