Skip to content

Commit

Permalink
Update jobbole#114 90 分钟学习现代微处理器.md
Browse files Browse the repository at this point in the history
  • Loading branch information
white-rabit authored Jun 25, 2019
1 parent fe901aa commit 106646a
Showing 1 changed file with 39 additions and 79 deletions.
118 changes: 39 additions & 79 deletions translation/#114 90 分钟学习现代微处理器.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@ Table of Contents
1. [More Than Just Megahertz 不仅仅是兆赫](#morethanjustmegahertz)
2. [Pipelining & Instruction-Level Parallelism 流水线技术与指令级并行性](#pipeliningandinstructionlevelparallelism)
3. [Deeper Pipelines – Superpipelining 更深的管道 —— 超级流水线](#deeperpipelinessuperpipelining)
4. [Multiple Issue – Superscalar](#multipleissuesuperscalar)
5. [Explicit Parallelism – VLIW](#explicitparallelismvliw)
4. [Multiple Issue – Superscalar 多发技术 —— 超标量体系结构](#multipleissuesuperscalar)
5. [Explicit Parallelism – VLIW 显式并行 —— VLIW](#explicitparallelismvliw)
6. [Instruction Dependencies & Latencies](#instructiondependenciesandlatencies)
7. [Branches & Branch Prediction](#branchesandbranchprediction)
8. [Eliminating Branches with Predication](#eliminatingbrancheswithpredication)
Expand Down Expand Up @@ -53,7 +53,7 @@ More Than Just Megahertz

The first issue that must be cleared up is the difference between clock speed and a processor's performance. _They are not the same thing._ Look at the results for processors of a few years ago (the late 1990s)...

第一个必须澄清的问题是时钟速度和处理器性能之间的区别。 _它们并不是同一回事。_ 看看几年前(90年代末)处理器的表现
第一个必须澄清的问题是时钟速度和处理器性能之间的区别。*它们并不是同一回事。*看看几年前(90年代末)处理器的表现

| | | SPECIT95 | SPECFP95 |
| :-----: | :----: | :----: | :----: |
Expand Down Expand Up @@ -99,7 +99,7 @@ Figure 1 – The instruction flow of a sequential processor.

Modern processors overlap these stages in a _pipeline_, like an assembly line. While one instruction is executing, the next instruction is being decoded, and the one after that is being fetched...

现代处理器将这些阶段在 _流水线_ 上重叠,就像生产线一样。当一条指令正在执行时,下一条指令正在被解码,而后一条指令正在被获取……
现代处理器将这些阶段在*流水线*上重叠,就像生产线一样。当一条指令正在执行时,下一条指令正在被解码,而后一条指令正在被获取……

![](pipelined2.png)

Expand Down Expand Up @@ -127,7 +127,7 @@ Figure 3 – A pipelined microarchitecture.

Since the result from each instruction is available after the execute stage has completed, the next instruction ought to be able to use that value immediately, rather than waiting for that result to be committed to its destination register in the writeback stage. To allow this, forwarding lines called _bypasses_ are added, going backwards along the pipeline...

因为每个指令的结果在完成执行阶段之后是可用的,所以下一个指令应该能够立即使用该值,而无需等待该结果在回写阶段被提交至目标寄存器。为了实现这一点,被称为 _旁路_ 的转发行被加入架构中,沿着管道向后移动……
因为每个指令的结果在完成执行阶段之后是可用的,所以下一个指令应该能够立即使用该值,而无需等待该结果在回写阶段被提交至目标寄存器。为了实现这一点,被称为 *旁路*的转发行被加入架构中,沿着管道向后移动……
![](pipelinedbypasses2.png)

Figure 4 – A pipelined microarchitecture with bypasses.
Expand All @@ -136,7 +136,7 @@ Figure 4 – A pipelined microarchitecture with bypasses.

Although the pipeline stages look simple, it is important to remember the _execute_ stage in particular is really made up of several different groups of logic (several sets of gates), making up different _functional units_ for each type of operation the processor must be able to perform...

虽然流水线各阶段看起来很简单,但重要的是要理解: _执行_ 阶段实际上是由不同的逻辑组(几组逻辑门)组成,他们形成了不同的 _功能单元_ 使得处理器能够执行各种必须的操作......
虽然流水线各阶段看起来很简单,但重要的是要理解: _执行_ 阶段实际上是由不同的逻辑组(几组逻辑门)组成,他们形成了不同的*功能单元*使得处理器能够执行各种必须的操作......

![](pipelinedfunctionalunits2.png)

Expand All @@ -146,7 +146,7 @@ Figure 5 – A pipelined microarchitecture in more detail.

The early RISC processors, such as IBM's 801 research prototype, the MIPS R2000 (based on the Stanford MIPS machine) and the original SPARC (derived from the Berkeley RISC project), all implemented a simple 5-stage pipeline not unlike the one shown above. At the same time, the mainstream 80386, 68030 and VAX CISC processors worked largely sequentially – it's much easier to pipeline a RISC because its _reduced instruction set_ means the instructions are mostly simple register-to-register operations, unlike the complex instruction sets of x86, 68k or VAX. As a result, a pipelined SPARC running at 20 MHz was way faster than a sequential 386 running at 33 MHz. Every processor since then has been pipelined, at least to some extent. A good summary of the original RISC research projects can be found in the [1985 CACM article](http://dl.acm.org/citation.cfm?id=214917) by David Patterson.

早期的 RISC 处理器,如 IBM 的 801 研究原型、MIPS R2000(基于斯坦福 MIPS 机器)和原始的 SPARC (源自伯克利 RISC 项目),实现了简单的,与上图并无不同的 5 级流水线。同时,主流的 80386、68030 和 VAX CISC 处理器基本上是顺序工作的 —— 流水线化 RISC 更容易。因为其 _简化的指令集_ 意味着大多指令都是简单的寄存器到寄存器的操作,而不像 x86、68k 或 VAX,他们拥有复杂的指令集。这使得 20 MHz 的流水线 SPARC 比 33 MHz 的顺序 386 运行速度快得多。从那时起,每个处理器都被流水线化了,至少在某种程度上是如此。在 1985 年由 David Patterson 撰写的 [CACM 文章](http://dl.acm.org/citation.cfm?id=214917) 中可以找到对 RISC 原始研究项目的一个好的总结。
早期的 RISC 处理器,如 IBM 的 801 研究原型、MIPS R2000(基于斯坦福 MIPS 机器)和原始的 SPARC (源自伯克利 RISC 项目),实现了简单的,与上图并无不同的 5 级流水线。同时,主流的 80386、68030 和 VAX CISC 处理器基本上是顺序工作的 —— 流水线化 RISC 更容易。因为其*简化的指令集*意味着大多指令都是简单的寄存器到寄存器的操作,而不像 x86、68k 或 VAX,他们拥有复杂的指令集。这使得 20 MHz 的流水线 SPARC 比 33 MHz 的顺序 386 运行速度快得多。从那时起,每个处理器都被流水线化了,至少在某种程度上是如此。在 1985 年由 David Patterson 撰写的 [CACM 文章](http://dl.acm.org/citation.cfm?id=214917) 中可以找到对 RISC 原始研究项目的一个好的总结。


Deeper Pipelines – Superpipelining
Expand All @@ -155,7 +155,7 @@ Deeper Pipelines – Superpipelining
----------------------------------
Since the clock speed is limited by (among other things) the length of the longest, slowest stage in the pipeline, the logic gates that make up each stage can be _subdivided_, especially the longer ones, converting the pipeline into a deeper super-pipeline with a larger number of shorter stages. Then the whole processor can be run at a _higher clock speed!_ Of course, each instruction will now take more cycles to complete (latency), but the processor will still be completing 1 instruction per cycle (throughput), and there will be more cycles per second, so the processor will complete more instructions per second (actual performance)...

鉴于时钟速度受到(除了其他原因之外)流水线中最长、最慢的阶段的长度的限制,组成每个阶段的逻辑门可以被 _细分_ ,尤其是那些较长的逻辑门,从而将流水线转换为具有更多更短阶段的深层超级流水线。这样,整个处理器就能够以 _更高的时钟速度_ 运行!当然,每个指令会需要更多的周期来完成(延迟),但是处理器仍然会每周期完成一个指令(吞吐量),并且每秒会有更多的周期,所以处理器每秒将完成更多的指令(实际性能)……
鉴于时钟速度受到(除了其他原因之外)流水线中最长、最慢的阶段的长度的限制,组成每个阶段的逻辑门可以被*细分*,尤其是那些较长的逻辑门,从而将流水线转换为具有更多更短阶段的深层超级流水线。这样,整个处理器就能够以*更高的时钟速度*运行!当然,每个指令会需要更多的周期来完成(延迟),但是处理器仍然会每周期完成一个指令(吞吐量),并且每秒会有更多的周期,所以处理器每秒将完成更多的指令(实际性能)……

![](superpipelined2.png)

Expand All @@ -167,103 +167,63 @@ The Alpha architects in particular liked this idea, which is why the early Alpha

Alpha 的架构师特别喜欢这个想法。这就是为什么早期的 Alpha 拥有深层管道,并在他们那个时代有如此高的时钟速度。如今,现代处理器努力将各阶段的门延迟保持在少数,大约 12-25 个门深(并非全部!)再加上 3-5 个闩锁本身。大部分处理器都有相当深的管道…

| Pipeline Depth | Processors |
| 流水线深度 | 处理器 |
| :-----: | :----: |
| 6 | UltraSPARC T1 |
| 7 | PowerPC G4e |
| 8 | UltraSPARC T2/T3, Cortex-A9 |
| 10 | Athlon, Scorpion |
| 11 | Krait |
| 12 | Pentium Pro/II/III, Athlon 64/Phenom, Apple A6 |
| 13 | Denver |
| 14 | UltraSPARC III/IV, Core 2, Apple A7/A8 |
| 14/19 | Core i*2/i*3 Sandy/Ivy Bridge, Core i*4/i*5 Haswell/Broadwell |
| 15 | Cortex-A15/A57 |
| 16 | PowerPC G5, Core i*1 Nehalem |
| 18 | Bulldozer/Piledriver, Steamroller |
| 20 | Pentium 4 |
| 31 | Pentium 4E Prescott |
| 10 | Athlon, Scorpion |
| 11 | Krait |
| 12 | Pentium Pro/II/III, Athlon 64/Phenom, Apple A6 |
| 13 | Denver |
| 14 | UltraSPARC III/IV, Core 2, Apple A7/A8 |
| 14/19 | Core i×2/i×3 Sandy/Ivy Bridge, Core i×4/i×5 Haswell/Broadwell |
| 15 | Cortex-A15/A57 |
| 16 | PowerPC G5, Core i×1 Nehalem |
| 18 | Bulldozer/Piledriver, Steamroller |
| 20 | Pentium 4 |
| 31 | Pentium 4E Prescott |

Table 2 – Pipeline depths of common processors.

Pipeline Depth

Processors

6

UltraSPARC T1

7

PowerPC G4e

8

UltraSPARC T2/T3, Cortex-A9

10

Athlon, Scorpion

11

Krait

12

Pentium Pro/II/III, Athlon 64/Phenom, Apple A6

13

Denver

14

UltraSPARC III/IV, Core 2, Apple A7/A8

14/19

Core i\*2/i\*3 Sandy/Ivy Bridge, Core i\*4/i\*5 Haswell/Broadwell

15

Cortex-A15/A57

16

PowerPC G5, Core i*1 Nehalem

18

Bulldozer/Piledriver, Steamroller

20

Pentium 4

31
表2 —— 普通处理器的流水线深度

Pentium 4E Prescott
The x86 processors generally have deeper pipelines than the RISCs (of comparable era) because they need to do extra work to decode the complex x86 instructions (more on this later). UltraSPARC T1/T2/T3 Niagara are a recent exception to the deep-pipeline trend – just 6 for UltraSPARC T1 and 8 for T2/T3 to keep those cores as small as possible (more on this later, too).

Table 2 – Pipeline depths of common processors.
x86 处理器通常具有比(同时代)RISC 更深的流水线,因为它们需要做额外的工作来解码复杂的 x86 指令(稍后将详细介绍)。UltraSPARC T1/T2/T3 Niagara 是最近深层管道趋势的一个例外 —— 为了保持这些核尽可能小,UltraSPARC T1 只有 6 层,T2、T3 只有 8 层(稍后将详细介绍)。

The x86 processors generally have deeper pipelines than the RISCs (of comparable era) because they need to do extra work to decode the complex x86 instructions (more on this later). UltraSPARC T1/T2/T3 Niagara are a recent exception to the deep-pipeline trend – just 6 for UltraSPARC T1 and 8 for T2/T3 to keep those cores as small as possible (more on this later, too).

Multiple Issue – Superscalar
----------------------------
多发技术 - 超标量体系结构
----------------------------

Since the execute stage of the pipeline is really a bunch of different _functional units_, each doing its own task, it seems tempting to try to execute multiple instructions _in parallel_, each in its own functional unit. To do this, the fetch and decode/dispatch stages must be enhanced so they can decode multiple instructions in parallel and send them out to the "execution resources"...

流水线的执行阶段实际上是一组不同的*功能单元*各自执行自己的任务,由此一个诱人的设想产生了,即多个命令各自在自己的功能单元中*同时*执行。为此,必须增强提取阶段及解码、分派阶段,以便它们能够并行解码多个指令,并将它们发送到“执行资源”……

![](superscalarmicroarch2.png)

Figure 7 – A superscalar microarchitecture.

图7 —— 超标量微体系结构

Of course, now that there are independent pipelines for each functional unit, they can even have different numbers of stages. This allows the simpler instructions to complete more quickly, reducing _latency_ (which we'll get to soon). Since such processors have many different pipeline depths, it's normal to refer to the depth of a processor's pipeline when executing _integer_ instructions, which is usually the shortest of the possible pipeline paths, with the memory and floating-point pipelines implied as having a few additional stages. Thus, a processor with a "10-stage pipeline" would use 10 stages for executing integer instructions, perhaps 12 or 13 stages for memory instructions, and maybe 14 or 15 stages for floating-point. There are also a bunch of bypasses within and between the various pipelines, but these have been left out of the diagram for simplicity.

当然,既然每个功能单元都有了独立的管道,那么它们甚至可以具有不同的阶段数。这将能使简单的指令更快地完成,从而减少*延迟*(我们将很快讨论这个问题)。由于这些处理器具有许多不同的流水线深度,所以在提到这些处理器的流水线深度时,通常指处理器在执行*整型*指令时的深度,因为整数指令通常是可能的流水线路径中最短的,而内存和浮点流水线则可能有一些附加的阶段。因此,具有 ”10 级流水线“ 的处理器将使用 10 个阶段来执行整型指令,可能用 12 或 13 个阶段用于存储器指令,可能有 14 或 15 个阶段用于浮点指令。在各个管道内和管道之间也有一些旁路,但是为了简单起见,在图中省略了这些旁路。


In the above example, the processor could potentially issue 3 different instructions per cycle – for example 1 integer, 1 floating-point and 1 memory instruction. Even more functional units could be added, so that the processor might be able to execute 2 integer instructions per cycle, or 2 floating-point instructions, or whatever the target applications could best use.

在上面的例子中,处理器可能每个周期发出 3 个不同的指令 ——例如 1 个整型、1 个浮点型和 1 个内存指令。甚至还可以添加更多的功能单元,以便处理器能够在一个周期内执行 2 个整型指令,或 2 个浮点指令,或者目标应用程序能够使用的任何最佳指令。


On a superscalar processor, the instruction flow looks something like...

在超标量处理器上,指令流看起来像…


![](superscalar2.png)

Figure 8 – The instruction flow of a superscalar processor.
Expand Down

0 comments on commit 106646a

Please sign in to comment.