Parallelism is a concept by which microprocessor operations are simultaneously carried out by duplicated hardware units . In parallelism, the data/instruction to be executed is distributed among the duplicated hardware units for execution. The duplicated hardware unit in the above definition may be multiprocessors, networked computers, a specialized hardware unit or a combination of all. Parallelism in computing includes:
- instruction-level parallelism,
- data-level parallelism
- thread-level parallelism.
The instruction-level parallelism in advanced microprocessor
Pipelining can be described as the use of many execution stages to carry out a single process execution e.g. the use of many stages in industrial production. The use of a pipeline in Intel microprocessors started with the Intel 8086 microprocessor which has two stages, thus:
- The bus interface unit which carries out the fetching of instructions, data, the calculation of physical address, the queue of fetched instruction, and
- The execution unit which carries out the decoding of fetched instruction, calculation of effective address, and the execution of data.
The pipeline in Intel 8086 is called a primitive pipeline.
- Core i7 pipeline structure
In modern microprocessors, there are many pipeline stages and sub-stages. This can be seen in Intel Core i7 which has to two main pipelines, thus Instruction fetch pipeline (In-order Front End Pipeline) and Execution Pipeline (Execution Engine).
- In-Order Front End Pipeline in Core i7
The In-Order Front End Pipeline of Core i7 is the part of the processor that handles the fetching, decoding, fusing and queuing of instructions from memory (Cache or main memory). It is in-order in the sense that it fetches and decodes instructions sequentially. The front end pipeline is made up of seven stages. These stages are:
- Instruction fetches unit
- Instruction Length Decoder Unit
- Instruction Queue Unit
- Instruction Decoder Unit
- Micro-program Control
- macro-fusion Unit
- Instruction Decoder Queue Unit
Instruction Fetch Unit
The Instruction Fetch Unit of core i7 is responsible for fetching instructions from the memory. It is made up of the L1 instruction cache, ITLB, the branch prediction unit, and the fetching hardware. The IFU and fetch 16 bytes of instructions from aligned memory at a time. One of the most important units in the instruction fetch unit is the ITLB. ITLB translates the virtual address of the instruction to be fetched into the physical address. It is used to improve the speed of address translation of processors. Another important unit in the IFU is the branch prediction unit. It allows the microprocessor to start speculative fetching and processing of guessed the branch instruction direction even before the loop condition is executed to determine the true direction. This is a method used in all modern processors to reduce stalling. It should be noted that each core in core i7 has two threads. For that reason, 16bytes (4 instructions) are fetching for each thread in alternative cycles. The Core i7 is a multiple issue processor and therefore can issue more than one instruction in a cycle.
The performance of the microprocessor is highly improved if the guess is right but it limits the performance if the guess is wrong as all the execution due to the speculation will be cancelled. The accuracy of the branch prediction unit depends on the Return Stack Buffer. The Return Stack Buffer enables the branch prediction unit to make an accurate RET instruction prediction . It has 16 entries. Another important part of the branch predicting unit is the Front-End queue. It acts as the lookup table for BPU during prediction.
The L2 cache is another part in the instruction fetch unit that helps in fetching and storing of data from main memory. In core i7, the size of the L2 cache is 256kB and is 8-way set associative. It is 8-way set associative in the sense that it used set-associative cache and each set contains 8 cache lines or cache blocks.
Instruction Length Decoder Unit (ILD)
The Instruction Length Decoder is the second stage of the Front-End pipeline of core i7. It prepares the fetched instruction for decoding at the decoder unit. It determines the length of each instruction, merges some of them if needed, decodes the prefix modifier associated with instruction and then notes the properties of each instruction. The prefix modifier decoding is done by the pre-decoder. The pre-fetch buffer in ILD stores the instruction while the ILD operates on it. The ILD can accept 16 bytes of instructions in one cycle and can write 6 instructions per cycle into the Instruction Queue (IQ) . The ILD is one of the blocks in core i7 that differentiate its front-end pipeline from the RISC front-end pipeline . It is necessary because CISC processors use a variable instruction format.
Instruction Queue Unit
The Instruction Queue buffers the instructions processed by the ILD. It can buffer up to 18 instructions at a time . The use of IQ helps reduce the stalling of the lower pipeline stages. It can deliver up to 5 instructions in one cycle to the decoder unit. The use of queue in the pipeline causes delay . It should be noted that queue is necessary for a pipelined processor, hiding the latency gap between different stages.
Instruction Decoder Unit
This pipeline stage decodes the instructions pre-processed by the ILD into micro-ops. It contains four decoding units that decode the instructions; thus, three simple decoders and one complex decoder . The simple Decoder can only decode instructions that can translate to not more than four micro-ops while the complex decoder can decode any instruction. For instructions that translate to more than four micro-ops, the complex decoder uses the Microprogram control ROM to interpret the instruction . Each decoder in the instruction decoder unit also contains a Micro-Fusion unit which helps them combine many simple micro-ops from the same instruction into complex micro-ops. It should be noted that the decoder state handles instructions as stream and therefore decodes the instruction in parallel. This implies that the Instruction Decoder unit implements Super-scalar pipeline.
The Micro program is the part of the Decoder unit that helps the decoder to be able to interpret a complex instruction. It contains the Microprogram sequencer and the control ROM. The micro-sequencer uses the instruction from the IDU to generate the address of the location in the ROM that contains the micro-ops for the instruction being decoded. It is needed for complex instructions that the decoder hardware cannot handle .
Instruction Decoder Queue
The instruction decoder queue is the FIFO storage where the micro-ops from the IDU is temporarily stored before dispatch. It has 28 slots . The IDQ is the input to the LSD.
Loop Stream Detector
This is a stage in the front end pipeline of core i7 that detects when a loop instruction is being executed by the CPU. When a loop is detected, this stage will shut down the instruction decoder unit, the instruction fetch unit to minimize power consumption. It will then start streaming the loop micro-op for execution. It has 28 entries. The Loop Stream Detector normally saves decoded loops, which makes it a little similar to Trace Cache of Pentium 4 processors. However, the Loop Stream Detector of Nehalem CPUs is a special cache .
- Execution Pipeline
The execution pipeline of the Intel Core i7 is the part where micro-ops are executed. The time a micro-op spent in the execution pipeline depends on its dependencies. Instructions that are dependent on one another are executed sequentially otherwise, they can be executed out-of other . It is an Out-of-Order pipeline which receives micro-ops from the IDQ and dynamically schedules them for dispatch and execution. The execution stage of the execution pipeline has a superscalar pipeline which allows micro-ops to execute in parallel which no code semantics (syntax) is broken. The Execution Pipeline is divided into four stages, thus :
- Register Alias Table, Rename and Allocation (RAT)
- Reorder Buffer
- Unified Reservation Station
- Execution Unit
- Memory Reorder Buffer
Register Alias Table, Rename and Allocation
The Register Alias Table is the first pipeline stage in the execution pipeline of Intel core i1. It has many functions which make it inevitable. The core function of RAT is register renaming which eliminates false dependency (such as WAW, WAR and RAW) . This is achieved by mapping all logical addresses into physical addresses This method solves the problem of dependency since physical registers are much larger in number compare to the logical registers and this reduces the concurrent register re-use . It also prioritizes the micro-ops. This implies creating another in which the micro-ops will be scheduled and executed. The physical registers are contained in the Reorder Buffer. It should be noted that RAT solves the problem of name dependency (Structure hazard) not data hazard . The output of the RAT is the input to the Unified Reservation Station.
Unified Reservation Station (URS)
The URS is the second stage in the Execution pipeline. It has 36 entries and stores micro-ops until all its operands are available . It receives the micro-ops from the RAT. It schedules and dispatches all the micro-ops with their operand available to the execution stage. The URS also contains a scheduler that schedules the dispatch of the micro-ops. The URS has six output ports. Each port is associated with an execution cluster. The problem of data dependency is solved in URS using bypass bus and pipeline interlock .
The Execution Unit is the third stage in the Execution Engine pipeline. The Intel Core i7 has six parallel execution clusters. This is an example of a superscalar execution unit with sub-pipelined and sub-superscalar units. Each cluster in the execution unit may be seen as a sub-superscalar and can perform multiple executions at the same time in parallel . It can perform integer arithmetic operation, shift operation, SIMD operation and shuffle, floating-point operation, streaming SIMD operation at the same time. The port 1 also has multiple execution units which can perform an integer operation, SIMD multiplication, integer load effective address operation, comparison operation, and floating-point addition. The duplication of execution units enhances the performance of the microprocessor. It is also a way of avoiding the stalling of the pipeline. This means that even if one port is busy with execution, the Reservation station will schedule another unit. Each sub-execution unit hardware is pipelined to enhance performance . The pipelining of the execution unit in i7 made it possible to solve the problem of processing micro-ops with long execution time. This implies that the unit can receive new micro-op every cycle for execution. The execution unit in core i7 is made up of the integer unit, floating-point unit, SSE unit, SIMD unit, Store address unit, load address unit, load/store data unit. The floating-point unit includes floating-point Add and Multiply. The execution unit of the Intel Core i7 is shown in Fig. 1.2. The FPU in core i7 is heavily pipelined to enhance the performance. The SSE and SIMD Unit have MXM and MMX registers respectively which buffers the SSE and SIMD data during operation. Due to the amount of memory needed by SSE and SIMD operations, they are streamed in and out of the processor without storage in the cache or registers.
The Reorder Buffer is the fourth stage of the Execution Pipeline. It has 128 entries . It stores micro-ops that are at various stages of completion. The Reorder Buffer reorders micro-ops in the order in which their respective instructions are fetched  before writing the completed micro-ops to the retirement registers. This helps to eliminate any disarrangement caused by the out-of-order execution. It is also an input to the URS.
1.3. Pipelining hazard
Pipelining is a process whereby the execution of an instruction is broken down into stages to ensure the overlapping of instruction fetching. Pipelining leads to hazards. There are three types of pipeline hazard, thus:
- Data hazard
- Structural hazard
- Control hazard
1.3.1. Data hazard
The data hazard occurs mostly in processors with superscalar execution unit. It is a situation whereby an instruction is not scheduled for execution because the data it needs is not yet available for the instruction that will produce the data is still executing. Data hazard is also called true or data dependency to differentiate it from false or name dependency. The problem of data hazards can be solved using bypass bus and pipeline interlock.
1.3.2. Structural hazard
One important form of structural hazard is the name dependency hazard. This is caused by insufficient register resources to avoid register conflict. The main cause of the name dependency problem is read after write (RAW). This occurs when an instruction reads wrong data from the register because another instruction has written into the register. The problem of dependency can be solved using the register renaming method.
188.8.131.52. Register renaming
The concept of register renaming means simulating the limited architectural registers to unlimited registers and then assigning unique virtual registers to each instruction provided none of the instruction is data-dependent. The use of separate data and instruction cache is also a way of preventing structural hazard due to bus conflict. The register renaming is a necessary mechanism in processors with out-of-order ability. When more than one instruction references a particular register for an operand, either reading or writing, executing those instructions in an out-of-order can lead to three kinds of data hazards:
A read-after- write is the most dangerous type of false dependency. A read from a register location must return the value placed there by the last write-in program order. This is referred to as false dependency as the new data are not needed and requires the instructions to execute in program order.
Successive writes to a particular register or memory location must leave that location containing the result of the second write. This can be resolved by squashing (synonyms: cancelling, annulling, mooting) the first write if necessary. WAW dependencies are also known as output dependencies.
A read from a register or memory location must return the last prior value written to that location, and not one written programmatically after the read. This is the sort of false dependency that can be resolved by renaming. WAR dependencies are also known as anti-dependencies. Instead of delaying the write until all reads are completed, two copies of the location can be maintained, the old value and the new value. Reads that precede, in program order, the write of the new value can be provided with the old value, even while other reads that follow the write are provided with the new value.
The false dependency is broken and additional opportunities for out-of-order execution are created. When all reads needing the old value have been satisfied, it can be discarded. This is the essential concept behind register renaming. Anything that is read and written can be renamed. While the general-purpose and floating-point registers are discussed the most, flag and status registers or even individual status bits are commonly renamed as well.
Register renaming styles
There are two styles of register renaming, distinguished by the circuit which holds data ready for an execution unit. The underlining principles in register renaming schemes are the converting of the architectural registers referenced in the instruction stream into physical registers. The architectural registers are specified by 3 to 5 bits while the tags are usually a 6 to 8-bit number. Because the size of a register file generally grows as the square of the number of ports, the renamed file is usually physically large and consumes significant power.
In the tag-indexed register file scheme, there is one large register file for data values, containing one register for every tag. For example, if the machine has 80 physical registers, then it would use 7-bit tags. 48 of the possible tag values, in this case, are unused. In this style, when an instruction is issued to an execution unit, the tags of the source registers are sent to the physical register file, where the values corresponding to those tags are read and sent to the execution unit.
If instruction 2 does not depend on the data from instruction 1, show how to resolve the problems between the two instructions .
- ADD R0, R2
- ADD R3, R0
For example 1, instruction number 1 and number2 have a false dependency and therefore need register renaming. This can be done by assigning a unique physical register to each architectural register within a given instruction, as shown in Fig 4.1. It should be noted that the mapping of architectural registers to physical registers is fully associative. This means that during renaming, any architectural register can be assigned to any physical register provided there will be no conflict. From Fig. 4.1, the conflict for R0 by instruction 1 and 2 is solved by assigning P0 to R0 in instruction 1 and assigning P4 to R0 in instruction 2. It should be noted that register renaming cannot be used in case of true data dependency.
Example 2 .
- R1 = sqrt(16)
- R2 = R1+2
In this example, although the two instructions are not data-dependent, instruction 1 and 2 are reusing the register R1. For that reason, it is required that a new separate and unique physical registers be assigned to each register in each respective instruction. This is shown in Fig. 4.2.
It should be noted that the assignment is fully associative and therefore each architectural register can be assigned to any physical register. From the diagram, the R1 register in instruction 1 is mapped to P1 while the R1 in instruction 2 is mapped to P3. By so doing, the register conflict due to the reuse of R1 will be solved.
In example three, the two instructions conflict for R1 while they are not data-dependent. This problem can be solved by using register renaming. Referring to Fig4.2, it will be seen that R1 in instruction 1 maps to P1 while R1 in instruction 2 is mapped to P4. This method ensures that no register conflict occurs between instruction 1 and 2. This implies that the two instructions can be executed in parallel.
Example four also shows another example of register renaming. This example can also be explained using Fig. 4.2. In this example, the R2 registers in the individual instructions will be assigned to unique physical registers. In this case, R2 in instruction 1 will be assigned to P2 while R2 in instruction 2 will be assigned to P5. This ensures that both instructions can be executed in parallel.
|α: R3 = #30||R3 → P13||Register R3 is mapped to Physical register P13|
|β: R1 = R3 div 3||R1→P14, R3 →P13||R3 is still mapped to P13. This means that instruction α and β are data-dependent. This implies they cannot execute in parallel.|
|γ: BNZ R1 λ||R1→P14||Since R1 is still mapped to P14, instruction β and ϒ are data-dependent. This means they cannot execute in parallel|
|δ: R2 = R3 -1||R2→P15, R3 →P13||Since R3 is still mapped to P13. Therefore, instruction β and δ are data-dependent. This means they cannot execute in parallel|
|ε: R2 = R2 – 2||R2(dest)→P18, R2(scr) →P15||Since R2 is still mapped to P18 and p15. Therefore, instruction ε and δ are data-dependent. This means they cannot execute in parallel|
|ζ: R2 = R2 – 3||R2(dest)→P17, R2(scr) →P16||Since R2 is still mapped to P17 and p16. Therefore, instruction ζ is named dependent on δ and ε but not data-dependent. This means that after renaming, it can execute in parallel with any instruction before it.|
|η: R2 = R2 – 4||R2 (dest)→P18, R2 (scr) →P17||Since R2 is mapped to P18 and p15. Therefore, instruction η is data-dependent on ε and ζ. This means η cannot execute in parallel with ε and ζ.|
184.108.40.206. Out-of-order execution
The out-of-order execution is a method used in advance microprocessors to solve the problem of the stall execution unit. It involves scheduling and issuing of instructions in out-of-order to the execution unit provided none of the two instruction issued at the same time is data-dependent. It is used in the processor with the superscalar execution unit.
220.127.116.11. Branch prediction
Branch prediction is a method used in advance processors to predict the possible direction a branch will even before the outcome of the decoding is known. It involves fetching an instruction from the predicted direction and executing the instruction without waiting for the decoding to finish. This method can lead to gain in performance if the prediction is right but if it happens to be wrong, all the instruction fetched will be cancelled and everything started afresh in the right direction.
A superscalar architecture is an architecture with multiple parallel pipeline stages, which processes instructions simultaneously within a single instruction thread . A superscalar architecture is a form of parallelism that implements instruction-level parallelism in a processor . It allows higher throughput than would be possible in a non-superscalar architecture at a given clock rate. The superscalar processor executes more than one instruction in a clock cycle by simultaneously dispatching multiple data-independent instructions to redundant functional execution units on the processor. It should be noted that each functional unit is not a separate CPU core but an execution resource within a single core. Some examples of execution resources are:
Arithmetic logic unit
Floating point unit
A processor that contains a single execution resource is classified as a SISD processor (Single Instructions, Single Data), while a multiple execution resource processor is classified as a MIMD processor (Multiple Instructions, Multiple Data) . It is to be noted that all the superscalar processors are also pipelined to enhance performance. Pipelining and superscalar architecture are different performance enhancement techniques. The superscalar architecture has the following identifying characteristics:
Instructions are issued from a sequential instruction stream
The microprocessor hardware dynamically checks for data dependencies between instructions before issuing them for execution. Data-dependent instructions cannot be issued at the same time. Any instruction that depends on the data to be produced by another instruction is not issued for execution until the data becomes available. This implies that the instruction to produce the data must be scheduled first.
The CPU accepts multiple instructions per clock cycle
In the superscalar processor, each instruction processes one data item. Because there are many independent functional units, the processor can issue many instructions to separate functional unit at the same time. Each of these instructions will process different data at the same time, thus multiple instructions can be processing separate data items concurrently .
The superscalar processor design emphasizes improving the instruction dispatching accuracy, thus, allowing it to keep the multiple functional units in use at all times. It becomes very useful when the number of redundant execution units increases. In modern processors, for instance, core i7, there are six ports, each port instructing many execution units as shown in Fig.1.1. If the dispatcher is ineffective at keeping all of these units fed with instructions, the performance of the system will suffer.
A superscalar processor usually sustains an execution rate over one instruction per machine cycle. It should be noted that processing multiple instructions concurrently does not make an architecture superscalar; pipelined, multiprocessor or multi-core architectures processes multiple instructions but they are not superscalar. Superscalar processors differ from multi-core processors in that the redundant functional units are not entire processors. In advanced processors, Simultaneous multithreading (SMT), is a technique for improving the overall efficiency of superscalar CPUs. SMT permits multiple independent threads of execution to better utilize the execution resources . In a superscalar processor, the dispatcher examines all the instruction from the decoder unit and decides which ones can be run in parallel. Instructions that can run in parallel are dispatched o redundant functional units.
Limitations of the pipelined superscalar architecture
There are conditions under which two arithmetic instructions cannot be safely dispatched in parallel for simultaneous execution by the pipelined superscalar ALUs. Such conditions are called hazards, and they can all be placed in one of three categories:
Because pipelining is a form of parallel execution, these three types of hazards can also hinder the pipelined execution, causing bubbles to occur in the pipeline.
Data hazards are a problem for both superscalar and pipelined execution . The data hazard limits the performance of the superscalar architecture by dictating to the scheduler which instructions can be issued together. In the superscalar processor, only instructions that are not data-dependent can be executed in parallel.
Example 1. A data hazard
|Line Code Comments 1 A ← B + C Add the numbers in registers C and B and store the result in A. 2 E ← D + A Add the numbers in registers A and D and store the result in C.|
In example one, instruction 2 depends on the outcome of the t instruction 1, the two instructions cannot be executed simultaneously. Rather, instruction1 must finish first, so that the result will be in A before instruction 2 resumes.
If Example 1 is run on a superscalar processor with two integers ALUs, the two instructions cannot be executed simultaneously by the two ALUs. Rather, the ALU executing the instruction 1 has to finish first, and then the other ALU can execute instruction 2. Similarly, in a pipelined processor, instruction has to wait until the first add completes the writing stage before it can enter the execute phase. In most advanced processors, the concept of data forwarding and pipeline interlock is used to reduce the challenges due to data dependency. With forwarding, the processor takes the result of the instruction 1 from the ALU’s output port and feeds it directly back into the ALU’s input port, bypassing the register-file writes stage. Thus; instruction 2 can start immediately instruction 1 finishes executing.
Register renaming is a trick that helps overcome data hazards on superscalar machines. Since any given machine’s programming model often specifies fewer registers than can be implemented in hardware, a given microprocessor implementation often has more registers than the number specified in the programming model.
Another challenge facing pipelined superscalar is the structural hazard. It involves a situation where enough execution resources but other supporting resources are missing. For example, if there is a register conflict due to lack of enough registers.
Example 4-3. A structural hazard
|Line Code Comments 15 A←C+D Add the numbers in registers A and B and store the result in B. 16 E←G+F Add the numbers in registers C and D and store the result in D.|
In example 2, there is no data hazard because the two instructions do not depend on each other. So it should be possible to execute them in parallel. However, this example presumes that both ALUs share the same group of four registers and they must be able to accommodate two simultaneous writes. Otherwise, the execution of these two instructions in parallel would trigger what’s called a structural hazard, where the processor doesn’t have enough resources to execute both instructions at once.
Methods of reducing the effect of structural hazard on superscalar a superscalar design with multiple ALUs, it would take an enormous number of wires to connect each register directly to each ALU . This problem gets worse as the number of registers and ALUs increases. Hence, in superscalar designs with a large number of registers, a CPU’s registers are grouped into a special unit called a register file. This unit is a memory array, much like the array of cells that makes up a computer’s main memory, and it’s accessed through a special interface that allows the ALU to read from or write to specific registers. This interface consists of a data bus and two types of ports: the read ports and the write ports. To read a value from a single register in the register file, the ALU accesses the register file’s read port and requests that the data from a specific register be placed on the special internal data bus that the register file shares with the ALU. Likewise, writing to the register file is done through the file’s write port. A single read port allows the ALU to access a single register at a time, so for an ALU to read from two registers simultaneously; the register file must have two read ports. Likewise, a write port allows the ALU to write to only one register at a time, so a single ALU needs a single write port to be able to write the results of operation back to a register. Therefore, the register file needs to read ports and one write port for each ALU. So for the two-ALU superscalar design, the register file needs a total of four read ports and two write ports.
It so happens that the amount of die space that the register file takes up increases approximately with the square of the number of ports, so there is a practical limit on the number of ports that a given register file can support. This is one of the reasons why modern CPUs use separate registry files to store integers, floating-point, and vector numbers. Since each type of math instruction (integer, floating-point, and vector) uses a different type of execution unit, attaching multiple integers, floating-point, and vector execution units to a single register file would result in quite a large file. Such a large single register file with a single port may not enhance performance due to structural hazard.
Control hazards, also known as branch hazards, are hazards that arise when the processor arrives at a conditional branch and has to decide which instruction to fetch next. In some primitive processors, the pipeline stalls while the branch condition is evaluated and the branch target is calculated. This stall inserts a few cycles of bubbles into the pipeline, depending on how long it takes the processor to identify and locate the branch target instruction. Modern processors use a technique called branch prediction to get around these branch-related stalls.
Another potential problem associated with branches lies in the fact that once the branch condition is evaluated and the address of the next instruction is loaded into the program counter, it then takes some cycles to fetch the next instruction from storage. This instruction load latency is added to the branch condition evaluation latency discussed earlier in this section. Depending on where the next instruction is located e.g. in a cache, in main memory, or on a hard disk, it can take from a few cycles to thousands of cycles to fetch the instruction. The cycles that the processor spends waiting for the instruction to be available are wasted cycles that stall the processor’s pipeline and limit performance. Computer architects use instruction caching to alleviate the effects of load latency.
The complexity and time cost
The increase in the number of simultaneously issued instructions increases the cost and complexity of dependency checking hardware. The real-time dependency checking worsened the problem as the clock speed of the processor increases. This cost includes additional logic gates required to implement the checks, and time delays through those gates. Research has shown that the gate cost in some cases may be NK gates, and the delay cost is N2log10N, where N is the number of instructions in the processor’s instruction set, and N is the number of simultaneously dispatched instructions .
Although the instruction stream may contain no inter-instruction dependencies, a superscalar CPU must nonetheless check for that possibility, since there is no assurance otherwise and failure to detect a dependency would produce incorrect results. There is a practical limit on how many instructions can be scheduled simultaneously. While process advances will allow ever-greater numbers of functional units, the burden of checking instruction dependencies grows so rapidly that the achievable superscalar dispatch limit is fairly small; likely on the order of five to six simultaneously dispatched instructions. However even given infinitely fast dependency checking logic on an otherwise conventional superscalar CPU, if the instruction stream itself has many dependencies, this would also limit the possible speedup. Thus the degree of intrinsic parallelism in the code stream forms a second limitation.
2. Thread-Level parallelism
Hyper-threading is a parallelism method in which a single microprocessor physical core is emulated to operate as multiple logical cores, all sharing common execution resources. The concept of Hyper-threading was first introduced by Intel in Intel Pentium 4 processors but was abandoned to be reintroduced in Nehalem processors (i7-2600). Core i7-2600 has 4 cores and 8 threads; it operates using thread-level parallelism. This means that the i7 has 4 cores but can operate as eight logical cores. It has an operating frequency of 2.8GHz and also applies turbo boost technology. It is a type of parallelism known as thread-level parallelism. In hyper-threading, the logical cores run a parallel program and therefore separate data and instruction. The implementation of hyper-threading can typically improve the performance of apps by 20 to 30 per cent, boosting some apps by up to 40 per cent . Hyper-threading technology improves CPU efficiency by allowing the processor to execute two instruction streams concurrently on independent data. The main function of hyper-threading is to decrease the number of dependent instructions on the pipeline by taking advantage of the superscalar architecture of the Execution Engine. They appear to the OS as two processors, thus the OS can schedule two processes at once and make the two or more processors share the same resources. If one process fails then the resources can be readily re-allocated.
2.1. Operation of Hyper-Threading
Hyper-threading works by duplicating the architectural state of the physical microprocessor but not duplicating the main execution resources . This allows a hyper-threaded processor to appear as two logical cores to the host operating system and therefore allowing the operating system to schedule two threads or processes simultaneously. The instruction fetch unit clock cycles are shared by the two threads. In the alternate cycle, instructions for the different thread are fetched. In each cycle, four instructions belonging to one of the threads are fetched. A header thread prefix is then attached to each instruction fetched to classify the according to threads. The cores share resources such as the FPU, the arithmetic logic unit (ALU), CPU cache. The L2 cache space is not allocated to any logical CPU. When execution resources cannot be used by one of the cores due to stall, the second core will use the execution resources to execute another scheduled task.
2.2. Importance of Hyper-Threading
The most outstanding impact of hyper-threading is the improvement in the performance of the hyper-threaded processor. Before the introduction of hyper-threading, microprocessors had to execute one instruction stream at a time and all multitasking process relied on the operating system’s ability to allocate CPU time thread . However, the introduction Hyper-threading has enabled microprocessors to execute multiple instruction streams simultaneously and therefore improve the performance. Another benefit of Hyper-threading is the improvement in the utilization of CPU resources. This benefit arises from the fact that most processor’s execution resources are shared and therefore, no resources are left idle. The flip side is that those CPU resources are indeed shared, so when both streams of instructions require the same shared resource, performance can suffer. However, with threshold sharing mechanism, none of the threads will suffer.
Another significant advantage of a thread is in SIMD data processing. Often time, the threads share the processing of a SIMD data and when the processing is complete, the data will be recombined before streaming out.
2.3. Disadvantages of Threading
One of the key disadvantages of threading is the cost of the hardware involved. Since some functional units are duplicated, the extra amount of transistor needed to add the extra-functional unit increases the cost of the process.
Another disadvantage if the conflict in the use of shared resources. Some resources in a threaded processor are shared. Because they are shared, a slow thread using a resource may prevent another thread from using it. This is the reason why the threshold sharing method is used.
3. Cache memory
A cache is a high-speed SRAM that is used to reduce the latency between the CPU and the memory. It receives data from the main memory in blocks to reduce the traffic on the data bus. There are different levels of cache, thus: level one (L1), level two (L2), and level three (L3). The L3 cache is all-inclusive (must contain all the data and instructions in L1 and L2) while L2 must contain all the instruction and data contained in L1. An example of microprocessor with the three levels of cache is core i7 has 256kB level 2 caches per core, 8MB level3 cache, and 64kB level 1 cache per core (32KB data cache and 32KB instruction cache). Each core in Intel i7 has many execution units such as FU, integer ALU. It has an integrated on-chip memory controller and the QuickPath interconnect. The memory controller has three channels and each of them is 8-byte wide and requires DDR3 SDRAM to operate. The total bandwidth of the three memory channels is 31.992GB/s while the bandwidth of the QuickPath interconnects is 21.6GB/s. The high performance of core i7 is because the Memory controller, the QPI and the L3 are on-chip and this reduces latency due to off-chip communication.
Each of these caches contains a memory controller called cache controller which is implemented on the same chip. Cache memory is a very fast memory device that is always used as an acceleration technique to reduce memory access latency.
3.1. Locality of reference
The cache works based on the principle of the locality of reference (program behaviour). There are two principles involved, thus:
- Spatial Locality – this implies that if access to a particular memory location is made now, there is a high probability that memory location near to it will be accessed in the near future. In another way, access to a particular memory location is made now, there is a high probability that other accesses will be made to neighbouring locations within the lifetime of the program. The spatial locality principle arises from the fact that programs are executed sequentially. The spatial locality of a cache depends on the block size of the cache.
- Temporal Locality – This is complementary to the spatial locality. This principle implies that if a memory location is accessed now, there is a high probability that it will access again in the near future. Given a sequence of references to and locations, there is a high probability that references to the same location will be made again. This principle arises from the use of looping in programs. The temporal locality depends on the total capacity of the cache.
Not all the data or instruction needed by the microprocessor is available in the cache. To optimize the usage of the cache, 0.1% of the most accessed memory location will be carefully cached . When a reference made by the microprocessor is found in the cache, it is called a cache hit. If the reference is not found in the cache it is called a cache miss . When there is a cache miss, the reference is satisfied with the main memory. On a cache miss, the cache control mechanism must fetch the missing data from memory and place it in the cache. Usually, the cache fetches a block or line of data from memory. The physical word is the basic unit of access to memory. The processor-cache interface can be characterized by many parameters. Those that directly affect processor performance include:
- Access time for a reference found in the cache (a hit) – property of the cache size and organization.
- Access time for a reference not found in the cache (a miss) – property of the memory organization.
3.2. Cache Effectiveness
The effectiveness of a cache depends on the locality and locality is the property of the program. What gives program power is the loop. The principle of Locality tends to favour a section of memory space. It is all about the reuse of data and instruction. Locality implies that we can predict with reasonable accuracy the data and instruction that will be used next based on the memory access completed. The cache must contain instruction and data that are frequently used or reused. The instruction and data which are always reused are called the working set of the program. This is what is placed in the cache.
3.3. Why cache is faster than main memory in the cache-memory system
The cache is faster than the main memory because it is made of SRAM while main memory is made by DRAM, closer to the processor and it is smaller than the main memory. For that reason, required data and instructions in the main memory are fetched and store in the cache to enhance the hit rate.When the data to be fetched is found in the cache, it is called a hit but it is not found in the cache, it is called a miss. When there is cache miss, the required data will be fetched from the main memory. Therefore cache miss is satisfied by the main memory and cache hit is satisfied from cache. The number of cache hits divided by the number of memory reference is called hit ratio (h) while the number of cache misses divided by the total number of memory reference is called miss ratio (m).
Cache hit ratio (h) + cache miss ratio (m) =1
h + m = 1
Assume memory has an access time of 100ns and 1 cache has an access time of 10ns. If the cache hit ratio is 99%. Determine the effectiveness of the cache memory.
h + m = 1 (fractional access)
hte + mtm = Tacc
0.99×10 + 0.01 x 100 = 9.9ns – average access time
The cache effectiveness is access time without cache divided by the access time with the cache. Therefore
3.4. The trade-off in cache design
It is necessary to apply trade-off because many factors affect the performance and cost-effectiveness of computers. When computers are too expensive, it becomes difficult to find markets for it. It is to be noted that cache design involves a trade-off. These tradeoffs include:
- Hit ratio (h)
- Access time on a hit
- The delay time on misses (for main memory)
- The cost
- The access time of the cache
To increase the hit ratio of a cache, it is necessary to have an increase in size and associativity of a cache. The increase in the cache size and associativity increases the cost and the access time on a hit. However, to reduce the cost and access time, a decrease in size and associativity is needed. There is an optimal point in cache design where the cost and access time is manageable. This is when the cache is set associative and cache size is moderate. The use of direct-mapped cache leads to low hit ratio while the use of fully associative cache leads to high cost. The transfer of data and instruction between memory and cache occurs in blocks. The transfer of blocks is to ensure special locality, temporal locality and to reduce traffic in the data bus.
3.5. Type of Cache Miss
There are three types of cache misses. The type of cache misses depends on the associativity of the cache, the size of the cache and the status of the cache. These cache misses can be divided into three, thus:
- Compulsory miss
- Capacity miss
- Conflict misses.
3.5.1. Capacity miss
This is the cache miss that occurs because the capacity of the cache is not large enough. The larger the cache size, the higher the hit ratio. Miss due to capacity does not depend on the cache organization.
- Compulsory miss
This occurs when the cache is accessed when there is no valid data in the cache. This occurs during startup. The compulsory miss is a function of the cache status. It is independent of the cache size or the cache organization.
- Conflict misses
This is a preventable cache miss. It is the only miss that is preventable in a cache. This occurs due to poor replacement policy and low cache associativity. This occurs when a cache line that is part of the working set is replaced due to inadequate replacement policy. If the replacement policy is adequate, a cache line which is part of the working set being executed will not be replaced.
3.6. Cache controller
The cache controller is a device that controls the communication among the CPU, the cache and the main memory. An example of a cache controller isan Intel 82497 which implements the MESI write-back protocol for the full multiprocessing support. Dual-ported buffers and registers allow the 82497 to concurrently handle the CPU bus, memory bus, and internal cache operation for maximum performance . The cache controller is implemented within SRAM, a cache chip and each separate cache level have its unique controller. To make the cache controller work better, the data path between the CPU bus and the memory bus is separated by the cache chip, allowing the CPU bus to handshake synchronously, asynchronously, or with a strobed protocol, and allowing concurrent CPU bus and memory bus operations.
The cache controller performs its duty by snooping and snarfing at the data bus. These two terms are the principles used by the cache controller to ensure cache consistency. In a nutshell, the cache snooping means a process by which the cache controller monitors the address bus for transaction. Snooping allows the cache controller to determine if the CPU is accessing the memory for a data the cache contains. If the data is in the cache, the cache controller will satisfy the request and then terminate the transaction. On the other hand, if the data is not in the cache, the cache controller will allow the transaction to continue and then take the data from the data bus. This is called snarfing. This is the method cache controller use to update its cache .
3.7. Characteristic of the system with no-cache
A computer with no cache is always slow when accessing memory. This is due to the large difference in speed between the memory and the microprocessor. The time it takes the microprocessor to send a request to the memory and obtain the result is called the memory access time. A microprocessor – memory pair with no cache is shown in Fig. 2.3. The system has the following properties.
3.7.1. A long round trip delay
Because of the length of the data bus between the microprocessor and memory, it takes a long time before the signal travels from one end of the bus to another. This time is called a round trip delay. It is a component of the memory latency. It does not depend on the speed of the memory chip.
3.7.2. Idle CPU resource
Since the microprocessor is much faster than the memory, it has to wait for the memory to send data before it can operate on it. This makes the microprocessor resources to stay idle while waiting for data.
3.7.3. The CPU operates in fits and stops
The microprocessor operates in hits and stops because when it finishes the operation, it will wait for the memory to get another it will work on. This makes the CPU to work on executing and stop fashion.
3.7.4. The CPU is starved of data
Because the CPU needs data to work on all the time, it will be starved if the memory is unable to supply the needed data. It is being starved because it operates at a faster speed compared to the main memory.
3.8. Characteristics of the system with Cache
When a microprocessor – memory pair has a cache, most of the problem of the system without cache will be solved. The use of cache in this system reduces the average memory latency since only missed will go into memory. The use of memory of the cache does not reduce the individual memory latency but reduces the average latency.
The system in Fig. 3.4 has the following properties:
- Memory is only accessed on a cache miss
- No memory accessed can occur if the cache has enough capacity and associativity
- The round trip delay for memory access does not change
The processor is still starved of data but for a lesser degree
3.9. Mapping of main memory to cache memory
In a system with cache, the main memory is mapped into the cache memory. The cache memory sees the main memory as a partitioned memory. The real address of memory is partitioned in the cache into:
- Tag memory
- Data memory
Tag memory carries the address of the memory where the data is from. The data memory holds the byte of data in the cache while the tag holds the real address of the memory. Fig. 3.5 shows how to tag memory is partitioned. Tag memory is partitioned into three, thus:
- Tag bit
- Valid bit
- Dirty bit.
When the valid bit is 1, it means there is a valid data in the cache; otherwise, there is no valid data in the cache. If the dirty bit is 1, it means the data in the cache have been modified but if it is zero, it means the data is not modified. The tag bit of tag memory contains the list of all the tag addresses. Fig 3.6 shows a list of all the tag addresses.
3.10. Types of caches organization
Within the cache, there are three basic types of cache organization:
3.10.1. Fully associative cache mapping
In fully associative cache mapping, any block in main memory can be mapped into any block in the cache . This implies that when a request is made by the microprocessor, all the cache blocks must be checked to determine if in the cache. This is done by comparing the real address with the tag in the cache. If the requested address is found, the corresponding location in the cache is fetched and returned to the processor.
The fully associative cache has the highest hit rate amongst all other types. However, because of cost and power consumption, it is rarely used. In the fully associative cache, the real address of the memory location is divided into two, thus:
- The tag: this specifies the location from where the data block is fetched from memory. It is compared with the real address of the processor during fetching.
- The byte or offset- this specifies a particular location in a cache block where a given data byte is stored. The partitioning of the real address is as shown in Fig.3.7.
Set-Associative Cache Mapping
A set-associative cache mapping is a hybrid between a fully associative cache, and direct-mapped cache . It is a type of cache mapping in which each block in the main memory can only be assigned to a set of blocks in the cache.
A set in a cache may contain N-block of cache memory. A cache with N blocks in a set is called N-way associative cache . For instance, in Intel core i7, the L1- data contains 8 blocks in each set (8-way set associative) and an L1 – instruction cache contains 4 blocks in each set (4-way set associative), the L2 cache contains 8 blocks in each set (8-set associative) and L3 contains 16 blocks in each set (16-way set associative) . The partitioning of the real address in the set-associative cache is shown in Fig. 3.8. The set-associative cache is mostly used because it has good hit ratio and cost-wise very fair.
Direct Mapped Cache
A direct-mapped cache is a mapping technique in which any block in the main memory will be cached only in one block in the cache. The direct-mapped cache is not the best mapping technique since it causes location conflict . Conflict occurs because many blocks in the main memory can be mapped to the same line of the cache and therefore the data in a given location can easily be replaced. However, when the cache capacity is very large, it will be unlikely for two blocks from the main memory to be mapped into the same cache line and therefore direct-mapped will be the best technique. In the direct-mapped cache, the real address of memory is partitioned into three. Fig. 12.8 shows the partitioning of the address.
Although the direct-mapped cache is very cost-effective, its low hit ratio outweighs its cost-benefit. For that reason, the direct-mapped cache is not always used except in a situation where the cost of producing a cache of large sizes is not important.
Memory architectures in modern processors
The Harvard architecture is applied at level 1 cache. Von Neumann architecture is used in main memory and level 2 ache. When microprocessor and memory are connected, it is called a microprocessor – memory pair. This is not useful as there is no input/out. Other applications of the Harvard architecture are:
- Digital signal processors
- Level 1 cache
Programs and constants are stored in a ROM. The reason is to protect it from changes since ROM is only readable. We can read or write to RAM.
 Comer D. Computer Organization 2006.
 Parallel computing – Wikipedia, the free encyclopedia n.d.
 Feature – Intel Core i7 – Nehalem Architecture Dive | bit-tech.net n.d.
 Thomadakis ME, Ph D. The Architecture of the Nehalem Processor and Nehalem-EP SMP Platforms List of Figures Intel “ Nehalem ” is the nickname for the “ Intel Micro-architecture ”, where the latter is a specific implementation 2011.
 Patterson JLH and DA. Computer Architecture : A Quantitative Approach. 5th ed. 2009.
 First Look at Nehalem Microarchitecture. Page 3 – X-bit labs n.d.
 Safi E, Member S, Moshovos A, Member S, Veneris A. On the Latency and Energy of Checkpointed Superscalar Register Alias Tables 2010;18:365–77.
 Out-of-order Predicated Execution with Translation Register Buffer apport 2003.
 Buffer R. Reorder Buffer 2005:5–6.
 Smith JE, Drive J. The Microarchitecture of Superscalar Processors 1995.
 Aagaard M. Simple Pipelines Lec-05 : Speculative Execution and Register Renaming Innovations in Processor Design Multi-Cycle , Parallel Execution Data-Hazard Correctness Register Renaming Hardware Register Renaming : A Trivial Example Register Renaming Rename Algori n.d.
 Superscalar – Wikipedia, the free encyclopedia n.d.
 What is superscalar architecture n.d.
 Peng Z. Lecture 5 : Superscalar Processors Superscalar Architecture Superscalar Architecture ( Cont ’ d ) 2012:1–20.
 Superscalar Definition n.d.
 Computer Organization and Architecture 1987:1–14.
 Wadhavkar S V, Shah TA, Dwiel BH. F AB S CALAR : A UTOMATING S UPERSCALAR C ORE D ESIGN PROCESSOR DESIGN AND VERIFICATION EFFORT INCREASES WITH EACH ADDITIONAL CORE 2012:48–59.
 Juice Up Your C# App with the Power of Hyper-Threading n.d.
 Hyper-threading – Wikipedia, the free encyclopedia n.d.
 Fields T. Cache Mapping n.d.:1–8.
 This is a basic Cache Tutorial n.d.
 Intel 82497 – Wikipedia, the free encyclopedia n.d.
 Processor P, Ram S. 1. Introduction n.d.:1–10.
 Schaum. Theory and Problems of Computer Arcitecture. McGraw-Hill; 2009.
 Set Associative Cache n.d.
 Valid I. 2-way set associative implementation n.d.:1–11.