Read nwang.pdf text version

c 2003 by Nicholas J Wang. All rights reserved.

CHARACTERIZING LOGICAL MASKING OF TRANSIENT FAULTS AT THE MICROARCHITECTURAL AND ARCHITECTURAL LEVELS

BY NICHOLAS J WANG B.S., University of Illinois at Urbana-Champaign, 2001

THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering in the Graduate College of the University of Illinois at Urbana-Champaign, 2003

Urbana, Illinois

ABSTRACT

In this work, the effects of transient faults on high performance microprocessors is explored. To perform a thorough exploration, a highly detailed RTL model of a modern, deeply pipelined, out-of-order microprocessor implementing the Alpha ISA was created. Using this model, the effects of manipulating individual state elements such as pipeline latches and RAM cells can be observed by comparing the results against that of a reference simulation. Using statistical fault injection into our experimental infrastructure, approximately 83% of single bit corruptions into potentially vulnerable processor state result in a complete state convergence within 10 000 cycles. Of the remaining trials, approximately 25% show differences in microarchitectural state, but have not affected software visible state. The failed trials were analyzed to identify areas in the model that are particularly susceptible to transient faults, and then lightweight protection mechanisms were implemented which masked approximately 20-75% of the failures. Building upon the failure modes seen in the microarchitecture, fault injections into software were performed to investigate the level of masking that the software layer provides. Together, the baseline microarchitectural substrate and software mask 9 out of 10 transient faults from affecting correct program execution.

iii

To my parents.

iv

ACKNOWLEDGMENTS

Thanks to the members of the ACS research group for their friendship and support. In particular, thanks to Mike Fertig for his contributions to the Y-Branch work and Todd Rafacz and Justin Quek for their respective contributions to the microarchitectural fault injection work. And, of course, special thanks to my advisor Sanjay Patel who inspired everything.

v

TABLE OF CONTENTS

CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

CHAPTER 2 EXPERIMENTAL MODEL . . . . . . 2.1 Fault Model . . . . . . . . . . . . . . . . . . . . . . . 2.2 Experimental Infrastructure . . . . . . . . . . . . . . 2.3 Fault Injection Process . . . . . . . . . . . . . . . . . 2.4 Statistical Significance . . . . . . . . . . . . . . . . . CHAPTER 3 MICROARCHITECTURAL 3.1 Injection into Latches . . . . . . . . . . . 3.2 Injection into Latches and RAMs . . . . . 3.3 Convergence Latency . . . . . . . . . . . . 3.4 Summary of Findings . . . . . . . . . . .

. . . . . . . . 4 . . . . . . . . . 4 . . . . . . . . . 4 . . . . . . . . . 7 . . . . . . . . . 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 12 15 16 19 20 20 23 24 24 25 26 27 27 31 31 32 34 36 38 41 43 45 47

LEVEL INJECTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 4 LIGHTWEIGHT PROTECTION MECHANISMS . 4.1 Failure Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Protection Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Timeout counter . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Register file ECC . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Register file pointer ECC . . . . . . . . . . . . . . . . . . . . 4.2.4 Instruction word parity . . . . . . . . . . . . . . . . . . . . . 4.3 Overheads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 5 ARCHITECTURAL LEVEL 5.1 Experimental Setup . . . . . . . . . . . . 5.2 Register Injections . . . . . . . . . . . . . 5.3 Instruction Word Injections . . . . . . . . 5.4 Conditional Branch Injections . . . . . . . 5.5 Convergence Latency . . . . . . . . . . . . CHAPTER 6 CHAPTER 7 CHAPTER 8 INJECTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

LIMITATIONS OF RESULTS . . . . . . . . . . . . . . . . . . . . . RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

CHAPTER 1

INTRODUCTION

Among the various issues facing the scaling of implementation technologies into the deep submicron regime, the issue of transient faults remains largely an unknown entity. Transient faults can arise from multiple sources. External sources include high-energy particles that cause voltage pulses in digital circuits, disrupting normal operation. Internal sources include cross talk and power supply noise. While transients have always to some extent plagued semiconductor-based digital systems, the scaling of devices, operating voltages, and design margins for purposes of performance and functionality raises concerns about the susceptibility of deep submicron systems to such transient effects. Historically, transient faults were of concern for those designing high-availability systems or systems used in electronics-hostile environments such as outer space. Because of the confluence of device and voltage scaling, and the increasing complexity of digital systems, the problem of transient faults is forecast to be a problem for all future digital systems. Previous work has examined the rates and modes in which a transient fault in a combinational circuit propagates into a latch element and becomes an error at the circuit level [1]-[5]. At this level, various masking phenomena inhibit the induced voltage pulse from being latched into a state element: electrical masking, logical masking, and latching window masking. Once latched, an error at the circuit level now appears as a fault to the microarchitectural level. The affected bit (or bits) of state constitute a faulty operating state for the microarchitecture, but they themselves can get masked from appearing as a fault to the architectural level (i.e., software visible state): techniques such as speculative execution, redundant control encodings, and idle logic all contribute to the masking effect at the microarchitectural level. Some fraction of

1

microarchitectural faults appear as faults to the architectural level. Architectural faults themselves can be masked: redundancies in code or dead bits, values, and computation attenuate such faults, concealing them. The propagation of faults from the microarchitectural level through the application level is the focus of this work. In the first portion of this study a significant fault injection study is performed on a Verilog model of a high-performance Alpha processor (i.e., one that is nearly latch accurate to a real implementation) in order to understand ways faults propagate into architectural state. In the second portion, the effect at the application level of transients induced at the processor level is examined. This investigation of logical masking of transient faults is relevant for two reasons. First, an understanding of the fault-to-error rates and modes at a particular level (say the circuit level) will help designers at that level construct better techniques to tolerate such faults. Second, an understanding of transmission of faults from one level to another can lead to fault-tolerance solutions that can be optimized across levels of the system hierarchy. For example, if the microarchitectural level can be designed to mask 99% of all faults it sees from the circuit level, the architectural level or application level can be designed to handle the 1% of transient faults that escape into architectural state. In this work, three basic contributions are made: · Microarchitectural Effects of Transient Faults: We study the effects of transient

faults that propagate into a latch and thus become an error at the microarchitectural level. The purpose of this component of our work is to examine the level and types of fault masking and modes of failure that occur when a transient fault manifests as a latched error in the pipeline logic of a modern processor. This study is conducted on a latch-accurate Verilog model of a modern wide-issue Alpha processor that uses speculative execution. This particular contribution is similar to past contributions [6], [7]; here a more intensive fault injection campaign is performed on a substantially more complex and speculative processor. · Lightweight Microarchitectural Protection Mechanisms: Based on the character-

istics of the microarchitectural effects found previously, several low-overhead protection mechanisms are devised and employed to increase the microarchitectural masking level. These mechanisms are implemented on the same Verilog model used to characterize transients and result in a sizable 2

reduction of failures caused by the modeled transients without resorting to the use of wholesale redundancy or an architectural checker [8]. · Architectural Effects of Microarchitectural Errors: The effects of latch-level errors

that have propagated into architectural processor state (i.e., register file and instruction words) are studied. Using simplistic fault models derived from our study of microarchitectural faults, fault masking is observed and characterized, along with failure modes of pipeline errors that propagate to the architectural layer. In addition, the effects of control flow violations are also explored, not necessarily because they accurately represent transient errors, but because their effects are somewhat unexpected and may be of great interest to high-performance architects and compiler writers.

3

CHAPTER 2

EXPERIMENTAL MODEL

In this section, the experimental setup is described. First, the fault model is introduced. Then, the various components of the experimental infrastructure and the experimental methodology are discussed. Finally, the statistical significance of the results will be presented.

2.1

Fault Model

A single bit flip was chosen as the fault model. A single bit flip is the result of an inversion of state of a single bit, i.e., changing a bit from logic 0 to 1 or vice-versa. This simplistic model emulates a transient fault whose effects have propagated and corrupted a single state element. Furthermore, faults that affect multiple state elements might also be represented well with the single bit flip fault model. The case when this occurs is when the multiple corrupted elements can all be traced logically back to a single bit of state from a previous clock cycle. Inverting that single bit of state at that previous time would then be equivalent to inverting the set of elements corrupted by the transient fault at the present. Transient faults which do not fit into either of these two categories might not be represented well by our fault model. Nonetheless, the single bit flip serves as a starting point for investigating transient faults that corrupt some set of state.

2.2

Experimental Infrastructure

To accurately study the effects of our chosen fault model on a microprocessor, it is necessary to perform the studies on a latch-accurate model. In a latch-accurate processor model, all elements of state present in a real implementation are also present in the model and vice-versa. Without such 4

Table 2.1: Processor model details. Stage

Fetch

Features

1024-entry 4-way set-associative BTB w/ bimodal branch predictor Choice branch predictor fed by 2-cycle local and global predictors Disagreement between bimodal and choice results in a one cycle bubble 8-entry return address stack with pointer recovery 8-wide split-line fetch from a 2-way set-associative 8kB L1 cache 32-entry fetch queue 4-wide decode 4-wide rename from 80 physical registers Speculative and architectural rename maps are maintained 32-entry scheduler with speculative wakeup 11 read ports and 7 write ports with 80 65-bit physical registers Holds speculative as well as architectural results 2 simple ALUs 1 complex ALU (2-5 cycles) with a buffer for register file write port conflicts 1 branch ALU 2 address generation units for memory instructions 16-entry load and store queues Dual-ported 2-way set-associative 32kB L1 data cache with 2-cycle latency Dual porting achieved with eight interleaved banks 16 noncoalescing miss handling registers for lockup free accesses Memory dependence prediction 64-entry reorder buffer with 8-wide retire

Decode Rename Issue Reg Read Execute

Memory

Retire

a model, the effects of an injected fault might differ from that of a real implementation and false conclusions could result. For this reason, great care was taken to create a detailed and accurate Verilog model upon which to perform these fault injection studies. The processor model utilizes a 12-stage pipeline and is dynamically scheduled. It includes such features as speculative scheduling, memory dependence prediction, and sophisticated branch prediction. It runs a subset of the Alpha instruction set architecture ­ due to time considerations, floating point, synchronizing memory instructions, and various other miscellaneous instructions were not implemented. The instruction window size is 64 instructions, and the scheduler issues up to 6 instructions each cycle from a window of 32 instructions. A diagram of the processor is shown in Figure 2.1 and more details are listed in Table 2.1. A modified version of SimpleScalar's functional simulator [9] was utilized to interface with the processor model. The functional simulator provided a reference against which to compare the

5

L1 Insn Cache

TLB BrPred0 BTB RAS BOB

Align + Rotate

BrPred1

32 Entry Fetch Queue

4x Decoder

Spec RAT

Spec Free List

Mem Dep Pred 0

Intra Bundle Rename Override

Mem Dep Pred 1

32-Entry Scheduler

80-Entry Register File

Simple

Simple

Complex

Branch

AGEN

AGEN

LDQ

STQ

L1 Data Cache

TLB

64-Entry ReOrder Buffer

Arch RAT

Arch Free List

Figure 2.1: The processor model. processor model's architectural state. Also, since the model's memory hierarchy only extends to the L1 caches, it uses the functional simulator to provide cache line fills for L1 misses. An L1 miss takes a constant eight cycles to service. This was chosen to keep the processor flowing, since an instruction cache miss to memory could render much of the pipeline's state dead. This in turn could result in higher transient fault masking levels. To be conservative, a perfect L2 cache was modeled.

6

Table 2.2: Summary of possible trial outcomes. Trial Result

µArch Match Terminated Fail Silent Gray Area

Description

Complete microarchitectural state match prior to corrupted architectural state. Checked periodically. Pipeline deadlock or an instruction generated an exception. Corrupted architectural state in the absence of exceptions. Any trial that doesn't fit into the above categories. A latent fault is present or the transient has shifted timing such that µArch Match never occurs.

2.3

Fault Injection Process

Now that we have described the model used in our experiments, we will outline our fault injection process. In each trial, the time at which to inject a transient fault is first selected. Then the bit to corrupt is selected uniformly randomly across all of the eligible bits of state in the model, where eligible state is defined by the particular experiment being run. After the fault injection occurs, the simulation is monitored for abnormal behavior. Each trial is monitored for up to 10 000 cycles for four general outcomes: (1) microarchitectural state match, (2) execution terminated, (3) fail silent data corruption, or (4) none of the above. These categories are exclusive of each other and respectively indicate that the injected fault was benign, resulted in early program termination in the absence of data corruption, resulted in data corruption, or had effects which could not be conclusively identified. A summary of the events that cause these outcomes is presented in Table 2.2 and is described in more detail in the following paragraphs. Microarchitectural state match occurs when the microarchitectural state of the processor model (all the state in the machine) is equivalent to that of a non-fault-injected reference simulation. Since checking for microarchitectural state matches is relatively expensive, we only check for it periodically, taking advantage of the fact that once a microarchitectural state match occurs, the remainder of the simulation is guaranteed to have consistent microarchitectural state. We perform the check on an exponential time scale: at 1, 10, 100, 1000, and 10 000 cycles after the injection. As instructions commit, they update architectural state (programmer visible state such as memory, registers, and program counter). If an instruction commits changes to architectural state that do not match those of the reference functional simulator described previously, then the transient 7

fault has corrupted architectural state, and the trial is considered a failure. If a trial results in a microarchitectural state match with no previous architectural state inconsistencies, then we can conclusively declare that the injected transient fault's effects have been masked by the microarchitectural layer. These trials are placed in the µArch Match category. If a trial does not result in failure or conclusive masking within our 10 000 cycle simulation limit, the trial is placed into the Gray Area category. Either the fault is latent within the pipeline, or it was successfully masked, but the timing of the simulation was thrown off such that a microarchitectural state match was never detected. While testing for microarchitectural state matches is straightforward, monitoring the changes to architectural state is more complex. Instructions that update the register file can easily be verified. Each cycle an instruction retires, the architectural register alias table is used to identify the physical registers that correspond to each architectural register. This information is used to build the architectural register file, and the file is then compared against that of the reference model for consistency. Note that since the register files are only compared when an instruction commits, it is possible for a transient to corrupt a register and never be detected as a failed trial. This effect is likely minor, since the windows of opportunity in terms of time and vulnerable state are small. Unfortunately, memory state is more difficult to efficiently verify. This is mainly due to the fact that the architectural memory state is, at any given time, distributed across several different structures. These structures include the store queue, L1 data cache, miss handling registers, a bank conflict buffer, and the memory image itself. Verifying memory state by checking stores as they retire is not sufficient because store instructions move through the previously mentioned structures after they are committed to architectural state, and the structures they move through are vulnerable to the effects of transient faults. Ideally, each cycle, all architecturally visible memory state in the model would be identified and compared against a reference memory model. We felt that this verification methodology was too expensive, so we used an alternate approach. Checking retiring stores for correctness is reasonably fast, so this check is made each cycle. To catch most of the remaining memory state errors, we perform a thorough examination of memory state periodically. There will be cases when memory state corruption is not caught, but we estimate their impact to be very small. Trials that result in register and memory corruptions are placed into the Fail Silent category, 8

along with those that result in TLB misses. In the processor model, the instruction and data TLBs are preloaded with all virtual pages that are ever used in a reference execution. This is done to conservatively identify memory accesses which should not have occurred. A large portion of these memory accesses might result in memory permission errors, which could cause an operating system to terminate the program. These trials belong in the Terminated Execution category, but since we have no good means of identifying which accesses would result in program termination, we conservatively place them in Fail Silent. The last two modes of failure are faults that result in pipeline deadlock or result in retiring an excepting instruction. Pipeline deadlock is determined by a lack of retiring instructions for a predetermined number of clock cycles. In the processor model, the L2 cache is perfect and is a fixed latency of eight cycles away. Combined with our pipeline depth, this means that there should never be a stretch of more than 100 clock cycles without a retired instruction. Therefore, a timeout counter was set to expire, thus indicating a timeout error, when 100 consecutive clock cycles elapse without any committed instructions. Examples of exceptions include memory alignment errors and overflow. When a trial results in one of these two outcomes, it is placed in the Terminated Execution category. These trials would be terminated before committing incorrect state. To summarize the possible experimental outcomes, Fail Silent indicates that the transient fault resulted in silent data corruption. This is the worst kind of failure because data is corrupted without warning. Terminated indicates that an instruction excepted or the pipeline deadlocked. These failures might be acceptable because they leave architectural state free of corruptions. µArch Match indicates that the trial eventually resulted in microarchitectural state completely matching that of a reference simulation. This indicates that the microarchitecture successfully masked the effects of the transient. Finally, Gray Area represents trials that did not corrupt architectural state nor match microarchitectural state within our simulation cycle limit. Since the model only implements a subset of the Alpha instruction set, each trial must occur on a segment of execution that is executable by the model. To do this, a set of 250-300 prevalidated start points from seven SPEC2000 integer benchmarks were used for injections. This also has the added advantage of reducing simulation time, which we will touch upon briefly in the next section. Each start point was created as follows. First, the branch predictors were warmed up on a functional simulator. Then the pipeline model was run for 80 000 clock cycles to warm up its 9

various components. The state of the model was then saved to disk and the model was run for an additional 10 000 clock cycles to verify that no unimplemented instructions are encountered. At this point, the saved state on disk is a prevalidated start point, ready for fault injection in an experiment. In our experiments, we divided our fault injection campaigns into two varieties: those targeting latches and those targeting both latches and pipeline RAM arrays. We chose to do this for three main reasons: First, flip-flops may have different fault rates and fault models from RAM structures due to implementation differences. By distinguishing between these types of state in our experiments, we can derive separate results based on different fault rates and models. The second reason is that data stored in flip-flops might tend to have different characteristics compared to data stored in RAM type structures. For example, one might store data that is more transient in nature or perhaps is less vulnerable to transient faults. Our final reason is that data stored in RAMs may be easier and more efficient to protect using parity or error correcting codes. Examples of structures that we estimate to be implemented as RAMs include the register file, RAT files, register free lists, scheduler and ROB payloads, and various queues throughout the model. Not including cache arrays, prediction tables, some of their associated hardware, and other various state that cannot cause architectural state mismatches when injected with faults, there are about 15 000 bits of storage in latches and 50 000 bits of storage in latches and RAM arrays.

2.4

Statistical Significance

In this study, sampling was used to identify trends in the effects of transient faults, so enough samples must be taken such that the experimental results have statistical significance. Ideally, both the bit and the cycle in which the fault injection occurs would be selected uniformly randomly from the predefined search space. While uniform sampling was implemented for selecting the state element to corrupt, the fault injections were performed on a set of about 250-300 predefined start points for each experiment. This trade-off was made to cut simulation time roughly in half ­ having predefined start points means that a noninjected reference simulation does not have to be run in parallel with each trial. Unfortunately, this speedup comes at the cost of skewing our overall results toward those of the individual start points. However, with a relatively large number of start points,

10

the amount of skewing is minimized. Each experiment's results are the compilation of 25 000 - 30 000 trials. If the faults could be injected at any randomly selected clock cycle, the overall results would have a confidence interval of less than 0.7% at a 95% confidence level. The number of predefined start points used is likely sufficient to maintain similar statistical significance. Note that for many of the experiments, the overall results are analyzed and broken down into categories composed of fewer trials. As a result, these results have somewhat larger confidence intervals than the overall results.

11

CHAPTER 3

MICROARCHITECTURAL LEVEL INJECTIONS

A soft error in a latch can be either masked by the underlying logic in a microarchitectural implementation or propagated to the architectural level in some form. In this section, the amount of masking that occurs on a typical microprocessor substrate and the nature of errors that are presented to the software are investigated.

3.1

Injection into Latches

Now we will explore the effects of transient faults upon latches. That is, we first limit the set of injectable state to flip-flop type structures in the pipeline. Elements of state that we estimate would be implemented as a RAM were not eligible for fault injection. Furthermore, elements of state that cannot affect architectural state are also excluded from fault injection. For example, hardware structures like the branch and memory dependence prediction tables and some of their associated hardware are immune to faults. The worst case scenario is that a fault would cause a degradation in performance. The prediction structures that are ignored in this experiment are: branch and memory prediction tables, return address stack, branch target buffer, various prediction structure control state, and a FIFO branch buffer that is used to hold branch prediction information until the branch commits and updates the branch predictor and branch target buffer. Also, we do not find that fault injection into the data store portions of caches is particularly interesting, since parity and error correcting codes are well understood and do a sufficient job at protecting them. We do, however, inject faults into the various structures that

12

support the L1 instruction and data caches. Results from this experiment are presented in Figure 3.1 for seven benchmarks. The different benchmarks represent different workloads on the processor, which may affect the masking rate of the microarchitecture. The aggregate results are presented in the rightmost bar. Note that the number of start points was not distributed evenly between the benchmarks, so the overall results are not equivalent to the average of the individual benchmark results. Gzip has the highest instructions per cycle (IPC) of the seven and bzip2 has relatively high IPC and branch prediction rates, and the highest data cache hit rate. These factors likely contribute to their higher failure rates, since on average, more meaningful work is in progress resulting in more vulnerable state. Overall, about 90% of faults in pipeline latches are masked successfully in the microarchitecture. Most of the remainder result in corrupted architectural state, while a small portion is unknown. The majority of the trials that resulted in corrupted architectural state were of the Fail Silent category.

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

ag gr eg at e ga p cf bz ip 2 gz ip cc 1 pa rs er vo rte m x

failsilent terminated gray uarch

Figure 3.1: Fault injection into latches results per benchmark. To further analyze the results of the experiment, each bit of state in the processor was categorized by the type of state they typically store. Table 3.1 lists the various categories and provides a brief description for each. The results of the experiment were broken down by type of state element injected and are presented in Figure 3.2. For reference, the relative size (in terms of number of bits of state) of each category's latches is shown in Figure 3.3. Note that the register alias tables and free lists were not available for injection in this experiment, since they are likely to be implemented 13

Table 3.1: Description of different categories of state. Category

addr archfreelist archrat ctrl data insn pc qctrl regfile regptr robptr specfreelist specrat valid

Description

64-bit address field for memory operations. Architectural register free list. Architectural register alias table. Miscellaneous control state such as decoded control words. Instruction input and output operands. Parts of the instruction word passed along with each instruction. 64-bit program counter fields. Control state associated with queues. 65-bit register file entries and scoreboard bits. 7-bit physical register file pointers. 6-bit ROB tags. Speculative register free list. Speculative register alias table. Valid bits throughout the pipeline.

as RAM arrays. Of the various categories, the bits of state categorized as qctrl and valid appear to be especially vulnerable to transient faults. Their impact on the overall fail rate is small, however, since they constitute only a small fraction of the total state of the machine. The fail rate of the data category is the lowest, perhaps due to a combination of a relatively low utilization rate and a high degree of speculation.

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

va lid ag gr eg at e da ta re gf ile trl re gp tr ro bp tr ad dr rl in sn ct pc qc

failsilent terminated gray uarch

Figure 3.2: Fault injection into latches results broken down by type.

14

regptr regfile qctrl

robptr valid addr ctrl

pc

insn

data

Figure 3.3: Relative size of each category for latches.

3.2

Injection into Latches and RAMs

In this section, we will examine the effects of transient faults on both flip-flop and RAM type state elements. As in the previous section, elements of state related to prediction and cache data storage structures were not subject to fault injection. The results of this experiment for each of the benchmarks are presented in Figure 3.4. Results broken down by the type of state injected are presented in Figure 3.5, and the relative size of each of the state categories is displayed in Figure 3.6. Note that the proportion of pc state is quite high. This is mainly due to an inefficient use of storage space in the retirement pipeline stage, which can lead to the appearance of a higher masking rate. This is discussed further in Chapter 6. The failure rate increases across all benchmarks. This can be largely attributed to the addition of the 80 physical registers, which have high rates of failure. As might be expected, the architectural register alias table and register file also exhibit relatively high fail rates, since a large proportion of their contents is architectural state. The speculative RAT file provides source register mappings for new instructions while the speculative free list provides destination register mappings. These structures appear to be especially critical as their associated structures also have relatively high rates of failure.

15

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

ag gr eg at e ga p cf bz ip 2 gz ip cc 1 pa rs er vo rte m x

failsilent terminated gray uarch

Figure 3.4: Results of fault injection into latches+RAMs per benchmark.

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

trl re gf ile re gp tr r sp obp tr ec fre el is sp t ec ra t v ag alid gr eg at e ad dr el is ar t ch ra t rl da ta in sn ct pc fre qc

failsilent terminated gray uarch

Figure 3.5: Results of fault injection into latches+RAMs broken down by type.

3.3

Convergence Latency

Now, we will look at the microarchitectural state convergence latency. That is, we will examine the number of cycles that elapse from the point of fault injection until microarchitectural state convergence. Graphs plotting latency information for injections into latches only and for injections into both latches and RAM structures are presented in Figures 3.7 and 3.8. The results are presented 16

ar

ch

specfreelist robptr regptr regfile

specrat

valid addr archfreelist archrat ctrl

qctrl

data

pc insn

Figure 3.6: Relative size of each category for latches+RAMs. for each type of state element as well as an aggregate total. Each bar is broken down into several categories. The numbered categories indicate the state convergence latency. For example, trials that resulted in a state convergence latency of between 10 and 100 cycles were placed in the 100 category. The other category indicates that a microarchitectural state match never occurred, either due to architectural state corruption or the cycle limit being hit. In general, trials injecting into pipeline latches converge in microarchitectural state more quickly than trials performing injections into all state element types. This is a result of data stored in pipeline latches being more transient in nature. The addr category is different in that it has a sizable number of trials that result in convergence after a large number of cycles in Figure 3.8. These trials are likely the result of fault injections into dead state in the L1 data cache's miss queue or memory conflict buffer. These structures are only written into when an L1 data cache miss or bank conflict occurs, which is relatively rare compared to the frequency of state updates in the processor core. Also, the large number of convergences between 1000 and 10 000 cycles hints that a sizable portion of the trials in the Gray Area of the addr category might have moved to the microarchitectural state match section had the simulation cycle limit been longer.

17

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

va lid ag gr eg at e re gf ile da ta trl re gp tr ro bp tr ad dr rl in sn ct pc qc

other 10000 1000 100 10 1

Figure 3.7: Latency of state match for fault injection into latches.

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

qc trl re gf ile re gp tr ro bp sp tr ec fre el is sp t ec ra t va lid ag gr eg at e t ar ch ra t ad dr rl da ta in sn el is fre ct pc

other 10000 1000 100 10 1

Figure 3.8: Latency of state match for fault injection into latches+RAMs. The data category is unique in that the majority of fault injection trials resulted in immediate microarchitectural state match. This indicates that data fields in our implementation are especially transient in nature and often are not holding valid or useful state. Overall, when a transient fault injection results in a state match, the match usually occurs within about 100 cycles. Only a small fraction remain latent between 100 and 10 000 cycles before converging.

ar

ch

18

3.4

Summary of Findings

In this section, fault injection campaigns were run to estimate the inherent masking levels of a typical modern microprocessor substrate for latches and RAM arrays. The masking levels for latches were higher than that of RAM arrays, indicating that they are generally less utilized. The effects of different workloads upon fault propagation rates were also examined. Although we saw differences between different benchmarks, they were minor relative to the level of masking apparently inherent to the pipeline. Then we categorized all the state of the machine by the type of data they typically store and identified the most vulnerable categories of state. Finally, we examined state convergence latency, or the number of cycles that elapse between a fault injection and a microarchitectural state match. The number of long latency state convergences hints at the percentage of ambiguous Gray Area trials that would have resulted in a state match had our simulation cycle limit been longer.

19

CHAPTER 4

LIGHTWEIGHT PROTECTION MECHANISMS

In this section, failures from the previous section are examined to characterize how they occur. This information, in addition to the results from Chapter 3, will help motivate the implementation of several lightweight state protection mechanisms. The overheads of these mechanisms will be discussed and finally the effect that they have upon microarchitectural masking of transient faults will be analyzed.

4.1

Failure Modes

To begin characterizing the failure modes of transient faults, we will first examine how the 9% and 12% of trials from Figures 3.2 and 3.5 resulted in failure. Figures 4.1 and 4.2 present this information. The different categories were introduced in Section 2.3 and are summarized in Table 4.1. Note that control flow errors are not included in the breakdown. This was done because a clean identification of control flow violations could not be made easily. Instead, those trials with divergent control flow will (almost certainly) eventually result in another form of corrupted architectural state, which is what the trial is classified as. Single bit flips that result in failures are dominated by register file corruption and pipeline deadlocks. Register file corruption is especially prevalent in the experiment where all pipeline state was available for injection. Figures 4.3 and 4.4 present the relative contributions of each type of state element to the total number of failures. From these figures, we see that the increase in register file corruption is largely due to the addition of the register file entries, register alias tables,

20

Table 4.1: Description of failure modes. Failure Mode

store stalled regfile mem itlb except dtlb

Description

A store instruction was corrupt at retire time Pipeline stalled for 100 cycles Register file inconsistent Periodic complete memory check failed Fetch redirected non-speculatively to an invalid page Instruction generated an exception Nonspeculative access to an invalid page

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

va lid ag gr eg at e re gf ile da ta trl ro bp tr re gp tr ad dr rl in sn ct pc qc

store stalled regfile mem itlb except dtlb

Figure 4.1: Breakdown of failure modes for injections into latches. and register free lists. Various register file pointer fields throughout the pipeline also significantly contribute to the register file corruption total. If these fields could be protected from transient faults, a large fraction of the failures could be removed. The other leading source of failures is pipeline deadlock. Many of these can be attributed to corrupt control, queue control, ROB tag, and valid fields. It is likely that simply forcing a pipeline flush would reset these corrupt fields and allow the pipeline to continue executing instructions normally. Note that in the case of injections into only latches, the regfile category's common failure mode is deadlock. This is because for that experiment, the category is composed of mostly scoreboard bits, which track registers with valid data. In our processor implementation, a pipeline

21

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

trl re gf ile re gp tr ro bp sp tr ec fre el is sp t ec ra t va lid ag gr eg at e el is t ar ch ra t ad dr rl da ta in sn ct pc fre qc

store stalled regfile mem itlb except dtlb

Figure 4.2: Breakdown of failure modes for injections into latches+RAMs.

robptr valid addr ctrl

ar

ch

regptr

regfile qctrl

data

pc

insn

Figure 4.3: Relative contributions of each state type to failures for latches. flush does not restore the scoreboard bits, but this feature is easily added. Note that the addr category's failure mode changes significantly from the latch only experiment to the latch+RAM experiment. Many more store failures are present and significant memory check and stall failures are introduced that were not present before. This is a result of address fields in the data cache miss queue, data cache conflict buffer, and store queue being available for injection only in the second experiment. A similar but lesser effect is seen for the data category. 22

specfreelist

specrat

archfreelist valid addr archrat ctrl

robptr regptr

data

insn

pc regfile qctrl

Figure 4.4: Relative contributions of each state type to failures for latches+RAMs.

4.2

Protection Mechanisms

In this section, four different lightweight protection mechanisms that attack stalled pipeline failures and register file corruptions are outlined. Their implementation and overheads in terms of extra state, logic, and potentially even higher fault rates are discussed. Finally, the steps taken to prevent our implemented protection mechanisms from adversely impacting our estimated critical path will be presented. The following mechanisms were chosen to protect the pipeline state that were found to be most vulnerable. Other types of state that might be easily protected include the qctrl category and much of the ctrl category. The queue control elements could be protected with parity (or error correcting codes if the queue potentially holds architectural state, like the store queue). Much of the control state is in the form of instruction bundles that get passed through the pipeline with each instruction. These fields could be protected in very much the same way that instruction words were protected, as we will describe in Section 4.2.4. Furthermore, Ando et al. [10] protected the data and address paths of their processor design with parity without specifically mentioning protection of other hardware structures. The previous results presented in this work indicate that, at least for our processor model, lower hanging fruit exists.

23

4.2.1

Timeout counter

The first protection mechanism is a timeout counter, which targets the stalled pipeline failures described in the previous section. It was implemented as a simple counter which is incremented each cycle that zero instructions retire and cleared otherwise. When the counter reaches a predetermined threshold (our implementation uses 100), it forces a pipeline flush and restarts fetch at the architectural program counter. The overhead of this mechanism is estimated to be minimal in terms of both state storage and combinational logic, requiring on the order of 10 latches in the processor's retirement stage.

4.2.2

Register file ECC

The next protection mechanism involves protecting register file contents with error correcting codes in a similar fashion to [11]. Because each register file entry holds state that is potentially software visible, it is not sufficient to simply detect that an error exists if we wish to mask the transient fault. The hardware must be able to recover the data once it detects a corruption. Thus, a parity scheme would not suffice. Another option was to protect the register file with triple modular redundancy, but this option is likely prohibitively expensive. The error correcting code added an overhead of eight bits per 64-bit register file entry and some additional pipeline latches. To protect the data with error correcting codes (ECC), the ECC bits must first be generated. Generating these bits is a problem because squeezing the associated combinational logic into the last stage of execution or immediately before the register file write are both unsatisfactory solutions since they would likely impact the critical path. The implemented solution performs the ECC generation in parallel with the register file write of the data and writes the ECC bits to a separate RAM structure the cycle afterwards. A valid bit is then required in the ECC RAM to indicate whether or not the ECC bits have been generated and stored for each register file and ECC RAM entry. This differs from the approach taken by Gaisler, who chose to write the data and protection bits into the register file together. The approach taken here is motivated by a deeper pipeline where less computation can fit within a clock cycle. On a register file read, the corresponding ECC RAM entry is read out in parallel with the register file entry, and data corruption detection is employed (if the ECC RAM valid bit is set).

24

To not adversely affect the critical path, the corruption detection should occur in parallel with the use of the data instead of in series with the register file read and the use. If the data was found to be corrupt, a recovery mechanism is initiated. The implemented solution performs the corruption detection and repair between the read and the use with no penalty. The motivation for taking such action was so that a trial in which this type of recovery took place would contribute to the µArch Match category rather than the Gray Area, thus avoiding some ambiguity in the experimental results. Furthermore, in the implementation, the register file entry is not repaired on corruption detection. The model relies on a subsequent register file write to overwrite the corrupted data.

4.2.3

Register file pointer ECC

In Section 4.1, we saw that the architectural and speculative register free lists and register alias tables as well as various register file pointers in the pipeline contributed greatly to the number of register file corruption failures. All five of these types of structures hold the same type of information: a physical register file pointer. In this section, all of these structures are protected by accompanying each register file pointer with ECC. As was the case with protecting the register file, at least the architectural register alias table must be protected with ECC, since it holds architectural state. Because the register file pointers are passed between the structures, protecting them with ECC in some structures and parity in the others would involve constant generation and regeneration of parity and ECC information. This problem was avoided by simply protecting all register file pointers with ECC and passing the pointer along with the ECC information together throughout the pipeline. The processor model has 80 physical registers, so 7 bits were necessary to encode each register pointer. Four bits of ECC information were added to protect each seven bit field. In the implementation, ECC repairs occur on writes to the architectural RAT and free list as well as the speculative free list. If placing the ECC repairs in these locations is not desirable for any reason (perhaps for critical path concerns), placing the repair any place in the pipeline where register file pointers flow should be roughly equivalent. ECC repairs also occur during the last stage of each functional unit (including loads returning from the data cache) for the destination register pointer. These repairs occur in parallel with the functional unit's computation or data cache access to catch transient faults that may have corrupted 25

the destination register pointer since it was last written into the speculative free list. Note that the destination register pointer latches between the last stage of each functional unit and the register file are not protected by ECC. This is because there is no time to perform an ECC repair on the destination register before it writes data into the (potentially wrong) register entry. Similar to what was proposed in the previous section, as each register file pointer is used throughout the pipeline, whether or not it holds corrupted state can be detected in parallel with its use and recovered from (except when the use is for writing into the register file). As before, instead of implementing the recovery in this manner, the repair is allowed to happen instantaneously to avoid placing trials in the ambiguous Gray Area.

4.2.4

Instruction word parity

For our last protection mechanism, the next largest contributor to failed trials was protected: the instruction word. In our model, the instruction word (along with various decoded information) is passed along with each instruction through the pipeline to provide control information in various stages. As portions of the instruction word become dead, they cease to be passed further. To protect instruction words, parity bits for each 32-bit instruction word are generated as they enter the pipeline from the L1 instruction cache. In particular, the parity generation occurs in series with the instruction rotation and alignment process. As instructions flow through the pipeline and lose portions of their instruction words, the parity bit is updated using information from the portions that were lost. When the remainder of the instruction word ceases to be propagated through the pipeline, the parity bit is checked for consistency. This occurs at the last stage of each functional unit (simple and complex ALUs, branch unit, and address generators). In the case of a parity error, a pipeline flush is initiated before the offending instruction has an opportunity to write the register file. In the absence of parity protection in the instruction cache, this scheme leaves the instruction word vulnerable in fetch while the parity bit is being generated. With parity protection in the instruction cache, the parity bit for each fetched instruction could be generated while the fetched cache line is still protected by the cache's parity bit. In the implementation of this protection mechanism, the instruction word is left vulnerable as its parity bit is generated. Furthermore, since the parity generation occurs in series with instruction rotation and alignment, the global cycle time 26

might be impacted. If this is the case, the parity bit generation can occur in decode at the cost of not protecting an extra set of pipeline latches.

4.3

Overheads

In each of the implementations presented previously, the overheads in terms of extra state, logic, and possible impacts to clock rate were discussed. These overheads may also lead to higher power requirements and other potentially adverse effects such as a higher capacitive load on various transistors throughout the processor. With the implementations of the above protection mechanisms, a 6-7% overhead in bit storage was seen. Roughly two-thirds of this state overhead was in the form of RAM type storage (register file, register alias tables, register free lists, and fields in the scheduler and ROB payloads). The baseline configuration used to calculate the state overhead did not include predictor tables and caches. Had we included the predictor tables and caches, the calculated storage overhead would be much lower. Depending on the nature of the various sources of transient faults, the overheads from these mechanisms may result in a higher fault rate, due to a larger amount of vulnerable hardware. For example, a larger number of storage elements might increase the rate of faults caused by cosmic rays [5]. Fortunately, nearly all of the introduced overheads are naturally redundant. For example, if a transient were to affect a parity bit protecting an instruction word, a forced pipeline flush would result, with no ultimate effect on correct program behavior. Nonetheless, in general, it is important to consider the effect of overheads upon the transient fault rate.

4.4

Results

In this section, for brevity, only results from injecting transient faults into latches and RAM structures are presented. Figure 4.5 breaks down the results of this experiment by type of state injected. Note the addition of two new categories: ecc and parity. These two categories respectively represent state used to store ECC and parity information. Compared against Figure 3.5, the number of failed trials drops significantly. The architectural failure rates for the archfreelist, archrat, insn, regfile, specfreelist, and specrat categories all exhibit large decreases from the protection mechanisms. The

27

set of insn bits, however, sees a large number of trials move from uarch to gray. This is a result of the parity protection mechanism initiating a recovery via pipeline flush when the bit corruption would have eventually been masked. The gray category does not cover all the trials since only valid instruction words with incorrect parity information will trigger a pipeline flush.

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

tr re l gf ile re gp tr ro sp bp ec tr fre el i sp st ec ra t ag vali d gr eg at e ad dr el is ar t ch ra t ct rl da ta ec c in sn pa rit y fre qc pc

failsilent terminated gray uarch

Figure 4.5: Results of fault injection into latches+RAMs broken down by type. Somewhat interesting to note is the large gray category of the archrat state elements. In the protection mechanism implementation, a corrupt architectural register alias table is never repaired, only overwritten with new, hopefully noncorrupt data. This only occurs when an instruction that writes its result to the corresponding register file commits. Since our simulation limit is 10 000 cycles, many of the trials in the gray category are likely due to an injection into a register alias table entry whose corresponding architectural register is not written to by a committed instruction within the simulation limit. If the transient fault rate is especially high, then scrubbing the architectural register alias table may be beneficial, to prevent multiple bit corruptions that the ECC coding cannot recover from. Also, the gray categories of the robptr and valid state classifications also increase in size. Figure 4.6 displays the failure modes of the various state classifications, and it shows that stalled failures have been essentially completely removed for the robptr and valid bits. This is evidence that the timeout counter mechanism worked to flush and restart the pipeline, resulting in subse-

ar

ch

28

quent correct execution. Unfortunately, the change in timing due to the pipeline flush makes a state match in our experimentation infrastructure rare.

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

ro bp tr sp ec fre el is t va lid ag gr eg at e da ta trl re gf ile re gp tr ad dr rl in sn ct pc qc

store stalled regfile mem itlb except dtlb

Figure 4.6: Breakdown of failure modes for injections into latches+RAMs. In Figure 4.7, a pie chart depicting the relative contributions of each state type to failures is presented. The failures are now dominated by transient faults affecting the program counter, control, and data fields. Note that failures from the protected elements were not completely eliminated. These failures were the result of transient faults affecting the areas that were left unprotected for ease of implementation. Nonetheless, their contribution to the failure total was greatly reduced. The last piece of data for this section is presented in Figure 4.8, which shows the latency to converge for the various state element categories. The aggregate results show slightly shorter convergence latencies with the protection mechanisms in place. The register file now has a larger number of trials converging between 1000 and 10 000 cycles after the injection. Since the register file, like the architectural RAT, depends on new writes to remove corrupt state, these relatively long latency convergences are also likely due to a long wait for a new write to occur. Again, scrubbing may be beneficial here. We notice a similar, but smaller long latency convergence trend for the architectural register alias table. Worth noting is that comparing Figures 4.5 and 4.8 to their nonprotected counterparts is not a fair comparison. This is due to the 6-7% extra (mostly nonvulnerable) state introduced by the

29

specfreelist valid robptr regptr regfile qctrl

addr

ctrl

pc

data insn

Figure 4.7: Relative contributions of each state type to failures for pipeline+RAMs.

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

trl re gf ile re gp tr ro sp bp ec t fre r el is sp t ec ra t v ag alid gr eg at e ad dr el is ar t ch ra t ct rl da ta in sn pa rit y ec c pc fre qc

other 10000 1000 100 10 1

Figure 4.8: Latency of state match for fault injection into latches+RAMs. various protection mechanisms. Nonetheless, after accounting for an 7% higher transient fault rate, the implemented mechanisms reduce the known failure rate (represented by the fail silent and terminated program categories) by approximately 75%. If we conservatively assume that all the trials from the gray category are failures, then the failure rate was reduced by approximately 20%.

ar

ch

30

CHAPTER 5

ARCHITECTURAL LEVEL INJECTIONS

Soft errors that do not get masked in the microarchitectural level propagate to the architectural level. Soft errors that manifest to this level may still be masked in software. Otherwise, they may further contaminate their environment and cause erroneous program output. In this section, we model errors that have propagated to the architectural level and observe their effects.

5.1

Experimental Setup

To investigate the effects of corrupted architectural state on software, a modified version of SimpleScalar's functional simulator was used [9]. The experiments were performed in the Alpha instruction set architecture and workloads from the SPEC2000 integer benchmark suite were used. In the functional simulator, two separate sets of architectural state were maintained: one was the target of selected fault injections while the other was used as a non-error-injected reference. After a fault injection, the error injected simulation is monitored for exceptions. Also, the two simulations are periodically compared to check for architectural state matches. All of architectural state, which includes the program counter, register file values, and the memory image, is compared to identify a state match. If a state match occurs prior to a system call, the trial is declared a success. State matches are only recognized before a system call is encountered because they represent external communication to the processor. Thus, we treat system calls as barriers to state convergences. If a state match is never reached, the user visible output of the program is compared against that of a reference execution and is subject to an acceptance test. The result of the acceptance test then

31

dictates whether the trial was a gray area success or a failure. Note that an injected fault can affect the number of instructions executed while still eventually resulting in an architectural state match. A straightforward method of executing and comparing the reference and injected simulations in lockstep would not detect these cases. Instead, in addition to the lockstep comparisons, a state check is performed if both simulations reach a system call. If the check succeeds, then an architectural state match has occurred regardless of whether the fault caused a different number of instructions to be executed. Approximately 1000 trials were run for each benchmark in each experiment. Each trial consists of a single fault injection into an instruction selected in a uniformly random manner. This yields a confidence interval of less than 3.1% per benchmark, at a 95% confidence level. The aggregate results for each experiment represent more trials, so their confidence intervals are smaller: about 1% at the same 95% confidence level.

5.2

Register Injections

In this experiment, soft errors were introduced by flipping bits in the register file. A random instruction from the dynamic instruction stream is selected, and if the instruction produces a value to be stored in a register, the value is injected with an error before being written into the register file. Stores, some branches, and other instructions which do not write to the register file were not injected with errors in this experiment. In the Alpha instruction set, direct and indirect jumps record a return address into a register. Injections into these return address values was performed. Other branches do not write to the register file, so they are excluded from this experiment. In the first part of this experiment, a single bit in the result is selected randomly and flipped. This single bit error fault model is reflective of the effects of a single bit error in the register file or a data field in the pipeline or in the cache system. The Alpha instruction set supports 64-bit registers, but much of the code in the benchmarks we selected operates mainly on the lower 32-bits. Thus, we chose to only inject bit flips into the lower half of the register word space.1 After each error injection is made, the injected simulation is monitored for state convergence, correct output, and instruction set architecture specified faults. The results of this experiment can be found in

1 Across all benchmarks, experiments that injected bit errors into all 64-bits displayed slightly lower rates of error manifestation.

32

Figure 5.1. If a software visible exception was observed during the simulation, the trial is put in the Exception category. If no exceptions take place, the trial is placed in the State OK category only if it eventually results in a complete architectural state match. If it does not, and it passes the user visible output acceptance test, it is placed in Output OK, otherwise, it goes into Output Bad. We see that on average about 55% of the injections result in state convergence. This is a result of injecting bit errors into values that are either dead or eventually logically masked off in their uses. Examples of logical masking include logical AND and OR operations. We also note that many of the bit errors were injected into temporary values used for determining conditional branch values. Oftentimes, these values are not sensitive to a single bit error because Alpha conditional branch operations only compare against zero.

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

eo n ga p pe rlb m k x av er ag e bz ip 2 gz ip pa rs er cr af ty gc c ol f tw vo rte

exceptions output bad output ok state ok

Figure 5.1: Single bit error injection into registers. In the second part of this experiment, an instruction is chosen at random as before, but instead of only flipping a single bit of the result, 64 random bits are written into the destination register. This fault model might be representative of a transient fault affecting a bit in a register pointer in the microarchitecture or the address field of a load. The results obtained from this experiment can be found in Figure 5.2. As might be expected, all the benchmarks exhibit convergence rate decreases, resulting in a

33

10% decrease when averaged across all benchmarks. The benchmark bzip2 had a sizable drop in convergence rates. The drop was mostly a result of low convergence rates in error injected load and logic instructions; other types of instructions did not show any significant change in convergence rates.

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

eo n ga p pe rlb m k x av er ag e bz ip 2 gz ip pa rs er cr af ty gc c ol f tw vo rte

exceptions output bad output ok state ok

Figure 5.2: Multiple bit error injection into registers.

5.3

Instruction Word Injections

Another way that soft errors at the architectural level can be modeled is through bit corruptions in instruction words. As was the case in the previous experiments, an instruction is first chosen at random. Immediately after the target instruction word is fetched from memory, a random bit in the word is flipped. The erroneous instruction word is executed as if it were correct, and the benchmark is allowed to proceed. This fault model is representative of a single bit corruption in an insn category as described in Chapter 3. Again, the simulation is monitored for state convergence and whether or not the output of the benchmark passes an acceptance test. The results of this experiment can be found in Figure 5.3. Note that while the previous register file injections were performed only on certain instructions, all instructions were candidates for the instruction word injections of this experiment. The two most notable sets of newly eligible instructions were stores and the remainder of branch instructions. The 34

convergence rates for injections into store instruction words are noticeably lower than that of other types of instructions, leading to a slightly lower overall convergence rate when compared against the results from Figure 5.2. The benchmark twolf is an exception to the lower convergence rate trend, and this is due to two main factors: a lower percentage of stores and higher convergence rates for branch and logic instructions, relative to other benchmarks.

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

x av er ag e pe rlb m k eo n ga p bz ip 2 gz ip pa rs er cr af ty gc c ol f tw vo rte

exceptions output bad output ok state ok

Figure 5.3: Single bit error injection into instruction words. A second type of instruction word injection was modifying instructions into nop's, or no operations. To accomplish this, instead of manipulating a single bit in the instruction word, we simply convert the target dynamic instruction into a nop. While this experiment does not directly reflect any of the fault models derived from our error injections at the microarchitectural level, it gives us an idea of the percentage of dynamic instructions that are ineffectual [12] (considering fall-through branches ineffectual instead of correctly predicted ones). The results are presented in Figure 5.4. On average, about 60% of all instructions perform a function ultimately equivalent to a nop. A large portion of these 60% are instructions that write silent values, produce logically dead values, are fall-through branches, or, as we will see shortly, belong to a group of approximately 40% of taken conditional branches.

35

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

eo n ga p pe rlb m k x av er ag e bz ip 2 gz ip pa rs er cr af ty gc c ol f tw vo rte

exceptions output bad output ok state ok

Figure 5.4: Changing instructions into nops.

5.4

Conditional Branch Injections

Occasionally, due to an error injection, the control flow of a program diverges. For example, this can happen with a corruption of a value that a branch conditions upon or a corruption in the instruction or control word of a branch itself. Interestingly, even when this occurs, the injected program can still manage to converge on architectural state. In Figure 5.5, how often an injection results in state convergence and how often control flow diverges is presented for each of the four fault models discussed previously. The data presented are averaged across all benchmarks. In the figure, 1bitdest and 64bitdest respectively denote the 1-bit and 64-bit result value corruption experiments and 1bitinsn and nop respectively denote the single bit instruction word corruption and conversion to nop experiments. Furthermore, state denotes whether state eventually converged and control denotes whether the control flow of the trial ever diverged. The data indicates that whether or not control flow diverges is somewhat correlated to state match success. This result is hardly surprising, as one might expect that executing different instructions would often lead to different results, and thus, state match failures. However, the number of trials where control flow diverges and an architectural state match is still eventually reached is not insignificant. This data was the inspiration for our next experiment: introducing soft errors in the form of 36

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 1bitdest 64bitdest 1bitinsn nop exceptions state bad/control diverged state bad/control ok state ok/control diverged state ok/control ok

Figure 5.5: Control divergence and state convergence rates. flipping a conditional branch direction. A conditional branch is chosen randomly from the dynamic instruction stream, and the condition is evaluated correctly. However, instead of following the correct branch direction, the opposite path is taken. So if the branch's condition dictated that it should be taken, control flow is instead allowed to fall-through, and vice-versa. The results of this experiment are presented in Figure 5.6.

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

eo n ga p pe rlb m k x av er ag e bz ip 2 gz ip pa rs er cr af ty gc c ol f tw vo rte

exceptions output bad output ok state ok

Figure 5.6: Changing the architected direction of conditional branches. 37

One might expect that changing the control flow of a program would have catastrophic effects, but the data gathered here show that this is often not the case. While some benchmarks appear to be very sensitive to control flow changes (bzip and gzip), others are surprisingly resilient (gcc). These differences are mainly due to code structures and transformations employed by programmers and compilers alike, as a result of the requirement for them to produce correct code under all dynamic cases. Some examples of these code structures and transformations include short-circuits, performance optimizations, and ineffectual loop iterations. This phenomenon was investigated in greater depth in previous work [13].

5.5

Convergence Latency

We now examine how long an error injected simulation will remain in incorrect state before it converges to a correct state. A soft error is introduced as described in the previous sections, and architectural state is monitored to observe when convergence occurs. We call the number of instructions that execute before convergence the convergence latency. This data point is generated with respect to the injected execution. In Figure 5.7, the distribution of convergence latencies for the register and instruction word injection experiments is presented. For each benchmark, from left to right, the bars represent the 1-bit result value corruption, 64-bit result value corruption, 1-bit instruction word corruption, and conversion to nop experiments. Since the entire distribution is large and subject to noise at the 100th percentile, only the middle 80% of the distribution is plotted. In other words, each bar represents the convergence latencies from the 10th to 90th percentiles. In addition, the diamonds indicate the median convergence latency. The average distribution denotes the average of the individual percentile figures. We see that when state converges, convergence usually occur rather quickly, regardless of the type of injection. For the register injections, roughly half of the convergence latencies occur before 30 instructions. For instruction word injections, about half occur before 10 instructions. Notable is the benchmark gzip, which has a number of long convergence latencies, but only for the 64-bit result and nop injections. Another interesting data point is the difference in the number of instructions executed when

38

100000000

10000000

1000000

100000

10000

1000

100

10

1 bzip2 crafty eon gap gcc gzip parser perlbmk twolf vortex avg

Figure 5.7: Convergence latency. control flow diverges. We display the middle 80% and medians of this distribution for each of the same four experiments in Figure 5.8. In this graph, the y-axis indicates how many more instructions were executed by the error injected simulation (relative to a non-error-injected reference simulation). We see that across all types of error injections, roughly half of those that converge in state (and exhibit control flow divergence) result in fewer instructions executed, hinting at significant inefficiencies in dynamic instruction streams. Knowing this, if a mechanism could detect the presence of a soft error and verify that it was (or will be) masked by the software, then active recovery through rollback to a safe checkpoint or some other means would be unnecessary and potentially far more expensive than simply allowing the program to continue execution.

39

1000

800

600

400

200

0

-200

-400

-600

-800

-1000

Figure 5.8: Difference in number of instructions executed.

40

CHAPTER 6

LIMITATIONS OF RESULTS

The presented experimental results are based on and built off of the selected fault models, the simulation models used, and the workloads simulated with. As these models are researched and improved upon, the relevance of this work may diminish if the assumptions used in this work no longer hold. For example, much of this work is geared towards characterizing and protecting against the effects of single bit corruptions. If this fault model fails to model physical transients, an underlying assumption of this work is broken. However, the general mechanism of identifying vulnerable portions of a microprocessor and devising low overhead protection mechanisms for those portions will still be valid. While care was taken to create detailed microarchitectural experimental infrastructure, not all of the intricacies of a modern dynamically scheduled processor were accounted for. For example, the scheduler, multiply unit, and store-to-load forwarding logic in the memory unit were implemented behaviorally. This can lead to inaccuracies in determining the state that is required for these hardware structures to operate within the constraints of a processor. Another inaccuracy resulting from using an experimental model is that some of the state in the microarchitectural model is not stored efficiently. The one large known example of this at the time of publishment is in the ROB, which has two 64-bit program counter fields for each entry. These fields are evident in the large pc category in Figure 3.6 and are used to keep track of a program counter value for each instruction as well as a target program counter for branch instructions. These fields are used for verifying control flow and are only read when branch instructions are encountered. Keeping these fields in a smaller, dedicated 16-entry structure reserved only for control flow instructions would be a more efficient use of die space and result in an approximately

41

10% reduction in state (only including injectable state in the latch+RAM experiments). Since all of the superfluous state is generally logically dead, its removal would result in a decrease in the masking rates of the microprocessor. Nonetheless, we believe that our model was created with sufficient detail to provide error manifestation results accurate to an order of magnitude when compared with those of a real implementation. For peak bandwidth concerns, many hardware structures in modern microprocessors are not used to their full potential. This leads to significant levels of transient fault masking, which our experimental results corroborate.

42

CHAPTER 7

RELATED WORK

Czeck and Siewiorek [6] performed a similar analysis through fault injection into selected bits of state in their simulation model. Here, we use a more modern simulation model and do a more thorough classification of the failure modes of various types of state in a microprocessor. Mukherjee et al. [14] introduced a method to compute Architectural Vulnerability Factors for various processor components and IA-64 software through analysis. The general experimental results presented in this work corroborate their analytic findings. Kim and Somani [7] injected faults into picoJava-II, a microprocessor core developed by Sun Microsystems. Their microarchitectural model is more accurate than the one used in this work; however, it is less complex in terms of high-performance microarchitectural features. Also, they only verify the architectural state of the machine. Here, trials that result in a complete microarchitectural state match are identified along with architectural state failures. Ando et al. [10] protected the data and address paths of their SPARC64 design with parity. Gaisler [11] protected the register file in his SPARC V8 implementation using a technique similar to the one used in this work. Furthermore, he protected various flip-flops by using triple modular redundancy and providing three separate clock trees. Franklin [15] noted different modes of failure throughout the pipeline, and proposed mechanisms to guard against them. Here, vulnerable state was identified through fault injection, and protection mechanisms to defend against a majority of transient faults were proposed, implemented, and tested. Magda [16] also explored the effects of transient faults upon a microprocessor model and also implemented various lightweight protection mechanisms. The model used in this work is more sophisticated, and the experiments are performed in a more controlled manner.

43

Other work related to the microarchitectural work presented here include higher overhead mechanisms to protect microprocessors with various forms of redundancy in microarchitecture [8], [17][20]. Here, arguably lower overhead approaches are proposed, albeit with lower fault coverage. Rotenberg [12], Mukherjee et al. [14], and Fahs et al. [21] explored the composition of dynamic instruction streams for dead and silent instructions. This work explores the same subject through fault injection and identifies a larger set of dynamically dead instructions. Namely, a significant portion of control instructions are dead, and thus, instructions that produce values solely for these control instructions are similarly dead.

44

CHAPTER 8

CONCLUSIONS

In this work, an analysis of the effects of transient faults on high performance processors was characterized. To accomplish this, a detailed microarchitectural model was created, and a fault model was selected. The results of the ensuing fault injection experiment were not particularly surprising: the most vulnerable parts of a processor are those that often hold architectural state. Regardless, this information was taken into account when devising lightweight protection mechanisms to cover the majority of the failures. After pipeline stalls, register file corruptions, register file pointer corruptions and instruction word corruptions were addressed, another set of fault injection experiments revealed that the largest contributors to architectural state failures were program counter, data, and miscellaneous control fields. Most of these fields are likely fairly straightforward to protect, but the various state machines and control state that make up a portion of the control category may be difficult to protect with techniques short of TMR. Utilizing some of the fault signatures found in the microarchitectural experiments, fault injection experiments were carried out on the architectural layer. These experiments served to provide a general estimate of the percentage of ultimately dead instructions in the dynamic instruction stream. Here, ultimately dead is defined as an instruction whose result value, when corrupted, does not affect correct program execution. That is, the corruption's effects get completely masked out prior to external communication. Somewhat surprising is the fact that nearly half of all dynamic conditional branches are ultimately dead. By this, we mean that execution down the fall-through and taken paths of a branch are occasionally equivalent. The reason for this is analogous to the reason dead instructions exist: the compiler and programmer must write static code that is correct

45

for all dynamic possibilities. Inefficiencies in the dynamic instruction stream ensue, and, from the results presented in this work, the degree of inefficiencies appears to be fairly high, even for conditional branches. These dynamic inefficiencies result in software masking of errors presented from the microarchitectural substrate. To summarize our experimental findings, we found that at least 80% of injected single event upsets into pipeline latches and RAM arrays in our baseline microarchitecture are masked from software. We also found significant masking levels present in software for various fault models. Together, the two levels of masking hide 9 out of every 10 transient faults that have been latched from affecting correct program execution. With precisely placed low overhead protection mechanisms, the level of masking is even higher. This gives an idea of the underutilization of modern microprocessors and dynamic inefficiencies of software.

46

REFERENCES

[1] H. Cha, E. M. Rudnick, J. H. Patel, R. K. Iyer, and G. S. Choi, "A gate-level simulation environment for alpha-particle-induced transient faults," IEEE Transactions on Computers, vol. 45, no. 11, pp. 1248­1256, November 1996. [2] P. Lid´n, P. Dahlgren, R. Johansson, and J. Karlsson, "On latching probability of particle e induced transients in combinational networks," in Proceedings of the 1994 International Symposium on Fault-Tolerant Computing, 1994, pp. 340­349. [3] M. Baze and S. Buchner, "Attenuation of single event induced pulses in CMOS combinational logic," IEEE Transactions on Nuclear Science, vol. 44, no. 6, pp. 2217­2223, December 1997. [4] S. Buchner, M. Baze, D. Brown, D. McMorrow, and J. Melinger, "Comparison of error rates in combinational and sequential logic," IEEE Transactions on Nuclear Science, vol. 44, no. 6, pp. 2209­2216, December 1997. [5] P. Shivakumar, M. Kistler, S. Keckler, D. Burger, and L. Alvisi, "Modeling the effect of technology trends on the soft error rate of combinational logic," in Proceedings of the 2002 International Conference on Dependable Systems and Networks, 2002, pp. 389­398. [6] E. W. Czeck and D. Siewiorek, "Effects of transient gate-level faults on program behavior," in Proceedings of the 1990 International Symposium on Fault-Tolerant Computing, 1990, pp. 236­243. [7] S. Kim and A. K. Somani, "Soft error sensitivity characterization for microprocessor dependability enhancement strategy," in Proceedings of the International Conference on Dependable Systems and Networks, 2002, pp. 416­425. [8] C. Weaver and T. Austin, "A fault tolerant approach to microprocessor design," in Proceedings of the 29th Annual International Symposium on Computer Architecture, 2002, pp. 87­98. [9] D. Burger, T. Austin, and S. Bennett, "Evaluating future microprocessors: The simplescalar tool set," University of Wisconsin - Madison Technical Report, Tech. Rep. 1308, July 1996. [10] H. Ando et al., "A 1.3 GHz fifth generation SPARC64 microprocessor," in Design Automation Conference, 2003, pp. 702­705. [11] J. Gaisler, "A portable and fault-tolerant microprocessor based on the SPARC V8 architecture," in Proceedings of the International Conference on Dependable Systems and Networks, 2002, pp. 409­415. [12] E. Rotenberg, "Exploiting large ineffectual instruction sequences," North Carolina State University, Tech. rep., November 1999. 47

[13] N. Wang, M. Fertig, and S. Patel, "Y-branches: When you come to a fork in the road, take it," in Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2003, pp. 56­66. [14] S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin, "A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor," in Proceedings of the 36th International Symposium on Microarchitecture, 2003, pp. 29­40. [15] M. Franklin, "Incorporating fault tolerance in superscalar processors," in Proceedings of High Performance Computing, 1996, pp. 301­306. [16] W. G. Magda, "Evaluating the impact of noise on microprocessor operation using an RTL model," M.S. thesis, University of Illinois at Urbana-Champaign, 2002. [17] M. Gomaa, C. Scarbrough, T. N. Vijaykumar, and I. Pomeranz, "Transient-fault recovery for chip multiprocessors," in The 30th Annual International Symposium on Computer Architecture, 2003, pp. 98­109. [18] S. S. Mukherjee, M. Kontz, and S. K. Reinhardt, "Detailed design and evaluation of redundant multithreading alternatives," in Proceedings of the 29th Annual International Symposium on Computer Architecture, 2002, pp. 99­110. [19] E. Rotenberg, "AR-SMT: A microarchitectural approach to fault tolerance in microprocessors," in Proceedings of Fault-Tolerant Computing Systems, 1999, pp. 84­91. [20] T. N. Vijaykumar, I. Pomeranz, and K. Cheng, "Transient-fault recovery using simultaneous multithreading," in Proceedings of the 29th Annual International Symposium on Computer Architecture, 2002, pp. 87­98. [21] B. Fahs et al., "Performance characterization of a hardware framework for dynamic optimization," in Proceedings of the 34th Annual International Symposium on Microarchitecture, 2001, pp. 16­27.

48

Information

54 pages

Report File (DMCA)

Our content is added by our users. We aim to remove reported files within 1 working day. Please use this link to notify us:

Report this file as copyright or inappropriate

1114868


You might also be interested in

BETA