# Investigating How Software Characteristics Impact the Effectiveness of Automated Software Fault Tolerance

Benjamin James, Michael Wirthlin, and Jeffrey Goeders

Abstract—A number of publications have examined automated fault tolerance techniques for software running on commercial off-the-shelf microcontrollers. Recently, we published an automated compiler-based protection tool called COAST (COmpiler Assisted Software fault Tolerance), a tool that automatically inserts dual- or triple-modular redundancy into software programs.

In this work we use COAST to explore how the effectiveness of automated fault protection varies between different benchmarks, tested on an ARM Cortex-A9 platform. Our hypothesis is that certain benchmark characteristics are more likely than others to influence the effectiveness of automated fault protection.

Through neutron radiation testing at the Los Alamos Neutron Science Center (LANSCE), we show cross-section improvements vary from 1.6x to 54x across eight benchmark variants. We then explore the characteristics of these benchmarks, and investigate how properties of these benchmark may impact the effectiveness of automated fault protection. Finally, we leverage a novel fault injection platform to isolate two of these benchmark characteristics, and validate our hypotheses.

*Index Terms*—Software fault tolerance, single event upset (SEU), silent data corruption (SDC), soft errors.

### I. INTRODUCTION

Radiation hardened processors are typically much more expensive and offer lower performance than commercial offthe-shelf (COTS) equivalents. This provides an incentive for finding software-based techniques that increase the fault tolerance of COTS microprocessors so they can be used in high radiation environments, such as space.

Recent studies have explored different methods for providing programs with fault tolerance through purely software approaches. A common way of providing fault tolerance is through replicating program instructions and/or variables. By inserting one or two replicas of every software operation, faults can be detected and reported or corrected at runtime. There is certainly a performance cost for this type of protection, but this kind of approach has been shown to be successful at reducing the overall error rate, and increasing the mean work to failure (MWTF) [1]–[4].

There are different ways of adding duplicated and triplicated instructions, such as modifying the architecture assembly code

This work was also supported by the Los Alamos Neutron Science Center (LANSCE) which provided beam time under proposal NS-2019-8294.

by hand [1], [2]. As this is not the most ideal solution, other works have studied automated methods of applying mitigation techniques [1], [5]–[10]. However, to the best of our knowledge, none of these works have provided a publicly available, open-source tool that others can use to replicate the work, or to use on new projects.

In July 2018, we released COAST (COmpiler Assisted Software Fault Tolerance), a compiler-based tool that *automatically* applies existing software mitigation techniques to user software. The tool is open-source, and publicly available at https://github.com/byuccl/coast. Since COAST is much more automated and flexible than previous work in this area, it is suitable as a tool to explore the effectiveness of software protection on different processing platforms and benchmarks. In recent work we showed how software protection could be applied to several different architectures (MSP 430, ARM 32-bit and 64-bit, and RISC-V) [11], [12]. Across different target architectures, we saw decreases in cross-section ranging from 4x–106x.

Most of the previous works that explored automated protection provided experimental results on just a couple key benchmarks. The automated nature of our work allows us to explore the effectiveness of software protection on a wide range of benchmarks. In this work we aim to explore and understand what characteristics of particular benchmarks make them more apt for protection through data and instruction replication. Rather than varying the *platform*, we vary the *program*. The results from testing multiple benchmarks will allow us to better understand how the effectiveness of automated protection changes from benchmark to benchmark, and ideally help designers to understand why automated protection may or may not provide substantial reliability improvements.

While we would ideally apply our tool to tens or hundreds of different C programs and build an accurate predictive model, this is not feasible. Limited access to radiation testing facilities combined with the relatively low frequency of errors means we must restrict the number of benchmarks to a small sample set.

The main contributions of this paper are:

• Experimental testing of multiple C programs at the Los Alamos Neutron Science Center (LANSCE). The platform under study, the 32-bit Xilinx Zynq ARM Cortex-A9, had 8 benchmarks tested on it. Across all these benchmarks, we saw reduction in cross-sections from 1.6x–54x.

B. James (b\_james@byu.edu), M. Wirthlin (wirthlin@byu.edu), and J. Goeders (jgoeders@byu.edu) are with the Dept. of Electrical and Computer Eng., Brigham Young University, 450 EB BYU, Provo, UT 84602 USA, and the NSF Center for Space, High-performance, and Resilient Computing (SHREC).

This work was supported by the I/UCRC Program of the National Science Foundation under grant 1738550.

- An analysis of the experimental results, from which we draw insight into benchmark characteristics that may commonly impact software-based fault protection.
- A fault injection framework that is used to isolate and validate two benchmark properties that impact fault protection effectiveness.

The paper is organized as follows: Section II gives more background on related work and the COAST tool. Section III outlines the way in which we tested our benchmarks in a radiation beam and shows the results thereof. Section IV analyzes the radiation test results and benchmark characteristics. Section V details our subsequent fault-injection study, and Section VI provides conclusions.

#### II. BACKGROUND

## A. Related Work

There have been several works which have investigated adding fault mitigation to software programs. Error Detection by Duplicated Instructions (EDDI) [5] first presented techniques for fine-grained duplication of instructions. These techniques duplicate all instructions and variables while maintaining a single control-flow, which requires synchronization of data-flows before any control-flow branching or function calls. This technique is also known as Duplicate With Compare (DWC), and allows for detecting errors at the cost of increased code size and execution time.

Later work introduced SWIFT-R [9], which extended this type of technique to triplication, allowing for not only error detection, but also correction. This is similar to Triple Modular Redundancy (TMR) in hardware, and is often referred to by this name in software as well. Software TMR has an even higher cost in code size and execution time than DWC, but with the added benefit of being able to tolerate errors without a reset or rollback.

There have been other works which explored different replication and synchronization rules, as this is an important factor when evaluating trade-offs between increased run time and fault coverage. Chielle et al.[13] presented a set of rules that can be used to guide decisions about replication and synchronization. The COAST tool implements software protection based on some of these rules.

Although there have been many recent works exploring different variations and optimizations on these basic DWC/TMR techniques [1]–[4], [7], [8], [14]–[19], there is no other current tool that offers the automation and flexibility of COAST. These previous works have used hand-modified assembly code, relied on specific architectures or assembly code formats, or leveraged specific processor features to obtain fault tolerance. In addition, none of these works are available as open-source tools, and very few have been tested in an actual high-radiation environment.

# B. COAST

Our software protection tool, COAST, automatically adds data-flow protection to arbitrary user programs. The default configuration (and the configuration used in our experiments)

| do:                 | do :                    |
|---------------------|-------------------------|
| 1d r0 = i           | 1d r0 = i               |
|                     | $ld r10 = i_copy$       |
|                     | $ld r20 = i\_copy2$     |
| r1 = sub r0, 1      | r1 = sub r0, 1          |
|                     | r11 = sub r10, 1        |
|                     | r21 = sub r20, 1        |
| r2 = cmp r1, 0      | r2 = cmp r1, 0          |
|                     | r12 = cmp r11, 0        |
|                     | r22 = cmp r21, 0        |
|                     | r3 = cmp r2, r12        |
|                     | r4 = select r3, r2, r22 |
| br neq r1 <b>do</b> | br neq <b>r4</b> do     |
|                     |                         |

Fig. 1: Code before and after TMR mitigation, from [11]

(b) TMR code

(a) Original code

is based on the *VAR3* scheme from [6], which is to replicate all compute and memory load/store instructions, and to synchronize as necessary before control flow instructions. However, the COAST tool is very configurable; it supports both DWC and TMR modes, as well as changing some of the replication and synchronization rules. Synchronization consists of (for DWC) a comparison of the two data flows, or (for TMR) a voter which determines the correct value based on the three copies. The replication of existing instructions and insertion of synchronization instructions is fully automated as part of the program compilation.

Figure 1 shows an example of what some assembly code would look like before and after it is run through COAST. The bold text shows the changes made by our compiler pass.

In our past work, we focused on proving the usefulness of COAST [20], or showing its usefulness on multiple target architectures [11]. In this work, we aim to show which *types* of benchmarks can benefit the most from being protected with software techniques.

#### **III. RADIATION TEST**

Radiation testing offers a realistic view into the effectiveness of fault mitigation techniques; however, the high cost and relatively low availability of testing facilities often means that only a few system configurations can be evaluated. In past work we observed that fault mitigation was much more effective on some benchmarks than others.

The goal of this test was to evaluate *several* benchmarks on a single platform, to gain an understanding of what software characteristics are present in benchmarks which benefit more from software-based fault protection. We tested eight different benchmarks executing on three identical Xilinx PYNQ development boards. Testing was performed at the Los Alamos Neutron Science Center (LANSCE) over the span of five days.

#### A. Methodology

1) Device Under Test: The DUT was a ZYNQ XC7Z020 SoC FPGA, which contains an embedded dual-core 32-bit 667MHz 28nm ARM A9 processor. There is a 32 KB instruction cache, 32 KB data cache, and a 512 KB unified cache per



Fig. 2: Neutron beam test setup

core (non-ECC). Only one core of the processor was used in the test, and the FPGA fabric was not utilized or tested. The platform was configured as a bare-metal system, with only essential Board Support Package (BSP) software. The 30L flight path (Ice House) was used, and the three boards were placed 99, 101 and 106cm from the neutron detector. These distances were accounted for when determining the fluence received by each board during the experiments. Each board was placed so the A9's external DRAM chip was outside of the 2" diameter neutron beam. Figure 2 provides a photo of the setup, which includes boards from several other experiments.

During the experiment, each benchmark runs some computation operation, then checks the result against a known golden value. In cases where the program output is a small value, such as crc32, the golden value is the exact program output. For other benchmarks, the golden value is a hash of a larger output (the MxM matrix multiplication benchmark hashes the resultant matrix), or in the case of the *qsort* benchmark, the code ensures the result is sorted. This approach to output validation aims to minimize cases where the golden value, which isn't protected, can be corrupted while the program executes.

The benchmarks repeatedly execute the same computation operation, periodically printing a heartbeat message via UART, which is monitored by a computer that is outside the path of the neutron beam. If the computed value does not match the golden value, the program immediately prints an error message to the monitoring computer. In these cases the controlling system power cycles the board and reprograms the software. Other events can also trigger a reprogramming, including a malformed output, or a heartbeat timeout.

In the experiments described in this paper, we employ the TMR option of COAST, which inserts voter operations at branch points in the program. We also enable the -countErrors option which allows for enhanced voter code that tracks whether any fault was detected and corrected. While this introduces some extra overhead that would not be used on a deployed system, it provides useful data for our experiment. When faults are detected, they are reported to the controlling system, and they also trigger a power cycle.

## B. Benchmarks

We used eight different benchmarks in our test, as outlined below:

- **crc32:** A 32-bit Cyclic Redundancy Check. This computes the hash of a statically defined table of 32-bit values.
- **dijkstra:** From the MiBench test suite, this finds the shortest path between a predefined set of nodes.
- **matrixMultiply:** Matrix multiplication, tested with two sizes: *Fit L1*, where the matrices were sized to all fit in the L1 cache (when triplicated), and *Fit L2*, where they likewise fit in the L2 cache.
- **nanojpeg:** A simple JPEG decoder<sup>1</sup>; input data is a JPEG image converted to a C array.
- **qsort:** Sort an array of floating point numbers. Tested in two configurations: *Library*, where we use the C standard library implementation of qsort, which is notably not protected by our tool; and *Custom*, which uses our own code for the sorting kernel, which allows protection to be enabled on the entire algorithm.
- **sha256:** Computes the SHA-256 hash of a statically defined array.

Each benchmark was compiled and tested using an original unmitigated version, and a TMR'd version produced by COAST.

# C. Radiation Test Results

The results from our experiment are shown in Table I. The first column lists the benchmark and protection configuration. The next column lists the total Fluence received for each benchmark configuration, which was calculated by correlating timestamps for when each benchmark was running with timestamped flux measurement logs from LANSCE. The next set of columns list the different abnormal statuses encountered during repeated benchmark execution. The Faults column lists the number of times the TMR voters in the code detected and corrected a fault. Errors are the number of times the benchmark computed a result which did not match the golden value. A Hang was recorded when the benchmark heartbeat stopped responding for a significant amount of time (about 10x the expected heartbeat interval). An Invalid Status was recorded any time the UART message from the benchmark did not match the expected regular expression format. Any of these unsuccessful runs triggered a reset of the board.

The columns *Code Size* and *Runtime* are for comparing the overhead required for COAST protection against the original version of the benchmark. *Code Size* is the size of the compiled ELF file, measured in KB.

The column *Cross Section* measures the error rate according to Equation (1).

<sup>&</sup>lt;sup>1</sup>based on https://keyj.emphy.de/nanojpeg/

TABLE I: Neutron beam test results

| Configuration<br>(Bench, Options)      | Fluence<br>(n/cm2)                                                     | Faults<br>(TMR<br>Fixed) | Errors<br>(SDC) | Hangs/<br>Invalid<br>Status | Code Size<br>(KB) |                 | Runtime<br>(ms) |                      | Cross<br>Section<br>(cm2)                            |                 | MWTF             |
|----------------------------------------|------------------------------------------------------------------------|--------------------------|-----------------|-----------------------------|-------------------|-----------------|-----------------|----------------------|------------------------------------------------------|-----------------|------------------|
| crc32, Unmit<br>crc32, TMR             | $\begin{array}{c} 2.41\times10^7\\ 2.6\times10^8 \end{array}$          | N/A<br>20                | 5<br>0          | 1/1<br>11/1                 | 159<br>191        | -<br>↑ 1.201x   | 936<br>1162     | -<br>↑ 1.241x        | $2.08 \times 10^{-7}$<br>** $3.84 \times 10^{-9}$    | -<br>↓ 53.99x   | -<br>↑ 43.49x    |
| dijkstra, Unmit<br>dijkstra, TMR       | $1.14 \times 10^9$<br>$6.25 \times 10^9$                               | N/A<br>13                | 0<br>0          | 76/2<br>356/1               | 171<br>191        | -<br>↑ 1.117x   | 478<br>2414     | ↑ <b>5.05</b> x      | $^{**8.81\times10^{-10}}_{**1.6\times10^{-10}}$      | ↓ <b>5.51</b> x | -<br>↑ 1.09x     |
| MxM, Unmit, L2<br>MxM, TMR, L2         | $\begin{array}{c} 1.23\times10^8\\ 4.97\times10^8\end{array}$          | N/A<br>101               | 24<br>7         | 12/0<br>47/0                | 307<br>536        | ↑ <b>1.75</b> x | 212<br>640      | ↑ 3.02x              | $1.95 \times 10^{-7}$<br>$1.41 \times 10^{-8}$       | ↓ 13.85x        | ↑ <b>4.58</b> x  |
| MxM, Unmit, L1<br>MxM, TMR, L1         | $\begin{array}{c} 8.06 \times 10^8 \\ 1.14 \times 10^{10} \end{array}$ | N/A<br>14                | 3<br>1          | 36/0<br>519/3               | 209<br>228        | -<br>↑ 1.09x    | 1528<br>2897    | -<br>↑ 1.9x          | $3.72 \times 10^{-9}$<br>$8.8 \times 10^{-11}$       | ↓ 42.28x        | ↑ 22.3x          |
| nanojpeg, Unmit<br>nanojpeg, TMR       | $5.85 \times 10^9$<br>$7.27 \times 10^9$                               | N/A<br>119               | 47<br>22        | 324/1<br>329/1              | 187<br>241        | -<br>↑ 1.29x    | 324<br>2503     | ↑ 7.73x              | $8.04 \times 10^{-9}$<br>$3.02 \times 10^{-9}$       | ↓ <b>2.66</b> x | ↓ <b>2.91</b> x  |
| qsortLib, Unmit<br>qsortLib, TMR       | $\begin{array}{c} 8.62\times10^9\\ 6.85\times10^9\end{array}$          | N/A<br>13                | 2<br>0          | 464/4<br>333/2              | 290<br>429        | -<br>↑ 1.48x    | 77<br>189       | ↑ <b>2.45</b> x      | $2.32 \times 10^{-10}$<br>**1.46 × 10 <sup>-10</sup> | -<br>↓ 1.59x    | ↓ 1.54x          |
| qsortCustom, Unmit<br>qsortCustom, TMR | $5.25 \times 10^9$<br>$2.29 \times 10^{10}$                            | N/A<br>22                | 10<br>0         | 255/0<br>1119/0             | 290<br>429        | -<br>↑ 1.48x    | 277<br>880      | ↑ <b>3.18</b> x      | $1.9 \times 10^{-9}$<br>**4.37 × 10 <sup>-11</sup>   | -<br>↓ 43.61x   | ↑ <b>13.73</b> x |
| sha256, Unmit<br>sha256, TMR           | $5.21 \times 10^{7}$<br>$2.13 \times 10^{8}$                           | N/A<br>30                | 4<br>2          | 2/241<br>10/0               | 138<br>215        | ↑ <b>1.56</b> x | 14<br>57        | -<br>↑ <b>4.07</b> x | $7.68 \times 10^{-8}$<br>$9.37 \times 10^{-9}$       | -<br>↓ 8.19x    | -<br>↑ 1.95x     |

\*\*No errors observed, so this is calculated given one error (assuming the worst-case, where an error could be observed on the next neutron).

$$Cross Section = \frac{Errors (SDC)}{Fluence}$$
(1)

The results show that the COAST TMR protection reduces cross-section by 1.0x to 54x, indicating that the characteristics of the benchmark significantly influence the effectiveness of the fault mitigation. The cross-section results from Table I are summarized in Figure 3, which shows the cross-section for each of the benchmarks with 95% confidence error bars.

Along with cross-section, we have the indicator *Mean Work To Failure* (MWTF) that puts cross-section in the context of the run-time overhead. In other words, benchmarks which run longer have more time during which they can be upset. The equation for calculating MWTF is given by Equation (2).

$$MWTF = \frac{\text{amount of work completed}}{\text{number of errors encountered}}$$
(2)  
= (raw error rate · AVF · execution time)<sup>-1</sup>

When taking run-time into consideration, it can be seen that while most benchmarks improved in MWTF (1.1x-43x), there were a couple that degraded (*nanojpeg* and *qsortLib*), meaning that the improvement in cross-section is not sufficient to overcome the increased fault rate due to the longer runtime.

### **IV. BENCHMARK ANALYSIS**

When we began this test, we had hopes of using the data to construct a model from which to predict the fault coverage of future programs when protected by COAST. However, it quickly become apparent that there are simply too many factors at play to develop an accurate predictive model, and doing so would require many more benchmarks and hours of radiation testing, which would be infeasible. However, it was still our intention to gather as much insight into the data we were able to collect, to help learn some lessons for future work, and gather insights that may be helpful for future engineers attempting to apply automated software protection.

The approach we took was to analyze the set of benchmarks we tested in radiation, to determine whether we could find benchmark properties that would correlate with the *improvement in reliability* when automated software protection was applied. More specifically, we tried to identify a set of benchmark properties that correlated with the *factor decrease to cross-section* when automated protection was applied to our benchmark set. To do this, we identified a large set of benchmark characteristics, and then determined which subset of these provided the best fit using multiple linear regression.

We recognize that our benchmark set is limited, and with small data sets it may be easy to infer correlation between data when it does not really exist, so we choose to use these identified characteristics to motivate further fault injection testing to validate that these characteristics impact the effectiveness of automated protection. These fault injection results are presented in Section V.

### A. Characteristic Set

The set of characteristics we found to be most impactful were the following:

- 1) **Peak Heap Usage** (in kbytes) *Negatively correlated with effectiveness of fault protection.*
- 2) **Static Memory Size** Size of global variables in memory (.data and .bss section counted, in kbytes) *Positively correlated with effectiveness of fault protection.*
- 3) **Sync Points/s** How many times a synchronization voter was hit per second of program execution.*Positively correlated with effectiveness of fault protection.*
- Fault Tolerance of Unprotected Benchmark This characteristic measures the cross-section (cm<sup>-2</sup>) of the unmitigated program, determined from our experimental data.



Fig. 3: Benchmark Cross-Sections, 95% confidence interval

# Unmitigated cross-section is positively correlated with effectiveness of fault protection.

There were several other benchmark characteristics that we examined that either showed no meaningful correlation for our benchmark set, or were redundant when considered with other properties. These characteristics included maximum resident set size (memory), read/write ratio, error rate from fault injection on the register file, and all combinations of cache access characteristics for each of the L1 and L2 caches. Although these were not influential for the data set we obtained from the radiation testing, it is certainly possible that some of these characteristics could affect the applicability of our fault mitigation techniques if other benchmarks were used, or if a more thorough regression was performed that included a larger data set, or more characteristics.

Furthermore, our set of characteristics is not meant to be an exhaustive list of meaningful benchmark properties. It is very possible that there are other benchmark properties that we failed to identify that may serve as good predictors of the effectiveness automated fault protection. However, we feel there are still meaningful design lessons to be learned from the characteristics we analyzed.

We now discuss each of these characteristics in greater detail:

# B. Peak Heap Usage

Our results indicated that an increase in heap usage negatively correlated with the effectiveness of our automated fault protection. Peak heap usage was obtained using the dynamic analysis tool massif, from the valgrind tool suite. In our benchmark set, only a few programs used the heap, with nanojpeg and gsort-Library being the largest users. Since the Static Memory Size positively affected crosssection performance, we concluded that it was not actually the memory usage itself that was the primary issue, but rather the calls to malloc, and the way that it manages heap memory. It seems that the more often malloc is called, the less effective COAST is at protecting the code. There are a couple reasons we expect this is the case: 1) since malloc is a library function, it cannot be protected by COAST; 2) even when the memory regions are passed back to the protected code, malloc'd regions have special header/footer metadata sections that COAST cannot synchronize. These metadata sections are used by subsequent calls to malloc and free to determine how each block of memory should be managed. If a fault occurs in any of these special regions, it is likely unrecoverable. Based on this, we believe it is best to avoid using dynamic memory allocation when wanting to perform software-based fault mitigation. As a second point of reference, the JPL coding standard strongly discourages dynamic memory allocation.

# C. Static Memory Size

Static memory usage was determined by inspecting the program executables using the readelf utility and observing the sizes of the .data and .bss sections. The positive correlation indicates that we expect fault tolerance effectiveness to increase as the amount of memory set aside for variables increases. In our test platform, the main memory consists of a large DRAM chip, which is outside of the beam path and naturally more resistant to radiation-based upsets than SRAM [21]. However, data in the processor caches is still highly susceptible to faults, and our previous test results indicate that COAST is effective at protecting values that reside in caches [12]. Furthermore, COAST, and other tools like it that apply protection through data replication are inherently designed to protect against data upsets. These approaches do not target upsets that could happen to control-flow elements such as the PC register, return values on the stack, etc. Since these data-replication approaches are designed to target upsets in data memory, it is not surprising that as programs become more data-heavy, and data sets increase in size, these tools are more effective at protecting against upsets.

To be more precise, although this characteristic is called "static memory size", it is actually the total size of the .data and .bss sections of the ELF file. These sections represent the majority of the data variables, besides those that are allocated from the .heap section, which is described in the previous characteristic. In summary, we believe the more data variables there are, the better able COAST is at protecting the program.

Some of this behavior may be due to the cache configuration of the platform under test. In our previous radiation experiments [12], we tested the PYNQ board with and without caches enabled. The error rate with the caches enabled was noticeably higher than the error rate with the caches disabled. This means that the errors are more likely to occur in the caches than in main memory or the processor registers. So COAST is helpful in this case because the cross section is so much higher with the caches enabled that there's plenty of room for improvement. It is possible that a different memory hierarchy configuration would have different behavior under a high radiation environment. When application designers are considering using automated protection to provide fault tolerance, they will need to take these architectural features into account when anticipating whether automated fault protection will be effective.

#### D. Synchronization Points/s

Synchronization points, or voters, are locations in the code where the automated fault protection has inserted operations that inspect the redundant copies of a piece of data and vote on which value should be propagated into the future.

The number of synchronization points encountered during normal execution was determined by modifying our compiler tool to automatically instrument the code such that it would increment a global counter each time a synchronization point was encountered. This number was then divided by the program execution time.

Our results suggest that benchmarks that synchronize more often will see more benefit from our protection techniques. The reason is due to the granularity of replication afforded by COAST. The data flow is replicated at the instruction level, so any data errors could be checked for as often as every 3 cycles. Although this is somewhat extreme, the general rule is, the sooner the error is detected and corrected, the less chance it has of propagating through the system.

One thing to keep in mind is that it is possible to have too many sync points. Although synchronization allows the TMR'd code to detect/correct errors, it does introduce a potential single point of failure. The code that does the voting is vulnerable to upsets, which represents a failure mode that did not exist in the original, unmitigated version of the code. Analyzing these sync points would be very difficult, as it's not as straightforward as simply measuring the memory usage as in other predictors. The sync points can vary distinctly in quantity, type (data store vs branch comparison), and placement. However, it appears that with the benchmarks tested, we did not exceed the ratio of normal code to synchronization code that would cause it to perform worse.

#### E. Fault Tolerance of Unprotected Benchmark

Our model is designed to measure improvement to crosssection; however, it's important to note that if the benchmark was already inherently fault tolerant, there may be fewer opportunities for COAST to improve its cross-section.

We used the radiation test cross-section results from the unprotected benchmarks to determine how naturally susceptible each benchmark was to upsets. Or put another way, the crosssection of the unmitigated benchmark provides indication of how likely an upset will manifest as an error in the program output. The larger the cross-section, the less fault tolerant a benchmark is, and thus, there are more opportunities to improve reliability through automated fault protection. On the other hand, if a benchmark has very low cross-section, it may already naturally mask faults, and the runtime and memory overheads of imposing automated fault protection may not be worth it.

An example of this is seen in the *quicksort* benchmarks: our golden checking code ensures that the values were sorted correctly; however, it does not actually check that no bits were flipped. Thus, many faults could be naturally masked. While this may not be desirable for an actual sorting benchmark, it would likely arise in other benchmarks, such as machine learning algorithms which have been shown to be somewhat fault tolerant [22].

# V. VALIDATING CHARACTERISTICS THROUGH FAULT INJECTION TESTING

In order to validate some of the trends we observed in our radiation testing, we devised a set of experiments to try and isolate a couple particular benchmark characteristics, and then use fault injection testing to determine how the changes impact the effectiveness of our automated protection scheme.

# A. Fault Injection Experiments

We created two different fault injection experiments:

1) Matrix Multiply Size: We modified our matrix multiplication benchmark to vary the sizes of the input matrices. These matrices are stored entirely in static memory, so this was done to validate our observation that our automated protection approach is more effective on benchmarks with larger data sizes. In this experiment we tested four different matrix sizes: 30x30, 75x75, 120x120 and 180x180. In each case we created an unmitigated version of the benchmark, and then a protected version where TMR protection was applied to the code.

In this experiment we are interested in observing whether the decrease in error rate obtained by TMR protection does indeed improve as the matrix size increases.

2) Inherent Benchmark Fault Tolerance: The other experiment we performed is designed to explore our observation that benchmarks which already mask upsets will not see as much improvement with TMR protection.

In this case we modified our *qsortLib* benchmark. In the original version, an unsorted array is input into the function, and the quick sort algorithm sorts the values. The golden checking code at the end of the benchmark checks that the values in the array are indeed in sorted order. It is important to recognize that this approach will inherently mask many upsets. This is the case for a couple reasons. First, if an array entry is sorted into place, and then a bit is flipped, it may often still be in sorted order (especially if a lower order bit was flipped). In addition, if a bit is flipped in an array entry before it is sorted into place, the algorithm may still produce a sorted array, even though the final array may be different than the original data set.

We then took this fault-tolerant version of quick sort, and modified the golden checking code to instead produce a hash of the sorted values. If this hash did not exactly match a golden hash value, an error is reported. This removes the natural fault tolerance of the algorithm, and will instead report an error if any single bit of the array data is modified.

#### B. Fault Injection Framework

To evaluate these benchmark characteristics we performed fault injection using our own custom-designed fault injection platform, PACIFIC (Platform for ACtive Injection of Faults In a Campaign). This framework, which we are publicly releasing as part of our COAST tool (https://github.com/byuccl/coast), approximates radiation testing using randomly injected faults into software while it is executing. Our fault injection tool uses QEMU, a popular machine emulator, to perform fault injections at random locations in memory, and at random points in time during program execution.

While many other fault injection tools exist [23]–[27], our fault injection framework is noteworthy for a few reasons: 1) It leverages custom QEMU plugins, rather than requiring modifications to the QEMU source code like previous tools, 2) it supports fault injection on bare metal programs, 3) fault injections are granular to the processor-cycle level, and 4) it is specifically designed to allow simulating fault injections in the processor cache.

Testing is done in the form of a fault injection "campaign", where the user specifies 1) the executable to be tested, 2) the section to be targeted, and 3) the number of faults to inject. The campaign supervisor will manage then QEMU and GDB instances and inject the specified number of faults, randomly distributed across the bits in the desired section. This is done over multiple runs of the program, where on each execution, the processor is paused and GDB is used to flip a single bit before execution is allowed to continue. If execution of the program does not finish, there is a watchdog which will detect if the program has gone on too long so it can be forcibly ended. The different possible results are: success, error detected, fault corrected, invalid output, and timeout. Figure 4 provides a system diagram of our fault injection framework.

In the experiments for this paper, we specifically chose to target the processor caches, since the bits in the caches represent a significant target for radiation-induced upsets [28], [29], and our previous radiation testing on the same platform [30] indicated that cache upsets were responsible for a large fraction of our errors.

Our framework is able to specifically target caches by using a QEMU plugin that "subscribes" to execution of all data load and store instructions, and will update an internal model of the processor caches. It maintains a model of what addresses in memory are resident in cache at any point of program execution, allowing us to inject faults specifically into these memory addresses.

The QEMU plugin system is also leveraged to enable cycle-accurate injection points. This second QEMU plugin subscribes to instruction execution events, allowing the plugin to monitor each time an instruction is executed. This means we can randomly inject after any number of instructions, and provides much finer control and better distribution than simply sleeping the process for a random amount of time and then pausing execution. This fined-grained approach does add significant runtime overhead, and means that thorough fault injection campaigns can take hours or days to complete.

# C. Fault Injection Results

The results of the fault injection experiments are provided in Table II.

The variations on the matrix multiplication benchmark confirm our hypothesis that COAST will provide more protection against errors as the more the program uses static memory. The error rate decreases for the 30x30, 75x75, 120x120 and 180x180 matrix sizes are 12x, 354x, 1491x and 2940x respectively. These values show COAST is highly effective at protecting against upsets in the cache, and the effectiveness increases with data size. However, it is important to recognize that fault injection does not perfectly reflect real radiation effects, and these results likely overestimate the effectiveness of protection. This is because the fault injection does not capture many upsets that COAST cannot fix, such as upsets in the program counter, control-flow structures, or internal processor state.

The experiment on the quicksort algorithm also confirmed our hypothesis regarding algorithms that are inherently fault tolerant. The TMR protection used by COAST provided a greater benefit on the hash-checking version, which reported an error whenever the simple XOR hash of the data detected a bit mismatch. Whereas the version that only checked that numbers were sorted (which could naturally mask upsets), did not achieve the same improvement. The difference was a 141x decrease in error rate versus 429x.

It is important to recognize that the fault tolerant version still has a lower *raw* error rate, as it is more likely to mask upsets. However, the *improvement* provided by protection is not as significant. This is an important consideration for those



Fig. 4: PACIFIC Fault Injection Framework

TABLE II: Fault Injection Results

| Configuration<br>(Bench, Options)        | # Runs         | Faults<br>(TMR<br>Fixed) | Errors<br>(SDC) | Hangs/<br>Invalid<br>Status | Error<br>Rate   |                | MWTF            |
|------------------------------------------|----------------|--------------------------|-----------------|-----------------------------|-----------------|----------------|-----------------|
| Matrix Multiply                          |                |                          |                 |                             |                 |                |                 |
| 30x30, Unmit<br>30x30, TMR               | 4000<br>35000  | 0<br>2473                | 28<br>20        | 1/0<br>126/1                | 0.70%<br>0.06%  | ↓ 12.25x       | ↑ <b>3.42</b> x |
| 75x75, Unmit<br>75x75, TMR               | 1000<br>58000  | 0<br>14964               | 122<br>20       | 0/0<br>49/3                 | 12.20%<br>0.03% | ↓ 353.8x       | -<br>↑ 84.81x   |
| 120x120, Unmit<br>120x120, TMR           | 1000<br>116000 | 0<br>69276               | 257<br>20       | 1/0<br>74/2                 | 25.70%<br>0.02% | -<br>↓ 1490.6x | -<br>↑ 169.57x  |
| 180x180, Unmit<br>180x180, TMR           | 1000<br>66000  | 0<br>53745               | 490<br>11       | 1/0<br>28/0                 | 49.00%<br>0.02% | ↓ 2940x        | ↑ 910.29x       |
| qsortLib                                 |                |                          |                 |                             |                 |                |                 |
| Check Sorted, Unmit<br>Check Sorted, TMR | 2000<br>217000 | 0<br>16276               | 39<br>30        | 2/0<br>279/1                | 1.95%<br>0.01%  | ↓ 141.1x       | ↑ 60.2x         |
| Check Hash, Unmit<br>Check Hash, TMR     | 1000<br>53000  | 0<br>17110               | 162<br>20       | 1/0<br>106/0                | 16.20%<br>0.04% | ↓ 429.3x       | ↑ 265.58x       |

looking to protect algorithms that may already naturally mask bit upsets, as the lower effectiveness offered by automated fault protection may not be worth the runtime and memory overheads.

# VI. CONCLUSION

In this article we have presented radiation test results of several benchmarks, tested both in their original form, and with automated fault protection applied. The results demonstrate that the effectiveness of automated protection varies greatly from benchmark to benchmark, with cross-section improvements ranging from 1.6x to 54x.

We analyzed several properties of the tested benchmarks to determine where correlations exist between the benchmark properties and the effectiveness of fault protection. While our data set is limited, it appears that some important benchmark characteristics include whether static or dynamic memory is used, the size of the data sets, how often replicated data is synchronized, and the inherent fault tolerance of the original algorithm.

Finally, we isolated and validated two of these properties (data size and inherent fault tolerance) through extensive fault injection, leveraging our custom-designed QEMU-based fault injection framework. In both cases the results validated what was observed in our original radiation testing.

The results of this work demonstrate how variations in algorithms and software workloads have a large impact on the effectiveness of automated fault tolerance. We hope this work will spur further exploration into improving automated techniques for software reliability.

# REFERENCES

 H. Quinn, Z. Baker, T. Fairbanks, J. L. Tripp, and G. Duran, "Software Resilience and the Effectiveness of Software Mitigation in Microcontrollers," in *IEEE Transactions on Nuclear Science*, vol. 62, Dec. 2015, pp. 2532–2538.

- [2] H. Quinn, Z. Baker, T. Fairbanks, J. L. Tripp, and G. Duran, "Robust Duplication With Comparison Methods in Microcontrollers," *IEEE Transactions on Nuclear Science*, vol. 64, no. 1, pp. 338–345, Jan. 2017.
- [3] E. Chielle, F. Rosa, G. S. Rodrigues, *et al.*, "Reliability on ARM Processors Against Soft Errors Through SIHFT Techniques," *IEEE Transactions on Nuclear Science*, vol. 63, no. 4, pp. 2208–2216, 2016.
- [4] D. S. Khudia, G. Wright, S. Mahlke, D. S. Khudia, G. Wright, and S. Mahlke, "Efficient soft error protection for commodity embedded microprocessors using profile information," ACM SIGPLAN Notices, vol. 47, no. 5, pp. 99–108, 2012.
- [5] N. Oh, P. P. Shirvani, and E. J. McCluskey, "Error detection by duplicated instructions in super-scalar processors," *IEEE Transactions on Reliability*, vol. 51, no. 1, pp. 63–75, Mar. 2002.
- [6] E. Chielle, R. S. Barth, Â. C. Lapolli, and F. L. Kastensmidt, "Configurable tool to protect processors against SEE by software-based detection techniques," in *LATW* 2012 - 13th IEEE Latin American Test Workshop, IEEE, Apr. 2012, pp. 1–6.
- [7] G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August, "SWIFT: Software implemented fault tolerance," in *International Symposium on Code Generation and Optimization*, vol. 2005, IEEE, 2005, pp. 243–254.
- [8] M. Didehban and A. Shrivastava, "nZDC : A Compiler technique for near Zero Silent data Corruption," in *Design Automation Conference (DAC)*, New York, New York, USA: ACM Press, 2016, pp. 278–283.
- [9] J. Chang, G. Reis, and D. August, "Automatic Instruction-Level Software-Only Recovery," in *International Conference on Dependable Systems and Networks*, IEEE, 2006, pp. 83–92.
- [10] A. Martinez-Alvarez, S. Cuenca-Asensi, F. Restrepo-Calle, F. R. Palomo Pinto, H. Guzman-Miranda, and M. A. Aguirre, "Compiler-directed soft error mitigation for embedded systems," *IEEE Transactions on Dependable and Secure Computing*, vol. 9, no. 2, pp. 159–172, 2012.
- [11] M. Bohman, B. James, M. J. Wirthlin, H. Quinn, and J. Goeders, "Microcontroller compiler-assisted software fault tolerance," *IEEE Transactions on Nuclear Science*, vol. 66, no. 1, pp. 223–232, Jan. 2019.
- [12] B. James, H. Quinn, M. Wirthlin, and J. Goeders, "Applying compiler-automated software fault tolerance to multiple processor platforms," *IEEE Transactions on Nuclear Science*, vol. 67, no. 1, pp. 321–327, Jan. 2020.
- [13] E. Chielle, F. Rosa, G. S. Rodrigues, *et al.*, "Reliability on ARM processors against soft errors by a purely software approach," in *European Conference on Radiation and its Effects on Components and Systems*, vol. 2015-Decem, IEEE, Sep. 2015, pp. 443–447.
- [14] E. Chielle, F. L. Kastensmidt, and S. Cuenca-Asensi, "Overhead reduction in data-flow software-based fault tolerance techniques," in *FPGAs and Parallel Architec*-

tures for Aerospace Applications: Soft Errors and Fault-Tolerant Design, Cham, 2015, pp. 279–291.

- [15] R. Vemu, S. Gurumurthy, and J. A. Abraham, "ACCE: Automatic correction of control-flow errors," in *International Test Conference*, IEEE, 2007, pp. 1–10.
- [16] A. Shrivastava, A. Rhisheekesan, R. Jeyapaul, and C. J. Wu, "Quantitative analysis of control flow checking mechanisms for soft errors," in *Design Automation Conference (DAC)*, Jun. 2014, pp. 65–70.
- [17] C. Fetzer, U. Schiffel, and M. Süßkraut, "AN-encoding compiler: Building safety-critical systems with commodity hardware," in *International Conference on Computer Safety, Reliability, and Security*, Berlin, Heidelberg, 2009, pp. 283–296.
- [18] C. Wang, H. S. Kim, Y. Wu, and V. Ying, "Compilermanaged software-based redundant multi-threading for transient fault detection," in *International Symposium* on Code Generation and Optimization, IEEE, Mar. 2007, pp. 244–256.
- [19] N. Nakka, K. Pattabiraman, and R. Iyer, "Processorlevel selective replication," in *International Conference* on Dependable Systems and Networks, IEEE, Jun. 2007, pp. 544–553.
- [20] M. K. Bohman, "Compiler-assisted software fault tolerance for microcontrollers," M.S. thesis, Brigham Young University, 2018.
- [21] T. Semiconductors. "Soft errors in electronic memory a white paper." (Jan. 5, 2004), [Online]. Available: http: //www.tezzaron.com/media/soft\_errors\_1\_1\_secure.pdf (visited on 12/02/2020).
- [22] F. Libano, B. Wilson, J. Anderson, *et al.*, "Selective hardening for neural networks in fpgas," *IEEE Transactions on Nuclear Science*, vol. 66, no. 1, pp. 216–222, Jan. 2019.
- [23] W. Chao, F. Zhongchuan, C. Hongsong, and C. Gang, "FSFI: A full system simulator-based fault injection tool," in *International Conference on Instrumentation, Measurement, Computer, Communication and Control*, Oct. 2011, pp. 326–329.
- [24] M. Heing-Becker, T. Kamph, and S. Schupp, "Biterror injection for software developers," in *Conference* on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE), Feb. 2014, pp. 434–439.
- [25] E. Carlisle, N. Wulf, J. MacKinnon, and A. George, "DrSEUs: A dynamic robust single-event upset simulator," in 2016 IEEE Aerospace Conference, Mar. 2016, pp. 3038–3048.
- [26] H. Schirmeier, M. Hoffmann, C. Dietrich, M. Lenz, D. Lohmann, and O. Spinczyk, "FAIL\*: An open and versatile fault-injection framework for the assessment of software-implemented hardware fault tolerance," in *European Dependable Computing Conference (EDCC)*, Sep. 2015, pp. 245–255.
- [27] L. Wanner, S. Elmalaki, Liangzhen Lai, P. Gupta, and M. Srivastava, "VarEMU: An emulation testbed for variability-aware software," in *International Conference* on Hardware/Software Codesign and System Synthesis (CODES+ISSS), Sep. 2013, pp. 224–233.

- [28] N. Wulf, G. Cieslewski, A. Gordon-Ross, and A. D. George, "SCIPS: An emulation methodology for fault injection in processor caches," in *IEEE Aerospace Conference*, ISSN: 1095-323X, Mar. 2011, pp. 2341–2349.
- [29] H. Quinn, "Challenges in testing complex systems," *IEEE Transactions on Nuclear Science*, vol. 61, no. 2, pp. 766–786, Apr. 2014.
- [30] B. James, H. Quinn, M. Wirthlin, and J. Goeders, "Applying compiler-automated software fault tolerance to multiple processor platforms," *IEEE Transactions on Nuclear Science*, vol. 67, no. 1, pp. 321–327, Jan. 2020.