Soft error detection through software fault tolerance techniques

The first scheme is a hardwarebased solution called remo that combines the best features of space and time redundancy. The task of fault detection for example, through algorithm based fault tolerance abft 32,81 or through assertions 19, 31 etc. Due to the high vulnerability of srambased fpgas in singleevent upsets seus, effective fault tolerant soft processor architectures must be considered when we use fpgas to build embedded systems. To avoid catastrophic events like unrecoverable system failures, fault tolerant techniques can be applied at software or hardware levels ex. Software and hardware techniques for seu detection in ip. In particular, soft errors are a matter of great concern when planning high accessibility systems or systems utilized as a part of electronicantagonistic situations 14. The performance overhead is generally high 38% in daft 9 and 19% in srmt 8.

Software faulttolerance techniques for transient faults. We propose two variations on this theme, with speci. Then, it reports an overview of the techniques to cope with them, looking in particular to software implementer fault tolerance sift techniques. Identi cation of critical variables using an fpgabased. A method of soft error detection in microprocessors. Software fault tolerance techniques for transient faults. Code duplication for byteerror detection by a single processor. Pdf softerror detection through software faulttolerance techniques.

Fault injection campaigns have been performed to evaluate the fault detection capability of the proposed technique in comparison with stateof. Empirical results on paritybased soft error detection. Softerror reliable architecture for future microprocessors. In this paper, a softwarebased technique is presented for detecting soft errors that. An efficient controlflow checking technique for the. In this paper we will discuss the techniques of software fault tolerance such as recovery blocks, nversion programming, single version programming, multiversion programming. The need to control software fault is one of the most rising challenges facing. It is proved that about 33% to 77% of transient faults are converted to control flow errors. Failures are detected by comparing the results of the different versions. Proceedings 1999 ieee international symposium on defect and fault tolerance in vlsi systems eft99. An accurate analysis of the effects of soft errors in the instruction and data caches of a pipelined microprocessor m rebaudengo, ms reorda, m violante 2003 design, automation and test in europe conference and exhibition, 602607, 2003. A hardwaresoftware collaborated method for softerror. Soft errors in logic circuits are sometimes detected and corrected using the techniques of fault tolerant design. Fault masking is any process that prevents faults in a system.

The introduction of software implemented hardware fault tolerance sihft 6 techniques for fault detection is applicable to cotsbased devices, providing lowcost solutions for enhancing the reliability of these systems without modifying the hardware. A hybrid fault tolerant leon3 soft core processor implemented in lowend sram fpga article in ieee transactions on nuclear science pp99. Softerror detection through software faulttolerance. Algorithm based fault tolerance abft abft refers to a selfcontained method for detecting, locating, and correcting faults with a software. Velazco, detecting soft errors by a purely software approach. On the other side, relying on software techniques for obtaining. Software fault tolerance is an immature area of research. A flexible fault injection platform for the analysis of. As a result, software fault tolerance is often adopted, since it allows the implementation of dependable systems without incurring in the high costs coming from designing custom hardware or using hardware redundancy. Dynamic techniques achieve fault tolerance by detecting the existence of faults and performing some action to remove the faulty hardware from the system. Fault tolerant strategies fault tolerance in computer system is achieved through redundancy in hardware, software, information, andor time. The aim of this paper is the introduction of a combined use of software. These approaches that are based on fault tolerance techniques use the information and time redundancy for detecting errors during the program execution. A softerror mitigated microprocessor with software.

Softwarecontrolled fault tolerance princeton university. Index terms leon3 softcore processor, fault tolerance, spatial redundancy, temporal redundancy, software. Effects of soft errors figure 1 illustrates the possible outcomes. Radiation effects, fault tolerant systems if the paper is accepted authors wish to be published in the proceedings. Software based fault tolerant scheduling techniques are dis cussed in 6, 10 for hard realtime multiprocessor systems. Several software controllable fault detection techniques are then presented. The paper describes a systematic approach for automatically introducing data and code redundancy into an existing program written using a highlevel language. Research in 6 predicts that soft error rate ser per chip of logic circuits will increase. Softwarecontrolled fault tolerance acm transactions on. The second scheme, remora combines the best features of hardware and software approaches of fault tolerance. Software fault tolerance techniques are designed to allow a system to tolerate software faults that remain in the system after its development. Faulttolerance techniques for softcore processors using. Proceedings of the 9th iasted international conference on parallel and distributed computing and networks, pdcn 2010. Characterizing the effects of transient faults on a high.

Xu jianjun,tan qingping,xiong yinqiao,tan lanfang,li jianli school of computer science,national university of. Unlike manufacturing and design faults, soft errors do not occur consistently and cannot be predicted which are also called transient faults. Software fault tolerance techniques are employed during the procurement, or development, of the software. T he p a r a met er s pa s s ed to a p r oc edu r e, a s we ll a s t he r et u r n e d va lu es, s h ou l d b e co ns i d er ed a s v a r ia. This paper proposes software controlled fault tolerance, a concept allowing designers and users to tailor their performance and reliability for each situation. In this context, softwarebased fault tolerance is an attractive solution, since it allows implementing dependable systems without incurring the high costs associated with developing custom hardwarebased tolerance techniques not readily available in offtheshelf prod ucts 1. System structure for software fault tolerance brian randell abstract this paper presents and discusses the rationale behind a method for structuring complex computing systems by the use of what we term recovery blocks, conversations, and fault tolerant interfaces. Survey of soft error mitigation techniques applied to leon3 soft.

Single version software fault tolerance techniques discussed include system structuring and closure, atomic actions, inline fault detection, exception handling, and others. Softerror detection through software faulttolerance techniques. These often include the use of redundant circuitry or computation of data, and typically come at the cost of circuit area, decreased performance, andor higher power consumption. Depending on the types and locations, a transient fault may severely affect the execution or even ultimately prevent lead to system failures. Goutam kumar saha, transient fault tolerance through algorithms, ieee potentials, vol. Preliminary experimental results are reported, showing the fault coverage obtained by the method, as well as some figures concerning the slowdown and code size. The aim of this paper is the introduction of a combined use of software and hardware approaches to achieve a complete fault. The transformations aim at making the program able to detect most of the soft errors affecting data and code, independently of the error detection mechanisms. In his recent article 5, john hennessy, one of the leading computer architects of the last two decades, points.

Remo provides very high fault coverage with minimum overheads in performance, power and area. The paper describes a systematic approach for automatically introducing data and. By applying the proposed technique, several benchmark applications have been hardened against transient errors. As microprocessors are increasingly used in safetycritical applications, there is a growing demand for effective faulttolerance techniques that can mitigate the effects of soft errors while reducing intrusiveness and minimizing the impact on performance and power consumption. Solid state electronics research center csser electrical, computer, and energy engineering, school of ecee. This chapter discusses techniques being used for fault tolerance on such systems, including checkpointrestart techniques systemlevel and applicationlevel. This is certainly more true of software systems than almost any phenomenon, not all software change in the same way so software fault tolerance methods are designed to overcome execution errors by modifying variable values to create an acceptable program state. Softwareimplemented fault detection approaches acm ubiquity. Recent researches show that soft errors damage control flow or data of a program. Softerror detection using control flow assertions ieee. Such redundancy can be implemented in static, dynamic, or hybrid configurations. Fault tolerance for aerospace applications on commercial devices microprocessors, fpgas against soft errors detection of permanent faults in microprocessors for automotive applications. Softerror detection through software faulttolerance techniques abstract.

A lowpower faulttolerant noc using error correction and. Restore architecture, transient errors or soft errors are detected through time redundancy in the restore architecture. As a result, the employment of fault detection and fault tolerance measures is becoming a mandatory task even for moderately critical applications. Software based methods use a type of software fault tolerance technique and software redundancy for detecting faults. Effectiveness and limitations of various software techniques for soft error detection. Redundancy can also be implemented in compilers 8,9. Introduction modern electronic circuits are highly complex systems and, as such, are prone to occasional errors or failures. An effective software implemented data error detection method in. The paper describes a systematic approach for automatically introducing data and code redundancy into an existing program written using a highlevel. Accordingly, software based techniques have recently gained in popularity, and a multitude of approaches that differ in the number and frequency.

That is, active techniques use fault detection, fault location, and fault recovery in an attempt to achieve fault tolerance. Soft ware fault tolerance techniques have also been proposed in 102,103. Software only reliability techniques have shown promise in their ability to protect against soft errors without any hardware overhead. However, existing lowlevel software only fault tolerance techniques have only addressed the problem of detecting faults, leaving recovery largely unaddressed. Logic soft errors in sub65nm technologies design and cad. Preliminary experimental results are reported, showing the fault coverage obtained by the method, as well as some figures concerning the slowdown and code size increase it cause publisher. Software fault tolerance carnegie mellon university. Fault tolerance can be achieved by the following techniques. Moving from these considerations, the present paper analyses the possibility of reusing software implemented hardware fault tolerance sihft techniques, typically exploited in microprocessor based systems, to design seu tolerant architectures. Swift, a software only technique, and craft, a suite of hybrid hardware software techniques. Velazco, detecting soft errors by a purely software. Architectural and microarchitectural techniques for. Preliminary experimental results are reported, showing the fault coverage obtained by the method, as well as some figures concerning the slowdown and code size increase it causes. The basic idea of control flow checking 79 is to partition the application program in basic.