Chalmers University of Technology, Göteborg, Sweden
www.desyre.eu

Three years ago, the DeSyRe (on-Demand System Reliability) project was begun with the goal of enabling extremely reliable medical devices. The consortium includes leading European experts in the field of fault-tolerant and self-repairing designs, both from academia and industry. University partners include: project leader Chalmers University of Technology (Sweden), University of Bristol (UK), EPFL (Switzerland), FORTH (Greece), and Imperial College London (UK). In addition, the consortium includes industry partners Neurasmus and Recore Systems (The Netherlands) and YOGITECH (Italy).

Fig. 1 – The DeSyRe design for a fault-tolerant System-on-Chip.

The consortium had initially promised new design techniques that would counter the increasing fault-rates expected for next technology nodes while, at the same time, reduce the power and performance penalties introduced by fault-tolerance measures.

Systems-on-a-chip (SoC) for extremely critical applications would use 28 percent less energy and 48 percent less chip area while offering 9 times lower hardware failure rate, if designed with the completely novel DeSyRe architecture. This, they say, would drastically reduce hospital costs and the replacement rate of medical devices. (See Figure 1)

Three years later, the results are in and the researchers report that the project has proven even more successful than expected. Chips designed based on the new DeSyRe paradigm have been shown to be more reliable and to be less power- and area-hungry than predicted at project onset.

How It Works

To reach these goals, DeSyRe introduced a different, hybrid approach to reliability, which separates the SoC into two different areas. One area comprises normal, interchangeable processing cores which are by nature fault-prone. The second area is extremely resistant to faults and monitors the sanity of the cores in the first area. It assures that each core in that area can handle an assigned sub-task correctly and efficiently, yet transfers tasks from one core to other idling cores in this same area in case of a diagnosed malfunction.

“In the DeSyRe project, we have coupled a new dynamically reconfigurable substrate together with runtime-system software support in such a manner that it can adapt on demand to various types and densities of faults, system constraints and application requirements,” says Ioannis Sourdis, Associate Professor in Computer Engineering at Chalmers University of Technology, and project leader of DeSyRe.

When comparing the DeSyRe system to a standard Triple-Modular-Redundancy system (TMR), a system that compares the output of three identical modules and then trusts the “majority vote”, a DeSyRe system requires 46 percent less chip area and 28 percent less energy to achieve the same tolerance to transient faults and the same performance as a typical TMR system. Alternatively, when comparing it to a time-redundant system (the program runs twice and the outcome is compared), DeSyRe executes code 14 percent to 32 percent faster.

Finally, when looking at permanent faults and comparing the DeSyRe system with a core-redundant system of the same area, a system in which everything is implemented with a back-up spare part; the back-up takes over in case of malfunctioning, DeSyRe reduces the number of failures (due to permanent faults) in a billion device hours (FIT) by a factor of 9.