To be, or not to be fault tolerant! Or fault intolerant?

Posted on Updated on

IMG_20180723_183441Semiconductors is a tough business, and definitely not for the faint hearted, said Suman Narayan, senior VP, for Semiconductors, IoT and Analytics, Cyient. If you are in DFT, you are in the insurance business. He was moderating a panel discussion on ‘fault tolerance vs. fault intolerance’.

Rubin Parekhji, senior technologist, Texas Instruments, said that a system is fault tolerant if there is no error. An app is fault tolerant if there is no intolerant fault. An affordable system should be fault tolerant. Which faults are important? How are hardware-software fault tolerant? For instance, if not done well, it will lead to bulky devices. There is a need to optimize and differentiate. There is a need to build fault tolerant systems using fault intolerant building blocks.

Jais Abraham, director of engineering, Qualcomm, said that device complexity has increased 6X times since 2010. There is a disproportionate increase in test cost vs. node shrink benefits. Are we good at fault finding? It’s our fault. Be intolerant to faults, but don’t be maniacal. Think of the entire gamut of testing. Think of the system, and not just the chip. Think of the manufacturing quality, and find remedies. Fault tolerance may mean testing enough such that it meets the quality requirements of customers, who are becoming intolerant. We continue to invest in fault tolerance architectures.

Ruchir Dixit, Technical director, Mentor,  felt that making a system robust is the choice. The key is the machine that we make, and whether it is robust. The customers expect a quality robust system. Simpler systems make up a complex system. Successful system deals with malfunctions. There are regenerative components. The ISO-26262 standard drives robustness.

Dr Sandeep Pendharkar, Engineering director, Intel, felt that there is an increased usage of semiconductors in apps such as ADAS and medical. Functional safety (FuSa) requires unprecedented quality levels. Now, DPPM has changed to DPPB.

Achieving near zero DPPB on the nearest node is nearly impossible. Fault tolerance is the way forward. How should the test flows change to comprehend all this? Should we cap the number of recoverable faults before declaring a chip unusable?

Ram Jonnavithula, VP of Engineering, Tessolve, said that a pacemaker should be fault tolerant, with zero defects. Fault tolerance requires redundancy, mechanism to detect and isolate faults. Sometimes, fault tolerance could mean reduced performance, but the system still functions.

Adit D. Singh, Prof. Electrical & Computer Engineering, Auburn University, USA, highlighted the threats to electronics reliability. These are:
* Test escapes – DPPM. Especially, escape from testing components. Also, timing defects.
* New failures occur during operation. They can also be due to aging.
* Poor system design, which are actually, no solution. There can be design errors and improper shields.

Test diversity helps costs. Design diversity helps fault tolerance costs. Design triplicated modules independently. Avoid correlated failures.

So, what’s it going to be? Be fault tolerant! Or, fault intolerant?