Books/References
CprE 545  Fault Tolerant Systems

 

Home
Announcements
Syllabus
Books/References
Tools
Term Paper
Policies
Lecture Notes
Assignments
Grades
ECpE
ISU     

 

Reference Books

  1. Fault Tolerant Computer System design by D. K. Pradhan, Prentice Hall.

  2. Reliable Computer Systems: Design and Evaluation (second edition) by D. P. Siewiorek and R. S. Swarz, Digital Press.

  3. Design and Analysis of Fault Tolerant Digital Systems by B.W. Johnson, Addison Wesley, 1989.

  4. Fault Tolerance in Distributed Systems, Pankaj Jalote, PTR Printice Hall, 1994.

 

Fundamental Concepts

A Coneptual Framework for System Fault Tolerance

Definition of terms used in Fault Tolerance

Reliability Concepts - Part 1

Reliability Concepts - Part 2

Notes on Byzantine Generals Problem

A. Avizienis and J. Laprie, ``Dependable Computing: From Concepts to Design Diversity,'' Proc. IEEE, vol.74, no.5, pp.629-638, May 1986.

A.K. Somani and N.H. Vaidya, ``Understanding fault-tolerance and reliability,'' IEEE Computer, vol.30, no.4, pp.45-50, Apr. 1997.

M. Pease, R.Shostak, and L. Lamport, ``Reaching Agreement in the Presence of Faults,'' M. Pease, R.Shostak, and L. Lamport, Journal of ACM, #27 (180), pp.228-234.

The Byzantine Generals Problem, ACM Trans. Prog. Languages and Systems, 4(1982) pp. 382-401.

 

Software Fault Tolerance

N. Leveson, J. Knight, and T. Shimeall, ``The use of self check and voting in software error detection: An empirical study,'' IEEE transactions on Software Engineering, April 1990.

A. Avizienis and j. Kelly, ``Fault Tolerance by Design Diversity: Concepts and Experiments,'' IEEE Computer, August 1984, pp. 67-80.

J.H. Purtilo and P. Jalote, ``An environment for developing fault-tolerant software,'' IEEE Trans. Software Engg., vol.17, no.2, pp.153-159, Feb. 1991.

 

Fault Detection and Location in Multiprocessor Systems

A review of system-level diagnosis

S. Tridandapani, A. K. Somani, and U. Reddy, ``Low Overhead Multiprocessor Allocation Strategies Exploiting System Spare Capacity for Fault Detection and Location,'' in IEEE Transactions on Computers, Vol. 44, No. 7, July 1995, pp. 865-877.

K. Mahesh, G. Manimaran, C. S. R. Murthy, and A. K. Somani, ``Scheduling Algorithms Exploiting Spare Capacity and Tasks' Laxities for Fault Detection and Location in Real-time Multiprocessor Systems,'' Journal of Parallel and Distributed Computing, vol.51, no.2, pp.136-150, June 1998.

Using spare capacity in SMT processors

 

Fault Tolerance in Real-time Systems

Fundamentals of Real-time Systems -For a report, you have to send mail to Dr. Manimaran

J.W.S. Liu, K.J. Lin, W.K. Shih, A.C. Yu, J.Y.Chung, and W. Zhao, ``Algorithms for scheduling imprecise computations,'' IEEE Computer, vol.24, no.5, pp.58-68, May 1991.

P. Ramanathan, ``Graceful degradation in real-time control applications using (m,k)-firm guarantee,'' In Proc. Fault-Tolerant Computing Symp., pp.132-141, 1997.

S. Ghosh, R. Melhem, and D. Mosse, ``Fault-tolerance through scheduling of aperiodic tasks in hard real-time multiprocessor systems,'' IEEE Trans. Parallel and Distributed Systems, vol.8, no.3, pp.272-284, Mar. 1997.

G. Manimaran and C. Siva Ram Murthy, ``A fault-tolerant dynamic scheduling algorithm for real-time multiprocessor systems and its analysis,'' IEEE Trans. Parallel and Distributed Systems, vol.9, no.11, pp.1137-1152, Nov. 1998.

J.H. Lala and R.E. Harper, ``Architectural principles for safety-critical real-time applications,'' Proc. of IEEE, vol.82, no.1, pp.25-40, Jan. 1994.

A.L. Liestman and R.H. Campbell, ``A fault-tolerant scheduling problem,'' IEEE Trans. Software Engg., vol.12, no. 11, pp. 1089-1095, Nov. 1986.

 

Dependable Communication

Dependable Channels in Multihop Networks - a report

Fault Tolerance in Optical Networks - a report

 

Checkpointing

E. N. Elnozahy, D. B. Johnson, and Y. M. Wang."A survey of rollback-recovery protocols in message-passing systems," Tech. Rep. No. CMU-CS-96-181, Dept. of Computer Science, Carnegie Mellon University, 1996.

 

Other Relevant Papers

A General Constructive Approach to Fault Tolerant Design Using Redundancy, by Barbour and Wojcik, IEEE Transactions on Computers, Jan 1989, pp. 15.

S. B. Choi and A. K. Somani, ``Design and Performance Analysis of Load-distributing Fault-tolerant Network,'' in IEEE Transactions on Computers, Vol. 45, No. 5, May 1996, pp. 540-551.

P. M. Chen, E. K. Lee, G. A. Gibson, R. H. Katz, and D. A. Patterson, `` RAID: High-Performance, Reliable Secondary Storage,'' ACM Computing surveys, Vol. 26, No. 2, Jun 1994, pp. 145-164.

A paper on DIRSMIN

Figure 1 for the DIRSMIN paper

A paper on embedding binary tree in faulty hypercube

 

Relevant Journals

Proc. of the IEEE

IEEE Computer

IEEE Trans. on Computers

IEEE Trans. on Parallel and Distributed Systems

IEEE/ACM Trans. on Networking.

ACM Trans. on Computer Systems.

Journal of Parallel and Distributed Computing.

 

Relevant Conference Proceedings

Proc. of Fault Tolerant Computing Symposium.

Proc. of Intl. Conf. Parallel and Distributed Systems.

Proc. of Intl. Parallel Processing Symposium.

Other related journals and conference proceedings.