Syllabus

CprE 545 Fault Tolerant Systems

When was the last time you experienced a computer related failure and what was the consequence? You may not remember because the effect was not big. But, what about a similar failure which might affect your bank's computer and you cannot withdraw your money. Or worse, you are a patient and your hospital's patient monitoring system fails or you are flying and your hear the following message. "This is your captain speaking. We have just discovered that our fly-by-wire system is not operational. We are investigating and will report to you in about one hour time." May be then you care. May be you would like to know causes of such failures and understand how to design these systems better.

Course Outline:

Dependable computer systems are required in applications which involve human life or large economics. In this course we study the theory and practice of design of such system both at hardware and software level. We will cover the following topics.

	Dependability concepts: dependable system, techniques for achieving dependability, dependability measures, fault, error, failure, faults and their menifestation, classification of faults and failures.
	Fault tolerant strategies: Fault detection, masking, containment, location, reconfiguration, and recovery.
	Fault tolerant design techniques: Hardware redundancy, software redundancy, time redundancy, and information redundancy.
	Testing and Design for Testability.
	Self-checking and fail-safe circuits.
	Infomation Redundancy : coding techniques, error detection and correction codes, burst error detection and correction, unidirectional codes
	Fault tolerance in distributed systems: Byzantine General problem, consensus protocols, checkpointing and recovery, stable stoage and RAID architectures, and data replication and resiliency.
	Dependability evaluation techniques and tools: Fault trees, Markov chains; HIMAP tool.
	Analysis of fault tolerant hardware and software architectures.
	System-level fault tolerance and low overhead high-availability technique
	Fault tolerance in real-time systems: Time-space tradeoff, fault tolerant scheduling algorithms.
	Faul tolerant interconnection networks: hypercube, star graphs, and fault tolerant ATM switches.
	Dependable communication: Dependable channels, survivable networks, fault-tolerant routing.
	Case studies of fault tolerant multiprocessor and distributed systems.
	Reading of some of the state-of-the-art research material. Anything you want to discuss Anything I may find interesting