Tel-Aviv University - Computer Science Colloquium

Sunday, Dec 25, 2005, 11:15-12:15

Room 309
Schreiber Building


Adnan Agbaria

University of Southern California


Compiler-Driven Distributed Checkpointing



Distributed checkpointing is an important concept in providing fault

tolerance in computer systems. Fault tolerance is important for distributed

systems, for which the failure rate is high. In today's applications, e.g.,

grid and massively parallel applications, the imposed overhead of taking a

distributed checkpoint using the known approaches can often outweigh its

benefits, due to coordination and other overhead from the processes. In this

talk, I present an innovative approach for distributed checkpointing. In

this approach, during compilation, the checkpoints are specified in the

application code using analysis based on the application level. During

execution, no coordination is required, and every process takes a local

checkpoint as specified in the code, independent of the other processes. In

addition, I present a performance analysis using stochastic models to

compare the imposed checkpoint overheads of this approach with other existed

checkpointing protocols.