Dependable Systems & Reliability Engineering

Understanding Trust and Reliability in Modern Systems

Course Objective: Learn how to build and maintain systems that people can trust to work correctly when needed, and understand the engineering principles that make this possible.

What We Will Cover

What makes a system dependable and trustworthy
How people and technology work together in systems
Techniques to prevent and handle system failures
Methods to measure and improve system reliability

What is Dependability?

Dependability: The degree to which a system can be trusted to work correctly and safely when we need it to work.

Think of dependability as the "trustworthiness" of a system. When you press the button for an elevator, you trust that it will come. When you use an ATM, you trust that it will give you the correct amount of money and not lose your account information.

Everyday Examples

Highly Dependable: Traffic lights, medical devices, aircraft systems
Moderately Dependable: Your smartphone, car, home internet
Less Critical: Gaming consoles, social media apps, entertainment systems

Why Dependability Matters

Safety: Some system failures can cause physical harm
Economics: System downtime costs businesses money
Trust: People need to trust systems to use them effectively
Society: Modern life depends on reliable systems

The Four Pillars of Dependability

Dependability is built on four fundamental properties. Think of them as the legs of a chair - you need all four to be strong for the chair to be stable.

Availability

Question: "Is the system ready when I need it?"

Example: A website that works 99.9% of the time

Measurement: Percentage of time system is operational

Reliability

Question: "Does the system work correctly over time?"

Example: A calculator that always gives the right answer

Measurement: Frequency of correct operation

Safety

Question: "Will the system avoid causing harm?"

Example: An elevator that stops safely if it detects a problem

Measurement: Absence of dangerous failures

Security

Question: "Is the system protected from attacks?"

Example: A banking app that protects your financial data

Measurement: Resistance to unauthorized access

Understanding Availability

Availability tells us what percentage of time a system is working and accessible. It's one of the most commonly used measures of dependability.

Availability = \frac{Uptime}{Uptime + Downtime} \times 100 %

The "Nines" - Industry Standards

Availability	Downtime per Year	Typical Use
99% (Two Nines)	87.6 hours (3.65 days)	Personal websites
99.9% (Three Nines)	8.76 hours	Business applications
99.99% (Four Nines)	52.6 minutes	Critical business systems
99.999% (Five Nines)	5.26 minutes	Life-critical systems

Real-World Impact

If Amazon's website had 99% availability instead of 99.9%, they would lose an additional 7.6 hours of sales per year. For a company that makes billions, this represents millions of dollars in lost revenue.

Sociotechnical Systems

Sociotechnical System: A system that includes both people and technology working together to accomplish a goal.

Most real-world systems are not just technology - they include people, processes, and organizations. Understanding this is crucial for building dependable systems.

The Four Layers of a Sociotechnical System

Business Processes
How people work together, company policies, procedures

Application Software
Programs that users directly interact with

Platform & Infrastructure
Operating systems, databases, networks

Hardware
Physical computers, servers, devices

Example: Hospital Patient Record System

Hardware: Computers, servers, network equipment
Platform: Database system, operating system
Application: Patient record software interface
Processes: How doctors and nurses use the system, hospital policies

Failure at any layer can make the entire system unreliable, even if the technology works perfectly.

Redundancy and Diversity

Two of the most important techniques for building dependable systems are redundancy (having backups) and diversity (using different approaches).

Redundancy

Definition: Having multiple copies or alternatives

Example: A car has both mirrors and a backup camera

Benefit: If one fails, others can take over

Cost: More expensive to build and maintain

Diversity

Definition: Using different approaches to solve the same problem

Example: A spacecraft uses both GPS and star navigation

Benefit: Different approaches have different failure modes

Cost: More complex to design and integrate

Types of Redundancy

Active Redundancy (Hot Backup): All components work simultaneously
Example: Airplane with multiple engines running
Passive Redundancy (Cold Backup): Backup activates when main component fails
Example: Emergency generator that starts when power goes out

Why Both Are Important

Redundancy alone: If you have three identical systems and they all have the same design flaw, they might all fail at the same time.

Diversity alone: If you have three different systems but only one of each type, you have no backup if one fails.

Best practice: Combine both - have multiple systems that use different approaches.

Introduction to Reliability Engineering

Reliability: The probability that a system will perform its intended function correctly during a specified period under stated conditions.

Reliability engineering is the discipline of ensuring that systems work correctly over time. It involves mathematical analysis, testing, and design techniques to minimize failures.

Key Reliability Concepts

Mean Time Between Failures (MTBF): Average time a system works before failing
Mean Time To Repair (MTTR): Average time needed to fix a failure
Failure Rate: How often failures occur in a given time period
Reliability Function: Mathematical prediction of system reliability over time

R (t) = e^{- t / MTBF}

Where R(t) is the reliability at time t

Practical Example

If a server has an MTBF of 1000 hours, what's the probability it will work correctly for 100 hours?

R(100) = e^(-100/1000) = e^(-0.1) ≈ 0.905 or 90.5%

Fault-Tolerant Architectures

Fault Tolerance: The ability of a system to continue operating correctly even when some of its components fail.

Instead of trying to prevent all failures (which is impossible), fault-tolerant systems are designed to handle failures gracefully when they occur.

Levels of Fault Tolerance

Fail-Stop: System stops safely when it detects an error
Example: Elevator stops between floors when it detects a cable problem
Fail-Soft: System provides reduced functionality during failures
Example: Car's air conditioning stops working, but engine continues
Fail-Operational: System continues normal operation despite failures
Example: Airplane continues flying with one engine failed

Load Balancing

Distribute work across multiple components so if one fails, others can handle the load

Circuit Breakers

Automatically stop using failed components to prevent cascading failures

Timeouts and Retries

Handle temporary failures by waiting and trying again

Isolation (Bulkheads)

Separate system components so failures don't spread

Measuring and Improving Reliability

You cannot improve what you cannot measure. Reliability engineering relies on systematic measurement and analysis to identify problems and track improvements.

What to Measure

Number of failures per day/week/month
Time between failures
System response times
User satisfaction scores
Error rates in transactions

How to Measure

Automated monitoring systems
Log file analysis
User feedback and surveys
Controlled testing
Field studies and observations

Using Measurements for Improvement

Identify Trends: Is reliability getting better or worse over time?
Find Root Causes: What components or processes cause the most problems?
Prioritize Improvements: Focus resources on areas with biggest impact
Validate Changes: Did our improvements actually help?
Predict Future Performance: When will we reach our reliability goals?

Reliability Growth

Most systems become more reliable over time as bugs are found and fixed. However, adding new features can temporarily decrease reliability. The key is to measure and manage this trade-off carefully.

Key Takeaways

Essential Principles

Dependability is about trust - people must be able to rely on systems
People are part of the system - technology alone is not enough
Failures will happen - design for them, don't just try to prevent them
Redundancy and diversity are your most powerful tools
Measurement is essential - you can't improve what you don't measure

For Managers

Understand that dependability costs money upfront but saves money in the long run. Invest in good processes and measurement systems.

For Users

Understand the trade-offs between cost, performance, and dependability. Know what level of reliability you actually need.

For Engineers

Design systems with failure in mind. Use proven patterns and always test your assumptions.

For Organizations

Create a culture that values reliability. Learn from failures and continuously improve processes.

Remember: Perfect reliability is impossible and infinitely expensive. The goal is to achieve "good enough" reliability for your specific needs while managing costs and other requirements effectively.