Dependable Systems & Reliability Engineering

Understanding Trust and Reliability in Modern Systems

Course Objective: Learn how to build and maintain systems that people can trust to work correctly when needed, and understand the engineering principles that make this possible.

What We Will Cover

  • What makes a system dependable and trustworthy
  • How people and technology work together in systems
  • Techniques to prevent and handle system failures
  • Methods to measure and improve system reliability

What is Dependability?

Dependability: The degree to which a system can be trusted to work correctly and safely when we need it to work.

Think of dependability as the "trustworthiness" of a system. When you press the button for an elevator, you trust that it will come. When you use an ATM, you trust that it will give you the correct amount of money and not lose your account information.

Everyday Examples

  • Highly Dependable: Traffic lights, medical devices, aircraft systems
  • Moderately Dependable: Your smartphone, car, home internet
  • Less Critical: Gaming consoles, social media apps, entertainment systems

Why Dependability Matters

  • Safety: Some system failures can cause physical harm
  • Economics: System downtime costs businesses money
  • Trust: People need to trust systems to use them effectively
  • Society: Modern life depends on reliable systems

The Four Pillars of Dependability

Dependability is built on four fundamental properties. Think of them as the legs of a chair - you need all four to be strong for the chair to be stable.

Availability

Question: "Is the system ready when I need it?"

Example: A website that works 99.9% of the time

Measurement: Percentage of time system is operational

Reliability

Question: "Does the system work correctly over time?"

Example: A calculator that always gives the right answer

Measurement: Frequency of correct operation

Safety

Question: "Will the system avoid causing harm?"

Example: An elevator that stops safely if it detects a problem

Measurement: Absence of dangerous failures

Security

Question: "Is the system protected from attacks?"

Example: A banking app that protects your financial data

Measurement: Resistance to unauthorized access

Understanding Availability

Availability tells us what percentage of time a system is working and accessible. It's one of the most commonly used measures of dependability.

Availability = Uptime Uptime + Downtime × 100 %

The "Nines" - Industry Standards

Availability Downtime per Year Typical Use
99% (Two Nines) 87.6 hours (3.65 days) Personal websites
99.9% (Three Nines) 8.76 hours Business applications
99.99% (Four Nines) 52.6 minutes Critical business systems
99.999% (Five Nines) 5.26 minutes Life-critical systems

Real-World Impact

If Amazon's website had 99% availability instead of 99.9%, they would lose an additional 7.6 hours of sales per year. For a company that makes billions, this represents millions of dollars in lost revenue.

Sociotechnical Systems

Sociotechnical System: A system that includes both people and technology working together to accomplish a goal.

Most real-world systems are not just technology - they include people, processes, and organizations. Understanding this is crucial for building dependable systems.

The Four Layers of a Sociotechnical System

Business Processes
How people work together, company policies, procedures
Application Software
Programs that users directly interact with
Platform & Infrastructure
Operating systems, databases, networks
Hardware
Physical computers, servers, devices

Example: Hospital Patient Record System

  • Hardware: Computers, servers, network equipment
  • Platform: Database system, operating system
  • Application: Patient record software interface
  • Processes: How doctors and nurses use the system, hospital policies

Failure at any layer can make the entire system unreliable, even if the technology works perfectly.

Redundancy and Diversity

Two of the most important techniques for building dependable systems are redundancy (having backups) and diversity (using different approaches).

Redundancy

Definition: Having multiple copies or alternatives

Example: A car has both mirrors and a backup camera

Benefit: If one fails, others can take over

Cost: More expensive to build and maintain

Diversity

Definition: Using different approaches to solve the same problem

Example: A spacecraft uses both GPS and star navigation

Benefit: Different approaches have different failure modes

Cost: More complex to design and integrate

Types of Redundancy

  • Active Redundancy (Hot Backup): All components work simultaneously
    Example: Airplane with multiple engines running
  • Passive Redundancy (Cold Backup): Backup activates when main component fails
    Example: Emergency generator that starts when power goes out

Why Both Are Important

Redundancy alone: If you have three identical systems and they all have the same design flaw, they might all fail at the same time.

Diversity alone: If you have three different systems but only one of each type, you have no backup if one fails.

Best practice: Combine both - have multiple systems that use different approaches.

Introduction to Reliability Engineering

Reliability: The probability that a system will perform its intended function correctly during a specified period under stated conditions.

Reliability engineering is the discipline of ensuring that systems work correctly over time. It involves mathematical analysis, testing, and design techniques to minimize failures.

Key Reliability Concepts

  • Mean Time Between Failures (MTBF): Average time a system works before failing
  • Mean Time To Repair (MTTR): Average time needed to fix a failure
  • Failure Rate: How often failures occur in a given time period
  • Reliability Function: Mathematical prediction of system reliability over time
R ( t ) = e - t / MTBF

Where R(t) is the reliability at time t

Practical Example

If a server has an MTBF of 1000 hours, what's the probability it will work correctly for 100 hours?

R(100) = e^(-100/1000) = e^(-0.1) ≈ 0.905 or 90.5%

Fault-Tolerant Architectures

Fault Tolerance: The ability of a system to continue operating correctly even when some of its components fail.

Instead of trying to prevent all failures (which is impossible), fault-tolerant systems are designed to handle failures gracefully when they occur.

Levels of Fault Tolerance

  • Fail-Stop: System stops safely when it detects an error
    Example: Elevator stops between floors when it detects a cable problem
  • Fail-Soft: System provides reduced functionality during failures
    Example: Car's air conditioning stops working, but engine continues
  • Fail-Operational: System continues normal operation despite failures
    Example: Airplane continues flying with one engine failed

Load Balancing

Distribute work across multiple components so if one fails, others can handle the load

Circuit Breakers

Automatically stop using failed components to prevent cascading failures

Timeouts and Retries

Handle temporary failures by waiting and trying again

Isolation (Bulkheads)

Separate system components so failures don't spread

Measuring and Improving Reliability

You cannot improve what you cannot measure. Reliability engineering relies on systematic measurement and analysis to identify problems and track improvements.

What to Measure

  • Number of failures per day/week/month
  • Time between failures
  • System response times
  • User satisfaction scores
  • Error rates in transactions

How to Measure

  • Automated monitoring systems
  • Log file analysis
  • User feedback and surveys
  • Controlled testing
  • Field studies and observations

Using Measurements for Improvement

  • Identify Trends: Is reliability getting better or worse over time?
  • Find Root Causes: What components or processes cause the most problems?
  • Prioritize Improvements: Focus resources on areas with biggest impact
  • Validate Changes: Did our improvements actually help?
  • Predict Future Performance: When will we reach our reliability goals?

Reliability Growth

Most systems become more reliable over time as bugs are found and fixed. However, adding new features can temporarily decrease reliability. The key is to measure and manage this trade-off carefully.

Key Takeaways

Essential Principles

  • Dependability is about trust - people must be able to rely on systems
  • People are part of the system - technology alone is not enough
  • Failures will happen - design for them, don't just try to prevent them
  • Redundancy and diversity are your most powerful tools
  • Measurement is essential - you can't improve what you don't measure

For Managers

Understand that dependability costs money upfront but saves money in the long run. Invest in good processes and measurement systems.

For Users

Understand the trade-offs between cost, performance, and dependability. Know what level of reliability you actually need.

For Engineers

Design systems with failure in mind. Use proven patterns and always test your assumptions.

For Organizations

Create a culture that values reliability. Learn from failures and continuously improve processes.

Remember: Perfect reliability is impossible and infinitely expensive. The goal is to achieve "good enough" reliability for your specific needs while managing costs and other requirements effectively.