I've seen three measurements generally used in Disaster Recovery or Business Continuity:
- Recovery Time Objective (RTO)
- Recovery Point Objective (RPO)
- Service Level Agreement (SLA)
I want to discuss briefly RTO and RPO in this post. SLA has a lot more nuance and context to explore. RTO is how long it takes to restore service after some event that causes downtime. This duration is the recovery time, and every business and service should have an objective, a goal, that service is restored within some timeframe.
For example, FooService has an RTO of one hour. If FooService's primary database node goes down, then whatever needs to happen -- failover to a secondary, rebuild of the primary, full restore from backup, etc. -- should happen within one hour. This of course doesn't come for free, without development time and practice, or without good support and monitoring. If the service is restored within an hour of the event including all time for monitoring to pick up on the change, then the RTO has been achieved for this incident.
RPO represents the maximum amount of data that could be lost in the worst case disaster. In my experience and view, this is the metric that catches even some of the best engineering teams and businesses by surprise. This is the harder metric to say with certainty too. I've generally seen this expressed in minutes or hours, in
up to 15 minutes RPO.
In FooService's case, the database is backed up once a day. FooService likely has an RPO of 24 hours. FooService's database could be lost, and a new database could be created from the backup of up to 24 hours ago. Reality tends to make this more of a guess, unfortunately, unless disaster recovery is practiced in events or with everyday practices. A backup isn't a backup if its only in one place or has never been used in a restore. If the backup takes 3 hours to happen, then perhaps 24 hours is too short of a time in a real situation and 27 or 28 hours is more realistic.
I've made a little visual to help explain these two terms. Some groups tend to include not just service recovery in RTO but also outreach, return to normal for queues, or other indicators and side effects. These metrics should be measured and improved, and time should be committed to reasonable times and recovery plans.