Backups That Actually Restore: A Field Guide for Small SaaS

Almost every team that runs a SaaS has a backup strategy. Almost every team that has a backup strategy is wrong about how well it works, in a direction they will only discover during an incident. The pattern is consistent enough that it deserves a name: backup confidence usually exceeds backup fitness, and the gap is closed only by practice.

The pattern repeats because backups are easy to start and hard to verify. A nightly cron that dumps the database to S3 takes an afternoon to write. Verifying that the dump can be restored to a working system, in a reasonable amount of time, with no missing data, takes much longer and feels less productive. So most teams write the cron and stop. Then, eighteen months later, something goes wrong, the team reaches for the backup, and discovers that the dump has been silently failing for six months because a credential expired, or that the dump succeeds but the restore takes nine hours, or that the dump captures the database but not the application files that contain user-uploaded content.

What to back up

The first question is harder than it sounds: what does "back up" actually cover? The naive answer is "the database," and for many SaaS applications it is wrong. The full state of a typical application is the database plus uploaded files plus configuration plus secrets plus anything in queues at the moment of backup. A backup that covers only the database will produce a restored system that is missing user uploads (broken images), pointing at expired secrets, and missing in-flight work that was queued at the moment of failure.

The pragmatic categorization is four buckets. The database is one bucket. User-uploaded files (typically in object storage like S3, but the backup question still applies) are a second. Application configuration that lives outside the database (environment variables, feature flag state, integration credentials) is a third. Source code is a fourth, and is usually backed up automatically by the version control system, but the build artifacts are not.

The first three buckets need explicit backup strategies. The fourth needs verification that you can rebuild a deployable artifact from source, with the same build inputs, on a clean machine. Teams that have not tried this in a year have, on average, a fifty percent chance that their build will fail.

The 3-2-1 rule and what it actually means

The classic 3-2-1 backup rule is: three copies of your data, on two different storage media, with one copy off-site. The rule predates the cloud and translates imperfectly. The substantive update is: three copies, on two storage providers, with one copy in a different geographic region.

The reason for the multi-provider requirement is that "off-site" in the cloud era means "outside the failure domain of your primary provider." If your database is on AWS RDS and your backup is on AWS S3 in the same region, you do not have an off-site backup; you have two copies in the same failure domain. AWS region failures are rare but they happen. Account-level compromises (a credential leak that gives an attacker access to delete both your database and your S3 bucket) are also rare but they happen.

The pragmatic compromise for a small team is: primary in your main provider, secondary in your main provider's other region, tertiary in a different provider. The third copy can be small (a weekly snapshot of the most recent backup) and is insurance against catastrophic provider issues. For most teams the third copy is overkill; for teams running customer-critical infrastructure it is appropriate.

Encrypted backups and key management

Backups should be encrypted. The reason is not that you are worried about your backup provider reading your data; it is that backups tend to spread to places you did not anticipate. They get downloaded for analysis, copied to laptops, shared with consultants, mounted on test machines. A backup that requires a key to read is much safer in transit and at rest than one that does not.

The standard pattern is age (the modern successor to GPG for file encryption) or sops with a KMS-backed key. The encrypt step is part of the backup pipeline; the decrypt key is stored separately from the backups themselves. Crucially, the decrypt key must be backed up, and stored in a place that does not require a working production system to access. A backup that you cannot decrypt because the key is in a database that is currently down is not a backup.

The restore drill

The piece of advice that matters most, and that is most often skipped, is: practice the restore. Schedule a recurring exercise (quarterly is a reasonable cadence) in which you pretend the production system is gone and restore from backup. Do this on a separate environment with no shortcuts; do not use a snapshot of production, do not skip steps because they would be obvious in a real incident, do not allow yourself to consult the production system while doing it.

The restore drill is where you discover the bugs. The dump is corrupt. The restore takes longer than your RTO. The application starts but is missing user uploads. The application starts but is pointing at expired credentials. The application starts but immediately fails because the database schema is from an older version than the code expects. Every one of these has happened in the wild; every one of them was discovered during a drill rather than during a real incident on a team that practiced, and during a real incident on a team that did not.

The drill should be timed and the time should be reported. Your recovery time objective (RTO) is meaningful only if you have evidence that the restore can complete inside it. A four-hour RTO with an unmeasured restore time is a wish, not a commitment.

Point-in-time recovery vs full backups

The distinction between point-in-time recovery and full backups is worth understanding clearly. A full backup captures the state of the database at a single moment. Point-in-time recovery (PITR) captures a base state plus the transaction log; you can replay the log to roll the database forward to any moment between the base and the most recent log entry.

For databases that support it (PostgreSQL, MySQL, most managed database services), PITR with a base every 24 hours and continuous log shipping is the right default. The recovery point objective (RPO) is seconds rather than hours, and the storage cost is modest because logs are small relative to the base.

For databases that do not support PITR (SQLite, for instance), full backups every few hours are the practical alternative. The RPO is whatever the backup interval is. SQLite has the advantage of being fast to back up (just copy the file with the right locking) and fast to restore (just copy the file back), so the operational cost of frequent full backups is low.

Retention and immutability

The retention policy on backups should account for the threat model. If the threat is hardware failure, last week's backup is as good as last night's. If the threat is data corruption that took several days to notice, you need backups going back at least that far. If the threat is ransomware or malicious deletion (including by a compromised insider), you need backups that the attacker cannot delete, which means immutable storage.

S3 Object Lock and the equivalent on other providers gives you write-once-read-many semantics; once a backup is written with a retention period, it cannot be deleted (even by the account owner) until the retention expires. This is the right configuration for backup buckets in environments where the threat model includes credential compromise. The alternative is to back up to a separate account with separate credentials and never give the production environment write access to that account.

What this looks like in practice

The four APIs we run at DocuMint, CronPing, FlagBit, and WebhookVault use SQLite as their primary store. The backup strategy is simple: a script that uses sqlite3 backup to write a consistent snapshot every six hours, encrypts it with age, uploads it to S3 with object lock for 30 days, and verifies the upload by re-downloading and computing a hash. The restore drill runs quarterly on a separate VPS, restores all four databases, brings up the applications, and verifies that a known-good user can log in and perform a representative operation.

The drill takes about two hours from start to "production-equivalent system running." That is the RTO commitment. The RPO is six hours, which is the backup interval. Both numbers are documented, both are actually tested, and we know which one will be the bottleneck if a real incident comes (the RTO; the restore is fast, but the verification and DNS cutover takes the bulk of the time).

The discipline is to do the boring drill on the calendar even when nothing is wrong. The first time something goes wrong is not the time to discover that the drill would have caught it.