Backing up FoundationDB

March 6, 2023 · 9 min read

Founding Engineer

We are running FoundationDB with the official kubernetes operator. FoundationDB supports logical backups (with backup_agent) and disaster recovery (with dr_agent) through copying the database/streaming changes logically. It also supports binary backups through disk snapshots.

In this blog post, we will describe how to make a backup of FoundationDB via backup_agent. The FoundationDB operator supports making logical backups via backup_agent, but it does not support running DR with dr_agent. We decided to run backup_agent as a separate deployment to allow a symmetric setup with dr_agent.

In order to run the backup_agent pods, we will need a deployment. That deployment will need the following:

Network level access to the FoundationDB cluster
The cluster file to use. This is provided from the config volume that is exposed from the operator. In the setup described below, the cluster file will be available at /mnt/fdb-config-volume/cluster-file.
If the backup destination is S3 or a similar object store, the secrets are needed to connect to it. In the example below, it is coming from an external secret.
If the backup destination is not object store, supply a volume for the backup agents where the backup will be stored.

Architecture

There is no limit to the number of backup agents that can connect to a cluster, each one will be responsible for a range of keys, which they will either store locally (or in the case of Kubernetes on the attached volumes), or stream to the object store. If the backup is stored locally and the database is for example 4 GiB, and there are 4 backup agents available, each of them will roughly store 1 GiB worth of data. If the backup destination is an object store, the number of backup agents will mostly determine the degree of parallelism both for reading the data and for writing the streamed data to the object store. This blog post will focus on the backups being stored on an object store, which I assume most of the Tigris users will do.

When using an object storage like S3, versioning should be disabled

(FoundationDB will do the versioning), and there should not be any automatic deletion/archival policy on the bucket. Only fdbbackup has a way to know what key range is still needed. Deleting old backups should be done with the expire subcommand of fdbbackup. :::

Below is an example of backup_agent as a kubernetes deployment with 4 replicas.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: fdb-backup-agent
  labels:
    app.kubernetes.io/name: fdb-backup-agent
spec:
  replicas: 4
  selector:
    matchLabels:
      app.kubernetes.io/name: fdb-backup-agent
  template:
    metadata:
      labels:
        app.kubernetes.io/name: fdb-backup-agent
    spec:
      containers:
        - name: fdb-backup-agent
          resources:
            limits:
              cpu: "0.5"
              memory: "4Gi"
            requests:
              cpu: "0.5"
              memory: "2Gi"
          image: foundationdb/foundationdb:7.1.20
          imagePullPolicy: IfNotPresent
          env:
            - name: AWS_ACCESS_KEY_ID
              valueFrom:
                secretKeyRef:
                  name: sample-fdb-backup-user-secrets
                  key: accessKey
            - name: AWS_SECRET_ACCESS_KEY
              valueFrom:
                secretKeyRef:
                  name: sample-fdb-backup-user-secrets
                  key: secretKey
          volumeMounts:
            - mountPath: /tmp
              name: tmp
            - mountPath: /mnt/fdb-config-volume
              name: sample-fdb-cluster-config-volume
          command:
            - "/usr/bin/backup_agent"
            - "-C"
            - "/mnt/fdb-config-volume/cluster-file"
            - "--knob_http_request_aws_v4_header=true"
      volumes:
        - name: tmp
          emptyDir: {}
        - name: sample-fdb-cluster-config-volume
          configMap:
            name: sample-fdb-cluster-config

A continuous backup pushes multiple logs and inconsistent snapshots to the destination. At restore time, the changes will be used to "roll forward" the database.

Backup command example:

fdbbackup start -t sample-test -z -s 7200 -C /mnt/fdb-config-volume/cluster-file -d 'blobstore://${AWS_ACCESS_KEY_ID}:${AWS_SECRET_ACCESS_KEY}@s3.us-east-2.amazonaws.com:443/path/to/backup/backup_name?bucket=sample-bucket&sc=1' --knob_http_request_aws_v4_header=true --log

The fdbbackup itself is a client utility (can be used from the backup agent pods) for managing backup jobs. Once the job level manipulation is done, the fdbbackup process itself will exit, the actual work is done by the backup_agent processes.

Those are quite a lot of parameters, let me explain them.

-t TAG_NAME

Specifies the tag that the backup will use. By default this is default. This will appear in the tags part of status json output.

-z

Specifies that the backup is continuous. Without this, the backup is just made as a one off. The fdbbackup

-s 7200

Specifies the snapshot interval, which determines how often a new snapshot should be made. Please note that 7200 is way too low for production settings, but for the purpose of this blog post, and showing the status output based on this, it's perfect.

[root@sample-fdb-backup-agent-776ff99c68-5xztw /]# fdbbackup status -C /mnt/fdb-config-volume/cluster-file -t conttest
The backup on tag conttest is restorable but continuing to blobstore://CREDS:HERE@s3.us-east-2.amazonaws.com:443/path/to/backup/backup_name?bucket=sample-bucket&sc=1.
BackupUID: 5579b197e04484d932ee267834086e47
BackupURL: blobstore://CREDS:HERE@s3.us-east-2.amazonaws.com:443/path/to/backup/backup_name?bucket=sample-bucket&sc=1
Snapshot interval is 7200 seconds.  Current snapshot progress target is 43.74% (>100% means the snapshot is supposed to be done)

Details:
 LogBytes written - 402464503164
 RangeBytes written - 172474426856
 Last complete log version and timestamp        - 21435815219, 2023/02/28.17:32:15+0000
 Last complete snapshot version and timestamp   - 18299527573, 2023/02/28.16:39:57+0000
 Current Snapshot start version and timestamp   - 18304439087, 2023/02/28.16:40:02+0000
 Expected snapshot end version and timestamp    - 25504439087, 2023/02/28.18:40:04+0000
 Backup supposed to stop at next snapshot completion - No

This means that the status output above was captured at 17:32, the consistent time for the previous snapshot was 16:39, the expected consistent timestamp of the next snapshot will be 18:40 (so there are 2 hours in between).

Because a continuous backup contains snapshots and mutations, any database version can be restored, which is covered. In the example above, this means that any transaction can be restored from the first complete snapshot (not in the status output) to 2023/02/28.17:32:15+0000. This means that such a continuous backup can protect from accidental data deletion as well.

Expiring backups

Because a continuous backup has all the changes that ever happened to the database, the size of the backup can grow over time. For example, updating a value to something different, but the same size (for example updating a timestamp or an integer) does not increase the total database size at the FoundationDB level, it will increase the size of the continuous backup permanently. While being able to restore any transaction is a very powerful, people rarely need to restore data that is for example more than 1 month old.

For this reason, the fdbbackup utility that manages the backups has a way to expire old data.

[root@sample-fdb-backup-agent-776ff99c68-5xztw /]#  fdbbackup expire -C /mnt/fdb-config-volume/cluster-file -d 'blobstore://CREDS:HERE@s3.us-east-2.amazonaws.com:443/path/to/backup/backup_name?bucket=sample-bucket&sc=1' --expire_before_version 25500608280 --knob_http_request_aws_v4_header=true --log
Final metadata update...
All data before version 25500608280 has been deleted.

Instead of specifying a version, a timestamp can be specified as well.

--expire-before-timestamp '2023/03/01.01:00:00+0000'

In this case, fdbbackup will find a version that is close to this, and will expire the data before that version.

After expiring, the backup size should reduced. This can be verified with an AWS client.

aws s3 ls s3://sample-bucket/path/to/backup/backup_name/ --recursive --human-readable --summarize
... (output truncated)
Total Objects: 5407
   Total Size: 446.7 GiB

Restoring backups

Restorations can be triggered with the fdbrestore utility. If no target time or version is specified, fdbrestore will restore the latest available version.

Exmaple:

[root@sample-fdb-backup-agent-776ff99c68-dcpxr /]# fdbrestore start --dest-cluster-file /mnt/fdb-config-volume/cluster-file -r 'blobstore://CREDS:HERE@s3.us-east-2.amazonaws.com:443/path/to/backup/backup_name?bucket=sample-bucket&sc=1' -t conttest --knob_http_request_aws_v4_header=true --log
No restore target version given, will use maximum restorable version from backup description.
Ussing target restore version 86927163669
Backup Description
URL: blobstore://CREDS:HERE@s3.us-east-2.amazonaws.com:443/path/to/backup/backup_name?bucket=sample-bucket&sc=1
Restorable: true
Partitioned logs: false
Snapshot:  startVersion=69006357978 (maxLogEnd -0.21 days)  endVersion=76194359335 (maxLogEnd -0.12 days)  totalBytes=83296161774  restorable=false  expiredPct=62.14
Snapshot:  startVersion=76200702764 (maxLogEnd -0.12 days)  endVersion=83403177978 (maxLogEnd -0.04 days)  totalBytes=83296403899  restorable=true  expiredPct=0.00
SnapshotBytes: 166592565673
ExpiredEndVersion:       73473237361 (maxLogEnd -0.16 days)
UnreliableEndVersion:    73473237361 (maxLogEnd -0.16 days)
MinLogBeginVersion:      73473237361 (maxLogEnd -0.16 days)
ContiguousLogEndVersion: 86927163670 (maxLogEnd -0.00 days)
MaxLogEndVersion:        86927163670 (maxLogEnd -0.00 days)
MinRestorableVersion:    83403177978 (maxLogEnd -0.04 days)
MaxRestorableVersion:    86927163669 (maxLogEnd -0.00 days)
Restoring backup to version: 86927163669

After this, the status and the progress can be checked with the status subcommand.

[root@sample-fdb-backup-agent-776ff99c68-dcpxr /]# fdbrestore status --dest-cluster-file /mnt/fdb-config-volume/cluster-file
Tag: conttest  UID: 0698eea17a5d6189cba7add3dc5c2341  State: running  Blocks: 14699/327034  BlocksInProgress: 15630  Files: 3098  BytesWritten: 15287375472  CurrentVersion: 76200702764 FirstConsistentVersion: 83403177978  ApplyVersionLag: 0  LastError: ''Task execution stopped due to timeout, abort, or completion by another worker' on 'restore_start'' 140s ago.
  URL: blobstore://CREDS:HERE@s3.us-east-2.amazonaws.com:443/path/to/backup/backup_name?bucket=sample-bucket&sc=1  Range: ''-'\xff'  AddPrefix: ''  RemovePrefix: ''  Version: 86927163669

The restore process itself is using a single commit proxy.

Closing notes

TLS and ports

The backup_agent is capable of using TLS to talk to the object store. This is something desirable for most deployments. By default, it needs a client side certificate (not covered in this blog post). If a self signed certificate is used, or if no certificate is used, but still the HTTPS endpoint of the object store is used, the --tls_verify_peers Check.Valid=0 option can be used (mostly useful for testing purposes). With the secure endpoint, in some cases, I need to specify :443 in the blobstore url for the correct protocol to be used (there is no protocol specification because of blobstore://).

V2 or V4 request headers for object stores

This blog post is written in 2023, so use v4 request headers (v2 request headers are marked as deprecated for a long time). Earlier FoundationDB supported only v2, but as of today, v4 is available across the board. In order to use those, specify the --knob_http_request_aws_v4_header=true option.

Using different object stores

FoundationDB 7.1 can be backed up and restored with the demonstrated methods in this blog post. There is a change in FoundationDB 7.2: the &region=REGION_HERE parameter becomes mandatory in the blobstore URL. As of 7.1, FoundationDB parses the region from the domain in the url (it is important to use the region specific URL for S3 not the bucket specific one for this reason). In 7.2, the region becomes a mandatory parameter, which allows more types of blob stores to be used.

Accessing S3 via IAM role

We had trouble accessing S3 via an IAM role (and with a service account in k8s). The fdbbackup utility will always get 403 Forbidden as a response. In order to use S3, create an IAM user (not an assumed role), which can be tied down to be able to access only specific buckets, and give those credentials to the backup_agent pods. A useful knob to debug these kind of issues is --knob_http_verbose_level=4.

Having strong backup and disaster recovery capabilities in a database is crucial to ensure that your data is safe and secure. FoundationDB provides a robust foundation for Tigris, with excellent backup and disaster recovery capabilities.

Expiring backups​

Restoring backups​

Closing notes​

TLS and ports​

V2 or V4 request headers for object stores​

Using different object stores​

Accessing S3 via IAM role​