Fault tolerance and suspend/resume

Feldera supports the following ways to gracefully stop a pipeline and later restart it from the same point:

Pause and resume. A paused pipeline continues to run and occupy memory on its host, but it does not ingest any more input until resumed. A paused pipeline does finish processing input it has already ingested; afterward, it uses only enough CPU to answer HTTP requests.

In the web UI, the ⏸️ button pauses a pipeline and ▶️ resumes it.

Every Feldera pipeline supports pause and resume.
Suspend and resume. When the user suspends a pipeline, it writes a checkpoint to storage and then exits. When the user resumes the pipeline, it uses the checkpoint to resume from the exact point where it left off, without dropping or duplicating input or output.

Suspending a pipeline allows it to resume gracefully on a different host or with a modified configuration. With Feldera deployed in Kubernetes, suspending a pipeline temrinates its current pod while keeping its state in EBS or S3. The pipeline can then resume in a new pod which can have more or fewer resources than the original pod.

The "suspend" API suspends a pipeline, and the "start" API (or ▶️ in the web UI) resumes it.

Suspend and resume is in the Feldera enterprise edition only. For a pipeline to support suspend and resume, a pipeline must use fault-tolerant connectors and have storage configured.
At-least-once fault tolerance. The pipeline regularly (by default, every minute) writes a checkpoint to storage. If the pipeline crashes, then it resumes from the most recent checkpoint, restarting input connectors from the point just before the checkpoint. Any output already produced beyond the checkpoint will be produced again, meaning that, overall, output records are produced at least once.

At-least-once fault tolerance is in the Feldera enterprise edition only, with the same configuration requirements as suspend and resume. Writing checkpoints has some performance cost so, in addition, it must be enabled explicitly on a pipeline.

An at-least-once fault tolerant pipeline can also be suspended and resumed.
Exactly once fault tolerance. This extends at-least-once fault tolerance with input and output journaling so that the pipeline restarts from the exact state prior to any unplanned, abrupt shutdown or crash. Restart of a fault-tolerant pipeline does not drop or duplicate input or output, meaning that each output record is produced exactly once.

Exactly once fault tolerance is in the Feldera enterprise edition only, with the same configuration requirements as suspend and resume. Journaling adds some performance cost to periodic checkpointing so, in addition, it must be enabled explicitly on a pipeline.

A fault-tolerant pipeline can also be suspended and resumed.

⚠️ Feldera resolves secrets once at pipeline startup. Checkpointed state for fault tolerant pipelines and suspended pipelines will contain resolved secrets.

The following sections describes these features in more detail.

Implementation

Feldera implements these features by writing a checkpoint to storage, that is, a consistent snapshot of the Feldera system's state, including computation and the input and output adapters. On resume, Feldera loads its state from the checkpoint, and then restarts each of the connectors at the point where it left off.

In addition, exactly once fault tolerance writes additional data to a separate journal between checkpoints. For each batch of data that Feldera processes through the pipeline, it writes enough information to the journal to obtain another copy of the batch's input data later.

When an exactly once fault-tolerant pipeline restarts, it loads its state from the most recent checkpoint, then it replays any data from input connectors previously processed beyond that checkpoint. If replay produces output that was previously sent to output connectors, it discards that output. After replay completes, the pipeline continues with new input that has not previously been processed.

Fault-tolerant connectors

For Feldera to suspend and resume, or to enable fault-tolerance, all of the pipeline's input connectors must support fault tolerance. Some input adapters do not yet support fault tolerance and therefore may not be part of a pipeline that supports these features. Input connectors individually document their support for fault tolerance.

For a pipeline to fully support either feature, its output connectors must also be fault tolerant. Only the Kafka output connector supports fault tolerance as of now. If a fault-tolerant pipeline includes non-fault-tolerant output connectors, then in the event of a crash and restart, Feldera may send duplicate output to those connectors, but it will not drop output.

Feldera does not yet support fault tolerance or suspend and resume in pipelines that use the SQL NOW function.

The following table documents input connector support for fault tolerance.

Input connector	Suspend and resume	At-least-once FT	Exactly once FT
Datagen	☑	☑	☑
Debezium	☑	☑	☑
Delta Lake	☑	☑	☐
HTTP GET (URL)	☑	☑	☑
HTTP	☑	☑	☑
Iceberg	☐	☐	☐
Kafka	☑	☑	☑
PostgreSQL	☐	☐	☐
Pub/Sub	☐	☐	☐
S3	☑	☑	☑

Enabling suspend and resume and fault tolerance

To enable suspend and resume or fault tolerance in an enterprise Feldera pipeline:

Ensure that all of the pipeline's connectors support fault tolerance, as described in the previous section.
Recent versions of Feldera enable storage in new pipelines by default. If storage is not yet enabled, enable it in one of the following ways:
- In the web UI, click on the gear icon ⚙️. In the dialog box, change storage to {}, e.g.:
  
  Then click on the Apply button. (For more storage options, click on the ⚙️ again, to see that {} has been expanded to show the available settings.)
- Using the fda command line tool:
```
fda set-config <pipeline> storage true
```
To additionally enable fault tolerance:
- In the web UI, click on the gear icon ⚙️. In the dialog box, inside fault_tolerance:
  - Change model to "at_least_once" or "exactly_once".
  - Optionally, change "checkpoint_interval_secs" to an automatic checkpoint interval of your choice, in seconds, or to null to disable automatic checkpoints.
  Then, click on the Apply button. If clicking on Apply does not dismiss the dialog box, then either storage has not been enabled or the running version of Feldera is not the enterprise edition.
- With the fda command line tool, use one of these commands to enable fault tolerance:
```
fda set-config <pipeline> fault_tolerance at_least_once
fda set-config <pipeline> fault_tolerance exactly_once
```
  Optionally, use the following command to change the automatic checkpointing interval to <seconds> (use 0 for <seconds> to disable automatic checkpoints):
```
fda set-config <pipeline> checkpoint_interval <seconds>
```

Writing checkpoints

When fault tolerance is enabled as described, by default Feldera automatically writes a checkpoint every 60 seconds. This interval can be changed to another value, or disabled entirely, as described above.

Feldera also has an HTTP API, /checkpoint, that allows the user to request writing a checkpoint immediately. If automatic checkpoints are disabled, then the user should occasionally invoke this API to ensure that the checkpoint feature is useful.

⚠️ Writing a checkpoint should ordinarily be a fairly fast operation that takes several seconds. However, the Delta Lake input connector can only checkpoint at some input positions. When a checkpoint is requested between those points, Feldera will execute steps that draw input only from those connectors until they advance to a point at which a checkpoint is possible, and then write the checkpoint. In some cases, this can take minutes or longer.

Implementation​

Fault-tolerant connectors​

Enabling suspend and resume and fault tolerance​

Writing checkpoints​

Implementation

Fault-tolerant connectors

Enabling suspend and resume and fault tolerance

Writing checkpoints