Troubleshooting
This guide covers issues Feldera Enterprise users and operators might run into in production, and steps to remedy them.
Pipelines with no input produce no outputs
Today the progress of a pipeline is mediated by the input connectors.
A pipeline without any input connectors will actually produce no
outputs. In consequence, a SQL program that contains no tables and
which does not call the NOW() function will never produce an output.
Unfortunately this makes it impossible to run simple SQL testing code such as:
CREATE MATERIALIZED VIEW V AS SELECT 1 + 2;
The workaround this limitation is to have at least one table in each SQL program; inserting a row in this table will trigger the production of an output. You can use the datagen connector to populate this table:
CREATE TABLE T(c BOOLEAN) WITH (
'connectors' = '[{
"name": "dummy",
"transport": {
"name": "datagen",
"config": {
"plan": [{
"limit": 1
}]
}
}
}]'
);
CREATE MATERIALIZED VIEW V AS SELECT 1 + 2 FROM T;
Diagnosing Performance Issues
When investigating pipeline performance, Feldera support will typically request a support-bundle. The bundle can be downloaded from your installation with one of the following methods:
- The
support-bundlefda command:
fda support-bundle affected-pipeline-name
- the
support_bundlefunction in the Python SDK. - the web-console has a button to download the bundle for a pipeline
- or the
support_bundleendpoint in the REST API.
The support bundle has the following content:
-
Pipeline Logs: for warnings and errors from the logs endpoint.
-
Pipeline Configuration: the pipeline configuration, including the SQL code and connector settings.
-
Pipeline Metrics: from the pipeline metrics endpoint.
-
Endpoint Stats: from the stats endpoint.
-
Circuit Profile: from the circuit profile endpoint.
-
Heap Profile: from heap usage endpoint.
Common Error Messages
Delta Lake Connection Errors
Error: Table metadata is invalid: Number of checkpoint files '0' is not equal to number of checkpoint metadata parts 'None'
Solution: This usually happens when the Delta Table uses features unsupported by delta-rs like liquid clustering or deletion vectors. Check the table properties and set the checkpoint policy to "classic":
ALTER TABLE my_table SET TBLPROPERTIES (
'checkpointPolicy' = 'classic'
)
Out-of-Memory Errors
Error: The pipeline container has restarted. This was likely caused by an Out-Of-Memory (OOM) crash.
Feldera runs each pipeline in a separate container with configurable memory limits. Here are some knobs to control memory usage:
-
Adjust the pipeline’s memory reservation and limit:
"resources": {
"memory_mb_min": 32000,
"memory_mb_max": 32000
} -
Throttle the amount of records buffered by the connector using the
max_queued_recordssetting:"max_queued_records": 100000 -
Ensure that storage is enabled (it's on by default):
"storage": {
"backend": {
"name": "default"
},
"min_storage_bytes": null,
"compression": "default",
"cache_mib": null
}, -
Optimize your SQL queries to avoid expensive cross-products. Use functions like NOW() sparingly on large relations.
Out-of-storage Errors
Error: The pipeline logs contain messages like:
DBSP error: runtime error: One or more worker threads terminated unexpectedly
worker thread 0 panicked
panic message: called `Result::unwrap()` on an `Err` value: StdIo(StorageFull)
Solution: Increase pipeline storage capacity
In the Enterprise edition, Feldera runs each pipeline in a separate pod and, by
default, attaches PVC volumes for storage. The default volume size is 30 GB,
and if your pipelines are encountering StorageFull errors, you should
explicitly request larger volumes for each pipeline:
"resources": {
"storage_mb_max": 128000
}
Kubernetes evictions
Error: the pipeline becomes UNAVAILABLE with no errors in the logs.
Solution: configure resource reservations and limits for the Pipeline.
Kubernetes may evict Pipeline pods under node resource pressure. To confirm, run:
kubectl describe pipeline-<pipeline-id>-0
and look for
Status: Failed
Reason: Evicted
You can also view the eviction event in your cluster monitoring stack (e.g. Datadog).
Evictions typically happen only when running Feldera in shared Kubernetes clusters. The pods to evict are determined by Kubernetes Quality-of-Service classes.
By default, Feldera Pipelines do not reserve any CPU or memory resources, which
puts them in the BestEffort priority class,
making them eviction candidates. To raise their priority:
-
Burstableclass: reserve a minimum amount of memory and CPU:"resources": {
"cpu_cores_min": 16,
"memory_mb_min": 32000,
} -
Guaranteedclass: set minimum and maximum resources to the same value, for memory and CPU:"resources": {
"cpu_cores_min": 16,
"cpu_cores_max": 16,
"memory_mb_min": 32000,
"memory_mb_max": 32000,
}
Rust Compilation Errors
Error: No space left on device during Rust compilation
Solution: Ensure the compiler-server has sufficient disk space (20Gib by default, configured via the compilerPvcStorageSize value in the Helm chart).
Uncommon Problems
Lost or accidentally deleted the Feldera Control-Plane PostgreSQL database
Feldera tracks state about pipelines inside a PostgreSQL database. If for some reason the state in this database is ever lost or otherwise can not be recovered, and the feldera instance had running pipelines at the time, a manual intervention may be necessary to clean-up the leftover pods. Its not possible to reinstantiate these leftover (orphaned) pipelines, therefore the Kubernetes objects backing these pipelines should be manually removed.
- Identify any stale pipelines in the feldera namespace
(e.g., by running
kubectl get pods -n $NS) - For each pipeline delete the k8s definitions that feldera created for it: Statefulset, Service, ConfigMap, Pod and PVC.
Here is an example script that would clean up a stale pipeline. Note: it is generally not enough to just delete the pod since the existing statefulset will try to re-create it.
NAMESPACE=feldera-ns
POD=pipeline-019a7c1d-6a0c-7923-afd7-0125fe589356-0
NAME=pipeline-019a7c1d-6a0c-7923-afd7-0125fe589356
PVC=pipeline-019a7c1d-6a0c-7923-afd7-0125fe589356-storage-pipeline-019a7c1d-6a0c-7923-afd7-0125fe589356-0
# Ensure you have the permissions to perform the delete operations
kubectl auth can-i delete sts -n $NAMESPACE
kubectl auth can-i delete service -n $NAMESPACE
kubectl auth can-i delete configmap -n $NAMESPACE
kubectl auth can-i delete pod -n $NAMESPACE
kubectl auth can-i delete pvc -n $NAMESPACE
# Delete the k8s objects manually
kubectl delete sts -n $NAMESPACE $NAME
kubectl delete service -n $NAMESPACE $NAME
kubectl delete configmap -n $NAMESPACE $NAME
kubectl delete pod -n $NAMESPACE $POD
kubectl delete pvc -n $NAMESPACE $PVC
Expand existing pipeline storage
This guide is only applicable to the Feldera Enterprise Edition and is for advanced users as it details out-of-band steps that make certain assumptions on implementation details of Feldera. As a consequence, these steps might be subject to change in the future. They are written with only single-host pipelines in mind.
Choosing a larger storage without preserving existing state can be done
by clearing the storage, followed by increasing the runtime configuration's
resources.storage_mb_max. However, this field cannot be edited with uncleared
storage (i.e., while preserving existing state). Currently, Feldera does not
have the feature to expand existing storage. It is however possible to do it
out-of-band by directly interacting with the underlying Kubernetes PVC.
This is only possible if its storage class allows volume expansion.
This out-of-band action will result in a discrepancy between what the runtime
configuration resources.storage_mb_max states and the actual size of the
storage. Be aware that clearing storage afterward will delete the PVC that
was changed out-of-band, thereby undoing the storage expansion.
Steps
-
Note down the
<pipeline-id>of the pipeline (from the Web Console: Open pipeline -> Tab: Performance -> Pipeline ID button).These environment variables will be used in the following steps:
# TODO: change to your own situation
NAMESPACE=feldera
PIPELINE_PVC=pipeline-<pipeline-id>-storage-pipeline-<pipeline-id>-0 -
Check the PVC of the pipeline:
kubectl get pvc -n $NAMESPACE $PIPELINE_PVC... which for example outputs (
resources.storage_mb_maxis set to 25000 in this example):NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE
pipeline-.... Bound pvc-... 24Gi RWO gp3 <unset> 21sThe storage class is
gp3in this example. -
Check the storage class:
kubectl get sc gp3... which for example outputs:
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
gp3 ebs.csi.aws.com Delete WaitForFirstConsumer true 10dThe storage class must support volume expansion (above:
ALLOWVOLUMEEXPANSION) in order to continue with the next step. If it does not, you can potentially enable it for the storage class via:kubectl patch sc gp3 -p '{"allowVolumeExpansion": true}'... followed by checking this is the case by doing again
kubectl get sc gp3and looking underALLOWVOLUMEEXPANSION. -
Patch the PVC of the pipeline with a higher storage request (in this example, increasing
25Gto50G):kubectl patch pvc -n $NAMESPACE $PIPELINE_PVC \
-p '{"spec":{"resources":{"requests":{"storage":"50G"}}}}'Note: it is not possible to patch
spec.resources.limits.storage, as such it will be lower thanspec.resources.requests.storageafterward. Irrespectively, on AWS it does expand the storage. -
Wait for the PVC resizing to take effect:
kubectl get pvc -n $NAMESPACE $PIPELINE_PVC... which for example eventually outputs (it can take seconds or minutes):
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE
pipeline-.... Bound pvc-... 47Gi RWO gp3 <unset> 5m44sIf it takes longer than expected, you can also debug using
kubectl describe pvc -n $NAMESPACE $PIPELINE_PVCto find out what is going on with the PVC.noteIt depends on the storage provisioner backing the storage class whether it accepts the PVC modification, and how long it takes for it to be applied. Additionally, storage provisioners can limit how often you can modify a PVC in a certain time window. Doing multiple expansions of the same PVC in quick succession might stop working at some point, as is the case for AWS. Consult your cloud provider documentation to learn more (e.g., AWS).