StackGres Docs > Administration Manual > CDC Streaming > Troubleshooting

Troubleshooting

This guide covers common issues encountered with SGStream and their solutions.

Diagnosing Issues

Check Stream Status

# Get detailed status
kubectl get sgstream my-stream -o yaml

# Check conditions
kubectl get sgstream my-stream -o jsonpath='{.status.conditions}' | jq

# Check failure message
kubectl get sgstream my-stream -o jsonpath='{.status.failure}'

Check Pod Status

# Find stream pod
kubectl get pods -l stackgres.io/stream-name=my-stream

# Describe pod for events
kubectl describe pod -l stackgres.io/stream-name=my-stream

# Check logs
kubectl logs -l stackgres.io/stream-name=my-stream --tail=100

Check Events

kubectl get events --field-selector involvedObject.name=my-stream --sort-by='.lastTimestamp'

Common Issues

Stream Fails to Start

Symptom

Stream pod is in CrashLoopBackOff or Error state.

Possible Causes and Solutions

1. Source database not accessible

# Check connectivity from cluster
kubectl run test-connection --rm -it --image=postgres:16 -- \
  psql -h source-cluster -U postgres -c "SELECT 1"

Solution: Verify network policies, service names, and credentials.

2. Invalid credentials

# Verify secret exists
kubectl get secret stream-credentials

# Check secret contents
kubectl get secret stream-credentials -o jsonpath='{.data.password}' | base64 -d

Solution: Update the secret with correct credentials.

3. Logical replication not enabled

# Check wal_level on source
kubectl exec source-cluster-0 -c postgres-util -- psql -c "SHOW wal_level"

Solution: For external PostgreSQL, set wal_level = logical and restart.

4. Insufficient replication slots

# Check max_replication_slots
kubectl exec source-cluster-0 -c postgres-util -- psql -c "SHOW max_replication_slots"

# Check current slots
kubectl exec source-cluster-0 -c postgres-util -- psql -c "SELECT * FROM pg_replication_slots"

Solution: Increase max_replication_slots in PostgreSQL configuration.

Replication Slot Already Exists

Symptom

Error: replication slot "xxx" already exists

Solution

Check if another stream is using the slot:

kubectl get sgstream --all-namespaces

If the slot is orphaned, drop it manually:

kubectl exec source-cluster-0 -c postgres-util -- psql -c \
  "SELECT pg_drop_replication_slot('orphaned_slot_name')"

Or specify a unique slot name:

spec:
  source:
    sgCluster:
      debeziumProperties:
        slotName: unique_slot_name

Publication Already Exists

Symptom

Error: publication "xxx" already exists

Solution

Use the existing publication:

spec:
  source:
    sgCluster:
      debeziumProperties:
        publicationName: existing_publication
        publicationAutocreateMode: disabled

Or drop the orphaned publication:

kubectl exec source-cluster-0 -c postgres-util -- psql -c \
  "DROP PUBLICATION orphaned_publication"

High Replication Lag

Symptom

milliSecondsBehindSource keeps increasing.

Possible Causes and Solutions

1. Target can’t keep up

Increase batch size and tune connection pool:

spec:
  target:
    sgCluster:
      debeziumProperties:
        batchSize: 1000
        connectionPoolMax_size: 64
        useReductionBuffer: true

2. Network latency

Check network between source and target:

kubectl exec stream-pod -- ping target-cluster

3. Insufficient resources

Increase stream pod resources:

spec:
  pods:
    resources:
      requests:
        cpu: 2000m
        memory: 2Gi
      limits:
        cpu: 4000m
        memory: 4Gi

4. Large transactions

For bulk operations, consider:

spec:
  source:
    sgCluster:
      debeziumProperties:
        maxBatchSize: 8192
        maxQueueSize: 32768

WAL Disk Space Issues

Symptom

Source database running out of disk space due to WAL accumulation.

Causes

Stream is paused or slow
Replication slot is blocking WAL cleanup

Solutions

Check slot status:

kubectl exec source-cluster-0 -c postgres-util -- psql -c \
  "SELECT slot_name, active, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) as lag_bytes
   FROM pg_replication_slots"

If stream is stuck, consider restarting:

kubectl delete pod -l stackgres.io/stream-name=my-stream

Enable heartbeats to acknowledge WAL:

spec:
  source:
    sgCluster:
      debeziumProperties:
        heartbeatIntervalMs: 30000

For emergency cleanup (data loss risk):

# Only if stream can be recreated
kubectl exec source-cluster-0 -c postgres-util -- psql -c \
  "SELECT pg_drop_replication_slot('stuck_slot')"

Snapshot Takes Too Long

Symptom

Snapshot phase runs for extended periods.

Solutions

Increase parallelism:

spec:
  source:
    sgCluster:
      debeziumProperties:
        snapshotMaxThreads: 4
        snapshotFetchSize: 20000

Snapshot only required tables:

spec:
  source:
    sgCluster:
      includes:
        - "public\\.important_table"
      debeziumProperties:
        snapshotIncludeCollectionList:
          - "public\\.important_table"

Use incremental snapshots for large tables:

spec:
  source:
    sgCluster:
      debeziumProperties:
        snapshotMode: no_data  # Skip initial snapshot

Then trigger incremental snapshots via signals.

Data Type Conversion Errors

Symptom

Errors about unsupported or mismatched data types.

Solutions

Enable unknown datatype handling:

spec:
  source:
    sgCluster:
      debeziumProperties:
        includeUnknownDatatypes: true
        binaryHandlingMode: base64

Use custom converters for specific types:

spec:
  source:
    sgCluster:
      debeziumProperties:
        converters:
          geometry:
            type: io.debezium.connector.postgresql.converters.GeometryConverter

CloudEvent Target Connection Refused

Symptom

Events not being delivered to CloudEvent endpoint.

Solutions

Verify endpoint URL:

kubectl run curl --rm -it --image=curlimages/curl -- \
  curl -v https://events.example.com/health

Check TLS settings:

spec:
  target:
    cloudEvent:
      http:
        skipHostnameVerification: true  # For self-signed certs

Increase timeouts:

spec:
  target:
    cloudEvent:
      http:
        connectTimeout: "30s"
        readTimeout: "60s"
        retryLimit: 10

Stream Keeps Restarting

Symptom

Stream pod restarts frequently.

Possible Causes

Out of memory

kubectl describe pod -l stackgres.io/stream-name=my-stream | grep -A5 "Last State"

Solution: Increase memory limits.

Transient errors

Enable retries:

spec:
  source:
    sgCluster:
      debeziumProperties:
        errorsMaxRetries: 10
        retriableRestartConnectorWaitMs: 30000

PersistentVolume issues

Check PVC status:

kubectl get pvc -l stackgres.io/stream-name=my-stream

Cannot Delete Stream

Symptom

SGStream stuck in Terminating state.

Solutions

Check for finalizers:

kubectl get sgstream my-stream -o jsonpath='{.metadata.finalizers}'

Remove finalizers if stuck:

kubectl patch sgstream my-stream -p '{"metadata":{"finalizers":null}}' --type=merge

Clean up orphaned resources:

# Delete replication slot manually
kubectl exec source-cluster-0 -c postgres-util -- psql -c \
  "SELECT pg_drop_replication_slot('my_stream_slot')"

# Delete publication
kubectl exec source-cluster-0 -c postgres-util -- psql -c \
  "DROP PUBLICATION IF EXISTS my_stream_publication"

Graceful Shutdown

To stop a stream gracefully and clean up resources:

Send tombstone signal:

kubectl annotate sgstream my-stream \
  debezium-signal.stackgres.io/tombstone='{}'

Wait for stream to complete:

kubectl get sgstream my-stream -w

Delete the stream:

kubectl delete sgstream my-stream

Debug Mode

Enable verbose logging for detailed troubleshooting:

spec:
  pods:
    customContainers:
      - name: stream
        env:
          - name: DEBUG_STREAM
            value: "true"
          - name: QUARKUS_LOG_LEVEL
            value: "DEBUG"

Getting Help

If issues persist:

Collect diagnostic information:

# Stream status
kubectl get sgstream my-stream -o yaml > stream-status.yaml

# Pod logs
kubectl logs -l stackgres.io/stream-name=my-stream --tail=500 > stream-logs.txt

# Events
kubectl get events --field-selector involvedObject.name=my-stream > stream-events.txt

# Source database status
kubectl exec source-cluster-0 -c postgres-util -- psql -c \
  "SELECT * FROM pg_replication_slots" > replication-slots.txt

Check the StackGres documentation
Open an issue on GitHub

Troubleshooting

Diagnosing Issues

Check Stream Status

Check Pod Status

Check Events

Common Issues

Stream Fails to Start

Symptom

Possible Causes and Solutions

Replication Slot Already Exists

Symptom

Solution

Publication Already Exists

Symptom

Solution

High Replication Lag

Symptom

Possible Causes and Solutions

WAL Disk Space Issues

Symptom

Causes

Solutions

Snapshot Takes Too Long

Symptom

Solutions

Data Type Conversion Errors

Symptom

Solutions

CloudEvent Target Connection Refused

Symptom

Solutions

Stream Keeps Restarting

Symptom

Possible Causes

Cannot Delete Stream

Symptom

Solutions

Graceful Shutdown

Debug Mode

Getting Help

Related Documentation