Skip to content

DevOps Guide

A practical, opinionated playbook for running orca in production. Every recommendation here comes from real operational experience migrating a ~20-service cluster (keycloak, gitea, litellm, searxng, compliance-agent, certifai-dashboard, and friends) off other orchestrators. Every pitfall listed was learned the hard way.

If you're just getting started, read the Getting Started and Configuration guides first. This page assumes you already have an orca binary and a host to run it on.

1. GitOps with orca-infra

Orca services are defined declaratively as service.toml files. You should treat those files the way you treat any other infrastructure code: keep them in git, review changes, and roll forward from the repo rather than from a shell session.

The orca-infra repo

Create a dedicated repo — we use gitea.meghsakha.com/sharang/orca-infra. Layout:

orca-infra/
├── cluster.toml
├── services/
│   ├── keycloak/
│   │   ├── service.toml
│   │   └── config/
│   │       └── certifai-theme/
│   │           └── ...theme files...
│   ├── gitea/
│   │   └── service.toml
│   ├── litellm/
│   │   ├── service.toml
│   │   └── config/
│   │       └── config.yaml
│   ├── searxng/
│   │   └── service.toml
│   ├── compliance-agent/
│   │   └── service.toml
│   └── certifai-dashboard/
│       └── service.toml
└── README.md

Each service gets its own directory. The config/ subdir, when present, holds files that the service.toml mounts into the container (themes, YAML configs, seed SQL, etc.).

Syncing to the host

The orca server reads services from the services/ directory relative to its current working directory. The simplest workflow:

bash
# On the host, once:
git clone ssh://git@gitea.meghsakha.com:22222/sharang/orca-infra.git ~/orca

# Then always start orca from ~/orca:
cd ~/orca && orca server -d

To roll out a change: commit in orca-infra, git pull on the host, orca deploy <service>.

A real example: Keycloak

services/keycloak/service.toml defines both the database and the app as separate [[service]] entries in a single file. depends_on controls boot order, secrets are externalized, and the theme directory is mounted read-only.

toml
[[service]]
name = "keycloak-db"
image = "postgres:16-alpine"
runtime = "docker"

[service.env]
POSTGRES_DB = "keycloak"
POSTGRES_USER = "${secrets.KEYCLOAK_DB_USER}"
POSTGRES_PASSWORD = "${secrets.KEYCLOAK_DB_PASSWORD}"

[[service.volumes]]
source = "keycloak-db-data"
target = "/var/lib/postgresql/data"

[[service]]
name = "keycloak"
image = "quay.io/keycloak/keycloak:25.0"
runtime = "docker"
command = ["start", "--optimized", "--http-enabled=true", "--hostname-strict=false"]
domain = "auth.meghsakha.com"
port = 8080
depends_on = ["keycloak-db"]

[service.env]
KC_DB = "postgres"
KC_DB_URL = "jdbc:postgresql://keycloak-db:5432/keycloak"
KC_DB_USERNAME = "${secrets.KEYCLOAK_DB_USER}"
KC_DB_PASSWORD = "${secrets.KEYCLOAK_DB_PASSWORD}"
KC_HOSTNAME = "auth.meghsakha.com"
KC_PROXY_HEADERS = "xforwarded"
KEYCLOAK_ADMIN = "${secrets.KEYCLOAK_ADMIN_USERNAME}"
KEYCLOAK_ADMIN_PASSWORD = "${secrets.KEYCLOAK_BOOTSTRAP_PASSWORD}"

[[service.mounts]]
source = "./services/keycloak/config/certifai-theme"
target = "/opt/keycloak/themes/certifai"
read_only = true

[service.liveness]
path = "/realms/master"
interval_secs = 30
timeout_secs = 10
failure_threshold = 3
initial_delay_secs = 120

A few things to note:

  • ${secrets.KEYCLOAK_DB_PASSWORD} references encrypted secrets — see Secrets management.
  • The mount source is a relative path starting with ./services/.... That relative path is resolved against orca's current working directory, which is why running orca from the right directory matters (see pitfall below).
  • initial_delay_secs = 120 gives Keycloak ~90s of uninterrupted boot time before the first probe fires.

Pitfall: always run orca from the same working directory

orca looks for services/ relative to the current working directory. If you run orca server -d from ~, it will look for ~/services/ and find nothing. Worse, if you run orca deploy keycloak from ~ while the server is running in ~/orca, the deploy command will fail to find the toml.

Rule: always cd ~/orca before running any orca command. Put it in a shell alias or a systemd WorkingDirectory= line. Don't rely on muscle memory.

2. cluster.toml

cluster.toml is the single source of truth for cluster-level configuration: cluster identity, default domain, ACME settings, and the backup schedule. It lives at the root of your orca-infra repo next to services/.

Here's what the breakpilot cluster looks like:

toml
[cluster]
name = "breakpilot"
domain = "meghsakha.com"

[acme]
email = "ops@meghsakha.com"
directory = "https://acme-v02.api.letsencrypt.org/directory"

[proxy]
http_port = 80
https_port = 443

[backup]
enabled = true
schedule = "0 0 3 * * *"   # daily at 03:00
retention_days = 14

[[backup.targets]]
type = "local"
path = "/var/lib/orca/backups"

[[backup.targets]]
type = "s3"
endpoint = "https://nbg1.your-objectstorage.com"
bucket = "breakpilot-backups"
region = "nbg1"
access_key = "${secrets.HETZNER_S3_ACCESS_KEY}"
secret_key = "${secrets.HETZNER_S3_SECRET_KEY}"

Pitfall: cluster.toml is only read on startup

Changes to cluster.toml — including backup schedules, ACME email, and proxy ports — are loaded exactly once, when the orca server boots. Editing the file does nothing until you restart:

bash
orca shutdown
cd ~/orca && orca server -d

Don't forget to cd ~/orca before restarting — otherwise orca will come back up pointing at the wrong (empty) services directory.

3. Secrets management

Secrets are stored encrypted at rest with AES-256 and decrypted in-memory when a service starts. You reference them from service.toml env blocks with ${secrets.NAME}.

Setting secrets

bash
cd ~/orca
orca secrets set KEYCLOAK_DB_USER keycloak
orca secrets set KEYCLOAK_DB_PASSWORD 'S0me-l0ng-rand0m-string'
orca secrets set KEYCLOAK_ADMIN_USERNAME admin
orca secrets set KEYCLOAK_BOOTSTRAP_PASSWORD 'an0ther-l0ng-string'

And in service.toml:

toml
[service.env]
KC_DB_USERNAME = "${secrets.KEYCLOAK_DB_USER}"
KC_DB_PASSWORD = "${secrets.KEYCLOAK_DB_PASSWORD}"
KEYCLOAK_ADMIN = "${secrets.KEYCLOAK_ADMIN_USERNAME}"
KEYCLOAK_ADMIN_PASSWORD = "${secrets.KEYCLOAK_BOOTSTRAP_PASSWORD}"

Listing and rotating

bash
orca secrets list               # names only, never values
orca secrets set KEY new-value  # overwrite
orca secrets rm KEY

After rotating a secret, redeploy the consuming service: orca deploy keycloak.

Pitfall: secrets.json is written relative to cwd

orca secrets set writes the encrypted secrets.json to the current working directory, not to a fixed location under ~/.orca/. This is a known wart and is being fixed in an upcoming release.

Until then: always run orca secrets set from ~/orca. Otherwise, you'll end up with a secrets.json in ~ that the server (running from ~/orca) never reads, and you'll spend an hour wondering why your env vars are blank.

4. CI/CD with webhooks

There are two CI/CD patterns for orca. Pick one per service based on how much rollback discipline you need.

Pattern 1: service-tagged image + webhook redeploy

Your CI builds an image, pushes :latest (and optionally :<sha> for debugging), then POSTs to an orca webhook. Orca force-pulls the :latest tag and restarts the service. Best for stateless apps where "forward fix" is the recovery strategy.

Pattern 2: pinned SHA tag + GitOps PR

Your CI builds an image tagged :<sha> only, then opens a PR against orca-infra to bump the toml. A human merges. Orca picks up the change on orca deploy. Best for databases, auth services, anything where you want an auditable history and a one-click revert.

The rest of this section walks through Pattern 1 using certifai-dashboard.

Gitea Actions workflow

.gitea/workflows/ci.yml in the certifai-dashboard repo:

yaml
name: build-and-deploy

on:
  push:
    branches: [main]

jobs:
  build:
    runs-on: docker
    steps:
      - uses: actions/checkout@v4

      - name: Log in to registry
        run: |
          echo "${{ secrets.REGISTRY_PASSWORD }}" | \
            docker login registry.meghsakha.com \
              -u "${{ secrets.REGISTRY_USERNAME }}" --password-stdin

      - name: Build image
        run: |
          docker build \
            -t registry.meghsakha.com/certifai-dashboard:latest \
            -t registry.meghsakha.com/certifai-dashboard:${{ github.sha }} \
            .

      - name: Push image
        run: |
          docker push registry.meghsakha.com/certifai-dashboard:latest
          docker push registry.meghsakha.com/certifai-dashboard:${{ github.sha }}

      - name: Trigger orca webhook
        env:
          SECRET: ${{ secrets.ORCA_WEBHOOK_SECRET }}
        run: |
          BODY='{"repo":"sharang/certifai-dashboard","ref":"refs/heads/main","sha":"${{ github.sha }}"}'
          SIG=$(printf '%s' "$BODY" | openssl dgst -sha256 -hmac "$SECRET" | awk '{print $2}')
          curl -fsSL -X POST https://orca.meghsakha.com/api/v1/webhooks/github \
            -H "Content-Type: application/json" \
            -H "X-Hub-Signature-256: sha256=$SIG" \
            -d "$BODY"

Required Gitea repo secrets:

  • REGISTRY_USERNAME — service account for the private registry
  • REGISTRY_PASSWORD — its password or token
  • ORCA_WEBHOOK_SECRET — the HMAC key you registered with orca (next step)

Registering the webhook with orca

Webhooks are registered via the orca REST API. You'll need your cluster admin token (written to ~/.orca/cluster.token on first boot).

bash
curl -X POST http://127.0.0.1:6880/api/v1/webhooks \
  -H "Authorization: Bearer $(cat ~/.orca/cluster.token)" \
  -H "Content-Type: application/json" \
  -d '{
    "repo": "sharang/certifai-dashboard",
    "service_name": "certifai-dashboard",
    "branch": "main",
    "secret": "the-same-value-as-ORCA_WEBHOOK_SECRET"
  }'

From then on, every push to main will:

  1. Build and push :latest and :<sha> images
  2. Call the webhook
  3. Orca verifies the HMAC, force-pulls :latest, and restarts the service

:latest vs :sha pull behavior

Orca force-pulls :latest tags on every reconcile (this landed in main today). For :<sha> or any other immutable tag, orca skips the pull if the image already exists locally. This means Pattern 1 ("just bump latest") works as expected, and Pattern 2 ("bump the sha in toml") is efficient — no pointless re-pulls.

Pitfall: webhooks are currently in-memory

Registered webhooks live in the orca server's in-memory state and are lost on restart. After every orca shutdown / orca server -d, you need to re-register them. This is being fixed — webhook persistence is tracked as a known issue. Until then, keep the curl registration commands in a script in your orca-infra repo so you can replay them.

5. Private Docker registry pulls

Orca uses the host's ~/.docker/config.json for registry authentication. Log in once on the host:

bash
docker login registry.meghsakha.com

From then on, any service.toml that references registry.meghsakha.com/foo:tag will be pulled with those credentials.

The chicken-and-egg problem

If your private registry is itself a service managed by orca (and routed through orca's proxy on :443), then pulling any image from that registry requires the orca proxy to be up, which requires the registry image to already be cached locally.

Pitfall: pre-pull the registry image

Before your very first orca server -d on a fresh host, manually pull the registry image:

bash
docker pull registry:2   # or whatever image your registry service uses

Otherwise you get a classic bootstrap deadlock: orca can't start the registry because it can't pull the registry image, because the proxy that routes to the registry isn't up yet.

On subsequent boots this isn't a problem — the image is cached — but it bites you hard on first-time setup and on any host migration.

6. Backups

Orca has a built-in backup scheduler configured via cluster.toml. It snapshots volumes and, optionally, runs per-service pre_hook commands (e.g. pg_dump) before snapshotting.

cluster.toml backup config

toml
[backup]
enabled = true
schedule = "0 0 3 * * *"   # six-field cron: sec min hour dom mon dow — daily at 03:00:00
retention_days = 14

[[backup.targets]]
type = "local"
path = "/var/lib/orca/backups"

[[backup.targets]]
type = "s3"
endpoint = "https://nbg1.your-objectstorage.com"
bucket = "breakpilot-backups"
region = "nbg1"
access_key = "${secrets.HETZNER_S3_ACCESS_KEY}"
secret_key = "${secrets.HETZNER_S3_SECRET_KEY}"

Hetzner Object Storage works as a drop-in S3 target — just point endpoint at the region-specific hostname and set region to match. Same pattern works for Backblaze B2, MinIO, Wasabi, etc.

Per-service pre_hook for database dumps

For Postgres-backed services, add a pre-hook that writes a dump into a location that gets snapshotted:

toml
[service.backup]
pre_hook = "pg_dump -U postgres -d keycloak -F c -f /var/backups/keycloak.dump"

Make sure /var/backups is a mounted volume so the dump is actually included in the snapshot.

Verifying the scheduler is running

bash
grep "Backup scheduler started" ~/.orca/orca.log

If you don't see that line, either [backup] is missing from cluster.toml, enabled = false, or the cron expression failed to parse (remember: six fields, including seconds).

Pitfall: manual orca backup all is broken right now

There's a known issue with orca backup all (the on-demand manual backup command): it nests tokio runtimes and panics. This is being fixed.

Scheduled backups (triggered by the in-process cron) work fine — the bug is only in the CLI path. If you need an ad-hoc backup before the bug is fixed, wait until the next scheduled run or invoke the backup via the REST API.

7. Health checks for slow-starting services

Orca supports liveness probes per service:

toml
[service.liveness]
path = "/healthz"
interval_secs = 30
timeout_secs = 5
failure_threshold = 3
initial_delay_secs = 10

The probe hits http://<container>:<port><path> on the service's primary port. If failure_threshold consecutive probes fail, orca restarts the container.

Keycloak: the important one

Keycloak is the canonical slow-starting service: it takes ~90 seconds to fully initialize on a cold boot (JPA schema, Infinispan, theme compilation). Without initial_delay_secs, orca will kill it mid-boot every single time and you'll get into a crash loop that looks like a broken image.

toml
[service.liveness]
path = "/realms/master"
interval_secs = 30
timeout_secs = 10
failure_threshold = 3
initial_delay_secs = 120

Two non-obvious things:

  1. initial_delay_secs = 120 — gives Keycloak a full two minutes of uninterrupted boot. Overkill on fast hardware, safe on slow hardware, never too short.
  2. path = "/realms/master" — the obvious choice is /health/ready, but Keycloak exposes that on the management port 9000, not the main HTTP port 8080 that orca probes. /realms/master is a cheap GET on port 8080 that returns 200 once the server is fully up.

Pitfall: no initial_delay_secs = crash loop on slow starters

Any service that takes more than a few seconds to accept connections — Keycloak, Gitea on first migration, anything JVM-based — needs initial_delay_secs. Otherwise orca's first probe fails, the failure_threshold ticks down quickly, and the container is killed before it ever finishes booting. You'll waste an hour thinking the image is broken.

8. Reverse proxy and TLS

Orca's built-in proxy listens on 80/443 and routes HTTP(S) traffic to services by the domain field in their service.toml. ACME certs are automatically issued from Let's Encrypt for every service with a domain set, using the email from cluster.toml.

HTTP routing

toml
[[service]]
name = "gitea"
image = "gitea/gitea:1.22"
domain = "gitea.meghsakha.com"
port = 3000

Any request to https://gitea.meghsakha.com gets routed to the gitea container on port 3000. No extra config.

Non-HTTP ports: extra_ports

For services that need a raw TCP port exposed outside the proxy — Gitea SSH, Postgres on a bastion, etc. — use extra_ports:

toml
[[service]]
name = "gitea"
image = "gitea/gitea:1.22"
domain = "gitea.meghsakha.com"
port = 3000
extra_ports = ["22222:22"]

This publishes container port 22 on host port 22222. That's how Gitea SSH (ssh://git@gitea.meghsakha.com:22222) works on this cluster.

Some upstreams are picky about headers and cookies. If you're debugging weird login loops, verify orca's proxy is doing all of these correctly (it does, as of current main — this is informational):

  • Set-Cookie is appended, not inserted. Upstreams like Keycloak often return multiple Set-Cookie headers in one response; a proxy that uses HeaderMap::insert will clobber all but the last one and break sessions. Orca uses append.
  • X-Forwarded-Proto, X-Forwarded-Host, X-Forwarded-For are injected on every forwarded request. Keycloak reads X-Forwarded-Proto to generate correct https:// URLs in its OIDC redirect responses; without it, you get redirects to http:// and browsers refuse the resulting insecure cookies.
  • Upstream redirects are NOT auto-followed. Orca's HTTP client is configured with redirect::Policy::none(). If the proxy follows a 302 from Keycloak, the client browser never sees the redirect, the auth handshake breaks, and you get infinite login loops.

Debugging login loops

If a service that works direct-to-container stops working through the proxy, 95% of the time it's one of the above three issues. Orca has all three handled in current main — if you're running older builds, upgrade first before going deep on a debugging session.

9. Migrating from another orchestrator

This is the playbook we used to move ~20 services off Coolify onto orca. Works for any source orchestrator (Coolify, docker-compose, Portainer, hand-rolled systemd) as long as you can docker inspect the running containers.

Step-by-step

1. Inspect the source container.

bash
docker inspect <container-name> | less

Pay attention to: Env, Mounts, NetworkSettings.Networks, HostConfig.PortBindings, Cmd, Entrypoint.

2. For DB-backed services, stop the app first, then dump the DB.

bash
docker stop certifai-dashboard               # disconnect all clients
docker exec certifai-db \
  pg_dump -U postgres -d certifai -F c -f /tmp/db.dump
docker cp certifai-db:/tmp/db.dump ./certifai.dump

Stopping the app first guarantees a consistent dump — no half-written rows, no surprise migrations mid-dump.

3. Write the new service.toml. Externalize every secret-looking value into ${secrets.X}. Don't just copy Coolify's inlined env — take the migration as an opportunity to clean up.

4. Set the secrets.

bash
cd ~/orca
orca secrets set CERTIFAI_DB_USER postgres
orca secrets set CERTIFAI_DB_PASSWORD 'value-from-docker-inspect'
# ... repeat for every secret

5. Deploy the new orca-managed containers.

bash
cd ~/orca && orca deploy certifai-dashboard

This starts the new DB (empty) and the new app. The app will almost certainly fail its liveness check because the DB is empty — that's fine, we're about to fix it.

6. Restore the DB dump into the new DB container.

bash
docker cp ./certifai.dump orca-certifai-db:/tmp/db.dump

docker exec orca-certifai-db \
  psql -U postgres -d certifai \
  -c 'DROP SCHEMA public CASCADE; CREATE SCHEMA public;'

docker exec orca-certifai-db \
  pg_restore -U postgres -d certifai /tmp/db.dump

The DROP SCHEMA is important — pg_restore doesn't like pre-existing tables from the fresh init.

7. Restart the app.

bash
orca restart certifai-dashboard

It should now pick up the restored data and pass its liveness check.

8. Cut DNS / proxy over. Point the domain at the orca host. If orca is on the same host as the old orchestrator, take the old service down first to free port 80/443.

9. Verify, then tear down the old containers. Give it at least 24 hours in production before docker rming the old ones — you want a recent, known-good snapshot to roll back to if anything surfaces.

10. Common gotchas reference

A skimmable list of every pitfall covered above:

  • cwd matters for services/. Always cd ~/orca before orca server or orca deploy. Orca resolves services/ relative to the current working directory.
  • cwd matters for secrets.json. orca secrets set writes to $PWD/secrets.json. Always run from ~/orca. (Being fixed.)
  • cluster.toml is only read on startup. Restart orca after editing: orca shutdown && cd ~/orca && orca server -d.
  • Webhooks are in-memory. Re-register them after every restart. Keep the curl commands in a script. (Being fixed.)
  • orca backup all has a tokio nesting bug. Use scheduled backups until it's fixed.
  • Pre-pull your registry image before the first boot if your registry is managed by orca. Otherwise bootstrap deadlock.
  • Slow-starting services need initial_delay_secs. Keycloak: 120s. Don't skimp.
  • Keycloak's /health/ready is on port 9000, not 8080. Use /realms/master on 8080 for liveness.
  • Cron schedules in cluster.toml need six fields (with seconds), not five. 0 0 3 * * *, not 0 3 * * *.
  • Stop the app before dumping its DB during a migration. Consistent dumps only come from quiesced databases.
  • DROP SCHEMA public CASCADE before pg_restoreing into a fresh DB container.

If you hit something that isn't on this list, file an issue — the list grows with every migration.

Released under the AGPL-3.0 License.