Troubleshooting
Common failure modes you’ll hit when running roksbnkctl against real IBM Cloud accounts, organised as symptom → root cause → fix. The entries here are mined from the issue logs accumulated over Sprints 0-5 plus the failure shapes documented in PRD 05 §“Risks”.
Use the page as a lookup table. If your symptom isn’t here, Chapter 23 — The E2E test plan lists what every phase asserts; reverse-engineering from the assertions can narrow your diagnosis. For deeper-than-here debugging, the per-phase log files under /tmp/roksbnkctl-e2e-backends/ are the first stop.
Install and init
Symptom: roksbnkctl init errors with plaintext secret detected
Root cause: an existing ~/.roksbnkctl/<workspace>/config.yaml has a credential value sitting in a field whose name matches the rejection regex (api_key, password, token, secret_access_key, hmac_secret). The rejection is a deliberate safety net — see Chapter 14 §“What’s safe to commit vs not”.
Fix: move the credential into IBMCLOUD_API_KEY (env var) or the OS keychain (roksbnkctl init writes it via zalando/go-keyring). For a single-user dev box, the supported plaintext-on-disk channel is ibmcloud.api_key_b64 — base64-encoded, which doesn’t trip the regex.
Symptom: roksbnkctl init interactive prompts loop forever asking for the API key
Root cause: you’re running under CI / a non-TTY shell and roksbnkctl can’t read stdin. The interactive prompt fallback is the last step in the credential resolver chain and it doesn’t gracefully skip when stdin is closed.
Fix: set IBMCLOUD_API_KEY in the env, or pre-populate the keychain entry. For batch / CI runs, the documented invocation is:
IBMCLOUD_API_KEY=$(cat /path/to/secret) roksbnkctl init -w my-workspace
Pre-setting IBMCLOUD_API_KEY skips the API-key prompt (it’s the first link in the resolver chain). init still prompts for the remaining workspace metadata (region, resource group, cluster name) on TTY-bound stdin — a fully non-interactive bootstrap is on the v1.x roadmap.
Symptom: doctor reports terraform: not found on a fresh dev box
Root cause: terraform is the only strictly-required host tool for v1.0 (everything else is internalised). Doctor checks PATH; if your shell session hasn’t sourced the install location it’ll miss.
Fix: install terraform via your package manager (brew install terraform, apt-get install terraform, etc.) and re-source the shell, or set the TERRAFORM_BIN env var pointing at the binary explicitly.
roksbnkctl up lifecycle
Symptom: terraform apply errors timeout while waiting for state to become 'normal'
Root cause: IBM Cloud’s control plane is occasionally 5-15 minutes slow propagating cluster state — a known transient. The cluster was created; the API just hasn’t caught up to reporting it as Ready.
Fix: roksbnkctl up retries the apply automatically up to 3 attempts with a 60-second sleep between (see applyWithRetry in internal/cli/lifecycle.go). If all three retries fail, just re-run roksbnkctl up manually — terraform’s state is durable, and the second attempt skips every resource that’s already provisioned.
Symptom: roksbnkctl up returns success but roksbnkctl k get nodes says No resources found
Root cause: the ROKS cluster’s worker nodes take 5-10 minutes to provision after the cluster’s master endpoint returns Ready. Terraform considers the cluster “applied” as soon as the master is up; the workers come up asynchronously.
Fix: wait 5-10 minutes and re-run. If you want a deterministic gate, watch the IBM Cloud console’s cluster page until the worker count matches workers_per_zone × zones, then proceed. There’s no roksbnkctl wait command in v1.0 — that’s a v1.x addition.
Symptom: roksbnkctl up post-apply hook fails fetching the admin kubeconfig with a 404
Root cause: the IBM Cloud kubeconfig API (/global/v2/applications/kubeconfig) returns 404 for ~30-60 seconds after the cluster create call returns. The cluster exists but the kubeconfig endpoint hasn’t materialised.
Fix: the binary retries with exponential backoff and usually succeeds within a minute. If it still 404s after the retry budget, run roksbnkctl kubeconfig --download -w <workspace> to retry just the fetch without re-applying.
Symptom: Error: Inappropriate value for attribute "kubeconfig_dir": directory does not exist
Root cause: the upstream HCL’s IBM provider doesn’t MkdirAll for the kubeconfig output directory; it expects the parent dir to exist already. The variable’s default (/work/.bnk/scratch/kubeconfig) is the in-container path; on a direct-on-host run it’s a path that doesn’t exist.
Fix: roksbnkctl writes a workspace-scoped override (kubeconfig_dir = ~/.roksbnkctl/<ws>/state/kubeconfig) and creates the dir at apply time. If you’re hand-rolling terraform without roksbnkctl up, mkdir -p ~/.roksbnkctl/<ws>/state/kubeconfig first.
Symptom: terraform destroy leaves orphan IBM Cloud resources (LBs, security groups, VPEs)
Root cause: ROKS occasionally leaves dangling cluster-owned resources after the cluster itself is destroyed — the destroy returns success but the IBM Cloud account still shows a load balancer or a Virtual Private Endpoint Gateway tagged with the deleted cluster’s ID.
Fix: run roksbnkctl ibmcloud is load-balancers | grep <cluster-name> (and similar for vpc-endpoint-gateways, security-groups) and ibmcloud is load-balancer-delete each orphan by ID. A future roksbnkctl cluster destroy --sweep-orphans will automate this — for now, manual.
Workspaces
Symptom: roksbnkctl ws delete <name> succeeds but subsequent commands still use the deleted workspace
Root cause: workspace context is set by the --workspace/-w flag (or the persistent value the active shell remembers from the last roksbnkctl ws use); deleting the workspace directory doesn’t reset that context, so subsequent commands try to operate on a non-existent workspace dir.
Fix: switch to another workspace before deleting the current one:
roksbnkctl ws use default
roksbnkctl ws delete my-old-workspace
The parking-lot pattern is the recommended flow: keep a default workspace as the always-safe destination after deletes. Documented in Chapter 6 — Workspaces.
Symptom: workspace "<name>" has terraform-managed resources; pass --force to delete anyway
Root cause: the workspace’s terraform.tfstate is non-empty — live infrastructure exists. roksbnkctl ws delete refuses to orphan the resources by removing the state file out from under them.
Fix: run roksbnkctl down -w <name> --auto first to destroy the resources, then roksbnkctl ws delete <name> (no --force needed once state is empty). If you genuinely want to abandon the infra and clean up by hand later, roksbnkctl ws delete --force skips the check.
Backends
Symptom: --backend docker errors with Cannot connect to the Docker daemon
Root cause: dockerd isn’t running, or your user isn’t in the docker group, or you’re on a system that needs a separate rootless-docker socket path.
Fix:
- Linux with system docker:
sudo systemctl start docker; add yourself to thedockergroup (sudo usermod -aG docker $USER) and log out + back in. - Linux with rootless docker:
systemctl --user start docker; setDOCKER_HOST=unix:///run/user/$(id -u)/docker.sock. - macOS / Windows: launch Docker Desktop / Rancher Desktop.
Verify with docker info | head -1 — if that fails, roksbnkctl --backend docker will too.
Symptom: --backend k8s errors with ops pod not found in roksbnkctl-ops namespace
Root cause: you haven’t run roksbnkctl ops install against the target cluster. The k8s backend dispatches into a long-lived ops pod that has to be provisioned first.
Fix:
roksbnkctl ops install
Verify with roksbnkctl k get pod -n roksbnkctl-ops — the pod should be Running. See Chapter 19 for the install model.
Symptom: --backend ssh:<target> errors with tool not found: iperf3 — run with --bootstrap to apt-install
Root cause: the SSH target doesn’t have the tool installed, and roksbnkctl doesn’t auto-install without explicit opt-in (because apt-installing on a production jumphost without consent is rude).
Fix: pass --bootstrap once per fresh target:
roksbnkctl --backend ssh:jumphost --bootstrap test throughput
The bootstrap step runs apt-get install -y <tool> (or the equivalent for ibmcloud — adding the IBM apt repo first). Subsequent calls skip the install check and run normally. See Chapter 17 §“SSH backend” for the bootstrap mechanism.
Symptom: --backend ssh:jumphost errors host key mismatch for jumphost (got SHA256:..., known_hosts has SHA256:...)
Root cause: the jumphost was re-provisioned (terraform destroy + apply) and now has a fresh host key, but ~/.roksbnkctl/known_hosts still has the old fingerprint. TOFU refuses to silently accept the change — that’s the threat model the prompt exists to defend against.
Fix: if you know the re-provision is legitimate, delete the stale entry:
ssh-keygen -R '<jumphost-ip>' -f ~/.roksbnkctl/known_hosts
# Or for the whole roksbnkctl known_hosts:
rm ~/.roksbnkctl/known_hosts
The next roksbnkctl --on jumphost call will TOFU-prompt with the new fingerprint. For CI use the --insecure-host-key flag, which records the key on first contact without prompting.
OpenShift and PodSecurity
Symptom: throughput test pod fails admission: violates PodSecurity "restricted:v1.x": runAsNonRoot != true
Root cause: the throughput suite’s default iperf3 image is networkstatic/iperf3:latest which runs as root. OpenShift’s restricted-v2 SCC rejects root pods.
Fix: set the workspace config to use the bundled image, which is built USER 1000:
# ~/.roksbnkctl/<workspace>/config.yaml
test:
throughput:
image: ghcr.io/jgruberf5/roksbnkctl-tools-iperf3:v0.9.0
Chapter 22 §“The bundled image and the runAsNonRoot constraint” is the full backstory. The same chapter’s §“OpenShift SCC failure mode” lists the three error-message variants OpenShift produces.
Symptom: roksbnkctl ops install errors ServiceAccount "roksbnkctl-ops" forbidden: violates PodSecurityPolicy
Root cause: rare — the cluster is running PodSecurityPolicy (the deprecated predecessor to PodSecurity admission) and the ops pod’s ServiceAccount doesn’t have the SCC binding it needs.
Fix: the ops manifest assumes restricted-v2 is acceptable. If your cluster forces privileged, that’s a cluster-policy question outside roksbnkctl’s control — talk to your cluster admin about granting restricted-v2 to the roksbnkctl-ops namespace.
Symptom: ImagePullBackOff on the ops pod or throughput pod
Root cause: most commonly, the cluster can’t reach the image registry. Three sub-causes:
- The cluster’s egress NAT doesn’t route to
ghcr.io(the image host forroksbnkctl-tools-*). - The image tag doesn’t exist for the version you’re running (e.g., you built
roksbnkctlfrommainat a commit between releases, and:devisn’t published). - ghcr.io itself is rate-limiting unauthenticated pulls (rare; usually only an issue for shared CI hosts hitting ghcr.io en masse).
Fix:
- Check egress with
roksbnkctl k exec <ops-pod> -- curl -sI https://ghcr.io— if that hangs, you have a network path issue, not a roksbnkctl issue. - Check the tag with
docker manifest inspect ghcr.io/jgruberf5/roksbnkctl-tools-iperf3:<version>— if 404, pin to a tagged release version in workspace config rather than running frommainhead. - For rate-limit issues, pre-pull images to a local registry mirror and override the workspace
test.throughput.imageto point there.
DNS
Symptom: roksbnkctl test dns returns NXDOMAIN against an internal GSLB record that you know exists
Root cause: your laptop’s resolver chain doesn’t have a route to the internal GSLB VIP. The default --server system uses your /etc/resolv.conf, which resolves against your office or ISP resolver — neither of which knows about the cluster-private GSLB.
Fix: query the GSLB VIP explicitly, or query from inside the cluster:
# Query the GSLB VIP directly
roksbnkctl test dns --target www.example.com --type A --server 169.45.91.5
# Or run the probe from inside the cluster (the cluster's resolvers reach the GSLB)
roksbnkctl test dns --target www.example.com --type A --backend k8s --server cluster
Chapter 21 §“Server resolution” is the full --server reference.
Symptom: --gslb-compare always reports gslb_divergence: false against a target you expect to diverge
Root cause: the chosen target’s GSLB rule isn’t differentiating your local vantage (laptop) from your k8s vantage (cluster). Two common shapes:
- The name is fronted by an anycast resolver fleet (Cloudflare, Google Public DNS) — same answer everywhere by design.
- Your laptop and your cluster are both in the same geographic region from GSLB’s perspective (both in North America hitting the same datacenter).
Fix: pick a target known to be geo-resolved (www.google.com is the canonical “different IPs from different regions” example), or add an SSH-based vantage (--backend ssh:eu-bastion) to bring in a third region. Chapter 21 §“GSLB cross-vantage compare” covers the multi-vantage workflow.
Symptom: roksbnkctl test dns --backend docker errors DNS probe doesn't benefit from docker
Root cause: design choice. Docker containers share the host’s network namespace by default, so a docker-backend probe has the same network identity as a --backend local probe — no GSLB-relevant vantage difference.
Fix: use --backend local, --backend k8s, or --backend ssh:<target> instead.
Cluster registration
Symptom: roksbnkctl cluster register <name> errors cluster not found
Root cause: the cluster name doesn’t exist in the workspace’s resource group, or the API key doesn’t have visibility into the resource group.
Fix: verify the name with roksbnkctl ibmcloud ks cluster ls --output json | jq '.[].name', and verify the resource-group scope in workspace config matches where the cluster lives. If the cluster is in a different resource group, set ibmcloud.resource_group in the workspace config to that group.
Symptom: register succeeds but roksbnkctl k get nodes immediately errors Unauthorized
Root cause: the kubeconfig was fetched but the auth token has already expired, or the IAM-based token that the kubeconfig embeds doesn’t match the API key that’s currently in env. Common after a 1-hour idle window.
Fix:
roksbnkctl kubeconfig --download --cluster <name>
The token refresh is automatic on every up/apply, but register against a cluster you didn’t just provision sometimes lands you with a stale token in the kubeconfig.
COS supply chain
Symptom: FLO fails to start with failed to pull FAR image: 403 Forbidden
Root cause: the f5-far-auth-key.tgz object in the bucket has stale credentials (the F5-side pull key was rotated, but the bucket still has the old one).
Fix: re-issue the key on the F5 side and upload to COS:
roksbnkctl cos object put bnk-schematics-resources/f5-far-auth-key.tgz \
./new-f5-far-auth-key.tgz \
--instance bnk-orchestration
# Restart FLO so it re-reads
roksbnkctl k delete pod -n f5-bnk -l app=flo
See Chapter 25 §“Worked example” for the full flow.
Symptom: cos object put for a 3 GB file errors midway with RequestTimeout
Root cause: the multipart upload SDK encountered a transient COS HTTP timeout on one of the part uploads. Multipart uploads aren’t currently resumed from the failure point — they restart from zero.
Fix: re-run the cos object put. If it fails reproducibly on the same part, the underlying network is the problem (your egress link is saturated, or COS is having a regional outage — check the IBM Cloud status page).
Symptom: cos bucket delete errors Bucket not empty
Root cause: COS requires buckets to be empty before delete; there’s no --recursive flag on bucket delete today.
Fix: list and delete each object, then delete the bucket:
roksbnkctl cos object list bnk-schematics-resources --instance bnk-orchestration | \
awk 'NR>1 {print $1}' | \
xargs -I{} roksbnkctl cos object delete "bnk-schematics-resources/{}" --instance bnk-orchestration
roksbnkctl cos bucket delete bnk-schematics-resources --instance bnk-orchestration
Don’t forget to abort any pending multipart uploads first — they don’t appear in the standard object list but they do prevent bucket deletion. The workaround for now is ibmcloud cos list-multipart-uploads followed by ibmcloud cos abort-multipart-upload until v1.x lands a native command.
Networking
Symptom: roksbnkctl test connectivity reports Get "https://...": dial tcp: i/o timeout for an internal-only URL
Root cause: connectivity probes run from --backend local by default. From your laptop, internal-only URLs (cluster-private VIPs, internal GSLB names) aren’t reachable.
Fix: route the probe through the cluster’s network — either via --backend k8s (when it lands for the connectivity suite — currently k8s-backend is iperf3 + DNS only; connectivity stays local for v1.0) or via an SSH target inside the cluster’s VPC (--backend ssh:cluster-jumphost).
Symptom: roksbnkctl test connectivity fails with x509: certificate signed by unknown authority against a self-signed internal endpoint
Root cause: the URL’s TLS cert isn’t in the host’s trust store.
Fix: pass --insecure (session-wide; skips TLS validation for every probe in the run). The flag is deliberately session-wide rather than per-host — see Chapter 20 §“Mixed TLS-trust posture”. For mixed trust posture across multiple internal endpoints, run two separate test connectivity invocations, one per trust group.
CI-specific
Symptom: nightly e2e run fails on phase D with Error: Provider configuration is missing
Root cause: a terraform init cache invalidation under ~/.roksbnkctl/<ws>/state/.terraform/ left a partial provider download. Happens after a CI worker is recycled mid-init.
Fix: rm -rf ~/.roksbnkctl/<ws>/state/.terraform/ then re-run roksbnkctl up. Terraform-init re-downloads the providers cleanly. For CI workers that get recycled often, add a pre-step that purges .terraform/ before each run.
Symptom: cred audit (phase M) reports IBMCLOUD_API_KEY found in docker inspect output
Root cause: real stop-ship — credentials leaked into a docker container’s runtime env. Check internal/exec/docker.go::buildEnvArgv for any code path that passes the credential by value (-e IBMCLOUD_API_KEY=<value>) rather than by reference (-e IBMCLOUD_API_KEY — let docker pull from the caller’s env).
Fix: file an issue immediately, do not tag a release until this is green. Phase M is the v1.0 release gate; a leak here means the redactor or the cred-passing logic regressed. See PRD 04 for the threat model.
Getting more help
When the symptom isn’t on this page:
- Re-run with
--verbose(-v) — the verbose output usually surfaces the root cause directly. - Check
/tmp/roksbnkctl-e2e-backends/<phase>-<ts>.logfor the per-phase trail. - Cross-reference Chapter 23 — The E2E test plan — the phase-by-phase pass criteria usually narrow down where the breakage lives.
- File an issue on github.com/jgruberf5/roksbnkctl with the verbose output, the
roksbnkctl --versionstamp, and the per-phase log if there is one.