Preface

Foreword

Standing up F5 BIG-IP Next for Kubernetes (BNK) on IBM Cloud Red Hat OpenShift (ROKS) used to be a multi-step deployment that hit a different surface at every step. A terraform init/plan/apply against an HCL tree somebody handed you. A manual ibmcloud ks cluster config to pull a kubeconfig. A separate IBM Cloud CLI install with its own apt-source dance. A manual oc adm policy add-scc-to-user privileged to let iperf3 actually run. SSH plumbing, env-var plumbing, kubeconfig plumbing — each one a small thing, and together a half-day of yak-shaving before BNK was even on the cluster.

roksbnkctl collapses that into a single static binary plus four interchangeable execution backends (local, docker, k8s, ssh:<target>) plus an opt-in in-cluster ops pod. One command brings a workspace up; one command tears it down; the connectivity, DNS, and throughput tests run from whichever network vantage the question actually requires. The tool exists because the manual path has too many moving parts for somebody who just wants to evaluate BNK or run a customer demo.

This book is the user-facing documentation for roksbnkctl. It ships alongside the v1.0 binary.

Who this book is for

Four audiences:

BNK evaluators kicking the tires on F5 BIG-IP Next for Kubernetes who want a low-friction path to a working trial deployment.
F5 sales engineers (SEs) who need a repeatable demo and proof-of-concept toolchain for customer engagements.
Customer engineers standing up BNK in their own IBM Cloud account, either for evaluation or as the foundation of a production rollout.
Contributors who want to extend roksbnkctl — add a backend, add a test suite, ship a new chapter. See Part IX — Contributing.

How to read this book

The book is organised so it can be read either way.

Linear: Parts I-VII walk from concepts through your first deployment, day-2 operations, and the built-in test suite. New users should read in order.
Reference: Part VIII (Command reference, Configuration reference, Terraform variable reference, Glossary) is exhaustive and indexed for lookups during day-to-day work. Search at the top of every page also reaches into reference material.

If you have 30 minutes and an IBM Cloud account, skip straight to Chapter 7 — Quick start. It’s the canonical “first cluster up” walkthrough and the rest of the book makes more sense after you’ve seen the happy path end-to-end.

Part IX is for contributors who want to build roksbnkctl from source or extend it.

Prerequisites

This book assumes:

Basic familiarity with IBM Cloud — you have an account, you know what an API key is, and you’ve used the IBM Cloud console at least once.
Basic familiarity with Kubernetes — you know what a pod, service, and namespace are; you’ve run kubectl (or oc) before.
A working terminal on Linux or macOS. Windows is supported for roksbnkctl itself, with documented limitations around interactive SSH (see Chapter 16).

You do not need prior experience with:

Terraform — roksbnkctl embeds a vetted HCL tree and drives terraform for you. You can ignore the underlying HCL until you want to customise it (Chapter 13).
OpenShift specifics — the tool treats ROKS as Kubernetes with a thin SCC + project overlay; the few OpenShift-specific gotchas are called out in Chapter 22 and Chapter 26.
F5 BIG-IP Next — BNK is the thing the book deploys; you don’t need to be a Big-IP engineer to evaluate it. Chapter 1 is the 5-minute “what is this product” primer.

Book conventions

Code blocks: shell commands use bash syntax highlighting; YAML snippets use yaml; HCL fragments use hcl; sample command output is shown in plain text blocks to distinguish output from input.
Cross-references: every chapter ends with a “Cross-references” section linking related chapters. Inline links use the form [Chapter 7 — Quick start](./07-quick-start.md) — a chapter number, an em-dash, the chapter title, and the relative path to the chapter source.
PRD links: design documents under docs/prd/ are linked as full GitHub URLs (e.g. https://github.com/jgruberf5/roksbnkctl/blob/main/docs/prd/03-EXECUTION-BACKENDS.md) so they resolve from the published book at GitHub Pages. The PRDs are the design surface; the book is the user surface — read PRDs only if you’re contributing or want the why behind a design call.
Forward references to post-v1.0 work: where a feature is explicitly queued for a v1.x release (e.g. terraform --backend k8s, multi-hop SSH ProxyJump), the prose flags it in future tense and points at docs/PLAN.md §“What’s deliberately deferred to post-v1.0” for the roadmap.

Welcome.

What is BIG-IP Next for Kubernetes (BNK)

F5 BIG-IP Next for Kubernetes (BNK) is F5’s containerised, Kubernetes-native re-imagining of the BIG-IP data plane. It runs the BIG-IP Traffic Management Microkernel (TMM) as pods inside a cluster, and exposes its configuration surface through Custom Resources rather than the classic TMSH / iControl REST APIs. The point of BNK is to give Kubernetes workloads the L4 and L7 traffic management features F5 customers already rely on — advanced load balancing, TLS termination, WAF policy, GSLB — without bolting an external appliance onto the cluster’s edge.

This chapter sets the context for the rest of the book. If you already deploy and operate BNK day-to-day you can skim it; if you arrived here knowing generic Kubernetes but new to F5’s product family, read it first.

Where BNK fits in F5’s product family

F5 has historically delivered traffic management as the BIG-IP appliance: a hardened Linux box (physical or virtual) running TMOS, with TMM as the data-plane kernel module. BIG-IP works extremely well at the cluster edge — north-south traffic — but it sits outside the cluster and is configured through its own control surface.

The next-generation lineage is BIG-IP Next: the same TMM data plane refactored to run as a regular Linux process, configurable through declarative APIs instead of imperative TMSH. BIG-IP Next ships in three deployment shapes:

BIG-IP Next for VMs / Bare Metal — same form factor as classic BIG-IP, modernised control plane.
BIG-IP Next Service Proxy for Kubernetes (SPK) — telco-focused, for 5G core workloads.
BIG-IP Next for Kubernetes (BNK) — general-purpose, runs inside any conformant Kubernetes cluster.

BNK is the focus of this book. It is the option you pick when your workloads already live in Kubernetes and you want F5 traffic management without standing up a separate appliance fleet.

What problems BNK solves

A standard Kubernetes cluster ships with a basic Service / Ingress story: kube-proxy iptables rules, a community ingress controller, maybe an external load balancer in front. That covers the common case but falls short when you need:

Real L7 traffic management for north-south traffic — fine-grained routing, header manipulation, TLS termination with custom cipher suites, mTLS enforcement, advanced HTTP/2 + HTTP/3 handling, WAF policy enforcement at the edge.
East-west service mesh-style features without a sidecar — connection pooling, circuit breaking, retries, observability for pod-to-pod traffic, applied at a per-namespace or per-workload granularity.
GSLB-style global traffic management — health-checked DNS responses that send a client to the nearest healthy cluster, integrated with the cluster’s own service health.
Compliance and regulated workloads — DDoS mitigation, behavioural anomaly detection, audit logging that an enterprise security team will accept.

BNK delivers all of the above as cluster-native primitives. You install it once, and from then on you express traffic management intent through CRDs (F5BigIpCtx, F5IngressTls, F5GslbPool, etc.) committed alongside your application manifests.

The components

BNK isn’t a single binary; it’s a set of cooperating components installed into a cluster. The pieces you’ll see most often:

TMM (Traffic Management Microkernel) — the data plane. Runs as DaemonSet pods on dedicated worker nodes. Every packet handled by BNK passes through TMM.
FLO (F5 Lifecycle Operator) — the control-plane operator. Watches BNK CRDs and reconciles them into TMM data-plane configuration. Owns the lifecycle of the TMM pods themselves: image pulls, version upgrades, rolling restarts.
CIS (Container Ingress Services) — Kubernetes-native ingress controller piece. Watches Ingress and BNK ingress CRDs, programmes TMM to terminate the corresponding traffic.
CNE Instance — Cloud-Native Edge configuration, the umbrella resource that ties a BNK install to its tenant context.
Cert-Manager — not strictly an F5 component, but a hard dependency. BNK uses cert-manager to mint and rotate the certificates TMM presents to clients.

Deeper chapters reference these names; you don’t need to memorise them now. The thing to take away is that BNK is an operator + DaemonSet pattern: the operator (FLO) reconciles your declarative intent into running data-plane pods (TMM).

Where BNK runs

BNK runs on a conformant Kubernetes cluster. F5 publishes a support matrix — read it for definitive answers — but in practice you’ll see BNK deployed on:

Managed Kubernetes: ROKS (IBM Cloud’s managed OpenShift), OpenShift Dedicated, EKS, AKS, GKE.
Self-managed OpenShift on bare metal or VMs.
Upstream Kubernetes in private clouds, with an LB provider that BNK can integrate with.

This book targets ROKS specifically. The next chapter explains why. The decisions and patterns documented here will translate to other Kubernetes flavours, but the bundled Terraform that roksbnkctl ships only knows how to provision ROKS.

North-south and east-west, in one install

It’s worth calling out explicitly: BNK is not “just an ingress controller” and it’s not “just a service mesh data plane”. It’s both, in the same install:

North-south (client outside the cluster talking to a workload inside) — BNK fronts a LoadBalancer-typed service, terminates TLS, applies WAF policy, routes to backend pods. Replaces the role a hardware BIG-IP or community ingress controller would play.
East-west (pod-to-pod or namespace-to-namespace inside the cluster) — BNK can be inserted into the path with no application sidecar, providing per-workload connection pooling, retries, and observability.

A single BNK install can handle both at once. Customer architectures often start with the north-south story (the obvious replacement for an existing BIG-IP appliance), then expand into east-west as the team gets comfortable with the operator-driven configuration model.

Pointer to F5’s official docs

Everything in this chapter is intentionally a sketch — enough to make the rest of this book legible. For definitive and up-to-date product information, including the full CRD reference, version compatibility matrix, sizing guidance, and license model, see F5’s official BNK documentation: https://clouddocs.f5.com/bigip-next/latest/.

The rest of this book focuses on deploying BNK with roksbnkctl and validating that the deployment works end-to-end. It does not duplicate F5’s product documentation; it complements it.

For an at-a-glance view of how roksbnkctl’s components fit together — the four execution backends, the cluster, the jumphost, the IBM Cloud control plane — see the architecture diagram at the top of Chapter 17 — Execution backends. For the happy-path lifecycle from one command to the next, see Chapter 7 — Quick start.

Why ROKS (Red Hat OpenShift on IBM Cloud)

This book and the roksbnkctl tool target ROKS — IBM Cloud’s managed Red Hat OpenShift offering — specifically. Other Kubernetes flavours can run BNK, and most of the patterns you’ll learn here translate, but the bundled Terraform that roksbnkctl ships only knows how to provision a ROKS cluster.

This chapter explains the rationale behind that choice. If you’re already using ROKS, you can skim this. If you’re evaluating whether ROKS is the right substrate for your BNK trial, read in full.

What ROKS is

ROKS is short for Red Hat OpenShift on IBM Cloud. It’s IBM Cloud’s managed-OpenShift service: you ask IBM for a cluster, IBM provisions the masters, etcd, the OpenShift control plane, and a pool of worker nodes; you get a kubeconfig and start deploying.

ROKS clusters are real OpenShift. They run the same Operator Lifecycle Manager (OLM), the same oc CLI, the same SecurityContextConstraints (SCC) model, the same routes-and-services machinery you’d find on any OpenShift install. The only thing IBM has done differently is take responsibility for keeping the control plane and the underlying infrastructure healthy.

What IBM manages, what you manage

The boundary between “IBM’s responsibility” and “your responsibility” is the principal value proposition of any managed Kubernetes service. For ROKS the line falls roughly here:

Concern	Owner
Master nodes (API server, scheduler, controllers)	IBM
etcd (persistence + backups)	IBM
OpenShift control plane (OLM, ingress operator, image registry)	IBM
OpenShift version upgrades for the control plane	IBM (you opt in to a major-version bump)
Worker node provisioning (VPC VSIs, subnets, security groups)	IBM, on your behalf via the cluster API
Worker node OS patching and CVE remediation	IBM
Worker pool sizing and lifecycle (`workers create/delete`)	You
Pod workloads running on the cluster	You
Application-level RBAC, network policy, TLS, service accounts	You
BNK install + configuration	You — this is what `roksbnkctl` automates

The thing to internalise: with ROKS you do not rack hardware, install RHEL, run openshift-install, manage etcd backups, or chase CVE patches across a worker fleet. IBM does all of that. You start at “I have an OpenShift cluster” and go from there.

Why managed-OpenShift over self-managed for BNK evaluation

If you want to evaluate BNK quickly, the calculus is straightforward. Self-managed OpenShift is a multi-week lift before you have a cluster:

Provision the underlying VMs (OpenStack / vSphere / bare metal).
Run openshift-install and debug whatever doesn’t go right.
Configure DNS, load balancers, container registry mirrors.
Stand up monitoring + logging + cert-manager.
Now you can start thinking about BNK.

ROKS compresses that to one Terraform apply of ~50 minutes. You get back a kubeconfig that authenticates against a real OpenShift cluster, with cert-manager already installable via OLM, and a worker pool of the size and zone topology you specified. From there, the BNK install is the same set of CRDs and Helm charts it would be on any OpenShift cluster.

For a sales-engineering demo or a customer proof-of-concept, “I have a cluster in 50 minutes” beats “I have a cluster in 2 weeks” every time. That trade-off is the reason this book exists in this shape.

Why OpenShift (not just any Kubernetes) for BNK

BNK runs on conformant Kubernetes generally, but it integrates more cleanly with OpenShift specifically because:

Operator-driven install — BNK is shipped as a set of operators. OpenShift has Operator Lifecycle Manager (OLM) as a first-class citizen, so the install pattern is familiar to OpenShift admins.
SecurityContextConstraints (SCC) — TMM pods need elevated capabilities (notably NET_ADMIN, raw socket access, hugepages). OpenShift’s SCC model formalises that grant; on upstream Kubernetes you’d be configuring PodSecurityAdmission policies by hand.
Routes — OpenShift’s Route CRD predates and is more capable than Ingress. BNK can act as an alternate Route implementation, slotting into existing OpenShift application architectures without forcing teams to migrate.
Image streams + the internal registry — useful for the BNK supply chain (FAR images, license bundles) which can be mirrored once and consumed by many installs.

If you’re already an OpenShift shop, BNK fits naturally. If you’re not, BNK still works but you’ll need to translate this book’s OpenShift-specific examples (SCCs, oc adm policy, Route) to your platform’s equivalents.

What’s out of scope for this book

A short list of Kubernetes flavours this book does not cover:

EKS / AKS / GKE — BNK runs on these, but roksbnkctl up won’t provision them. You’d use cloud-specific tooling, then deploy BNK on top with the standard Helm charts F5 publishes.
Self-managed OpenShift on bare metal or VMs — same: no roksbnkctl up. You’d use openshift-install, then deploy BNK.
K3s, RKE2, microk8s — BNK’s not formally supported on these for production; useful for local dev work but outside this book’s scope.

The patterns from later chapters — workspaces, the --on flag, the connectivity / DNS / throughput tests — would still be useful on any of these, but the lifecycle commands (init, up, down, cluster register) assume ROKS.

What you need before continuing

To follow this book end-to-end you need:

An IBM Cloud account with billing enabled. The free tier won’t provision a worker pool; you’ll need a Pay-As-You-Go or Subscription account.
An IBM Cloud API key with permission to create ROKS clusters in the target account.
A resource group to scope cluster resources to. The default Default group works fine for a single-user evaluation; production deployments tend to use a dedicated group per environment.

The next chapters walk through installation and the quick-start path. By the end of Chapter 7 you’ll have a deployed BNK trial on a fresh ROKS cluster.

What roksbnkctl does (and doesn’t do)

roksbnkctl is a single-binary CLI for deploying and validating F5 BIG-IP Next for Kubernetes (BNK) onto IBM Cloud ROKS. It exists to compress a multi-step deployment — clone the right Terraform, configure it, run terraform, fetch a kubeconfig, install BNK, run smoke tests — into a four-command lifecycle.

This chapter is about scope. What roksbnkctl owns, what it deliberately does not own, and what’s coming in future releases. Read it before you reach for the tool to do something it isn’t trying to do.

The 4-command lifecycle

The everyday user-facing flow is four commands:

roksbnkctl init        # answer a few prompts about region, RG, cluster name
roksbnkctl up          # terraform plan + apply (~50 min for fresh ROKS + BNK)
roksbnkctl test        # connectivity + DNS + throughput against the deployment
roksbnkctl down        # tear it all back down when you're done evaluating

That’s it. From “I have an IBM Cloud API key” to “deployed BNK with a passing throughput test” with no manual terraform apply, no hand-editing kubeconfig paths, no chasing down BNK Helm charts — then a clean tear-down when you’re done so you stop paying for the cluster.

Chapter 7 walks through this end-to-end with sample output.

What roksbnkctl owns

roksbnkctl’s scope is everything between “you have an IBM Cloud API key” and “you have a working BNK install you can run tests against”. Concretely:

Workspace state — kubectl-style per-environment isolation under ~/.roksbnkctl/<workspace>/. Each workspace has its own config, terraform state, kubeconfig, scratch artefacts. Switch with roksbnkctl ws use <name> or override per-command with -w <name>.
Terraform-exec orchestration — wraps HashiCorp’s terraform-exec library to drive terraform init/plan/apply/destroy with the right state file, the right TF_DATA_DIR, the right tfvars layering. You don’t run terraform directly; roksbnkctl up does.
Kubeconfig fetch — after a successful up, fetches the admin kubeconfig from IBM Cloud’s container service API and writes it to ~/.kube/config at mode 0600. Retries on the 404s that happen during cluster propagation lag.
COS supply chain — the BNK install needs FAR images and JWT licenses staged in IBM Cloud Object Storage. roksbnkctl cos instance/bucket/object handles instance creation, bucket lifecycle, and streaming object I/O (multipart for large files) without making you pip install the IBM COS SDK separately.
Post-deploy validation — roksbnkctl test runs three suites: HTTP/HTTPS connectivity (built-in net/http, no external curl), DNS resolution (built-in net.Resolver, no external dig), and iperf3 throughput (deploys an iperf3 -s pod into the cluster, runs the client, parses JSON output, tears down).
Credentials handling — IBM Cloud API key resolution chain: env vars (IBMCLOUD_API_KEY etc.), OS keychain (macOS Keychain / libsecret / Windows Credential Manager via zalando/go-keyring), opt-in base64 in workspace config, interactive prompt as last resort. Plaintext keys in config.yaml are rejected.

If any of those words don’t make sense yet, don’t worry — later chapters cover each in depth.

What roksbnkctl does not try to do

Equally important: the explicit non-goals. roksbnkctl deliberately stays out of these spaces because well-established tools already cover them:

Not a generic IBM Cloud CLI. That’s ibmcloud. If you want to manage VPCs, IAM policies, classic infrastructure, Watson, or any of the hundred-plus other IBM Cloud services, use ibmcloud. roksbnkctl ibmcloud <args...> exists as a convenience passthrough that loads workspace credentials, but it doesn’t try to replace ibmcloud’s surface.
Not a generic Kubernetes CLI. That’s kubectl. roksbnkctl kubectl <args...> is again a passthrough that loads the workspace’s kubeconfig; it does not try to be a kubectl re-implementation. (Phase 2 internalises a small subset — roksbnkctl k get/apply/logs/exec/port-forward — so the happy path doesn’t require a host kubectl binary, but that’s targeted convenience, not replacement.)
Not an OpenShift admin tool. That’s oc. Same story: roksbnkctl oc <args...> passthrough, no attempt to re-implement.
Not a BNK runtime UI. Once BNK is deployed, you configure it through its CRDs (F5BigIpCtx, F5IngressTls, etc.). roksbnkctl doesn’t ship a TUI / web UI for editing those — it gets you to a deployed BNK and steps out of the way.
Not a Terraform authoring tool. The HCL lives in this repo’s terraform/ directory and is embedded into the binary at build time. roksbnkctl runs that HCL; it doesn’t help you write more of it. If you fork the HCL, point roksbnkctl at your fork via tf_source: github or tf_source: local.
Not an arbitrary workload deployer. BNK is the workload. The iperf3 / nginx fixtures used by roksbnkctl test exist only to validate BNK; they’re not a general-purpose deployment surface.

The principle is “do one thing well”. roksbnkctl does BNK-on-ROKS lifecycle and validation. Every other concern is delegated to the right purpose-built tool.

The relationship to bundled HCL

A core design decision worth surfacing: the Terraform that drives the deployment lives in this repo under terraform/, and is embedded into the roksbnkctl binary at build time via Go’s embed package.

This means:

One install gets you the CLI + a matched HCL pair. No “clone the right tag of the terraform repo separately” step.
Versioning is unified. A roksbnkctl v1.0 release ships with a specific snapshot of the HCL. Upgrading the binary upgrades the HCL atomically. There’s no skew between “binary version” and “Terraform version”.
Power users can override. The workspace config has a tf_source: block:
```
tf_source:
  type: embedded     # default; uses HCL bundled into the binary
  # type: local
  # path: /path/to/your/terraform
  # type: github
  # repo: yourfork/roksbnkctl-terraform
  # ref: my-branch
```
tf_source: local is the right setting if you’re iterating on the HCL itself. tf_source: github lets you point at a fork of the terraform repo if you’ve published one separately. The default — embedded — covers the everyday case.

Chapter 13 covers the tfvars layering rules; this is just the elevator pitch for “the HCL ships with the binary”.

What v1.0 ships and what’s queued for v1.x

This book ships with v1.0. The surface it documents:

kubectl internalisation — roksbnkctl k get/apply/logs/exec/port-forward is a first-class verb talking to the cluster directly via client-go. Host kubectl is informational only; the only required prereq on PATH is terraform.
Four execution backends — every external tool (ibmcloud, iperf3, terraform) selectable across local | docker | k8s | ssh via --backend. iperf3 runs entirely in-cluster by default; ibmcloud runs in a pinned-version Docker container if you don’t want to install it; any tool proxies through a jumphost via --backend ssh:<target>. Chapter 17 is the user-facing surface; PRD 03 is the design rationale.
GSLB-aware DNS testing — the DNS probe is miekg/dns-based with multi-vantage support, so you can verify that BNK’s GSLB is returning different answers from different network locations. Chapter 21 covers it.
Polished book — all 32 chapters, every code example verified, four Mermaid diagrams (architecture / lifecycle / GSLB cross-vantage / backend matrix), per-Part worked examples.

A handful of items are explicitly deferred to v1.x:

terraform over --backend k8s and --backend ssh (state-file portability design needed).
Multi-hop SSH ProxyJump for the --on and ssh:<target> paths.
Windows full TTY (interactive shell on Windows ships as line-buffered; full PTY is a v1.x item).
Typed OpenShift CRDs (today’s unstructured printer works; richer per-type output is queued).
Cross-driver cluster-sharing for e2e-test-full.sh (each driver brings up its own cluster today).

See docs/PLAN.md §“What’s deliberately deferred to post-v1.0” for the full roadmap.

Pointers to the next chapters

Chapter 4 — Installation gets the binary on your machine.
Chapter 7 — Quick start walks the 4-command lifecycle with sample output.
Chapter 16 — The –on flag and SSH jumphosts covers running passthrough commands over SSH against an auto-discovered jumphost — useful in customer-firewalled and air-gapped scenarios.

Installation

This chapter gets a roksbnkctl binary onto your machine and verifies it works. Two install paths are covered: build-from-source (native Go, the canonical path until release artefacts ship) and build-with-Docker (no host Go required).

Pre-built binaries are attached to every GitHub Release (Linux, macOS, Windows × amd64, arm64). The book also ships as an offline PDF (roksbnkctl-book-<tag>.pdf) on the same release page. A Homebrew tap is on the v1.x roadmap; until then macOS users grab the binary from the release page or build from source.

Prerequisites

Linux or macOS for the day-to-day developer experience. Windows compiles cleanly but interactive features (TTY-bound SSH shell, ssh-agent integration) are not first-class on Windows yet.
Git to clone the repository (only if building from source — not needed if you grab a pre-built binary).
Go 1.25 or newer if you want a native build. If you don’t have Go (or have an older version), use the Docker-based build or a pre-built release binary.
Terraform >= 1.5 on PATH at runtime — required for roksbnkctl up / plan / apply / down.
Helm 3 on PATH at runtime — required during roksbnkctl up. The bundled terraform modules (cert_manager, flo, cne_instance) use null_resource + local-exec provisioners that shell out to helm upgrade --install; without helm the apply errors out with exit status 127 — helm: not found.

The remaining tools (ibmcloud, kubectl, oc, iperf3, docker) are optional and only needed for the corresponding passthrough or backend.

You do not need Docker installed to use roksbnkctl with the default local backend. Docker is required only if you opt in to --backend docker for terraform / ibmcloud. The k8s and ssh backends are alternatives that need neither host Docker nor host Go.

Installing prerequisites

Install paths per platform. terraform and helm are strictly required for v1.0 (helm is invoked by terraform’s local-exec provisioners during roksbnkctl up); the rest are optional, install only what you need.

macOS — Homebrew

brew install terraform               # required
brew install helm                    # required — terraform `local-exec` provisioner shells out to `helm`
brew install --cask ibmcloud-cli     # optional — only for `roksbnkctl ibmcloud …` passthrough
brew install kubectl                 # optional — only for `roksbnkctl kubectl …` passthrough (`roksbnkctl k *` is internalised)
brew install iperf3                  # optional — only for `--backend local`/`--backend ssh:<t>` throughput tests

# oc (Red Hat OpenShift CLI) — optional, only for `roksbnkctl oc …` passthrough.
# No brew formula; install via the Red Hat mirror tarball:
curl -sSL https://mirror.openshift.com/pub/openshift-v4/clients/ocp/stable/openshift-client-mac.tar.gz \
  | sudo tar -xz -C /usr/local/bin oc

If you installed ibmcloud-cli, add the plugins roksbnkctl uses:

ibmcloud plugin install kubernetes-service -f
ibmcloud plugin install cloud-object-storage -f

Linux — Ubuntu / Debian

# terraform — required
wget -qO- https://apt.releases.hashicorp.com/gpg \
  | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] \
https://apt.releases.hashicorp.com $(lsb_release -cs) main" \
  | sudo tee /etc/apt/sources.list.d/hashicorp.list
sudo apt-get update && sudo apt-get install -y terraform

# helm 3 — required (terraform's null_resource + local-exec provisioner for cert_manager / flo / cne_instance shells out to `helm`)
curl https://baltocdn.com/helm/signing.asc \
  | sudo gpg --dearmor -o /usr/share/keyrings/helm.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/helm.gpg] \
https://baltocdn.com/helm/stable/debian/ all main" \
  | sudo tee /etc/apt/sources.list.d/helm-stable-debian.list
sudo apt-get update && sudo apt-get install -y helm

# ibmcloud CLI + plugins — optional, for `roksbnkctl ibmcloud …` passthrough with --backend local
curl -fsSL https://clis.cloud.ibm.com/install/linux | sudo sh
ibmcloud plugin install kubernetes-service -f
ibmcloud plugin install cloud-object-storage -f

# kubectl — optional, only for `roksbnkctl kubectl <args>` passthrough (`roksbnkctl k *` is internalised and needs no host install)
sudo snap install kubectl --classic
# or via direct download:
# curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
# chmod +x kubectl && sudo mv kubectl /usr/local/bin/

# oc (Red Hat OpenShift CLI) — optional, only for `roksbnkctl oc <args>` passthrough.
# No apt package; install via the Red Hat mirror tarball:
curl -sSL https://mirror.openshift.com/pub/openshift-v4/clients/ocp/stable/openshift-client-linux.tar.gz \
  | sudo tar -xz -C /usr/local/bin oc

# iperf3 — optional, only for `--backend local` / `--backend ssh:<t>` throughput tests
sudo apt-get install -y iperf3

Instructions above target Ubuntu and Debian. For other Linux distributions (RHEL, Fedora, Arch, openSUSE, Alpine, …), a quick online search for “install terraform on <your distro>” — and the same pattern for ibmcloud, kubectl, and iperf3 — yields the equivalent commands. HashiCorp ships an RPM repo at https://rpm.releases.hashicorp.com covering RHEL/Fedora, and most distributions package kubectl and iperf3 in their official repos; the IBM Cloud CLI installer at https://clis.cloud.ibm.com/install/linux is a single curl-pipe-sh that works across distros.

Windows — Chocolatey

choco install terraform
choco install kubernetes-helm  # required — terraform local-exec provisioner shells out to `helm`
choco install ibmcloud-cli     # optional
choco install kubernetes-cli   # optional, provides kubectl
choco install openshift-cli    # optional, provides oc (Red Hat OpenShift CLI)
choco install iperf3           # optional

Or via Scoop:

scoop install terraform helm ibmcloud-cli kubernetes-cli openshift-cli iperf3

If choco/scoop don’t carry openshift-cli for your version, grab the Windows tarball from the Red Hat mirror directly:

Invoke-WebRequest -Uri https://mirror.openshift.com/pub/openshift-v4/clients/ocp/stable/openshift-client-windows.zip -OutFile oc.zip
Expand-Archive oc.zip -DestinationPath "$env:USERPROFILE\bin\"
# then add %USERPROFILE%\bin to your PATH

After installing ibmcloud-cli, add the plugins:

ibmcloud plugin install kubernetes-service -f
ibmcloud plugin install cloud-object-storage -f

Windows TTY-bound SSH features (the roksbnkctl shell --on <target> interactive path) have known limitations on Windows; file-based SSH keys + non-interactive commands work, but ssh-agent named-pipe integration is a v1.x item. See docs/PLAN.md §“What’s deliberately deferred to post-v1.0”.

Path A — native build (requires Go 1.25+)

If go version reports 1.25 or newer, this is the simplest path:

git clone https://github.com/jgruberf5/roksbnkctl.git
cd roksbnkctl

go mod tidy                          # first time only — populates go.sum
make build                           # → bin/roksbnkctl

# Install via roksbnkctl itself (recommended — copies into ~/.local/bin):
./bin/roksbnkctl install

That’s the whole thing. The install subcommand is idempotent and copies the running binary into a directory on your PATH. Default destination is ~/.local/bin/roksbnkctl.

Make targets you’ll use most often:

make build      # go build -ldflags ... -o bin/roksbnkctl ./cmd/roksbnkctl
make test       # go test ./...
make vet        # go vet ./...
make tidy       # go mod tidy
make clean      # rm -rf bin/

If make build fails, the most likely cause is Go too old. The module declares go 1.25.0 in go.mod (forced by transitive deps from the SSH/integration test layers); older versions error out with go: module requires Go 1.25. Either upgrade Go or fall back to the Docker path below.

Path B — Docker-based build (no host Go required)

This path is ideal for sealed CI workstations, custom VM images, or anywhere installing Go on the host is awkward. The official golang:1.25-alpine image has everything needed (Sprint 1 bumped the minimum Go version from 1.23 to 1.25 because of testcontainers-go and gliderlabs/ssh transitive dependencies); the build artefact lands in ./bin/ owned by your host user.

git clone https://github.com/jgruberf5/roksbnkctl.git
cd roksbnkctl

docker run --rm -v "$PWD:/work" -w /work \
  --user "$(id -u):$(id -g)" -e HOME=/tmp \
  golang:1.25-alpine sh -c 'go mod tidy && go build -o bin/roksbnkctl ./cmd/roksbnkctl'

./bin/roksbnkctl install

Anatomy of the docker invocation:

Flag	Why
`-v "$PWD:/work"`	Bind-mount the repo into the container at `/work`.
`-w /work`	Container working directory matches the mount.
`--user "$(id -u):$(id -g)"`	Output binary is owned by your host user, not root.
`-e HOME=/tmp`	Go writes its module cache under `$HOME`; `/tmp` is writable by any user. Without this, `go mod tidy` fails on a writable-`/root` permission error.
`golang:1.25-alpine`	Pinned major version; matches `go.mod`’s minimum.

Cross-compile via Docker

Set GOOS / GOARCH env vars in the same docker run to produce binaries for other platforms:

# macOS arm64 (Apple Silicon)
docker run --rm -v "$PWD:/work" -w /work \
  --user "$(id -u):$(id -g)" -e HOME=/tmp \
  -e GOOS=darwin -e GOARCH=arm64 \
  golang:1.25-alpine sh -c 'go mod tidy && go build -o bin/roksbnkctl-darwin-arm64 ./cmd/roksbnkctl'

# Windows amd64 (compile-only; not tested at runtime)
docker run --rm -v "$PWD:/work" -w /work \
  --user "$(id -u):$(id -g)" -e HOME=/tmp \
  -e GOOS=windows -e GOARCH=amd64 \
  golang:1.25-alpine sh -c 'go mod tidy && go build -o bin/roksbnkctl.exe ./cmd/roksbnkctl'

Each binary is statically linked (Alpine + CGO_ENABLED=0 is the cross-compile default) so the produced file has no runtime library dependencies.

The `install` subcommand

roksbnkctl install [--dir PATH] [--force]

install copies the running binary into a directory on PATH. Defaults:

Destination: ~/.local/bin/roksbnkctl — this directory is on the default PATH for most modern Linux and macOS user environments and does not require sudo.
Mode: 0755.
Idempotent: if the running binary is already at the destination, no-op (no error).

Override the destination with --dir:

./bin/roksbnkctl install --dir ~/bin
sudo ./bin/roksbnkctl install --dir /usr/local/bin

--force overwrites an existing file at the destination. Without it, install refuses if the destination is a different binary.

If ~/.local/bin is not on your PATH, add it. On bash:

echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc
exec $SHELL -l

On zsh, swap ~/.bashrc for ~/.zshrc.

Verifying the install

Two quick checks: version (proves the binary runs) and doctor (proves the runtime environment is set up for actual work).

`roksbnkctl version`

roksbnkctl version

Sample output:

roksbnkctl v1.0.0 (commit abc1234, built 2026-05-10T14:22:08Z)
Docs: https://jgruberf5.github.io/roksbnkctl/book/

The version string is populated via -ldflags at build time; make build VERSION=v1.0.0 injects an explicit tag. A bare make build produces something like dev (commit abc1234, built ...). The Docs: URL is a compile-time constant (internal/cli/meta.go::DocsURL) — every binary built from this tree points at the same book URL.

`roksbnkctl doctor`

roksbnkctl doctor

doctor runs the prereq + credentials report. Sample output on a healthy machine looks like this (yours will differ depending on which optional binaries you have installed and whether you’ve initialised a workspace):

✓  terraform         /usr/bin/terraform (Terraform v1.15.2)                                   (required for `roksbnkctl up`)
✓  helm              /usr/local/bin/helm (v3.20.2)                                            (required for `roksbnkctl up`; terraform `local-exec` provisioners shell out to helm)
⚠  iperf3            not on PATH                                                              (needed for `roksbnkctl test throughput`)
✓  kubectl           /usr/local/bin/kubectl (clientVersion:)                                  (optional; `roksbnkctl kubectl` passthrough)
✓  oc                /usr/local/bin/oc (Client Version: 4.21.10)                              (optional; `roksbnkctl oc` passthrough)
✓  ibmcloud          /usr/local/bin/ibmcloud (ibmcloud 2.43.0 ...)                            (optional; `roksbnkctl ibmcloud` passthrough)
✓  kubeconfig        /home/jgruber/.kube/config                                               (needed for cluster-side ops)
✓  workspace         default                                                                  (per-environment config + state)
✓  ibmcloud api key  resolved via OS keychain                                                 (auth for terraform + IBM SDK calls)
✓  ibm cloud auth    OK (account: Main F5 Account)                                            (verifies API key works against IBM IAM)

Each row is <status> <name> <detail> <why we care>. Failures are red ✗ and exit non-zero; warnings are yellow ⚠ and don’t fail the run. terraform and helm are the hard-required checks at v1.0 — the rest are either optional passthroughs or specific to test suites. Chapter 5 walks through what each check is verifying and how to fix common failures.

OS support matrix

OS	Native build	Docker build	Cross-compile target	Runtime status
Linux (amd64, arm64)	yes	yes	yes	first-class
macOS (amd64, arm64)	yes	yes	yes	first-class
Windows (amd64, arm64)	yes	yes	yes	compile-only; `roksbnkctl shell --on` and `roksbnkctl exec --on jumphost` PTY behaviour limited

“First-class” means the v1.0 acceptance criteria are validated on those platforms; “compile-only” means the binary builds and runs but interactive features (notably TTY-bound SSH) have known limitations and are not part of the v1.0 release gate.

The Windows limitations are tracked in PRD 01 (the SSH client design) and largely come down to golang.org/x/crypto/ssh’s incomplete PTY handling on Windows and the absence of an SSH agent named-pipe protocol. File-based SSH keys work; full PTY and ssh-agent integration on Windows are on the v1.x roadmap (see docs/PLAN.md §“What’s deliberately deferred to post-v1.0”).

Required prerequisites — `terraform` and `helm` at v1.0

The v1.0 cluster lifecycle needs two binaries on PATH:

terraform (>= 1.5) — hard-required for any cluster lifecycle command (up, down, plan, apply).
helm (3.x) — hard-required during roksbnkctl up. The bundled terraform modules (cert_manager, flo, cne_instance) use null_resource + local-exec provisioners that shell out to helm upgrade --install. Without it, the apply fails with exit status 127 — helm: not found. (A v1.x effort to refactor those modules onto the helm_release terraform resource would eliminate the host requirement; tracked in docs/PLAN.md §“What’s deliberately deferred to post-v1.0”.)

Optional binaries — only needed for the corresponding passthrough or fallback path:

iperf3 — only needed for --backend local and --backend ssh:<target> throughput modes. The default --backend k8s runs iperf3 entirely in cluster (no host binary needed).
kubectl / oc — only needed for the roksbnkctl kubectl <args...> / roksbnkctl oc <args...> passthroughs. The everyday verbs (get, apply, describe, delete, logs, exec, port-forward) are internalised under roksbnkctl k and need no host binary.
ibmcloud — only needed for the roksbnkctl ibmcloud <args...> passthrough on --backend local. The cluster-lifecycle path uses IBM Go SDKs internally and does not shell out to ibmcloud. The docker, k8s, and ssh backends ship their own ibmcloud — no host install needed.
docker — only needed for --backend docker. Optional; the k8s and ssh backends are alternatives if docker isn’t available.

Run roksbnkctl doctor to see exactly what your environment is missing for the workflow you intend to run.

Updating

git pull && make build is the source-build update mechanism (or re-run the Docker build for the containerised path).

roksbnkctl self update upgrades from a tagged GitHub release. Use it once you’ve installed an initial v1.0 binary:

roksbnkctl self update
# Checks https://github.com/jgruberf5/roksbnkctl/releases/latest, downloads
# the matching asset for your OS+arch, verifies the checksum, swaps the
# binary atomically.

With a working binary on PATH, Chapter 5 — Doctor explains what every doctor check is looking at, Chapter 6 — Workspaces explains the ~/.roksbnkctl/<workspace>/ layout, and Chapter 7 — Quick start walks the 4-command lifecycle end-to-end.

Doctor: checking your environment

roksbnkctl doctor is the prereq + credentials report. It runs in under five seconds, exits non-zero on any hard error, and prints a tabular report that maps one-to-one to the runtime dependencies the rest of the tool reaches for.

This chapter walks every check, explains what each row’s “why we care” blurb means, covers the post-Sprint 2 changes that move kubectl and oc from “needed” to “informational”, and describes the --target SSH probe added in Sprint 1.

What `doctor` checks

A bare roksbnkctl doctor runs the general checks: tooling on PATH, kubeconfig location, the resolved workspace, and the IBM Cloud authentication chain. Sample output on a healthy machine post-Sprint 2 looks like this:

roksbnkctl doctor
✓  terraform         /usr/bin/terraform (Terraform v1.15.2)                                   (required for `roksbnkctl up`)
✓  helm              /usr/local/bin/helm (v3.20.2)                                            (required for `roksbnkctl up`; terraform `local-exec` shells out to helm)
⚠  iperf3            not on PATH                                                              (needed for `roksbnkctl test throughput`)
✓  kubectl           /usr/local/bin/kubectl (clientVersion:)                                  (internalised in roksbnkctl k *; passthrough still works if installed)
✓  oc                /usr/local/bin/oc (Client Version: 4.21.10)                              (internalised in roksbnkctl k *; passthrough still works if installed)
✓  ibmcloud          /usr/local/bin/ibmcloud (ibmcloud 2.43.0 ...)                            (optional; `roksbnkctl ibmcloud` passthrough)
✓  kubeconfig        /home/jgruber/.kube/config                                               (needed for cluster-side ops)
✓  workspace         default                                                                  (per-environment config + state)
✓  ibmcloud api key  resolved                                                                 (auth for terraform + IBM SDK calls)
✓  ibm cloud auth    OK (account: 1a2b3c..., user: you@example.com)                           (verifies API key works against IBM IAM)

Each row has the same shape:

<status> <name> <detail> <why we care>

status is one of ✓ (green / OK), ⚠ (yellow / warning), or ✗ (red / error). Skipped checks render as ⚠.
name is the dependency or capability being checked.
detail is the resolved value — usually a path, a version line, or an error message.
why we care is a parenthetical clause naming the roksbnkctl feature that depends on this row.

The BackendName column on the underlying Check struct (internal/doctor/check.go) is reserved for the per-backend probes that land in Sprint 4 (PRD 03). Until then it stays empty for every general check.

Each check explained

`terraform` — required

One of two hard-required binaries for the roksbnkctl up happy path. roksbnkctl shells out to terraform via terraform-exec for plan/apply/destroy; without it nothing in the cluster lifecycle works.

Pass condition: a binary on PATH, version 1.5 or newer.

Failure mode: not on PATH. Fix: install Terraform from terraform.io, or your distro’s package manager, then re-run doctor.

`helm` — required

The second hard-required binary, added in v1.0.2. The bundled terraform modules (cert_manager, flo, cne_instance) use null_resource + local-exec provisioners that shell out to helm upgrade --install from inside terraform’s apply phase. Without helm on PATH, the apply fails partway through the cluster lifecycle with:

Error: local-exec provisioner error
Error running command 'helm upgrade --install cert-manager ...':
exit status 127. Output: /bin/sh: 1: helm: not found

Pass condition: a helm (v3.x) binary on PATH. Doctor parses helm version --short for the version detail.

Failure mode: not on PATH. Fix: install Helm 3 from helm.sh/docs/intro/install/, or via your distro’s package manager:

# Linux (Ubuntu/Debian — official Helm apt repo):
curl https://baltocdn.com/helm/signing.asc | sudo gpg --dearmor -o /usr/share/keyrings/helm.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/helm.gpg] https://baltocdn.com/helm/stable/debian/ all main" | sudo tee /etc/apt/sources.list.d/helm-stable-debian.list
sudo apt-get update && sudo apt-get install -y helm

# macOS:
brew install helm

# Windows:
choco install kubernetes-helm

A v1.x effort to refactor the cert_manager / flo / cne_instance modules onto the helm_release terraform resource type (which uses the hashicorp/helm provider’s embedded Helm 3 runtime) would eliminate this host requirement. Tracked in docs/PLAN.md §“What’s deliberately deferred to post-v1.0”.

`iperf3` — informational

Used only by roksbnkctl test throughput in its host-iperf3 modes. After Sprint 4 lands the k8s execution backend (PRD 03), iperf3 moves entirely in-cluster and this row goes away for the everyday user.

Failure mode: not on PATH. Fix: install iperf3 if you plan to use the throughput test today; otherwise ignore.

`kubectl` — informational (Sprint 2 change)

Before Sprint 2: kubectl was an optional warning when missing — useful for the roksbnkctl kubectl passthrough.

After Sprint 2: kubectl is informational. The everyday verbs (get, apply, describe, delete, logs, exec, port-forward) are now native Go via client-go and live under roksbnkctl k. Missing host kubectl no longer disables the happy path; it only disables the roksbnkctl kubectl <args...> passthrough.

If kubectl is on PATH, the row is still ✓ and shows the version line. If it’s missing, the row is informational, not a warning, and the detail explains where the equivalent functionality lives.

`oc` — informational (Sprint 2 change)

Same story as kubectl — Sprint 2 internalises the OpenShift-relevant verbs (Phase 2.1 adds Project, Route, ImageStream to roksbnkctl k get). Host oc is preserved as an escape hatch; missing oc no longer warns.

`ibmcloud` — optional

Required only for the roksbnkctl ibmcloud <args...> passthrough. The cluster-lifecycle path uses IBM Go SDKs internally — roksbnkctl up does not shell out to ibmcloud — so you can skip this binary if you don’t need the passthrough.

`kubeconfig`

Resolves the kubeconfig path via $KUBECONFIG first, then ~/.kube/config. Cluster-side commands (status, logs, every k <verb>) need it.

roksbnkctl up writes the admin kubeconfig at ~/.kube/config (mode 0600) on a fresh apply. If you already have a multi-cluster ~/.kube/config, point $KUBECONFIG at the workspace’s state directory instead:

export KUBECONFIG=~/.roksbnkctl/<workspace>/state/kubeconfig

Failure mode: $KUBECONFIG and ~/.kube/config both missing. Fix: run roksbnkctl kubeconfig --download to fetch the admin kubeconfig from IBM Cloud.

`workspace`

Reports the resolved workspace name and whether its config.yaml exists.

✓ default — the current workspace pointer at ~/.roksbnkctl/config.yaml resolves and the named workspace has a populated config.yaml.
⚠ "default" not initialised — the directory may exist (created by roksbnkctl ws new) but config.yaml is empty. Run roksbnkctl init to populate.
✗ no config context — the global config can’t be loaded at all.

The one-off -w / --workspace flag overrides which workspace doctor reports against. See Chapter 6 — Workspaces.

`ibmcloud api key`

Resolves the API key via the chain documented in Chapter 14 — Credentials: env var → OS keychain → workspace config (base64) → TTY prompt.

Pass condition: the chain produces a non-empty key. The key value is never printed — only the source (“resolved”).

Failure mode: IBMCLOUD_API_KEY unset and no keychain entry for workspace "<name>". Fix: either export IBMCLOUD_API_KEY=... for the session, or re-run roksbnkctl init and accept the keychain-save prompt.

`ibm cloud auth`

Round-trips the resolved key against IBM IAM via the SDK (Verify() call). Confirms the key is not just present but actually authenticates.

Pass condition: IAM accepts the key; the row reports the resolved account and user identity.

Failure modes:

BXNIM0415E: Provided API key could not be found — the key is malformed or has been deleted in IBM Cloud.
network is unreachable / i/o timeout — your workstation can’t reach iam.cloud.ibm.com. Common in customer-firewall scenarios; route through a jumphost (Chapter 16) to confirm the key works from inside the customer network.

Common failures and how to fix them

The chapter readers most often land on. Each row maps a real-world symptom to its fix:

Symptom	Likely cause	Fix
`terraform not on PATH`	not installed	install Terraform `>= 1.5`; re-run `doctor`
`kubeconfig: $KUBECONFIG and ~/.kube/config both missing`	never ran `up` against this workspace	`roksbnkctl kubeconfig --download` or run `roksbnkctl up`
`ibmcloud api key: ... no keychain entry`	new shell, key not exported	`export IBMCLOUD_API_KEY=...` or re-run `roksbnkctl init`
`ibm cloud auth: BXNIM0415E`	bad / rotated key	regenerate the key in the IBM Cloud console; update the keychain via `roksbnkctl init`
`ibm cloud auth: i/o timeout`	corp-firewalled workstation	use `--on jumphost` to test from inside the customer network
`workspace "foo" not initialised`	`ws new` was run but `init` was not	run `roksbnkctl init -w foo`
`workspace: no config context`	`~/.roksbnkctl/config.yaml` corrupt	inspect the file; worst case delete it and re-run `init`

If a fix isn’t here, Chapter 26 — Troubleshooting covers the longer tail.

The `--target <name>` SSH check (Sprint 1)

Sprint 1’s --on jumphost flag introduced an optional second mode for doctor: probe an SSH target before you try to use it.

roksbnkctl doctor --target jumphost

This adds one row per resolved target:

✓  ssh:jumphost      ubuntu@169.45.91.177:22 (TOFU recorded)            (verifies the target is reachable)

The probe:

Resolves the target’s host, user, port, and key source from ~/.roksbnkctl/<workspace>/config.yaml.
Connects via the internal/remote SSH client.
Validates the host key against ~/.roksbnkctl/known_hosts (TOFU prompt on first contact, unless --insecure-host-key).
Runs a no-op command (true) to confirm the channel works end-to-end.

Failure modes specific to the SSH probe:

host key mismatch — the target was rebuilt; edit ~/.roksbnkctl/known_hosts to clear the entry, then re-probe.
unable to authenticate — the key source resolved but the remote rejected it. Check key_path / key_source in workspace config; if key_source: agent, verify ssh-add -l shows the right key.
dial tcp: i/o timeout — the host:port is unreachable. Verify with nc -vz <host> 22 from a known-good network.

Pass --target all to probe every target listed in the workspace’s targets: block. Useful in CI when you want a single command that asserts every entry is reachable.

Reading the exit code

doctor exits with:

0 — all checks are green or warnings only. Warnings do not fail doctor. The everyday workflow can proceed.
non-zero — at least one row produced an ✗ error. The first error string is also written to stderr so wrapper scripts can grep it.

This is the contract scripts/e2e-test.sh and the Makefile rely on: a script that runs roksbnkctl doctor && roksbnkctl up --auto will only proceed past doctor if the environment is genuinely ready.

The “warnings don’t fail” rule is deliberate. After Sprint 2, an iperf3 not on PATH warning is informational — the everyday up / test connectivity flow doesn’t need it. Forcing exit-1 on every warning would be too aggressive for the common case.

If you want to gate scripts strictly (e.g. CI workflows that must have iperf3 installed because they run the throughput suite), parse the output rather than relying on the exit code:

if ! roksbnkctl doctor | grep -q '^✓  iperf3'; then
  echo "iperf3 missing — install it before running test throughput" >&2
  exit 1
fi

What `doctor` is not

A few deliberate non-features worth naming:

Not a fix-it tool. doctor reports; it never installs, never modifies workspace config, never calls IBM Cloud APIs that mutate state. The IAM verify call is read-only. If doctor could break things, users couldn’t run it freely — and “run doctor” needs to be a safe first move.
Not a backend probe. Per-backend availability checks (docker daemon reachable, k8s ops pod healthy, ssh target reachable) ship as separate BackendName-tagged rows via doctor --backend <name> (PRD 03). The --target probe was the early prototype of that pattern.
Not concurrent-safe. The CLI invokes doctor once per command; the side-channel for “why we care” blurbs in internal/doctor/doctor.go doesn’t synchronise. Don’t run two doctors against the same process.

Cross-references

Chapter 4 — Installation introduces doctor as the post-install verification step.
Chapter 6 — Workspaces explains the workspace row and the -w override.
Chapter 14 — Credentials is the deep dive on the ibmcloud api key resolution chain.
Chapter 16 — The --on flag covers the --target probe’s underlying SSH client.
Chapter 24 — Day-2 ops is the canonical reference for the internalised k <verb> commands that make kubectl / oc informational.

Workspaces

A workspace is a per-environment bundle of config + state. The shape is modelled on kubectl contexts: you can have many of them, exactly one is “current” at a time, and a -w flag lets you address a specific one for a single command without flipping the pointer.

This chapter covers the on-disk layout, the everyday init / use / list flow, the full roksbnkctl workspaces command tree, the -w / --workspace override, and the “parking-lot” pattern the end-to-end test uses to delete the workspace it’s currently inside.

The on-disk layout

Every workspace lives under ~/.roksbnkctl/<name>/:

~/.roksbnkctl/
  config.yaml                          # global; current_workspace pointer
  known_hosts                          # SSH host keys (shared across workspaces)
  default/                             # workspace "default"
    config.yaml                        # this workspace's inputs
    cluster-outputs.json               # post-apply cluster identity (when present)
    state/                             # BNK trial state
      terraform.tfstate
      terraform.tfvars
      kubeconfig                       # admin kubeconfig (mode 0600)
      tf-source/                       # bundled HCL extracted to disk
      scratch/                         # docker bind-mounts, helm caches
    state-cluster/                     # cluster-phase state (separate tree)
      terraform.tfstate
      cluster-phase-override.tfvars
  prod/                                # workspace "prod"
    config.yaml
    state/
    ...

Three things are worth calling out:

~/.roksbnkctl/config.yaml is global — non-secret user-wide preferences plus the current_workspace pointer. It is not a workspace config; the per-workspace files live one level deeper.
state/ and state-cluster/ are intentionally separate so roksbnkctl cluster up and roksbnkctl up don’t tangle their Terraform state. Most users won’t touch either directly.
cluster-outputs.json is the persisted identity of the workspace’s ROKS cluster — written by cluster up or cluster register, read by roksbnkctl up so BNK trials don’t have to re-state cluster identity in every tfvars.

Override the base directory with the ROKSBNKCTL_HOME env var. Test fixtures use this; everyday users shouldn’t need it.

`terraform.applied.tfvars` — what’s deployed right now

v1.4.0 adds a per-phase snapshot of the effective Terraform var-file inputs that produced the workspace’s current state. After every successful terraform apply — roksbnkctl cluster up, roksbnkctl bnk up, or the legacy single-shape roksbnkctl up — roksbnkctl writes a canonical-HCL summary of “what var-files said” to the phase’s state directory. Re-create / audit / handoff workflows that previously needed config.yaml (or memory) now have a file-on-disk record of the inputs.

Where it lives

Workspace shape	Phase	Path
`ShapeSplit` / `ShapeClusterOnly`	Cluster phase	`~/.roksbnkctl/<workspace>/state-cluster/terraform.applied.tfvars`
`ShapeSplit`	Trial phase	`~/.roksbnkctl/<workspace>/state/terraform.applied.tfvars`
`ShapeLegacySingle`	both phases (collapsed)	`~/.roksbnkctl/<workspace>/state/terraform.applied.tfvars`

On ShapeLegacySingle, the file is a union of all sources (since the legacy shape doesn’t separate cluster and trial state) and the header comment records phase=legacy-single so the reader doesn’t mistake it for either a cluster-only or trial-only snapshot. See PRD 07 §“Design” for the format spec.

What it captures

A canonical HCL var-file: one assignment per line, variables sorted alphabetically within each source section. Each section is preceded by a comment line documenting which source contributed the values:

# === from config.yaml === — vars derived from the workspace’s config.yaml (written to terraform.tfvars on disk).
# === from terraform.tfvars.user === — the workspace-local user override file. If the file doesn’t exist, the section header is # === from terraform.tfvars.user (missing) === and the body is empty.
# === from cluster-phase override === — state-cluster/cluster-phase-override.tfvars (cluster-phase snapshots only).

Source-attribution comments matter because the same variable can appear in multiple sources; the “winner” — the value Terraform actually used — is the last section to mention it. The comments let the reader trace why a particular value ended up live.

Lifecycle

Written after every successful terraform apply. Plan flows don’t write the snapshot — the name terraform.applied.tfvars would mislead if a plan-time write existed.
Overwritten each apply. If you want history, copy the file aside before re-running up or wire restic / a git commit hook against ~/.roksbnkctl/<workspace>/.
Untouched by destroy. cluster down / bnk down leave the prior up’s snapshot in place; that’s what was last deployed. The file’s mtime + the absence of Terraform state is the “torn down on <date>” signal.
Never read by roksbnkctl itself. The snapshot is an output for the user — never an input the tool depends on. Making it an input would create a feedback loop where redacted values get written back as the literal string <redacted>.

Redaction

Exactly one variable is redacted: ibmcloud_api_key. It’s the only var whose value comes from the cred resolver rather than being authored by the user in config.yaml or a tfvars file — so it’s the only value the snapshot would expose that the user didn’t put there themselves. See PRD 04 §“Cred tmpfile-bind-mount pattern” for why the API key isn’t in tfvars in the first place. The redacted line carries an inline comment:

ibmcloud_api_key = "<redacted>"  # source: cred resolver, not persisted

For team-handoff scenarios (a teammate receives this file out-of-band and wants to re-create the workspace): replace the <redacted> value with the teammate’s own API key, or simply remove the ibmcloud_api_key line so the cred resolver supplies it from the teammate’s own environment (keychain, shell env, ~/.bluemix/api_key, etc.) at apply time. Every other line round-trips verbatim.

The file mode is 0600 regardless. The non-redacted contents (workspace identifiers, region, resource group, cluster name, tunable values) aren’t credential-grade secrets, but aren’t world-readable-grade either. Tight permissions are the cheap default.

What it’s not

Not an input to subsequent applies. The -var-file chain on the next apply is unchanged: config.yaml-derived → terraform.tfvars.user → phase overrides.
Not a record of Terraform defaults. If variable "foo" { default = "bar" } and the user never set foo, the snapshot omits foo entirely. Capturing defaults would require running terraform output against the variables block — separate concern.
Not a state-derived value capture. Computed expressions, resource references, locals, and data-source values aren’t var-file inputs and don’t appear. terraform console against the live state dir is the right tool for those.
Not a TF_VAR_* env capture. roksbnkctl doesn’t set TF_VAR_* today — everything goes via -var-file — so the snapshot covers the complete input surface. A future cycle that starts using TF_VAR_* will need to extend this file.

Safe-to-commit guidance

The file is suitable for git commit alongside config.yaml after the user verifies the redaction matches their threat model. The standard reminder applies: the workspace dir may contain other semi-sensitive material — cluster-outputs.json records the cluster’s crn and admin identity hints; the state/ and state-cluster/ trees include terraform.tfstate (which contains resource IDs, IAM bindings, and any value Terraform’s provider exposed); the kubeconfig files are mode 0600 for a reason. Review the whole workspace dir with the same lens before committing.

roksbnkctl does not touch .gitignore. If you commit the workspace, you commit the workspace; if you don’t, you don’t. The tool stays out of that decision.

Worked example

For a ShapeSplit cluster phase apply, ~/.roksbnkctl/canada-roks/state-cluster/terraform.applied.tfvars looks like:

# Generated by roksbnkctl v1.4.0 at 2026-05-14T10:23:17Z after terraform apply on phase=cluster.
# Re-generated each apply. Do not edit by hand — your changes will be overwritten.

# === from config.yaml ===
cluster_name = "canada-roks"
ibmcloud_api_key = "<redacted>"  # source: cred resolver, not persisted
region = "ca-tor"
resource_group_name = "default"

# === from terraform.tfvars.user ===
worker_count = 4

# === from cluster-phase override ===
deploy_bnk = false

Re-applying from this snapshot alone reconstructs the inputs the user wrote; embedded Terraform module defaults are not captured (see §“What it’s not” above for the full list of what’s out of scope).

The header records the binary version and apply timestamp so the reader can correlate the snapshot to a specific roksbnkctl invocation. Alphabetic ordering within each section means re-running apply with identical inputs produces a byte-identical file (idempotency — handy for diffing snapshots across applies).

The everyday workspace routine

The minimum daily routine:

# Initialise (creates ~/.roksbnkctl/<name>/config.yaml; defaults to "default")
roksbnkctl init

# Switch which workspace is "current"
roksbnkctl ws use prod

# See all workspaces and which one is current
roksbnkctl ws list

roksbnkctl init -w <name> is the one-shot path that creates the directory and populates config.yaml interactively. Everything else (ws new, ws use, ws delete) is the deconstructed form for users who want finer-grained control.

The full command tree

roksbnkctl workspaces ...     # canonical name
roksbnkctl ws ...              # alias

`ws new <name>` — empty skeleton

Creates ~/.roksbnkctl/<name>/ with no config.yaml. Useful when you want the directory to exist (so ws use works) before you run init.

roksbnkctl ws new staging
# ✓ Created workspace "staging" (run `roksbnkctl init -w staging` to configure)

Most users skip this and use roksbnkctl init -w staging directly, which does both steps in one go.

`ws use <name>` — switch current

Sets the current_workspace pointer in ~/.roksbnkctl/config.yaml:

roksbnkctl ws use prod
# ✓ Current workspace: prod

roksbnkctl ws current
# prod

Refuses to point at a non-existent workspace. The pointer is the only thing that changes — workspace state stays put.

`ws current` — print the pointer

roksbnkctl ws current
# default

Prints the current workspace name on stdout. If no pointer is set, prints a hint like “no current workspace; run roksbnkctl ws use <name> or roksbnkctl init” to stderr and exits 0 with empty stdout — so WS=$(roksbnkctl ws current) produces an empty string in scripts rather than spurious output.

`ws list` — table view

roksbnkctl ws list
NAME      CURRENT  REGION    CLUSTER          TF SOURCE
default   *        us-south  bnk-quickstart   embedded@v1.0.0
prod               eu-de     bnk-prod         embedded@v1.0.0
staging            us-south  bnk-staging      local:./terraform

The * marker on CURRENT highlights the active workspace. Other columns reflect each workspace’s config.yaml. Rows where config.yaml is missing or unparseable still show the name, with the other columns blank — the list never errors out because of one corrupt workspace.

`ws delete <name> [--force]`

Removes the workspace directory and the OS-keychain entry for its API key. Two safety rails:

Refuses to delete the current workspace. You’d be left with a dangling current_workspace pointer, so delete errors out with: cannot delete current workspace "foo"; switch first: roksbnkctl ws use <other>.
Refuses if Terraform state lists provisioned resources (unless --force). Catches the foot-gun where you forget to run roksbnkctl down first.

roksbnkctl ws delete staging
# Delete workspace "staging"? [y/N]: y
# ✓ Deleted workspace "staging"

# Refused — state still has resources
roksbnkctl ws delete prod
# Error: terraform state lists 77 resources; run `roksbnkctl down` first or pass --force

# I really mean it
roksbnkctl ws delete prod --force
# ✓ Deleted workspace "prod"

--force skips both the prompt and the state-non-empty check. Use it sparingly — there’s no “undo” for rm -rf ~/.roksbnkctl/<name>/.

The current-workspace pointer

The pointer lives at ~/.roksbnkctl/config.yaml:

current_workspace: prod

Every command that doesn’t pass -w reads this pointer. roksbnkctl init writes it on first run (so the very first init makes default current automatically). ws use rewrites it. Nothing else touches it.

If the pointer references a workspace that doesn’t exist (e.g. someone rm -rf’d the directory by hand), roksbnkctl errors out with a clear message: workspace "prod" referenced by current_workspace does not exist; run roksbnkctl ws use <other>.

`-w` / `--workspace` for one-off overrides

Every command accepts -w <name> to override the current pointer for a single invocation:

# Doctor against "prod" without flipping the global pointer
roksbnkctl -w prod doctor

# Run init for a new workspace called "staging"
roksbnkctl init -w staging

# Get pods from the "default" cluster while currently on "prod"
roksbnkctl -w default k get pods -A

Use this when:

You’re scripting against multiple workspaces in a single run (CI runner that exercises default + e2e-cleanup back-to-back).
You want to run a one-off command against a different environment without losing your current context.
You’re testing a fresh workspace before promoting it to current.

The flag only affects the running command — the pointer in ~/.roksbnkctl/config.yaml is unchanged. After the command exits, the next bare roksbnkctl reads the original pointer.

The parking-lot pattern

A subtle gotcha: ws delete refuses to remove the current workspace, but the end-to-end test suite needs to clean itself up after running against the default workspace.

The fix is the parking-lot pattern: have a throwaway workspace that exists only to be the “current” pointer while you delete other workspaces.

# End-to-end test cleanup (e2e-test.sh: Phase D destroys; Phase H runs the parking-lot dance below)

# Run the destroy against "default" (still current at this point)
roksbnkctl down --auto

# Park the pointer somewhere harmless
roksbnkctl ws new e2e-cleanup
roksbnkctl ws use e2e-cleanup

# Now we can drop the original workspace — it's no longer current
roksbnkctl ws delete default --force

# Optional: remove the parking lot too, by parking somewhere else first
roksbnkctl ws new tmp-park
roksbnkctl ws use tmp-park
roksbnkctl ws delete e2e-cleanup --force
roksbnkctl ws delete tmp-park --force   # leaves no current pointer

The pattern works because current_workspace only matters for commands that read workspace config. Once the pointer points elsewhere, the original workspace is just a directory and delete is happy to remove it.

If you want to delete every workspace including the parking lot, the last delete will leave you with an empty current_workspace. The next roksbnkctl init will populate it again with default.

Using a workspace’s environment in your shell

roksbnkctl shell drops you into a subshell with KUBECONFIG, IBMCLOUD_API_KEY, IC_API_KEY, and IBMCLOUD_REGION pre-loaded from the current workspace:

roksbnkctl shell
# (now in a subshell)
echo $KUBECONFIG
# /home/you/.roksbnkctl/default/state/kubeconfig
exit
# (back to the parent shell)

Same for -w:

roksbnkctl -w prod shell

Useful when you want to run host kubectl / host oc / arbitrary tools with the workspace context loaded. The Sprint 2 internalised verbs (roksbnkctl k get, etc.) read the same context automatically — you don’t need to be in a subshell to use them.

Common workspace patterns

A handful of patterns that come up in practice:

Use case	Pattern
Different IBM Cloud accounts	`default` for personal, `acct-foo` for an account-specific key
Different regions	`us-south`, `eu-de` workspaces with distinct `cluster.name` values
Throwaway short-lived clusters	`bnk-trial-N` workspaces; delete with `--force` after `down`
CI vs local dev	`dev` and `ci` workspaces; `ci` uses `IBMCLOUD_API_KEY` from env, `dev` uses keychain
Parking-lot cleanup	`e2e-cleanup` workspace per “the parking-lot pattern” above

Workspaces are cheap. If a flow benefits from isolation, make a new one rather than fighting with --var-file overrides on the existing one.

Forward-link to Chapter 12

This chapter covers the workspace-as-a-unit: how to create, switch, list, delete. The schema of the per-workspace config.yaml itself — every field, default, valid range — is Chapter 12 — Workspace config.

Quick start: from API key to deployed BNK

This chapter walks the 4-command lifecycle (init → up → test → down) end-to-end. By the time you reach the bottom you’ll have a deployed BNK trial on a fresh ROKS cluster, a passing connectivity test, and a clean tear-down command ready when you’re done.

The lifecycle, at a glance:

sequenceDiagram
    autonumber
    actor User
    participant CLI as roksbnkctl
    participant TF as terraform-exec (embedded HCL)
    participant IBM as IBM Cloud API
    participant K8s as ROKS cluster + BNK
    User->>CLI: roksbnkctl init
    CLI->>CLI: write workspace config.yaml
    CLI-->>User: workspace ready
    User->>CLI: roksbnkctl up --auto
    CLI->>TF: terraform plan + apply
    TF->>IBM: provision cluster, VPC, jumphost
    IBM-->>TF: 77 resources created
    TF->>K8s: helm install cert-manager + flo + BNK
    K8s-->>CLI: cluster + BNK up
    CLI-->>User: kubeconfig saved, jumphost target auto-populated
    User->>CLI: roksbnkctl test
    CLI->>K8s: connectivity + DNS + (optional) throughput
    K8s-->>CLI: pass/fail per check
    CLI-->>User: green
    User->>CLI: roksbnkctl down --auto
    CLI->>TF: terraform destroy
    TF->>IBM: 77 resources destroyed
    CLI-->>User: workspace state retained

The walkthrough assumes:

You have a roksbnkctl binary on PATH (Chapter 4).
You have an IBM Cloud API key for an account with permission to create ROKS clusters.
terraform >= 1.5 is on PATH and roksbnkctl doctor looks healthy (Chapter 5).

If roksbnkctl doctor is not green for terraform and IBMCLOUD_API_KEY resolves, fix those first — nothing below will work otherwise.

Note. The output blocks below are illustrative — version strings, cluster IDs, IPs, and timing all vary between runs. The shape of each step is what to look for.

Step 1 — set the API key

The cleanest way to make roksbnkctl see your API key is the IBMCLOUD_API_KEY environment variable. roksbnkctl init will offer to save it to your OS keychain afterwards, so you only paste it once.

export IBMCLOUD_API_KEY="ibmcloud-api-key-value-here"

If you’d rather not export it in your shell, roksbnkctl init will prompt for it on a TTY and offer the same keychain-save afterwards. See Chapter 14 for the full resolution chain.

Step 2 — `roksbnkctl init`

Initialises a workspace under ~/.roksbnkctl/default/ (or under <name>/ if you pass -w <name>). Verifies the API key against IBM IAM, resolves the resource group, and writes config.yaml.

roksbnkctl init

Sample interactive session:

roksbnkctl init
→ Verifying IBMCLOUD_API_KEY against IBM IAM ... ok (account: 1a2b3c..., user: you@example.com)
? Workspace name (default):
? Region (us-south):
? Resource group (Default):
→ Resolving resource group "Default" ... ok (id: ...)
? Cluster name (bnk-quickstart):
? OpenShift version (4.14_openshift):
? Worker zone (us-south-1):
? Worker count (2):
? Save IBMCLOUD_API_KEY to OS keychain for this workspace? (y/N): y
→ Saved to keychain (service: roksbnkctl, account: default/ibmcloud_api_key)
✓ Wrote ~/.roksbnkctl/default/config.yaml

What just happened:

A workspace called default now exists at ~/.roksbnkctl/default/.
config.yaml records the region, resource group, cluster name, OpenShift version, worker pool sizing, and BNK component defaults.
The API key is saved to your OS keychain (macOS Keychain, libsecret on Linux, or Windows Credential Manager) under service roksbnkctl. Subsequent runs resolve it from there without prompting.

You can re-run roksbnkctl init to update workspace settings; existing values become the prompt defaults.

Step 3 — `roksbnkctl up --auto`

The deployment. Runs terraform plan, runs terraform apply, fetches the admin kubeconfig from IBM Cloud, writes it to ~/.kube/config at mode 0600. The --auto flag skips the plan-and-confirm gate; without it up shows the plan and asks “apply? [y/N]” before continuing.

roksbnkctl up --auto

Sample output (heavily abridged — a real run is ~50 minutes and prints terraform’s full plan + apply log):

roksbnkctl up --auto
→ Resolving terraform source ... embedded (v1.0.0)
→ Extracting bundled HCL to ~/.roksbnkctl/default/state/tf-source/embedded-terraform/
→ Pre-creating kubeconfig + scratch directories
→ Rendering auto-tfvars from config.yaml ... ok
→ terraform init -reconfigure
  Initializing provider plugins... done.
→ terraform apply (auto-approved)
  module.roks_cluster.ibm_container_vpc_cluster.cluster: Creating...
  module.roks_cluster.ibm_container_vpc_cluster.cluster: Still creating... [10m elapsed]
  module.roks_cluster.ibm_container_vpc_cluster.cluster: Still creating... [20m elapsed]
  module.roks_cluster.ibm_container_vpc_cluster.cluster: Still creating... [30m elapsed]
  module.roks_cluster.ibm_container_vpc_cluster.cluster: Creation complete after 38m12s
  module.cert_manager.helm_release.cert_manager: Creation complete after 2m11s
  module.flo.helm_release.flo: Creation complete after 4m02s
  module.cne_instance.kubernetes_manifest.cne_instance: Creation complete after 1m42s
  module.license.helm_release.license: Creation complete after 2m18s
  module.testing.tls_private_key.jumphost_shared_key: Creation complete after 0s
  module.testing.ibm_is_instance.tgw_jumphost: Creation complete after 1m48s

  Apply complete! Resources: 77 added, 0 changed, 0 destroyed.

→ Fetching admin kubeconfig for cluster "<cluster-id>"
✓ Wrote /home/you/.kube/config (chmod 0600)
✓ Auto-registered target jumphost (169.45.91.177); use `roksbnkctl --on jumphost ...`

What just happened:

77 resources were created across ROKS, cert-manager, FLO, CNE Instance, BNK license, and a small testing footprint (the TGW jumphost).
An admin kubeconfig was fetched directly from IBM Cloud’s container service API (no ibmcloud ks cluster config shell-out) and written at mode 0600.
A jumphost target was auto-populated in your workspace config from terraform outputs. This makes Chapter 16’s --on jumphost flag work without any further configuration.

The actual elapsed time on a fresh run is dominated by ROKS cluster creation (~30-40 min) and cert-manager + FLO Helm install (~10 min). Re-runs are dramatically faster because terraform’s idempotence skips already-created resources.

Step 4 — `roksbnkctl status`

Quick sanity check: workspace pointer is right, cluster is reachable, BNK pods are healthy.

roksbnkctl status

Sample output:

Workspace: default
Region:    us-south
RG:        Default (id: ...)
Cluster:   bnk-quickstart (id: <cluster-id>) — Ready
TF source: embedded (v1.0.0)
Last apply: 2026-05-08T14:22:08Z
Nodes:     2/2 Ready
BNK pods:  flo (3/3), cis (1/1), cert-manager (3/3), cne-instance (1/1)

If anything is not green here, jump to Chapter 26 — Troubleshooting.

Step 5 — `roksbnkctl test`

Run the built-in validation suite. Bare test runs the connectivity + DNS checks (the throughput test takes a few minutes and is opt-in).

roksbnkctl test

Sample output:

roksbnkctl test
→ Suite: connectivity
  ✓ https://www.f5.com (200, 312ms)
  ✓ https://api.openshift.com (200, 88ms)
  ✓ https://us-south.containers.cloud.ibm.com (200, 142ms)
→ Suite: dns
  ✓ www.f5.com → 23.50.149.94 (A, 12ms)
  ✓ api.openshift.com → 35.190.27.231 (A, 18ms)

3 connectivity checks passed; 2 DNS checks passed; 0 failed.

For the throughput suite specifically:

roksbnkctl test throughput --mode east-west

Sample output:

→ Deploying iperf3 server pod into namespace "roksbnkctl-test"
✓ Pod ready (iperf3-server-...)
→ Exposing via ClusterIP service
✓ Service ready (cluster-ip: 172.21.45.108:5201)
→ Running iperf3 -c against the service from local
✓ throughput: 9.41 Gbits/sec (mean over 10s)
→ Tearing down iperf3 fixture
✓ pod and service deleted

The --mode east-west flag uses a ClusterIP service and runs the host iperf3 client through oc port-forward for in-cluster traffic; --mode north-south uses a LoadBalancer for outside-the-cluster traffic. See Chapter 22 for the full design.

The connectivity test uses Go’s built-in net/http — no external curl is shelled out — and similarly DNS uses Go’s net.Resolver. The --insecure flag on test connectivity skips TLS validation if you need to test against self-signed endpoints.

Step 6 — explore (optional)

A few useful follow-ups now that the cluster’s up:

# tail the F5 Lifecycle Operator logs
roksbnkctl logs flo -f

# drop into a shell with the workspace's KUBECONFIG + IBMCLOUD_API_KEY exported
roksbnkctl shell

# run a one-shot kubectl with the workspace context loaded
roksbnkctl kubectl get pods -A

# run ibmcloud through the auto-discovered jumphost
roksbnkctl ibmcloud --on jumphost ks cluster ls

The --on jumphost flag is covered in detail in Chapter 16. It lets you run any of the passthrough commands (exec, shell, kubectl, oc, ibmcloud) from inside the cluster’s network — useful when your workstation is behind a corporate firewall that can’t reach IBM Cloud directly.

Step 7 — `roksbnkctl down --auto`

Tear it all back down when you’re finished. The teardown is terraform destroy under the hood, with the same resilience to transient IBM API errors as up.

roksbnkctl down --auto

Sample output:

roksbnkctl down --auto
→ terraform destroy (auto-approved)
  module.testing.ibm_is_instance.tgw_jumphost: Destroying...
  ...
  module.roks_cluster.ibm_container_vpc_cluster.cluster: Destroying...
  module.roks_cluster.ibm_container_vpc_cluster.cluster: Still destroying... [5m elapsed]
  module.roks_cluster.ibm_container_vpc_cluster.cluster: Destruction complete after 8m16s

  Destroy complete! Resources: 77 destroyed.

✓ Workspace "default" state retained at ~/.roksbnkctl/default/
  (run `roksbnkctl ws delete default` to remove the workspace dir)

down retains the workspace dir and config so you can up again with the same settings. To remove the workspace entirely:

roksbnkctl ws delete default

This refuses if terraform state still lists resources (use --force to override) and cleans up the keychain entry.

What you just did

In effectively three commands you:

Provisioned a fresh ROKS cluster on IBM Cloud.
Installed cert-manager, F5 Lifecycle Operator, and a complete BNK trial on top of it.
Validated the deployment with HTTP connectivity + DNS resolution + (optionally) throughput tests.
Got an auto-discovered jumphost target ready for any --on jumphost follow-ups.

The same flow runs against multiple workspaces, multiple regions, and multiple resource groups — see Chapter 6 for the multi-environment patterns. From here, Chapter 16 covers the --on flag, Chapter 24 covers day-2 operations, and Chapter 26 covers what to do when one of the above steps doesn’t go right.

The cluster phase (cluster up/down)

A roksbnkctl workspace is two phases on top of each other: a durable cluster phase (the ROKS cluster + cluster-shared services that take 30+ minutes to provision) and a short-lived trial phase (the BNK trial that iterates on top in 5-10 minutes). The cluster phase is exposed as its own command pair, roksbnkctl cluster up / roksbnkctl cluster down, so the cluster survives across many BNK trial cycles.

As of v1.1.0, this two-phase shape is the default for every new workspace. A fresh roksbnkctl up provisions the cluster phase first, then the trial phase, against separate state directories. Tearing down only the trial — the common iteration case — uses roksbnkctl bnk down and leaves the cluster intact. The unscoped up / down verbs are now shape-aware composites that delegate to the right phase commands underneath.

Workspaces created against v1.0.x that have cluster modules and trial modules in the same terraform.tfstate (the legacy single-state shape) keep working — roksbnkctl up and down continue to operate against them in-place, byte-for-byte the way they did in v1.0. See § Legacy single-state workspaces at the bottom of the chapter to identify which shape a workspace is.

This chapter covers what each phase deploys, why the two state directories are separate, the deploy_bnk=false override that makes “cluster only” work, the cluster-outputs.json artefact written on success, a worked example, and the legacy single-state shape. The companion BNK-trial chapter, Chapter 10, covers roksbnkctl bnk up / bnk down for the trial layer.

What’s deployed where

The bundled HCL has roughly two halves. The cluster phase owns the durable, cluster-scoped resources:

The ROKS cluster itself (VPC + subnets + worker pool)
A transit gateway (so the test jumphost can reach cluster internals)
The registry COS (Cloud Object Storage) instance — used by the BNK trial as its FAR image / license / schematic store
cert-manager (Helm release into the cluster)
The TGW jumphost VM (an Ubuntu VSI in the same VPC, used by --on jumphost)

The trial phase owns the BNK-specific resources:

F5 Lifecycle Operator (flo) Helm release
cne_instance Kubernetes manifest
BNK license + admin certs
Various cluster-side bits: ServiceAccounts, RoleBindings, Secrets

Two-phase split: cluster up provisions the first list; roksbnkctl up (the trial) provisions the second.

┌─────────────────────────────────────────────────────────┐
│  cluster phase (durable, reused across many trials)     │
│    ROKS cluster + VPC + transit gateway                 │
│    registry COS instance                                │
│    cert-manager (Helm)                                  │
│    TGW jumphost                                         │
├─────────────────────────────────────────────────────────┤
│  trial phase (one trial — destroyed by `roksbnkctl down`)│
│    flo (F5 Lifecycle Operator)                          │
│    cne_instance                                         │
│    license / admin cert / SCC bindings                  │
└─────────────────────────────────────────────────────────┘

The split exists because ROKS clusters take 30-50 minutes to provision and roughly $0.30/hour to run. Re-creating the cluster every time you want to re-test a BNK trial is wasteful; reusing one cluster for many trials cuts iteration time from “an hour” to “a few minutes”.

The two state directories

To keep cluster state and trial state from tangling, roksbnkctl uses separate Terraform state directories:

~/.roksbnkctl/<workspace>/
  state/                   # BNK trial state — written by `roksbnkctl up/down`
    terraform.tfstate
    terraform.tfvars
  state-cluster/           # cluster phase state — written by `roksbnkctl cluster up/down`
    terraform.tfstate
    cluster-phase-override.tfvars

Each phase’s commands read and write only their own state directory. Both phases use the same Terraform source (the bundled HCL) but with different effective tfvars — the trick is the deploy_bnk flag.

The `deploy_bnk=false` override

The bundled HCL has a top-level deploy_bnk boolean. When true, the BNK trial modules (flo, cne_instance, license) run; when false, they’re skipped and Terraform only provisions the cluster-phase resources.

roksbnkctl cluster up and roksbnkctl cluster down force deploy_bnk = false by writing a small auto-generated tfvars override into the cluster state directory:

# ~/.roksbnkctl/<workspace>/state-cluster/cluster-phase-override.tfvars
# Generated by roksbnkctl. Do not edit by hand.
# Cluster-phase override: BNK trial modules (flo / cne_instance /
# license) are skipped. cert-manager and the testing jumphost still run
# — they're cluster-shared singletons that belong with the cluster.
deploy_bnk = false

This file is layered onto the var-file chain after user-supplied --var-file flags so the override always wins. The user’s terraform.tfvars and --var-file <path> arguments still apply for everything else (region, RG, cluster name, worker count, …) — only deploy_bnk is forced.

roksbnkctl up doesn’t write this override file; its tfvars chain leaves deploy_bnk at the upstream default (true), so the trial modules run.

`cluster-outputs.json` — the cluster identity record

When roksbnkctl cluster up apply succeeds, it reads the relevant Terraform outputs (cluster name, ID, region, RG, VPC, registry COS) and writes them to a workspace-scoped JSON file:

~/.roksbnkctl/<workspace>/cluster-outputs.json

Sample contents:

{
  "cluster_name": "bnk-quickstart",
  "cluster_id": "cre6h4l20jjsg4kvt3a0",
  "region": "us-south",
  "resource_group_id": "abc123...",
  "vpc_id": "r006-...",
  "registry_cos_crn": "crn:v1:bluemix:public:cloud-object-storage:global:a/...",
  "registry_cos_name": "bnk-quickstart-cos-instance",
  "master_url": "https://c106.us-south.containers.cloud.ibm.com:31415",
  "openshift_version": "4.14_openshift",
  "source": "cluster-up",
  "recorded_at": "2026-05-08T14:22:08Z"
}

The source field discriminates between cluster-up (we created it) and cluster-register (we discovered an existing cluster — see Chapter 9). Subsequent commands read this file to learn the workspace’s cluster identity without hitting IBM APIs.

roksbnkctl cluster down deletes the file as part of its post-destroy cleanup. roksbnkctl cluster show pretty-prints it for human readers:

roksbnkctl cluster show
workspace:        default
source:           cluster-up
recorded_at:      2026-05-08T14:22:08Z

cluster_name:     bnk-quickstart
cluster_id:       cre6h4l20jjsg4kvt3a0
region:           us-south
resource_group:   abc123...
openshift:        4.14_openshift
master_url:       https://c106.us-south.containers.cloud.ibm.com:31415

vpc_id:           r006-...
registry_cos:     bnk-quickstart-cos-instance
registry_cos_crn: crn:v1:bluemix:public:cloud-object-storage:global:a/...

Worked example: cluster up → kubectl get nodes → cluster down

The cluster-only flow, end to end:

Step 1 — `roksbnkctl init`

If you don’t have a workspace yet, initialise one. This is the same init flow as the trial path; the cluster commands reuse the workspace’s config.

roksbnkctl init

Step 2 — `roksbnkctl cluster up --auto`

Provisions the cluster phase only:

roksbnkctl cluster up --auto

Sample output (heavily abridged):

→ terraform plan (cluster phase: deploy_bnk=false forced)
→ Layering user tfvars from ~/.roksbnkctl/default/state-cluster/cluster-phase-override.tfvars (overrides config.yaml-derived values)
→ terraform init
→ terraform apply
  module.roks_cluster.ibm_container_vpc_cluster.cluster: Creating...
  module.roks_cluster.ibm_container_vpc_cluster.cluster: Still creating... [10m elapsed]
  ...
  module.roks_cluster.ibm_container_vpc_cluster.cluster: Creation complete after 38m12s
  module.cert_manager.helm_release.cert_manager: Creation complete after 2m11s
  module.testing.tls_private_key.jumphost_shared_key: Creation complete after 0s
  module.testing.ibm_is_instance.tgw_jumphost: Creation complete after 1m48s

  Apply complete! Resources: 36 added, 0 changed, 0 destroyed.

✓ Wrote ~/.roksbnkctl/default/cluster-outputs.json
✓ Wrote /home/you/.kube/config (chmod 0600)
✓ Auto-registered target jumphost (169.45.91.177); use `roksbnkctl --on jumphost ...`

Roughly 36 resources land — the cluster phase is about half the size of a full BNK trial. Time-to-ready is dominated by the ROKS cluster itself; everything else after the cluster comes up is fast.

Step 3 — verify the cluster works

The post-apply admin kubeconfig is fetched automatically (unless --no-kubeconfig). kubectl get nodes confirms reachability:

kubectl get nodes
# NAME           STATUS   ROLES           AGE   VERSION
# 10.243.0.4     Ready    master,worker   3m    v1.28.6+5e1b9a1
# 10.243.64.4    Ready    master,worker   3m    v1.28.6+5e1b9a1

Or, post-Sprint 2, the same thing through the internalised verb:

roksbnkctl k get nodes

roksbnkctl status reports cluster identity + reachability:

roksbnkctl status
Workspace:    default
Region:       us-south
Cluster:      bnk-quickstart  (attach existing)
TF source:    embedded@v1.0.0
Last apply:   2026-05-08 14:22:08 UTC  (3m ago)
Kubeconfig:   /home/you/.kube/config
Cluster:      2/2 nodes ready

Step 4 — (optional) deploy a BNK trial on top

Now that the cluster is up, roksbnkctl up deploys a BNK trial onto it. It reads cluster-outputs.json and reuses the cluster:

roksbnkctl up --auto

See Chapter 10 — Deploying BNK trials for the trial-phase walkthrough. You can run up / down many times against the same cluster — each cycle is ~5 minutes rather than the ~50 minutes of a fresh-cluster run.

Step 5 — `roksbnkctl cluster down --auto`

Tear down the cluster phase. In v1.1.0 cluster down is strictly scoped: it refuses with a hard error (rather than the v1.0.x warning-but-prompt) on any workspace whose trial state is non-empty, so an out-of-order destroy can’t accidentally orphan BNK resources. Destroy the trial first with roksbnkctl bnk down (or roksbnkctl down for both at once); see Chapter 11 for the full refusal catalogue.

roksbnkctl cluster down --auto

Sample output:

→ terraform destroy (cluster phase)
  module.testing.ibm_is_instance.tgw_jumphost: Destroying...
  module.cert_manager.helm_release.cert_manager: Destroying...
  module.roks_cluster.ibm_container_vpc_cluster.cluster: Destroying...
  module.roks_cluster.ibm_container_vpc_cluster.cluster: Still destroying... [5m elapsed]
  module.roks_cluster.ibm_container_vpc_cluster.cluster: Destruction complete after 8m16s

  Destroy complete! Resources: 36 destroyed.

Post-destroy, cluster-outputs.json is deleted. The workspace directory and its config.yaml survive — re-running cluster up against the same workspace re-creates the cluster with the same name and region.

Why split cluster from trial?

Two-phase is the default because the cost of conflating them is concrete. ROKS clusters take 30-50 minutes to provision and bill at roughly $0.30/hour; a BNK trial on top takes 5-10 minutes. Iterating on the trial — different flo versions, different cne_instance shapes, license bundle revisions — happens far more often than iterating on the cluster underneath. Splitting state means a bnk down / bnk up cycle is a five-minute round-trip instead of an hour.

Three scenarios this shape unlocks:

Many BNK trial iterations on one cluster. Run cluster up once, then loop bnk up / bnk down against the same cluster until you’ve covered all the trial permutations. Then cluster down once when you’re finished. This is the headline win of the v1.1.0 surface — see Chapter 10 §“Worked example — iterating on a BNK trial”.
Pre-provisioning for a workshop or demo. You want the cluster ready and warm before the demo starts; you’ll deploy the BNK trial live in front of the audience. cluster up the night before; bnk up during the demo.
Decoupling cluster lifecycle from trial lifecycle. A long-lived cluster used by multiple team members, where one person owns the cluster phase and others own the BNK trials. Cluster-phase outputs live in cluster-outputs.json; trials read it. Each trial can bnk up / bnk down without affecting the cluster.

For workspaces that just want “create a cluster, deploy BNK on it, test, tear it all down”, the unscoped roksbnkctl up / roksbnkctl down are still the right verbs — in v1.1.0 they’re shape-aware composites that drive the cluster + trial steps in the right order without you having to think about it.

Legacy single-state workspaces

Workspaces created against v1.0.x predate the split. Their terraform.tfstate under ~/.roksbnkctl/<workspace>/state/ contains both the cluster modules (module.roks_cluster, module.cert_manager, module.testing) and the trial modules (module.flo, module.cne_instance, module.license) in one file; state-cluster/ either doesn’t exist or is empty.

roksbnkctl calls this shape LegacySingle and identifies it by walking the trial state’s resource list for cluster-module addresses. To check a workspace’s shape from the outside, look at the state directories:

$ ls ~/.roksbnkctl/<workspace>/
config.yaml  state/  state-cluster/    # split (v1.1.0+) or cluster-only

$ ls ~/.roksbnkctl/<workspace>/
config.yaml  state/                    # legacy single-state, or empty

A state/terraform.tfstate that contains module.roks_cluster and friends is legacy single-state; a state-cluster/terraform.tfstate with content is the split shape.

The v1.1.0 binary handles both shapes:

Legacy single-state workspaces: roksbnkctl up and roksbnkctl down operate monolithically the way they did in v1.0 — same plan output, same resource count, same byte-for-byte behaviour. The phase-scoped commands (cluster up/down, bnk up/down) refuse with a message pointing you back at the unscoped lifecycle verbs.
Split workspaces (the new default): up / down are shape-aware composites that delegate to the phase commands underneath; cluster up/down and bnk up/down work directly.

The refusal messages on a legacy workspace look like:

$ roksbnkctl -w canada-roks cluster down
this workspace is legacy single-state; cluster and BNK trial share one state. Use `roksbnkctl down` to tear down both, or migrate the state first

$ roksbnkctl -w canada-roks bnk down
this workspace is legacy single-state; `bnk down` can't isolate the trial phase. Use `roksbnkctl down` to tear down both, or migrate the state first

The refusals print as a single line each — wrapping is a function of your terminal width. Grep against any of the inline punctuation (e.g. \bnk down` can’t isolate`) lands a clean match.

There is no automatic state-migration command in v1.1.0. The refusal text references migration (“or migrate the state first”) because a future roksbnkctl migrate is planned, but until it ships, legacy workspaces stay on the unscoped up / down flow that’s worked for them since v1.0. See Chapter 11 §“The phase-aware decision tree” for the full destruction-time decision matrix.

Cross-references

Chapter 9 — Registering an existing cluster — the alternative to cluster up when you already have a ROKS cluster you want roksbnkctl to manage.
Chapter 10 — Deploying BNK trials — roksbnkctl up and the bnk up / bnk down command group; the dispatch matrix; iteration walkthrough.
Chapter 11 — Tearing down — phase-aware decision matrix and the refusal-message catalogue.
Chapter 24 — Day-2 ops — roksbnkctl k get / apply / logs for working against the cluster after either phase.

Registering an existing cluster

roksbnkctl cluster register <name> wires roksbnkctl up to a ROKS cluster that already exists in your IBM Cloud account — one you didn’t provision via cluster up. After a successful register, the workspace behaves exactly as if you’d done cluster up: roksbnkctl up deploys BNK trials onto the registered cluster, roksbnkctl down tears those trials down, roksbnkctl status reports the cluster’s identity, and so on.

This chapter covers when registration is the right answer, what input is required vs auto-discovered, the COS naming convention, the cluster-outputs.json write, and a worked example.

When to use this

cluster register is the answer when all of these are true:

A ROKS cluster already exists in the IBM Cloud account.
You have IAM access to the cluster’s VPC + container service.
You want roksbnkctl to deploy BNK trials onto that cluster.
You don’t want roksbnkctl to own the cluster’s lifecycle (it shouldn’t be terraform destroy-able from your workstation).

Common scenarios:

Your team operates the ROKS cluster centrally. A platform team provisioned the cluster via their own Terraform / Pulumi / IBM Cloud Schematics; you just want to deploy BNK trials onto it. Register it; deploy trials; tear them back down. The cluster itself stays under the platform team’s ownership.
You’re attaching to an existing demo cluster. A workshop hosts a shared cluster that participants attach to. Each participant registers it in their own workspace and deploys their own trial — trials are isolated by namespace under the same cluster.
You provisioned the cluster manually for testing. You created a one-off cluster via ibmcloud ks cluster create vpc-gen2 ... and want to move forward with roksbnkctl rather than re-creating it.

If none of those apply — i.e. you want roksbnkctl to own cluster lifecycle end-to-end — use cluster up instead. Register and cluster up are mutually exclusive per workspace; the second one wins.

Required input vs auto-discovery

cluster register takes one positional argument (the cluster name or ID) and one optional flag (--registry-cos-name).

roksbnkctl cluster register <cluster-name-or-id> [--registry-cos-name <cos-instance-name>]

Everything else is auto-discovered via the IBM SDK:

Field	Source
`cluster_id`	`ibmcloud ks cluster get <name>` (resolved by name → ID)
`region`	from the cluster lookup
`resource_group_id`	from the cluster lookup
`vpc_id`	from the cluster’s `provider.vpcs[0].id`
`master_url`	from the cluster lookup
`openshift_version`	from the cluster’s `masterKubeVersion`
`registry_cos_crn`	discovered via the registry COS instance lookup (see below)

The cluster lookup goes through the same container-service endpoint ibmcloud ks cluster get uses — no host ibmcloud install required. If the named cluster doesn’t exist in the account, the call returns a clear no cluster named <foo> error rather than a 404 stack trace.

A vpc-gen2 cluster is required. Classic infrastructure clusters return successfully but their vpc_id is empty, and cluster register refuses to write a record without one:

Error: cluster "old-classic" has no VPC — roksbnkctl only supports vpc-gen2 clusters

The COS naming convention

roksbnkctl up needs a Cloud Object Storage instance to act as the registry for FAR images, JWT licenses, and schematic state. cluster register verifies that this COS instance exists at registration time so a later up doesn’t fail mid-apply with a missing-instance error.

Default convention

The bundled HCL falls back to <cluster-name>-cos if the user’s tfvars don’t override roks_cos_instance_name. So cluster register defaults to looking up <cluster-name>-cos:

# Cluster name: "canada-roks" → expects COS instance "canada-roks-cos"
roksbnkctl cluster register canada-roks

Override with `--registry-cos-name`

If your team set roks_cos_instance_name to something else in their tfvars (or named the COS instance via the IBM Cloud console with a different convention), pass --registry-cos-name <name>:

roksbnkctl cluster register canada-roks \
  --registry-cos-name canada-roks-bnk-registry

The instance name is case-sensitive and must match exactly — Canada-ROKS-COS and canada-roks-cos are different instances.

What if the COS doesn’t exist yet?

cluster register errors out:

Error: registry COS instance "canada-roks-cos" not found in account: ...
  Either run `roksbnkctl cluster up` to create it, or pass --registry-cos-name <name>
  if your tfvars uses a different roks_cos_instance_name

You have two options:

Create the COS instance in the IBM Cloud console with the conventional name (<cluster>-cos), then re-run register. The instance can be empty — roksbnkctl up will populate it with the bucket structure it needs on its first apply.
Use a different name that already exists in the account, via --registry-cos-name <name>.

Either way, cluster register won’t write cluster-outputs.json until both the cluster and its registry COS instance exist.

The `cluster-outputs.json` write

On success, cluster register writes ~/.roksbnkctl/<workspace>/cluster-outputs.json — the same file cluster up writes. The contents look identical except for one field:

{
  "cluster_name": "canada-roks",
  "cluster_id": "cre6h4l20jjsg4kvt3a0",
  "region": "ca-tor",
  "resource_group_id": "abc123...",
  "vpc_id": "r038-...",
  "registry_cos_crn": "crn:v1:bluemix:public:cloud-object-storage:global:a/...",
  "registry_cos_name": "canada-roks-cos",
  "master_url": "https://c106.ca-tor.containers.cloud.ibm.com:31415",
  "openshift_version": "4.14_openshift",
  "source": "cluster-register",
  "recorded_at": "2026-05-08T14:22:08Z"
}

The source field is cluster-register (vs cluster-up for self-provisioned clusters). Downstream commands that care about provenance — for example, a future roksbnkctl cluster down would refuse to destroy a cluster-register-sourced cluster — read this field. Subnet IDs (subnet_ids) and transit gateway ID (transit_gateway_id) are left blank for registered clusters; the bundled HCL doesn’t need them when roksbnkctl up runs against a pre-existing cluster.

Worked example: register canada-roks

The full flow for attaching to a hypothetical canada-roks cluster.

Step 1 — create or pick a workspace

roksbnkctl ws new canada
roksbnkctl init -w canada
# (interactive — fill in region as ca-tor; cluster.name = canada-roks)

You can also run cluster register against the current workspace; the -w is just for clarity.

Step 2 — `cluster register`

roksbnkctl -w canada cluster register canada-roks

Sample output:

→ Looking up cluster "canada-roks"
✓ Cluster canada-roks (cre6h4l20jjsg4kvt3a0) — state: normal, masters: 4.14_openshift
✓ VPC r038-... (resource group prod-rg)
→ Verifying registry COS instance "canada-roks-cos"
✓ COS instance canada-roks-cos (abc-123-def-...)
✓ Wrote ~/.roksbnkctl/canada/cluster-outputs.json

If the COS naming was non-conventional:

roksbnkctl -w canada cluster register canada-roks \
  --registry-cos-name canada-bnk-registry

Step 3 — verify with `cluster show`

roksbnkctl -w canada cluster show
workspace:        canada
source:           cluster-register
recorded_at:      2026-05-08T14:22:08Z

cluster_name:     canada-roks
cluster_id:       cre6h4l20jjsg4kvt3a0
region:           ca-tor
resource_group:   abc123...
openshift:        4.14_openshift
master_url:       https://c106.ca-tor.containers.cloud.ibm.com:31415

vpc_id:           r038-...
registry_cos:     canada-roks-cos
registry_cos_crn: crn:v1:bluemix:public:cloud-object-storage:global:a/...

Step 4 — fetch the kubeconfig

cluster register does not automatically download the kubeconfig — it’s a metadata-only operation. Grab it explicitly:

roksbnkctl -w canada kubeconfig --download
# → Fetching admin kubeconfig for "canada-roks"
# ✓ Wrote /home/you/.kube/config (12345 bytes)

Step 5 — use the cluster as if you’d done `cluster up`

From here, the workflow is identical to a self-provisioned cluster:

# Verify reachability
roksbnkctl -w canada k get nodes

# Deploy a BNK trial onto it
roksbnkctl -w canada up --auto

# Tear the trial back down (cluster survives)
roksbnkctl -w canada down --auto

roksbnkctl up reads cluster-outputs.json and uses the cluster identity directly — no need to re-state cluster name/region/RG in the trial’s tfvars.

When register isn’t enough

Some scenarios where cluster register won’t get you over the line:

The cluster is in a different IBM Cloud account. API keys are account-scoped; you’d need a key for the cluster’s account. cluster register doesn’t cross account boundaries.
The cluster is private (no public master endpoint). roksbnkctl up needs to apply Helm charts and Kubernetes manifests against the master. If the master is only reachable from inside a VPN, route the apply through --on jumphost (Sprint 1) or wait for the SSH execution backend in Sprint 4.
The cluster is a classic-infrastructure ROKS (not vpc-gen2). Registration refuses; classic clusters aren’t supported.
The cluster’s worker pool is too small. BNK trials need at least 2 workers with adequate CPU/memory. The upstream HCL provisions appropriately-sized workers; an existing cluster might not.

For the first three, the cluster simply isn’t a candidate. For the last one, the apply may run but flo / cne_instance will fail to schedule — scale the worker pool first.

Re-registering and unregistering

To re-register with new data (e.g. you renamed the COS instance, or the master URL changed), just run cluster register again — it overwrites cluster-outputs.json in place.

To unregister without destroying anything, delete the file directly:

rm ~/.roksbnkctl/canada/cluster-outputs.json

The workspace’s config.yaml and state/ survive; only the cluster identity record is removed. The next roksbnkctl up will fail with workspace has no cluster-outputs.json until you either re-register or run cluster up.

There’s deliberately no roksbnkctl cluster unregister command. Deleting the JSON is a single-file operation that doesn’t deserve its own subcommand, and the absence of one nudges users toward “destroy the trial first, then deal with the cluster identity” rather than “unregister without thinking about the consequences”.

Cross-references

Chapter 8 — The cluster phase — the alternative when you want roksbnkctl to provision the cluster.
Chapter 10 — Deploying BNK trials — what roksbnkctl up does on top of a registered (or cluster up’d) cluster.
Chapter 25 — COS supply chain management — the COS instance and bucket layout that --registry-cos-name points at.

Deploying BNK trials on top

roksbnkctl up deploys a BNK trial — F5’s Lifecycle Operator, the CNE Instance, license bundles, and the cluster-side glue that makes them work — onto a ROKS cluster that already exists. “Already exists” means either provisioned by cluster up or registered from a pre-existing cluster.

For workspaces where the cluster and the trial are managed as separate phases (the v1.1.0 default — see Chapter 8), the trial layer also gets its own command pair: roksbnkctl bnk up / bnk down. bnk down tears down only the trial; the cluster keeps running, so the next iteration starts in 5-10 minutes instead of an hour. The bnk group is documented in §“The bnk up / bnk down command group” below.

This chapter is the deeper-than-quick-start view of up: what each module does, the ~77-resource shape of a clean apply, the token-rotation observation when you re-run up against an existing cluster, how to read the Terraform plan output, and how the bnk group + the shape-aware composite up / down fit together.

Chapter 7 — Quick start shows the happy path end-to-end with sample output. This chapter goes deeper.

What “deploying BNK” means

A BNK trial is a deliberately small set of Kubernetes resources that share state with a cluster-shared cert-manager and a cluster-scoped registry COS. The components that roksbnkctl up is responsible for landing:

Component	What it is	Module in the bundled HCL
`flo`	F5 Lifecycle Operator — the controller that watches CNE Instance CRs and reconciles them into running BIG-IP Next pods	`module.flo` (Helm release)
`cne_instance`	The CR that declares “I want a BIG-IP Next data plane here” — drives `flo` to provision the TMM pods	`module.cne_instance` (Kubernetes manifest)
`license`	JWT licenses + activation tokens that gate BNK’s runtime — sourced from the registry COS	`module.license` (Helm release + null_resources)
`cluster-side bits`	ServiceAccounts, RoleBindings, SCC bindings, Secrets that flo / cne_instance / license need at runtime	scattered across the modules above

up does not own the cluster, cert-manager, the registry COS, or the jumphost — those are cluster-phase resources. See Chapter 8 for the split.

The 77-resource shape

A clean roksbnkctl up against a fresh cluster lands roughly 77 resources when the cluster phase is bundled in (i.e. cluster up and up were one combined run). Against a pre-existing cluster (cluster up then up), the trial-only count is smaller — roughly the difference, ~41 resources.

The number isn’t load-bearing; it shifts a few resources up or down between upstream HCL releases as the chart adds/removes null_resources and Secrets. Treat “77” as a sanity-check tag, not a contract.

A representative breakdown:

Cluster phase (~36 resources, owned by `cluster up`)
  ROKS cluster + worker pools          ~5
  VPC + subnets + security groups       ~6
  Transit gateway + connections          ~4
  Registry COS instance + bucket          ~3
  cert-manager Helm release               ~2
  TGW jumphost VSI + cloud-init         ~16

Trial phase (~41 resources, owned by `roksbnkctl up`)
  flo Helm release                       ~5
  cne_instance manifest + finalisers     ~4
  license Helm release                  ~10
  Cluster-side SAs / RoleBindings / SCC ~10
  null_resources for token bootstrap    ~12

The null_resources at the bottom of the list are interesting — they’re the ones that re-run on every apply (more on that below).

Apply timing

A clean up against a fresh cluster takes ~50 minutes:

ROKS cluster provisioning: 30-40 min (the bulk of the wait)
cert-manager + flo Helm install: ~5 min
cne_instance reconcile: 1-2 min
license bootstrap (token generation + activation): 2-3 min
Cluster-side bits + finalisers: 2-3 min

Against a pre-existing cluster (already-up’d or registered), the trial-only run is 5-10 minutes. Most of that is Helm waiting for flo to stabilise and the license module’s null_resources running.

The token-rotation observation

If you re-run roksbnkctl up against an already-deployed BNK trial, you’ll see ~41 resources re-create or update in-place even though “nothing changed”. This is expected.

The license module rotates admin certificate tokens between runs — the JWT used to authenticate against the BNK control plane is short-lived and re-minted on each apply. A token rotation cascades into ~12 null_resources that exist solely to inject the new token into Helm-managed Secrets:

module.license.null_resource.cncf_admin_cert_token: Refreshing state... [id=8746234876]
module.license.null_resource.cncf_admin_cert_token: Destroying... [id=8746234876]
module.license.null_resource.cncf_admin_cert_token: Destruction complete after 0s
module.license.null_resource.cncf_admin_cert_token: Creating...
module.license.null_resource.cncf_admin_cert_token: Creation complete after 12s [id=9183746183]

That’s why the count of “destroyed + created” can hit ~41 even when no infrastructure-meaningful changes have been made.

The rotation is harmless — running pods aren’t restarted, traffic isn’t interrupted. The new token replaces the old in the relevant Secret; flo notices and updates its in-memory cache. From the BNK trial’s runtime perspective, the second up is a no-op.

If you want to skip the rotation cycle and just check “would this plan change anything significant?”, use roksbnkctl plan rather than up — it shows the plan without applying.

Reading the Terraform plan output

roksbnkctl up runs terraform plan first and prints its output. The plan summary at the end is the most useful part:

Plan: 77 to add, 0 to change, 0 to destroy.

Or, post-rotation:

Plan: 12 to add, 0 to change, 12 to destroy.

The body of the plan shows individual resource changes with one of three markers:

+ create — a new resource. Lines are green in a TTY.
<= read — a data source the plan read but did not change. Common for data "ibm_resource_group" and similar lookups; effectively informational.
# destroy — an in-progress destroy of an existing resource. Followed by a + create if it’s being replaced (the null_resource rotation case).
~ update in-place — a resource whose attributes are being mutated without re-creation.

The <= data sources are the ones that look like:

data "ibm_resource_group" "default" {
  name = "Default"
  id   = "abc123..." (will be read)
}

These are read-only — Terraform is just resolving the resource group’s ID at plan time so downstream modules can reference it. They show up in every plan, including no-op plans.

# destroy lines without a corresponding + create — i.e. resources actually leaving — should make you stop and read carefully. On a re-run of up, this generally means an upstream HCL change removed a resource. It’s rare but not zero.

When `up` doesn’t apply (no-op runs)

If the plan reports zero changes, up skips apply and prints:

✓ no changes

But it still does two best-effort post-actions:

Fetch the kubeconfig (unless --no-kubeconfig). Useful when the cluster exists but you’ve never grabbed the admin kubeconfig on this workstation.
Auto-register the jumphost target. Reads testing_tgw_jumphost_ip and jumphost_shared_key from Terraform outputs and writes a targets:jumphost entry in workspace config. Re-runs are idempotent.

So roksbnkctl up against an unchanged cluster is a useful “re-establish my workstation’s view of this workspace” verb — it can’t hurt anything (no apply runs), and it freshens local artefacts.

The `--auto`, `--no-kubeconfig`, `--var-file` flags

roksbnkctl up [--auto] [--no-kubeconfig] [--var-file <path>]...

Flag	Effect
`--auto`	Skip the “Apply this plan? [y/N]” prompt. Required for non-interactive runs (CI, scripted pipelines).
`--no-kubeconfig`	Skip the post-apply kubeconfig fetch. Useful when you’ve already got a kubeconfig and don’t want it overwritten.
`--var-file <path>`	Layer extra Terraform var-files onto the chain (repeatable; later wins). Lets you parameterise without editing config.yaml.
`--tf-source <ref>`	Override the pinned TF source for this run only. Skip the embedded HCL and use a path or URL instead. Mostly for dev.

--var-file is the canonical way to stage a non-default deploy. For example, deploying a BNK trial with a non-default cne_instance.replicas:

echo 'cne_replicas = 3' > ./more-replicas.tfvars
roksbnkctl up --auto --var-file ./more-replicas.tfvars

The var-file chain is, in order:

The auto-generated terraform.tfvars (rendered from config.yaml).
~/.roksbnkctl/<workspace>/terraform.tfvars.user if present.
Each --var-file flag, left-to-right.

Later wins on conflict — same as Terraform’s own ordering.

Apply retries on transient errors

ROKS master endpoints take 1-5 minutes to fully propagate after the cluster reaches Ready. The cne_instance, license, and cert-manager modules all curl the master directly; on a fresh cluster, they sometimes race propagation and fail with exit status 7 (curl couldn’t connect) or Connection refused.

roksbnkctl up has built-in retry: up to 3 apply attempts, with a 60-second sleep between attempts, on any of these heuristic patterns:

exit status 7 (curl couldn’t connect)
Connection refused / connection refused
i/o timeout
no route to host
network is unreachable
no such host
TLS handshake timeout
failed to dial
to download the config doesn't exist

If your apply hits one of these, you’ll see:

→ apply attempt 1 hit a transient-looking failure; waiting 60s and retrying...

Terraform’s idempotence means already-created resources are skipped on the retry; only the failed null_resources / data sources re-execute. After 3 attempts, up gives up:

✗ apply still failing after 3 attempts — giving up

At that point, fix the underlying cause (usually wait longer or re-run manually) and try again. The retry is for transient races, not persistent failures.

What happens on success

A successful up does five things in order:

Apply complete. Apply complete! Resources: 77 added, 0 changed, 0 destroyed.
Fetch the admin kubeconfig from IBM Cloud’s container service API. Written to $KUBECONFIG (or ~/.kube/config) at mode 0600.
Auto-register the jumphost target in workspace config (so --on jumphost works without manual config — see Chapter 16).
Stamp terraform.tfstate’s mtime. roksbnkctl status reads this as “last apply” timestamp.
Exit 0.

The kubeconfig fetch and jumphost registration are best-effort: they log warnings on failure but don’t fail the parent command. up succeeded if Terraform succeeded; the post-apply niceties are conveniences.

The `bnk up` / `bnk down` command group

New in v1.1.0. The roksbnkctl bnk group is the trial-only counterpart to roksbnkctl cluster — it operates on the trial state under state/ and leaves the cluster state under state-cluster/ untouched. The whole point is that iterating on a BNK trial no longer costs a 30-minute cluster rebuild: a bnk down / bnk up round-trip is the 5-10 minute trial-apply window, the cluster keeps running underneath.

`roksbnkctl bnk up`

Deploys the BNK trial against the workspace’s registered cluster.

If the workspace already has a cluster phase (either from cluster up or from cluster register), bnk up runs the trial apply directly — same plan, same ~41 resources, same 5-10 minute window as the trial half of a full up.
If the workspace is empty (no cluster registered yet), bnk up offers to bootstrap the cluster phase first with a confirmation prompt, then runs the trial apply. This keeps the new user’s quick-start path one command, even if they typed bnk up instead of up.
On a legacy single-state workspace, bnk up refuses — there’s no way to isolate the trial phase when the trial and cluster share one state file.

Sample output of the bootstrap-prompt path:

$ roksbnkctl bnk up
No cluster registered for this workspace.
→ Provisioning the cluster phase first (ROKS cluster + transit gateway +
  registry COS + cert-manager + jumphost; ~30 min) before the BNK trial.
Continue? [y/N]: y
→ terraform plan (cluster phase: deploy_bnk=false forced)
...
✓ Wrote ~/.roksbnkctl/default/cluster-outputs.json
→ terraform plan (trial phase)
...
Apply complete! Resources: 41 added, 0 changed, 0 destroyed.

Three prompts fire in the empty-workspace case — one for “do you want to bootstrap the cluster phase,” one for “apply this terraform plan” inside the nested cluster up, and a third when the trial-phase apply prompts. (On a non-empty workspace where bnk up skips the cluster bootstrap, only the latter two fire — and a ShapeClusterOnly/ShapeSplit bnk up is the common iteration case.) For a 30-minute operation we kept the prompts explicit rather than collapsing them. --auto skips all three:

$ roksbnkctl bnk up --auto

`roksbnkctl bnk down`

Destroys the trial only. The cluster phase keeps running.

On a split workspace (cluster + trial both present), bnk down runs terraform destroy against the trial state — ~41 resources, the same as the trial half of a full down.
On an empty or cluster-only workspace, bnk down refuses: there’s no trial to destroy.
On a legacy single-state workspace, bnk down refuses: the cluster lives in the trial state so a trial-only destroy isn’t possible.

Sample output against a split workspace:

$ roksbnkctl bnk down --auto
→ terraform destroy (trial phase)
  module.license.helm_release.license: Destroying...
  module.cne_instance.kubernetes_manifest.cne: Destroying...
  module.flo.helm_release.flo: Destroying...
  ...
  Destroy complete! Resources: 41 destroyed.

✓ Trial phase destroyed. Cluster phase ~/.roksbnkctl/default/state-cluster/ is intact.
  Run `roksbnkctl bnk up` to deploy another trial against the same cluster.

The shape dispatch matrix

The unscoped roksbnkctl up / down verbs are now shape-aware composites — they detect the on-disk shape of the workspace and delegate to the right phase commands underneath. The full picture for all four shapes and all six commands:

Command	Empty (nothing applied)	ClusterOnly (`cluster up` ran)	Split (cluster + trial both applied)	LegacySingle (v1.0.x state)
`up`	`cluster up` → trial up	trial up	`cluster up` (refresh) → trial up	monolithic trial up (v1.0.x behaviour)
`down`	error: nothing to destroy	`cluster down`	trial down → `cluster down`	monolithic trial down (v1.0.x behaviour)
`bnk up`	confirm + `cluster up` → trial up	trial up	trial up	refuse
`bnk down`	refuse: no trial	refuse: no trial	trial down	refuse
`cluster up`	`cluster up`	`cluster up` (refresh)	`cluster up` (refresh)	refuse
`cluster down`	refuse: nothing to destroy	`cluster down`	refuse: trial exists	refuse

The user-facing simplification: the unscoped up / down “just work” against every shape (including v1.0.x legacy state). The phase-scoped commands (bnk, cluster) only operate when the shape allows isolation and refuse loudly with an actionable message otherwise. Refusals always point at the resolution — see Chapter 11 §“Refusal messages” for the full catalogue.

The engineering version of this table — with the implementation details, the ShapeUnknown edge cases, and the rationale — lives in PRD 06 §“Dispatch table”.

Worked example — iterating on a BNK trial

The headline workflow the v1.1.0 surface unlocks. You’re testing different cne_instance parameter combinations against a stable cluster.

# Step 1 — one-time cluster provision (~38 minutes)
roksbnkctl cluster up --auto
# → terraform apply (cluster phase: deploy_bnk=false forced)
#   ...
#   Apply complete! Resources: 36 added, 0 changed, 0 destroyed.
# ✓ Wrote ~/.roksbnkctl/default/cluster-outputs.json

# Step 2 — first BNK trial (~7 minutes — trial only, cluster is reused)
roksbnkctl bnk up --auto
# → terraform plan (trial phase)
#   Plan: 41 to add, 0 to change, 0 to destroy.
#   ...
#   Apply complete! Resources: 41 added, 0 changed, 0 destroyed.

# Step 3 — poke at the trial, find something to tune
roksbnkctl k get pods -n f5-bnk
roksbnkctl test connectivity

# Step 4 — destroy just the trial (~3 minutes — cluster persists)
roksbnkctl bnk down --auto
# → terraform destroy (trial phase)
#   Destroy complete! Resources: 41 destroyed.
# ✓ Trial phase destroyed. Cluster phase ~/.roksbnkctl/default/state-cluster/ is intact.

# Step 5 — edit config.yaml (or a --var-file) to change cne_instance settings
$EDITOR ~/.roksbnkctl/default/config.yaml

# Step 6 — second BNK trial against the same cluster (~7 minutes; the 30-minute
#          cluster provision from step 1 does NOT repeat)
roksbnkctl bnk up --auto
# → terraform plan (trial phase)
#   ...
#   Apply complete! Resources: 41 added, 0 changed, 0 destroyed.

The win is in step 6: the cluster persists across the bnk down / bnk up boundary, so the second trial deploy is ~7 minutes instead of the ~50 minutes a full down → up cycle would cost in v1.0.x. Across a day of iteration, that’s the difference between five trial permutations and one.

When you’re done with the whole session:

# Step 7 — tear down the cluster too
roksbnkctl cluster down --auto
# (or `roksbnkctl down` from any starting state — see the dispatch matrix above)

Cross-references

Chapter 7 — Quick start — happy-path walkthrough end-to-end.
Chapter 8 — The cluster phase — what cluster up provisions, the two state directories, and how to identify a legacy single-state workspace.
Chapter 11 — Tearing down — phase-aware decision matrix; full refusal-message catalogue; orphan recovery.
Chapter 13 — Terraform variables — full reference for what you can override via --var-file.
Chapter 22 — Throughput testing — once BNK is deployed, validating its data plane.
Chapter 26 — Troubleshooting — long-tail apply failures (SCC violations, propagation lag, kubeconfig 404s) and their fixes.

Tearing down

roksbnkctl down, roksbnkctl bnk down, and roksbnkctl cluster down are the three destroy verbs — the inverses of up, bnk up, and cluster up respectively. This chapter covers what each one removes, the ordering constraint between them, the refusal messages you’ll hit if you ask for the wrong one, what survives a destroy, the --auto flag for non-interactive runs, and the workspace-cleanup story.

The phase-aware decision tree

Which verb do you want? The shape of your workspace and your intent both matter. Start here:

I want to keep the cluster and just tear down the BNK trial:
    → roksbnkctl bnk down

I want to tear down everything (cluster + trial):
    → roksbnkctl down

I want to tear down only the cluster (no trial currently deployed):
    → roksbnkctl cluster down

I'm on a v1.0.x workspace (cluster + trial in one state):
    → roksbnkctl down       (tears down everything in one shot)
    → see Chapter 8 §"Legacy single-state workspaces" to confirm your shape

Quick shape check: ls ~/.roksbnkctl/<workspace>/ — if you see state-cluster/, you’re on the v1.1.0 split shape; if you see only state/, you’re on legacy single-state.

The big rule, stated up front: destroy in reverse of create. Trial first (bnk down), cluster second (cluster down). The unscoped roksbnkctl down does this ordering for you — on a split workspace it runs the trial destroy first and then the cluster destroy. On a legacy single-state workspace it runs a monolithic destroy (the v1.0.x behaviour, byte-for-byte). Either way you don’t have to think about ordering; down is the safe default.

The phase-scoped commands (bnk down, cluster down) are the precision tools — they let you keep one phase across many cycles of the other. They also refuse loudly if you ask them to do something that would orphan resources or that the shape doesn’t allow. The full refusal catalogue is in §“Refusal messages catalogue” below; the rule of thumb is that the error message always names the verb that would actually work.

The three destroys

There are three teardown verbs matching the three slices of state:

`roksbnkctl down` — shape-aware composite

The unscoped down is a shape-aware composite in v1.1.0: it detects the on-disk shape of the workspace and dispatches to the right phase destroys in the right order.

roksbnkctl down

Workspace shape	`down` does
Split (cluster + trial)	trial destroy → cluster destroy
ClusterOnly (only cluster applied)	cluster destroy
LegacySingle (v1.0.x — both in one state)	monolithic destroy (v1.0.x behaviour, byte-for-byte)
Empty	error: `nothing to destroy in this workspace`

This is the safe default — down always does the right thing regardless of shape, and it’s the only verb you can run on a legacy single-state workspace.

`roksbnkctl bnk down` — destroy the BNK trial only

New in v1.1.0. Tears down everything the trial phase created — the flo Helm release, cne_instance, the license module, cluster-side ServiceAccounts / RoleBindings / SCC bindings, and the null_resources that bootstrap admin tokens — and leaves the cluster running.

roksbnkctl bnk down

What survives:

The ROKS cluster itself
cert-manager
The registry COS instance and its bucket contents (FAR images, license artefacts)
The TGW jumphost
All cluster-phase Terraform state under state-cluster/
cluster-outputs.json (the cluster is still registered)
The workspace’s config.yaml

Roughly 41 resources destroyed on a clean trial-only bnk down. Time is dominated by Helm’s pre-delete hooks and the cne_instance finaliser unwind — usually 2-5 minutes total.

bnk down refuses on Empty, ClusterOnly, and LegacySingle workspaces — there’s nothing to destroy on the first two, and the trial-only isolation isn’t possible on the third. See §“Refusal messages catalogue” for the exact text.

`roksbnkctl cluster down` — destroy the cluster phase

Tears down the cluster + cluster-shared services: the ROKS cluster, transit gateway, registry COS instance, cert-manager Helm release, and the TGW jumphost.

roksbnkctl cluster down

What survives:

The workspace’s config.yaml
~/.roksbnkctl/<workspace>/state/ (now empty of resources but the directory persists)
~/.roksbnkctl/<workspace>/state-cluster/ Terraform state files (the cluster-side state itself is empty; the directory and terraform.tfstate persist)

Roughly 36 resources destroyed. The ROKS cluster destroy alone is 5-10 minutes; everything else is fast.

The post-destroy cleanup deletes cluster-outputs.json automatically — the workspace no longer has a registered cluster.

Order matters: trial first, then cluster

The upstream HCL’s resource graph requires this ordering. The trial-phase resources have implicit dependencies on cluster-phase resources (they live in the cluster, after all), and Terraform’s destroy graph traverses dependencies in reverse. If the cluster phase tries to destroy first, the trial phase’s resources are still there — finalisers block the destroy of the cluster’s namespaces, the cluster-side SCC bindings reference SCCs that are in the way, and so on.

In v1.1.0 roksbnkctl cluster down enforces this ordering with a hard refusal: if the trial state has any resources in it, cluster down errors out and points you at bnk down (or down) instead. The v1.0.x “warning-but-prompt” behaviour is gone — even --auto won’t bypass the guard, because correctness, not confirmation, is the issue. The full refusal text:

$ roksbnkctl cluster down
BNK trial state exists in this workspace; run `roksbnkctl bnk down` first
(or `roksbnkctl down` to tear down both phases)

So in practice, always destroy the trial before the cluster. The unscoped down does this ordering for you on a split workspace; the phase-scoped pair is bnk down then cluster down.

The clean teardown sequence — split workspace, explicit phase commands:

# 1. Destroy the BNK trial
roksbnkctl bnk down --auto

# 2. Now safe to destroy the cluster phase
roksbnkctl cluster down --auto

# 3. (Optional) Delete the workspace itself
roksbnkctl ws delete <name> --force

Or the one-shot equivalent:

# 1. Tear down both phases in order
roksbnkctl down --auto

# 2. (Optional) Delete the workspace itself
roksbnkctl ws delete <name> --force

If you roksbnkctl up against a registered cluster (one you didn’t cluster up yourself), step 2 doesn’t apply — the cluster wasn’t yours to destroy. Just bnk down the trial and stop there, then optionally unregister by deleting cluster-outputs.json.

Refusal messages catalogue

The phase-scoped destroy verbs refuse loudly when the shape doesn’t allow what you’ve asked for. Every refusal names the verb that would actually work. If you hit one in the wild, grep your terminal output for the message text and you should land here:

Command + shape	Refusal text	Resolution
`bnk down` on LegacySingle	`this workspace is legacy single-state;` bnk down`can't isolate the trial phase. Use`roksbnkctl down `to tear down both, or migrate the state first`	Use `roksbnkctl down`; the legacy state has the trial and cluster in one file, so a trial-only destroy isn’t possible. See Chapter 8 §“Legacy single-state workspaces”.
`bnk down` on Empty or ClusterOnly	`no BNK trial state to destroy in this workspace`	Nothing to do — no trial is deployed. If you want to destroy the cluster, use `roksbnkctl cluster down`.
`cluster down` on LegacySingle	`this workspace is legacy single-state; cluster and BNK trial share one state. Use` roksbnkctl down `to tear down both, or migrate the state first`	Use `roksbnkctl down`.
`cluster down` on Split	BNK trial state exists in this workspace; run `roksbnkctl bnk down` first (or `roksbnkctl down` to tear down both phases)	Run `bnk down` first to remove the trial, then `cluster down` for the cluster — or `roksbnkctl down` to do both in one shot.
`cluster down` on Empty	`nothing to destroy in this workspace`	Nothing to do — the cluster hasn’t been provisioned.
`down` on Empty	`nothing to destroy in this workspace`	Nothing to do — the workspace has no state.
`cluster up` on LegacySingle	this workspace was provisioned with v1.0.x single-state — its cluster lives in the trial state file. Use `roksbnkctl up` to operate on it, or migrate the state to two-phase shape first	Use `roksbnkctl up`. The cluster already exists in the trial state; applying the cluster phase separately would create a second one.
`bnk up` on LegacySingle	this workspace is legacy single-state; `bnk up` can't isolate the trial phase. Use `roksbnkctl up` for in-place behavior, or migrate the state first	Use `roksbnkctl up`.

The “migrate the state first” references in two of the messages describe a future roksbnkctl migrate command that does not exist in v1.1.0. The refusals point at it so the wording stays valid once migrate ships; until then, the unscoped up / down is the working alternative for legacy workspaces.

What survives a destroy

The contract: roksbnkctl never destroys local state without explicit consent, and never destroys cloud resources outside its Terraform state.

After a successful down:

Survives	Where
Workspace config	`~/.roksbnkctl/<name>/config.yaml`
Workspace directory + state files	`~/.roksbnkctl/<name>/` (empty `state/`; `state-cluster/` untouched if `cluster down` not run)
OS keychain entry for the API key	per-workspace, named `roksbnkctl/<name>/ibmcloud_api_key`
`~/.kube/config`	left in place
The cluster (if only trial was destroyed)	runs and bills as before
The registry COS bucket’s contents	FAR images, JWT licenses, schematic state — survive cluster destroy too if the bucket was created outside the bundled HCL
`~/.roksbnkctl/known_hosts`	SSH host keys persist; deleting a workspace does not clear them

Re-running up against a down’d workspace re-creates everything from scratch. The workspace’s config.yaml is preserved precisely so this re-create can use the same inputs without re-prompting.

The COS bucket point is worth highlighting: the bundled HCL provisions the COS instance but generally does not provision the buckets inside it (those are written by post-apply provisioners or by the BNK runtime itself). When cluster down destroys the COS instance, the bucket goes with it — but if the COS instance was created out-of-band (e.g. by a registered cluster’s owner) and roksbnkctl is just attaching, then cluster down doesn’t apply and the COS survives.

`--auto` for non-interactive runs

All three destroy commands prompt for confirmation by default:

$ roksbnkctl down
This will destroy workspace "default"'s resources.
Continue? [y/N]:

$ roksbnkctl bnk down
This will destroy the BNK trial for workspace "default". The cluster phase
will remain in place — run `roksbnkctl cluster down` to remove it too.
Continue? [y/N]:

$ roksbnkctl cluster down
This will destroy the cluster phase for workspace "default" (ROKS + transit gateway + registry COS + cert-manager + jumphost).
Continue? [y/N]:

--auto skips the prompt — required for CI / scripted pipelines:

roksbnkctl down --auto
roksbnkctl bnk down --auto
roksbnkctl cluster down --auto

--auto does not override the shape-based refusals (see §“Refusal messages catalogue” above) — those are correctness guards, not confirmation prompts. If trial state is present, cluster down --auto still refuses; on a legacy single-state workspace, bnk down --auto and cluster down --auto still refuse.

Like `up`, transient errors retry

down doesn’t share up’s explicit retry-on-transient-error logic, but Terraform’s destroy is naturally idempotent: re-running down after a partial destroy picks up where the previous run left off. If you see a transient network error during destroy, just re-run:

roksbnkctl down --auto
# (some resources destroyed, then transient error)

roksbnkctl down --auto
# (picks up where it left off, completes)

The same applies to cluster down. ROKS cluster destroy specifically can take longer than expected when the master is propagating its delete state — wait a few minutes and re-try if you see master-not-found errors.

Cleaning up workspaces

A successful down leaves the workspace directory in place. You usually want to clean that up too:

roksbnkctl ws delete <name> --force

Two safety rails on ws delete:

Refuses to delete the current workspace. Use the parking-lot pattern if you need to drop your current workspace.
Refuses if Terraform state still lists resources (unless --force). Catches the case where you forgot to run down first.

The --force flag overrides both checks — but if you ws delete --force a workspace that still has provisioned cloud resources, you’ll have leaked them. There’s no auto-recovery; you’d need to find them via the IBM Cloud console and delete them by hand.

The full clean-as-you-go pattern from scripts/e2e-test.sh (Phase D destroys; Phase H parks and deletes):

# 1. Destroy the trial
roksbnkctl down --auto

# 2. Destroy the cluster phase
roksbnkctl cluster down --auto

# 3. Park the current-workspace pointer somewhere harmless
roksbnkctl ws new e2e-cleanup
roksbnkctl ws use e2e-cleanup

# 4. Now the original workspace is no longer current — safe to delete
roksbnkctl ws delete default --force

# 5. (Optional) clean up the parking lot too
roksbnkctl ws delete e2e-cleanup --force

Step 3-5 is the parking-lot pattern from Chapter 6. It’s specifically necessary when the workspace you want to delete is currently the active one — ws delete refuses to remove the current workspace because that would leave a dangling current_workspace pointer.

Cost note: an undestroyed cluster keeps billing

ROKS clusters bill at roughly $0.30/hour per cluster + worker pool — call it $7/day for a 2-worker cluster, plus a few cents/day for the VPC / load balancers / COS / jumphost. A forgotten cluster can rack up real cost over a weekend.

To verify what’s still running in your account:

IBM Cloud console → Kubernetes → Clusters — every cluster, billing or not.
IBM Cloud console → VPC Infrastructure → VPCs — networks left over after a partial destroy.
IBM Cloud console → Resource list — exhaustive view of everything in the account, filterable by RG.

If you find a leaked cluster from a past roksbnkctl run, the right move is to re-attach to it via roksbnkctl cluster register <name> and then cluster down --auto — roksbnkctl cleans up cleanly when it has the cluster in its state. Manually deleting via the console works too but leaves dangling VPCs and security groups that the bundled HCL would have cleaned up.

roksbnkctl status and roksbnkctl cluster show both report the cluster identity recorded in cluster-outputs.json, but they don’t probe for “are there other clusters in this account?” — that’s deliberately not their job. The IBM Cloud console is the canonical source of truth for what’s billing.

Workspace deletion ≠ destroy

A subtle but important distinction. roksbnkctl ws delete removes the local workspace directory and the OS-keychain API key entry. It does not destroy any cloud resources. If you ws delete --force without first running down / cluster down, the cloud resources keep running and you’ve lost the local Terraform state that roksbnkctl would use to destroy them.

In that scenario, recovery is:

Find the leaked cluster in the IBM Cloud console.
Recreate the workspace: roksbnkctl init -w recovery.
Register the existing cluster: roksbnkctl cluster register <leaked-cluster-name>.
Then run roksbnkctl cluster down --auto to destroy it cleanly.

The Terraform state is regenerated implicitly during register + plan; the resources roksbnkctl would otherwise have tracked get re-discovered through the IBM SDK lookups. It’s not seamless, but it’s recoverable.

The ws delete --force flag’s “still has resources” check exists exactly to prevent this scenario — don’t bypass it without thinking about the consequences.

Worked example: register an existing cluster, deploy BNK, tear down

End-to-end Part III scenario: somebody on your team already provisioned a ROKS cluster manually via the IBM Cloud console (or via a different terraform tree); you need to deploy BNK on top of it using roksbnkctl, validate, and tear the whole thing down cleanly. The flow exercises Chapter 9, Chapter 10, and this chapter end-to-end.

# 1. Workspace bootstrap — same as a fresh deploy
roksbnkctl init -w preexisting
# (answer prompts for region + resource group; pick the values matching
#  the existing cluster's location)

# 2. Register the already-running cluster into the workspace
roksbnkctl cluster register existing-bnk-cluster -w preexisting
# Expected:
#   → Discovering cluster "existing-bnk-cluster" via IBM Cloud API ...
#   ✓ Cluster ID: <crn>
#   ✓ Wrote ~/.roksbnkctl/preexisting/cluster-outputs.json
#   ✓ Fetched admin kubeconfig to ~/.kube/config (chmod 0600)

# 3. Verify roksbnkctl sees the cluster
roksbnkctl status -w preexisting
# Expected: cluster Ready, workers count, no BNK pods yet

# 4. Deploy BNK on top — `up` is idempotent over the existing cluster
roksbnkctl up --auto -w preexisting
# Expected: terraform applies the cert-manager + flo + cne_instance +
# license modules only; the roks_cluster module sees the cluster already
# exists and skips. ~10-15 min vs ~50 min for a from-scratch up.

# 5. Validate
roksbnkctl test -w preexisting
# Expected: green across connectivity + dns

# 6. Tear down — destroys the BNK overlay; the registered cluster survives
roksbnkctl down --auto -w preexisting
# Expected:
#   → terraform destroy (auto-approved)
#   Destroy complete! Resources: N destroyed.
#   ✓ Workspace "preexisting" state retained at ~/.roksbnkctl/preexisting/

The destroy count N is the BNK overlay + jumphost only — typically 30-40 resources, not the from-scratch ~77 count. cluster register is a discovery-only path: terraform state holds the overlay modules (cert_manager, flo, cne_instance, license) and the testing jumphost, but not the roks_cluster module, because the cluster pre-existed roksbnkctl. down destroys only what terraform knows about, so the registered cluster survives untouched.

If you also want to release the underlying cluster, you have to tear it down through whatever provisioned it originally (the IBM Cloud console, or the separate terraform tree your teammate used). roksbnkctl cluster down only works against clusters roksbnkctl cluster up created in the first place — see Chapter 8 for the cluster-phase boundary.

The full register → up → test → down loop above is what Phase E + Phase H of the e2e plan exercise; see Chapter 23 for the CI version.

Cross-references

Chapter 6 — Workspaces — ws delete mechanics and the parking-lot pattern.
Chapter 8 — The cluster phase — what cluster up provisions and cluster down removes.
Chapter 9 — Registering an existing cluster — the cluster register mechanics the walkthrough builds on.
Chapter 10 — Deploying BNK trials — what up provisions and down removes.
Chapter 26 — Troubleshooting — recovery from partial-destroy and orphan-state scenarios.

Workspace config (config.yaml)

This chapter is the field-by-field reference for the per-workspace config.yaml. If you’ve read Chapter 6 — Workspaces you’ve seen the on-disk layout; this chapter zooms in on the YAML file that drives everything else (init, up, down, cluster up, the test suite, the SSH targets, the new execution backends).

You don’t usually edit this file by hand. roksbnkctl init generates it interactively; later runs read it. But because every other knob in the tool reads from here, it’s worth knowing what every field means and what defaults apply when you leave one out.

File location

Each workspace’s config lives at:

~/.roksbnkctl/<workspace>/config.yaml

Override the base directory with the ROKSBNKCTL_HOME env var (test fixtures use this; everyday users shouldn’t need it). The file is created mode 0644 — readable by your user, the same trust posture as the surrounding workspace directory.

There’s also a global ~/.roksbnkctl/config.yaml at the top level — it holds the current_workspace pointer and other user-wide preferences. That’s a different file with a different schema; this chapter is about the per-workspace one.

When it gets written

Action	Effect on `config.yaml`
`roksbnkctl init`	Creates the file from interactive prompts. Existing file? Asks before overwriting.
`roksbnkctl init --upgrade-tf`	Updates `tf_source:` only; leaves every other field alone.
`roksbnkctl targets add <name> ...`	Adds an entry under `targets:`.
`roksbnkctl targets remove <name>`	Removes the entry.
`roksbnkctl up` (post-apply)	Auto-populates `targets.jumphost` if the upstream HCL emitted a TGW jumphost output.
Anything else	Reads the file. Doesn’t write back.

Direct hand-editing is supported (the file is plain YAML) but discouraged for fields that have dedicated commands — adding an SSH target via roksbnkctl targets add keeps the schema validation in one place.

Top-level structure

ibmcloud:        # IBM Cloud account + auth
  region: ca-tor
  resource_group: default
  api_key_source: keychain
  # api_key_b64: <base64-of-api-key>   # OPTIONAL fallback when keychain unavailable

cluster:         # ROKS cluster identity
  create: true
  name: tf-openshift-cluster
  openshift_version: "4.18"
  workers_per_zone: 2

bnk:             # BNK trial knobs (optional; falls through to upstream HCL defaults)
  cneinstance_size: Small
  far_repo_url: repo.f5.com
  manifest_version: 2.3.0-3.2598.3-0.0.170

test:            # test-suite tuning (optional)
  throughput:
    duration: 30
    streams: 8
  connectivity:
    extra_hosts:
      - https://my.gslb.example.com

tf_source:       # where the Terraform HCL comes from
  type: embedded         # embedded | github | local

targets:         # SSH targets (see Chapter 15)
  jumphost:
    host: 169.45.91.177
    user: ubuntu
    key_source: tf-output:jumphost_shared_key

exec:            # per-tool execution backend defaults (see Chapter 17)
  ibmcloud:  { backend: local }
  iperf3:    { backend: k8s }
  terraform: { backend: local }

cos:             # optional COS supply-chain config
  instance: bnk-orchestration
  bucket: bnk-schematics-resources

Every block except ibmcloud:, cluster:, and tf_source: is optional. Omit a block and the tool falls through to either a documented default (covered below) or the upstream HCL’s own default for terraform variables.

`ibmcloud:`

ibmcloud:
  region: ca-tor
  resource_group: default
  api_key_source: keychain
  api_key_b64: ""

Field	Type	Default	Notes
`region`	string	none — required	IBM Cloud region for cluster, VPC, COS. Examples: `ca-tor`, `us-south`, `eu-de`.
`resource_group`	string	`default`	Account-level resource group all created resources land in.
`api_key_source`	enum	empty (auto-resolve chain)	`env` \| `keychain` \| `config` \| `prompt`. Pin the resolver to one source; leave empty to walk the full chain. See Chapter 14.
`api_key_b64`	string	empty	Base64-encoded API key, obfuscation only — not encryption. The fallback when no OS keychain is available (e.g. WSL2 without libsecret). Treat the file as plaintext-credential-equivalent.

The plaintext field name api_key: is rejected at load time — roksbnkctl refuses to read a workspace config that contains it. The encoded api_key_b64: form is the only inline path. Full discussion in Chapter 14 — Credentials and the resolver chain.

`cluster:`

cluster:
  create: true
  name: tf-openshift-cluster
  openshift_version: "4.18"
  workers_per_zone: 2

Field	Type	Default	Notes
`create`	bool	`true`	When `true`, `roksbnkctl cluster up` provisions a new ROKS cluster. When `false`, `cluster register <name>` adopts an existing one.
`name`	string	none — required	OpenShift cluster name when `create=true`; cluster ID-or-name to adopt when `create=false`.
`openshift_version`	string	empty (latest)	E.g. `"4.18"`. Empty lets IBM Cloud pick the current default. Quote it — YAML otherwise parses `4.18` as a float.
`workers_per_zone`	int	`1`	Worker nodes per AZ; cluster runs across 3 AZs by default in MZR regions, so `2` ⇒ 6 workers total.

The cluster: block translates to terraform variables create_roks_cluster, openshift_cluster_name, roks_cluster_id_or_name, openshift_cluster_version, roks_workers_per_zone — see Chapter 13 and Chapter 29 for the full mapping.

`bnk:`

bnk:
  cneinstance_size: Small
  far_repo_url: repo.f5.com
  manifest_version: 2.3.0-3.2598.3-0.0.170

Field	Type	Default	Notes
`cneinstance_size`	enum	upstream HCL default (`Small`)	`Small` \| `Medium` \| `Large`. Sets `cneinstance_deployment_size`.
`far_repo_url`	string	upstream HCL default (`repo.f5.com`)	The FAR Docker/Helm repo. Override only for staging/internal repos.
`manifest_version`	string	upstream HCL default	Pin a specific BNK manifest chart version. Leave empty to track the upstream HCL’s pin.

Every field here is optional — leave the block out entirely and you get the upstream HCL’s defaults for all three.

`test:`

test:
  throughput:
    image: networkstatic/iperf3:latest
    duration: 30
    streams: 8
    default_mode: north-south
  connectivity:
    extra_hosts:
      - https://my.gslb.example.com
      - https://internal.example.test

Field	Type	Default	Notes
`throughput.image`	string	`networkstatic/iperf3:latest`	iperf3 image used by the throughput test (when running with the `local` or `ssh` backends). The `k8s` backend uses the GHCR image (`ghcr.io/jgruberf5/roksbnkctl-tools-iperf3:<version>`) instead.
`throughput.duration`	int seconds	`30`	iperf3 `-t` flag.
`throughput.streams`	int	`8`	iperf3 `-P` flag.
`throughput.default_mode`	enum	`north-south`	`north-south` \| `east-west`. The connectivity vector to test by default.
`connectivity.extra_hosts`	[]string	empty	Extra URLs the connectivity test probes alongside the canonical IBM/F5 endpoints.

`tf_source:`

tf_source:
  type: embedded

`type`	Other fields	Use case
`embedded` (default)	none	Use the HCL bundled into the `roksbnkctl` binary via `go:embed`. The recommended path for users — install one binary, get matched CLI + Terraform together.
`github`	`repo: "owner/name"`, `ref: "v0.6.1"`	Pull a tarball from a GitHub release. Useful for testing forks or pinning to a specific upstream tag.
`local`	`path: "/abs/path/to/tf-source"`	Point Terraform at an on-disk directory. For active development on the HCL itself.

An empty type is treated as embedded (legacy / forgot-to-set).

roksbnkctl init --upgrade-tf is the helper for bumping the source between versions without retyping the rest of the config — see “Editing by hand vs helpers” below.

`targets:` — SSH targets

targets:
  jumphost:
    host: 169.45.91.177
    user: ubuntu
    key_source: tf-output:jumphost_shared_key
  bastion:
    host: ops.example.com
    user: jgruber
    key_path: ~/.ssh/id_ed25519

Each entry has host, user, optional port (default 22), and exactly one of key_path or key_source. The key_source enum supports agent and tf-output:<name>.

The deep reference is Chapter 15 — SSH targets, and the user-facing prose is Chapter 16 — The –on flag and SSH jumphosts. This chapter just notes the schema’s place in the overall config.

You don’t typically edit this block by hand. roksbnkctl up auto-populates jumphost post-apply, and roksbnkctl targets add ... populates the rest.

`exec:` — execution-backend defaults

exec:
  ibmcloud:  { backend: local }
  iperf3:    { backend: k8s }
  terraform: { backend: local }

Per-tool defaults for the --backend system. Each entry is keyed by the tool name (ibmcloud, iperf3, terraform, and others as the matrix grows) and selects which execution backend that tool uses by default. Allowed backend values:

Backend	Notes
`local`	`os/exec` against the host binary. The default for `terraform` and `ibmcloud`.
`docker`	Runs inside a vendored container image. Frozen toolchain version, no host install.
`k8s`	Runs inside the cluster (long-lived ops pod or one-shot Job). Default for `iperf3`.
`ssh`	Runs on a registered SSH target. Format: `ssh:<target-name>`.

A --backend <value> flag on the command line overrides the workspace config for that single invocation. The flag wins; the config sets the default.

The iperf3 default is k8s because measuring throughput from a laptop’s internet uplink isn’t useful — you want the test to run from a network location adjacent to or inside the cluster. The local default is wrong for that tool, so the workspace config flips it.

Chapter 17 — Execution backends covers the full backend matrix; Chapter 18 — Choosing a backend per tool is the decision tree.

`cos:` — COS supply-chain (optional)

cos:
  instance: bnk-orchestration
  bucket: bnk-schematics-resources
  upload:
    - source: ./local/f5-far-auth-key.tgz
      key: f5-far-auth-key.tgz
    - source: ./local/trial.jwt
      key: trial.jwt

Field	Type	Notes
`instance`	string	COS instance name holding the FAR auth key + JWT.
`bucket`	string	COS bucket name within that instance.
`upload`	[]{source, key}	Optional pre-flight uploads from local files into the bucket. Useful for CI scenarios where the supply-chain artefacts are produced by the pipeline.

The block is optional — if you’ve already populated COS by hand or via the upstream HCL’s roks_cos_instance_name variable, you don’t need it. Chapter 25 — COS supply chain management covers the full workflow.

Behaviour when fields are missing

roksbnkctl falls through three layers in order: workspace config → upstream HCL default → fail.

Missing field	Behaviour
`ibmcloud.region`	`roksbnkctl init` prompts; programmatic loads error with “region is empty”.
`ibmcloud.api_key_source`	Resolver walks the full chain (env → keychain → config → prompt).
`ibmcloud.api_key_b64`	Skipped in the resolver chain.
`cluster.name`	`init` prompts; programmatic loads error.
`cluster.openshift_version`	Empty string passed to upstream HCL; the module picks the current default.
`cluster.workers_per_zone`	Falls through to `1` (upstream default).
`bnk.*`	Field is omitted from the generated `terraform.tfvars` and the upstream HCL default applies.
`tf_source`	Treated as `type: embedded` (legacy default).
`targets.*`	Block absent ⇒ `roksbnkctl --on jumphost` errors with “no target named jumphost”; auto-populated by `up`.
`exec.*`	Per-tool defaults at v1.0: `ibmcloud`→`local`, `terraform`→`local`, `iperf3`→`k8s`, DNS probe→`local`. Override per-tool via this block, or per-invocation via `--backend`.
`cos.*`	No pre-flight uploads; the COS instance/bucket are read from the upstream HCL’s tfvars instead.

The general rule: if you don’t write it in config.yaml, roksbnkctl doesn’t write it into terraform.tfvars, and the upstream HCL’s default = ... clause takes over. The full upstream defaults are listed in Chapter 29.

How `--var-file` interacts with `config.yaml`

Both roksbnkctl up and roksbnkctl plan/apply/destroy accept the same --var-file flag terraform itself accepts (repeatable, later files win). The layering rule is:

1. config.yaml-derived terraform.tfvars        (written first by roksbnkctl)
2. ~/.roksbnkctl/<ws>/terraform.tfvars.user  (optional manual override)
3. --var-file <path>                           (CLI; repeatable)

Later layers override earlier. Concretely: config.yaml’s cluster.workers_per_zone: 2 writes roks_workers_per_zone = 2 into the generated tfvars. If you then pass --var-file ./bigger.tfvars containing roks_workers_per_zone = 5, terraform sees 5. The config.yaml value didn’t get re-applied; --var-file wins.

The terraform.tfvars.user middle layer is for when you want a workspace-local override that survives across runs without modifying config.yaml — it’s typically used for fields the YAML schema doesn’t model (rare; the schema covers the common knobs). Chapter 13 goes deep on this.

The IBMCLOUD_API_KEY is the one exception that never goes through tfvars on disk. It’s passed as a TF_VAR_ibmcloud_api_key env var on the terraform invocation. --var-file cannot supply the API key — the resolver chain in Chapter 14 is the only path.

Editing by hand vs helpers

Several commands manage subsets of config.yaml so you don’t have to:

Subset	Helper
Whole file (interactive)	`roksbnkctl init`
`tf_source:` only	`roksbnkctl init --upgrade-tf`
`targets:` block	`roksbnkctl targets add/remove`
`ibmcloud.api_key_b64`	`roksbnkctl init` (after entering the key, it offers to save)

When you do edit by hand, the load-time validators run on next roksbnkctl invocation:

The plaintext-secret heuristic rejects an api_key: field (it must be api_key_b64: to be tolerated).
Workspace name validation runs on directory access (workspace names must match [A-Za-z0-9][A-Za-z0-9_.-]{0,63}).
YAML parse errors surface a line number.

If a hand edit breaks the file, every command that reads the workspace fails fast with the parse error path, so you’ll know within one invocation.

Worked example: bootstrap a workspace from scratch

End-to-end Part IV scenario: brand-new laptop, no roksbnkctl workspaces yet, an IBM Cloud API key in your password manager. Goal: a usable workspace with the key in the OS keychain, the right region + resource group resolved, and terraform.tfvars ready to drive the HCL.

# 1. roksbnkctl init — interactive bootstrap
$ roksbnkctl init
Workspace name [default]: dev
IBM Cloud region [ca-tor]:
IBM Cloud resource group [default]:
Enter IBM Cloud API key (input hidden):
Save the key for future runs? [Y/n]: y
  ✓ saved to OS keychain (service: roksbnkctl, account: dev/ibmcloud_api_key)
Cluster name [tf-openshift-cluster]: dev-cluster
Workers per zone [1]: 2
✓ Created workspace "dev"

The resulting ~/.roksbnkctl/dev/config.yaml:

ibmcloud:
  region: ca-tor
  resource_group: default
  api_key_source: keychain
cluster:
  create: true
  name: dev-cluster
  workers_per_zone: 2
tf_source:
  type: embedded

That’s the minimum. Everything else (bnk:, test:, targets:, exec:, cos:) is empty and falls through to defaults. The API key can also be supplied non-interactively from your password manager’s CLI by setting IBMCLOUD_API_KEY in the environment of the init invocation:

op here is the 1Password CLI; the op://... URI is its secret-reference scheme. Any password-manager CLI that prints a secret to stdout works the same way — Bitwarden (bw), gopass, aws secretsmanager get-secret-value, Doppler, etc. — the only thing roksbnkctl cares about is that IBMCLOUD_API_KEY is set in the environment when init runs.

# Alternative: pre-set IBMCLOUD_API_KEY so init resolves it from env rather than prompting
IBMCLOUD_API_KEY=$(op read 'op://Private/IBM Cloud/api-key') roksbnkctl init -w dev

Chapter 14 §“The IBMCLOUD_API_KEY resolver chain” covers the full env → keychain → workspace api_key_b64 → TTY-prompt order; this env-var path is the first link in that chain, so anything init resolves at bootstrap time follows the same precedence later invocations use. Once init has saved the key to the OS keychain (the default sink), no further prompting is needed. init still prompts interactively for the remaining workspace metadata (region, resource group, cluster name) — a fully non-interactive bootstrap is on the v1.x roadmap.

Now render terraform.tfvars so subsequent up runs have explicit HCL inputs to point --var-file at:

# 2. Render terraform.tfvars from config.yaml
$ roksbnkctl tfvars -w dev > ~/.roksbnkctl/dev/terraform.tfvars
$ head ~/.roksbnkctl/dev/terraform.tfvars
ibmcloud_region        = "ca-tor"
ibmcloud_resource_group = "default"
cluster_name           = "dev-cluster"
workers_per_zone       = 2
# ...

Chapter 13 covers the precedence rules between config.yaml, terraform.tfvars, and terraform.tfvars.user (the hand-edit overlay).

Finally, verify the workspace is healthy before the first real up:

# 3. Sanity-check
$ roksbnkctl doctor -w dev
✓ terraform     1.6.2  on PATH
✓ IBMCLOUD_API_KEY resolves via keychain
✓ region "ca-tor" accepts the key (IAM round-trip OK)
✓ resource group "default" exists (id: ...)
✓ workspace dev healthy

From here, roksbnkctl up --auto -w dev is the next step (see Chapter 7 — Quick start). You can layer on bnk:, test:, targets:, exec:, cos: blocks by hand-editing config.yaml whenever you need them — init only writes the minimum to keep first-run friction low.

Cross-references

Chapter 13 — Terraform variables — the layering between config.yaml and terraform.tfvars.
Chapter 14 — Credentials and the resolver chain — the api_key_* fields and how they’re resolved.
Chapter 15 — SSH targets — the targets: block.
Chapter 17 — Execution backends — the exec: block.
Chapter 28 — Configuration reference — auto-generated complete field list.
Chapter 29 — Terraform variable reference — the upstream HCL variables config.yaml translates to.

Terraform variables (terraform.tfvars)

roksbnkctl is a thin orchestration layer over a Terraform HCL bundle. The HCL has its own variables — well over 60 of them — declared in terraform/variables.tf. The workspace’s config.yaml covers the common knobs; for the rest, you reach into terraform.tfvars directly.

This chapter is the surface for that lower layer: where the example file lives, how roksbnkctl tfvars bootstraps a starter, what --var-file does, the layering rule between config.yaml-derived tfvars and your overrides, and the one variable that never goes on disk (ibmcloud_api_key).

Where the bundled HCL lives

The Terraform HCL is bundled into the roksbnkctl binary via go:embed. On first use of a workspace, it gets extracted to:

~/.roksbnkctl/<workspace>/state/tf-source/embedded-terraform/
├── main.tf
├── variables.tf
├── outputs.tf
├── providers.tf
├── versions.tf
├── terraform.tfvars.example
└── modules/

That terraform.tfvars.example file is the canonical reference for what’s tunable — every variable with a sensible starter value, grouped by module (ROKS cluster, cert-manager, FLO, CNEInstance, License, testing). terraform/variables.tf (linked at the GitHub canonical URL) is the formal declaration with types, descriptions, and defaults.

You don’t edit the example file in place. Copy or generate from it instead.

`roksbnkctl tfvars` — bootstrap a starter

The roksbnkctl tfvars subcommand prints a starter terraform.tfvars to stdout, populated from the current workspace state:

$ roksbnkctl tfvars > ~/.roksbnkctl/dev/terraform.tfvars.user

What gets pre-filled:

Every field from config.yaml that maps to a tfvar (cluster name, region, workers, BNK fields, COS fields)
The cluster’s identity from cluster-outputs.json if cluster up has already run
A commented-out section for the variables you might want to tune next (jumphost profile, GSLB datacenter, license mode)

What’s deliberately excluded:

ibmcloud_api_key — never on disk (see “The IBMCLOUD_API_KEY exception” below)
Sensitive outputs (BIG-IP passwords, COS HMAC secrets) — left as upstream defaults

The starter is meant to be copied into ~/.roksbnkctl/<ws>/terraform.tfvars.user (the workspace-local override file) or into a --var-file path you keep alongside the workspace.

What you typically edit

The variables that matter for day-to-day BNK trial work, ordered by likely-to-touch:

Variable	Default	What it controls
`openshift_cluster_name`	`tf-openshift-cluster`	Cluster name. Mirrors `config.yaml`’s `cluster.name`.
`roks_workers_per_zone`	`1`	Worker nodes per AZ. `2` ⇒ 6 workers in a 3-AZ MZR region.
`create_roks_cluster`	`true`	Set `false` to adopt an existing cluster. Pair with `roks_cluster_id_or_name`.
`openshift_cluster_version`	`"4.18"`	OpenShift minor. Quote it — YAML/HCL parses `4.18` as float otherwise.
`cneinstance_deployment_size`	`Small`	`Small`/`Medium`/`Large`. CNEInstance sizing.
`f5_bigip_k8s_manifest_version`	upstream pin	Pin a specific BNK manifest chart version.
`far_repo_url`	`repo.f5.com`	FAR Docker/Helm registry. Override only for staging.
`flo_namespace`	`f5-bnk`	Where the F5 Lifecycle Operator runs.
`testing_create_tgw_jumphost`	`true`	Create the testing jumphost in a client VPC over Transit Gateway.
`testing_ssh_key_name`	`""` (must set)	Existing IBM Cloud SSH key name for jumphost provisioning.
`cneinstance_gslb_datacenter_name`	`""`	Set when wiring BNK into an F5 BIG-IP GSLB datacenter.
`license_mode`	`connected`	`connected` \| `disconnected`.

For the full list with types and per-field descriptions, see terraform/variables.tf directly — link here — or the auto-generated Chapter 29 — Terraform variable reference.

The layering rule

When roksbnkctl up (or plan/apply/destroy) invokes Terraform, it composes three layers of tfvars in this order:

1. terraform.tfvars              (rendered by roksbnkctl from config.yaml)
2. terraform.tfvars.user         (workspace-local override, optional)
3. --var-file <path> ...         (CLI flag, repeatable, later file wins)

Later layers override earlier ones — same rule Terraform itself uses for -var-file chaining.

Concretely:

# config.yaml says cluster.workers_per_zone: 2
# ~/.roksbnkctl/dev/terraform.tfvars.user contains:
#   roks_workers_per_zone = 4
# Run with no flag:
roksbnkctl up
# → terraform sees 4 (.user wins over generated .tfvars)

# Pass a CLI override:
roksbnkctl up --var-file ./perf-test.tfvars
# perf-test.tfvars contains: roks_workers_per_zone = 8
# → terraform sees 8 (.var-file wins over .user)

# Multiple --var-files; later wins:
roksbnkctl up \
  --var-file ./base.tfvars \
  --var-file ./override.tfvars
# → values in override.tfvars win over base.tfvars,
#   which both win over .user, which wins over .tfvars

The --var-file flag matches Terraform’s own --var-file exactly — repeatable, paths interpreted relative to the working directory at invocation time.

The `IBMCLOUD_API_KEY` exception

The upstream HCL declares ibmcloud_api_key as a sensitive variable. Every other tfvar can land in a file on disk; this one never does.

Instead, the API key flows through the resolver chain (env → keychain → config-b64 → prompt — see Chapter 14), and roksbnkctl exports it as TF_VAR_ibmcloud_api_key in the environment of the terraform-exec child process. Terraform reads the env var and injects it as if it had been declared in tfvars, but no plaintext key ever touches the filesystem.

If you put ibmcloud_api_key = "..." in a hand-edited tfvars and run terraform directly (not via roksbnkctl), it works — Terraform itself is happy. But this is not how roksbnkctl runs Terraform, and putting the key in a .tfvars.user or --var-file is strongly discouraged: the file persists on disk, gets backed up, gets committed to git by accident, and gets read by other processes. The env-var path eliminates the on-disk window entirely.

Other secrets in scope:

bigip_password — upstream HCL declares it as a regular string (not sensitive). If you set it in tfvars, the value lands on disk. Treat that file like a credential.
COS HMAC keys — auto-generated by the roks_cluster module via the COS service-credentials resource; they live in terraform.tfstate (which is itself sensitive — chmod 0600, never commit, treat the workspace as a secret store).

Worked example: bigger cluster for a perf test

Default workspace, default cluster. You want to bump worker count for one perf-test run, then go back.

# 1. Confirm the current value comes from config.yaml
$ grep workers ~/.roksbnkctl/dev/config.yaml
  workers_per_zone: 2

# 2. Drop a one-off override into a file
$ cat > ~/perf-cluster.tfvars <<'EOF'
roks_workers_per_zone = 6
roks_min_worker_vcpu_count = 32
roks_min_worker_memory_gb = 128
EOF

# 3. Plan against it (note: --var-file passes through to terraform plan)
$ roksbnkctl plan --var-file ~/perf-cluster.tfvars

# 4. Apply
$ roksbnkctl apply --var-file ~/perf-cluster.tfvars

# 5. Run the throughput test
$ roksbnkctl test throughput

# 6. Roll back: re-apply WITHOUT the var-file
$ roksbnkctl apply
# → terraform sees workers_per_zone=2 again from config.yaml-derived tfvars

Notice step 6 — dropping the --var-file flag is the rollback. Terraform compares its current state to the new desired state (from config.yaml) and scales the worker pool back down. No special “undo” command needed.

For a more permanent override (you want this workspace to always run with bigger nodes), put the contents of perf-cluster.tfvars into ~/.roksbnkctl/dev/terraform.tfvars.user instead. Then every roksbnkctl up/apply picks it up automatically without a CLI flag.

When to edit `config.yaml` vs `.tfvars.user` vs `--var-file`

A rough decision matrix:

You want to change…	Edit…
Cluster identity, region, OpenShift version, worker count	`config.yaml` (via `roksbnkctl init` or by hand)
BNK chart version, CNEInstance size, FAR repo	`config.yaml` (the `bnk:` block)
A variable not modelled in `config.yaml` (e.g. `cneinstance_gslb_datacenter_name`, `bigip_password`)	`terraform.tfvars.user` (workspace-local, persistent)
A one-off override for a single run (perf test, capacity bump)	`--var-file ./oneoff.tfvars` (CLI)
A CI-pipeline variable bundle that’s checked into git	`--var-file ./ci-overrides.tfvars` (CLI; the file lives in your CI repo, not the workspace)

The schema in config.yaml covers about a third of the upstream HCL variables — the ones that nearly every workspace needs to set. The other two-thirds (jumphost details, every BNK module’s full surface, the testing module’s full surface) are reachable through the lower layers.

Cross-references

Chapter 12 — Workspace config — what config.yaml covers vs what falls through to tfvars.
Chapter 14 — Credentials and the resolver chain — why ibmcloud_api_key doesn’t go in tfvars.
Chapter 29 — Terraform variable reference — auto-generated complete reference for terraform/variables.tf.
The upstream terraform/variables.tf source: https://github.com/jgruberf5/roksbnkctl/blob/main/terraform/variables.tf
The upstream starter file: https://github.com/jgruberf5/roksbnkctl/blob/main/terraform/terraform.tfvars.example

Credentials and the resolver chain

roksbnkctl handles four kinds of secrets: an IBM Cloud API key, a kubeconfig, an SSH private key, and the Terraform state file. Each has a different threat model, a different lookup chain, and a different rule for “what’s safe to commit to a workspace”.

This chapter is the user-facing distillation of PRD 04 — credential propagation. PRD 04 is the design surface for developers extending the credential system; this chapter is the operational surface for users who need to know “where does my key live, and how does the tool find it”.

The four secrets in scope

Credential	Used by	Resolved from
IBMCLOUD_API_KEY	`ibmcloud` CLI, terraform’s IBM provider, IBM SDK calls	Env → OS keychain → workspace `api_key_b64` → prompt
kubeconfig	`kubectl`/`oc` passthroughs, `roksbnkctl k get/apply/...`, terraform’s k8s + helm providers	`KUBECONFIG` env → `~/.kube/config` (kubectl-style)
SSH private key	The SSH client backing `--on` and the `ssh:<target>` execution backend	Per-target: file path, ssh-agent, or `tf-output:<name>`
Terraform state	The `terraform-exec` calls inside `roksbnkctl up`/`apply`/`destroy`	Workspace `state/terraform.tfstate` (filesystem only)

Each has its own discovery rules. Walk them in turn.

The IBMCLOUD_API_KEY resolver chain

The single most-used credential. Resolved by internal/cred/resolver.go (extracted this sprint from the formerly scattered logic in internal/config/secrets.go). The resolver walks four sources in order:

1. Environment variables (process-scoped, never persisted)
2. OS keychain (per-user, system-managed)
3. Workspace config api_key_b64 (per-workspace, base64 obfuscation)
4. Interactive prompt (TTY-only)

The first source that yields a non-empty value wins. The chain stops there — the key isn’t re-fetched from a “more authoritative” source on subsequent calls.

Source 1 — Environment

The resolver checks these env vars in order, returning the first non-empty value:

IBMCLOUD_API_KEY            # canonical
IC_API_KEY                  # short alias
TF_VAR_ibmcloud_api_key     # terraform passthrough form
TF_VAR_IBMCLOUD_API_KEY     # uppercase variant some pipelines use
TF_VAR_IC_API_KEY           # uppercase variant of the short alias

Env vars are first because they’re the most explicit path — if you’ve gone to the trouble of setting one, you’ve made a deliberate choice. Pre-existing CI pipelines, automation scripts, and direnv setups all live here. The resolver respects that ordering even when a keychain entry also exists.

Source 2 — OS keychain

roksbnkctl stores per-workspace API keys in the OS-native keychain via github.com/zalando/go-keyring:

OS	Backend
macOS	Keychain (`security` framework)
Linux (with libsecret)	GNOME Keyring / KWallet via Secret Service API
Windows	Credential Manager
Linux (no libsecret)	Falls back to source 3 (config base64)

Entries are namespaced under service roksbnkctl, with user <workspace>/ibmcloud_api_key:

# What `roksbnkctl init` writes (no-op shown for clarity)
$ keyring set roksbnkctl dev/ibmcloud_api_key

This is the recommended secure default. The OS handles process isolation; roksbnkctl only sees the value during the brief window between fetch and use.

Source 3 — Workspace `api_key_b64`

A base64-encoded blob in ~/.roksbnkctl/<workspace>/config.yaml:

ibmcloud:
  api_key_b64: ZW5jb2RlZC1hcGkta2V5LXZhbHVl

Important framing: base64 is obfuscation, not encryption. Anyone with read access to the file can decode it instantly:

echo -n "ZW5jb2RlZC1hcGkta2V5LXZhbHVl" | base64 -d
# → encoded-api-key-value

The encoding exists for two reasons:

Visual. A glancing cat config.yaml doesn’t surface the literal API key.
Format. API keys can contain = and other YAML-special characters that complicate inline storage. Base64 normalises them.

api_key_b64 is the fallback when the OS keychain isn’t available — most commonly WSL2 without libsecret, headless Linux servers, and CI runners where bringing up a keychain daemon is more friction than it’s worth. Treat the file like a plaintext credential: chmod 0600, never commit, never share.

File-mode note: roksbnkctl init writes config.yaml mode 0644 by default (chapter 12 §“File permissions”). When you populate api_key_b64, chmod 0600 the file yourself — and re-chmod after any subsequent roksbnkctl init that re-writes the file, since init doesn’t preserve the tightened mode. The keychain and env-var paths sidestep this entirely: nothing sensitive lands in config.yaml, so the default 0644 is fine.

The plaintext field name api_key: is rejected at workspace-load time:

$ roksbnkctl up
error: ~/.roksbnkctl/dev/config.yaml: plaintext secret detected (offset 47):
       workspace config.yaml must not contain credentials — use IBMCLOUD_API_KEY
       env var or the OS keychain (see `roksbnkctl init`)

The regex catches api_key, apikey, ibmcloud_api_key, password, token, secret_access_key, hmac_secret. The _b64 suffix is the documented escape — it’s the only inline form the loader tolerates.

Source 4 — Interactive prompt

When sources 1-3 all come up empty AND stdin is a TTY, the resolver prompts:

Enter IBM Cloud API key for workspace "dev": ********
Save the key for future runs? [Y/n]: y
  ✓ saved to OS keychain

The key is read with echo disabled (via golang.org/x/term). The prompt offers to persist — by default it tries the OS keychain first, falls back to api_key_b64 in config.yaml if the keychain is unavailable.

If stdin isn’t a TTY (CI runner, piped input, daemon process), the resolver errors instead of hanging:

error: no IBM Cloud API key available and stdin is not a TTY (cannot prompt;
       set IBMCLOUD_API_KEY or run `roksbnkctl init`)

Pinning a single source

The chain is the default. To force one specific source, set ibmcloud.api_key_source in config.yaml:

ibmcloud:
  api_key_source: keychain    # env | keychain | config | prompt

This is useful in two scenarios:

CI: api_key_source: env makes a missing env var a hard error rather than falling through to a (locked / non-existent) keychain.
Auditable single-source-of-truth: pinning to keychain documents that this workspace’s key lives in the OS keychain and nowhere else; reading the key from a different source becomes an error rather than a silent fallback.

kubeconfig discovery

Different chain, different rules. roksbnkctl discovers the kubeconfig the same way kubectl does — two sources, in this order:

1. KUBECONFIG environment variable (first existing path in a colon-separated list)
2. ~/.kube/config

This is the kubectl-standard discovery chain, implemented in internal/k8s/client.go::DefaultKubeconfigPath(). Whatever you’ve already taught kubectl to read, roksbnkctl reads too.

cluster up’s post-apply step writes the admin kubeconfig to ~/.kube/config (mode 0600) by default — so the second source in the chain is also the destination of the tool’s own output, and the same KUBECONFIG-overrides-everything rule applies. If KUBECONFIG is set when cluster up runs, the download lands at that path instead.

Note: there is also a ~/.roksbnkctl/<workspace>/state/kubeconfig/ directory under the workspace state dir. It’s a Terraform input (kubeconfig_dir tfvar) that the upstream HCL writes per-component sub-files into (cert_manager, cne_instance, flo, license); it is not a kubeconfig file the resolver reads. Don’t confuse the two.

When the file is missing

If neither source yields a kubeconfig, commands that need one error with:

error: no kubeconfig: KUBECONFIG env not set, ~/.kube/config not present.
       Run `roksbnkctl cluster up`, `roksbnkctl cluster register <name>`,
       or set KUBECONFIG.

The remediation message tells you which path to take. cluster register <name> is the path for an existing cluster you want to adopt without re-creating it (see Chapter 9).

File permissions

cluster up writes ~/.kube/config chmod 0600 (owner read/write only). It contains the cluster admin token; treat it like a credential. Don’t commit it, don’t email it, don’t cat it in screen-shared sessions.

SSH private keys

Per-target, not per-workspace. Each entry under targets: in config.yaml declares exactly one of:

Source	Form	Notes
File	`key_path: ~/.ssh/id_ed25519`	Standard OpenSSH key formats. Tilde expansion honoured.
Agent	`key_source: agent`	Talks to ssh-agent over `$SSH_AUTH_SOCK`. Linux/macOS only at v1.0; Windows ssh-agent named-pipe support is on the v1.x roadmap.
TF output	`key_source: tf-output:jumphost_shared_key`	Reads from terraform state at connect time; never written to disk separately.

The tf-output: form is the auto-discovered jumphost path — the upstream HCL provisions a tls_private_key resource per cluster create, marks it sensitive, and surfaces it as a terraform output. roksbnkctl reads the output via terraform output -raw at SSH-connect time, never persists it, and the key only exists in TF state plus in memory during a connect.

Chapter 15 — SSH targets is the deep reference for the targets: block; this chapter just notes the credential-side framing.

Terraform state

~/.roksbnkctl/<workspace>/state/terraform.tfstate is the workspace’s terraform state file. It contains:

IBM Cloud admin tokens (cluster admin, COS HMAC credentials)
Generated TLS private keys (the jumphost shared key referenced above)
Sensitive outputs (FAR auth bundles, license JWTs)
Every resource attribute terraform tracks

It is plaintext-credential-equivalent. The file mode is 0600; the parent directory is 0700. Backup the workspace dir intact, never commit it to git, treat compromise of the state file as compromise of every secret it contains.

There is no separate “TF state credential” — the file’s filesystem ACL is the only access control. PRD 04 covers the cross-backend story for moving state into a Docker bind-mount, a Kubernetes Secret, or an SCP’d remote temp directory; at v1.0 the local file is the only path (terraform --backend k8s / ssh are deferred to v1.x; see docs/PLAN.md §“What’s deliberately deferred to post-v1.0”).

What’s safe to commit vs not

A short rule:

SAFE TO COMMIT:    nothing in ~/.roksbnkctl/<workspace>/
NOT SAFE:          everything in ~/.roksbnkctl/<workspace>/

The longer version, by file:

Path	Commit?	Why
`~/.roksbnkctl/<ws>/config.yaml`	No	Even without `api_key_b64`, this file documents your cluster identity, region, COS bucket — useful inventory for an attacker.
`~/.roksbnkctl/<ws>/config.yaml` (with `api_key_b64`)	Hard no	The base64 value is plaintext-equivalent. Committing it = leaking the key.
`~/.roksbnkctl/<ws>/state/kubeconfig`	No	Cluster admin token.
`~/.roksbnkctl/<ws>/state/terraform.tfstate`	No	Every secret terraform manages, in plaintext.
`~/.roksbnkctl/<ws>/state/terraform.tfvars`	No	Generated; references no secrets directly but documents resource layout.
`~/.roksbnkctl/<ws>/terraform.tfvars.user`	Maybe	If you’ve kept secrets out (no `bigip_password`, no `ibmcloud_api_key`), it’s just config. Audit before committing.
`~/.roksbnkctl/<ws>/cluster-outputs.json`	No	Cluster identity + COS instance name. Not directly a secret but tied to the workspace.
`~/.roksbnkctl/known_hosts`	Yes (if you want)	Host-key fingerprints; not a secret. Same threat model as OpenSSH’s `~/.ssh/known_hosts`.

The simplest policy: a .gitignore that excludes the entire ~/.roksbnkctl/ tree. If you really want to share a workspace skeleton with a colleague, send the config.yaml minus api_key_b64 and let them re-run roksbnkctl init against their own account.

How `roksbnkctl init` writes the API key

Walk through the writeable side of the resolver:

$ roksbnkctl init
Workspace name [default]: dev
IBM Cloud region [ca-tor]:
Enter IBM Cloud API key (input hidden): ********
Save the key for future runs? [Y/n]: y
  ✓ saved to OS keychain

What just happened:

init prompted for the key. Input echo was off; the key never appeared on screen.
The user said “save”.
SaveAPIKeyForWorkspace tried SaveAPIKeyToKeychain first.
The OS keychain accepted the entry (Linux + libsecret in this case). The success path returned "OS keychain" and init printed the confirmation.
The key was not written to terraform.tfvars (that’s the resolver’s job at terraform-invoke time, via the TF_VAR_ibmcloud_api_key env var).

If step 4 had failed (no keychain, WSL2 without libsecret), SaveAPIKeyForWorkspace would have fallen through to saveAPIKeyToConfig — base64-encoded the key, written it into config.yaml’s api_key_b64 field, returned "config.yaml (base64)". init would have printed:

  ✓ saved to config.yaml (base64)
  warning: base64 is obfuscation, not encryption — chmod 0600 the file

Both destinations work. The keychain path is the recommended default; the config-b64 path is the documented fallback.

What’s new in v1.2: the cred-tmpfile and trusted-profile paths

v1.2.0 closes the two longest-deferred items from PRD 04 §“Open questions”: roksbnkctl --backend docker no longer leaks IBMCLOUD_API_KEY in docker inspect, and roksbnkctl --backend k8s ops install auto-provisions an IBM Cloud trusted profile so the ops pod never sees a static API key. Both have fallbacks for environments where the new path doesn’t apply; v1.0.x / v1.1.x workspaces continue to work without change.

The tmpfile-bind-mount pattern (docker backend)

The docker backend writes the resolved IBMCLOUD_API_KEY to a 0600 tempfile on the host, bind-mounts that single file read-only at /run/secrets/ibmcloud_api_key inside the container, and points the container at the file via IBMCLOUD_API_KEY_FILE=/run/secrets/ibmcloud_api_key. The value never appears in the container’s stored env metadata — docker inspect <id> shows the path, not the key. The tempfile is owned by the calling user and is removed on backend exit (and on context cancellation, so an interrupted run still cleans up).

You don’t have to do anything to opt in — the pattern is the default for --backend docker on v1.2 and up. The engineering shape (lifecycle, the inline sh -c shim that re-exports the value into the legacy IBMCLOUD_API_KEY env name for tools that read from env, the why-not-just-use---secret discussion) lives in PRD 04 §“Resolved in Sprint 9” → “Cred tmpfile-bind-mount pattern (docker backend)”. For most users the takeaway is one line: in v1.2, --backend docker is docker inspect-clean.

The `--trusted-profile` flag (k8s backend)

New flag on roksbnkctl ops install that controls how the ops pod gets its IBM Cloud credential. Three values:

Value	What it does	When to use
`auto` (default)	Try to provision an IBM Cloud trusted profile (`roksbnkctl-ops-<workspace>`) linked to the ops pod’s ServiceAccount. The pod assumes the profile via its projected SA token and the static API key never lands in any Secret. If your workspace API key doesn’t have IAM `iam-identity` permissions, automatically fall back to the v1.0.x static-key Secret with a stderr warning that names the missing perm and how to silence it (`--trusted-profile=off`).	Default for new installs. Production users get the secure path automatically; restricted-IAM users still complete `ops install` successfully.
`on`	Try to provision; fail loudly with a non-zero exit if perms don’t allow. No fallback.	CI / hardened environments where the static-key path is unacceptable and a perm-missing case should block, not warn.
`off`	Skip the trusted-profile path entirely; provision the v1.0.x static-key Secret (matches v1.0.x / v1.1.x behaviour).	Compatibility / debugging — and the documented path for clusters whose IAM admin doesn’t grant `iam-identity` perms and isn’t expected to.

Chapter 19 — The in-cluster ops pod walks through the --trusted-profile=auto install flow, the verification commands (oc get serviceaccount roksbnkctl-ops -o yaml showing the trusted-profile annotation), the fallback warning shape, and how ops uninstall cleans up a provisioned profile.

Compatibility note

v1.0.x and v1.1.x workspaces continue to work without migration. The docker tmpfile pattern is a transparent replacement — the resolver chain is unchanged, the workspace config is unchanged, and no flag is required to opt in. The k8s --trusted-profile=auto default with auto-fallback means existing workspaces against an IAM-restricted key keep getting the static-key Secret as before, with one extra stderr warning line on ops install naming the fallback and how to silence it. Setting --trusted-profile=off reproduces the v1.0.x behaviour byte-for-byte (no warning, static-key Secret straight away).

Backend-specific cred propagation

The credential-propagation rules differ per backend. All four backends ship at v1.0:

Backend	Where creds live	Mechanism
`local`	The user’s environment	`os/exec` inherits parent env
`docker`	Caller’s env, propagated by reference	`docker run --env IBMCLOUD_API_KEY` (no `=value`) — value inherits, never appears in `docker inspect`
`k8s`	Kubernetes Secret in the `roksbnkctl-ops` namespace	Mounted into the ops pod via `envFrom: secretRef`; or IAM trusted profile (preferred)
`ssh`	Remote env or wrapper script	`ssh -o SetEnv=IBMCLOUD_API_KEY=...` first; falls back to a 0700 wrapper script with `trap rm EXIT`

Each backend’s “where creds live” surface is summarised in Chapter 17 — Execution backends; the design rationale is in PRD 04.

The user-facing invariant across all four: you put the key into one of the resolver chain’s sources, and roksbnkctl figures out the rest. You don’t have to learn four different credential APIs to use four different backends.

The redactor

roksbnkctl writes a fair amount to its own logs (stdout, stderr) — terraform plan output, ibmcloud CLI output, error traces. Anywhere we can plausibly print the IBM API key (because a downstream tool printed it, because an error message included it, because a debug trace dumped a struct), the redactor masks it before the bytes leave the binary.

What gets redacted:

The IBM Cloud API key value, anywhere it appears in Stdout or Stderr of an exec backend’s RunOpts. Replaced with [REDACTED].
The same value in roksbnkctl’s own log output (the lifecycle commands that wrap terraform-exec).

What does not get redacted:

Output captured by callers via -o yaml/-o json for resources that legitimately contain the key (e.g., a Secret returned from roksbnkctl k get). The redactor doesn’t know about Kubernetes resource semantics; if you k get secret -o yaml, you’ll see the key. (The same is true of kubectl.)
Output from a tool you ran outside roksbnkctl (e.g., piping to tee after invoking terraform directly). The redactor only sees bytes that pass through the exec backend’s Stdout/Stderr writers.
The terraform state file. State is on-disk; the redactor is an in-memory stream filter.

The implementation is internal/exec/redact.go — a wrapping io.Writer with byte-comparison redaction and cross-write prefix buffering (so a secret split across two Write calls still gets masked). The matcher uses the resolved API key value verbatim (a known string at run-time) rather than a generic “looks like an IBM API key” pattern, to avoid false positives on legitimate output.

PRD 04’s acceptance criteria require that the API key never appears in docker inspect, ps -ef, kubectl get pods/events -o yaml, or kubectl describe pod. The redactor is the defence-in-depth layer; the per-backend cred-propagation rules are the primary control.

Cross-references

PRD 04 — credential propagation across execution backends — the full design.
Chapter 12 — Workspace config — the ibmcloud: block schema.
Chapter 13 — Terraform variables — why ibmcloud_api_key doesn’t go in tfvars.
Chapter 15 — SSH targets — the SSH key sources.
Chapter 17 — Execution backends — backend-specific cred mechanics.
internal/cred/resolver.go — the implementation extracted this sprint: https://github.com/jgruberf5/roksbnkctl/blob/main/internal/cred/resolver.go
internal/config/secrets.go — the keychain + config-b64 helpers: https://github.com/jgruberf5/roksbnkctl/blob/main/internal/config/secrets.go

SSH targets

This chapter is the technical reference for the targets: system. Its companion is Chapter 16 — The –on flag and SSH jumphosts, which is the user-facing prose for “how do I run a command on the jumphost”. Chapter 16 introduces targets briefly; this chapter goes deeper into the schema, the key sources, the host-key trust model, the auto-discovery pipeline, and what the ssh execution backend layers on top of the lightweight --on flag.

If you arrived here from Chapter 16 looking for “where do I learn the full surface”, you’re in the right place.

The `targets:` block schema

Targets live under targets: in ~/.roksbnkctl/<workspace>/config.yaml:

targets:
  jumphost:
    host: 169.45.91.177
    port: 22
    user: ubuntu
    key_source: tf-output:jumphost_shared_key

  bastion:
    host: ops.example.com
    user: jgruber
    key_path: ~/.ssh/id_ed25519

  prod-jump:
    host: 10.0.0.5
    user: ec2-user
    key_source: agent

The Go struct backing it is internal/config.TargetCfg:

type TargetCfg struct {
    Host      string `yaml:"host"`
    Port      int    `yaml:"port,omitempty"`        // default 22
    User      string `yaml:"user"`
    KeyPath   string `yaml:"key_path,omitempty"`    // file path
    KeySource string `yaml:"key_source,omitempty"`  // "agent" | "tf-output:<name>"
}

Field	Type	Required	Notes
`host`	string	yes	IP or hostname. Resolved via the standard Go resolver chain (no special DNS handling).
`port`	int	no	Defaults to `22`. Only set when the remote sshd listens elsewhere.
`user`	string	yes	Remote login username.
`key_path`	string	one-of	File path to a PEM-encoded private key. Tilde expansion honoured.
`key_source`	string	one-of	`agent` or `tf-output:<output-name>`.

Validation rules at load time:

Exactly one of key_path or key_source must be set. Setting neither, or both, fails the load with a clear error.
The target name (the YAML map key) must be non-empty and stable across YAML round-trips — roksbnkctl targets show <name> and roksbnkctl targets remove <name> look up by this name.

The TargetCfg type lives in internal/config rather than internal/remote to avoid an import cycle: the YAML (de)serialiser needs the wire shape, and the SSH client (internal/remote) needs to consume it. Keeping the shape in config and the runtime Target (parsed key, dialer config, etc.) in remote keeps the dependency direction one-way.

Key sources

The three options for telling roksbnkctl how to find the SSH private key for a target.

`key_path: <file>`

A PEM-encoded private key on disk:

bastion:
  host: ops.example.com
  user: jgruber
  key_path: ~/.ssh/id_ed25519

Standard OpenSSH formats are accepted: id_rsa, id_ed25519, id_ecdsa, id_dsa (deprecated but supported). Tilde expansion uses os.UserHomeDir() semantics — ~/ → user home, ~user/ is not supported (use an absolute path).

The file is read at SSH-connect time, not at config-load time. A missing or unreadable file fails the connect, not the workspace load. This matters for ergonomics: you can edit a target into config.yaml referencing a key path that doesn’t exist yet, then create the key separately, without roksbnkctl init/use failing in between.

Encrypted keys (passphrase-protected) are not currently supported in the SSH client — the agent path is the recommended workflow for keys that need a passphrase.

`key_source: agent`

Talks to ssh-agent over $SSH_AUTH_SOCK:

prod-jump:
  host: 10.0.0.5
  user: ec2-user
  key_source: agent

The agent presents whichever keys it currently holds; roksbnkctl tries each in turn against the target’s authorized_keys (via SSH’s standard publickey-authentication exchange). The first key the server accepts is the one that gets used.

This is the right setting when:

Your team manages keys via 1Password / hardware tokens / gpg-agent and you don’t want a key file on disk.
You’re on a shared workstation where putting the key file in ~/.ssh/ is undesirable.
You’re already using ssh-agent for everything else and want consistent behaviour.

Platform note: ssh-agent integration is Linux/macOS-only. Windows users should use key_path to a file. The restriction is structural to the Go SSH library, which doesn’t wrap the Windows ssh-agent named-pipe protocol — see golang.org/x/crypto/ssh/agent and upstream tracking issues for status; full Windows support is a v2 item.

`key_source: tf-output:<output-name>`

Reads the key from the workspace’s terraform state output of that name:

jumphost:
  host: 169.45.91.177
  user: ubuntu
  key_source: tf-output:jumphost_shared_key

This is the auto-discovered jumphost path. The upstream HCL provisions a tls_private_key resource per cluster create, marks it sensitive, and surfaces it as a terraform output named jumphost_shared_key. roksbnkctl reads it via the equivalent of terraform output -raw <name> at SSH-connect time.

What this gets you that key_path doesn’t:

No on-disk key file separate from terraform state. The key only exists in terraform.tfstate (which is already a sensitive workspace artefact) and in memory during the SSH handshake.
Auto-rotation on cluster re-create. Destroy and re-create the cluster, terraform generates a new tls_private_key, and the next --on jumphost invocation picks up the new key without any manual rewriting of the workspace config.
Single source of truth. The key value is in terraform state — the same place every other cluster-generated secret lives.

The terraform output must be a string-typed PEM-encoded private key. terraform output -raw <name> returns the value regardless of the sensitive flag (the flag just suppresses display; the data is still readable to anyone with state access).

Host-key TOFU and `~/.roksbnkctl/known_hosts`

roksbnkctl keeps its own known_hosts file at ~/.roksbnkctl/known_hosts. It does not read or write ~/.ssh/known_hosts. The two files are independent.

Why a separate file

Three reasons:

Isolation. roksbnkctl’s SSH client is a different program from ssh(1); mixing host-key state between the two creates surprising behaviour (deleting a key from ~/.ssh/known_hosts doesn’t clear it from roksbnkctl’s view, or vice versa).
Audit. A roksbnkctl-managed file lets the tool’s behaviour be reasoned about without inspecting the user’s broader SSH state.
Cleanup. roksbnkctl ws delete <name> could theoretically scrub host-key entries on workspace destroy; mixing into ~/.ssh/known_hosts would mean editing a file the tool didn’t own.

The format matches OpenSSH’s ~/.ssh/known_hosts exactly (so future cross-pollination is technically possible), but the filenames are deliberately separate.

TOFU on first connect

The first time you connect to a target, roksbnkctl shows the host key fingerprint and asks whether to trust it:

$ roksbnkctl exec --on jumphost -- whoami
Add 169.45.91.177:22's key (SHA256:abc123def456ghi789jkl0mnopqrstuvwxyz/+=) to ~/.roksbnkctl/known_hosts? [y/N]: y
ubuntu

Answer y and the key is appended. Subsequent connects to the same host:port with the same server key trust silently.

Answer n and the connect aborts with exit code 126.

Mismatch behaviour

If the host key changes — re-provisioned VM, MITM attack, configuration drift — roksbnkctl refuses to connect:

error: host key mismatch: 169.45.91.177:22 known with SHA256:abc123... but
       server presented SHA256:zyx987...; if the host was rebuilt, edit
       ~/.roksbnkctl/known_hosts

Same model OpenSSH uses. The fix is the same: edit (or ssh-keygen -R) the file to remove the stale entry, then re-connect to re-trigger the TOFU prompt.

The default ssh-keygen binary works against ~/.roksbnkctl/known_hosts — pass -f:

ssh-keygen -R 169.45.91.177 -f ~/.roksbnkctl/known_hosts

`--insecure-host-key` for CI

Automation contexts can’t answer a TOFU prompt. The --insecure-host-key flag skips host-key verification entirely:

roksbnkctl exec --on jumphost --insecure-host-key -- whoami

This is insecure — anyone in the network path can MITM the connection — and is intended only for short-lived CI runs against ephemeral test infrastructure. Don’t use it where session content is sensitive.

The flag is per-invocation, not per-target. There’s deliberately no targets.<name>.insecure_host_key: true config knob: forcing the choice into the call site keeps the security implications visible at every invocation.

When to use it:

E2E tests against a freshly-provisioned cluster jumphost where the host key is just-generated and changes per run.
Pipeline runs against ephemeral test VMs that get torn down within minutes.
Recovery scenarios where the known-hosts file is corrupt and you need to bootstrap.

When not to use it:

Production jumphosts with stable identity.
Customer environments where session integrity matters.
Anything where the SSH session carries secrets you can’t afford to leak to a passive attacker.

`roksbnkctl targets` — full reference

Four subcommands. Chapter 16 introduces them with worked examples; here’s the complete flag surface.

`roksbnkctl targets list`

NAME       HOST                USER     KEY
jumphost   169.45.91.177:22    ubuntu   tf-output:jumphost_shared_key
bastion    ops.example.com:22  jgruber  file:~/.ssh/id_ed25519
prod-jump  10.0.0.5:22         ec2-user agent

Flags:

--verbose / -v: also prints whether the target has a known-hosts entry recorded.
-o json: machine-readable form. Schema: {"targets": [{"name": ..., "host": ..., "port": ..., "user": ..., "key_source": ...}]}.

The KEY column shows the source descriptor — never the key material. File-backed sources are prefixed file: to distinguish them visually from tf-output: and agent.

`roksbnkctl targets show <name>`

name:        jumphost
host:        169.45.91.177
port:        22
user:        ubuntu
key_source:  tf-output:jumphost_shared_key

Same restriction: key material itself is never printed.

-o json is supported for scripted callers.

`roksbnkctl targets add <name> ...`

Required flags: --host, --user, and exactly one of --key-path / --key-source.

# File-backed key
roksbnkctl targets add bastion \
  --host ops.example.com \
  --user jgruber \
  --key-path ~/.ssh/id_ed25519

# ssh-agent
roksbnkctl targets add prod-jump \
  --host 10.0.0.5 \
  --user ec2-user \
  --key-source agent

# Non-default port
roksbnkctl targets add custom \
  --host 10.0.0.5 \
  --user root \
  --key-path ~/.ssh/custom \
  --port 2222

# tf-output (rare; usually auto-populated by `up`)
roksbnkctl targets add backup-jump \
  --host 10.0.0.6 \
  --user ubuntu \
  --key-source tf-output:backup_jumphost_key

Refuses to add a target whose name collides with an existing entry — use targets remove <name> first, or pick a different name.

`roksbnkctl targets remove <name>`

roksbnkctl targets remove bastion

Removes the entry from config.yaml. Does not remove the corresponding host-key line from ~/.roksbnkctl/known_hosts — re-adding the same target later doesn’t re-trigger TOFU. This is deliberate; if you want to wipe the host key too, edit the known-hosts file by hand.

Auto-discovery from terraform outputs

The single most-used target — jumphost — is auto-populated post-roksbnkctl up. The flow:

roksbnkctl up runs terraform apply against the workspace’s HCL.
After successful apply, roksbnkctl reads three outputs: testing_tgw_jumphost_ip, testing_tgw_jumphost_user, jumphost_shared_key.
If testing_tgw_jumphost_ip is non-empty AND not the literal sentinel string "TGW jumphost not created" (which the upstream HCL emits when the testing module is disabled), roksbnkctl writes a jumphost target into config.yaml:
```
targets:
  jumphost:
    host: <testing_tgw_jumphost_ip>
    user: <testing_tgw_jumphost_user || "ubuntu">
    key_source: tf-output:jumphost_shared_key
```

A confirmation line is printed:

✓ Auto-registered target jumphost (169.45.91.177); use `roksbnkctl --on jumphost ...`

The auto-population is idempotent — re-running up against an already-jumphost-populated workspace re-writes the same fields. If you’ve manually customised the entry (changed the user, swapped to key_path), the auto-population overwrites your changes. There’s no merge logic; the latest up wins.

If testing_create_tgw_jumphost = false in tfvars, the upstream HCL skips creating the jumphost VM and emits the sentinel output. Auto-population is then a no-op, and you’re free to create your own jumphost (or differently-named) entry via targets add.

Per-AZ cluster jumphosts (`jumphost-<zone>`)

When testing_create_cluster_jumphosts = true, the deploy builds one cluster jumphost per cluster-VPC availability zone in addition to the single TGW jumphost — each on its own floating IP, all sharing the same key as the TGW jumphost. Since v1.5.0, the same post-up hook that seeds the singular jumphost also auto-registers one target per AZ:

After a successful up, roksbnkctl reads the testing_cluster_jumphost_ips terraform output — a map { zone => floating-IP }.

For each zone => fip, it upserts a target named jumphost-<zone>, reusing the same shared key the singular jumphost uses:

targets:
  jumphost-ca-tor-1:
    host: <ca-tor-1-fip>
    user: ubuntu
    key_source: tf-output:jumphost_shared_key
  # …one per AZ…

A summary line is printed:

✓ Auto-registered 3 per-AZ cluster jumphost targets (jumphost-ca-tor-1, jumphost-ca-tor-2, jumphost-ca-tor-3); use `roksbnkctl --on jumphost-<zone> ...`

Verify with roksbnkctl targets list — you should see jumphost plus one jumphost-<zone> per AZ. Each is a first-class --on target (full kubectl/oc/ibmcloud/shell passthrough, no SSH hop): roksbnkctl --on jumphost-ca-tor-2 kubectl get pods. Like the singular jumphost, registration is best-effort and idempotent — a parse/write failure logs a single warning: and does not fail up, and re-running up after a floating-IP rotation refreshes each jumphost-<zone> host in place. When testing_create_cluster_jumphosts = false (or the output is absent/empty), this is a silent no-op — only jumphost is seeded, with no warning noise.

Orphaned-target caveat (option (a) upsert-only). Auto-registration upserts but never prunes. If a later apply removes a zone, or testing_create_cluster_jumphosts is flipped to false, the now-orphaned jumphost-<oldzone> target points at a destroyed host and lingers in your config until you remove it by hand:
roksbnkctl targets remove jumphost-ca-tor-3
A host re-created on a recycled floating IP will also trip the host-key mismatch refusal — see §“Host-key TOFU and ~/.roksbnkctl/known_hosts” and clear the stale known_hosts line with ssh-keygen -R <fip> -f ~/.roksbnkctl/known_hosts. An automatic-prune (reconcile) mode that removes orphans on the next up is a deliberate post-v1.5.0 follow-up (it needs unambiguous “this target is auto-managed” ownership semantics so a hand-named jumphost-mybox is never deleted). See PRD 09.

Pre-v1.5.0 fallback. On a release before v1.5.0 the per-AZ jumphosts are not auto-registered — register each by hand. Look up the floating IPs with the read-only terraform command (v1.5.0+; Chapter 16 §“Per-AZ cluster jumphosts”):
roksbnkctl terraform output testing_cluster_jumphost_ips
roksbnkctl terraform output testing_cluster_jumphost_ssh_commands
…or, on an even older release without roksbnkctl terraform, the raw form cd ~/.roksbnkctl/<ws>/state && TF_DATA_DIR=$PWD/terraform terraform output testing_cluster_jumphost_ips. Then register one target per AZ — note --key-source tf-output:jumphost_shared_key is correct because one shared key covers all jumphosts (see §“key_source: tf-output:<output-name>”):
roksbnkctl targets add jumphost-ca-tor-1 \
  --host <ca-tor-1-fip> --user ubuntu \
  --key-source tf-output:jumphost_shared_key
# …repeat per zone…
Each new IP triggers a one-time host-key TOFU prompt on first connect (see §“Host-key TOFU and ~/.roksbnkctl/known_hosts”). Manually-added targets are not auto-managed: a destroy+recreate rotates the FIPs and you must re-targets add (contrast the v1.5.0 auto-registered targets, which up refreshes in place).

What is not auto-discovered

The auto-discovery flow registers the TGW jumphost (always) and, since v1.5.0, the per-AZ jumphost-<zone> targets when testing_create_cluster_jumphosts = true. It does not register:

Per-AZ jumphosts by private IP. There is no top-level testing_cluster_jumphost_private_ips terraform output; the private-IP hop pattern in Chapter 16 §“Per-AZ cluster jumphosts” is a documented zero-setup technique, not an auto-registered target.
Any non-jumphost host. Bastions, ops boxes, and the like are always targets add by hand.

Inspecting what the post-`up` flow saw

When the auto-population doesn’t happen and you expected it to, check:

roksbnkctl tf output testing_tgw_jumphost_ip
roksbnkctl tf output testing_tgw_jumphost_user
roksbnkctl tf output -json jumphost_shared_key | head -c 50

(The third one returns a JSON-encoded string for a sensitive output; truncate to confirm it’s non-empty without dumping the key.)

If all three are populated and the auto-write didn’t fire, that’s a bug — file an issue with the output values redacted.

What the SSH execution backend adds on top of `--on`

The --on <target> flag is the lightweight remote-exec path — one SSH session, one command, no remote state. The ssh:<target> execution backend layers more on top, reusing the same internal/remote.Client under internal/exec/ssh.go:

Capability	What it gives you
File materialisation	`RunOpts.Files` map gets written to `/tmp/roksbnkctl.<rand>/<basename>` on the remote, available as the working directory for the command. Cleanup via `trap 'rm -rf' EXIT` in a wrapper.
Env passing with fallback	First tries `ssh -o SetEnv=KEY=VALUE` (requires remote sshd `AcceptEnv`). On failure, writes a 0700 wrapper script that exports the env and execs the command, with `trap 'rm -f $0' EXIT` to scrub.
Apt bootstrap	If the remote target doesn’t have a tool (`iperf3`, `ibmcloud`) installed, the backend can `sudo apt-get install` it on demand (Ubuntu only at v1.0).
SCP-and-cleanup for kubeconfig	The backend’s recommended path for shipping a kubeconfig to the remote: SCP to a tempdir, run, `trap 'rm -rf' EXIT` to scrub.
Wrapper-script credential propagation	Detailed in PRD 04 § SSH. Brief on-disk window with strict cleanup.

The targets: schema and the roksbnkctl targets commands are the same surface for both — the backend just uses each target in a heavier-weight way. Anything you set up for --on keeps working under --backend ssh:<target>.

The split between the lightweight --on path and the full ssh backend is deliberate: --on stays simple — one SSH session, one command, no remote state. The backend handles the heavier lifting (file materialisation, package installation, multi-step orchestration).

Cross-references

Chapter 12 — Workspace config — where targets: fits in the overall schema.
Chapter 14 — Credentials and the resolver chain — the SSH-key sources from a credential-discipline perspective.
Chapter 16 — The –on flag and SSH jumphosts — the user-facing prose for “how do I use this”.
Chapter 17 — Execution backends — where the SSH backend sits in the broader backend matrix.
PRD 01 — SSH client + –on flag — the design rationale for targets: and the SSH client.
internal/remote/ package: https://github.com/jgruberf5/roksbnkctl/tree/main/internal/remote
internal/cli/targets.go: https://github.com/jgruberf5/roksbnkctl/blob/main/internal/cli/targets.go

The –on flag and SSH jumphosts

The --on <target> flag (most commonly --on jumphost) re-runs a roksbnkctl passthrough command (exec, shell, kubectl, oc, ibmcloud) on a remote SSH host instead of locally. After a successful roksbnkctl up, a jumphost target is auto-populated from the upstream HCL’s terraform outputs, so --on jumphost works with no manual configuration in the common case.

This chapter covers when to reach for --on, the targets: workspace config block, the auto-population behaviour, the roksbnkctl targets command tree for managing your own targets, and how host-key trust is established.

The full design rationale for this feature lives in PRD 01. This chapter is the user-facing distillation.

Why this exists

There are a handful of scenarios where running a command from your laptop is the wrong answer:

Customer-firewall scenarios. Your customer’s network policy lets the corporate jumphost reach *.cloud.ibm.com but blocks your laptop’s egress to anything except web traffic. ibmcloud iam oauth-tokens works from the jumphost; from your laptop it times out.
Air-gapped environments. The cluster lives in a VPC with no public ingress, accessible only via a bastion VM. The cluster API server isn’t reachable from your laptop at all; you need to be inside the network to talk to it.
Pre-cluster operations. You want to run ibmcloud commands against the IBM Cloud API but your workstation doesn’t have ibmcloud installed and you’d rather not install it. The jumphost has it; route through there.

--on makes those scenarios one flag rather than “ssh to the jumphost, install your tools there, copy your kubeconfig over manually”. The SSH client is built into roksbnkctl (using golang.org/x/crypto/ssh); no host ssh binary is required.

The `targets:` workspace config block

Targets are stored in your workspace config at ~/.roksbnkctl/<workspace>/config.yaml under a targets: key:

targets:
  jumphost:                                # auto-populated after `roksbnkctl up`
    host: 169.45.91.177
    user: ubuntu
    key_source: tf-output:jumphost_shared_key
    port: 22                               # default; can be omitted

  bastion:                                 # user-defined
    host: ops.example.com
    user: jgruber
    key_path: ~/.ssh/id_ed25519

  prod-jump:
    host: 10.0.0.5
    user: ec2-user
    key_source: agent                      # use ssh-agent

Each entry has at minimum host and user. Port defaults to 22. Key resolution is determined by exactly one of key_path or key_source — see “Key sources” below.

You don’t typically edit this file by hand. The auto-discovery flow populates jumphost for you, and roksbnkctl targets add ... populates other entries.

Auto-discovery from `roksbnkctl up`

The upstream HCL provisions a small testing jumphost as part of every cluster apply. Two terraform outputs surface it:

testing_tgw_jumphost_ip — the public IP of the jumphost VM.
jumphost_shared_key — the private key (PEM) the jumphost was provisioned with, marked sensitive in the HCL.

After a successful roksbnkctl up, roksbnkctl reads both outputs and writes a jumphost target into your workspace config:

✓ Auto-registered target jumphost (169.45.91.177); use `roksbnkctl --on jumphost ...`

The auto-registered target uses user: ubuntu (the upstream HCL provisions an Ubuntu cloud image whose default user is ubuntu).

The key_source: tf-output:jumphost_shared_key form means the private key is read from terraform state at SSH-connect time rather than being copied into the workspace config or written to disk separately. The key only ever exists in terraform state and in memory during a connect; destroying and re-creating the cluster generates a new key, and roksbnkctl picks up the new one without any manual intervention.

If your cluster apply produced a testing_tgw_jumphost_ip output of "TGW jumphost not created" (the upstream HCL emits this string when the testing module is disabled) the auto-population is skipped. You can still add a jumphost target manually with roksbnkctl targets add if you have a different bastion in mind.

Key sources

Three ways to tell roksbnkctl how to find the SSH private key:

key_path: <path> — a file on disk. Standard OpenSSH key formats are accepted (~/.ssh/id_ed25519, ~/.ssh/id_rsa, etc.). Tilde expansion is honoured.
key_source: agent — talk to the user’s ssh-agent over the socket pointed at by $SSH_AUTH_SOCK. The agent presents whichever keys it currently holds; roksbnkctl tries each in turn against the target’s authorized_keys. This is the right setting if your team already manages keys via 1Password / hardware tokens / gpg-agent and you don’t want a key file on disk. Note: ssh-agent integration is Linux/macOS-only at v1.0; Windows users should use key_path instead. Windows ssh-agent named-pipe support is on the v1.x roadmap.
key_source: tf-output:<output-name> — read the key from the workspace’s terraform state output of that name. Used by the auto-discovered jumphost target. The terraform output must be a string-typed PEM-encoded private key; sensitive outputs work fine because terraform output -raw <name> returns the value regardless of the sensitive flag.

Exactly one of key_path or key_source must be set per target. roksbnkctl targets show <name> will tell you which is in use without printing the key material.

Host-key TOFU on first connect

The first time you connect to a target, roksbnkctl shows the host key fingerprint and asks whether to trust it. The prompt is a single line:

$ roksbnkctl exec --on jumphost -- whoami
Add 169.45.91.177:22's key (SHA256:abc123def456ghi789jkl0mnopqrstuvwxyz/+=) to ~/.roksbnkctl/known_hosts? [y/N]: y
ubuntu

Answer y and the key is appended to ~/.roksbnkctl/known_hosts (the same format as OpenSSH’s ~/.ssh/known_hosts). Subsequent connects trust silently.

Answer n and the connect fails with a clear “host key not trusted” error.

If the host key changes between runs — which would happen on a re-provisioned VM, or could happen as a man-in-the-middle attack — roksbnkctl refuses to connect:

error: host key mismatch: 169.45.91.177:22 known with SHA256:abc123... but server presented SHA256:zyx987...; if the host was rebuilt, edit ~/.roksbnkctl/known_hosts

This is “trust on first use” (TOFU) — the same model OpenSSH uses for new hosts. Exit code is 126 on host-key rejections.

`--insecure-host-key` for CI

In automation contexts where a TOFU prompt would block forever, pass --insecure-host-key to skip host-key verification entirely:

roksbnkctl exec --on jumphost --insecure-host-key -- whoami

This is insecure — anyone in the network path can MITM the connection — and is intended only for short-lived CI runs against ephemeral test infrastructure. Don’t use it in any context where the SSH session matters for security.

The `roksbnkctl targets` command tree

Four subcommands for managing target entries:

roksbnkctl targets list
roksbnkctl targets show <name>
roksbnkctl targets add <name> --host ... --user ... --key-path ...
roksbnkctl targets remove <name>

`targets list`

roksbnkctl targets list
NAME       HOST                USER     KEY
jumphost   169.45.91.177:22    ubuntu   tf-output:jumphost_shared_key
bastion    ops.example.com:22  jgruber  file:~/.ssh/id_ed25519

Prints every target in the current workspace’s config. The KEY column shows the key source — never the key material itself. File-backed keys are prefixed with file: so they’re visually distinct from tf-output: and agent sources.

`targets show <name>`

roksbnkctl targets show jumphost
name:        jumphost
host:        169.45.91.177
port:        22
user:        ubuntu
key_source:  tf-output:jumphost_shared_key

Prints the full record. Note that the key material itself is never printed — only the source descriptor (file path, ssh-agent, or terraform-output name).

`targets add <name> ...`

roksbnkctl targets add bastion \
  --host ops.example.com \
  --user jgruber \
  --key-path ~/.ssh/id_ed25519

# or with ssh-agent:
roksbnkctl targets add prod-jump \
  --host 10.0.0.5 \
  --user ec2-user \
  --key-source agent

# or with a non-default port:
roksbnkctl targets add custom \
  --host 10.0.0.5 \
  --user root \
  --key-path ~/.ssh/custom \
  --port 2222

Writes the new target into ~/.roksbnkctl/<workspace>/config.yaml. Refuses if a target of that name already exists (use targets remove first).

`targets remove <name>`

roksbnkctl targets remove bastion

Removes the entry from config.yaml. Does not remove the corresponding line from ~/.roksbnkctl/known_hosts — the host key stays recorded so re-adding the same target later doesn’t re-trigger a TOFU prompt.

Working examples

The everyday verbs:

# Run an arbitrary command on the jumphost
roksbnkctl exec --on jumphost -- whoami
# → ubuntu

roksbnkctl exec --on jumphost -- uname -a
# → Linux jumphost-vm 5.15.0-... #... SMP ... x86_64 GNU/Linux

# Interactive PTY shell
roksbnkctl shell --on jumphost
# → drops you into the jumphost's default shell as the configured user
# → exit returns you to your local prompt

# ibmcloud passthrough — runs `ibmcloud ks cluster ls` on the jumphost
# (handy when your laptop's network can't reach IBM Cloud APIs)
roksbnkctl ibmcloud --on jumphost ks cluster ls

# kubectl passthrough — same pattern
roksbnkctl kubectl --on jumphost get pods -A

# oc passthrough
roksbnkctl oc --on jumphost projects

Per-AZ cluster jumphosts

When your deploy sets testing_create_cluster_jumphosts = true, the upstream HCL builds one cluster jumphost per cluster-VPC availability zone in addition to the single TGW jumphost. Since v1.5.0 these are auto-registered by roksbnkctl up as jumphost-<zone> targets (alongside the singular jumphost) — see Chapter 15 §“Auto-discovery from terraform outputs”. They are first-class --on targets: full passthrough, no hop.

# verify what `up` registered:
roksbnkctl targets list
# → jumphost, jumphost-ca-tor-1, jumphost-ca-tor-2, jumphost-ca-tor-3

# run directly on a specific per-AZ cluster jumphost (no hop, full passthrough):
roksbnkctl kubectl --on jumphost-ca-tor-2 get nodes
roksbnkctl shell --on jumphost-ca-tor-1

Pre-v1.5.0 fallback. On a release before v1.5.0 the per-AZ jumphosts are not auto-registered. Look up their floating IPs and register each by hand — roksbnkctl terraform output testing_cluster_jumphost_ips (v1.5.0’s read-only terraform, Chapter 15), or on an even older release cd ~/.roksbnkctl/<ws>/state && TF_DATA_DIR=$PWD/terraform terraform output testing_cluster_jumphost_ips — then roksbnkctl targets add jumphost-<zone> --host <fip> --user ubuntu --key-source tf-output:jumphost_shared_key per zone. Chapter 15 §“Auto-discovery from terraform outputs” documents this fallback in full.

Hopping to a cluster jumphost via the registered `jumphost` (zero-setup, no roksbnkctl state)

The shared key is installed on every jumphost and the private key file is present on each box at /home/ubuntu/.ssh/id_rsa. So from the auto-registered TGW jumphost you can hop to any cluster jumphost by its private IP (the TGW jumphost reaches the cluster VPC over the Transit Gateway) with no key copying and nothing added to roksbnkctl state — handy for a one-off against a host you don’t want to register:

# the per-AZ private IPs are not a top-level terraform output; read them
# from inside the TGW jumphost (it sits in the routed network):
roksbnkctl exec --on jumphost -- \
  curl -s http://169.254.169.254/...   # or your own inventory source

# run a command on a cluster jumphost, hopping through the TGW jumphost,
# using the shared key already on the box (no scp):
roksbnkctl exec --on jumphost -- \
  ssh -i /home/ubuntu/.ssh/id_rsa -o StrictHostKeyChecking=accept-new \
  ubuntu@<cluster-jumphost-private-ip> kubectl get nodes

The inner ssh uses the on-box /home/ubuntu/.ssh/id_rsa (the same shared key the auto-registered targets use — one key for all jumphosts), so no key material is copied. StrictHostKeyChecking=accept-new on the inner hop avoids an interactive prompt the outer non-PTY session can’t answer. This is the zero-setup path: nothing is written to ~/.roksbnkctl/. For first-class repeated access prefer the auto-registered jumphost-<zone> targets above.

Orphaned-target caveat. The auto-registered jumphost-<zone> targets use option-(a) upsert-only registration: if a destroy removes a zone (or testing_create_cluster_jumphosts is flipped to false), the stale jumphost-<oldzone> entry lingers in your config until you roksbnkctl targets remove jumphost-<oldzone> by hand. A re-created host on a recycled floating IP will also trip the host-key TOFU mismatch — see Chapter 15 §“Host-key TOFU and ~/.roksbnkctl/known_hosts”. An automatic-prune (reconcile) mode is a tracked post-v1.5.0 follow-up.

Behaviour details worth knowing:

Streaming I/O. stdout, stderr, stdin all stream in real time — the same as running the command locally. Long-running commands (oc adm top nodes, ibmcloud ks cluster get on a slow API call) work normally.
Exit code propagation. The remote command’s exit code is the local exit code. A failing remote command produces a non-zero roksbnkctl exit; a succeeding remote command produces 0. CI scripts can rely on this.
TTY auto-detection. roksbnkctl shell --on auto-allocates a PTY. Other verbs (exec, kubectl, oc, ibmcloud) run without a PTY at v1.0; if you need a PTY for top or another isatty()-sensitive command, fall back to roksbnkctl shell --on jumphost and run the command from the interactive shell.
Environment passthrough. Only machine-portable values cross the SSH boundary: IBMCLOUD_API_KEY, IC_API_KEY, IBMCLOUD_REGION, and IBMCLOUD_VERSION_CHECK are propagated to the remote session via SSH SetEnv, so ibmcloud iam oauth-tokens on the jumphost authenticates with the same key your local workspace uses. The remote sshd must be configured to accept AcceptEnv IBMCLOUD_* etc. for this to work; the upstream HCL’s jumphost is already configured for it. KUBECONFIG is not forwarded — it is a local filesystem path, meaningless on the target, and forwarding it would shadow the jumphost’s own cloud-init-provisioned /home/ubuntu/.kube/config. Passthrough kubectl/oc on the target therefore use the target’s own kubeconfig, which is the correct behaviour. (Before v1.5.0 the local KUBECONFIG path was forwarded; after a successful local up that deterministically broke --on jumphost kubectl|oc with connection to the server localhost:8080 was refused — see the v1.5.0 changelog. Local roksbnkctl kubectl without --on still resolves KUBECONFIG via the local chain, unchanged.)

What `--on` doesn’t do (yet)

A few things deliberately deferred to later phases:

Lifecycle commands (up, down, plan, apply) reject --on with a clear error at v1.0. Running terraform on a remote host has different state-handling considerations and is the job of the SSH execution backend (Chapter 17) — terraform over --backend ssh:<target> is itself deferred to v1.x (state-file portability).
ProxyJump / multi-hop SSH. If your jumphost itself is reached through another bastion, that’s not directly supported at v1.0. The upstream HCL’s jumphost design lets the TGW jumphost reach cluster-internal VMs natively, so you usually don’t need multi-hop in practice. ProxyJump support is on the v1.x roadmap. (The per-AZ cluster jumphosts themselves are auto-registered as first-class jumphost-<zone> targets since v1.5.0 — see §“Per-AZ cluster jumphosts” above; the manual ssh -i … via the registered jumphost hop there is the zero-setup alternative for one-offs, not a roksbnkctl-native multi-hop.)
~/.ssh/config parsing. Targets must be defined explicitly in workspace config; roksbnkctl does not read your existing ~/.ssh/config.
Password auth. Keys + agent only. Passwords are not supported and won’t be.
SCP / SFTP. File transfer is the SSH execution backend’s job (handled via RunOpts.Files materialisation; see Chapter 17 §“SSH backend”). --on does one-shot remote exec only.
Windows ssh-agent. The key_source: agent path is Linux/macOS only at v1.0; Windows users must use key_path to a file. Already noted in Key sources above; called out here so a Windows reader who skipped to this section doesn’t miss it.

Cross-reference

Chapter 17 — Execution backends extends the SSH client used here into a full execution backend with file materialisation, env-file fallback for sshd configurations that can’t AcceptEnv, and apt-bootstrap of missing tools on Ubuntu jumphosts. The --on flag stays as the lightweight one-shot path; --backend ssh is the deeper integration. The two share the same internal/remote.Client so what you learn here translates directly.

For the design rationale, edge cases, and open questions, read PRD 01 — this chapter is the user-facing surface; PRD 01 is the developer-facing surface.

Execution backends: local, docker, k8s, ssh

roksbnkctl runs a handful of external tools as part of its job — ibmcloud, terraform, iperf3, eventually dig-equivalents and others. By default each tool runs as a child process on your laptop. That’s fine for some tools and wrong for others: iperf3 from your laptop measures your laptop’s internet uplink, not the cluster’s bandwidth. Likewise, terraform via docker (the --backend docker mode covered below) lets you pin a frozen tool version for CI reproducibility without installing it on the host.

The execution-backend system lets you pick where each tool runs without changing the surface command. The same roksbnkctl ibmcloud ks cluster ls invocation can run as a local process, inside a vendored container, inside the cluster, or on a remote SSH host — selected by a flag or a per-tool default in your workspace config.

This chapter is the user-facing reference for all four backends. After the introduction, each backend gets its own deep-dive section covering the mechanics, the credential-propagation rules, the failure modes, and a short “when to use it” callout. Chapter 18 is the decision-tree companion that picks one for a given (tool, scenario) pair.

Architecture at a glance

The four backends sit between roksbnkctl (the binary on your laptop) and the external tools each backend runs. Every backend produces the same observable behaviour from the user’s point of view — the same roksbnkctl ibmcloud ks cluster ls invocation — but routes the actual tool execution to a different network vantage.

graph LR
    User[laptop<br/>roksbnkctl binary]
    subgraph local[local backend]
        L_tf[terraform]
        L_ibm[ibmcloud]
        L_iperf[iperf3]
        L_dns[dns probe]
    end
    subgraph docker[docker backend]
        D_ibm[ibmcloud<br/>frozen image]
        D_tf[terraform<br/>frozen image]
    end
    subgraph k8s[k8s backend]
        K_ops[ops pod<br/>ibmcloud]
        K_job[one-shot Job<br/>iperf3 / dns]
    end
    subgraph ssh[ssh:target backend]
        S_ibm[ibmcloud<br/>on jumphost]
        S_iperf[iperf3<br/>on jumphost]
    end
    Cluster[ROKS cluster<br/>cert-manager + flo + BNK]
    Jump[SSH jumphost<br/>auto-discovered from terraform]
    IBMAPI[IBM Cloud API<br/>+ IAM]

    User --> local
    User --> docker
    User --> k8s
    User --> ssh
    local --> IBMAPI
    docker --> IBMAPI
    k8s --> Cluster
    Cluster --> IBMAPI
    ssh --> Jump
    Jump --> IBMAPI
    Jump -.->|optional<br/>private path| Cluster

    classDef bk fill:#f4f4f4,stroke:#666,color:#000;
    class local,docker,k8s,ssh bk;

Chapter 18 is the decision-tree companion that picks a backend for a given scenario; the rest of this chapter is the per-backend mechanics.

The four backends at a glance

Backend	What it does
local	`os/exec` — spawns the tool as a child process, inheriting your env and PATH
docker	`docker run` against a vendored image (`ghcr.io/jgruberf5/roksbnkctl-tools-<tool>:<v>`); frozen toolchain version
k8s	Runs inside the cluster, either in a long-lived ops pod or as a one-shot Job; auth via the pod’s ServiceAccount token
ssh	Runs on a registered SSH target via the built-in SSH client; opt-in apt-bootstrap of missing tools on Ubuntu

Each backend solves a different problem:

local: fastest startup, simplest mental model, requires the host tool to exist on PATH.
docker: reproducible across dev machines, no host install needed, frozen at a known-good tool version.
k8s: network-correct (private IPs reachable, cluster-internal services accessible), zero host install, in-cluster identity via ServiceAccount.
ssh: pre-cluster ops from a known-IP bastion, customer-firewall workflows, air-gapped environments where the laptop can’t reach IBM Cloud APIs but the jumphost can.

All four implementations conform to the same Go interface (internal/exec.Backend) so callers don’t branch on backend type — they just call backend.Run(ctx, argv, opts) and let the implementation handle the mechanics. That uniformity is what lets the same roksbnkctl ibmcloud ks cluster ls work across all four with no surface-level change.

The `--backend` CLI flag

Override the per-tool default for a single invocation:

# Local (the implicit default for ibmcloud + terraform)
roksbnkctl ibmcloud ks cluster ls

# Same command, in a vendored docker image
roksbnkctl ibmcloud --backend docker ks cluster ls

# Same command, in the cluster (requires `roksbnkctl ops install` first)
roksbnkctl ibmcloud --backend k8s ks cluster ls

# Same command, on a remote SSH host
roksbnkctl ibmcloud --backend ssh:jumphost ks cluster ls

Format:

--backend local|docker|k8s|ssh:<target>

The ssh:<target> form pins the SSH backend to a specific named target from roksbnkctl targets list (registered via roksbnkctl targets add; see Chapter 16).

The flag is persistent at the root — it works for any command that runs an external tool. Commands that don’t run external tools (like roksbnkctl ws list) ignore it.

The flag wins over the workspace-config default. If config.yaml says iperf3: { backend: k8s } and you pass --backend local, the local backend runs.

Backend-failure semantics

Each backend has a different failure surface. The convention is:

Backend startup failure (Docker daemon unreachable, k8s API unreachable, SSH connect refused, binary not on PATH for local) ⇒ exit code 127, with a message naming the cause. No silent fallback to local. Silent fallback hides intent and produces confusing test results.
Backend mid-run failure (the container started but couldn’t pull a sub-resource; the pod was OOMKilled before the wrapped tool ran; the SSH session died after apt-get install but before the tool exec) ⇒ exit code 126, distinct from 127 so CI can tell “we never got going” from “we got going then broke”.
Tool exit code (the actual ibmcloud / terraform / iperf3 exit code, anything in 0-125 or 128-255) ⇒ propagated 1:1, including non-zero codes.
Context cancellation / timeout ⇒ exit code 137 (the conventional SIGKILL-on-signal code).

This way, your CI script can tell “the tool said X failed” (typical exit codes) from “we never reached the tool” (127) from “we reached the tool, then the backend died mid-flight” (126) from “we ran out of time” (137).

Per-tool defaults from `exec:`

Workspace config carries the per-tool default backend in the exec: block:

# ~/.roksbnkctl/<workspace>/config.yaml
exec:
  ibmcloud:  { backend: local }
  iperf3:    { backend: k8s }
  terraform: { backend: local }

The defaults shipped today:

Tool	Default backend	Supported backends	Why
`terraform`	`local`	`local`, `docker`	The terraform-exec local path is the established workflow. State handling is simplest here. The `docker` backend runs frozen `hashicorp/terraform:1.5.7` with a bind-mounted state dir — see § terraform via docker. `k8s` and `ssh` are deferred to v1.x.
`ibmcloud`	`local`	`local`, `docker`, `k8s`, `ssh:<target>`	Most users have it on PATH or are happy installing it. Compliance/firewall scenarios opt in via `--backend ssh:jumphost` or `docker`.
`iperf3`	`k8s`	`local`, `k8s`, `ssh:<target>`	Throughput from a laptop’s uplink isn’t the cluster’s bandwidth. The k8s default runs the iperf3 client adjacent to (or inside) the cluster so the number reflects cluster fabric, not your office Wi-Fi.
`dns`	`local`	`local`, `k8s`, `ssh:<target>`	Single-vantage by default; `--gslb-compare` fans out across configured vantages for GSLB validation. See Chapter 21.

Chapter 12 — Workspace config covers the exec: block schema in detail; this chapter just notes its place in the backend system.

Chapter 18 — Choosing a backend per tool is the decision tree for “which backend should I pick for this tool in this scenario”.

Per-backend deep dives

`local` backend

The default for ibmcloud and terraform. os/exec.CommandContext(ctx, argv[0], argv[1:]...), inheriting the parent process’s environment, PATH, and working directory. Mechanically the simplest of the four — no container, no cluster, no network handshake.

`os/exec` shape

internal/exec/local.go resolves argv[0] via exec.LookPath, then builds a *exec.Cmd:

bin, err := exec.LookPath(argv[0])
// fall through to argv[0] verbatim if it's an absolute path that LookPath rejects
cmd := exec.CommandContext(ctx, bin, argv[1:]...)
cmd.Env    = effectiveEnv     // os.Environ() + opts.Env + Credentials.EnvVars()
cmd.Dir    = opts.WorkDir     // empty → inherit caller's CWD
cmd.Stdin  = opts.Stdin
cmd.Stdout = redactor(opts.Stdout, creds)
cmd.Stderr = redactor(opts.Stderr, creds)

The redactor wrap is defense-in-depth — see Chapter 14 §“The redactor”. If a wrapped tool ever prints IBMCLOUD_API_KEY value to stdout (a debug trace, an error message), the redactor replaces it with [REDACTED] before the bytes leave the binary.

Env propagation

Three sources, in order:

The host process’s environment (os.Environ()) — your shell’s PATH, HOME, KUBECONFIG, etc.
RunOpts.Env — caller-supplied KEY=VALUE strings (e.g., IBMCLOUD_REGION=ca-tor from the workspace config).
Credentials.EnvVars() — IBMCLOUD_API_KEY=… plus the legacy IC_API_KEY=… alias older ibmcloud versions accept.

os/exec documents that for duplicate keys the last entry wins. So caller-supplied vars override host env, and credential vars override caller-supplied — meaning a workspace’s API key always wins over a stale IBMCLOUD_API_KEY in your shell.

The local backend does not scrub the host env. If you have an unrelated AWS_ACCESS_KEY_ID in your shell, the wrapped tool sees it. That’s by design — local is the “trust the user’s shell” path; if you want a hermetic env, switch to docker.

Working directory

RunOpts.WorkDir becomes cmd.Dir. Empty → inherit the caller’s CWD (Cobra’s RootCmd.Run runs from wherever the user invoked roksbnkctl).

When RunOpts.Files is non-empty and WorkDir is empty, the local backend creates a tempdir under os.TempDir(), writes each Files entry as a 0600 file inside, and uses the tempdir as WorkDir. The tempdir is removed via defer after Run returns. This is mostly there for symmetry with the docker / k8s / ssh backends; today’s ibmcloud passthrough never uses it.

Signal handling

exec.CommandContext wires ctx cancellation to the child: when the ctx ticks past its deadline (or the user hits Ctrl-C and the root cobra command cancels), Go sends SIGKILL (the default Cmd.Cancel) to the child. The child has no opportunity to clean up; this is intentional — we’d rather kill a stuck terraform than wait on an indefinite hang.

The kill is process-only, not process-group. If terraform has spawned grandchildren (the IBM provider’s helpers, an SSH key generator, etc.) those grandchildren may outlive the ctx-cancel by a few seconds. We haven’t seen this matter in practice; if it does, a pgid kill is a small follow-up.

Exit-code mapping

Outcome	Exit code	Source
Child exits 0	`0`	child
Child exits non-zero (e.g., `terraform plan` saw drift)	child’s exit code, `1-125` or `128-255`	child
`argv[0]` not on PATH and not an absolute path	`127`	local backend (POSIX shell convention)
Child binary couldn’t be exec’d despite being present (e.g., not executable)	`126`	local backend (mid-run failure: we found the binary but couldn’t spawn it)
Ctx cancelled mid-run, child SIGKILL’d	`137`	`128 + SIGKILL`

Note the 126 vs 127 split: 127 means “we never reached the tool” (binary missing, daemon unreachable, SSH refused); 126 means “we reached the tool but the backend itself broke after that point” (couldn’t fork, container created but crashed, pod scheduled but evicted before exec). Sprint 3 collapsed both to 127 in the local + docker implementations; this sprint splits them per PRD 03 §“Backend interface”. CI scripts that distinguish “test infra broken” from “real test failure” can now key on the difference.

When to use it

You have the tool installed and on PATH already.
You want the fastest startup — no container daemon, no SSH handshake, no cluster API call.
You’re running terraform against the workspace’s local state (the established workflow).
You’re debugging and want the simplest mental model for “where did that output come from”.

Chapter 18 §“Decision tree” expands these into a per-(tool, scenario) walkthrough.

`docker` backend

Runs the tool inside a vendored container image, talking to the local docker daemon over its socket. docker on PATH is not required — roksbnkctl uses the official Docker Go SDK (github.com/moby/moby/client) and dials the socket directly.

roksbnkctl ibmcloud --backend docker ks cluster ls

Container shape

Mechanically (the ibmcloud passthrough; iperf3 client is similar with a different image and ports):

docker run --rm \
  -v <tempdir>/kubeconfig:/root/.kube/config:ro \  # if Credentials.KubeconfigBytes set
  -e IBMCLOUD_API_KEY \                            # bare name; value inherits
  -e IC_API_KEY \                                  # legacy alias
  ghcr.io/jgruberf5/roksbnkctl-tools-ibmcloud:<v> \
  ks cluster ls

internal/exec/docker.go doesn’t shell out to docker run; it builds a container.Config + container.HostConfig and calls cli.ContainerCreate → ContainerStart → ContainerLogs(stream=true). The bash-style above is the conceptual equivalent.

There’s no workspace-wide bind-mount. Per-invocation mounts come from three sources only:

Credentials.KubeconfigBytes — written to <tempdir>/kubeconfig (mode 0600) on the host, bind-mounted as a single file at /root/.kube/config read-only. Single-file mount per PRD 04 §“Anti-patterns” — bind-mounting ~/.kube/ exposes other clusters’ configs.
RunOpts.Files — each name → bytes entry written to <tempdir>/<basename> and bind-mounted at /work/<basename>. The container’s WorkingDir is set to /work so callers can reference files by relative path. (ibmcloud passthrough doesn’t use this; it lands when the iperf3 client backend wants to ship iperf3.json to the pod, or when a future tool wants a config file.)
RunOpts.WorkDir — overrides WorkingDir if explicitly set.

The tempdir is removed via defer after Run returns, regardless of exit code or panic.

Credential propagation specifics

Three things matter, all enforced by internal/exec/creds.go::Credentials.DockerArgs(...):

--env IBMCLOUD_API_KEY (bare name, no =value). The docker daemon looks up the value from the daemon’s environment at container-create time, not from argv. So the literal API key string never appears in docker inspect, docker ps -a --format, or the daemon’s container metadata. PRD 04 §“Anti-patterns” calls out the --env IBMCLOUD_API_KEY=$KEY form as a leak vector — we don’t use it. See Chapter 14 — Credentials.DockerArgs() for the full call shape.
Single-file kubeconfig mount, read-only. Not the parent dir. The container can read exactly the kubeconfig you handed it — nothing else under ~/.kube/.
Stdout/stderr through the redactor. Same defense-in-depth as the local backend: if the wrapped tool prints the API key value (rare but possible), the redactor masks it before the bytes leave roksbnkctl’s process.

`:dev` tag resolution

The vendored images live at:

Tool	Image
`ibmcloud`	`ghcr.io/jgruberf5/roksbnkctl-tools-ibmcloud:<tag>` (vendored from `icr.io/ibm-cloud/ibmcloud-cli` upstream)
`iperf3`	`ghcr.io/jgruberf5/roksbnkctl-tools-iperf3:<tag>` (Alpine + iperf3)
`terraform`	`hashicorp/terraform:<v>` (official upstream)

The <tag> for the vendored per-tool images (ibmcloud, iperf3) is resolved at runtime by internal/exec/docker.go::toolImageTag(). It reads the binary’s internal/version.Version (set via ldflags at build time): a release-built binary like v0.10.0 pulls :v0.10.0; a dev build (Version == "dev") pulls :dev. Sprint 4 landed this version-pinning in place of Sprint 3’s hard-coded :dev so a go install of a tagged release pulls a matching tagged image rather than a :dev that may not exist for the published binary. The terraform row is the exception — it points at the upstream hashicorp/terraform image and stays pinned to a specific version (currently 1.5.7) regardless of roksbnkctl’s own version.

The :dev tag is still the local-development idiom: cd tools/docker && make build-all builds and tags every tools image as :dev locally; a dev-build roksbnkctl finds them via the local docker cache without a ghcr.io round-trip.

If you’re cutting a custom tools image and want roksbnkctl to pick it up, the simplest path is docker tag your-image ghcr.io/jgruberf5/roksbnkctl-tools-ibmcloud:dev locally — the docker backend pulls the local-cached version first.

Auto-remove and ctx-cancel-kill

Two cleanup mechanisms work together:

AutoRemove: true in HostConfig. The docker daemon removes the container as soon as it exits, regardless of exit code. No docker ps -a clutter, no manual docker rm ever required.
Ctx-cancel triggers ContainerKill. When ctx.Done() fires, the docker backend issues cli.ContainerKill(ctx, id, "SIGKILL") and waits a few seconds for the daemon to confirm. The --rm then takes care of removal. Net effect: hitting Ctrl-C during a stuck ibmcloud login doesn’t leave a zombie container behind.

Combined with the daemon’s own watchdog on the container, the worst case is a few seconds of “container is dying” between Ctrl-C and the container disappearing. We haven’t seen leaked containers in dev or CI.

Image build pipeline

Image versions are tagged in lock-step with roksbnkctl releases; the GitHub Actions workflow that builds + pushes them runs on every release tag. See Chapter 31 — Building from source for the build pipeline details.

terraform via docker

terraform is the second tool routed through the docker backend (alongside ibmcloud). The shape is similar to ibmcloud (docker run against a vendored image, single-file mounts for sensitive data, no creds in argv) but with two terraform-specific concerns: state persistence across runs, and host-user UID alignment so state files written inside the container stay readable on the host.

State persistence via bind-mount

Terraform’s local state file lives at terraform.tfstate in the working directory. For the docker backend the working directory has to be a host-side path bind-mounted into the container, not a container-internal path that disappears on --rm. The docker backend bind-mounts the workspace’s state directory into the container:

docker run --rm \
  -v ~/.roksbnkctl/<workspace>/state:/state \
  --workdir /state/tf-source/embedded-terraform \
  --user $(id -u):$(id -g) \
  hashicorp/terraform:1.5.7 \
  apply -auto-approve

Concretely:

Host source: ~/.roksbnkctl/<workspace>/state/ — the same directory the local terraform backend writes state to today, so switching between --backend local and --backend docker against the same workspace doesn’t fork state.
Container target: /state — the bind-mount root inside the container.
Container working directory: /state/tf-source/<source>/ (e.g., /state/tf-source/embedded-terraform/ for the default embedded source) — the same path the local backend resolves to, so terraform sees the same main.tf either way.
The HCL is bind-mounted alongside state. The embedded HCL is materialised at run time into ~/.roksbnkctl/<workspace>/state/tf-source/<source>/ (chapter 31 covers the embedded-source layout); since state/ is the bind-mount root, both terraform.tfstate and the HCL tree land inside the container together. There’s no separate HCL projection.

The bind-mount is read-write — terraform needs to write terraform.tfstate, rotate terraform.tfstate.backup, and populate the .terraform/ cache. Combined with --rm, the file lifecycle is: container creates state, container exits, --rm removes the container, state files persist on the host. Subsequent runs (re-mounted at the same host path) pick up where the prior run left off.

Image: `hashicorp/terraform:1.5.7`

The image is the official upstream hashicorp/terraform published by HashiCorp on Docker Hub, pinned to a literal version in internal/exec/docker.go’s toolImages map (currently 1.5.7). The pin is intentional — the embedded HCL has been validated against this terraform version, and the docker backend’s whole point is reproducibility. Bumping the pin is a deliberate change to the binary and lands as a release.

The vendored per-tool images (ibmcloud, iperf3) get their tag from the roksbnkctl binary’s own version (see § :dev tag resolution above). Terraform is the exception — the binary’s version doesn’t follow upstream terraform’s release cadence, so the pin stays literal.

The UID/GID alignment gotcha

Linux Docker containers run as root by default. With a root-owned container writing into a bind-mount, the resulting host files end up owned by root — and any subsequent local-backend terraform apply (or even a cat ~/.roksbnkctl/<ws>/state/terraform.tfstate) hits permission errors. The docker backend works around this by passing --user $(id -u):$(id -g) explicitly:

docker run --rm \
  --user 1000:1000 \                                     # host's caller-uid:caller-gid
  -v ~/.roksbnkctl/dev-tor/state:/state \
  --workdir /state/tf-source/embedded-terraform \
  hashicorp/terraform:1.5.7 \
  apply -auto-approve

The container process runs as the host user, so files written into the bind-mount are owned by the host user — same as a local-backend terraform apply would have produced. Switching backends mid-debug doesn’t strand state files behind a permission wall.

The UID/GID values are read from the host process at run time (Go’s os.Getuid() / os.Getgid()). On macOS this is mostly cosmetic — Docker Desktop’s VM normalises ownership on the host bind-mount automatically — but it’s required for clean Linux behaviour, so the backend always passes the flag.

Supported commands

The terraform docker backend honours --backend docker for the four lifecycle commands:

roksbnkctl up    --backend docker  [--var-file <path>] [--auto]
roksbnkctl plan  --backend docker  [--var-file <path>]
roksbnkctl apply --backend docker  [--var-file <path>] [--auto]
roksbnkctl down  --backend docker  [--var-file <path>] [--auto]

Flags that the local terraform backend honours (--var-file, --auto, plus the -w/--workspace selector) plumb through to the docker backend identically — the backend’s job is to spawn hashicorp/terraform:1.5.7 with the right argv; it doesn’t filter or rewrite the lifecycle commands’ flags. (--auto is roksbnkctl’s shorthand for terraform’s -auto-approve; the wrapper renames it for terseness and consistency across up/apply/down.)

roksbnkctl up --backend docker is the apply-with-auto-approve shorthand the existing local lifecycle uses; --backend docker switches the spawn target without changing the command shape.

Deferred: k8s and ssh terraform backends

--backend k8s and --backend ssh:<target> for terraform are not in v1.0. The blocker is state-handling: the local backend keeps state on the host filesystem, the docker backend bind-mounts the same path, but k8s (run terraform in a one-shot Job) and ssh:<target> (run terraform on a remote host) need a story for shipping state between the run vantage and the canonical workspace state dir. Designs under consideration include a versioned ConfigMap/Secret pair for k8s and an scp-pre-and-post atomic move for ssh; both are deferred to v1.x once the trade-offs have settled (see docs/PLAN.md §“What’s deliberately deferred to post-v1.0”).

PRD 03 §“State concerns” is the design spec; trying --backend k8s against terraform errors at parse time:

$ roksbnkctl up --backend k8s
error: terraform doesn't support backend `k8s` at v1.0 (state-handling design
       open; tracked in PRD 03 § State concerns); supported: local, docker

When to use it

You’re on a clean dev machine without ibmcloud installed and don’t want to install it.
You need a frozen tool version for CI reproducibility.
You’re debugging a “works on my machine” issue and want to factor out the host install variable.

When docker is the wrong call:

The tool needs network access that your laptop has but the container doesn’t (rare; default bridge networking usually preserves laptop’s egress).
You’re running iperf3 and want a network-locality benefit — docker doesn’t give you that vs local. Use k8s instead.
You’re running a DNS probe and want a different network vantage — same network identity as the host, no value-add. The DNS subcommand rejects --backend docker by design.
You’re on Windows. Linux/macOS docker daemons are in scope; Windows Docker Desktop coverage is deferred to a future round.

`k8s` backend

Runs the wrapped tool inside the cluster. Two distinct execution patterns share the same Backend.Run interface:

Pattern	Used for	Lives in	Lifetime
Long-lived ops pod	ad-hoc `ibmcloud` commands, future interactive shells	`roksbnkctl-ops` namespace	manually managed via `roksbnkctl ops install/uninstall`
One-shot Job	iperf3 client runs, future `terraform` runs, future DNS probes	`roksbnkctl-test` namespace	per-invocation; auto-deleted after `ttlSecondsAfterFinished: 60`

The split mirrors the two latency budgets. Long-lived pods amortise the pod-startup cost across many invocations — perfect for ibmcloud iam oauth-tokens which you might run twenty times in a debugging session. One-shot Jobs are clean (no leftover state, no concurrency questions) — perfect for iperf3 -c <server> which runs once, emits its JSON, and exits.

Long-lived ops pod pattern

The pod is named roksbnkctl-ops in the roksbnkctl-ops namespace. roksbnkctl ops install deploys it (see Chapter 19 for the full lifecycle). The image bundles ibmcloud CLI plus kubectl as backup; future iterations may add oc, terraform, etc. The container inside the pod is named tools.

Backend.Run(ctx, argv, opts) for the ops-pod path is essentially:

exec, _ := remotecommand.NewSPDYExecutor(restConfig, "POST",
    clientset.CoreV1().RESTClient().Post().
        Resource("pods").Namespace("roksbnkctl-ops").Name("roksbnkctl-ops").
        SubResource("exec").
        VersionedParams(&corev1.PodExecOptions{
            Container: "tools",
            Command:   argv,
            Stdin:     opts.Stdin != nil,
            Stdout:    true,
            Stderr:    true,
            TTY:       opts.TTY,
        }, scheme.ParameterCodec).URL())
exec.StreamWithContext(ctx, remotecommand.StreamOptions{
    Stdin: opts.Stdin, Stdout: redactor(opts.Stdout, creds), Stderr: redactor(opts.Stderr, creds), Tty: opts.TTY,
})

The exit code comes back via the SPDY channel’s metav1.Status — the executor surfaces it as a exec.CodeExitError. We propagate that as the backend’s exit code, same as local propagates exec.ExitError.ExitCode().

opts.WorkDir is ignored for the ops pod path. The pod’s WorkingDir is fixed at container-spec time (/work); per-exec working-dir changes would require recreating the pod. Callers that need a specific cwd should cd <dir> && it into argv (the local backend’s symmetric escape hatch).

One-shot Job pattern

For each invocation, the backend builds a batchv1.Job spec, applies it, streams logs from the Job’s pod, waits for completion, reads the exit code from the pod’s container status, and lets ttlSecondsAfterFinished clean up.

Skeleton:

apiVersion: batch/v1
kind: Job
metadata:
  generateName: roksbnkctl-iperf3-client-     # randomized; multiple runs don't collide
  namespace: roksbnkctl-test
spec:
  ttlSecondsAfterFinished: 60                  # auto-delete the Job + its Pod 60s after completion
  backoffLimit: 0                              # no retries; the test reports failure once and stops
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: iperf3-client
        image: ghcr.io/jgruberf5/roksbnkctl-tools-iperf3:<v>
        command: ["iperf3", "-c", "<server-svc>", "-J"]
        envFrom:
        - secretRef:
            name: roksbnkctl-job-creds-<random>   # projected per invocation
        volumeMounts:
        - name: files
          mountPath: /work
      volumes:
      - name: files
        projected:
          sources:
          - secret:
              name: roksbnkctl-job-files-<random>  # one Secret per invocation, holds RunOpts.Files

Three details to call out:

Projected Secret for cred propagation. Credentials.IBMCloudAPIKey (when set) becomes a one-shot Secret, mounted via envFrom: secretRef. Per PRD 04 §“In-cluster pod” this beats argv (which would show in kubectl describe pod) and beats inline env: blocks (which surface in kubectl get pod -o yaml). The Secret carries the same ttlSecondsAfterFinished-equivalent lifecycle: when the Job’s ttlSecondsAfterFinished deletes the Job, the owning controller’s GC sweeps the Secret too via ownerReferences.
Log streaming via client-go. Once the Job’s pod is in Running state, clientset.CoreV1().Pods(ns).GetLogs(name, &corev1.PodLogOptions{Follow: true}).Stream(ctx) returns an io.ReadCloser that we copy through the redactor into opts.Stdout. The stream stays open until the pod terminates or ctx cancels.
Exit-code extraction. When the pod transitions to Succeeded or Failed, we read pod.Status.ContainerStatuses[0].State.Terminated.ExitCode and return that as the backend’s exit code. A Failed pod with ExitCode: 0 (rare; usually OOMKilled or evicted) maps to backend exit code 126 — backend mid-run failure rather than tool failure.

The roksbnkctl-test namespace is a fresh namespace dedicated to one-shot test workloads. It’s separate from roksbnkctl-ops (the long-lived pod’s home) so RBAC can be scoped tighter — see Chapter 19 §“RBAC”.

iperf3 server side

Worth calling out because it’s the asymmetric piece. The iperf3 test deploys a server bare Pod + Service into roksbnkctl-test, then runs the client as the one-shot Job described above:

Side	Resource	Lifetime
Server	`roksbnkctl-iperf3` bare Pod + Service (`LoadBalancer` for `--mode north-south`; `ClusterIP` for `--mode east-west`)	torn down after the client Job completes
Client	one-shot Job	`ttlSecondsAfterFinished: 60`

The bare-Pod (rather than Deployment) shape is intentional — the iperf3 server is single-shot, scoped to one test, torn down on completion; the controller-managed replica machinery a Deployment provides is unused and would only confuse the cleanup story. Service type is driven by --mode: north-south measures laptop-to-cluster bandwidth and needs a publicly reachable endpoint (LoadBalancer); east-west measures node-to-pod and stays in-cluster (ClusterIP). See Chapter 22 — Throughput testing for the user-facing flag surface.

The client Job’s argv is iperf3 -c <server-cluster-ip-or-lb> -J. The -J JSON flows back via log streaming, parsed in internal/test/throughput.go, surfaced as roksbnkctl test throughput JSON output.

The server pod’s securityContext is set to satisfy OpenShift’s restricted-v2 SCC: runAsNonRoot: true, allowPrivilegeEscalation: false, seccompProfile.type: RuntimeDefault, capabilities.drop: [ALL]. iperf3 listens on port 5201 (unprivileged) so no root is needed. The Sprint 3 cluster baseline tripped the SCC by missing one or more of these fields; the manifest the k8s backend emits this sprint sets all four.

When to use it

You’re running iperf3 and want a number that reflects cluster fabric, not your office Wi-Fi.
You’re running ibmcloud from a network that can reach the cluster but not *.cloud.ibm.com directly. The ops pod has both lines of sight; your laptop has only one.
You want a cluster-side ad-hoc shell for debugging — roksbnkctl exec --backend k8s -- bash (when implemented) drops into the ops pod.

When k8s is the wrong call:

The cluster doesn’t exist yet (roksbnkctl ops install requires a working kubeconfig). Use local or ssh for pre-cluster ops.
You haven’t run roksbnkctl ops install. Run it first; it’s a one-time setup per cluster.
You’re running terraform — --backend k8s for terraform is deferred to a future release pending a state-handling design (see PRD 03 §“State concerns”).

Chapter 19 is the full reference for the cluster-side mechanics: namespace, ServiceAccount, ClusterRole, Secret, lifecycle.

`ssh` backend

Runs the wrapped tool on a registered SSH target. Builds on Sprint 1’s internal/remote.Client (the same SSH client backing the --on flag); this section assumes you’ve read Chapter 16 for the target-config and host-key TOFU framing.

roksbnkctl ibmcloud --backend ssh:jumphost ks cluster ls
roksbnkctl ibmcloud --backend ssh:bastion --bootstrap iam oauth-tokens

Per-tool apt-bootstrap and the `--bootstrap` flag

Before exec’ing the wrapped tool, the SSH backend probes whether it’s installed:

ssh <target> 'command -v <tool>'

Exit 0 → tool present, proceed. Non-zero → tool missing. What happens next depends on --bootstrap:

Without --bootstrap (the default). The backend errors with exit 127 and a clear message:
```
error: tool `iperf3` not found on ssh target jumphost; re-run with --bootstrap to install via apt-get,
       or pre-install on the target manually
```
No sudo apt-get ever runs. The backend won’t surprise the user with package-manager invocations or sudo password prompts on a remote they didn’t expect mutation on.

With --bootstrap. The backend runs the per-tool bootstrap recipe. For Ubuntu (the only OS supported this round), the recipe is roughly:

# ibmcloud needs IBM's apt repo + GPG key first
curl -fsSL https://download.clis.cloud.ibm.com/Linux/Ubuntu/repo.gpg | sudo apt-key add -
echo 'deb https://download.clis.cloud.ibm.com/Linux/Ubuntu jammy main' \
  | sudo tee /etc/apt/sources.list.d/ibmcloud.list
sudo -n apt-get update -y
sudo -n apt-get install -y ibmcloud-cli

iperf3 is simpler — no repo addition, just sudo -n apt-get install -y iperf3.

The opt-in default reflects PRD 03 open question §“--bootstrap opt-in for SSH”: silent sudo apt-get on a remote host is the kind of surprise that erodes operator trust, especially when the remote is shared between teams. Make the user say “yes, install for me”.

Bootstrap failure modes split between the two backend-failure exit codes per §“Backend-failure semantics”: 127 when we never got going (couldn’t reach the repo, no apt mapping, tool missing without --bootstrap); 126 when we got partway in and then something broke (sudo / OS-detect / install).

Failure	Exit	What you see
`--bootstrap` not set and tool missing	`127`	“tool `<name>` not found on ssh target `<target>`; re-run with –bootstrap to install via apt-get, or pre-install on the target manually”
`sudo` requires a password (NOPASSWD not configured)	`126`	`sudo: a password is required` → “the SSH user needs passwordless sudo for `apt-get install`. Configure `<user> ALL=(ALL) NOPASSWD: /usr/bin/apt-get` in /etc/sudoers, or pre-install `<pkg>` manually.”
Non-Ubuntu OS (`lsb_release -is` doesn’t return `Ubuntu`)	`126`	“auto-install only supports Ubuntu. Pre-install `<pkg>` on the target (RHEL: `yum install <pkg>`).”
Network unreachable from target (apt-get can’t reach the repo)	`127`	“target can’t reach the package repo. Check the target’s egress policy or pre-install `<pkg>` manually.”
No apt mapping for the requested tool	`126`	“no bootstrap recipe known for tool `<name>`; the SSH backend only auto-installs `ibmcloud` + `iperf3` today”

File materialisation

RunOpts.Files entries are written to a per-invocation tempdir on the remote. The tempdir is /tmp/roksbnkctl.<random>/ where <random> is a fresh 16-byte hex string per Run:

# pseudo-flow
ssh <target> 'mkdir -m 0700 /tmp/roksbnkctl.<rand>'
scp <local-temp>/<basename> <target>:/tmp/roksbnkctl.<rand>/<basename>
ssh <target> '
  trap "rm -rf /tmp/roksbnkctl.<rand>" EXIT
  cd /tmp/roksbnkctl.<rand>
  <argv...>
'

The trap … EXIT is shell-builtin; it fires on normal exit, on set -e failure, on SIGINT (Ctrl-C), on SIGTERM. So even if the user kills their roksbnkctl invocation mid-run, the remote tempdir is cleaned up by the wrapper script’s own trap before the SSH session terminates.

The 0700 mode on the tempdir ensures only the SSH user can read it during the brief on-disk window. On shared bastions (multi-user jumphosts) this matters — and it’s why we materialise to /tmp (which the user owns) rather than /var/tmp or some shared scratch path.

Kubeconfig follows the same pattern: Credentials.KubeconfigBytes becomes <tempdir>/kubeconfig, the wrapper exports KUBECONFIG=<tempdir>/kubeconfig, the trap removes the file on exit. PRD 04 §“Kubeconfig options for SSH backend” calls this “Option A” — scp-and-cleanup. We picked it over the in-memory <() process-substitution alternative because it’s robust across remote shells and sshd configs.

Env propagation: `SetEnv` vs wrapper script

OpenSSH supports two ways to pass an env var to a remote command:

ssh -o SetEnv=KEY=VALUE target … — client tells the server “please add this to the env”. Works only if the server’s sshd_config has AcceptEnv KEY matching. Most stock sshd configs don’t enable AcceptEnv for arbitrary keys.
Wrapper script with export KEY=VALUE — the script writes the env var into its own process before exec "$@". Works regardless of sshd config, but the value lives briefly in a 0700 file on the remote.

The SSH backend tries SetEnv first. On the first connect to a new target, it sends a sentinel env var (ROKSBNKCTL_SETENV_TEST=ok) and runs echo "$ROKSBNKCTL_SETENV_TEST". If the output is ok, SetEnv works on this target — the result is cached in workspace state, and subsequent runs use SetEnv directly.

If the sentinel doesn’t surface, sshd silently dropped it (it logs refused setenv request on the server side, but clients don’t see that). The backend falls back to a wrapper script:

#!/bin/sh
# /tmp/roksbnkctl.<rand>/wrap.sh, mode 0700, owner-readable only
trap 'rm -f "$0"' EXIT
set +o history
export IBMCLOUD_API_KEY='<value>'
exec "$@"

Then: ssh <target> /tmp/roksbnkctl.<rand>/wrap.sh ibmcloud iam oauth-tokens.

The wrapper-script path is the Sprint 1 validator Issue 4 carry-over — the same shape --on uses for env passing today. Risks (file content includes the secret) are mitigated by:

Mode 0700 so only the SSH user can read.
set +o history so the value doesn’t leak into shell history.
trap 'rm -f "$0"' EXIT deletes the wrapper as soon as it exits — including on Ctrl-C, since the trap covers SIGINT/SIGTERM by virtue of being in the script’s main process.
The key is never in argv, so ps -ef on the remote doesn’t show it.

roksbnkctl targets show <name> reports which mechanism the target uses (e.g., env propagation: SetEnv (AcceptEnv ok) or env propagation: wrapper script (sshd refused SetEnv)) so users can choose to enable AcceptEnv server-side if they prefer.

Bootstrap failure modes (consolidated)

Symptom	Cause	Remediation
`sudo: a password is required`	NOPASSWD sudo not configured	Add `<ssh-user> ALL=(ALL) NOPASSWD: /usr/bin/apt-get` to `/etc/sudoers.d/roksbnkctl` on the target
`auto-install only supports Ubuntu`	`/etc/os-release` ID is not `ubuntu`	Pre-install the tool manually; RHEL: `sudo yum install <pkg>`; Alpine: `sudo apk add <pkg>`
`target can't reach the package repo`	Target’s egress policy blocks `download.clis.cloud.ibm.com` (or upstream Ubuntu mirrors)	Pre-install or open egress; doctor’s `--backend ssh:<target>` flags this
`tool not found on ssh target …; re-run with --bootstrap`	`--bootstrap` not passed and tool missing	Re-run with `--bootstrap`, or pre-install on the target

When to use it

You’re running ibmcloud from a customer-firewalled office where the corporate jumphost can reach IBM Cloud APIs but your laptop can’t.
You’re working in an air-gapped environment where roksbnkctl runs on your laptop but the IBM Cloud API conversations have to happen from a specific bastion’s IP.
You want a low-overhead remote-exec path that doesn’t require a cluster (the k8s backend’s prereq).

When ssh is the wrong call:

The target lacks the tool and you don’t want to mutate it. Skip --bootstrap; the backend errors clearly without installing anything.
The target isn’t Ubuntu and you don’t want to pre-install. Bootstrap won’t work; pre-install or use local/docker/k8s.
You’re running iperf3 to measure cluster bandwidth. SSH puts the client somewhere on the network path to the cluster but not necessarily adjacent to it — k8s is the right answer for that case.

Chapter 16 covers the lighter-weight --on jumphost predecessor that uses the same targets: config block. The SSH backend is the heavier-duty form: file materialisation, env propagation hardening, opt-in bootstrap. Chapter 18 is the decision tree.

The `Backend` interface

For the curious, the Go interface every backend conforms to:

package exec

type Backend interface {
    Run(ctx context.Context, argv []string, opts RunOpts) (int, error)
    Name() string
}

type RunOpts struct {
    Stdin           io.Reader
    Stdout, Stderr  io.Writer
    Env             []string         // KEY=VALUE pairs
    WorkDir         string           // best-effort; some backends ignore (k8s)
    TTY             bool             // request PTY where supported
    Files           map[string][]byte // files materialized at exec time
    Credentials     *Credentials     // routed via PRD 04's per-backend mechanism
}

type Credentials struct {
    KubeconfigBytes []byte
    IBMCloudAPIKey  string
}

All four implementations satisfy this interface. Call sites in cli/cluster.go, cli/test.go, etc., get a Backend from the registry and call Run(...) — no branching on backend type. The uniformity is what makes the system extensible without rewriting callers each time a backend lands.

The Credentials struct is the bridge between the resolver chain (env → keychain → config-b64 → prompt) covered in Chapter 14 and the per-backend propagation rules in PRD 04. Each backend translates the struct into the mechanism appropriate to where it runs: env vars for local, --env KEY (no =value) for docker, secretKeyRef for k8s, SetEnv or wrapper script for ssh.

Cross-references

PRD 03 — pluggable execution backends — the design rationale and full per-backend spec.
PRD 04 — credential propagation — the cred-passing rules every backend implements.
Chapter 12 — Workspace config — the exec: block schema.
Chapter 14 — Credentials and the resolver chain — how creds reach the backend in the first place.
Chapter 16 — The –on flag and SSH jumphosts — the lightweight remote-exec predecessor to the SSH backend.
Chapter 18 — Choosing a backend per tool — the decision tree.
Chapter 19 — The in-cluster ops pod — deploy-time mechanics for the k8s backend.

Choosing a backend per tool

Chapter 17 covered the mechanics of each backend. This chapter is the decision tree: given a tool and a scenario, which of local / docker / k8s / ssh is the right call.

If you’re searching for “which backend should I use”, you’ve landed on the right page.

The four backends in one line each

Backend	One-line summary	Deep dive
`local`	`os/exec` on your laptop; fastest, requires the tool on PATH	§ Local backend
`docker`	`docker run` against a vendored image; frozen tool version, no host install	§ Docker backend
`k8s`	inside the cluster (long-lived ops pod or one-shot Job); cluster-correct network identity	§ K8s backend
`ssh:<target>`	on a registered SSH target; opt-in apt-bootstrap on Ubuntu	§ SSH backend

If you’re skimming, the cheat-sheet is:

local when you have the tool installed and the host’s network identity is correct for the call.
docker when you don’t have the tool and don’t want to install it, or you need a frozen version for CI.
k8s when the call’s network position matters and the cluster is the right vantage point.
ssh:<target> when the call needs to originate from a specific external host (a customer bastion, an air-gapped bridge).

The rest of this chapter is the longer version.

Per-tool default backends

Every tool has a default backend baked into roksbnkctl. Workspace config (exec: block) can override the default per workspace; --backend overrides for a single invocation.

Tool	Default	Resolved by
`ibmcloud`	`local`	`internal/cli/cluster.go::resolveBackendSpecWith("ibmcloud", flagOverride)`
`iperf3`	`k8s`	`internal/cli/test.go::resolveBackendSpecWith("iperf3", flagOverride)`
`terraform`	`local`	`internal/cli/lifecycle.go::resolveBackendSpecWith("terraform", flagOverride)`

The defaults reflect “what’s the right answer for the most common scenario”:

ibmcloud defaults to local because most users have it on PATH or are happy installing it. The compliance + firewall scenarios where ssh or docker are better are the minority of calls.
iperf3 defaults to k8s because throughput from a laptop’s uplink isn’t the cluster’s bandwidth. The k8s backend places the iperf3 client in (or adjacent to) the cluster so the number reflects fabric, not Wi-Fi. Laptop-uplink-to-cluster is a real measurement too, but it’s the special case — opt in via --backend local.
terraform defaults to local because the terraform-exec local path is the established workflow. State handling is simplest there. Frozen-version CI runs use --backend docker; non-local network-locality use cases (cluster-side, SSH-bastion-side) are deferred to a future release pending a state-handling design — see PRD 03 §“State concerns”.

To change a default per workspace, edit ~/.roksbnkctl/<workspace>/config.yaml:

exec:
  iperf3:    { backend: k8s }      # already the default; shown for clarity
  ibmcloud:  { backend: ssh:bastion }
  terraform: { backend: docker }

Chapter 12 §“exec:” covers the schema. The --backend CLI flag overrides whatever is in exec: for a single invocation.

Per-tool supported-backend matrix

Not every tool supports every backend. The authoritative matrix at v1.0:

Tool	`local`	`docker`	`k8s`	`ssh:<target>`
`ibmcloud`	yes (default)	yes (frozen image)	yes (long-lived ops pod)	yes
`iperf3`	yes (opt-in: laptop vantage)	not supported (same network identity as `local`)	yes (default)	yes
`terraform`	yes (default)	yes (frozen image)	deferred to v1.x (state-file handling)	deferred to v1.x (state-file handling)
DNS probe	yes (default for laptop vantage)	not supported (same network identity as `local`)	yes (cluster vantage)	yes (remote vantage)
`kubectl` / `oc`	internalised — runs via the Go client, not via a host binary	n/a	n/a	n/a
`dig`	internalised — DNS probe replaces `dig` for in-tree work	n/a	n/a	n/a

Legend:

yes — supported; same surface command works on this backend.
yes (default) — this backend is the per-tool default; pass --backend other to override.
not supported — rejected at CLI parse time with a clear error pointing at the right alternative.
deferred to v1.x — a real design constraint, not a gap; see the cell text and PRD 03 §“State concerns”.
internalised — roksbnkctl performs the operation via its own embedded library, not by shelling out; no backend selection applies.

The “no” entries are intentional design decisions, not gaps:

iperf3 over docker is rejected because a Docker container running locally has the same network identity as the host — same NAT egress, same uplink, same observed bandwidth as --backend local. The user’s mental model would be “I picked docker, so the iperf3 must be hermetic now” but the throughput number wouldn’t actually differ. Better to refuse and force the user to pick local (deliberate laptop measurement) or k8s (cluster measurement).
DNS probe over docker is rejected for the same reason. DNS resolution from a Docker container with default bridge networking goes through the same resolver as the host. There’s no GSLB-relevant network-locality difference. The probe subcommand errors with “DNS probe doesn’t benefit from docker; use local instead” when --backend docker is passed.
terraform over k8s and ssh is deferred to v1.x. The state file is sensitive (admin tokens, generated TLS keys, license bundles); moving it into a Kubernetes Secret or scp’ing it pre/post-run requires a state-handling design that hasn’t shipped yet. PRD 03 §“State concerns” lays out the considerations; the roadmap entry lives in docs/PLAN.md §“What’s deliberately deferred to post-v1.0”.

Passing an unsupported (tool, backend) pair errors at the CLI layer before the backend is invoked:

$ roksbnkctl test throughput --backend docker
error: iperf3 doesn't support backend `docker` (same network identity as `local`,
       no value-add); supported: local, k8s, ssh:<target>

Decision tree

Pick the question that matches your scenario.

“I want to measure cluster bandwidth”

Use --backend k8s. The default for iperf3 is already k8s — the explicit flag is redundant unless you’ve overridden the default in workspace config:

roksbnkctl test throughput
# equivalent to:
roksbnkctl test throughput --backend k8s

The k8s backend deploys a server-side Deployment + LoadBalancer Service in roksbnkctl-test, runs the iperf3 client as a one-shot Job in the same namespace, collects the JSON output from the client pod’s logs, and tears down both. The bandwidth number reflects the cluster fabric.

If you instead want to measure your laptop’s uplink to the cluster:

roksbnkctl test throughput --backend local --endpoint <cluster-LB-ip>:5201

That’s a deliberately different measurement — useful when you suspect office Wi-Fi, not cluster fabric, is the bottleneck.

“I’m doing GSLB DNS validation”

Use both local and k8s. F5 BIG-IP Next’s GSLB returns different answers depending on the requesting resolver’s IP — geographic affinity, datacenter routing, health-check state. To validate that the GSLB is actually doing this, query from multiple network vantage points and compare.

The multi-vantage probe ships at v1.0 via roksbnkctl test dns --gslb-compare:

roksbnkctl test dns \
  --target www.example.com \
  --type A \
  --server gslb-vip.f5.example.com \
  --gslb-compare

--gslb-compare fans out to every configured vantage (local for your office IP, k8s for the cluster’s egress IP, ssh:<region-bastion> for a bastion in another region) in parallel and emits a single comparison JSON with a gslb_divergence boolean. Different answers across vantages are expected in a healthy GSLB; identical answers might mean the GSLB rules aren’t taking effect for the resolver positions you queried from.

Chapter 21 — DNS testing for GSLB is the full reference.

“I need to run `ibmcloud` from a customer-firewalled office”

Use --backend ssh:<bastion>. Your customer’s network policy lets the corporate jumphost reach *.cloud.ibm.com but blocks your laptop. The SSH backend ships your kubeconfig to the bastion (single file, mode 0600, removed via trap on session exit), runs ibmcloud there, streams the output back:

roksbnkctl ibmcloud --backend ssh:bastion ks cluster ls

If ibmcloud isn’t installed on the bastion, you’ll get a clear error:

error: tool `ibmcloud` not found on ssh target bastion; re-run with --bootstrap to install
       via apt-get, or pre-install on the target manually

Re-run with --bootstrap if you want roksbnkctl to sudo apt-get install -y ibmcloud-cli on the bastion. The opt-in default reflects “we don’t surprise users with sudo apt-get on a remote they didn’t expect mutation on” — see Chapter 17 §“SSH backend” for the bootstrap mechanics.

“I’m in CI and want a frozen toolchain version”

Use --backend docker. The vendored images are tagged in lock-step with roksbnkctl releases — ghcr.io/jgruberf5/roksbnkctl-tools-ibmcloud:v1.0.0 is the exact same ibmcloud binary every CI run sees, regardless of when the runner image was built or what apt-get happens to ship that day:

roksbnkctl ibmcloud --backend docker iam oauth-tokens
roksbnkctl up --backend docker     # terraform inside hashicorp/terraform:<v>

For CI specifically, also pin ibmcloud.api_key_source: env in workspace config so the API key resolution is unambiguous (no keychain fallback to confuse a non-interactive runner) — see Chapter 14 §“Pinning a single source”.

“I’m on a clean dev machine without `ibmcloud` installed”

Use --backend docker. No apt-get install ibmcloud-cli, no IBM repo + GPG key dance, no upstream-package-version mismatch — docker pull ghcr.io/jgruberf5/roksbnkctl-tools-ibmcloud:dev is the only setup, and roksbnkctl does that for you on first invocation.

Alternatively, if your laptop is the dev machine and you’ll run ibmcloud more than once, just install it. The local backend has lower per-invocation startup latency than docker (no container create/start/log-attach), so once you’ve paid the install cost the local path is faster for the rest of the session.

“I want a cluster-side ad-hoc shell”

Use --backend k8s with the long-lived ops pod. Once roksbnkctl ops install has run, --backend k8s for ibmcloud (or any future tool) routes through kubectl exec -n roksbnkctl-ops ops -- <argv>. The pod stays alive between invocations, so the second and subsequent commands skip pod-startup latency.

roksbnkctl ops install
roksbnkctl ibmcloud --backend k8s iam oauth-tokens
roksbnkctl ibmcloud --backend k8s ks cluster ls
roksbnkctl ibmcloud --backend k8s account list

Chapter 19 is the full reference for the ops pod lifecycle.

“I’m pre-cluster — there’s no cluster yet”

Use local or ssh:<target>. The k8s backend prereq is a working kubeconfig pointing at a running cluster; before roksbnkctl up has succeeded, that doesn’t exist. For pre-cluster ibmcloud + terraform calls (account inspection, IAM tinkering, the cluster-create itself), local and ssh:bastion are the only two paths.

When not to use a backend

Common foot-guns, in rough order of how often they come up:

`--backend k8s` without `roksbnkctl ops install`

The ops pod must exist before the k8s backend can route ibmcloud calls through it. First-time use:

roksbnkctl ops install         # one-time setup per cluster
roksbnkctl ibmcloud --backend k8s ks cluster ls

If you skip the install, the backend errors with a clear remediation:

error: ops pod not found in roksbnkctl-ops namespace; run `roksbnkctl ops install` first

Chapter 19 covers the install/show/uninstall lifecycle.

`--backend docker` for a network-locality test

iperf3 and the DNS probe both reject --backend docker because a local Docker container has the same network identity as the host (default bridge networking). The probe wouldn’t measure anything different. The CLI errors at parse time:

$ roksbnkctl test throughput --backend docker
error: iperf3 doesn't support backend `docker` (same network identity as `local`,
       no value-add); supported: local, k8s, ssh:<target>

If you actually want a hermetic-tools throughput test, --backend k8s is the right answer.

`--backend ssh:host` without `--bootstrap` on a fresh target

If ibmcloud (or iperf3) isn’t installed on the target, the SSH backend won’t silently sudo apt-get for you — --bootstrap is opt-in. The first call on a fresh target tells you exactly what’s needed:

error: tool `ibmcloud` not found on ssh target bastion; re-run with --bootstrap to install
       via apt-get, or pre-install on the target manually

Re-run with --bootstrap if mutation is OK; otherwise pre-install via your config-management of choice (Ansible, Salt, baked-in-image).

`--backend ssh:host` to a non-Ubuntu target with `--bootstrap`

The apt-bootstrap recipe is Ubuntu-only this round. RHEL / CentOS / Alpine targets need pre-installation via yum / dnf / apk — --bootstrap errors out cleanly:

error: auto-install only supports Ubuntu. Pre-install `ibmcloud-cli` on the target
       (RHEL: `yum install ibmcloud-cli`)

Once the tool is installed, --backend ssh:host works without --bootstrap.

`--backend k8s` for `terraform`

Deferred to v1.x. The terraform tool’s k8s + ssh backends require a state-handling design that hasn’t shipped — moving the state file into a Kubernetes Secret or scp’ing it pre/post-run is fiddly enough to be a feature in its own right (PRD 03 §“State concerns”; roadmap in docs/PLAN.md §“What’s deliberately deferred to post-v1.0”). For now, terraform supports local and docker only. If the network-locality use case (running terraform from a customer VPC for IP-egress reasons) is blocking, file an issue.

Mixing `--on` and `--backend ssh:<target>`

--on <target> is the Chapter 16 lightweight remote-exec — it runs the passthrough shape (exec, shell, kubectl, oc, ibmcloud) on the target by literally re-running the command via SSH. --backend ssh:<target> is the heavier-duty form — it routes through the Backend interface, which means file materialisation, env propagation hardening, opt-in apt-bootstrap, and the redactor are all wired in.

You generally want one or the other, not both. The supported precedence is “--backend ssh:<target> wins”; passing both flags on the same invocation surfaces a warning. If you’re calling roksbnkctl ibmcloud …, prefer --backend ssh:<target> for the same target — you get the better cred-handling story automatically.

Workspace config + `--backend` flag interaction

Recap of Chapter 12 §“exec:”:

The flag wins. If ~/.roksbnkctl/<ws>/config.yaml says:

exec:
  iperf3: { backend: k8s }

…and you run roksbnkctl test throughput --backend local, the local backend runs. The flag is the per-invocation override; the workspace config is the per-workspace default.

If neither is set, the per-tool default from the previous section applies (iperf3 → k8s, ibmcloud → local, terraform → local). The resolution order is exact:

--backend flag
exec.<tool>.backend in workspace config
Per-tool baked-in default

There’s no fallback chain inside this resolution — if you pass --backend k8s and the cluster is unreachable, the backend errors with “cluster API unreachable” (exit 127). It does not fall through to local. Silent fallback hides intent and produces confusing CI results; the failure-mode discipline in Chapter 17 §“Backend-failure semantics” applies here too.

Summary table

The decision-tree contents collapsed into one table:

If you want to…	Backend	Notes
Measure cluster bandwidth	`k8s`	iperf3 client + server in cluster (the default)
Measure laptop-uplink-to-cluster bandwidth	`local`	deliberate; not the iperf3 default
GSLB DNS cross-vantage compare	`local` + `k8s` (`--gslb-compare`)	multiple vantages in parallel
`ibmcloud` from a customer-firewalled office	`ssh:bastion`	with `--bootstrap` if first call on fresh Ubuntu
Frozen-version CI for any tool	`docker`	image tag matches `roksbnkctl` release
Cluster-side ad-hoc `ibmcloud` debugging	`k8s`	requires `roksbnkctl ops install` first
Pre-cluster ibmcloud / terraform	`local` or `ssh`	`k8s` requires a working cluster
`terraform up` on a clean dev machine	`local` (default) or `docker`	k8s + ssh deferred
Air-gapped: laptop can’t reach IBM Cloud, bastion can	`ssh:bastion`	with kubeconfig propagation
Just learning the tool	`local`	simplest mental model

Worked example: bare-metal + jumphost office workflow

End-to-end Part V scenario: you’re an F5 SE running a customer POC from a corporate-firewalled office. The laptop can’t reach *.cloud.ibm.com directly (the office proxy blocks it) but a customer-provisioned Ubuntu jumphost at 10.20.30.40 can. The jumphost was already auto-discovered by an earlier roksbnkctl up against this customer’s account, so targets list shows it. You need to: install the in-cluster ops pod, run ibmcloud from the bastion, and run a throughput test from inside the cluster — all without installing tools locally.

# 1. Verify jumphost is registered + reachable
$ roksbnkctl targets list -w customer
NAME       HOST          KEY_SOURCE         STATUS
jumphost   10.20.30.40   workspace/state    reachable

# 2. Run ibmcloud from the jumphost (Sprint 1 --on flag, lightweight)
$ roksbnkctl ibmcloud --on jumphost ks cluster ls
OK
Name              ID                                     State    Created     ...
customer-cluster  c4abc123def456                         normal   3 days ago  ...

# 3. For the same call routed through the Backend interface (cred-handling
# hardened, redactor wired, opt-in apt-bootstrap available), use --backend
$ roksbnkctl ibmcloud --backend ssh:jumphost ks cluster ls
# Same output; different code path. Prefer --backend ssh:<target> for
# everything except quick interactive shells where --on is faster to type.

# 4. Install the in-cluster ops pod (one-time per cluster)
$ roksbnkctl ops install
✓ Namespace roksbnkctl-ops created
✓ ServiceAccount + Role + RoleBinding applied
✓ Secret roksbnkctl-ibm-creds applied (envFrom secretRef)
✓ Pod roksbnkctl-ops Running (2.3s)

# 5. Same ibmcloud call routed through the ops pod (k8s backend)
$ roksbnkctl ibmcloud --backend k8s iam oauth-tokens
IAM token: Bearer eyJ...
# (The token comes from inside the cluster — different egress IP from the
# jumphost's, useful when IAM policy is IP-conditional.)

# 6. Throughput test using the cluster vantage (default for iperf3)
$ roksbnkctl test throughput
→ Deploying iperf3 server pod into namespace "roksbnkctl-test"
✓ Pod ready (iperf3-server-...)
→ Deploying iperf3 client Job in the same namespace
✓ Client Job complete
✓ throughput: 8.92 Gbits/sec (mean over 10s)
→ Tearing down iperf3 fixture
✓ pod, service, and Job deleted

# 7. Throughput test from the jumphost into the cluster (north-south, real
# customer-network bandwidth — not laptop wifi)
$ roksbnkctl test throughput --backend ssh:jumphost --mode north-south
✓ throughput: 936 Mbits/sec  (jumphost → cluster LB; customer's WAN)

# 8. Persist the per-tool routing in workspace config (one-time)
$ cat >> ~/.roksbnkctl/customer/config.yaml <<'YAML'
exec:
  ibmcloud:  { backend: ssh:jumphost }
  iperf3:    { backend: k8s }
  terraform: { backend: local }
YAML
# Subsequent runs skip the --backend flag — every ibmcloud call routes via
# the jumphost automatically; every iperf3 runs in-cluster.

The point of this walkthrough: with no tools installed locally beyond roksbnkctl itself, you’ve reached the IBM Cloud control plane from the customer’s bastion (compliance-correct egress), exercised the cluster fabric for throughput, and persisted the routing per workspace. The same laptop with the same roksbnkctl binary handles a different customer’s POC by pointing at a different workspace; nothing on the laptop is workspace-specific.

Chapter 19 covers the ops-pod lifecycle in detail; Chapter 22 covers the north-south vs east-west modes.

Cross-references

Chapter 12 — Workspace config — the exec: block schema.
Chapter 14 — Credentials and the resolver chain — how creds reach each backend.
Chapter 16 — The --on flag and SSH jumphosts — the lightweight remote-exec predecessor to the SSH backend.
Chapter 17 — Execution backends — per-backend mechanics.
Chapter 19 — The in-cluster ops pod — the cluster-side prerequisite for --backend k8s.
Chapter 22 — Throughput testing — iperf3-specific flags.
PRD 03 — pluggable execution backends — the design spec.

The in-cluster ops pod

The k8s execution backend has two execution patterns: a long-lived ops pod for ad-hoc commands, and one-shot Jobs for throughput tests, DNS probes, and other per-invocation workloads. Chapter 17 §“K8s backend” covered the interface mechanics — how Backend.Run dispatches into either pattern.

This chapter is the reference for the pod itself: what roksbnkctl ops install deploys, what RBAC it grants, where credentials live, how to rotate them, and how to debug when something goes wrong.

If you’ve never run roksbnkctl ops install, you can read this chapter front-to-back; otherwise the § Operability section near the end is the troubleshooting jump-off point.

What the ops pod is

A long-lived pod in the roksbnkctl-ops namespace, running an image bundled with the tools roksbnkctl may want to invoke cluster-side: ibmcloud CLI plus kubectl as a fallback, with oc and terraform reserved for future iterations.

The pod sits idle waiting for kubectl exec calls. Each roksbnkctl ibmcloud --backend k8s … invocation routes through client-go’s SPDY executor, runs the wrapped tool inside the existing pod, streams stdout/stderr back, and returns the exit code. No pod create/start latency between invocations — a session of twenty ibmcloud commands pays the startup cost once.

Compared to the one-shot Job pattern (used for iperf3 and the DNS probe), the ops pod trades a bit of resource-usage idle-state for substantially lower per-call latency. It’s the right shape when you want to debug interactively or run many small commands.

`roksbnkctl ops install`

Idempotent setup. Run once per cluster; re-run any time you want to refresh the image, rotate the API key Secret, or recover from a partial uninstall.

roksbnkctl ops install

What it does, step by step:

1. Create the namespace

apiVersion: v1
kind: Namespace
metadata:
  name: roksbnkctl-ops
  labels:
    app.kubernetes.io/name: roksbnkctl
    app.kubernetes.io/component: ops-pod

The roksbnkctl-ops namespace is dedicated to the long-lived pod. Separate from roksbnkctl-test (where one-shot Jobs run) so RBAC can be scoped per namespace — see § RBAC below.

2. Create the ServiceAccount

apiVersion: v1
kind: ServiceAccount
metadata:
  name: roksbnkctl-ops
  namespace: roksbnkctl-ops

The pod runs as this SA. Its projected token is auto-mounted at /var/run/secrets/kubernetes.io/serviceaccount/, which is what the bundled kubectl uses for in-cluster authentication. The IBM Cloud API key (a separate credential) reaches the pod through a Kubernetes Secret — see § Credential propagation below.

3. Create the ClusterRole + ClusterRoleBinding

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: roksbnkctl-ops
rules:
- apiGroups: [""]
  resources: ["pods", "pods/exec", "pods/log"]
  verbs:     ["get", "list", "watch", "create", "delete"]
- apiGroups: ["batch"]
  resources: ["jobs"]
  verbs:     ["get", "list", "watch", "create", "delete"]
- apiGroups: [""]
  resources: ["secrets"]
  verbs:     ["get", "list"]
  resourceNames: ["roksbnkctl-ibm-creds"]
- apiGroups: [""]
  resources: ["services"]
  verbs:     ["get", "list", "create", "delete"]
- apiGroups: ["apps"]
  resources: ["deployments"]
  verbs:     ["get", "list", "create", "delete"]
- apiGroups: [""]
  resources: ["namespaces"]
  verbs:     ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: roksbnkctl-ops
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: roksbnkctl-ops
subjects:
- kind: ServiceAccount
  name: roksbnkctl-ops
  namespace: roksbnkctl-ops

The full manifest lives at internal/exec/k8s_install.yaml (embedded into the binary). § RBAC walks through what each rule is for.

4. Create or update the credential Secret

v1.2+ note. This step describes the static-key Secret applied under --trusted-profile=off and under the auto-fallback when IAM perms don’t allow the trusted-profile path. Under --trusted-profile=auto success the Secret is still applied with empty data fields (placeholder so envFrom: secretRef always resolves); the cred propagation happens via the trusted-profile annotation on the SA + the IAM_PROFILE_ID env var instead. See §“Trusted-profile flow (v1.2+)” below.

apiVersion: v1
kind: Secret
metadata:
  name: roksbnkctl-ibm-creds
  namespace: roksbnkctl-ops
  annotations:
    helm.sh/resource-policy: keep            # don't sweep on accidental destroy
type: Opaque
stringData:
  IBMCLOUD_API_KEY: <resolved-key-value>

The key value comes from the workspace’s resolver chain (env → keychain → config-b64 → prompt) — see Chapter 14 for the resolution order. The Secret carries two keys (IBMCLOUD_API_KEY and the legacy alias IC_API_KEY) both populated from the same resolved value, so older ibmcloud CLI versions that look for the IC_ name find it.

If the Secret already exists (re-running ops install after a key rotation), roksbnkctl does a client-side Get + Update: the Secret’s data is overwritten with the freshly resolved value, the roksbnkctl.io/rotated-at annotation is stamped with the current timestamp, and the rest of the Secret’s metadata is left untouched. roksbnkctl ops show surfaces last cred rotation: <timestamp> by reading that annotation.

5. Create the Pod

apiVersion: v1
kind: Pod
metadata:
  name: roksbnkctl-ops
  namespace: roksbnkctl-ops
  labels:
    app: roksbnkctl-ops
    roksbnkctl.io/managed: "true"
spec:
  serviceAccountName: roksbnkctl-ops
  restartPolicy: Always
  securityContext:
    runAsNonRoot: true
    seccompProfile:
      type: RuntimeDefault
  containers:
  - name: tools
    image: ${OPS_IMAGE}                  # resolved from roksbnkctl's version at install time
    imagePullPolicy: IfNotPresent
    command: ["sleep", "infinity"]
    envFrom:
    - secretRef:
        name: roksbnkctl-ibm-creds
    securityContext:
      allowPrivilegeEscalation: false
      runAsNonRoot: true
      capabilities:
        drop: ["ALL"]
    resources:
      requests: { cpu: 50m,  memory: 64Mi }
      limits:   { cpu: 500m, memory: 256Mi }

Three details to call out:

command: ["sleep", "infinity"] — the pod’s own command. Each Backend.Run invocation issues a kubectl exec against this idle process, which means the pod’s main process never exits as long as the pod is healthy.
securityContext is set explicitly for OpenShift’s restricted-v2 SCC. Pod-level runAsNonRoot + seccompProfile.type: RuntimeDefault; container-level allowPrivilegeEscalation: false + capabilities.drop: [ALL] + runAsNonRoot — the same fields the iperf3 server pod sets, for the same reason.
envFrom: secretRef — the API key reaches the pod’s env without ever touching the pod manifest’s argv or env: block. kubectl describe pod roksbnkctl-ops shows the secret reference name but not the value, per PRD 04 §“In-cluster pod”.

6. Wait for readiness

roksbnkctl ops install waits for Pod.Status.Phase == Running and the container’s Ready condition before returning. Default timeout is 60 seconds; longer for clusters with slow image pulls (the ghcr.io image is ~80 MiB). Failures surface a kubectl describe pod roksbnkctl-ops excerpt for context.

Trusted-profile flow (v1.2+)

The static-key Secret described above is the v1.0.x / v1.1.x path. In v1.2.0 it becomes the fallback: the default ops install invocation auto-provisions an IBM Cloud IAM trusted profile linked to the ops pod’s ServiceAccount, and the static API key no longer needs to land in any Kubernetes Secret. PRD 04 §“Resolved in Sprint 9” → “Trusted-profile auto-provisioning (k8s backend)” is the design reference; this section is the operational walkthrough.

v1.3.0 closes both the provisioning and the runtime sides of this flow — ops install --trusted-profile=auto provisions the profile, and the in-pod ibmcloud login wrap detects the SA’s trusted-profile annotation and authenticates via the projected SA token at runtime. The v1.2.x partial-closure history (provisioning shipped, runtime deferred) is preserved in CHANGELOG v1.3.0 → ### Changed for readers who specifically want the chronology.

`roksbnkctl ops install --trusted-profile=auto`

--trusted-profile=auto is the default as of v1.2 — running roksbnkctl ops install with no flag picks the auto path. Naming the flag explicitly is useful in scripts that pin behaviour or in docs that want to read unambiguously:

$ roksbnkctl ops install --trusted-profile=auto
✓ Provisioned IAM trusted profile roksbnkctl-ops-canada-roks (iam-Profile-9f2…)
✓ created namespace roksbnkctl-ops
✓ created sa roksbnkctl-ops/roksbnkctl-ops
✓ created secret roksbnkctl-ops/roksbnkctl-ibm-creds
✓ created clusterrole roksbnkctl-ops
✓ created crb roksbnkctl-ops
✓ created pod roksbnkctl-ops/roksbnkctl-ops
→ Waiting for ops pod to be Ready (60s timeout)
✓ Ops pod is Ready (trusted profile roksbnkctl-ops-canada-roks)

Re-runs against an existing install emit updated <kind> … / <kind> … exists instead of created for each resource that already matches the desired state. The trusted-profile provisioning line above is the single line internal/cli/ops.go emits for the whole IBM IAM-side flow (perm probe + profile create + compute-resource link + SA annotation) — the work happens silently inside resolveTrustedProfileForInstall; the one line you see is the receipt.

What just happened, in order (the binary doesn’t narrate these steps but they’re what’s actually going on):

IAM perm probe. ops install calls IBM IAM Identity to confirm the resolved API key has iam-identity perms. On 403, the flag value drives the next step: auto falls back to the static-key Secret with a warning (see §“--trusted-profile=auto falling back” below); on errors out with a non-zero exit.
Profile creation. Names the profile roksbnkctl-ops-<workspace> so multiple workspaces against the same IBM Cloud account don’t race for a single shared name. The compute-resource link binds the profile to your cluster’s OIDC issuer URL + the roksbnkctl-ops/roksbnkctl-ops ServiceAccount specifically — other SAs on the same cluster can’t assume the profile.
Policy attachment. v1.2 ships with no default policies attached — the profile inherits whatever IAM policies your account has set up for trusted profiles in general (typically nothing, until you grant). A future cycle will surface ibmcloud.trusted_profile.policies as a workspace-config block; tracked under v1.x deferred. If you need the profile to actually authorise specific actions (Container Registry pulls, Cloud Object Storage reads), grant the policies via IBM Cloud Console or ibmcloud iam trusted-profile-policy-create after ops install returns.
SA annotation. The ServiceAccount gets iam.cloud.ibm.com/trusted-profile: roksbnkctl-ops-<workspace> plus the roksbnkctl.io/trusted-profile-managed: "true" marker that signals ops uninstall to delete the profile during cleanup.
Pod creation. The pod’s container always has envFrom: secretRef: roksbnkctl-ibm-creds; what changes between modes is the Secret’s contents. Under --trusted-profile=auto success the Secret is created with empty data — IBMCLOUD_API_KEY is the empty string — plus an extra IAM_PROFILE_ID env var pointing at the provisioned profile’s ID, and a projected ServiceAccount-token volume mounted at /var/run/secrets/tokens/token (audience iam) so the pod has a cluster-issued JWT the IBM IAM endpoint will accept. The in-pod ibmcloud login wrap detects the SA’s trusted-profile annotation and runs ibmcloud login -a https://cloud.ibm.com --cr-token @/var/run/secrets/tokens/token --profile "$IAM_PROFILE_ID" -r "${IBMCLOUD_REGION:-us-south}" --quiet — the --cr-token @<path> form reads the projected SA token from disk; IBM IAM validates that JWT against the trusted profile’s ROKS_SA compute-resource link (the link internal/ibm/trusted_profile.go::ensureLink provisions). The static API key never transits the pod env. Under --trusted-profile=off (or the auto-fallback) the Secret carries the resolved API key, no projected token volume is mounted, and the wrap runs ibmcloud login --apikey "$IBMCLOUD_API_KEY" — the v1.0.x path.

Verifying the profile is in use

The ServiceAccount carries the truth-of-record annotation:

$ oc get serviceaccount roksbnkctl-ops -n roksbnkctl-ops -o yaml
# or, kubectl-equivalent via the bundled passthrough:
$ roksbnkctl k get sa roksbnkctl-ops -n roksbnkctl-ops -o yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    iam.cloud.ibm.com/trusted-profile: Profile-ccba11f2-3b1f-4b1a-b8a4-aeed2b7b3320  # ← the IBM IAM Profile ID
    roksbnkctl.io/trusted-profile-managed: "true"                                     # ← ops uninstall will delete it
    roksbnkctl.io/provisioned-at: "2026-05-13T14:08:33Z"
  name: roksbnkctl-ops
  namespace: roksbnkctl-ops

End-to-end smoke test of the runtime cred flow:

$ roksbnkctl --backend k8s ibmcloud iam oauth-tokens
IAM token:  Bearer eyJ…

The token is fresh-each-call: the in-pod wrap detects the SA’s iam.cloud.ibm.com/trusted-profile annotation, trades the pod’s projected SA token for an IAM token against the trusted profile, and returns it to the caller. No static API key transits the pod env.

The first invocation may take 30–60 seconds after ops install returns because IBM IAM needs to pick up the cluster’s OIDC issuer URL before it will accept the projected SA token as proof for the profile. The wrap absorbs this with a 3-attempt × 20s-backoff retry — up to ~40s of waiting inside the wrap before it surfaces the failure. On triple-fail the wrap prints trusted-profile login failed after 3 attempts: <captured-stderr> to your terminal (the captured stderr will include the underlying ibmcloud login diagnostic — typically the “Unable to authenticate” / FAILED banner shape). If your first smoke test produces that line, give IAM a few more seconds and re-run. After the first successful call the wrap’s auth state is cached for the pod’s lifetime, so subsequent roksbnkctl --backend k8s ibmcloud <subcommand> invocations don’t re-pay the propagation window.

`--trusted-profile=auto` falling back

auto falls back to the v1.0.x static-key Secret when any of three pre-conditions for trusted-profile provisioning aren’t met. The warning prints first (the fallback decision is made before any cluster-side resource is applied), then the rest of the install proceeds with the v1.0.x static-key shape:

$ roksbnkctl ops install
warning: IAM perm 'iam-identity' missing; using static-key Secret. Pass `--trusted-profile=off` to silence.
✓ created namespace roksbnkctl-ops
✓ created sa roksbnkctl-ops/roksbnkctl-ops
✓ created secret roksbnkctl-ops/roksbnkctl-ibm-creds
✓ created clusterrole roksbnkctl-ops
✓ created crb roksbnkctl-ops
✓ created pod roksbnkctl-ops/roksbnkctl-ops
→ Waiting for ops pod to be Ready (60s timeout)
✓ Ops pod is Ready (static-key Secret)

The three warning shapes (in source order — internal/cli/ops.go resolveTrustedProfileForInstall):

Trigger	Warning
Workspace has no registered cluster yet (`cluster-outputs.json` missing — run `cluster up` or `cluster register` first)	`warning: trusted-profile mode 'auto' needs a registered cluster (<err>); falling back to static-key Secret. Pass` –trusted-profile=off `to silence.`
Registered cluster lookup against the IBM Cloud API failed (network, key auth, cluster deleted out-of-band)	`warning: trusted-profile mode 'auto' couldn't look up cluster (<err>); falling back to static-key Secret. Pass` –trusted-profile=off `to silence.`
API key lacks IAM `iam-identity` permission (the most common fallback)	`warning: IAM perm 'iam-identity' missing; using static-key Secret. Pass` –trusted-profile=off `to silence.`

All three are non-fatal; the install completes and the pod works exactly as it did in v1.0.x. The warnings are terse on purpose — the actionable detail belongs in this chapter, not in every stderr line. Three ways to clear them permanently:

Run cluster up or cluster register first if the warning names the missing cluster registration. ops install re-run after the registration completes will detect the cluster and switch to the trusted-profile path.
Ask your IAM admin to grant iam-identity Operator role on the API key (or use a different key that already has it) if the warning names the missing IAM perm. Re-run ops install — the install detects the changed perm posture on re-run and replaces the static-key Secret with a trusted-profile binding.
Opt out via --trusted-profile=off (next subsection) if you don’t want the warning every install.

`--trusted-profile=off`

Explicit opt-out. Skips the IAM perm check entirely and provisions the v1.0.x static-key Secret:

$ roksbnkctl ops install --trusted-profile=off
✓ created namespace roksbnkctl-ops
✓ created sa roksbnkctl-ops/roksbnkctl-ops
✓ created secret roksbnkctl-ops/roksbnkctl-ibm-creds
✓ created clusterrole roksbnkctl-ops
✓ created crb roksbnkctl-ops
✓ created pod roksbnkctl-ops/roksbnkctl-ops
→ Waiting for ops pod to be Ready (60s timeout)
✓ Ops pod is Ready (static-key Secret)

Use cases:

Reproducing v1.0.x behaviour exactly — for byte-for-byte parity tests against an older deployment, or for scripts whose assertions match the v1.0.x ops show output verbatim.
Air-gapped clusters that can’t reach the IBM IAM API at runtime — without that connectivity, the pod can’t trade its projected SA token for an IAM token, so the trusted-profile path is non-functional regardless of perms.
Cred rotation runbooks that already automate the static-key path and aren’t yet ready to switch to the projected-token model.

The third value, --trusted-profile=on, is the inverse — it forces the trusted-profile path and refuses to fall back on perm-missing, returning a non-zero exit with the same warning text. Use it in CI to surface IAM-perm regressions explicitly.

Cleanup on `ops uninstall`

roksbnkctl ops uninstall --confirm honors the roksbnkctl.io/trusted-profile-managed: "true" annotation on the SA and deletes the IBM Cloud trusted profile alongside the cluster-side objects:

$ roksbnkctl ops uninstall --confirm
✓ deleted trusted profile roksbnkctl-ops-canada-roks
✓ deleted pod roksbnkctl-ops
✓ deleted secret roksbnkctl-ibm-creds
✓ deleted serviceaccount roksbnkctl-ops
✓ deleted clusterrolebinding roksbnkctl-ops
✓ deleted clusterrole roksbnkctl-ops
✓ deleted namespace roksbnkctl-ops
✓ deleted namespace roksbnkctl-test

The trusted profile is deleted first, before the cluster-side objects, so even if the cluster API becomes unreachable mid-uninstall the IBM Cloud-side state isn’t left orphaned. The Secret is always deleted regardless of mode (it’s always rendered by ops install, just with empty data under --trusted-profile=auto success).

The trusted-profile delete is best-effort — if the calling user’s API key has lost iam-identity perms in the meantime (or the key itself has rotated and the new key doesn’t have those perms), the cluster-side objects still delete and a warning line is printed instructing the user to delete the profile manually via the IBM Cloud console. The annotation remains correct documentation of what was provisioned; roksbnkctl ops install on a fresh cluster will pick a fresh profile name unconditionally so an orphaned profile from a prior install doesn’t collide.

--trusted-profile=off installs leave no trusted profile to clean up — ops uninstall --confirm just deletes the cluster-side objects + the static-key Secret as it did in v1.0.x.

`roksbnkctl ops show`

Reports current state without making any changes:

$ roksbnkctl ops show
namespace:    roksbnkctl-ops
pod:          roksbnkctl-ops
phase:        Running
ready:        true
image:        ghcr.io/jgruberf5/roksbnkctl-tools-ibmcloud:v0.9.0
rbac subject: system:serviceaccount:roksbnkctl-ops:roksbnkctl-ops
trusted-profile: Profile-ccba11f2-3b1f-4b1a-b8a4-aeed2b7b3320
secret:       roksbnkctl-ibm-creds (rotated 2026-05-10T11:03:17Z)

What each line surfaces:

Pod phase + readiness — Running + true is green; anything else means the pod is unhealthy and Backend.Run calls will fail. The container count is exactly one (tools); ready: true is a single bool, not a 2/2-style ratio.
Image — the :v… tag matches the roksbnkctl release the image was published with (resolved at install time from the binary’s version; see Chapter 17 §:dev tag resolution). Mismatched against your roksbnkctl --version means re-running ops install will pull the matching image.
RBAC subject — the SA the pod runs as. kubectl describe clusterrole roksbnkctl-ops prints the full ruleset (the ClusterRoleBinding is named the same as the role).
Trusted-profile line — reads the SA’s iam.cloud.ibm.com/trusted-profile annotation. The value is the IBM IAM Profile ID (Profile-<uuid>, the canonical IAM identifier — runOpsInstall annotates with tp.ID rather than the friendly roksbnkctl-ops-<workspace> name so the value is grep-friendly against IBM Cloud IAM audit logs; the parenthetical form on the ✓ Provisioned IAM trusted profile … install line cross-references both). Present + non-empty means ops install ran with --trusted-profile=auto or =on and the provisioning succeeded; the runtime cred flow is going via the projected SA token. Under --trusted-profile=off (or auto-fallback when IAM perms were missing) the line reads trusted-profile: (none — static-key Secret path).
Secret line — the cred Secret’s name + the roksbnkctl.io/rotated-at annotation that ops install stamps each time the Secret is applied. Always emitted (the Secret manifest is always rendered — empty data under trusted-profile success, populated under the static-key paths). If the Secret resource is missing entirely on the cluster, the line reads secret: (missing: …).

The current output is a fixed seven-line key/value block; a structured --output json mode is on the v1.x roadmap once ops show grows additional fields (image-id hash, env-hash reconciliation against the live pod, etc.). See docs/PLAN.md §“What’s deliberately deferred to post-v1.0”.

`roksbnkctl ops uninstall`

Full removal. Run when decommissioning the cluster, or when you want a clean re-install. The command is a destructive-action gate: by default it prints a preview of what would be deleted and exits successfully; the actual deletion only runs with --confirm.

$ roksbnkctl ops uninstall
Would delete (re-run with --confirm to proceed):
  - Pod        roksbnkctl-ops/roksbnkctl-ops
  - Secret     roksbnkctl-ops/roksbnkctl-ibm-creds
  - ServiceAccount roksbnkctl-ops/roksbnkctl-ops
  - ClusterRole/ClusterRoleBinding roksbnkctl-ops
  - Namespace  roksbnkctl-ops
  - Namespace  roksbnkctl-test

$ roksbnkctl ops uninstall --confirm
✓ deleted pod roksbnkctl-ops
✓ deleted secret roksbnkctl-ibm-creds
✓ deleted serviceaccount roksbnkctl-ops
✓ deleted clusterrolebinding roksbnkctl-ops
✓ deleted clusterrole roksbnkctl-ops
✓ deleted namespace roksbnkctl-ops
✓ deleted namespace roksbnkctl-test

Note that the cluster-scoped objects (ClusterRole, ClusterRoleBinding) get cleaned too — they’re not garbage-collected by namespace deletion since they live above the namespace. roksbnkctl ops uninstall --confirm makes this explicit so a stale roksbnkctl-ops ClusterRole can’t outlive a namespace removed via kubectl delete ns.

Both managed namespaces (roksbnkctl-ops and roksbnkctl-test) are deleted. The roksbnkctl-test namespace is where one-shot Job pods (iperf3 client, future probes) land — by the time you’re running uninstall you’ve already concluded those test workloads are finished, so removing the namespace alongside the ops-pod surface keeps the cluster clean.

When to run uninstall:

Cluster decommission — the cluster is going away, clean up cluster-scoped objects before destroying it.
Cred rotation when paranoid — the rotation story (next section) doesn’t require uninstall, but if you’re worried about old secrets persisting in etcd snapshots, an uninstall + re-install regenerates the Secret cleanly.
Image upgrade with a major manifest change — if the embedded k8s_install.yaml evolves (new RBAC rule, security-context tweak), uninstall + install is the cleanest way to apply.

RBAC: the ClusterRole rules

The full ClusterRole rule set (transcribed from internal/exec/k8s_install.yaml):

API group	Resources	Verbs	Why
`batch`	`jobs`	get, list, watch, create, delete	One-shot Job lifecycle (iperf3 client, future probes). The backend creates the Job, watches it, reads logs, deletes it.
`""` (core)	`pods`	get, list, watch	The backend lists/watches pods to find Job-spawned pods + to wait for the ops pod’s Ready state. No `create`/`delete` — pods are owned by their Jobs (or by `ops install`’s user-side privilege), not by the pod’s SA.
`""` (core)	`pods/log`	get, list	Log streaming from one-shot Job pods (the bytes the wrapped tool wrote to stdout/stderr).
`""` (core)	`pods/exec`	create, get	`kubectl exec` is a `create` against the `pods/exec` subresource — the SPDY-channel verb the long-lived ops-pod path uses.
`""` (core)	`secrets` (named `roksbnkctl-ibm-creds`)	get	The pod reads the cred Secret directly only if a future workflow opts to (kubelet’s projection of `envFrom: secretRef` runs as kubelet, not as this SA). The `resourceNames` filter keeps the SA from reading any other Secret in the namespace — least-privilege per PRD 04 §“In-cluster pod”.

Notably not granted:

pods create / delete — the SA can list and watch pods but can’t create or delete them. Pod lifecycle is mediated by Jobs (which the SA does manage) and by the ops pod itself (which ops install creates with the user’s privilege, not the SA’s).
secrets create / update / delete / list — the pod never writes Secrets, and can’t even list to discover which Secrets exist in the namespace. The install-time Secret creation is done by the user invoking ops install (whose kubeconfig has cluster-admin or comparable), not by the pod’s SA. Combined with the resourceNames filter on get, this is the tightest practical surface that still lets the pod consume its own cred.
services, deployments, namespaces — the SA can’t touch these at all. The iperf3 server fixture (when the throughput test runs) is provisioned by roksbnkctl test throughput running on the caller’s side using the user’s kubeconfig, not by anything inside the ops pod.
clusterroles, clusterrolebindings — the pod never modifies its own RBAC.
* cluster-admin — explicitly avoided. The pod has exactly the verbs it needs and nothing else.

This matches PRD 04 §“Least privilege per backend” and PRD 03 §“K8s”: the ops pod is a powerful tool but its blast radius is bounded.

To audit the rules on a running cluster:

kubectl describe clusterrole roksbnkctl-ops
kubectl auth can-i --as=system:serviceaccount:roksbnkctl-ops:roksbnkctl-ops \
  '*' '*' --all-namespaces       # should print mostly "no"

Credential propagation

v1.2+ note. What follows is the static-key propagation path. As of v1.2 it’s the fallback rather than the default — --trusted-profile=auto installs assume an IBM Cloud trusted profile via the pod’s projected SA token and the static API key never lands in a Kubernetes Secret. See §“Trusted-profile flow (v1.2+)” above for that path. The hop-by-hop description below still describes what happens under --trusted-profile=off (and under the auto-fallback when IAM perms don’t allow the trusted-profile path).

The IBMCLOUD_API_KEY reaches the wrapped tool in three hops:

resolver chain (env → keychain → config-b64 → prompt)
       ↓                        on the laptop, at `roksbnkctl ops install` time
  Kubernetes Secret roksbnkctl-ibm-creds                in roksbnkctl-ops namespace
       ↓                        applied by `ops install` via kubectl-equivalent
  Pod env (IBMCLOUD_API_KEY=…)                          via `envFrom: secretRef`
       ↓                        kubelet reads Secret, sets env on container start
  Wrapped tool (`ibmcloud iam oauth-tokens`)            reads from os.Getenv

Three properties this gives you:

The key never appears in argv. kubectl describe pod roksbnkctl-ops shows envFrom: secretRef: name: roksbnkctl-ibm-creds, not the value. kubectl get pod roksbnkctl-ops -o yaml shows the same.
The key never appears in the pod’s own logs. The wrapped tool uses the env var; the env var name (not value) is what the pod’s startup logs print.
The redactor is the defense-in-depth backstop. If the wrapped tool ever prints the value (e.g., ibmcloud --debug), the SPDY stream from the pod is wrapped through internal/exec/redact.go before reaching the caller’s stdout — same as the local + docker backends.

The Secret carries two keys today — IBMCLOUD_API_KEY and the legacy IC_API_KEY alias older ibmcloud versions accept — both populated from the same resolved value. Names are stable; embedded in internal/exec/k8s_install.yaml. Future cluster-side credentials (an AWS access key, a GCP service-account JSON) will add new keys to the same Secret rather than spinning up new Secrets, simplifying RBAC.

Rotation: rotating the API key

v1.2+ note. Under --trusted-profile=auto / =on (default), there’s nothing to rotate — the ops pod’s IAM tokens are short-lived and the IBM IAM endpoint refreshes them transparently each time the SDK trades the projected SA token. Key rotation only matters when the install ran with --trusted-profile=off or fell back to the static-key Secret because the resolved key lacked iam-identity perms. The procedure below covers that static-key case.

When the IBM Cloud API key changes (key rotation, account takeover, key compromise), you need to update the cluster-side Secret. The flow:

# 1. Update the local resolver chain — pick whichever source you populated
#    initially (the chain order is: env > keychain > config-b64; see chapter 14):
export IBMCLOUD_API_KEY=<new-key>             # env (one-shot)
# or update the keychain entry directly: `keyring` / `secret-tool` / Keychain.app
# or edit ~/.roksbnkctl/<workspace>/config.yaml's api_key_b64 field

# 2. Re-run ops install — this re-resolves the key, updates the cluster
#    Secret, and rolls the pod
roksbnkctl ops install

What ops install does on re-run:

The Secret roksbnkctl-ibm-creds is updated with the new value via a client-side Get + Update (the existing Secret’s data is overwritten, the roksbnkctl.io/rotated-at annotation is refreshed, the rest of the metadata is left alone).
The pod’s env, however, is set at container-start time — kubelet reads the Secret value when the pod is created, not on every Secret update. So an updated Secret doesn’t propagate to the running pod’s env until the pod is recreated.
ops install therefore deletes and recreates the bare ops pod after the Secret update. New pod → kubelet reads the updated Secret → env contains the new value. (Re-creation takes a few seconds for the image cache hit; up to ~30 seconds on a cold cluster.)

The ops pod is a bare Pod, not a Deployment or DaemonSet, so kubectl rollout restart won’t work on it (rollout restart only operates on controller resources). The canonical way to force a fresh pod is roksbnkctl ops install (idempotent — it’ll handle the delete-and-recreate). If you really want to do it by hand:

kubectl delete pod roksbnkctl-ops -n roksbnkctl-ops
# then re-run `roksbnkctl ops install` to create the replacement;
# the bare pod has no controller, so nothing else will recreate it.

roksbnkctl ops show will report phase: Pending briefly during recreation, then Running + ready: true once kubelet finishes projecting the updated Secret.

Operability

Things to know when something’s wrong.

Where pod logs go

roksbnkctl k logs -n roksbnkctl-ops roksbnkctl-ops
# or
kubectl logs -n roksbnkctl-ops roksbnkctl-ops

The pod’s main process is sleep infinity, so the log is mostly empty. Each kubectl exec invocation runs in its own ephemeral process — those processes’ stdout/stderr go back through the SPDY channel to the caller, not into the pod’s log. So kubectl logs is helpful for debugging pod startup (image pull failures, SCC denials, OOMKills) but not for “what did ibmcloud iam oauth-tokens actually print” — that’s just the caller’s stdout.

For a paper trail of recent invocations, capture roksbnkctl ibmcloud --backend k8s … 2>&1 | tee /tmp/ibmcloud.log on the calling side.

Debugging a stuck `ops install`

ops install waits up to 60 seconds for the pod to become Ready. If it times out:

roksbnkctl k describe -n roksbnkctl-ops pod/roksbnkctl-ops
roksbnkctl k get -n roksbnkctl-ops events --sort-by=.lastTimestamp | tail -20

Common causes:

Symptom	Cause	Fix
`ImagePullBackOff`	ghcr.io rate limit, or image tag doesn’t exist	check `roksbnkctl --version`, ensure ghcr.io is reachable from the cluster
`CreateContainerConfigError` referencing the Secret	Secret was deleted between Secret apply and Pod create (race)	re-run `roksbnkctl ops install` (idempotent)
`RunContainerError` with SCC denial	the cluster’s PodSecurity admission rejected the manifest	`kubectl get events` will name the missing field; usually means an OpenShift cluster expects the `restricted-v2` profile and a manifest field is wrong — file an issue with the event message
Pod stuck in `Pending` with no Events	cluster is at capacity / out of CPU	scale the cluster or trim resources; the pod requests `50m` CPU + `128Mi` mem, very small

Cluster API outage during `ops install`

If the kube-apiserver becomes unreachable mid-install (transient cloud-provider issue, kubeconfig expired, network partition), ops install fails fast at whichever step hit the apiserver:

✓ applied namespace roksbnkctl-ops
applying secret roksbnkctl-ops/roksbnkctl-ibm-creds       ... ERROR: Get "https://...": dial tcp: i/o timeout

The install is partial at that point — earlier steps succeeded, later steps didn’t. ops install is idempotent, so just re-run once the apiserver is back; the steps that already completed are no-ops the second time, the steps that didn’t will run.

If the apiserver is permanently gone (cluster destroyed): ops uninstall will fail the same way, since it also needs the apiserver. In that case the cluster-scoped objects (ClusterRole, ClusterRoleBinding) become orphans you can clean up manually if you ever rebuild the cluster, or ignore if you’re done with this cluster’s identity entirely.

Verifying the install end-to-end

A one-liner sanity check:

roksbnkctl ibmcloud --backend k8s iam oauth-tokens

If the SA/Secret/RBAC/Pod chain is healthy, this prints a fresh OAuth token. If it errors, the error message names which link in the chain broke (pod not found, Secret missing, exec denied, ibmcloud CLI exit non-zero).

Chapter 26 — Troubleshooting covers the broader “ops pod is unhappy” failure modes alongside other end-user troubleshooting.

Cross-references

PRD 03 — pluggable execution backends, §“K8s” — the ops-pod design rationale.
PRD 04 — credential propagation, §“In-cluster pod” — Secret-based propagation rules.
Chapter 14 — Credentials and the resolver chain — where the IBMCLOUD_API_KEY value comes from before it lands in the Secret.
Chapter 17 §“K8s backend” — the interface mechanics this chapter complements.
Chapter 18 — Choosing a backend per tool — when --backend k8s is the right call.
internal/exec/k8s_install.yaml — the embedded RBAC manifests: https://github.com/jgruberf5/roksbnkctl/blob/main/internal/exec/k8s_install.yaml
internal/cli/ops.go — the roksbnkctl ops install/show/uninstall command implementation: https://github.com/jgruberf5/roksbnkctl/blob/main/internal/cli/ops.go

Connectivity testing

roksbnkctl test connectivity answers one question: can my workspace reach the HTTP/HTTPS endpoints I care about right now?

It’s the simplest of the three test suites — no cluster fixtures, no remote vantage, no JSON parsing harness. Each configured URL gets one HTTP GET, the suite reports pass/fail, and the runner exits 0 if every probe passed.

Use it as the first sanity check after roksbnkctl up, as a CI smoke step against a known-good fixture set, or as the “is it me or is it the network” baseline before reaching for curl -v or openssl s_client.

What the connectivity suite does

For each configured URL the runner:

Adds an https:// scheme if you didn’t write one.
Issues a single GET with a 10-second timeout and the user-agent roksbnkctl/test.
Records the HTTP status code, the wall-clock duration, and (for HTTPS) the negotiated TLS version.
Marks the probe pass if the status code is in [200, 400) (any 2xx or 3xx); fail for anything else, any TLS error, any DNS error, any timeout.
Aggregates the per-URL results into a suite result; the suite passes only when every URL passed.

That’s it. No retries, no expected-body matching, no configurable status assertions, no L4 reachability — those are deliberate non-goals (see § When connectivity is the wrong tool below).

Configuring `extra_hosts`

The list of URLs to probe lives in your workspace config under test.connectivity.extra_hosts:

# ~/.roksbnkctl/<workspace>/config.yaml
test:
  connectivity:
    extra_hosts:
      - https://my-bnk-cis-controller.example.com
      - https://bigip-next-admin.example.com:8443
      - https://gslb.example.com
      - my-bare-host.example.com    # scheme defaults to https://

The schema is intentionally minimal — extra_hosts is a []string of URLs (or bare hostnames; https:// is added when no scheme is present). One entry per line. The order in the file is the order the runner probes.

There’s no per-host method, no per-host expected-status, and no per-host TLS-trust override today. If you need to assert something more specific than “does HTTP work” — a particular status code, a custom header, a body match — curl is the right tool, not roksbnkctl test connectivity. A richer per-host schema is queued for v1.x; the v1.0 surface holds the YAML simple on purpose.

Chapter 12 — Workspace config covers the full test: block; this chapter expands the connectivity slice.

What `extra_hosts` typically holds

Three classes of URL show up most often in a real workspace:

The BNK CIS controller — confirms the data-plane front-end is reachable and is returning a sane status code.
The F5 BIG-IP Next admin endpoint — confirms the management plane is reachable from your seat (often :8443 rather than :443).
The GSLB VIP that fronts the application — confirms the routed name actually serves a 2xx; pair with roksbnkctl test dns for the GSLB-aware DNS-side validation.

What doesn’t belong in extra_hosts: anything you only care about on a specific TLS error, anything that needs a request body, anything where pass/fail is more nuanced than “got a 2xx or 3xx”. Those are curl jobs, not connectivity-suite jobs.

The `--insecure` flag

Self-signed certs are common in pre-production BNK deployments — the F5 BIG-IP Next admin endpoint, the CIS controller, an internal GSLB VIP that hasn’t yet been re-fronted with a public CA cert. By default Go’s TLS stack rejects them and the probe fails with x509: certificate signed by unknown authority.

Pass --insecure to skip certificate verification for the run:

roksbnkctl test connectivity --insecure

What --insecure does:

Sets tls.Config.InsecureSkipVerify = true on the HTTP client used by the connectivity suite.
Applies for the duration of one invocation only.
Affects every URL probed in that run.

What --insecure does not do:

It does not change L4 / DNS behaviour. A name that won’t resolve still fails; a host that drops TCP still fails.
It is not per-host — there’s no --insecure-only=foo.example.com. Once set, the run skips verification for everything in extra_hosts.
It is not persisted. Setting it in one invocation does not affect the next.
It is not the same as a config-level insecure_tls: true per host. The v1.0 schema doesn’t have that knob; the only way to skip cert verification today is the session-wide flag.

If you need different TLS-trust posture per endpoint (one URL strict, another lenient), run two invocations with two different extra_hosts lists in two workspaces — that’s the workaround until per-host trust lands.

Reading the output

Default output is human-readable on stderr; pass -o json for machine-readable on stdout.

Human-readable

$ roksbnkctl test connectivity
running connectivity ...
  PASS  https://my-bnk-cis-controller.example.com  200 OK in 142ms
  PASS  https://bigip-next-admin.example.com:8443  302 Found in 88ms
  FAIL  https://gslb.example.com                   Get "...": dial tcp: i/o timeout
connectivity FAIL (2/3 passed)
$ echo $?
1

A 3xx redirect counts as pass — the runner doesn’t follow redirects, but the redirect itself is a successful HTTP response, which is what the suite measures. If you specifically need the final 200 after a redirect chain, curl -L is the tool.

JSON

$ roksbnkctl test connectivity -o json

{
  "schema": "roksbnkctl.v1",
  "command": "test",
  "suite": "connectivity",
  "timestamp": "2026-05-10T14:32:01.123Z",
  "duration_ms": 235,
  "overall": "fail",
  "results": [
    {
      "suite": "connectivity",
      "name": "https://my-bnk-cis-controller.example.com",
      "status": "pass",
      "detail": "200 OK in 142ms",
      "duration_ms": 142,
      "extra": { "status_code": 200, "tls_version": "TLS 1.3" }
    },
    {
      "suite": "connectivity",
      "name": "https://gslb.example.com",
      "status": "fail",
      "detail": "Get \"https://gslb.example.com\": dial tcp: i/o timeout",
      "duration_ms": 10003
    }
  ]
}

Exit code follows the same rules as the human-readable form: 0 on overall: pass, 1 on overall: fail. CI runners can branch on the exit code; richer assertions (e.g., “I tolerate one fail out of five”) need to consume the JSON.

Running connectivity inside `roksbnkctl test all`

Connectivity is one of the suites the bare roksbnkctl test (or roksbnkctl test all) command dispatches. The runner walks every configured suite, prints per-suite summaries on stderr, and exits non-zero if any suite failed:

$ roksbnkctl test
running connectivity ...
  PASS  https://bnk-cis.dev-tor.example.com  200 OK in 174ms
running dns ...
  PASS  bnk-cis.dev-tor.example.com  resolved 1 address(es)
connectivity PASS (1/1 passed)
dns          PASS (1/1 passed)

PASS overall (2/2 suites passed)

In -o json mode, roksbnkctl test all emits an all-shape envelope with one suites[] entry per suite. CI assertions can pin to either the suite-level overall or to a specific probe’s status:

roksbnkctl test all -o json | jq -e '.suites[] | select(.suite=="connectivity") | .overall == "pass"'

The bare roksbnkctl test defaults to the all suite. To run connectivity in isolation:

roksbnkctl test connectivity            # explicit suite
roksbnkctl test connectivity --insecure # session-wide TLS skip

Exit codes and CI integration

exit 0  →  every probe passed (every URL returned 2xx or 3xx)
exit 1  →  any probe failed (non-2xx/3xx, TLS error, DNS error, timeout)

There’s no third “infra error” exit code from the connectivity suite specifically — the suite is straight Go HTTP, no external tooling, no backend dispatch. If roksbnkctl test connectivity exits non-zero, the cause is in the response from one of your configured URLs.

For a CI step that tolerates a known-flaky endpoint while still failing on the others, consume the JSON instead of relying on the exit code:

roksbnkctl test connectivity -o json \
  | jq -e '[.results[] | select(.name | test("flaky-staging.example") | not) | .status] | all(. == "pass")'

Worked example: probing a BNK deployment

A typical post-up config for a BNK trial — cover the data-plane VIP, the admin endpoint, and the GSLB front:

# ~/.roksbnkctl/dev-tor/config.yaml
test:
  connectivity:
    extra_hosts:
      - https://bnk-cis.dev-tor.bnkfun.example.com         # BNK CIS controller (data plane)
      - https://bigip-next-admin.dev-tor.bnkfun.example.com:8443  # F5 BIG-IP Next admin
      - https://gslb-vip.dev-tor.bnkfun.example.com        # the GSLB front

Then:

$ roksbnkctl test connectivity --insecure
running connectivity ...
  PASS  https://bnk-cis.dev-tor.bnkfun.example.com               200 OK in 174ms
  PASS  https://bigip-next-admin.dev-tor.bnkfun.example.com:8443 302 Found in 91ms
  PASS  https://gslb-vip.dev-tor.bnkfun.example.com              200 OK in 211ms
connectivity PASS (3/3 passed)

--insecure is needed here because BNK’s admin endpoint and the GSLB VIP are fronted by a self-signed cert in dev. Once the trial moves to a production cert chain, drop the flag — the strict path is what you want for staging and prod.

When connectivity is the wrong tool

roksbnkctl test connectivity is “does HTTP work”. For anything finer-grained, reach for the right tool:

Scenario	Use this instead
You want to see the full TLS handshake, the cert chain, the SNI resolution, the negotiated cipher	`openssl s_client -connect host:port -servername host`
You want headers, redirect-following, body matching, a specific status assertion	`curl -v -L --fail-with-body <url>`
You want to confirm L4 reachability on a specific port, no HTTP layer	`nc -vz host port` (or `bash -c 'echo > /dev/tcp/host/port'`)
You want to confirm DNS resolution from a specific resolver, especially across vantages for GSLB	`roksbnkctl test dns`
You want to see what answer a name returns from inside the cluster vs from your laptop	`roksbnkctl test dns --gslb-compare`
You want to measure bandwidth between two endpoints	`roksbnkctl test throughput`

The connectivity suite is intentionally a thin probe. When the answer to “is it broken” is “yes” and you need to know why, the suite has done its job — it’s flagged the URL — and the next step is one of the tools above.

Cross-references

Chapter 12 — Workspace config — full test: block schema, including connectivity.extra_hosts.
Chapter 21 — DNS testing for GSLB — when “the URL fails” actually means “the name doesn’t resolve from this vantage”.
Chapter 22 — Throughput testing — the bandwidth-measurement companion suite.
Chapter 26 — Troubleshooting — common patterns for diagnosing connectivity failures across BNK / ROKS deployments.

DNS testing for GSLB

roksbnkctl test dns is the diagnostic surface for DNS-driven traffic management — the kind of behaviour an F5 BIG-IP Next GSLB deployment depends on, where the answer a name returns isn’t a single global truth but a function of who’s asking from where.

This is the longest chapter in the testing section because the question it answers is the most subtle. Connectivity testing tells you “the URL works”; throughput testing tells you “the path is fast”. Both assume the name resolved. When the GSLB is the thing under test, “the name resolved” is itself the question — and the answer changes depending on the network vantage of whoever’s asking.

The flag surface, the JSON output, and the multi-vantage workflow on this page are all what v1.0 ships. The design rationale lives in PRD 03 §“DNS probe (GSLB-aware)”; read that for the why, this chapter for the how.

Three vantages, one comparison

--gslb-compare is the flagship workflow: a single roksbnkctl test dns invocation fans out across local, k8s, and (optionally) ssh:<target> vantages in parallel, asks each one to resolve the same name, and reports whether the answers diverged.

graph TB
    subgraph runner[roksbnkctl test dns --gslb-compare]
        cmd[fan-out parallel<br/>per-vantage probe]
        cmp[divergence detector<br/>+ JSON aggregate]
    end
    subgraph vantage_local[local vantage]
        L[laptop resolver<br/>public DNS / GSLB outside-the-cluster answer]
    end
    subgraph vantage_k8s[k8s vantage]
        K[ops-pod resolver<br/>cluster CoreDNS / cluster-routed GSLB answer]
    end
    subgraph vantage_ssh[ssh:jumphost vantage]
        S[remote-host resolver<br/>third network path's GSLB answer]
    end
    GSLB[F5 BIG-IP Next GSLB<br/>per-resolver dispatch rule]

    cmd --> L
    cmd --> K
    cmd --> S
    L --> GSLB
    K --> GSLB
    S --> GSLB
    GSLB -. region A IP .-> L
    GSLB -. region B IP .-> K
    GSLB -. region C IP .-> S
    L --> cmp
    K --> cmp
    S --> cmp
    cmp -->|gslb_divergence: true/false| out[stdout JSON<br/>roksbnkctl.dns.v1]

The point of the diagram: a single roksbnkctl invocation probes from network positions a dig from your laptop can’t reach. The cluster vantage answers from the cluster’s egress IP; the SSH vantage answers from a third network path. Comparing the three is exactly the assertion “is the GSLB rule taking effect” needs.

The design rationale for this shape lives in PRD 03 §“DNS probe (GSLB-aware)”. The rest of this chapter is the user-facing surface.

The GSLB problem

F5 BIG-IP Next’s GSLB (Global Server Load Balancing) returns different DNS answers depending on the requesting resolver’s IP address. The discrimination is a feature, not a bug — that’s the whole point of GSLB:

Geographic affinity: a user in the US gets a US datacenter IP; a user in the EU gets an EU datacenter IP. The dispatch rule is per-region.
Datacenter routing: a request from a known partner CIDR gets a private VIP; a request from the public internet gets a public VIP.
Health-check state: when the primary pool member goes unhealthy, GSLB starts handing out the secondary pool member’s IP — but only after the resolver’s TTL on the prior answer expires.
Anycast vs unicast: a name fronted by an anycast resolver fleet may return the same answer everywhere; the same name fronted by GSLB returns a per-region answer.

Validating GSLB means validating that the right answer comes back from the right vantage. From your laptop in Toronto, you should see the US datacenter IP. From a workload running in the EU, you should see the EU datacenter IP. From an east-Asia bastion, you should see whatever your GSLB rule says east-Asia gets.

The standard dig www.example.com from your laptop only ever tells you what your laptop’s resolver, talking to its configured upstream, gets back. That’s one vantage. To validate GSLB you need several, and you need to be able to compare the answers.

Why per-vantage probing matters

A concrete worked example. Suppose your GSLB rule says:

“Users in the US get the IP for dc1.example.com (169.45.91.10). Users in the EU get the IP for dc2.example.com (52.123.45.67).”

You run from your laptop in the US:

$ dig +short www.example.com
169.45.91.10

That confirms the US rule. But it tells you nothing about the EU rule. To verify the EU rule you’d have to actually be in the EU — or, more practically, run the query from a network vantage that the GSLB will treat as EU.

roksbnkctl test dns makes that vantage selection a flag:

# From the laptop (your home / office / coffee-shop IP)
roksbnkctl test dns --target www.example.com --type A --backend local

# From inside the cluster (the cluster's egress IP — often a different region)
roksbnkctl test dns --target www.example.com --type A --backend k8s

# From a registered SSH target in the EU
roksbnkctl test dns --target www.example.com --type A --backend ssh:eu-bastion

The --gslb-compare flag fans out across the configured backends in parallel and emits a single JSON report that calls out whether the answers diverged across vantages — exactly the assertion you need for “is the GSLB rule taking effect”.

The `roksbnkctl test dns` flag surface

roksbnkctl test dns \
  [--target <name>] \
  [--type <record-type>] \
  [--server <server-spec>] \
  [--iterations <N>] \
  [--backend <local|k8s|ssh:<target>>] \
  [--gslb-compare] \
  [-o json]

Flag	Default	Notes
`--target`	the workspace’s `test.dns.default_target` if set, otherwise required	The DNS name to query. FQDN preferred (the trailing dot is added if missing).
`--type`	`A`	Any record type the underlying `miekg/dns` library accepts via `dns.StringToType`. Common picks: `A`, `AAAA`, `CNAME`, `MX`, `NS`, `TXT`, `SRV`, `SOA`, `PTR`, `CAA`, `DS`, `DNSKEY`, `ANY`. The full table also includes `HTTPS`, `SVCB`, `TLSA`, `SSHFP`, `URI`, `NAPTR`, `RRSIG`, `NSEC`/`NSEC3`, `LOC`, etc.
`--server`	`system`	Where to send the query. Literal IP, `host:port`, the keyword `system` (use the host’s `/etc/resolv.conf`), the keyword `cluster` (use the cluster’s CoreDNS — `--backend k8s` only), or a name from the workspace’s `test.dns.resolvers` map.
`--iterations`	`1`	How many queries to send to the same server. The runner reports per-query RTT plus p50/p95/p99 across the run.
`--backend`	per-tool default (see § Backend selection)	`local`, `k8s`, or `ssh:<target>`. Docker is rejected — see § Why `--backend docker` is rejected.
`--gslb-compare`	off	Fan out across all configured vantages and emit a comparison JSON. See § The `--gslb-compare` workflow.
`-o json`	text	Switch from human-readable text on stderr to JSON on stdout. Two schemas: `roksbnkctl.dns.v1.vantage` for single-vantage runs (a flat document), `roksbnkctl.dns.v1` for `--gslb-compare` (wraps one or more vantages plus a `gslb_divergence` boolean).

The probe library is github.com/miekg/dns — the same DNS implementation CoreDNS uses. Replacing the standard library’s net.Resolver got us three things we couldn’t get otherwise:

Full record-type surface: net.Resolver only exposes a fixed subset; GSLB validation often needs CAA (cert provisioning), DS/DNSKEY (DNSSEC), or SOA (authority chain) which net.Resolver doesn’t.
Per-query server selection: standard library hides the upstream resolver behind whatever /etc/resolv.conf says; we need to be able to point at a specific GSLB VIP for the query.
Per-query RTT measurement: miekg/dns’s Exchange() returns time.Duration directly. No timing-overhead fudging from running queries serially through the resolver chain.

When --backend k8s is selected, the probe self-execs as a one-shot Job in the roksbnkctl-test namespace — no separate image. roksbnkctl is its own probe runner; the Job’s pod runs roksbnkctl test dns ... with --backend local from the cluster’s network vantage. See Chapter 17 §“K8s backend” for the Job mechanics.

Server resolution

--server accepts five forms:

Literal IP or `host:port`

roksbnkctl test dns --target www.example.com --type A --server 8.8.8.8
roksbnkctl test dns --target www.example.com --type A --server 8.8.8.8:53
roksbnkctl test dns --target www.example.com --type A --server gslb-vip.example.com:53

Bare hosts default to port 53. IPv6 literals must be bracketed: [2001:4860:4860::8888]:53.

`system`

roksbnkctl test dns --target www.example.com --type A --server system

Reads /etc/resolv.conf from the host running the probe (so for --backend local that’s your laptop; for --backend k8s that’s the Pod’s /etc/resolv.conf, which CoreDNS owns; for --backend ssh:<target> that’s the target’s /etc/resolv.conf). This is the default if --server is omitted entirely.

`cluster`

roksbnkctl test dns --target www.example.com --type A --server cluster --backend k8s

Identical to system when running with --backend k8s (CoreDNS is what the Pod’s /etc/resolv.conf points at). Allowed only with --backend k8s — using --server cluster from a local or ssh vantage errors at parse time, since “cluster CoreDNS” isn’t a meaningful concept from outside.

Named resolver from workspace config

roksbnkctl test dns --target www.example.com --type A --server gslb-vip

Looks up gslb-vip in test.dns.resolvers (see next section). Useful for checking the same name against several different upstream resolvers without remembering each IP.

Workspace config: `test.dns`

Two new keys land this sprint under the existing test: block:

# ~/.roksbnkctl/<workspace>/config.yaml
test:
  dns:
    default_target: www.example.com
    resolvers:
      google:     "8.8.8.8:53"
      cloudflare: "1.1.1.1:53"
      gslb-vip:   "169.45.91.5:53"

Field	Type	Default	Notes
`dns.default_target`	string	empty	The name `roksbnkctl test dns` queries when `--target` isn’t passed. Lets you keep the per-workspace canonical name out of every CLI invocation.
`dns.resolvers`	map[string]string	empty	Named resolvers usable as `--server <name>`. Values are `<host>:<port>` (port required; mirrors the `--server` literal-IP form).

Both are optional. With neither set, --target is required on every invocation and --server only accepts literal IPs / system / cluster.

Chapter 12 §“test:” is the full workspace-config reference; this is the GSLB-relevant subset.

Backend selection for the probe

The probe runs from one network vantage at a time per --backend:

--backend local: in-process. Runs in the roksbnkctl binary itself, so the network vantage is your laptop’s. No cluster prereq, no SSH prereq.
--backend k8s: a one-shot Job in roksbnkctl-test. The Job’s pod runs the bundled tools image ghcr.io/jgruberf5/roksbnkctl-tools-ibmcloud:<tag> — the same image the in-cluster ops pod uses, which carries both ibmcloud and roksbnkctl on PATH. (If a Job fails to pull, kubectl describe pod will name the roksbnkctl-tools-ibmcloud image — there is no separate roksbnkctl-cli image to look for.) The Job’s command is roksbnkctl test dns --target ... --type ... --server ... --backend local -o json; the stdout is collected via the k8s backend’s log-stream path. Vantage is the cluster’s egress IP.
--backend ssh:<target>: scps the roksbnkctl binary onto the target if it’s missing (or skips if it’s already there, marker-file gated), then runs the same roksbnkctl test dns ... --backend local -o json over SSH. The vantage is the target’s IP.
--backend docker: rejected — see below.

The default backend per roksbnkctl invocation (when --backend is omitted and there’s no exec.dns.backend in workspace config) is local. To run GSLB cross-vantage you generally pass --gslb-compare, which fans out instead of picking a single vantage.

Chapter 17 §“K8s backend” has the one-shot-Job mechanics; Chapter 17 §“SSH backend” has the file-materialisation and bootstrap story. Both apply to the DNS probe verbatim.

The `--gslb-compare` workflow

Pass --gslb-compare to fan out across all configured vantages in parallel and emit a single comparison JSON:

roksbnkctl test dns \
  --target www.example.com \
  --type A \
  --server gslb-vip.example.com \
  --gslb-compare \
  -o json

What happens:

The runner enumerates configured vantages: local always; k8s when a kubeconfig is reachable on the host (the probe runs as a one-shot Job in roksbnkctl-test — the long-lived ops pod isn’t required); plus every entry in the workspace’s targets: block, each as ssh:<name>.
Each vantage runs the probe in sequence (one at a time; the run completes when the slowest vantage returns). The query (target, type, server) is identical; only the backend differs. Worst-case wall time with three vantages and the default 2-second per-query timeout is ~6 seconds.
Per-vantage results are collected with their full RTT distribution and answer set.
The runner compares the answer sets across vantages. If they differ, gslb_divergence is set to true in the output and a human-readable summary names the diverging vantages.
The output is a single roksbnkctl.dns.v1 JSON document wrapping one vantages[] entry per backend.

gslb_divergence: true is not a failure signal — for a healthy GSLB it’s the expected outcome. The exit code is 0 whenever every per-vantage probe succeeded (got an answer, even if the answers differ). The exit code is 1 when any per-vantage probe failed (NXDOMAIN, SERVFAIL, timeout).

JSON output schema

There are two distinct JSON shapes depending on whether --gslb-compare was passed. Both are versioned, both pin against PRD 03 §“DNS probe”, and both can be consumed by CI:

roksbnkctl.dns.v1.vantage — single-vantage probe. A flat document describing one vantage’s result.
roksbnkctl.dns.v1 — multi-vantage --gslb-compare. Wraps an array of per-vantage entries plus a gslb_divergence boolean.

Single-vantage output (`roksbnkctl.dns.v1.vantage`)

roksbnkctl test dns --target www.cloudflare.com --type A --server 8.8.8.8 \
  --iterations 10 --backend local -o json

{
  "schema": "roksbnkctl.dns.v1.vantage",
  "backend": "local",
  "server": "8.8.8.8:53",
  "iterations": 10,
  "rtt_ms": { "p50": 12.4, "p95": 18.1, "p99": 22.7 },
  "answers": [
    { "name": "www.cloudflare.com.", "type": "A", "ttl": 60, "rdata": "104.16.132.229" },
    { "name": "www.cloudflare.com.", "type": "A", "ttl": 60, "rdata": "104.16.133.229" }
  ],
  "rcode": "NOERROR",
  "authoritative": false,
  "truncated": false
}

This is the per-vantage shape — no target / type wrapper at the top level (the caller already knows what they queried), no vantages[] array, no gslb_divergence field.

Multi-vantage output (`roksbnkctl.dns.v1`, divergence detected)

roksbnkctl test dns --target www.example.com --type A --server gslb-vip.example.com \
  --gslb-compare -o json

{
  "schema": "roksbnkctl.dns.v1",
  "target": "www.example.com",
  "type": "A",
  "vantages": [
    {
      "schema": "roksbnkctl.dns.v1.vantage",
      "backend": "local",
      "server": "169.45.91.5:53",
      "iterations": 1,
      "rtt_ms": { "p50": 14.2, "p95": 14.2, "p99": 14.2 },
      "answers": [
        { "name": "www.example.com.", "type": "A", "ttl": 30, "rdata": "169.45.91.10" }
      ],
      "rcode": "NOERROR",
      "authoritative": true,
      "truncated": false
    },
    {
      "schema": "roksbnkctl.dns.v1.vantage",
      "backend": "k8s",
      "server": "169.45.91.5:53",
      "iterations": 1,
      "rtt_ms": { "p50": 8.7, "p95": 8.7, "p99": 8.7 },
      "answers": [
        { "name": "www.example.com.", "type": "A", "ttl": 30, "rdata": "10.20.30.40" }
      ],
      "rcode": "NOERROR",
      "authoritative": true,
      "truncated": false
    }
  ],
  "gslb_divergence": true,
  "gslb_divergence_summary": "answers differ between local (169.45.91.10) and k8s (10.20.30.40) — GSLB returning location-specific records as expected"
}

The comparison document embeds the per-vantage shape unchanged inside vantages[] — each entry still carries "schema": "roksbnkctl.dns.v1.vantage" so a downstream parser can validate per-vantage entries against the same schema independent of whether they came in standalone or as part of a comparison.

Schema field reference

Per-vantage shape (roksbnkctl.dns.v1.vantage):

Path	Type	Meaning
`schema`	string	Always `roksbnkctl.dns.v1.vantage`.
`backend`	string	`local`, `k8s`, or `ssh:<target>`.
`server`	string	The resolver address actually used (literal, system-resolvconf result, or named-resolver lookup).
`iterations`	int	How many queries went to that vantage. Mirrors `--iterations`.
`rtt_ms`	object	`{ p50, p95, p99 }` across the iterations. Single-iteration runs report the same number for all three.
`answers[]`	array	The RRs returned. `name` is the FQDN, `type` is the record type, `ttl` is from the response, `rdata` is the RR’s data (IP for A/AAAA, target for CNAME, etc.).
`rcode`	string	The DNS response code: `NOERROR`, `NXDOMAIN`, `SERVFAIL`, `REFUSED`, `TIMEOUT`.
`authoritative`	bool	Whether the AA flag was set in the response.
`truncated`	bool	Whether the TC flag was set. (The probe automatically retries truncated UDP responses over TCP; the field reflects the final response.)
`error`	string	Present only when the probe could not get a usable response. Carries the underlying Go error string.

Comparison shape (roksbnkctl.dns.v1, emitted by --gslb-compare):

Path	Type	Meaning
`schema`	string	Always `roksbnkctl.dns.v1`.
`target`	string	The queried name, normalised to FQDN (trailing dot included).
`type`	string	The record type queried.
`vantages[]`	array	One per-vantage entry (each conforms to `roksbnkctl.dns.v1.vantage`).
`gslb_divergence`	bool	True when the answer sets across `vantages[]` differ.
`gslb_divergence_summary`	string	Present only when `gslb_divergence: true`. Human-readable explanation naming the diverging vantages and answers.

Both schemas are stable at v1.0 — additive changes (new optional fields) are allowed within v1; field renames or removals would bump to .v2. v1.0 includes the optional edns_client_subnet object on each per-vantage entry, emitted when the resolver echoes an EDNS Client Subnet option (RFC 7871) in its response — most GSLB-aware authoritative servers do; vanilla recursive resolvers don’t, in which case the field is omitted from the JSON via omitempty. Sub-fields are family (1 = IPv4, 2 = IPv6), source_netmask, scope_netmask, and address. Useful for confirming the GSLB actually saw your client’s geographic scope rather than the resolver’s IP.

RTT measurement and `--iterations`

Per-query RTT is captured directly from miekg/dns’s Exchange(), which returns the round-trip duration without the runner having to wrap a stopwatch around the call. That keeps the timing honest — no scheduling jitter from a goroutine yield between time.Now() and the actual UDP send.

--iterations N runs the same query against the same server N times serially and reports p50/p95/p99 across the samples. Use cases:

Detecting health-check flapping: if a GSLB pool member is on the edge of its health threshold, the answer can flip back and forth across iterations. A single query catches one snapshot; ten iterations show the ratio.
Detecting anycast routing changes: when an anycast resolver changes which BGP path serves your AS, RTT can jump 30-100ms. p99 catches the worst case; p50 stays steady.
Establishing a baseline before a change: run with --iterations 30 before a GSLB rule change, again after, and compare distributions.

For --backend k8s and --backend ssh:<target>, RTT is measured inside the remote vantage. The number reflects the resolver-to-resolver path from that vantage, not the laptop-to-cluster (or laptop-to-jumphost) transit. That’s the correct measurement: when you ask “how slow is the GSLB from inside the cluster”, you mean cluster-side latency, not how long it took your laptop to wait for the SSH-tunnelled answer.

Sample F5 BIG-IP Next GSLB scenarios

Three concrete scenarios you’ll hit when validating a real BNK deployment.

Scenario 1: Geographic affinity working as expected

You’ve configured a GSLB rule that returns dc1 (169.45.91.10) for US queries and dc2 (52.123.45.67) for EU queries. You’re in the US, your jumphost is in eu-de (Frankfurt).

roksbnkctl test dns \
  --target www.example.com \
  --type A \
  --server gslb-vip.example.com \
  --gslb-compare \
  -o json

Expected output:

{
  "schema": "roksbnkctl.dns.v1",
  "target": "www.example.com",
  "type": "A",
  "vantages": [
    { "backend": "local",            "answers": [{ "rdata": "169.45.91.10" }], "rcode": "NOERROR" },
    { "backend": "ssh:eu-jumphost",  "answers": [{ "rdata": "52.123.45.67" }], "rcode": "NOERROR" }
  ],
  "gslb_divergence": true,
  "gslb_divergence_summary": "answers differ between local (169.45.91.10) and ssh:eu-jumphost (52.123.45.67) — GSLB returning location-specific records as expected"
}

gslb_divergence: true is the assertion you wanted. There are two ways to key CI on it:

# Option A: parse the JSON yourself
roksbnkctl test dns ... --gslb-compare -o json | jq -e '.gslb_divergence == true'

# Option B: built-in --require-divergence flag (the binary returns
# non-zero exit when --gslb-compare finds NO divergence)
roksbnkctl test dns ... --gslb-compare --require-divergence

Both forms produce a non-zero exit when gslb_divergence is false — the --require-divergence flag is the same assertion baked into the binary so CI scripts don’t need a jq dependency. Pick whichever fits your pipeline.

If gslb_divergence flips to false, something has changed — the GSLB rule was disabled, the geographic dispatch broke, or the jumphost moved out of the EU range. Surface that as a CI failure.

Scenario 2: Health-check-driven failover

A BNK pool member backing www.example.com has an active health check. You want to verify that GSLB stops returning that member’s IP when the health check fails.

Steps:

# 1. Baseline: while the member is healthy, both vantages return its IP
roksbnkctl test dns --target www.example.com --type A --server gslb-vip.example.com \
  --gslb-compare -o json > /tmp/before.json

# 2. Take the member offline (BNK admin UI / API; outside roksbnkctl's surface)
# 3. Wait for the GSLB's TTL to expire (in this example, 30 seconds)
sleep 35

# 4. Probe again
roksbnkctl test dns --target www.example.com --type A --server gslb-vip.example.com \
  --gslb-compare -o json > /tmp/after.json

# 5. Diff the answer sets
diff <(jq '.vantages[].answers' /tmp/before.json) <(jq '.vantages[].answers' /tmp/after.json)

You expect the IP in vantages[].answers[].rdata to change from the failed member’s IP to the secondary’s IP across both vantages — and to do so on roughly the same TTL boundary across vantages. If one vantage flips and the other doesn’t, the GSLB’s health-check propagation is asymmetric — useful diagnostic data.

Scenario 3: Anycast vs unicast detection

You suspect a name is fronted by a public anycast resolver fleet (Cloudflare’s 1.1.1.1, Google’s 8.8.8.8) rather than a real GSLB. Anycast returns the same answer everywhere; GSLB returns per-region answers.

roksbnkctl test dns --target www.cloudflare.com --type A --server 8.8.8.8 \
  --gslb-compare -o json

{
  "schema": "roksbnkctl.dns.v1",
  "target": "www.cloudflare.com",
  "vantages": [
    { "backend": "local", "answers": [{ "rdata": "104.16.132.229" }] },
    { "backend": "k8s",   "answers": [{ "rdata": "104.16.132.229" }] }
  ],
  "gslb_divergence": false
}

gslb_divergence: false despite probing from two vantages → the answer is anycast, not GSLB-dispatched. Useful when handing off to a customer who’s claiming “the GSLB isn’t routing me right” — you can prove the name they’re querying isn’t actually under GSLB control.

Why `--backend docker` is rejected

A Docker container running locally on the user’s laptop has the same network identity as the host (default bridge networking). The container’s egress NATs out via the host’s interface, so a DNS query from inside the container reaches the upstream resolver from the host’s IP.

For GSLB validation, that means --backend docker would give an answer identical to --backend local. No new vantage. The CLI rejects the combination at parse time:

$ roksbnkctl test dns --backend docker --target www.example.com --type A
error: DNS probe doesn't benefit from --backend docker (same network identity
       as --backend local, no GSLB-relevant vantage difference). Use --backend
       local, --backend k8s, or --backend ssh:<target> instead.

This is by design and called out in PRD 03 §“DNS probe”. If you want frozen-toolchain DNS testing for CI reproducibility, the roksbnkctl binary itself is already a single static binary — pinning it to a specific version pins the probe.

Integration with `extra_hosts`

If you’ve configured connectivity.extra_hosts for your workspace and want a quick “does each of those names resolve” check:

roksbnkctl test dns

With no --target and no --gslb-compare, the runner falls back to today’s behaviour: probe each host in connectivity.extra_hosts, single-vantage, single-iteration, with the host’s system resolver. This is the same shape as the connectivity suite’s reachability probe — no GSLB awareness, no per-server selection — and is the right tool for “did I typo a hostname in my config” rather than “is the GSLB doing what I configured”.

For real GSLB validation you almost always want --gslb-compare plus an explicit --target and --server. The extra_hosts fallback is the carry-over from earlier roksbnkctl releases, kept for compatibility.

Chapter 20 — Connectivity testing covers connectivity.extra_hosts in full.

Worked example: GSLB divergence troubleshooting

End-to-end Part VI scenario: a customer’s BNK GSLB rule says “users in the US get IP A, users in the EU get IP B”. They claim the rule isn’t working — EU users keep landing on the US datacenter. You have roksbnkctl configured with a local laptop (your office), the customer’s cluster (ops pod installed), and a customer-provided EU bastion as an SSH target named eu-bastion. Goal: prove or disprove the divergence, point at where to look next.

# 1. Baseline — three-vantage probe of the GSLB-fronted name
$ roksbnkctl test dns \
    --target www.example.com \
    --type A \
    --server gslb-vip.example.com \
    --gslb-compare \
    -o json | tee /tmp/gslb-baseline.json | jq .
{
  "schema": "roksbnkctl.dns.v1",
  "target": "www.example.com",
  "type": "A",
  "server": "gslb-vip.example.com",
  "gslb_divergence": false,
  "vantages": [
    {
      "backend": "local",
      "answers": [{"rdata": "192.0.2.10"}],
      "rtt_ms": {"p50": 18.2, "p95": 22.1}
    },
    {
      "backend": "k8s",
      "answers": [{"rdata": "192.0.2.10"}],
      "rtt_ms": {"p50": 6.4, "p95": 9.8}
    },
    {
      "backend": "ssh:eu-bastion",
      "answers": [{"rdata": "192.0.2.10"}],
      "rtt_ms": {"p50": 24.7, "p95": 31.4}
    }
  ]
}

# Note: `gslb_divergence: false` means ALL THREE vantages got 192.0.2.10
# back. That's the smoking gun — the customer is right. The GSLB is
# returning the same answer regardless of where the resolver is.

# 2. Confirm via `dig` from the EU bastion directly (sanity check that
# roksbnkctl isn't somehow masking the answer)
$ roksbnkctl exec --on eu-bastion -- dig +short @gslb-vip.example.com www.example.com
192.0.2.10
# Same answer. The probe isn't lying.

# 3. Check the GSLB rule's resolver-IP detection
$ roksbnkctl ibmcloud --backend ssh:eu-bastion ks cluster get \
    --cluster customer-cluster --json | jq '.serviceEndpoints'
# (Inspect the cluster's egress IP that GSLB sees for k8s vantage queries)

# 4. Repeat the probe with a known-good test name (one with a working
# anycast resolver, to rule out probe-side bugs)
$ roksbnkctl test dns \
    --target www.cloudflare.com \
    --type A \
    --server 8.8.8.8 \
    --gslb-compare
✓ local         : 104.16.124.96 (RTT 12ms)
✓ k8s           : 104.16.124.96 (RTT 6ms)
✓ ssh:eu-bastion: 104.16.124.96 (RTT 28ms)
  gslb_divergence: false  (anycast — expected)

# 5. Hand off the artefact to the GSLB owner with three pieces of evidence:
#    - /tmp/gslb-baseline.json (the divergence-false JSON)
#    - The cluster's resolver-IP-as-seen-by-GSLB from step 3
#    - Confirmation that the probe machinery itself works (step 4)

What this walkthrough lets you say with confidence: “the GSLB is returning 192.0.2.10 to all three vantages including the EU bastion. The rule isn’t dispatching. Check the rule’s resolver-IP-CIDR-match clause against the EU bastion’s egress IP <X>.” That’s a falsifiable claim a GSLB engineer can act on.

Common follow-up failure modes (covered in Chapter 26):

Probe says diverged but customer says it’s not — likely the customer’s testing point is behind a NAT that masks their network position; their resolver looks like a different region to the GSLB.
Probe says NOT diverged but customer says it IS — likely the customer is querying through a CDN-fronted resolver that anycasts the request. Re-probe with --server <gslb-vip-directly> to skip the resolver chain.
All vantages return SERVFAIL — the GSLB-VIP isn’t reachable from any of them. Check connectivity (Chapter 20) before re-running.

Cross-references

PRD 03 §“DNS probe (GSLB-aware)” — design rationale, the full schema spec, the rejected-by-design list.
PRD 05 §“Phase L-DNS” — the end-to-end test sequence that validates each row of this chapter against a live cluster.
Chapter 12 §“test:” — workspace-config schema, including test.dns.resolvers and test.dns.default_target.
Chapter 14 — Credentials and the resolver chain — how roksbnkctl itself reaches the cluster / SSH target the probe runs on.
Chapter 17 §“K8s backend” — the one-shot-Job mechanics the probe reuses for --backend k8s.
Chapter 17 §“SSH backend” — file materialisation and bootstrap for --backend ssh:<target>.
Chapter 18 §“I’m doing GSLB DNS validation” — the decision-tree row that picks backends for a GSLB scenario.
Chapter 20 — Connectivity testing — the simpler “does HTTP work” companion suite.
Chapter 22 — Throughput testing — the bandwidth-measurement companion suite.
miekg/dns upstream — the underlying DNS library, used by CoreDNS and a long list of other reference implementations.

Throughput testing

roksbnkctl test throughput measures TCP bandwidth between an iperf3 client and an iperf3 server, with at least one side running adjacent to (or inside) the cluster so the number reflects something useful — cluster fabric, the inbound path through a LoadBalancer (the iperf3 north-south mode, default), the outbound path from a jumphost, or pod-to-pod (east-west).

The heavy lifting (server pod lifecycle, OpenShift SCC compliance, in-cluster client Job, log streaming) lives in Chapter 17 §“K8s backend”. This chapter is the user-facing flag surface, the mode selection, and the output-interpretation guide.

What the suite measures

Plain TCP throughput, plus jitter and retransmits, between two endpoints both running iperf3:

The server runs in the cluster — a single bare Pod plus a Service, deployed in the roksbnkctl-test namespace. Service type is ClusterIP for east-west, LoadBalancer for north-south. See Chapter 17 §“iperf3 server side” for the manifest details.
The client runs wherever you point the backend — by default in the cluster as a one-shot Job, alternatively on your laptop or on a registered SSH target.
Output is iperf3’s native -J JSON, parsed and surfaced as roksbnkctl test throughput JSON.

The suite is appropriate for “is the cluster fabric healthy”, “is the BNK data path delivering the bandwidth I expect from outside”, and “is this jumphost the bottleneck between my office and the cluster”. It is not a precision benchmark — TCP throughput is sensitive to MTU, NIC offloads, kernel tunables, and the iperf3 server’s own resource limits, none of which the suite tries to control.

The two modes

Mode is selected by --mode. The default is north-south.

roksbnkctl test throughput --mode north-south   # default
roksbnkctl test throughput --mode east-west

`--mode north-south`

Measures the inbound path from outside the cluster to a Pod inside it. The server’s Service is a LoadBalancer, so the cluster provisions an external endpoint (an IBM Cloud LB on ROKS, an external IP / hostname on bare-metal k8s). The client connects to that endpoint.

Use cases:

“Is the BNK ingress path delivering the bandwidth I expect”
“Is my office Wi-Fi or my home connection the bottleneck”
“Is the cluster’s egress capacity what the cloud provider promised”

Combine with --backend local (run the client on your laptop) when you specifically want to measure the laptop-to-cluster path. Combine with --backend ssh:<jumphost> when you want a known-stable measurement vantage from a jumphost in a known IP block — useful when laptop Wi-Fi is suspect.

`--mode east-west`

Measures the intra-cluster fabric — Pod-to-Pod or host-to-Pod. The server’s Service is ClusterIP, reachable only from inside the cluster. The default --backend k8s runs the client adjacent to the server (a one-shot Job in the same namespace), so the number reflects the CNI’s pod-to-pod throughput.

Use cases:

“Is the cluster’s network plugin healthy”
“Are the worker nodes hitting the link rate the underlying fabric promises”
“Has the BNK CIS deployment regressed cluster-internal throughput”

Today’s east-west still allows --backend local (the client runs on the host and reaches the ClusterIP via NodePort-equivalent access if the kubeconfig is the same one a kubectl port-forward would use), but the number is a host-to-cluster-via-NodePort hybrid in that case rather than a true pod-to-pod measurement. True pod-to-pod east-west — both sides scheduled to specific pods, optionally pinned to different nodes via --cross-node — is the v1.x refinement; today the in-cluster Job client gets you most of the way there.

Per-tool default backend

The default backend for iperf3 is k8s. From the per-tool defaults table in Chapter 18 §“Per-tool default backends”:

Tool	Default backend	Why
`iperf3`	`k8s`	Throughput from a laptop’s uplink isn’t the cluster’s bandwidth. Default to running adjacent to the cluster so the number reflects fabric, not Wi-Fi.

The default holds whether or not you’ve set exec.iperf3.backend in workspace config. To override per-invocation:

roksbnkctl test throughput --backend local                  # client on laptop
roksbnkctl test throughput --backend ssh:jumphost           # client on jumphost
roksbnkctl test throughput --backend k8s                    # default; explicit

--backend docker is rejected by the throughput suite. A Docker container running locally has the same network identity as the host (default bridge networking), so the client’s view of the network is identical to --backend local. The CLI errors at parse time:

$ roksbnkctl test throughput --backend docker
error: --backend docker isn't supported for iperf3 — docker shares the host
       network namespace by default and gives no network-locality benefit over
       local. Use --backend local or --backend k8s instead

Chapter 18 §“Throughput testing” is the decision-tree row that walks through the (mode, backend) matrix.

When local or ssh:<target> makes sense:

local: you’re deliberately measuring laptop-to-cluster bandwidth (north-south from your seat). Typical use is debugging “the dashboard feels slow from my desk” — you want to confirm the office uplink, not the cluster fabric, is the bottleneck.
ssh:<target>: you have a registered SSH target with a known IP (often a customer jumphost in a specific datacenter) and want a bandwidth measurement from that vantage. The SSH backend ensures iperf3 is on the target (auto-installs via apt with --bootstrap on Ubuntu; see Chapter 17 §“SSH backend”).

The bundled image and the `runAsNonRoot` constraint

The iperf3 server pod’s securityContext is set to satisfy OpenShift’s restricted-v2 SCC:

securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  seccompProfile:
    type: RuntimeDefault
containers:
- name: iperf3
  securityContext:
    allowPrivilegeEscalation: false
    runAsNonRoot: true
    capabilities:
      drop: ["ALL"]

iperf3 listens on port 5201 (unprivileged) so root isn’t needed. The bundled image at ghcr.io/jgruberf5/roksbnkctl-tools-iperf3:<v> declares USER 1000 in its Dockerfile, matching the pod’s runAsUser: 1000.

Two things follow:

Stock images that run as root will fail admission. The default in workspace config is networkstatic/iperf3:latest, which runs as root. On OpenShift / on any cluster with restricted-v2 PodSecurity admission, that image will fail with forbidden: violates PodSecurity "restricted:v1.x". The fix is to switch to the bundled image:
```
# ~/.roksbnkctl/<workspace>/config.yaml
test:
  throughput:
    image: ghcr.io/jgruberf5/roksbnkctl-tools-iperf3:v0.9.0
```
The bundled image is the --backend k8s default for the client Job regardless; the workspace override only affects the server pod’s image. Keep them in sync to avoid version skew during a debug session.
A custom workspace-overridden image must respect runAsNonRoot. If you point test.throughput.image at your own iperf3 image, that image must not require root to start. iperf3 itself doesn’t need privilege; if your image does, drop the USER root line and rebuild.

Chapter 17 §“iperf3 server side” goes deeper on the SCC story — what the four securityContext fields do, why each is required, and how to debug an admission failure.

OpenShift SCC failure mode

If your throughput pod fails to start with one of:

Forbidden: violates PodSecurity "restricted:v1.x"
unable to validate against any security context constraint: ... restricted-v2
runAsNonRoot is required

…then either the configured image runs as root (use the bundled ghcr.io/jgruberf5/roksbnkctl-tools-iperf3:<v> image instead — set test.throughput.image in workspace config) or the cluster’s PodSecurity admission is stricter than restricted-v2 (the manifest the k8s backend builds satisfies restricted-v2 but not privileged; if your cluster requires privileged for the test namespace, that’s a cluster policy question outside the suite’s control).

Chapter 17 §“iperf3 server side” is the canonical source for the manifest’s securityContext. If you’re hand-rolling an iperf3 image for the suite, it’s the spec to match.

Reading the output

Default output is human-readable on stderr; -o json switches to JSON on stdout.

Human-readable

$ roksbnkctl test throughput
→ Deploying iperf3 fixture
→ Waiting for iperf3 server pod ready
→ Waiting for LoadBalancer endpoint (can take 30–90s on IBM Cloud)
✓ iperf3 endpoint: 169.45.91.10:5201
running throughput ...
  PASS  iperf3 north-south → 169.45.91.10:5201 (k8s)  3.41 Gbps received, 0% retransmits in 30s
throughput PASS (1/1 passed)
✓ iperf3 fixture removed

JSON

iperf3’s -J output is rich (sender, receiver, per-stream stats, CPU usage). The roksbnkctl wrapper preserves the iperf3 JSON in the probe’s detail field so all of iperf3’s data survives, while the suite-level shell follows the roksbnkctl.v1 schema:

roksbnkctl test throughput -o json

{
  "schema": "roksbnkctl.v1",
  "command": "test",
  "suite": "throughput",
  "timestamp": "2026-05-10T14:32:01Z",
  "duration_ms": 31420,
  "overall": "pass",
  "results": [
    {
      "suite": "throughput",
      "name": "iperf3 north-south → 169.45.91.10:5201 (k8s)",
      "status": "pass",
      "duration_ms": 30015,
      "detail": "{ ...full iperf3 -J JSON, including sum_received, sum_sent... }"
    }
  ]
}

The fields you’ll most often want from the embedded iperf3 JSON, in order of usefulness:

Field	What it tells you
`end.sum_received.bits_per_second`	The throughput number you should report. iperf3 measures both sender and receiver and the receiver number is the right one to quote — it accounts for retransmits and any path losses.
`end.sum_sent.bits_per_second`	Sender-side throughput. If sent ≫ received, packets were dropped on the path. If sent ≈ received, the path is healthy.
`end.sum_sent.retransmits`	TCP retransmits over the run. A handful is normal; double-digit-percent of streams indicates congestion or a bad NIC.
`end.streams[].sender.jitter_ms`	Per-stream jitter. Useful for diagnosing variable-latency paths.
`end.cpu_utilization_percent.host_total`	Whether the client CPU was the bottleneck. >80% suggests the iperf3 client maxed out CPU before the network did — increase the iperf3 server’s stream count (a server-pod knob, not a roksbnkctl flag) to spread load, or run on a beefier client.

Example interpretation:

sum_received: 3.41 Gbps     → headline number
sum_sent:     3.42 Gbps     → very close, healthy path
retransmits:  127           → normal-low

sum_received: 1.21 Gbps     → headline number
sum_sent:     2.95 Gbps     → ≫ received; >50% of bytes lost or retransmitted
retransmits:  18743         → heavy

The second shape is what a saturated link or a flaky NIC looks like. The first is a healthy gigabit-class path.

Tuning knobs in workspace config

# ~/.roksbnkctl/<workspace>/config.yaml
test:
  throughput:
    image: ghcr.io/jgruberf5/roksbnkctl-tools-iperf3:v0.9.0
    duration: 30        # iperf3 -t flag, seconds
    streams: 8          # iperf3 -P flag, parallel streams
    default_mode: north-south

The defaults (30s, 8 streams, north-south) are a reasonable starting point for “is the BNK data path healthy”. For deeper diagnosis:

Bump duration to 60-90s if the path is variable and you want a stable average.
Bump streams to 16 or 32 if the path’s bandwidth-delay product is high (long-haul links benefit from more parallelism).
Drop streams to 1 if you’re specifically testing single-flow throughput (e.g., reproducing a customer’s “single-stream upload feels slow” complaint).

Chapter 12 — Workspace config lists the full schema.

Cleanup and `--keep`

By default the suite tears down the iperf3 server pod and Service after the client run completes. If a test fails and you want to poke at the fixture (kubectl exec into the server, hand-run iperf3 -c from a third location, etc.), pass --keep:

roksbnkctl test throughput --keep
# ... fixture stays up; debug to your heart's content ...
kubectl delete -n roksbnkctl-test pod/roksbnkctl-iperf3 svc/roksbnkctl-iperf3

The fixture is in the roksbnkctl-test namespace (same namespace the k8s backend uses for one-shot Jobs). It’s a bare Pod plus a Service; nothing else lingers when you delete the two resources.

Cross-references

Chapter 17 §“K8s backend” — server-side mechanics (manifest, SCC, log streaming, exit-code extraction).
Chapter 17 §“iperf3 server side” — the asymmetric server-pod-plus-client-Job shape.
Chapter 18 §“I want to measure cluster bandwidth” — the decision-tree entry that picks (mode, backend) for your scenario.
Chapter 12 §“test:” — workspace-config schema for test.throughput.*.
Chapter 20 — Connectivity testing — the simpler “does HTTP work” companion suite.
Chapter 21 — DNS testing for GSLB — the DNS validation companion.
PRD 03 §“iperf3” — the design spec.

The E2E test plan

roksbnkctl ships a layered end-to-end test suite that exercises the full surface — install, lifecycle, four execution backends, internalised kubectl, the DNS probe, the cred-leak audit, and a mixed-mode lifecycle — against a live IBM Cloud account. This chapter is the user-facing guide: what the suite is, how to run it locally, what each phase validates, what it costs, and how it’s re-run when (not if) part of it flakes.

The design rationale lives in PRD 05; read that for the why. This chapter is the how and what.

What the E2E suite is

The suite is 14 automated phases organised into two tiers, plus Phase J as a manual integrator step (kubectl internalisation requires sudo mv of the host kubectl/oc binaries; that mutation is too disruptive to automate, so PRD 05 §J leaves it as a release-checklist item):

Tier	Phases	What it covers	Driver script
Baseline	A, B, C, D, E, F, G, H	install, init, plan, up, post-apply checks, test suites, down	`scripts/e2e-test.sh`
Backends + extras	I, K, L, L-DNS, M, N	SSH backend, docker backend, k8s backend + ops pod, DNS probe with GSLB compare, cred-leak audit, mixed-mode lifecycle	`scripts/e2e-test-backends.sh`
Manual	J	kubectl internalisation (PATH-stripped, integrator-driven)	per-release checklist

A combined driver, scripts/e2e-test-full.sh, runs both automated tiers in sequence: A-H first to bring up + exercise + tear down the baseline cluster, then I-N which provisions a fresh cluster via Phase N’s mixed-mode-lifecycle step. The two drivers stay decoupled — each can be run standalone — at the cost of an extra cluster apply (~70min wall-time, ~5-7h combined). Cluster-sharing across the two drivers (the PRD-envisioned design) is queued for v1.x; see PRD 05 §“Test infrastructure”.

Phase coverage at a glance

Phase	Tier	Validates	PRD
A	baseline	doctor + init	—
B	baseline	plan (read-only)	—
C	baseline	targets list, registration	01
D	baseline	`up` lifecycle — provision + deploy BNK	—
E	baseline	post-apply checks (`status`, `k get`, `logs`)	02
F	baseline	`test connectivity` (HTTP probes)	—
G	baseline	`test throughput` (iperf3)	—
H	baseline	`down` — destroy + cleanup	—
I	backends	SSH backend / `--on jumphost`, host-key TOFU	01
J	manual	kubectl internalisation (PATH-stripped — requires `sudo mv`)	02
K	backends	docker backend (ibmcloud + iperf3 client)	03 § Docker
L	backends	k8s backend (ops pod + ibmcloud + iperf3)	03 § K8s
L-DNS	backends	DNS probe + GSLB cross-vantage compare	03 § DNS
M	backends	cred-leak audit (docker inspect, k8s events, ssh tempfiles)	04
N	backends	mixed-mode lifecycle (each tool on a different backend)	all of the above

Phase J is an integrator-driven manual step; the per-release checklist in docs/E2E_TEST.md covers its procedure (PATH-strip kubectl + oc, then re-run the baseline driver’s Phase E to confirm roksbnkctl k get/apply/describe/... still works against the cluster).

How to run it locally

The three driver scripts all live under scripts/:

# Baseline only (A-H) — ~90 minutes
./scripts/e2e-test.sh

# Backends + extras only (I-N + L-DNS) — requires a live cluster (run after
# Phase D of e2e-test.sh, or provision via `roksbnkctl up` in a separate
# workspace first)
./scripts/e2e-test-backends.sh

# Combined — A-H baseline, then I-N + L-DNS against a fresh cluster the
# backends driver brings up via Phase N's mixed-mode-lifecycle step,
# ~5-7 hours total (two separate cluster applies)
./scripts/e2e-test-full.sh

Pre-requisites

Pre-req	Required for	Notes
IBM Cloud account with API key	every phase	`IBMCLOUD_API_KEY` env var, or `roksbnkctl init` writes a keychain entry
`terraform` binary on PATH	phases B-H	the only strictly-required host tool for the baseline
Docker daemon	phase K	`dockerd` or `colima` or Rancher Desktop — anything that publishes a docker socket
`kind` binary	phase L on CI	the in-CI k8s backend uses a kind cluster; on a real run it uses the ROKS cluster from D
An SSH bastion or jumphost	phase I, N	provisioned automatically by phase D’s terraform when `testing_create_tgw_jumphost = true` (the default)
Adequate disk for terraform plan output	phases B-H	~200 MB for the embedded module’s state

Everything else (kubectl, oc, dig, iperf3) is internalised by the binary — phase J explicitly verifies the suite passes with kubectl and oc moved out of PATH.

Resuming a partial run

Every phase is re-runnable. The driver scripts respect a PHASE_FROM= env var:

# Restart from phase G (skipping A-F, which already ran)
PHASE_FROM=G ./scripts/e2e-test.sh

# Same for the backend driver
PHASE_FROM=L ./scripts/e2e-test-backends.sh

The phase pointer is read at startup and the script fast-forwards past every step before it. Assertion phases that hit external APIs (DNS resolvers, IBM Cloud control plane) include jitter and retry on the typical transient failure shapes — short DNS timeouts, IAM 5xx blips, etc. See PRD 05 §“Risks” for the retry policy.

Run logs

Each driver writes a single combined log per run:

scripts/e2e-test.sh → /tmp/roksbnkctl-e2e/run-<timestamp>.log
scripts/e2e-test-backends.sh → /tmp/roksbnkctl-e2e-backends/run-<timestamp>.log
scripts/e2e-test-full.sh → both of the above (the combined runner re-uses each child driver’s log directory)

Logs are preserved on both success and failure; clean them up manually when disk pressure warrants. On a CI machine that runs the suite nightly, the logs are the only forensics you get — keep them for at least 7 days. Per-phase log splitting is a v1.x consideration.

Dry-run

DRY_RUN=1 short-circuits every roksbnkctl invocation to a no-op that prints the command it would have run. Useful for re-validating the script wiring after edits without paying the 30-minute cluster-apply tax. The validator agent’s “is the test plan still well-formed” check runs in this mode.

What each phase validates

Phase A — `init`

roksbnkctl init prompts for region, resource group, cluster name, and BNK version, then writes ~/.roksbnkctl/<workspace>/config.yaml. The phase asserts the file exists, contains no plaintext API key (the rejection regex in internal/config/workspace.go catches that), and that roksbnkctl doctor reports green for terraform and informational for kubectl/oc.

Phase B — `plan`

roksbnkctl plan runs terraform init (downloads providers, ~30s) and terraform plan (computes the resource diff, ~30-60s on a clean workspace). No infrastructure is provisioned. The phase asserts the plan reports ~77 resources to add (the exact count is the upstream HCL’s full set of cluster + cert-manager + flo + cne_instance + license + testing resources).

Phase C — targets and registration

roksbnkctl targets list against a fresh workspace returns empty. roksbnkctl cluster register <existing-cluster> (optional, skipped if the workspace is provisioning new infra) ties an existing ROKS cluster’s COS instance + bucket discovery into the workspace.

Phase D — `up` lifecycle

The dominant cost phase. roksbnkctl up --auto runs terraform apply against the embedded HCL — provisioning ~77 resources: VPC, subnets, transit gateway, ROKS cluster, cert-manager, FLO, CNEInstance, License, jumphost. Expect 30-50 minutes on a clean apply, 5-15 minutes longer when IBM Cloud’s control plane is slow. The phase asserts terraform exits zero and the admin kubeconfig was fetched and written to $KUBECONFIG.

Post-apply, the phase auto-registers the jumphost target (per Chapter 16 §“Auto-discovery from roksbnkctl up”) so subsequent phases can --on jumphost without manual config.

Phase E — post-apply checks

roksbnkctl status shows the deployed BNK components. roksbnkctl k get nodes lists 3 worker nodes Ready. roksbnkctl logs flo (the F5 Lifecycle Operator) prints recent log lines. The phase asserts each command exits zero and that the cluster’s BNK install (FLO + CNE Instance + License) is in a healthy state.

Phase F — `test connectivity`

roksbnkctl test connectivity walks the workspace’s test.connectivity.extra_hosts list and probes each URL. Pass criteria: every URL returned a 2xx (or the expected status, when Chapter 20’s richer assertion shape lands).

Phase G — `test throughput`

roksbnkctl test throughput deploys the iperf3 server-pod fixture, runs a 30-second client measurement, and tears the fixture down. Pass criteria: bandwidth > 100 Mbps (a conservative floor — actual numbers on IBM Cloud are typically 1-5 Gbps), retransmits < 5% of streams, fixture removed afterwards.

Phase H — `down`

roksbnkctl down --auto runs terraform destroy. Pass criteria: all ~77 resources destroyed (terraform reports Destroy complete!), no orphan IBM Cloud resources detectable via roksbnkctl ibmcloud resource search.

Phase I — SSH backend / `--on jumphost`

roksbnkctl exec --on jumphost -- whoami returns root (the jumphost auto-provisioned with cloud-init’s root user). roksbnkctl ibmcloud --on jumphost iam oauth-tokens validates IBMCLOUD_API_KEY propagation over SSH. A negative test mutates ~/.roksbnkctl/known_hosts to a wrong fingerprint and asserts the next call exits 126 with a clear “host key mismatch” error. Phase I is the user-facing acceptance test for PRD 01.

Phase J — kubectl internalisation

The phase strips kubectl and oc out of PATH (via env-var sanitisation, not filesystem moves — see PRD 05 §“Open questions”) and verifies that roksbnkctl k get nodes, roksbnkctl k apply -f, roksbnkctl k describe, roksbnkctl k exec, roksbnkctl k port-forward, and roksbnkctl k delete all work against the cluster. A supplementary byte-equivalence step (run separately, not gated on PATH stripping) diffs kubectl get nodes -o yaml against roksbnkctl k get nodes -o yaml and asserts the diff (excluding managedFields, resourceVersion, creationTimestamp) is empty.

Phase K — docker backend

roksbnkctl ibmcloud --backend docker iam oauth-tokens pulls the roksbnkctl-tools-ibmcloud image on first call and runs the ibmcloud CLI inside it. The phase asserts the API key is not baked into the image (via docker history inspection) and not exposed in the running container’s env (via docker inspect). Chapter 17 §“docker backend” covers the credential-passing mechanism.

Phase L — k8s backend and ops pod

roksbnkctl ops install creates the roksbnkctl-ops namespace, deploys the long-lived ops pod, projects the IBM API key as a Kubernetes Secret, and binds the pod’s ServiceAccount to a least-privilege ClusterRole. Subsequent steps run roksbnkctl ibmcloud --backend k8s iam oauth-tokens (executes inside the ops pod) and roksbnkctl test throughput --backend k8s (the iperf3 server pod and client Job both run in-cluster). RBAC assertions confirm the SA can create Jobs in roksbnkctl-test but cannot delete Pods in default — least-privilege is enforced. Chapter 19 is the ops-pod reference.

Phase L-DNS — DNS probe and GSLB compare

The DNS phase exercises the miekg/dns-backed probe:

Single-vantage A and AAAA lookups against 8.8.8.8
NXDOMAIN negative test (asserts rcode=NXDOMAIN)
Iterated probe (10 queries to the same server, RTT p50/p95/p99 reported)
K8s-backend probe (runs as a Job in roksbnkctl-test, the binary self-execs in-cluster, RTT reflects in-cluster network path)
--server cluster (uses the pod’s /etc/resolv.conf, validates CoreDNS visibility)
--gslb-compare happy path (fans out local + k8s, asserts the answer schema)
--gslb-compare divergence (target a geo-resolved name where laptop and cluster IPs hit different DCs, asserts gslb_divergence: true)
Docker rejection negative (asserts the parse-time rejection error for --backend docker --target ...)

LD9 (SSH vantage) is exercised only when a jumphost is configured; LD5-LD8 are the must-pass set.

Phase M — cred-leak audit

Cross-cutting check that runs after I-L — confirms no credential value leaked during any prior phase. Concrete assertions:

docker history <ibmcloud-tool-image> — no IBMCLOUD_API_KEY=... ENV layer
docker inspect <last-container> — no API key value in env
kubectl get events -n roksbnkctl-ops -o yaml — grepping for the API key value returns nothing
kubectl logs <ops-pod> — grepping for the API key value returns nothing (the redactor masks any tool output that prints it)
ssh jumphost ls /tmp/roksbnkctl.* — empty (the SSH backend’s trap cleans up tempfiles on exit)
sshd auth.log — Accepted publickey lines present; the SetEnv var name (IBMCLOUD_API_KEY) is logged but not the value
~/.roksbnkctl/*/state/*.log host-side logs — no API key value

The audit is the single most important gate on the v1.0 release. A leak in any of M1-M7 is a stop-ship. See PRD 04 for the threat model.

Phase N — mixed-mode lifecycle

A realistic scenario: workspace config routes each tool to its preferred backend, then a full up + test + down cycle runs end-to-end. Concretely, exec.terraform=local, exec.ibmcloud=ssh:jumphost, exec.iperf3=k8s — three different backends in one lifecycle. The phase asserts state is preserved across the per-tool dispatch (the workspace’s terraform state file is touched only by the local-backend terraform; the API key projected into the k8s ops pod is the same one resolved for the SSH dispatch) and that down cleanly destroys everything.

How CI runs it

.github/workflows/ci.yml runs unit + integration on every PR — go test ./... plus the testcontainers-go-backed integration tests. The full e2e suite is too expensive (4-6 hours, $5-10 of IBM Cloud spend per run) to gate on every PR.

A separate manual-trigger workflow runs scripts/e2e-test-full.sh on demand and on release branches. The workflow is dispatched via the GitHub Actions UI (“Run workflow”) and stamps the resulting log artefacts onto the workflow run. See .github/workflows/e2e-full.yml for the workflow YAML — the workflow accepts optional cluster_region + teardown_on_success inputs and runs automatically on every release/** branch push.

The release-cut policy is: don’t tag vX.Y.Z until the most recent manual-trigger run on the release branch is green for three consecutive nights. This catches the flakes that don’t reproduce locally — most of which are IBM Cloud control-plane blips rather than real regressions.

Cost and time

A full scripts/e2e-test-full.sh run currently costs:

Resource	Approximate cost (USD)
ROKS cluster (3 workers, ~5 hours uptime)	$3-6
1-2 LoadBalancer Service objects (for north-south throughput)	$0.50-1
COS instance + objects for the supply chain	$0.10-0.20
Egress bandwidth (throughput tests, image pulls)	$0.20-0.50
Total per run	$5-10

Per-phase time estimates:

Phase	Wall time
A (init)	<1 minute
B (plan)	1-2 minutes
C (targets / register)	<1 minute
D (up)	30-50 minutes (the dominant cost)
E (post-apply checks)	1-2 minutes
F (connectivity)	1-2 minutes
G (throughput)	1-3 minutes
H (down)	15-25 minutes
I (SSH backend)	2-3 minutes
J (kubectl internal)	3-5 minutes
K (docker backend)	3-5 minutes (first call pulls the image, +1-2 minutes)
L (k8s backend + ops pod)	3-5 minutes
L-DNS (DNS probe + GSLB compare)	2-4 minutes
M (cred audit)	<1 minute
N (mixed-mode lifecycle)	30-50 minutes (full up + down again)
Total	~4-6 hours

Phase N is the second-dominant cost — it runs a complete up/down cycle on top of D’s. Contributors who want a shorter test loop should skip N (PHASE_FROM= past it) and rely on D + I-M coverage; full N is a release-gate concern, not a per-PR concern.

Re-runnability

Every phase is re-runnable via PHASE_FROM=. The driver scripts are idempotent in two senses:

Phase ordering: phases later than PHASE_FROM=<X> run unconditionally; phases at or before X are skipped. The script doesn’t try to remember whether earlier phases succeeded — that’s the user’s job (the per-phase log files are the evidence).
Per-phase actions: each phase’s individual step calls are themselves idempotent where possible. roksbnkctl up on an already-applied workspace is a no-op (terraform plan reports zero changes). roksbnkctl ops install on an already-installed cluster is a no-op (the namespace + RBAC exist). The redactor + cred-resolver short-circuit cleanly on repeated invocations.

External-API steps include jitter+retry per PRD 05 §“Risks”: DNS resolvers occasionally return SERVFAIL on first query and succeed on the second, IBM IAM occasionally 5xxs during high-load periods, and the in-cluster ops pod can take a few seconds to be Running after ops install returns. Each of these is retried with a short exponential-backoff jitter rather than failing the phase.

The intended workflow on a flake is:

# Phase L flaked on "ops pod not yet Running"
# Re-run from L; everything before is preserved
PHASE_FROM=L ./scripts/e2e-test-backends.sh

If the same phase flakes on consecutive PHASE_FROM= runs, it’s a real bug — open an issue with the per-phase log attached.

Cross-references

Chapter 26 — Troubleshooting — symptom → root cause → fix entries for the failure modes phase D through phase N can surface.
Chapter 17 — Execution backends — the four-backend matrix that phases I, K, L exercise.
Chapter 19 — The in-cluster ops pod — what phase L installs and what RBAC it carries.
Chapter 21 — DNS testing for GSLB — the probe behaviour phase L-DNS exercises.
Chapter 22 — Throughput testing — the iperf3 fixture phase G uses.
PRD 05 — the design spec for the suite.

Day-2 ops: status, logs, k get/apply/exec

This is the chapter to read after the cluster is up and BNK is deployed and you’re now living with the result. It opens with roksbnkctl status — the workspace-level read of what’s deployed — then covers the per-resource verbs: read pod state, tail logs, apply a manifest, port-forward to a service, exec into a pod. Sprint 2 internalises all the per-resource verbs into native Go via client-go so you no longer need kubectl on PATH for the everyday workflow.

The full design rationale lives in PRD 02. This chapter is the user-facing surface — the canonical “what’s the kubectl-equivalent in roksbnkctl?” reference.

Why internalise

Three reasons, in order of weight:

Single binary. roksbnkctl is meant to be the one thing you install. After Sprint 2, the only required external prerequisite for the happy path is terraform. Everything else — kubectl, oc, iperf3, dig — is either built-in or an optional escape hatch.
No version skew. The vendored client-go matches the kube API the bundled HCL targets. You can’t accidentally use kubectl 1.20 against a 1.28 cluster and have its print column heuristics go sideways.
First-class output formats. cli-runtime gives byte-identical -o yaml/-o json/-o jsonpath output to kubectl. The validator agent’s golden-file tests in internal/k8s/golden_test.go assert this for representative resources.

`roksbnkctl status`

The one-shot read of a workspace’s posture. Always best-effort — every section reports its own missing pieces so a partial state still produces useful output rather than a hard error.

$ roksbnkctl status
Workspace:        canada-roks
Region:           ca-tor
Resource group:   default
Cluster:          canada-roks  (attach existing)
TF source:        jgruberf5/ibmcloud_terraform_bigip_next_for_kubernetes_2_3@v1.3.0
Cluster phase:    deployed (last apply 2026-05-13 14:08:33 MST)
BNK trial:        deployed (last apply 2026-05-13 14:15:01 MST)
Kubeconfig:       /home/you/.kube/config
Cluster:          2/2 nodes ready

The header rows (workspace, region, resource group, cluster identity, TF source pin, kubeconfig path, cluster reachability) are the same across every workspace shape. The per-phase deployment lines below them are shape-dependent: roksbnkctl status reads each phase’s terraform.tfstate mtime independently and emits one line per phase. The mapping from workspace shape to per-phase output:

The two Cluster: lines are by design: the first (in the header block) reports cluster identity — which cluster you’re targeting and whether the workspace creates it or attaches to an existing one. The second (the trailer line) reports cluster reachability — node count and ready count from a live API call. The label is reused because both pieces of information are about “the cluster”; the column to the right disambiguates.

TF source: reflects the workspace’s tf_source.type: github renders as <Repo>@<Ref> (the canonical happy-path shape since Sprint 5 — e.g., jgruberf5/ibmcloud_terraform_bigip_next_for_kubernetes_2_3@v1.3.0); local renders as local:<Path>; embedded or unset renders as (unset). The samples below use the github shape since it’s what most readers will see.

`ShapeEmpty` — fresh workspace, neither phase deployed

$ roksbnkctl status
Workspace:        dev
Region:           us-south
Resource group:   default
Cluster:          (unset)  (attach existing)
TF source:        jgruberf5/ibmcloud_terraform_bigip_next_for_kubernetes_2_3@v1.3.0
Cluster phase:    not deployed
BNK trial:        not deployed
Kubeconfig:       (none — run `roksbnkctl kubeconfig --download`)

Both phases report not deployed — the state directories either don’t exist or hold zero-resource state files. Running roksbnkctl cluster up (or roksbnkctl up for the monolithic path) advances the workspace to ShapeClusterOnly.

`ShapeClusterOnly` — cluster phase deployed, no BNK trial yet

$ roksbnkctl status
Workspace:        canada-roks
Region:           ca-tor
Resource group:   default
Cluster:          canada-roks  (attach existing)
TF source:        jgruberf5/ibmcloud_terraform_bigip_next_for_kubernetes_2_3@v1.3.0
Cluster phase:    deployed (last apply 2026-05-13 14:08:33 MST)
BNK trial:        not deployed
Kubeconfig:       /home/you/.kube/config
Cluster:          2/2 nodes ready

The Cluster phase line reads the mtime of <state-cluster-dir>/terraform.tfstate; the BNK trial line reads <state-dir>/terraform.tfstate and falls back to not deployed when the trial state is empty or missing. Running roksbnkctl bnk up advances the workspace to ShapeSplit.

`ShapeSplit` — both phases deployed (the v1.1+ steady state)

$ roksbnkctl status
Workspace:        canada-roks
Region:           ca-tor
Resource group:   default
Cluster:          canada-roks  (attach existing)
TF source:        jgruberf5/ibmcloud_terraform_bigip_next_for_kubernetes_2_3@v1.3.0
Cluster phase:    deployed (last apply 2026-05-13 14:08:33 MST)
BNK trial:        deployed (last apply 2026-05-13 14:15:01 MST)
Kubeconfig:       /home/you/.kube/config
Cluster:          2/2 nodes ready

Each phase has its own mtime; the timestamps move independently. Re-running roksbnkctl bnk down then roksbnkctl bnk up updates the BNK trial line without touching the Cluster phase line — useful for confirming which phase you most recently exercised.

`ShapeLegacySingle` — v1.0.x workspace, cluster + trial in one tfstate

$ roksbnkctl status
Workspace:        legacy-canada
Region:           ca-tor
Resource group:   default
Cluster:          canada-roks  (attach existing)
TF source:        (unset)
Shape:            legacy single-state (cluster + trial in one tfstate)
Last apply:       2026-05-13 14:15:01 MST  (4h22m18s ago)
Kubeconfig:       /home/you/.kube/config
Cluster:          2/2 nodes ready

Script-compat note. ShapeLegacySingle preserves the v1.0.x Last apply: line verbatim. Scripts that parsed roksbnkctl status output for the Last apply line on a legacy workspace continue to work unchanged. New script targets should switch to the per-phase Cluster phase: / BNK trial: lines (or to roksbnkctl cluster show + bnk show for a structured read); the per-phase lines are emitted for ShapeEmpty, ShapeClusterOnly, and ShapeSplit, not for ShapeLegacySingle. The Shape: line is a one-line callout so you don’t have to grep Chapter 8 to figure out which shape you’re on.

The shape detection logic lives in internal/config/tfstate.go::DetectShape; the per-phase emission in runStatus is in internal/cli/inspect.go. See PRD 06 §“status command integration” for the design rationale.

The `k` command tree

All internalised verbs live under roksbnkctl k:

roksbnkctl k get          # fetch resources
roksbnkctl k describe     # human-readable detail
roksbnkctl k apply        # server-side apply from file/dir
roksbnkctl k delete       # delete with cascade options
roksbnkctl k logs         # pod or component logs
roksbnkctl k exec         # exec into a pod (SPDY)
roksbnkctl k port-forward # forward local ports to a pod (SPDY)

Two of those have top-level shortcuts for muscle-memory convenience — the verbs you’d type a hundred times a day:

roksbnkctl get  ↔  roksbnkctl k get
roksbnkctl logs ↔  roksbnkctl k logs

apply, exec, delete, describe, and port-forward only work under the k prefix.

Two verbs are deliberately not aliased to avoid shadowing existing top-level commands:

roksbnkctl apply is the existing top-level lifecycle verb that runs terraform apply against the workspace (Sprint 0/1 surface). Adding a second apply would shadow it and break roksbnkctl up / roksbnkctl apply muscle memory. Use roksbnkctl k apply -f ... explicitly for the Kubernetes-side server-side apply.
roksbnkctl exec runs a command on the host with the workspace’s env loaded (Sprint 1’s host-exec verb — see Chapter 16, specifically the “Working examples” section). roksbnkctl k exec runs in a pod. The split keeps both meanings unambiguous without surprising name-collision behaviour.

kubectl/oc passthroughs stay as escape hatches

The existing roksbnkctl kubectl <args...> and roksbnkctl oc <args...> passthroughs are preserved post-Sprint 2. They still shell out to the host binary (with the workspace’s KUBECONFIG and credentials loaded) for anything outside the internalised subset.

When to reach for the passthrough:

Use case	Why passthrough
`kubectl rollout` (status/history/undo/restart)	Out of scope for v1.0; PRD 02 explicitly defers
`kubectl scale` / `kubectl autoscale`	Out of scope; passthrough is fine
`kubectl edit` / `kubectl patch`	Low frequency for BNK ops; out of scope for v1.0
`kubectl auth can-i` / RBAC introspection	Out of scope; passthrough is fine
`kubectl drain` / `cordon` / `taint`	Cluster admin operations; not roksbnkctl’s role
`kubectl run` / `kubectl create`	Imperative resource creation; use `k apply -f` instead
`oc adm` / `oc image` / OpenShift admin verbs	Niche enough to defer; passthrough
Niche flag combos	Anything not in the internalised verb’s flag set

If kubectl is missing from PATH, the passthrough errors with:

Error: kubectl not on PATH; use `roksbnkctl k get/apply/...` for the in-process path,
       or install kubectl

Same for oc. The doctor check (post-Sprint 2) treats both as informational rather than warnings — see Chapter 5 — Doctor.

Worked examples

The verbs in everyday order. Every example below assumes roksbnkctl k and accepts the top-level alias where one exists.

`roksbnkctl k get`

The most-used verb. Resource type + optional name + optional flags:

# All pods in the default namespace
roksbnkctl k get pods

# Pods in a specific namespace
roksbnkctl k get pods -n f5-bnk

# Pods across all namespaces
roksbnkctl k get pods -A

# A specific pod by name
roksbnkctl k get pod flo-controller-abc123 -n f5-bnk

# Label selector
roksbnkctl k get pods -A -l app.kubernetes.io/name=f5-lifecycle-operator

# Cluster-scoped resources (no namespace)
roksbnkctl k get nodes
roksbnkctl k get storageclasses

Output formats — these match kubectl byte-for-byte:

roksbnkctl k get pods -n f5-bnk -o yaml
roksbnkctl k get pods -n f5-bnk -o json
roksbnkctl k get pods -n f5-bnk -o wide
roksbnkctl k get pods -n f5-bnk -o name
roksbnkctl k get pods -n f5-bnk -o jsonpath='{.items[*].metadata.name}'
roksbnkctl k get pods -n f5-bnk -o go-template='{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}'

Plural / singular / shortname handling comes from the cluster’s RESTMapper via the discovery client, so pod, pods, po all work and pick up CRDs without a hardcoded list. roksbnkctl k get cneinstances (a BNK CRD) works as soon as the CRD is registered with the API server — no rebuild required.

Using the top-level alias:

roksbnkctl get pods -A

`roksbnkctl k describe`

Delegates to k8s.io/kubectl/pkg/describe — the same library kubectl uses internally. Output is identical to kubectl describe:

roksbnkctl k describe pod flo-controller-abc123 -n f5-bnk
roksbnkctl k describe node 10.243.0.4
roksbnkctl k describe service flo-webhook -n f5-bnk
roksbnkctl k describe cneinstance my-instance -n f5-bnk

The describe output’s “Events” section is especially useful for debugging stuck resources — pod scheduling failures, image pull errors, finaliser hangs all surface here.

`roksbnkctl k apply`

Server-side apply (SSA) with field-manager roksbnkctl. Inputs:

# Single file
roksbnkctl k apply -f pod.yaml

# Directory of YAMLs (recurses *.yaml)
roksbnkctl k apply -f manifests/

# Kustomize base (auto-detected if kustomization.yaml is present)
roksbnkctl k apply -f my-kustomize-base/

# stdin
cat pod.yaml | roksbnkctl k apply -f -

# Apply into a specific namespace (overrides metadata.namespace if absent)
roksbnkctl k apply -f manifests/ -n f5-bnk

# Force conflicts (SSA force-conflicts=true)
roksbnkctl k apply -f manifests/ --force

There is no top-level roksbnkctl apply alias for this verb — roksbnkctl apply is the lifecycle command that runs terraform apply. Always use roksbnkctl k apply for the Kubernetes-side apply.

Differences from kubectl apply:

Always SSA. Field manager is roksbnkctl. Client-side apply is not supported.
Kustomize auto-detect. A directory containing kustomization.yaml is built via sigs.k8s.io/kustomize/api before applying — no -k flag needed.
--force maps to SSA’s force-conflicts=true. Without it, conflicts with another field manager produce a clean error rather than silently winning.

For a vanilla kubectl apply -f workflow, the behaviour is functionally identical. For workflows that depend on client-side three-way merge or specific --server-side flag combinations, fall back to the passthrough.

`roksbnkctl k delete`

Cascade-aware deletion via the dynamic client:

# Delete by name
roksbnkctl k delete pod flo-controller-abc123 -n f5-bnk

# Cascade: orphan, background (default), foreground
roksbnkctl k delete deployment flo -n f5-bnk --cascade=foreground

# Force (bypass graceful deletion; immediate)
roksbnkctl k delete pod stuck-pod -n f5-bnk --force

# Custom grace period (seconds)
roksbnkctl k delete pod my-pod -n f5-bnk --grace-period=5

Use --cascade=foreground when you want to wait for owned resources (Pods owned by a Deployment, etc.) to be deleted before the parent disappears — useful for tearing down BNK trial CRs cleanly so finalisers run in order.

`roksbnkctl k logs` and `roksbnkctl logs`

Two paths, one verb. The component-aware path was introduced in Sprint 1 for BNK-specific workflows; the raw pod-name path is new in Sprint 2.

Component-aware (existing — by label selector):

roksbnkctl logs flo                # F5 Lifecycle Operator (label selector under the hood)
roksbnkctl logs cis                # F5 BNK CIS controller
roksbnkctl logs cert-manager       # cert-manager
roksbnkctl logs cneinstance        # BIG-IP TMM data plane pods

Raw pod-name (new in Sprint 2):

roksbnkctl k logs flo-controller-abc123 -n f5-bnk

Common flags (both paths):

-f, --follow              # stream live (kubectl logs -f)
-c, --container <name>    # specific container in a multi-container pod
--previous                # logs from the previous instance (after a crash)
--since=10m               # only logs in the last 10 minutes
--tail=100                # last N lines only

Top-level alias:

roksbnkctl logs flo -f --since=5m

If the named first arg matches one of the well-known BNK components (flo, cis, cert-manager, cneinstance), the component-aware path is used; otherwise it’s treated as a pod name. The component map lives in internal/cli/inspect.go and is keyed off the upstream chart’s default labels.

`roksbnkctl k exec`

SPDY exec into a pod. Same semantics as kubectl exec:

# One-shot command
roksbnkctl k exec flo-controller-abc123 -n f5-bnk -- ls -la /

# stdin attached
roksbnkctl k exec flo-controller-abc123 -n f5-bnk -i -- cat /etc/hostname

# Interactive PTY (the bash-style use)
roksbnkctl k exec flo-controller-abc123 -n f5-bnk -i -t -- bash

# Specific container in a multi-container pod
roksbnkctl k exec flo-controller-abc123 -n f5-bnk -c sidecar -- env

The -i and -t flags map directly to kubectl exec’s -i (stdin) and -t (PTY). For top / bash / interactive Python sessions, pass both.

There is no roksbnkctl exec (top-level) alias — roksbnkctl exec runs on the host. See “Disambiguating roksbnkctl exec” in PRD 02.

`roksbnkctl k port-forward`

SPDY port-forward to a pod:

# Forward localhost:8080 → pod's :80
roksbnkctl k port-forward flo-controller-abc123 -n f5-bnk 8080:80

# Multiple ports
roksbnkctl k port-forward flo-controller-abc123 -n f5-bnk 8080:80 8443:443

# Random local port (let the kernel pick)
roksbnkctl k port-forward flo-controller-abc123 -n f5-bnk :80

Ctrl+C closes the tunnel cleanly — no orphaned local listeners. The forward survives idle (reads/writes are bidirectional); it’s torn down only on signal or pod restart.

For a Service rather than a Pod, port-forward via the Service’s underlying pod or use kubectl port-forward svc/<name> through the passthrough — Service-targeted port-forwarding is currently passthrough-only.

Output format compatibility

The biggest user-visible promise: -o yaml / -o json / -o wide / -o jsonpath produce the same bytes as kubectl, modulo a small set of timestamp-and-resourceVersion fields that change between calls.

Concretely, the validator agent’s golden-file tests at internal/k8s/golden_test.go capture kubectl get <resource> -o yaml and roksbnkctl k get <resource> -o yaml against a live ROKS cluster and diff them, ignoring:

metadata.managedFields (ordering varies between callers; not user-visible)
metadata.resourceVersion (monotonic counter; changes on every read)
metadata.creationTimestamp (set server-side; not under our control)

Anything else differing is a test failure. The covered resources at v1.0 are Node, Pod, Service, ConfigMap — representative both of cluster-scoped (Node) and namespace-scoped (Pod, Service, ConfigMap), and of the typed-client (Node, Pod, Service) and dynamic-client (anything via cli-runtime’s resource.Builder) paths.

Run them locally with:

make test-live

…against a KUBECONFIG that points at a real ROKS cluster. They’re not part of the unit-test CI run because they need a live cluster; the integrator runs them before tagging a release. Documented in CONTRIBUTING.md.

OpenShift extensions

Beyond the core kubectl-equivalent verbs, ROKS clusters surface OpenShift-specific resource types — Project, Route, ImageStream, BuildConfig. roksbnkctl k get discovers these natively today via the dynamic client + RESTMapper path (the cluster advertises them through the API discovery doc; the deferred-discovery mapper picks them up):

roksbnkctl k get projects                    # OpenShift projects (vs Kubernetes namespaces)
roksbnkctl k get routes -n f5-bnk            # OpenShift Routes (vs Ingress)
roksbnkctl k get imagestreams -n f5-bnk      # OpenShift ImageStreams
roksbnkctl k get buildconfigs                # BuildConfigs (mostly empty in BNK trials)

Same verb shape (get / describe / delete); the dynamic-client + RESTMapper combination handles type discovery without needing a per-type Go-side scheme registration.

Phase 2.1 of PRD 02 adds typed clients via github.com/openshift/client-go for nicer printing and describe integration of these resources. This is on the v1.x roadmap (see docs/PLAN.md §“What’s deliberately deferred to post-v1.0”). Until typed clients land, roksbnkctl k get/describe still works against OpenShift CRDs — just with the generic unstructured printer. If you want richer per-type output today, fall back to the oc passthrough:

roksbnkctl oc get projects                   # typed-client output today
roksbnkctl oc describe route f5-bnk-svc      # typed Route fields

Doctor change recap

A reminder of what changed in Sprint 2’s doctor (covered in Chapter 5):

kubectl — was “needed (warning when missing)”; now informational (no warning when missing).
oc — same downgrade.

A fresh dev box without kubectl / oc installed should run roksbnkctl doctor and see green-or-informational across the board for the everyday workflow. The host-binary requirement is gone; the binaries are nice-to-have for the passthroughs.

kubectl muscle-memory cheat sheet

A reader migrating from kubectl should be able to use this section as a Rosetta Stone:

`kubectl ...`	`roksbnkctl ...`
`kubectl get pods`	`roksbnkctl get pods` (or `roksbnkctl k get pods`)
`kubectl get pods -A`	`roksbnkctl get pods -A`
`kubectl get pods -o yaml`	`roksbnkctl get pods -o yaml`
`kubectl describe pod <name>`	`roksbnkctl k describe pod <name>`
`kubectl apply -f manifests/`	`roksbnkctl k apply -f manifests/`
`kubectl apply -k overlay/`	`roksbnkctl k apply -f overlay/` (auto-detects kustomize)
`kubectl delete pod <name>`	`roksbnkctl k delete pod <name>`
`kubectl logs <pod> -f`	`roksbnkctl logs <pod> -f` (or `roksbnkctl k logs <pod> -f`)
`kubectl exec -it <pod> -- bash`	`roksbnkctl k exec <pod> -i -t -- bash`
`kubectl port-forward <pod> 8080:80`	`roksbnkctl k port-forward <pod> 8080:80`
`kubectl rollout status deploy/foo`	`roksbnkctl kubectl rollout status deploy/foo` (passthrough)
`kubectl edit deployment foo`	`roksbnkctl kubectl edit deployment foo` (passthrough)
`kubectl scale deployment foo --replicas=3`	`roksbnkctl kubectl scale deployment foo --replicas=3` (passthrough)
`oc projects`	`roksbnkctl k get projects` (works today via dynamic-client) or `roksbnkctl oc projects` for typed-client output

The general pattern: if it’s get / describe / apply / delete / logs / exec / port-forward against a typed or unstructured Kubernetes resource, the internalised verb is the right answer. Anything else, fall back to the passthrough.

Cross-references

Chapter 5 — Doctor — the kubectl/oc downgrade in context.
Chapter 6 — Workspaces — the KUBECONFIG resolution chain that powers every k <verb>.
Chapter 8 — Cluster/trial phase split — the workspace-shape concept that drives the per-phase status lines.
Chapter 10 — Deploying BNK trials on top — the verb that advances ShapeClusterOnly → ShapeSplit (and back via bnk down).
Chapter 11 — Tearing down — independent teardown of the cluster and trial phases.
Chapter 16 — The --on flag — --on plus the passthroughs for customer-firewalled scenarios.
PRD 02 — the design rationale and acceptance criteria for the work in this chapter.
PRD 06 §“status command integration” — the design rationale for the per-phase status output.

COS supply chain management

BIG-IP Next for Kubernetes (BNK) pulls its runtime artefacts — the F5 Application Runtime (FAR) container images, the JWT licence used at install + renewal time, the f5-bigip-k8s-manifest Helm chart, and the schematic JSON the deployer renders — from IBM Cloud Object Storage (COS). The COS bucket is the supply chain: it’s how artefacts produced upstream (F5 build pipeline, licence-issuing service, schematic generator) reach the cluster.

roksbnkctl cos is the management surface for that supply chain. Three command levels — cos instance, cos bucket, cos object — cover the full CRUD on COS resources without touching the ibmcloud CLI; everything (most visibly cos object put for uploads and cos object get for downloads) goes through the IBM Cloud Go SDKs (go-sdk-core, platform-services-go-sdk, ibm-cos-sdk-go).

What COS is in this stack

COS is IBM’s S3-compatible object store. Two layers matter here:

Instance: a service instance under Resource Controller. Instances are global; they don’t pin to a region. A workspace typically uses one COS instance per environment (dev, staging, prod) and shares it across multiple buckets.
Bucket: an S3-style bucket with regional affinity, hosted on a COS instance. Buckets carry the storage class (standard, vault, cold, smart) and the access policy (HMAC keys, service-instance creds, public ACLs — the latter are off-limits for BNK supply-chain use).

The BNK supply chain reads from one COS bucket per cluster’s BNK install. The bucket holds:

Object	What it is	Consumed by
`f5-far-auth-key.tgz`	FAR repository pull credentials — the F5-internal artefact key that lets FLO download FAR container images	`flo` module at install time
`trial.jwt` (or production equivalent)	BNK subscription JWT — the licence the CNE Instance presents	`flo` and `license` modules
`schematic-<v>.json`	The deployer’s schematic JSON for the deployed BNK version	informational, not directly mounted into the cluster
(optional) FAR image tarballs	Pre-pulled FAR images for air-gapped installs	`flo` when running in disconnected mode

The bucket structure is defined by the upstream HCL — concretely by the ibmcloud_resources_cos_bucket variable, which defaults to bnk-schematics-resources. The instance defaults to bnk-orchestration.

The three command levels

roksbnkctl cos instance {create|delete|list}
roksbnkctl cos bucket   {create|delete|list} --instance <name-or-CRN>
roksbnkctl cos object   {put|get|delete|list} --instance <name-or-CRN>

All three layers resolve credentials through the standard credential resolver chain — env var, OS keychain, workspace api_key_b64, prompt. There’s no separate “COS credential”; the IBM API key authenticates against Resource Controller (instance ops) and IAM-signed S3 requests (bucket and object ops).

`cos instance`

Manages COS service instances at the account level via Resource Controller.

# Create a Standard-plan instance under the workspace's resource group
roksbnkctl cos instance create bnk-orchestration --plan standard

# Override the plan by catalog UUID when roksbnkctl hasn't mapped the tier
roksbnkctl cos instance create bnk-orchestration --plan-id <uuid>

# List instances in the account
roksbnkctl cos instance list

# Delete an instance (default: recursive — removes bound HMAC keys, service creds)
roksbnkctl cos instance delete bnk-orchestration
roksbnkctl cos instance delete bnk-orchestration --no-recursive --auto

Flag	Default	Notes
`--plan`	`standard`	Friendly name (`standard`, `lite`); maps to a Resource Controller plan UUID internally.
`--plan-id`	—	Catalog UUID — bypasses the friendly-name mapping. Use when IBM ships a plan tier `roksbnkctl` hasn’t seen yet.
`--target`	`global`	COS instances are global; this is left as a flag for forward compatibility.
`--no-recursive`	(off)	On delete, do NOT remove bound HMAC keys and service credentials. Hardly ever what you want.
`--auto`	(off)	On delete, skip the y/N confirmation.

The resource group is read from the workspace’s ibmcloud.resource_group field (defaulting to default when unset).

`cos bucket`

Manages buckets within a named instance. The --instance flag is required for every bucket and object call — buckets aren’t globally unique, only unique within an instance.

# Create a standard-class bucket
roksbnkctl cos bucket create bnk-schematics-resources \
  --instance bnk-orchestration \
  --class standard

# List buckets on the instance
roksbnkctl cos bucket list --instance bnk-orchestration

# Delete (the bucket must be empty first; cos object delete --recursive isn't implemented yet)
roksbnkctl cos bucket delete bnk-schematics-resources --instance bnk-orchestration

Flag	Default	Notes
`--instance`	(required)	Instance name or CRN — the CRN starts with `crn:v1:` and is used as-is; a bare name is looked up via Resource Controller.
`--region`	workspace region	The IBM Cloud region the bucket is pinned to. Override only when you’re crossing regions deliberately.
`--class`	`standard`	Storage class: `standard` (frequently accessed), `vault` (infrequent), `cold` (archive), `smart` (auto-tiered). The BNK supply chain uses `standard` because FLO reads at install + every restart.

`cos object`

Manages objects (files) within a bucket. The key syntax is <bucket>/<key/with/slashes> — the parser splits on the first slash, so bucket/dir/file.tgz parses as bucket bucket, key dir/file.tgz.

# Upload (streaming; multipart auto-engages for large files)
roksbnkctl cos object put bnk-schematics-resources/f5-far-auth-key.tgz \
  ./local/f5-far-auth-key.tgz \
  --instance bnk-orchestration

# Download (streaming)
roksbnkctl cos object get bnk-schematics-resources/f5-far-auth-key.tgz \
  ./downloaded.tgz \
  --instance bnk-orchestration

# Delete
roksbnkctl cos object delete bnk-schematics-resources/old-trial.jwt \
  --instance bnk-orchestration

# List (with an optional key prefix)
roksbnkctl cos object list bnk-schematics-resources \
  --instance bnk-orchestration

roksbnkctl cos object list bnk-schematics-resources/schematics/ \
  --instance bnk-orchestration

The list output is a tab-separated KEY SIZE MODIFIED table — pipe through column -t for readability or cut -f1 to extract just the keys.

The BNK supply chain shape

A typical bnk-schematics-resources bucket after a clean install looks like:

$ roksbnkctl cos object list bnk-schematics-resources --instance bnk-orchestration
KEY                                     SIZE        MODIFIED
f5-far-auth-key.tgz                     2412        2026-05-08T14:12:33Z
trial.jwt                               1857        2026-05-08T14:12:34Z
schematic-2.3.0-3.2598.3-0.0.170.json   18432       2026-05-08T14:13:01Z

Three pieces of metadata in the upstream HCL (terraform/variables.tf) pin the bucket layout:

HCL variable	Default	Object
`f5_cne_far_auth_file`	`f5-far-auth-key.tgz`	FAR pull credentials
`f5_cne_subscription_jwt_file`	`trial.jwt`	Subscription JWT
`f5_bigip_k8s_manifest_version`	`2.3.0-3.2598.3-0.0.170`	Schematic filename inferred from this

Changing any of these in terraform.tfvars (or the workspace bnk: block, which renders into tfvars) changes which COS keys FLO will look for. The HCL doesn’t auto-discover key names — they’re literal.

For air-gapped installs where the cluster can’t reach repo.f5.com, additional pre-pulled FAR image tarballs go in the same bucket and the far_repo_url variable points at a COS-backed proxy. That topology is out of scope for v1.0; the supply chain shape described here is the connected-mode happy path.

Multipart upload and streaming download

FAR image tarballs run 1-5 GB. cos object put streams the input file into the bucket in 5 MB parts using S3-style multipart uploads:

For files under 5 MB, a single-part PutObject is used (the SDK’s default).
For files over 5 MB, the SDK auto-engages multipart upload — the file is split into 5 MB parts, each uploaded in parallel (up to 4 concurrent parts), and finalised with a CompleteMultipartUpload call.

The split is transparent — there’s no --multipart flag to set. The SDK handles it under the hood. If you want to verify multipart is happening for a specific file, watch with roksbnkctl --verbose cos object put … and the SDK’s debug logging surfaces the part count.

cos object get is similarly streaming: the SDK pipes the body straight to the destination file without buffering it in memory. Multi-gigabyte downloads on a memory-constrained jumphost are safe.

If a multipart upload is interrupted (network drop, ^C), the partial-upload state lingers on COS until cleaned up. Today roksbnkctl doesn’t expose a “list and abort orphan multipart uploads” command — that’s a v1.x addition. The workaround is to use ibmcloud cos list-multipart-uploads directly via Chapter 17’s docker backend or the IBM Cloud console.

Workspace config integration

The workspace cos: block is optional — if the bucket is already populated (manually, or by an external CI pipeline), the block can be omitted entirely. When set, it triggers an auto-upload at roksbnkctl up time so the FAR pull and licence land before FLO needs them.

# ~/.roksbnkctl/<workspace>/config.yaml
cos:
  instance: bnk-orchestration
  bucket: bnk-schematics-resources
  upload:
    - source: ./local/f5-far-auth-key.tgz
      key: f5-far-auth-key.tgz
    - source: ./local/trial.jwt
      key: trial.jwt

The block maps directly to internal/config/workspace.go::COSCfg:

Field	Type	Purpose
`instance`	string	COS instance name or CRN. Looked up via Resource Controller at runtime.
`bucket`	string	Bucket name within the instance.
`upload`	list of `{source, key}`	Optional pre-flight uploads. `source` is a host filesystem path (relative or absolute); `key` is the destination object key in the bucket.

Pre-flight uploads run before terraform apply, so FLO sees the artefacts when it pulls. Idempotent: re-running up re-uploads, which COS treats as overwrite — safe.

When the supply chain matters

Three lifecycle moments where the COS bucket is in play:

Install time

roksbnkctl up provisions FLO, which queries the bucket for f5-far-auth-key.tgz and trial.jwt. Missing either object → FLO fails to start → terraform apply retries (per internal/cli/lifecycle.go::applyWithRetry) for ~3 attempts before erroring. The fix is always “put the missing object in the bucket and re-run up”; the lifecycle retry hides transient bucket-policy propagation lag but won’t paper over a genuinely-empty bucket.

Upgrade time

When bnk.manifest_version (or the f5_bigip_k8s_manifest_version HCL variable) bumps, FLO pulls a new FAR image tarball and re-renders the CNE Instance. If the new manifest version references a FAR image that isn’t already in repo.f5.com’s public registry (rare, but happens for pre-release builds), the bucket holds the air-gapped fallback. Standard upgrades — the connected-mode case — don’t touch the bucket; they just pull from repo.f5.com using the credentials in f5-far-auth-key.tgz.

Licence rotation

When the trial expires or a production licence arrives, swap trial.jwt for the new file:

roksbnkctl cos object put bnk-schematics-resources/trial.jwt \
  ./new-license.jwt \
  --instance bnk-orchestration

# Force FLO to re-read the licence (delete the CNE Instance's License resource;
# FLO's reconciler re-creates it from the updated JWT)
roksbnkctl k delete license -n f5-bnk --all

FLO picks up the new JWT within 60-90 seconds. No roksbnkctl up re-run required for licence rotation alone.

Worked example: rotating COS supply-chain assets

End-to-end Part VII scenario: the FAR auth key on file is about to expire, a new one arrived from the F5 distribution side, and you need to rotate it without taking BNK down. The same flow handles licence-JWT rotation (swap trial.jwt for the production JWT) and FAR-image-tarball uploads for air-gapped clusters. Cross-link to Chapter 14 for the API-key half of the rotation story; this walkthrough focuses on the COS object half.

# 1. Sanity-check the current state
roksbnkctl cos object list bnk-schematics-resources --instance bnk-orchestration

# 2. Upload the new auth key (overwrites the existing file)
roksbnkctl cos object put bnk-schematics-resources/f5-far-auth-key.tgz \
  ./new-far-auth-key.tgz \
  --instance bnk-orchestration

# 3. Verify the upload
roksbnkctl cos object list bnk-schematics-resources --instance bnk-orchestration
# Expected: the f5-far-auth-key.tgz row's MODIFIED timestamp is now

# 4. (optional, air-gapped only) Upload the FAR image tarball
roksbnkctl cos object put bnk-schematics-resources/far-2.3.0-images.tgz \
  ./far-2.3.0-images.tgz \
  --instance bnk-orchestration

# 5. Force FLO to re-read the supply chain
roksbnkctl k delete pod -n f5-bnk -l app=flo
# (FLO's controller restarts; the new pod re-pulls f5-far-auth-key.tgz on first reconcile)

# 6. Verify FLO is healthy with the new key
roksbnkctl logs flo
# Expected: no "failed to pull FAR image: unauthorized" lines

The third step is the verification gate. If FLO’s logs still show auth failures after the pod restart, the new auth key was rejected by repo.f5.com — re-issue the key on the F5 side, not in the bucket.

Cross-references

Chapter 12 §“cos: — COS supply-chain (optional)” — the workspace-config block this chapter operationalises.
Chapter 13 — Terraform variables — f5_cne_far_auth_file, f5_cne_subscription_jwt_file, ibmcloud_cos_instance_name, ibmcloud_resources_cos_bucket are the HCL handles.
Chapter 14 — Credentials — the API-key resolution that auths every cos call.
Chapter 24 — Day-2 ops — roksbnkctl logs flo is the post-rotation verification command.
Chapter 26 — Troubleshooting — bucket-policy propagation lag, missing-object failure shapes.

Troubleshooting

Common failure modes you’ll hit when running roksbnkctl against real IBM Cloud accounts, organised as symptom → root cause → fix. The entries here are mined from the issue logs accumulated over Sprints 0-5 plus the failure shapes documented in PRD 05 §“Risks”.

Use the page as a lookup table. If your symptom isn’t here, Chapter 23 — The E2E test plan lists what every phase asserts; reverse-engineering from the assertions can narrow your diagnosis. For deeper-than-here debugging, the per-phase log files under /tmp/roksbnkctl-e2e-backends/ are the first stop.

Install and init

Symptom: `roksbnkctl init` errors with `plaintext secret detected`

Root cause: an existing ~/.roksbnkctl/<workspace>/config.yaml has a credential value sitting in a field whose name matches the rejection regex (api_key, password, token, secret_access_key, hmac_secret). The rejection is a deliberate safety net — see Chapter 14 §“What’s safe to commit vs not”.

Fix: move the credential into IBMCLOUD_API_KEY (env var) or the OS keychain (roksbnkctl init writes it via zalando/go-keyring). For a single-user dev box, the supported plaintext-on-disk channel is ibmcloud.api_key_b64 — base64-encoded, which doesn’t trip the regex.

Symptom: `roksbnkctl init` interactive prompts loop forever asking for the API key

Root cause: you’re running under CI / a non-TTY shell and roksbnkctl can’t read stdin. The interactive prompt fallback is the last step in the credential resolver chain and it doesn’t gracefully skip when stdin is closed.

Fix: set IBMCLOUD_API_KEY in the env, or pre-populate the keychain entry. For batch / CI runs, the documented invocation is:

IBMCLOUD_API_KEY=$(cat /path/to/secret) roksbnkctl init -w my-workspace

Pre-setting IBMCLOUD_API_KEY skips the API-key prompt (it’s the first link in the resolver chain). init still prompts for the remaining workspace metadata (region, resource group, cluster name) on TTY-bound stdin — a fully non-interactive bootstrap is on the v1.x roadmap.

Symptom: doctor reports `terraform: not found` on a fresh dev box

Root cause: terraform is the only strictly-required host tool for v1.0 (everything else is internalised). Doctor checks PATH; if your shell session hasn’t sourced the install location it’ll miss.

Fix: install terraform via your package manager (brew install terraform, apt-get install terraform, etc.) and re-source the shell, or set the TERRAFORM_BIN env var pointing at the binary explicitly.

`roksbnkctl up` lifecycle

Symptom: `terraform apply` errors `timeout while waiting for state to become 'normal'`

Root cause: IBM Cloud’s control plane is occasionally 5-15 minutes slow propagating cluster state — a known transient. The cluster was created; the API just hasn’t caught up to reporting it as Ready.

Fix: roksbnkctl up retries the apply automatically up to 3 attempts with a 60-second sleep between (see applyWithRetry in internal/cli/lifecycle.go). If all three retries fail, just re-run roksbnkctl up manually — terraform’s state is durable, and the second attempt skips every resource that’s already provisioned.

Symptom: `roksbnkctl up` returns success but `roksbnkctl k get nodes` says `No resources found`

Root cause: the ROKS cluster’s worker nodes take 5-10 minutes to provision after the cluster’s master endpoint returns Ready. Terraform considers the cluster “applied” as soon as the master is up; the workers come up asynchronously.

Fix: wait 5-10 minutes and re-run. If you want a deterministic gate, watch the IBM Cloud console’s cluster page until the worker count matches workers_per_zone × zones, then proceed. There’s no roksbnkctl wait command in v1.0 — that’s a v1.x addition.

Symptom: `roksbnkctl up` post-apply hook fails fetching the admin kubeconfig with a 404

Root cause: the IBM Cloud kubeconfig API (/global/v2/applications/kubeconfig) returns 404 for ~30-60 seconds after the cluster create call returns. The cluster exists but the kubeconfig endpoint hasn’t materialised.

Fix: the binary retries with exponential backoff and usually succeeds within a minute. If it still 404s after the retry budget, run roksbnkctl kubeconfig --download -w <workspace> to retry just the fetch without re-applying.

Symptom: `Error: Inappropriate value for attribute "kubeconfig_dir": directory does not exist`

Root cause: the upstream HCL’s IBM provider doesn’t MkdirAll for the kubeconfig output directory; it expects the parent dir to exist already. The variable’s default (/work/.bnk/scratch/kubeconfig) is the in-container path; on a direct-on-host run it’s a path that doesn’t exist.

Fix: roksbnkctl writes a workspace-scoped override (kubeconfig_dir = ~/.roksbnkctl/<ws>/state/kubeconfig) and creates the dir at apply time. If you’re hand-rolling terraform without roksbnkctl up, mkdir -p ~/.roksbnkctl/<ws>/state/kubeconfig first.

Symptom: `terraform destroy` leaves orphan IBM Cloud resources (LBs, security groups, VPEs)

Root cause: ROKS occasionally leaves dangling cluster-owned resources after the cluster itself is destroyed — the destroy returns success but the IBM Cloud account still shows a load balancer or a Virtual Private Endpoint Gateway tagged with the deleted cluster’s ID.

Fix: run roksbnkctl ibmcloud is load-balancers | grep <cluster-name> (and similar for vpc-endpoint-gateways, security-groups) and ibmcloud is load-balancer-delete each orphan by ID. A future roksbnkctl cluster destroy --sweep-orphans will automate this — for now, manual.

Workspaces

Symptom: `roksbnkctl ws delete <name>` succeeds but subsequent commands still use the deleted workspace

Root cause: workspace context is set by the --workspace/-w flag (or the persistent value the active shell remembers from the last roksbnkctl ws use); deleting the workspace directory doesn’t reset that context, so subsequent commands try to operate on a non-existent workspace dir.

Fix: switch to another workspace before deleting the current one:

roksbnkctl ws use default
roksbnkctl ws delete my-old-workspace

The parking-lot pattern is the recommended flow: keep a default workspace as the always-safe destination after deletes. Documented in Chapter 6 — Workspaces.

Symptom: `workspace "<name>" has terraform-managed resources; pass --force to delete anyway`

Root cause: the workspace’s terraform.tfstate is non-empty — live infrastructure exists. roksbnkctl ws delete refuses to orphan the resources by removing the state file out from under them.

Fix: run roksbnkctl down -w <name> --auto first to destroy the resources, then roksbnkctl ws delete <name> (no --force needed once state is empty). If you genuinely want to abandon the infra and clean up by hand later, roksbnkctl ws delete --force skips the check.

Backends

Symptom: `--backend docker` errors with `Cannot connect to the Docker daemon`

Root cause: dockerd isn’t running, or your user isn’t in the docker group, or you’re on a system that needs a separate rootless-docker socket path.

Fix:

Linux with system docker: sudo systemctl start docker; add yourself to the docker group (sudo usermod -aG docker $USER) and log out + back in.
Linux with rootless docker: systemctl --user start docker; set DOCKER_HOST=unix:///run/user/$(id -u)/docker.sock.
macOS / Windows: launch Docker Desktop / Rancher Desktop.

Verify with docker info | head -1 — if that fails, roksbnkctl --backend docker will too.

Symptom: `--backend k8s` errors with `ops pod not found in roksbnkctl-ops namespace`

Root cause: you haven’t run roksbnkctl ops install against the target cluster. The k8s backend dispatches into a long-lived ops pod that has to be provisioned first.

Fix:

roksbnkctl ops install

Verify with roksbnkctl k get pod -n roksbnkctl-ops — the pod should be Running. See Chapter 19 for the install model.

Symptom: `--backend ssh:<target>` errors with `tool not found: iperf3 — run with --bootstrap to apt-install`

Root cause: the SSH target doesn’t have the tool installed, and roksbnkctl doesn’t auto-install without explicit opt-in (because apt-installing on a production jumphost without consent is rude).

Fix: pass --bootstrap once per fresh target:

roksbnkctl --backend ssh:jumphost --bootstrap test throughput

The bootstrap step runs apt-get install -y <tool> (or the equivalent for ibmcloud — adding the IBM apt repo first). Subsequent calls skip the install check and run normally. See Chapter 17 §“SSH backend” for the bootstrap mechanism.

Symptom: `--backend ssh:jumphost` errors `host key mismatch for jumphost (got SHA256:..., known_hosts has SHA256:...)`

Root cause: the jumphost was re-provisioned (terraform destroy + apply) and now has a fresh host key, but ~/.roksbnkctl/known_hosts still has the old fingerprint. TOFU refuses to silently accept the change — that’s the threat model the prompt exists to defend against.

Fix: if you know the re-provision is legitimate, delete the stale entry:

ssh-keygen -R '<jumphost-ip>' -f ~/.roksbnkctl/known_hosts
# Or for the whole roksbnkctl known_hosts:
rm ~/.roksbnkctl/known_hosts

The next roksbnkctl --on jumphost call will TOFU-prompt with the new fingerprint. For CI use the --insecure-host-key flag, which records the key on first contact without prompting.

OpenShift and PodSecurity

Symptom: throughput test pod fails admission: `violates PodSecurity "restricted:v1.x": runAsNonRoot != true`

Root cause: the throughput suite’s default iperf3 image is networkstatic/iperf3:latest which runs as root. OpenShift’s restricted-v2 SCC rejects root pods.

Fix: set the workspace config to use the bundled image, which is built USER 1000:

# ~/.roksbnkctl/<workspace>/config.yaml
test:
  throughput:
    image: ghcr.io/jgruberf5/roksbnkctl-tools-iperf3:v0.9.0

Chapter 22 §“The bundled image and the runAsNonRoot constraint” is the full backstory. The same chapter’s §“OpenShift SCC failure mode” lists the three error-message variants OpenShift produces.

Symptom: `roksbnkctl ops install` errors `ServiceAccount "roksbnkctl-ops" forbidden: violates PodSecurityPolicy`

Root cause: rare — the cluster is running PodSecurityPolicy (the deprecated predecessor to PodSecurity admission) and the ops pod’s ServiceAccount doesn’t have the SCC binding it needs.

Fix: the ops manifest assumes restricted-v2 is acceptable. If your cluster forces privileged, that’s a cluster-policy question outside roksbnkctl’s control — talk to your cluster admin about granting restricted-v2 to the roksbnkctl-ops namespace.

Symptom: `ImagePullBackOff` on the ops pod or throughput pod

Root cause: most commonly, the cluster can’t reach the image registry. Three sub-causes:

The cluster’s egress NAT doesn’t route to ghcr.io (the image host for roksbnkctl-tools-*).
The image tag doesn’t exist for the version you’re running (e.g., you built roksbnkctl from main at a commit between releases, and :dev isn’t published).
ghcr.io itself is rate-limiting unauthenticated pulls (rare; usually only an issue for shared CI hosts hitting ghcr.io en masse).

Fix:

Check egress with roksbnkctl k exec <ops-pod> -- curl -sI https://ghcr.io — if that hangs, you have a network path issue, not a roksbnkctl issue.
Check the tag with docker manifest inspect ghcr.io/jgruberf5/roksbnkctl-tools-iperf3:<version> — if 404, pin to a tagged release version in workspace config rather than running from main head.
For rate-limit issues, pre-pull images to a local registry mirror and override the workspace test.throughput.image to point there.

DNS

Symptom: `roksbnkctl test dns` returns NXDOMAIN against an internal GSLB record that you know exists

Root cause: your laptop’s resolver chain doesn’t have a route to the internal GSLB VIP. The default --server system uses your /etc/resolv.conf, which resolves against your office or ISP resolver — neither of which knows about the cluster-private GSLB.

Fix: query the GSLB VIP explicitly, or query from inside the cluster:

# Query the GSLB VIP directly
roksbnkctl test dns --target www.example.com --type A --server 169.45.91.5

# Or run the probe from inside the cluster (the cluster's resolvers reach the GSLB)
roksbnkctl test dns --target www.example.com --type A --backend k8s --server cluster

Chapter 21 §“Server resolution” is the full --server reference.

Symptom: `--gslb-compare` always reports `gslb_divergence: false` against a target you expect to diverge

Root cause: the chosen target’s GSLB rule isn’t differentiating your local vantage (laptop) from your k8s vantage (cluster). Two common shapes:

The name is fronted by an anycast resolver fleet (Cloudflare, Google Public DNS) — same answer everywhere by design.
Your laptop and your cluster are both in the same geographic region from GSLB’s perspective (both in North America hitting the same datacenter).

Fix: pick a target known to be geo-resolved (www.google.com is the canonical “different IPs from different regions” example), or add an SSH-based vantage (--backend ssh:eu-bastion) to bring in a third region. Chapter 21 §“GSLB cross-vantage compare” covers the multi-vantage workflow.

Symptom: `roksbnkctl test dns --backend docker` errors `DNS probe doesn't benefit from docker`

Root cause: design choice. Docker containers share the host’s network namespace by default, so a docker-backend probe has the same network identity as a --backend local probe — no GSLB-relevant vantage difference.

Fix: use --backend local, --backend k8s, or --backend ssh:<target> instead.

Cluster registration

Symptom: `roksbnkctl cluster register <name>` errors `cluster not found`

Root cause: the cluster name doesn’t exist in the workspace’s resource group, or the API key doesn’t have visibility into the resource group.

Fix: verify the name with roksbnkctl ibmcloud ks cluster ls --output json | jq '.[].name', and verify the resource-group scope in workspace config matches where the cluster lives. If the cluster is in a different resource group, set ibmcloud.resource_group in the workspace config to that group.

Symptom: `register` succeeds but `roksbnkctl k get nodes` immediately errors `Unauthorized`

Root cause: the kubeconfig was fetched but the auth token has already expired, or the IAM-based token that the kubeconfig embeds doesn’t match the API key that’s currently in env. Common after a 1-hour idle window.

Fix:

roksbnkctl kubeconfig --download --cluster <name>

The token refresh is automatic on every up/apply, but register against a cluster you didn’t just provision sometimes lands you with a stale token in the kubeconfig.

COS supply chain

Symptom: FLO fails to start with `failed to pull FAR image: 403 Forbidden`

Root cause: the f5-far-auth-key.tgz object in the bucket has stale credentials (the F5-side pull key was rotated, but the bucket still has the old one).

Fix: re-issue the key on the F5 side and upload to COS:

roksbnkctl cos object put bnk-schematics-resources/f5-far-auth-key.tgz \
  ./new-f5-far-auth-key.tgz \
  --instance bnk-orchestration

# Restart FLO so it re-reads
roksbnkctl k delete pod -n f5-bnk -l app=flo

See Chapter 25 §“Worked example” for the full flow.

Symptom: `cos object put` for a 3 GB file errors midway with `RequestTimeout`

Root cause: the multipart upload SDK encountered a transient COS HTTP timeout on one of the part uploads. Multipart uploads aren’t currently resumed from the failure point — they restart from zero.

Fix: re-run the cos object put. If it fails reproducibly on the same part, the underlying network is the problem (your egress link is saturated, or COS is having a regional outage — check the IBM Cloud status page).

Symptom: `cos bucket delete` errors `Bucket not empty`

Root cause: COS requires buckets to be empty before delete; there’s no --recursive flag on bucket delete today.

Fix: list and delete each object, then delete the bucket:

roksbnkctl cos object list bnk-schematics-resources --instance bnk-orchestration | \
  awk 'NR>1 {print $1}' | \
  xargs -I{} roksbnkctl cos object delete "bnk-schematics-resources/{}" --instance bnk-orchestration

roksbnkctl cos bucket delete bnk-schematics-resources --instance bnk-orchestration

Don’t forget to abort any pending multipart uploads first — they don’t appear in the standard object list but they do prevent bucket deletion. The workaround for now is ibmcloud cos list-multipart-uploads followed by ibmcloud cos abort-multipart-upload until v1.x lands a native command.

Networking

Symptom: `roksbnkctl test connectivity` reports `Get "https://...": dial tcp: i/o timeout` for an internal-only URL

Root cause: connectivity probes run from --backend local by default. From your laptop, internal-only URLs (cluster-private VIPs, internal GSLB names) aren’t reachable.

Fix: route the probe through the cluster’s network — either via --backend k8s (when it lands for the connectivity suite — currently k8s-backend is iperf3 + DNS only; connectivity stays local for v1.0) or via an SSH target inside the cluster’s VPC (--backend ssh:cluster-jumphost).

Symptom: `roksbnkctl test connectivity` fails with `x509: certificate signed by unknown authority` against a self-signed internal endpoint

Root cause: the URL’s TLS cert isn’t in the host’s trust store.

Fix: pass --insecure (session-wide; skips TLS validation for every probe in the run). The flag is deliberately session-wide rather than per-host — see Chapter 20 §“Mixed TLS-trust posture”. For mixed trust posture across multiple internal endpoints, run two separate test connectivity invocations, one per trust group.

CI-specific

Symptom: nightly e2e run fails on phase D with `Error: Provider configuration is missing`

Root cause: a terraform init cache invalidation under ~/.roksbnkctl/<ws>/state/.terraform/ left a partial provider download. Happens after a CI worker is recycled mid-init.

Fix: rm -rf ~/.roksbnkctl/<ws>/state/.terraform/ then re-run roksbnkctl up. Terraform-init re-downloads the providers cleanly. For CI workers that get recycled often, add a pre-step that purges .terraform/ before each run.

Symptom: cred audit (phase M) reports `IBMCLOUD_API_KEY found in docker inspect output`

Root cause: real stop-ship — credentials leaked into a docker container’s runtime env. Check internal/exec/docker.go::buildEnvArgv for any code path that passes the credential by value (-e IBMCLOUD_API_KEY=<value>) rather than by reference (-e IBMCLOUD_API_KEY — let docker pull from the caller’s env).

Fix: file an issue immediately, do not tag a release until this is green. Phase M is the v1.0 release gate; a leak here means the redactor or the cred-passing logic regressed. See PRD 04 for the threat model.

Getting more help

When the symptom isn’t on this page:

Re-run with --verbose (-v) — the verbose output usually surfaces the root cause directly.
Check /tmp/roksbnkctl-e2e-backends/<phase>-<ts>.log for the per-phase trail.
Cross-reference Chapter 23 — The E2E test plan — the phase-by-phase pass criteria usually narrow down where the breakage lives.
File an issue on github.com/jgruberf5/roksbnkctl with the verbose output, the roksbnkctl --version stamp, and the per-phase log if there is one.

Command reference

Auto-generated by go run ./tools/refgen/cobra-md > book/src/27-command-reference.md. Re-run on every CLI surface change.

roksbnkctl deploys F5 BIG-IP Next for Kubernetes (BNK) onto IBM Cloud ROKS, manages the COS supply chain BNK depends on, and runs built-in connectivity, DNS, and throughput tests against the deployed environment.

The 4-command lifecycle: roksbnkctl init Interactive setup; writes the workspace config roksbnkctl up Provision (or attach) and deploy BNK roksbnkctl test Run connectivity, DNS, and throughput tests roksbnkctl down Tear down BNK (and the cluster if cluster up provisioned it)

See https://jgruberf5.github.io/roksbnkctl/book/ for the canonical user guide.

Global flags

These flags apply to every command. They are declared on the root command and inherited by every subcommand.

Flag	Type	Default	Description
`--backend`	`string`	—	execution backend: local \| docker \| k8s \| ssh:`<target>` (default: per-tool from workspace exec: block, else local)
`--bootstrap`	`bool`	`false`	for –backend ssh:`<target>`: auto-install missing tools on Ubuntu via apt-get (requires passwordless sudo on the target)
`--insecure-host-key`	`bool`	`false`	skip the host-key TOFU prompt; record on first contact (CI use)
`--no-color`	`bool`	`false`	disable colored output
`--on`	`string`	—	run on the named SSH target instead of locally (`roksbnkctl targets list` to see options)
`--output` / `-o`	`string`	`text`	output format: text \| json
`--quiet` / `-q`	`bool`	`false`	suppress all but errors
`--verbose` / `-v`	`bool`	`false`	verbose output
`--workspace` / `-w`	`string`	—	workspace name (default: current; first run creates ‘default’)

`roksbnkctl apply`

Apply Terraform without re-prompting (assumes config.yaml exists)

roksbnkctl apply [flags]

Flags

Flag	Type	Default	Description
`--auto`	`bool`	`false`	skip the confirmation prompt
`--no-kubeconfig`	`bool`	`false`	skip the post-apply admin kubeconfig fetch
`--var-file`	`stringArray`	`[]`	extra TF var-file (repeatable; later files override earlier)

`roksbnkctl cluster`

ROKS cluster lifecycle (separate from BNK trials)

Manage the ROKS cluster as a durable, reusable resource that sits underneath your BNK trials.

Commands: roksbnkctl cluster up Create the ROKS cluster (+ transit gateway, registry COS, cert-manager, jumphost) roksbnkctl cluster down Destroy the cluster and everything cluster-scoped roksbnkctl cluster register Discover an already-existing cluster and persist its identity roksbnkctl cluster show Print the registered cluster from cluster-outputs.json

Each roksbnkctl up against this workspace will reuse the registered cluster (reading cluster-outputs.json) so multiple BNK trials can share one cluster.

`roksbnkctl cluster down`

Destroy the cluster phase (ROKS + cluster-shared services)

roksbnkctl cluster down [flags]

Tears down everything roksbnkctl cluster up created. Refuses to run if any BNK trial state exists for this workspace — destroy those first with roksbnkctl down to avoid orphaned BNK resources.

Flags

Flag	Type	Default	Description
`--auto`	`bool`	`false`	skip the destroy confirmation
`--var-file`	`stringArray`	`[]`	extra TF var-file (repeatable; later files override earlier)

← back to roksbnkctl cluster

`roksbnkctl cluster register`

Discover an existing ROKS cluster and persist its identity

roksbnkctl cluster register [cluster-name-or-id] [flags]

Looks up an existing ROKS cluster in your IBM Cloud account, verifies its registry COS instance exists, and writes the cluster’s identity to ~/.roksbnkctl/<workspace>/cluster-outputs.json.

Subsequent roksbnkctl up runs in this workspace will pick up the registered cluster automatically — no need to repeat its identity in trial tfvars.

By default the registry COS instance name follows the upstream HCL fallback formula “<cluster-name>-cos”. Pass –registry-cos-name to override (e.g. if your tfvars sets roks_cos_instance_name to a different value).

Flags

Flag	Type	Default	Description
`--prompt`	`bool`	`false`	prompt for the cluster name even if one is given as an argument
`--registry-cos-name`	`string`	—	expected registry COS instance name (default “`<cluster>`-cos” — matches the upstream HCL fallback)

← back to roksbnkctl cluster

`roksbnkctl cluster show`

Print the registered cluster (cluster-outputs.json)

← back to roksbnkctl cluster

`roksbnkctl cluster up`

Provision the ROKS cluster (and cluster-shared services) only

roksbnkctl cluster up [flags]

Runs terraform apply with deploy_bnk=false forced — creates the ROKS cluster, transit gateway, registry COS, cert-manager, and the test jumphost, but skips the BNK trial modules (flo, cne_instance, license). On success, writes the cluster’s identity to ~/.roksbnkctl/<workspace>/cluster-outputs.json so subsequent roksbnkctl up runs can deploy BNK trials onto this cluster.

Uses a separate state directory (~/.roksbnkctl/<workspace>/state-cluster/) so it doesn’t tangle with BNK-trial state.

Flags

Flag	Type	Default	Description
`--auto`	`bool`	`false`	skip the confirmation prompt before apply
`--no-kubeconfig`	`bool`	`false`	skip the post-apply admin kubeconfig fetch
`--var-file`	`stringArray`	`[]`	extra TF var-file (repeatable; later files override earlier)

← back to roksbnkctl cluster

`roksbnkctl cos`

Manage IBM Cloud Object Storage (instances, buckets, objects)

roksbnkctl cos provides full CRUD on the COS supply chain BNK depends on: COS instances (via Resource Controller), buckets, and keyed objects (FAR pull keys, JWT licenses, etc.). All calls go through the IBM Go SDKs — no ibmcloud CLI dependency.

`roksbnkctl cos bucket`

Manage COS buckets

Flags

Flag	Type	Default	Description
`--instance`	`string`	—	COS instance name or CRN (required)

← back to roksbnkctl cos

`roksbnkctl cos bucket create`

Create a bucket on the named instance

roksbnkctl cos bucket create <bucket> [flags]

Flags

Flag	Type	Default	Description
`--class`	`string`	`standard`	storage class (standard, vault, cold, smart)
`--region`	`string`	—	bucket region (default: workspace region)

← back to roksbnkctl cos bucket

`roksbnkctl cos bucket delete`

Delete a bucket (must be empty)

roksbnkctl cos bucket delete <bucket>

← back to roksbnkctl cos bucket

`roksbnkctl cos bucket list`

List buckets on the named instance

← back to roksbnkctl cos bucket

`roksbnkctl cos instance`

Manage COS instances (service instances under Resource Controller)

← back to roksbnkctl cos

`roksbnkctl cos instance create`

Create a COS instance

roksbnkctl cos instance create <name> [flags]

Create a COS service instance under the workspace’s resource group.

–plan accepts a friendly name (standard | lite); –plan-id takes a catalog UUID directly when IBM ships a tier roksbnkctl hasn’t mapped yet. –target defaults to “global” (COS instances are global; buckets carry the regional affinity).

Flags

Flag	Type	Default	Description
`--plan`	`string`	`standard`	service plan name (standard \| lite)
`--plan-id`	`string`	—	service plan UUID (overrides –plan; for plans roksbnkctl hasn’t mapped)
`--target`	`string`	`global`	target region (default: global; COS instances are global)

← back to roksbnkctl cos instance

`roksbnkctl cos instance delete`

Delete a COS instance (and its bound resources unless –no-recursive)

roksbnkctl cos instance delete <name> [flags]

Flags

Flag	Type	Default	Description
`--auto`	`bool`	`false`	skip the confirmation prompt
`--no-recursive`	`bool`	`false`	do NOT delete bound resources (HMAC keys, service credentials)

← back to roksbnkctl cos instance

`roksbnkctl cos instance list`

List COS instances in the current account

← back to roksbnkctl cos instance

`roksbnkctl cos object`

Manage objects in COS buckets

Flags

Flag	Type	Default	Description
`--instance`	`string`	—	COS instance name or CRN (required)

← back to roksbnkctl cos

`roksbnkctl cos object delete`

Delete an object

roksbnkctl cos object delete <bucket>/<key>

← back to roksbnkctl cos object

`roksbnkctl cos object get`

Download an object (streaming)

roksbnkctl cos object get <bucket>/<key> <local-file>

← back to roksbnkctl cos object

`roksbnkctl cos object list`

List objects (optionally under a prefix)

roksbnkctl cos object list <bucket>[/<prefix>]

← back to roksbnkctl cos object

`roksbnkctl cos object put`

Upload an object (multipart for large files, streaming)

roksbnkctl cos object put <bucket>/<key> <local-file>

← back to roksbnkctl cos object

`roksbnkctl doctor`

Check prerequisites and report missing pieces

roksbnkctl doctor [flags]

Verifies the host has what roksbnkctl needs.

Required (hard fail on missing):

terraform on PATH (the local backend’s workhorse for roksbnkctl up)

Informational (the binary internalises each surface; missing → no warning):

kubectl / oc — internalised via client-go (roksbnkctl k *)
ibmcloud — bundled image, run via –backend docker / –backend ssh:<target>
iperf3 — bundled image, run via –backend k8s
dig — DNS probe internalised via miekg/dns

A stock dev box with only terraform installed should produce exit 0 and zero warnings.

Pass –target <name> to additionally probe an SSH target (runs whoami). Pass –backend k8s | ssh:<target> for per-backend prereq checks.

Exits non-zero only when a required check fails (warnings don’t block).

Flags

Flag	Type	Default	Description
`--backend`	`string`	—	additionally run per-backend checks: k8s \| ssh:`<target>`
`--target`	`string`	—	additionally probe the named SSH target with `whoami`

`roksbnkctl down`

Destroy everything in the workspace — terraform destroy

roksbnkctl down [flags]

Flags

Flag	Type	Default	Description
`--auto`	`bool`	`false`	skip the destroy confirmation
`--var-file`	`stringArray`	`[]`	extra TF var-file (repeatable; later files override earlier)

`roksbnkctl exec`

Run a single command with cluster context loaded

roksbnkctl exec [command...]

`roksbnkctl get`

Get one or more resources (pods, nodes, services, CRDs, …)

roksbnkctl get <resource> [name] [-n <ns> | -A] [-l <selector>] [-o <fmt>] [flags]

Fetches Kubernetes resources via client-go, no host kubectl required.

The resource argument accepts plurals, singulars, and short names from RESTMapper (pods/pod/po, services/svc, deployments/deploy). Multiple types can be comma-separated:

roksbnkctl k get pods,services -n f5-bnk roksbnkctl k get nodes -o yaml roksbnkctl k get pods -A -l app.kubernetes.io/name=f5-lifecycle-operator roksbnkctl k get pod my-pod -n default -o jsonpath=‘{.status.phase}’

CRDs work via dynamic discovery without a hardcoded list — roksbnkctl k get cneinstances resolves the same way kubectl does.

Flags

Flag	Type	Default	Description
`--all-namespaces` / `-A`	`bool`	`false`	list across all namespaces
`--namespace` / `-n`	`string`	—	namespace scope (default: current-context’s namespace)
`--output` / `-o`	`string`	—	output format: yaml \| json \| wide \| name \| jsonpath=… \| go-template=…
`--selector` / `-l`	`string`	—	label selector (e.g. ‘app=foo,tier!=cache’)

`roksbnkctl ibmcloud`

Passthrough to local ibmcloud with workspace API key + region loaded

roksbnkctl ibmcloud [args...]

`roksbnkctl init`

Interactive setup; writes the workspace config.yaml

roksbnkctl init [flags]

roksbnkctl init walks through the prompts (region, resource group, cluster, BNK version) and writes ~/.roksbnkctl/<workspace>/config.yaml.

On first run with no -w flag, creates and uses the ‘default’ workspace. Re-run with –upgrade-tf to bump the pinned Terraform source to its latest release.

Flags

Flag	Type	Default	Description
`--tf-source`	`string`	—	override TF source (path or URL); pinned into config.yaml
`--upgrade-tf`	`bool`	`false`	resolve and pin the latest TF release into config.yaml

`roksbnkctl install`

Copy the running roksbnkctl binary into a directory on PATH

roksbnkctl install [flags]

Install the roksbnkctl binary you’re currently running into a directory on $PATH so you can invoke it as roksbnkctl from any working directory.

Default destination, in order of preference: $HOME/.local/bin (preferred — typically writable without sudo) $HOME/bin (older convention; still on PATH for some setups) /usr/local/bin (system-wide; usually needs sudo)

Override the destination with –dir.

Idempotent: if the running binary already lives at the destination, prints a message and exits 0. Use –force to overwrite (useful right after a local rebuild that landed at the install path).

Examples: roksbnkctl install # default — ~/.local/bin roksbnkctl install –dir ~/bin # specific user dir sudo roksbnkctl install –dir /usr/local/bin # system-wide

Note: this is distinct from roksbnkctl self update, which pulls the latest GitHub release tarball over the network.

Flags

Flag	Type	Default	Description
`--dir`	`string`	—	destination directory (default: ~/.local/bin or /usr/local/bin)
`--force`	`bool`	`false`	overwrite even if destination resolves to the running binary

`roksbnkctl k`

Kubernetes verbs (kubectl-internalised; no host kubectl required)

roksbnkctl k <verb> runs the BNK-relevant kubectl/oc verb subset natively in-process via client-go, with no host kubectl/oc binary required. Output formatting matches kubectl byte-for-byte for -o yaml/json/wide/name/jsonpath/go-template.

Verbs: k get list/show resources k describe human-friendly resource detail (delegates to kubectl/pkg/describe) k apply server-side apply for files, dirs, kustomize bases, or stdin k delete delete with cascade + grace period control k logs pod or component logs (extends roksbnkctl logs) k exec exec into a pod via SPDY k port-forward forward a local port to a pod via SPDY

The existing roksbnkctl kubectl / roksbnkctl oc passthroughs remain as escape hatches for verbs not internalised here (edit, patch, rollout, scale, etc.) — they require kubectl/oc on PATH.

`roksbnkctl k apply`

Server-side apply YAML/JSON manifests, directories, or kustomize bases

roksbnkctl k apply -f <file-or-dir> [-n <ns>] [--force] [flags]

Server-side apply with field-manager ‘roksbnkctl’.

-f <file> single YAML/JSON file (multi-doc YAML supported) -f <dir> directory: kustomization.yaml-detected → krusty build; otherwise recursive *.yaml / *.yml -f - stdin (multi-doc YAML)

–force passes through to SSA’s force-conflicts flag, identical to kubectl apply –server-side –force-conflicts.

Examples:

roksbnkctl k apply -f deploy.yaml -n f5-bnk roksbnkctl k apply -f manifests/ cat deploy.yaml | roksbnkctl k apply -f -

Flags

Flag	Type	Default	Description
`--filename` / `-f`	`string`	—	file, directory, or ‘-’ for stdin
`--force`	`bool`	`false`	force-conflicts on server-side apply (kubectl apply –force-conflicts)
`--namespace` / `-n`	`string`	—	namespace for namespaced resources without an explicit namespace field

← back to roksbnkctl k

`roksbnkctl k delete`

Delete resources by name or label selector

roksbnkctl k delete <resource> [name] [-n <ns> | -A] [-l <selector>] [--force] [--grace-period N] [--cascade orphan|background|foreground] [flags]

Deletes resources via the dynamic client. Cascade options match kubectl’s:

–cascade=background delete the object; controller cleans dependents async (default) –cascade=foreground block until dependents are gone –cascade=orphan delete only the object, leave dependents

Examples:

roksbnkctl k delete pod my-pod -n f5-bnk roksbnkctl k delete pods -l app=stale –force –grace-period=0 roksbnkctl k delete deployment foo –cascade=foreground

Flags

Flag	Type	Default	Description
`--all-namespaces` / `-A`	`bool`	`false`	delete across all namespaces
`--cascade`	`string`	`background`	cascade: orphan\|background\|foreground
`--force`	`bool`	`false`	force-delete: implies –grace-period=0 unless overridden
`--grace-period`	`int`	`-1`	graceful termination period (seconds); -1 = use resource default
`--namespace` / `-n`	`string`	—	namespace scope
`--selector` / `-l`	`string`	—	label selector

← back to roksbnkctl k

`roksbnkctl k describe`

Show detailed human-readable resource info (events, conditions, related objects)

roksbnkctl k describe <resource> [name] [-n <ns> | -A] [-l <selector>] [flags]

Delegates to k8s.io/kubectl/pkg/describe — the same library kubectl/oc use internally, so output is byte-equivalent.

Examples:

roksbnkctl k describe pod my-pod -n f5-bnk roksbnkctl k describe nodes roksbnkctl k describe deployment f5-lifecycle-operator -n f5-bnk

Flags

Flag	Type	Default	Description
`--all-namespaces` / `-A`	`bool`	`false`	describe across all namespaces
`--namespace` / `-n`	`string`	—	namespace scope
`--selector` / `-l`	`string`	—	label selector
`--show-events`	`bool`	`true`	include the Events block (kubectl default: true)

← back to roksbnkctl k

`roksbnkctl k exec`

Exec into a pod via SPDY (kubectl-equivalent in-process)

roksbnkctl k exec <pod> [-n <ns>] [-c <container>] [-i] [-t] -- <cmd> [args...] [flags]

Opens an exec stream against the named pod over SPDY. The semantics mirror kubectl exec:

-i / –stdin attach stdin to the remote process -t / –tty allocate a PTY (use for top, bash-style interactive work) -c / –container pick a container in a multi-container pod

Examples:

roksbnkctl k exec my-pod – ls /tmp roksbnkctl k exec my-pod -it – bash roksbnkctl k exec my-pod -c sidecar – cat /etc/hostname

Note: this is the cluster-side exec. The host-side equivalent is ‘roksbnkctl exec <cmd>’ — distinct on purpose (PRD 02 §“Disambiguating roksbnkctl exec”, Option B).

Flags

Flag	Type	Default	Description
`--container` / `-c`	`string`	—	container name in a multi-container pod
`--namespace` / `-n`	`string`	—	namespace scope (default: default)
`--stdin` / `-i`	`bool`	`false`	attach stdin
`--tty` / `-t`	`bool`	`false`	allocate a PTY

← back to roksbnkctl k

`roksbnkctl k get`

Get one or more resources (pods, nodes, services, CRDs, …)

roksbnkctl k get <resource> [name] [-n <ns> | -A] [-l <selector>] [-o <fmt>] [flags]

Fetches Kubernetes resources via client-go, no host kubectl required.

The resource argument accepts plurals, singulars, and short names from RESTMapper (pods/pod/po, services/svc, deployments/deploy). Multiple types can be comma-separated:

CRDs work via dynamic discovery without a hardcoded list — roksbnkctl k get cneinstances resolves the same way kubectl does.

Flags

Flag	Type	Default	Description
`--all-namespaces` / `-A`	`bool`	`false`	list across all namespaces
`--namespace` / `-n`	`string`	—	namespace scope (default: current-context’s namespace)
`--output` / `-o`	`string`	—	output format: yaml \| json \| wide \| name \| jsonpath=… \| go-template=…
`--selector` / `-l`	`string`	—	label selector (e.g. ‘app=foo,tier!=cache’)

← back to roksbnkctl k

`roksbnkctl k logs`

Stream pod logs (kubectl-equivalent direct path)

roksbnkctl k logs <pod-name> [-n <ns>] [-c <container>] [-f] [--previous] [--since 5m] [--tail N] [flags]

Streams logs for a named pod. Differs from the top-level ‘roksbnkctl logs <component>’ in that this takes a literal pod name — matching kubectl’s surface — while the component variant maps a known BNK component name to a label selector.

Both forms honour -n, -c, -f, –previous, –since, –tail.

Flags

Flag	Type	Default	Description
`--container` / `-c`	`string`	—	container name in a multi-container pod
`--follow` / `-f`	`bool`	`false`	follow log output
`--namespace` / `-n`	`string`	—	namespace scope (default: default)
`--previous`	`bool`	`false`	fetch logs from the previous container instance
`--since`	`string`	—	only return logs newer than this duration (e.g. 5s, 2m, 1h)
`--tail`	`int64`	`-1`	tail the last N lines (-1 = full log)

← back to roksbnkctl k

`roksbnkctl k port-forward`

Forward local port(s) to a pod via SPDY

Aliases: port_forward

roksbnkctl k port-forward <pod> [-n <ns>] <local-port>[:<remote-port>] [...] [flags]

Forwards one or more local TCP ports to ports on the named pod. Equivalent to kubectl port-forward; signal handling closes the tunnel cleanly on Ctrl+C.

Port spec:

8080 local 8080 → pod 8080 8080:80 local 8080 → pod 80: 80 ephemeral local port → pod 80

Examples:

roksbnkctl k port-forward my-pod 8080:80 roksbnkctl k port-forward my-pod -n f5-bnk 9090:9090 8080:80

Flags

Flag	Type	Default	Description
`--namespace` / `-n`	`string`	—	namespace scope (default: default)

← back to roksbnkctl k

`roksbnkctl kubeconfig`

Print the kubeconfig path (or contents with –export)

roksbnkctl kubeconfig [flags]

Flags

Flag	Type	Default	Description
`--cluster`	`string`	—	cluster name or ID for –download (default: workspace cluster.name)
`--download`	`bool`	`false`	fetch admin kubeconfig from IBM Cloud and save to ~/.kube/config
`--export`	`bool`	`false`	print kubeconfig contents instead of path

`roksbnkctl kubectl`

Passthrough to local kubectl with workspace KUBECONFIG loaded

roksbnkctl kubectl [args...]

`roksbnkctl logs`

Tail logs for a BNK component (flo, cis, cert-manager, cneinstance)

roksbnkctl logs <component> [flags]

Looks up the named BNK component, finds its pod(s) by label, and streams logs to stdout. With –follow, streams live. With multiple matching pods, tails the first and prints a hint about using roksbnkctl kubectl for per-pod selection.

The component → namespace/selector map is hardcoded for v1 against the upstream TF chart’s default labels; if your install renamed namespaces or relabelled, fall back to:

roksbnkctl kubectl logs -n <ns> <pod>

Flags

Flag	Type	Default	Description
`--container` / `-c`	`string`	—	container name in a multi-container pod
`--follow` / `-f`	`bool`	`false`	follow log output
`--namespace` / `-n`	`string`	—	override the component’s default namespace
`--previous`	`bool`	`false`	fetch logs from the previous container instance
`--since`	`string`	—	only return logs newer than this duration (e.g. 5s, 2m, 1h)
`--tail`	`int64`	`-1`	tail the last N lines (-1 = full log)

`roksbnkctl oc`

Passthrough to local oc with workspace KUBECONFIG loaded

roksbnkctl oc [args...]

`roksbnkctl ops`

Manage the in-cluster ops pod (k8s execution backend)

roksbnkctl ops manages the long-lived ops pod the k8s execution backend exec’s tools into. The pod runs in the roksbnkctl-ops namespace with a least-privilege ServiceAccount + ClusterRole, and gets its IBM Cloud API key from a Secret apply-time-templated from the workspace credential.

Subcommands: install apply the embedded manifests (idempotent) show print pod + Secret + RBAC status uninstall delete every roksbnkctl.io/managed object created by install

`roksbnkctl ops install`

Apply (or update) the in-cluster ops fixtures

Applies the embedded namespaces, ServiceAccount, Secret, ClusterRole, ClusterRoleBinding, and ops Pod. Idempotent: re-running with a new API key updates the Secret and rolls the Pod.

← back to roksbnkctl ops

`roksbnkctl ops show`

Print the ops pod’s status, image, RBAC subject, and Secret rotation timestamp

← back to roksbnkctl ops

`roksbnkctl ops uninstall`

Delete the ops fixtures (namespaces, RBAC, Pod, Secret)

roksbnkctl ops uninstall [flags]

Flags

Flag	Type	Default	Description
`--confirm`	`bool`	`false`	actually perform the uninstall (otherwise prints what would be deleted)

← back to roksbnkctl ops

`roksbnkctl plan`

Read-only; show what roksbnkctl up would change

roksbnkctl plan [flags]

Flags

Flag	Type	Default	Description
`--var-file`	`stringArray`	`[]`	extra TF var-file (repeatable; later files override earlier)

`roksbnkctl self`

Manage the roksbnkctl binary itself

`roksbnkctl self update`

Pull the latest roksbnkctl release matching the host arch

Downloads the latest GitHub release tarball for this platform, verifies its SHA256 against the release’s checksums.txt, and replaces the running binary in place.

Linux/macOS only — Windows can’t replace a running .exe in place; use scoop update roksbnkctl instead.

Requires write permission on the binary’s directory (typical install under /usr/local/bin needs sudo; brew/scoop should use their own upgrade verb).

← back to roksbnkctl self

`roksbnkctl shell`

Interactive bash with KUBECONFIG, IBMCLOUD_API_KEY, and region pre-loaded

roksbnkctl shell drops into a $SHELL subshell with the workspace’s KUBECONFIG, IBMCLOUD_API_KEY, IC_API_KEY, and IBMCLOUD_REGION exported so locally-installed kubectl / oc / ibmcloud commands work without further setup. Exits when the subshell does.

`roksbnkctl status`

Summary of the workspace: cluster, components, last apply

roksbnkctl status reports a quick read of the workspace:

workspace name + region
configured cluster name
pinned Terraform source
last terraform apply timestamp (mtime of terraform.tfstate)
kubeconfig path (if any)
cluster reachability (node count + ready count)

v1.x will add per-BNK-component readiness (flo, cis, cert-manager, cneinstance) once the component-discovery shape is finalised.

`roksbnkctl targets`

Manage SSH targets used by –on

Targets are named SSH endpoints stored under the workspace’s targets: block. They become reachable via the persistent –on flag on commands like roksbnkctl exec, roksbnkctl shell, roksbnkctl kubectl, etc.

A jumphost target is auto-populated after a successful roksbnkctl up when the upstream HCL provisions one (testing_tgw_jumphost outputs).

`roksbnkctl targets add`

Add or update a target

roksbnkctl targets add <name> --host H --user U [--port P] [--key-path P | --key-source S] [flags]

Flags

Flag	Type	Default	Description
`--host`	`string`	—	host or IP
`--key-path`	`string`	—	path to a PEM private key
`--key-source`	`string`	—	key source — “agent” or “tf-output:`<name>`”
`--port`	`int`	`0`	ssh port (default 22)
`--user`	`string`	—	remote user

← back to roksbnkctl targets

`roksbnkctl targets list`

List all targets in the current workspace

← back to roksbnkctl targets

`roksbnkctl targets remove`

Remove a target

roksbnkctl targets remove <name>

← back to roksbnkctl targets

`roksbnkctl targets show`

Show detail for one target

roksbnkctl targets show <name>

← back to roksbnkctl targets

`roksbnkctl test`

Run deployment validation tests (default: all)

roksbnkctl test [suite] [flags]

roksbnkctl test runs deployment validation against the current workspace.

Suites: connectivity HTTP/HTTPS reachability of deployed BNK services dns DNS resolution of ingress and service hostnames throughput iperf3 measurements (north-south by default; v1.x) all run all of the above (default if no suite is specified)

Honors -o json with the roksbnkctl.v1 schema. Exit code 0 on all-pass, non-zero on any-fail — CI-friendly.

Flags

Flag	Type	Default	Description
`--insecure`	`bool`	`false`	skip TLS certificate validation (connectivity only)

`roksbnkctl test connectivity`

HTTP/HTTPS reachability against configured hosts

← back to roksbnkctl test

`roksbnkctl test dns`

DNS resolution probe (single-vantage, GSLB-compare, or workspace-driven)

roksbnkctl test dns [flags]

roksbnkctl test dns runs DNS probes against configured resolvers.

Two modes:

Workspace-driven (no flags) — resolves each host listed under test.connectivity.extra_hosts via the std-lib resolver. Same as Sprint 0–4 behaviour; preserves CI invocations using the legacy roksbnkctl.v1 schema.

Flag-driven (any of –target/–type/–server/–gslb-compare set) — uses the embedded miekg/dns probe (no external dig install needed). Single-vantage emits roksbnkctl.dns.v1.vantage; –gslb-compare emits roksbnkctl.dns.v1 with a gslb_divergence boolean across all configured backends (local + k8s + ssh:<targets>).

Use –backend local|k8s|ssh:<target> to pick a single vantage point; –gslb-compare fans out across all available vantages. PRD 03 §“DNS probe (GSLB-aware)”.

Flags

Flag	Type	Default	Description
`--gslb-compare`	`bool`	`false`	fan out the probe across all configured backends (local + k8s + ssh:`<targets>`) and emit a comparison JSON with gslb_divergence
`--iterations`	`int`	`1`	number of repeated queries; >1 enables RTT distribution
`--require-divergence`	`bool`	`false`	with –gslb-compare: exit non-zero if NO divergence is observed (CI assertion that GSLB is doing something)
`--server`	`string`	—	resolver: `<ip>`[:`<port>`] \| system \| cluster \| `<named-from-workspace>` (default: system)
`--target`	`string`	—	DNS name to query (overrides workspace test.dns.default_target)
`--timeout`	`duration`	`2s`	per-query timeout
`--type`	`string`	`A`	record type: A \| AAAA \| CNAME \| MX \| NS \| TXT \| SRV \| SOA \| PTR \| CAA \| DS \| DNSKEY \| ANY

← back to roksbnkctl test

`roksbnkctl test list`

List available test suites

← back to roksbnkctl test

`roksbnkctl test throughput`

iperf3 throughput; deploys server pod automatically (v1.x)

roksbnkctl test throughput [flags]

Deploys an iperf3 server in the test namespace and runs the client either from the roksbnkctl host (–mode north-south, default) or from a second in-cluster pod (–mode east-west).

Not yet implemented — landing in v1.x once the internal/k8s client-go fixture lifecycle is wired.

Flags

Flag	Type	Default	Description
`--cross-node`	`bool`	`false`	force east-west client and server onto different nodes
`--keep`	`bool`	`false`	leave the iperf3 server pod running after the test
`--mode`	`string`	`north-south`	throughput mode: north-south \| east-west

← back to roksbnkctl test

`roksbnkctl tfvars`

Emit the upstream TF’s terraform.tfvars.example for editing

roksbnkctl tfvars [flags]

Resolves the workspace’s pinned TF source (downloading the tarball if not yet cached) and writes its terraform.tfvars.example as a starting point you can edit and pass to roksbnkctl up.

Default writes to ./terraform.tfvars in the current directory. Pass -o <path> to write elsewhere, or -o - to print to stdout.

Refuses to overwrite an existing destination unless –force is set.

Workflow: roksbnkctl init # pins a TF source roksbnkctl tfvars # writes ./terraform.tfvars from the upstream example $EDITOR ./terraform.tfvars roksbnkctl up –var-file ./terraform.tfvars

Flags

Flag	Type	Default	Description
`--force`	`bool`	`false`	overwrite the destination if it already exists
`--output` / `-o`	`string`	`./terraform.tfvars`	destination file (or - for stdout)

`roksbnkctl up`

Provision (or attach) and deploy BNK — terraform plan + apply

roksbnkctl up [flags]

roksbnkctl up validates credentials, resolves the pinned Terraform source, runs plan, and (after confirmation, unless –auto) applies. Idempotent and resumable: a partial failure is recovered by re-running ‘roksbnkctl up’.

Flags

Flag	Type	Default	Description
`--auto`	`bool`	`false`	skip the confirmation prompt before apply
`--no-kubeconfig`	`bool`	`false`	skip the post-apply admin kubeconfig fetch
`--tf-source`	`string`	—	override TF source for this run only
`--var-file`	`stringArray`	`[]`	extra TF var-file (repeatable; later files override earlier)

`roksbnkctl version`

Print version, commit, and build date

`roksbnkctl workspaces`

Manage roksbnkctl workspaces (per-environment config + state bundles)

Aliases: ws

Each workspace lives under ~/.roksbnkctl/<name>/ with its own config.yaml and state. The current_workspace pointer in ~/.roksbnkctl/config.yaml decides which one commands run against; -w/–workspace overrides for one invocation.

`roksbnkctl workspaces current`

Print the current workspace name

← back to roksbnkctl workspaces

`roksbnkctl workspaces delete`

Delete a workspace (refuses if state is non-empty unless –force)

roksbnkctl workspaces delete <name> [flags]

Flags

Flag	Type	Default	Description
`--force`	`bool`	`false`	delete even if Terraform state lists provisioned resources

← back to roksbnkctl workspaces

`roksbnkctl workspaces list`

List workspaces and their states

← back to roksbnkctl workspaces

`roksbnkctl workspaces new`

Create a new (empty) workspace skeleton — run roksbnkctl init -w <name> to populate

roksbnkctl workspaces new <name>

← back to roksbnkctl workspaces

`roksbnkctl workspaces use`

Set the current workspace pointer

roksbnkctl workspaces use <name>

← back to roksbnkctl workspaces

Configuration reference

Field-by-field schema reference for the workspace config.yaml. Source of truth is the Workspace struct in internal/config/workspace.go; this chapter is the human-readable rendering of those tags.

Chapter 12 — Workspace config is the teaching chapter; this one is the lookup chapter. Use chapter 12 to learn the shape, use this one to look up the type of a specific field.

File location and lifecycle

Property	Value
Path	`~/.roksbnkctl/<workspace>/config.yaml`
Default workspace	`default` (auto-created on first run)
Overridable home	`ROKSBNKCTL_HOME` env var (defaults to `~/.roksbnkctl/`)
Mode	`0644`
Created by	`roksbnkctl init`
Updated by	`roksbnkctl init --upgrade-tf`, `roksbnkctl kubeconfig --download`, hand-editing

The file is hand-editable; YAML is parsed with gopkg.in/yaml.v3 so anchors and aliases work but are not idiomatic for this file. Plaintext credentials in any of the regex-matched secret fields (api_key, apikey, password, token, secret_access_key, hmac_secret) are rejected at load time — the file fails to parse with a clear error. Base64-encoded credentials in ibmcloud.api_key_b64 are allowed (the field name doesn’t match the rejection regex). See Chapter 14.

Top-level structure

ibmcloud:        # required
cluster:         # required
bnk:             # optional; populates upstream HCL bnk variables
test:            # optional; populates test.* settings
tf_source:       # required (defaults to embedded if omitted)
cos:             # optional; supply-chain auto-upload
targets:         # optional; populated automatically by up's post-apply hook
exec:            # optional; per-tool default-backend map

The order of the top-level keys in the file doesn’t matter; YAML is a mapping. The order shown above is the canonical render order produced by roksbnkctl init.

`ibmcloud:` block

ibmcloud:
  region: ca-tor
  resource_group: default
  api_key_source: keychain
  api_key_b64: <base64>

Field	Type	Default	Allowed	Notes
`region`	string	— (prompted by `init`)	any IBM Cloud region: `us-south`, `us-east`, `ca-tor`, `eu-de`, `eu-gb`, `jp-tok`, `au-syd`, etc.	The IBM Cloud region for all cluster + COS resources. Crosses module boundaries — must match the upstream HCL’s `ibmcloud_cluster_region`.
`resource_group`	string	`default`	any RG name in the account	The resource group cluster + COS resources are provisioned into.
`api_key_source`	string	(resolver chain runs)	`env` \| `keychain` \| `config` \| `prompt`	Pins the resolver to a single source rather than walking the chain. Set explicitly when you want predictable behaviour in CI. See Chapter 14 §“Pinning a single source”.
`api_key_b64`	string	—	base64-encoded API key	Obfuscation, not encryption — anyone with file-read access decodes instantly. For single-user dev only; never commit. The field name deliberately doesn’t match the plaintext-secret rejection regex.

`cluster:` block

cluster:
  create: true
  name: tf-openshift-cluster
  openshift_version: "4.18"
  workers_per_zone: 1

Field	Type	Default	Allowed	Notes
`create`	bool	`true`	`true` \| `false`	`true` provisions a new ROKS cluster; `false` attaches to an existing one (set `name` to the existing cluster’s name or ID).
`name`	string	— (prompted by `init`)	RFC 1123 DNS label	The cluster name. Used as the OpenShift cluster identity and as the resource group disambiguator.
`openshift_version`	string	`4.18`	any version IBM Cloud’s catalog accepts	Pinned to a minor (`4.18`) rather than patch — IBM ships continuous patch updates within a minor. Leave empty for “latest”.
`workers_per_zone`	integer	`1`	1+	Worker nodes provisioned per availability zone. Multiply by the zone count (typically 3) for the total cluster size. BNK needs ≥1 worker; production deployments use 2-3 per zone.

`bnk:` block

bnk:
  cneinstance_size: Small
  far_repo_url: repo.f5.com
  manifest_version: 2.3.0-3.2598.3-0.0.170

Field	Type	Default	Allowed	Notes
`cneinstance_size`	string	`Small`	`Small` \| `Medium` \| `Large`	Sizing for the deployed CNE Instance. Renders into the upstream HCL `cneinstance_deployment_size` variable.
`far_repo_url`	string	`repo.f5.com`	URL of a Docker-compatible image registry	The image registry FLO pulls FAR container images from. Override for air-gapped installs pointing at a local mirror.
`manifest_version`	string	`2.3.0-3.2598.3-0.0.170`	a published `f5-bigip-k8s-manifest` chart version	Pins the FLO + CIS versions transitively (both are extracted from the manifest chart).

All three fields are optional; omitting renders the HCL’s own defaults. See Chapter 13 — Terraform variables for the upstream defaults.

`test:` block

test:
  throughput:
    image: ghcr.io/jgruberf5/roksbnkctl-tools-iperf3:v0.9.0
    duration: 30
    streams: 8
    default_mode: north-south
  connectivity:
    extra_hosts:
      - https://www.example.com/healthz
      - https://internal.bnk.local/status
  dns:
    resolvers:
      google: "8.8.8.8:53"
      cloudflare: "1.1.1.1:53"
      gslb-vip: "169.45.91.5:53"
    default_target: www.example.com

`test.throughput`

Field	Type	Default	Allowed	Notes
`image`	string	`networkstatic/iperf3:latest`	any iperf3 Docker image	The image used for both server pod and client Job. The default runs as root and fails on OpenShift’s `restricted-v2`; use the bundled image `ghcr.io/jgruberf5/roksbnkctl-tools-iperf3:<v>` for SCC-clean installs. See Chapter 22.
`duration`	integer	`30`	1-300 (seconds)	The iperf3 `-t` flag — test duration in seconds.
`streams`	integer	`8`	1-128	The iperf3 `-P` flag — parallel TCP streams.
`default_mode`	string	`north-south`	`north-south` \| `east-west`	Default `--mode` when not passed on the command line.

`test.connectivity`

Field	Type	Default	Allowed	Notes
`extra_hosts`	list of string	(empty)	URLs	Each URL is probed via HTTP GET; pass criterion is a 2xx response. The v1.0 shape is a bare list — no per-host method, expected-status, or TLS-trust override. Use `--insecure` (session-wide) for self-signed certs. See Chapter 20 §“Configuring extra_hosts”.

`test.dns`

Field	Type	Default	Allowed	Notes
`resolvers`	map[string]string	(empty)	name → `<ip>[:<port>]`	Friendly-name aliases for `--server <name>`. Lets workspace config push GSLB VIP addresses out of the command line.
`default_target`	string	(empty)	DNS name	Default `--target` when not passed on the command line. Useful for “always probe this name”.

`tf_source:` block

tf_source:
  type: embedded         # or: github | local
  repo: jgruberf5/roksbnkctl-tf
  ref: v1.0.0
  path: /path/to/checkout

Field	Type	Default	Allowed	Notes
`type`	string	`embedded`	`embedded` \| `github` \| `local`	Where the Terraform source comes from. `embedded` uses the HCL bundled into the binary at compile time via `//go:embed`. `github` downloads a tarball from a GitHub release. `local` points at a directory on disk.
`repo`	string	—	`owner/name` form	Required for `type: github`. The GitHub repo holding the HCL.
`ref`	string	—	a tag, branch, or SHA	Required for `type: github`. The release tag or git ref to fetch.
`path`	string	—	absolute or relative directory	Required for `type: local`. The on-disk directory containing `main.tf`.

Most users want embedded (the default). The github mode is for testing forks or pinning to an upstream tag that’s newer than the bundled one. The local mode is for active development on the HCL itself.

`cos:` block

cos:
  instance: bnk-orchestration
  bucket: bnk-schematics-resources
  upload:
    - source: ./local/f5-far-auth-key.tgz
      key: f5-far-auth-key.tgz
    - source: ./local/trial.jwt
      key: trial.jwt

Field	Type	Default	Allowed	Notes
`instance`	string	—	COS instance name or CRN	The instance the supply-chain bucket lives on. Names are resolved via Resource Controller at runtime.
`bucket`	string	—	S3 bucket name	The bucket within the instance.
`upload`	list of `{source, key}`	(empty)	host path → bucket key	Pre-flight uploads run before `roksbnkctl up`. Idempotent — re-running overwrites the bucket objects.

See Chapter 25 — COS supply chain management for the full surface.

`targets:` block

targets:
  jumphost:
    host: 169.45.91.10
    port: 22
    user: ubuntu
    key_path: /path/to/private/key.pem      # one of key_path
    key_source: tf-output:jumphost_shared_key  # ...or key_source

The top-level value is a map; the key is the target name (jumphost, eu-bastion, etc.). Each entry:

Field	Type	Default	Allowed	Notes
`host`	string	—	hostname or IP	The SSH endpoint. IPv6 literals must be unbracketed (the SSH client brackets internally).
`port`	integer	`22`	1-65535	SSH port.
`user`	string	—	a username on the target	Typically `ubuntu` for HCL-provisioned jumphosts (cloud-init writes the user); `root` for direct-IBM-Cloud Linux VSIs.
`key_path`	string	—	a path to a PEM file	One of `key_path` or `key_source` is required. Path to the PEM-encoded private key.
`key_source`	string	—	`agent` \| `tf-output:<output-name>`	The other “key source” form. `agent` uses ssh-agent; `tf-output:<name>` reads the named terraform output as the PEM.

Auto-populated by roksbnkctl up post-apply for the upstream HCL’s TGW jumphost when testing_create_tgw_jumphost = true. See Chapter 15 — SSH targets and Chapter 16 — The --on flag.

`exec:` block

exec:
  ibmcloud:  { backend: local }
  iperf3:    { backend: k8s }
  terraform: { backend: local }

Top-level value is a map keyed by tool name. Each entry has one field:

Field	Type	Default	Allowed	Notes
`backend`	string	`local`	`local` \| `docker` \| `k8s` \| `ssh:<target>`	The default execution backend for this tool. A `--backend <value>` flag on the command line overrides the workspace config for that single invocation.

The per-tool defaults at v1.0:

Tool	Default backend	Supported backends
`terraform`	`local`	`local`, `docker` (k8s and ssh deferred to v1.x)
`ibmcloud`	`local`	`local`, `docker`, `k8s`, `ssh:<target>`
`iperf3`	`k8s`	`local`, `k8s`, `ssh:<target>` (docker rejected)
`dns`	`local`	`local`, `k8s`, `ssh:<target>` (docker rejected)

See Chapter 17 — Execution backends and Chapter 18 — Choosing a backend per tool.

Field-by-field reference table

Sorted by top-level block. Lookup-friendly. Every field that appears in internal/config/workspace.go.

Path	Type	Default	Notes
`ibmcloud.region`	string	(prompted)	IBM Cloud region (`ca-tor`, `us-south`, …).
`ibmcloud.resource_group`	string	`default`	Resource group name.
`ibmcloud.api_key_source`	string	(chain)	`env` \| `keychain` \| `config` \| `prompt`.
`ibmcloud.api_key_b64`	string	(empty)	Base64-encoded API key. Obfuscation only.
`cluster.create`	bool	`true`	Provision new vs attach existing.
`cluster.name`	string	(prompted)	Cluster name.
`cluster.openshift_version`	string	`4.18`	OpenShift minor version.
`cluster.workers_per_zone`	integer	`1`	Workers per AZ.
`bnk.cneinstance_size`	string	`Small`	`Small` \| `Medium` \| `Large`.
`bnk.far_repo_url`	string	`repo.f5.com`	FAR image registry URL.
`bnk.manifest_version`	string	`2.3.0-3.2598.3-0.0.170`	f5-bigip-k8s-manifest chart version.
`test.throughput.image`	string	`networkstatic/iperf3:latest`	iperf3 image.
`test.throughput.duration`	integer	`30`	iperf3 `-t` (seconds).
`test.throughput.streams`	integer	`8`	iperf3 `-P` (parallel streams).
`test.throughput.default_mode`	string	`north-south`	Default mode.
`test.connectivity.extra_hosts`	[]string	(empty)	URLs to probe.
`test.dns.resolvers`	map[string]string	(empty)	Name → `<ip>[:<port>]`.
`test.dns.default_target`	string	(empty)	Default `--target` value.
`tf_source.type`	string	`embedded`	`embedded` \| `github` \| `local`.
`tf_source.repo`	string	(empty)	GitHub `owner/name`; required for `github`.
`tf_source.ref`	string	(empty)	Git ref; required for `github`.
`tf_source.path`	string	(empty)	Local directory; required for `local`.
`cos.instance`	string	(empty)	COS instance name or CRN.
`cos.bucket`	string	(empty)	Bucket name.
`cos.upload[].source`	string	—	Local file path.
`cos.upload[].key`	string	—	Bucket key.
`targets.<name>.host`	string	—	SSH host.
`targets.<name>.port`	integer	`22`	SSH port.
`targets.<name>.user`	string	—	SSH user.
`targets.<name>.key_path`	string	(empty)	PEM file path.
`targets.<name>.key_source`	string	(empty)	`agent` \| `tf-output:<name>`.
`exec.<tool>.backend`	string	`local` (varies by tool)	`local` \| `docker` \| `k8s` \| `ssh:<target>`.

Behaviour when fields are missing

roksbnkctl falls through three layers: workspace config → upstream HCL default → fail.

Missing field	Behaviour
`ibmcloud.region`	`roksbnkctl init` prompts; programmatic loads error with “region is empty”.
`ibmcloud.resource_group`	Defaults to `default`.
`ibmcloud.api_key_source`	Resolver walks the full chain (env → keychain → config → prompt).
`ibmcloud.api_key_b64`	Skipped in the resolver chain.
`cluster.create`	Defaults to `true`.
`cluster.name`	`init` prompts; programmatic loads error.
`cluster.openshift_version`	Empty string passed to upstream HCL; the module picks the current default.
`cluster.workers_per_zone`	Falls through to `1` (upstream HCL default).
`bnk.*`	Each field is omitted from the generated `terraform.tfvars` and the upstream HCL default applies.
`test.throughput.*`	Coded defaults (30s, 8 streams, `networkstatic/iperf3:latest`) apply.
`test.connectivity.extra_hosts`	Connectivity probe runs with built-in URLs only.
`test.dns.resolvers`	`--server` requires a literal IP or `host:port`.
`test.dns.default_target`	`--target` becomes required on the command line.
`tf_source`	Treated as `type: embedded` (legacy default).
`cos`	Block omitted ⇒ no pre-flight uploads; FLO reads whatever’s already in the configured bucket.
`targets.*`	Block absent ⇒ `roksbnkctl --on jumphost` errors with “no target named jumphost”; auto-populated by `up` when terraform provisions a jumphost.
`exec.*`	Each tool falls back to its built-in default (typically `local`; `iperf3` is `k8s`).

How `--var-file` interacts with `config.yaml`

roksbnkctl up --var-file <file> layers user-supplied tfvars after the auto-rendered tfvars derived from config.yaml. Later wins, terraform-style. Multiple --var-file flags are accepted and stack in command-line order.

The auto-render path: config.yaml → typed Workspace struct → key/value tfvars → ~/.roksbnkctl/<ws>/state/terraform.tfvars. The user’s --var-file is appended to the terraform invocation as an additional -var-file=<path> argument. See Chapter 13 — Terraform variables for the layering rules.

A workspace-persistent override file is ~/.roksbnkctl/<ws>/terraform.tfvars.user — when present, it’s auto-layered after the rendered tfvars and before any explicit --var-file. Useful for “always pass this bigip_password value when applying this workspace” without putting it in config.yaml (where the plaintext-secret rejection would reject it).

Cross-references

Chapter 12 — Workspace config — the teaching counterpart to this lookup.
Chapter 13 — Terraform variables — how config.yaml fields render into tfvars.
Chapter 14 — Credentials and the resolver chain — the ibmcloud.api_key_* semantics.
Chapter 29 — Terraform variable reference — the upstream HCL variable surface that bnk.* and cluster.* populate.

Terraform variable reference

Auto-generated by go run ./tools/refgen/tfvars-md > book/src/29-terraform-variable-reference.md. Re-run on every terraform/variables.tf change.

Every variable below is settable via terraform.tfvars, -var, -var-file, or (for sensitive values) the corresponding TF_VAR_<name> environment variable. Variables with _required_ defaults must be set explicitly. See Chapter 13 for how roksbnkctl threads these through the workspace config.

Root module variables

Source: terraform/variables.tf

Variable	Type	Default	Description	Sensitive
`ibmcloud_api_key`	`string`	required	IBM Cloud API key	yes
`ibmcloud_cluster_region`	`string`	`"ca-tor"`	IBM Cloud region for all cluster resources	no
`ibmcloud_resource_group`	`string`	`"default"`	IBM Cloud resource group name	no
`create_roks_cluster`	`bool`	`true`	Create a new ROKS cluster. When false, supply roks_cluster_id_or_name instead.	no
`roks_cluster_id_or_name`	`string`	`""`	ID or name of an existing ROKS cluster — used when create_roks_cluster = false	no
`create_roks_transit_gateway`	`bool`	`true`	Create Transit Gateway and VPC connections	no
`create_roks_registry_cos_instance`	`bool`	`true`	Create Cloud Object Storage instance for the OpenShift image registry	no
`roks_cluster_vpc_name`	`string`	`"tf-cluster-vpc"`	Name of the cluster VPC	no
`openshift_cluster_name`	`string`	`"tf-openshift-cluster"`	Name of the OpenShift cluster	no
`openshift_cluster_version`	`string`	`"4.18"`	OpenShift cluster version (e.g. 4.18). Leave empty to use the latest available.	no
`roks_workers_per_zone`	`number`	`1`	Number of worker nodes per availability zone	no
`roks_min_worker_vcpu_count`	`number`	`16`	Minimum vCPU count when auto-selecting the worker node flavor	no
`roks_min_worker_memory_gb`	`number`	`64`	Minimum memory in GB when auto-selecting the worker node flavor	no
`roks_cos_instance_name`	`string`	`"tf-openshift-cos-instance"`	Name of the COS instance for the OpenShift image registry	no
`roks_transit_gateway_name`	`string`	`"tf-tgw"`	Name of the Transit Gateway. Must reference an existing TGW when create_roks_transit_gateway = false and testing_create_tgw_jumphost = true.	no
`install_cert_manager`	`bool`	`true`	Install cert-manager. When false, cert_manager_namespace is passed directly to flo.	no
`cert_manager_namespace`	`string`	`"cert-manager"`	Kubernetes namespace for cert-manager	no
`cert_manager_version`	`string`	`"v1.17.3"`	cert-manager Helm chart version	no
`ibmcloud_cos_bucket_region`	`string`	`"us-south"`	IBM Cloud region where the COS bucket is located	no
`ibmcloud_cos_instance_name`	`string`	`"bnk-orchestration"`	IBM Cloud COS instance name	no
`ibmcloud_resources_cos_bucket`	`string`	`"bnk-schematics-resources"`	IBM Cloud COS bucket containing FAR auth key and JWT files	no
`deploy_bnk`	`bool`	`true`	Deploy BIG-IP Next for Kubernetes — creates flo, cne_instance, and license. When false all three modules are skipped.	no
`far_repo_url`	`string`	`"repo.f5.com"`	FAR repository URL for Docker and Helm images	no
`f5_bigip_k8s_manifest_version`	`string`	`"2.3.0-3.2598.3-0.0.170"`	Version of the f5-bigip-k8s-manifest chart (FLO and CIS versions are extracted from this)	no
`f5_cne_far_auth_file`	`string`	`"f5-far-auth-key.tgz"`	FAR auth key filename in the COS bucket (.tgz)	no
`f5_cne_subscription_jwt_file`	`string`	`"trial.jwt"`	Subscription JWT filename in the COS bucket — used by flo and license	no
`flo_namespace`	`string`	`"f5-bnk"`	Kubernetes namespace for the F5 Lifecycle Operator	no
`flo_utils_namespace`	`string`	`"f5-utils"`	Kubernetes namespace for F5 utility components — used by flo, cne_instance, and license	no
`bigip_username`	`string`	`"admin"`	BIG-IP username for the CIS controller	no
`bigip_password`	`string`	`"admin"`	BIG-IP password for the CIS controller	yes
`bigip_url`	`string`	`"192.168.1.245"`	BIG-IP URL for the CIS controller	no
`flo_trusted_profile_id`	`string`	`""`	IBM Cloud Trusted Profile ID created by flo — wired automatically from flo output; set here to override	no
`flo_cluster_issuer_name`	`string`	`""`	Kubernetes ClusterIssuer name created by flo — wired automatically from flo output; set here to override	no
`cneinstance_network_attachments`	`list(string)`	`["ens3-ipvlan-l2", "macvlan-conf"]`	Network attachment names for cne_instance — wired automatically from flo output; set here to override	no
`cneinstance_deployment_size`	`string`	`"Small"`	Deployment size for CNEInstance (Small, Medium, Large)	no
`cneinstance_gslb_datacenter_name`	`string`	`""`	GSLB datacenter name for CNEInstance (optional)	no
`license_mode`	`string`	`"connected"`	License operation mode (connected or disconnected)	no
`testing_create_tgw_jumphost`	`bool`	`true`	Create a jumphost in a client VPC connected to the cluster via the Transit Gateway	no
`testing_create_cluster_jumphosts`	`bool`	`false`	Create one jumphost per availability zone directly inside the cluster VPC	no
`testing_ssh_key_name`	`string`	`""`	Name of the IBM Cloud SSH key to inject into all jumphosts	no
`testing_jumphost_profile`	`string`	`""`	Instance profile for all jumphosts (leave empty to auto-select based on min_vcpu_count and min_memory_gb)	no
`testing_min_vcpu_count`	`number`	`4`	Minimum vCPU count when auto-selecting the jumphost instance profile	no
`testing_min_memory_gb`	`number`	`8`	Minimum memory in GB when auto-selecting the jumphost instance profile	no
`testing_create_client_vpc`	`bool`	`false`	Create a new client VPC for the TGW jumphost. When false, testing_client_vpc_name must reference an existing VPC.	no
`testing_client_vpc_name`	`string`	`"tf-testing-vpc"`	Name of the client VPC — created when testing_create_client_vpc = true, or looked up when false	no
`testing_client_vpc_region`	`string`	`"ca-tor"`	IBM Cloud region for the client VPC and TGW jumphost	no
`testing_tgw_jumphost_name`	`string`	`"tf-testing-jumphost-tgw"`	Name of the TGW-connected jumphost instance	no
`testing_cluster_jumphost_name_prefix`	`string`	`"tf-testing-jumphost-cluster"`	Name prefix for cluster jumphosts — zone name is appended (`<prefix>`-`<zone>`)	no
`kubeconfig_dir`	`string`	`"/work/.bnk/scratch/kubeconfig"`	Parent directory where ibm_container_cluster_config writes admin kubeconfigs. Each submodule appends its name as a subdir. Default is the bnk runner image’s /work mount; override for direct-on-host runs.	no
`scratch_dir`	`string`	`"/work/.bnk/scratch"`	Persistent scratch directory for FLO’s FAR/manifest cross-apply artifacts. Default is the bnk runner image’s /work mount; override for direct-on-host runs.	no

Module: `cert_manager`

Source: terraform/modules/cert_manager/variables.tf

Variable	Type	Default	Description	Sensitive
`ibmcloud_api_key`	`string`	required	IBM Cloud API Key	yes
`ibmcloud_cluster_region`	`string`	`"ca-tor"`	IBM Cloud region where the cluster resides	no
`ibmcloud_resource_group`	`string`	`"default"`	IBM Cloud Resource Group name (leave empty to use account default)	no
`roks_cluster_name_or_id`	`string`	required	Name or ID of the existing OpenShift ROKS cluster to deploy BNK onto	no
`cert_manager_namespace`	`string`	`"cert-manager"`	Kubernetes namespace for cert-manager	no
`cert_manager_version`	`string`	`"v1.17.3"`	cert-manager Helm chart version	no
`create_roks_cluster`	`bool`	`false`	When true, cluster is being created by roks_cluster — skip plan-time cluster credential fetch	no
`roks_cluster_dependency_id`	`string`	`null`	roks_cluster sentinel ID — when set, defers runtime_config fetch to apply time after roks_cluster completes	no
`kubeconfig_dir`	`string`	`"/work/.bnk/scratch/kubeconfig/cert_manager"`	Persistent, writable dir for ibm_container_cluster_config kubeconfig downloads. Defaults to a host-bind-mounted, module-scoped path under .bnk/scratch.	no

Module: `cne_instance`

Source: terraform/modules/cne_instance/variables.tf

Variable	Type	Default	Description	Sensitive
`ibmcloud_api_key`	`string`	required	IBM Cloud API Key	yes
`ibmcloud_cluster_region`	`string`	`"ca-tor"`	IBM Cloud region where the cluster resides	no
`ibmcloud_resource_group`	`string`	`"default"`	IBM Cloud Resource Group name (leave empty to use account default)	no
`roks_cluster_name_or_id`	`string`	required	Name or ID of the existing OpenShift ROKS cluster to deploy BNK onto	no
`far_repo_url`	`string`	`"repo.f5.com"`	FAR Repository URL for Docker and Helm registry	no
`flo_namespace`	`string`	`"f5-bnk"`	Namespace for F5 Lifecycle Operator	no
`flo_utils_namespace`	`string`	`"f5-utils"`	Namespace for F5 utility components	no
`f5_bigip_k8s_manifest_version`	`string`	`"2.3.0-3.2598.3-0.0.170"`	Version of f5-bigip-k8s-manifest chart - used by flo, cneinstance modules	no
`flo_trusted_profile_id`	`string`	`""`	IBM IAM Trusted Profile ID for provisioning VPC routes	no
`flo_cluster_issuer_name`	`string`	`""`	mTLS certificate issuer name	no
`cneinstance_deployment_size`	`string`	`"Small"`	Deployment size for CNEInstance (Small, Medium, Large)	no
`cneinstance_gslb_datacenter_name`	`string`	`""`	GSLB datacenter name for CNEInstance (optional)	no
`cneinstance_network_attachments`	`list(string)`	`["ens3-ipvlan-l2", "macvlan-conf"]`	The Multus Network Attachment Definitions for the CNEInstance TMM deployments	no
`create_roks_cluster`	`bool`	`false`	When true, cluster is being created by roks_cluster — skip plan-time cluster credential fetch	no
`roks_cluster_dependency_id`	`string`	`null`	roks_cluster sentinel ID — when set, defers runtime_config fetch to apply time after roks_cluster completes	no
`flo_dependency_id`	`string`	`null`	flo_ready sentinel ID — pass module.flo.flo_ready_id to defer cne_instance until flo completes and CRDs are registered	no
`deploy_bnk`	`bool`	`true`	Deploy BIG-IP Next for Kubernetes — when false the inner cneinstance module is disabled and no CNEInstance resources are created	no
`kubeconfig_dir`	`string`	`"/work/.bnk/scratch/kubeconfig/cne_instance"`	Persistent, writable dir for ibm_container_cluster_config kubeconfig downloads. Defaults to a host-bind-mounted, module-scoped path under .bnk/scratch.	no

Module: `flo`

Source: terraform/modules/flo/variables.tf

Variable	Type	Default	Description	Sensitive
`ibmcloud_api_key`	`string`	required	IBM Cloud API Key	yes
`ibmcloud_cluster_region`	`string`	`"ca-tor"`	IBM Cloud region where the cluster resides	no
`ibmcloud_resource_group`	`string`	`"default"`	IBM Cloud Resource Group name (leave empty to use account default)	no
`roks_cluster_name_or_id`	`string`	required	Name or ID of the existing OpenShift ROKS cluster to deploy BNK onto	no
`far_repo_url`	`string`	`"repo.f5.com"`	FAR Repository URL for Docker and Helm registry	no
`f5_bigip_k8s_manifest_version`	`string`	`"2.3.0-3.2598.3-0.0.170"`	Version of the f5-bigip-k8s-manifest chart (FLO/CIS versions are extracted from this)	no
`use_cos_bucket`	`bool`	`true`	Fetch FAR auth key and JWT from IBM Cloud Object Storage instead of local variables	no
`ibmcloud_cos_bucket_region`	`string`	`"us-south"`	IBM Cloud region where the COS bucket is located	no
`ibmcloud_cos_instance_name`	`string`	`"bnk-orchestration"`	IBM Cloud COS instance name	no
`ibmcloud_resources_cos_bucket`	`string`	`"bnk-schematics-resources"`	IBM Cloud COS bucket containing the FAR auth key and JWT files	no
`f5_cne_far_auth_file`	`string`	`"f5-far-auth-key.tgz"`	FAR auth key filename in the COS bucket (.tgz)	no
`f5_cne_subscription_jwt_file`	`string`	`"trial.jwt"`	Subscription JWT filename in the COS bucket	no
`flo_namespace`	`string`	`"f5-bnk"`	Namespace for F5 Lifecycle Operator	no
`flo_utils_namespace`	`string`	`"f5-utils"`	Namespace for F5 utility components	no
`cert_manager_namespace`	`string`	`"cert-manager"`	Kubernetes namespace for cert-manager - used by cert-manager, flo modules	no
`bigip_username`	`string`	`"admin"`	BIG-IP username for CIS controller login	no
`bigip_password`	`string`	`"admin"`	BIG-IP password for CIS controller login	yes
`bigip_url`	`string`	`"https://192.168.1.245"`	BIG-IP URL for CIS controller login	no
`create_roks_cluster`	`bool`	`false`	When true, cluster is being created by roks_cluster — skip plan-time cluster credential fetch	no
`roks_cluster_dependency_id`	`string`	`null`	roks_cluster sentinel ID — when set, defers runtime_config fetch to apply time after roks_cluster completes	no
`cert_manager_dependency_id`	`string`	`null`	cert_manager ready sentinel ID — when set, blocks flo inner module until cert-manager CRDs are available	no
`deploy_bnk`	`bool`	`true`	Deploy BIG-IP Next for Kubernetes — when false the inner flo module is disabled and no FLO resources are created	no
`kubeconfig_dir`	`string`	`"/work/.bnk/scratch/kubeconfig/flo"`	Persistent, writable dir for ibm_container_cluster_config kubeconfig downloads. Defaults to a host-bind-mounted, module-scoped path under .bnk/scratch.	no
`scratch_dir`	`string`	`"/work/.bnk/scratch"`	Persistent scratch directory for FAR/manifest cross-apply artifacts. Default is the bnk runner image’s /work mount.	no

Module: `license`

Source: terraform/modules/license/variables.tf

Variable	Type	Default	Description	Sensitive
`ibmcloud_api_key`	`string`	required	IBM Cloud API Key	yes
`ibmcloud_cluster_region`	`string`	`"ca-tor"`	IBM Cloud region where the cluster resides	no
`ibmcloud_resource_group`	`string`	`"default"`	IBM Cloud Resource Group name (leave empty to use account default)	no
`ibmcloud_cos_bucket_region`	`string`	`"us-south"`	IBM Cloud region where the COS bucket is located	no
`ibmcloud_cos_instance_name`	`string`	`"bnk-orchestration"`	IBM Cloud COS instance name	no
`ibmcloud_resources_cos_bucket`	`string`	`"bnk-schematics-resources"`	IBM Cloud COS bucket containing the FAR auth key and JWT files	no
`roks_cluster_name_or_id`	`string`	required	Name or ID of the existing OpenShift ROKS cluster to deploy BNK onto	no
`flo_utils_namespace`	`string`	`"f5-utils"`	Namespace for F5 utility components	no
`f5_cne_subscription_jwt_file`	`string`	`"trial.jwt"`	Subscription JWT filename in the COS bucket	no
`license_mode`	`string`	`"connected"`	License operation mode (connected or disconnected)	no
`create_roks_cluster`	`bool`	`false`	When true, cluster is being created by roks_cluster — skip plan-time cluster credential fetch	no
`roks_cluster_dependency_id`	`string`	`null`	roks_cluster sentinel ID — when set, defers runtime_config fetch to apply time after roks_cluster completes	no
`cneinstance_dependency_id`	`string`	`null`	cneinstance_ready_id from ws4 — when set, ensures License CRD is available before applying License CR	no
`deploy_bnk`	`bool`	`true`	Deploy BIG-IP Next for Kubernetes — when false the inner license module is disabled and no License resources are created	no
`kubeconfig_dir`	`string`	`"/work/.bnk/scratch/kubeconfig/license"`	Persistent, writable dir for ibm_container_cluster_config kubeconfig downloads. Defaults to a host-bind-mounted, module-scoped path under .bnk/scratch.	no

Module: `roks_cluster`

Source: terraform/modules/roks_cluster/variables.tf

Variable	Type	Default	Description	Sensitive
`ibmcloud_api_key`	`string`	required	IBM Cloud API key	yes
`ibmcloud_cluster_region`	`string`	required	IBM Cloud region for all cluster resources	no
`ibmcloud_resource_group`	`string`	`"default"`	IBM Cloud resource group name	no
`create_roks_cluster`	`bool`	`true`	Create a new ROKS cluster. When false, supply roks_cluster_id_or_name instead.	no
`roks_cluster_id_or_name`	`string`	`""`	ID or name of an existing ROKS cluster — used when create_roks_cluster = false	no
`create_roks_transit_gateway`	`bool`	`true`	Create Transit Gateway and VPC connections	no
`create_roks_registry_cos_instance`	`bool`	`true`	Create Cloud Object Storage instance for the OpenShift image registry	no
`roks_cluster_vpc_name`	`string`	`"tf-cluster-vpc"`	Name of the cluster VPC	no
`openshift_cluster_name`	`string`	`"tf-openshift-cluster"`	Name of the OpenShift cluster	no
`openshift_cluster_version`	`string`	`"4.18"`	OpenShift cluster version (e.g. 4.18)	no
`roks_workers_per_zone`	`number`	`1`	Number of worker nodes per availability zone	no
`roks_min_worker_vcpu_count`	`number`	`16`	Minimum vCPU count when auto-selecting the worker node flavor	no
`roks_min_worker_memory_gb`	`number`	`64`	Minimum memory in GB when auto-selecting the worker node flavor	no
`roks_cos_instance_name`	`string`	`"tf-openshift-cos-instance"`	Name of the COS instance for the OpenShift image registry	no
`roks_transit_gateway_name`	`string`	`"tf-tgw"`	Name of the Transit Gateway	no

Module: `testing`

Source: terraform/modules/testing/variables.tf

Variable	Type	Default	Description	Sensitive
`ibmcloud_api_key`	`string`	required	IBM Cloud API Key	yes
`ibmcloud_cluster_region`	`string`	`"ca-tor"`	IBM Cloud region where the referenced cluster resides	no
`ibmcloud_resource_group`	`string`	`""`	IBM Cloud Resource Group name (leave empty to use account default)	no
`roks_cluster_name_or_id`	`string`	required	Name or ID of the existing OpenShift ROKS cluster	no
`testing_create_tgw_jumphost`	`bool`	`true`	Create a jumphost in a client VPC and (optionally) connect it to the cluster via a Transit Gateway	no
`testing_create_cluster_jumphosts`	`bool`	`false`	Create one jumphost per availability zone directly inside the cluster VPC	no
`testing_ssh_key_name`	`string`	`""`	Name of the SSH key to inject into all jumphosts. Must exist in client_vpc_region (for TGW jumphost) and in ibmcloud_cluster_region (for cluster jumphosts)	no
`testing_jumphost_profile`	`string`	`""`	Instance profile for all jumphosts (leave empty to auto-select from min_vcpu_count and min_memory_gb)	no
`testing_min_vcpu_count`	`number`	`4`	Minimum vCPU count when auto-selecting the instance profile	no
`testing_min_memory_gb`	`number`	`8`	Minimum memory in GB when auto-selecting the instance profile	no
`testing_create_client_vpc`	`bool`	`false`	Create a new client VPC for the TGW jumphost. When false, client_vpc_name must reference an existing VPC	no
`testing_client_vpc_name`	`string`	`"tf-testing-vpc"`	Name of the client VPC — created when create_client_vpc = true, or looked up when create_client_vpc = false	no
`testing_client_vpc_region`	`string`	`"ca-tor"`	IBM Cloud region for the client VPC and TGW jumphost	no
`testing_transit_gateway_name`	`string`	`""`	Name of an existing Transit Gateway to connect the client VPC to (leave empty to skip TGW attachment)	no
`testing_tgw_jumphost_name`	`string`	`"tf-testing-jumphost-tgw"`	Name of the TGW-connected jumphost instance (used as prefix for subnet, gateway, security group, and floating IP)	no
`testing_cluster_jumphost_name_prefix`	`string`	`"tf-testing-jumphost-cluster"`	Name prefix for cluster jumphosts — zone name is appended (`<prefix>`-`<zone>`)	no
`roks_cluster_dependency_id`	`string`	`null`	roks_cluster sentinel ID — when set, defers cluster/TGW data source reads to apply time after roks_cluster completes	no
`create_roks_cluster`	`bool`	`false`	Set to true when the ROKS cluster is being created in this run — skips cluster-VPC-derived data sources that require a pre-existing cluster	no
`cluster_vpc_id`	`string`	`""`	ID of the cluster VPC — pass module.roks_cluster.roks_cluster_vpc_id directly; avoids deriving via worker-pool subnet chain which is deferred to apply time	no

Glossary

Plain-English definitions of the terms used across the book. Project-specific concepts, IBM-Cloud-specific products, OpenShift / Kubernetes admission concepts, and the F5 BIG-IP Next networking vocabulary all live here. Entries are deliberately one or two sentences; the deep-dive lives in the linked chapter where applicable.

A — D

api_key_b64 Base64-encoded IBM Cloud API key stored inline in the workspace config.yaml. Obfuscation, not encryption — anyone with file-read access decodes instantly. The field name deliberately doesn’t match the plaintext-secret rejection regex. See Chapter 14 §“Source 3”.

Backend (--backend) The execution context for a tool dispatch. One of local (os/exec on the host), docker (containerised), k8s (in-cluster ops pod or Job), or ssh:<target> (a registered SSH endpoint). See Chapter 17.

BNK BIG-IP Next for Kubernetes. F5’s Kubernetes-native CNF deployment of BIG-IP, made up of FLO (F5 Lifecycle Operator) + CNE Instance + License + CIS. The reason this CLI exists. See Chapter 1.

CIS Two unrelated CISes appear in this stack. Inside the cluster, CIS is F5’s Container Ingress Services — the F5 controller that watches Kubernetes Ingress + Route resources and programs the BIG-IP data plane. At the IBM Cloud account level, CIS is Cloud Internet Services — IBM’s DNS, CDN, WAF, and DDoS-protection product. Context disambiguates; when in doubt, “F5 CIS” vs “IBM Cloud CIS”.

ClusterIP (k8s) A Service type that gives a Service an internal cluster IP, reachable only from inside the cluster. Used by the throughput suite’s east-west mode.

ClusterRole (k8s) A cluster-scoped RBAC role granting verbs on resources. The ops pod’s least-privilege ClusterRole grants jobs.create in roksbnkctl-test but not pods.delete in default.

CNE Instance A Custom Resource defined by FLO. Represents one deployed instance of the BNK data plane (TMM pods + control plane). Sizing is Small/Medium/Large. See Chapter 10.

Cobra github.com/spf13/cobra — the Go CLI library roksbnkctl is built on. The command tree at internal/cli/ is a cobra command tree.

COS IBM Cloud Object Storage — S3-compatible object store. The BNK supply chain bucket lives on a COS instance. See Chapter 25.

cred resolver chain The ordered lookup for IBMCLOUD_API_KEY: env var → OS keychain → workspace api_key_b64 → interactive prompt. The chain stops at the first source that yields a non-empty value. See Chapter 14.

CRN Cloud Resource Name — IBM’s globally-unique resource identifier. Starts with crn:v1: and encodes account, region, service, and resource ID. Most roksbnkctl cos commands accept either a friendly name (resolved at runtime) or a CRN.

E — J

east-west Network direction term: traffic between two endpoints inside the cluster (pod-to-pod, service-to-service). The throughput suite’s --mode east-west measures CNI fabric throughput. See Chapter 22.

Embedded HCL The Terraform source tree compiled into the roksbnkctl binary via Go’s //go:embed directive. The default tf_source is embedded. Rebuilding the binary picks up HCL changes. See Chapter 31 §“The embedded HCL”.

envFrom (k8s) A Pod spec field that references a Secret or ConfigMap and projects all of its keys as environment variables into the container. The k8s backend’s ops pod uses envFrom: secretRef: roksbnkctl-ibm-creds to receive the API key without listing it in the manifest plaintext.

extra_hosts The workspace config’s list of additional URLs to probe under roksbnkctl test connectivity. In v1.0 the value is a bare []string of URLs; per-host method/expected-status overrides are deferred. See Chapter 20.

FAR F5 Application Runtime — the container-image distribution of the BIG-IP Next data plane. FLO pulls FAR images from repo.f5.com using the auth key in the COS supply-chain bucket.

FLO F5 Lifecycle Operator — the Kubernetes operator that owns the CNE Instance + License + supporting resources. The control plane piece of BNK. (The acronym sometimes also surfaces as “F5 Logging Operator” — context disambiguates; in this book it always means Lifecycle Operator.)

FQDN Fully Qualified Domain Name — the absolute form of a DNS name ending with a trailing dot (www.example.com.).

FAR auth key The credential tarball (f5-far-auth-key.tgz) that FLO uses to pull FAR images from repo.f5.com. Lives in the COS supply-chain bucket. Rotated periodically; see Chapter 25 §“Licence rotation”.

ghcr.io GitHub Container Registry — where the roksbnkctl-tools-* images are published. The k8s backend pulls from ghcr.io/jgruberf5/roksbnkctl-tools-{ibmcloud,iperf3}.

GSLB Global Server Load Balancing — DNS-driven traffic management where the answer a name returns depends on the requesting resolver’s network vantage. The thing Chapter 21 is built to validate.

--gslb-compare The DNS-probe flag that fans out across all configured backends in parallel and emits a comparison JSON with gslb_divergence: true|false. The signature workflow for “is the GSLB rule taking effect”. See Chapter 21 §“The –gslb-compare workflow”.

HCL HashiCorp Configuration Language — the syntax of Terraform .tf files. The upstream HCL is bundled into the binary; see Embedded HCL.

ibmcloud Two senses. The IBM Cloud CLI binary (which roksbnkctl ibmcloud … passes through to or replaces, depending on the backend). And the YAML block in config.yaml (ibmcloud:) holding region, resource group, and API key source.

ImagePullBackOff (k8s) A Pod status indicating the image couldn’t be pulled from the registry. Usually a network or auth problem; sometimes a tag-doesn’t-exist problem. See Chapter 26 §“ImagePullBackOff…”.

iperf3 mode (--mode) The throughput-suite flag selecting north-south (LoadBalancer Service, client outside the cluster) or east-west (ClusterIP Service, client inside the cluster). See Chapter 22 §“The two modes”.

JWT JSON Web Token — the signed-token format BNK uses for the subscription licence (trial.jwt in the COS supply-chain bucket).

K — N

k (roksbnkctl k <verb>) The internalised kubectl subtree. roksbnkctl k get/apply/describe/delete/exec/logs/port-forward — built on k8s.io/client-go directly so no host kubectl binary is required. See Chapter 24.

kubeconfig The Kubernetes client-configuration file (clusters, contexts, credentials). Defaults to ~/.kube/config. roksbnkctl up auto-fetches the admin kubeconfig post-apply.

LoadBalancer (k8s Service type) A Service type that provisions an external endpoint (a cloud LB on managed Kubernetes; an external IP on bare-metal CNI). Used by the throughput suite’s north-south mode and by BNK’s exposed VIPs.

Long-lived ops pod The k8s backend’s persistent execution context. Deployed by roksbnkctl ops install; subsequent --backend k8s dispatches kubectl exec into the same pod rather than starting a fresh Pod each call. Contrasted with the one-shot Job pattern used for iperf3 and DNS probes. See Chapter 19.

Manifest version (f5_bigip_k8s_manifest_version) The version pin on the f5-bigip-k8s-manifest Helm chart. Transitively pins both the FLO and CIS versions (both are extracted from the manifest chart). See Chapter 13.

mdBook rust-lang/mdBook — the static-site generator the book is built with. Markdown source under book/src/, HTML output under book/book/. See Chapter 31 §“The book build”.

miekg/dns github.com/miekg/dns — the Go DNS library the GSLB probe is built on. Same library CoreDNS uses; gives roksbnkctl test dns full record-type coverage and per-query server selection. See Chapter 21 §“The roksbnkctl test dns flag surface”.

north-south Network direction term: traffic crossing the cluster boundary — from outside the cluster to a pod inside, or vice versa. The throughput suite’s --mode north-south measures inbound LoadBalancer-path throughput. See Chapter 22.

NXDOMAIN DNS response code indicating “this name does not exist”. roksbnkctl test dns against a non-existent name exits 1 with rcode=NXDOMAIN.

O — R

--on <target> The persistent CLI flag dispatching an ibmcloud/exec/shell/kubectl/oc passthrough over SSH to a named target instead of running it locally. The other half of the SSH-client + --on feature alongside the SSH backend. See Chapter 16.

OpenShift Red Hat’s enterprise Kubernetes distribution. ROKS = managed OpenShift on IBM Cloud.

Ops pod Shorthand for the long-lived k8s-backend execution pod deployed in the roksbnkctl-ops namespace by roksbnkctl ops install. See Chapter 19.

passthrough A command that proxies its argv to an underlying tool. roksbnkctl ibmcloud … passes through to the ibmcloud CLI; roksbnkctl kubectl … passes through to kubectl. Passthroughs run on whatever backend is selected (local by default).

PRD Product Requirements Document. The project uses numbered PRDs under docs/prd/ to coordinate larger feature work. See Chapter 32 §“The PRD process”.

PHASE_FROM= The env-var resume mechanism on the e2e driver scripts. PHASE_FROM=L ./scripts/e2e-test-backends.sh fast-forwards past phases A-K. See Chapter 23 §“Resuming a partial run”.

RBAC Role-Based Access Control — the Kubernetes authorization model. The ops pod has a least-privilege RBAC binding; see ClusterRole.

restricted-v2 The default OpenShift PodSecurity policy / SCC at admission. Rejects pods that run as root, allow privilege escalation, or hold the ALL capability set. All roksbnkctl-managed pods (ops pod, iperf3 server, DNS probe Job) are written to satisfy restricted-v2. See Chapter 22 §“The bundled image and the runAsNonRoot constraint”.

redactor The output-stream wrapper at internal/exec/redact.go that masks the IBM API key value in any subprocess’s stdout/stderr before it reaches the user’s terminal or the log. The defence-in-depth net for credential leaks. See Chapter 14 §“The redactor”.

ROKS Red Hat OpenShift on IBM Cloud — IBM’s managed OpenShift offering. The cluster roksbnkctl up provisions. See Chapter 2.

runAsNonRoot A Pod / container securityContext field. Required true by restricted-v2. Images that have USER root in the Dockerfile fail admission with this set.

RTT Round-Trip Time — measured in milliseconds for each DNS query. roksbnkctl test dns -o json surfaces p50/p95/p99 across the run.

S — Z

Schematic JSON The deployer-rendered JSON document describing a BNK deployment. Lives in the COS supply-chain bucket; not consumed at install time, kept for forensics.

SCC Security Context Constraint — OpenShift’s pod-admission policy. restricted-v2 is the default; pods that violate it (e.g., by running as root) are rejected by the admission controller. See Chapter 22.

Secret (k8s) A namespaced resource holding key/value data, typically base64-encoded credentials. The k8s backend creates roksbnkctl-ibm-creds in the roksbnkctl-ops namespace at ops install time.

secretRef (k8s) The Pod spec form that references a Secret for environment-variable projection. Used together with envFrom for the ops pod’s credential injection.

Service (k8s sense) A Kubernetes resource that provides a stable endpoint for accessing one or more Pods. Types: ClusterIP (default), NodePort, LoadBalancer, ExternalName. See ClusterIP, LoadBalancer.

SPDY Speedy (protocol). The websocket-like, multiplexed-stream protocol Kubernetes uses for exec and port-forward. roksbnkctl k exec is a SPDY client implementation on top of k8s.io/client-go’s SPDY executor.

SSH backend The --backend ssh:<target> execution path. Runs the tool on a registered SSH endpoint via the internal/remote.Client wrapper. See Chapter 17 §“SSH backend”.

TGW Transit Gateway — IBM Cloud’s VPC-to-VPC connectivity service. The upstream HCL provisions a TGW between the cluster VPC and the testing-client VPC so the jumphost can reach the cluster’s internal endpoints.

tfvars (terraform.tfvars) Variable-value file for Terraform — assigns concrete values to the HCL’s variable blocks. roksbnkctl auto-renders one from config.yaml; user overrides layer on top via terraform.tfvars.user and --var-file. See Chapter 13.

tf_source The workspace config.yaml block selecting where the Terraform source comes from: embedded (compiled into the binary; the default), github (downloaded tarball), local (an on-disk directory). See Chapter 12 §“tf_source:”.

TLS (--insecure) Transport Layer Security. The --insecure flag on roksbnkctl test connectivity skips TLS certificate validation for every probe in the run (session-wide, not per-host).

TMM Traffic Management Microkernel — the BIG-IP data-plane process. BNK runs TMM as a Pod; the CNE Instance specifies how many and at what size.

TOFU Trust On First Use — the SSH-style host-key acceptance pattern. On first connection to a new SSH target, roksbnkctl prompts the user to verify the fingerprint; subsequent connections check against the saved fingerprint in ~/.roksbnkctl/known_hosts. A fingerprint mismatch refuses to connect. See Chapter 16 §“Host-key handling”.

Trusted Profile An IBM IAM construct that lets a Kubernetes ServiceAccount assume IBM Cloud permissions. FLO uses one to authenticate against IBM Cloud APIs without storing an API key in the cluster.

TTL Time To Live — DNS-record cache duration in seconds. roksbnkctl test dns -o json surfaces each answer’s TTL.

v1.0 The release this book is the launch deliverable for. All E2E phases pass on a clean dev box; doctor green-by-default with terraform-only required.

VPE Virtual Private Endpoint — IBM Cloud’s private-network access point for managed services. Sometimes left dangling after a cluster destroy (see Chapter 26 §“orphan IBM Cloud resources”).

VPC Virtual Private Cloud — IBM Cloud’s network-isolation primitive. The cluster lives in one VPC; the testing client jumphost lives in another, connected via TGW.

VSI Virtual Server Instance — IBM Cloud’s general-purpose VM. The jumphosts are VSIs.

workspace A named slot under ~/.roksbnkctl/<name>/ containing one config.yaml, one Terraform state directory, and (usually) one kubeconfig. The kubectl-style multi-environment isolation primitive. See Chapter 6.

ws / workspace The CLI subtree managing workspaces. roksbnkctl ws new/use/list/delete.

Cross-references

Chapter 1 — BNK context.
Chapter 2 — ROKS context.
Chapter 14 — credentials terminology.
Chapter 17 — backend terminology.
Chapter 21 — DNS / GSLB terminology.
Chapter 22 — throughput / SCC terminology.

Building from source

This chapter is for contributors and operators who want to build roksbnkctl themselves — whether to test an unreleased change, to verify a release artefact, or to embed a custom HCL fork into the binary.

For users who just want to install the binary, Chapter 4 — Installation is the right page. This chapter is the build-side companion.

Go version requirement

The minimum Go version is the one pinned in go.mod:

go 1.25.0

We pin to a recent toolchain for two reasons:

The IBM Cloud Go SDKs (go-sdk-core/v5, platform-services-go-sdk, ibm-cos-sdk-go) and k8s.io/client-go v0.30+ both make liberal use of Go’s modern generics — pre-1.21 toolchains won’t build.
Several dependencies (miekg/dns, docker/docker) test on the current and previous minor only; we follow upstream.

Install Go via your package manager (brew install go, apt-get install golang-1.25, etc.) or from go.dev/dl. Verify with go version.

Quick build

The shortest path from a fresh clone to a working binary:

git clone https://github.com/jgruberf5/roksbnkctl.git
cd roksbnkctl
go build -o roksbnkctl ./cmd/roksbnkctl
./roksbnkctl --version

go build produces a static binary in the working directory. Cross-compilation to a different OS/arch needs GOOS / GOARCH set:

GOOS=linux   GOARCH=amd64 go build -o roksbnkctl-linux-amd64   ./cmd/roksbnkctl
GOOS=darwin  GOARCH=arm64 go build -o roksbnkctl-darwin-arm64  ./cmd/roksbnkctl
GOOS=windows GOARCH=amd64 go build -o roksbnkctl.exe          ./cmd/roksbnkctl

A full multi-platform build is easier through goreleaser:

goreleaser release --snapshot --clean
# Output lands in dist/

The --snapshot --clean flags produce a local build without trying to publish to GitHub. The release shape is described in .goreleaser.yml — Linux + macOS × amd64 + arm64, plus a Windows compile-only check.

Build via the Makefile

The repo’s Makefile wraps the common build steps and stamps version metadata into the binary:

make build              # builds bin/roksbnkctl with -ldflags version stamping
make test               # go test ./...
make vet                # go vet ./...
make tidy               # go mod tidy
make test-short         # go test -short ./...
make test-integration   # testcontainers-go-backed integration tests (needs Docker)
make test-cred-audit    # the security-spine regression suite
make lint               # gofmt + vet + staticcheck (if installed)

The version stamp comes from three ldflags variables baked into internal/cli:

var (
    Version   = "dev"
    Commit    = "none"
    BuildDate = "unknown"
)

make build passes -X github.com/jgruberf5/roksbnkctl/internal/cli.Version=$VERSION (and the others) so roksbnkctl --version reports the actual git-rev and build timestamp rather than the placeholders. Set VERSION explicitly when stamping a release:

VERSION=v1.0.0 make build

The embedded HCL

The Terraform source tree at terraform/ is compiled into the binary via Go’s //go:embed directive. The embed declaration lives at the repo root in embedded.go (and is wired through internal/tf/ to be served as the tf_source: embedded provider).

Two implications:

Rebuilding the binary picks up HCL changes. If you’re hacking on the HCL, make build produces a binary that ships your changes embedded. No separate “deploy the HCL” step.
The HCL is read-only at runtime. The binary extracts it to a temporary directory on first use; the extracted copy is what terraform operates on. The original embedded source is immutable.

For users who want to not use the embedded HCL, the tf_source: github or tf_source: local options in the workspace config bypass it entirely. See Chapter 12 §“tf_source:”.

The bundled tools images

The tools/docker/ directory holds Dockerfiles for the images the docker and k8s backends use:

tools/docker/
├── Makefile
├── ibmcloud/
│   └── Dockerfile      # roksbnkctl-tools-ibmcloud
└── iperf3/
    └── Dockerfile      # roksbnkctl-tools-iperf3

tools/docker/Makefile builds both images locally as :dev:

cd tools/docker
make ibmcloud           # builds roksbnkctl-tools-ibmcloud:dev
make iperf3             # builds roksbnkctl-tools-iperf3:dev
make all                # both

The :dev tag is what a from-source roksbnkctl resolves to when the binary’s Version is dev. A tag-released binary (v1.0.0) resolves to ghcr.io/jgruberf5/roksbnkctl-tools-ibmcloud:v1.0.0 instead — the resolver logic lives in internal/exec/ (SetToolImageTag is wired in internal/cli/root.go::init). See Chapter 17 §“:dev tag resolution”.

The GitHub Actions workflow tools-images.yml builds and pushes the published images on a tag push or when tools/docker/** changes.

The book build

The book is built with mdBook. Install:

cargo install mdbook
# or
brew install mdbook

The book source lives under book/src/ with book.toml as the config. Common operations:

make book-serve         # mdbook serve book/ --open
                        # opens http://localhost:3000 with live-reload
make book               # mdbook build book/
                        # static HTML at book/book/
make book-clean         # rm -rf book/book

The published site at https://jgruberf5.github.io/roksbnkctl/book/ is built and deployed by .github/workflows/book.yml on every push to main. The workflow runs mdbook build book/ and pushes the output to the gh-pages branch via peaceiris/actions-gh-pages.

For PR-time verification, .github/workflows/spellcheck.yml runs cspell on book/src/**/*.md — a warning, not a gate, but worth eyeballing the output before merging.

The auto-generated chapters

Two reference chapters are generated rather than hand-written. The generators live under tools/refgen/:

# Chapter 27 — command reference (walks the cobra command tree)
go run ./tools/refgen/cobra-md > book/src/27-command-reference.md

# Chapter 29 — terraform variable reference (parses terraform/variables.tf)
go run ./tools/refgen/tfvars-md > book/src/29-terraform-variable-reference.md

When to re-run:

Chapter 27: any change to the cobra command tree under internal/cli/ or cmd/roksbnkctl/ — new commands, renamed flags, edited Long: / Example: strings.
Chapter 29: any change to terraform/variables.tf or any submodule variables.tf referenced from the root — new variables, default-value changes, edited descriptions.

Both generators emit deterministic output — the same input HCL or cobra tree always produces the same markdown — so you can commit the rendered output to source control without worrying about spurious diff churn.

Cross-compile matrix

goreleaser covers the canonical matrix:

OS	Architecture	Status
Linux	amd64	Fully supported
Linux	arm64	Fully supported
macOS	amd64 (Intel)	Fully supported
macOS	arm64 (Apple Silicon)	Fully supported
Windows	amd64	Compile-only; SSH TTY support degraded
Windows	arm64	Compile-only; same caveat
FreeBSD	amd64	Not tested

The Windows caveat is real: golang.org/x/crypto/ssh’s PTY allocation isn’t complete on Windows, so roksbnkctl shell --on jumphost falls back to a non-TTY shell. The other commands (exec, ibmcloud, kubectl) work fine on Windows.

Output from goreleaser release --snapshot --clean lands in dist/:

dist/
├── roksbnkctl_linux_amd64_v1/
│   └── roksbnkctl
├── roksbnkctl_linux_arm64/
│   └── roksbnkctl
├── roksbnkctl_darwin_amd64_v1/
│   └── roksbnkctl
├── roksbnkctl_darwin_arm64/
│   └── roksbnkctl
└── ...

Each archive bundles the binary plus LICENSE, README.md, and the rendered book/book/ directory (when the snapshot is built from a tagged commit).

Release process

Tagged releases are cut on the main branch:

# Update CHANGELOG.md with the release notes
git tag -a v1.0.0 -m "v1.0.0 — book launch + full E2E coverage"
git push origin v1.0.0

The push triggers release.yml, which runs goreleaser release to:

Cross-compile the binary for the supported OS/arch matrix.
Build the matching tools images and push to ghcr.io/jgruberf5/roksbnkctl-tools-*:<tag>.
Attach the binaries, checksums (checksums.txt), and the rendered book PDF (if mdbook-pdf is configured) to the GitHub release.
Generate release notes from the CHANGELOG and the commits since the previous tag.

The release-gate criteria — what has to hold before tagging — are documented in PLAN.md §“v1.0 (M4)”. The most important: full E2E green for 3 consecutive nights on the release branch.

Cross-references

Chapter 4 — Installation — for users who just want the binary, not the source.
Chapter 17 §“:dev tag resolution” — how a from-source binary picks tool images.
Chapter 32 — Extending roksbnkctl — once you’ve built, what to actually hack on.
docs/PLAN.md — the release-gate policy.

Extending roksbnkctl

This chapter is the hacking guide for contributors. It covers the four most common extension shapes — adding a new execution backend, a new test suite, a new tool to an existing backend, a new chapter to the book — plus the PRD process the project uses to coordinate larger changes and the four-agent sprint-dispatch pattern Sprints 0-6 ran on.

For building the binary, see Chapter 31 — Building from source. For using the binary, see the rest of the book.

Adding a new execution backend

A backend is anything implementing the Backend interface in internal/exec/backend.go. The four backends shipped at v1.0 — local, docker, k8s, ssh:<target> — are each a single file under that package.

The end-to-end shape:

Implement the interface. Create internal/exec/<your-backend>.go. The contract is Run(ctx context.Context, argv []string, opts RunOpts) (int, error). Honour opts.Stdin/Stdout/Stderr, opts.WorkDir, opts.Env, opts.Credentials, opts.HostMounts, and opts.RunAsUser. Return the subprocess exit code as the first return; second is for backend-side errors (couldn’t start, ctx cancelled, etc.).
Register it. Call exec.Register(name string, b Backend) from the package’s init() block. The ResolveBackend(spec string) function in internal/exec/backend.go dispatches --backend <name> to the registered backend.
Handle credentials safely. Read PRD 04 before touching opts.Credentials. The cardinal rule: never pass credential values via argv — they end up in ps output, container metadata, and process accounting. Pass by reference (env var by name, projected Secret, SSH SetEnv) and let the runtime do the value plumbing.
Wire the redactor. Wrap opts.Stdout and opts.Stderr with internal/exec.NewRedactor before handing them to the subprocess. The redactor masks any credential value that leaks into the tool’s stdout/stderr. The local and docker backends do this in their wrappers; copy the pattern.
Add a doctor check. Doctor’s per-backend availability check needs to recognise your backend. Add an entry under internal/cli/doctor_backend.go reporting whether your backend’s prerequisites are satisfied (e.g., “is the daemon running”, “is the SDK reachable”). Green-by-default on a stock dev box is the goal — yellow-skip rather than red-fail when prerequisites are missing.
Add per-backend cred-audit assertions. The cred-leak audit at internal/exec/audit_test.go (and Phase M of the e2e plan) needs to know what surfaces your backend produces — container inspection, process listing, log files. Add a TestCredAudit_<YourBackend> subtest asserting the API key value never appears in any of them.
E2E phase. Add a new phase to scripts/e2e-test-backends.sh with concrete pass/fail criteria. Cross-link from PRD 05 so the test plan stays the source of truth.
Documentation. Add a deep-dive subsection to Chapter 17 — Execution backends and a decision-tree entry to Chapter 18 — Choosing a backend per tool. Without docs the backend doesn’t exist for users.

A backend PR that lands all eight steps is a complete contribution; one that lands the code but skips the audit and docs will get a “please come back with…” review comment.

Adding a new test suite

The test subtree (roksbnkctl test <suite>) holds three suites at v1.0: connectivity, dns, throughput. Adding a fourth (e.g., tls-handshake, latency, tcp-flowstate) follows a five-step recipe:

Implement the runner. Create internal/test/<suite>.go. The suite produces results in the roksbnkctl.<suite>.v1 JSON schema — pick a top-level shape consistent with the existing suites (ProbeResult for single-probe, ProbeSuiteResult for an aggregate with results[]).
Wire a subcommand. Add internal/cli/test_<suite>.go with a cobra command under test. The flag surface should mirror the existing suites’ patterns — --target, --iterations, -o json, --backend (when the suite is backend-aware).
Pick the backends. Most test suites are backend-aware (the suite runs from a network vantage that the backend selects). DNS and throughput accept local / k8s / ssh:<target> and reject docker; connectivity is currently local-only. Decide which backends make sense for your suite — the deciding question is “does the vantage change the answer?”.
Wire the JSON schema constant. Add roksbnkctl.<suite>.v1 to your suite’s output. CI assertions diff against this — bumping the version is a breaking change, document it in CHANGELOG.
Add an E2E phase. New phase under PRD 05 and corresponding script section in scripts/e2e-test-backends.sh.
Documentation. New chapter or major section in Part VI of the book (currently chapters 20-23). Cross-link from Chapter 23 — The E2E test plan and Chapter 18 — Choosing a backend per tool.

The DNS probe is the canonical worked example — read internal/test/dns.go + internal/cli/test.go to see all six steps in their landed form, plus the Sprint 5 architect prompt for the design framing.

Adding a new tool to an existing backend

The docker, k8s, and ssh backends each maintain a map of tool-name → image / package. Adding a new tool (e.g., mtr, tcpdump, helm) means an entry in each backend’s map.

Docker backend

internal/exec/docker.go::toolImages maps tool names to image specs:

var toolImages = map[string]string{
    "ibmcloud":  "ghcr.io/jgruberf5/roksbnkctl-tools-ibmcloud",
    "iperf3":    "ghcr.io/jgruberf5/roksbnkctl-tools-iperf3",
    "terraform": "hashicorp/terraform:1.5.7",
    "<your>":    "<your-image-ref>",
}

Tag resolution is handled by SetToolImageTag (set in internal/cli/root.go::init) — a :dev tag for a from-source binary, :<release-tag> for a tagged release. If your image needs its ENTRYPOINT bypassed (e.g., for image-specific argv mangling), add a jobToolCmdOverride entry.

K8s backend

internal/exec/k8s.go holds two patterns — long-lived ops pod (for tools that share state, like ibmcloud) and one-shot Job (for tools that produce a single output, like iperf3 or DNS probes). New tools pick one pattern:

Ops pod: add the tool’s image to the ops pod’s container spec at install time, or kubectl exec into the existing ops pod and run the host-installed binary.
One-shot Job: build a Pod template using the same image conventions as iperf3, run, stream logs, capture exit code, delete. The Job pattern is the right call for tools where the result is the only thing that matters.

SSH backend

internal/exec/ssh.go maintains a map of tool names to apt-package names for the --bootstrap auto-install:

// toolPackage carries apt-repo metadata + package name; see the
// production form in internal/exec/ssh.go for the full struct shape
// (IBM repo URL + GPG key + apt-source line for ibmcloud-cli, etc.).
var toolPackages = map[string]toolPackage{
    "ibmcloud": { /* IBM apt repo + key + "ibmcloud-cli" */ },
    "iperf3":   { /* plain ubuntu-main "iperf3" */ },
    "<your>":   { /* repo + key + "<deb-package>" */ },
}

The bootstrap step runs apt-get install -y <packages> on the SSH target when the tool isn’t already on PATH. Non-Debian targets are out of scope for v1.0; the bootstrap fails clearly with a message pointing at the manual-install path.

For each backend, the implementation work is small (one map entry). The doctor checks, e2e coverage, and docs are the bulk — same shape as adding a new backend, scaled to the smaller surface.

Adding a new chapter to the book

The book is mdBook with markdown source under book/src/. Adding a chapter:

Create book/src/<NN>-<slug>.md — the file. Numbered prefix for sort order.
Add the chapter to book/src/SUMMARY.md — the TOC. Use the existing parts (Concepts, Getting Started, Cluster Lifecycle, …) or add a new part if it doesn’t fit.
Run make book-serve to live-preview at http://localhost:3000 with auto-reload.
Cross-link from related chapters at the bottom (the “Cross-references” section every chapter ends with).
Push. .github/workflows/book.yml re-deploys to gh-pages on every merge to main.

The book follows a consistent style:

Lower-case prose, sentence-case section headers.
Code blocks for any command, inline code for filenames and identifiers.
Short paragraphs, one idea each.
Examples should be runnable as written.
PRD references use the full GitHub URL (https://github.com/jgruberf5/roksbnkctl/blob/main/docs/prd/03-EXECUTION-BACKENDS.md) to avoid the published-book 404 issue surfaced in Sprint 1.

The PRD process

The project uses numbered Product Requirements Documents under docs/prd/ for larger feature work — anything that touches multiple files, spans more than one sprint, or has open design questions that need to be settled before code lands.

When a feature warrants a PRD vs. a direct PR:

Use a PR	Use a PRD
Single-file change	Multi-file change across `internal/{exec,cli,config,…}`
Bug fix	New subsystem (a new backend, a new test suite)
Doc fix	New surface that needs a stable contract (a JSON schema, a workspace-config field)
Refactor with no behaviour change	A change that needs threat-model thinking (creds, network, multi-tenancy)
Drive-by polish	Anything cross-cutting >50 LOC

The PRD lifecycle:

Draft: open as a markdown file under docs/prd/NN-<TITLE>.md. The structure should follow the existing PRDs (00-OVERVIEW, 01-SSH, 02-KUBECTL, 03-BACKENDS, 04-CREDS, 05-E2E): goal, approach, file-by-file plan, test plan, acceptance criteria, open questions.
Review: open a PR adding the PRD. Discuss in the PR. Open questions get resolved by edit or by punting to a follow-up issue.
Implement: the PRD becomes the implementation plan. Per-sprint tasks land in docs/PLAN.md referencing the PRD by number.
Land: code PRs reference the PRD; the PRD itself is the spec, code is the implementation. When the implementation diverges from the PRD, the PRD gets updated to match — never the other way around (the binary’s behaviour is the source of truth).

The PLAN.md per-sprint planning rhythm interleaves code + tests + docs per sprint. Each sprint’s prompts (under prompts/sprint<N>/) translate the PLAN into concrete agent tasks.

The four-agent sprint dispatch

Larger sprints (Sprints 3-6) are dispatched as four parallel agents:

Architect — designs the surface, drafts the book chapters that explain it, files architect-side issues.
Staff engineer — writes the production Go and shell code, modifies the bundled HCL when needed.
Tech-writer — reviews the architect’s chapters for accuracy, fluency, and cross-link integrity. Files tech-writer-side issues.
Validator — writes / extends the e2e test scripts and CI workflows, files validator-side issues.

The dispatch lives at prompts/sprint<N>/{architect,staff,tech-writer,validator}.md — one prompt per agent. Each agent runs independently against the same repo snapshot. An integrator at the end folds the four agents’ outputs together, resolves the issues each filed against the others, and commits the aggregate.

When to dispatch four agents vs. just open a PR:

Direct PR	Four-agent sprint
Single feature, single sprint, <10 files	Multi-feature sprint with code + docs + tests scope
Bug fix	New PRD landing
Drive-by improvement	Sprint-gate milestone work
You’re the only contributor	Coordinating with reviewers who’d otherwise serialise

prompts/README.md documents the agent-coordination pattern. The sprint dispatch is the project’s way of running review-and-implementation in parallel rather than serial — it works when the surfaces are well-separated (code vs docs vs tests don’t conflict on file ownership) and the integrator has enough context to merge the four lanes.

Worked example: adding a new execution backend

End-to-end Part IX scenario: you want to add a podman backend (rootless container runtime as an alternative to docker) so users on Fedora/RHEL hosts that ship podman by default don’t have to install Docker just to use the --backend docker workflow. Same surface, different daemon. The walkthrough below tracks the eight-step recipe above with concrete file paths and a diff-shaped sketch of each change.

# 1. Implement the interface — new backend file
cat > internal/exec/podman.go <<'GO'
package exec

import (
    "context"
    "os/exec"
)

type podmanBackend struct{}

func (p *podmanBackend) Run(ctx context.Context, argv []string, opts RunOpts) (int, error) {
    args := append([]string{"run", "--rm"}, dockerStyleArgs(opts)...)
    args = append(args, opts.Image)
    args = append(args, argv...)
    cmd := exec.CommandContext(ctx, "podman", args...)
    return runWithRedactor(cmd, opts)
}

func init() {
    Register("podman", &podmanBackend{})
}
GO

# 2. Add the tool image mapping (podman uses the same OCI images as docker)
# Edit internal/exec/podman.go — add a toolImages map analogous to docker.go,
# or share the docker.go map by exporting it. The two registries are
# compatible; you'd typically share.

# 3. Doctor check — internal/cli/doctor_backend.go
# Add a `checkPodmanBackend()` function that runs `podman info` once with a
# 2s timeout. Green if exit 0, yellow if podman not found, red if podman
# present but daemon unreachable.

# 4. Wire credentials — re-use the docker backend's cred-propagation logic
# (the `-e VAR` pattern works identically for podman). Pass opts.Credentials
# by env-var reference, never by argv. See internal/exec/docker.go::
# dockerStyleArgs for the pattern to copy.

# 5. Add cred-audit test
cat > internal/exec/podman_audit_test.go <<'GO'
package exec_test

func TestCredAudit_Podman(t *testing.T) {
    // Run a no-op command via the podman backend with a known API key,
    // then inspect `podman inspect`'s output for the key value. Assert
    // the value never appears in the container's labels, env, or args.
}
GO

# 6. E2E phase — extend scripts/e2e-test-backends.sh
# Add Phase P (or extend Phase K) with a parallel sequence to K2-K6 but
# using --backend podman. Cross-link to PRD 05.

# 7. Documentation — chapters 17 + 18
# - Chapter 17: add a "Podman backend" section parallel to "Docker backend",
#   noting it's rootless-by-default and a drop-in alternative.
# - Chapter 18: add a row to the per-tool matrix; add a decision-tree entry
#   ("I'm on a podman-only host"); update the at-a-glance table.

# 8. Run the full test suite
go build ./...
go vet ./...
go test ./...
DRY_RUN=1 ./scripts/e2e-test-backends.sh

The PR should land all eight steps in one commit-set. A reviewer will look for: registered init(), doctor check, cred-audit test, e2e phase, and the two chapter additions. Without the audit + docs, the PR isn’t complete — see the cardinal rule at the top of the Adding a new execution backend section.

The same pattern applies to a new test suite, a new tool on an existing backend, or a new chapter — the eight-step recipe is the long version; the worked example is the copy-paste short version. Pick the shape that matches your contribution.

Cross-references

Chapter 17 — Execution backends — the four-backend matrix you’re extending.
Chapter 19 — The in-cluster ops pod — the k8s-backend pattern your new tool might join.
Chapter 20-22 — the three existing test suites your new suite would join.
Chapter 23 — The E2E test plan — where your new phase belongs.
Chapter 31 — Building from source — the build-side counterpart to the hacking side.
PRD 00 — Overview — the PRD index.
docs/PLAN.md — the per-sprint planning rhythm.
prompts/README.md — the four-agent dispatch pattern.

Keyboard shortcuts

Deploying and Testing BIG-IP Next for Kubernetes with roksbnkctl