Roadmap

Project-level planning for this project. Items here feed sit_flow.

How to Use

Quick capture: Add rough ideas to Inbox anytime (even mid-implementation)
Pick work: Use item ID with sit_flow(slug: "wf-01")
After reflect: Inbox items get triaged to themes or discarded

Inbox

ri-04 scripts/reset-dns.sh doesn’t clean continuation records or group device namespaces — stale records survive across 3-node test runs, causing spurious decrypt/key-slot errors on the next launch. (group-records-binary-encoding in-wild)
ri-05 Background scp via run_in_background reports success while stalling mid-upload — partial files land on the target with no error signal. Need a size/checksum verification step after server-transfer-sync.sh uploads. (group-records-binary-encoding in-wild)
ri-07 Cross-provider tor-gw topology documentation: when tor-gw and Tamago deploy on different GCP projects / providers, Noise (ca-91) already encrypts the tor-gw→Tamago link. Document the cross-provider deployment variant (operator guide, firewall implications, Tamago private-IP distribution across projects). (tor-gw-vm-deployment-topology-automation retro §Recommendations 4) To be delivered by pd-11 Phase 3 cross-provider validation.

Workflow Improvements

wf-01 Add a propose-phase check that compiles or grep-checks module paths in code snippets against go.mod (cmd-identity-init)
wf-02 Add a propose-phase step: for each code snippet from spec docs, verify library API matches actual installed version via go doc before implementation (record-encrypt)
wf-03 Add a propose-phase check: for ECDH-based functions, trace key flow through both directions (publisher and resolver) to verify whether each parameter needs a seed (private key derivation) or a public key (record-format)
wf-04 Add a propose-phase mechanical check: for each function referenced in code snippets, verify it exists in the actual source via go doc or grep (device-provision)
wf-05 When adding fields to signed records, the implementation plan should explicitly enumerate all golden-value test files that need updating (device-provision)
wf-08 In propose-phase LOC estimates for features involving multi-key cryptographic protocols (master + device + WG + delegation + outer envelope construction), estimate the test fixture helper separately from test cases. A 3:1 ratio (test LOC to production LOC) is realistic; the daemon-facilitator-serve estimate of 400 test LOC underran by 665 LOC because the serverTestFixture helper alone was ~100 LOC and the multi-key setup work compounds across every test case (daemon-facilitator-serve)
wf-09 For every feature using the privilege-gated “fail at the right step” test pattern, or any caller-side lookup that has a fallback branch (elevated 2026-04-13 after ca-25 recurrence during master-device-roster), also require at least one success-path integration test gated behind //go:build integration, documented as manual-run or root-capable-CI only. Retroactively apply to cmd-connect, daemon-minimal, daemon-facilitator, daemon-facilitator-serve to catch latent success-path bugs. The missing UAPI socket in wgplane.CreateInterface (shipped through three feature cycles, broke every first-run attempt) is the canonical privileged-ops case: fail-at-right-step tests assert logic preceding the privileged call is correct, but cannot tell you whether the downstream logic would have worked if it could. The ca-25 recurrence (non-privileged caller-side lookup with a Path A / Path B fallback confusion) generalizes the pattern: unit tests prove the crypto pipeline with pristine inputs; integration tests use pristine fixtures; neither exercises the lookup-failure path where real-world inputs diverge from pristine state. Both layers are required: caller-layer integration tests that exercise fallback branches AND the lookup-path inputs that trigger them (e.g., peerstore entries with missing fields, legacy imports, partially-populated state). (first-run bug hunt, elevated by master-device-roster retrospective)
wf-10 When a CLI subcommand needs root but reads user-level XDG config paths, document the sudo HOME=$HOME pattern prominently in both the README and the tutorial. Alternatively support --keyring / --device-keyring flags universally for explicit override. The first-run tutorial hit this because sudo resets HOME to /root and config.KeyringPath resolves against $HOME/.config. (first-run bug hunt)
wf-11 For any internal/wgplane / WireGuard interface code, integration tests must exercise the success path with actual packet traffic (ping or equivalent) through a pair of real interfaces. Merely creating and tearing down an interface does not validate that the configuration is functional. Three latent bugs in this package (UAPI socket missing, AllowedIPs default wrong, ListenPort never set) shipped through three feature cycles because wgplane_integration_test.go creates interfaces without verifying traffic flow. A single integration test that sets up two real interfaces on localhost (different netns or userspace loopback) and pings between them would catch all three bugs in one run. Apply retroactively to wgplane as the acceptance criterion for any future change to the package. Generalizes wf-09: not just success-path tests, but traffic-validating success-path tests for network code. (first-run bug hunt Session 2)
wf-12 wf-12 When drafting study notes, proposal design options, or ROADMAP items, explicitly enumerate applicable howtos from .sit/howto/ and apply them as design lenses before committing to a solution. Prior howtos are institutional knowledge that propagates forward only if consulted at design time, not just code time. The master-device-roster study initially failed to apply reorder_verification_with_channel_evidence.md despite it being directly applicable. (master-device-roster)
wf-13 For features combining scheduling + crypto + time manipulation, budget 70% production LOC overshoot and 5:1 test-to-production ratio (vs standard 30-50% / 3:1). Source: daemon-syncplan retrospective LOC analysis.
wf-14 Expand [SCOPE] trigger to fire on step-level contraction (not only expansion). Cycle 1’s silent omission of the write path would have been caught if [SCOPE] also fired when planned steps are not delivered. Methodology change to scope_expansion_protocol.md. (daemon-groups-foundation)
wf-15 LOC estimate revision for crypto + messaging + daemon + CLI features (the full-lifecycle variant, not just primitives). daemon-groups-foundation shipped ~2231 LOC total (975 production + 1256 test) vs the proposal’s ~1700 LOC wf-13-budgeted estimate — a 31% overshoot on top of wf-13’s already-generous 5:1 ratio. Candidate revision: budget 6:1 test-to-production for features that cross the primitives → daemon → CLI boundary end-to-end, treating wf-13 as a floor for primitives-only features and wf-15 as the ceiling for CLI-exercised full-stack features. Consider annotating wf-08 with the daemon-groups-foundation data point or letting wf-15 supersede it. (daemon-groups-foundation)
wf-16 Read-path-first bias sentinel: cycle 1 of crypto/messaging features tends to complete read-path components and stop, because the read path has fewer dependency edges than the write path (no live tunnel required, no message exchange, no mock tunnel infrastructure in tests). Add a cycle-1 completion-gate sentinel grep for write-path symbols (CLI subcommands, daemon message handlers, message producers) BEFORE advancing from Implement to Reflect. If any write-path symbol returns zero matches, fail the gate and require explicit [SCOPE] acknowledgment or a second cycle to close the gap. Directly addresses the cycle 1 failure mode that this feature hit. Related to but distinct from wf-14 ([SCOPE] fires on contraction) — wf-16 is the detection mechanism, wf-14 is the logging response. (daemon-groups-foundation) Strengthening (2026-04-14 from cycle 2.5 finding in commit 316b1a3): the symbol-grep sentinel is necessary but insufficient. A symbol existing in a package is not the same as the daemon/CLI actually calling it at runtime. Cycle 2 of daemon-groups-foundation created GroupMsgHandler and runGroupInvite/Accept/DevicePublish symbols that satisfied the cycle 1 sentinel — but the daemon’s cmd/portdistrictd/main.go never populated the corresponding daemon.Config fields, making the if d.cfg.GroupDir != "" and if d.cfg.GroupMsgListenAddr != "" branches in RunChores dead code at runtime. Caught only during in-wild test runbook authoring. Logically complete sentinel traces from main.go → daemon.Config{...} struct literal → RunChores conditional branches → target struct’s Run() call. Harder to mechanize than a grep but catches the “symbol exists, symbol unreachable” failure mode. Candidate: AST-based reachability checker that validates every non-test .Run() method on a daemon Task implementation is reachable from a code path rooted at a main.go entry point. Minimum viable version: a per-feature manual check in the Reflect phase that opens main.go and greps for d.cfg.<NewField> references for every new Config field added in the cycle. Second strengthening (2026-04-14 in-wild test evening session, cycles 2.6 + 2.7): the pattern manifested at two deeper levels in the same session. Cycle 2.6: GroupInvite.AdminNodeID was set to the pseudonymous group node ID, but the trust-gate in handleInvite needed the master node ID — mocked unit test masked the failure. Cycle 2.7: publishRecord was declared as a callback field on GroupMsgHandler, used in handleAccept conditionally (if h.publishRecord != nil), but daemon.go:RunChores never populated it — the else branch (encode-and-discard) was the only code path in production, and unit tests constructed handlers with whatever fields each test needed. Both caught only by in-wild execution observing log vs DNS state divergence. This strengthens the case that reachability checks must also cover (a) non-nil callback struct fields, (b) identity-lookup semantics in mocks (mocks must receive specific values, not stub true/TRUSTED/nil unconditionally). Promotion candidate: howto “Every struct field on a daemon Task should be required at construction time or documented as optional” — replacing struct-literal construction with constructor functions (NewGroupMsgHandler(conn, ..., publishRecord, ...)) that fail at compile time when a caller forgets a required field. Optional fields get explicit With* methods.
wf-17 GroupMsgHandler SIGTERM cleanup: conn.Close() on ctx.Done() to unblock ReadFrom and release UDP socket. Observed 2026-04-14 in-wild test (daemon-groups-foundation finding #4): after SIGTERM, the GroupMsgHandler.Run goroutine stays blocked on conn.ReadFrom() because there is no context-cancellation wiring. The process doesn’t exit within the script’s 1-second post-SIGTERM sleep window; UDP 9998 stays bound by the old process; the new daemon fails with bind: address already in use. Workaround applied in scripts (fe9670d refine(scripts)): two-phase kill (SIGTERM → sleep 1 → SIGKILL). Real fix: in internal/daemon/group_msg_handler.go:Run, add a goroutine that waits on ctx.Done() and calls h.conn.Close(), which will unblock the pending ReadFrom with an error. Same pattern as internal/daemon/facilitator_server.go UDP listener. Related but distinct from cq-03 (gofmt/goimports gate) — wf-17 is a runtime-cleanup bug, cq-03 is a pre-commit formatting gate. (daemon-groups-foundation in-wild test)
wf-18 wf-18 Author in-wild test runbooks during Propose phase for features with daemon + CLI + external-IO integration surface. The runbook is a third verification layer (after unit tests and call-graph tracing), not a planning artifact. daemon-groups-foundation: runbook authored post-Reflect caught cycles 2.5/2.6/2.7; if authored during Propose, all three would have been caught during Implement. (daemon-groups-foundation late retrospective)
wf-19 Promote candidate howto daemon_task_constructor_functions.md — constructor-vs-struct-literal for daemon task types. Two manifestations in daemon-groups-foundation (cycle 2.5 GroupDir/GroupMsgListenAddr not populated in daemon.Config literal; cycle 2.7 publishRecord callback not populated in GroupMsgHandler literal), but zero in other features. Held back by late-reflect subagent pending a second feature independently hitting the same shape. Rule: daemon chore task types with required dependencies get NewXxx(required1, required2, ...) *Xxx constructors; optional fields become WithXxx() methods. Promote when a third data point lands, or downgrade to an observation note inside scope_expansion_protocol.md if no new data point emerges within 2 features. (daemon-groups-foundation late retrospective)
wf-20 Promote or fold candidate howto layered_verification_protocol.md — the five-layer verification model (L1 grep → L2 gofmt/goimports → L3 call-graph trace → L4 in-wild execution → L5 cascading in-wild after upstream unblocks). Currently documented across two retros (daemon-groups-foundation original + late) but not consolidated. Decision needed: (A) standalone howto, or (B) appendix to existing scope_expansion_protocol.md which already covers L1-L2. Recommendation: (B) — extend scope_expansion_protocol.md with layers 3-5 rather than fragmenting. wf-16 sentinel work is the mechanization of L3. (daemon-groups-foundation late retrospective)
wf-21 Spec-section cross-reference gate in Propose phase: when implementing a design spec, the migration map must cross-reference EVERY numbered section. Aggregator sections, worked examples, argument type rules, and deferred-command lists each define behaviors that a rename-only pass will miss. Howto written: spec_section_cross_reference_for_cli_migration.md. (portdistrict-cli-redesign)
wf-22 In-wild validation as first-class implementation phase: proposals with deployment scripts should include “in-wild validation” as an explicit phase in the implementation plan, not a “mechanical follow-on.” (portdistrict-cli-redesign)
wf-23 When a proposal migrates a CLI to match a design spec, the migration map must cross-reference EVERY numbered spec section (§1–§N), not just the command-tree tables. Aggregator sections (§3), worked examples (§4), and argument-type rules (§5) each define behaviors that a rename-only pass will miss. Driven by 3 PIVOTs during portdistrict-cli-redesign where §2 was mapped exhaustively but §3/§4/§5 were treated as Phase-2 deferrals; all three PIVOTs came from skipped sections.
wf-24 Evaluate promotion of howto/platform_route_spike.md — mandate a Phase 0 spike exercising the actual OS route-installation path per platform (not just WireGuard mock interfaces) whenever AllowedIPs prefix length changes. Single supporting data point today: peer-overlay-derivation Finding 1 (darwin/BSD AddPeerRoute stripped CIDR prefix, broke /64 routing, only discovered in-wild because Phase 0 spike validated coexistence on a mock interface not through the real route syscall). Decision needed: promote now (preventive), wait for a second data point (avoid over-fitting to one bug), or drop entirely (cost/benefit too low for a change class that rarely happens). (peer-overlay-derivation deferred)
wf-25 In-wild test scripts should SSH via ~/.ssh/config aliases (not overlay/public DNS names) and use nohup ... </dev/null >log 2>&1 & for remote background commands. Two script regressions in test-3node-exit.sh (hostnames, SSH backgrounding) predated the wgplane-native-platform-stubs workflow and were only caught by running the in-wild suite. (wgplane-native-platform-stubs)
wf-26 TamaGo exit-node --open/--demo trust mode: accept any exit_request (logging consumer pubkey for audit) so exit try with ephemeral node IDs works against stricter trust policies. Currently the TamaGo binary uses allowAll{} (main.go:133) which makes exit try work today, but any future move to a stricter TrustChecker would break ephemeral consumers. Filed from exit-try-onramp retrospective — flagged as blocking prerequisite in the study.
wf-27 Health-gate retry with exponential backoff. DONE 2026-04-28 (closed by pd-19): cmd/portdistrict/exit_try_tor.go:373 healthBackoffs extended from [2s, 4s, 8s] (~14s budget) to [5s, 10s, 20s, 30s] (~65s budget) as defense-in-depth alongside pd-19’s truthful server-side readiness gate. Different specific values from wf-27’s planned [500ms, 1s, 2s] because the dominant delay is descriptor-publication cold-start (seconds-to-minutes) rather than WG handshake (sub-second). The intent — exponential-backoff retry to absorb transient delays — is fully realized. See workflow::2026-04-28_tor-gw-consensus-hsdir-cold-start::retrospective.md.
wf-28 Add context.Context to ExitSender.Receive interface to fix the goroutine leak on timeout. Both runExitConnect and runExitTry share the pattern where the receive goroutine leaks if the timeout fires. Pre-existing debt flagged in exit-try-onramp retrospective.
wf-29 wf-29 Name Noise protocol cipher states by role, not ordinal: cs_init/cs_resp instead of cs0/cs1 (and similarly owner_send/owner_recv instead of c1/c2 for any channel pair). The trusted-tamago-node bootstrap responder shipped with the cipher states reversed because the ordinal names did not encode the Noise spec’s role mapping. Role-based naming makes the initiator→responder / responder→initiator direction explicit at the declaration site. Howto exists: howto::noise_cipher_state_direction.md. Apply as a convention check in any future Noise_IK or Noise_XX implementation review. (trusted-tamago-node)
wf-30 Add a post-create verification step to scripts/tamago-publish-image.sh AND scripts/tor-gw-publish-image.sh that runs gcloud compute images describe --format='value(guestOsFeatures)' and fails if UEFI_COMPATIBLE, SEV_SNP_CAPABLE, or GVNIC is missing. Catches misconfiguration at image-publish time (~5 minutes earlier in the deploy loop than at instance-create time). Sources: workflow::2026-04-23_gvnic-tls-tcp-relay-bug::retrospective.md (UEFI_COMPATIBLE + SEV_SNP_CAPABLE), workflow::2026-04-25_tor-gw-hardened-image-cloud-portable::late-retrospective.md (GVNIC requirement, grounded in pd-11 in-wild GCE failure when the flag was missing). Partial closure 2026-04-30: Go publish path (machine server build --publish) now verifies all three features via ImagesClient.Get after image creation — see wf-48. Shell-script verification remains open; consider deprecating shell scripts (see wf-55).
wf-31 Audit bootstrap-then-real setup patterns in TamaGo + gVisor integration. cmd/portdistrict-exitnode-tamago/nic_gvnic.go added a bootstrap link-local address and kept it after adding the real address, which caused ca-88’s destination-selective failure. Search for similar AddProtocolAddress / AddAddress patterns where a temporary is installed for bootstrap and the real value is added later without removing the temporary. Source: report::2026-04-23_ca88-fix-source-address-selection.md.
wf-32 Transport Invariant Audit step in proposals: when reusing an adjacent module across transport topologies, list its implicit invariants and verify each against the new transport’s physical properties. Would have caught proxyhealth.Check IP-inequality mismatch at proposal time. (exit-try-via-tor)
wf-33 Transport-invariant audit step in Propose phase: when a proposal reuses a module across transport topologies (e.g., proxyhealth.Check reused from --direct into --via-tor), the proposal must enumerate the reused module’s implicit invariants and map each to a physical property of the new transport. Mismatches get flagged and either resolved at proposal time or logged as known-risks. Would have caught the proxyhealth.Check strict-IP invariant mismatch at proposal time — instead surfaced as a [PIVOT] during in-wild testing of exit-try-via-tor. Candidate howto: auditing_inherited_modules_across_transports.md. Single data point today (exit-try-via-tor); promote to howto on second occurrence, or write preventively if the next transport cycle (e.g., ca-91 chainer-tamago-noise) proves this lens is useful. (exit-try-via-tor retrospective)
wf-34 When introducing a flag that resolves a bundle of values from disk (like --from <ns> resolving onion + tamago_key + future fields), enumerate EVERY value the bundle supplies and mutex it against the resolver flag. Single-value mutex is incomplete if the resolver supplies multiple values. The pd-08 in-wild test caught this: --from correctly mutex’d against --via-tor <onion> but silently accepted --from together with --tamago-key, then used the grant’s key while ignoring the explicit one. Grounded in workflow::2026-04-25_signed-exit-grant-bundle-for-via-tor::retrospective.md §[BUG] 2.
wf-35 String-typed enum constants used in cross-package comparisons must be imported from their canonical setter, not duplicated as string literals. The pd-08 in-wild test caught this: cmd/portdistrict/exit_accept.go compared peer.TrustState != "trusted" (lowercase) but cmd/portdistrict/peer.go:329 writes "TRUSTED" (uppercase). The error message itself revealed the mismatch (is not trusted (state: TRUSTED)). Unit tests masked it because they constructed peerstore records inline using whatever string the test wrote. Candidate fix: promote TrustStateTrusted/Verified/Seen/Revoked constants in internal/peerstore/peerstore.go and require all callers to use them. Grounded in workflow::2026-04-25_signed-exit-grant-bundle-for-via-tor::retrospective.md §[BUG] 1.
wf-36 Kernel package signing with project key: replace --allow-untrusted with a project-specific APK signing key for the custom kernel package. Acceptable for development but should be revisited before any release workflow. Source: sev-snp-capable-alpine-kernel-package study §Deferred Scope.
wf-37 Integration with portdistrict exit grant to auto-populate measurement field — programmatic piping from tamago-show.sh --json to portdistrict exit grant --from-json to reduce operator error. Deferred from tamago-show-script study.
wf-38 Merge tor-gw-show.sh and tamago-show.sh into a single unified operator show script — both scrape from GCE serial; a combined script could reduce operator confusion. Deferred from tamago-show-script study.
wf-40 Implement non-confidential bootstrap protocol (simplified Noise → owner-key proof → WireGuard handoff, no attestation exchange) so machine server claim works for --on hetzner / --on self deployments. Currently errors when deploy metadata lacks --confidential. From workflow::2026-04-29_machine-server-provisioning-cli::retrospective.md Recommendation #1.
wf-42 Specify --on self provider path: doctor --on self should check port bindability, SSH reachability, and distro detection (not ADC). Currently underspecified. From workflow::2026-04-29_machine-server-provisioning-cli::retrospective.md Recommendation #4.
wf-43 Implement remote capability toggling (enable --label / disable --label / show --label for non-local servers) via a daemon-to-daemon control protocol. Currently errors with “SSH in and run locally”. Restores the 5-command operator path documented in machine-server-provisioning-cli proposal Summary. From workflow::2026-04-29_machine-server-provisioning-cli::retrospective.md Recommendation #5.
wf-44 Extend bootstrap protocol to include emit report alongside session report, enabling claim-time app hash verification. Currently the claim path only receives the session report (nonce+sessionHash), not the emit report (noisePub+appHash). From workflow::2026-04-30_offline-measurement-prediction::retrospective.md Recommendation #1.
wf-47 In-wild GCE byte-equality validation: deploy a real GCE Confidential VM, capture its reported LAUNCH_DIGEST, run PredictFirmware against the corresponding OVMF binary downloaded from gs://gce_tcb_integrity/ovmf_x64_csm/<hash>.fd, and assert byte-equality. Phase 0 spike A1 only validated that gce-tcb-verifier/sev.LaunchDigest compiles and rejects bad input; it did NOT compare a predicted digest to a real CVM’s reported value. Without this test, PredictFirmware is plausibly-typed but unproven against reality. Manual gate before any production rollout that relies on --measurement-file for first-deployment substitution detection. Uses GCE spend; suitable for a one-time validation run after the 43 NOT-VERIFIED markers are human-reviewed. From workflow::2026-04-30_offline-measurement-prediction::retrospective.md follow-on identified during reflect. First data point captured 2026-04-30 (in-wild test, label wild-test-1, project portdistrict, zone europe-west3-b, --vcpus 2, OVMF 759990ee... from public TCB bucket dated 2026-04-15): predict ≠ reality. Predicted 2d24cf9624ee36449e50c6c84042540b05898f6559f02741b7b354e0cc2ed18d108352ade7dfc4cecce4fa974e51c773 vs reported 7a5ed176bad8a9ff02cebb94b24b076a0b1905042a85d9fca7670d3a3ff466db3b1c2b76f8eca888f8d806d2ec92434e. Attestation chain (ECDSA-P384, VCEK→ASK→ARK) valid; app hash matched exactly. Hypotheses: (a) regional OVMF rollout lag (the public bucket’s newest binary isn’t what GCE actually loads in europe-west3-b yet); (b) predict.LaunchDigest missing GCE-specific input (kernel/IDBlock/boot params); (c) vcpu/topology assumption mismatch. Pivot wf-47 from “deploy and validate” to “fix prediction or document divergence”. See workflow::2026-04-30_machine-server-publish-image::retrospective.md §In-Wild Validation [BUG-2].
wf-49 Add OVMF caching to ~/.cache/portdistrict/ovmf/ to avoid repeated ~4MB downloads when building frequently with the same measurement
wf-50 Add automatic measurement discovery from running CVM metadata to complement machine server build --ovmf-from-measurement (eliminates need to manually copy measurement hex)
wf-51 Add OVMF binary caching to ~/.cache/portdistrict/ovmf/<digest>.fd to avoid repeated ~4MB downloads on rebuilds. Use xdg.CacheHome (already a dep). Skip download if cached file SHA-384 matches expected digest. From workflow::2026-04-30_add-ovmf-firmware-auto-download-to-build::retrospective.md Recommendation #1.
wf-52 Add automatic measurement discovery from running CVM (e.g. via SSH or GCE metadata) to complement attest verb — eliminates manual --ovmf-from-measurement <hex> lookup. From workflow::2026-04-30_add-ovmf-firmware-auto-download-to-build::retrospective.md Recommendation #2.
wf-53 Standalone publish-image verb: publish pre-built archives to GCS + GCE without rebuilding. Currently --publish requires --source-dir. From workflow::2026-04-30_machine-server-publish-image::retrospective.md deferred scope.
wf-54 Image versioning: support timestamped image names (e.g. exitnode-tamago-gvnic-20260430) instead of fixed exitnode-tamago-gvnic, enabling rollback. Currently uses delete-then-create with a single name. From workflow::2026-04-30_machine-server-publish-image::retrospective.md deferred scope.
wf-55 Assess whether shell publish scripts (scripts/tamago-publish-image.sh, scripts/tor-gw-publish-image.sh) can be deprecated now that machine server build --publish exists. wf-30 shell-script verification is still open. From workflow::2026-04-30_machine-server-publish-image::retrospective.md recommendation #3. Assessment done 2026-05-11 — full classification in scripts/README.md (code::scripts/README.md). Verdict: (1) eight scripts are fully replaced by the portdistrict CLI today and can be deleted in one PR — tamago-build-exit-node.sh, tamago-package-disk.sh, tamago-deploy-gce.sh, tamago-show.sh, tor-gw-image-build.sh, tor-gw-deploy.sh, tor-gw-show.sh, tor-gw-startup.sh (closes/retargets wf-37, wf-38); (2) the two named publish scripts have a thin residual gap (CLI --publish requires --source-dir so it rebuilds; no publish-pre-built-archive path) — keep until wf-53 lands, then delete and close wf-30 as superseded by the Go ImagesClient.Get check (wf-48), retargeting pd-25 to the Go build path; (3) the remaining ~25 scripts (kernel-APK build input build-kernel-pd-snp.sh, CI harnesses tamago-verify-reproducible.sh/tamago-attest-e2e.sh, QEMU+spike harnesses, the 3-node integration suite, daemon wrappers, debug/reset utilities) are out of scope and stay. Item stays open for the §1 deletion PR.
wf-57 --ovmf-from-measurement first-deploy ergonomics — add help-text disambiguation that the flag expects an SEV-SNP launch measurement (the digest attest reports), not the OVMF binary SHA, and explicitly hint “first deploy: use --ovmf <path>”. Optional: add an --ovmf-from-bucket-latest shortcut that downloads the most recent .fd from the public TCB bucket — covers greenfield in one flag. Related to wf-50 (auto-discover from running CVM). From workflow::2026-04-30_machine-server-publish-image::retrospective.md §In-Wild Validation [BUG-1].
wf-60 exit try --invite <single-blob>: collapse the three-flag friend-side invite (--direct <ip:port> --pubkey <key> --token <token>) into one labeled base64url blob so the operator sends one short string in chat instead of a long multi-flag command. Reuse prior art: mirror the portdistrict-grant1:<base64url-payload> format already used by exit grant/exit accept (see workflow::2026-04-25_signed-exit-grant-bundle-for-via-tor::retrospective.md and code::internal/signalbundle/bundle.go Encode/Decode matched-pair pattern: struct → map → canonjson → ed25519.Sign → re-marshal). Picks: label portdistrict-invite1: (versioned, parallel to grant1:); payload {transport: "direct", direct, pubkey, token} (the token field is already a base64url blob from code::internal/exitnodeshared/invite_trust.go). Schema extensibility: include an explicit transport field on v1 even though only "direct" is supported initially — this lets a future item add via-tor (or other transports) by extending the same invite1: envelope rather than minting invite2:. Decoder should reject unknown transport values with a clear error directing the user to upgrade portdistrict. Bundling with tor-gw-show.sh is deliberately out of scope (different producer language, different trust model — the via-tor path already has portdistrict-grant1:, and tor-gw-show.sh is on the wf-55 deprecation track). Implementation: ~30 LOC encode/decode + a new --invite <blob> flag on exit try that expands to the existing three flags; machine server invite output gains a second line printing the compact form (existing three-flag form stays for back-compat). Update site: code::cmd/portdistrict/machine_server_remote.go:454 where runMachineServerRemoteInvite prints portdistrict exit try --direct ... --pubkey ... --token .... Sibling of wf-53 (standalone publish verb) — same theme of “ergonomics for the invited side.” Origin: 2026-04-30 conversation about friend-side convenience after the wf-48 in-wild test.
wf-61 Add release binary signing (cosign or GPG) to .forgejo/workflows/release.yml. Distinct from APK signing (wf-36) — this covers the standard Go binaries produced by the release workflow. (release-binaries retro 2026-04-30)
wf-62 Docker image registry publishing for portdistrictd. Distroless Dockerfile exists (test/fixtures/distroless/Dockerfile) but no image registry publishing workflow. (release-binaries study deferred scope 2026-04-30)
wf-64 exit invite --legacy: emit the three-flag portdistrict exit try --direct ... --pubkey ... --token ... form alongside the fenced display, for friends running older portdistrict builds that lack the positional invite-blob detection. Was in the proposal but skipped during implementation; the three-flag form is still available via direct invocation, so this is a small ergonomic add (~10 LOC). From workflow::2026-04-30_exit-invite-convenient-defaults::retrospective.md Recommendation #1.
wf-67 Extend machine server with a --with-facilitator augmentation flag that opts an exit-node deploy into colocating the facilitator capability on the same VM as the exit-node (default), with a --facilitator-vm escape hatch for the rare case where a separate VM is wanted. Local-host case is already shipped: on a regular host machine server enable --exit-node --facilitator already runs both capabilities inside one portdistrictd (cli_semantics_design.md §2.3.2; cmd/portdistrictd/main.go:22,229,242 wires facilitatorplane based on facilitator.json). wf-67 brings the same colocation to remote-orchestration. Two paths, both colocated by default: (1) non-confidential deploy (deploy --on hetzner --label foo --with-facilitator or deploy --on gce --label foo --with-facilitator without --confidential): trivial — provisions a single regular Linux VM running portdistrictd --exit-node --facilitator, where both capabilities share the daemon already; the augmentation flag just toggles the facilitator.json config-file write at deploy time and opens the firewall port. No image change needed beyond the existing exit-node deploy path. (2) confidential / Tamago deploy (deploy --on gce --confidential --label foo --with-facilitator): the Tamago binary built by machine server build (cmd/portdistrict-exitnode-tamago) is single-purpose and does not contain facilitator code today — colocation requires a Tamago port of the facilitator (filed as ca-103): TCP listener (Tamago net stack handles this), registration store (in-memory map suffices for single-VM scope), no glibc dependencies. ca-103 is wf-67’s main implementation cost; spike before propose to confirm the Tamago net stack covers what internal/facilitatorplane/ needs. Once ported, build --confidential --with-facilitator produces a single Tamago binary with both planes, and deploy --confidential --with-facilitator provisions one CVM that runs both — preserving the pd-18 trust chain for the exit-node plane unchanged. --facilitator-vm escape hatch (low priority, defer behind colocation): when the operator wants to scale the facilitator independently or place it in a different network zone, --with-facilitator --facilitator-vm provisions a second VM running portdistrictd --facilitator, paired by --label. The default stays colocated. Per-verb behaviour with --with-facilitator (colocated mode, mirrors wf-66’s label-paired idiom but with one VM): build --with-facilitator produces an image whose daemon has both capabilities compiled in (Tamago port required for --confidential); deploy --with-facilitator writes both exit-node.json and facilitator.json into instance metadata and opens both ports (51821/UDP + 7777/TCP); attest --label foo validates pd-18 on the single CVM as today (no extra check — facilitator runs in the same measured boot); claim --label foo registers both the exit-node endpoint and the facilitator endpoint in the local device record in one Noise → owner-key proof exchange; show / doctor surface both planes’ state from the one VM; destroy tears down the one VM. Compatibility with --via-tor: build/deploy --confidential --via-tor --with-facilitator --label foo provisions two VMs (exit-node-with-facilitator CVM + tor-gw CVM) — --via-tor always adds a second VM (different image, different role), --with-facilitator does not. So the VM-count matrix is 1 + (--via-tor ? 1 : 0) + (--with-facilitator --facilitator-vm ? 1 : 0). Confidential vs regular axis: --with-facilitator does not by itself require --confidential (regular path is path 1 above and is essentially free), but the confidential path is gated on the Tamago facilitator port landing first. Until ca-103 lands, --confidential --with-facilitator either errors out pointing at ca-103, or auto-implies --facilitator-vm (decide during study). Doc impact: cli_semantics_design.md §2.3.2 grows a --with-facilitator paragraph parallel to wf-66’s --via-tor paragraph, noting colocation as the default. Scope split: (a) non-confidential colocated path is small — a config-file plumbing change + firewall port. (b) confidential colocated path is medium — Tamago facilitator port (likely warrants its own ca-* item + spike). (c) --facilitator-vm escape hatch is medium — reuses wf-66’s two-VM pattern. Recommend study sequences (a) → (b), with (c) deferred. Order: file after wf-66 lands; the augmentation pattern wf-66 establishes (label-paired co-deployment in deploy state, attest/claim recovering augmentations from deploy state, show displaying multi-plane state) is the foundation wf-67 reuses. Origin: 2026-05-01 conversation — first framed as a separate-VM augmentation, then corrected: facilitator can colocate on the same VM as the exit-node (matches the local-host two-capability case in cli_semantics §2.3.2), with the separate-VM path kept as an opt-in escape hatch.
wf-68 Context-timeout audit on existing multi-step deploy paths in cmd/portdistrict/. The --via-tor augmentation (wf-66 deploy half) surfaced a latent context-timeout budget mismatch — original 5-min context.WithTimeout was insufficient when serial polling exceeds 5 min (240s onion poll + 120s Noise pubkey poll). Fixed by extending to 12 min when viaTor=true. Audit the other multi-step functions (runMachineServerBuild, runMachineServerAttest, runMachineServerClaim, runMachineServerReconnect) for similar latent issues, especially before wf-67 --with-facilitator adds another sequential VM operation. Apply the rule from ~~howto::context_timeout_budget_for_multi_step_operations.md~~. From workflow::2026-05-01_machine-server-via-tor-deploy-augmentation::retrospective.md Recommendation #1.
wf-69 Programmatic length assertions on fixed-length test constants. The wf-66 deploy-half implementation hit a [TEST] entry where a hand-constructed v3 onion address was 54 chars instead of the required 56 (regex [a-z2-7]{56} correctly rejected the test data; the test was wrong). Add a small test helper or convention that asserts len(constant) == expected at the test-file declaration site for fixed-length crypto/protocol values (onion-v3 = 56 base32, ed25519 pubkey = 44 base64url, X25519 pubkey = 43 base64url, etc.) so length errors fail at test-load time with a clear message instead of inside opaque regex non-matches. From workflow::2026-05-01_machine-server-via-tor-deploy-augmentation::retrospective.md Recommendation #2.
wf-70 DONE 2026-05-05 Added [--via-tor] [--tor-gw-zone <zone>] [--tor-gw-image <name>] continuation line to the deploy entry in printMachineServerUsage (cmd/portdistrict/machine_server.go:48-49), mirroring the existing build continuation style. Cosmetic 1-LOC fix; flags themselves were already wired in machine_server_deploy.go:26-93 since wf-66. Shipped in commit 453d35e.
wf-71 [BUG] DONE 2026-05-02 machine server deploy --via-tor --confidential did not propagate --confidential to the tor-gw companion VM. Fix shipped same day: added confidential bool parameter to runDeployTorGW; when true, sets ConfidentialInstanceConfig.ConfidentialInstanceType = "SEV_SNP", Scheduling.OnHostMaintenance = "TERMINATE", and MinCpuPlatform = "AMD Milan" on the tor-gw InsertInstanceRequest; call site in runDeployGCE updated to thread confidential through. Validated in-wild same day on instance wf66-smoke3-110535-tor-gw: gcloud instances describe returns SEV_SNP / TERMINATE / AMD Milan; tor-gw serial shows sev-guest sev-guest: Initialized SEV guest driver → tor-gw-init: SEV-SNP detected (/dev/sev-guest present) → derived key obtained from SEV-SNP → sealed onion identity written to tmpfs → attestation report signature verified against locally-cached AMD cert chain (pd-18 chain) with measurement 7a5ed176bad8a9ff02cebb94b24b076a0b1905042a85d9fca7670d3a3ff466db... (canonical pd-18 measurement); produced canonical pd-15/pd-16 sealed onion 3efkxftj…7i7qd.onion. See workflow::2026-05-01_machine-server-via-tor-deploy-augmentation::retrospective.md “In-wild re-run” section. Closes the gap between wf-66’s design intent and shipped behavior.
pd-27 [BUG] DONE 2026-05-02 Tor inside the tor-gw-hardened image failed with /var/lib/tor: Permission denied on fresh boot. Serial output sequence: tor-gw-init: starting (PID 471) → tor-gw-init: tamago private IP = 10.156.0.37 → tor-gw-init: torrc written → tor-gw-init: tor started (PID 476) → Permission denied → [warn] Failed to parse/validate config: Couldn't create private data directory "/var/lib/tor" → [err] Reading config failed--see warnings above. → FATAL: hostname not generated within 1m0s (the hostnameTimeout = 60s from cmd/tor-gw-init/main.go:40). Tor exits before it can publish the onion descriptor; the deploy-side pollSerialForMarker correctly times out at 240s and saves an empty OnionHostname. Pre-existing image regression — not caused by wf-66: the legacy scripts/tor-gw-deploy.sh would hit the identical failure if run against this image build. The pd-09 hardened-image design (Alpine + read-only rootfs + writable tmpfs at runtime-state paths) shipped working in pd-11 Phase 1 + pd-16 (two-CVM same-onion validation). Something in a later image rebuild broke the writable mount at /var/lib/tor — most likely the tmpfs-mount step in init or the directory ownership/permissions. To investigate: (a) gcloud compute images describe tor-gw-hardened to check creation timestamp + family/source; (b) cmd/tor-gw-init/main.go for any change to writable-path setup since pd-11; (c) Alpine image inittab/sysinit scripts for mount -t tmpfs ... /var/lib/tor or chown to the tor user; (d) torrc generation for explicit User tor / DataDirectory directives. Surfaced by: in-wild test of wf-66 deploy half on 2026-05-02 (instance wf66-smoke-101539-tor-gw). Filed as pd because this is platform-deployment infrastructure, not a CLI workflow issue. Root cause identified same day: mountTmpfs() in cmd/tor-gw-init/main.go:303-318 mounts a fresh tmpfs at /var/lib/tor and chmods to 0700 but never chowns to the tor user. The image-build script’s chown -R tor:tor /rootfs/var/lib/tor only affects the underlying read-only rootfs directory, which is overlaid by the tmpfs at runtime. When tor drops privileges to the tor user (per User tor in the generated torrc), it cannot write to a root-owned tmpfs. Fix shipped same day: in mountTmpfs(), after mounting tmpfs and before chmod, do user.Lookup("tor") and os.Chown("/var/lib/tor", uid, gid) (using the same pattern already established in writeHSKeyFiles at line 435). Image rebuilt + republished as tor-gw-hardened. Validated in-wild same day on instance wf66-smoke3-110535-tor-gw: tor created its DataDirectory successfully (no permission error), tor-gw-init reached the descriptor-publication step, and printed TOR_GW_ONION_HOSTNAME=3efkxftjm2numzarrt77x677biqmnbosoygeo6tcuxp5dhowwge7i7qd.onion to serial within the deploy-side 240s pollSerialForMarker budget. See workflow::2026-05-01_machine-server-via-tor-deploy-augmentation::retrospective.md “In-wild re-run” section.
wf-75 DONE 2026-05-05 Downstream per-verb via-tor cycles (claim-reconnect, show-doctor, build, attest) — references Decision 3 (always-via-tor claim) and Decision 2 (show bridge verb) from via-tor-unification-design. Partial DONE 2026-05-04 via workflow::2026-05-04_via-tor-cli-parity::retrospective.md: attest/reconnect/doctor per-verb wiring closed for local via-tor hosts; claim/show already shipped in parent cycles. Build half closed 2026-05-05 via wf-80 (workflow::2026-05-04_machine-server-build-via-tor::retrospective.md): machine server build --via-tor now embeds the tor-gw companion image-build pipeline. All per-verb wiring is now complete.
wf-80 DONE 2026-05-05 machine server build --via-tor — Tamago image-build pipeline embedded into binary alongside the new tor-gw companion image-build pipeline (closes the build/publish half of the via-tor deploy flow). Shipped via workflow::2026-05-04_machine-server-build-via-tor::retrospective.md. Validated end-to-end in-wild: both images published, deploy succeeded, canonical pd-15/pd-16/pd-18 sealed onion (3efkxftj…7i7qd.onion) re-published.
wf-84 DONE 2026-05-05 Embed tor-gw-image-build.sh + tor-gw-publish-image.sh into portdistrict. New ROADMAP item filed at study time; closed simultaneously with wf-80. The 6-step tor-gw build pipeline (init shim, Alpine rootfs, kernel/initramfs extract, grub EFI, FAT32 disk, tar.gz) now runs from Go via os/exec + embedded shell scripts. Scripts remain in-tree for reference but are no longer required for the build/publish workflow. Shipped via workflow::2026-05-04_machine-server-build-via-tor::retrospective.md.
wf-81 DONE 2026-05-04 exit connect --via-tor persistent friend tunnel. Implemented in cmd/portdistrict/exit_connect_tor.go: reuses handleBrowser chainer goroutines + grant resolution + health-check backoff from exit try --via-tor; key difference is no browser launch — saves ExitState{Mode: "tor-socks5"}, blocks on SIGINT/SIGTERM, clears state on exit. Outer runExit dispatch pre-scans --via-tor/--from on connect (mirroring the try pre-scan, addresses parser-bug class from completion retrospective). Tests in cmd/portdistrict/exit_connect_tor_test.go. Shipped via workflow::2026-05-04_via-tor-cli-parity::retrospective.md.
wf-82 DONE 2026-05-04 exit list implementation. Replaces “not yet implemented” stub in cmd/portdistrict/exit.go with transport-agnostic provider table (OPERATOR / TRANSPORT / ONION / EXPIRES) plus --json mode. Reads from internal/exitgrant.Store with cross-reference into internal/peerstore. Tests in cmd/portdistrict/exit_list_test.go. Shipped via workflow::2026-05-04_via-tor-cli-parity::retrospective.md.
wf-83 DONE 2026-05-04 machine server destroy --via-tor for local hosts (provider == "self", OnionHostname != ""). Disables via-tor config (load → ViaTorEnabled = false → save), sends SIGHUP to the daemon to stop the tor supervisor, removes registry entry. Onion keys retained by default with manual-wipe note. Reuses the disable --via-tor config+SIGHUP pattern from machine_server.go:361-382. Shipped via workflow::2026-05-04_via-tor-cli-parity::retrospective.md.
wf-85 Default the release pipeline and friend-side build to CGO_ENABLED=0 static binaries. The two-friend GCE test required static binaries for cross-distro compatibility (Friend B on Ubuntu 20.04, glibc 2.31 rejected the dynamically-linked default built on Debian forky, glibc 2.42). (via-tor-socks5-hardening retro)
wf-86 Cross-binary wiring checklist in Done When criteria: when a feature spans client/server/config with metadata layers in between (invite → operator config → GCE metadata → tor-gw-init), the Done When criteria must explicitly name each connecting wire. “Client side ✓ + server side ✓” can true-and-true while the wire between them is missing. Two data points from via-tor-access-control (Phase 4 wiring gap, ClientOnionAuthDir gap). (via-tor-access-control)

Code Architecture

ca-08 cmd-group-publish DeviceRecord for group context. Extends the publish pattern to shared infrastructure. Partially delivered by daemon-groups-foundation 2026-04-14 — core group device publish implemented at cmd/portdistrictctl/group.go:472 (runGroupDevicePublish), encrypts group DeviceRecord with group_read_key, publishes to DNS via existing Cloudflare path. Still open: CLI ergonomics polish (richer error messages, help text refinement, interactive confirmation prompts, --dry-run flag), edge case handling (missing group_read_key, stale state files, in-flight key rotation), user-facing docs. Consider closing as “delivered” if the polish is out of scope for the current milestone, or keep open and rescope to cover only the polish items.
ca-09 cmd-peer-revoke Credential revocation for the trust lifecycle. Complements peer trust/verify.
ca-12 Add prompt or documentation hint in device init output suggesting users set an endpoint before publishing (cmd-connect)
ca-13 device-serve One-time HTTPS server for provision file transfer, URI-based join, and QR code rendering. Deferred from device-provision.
ca-14 Enable //go:build integration tests for WireGuardManager in CI when root-capable CI runners are available (daemon-minimal)
ca-18 Trust-gate redesign for sibling devices: runConnect and any other peerstore-based trust gate should support sibling-device provenance natively, either via a self-entry in the peerstore or via a direct check against the local device keyring’s MasterNodeID. The current local patch in cmd/portdistrictctl/main.go is a point fix; the architectural question is whether the peerstore should be the trust oracle for sibling devices at all. (first-run bug hunt)
ca-19 Add --keyring flag to connect (and other subcommands that read the master keyring under sudo) so sudo HOME=$HOME is not required. Currently only --device-keyring is accepted. Small tactical fix (~5 lines) that unblocks first-run tutorials without requiring environment gymnastics. (first-run bug hunt)
ca-21 Facilitator client configuration path on portdistrictd: daemon.Config.Facilitators is read in RunChores (daemon.go:120) to spawn per-facilitator Registration tasks, but cmd/portdistrictd/main.go never populates it — there is no --facilitator flag, no device-keyring field, no config file. The --facilitator-listen flag (server side) is implemented but there is no corresponding client-side path, so the daemon cannot register with any facilitator even when one is running. This blocks end-to-end Phase H testing of the daemon-facilitator-serve authorization challenge flow on real hardware. ca-15 (facilitator enable CLI) and ca-16 (facilitator_endpoint persistence in DeviceKeyring) cover the proper long-term fix but neither has been implemented. Tactical fix option: add a --facilitator <endpoint>:<wg-pubkey>:<node-id> flag for testing, clearly marked as a test harness, to be replaced by ca-15/ca-16. Blocks sibling-topology and cross-identity Phase H testing until this or ca-15 lands. (first-run bug hunt, Phase H preparation)
ca-22 identity import silent stale-pubkey update: the “Peer already known” branch of runIdentityImport (cmd/portdistrictctl/main.go:372-393) updates only LastBundleTimestamp, silently keeping stale IdentityPubkeyB64 and IdentityECDHPubkeyB64 values if the peer rotates their master identity. Re-importing a new bundle with a changed pubkey has no effect; the downstream identity verify then reports a spurious “identity_pubkey mismatch” that is correctly detected but caused by import silently ignoring the incoming pubkey. Proper fix options: (A) fail with a clear error telling the operator to remove the stale entry; (B) compare stored vs incoming pubkey and refuse to update silently; (C) prompt for confirmation. Option B is the smallest change with the most explicit failure mode. Discovered in Session 2 of the first-run bug hunt when an earlier rm -rf ~/.config/portdistrict/ left a stale peer file behind and the re-import didn’t catch the divergence. (first-run bug hunt Session 2)
ca-26 ca-26 Spec update: fold Path B identity ECDH derivation (DeriveIdentityECDHKeypair / BLAKE3 domain separation) into identity_and_trust.md section 21 as a normative requirement. Currently the load-bearing privacy invariant (domain separation between Ed25519 and X25519 KDF branches) is implementation-correct but spec-silent. See doc::portdistrict::identity_ecdh_domain_separation.md. (master-device-roster)
ca-27 ca-27 Identity bundle carrying device namespaces: extend signalbundle.Encode to optionally include device_namespaces so peers learn devices at import time (convenience, not required — DNS remains authoritative). Deferred from master-device-roster study. (master-device-roster)
ca-28 ca-28 Master record expiry-based republish in daemon: extend RecordPublishTask to periodically republish the MasterRecord v2 (not just the DeviceRecord), using the same chore-runner interval. Currently only CLI commands trigger master record publishes. Deferred from master-device-roster study as larger scope. (master-device-roster)
ca-29 Daemon treats cross-master roster entries as local siblings. device roster add <ns> warns when <ns> does not end with the master namespace but stores the entry in device_namespaces unchanged. On daemon startup, the sibling iteration at code::internal/daemon/daemon.go:88 attempts to resolve every roster entry as a sibling device, producing repeated resolve _portdistrict.<ns>: no such host (or similar) errors in the log for any cross-master or typo’d entry every refresh cycle. Concrete symptom observed during master-device-roster manual validation (2026-04-13): adding stranger.alice.example.com to the roster for testing the cross-master warning path made the daemon log it as sibling stranger.alice.example.com: resolve device stranger.alice.example.com: ...: no such host on every startup and DNS refresh cycle. Design options: (A) strip cross-master entries from sibling iteration at daemon startup (filter roster to entries where strings.HasSuffix(entry, "." + ownMasterNamespace) before iterating); (B) separate roster field for foreign namespaces (device_namespaces []string for local siblings + foreign_namespaces []string for cross-master entries the owner explicitly claims, with different consumer semantics); (C) enforce rejection of cross-master entries at device roster add time (turn the warning into an error unless a --force flag is set); (D) keep current behavior but improve the log line to sibling <ns>: cross-master entry not reachable, skipping so the operator understands the degradation is intentional. Option A is the minimum correct fix and matches the implicit model of the roster as “my local devices”. Option C is the strictest but breaks the current (no-op but warn) UX. Option B is a schema change that would let users legitimately claim namespaces they don’t directly control. Related to but distinct from ca-20 (sibling discovery) and ca-24 (cross-identity discovery) — this is about roster hygiene, not discovery mechanism. Captured in workflow::2026-04-13_master-device-roster::retrospective.md Manual Validation Session § New findings.
ca-30 Empty-roster v2 publishers degrade to legacy path in ConnectAllTrusted tier 2. The check at code::internal/daemon/daemon.go:136 is if len(peerRoster) > 0 { ... continue }. resolveDeviceRoster at code::internal/daemon/roster_resolve.go:20 returns nil for v1 records and returns the decoded DeviceNamespaces slice for v2 records. A peer publishing a v2 master record with an empty device_namespaces slice (e.g., they just ran identity publish before adding any devices, or deliberately want to assert “no devices under this identity”) returns a non-nil empty slice from resolveDeviceRoster. The if len(peerRoster) > 0 check is false, the block is skipped, and the daemon falls through to the legacy tier-3 path (d.pm.AddPeer(nodeID)), which hits the ca-24 anti-pattern for any peer whose p.Namespace is a master namespace. A v2 peer explicitly publishing “I have no devices” should be treated as conclusive (“no devices to connect to for this peer, done”) — not as a signal to try the known-broken legacy path. Fix: distinguish nil (v1 record or resolution failure, fall back to legacy) from non-nil empty slice (v2 with empty roster, do not fall back). Option (A): check peerRoster != nil at the daemon, treat empty non-nil as “definitively no devices”; Option (B): change resolveDeviceRoster to return an explicit enum or sentinel for “v2 with empty roster” vs “v1 / resolution failure”. Option A is the smaller change. Minor edge case — unlikely to fire in practice because most users add devices before publishing — but it’s a latent logic bug that would confuse debugging of an actually-empty v2 roster. Discovered during manual validation of master-device-roster (2026-04-13). Captured in workflow::2026-04-13_master-device-roster::retrospective.md Manual Validation Session § New findings.
ca-33 Audit trust-state filtering consistency across encoder-side key-collection functions: collectPeerECDHPubkeys (TRUSTED only) vs collectECDHPubkeys and runDevicePublish inline loop (VERIFIED+TRUSTED). Pre-existing discrepancy unrelated to Path A fix but worth harmonizing. Source: workflow::2026-04-13_encoder-side-path-a-fallback-removal::retrospective.md
ca-34 daemon-relay: Relay fallback connectivity for peers that can’t hole-punch via sync windows. Companion to daemon-syncplan. Deferred from daemon-syncplan study scope.
ca-35 Adaptive backoff on repeated sync window failures (Doc 5 §15 implementation-defined gap): currently a failed window attempt has no memory — the same peer is re-attempted on the next scheduling cycle. Add exponential backoff per (nodeID, deviceNS) pair. Source: daemon-syncplan deferred scope.
ca-36 Multi-peer sync window prioritization: v1 uses first-match-wins. Future versions should prioritize by peer trust level, last-seen recency, and window overlap quality. Source: daemon-syncplan deferred scope.
ca-37 User-configurable sync window schedules via CLI: v1 auto-generates 4 windows at 30-minute spacing. Allow users to specify explicit schedules for deterministic rendezvous timing. Source: daemon-syncplan deferred scope.
ca-38 Group read key slot ECDH path decision (blocks daemon-groups-management member-removal + read-key-rotation work). Protocol spec §2.5 specifies group read key slots literally as X25519(admin_ephemeral_sk, member_group_pseudonym_pk_as_x25519) — Path A (Ed25519→X25519 conversion). But (1) spec §2.5 itself carries an [ASSUMPTION — NEEDS VERIFICATION] marker on the Ed25519→X25519 procedure at spec line 174, and (2) production code has removed all Path A key-slot callers in favour of Path B (DeriveIdentityECDHKeypair + BLAKE3 KDF → independent X25519 keypair) via ca-25/ca-31/ca-32, documented in doc::portdistrict::identity_ecdh_domain_separation.md as “Path A… is not used for key-slot ECDH in any production caller.” Three candidate resolutions: (A) Adopt Path A for groups only — justify explicitly in identity_ecdh_domain_separation.md that the ca-17 privacy invariant does not apply (the group_pseudonym_pk is published in the group record, so there is no hidden X25519 pub to protect), reintroduce a single Path A key-slot caller, amend the spec divergence doc. Smallest code, biggest doc surface. (B) Define DeriveGroupPseudonymECDHKeypair(masterSeed, groupPubkey) via blake3.NewDeriveKey("portdistrict-group-pseudonym-ecdh-v1"), publish each member’s group_pseudonym_ecdh_pub alongside the Ed25519 group_pseudonym_pk (either in the group record member entry or in the group_accept tunnel message), update spec §2.5 and §9.1 with the new field. Consistent with production convention; largest code + spec surface. (C) Keep foundation’s tunnel-delivered key permanently and drop read_key_slots from the spec entirely — requires solving read-key rotation without slots (re-delivery via tunnel to every active member on every rotation), which is viable for small groups but scales poorly. Out of scope for v1. Resolution requires a focused study (~4-8h) producing a definitive call graph and a spec amendment before the first slot-construction commit. Source: workflow::2026-04-14_daemon-groups-foundation::draft.md §14 S2, §15 F13, D6. (daemon-groups-foundation)
ca-49 Endpoint detection ergonomics — automatic discovery of the daemon’s own public endpoint so operators don’t need to manually run portdistrictctl device update --endpoint <ip:port> or configure --ip-echo-url before publishing a device record. Observed concretely 2026-04-16 during ca-45 in-wild validation: mac (EC2 1:1 NAT, public v4 3.21.98.174) and rentamac (home/office NAT, public v4 81.4.164.198) both had endpoints: [] in their device records because (1) DefaultInterfaceScanner at internal/endpointdetect/detect.go:20 only scans for global unicast IPv6 on local interfaces — no IPv4 path, (2) no default --ip-echo-url is configured, (3) the device keyring was never populated with an endpoint via device update, (4) the daemon was launched with --endpoint-detect 0 which disables auto-detection entirely, and (5) device publish produces no warning when publishing a record with zero endpoints. This is a silent failure mode: daemons run, publish records, appear to work, but peers cannot dial them. It caused a wrong diagnosis in the 2026-04-16 three-node test (“both hosts are NAT’d, mesh impossible”) when the actual fix was a one-line device update --endpoint on each host. See Finding correction in the 2026-04-16 in-wild validation session for the grounded reproduction. Design gap enumerated: (A) IPv4 interface scan in DefaultInterfaceScanner — symmetric to the existing v6 path, filter !IsLoopback && !IsPrivate && ip.To4() != nil. Works for bare-metal / VPS / self-hosted with a public v4 NIC. Doesn’t work for 1:1 NAT (EC2, GCE, K8s pods). (B) Cloud metadata service fallback — detect running on AWS/GCE/Azure via http://169.254.169.254/latest/meta-data/public-ipv4 (AWS) or equivalent and use the response. Per-cloud detection code, more reliable than HTTP echo for cloud, no third-party call. (C) Trusted-peer endpoint echo protocol (see ca-50) — preferred, uses portdistrict’s own trust domain instead of external services. (D) device publish emits a loud warning (or error with --allow-empty-endpoints escape hatch) when a record has zero endpoints, pointing at device update --endpoint + --ip-echo-url + ca-50 as remediation. Smallest code change, highest operator-facing value per LOC; complements A/B/C rather than replacing them. (E) Cascading fallback: try A (local v4 scan) → B (cloud metadata, short timeout) → C (trusted-peer echo) → warn per D if all fail. Recommended combo: A + B + C + D, skip HTTP echo defaults (privacy leak, external dependency), skip STUN/UPnP/NAT-PMP (explicitly out of scope per doc::portdistrict::wireguard_integration.md §1044). Acceptance: a daemon started on any of {bare-metal with public v4, AWS EC2, GCE VM, rentamac-class home NAT with at least one trusted public peer} should auto-populate DeviceKeyring.Endpoints at startup without operator intervention OR emit a prominent warning that names the specific remediation. Ergonomic impact: eliminates the 2026-04-16 morning confusion where the user had to manually invoke device update --endpoint on two hosts before the test could proceed. Supersedes ca-12 (shallow “device init prompt” item — fold into this). Related to ca-50 (the specific protocol design for option C). Full grounded evidence in 2026-04-16 in-wild validation session retrospective. (ca-45 in-wild validation 2026-04-16)
ca-47 DNSRefreshTask fallback to AddPeerDevice(p.Namespace) when roster resolution fails transiently — observed 2026-04-16 on remote_access startup log at 21:52:15: dns-refresh: adding peer RUPKZ7C7YXFARBLCQYUJVEUFXKTKAXUT (rentamac.portdistrict.net) followed by add peer RUPKZ7C7YXFARBLCQYUJVEUFXKTKAXUT (rentamac.portdistrict.net): add peer device RUPKZ7C7YXFARBLCQYUJVEUFXKTKAXUT (rentamac.portdistrict.net): dnsdisc: expected record_type "device", got "master". The parenthesized namespace is the peer’s master namespace (rentamac.portdistrict.net), which is what identity import writes to peerstore.Namespace. Fix ① (95fec6c) rewrote wgmanager.resolvePeerConfig (the AddPeer(nodeID) path) to use master-roster lookup, but the DNSRefreshTask path in internal/daemon/dns_refresh_task.go has its own fallback that calls pm.AddPeerDevice(nodeID, p.Namespace) when roster resolution fails transiently (stale DNS cache, stale key slots, upstream decrypt error). AddPeerDevice expects a device namespace and tries to decode _portdistrict.<master-ns> as a device record, which fails the strict record_type check. This is the DNSRefreshTask analog of ca-20/ca-24 — same root cause (peerstore.Namespace is master), different call site, not covered by 95fec6c. Every 10s refresh tick fires the error until the cache refreshes and roster resolution succeeds. Self-heals eventually but generates misleading log noise and wastes CPU on retried decode failures. Related to but distinct from ca-30 (which is about ConnectAllTrusted’s empty-roster fallback, already obsoleted by fix ①’s error-message change). Candidate fixes: (A) dns_refresh_task does not fall back at all — on roster resolution failure, log a clear “transient roster resolve failure, retrying next tick” message and skip the peer; (B) route the fallback through wgmanager.AddPeer(nodeID) (which now uses fix ①’s master-roster lookup) instead of AddPeerDevice(nodeID, p.Namespace), getting the benefit of the already-landed consumer-side fix; (C) heuristic detection of whether p.Namespace looks like a device or master namespace (fuzzy, not recommended). Option B is the smallest change with the most benefit because the error message from fix ① is already operator-actionable. Acceptance: startup log under a transient DNS cache failure shows “transient roster resolve” or fix ①’s “master roster at <fqdn> is empty or unreachable” error, not the obsolete “expected record_type "device", got "master"” error. Triage-level fix (~10 LOC in one function). Full evidence in ~~report::2026-04-16_three-node-test-aborted_findings.md~~ Grounded Observation Capture section. Finding ① from session follow-up discussion. (three-node in-wild test 2026-04-16)
ca-46 scripts/mac-daemon.sh removes sudo -v call — the pre-auth credential refresh at line 102 of the script shipped in f7e4c07 demands a tty even when the sudoers rule is NOPASSWD, failing every non-interactive invocation over ssh with sudo: a terminal is required to read the password. Caught during the 2026-04-16 three-node test attempt when the script refused to launch on both mac (NOPASSWD) and rentamac. One-line fix: remove sudo -v (line 102) and optionally gate it on [ -t 0 ] (stdin is a tty) to preserve the interactive ad-hoc use case. The comment at lines 99-101 explains the original intent (pre-auth credential caching for interactive operators) but that use case is marginal compared to the automation use case that the script exists to support. Triage-level fix; 5-minute follow-up. Full finding in ~~report::2026-04-16_three-node-test-aborted_findings.md~~ Finding B.
ca-53 peer demote / peer forget CLI commands for trust-state management (cli_semantics_design §2.4). Complements existing peer trust + deferred peer revoke (ca-09). Lets users downgrade TRUSTED→VERIFIED or remove peers entirely from the local peerstore.
ca-54 group member decline <group-id> and group member leave <name> — group lifecycle commands (cli_semantics_design §2.5.2). Current impl has group member accept/publish only.
ca-55 group network show <name> and group network ping <member> — group overlay inspection + connectivity helper (cli_semantics_design §2.5.3). Currently the network subtree is a stub.
ca-56 tunnel list, tunnel show <peer-ns>, tunnel disconnect <peer-ns> — tunnel management wrappers (cli_semantics_design §2.6). Currently only tunnel connect exists. These also unblock the deferred status sections for tunnels.
ca-57 identity rotate — master-key rotation command (cli_semantics_design §2.2 / §8 deferred). Re-publishes master record with new seed; requires peers to re-verify. Touches keystore, peerstore, and master-record envelope formats.
ca-58 normalize-base64-encoding: Storage encoding inconsistency (base64.StdEncoding vs base64.RawURLEncoding) across keystore/devicestore. Requires schema migration. (refactor-cmd-portdistrict deferred scope)
ca-59 secure-passphrase-input: Replace readPassphrase (fixed 1024-byte buffer with single Read) with term.ReadPassword + TTY detection. Behavioral change needing integration test for piped passphrases. (refactor-cmd-portdistrict deferred scope)
ca-60 peer-key-load-diagnostics: loadVerifiedPeerKeys silently skips all errors (bad JSON, bad base64). Should warn on non-IsNotExist failures so corrupt peer files are not invisibly dropped. (refactor-cmd-portdistrict deferred scope)
ca-62 peer show should display derived overlay prefix and per-device addresses (UX followup from peer-overlay-derivation Finding 3)
ca-63 tunnel show should display peer overlay AllowedIPs in derived/labeled form rather than raw WG state (peer-overlay-derivation Finding 3 followup)
ca-64 DNS cache propagation helper (wait-for-resolver-fresh) for test harness — poll loop replacing manual sleep-and-retry in init-3node.sh (pre-existing Finding 4 from peer-overlay-derivation in-wild)
ca-65 Two-phase daemon teardown (SIGTERM → sleep → SIGKILL) in test-3node.sh — propagate pattern from mac-daemon.sh to avoid orphan daemons (pre-existing Finding 6 from peer-overlay-derivation in-wild)
ca-66 Group member custom pseudonyms / display labels — inside a group, members are currently identified only by cryptographically-derived blake3 node IDs with no user-facing label. Evaluate three shapes: local-only aliases (git-config-style, zero protocol change), self-signed announcement in the group record (visible within group, cross-group correlation risk), or sealed-to-admin announcement (admin as naming authority). Draft at workflow::2026-04-18_group-member-custom-pseudonyms::draft.md. Raised during peer-overlay-derivation tutorial refinement.
ca-67 Investigate intermittent group device decrypt failure: chacha20poly1305: message authentication failed during group-refresh may indicate a race between republish and resolver caching.
ca-70 End-to-end system-mode fixture with real peer-trust bootstrap: the current test/fixtures/system-mode/ fixture validates container plumbing only (TUN creation, NET_ADMIN cap, --kill-switch flag parsing, binary accessibility). The actual exit_request/exit_ack handshake is not exercised because no peer trust is provisioned between provider and consumer containers. Extend the fixture (or adapt scripts/test-3node-exit.sh) to keygen both containers, publish DeviceRecords, mutual peer add + peer verify, provider runs portdistrictd with exit-node enabled, consumer runs portdistrict use <provider> --mode system, verify real default route through provider + curl ifconfig.me returns provider IP + portdistrict use off restores routing. Estimated ~100–200 lines of shell + Dockerfile extension. Upstream lesson from test-3node-exit.sh is directly applicable. (exit-node-system-mode 2026-04-19)
ca-72 daemon-query-ipc: Unix domain socket IPC for daemon live state queries — machine server show currently reads disk only (deferred-data footer). Live consumer data (active connections, tx/rx, last-handshake) requires daemon IPC. Deferred from exit-node-cli-alignment study scope.
ca-73 exit list implementation: The exit list verb is stubbed (prints “not yet implemented”). Implementing candidate enumeration (listing available exit nodes) requires a discovery mechanism — DNS-based listing, facilitator query, or peer-list scan. Filed from exit-node-cli-alignment deferred scope.
ca-77 Add integration test for gvnic.BringUp() that creates a stack and verifies UDP connectivity via loopback — catches the defaultTransportProtocols stub bug class
ca-78 Investigate MSI-X on AMD EPYC for TamaGo interrupt-driven drivers — polling wastes CPU core, resolution would benefit all interrupt-driven drivers and potentially close meta-04
ca-80 TamaGo interrupt-driven I/O (LAPIC + MSI-X) follow-on. Polling-only path in the current gvnic driver wastes a CPU core and busy-loops in the RX poll. Draft exists at workflow::2026-04-21_tamago-x86-interrupts::draft.md. Blocks: nothing critical (driver works polling-only) but unlocks efficient multi-queue operation. Related to meta-04 (MSI-X boot-crash howto).
ca-79-B Consumer-side peer adopt --pubkey <key> --endpoint <ip:port> to synthesize a local DeviceRecord from out-of-band credentials (e.g., a TamaGo exit-node pubkey copy-pasted from serial console). Replaces the exit connect --direct/--pubkey operator escape hatch (ca-79) with a path that fits the existing trust model: after peer adopt, the standard exit connect <peer-name> flow works through the normal resolveProvider/resolveOverlayAddr/resolveProviderWGPubkey calls. CLI design pinned in the ca-79 study’s “Design Question” section: keep peer adopt and peer add as separate verbs sharing a registerPeer(record, source PeerSource) function in internal/peerstore — the verb separation keeps the trust-model exception visible (source: bundle | adopted in peer list); collapsing into peer add --pubkey ... was rejected because it would erode the cryptographic-bundle invariant. When this lands, remove --direct/--pubkey per the TODO(ca-79-B) sunset comment in cmd/portdistrict/exit.go.
ca-81 Health-gate retry with exponential backoff in proxyhealth.Check or runExitTry: 3 attempts at 500ms / 1s / 2s. Planned in exit-try-onramp proposal but not implemented. (exit-try-onramp)
ca-82 ExitSender.Receive goroutine leak on timeout: add context.Context to the ExitSender interface so the Receive goroutine can be cancelled when timeout fires. Pre-existing debt shared by runExitConnect and runExitTry. (exit-try-onramp)
ca-83 ca-83 Consolidate keyring loading into a loadDefaultKeyring() (*keystore.Keyring, error) helper: four new commands in trusted-tamago-node violated the config.KeyringPath("") → keystore.Load(path) two-step XDG resolution pattern. A shared helper centralises the pattern and makes misuse a compile-time issue rather than a runtime failure. (trusted-tamago-node)
ca-84 ca-84 Store both control port and data port explicitly in the remote node registry: the invite command had to derive controlPort = dataPort - 1 from the registry’s single stored port, an implicit TamaGo convention that caused the invite-command port bug. Storing both ports (or a typed ControlAddr/DataAddr pair) makes the convention explicit and prevents the same class of bug in future consumers of the registry. (trusted-tamago-node)
ca-85 Document production interrupt integration path — Once Stream A hardware validation succeeds, define where interrupt ordering fix applies to production binary (cmd/portdistrict-exitnode-tamago/main.go currently has no LAPIC usage)
ca-86 tamago-sev-snp-vc-handler: investigate whether TamaGo’s generic setIDT() clobbers the EFI-installed SEV-SNP #VC handler (vector 29) and implement preservation. Symptom (2026-04-23 hardware validation): gvnic-spike reboot-loops on GCE N2D SEV-SNP (AMD EPYC Milan) — 162 boot cycles in 90s, halts at yielding to scheduler to ensure IDT installed before enabling LAPIC… before scheduler yield complete prints; never reaches LAPIC.Enable. Ordering fix from 2026-04-22 retro resolves Intel+OVMF triple-fault but NOT AMD+SEV-SNP hang. Working theory: ServiceInterrupts goroutine runs during the 100ms yield, setIDT() overwrites the EFI-installed #VC entry, subsequent RMP-check page touch raises unhandled #VC → triple-fault → SEV-SNP auto-reset. Spike approach: log IDT vector 29 before/after setIDT runs; confirm clobber; implement re-install or generic IDT SEV-SNP awareness. Success criterion: spike completes LAPIC.Enable + 20 heartbeats on GCE N2D SEV-SNP. Blocks: AMD EPYC interrupt-mode bring-up (ca-80 on that platform). See workflow::2026-04-23_tamago-x86-interrupts_hardware-validation::late-retrospective.md.
ca-87 Stronger –via-tor health-gate: evaluate whether CheckWithOptions(AllowSameIP: true) should be replaced with attestation-based proof-of-tunneling. Closed by ca-93 2026-04-27: evaluation complete — CheckWithOptions retained as connectivity check; attestation proof comes from Noise NK handshake (option (b) from ca-93). See ~~proposal::via-tor-attestation-health-gate.md~~. (exit-try-via-tor)
ca-92 Attestation-bind Tamago’s Noise static key. DONE 2026-04-27 via option (a): GHCB DeriveKey (MSG_KEY_REQ) with KeySelect=VCEK and GuestFieldSelect=Measurement|GuestPolicy produces a 32-byte PSP-mediated secret bound to chip + image; BLAKE3-KDF (context portdistrict-tamago-noisewrap-v1) + X25519 clamping → static keypair. Cross-instance determinism validated end-to-end on GCE N2D Confidential VM: TAMAGO_NOISE_PUBKEY=1c45alg7aQwwNZCU-M59ofTVdaKU9KD7fOVT8O1EbCk byte-identical across destroy + fresh deploy. No protocol changes — internal/noisewrap, internal/exitgrant, client-side code untouched. Bonus: “early emit” pattern added so operators can read the pubkey at derivation time, before bootstrap claim completes. See workflow::2026-04-27_tamago-noise-key-attestation-binding::retrospective.md. Unblocks ca-93 (~50 LOC client-side comparison at noisewrap.Dial). Option (b) (in-band attestation report transport) deferred for untrusted-operator threat models.
ca-93 Strengthen --via-tor health-gate beyond “tunneled HTTPS request succeeded”. DONE 2026-04-27 via option (b): Noise NK handshake IS proof of tunneling — when tamagoKey != nil, the health gate now reports Attestation: Noise handshake authenticated (key from <source>) and, when the grant carries ExpectedMeasurement, a second line displays the measurement hex as defense-in-depth confirmation. CheckWithOptions retained as connectivity check; attestation proof comes from the Noise handshake (ca-92 key binding). ~30 LOC production + ~80 LOC tests (3 new tests covering grant+measurement, –tamago-key flag, and no-key paths). Also closes ca-87. See ~~proposal::via-tor-attestation-health-gate.md~~. Related: ca-91, ca-92.
ca-95 Investigate Noise NK handshake auth failure with ca-92-derived keys. DONE 2026-04-28: misdiagnosed at filing — was actually three separate issues compounding. (a) Pre-clamping interaction with flynn/noise — falsified by TestDerivedKey_NK_Handshake unit test (passes). (b) curve25519.X25519 vs flynn/noise pubkey divergence — falsified by TestDerivedKey_PubkeyConsistency (byte-identical). (c) TamaGo runtime crypto difference — falsified by on-device fixed-input diagnostic (DIAG_FIXED_PUB_HEX matched host Go byte-for-byte). The actual fix: base64 encoding inconsistency (socks5_noise.go:67 was StdEncoding, main.go:103 was RawURLEncoding); both now RawURLEncoding. The proximate causes of the original test failure were tor-gw cold-start (pd-19 fix) + operator key-paste error in earlier session retests. 🎯 First end-to-end real-traffic demonstration of the SEV-SNP chain: with pd-19’s truthful gate + ca-95’s encoding fix + correct-chip pubkey paste, portdistrict exit try --via-tor reached Tamago’s GCE external IP (tunneled IP 35.198.72.12 ≠ direct IP 46.188.164.184 — OK), Attestation: Noise handshake authenticated line fired, real HTTPS request to ifconfig.me round-tripped through laptop tor → onion → tor-gw → Tamago WG plane → public internet. The 5-link trust chain (pd-10 + pd-17 + pd-18 + ca-92 + ca-93 + pd-19 + ca-95) is operationally complete. New unit tests in internal/noisewrap/noisewrap_test.go are durable regression coverage. See workflow::2026-04-27_noisewrap-derived-key-nk-handshake-bug::retrospective.md.
ca-97 Bootstrap protocol integration test covering both claim and reconnect paths — net.Pipe-based test exercising full frame sequence (handshake → attestation → cert chain → mode message) to catch frame-sequence mismatches automatically. (tamago-attestation-client-verify)
ca-98 Non-confidential bootstrap protocol: implement simplified Noise → owner-key proof → WireGuard handoff flow (without attestation exchange) for --on hetzner / --on self claim path (machine-server-provisioning-cli)
ca-99 machine server build verb: implement Phase 2 once offline-measurement-prediction proposal lands — Tamago build pipeline in pure Go via bitfield/script, GCE image publish via Go SDK (machine-server-provisioning-cli)
ca-100 --on self provider path: implement doctor/deploy/claim for bare-metal/VPS servers — check port bindability, SSH reachability, distro detection instead of ADC (machine-server-provisioning-cli)
ca-101 Remote capability toggling: implement enable --label / disable --label for non-local servers via daemon-to-daemon control transport or SSH-based config push (machine-server-provisioning-cli)
ca-102 Make runMachineServerRemoteEnable v2-aware (preserve extended NodeEntry fields across Remove+Add) to retire the fragile save-restore bridge in claim. Amortize the tamago E2E rerun cost the next time it triggers for another reason. See howto::registry_save_restore_bridge.md for the bridge being replaced. From workflow::2026-04-29_machine-server-provisioning-cli::retrospective.md Recommendation #3.
ca-103 Tamago port of internal/facilitatorplane/ — compile the facilitator plane into the Tamago bare-metal exit-node binary so a single CVM can serve both --exit-node and --facilitator capabilities under the same pd-18 attestation, with no second VM. Why this is a separate item from wf-67: wf-67’s non-confidential colocated path (portdistrictd --exit-node --facilitator on regular Linux) is essentially free — the daemon already does this on the local-host. The confidential colocated path is gated on this port: today cmd/portdistrict-exitnode-tamago does not import internal/facilitatorplane/ and the Tamago build set is single-purpose. wf-67 explicitly declares colocation as the default; ca-103 is what makes the default work for --confidential deploys. Spike-before-propose: confirm the Tamago net stack supplies what the facilitator needs — TCP listener (the via-tor + WG planes already use Tamago net, so likely yes), accept loop with concurrent connections (no goroutine constraints in current Tamago), in-memory registration store keyed by node-id (trivial — sync.Map or similar), no glibc/cgo dependencies in facilitatorplane (audit internal/facilitatorplane/ for any linux-only sys imports). If any gap surfaces, document and route around — the registration store is the only piece with substantive state, and it already has no persistence requirements (rebuilt on connect). Out of scope: the registration store does not need to be measurement-attested separately — the Tamago binary’s pd-18 measurement covers it implicitly because it’s compiled into the same image; the trust story for via-facilitator connections is “the facilitator runs in the measured exit-node CVM” rather than a new attestation channel. Implementation sketch: (a) add facilitator to the build tags Tamago accepts in cmd/portdistrict-exitnode-tamago/main.go (or always-on if the size cost is small); (b) wire facilitatorplane.New(...) into the existing run loop alongside the WG / Noise planes; (c) read facilitator.json-equivalent config from instance metadata at boot (mirror the existing exit-node metadata path); (d) emit TAMAGO_FACILITATOR_LISTENING=:7777 serial marker for machine server doctor/show to scrape. Dependency direction: ca-103 unblocks the --confidential --with-facilitator path of wf-67. Until ca-103 lands, wf-67’s confidential colocated path either errors out pointing at this item, or auto-implies --facilitator-vm (deploy a second regular Linux VM with portdistrictd --facilitator). The non-confidential colocated path of wf-67 is independent and ships without ca-103. Code touch points: cmd/portdistrict-exitnode-tamago/main.go (run-loop wire-up), internal/facilitatorplane/* (audit for non-Tamago deps), cmd/portdistrict/machine_server_build.go (include facilitator in the build tags), serial-marker glue. Origin: 2026-05-01 conversation about colocating facilitator on the same CVM as the exit-node, after wf-67 was reframed from “second VM by default” to “colocate by default”. Filed because the confidential colocated path is the substantive engineering cost wf-67 inherits from the Tamago single-purpose-binary constraint and warrants its own roadmap entry with its own spike + study + propose cycle.
ca-106 DONE 2026-05-05 Fixed compileTamago relative-path mismatch: --output-dir now resolved to absolute via filepath.Abs(args[i]) at flag-parse time (cmd/portdistrict/machine_server_build.go). Eliminates ELF-not-found error when cwd != sourceDir. Surfaced by machine-server-build-via-tor Scope B in-wild verification; fixed inline in same cycle commit 08fe547. (machine-server-build-via-tor)
ca-107 DONE 2026-05-05 Extended machine_server_doctor’s checkToolOnPath to fall back to /usr/sbin and /sbin when exec.LookPath misses, with explicit hint when found there. Added ensureSbinInPath() helper called at runMachineServerBuild entry to augment the process PATH so exec.Command lookups in the build path also succeed. Affects all doctor users. Shipped inline in same cycle commit 08fe547. (machine-server-build-via-tor)
ca-108 Tamago LoopbackAllowlist config plumbing — currently uses a hardcoded default ["127.0.0.1:8888"] (no config file in Tamago). If future Tamago services need additional loopback destinations (e.g. ca-103 facilitator port, multi-service CVMs), wire the allowlist through Tamago’s instance metadata or config-emission path. Low priority — single destination suffices today. (via-tor-socks5-hardening retro)
ca-112 tor-gw metadata-poll for ca-89 opt-in mode: if operator demand materializes for –tor-client-auth, tor-gw needs the same metadata-poll pattern to read authorized_clients from GCE metadata and write .auth files + SIGHUP Tor. SIGHUP-Tor pattern validated by wf-73. (via-tor-trusted-peers-tamago retro)
ca-113 GCE metadata wait-for-change long-poll optimization: replace 60s fixed-interval poll with ?wait_for_change=true&last_etag= for event-driven trusted_peers updates. Requires adjusting HTTP client timeout (currently 8s in bootstrap.go:43). (via-tor-trusted-peers-tamago retro)
ca-115 BLOCKS cycle-4 merge. Fresh per-connection attestation (ca-90) is the cumulative-load choke point under cycle-4. Serial logs socks5-noise: fresh attestation failed (falling back to cached): fresh attestation report: VMGEXIT 0x80000012: info1=0x0 info2=0x200000000 (err_code=0x2 detail=0x0 rbx_out=4) on most IK connections. Diagnosis empirically confirmed 2026-05-08 (Tier 2 in-wild diagnostic): a patched build that force-skips the fresh-attestation GHCB call (always uses cached report) handles 8 concurrent SOCKS5 curl bursts cleanly (8/8 success, single Tor warmup retry on first health check, no cumulative degradation, response body confirms traffic correctly routed Tor → onion → IK → Tamago). With fresh attestation re-enabled, the same workload saturates Tamago and the chainer’s health check fails. Concrete user-facing impact: friend-opens-browser is broken in practice — a typical web page loads 30–80 resources concurrently → 30–80 IK handshakes → queue saturates Tamago well before the page finishes → chainer reports tunnel sick → browser hangs. Single-connection workloads (one curl) still work because the queue never grows. Surprising finding: cycle-4 did NOT modify freshAttestationReport, GetExtendedAttestationReport, or any GHCB code — those are byte-identical to cycle-3.5. The regression is from indirect runtime effects of cycle-4 (new poll goroutine, new TrustedPeers map, IK accept loop’s faster acceptance, or heap layout shift) on the SAME GHCB call that worked in cycle-3.5. Investigation directions (Tier 3): (a) info2=0x200000000 (= 2³³) suggests a misaligned guest physical address or a high-bit pollution in the response page GPA; (b) check if the new metadata-poll goroutine’s HTTP allocations affect the GHCB request page’s memory locality; (c) confirm whether cycle-3.5 in-wild ever exercised concurrent fresh-attestation under similar load (the regression may be pre-existing but masked by cycle-3.5’s slower NK accept loop). Available workaround if fix is hard: revert ca-90 freshness — unsatisfactory because it loses replay-resistance defense. Available adjacent option: rate-limit IK accepts upstream of attestation (effectively serializing GHCB calls), but that throttles legitimate friends. Surfaced + diagnosed 2026-05-08 during cycle-4 in-wild test. (via-tor-trusted-peers-tamago in-wild)
cq-17 Loopback allowlist deny path returns EOF instead of SOCKS5 RuleFailure under cycle-4 IK channel: TestInWildLoopbackAllowlistDeny expected socks connect ... not allowed by ruleset (RepRuleFailure 0x02) but got bare EOF. The deny path still rejects (bootstrap allow case still differentiates), but the response code surface changed from a typed SOCKS5 reply to a connection close. Likely a side effect of the cycle-4 accept-loop refactor that inserted the trusted-peers gate before the inner SOCKS5 server. Operations clarity regression — operator probing a denied target now gets ambiguous EOF instead of “ruleset” diagnostic. Investigation: compare inner SOCKS5 server’s denial path behavior under NK (cycle-3) vs IK (cycle-4) accept loops. (via-tor-trusted-peers-tamago in-wild 2026-05-08)

Code Quality & Markers

cq-02 Phase 0 spike files should use //go:build spike build tags or separate _test packages to prevent name collisions with production tests. Update spike file template/guidance. (daemon-groups-foundation) — Note: originally filed as cq-01 by the Reflect subagent but renamed to cq-02 to avoid ID collision with the pre-existing cq-01 in CHANGELOG.md (cmd-identity-verify era, closed).
cq-03 Implementation subagents should run gofmt + goimports (or godoctor smart_edit) before declaring phase complete. Cycle 1 of daemon-groups-foundation shipped a misordered import in internal/daemon/daemon.go (groups inserted between endpointdetect and facilitatorplane instead of after facilitatorplane); caught by human post-cycle-2 manual sed, not by the subagent’s self-review. Self-review gate should include a mechanical formatting check. (daemon-groups-foundation)
cq-04 Downgrade first-contact resolve log spam to Debug level: recordcrypt: no matching key slot and expected record_type "device", got "master" fire repeatedly during normal operation. Should be Debug-level, not Info/Warn.
cq-05 Add lifecycle/Teardown regression test for the kernel TUN path (not just netstack). The double-close bug in Teardown was latent because kernel TUN silently tolerates double-close. A privileged test exercising CreateInterface → Teardown → verify no panic would have caught this earlier. (wgplane-native-platform-stubs)
cq-07 Implement full queue reset in stall watchdog — Current implementation logs warning and resets lastActivity but does not rebuild queues; add DESTROY_TX_QUEUE + DESTROY_RX_QUEUE + CREATE_TX_QUEUE + CREATE_RX_QUEUE sequence for true recovery
cq-08 Extend fault-injection test coverage — Add status override tests (AdminQ command failures), RX descriptor corruption tests (seq jumps, oversized lengths), queue exhaustion tests (TX ring full, RX ring empty)
cq-09 Add coverage threshold gate to test suite — 46.0% coverage represents progress but is still below robustness threshold; consider adding CI gate at 60% or 70% to prevent backsliding
cq-10 Unsilence txNotifier.WriteNotify drop path in third_party/gvnic/net.go:162-167. The if len(dstMAC) == 0 { drop } branch is a silent drop with just pkt.DecRef() and continue. Add a rate-limited log.Printf + a counter exposed via stats. Source: workflow::2026-04-23_gvnic-tcp-destination-selective-pcap::retrospective.md — would have ruled out this branch in 1 probe cycle during ca-88 if the counter had existed.
cq-11 CI reproducibility gate — add a CI step that runs scripts/tamago-verify-reproducible.sh to prevent build-determinism regressions. Requires TamaGo toolchain availability in CI. Deferred from tamago-reproducible-measurement study. (tamago-reproducible-measurement retro §Recommendations 3)
cq-12 Refactor cliSim cert-chain read placement to mirror production’s per-branch structure (move from common section into runClaim/runReconnect), so falsification surfaces the original BUG’s deadlock signature instead of verify-step error
cq-13 Registry v2 metadata loss hardening: make runMachineServerRemoteEnable v2-aware (preserve additional fields) when tamago E2E rerun is next triggered for other reasons (machine-server-provisioning-cli)
cq-14 Context timeout audit on existing deploy paths: review other multi-step functions in cmd/portdistrict/ for similar context timeout budget mismatches, especially any that may be augmented in future (e.g., wf-67 --with-facilitator). From workflow::2026-05-01_machine-server-via-tor-deploy-augmentation::retrospective.md Recommendation #1.
cq-15 Programmatic test data length assertion: for fixed-length cryptographic/protocol test constants (onion addresses, base32/base64 keys), add len() assertions at the test constant declaration site rather than relying on visual inspection. From workflow::2026-05-01_machine-server-via-tor-deploy-augmentation::retrospective.md Recommendation #2.
cq-16 Surface the remaining via-tor failure mode: distinct error string for “ca-89 enabled but auth-file missing”. The other 3 named strings (“client_static not in trusted_peers”, “token expired”, “token signature invalid”) shipped with ca-109 in the cycle-4 accept loop. The 4th requires the ca-89 ON path, which lands alongside ca-112. (via-tor-access-control + via-tor-trusted-peers-tamago retros)

Cross-Platform Porting

cp-06 Add GOOS=windows/GOOS=darwin smoke-build matrix to CI (blocked on CI existing at all — there’s no .github/workflows/, no Makefile, no .gitlab-ci.yml today). First dedicated CI step should be go build ./... across linux, darwin, windows to catch portability regressions before they snowball. File CI-bootstrap as a separate meta-item if needed. (cross-platform audit 2026-04-15)
cp-10 Replace per-platform exec.Command shell-outs in internal/wgplane/platform_*.go with pure-Go syscalls. Partial (Linux delivered 2026-04-19 by sit-feature/wgplane-native-platform-stubs, commits f7f837b..a6f0248): both internal/wgplane/platform_linux.go and internal/exitnode/platform_linux.go are now netlink-based (vishvananda/netlink); in-wild zero-execve validated via scripts/test-3node-strace.sh; distroless container fixture at test/fixtures/distroless/Dockerfile runs portdistrictd in a 24.6 MB image without iproute2. Darwin/BSD/Windows still shell out — successor items cp-12 (darwin+BSD) and cp-13 (Windows) track the remaining platforms. The overall cp-10 acceptance (grep → zero across ALL platform files) is NOT yet satisfied. Original context below. Current state (post-cp-02): each platform file shells out to the OS base tool for interface address assignment and bring-up — platform_linux.go uses ip addr add <cidr> dev <iface> + ip link set <iface> up (iproute2); platform_darwin.go + platform_bsd.go use ifconfig <iface> inet6 <ip> prefixlen <n> + ifconfig <iface> up; platform_windows.go uses netsh interface ipv6 add address .... These are all OS-base tools shipped with every modern install, so cp-02 correctly prioritized them over pure-Go for portability. The residual concerns that motivate further work: (a) fork/exec overhead on every interface configure (negligible at daemon startup but could matter at scale with many interfaces), (b) fragile error handling via CombinedOutput parsing instead of structured errno/netlink response codes, (c) attack surface from subprocess spawning on privileged paths (each exec is a potential abuse vector if an attacker can influence the argv), (d) minimal-container/embedded deployments where the base image may strip userland tools for size/security (distroless Linux images don’t ship iproute2 by default; embedded BSD images may omit ifconfig), (e) cleaner dependency graph — the daemon currently requires the OS to have specific userland commands available, a fact that’s invisible until runtime. Scope: (A) Linux: replace ip shell-outs with vishvananda/netlink (the canonical Go netlink library) or direct golang.org/x/sys/unix netlink socket calls. vishvananda/netlink is ~500KB and widely used (by containerd, CNI plugins, etc). Estimated ~50 LOC replacing the two exec.Command calls in platform_linux.go. (B) darwin + BSD: replace ifconfig shell-outs with ioctl(SIOCAIFADDR_IN6) via golang.org/x/sys/unix. More involved than netlink because darwin’s utun ioctl interface is less documented, but wireguard-go’s own tun/tun_darwin.go already does this and can be referenced. Estimated ~100 LOC. (C) Windows: replace netsh shell-outs with golang.org/x/sys/windows/iphlpapi which has functions like CreateUnicastIpAddressEntry for programmatic IP assignment. Estimated ~80 LOC. Priority: LOW — current shell-out approach works correctly on all supported platforms; this is a code-quality / lean-deployment improvement, not a correctness fix. Dependencies: vishvananda/netlink adds a new module dependency (acceptable given its ubiquity in Go networking code) but the darwin/BSD/Windows paths stay within existing golang.org/x/sys/* packages. Acceptance: grep -r "exec.Command" internal/wgplane/ returns zero matches in production code (test fixtures exempt), and the existing wgplane integration tests + cp-07 (in-wild darwin) + cp-08 (in-wild Windows) all continue to pass. Filed 2026-04-16 from the device-record-overlay-address session’s external-tools audit: the audit confirmed portdistrict has zero wg CLI dependency (cp-01 done) but still has four platform-specific shell-outs to OS base tools. The audit was prompted by the user’s question “do we rely on external tools such as wg on portdistrict nodes?” Full audit in the 2026-04-16 device-record-overlay-address retrospective when it lands. (device-record-overlay-address in-wild audit 2026-04-16)
cp-12 Darwin + BSD native API for internal/wgplane/platform_{darwin,bsd}.go: replace ifconfig shell-outs with ioctl(SIOCAIFADDR_IN6) via golang.org/x/sys/unix, mirroring wireguard-go’s tun/tun_darwin.go. Successor to cp-10 for the darwin/BSD portion. Priority LOW — current ifconfig shell-outs work correctly in production (cp-07 in-wild validated on macOS 14.8 and 26.4.1); this is code-quality/lean-deployment work. Acceptance: grep -r "exec.Command" internal/wgplane/platform_darwin.go internal/wgplane/platform_bsd.go internal/exitnode/platform_darwin.go returns zero; macOS in-wild mesh still passes. When filed: consult howto::shellout_to_native_api_porting.md for semantic preservation (EEXIST idempotency, split-route vs true-default-route). (wgplane-native-platform-stubs 2026-04-19)
cp-13 Windows native API for internal/wgplane/platform_windows.go: replace netsh shell-outs with golang.org/x/sys/windows/iphlpapi.CreateUnicastIpAddressEntry. Successor to cp-10 for Windows. Blocked on cp-08 (Windows runtime validation) — adding native API without in-wild runtime test is risk without reward. When filed: consult howto::shellout_to_native_api_porting.md. (wgplane-native-platform-stubs 2026-04-19)
cp-14 TamaGo-compatible conn.Bind for wireguard-go (resolves OQ7 fully): conn.NewDefaultBind() opens UDP sockets via syscall.Socket, which TamaGo does not expose. internal/wgplane/create_netstack.go currently uses conn.NewDefaultBind() and compiles with -tags wgplane_netstack on Linux but will NOT link under GOOS=tamago. Needs a netstack-UDP-endpoint-backed conn.Bind implementation that uses gVisor’s stack.UDPEndpoint (or equivalent) instead of host syscalls, gated by the tamago build tag. Estimated ~200–300 LOC. Prerequisite for the exit-node role on the TamaGo binary (per doc::portdistrict::tamago_sevsnp_facilitator_design_pending.md §11 OQ7/OQ8). Not blocking for the SEV-SNP facilitator role itself, which doesn’t use WireGuard. (wgplane-native-platform-stubs 2026-04-19)
cp-08 Windows in-wild verification of internal/wgplane/platform_windows.go. Cross-compilation proves GOOS=windows go build ./internal/wgplane/... passes, but netsh interface ipv6 add address and Wintun TUN creation paths are untested at runtime. Needs: Windows 10/11 test environment with Wintun installed, administrator shell, run the daemon and verify the interface appears in netsh interface show interface. Expect edge cases around netsh silent failures without elevation, Wintun adapter naming, UAPI named-pipe path (\\.\pipe\ProtectedPrefix\Administrators\WireGuard\<ifname>). If netsh path is problematic, fallback is the proposal’s errNotImplemented stub with [SCOPE] log. Risk flagged in wgplane-portable-transport proposal’s Risks table. (wgplane-portable-transport 2026-04-15)
cp-09 Config/state XDG directory split on Linux (cp-04-v2): move force-window.json, peers/, groups/ from xdg.ConfigHome to xdg.StateHome for XDG-proper separation. Deferred from config-path-xdg-adoption because of migration complexity and no user demand. Zero urgency — current layout works, and on macOS ConfigHome/StateHome collapse to the same path anyway. (config-path-xdg-adoption)
cp-15 Add internal/exitnode/platform_windows.go stubs to unblock Windows cross-compilation for portdistrict and portdistrictd. Same pattern as wgplane-native-platform-stubs (cp-12/cp-13). Once stubs exist, re-add windows/amd64 to .forgejo/workflows/{release,build}.yml platform matrices. (release-binaries retro 2026-04-30)
cp-16 Homebrew/APT/AUR packaging for portdistrict and portdistrictd. Downstream of release binaries — now that release infrastructure exists, distribution packaging can be added. (release-binaries study deferred scope 2026-04-30)
cp-17 Relax //go:build linux constraint on cmd/tor-gw-init to allow a darwin stub mode for development convenience. Today tor-gw-init is linux-only by build tag, so the release matrix excludes darwin even though the binary’s high-level CLI surface could be exercised on macOS during development if the linux-specific paths fell back to clear errors. Low priority — current linux-only matrix matches deployment reality. (release-binaries retro 2026-04-30 Recommendation #3)

Regression & Introspection

ri-01 Widen scripts/test-3node.sh probe timing for high-latency pairs: the current 15s DNS-publish sleep plus single warmup ping is insufficient for WG handshakes between remote_access and mac (~113ms RTT), producing false 2/6 failures on first launch while the mesh is actually healthy (verified 2026-04-18 in-wild during wire-format-single-version-reset retest). Options: (a) increase sleep to 30s, (b) retry-aware warmup loop that pings each pair until handshake succeeds or timeout, (c) check wg show handshake age before probing. Retrospective reference: .sit/reports/2026-04-18_wire-format-single-version-reset_retrospective.md § In-Wild Verification.
ri-02 Retry-with-backoff in test-3node-group.sh network-show step: replace single-shot sleep with a poll loop for DNS propagation to reduce false failures from Cloudflare timing.
ri-03 Multi-host cross-compile deployment script: server-transfer-sync.sh should support darwin/arm64 targets alongside linux/amd64 for 3-node tests.
ri-04 Add scripts/tamago-attest-e2e.sh to periodic CI on SEV-SNP-capable GCE instances for on-silicon bootstrap protocol verification
wf-39 Refactor cliSim (cmd/portdistrict/bootstrap_protocol_test.go) to read the bootstrap cert-chain frame per-branch (runClaim with verify, runReconnect read-and-discard) instead of in the common section. Mirrors production decomposition (machine_server_remote.go:206-213 and :583-591). Falsification surface will then match the original ca-94 BUG’s deadlock signature instead of attestation verify: no cert chain provided. Source: 2026-04-29 bootstrap-protocol-integration-test retrospective.

Parallel Development

pd-02 TamaGo --open/--demo trust mode: provider-side permissive trust checker that accepts any exit_request from unknown consumers (logging pubkeys for audit). Required for exit try to work against non-open providers. (exit-try-onramp)
pd-03 IPv6-only KVM guests on self-hosted SEV-SNP — evaluate provisioning KVM instances as IPv6-only (no per-guest IPv4) to avoid IPv4 address-space pressure when migrating off GCE to self-hosted SEV-SNP. What’s preserved for free: portdistrict overlay is already v6 (fd00::/8 ULA); WireGuard endpoints are address-family agnostic; Tor onion services are address-family agnostic (fits exit-try-via-tor natively); SEV-SNP attestation is orthogonal to IP family; v6 removes most NAT traversal edge cases (ca-50 trusted-peer echo gets simpler, not harder). What’s lost / needs mitigation: (1) egress to v4-only destinations from Tamago exit-nodes — mitigate with NAT64 + DNS64 on the host (Jool or Tayga, ~20 lines of config); without it, a large fraction of the public internet (including many banks, older infra, ifconfig.me, some telemetry endpoints) fails from the “browse to any site” exit path; (2) inbound from v4-only user ISPs/cafes — facilitator should keep at least one dual-stack entry point (small $5/mo v4 VPS forwarding to v6 backend) until v4-user traffic is measured; (3) occasional gaps in v4-only package/registry mirrors (rare for major registries but bites private/niche ones) — dual-stack cache proxy or NAT64 solves; (4) reputational effects on v6 egress ranges (more CAPTCHAs on some sites from new v6 /64s) — same caveat as any new v6 range, time-decays; (5) legacy admin/CI paths from v4-only runners — one dual-stack jump host suffices. SEV-SNP-specific note: attestation reports don’t constrain address family, but cloud-init / guest networking config in the sealed image must be v6-ready (SLAAC + DHCPv6); TamaGo guests are already IPv6-first so this is a v1 concern only for Linux-guest workers. Rough recommendation: trusted-mesh nodes v6-only (preferable — matches overlay); Tamago exit-nodes v6-only + NAT64 on host; facilitators dual-stack until v4-user share is measured; admin/CI one dual-stack jump host. Cost framing: a /24 is ~$5k capex or ~$200/mo rental; NAT64 + one v4 entry point is ~10% of that. Action when self-hosted SEV-SNP plan firms up: (a) pick a NAT64 stack (Jool vs Tayga) and measure real-world v4-dest coverage; (b) audit Tamago exit-node dial paths for v4-literal assumptions; (c) instrument facilitator for client-side v4/v6 arrival ratios before going v6-only; (d) decide whether pd-03 supersedes or supplements a prospective dual-stack fallback mode. Related: ca-50 (trusted-peer echo — v6 makes this simpler), ca-87 (gVNIC TLS — orthogonal but the self-hosted migration is one possible resolution path). Filed during exit-try-via-tor study session 2026-04-24.
pd-05 Multi-gateway tor-gw load balancing: when a second operator deploys, design a client gateway selection mechanism (e.g., onion hostname list in DNS TXT, facilitator-mediated discovery). Currently 1:1 tor-gw:Tamago with out-of-band onion hostname. (tor-gw-vm-deployment-topology-automation)
pd-09 tor-gw hardened image (cloud-portable). Phase 1 COMPLETE 2026-04-25 — code + rootless build pipeline shipped (cmd/tor-gw-init/, scripts/tor-gw-image-build.sh, scripts/tor-gw-publish-image.sh, updated tor-gw-deploy.sh); see workflow::2026-04-25_tor-gw-hardened-image-cloud-portable::retrospective.md. Phase 3 in-wild GCE validation tracked as pd-11. Replaces today’s “stock Debian 12 + apt install tor via startup script” deploy pattern (pd-04) with a purpose-built immutable VM image. Small distroless/Alpine/buildroot base + stock c-tor + ~100-200 LOC init shim that reads forwarding target from instance metadata, writes torrc, and execs tor; read-only rootfs, no SSH, no package manager at runtime; runs unchanged on GCE / AWS / Azure / Hetzner / bare metal. Properties independent of attestation: reproducible image hash, smaller TCB, cross-provider portability, reduced post-compromise blast radius. Phasing: Phase 1 image build + GCE validation (~1wk) → Phase 3 cross-provider validation (~3d). Subsumes cq-11 (read-only rootfs). Closes ri-07 (cross-provider docs) by example. Unblocks pd-08 (gives signed grants a hardened-image deployment to point at) and pd-10 (provides the substrate for SEV-SNP onion-identity sealing). Joint design with pd-10 — see report::2026-04-25_tor-gw-hardened-image-draft.md, “Layer 1” sections. Recommended sequence: pd-09 → pd-08 → pd-10. Filed from tor-gw-vm-deployment-topology-automation UX review 2026-04-25.
pd-11 Phase 3 in-wild GCE validation for tor-gw hardened image. Phase 1 (image-side) DONE 2026-04-25 via 5 pivots — see workflow::2026-04-25_pd-11::retrospective.md. Phase 2 (e2e exit try --via-tor) DONE 2026-05-02 via the new wf-66 CLI path on instance wf66-e2e-120748. Full chain validated end-to-end: curl --socks5-hostname 127.0.0.1:9999 https://ifconfig.me returned 34.179.176.103 (= Tamago VM external IP) instead of the laptop’s direct IP 46.188.165.75, confirming traffic round-tripped through laptop curl → chainer 9999 → laptop tor 9050 → onion 3efkxftj…7i7qd → tor-gw SOCKS5+Noise → Tamago WG plane → public internet → ifconfig.me. The Noise NK handshake authenticated with the chip-derived ca-92 key. Same chain ca-95 retrospective proved on 2026-04-28 with manually-deployed infra — now via CLI-deployed infra (machine server deploy --via-tor + machine server claim --accept-discovered-measurement), no shell scripts. Also validated claim attestation chain end-to-end: nonce + session hash binding confirmed, ECDSA-P384 signature verified against VCEK, VCEK→ASK→ARK cert chain valid, measurement matched canonical pd-18 7a5ed176bad8a9ff02cebb94b24b076a0b1905042a85d9fca7670d3a3ff466db3b1c2b76f8eca888f8d806d2ec92434e. See workflow::2026-05-01_machine-server-via-tor-deploy-augmentation::retrospective.md “In-wild e2e (pd-11 Phase 2 closure)” section.
pd-12 Grant CLI discoverability verbs (grant list, grant show): display stored grants and their status. Completes the pd-08 discoverability story (“discoverable via peer list”). Low priority follow-on.
pd-13 pd-11 Phase 2 (exit try --via-tor e2e) should exercise the grant-based path (exit try --via-tor --from <operator-ns>) rather than the raw flag path, now that pd-08 is shipped.
pd-15 Rebuild linux-pd-snp without linux-firmware-any bloat. DONE 2026-04-27 via [PIVOT]: original mechanism (swap main/linux-lts → main/linux-virt) was falsified — main/linux-virt/ does not exist as a separate APKBUILD directory in current Alpine aports; linux-virt is a flavor of main/linux-lts/APKBUILD. Fix A landed instead: inline sed -i patch in scripts/build-kernel-pd-snp.sh:84-92 strips linux-firmware-any from the cloned APKBUILD’s main package() depends list, with a grep -q drift assertion. Validated on GCE N2D Confidential VMs: disk.tar.gz 659MB → 76MB (3.3× under the ≤250MB target), rootfs 722MB → 130.5MB, identical .onion across two independent CVMs. See workflow::2026-04-27_tor-gw-image-size-optimization::retrospective.md.
pd-16 Phase 3 GCE validation: Deploy two CVMs from the same image, scrape serial output via tor-gw-show.sh, confirm same .onion address. Validates assumptions 2b, 3, and 4 from the study. Reuse existing test/spike/pd-10/a1_sev_guest_probe_gce_test.sh orchestrator pattern. (tor-gw-sev-snp-onion-identity-sealing retro §Recommendations) DONE 2026-04-27: validated twice on independent CVMs from the same image, both pre-pd-15 and post-pd-15 builds; same .onion 3efkxftj…7i7qd.onion and same attestation measurement on both occasions. See workflow::2026-04-27_tor-gw-image-size-optimization::retrospective.md.
pd-17 Grant struct population (pd-10 Phase 2): Extend internal/exitgrant/grant.go to accept attestation measurement and serialize into grants. Operator workflow: scrape TOR_GW_ATTESTATION_MEASUREMENT from serial, pass to portdistrict exit grant –measurement. DONE 2026-04-27: ExpectedMeasurement []byte added to Grant struct, –measurement flag on exit grant with 48-byte length validation, hex-encoded on the wire (matches SEV-SNP convention), nullable-tolerant Decode preserves backward compatibility with pd-08 grants. The reserved-null field pd-08 added required zero migration. See workflow::2026-04-27_exit-grant-attestation-measurement::retrospective.md. Unblocks ca-93.
pd-18 Verify the SEV-SNP attestation report — server-side. DONE 2026-04-27 via two pivots: original mechanism (AMD KDS lookup) was falsified — the library’s strict policy[17]=1 check rejected GCE reports, AND AMD public KDS does not publish VCEKs for GCE chip+TCB pairs (verified 404 from multiple egress points). Final implementation uses client.GetRawExtendedReport which issues SNP_GET_EXT_REPORT; GCE pre-caches VCEK + ASK + ARK in guest memory via SNP_SET_EXT_CONFIG so the certs come back bundled with the report. No network calls during verification. Validated on GCE N2D Confidential VM: TOR_GW_ATTESTATION_REPORT_VALID=true on serial. Bonus retroactive bug fix: the pd-10/pd-15-era hand-rolled ioctl wrapper produced a struct-padded measurement (32 zero bytes prefix); pd-18’s switch to library wrapper returns proper bytes. Canonical measurement: 7a5ed176bad8a9ff02cebb94b24b076a0b1905042a85d9fca7670d3a3ff466db3b1c2b76f8eca888f8d806d2ec92434e. .onion unaffected. See workflow::2026-04-27_tor-gw-attestation-report-verify::retrospective.md. Unblocks ca-93 client-side comparison.
pd-19 tor-gw consensus / HSDir publication issue on freshly-deployed VMs. DONE 2026-04-28: root cause was the “ready” marker firing on hostname-file existence (signal-of-presence) rather than on actual HSDir publication (truthful readiness). Fix: added ControlSocket /var/lib/tor/control.sock to the generated torrc, implemented a minimal Tor control-protocol client in cmd/tor-gw-init/main.go (waitForDescriptorPublished()), and wait for HS_DESC UPLOADED event before emitting TOR_GW_DESCRIPTOR_PUBLISHED=true. Non-fatal degradation if control socket fails. The “no exit nodes” warning that triggered the original filing was a red herring — hidden services don’t need exit nodes. Phase 0 spike (test/spike/pd-19/) validated control auth + HS_DESC parseability against a private spike-tor instance on the laptop; measured 7.6s warm-laptop cold-start delay vs 60-120s budget for cold GCE. Implementation pivot count: 0. Validated in-wild on tor-gw-poc-a: TOR_GW_DESCRIPTOR_PUBLISHED=true emitted before subsequent attestation/onion markers. e2e chain test through Tamago surfaced ca-95 (Noise NK handshake bug between client noisewrap.Dial with ca-92-derived RemoteStatic and Tamago’s socks5_noise.go listener — chacha20poly1305: message authentication failed on accept NK msg1). ca-95 is a separate latent bug that pd-19 unblocked the test enough to surface; pd-19’s own claim is met. See workflow::2026-04-28_tor-gw-consensus-hsdir-cold-start::retrospective.md. Also closes wf-27.
pd-20 Tamago analog of pd-18: server-side cryptographic verification of Tamago’s own SEV-SNP attestation report at boot. Today cmd/portdistrict-exitnode-tamago derives a chip-bound Noise key (ca-92) and produces an attestation report inside runBootstrapServer (used to bind the bootstrap claim to this session), but never validates its OWN report’s cert chain or ECDSA-P384 signature locally. So Tamago doesn’t actually know whether it’s running on a genuine AMD chip — it trusts the kernel’s claim of SEV-SNP availability. The fix mirrors pd-18 (which did the same for tor-gw via client.GetRawExtendedReport): use the GHCB extended-report path to fetch the report PLUS the VCEK + ASK + ARK certs pre-cached by GCE in guest memory; parse the cert table; validate VCEK→ASK→ARK chain locally against a pinned ARK; verify the report’s signature on the raw bytes. Emit serial markers TAMAGO_ATTESTATION_REPORT_VALID=true|false and TAMAGO_ATTESTATION_REPORT_REASON=<msg> on failure. Twist for Tamago: TamaGo runtime (bare-metal Go on the metal, no glibc) is a different runtime than Linux init shim — pd-18 used github.com/google/go-sev-guest/client which is Linux-specific. Tamago’s kvm/sev/key.go already has GHCB.DeriveKey; Phase 0 spike should validate whether GHCB also exposes an extended-report path (or if Tamago needs its own ioctl-equivalent reading via the same MSG_REPORT_REQ but with extended-config response handling). Falls into the same family as pd-15 / pd-18 falsifications; spike before propose. Also: include the Tamago Noise pubkey fingerprint in the report’s report_data (64-byte client-supplied field) so ca-94’s client-side check can bind report-to-key in one shot. Filed 2026-04-28 from session discussion identifying gap between ca-92’s chip-binding (operator-side inference) and lack of in-binary cryptographic self-verification. Unblocks ca-94. Related: pd-18 (tor-gw equivalent), ca-92 (key derivation), ca-94 (client-side wiring).
pd-23 Self-hosted SEV-SNP in-binary verification for Tamago: on non-GCE hardware where AMD’s public KDS publishes VCEKs, Tamago can do full in-binary self-verification via gvisor netstack TLS. Revisit when self-hosted substrate is available. (tamago-attestation-report-verify)
pd-24 File upstream TamaGo issue for GHCB multi-call timing constraint on GCE Confidential VMs. Reproduction: 3 rapid SNP_GUEST_REQUEST calls fail on 3rd; 2 rapid + 1 delayed succeeds. Low urgency (workaround validated). See howto/tamago_sev_snp_multi_call_ghcb.md.
pd-25 Apply three-control reproducibility pattern (volume serial pinning, SOURCE_DATE_EPOCH, deterministic tar+gzip) to scripts/tor-gw-image-build.sh. pd-09 lists “reproducible image hash” as a goal but does not yet implement these controls. Pattern documented in howto/reproducible_fat32_disk_images.md. (tamago-reproducible-measurement retro §Recommendations 1)
pd-26 Offline LAUNCH_DIGEST computation — compute the expected SEV-SNP measurement from source without deploying a CVM. Requires understanding GCE’s OVMF firmware binary and the sev-snp-measure tool. Completes the end-to-end reproducibility story for third-party verification. Deferred from tamago-reproducible-measurement study. (tamago-reproducible-measurement retro §Recommendations 2)

Workflow Methodology

meta-01 Write a meta-howto on “expired warnings”: when a prior retro predicts failures that don’t recur, the absence is evidence the prior fix worked (cmd-identity-init)
meta-04 Document the MSI-X boot-crash pitfall on Intel+OVMF as a howto — symptom: tamago-example’s startInterruptHandler pattern (cpu.LAPIC.Enable() + ioa.EnableInterrupt(irq, vector) + goos.Idle override + cpu.ServiceInterrupts(isr)) causes guest triple-fault immediately after WireGuard interface-up on QEMU + OVMF + Intel host (serial cuts off mid-line, QEMU exits via -no-reboot). Root cause unidentified; likely LAPIC state or goos.Idle assumptions. Verify whether it reproduces on AMD EPYC + OVMF before committing to IRQ-driven designs. Reproducer: commit 7aa7cbc. (tamago-exit-node-gce late retro)
meta-05 Promote rootless UEFI disk-image packaging pattern to howto — scripts/tamago-package-disk.sh uses dd + /sbin/mkfs.vfat + mmd/mcopy (from mtools) to produce a FAT32 image with /EFI/BOOT/BOOTX64.EFI + /EFI/BOOT/shimx64.efi, then tars it for GCE compute images create. Replaces the sudo losetup + mount pattern that go-boot wiki’s GCE guide assumes. One-time setup handoff: sudo apt install dosfstools mtools. Validated 2026-04-20 producing tamago-exit-node-gce’s exitnode-tamago-v1 / v2 GCE images. (tamago-exit-node-gce late retro)
meta-07 Methodology note for Study: when Study writes “cheaper and more conclusive than a spike/pcap” about a non-trivial mechanism, test whether the cheap probe actually falsifies the mechanism or just tests one consequence of it. In gvnic-tls-tcp-relay-bug, the arithmetic MTU hypothesis fit the symptom pattern cleanly and looked conclusive on paper, but only a two-iteration A/B (MTU 1400 vs 1340) falsified it, while pcap would have been decisive with a single deploy. Guideline draft: when multiple mechanisms can explain the same symptom pattern, do the diagnostic that discriminates between them, not the one that merely confirms the most appealing one. Source: workflow::2026-04-23_gvnic-tls-tcp-relay-bug::retrospective.md.
meta-08 When spiking bitfield/script or similar libraries, create a coverage checklist of specific API methods used in proposal code snippets. Test terminal methods (WriteFile, String), WithEnv with partial vs full environment, and WithDir existence — not just the core Exec pattern. From workflow::2026-04-30_machine-server-build-tamago-pipeline::retrospective.md Recommendation #1.
meta-09 When adding a required field to a cross-binary struct, pre-count callers with grep before committing — NoiseStaticKey touched 8+ callers, all mechanical but surprised the estimate. Methodology note. (via-tor-trusted-peers-tamago retro)

Considered but Not Pursued

meta-06 Consider splitting dual-stream workflows into separate proposals — Stream A and Stream B had zero dependencies; managing both in one proposal added coordination overhead. Future dual-stream work may benefit from separate SIT workflows running in parallel
ca-88 Bundled tor client — retired 2026-04-24 as out of scope for portdistrict. Rationale: portdistrict’s target users are technical (CLI-comfortable, security/privacy-aware) for whom sudo apt install tor / brew install tor is acceptable setup. Bundling tor would require either per-platform signed/notarized binary ship (macOS Gatekeeper, Windows SmartScreen), CGo-linked libtor (cross-compile matrix pain, CVE treadmill), or download-and-verify on first run (Tor Project GPG pinning + per-OS Gatekeeper workarounds). Every one of these is sustained engineering that distracts from portdistrict’s actual domain (attested exit nodes + SOCKS5 chaining). The robust + cheap alternative is a clear error message on missing tor, with per-platform install hints, which is shippable as ~10 LOC in the existing --via-tor error path. Original scope item was filed from exit-try-via-tor deferred scope 2026-04-24. Full discussion in this session’s transcript.

Completed

wf-72 DONE 2026-05-03 machine server enable/disable --via-tor capability flag for non-remote-orchestrated hosts. Initial PARTIAL state (CLI shipped but produced an unreachable onion) was closed by the via-tor-host-capability-completion cycle. Full in-wild round-trip verified 2026-05-03 on a single Linux host: enable → tor onion published → claim populated registry → invite minted → friend exit try --accept-tofu recorded TOFU fingerprint matching the invite blob’s. See workflow::2026-05-03_via-tor-host-capability-completion::retrospective.md.
wf-73 DONE 2026-05-03 Daemon-side managed-Tor process supervision: internal/torsupervisor/Supervisor lifecycle (Start/Stop/Reattach/OnionHostname/Running, torrc generation with 0700 dir enforcement, hostname polling, descriptor publication wait via ControlSocket HS_DESC UPLOADED, SIGHUP reload, graceful shutdown). Phase 0 spike validated ControlSocket-from-Go-subprocess, re-attach across daemon restart, and non-standard HiddenServiceDir with daemon UID. Initial PARTIAL state closed by the completion cycle: torrc port wired to 1080 (matching the new SOCKS5+Noise listener); descriptor-publication race fixed (waitForDescriptor now polls the control socket via dialControlSocketWithRetry). In-wild round-trip verified 2026-05-03 (~6s for HS_DESC UPLOADED after tor start). See workflow::2026-05-03_via-tor-host-capability-completion::retrospective.md.
wf-77 DONE 2026-05-03 In-wild verification of via-tor host capability. Full enable --exit-node --via-tor → show → claim → exit invite --via-tor → friend exit try --accept-tofu → fingerprint pinned in tofu-onions.json cycle exercised end-to-end on a single Linux host. TOFU branch (exit_try_tor.go:236-273) confirmed working: friend received empty attestation frame (“Attestation: no report received (Tamago dev mode)”), recorded fingerprint 7459a125ba2a97780e7d1cc6973b2e580852f911d8aad92054d0f37dce6782f0 (matches the invite blob’s fingerprint). One in-wild bug fixed during the test: torsupervisor’s descriptor-wait race (one-shot net.DialTimeout raced tor’s async ControlSocket creation; replaced with bounded retry). See workflow::2026-05-03_via-tor-host-capability-completion::retrospective.md.
wf-78 DONE 2026-05-03 Registry write path for local non-confidential hosts. machine server claim --label <l> extended to handle non-confidential local via-tor hosts: reads OnionHostname + NoisePubkey from via-tor.json and WGPubKey from device keyring, writes NodeEntry{Confidential: false} directly to remote-nodes.json (no SSH delegation, no deploy step required). The Resolve(label) error path now creates a fresh entry when the local-host case applies (scope expansion logged in retrospective). In-wild verified 2026-05-03 — claim populated registry; subsequent exit invite --via-tor --node vt-test minted a usable TOFU blob. See workflow::2026-05-03_via-tor-host-capability-completion::retrospective.md.
wf-79 DONE 2026-05-03 Operator-side SOCKS5+Noise listener for non-CVM Linux hosts. internal/daemon/socks5noise.go ships a TCP listener on 127.0.0.1:1080 that wraps net.Listen with noisewrap.Listen, emits an empty attestation frame (writeAttestationFrame(conn, nil, nil) = 8 zero bytes — triggers TOFU branch on friend side), and runs a go-socks5 server with kernel net.Dial for egress. Daemon spawns/stops the listener alongside the tor supervisor (decoupled lifecycle, port-agreement-by-constant). Phase 0 spike validated 4/4 hypotheses including kernel-vs-netstack (chose kernel), empty-attestation-triggers-TOFU, and noisewrap-with-net.Listen. In-wild verified 2026-05-03: friend exit try --via-tor completed Noise NK handshake against the local listener through the tor circuit. See workflow::2026-05-03_via-tor-host-capability-completion::retrospective.md.
ca-109 DONE 2026-05-08 Trusted-peers-at-Tamago with metadata-polled hot-add. Moved primary auth from anonymous Noise NK to identity-revealing Noise IK with a trusted_peers map at Tamago, refreshed at runtime by a 60s metadata-poll goroutine reading the trusted_peers GCE instance attribute. Eliminates the cycle-3 chicken-and-egg between exit invite (post-deploy credentials) and Tamago boot (boot-time-only credential read). Wf-76 invite tokens retained as mandatory second layer (24h credential expiry). Tor v3 client auth (ca-89) defaulted OFF (operator opt-in via --tor-client-auth); ClientOnionAuthDir footgun eliminated. Phase 0 spike validated flynn/noise IK + PeerStatic() API; zero pivots during implementation (4th consecutive zero-pivot cycle when spikes are used). All 797 tests pass. See workflow::2026-05-08_via-tor-trusted-peers-tamago::retrospective.md.
ca-110 DONE 2026-05-08 Token replay protection — addressed by ca-109’s Noise IK handshake freshness. Each connection establishes fresh ephemeral keys via the IK pattern, so a captured invite token cannot be replayed without also producing a new IK handshake (and the client static pub must be in trusted_peers). Combined with the existing 24h token expiry, replay risk is sufficiently mitigated; explicit single-use-token logic remains unnecessary. See workflow::2026-05-08_via-tor-trusted-peers-tamago::retrospective.md.
ca-111 DONE 2026-05-08 Client/token revocation — addressed by ca-109’s metadata-poll hot-remove. Operator removes a friend by updating the trusted_peers GCE metadata attribute (e.g. gcloud compute instances add-metadata); Tamago’s poll goroutine picks up the change within 60s and replaces the trusted-peers map under a write lock. No CRL/OCSP infrastructure needed. See workflow::2026-05-08_via-tor-trusted-peers-tamago::retrospective.md.
ca-114 DONE 2026-05-08 Post-deploy trusted_peers metadata push (closes the “no-redeploy-to-add-friend” promise from ca-109). exit invite --via-tor now reads node Project/Zone/Label from the registry and uses the GCE Compute SDK (compute.InstancesClient.Get + SetMetadata with fingerprint preservation) to push the updated trusted_peers to the running Tamago VM whenever a new client noise pubkey is appended. Non-GCE / non-confidential nodes skip the push silently with a clarifying log line. ADC failures degrade to a warning + a gcloud compute instances add-metadata fallback hint instead of failing invite generation. Replaces the previous misleading Run \portdistrict machine server deploy –via-tor`` hint that suggested redeploying. Landed as a [SCOPE] expansion on the cycle-4 branch ahead of in-wild verification. See workflow::2026-05-08_via-tor-trusted-peers-tamago::retrospective.md.
wf-48 Add --publish flag or machine server build-publish verb to upload disk image to GCS and create GCE custom image. Requires spike for Go SDK storage.NewWriter and compute.Images.Insert. Delivered 2026-04-30 as --publish flag on machine server build (with --bucket, --image, --project); pure Go SDK (cloud.google.com/go/storage + cloud.google.com/go/compute/apiv1); all three guest-os-features set; delete-then-create with 404 tolerance; post-create verification via ImagesClient.Get. Originated from workflow::2026-04-30_machine-server-build-tamago-pipeline::retrospective.md Rec #2. Closing retrospective: workflow::2026-04-30_machine-server-publish-image::retrospective.md.
wf-56 machine server build --source-dir semantics — clarify in help-text and enforce that --source-dir expects the project root, not the app subdirectory. Delivered 2026-04-30 on branch sit-feature/publish-image-ux-cleanup: help-text in printMachineServerUsage clarifies the project-root requirement; early os.Stat(<source-dir>/cmd/portdistrict-exitnode-tamago) check fires before compileTamago with a friendly error pointing at --source-dir . from repo root. Verified live (sanity check fires immediately for the wrong path; help-text shows the project-root note).
wf-58 claim --accept-discovered-measurement TOFU output. Delivered 2026-04-30 on branch sit-feature/publish-image-ux-cleanup: flag now threads from runMachineServerClaim through delegateArgs into runMachineServerRemoteEnable, which on the no---measurement + TOFU branch extracts the measurement bytes from rawReport[sevsnp.MeasurementOffset:+sevsnp.MeasurementLen] and prints Attestation: measurement accepted (TOFU first-deploy: <hex>) plus a hint to record the value for subsequent strict---measurement claims. The misleading “measurement not checked” line remains for the no-flag, no---measurement path.
wf-59 claim bootstrap-port closure messaging. Delivered 2026-04-30 on branch sit-feature/publish-image-ux-cleanup: after the “Node claimed. Exit-node active on port N.” line, claim now prints Bootstrap port 8888 closed; subsequent claim attempts will require redeploy. so operators understand why a re-attempt fails.