Roadmap
Project-level planning for this project. Items here feed sit_flow.
How to Use
- Quick capture: Add rough ideas to Inbox anytime (even mid-implementation)
- Pick work: Use item ID with
sit_flow(slug: "wf-01") - After reflect: Inbox items get triaged to themes or discarded
Inbox
ri-04scripts/reset-dns.shdoesn’t clean continuation records or group device namespaces — stale records survive across 3-node test runs, causing spurious decrypt/key-slot errors on the next launch. (group-records-binary-encoding in-wild)ri-05Backgroundscpviarun_in_backgroundreports success while stalling mid-upload — partial files land on the target with no error signal. Need a size/checksum verification step afterserver-transfer-sync.shuploads. (group-records-binary-encoding in-wild)ri-07Cross-provider tor-gw topology documentation: when tor-gw and Tamago deploy on different GCP projects / providers, Noise (ca-91) already encrypts the tor-gw→Tamago link. Document the cross-provider deployment variant (operator guide, firewall implications, Tamago private-IP distribution across projects). (tor-gw-vm-deployment-topology-automation retro §Recommendations 4) To be delivered by pd-11 Phase 3 cross-provider validation.
Workflow Improvements
-
wf-01Add a propose-phase check that compiles or grep-checks module paths in code snippets against go.mod (cmd-identity-init) -
wf-02Add a propose-phase step: for each code snippet from spec docs, verify library API matches actual installed version viago docbefore implementation (record-encrypt) -
wf-03Add a propose-phase check: for ECDH-based functions, trace key flow through both directions (publisher and resolver) to verify whether each parameter needs a seed (private key derivation) or a public key (record-format) -
wf-04Add a propose-phase mechanical check: for each function referenced in code snippets, verify it exists in the actual source viago docor grep (device-provision) -
wf-05When adding fields to signed records, the implementation plan should explicitly enumerate all golden-value test files that need updating (device-provision) -
wf-08In propose-phase LOC estimates for features involving multi-key cryptographic protocols (master + device + WG + delegation + outer envelope construction), estimate the test fixture helper separately from test cases. A 3:1 ratio (test LOC to production LOC) is realistic; the daemon-facilitator-serve estimate of 400 test LOC underran by 665 LOC because theserverTestFixturehelper alone was ~100 LOC and the multi-key setup work compounds across every test case (daemon-facilitator-serve) -
wf-09For every feature using the privilege-gated “fail at the right step” test pattern, or any caller-side lookup that has a fallback branch (elevated 2026-04-13 afterca-25recurrence during master-device-roster), also require at least one success-path integration test gated behind//go:build integration, documented as manual-run or root-capable-CI only. Retroactively apply to cmd-connect, daemon-minimal, daemon-facilitator, daemon-facilitator-serve to catch latent success-path bugs. The missing UAPI socket inwgplane.CreateInterface(shipped through three feature cycles, broke every first-run attempt) is the canonical privileged-ops case: fail-at-right-step tests assert logic preceding the privileged call is correct, but cannot tell you whether the downstream logic would have worked if it could. Theca-25recurrence (non-privileged caller-side lookup with a Path A / Path B fallback confusion) generalizes the pattern: unit tests prove the crypto pipeline with pristine inputs; integration tests use pristine fixtures; neither exercises the lookup-failure path where real-world inputs diverge from pristine state. Both layers are required: caller-layer integration tests that exercise fallback branches AND the lookup-path inputs that trigger them (e.g., peerstore entries with missing fields, legacy imports, partially-populated state). (first-run bug hunt, elevated by master-device-roster retrospective) -
wf-10When a CLI subcommand needs root but reads user-level XDG config paths, document thesudo HOME=$HOMEpattern prominently in both the README and the tutorial. Alternatively support--keyring/--device-keyringflags universally for explicit override. The first-run tutorial hit this becausesudoresets HOME to/rootandconfig.KeyringPathresolves against$HOME/.config. (first-run bug hunt) -
wf-11For anyinternal/wgplane/ WireGuard interface code, integration tests must exercise the success path with actual packet traffic (ping or equivalent) through a pair of real interfaces. Merely creating and tearing down an interface does not validate that the configuration is functional. Three latent bugs in this package (UAPI socket missing, AllowedIPs default wrong, ListenPort never set) shipped through three feature cycles becausewgplane_integration_test.gocreates interfaces without verifying traffic flow. A single integration test that sets up two real interfaces on localhost (different netns or userspace loopback) and pings between them would catch all three bugs in one run. Apply retroactively towgplaneas the acceptance criterion for any future change to the package. Generalizeswf-09: not just success-path tests, but traffic-validating success-path tests for network code. (first-run bug hunt Session 2) -
wf-12wf-12When drafting study notes, proposal design options, or ROADMAP items, explicitly enumerate applicable howtos from.sit/howto/and apply them as design lenses before committing to a solution. Prior howtos are institutional knowledge that propagates forward only if consulted at design time, not just code time. The master-device-roster study initially failed to apply reorder_verification_with_channel_evidence.md despite it being directly applicable. (master-device-roster) -
wf-13For features combining scheduling + crypto + time manipulation, budget 70% production LOC overshoot and 5:1 test-to-production ratio (vs standard 30-50% / 3:1). Source: daemon-syncplan retrospective LOC analysis. -
wf-14Expand[SCOPE]trigger to fire on step-level contraction (not only expansion). Cycle 1’s silent omission of the write path would have been caught if[SCOPE]also fired when planned steps are not delivered. Methodology change to scope_expansion_protocol.md. (daemon-groups-foundation) -
wf-15LOC estimate revision for crypto + messaging + daemon + CLI features (the full-lifecycle variant, not just primitives). daemon-groups-foundation shipped ~2231 LOC total (975 production + 1256 test) vs the proposal’s ~1700 LOC wf-13-budgeted estimate — a 31% overshoot on top of wf-13’s already-generous 5:1 ratio. Candidate revision: budget 6:1 test-to-production for features that cross the primitives → daemon → CLI boundary end-to-end, treating wf-13 as a floor for primitives-only features and wf-15 as the ceiling for CLI-exercised full-stack features. Consider annotatingwf-08with the daemon-groups-foundation data point or letting wf-15 supersede it. (daemon-groups-foundation) -
wf-16Read-path-first bias sentinel: cycle 1 of crypto/messaging features tends to complete read-path components and stop, because the read path has fewer dependency edges than the write path (no live tunnel required, no message exchange, no mock tunnel infrastructure in tests). Add a cycle-1 completion-gate sentinel grep for write-path symbols (CLI subcommands, daemon message handlers, message producers) BEFORE advancing from Implement to Reflect. If any write-path symbol returns zero matches, fail the gate and require explicit[SCOPE]acknowledgment or a second cycle to close the gap. Directly addresses the cycle 1 failure mode that this feature hit. Related to but distinct fromwf-14([SCOPE]fires on contraction) — wf-16 is the detection mechanism, wf-14 is the logging response. (daemon-groups-foundation) Strengthening (2026-04-14 from cycle 2.5 finding in commit316b1a3): the symbol-grep sentinel is necessary but insufficient. A symbol existing in a package is not the same as the daemon/CLI actually calling it at runtime. Cycle 2 of daemon-groups-foundation createdGroupMsgHandlerandrunGroupInvite/Accept/DevicePublishsymbols that satisfied the cycle 1 sentinel — but the daemon’scmd/portdistrictd/main.gonever populated the correspondingdaemon.Configfields, making theif d.cfg.GroupDir != ""andif d.cfg.GroupMsgListenAddr != ""branches inRunChoresdead code at runtime. Caught only during in-wild test runbook authoring. Logically complete sentinel traces frommain.go→daemon.Config{...}struct literal →RunChoresconditional branches → target struct’sRun()call. Harder to mechanize than a grep but catches the “symbol exists, symbol unreachable” failure mode. Candidate: AST-based reachability checker that validates every non-test.Run()method on a daemonTaskimplementation is reachable from a code path rooted at amain.goentry point. Minimum viable version: a per-feature manual check in the Reflect phase that opensmain.goand greps ford.cfg.<NewField>references for every new Config field added in the cycle. Second strengthening (2026-04-14 in-wild test evening session, cycles 2.6 + 2.7): the pattern manifested at two deeper levels in the same session. Cycle 2.6:GroupInvite.AdminNodeIDwas set to the pseudonymous group node ID, but the trust-gate inhandleInviteneeded the master node ID — mocked unit test masked the failure. Cycle 2.7:publishRecordwas declared as a callback field onGroupMsgHandler, used inhandleAcceptconditionally (if h.publishRecord != nil), butdaemon.go:RunChoresnever populated it — the else branch (encode-and-discard) was the only code path in production, and unit tests constructed handlers with whatever fields each test needed. Both caught only by in-wild execution observing log vs DNS state divergence. This strengthens the case that reachability checks must also cover (a) non-nil callback struct fields, (b) identity-lookup semantics in mocks (mocks must receive specific values, not stub true/TRUSTED/nil unconditionally). Promotion candidate: howto “Every struct field on a daemon Task should be required at construction time or documented as optional” — replacing struct-literal construction with constructor functions (NewGroupMsgHandler(conn, ..., publishRecord, ...)) that fail at compile time when a caller forgets a required field. Optional fields get explicitWith*methods. -
wf-17GroupMsgHandler SIGTERM cleanup:conn.Close()onctx.Done()to unblockReadFromand release UDP socket. Observed 2026-04-14 in-wild test (daemon-groups-foundation finding #4): after SIGTERM, theGroupMsgHandler.Rungoroutine stays blocked onconn.ReadFrom()because there is no context-cancellation wiring. The process doesn’t exit within the script’s 1-second post-SIGTERM sleep window; UDP 9998 stays bound by the old process; the new daemon fails withbind: address already in use. Workaround applied in scripts (fe9670d refine(scripts)): two-phase kill (SIGTERM → sleep 1 → SIGKILL). Real fix: ininternal/daemon/group_msg_handler.go:Run, add a goroutine that waits onctx.Done()and callsh.conn.Close(), which will unblock the pendingReadFromwith an error. Same pattern asinternal/daemon/facilitator_server.goUDP listener. Related but distinct fromcq-03(gofmt/goimports gate) — wf-17 is a runtime-cleanup bug, cq-03 is a pre-commit formatting gate. (daemon-groups-foundation in-wild test) -
wf-18wf-18Author in-wild test runbooks during Propose phase for features with daemon + CLI + external-IO integration surface. The runbook is a third verification layer (after unit tests and call-graph tracing), not a planning artifact. daemon-groups-foundation: runbook authored post-Reflect caught cycles 2.5/2.6/2.7; if authored during Propose, all three would have been caught during Implement. (daemon-groups-foundation late retrospective) -
wf-19Promote candidate howtodaemon_task_constructor_functions.md— constructor-vs-struct-literal for daemon task types. Two manifestations in daemon-groups-foundation (cycle 2.5GroupDir/GroupMsgListenAddrnot populated indaemon.Configliteral; cycle 2.7publishRecordcallback not populated inGroupMsgHandlerliteral), but zero in other features. Held back by late-reflect subagent pending a second feature independently hitting the same shape. Rule: daemon chore task types with required dependencies getNewXxx(required1, required2, ...) *Xxxconstructors; optional fields becomeWithXxx()methods. Promote when a third data point lands, or downgrade to an observation note insidescope_expansion_protocol.mdif no new data point emerges within 2 features. (daemon-groups-foundation late retrospective) -
wf-20Promote or fold candidate howtolayered_verification_protocol.md— the five-layer verification model (L1 grep → L2 gofmt/goimports → L3 call-graph trace → L4 in-wild execution → L5 cascading in-wild after upstream unblocks). Currently documented across two retros (daemon-groups-foundation original + late) but not consolidated. Decision needed: (A) standalone howto, or (B) appendix to existingscope_expansion_protocol.mdwhich already covers L1-L2. Recommendation: (B) — extend scope_expansion_protocol.md with layers 3-5 rather than fragmenting. wf-16 sentinel work is the mechanization of L3. (daemon-groups-foundation late retrospective) -
wf-21Spec-section cross-reference gate in Propose phase: when implementing a design spec, the migration map must cross-reference EVERY numbered section. Aggregator sections, worked examples, argument type rules, and deferred-command lists each define behaviors that a rename-only pass will miss. Howto written:spec_section_cross_reference_for_cli_migration.md. (portdistrict-cli-redesign) -
wf-22In-wild validation as first-class implementation phase: proposals with deployment scripts should include “in-wild validation” as an explicit phase in the implementation plan, not a “mechanical follow-on.” (portdistrict-cli-redesign) -
wf-23When a proposal migrates a CLI to match a design spec, the migration map must cross-reference EVERY numbered spec section (§1–§N), not just the command-tree tables. Aggregator sections (§3), worked examples (§4), and argument-type rules (§5) each define behaviors that a rename-only pass will miss. Driven by 3 PIVOTs during portdistrict-cli-redesign where §2 was mapped exhaustively but §3/§4/§5 were treated as Phase-2 deferrals; all three PIVOTs came from skipped sections. -
wf-24Evaluate promotion ofhowto/platform_route_spike.md— mandate a Phase 0 spike exercising the actual OS route-installation path per platform (not just WireGuard mock interfaces) whenever AllowedIPs prefix length changes. Single supporting data point today: peer-overlay-derivation Finding 1 (darwin/BSDAddPeerRoutestripped CIDR prefix, broke/64routing, only discovered in-wild because Phase 0 spike validated coexistence on a mock interface not through the realroutesyscall). Decision needed: promote now (preventive), wait for a second data point (avoid over-fitting to one bug), or drop entirely (cost/benefit too low for a change class that rarely happens). (peer-overlay-derivation deferred) -
wf-25In-wild test scripts should SSH via ~/.ssh/config aliases (not overlay/public DNS names) and usenohup ... </dev/null >log 2>&1 &for remote background commands. Two script regressions in test-3node-exit.sh (hostnames, SSH backgrounding) predated the wgplane-native-platform-stubs workflow and were only caught by running the in-wild suite. (wgplane-native-platform-stubs) -
wf-26TamaGo exit-node--open/--demotrust mode: accept anyexit_request(logging consumer pubkey for audit) soexit trywith ephemeral node IDs works against stricter trust policies. Currently the TamaGo binary usesallowAll{}(main.go:133) which makesexit trywork today, but any future move to a stricterTrustCheckerwould break ephemeral consumers. Filed from exit-try-onramp retrospective — flagged as blocking prerequisite in the study. -
wf-27Health-gate retry with exponential backoff. DONE 2026-04-28 (closed by pd-19):cmd/portdistrict/exit_try_tor.go:373healthBackoffsextended from[2s, 4s, 8s](~14s budget) to[5s, 10s, 20s, 30s](~65s budget) as defense-in-depth alongside pd-19’s truthful server-side readiness gate. Different specific values from wf-27’s planned[500ms, 1s, 2s]because the dominant delay is descriptor-publication cold-start (seconds-to-minutes) rather than WG handshake (sub-second). The intent — exponential-backoff retry to absorb transient delays — is fully realized. See workflow::2026-04-28_tor-gw-consensus-hsdir-cold-start::retrospective.md. -
wf-28Addcontext.ContexttoExitSender.Receiveinterface to fix the goroutine leak on timeout. BothrunExitConnectandrunExitTryshare the pattern where the receive goroutine leaks if the timeout fires. Pre-existing debt flagged in exit-try-onramp retrospective. -
wf-29wf-29 Name Noise protocol cipher states by role, not ordinal:cs_init/cs_respinstead ofcs0/cs1(and similarlyowner_send/owner_recvinstead ofc1/c2for any channel pair). Thetrusted-tamago-nodebootstrap responder shipped with the cipher states reversed because the ordinal names did not encode the Noise spec’s role mapping. Role-based naming makes the initiator→responder / responder→initiator direction explicit at the declaration site. Howto exists: howto::noise_cipher_state_direction.md. Apply as a convention check in any future Noise_IK or Noise_XX implementation review. (trusted-tamago-node) -
wf-30Add a post-create verification step toscripts/tamago-publish-image.shANDscripts/tor-gw-publish-image.shthat runsgcloud compute images describe --format='value(guestOsFeatures)'and fails ifUEFI_COMPATIBLE,SEV_SNP_CAPABLE, orGVNICis missing. Catches misconfiguration at image-publish time (~5 minutes earlier in the deploy loop than at instance-create time). Sources: workflow::2026-04-23_gvnic-tls-tcp-relay-bug::retrospective.md (UEFI_COMPATIBLE + SEV_SNP_CAPABLE), workflow::2026-04-25_tor-gw-hardened-image-cloud-portable::late-retrospective.md (GVNIC requirement, grounded in pd-11 in-wild GCE failure when the flag was missing). Partial closure 2026-04-30: Go publish path (machine server build --publish) now verifies all three features viaImagesClient.Getafter image creation — see wf-48. Shell-script verification remains open; consider deprecating shell scripts (seewf-55). -
wf-31Audit bootstrap-then-real setup patterns in TamaGo + gVisor integration.cmd/portdistrict-exitnode-tamago/nic_gvnic.goadded a bootstrap link-local address and kept it after adding the real address, which caused ca-88’s destination-selective failure. Search for similarAddProtocolAddress/AddAddresspatterns where a temporary is installed for bootstrap and the real value is added later without removing the temporary. Source: report::2026-04-23_ca88-fix-source-address-selection.md. -
wf-32Transport Invariant Audit step in proposals: when reusing an adjacent module across transport topologies, list its implicit invariants and verify each against the new transport’s physical properties. Would have caught proxyhealth.Check IP-inequality mismatch at proposal time. (exit-try-via-tor) -
wf-33Transport-invariant audit step in Propose phase: when a proposal reuses a module across transport topologies (e.g.,proxyhealth.Checkreused from--directinto--via-tor), the proposal must enumerate the reused module’s implicit invariants and map each to a physical property of the new transport. Mismatches get flagged and either resolved at proposal time or logged as known-risks. Would have caught theproxyhealth.Checkstrict-IP invariant mismatch at proposal time — instead surfaced as a [PIVOT] during in-wild testing of exit-try-via-tor. Candidate howto:auditing_inherited_modules_across_transports.md. Single data point today (exit-try-via-tor); promote to howto on second occurrence, or write preventively if the next transport cycle (e.g.,ca-91chainer-tamago-noise) proves this lens is useful. (exit-try-via-tor retrospective) -
wf-34When introducing a flag that resolves a bundle of values from disk (like--from <ns>resolving onion + tamago_key + future fields), enumerate EVERY value the bundle supplies and mutex it against the resolver flag. Single-value mutex is incomplete if the resolver supplies multiple values. The pd-08 in-wild test caught this:--fromcorrectly mutex’d against--via-tor <onion>but silently accepted--fromtogether with--tamago-key, then used the grant’s key while ignoring the explicit one. Grounded in workflow::2026-04-25_signed-exit-grant-bundle-for-via-tor::retrospective.md §[BUG] 2. -
wf-35String-typed enum constants used in cross-package comparisons must be imported from their canonical setter, not duplicated as string literals. The pd-08 in-wild test caught this:cmd/portdistrict/exit_accept.gocomparedpeer.TrustState != "trusted"(lowercase) butcmd/portdistrict/peer.go:329writes"TRUSTED"(uppercase). The error message itself revealed the mismatch (is not trusted (state: TRUSTED)). Unit tests masked it because they constructed peerstore records inline using whatever string the test wrote. Candidate fix: promoteTrustStateTrusted/Verified/Seen/Revokedconstants ininternal/peerstore/peerstore.goand require all callers to use them. Grounded in workflow::2026-04-25_signed-exit-grant-bundle-for-via-tor::retrospective.md §[BUG] 1. -
wf-36Kernel package signing with project key: replace--allow-untrustedwith a project-specific APK signing key for the custom kernel package. Acceptable for development but should be revisited before any release workflow. Source: sev-snp-capable-alpine-kernel-package study §Deferred Scope. -
wf-37Integration withportdistrict exit grantto auto-populate measurement field — programmatic piping fromtamago-show.sh --jsontoportdistrict exit grant --from-jsonto reduce operator error. Deferred from tamago-show-script study. -
wf-38Mergetor-gw-show.shandtamago-show.shinto a single unified operator show script — both scrape from GCE serial; a combined script could reduce operator confusion. Deferred from tamago-show-script study. -
wf-40Implement non-confidential bootstrap protocol (simplified Noise → owner-key proof → WireGuard handoff, no attestation exchange) somachine server claimworks for--on hetzner/--on selfdeployments. Currently errors when deploy metadata lacks--confidential. From workflow::2026-04-29_machine-server-provisioning-cli::retrospective.md Recommendation #1. -
wf-42Specify--on selfprovider path:doctor --on selfshould check port bindability, SSH reachability, and distro detection (not ADC). Currently underspecified. From workflow::2026-04-29_machine-server-provisioning-cli::retrospective.md Recommendation #4. -
wf-43Implement remote capability toggling (enable --label/disable --label/show --labelfor non-local servers) via a daemon-to-daemon control protocol. Currently errors with “SSH in and run locally”. Restores the 5-command operator path documented inmachine-server-provisioning-cliproposal Summary. From workflow::2026-04-29_machine-server-provisioning-cli::retrospective.md Recommendation #5. -
wf-44Extend bootstrap protocol to include emit report alongside session report, enabling claim-time app hash verification. Currently the claim path only receives the session report (nonce+sessionHash), not the emit report (noisePub+appHash). From workflow::2026-04-30_offline-measurement-prediction::retrospective.md Recommendation #1. -
wf-47In-wild GCE byte-equality validation: deploy a real GCE Confidential VM, capture its reported LAUNCH_DIGEST, runPredictFirmwareagainst the corresponding OVMF binary downloaded fromgs://gce_tcb_integrity/ovmf_x64_csm/<hash>.fd, and assert byte-equality. Phase 0 spike A1 only validated thatgce-tcb-verifier/sev.LaunchDigestcompiles and rejects bad input; it did NOT compare a predicted digest to a real CVM’s reported value. Without this test,PredictFirmwareis plausibly-typed but unproven against reality. Manual gate before any production rollout that relies on--measurement-filefor first-deployment substitution detection. Uses GCE spend; suitable for a one-time validation run after the 43 NOT-VERIFIED markers are human-reviewed. From workflow::2026-04-30_offline-measurement-prediction::retrospective.md follow-on identified during reflect. First data point captured 2026-04-30 (in-wild test, labelwild-test-1, projectportdistrict, zoneeurope-west3-b,--vcpus 2, OVMF759990ee...from public TCB bucket dated 2026-04-15): predict ≠ reality. Predicted2d24cf9624ee36449e50c6c84042540b05898f6559f02741b7b354e0cc2ed18d108352ade7dfc4cecce4fa974e51c773vs reported7a5ed176bad8a9ff02cebb94b24b076a0b1905042a85d9fca7670d3a3ff466db3b1c2b76f8eca888f8d806d2ec92434e. Attestation chain (ECDSA-P384, VCEK→ASK→ARK) valid; app hash matched exactly. Hypotheses: (a) regional OVMF rollout lag (the public bucket’s newest binary isn’t what GCE actually loads in europe-west3-b yet); (b)predict.LaunchDigestmissing GCE-specific input (kernel/IDBlock/boot params); (c) vcpu/topology assumption mismatch. Pivot wf-47 from “deploy and validate” to “fix prediction or document divergence”. See workflow::2026-04-30_machine-server-publish-image::retrospective.md §In-Wild Validation [BUG-2]. -
wf-49Add OVMF caching to~/.cache/portdistrict/ovmf/to avoid repeated ~4MB downloads when building frequently with the same measurement -
wf-50Add automatic measurement discovery from running CVM metadata to complementmachine server build --ovmf-from-measurement(eliminates need to manually copy measurement hex) -
wf-51Add OVMF binary caching to~/.cache/portdistrict/ovmf/<digest>.fdto avoid repeated ~4MB downloads on rebuilds. Usexdg.CacheHome(already a dep). Skip download if cached file SHA-384 matches expected digest. From workflow::2026-04-30_add-ovmf-firmware-auto-download-to-build::retrospective.md Recommendation #1. -
wf-52Add automatic measurement discovery from running CVM (e.g. via SSH or GCE metadata) to complementattestverb — eliminates manual--ovmf-from-measurement <hex>lookup. From workflow::2026-04-30_add-ovmf-firmware-auto-download-to-build::retrospective.md Recommendation #2. -
wf-53Standalonepublish-imageverb: publish pre-built archives to GCS + GCE without rebuilding. Currently--publishrequires--source-dir. From workflow::2026-04-30_machine-server-publish-image::retrospective.md deferred scope. -
wf-54Image versioning: support timestamped image names (e.g.exitnode-tamago-gvnic-20260430) instead of fixedexitnode-tamago-gvnic, enabling rollback. Currently uses delete-then-create with a single name. From workflow::2026-04-30_machine-server-publish-image::retrospective.md deferred scope. -
wf-55Assess whether shell publish scripts (scripts/tamago-publish-image.sh,scripts/tor-gw-publish-image.sh) can be deprecated now thatmachine server build --publishexists. wf-30 shell-script verification is still open. From workflow::2026-04-30_machine-server-publish-image::retrospective.md recommendation #3. Assessment done 2026-05-11 — full classification inscripts/README.md(code::scripts/README.md). Verdict: (1) eight scripts are fully replaced by theportdistrictCLI today and can be deleted in one PR —tamago-build-exit-node.sh,tamago-package-disk.sh,tamago-deploy-gce.sh,tamago-show.sh,tor-gw-image-build.sh,tor-gw-deploy.sh,tor-gw-show.sh,tor-gw-startup.sh(closes/retargetswf-37,wf-38); (2) the two named publish scripts have a thin residual gap (CLI--publishrequires--source-dirso it rebuilds; no publish-pre-built-archive path) — keep untilwf-53lands, then delete and closewf-30as superseded by the GoImagesClient.Getcheck (wf-48), retargetingpd-25to the Go build path; (3) the remaining ~25 scripts (kernel-APK build inputbuild-kernel-pd-snp.sh, CI harnessestamago-verify-reproducible.sh/tamago-attest-e2e.sh, QEMU+spike harnesses, the 3-node integration suite, daemon wrappers, debug/reset utilities) are out of scope and stay. Item stays open for the §1 deletion PR. -
wf-57--ovmf-from-measurementfirst-deploy ergonomics — add help-text disambiguation that the flag expects an SEV-SNP launch measurement (the digestattestreports), not the OVMF binary SHA, and explicitly hint “first deploy: use--ovmf <path>”. Optional: add an--ovmf-from-bucket-latestshortcut that downloads the most recent.fdfrom the public TCB bucket — covers greenfield in one flag. Related towf-50(auto-discover from running CVM). From workflow::2026-04-30_machine-server-publish-image::retrospective.md §In-Wild Validation [BUG-1]. -
wf-60exit try --invite <single-blob>: collapse the three-flag friend-side invite (--direct <ip:port> --pubkey <key> --token <token>) into one labeled base64url blob so the operator sends one short string in chat instead of a long multi-flag command. Reuse prior art: mirror theportdistrict-grant1:<base64url-payload>format already used byexit grant/exit accept(see workflow::2026-04-25_signed-exit-grant-bundle-for-via-tor::retrospective.md and code::internal/signalbundle/bundle.go Encode/Decode matched-pair pattern: struct → map → canonjson → ed25519.Sign → re-marshal). Picks: labelportdistrict-invite1:(versioned, parallel togrant1:); payload{transport: "direct", direct, pubkey, token}(thetokenfield is already a base64url blob from code::internal/exitnodeshared/invite_trust.go). Schema extensibility: include an explicittransportfield on v1 even though only"direct"is supported initially — this lets a future item add via-tor (or other transports) by extending the sameinvite1:envelope rather than mintinginvite2:. Decoder should reject unknowntransportvalues with a clear error directing the user to upgrade portdistrict. Bundling withtor-gw-show.shis deliberately out of scope (different producer language, different trust model — the via-tor path already hasportdistrict-grant1:, andtor-gw-show.shis on the wf-55 deprecation track). Implementation: ~30 LOC encode/decode + a new--invite <blob>flag onexit trythat expands to the existing three flags;machine server inviteoutput gains a second line printing the compact form (existing three-flag form stays for back-compat). Update site: code::cmd/portdistrict/machine_server_remote.go:454 whererunMachineServerRemoteInviteprintsportdistrict exit try --direct ... --pubkey ... --token .... Sibling ofwf-53(standalone publish verb) — same theme of “ergonomics for the invited side.” Origin: 2026-04-30 conversation about friend-side convenience after the wf-48 in-wild test. -
wf-61Add release binary signing (cosign or GPG) to.forgejo/workflows/release.yml. Distinct from APK signing (wf-36) — this covers the standard Go binaries produced by the release workflow. (release-binaries retro 2026-04-30) -
wf-62Docker image registry publishing for portdistrictd. Distroless Dockerfile exists (test/fixtures/distroless/Dockerfile) but no image registry publishing workflow. (release-binaries study deferred scope 2026-04-30) -
wf-64exit invite --legacy: emit the three-flagportdistrict exit try --direct ... --pubkey ... --token ...form alongside the fenced display, for friends running older portdistrict builds that lack the positional invite-blob detection. Was in the proposal but skipped during implementation; the three-flag form is still available via direct invocation, so this is a small ergonomic add (~10 LOC). From workflow::2026-04-30_exit-invite-convenient-defaults::retrospective.md Recommendation #1. -
wf-67Extendmachine serverwith a--with-facilitatoraugmentation flag that opts an exit-node deploy into colocating the facilitator capability on the same VM as the exit-node (default), with a--facilitator-vmescape hatch for the rare case where a separate VM is wanted. Local-host case is already shipped: on a regular hostmachine server enable --exit-node --facilitatoralready runs both capabilities inside oneportdistrictd(cli_semantics_design.md §2.3.2;cmd/portdistrictd/main.go:22,229,242wiresfacilitatorplanebased onfacilitator.json). wf-67 brings the same colocation to remote-orchestration. Two paths, both colocated by default: (1) non-confidential deploy (deploy --on hetzner --label foo --with-facilitatorordeploy --on gce --label foo --with-facilitatorwithout--confidential): trivial — provisions a single regular Linux VM runningportdistrictd --exit-node --facilitator, where both capabilities share the daemon already; the augmentation flag just toggles thefacilitator.jsonconfig-file write at deploy time and opens the firewall port. No image change needed beyond the existing exit-node deploy path. (2) confidential / Tamago deploy (deploy --on gce --confidential --label foo --with-facilitator): the Tamago binary built bymachine server build(cmd/portdistrict-exitnode-tamago) is single-purpose and does not contain facilitator code today — colocation requires a Tamago port of the facilitator (filed asca-103): TCP listener (Tamago net stack handles this), registration store (in-memory map suffices for single-VM scope), no glibc dependencies. ca-103 is wf-67’s main implementation cost; spike before propose to confirm the Tamago net stack covers whatinternal/facilitatorplane/needs. Once ported,build --confidential --with-facilitatorproduces a single Tamago binary with both planes, anddeploy --confidential --with-facilitatorprovisions one CVM that runs both — preserving the pd-18 trust chain for the exit-node plane unchanged.--facilitator-vmescape hatch (low priority, defer behind colocation): when the operator wants to scale the facilitator independently or place it in a different network zone,--with-facilitator --facilitator-vmprovisions a second VM runningportdistrictd --facilitator, paired by--label. The default stays colocated. Per-verb behaviour with--with-facilitator(colocated mode, mirrors wf-66’s label-paired idiom but with one VM):build --with-facilitatorproduces an image whose daemon has both capabilities compiled in (Tamago port required for--confidential);deploy --with-facilitatorwrites bothexit-node.jsonandfacilitator.jsoninto instance metadata and opens both ports (51821/UDP + 7777/TCP);attest --label foovalidates pd-18 on the single CVM as today (no extra check — facilitator runs in the same measured boot);claim --label fooregisters both the exit-node endpoint and the facilitator endpoint in the local device record in one Noise → owner-key proof exchange;show/doctorsurface both planes’ state from the one VM;destroytears down the one VM. Compatibility with--via-tor:build/deploy --confidential --via-tor --with-facilitator --label fooprovisions two VMs (exit-node-with-facilitator CVM + tor-gw CVM) —--via-toralways adds a second VM (different image, different role),--with-facilitatordoes not. So the VM-count matrix is1 + (--via-tor ? 1 : 0) + (--with-facilitator --facilitator-vm ? 1 : 0). Confidential vs regular axis:--with-facilitatordoes not by itself require--confidential(regular path is path 1 above and is essentially free), but the confidential path is gated on the Tamago facilitator port landing first. Until ca-103 lands,--confidential --with-facilitatoreither errors out pointing at ca-103, or auto-implies--facilitator-vm(decide during study). Doc impact: cli_semantics_design.md §2.3.2 grows a--with-facilitatorparagraph parallel to wf-66’s--via-torparagraph, noting colocation as the default. Scope split: (a) non-confidential colocated path is small — a config-file plumbing change + firewall port. (b) confidential colocated path is medium — Tamago facilitator port (likely warrants its own ca-* item + spike). (c)--facilitator-vmescape hatch is medium — reuses wf-66’s two-VM pattern. Recommend study sequences (a) → (b), with (c) deferred. Order: file after wf-66 lands; the augmentation pattern wf-66 establishes (label-paired co-deployment in deploy state,attest/claimrecovering augmentations from deploy state,showdisplaying multi-plane state) is the foundation wf-67 reuses. Origin: 2026-05-01 conversation — first framed as a separate-VM augmentation, then corrected: facilitator can colocate on the same VM as the exit-node (matches the local-host two-capability case in cli_semantics §2.3.2), with the separate-VM path kept as an opt-in escape hatch. -
wf-68Context-timeout audit on existing multi-step deploy paths incmd/portdistrict/. The--via-toraugmentation (wf-66 deploy half) surfaced a latent context-timeout budget mismatch — original 5-mincontext.WithTimeoutwas insufficient when serial polling exceeds 5 min (240s onion poll + 120s Noise pubkey poll). Fixed by extending to 12 min whenviaTor=true. Audit the other multi-step functions (runMachineServerBuild,runMachineServerAttest,runMachineServerClaim,runMachineServerReconnect) for similar latent issues, especially before wf-67--with-facilitatoradds another sequential VM operation. Apply the rule fromhowto::context_timeout_budget_for_multi_step_operations.md. From workflow::2026-05-01_machine-server-via-tor-deploy-augmentation::retrospective.md Recommendation #1. -
wf-69Programmatic length assertions on fixed-length test constants. The wf-66 deploy-half implementation hit a [TEST] entry where a hand-constructed v3 onion address was 54 chars instead of the required 56 (regex[a-z2-7]{56}correctly rejected the test data; the test was wrong). Add a small test helper or convention that assertslen(constant) == expectedat the test-file declaration site for fixed-length crypto/protocol values (onion-v3 = 56 base32, ed25519 pubkey = 44 base64url, X25519 pubkey = 43 base64url, etc.) so length errors fail at test-load time with a clear message instead of inside opaque regex non-matches. From workflow::2026-05-01_machine-server-via-tor-deploy-augmentation::retrospective.md Recommendation #2. -
wf-70DONE 2026-05-05 Added[--via-tor] [--tor-gw-zone <zone>] [--tor-gw-image <name>]continuation line to thedeployentry inprintMachineServerUsage(cmd/portdistrict/machine_server.go:48-49), mirroring the existingbuildcontinuation style. Cosmetic 1-LOC fix; flags themselves were already wired inmachine_server_deploy.go:26-93since wf-66. Shipped in commit453d35e. -
wf-71[BUG] DONE 2026-05-02machine server deploy --via-tor --confidentialdid not propagate--confidentialto the tor-gw companion VM. Fix shipped same day: addedconfidential boolparameter torunDeployTorGW; when true, setsConfidentialInstanceConfig.ConfidentialInstanceType = "SEV_SNP",Scheduling.OnHostMaintenance = "TERMINATE", andMinCpuPlatform = "AMD Milan"on the tor-gwInsertInstanceRequest; call site inrunDeployGCEupdated to threadconfidentialthrough. Validated in-wild same day on instancewf66-smoke3-110535-tor-gw:gcloud instances describereturnsSEV_SNP / TERMINATE / AMD Milan; tor-gw serial showssev-guest sev-guest: Initialized SEV guest driver→tor-gw-init: SEV-SNP detected (/dev/sev-guest present)→derived key obtained from SEV-SNP→sealed onion identity written to tmpfs→attestation report signature verified against locally-cached AMD cert chain(pd-18 chain) with measurement7a5ed176bad8a9ff02cebb94b24b076a0b1905042a85d9fca7670d3a3ff466db...(canonical pd-18 measurement); produced canonical pd-15/pd-16 sealed onion3efkxftj…7i7qd.onion. See workflow::2026-05-01_machine-server-via-tor-deploy-augmentation::retrospective.md “In-wild re-run” section. Closes the gap between wf-66’s design intent and shipped behavior. -
pd-27[BUG] DONE 2026-05-02 Tor inside thetor-gw-hardenedimage failed with/var/lib/tor: Permission deniedon fresh boot. Serial output sequence:tor-gw-init: starting (PID 471)→tor-gw-init: tamago private IP = 10.156.0.37→tor-gw-init: torrc written→tor-gw-init: tor started (PID 476)→Permission denied→[warn] Failed to parse/validate config: Couldn't create private data directory "/var/lib/tor"→[err] Reading config failed--see warnings above.→FATAL: hostname not generated within 1m0s(thehostnameTimeout = 60sfromcmd/tor-gw-init/main.go:40). Tor exits before it can publish the onion descriptor; the deploy-sidepollSerialForMarkercorrectly times out at 240s and saves an emptyOnionHostname. Pre-existing image regression — not caused by wf-66: the legacyscripts/tor-gw-deploy.shwould hit the identical failure if run against this image build. The pd-09 hardened-image design (Alpine + read-only rootfs + writable tmpfs at runtime-state paths) shipped working in pd-11 Phase 1 + pd-16 (two-CVM same-onion validation). Something in a later image rebuild broke the writable mount at/var/lib/tor— most likely the tmpfs-mount step in init or the directory ownership/permissions. To investigate: (a)gcloud compute images describe tor-gw-hardenedto check creation timestamp + family/source; (b)cmd/tor-gw-init/main.gofor any change to writable-path setup since pd-11; (c) Alpine image inittab/sysinit scripts formount -t tmpfs ... /var/lib/toror chown to the tor user; (d) torrc generation for explicitUser tor/DataDirectorydirectives. Surfaced by: in-wild test of wf-66 deploy half on 2026-05-02 (instancewf66-smoke-101539-tor-gw). Filed aspdbecause this is platform-deployment infrastructure, not a CLI workflow issue. Root cause identified same day:mountTmpfs()incmd/tor-gw-init/main.go:303-318mounts a fresh tmpfs at/var/lib/torand chmods to 0700 but never chowns to thetoruser. The image-build script’schown -R tor:tor /rootfs/var/lib/toronly affects the underlying read-only rootfs directory, which is overlaid by the tmpfs at runtime. When tor drops privileges to thetoruser (perUser torin the generated torrc), it cannot write to a root-owned tmpfs. Fix shipped same day: inmountTmpfs(), after mounting tmpfs and before chmod, douser.Lookup("tor")andos.Chown("/var/lib/tor", uid, gid)(using the same pattern already established inwriteHSKeyFilesat line 435). Image rebuilt + republished astor-gw-hardened. Validated in-wild same day on instancewf66-smoke3-110535-tor-gw: tor created its DataDirectory successfully (no permission error),tor-gw-initreached the descriptor-publication step, and printedTOR_GW_ONION_HOSTNAME=3efkxftjm2numzarrt77x677biqmnbosoygeo6tcuxp5dhowwge7i7qd.onionto serial within the deploy-side 240spollSerialForMarkerbudget. See workflow::2026-05-01_machine-server-via-tor-deploy-augmentation::retrospective.md “In-wild re-run” section. -
wf-75DONE 2026-05-05 Downstream per-verb via-tor cycles (claim-reconnect, show-doctor, build, attest) — references Decision 3 (always-via-tor claim) and Decision 2 (show bridge verb) fromvia-tor-unification-design. Partial DONE 2026-05-04 via workflow::2026-05-04_via-tor-cli-parity::retrospective.md:attest/reconnect/doctorper-verb wiring closed for local via-tor hosts;claim/showalready shipped in parent cycles. Build half closed 2026-05-05 via wf-80 (workflow::2026-05-04_machine-server-build-via-tor::retrospective.md):machine server build --via-tornow embeds the tor-gw companion image-build pipeline. All per-verb wiring is now complete. -
wf-80DONE 2026-05-05machine server build --via-tor— Tamago image-build pipeline embedded into binary alongside the new tor-gw companion image-build pipeline (closes the build/publish half of the via-tor deploy flow). Shipped via workflow::2026-05-04_machine-server-build-via-tor::retrospective.md. Validated end-to-end in-wild: both images published, deploy succeeded, canonical pd-15/pd-16/pd-18 sealed onion (3efkxftj…7i7qd.onion) re-published. -
wf-84DONE 2026-05-05 Embedtor-gw-image-build.sh+tor-gw-publish-image.shintoportdistrict. New ROADMAP item filed at study time; closed simultaneously with wf-80. The 6-step tor-gw build pipeline (init shim, Alpine rootfs, kernel/initramfs extract, grub EFI, FAT32 disk, tar.gz) now runs from Go viaos/exec+ embedded shell scripts. Scripts remain in-tree for reference but are no longer required for the build/publish workflow. Shipped via workflow::2026-05-04_machine-server-build-via-tor::retrospective.md. -
wf-81DONE 2026-05-04exit connect --via-torpersistent friend tunnel. Implemented incmd/portdistrict/exit_connect_tor.go: reuseshandleBrowserchainer goroutines + grant resolution + health-check backoff fromexit try --via-tor; key difference is no browser launch — savesExitState{Mode: "tor-socks5"}, blocks on SIGINT/SIGTERM, clears state on exit. OuterrunExitdispatch pre-scans--via-tor/--fromonconnect(mirroring thetrypre-scan, addresses parser-bug class from completion retrospective). Tests incmd/portdistrict/exit_connect_tor_test.go. Shipped via workflow::2026-05-04_via-tor-cli-parity::retrospective.md. -
wf-82DONE 2026-05-04exit listimplementation. Replaces “not yet implemented” stub incmd/portdistrict/exit.gowith transport-agnostic provider table (OPERATOR / TRANSPORT / ONION / EXPIRES) plus--jsonmode. Reads frominternal/exitgrant.Storewith cross-reference intointernal/peerstore. Tests incmd/portdistrict/exit_list_test.go. Shipped via workflow::2026-05-04_via-tor-cli-parity::retrospective.md. -
wf-83DONE 2026-05-04machine server destroy --via-torfor local hosts (provider == "self",OnionHostname != ""). Disables via-tor config (load →ViaTorEnabled = false→ save), sends SIGHUP to the daemon to stop the tor supervisor, removes registry entry. Onion keys retained by default with manual-wipe note. Reuses thedisable --via-torconfig+SIGHUP pattern frommachine_server.go:361-382. Shipped via workflow::2026-05-04_via-tor-cli-parity::retrospective.md. -
wf-85Default the release pipeline and friend-side build toCGO_ENABLED=0static binaries. The two-friend GCE test required static binaries for cross-distro compatibility (Friend B on Ubuntu 20.04, glibc 2.31 rejected the dynamically-linked default built on Debian forky, glibc 2.42). (via-tor-socks5-hardening retro) -
wf-86Cross-binary wiring checklist in Done When criteria: when a feature spans client/server/config with metadata layers in between (invite → operator config → GCE metadata → tor-gw-init), the Done When criteria must explicitly name each connecting wire. “Client side ✓ + server side ✓” can true-and-true while the wire between them is missing. Two data points from via-tor-access-control (Phase 4 wiring gap, ClientOnionAuthDir gap). (via-tor-access-control)
Code Architecture
-
ca-08cmd-group-publishDeviceRecord for group context. Extends the publish pattern to shared infrastructure. Partially delivered by daemon-groups-foundation 2026-04-14 — coregroup device publishimplemented atcmd/portdistrictctl/group.go:472(runGroupDevicePublish), encrypts group DeviceRecord withgroup_read_key, publishes to DNS via existing Cloudflare path. Still open: CLI ergonomics polish (richer error messages, help text refinement, interactive confirmation prompts,--dry-runflag), edge case handling (missing group_read_key, stale state files, in-flight key rotation), user-facing docs. Consider closing as “delivered” if the polish is out of scope for the current milestone, or keep open and rescope to cover only the polish items. -
ca-09cmd-peer-revokeCredential revocation for the trust lifecycle. Complements peer trust/verify. -
ca-12Add prompt or documentation hint indevice initoutput suggesting users set an endpoint before publishing (cmd-connect) -
ca-13device-serveOne-time HTTPS server for provision file transfer, URI-based join, and QR code rendering. Deferred from device-provision. -
ca-14Enable//go:build integrationtests for WireGuardManager in CI when root-capable CI runners are available (daemon-minimal) -
ca-18Trust-gate redesign for sibling devices:runConnectand any other peerstore-based trust gate should support sibling-device provenance natively, either via a self-entry in the peerstore or via a direct check against the local device keyring’sMasterNodeID. The current local patch incmd/portdistrictctl/main.gois a point fix; the architectural question is whether the peerstore should be the trust oracle for sibling devices at all. (first-run bug hunt) -
ca-19Add--keyringflag toconnect(and other subcommands that read the master keyring under sudo) sosudo HOME=$HOMEis not required. Currently only--device-keyringis accepted. Small tactical fix (~5 lines) that unblocks first-run tutorials without requiring environment gymnastics. (first-run bug hunt) -
ca-21Facilitator client configuration path onportdistrictd:daemon.Config.Facilitatorsis read inRunChores(daemon.go:120) to spawn per-facilitatorRegistrationtasks, butcmd/portdistrictd/main.gonever populates it — there is no--facilitatorflag, no device-keyring field, no config file. The--facilitator-listenflag (server side) is implemented but there is no corresponding client-side path, so the daemon cannot register with any facilitator even when one is running. This blocks end-to-end Phase H testing of thedaemon-facilitator-serveauthorization challenge flow on real hardware.ca-15(facilitator enableCLI) andca-16(facilitator_endpointpersistence in DeviceKeyring) cover the proper long-term fix but neither has been implemented. Tactical fix option: add a--facilitator <endpoint>:<wg-pubkey>:<node-id>flag for testing, clearly marked as a test harness, to be replaced by ca-15/ca-16. Blocks sibling-topology and cross-identity Phase H testing until this or ca-15 lands. (first-run bug hunt, Phase H preparation) -
ca-22identity importsilent stale-pubkey update: the “Peer already known” branch ofrunIdentityImport(cmd/portdistrictctl/main.go:372-393) updates onlyLastBundleTimestamp, silently keeping staleIdentityPubkeyB64andIdentityECDHPubkeyB64values if the peer rotates their master identity. Re-importing a new bundle with a changed pubkey has no effect; the downstreamidentity verifythen reports a spurious “identity_pubkey mismatch” that is correctly detected but caused by import silently ignoring the incoming pubkey. Proper fix options: (A) fail with a clear error telling the operator to remove the stale entry; (B) compare stored vs incoming pubkey and refuse to update silently; (C) prompt for confirmation. Option B is the smallest change with the most explicit failure mode. Discovered in Session 2 of the first-run bug hunt when an earlierrm -rf ~/.config/portdistrict/left a stale peer file behind and the re-import didn’t catch the divergence. (first-run bug hunt Session 2) -
ca-26ca-26Spec update: fold Path B identity ECDH derivation (DeriveIdentityECDHKeypair/ BLAKE3 domain separation) intoidentity_and_trust.mdsection 21 as a normative requirement. Currently the load-bearing privacy invariant (domain separation between Ed25519 and X25519 KDF branches) is implementation-correct but spec-silent. See doc::portdistrict::identity_ecdh_domain_separation.md. (master-device-roster) -
ca-27ca-27Identity bundle carrying device namespaces: extendsignalbundle.Encodeto optionally includedevice_namespacesso peers learn devices at import time (convenience, not required — DNS remains authoritative). Deferred from master-device-roster study. (master-device-roster) -
ca-28ca-28Master record expiry-based republish in daemon: extendRecordPublishTaskto periodically republish the MasterRecord v2 (not just the DeviceRecord), using the same chore-runner interval. Currently only CLI commands trigger master record publishes. Deferred from master-device-roster study as larger scope. (master-device-roster) -
ca-29Daemon treats cross-master roster entries as local siblings.device roster add <ns>warns when<ns>does not end with the master namespace but stores the entry indevice_namespacesunchanged. On daemon startup, the sibling iteration at code::internal/daemon/daemon.go:88 attempts to resolve every roster entry as a sibling device, producing repeatedresolve _portdistrict.<ns>: no such host(or similar) errors in the log for any cross-master or typo’d entry every refresh cycle. Concrete symptom observed during master-device-roster manual validation (2026-04-13): addingstranger.alice.example.comto the roster for testing the cross-master warning path made the daemon log it assibling stranger.alice.example.com: resolve device stranger.alice.example.com: ...: no such hoston every startup and DNS refresh cycle. Design options: (A) strip cross-master entries from sibling iteration at daemon startup (filterrostertoentries where strings.HasSuffix(entry, "." + ownMasterNamespace)before iterating); (B) separate roster field for foreign namespaces (device_namespaces []stringfor local siblings +foreign_namespaces []stringfor cross-master entries the owner explicitly claims, with different consumer semantics); (C) enforce rejection of cross-master entries atdevice roster addtime (turn the warning into an error unless a--forceflag is set); (D) keep current behavior but improve the log line tosibling <ns>: cross-master entry not reachable, skippingso the operator understands the degradation is intentional. Option A is the minimum correct fix and matches the implicit model of the roster as “my local devices”. Option C is the strictest but breaks the current(no-op but warn)UX. Option B is a schema change that would let users legitimately claim namespaces they don’t directly control. Related to but distinct from ca-20 (sibling discovery) and ca-24 (cross-identity discovery) — this is about roster hygiene, not discovery mechanism. Captured in workflow::2026-04-13_master-device-roster::retrospective.md Manual Validation Session § New findings. -
ca-30Empty-roster v2 publishers degrade to legacy path inConnectAllTrustedtier 2. The check at code::internal/daemon/daemon.go:136 isif len(peerRoster) > 0 { ... continue }.resolveDeviceRosterat code::internal/daemon/roster_resolve.go:20 returns nil for v1 records and returns the decodedDeviceNamespacesslice for v2 records. A peer publishing a v2 master record with an emptydevice_namespacesslice (e.g., they just ranidentity publishbefore adding any devices, or deliberately want to assert “no devices under this identity”) returns a non-nil empty slice fromresolveDeviceRoster. Theif len(peerRoster) > 0check is false, the block is skipped, and the daemon falls through to the legacy tier-3 path (d.pm.AddPeer(nodeID)), which hits the ca-24 anti-pattern for any peer whosep.Namespaceis a master namespace. A v2 peer explicitly publishing “I have no devices” should be treated as conclusive (“no devices to connect to for this peer, done”) — not as a signal to try the known-broken legacy path. Fix: distinguish nil (v1 record or resolution failure, fall back to legacy) from non-nil empty slice (v2 with empty roster, do not fall back). Option (A): checkpeerRoster != nilat the daemon, treat empty non-nil as “definitively no devices”; Option (B): changeresolveDeviceRosterto return an explicit enum or sentinel for “v2 with empty roster” vs “v1 / resolution failure”. Option A is the smaller change. Minor edge case — unlikely to fire in practice because most users add devices before publishing — but it’s a latent logic bug that would confuse debugging of an actually-empty v2 roster. Discovered during manual validation of master-device-roster (2026-04-13). Captured in workflow::2026-04-13_master-device-roster::retrospective.md Manual Validation Session § New findings. -
ca-33Audit trust-state filtering consistency across encoder-side key-collection functions:collectPeerECDHPubkeys(TRUSTED only) vscollectECDHPubkeysandrunDevicePublishinline loop (VERIFIED+TRUSTED). Pre-existing discrepancy unrelated to Path A fix but worth harmonizing. Source: workflow::2026-04-13_encoder-side-path-a-fallback-removal::retrospective.md -
ca-34daemon-relay: Relay fallback connectivity for peers that can’t hole-punch via sync windows. Companion to daemon-syncplan. Deferred from daemon-syncplan study scope. -
ca-35Adaptive backoff on repeated sync window failures (Doc 5 §15 implementation-defined gap): currently a failed window attempt has no memory — the same peer is re-attempted on the next scheduling cycle. Add exponential backoff per(nodeID, deviceNS)pair. Source: daemon-syncplan deferred scope. -
ca-36Multi-peer sync window prioritization: v1 uses first-match-wins. Future versions should prioritize by peer trust level, last-seen recency, and window overlap quality. Source: daemon-syncplan deferred scope. -
ca-37User-configurable sync window schedules via CLI: v1 auto-generates 4 windows at 30-minute spacing. Allow users to specify explicit schedules for deterministic rendezvous timing. Source: daemon-syncplan deferred scope. -
ca-38Group read key slot ECDH path decision (blocksdaemon-groups-managementmember-removal + read-key-rotation work). Protocol spec §2.5 specifies group read key slots literally asX25519(admin_ephemeral_sk, member_group_pseudonym_pk_as_x25519)— Path A (Ed25519→X25519 conversion). But (1) spec §2.5 itself carries an[ASSUMPTION — NEEDS VERIFICATION]marker on the Ed25519→X25519 procedure at spec line 174, and (2) production code has removed all Path A key-slot callers in favour of Path B (DeriveIdentityECDHKeypair+ BLAKE3 KDF → independent X25519 keypair) via ca-25/ca-31/ca-32, documented in doc::portdistrict::identity_ecdh_domain_separation.md as “Path A… is not used for key-slot ECDH in any production caller.” Three candidate resolutions: (A) Adopt Path A for groups only — justify explicitly inidentity_ecdh_domain_separation.mdthat theca-17privacy invariant does not apply (thegroup_pseudonym_pkis published in the group record, so there is no hidden X25519 pub to protect), reintroduce a single Path A key-slot caller, amend the spec divergence doc. Smallest code, biggest doc surface. (B) DefineDeriveGroupPseudonymECDHKeypair(masterSeed, groupPubkey)viablake3.NewDeriveKey("portdistrict-group-pseudonym-ecdh-v1"), publish each member’sgroup_pseudonym_ecdh_pubalongside the Ed25519group_pseudonym_pk(either in the group record member entry or in thegroup_accepttunnel message), update spec §2.5 and §9.1 with the new field. Consistent with production convention; largest code + spec surface. (C) Keep foundation’s tunnel-delivered key permanently and dropread_key_slotsfrom the spec entirely — requires solving read-key rotation without slots (re-delivery via tunnel to every active member on every rotation), which is viable for small groups but scales poorly. Out of scope for v1. Resolution requires a focused study (~4-8h) producing a definitive call graph and a spec amendment before the first slot-construction commit. Source: workflow::2026-04-14_daemon-groups-foundation::draft.md §14 S2, §15 F13, D6. (daemon-groups-foundation) -
ca-49Endpoint detection ergonomics — automatic discovery of the daemon’s own public endpoint so operators don’t need to manually runportdistrictctl device update --endpoint <ip:port>or configure--ip-echo-urlbefore publishing a device record. Observed concretely 2026-04-16 during ca-45 in-wild validation: mac (EC2 1:1 NAT, public v43.21.98.174) and rentamac (home/office NAT, public v481.4.164.198) both hadendpoints: []in their device records because (1)DefaultInterfaceScanneratinternal/endpointdetect/detect.go:20only scans for global unicast IPv6 on local interfaces — no IPv4 path, (2) no default--ip-echo-urlis configured, (3) the device keyring was never populated with an endpoint viadevice update, (4) the daemon was launched with--endpoint-detect 0which disables auto-detection entirely, and (5)device publishproduces no warning when publishing a record with zero endpoints. This is a silent failure mode: daemons run, publish records, appear to work, but peers cannot dial them. It caused a wrong diagnosis in the 2026-04-16 three-node test (“both hosts are NAT’d, mesh impossible”) when the actual fix was a one-linedevice update --endpointon each host. See Finding correction in the 2026-04-16 in-wild validation session for the grounded reproduction. Design gap enumerated: (A) IPv4 interface scan inDefaultInterfaceScanner— symmetric to the existing v6 path, filter!IsLoopback && !IsPrivate && ip.To4() != nil. Works for bare-metal / VPS / self-hosted with a public v4 NIC. Doesn’t work for 1:1 NAT (EC2, GCE, K8s pods). (B) Cloud metadata service fallback — detect running on AWS/GCE/Azure viahttp://169.254.169.254/latest/meta-data/public-ipv4(AWS) or equivalent and use the response. Per-cloud detection code, more reliable than HTTP echo for cloud, no third-party call. (C) Trusted-peer endpoint echo protocol (seeca-50) — preferred, uses portdistrict’s own trust domain instead of external services. (D)device publishemits a loud warning (or error with--allow-empty-endpointsescape hatch) when a record has zero endpoints, pointing atdevice update --endpoint+--ip-echo-url+ ca-50 as remediation. Smallest code change, highest operator-facing value per LOC; complements A/B/C rather than replacing them. (E) Cascading fallback: try A (local v4 scan) → B (cloud metadata, short timeout) → C (trusted-peer echo) → warn per D if all fail. Recommended combo: A + B + C + D, skip HTTP echo defaults (privacy leak, external dependency), skip STUN/UPnP/NAT-PMP (explicitly out of scope perdoc::portdistrict::wireguard_integration.md§1044). Acceptance: a daemon started on any of {bare-metal with public v4, AWS EC2, GCE VM, rentamac-class home NAT with at least one trusted public peer} should auto-populateDeviceKeyring.Endpointsat startup without operator intervention OR emit a prominent warning that names the specific remediation. Ergonomic impact: eliminates the 2026-04-16 morning confusion where the user had to manually invokedevice update --endpointon two hosts before the test could proceed. Supersedesca-12(shallow “device init prompt” item — fold into this). Related toca-50(the specific protocol design for option C). Full grounded evidence in 2026-04-16 in-wild validation session retrospective. (ca-45 in-wild validation 2026-04-16) -
ca-47DNSRefreshTaskfallback toAddPeerDevice(p.Namespace)when roster resolution fails transiently — observed 2026-04-16 on remote_access startup log at 21:52:15:dns-refresh: adding peer RUPKZ7C7YXFARBLCQYUJVEUFXKTKAXUT (rentamac.portdistrict.net)followed byadd peer RUPKZ7C7YXFARBLCQYUJVEUFXKTKAXUT (rentamac.portdistrict.net): add peer device RUPKZ7C7YXFARBLCQYUJVEUFXKTKAXUT (rentamac.portdistrict.net): dnsdisc: expected record_type "device", got "master". The parenthesized namespace is the peer’s master namespace (rentamac.portdistrict.net), which is whatidentity importwrites topeerstore.Namespace. Fix ① (95fec6c) rewrotewgmanager.resolvePeerConfig(theAddPeer(nodeID)path) to use master-roster lookup, but theDNSRefreshTaskpath ininternal/daemon/dns_refresh_task.gohas its own fallback that callspm.AddPeerDevice(nodeID, p.Namespace)when roster resolution fails transiently (stale DNS cache, stale key slots, upstream decrypt error).AddPeerDeviceexpects a device namespace and tries to decode_portdistrict.<master-ns>as a device record, which fails the strict record_type check. This is the DNSRefreshTask analog of ca-20/ca-24 — same root cause (peerstore.Namespace is master), different call site, not covered by95fec6c. Every 10s refresh tick fires the error until the cache refreshes and roster resolution succeeds. Self-heals eventually but generates misleading log noise and wastes CPU on retried decode failures. Related to but distinct fromca-30(which is aboutConnectAllTrusted’s empty-roster fallback, already obsoleted by fix ①’s error-message change). Candidate fixes: (A) dns_refresh_task does not fall back at all — on roster resolution failure, log a clear “transient roster resolve failure, retrying next tick” message and skip the peer; (B) route the fallback throughwgmanager.AddPeer(nodeID)(which now uses fix ①’s master-roster lookup) instead ofAddPeerDevice(nodeID, p.Namespace), getting the benefit of the already-landed consumer-side fix; (C) heuristic detection of whetherp.Namespacelooks like a device or master namespace (fuzzy, not recommended). Option B is the smallest change with the most benefit because the error message from fix ① is already operator-actionable. Acceptance: startup log under a transient DNS cache failure shows “transient roster resolve” or fix ①’s “master roster at<fqdn>is empty or unreachable” error, not the obsolete “expected record_type "device", got "master"” error. Triage-level fix (~10 LOC in one function). Full evidence inreport::2026-04-16_three-node-test-aborted_findings.mdGrounded Observation Capture section. Finding ① from session follow-up discussion. (three-node in-wild test 2026-04-16) -
ca-46scripts/mac-daemon.shremovessudo -vcall — the pre-auth credential refresh at line 102 of the script shipped inf7e4c07demands a tty even when the sudoers rule is NOPASSWD, failing every non-interactive invocation over ssh withsudo: a terminal is required to read the password. Caught during the 2026-04-16 three-node test attempt when the script refused to launch on both mac (NOPASSWD) and rentamac. One-line fix: removesudo -v(line 102) and optionally gate it on[ -t 0 ](stdin is a tty) to preserve the interactive ad-hoc use case. The comment at lines 99-101 explains the original intent (pre-auth credential caching for interactive operators) but that use case is marginal compared to the automation use case that the script exists to support. Triage-level fix; 5-minute follow-up. Full finding inreport::2026-04-16_three-node-test-aborted_findings.mdFinding B. -
ca-53peer demote/peer forgetCLI commands for trust-state management (cli_semantics_design §2.4). Complements existingpeer trust+ deferredpeer revoke(ca-09). Lets users downgrade TRUSTED→VERIFIED or remove peers entirely from the local peerstore. -
ca-54group member decline <group-id>andgroup member leave <name>— group lifecycle commands (cli_semantics_design §2.5.2). Current impl hasgroup member accept/publishonly. -
ca-55group network show <name>andgroup network ping <member>— group overlay inspection + connectivity helper (cli_semantics_design §2.5.3). Currently thenetworksubtree is a stub. -
ca-56tunnel list,tunnel show <peer-ns>,tunnel disconnect <peer-ns>— tunnel management wrappers (cli_semantics_design §2.6). Currently onlytunnel connectexists. These also unblock the deferredstatussections for tunnels. -
ca-57identity rotate— master-key rotation command (cli_semantics_design §2.2 / §8 deferred). Re-publishes master record with new seed; requires peers to re-verify. Touches keystore, peerstore, and master-record envelope formats. -
ca-58normalize-base64-encoding: Storage encoding inconsistency (base64.StdEncoding vs base64.RawURLEncoding) across keystore/devicestore. Requires schema migration. (refactor-cmd-portdistrict deferred scope) -
ca-59secure-passphrase-input: Replace readPassphrase (fixed 1024-byte buffer with single Read) with term.ReadPassword + TTY detection. Behavioral change needing integration test for piped passphrases. (refactor-cmd-portdistrict deferred scope) -
ca-60peer-key-load-diagnostics: loadVerifiedPeerKeys silently skips all errors (bad JSON, bad base64). Should warn on non-IsNotExist failures so corrupt peer files are not invisibly dropped. (refactor-cmd-portdistrict deferred scope) -
ca-62peer showshould display derived overlay prefix and per-device addresses (UX followup from peer-overlay-derivation Finding 3) -
ca-63tunnel showshould display peer overlay AllowedIPs in derived/labeled form rather than raw WG state (peer-overlay-derivation Finding 3 followup) -
ca-64DNS cache propagation helper (wait-for-resolver-fresh) for test harness — poll loop replacing manual sleep-and-retry ininit-3node.sh(pre-existing Finding 4 from peer-overlay-derivation in-wild) -
ca-65Two-phase daemon teardown (SIGTERM → sleep → SIGKILL) intest-3node.sh— propagate pattern frommac-daemon.shto avoid orphan daemons (pre-existing Finding 6 from peer-overlay-derivation in-wild) -
ca-66Group member custom pseudonyms / display labels — inside a group, members are currently identified only by cryptographically-derived blake3 node IDs with no user-facing label. Evaluate three shapes: local-only aliases (git-config-style, zero protocol change), self-signed announcement in the group record (visible within group, cross-group correlation risk), or sealed-to-admin announcement (admin as naming authority). Draft at workflow::2026-04-18_group-member-custom-pseudonyms::draft.md. Raised during peer-overlay-derivation tutorial refinement. -
ca-67Investigate intermittent group device decrypt failure:chacha20poly1305: message authentication failedduring group-refresh may indicate a race between republish and resolver caching. -
ca-70End-to-end system-mode fixture with real peer-trust bootstrap: the currenttest/fixtures/system-mode/fixture validates container plumbing only (TUN creation, NET_ADMIN cap,--kill-switchflag parsing, binary accessibility). The actualexit_request/exit_ackhandshake is not exercised because no peer trust is provisioned between provider and consumer containers. Extend the fixture (or adaptscripts/test-3node-exit.sh) to keygen both containers, publish DeviceRecords, mutualpeer add+peer verify, provider runsportdistrictdwith exit-node enabled, consumer runsportdistrict use <provider> --mode system, verify real default route through provider +curl ifconfig.mereturns provider IP +portdistrict use offrestores routing. Estimated ~100–200 lines of shell + Dockerfile extension. Upstream lesson from test-3node-exit.sh is directly applicable. (exit-node-system-mode 2026-04-19) -
ca-72daemon-query-ipc: Unix domain socket IPC for daemon live state queries —machine server showcurrently reads disk only (deferred-data footer). Live consumer data (active connections, tx/rx, last-handshake) requires daemon IPC. Deferred from exit-node-cli-alignment study scope. -
ca-73exit listimplementation: Theexit listverb is stubbed (prints “not yet implemented”). Implementing candidate enumeration (listing available exit nodes) requires a discovery mechanism — DNS-based listing, facilitator query, or peer-list scan. Filed from exit-node-cli-alignment deferred scope. -
ca-77Add integration test forgvnic.BringUp()that creates a stack and verifies UDP connectivity via loopback — catches thedefaultTransportProtocolsstub bug class -
ca-78Investigate MSI-X on AMD EPYC for TamaGo interrupt-driven drivers — polling wastes CPU core, resolution would benefit all interrupt-driven drivers and potentially closemeta-04 -
ca-80TamaGo interrupt-driven I/O (LAPIC + MSI-X) follow-on. Polling-only path in the current gvnic driver wastes a CPU core and busy-loops in the RX poll. Draft exists at workflow::2026-04-21_tamago-x86-interrupts::draft.md. Blocks: nothing critical (driver works polling-only) but unlocks efficient multi-queue operation. Related to meta-04 (MSI-X boot-crash howto). -
ca-79-BConsumer-sidepeer adopt --pubkey <key> --endpoint <ip:port>to synthesize a local DeviceRecord from out-of-band credentials (e.g., a TamaGo exit-node pubkey copy-pasted from serial console). Replaces theexit connect --direct/--pubkeyoperator escape hatch (ca-79) with a path that fits the existing trust model: afterpeer adopt, the standardexit connect <peer-name>flow works through the normalresolveProvider/resolveOverlayAddr/resolveProviderWGPubkeycalls. CLI design pinned in the ca-79 study’s “Design Question” section: keeppeer adoptandpeer addas separate verbs sharing aregisterPeer(record, source PeerSource)function ininternal/peerstore— the verb separation keeps the trust-model exception visible (source: bundle | adoptedinpeer list); collapsing intopeer add --pubkey ...was rejected because it would erode the cryptographic-bundle invariant. When this lands, remove--direct/--pubkeyper theTODO(ca-79-B)sunset comment incmd/portdistrict/exit.go. -
ca-81Health-gate retry with exponential backoff inproxyhealth.CheckorrunExitTry: 3 attempts at 500ms / 1s / 2s. Planned in exit-try-onramp proposal but not implemented. (exit-try-onramp) -
ca-82ExitSender.Receivegoroutine leak on timeout: addcontext.Contextto theExitSenderinterface so theReceivegoroutine can be cancelled when timeout fires. Pre-existing debt shared byrunExitConnectandrunExitTry. (exit-try-onramp) -
ca-83ca-83 Consolidate keyring loading into aloadDefaultKeyring() (*keystore.Keyring, error)helper: four new commands in trusted-tamago-node violated theconfig.KeyringPath("") → keystore.Load(path)two-step XDG resolution pattern. A shared helper centralises the pattern and makes misuse a compile-time issue rather than a runtime failure. (trusted-tamago-node) -
ca-84ca-84 Store both control port and data port explicitly in the remote node registry: the invite command had to derivecontrolPort = dataPort - 1from the registry’s single stored port, an implicit TamaGo convention that caused the invite-command port bug. Storing both ports (or a typedControlAddr/DataAddrpair) makes the convention explicit and prevents the same class of bug in future consumers of the registry. (trusted-tamago-node) -
ca-85Document production interrupt integration path — Once Stream A hardware validation succeeds, define where interrupt ordering fix applies to production binary (cmd/portdistrict-exitnode-tamago/main.go currently has no LAPIC usage) -
ca-86tamago-sev-snp-vc-handler: investigate whether TamaGo’s genericsetIDT()clobbers the EFI-installed SEV-SNP #VC handler (vector 29) and implement preservation. Symptom (2026-04-23 hardware validation): gvnic-spike reboot-loops on GCE N2D SEV-SNP (AMD EPYC Milan) — 162 boot cycles in 90s, halts atyielding to scheduler to ensure IDT installed before enabling LAPIC…beforescheduler yield completeprints; never reachesLAPIC.Enable. Ordering fix from 2026-04-22 retro resolves Intel+OVMF triple-fault but NOT AMD+SEV-SNP hang. Working theory:ServiceInterruptsgoroutine runs during the 100ms yield,setIDT()overwrites the EFI-installed #VC entry, subsequent RMP-check page touch raises unhandled #VC → triple-fault → SEV-SNP auto-reset. Spike approach: log IDT vector 29 before/aftersetIDTruns; confirm clobber; implement re-install or generic IDT SEV-SNP awareness. Success criterion: spike completesLAPIC.Enable+ 20 heartbeats on GCE N2D SEV-SNP. Blocks: AMD EPYC interrupt-mode bring-up (ca-80 on that platform). See workflow::2026-04-23_tamago-x86-interrupts_hardware-validation::late-retrospective.md. -
ca-87Stronger –via-tor health-gate: evaluate whether CheckWithOptions(AllowSameIP: true) should be replaced with attestation-based proof-of-tunneling. Closed by ca-93 2026-04-27: evaluation complete —CheckWithOptionsretained as connectivity check; attestation proof comes from Noise NK handshake (option (b) from ca-93). Seeproposal::via-tor-attestation-health-gate.md. (exit-try-via-tor) -
ca-92Attestation-bind Tamago’s Noise static key. DONE 2026-04-27 via option (a): GHCBDeriveKey(MSG_KEY_REQ) withKeySelect=VCEKandGuestFieldSelect=Measurement|GuestPolicyproduces a 32-byte PSP-mediated secret bound to chip + image; BLAKE3-KDF (contextportdistrict-tamago-noisewrap-v1) + X25519 clamping → static keypair. Cross-instance determinism validated end-to-end on GCE N2D Confidential VM:TAMAGO_NOISE_PUBKEY=1c45alg7aQwwNZCU-M59ofTVdaKU9KD7fOVT8O1EbCkbyte-identical across destroy + fresh deploy. No protocol changes —internal/noisewrap,internal/exitgrant, client-side code untouched. Bonus: “early emit” pattern added so operators can read the pubkey at derivation time, before bootstrap claim completes. See workflow::2026-04-27_tamago-noise-key-attestation-binding::retrospective.md. Unblocks ca-93 (~50 LOC client-side comparison at noisewrap.Dial). Option (b) (in-band attestation report transport) deferred for untrusted-operator threat models. -
ca-93Strengthen--via-torhealth-gate beyond “tunneled HTTPS request succeeded”. DONE 2026-04-27 via option (b): Noise NK handshake IS proof of tunneling — whentamagoKey != nil, the health gate now reportsAttestation: Noise handshake authenticated (key from <source>)and, when the grant carriesExpectedMeasurement, a second line displays the measurement hex as defense-in-depth confirmation.CheckWithOptionsretained as connectivity check; attestation proof comes from the Noise handshake (ca-92 key binding). ~30 LOC production + ~80 LOC tests (3 new tests covering grant+measurement, –tamago-key flag, and no-key paths). Also closes ca-87. Seeproposal::via-tor-attestation-health-gate.md. Related:ca-91,ca-92. -
ca-95Investigate Noise NK handshake auth failure with ca-92-derived keys. DONE 2026-04-28: misdiagnosed at filing — was actually three separate issues compounding. (a) Pre-clamping interaction with flynn/noise — falsified byTestDerivedKey_NK_Handshakeunit test (passes). (b)curve25519.X25519vs flynn/noise pubkey divergence — falsified byTestDerivedKey_PubkeyConsistency(byte-identical). (c) TamaGo runtime crypto difference — falsified by on-device fixed-input diagnostic (DIAG_FIXED_PUB_HEXmatched host Go byte-for-byte). The actual fix: base64 encoding inconsistency (socks5_noise.go:67was StdEncoding,main.go:103was RawURLEncoding); both now RawURLEncoding. The proximate causes of the original test failure were tor-gw cold-start (pd-19 fix) + operator key-paste error in earlier session retests. 🎯 First end-to-end real-traffic demonstration of the SEV-SNP chain: with pd-19’s truthful gate + ca-95’s encoding fix + correct-chip pubkey paste,portdistrict exit try --via-torreached Tamago’s GCE external IP (tunneled IP 35.198.72.12 ≠ direct IP 46.188.164.184 — OK),Attestation: Noise handshake authenticatedline fired, real HTTPS request to ifconfig.me round-tripped throughlaptop tor → onion → tor-gw → Tamago WG plane → public internet. The 5-link trust chain (pd-10 + pd-17 + pd-18 + ca-92 + ca-93 + pd-19 + ca-95) is operationally complete. New unit tests ininternal/noisewrap/noisewrap_test.goare durable regression coverage. See workflow::2026-04-27_noisewrap-derived-key-nk-handshake-bug::retrospective.md. -
ca-97Bootstrap protocol integration test covering both claim and reconnect paths — net.Pipe-based test exercising full frame sequence (handshake → attestation → cert chain → mode message) to catch frame-sequence mismatches automatically. (tamago-attestation-client-verify) -
ca-98Non-confidential bootstrap protocol: implement simplified Noise → owner-key proof → WireGuard handoff flow (without attestation exchange) for--on hetzner/--on selfclaim path (machine-server-provisioning-cli) -
ca-99machine server buildverb: implement Phase 2 once offline-measurement-prediction proposal lands — Tamago build pipeline in pure Go via bitfield/script, GCE image publish via Go SDK (machine-server-provisioning-cli) -
ca-100--on selfprovider path: implement doctor/deploy/claim for bare-metal/VPS servers — check port bindability, SSH reachability, distro detection instead of ADC (machine-server-provisioning-cli) -
ca-101Remote capability toggling: implementenable --label/disable --labelfor non-local servers via daemon-to-daemon control transport or SSH-based config push (machine-server-provisioning-cli) -
ca-102MakerunMachineServerRemoteEnablev2-aware (preserve extended NodeEntry fields across Remove+Add) to retire the fragile save-restore bridge inclaim. Amortize the tamago E2E rerun cost the next time it triggers for another reason. See howto::registry_save_restore_bridge.md for the bridge being replaced. From workflow::2026-04-29_machine-server-provisioning-cli::retrospective.md Recommendation #3. -
ca-103Tamago port ofinternal/facilitatorplane/— compile the facilitator plane into the Tamago bare-metal exit-node binary so a single CVM can serve both--exit-nodeand--facilitatorcapabilities under the same pd-18 attestation, with no second VM. Why this is a separate item from wf-67: wf-67’s non-confidential colocated path (portdistrictd --exit-node --facilitatoron regular Linux) is essentially free — the daemon already does this on the local-host. The confidential colocated path is gated on this port: todaycmd/portdistrict-exitnode-tamagodoes not importinternal/facilitatorplane/and the Tamago build set is single-purpose. wf-67 explicitly declares colocation as the default; ca-103 is what makes the default work for--confidentialdeploys. Spike-before-propose: confirm the Tamago net stack supplies what the facilitator needs — TCP listener (the via-tor + WG planes already use Tamago net, so likely yes), accept loop with concurrent connections (no goroutine constraints in current Tamago), in-memory registration store keyed by node-id (trivial — sync.Map or similar), no glibc/cgo dependencies infacilitatorplane(auditinternal/facilitatorplane/for any linux-only sys imports). If any gap surfaces, document and route around — the registration store is the only piece with substantive state, and it already has no persistence requirements (rebuilt on connect). Out of scope: the registration store does not need to be measurement-attested separately — the Tamago binary’s pd-18 measurement covers it implicitly because it’s compiled into the same image; the trust story for via-facilitator connections is “the facilitator runs in the measured exit-node CVM” rather than a new attestation channel. Implementation sketch: (a) addfacilitatorto the build tags Tamago accepts incmd/portdistrict-exitnode-tamago/main.go(or always-on if the size cost is small); (b) wirefacilitatorplane.New(...)into the existing run loop alongside the WG / Noise planes; (c) readfacilitator.json-equivalent config from instance metadata at boot (mirror the existing exit-node metadata path); (d) emitTAMAGO_FACILITATOR_LISTENING=:7777serial marker formachine server doctor/showto scrape. Dependency direction: ca-103 unblocks the--confidential --with-facilitatorpath of wf-67. Until ca-103 lands, wf-67’s confidential colocated path either errors out pointing at this item, or auto-implies--facilitator-vm(deploy a second regular Linux VM withportdistrictd --facilitator). The non-confidential colocated path of wf-67 is independent and ships without ca-103. Code touch points:cmd/portdistrict-exitnode-tamago/main.go(run-loop wire-up),internal/facilitatorplane/*(audit for non-Tamago deps),cmd/portdistrict/machine_server_build.go(include facilitator in the build tags), serial-marker glue. Origin: 2026-05-01 conversation about colocating facilitator on the same CVM as the exit-node, after wf-67 was reframed from “second VM by default” to “colocate by default”. Filed because the confidential colocated path is the substantive engineering cost wf-67 inherits from the Tamago single-purpose-binary constraint and warrants its own roadmap entry with its own spike + study + propose cycle. -
ca-106DONE 2026-05-05 FixedcompileTamagorelative-path mismatch:--output-dirnow resolved to absolute viafilepath.Abs(args[i])at flag-parse time (cmd/portdistrict/machine_server_build.go). Eliminates ELF-not-found error whencwd != sourceDir. Surfaced by machine-server-build-via-tor Scope B in-wild verification; fixed inline in same cycle commit08fe547. (machine-server-build-via-tor) -
ca-107DONE 2026-05-05 Extendedmachine_server_doctor’scheckToolOnPathto fall back to/usr/sbinand/sbinwhenexec.LookPathmisses, with explicit hint when found there. AddedensureSbinInPath()helper called atrunMachineServerBuildentry to augment the process PATH soexec.Commandlookups in the build path also succeed. Affects all doctor users. Shipped inline in same cycle commit08fe547. (machine-server-build-via-tor) -
ca-108TamagoLoopbackAllowlistconfig plumbing — currently uses a hardcoded default["127.0.0.1:8888"](no config file in Tamago). If future Tamago services need additional loopback destinations (e.g. ca-103 facilitator port, multi-service CVMs), wire the allowlist through Tamago’s instance metadata or config-emission path. Low priority — single destination suffices today. (via-tor-socks5-hardening retro) -
ca-112tor-gw metadata-poll for ca-89 opt-in mode: if operator demand materializes for –tor-client-auth, tor-gw needs the same metadata-poll pattern to read authorized_clients from GCE metadata and write .auth files + SIGHUP Tor. SIGHUP-Tor pattern validated by wf-73. (via-tor-trusted-peers-tamago retro) -
ca-113GCE metadata wait-for-change long-poll optimization: replace 60s fixed-interval poll with ?wait_for_change=true&last_etag=for event-driven trusted_peers updates. Requires adjusting HTTP client timeout (currently 8s in bootstrap.go:43). (via-tor-trusted-peers-tamago retro) -
ca-115BLOCKS cycle-4 merge. Fresh per-connection attestation (ca-90) is the cumulative-load choke point under cycle-4. Serial logssocks5-noise: fresh attestation failed (falling back to cached): fresh attestation report: VMGEXIT 0x80000012: info1=0x0 info2=0x200000000 (err_code=0x2 detail=0x0 rbx_out=4)on most IK connections. Diagnosis empirically confirmed 2026-05-08 (Tier 2 in-wild diagnostic): a patched build that force-skips the fresh-attestation GHCB call (always uses cached report) handles 8 concurrent SOCKS5 curl bursts cleanly (8/8 success, single Tor warmup retry on first health check, no cumulative degradation, response body confirms traffic correctly routed Tor → onion → IK → Tamago). With fresh attestation re-enabled, the same workload saturates Tamago and the chainer’s health check fails. Concrete user-facing impact: friend-opens-browser is broken in practice — a typical web page loads 30–80 resources concurrently → 30–80 IK handshakes → queue saturates Tamago well before the page finishes → chainer reports tunnel sick → browser hangs. Single-connection workloads (one curl) still work because the queue never grows. Surprising finding: cycle-4 did NOT modifyfreshAttestationReport,GetExtendedAttestationReport, or any GHCB code — those are byte-identical to cycle-3.5. The regression is from indirect runtime effects of cycle-4 (new poll goroutine, new TrustedPeers map, IK accept loop’s faster acceptance, or heap layout shift) on the SAME GHCB call that worked in cycle-3.5. Investigation directions (Tier 3): (a)info2=0x200000000(= 2³³) suggests a misaligned guest physical address or a high-bit pollution in the response page GPA; (b) check if the new metadata-poll goroutine’s HTTP allocations affect the GHCB request page’s memory locality; (c) confirm whether cycle-3.5 in-wild ever exercised concurrent fresh-attestation under similar load (the regression may be pre-existing but masked by cycle-3.5’s slower NK accept loop). Available workaround if fix is hard: revert ca-90 freshness — unsatisfactory because it loses replay-resistance defense. Available adjacent option: rate-limit IK accepts upstream of attestation (effectively serializing GHCB calls), but that throttles legitimate friends. Surfaced + diagnosed 2026-05-08 during cycle-4 in-wild test. (via-tor-trusted-peers-tamago in-wild) -
cq-17Loopback allowlist deny path returns EOF instead of SOCKS5 RuleFailure under cycle-4 IK channel:TestInWildLoopbackAllowlistDenyexpectedsocks connect ... not allowed by ruleset(RepRuleFailure 0x02) but got bareEOF. The deny path still rejects (bootstrap allow case still differentiates), but the response code surface changed from a typed SOCKS5 reply to a connection close. Likely a side effect of the cycle-4 accept-loop refactor that inserted the trusted-peers gate before the inner SOCKS5 server. Operations clarity regression — operator probing a denied target now gets ambiguous EOF instead of “ruleset” diagnostic. Investigation: compare inner SOCKS5 server’s denial path behavior under NK (cycle-3) vs IK (cycle-4) accept loops. (via-tor-trusted-peers-tamago in-wild 2026-05-08)
Code Quality & Markers
-
cq-02Phase 0 spike files should use//go:build spikebuild tags or separate_testpackages to prevent name collisions with production tests. Update spike file template/guidance. (daemon-groups-foundation) — Note: originally filed ascq-01by the Reflect subagent but renamed tocq-02to avoid ID collision with the pre-existingcq-01in CHANGELOG.md (cmd-identity-verify era, closed). -
cq-03Implementation subagents should rungofmt+goimports(orgodoctor smart_edit) before declaring phase complete. Cycle 1 of daemon-groups-foundation shipped a misordered import ininternal/daemon/daemon.go(groupsinserted betweenendpointdetectandfacilitatorplaneinstead of afterfacilitatorplane); caught by human post-cycle-2 manual sed, not by the subagent’s self-review. Self-review gate should include a mechanical formatting check. (daemon-groups-foundation) -
cq-04Downgrade first-contact resolve log spam to Debug level:recordcrypt: no matching key slotandexpected record_type "device", got "master"fire repeatedly during normal operation. Should be Debug-level, not Info/Warn. -
cq-05Add lifecycle/Teardown regression test for the kernel TUN path (not just netstack). The double-close bug in Teardown was latent because kernel TUN silently tolerates double-close. A privileged test exercising CreateInterface → Teardown → verify no panic would have caught this earlier. (wgplane-native-platform-stubs) -
cq-07Implement full queue reset in stall watchdog — Current implementation logs warning and resets lastActivity but does not rebuild queues; add DESTROY_TX_QUEUE + DESTROY_RX_QUEUE + CREATE_TX_QUEUE + CREATE_RX_QUEUE sequence for true recovery -
cq-08Extend fault-injection test coverage — Add status override tests (AdminQ command failures), RX descriptor corruption tests (seq jumps, oversized lengths), queue exhaustion tests (TX ring full, RX ring empty) -
cq-09Add coverage threshold gate to test suite — 46.0% coverage represents progress but is still below robustness threshold; consider adding CI gate at 60% or 70% to prevent backsliding -
cq-10UnsilencetxNotifier.WriteNotifydrop path inthird_party/gvnic/net.go:162-167. Theif len(dstMAC) == 0 { drop }branch is a silent drop with justpkt.DecRef()andcontinue. Add a rate-limitedlog.Printf+ a counter exposed via stats. Source: workflow::2026-04-23_gvnic-tcp-destination-selective-pcap::retrospective.md — would have ruled out this branch in 1 probe cycle during ca-88 if the counter had existed. -
cq-11CI reproducibility gate — add a CI step that runsscripts/tamago-verify-reproducible.shto prevent build-determinism regressions. Requires TamaGo toolchain availability in CI. Deferred from tamago-reproducible-measurement study. (tamago-reproducible-measurement retro §Recommendations 3) -
cq-12RefactorcliSimcert-chain read placement to mirror production’s per-branch structure (move from common section intorunClaim/runReconnect), so falsification surfaces the original BUG’s deadlock signature instead of verify-step error -
cq-13Registry v2 metadata loss hardening: makerunMachineServerRemoteEnablev2-aware (preserve additional fields) when tamago E2E rerun is next triggered for other reasons (machine-server-provisioning-cli) -
cq-14Context timeout audit on existing deploy paths: review other multi-step functions incmd/portdistrict/for similar context timeout budget mismatches, especially any that may be augmented in future (e.g., wf-67--with-facilitator). From workflow::2026-05-01_machine-server-via-tor-deploy-augmentation::retrospective.md Recommendation #1. -
cq-15Programmatic test data length assertion: for fixed-length cryptographic/protocol test constants (onion addresses, base32/base64 keys), addlen()assertions at the test constant declaration site rather than relying on visual inspection. From workflow::2026-05-01_machine-server-via-tor-deploy-augmentation::retrospective.md Recommendation #2. -
cq-16Surface the remaining via-tor failure mode: distinct error string for “ca-89 enabled but auth-file missing”. The other 3 named strings (“client_static not in trusted_peers”, “token expired”, “token signature invalid”) shipped with ca-109 in the cycle-4 accept loop. The 4th requires the ca-89 ON path, which lands alongside ca-112. (via-tor-access-control + via-tor-trusted-peers-tamago retros)
Cross-Platform Porting
-
cp-06AddGOOS=windows/GOOS=darwinsmoke-build matrix to CI (blocked on CI existing at all — there’s no.github/workflows/, no Makefile, no.gitlab-ci.ymltoday). First dedicated CI step should bego build ./...acrosslinux,darwin,windowsto catch portability regressions before they snowball. File CI-bootstrap as a separate meta-item if needed. (cross-platform audit 2026-04-15) -
cp-10Replace per-platformexec.Commandshell-outs ininternal/wgplane/platform_*.gowith pure-Go syscalls. Partial (Linux delivered 2026-04-19 bysit-feature/wgplane-native-platform-stubs, commitsf7f837b..a6f0248): bothinternal/wgplane/platform_linux.goandinternal/exitnode/platform_linux.goare now netlink-based (vishvananda/netlink); in-wild zero-execve validated viascripts/test-3node-strace.sh; distroless container fixture attest/fixtures/distroless/Dockerfileruns portdistrictd in a 24.6 MB image without iproute2. Darwin/BSD/Windows still shell out — successor itemscp-12(darwin+BSD) andcp-13(Windows) track the remaining platforms. The overall cp-10 acceptance (grep → zero across ALL platform files) is NOT yet satisfied. Original context below. Current state (post-cp-02): each platform file shells out to the OS base tool for interface address assignment and bring-up —platform_linux.gousesip addr add <cidr> dev <iface>+ip link set <iface> up(iproute2);platform_darwin.go+platform_bsd.gouseifconfig <iface> inet6 <ip> prefixlen <n>+ifconfig <iface> up;platform_windows.gousesnetsh interface ipv6 add address .... These are all OS-base tools shipped with every modern install, so cp-02 correctly prioritized them over pure-Go for portability. The residual concerns that motivate further work: (a) fork/exec overhead on every interface configure (negligible at daemon startup but could matter at scale with many interfaces), (b) fragile error handling viaCombinedOutputparsing instead of structured errno/netlink response codes, (c) attack surface from subprocess spawning on privileged paths (each exec is a potential abuse vector if an attacker can influence the argv), (d) minimal-container/embedded deployments where the base image may strip userland tools for size/security (distroless Linux images don’t ship iproute2 by default; embedded BSD images may omitifconfig), (e) cleaner dependency graph — the daemon currently requires the OS to have specific userland commands available, a fact that’s invisible until runtime. Scope: (A) Linux: replaceipshell-outs withvishvananda/netlink(the canonical Go netlink library) or directgolang.org/x/sys/unixnetlink socket calls.vishvananda/netlinkis ~500KB and widely used (by containerd, CNI plugins, etc). Estimated ~50 LOC replacing the two exec.Command calls inplatform_linux.go. (B) darwin + BSD: replaceifconfigshell-outs withioctl(SIOCAIFADDR_IN6)viagolang.org/x/sys/unix. More involved than netlink because darwin’s utun ioctl interface is less documented, but wireguard-go’s owntun/tun_darwin.goalready does this and can be referenced. Estimated ~100 LOC. (C) Windows: replacenetshshell-outs withgolang.org/x/sys/windows/iphlpapiwhich has functions likeCreateUnicastIpAddressEntryfor programmatic IP assignment. Estimated ~80 LOC. Priority: LOW — current shell-out approach works correctly on all supported platforms; this is a code-quality / lean-deployment improvement, not a correctness fix. Dependencies:vishvananda/netlinkadds a new module dependency (acceptable given its ubiquity in Go networking code) but the darwin/BSD/Windows paths stay within existinggolang.org/x/sys/*packages. Acceptance:grep -r "exec.Command" internal/wgplane/returns zero matches in production code (test fixtures exempt), and the existing wgplane integration tests + cp-07 (in-wild darwin) + cp-08 (in-wild Windows) all continue to pass. Filed 2026-04-16 from the device-record-overlay-address session’s external-tools audit: the audit confirmed portdistrict has zerowgCLI dependency (cp-01 done) but still has four platform-specific shell-outs to OS base tools. The audit was prompted by the user’s question “do we rely on external tools such aswgon portdistrict nodes?” Full audit in the 2026-04-16 device-record-overlay-address retrospective when it lands. (device-record-overlay-address in-wild audit 2026-04-16) -
cp-12Darwin + BSD native API forinternal/wgplane/platform_{darwin,bsd}.go: replaceifconfigshell-outs withioctl(SIOCAIFADDR_IN6)viagolang.org/x/sys/unix, mirroring wireguard-go’stun/tun_darwin.go. Successor to cp-10 for the darwin/BSD portion. Priority LOW — currentifconfigshell-outs work correctly in production (cp-07 in-wild validated on macOS 14.8 and 26.4.1); this is code-quality/lean-deployment work. Acceptance:grep -r "exec.Command" internal/wgplane/platform_darwin.go internal/wgplane/platform_bsd.go internal/exitnode/platform_darwin.goreturns zero; macOS in-wild mesh still passes. When filed: consult howto::shellout_to_native_api_porting.md for semantic preservation (EEXIST idempotency, split-route vs true-default-route). (wgplane-native-platform-stubs 2026-04-19) -
cp-13Windows native API forinternal/wgplane/platform_windows.go: replacenetshshell-outs withgolang.org/x/sys/windows/iphlpapi.CreateUnicastIpAddressEntry. Successor to cp-10 for Windows. Blocked oncp-08(Windows runtime validation) — adding native API without in-wild runtime test is risk without reward. When filed: consult howto::shellout_to_native_api_porting.md. (wgplane-native-platform-stubs 2026-04-19) -
cp-14TamaGo-compatibleconn.Bindfor wireguard-go (resolves OQ7 fully):conn.NewDefaultBind()opens UDP sockets viasyscall.Socket, which TamaGo does not expose.internal/wgplane/create_netstack.gocurrently usesconn.NewDefaultBind()and compiles with-tags wgplane_netstackon Linux but will NOT link underGOOS=tamago. Needs a netstack-UDP-endpoint-backedconn.Bindimplementation that uses gVisor’sstack.UDPEndpoint(or equivalent) instead of host syscalls, gated by thetamagobuild tag. Estimated ~200–300 LOC. Prerequisite for the exit-node role on the TamaGo binary (per doc::portdistrict::tamago_sevsnp_facilitator_design_pending.md §11 OQ7/OQ8). Not blocking for the SEV-SNP facilitator role itself, which doesn’t use WireGuard. (wgplane-native-platform-stubs 2026-04-19) -
cp-08Windows in-wild verification ofinternal/wgplane/platform_windows.go. Cross-compilation provesGOOS=windows go build ./internal/wgplane/...passes, butnetsh interface ipv6 add addressand Wintun TUN creation paths are untested at runtime. Needs: Windows 10/11 test environment with Wintun installed, administrator shell, run the daemon and verify the interface appears innetsh interface show interface. Expect edge cases aroundnetshsilent failures without elevation, Wintun adapter naming, UAPI named-pipe path (\\.\pipe\ProtectedPrefix\Administrators\WireGuard\<ifname>). Ifnetshpath is problematic, fallback is the proposal’serrNotImplementedstub with[SCOPE]log. Risk flagged in wgplane-portable-transport proposal’s Risks table. (wgplane-portable-transport 2026-04-15) -
cp-09Config/state XDG directory split on Linux (cp-04-v2): moveforce-window.json,peers/,groups/fromxdg.ConfigHometoxdg.StateHomefor XDG-proper separation. Deferred from config-path-xdg-adoption because of migration complexity and no user demand. Zero urgency — current layout works, and on macOSConfigHome/StateHomecollapse to the same path anyway. (config-path-xdg-adoption) -
cp-15Addinternal/exitnode/platform_windows.gostubs to unblock Windows cross-compilation for portdistrict and portdistrictd. Same pattern as wgplane-native-platform-stubs (cp-12/cp-13). Once stubs exist, re-addwindows/amd64to.forgejo/workflows/{release,build}.ymlplatform matrices. (release-binaries retro 2026-04-30) -
cp-16Homebrew/APT/AUR packaging for portdistrict and portdistrictd. Downstream of release binaries — now that release infrastructure exists, distribution packaging can be added. (release-binaries study deferred scope 2026-04-30) -
cp-17Relax//go:build linuxconstraint oncmd/tor-gw-initto allow a darwin stub mode for development convenience. Todaytor-gw-initis linux-only by build tag, so the release matrix excludes darwin even though the binary’s high-level CLI surface could be exercised on macOS during development if the linux-specific paths fell back to clear errors. Low priority — current linux-only matrix matches deployment reality. (release-binaries retro 2026-04-30 Recommendation #3)
Regression & Introspection
-
ri-01Widenscripts/test-3node.shprobe timing for high-latency pairs: the current 15s DNS-publish sleep plus single warmup ping is insufficient for WG handshakes betweenremote_accessandmac(~113ms RTT), producing false 2/6 failures on first launch while the mesh is actually healthy (verified 2026-04-18 in-wild during wire-format-single-version-reset retest). Options: (a) increase sleep to 30s, (b) retry-aware warmup loop that pings each pair until handshake succeeds or timeout, (c) checkwg showhandshake age before probing. Retrospective reference:.sit/reports/2026-04-18_wire-format-single-version-reset_retrospective.md§ In-Wild Verification. -
ri-02Retry-with-backoff intest-3node-group.shnetwork-show step: replace single-shot sleep with a poll loop for DNS propagation to reduce false failures from Cloudflare timing. -
ri-03Multi-host cross-compile deployment script:server-transfer-sync.shshould support darwin/arm64 targets alongside linux/amd64 for 3-node tests. -
ri-04Addscripts/tamago-attest-e2e.shto periodic CI on SEV-SNP-capable GCE instances for on-silicon bootstrap protocol verification -
wf-39RefactorcliSim(cmd/portdistrict/bootstrap_protocol_test.go) to read the bootstrap cert-chain frame per-branch (runClaimwith verify,runReconnectread-and-discard) instead of in the common section. Mirrors production decomposition (machine_server_remote.go:206-213 and :583-591). Falsification surface will then match the original ca-94 BUG’s deadlock signature instead ofattestation verify: no cert chain provided. Source: 2026-04-29 bootstrap-protocol-integration-test retrospective.
Parallel Development
-
pd-02TamaGo--open/--demotrust mode: provider-side permissive trust checker that accepts anyexit_requestfrom unknown consumers (logging pubkeys for audit). Required forexit tryto work against non-open providers. (exit-try-onramp) -
pd-03IPv6-only KVM guests on self-hosted SEV-SNP — evaluate provisioning KVM instances as IPv6-only (no per-guest IPv4) to avoid IPv4 address-space pressure when migrating off GCE to self-hosted SEV-SNP. What’s preserved for free: portdistrict overlay is already v6 (fd00::/8ULA); WireGuard endpoints are address-family agnostic; Tor onion services are address-family agnostic (fitsexit-try-via-tornatively); SEV-SNP attestation is orthogonal to IP family; v6 removes most NAT traversal edge cases (ca-50trusted-peer echo gets simpler, not harder). What’s lost / needs mitigation: (1) egress to v4-only destinations from Tamago exit-nodes — mitigate with NAT64 + DNS64 on the host (Jool or Tayga, ~20 lines of config); without it, a large fraction of the public internet (including many banks, older infra,ifconfig.me, some telemetry endpoints) fails from the “browse to any site” exit path; (2) inbound from v4-only user ISPs/cafes — facilitator should keep at least one dual-stack entry point (small $5/mo v4 VPS forwarding to v6 backend) until v4-user traffic is measured; (3) occasional gaps in v4-only package/registry mirrors (rare for major registries but bites private/niche ones) — dual-stack cache proxy or NAT64 solves; (4) reputational effects on v6 egress ranges (more CAPTCHAs on some sites from new v6 /64s) — same caveat as any new v6 range, time-decays; (5) legacy admin/CI paths from v4-only runners — one dual-stack jump host suffices. SEV-SNP-specific note: attestation reports don’t constrain address family, but cloud-init / guest networking config in the sealed image must be v6-ready (SLAAC + DHCPv6); TamaGo guests are already IPv6-first so this is a v1 concern only for Linux-guest workers. Rough recommendation: trusted-mesh nodes v6-only (preferable — matches overlay); Tamago exit-nodes v6-only + NAT64 on host; facilitators dual-stack until v4-user share is measured; admin/CI one dual-stack jump host. Cost framing: a /24 is ~$5k capex or ~$200/mo rental; NAT64 + one v4 entry point is ~10% of that. Action when self-hosted SEV-SNP plan firms up: (a) pick a NAT64 stack (Jool vs Tayga) and measure real-world v4-dest coverage; (b) audit Tamago exit-node dial paths for v4-literal assumptions; (c) instrument facilitator for client-side v4/v6 arrival ratios before going v6-only; (d) decide whetherpd-03supersedes or supplements a prospective dual-stack fallback mode. Related:ca-50(trusted-peer echo — v6 makes this simpler),ca-87(gVNIC TLS — orthogonal but the self-hosted migration is one possible resolution path). Filed duringexit-try-via-torstudy session 2026-04-24. -
pd-05Multi-gateway tor-gw load balancing: when a second operator deploys, design a client gateway selection mechanism (e.g., onion hostname list in DNS TXT, facilitator-mediated discovery). Currently 1:1 tor-gw:Tamago with out-of-band onion hostname. (tor-gw-vm-deployment-topology-automation) -
pd-09tor-gw hardened image (cloud-portable). Phase 1 COMPLETE 2026-04-25 — code + rootless build pipeline shipped (cmd/tor-gw-init/,scripts/tor-gw-image-build.sh,scripts/tor-gw-publish-image.sh, updatedtor-gw-deploy.sh); see workflow::2026-04-25_tor-gw-hardened-image-cloud-portable::retrospective.md. Phase 3 in-wild GCE validation tracked as pd-11. Replaces today’s “stock Debian 12 +apt install torvia startup script” deploy pattern (pd-04) with a purpose-built immutable VM image. Small distroless/Alpine/buildroot base + stock c-tor + ~100-200 LOC init shim that reads forwarding target from instance metadata, writes torrc, and execs tor; read-only rootfs, no SSH, no package manager at runtime; runs unchanged on GCE / AWS / Azure / Hetzner / bare metal. Properties independent of attestation: reproducible image hash, smaller TCB, cross-provider portability, reduced post-compromise blast radius. Phasing: Phase 1 image build + GCE validation (~1wk) → Phase 3 cross-provider validation (~3d). Subsumes cq-11 (read-only rootfs). Closes ri-07 (cross-provider docs) by example. Unblocks pd-08 (gives signed grants a hardened-image deployment to point at) and pd-10 (provides the substrate for SEV-SNP onion-identity sealing). Joint design with pd-10 — see report::2026-04-25_tor-gw-hardened-image-draft.md, “Layer 1” sections. Recommended sequence: pd-09 → pd-08 → pd-10. Filed from tor-gw-vm-deployment-topology-automation UX review 2026-04-25. -
pd-11Phase 3 in-wild GCE validation for tor-gw hardened image. Phase 1 (image-side) DONE 2026-04-25 via 5 pivots — see workflow::2026-04-25_pd-11::retrospective.md. Phase 2 (e2eexit try --via-tor) DONE 2026-05-02 via the new wf-66 CLI path on instancewf66-e2e-120748. Full chain validated end-to-end:curl --socks5-hostname 127.0.0.1:9999 https://ifconfig.mereturned34.179.176.103(= Tamago VM external IP) instead of the laptop’s direct IP46.188.165.75, confirming traffic round-tripped throughlaptop curl → chainer 9999 → laptop tor 9050 → onion 3efkxftj…7i7qd → tor-gw SOCKS5+Noise → Tamago WG plane → public internet → ifconfig.me. The Noise NK handshake authenticated with the chip-derived ca-92 key. Same chain ca-95 retrospective proved on 2026-04-28 with manually-deployed infra — now via CLI-deployed infra (machine server deploy --via-tor+machine server claim --accept-discovered-measurement), no shell scripts. Also validatedclaimattestation chain end-to-end: nonce + session hash binding confirmed, ECDSA-P384 signature verified against VCEK, VCEK→ASK→ARK cert chain valid, measurement matched canonical pd-187a5ed176bad8a9ff02cebb94b24b076a0b1905042a85d9fca7670d3a3ff466db3b1c2b76f8eca888f8d806d2ec92434e. See workflow::2026-05-01_machine-server-via-tor-deploy-augmentation::retrospective.md “In-wild e2e (pd-11 Phase 2 closure)” section. -
pd-12Grant CLI discoverability verbs (grant list,grant show): display stored grants and their status. Completes the pd-08 discoverability story (“discoverable viapeer list”). Low priority follow-on. -
pd-13pd-11 Phase 2 (exit try --via-tore2e) should exercise the grant-based path (exit try --via-tor --from <operator-ns>) rather than the raw flag path, now that pd-08 is shipped. -
pd-15Rebuild linux-pd-snp withoutlinux-firmware-anybloat. DONE 2026-04-27 via [PIVOT]: original mechanism (swapmain/linux-lts→main/linux-virt) was falsified —main/linux-virt/does not exist as a separate APKBUILD directory in current Alpine aports;linux-virtis a flavor ofmain/linux-lts/APKBUILD. Fix A landed instead: inlinesed -ipatch inscripts/build-kernel-pd-snp.sh:84-92stripslinux-firmware-anyfrom the cloned APKBUILD’s mainpackage()depends list, with agrep -qdrift assertion. Validated on GCE N2D Confidential VMs: disk.tar.gz 659MB → 76MB (3.3× under the ≤250MB target), rootfs 722MB → 130.5MB, identical .onion across two independent CVMs. See workflow::2026-04-27_tor-gw-image-size-optimization::retrospective.md. -
pd-16Phase 3 GCE validation: Deploy two CVMs from the same image, scrape serial output via tor-gw-show.sh, confirm same .onion address. Validates assumptions 2b, 3, and 4 from the study. Reuse existing test/spike/pd-10/a1_sev_guest_probe_gce_test.sh orchestrator pattern. (tor-gw-sev-snp-onion-identity-sealing retro §Recommendations) DONE 2026-04-27: validated twice on independent CVMs from the same image, both pre-pd-15 and post-pd-15 builds; same .onion3efkxftj…7i7qd.onionand same attestation measurement on both occasions. See workflow::2026-04-27_tor-gw-image-size-optimization::retrospective.md. -
pd-17Grant struct population (pd-10 Phase 2): Extend internal/exitgrant/grant.go to accept attestation measurement and serialize into grants. Operator workflow: scrape TOR_GW_ATTESTATION_MEASUREMENT from serial, pass to portdistrict exit grant –measurement. DONE 2026-04-27: ExpectedMeasurement []byte added to Grant struct, –measurementflag on exit grant with 48-byte length validation, hex-encoded on the wire (matches SEV-SNP convention), nullable-tolerant Decode preserves backward compatibility with pd-08 grants. The reserved-null field pd-08 added required zero migration. See workflow::2026-04-27_exit-grant-attestation-measurement::retrospective.md. Unblocks ca-93. -
pd-18Verify the SEV-SNP attestation report — server-side. DONE 2026-04-27 via two pivots: original mechanism (AMD KDS lookup) was falsified — the library’s strict policy[17]=1 check rejected GCE reports, AND AMD public KDS does not publish VCEKs for GCE chip+TCB pairs (verified 404 from multiple egress points). Final implementation usesclient.GetRawExtendedReportwhich issuesSNP_GET_EXT_REPORT; GCE pre-caches VCEK + ASK + ARK in guest memory viaSNP_SET_EXT_CONFIGso the certs come back bundled with the report. No network calls during verification. Validated on GCE N2D Confidential VM:TOR_GW_ATTESTATION_REPORT_VALID=trueon serial. Bonus retroactive bug fix: the pd-10/pd-15-era hand-rolled ioctl wrapper produced a struct-padded measurement (32 zero bytes prefix); pd-18’s switch to library wrapper returns proper bytes. Canonical measurement:7a5ed176bad8a9ff02cebb94b24b076a0b1905042a85d9fca7670d3a3ff466db3b1c2b76f8eca888f8d806d2ec92434e. .onion unaffected. See workflow::2026-04-27_tor-gw-attestation-report-verify::retrospective.md. Unblocks ca-93 client-side comparison. -
pd-19tor-gw consensus / HSDir publication issue on freshly-deployed VMs. DONE 2026-04-28: root cause was the “ready” marker firing on hostname-file existence (signal-of-presence) rather than on actual HSDir publication (truthful readiness). Fix: addedControlSocket /var/lib/tor/control.sockto the generated torrc, implemented a minimal Tor control-protocol client incmd/tor-gw-init/main.go(waitForDescriptorPublished()), and wait forHS_DESC UPLOADEDevent before emittingTOR_GW_DESCRIPTOR_PUBLISHED=true. Non-fatal degradation if control socket fails. The “no exit nodes” warning that triggered the original filing was a red herring — hidden services don’t need exit nodes. Phase 0 spike (test/spike/pd-19/) validated control auth + HS_DESC parseability against a private spike-tor instance on the laptop; measured 7.6s warm-laptop cold-start delay vs 60-120s budget for cold GCE. Implementation pivot count: 0. Validated in-wild ontor-gw-poc-a:TOR_GW_DESCRIPTOR_PUBLISHED=trueemitted before subsequent attestation/onion markers. e2e chain test through Tamago surfaced ca-95 (Noise NK handshake bug between clientnoisewrap.Dialwith ca-92-derivedRemoteStaticand Tamago’ssocks5_noise.golistener —chacha20poly1305: message authentication failedonaccept NK msg1). ca-95 is a separate latent bug that pd-19 unblocked the test enough to surface; pd-19’s own claim is met. See workflow::2026-04-28_tor-gw-consensus-hsdir-cold-start::retrospective.md. Also closes wf-27. -
pd-20Tamago analog of pd-18: server-side cryptographic verification of Tamago’s own SEV-SNP attestation report at boot. Todaycmd/portdistrict-exitnode-tamagoderives a chip-bound Noise key (ca-92) and produces an attestation report insiderunBootstrapServer(used to bind the bootstrap claim to this session), but never validates its OWN report’s cert chain or ECDSA-P384 signature locally. So Tamago doesn’t actually know whether it’s running on a genuine AMD chip — it trusts the kernel’s claim of SEV-SNP availability. The fix mirrors pd-18 (which did the same for tor-gw viaclient.GetRawExtendedReport): use the GHCB extended-report path to fetch the report PLUS the VCEK + ASK + ARK certs pre-cached by GCE in guest memory; parse the cert table; validate VCEK→ASK→ARK chain locally against a pinned ARK; verify the report’s signature on the raw bytes. Emit serial markersTAMAGO_ATTESTATION_REPORT_VALID=true|falseandTAMAGO_ATTESTATION_REPORT_REASON=<msg>on failure. Twist for Tamago: TamaGo runtime (bare-metal Go on the metal, no glibc) is a different runtime than Linux init shim — pd-18 usedgithub.com/google/go-sev-guest/clientwhich is Linux-specific. Tamago’skvm/sev/key.goalready hasGHCB.DeriveKey; Phase 0 spike should validate whether GHCB also exposes an extended-report path (or if Tamago needs its own ioctl-equivalent reading via the same MSG_REPORT_REQ but with extended-config response handling). Falls into the same family as pd-15 / pd-18 falsifications; spike before propose. Also: include the Tamago Noise pubkey fingerprint in the report’sreport_data(64-byte client-supplied field) so ca-94’s client-side check can bind report-to-key in one shot. Filed 2026-04-28 from session discussion identifying gap between ca-92’s chip-binding (operator-side inference) and lack of in-binary cryptographic self-verification. Unblocks ca-94. Related:pd-18(tor-gw equivalent),ca-92(key derivation),ca-94(client-side wiring). -
pd-23Self-hosted SEV-SNP in-binary verification for Tamago: on non-GCE hardware where AMD’s public KDS publishes VCEKs, Tamago can do full in-binary self-verification via gvisor netstack TLS. Revisit when self-hosted substrate is available. (tamago-attestation-report-verify) -
pd-24File upstream TamaGo issue for GHCB multi-call timing constraint on GCE Confidential VMs. Reproduction: 3 rapid SNP_GUEST_REQUEST calls fail on 3rd; 2 rapid + 1 delayed succeeds. Low urgency (workaround validated). See howto/tamago_sev_snp_multi_call_ghcb.md. -
pd-25Apply three-control reproducibility pattern (volume serial pinning, SOURCE_DATE_EPOCH, deterministic tar+gzip) toscripts/tor-gw-image-build.sh. pd-09 lists “reproducible image hash” as a goal but does not yet implement these controls. Pattern documented in howto/reproducible_fat32_disk_images.md. (tamago-reproducible-measurement retro §Recommendations 1) -
pd-26Offline LAUNCH_DIGEST computation — compute the expected SEV-SNP measurement from source without deploying a CVM. Requires understanding GCE’s OVMF firmware binary and the sev-snp-measure tool. Completes the end-to-end reproducibility story for third-party verification. Deferred from tamago-reproducible-measurement study. (tamago-reproducible-measurement retro §Recommendations 2)
Workflow Methodology
-
meta-01Write a meta-howto on “expired warnings”: when a prior retro predicts failures that don’t recur, the absence is evidence the prior fix worked (cmd-identity-init) -
meta-04Document the MSI-X boot-crash pitfall on Intel+OVMF as a howto — symptom: tamago-example’sstartInterruptHandlerpattern (cpu.LAPIC.Enable()+ioa.EnableInterrupt(irq, vector)+goos.Idleoverride +cpu.ServiceInterrupts(isr)) causes guest triple-fault immediately after WireGuard interface-up on QEMU + OVMF + Intel host (serial cuts off mid-line, QEMU exits via-no-reboot). Root cause unidentified; likely LAPIC state orgoos.Idleassumptions. Verify whether it reproduces on AMD EPYC + OVMF before committing to IRQ-driven designs. Reproducer: commit7aa7cbc. (tamago-exit-node-gce late retro) -
meta-05Promote rootless UEFI disk-image packaging pattern to howto —scripts/tamago-package-disk.shusesdd+/sbin/mkfs.vfat+mmd/mcopy(frommtools) to produce a FAT32 image with/EFI/BOOT/BOOTX64.EFI+/EFI/BOOT/shimx64.efi, then tars it for GCEcompute images create. Replaces thesudo losetup+mountpattern thatgo-bootwiki’s GCE guide assumes. One-time setup handoff:sudo apt install dosfstools mtools. Validated 2026-04-20 producingtamago-exit-node-gce’sexitnode-tamago-v1/v2GCE images. (tamago-exit-node-gce late retro) -
meta-07Methodology note for Study: when Study writes “cheaper and more conclusive than a spike/pcap” about a non-trivial mechanism, test whether the cheap probe actually falsifies the mechanism or just tests one consequence of it. Ingvnic-tls-tcp-relay-bug, the arithmetic MTU hypothesis fit the symptom pattern cleanly and looked conclusive on paper, but only a two-iteration A/B (MTU 1400 vs 1340) falsified it, while pcap would have been decisive with a single deploy. Guideline draft: when multiple mechanisms can explain the same symptom pattern, do the diagnostic that discriminates between them, not the one that merely confirms the most appealing one. Source: workflow::2026-04-23_gvnic-tls-tcp-relay-bug::retrospective.md. -
meta-08When spikingbitfield/scriptor similar libraries, create a coverage checklist of specific API methods used in proposal code snippets. Test terminal methods (WriteFile,String),WithEnvwith partial vs full environment, andWithDirexistence — not just the coreExecpattern. From workflow::2026-04-30_machine-server-build-tamago-pipeline::retrospective.md Recommendation #1. -
meta-09When adding a required field to a cross-binary struct, pre-count callers with grep before committing — NoiseStaticKey touched 8+ callers, all mechanical but surprised the estimate. Methodology note. (via-tor-trusted-peers-tamago retro)
Considered but Not Pursued
-
meta-06Consider splitting dual-stream workflows into separate proposals — Stream A and Stream B had zero dependencies; managing both in one proposal added coordination overhead. Future dual-stream work may benefit from separate SIT workflows running in parallel -
ca-88Bundled tor client — retired 2026-04-24 as out of scope for portdistrict. Rationale: portdistrict’s target users are technical (CLI-comfortable, security/privacy-aware) for whomsudo apt install tor/brew install toris acceptable setup. Bundlingtorwould require either per-platform signed/notarized binary ship (macOS Gatekeeper, Windows SmartScreen), CGo-linked libtor (cross-compile matrix pain, CVE treadmill), or download-and-verify on first run (Tor Project GPG pinning + per-OS Gatekeeper workarounds). Every one of these is sustained engineering that distracts from portdistrict’s actual domain (attested exit nodes + SOCKS5 chaining). The robust + cheap alternative is a clear error message on missing tor, with per-platform install hints, which is shippable as ~10 LOC in the existing--via-torerror path. Original scope item was filed fromexit-try-via-tordeferred scope 2026-04-24. Full discussion in this session’s transcript.
Completed
-
wf-72DONE 2026-05-03machine server enable/disable --via-torcapability flag for non-remote-orchestrated hosts. Initial PARTIAL state (CLI shipped but produced an unreachable onion) was closed by the via-tor-host-capability-completion cycle. Full in-wild round-trip verified 2026-05-03 on a single Linux host: enable → tor onion published → claim populated registry → invite minted → friendexit try --accept-tofurecorded TOFU fingerprint matching the invite blob’s. See workflow::2026-05-03_via-tor-host-capability-completion::retrospective.md. -
wf-73DONE 2026-05-03 Daemon-side managed-Tor process supervision:internal/torsupervisor/Supervisorlifecycle (Start/Stop/Reattach/OnionHostname/Running, torrc generation with 0700 dir enforcement, hostname polling, descriptor publication wait via ControlSocketHS_DESC UPLOADED, SIGHUP reload, graceful shutdown). Phase 0 spike validated ControlSocket-from-Go-subprocess, re-attach across daemon restart, and non-standard HiddenServiceDir with daemon UID. Initial PARTIAL state closed by the completion cycle: torrc port wired to 1080 (matching the new SOCKS5+Noise listener); descriptor-publication race fixed (waitForDescriptornow polls the control socket viadialControlSocketWithRetry). In-wild round-trip verified 2026-05-03 (~6s forHS_DESC UPLOADEDafter tor start). See workflow::2026-05-03_via-tor-host-capability-completion::retrospective.md. -
wf-77DONE 2026-05-03 In-wild verification of via-tor host capability. Fullenable --exit-node --via-tor→show→claim→exit invite --via-tor→ friendexit try --accept-tofu→ fingerprint pinned intofu-onions.jsoncycle exercised end-to-end on a single Linux host. TOFU branch (exit_try_tor.go:236-273) confirmed working: friend received empty attestation frame (“Attestation: no report received (Tamago dev mode)”), recorded fingerprint7459a125ba2a97780e7d1cc6973b2e580852f911d8aad92054d0f37dce6782f0(matches the invite blob’s fingerprint). One in-wild bug fixed during the test: torsupervisor’s descriptor-wait race (one-shotnet.DialTimeoutraced tor’s async ControlSocket creation; replaced with bounded retry). See workflow::2026-05-03_via-tor-host-capability-completion::retrospective.md. -
wf-78DONE 2026-05-03 Registry write path for local non-confidential hosts.machine server claim --label <l>extended to handle non-confidential local via-tor hosts: readsOnionHostname+NoisePubkeyfromvia-tor.jsonandWGPubKeyfrom device keyring, writesNodeEntry{Confidential: false}directly toremote-nodes.json(no SSH delegation, nodeploystep required). TheResolve(label)error path now creates a fresh entry when the local-host case applies (scope expansion logged in retrospective). In-wild verified 2026-05-03 —claimpopulated registry; subsequentexit invite --via-tor --node vt-testminted a usable TOFU blob. See workflow::2026-05-03_via-tor-host-capability-completion::retrospective.md. -
wf-79DONE 2026-05-03 Operator-side SOCKS5+Noise listener for non-CVM Linux hosts.internal/daemon/socks5noise.goships a TCP listener on127.0.0.1:1080that wrapsnet.Listenwithnoisewrap.Listen, emits an empty attestation frame (writeAttestationFrame(conn, nil, nil)= 8 zero bytes — triggers TOFU branch on friend side), and runs ago-socks5server with kernelnet.Dialfor egress. Daemon spawns/stops the listener alongside the tor supervisor (decoupled lifecycle, port-agreement-by-constant). Phase 0 spike validated 4/4 hypotheses including kernel-vs-netstack (chose kernel), empty-attestation-triggers-TOFU, and noisewrap-with-net.Listen. In-wild verified 2026-05-03: friendexit try --via-torcompleted Noise NK handshake against the local listener through the tor circuit. See workflow::2026-05-03_via-tor-host-capability-completion::retrospective.md. -
ca-109DONE 2026-05-08 Trusted-peers-at-Tamago with metadata-polled hot-add. Moved primary auth from anonymous Noise NK to identity-revealing Noise IK with atrusted_peersmap at Tamago, refreshed at runtime by a 60s metadata-poll goroutine reading thetrusted_peersGCE instance attribute. Eliminates the cycle-3 chicken-and-egg betweenexit invite(post-deploy credentials) and Tamago boot (boot-time-only credential read). Wf-76 invite tokens retained as mandatory second layer (24h credential expiry). Tor v3 client auth (ca-89) defaulted OFF (operator opt-in via--tor-client-auth); ClientOnionAuthDir footgun eliminated. Phase 0 spike validatedflynn/noiseIK +PeerStatic()API; zero pivots during implementation (4th consecutive zero-pivot cycle when spikes are used). All 797 tests pass. See workflow::2026-05-08_via-tor-trusted-peers-tamago::retrospective.md. -
ca-110DONE 2026-05-08 Token replay protection — addressed by ca-109’s Noise IK handshake freshness. Each connection establishes fresh ephemeral keys via the IK pattern, so a captured invite token cannot be replayed without also producing a new IK handshake (and the client static pub must be intrusted_peers). Combined with the existing 24h token expiry, replay risk is sufficiently mitigated; explicit single-use-token logic remains unnecessary. See workflow::2026-05-08_via-tor-trusted-peers-tamago::retrospective.md. -
ca-111DONE 2026-05-08 Client/token revocation — addressed by ca-109’s metadata-poll hot-remove. Operator removes a friend by updating thetrusted_peersGCE metadata attribute (e.g.gcloud compute instances add-metadata); Tamago’s poll goroutine picks up the change within 60s and replaces the trusted-peers map under a write lock. No CRL/OCSP infrastructure needed. See workflow::2026-05-08_via-tor-trusted-peers-tamago::retrospective.md. -
ca-114DONE 2026-05-08 Post-deploytrusted_peersmetadata push (closes the “no-redeploy-to-add-friend” promise from ca-109).exit invite --via-tornow reads node Project/Zone/Label from the registry and uses the GCE Compute SDK (compute.InstancesClient.Get+SetMetadatawith fingerprint preservation) to push the updatedtrusted_peersto the running Tamago VM whenever a new client noise pubkey is appended. Non-GCE / non-confidential nodes skip the push silently with a clarifying log line. ADC failures degrade to a warning + agcloud compute instances add-metadatafallback hint instead of failing invite generation. Replaces the previous misleadingRun \portdistrict machine server deploy –via-tor`` hint that suggested redeploying. Landed as a [SCOPE] expansion on the cycle-4 branch ahead of in-wild verification. See workflow::2026-05-08_via-tor-trusted-peers-tamago::retrospective.md. -
wf-48Add--publishflag ormachine server build-publishverb to upload disk image to GCS and create GCE custom image. Requires spike for Go SDKstorage.NewWriterandcompute.Images.Insert. Delivered 2026-04-30 as--publishflag onmachine server build(with--bucket,--image,--project); pure Go SDK (cloud.google.com/go/storage+cloud.google.com/go/compute/apiv1); all three guest-os-features set; delete-then-create with 404 tolerance; post-create verification viaImagesClient.Get. Originated from workflow::2026-04-30_machine-server-build-tamago-pipeline::retrospective.md Rec #2. Closing retrospective: workflow::2026-04-30_machine-server-publish-image::retrospective.md. -
wf-56machine server build --source-dirsemantics — clarify in help-text and enforce that--source-direxpects the project root, not the app subdirectory. Delivered 2026-04-30 on branchsit-feature/publish-image-ux-cleanup: help-text inprintMachineServerUsageclarifies the project-root requirement; earlyos.Stat(<source-dir>/cmd/portdistrict-exitnode-tamago)check fires beforecompileTamagowith a friendly error pointing at--source-dir .from repo root. Verified live (sanity check fires immediately for the wrong path; help-text shows the project-root note). -
wf-58claim --accept-discovered-measurementTOFU output. Delivered 2026-04-30 on branchsit-feature/publish-image-ux-cleanup: flag now threads fromrunMachineServerClaimthrough delegateArgs intorunMachineServerRemoteEnable, which on the no---measurement+ TOFU branch extracts the measurement bytes fromrawReport[sevsnp.MeasurementOffset:+sevsnp.MeasurementLen]and printsAttestation: measurement accepted (TOFU first-deploy: <hex>)plus a hint to record the value for subsequent strict---measurementclaims. The misleading “measurement not checked” line remains for the no-flag, no---measurementpath. -
wf-59claimbootstrap-port closure messaging. Delivered 2026-04-30 on branchsit-feature/publish-image-ux-cleanup: after the “Node claimed. Exit-node active on port N.” line, claim now printsBootstrap port 8888 closed; subsequent claim attempts will require redeploy.so operators understand why a re-attempt fails.