4 Commits

Author SHA1 Message Date
-Puter
5f667038d8 docs: plan robust retry and dlq layer 2026-06-05 22:01:00 +05:30
-Puter
ef5d7bb378 docs: map staging and production backend behavior 2026-06-05 22:01:00 +05:30
-Puter
d4f9b0edcb docs: inventory backend dead code candidates 2026-06-05 22:01:00 +05:30
-Puter
01e9cc92d4 docs: audit backend organization and actor flow 2026-06-05 22:01:00 +05:30
4 changed files with 695 additions and 0 deletions

84
docs/backend-dead-code.md Normal file
View File

@@ -0,0 +1,84 @@
# Backend Dead Code Inventory
PRM-46 inventory pass for `growqr-backend`.
No source code was deleted in this pass. Static search and manual inspection were used. Typecheck was run successfully with `pnpm typecheck`.
## Summary
The codebase is mostly wired, but it contains several compatibility, demo, and partially superseded paths. The main cleanup risk is accidentally removing code still used by the frontend's older workflow screens or by demo environments.
## Candidates
| Priority | Candidate | Recommendation | Evidence |
| --- | --- | --- | --- |
| High | `src/actors/product-service-actors.ts` | Keep for now; consider deleting only after confirming no Rivet clients call these actors. | Actors are registered in `src/actors/registry.ts`, but local code routes service calls through `src/routes/services.ts` and `src/services/product-service-clients.ts` directly. No local `getOrCreate` references for `interviewServiceActor`, `roleplayServiceActor`, or `resumeServiceActor` were found. |
| High | Legacy `/workflows/job-application*` route aliases in `src/routes/workflows.ts` and large portions of `src/actors/user-actor.ts` workflow state | Keep until frontend migration is verified; likely cleanup after DB-backed workflow runs fully replace it. | `job-application` aliases call `userActor`; newer `/workflow-runs` path uses `workflowRuns`, `workflowRunModules`, and `workflowRunActor`/`executeWorkflowModule`. Two workflow systems coexist. |
| High | `src/workflows/module-runner.ts` synchronous execution from routes | Keep, but consolidate behind `workflowRunActor` before cleanup. | Used both by `workflowRunActor` and directly by route handlers. Direct route use undercuts actor durability, but the module runner itself is active. |
| Medium | `src/workflows/smoke-test.ts` | Keep as script if used manually; otherwise convert to documented test or remove. | Only referenced by `package.json` script `workflows:smoke`; not part of app runtime. |
| Medium | `scripts/rivet-actors.ts` | Keep if used by ops; document or remove if not. | Standalone admin script; not imported by source. It relies on `RIVET_ENDPOINT`, `RIVET_NAMESPACE`, and admin token defaults. |
| Medium | Demo home seeder `src/home/seed-demo-home.ts` and `/home/seed-demo` | Keep in staging/demo only; move behind explicit environment gate. | `src/routes/home.ts` exposes a seed endpoint. Schema has `generatedBy: "demo"` for notifications. This is live source behavior rather than isolated fixture code. |
| Medium | Static fallback mission registry vs persisted registry (`src/missions/registry.ts` and `src/missions/postgres-registry.ts`) | Keep both until migration/backfill is confirmed; then decide whether DB registry or static registry is source of truth. | `routes/missions.ts` reads persisted definitions, while actor factory and conversations read static definitions. `postgres-registry` falls back to static definitions. |
| Medium | Duplicate mission actor wrappers (`career-transition-actor.ts`, `salary-negotiation-war-room-actor.ts`, `promotion-readiness-actor.ts`, `personal-brand-opportunity-engine-actor.ts`) | Keep; low-cost wrappers are active. | Thin wrappers are mapped in routes, registry, event actor, and actor registry. |
| Medium | `src/events/projectors/projection-agent.ts` LLM insight path | Keep, but verify product use. | Referenced by `userEventActor` and `reducer-types`, so not dead. It can silently fall back when no LLM API key exists. |
| Medium | Legacy Redis observers in `src/events/redis-consumer.ts` | Keep until services emit canonical Grow Events. | Comments state these observe existing service A2A traffic. They are enabled by `INTERVIEW_REDIS_URL`, `ROLEPLAY_REDIS_URL`, and `RESUME_REDIS_URL`. |
| Medium | `events` audit table in `src/db/schema.ts` | Keep until old frontend timelines and route writes are audited. | Older user/service paths still import/use `events` table, while newer Grow Event tables also exist. |
| Low | `src/workflows/registry.ts` and `src/missions/registry.ts` duplicate product concepts | Keep; consolidate later. | Workflows are commercial product definitions; missions are actor-backed variants. The overlap is intentional but duplicative. |
| Low | `docker/opencode/workspace-template/*/README.md` placeholders | Keep as template docs or remove if generated workspaces no longer need empty folders. | Template-only files are not runtime code, but useful for preserving folder structure. |
| Low | `docs/architecture.html` | Keep unless replaced by Markdown architecture docs. | Existing doc artifact, not source. |
## Unused or Underused Env Vars / Config Values
| Env/config | Recommendation | Evidence |
| --- | --- | --- |
| `config.required` | Keep or remove after scanning call sites; currently exported but not used in local source. | `required` is attached to config, but no local `config.required(` references were found. |
| `clerkPublishableKey` | Keep if clients read backend config elsewhere; otherwise remove from backend config. | Defined in `config.ts` and `.env.example`, but backend auth uses secret key. |
| `opencodeApiKey` | Keep only if future direct OpenCode auth requires it; currently `llmApiKey` consumes `OPENCODE_API_KEY`. | Defined separately in config; most OpenCode runtime calls use per-container password, not this field. |
| `userServiceUrl` | Keep; used by missions profile lookup. | `routes/missions.ts` fetches `/api/v1/users/me`. |
| `legacyServiceTaskObserverGroup` | Keep while legacy Redis observers exist. | Used in `redis-consumer.ts`. |
| `migrationVersion`, `promptVersion`, `opencodeImageVersion` | Keep; active Docker rollout labels. | Used by `docker/manager.ts` and Docker build metadata. |
## Stale or Demo-Oriented Behavior
- Demo generated home notifications and `/home/seed-demo` should move to a staging/demo module or be guarded by `config.environment`.
- `service-agents.ts` includes demo-like defaults, such as `formula_version: "workflow-demo"` and synthetic Q Score fallback summaries.
- `config.ts` defaults many production-sensitive values to local/dev values, including Gitea admin credentials, service token fallback, A2A key, and localhost URLs.
- Docker/OpenCode scripts are active but dev-biased, using image tags like `growqr/opencode:dev`.
## Prompt Workflow Inventory
All prompt workflow files under `prompts/workflows/*` are referenced by `src/workflows/registry.ts` through `promptPath` values:
- `career-transition/orchestrator.md`
- `interview-to-offer/interview-plan.md`
- `salary-negotiation-war-room/orchestrator.md`
- `promotion-readiness/orchestrator.md`
- `personal-brand-opportunity-engine/orchestrator.md`
Additional interview-to-offer prompt files (`resume-analysis.md`, `story-bank.md`, `final-readiness-report.md`) are not referenced by `workflowDefinitions` directly in this pass. Recommendation: keep until OpenCode/agent prompt loading is audited, then either wire them into module definitions or archive them.
## Delete/Keep Decisions Before Cleanup
Do not delete yet:
- `userActor` workflow code
- `product-service-actors`
- static mission/workflow registries
- Redis legacy observers
- demo home seeder
- standalone scripts
Good first cleanup after approval:
1. Move demo seeding to `src/staging` and guard it with a staging/demo environment.
2. Remove or document unused config fields (`config.required`, `clerkPublishableKey`, `opencodeApiKey`) after a second pass across frontend/deployment references.
3. Convert `workflows:smoke` into a real test or delete the script.
4. Consolidate mission actor type mapping into one helper and remove duplicate mapping functions.
## Verification
`pnpm typecheck` passed:
```txt
tsc -p tsconfig.json --noEmit
```

View File

@@ -0,0 +1,179 @@
# Backend Organization Audit
PRM-41 audit pass for `growqr-backend`.
Scope reviewed: `src/routes`, `src/actors`, `src/events`, `src/missions`, `src/workflows`, and `src/services`.
## Executive Summary
The backend currently has three overlapping orchestration layers:
1. HTTP routes that directly perform database writes, service calls, and some synchronous workflow execution.
2. Rivet actors that own durable user, workflow, mission, conversation, memory, and event processing state.
3. Event/projector code that normalizes service events into Grow Events, updates mission state, records service sessions, and projects Q Score signals.
That split is workable for a demo-stage backend, but it blurs ownership. Several routes contain business logic that should live in services or actors, while actors and event consumers need stronger idempotency, retry, and replay boundaries before production traffic.
## High-Level Architecture
```mermaid
flowchart LR
FE[Frontend / service clients] --> Hono[Hono routes]
Hono --> DB[(Postgres / Drizzle)]
Hono --> Rivet[Rivet actors]
Hono --> Svc[Product services]
Hono --> Docker[Docker + Gitea + OpenCode]
Svc --> Redis[Redis streams / pubsub]
Redis --> Consumer[events/redis-consumer]
Consumer --> GrowEvents[(grow_events)]
Consumer --> EventActor[userEventActor]
EventActor --> MissionActors[mission actors]
EventActor --> Projectors[QScore/session/projectors]
MissionActors --> DB
Rivet --> DB
Rivet --> Svc
Rivet --> Docker
```
## Route to Actor/Service/Event/Data Flow Map
| Route module | Mounted path | Primary flow | Actor/service/data dependencies | Notes |
| --- | --- | --- | --- | --- |
| `src/routes/actors.ts` | `/actors` | Auth-gated user stack control | `docker/manager`, `actors` table | Provisions/stops OpenCode stack directly from route. |
| `src/routes/agents.ts` | `/agents` | Catalog read | `agents/catalog` | Thin route. |
| `src/routes/chat.ts` | `/api/chat` | Chat request, Rivet first, direct LLM fallback | `userActor`, `lib/llm`, `services/service-agents` | Contains fallback tool orchestration and timeout logic in route. |
| `src/routes/conversations.ts` | `/conversations` | Conversation CRUD/chat/mission bridging | `conversationActor`, mission actors, `grow_conversations`, messages | Heavy route; mixes persistence, actor bootstrapping, mission resolution, and response shaping. |
| `src/routes/events.ts` | `/events` | User/service event ingestion and listing | `recordGrowEvent`, `routeGrowEventToUserActor`, `grow_events` | Good ingestion boundary, but service auth is environment-sensitive. |
| `src/routes/git.ts` | `/git` | Repo/file operations | `docker/manager`, `GiteaClient` | Route owns path safety and repo operation decisions. |
| `src/routes/grow.ts` | `/grow` | Grow bootstrap and active state | `growActor` | Thin actor gateway. |
| `src/routes/home.ts` | `/home` | Home feed, notifications, demo seed | `home-feed`, `seed-demo-home` | Includes demo seeding endpoint. |
| `src/routes/missions.ts` | `/missions` | Mission catalog, start/pause/resume/stage/artifacts/coach | `growActor`, mission actors, user service, mission registry | Heavy route; owns mission selection, profile fallback, actor type mapping, and artifact commands. |
| `src/routes/opencode.ts` | `/opencode` | OpenCode stack/session/message proxy | `docker/manager`, `OpencodeClient` | Directly provisions stack and opens sessions. |
| `src/routes/services.ts` | `/services` | Product service proxy and event recording | `product-service-clients`, `recordGrowEvent`, Q Score onboarding | Very heavy route; contains service-specific payload shaping and event side effects. |
| `src/routes/users.ts` | `/users` | User profile/bootstrap | `auth/clerk`, `users` table, onboarding Q Score | Includes Clerk profile mirroring and onboarding side effects. |
| `src/routes/workflows.ts` | `/workflows`, `/workflow-runs` | Workflow definitions/runs/modules/approvals | `userActor`, `workflowRunActor`, `workflow/module-runner`, DB | Two paths: legacy userActor job-application flow and DB-backed workflow runs. |
## Actor Inventory
| Actor | Current role | Main inputs | Outputs/effects | Robustness observations |
| --- | --- | --- | --- | --- |
| `userActor` | Legacy unified user orchestration: chat, memory tools, workflow status, service handoffs, OpenCode/Gitea interactions | `/api/chat`, `/workflows/job-application`, workflow route aliases | Actor state, DB events, service calls, Gitea reads/writes | Very broad responsibilities; failures in service calls often become summaries rather than durable retryable jobs. |
| `workflowRunActor` | Queued workflow module runner | `/workflow-runs/:runId/pause|resume` and direct client use | `workflowRunModules`, `workflowEvents`, `qscoreSnapshots` via module runner | Has Rivet loop retry settings for module execution, but route-level `/run` bypasses actor queue and executes synchronously. |
| `conversationActor` | Durable streaming conversation state | `/conversations` | Actor state and generated messages | Queue usage exists for messages; needs documented idempotency per turn/message id. |
| `memoryActor` | Durable memory file state | Internal client use | Actor state/file-like memory | Queue writes exist; external call idempotency unclear. |
| `growActor` | Active mission list/state control | `/grow`, `/missions` | `grow_active_missions`, mission state | Mission lifecycle split across growActor, mission actors, and routes. |
| `userEventActor` | Routes normalized Grow Events to missions/projectors | Redis consumer, `/events` ingestion | Mission stage patches, projector DB updates, event status | Central point for event idempotency, but retries/replay/DLQ are not yet formalized. |
| Mission actors | Per-mission state machines | `/missions`, `/conversations`, event actor | `grow_active_missions`, artifacts, suggestions | Four mission actors are thin factory wrappers; interview-to-offer has custom implementation. |
| Product service actors | Actor wrappers for interview/roleplay/resume clients | Registry only; possible client use | Service calls | Registered, but routes call clients directly. These may be underused compared to direct service proxy routes. |
## Event and Projector Flow
```mermaid
sequenceDiagram
participant Service as Product service
participant Redis as Redis stream/pubsub
participant Route as /events or service routes
participant Store as grow_events
participant UserEvent as userEventActor
participant Mission as mission actor
participant Projection as projectors
Service->>Redis: canonical GrowEvent or legacy task response
Redis->>Route: redis-consumer normalizes message
Route->>Store: recordGrowEvent with dedupeKey
Route->>UserEvent: routeGrowEventToUserActor
UserEvent->>Mission: apply reducer-derived stage patches
UserEvent->>Projection: service session and Q Score projections
Projection->>Store: update projection tables
```
Current event strengths:
- `normalizeGrowEvent` accepts multiple service field conventions.
- `recordGrowEvent` uses `dedupeKey` and a unique index on `grow_events.dedupe_key`.
- Legacy Redis observer bridges `tasks:*` and `responses:*` without service changes.
- Projector surfaces exist for session tracking, Q Score, and LLM-derived insights.
Current event gaps:
- Redis canonical consumer always `xAck`s in `finally`, even when `recordAndRoute` fails, so failed messages do not remain pending for retry.
- No DLQ stream/table for failed canonical or legacy event processing.
- No replay script for `grow_events.processing_status in ('failed', 'unresolved')`.
- Legacy task context is in-memory only, so response events can lose user/action context after a backend restart.
## Business Logic in Routes
Highest concentration:
- `src/routes/services.ts`: service-specific request construction, event emission, Q Score baseline/onboarding side effects, mission association, and UI response shaping.
- `src/routes/workflows.ts`: run creation, module row initialization, baseline Q Score, approval gate progression, artifact content lookup, and synchronous module execution.
- `src/routes/missions.ts`: mission profile lookup from user service, actor type mapping, start/resume/pause/stage/artifact commands, and coach run orchestration.
- `src/routes/conversations.ts`: active conversation persistence, mission-aware chat routing, actor fallback behavior, and response normalization.
- `src/routes/chat.ts`: Rivet fallback, direct LLM tool loop, service agent selection, and timeout handling.
Low-risk thin routes:
- `src/routes/agents.ts`
- `src/routes/grow.ts`
- parts of `src/routes/events.ts`
Recommended ownership target:
- Routes validate/authenticate and translate HTTP to commands.
- Actors own durable user/mission/workflow progression.
- Services own outbound HTTP details.
- Projectors own derived read models.
- Routes should not decide retry, idempotency, or service fallback behavior beyond returning HTTP errors.
## Idempotency Gaps
| Area | Existing behavior | Gap |
| --- | --- | --- |
| Grow Event ingestion | `dedupeKey` unique index; normalizer uses explicit key or source id | Service routes do not consistently set stable dedupe keys for all service-created side effects. |
| Workflow runs | `/workflow-runs/:runId/modules/:moduleId/run` reads `idempotency-key` header | `executeWorkflowModule` does not use the key to suppress duplicate service calls; `/run` generates timestamp keys. |
| Workflow module rows | Has `idempotencyKey`, `retryCount`, `maxRetries` columns | Counters are mostly passive; no central retry state machine. |
| Actor queues | Rivet queues and `loop` step names provide some dedupe for `workflowRunActor` | Several routes bypass actor queue and execute directly. |
| Service session creation | `stableUuid` exists in service-agent helper | Not consistently used as a request id/idempotency key across service calls. |
| OpenCode artifacts | `onConflictDoNothing` for workflow artifacts | OpenCode prompt/message send can duplicate work before artifact row conflict applies. |
## Retry Gaps
| Area | Existing behavior | Gap |
| --- | --- | --- |
| `workflowRunActor` | Rivet `loop` has `retryBackoffBase` and `retryBackoffMax` | Only applies when execution goes through actor loop. |
| HTTP service clients | Throw on non-2xx after `fetch` | No timeout, retry classification, request id, or backoff. |
| Gitea client | Some wait/poll helpers exist | Most API calls are single-shot. |
| OpenCode client | Health polling exists | Session/message calls are single-shot. |
| Redis consumer | Infinite loop catches top-level errors | Per-message failures are acked; no retry budget or DLQ. |
| Projectors | Called by event actor | Projector failures need durable retry/replay semantics and status transitions. |
## Actor Robustness Gaps
- `userActor` is too broad to reason about failure domains. It owns chat, service tools, memory, workflow, Gitea, OpenCode, and DB event writes.
- Product service actors are registered but not the primary path for service proxy routes, so actor-level durability is uneven.
- Mission actor mapping is manually duplicated in routes, registry, and event actor.
- Route-level synchronous workflow execution can hold HTTP requests open across slow service/OpenCode calls.
- Actor initialization is repeated in routes; a central actor gateway could enforce init/idempotency/logging.
## Priority-Ranked Recommendations
1. Create a backend command layer for route-to-actor/service translation. Move mission start, workflow run, approval, service configure, and chat tool dispatch logic out of routes.
2. Make `workflowRunActor` the only executor for workflow modules. Routes should enqueue commands and return command ids.
3. Add a shared outbound `withRetry`/timeout/idempotency wrapper for service clients, Gitea, OpenCode, and LLM calls.
4. Add DLQ and replay support for Redis/event processing. Do not ack canonical Redis messages until durable record/projector status is successful or DLQ-ed.
5. Normalize mission actor mapping into a single registry source used by routes, event actor, and mission registry.
6. Split `userActor` responsibilities: chat/memory/workflow/OpenCode paths should be smaller actors or delegated services with explicit contracts.
7. Convert route-created side effects to stable idempotency keys. Use request id, user id, mission instance id, service id, and operation name.
8. Add structured logging fields across routes/actors/events: `requestId`, `userId`, `missionInstanceId`, `runId`, `moduleId`, `eventId`, `idempotencyKey`, `retryAttempt`.
9. Add focused tests around duplicate workflow module run, duplicate service event ingest, Redis failure handling, and mission projector replay.
## Suggested Next Slice
Use PRM-43 to introduce shared retry/idempotency primitives first. Then return to this audit and migrate the highest-risk route logic in this order:
1. `/workflow-runs/*/run`
2. `/services/interview|roleplay configure/review`
3. `/missions/:missionId/start`
4. `/api/chat` direct LLM fallback

148
docs/environment-matrix.md Normal file
View File

@@ -0,0 +1,148 @@
# Environment Matrix
PRM-42 staging vs production separation inventory for `growqr-backend`.
No refactor was performed in this pass.
## Current Environment Model
The backend currently uses `config.nodeEnv` plus many individual env vars. There is no explicit first-class `environment` such as `development | staging | production | demo`.
Important consequence: local/dev defaults can leak into staging or production unless deployment env vars override every sensitive value.
## Current Config Inventory
| Area | Config/env | Current default | Production concern |
| --- | --- | --- | --- |
| Runtime | `PORT`, `LOG_LEVEL`, `NODE_ENV` | `4000`, `info`, `development` | `NODE_ENV` is too broad for staging/demo behavior. |
| Database | `DATABASE_URL` | hardcoded fallback DSN in `config.ts` | Production should fail fast instead of falling back. |
| Auth | `CLERK_SECRET_KEY`, `CLERK_PUBLISHABLE_KEY` | empty | Secret key absence changes auth behavior; publishable key appears underused. |
| Service auth | `SERVICE_TOKEN`, `A2A_ALLOWED_KEY` | empty / `dev-a2a-key` | Dev token fallback must not be accepted in production. |
| Redis events | `GROW_EVENTS_REDIS_URL`, `REDIS_URL`, stream/group/consumer names | disabled unless set | Staging/prod need explicit stream, group, and replay policy. |
| Legacy Redis | `INTERVIEW_REDIS_URL`, `ROLEPLAY_REDIS_URL`, `RESUME_REDIS_URL` | fallback to event Redis | Legacy observation should be explicitly enabled per environment. |
| LLM | `LLM_PROVIDER`, `LLM_API_KEY`, `OPENCODE_API_KEY`, `LLM_BASE_URL`, `GROW_AGENT_MODEL`, `LLM_MODEL` | `opencode`, `https://opencode.ai/zen/v1`, `kimi-k2.6` | Staging/prod should pin provider/model and require API key where features are enabled. |
| Rivet | `RIVET_ENDPOINT`, `RIVET_CLIENT_ENDPOINT` | localhost/127.0.0.1 | Docker compose overrides endpoint; production needs internal and public separation. |
| Product services | `INTERVIEW_SERVICE_URL`, `ROLEPLAY_SERVICE_URL`, `QSCORE_SERVICE_URL`, `RESUME_SERVICE_URL`, `USER_SERVICE_URL`, `MATCHMAKING_SERVICE_URL`, `SOCIAL_BRANDING_SERVICE_URL` | localhost ports | Production should require service URLs or feature-disable explicitly. |
| Public URLs | `INTERVIEW_PUBLIC_URL`, `ROLEPLAY_PUBLIC_URL`, `RESUME_PUBLIC_URL`, `WORKFLOWS_DASHBOARD_URL`, `FRONTEND_ORIGIN` | localhost/frontend fallback | Public and internal service URLs need separate semantics. |
| Gitea | `GITEA_PUBLIC_URL`, `GITEA_INTERNAL_URL`, `GITEA_ADMIN_USER`, `GITEA_ADMIN_PASSWORD`, `GITEA_ADMIN_TOKEN`, `GITEA_ORG_NAME` | localhost, `growqr-admin`, `growqr-admin-dev`, empty token | Admin password fallback is dev-only. Production should require token/secret. |
| OpenCode | `OPENCODE_IMAGE`, `OPENCODE_IMAGE_VERSION`, `MIGRATION_VERSION`, `PROMPT_VERSION`, `USER_CONTAINER_HOST`, `USER_DATA_ROOT`, `USER_PORT_RANGE_*` | dev image/version, local paths/ports | Needs staging/prod image tags and storage policy. |
| CORS/admin | `FRONTEND_ORIGIN`, `ADMIN_USER_IDS` | localhost / empty | Empty admin list currently allows `/workflows/admin/ops` to all authenticated users. |
| Agent limits | `MAX_AGENT_TOKENS`, `PROJECTION_AGENT_MODEL`, `CONVERSATION_ACTOR_MODEL` | 4096 / agent model | Model overrides should be pinned by environment. |
## Environment-Dependent Code Paths
| File | Behavior |
| --- | --- |
| `src/config.ts` | Central env parsing with dev defaults for database, tokens, local service URLs, Gitea, OpenCode, Rivet, frontend, and ports. |
| `src/auth/clerk.ts` | In non-production, `A2A_ALLOWED_KEY` is accepted as an auth fallback. Clerk client is only created when `CLERK_SECRET_KEY` exists. |
| `src/index.ts` | Proxies `/api/rivet` only when `process.env.RIVET_ENDPOINT` is set. Starts Redis consumer opportunistically. CORS uses `FRONTEND_ORIGIN`. |
| `src/events/redis-consumer.ts` | Canonical consumer disabled if no Redis URL. Legacy observers enabled by legacy Redis URLs. |
| `src/events/projectors/projection-agent.ts` | Falls back if no LLM API key; model can be overridden by `PROJECTION_AGENT_MODEL`. |
| `src/actors/conversation/agent.ts` | Requires LLM key for streaming; model can be overridden by `CONVERSATION_ACTOR_MODEL`. |
| `src/routes/events.ts` | Service ingest auth allows no service token in non-production. |
| `src/routes/home.ts` | Exposes demo seeding route. |
| `src/home/seed-demo-home.ts` | Demo notifications and executable direct script behavior. |
| `src/services/service-agents.ts` | Synthetic/demo fallbacks for some unavailable services and Q Score estimate behavior. |
| `src/docker/manager.ts` | Uses Gitea/OpenCode image/version/host/path/port config and mutates Docker runtime. |
| `scripts/rivet-actors.ts` | Uses dev Rivet namespace/token defaults. |
| `docker-compose.yml` | Dev compose defaults for Postgres, Gitea, Rivet, backend, services, frontend origins, and OpenCode image. |
| `docker/opencode/*` | Dev-oriented OpenCode image/template behavior. |
## Hardcoded URL and Default Hotspots
- `http://localhost:*` defaults in `src/config.ts`, `.env.example`, `README.md`, and `docker-compose.yml`.
- `http://127.0.0.1:*` defaults for Rivet client, Gitea, and user container host.
- `http://host.docker.internal:*` compose service defaults.
- OpenCode base image `ghcr.io/anomalyco/opencode:latest` in `docker/opencode/Dockerfile`.
- Dev image tag `growqr/opencode:dev`.
- Gitea admin defaults `growqr-admin` / `growqr-admin-dev`.
- A2A fallback `dev-a2a-key`.
## Clerk / JWKS Assumptions
The code uses Clerk SDK with `CLERK_SECRET_KEY`; there is no explicit JWKS URL configuration in the reviewed backend source. Service-to-service auth is token based, with dev fallback behavior. Target production should document whether auth is:
- Clerk session token verification for user requests.
- `SERVICE_TOKEN` for service-to-backend event ingestion.
- Separate internal A2A key for legacy product service calls.
- Optional JWKS validation if services send JWTs instead of opaque service tokens.
## Target Config Model
Introduce:
```ts
type RuntimeEnvironment = "development" | "test" | "staging" | "demo" | "production";
```
Recommended top-level config shape:
```ts
config.environment
config.isProduction
config.isStaging
config.isDemo
config.features.demoDataEnabled
config.features.legacyRedisObserversEnabled
config.features.opencodeProvisioningEnabled
config.features.serviceProxyEnabled
config.urls.internal.*
config.urls.public.*
config.auth.*
config.retry.*
config.events.*
```
Rules:
- Production must fail fast for missing `DATABASE_URL`, `CLERK_SECRET_KEY`, `SERVICE_TOKEN`, `FRONTEND_ORIGIN`, Gitea credentials/token, and any enabled service URL.
- Staging may use staging service URLs and demo data only when `DEMO_DATA_ENABLED=true`.
- Development may keep local defaults.
- Demo behavior should be impossible in production unless an explicit, audited flag is set and the route remains auth/admin-gated.
## What Should Move to `src/staging`
Proposed `src/staging` candidates:
- `home/seed-demo-home.ts`
- `/home/seed-demo` route handler
- demo notification factories
- demo Q Score formulas/fallback constants in service-agent behavior, if not product-approved
- local-only service session scaffolding helpers
- any future seeders/backfills used only for demos
Suggested layout:
```txt
src/staging/
demo-home.ts
demo-qscore.ts
seed-routes.ts
guards.ts
```
`src/staging/guards.ts` should expose `requireStagingOrDemo(config)` and fail closed in production.
## Target Environment Matrix
| Behavior | Development | Staging | Demo | Production |
| --- | --- | --- | --- | --- |
| Localhost defaults | Allowed | Not allowed | Not allowed unless local demo | Not allowed |
| Demo seed endpoints | Allowed | Explicit flag + admin | Enabled by flag + admin | Disabled |
| Service token fallback | Allowed | Not allowed | Not allowed | Not allowed |
| Legacy Redis observers | Optional | Explicit flag | Explicit flag | Disable unless migration requires |
| Redis canonical events | Optional | Required for event demos | Required | Required |
| OpenCode image | `:dev` ok | pinned staging tag | pinned demo tag | pinned release tag |
| Admin ops route | Authenticated maybe ok | `ADMIN_USER_IDS` required | `ADMIN_USER_IDS` required | `ADMIN_USER_IDS` required |
| Missing Clerk secret | Allowed only for local mock if implemented | Fail | Fail | Fail |
| Gitea admin password default | Allowed | Fail | Fail | Fail |
## Priority Recommendations
1. Add `APP_ENV` or `GROWQR_ENV` and derive `config.environment`; stop relying on `NODE_ENV` for product behavior.
2. Fail fast in staging/production for missing secrets and localhost/default service URLs.
3. Move demo seed code into `src/staging` and guard routes with `DEMO_DATA_ENABLED` plus admin check.
4. Require `ADMIN_USER_IDS` before enabling `/workflows/admin/ops` outside development.
5. Split public URLs and internal URLs in config names consistently across frontend, services, Gitea, Rivet, and OpenCode.
6. Add a deployment checklist that records every required env var per environment.
7. Make legacy Redis observers an explicit feature flag and set a removal date.

View File

@@ -0,0 +1,284 @@
# Retry, Idempotency, and DLQ Plan
PRM-43 design pass for `growqr-backend`.
No implementation was performed in this pass.
## Goals
- Bound every outbound call with timeouts.
- Retry only safe operations with classified errors.
- Make repeated commands safe through idempotency keys.
- Preserve failed event/workflow work in a DLQ with replay tooling.
- Add logs that let support trace one user action across route, actor, service, Redis, projector, and database writes.
## Outbound Call Site Inventory
| Area | Files | Current behavior | Needed behavior |
| --- | --- | --- | --- |
| Product service clients | `src/services/product-service-clients.ts` | Direct `fetch`, no timeout/retry/idempotency header | Shared service client with timeout, retry, idempotency key, and request id. |
| Service agent probes | `src/services/service-agents.ts` | Direct `fetch`, some fallback summaries | Same shared client; distinguish "unavailable" from retriable failure. |
| Gitea | `src/lib/gitea.ts`, `src/docker/manager.ts`, `src/actors/user-actor.ts` | Direct `fetch`, some wait-for-ready helpers | Retry transient Gitea API errors; idempotent repo/user/file operations. |
| OpenCode | `src/lib/opencode.ts`, `src/workflows/executors/opencode-executor.ts` | Direct `fetch`, health polling, no command dedupe | Timeout and retry health/session/message calls; stable command id for prompts. |
| LLM | `src/lib/llm.ts`, `src/actors/conversation/agent.ts`, `src/events/projectors/projection-agent.ts` | Direct SDK/fetch calls | Timeout, retry on provider transient errors, no retry on content/schema errors. |
| Actor sends | routes, `src/events/route-to-user-actor.ts`, actors | `getOrCreate(...).method(...)`, queue sends | Standard command envelope with idempotency key and correlation ids. |
| Redis consumer | `src/events/redis-consumer.ts` | Loops forever; canonical messages ack in `finally`; no DLQ | Retry budget, pending handling, DLQ stream/table, replay. |
| Projectors | `src/events/projectors/*`, `src/actors/events/user-event-actor.ts` | Called within event actor processing | Per-projector idempotency and failure status; replay from stored Grow Events. |
| Workflow module runner | `src/workflows/module-runner.ts`, `src/actors/workflow-run-actor.ts` | Actor loop retries in one path; direct route execution in another | Actor-only execution, durable command id, retry state in DB. |
## Shared `withRetry` API
Add `src/lib/retry.ts`:
```ts
export type RetryPolicy = {
maxAttempts: number;
baseDelayMs: number;
maxDelayMs: number;
timeoutMs: number;
jitter: boolean;
};
export async function withRetry<T>(
operation: string,
fn: (ctx: { signal: AbortSignal; attempt: number }) => Promise<T>,
options: {
policy?: Partial<RetryPolicy>;
idempotencyKey?: string;
classify?: (error: unknown) => "retry" | "fail";
logFields?: Record<string, unknown>;
},
): Promise<T>;
```
Default policy:
- `maxAttempts: 3`
- `baseDelayMs: 250`
- `maxDelayMs: 5_000`
- `timeoutMs: 10_000`
- jitter enabled
Classification:
- Retry: network errors, abort/timeout, HTTP `408`, `425`, `429`, `500`, `502`, `503`, `504`.
- Do not retry: HTTP `400`, `401`, `403`, `404`, validation/schema errors, duplicate/idempotency conflicts that already completed.
- Special case: `409` may be success for idempotent create-if-absent operations.
## Idempotency Model
Add a command/event idempotency key convention:
```txt
<domain>:<userId>:<entityId>:<operation>:<version>
```
Examples:
- `workflow:user_123:run_456:module:resume:v1`
- `mission:user_123:instance_456:start:v1`
- `service:user_123:interview:configure:session_abc`
- `event:user_123:growEventId:project:qscore:v1`
- `opencode:user_123:run_456:interview-plan:prompt-v4`
Where to store:
- `workflowRunModules.idempotencyKey` for module commands.
- `workflowEvents.payload.idempotencyKey` for audit trail.
- `growEvents.dedupeKey` for event ingestion.
- Add a future `idempotency_keys` table only if multiple domains need durable response reuse.
Minimum table design if needed:
```txt
idempotency_keys
key text primary key
domain text not null
user_id text
status text check (processing, completed, failed)
request_hash text
response jsonb
error text
expires_at timestamptz
created_at timestamptz
updated_at timestamptz
```
## HTTP Service Client Plan
Create `src/services/http-client.ts`:
- Accepts `baseUrl`, `path`, `method`, `json`, `headers`, `idempotencyKey`, `operation`, `timeoutMs`.
- Adds:
- `authorization: Bearer <A2A_ALLOWED_KEY>` when configured.
- `x-request-id`
- `x-idempotency-key` or `idempotency-key`.
- `x-growqr-user` when user-scoped.
- Uses `withRetry`.
- Parses text once and returns typed JSON.
- Logs attempt, latency, status, and error class.
Then migrate:
1. `product-service-clients.ts`
2. `service-agents.ts`
3. mission route direct user-service fetch
4. workflow service health checks
## Workflow Retry Plan
Target behavior:
- Routes enqueue commands to `workflowRunActor`; routes do not call `executeWorkflowModule` directly.
- `workflowRunActor` writes command state before execution.
- `executeWorkflowModule` receives `idempotencyKey` and passes it to service/OpenCode calls.
- On failure, increment `workflowRunModules.retryCount`, store `error`, and emit `workflowEvents` with `retryAttempt`.
- Exceeding retry budget marks module `blocked` or `failed` based on module type and writes a DLQ row/event.
Module status transition:
```mermaid
stateDiagram-v2
[*] --> idle
idle --> queued
queued --> running
running --> done
running --> retry_wait
retry_wait --> running
running --> blocked
running --> dlq
dlq --> replaying
replaying --> running
```
## Redis Consumer and DLQ Plan
Do not ack canonical Redis messages until one of these is true:
- event persisted and routed/projected successfully;
- event persisted but routing failed and a durable retry record was created;
- message moved to DLQ after retry budget.
Add DLQ options:
1. Redis stream DLQ: `grow.events.dlq`
2. Postgres table: `grow_event_dlq`
Recommended to use both:
- Redis DLQ for operational stream tooling.
- Postgres DLQ for admin UI, audit, and replay metadata.
DLQ row fields:
```txt
id
source_stream
source_message_id
payload
error
attempts
last_attempt_at
status: pending | replaying | replayed | discarded
created_at
updated_at
```
Replay script:
```txt
pnpm events:replay --status failed --limit 100
pnpm events:replay --dlq --id <dlq-id>
pnpm events:replay --event-id <grow-event-id> --projectors qscore,service-session
```
Script responsibilities:
- Re-read stored payload.
- Re-run `recordGrowEvent` if needed.
- Re-run `routeGrowEventToUserActor`.
- Optionally run only selected projectors.
- Preserve original `dedupeKey`.
## Projector Idempotency Plan
Projectors should be repeatable:
- Q Score latest table already has `(userId, signalId)` primary key.
- Mission service sessions have unique `(serviceId, externalId)`.
- Artifacts should dedupe by `(missionInstanceId, serviceId, externalId, type)` or a stable artifact key.
- Mission stage patches should be applied with deterministic status/progress and no duplicate suggestions.
Add projector event logs:
```txt
grow_event_projector_runs
event_id
projector
status
attempt
error
started_at
completed_at
```
## Logging Fields
Every route/actor/event/retry log should include as many of these as available:
- `requestId`
- `traceId`
- `userId`
- `orgId`
- `actorType`
- `actorKey`
- `runId`
- `moduleId`
- `missionId`
- `missionInstanceId`
- `stageId`
- `eventId`
- `source`
- `eventType`
- `idempotencyKey`
- `operation`
- `attempt`
- `maxAttempts`
- `latencyMs`
- `httpStatus`
- `retryable`
- `dlqId`
## Test Plan
Unit tests:
- `withRetry` retries transient errors and stops on non-retryable errors.
- Timeout aborts fetch and logs retry attempt.
- Idempotency key helper returns stable keys.
- HTTP client adds auth, request id, and idempotency headers.
Integration tests:
- Duplicate `/workflow-runs/:runId/modules/:moduleId/run` command does not duplicate service call.
- Duplicate Grow Event with same `dedupeKey` is stored once and projection remains stable.
- Redis message failure is not acked until retry/DLQ path is recorded.
- DLQ replay reprocesses a failed event and updates projector status.
- OpenCode module execution retry does not create duplicate artifact rows.
Manual staging drills:
1. Stop interview service, run interview module, verify retry and blocked/DLQ behavior.
2. Emit duplicate Redis events, verify one `grow_events` row and stable projector state.
3. Break Gitea token, provision stack, verify retry logs and no partial untracked state.
4. Replay a DLQ event, verify mission progress and Q Score update.
## Implementation Order
1. Add `src/lib/retry.ts` and focused unit tests.
2. Add service HTTP client and migrate product service calls.
3. Add workflow command idempotency and route-to-actor queueing.
4. Add Redis DLQ and replay script.
5. Add projector run records.
6. Migrate Gitea/OpenCode/LLM calls to `withRetry`.
7. Add staging failure drills to deployment checklist.