Compare commits
4 Commits
213987a9e0
...
5f667038d8
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
5f667038d8 | ||
|
|
ef5d7bb378 | ||
|
|
d4f9b0edcb | ||
|
|
01e9cc92d4 |
84
docs/backend-dead-code.md
Normal file
84
docs/backend-dead-code.md
Normal file
@@ -0,0 +1,84 @@
|
||||
# Backend Dead Code Inventory
|
||||
|
||||
PRM-46 inventory pass for `growqr-backend`.
|
||||
|
||||
No source code was deleted in this pass. Static search and manual inspection were used. Typecheck was run successfully with `pnpm typecheck`.
|
||||
|
||||
## Summary
|
||||
|
||||
The codebase is mostly wired, but it contains several compatibility, demo, and partially superseded paths. The main cleanup risk is accidentally removing code still used by the frontend's older workflow screens or by demo environments.
|
||||
|
||||
## Candidates
|
||||
|
||||
| Priority | Candidate | Recommendation | Evidence |
|
||||
| --- | --- | --- | --- |
|
||||
| High | `src/actors/product-service-actors.ts` | Keep for now; consider deleting only after confirming no Rivet clients call these actors. | Actors are registered in `src/actors/registry.ts`, but local code routes service calls through `src/routes/services.ts` and `src/services/product-service-clients.ts` directly. No local `getOrCreate` references for `interviewServiceActor`, `roleplayServiceActor`, or `resumeServiceActor` were found. |
|
||||
| High | Legacy `/workflows/job-application*` route aliases in `src/routes/workflows.ts` and large portions of `src/actors/user-actor.ts` workflow state | Keep until frontend migration is verified; likely cleanup after DB-backed workflow runs fully replace it. | `job-application` aliases call `userActor`; newer `/workflow-runs` path uses `workflowRuns`, `workflowRunModules`, and `workflowRunActor`/`executeWorkflowModule`. Two workflow systems coexist. |
|
||||
| High | `src/workflows/module-runner.ts` synchronous execution from routes | Keep, but consolidate behind `workflowRunActor` before cleanup. | Used both by `workflowRunActor` and directly by route handlers. Direct route use undercuts actor durability, but the module runner itself is active. |
|
||||
| Medium | `src/workflows/smoke-test.ts` | Keep as script if used manually; otherwise convert to documented test or remove. | Only referenced by `package.json` script `workflows:smoke`; not part of app runtime. |
|
||||
| Medium | `scripts/rivet-actors.ts` | Keep if used by ops; document or remove if not. | Standalone admin script; not imported by source. It relies on `RIVET_ENDPOINT`, `RIVET_NAMESPACE`, and admin token defaults. |
|
||||
| Medium | Demo home seeder `src/home/seed-demo-home.ts` and `/home/seed-demo` | Keep in staging/demo only; move behind explicit environment gate. | `src/routes/home.ts` exposes a seed endpoint. Schema has `generatedBy: "demo"` for notifications. This is live source behavior rather than isolated fixture code. |
|
||||
| Medium | Static fallback mission registry vs persisted registry (`src/missions/registry.ts` and `src/missions/postgres-registry.ts`) | Keep both until migration/backfill is confirmed; then decide whether DB registry or static registry is source of truth. | `routes/missions.ts` reads persisted definitions, while actor factory and conversations read static definitions. `postgres-registry` falls back to static definitions. |
|
||||
| Medium | Duplicate mission actor wrappers (`career-transition-actor.ts`, `salary-negotiation-war-room-actor.ts`, `promotion-readiness-actor.ts`, `personal-brand-opportunity-engine-actor.ts`) | Keep; low-cost wrappers are active. | Thin wrappers are mapped in routes, registry, event actor, and actor registry. |
|
||||
| Medium | `src/events/projectors/projection-agent.ts` LLM insight path | Keep, but verify product use. | Referenced by `userEventActor` and `reducer-types`, so not dead. It can silently fall back when no LLM API key exists. |
|
||||
| Medium | Legacy Redis observers in `src/events/redis-consumer.ts` | Keep until services emit canonical Grow Events. | Comments state these observe existing service A2A traffic. They are enabled by `INTERVIEW_REDIS_URL`, `ROLEPLAY_REDIS_URL`, and `RESUME_REDIS_URL`. |
|
||||
| Medium | `events` audit table in `src/db/schema.ts` | Keep until old frontend timelines and route writes are audited. | Older user/service paths still import/use `events` table, while newer Grow Event tables also exist. |
|
||||
| Low | `src/workflows/registry.ts` and `src/missions/registry.ts` duplicate product concepts | Keep; consolidate later. | Workflows are commercial product definitions; missions are actor-backed variants. The overlap is intentional but duplicative. |
|
||||
| Low | `docker/opencode/workspace-template/*/README.md` placeholders | Keep as template docs or remove if generated workspaces no longer need empty folders. | Template-only files are not runtime code, but useful for preserving folder structure. |
|
||||
| Low | `docs/architecture.html` | Keep unless replaced by Markdown architecture docs. | Existing doc artifact, not source. |
|
||||
|
||||
## Unused or Underused Env Vars / Config Values
|
||||
|
||||
| Env/config | Recommendation | Evidence |
|
||||
| --- | --- | --- |
|
||||
| `config.required` | Keep or remove after scanning call sites; currently exported but not used in local source. | `required` is attached to config, but no local `config.required(` references were found. |
|
||||
| `clerkPublishableKey` | Keep if clients read backend config elsewhere; otherwise remove from backend config. | Defined in `config.ts` and `.env.example`, but backend auth uses secret key. |
|
||||
| `opencodeApiKey` | Keep only if future direct OpenCode auth requires it; currently `llmApiKey` consumes `OPENCODE_API_KEY`. | Defined separately in config; most OpenCode runtime calls use per-container password, not this field. |
|
||||
| `userServiceUrl` | Keep; used by missions profile lookup. | `routes/missions.ts` fetches `/api/v1/users/me`. |
|
||||
| `legacyServiceTaskObserverGroup` | Keep while legacy Redis observers exist. | Used in `redis-consumer.ts`. |
|
||||
| `migrationVersion`, `promptVersion`, `opencodeImageVersion` | Keep; active Docker rollout labels. | Used by `docker/manager.ts` and Docker build metadata. |
|
||||
|
||||
## Stale or Demo-Oriented Behavior
|
||||
|
||||
- Demo generated home notifications and `/home/seed-demo` should move to a staging/demo module or be guarded by `config.environment`.
|
||||
- `service-agents.ts` includes demo-like defaults, such as `formula_version: "workflow-demo"` and synthetic Q Score fallback summaries.
|
||||
- `config.ts` defaults many production-sensitive values to local/dev values, including Gitea admin credentials, service token fallback, A2A key, and localhost URLs.
|
||||
- Docker/OpenCode scripts are active but dev-biased, using image tags like `growqr/opencode:dev`.
|
||||
|
||||
## Prompt Workflow Inventory
|
||||
|
||||
All prompt workflow files under `prompts/workflows/*` are referenced by `src/workflows/registry.ts` through `promptPath` values:
|
||||
|
||||
- `career-transition/orchestrator.md`
|
||||
- `interview-to-offer/interview-plan.md`
|
||||
- `salary-negotiation-war-room/orchestrator.md`
|
||||
- `promotion-readiness/orchestrator.md`
|
||||
- `personal-brand-opportunity-engine/orchestrator.md`
|
||||
|
||||
Additional interview-to-offer prompt files (`resume-analysis.md`, `story-bank.md`, `final-readiness-report.md`) are not referenced by `workflowDefinitions` directly in this pass. Recommendation: keep until OpenCode/agent prompt loading is audited, then either wire them into module definitions or archive them.
|
||||
|
||||
## Delete/Keep Decisions Before Cleanup
|
||||
|
||||
Do not delete yet:
|
||||
|
||||
- `userActor` workflow code
|
||||
- `product-service-actors`
|
||||
- static mission/workflow registries
|
||||
- Redis legacy observers
|
||||
- demo home seeder
|
||||
- standalone scripts
|
||||
|
||||
Good first cleanup after approval:
|
||||
|
||||
1. Move demo seeding to `src/staging` and guard it with a staging/demo environment.
|
||||
2. Remove or document unused config fields (`config.required`, `clerkPublishableKey`, `opencodeApiKey`) after a second pass across frontend/deployment references.
|
||||
3. Convert `workflows:smoke` into a real test or delete the script.
|
||||
4. Consolidate mission actor type mapping into one helper and remove duplicate mapping functions.
|
||||
|
||||
## Verification
|
||||
|
||||
`pnpm typecheck` passed:
|
||||
|
||||
```txt
|
||||
tsc -p tsconfig.json --noEmit
|
||||
```
|
||||
179
docs/backend-organization-audit.md
Normal file
179
docs/backend-organization-audit.md
Normal file
@@ -0,0 +1,179 @@
|
||||
# Backend Organization Audit
|
||||
|
||||
PRM-41 audit pass for `growqr-backend`.
|
||||
|
||||
Scope reviewed: `src/routes`, `src/actors`, `src/events`, `src/missions`, `src/workflows`, and `src/services`.
|
||||
|
||||
## Executive Summary
|
||||
|
||||
The backend currently has three overlapping orchestration layers:
|
||||
|
||||
1. HTTP routes that directly perform database writes, service calls, and some synchronous workflow execution.
|
||||
2. Rivet actors that own durable user, workflow, mission, conversation, memory, and event processing state.
|
||||
3. Event/projector code that normalizes service events into Grow Events, updates mission state, records service sessions, and projects Q Score signals.
|
||||
|
||||
That split is workable for a demo-stage backend, but it blurs ownership. Several routes contain business logic that should live in services or actors, while actors and event consumers need stronger idempotency, retry, and replay boundaries before production traffic.
|
||||
|
||||
## High-Level Architecture
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
FE[Frontend / service clients] --> Hono[Hono routes]
|
||||
Hono --> DB[(Postgres / Drizzle)]
|
||||
Hono --> Rivet[Rivet actors]
|
||||
Hono --> Svc[Product services]
|
||||
Hono --> Docker[Docker + Gitea + OpenCode]
|
||||
|
||||
Svc --> Redis[Redis streams / pubsub]
|
||||
Redis --> Consumer[events/redis-consumer]
|
||||
Consumer --> GrowEvents[(grow_events)]
|
||||
Consumer --> EventActor[userEventActor]
|
||||
EventActor --> MissionActors[mission actors]
|
||||
EventActor --> Projectors[QScore/session/projectors]
|
||||
MissionActors --> DB
|
||||
|
||||
Rivet --> DB
|
||||
Rivet --> Svc
|
||||
Rivet --> Docker
|
||||
```
|
||||
|
||||
## Route to Actor/Service/Event/Data Flow Map
|
||||
|
||||
| Route module | Mounted path | Primary flow | Actor/service/data dependencies | Notes |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| `src/routes/actors.ts` | `/actors` | Auth-gated user stack control | `docker/manager`, `actors` table | Provisions/stops OpenCode stack directly from route. |
|
||||
| `src/routes/agents.ts` | `/agents` | Catalog read | `agents/catalog` | Thin route. |
|
||||
| `src/routes/chat.ts` | `/api/chat` | Chat request, Rivet first, direct LLM fallback | `userActor`, `lib/llm`, `services/service-agents` | Contains fallback tool orchestration and timeout logic in route. |
|
||||
| `src/routes/conversations.ts` | `/conversations` | Conversation CRUD/chat/mission bridging | `conversationActor`, mission actors, `grow_conversations`, messages | Heavy route; mixes persistence, actor bootstrapping, mission resolution, and response shaping. |
|
||||
| `src/routes/events.ts` | `/events` | User/service event ingestion and listing | `recordGrowEvent`, `routeGrowEventToUserActor`, `grow_events` | Good ingestion boundary, but service auth is environment-sensitive. |
|
||||
| `src/routes/git.ts` | `/git` | Repo/file operations | `docker/manager`, `GiteaClient` | Route owns path safety and repo operation decisions. |
|
||||
| `src/routes/grow.ts` | `/grow` | Grow bootstrap and active state | `growActor` | Thin actor gateway. |
|
||||
| `src/routes/home.ts` | `/home` | Home feed, notifications, demo seed | `home-feed`, `seed-demo-home` | Includes demo seeding endpoint. |
|
||||
| `src/routes/missions.ts` | `/missions` | Mission catalog, start/pause/resume/stage/artifacts/coach | `growActor`, mission actors, user service, mission registry | Heavy route; owns mission selection, profile fallback, actor type mapping, and artifact commands. |
|
||||
| `src/routes/opencode.ts` | `/opencode` | OpenCode stack/session/message proxy | `docker/manager`, `OpencodeClient` | Directly provisions stack and opens sessions. |
|
||||
| `src/routes/services.ts` | `/services` | Product service proxy and event recording | `product-service-clients`, `recordGrowEvent`, Q Score onboarding | Very heavy route; contains service-specific payload shaping and event side effects. |
|
||||
| `src/routes/users.ts` | `/users` | User profile/bootstrap | `auth/clerk`, `users` table, onboarding Q Score | Includes Clerk profile mirroring and onboarding side effects. |
|
||||
| `src/routes/workflows.ts` | `/workflows`, `/workflow-runs` | Workflow definitions/runs/modules/approvals | `userActor`, `workflowRunActor`, `workflow/module-runner`, DB | Two paths: legacy userActor job-application flow and DB-backed workflow runs. |
|
||||
|
||||
## Actor Inventory
|
||||
|
||||
| Actor | Current role | Main inputs | Outputs/effects | Robustness observations |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| `userActor` | Legacy unified user orchestration: chat, memory tools, workflow status, service handoffs, OpenCode/Gitea interactions | `/api/chat`, `/workflows/job-application`, workflow route aliases | Actor state, DB events, service calls, Gitea reads/writes | Very broad responsibilities; failures in service calls often become summaries rather than durable retryable jobs. |
|
||||
| `workflowRunActor` | Queued workflow module runner | `/workflow-runs/:runId/pause|resume` and direct client use | `workflowRunModules`, `workflowEvents`, `qscoreSnapshots` via module runner | Has Rivet loop retry settings for module execution, but route-level `/run` bypasses actor queue and executes synchronously. |
|
||||
| `conversationActor` | Durable streaming conversation state | `/conversations` | Actor state and generated messages | Queue usage exists for messages; needs documented idempotency per turn/message id. |
|
||||
| `memoryActor` | Durable memory file state | Internal client use | Actor state/file-like memory | Queue writes exist; external call idempotency unclear. |
|
||||
| `growActor` | Active mission list/state control | `/grow`, `/missions` | `grow_active_missions`, mission state | Mission lifecycle split across growActor, mission actors, and routes. |
|
||||
| `userEventActor` | Routes normalized Grow Events to missions/projectors | Redis consumer, `/events` ingestion | Mission stage patches, projector DB updates, event status | Central point for event idempotency, but retries/replay/DLQ are not yet formalized. |
|
||||
| Mission actors | Per-mission state machines | `/missions`, `/conversations`, event actor | `grow_active_missions`, artifacts, suggestions | Four mission actors are thin factory wrappers; interview-to-offer has custom implementation. |
|
||||
| Product service actors | Actor wrappers for interview/roleplay/resume clients | Registry only; possible client use | Service calls | Registered, but routes call clients directly. These may be underused compared to direct service proxy routes. |
|
||||
|
||||
## Event and Projector Flow
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Service as Product service
|
||||
participant Redis as Redis stream/pubsub
|
||||
participant Route as /events or service routes
|
||||
participant Store as grow_events
|
||||
participant UserEvent as userEventActor
|
||||
participant Mission as mission actor
|
||||
participant Projection as projectors
|
||||
|
||||
Service->>Redis: canonical GrowEvent or legacy task response
|
||||
Redis->>Route: redis-consumer normalizes message
|
||||
Route->>Store: recordGrowEvent with dedupeKey
|
||||
Route->>UserEvent: routeGrowEventToUserActor
|
||||
UserEvent->>Mission: apply reducer-derived stage patches
|
||||
UserEvent->>Projection: service session and Q Score projections
|
||||
Projection->>Store: update projection tables
|
||||
```
|
||||
|
||||
Current event strengths:
|
||||
|
||||
- `normalizeGrowEvent` accepts multiple service field conventions.
|
||||
- `recordGrowEvent` uses `dedupeKey` and a unique index on `grow_events.dedupe_key`.
|
||||
- Legacy Redis observer bridges `tasks:*` and `responses:*` without service changes.
|
||||
- Projector surfaces exist for session tracking, Q Score, and LLM-derived insights.
|
||||
|
||||
Current event gaps:
|
||||
|
||||
- Redis canonical consumer always `xAck`s in `finally`, even when `recordAndRoute` fails, so failed messages do not remain pending for retry.
|
||||
- No DLQ stream/table for failed canonical or legacy event processing.
|
||||
- No replay script for `grow_events.processing_status in ('failed', 'unresolved')`.
|
||||
- Legacy task context is in-memory only, so response events can lose user/action context after a backend restart.
|
||||
|
||||
## Business Logic in Routes
|
||||
|
||||
Highest concentration:
|
||||
|
||||
- `src/routes/services.ts`: service-specific request construction, event emission, Q Score baseline/onboarding side effects, mission association, and UI response shaping.
|
||||
- `src/routes/workflows.ts`: run creation, module row initialization, baseline Q Score, approval gate progression, artifact content lookup, and synchronous module execution.
|
||||
- `src/routes/missions.ts`: mission profile lookup from user service, actor type mapping, start/resume/pause/stage/artifact commands, and coach run orchestration.
|
||||
- `src/routes/conversations.ts`: active conversation persistence, mission-aware chat routing, actor fallback behavior, and response normalization.
|
||||
- `src/routes/chat.ts`: Rivet fallback, direct LLM tool loop, service agent selection, and timeout handling.
|
||||
|
||||
Low-risk thin routes:
|
||||
|
||||
- `src/routes/agents.ts`
|
||||
- `src/routes/grow.ts`
|
||||
- parts of `src/routes/events.ts`
|
||||
|
||||
Recommended ownership target:
|
||||
|
||||
- Routes validate/authenticate and translate HTTP to commands.
|
||||
- Actors own durable user/mission/workflow progression.
|
||||
- Services own outbound HTTP details.
|
||||
- Projectors own derived read models.
|
||||
- Routes should not decide retry, idempotency, or service fallback behavior beyond returning HTTP errors.
|
||||
|
||||
## Idempotency Gaps
|
||||
|
||||
| Area | Existing behavior | Gap |
|
||||
| --- | --- | --- |
|
||||
| Grow Event ingestion | `dedupeKey` unique index; normalizer uses explicit key or source id | Service routes do not consistently set stable dedupe keys for all service-created side effects. |
|
||||
| Workflow runs | `/workflow-runs/:runId/modules/:moduleId/run` reads `idempotency-key` header | `executeWorkflowModule` does not use the key to suppress duplicate service calls; `/run` generates timestamp keys. |
|
||||
| Workflow module rows | Has `idempotencyKey`, `retryCount`, `maxRetries` columns | Counters are mostly passive; no central retry state machine. |
|
||||
| Actor queues | Rivet queues and `loop` step names provide some dedupe for `workflowRunActor` | Several routes bypass actor queue and execute directly. |
|
||||
| Service session creation | `stableUuid` exists in service-agent helper | Not consistently used as a request id/idempotency key across service calls. |
|
||||
| OpenCode artifacts | `onConflictDoNothing` for workflow artifacts | OpenCode prompt/message send can duplicate work before artifact row conflict applies. |
|
||||
|
||||
## Retry Gaps
|
||||
|
||||
| Area | Existing behavior | Gap |
|
||||
| --- | --- | --- |
|
||||
| `workflowRunActor` | Rivet `loop` has `retryBackoffBase` and `retryBackoffMax` | Only applies when execution goes through actor loop. |
|
||||
| HTTP service clients | Throw on non-2xx after `fetch` | No timeout, retry classification, request id, or backoff. |
|
||||
| Gitea client | Some wait/poll helpers exist | Most API calls are single-shot. |
|
||||
| OpenCode client | Health polling exists | Session/message calls are single-shot. |
|
||||
| Redis consumer | Infinite loop catches top-level errors | Per-message failures are acked; no retry budget or DLQ. |
|
||||
| Projectors | Called by event actor | Projector failures need durable retry/replay semantics and status transitions. |
|
||||
|
||||
## Actor Robustness Gaps
|
||||
|
||||
- `userActor` is too broad to reason about failure domains. It owns chat, service tools, memory, workflow, Gitea, OpenCode, and DB event writes.
|
||||
- Product service actors are registered but not the primary path for service proxy routes, so actor-level durability is uneven.
|
||||
- Mission actor mapping is manually duplicated in routes, registry, and event actor.
|
||||
- Route-level synchronous workflow execution can hold HTTP requests open across slow service/OpenCode calls.
|
||||
- Actor initialization is repeated in routes; a central actor gateway could enforce init/idempotency/logging.
|
||||
|
||||
## Priority-Ranked Recommendations
|
||||
|
||||
1. Create a backend command layer for route-to-actor/service translation. Move mission start, workflow run, approval, service configure, and chat tool dispatch logic out of routes.
|
||||
2. Make `workflowRunActor` the only executor for workflow modules. Routes should enqueue commands and return command ids.
|
||||
3. Add a shared outbound `withRetry`/timeout/idempotency wrapper for service clients, Gitea, OpenCode, and LLM calls.
|
||||
4. Add DLQ and replay support for Redis/event processing. Do not ack canonical Redis messages until durable record/projector status is successful or DLQ-ed.
|
||||
5. Normalize mission actor mapping into a single registry source used by routes, event actor, and mission registry.
|
||||
6. Split `userActor` responsibilities: chat/memory/workflow/OpenCode paths should be smaller actors or delegated services with explicit contracts.
|
||||
7. Convert route-created side effects to stable idempotency keys. Use request id, user id, mission instance id, service id, and operation name.
|
||||
8. Add structured logging fields across routes/actors/events: `requestId`, `userId`, `missionInstanceId`, `runId`, `moduleId`, `eventId`, `idempotencyKey`, `retryAttempt`.
|
||||
9. Add focused tests around duplicate workflow module run, duplicate service event ingest, Redis failure handling, and mission projector replay.
|
||||
|
||||
## Suggested Next Slice
|
||||
|
||||
Use PRM-43 to introduce shared retry/idempotency primitives first. Then return to this audit and migrate the highest-risk route logic in this order:
|
||||
|
||||
1. `/workflow-runs/*/run`
|
||||
2. `/services/interview|roleplay configure/review`
|
||||
3. `/missions/:missionId/start`
|
||||
4. `/api/chat` direct LLM fallback
|
||||
148
docs/environment-matrix.md
Normal file
148
docs/environment-matrix.md
Normal file
@@ -0,0 +1,148 @@
|
||||
# Environment Matrix
|
||||
|
||||
PRM-42 staging vs production separation inventory for `growqr-backend`.
|
||||
|
||||
No refactor was performed in this pass.
|
||||
|
||||
## Current Environment Model
|
||||
|
||||
The backend currently uses `config.nodeEnv` plus many individual env vars. There is no explicit first-class `environment` such as `development | staging | production | demo`.
|
||||
|
||||
Important consequence: local/dev defaults can leak into staging or production unless deployment env vars override every sensitive value.
|
||||
|
||||
## Current Config Inventory
|
||||
|
||||
| Area | Config/env | Current default | Production concern |
|
||||
| --- | --- | --- | --- |
|
||||
| Runtime | `PORT`, `LOG_LEVEL`, `NODE_ENV` | `4000`, `info`, `development` | `NODE_ENV` is too broad for staging/demo behavior. |
|
||||
| Database | `DATABASE_URL` | hardcoded fallback DSN in `config.ts` | Production should fail fast instead of falling back. |
|
||||
| Auth | `CLERK_SECRET_KEY`, `CLERK_PUBLISHABLE_KEY` | empty | Secret key absence changes auth behavior; publishable key appears underused. |
|
||||
| Service auth | `SERVICE_TOKEN`, `A2A_ALLOWED_KEY` | empty / `dev-a2a-key` | Dev token fallback must not be accepted in production. |
|
||||
| Redis events | `GROW_EVENTS_REDIS_URL`, `REDIS_URL`, stream/group/consumer names | disabled unless set | Staging/prod need explicit stream, group, and replay policy. |
|
||||
| Legacy Redis | `INTERVIEW_REDIS_URL`, `ROLEPLAY_REDIS_URL`, `RESUME_REDIS_URL` | fallback to event Redis | Legacy observation should be explicitly enabled per environment. |
|
||||
| LLM | `LLM_PROVIDER`, `LLM_API_KEY`, `OPENCODE_API_KEY`, `LLM_BASE_URL`, `GROW_AGENT_MODEL`, `LLM_MODEL` | `opencode`, `https://opencode.ai/zen/v1`, `kimi-k2.6` | Staging/prod should pin provider/model and require API key where features are enabled. |
|
||||
| Rivet | `RIVET_ENDPOINT`, `RIVET_CLIENT_ENDPOINT` | localhost/127.0.0.1 | Docker compose overrides endpoint; production needs internal and public separation. |
|
||||
| Product services | `INTERVIEW_SERVICE_URL`, `ROLEPLAY_SERVICE_URL`, `QSCORE_SERVICE_URL`, `RESUME_SERVICE_URL`, `USER_SERVICE_URL`, `MATCHMAKING_SERVICE_URL`, `SOCIAL_BRANDING_SERVICE_URL` | localhost ports | Production should require service URLs or feature-disable explicitly. |
|
||||
| Public URLs | `INTERVIEW_PUBLIC_URL`, `ROLEPLAY_PUBLIC_URL`, `RESUME_PUBLIC_URL`, `WORKFLOWS_DASHBOARD_URL`, `FRONTEND_ORIGIN` | localhost/frontend fallback | Public and internal service URLs need separate semantics. |
|
||||
| Gitea | `GITEA_PUBLIC_URL`, `GITEA_INTERNAL_URL`, `GITEA_ADMIN_USER`, `GITEA_ADMIN_PASSWORD`, `GITEA_ADMIN_TOKEN`, `GITEA_ORG_NAME` | localhost, `growqr-admin`, `growqr-admin-dev`, empty token | Admin password fallback is dev-only. Production should require token/secret. |
|
||||
| OpenCode | `OPENCODE_IMAGE`, `OPENCODE_IMAGE_VERSION`, `MIGRATION_VERSION`, `PROMPT_VERSION`, `USER_CONTAINER_HOST`, `USER_DATA_ROOT`, `USER_PORT_RANGE_*` | dev image/version, local paths/ports | Needs staging/prod image tags and storage policy. |
|
||||
| CORS/admin | `FRONTEND_ORIGIN`, `ADMIN_USER_IDS` | localhost / empty | Empty admin list currently allows `/workflows/admin/ops` to all authenticated users. |
|
||||
| Agent limits | `MAX_AGENT_TOKENS`, `PROJECTION_AGENT_MODEL`, `CONVERSATION_ACTOR_MODEL` | 4096 / agent model | Model overrides should be pinned by environment. |
|
||||
|
||||
## Environment-Dependent Code Paths
|
||||
|
||||
| File | Behavior |
|
||||
| --- | --- |
|
||||
| `src/config.ts` | Central env parsing with dev defaults for database, tokens, local service URLs, Gitea, OpenCode, Rivet, frontend, and ports. |
|
||||
| `src/auth/clerk.ts` | In non-production, `A2A_ALLOWED_KEY` is accepted as an auth fallback. Clerk client is only created when `CLERK_SECRET_KEY` exists. |
|
||||
| `src/index.ts` | Proxies `/api/rivet` only when `process.env.RIVET_ENDPOINT` is set. Starts Redis consumer opportunistically. CORS uses `FRONTEND_ORIGIN`. |
|
||||
| `src/events/redis-consumer.ts` | Canonical consumer disabled if no Redis URL. Legacy observers enabled by legacy Redis URLs. |
|
||||
| `src/events/projectors/projection-agent.ts` | Falls back if no LLM API key; model can be overridden by `PROJECTION_AGENT_MODEL`. |
|
||||
| `src/actors/conversation/agent.ts` | Requires LLM key for streaming; model can be overridden by `CONVERSATION_ACTOR_MODEL`. |
|
||||
| `src/routes/events.ts` | Service ingest auth allows no service token in non-production. |
|
||||
| `src/routes/home.ts` | Exposes demo seeding route. |
|
||||
| `src/home/seed-demo-home.ts` | Demo notifications and executable direct script behavior. |
|
||||
| `src/services/service-agents.ts` | Synthetic/demo fallbacks for some unavailable services and Q Score estimate behavior. |
|
||||
| `src/docker/manager.ts` | Uses Gitea/OpenCode image/version/host/path/port config and mutates Docker runtime. |
|
||||
| `scripts/rivet-actors.ts` | Uses dev Rivet namespace/token defaults. |
|
||||
| `docker-compose.yml` | Dev compose defaults for Postgres, Gitea, Rivet, backend, services, frontend origins, and OpenCode image. |
|
||||
| `docker/opencode/*` | Dev-oriented OpenCode image/template behavior. |
|
||||
|
||||
## Hardcoded URL and Default Hotspots
|
||||
|
||||
- `http://localhost:*` defaults in `src/config.ts`, `.env.example`, `README.md`, and `docker-compose.yml`.
|
||||
- `http://127.0.0.1:*` defaults for Rivet client, Gitea, and user container host.
|
||||
- `http://host.docker.internal:*` compose service defaults.
|
||||
- OpenCode base image `ghcr.io/anomalyco/opencode:latest` in `docker/opencode/Dockerfile`.
|
||||
- Dev image tag `growqr/opencode:dev`.
|
||||
- Gitea admin defaults `growqr-admin` / `growqr-admin-dev`.
|
||||
- A2A fallback `dev-a2a-key`.
|
||||
|
||||
## Clerk / JWKS Assumptions
|
||||
|
||||
The code uses Clerk SDK with `CLERK_SECRET_KEY`; there is no explicit JWKS URL configuration in the reviewed backend source. Service-to-service auth is token based, with dev fallback behavior. Target production should document whether auth is:
|
||||
|
||||
- Clerk session token verification for user requests.
|
||||
- `SERVICE_TOKEN` for service-to-backend event ingestion.
|
||||
- Separate internal A2A key for legacy product service calls.
|
||||
- Optional JWKS validation if services send JWTs instead of opaque service tokens.
|
||||
|
||||
## Target Config Model
|
||||
|
||||
Introduce:
|
||||
|
||||
```ts
|
||||
type RuntimeEnvironment = "development" | "test" | "staging" | "demo" | "production";
|
||||
```
|
||||
|
||||
Recommended top-level config shape:
|
||||
|
||||
```ts
|
||||
config.environment
|
||||
config.isProduction
|
||||
config.isStaging
|
||||
config.isDemo
|
||||
config.features.demoDataEnabled
|
||||
config.features.legacyRedisObserversEnabled
|
||||
config.features.opencodeProvisioningEnabled
|
||||
config.features.serviceProxyEnabled
|
||||
config.urls.internal.*
|
||||
config.urls.public.*
|
||||
config.auth.*
|
||||
config.retry.*
|
||||
config.events.*
|
||||
```
|
||||
|
||||
Rules:
|
||||
|
||||
- Production must fail fast for missing `DATABASE_URL`, `CLERK_SECRET_KEY`, `SERVICE_TOKEN`, `FRONTEND_ORIGIN`, Gitea credentials/token, and any enabled service URL.
|
||||
- Staging may use staging service URLs and demo data only when `DEMO_DATA_ENABLED=true`.
|
||||
- Development may keep local defaults.
|
||||
- Demo behavior should be impossible in production unless an explicit, audited flag is set and the route remains auth/admin-gated.
|
||||
|
||||
## What Should Move to `src/staging`
|
||||
|
||||
Proposed `src/staging` candidates:
|
||||
|
||||
- `home/seed-demo-home.ts`
|
||||
- `/home/seed-demo` route handler
|
||||
- demo notification factories
|
||||
- demo Q Score formulas/fallback constants in service-agent behavior, if not product-approved
|
||||
- local-only service session scaffolding helpers
|
||||
- any future seeders/backfills used only for demos
|
||||
|
||||
Suggested layout:
|
||||
|
||||
```txt
|
||||
src/staging/
|
||||
demo-home.ts
|
||||
demo-qscore.ts
|
||||
seed-routes.ts
|
||||
guards.ts
|
||||
```
|
||||
|
||||
`src/staging/guards.ts` should expose `requireStagingOrDemo(config)` and fail closed in production.
|
||||
|
||||
## Target Environment Matrix
|
||||
|
||||
| Behavior | Development | Staging | Demo | Production |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| Localhost defaults | Allowed | Not allowed | Not allowed unless local demo | Not allowed |
|
||||
| Demo seed endpoints | Allowed | Explicit flag + admin | Enabled by flag + admin | Disabled |
|
||||
| Service token fallback | Allowed | Not allowed | Not allowed | Not allowed |
|
||||
| Legacy Redis observers | Optional | Explicit flag | Explicit flag | Disable unless migration requires |
|
||||
| Redis canonical events | Optional | Required for event demos | Required | Required |
|
||||
| OpenCode image | `:dev` ok | pinned staging tag | pinned demo tag | pinned release tag |
|
||||
| Admin ops route | Authenticated maybe ok | `ADMIN_USER_IDS` required | `ADMIN_USER_IDS` required | `ADMIN_USER_IDS` required |
|
||||
| Missing Clerk secret | Allowed only for local mock if implemented | Fail | Fail | Fail |
|
||||
| Gitea admin password default | Allowed | Fail | Fail | Fail |
|
||||
|
||||
## Priority Recommendations
|
||||
|
||||
1. Add `APP_ENV` or `GROWQR_ENV` and derive `config.environment`; stop relying on `NODE_ENV` for product behavior.
|
||||
2. Fail fast in staging/production for missing secrets and localhost/default service URLs.
|
||||
3. Move demo seed code into `src/staging` and guard routes with `DEMO_DATA_ENABLED` plus admin check.
|
||||
4. Require `ADMIN_USER_IDS` before enabling `/workflows/admin/ops` outside development.
|
||||
5. Split public URLs and internal URLs in config names consistently across frontend, services, Gitea, Rivet, and OpenCode.
|
||||
6. Add a deployment checklist that records every required env var per environment.
|
||||
7. Make legacy Redis observers an explicit feature flag and set a removal date.
|
||||
284
docs/retry-idempotency-dlq-plan.md
Normal file
284
docs/retry-idempotency-dlq-plan.md
Normal file
@@ -0,0 +1,284 @@
|
||||
# Retry, Idempotency, and DLQ Plan
|
||||
|
||||
PRM-43 design pass for `growqr-backend`.
|
||||
|
||||
No implementation was performed in this pass.
|
||||
|
||||
## Goals
|
||||
|
||||
- Bound every outbound call with timeouts.
|
||||
- Retry only safe operations with classified errors.
|
||||
- Make repeated commands safe through idempotency keys.
|
||||
- Preserve failed event/workflow work in a DLQ with replay tooling.
|
||||
- Add logs that let support trace one user action across route, actor, service, Redis, projector, and database writes.
|
||||
|
||||
## Outbound Call Site Inventory
|
||||
|
||||
| Area | Files | Current behavior | Needed behavior |
|
||||
| --- | --- | --- | --- |
|
||||
| Product service clients | `src/services/product-service-clients.ts` | Direct `fetch`, no timeout/retry/idempotency header | Shared service client with timeout, retry, idempotency key, and request id. |
|
||||
| Service agent probes | `src/services/service-agents.ts` | Direct `fetch`, some fallback summaries | Same shared client; distinguish "unavailable" from retriable failure. |
|
||||
| Gitea | `src/lib/gitea.ts`, `src/docker/manager.ts`, `src/actors/user-actor.ts` | Direct `fetch`, some wait-for-ready helpers | Retry transient Gitea API errors; idempotent repo/user/file operations. |
|
||||
| OpenCode | `src/lib/opencode.ts`, `src/workflows/executors/opencode-executor.ts` | Direct `fetch`, health polling, no command dedupe | Timeout and retry health/session/message calls; stable command id for prompts. |
|
||||
| LLM | `src/lib/llm.ts`, `src/actors/conversation/agent.ts`, `src/events/projectors/projection-agent.ts` | Direct SDK/fetch calls | Timeout, retry on provider transient errors, no retry on content/schema errors. |
|
||||
| Actor sends | routes, `src/events/route-to-user-actor.ts`, actors | `getOrCreate(...).method(...)`, queue sends | Standard command envelope with idempotency key and correlation ids. |
|
||||
| Redis consumer | `src/events/redis-consumer.ts` | Loops forever; canonical messages ack in `finally`; no DLQ | Retry budget, pending handling, DLQ stream/table, replay. |
|
||||
| Projectors | `src/events/projectors/*`, `src/actors/events/user-event-actor.ts` | Called within event actor processing | Per-projector idempotency and failure status; replay from stored Grow Events. |
|
||||
| Workflow module runner | `src/workflows/module-runner.ts`, `src/actors/workflow-run-actor.ts` | Actor loop retries in one path; direct route execution in another | Actor-only execution, durable command id, retry state in DB. |
|
||||
|
||||
## Shared `withRetry` API
|
||||
|
||||
Add `src/lib/retry.ts`:
|
||||
|
||||
```ts
|
||||
export type RetryPolicy = {
|
||||
maxAttempts: number;
|
||||
baseDelayMs: number;
|
||||
maxDelayMs: number;
|
||||
timeoutMs: number;
|
||||
jitter: boolean;
|
||||
};
|
||||
|
||||
export async function withRetry<T>(
|
||||
operation: string,
|
||||
fn: (ctx: { signal: AbortSignal; attempt: number }) => Promise<T>,
|
||||
options: {
|
||||
policy?: Partial<RetryPolicy>;
|
||||
idempotencyKey?: string;
|
||||
classify?: (error: unknown) => "retry" | "fail";
|
||||
logFields?: Record<string, unknown>;
|
||||
},
|
||||
): Promise<T>;
|
||||
```
|
||||
|
||||
Default policy:
|
||||
|
||||
- `maxAttempts: 3`
|
||||
- `baseDelayMs: 250`
|
||||
- `maxDelayMs: 5_000`
|
||||
- `timeoutMs: 10_000`
|
||||
- jitter enabled
|
||||
|
||||
Classification:
|
||||
|
||||
- Retry: network errors, abort/timeout, HTTP `408`, `425`, `429`, `500`, `502`, `503`, `504`.
|
||||
- Do not retry: HTTP `400`, `401`, `403`, `404`, validation/schema errors, duplicate/idempotency conflicts that already completed.
|
||||
- Special case: `409` may be success for idempotent create-if-absent operations.
|
||||
|
||||
## Idempotency Model
|
||||
|
||||
Add a command/event idempotency key convention:
|
||||
|
||||
```txt
|
||||
<domain>:<userId>:<entityId>:<operation>:<version>
|
||||
```
|
||||
|
||||
Examples:
|
||||
|
||||
- `workflow:user_123:run_456:module:resume:v1`
|
||||
- `mission:user_123:instance_456:start:v1`
|
||||
- `service:user_123:interview:configure:session_abc`
|
||||
- `event:user_123:growEventId:project:qscore:v1`
|
||||
- `opencode:user_123:run_456:interview-plan:prompt-v4`
|
||||
|
||||
Where to store:
|
||||
|
||||
- `workflowRunModules.idempotencyKey` for module commands.
|
||||
- `workflowEvents.payload.idempotencyKey` for audit trail.
|
||||
- `growEvents.dedupeKey` for event ingestion.
|
||||
- Add a future `idempotency_keys` table only if multiple domains need durable response reuse.
|
||||
|
||||
Minimum table design if needed:
|
||||
|
||||
```txt
|
||||
idempotency_keys
|
||||
key text primary key
|
||||
domain text not null
|
||||
user_id text
|
||||
status text check (processing, completed, failed)
|
||||
request_hash text
|
||||
response jsonb
|
||||
error text
|
||||
expires_at timestamptz
|
||||
created_at timestamptz
|
||||
updated_at timestamptz
|
||||
```
|
||||
|
||||
## HTTP Service Client Plan
|
||||
|
||||
Create `src/services/http-client.ts`:
|
||||
|
||||
- Accepts `baseUrl`, `path`, `method`, `json`, `headers`, `idempotencyKey`, `operation`, `timeoutMs`.
|
||||
- Adds:
|
||||
- `authorization: Bearer <A2A_ALLOWED_KEY>` when configured.
|
||||
- `x-request-id`
|
||||
- `x-idempotency-key` or `idempotency-key`.
|
||||
- `x-growqr-user` when user-scoped.
|
||||
- Uses `withRetry`.
|
||||
- Parses text once and returns typed JSON.
|
||||
- Logs attempt, latency, status, and error class.
|
||||
|
||||
Then migrate:
|
||||
|
||||
1. `product-service-clients.ts`
|
||||
2. `service-agents.ts`
|
||||
3. mission route direct user-service fetch
|
||||
4. workflow service health checks
|
||||
|
||||
## Workflow Retry Plan
|
||||
|
||||
Target behavior:
|
||||
|
||||
- Routes enqueue commands to `workflowRunActor`; routes do not call `executeWorkflowModule` directly.
|
||||
- `workflowRunActor` writes command state before execution.
|
||||
- `executeWorkflowModule` receives `idempotencyKey` and passes it to service/OpenCode calls.
|
||||
- On failure, increment `workflowRunModules.retryCount`, store `error`, and emit `workflowEvents` with `retryAttempt`.
|
||||
- Exceeding retry budget marks module `blocked` or `failed` based on module type and writes a DLQ row/event.
|
||||
|
||||
Module status transition:
|
||||
|
||||
```mermaid
|
||||
stateDiagram-v2
|
||||
[*] --> idle
|
||||
idle --> queued
|
||||
queued --> running
|
||||
running --> done
|
||||
running --> retry_wait
|
||||
retry_wait --> running
|
||||
running --> blocked
|
||||
running --> dlq
|
||||
dlq --> replaying
|
||||
replaying --> running
|
||||
```
|
||||
|
||||
## Redis Consumer and DLQ Plan
|
||||
|
||||
Do not ack canonical Redis messages until one of these is true:
|
||||
|
||||
- event persisted and routed/projected successfully;
|
||||
- event persisted but routing failed and a durable retry record was created;
|
||||
- message moved to DLQ after retry budget.
|
||||
|
||||
Add DLQ options:
|
||||
|
||||
1. Redis stream DLQ: `grow.events.dlq`
|
||||
2. Postgres table: `grow_event_dlq`
|
||||
|
||||
Recommended to use both:
|
||||
|
||||
- Redis DLQ for operational stream tooling.
|
||||
- Postgres DLQ for admin UI, audit, and replay metadata.
|
||||
|
||||
DLQ row fields:
|
||||
|
||||
```txt
|
||||
id
|
||||
source_stream
|
||||
source_message_id
|
||||
payload
|
||||
error
|
||||
attempts
|
||||
last_attempt_at
|
||||
status: pending | replaying | replayed | discarded
|
||||
created_at
|
||||
updated_at
|
||||
```
|
||||
|
||||
Replay script:
|
||||
|
||||
```txt
|
||||
pnpm events:replay --status failed --limit 100
|
||||
pnpm events:replay --dlq --id <dlq-id>
|
||||
pnpm events:replay --event-id <grow-event-id> --projectors qscore,service-session
|
||||
```
|
||||
|
||||
Script responsibilities:
|
||||
|
||||
- Re-read stored payload.
|
||||
- Re-run `recordGrowEvent` if needed.
|
||||
- Re-run `routeGrowEventToUserActor`.
|
||||
- Optionally run only selected projectors.
|
||||
- Preserve original `dedupeKey`.
|
||||
|
||||
## Projector Idempotency Plan
|
||||
|
||||
Projectors should be repeatable:
|
||||
|
||||
- Q Score latest table already has `(userId, signalId)` primary key.
|
||||
- Mission service sessions have unique `(serviceId, externalId)`.
|
||||
- Artifacts should dedupe by `(missionInstanceId, serviceId, externalId, type)` or a stable artifact key.
|
||||
- Mission stage patches should be applied with deterministic status/progress and no duplicate suggestions.
|
||||
|
||||
Add projector event logs:
|
||||
|
||||
```txt
|
||||
grow_event_projector_runs
|
||||
event_id
|
||||
projector
|
||||
status
|
||||
attempt
|
||||
error
|
||||
started_at
|
||||
completed_at
|
||||
```
|
||||
|
||||
## Logging Fields
|
||||
|
||||
Every route/actor/event/retry log should include as many of these as available:
|
||||
|
||||
- `requestId`
|
||||
- `traceId`
|
||||
- `userId`
|
||||
- `orgId`
|
||||
- `actorType`
|
||||
- `actorKey`
|
||||
- `runId`
|
||||
- `moduleId`
|
||||
- `missionId`
|
||||
- `missionInstanceId`
|
||||
- `stageId`
|
||||
- `eventId`
|
||||
- `source`
|
||||
- `eventType`
|
||||
- `idempotencyKey`
|
||||
- `operation`
|
||||
- `attempt`
|
||||
- `maxAttempts`
|
||||
- `latencyMs`
|
||||
- `httpStatus`
|
||||
- `retryable`
|
||||
- `dlqId`
|
||||
|
||||
## Test Plan
|
||||
|
||||
Unit tests:
|
||||
|
||||
- `withRetry` retries transient errors and stops on non-retryable errors.
|
||||
- Timeout aborts fetch and logs retry attempt.
|
||||
- Idempotency key helper returns stable keys.
|
||||
- HTTP client adds auth, request id, and idempotency headers.
|
||||
|
||||
Integration tests:
|
||||
|
||||
- Duplicate `/workflow-runs/:runId/modules/:moduleId/run` command does not duplicate service call.
|
||||
- Duplicate Grow Event with same `dedupeKey` is stored once and projection remains stable.
|
||||
- Redis message failure is not acked until retry/DLQ path is recorded.
|
||||
- DLQ replay reprocesses a failed event and updates projector status.
|
||||
- OpenCode module execution retry does not create duplicate artifact rows.
|
||||
|
||||
Manual staging drills:
|
||||
|
||||
1. Stop interview service, run interview module, verify retry and blocked/DLQ behavior.
|
||||
2. Emit duplicate Redis events, verify one `grow_events` row and stable projector state.
|
||||
3. Break Gitea token, provision stack, verify retry logs and no partial untracked state.
|
||||
4. Replay a DLQ event, verify mission progress and Q Score update.
|
||||
|
||||
## Implementation Order
|
||||
|
||||
1. Add `src/lib/retry.ts` and focused unit tests.
|
||||
2. Add service HTTP client and migrate product service calls.
|
||||
3. Add workflow command idempotency and route-to-actor queueing.
|
||||
4. Add Redis DLQ and replay script.
|
||||
5. Add projector run records.
|
||||
6. Migrate Gitea/OpenCode/LLM calls to `withRetry`.
|
||||
7. Add staging failure drills to deployment checklist.
|
||||
Reference in New Issue
Block a user