13 KiB
13 KiB
Backend Organization Audit
PRM-41 audit pass for growqr-backend.
Scope reviewed: src/routes, src/actors, src/events, src/missions, src/workflows, and src/services.
Executive Summary
The backend currently has three overlapping orchestration layers:
- HTTP routes that directly perform database writes, service calls, and some synchronous workflow execution.
- Rivet actors that own durable user, workflow, mission, conversation, memory, and event processing state.
- Event/projector code that normalizes service events into Grow Events, updates mission state, records service sessions, and projects Q Score signals.
That split is workable for a demo-stage backend, but it blurs ownership. Several routes contain business logic that should live in services or actors, while actors and event consumers need stronger idempotency, retry, and replay boundaries before production traffic.
High-Level Architecture
flowchart LR
FE[Frontend / service clients] --> Hono[Hono routes]
Hono --> DB[(Postgres / Drizzle)]
Hono --> Rivet[Rivet actors]
Hono --> Svc[Product services]
Hono --> Docker[Docker + Gitea + OpenCode]
Svc --> Redis[Redis streams / pubsub]
Redis --> Consumer[events/redis-consumer]
Consumer --> GrowEvents[(grow_events)]
Consumer --> EventActor[userEventActor]
EventActor --> MissionActors[mission actors]
EventActor --> Projectors[QScore/session/projectors]
MissionActors --> DB
Rivet --> DB
Rivet --> Svc
Rivet --> Docker
Route to Actor/Service/Event/Data Flow Map
| Route module | Mounted path | Primary flow | Actor/service/data dependencies | Notes |
|---|---|---|---|---|
src/routes/actors.ts |
/actors |
Auth-gated user stack control | docker/manager, actors table |
Provisions/stops OpenCode stack directly from route. |
src/routes/agents.ts |
/agents |
Catalog read | agents/catalog |
Thin route. |
src/routes/chat.ts |
/api/chat |
Chat request, Rivet first, direct LLM fallback | userActor, lib/llm, services/service-agents |
Contains fallback tool orchestration and timeout logic in route. |
src/routes/conversations.ts |
/conversations |
Conversation CRUD/chat/mission bridging | conversationActor, mission actors, grow_conversations, messages |
Heavy route; mixes persistence, actor bootstrapping, mission resolution, and response shaping. |
src/routes/events.ts |
/events |
User/service event ingestion and listing | recordGrowEvent, routeGrowEventToUserActor, grow_events |
Good ingestion boundary, but service auth is environment-sensitive. |
src/routes/git.ts |
/git |
Repo/file operations | docker/manager, GiteaClient |
Route owns path safety and repo operation decisions. |
src/routes/grow.ts |
/grow |
Grow bootstrap and active state | growActor |
Thin actor gateway. |
src/routes/home.ts |
/home |
Home feed, notifications, demo seed | home-feed, seed-demo-home |
Includes demo seeding endpoint. |
src/routes/missions.ts |
/missions |
Mission catalog, start/pause/resume/stage/artifacts/coach | growActor, mission actors, user service, mission registry |
Heavy route; owns mission selection, profile fallback, actor type mapping, and artifact commands. |
src/routes/opencode.ts |
/opencode |
OpenCode stack/session/message proxy | docker/manager, OpencodeClient |
Directly provisions stack and opens sessions. |
src/routes/services.ts |
/services |
Product service proxy and event recording | product-service-clients, recordGrowEvent, Q Score onboarding |
Very heavy route; contains service-specific payload shaping and event side effects. |
src/routes/users.ts |
/users |
User profile/bootstrap | auth/clerk, users table, onboarding Q Score |
Includes Clerk profile mirroring and onboarding side effects. |
src/routes/workflows.ts |
/workflows, /workflow-runs |
Workflow definitions/runs/modules/approvals | userActor, workflowRunActor, workflow/module-runner, DB |
Two paths: legacy userActor job-application flow and DB-backed workflow runs. |
Actor Inventory
| Actor | Current role | Main inputs | Outputs/effects | Robustness observations |
|---|---|---|---|---|
userActor |
Legacy unified user orchestration: chat, memory tools, workflow status, service handoffs, OpenCode/Gitea interactions | /api/chat, /workflows/job-application, workflow route aliases |
Actor state, DB events, service calls, Gitea reads/writes | Very broad responsibilities; failures in service calls often become summaries rather than durable retryable jobs. |
workflowRunActor |
Queued workflow module runner | `/workflow-runs/:runId/pause | resume` and direct client use | workflowRunModules, workflowEvents, qscoreSnapshots via module runner |
conversationActor |
Durable streaming conversation state | /conversations |
Actor state and generated messages | Queue usage exists for messages; needs documented idempotency per turn/message id. |
memoryActor |
Durable memory file state | Internal client use | Actor state/file-like memory | Queue writes exist; external call idempotency unclear. |
growActor |
Active mission list/state control | /grow, /missions |
grow_active_missions, mission state |
Mission lifecycle split across growActor, mission actors, and routes. |
userEventActor |
Routes normalized Grow Events to missions/projectors | Redis consumer, /events ingestion |
Mission stage patches, projector DB updates, event status | Central point for event idempotency, but retries/replay/DLQ are not yet formalized. |
| Mission actors | Per-mission state machines | /missions, /conversations, event actor |
grow_active_missions, artifacts, suggestions |
Four mission actors are thin factory wrappers; interview-to-offer has custom implementation. |
| Product service actors | Actor wrappers for interview/roleplay/resume clients | Registry only; possible client use | Service calls | Registered, but routes call clients directly. These may be underused compared to direct service proxy routes. |
Event and Projector Flow
sequenceDiagram
participant Service as Product service
participant Redis as Redis stream/pubsub
participant Route as /events or service routes
participant Store as grow_events
participant UserEvent as userEventActor
participant Mission as mission actor
participant Projection as projectors
Service->>Redis: canonical GrowEvent or legacy task response
Redis->>Route: redis-consumer normalizes message
Route->>Store: recordGrowEvent with dedupeKey
Route->>UserEvent: routeGrowEventToUserActor
UserEvent->>Mission: apply reducer-derived stage patches
UserEvent->>Projection: service session and Q Score projections
Projection->>Store: update projection tables
Current event strengths:
normalizeGrowEventaccepts multiple service field conventions.recordGrowEventusesdedupeKeyand a unique index ongrow_events.dedupe_key.- Legacy Redis observer bridges
tasks:*andresponses:*without service changes. - Projector surfaces exist for session tracking, Q Score, and LLM-derived insights.
Current event gaps:
- Redis canonical consumer always
xAcks infinally, even whenrecordAndRoutefails, so failed messages do not remain pending for retry. - No DLQ stream/table for failed canonical or legacy event processing.
- No replay script for
grow_events.processing_status in ('failed', 'unresolved'). - Legacy task context is in-memory only, so response events can lose user/action context after a backend restart.
Business Logic in Routes
Highest concentration:
src/routes/services.ts: service-specific request construction, event emission, Q Score baseline/onboarding side effects, mission association, and UI response shaping.src/routes/workflows.ts: run creation, module row initialization, baseline Q Score, approval gate progression, artifact content lookup, and synchronous module execution.src/routes/missions.ts: mission profile lookup from user service, actor type mapping, start/resume/pause/stage/artifact commands, and coach run orchestration.src/routes/conversations.ts: active conversation persistence, mission-aware chat routing, actor fallback behavior, and response normalization.src/routes/chat.ts: Rivet fallback, direct LLM tool loop, service agent selection, and timeout handling.
Low-risk thin routes:
src/routes/agents.tssrc/routes/grow.ts- parts of
src/routes/events.ts
Recommended ownership target:
- Routes validate/authenticate and translate HTTP to commands.
- Actors own durable user/mission/workflow progression.
- Services own outbound HTTP details.
- Projectors own derived read models.
- Routes should not decide retry, idempotency, or service fallback behavior beyond returning HTTP errors.
Idempotency Gaps
| Area | Existing behavior | Gap |
|---|---|---|
| Grow Event ingestion | dedupeKey unique index; normalizer uses explicit key or source id |
Service routes do not consistently set stable dedupe keys for all service-created side effects. |
| Workflow runs | /workflow-runs/:runId/modules/:moduleId/run reads idempotency-key header |
executeWorkflowModule does not use the key to suppress duplicate service calls; /run generates timestamp keys. |
| Workflow module rows | Has idempotencyKey, retryCount, maxRetries columns |
Counters are mostly passive; no central retry state machine. |
| Actor queues | Rivet queues and loop step names provide some dedupe for workflowRunActor |
Several routes bypass actor queue and execute directly. |
| Service session creation | stableUuid exists in service-agent helper |
Not consistently used as a request id/idempotency key across service calls. |
| OpenCode artifacts | onConflictDoNothing for workflow artifacts |
OpenCode prompt/message send can duplicate work before artifact row conflict applies. |
Retry Gaps
| Area | Existing behavior | Gap |
|---|---|---|
workflowRunActor |
Rivet loop has retryBackoffBase and retryBackoffMax |
Only applies when execution goes through actor loop. |
| HTTP service clients | Throw on non-2xx after fetch |
No timeout, retry classification, request id, or backoff. |
| Gitea client | Some wait/poll helpers exist | Most API calls are single-shot. |
| OpenCode client | Health polling exists | Session/message calls are single-shot. |
| Redis consumer | Infinite loop catches top-level errors | Per-message failures are acked; no retry budget or DLQ. |
| Projectors | Called by event actor | Projector failures need durable retry/replay semantics and status transitions. |
Actor Robustness Gaps
userActoris too broad to reason about failure domains. It owns chat, service tools, memory, workflow, Gitea, OpenCode, and DB event writes.- Product service actors are registered but not the primary path for service proxy routes, so actor-level durability is uneven.
- Mission actor mapping is manually duplicated in routes, registry, and event actor.
- Route-level synchronous workflow execution can hold HTTP requests open across slow service/OpenCode calls.
- Actor initialization is repeated in routes; a central actor gateway could enforce init/idempotency/logging.
Priority-Ranked Recommendations
- Create a backend command layer for route-to-actor/service translation. Move mission start, workflow run, approval, service configure, and chat tool dispatch logic out of routes.
- Make
workflowRunActorthe only executor for workflow modules. Routes should enqueue commands and return command ids. - Add a shared outbound
withRetry/timeout/idempotency wrapper for service clients, Gitea, OpenCode, and LLM calls. - Add DLQ and replay support for Redis/event processing. Do not ack canonical Redis messages until durable record/projector status is successful or DLQ-ed.
- Normalize mission actor mapping into a single registry source used by routes, event actor, and mission registry.
- Split
userActorresponsibilities: chat/memory/workflow/OpenCode paths should be smaller actors or delegated services with explicit contracts. - Convert route-created side effects to stable idempotency keys. Use request id, user id, mission instance id, service id, and operation name.
- Add structured logging fields across routes/actors/events:
requestId,userId,missionInstanceId,runId,moduleId,eventId,idempotencyKey,retryAttempt. - Add focused tests around duplicate workflow module run, duplicate service event ingest, Redis failure handling, and mission projector replay.
Suggested Next Slice
Use PRM-43 to introduce shared retry/idempotency primitives first. Then return to this audit and migrate the highest-risk route logic in this order:
/workflow-runs/*/run/services/interview|roleplay configure/review/missions/:missionId/start/api/chatdirect LLM fallback