Backend Organization Audit

PRM-41 audit pass for growqr-backend.

Scope reviewed: src/routes, src/actors, src/events, src/missions, src/workflows, and src/services.

Executive Summary

The backend currently has three overlapping orchestration layers:

HTTP routes that directly perform database writes, service calls, and some synchronous workflow execution.
Rivet actors that own durable user, workflow, mission, conversation, memory, and event processing state.
Event/projector code that normalizes service events into Grow Events, updates mission state, records service sessions, and projects Q Score signals.

That split is workable for a demo-stage backend, but it blurs ownership. Several routes contain business logic that should live in services or actors, while actors and event consumers need stronger idempotency, retry, and replay boundaries before production traffic.

High-Level Architecture

flowchart LR
  FE[Frontend / service clients] --> Hono[Hono routes]
  Hono --> DB[(Postgres / Drizzle)]
  Hono --> Rivet[Rivet actors]
  Hono --> Svc[Product services]
  Hono --> Docker[Docker + Gitea + OpenCode]

  Svc --> Redis[Redis streams / pubsub]
  Redis --> Consumer[events/redis-consumer]
  Consumer --> GrowEvents[(grow_events)]
  Consumer --> EventActor[userEventActor]
  EventActor --> MissionActors[mission actors]
  EventActor --> Projectors[QScore/session/projectors]
  MissionActors --> DB

  Rivet --> DB
  Rivet --> Svc
  Rivet --> Docker

Route to Actor/Service/Event/Data Flow Map

Route module	Mounted path	Primary flow	Actor/service/data dependencies	Notes
`src/routes/actors.ts`	`/actors`	Auth-gated user stack control	`docker/manager`, `actors` table	Provisions/stops OpenCode stack directly from route.
`src/routes/agents.ts`	`/agents`	Catalog read	`agents/catalog`	Thin route.
`src/routes/chat.ts`	`/api/chat`	Chat request, Rivet first, direct LLM fallback	`userActor`, `lib/llm`, `services/service-agents`	Contains fallback tool orchestration and timeout logic in route.
`src/routes/conversations.ts`	`/conversations`	Conversation CRUD/chat/mission bridging	`conversationActor`, mission actors, `grow_conversations`, messages	Heavy route; mixes persistence, actor bootstrapping, mission resolution, and response shaping.
`src/routes/events.ts`	`/events`	User/service event ingestion and listing	`recordGrowEvent`, `routeGrowEventToUserActor`, `grow_events`	Good ingestion boundary, but service auth is environment-sensitive.
`src/routes/git.ts`	`/git`	Repo/file operations	`docker/manager`, `GiteaClient`	Route owns path safety and repo operation decisions.
`src/routes/grow.ts`	`/grow`	Grow bootstrap and active state	`growActor`	Thin actor gateway.
`src/routes/home.ts`	`/home`	Home feed, notifications, demo seed	`home-feed`, `seed-demo-home`	Includes demo seeding endpoint.
`src/routes/missions.ts`	`/missions`	Mission catalog, start/pause/resume/stage/artifacts/coach	`growActor`, mission actors, user service, mission registry	Heavy route; owns mission selection, profile fallback, actor type mapping, and artifact commands.
`src/routes/opencode.ts`	`/opencode`	OpenCode stack/session/message proxy	`docker/manager`, `OpencodeClient`	Directly provisions stack and opens sessions.
`src/routes/services.ts`	`/services`	Product service proxy and event recording	`product-service-clients`, `recordGrowEvent`, Q Score onboarding	Very heavy route; contains service-specific payload shaping and event side effects.
`src/routes/users.ts`	`/users`	User profile/bootstrap	`auth/clerk`, `users` table, onboarding Q Score	Includes Clerk profile mirroring and onboarding side effects.
`src/routes/workflows.ts`	`/workflows`, `/workflow-runs`	Workflow definitions/runs/modules/approvals	`userActor`, `workflowRunActor`, `workflow/module-runner`, DB	Two paths: legacy userActor job-application flow and DB-backed workflow runs.

Actor Inventory

Actor	Current role	Main inputs	Outputs/effects	Robustness observations
`userActor`	Legacy unified user orchestration: chat, memory tools, workflow status, service handoffs, OpenCode/Gitea interactions	`/api/chat`, `/workflows/job-application`, workflow route aliases	Actor state, DB events, service calls, Gitea reads/writes	Very broad responsibilities; failures in service calls often become summaries rather than durable retryable jobs.
`workflowRunActor`	Queued workflow module runner	`/workflow-runs/:runId/pause	resume` and direct client use	`workflowRunModules`, `workflowEvents`, `qscoreSnapshots` via module runner
`conversationActor`	Durable streaming conversation state	`/conversations`	Actor state and generated messages	Queue usage exists for messages; needs documented idempotency per turn/message id.
`memoryActor`	Durable memory file state	Internal client use	Actor state/file-like memory	Queue writes exist; external call idempotency unclear.
`growActor`	Active mission list/state control	`/grow`, `/missions`	`grow_active_missions`, mission state	Mission lifecycle split across growActor, mission actors, and routes.
`userEventActor`	Routes normalized Grow Events to missions/projectors	Redis consumer, `/events` ingestion	Mission stage patches, projector DB updates, event status	Central point for event idempotency, but retries/replay/DLQ are not yet formalized.
Mission actors	Per-mission state machines	`/missions`, `/conversations`, event actor	`grow_active_missions`, artifacts, suggestions	Four mission actors are thin factory wrappers; interview-to-offer has custom implementation.
Product service actors	Actor wrappers for interview/roleplay/resume clients	Registry only; possible client use	Service calls	Registered, but routes call clients directly. These may be underused compared to direct service proxy routes.

Event and Projector Flow

sequenceDiagram
  participant Service as Product service
  participant Redis as Redis stream/pubsub
  participant Route as /events or service routes
  participant Store as grow_events
  participant UserEvent as userEventActor
  participant Mission as mission actor
  participant Projection as projectors

  Service->>Redis: canonical GrowEvent or legacy task response
  Redis->>Route: redis-consumer normalizes message
  Route->>Store: recordGrowEvent with dedupeKey
  Route->>UserEvent: routeGrowEventToUserActor
  UserEvent->>Mission: apply reducer-derived stage patches
  UserEvent->>Projection: service session and Q Score projections
  Projection->>Store: update projection tables

Current event strengths:

normalizeGrowEvent accepts multiple service field conventions.
recordGrowEvent uses dedupeKey and a unique index on grow_events.dedupe_key.
Legacy Redis observer bridges tasks:* and responses:* without service changes.
Projector surfaces exist for session tracking, Q Score, and LLM-derived insights.

Current event gaps:

Redis canonical consumer always xAcks in finally, even when recordAndRoute fails, so failed messages do not remain pending for retry.
No DLQ stream/table for failed canonical or legacy event processing.
No replay script for grow_events.processing_status in ('failed', 'unresolved').
Legacy task context is in-memory only, so response events can lose user/action context after a backend restart.

Business Logic in Routes

Highest concentration:

src/routes/services.ts: service-specific request construction, event emission, Q Score baseline/onboarding side effects, mission association, and UI response shaping.
src/routes/workflows.ts: run creation, module row initialization, baseline Q Score, approval gate progression, artifact content lookup, and synchronous module execution.
src/routes/missions.ts: mission profile lookup from user service, actor type mapping, start/resume/pause/stage/artifact commands, and coach run orchestration.
src/routes/conversations.ts: active conversation persistence, mission-aware chat routing, actor fallback behavior, and response normalization.
src/routes/chat.ts: Rivet fallback, direct LLM tool loop, service agent selection, and timeout handling.

Low-risk thin routes:

src/routes/agents.ts
src/routes/grow.ts
parts of src/routes/events.ts

Recommended ownership target:

Routes validate/authenticate and translate HTTP to commands.
Actors own durable user/mission/workflow progression.
Services own outbound HTTP details.
Projectors own derived read models.
Routes should not decide retry, idempotency, or service fallback behavior beyond returning HTTP errors.

Idempotency Gaps

Area	Existing behavior	Gap
Grow Event ingestion	`dedupeKey` unique index; normalizer uses explicit key or source id	Service routes do not consistently set stable dedupe keys for all service-created side effects.
Workflow runs	`/workflow-runs/:runId/modules/:moduleId/run` reads `idempotency-key` header	`executeWorkflowModule` does not use the key to suppress duplicate service calls; `/run` generates timestamp keys.
Workflow module rows	Has `idempotencyKey`, `retryCount`, `maxRetries` columns	Counters are mostly passive; no central retry state machine.
Actor queues	Rivet queues and `loop` step names provide some dedupe for `workflowRunActor`	Several routes bypass actor queue and execute directly.
Service session creation	`stableUuid` exists in service-agent helper	Not consistently used as a request id/idempotency key across service calls.
OpenCode artifacts	`onConflictDoNothing` for workflow artifacts	OpenCode prompt/message send can duplicate work before artifact row conflict applies.

Retry Gaps

Area	Existing behavior	Gap
`workflowRunActor`	Rivet `loop` has `retryBackoffBase` and `retryBackoffMax`	Only applies when execution goes through actor loop.
HTTP service clients	Throw on non-2xx after `fetch`	No timeout, retry classification, request id, or backoff.
Gitea client	Some wait/poll helpers exist	Most API calls are single-shot.
OpenCode client	Health polling exists	Session/message calls are single-shot.
Redis consumer	Infinite loop catches top-level errors	Per-message failures are acked; no retry budget or DLQ.
Projectors	Called by event actor	Projector failures need durable retry/replay semantics and status transitions.

Actor Robustness Gaps

userActor is too broad to reason about failure domains. It owns chat, service tools, memory, workflow, Gitea, OpenCode, and DB event writes.
Product service actors are registered but not the primary path for service proxy routes, so actor-level durability is uneven.
Mission actor mapping is manually duplicated in routes, registry, and event actor.
Route-level synchronous workflow execution can hold HTTP requests open across slow service/OpenCode calls.
Actor initialization is repeated in routes; a central actor gateway could enforce init/idempotency/logging.

Priority-Ranked Recommendations

Create a backend command layer for route-to-actor/service translation. Move mission start, workflow run, approval, service configure, and chat tool dispatch logic out of routes.
Make workflowRunActor the only executor for workflow modules. Routes should enqueue commands and return command ids.
Add a shared outbound withRetry/timeout/idempotency wrapper for service clients, Gitea, OpenCode, and LLM calls.
Add DLQ and replay support for Redis/event processing. Do not ack canonical Redis messages until durable record/projector status is successful or DLQ-ed.
Normalize mission actor mapping into a single registry source used by routes, event actor, and mission registry.
Split userActor responsibilities: chat/memory/workflow/OpenCode paths should be smaller actors or delegated services with explicit contracts.
Convert route-created side effects to stable idempotency keys. Use request id, user id, mission instance id, service id, and operation name.
Add structured logging fields across routes/actors/events: requestId, userId, missionInstanceId, runId, moduleId, eventId, idempotencyKey, retryAttempt.
Add focused tests around duplicate workflow module run, duplicate service event ingest, Redis failure handling, and mission projector replay.

Suggested Next Slice

Use PRM-43 to introduce shared retry/idempotency primitives first. Then return to this audit and migrate the highest-risk route logic in this order:

/workflow-runs/*/run
/services/interview|roleplay configure/review
/missions/:missionId/start
/api/chat direct LLM fallback

13 KiB Raw Blame History