Files
growqr-backend/docs/backend-organization-audit.md

13 KiB

Backend Organization Audit

PRM-41 audit pass for growqr-backend.

Scope reviewed: src/routes, src/actors, src/events, src/missions, src/workflows, and src/services.

Executive Summary

The backend currently has three overlapping orchestration layers:

  1. HTTP routes that directly perform database writes, service calls, and some synchronous workflow execution.
  2. Rivet actors that own durable user, workflow, mission, conversation, memory, and event processing state.
  3. Event/projector code that normalizes service events into Grow Events, updates mission state, records service sessions, and projects Q Score signals.

That split is workable for a demo-stage backend, but it blurs ownership. Several routes contain business logic that should live in services or actors, while actors and event consumers need stronger idempotency, retry, and replay boundaries before production traffic.

High-Level Architecture

flowchart LR
  FE[Frontend / service clients] --> Hono[Hono routes]
  Hono --> DB[(Postgres / Drizzle)]
  Hono --> Rivet[Rivet actors]
  Hono --> Svc[Product services]
  Hono --> Docker[Docker + Gitea + OpenCode]

  Svc --> Redis[Redis streams / pubsub]
  Redis --> Consumer[events/redis-consumer]
  Consumer --> GrowEvents[(grow_events)]
  Consumer --> EventActor[userEventActor]
  EventActor --> MissionActors[mission actors]
  EventActor --> Projectors[QScore/session/projectors]
  MissionActors --> DB

  Rivet --> DB
  Rivet --> Svc
  Rivet --> Docker

Route to Actor/Service/Event/Data Flow Map

Route module Mounted path Primary flow Actor/service/data dependencies Notes
src/routes/actors.ts /actors Auth-gated user stack control docker/manager, actors table Provisions/stops OpenCode stack directly from route.
src/routes/agents.ts /agents Catalog read agents/catalog Thin route.
src/routes/chat.ts /api/chat Chat request, Rivet first, direct LLM fallback userActor, lib/llm, services/service-agents Contains fallback tool orchestration and timeout logic in route.
src/routes/conversations.ts /conversations Conversation CRUD/chat/mission bridging conversationActor, mission actors, grow_conversations, messages Heavy route; mixes persistence, actor bootstrapping, mission resolution, and response shaping.
src/routes/events.ts /events User/service event ingestion and listing recordGrowEvent, routeGrowEventToUserActor, grow_events Good ingestion boundary, but service auth is environment-sensitive.
src/routes/git.ts /git Repo/file operations docker/manager, GiteaClient Route owns path safety and repo operation decisions.
src/routes/grow.ts /grow Grow bootstrap and active state growActor Thin actor gateway.
src/routes/home.ts /home Home feed, notifications, demo seed home-feed, seed-demo-home Includes demo seeding endpoint.
src/routes/missions.ts /missions Mission catalog, start/pause/resume/stage/artifacts/coach growActor, mission actors, user service, mission registry Heavy route; owns mission selection, profile fallback, actor type mapping, and artifact commands.
src/routes/opencode.ts /opencode OpenCode stack/session/message proxy docker/manager, OpencodeClient Directly provisions stack and opens sessions.
src/routes/services.ts /services Product service proxy and event recording product-service-clients, recordGrowEvent, Q Score onboarding Very heavy route; contains service-specific payload shaping and event side effects.
src/routes/users.ts /users User profile/bootstrap auth/clerk, users table, onboarding Q Score Includes Clerk profile mirroring and onboarding side effects.
src/routes/workflows.ts /workflows, /workflow-runs Workflow definitions/runs/modules/approvals userActor, workflowRunActor, workflow/module-runner, DB Two paths: legacy userActor job-application flow and DB-backed workflow runs.

Actor Inventory

Actor Current role Main inputs Outputs/effects Robustness observations
userActor Legacy unified user orchestration: chat, memory tools, workflow status, service handoffs, OpenCode/Gitea interactions /api/chat, /workflows/job-application, workflow route aliases Actor state, DB events, service calls, Gitea reads/writes Very broad responsibilities; failures in service calls often become summaries rather than durable retryable jobs.
workflowRunActor Queued workflow module runner `/workflow-runs/:runId/pause resume` and direct client use workflowRunModules, workflowEvents, qscoreSnapshots via module runner
conversationActor Durable streaming conversation state /conversations Actor state and generated messages Queue usage exists for messages; needs documented idempotency per turn/message id.
memoryActor Durable memory file state Internal client use Actor state/file-like memory Queue writes exist; external call idempotency unclear.
growActor Active mission list/state control /grow, /missions grow_active_missions, mission state Mission lifecycle split across growActor, mission actors, and routes.
userEventActor Routes normalized Grow Events to missions/projectors Redis consumer, /events ingestion Mission stage patches, projector DB updates, event status Central point for event idempotency, but retries/replay/DLQ are not yet formalized.
Mission actors Per-mission state machines /missions, /conversations, event actor grow_active_missions, artifacts, suggestions Four mission actors are thin factory wrappers; interview-to-offer has custom implementation.
Product service actors Actor wrappers for interview/roleplay/resume clients Registry only; possible client use Service calls Registered, but routes call clients directly. These may be underused compared to direct service proxy routes.

Event and Projector Flow

sequenceDiagram
  participant Service as Product service
  participant Redis as Redis stream/pubsub
  participant Route as /events or service routes
  participant Store as grow_events
  participant UserEvent as userEventActor
  participant Mission as mission actor
  participant Projection as projectors

  Service->>Redis: canonical GrowEvent or legacy task response
  Redis->>Route: redis-consumer normalizes message
  Route->>Store: recordGrowEvent with dedupeKey
  Route->>UserEvent: routeGrowEventToUserActor
  UserEvent->>Mission: apply reducer-derived stage patches
  UserEvent->>Projection: service session and Q Score projections
  Projection->>Store: update projection tables

Current event strengths:

  • normalizeGrowEvent accepts multiple service field conventions.
  • recordGrowEvent uses dedupeKey and a unique index on grow_events.dedupe_key.
  • Legacy Redis observer bridges tasks:* and responses:* without service changes.
  • Projector surfaces exist for session tracking, Q Score, and LLM-derived insights.

Current event gaps:

  • Redis canonical consumer always xAcks in finally, even when recordAndRoute fails, so failed messages do not remain pending for retry.
  • No DLQ stream/table for failed canonical or legacy event processing.
  • No replay script for grow_events.processing_status in ('failed', 'unresolved').
  • Legacy task context is in-memory only, so response events can lose user/action context after a backend restart.

Business Logic in Routes

Highest concentration:

  • src/routes/services.ts: service-specific request construction, event emission, Q Score baseline/onboarding side effects, mission association, and UI response shaping.
  • src/routes/workflows.ts: run creation, module row initialization, baseline Q Score, approval gate progression, artifact content lookup, and synchronous module execution.
  • src/routes/missions.ts: mission profile lookup from user service, actor type mapping, start/resume/pause/stage/artifact commands, and coach run orchestration.
  • src/routes/conversations.ts: active conversation persistence, mission-aware chat routing, actor fallback behavior, and response normalization.
  • src/routes/chat.ts: Rivet fallback, direct LLM tool loop, service agent selection, and timeout handling.

Low-risk thin routes:

  • src/routes/agents.ts
  • src/routes/grow.ts
  • parts of src/routes/events.ts

Recommended ownership target:

  • Routes validate/authenticate and translate HTTP to commands.
  • Actors own durable user/mission/workflow progression.
  • Services own outbound HTTP details.
  • Projectors own derived read models.
  • Routes should not decide retry, idempotency, or service fallback behavior beyond returning HTTP errors.

Idempotency Gaps

Area Existing behavior Gap
Grow Event ingestion dedupeKey unique index; normalizer uses explicit key or source id Service routes do not consistently set stable dedupe keys for all service-created side effects.
Workflow runs /workflow-runs/:runId/modules/:moduleId/run reads idempotency-key header executeWorkflowModule does not use the key to suppress duplicate service calls; /run generates timestamp keys.
Workflow module rows Has idempotencyKey, retryCount, maxRetries columns Counters are mostly passive; no central retry state machine.
Actor queues Rivet queues and loop step names provide some dedupe for workflowRunActor Several routes bypass actor queue and execute directly.
Service session creation stableUuid exists in service-agent helper Not consistently used as a request id/idempotency key across service calls.
OpenCode artifacts onConflictDoNothing for workflow artifacts OpenCode prompt/message send can duplicate work before artifact row conflict applies.

Retry Gaps

Area Existing behavior Gap
workflowRunActor Rivet loop has retryBackoffBase and retryBackoffMax Only applies when execution goes through actor loop.
HTTP service clients Throw on non-2xx after fetch No timeout, retry classification, request id, or backoff.
Gitea client Some wait/poll helpers exist Most API calls are single-shot.
OpenCode client Health polling exists Session/message calls are single-shot.
Redis consumer Infinite loop catches top-level errors Per-message failures are acked; no retry budget or DLQ.
Projectors Called by event actor Projector failures need durable retry/replay semantics and status transitions.

Actor Robustness Gaps

  • userActor is too broad to reason about failure domains. It owns chat, service tools, memory, workflow, Gitea, OpenCode, and DB event writes.
  • Product service actors are registered but not the primary path for service proxy routes, so actor-level durability is uneven.
  • Mission actor mapping is manually duplicated in routes, registry, and event actor.
  • Route-level synchronous workflow execution can hold HTTP requests open across slow service/OpenCode calls.
  • Actor initialization is repeated in routes; a central actor gateway could enforce init/idempotency/logging.

Priority-Ranked Recommendations

  1. Create a backend command layer for route-to-actor/service translation. Move mission start, workflow run, approval, service configure, and chat tool dispatch logic out of routes.
  2. Make workflowRunActor the only executor for workflow modules. Routes should enqueue commands and return command ids.
  3. Add a shared outbound withRetry/timeout/idempotency wrapper for service clients, Gitea, OpenCode, and LLM calls.
  4. Add DLQ and replay support for Redis/event processing. Do not ack canonical Redis messages until durable record/projector status is successful or DLQ-ed.
  5. Normalize mission actor mapping into a single registry source used by routes, event actor, and mission registry.
  6. Split userActor responsibilities: chat/memory/workflow/OpenCode paths should be smaller actors or delegated services with explicit contracts.
  7. Convert route-created side effects to stable idempotency keys. Use request id, user id, mission instance id, service id, and operation name.
  8. Add structured logging fields across routes/actors/events: requestId, userId, missionInstanceId, runId, moduleId, eventId, idempotencyKey, retryAttempt.
  9. Add focused tests around duplicate workflow module run, duplicate service event ingest, Redis failure handling, and mission projector replay.

Suggested Next Slice

Use PRM-43 to introduce shared retry/idempotency primitives first. Then return to this audit and migrate the highest-risk route logic in this order:

  1. /workflow-runs/*/run
  2. /services/interview|roleplay configure/review
  3. /missions/:missionId/start
  4. /api/chat direct LLM fallback