docs: audit backend organization and actor flow

This commit is contained in:
-Puter
2026-06-05 22:01:00 +05:30
parent 213987a9e0
commit 01e9cc92d4

View File

@@ -0,0 +1,179 @@
# Backend Organization Audit
PRM-41 audit pass for `growqr-backend`.
Scope reviewed: `src/routes`, `src/actors`, `src/events`, `src/missions`, `src/workflows`, and `src/services`.
## Executive Summary
The backend currently has three overlapping orchestration layers:
1. HTTP routes that directly perform database writes, service calls, and some synchronous workflow execution.
2. Rivet actors that own durable user, workflow, mission, conversation, memory, and event processing state.
3. Event/projector code that normalizes service events into Grow Events, updates mission state, records service sessions, and projects Q Score signals.
That split is workable for a demo-stage backend, but it blurs ownership. Several routes contain business logic that should live in services or actors, while actors and event consumers need stronger idempotency, retry, and replay boundaries before production traffic.
## High-Level Architecture
```mermaid
flowchart LR
FE[Frontend / service clients] --> Hono[Hono routes]
Hono --> DB[(Postgres / Drizzle)]
Hono --> Rivet[Rivet actors]
Hono --> Svc[Product services]
Hono --> Docker[Docker + Gitea + OpenCode]
Svc --> Redis[Redis streams / pubsub]
Redis --> Consumer[events/redis-consumer]
Consumer --> GrowEvents[(grow_events)]
Consumer --> EventActor[userEventActor]
EventActor --> MissionActors[mission actors]
EventActor --> Projectors[QScore/session/projectors]
MissionActors --> DB
Rivet --> DB
Rivet --> Svc
Rivet --> Docker
```
## Route to Actor/Service/Event/Data Flow Map
| Route module | Mounted path | Primary flow | Actor/service/data dependencies | Notes |
| --- | --- | --- | --- | --- |
| `src/routes/actors.ts` | `/actors` | Auth-gated user stack control | `docker/manager`, `actors` table | Provisions/stops OpenCode stack directly from route. |
| `src/routes/agents.ts` | `/agents` | Catalog read | `agents/catalog` | Thin route. |
| `src/routes/chat.ts` | `/api/chat` | Chat request, Rivet first, direct LLM fallback | `userActor`, `lib/llm`, `services/service-agents` | Contains fallback tool orchestration and timeout logic in route. |
| `src/routes/conversations.ts` | `/conversations` | Conversation CRUD/chat/mission bridging | `conversationActor`, mission actors, `grow_conversations`, messages | Heavy route; mixes persistence, actor bootstrapping, mission resolution, and response shaping. |
| `src/routes/events.ts` | `/events` | User/service event ingestion and listing | `recordGrowEvent`, `routeGrowEventToUserActor`, `grow_events` | Good ingestion boundary, but service auth is environment-sensitive. |
| `src/routes/git.ts` | `/git` | Repo/file operations | `docker/manager`, `GiteaClient` | Route owns path safety and repo operation decisions. |
| `src/routes/grow.ts` | `/grow` | Grow bootstrap and active state | `growActor` | Thin actor gateway. |
| `src/routes/home.ts` | `/home` | Home feed, notifications, demo seed | `home-feed`, `seed-demo-home` | Includes demo seeding endpoint. |
| `src/routes/missions.ts` | `/missions` | Mission catalog, start/pause/resume/stage/artifacts/coach | `growActor`, mission actors, user service, mission registry | Heavy route; owns mission selection, profile fallback, actor type mapping, and artifact commands. |
| `src/routes/opencode.ts` | `/opencode` | OpenCode stack/session/message proxy | `docker/manager`, `OpencodeClient` | Directly provisions stack and opens sessions. |
| `src/routes/services.ts` | `/services` | Product service proxy and event recording | `product-service-clients`, `recordGrowEvent`, Q Score onboarding | Very heavy route; contains service-specific payload shaping and event side effects. |
| `src/routes/users.ts` | `/users` | User profile/bootstrap | `auth/clerk`, `users` table, onboarding Q Score | Includes Clerk profile mirroring and onboarding side effects. |
| `src/routes/workflows.ts` | `/workflows`, `/workflow-runs` | Workflow definitions/runs/modules/approvals | `userActor`, `workflowRunActor`, `workflow/module-runner`, DB | Two paths: legacy userActor job-application flow and DB-backed workflow runs. |
## Actor Inventory
| Actor | Current role | Main inputs | Outputs/effects | Robustness observations |
| --- | --- | --- | --- | --- |
| `userActor` | Legacy unified user orchestration: chat, memory tools, workflow status, service handoffs, OpenCode/Gitea interactions | `/api/chat`, `/workflows/job-application`, workflow route aliases | Actor state, DB events, service calls, Gitea reads/writes | Very broad responsibilities; failures in service calls often become summaries rather than durable retryable jobs. |
| `workflowRunActor` | Queued workflow module runner | `/workflow-runs/:runId/pause|resume` and direct client use | `workflowRunModules`, `workflowEvents`, `qscoreSnapshots` via module runner | Has Rivet loop retry settings for module execution, but route-level `/run` bypasses actor queue and executes synchronously. |
| `conversationActor` | Durable streaming conversation state | `/conversations` | Actor state and generated messages | Queue usage exists for messages; needs documented idempotency per turn/message id. |
| `memoryActor` | Durable memory file state | Internal client use | Actor state/file-like memory | Queue writes exist; external call idempotency unclear. |
| `growActor` | Active mission list/state control | `/grow`, `/missions` | `grow_active_missions`, mission state | Mission lifecycle split across growActor, mission actors, and routes. |
| `userEventActor` | Routes normalized Grow Events to missions/projectors | Redis consumer, `/events` ingestion | Mission stage patches, projector DB updates, event status | Central point for event idempotency, but retries/replay/DLQ are not yet formalized. |
| Mission actors | Per-mission state machines | `/missions`, `/conversations`, event actor | `grow_active_missions`, artifacts, suggestions | Four mission actors are thin factory wrappers; interview-to-offer has custom implementation. |
| Product service actors | Actor wrappers for interview/roleplay/resume clients | Registry only; possible client use | Service calls | Registered, but routes call clients directly. These may be underused compared to direct service proxy routes. |
## Event and Projector Flow
```mermaid
sequenceDiagram
participant Service as Product service
participant Redis as Redis stream/pubsub
participant Route as /events or service routes
participant Store as grow_events
participant UserEvent as userEventActor
participant Mission as mission actor
participant Projection as projectors
Service->>Redis: canonical GrowEvent or legacy task response
Redis->>Route: redis-consumer normalizes message
Route->>Store: recordGrowEvent with dedupeKey
Route->>UserEvent: routeGrowEventToUserActor
UserEvent->>Mission: apply reducer-derived stage patches
UserEvent->>Projection: service session and Q Score projections
Projection->>Store: update projection tables
```
Current event strengths:
- `normalizeGrowEvent` accepts multiple service field conventions.
- `recordGrowEvent` uses `dedupeKey` and a unique index on `grow_events.dedupe_key`.
- Legacy Redis observer bridges `tasks:*` and `responses:*` without service changes.
- Projector surfaces exist for session tracking, Q Score, and LLM-derived insights.
Current event gaps:
- Redis canonical consumer always `xAck`s in `finally`, even when `recordAndRoute` fails, so failed messages do not remain pending for retry.
- No DLQ stream/table for failed canonical or legacy event processing.
- No replay script for `grow_events.processing_status in ('failed', 'unresolved')`.
- Legacy task context is in-memory only, so response events can lose user/action context after a backend restart.
## Business Logic in Routes
Highest concentration:
- `src/routes/services.ts`: service-specific request construction, event emission, Q Score baseline/onboarding side effects, mission association, and UI response shaping.
- `src/routes/workflows.ts`: run creation, module row initialization, baseline Q Score, approval gate progression, artifact content lookup, and synchronous module execution.
- `src/routes/missions.ts`: mission profile lookup from user service, actor type mapping, start/resume/pause/stage/artifact commands, and coach run orchestration.
- `src/routes/conversations.ts`: active conversation persistence, mission-aware chat routing, actor fallback behavior, and response normalization.
- `src/routes/chat.ts`: Rivet fallback, direct LLM tool loop, service agent selection, and timeout handling.
Low-risk thin routes:
- `src/routes/agents.ts`
- `src/routes/grow.ts`
- parts of `src/routes/events.ts`
Recommended ownership target:
- Routes validate/authenticate and translate HTTP to commands.
- Actors own durable user/mission/workflow progression.
- Services own outbound HTTP details.
- Projectors own derived read models.
- Routes should not decide retry, idempotency, or service fallback behavior beyond returning HTTP errors.
## Idempotency Gaps
| Area | Existing behavior | Gap |
| --- | --- | --- |
| Grow Event ingestion | `dedupeKey` unique index; normalizer uses explicit key or source id | Service routes do not consistently set stable dedupe keys for all service-created side effects. |
| Workflow runs | `/workflow-runs/:runId/modules/:moduleId/run` reads `idempotency-key` header | `executeWorkflowModule` does not use the key to suppress duplicate service calls; `/run` generates timestamp keys. |
| Workflow module rows | Has `idempotencyKey`, `retryCount`, `maxRetries` columns | Counters are mostly passive; no central retry state machine. |
| Actor queues | Rivet queues and `loop` step names provide some dedupe for `workflowRunActor` | Several routes bypass actor queue and execute directly. |
| Service session creation | `stableUuid` exists in service-agent helper | Not consistently used as a request id/idempotency key across service calls. |
| OpenCode artifacts | `onConflictDoNothing` for workflow artifacts | OpenCode prompt/message send can duplicate work before artifact row conflict applies. |
## Retry Gaps
| Area | Existing behavior | Gap |
| --- | --- | --- |
| `workflowRunActor` | Rivet `loop` has `retryBackoffBase` and `retryBackoffMax` | Only applies when execution goes through actor loop. |
| HTTP service clients | Throw on non-2xx after `fetch` | No timeout, retry classification, request id, or backoff. |
| Gitea client | Some wait/poll helpers exist | Most API calls are single-shot. |
| OpenCode client | Health polling exists | Session/message calls are single-shot. |
| Redis consumer | Infinite loop catches top-level errors | Per-message failures are acked; no retry budget or DLQ. |
| Projectors | Called by event actor | Projector failures need durable retry/replay semantics and status transitions. |
## Actor Robustness Gaps
- `userActor` is too broad to reason about failure domains. It owns chat, service tools, memory, workflow, Gitea, OpenCode, and DB event writes.
- Product service actors are registered but not the primary path for service proxy routes, so actor-level durability is uneven.
- Mission actor mapping is manually duplicated in routes, registry, and event actor.
- Route-level synchronous workflow execution can hold HTTP requests open across slow service/OpenCode calls.
- Actor initialization is repeated in routes; a central actor gateway could enforce init/idempotency/logging.
## Priority-Ranked Recommendations
1. Create a backend command layer for route-to-actor/service translation. Move mission start, workflow run, approval, service configure, and chat tool dispatch logic out of routes.
2. Make `workflowRunActor` the only executor for workflow modules. Routes should enqueue commands and return command ids.
3. Add a shared outbound `withRetry`/timeout/idempotency wrapper for service clients, Gitea, OpenCode, and LLM calls.
4. Add DLQ and replay support for Redis/event processing. Do not ack canonical Redis messages until durable record/projector status is successful or DLQ-ed.
5. Normalize mission actor mapping into a single registry source used by routes, event actor, and mission registry.
6. Split `userActor` responsibilities: chat/memory/workflow/OpenCode paths should be smaller actors or delegated services with explicit contracts.
7. Convert route-created side effects to stable idempotency keys. Use request id, user id, mission instance id, service id, and operation name.
8. Add structured logging fields across routes/actors/events: `requestId`, `userId`, `missionInstanceId`, `runId`, `moduleId`, `eventId`, `idempotencyKey`, `retryAttempt`.
9. Add focused tests around duplicate workflow module run, duplicate service event ingest, Redis failure handling, and mission projector replay.
## Suggested Next Slice
Use PRM-43 to introduce shared retry/idempotency primitives first. Then return to this audit and migrate the highest-risk route logic in this order:
1. `/workflow-runs/*/run`
2. `/services/interview|roleplay configure/review`
3. `/missions/:missionId/start`
4. `/api/chat` direct LLM fallback