docs: plan robust retry and dlq layer

2026-06-05 22:01:00 +05:30
parent ef5d7bb378
commit 5f667038d8
1 changed files with 284 additions and 0 deletions
--- a/docs/retry-idempotency-dlq-plan.md
+++ b/docs/retry-idempotency-dlq-plan.md
@@ -0,0 +1,284 @@
+# Retry, Idempotency, and DLQ Plan
+
+PRM-43 design pass for `growqr-backend`.
+
+No implementation was performed in this pass.
+
+## Goals
+
+- Bound every outbound call with timeouts.
+- Retry only safe operations with classified errors.
+- Make repeated commands safe through idempotency keys.
+- Preserve failed event/workflow work in a DLQ with replay tooling.
+- Add logs that let support trace one user action across route, actor, service, Redis, projector, and database writes.
+
+## Outbound Call Site Inventory
+
+| Area | Files | Current behavior | Needed behavior |
+| --- | --- | --- | --- |
+| Product service clients | `src/services/product-service-clients.ts` | Direct `fetch`, no timeout/retry/idempotency header | Shared service client with timeout, retry, idempotency key, and request id. |
+| Service agent probes | `src/services/service-agents.ts` | Direct `fetch`, some fallback summaries | Same shared client; distinguish "unavailable" from retriable failure. |
+| Gitea | `src/lib/gitea.ts`, `src/docker/manager.ts`, `src/actors/user-actor.ts` | Direct `fetch`, some wait-for-ready helpers | Retry transient Gitea API errors; idempotent repo/user/file operations. |
+| OpenCode | `src/lib/opencode.ts`, `src/workflows/executors/opencode-executor.ts` | Direct `fetch`, health polling, no command dedupe | Timeout and retry health/session/message calls; stable command id for prompts. |
+| LLM | `src/lib/llm.ts`, `src/actors/conversation/agent.ts`, `src/events/projectors/projection-agent.ts` | Direct SDK/fetch calls | Timeout, retry on provider transient errors, no retry on content/schema errors. |
+| Actor sends | routes, `src/events/route-to-user-actor.ts`, actors | `getOrCreate(...).method(...)`, queue sends | Standard command envelope with idempotency key and correlation ids. |
+| Redis consumer | `src/events/redis-consumer.ts` | Loops forever; canonical messages ack in `finally`; no DLQ | Retry budget, pending handling, DLQ stream/table, replay. |
+| Projectors | `src/events/projectors/*`, `src/actors/events/user-event-actor.ts` | Called within event actor processing | Per-projector idempotency and failure status; replay from stored Grow Events. |
+| Workflow module runner | `src/workflows/module-runner.ts`, `src/actors/workflow-run-actor.ts` | Actor loop retries in one path; direct route execution in another | Actor-only execution, durable command id, retry state in DB. |
+
+## Shared `withRetry` API
+
+Add `src/lib/retry.ts`:
+
+```ts
+export type RetryPolicy = {
+  maxAttempts: number;
+  baseDelayMs: number;
+  maxDelayMs: number;
+  timeoutMs: number;
+  jitter: boolean;
+};
+
+export async function withRetry<T>(
+  operation: string,
+  fn: (ctx: { signal: AbortSignal; attempt: number }) => Promise<T>,
+  options: {
+    policy?: Partial<RetryPolicy>;
+    idempotencyKey?: string;
+    classify?: (error: unknown) => "retry" | "fail";
+    logFields?: Record<string, unknown>;
+  },
+): Promise<T>;
+```
+
+Default policy:
+
+- `maxAttempts: 3`
+- `baseDelayMs: 250`
+- `maxDelayMs: 5_000`
+- `timeoutMs: 10_000`
+- jitter enabled
+
+Classification:
+
+- Retry: network errors, abort/timeout, HTTP `408`, `425`, `429`, `500`, `502`, `503`, `504`.
+- Do not retry: HTTP `400`, `401`, `403`, `404`, validation/schema errors, duplicate/idempotency conflicts that already completed.
+- Special case: `409` may be success for idempotent create-if-absent operations.
+
+## Idempotency Model
+
+Add a command/event idempotency key convention:
+
+```txt
+<domain>:<userId>:<entityId>:<operation>:<version>
+```
+
+Examples:
+
+- `workflow:user_123:run_456:module:resume:v1`
+- `mission:user_123:instance_456:start:v1`
+- `service:user_123:interview:configure:session_abc`
+- `event:user_123:growEventId:project:qscore:v1`
+- `opencode:user_123:run_456:interview-plan:prompt-v4`
+
+Where to store:
+
+- `workflowRunModules.idempotencyKey` for module commands.
+- `workflowEvents.payload.idempotencyKey` for audit trail.
+- `growEvents.dedupeKey` for event ingestion.
+- Add a future `idempotency_keys` table only if multiple domains need durable response reuse.
+
+Minimum table design if needed:
+
+```txt
+idempotency_keys
+  key text primary key
+  domain text not null
+  user_id text
+  status text check (processing, completed, failed)
+  request_hash text
+  response jsonb
+  error text
+  expires_at timestamptz
+  created_at timestamptz
+  updated_at timestamptz
+```
+
+## HTTP Service Client Plan
+
+Create `src/services/http-client.ts`:
+
+- Accepts `baseUrl`, `path`, `method`, `json`, `headers`, `idempotencyKey`, `operation`, `timeoutMs`.
+- Adds:
+  - `authorization: Bearer <A2A_ALLOWED_KEY>` when configured.
+  - `x-request-id`
+  - `x-idempotency-key` or `idempotency-key`.
+  - `x-growqr-user` when user-scoped.
+- Uses `withRetry`.
+- Parses text once and returns typed JSON.
+- Logs attempt, latency, status, and error class.
+
+Then migrate:
+
+1. `product-service-clients.ts`
+2. `service-agents.ts`
+3. mission route direct user-service fetch
+4. workflow service health checks
+
+## Workflow Retry Plan
+
+Target behavior:
+
+- Routes enqueue commands to `workflowRunActor`; routes do not call `executeWorkflowModule` directly.
+- `workflowRunActor` writes command state before execution.
+- `executeWorkflowModule` receives `idempotencyKey` and passes it to service/OpenCode calls.
+- On failure, increment `workflowRunModules.retryCount`, store `error`, and emit `workflowEvents` with `retryAttempt`.
+- Exceeding retry budget marks module `blocked` or `failed` based on module type and writes a DLQ row/event.
+
+Module status transition:
+
+```mermaid
+stateDiagram-v2
+  [*] --> idle
+  idle --> queued
+  queued --> running
+  running --> done
+  running --> retry_wait
+  retry_wait --> running
+  running --> blocked
+  running --> dlq
+  dlq --> replaying
+  replaying --> running
+```
+
+## Redis Consumer and DLQ Plan
+
+Do not ack canonical Redis messages until one of these is true:
+
+- event persisted and routed/projected successfully;
+- event persisted but routing failed and a durable retry record was created;
+- message moved to DLQ after retry budget.
+
+Add DLQ options:
+
+1. Redis stream DLQ: `grow.events.dlq`
+2. Postgres table: `grow_event_dlq`
+
+Recommended to use both:
+
+- Redis DLQ for operational stream tooling.
+- Postgres DLQ for admin UI, audit, and replay metadata.
+
+DLQ row fields:
+
+```txt
+id
+source_stream
+source_message_id
+payload
+error
+attempts
+last_attempt_at
+status: pending | replaying | replayed | discarded
+created_at
+updated_at
+```
+
+Replay script:
+
+```txt
+pnpm events:replay --status failed --limit 100
+pnpm events:replay --dlq --id <dlq-id>
+pnpm events:replay --event-id <grow-event-id> --projectors qscore,service-session
+```
+
+Script responsibilities:
+
+- Re-read stored payload.
+- Re-run `recordGrowEvent` if needed.
+- Re-run `routeGrowEventToUserActor`.
+- Optionally run only selected projectors.
+- Preserve original `dedupeKey`.
+
+## Projector Idempotency Plan
+
+Projectors should be repeatable:
+
+- Q Score latest table already has `(userId, signalId)` primary key.
+- Mission service sessions have unique `(serviceId, externalId)`.
+- Artifacts should dedupe by `(missionInstanceId, serviceId, externalId, type)` or a stable artifact key.
+- Mission stage patches should be applied with deterministic status/progress and no duplicate suggestions.
+
+Add projector event logs:
+
+```txt
+grow_event_projector_runs
+  event_id
+  projector
+  status
+  attempt
+  error
+  started_at
+  completed_at
+```
+
+## Logging Fields
+
+Every route/actor/event/retry log should include as many of these as available:
+
+- `requestId`
+- `traceId`
+- `userId`
+- `orgId`
+- `actorType`
+- `actorKey`
+- `runId`
+- `moduleId`
+- `missionId`
+- `missionInstanceId`
+- `stageId`
+- `eventId`
+- `source`
+- `eventType`
+- `idempotencyKey`
+- `operation`
+- `attempt`
+- `maxAttempts`
+- `latencyMs`
+- `httpStatus`
+- `retryable`
+- `dlqId`
+
+## Test Plan
+
+Unit tests:
+
+- `withRetry` retries transient errors and stops on non-retryable errors.
+- Timeout aborts fetch and logs retry attempt.
+- Idempotency key helper returns stable keys.
+- HTTP client adds auth, request id, and idempotency headers.
+
+Integration tests:
+
+- Duplicate `/workflow-runs/:runId/modules/:moduleId/run` command does not duplicate service call.
+- Duplicate Grow Event with same `dedupeKey` is stored once and projection remains stable.
+- Redis message failure is not acked until retry/DLQ path is recorded.
+- DLQ replay reprocesses a failed event and updates projector status.
+- OpenCode module execution retry does not create duplicate artifact rows.
+
+Manual staging drills:
+
+1. Stop interview service, run interview module, verify retry and blocked/DLQ behavior.
+2. Emit duplicate Redis events, verify one `grow_events` row and stable projector state.
+3. Break Gitea token, provision stack, verify retry logs and no partial untracked state.
+4. Replay a DLQ event, verify mission progress and Q Score update.
+
+## Implementation Order
+
+1. Add `src/lib/retry.ts` and focused unit tests.
+2. Add service HTTP client and migrate product service calls.
+3. Add workflow command idempotency and route-to-actor queueing.
+4. Add Redis DLQ and replay script.
+5. Add projector run records.
+6. Migrate Gitea/OpenCode/LLM calls to `withRetry`.
+7. Add staging failure drills to deployment checklist.