docs: plan robust retry and dlq layer

This commit is contained in:
-Puter
2026-06-05 22:01:00 +05:30
parent ef5d7bb378
commit 5f667038d8

View File

@@ -0,0 +1,284 @@
# Retry, Idempotency, and DLQ Plan
PRM-43 design pass for `growqr-backend`.
No implementation was performed in this pass.
## Goals
- Bound every outbound call with timeouts.
- Retry only safe operations with classified errors.
- Make repeated commands safe through idempotency keys.
- Preserve failed event/workflow work in a DLQ with replay tooling.
- Add logs that let support trace one user action across route, actor, service, Redis, projector, and database writes.
## Outbound Call Site Inventory
| Area | Files | Current behavior | Needed behavior |
| --- | --- | --- | --- |
| Product service clients | `src/services/product-service-clients.ts` | Direct `fetch`, no timeout/retry/idempotency header | Shared service client with timeout, retry, idempotency key, and request id. |
| Service agent probes | `src/services/service-agents.ts` | Direct `fetch`, some fallback summaries | Same shared client; distinguish "unavailable" from retriable failure. |
| Gitea | `src/lib/gitea.ts`, `src/docker/manager.ts`, `src/actors/user-actor.ts` | Direct `fetch`, some wait-for-ready helpers | Retry transient Gitea API errors; idempotent repo/user/file operations. |
| OpenCode | `src/lib/opencode.ts`, `src/workflows/executors/opencode-executor.ts` | Direct `fetch`, health polling, no command dedupe | Timeout and retry health/session/message calls; stable command id for prompts. |
| LLM | `src/lib/llm.ts`, `src/actors/conversation/agent.ts`, `src/events/projectors/projection-agent.ts` | Direct SDK/fetch calls | Timeout, retry on provider transient errors, no retry on content/schema errors. |
| Actor sends | routes, `src/events/route-to-user-actor.ts`, actors | `getOrCreate(...).method(...)`, queue sends | Standard command envelope with idempotency key and correlation ids. |
| Redis consumer | `src/events/redis-consumer.ts` | Loops forever; canonical messages ack in `finally`; no DLQ | Retry budget, pending handling, DLQ stream/table, replay. |
| Projectors | `src/events/projectors/*`, `src/actors/events/user-event-actor.ts` | Called within event actor processing | Per-projector idempotency and failure status; replay from stored Grow Events. |
| Workflow module runner | `src/workflows/module-runner.ts`, `src/actors/workflow-run-actor.ts` | Actor loop retries in one path; direct route execution in another | Actor-only execution, durable command id, retry state in DB. |
## Shared `withRetry` API
Add `src/lib/retry.ts`:
```ts
export type RetryPolicy = {
maxAttempts: number;
baseDelayMs: number;
maxDelayMs: number;
timeoutMs: number;
jitter: boolean;
};
export async function withRetry<T>(
operation: string,
fn: (ctx: { signal: AbortSignal; attempt: number }) => Promise<T>,
options: {
policy?: Partial<RetryPolicy>;
idempotencyKey?: string;
classify?: (error: unknown) => "retry" | "fail";
logFields?: Record<string, unknown>;
},
): Promise<T>;
```
Default policy:
- `maxAttempts: 3`
- `baseDelayMs: 250`
- `maxDelayMs: 5_000`
- `timeoutMs: 10_000`
- jitter enabled
Classification:
- Retry: network errors, abort/timeout, HTTP `408`, `425`, `429`, `500`, `502`, `503`, `504`.
- Do not retry: HTTP `400`, `401`, `403`, `404`, validation/schema errors, duplicate/idempotency conflicts that already completed.
- Special case: `409` may be success for idempotent create-if-absent operations.
## Idempotency Model
Add a command/event idempotency key convention:
```txt
<domain>:<userId>:<entityId>:<operation>:<version>
```
Examples:
- `workflow:user_123:run_456:module:resume:v1`
- `mission:user_123:instance_456:start:v1`
- `service:user_123:interview:configure:session_abc`
- `event:user_123:growEventId:project:qscore:v1`
- `opencode:user_123:run_456:interview-plan:prompt-v4`
Where to store:
- `workflowRunModules.idempotencyKey` for module commands.
- `workflowEvents.payload.idempotencyKey` for audit trail.
- `growEvents.dedupeKey` for event ingestion.
- Add a future `idempotency_keys` table only if multiple domains need durable response reuse.
Minimum table design if needed:
```txt
idempotency_keys
key text primary key
domain text not null
user_id text
status text check (processing, completed, failed)
request_hash text
response jsonb
error text
expires_at timestamptz
created_at timestamptz
updated_at timestamptz
```
## HTTP Service Client Plan
Create `src/services/http-client.ts`:
- Accepts `baseUrl`, `path`, `method`, `json`, `headers`, `idempotencyKey`, `operation`, `timeoutMs`.
- Adds:
- `authorization: Bearer <A2A_ALLOWED_KEY>` when configured.
- `x-request-id`
- `x-idempotency-key` or `idempotency-key`.
- `x-growqr-user` when user-scoped.
- Uses `withRetry`.
- Parses text once and returns typed JSON.
- Logs attempt, latency, status, and error class.
Then migrate:
1. `product-service-clients.ts`
2. `service-agents.ts`
3. mission route direct user-service fetch
4. workflow service health checks
## Workflow Retry Plan
Target behavior:
- Routes enqueue commands to `workflowRunActor`; routes do not call `executeWorkflowModule` directly.
- `workflowRunActor` writes command state before execution.
- `executeWorkflowModule` receives `idempotencyKey` and passes it to service/OpenCode calls.
- On failure, increment `workflowRunModules.retryCount`, store `error`, and emit `workflowEvents` with `retryAttempt`.
- Exceeding retry budget marks module `blocked` or `failed` based on module type and writes a DLQ row/event.
Module status transition:
```mermaid
stateDiagram-v2
[*] --> idle
idle --> queued
queued --> running
running --> done
running --> retry_wait
retry_wait --> running
running --> blocked
running --> dlq
dlq --> replaying
replaying --> running
```
## Redis Consumer and DLQ Plan
Do not ack canonical Redis messages until one of these is true:
- event persisted and routed/projected successfully;
- event persisted but routing failed and a durable retry record was created;
- message moved to DLQ after retry budget.
Add DLQ options:
1. Redis stream DLQ: `grow.events.dlq`
2. Postgres table: `grow_event_dlq`
Recommended to use both:
- Redis DLQ for operational stream tooling.
- Postgres DLQ for admin UI, audit, and replay metadata.
DLQ row fields:
```txt
id
source_stream
source_message_id
payload
error
attempts
last_attempt_at
status: pending | replaying | replayed | discarded
created_at
updated_at
```
Replay script:
```txt
pnpm events:replay --status failed --limit 100
pnpm events:replay --dlq --id <dlq-id>
pnpm events:replay --event-id <grow-event-id> --projectors qscore,service-session
```
Script responsibilities:
- Re-read stored payload.
- Re-run `recordGrowEvent` if needed.
- Re-run `routeGrowEventToUserActor`.
- Optionally run only selected projectors.
- Preserve original `dedupeKey`.
## Projector Idempotency Plan
Projectors should be repeatable:
- Q Score latest table already has `(userId, signalId)` primary key.
- Mission service sessions have unique `(serviceId, externalId)`.
- Artifacts should dedupe by `(missionInstanceId, serviceId, externalId, type)` or a stable artifact key.
- Mission stage patches should be applied with deterministic status/progress and no duplicate suggestions.
Add projector event logs:
```txt
grow_event_projector_runs
event_id
projector
status
attempt
error
started_at
completed_at
```
## Logging Fields
Every route/actor/event/retry log should include as many of these as available:
- `requestId`
- `traceId`
- `userId`
- `orgId`
- `actorType`
- `actorKey`
- `runId`
- `moduleId`
- `missionId`
- `missionInstanceId`
- `stageId`
- `eventId`
- `source`
- `eventType`
- `idempotencyKey`
- `operation`
- `attempt`
- `maxAttempts`
- `latencyMs`
- `httpStatus`
- `retryable`
- `dlqId`
## Test Plan
Unit tests:
- `withRetry` retries transient errors and stops on non-retryable errors.
- Timeout aborts fetch and logs retry attempt.
- Idempotency key helper returns stable keys.
- HTTP client adds auth, request id, and idempotency headers.
Integration tests:
- Duplicate `/workflow-runs/:runId/modules/:moduleId/run` command does not duplicate service call.
- Duplicate Grow Event with same `dedupeKey` is stored once and projection remains stable.
- Redis message failure is not acked until retry/DLQ path is recorded.
- DLQ replay reprocesses a failed event and updates projector status.
- OpenCode module execution retry does not create duplicate artifact rows.
Manual staging drills:
1. Stop interview service, run interview module, verify retry and blocked/DLQ behavior.
2. Emit duplicate Redis events, verify one `grow_events` row and stable projector state.
3. Break Gitea token, provision stack, verify retry logs and no partial untracked state.
4. Replay a DLQ event, verify mission progress and Q Score update.
## Implementation Order
1. Add `src/lib/retry.ts` and focused unit tests.
2. Add service HTTP client and migrate product service calls.
3. Add workflow command idempotency and route-to-actor queueing.
4. Add Redis DLQ and replay script.
5. Add projector run records.
6. Migrate Gitea/OpenCode/LLM calls to `withRetry`.
7. Add staging failure drills to deployment checklist.