docs: plan robust retry and dlq layer
This commit is contained in:
284
docs/retry-idempotency-dlq-plan.md
Normal file
284
docs/retry-idempotency-dlq-plan.md
Normal file
@@ -0,0 +1,284 @@
|
||||
# Retry, Idempotency, and DLQ Plan
|
||||
|
||||
PRM-43 design pass for `growqr-backend`.
|
||||
|
||||
No implementation was performed in this pass.
|
||||
|
||||
## Goals
|
||||
|
||||
- Bound every outbound call with timeouts.
|
||||
- Retry only safe operations with classified errors.
|
||||
- Make repeated commands safe through idempotency keys.
|
||||
- Preserve failed event/workflow work in a DLQ with replay tooling.
|
||||
- Add logs that let support trace one user action across route, actor, service, Redis, projector, and database writes.
|
||||
|
||||
## Outbound Call Site Inventory
|
||||
|
||||
| Area | Files | Current behavior | Needed behavior |
|
||||
| --- | --- | --- | --- |
|
||||
| Product service clients | `src/services/product-service-clients.ts` | Direct `fetch`, no timeout/retry/idempotency header | Shared service client with timeout, retry, idempotency key, and request id. |
|
||||
| Service agent probes | `src/services/service-agents.ts` | Direct `fetch`, some fallback summaries | Same shared client; distinguish "unavailable" from retriable failure. |
|
||||
| Gitea | `src/lib/gitea.ts`, `src/docker/manager.ts`, `src/actors/user-actor.ts` | Direct `fetch`, some wait-for-ready helpers | Retry transient Gitea API errors; idempotent repo/user/file operations. |
|
||||
| OpenCode | `src/lib/opencode.ts`, `src/workflows/executors/opencode-executor.ts` | Direct `fetch`, health polling, no command dedupe | Timeout and retry health/session/message calls; stable command id for prompts. |
|
||||
| LLM | `src/lib/llm.ts`, `src/actors/conversation/agent.ts`, `src/events/projectors/projection-agent.ts` | Direct SDK/fetch calls | Timeout, retry on provider transient errors, no retry on content/schema errors. |
|
||||
| Actor sends | routes, `src/events/route-to-user-actor.ts`, actors | `getOrCreate(...).method(...)`, queue sends | Standard command envelope with idempotency key and correlation ids. |
|
||||
| Redis consumer | `src/events/redis-consumer.ts` | Loops forever; canonical messages ack in `finally`; no DLQ | Retry budget, pending handling, DLQ stream/table, replay. |
|
||||
| Projectors | `src/events/projectors/*`, `src/actors/events/user-event-actor.ts` | Called within event actor processing | Per-projector idempotency and failure status; replay from stored Grow Events. |
|
||||
| Workflow module runner | `src/workflows/module-runner.ts`, `src/actors/workflow-run-actor.ts` | Actor loop retries in one path; direct route execution in another | Actor-only execution, durable command id, retry state in DB. |
|
||||
|
||||
## Shared `withRetry` API
|
||||
|
||||
Add `src/lib/retry.ts`:
|
||||
|
||||
```ts
|
||||
export type RetryPolicy = {
|
||||
maxAttempts: number;
|
||||
baseDelayMs: number;
|
||||
maxDelayMs: number;
|
||||
timeoutMs: number;
|
||||
jitter: boolean;
|
||||
};
|
||||
|
||||
export async function withRetry<T>(
|
||||
operation: string,
|
||||
fn: (ctx: { signal: AbortSignal; attempt: number }) => Promise<T>,
|
||||
options: {
|
||||
policy?: Partial<RetryPolicy>;
|
||||
idempotencyKey?: string;
|
||||
classify?: (error: unknown) => "retry" | "fail";
|
||||
logFields?: Record<string, unknown>;
|
||||
},
|
||||
): Promise<T>;
|
||||
```
|
||||
|
||||
Default policy:
|
||||
|
||||
- `maxAttempts: 3`
|
||||
- `baseDelayMs: 250`
|
||||
- `maxDelayMs: 5_000`
|
||||
- `timeoutMs: 10_000`
|
||||
- jitter enabled
|
||||
|
||||
Classification:
|
||||
|
||||
- Retry: network errors, abort/timeout, HTTP `408`, `425`, `429`, `500`, `502`, `503`, `504`.
|
||||
- Do not retry: HTTP `400`, `401`, `403`, `404`, validation/schema errors, duplicate/idempotency conflicts that already completed.
|
||||
- Special case: `409` may be success for idempotent create-if-absent operations.
|
||||
|
||||
## Idempotency Model
|
||||
|
||||
Add a command/event idempotency key convention:
|
||||
|
||||
```txt
|
||||
<domain>:<userId>:<entityId>:<operation>:<version>
|
||||
```
|
||||
|
||||
Examples:
|
||||
|
||||
- `workflow:user_123:run_456:module:resume:v1`
|
||||
- `mission:user_123:instance_456:start:v1`
|
||||
- `service:user_123:interview:configure:session_abc`
|
||||
- `event:user_123:growEventId:project:qscore:v1`
|
||||
- `opencode:user_123:run_456:interview-plan:prompt-v4`
|
||||
|
||||
Where to store:
|
||||
|
||||
- `workflowRunModules.idempotencyKey` for module commands.
|
||||
- `workflowEvents.payload.idempotencyKey` for audit trail.
|
||||
- `growEvents.dedupeKey` for event ingestion.
|
||||
- Add a future `idempotency_keys` table only if multiple domains need durable response reuse.
|
||||
|
||||
Minimum table design if needed:
|
||||
|
||||
```txt
|
||||
idempotency_keys
|
||||
key text primary key
|
||||
domain text not null
|
||||
user_id text
|
||||
status text check (processing, completed, failed)
|
||||
request_hash text
|
||||
response jsonb
|
||||
error text
|
||||
expires_at timestamptz
|
||||
created_at timestamptz
|
||||
updated_at timestamptz
|
||||
```
|
||||
|
||||
## HTTP Service Client Plan
|
||||
|
||||
Create `src/services/http-client.ts`:
|
||||
|
||||
- Accepts `baseUrl`, `path`, `method`, `json`, `headers`, `idempotencyKey`, `operation`, `timeoutMs`.
|
||||
- Adds:
|
||||
- `authorization: Bearer <A2A_ALLOWED_KEY>` when configured.
|
||||
- `x-request-id`
|
||||
- `x-idempotency-key` or `idempotency-key`.
|
||||
- `x-growqr-user` when user-scoped.
|
||||
- Uses `withRetry`.
|
||||
- Parses text once and returns typed JSON.
|
||||
- Logs attempt, latency, status, and error class.
|
||||
|
||||
Then migrate:
|
||||
|
||||
1. `product-service-clients.ts`
|
||||
2. `service-agents.ts`
|
||||
3. mission route direct user-service fetch
|
||||
4. workflow service health checks
|
||||
|
||||
## Workflow Retry Plan
|
||||
|
||||
Target behavior:
|
||||
|
||||
- Routes enqueue commands to `workflowRunActor`; routes do not call `executeWorkflowModule` directly.
|
||||
- `workflowRunActor` writes command state before execution.
|
||||
- `executeWorkflowModule` receives `idempotencyKey` and passes it to service/OpenCode calls.
|
||||
- On failure, increment `workflowRunModules.retryCount`, store `error`, and emit `workflowEvents` with `retryAttempt`.
|
||||
- Exceeding retry budget marks module `blocked` or `failed` based on module type and writes a DLQ row/event.
|
||||
|
||||
Module status transition:
|
||||
|
||||
```mermaid
|
||||
stateDiagram-v2
|
||||
[*] --> idle
|
||||
idle --> queued
|
||||
queued --> running
|
||||
running --> done
|
||||
running --> retry_wait
|
||||
retry_wait --> running
|
||||
running --> blocked
|
||||
running --> dlq
|
||||
dlq --> replaying
|
||||
replaying --> running
|
||||
```
|
||||
|
||||
## Redis Consumer and DLQ Plan
|
||||
|
||||
Do not ack canonical Redis messages until one of these is true:
|
||||
|
||||
- event persisted and routed/projected successfully;
|
||||
- event persisted but routing failed and a durable retry record was created;
|
||||
- message moved to DLQ after retry budget.
|
||||
|
||||
Add DLQ options:
|
||||
|
||||
1. Redis stream DLQ: `grow.events.dlq`
|
||||
2. Postgres table: `grow_event_dlq`
|
||||
|
||||
Recommended to use both:
|
||||
|
||||
- Redis DLQ for operational stream tooling.
|
||||
- Postgres DLQ for admin UI, audit, and replay metadata.
|
||||
|
||||
DLQ row fields:
|
||||
|
||||
```txt
|
||||
id
|
||||
source_stream
|
||||
source_message_id
|
||||
payload
|
||||
error
|
||||
attempts
|
||||
last_attempt_at
|
||||
status: pending | replaying | replayed | discarded
|
||||
created_at
|
||||
updated_at
|
||||
```
|
||||
|
||||
Replay script:
|
||||
|
||||
```txt
|
||||
pnpm events:replay --status failed --limit 100
|
||||
pnpm events:replay --dlq --id <dlq-id>
|
||||
pnpm events:replay --event-id <grow-event-id> --projectors qscore,service-session
|
||||
```
|
||||
|
||||
Script responsibilities:
|
||||
|
||||
- Re-read stored payload.
|
||||
- Re-run `recordGrowEvent` if needed.
|
||||
- Re-run `routeGrowEventToUserActor`.
|
||||
- Optionally run only selected projectors.
|
||||
- Preserve original `dedupeKey`.
|
||||
|
||||
## Projector Idempotency Plan
|
||||
|
||||
Projectors should be repeatable:
|
||||
|
||||
- Q Score latest table already has `(userId, signalId)` primary key.
|
||||
- Mission service sessions have unique `(serviceId, externalId)`.
|
||||
- Artifacts should dedupe by `(missionInstanceId, serviceId, externalId, type)` or a stable artifact key.
|
||||
- Mission stage patches should be applied with deterministic status/progress and no duplicate suggestions.
|
||||
|
||||
Add projector event logs:
|
||||
|
||||
```txt
|
||||
grow_event_projector_runs
|
||||
event_id
|
||||
projector
|
||||
status
|
||||
attempt
|
||||
error
|
||||
started_at
|
||||
completed_at
|
||||
```
|
||||
|
||||
## Logging Fields
|
||||
|
||||
Every route/actor/event/retry log should include as many of these as available:
|
||||
|
||||
- `requestId`
|
||||
- `traceId`
|
||||
- `userId`
|
||||
- `orgId`
|
||||
- `actorType`
|
||||
- `actorKey`
|
||||
- `runId`
|
||||
- `moduleId`
|
||||
- `missionId`
|
||||
- `missionInstanceId`
|
||||
- `stageId`
|
||||
- `eventId`
|
||||
- `source`
|
||||
- `eventType`
|
||||
- `idempotencyKey`
|
||||
- `operation`
|
||||
- `attempt`
|
||||
- `maxAttempts`
|
||||
- `latencyMs`
|
||||
- `httpStatus`
|
||||
- `retryable`
|
||||
- `dlqId`
|
||||
|
||||
## Test Plan
|
||||
|
||||
Unit tests:
|
||||
|
||||
- `withRetry` retries transient errors and stops on non-retryable errors.
|
||||
- Timeout aborts fetch and logs retry attempt.
|
||||
- Idempotency key helper returns stable keys.
|
||||
- HTTP client adds auth, request id, and idempotency headers.
|
||||
|
||||
Integration tests:
|
||||
|
||||
- Duplicate `/workflow-runs/:runId/modules/:moduleId/run` command does not duplicate service call.
|
||||
- Duplicate Grow Event with same `dedupeKey` is stored once and projection remains stable.
|
||||
- Redis message failure is not acked until retry/DLQ path is recorded.
|
||||
- DLQ replay reprocesses a failed event and updates projector status.
|
||||
- OpenCode module execution retry does not create duplicate artifact rows.
|
||||
|
||||
Manual staging drills:
|
||||
|
||||
1. Stop interview service, run interview module, verify retry and blocked/DLQ behavior.
|
||||
2. Emit duplicate Redis events, verify one `grow_events` row and stable projector state.
|
||||
3. Break Gitea token, provision stack, verify retry logs and no partial untracked state.
|
||||
4. Replay a DLQ event, verify mission progress and Q Score update.
|
||||
|
||||
## Implementation Order
|
||||
|
||||
1. Add `src/lib/retry.ts` and focused unit tests.
|
||||
2. Add service HTTP client and migrate product service calls.
|
||||
3. Add workflow command idempotency and route-to-actor queueing.
|
||||
4. Add Redis DLQ and replay script.
|
||||
5. Add projector run records.
|
||||
6. Migrate Gitea/OpenCode/LLM calls to `withRetry`.
|
||||
7. Add staging failure drills to deployment checklist.
|
||||
Reference in New Issue
Block a user