9.0 KiB
9.0 KiB
Retry, Idempotency, and DLQ Plan
PRM-43 design pass for growqr-backend.
No implementation was performed in this pass.
Goals
- Bound every outbound call with timeouts.
- Retry only safe operations with classified errors.
- Make repeated commands safe through idempotency keys.
- Preserve failed event/workflow work in a DLQ with replay tooling.
- Add logs that let support trace one user action across route, actor, service, Redis, projector, and database writes.
Outbound Call Site Inventory
| Area | Files | Current behavior | Needed behavior |
|---|---|---|---|
| Product service clients | src/services/product-service-clients.ts |
Direct fetch, no timeout/retry/idempotency header |
Shared service client with timeout, retry, idempotency key, and request id. |
| Service agent probes | src/services/service-agents.ts |
Direct fetch, some fallback summaries |
Same shared client; distinguish "unavailable" from retriable failure. |
| Gitea | src/lib/gitea.ts, src/docker/manager.ts, src/actors/user-actor.ts |
Direct fetch, some wait-for-ready helpers |
Retry transient Gitea API errors; idempotent repo/user/file operations. |
| OpenCode | src/lib/opencode.ts, src/workflows/executors/opencode-executor.ts |
Direct fetch, health polling, no command dedupe |
Timeout and retry health/session/message calls; stable command id for prompts. |
| LLM | src/lib/llm.ts, src/actors/conversation/agent.ts, src/events/projectors/projection-agent.ts |
Direct SDK/fetch calls | Timeout, retry on provider transient errors, no retry on content/schema errors. |
| Actor sends | routes, src/events/route-to-user-actor.ts, actors |
getOrCreate(...).method(...), queue sends |
Standard command envelope with idempotency key and correlation ids. |
| Redis consumer | src/events/redis-consumer.ts |
Loops forever; canonical messages ack in finally; no DLQ |
Retry budget, pending handling, DLQ stream/table, replay. |
| Projectors | src/events/projectors/*, src/actors/events/user-event-actor.ts |
Called within event actor processing | Per-projector idempotency and failure status; replay from stored Grow Events. |
| Workflow module runner | src/workflows/module-runner.ts, src/actors/workflow-run-actor.ts |
Actor loop retries in one path; direct route execution in another | Actor-only execution, durable command id, retry state in DB. |
Shared withRetry API
Add src/lib/retry.ts:
export type RetryPolicy = {
maxAttempts: number;
baseDelayMs: number;
maxDelayMs: number;
timeoutMs: number;
jitter: boolean;
};
export async function withRetry<T>(
operation: string,
fn: (ctx: { signal: AbortSignal; attempt: number }) => Promise<T>,
options: {
policy?: Partial<RetryPolicy>;
idempotencyKey?: string;
classify?: (error: unknown) => "retry" | "fail";
logFields?: Record<string, unknown>;
},
): Promise<T>;
Default policy:
maxAttempts: 3baseDelayMs: 250maxDelayMs: 5_000timeoutMs: 10_000- jitter enabled
Classification:
- Retry: network errors, abort/timeout, HTTP
408,425,429,500,502,503,504. - Do not retry: HTTP
400,401,403,404, validation/schema errors, duplicate/idempotency conflicts that already completed. - Special case:
409may be success for idempotent create-if-absent operations.
Idempotency Model
Add a command/event idempotency key convention:
<domain>:<userId>:<entityId>:<operation>:<version>
Examples:
workflow:user_123:run_456:module:resume:v1mission:user_123:instance_456:start:v1service:user_123:interview:configure:session_abcevent:user_123:growEventId:project:qscore:v1opencode:user_123:run_456:interview-plan:prompt-v4
Where to store:
workflowRunModules.idempotencyKeyfor module commands.workflowEvents.payload.idempotencyKeyfor audit trail.growEvents.dedupeKeyfor event ingestion.- Add a future
idempotency_keystable only if multiple domains need durable response reuse.
Minimum table design if needed:
idempotency_keys
key text primary key
domain text not null
user_id text
status text check (processing, completed, failed)
request_hash text
response jsonb
error text
expires_at timestamptz
created_at timestamptz
updated_at timestamptz
HTTP Service Client Plan
Create src/services/http-client.ts:
- Accepts
baseUrl,path,method,json,headers,idempotencyKey,operation,timeoutMs. - Adds:
authorization: Bearer <A2A_ALLOWED_KEY>when configured.x-request-idx-idempotency-keyoridempotency-key.x-growqr-userwhen user-scoped.
- Uses
withRetry. - Parses text once and returns typed JSON.
- Logs attempt, latency, status, and error class.
Then migrate:
product-service-clients.tsservice-agents.ts- mission route direct user-service fetch
- workflow service health checks
Workflow Retry Plan
Target behavior:
- Routes enqueue commands to
workflowRunActor; routes do not callexecuteWorkflowModuledirectly. workflowRunActorwrites command state before execution.executeWorkflowModulereceivesidempotencyKeyand passes it to service/OpenCode calls.- On failure, increment
workflowRunModules.retryCount, storeerror, and emitworkflowEventswithretryAttempt. - Exceeding retry budget marks module
blockedorfailedbased on module type and writes a DLQ row/event.
Module status transition:
stateDiagram-v2
[*] --> idle
idle --> queued
queued --> running
running --> done
running --> retry_wait
retry_wait --> running
running --> blocked
running --> dlq
dlq --> replaying
replaying --> running
Redis Consumer and DLQ Plan
Do not ack canonical Redis messages until one of these is true:
- event persisted and routed/projected successfully;
- event persisted but routing failed and a durable retry record was created;
- message moved to DLQ after retry budget.
Add DLQ options:
- Redis stream DLQ:
grow.events.dlq - Postgres table:
grow_event_dlq
Recommended to use both:
- Redis DLQ for operational stream tooling.
- Postgres DLQ for admin UI, audit, and replay metadata.
DLQ row fields:
id
source_stream
source_message_id
payload
error
attempts
last_attempt_at
status: pending | replaying | replayed | discarded
created_at
updated_at
Replay script:
pnpm events:replay --status failed --limit 100
pnpm events:replay --dlq --id <dlq-id>
pnpm events:replay --event-id <grow-event-id> --projectors qscore,service-session
Script responsibilities:
- Re-read stored payload.
- Re-run
recordGrowEventif needed. - Re-run
routeGrowEventToUserActor. - Optionally run only selected projectors.
- Preserve original
dedupeKey.
Projector Idempotency Plan
Projectors should be repeatable:
- Q Score latest table already has
(userId, signalId)primary key. - Mission service sessions have unique
(serviceId, externalId). - Artifacts should dedupe by
(missionInstanceId, serviceId, externalId, type)or a stable artifact key. - Mission stage patches should be applied with deterministic status/progress and no duplicate suggestions.
Add projector event logs:
grow_event_projector_runs
event_id
projector
status
attempt
error
started_at
completed_at
Logging Fields
Every route/actor/event/retry log should include as many of these as available:
requestIdtraceIduserIdorgIdactorTypeactorKeyrunIdmoduleIdmissionIdmissionInstanceIdstageIdeventIdsourceeventTypeidempotencyKeyoperationattemptmaxAttemptslatencyMshttpStatusretryabledlqId
Test Plan
Unit tests:
withRetryretries transient errors and stops on non-retryable errors.- Timeout aborts fetch and logs retry attempt.
- Idempotency key helper returns stable keys.
- HTTP client adds auth, request id, and idempotency headers.
Integration tests:
- Duplicate
/workflow-runs/:runId/modules/:moduleId/runcommand does not duplicate service call. - Duplicate Grow Event with same
dedupeKeyis stored once and projection remains stable. - Redis message failure is not acked until retry/DLQ path is recorded.
- DLQ replay reprocesses a failed event and updates projector status.
- OpenCode module execution retry does not create duplicate artifact rows.
Manual staging drills:
- Stop interview service, run interview module, verify retry and blocked/DLQ behavior.
- Emit duplicate Redis events, verify one
grow_eventsrow and stable projector state. - Break Gitea token, provision stack, verify retry logs and no partial untracked state.
- Replay a DLQ event, verify mission progress and Q Score update.
Implementation Order
- Add
src/lib/retry.tsand focused unit tests. - Add service HTTP client and migrate product service calls.
- Add workflow command idempotency and route-to-actor queueing.
- Add Redis DLQ and replay script.
- Add projector run records.
- Migrate Gitea/OpenCode/LLM calls to
withRetry. - Add staging failure drills to deployment checklist.