Files
growqr-backend/docs/retry-idempotency-dlq-plan.md
2026-06-05 22:01:00 +05:30

9.0 KiB

Retry, Idempotency, and DLQ Plan

PRM-43 design pass for growqr-backend.

No implementation was performed in this pass.

Goals

  • Bound every outbound call with timeouts.
  • Retry only safe operations with classified errors.
  • Make repeated commands safe through idempotency keys.
  • Preserve failed event/workflow work in a DLQ with replay tooling.
  • Add logs that let support trace one user action across route, actor, service, Redis, projector, and database writes.

Outbound Call Site Inventory

Area Files Current behavior Needed behavior
Product service clients src/services/product-service-clients.ts Direct fetch, no timeout/retry/idempotency header Shared service client with timeout, retry, idempotency key, and request id.
Service agent probes src/services/service-agents.ts Direct fetch, some fallback summaries Same shared client; distinguish "unavailable" from retriable failure.
Gitea src/lib/gitea.ts, src/docker/manager.ts, src/actors/user-actor.ts Direct fetch, some wait-for-ready helpers Retry transient Gitea API errors; idempotent repo/user/file operations.
OpenCode src/lib/opencode.ts, src/workflows/executors/opencode-executor.ts Direct fetch, health polling, no command dedupe Timeout and retry health/session/message calls; stable command id for prompts.
LLM src/lib/llm.ts, src/actors/conversation/agent.ts, src/events/projectors/projection-agent.ts Direct SDK/fetch calls Timeout, retry on provider transient errors, no retry on content/schema errors.
Actor sends routes, src/events/route-to-user-actor.ts, actors getOrCreate(...).method(...), queue sends Standard command envelope with idempotency key and correlation ids.
Redis consumer src/events/redis-consumer.ts Loops forever; canonical messages ack in finally; no DLQ Retry budget, pending handling, DLQ stream/table, replay.
Projectors src/events/projectors/*, src/actors/events/user-event-actor.ts Called within event actor processing Per-projector idempotency and failure status; replay from stored Grow Events.
Workflow module runner src/workflows/module-runner.ts, src/actors/workflow-run-actor.ts Actor loop retries in one path; direct route execution in another Actor-only execution, durable command id, retry state in DB.

Shared withRetry API

Add src/lib/retry.ts:

export type RetryPolicy = {
  maxAttempts: number;
  baseDelayMs: number;
  maxDelayMs: number;
  timeoutMs: number;
  jitter: boolean;
};

export async function withRetry<T>(
  operation: string,
  fn: (ctx: { signal: AbortSignal; attempt: number }) => Promise<T>,
  options: {
    policy?: Partial<RetryPolicy>;
    idempotencyKey?: string;
    classify?: (error: unknown) => "retry" | "fail";
    logFields?: Record<string, unknown>;
  },
): Promise<T>;

Default policy:

  • maxAttempts: 3
  • baseDelayMs: 250
  • maxDelayMs: 5_000
  • timeoutMs: 10_000
  • jitter enabled

Classification:

  • Retry: network errors, abort/timeout, HTTP 408, 425, 429, 500, 502, 503, 504.
  • Do not retry: HTTP 400, 401, 403, 404, validation/schema errors, duplicate/idempotency conflicts that already completed.
  • Special case: 409 may be success for idempotent create-if-absent operations.

Idempotency Model

Add a command/event idempotency key convention:

<domain>:<userId>:<entityId>:<operation>:<version>

Examples:

  • workflow:user_123:run_456:module:resume:v1
  • mission:user_123:instance_456:start:v1
  • service:user_123:interview:configure:session_abc
  • event:user_123:growEventId:project:qscore:v1
  • opencode:user_123:run_456:interview-plan:prompt-v4

Where to store:

  • workflowRunModules.idempotencyKey for module commands.
  • workflowEvents.payload.idempotencyKey for audit trail.
  • growEvents.dedupeKey for event ingestion.
  • Add a future idempotency_keys table only if multiple domains need durable response reuse.

Minimum table design if needed:

idempotency_keys
  key text primary key
  domain text not null
  user_id text
  status text check (processing, completed, failed)
  request_hash text
  response jsonb
  error text
  expires_at timestamptz
  created_at timestamptz
  updated_at timestamptz

HTTP Service Client Plan

Create src/services/http-client.ts:

  • Accepts baseUrl, path, method, json, headers, idempotencyKey, operation, timeoutMs.
  • Adds:
    • authorization: Bearer <A2A_ALLOWED_KEY> when configured.
    • x-request-id
    • x-idempotency-key or idempotency-key.
    • x-growqr-user when user-scoped.
  • Uses withRetry.
  • Parses text once and returns typed JSON.
  • Logs attempt, latency, status, and error class.

Then migrate:

  1. product-service-clients.ts
  2. service-agents.ts
  3. mission route direct user-service fetch
  4. workflow service health checks

Workflow Retry Plan

Target behavior:

  • Routes enqueue commands to workflowRunActor; routes do not call executeWorkflowModule directly.
  • workflowRunActor writes command state before execution.
  • executeWorkflowModule receives idempotencyKey and passes it to service/OpenCode calls.
  • On failure, increment workflowRunModules.retryCount, store error, and emit workflowEvents with retryAttempt.
  • Exceeding retry budget marks module blocked or failed based on module type and writes a DLQ row/event.

Module status transition:

stateDiagram-v2
  [*] --> idle
  idle --> queued
  queued --> running
  running --> done
  running --> retry_wait
  retry_wait --> running
  running --> blocked
  running --> dlq
  dlq --> replaying
  replaying --> running

Redis Consumer and DLQ Plan

Do not ack canonical Redis messages until one of these is true:

  • event persisted and routed/projected successfully;
  • event persisted but routing failed and a durable retry record was created;
  • message moved to DLQ after retry budget.

Add DLQ options:

  1. Redis stream DLQ: grow.events.dlq
  2. Postgres table: grow_event_dlq

Recommended to use both:

  • Redis DLQ for operational stream tooling.
  • Postgres DLQ for admin UI, audit, and replay metadata.

DLQ row fields:

id
source_stream
source_message_id
payload
error
attempts
last_attempt_at
status: pending | replaying | replayed | discarded
created_at
updated_at

Replay script:

pnpm events:replay --status failed --limit 100
pnpm events:replay --dlq --id <dlq-id>
pnpm events:replay --event-id <grow-event-id> --projectors qscore,service-session

Script responsibilities:

  • Re-read stored payload.
  • Re-run recordGrowEvent if needed.
  • Re-run routeGrowEventToUserActor.
  • Optionally run only selected projectors.
  • Preserve original dedupeKey.

Projector Idempotency Plan

Projectors should be repeatable:

  • Q Score latest table already has (userId, signalId) primary key.
  • Mission service sessions have unique (serviceId, externalId).
  • Artifacts should dedupe by (missionInstanceId, serviceId, externalId, type) or a stable artifact key.
  • Mission stage patches should be applied with deterministic status/progress and no duplicate suggestions.

Add projector event logs:

grow_event_projector_runs
  event_id
  projector
  status
  attempt
  error
  started_at
  completed_at

Logging Fields

Every route/actor/event/retry log should include as many of these as available:

  • requestId
  • traceId
  • userId
  • orgId
  • actorType
  • actorKey
  • runId
  • moduleId
  • missionId
  • missionInstanceId
  • stageId
  • eventId
  • source
  • eventType
  • idempotencyKey
  • operation
  • attempt
  • maxAttempts
  • latencyMs
  • httpStatus
  • retryable
  • dlqId

Test Plan

Unit tests:

  • withRetry retries transient errors and stops on non-retryable errors.
  • Timeout aborts fetch and logs retry attempt.
  • Idempotency key helper returns stable keys.
  • HTTP client adds auth, request id, and idempotency headers.

Integration tests:

  • Duplicate /workflow-runs/:runId/modules/:moduleId/run command does not duplicate service call.
  • Duplicate Grow Event with same dedupeKey is stored once and projection remains stable.
  • Redis message failure is not acked until retry/DLQ path is recorded.
  • DLQ replay reprocesses a failed event and updates projector status.
  • OpenCode module execution retry does not create duplicate artifact rows.

Manual staging drills:

  1. Stop interview service, run interview module, verify retry and blocked/DLQ behavior.
  2. Emit duplicate Redis events, verify one grow_events row and stable projector state.
  3. Break Gitea token, provision stack, verify retry logs and no partial untracked state.
  4. Replay a DLQ event, verify mission progress and Q Score update.

Implementation Order

  1. Add src/lib/retry.ts and focused unit tests.
  2. Add service HTTP client and migrate product service calls.
  3. Add workflow command idempotency and route-to-actor queueing.
  4. Add Redis DLQ and replay script.
  5. Add projector run records.
  6. Migrate Gitea/OpenCode/LLM calls to withRetry.
  7. Add staging failure drills to deployment checklist.