DISTRIBUTED JOB QUEUE & TASK RUNNER

Individual Project · Jan 2026

back

Node.jsTypeScriptBullMQRedisNext.jsExpress.jsPrometheusGrafanaDockerTurborepo

Building a Production-Style Distributed Job Queue with BullMQ, Redis & Node.js

Published: January 2026

Modern products do not process every task in the request-response path. Expensive operations — emails, image transforms, AI tasks — are queued and executed asynchronously by workers. This project is a full implementation of that architecture.

Why I Built This

Every serious production system eventually outgrows the naive request-response model. When a user uploads an image, you don't resize it inline — you queue it. When an order is placed, you don't send the receipt email synchronously — you queue it. When an AI task is triggered, you definitely don't block the HTTP thread — you queue it.

I wanted to understand this pattern end-to-end: not just use an off-the-shelf queue service, but build the full stack — producer API, Redis-backed queue, typed workers, real-time dashboard, and observability. This project is that implementation.

System Architecture

The system is a Turborepo monorepo with four apps and two shared packages. Each layer has a single, clear responsibility.

High-Level Architecture
═══════════════════════════════════════════════════════

Client (Dashboard / API Consumer)
        │
        │  HTTP + WebSocket
        ▼
Express API  ──── Zod validation, rate-limit, auth
        │
        │  BullMQ enqueue
        ▼
Redis (BullMQ backend)  ──── persistent job storage
        │
        │  BullMQ dequeue
        ▼
Workers (email / image / ai)  ──── bounded concurrency
        │
        │  Prometheus /metrics
        ▼
Prometheus ──────────────────► Grafana dashboards

Monorepo Structure

distributed-job-queue/
├── apps/
│   ├── api/          # Express API · WebSocket · /metrics endpoint
│   ├── workers/      # Email · Image · AI worker processes
│   └── dashboard/    # Next.js monitoring dashboard (BFF pattern)
├── packages/
│   ├── queue-config/ # Shared Redis connection + queue declarations
│   └── metrics/      # Shared Prometheus metric registry
└── infra/
    ├── docker-compose.yml
    ├── prometheus.yml
    └── grafana/      # Provisioned datasource + starter dashboard

The API Service

The API is an Express app that acts as the job producer. It accepts requests, validates them with Zod, enqueues work into Redis via BullMQ, and exposes status endpoints so clients can poll job state.

Key Endpoints

`POST /jobs/email` — enqueue an email job
`POST /jobs/image` — enqueue an image processing job
`POST /jobs/ai` — enqueue an AI task (supports optional delay)
`GET /jobs/:queue/:id` — job status and result
`GET /jobs/:queue/stats` — queue counters (waiting / active / completed / failed)
`GET /queues` — snapshot of all queues
`GET /health` — health check
`GET /metrics` — Prometheus scrape endpoint
`WS /live` — real-time queue events over WebSocket

Every job submission payload is validated with Zod schemas before enqueueing. This catches bad data at the edge so workers never receive malformed jobs. Optional API key authentication and express-rate-limit protect the endpoints from abuse.

The Queue Layer (Shared Package)

The `packages/queue-config` package is a shared module imported by both the API and workers. This single source of truth ensures both sides use identical queue names, Redis connection options, and default job settings.

Three queues: `email`, `image`, `ai`
Retry policy: exponential backoff with configurable attempts
Job retention: completed jobs kept for 24 h, failed jobs kept for 72 h (for replay/debugging)
Queue events: streamed via BullMQ `QueueEvents` and forwarded to the WebSocket `/live` endpoint

// packages/queue-config/src/queues.ts
export const defaultJobOptions: DefaultJobOptions = {
  attempts: 3,
  backoff: { type: 'exponential', delay: 1000 },
  removeOnComplete: { count: 100, age: 86400 },   // retain 24 h
  removeOnFail:    { count: 200, age: 259200 },   // retain 72 h
};

export const queues = {
  email: new Queue('email', { connection, defaultJobOptions }),
  image: new Queue('image', { connection, defaultJobOptions }),
  ai:    new Queue('ai',    { connection, defaultJobOptions }),
};

The Worker Processes

Workers run as independent Node.js processes — separate from the API so they can be scaled horizontally without touching HTTP traffic. Each worker has:

Bounded concurrency — configurable `concurrency` option per worker
Progress updates — call `job.updateProgress()` so the dashboard can show real-time progress bars
Graceful shutdown — `SIGTERM` handler waits for in-flight jobs to finish before exiting
Worker-level Prometheus `/metrics` server — scraped independently by Prometheus

Worker Behavior Summary

emailWorker — simulates SMTP send with progress stages, supports `SIMULATE_EMAIL_FAILURE=1` for retry demos
imageWorker — simulates image transform + stores output artifact metadata in job result
aiWorker — simulates summarize / embed / classify with multi-stage progress (tokenize → run → postprocess)

Real-time Monitoring Dashboard

The dashboard is a Next.js app using the BFF (Backend-For-Frontend) pattern — API routes on the server proxy to the Express API, so the browser never holds credentials.

Queue health page — live counters for waiting / active / completed / failed per queue
Live feed — subscribes to the WebSocket `/live` endpoint and shows events as they arrive
Job detail screen — polls `/jobs/:queue/:id` for status, progress bar, and result/error payload
Metrics page — link out to Prometheus and Grafana

Observability: Prometheus + Grafana

The shared `packages/metrics` module exposes a single `prom-client` registry. Both the API and each worker import this package and register their counters/histograms/gauges against the same registry, then expose `/metrics` on separate ports.

Metrics Exposed

`bullmq_jobs_processed_total` — counter, labelled by queue and status
`bullmq_job_duration_seconds` — histogram of processing times
`bullmq_queue_depth` — gauge of waiting jobs per queue
`bullmq_active_workers` — gauge per worker process
`bullmq_job_retries_total` — counter of retry events

# infra/prometheus.yml — scrape config
scrape_configs:
  - job_name: api
    static_configs:
      - targets: ['api:3001']

  - job_name: workers
    static_configs:
      - targets: ['worker-email:9101', 'worker-image:9102', 'worker-ai:9103']

Grafana is provisioned automatically with a starter dashboard (`bullmq-overview.json`) that visualises throughput, average duration, queue depth, and active worker count.

Runtime Data Flow

Step 1  Client POSTs to /jobs/:queue
Step 2  API validates payload (Zod) + checks auth + rate-limit
Step 3  API calls queue.add() → BullMQ serialises job to Redis
Step 4  Worker poll loop picks up job (FIFO / priority)
Step 5  Worker calls job.updateProgress() at each stage
Step 6  BullMQ persists result or failure state in Redis
Step 7  API /jobs/:queue/:id returns current state to poller
Step 8  QueueEvents broadcasts event → WebSocket /live → dashboard
Step 9  Metrics scraped by Prometheus every 15 s → Grafana renders

Reliability & Security Model

Reliability

Retry with exponential backoff — up to 3 attempts, doubling delay, protects downstream services from thundering-herd
Dead-letter style retention — failed jobs stay in Redis for 72 h for inspection and manual replay
Rate limiting — per-IP rate limits on the API prevent queue flooding
Queue depth alerting — Prometheus gauge enables Alertmanager rules for depth spikes

Security

Helmet — sets secure HTTP headers on every response
API key guard — optional `API_KEY` env var; unauthenticated requests are rejected before reaching the queue
Zod validation — all job payloads validated before enqueue; invalid shapes are rejected with 400
Request-size limit — JSON body parser rejects oversized payloads

Docker Compose: Full Stack in One Command

The `infra/docker-compose.yml` spins up the entire stack — Redis, API, all three workers, the dashboard, Prometheus, and Grafana — with a single command:

docker compose -f infra/docker-compose.yml up --build

# Then open:
# Dashboard   → http://localhost:3000
# API         → http://localhost:3001
# Prometheus  → http://localhost:9090
# Grafana     → http://localhost:3002  (admin / admin)

Tech Stack

Runtime: Node.js 20, TypeScript
API: Express, Zod, Helmet, express-rate-limit
Queue: BullMQ, Redis 7
Dashboard: Next.js (App Router), BFF API routes
Observability: Prometheus (`prom-client`), Grafana
Tooling: Turborepo, Docker Compose, GitHub Actions CI

Key Takeaways

Building this end-to-end forced me to think about concerns that tutorials gloss over:

Shared packages matter — putting queue config and metrics in shared packages eliminates drift between producers and consumers
Graceful shutdown is non-negotiable — without it, in-flight jobs lose their results on deploy
Observability from day one — adding Prometheus after the fact is painful; baking it in from the start makes production debugging tractable
Retry policy design — exponential backoff with jitter prevents thundering-herd against downstream APIs
Test with real Redis — BullMQ behaviour in tests with in-memory fakes diverges from production; use testcontainers or a local Redis instance

GitHub: github.com/aniketghavte — source code available on request.

Built with Node.js, BullMQ, Redis, Next.js, Prometheus, and Grafana.

Twitter

Github