Job Queue Runbook¶

Vendure の job queue (update-search-index / apply-collection-filters / send-email / clean-sessions / rits-report-generation) が詰まった、もしくは worker process が落ちた場合の検知と復旧手順をまとめる。背景は #848。

監視サーフェス¶

信号	場所	閾値
`/health/ready` `dependencies.jobQueueLag`	Vendure app (production / staging)	`update-search-index` / `apply-collection-filters` の PENDING 閾値超過、または stale RUNNING で `fail`
`/health/ready` `jobQueueLag.queues[*]`	同上	queue ごとの `pendingCount` / `oldestPendingAgeSeconds` / `runningCount` / `oldestRunningAgeSeconds` を JSON で公開
`/health/ready` `collectionSearchIndexConsistency`	同上	最近更新された collection の実体 membership と search index がずれていれば `degraded`
Fly machine state (`fly status -a ritsubi-ecommerce`)	Fly	`worker` process が `stopped` のまま 1 分以上経過したら異常
`pendingSearchIndexUpdates` (Admin API)	Vendure Dashboard / Admin API	手動 flush 対象の未反映 update。通常運用では `bufferUpdates: false` のため、滞留調査は SQL job queue を正本にする

`jobQueueLag` 閾値の正本¶

MONITORED_JOB_QUEUES (apps/vendure-server/src/observability/observability.service.ts) が正本。

queue	maxPending	maxOldestAgeSeconds	maxRunningAgeSeconds	根拠
`update-search-index`	300	600	600	incident #848 の 1333 件 / 17h 滞留を早期 fail にする実測ベースの控えめ値
`apply-collection-filters`	50	300	600	collection / filter 変更時のみ発火する低スループット queue。定常 pending ≈ 0 を前提にした保守的初期値で、production steady-state を観測して tune する

暫定値の queue は steady-state 観測後に再調整する。閾値変更は code (正本) と本表を同時に更新する。

検知¶

Sentry / 外形監視で /health/ready 200 → 503 切替を alert する（既存の env-status alert に乗る）。
Fly 通知の machine ... has not been seen を Slack 受信する。
Dashboard の 設定 > 検索インデックス が「修復が必要」を表示し続けたら同画面で自動修復する。

復旧手順¶

Step 0: 状態確認¶

# Fly machine の生死
flyctl status -a ritsubi-ecommerce

# /health/ready で jobQueueLag を直接確認
curl -sS https://commerce.ritsubi-platform.com/health/ready | jq .jobQueueLag

# 直近の job state 内訳 (Admin API: superadmin)
just env-status production --format json | jq '.vendure.jobQueueLag'

Step 1: worker process を起動¶

worker machine が stopped の場合:

flyctl machine start <machine_id> -a ritsubi-ecommerce

fly.toml の [[restart]] policy = "always" で OS レベルの crash は自動復旧されるため、stopped 状態に陥っているのは「明示停止」「max_retries 超過」「Fly machine config drift」のいずれか。flyctl machine status <machine_id> -a ritsubi-ecommerce --display-config で worker の restart policy が always になっていることを確認する。on-failure のままなら deploy config が process group に適用されていないので、まず apps/vendure-server/fly.toml の [[restart]].processes と predeploy check を修正する。max_retries 超過は flyctl logs -a ritsubi-ecommerce -i <machine_id> で原因 (OOM / 起動時 throw) を特定して migration / config を直す。

Step 2: Dashboard で自動修復¶

React Dashboard の 設定 > 検索インデックス (/settings/search-index) を開き、まず 自動修復 を押す。自動修復は以下を順に実行する。

保留中 search updates の flush
update-search-index / apply-collection-filters の滞留確認
stale RUNNING job の検知
最近更新された collection、または対象 collection の index drift 検査
drift がある variant だけの差分 reindex

画面の状態が 反映済み に戻れば完了。内部 queue 名や SQL を運用者が意識する必要はない。 stale RUNNING がある場合は二重 reindex を投入せず、画面は「修復が必要」のまま止まっている処理を示す。この場合は Step 3 で worker / job queue 側を確認する。

Step 3: queue の詰まりを解消¶

worker が動いていれば DefaultJobQueuePlugin の polling (5s) が自動で消化する。1 件あたり数秒以内に処理が進むため、oldestPendingAgeSeconds が下がっていくのを確認する。

進まない場合は手動で reindex を再投入:

SECRETS_CONFIG=production_vendure ./scripts/ops/with-env.sh -- bash -c '
ADMIN_URL="https://commerce.ritsubi-platform.com/admin-api"
ORIGIN="https://dashboard.ritsubi-platform.com"

login=$(curl -sS -i -H "content-type: application/json" -H "origin: $ORIGIN" -H "referer: $ORIGIN/" \
  "$ADMIN_URL" --data "{\"query\":\"mutation L(\$u:String!,\$p:String!){login(username:\$u,password:\$p,rememberMe:false){__typename}}\",\"variables\":{\"u\":\"${SUPERADMIN_USERNAME}\",\"p\":\"${SUPERADMIN_PASSWORD}\"}}")
TOKEN=$(printf "%s" "$login" | grep -i "^vendure-auth-token:" | awk "{print \$2}" | tr -d "\r")

curl -sS -H "content-type: application/json" -H "origin: $ORIGIN" -H "authorization: Bearer $TOKEN" \
  "$ADMIN_URL" --data "{\"query\":\"mutation { reindex { id state } }\"}"
'

Step 4: search update flush / 高度な修復¶

DefaultSearchPlugin は通常 bufferUpdates: false のため、今回のような滞留は SQL job_record の update-search-index PENDING として見る。pendingSearchIndexUpdates / runPendingSearchIndexUpdates は手動 flush 対象が残った場合の別経路で、job queue の worker 不在を解消する代替にはならない。

通常は React Dashboard の 設定 > 検索インデックス (/settings/search-index) を開き、 自動修復 を押す。必要に応じて 保留中の更新を実行 を単独で使える。コレクション内容や商品情報と storefront の検索結果が大きくずれている場合だけ、同画面の 高度な修復: 全件再構築 を使う。

画面に入れない場合は直接 mutation で flush する:

mutation {
  runPendingSearchIndexUpdates {
    success
  }
}

Step 5: 復旧確認¶

# jobQueueLag が ok に戻ったこと
curl -sS https://commerce.ritsubi-platform.com/health/ready | jq '.dependencies.jobQueueLag, .jobQueueLag, .collectionSearchIndexConsistency'

# 顧客 facing で空白だった collection が戻ったこと（例: Dr.MESOCEUTICAL）
curl -sS https://commerce.ritsubi-platform.com/shop-api \
  -H "content-type: application/json" \
  --data '{"query":"{ search(input:{collectionSlug:\"drmesoceutical\",take:1}) { totalItems } }"}'

設計上の前提¶

production: app process は HTTP のみ (VENDURE_RUN_JOB_QUEUE=false)、worker process group が job queue を独占。worker 不在 = 顧客導線への即時影響あり。
staging: app 1 台に同居 (VENDURE_RUN_JOB_QUEUE=true)。app が止まれば storefront も止まるため詰まる前に検知される。
DefaultSearchPlugin は bufferUpdates: false。検索 index 更新は SQL job_record の update-search-index を正本にし、worker process group が消化する。pendingSearchIndexUpdates / runPendingSearchIndexUpdates は手動 flush 対象が残った場合の補助経路。
DefaultJobQueuePlugin は concurrency: 1 (production), pollInterval: 5s。1 queue あたり同時 1 job。
worker startup reaper は RUNNING のまま残った SQL job を 10 分で CANCELLED にする。readiness の maxRunningAgeSeconds=600 と揃え、worker restart 後に health だけが先に落ちる状態を避ける。