ServerOps.ggbeta
GuidesLogsDashboard

Alerts

Notify Discord (and eventually email, HTTP, Slack) when matching events cross a threshold. Three rule shapes: any match, count exceeds, no matches.

Alerts

An alert is a rule that watches for matching events in a rolling window and posts to a destination when its threshold trips. v1 ships with Discord webhooks; HTTP, email, Slack, and PagerDuty are planned for v2.

Alerts (Logs ▸ Alerts) live next to your dashboards. Most customers create an alert from a dashboard they're already watching: click ◆ SAVE AS ALERT on the dashboard detail page (or ◆ SAVE AS ALERT next to the search bar) to pre-populate the editor with the current predicate.

Three rule shapes

ShapeFires when...Common use
Any matchEvery single matching eventCheat-detect, banned-word triggers, admin actions
Count exceedsCount of matching events in window ≥ thresholdError rate spikes, kick floods, retry storms
No matches (heartbeat)Zero matching events in windowDead ingestion, silent player counts, broken integrations

no_match is edge-triggered: it fires ONCE when the no-match condition begins, then clears the moment events return. You won't get a Discord message every 30 seconds for hours during a sustained outage.

The predicate language

Same grammar as dashboards. Fields: dataset, event, severity. Comma list for OR:

severity:error
severity:warn,error,fatal AND dataset:player-actions
event:login,logout

Empty predicate matches everything (useful for Any match rules on a specific dataset).

If you need richer filters (payload values, free text, etc.), you're outside what the rollup table supports. Build the query in the search page first, prove it returns what you want, then if it's not expressible as a dashboard predicate the alert can't run on it.

Windows and cooldowns

Window is how far back the rule looks each tick. Presets: 1m, 5m, 15m, 1h, 6h, 24h.

Cooldown is the minimum wait between consecutive fires. After a rule fires, it's silent for the cooldown period regardless of how many matches arrive. Presets: 1m, 5m, 15m, 1h, 4h, 1d.

Tuning rule: cooldown should be ≥ window, otherwise you'll re-evaluate the same matches that triggered the last fire.

Destinations

v1.2 supports three destinations: Discord, HTTP, and Email. Slack and PagerDuty are planned.

Discord

Paste a Discord channel webhook URL (https://discord.com/api/webhooks/...). We POST application/json with a content field. Real fires look like:

🚨 Alert `errors-by-severity-spike` fired: ≥10 matches in 5m. Match count: 14

Test fires are prefixed [SERVEROPS TEST] so teammates seeing them in Discord know they're not real.

Rate limiting

Discord limits webhooks at ~30 messages/min per URL. We cap at 5/min per URL to leave headroom for shared channels. Multiple alerts pointing at the same webhook share the budget.

If a fire hits the rate limit, the audit log records rate_limited: per-URL bucket exhausted instead of delivering. The next minute's tokens let it through.

Retry behaviour

  • 2xx: success, no retry.
  • 4xx: permanent failure (revoked webhook, bad URL). No retry. Customer must fix the URL.
  • 5xx / network: 3 attempts with exponential backoff (200ms, 1s, 5s). After that, audit logged as failed.

HTTP destination

Point at any HTTPS endpoint. We POST application/json with this shape:

{
  "version": 1,
  "kind": "fire",
  "fired_at": "2026-05-22T14:08:09Z",
  "rule": {
    "id": "ar_...",
    "name": "errors-spike",
    "project_id": "proj_...",
    "project_slug": "departures-rp",
    "window_seconds": 300
  },
  "match_count": 12,
  "threshold_human": "≥10 matches in 5m",
  "message": "rendered template",
  "permalink": "https://serverops.gg/dashboard/logs/alerts/ar_..."
}

kind is one of fire / recovery / test. Schema is pinned by version; we bump that integer when the shape changes.

Signature verification. Every POST includes:

X-ServerOps-Signature: t=<unix-seconds>,v1=<base64-hmac>

Where v1 is base64(hmac_sha256(secret, "<t>.<body>")). The signing secret is shown to you ONCE when you create the rule - copy it then; we cannot show it again. To rotate, save a new URL on the rule and we mint a fresh secret.

Verification example (Python):

import hmac, hashlib, base64, time

def verify(signature_header, body, secret):
    parts = dict(p.split("=", 1) for p in signature_header.split(","))
    t = int(parts["t"])
    if abs(time.time() - t) > 300:
        return False  # too old; replay protection
    expected = base64.b64encode(
        hmac.new(secret.encode(), f"{t}.{body}".encode(), hashlib.sha256).digest()
    ).decode()
    return hmac.compare_digest(expected, parts["v1"])

Use this destination to wire alerts into Slack (their incoming-webhooks API), PagerDuty (their Events API v2), n8n / Zapier, or your own incident-response service.

Email destination

Point at any recipient address. The email arrives with a subject like [ServerOps] 🚨 FIRED · errors-spike and a body matching the Discord embed visually. Plain-text fallback is always included.

Daily cap. 50 alert emails per org per day. The cap is per-org, not per-rule - if you have 10 rules pointing at the same email address they share the budget. Hit the cap and subsequent fires are audit-logged as rate_limited instead of delivered. Counter resets ~24h after the first fire.

Email destinations don't have retry. Failed sends (bad address, email provider outage) show up in the audit log; you'll need to fix the address or wait for the provider to recover.

Test fire

Every alert detail page has a ↗ TEST FIRE button. It POSTs a synthetic Discord message to the configured URL labeled [SERVEROPS TEST]. Server-side rate limit: 1 test per rule per minute.

Test fires:

  • Show up in the audit log with a TEST badge.
  • Do NOT update the rule's last_fired_at (no cooldown impact).
  • Do NOT update in_alarm_state (no_match rules unaffected).

Use this to confirm a webhook URL works before waiting for a real event.

Audit log

The alert detail page shows the last 50 firings (real + test). Each row has timestamp, match count, destination status code, and any error message. Failed deliveries are tinted red.

Lifecycle and the settling period

After you save a new alert, it can't fire for 5 minutes. This is the settling period - it lets you notice a misconfigured rule before it floods Discord with historical matches.

To pause a rule without losing it, toggle ENABLED off. To remove it, archive (reversible by archiving / unarchiving is currently API-only; UI lands in a follow-up).

What this page does NOT do (yet)

  • Slack / PagerDuty native destinations. Use the HTTP destination + their inbound webhooks for now.
  • Throttling (N fires per hour). Use cooldown for the equivalent behaviour.
  • Business-hours silencing. Roadmap.
  • Ratio thresholds ("error rate > 5%"). Customer-driven; ask if you need it.
  • Multiple destinations per rule. Create one rule per destination.
  • Destination type change post-create. Make a new alert if you need to switch from Discord to HTTP, etc. The sealed URL + signing secret are bound to the type at create time.

Common error responses

CodeMeaning
nameInvalidName empty or > 100 chars
queryInvalidPredicate failed the grammar check (see field:value AND ...)
thresholdTypeInvalidThreshold type must be any / count_gte / no_match
thresholdValueInvalidCount threshold must be a positive integer
destinationUrlInvalidDiscord webhook URL must start with https://
destinationTypeInvalidOnly discord is accepted in v1
testFireRateLimited1 test per rule per minute; wait a moment and retry
alertsNotConfiguredOperator hasn't set ALERTS_SECRET_KEY on the deployment

Architecture notes (curious customers)

The evaluator runs as a goroutine inside worker-logs, ticking every 30 seconds. Per (org, project) it issues ONE query against the per-minute rollup table; all your rules for that project evaluate in memory against the returned rows. CH cost stays small even at 50+ rules per project. See runbooks/logs/65_alerts_architecture.md in our open-source repo for the full decision log + invariants + failure modes.

On this page