Hands-On: Detecting and Alerting on Cross-Service Failures (X + Cloudflare + AWS)
monitoringSREtutorial

Hands-On: Detecting and Alerting on Cross-Service Failures (X + Cloudflare + AWS)

UUnknown
2026-02-20
10 min read
Advertisement

Step-by-step guide to correlate X, Cloudflare and AWS errors — implement request-id propagation, normalize logs, and build Grafana alerts to surface root cause fast.

Hook: Why multi-service monitoring must change after the X + Cloudflare + AWS incidents

When multiple providers fail at once — an edge layer returning 5xx, a CDN rate-limiting traffic, and origin targets flapping in AWS — SREs waste hours chasing symptoms instead of the root cause. If your team still treats Cloudflare, AWS, and platform outages as separate silos, you will repeat that waste. This tutorial walks through a repeatable, production-ready approach (2026-era) to correlate logs, metrics, and traces across X, Cloudflare, and AWS, so your alerts point to the real culprit within minutes.

Inverted pyramid: What you'll get (most important first)

  • Outcome: Alerts that surface correlated multi-service failures (edge vs. CDN vs. origin) and include the likely root cause.
  • Architecture: Cloudflare Logpush → S3 → Lambda / Kinesis → Central store (CloudWatch/Opensearch/Loki/Athena) + Grafana for dashboards/alerts.
  • Key techniques: request-id propagation, structured logs, automated enrichment, cross-datasource Grafana alerts, and SRE runbooks.
  • Tools: Cloudflare Logpush, Cloudflare Workers (optional), AWS CloudWatch/Logs Insights, S3, Lambda, Athena/Glue, Grafana (2026 features), OpenTelemetry patterns and AI-assisted correlation tips.

Context: Why this matters in 2026

Late 2025 through early 2026 saw more frequent multi-vendor incident reports — edge providers, CDNs, and major cloud regions experiencing overlapping failures. The industry responded by accelerating:

  • OpenTelemetry adoption for consistent trace/metric formats across providers and services.
  • Edge observability (Cloudflare Workers and edge tracing) to capture request context early in the chain.
  • AI-assisted correlation in observability platforms to reduce toil.
If your stack doesn't normalize request context at the edge and feed it into a central observability layer, alerts will continue to be noisy and ambiguous.

Design principles (rules you'll follow)

  • Propagate a single request id from edge → origin (X-Request-ID or traceparent).
  • Collect structured logs at every control plane (Cloudflare, ALB/NLB, application).
  • Enrich logs with geographic, customer, and feature flags where applicable.
  • Store raw dumps in S3 for forensic replay and long-term retention.
  • Correlate by time-window and request-id when exact IDs aren't available.

High-level architecture

  1. Instrument Cloudflare to forward an X-Request-ID (inject with Workers if needed).
  2. Enable Cloudflare Logpush to send edge logs to an S3 bucket (partitioned by date/service).
  3. Enable AWS access logs: ALB/ELB + CloudFront (if used) + Route53 + CloudWatch application logs.
  4. Use AWS Lambda (or Kinesis Data Firehose) to transform Cloudflare logs and application logs into a normalized JSON schema and write to CloudWatch Logs or OpenSearch/Loki.
  5. Use AWS Glue/Athena for ad-hoc queries over raw S3 logs; use Grafana to visualize CloudWatch/Athena/OpenSearch.
  6. Create correlated alerting in Grafana that combines a Cloudflare 5xx spike and ALB 5xx/target unhealthy metric to infer origin vs edge issues.

Step-by-step implementation

Step 1 — Ensure request ID propagation at the edge

Many outages become opaque when the CDN/edge strips or fails to forward a request identifier. Add a lightweight Cloudflare Worker to generate or forward an X-Request-ID header. That ID is the primary correlation key.

// Cloudflare Worker: add X-Request-ID if missing
addEventListener('fetch', event => {
  event.respondWith(handle(event.request))
})

async function handle(request) {
  const headers = new Headers(request.headers)
  if (!headers.get('x-request-id')) {
    headers.set('x-request-id', crypto.randomUUID())
  }
  const modified = new Request(request, { headers })
  return fetch(modified)
}

Deploy this to the routes that front your site or API. In 2026, Workers are mature and low-latency; injecting a UUID here is standard practice.

Step 2 — Configure Cloudflare Logpush to S3

Use the Logpush interface to stream all edge logs to an S3 bucket. Partition by date and zone for efficient querying. Include the following fields in your Logpush job: EdgeResultStatus, EdgeResponseCode, cs-method, cs-uri-stem, x-request-id (if forwarded), clientIp, userAgent, cf-ray.

Step 3 — Ingest AWS logs and expose metrics

Turn on ALB logs and CloudWatch application logs with structured JSON output. For ECS/EKS use Fluent Bit or OpenTelemetry Collector to push logs and traces to CloudWatch or your centralized store. Expose the following CloudWatch metrics:

  • ALB 5xx count
  • TargetGroup HealthyHostCount
  • Lambda function errors and throttles
  • Route53 health check failure count

Step 4 — Normalize logs with Lambda or Kinesis

Create a lightweight Lambda that triggers on S3 PutObject for Cloudflare logs. The Lambda parses NDJSON/CSV logs, converts to a canonical JSON schema, and forwards to CloudWatch Logs and optionally OpenSearch or an observability pipeline.

exports.handler = async (event) => {
  for (const record of event.Records) {
    const bucket = record.s3.bucket.name
    const key = decodeURIComponent(record.s3.object.key.replace(/\+/g, ' '))
    // download object, parse each line as JSON, enrich with source=cloudflare
    // push to CloudWatch Logs via PutLogEvents or Kinesis Data Firehose
  }
}

Normalization schema example (fields to keep):

  • timestamp
  • request_id (x-request-id)
  • source (cloudflare | alb | app)
  • status (http status)
  • url
  • client_ip
  • cf_ray / trace_id if present

Step 5 — Centralize storage and enable fast queries

Two complementary stores are useful:

  1. Cold storage (S3 + Athena) for forensic queries and long-term retention.
  2. Hot store (CloudWatch Logs / OpenSearch / Grafana Loki) for live alerts and dashboards.

Use AWS Glue to create a table over Cloudflare logs in S3 so Athena queries are fast. Example Athena query to find edge 5xx spikes:

SELECT date_trunc('minute', request_timestamp) AS minute,
       count(*) AS edge_5xx
FROM cloudflare_logs
WHERE http_status >= 500
GROUP BY 1
ORDER BY 1 DESC
LIMIT 100

Step 6 — Build Grafana dashboards and correlated panels

Grafana 2026 has richer alerting features: you can create alerts based on expressions across different datasources. Create three panels:

  1. Edge errors (Athena or CloudWatch): 5xx/min from Cloudflare logs.
  2. Origin errors (CloudWatch): ALB 5xx or Lambda errors.
  3. User impact (synthetic / RUM): failed check rate from synthetic probes or RUM SDK.

Then define an alert rule using a simple expression that triggers when:

edge_5xx_rate > X AND origin_5xx_rate <= Y → likely edge/CDN issue; edge_5xx_rate > X AND origin_5xx_rate > Y → likely origin or networking issue.

Example Grafana expression (pseudocode):

A = query_result(edge_5xx_rate)
B = query_result(origin_5xx_rate)
alert = A > 50 AND B <= 10

Step 7 — Correlate individual requests for root-cause

When an alert fires, the first action should be to inspect a small time-window of enriched logs and trace data filtered by request_id or cf-ray. Workflow:

  1. Grafana alert includes example request_id and a link to a Logs Insights query.
  2. Open the CloudWatch Logs Insights or Athena query to view the request_id timeline across sources.
  3. If traces exist (OpenTelemetry), open the trace view; trace will show latency and where 5xx originates.
// CloudWatch Logs Insights example
fields @timestamp, @message
| filter request_id = '123e4567-e89b-12d3-a456-426614174000'
| sort @timestamp desc
| limit 50

Alert examples and runbook actions

Make alerts actionable by embedding the probable root cause and next steps.

Alert: Edge 5xx spike with low origin errors

  • Condition: Cloudflare edge 5xx/min > threshold AND ALB 5xx/min <= baseline.
  • Likely cause: CDN configuration, rate-limiting, WAF rules, or upstream networking to Cloudflare.
  • Runbook: Check Cloudflare dashboard for WAF rule spikes, recent config changes, and Cloudflare status page. Rollback recent Cloudflare rules if correlated.

Alert: Edge & origin 5xx both high

  • Condition: Edge 5xx/min > threshold AND ALB 5xx/min > threshold.
  • Likely cause: Origin overloaded, autoscaling failures, or inter-region networking issues.
  • Runbook: Check origin CPU/memory, target health, autoscaling events, and EKS pod restarts. Roll traffic to warm region or enable service limit increase.

Alert: Synthetic failures with no logs

  • Condition: Synthetic probe failures but no logs recorded.
  • Likely cause: Logging pipeline outage (Kinesis/Firehose/Lambda) or IAM permission issues.
  • Runbook: Validate log delivery status, check S3 PutObject metrics, and check Glue/Athena partitions.

Advanced correlation techniques (2026 best practices)

Use these to reduce mean time to identification (MTTI):

  • Probabilistic correlation: When request IDs are missing, correlate by 5-tuple (client IP, user-agent, URL path, timestamp window, cookie/session id) using vector similarity queries in OpenSearch.
  • Edge-enforced trace headers: Use Cloudflare Workers to add W3C traceparent headers compatible with OpenTelemetry so traces stitch across edge and origin.
  • AI-assisted triage: In 2026 observability platforms often include LLM-driven correlation to suggest root causes; use it to prioritize alerts but validate before automated remediation.
  • Real-time enrichment: Enrich events with feature-flags, release version, and deployment id at ingress so alerts can attribute failures to recent deploys.

Testing and validation (SRE checks)

  1. Run synthetic tests that emulate the header propagation — verify X-Request-ID is present in Cloudflare logs and application logs.
  2. Simulate an ALB 502 and confirm Grafana rule triggers with expected classification (origin vs edge).
  3. Run chaos tests for Cloudflare rate-limiting or WAF rules to ensure alerts and runbooks catch and recover.

Example incident walkthrough (short case study)

Scenario: At 09:43 UTC, users report intermittent 503s. Grafana alert fires: edge_5xx_rate = 120/min, origin_5xx_rate = 15/min, synthetic failure rate = 40%.

  1. On-call opens the Grafana alert panel which includes a sample request_id and a link to the Logs Insights query pre-filled for the last 10 minutes.
  2. Logs show Cloudflare EdgeResultStatus = 503 and Cloudflare WAF rule X triggered 30% of the requests; ALB shows low 502s. The cf-ray header maps to a single POP (point-of-presence).
  3. Conclusion: Edge-side WAF rule or POP network issue. On-call disabled the WAF rule and traffic returned to normal in 2 minutes.
  4. Postmortem: Add more granular WAF metrics and create a rule to auto-open a safe-mode bypass for high user-impact alerts.

Operational notes & security

  • Protect your log bucket (S3) with strict IAM policies and object-level encryption.
  • Redact PII in the normalization step; never forward sensitive headers to third-party tools unmasked.
  • Use IAM roles for Lambda and Kinesis with least privilege. Monitor for suspicious access patterns.

Checklist before you go to production

  • Edge request-id present and forwarded: PASS
  • Cloudflare logs flowing to S3: PASS
  • Lambda normalization jobs healthy: PASS
  • Grafana dashboards show cross-datasource correlated panels: PASS
  • Runbooks linked from alerts: PASS

Through 2026, expect these trends to shape multi-service monitoring:

  • Edge-native tracing will be widely supported; plan to adopt W3C traceparent across Workers and origins.
  • Federated observability — vendor-neutral telemetry meshes and OpenTelemetry will make correlation easier across providers.
  • AI/ML correlation will suggest probable root causes but requires human verification; automate low-risk mitigations only.
  • Regulatory and privacy rules will require redaction pipelines in log normalization; bake this in now.

Actionable takeaways

  • Start at the edge: Ensure X-Request-ID or traceparent headers are injected and forwarded.
  • Centralize and normalize Cloudflare and AWS logs into a single JSON schema for fast correlation.
  • Use cross-datasource alerts in Grafana to classify incidents quickly (edge vs origin vs pipeline).
  • Automate runbooks and include links to the exact Logs Insights/Athena query for the alert’s timeframe.
  • Keep raw logs in S3 for forensic replay and postmortems.

Final checklist & next steps

  1. Deploy Cloudflare Worker to generate X-Request-ID.
  2. Enable Cloudflare Logpush to S3 and configure Athena table.
  3. Set up Lambda normalization and streaming to CloudWatch/OpenSearch.
  4. Create Grafana dashboards and cross-datasource alert rules.
  5. Write runbooks for common multi-service failure patterns and test with chaos exercises.
Correlation is the fastest path to root-cause — invest in identifiers, normalization, and combined alerting.

Call to action

Ready to reduce MTTI for cross-service failures? Start by deploying the Cloudflare Worker and enabling Logpush to S3 today. If you want a hands-on workshop or a reference repo with Lambda normalization code, runbook templates, and Grafana dashboards tuned for X + Cloudflare + AWS stacks, click the button below or contact our team for a guided build-out.

Advertisement

Related Topics

#monitoring#SRE#tutorial
U

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-02-20T03:19:34.039Z