Webhook Integration Guide: How to Build Reliable Event-Driven Systems

Webhooks are one of the most widely used integration patterns in modern software and one of the most frequently built incorrectly. The concept is simple — a server pushes a notification to your endpoint when something happens. The implementation details, however, are where most integrations develop reliability problems: missed events, duplicate processing, security vulnerabilities, and silent failures during provider outages.

This guide covers the full picture: what webhooks are and how they differ from polling, the inbound vs outbound distinction, security implementation via HMAC verification, retry logic design, and a failure mode table you can use to audit an existing integration. At the end, there is a cost comparison between webhooks and polling at realistic event volumes, and guidance on when to use a webhook broker versus building your own infrastructure.

Webhooks vs REST Polling: The Core Distinction

REST polling is the simpler pattern. Your system sends a GET request to an API on a schedule — every minute, every five minutes, every hour — and checks whether anything has changed since the last request. It is easy to implement, easy to understand, and widely supported. It is also inherently inefficient: most polling requests return nothing, and the latency between an event occurring and your system knowing about it is equal to your poll interval.

Webhooks invert the direction of information flow. Instead of your system asking "has anything changed?" on a schedule, the external system pushes a notification to your endpoint the moment something happens. Latency drops from minutes to seconds. API call volume drops from one-per-poll-interval to one-per-event.

Dimension	Polling	Webhooks
Event latency	0 to poll interval (0–15 min typical)	1–5 seconds
API call volume at 1,000 events/hr	60 calls/hr (1-min interval)	1,000 calls/hr
API call volume at 10 events/hr	60 calls/hr (wasteful)	10 calls/hr (efficient)
Infrastructure needed	Scheduler (cron, cron job, Lambda)	HTTP endpoint (always on)
Reliability model	You control retry frequency	Provider controls retry on failure
Missed event handling	Self-managed (query historical data)	Provider retry window (24–72 hrs)

The efficiency comparison flips at very low event volumes. If you expect 5–10 events per day from an external system and your poll interval is 15 minutes, polling generates 96 API calls per day for 5–10 actual events. Webhooks generate 5–10 calls. At that volume, either approach is operationally trivial, but webhooks are still more efficient and lower latency.

Where polling retains a genuine advantage: data sources that do not support webhooks (many legacy systems, internal databases, file systems), reconciliation checks where you want to verify state periodically regardless of events, and very low-frequency updates where the infrastructure overhead of a persistent endpoint is not justified.

Inbound vs Outbound Webhooks

The terms inbound and outbound are relative to your system, and the distinction matters because the implementation responsibilities are opposite.

Inbound Webhooks: You Receive

An inbound webhook means an external system is sending events to your endpoint. Stripe sends a payment.succeeded event to your server when a payment completes. GitHub sends a push event to your CI/CD pipeline when a commit lands. Shopify sends an order.created event to your fulfillment system when a customer checks out.

Your responsibilities: provide a stable, authenticated HTTPS endpoint; verify the provider's signature on every request; respond with a 200 status code immediately; process the event asynchronously; handle duplicates with idempotency keys; and implement a queue if you need guaranteed processing order.

Outbound Webhooks: You Send

An outbound webhook means your system sends events to external systems when things happen in your application. If you are building a SaaS product, outbound webhooks let your customers subscribe to events in your platform. A project management tool might send a task.completed event to a customer's internal system when a task closes. A payment platform might send a transfer.initiated event to a customer's accounting software when money moves.

Your responsibilities: maintain a registry of customer endpoint URLs and the events they subscribe to; sign every outgoing request with HMAC; implement retry logic with exponential backoff; maintain delivery logs per customer so they can debug failed deliveries; and provide a way to test and replay events from your dashboard.

Outbound webhooks are an order of magnitude more complex to build reliably than inbound webhooks. You are now responsible for the reliability story — retry queues, delivery guarantees, backpressure, and per-customer observability. This is why webhook broker services (Hookdeck, Svix) exist and are worth evaluating before building your own.

HMAC Signature Verification: Why It Matters and How It Works

An inbound webhook endpoint that accepts requests without signature verification is a security vulnerability. Any actor who discovers your endpoint URL can send fabricated events — fake payment completions, fake order creations, fake user actions. At best, this triggers unintended workflows. At worst, it is a vector for fraud.

HMAC (Hash-based Message Authentication Code) is the standard solution. The mechanism works as follows:

Setup: When you register your webhook endpoint with the provider, they give you a signing secret — a random string that only you and the provider know.

On send: When the provider sends a webhook, they compute HMAC-SHA256 of the raw request body using the signing secret. The result goes into a request header (Stripe uses Stripe-Signature; GitHub uses X-Hub-Signature-256; Shopify uses X-Shopify-Hmac-SHA256).

On receive: Your endpoint reads the raw request body (before any JSON parsing) and computes the same HMAC-SHA256 using your stored secret. You compare your computed value to the value in the header using a timing-safe comparison function. If they match, the request is genuine. If they do not match, you return 401 and log the discrepancy.

Two common implementation mistakes: parsing the JSON body before computing the HMAC (this changes the byte representation and breaks the comparison), and using a standard string equality check instead of a timing-safe comparison (the latter prevents timing attacks that could theoretically reveal the secret byte by byte).

The Security Checklist

For any inbound webhook integration, verify these five points before considering the integration production-ready:

1. Always verify the HMAC signature before processing. Not just in theory — verify the code path and confirm the endpoint returns 401 on an invalid signature. Write a test for this. Provider test tools typically let you send a bad signature to verify the rejection path works.

2. Use HTTPS endpoints only. HTTP endpoints expose the payload and the signature header to network observers. Any provider that sends webhooks to HTTP endpoints is misconfigured. Most production providers refuse to deliver to non-HTTPS endpoints.

3. Respond with 200 immediately, then process asynchronously. Providers wait for your 200 response before marking the delivery as successful. If your endpoint does processing before returning 200 — database writes, downstream API calls, email sending — and any of those operations take longer than the provider's timeout (typically 5–30 seconds), the provider marks the delivery as failed and retries. This causes duplicate processing. The pattern is: queue the raw payload immediately, return 200, then process from the queue in a background worker.

4. Implement idempotency keys on database writes. Providers retry events on timeout or failure. Your endpoint will receive the same event multiple times in normal operation — not just in failure scenarios. Every database write triggered by a webhook must be idempotent: running it twice with the same event ID must produce the same result as running it once. The standard approach is to store the webhook event ID and check for it before processing.

5. Whitelist source IPs where the provider publishes them. Stripe, GitHub, and several other major providers publish their webhook source IP ranges. If your infrastructure allows it, restrict your webhook endpoint to accept requests only from those IP ranges. This adds a network-level security layer on top of the HMAC verification.

Retry Logic: What Providers Do and What You Need to Handle

Every major webhook provider retries failed deliveries, but the retry window and schedule vary significantly. Understanding your provider's retry behavior determines how you design your own failure handling.

Stripe retries failed deliveries for 72 hours using exponential backoff starting at a few seconds and extending to several hours per attempt. GitHub retries for 24 hours. Shopify retries for 48 hours with up to 19 retry attempts. Slack retries for 30 minutes. Jira has a more limited retry window of a few hours.

The practical implication: a webhook endpoint that is down for more than 24 hours will miss events from some providers permanently. This is why your webhook endpoint needs to be one of the most reliable components in your infrastructure — ideally more reliable than the application logic that processes the events, which is why the queuing pattern matters so much.

If your endpoint goes down for an extended period, most providers offer a way to replay missed events from their dashboard. Build the operational knowledge to do this quickly into your runbooks before you need it.

Failure Modes and How to Prevent Them

Webhook integrations fail in predictable ways. The following table documents the common failure modes, their root causes, and the prevention pattern for each.

Failure	Cause	Prevention
Missed events	Endpoint downtime exceeds provider retry window	Queue events to SQS or Redis before processing; endpoint only needs to enqueue, not process
Duplicate processing	Provider retried after your endpoint timed out	Idempotency keys: store event ID, skip if already processed
Out-of-order events	Network delays or parallel delivery	Event sequence numbers and deduplication logic; avoid strict ordering assumptions
Signature mismatch	Wrong secret, body parsed before HMAC check, encoding error	Test with provider test tool; compute HMAC on raw body bytes, not parsed JSON
Payload schema change	Provider updated API version, changed field names or structure	Schema validation on ingestion; versioned endpoint URLs; subscribe to provider changelog
Endpoint timeout	Synchronous processing in webhook handler takes >5s	Respond 200 immediately; all processing in background queue worker
Silent drops under load	Burst of events exceeds endpoint capacity	Rate-limit-aware queue; autoscaling workers; dead-letter queue for unprocessed events

The most common failure in production webhook integrations is duplicate processing. Providers retry on timeout. Developers write synchronous handlers. The handler takes 8 seconds, the provider's timeout is 5 seconds, the provider marks the delivery as failed and retries. The second delivery processes successfully. The result: two records created, two emails sent, two charges attempted. The async queue pattern eliminates this entirely because your handler returns 200 in milliseconds regardless of what the background worker does.

The Cost Comparison at Real Event Volumes

The economics of webhooks versus polling are often discussed abstractly. At real event volumes, the difference is concrete.

Consider a scenario where you receive 1,000 payment events per hour from Stripe. With a polling approach at 1-minute intervals, you make 60 API calls per hour to check for new events, regardless of whether any events occurred. At 10,000 events per month (a modest but not trivial event volume for a growing B2B product), polling at 1-minute intervals generates 43,200 API calls per month to retrieve 10,000 events. That is 4.3 unnecessary calls per actual event.

Stripe charges $0.00 for webhook deliveries and $0.00 for their Events API. But this math applies to any system where API calls have cost: internal rate limits, third-party API quotas, Zapier task limits. At 10,000 events per month on a Zapier Professional plan ($50/month for 2,000 tasks), polling would exhaust your plan in 2,000 tasks. A webhook-driven approach using exactly 10,000 tasks would cost $249/month at Zapier's pricing. The comparison is academic, but it illustrates that event-driven architecture is not just about latency — it is about resource efficiency.

For custom integrations where infrastructure cost matters: a webhook endpoint on AWS Lambda costs essentially nothing at typical event volumes. A 1,000 events/hour Lambda at 200ms per invocation with 256MB memory runs approximately $0.80/month. The equivalent polling scheduler (a Lambda running every minute regardless of events) runs approximately $0.36/month but delivers events with up to 60-second latency instead of under 5 seconds. For anything user-facing, the latency difference alone justifies webhooks.

Provider-Specific Quirks

The major webhook providers each have specific behaviors that are not always in the primary documentation.

Stripe: Stripe's signature header includes a timestamp to prevent replay attacks. Your verification must check that the timestamp is within a tolerance window (Stripe recommends 300 seconds). Stripe sends webhooks for both live mode and test mode; use separate endpoints or a header check to distinguish them. Stripe's dashboard lets you replay any event from the last 30 days.

GitHub: GitHub sends webhook secrets as both SHA-1 (deprecated) and SHA-256 in the X-Hub-Signature-256 header. Always use SHA-256. GitHub allows up to 3 custom headers per webhook. GitHub Actions workflows triggered by webhooks have their own retry and timeout model separate from the webhook delivery itself.

Shopify: Shopify webhooks are API-version-specific. When Shopify releases a new API version, the event structure may change. Shopify requires you to respond within 5 seconds or marks the delivery as failed. Shopify tracks failed webhook endpoints and may disable them after repeated failures — check your webhook health in the admin dashboard periodically.

Slack: Slack's Events API requires an initial URL verification challenge (a GET request with a challenge parameter that you must return). Slack sends event payloads with a 3-second timeout. Any response besides 200 within 3 seconds triggers a retry. Slack's retry headers include X-Slack-Retry-Num so you can distinguish original delivery from retries.

Jira: Jira's webhook system (JQL-based automation triggers) has a shorter retry window than other providers. Jira webhooks do not include a signing secret by default — you must configure a secret token manually if you want HMAC verification. Jira Cloud and Jira Data Center have slightly different webhook payload schemas for the same event types.

When to Use a Webhook Broker

A webhook broker sits between the external provider and your application. Hookdeck and Svix are the two most common options for different use cases.

Hookdeck is designed for receiving webhooks reliably. It accepts the provider's delivery, stores it, and delivers it to your endpoint with its own retry logic. If your endpoint is down, Hookdeck queues the event and retries it until your endpoint comes back. Hookdeck also provides delivery logs, filtering, and transformation before events reach your system. It is particularly valuable for high-volume integrations where event replay and observability matter. Pricing starts at a free tier and scales with event volume.

Svix is designed for sending webhooks — the outbound use case. If you are building a SaaS product that needs to offer webhook subscriptions to your customers, Svix handles the endpoint registry, signature generation, retry logic, per-customer delivery logs, and dashboard for your customers. Building this infrastructure from scratch takes 3–6 weeks; Svix reduces it to an API integration.

Build your own webhook infrastructure when: your event volume is high enough that per-event broker pricing is more expensive than your own infrastructure, your reliability and security requirements are specific enough that a broker cannot meet them, or you need to own the complete data path for compliance reasons.

The Cost to Build a Webhook Integration

The engineering effort depends heavily on your reliability requirements.

A simple inbound webhook receiver — endpoint URL, HMAC verification, basic logging, synchronous processing — can be built in 1–3 days. This is production-ready for low-volume, low-criticality integrations where the occasional missed event is acceptable.

A production-grade inbound webhook receiver — HMAC verification, async queue (SQS or Redis), idempotency keys, retry handling, schema validation, monitoring, and alerting on delivery failures — takes 2–3 weeks and costs $5K–10K at typical contractor rates. This is what you should build for any integration where missed or duplicate events have business consequences.

A full outbound webhook service — endpoint registry, HMAC signing, retry queue with exponential backoff, per-customer delivery logs, dashboard UI, and test event tooling — takes 6–10 weeks and costs $15K–25K. Using Svix instead drops that to 1–2 weeks at a fraction of the custom build cost, plus ongoing Svix subscription fees.

Integration cost summary:

Simple inbound receiver: 1–3 days, ~$1K–3K

Production-grade inbound with queue + idempotency: 2–3 weeks, $5K–10K

Full outbound webhook service: 6–10 weeks, $15K–25K (or 1–2 weeks via Svix)

A Real-World Example

A 30-person operations team at a logistics company was using Stripe for billing and needed to update their internal job management system whenever a payment was confirmed. They had been polling the Stripe API every 5 minutes to check for new payments, which introduced up to 5-minute delays in job dispatch after payment confirmation. At their volume (around 800 payments per day), polling generated 288 API calls per day to retrieve 800 events.

The webhook integration took 6 days of engineering time. A new endpoint received Stripe's payment_intent.succeeded events, verified the signature, queued each event in Redis, and returned 200 immediately. A background worker consumed the queue, updated the job management system, and wrote an idempotency record to prevent duplicate processing. Delivery latency dropped from up to 5 minutes to under 10 seconds. The polling cron job was retired.

The integration has been in production for 18 months with no missed events and no duplicate processing incidents. One schema change from Stripe (a field rename in a minor API version update) required a 2-hour fix, caught immediately by the schema validation layer.

Frequently Asked Questions

How do I secure a webhook endpoint?

The standard approach is HMAC signature verification. The provider shares a secret with you at setup time. When they send a webhook, they compute HMAC-SHA256 of the raw request body using that secret and include the result in a header (Stripe uses Stripe-Signature, GitHub uses X-Hub-Signature-256). Your endpoint computes the same HMAC and compares it to the header using a timing-safe comparison function. If they match, the request is genuine. Never skip this check. Additionally: use HTTPS only, respond with 200 immediately and process asynchronously, implement idempotency keys for database writes, and whitelist source IPs where the provider publishes them.

How do I test webhooks locally during development?

Use a tunneling tool to expose your local server to the internet. Stripe CLI handles this natively for Stripe events, including signature injection: stripe listen --forward-to localhost:3000/webhooks. For other providers, ngrok creates a temporary public URL that forwards requests to your local port: ngrok http 3000. Hookdeck and Svix both offer local testing proxies with event replay capability, which is more convenient than re-triggering events in the provider dashboard. Most major providers (Stripe, GitHub, Shopify) also have test event tools in their dashboard that let you send sample payloads without triggering real actions.

When should I use webhooks instead of polling?

Use webhooks whenever the provider supports them and your event frequency is high enough that polling would generate significant API call volume or introduce unacceptable latency. The rule of thumb: if you would poll more often than every 15 minutes, webhooks are more efficient. At 1,000 events per hour, polling every minute generates 60 API calls per hour regardless of events; webhooks generate exactly 1,000 pushes. Webhooks deliver lower latency, lower call volume, and require no scheduled infrastructure. Polling remains appropriate when the data source does not support webhooks, when you need periodic state reconciliation regardless of events, or when event frequency is very low and the webhook infrastructure cost is not justified.

Build a Reliable Integration

Webhook integrations look simple on the surface and reveal their complexity in production. If you are building a critical integration — payment confirmations, order processing, customer lifecycle events — it is worth taking the time to implement the full reliability pattern: async queue, idempotency keys, schema validation, and delivery monitoring.

Talk to me about your integration requirements →

Webhook Integration Guide: How to Build Reliable Event-Driven Systems

Webhooks vs REST Polling: The Core Distinction

Inbound vs Outbound Webhooks

Inbound Webhooks: You Receive

Outbound Webhooks: You Send

HMAC Signature Verification: Why It Matters and How It Works

The Security Checklist

Retry Logic: What Providers Do and What You Need to Handle

Failure Modes and How to Prevent Them

The Cost Comparison at Real Event Volumes

Provider-Specific Quirks

When to Use a Webhook Broker

The Cost to Build a Webhook Integration

A Real-World Example

Frequently Asked Questions

How do I secure a webhook endpoint?

How do I test webhooks locally during development?

When should I use webhooks instead of polling?

Build a Reliable Integration

Free: Webhook Integration Audit Checklist

Related Services

AI Ops Sprint

Related Posts

Evgeny Goncharov

Have an integration project in mind?

Webhook Integration Guide: How to Build Reliable Event-Driven Systems

Webhooks vs REST Polling: The Core Distinction

Inbound vs Outbound Webhooks

Inbound Webhooks: You Receive

Outbound Webhooks: You Send

HMAC Signature Verification: Why It Matters and How It Works

The Security Checklist

Retry Logic: What Providers Do and What You Need to Handle

Failure Modes and How to Prevent Them

The Cost Comparison at Real Event Volumes

Provider-Specific Quirks

When to Use a Webhook Broker

The Cost to Build a Webhook Integration

A Real-World Example

Frequently Asked Questions

How do I secure a webhook endpoint?

How do I test webhooks locally during development?

When should I use webhooks instead of polling?

Build a Reliable Integration

Free: Webhook Integration Audit Checklist

Related Services

AI Ops Sprint

Related Posts

Evgeny Goncharov

Have an integration project in mind?

Weekly Automation Insights