Loading tutorials…
Loading tutorials…
Default Zapier behavior on errors: fire once, fail silent, halt the Zap. Lose data. This walks through auto-replay, dedicated error Zaps, fallback paths, and the monitoring discipline that catches breaks within an hour — not after the next quarterly review.
Who this is forOperators with 5+ Zaps in production who have already lost data to a silent failure once. If you cannot answer "how would I know if a Zap broke right now?" — this is the tutorial to fix that.
What you'll need
Step 1
For every published Zap: Zap → Settings → Notifications → "Send email if Zap stops working" → ON. Default is OFF on Free and Starter.
Open Zapier → Zaps → click a published Zap.
Navigate to "Settings" (left nav within the Zap view).
Find "Notification settings" → toggle "Send email if Zap stops working" to ON.
Set the recipient email — usually a shared inbox or alias your team monitors, not a single personal inbox.
Repeat for every published Zap. There is no global default — you must do this per Zap.
For Pro+ plans, also enable "Auto-replay errors" in the same panel. This retries failed runs automatically (more on this in next step).
Step 2
In Zap Settings → Notifications → "Auto-replay errors" → ON. Zapier retries failed runs with exponential backoff for 24 hours.
On Pro and above, Zapier offers Auto-Replay: when a run errors due to a transient issue (API rate limit, brief timeout, network blip), it automatically retries.
In each Zap → Settings → Notifications → enable "Auto-replay errors."
Retry schedule: 5 min, 10 min, 30 min, 1 hr, 4 hr, 12 hr, 24 hr. If all retries fail, the Zap halts and a notification fires.
Auto-replay only handles transient errors. Logical errors (missing field, invalid value) will fail on every retry — those need manual investigation.
Worth knowing: a Zap with 100 errored runs that auto-replay successfully consumes 100 Tasks (one per successful action), not 700. Failed retries do not count.
Step 3
Create a dedicated "Error Monitor" Zap with Webhooks trigger. Send notifications from every other Zap's halt event to this Webhooks URL, which then fans out to Slack/email/PagerDuty.
Per-Zap email notifications fragment across many emails. A central error Zap aggregates everything into one channel.
Build a new Zap. Trigger: "Webhooks by Zapier → Catch Hook." Get the unique webhook URL.
Action 1: Slack → Send Channel Message. Map fields from the incoming payload (Zap name, error message, timestamp).
Action 2 (optional): Tables → Create Record. Log every error to a Zapier Table for post-mortem analysis.
Now in every production Zap, add a final "catch-all" branch: a Filter step (or Path) that detects "did this Zap halt" and posts to the central webhook URL. This is the meta-pattern most teams skip.
Easier alternative: use Zapier's built-in "Zap stopped working" emails and forward them to a Slack channel via Gmail-to-Slack integration.
Step 4
If a step is mission-critical (writing a real order to HubSpot), build a Path or Filter that, on failure, sends the data to a backup destination (Tables, sheet, Slack) so it isn't lost.
Default Zapier behavior: if an action fails, the whole Zap halts. Subsequent steps do not run.
For mission-critical write operations, design a fallback before failure occurs.
Pattern: wrap the risky action in a Path. Path A condition: primary write succeeded. Path B condition: primary failed. Path B logs the input to a Zapier Table (or Sheet) and Slacks the team.
Implementation: use Code by Zapier to perform the API call with try/except. Return a `success` boolean. Use Paths after the Code step to branch on `success`.
Now even when the destination app has an outage, the data is captured. Replay from the backup table once the outage is over.
Step 5
Create a Schedule-trigger Zap that runs weekly. It compares Zap History run counts vs. expected counts and posts a summary to Slack.
Even with auto-replay and notifications, slow drift is hard to catch. A weekly sanity check surfaces it.
New Zap. Trigger: Schedule by Zapier → "Every Week" on Monday morning.
Action: Code by Zapier → script that calls Zapier API to fetch run counts for each of your production Zaps over the past 7 days.
Action: Compare against expected counts (which you store in a Zapier Table you maintain by hand or a rolling average).
Action: Slack a summary: "Last week: Zap A 412 runs (expected 400-500), Zap B 0 runs (expected 50-100 — INVESTIGATE)."
This catches slow halts (Zap technically still on but no triggers firing) that error notifications miss.
Step 6
For every critical Zap, write a one-paragraph runbook: what does this Zap do, what does failure look like, who owns it, how to manually recover.
A Zap that runs flawlessly for 6 months will eventually break at 2am. The person who responds needs context.
For every business-critical Zap, write a runbook entry (Notion, Confluence, README — anywhere your team looks).
Include: business purpose, trigger frequency, expected daily volume, common failure modes, manual recovery steps, who to escalate to.
In Zapier itself, paste a link to the runbook in the Zap Description field (Zap → Settings → Description).
Without this, the on-call person spends 30+ minutes orienting before they can debug. With it, they fix in 10.
Common mistakes
Relying on email notifications only
What goes wrong: Error emails get buried in a personal inbox. The on-call person is out for the weekend. The Zap is halted for 72 hours and you lose 100+ leads before anyone notices.
How to avoid: Route errors to a shared Slack channel where the whole team can see them. Email is a fine secondary, but not a primary alert channel.
No fallback for mission-critical writes
What goes wrong: HubSpot has a 2-hour outage. Every lead-capture Zap halts. 50 leads are lost permanently because you have no backup capture mechanism.
How to avoid: For any flow that writes high-value records, build a fallback path (Zapier Table or Sheet) that captures the data when the primary destination fails. Replay from the backup once the outage clears.
Never replaying old failed runs
What goes wrong: You fix the underlying issue (e.g., a stale OAuth) but never replay the 47 runs that failed during the outage. Those 47 events stay lost forever.
How to avoid: After every Zap halt, open Zap History → filter to errored runs in the outage window → click "Replay" on each. Or use bulk replay in newer Zapier versions.
Assuming auto-replay covers logical errors
What goes wrong: Auto-replay retries the same payload up to 7 times. If the payload is malformed (missing field), every retry fails identically. You burn 7 retry windows over 24 hours and the data is still not delivered.
How to avoid: Auto-replay only fixes transient errors. For logical errors, you need to manually fix the source data, then replay. Distinguish in monitoring: transient errors auto-resolve; logical errors need human attention.
No drift detection
What goes wrong: A Zap that should fire 100x/day silently drops to firing 10x/day because the upstream trigger app changed schemas. Notifications fire only on errors, not on volume drops. You discover the problem in next quarter's metrics review.
How to avoid: Weekly sanity-check Zap that compares run counts to expected ranges. Volume drift surfaces in days, not quarters.
Cluttered runbook (or none at all)
What goes wrong: On-call person opens Zap History at 2am, sees an error, has no idea what the Zap is supposed to do or who owns it. Wastes 45 minutes orienting before fixing.
How to avoid: One-paragraph runbook per critical Zap, linked from the Zap Description field. Keep it dead simple: purpose, owner, common failures, recovery steps.
Recap
Done — what's next
How to troubleshoot a failing Zap (step-by-step debug)
Read the next tutorial
Hand it off
Monitoring is the work that compounds across every Zap you ship. Skipping it means every new Zap adds risk. Specialists wire monitoring as a default — never bolted on later. EverestX automation specialists set up centralized error handling across a 10-20 Zap stack in one engagement, typically $300-600 at $14-16/hr.
See specialist rates
An error is a single failed run — the next trigger event will still try. A halt means Zapier has stopped trying due to repeated errors — the Zap is effectively off until you investigate and replay.
30 days on Pro and above, 7 days on Starter, 24 hours on Free. Past that window, failed runs are deleted and the data is unrecoverable from inside Zapier.
Yes, via a central error Zap with Paths. Route by error type (auth → Slack #infra, rate limit → Slack #ops, logical error → email the Zap owner). The fork happens inside the error Zap, not inside each source Zap.
No. Auto-replay only consumes Tasks on the successful retry — same as a normal successful run. Failed retries are free. So a transient error that retries 5 times and succeeds on the 6th costs 1 Task.
Tables and Interfaces have no built-in error notifications. Build a meta-Zap that checks them on a Schedule trigger (e.g., row count daily, key fields nonempty) and alerts on anomalies.
Zapier
Your Zap was working last week. Today, Zap History shows red. This walks through the diagnostic flow specialists run — OAuth, payload shape, rate limits, schema drift — in the order that surfaces the issue fastest.
Zapier
One trigger, three or four actions. Easy to draw on a whiteboard, easy to break in production. This walks through chaining, naming, and the error scenarios that hit you on day 30, not day 1.
Zapier
One Zap. Five different outcomes depending on the trigger payload. This is where Filter by Zapier and Paths by Zapier earn their keep — and where most DIY setups stack conditions wrong and end up routing nothing.
Zapier
When a built-in action cannot do what you need, Code by Zapier is the escape hatch. Run Python or JavaScript inline. Handy, dangerous if abused, and the source of about 30% of advanced-Zap breakages we see.
HubSpot Marketing Hub
Workflows are the engine under HubSpot's marketing automation. They are also where 80% of the silent breakage happens — wrong enrollment criteria, missing re-enrollment toggles, branch logic that loops. Here's how specialists build them so they hold up.