If you're new to the Polygon team or haven't been through an incident yet, this page tells you what to read first and in what order. You don't need to absorb everything at once — start here, and the rest will make sense when you need it.
Understand what counts as an incident.#
Read our severity definitions so you know the difference between a Critical (S0) and a Low (S3) incident. The key thing is that you shouldn't have to debate whether something is an incident during a response — the definitions exist so that decision is already made.
Our failure scenarios document maps specific technical failures to severity levels and blast radius. You don't need to memorize it, but skim through it so you have a sense of what's been cataloged and how failures are classified.
When in doubt, escalate.
If you're unsure whether something qualifies as an incident, post in #service-issue-alert and tag @FirstResponders. It's always better to raise a potential issue early than to wait for certainty.
Know how alerts and escalation work.#
Evidence of an incident can come from multiple places:
- Automated alerts on #alerts-ps-pos — these are Datadog monitors covering chain health, block production, checkpoint submissions, and more.
- Direct customer contact through Sales, Account Management, Partnerships, Developer Relations, or any other team.
- Social media on 0xPolygon on X, Polygon Support on Discord, or @PolygonHQ on Telegram.
When an incident is confirmed, the First Responder creates a new public Slack channel and Google Meet link for the war room. The incident is declared on #incidents-pos using a standardized template. All discussion happens in the dedicated incident channel — not in #incidents-pos.
Know your channels
- #alerts-ps-pos — where automated alerts land
- #service-issue-alert — where anyone can report a potential issue
- #incidents-pos — where incidents are formally declared (no discussion here)
- Dedicated incident channel — created per-incident for all discussion and coordination
Make sure you're a member of these channels and that your Squadcast notifications are working before your first on-call shift.
Learn the roles.#
There are three primary roles during an incident. Read the different roles page for the full picture, but here's the short version:
- First Responder — The person from the engineering reliability team who catches the alert, assesses the failure scenario and severity, declares the incident, and manages status updates on status.polygon.technology and the Partner Comms Bot throughout.
- Incident Commander — A senior member of the core engineering team who leads the technical response: qualifying the incident, identifying root cause, coordinating a fix, and monitoring stability post-resolution.
- Communication Commander — A member of the marketing team who handles public communications on X, Telegram, and Discord.
There are also secondary on-call roles (DevOps, Applications, Smart Contracts, SecOps, IT, BD) that can be pulled in as needed.
Know which role you might fill, and read the corresponding training guide. If you're going on-call as a First Responder, the incident response training course is the place to start.
You are responsible for knowing your role.
It is your responsibility to know if you are on-call at any given time and what role you're expected to fill. Check the on-call schedule before your rotation starts.
Understand the postmortem process.#
After every significant incident, we write a postmortem. This is how we learn and improve. Read the postmortem process and familiarize yourself with the template.
The Incident Commander owns driving the postmortem to completion. If you're involved in an incident, you'll likely be asked to contribute your section of the timeline and any relevant findings.
Key principles:
- Postmortems are blameless — we focus on systems and processes, not individuals.
- Every postmortem produces action items with owners and due dates.
- The postmortem is published to status.polygon.technology for transparency.
Practice before it's real.#
If you haven't been through an incident yet, don't let your first one be the real thing. Run through the process with a simulated scenario:
- Walk through the failure scenarios and pick one. Talk through how you'd detect it, who you'd page, and what the first 15 minutes look like.
- Join a tabletop exercise if one is being run. These simulate an incident end-to-end — from first alert through resolution and postmortem — without any real systems being affected.
- Playing a game of "Keep Talking and Nobody Explodes" is a surprisingly good way to practice the communication skills that matter most during incident response.
The switch from normal day-to-day work to emergency operations can be jarring. Practicing in a low-stakes environment makes the real thing significantly less stressful.
Going deeper.#
Once you're comfortable with the basics, here's the recommended reading order for the rest of the documentation:
On-call fundamentals#
- Being On-Call — What your responsibilities are (and what they're not).
- Alerting Principles — How we decide what pages an engineer and when.
Incident mechanics#
- Incident Call Etiquette — How to conduct yourself on an incident call.
- During an Incident — What to do and how to contribute constructively.
- Complex Incidents — How larger incidents with multiple workstreams are handled.
Specialized response#
- Security Incident Response — Security incidents follow a different process.
- Crisis Response — When an incident escalates beyond a technical issue.
Writing effective postmortems#
- Effective Postmortems — How to write postmortems that actually lead to improvement.