What is an incident?#

Any unplanned disruption or degradation of service that is actively affecting — or has the potential to affect — users, partners, or the Polygon network.

An incident is defined by its impact on service delivery, not merely by the occurrence of a technical failure. A node crashing is a failure. If that node was a block producer and blocks stop being included, that's an incident. If the crash was on a sentry node with three other healthy sentries behind the load balancer, it's a failure handled by redundancy — not an incident.

What is a major incident?#

Any incident classified as S0 (Critical) or S1 (High) that requires a coordinated response across multiple teams. Major incidents trigger the full incident response process: war room, Incident Commander, Communication Commander, status page updates, and 15-minute update cadence.

See our severity definitions for the full classification.

Can a lower-severity issue be a major incident?

All S0 and S1 incidents are major incidents, but you don't need a high severity classification to trigger full incident response. If a situation requires coordinated response across multiple teams, even at S2, the First Responder or Incident Commander can escalate it. The point is coordination, not the label.

What is incident response?#

An organized approach to addressing and managing an incident. The goal isn't just to resolve the incident, but to handle the situation in a way that limits damage, reduces recovery time, and produces the information needed to prevent recurrence.

Incident response is distinct from problem management. Incident response restores service. Problem management investigates root causes after service is restored, to prevent the next one. Both matter, but they happen at different times and involve different thinking.

What triggers our incident response process?#

Our incident response process can be triggered in two ways: automated monitoring or human escalation.

Automated Monitoring#

Datadog monitors on #alerts-ps-pos cover chain health, block production, checkpoint submissions, consensus performance, and application-level metrics. When a monitor fires, the on-duty First Responder triages the alert, identifies the failure scenario, assesses severity, and initiates the appropriate response.

Human Escalation#

Automated monitoring doesn't catch everything. Incidents can surface through customer reports, social media, partner escalations, or an engineer noticing something off in a dashboard. Any Polygon team member can trigger incident response by posting in #service-issue-alert and tagging @FirstResponders.

If you see something, say something. You don't need to be certain it's an incident — the First Responder will triage it.

When in doubt, escalate.

If you're unsure whether something requires incident response, escalate anyway. It is always better to trigger a response for something that turns out to be minor than to delay a response for something that turns out to be critical.

Incident Severity#

Our severity definitions classify incidents from S0 (Critical) through S3 (Low) based on predefined criteria. The severity guides the type of response: higher severity means you can take riskier actions to restore service, more people are mobilized, and communication requirements increase.

Severity is useful for quickly determining the response level, but it's not a gate. If something isn't neatly covered by the severity definitions but you believe it requires coordinated response — it requires coordinated response. Assign a severity later; don't let classification slow down the response.

Mentality Shift#

One of the more important concepts in incident response is the mentality shift that happens when an incident is declared. We call this the "Peacetime vs. Wartime" shift.

During normal operations, decisions are made carefully, changes go through review, and risk tolerance is low. During an incident, the calculus changes. Speed matters more. You are authorized to take actions that would normally require more process — rolling back a deployment without a full review, restarting a service without waiting for approval, pulling someone off their current work to help investigate.

This shift can be hard to internalize, especially for people who haven't been through many incidents. Your incident response process can stall if responders stick to peacetime thinking and hesitate to take action because it feels risky. The Incident Commander's job is partly to give people permission to operate in wartime mode — to make it clear that the normal rules are temporarily suspended in service of restoring the system.

You can read more about this in the Responder Training Guide.

Naming doesn't matter

Some people prefer "Normal vs. Emergency" or simply "OK vs. Not OK." The specific terms aren't important — the shift in decision-making speed and risk tolerance is.


Key Terminology: Incidents, Failures, and Related Concepts#

Definitions from ISO/IEC and NIST standards to establish a shared vocabulary for incident management, observability, and alerting.

The Causal Chain#

These terms form a causal hierarchy, not a set of synonyms. Understanding the chain clarifies when each concept applies:

ErrorFault / DefectFailureIncidentOutage

A human error introduces a fault into a system. When conditions activate that fault, a failure becomes externally observable. If that failure affects service delivery, it becomes an incident. If service is rendered completely unavailable for a sustained period, it constitutes an outage. A problem is the underlying cause being investigated to prevent recurrence.

Not every step in the chain triggers the next: a fault may never manifest, a failure may not affect service, and an incident may not escalate to an outage.

Incident Terminology: The Causal Chain

Definitions#

  • Error — A human action that produces an incorrect result, or the difference between a computed/observed value and the true/correct value. An error is the originating human mistake that introduces a fault into a system. (NIST SP 500-209, based on IEEE 610.12-1990)

  • Fault / Defect — An incorrect step, process, or data definition in a system; a state of unfitness for use or nonconformance to specification. A fault is the latent problem embedded in the system as a result of an error. It may exist without being detected until conditions cause it to manifest as a failure. (NIST SP 500-209, based on IEEE 610.12-1990)

  • Failure — A discrepancy between the external results of a system's operation and its requirements. A failure is the observable, external consequence of a fault — it is evidence that a fault exists. Not every fault leads to a failure (e.g., a fault in an unused code path), and not every failure constitutes an incident. (NIST SP 500-209, based on IEEE 610.12-1990)

  • Incident — An unplanned interruption to a service, a reduction in the quality of a service, or an event that has not yet impacted the service to the customer or user. In broader usage, any anomalous or unexpected event, set of events, condition, or situation at any time during the lifecycle of a project, product, service, or system. An incident is defined by its impact (or potential impact) on service delivery, not merely by the occurrence of a technical failure. (ISO/IEC 20000-1:2018 §3.2.5; ISO/IEC/IEEE 29119-1:2022 §3.39)

  • Problem — The underlying cause of one or more actual or potential incidents. Problem management is a separate discipline from incident management: incidents restore service, while problem management investigates root causes to prevent recurrence. When a root cause or workaround has been identified but not permanently resolved, the problem is reclassified as a Known Error. (ISO/IEC 20000-1:2018 §3.2.10, §3.2.9)

  • Outage* / Disruption — An unplanned event that causes a system to be inoperable for a length of time. An outage represents a sustained, complete loss of availability of a service or system, and is a specific and severe type of incident. Not all incidents are outages — a degraded service is an incident but not necessarily an outage. (NIST SP 800-34 Rev. 1, Glossary)

*A note on "outage" at the protocol level. The chain is decentralized. Calling a chain halt an "outage" implies we're the service provider in the traditional sense, and that's not quite how it works. A halt is an incident, a serious one, but it's a protocol-level event. It can cause downstream outages for RPC providers, bridge operators, exchanges. Those are their service outages, triggered by our incident. Where "outage" maps cleanly is services where we own the availability commitment: AggLayer, managed RPC endpoints, private mempool features. Defined customer, measurable availability expectation, a process that's either running or it's not. So the framing is: protocol-level events (halts, consensus issues, bad blocks) are always incidents, severity-classified by blast radius and duration. "Outage" gets reserved for specific services in our catalogue where we've made an explicit availability promise. As we build out SLOs, the service catalogue will make this distinction concrete rather than something we relitigate every time something breaks.


Example: December 11, 2025 — Transactions Stuck Due to Nonce Gaps#

This example walks through a real Polygon incident using the causal chain to show how each stage calls for different tools, different responses, and different teams.

Error#

Several weeks before anything went sideways, the block producer's transaction pool parameters got dialed back. accountslots and accountqueue dropped from 256 to 64, a move to address performance concerns. Made sense at the time. But it quietly shrank the system's ability to absorb traffic spikes. Separately, the Madhugiri hardfork (deployed December 9th) introduced a block announcement timing change that left a stale wait-time sitting in the block building path. Both were reasonable decisions that produced unintended consequences.

Fault / Defect#

The reduced txpool params and the block announcement delay both sat in production without tripping any alarms. Latent flaws, baked right in. The system was running inside its now-smaller headroom, but nobody could see how little margin was left. These defects lived silently for days (the pool config) and hours (the hardfork timing issue), waiting for the right conditions to surface.

Failure#

December 11th. A 7x spike in DoS traffic slammed one of the 3 block producers. The shrunken pool capacity, the block building delay (fewer transactions picked up per cycle), and the traffic surge all collided. The block producer couldn't maintain a usable set of pending transactions. Nonce gaps formed, which blocked entire transaction chains from the same accounts. The system was technically running. It just wasn't doing its job.

Incident#

Users and partners started reporting stuck transactions around 5:00 PM UTC. Polymarket escalated later that evening. Transactions sat pending indefinitely despite valid gas parameters. The chain was still producing blocks, but a meaningful chunk of user transactions couldn't get processed. This is where failure crossed into service impact. And because this was a protocol-level event, it's classified as an incident, severity-weighted by blast radius and duration, not an outage. The chain is decentralized infrastructure. Calling this an "outage" implies we're the service provider in the traditional sense, and that's not quite how it works. A chain-level incident like this can cause downstream outages for RPC providers, bridge operators, exchanges. Those are their service outages, triggered by our incident. The distinction matters.

Degraded Service (not Outage)#

The chain never halted. Blocks kept getting produced and finalized. But for affected users, the experience was effectively broken: their transactions couldn't be included. The team had to rotate the problematic block producer 3 separate times (Dec 11 at 5:45 PM, Dec 11 at 10:20 PM, Dec 13 at 2:15 AM) before permanently pulling it from rotation on December 13th. So where does "outage" apply? It maps cleanly to services where we own the availability commitment: AggLayer, managed RPC endpoints, private mempool features. Services with a defined customer, a measurable availability expectation, and a process that's either running or it's not. Concrete. Unambiguous. Protocol-level events (halts, consensus issues, bad blocks) are always incidents. "Outage" gets reserved for specific services in our catalogue where we've made an explicit availability promise. As we build out SLOs, the service catalogue will make this distinction concrete rather than something we relitigate every time something breaks.

Problem#

Post-resolution investigation turned up multiple causes feeding back into each other: the reduced txpool config, the announcement timing bug inherited from the hardfork, the DoS traffic amplifying both, and the fundamental inadequacy of the block producer's hardware. Problem management led to bumping pool parameters back up, deploying the timing fix, retiring the problematic hardware, and bolting on new monitoring for underutilized blocks and low transaction relay rates. Each fix maps back to eliminating or detecting the faults before they can surface again (the "prevents recurrence" loop in the causal chain diagram).

Key Takeaway#

The error happened weeks before anyone noticed. The faults sat latent for days. The failure kicked off when load conditions shifted. The incident got declared when users were impacted. The service was degraded (partial, sustained, never a full halt) but severe enough to require 3 rounds of intervention. And the problem investigation after resolution is what prevents the next one. Each stage calls for different tools, different responses, different teams. Lumping everything under "incident" or reaching for "outage" on protocol-level events obscures which response is actually appropriate at each point. Precise terminology isn't pedantic here. It's operational. And once the service catalogue and SLOs are in place, the terminology stops being a judgment call and starts being something you can just look up.