Information on what to do during a major incident. See our severity level descriptions for what constitutes a major incident (S0 or S1).

#incidents-pos (declarations) Dedicated per-incident Slack channel (created by First Responder)
Google Meet link is pinned in the incident channel once the war room is started.
For executive summary updates only, see the incident channel topic or #service-issue-alert.

Security Incident?

If this is a security incident, you should follow the Security Incident Response process.

Don't Panic!#

  1. Join the incident Slack channel and the Google Meet war room (links are posted in the channel topic).

    • Anyone is free to join the channel or call to observe and follow along with the incident.
    • If you wish to actively participate, you should join both Slack and the call. If you can't join the call for some reason, you should have a dedicated proxy on the call relaying information. Disjointed discussions across multiple channels are ultimately distracting.
  2. Follow along with the call/chat and add any comments you feel are appropriate, but keep the discussion relevant to the problem at hand.

    • If you are not an SME, try to filter any discussion through the primary SME for the affected service. Too many people discussing at once can become overwhelming, so we should try to maintain a hierarchical structure to the call if possible.
  3. Follow instructions from the Incident Commander.

    • Is there no IC on the call?
      • Escalate via the First Responder, who can page the appropriate IC through incident.io or directly in Slack. See Who's On-Call for product-specific IC assignments.
      • Never hesitate to escalate. It's much better to have an IC and not need them than the other way around.

Steps for Incident Commander#

Resolve the incident as quickly and as safely as possible. Delegate tasks to relevant experts at your discretion.

  1. Announce in the incident Slack channel and on the Google Meet call that you are the Incident Commander.

  2. Confirm who is acting as First Responder (scribe) and Communication Commander. If these roles are not yet filled, assign them.

  3. Identify if there is an obvious cause to the incident (recent deployment, spike in traffic, chain-level anomaly, etc.) and delegate investigation to the relevant on-call group(s).

    • Use the on-call SMEs to assist in the analysis. They should be able to quickly provide confirmation of the cause, but not always. It's the IC's call on how to proceed in cases where the cause is not positively known. Confer with service owners and use their knowledge to help you.
  4. Identify investigation & repair actions and delegate them to the relevant on-call experts. Some common examples (not exhaustive):

    • Bad Deployment: Roll it back.
    • Chain Halt / Block Production Stalled: Identify stuck validators; coordinate emergency validator restart or hotfix if needed.
    • Bridge Stuck / Delayed: Check checkpoint submitter, root chain contracts, and bridge service health.
    • RPC Degradation: Validate auto-scaling, check node sync status, consider traffic re-routing.
    • Smart Contract Bug: Assess whether a pause is needed; coordinate with the Smart Contracts on-call.
    • Spike in Failed Transactions: Identify whether it's gas-related, mempool congestion, or application logic.
    • Cloud Provider / Data Center Outage: Validate failover automation has kicked in; force it if not.
    • Degraded Service Behavior without load: Gather forensic data (logs, metrics, heap dumps), and consider a rolling restart.
  5. Decide whether we need to announce publicly, and instruct the Communication Commander accordingly.

  6. Keep track of your span of control. If the response starts to become larger, or the incident increases in complexity, consider splitting off sub-teams to get a more effective response.

  7. Once the incident has recovered or is actively recovering, you can announce that the incident is over and that the call is ending. This usually indicates there's no more productive work to be done for the incident right now.

    • Move the remaining, non-time-critical discussion to Slack.
    • Follow up to ensure the Communication Commander wraps up any public messaging.
    • Identify any post-incident clean-up work.
    • You may need to perform debriefing/analysis of the underlying contributing factor.
  8. Once the call is over, you can start to follow the steps from After an Incident.

Steps for First Responder (Scribe)#

You are the first person on the scene. You triage, declare the incident, and document key information throughout.

  1. Update the incident Slack channel topic with who the IC is and that you're acting as scribe.

    • e.g. "IC: Adam Dossa | Scribe: [Your Name] | Comms: [Comms Commander Name]"
  2. You should add notes to the incident Slack channel when significant actions are taken or findings are determined. You don't need to wait for the IC to direct this — use your own judgment.

    • You should also add TODO notes that indicate follow-ups slated for later.
  3. Follow instructions from the Incident Commander.

  4. Once the call is over, you can start to follow the steps from After an Incident.

Steps for Subject Matter Experts#

You are there to support the Incident Commander in identifying the cause of the incident, suggesting and evaluating repair actions, and following through on those actions.

  1. Investigate the incident by analyzing any graphs, logs, or dashboards at your disposal (Datadog, chain explorers, node logs, etc.). Announce all findings to the Incident Commander.

    • If you are unsure of the cause, that is fine. Simply state that you are investigating and provide regular updates to the IC.
  2. Announce all suggestions for resolution to the Incident Commander. It is their decision on how to proceed — do not take actions unless told to do so!

  3. Follow instructions from the Incident Commander.

  4. Once the call is over, you can start to follow the steps from After an Incident.

Steps for Communication Commander#

Be on stand-by to post public-facing messages regarding the incident.

  1. You will typically be required to update status pages, social media accounts, and community channels (Discord, Telegram, X/Twitter) at certain times during the call.

  2. Follow instructions from the Incident Commander.

  3. Once the call is over, you can start to follow the steps from After an Incident.

Steps for Internal Liaison#

You are there to provide updates to internal stakeholders and to mobilize additional internal responders as necessary.

  1. Be prepared to page other people as directed by the Incident Commander, using incident.io or direct Slack mentions of on-call groups.

  2. Notify internal stakeholders as necessary. Post updates in #service-issue-alert for broad visibility.

  3. Provide regular status updates in Slack (roughly every 30 minutes) to leadership, giving an executive summary of the current status. Keep it short and to the point, and use @here.

  4. Follow instructions from the Incident Commander.

  5. Once the call is over, you can start to follow the steps from After an Incident.