First Responder Guide#

This is your operational playbook. It tells you exactly what to do, in order, from the moment you notice something might be wrong through incident resolution and handoff. Every step has the exact command, channel, or action you need. Nothing is left to memory.

Your job is not to fix the problem. Your job is to detect, declare, coordinate, document, and communicate. You are the air traffic controller, not the pilot.

If you are reading this during an active incident, skip to Step 4: Declare the Incident.


Before Your Shift Starts#

Do this checklist before every shift. It takes 5 minutes and will save you 30 minutes of panic later.

  • [ ] incident.io notifications working? Have someone send you a test page. Verify you get a Slack notification AND a phone call if unacknowledged after 5 minutes.
  • [ ] Slack channels joined?
  • [ ] Can you access these tools?
    • Datadog -- monitoring dashboards
    • status.polygon.technology -- status page (you will update this)
    • incident.io -- incident management (you will declare incidents here)
    • Google Meet -- you will create or join war rooms
  • [ ] Do you know who is on-call with you? Check incident.io schedules for the current IC and on-call SMEs for DevOps, Applications, Smart Contracts, etc.

The Guide#

Step 1: Something Looks Wrong#

You noticed something. Maybe it's a Datadog alert in #alerts-ps-pos. Maybe someone pinged #service-issue-alert. Maybe you saw a spike on a dashboard. Maybe a partner DMed you on Telegram.

Do not ignore it. Do not wait for a second signal.

Go to Step 2.


Step 2: Quick Triage (2 minutes max)#

You need to answer one question: Is this a real problem, or noise?

Spend no more than 2 minutes on this. You are not diagnosing root cause. You are deciding whether to escalate.

Check these things, in this order:

  1. Is there a Datadog alert firing? Check #alerts-ps-pos. If yes, read the alert name and severity. This tells you what the monitor thinks is wrong.
  2. Is anyone else already talking about this? Check #service-issue-alert and #incidents-pos. If an incident is already declared, join that incident -- do not create a new one. Go to Step 7.
  3. Can you see the problem in a dashboard? Open Datadog. Look at the relevant service dashboard (chain health, block production, RPC latency, bridge status -- whatever matches the alert or report). Is the metric clearly abnormal?

Decision point:

What you see What you do
Clear alert firing + dashboard confirms a problem Go to Step 3
Someone reported something but no alert is firing Go to Step 3 -- alerts don't catch everything
Alert fired but dashboard looks normal now Post in #service-issue-alert: "Alert [NAME] fired at [TIME] but metrics appear to have recovered. Monitoring for recurrence." Monitor for 15 minutes. If it fires again, go to Step 3.
You're not sure if it's real Go to Step 2a

Step 2a: You're Not Sure#

This is normal. It's 2 AM and you're looking at a graph that might be weird or might be fine. Here's what you do:

Post in #service-issue-alert right now. Copy-paste this template and fill in the blanks:

@FirstResponders Potential issue detected.

What I'm seeing: [DESCRIBE WHAT YOU SAW -- alert name, dashboard metric, user report, etc.]
When it started: [TIME, or "unsure"]
Affected service: [BEST GUESS -- PoS chain, bridge, RPC, Portal, etc., or "unsure"]
Current status: Investigating. Will update in 10 minutes.

This does three things:

  1. Gets other eyes on it (someone awake might immediately recognize the pattern).
  2. Creates a paper trail in case this escalates.
  3. Starts the clock -- if no one responds and the problem gets worse, you already have context posted.

Wait up to 10 minutes. During those 10 minutes, keep watching the dashboard. If the problem:

  • Gets worse or stays bad -- Go to Step 3
  • Clearly resolves on its own -- Post an update in #service-issue-alert: "Issue appears resolved. [BRIEF DESCRIPTION]. Will monitor." Done.
  • You're still not sure after 10 minutes -- Escalate. Go to Step 3. It is always better to declare an incident that turns out to be nothing than to miss one that turns out to be real.

Step 3: Assess Severity#

You've decided this is a real problem. Now you need to assign a severity. This determines how big the response is.

Use this decision tree. Start at the top.

Is the chain halted? (No blocks being produced at all)
 YES --> S0 (Critical). Go to Step 4.

Is the chain severely degraded? (Block times >5s, consensus issues,
 checkpoint production stopped >1hr, bridge completely frozen)
 YES --> S1 (High). Go to Step 4.

Is there a complete outage of a major user-facing service?
 (Portal down for all users, all RPC endpoints down, bridge fully stuck)
 YES --> S0 (Critical). Go to Step 4.

Is there a security incident or suspected exploit?
 YES --> S0 (Critical). Go to Step 4.
      ALSO: Follow the Security Incident Response process.

Is there significant degradation but the service is still working?
 (Slow blocks, elevated latency, some users affected, one RPC provider
 down, partial feature loss)
 YES --> S1 (High). Go to Step 4.

Is the impact limited? (Single node failure with redundancy intact,
 single provider outage, cosmetic issues, minor perf degradation)
 YES --> S2 (Medium) or S3 (Low). Go to Step 4.

You're still not sure?
 --> Treat it as S1. You can downgrade later. Go to Step 4.

Do Not Debate Severity

If you're torn between two levels, always pick the more severe one. Severity can be downgraded later. It cannot be un-delayed. We do not litigate severity during an active incident.

Ask: What's the worst that can happen?

Once the IC and SMEs are on the call, explicitly ask: "What's the worst-case scenario if this continues unchecked?" This forces a concrete answer about blast radius -- fund loss, chain halt, data exposure, partner impact -- and ensures severity is calibrated to the real risk, not just the current symptoms. If the answer changes your severity assessment, upgrade immediately.

S0 and S1 are major incidents. They trigger the full incident response process: war room, IC, Communication Commander, status page, 15-minute update cadence.

S2 and S3 get a coordinated response without the full war room. But if an S2 requires coordination across multiple teams, you can escalate it to full incident response at any time.


Step 4: Declare the Incident#

You are about to formally declare an incident. This is the right thing to do. Even if it turns out to be a false alarm, that's fine -- we'd rather over-declare than under-declare.

In Slack, type this command:

/inc

A dialog will open. Fill in:

  • Name: Short description of what's happening. Examples:
    • "PoS chain halt - no blocks produced since 14:32 UTC"
    • "Bridge withdrawals stuck for 45+ minutes"
    • "RPC latency spike across multiple providers"
    • "Portal completely unreachable"
  • Severity: The severity you determined in Step 3 (S0, S1, S2, or S3).

What happens automatically when you do this:

  • incident.io creates a dedicated Slack channel for this incident
  • An announcement posts to #incidents-pos
  • The incident timeline starts being tracked

Go to Step 5.


Step 5: Page the Right People#

You need people on this incident. Page them now, in this order of priority.

For S0 and S1 (major incidents), page ALL of these:

Who How Why
Incident Commander See below for who to page Leads the response. You need them.
DevOps On-Call Type @DevOpsOnCall in the incident channel Infrastructure, nodes, deployments
Communication Commander Page via incident.io or Slack Public comms on X, Telegram, Discord
BD On-Call Type @BDOnCall in the incident channel T0 customer outreach

Which Incident Commander to page depends on the affected product:

Affected Product IC to Page
Polygon PoS Chain Adam Dossa or Mudit Gupta
Agglayer Taylan or Mudit Gupta
Other / not sure Mudit Gupta (catch-all escalation)

Also page the relevant domain SMEs based on what's affected:

If the problem involves... Page...
Infrastructure, nodes, deploys, chain recovery @DevOpsOnCall
Portal, Staking UI, backend services @ApplicationsOnCall
Bridge, staking contracts, state sync, checkpoints @SmartContractsOnCall
Agglayer service, cross-chain settlement @AgglayerOnCall
Security concern of any kind @SecOpsOnCall

For S2 and S3 (non-major):

Page the relevant on-call SME team for the affected service. You do not need the full war room unless the situation requires cross-team coordination. If it does, escalate to S1.

If nobody responds within 5 minutes: Escalate. Page the backup on-call. If still no response, page Mudit Gupta directly. Never sit and wait quietly.

If you genuinely don't know who to page -- the on-call group doesn't exist yet, you're not sure which team owns the affected service, or incident.io isn't giving you a clear answer -- check the Polygon org charts in ClickUp to find the team lead or engineering manager for the affected area and page them directly. This is the fallback of last resort; in general, the @OnCall groups in Slack (managed through incident.io) should be your first path.


Step 6: Set Up the War Room (S0 and S1 Only)#

For major incidents, you need a Google Meet war room.

  1. Create a Google Meet link. (Or use a pre-configured incident bridge if one exists.)
  2. Pin the Google Meet link in the incident Slack channel.
  3. Set the channel topic with the key info. Copy-paste and fill in:
IC: [NAME] | Scribe: [YOUR NAME] | Comms: [COMMS COMMANDER NAME or "TBD"] | Meet: [LINK]

If you don't know who the IC or Comms Commander is yet, put "TBD" and update it as people join.

  1. Enable Gemini note-taking and transcription. Once you're in the Google Meet, click the pencil/notepad icon (or go to "Activities" > "Meeting notes") and turn on Gemini. This gives you an automatic transcript and AI-generated summary of the entire call. This is extremely valuable for postmortems -- it captures things the scribe misses and provides a verbatim record. If you can't find the Gemini toggle, ask someone else on the call to enable it. Do not skip this step.

  2. Join the Google Meet yourself. You need to be on both Slack and the call.


Step 7: Join the War Room and Start Scribing#

You are now the scribe. This is your primary role for the rest of the incident. Here's what that means in practice:

Your job is to document what happens in the incident Slack channel. You are keeping the written record. The IC runs the response. The SMEs do the technical work. You write down what happened.

What to document (post these in the incident Slack channel as they happen):

  • Key actions taken: "14:47 UTC - @alice is restarting the block producer node prod-bor-bp-01"
  • Status updates from the IC: "14:52 UTC - IC update: S0, block production halted, suspected stuck lock on BP. Alice restarting service. Next check-in in 3 min."
  • Decisions and their outcomes: "14:55 UTC - Polled for rolling restart. No objections. Proceeding. @bob executing."
  • Follow-up items (prefix with TODO): "TODO: Why did the block production monitor not fire until 10 minutes after blocks stopped?"
  • When SMEs join or leave: "15:01 UTC - @charlie (Smart Contracts) joined the call."

What NOT to do:

  • Do not investigate the problem yourself. Do not check logs. Do not SSH into anything.
  • Do not document every word said on the call. Capture key events, decisions, and actions only.
  • Do not perform remediations unless the IC explicitly asks you to (and even then, push back -- your scribing role is more important).
  • Do not restart services, disable features, or take any "fix" actions on your own -- even if you think you know what's wrong. Blindly disabling or restarting things can make the problem worse, destroy evidence needed for investigation, and create new problems. Wait for the IC to assign remediation to an SME. The only exception is an active security breach where funds or data are at immediate risk (see Security Incident Response).

Step 8: Status Page and Partner Comms (S0 and S1)#

This is your responsibility. Do not wait for someone to tell you to do this.

Within 5 minutes of declaring the incident, update the status page:

  1. Go to status.polygon.technology (or use the incident.io integration if configured).
  2. Create a new incident on the status page. Use this template for the initial message:
We are currently investigating reports of [BRIEF DESCRIPTION OF IMPACT].
We are aware of the issue and are actively working on resolution.
Next update in 15 minutes.

Examples of good initial messages:

  • "We are currently investigating reports of degraded block production on Polygon PoS. We are aware of the issue and are actively working on resolution. Next update in 15 minutes."
  • "We are currently investigating reports of delayed bridge withdrawals. We are aware of the issue and are actively working on resolution. Next update in 15 minutes."
  1. Trigger the Partner Comms Bot notification (for S0 and S1). This is sent automatically via incident.io for major incidents. Verify it went out.

Update cadence:

Time since declaration Update frequency
First 2 hours Every 15-20 minutes
After 2 hours Every 30 minutes minimum (reduce if updates are content-less)
When impact changes meaningfully Immediately, regardless of cadence

Each update should include:

  • Current impact (what users/validators/partners are experiencing right now)
  • What we're doing about it (investigation, mitigation, fix in progress)
  • When the next update will be

You do not decide the content of these updates alone. Listen to what the IC and SMEs are saying on the call, distill it into user-facing language, and post it. If you're unsure what to say, ask the IC: "IC, what should I say on the status page?"


Step 9: During the Incident -- Your Ongoing Responsibilities#

While the incident is active, you are doing four things in a loop:

1. LISTEN to the call and Slack channel
2. DOCUMENT key events, actions, and decisions in the incident channel
3. UPDATE the status page on cadence (see Step 8)
4. PAGE additional people if the IC asks you to

Specific things to watch for and act on:

Situation What you do
IC asks for a specific person/team to be paged Page them immediately. Use the @handle in the incident channel. Post: "Paging @DevOpsOnCall per IC request."
IC announces a severity change Update the channel topic. Update the status page. Post in channel: "Severity changed from S1 to S0 per IC."
IC asks if anyone has strong objections to a plan Note the result in channel: "Polled for [PLAN]. No objections. Proceeding."
SME reports a finding Note it: "15:12 UTC - @alice reports: mempool is full on bp-01, likely cause of block delay."
Someone asks for a status update in #service-issue-alert Post a brief update there. Keep discussion in the incident channel, not in #service-issue-alert.
30 minutes pass without a status update to internal stakeholders Post an update in #service-issue-alert with @here. Template: "@here S0 update: [CURRENT STATE]. [WHAT WE'RE DOING]. [NEXT UPDATE TIME]."
You lose track of what's happening Ask the IC: "IC, can we get a quick status update for the channel?" This is normal and expected.
IC is not on the call yet Post in the incident channel: "No IC present yet. Paging [NAME]." Page them again if no response in 5 min.

Step 10: The IC Arrives -- Handoff#

When the IC joins the call, they will announce themselves:

"This is [NAME], I am the Incident Commander for this call."

Once the IC is on the call:

  1. Brief them. Give a 30-second summary: "We're seeing [PROBLEM]. It started at [TIME]. Current severity is [S-LEVEL]. [WHO] is investigating. Status page has been updated. Here's what we know so far: [KEY FACTS]."
  2. Confirm your role. "I'm acting as scribe. Channel topic is updated."
  3. Follow their instructions from this point forward. They are in charge. You document and communicate.

If the IC asks you to also handle Communication Commander duties (for smaller incidents), you can do both. If the incident is large, push back: "IC, I think we need a separate Comms Commander. Can we page Marketing?"


Step 11: Resolution#

The IC will announce that the incident is resolved or recovering:

"The incident is resolved. We're ending the call."

Before you post the resolution, confirm recovery is real:

The IC and SMEs should have verified that the fix is working, but as scribe you are the last safety check before we tell the world it's over. Before posting the resolution message:

  1. Check the dashboards yourself. Open the same Datadog dashboard you looked at during triage. Are the metrics back to normal? Are the alerts cleared? If anything still looks off, say so: "IC, I'm still seeing [METRIC] elevated on the dashboard. Are we confident this is resolved?"
  2. Ask the IC explicitly: "IC, can you confirm the fix is verified and we're clear to post resolution?" Wait for a clear yes.

Once confirmed, do the following in order:

  1. Post in the incident Slack channel:

    Call is over. Thanks everyone. Follow-up discussion continues in this channel.
    

  2. Update the status page with the resolution message. Get approval from the IC on the wording. Template:

    The issue affecting [SERVICE] has been resolved. [BRIEF DESCRIPTION OF WHAT HAPPENED AND THE FIX].
    [If there are lingering impacts, describe them. If not:] There are no ongoing impacts.
    We will publish a full postmortem within [3 days for S0 / 5 business days for S1].
    

  3. Update the incident channel topic to reflect resolved status.

  4. Trigger the Partner Comms Bot with the resolution message if applicable.

  5. Close the incident in incident.io:

    /inc close
    

  6. Monitor for regression. For the next 30 minutes after resolution, keep an eye on #alerts-ps-pos and the relevant dashboards. If the problem comes back, re-open the incident immediately: post in the incident channel, page the IC back, and update the status page. Recurrence within 30 minutes is common -- don't walk away.


Step 12: After the Incident#

Your job isn't done yet. These are your post-incident responsibilities:

  1. Review the Slack channel history. Go through the incident channel and extract any key events you missed while on the call. Clean up your timeline notes.

  2. Collect all TODO items. Search the channel for "TODO". Copy all of them into the postmortem document once it's created.

  3. Contribute to the postmortem. The IC will assign a postmortem owner. That person will reach out to you for the timeline and your notes. Be responsive -- the postmortem needs to be scheduled within 3 calendar days (S0) or 5 business days (S1).

  4. Hand off to the next shift. If your shift is ending, brief the incoming First Responder on what happened and any ongoing concerns.


Quick Reference Card#

Print this. Tape it to your monitor. Seriously.

SOMETHING LOOKS WRONG
  |
  v
CAN I TELL IN 2 MIN IF IT'S REAL?
  |           |
  YES         NO --> Post in #service-issue-alert, wait 10 min
  |                   |           |
  |                   RESOLVES    DOESN'T RESOLVE / GETS WORSE
  |                   |           |
  |                   Done        |
  v                               v
ASSESS SEVERITY
  Chain halted / security / full service outage  --> S0
  Severe degradation / most users affected       --> S1
  Degraded but functional / limited impact       --> S2
  Minor / cosmetic / redundancy handled it       --> S3
  Not sure                                       --> S1
  |
  v
DECLARE: /inc (set name + severity)
  |
  v
PAGE (S0/S1):
  IC (Adam/Taylan/Mudit depending on product)
  @DevOpsOnCall
  Comms Commander
  @BDOnCall
  + relevant SME team
  |
  v
WAR ROOM (S0/S1):
  Create Google Meet, pin in channel, set topic
  |
  v
STATUS PAGE: Update within 5 min of declaration
  |
  v
SCRIBE: Document. Update status page on cadence. Page people. Repeat.
  |
  v
RESOLUTION: Post resolution, update status page, /inc close
  |
  v
AFTER: Review notes, collect TODOs, contribute to postmortem

Slack Channels Reference#

Channel What it's for When you use it
#alerts-ps-pos Automated Datadog alerts Check here when triaging a potential issue
#service-issue-alert Anyone can report a potential issue Post here when you're not sure if something is an incident
#incidents-pos Formal incident declarations /inc posts here automatically. No discussion in this channel.
Dedicated incident channel All incident discussion and coordination Created automatically by /inc. This is where everything happens.

On-Call Groups Reference#

Group Slack Handle When to page them
First Responders @FirstResponders That's you.
DevOps @DevOpsOnCall Infrastructure, nodes, deployments, chain recovery
Applications @ApplicationsOnCall Portal, Staking, managed services, backend
Smart Contracts @SmartContractsOnCall Bridge, staking contracts, state sync, checkpoints
Agglayer @AgglayerOnCall Agglayer service, cross-chain settlement
SecOps @SecOpsOnCall Any security concern, suspected exploit
BD @BDOnCall T0 customer outreach, partner comms
IT @ITOnCall Internal tools down (VPN, SSO, Slack itself)

Common Situations and What To Do#

"An alert fired but it resolved before I could look at it"#

Post in #service-issue-alert: "Alert [NAME] fired at [TIME] and auto-resolved at [TIME]. Monitoring for recurrence." Watch for 15 minutes. If it fires again, treat it as a real incident.

"Someone on Twitter/Discord is reporting an issue but I can't see it in monitoring"#

Investigate for 2 minutes. If you can confirm or can't rule it out, post in #service-issue-alert and start triage. User reports are valid escalation sources -- our monitors don't catch everything.

"The IC isn't responding to pages"#

After 5 minutes with no response: page the backup IC. If you don't know who that is, page Mudit Gupta directly. If no IC is available after 10 minutes and this is S0/S1, announce on the call: "No IC available. I am coordinating the response until one joins. [SENIOR SME], can you help lead the technical investigation?"

"I declared an incident and it turned out to be nothing"#

Good. You did the right thing. Post in the incident channel: "False alarm -- [REASON]. Closing incident." Run /inc close. No one will be upset. We'd rather have a false alarm than a missed incident.

"I think this might be a security incident"#

If there is any chance it's security-related (suspected exploit, unauthorized access, unusual fund movements, vulnerability report), page @SecOpsOnCall immediately and follow the Security Incident Response process. Key differences: use a private Slack channel, give it an innocuous codename, and do not discuss details publicly until cleared by SecOps.

"The status page tool / incident.io / Slack is down"#

Use whatever channel is working. If Slack is down, use Google Meet + phone calls. If incident.io is down, create a Slack channel manually (name it #inc-YYYY-MM-DD-short-description). Update the status page manually via its web UI. Adapt, don't freeze.

"It's an S2 or S3 -- do I need to do all of this?"#

No. For S2/S3:

  • Declare with /inc (still do this for tracking).
  • Page the relevant on-call SME team.
  • Monitor the issue. Work it as top priority.
  • No war room needed unless cross-team coordination is required.
  • Update status page only if there's user-facing impact.
  • If it escalates, upgrade to S1 and spin up the full process.

Key Principles to Remember#

  1. Never hesitate to escalate. It is always better to page someone and not need them than to not page someone and need them.
  2. Assume the worst severity. If you're not sure if it's S1 or S0, treat it as S0. Downgrade later.
  3. Don't diagnose -- coordinate. You are not expected to find the root cause. You are expected to get the right people into the room and keep the process moving.
  4. Document as you go. Your notes in the incident channel become the postmortem timeline. Future you (and the postmortem owner) will thank you.
  5. Silence is not inaction. If the call goes quiet, people are working. But if YOU haven't posted a status update in 20 minutes, post one.
  6. When in doubt, post in #service-issue-alert. It's the lowest-risk escalation path. More eyes are always better.