Senior SREs or specialists often hold critical incident resolution knowledge that is not documented or maintained, leading to delayed incident recovery when they leave. Existing documentation (e.g., Google Docs) is often outdated, incomplete, or hard to find. Teams frequently spend excessive time digging through old messages or calling former employees during incidents, increasing downtime and operational risk.
“RunReplay captures the Slack threads, timelines, and key actions from every PagerDuty incident—then surfaces exactly what worked last time the moment a similar incident fires again. Built for SRE teams who've lost a senior engineer and had to rebuild institutional knowledge from scratch.”
An app that automatically captures incident activities in real-time—logging commands run, chat conversations, timelines, and actions taken during an incident—and links these directly to incident types. It maintains living, versioned runbooks tied to real incident data that evolve as new incidents occur, with automatic prompts for stale content review. It includes searchability, access control, and alerts for outdated procedures, enabling rapid onboarding and knowledge sharing.
Increasing complexity and velocity of cloud infrastructure combined with recent advances in real-time collaboration and logging tools make automated, living incident documentation feasible and valuable.
Staff SRE or Platform Engineering Lead at a Series B–D SaaS company (50–300 engineers), running 3+ Kubernetes clusters, who has experienced knowledge loss from engineer attrition and is measured on MTTR reduction.
~18,000 US-based companies fit the Series B–D, 50–300 engineer, Kubernetes-heavy profile (Crunchbase + LinkedIn filter estimate). At $500–$2,000/mo per team, serviceable addressable market is ~$150–350M/yr—enough for a $10–30M ARR outcome without enterprise sales.
Build a Framer landing page with a 2-minute Loom demo showing the concept. Add a $299 'Founding Team' pre-order via Stripe (lifetime discount, billed when MVP ships). DM 50 staff SREs and platform leads who commented on the r/devops thread above, plus post in r/kubernetes and r/sre. Offer 5 free 'incident audit' calls where you manually replay their last 3 incidents using their own Slack exports—this concierge version validates the workflow without writing code.
5 pre-orders at $299 or 8 companies that complete the free incident audit and say they would pay $49+/user/mo — whichever comes first. If neither happens within 3 weeks, the messaging or buyer needs revision.
PagerDuty dominates alerting and incident coordination but has minimal focus on knowledge capture or runbook evolution from real incident data — it routes alerts, not institutional knowledge. Lynx and Edgedive are AI-first incident resolution tools focused on automation and triage, not on capturing and persisting what humans learn during incidents. OneGrep automates runbook execution but doesn't address the knowledge decay problem — runbooks still need to be authored and maintained manually. The specific gap is the feedback loop: no major player automatically enriches and versions runbooks from live incident telemetry, closing the knowledge degradation cycle.
Incident response platform focused on alerting, on-call scheduling, and orchestration with some runbook features, but limited automatic knowledge capture from incidents.
Enterprise IT service management with incident management, AIOps, and runbook automation; recent AI enhancements for post-incident analysis.
Incident alerting and response integrated with Jira; supports runbooks but manual authoring.
AI-first incident resolution for SREs, focusing on triage and automation.
Incident management with retrospective capture, runbook linking, and post-mortems.
Modern incident response with runbook execution and knowledge base.
AI-powered runbook execution and automation for incidents.
All-in-one incident management with timelines, runbooks, and post-mortems.
Incident management platform with Slack integration, timelines, and runbooks.
The core differentiator is passive, real-time knowledge capture tied to actual incident events — commands run, Slack threads, timelines — rather than requiring engineers to manually document post-incident. A verticalized focus on knowledge half-life (staleness alerts, version history, reviewer prompts) addresses a workflow no current tool owns. Pricing as a knowledge operations layer that integrates with existing toolchains (PagerDuty, Slack, Jira) rather than replacing them reduces adoption friction significantly.
RunReplay is the only tool that surfaces what actually worked in past incidents—not AI-generated suggestions, but real human-approved action sequences—directly inside the active incident Slack channel.
We are the incident replay layer for SRE teams who can't afford to forget.
Historical incident data gravity: the longer a team uses RunReplay, the richer its incident memory becomes, making the tool dramatically more valuable at month 12 than month 1 and creating high switching costs because leaving means losing institutional memory.
SREs don't want AI to write their runbooks—they want to find the Slack message from 8 months ago where Maya typed exactly what she did to fix the pod crash loop, and RunReplay is the only tool built on that insight rather than on auto-generation.
PagerDuty, Atlassian (Confluence + Jira), or Slack could add AI-driven incident summarization and runbook linking as native features, commoditizing the core value propRequires deep integrations across heterogeneous DevOps stacks (terminals, Kubernetes, Slack, PagerDuty, etc.), making the build complex and the sales cycle longSecurity and compliance concerns around logging raw commands and chat conversations during incidents may block enterprise adoption or require significant trust-buildingAI-generated runbook quality may be inconsistent enough to erode trust — SREs may not rely on auto-captured content without heavy human curation, reducing the automation valueMarket timing risk: AI coding and ops tools are evolving so rapidly that the knowledge management angle could be absorbed into broader AI SRE platforms within 12-18 months
The timing of the market poses a significant risk; with AI tooling rapidly evolving, if larger competitors consolidate around similar capabilities within a 12-18 month window, RunReplay could be rendered obsolete before it finds product-market fit. Additionally, issues around privacy and data security could become problematic, leading to compliance challenges that erode enterprise sales.
OpsGenie initially focused on alerting without knowledge capture, which resulted in users seeking additional resources for context and learning. Similar issues were faced by VictorOps, which was acquired by Splunk and found itself overshadowed by the larger vendor's ecosystem without any distinct knowledge management strategy.
The claim that existing tools do not capture incident knowledge underestimates the adaptability of platforms like FireHydrant and Incident.io, which are already interested in enhancing their offerings in the knowledge management space. As teams prioritize integrations that combine alerting and runbook updates, RunReplay may struggle to maintain its position as a 'must-have' tool due to the competition's speed and resources.
Viable opportunity in a $2.5B+ growing market (11-19% CAGR) with clear gap in automatic, living runbook evolution from incidents—PagerDuty and ServiceNow dominate alerting/ITSM but neglect knowledge persistence. Most dangerous is Incident.io/FireHydrant for modern UX/timelines, yet none fully close the feedback loop on human SRE learnings. Best breakthrough via mid-market DevOps with Slack-native auto-capture, targeting pain of stale docs and ex-employee calls; less entrenched than enterprise giants.
Step 1: Reply with genuine insight to the 852-upvote r/devops thread and DM the 15–20 users who described the exact knowledge-loss pain. Step 2: Post a 90-second Loom in r/sre showing the concierge incident audit offer (free for first 10 teams). Step 3: Search LinkedIn for 'Staff SRE' or 'Platform Lead' at Series B–D companies with 50–300 employees, filter by companies using PagerDuty (check their job postings), send 100 cold DMs with the subject 'How did you fix that last K8s incident?' Step 4: Convert 3 of those audit calls to $499/mo paid pilots before writing more code.
$39/user/mo (Growth, up to 15 users), $69/user/mo (Team, unlimited users + runbook versioning history), 14-day free trial, no credit card required. Minimum 3-user seat. Annual discount: 2 months free.
A single 30-minute MTTR reduction per incident saves $1,500–$5,000 in eng cost at these companies. At $39/user/mo for a 5-person on-call rotation ($195/mo), payback is achieved in the first incident. This is priced below Incident.io and FireHydrant to win on value-per-dollar, not features.
The aha moment occurs when a recurring incident fires and RunReplay posts the prior resolution summary in the channel within 60 seconds—engineers respond faster and someone says 'oh this is actually useful' in the thread
If horizontal DevOps positioning fails to convert, niche down to fintech/banking SRE teams where audit trail requirements for incident resolution are regulatory (SOC2, PCI-DSS)—same product, compliance-first messaging
If incident replay is too niche, reposition the same incident history product as a 30-day onboarding accelerator for new SRE hires—let them replay the last 12 months of incidents to ramp faster
If direct sales CAC stays above $300 with no improvement after 90 days, sell the incident replay engine as an embeddable module to FireHydrant, Rootly, or Incident.io as a knowledge layer add-on
Next.js + Supabase + Slack Bolt SDK + PagerDuty Webhooks + Stripe — deploy on Vercel, zero DevOps overhead at launch
5–7 weeks solo dev: week 1–2 Slack bot + PagerDuty webhook ingestion, week 3–4 similarity matching + summary rendering, week 5 runbook divergence flag + Stripe billing, week 6–7 QA + onboarding flow
Strong validated pain with 852-upvote proof signal and a clear gap in the market that well-funded incumbents haven't closed—but the incident-frequency dependency for activation and the real risk of Slack or PagerDuty shipping a 'good enough' native summary feature within 18 months cap the ceiling without a faster moat-building strategy around historical data lock-in.