Every platform team eventually hits the same wall. A pipeline fails at 2pm on a Tuesday. The error message says something about a timeout, a missing secret mount, or an OOM kill — but not enough to tell a mid-level engineer what to do next. They open a ticket. That ticket waits for someone who has seen this exact failure before. And while it waits, a release window closes.
This article documents the thinking behind an AI-assisted root cause analysis (RCA) system I designed for CI/CD failure investigation. The goal wasn't to automate fixes. It was to surface the right question fast enough that engineers could answer it themselves.
1. Problem — What Breaks at Scale
CI/CD pipelines fail in patterned ways. Exit code 137 almost always means an OOM kill. A connection refused during artifact push usually points to registry throttling under load. Test stage timeouts in a specific service often correlate with database migration locks introduced by another team.
At the level of one team running five pipelines a day, this knowledge stays manageable. At the level of 40 teams running several hundred pipelines across a shared platform, it doesn't. The failure modes multiply. The patterns are still there — they just aren't written down anywhere systematic. They live in Slack threads, in the institutional memory of three platform engineers who have been around long enough, and in runbooks that were last updated when the infrastructure looked completely different.
The result is that mean time to resolution (MTTR) for CI/CD failures scales poorly with team growth. Onboarding a new engineer doesn't help because they have no pattern library to draw from. Every failure investigation starts at zero.
2. Why Current Approaches Fail
The standard DevOps answer to this has been runbooks. Write down what you know, link it from your alerting system, and hope the next engineer finds it. Runbooks work — until they don't. They go stale the moment the infrastructure changes. They're hard to search across. They assume the engineer already understands the symptom well enough to pick the right runbook.
Log-based post-mortems have the opposite problem. The signal is all there — pipeline logs capture everything that happened — but in raw form they're overwhelming. A failed build can easily produce 10,000+ lines across multiple stages and parallel jobs. Expecting an engineer under pressure to manually grep through that, identify the relevant signal, and map it to a known failure pattern is unrealistic.
Alert correlation tools help at the infrastructure level, but most CI/CD failures don't trigger alerts. They're not monitoring gaps — they're workflow failures. No metrics platform metric fires when a Helm chart rendering step silently truncates a configmap. No PagerDuty alert fires when a test fails because a feature flag was toggled by a different team twenty minutes ago.
What's missing isn't more tooling. It's a way to match what just happened against what has happened before, fast enough to be useful, and accurate enough to be trustworthy.
3. Architecture Thinking
The architecture for this system came from a simple constraint: whatever it does, it cannot take action. The moment an AI system starts making changes to production pipelines, build configurations, or deployment state, the blast radius of a wrong inference becomes unacceptable. So the design started from the output working backwards.
The output is a ranked list of suggested causes, each with a confidence score and a link to the historical evidence behind it. That's it. The system presents findings; humans decide what to do with them.
Working backwards from that:
- To produce ranked suggestions, you need a pattern library — structured records of past failures, their symptoms, and their confirmed resolutions.
- To build that library, you need log processing that extracts structured signals from raw pipeline output.
- To match a new failure against the library, you need a similarity layer — something that handles paraphrasing, log format variation, and partial matches.
- To make any of this trustworthy, every inference needs to be traceable: which logs triggered it, which historical records matched, what the confidence level is.
That gives you three layers: ingestion, pattern matching, and advisory output. Each layer is independently replaceable. The ingestion layer doesn't care how matching works. The matching layer doesn't care how suggestions are presented.
4. Solution Model — The Advisory RCA Assistant
Here is how each layer works in practice.
Ingestion Layer
Pipeline logs from CI platform, legacy CI, and GitOps controllers workflows are collected via a structured log pipeline — in this case, OpenTelemetry with a custom exporter that tags each log line with pipeline ID, stage name, job name, exit code, and wall-clock duration. The raw text is preserved but structured metadata travels alongside it.
Logs are segmented by stage. A 12,000-line build log becomes a set of smaller, stage-scoped windows. This matters because the failure signal is almost never in the middle of a successful stage — it's at the boundary where something stopped working. Segmentation reduces noise significantly before any AI processing begins.
Pattern Library
The pattern library is built from historical incident data. Every resolved CI/CD failure in the ticketing system gets a structured record: what was the symptom (log snippets, error codes), what was the confirmed root cause, what was the fix. These records are embedded as dense vectors using a sentence transformer model running on internal infrastructure — nothing leaves the network.
Critically, the library is maintained by humans. Platform engineers review and tag records. An AI-generated failure record that hasn't been reviewed by a human is flagged as unverified and weighted lower in matching. This prevents the library from drifting on its own.
Matching and Advisory Output
When a new failure comes in, the ingestion layer produces a structured summary: stage where failure occurred, exit code, key log phrases, metadata about the job context. This summary is embedded and compared against the pattern library using cosine similarity.
The top five matches are returned with similarity scores. A large language model is then used for one specific purpose only: to write a plain-English summary of why this failure likely matches the historical record. Not to suggest a fix. Not to take action. Just to explain the match in language that doesn't require reading the raw vector distances.
The output surfaces in the CI/CD platform's failure notification — a sidebar next to the failed build, showing the three most probable causes ranked by confidence, each with a link to the historical incident it matched and the log excerpt that triggered it.
5. Real-World Scenario
A microservice pipeline fails during the integration test stage. Exit code is 1. The log shows a database connection timeout after 30 seconds. The error is generic enough that a new engineer wouldn't know whether this is a test environment issue, a connection pool exhaustion problem, or a schema migration that hasn't run yet.
The RCA assistant ingests the failure. The stage segmentation isolates the test stage log. The structured summary captures: stage = integration-test, exit code = 1, error phrase = "connection timeout", service = payments-api, duration = 31s.
The vector match returns three historical incidents. The top match, with 87% confidence, is a record from six weeks ago: integration tests for payments-api failed with the same timeout because a Liquibase migration introduced by the accounts team had locked a shared table. The resolution was to add a pre-test migration health check.
The engineer sees this in the build failure sidebar. They check whether the accounts team deployed recently. They did — 40 minutes ago. The engineer runs the migration check manually, confirms the lock, and coordinates a resolution. Total investigation time: 11 minutes instead of an average of 47.
6. Trade-offs
This kind of system is worth being honest about what it costs.
It requires sustained investment in the pattern library. An empty library produces useless results. Building it from historical data takes time, and keeping it accurate requires someone to own it. The system gets better the more it's used and maintained — but that means the first six months are the hardest, before the flywheel builds momentum.
Confidence scores can mislead. An 87% match sounds authoritative. If engineers start treating it as a diagnosis rather than a lead, they stop thinking critically about whether the suggested cause actually fits their context. The UX has to make the uncertainty visible — not just the score, but the underlying evidence. Engineers need to see why the system thinks this, not just what it thinks.
Read-only is a genuine constraint. There are scenarios where the system could theoretically trigger a safe automated remediation — re-running a flaky test, refreshing a stale cache, rotating an expired credential. Choosing not to do this means some failures still take longer to resolve than they technically need to. That's a deliberate trade. The cost of a wrong automated action on a production-adjacent pipeline is higher than the time saved by automation.
The LLM summary step adds latency and a potential failure mode. If the language model is unavailable, the system falls back to showing raw matched log excerpts and similarity scores. Less readable, but still useful. The LLM is a convenience layer, not a critical path dependency.
7. Future Direction
The immediate next step is integrating the pattern library with change management records. Right now the system matches against log patterns. It doesn't know that the failure occurred 23 minutes after a infrastructure as code apply in an adjacent environment. Adding that causal context — change event + failure event — would significantly improve both the match quality and the explanation.
Longer term, the pattern library should be federated. Different teams have different failure domains. A security scanning stage fails differently than a build stage. Maintaining one monolithic pattern library eventually becomes a bottleneck. A federated model where each team owns their domain's patterns, with a shared retrieval layer, scales better.
The read-only constraint will be revisited — but carefully. The right question isn't whether AI should ever act; it's which actions are safe to automate and under what conditions. Re-triggering a known-flaky test with a specific seed value, when the confidence is above a threshold and the pipeline is in a non-production environment, is a reasonable candidate. That decision belongs in a governance framework, not in an engineering sprint.
Finally: the biggest value this system delivered wasn't the time saved on individual failures. It was the change in how new engineers experience the platform. When you join a team and immediately have access to the institutional memory of every failure that team has investigated and resolved, onboarding changes character. The knowledge stops being tribal.