Tired of digging through Jenkins logs? I built a RAG-based AI Assistant to diagnose CI/CD failures using OpenSearch and Groq.
I got tired of the "tab-switching marathon" every time a build failed—jumping between Jenkins consoles, CloudWatch logs, and GitHub to figure out why a pipeline died. I wanted a way to just ask my data what went wrong in plain English.
I built an AI-powered SRE assistant that works like a Log Intelligence tool. It hooks into your CI/CD data and allows you to query your infrastructure via chat.
What it does:
💬 "Why did build #100 fail?"
💬 "Which jobs have been the most unstable this week?"
💬 "Show me build failure trends in a bar chart."
The Tech Stack:
LLM: Llama 3.1 70B (via Groq for sub-second inference).
Vector DB: OpenSearch using hybrid search (BM25 keyword matching + k-NN vectors).
Embeddings: sentence-transformers.
UI: Streamlit for the conversational dashboard.
Pipeline: An automated ETL agent that scrapes data every 30 minutes.
Why it doesn't hallucinate:
The system uses a RAG (Retrieval-Augmented Generation) architecture. The AI is strictly grounded in the context retrieved from your actual Jenkins data and CloudWatch logs. If the answer isn't in the logs, it won't invent a reason for the failure. The hybrid search is key here—it catches specific error codes while understanding the semantic intent of the query.
Current Roadmap:
Automated root cause analysis (RCA) by indexing historical failure patterns.
Cross-account CloudWatch correlation.
I’d love to get some feedback from the community on how you're currently handling log noise at scale or if you've explored OpenSearch for similar SRE use cases.