r/mlops

▲ 41 r/mlops+1 crossposts

Is MLOps a safer direction for ML Engineers right now

I’m currently working as an ML Engineer, and lately I’ve been thinking about shifting more toward MLOps

My assumption is that companies will still need devops who can deploy / maintain LLLM models bought from other companies

I understand nobody really knows where the industry will end up. I would like to hear from you all to understand what skills are worth investing time into during this uncertain phase instead of just doing nothing?

reddit.com
u/stardust_137 — 2 days ago
▲ 6 r/mlops

[D] I built a free platform to learn Machine Learning through interactive coding challenges

Hi everyone,

When I started learning Machine Learning, I found plenty of tutorials and courses, but I struggled to find a structured way to practice what I was learning.

So I built **ML Playground**: a hands-on platform designed to help learners progress from fundamentals to advanced topics by writing real code.

**What’s included**
17 structured chapters

140+ interactive coding stations

120+ coding problems with automated test cases

Daily challenges

XP and leaderboard system

The goal is to make ML learning more structured and practice-oriented.

It’s free to start:
[https://mlplayground.in\](https://mlplayground.in/)

I’d love to hear your feedback on:
The learning experience

The curriculum structure

Features you’d like to see added

Thanks for checking it out.

reddit.com
u/Lopsided-Bit8321 — 20 hours ago
▲ 14 r/mlops+5 crossposts

Turns out "Claude Code over files in S3" quickly becomes "rebuild half the data warehouse stack"

Schemas, lineage, datasets, file refs - agent needs to know everything! An there is a need in the system that stores all these.

OpenAI's Data Agent post made us feel slightly less insane because they ended up building many of the same layers internally just on top of warehouses instead of object storage - https://openai.com/index/inside-our-in-house-data-agent/

Yes, most of these problems are solved there but needs to be solved when working in S3/GCS/Azure.

I'd appreciate feedback from folks here: how do you work with large-scale datasets in object storage, and how do you supply context about them to agents?

u/dmpetrov — 1 day ago
▲ 1 r/mlops

Any tips on learning MLOps

I started learning Python, and I'm curious, do you have any tips to learn it and how to do it right?

reddit.com
u/Deziak_ — 1 day ago
▲ 5 r/mlops

Need your feedback on my assumption on how to prevent agents from failing

A thing that surprised me while digging into agent reliability is that a model with 95% accuracy per step sounds excellent. But if your agent takes 10 steps to complete a task, the overall success rate drops to ~60%. And at 100 steps, it’s basically unusable (~0.6%). The failure compounds fast.

Then I came across a few numbers that made this feel less theoretical. Datadog tracked 8.4M AI model request failures in March 2026 and reported that ~5% of AI requests fail in production. A large chunk of these aren’t infra outages, but logic/quality failures that teams can’t properly debug. Similarly, McKinsey in its report said that while many enterprises are experimenting with agents, very few are actually scaling them successfully in production.

The more I look at this, the more it feels like an experimentation infrastructure problem, not a model capability problem. Most teams still test agents in playgrounds/staging and then hope production behaves similarly. But prompts, tools, memory, routing, temperature, context length, fallback logic, etc. all interact in weird ways under real traffic.

Web teams solved this years ago with A/B testing and controlled rollouts. Feels like agent teams need the same thing. Like experiment on live traffic, compare prompt/config variants, isolate regressions, and measure task success over time.

Curious if you agree to this or think there are better ways to solve these production issues.

reddit.com
u/wassupabhishek — 20 hours ago
▲ 8 r/mlops+7 crossposts

I changed a system prompt. Quality dropped 84% → 52%. HTTP 200. No errors. Found out 11 days later from a user complaint.

Built TraceMind to solve this. It's free, self-hosted, runs on Groq free tier.

What it does:

- Auto-scores every LLM response in background

- Per-claim hallucination detection (4 types)

- ReAct eval agent that diagnoses WHY quality dropped

- Statistical A/B prompt testing (Mann-Whitney U)

- Python SDK — one decorator, nothing else changes

The agent investigation looks like this:

Step 1: search_similar_failures

→ Found 3 similar past failures (82% match)

Step 2: fetch_recent_traces

→ 14 low-quality traces in last 24h. Lowest score: 3.2

Step 3: analyze_failure_pattern

→ Root cause: prompt has no fallback for ambiguous questions

→ Fix: add explicit fallback instruction

45 seconds. Specific root cause. Specific fix.

GitHub: github.com/Aayush-engineer/tracemind

Self-hosted, MIT license, no vendor lock-in.

Happy to answer any questions about the architecture.

u/ZealousidealCorgi472 — 9 hours ago
▲ 124 r/mlops

r/mlops has been re-opened

r/mlops is open again. Yep, you read that right!

The old mods were inactive and the community entered a restricted mode. There was a huge amount of spam piling up. I'm going to clean it up and see if we can streamline the experience.

For those of you who stumbled upon this place by curiosity:

This community is for practical discussions around ML in production: infrastructure, deployment, serving, evaluation, monitoring, tooling, platforms, reliability, data pipelines, machine learning, orchestration, LLMOps, platform engineering, and real-world operational lessons.

What’s welcome:

  • Technical discussions and architecture deep-dives
  • Open-source tools and projects
    • But do not spam your project or try to get free market research about your project!
  • Case studies and postmortems
  • Research with clear operational relevance
  • Tutorials, benchmarks, and implementation details

What’s not:

  • Low-effort self-promotion
  • Generic AI hype/content farming/AI-generated posts
  • “What AI startup should I build?” posts
  • Hiring posts. Check out some of the communities online for this.
  • Affiliate spam, SEO dumps, or engagement bait

If you’re building, operating, or scaling ML systems, you’re in the right place.

Enjoy, but don't wreck the place!

u/MyBossIsOnReddit

reddit.com
u/MyBossIsOnReddit — 7 days ago
▲ 14 r/mlops

How are you guys catching upstream schema drift before it silently poisons your models in production?

Hey all. We're dealing with a nightmare right now where upstream software/data engineering teams keep making subtle schema changes (dropping columns, changing unit types, renaming API fields).

​The traditional ETL/dbt tests all pass because the data pipelines themselves don't technically "break." But the feature pipelines ingest that skewed data, and our downstream ML models (specifically credit/fraud) just silently rot in production. We don't realize the model's predictions have degraded until days later.

​It feels like there’s a massive gap between the data warehouse and the feature store. Great Expectations feels too heavy and slow for this, and generic pipeline monitoring doesn't catch the ML-specific context.

​How are your teams handling data contracts or putting circuit breakers in place before the data hits the models? Is anyone actually doing this well, or is everyone just manually firefighting feature drift?

reddit.com
u/Tricky_Ad9372 — 3 days ago
▲ 27 r/mlops

I got tired of spending 30 minutes setting up GPU instances every time I wanted to test a model so I built a CLI that does it in 2 minutes. It's free and open source.

I kept running into the same problem. I want to test a new model, so I open RunPod, check Vast ai, check Lambda, compare prices, spin something up, SSH in, install vLLM, figure out TP settings, pull the model, configure everything. By the time I'm actually running inference I've wasted an hour on ops work.

Then I'd forget to terminate the instance and wake up to a $96 bill. Did that twice before I snapped and built something.

It's called swm. One CLI that talks to 10 GPU clouds. Search available GPUs across all of them sorted by price, spin up an instance, and install vLLM or Ollama with one command. It auto-detects your GPU count and sets tensor parallelism for you.

The part that actually saves the most time though is the workspace sync. Your whole environment lives in S3. When you're done you run swm pod down and it pushes everything, terminates the pod, and you can resume on any provider later with everything exactly where you left it. Models, configs, all of it.

Also built a lifecycle guard that monitors GPU utilization and SSH sessions. If nothing's happening for 30 minutes it saves your workspace and kills the pod automatically. No more overnight bills.

A few things it does:

  • swm gpus -g h100 --max-price 3.00 --sort price — compare across RunPod, Vast ai, Lambda, AWS, GCP, Azure, CoreWeave, Vultr, TensorDock, FluidStack
  • swm setup install vllm — installs and configures vLLM with correct TP settings automatically
  • swm models pull — search HuggingFace and pull to any pod
  • swm pod down — push workspace to S3, terminate, resume later on any cloud
  • Works with Cursor, Claude Code, Codex, Windsurf  any agent that runs shell commands

It's free, open source, Apache 2.0. pipx install swm-gpu

Site: https://swmgpu.com GitHub: https://github.com/swm-gpu/swm

Would love feedback from anyone who rents GPUs regularly. What's annoying about your current workflow that I should build for next?

reddit.com
u/Smurgels — 3 days ago
▲ 6 r/mlops

How do I bring feature engineering pipelines to production?

I'm relatively new to MLOps and I've been tasked with productionising feature engineering code (mostly written in SQL) into Lakeflow Spark Declarative Pipelines (SDP) on Databricks.

The current workflow is a bit tedious; DS decides the model is ready, hands me the feature logic (which are huge, complex SQL code with many joins and aggregations for every feature they've ever researched), and based on the features that model actually needs, I slim down the SQL code to only output those features. This is necessary as the project requires features to be served within 1 hour of raw data being ingested, and creating a "master" pipeline for all features that runs continuously to meet the time frame was extremely expensive.

As you can guess, with this workflow, when DS updates their model or adds a feature, I have to manually edit the pipeline code. Sometimes it's a lot of work even for one added feature as there may be a lot of intermediate operations and/or CTEs involved in its computation. I would trace back the original complex logic, which is a PITA.

I'm still new to this, so I would like to hear from this community any advice or solution you may have on approaching this problem, preferably one that integrates smoothly with Databricks.

ChatGPT talked about implementing a framework where DS adds feature metadata to a feature registry, each model gets a config file listing its features, and a parser reads it and auto-generates the pipeline by piecing the feature engineering operations together.

Sounds great, except I still can't seem to wrap my head around the idea of a parser that can reliably assemble the SQL code without including too many unneeded features (as features may be computed together), especially since the code I have is very complex and I still have to reduce joins and nesting in each file such that the pipeline materialized views can incrementally refresh.

reddit.com
u/botsunny — 13 hours ago
▲ 24 r/mlops

Is it a mistake to start with MLOps instead of traditional DevOps?

​I am currently learning the basics of DevOps. While researching resources, I came across 'MLOps,' which intrigued me. I’ve done some basic research, but I’m confused: should I master DevOps first to get into MLOps, or can I start with MLOps directly?

Some roadmaps suggest you can start MLOps with no prior knowledge, while others claim the exact opposite. Could someone please guide me with a realistic roadmap or share some solid resources?

Also, I’d love to know: is it actually possible for a fresher to break into this domain, or is it strictly for experienced engineers

Thanks in advance 🥲🤝

reddit.com
u/Atomic_rizz — 6 days ago
▲ 56 r/mlops

How I approach MLOps system design questions in interviews: sharing the thinking, not just the diagram

Got asked "design a data ingestion pipeline for an ML team that needs daily data from 3 external APIs" in a system design round.

Sharing my approach.

Ask clarifying questions first. Most candidates skip this and start drawing immediately. But every answer below changes the design:

  • JSON vs streaming vs flat files? Changes the entire ingestion layer.
  • 5 GB/day vs 50 GB vs 1 TB? Python + PostgreSQL vs Spark vs full data lake with Delta Lake/Iceberg.
  • Real-time vs daily batch? Kafka + Flink vs a scheduled Airflow DAG. Massive complexity difference.
  • One team vs twenty? Simple DB vs access control, data catalogue, feature store.

I assumed: structured JSON, 5-10 GB/day, daily batch, single team, Kubernetes available.

The pipeline:

3 API sources → Airflow (KubernetesExecutor, one pod per task) → parallel extraction → raw JSON stored in MinIO untouched → transform (clean, cast, validate) → PostgreSQL.

Key pattern: store raw and processed separately. Transform logic has a bug? Fix code, reprocess from raw. No re-fetching from APIs. Interviewer asks, "Reprocess last month?" --> You have an answer.

Production concerns that matter:

  • Exponential backoff on retries (1 min, 5 min, 15 min)
  • Idempotency: re-running the same date must not create duplicates (upsert, partition overwrite, or staging table merge)
  • Data quality checks after every load — null counts, row counts, duplicates
  • Backfill support from raw storage

Mistakes I have seen (and made):

  • Saying "I would use Kafka" before knowing volume or freshness
  • No raw storage layer = no reprocessing ability
  • Only describing the happy path, never mentioning failures
  • Over-engineering a single-team problem with Spark Streaming and data mesh

Actually built this pipeline on Kubernetes with real Binance API data. Code: github.com/var1914/mlops-boilerplate

Full visual walkthrough on YouTube

u/Extension_Key_5970 — 5 days ago
▲ 9 r/mlops

Local AI needs to be the norm. The 1000ms cloud latency tax is killing production.

The cloud is convenient until the API bill hits. Until the rate limits kick in. Until the model you depend on gets deprecated overnight with a polite email. I have been auditing infrastructure setups for the past three months, looking at the telemetry from dozens of enterprise deployments. The consensus is clear. Local AI needs to be the baseline architecture for most predictable tasks. Renting compute indefinitely for every single prompt is an architectural failure. Numbers do not lie. I ran the numbers on cloud API overhead, and the latency tax alone is enough to justify moving your core logic back to local silicon.

Let us look at the latency telemetry. Network latency is the hidden cost of cloud AI. A typical API call to a hosted model adds 200 to 1000 milliseconds of overhead before the model even starts generating. This is not a compute bottleneck. This is pure physics and routing. You have DNS resolution, TLS handshakes, API gateway routing, load balancers, and queueing before the inference engine even sees your prompt. When you are building agentic loops or chaining multiple calls, that 500ms delay compounds. Four steps in an agent workflow just cost you two full seconds of dead time. It ruins the user experience. Tested on prod, local execution drops that network overhead exactly to zero. Direct memory access. Time to first token is dictated purely by your hardware, not by internet traffic.

Then we have the data leakage problem. Every Copilot keystroke you take sends your proprietary code to someone else's server. Your trade secrets are just the next training data point for a foundational model. Companies are blissfully ignorant about this until a compliance audit forces them to look at where their data goes. Using local AI means your code stays safe. Zero leaks. Zero unwanted training. When your data never leaves your device, you bypass months of compliance review and security theater.

The common pushback I hear is that local hardware is too expensive or too weak. That is outdated data. Most people assume their laptop cannot run AI. They are wrong. You can install a local model in five minutes flat. Tools like LM Studio and Ollama have removed the technical setup entirely. No terminal wrangling. No dependency hell. You just pick a quantized GGUF model and start generating. I have seen developers running Sonnet-level logic on a Mac Studio for exactly zero dollars in token costs. Even an off-the-shelf S21 phone can run an offline AI agent today. The hardware floor has dropped significantly, while the output quality has spiked. Owning the silicon hits different when you realize you are completely disconnected from the internet and still getting high-tier reasoning.

Let us break down the cost. The financial argument for renting cloud models relies on low utilization. If you are running high volumes of predictable tasks that do not require the absolute frontier reasoning models, cloud APIs are a budget drain. A continuous background task analyzing logs, structuring JSON, or proofreading text can easily consume millions of tokens a day. At cloud rates, that adds up to thousands of dollars a month. A dedicated machine with dual RTX 4090s or a fully loaded Mac Studio costs a few thousand dollars upfront. The break-even point is often under four months. After that, your marginal cost per token is zero. You are just paying for electricity.

Let us dig into the MLOps reality of managing local versus cloud. Deploying a local instance of Llama 3 70B or a quantized Qwen 1.5 requires upfront configuration. You have to map the VRAM, configure the context window, and handle continuous batching if you are serving multiple users. But modern inference servers like vLLM or TGI have made this highly deterministic. You assign the hardware, you measure the throughput, and you get a flat operational cost. When you rely on a cloud API, your throughput is at the mercy of their current load. I have tracked API response times during peak US business hours. The variance is unacceptable for enterprise SLAs. A prompt that takes 1.2 seconds at 3 AM can easily take 4.5 seconds at 10 AM. You cannot build a reliable synchronous application on top of unpredictable latency spikes.

Look at the ecosystem shifts. We are seeing major players open-sourcing models aggressively. This is a strategic move to commoditize the inference layer. When you have access to highly capable open weights, the value shifts from the model provider to the infrastructure owner. By keeping your AI local, you capitalize on this commoditization. You uncouple your product's performance from a vendor's pricing strategy.

Consider the operational workflow. When a developer needs a private environment to test sensitive financial data or unreleased proprietary software, cloud APIs require extensive data masking. Masking data reduces the context quality. The LLM gets a sanitized, broken version of the problem and returns a suboptimal solution. Local execution allows you to feed raw, unfiltered production data straight into the model context. The model has full visibility. The reasoning improves because the context is complete.

Beyond the financial math, cloud reliance introduces existential product risk. You are building on sand. If a major provider decides to change their safety filters, alter the model behavior, or simply turn off the specific endpoint you use, your application breaks. Local customization gives you absolute control. You can fine-tune models for your specific use case. You control the weights, you control the infrastructure, and you control the uptime.

We need to stop defaulting to cloud APIs for every single AI feature. Regional models and local execution should handle the baseline load. Use the massive global giant models for edge cases that require immense reasoning depth. But for the daily grind of data extraction, code generation, and standard text manipulation, local is the only logical choice. Benchmark or it didn't happen. The data shows that localized compute is faster, infinitely cheaper at scale, and mathematically more secure. Run your own hardware. Here is the data, do the math yourself.

reddit.com
u/TroyNoah6677 — 3 days ago
▲ 7 r/mlops

MLOps on Databricks

Hi guys, how does your model training pipeline (train - validate - promote) on Databricks look like?

Basically idea is to use deploy code pattern, where e.g. u have access on dev to prod data, so u can experiment with different models, different parameters, hyper param tuning etc... so classic model development cycle, once u are confident in your model performance on the dev, you need to manually take out your best training parameters from experiment, put it into some human readable code (yaml file), deploy code pipeline to staging, run some testing that nothing breaks, then in production, with that best parameters, you do model training pipeline again where u possibly challenge the model which runs in production.

Is this standard? I am wondering that this way u are never sure that u will reproduce what u have got on dev while experimenting on the production. How do u promote your models? How do u train your models?

reddit.com
u/ptab0211 — 6 days ago
▲ 16 r/mlops+1 crossposts

I got tired of copy-pasting ML pipeline YAML across projects, so I built a reusable GitLab CI/CD component

Every ML project I've worked on had the same boilerplate CI: MLflow wiring, data validation, metric checks, model registration. Around the fifth project I no longer remembered which config I'd previously fixed the MLFLOW_RUN_ID passing bug in.

So I built a GitLab CI/CD component that turns this into 10 lines:

yaml

include:
  - component: gitlab.com/netOpyr/gitlab-mlops-component/full-pipeline@1.0.0
    inputs:
      model_name: wine-classifier
      training_script: scripts/train.py
      data_path: data/train.csv
      framework: sklearn
      metric_name: accuracy
      min_threshold: '0.85'

Which gives you a full 4-stage pipeline:

validate → train → evaluate → register
  • validate: schema, nulls, Evidently drift, Great Expectations
  • train: MLflow autologging (sklearn/PyTorch/TF/XGBoost/LightGBM), GPU support
  • evaluate: threshold check + optional comparison vs production model
  • register: GitLab Model Registry, only runs if eval passed

Works on GitLab Free. DVC integration and parallel multi-model training also supported.

Published in GitLab CI/CD Catalog: https://gitlab.com/netOpyr/gitlab-mlops-component

Happy to answer questions — especially on the evaluate stage, compare_with_production was the trickiest part to get right.

u/Na_S04 — 2 days ago
▲ 4 r/mlops

How do you actually catch when your production model is silently outputting garbage?

I have seen cases about production ML failures and I keep seeing this Model trains at 87% accuracy,Deploys fine, no errors in logs, API returns 200s , Predictions look reasonable Everything seems healthy
then 2 to 3 weeks later , buisness metric starts to drop quitely and surprisingly no one notices until someone manually digs into the data and realizes the model has been degrading the whole time.
I am curious about how you guys handle this in practice and how much time is wasted in catching these issues

reddit.com
u/SignalForge007 — 5 days ago
▲ 3 r/mlops

How are teams treating edge model deployment in their MLOps pipeline?

I’m trying to compare notes on MLOps for edge / physical AI deployments.

For cloud models, the loop is fairly mature: train, eval, deploy, monitor, roll back. For edge models running on robots, Jetsons, mobile NPUs, ARM CPUs, etc., the deployment process seems much less standardized.

The issues I keep seeing:

- model works on workstation/cloud GPU but misses latency on-device

- quantization/pruning changes behavior in ways the normal eval set does not catch

- cold start matters separately from steady-state latency

- unsupported ops or vendor SDK differences force target-specific work

- monitoring is hard when the runtime has to stay offline or privacy constrained

Recent datapoint from a deployment I worked on: multimodal classifier on Jetson Orin NX, 111ms cold start, 100% of decisions inside a 150ms budget, zero cloud calls.

How are people handling this in practice?

- Is edge compilation a separate release gate?

- Do you maintain hardware-specific evals?

- Are model + runtime + target device versioned together?

- What tools are you using for regression testing after compression?

reddit.com
u/Hairy_Strawberry7028 — 5 days ago
▲ 11 r/mlops

Need tips/guidance

i am planning to start/switch to mlops, i know basics of ML and DL, done small internship in data science as well, planning to learn MLOPS, can anyone suggest me how to proceed and good resources to follow? also the roadmap if anyone can tell me..
thanks

reddit.com
u/JokoBitch — 5 days ago