u/Acceptable_Gap9697 — reddlx

DeepClaude hit 476 points on HN this weekend, and I've been running a similar setup for the past week so I figured I'd share some actual numbers.

The setup: DeepSeek V4 Pro (1.6T params, 49B active, 1M context window) via their Anthropic-compatible API endpoint. You set ANTHROPIC_BASE_URL to https://api.deepseek.com/anthropic, swap your API key, and Claude Code works exactly as before.

Cost comparison over 7 days of real usage:

Claude Opus 4.6 (my previous setup): significantly more expensive per session
DeepSeek V4 Pro (same workload): roughly 15-20x cheaper based on per-token pricing
For my daily usage pattern, the savings are substantial

Where quality is equivalent (my subjective assessment):

Scaffolding new modules and pipelines
Writing integration code between services
Test generation
Refactoring existing code with clear patterns
Documentation generation
Boilerplate and CRUD operations

Where Claude still wins noticeably:

Ambiguous architectural decisions across large codebases (10k+ lines of context)
Complex multi-file refactors where the agent needs to reason about side effects across modules
Tasks where the prompt is vague and the agent needs to infer intent from project structure

My current approach is routing: DeepSeek V4 Pro handles the first category (roughly 80% of my daily agent usage), and I switch to Claude Opus for the second category manually. I'm working on automating the routing with a simple classifier that looks at task complexity signals.

The Anthropic-compatible API endpoint is the key enabler here. DeepSeek built it so any tool in the Claude ecosystem works with a config change. Kimi is doing something similar. The model layer is commoditising fast, and the practical implication for anyone running agent-heavy workflows is that you should be testing cheaper backends for your routine tasks.

One caveat: DeepSeek V4 Pro's long-context performance degrades more noticeably than Claude's past ~200k tokens in my testing. If your agent sessions regularly hit high token counts, you'll want to test this carefully before switching.

Has anyone else been running this setup? Curious about quality comparisons on different task types.