Lesson from rebuilding my scoring engine: domain correctness > general accuracy
I'm building MixDoctor — an AI mix/mastering analyzer for iOS/macOS (Swift, SwiftUI, Claude API). Just finished a significant overhaul of the core scoring system and wanted to share what I learned.
The original engine used flat thresholds for loudness, dynamic range, and frequency balance. It worked for mainstream genres but was actively wrong for edge cases — Metal, Classical, EDM all have "correct" values that look like problems by general standards.
Rebuilt it around 9 genre groups with research-backed thresholds. The engineering wasn't the hard part — the calibration and prompt engineering to get reliable, genre-appropriate feedback from the AI layer took most of the time.
Key takeaway: if you're building any domain-specific AI tool, your scoring/evaluation layer has to speak the domain's language. A generalist model with domain-specific prompting and thresholds outperforms a generalist approach end to end.
What are others building where this kind of domain-specific calibration has come up?