u/ajdevrel

Who are the developlers here who care about AI quality?

Something I keep running into is shipping LLM features seems to be easy, but knowing whether they're actually good is not.

Curious how people are handling this. Do you....

  • maintain a golden dataset and re-run it on every prompt change?
  • use LLM-as-judge? If so, how do you trust the judge?
  • ship and watch user feedback?
  • something else?

I've been going back and forth on opening a focused group chat for developers who care about this stuff. Just a place that's open to comparing notes and experiences. What do any of you think?

Regardless, superi nterested in how folks here are approaching AI quality, etc.

reddit.com
u/ajdevrel — 3 days ago