u/Rent_South

I benchmarked the new release Gemini 3.5 Flash on ~10 saved evals. Using the exact same prompts.

I benchmarked the new release Gemini 3.5 Flash on ~10 saved evals. Using the exact same prompts.

I added tested Gemini 3.5 Flash and ran it through around 10 saved evals I use for model selection decisions in production.

So far, the result is not what I expected.

On most of my tasks, Gemini 3.5 Flash underperformed older Gemini variants. In the screenshot below, this is a vision emotion-detection eval with 5 runs per model:

In, this eval it ended way down at 13th place, even though 3.1-pro and 3.1 flash lite are top 1 & 2, its even lower than gemini 3 flash actually. Its 10x more expensive than flash lite for a worse result. Its an avg result of 5 runs so its not a one time fluke. On top of that, this is 1/10 benchmarks with similar outcomes, although admittedly this is one of the worst case.

====================================================================================================
LLM Benchmark Results - Emotion Detection - Increasing Complexity
====================================================================================================

Model                   Provider    Avg Score           Stability   Rec. Temp Pricing     Cost*       Time      Acc/$     Acc/min   Completion
----------------------------------------------------------------------------------------------------------------------------------------------
gemini-3.1-pro          gemini      80% (3.2/4.0)       ±1.000      0.3       High        $0.0292     23.48s    109.58    8.18      100.0%    
gemini-3.1-flash-lite   gemini      75% (3.0/4.0)       ±0.000      0.3       Medium      $0.00114    6.24s     2.63K     28.85     100.0%    
gpt-5.4                 openai      75% (3.0/4.0)       ±0.000      N/A       High        $0.0128     8.45s     234.24    21.31     100.0%    
claude-opus-4.6         anthropic   75% (3.0/4.0)       ±0.000      0.3       High        $0.0246     12.44s    121.73    14.46     100.0%    
gemini-3-flash          gemini      65% (2.6/4.0)       ±1.000      0.3       Medium      $0.00735    16.36s    353.81    9.54      100.0%    
sonar                   perplexity  65% (2.6/4.0)       ±1.000      0.3       Medium      $0.0256     10.61s    101.60    14.71     100.0%    
grok-4-fast-non-reason  xai         55% (2.2/4.0)       ±1.000      0.3       Low         $0.000375   7.31s     5.87K     18.06     100.0%    
gpt-5-nano              openai      55% (2.2/4.0)       ±1.000      N/A       Very Low    $0.000592   12.35s    3.72K     10.69     100.0%    
mistral-medium-latest   mistral     55% (2.2/4.0)       ±1.000      0.3       Medium      $0.00219    8.29s     1.01K     15.93     100.0%    
llama4-maverick         meta        50% (2.0/4.0)       ±0.000      0.3       Low         $0.00202    7.35s     988.82    16.33     100.0%    
gpt-5.4-mini            openai      50% (2.0/4.0)       ±0.000      N/A       Medium      $0.00384    12.95s    520.53    9.26      100.0%    
claude-sonnet-4.6       anthropic   50% (2.0/4.0)       ±0.000      0.3       High        $0.0148     8.96s     135.25    13.39     100.0%    
gemini-3.5-flash        gemini      50% (2.0/4.0)       ±0.000      0.3       High        $0.0168     11.32s    118.99    10.60     100.0%    
gpt-5.4-nano            openai      38% (1.5/4.0)       ±1.000      N/A       Low         $0.00103    11.31s    1.46K     7.96      100.0%    
claude-haiku-4.5        anthropic   25% (1.0/4.0)       ±0.000      0.3       Medium      $0.00493    5.74s     202.88    10.46     100.0%    

Total models tested: 15

I ran this via an online benchmarking tool. Not claiming this means Gemini 3.5 Flash is bad universally. These are my saved evals, and Gemini and any models can be prompt-sensitive. But for my workflows, these benchmarks unfortunately indicate that I can't use it as is.

I really hope that this is something that will change, because I had high expectations for this model given their previous release. To me it just goes to show that artificial analysis and other generic benchmarks can really be misleading when it comes to model decisions. From what the results they were showing I was expecting much better...

u/Rent_South — 11 hours ago

I benchmarked the new release Gemini 3.5 Flash on ~10 saved evals

I added tested Gemini 3.5 Flash and ran it through around 10 saved evals I use for model selection decisions in production.

So far, the result is not what I expected.

On most of my tasks, Gemini 3.5 Flash underperformed older Gemini variants. In the screenshot below, this is a vision emotion-detection eval with 5 runs per model:

In, this eval it ended way down at 13th place, even though 3.1-pro and 3.1 flash lite are top 1 & 2, its even lower than gemini 3 flash actually. Its 10x more expensive than flash lite for a worse result. Its an avg result of 5 runs so its not a one time fluke. On top of that, this is 1/10 benchmarks with similar outcomes, although admittedly this is one of the worst case.

https://preview.redd.it/e87e67lm752h1.png?width=2750&format=png&auto=webp&s=93e7820e8d6f5cc832c0b756ed27ff00f2c21ae9

I ran this via an online benchmarking tool. Not claiming this means Gemini 3.5 Flash is bad universally. These are my saved evals, and Gemini and any models can be prompt-sensitive. But for my workflows, these benchmarks unfortunately indicate that I can't use it as is.

I really hope that this is something that will change, because I had high expectations for this model given their previous release. To me it just goes to show that artificial analysis and other generic benchmarks can really be misleading when it comes to model decisions. From what the results they were showing I was expecting much better...

More Data on the eval:

====================================================================================================
LLM Benchmark Results - Emotion Detection - Increasing Complexity
====================================================================================================

Model                   Provider    Avg Score           Stability   Rec. Temp Pricing     Cost*       Time      Acc/$     Acc/min   Completion
----------------------------------------------------------------------------------------------------------------------------------------------
gemini-3.1-pro          gemini      80% (3.2/4.0)       ±1.000      0.3       High        $0.0292     23.48s    109.58    8.18      100.0%    
gemini-3.1-flash-lite   gemini      75% (3.0/4.0)       ±0.000      0.3       Medium      $0.00114    6.24s     2.63K     28.85     100.0%    
gpt-5.4                 openai      75% (3.0/4.0)       ±0.000      N/A       High        $0.0128     8.45s     234.24    21.31     100.0%    
claude-opus-4.6         anthropic   75% (3.0/4.0)       ±0.000      0.3       High        $0.0246     12.44s    121.73    14.46     100.0%    
gemini-3-flash          gemini      65% (2.6/4.0)       ±1.000      0.3       Medium      $0.00735    16.36s    353.81    9.54      100.0%    
sonar                   perplexity  65% (2.6/4.0)       ±1.000      0.3       Medium      $0.0256     10.61s    101.60    14.71     100.0%    
grok-4-fast-non-reason  xai         55% (2.2/4.0)       ±1.000      0.3       Low         $0.000375   7.31s     5.87K     18.06     100.0%    
gpt-5-nano              openai      55% (2.2/4.0)       ±1.000      N/A       Very Low    $0.000592   12.35s    3.72K     10.69     100.0%    
mistral-medium-latest   mistral     55% (2.2/4.0)       ±1.000      0.3       Medium      $0.00219    8.29s     1.01K     15.93     100.0%    
llama4-maverick         meta        50% (2.0/4.0)       ±0.000      0.3       Low         $0.00202    7.35s     988.82    16.33     100.0%    
gpt-5.4-mini            openai      50% (2.0/4.0)       ±0.000      N/A       Medium      $0.00384    12.95s    520.53    9.26      100.0%    
claude-sonnet-4.6       anthropic   50% (2.0/4.0)       ±0.000      0.3       High        $0.0148     8.96s     135.25    13.39     100.0%    
gemini-3.5-flash        gemini      50% (2.0/4.0)       ±0.000      0.3       High        $0.0168     11.32s    118.99    10.60     100.0%    
gpt-5.4-nano            openai      38% (1.5/4.0)       ±1.000      N/A       Low         $0.00103    11.31s    1.46K     7.96      100.0%    
claude-haiku-4.5        anthropic   25% (1.0/4.0)       ±0.000      0.3       Medium      $0.00493    5.74s     202.88    10.46     100.0%    

Total models tested: 15
reddit.com
u/Rent_South — 12 hours ago

Dropped my AI bill by 13x last month

You'd think that newer or more expensive models are better at everything. After running 1000s of evals in the last year, I can tell you that's just not true. Older or cheaper models often perform better on a given task, AND are quicker.

Quick example.

Had a classification flow in one of my pipelines running on GPT-5.4. Hundreds of calls a day. Default choice, never questioned it.

Tested it across 21 models on openmark.ai. Real samples from my production data, 10 nuanced classification tests. Real API cost from actual token counts.

https://preview.redd.it/hn1ose7kjx0h1.png?width=2288&format=png&auto=webp&s=33ee1a1c9b50c53d643220c672cb6f6dfc916130

- gemini-3.1-flash-lite: 85% accuracy, $1.55 per 10K calls
- gpt-5.4: 85% accuracy, $20.30 per 10K calls
- llama4-maverick: 80%, $1.84 per 10K calls
- claude-opus-4.6: 80%, $42.80 per 10K calls

Flash Lite matched GPT-5.4 at 13x less cost. Opus, the most expensive model in the test, scored lower than both.

Switched. Bill dropped 92%.

On a different task it would be a different ranking. That's the point. You can't know without testing on your own data. There's a near-infinity of real-world AI agent use cases and the best model is rarely the obvious one.

Also worth knowing, real API cost varies wildly from the announced price per million tokens. Some models output thousands of CoT tokens when you just need a single word. A model that looks cheap on paper can cost 10x more in practice. Only way to know is to measure.

If you want to automate it, there's an open-source OpenClaw router that takes the benchmark results and auto-selects the best model per task in your pipeline with fallbacks: https://clawhub.ai/plugins/openmark-router

reddit.com
u/Rent_South — 7 days ago

I cut my OpenClaw costs by 90%

Was running a classification flow through GPT-5.4 by default. Hundreds of calls a day in one of my agentic pipelines. Wasn't cheap, but it worked, so I never questioned it.

Decided to actually test it.

Ran the same task through 21 models on openmark.ai. 10 nuanced classification tests, real samples from my production data. Real API cost calculated from actual input/output token counts, not derived from estimated price-per-million info.

https://preview.redd.it/urggdq4ahx0h1.png?width=2288&format=png&auto=webp&s=810eb73345e920626e4430b5f573064075210ac0

Top of the ranking:
- gemini-3.1-flash-lite: 85% accuracy, $1.55 per 10K calls, 16s
- gpt-5.4: 85% accuracy, $20.30 per 10K calls, 13s
- llama4-maverick: 80%, $1.84 per 10K calls, 17s
- claude-opus-4.6: 80%, $42.80 per 10K calls, 26s

Flash Lite tied GPT-5.4 on accuracy. 13x cheaper. Opus, the most expensive model in the test, scored lower than both.

Switched the flow to Flash Lite. Bill dropped 90% overnight.

Couple things worth saying.

This doesn't mean Flash Lite is "the best model". Best model depends entirely on the task. After running 1000s of evals in the last 12 months, the ranking flips completely depending on what I'm testing. Generic leaderboards tell you nothing about your specific workflow.

And "real API cost" is rarely what providers advertise per million tokens. Models tokenize the same text differently. Some output thousands of CoT tokens when you need a one-word answer. A model that looks cheap on paper can cost 10x more in practice. Only way to know is to measure on your actual tasks.

There's also an open-source OpenClaw router plugin you can feed benchmark results into, so each task in your pipeline automatically gets the model that actually passed your quality bar, with fallbacks: https://clawhub.ai/plugins/openmark-router

reddit.com
u/Rent_South — 7 days ago