u/Aditya_10204

I’m working on an assessment where I need to create a coding task (basically SWE-bench style). The idea is:

take an existing repo (I’m using pydantic)

write tests that fail on the current code

provide a patch that fixes it

and the task shouldn’t be trivial for an LLM to solve(it should be solvable, llm should solve it around 4/10 times, models like haiku)

The difficulty requirement is the tricky part. It shouldn’t be impossible, but also not something a model solves instantly every time.

What I’ve been doing so far:

using Claude Opus to explore the repo and identify possible bugs or edge cases

writing tests around those cases

then in a separate run, giving the instructions to a smaller model (like Haiku)

letting it generate a patch

and running that patch against the tests I wrote

I’ve been repeating this loop for quite a while.

The problem is, most of the time the model just figures it out. Even with edge cases, chaining conditions, or slightly more complex scenarios, it still manages to fix things pretty reliably.

So I’m clearly missing something.

I feel like I’m designing bugs that are too local or too easy to pattern match, but I don’t really know how to move beyond that. At the same time, I can’t just make things random or overly complex because the task still needs to be fair and testable.

Also, I don’t have the option to modify the codebase directly — I can only define behavior through tests and provide a patch — so that constraint makes it harder to think creatively about it.

At this point I kind of know I’m not approaching it with the right mental model, just not sure what the correct approach is.

If anyone here has worked on:

SWE-bench style tasks

LLM evals / coding agent benchmarks

or even just tricky real-world debugging cases

I’d really appreciate any pointers on:

how you think about difficulty in these tasks

what patterns actually make models struggle

or how you come up with good task ideas

Right now it just feels like I’m going in circles.

reddit.com
u/Aditya_10204 — 11 days ago

I’m working on an assessment where I need to create a coding task (basically SWE-bench style). The idea is:

take an existing repo (I’m using pydantic)

write tests that fail on the current code

provide a patch that fixes it

and the task shouldn’t be trivial for an LLM to solve(it should be solvable, llm should solve it around 4/10 times, models like haiku)

The difficulty requirement is the tricky part. It shouldn’t be impossible, but also not something a model solves instantly every time.

What I’ve been doing so far:

using Claude Opus to explore the repo and identify possible bugs or edge cases

writing tests around those cases

then in a separate run, giving the instructions to a smaller model (like Haiku)

letting it generate a patch

and running that patch against the tests I wrote

I’ve been repeating this loop for quite a while.

The problem is, most of the time the model just figures it out. Even with edge cases, chaining conditions, or slightly more complex scenarios, it still manages to fix things pretty reliably.

So I’m clearly missing something.

I feel like I’m designing bugs that are too local or too easy to pattern match, but I don’t really know how to move beyond that. At the same time, I can’t just make things random or overly complex because the task still needs to be fair and testable.

Also, I don’t have the option to modify the codebase directly — I can only define behavior through tests and provide a patch — so that constraint makes it harder to think creatively about it.

At this point I kind of know I’m not approaching it with the right mental model, just not sure what the correct approach is.

If anyone here has worked on:

SWE-bench style tasks

LLM evals / coding agent benchmarks

or even just tricky real-world debugging cases

I’d really appreciate any pointers on:

how you think about difficulty in these tasks

what patterns actually make models struggle

or how you come up with good task ideas

Right now it just feels like I’m going in circles.

reddit.com
u/Aditya_10204 — 11 days ago

I've a Hiring manager interview (30min duration) coming up on Tuesday.

Role: Technical Support Engineer (Remote)

Company: US based AI infra company.

For some context, I've background of Software engineer (full stack) 1.5yoe and this is my first time interviewing for this role, so idk what to expect.

I had a screening interview before this where they told me about the responsibilities and about this role. I will be mostly helping customers with technical difficulties and troubleshooting. Scope will expand as I get experience.

Now considering all this, how should I prepare for the interview, how should I anchor compensation discussion, should I be preparing any technical part as well.

Edit: Interview will be taken by the head of operations and not the HR manager

Thankyou!!

reddit.com
u/Aditya_10204 — 12 days ago