r/MachineLearning

First time NeurIPS. How different is it from low-ranked conferences? [D]

I'm a PhD student and already published papers in A/B ranked paper (10+). My field of work never allowed me to work on something really exciting and a core A* conference. But finally after years I think I have work worthy of some discussion at the top venue.

I'm referring to papers (my field and top papers) from previous editions and I notice that there's a big difference on how people write, how they put their message on table and also it is too theoretical sometimes.

Are there any golden rules people follow who frequently get into these conferences? Should I be soft while making novelty claims?

Also those who moved from submitting to niche-conferences to NeurIPS/ICML/CVPR, did you change your approach?

My field is imaging in healthcare.

reddit.com
u/ade17_in — 2 hours ago
🔥 Hot ▲ 79 r/MachineLearning

[D] TMLR reviews seem more reliable than ICML/NeurIPS/ICLR

This year I submitted a paper to ICML for the first time. I have also experienced the review process at TMLR and ICLR. From my observation, given these venues take up close to (or less than) 4 months until the final decision, I think the quality of reviews at TMLR was so much on point when compared with that at ICML right now. Many ICML reviews I am seeing (be it my own paper or the papers received for reviewing), feel rushed, low confidence or sometimes overly hostile without providing constructive feedback. All this makes me realise the quality that TMLR reviews offered. The reviewers there are more aware of the topic, ask reasonable questions and show concerns where it's apt. ​It’s making me wonder if the big conferences (ICML/NeurIPS/ICLR) are even worth it?

reddit.com
u/MT1699 — 14 hours ago

[D] ICML 2026 Average Score

Hi all,

I’m curious about the current review dynamics for ICML 2026, especially after the rebuttal phase.

For those who are reviewers (or have insight into the process), could you share what the average scores look like in your batch after rebuttal?

Also, do tools like trackers https://papercopilot.com/statistics/icml-statistics/icml-2026-statistics/

reflect true Score distributions to some degree.

Appreciate any insights.

reddit.com
u/Hope999991 — 6 hours ago
🔥 Hot ▲ 50 r/MachineLearning

[D] Physicist-turned-ML-engineer looking to get into ML research. What's worth working on and where can I contribute most?

After years of focus on building products, I'm carving out time to do independent research again and trying to find the right direction. I have stayed reasonably up-to-date regarding major developments of the past years (reading books, papers, etc) ... but I definitely don't have a full understanding of today's research landscape. Could really use the help of you experts :-)

A bit more about myself: PhD in string theory/theoretical physics (Oxford), then quant finance, then built and sold an ML startup to a large company where I now manage the engineering team.
Skills/knowledge I bring which don't come as standard with Physics:

  • Differential Geometry & Topology
  • (numerical solution of) Partial Differential Equations
  • (numerical solution of) Stochastic Differential Equations
  • Quantum Field Theory / Statistical Field Theory
  • tons of Engineering/Programming experience (in prod envs)

Especially curious to hear from anyone who made a similar transition already!

reddit.com
u/BalcksChaos — 21 hours ago
[P] I trained a Mamba-3 log anomaly detector that hit 0.9975 F1 on HDFS — and I’m curious how far this can go

[P] I trained a Mamba-3 log anomaly detector that hit 0.9975 F1 on HDFS — and I’m curious how far this can go

Experiment #324 ended well. ;)

This time I built a small project around log anomaly detection. In about two days, I went from roughly 60% effectiveness in the first runs to a final F1 score of 0.9975 on the HDFS benchmark.

Under my current preprocessing and evaluation setup, LogAI reaches F1=0.9975, which is slightly above the 0.996 HDFS result reported for LogRobust in a recent comparative study.

What that means in practice:

  • on 3,368 anomalous sessions in the test set, it missed about 9 (recall = 0.9973)
  • on roughly 112k normal sessions, it raised only about 3 false alarms (precision = 0.9976)

What I find especially interesting is that this is probably the first log anomaly detection model built on top of Mamba-3 / SSM, which was only published a few weeks ago.

The model is small:

  • 4.9M parameters
  • trains in about 36 minutes on an RTX 4090
  • needs about 1 GB of GPU memory
  • inference is below 2 ms on a single consumer GPU, so over 500 log events/sec

For comparison, my previous approach took around 20 hours to train.

The dataset here is the classic HDFS benchmark from LogHub / Zenodo, based on Amazon EC2 logs:

  • 11M+ raw log lines
  • 575,061 sessions
  • 16,838 anomalous sessions (2.9%)

This benchmark has been used in a lot of papers since 2017, so it’s a useful place to test ideas.

The part that surprised me most was not just the score, but what actually made the difference.

I started with a fairly standard NLP-style approach:

  • BPE tokenizer
  • relatively large model, around 40M parameters

That got me something like 0.61–0.74 F1, depending on the run. It looked reasonable at first, but I kept hitting a wall. Hyperparameter tuning helped a bit, but not enough.

The breakthrough came when I stopped treating logs like natural language.

Instead of splitting lines into subword tokens, I switched to template-based tokenization: one log template = one token representing an event type.

So instead of feeding the model something like text, I feed it sequences like this:

[5, 3, 7, 5, 5, 3, 12, 12, 5, ...]

Where for example:

  • "Receiving block blk_123 from 10.0.0.1" - Template #5
  • "PacketResponder 1 terminating" - Template #3
  • "Unexpected error deleting block blk_456" - Template #12

That one change did a lot at once:

  • vocabulary dropped from about 8000 to around 50
  • model size shrank by roughly 10x
  • training went from hours to minutes
  • and, most importantly, the overfitting problem mostly disappeared

The second important change was matching the classifier head to the architecture. Mamba is causal, so the last token carries a compressed summary of the sequence context. Once I respected that in the pooling/classification setup, the model started behaving the way I had hoped.

The training pipeline was simple:

  • Pretrain (next-token prediction): the model only sees normal logs and learns what “normal” looks like
  • Finetune (classification): the model sees labeled normal/anomalous sessions
  • Test: the model gets unseen sessions and predicts normal vs anomaly

Data split was 70% train / 10% val / 20% test, so the reported F1 is on sessions the model did not see during training.

Another useful thing is that the output is not just binary. The model gives a continuous anomaly score from 0 to 1.

So in production this could be used with multiple thresholds, for example:

  • > 0.7 = warning
  • > 0.95 = critical

Or with an adaptive threshold that tracks the baseline noise level of a specific system.

A broader lesson for me: skills and workflows I developed while playing with AI models for chess transfer surprisingly well to other domains. That’s not exactly new - a lot of AI labs started with games, and many still do - but it’s satisfying to see it work in practice.

Also, I definitely did not get here alone. This is a combination of:

  • reading a lot of papers
  • running automated experiment loops
  • challenging AI assistants instead of trusting them blindly
  • and then doing my own interpretation and tuning

Very rough split:

  • 50% reading papers and extracting ideas
  • 30% automated hyperparameter / experiment loops
  • 20% manual tuning and changes based on what I learned

Now I’ll probably build a dashboard and try this on my own Astrography / Astropolis production logs. Or I may push it further first on BGL, Thunderbird, or Spirit.

Honestly, I still find it pretty wild how much can now be done on a gaming PC if you combine decent hardware, public research, and newer architectures quickly enough.

Curious what people here think:

  • does this direction look genuinely promising to you?
  • has anyone else tried SSMs / Mamba for log modeling?
  • and which benchmark would you hit next: BGL, Thunderbird, or Spirit?

If there’s interest, I can also share more about the preprocessing, training loop, and the mistakes that got me stuck at 60-70% before it finally clicked.

P.S. I also tested its effectiveness and reproducibility across different seeds. On most of them, it actually performed slightly better than before.

https://preview.redd.it/3hrr4prgbzsg1.png?width=1794&format=png&auto=webp&s=d50ff21226e9aa97c2c0bbefed77be5dd8389cb8

reddit.com
u/Adam_Jesion — 8 hours ago

[D] icml, no rebuttal ack so far..

Almost all the papers I reviewed have received at least one ack, but I haven’t gotten a single rebuttal acknowledgment yet. Is there anyone else who hasn’t received theirs?

reddit.com
u/tuejan11 — 15 hours ago

[D] When to transition from simple heuristics to ML models (e.g., DensityFunction)?

Two questions:

  1. What are the recommendations around when to transition from a simple heuristic baseline to machine learning ML models for data?
    • For example, say I have a search that returns output for how many authentications are “just right” so I can flag activity that spikes above/below normal. When would I consider transitioning that from a baseline search to a search that applies an ML model like DensityFunction?
  2. Any recommendations around books that address/tackle this subject?

Thx

reddit.com
u/DerRoteBaron1 — 4 hours ago
▲ 7 r/MachineLearning+1 crossposts

[D] Reviewer said he will increase his score but he hasn’t (yet)

Maybe someone here can help me figure this out.

I have a reviewer who acknowledged my rebuttal and said they will increase their score*, but they haven’t. Their score is still 4, which was the initial score. Now I am very anxious about the AC reading this and thinking that they increased their score to 4 from a 3 ( meaning their initial thought was reject) because the other person who acknowledged and said they will increase their score did it on the spot at the same time, and I can see the updated score, but the other said they will but didn’t, and now I fear it will look like they did and that the 4 is the updated score ( meaning the initial score was a reject).

I can answer to the rebuttal ( they said option A, fully resolved). I wonder if in my answer I should hint that they have yet to make the update? As a reviewer, would you be annoyed by that ?

Or wait until the 7th ( answer deadlines) if no update. Send a private comment to AC explaining this or not do anything and taking the risk that this might penalize my paper outcome.

Do ACs get pissed with authors who are obsessed over score? Will the AC penalize me by rejecting my paper because I said that the reviewer didn’t increase their score as they promised ?

What can I do ? At this point, the reviewer not saying they would increase would’ve been better because it would mean a 4 that remained a 4 but now a 4 that looks like an updated score will be interpreted as the initial result was a 3 or 2, which is bad.

If accepted, also, the score will affect if it will be given a spotlight or not, so it’s definitely meaningful for me to have his score updated because I don’t know how to handle this, and I don’t know why he couldn’t just update his score when he was already on open review on his reviewer console and it would have taken him 10 seconds to do it ? Why did he have to postpone it ? 😞😞

reddit.com
u/DazzlingPin3965 — 11 hours ago
[R] VOID: Video Object and Interaction Deletion (physically-consistent video inpainting)

[R] VOID: Video Object and Interaction Deletion (physically-consistent video inpainting)

We present VOID, a model for video object removal that aims to handle *physical interactions*, not just appearance.

Most existing video inpainting / object removal methods can fill in pixels behind an object (e.g., removing shadows or reflections), but they often fail when the removed object affects the dynamics of the scene.

For example:
- A domino chain is falling → removing the middle blocks should stop the chain
- Two cars are about to crash → removing one car should prevent the collision

Current models typically remove the object but leave its effects unchanged, resulting in physically implausible outputs.

VOID addresses this by modeling counterfactual scene evolution:
“What would the video look like if the object had never been there?”

Key ideas:
- Counterfactual training data: paired videos with and without objects (generated using Kubric and HUMOTO)
- VLM-guided masks: a vision-language model identifies which regions of the scene are affected by the removal
- Two-pass generation: first predict the new motion, then refine with flow-warped noise for temporal consistency

In a human preference study on real-world videos, VOID was selected 64.8% of the time over baselines such as Runway (Aleph), Generative Omnimatte, and ProPainter.

Project page: https://void-model.github.io/
Code: https://github.com/Netflix/void-model
Demo: https://huggingface.co/spaces/sam-motamed/VOID
Paper: https://arxiv.org/abs/2604.02296

Happy to answer questions!

Removing the compressor and saving the duckie.

reddit.com
u/Least_Light6037 — 12 hours ago

[D] Are there REAL success stories of autonomous AI dev agents working reliably in production?

I’m having a serious debate with a colleague, and I want to settle this with actual evidence instead of opinions.

The claim:

That it’s possible today to run orchestrated AI developer agents (multiple agents, coordinated workflows) that can autonomously build and maintain software — under supervision of a senior AI/dev — without running into unfixable errors or constant breakdowns.

I’m skeptical. He believes it’s already happening.

So I’m looking for real-world examples, not theory:

- Have you actually used autonomous dev agents in production?

- What was the setup? (tools, stack, orchestration method)

- What level of autonomy are we talking about?

- What still breaks?

- Did it scale beyond small experiments or toy projects?

Especially interested in:

- Multi-agent setups (not just Copilot-style assistance)

- Systems that run for extended periods (not one-off demos)

- Cases where human input is minimal but still controlled

If you’ve seen this work (or fail), I’d really appreciate detailed insights.

Trying to separate hype from reality here.

reddit.com
u/MegaMillyMansion — 5 hours ago

[D] Best websites for pytorch/numpy interviews

Hello,

I’m at the last year of my PHD and I’m starting to prepare interviews. I’m mainly aiming at applied scientist/research engineer or research scientist role.

For now I’m doing mainly leetcode. I’m looking for websites that can help me train for coding interviews in pytorch/numpy. I did some research and these websites popped up: nexskillai, tensorgym, deep-ml, leetgpu and the torch part of neetcode.

However I couldn’t really decide which of these websites are the best.

I’m open to suggestions in this matter, thanks.

reddit.com
u/Training-Adeptness57 — 2 hours ago

[R] Differentiable Clustering & Search !

Hey guys,

I occasionally write articles on my blog, and I am happy to share the new one with you : https://bornlex.github.io/posts/differentiable-clustering/.

It came from something I was working for at work, and we ended up implementing something else because of the constraints that we have.

The method mixes different loss terms to achieve a differentiable clustering method that takes into account mutual info, semantic proximity and even constraints such as the developer enforcing two tags (could be documents) to be part of the same cluster.

Then it is possible to search the catalog using the clusters.

All of it comes from my mind, I used an AI to double check the sentences, spelling, so it might have rewritten a few sentences, but most of it is human made.

I've added the research flair even though it is not exactly research, but more experimental work.

Can't wait for your feedback !

Ju

reddit.com
u/bornlex — 7 hours ago
Week