u/Skye7821

5090 FE water block compatibility with PRO 6000 WE

Hello everyone. As the title suggests I am wondering whether I could simply use an RTX 5090 FE editions waterblock over the PRO 6000 on the main block side. I am wondering as I would really like the RGB and flow through design of the many 5090 waterblocks out there. If anyone has had any success in this please let me know!

reddit.com
u/Skye7821 — 2 days ago

Slop is making me feel disconnected from AI Research [D]

Hello everyone. This is just a small rant on my part. I’m relatively young, a final year undergrad, and I’ve been interested in AI researcher since I was in high school. Over that period of time I feel there has been a significant shift in the landscape regarding the culture surrounding the research.

While I’ve really enjoyed producing some interesting and creative work, I can’t help but feel that slowly the wave of low quality AI research and researchers are really making me feel frustrated. To just give a summary of what I and many others have seen:

- Papers with hallucinated citations and even prompts contained in the papers
- Papers with clearly misleading data that does not tell the whole picture.
- Labs who have built a culture around quantity over quality, pumping out pubs, citing each other, and having all of the lab on each paper to inflate each students publication record.
- Highschoolers…. Yes HIGHSCHOOLERS, becoming more common submitting at conferences that don’t really know what they are doing but paying a pretty penny to participate in “research programs” which are really just cash cows taking advantage of the fierce competition. See the post on the subreddit for more info.
- Even the so called “top labs” producing work that is somewhat misleading or not fully representative. For instance see what happened recently with TurboQuant.
- Research from “low tier institutions” being drowned out because they are not good for click baiting and farming views on LinkedIn and X, even if they are high quality.

It’s… a lot I know. Of course these problems have been around for a long time, but I feel as if lately they have become more and more exacerbated. I originally felt that I was attached to AI research primarily for the creativity and freedom, but I feel that ironically AI itself has been a hindrance on the quality of work being published.

Of course I don’t mean to say that all AI has been bad for ML research, I mean even I use it extensively to help me polish my writing and generate seaborn plots for my data, but that is very very different from just pumping out low quality cookie cutter work.

Anyways, just wondering if anyone else shares similar thoughts. I know I’m relatively young here so maybe some of you have better insights into the broader trends over the decades.

reddit.com
u/Skye7821 — 3 days ago

Hello everyone. I’m excited to share our new paper!

Figure 1: Comparison Across Architectures

A lot of recent Transformer variants try to improve information flow across depth by exposing later layers to earlier representations. You may have recently heard about methods like DenseFormer, MUDDFormer, and HyperConnections, which add more dense or dynamic cross-layer pathways. These are expressive, but they can also come with meaningful throughput and memory costs.

Our question was more specific: Can we improve the efficiency-performance tradeoff at scale by enabling more principled reuse of early representations?

We introduce SATFormer, which keeps the same cheap first-layer value pathway used by value residual learning, but replaces static layer-wise mixing with a per-token, per-head, context-dependent gate. Instead of uniformly copying early features into every later layer, SATFormer learns when and where each head should re-access the first-layer value stream.

Main results:

  • Across 130M–1.3B models, SATFormer improves validation loss over both Transformer and ResFormer baselines.
  • On retrieval-intensive benchmarks, SATFormer gets the best average score among the evaluated architectures, narrowly surpassing MUDDFormer and improving over ResFormer by about 1.5 average points.
  • SATFormer runs close to Transformer/ResFormer, whom are roughly 1.75×–1.82× higher throughput than HyperConnections and MUDDFormer.
  • Mechanistic analysis suggests the gate is not just acting like a dense residual shortcut: access is sparse, depth-dependent, head-specific, and stronger for specific tokens.

The core framing is that early-representation reuse may be better treated as a retrieval/control problem rather than a connectivity/maximal routing problem. OverllI am excited to discuss what some better approaches may be to improving the transformer architecture while maintaining a high throughput.

Arxiv: https://arxiv.org/pdf/2605.03953

github (still WIP): https://github.com/SkyeGunasekaran/SATFormer

reddit.com
u/Skye7821 — 14 days ago

Hey everyone, I've been working on a small Python package called AutoMuon that makes the Muon optimizer usable as a drop-in replacement for AdamW in arbitrary PyTorch training pipelines.

The core idea is relatively simple: Muon works primarily on 2D weight matrices (linear projections, conv layers) on hidden states, but you still need AdamW for embeddings, norms, and biases, etc. AutoMuon scans your model at init, figures out the right optimizer for each parameter automatically.

I am open to PRs, especially for expanding the module-type exclusion list if you hit edge cases in your architecture. Would love to know if anyone tries it on something other than transformers or CNNs and what they find. I feel that it would likely struggle with fully custom architectures, like flash-linear-attention for instance, so that would require some user tuning.

I am planning to add more tests for time series forecasting, genomics, language modeling, etc. I want to see how generalizable Muon really is!

https://github.com/SkyeGunasekaran/automuon

pip install git+https://github.com/SkyeGunasekaran/automuon.git

u/Skye7821 — 24 days ago