▲ 51 r/CUDA
I'm starting a Master's focused around AI systems, GPU computing, and HPC, and I'm trying to better understand where the genuinely important problems are in AI infrastructure and inference engineering.
My background so far has mostly been applied ML systems work:
- production LLM serving with vLLM
- real-time ASR → LLM → TTS pipelines
- LoRA fine-tuning/merging
- latency-sensitive voice agents
- benchmarking and pipeline optimization
Over time I realized many of the hardest problems weren't really model problems, but systems problems:
- GPU underutilization
- irregular batching
- memory movement
- scheduling/backpressure
- latency propagation through pipelines
- inference efficiency under real-time constraints
Long term, I’m much more interested in the systems/infrastructure side of AI than pure modeling. Things like:
- inference runtimes
- GPU systems
- CUDA/Triton
- compiler/runtime optimization
- distributed inference
- memory efficiency
- scheduling for irregular workloads
- HPC for AI workloads
- AI-assisted systems optimization
Right now I’m trying to figure out what problems in this space are:
- genuinely important
- underexplored
- likely to matter over the next 5 years
- realistic for a strong Master’s thesis
A few questions I’d really love practitioner/researcher opinions on:
- What problems in AI infrastructure or inference engineering still feel painfully unsolved in practice?
- What research directions seem overhyped vs genuinely valuable?
- Which areas do you think will matter most over the next few years: kernel optimization, distributed inference, memory systems, compiler/runtime work, scheduling, networking, etc?
- What’s a technically deep but realistic Master’s-level research problem in this space?
Happy to hear “your framing is wrong because X” too — that’s probably the most useful feedback I can get right now.
u/Quirky-Guide-762 — 7 days ago