r/pytorch

Image 1 — Is the DGX Spark worth the money?
Image 2 — Is the DGX Spark worth the money?
▲ 1 r/LocalLLaMA+1 crossposts

Is the DGX Spark worth the money?

I've seen a lot of DGX Spark discussions here focused on inference performance, and yeah, if you compare it to 4x 3090s for running small models, the DGX loses both in price and performance.

The Spark actually excels for prototyping

Let me break it down:

I just finished CPT on Nemotron-3-Nano on a ~6B tokens dataset.

I spent about a week on my two Sparks debugging everything: FP32 logit tensors that allocated 34 GB for a single tensor, parallelization, Triton kernel crashes on big batches on Blackwell, Mamba-2 backward pass race conditions, causal mask waste, among others. In total I fixed 10+ issues on the Sparks.

The Sparks ran stable at 1,130 tokens/sec after all patches. ETA for the full 6B token run? 30 days!!!. Not viable for production. Instead I tried the same setup on a bigger Blackwell GPU, the B200, actually 8x B200.

Scaling to 8x B200

When I moved to 8x B200 on Verda (unbelievable spot pricing at €11.86/h), the whole setup took about 1 hour. All the patches, hyperparameters, and dataset format worked identically as in the DGX, I just needed to scale. The Spark's 30-day run finished in about 8 hours on the B200s. 167x faster (see image).

For context, before Verda I tried Azure, but their quota approval process for high-end GPU instances takes too long. Verda instead let me spin up immediately on spot at roughly a quarter of what comparable on-demand instances cost elsewhere.

Cost analysis (see image)    

If I had prototyped directly on cloud B200s at on-demand rates it would be about ~€1,220 just for debugging and getting the complete model-dataset properly set up. On the Spark? €0 cost as the hw is mine.

Production run: €118. Total project cost: €118.
Cloud-only equivalent: €1,338 (if I chose the same setup I used for training). That's 91% less by starting first on the DGX.

Ok, also the Spark has a price, but ~€1,200 saved per prototyping cycle, the Spark pays for itself in about 6-7 serious training projects. And most importantly, you'll never get a bill while prototyping, figuring out the setup and fixing bugs.

The honest opinion

The DGX Spark is not an inference machine and it's not a training cluster. It's a prototyping and debugging workstation. If you're doing large training work and want to iterate locally before burning cloud credits, it makes a lot of sense. If you just want to run LLMs for single-turn or few-turns chatting, buy something like the 3090s or the latest Macs.

For anyone interested in more details and the process from starting on the DGX and deploying to the big Blackwell GPUs, you can find the whole research here.

Happy to answer any questions about the Spark, the 2-node cluster setup, and B200/B300 Blackwell deployment.

u/Lorelabbestia — 5 days ago
▲ 2 r/deeplearning+1 crossposts

A visual workspace for "Transformer Surgery": Building, pruning, and exporting hybrid architectures (Gemma 4, Mistral, Llama and more)

I’ve spent a lot of time lately digging into the "surgical" side of LLMs—specifically trying to understand how the internal math changes when you mix architectural concepts, like putting a Llama-style MLP into a Gemma-style soft-capping attention block.

One thing that consistently slows down research is how rigid the standard libraries are. If you want to swap a normalization layer or test a hybrid GQA/SWA (Grouped-Query/Sliding Window) setup, you usually end up monkey-patching deep inside a modeling_xxx.py file or writing one-off scripts that break when you change a hidden dimension.

To solve this for my own research, I built a visual workspace called Neural Playground (part of OLLA) that handles the boilerplate and exports the results as clean, runnable PyTorch code. I’m opening it up for others to use for their own prototyping and architecture experiments.

What you can do with it:

  • Deconstruct Model Families: Inspect the exact layer structures of Mistral, Llama, Gemma, and Phi.
  • Configure Every Parameter: Directly adjust KV heads, RoPE settings, hidden sizes, and attention variants through the UI.
  • Export to PyTorch: Once you’ve designed a hybrid variant, you can export the entire thing as a clean PyTorch project.
  • Local Pruning: I’ve also included a one-click local checkpoint pruner with VRAM reporting to see the impact of architectural changes before you even hit train.

Why I’m sharing this: I’m looking for technical feedback from people who do a lot of model surgery or local deployment. Specifically:

  1. Are there specific hybrid combinations (like MoE variants) that are currently a pain for you to implement manually?
  2. What additional "model surgery" tools would be most useful? I'm currently looking at adding Knowledge Distillation support next.

The project is live at: https://olla.work. I’m hoping this helps lower the barrier to entry for custom architecture research and helps people "see" the math behind the layers.

reddit.com
u/ColdPassenger9550 — 23 hours ago
Real-Time Instance Segmentation using YOLOv8 and OpenCV

Real-Time Instance Segmentation using YOLOv8 and OpenCV

For anyone studying Dog Segmentation Magic: YOLOv8 for Images and Videos (with Code):

The primary technical challenge addressed in this tutorial is the transition from standard object detection—which merely identifies a bounding box—to instance segmentation, which requires pixel-level accuracy. YOLOv8 was selected for this implementation because it maintains high inference speeds while providing a sophisticated architecture for mask prediction. By utilizing a model pre-trained on the COCO dataset, we can leverage transfer learning to achieve precise boundaries for canine subjects without the computational overhead typically associated with heavy transformer-based segmentation models.

 

The workflow begins with environment configuration using Python and OpenCV, followed by the initialization of the YOLOv8 segmentation variant. The logic focuses on processing both static image data and sequential video frames, where the model performs simultaneous detection and mask generation. This approach ensures that the spatial relationship of the subject is preserved across various scales and orientations, demonstrating how real-time segmentation can be integrated into broader computer vision pipelines.

 

Reading on Medium: https://medium.com/image-segmentation-tutorials/fast-yolov8-dog-segmentation-tutorial-for-video-images-195203bca3b3

Detailed written explanation and source code: https://eranfeit.net/fast-yolov8-dog-segmentation-tutorial-for-video-images/

Deep-dive video walkthrough: https://youtu.be/eaHpGjFSFYE

 

This content is provided for educational purposes only. The community is invited to provide constructive feedback or post technical questions regarding the implementation details.

 

Eran Feit

https://preview.redd.it/fcic3gulvitg1.png?width=1280&format=png&auto=webp&s=820183fbc4adfd354ee01da784deb93fa6c8a27e

reddit.com
u/Feitgemel — 14 hours ago
Week