LLM Hardware Chronicles: The Unexpected Journey from NVIDIA to AMD and Back
After months of experimenting with local Large Language Models (LLMs) via Ollama, I’ve learned one fundamental truth: You can never have enough VRAM.
This journey from consumer GPUs to datacenter cards has been filled with interesting discoveries about the balance between hardware capabilities, software ecosystems, and practical limitations.
The Initial Setup
My journey with the Derr AI system began with the wild west of models, experimenting with everything I could get my hands on to understand what would run locally. After much tinkering, I settled on running Mistral Small (22B parameters) on an NVIDIA RTX 3060 with 12GB of VRAM. While the 3060 is a solid card for many tasks, it quickly hit its limits when running larger language models.
Even with extensive optimization techniques, including:
- KV quantization for memory efficiency
- Mixed-precision models using llama.cpp’s q4_K_M quantization
- Offloading layers to CPU
The 12GB VRAM ceiling remained a constant barrier to experimentation with larger and smarter models.
Enter the MI60s
To break through these limitations, I made an unconventional choice: purchasing two AMD Radeon Instinct MI60s, each with 32GB of HBM2 memory. At $400 each, these datacenter cards presented an intriguing opportunity, offering a massive 64GB of total VRAM at a fraction of the cost of current-gen AI cards. The potential to run larger models like Llama 3.3 70B was exciting, even though I knew I’d be working with some significant constraints: the aging GCN architecture from 2017, limited FP16 and INT8 performance crucial for modern LLM inference, and the complexity of working with end-of-life ROCm drivers that would require custom compilation.
The full reality of datacenter hardware meeting consumer setups hit home during installation. After shutting down the server and removing the 3060, I encountered my first obstacle: my case simply couldn’t accommodate both MI60s. The motherboard’s PCIe slot placement meant one card needed to go in the bottom slot, but the power supply blocked any hope of that working. While waiting 10 days for a new case to arrive, I took the opportunity to run initial performance comparisons with a single card (detailed below).
One silver lining during the wait was discovering that the software transition would be far smoother than anticipated. I had been bracing myself for a complex Docker reconfiguration, including making custom versions of images with the needed recompiled drivers, but it turned out to be surprisingly straightforward. A simple change to the Docker tag and tweaking a few environment variables were all it took to get everything running. What I thought would be days of troubleshooting and headaches ended up taking less than an hour.
Other Applications
The challenge of building custom Docker images for other applications highlighted a significant weakness in the ROCm ecosystem. These applications only provided CUDA and CPU based prebuilt images, but no ROCm based image. While recreating docker images with ROCm based PyTorch was technically feasible, the time investment required began to outweigh the potential benefits. This became one of several factors that ultimately led to my decision to return the cards.
Performance Analysis
Ollama Performance Comparison
Model | 3060 (t/s) | 1x MI60 (t/s) | 2x MI60 (t/s) |
---|---|---|---|
Llama 3.2 3B (q4_K_M) | 118.54 | 71.53 | N/A |
Llama 3.1 8B (q3_K_L) | 48.37 | 38.64 | N/A |
Mistral Nemo 12B (q4_K_M) | 41.6 | 31.75 | N/A |
Mistral Small 22B (q4_K_M) | 11.811 | 22.06 | N/A |
Command R 32B 08 2024 (q4_0) | OOM2 | 15.96 | N/A |
Llama 3.3 70B (q4_K_M) | OOM2 | OOM2 | N/A |
Mistral-Large-Instruct-2411 123B (q3_K_M) | OOM2 | OOM2 | N/A |
ComfyUI Testing
The ComfyUI testing revealed perhaps the most concerning performance issues. Generating a 1024x1024 image with flux.1-dev never successfully completed, while my 3060 Ti completed the same task in 2 minutes. This was show stopper for me, despite the MI60’s theoretical advantages in memory bandwidth and FP16 capability, highlighted the practical limitations of working with ROCm in a consumer context.
I must also comment that this was much more difficult to set up than ollama and never fully worked, but it was not a full custom image like other applications. I used the hardandheavy/comfyui-rocm:latest
Docker image, but I had to modify the installation script for ComfyUI and its dependencies to correctly create a python virtual environment. While not a complex fix, it epitomizes ROCm’s second-class status compared to CUDA. This doesn’t mean ROCm is unsuitable for production workloads; rather, it’s simply not yet ready for prime time with these consumer software stacks. Another observation about this container: it does not auto-update ComfyUI.
Lessons Learned & Decision to Return
After extensive testing and consideration, I made the difficult decision to return both MI60 cards. While the cards offered impressive VRAM capacity at an attractive price point, several factors influenced this choice:
- Performance Regressions: Despite superior theoretical specifications, the MI60s consistently underperformed compared to the 3060 Ti in real-world tasks.
- Software Ecosystem Limitations: The ROCm ecosystem’s relative immaturity compared to CUDA created ongoing challenges:
- Limited optimization for modern AI workloads
- Lack of prebuilt containers for many applications
- Time-consuming custom builds required for basic functionality
- Future Support Concerns: Being end-of-life datacenter cards, the likelihood of improved ROCm support or optimizations was low
Instead of proceeding with the dual MI60 setup, I’ve decided to pursue a used NVIDIA card with more VRAM. While this may mean a higher initial investment, the robust CUDA ecosystem and proven optimization paths make it a more practical choice for my needs.
Conclusion
This experiment with datacenter GPUs provided valuable insights into the real-world challenges of building AI workstations. While the allure of massive VRAM at budget prices is tempting, the importance of software ecosystem maturity cannot be overstated. The decision to return to the NVIDIA ecosystem, despite higher costs, reflects a crucial lesson: raw specifications alone don’t determine real-world utility.
For others considering similar paths:
- Carefully weigh ecosystem support against hardware specifications
- Consider the total cost of ownership, including time investment
- Remember that theoretical performance doesn’t always translate to practical advantages
Next Steps
Moving forward, I’ll be:
- Researching used NVIDIA options with larger VRAM capacity
- Documenting the transition back to CUDA-based workflows
- Researching a tensor parallelism implementation in Ollama
- Exploring optimization techniques within the NVIDIA ecosystem
- [Upcoming: Comparative analysis of different NVIDIA options for AI workloads]