LLM Hardware Chronicles: The Unexpected Journey from NVIDIA to AMD and Back

After months of experimenting with local Large Language Models (LLMs) via Ollama, I’ve learned one fundamental truth: You can never have enough VRAM.

This journey from consumer GPUs to datacenter cards has been filled with interesting discoveries about the balance between hardware capabilities, software ecosystems, and practical limitations.

The Initial Setup

My journey with the Derr AI system began with the wild west of models, experimenting with everything I could get my hands on to understand what would run locally. After much tinkering, I settled on running Mistral Small (22B parameters) on an NVIDIA RTX 3060 with 12GB of VRAM. While the 3060 is a solid card for many tasks, it quickly hit its limits when running larger language models.

Even with extensive optimization techniques, including:

KV quantization for memory efficiency
Mixed-precision models using llama.cpp’s q4_K_M quantization
Offloading layers to CPU

The 12GB VRAM ceiling remained a constant barrier to experimentation with larger and smarter models.

Enter the MI60s

To break through these limitations, I made an unconventional choice: purchasing two AMD Radeon Instinct MI60s, each with 32GB of HBM2 memory. At $400 each, these datacenter cards presented an intriguing opportunity, offering a massive 64GB of total VRAM at a fraction of the cost of current-gen AI cards. The potential to run larger models like Llama 3.3 70B was exciting, even though I knew I’d be working with some significant constraints: the aging GCN architecture from 2017, limited FP16 and INT8 performance crucial for modern LLM inference, and the complexity of working with end-of-life ROCm drivers that would require custom compilation.

The full reality of datacenter hardware meeting consumer setups hit home during installation. After shutting down the server and removing the 3060, I encountered my first obstacle: my case simply couldn’t accommodate both MI60s. The motherboard’s PCIe slot placement meant one card needed to go in the bottom slot, but the power supply blocked any hope of that working. While waiting 10 days for a new case to arrive, I took the opportunity to run initial performance comparisons with a single card (detailed below).

One silver lining during the wait was discovering that the software transition would be far smoother than anticipated. I had been bracing myself for a complex Docker reconfiguration, including making custom versions of images with the needed recompiled drivers, but it turned out to be surprisingly straightforward. A simple change to the Docker tag and tweaking a few environment variables were all it took to get everything running. What I thought would be days of troubleshooting and headaches ended up taking less than an hour.

Other Applications

The challenge of building custom Docker images for other applications highlighted a significant weakness in the ROCm ecosystem. These applications only provided CUDA and CPU based prebuilt images, but no ROCm based image. While recreating docker images with ROCm based PyTorch was technically feasible, the time investment required began to outweigh the potential benefits. This became one of several factors that ultimately led to my decision to return the cards.

Performance Analysis

Ollama Performance Comparison

Model	3060 (t/s)	1x MI60 (t/s)	2x MI60 (t/s)
Llama 3.2 3B (q4_K_M)	118.54	71.53	N/A
Llama 3.1 8B (q3_K_L)	48.37	38.64	N/A
Mistral Nemo 12B (q4_K_M)	41.6	31.75	N/A
Mistral Small 22B (q4_K_M)	11.81¹	22.06	N/A
Command R 32B 08 2024 (q4_0)	OOM²	15.96	N/A
Llama 3.3 70B (q4_K_M)	OOM²	OOM²	N/A
Mistral-Large-Instruct-2411 123B (q3_K_M)	OOM²	OOM²	N/A

ComfyUI Testing

The ComfyUI testing revealed perhaps the most concerning performance issues. Generating a 1024x1024 image with flux.1-dev never successfully completed, while my 3060 Ti completed the same task in 2 minutes. This was show stopper for me, despite the MI60’s theoretical advantages in memory bandwidth and FP16 capability, highlighted the practical limitations of working with ROCm in a consumer context.

I must also comment that this was much more difficult to set up than ollama and never fully worked, but it was not a full custom image like other applications. I used the hardandheavy/comfyui-rocm:latest Docker image, but I had to modify the installation script for ComfyUI and its dependencies to correctly create a python virtual environment. While not a complex fix, it epitomizes ROCm’s second-class status compared to CUDA. This doesn’t mean ROCm is unsuitable for production workloads; rather, it’s simply not yet ready for prime time with these consumer software stacks. Another observation about this container: it does not auto-update ComfyUI.

Lessons Learned & Decision to Return

After extensive testing and consideration, I made the difficult decision to return both MI60 cards. While the cards offered impressive VRAM capacity at an attractive price point, several factors influenced this choice:

Performance Regressions: Despite superior theoretical specifications, the MI60s consistently underperformed compared to the 3060 Ti in real-world tasks.
Software Ecosystem Limitations: The ROCm ecosystem’s relative immaturity compared to CUDA created ongoing challenges:
- Limited optimization for modern AI workloads
- Lack of prebuilt containers for many applications
- Time-consuming custom builds required for basic functionality
Future Support Concerns: Being end-of-life datacenter cards, the likelihood of improved ROCm support or optimizations was low

Instead of proceeding with the dual MI60 setup, I’ve decided to pursue a used NVIDIA card with more VRAM. While this may mean a higher initial investment, the robust CUDA ecosystem and proven optimization paths make it a more practical choice for my needs.

Conclusion

This experiment with datacenter GPUs provided valuable insights into the real-world challenges of building AI workstations. While the allure of massive VRAM at budget prices is tempting, the importance of software ecosystem maturity cannot be overstated. The decision to return to the NVIDIA ecosystem, despite higher costs, reflects a crucial lesson: raw specifications alone don’t determine real-world utility.

For others considering similar paths:

Carefully weigh ecosystem support against hardware specifications
Consider the total cost of ownership, including time investment
Remember that theoretical performance doesn’t always translate to practical advantages

Next Steps

Moving forward, I’ll be:

Researching used NVIDIA options with larger VRAM capacity
Documenting the transition back to CUDA-based workflows
Researching a tensor parallelism implementation in Ollama
Exploring optimization techniques within the NVIDIA ecosystem
[Upcoming: Comparative analysis of different NVIDIA options for AI workloads]

Mistral Small, only 46 of 57 layers were on the GPU, the rest were on CPU. ↩︎
Out of Memory or performance too poor for practical use. ↩︎ ↩︎ ↩︎ ↩︎ ↩︎

John Derr