Great choice—Ollama simplifies running LLMs locally with easy quantization and multi-GPU support. For your RTX 8000 (48GB VRAM) + 128GB RAM setup:
- 4-bit quantized 70B models (e.g., Llama 3.1 70B): Fits in ~35-40GB VRAM, handles inference at 5-10 tokens/sec. Command:
ollama run llama3.1:70b-q4_0. - Full 30B+ models (e.g., Mistral 30B unquantized): Uses ~24-30GB VRAM, faster at 15-20 tokens/sec. Try
ollama run mistral:30b.
Install Ollama via their site (ollama.com), then pull models. Enable GPU acceleration with CUDA 12+. For longer contexts, offload to system RAM via --num-gpu-layers. Guides confirm RTX 8000 excels for this scale—expect solid performance with good cooling.