Run even larger AI models locally with LM Studio

in LeoFinance11 months ago

A few days ago, I wrote a post about how to run large language AI models on your local PC using Ollama. I am a big fan of Ollama, but I have been using a new tool that is even better for interactive use.

LM Studio offers many of the same features and ease of use as Ollama and a lot more. It runs on Windows, Mac, and Linux and can be used interactively, or as a server that mimics OpenAI's API.

I've been liking LM Studio so much, I think I am going to remove Ollama from my machine. I have been using Ollama interactively as well as a server for other processes. I even have Ollama linked into VS Code to act as my own version of Github Copilot.

Installing LM Studio

Super easy, barely an inconvience. Just go to https://lmstudio.ai/ and choose your operating system.

Using LM Studio

Once you have LM Studio installed, you are going to first need to download some models. Depending on your system, and how much VRAM you have, you choices may be limited. One of the great things about LM Studio is you can use VRAM along with your system ram (at a big performance penalty). This will allow you to use much larger and less quantized models.

I have an AMD 5950X with 64G DDR4 and an nVidia 3090 in my main system. This gives me 24G VRAM and 64G of system ram. One of the models I have been playing with lately is Dolphin-Mixtral, which is a MoE model. I'm not going to get into a MoE model in this post, but it is a newer approach to LLM that uses multiple smaller models to provide fine tuned experts to break up responses.

Let's look at my options for this model and my hardware.

First we got to go to the search models tab, and then select the 2.7 Mixtral version. This is the latest release for Dolphin-Mixtral.

At the top of LM Studio, you can see my resources available. Which is the amount I said above but with the current overhead factored in.

Select the model, and on the right you will see all the files associated with it.

You can see I have two installed which I chose specifically. The first one will fit entirely on my GPU and run at top performance, the other is a 4 bit quantized version which is a lot better but requires a lot of ram. If you look at the first model, it is a 2 bit quantized model, which means the precision is highly reduced resulting in more potentially inaccurate choices as you go through the neural network. I would recommend using a 4 bit model if possible, the source model is typically 16 bit, so anything less than this will have reduced accuracy. 4 bit is generally a good compromise while making it accessable with consumer hardware.

Let's try the 2 bit version, and see how that goes.


First you need to go into the chat tab, and then load the model up top.

On the right you will see some choices, for this model we are going to want to set GPU Layers to -1, this will force all layers onto the GPU, this is ideal for this model as it will fit on my 24G 3090. If your GPU can't fit it, you will get an error. You will also want to set the context window, this is how much data the model can reference. 2048 is the default, but the more tokens you have the further back the conversation can go. 2048 is a good starting point for most tasks, if you are consuming more information you may need to increase this.

The first prompt I am going to use to test, is:

I have a hot dog on a plate in the kitchen, I take the plate into the living room and sit down. Where is the hot dog?

Even the 2 bit model is able to answer this, despite many other models failing.

On the bottom, we can see some performance numbers to see how fast we are generating responses. 48 tokens per second is a very acceptable speed. In fact, this is faster than you can read.

I'm going to switch to the 4 bit quantized version, this version requires 26GB, just over my available VRam, so before I load it, I need to change the GPU layers parameter. I found through some testing I can offload 20 layers to the GPU and use most of the available VRAM.

This configuration though I lost a lot of performance, dropping down just under 8 tokens per second. This is still usable, and not as fast as most poeple can read, but not slow enough that you are waiting forever. Most of the model is fitted on the GPU, with a few layers done on the CPU and system ram.

I can tweak the settings a little bit, and get 22 layers on the GPU for a slight improvement but I can't get the last couple layers on the GPU due to the ram requirements. This gave me a slight increase in performance, but nothing major. Depending on how much VRAM you have, your results will vary. I can also increase the CPU threads to 12 ( I have 16 native cores on my CPU) to get similar performance without increasing layers.

Just as important as your prompt, is the system prompt you give the model before asking it a question. The default prompt is very simple and can be modified to suit your needs.

For example, you can give it a prompt "You are an expert lawyer, and your client gives you a call and asks a question. Please answer their questions to the best of your ability". You can save this as a preset "Lawyer".

LM Studio also exposes a lot more advanced settings you can use to tweak your experience. From my experience, LM Studio is a bit buggy, at least on the Linux beta and you may have better results from Ollama if the bugs creep up in your use.

For most people, the Dolphin Mixtral may be too big of a model to work with, and you might want to look at someting like Open Orca 7B. As always, you can explore models on Hugging Face, the goto stop for Open LLM models.

Posted Using InLeo Alpha

Sort:  

Very interesting. I've been looking for an easy to use a large LLM that could use the 64Gb of unified memory on my M1 Max chip.

I don't like interacting with AI in the cloud that collects my data and is not private.

I was surprised to learn that Apple has released a M3 Max with 128Gb unified memory. That would really be powerful and could run huge models.

I'll let you know how it goes.

Mac Studio and even Mac Minis are very popular option for LLM due to how unified memory works. Nowhere can you get ~188 VRAM for less than the cost of even a single A100 40G.

I'm getting 23 tokens per second using the 5 bit Mixtal 2.7 model.

macs have a big edge for this.
I would recommend the 4 bit, the 5 bit isn't much better and takes a lot more ram. I'd stick with 4 bit, or something like 8 bit if you can get there.

prior knowledge required

Not really, this gets you up and running.

Mmmm i am not trying this,
It Is so very complicated to me, i uses the Telegram bot only for traducción,
By the way i have a Homework to you, kevinwong here are publishing an Is proyect calles taunet with His Coín named agrs, they appears as the wonderfull development proyect, right bow they do not have a product finish, my answer Is this programing in github caller IDNI could be confiable or it been only a scam.

Thanks a Lot AND sorry for the abuse.
I appreciate your opinión.

!PGM

Sent 0.1 PGM - 0.1 LVL- 1 STARBITS - 0.05 DEC - 1 SBT - 0.1 THG - 0.000001 SQM - 0.1 BUDS - 0.01 WOO - 0.005 SCRAP - 0.001 INK tokens

remaining commands 12

BUY AND STAKE THE PGM TO SEND A LOT OF TOKENS!

The tokens that the command sends are: 0.1 PGM-0.1 LVL-0.1 THGAMING-0.05 DEC-15 SBT-1 STARBITS-[0.00000001 BTC (SWAP.BTC) only if you have 2500 PGM in stake or more ]

5000 PGM IN STAKE = 2x rewards!

image.png
Discord image.png

Support the curation account @ pgm-curator with a delegation 10 HP - 50 HP - 100 HP - 500 HP - 1000 HP

Get potential votes from @ pgm-curator by paying in PGM, here is a guide

I'm a bot, if you want a hand ask @ zottone444


Thanks for your contribution to STEM content.

Congrats!

Thanks for your contribution to the STEMsocial community. Feel free to join us on discord to get to know the rest of us!

Please consider delegating to the @stemsocial account (85% of the curation rewards are returned).

You may also include @stemsocial as a beneficiary of the rewards of this post to get a stronger support.