Ollama-cpu

Why ollama? Easy to use Local LLMs.

ollama:

Tool to look models by size:

Don't have Nvidia GPU?

The cpu version will be very slow for inference, and the yaml ram resource is limited to 1Gi to avoid ram leak and crashes.

Alternatively, you can try ollama cloud https://docs.ollama.com/cloud.

Create account, follow the instructions.

Enter pod and login:

kubectl -n ollama exec -it ollama-0 -- bash

Call login command:

ollama signin

follow the instructions to login ollama cloud.

try small model, bigger models will eat tokens/quota very fast:

ollama pull ministral-3:3b-cloud

available cloud models:

https://ollama.com/search?c=cloud

requirements:

Deploy ollama with cuda support:

kubectl apply -f ollama-cuda.yaml

Choose your model by size to fit your gpu vram:

Example: pull the llama3.2 3b model:

kubectl -n ollama exec ollama-0 -- bash -c "ollama pull llama3.2:3b"