Home » How-to » How to Use Ollama for Beginners

How to Use Ollama for Beginners

Artificial intelligence has been gaining significant popularity recently. Moreover, along with cloud-available models, many users want to have the ability to use artificial intelligence locally. Whatever the case, it's much safer because your data isn't transmitted anywhere, and it's also more reliable because when using models locally, you can't be blocked.

However, this approach has its problems. The hardware capabilities of personal computers are quite limited, and therefore the capabilities of locally run models will be significantly lower than those available in the cloud. In this article, we will explore what all this is, how it works, and how to use Ollama to run artificial intelligence models locally.


Table of Contents

Essential Fundamentals

Ollama has a fairly simple command-line interface, and you can download and run a model by executing just a few commands. Therefore, getting started with this tool is quite straightforward. It's all implemented based on llama.cpp, and AI models are stored in GGUF (GPT Generated Unified Format) format, but we don't need to deal with this since Ollama handles the model downloading and llama.cpp configuration.

However, to understand which models can and cannot be run on your computer, you need to know some basics. In simple terms, an artificial intelligence model is a huge set of numbers arranged in a specific structure and often combined into multiple layers. The number of these parameters determines the size of the model.

When we make a request to the model, our query is also converted into numbers, and these numbers are multiplied by the numbers that were in the model. Of course, this is not a simple operation, and it all happens taking into account the structure of connections in the model and its layers, after which we receive the result. This is greatly simplified and may not be entirely accurate, but it reflects the general process of what happens.

Since it is necessary to perform a vast number of mathematical operations in a short period of time, graphics cards are used to run artificial intelligence models. Graphics cards have many processor cores that perform such tasks much faster than a CPU. However, for this to work, all model numbers and your query must fit into your graphics card's video memory.

Model Size

For each model on the ollama website, the number of parameters and its size are specified:

Of course, you can load only a part of the model into the graphics card memory if there isn't enough video memory, while storing the rest in RAM. Ollama supports this by default. However, this causes the model interaction speed to drop by dozens of times because the computer needs to frequently access RAM and CPU, which aren't as fast as the graphics card.

That's why model developers try to create models of different sizes, for example, 1b, 4b, 12b, 27b. Here 'b' stands for billion. 12b means 12 billion parameters. If we use Float16 (FP16) for storing floating-point numbers, then at full size, a model with 12 billion parameters will occupy 12,000,000,000 * 2 bytes = 24 gigabytes. This is the maximum precision at which models are available in ollama. Considering that we also need to fit the query data (context) somewhere, we would need a graphics card with more than 24 gigabytes of video memory to run such a model.

Quantization

Graphics cards with large memory capacity are quite expensive, which is why quantization was invented. The precision of numbers that make up the model is reduced. In other words, the tail is cut off, and a number that occupied 2 bytes starts to occupy one byte, or even half a byte. This certainly affects the model's performance, and the more quantized the model is, the worse it will perform, but the less memory it will require. Here are the quantization levels supported by ollama:

  • fp16 - maximum, 16 bits, 2 bytes
  • q8 - 8 bits - half a byte
  • q4 - 4 bits - 0.25 bytes

Of course, there are various quantization method modifications that help improve model performance while maintaining the same memory savings, but I won't discuss them here. When selecting a model on the Ollama website, you can choose the quant from a dropdown list:

It's optimal to use fp16 for production, while q4 can be used for testing and everything else. For example, the gemma3 model with 12 billion parameters takes up 24 GB, but in Q4 format it's only 8.8 GB, which means it can run on an RTX 3060 with 12 GB of video memory.

This will be enough to understand which model can be loaded, and now let's move on directly to installing Ollama.

Installing Ollama

To install Ollama, simply visit https://ollama.com/ and click the Download button. This will open a page containing the installation command for any Linux distribution. It's essentially a curl command that downloads and runs the Ollama installation script on your system. At the time of writing this article, the command looks like this:

curl -fsSL https://ollama.com/install.sh | sh

After the installation script is complete, you can verify that the ollama service is running:

sudo systemctl status ollama

How to Use Ollama

After that, you can proceed with usage. First of all, let's look at how to load the model.

Loading the Model

The model should be selected on the Ollama website using search. For example, gemma3:

Next, in the dropdown list, select the number of parameters and model quantization, for example 12b. By default, Q4_K_M quantization is typically used:

To see all available options, go to the tags tab:

After selecting the model, you can view information about it, including the number of parameters, quantization, size, etc.:

And to the right of the dropdown list, we see the command to launch the model, which includes its name:

You can immediately execute the suggested command, and it will not only download it but also launch a chat with it in the terminal. Alternatively, you can simply download the model using the pull command:

ollama pull gemma3:12b

You can also fully specify the quantum, which we take from the tag name:

ollama pull gemma3:12b-it-q4_K_M

Running the Model in Terminal

After the model is downloaded, you can view the list of locally available models using the list command:

ollama list

You can start a chat with the required model in the terminal using the same run command:

ollama run gemma3:12b

Here you can ask questions to communicate with the model. For example: Who created the Linux kernel:

Ollama can load models both into video memory and offload some layers to RAM if there is not enough video memory. This is done automatically without your intervention. If you are using an Nvidia graphics card, you can use the nvidia_smi utility to verify that video memory is being used:

nvidia_smi

Here you can see that Ollama uses slightly more than 8GB of video memory, which means the model is loaded into video memory.

Usage Statistics

When using artificial intelligence, the size of requests (prompts) that you send to the model, as well as the responses you receive, is measured not in the number of characters but in the number of tokens. Usually, a word takes about one and a half to two tokens. You can enable statistics display in the chat using the command:

/set verbose

In statistics, you can see not only the number of received and generated tokens but also the generation speed, the number of tokens per second in the eval rate field:

System Model Request

When interacting with an artificial intelligence model, not only the current request is specified but also the system one. By default, the system request usually contains something like "you are helpful assistant" or remains empty. If you need the model to perform a specific task, you can define it in the system request using the /set system command. For example:

You are my Linux tutor. Today we gonna learn Linux commands. Your task is to name command and I have to guess what it used for. Prepare ten command names, ask one by one and then check answer.

After that, you just need to write something to the model to make it start working. For example, let's go:

You can view the current system query using the command:

/show system

If you go through the command learning example with statistics enabled, you will notice that the request size keeps increasing.

This happens because all your previous requests and model responses are transmitted in the prompt. This needs to be taken into account since models usually have a limited context size, and if the context overflows, the model will either perform worse or completely forget what it was asked to do and generate something random. In such cases, you'll need to start a new chat and break down large tasks into smaller ones.

Ending Chat in Terminal

There are quite a lot of settings for the main model operating parameters. You can view all available commands using the /? command:

To end the chat, type the command:

/bye

API Requests

Usually, the command-line interface is not very practical, and most often you will interact with Ollama models through some graphical interface that communicates with Ollama via API. So, let's look at how to make an API request.

By default, Ollama listens for requests on port 11434. The chat API is compatible with OpenAI, so the request URL will look like this:

http://localhost:11434/api/v1/chat/completions

For example, you can make a request using curl:

curl http://localhost:11434/v1/chat/completions -d '{ "model": "gemma3:12b", "messages": [ { "role": "user", "content": "Who created Linux kernel? Answer only name" } ], "stream": false }' | json_pp

Removing the Model

Artificial intelligence models take up quite a lot of disk space. If you want to delete one of them, you can do it using the rm command:

ollama rm gemma3:12b

Removing Ollama

First, stop the Ollama service:

sudo systemctl --now disable ollama

Then delete the service file itself:

sudo rm /etc/systemd/system/ollama.service

Remove the ollama executable file:

sudo rm $(which ollama)

Remove documentation:

sudo rm -r /usr/share/ollama

And finally, we need to remove the user and group that were created by the installation script:

sudo userdel ollama sudo groupdel ollama

Wrapping Up

In this article, we explored how to install and use Ollama to run artificial intelligence models locally. Models running locally are not as powerful as cloud-based models because they have fewer parameters. However, they can handle simple tasks. And unlike the cloud, you don't have to pay for each sent and generated token, plus it's more private.

Creative Commons License
The article is distributed under Creative Commons ShareAlike 4.0 license. Link to the source is required.

Leave a Comment