Facebook iconHow To Use Local LLMs with Ollama? (A Complete Guide)
Blogs/AI

How To Use Local LLMs with Ollama? (A Complete Guide)

Jul 1, 20256 Min Read
Written by Dharshan
How To Use Local LLMs with Ollama? (A Complete Guide) Hero

AI tools like chatbots and content generators are everywhere. But usually, they run online using cloud services. What if you could run those smart AI models directly on your own computer, just like running a regular app? That’s what Ollama helps you do. 

In this blog, you’ll learn how to set it up, use it in different ways (like with terminal, code, or API), change some basic settings, and know what it can and can't do.

What is Ollama?

Ollama is a software that allows you to use large, powerful AI models on your own machine without having to rely on the internet. It handles downloading and running the models and allows you to chat with them or get responses from them, much like ChatGPT. 

You can interact with it through simple commands, programming (Python, etc.) or with other API tools. It’s awesome for testing AI that you have locally, doing your projects, testing things, private and easy.

Why Use Ollama?

Ollama is aimed at developers and AI enthusiasts who are interested in running large language models locally, so they have more control and flexibility, and less dependency on cloud services.

  • Run Models Offline with Full ControlOllama lets you download and run models directly on your machine. It automatically uses your GPU for acceleration if available, and falls back to CPU when GPU isn’t present, ensuring it works across a wide range of systems.
  • Fast Testing and DevelopmentLocal setup means quicker iteration, easier debugging, and smoother experimentation without waiting for remote servers or rate limits.
  • No Cloud DependencyWith no need for internet or cloud APIs, Ollama removes the reliance on third-party providers, making your workflow more stable and self-contained.

How to Install Ollama?

Installing Ollama is quick and straightforward. It works on macOS, Windows, and Linux.

Install Ollama for macOS and Linux (with Homebrew):

brew install ollama

Install Ollama For Windows:

  1. Visit the official website: https://ollama.com
  2. Download the Windows installer (.exe file)
  3. Run the installer and follow the setup steps

Alternative (for all platforms using manual download) Ollama:

Go to https://ollama.com/download and choose the right version for your operating system.

After installation, open a terminal and test it by running:

ollama --version

If the version number appears, Ollama is successfully installed and ready to use.

Partner with Us for Success

Experience seamless collaboration and exceptional results.

Basic Ollama CLI Commands

Ollama provides a simple command-line interface to help you manage and interact with language models on your local machine. Below are some of the most commonly used commands:

CommandDescription

ollama run <model>

Starts and runs a specified model

ollama list

Displays a list of installed models

ollama pull <model>

Downloads a model from the Ollama library

ollama create <name> -f Modelfile

Creates a custom model using a Modelfile

ollama serve

Starts the Ollama API server

ollama stop <model>

Stops a running model

ollama run <model>

Description

Starts and runs a specified model

1 of 6

These commands form the foundation of how you interact with Ollama through the terminal, making it easy to manage and use models locally.

Ollama Rest API Endpoints

Ollama provides a RESTful api to be able to access and play with models through code. These are endpoints that allow you to process text, manage models, generate embeddings, and more, all of which work locally on your machine

MethodEndpointPurpose

POST

/api/generate

Text generation

POST

/api/chat

Chat-style message handling

POST

/api/create

Create custom models

GET

/api/tags

List available (installed) models

DELETE

/api/delete

Delete a model

POST

/api/pull

Download a model

POST

/api/push

Upload a model

POST

/api/embed

Generate embeddings

GET

/api/ps

List running model processes

POST

/api/embeddings

OpenAI-compatible embedding endpoint

GET

/api/version

Get the current Ollama version

POST

Endpoint

/api/generate

Purpose

Text generation

1 of 11

These endpoints will allow you to easily incorporate Ollama into your apps, tools, or workflows with just a few HTTP requests.

Suggested Reads- What are Embedding Models in Machine Learning?

How to Run Ollama Models

Before using Ollama through code or APIs, you first need to install and run a supported model. Here's how to get started:

Step 1: Install and Run a Model

  1. Open your terminal.
  2. Choose a model from the Ollama model library:🔗 https://ollama.com/library
  3. Pull(install) the model you want to use. For example, to install llama3.2(LLaMA 3.2):
ollama pull llama3.2
  1. Once it's downloaded, run it.
ollama run llama3.2
Microsoft Windows Terminal

Now you can start chatting with the model directly in the terminal.

Step 2: Use the API with Python

If you want to use Python without any third-party SDKs like OpenAI's, you can make direct HTTP requests to Ollama's local server. Here’s how to do it:

import requests
import json

url = "http://localhost:11434/api/generate"

payload = {
    "model": "llama3.2",
    "prompt": "What is LLM?",
    "temperature": 0.2,
    "top_p": 0.7,
    "top_k": 30,
    "repeat_penalty": 1.1,
    "max_tokens": 100,
    "stream": False
}

response = requests.post(url, json=payload)

if response.status_code == 200:
    result = response.json()
    print(result["response"])
else:
    print("Error:", response.status_code, response.text)
  1. Click Send to see the model's response being generated.

Partner with Us for Success

Experience seamless collaboration and exceptional results.

This configuration allows you to use Ollama as a local LLM server and test a variety of model behaviors with actual API calls. In your script if you don't want to use the default AI model name and warmup settings, you can modify the name or parameters such as temperature, top_p, repeat_penalty, etc., in the script to affect the trivia model behaviour.

OpenAI Compatibility Setup with Ollama

Ollama is designed to follow the OpenAI API format (like ChatGPT).This means you can use Ollama as a local drop-in replacement in apps or tools that were originally built for those services without changing much of your code.

Why OpenAI Compatibility Matters

  • Ollama follows the same structure for chat, completion, and embedding endpoints used by many leading LLM providers.
  • You can connect it with tools and frameworks like LangChain, LlamaIndex, and more.
  • Easily reuse your existing ChatGPT-style apps or backend code by simply switching the base URL to Ollama (http://localhost:11434).
  • It allows fast, offline testing and development with full control and no cloud dependency.

In this blog we’ll explore how to use OpenAI-compatible code and tools, since it's one of the most widely supported and easiest ways to get started with local LLMs.

from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama'  # required, but not actually used
)

response = client.chat.completions.create(
    model="llama2",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"}
    ],
    temperature=0.7,
    top_p=0.9,
    top_k=40,
    repeat_penalty=1.1,
    num_ctx=2048,
    max_tokens=1000
)

print(response.choices[0].message.content)

Make sure the model is installed and running:

ollama run llama3.2

4 Ollama Limitations You Should Know

While Ollama is a powerful tool for running local LLMs, it does come with a few limitations to keep in mind:

  • High RAM Usage for Larger ModelsRunning bigger models like DeepSeek-R1 or Mixtral may require a lot of system memory, which can be a challenge on lower-end machines.
  • No Built-in GPU Support in Some EnvironmentsGPU acceleration isn’t available everywhere by default, which means model performance might be slower, especially on CPU-only setups.
  • Limited Community or Contributed ModelsUnlike platforms like Hugging Face, Ollama currently has a smaller library of models and fewer community-made variations.
  • Not Meant for Large-Scale ProductionOllama is best suited for local testing, development, or personal use. While it can be used for small-scale or low-traffic production setups, it is not optimized for large-scale, high-load, or enterprise-level deployments.

These limitations don’t affect most local development or testing needs, but they’re important to be aware of depending on your use case.

Conclusion

Ollama is an easy-to-deploy, high-performing software to run AI language models locally on your desktop or server without requiring the cloud. It has varied use cases such as on the command line, via APIs, or as code and has the same structure as the widely used tools like ChatGPT. 

Though it might require a little additional memory, it’s not compatible with every advanced feature, but it’s good for learning, testing, and creating local AI projects. Ollama is a good place to begin if you prefer privacy, control, and offline access to AI

Author-Dharshan
Dharshan

Passionate AI/ML Engineer with interest in OpenCV, MediaPipe, and LLMs. Exploring computer vision and NLP to build smart, interactive systems.

Phone

Next for you

Graph RAG vs Temporal Graph RAG: How AI Understands Time Cover

AI

Jul 1, 20254 min read

Graph RAG vs Temporal Graph RAG: How AI Understands Time

What if AI could rewind time to answer your questions? Most AI tools today focus on what happened, but not WHEN it happened. That’s where Temporal Graph RAG steps in. It combines the power of knowledge graphs with time-aware intelligence to give more accurate, contextual answers. In this blog, you’ll learn: * What Graphs and Knowledge Graphs are * How Graph RAG works and why it’s smarter than regular RAG * How Temporal Graph RAG takes it to the next level with time-aware intelligence Wha

What is Multi-Step RAG (A Complete Guide) Cover

AI

Jul 1, 20258 min read

What is Multi-Step RAG (A Complete Guide)

Traditional Retrieval-Augmented Generation (RAG) retrieves relevant documents once and generates a response using a fixed context. While effective for simple queries, it often fails with complex, multi-hop, or ambiguous questions due to its single-step, static approach. Multi-Step RAG addresses these limitations by introducing iterative retrieval and reasoning. After an initial retrieval, the system analyzes the retrieved context to identify sub-tasks or refine the query, performing multiple re

How To Evaluate LLM Hallucinations and Faithfulness Cover

AI

Jul 1, 20258 min read

How To Evaluate LLM Hallucinations and Faithfulness

Large language models are widely used, and it’s important to make sure that the generated answers are accurate and provide correct information. Evaluating these aspects helps the developers and researchers to understand how reliable an LLM model is, especially in critical areas like healthcare, law, and education.  The main goal is to avoid the wrong answers and make sure the model gives the correct and fact-based information. In this blog, let’s learn about faithfulness and hallucinations in d