Facebook iconHow To Use Local LLMs with Ollama? (A Complete Guide)
Blogs/AI

How To Use Local LLMs with Ollama? (A Complete Guide)

Jul 3, 20256 Min Read
Written by Dharshan
How To Use Local LLMs with Ollama? (A Complete Guide) Hero

AI tools like chatbots and content generators are everywhere. But usually, they run online using cloud services. What if you could run those smart AI models directly on your own computer, just like running a regular app? That’s what Ollama helps you do. 

In this blog, you’ll learn how to set it up, use it in different ways (like with terminal, code, or API), change some basic settings, and know what it can and can't do.

What is Ollama?

Ollama is a software that allows you to use large, powerful AI models on your own machine without having to rely on the internet. It handles downloading and running the models and allows you to chat with them or get responses from them, much like ChatGPT. 

You can interact with it through simple commands, programming (Python, etc.) or with other API tools. It’s awesome for testing AI that you have locally, doing your projects, testing things, private and easy.

Why Use Ollama?

Ollama is aimed at developers and AI enthusiasts who are interested in running large language models locally, so they have more control and flexibility, and less dependency on cloud services.

  • Run Models Offline with Full ControlOllama lets you download and run models directly on your machine. It automatically uses your GPU for acceleration if available, and falls back to CPU when GPU isn’t present, ensuring it works across a wide range of systems.
  • Fast Testing and DevelopmentLocal setup means quicker iteration, easier debugging, and smoother experimentation without waiting for remote servers or rate limits.
  • No Cloud DependencyWith no need for internet or cloud APIs, Ollama removes the reliance on third-party providers, making your workflow more stable and self-contained.

How to Install Ollama?

Installing Ollama is quick and straightforward. It works on macOS, Windows, and Linux.

Install Ollama for macOS and Linux (with Homebrew):

brew install ollama

Install Ollama For Windows:

  1. Visit the official website: https://ollama.com
  2. Download the Windows installer (.exe file)
  3. Run the installer and follow the setup steps

Alternative (for all platforms using manual download) Ollama:

Go to https://ollama.com/download and choose the right version for your operating system.

After installation, open a terminal and test it by running:

ollama --version

If the version number appears, Ollama is successfully installed and ready to use.

Partner with Us for Success

Experience seamless collaboration and exceptional results.

Basic Ollama CLI Commands

Ollama provides a simple command-line interface to help you manage and interact with language models on your local machine. Below are some of the most commonly used commands:

CommandDescription

ollama run <model>

Starts and runs a specified model

ollama list

Displays a list of installed models

ollama pull <model>

Downloads a model from the Ollama library

ollama create <name> -f Modelfile

Creates a custom model using a Modelfile

ollama serve

Starts the Ollama API server

ollama stop <model>

Stops a running model

ollama run <model>

Description

Starts and runs a specified model

1 of 6

These commands form the foundation of how you interact with Ollama through the terminal, making it easy to manage and use models locally.

Ollama Rest API Endpoints

Ollama provides a RESTful api to be able to access and play with models through code. These are endpoints that allow you to process text, manage models, generate embeddings, and more, all of which work locally on your machine

MethodEndpointPurpose

POST

/api/generate

Text generation

POST

/api/chat

Chat-style message handling

POST

/api/create

Create custom models

GET

/api/tags

List available (installed) models

DELETE

/api/delete

Delete a model

POST

/api/pull

Download a model

POST

/api/push

Upload a model

POST

/api/embed

Generate embeddings

GET

/api/ps

List running model processes

POST

/api/embeddings

OpenAI-compatible embedding endpoint

GET

/api/version

Get the current Ollama version

POST

Endpoint

/api/generate

Purpose

Text generation

1 of 11

These endpoints will allow you to easily incorporate Ollama into your apps, tools, or workflows with just a few HTTP requests.

Suggested Reads- What are Embedding Models in Machine Learning?

How to Run Ollama Models

Before using Ollama through code or APIs, you first need to install and run a supported model. Here's how to get started:

Step 1: Install and Run a Model

  1. Open your terminal.
  2. Choose a model from the Ollama model library:🔗 https://ollama.com/library
  3. Pull(install) the model you want to use. For example, to install llama3.2(LLaMA 3.2):
ollama pull llama3.2
  1. Once it's downloaded, run it.
ollama run llama3.2
Output of Local LLMs with Ollama

Now you can start chatting with the model directly in the terminal.

Step 2: Use the API with Python

If you want to use Python without any third-party SDKs like OpenAI's, you can make direct HTTP requests to Ollama's local server. Here’s how to do it:

import requests
import json

url = "http://localhost:11434/api/generate"

payload = {
    "model": "llama3.2",
    "prompt": "What is LLM?",
    "temperature": 0.2,
    "top_p": 0.7,
    "top_k": 30,
    "repeat_penalty": 1.1,
    "max_tokens": 100,
    "stream": False
}

response = requests.post(url, json=payload)

if response.status_code == 200:
    result = response.json()
    print(result["response"])
else:
    print("Error:", response.status_code, response.text)
  1. Click Send to see the model's response being generated.

Partner with Us for Success

Experience seamless collaboration and exceptional results.

This configuration allows you to use Ollama as a local LLM server and test a variety of model behaviors with actual API calls. In your script if you don't want to use the default AI model name and warmup settings, you can modify the name or parameters such as temperature, top_p, repeat_penalty, etc., in the script to affect the trivia model behaviour.

OpenAI Compatibility Setup with Ollama

Ollama is designed to follow the OpenAI API format (like ChatGPT).This means you can use Ollama as a local drop-in replacement in apps or tools that were originally built for those services without changing much of your code.

Why OpenAI Compatibility Matters

  • Ollama follows the same structure for chat, completion, and embedding endpoints used by many leading LLM providers.
  • You can connect it with tools and frameworks like LangChain, LlamaIndex, and more.
  • Easily reuse your existing ChatGPT-style apps or backend code by simply switching the base URL to Ollama (http://localhost:11434).
  • It allows fast, offline testing and development with full control and no cloud dependency.

In this blog we’ll explore how to use OpenAI-compatible code and tools, since it's one of the most widely supported and easiest ways to get started with local LLMs.

from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama'  # required, but not actually used
)

response = client.chat.completions.create(
    model="llama2",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"}
    ],
    temperature=0.7,
    top_p=0.9,
    top_k=40,
    repeat_penalty=1.1,
    num_ctx=2048,
    max_tokens=1000
)

print(response.choices[0].message.content)

Make sure the model is installed and running:

ollama run llama3.2

4 Ollama Limitations You Should Know

While Ollama is a powerful tool for running local LLMs, it does come with a few limitations to keep in mind:

  • High RAM Usage for Larger ModelsRunning bigger models like DeepSeek-R1 or Mixtral may require a lot of system memory, which can be a challenge on lower-end machines.
  • No Built-in GPU Support in Some EnvironmentsGPU acceleration isn’t available everywhere by default, which means model performance might be slower, especially on CPU-only setups.
  • Limited Community or Contributed ModelsUnlike platforms like Hugging Face, Ollama currently has a smaller library of models and fewer community-made variations.
  • Not Meant for Large-Scale ProductionOllama is best suited for local testing, development, or personal use. While it can be used for small-scale or low-traffic production setups, it is not optimized for large-scale, high-load, or enterprise-level deployments.

These limitations don’t affect most local development or testing needs, but they’re important to be aware of depending on your use case.

Conclusion

Ollama is an easy-to-deploy, high-performing software to run AI language models locally on your desktop or server without requiring the cloud. It has varied use cases such as on the command line, via APIs, or as code and has the same structure as the widely used tools like ChatGPT. 

Though it might require a little additional memory, it’s not compatible with every advanced feature, but it’s good for learning, testing, and creating local AI projects. Ollama is a good place to begin if you prefer privacy, control, and offline access to AI

Author-Dharshan
Dharshan

Passionate AI/ML Engineer with interest in OpenCV, MediaPipe, and LLMs. Exploring computer vision and NLP to build smart, interactive systems.

Phone

Next for you

How to Use Hugging Face with OpenAI-Compatible APIs? Cover

AI

Jul 29, 20254 min read

How to Use Hugging Face with OpenAI-Compatible APIs?

As large language models become more widely adopted, developers are looking for flexible ways to integrate them without being tied to a single provider. Hugging Face’s newly introduced OpenAI-compatible API offers a practical solution, allowing you to run models like LLaMA, Mixtral, or DeepSeek using the same syntax as OpenAI’s Python client. According to Hugging Face, hundreds of models are now accessible using the OpenAI-compatible client across providers like Together AI, Replicate, and more.

Transformers vs vLLM vs SGLang: Comparison Guide Cover

AI

Jul 29, 20257 min read

Transformers vs vLLM vs SGLang: Comparison Guide

These are three of the most popular tools for running AI language models today. Each one offers different strengths when it comes to setup, speed, memory use, and flexibility. In this guide, we’ll break down what each tool does, how to get started with them, and when you might want to use one over the other. Even if you're new to AI, you'll walk away with a clear understanding of which option makes the most sense for your needs, whether you're building an app, speeding up model inference, or cr

What is vLLM? Everything You Should Know Cover

AI

Jul 29, 20258 min read

What is vLLM? Everything You Should Know

If you’ve ever used AI tools like ChatGPT and wondered how they’re able to generate so many prompt responses so quickly, vLLM is a big part of the explanation. It’s a high-performance engine to make large language models (LLMs) run faster and more efficiently.  This blog effectively summarizes what vLLM is, why it matters, how it works and how developers can use it. Whether you’re a developer looking to accelerate your AI models or simply curious about the inner workings of AI, this guide will