Facebook iconHow To Use Local LLMs with Ollama? (A Complete Guide)
Blogs/AI

How To Use Local LLMs with Ollama? (A Complete Guide)

Written by Dharshan
Oct 23, 2025
6 Min Read
How To Use Local LLMs with Ollama? (A Complete Guide) Hero

AI tools like chatbots and content generators are everywhere. But usually, they run online using cloud services. What if you could run those smart AI models directly on your own computer, just like running a regular app? That’s what Ollama helps you do. 

In this blog, you’ll learn how to set it up, use it in different ways (like with terminal, code, or API), change some basic settings, and know what it can and can't do.

What is Ollama?

Ollama is a software that allows you to use large, powerful AI models on your own machine without having to rely on the internet. It handles downloading and running the models and allows you to chat with them or get responses from them, much like ChatGPT. 

You can interact with it through simple commands, programming (Python, etc.) or with other API tools. It’s awesome for testing AI that you have locally, doing your projects, testing things, private and easy.

Why Use Ollama?

Ollama is aimed at developers and AI enthusiasts who are interested in running large language models locally, so they have more control and flexibility, and less dependency on cloud services.

  • Run Models Offline with Full ControlOllama lets you download and run models directly on your machine. It automatically uses your GPU for acceleration if available, and falls back to CPU when GPU isn’t present, ensuring it works across a wide range of systems.
  • Fast Testing and DevelopmentLocal setup means quicker iteration, easier debugging, and smoother experimentation without waiting for remote servers or rate limits.
  • No Cloud DependencyWith no need for internet or cloud APIs, Ollama removes the reliance on third-party providers, making your workflow more stable and self-contained.

How to Install Ollama?

Installing Ollama is quick and straightforward. It works on macOS, Windows, and Linux.

Install Ollama for macOS and Linux (with Homebrew):

brew install ollama

Install Ollama For Windows:

  1. Visit the official website: https://ollama.com
  2. Download the Windows installer (.exe file)
  3. Run the installer and follow the setup steps

Alternative (for all platforms using manual download) Ollama:

Go to https://ollama.com/download and choose the right version for your operating system.

After installation, open a terminal and test it by running:

ollama --version

If the version number appears, Ollama is successfully installed and ready to use.

Running Local LLMs with Ollama
Understand how Ollama hosts open-weight LLMs locally. Learn model management, quantization, and prompt tuning.
Murtuza Kutub
Murtuza Kutub
Co-Founder, F22 Labs

Walk away with actionable insights on AI adoption.

Limited seats available!

Calendar
Saturday, 27 Dec 2025
10PM IST (60 mins)

Basic Ollama CLI Commands

Ollama provides a simple command-line interface to help you manage and interact with language models on your local machine. Below are some of the most commonly used commands:

CommandDescription

ollama run <model>

Starts and runs a specified model

ollama list

Displays a list of installed models

ollama pull <model>

Downloads a model from the Ollama library

ollama create <name> -f Modelfile

Creates a custom model using a Modelfile

ollama serve

Starts the Ollama API server

ollama stop <model>

Stops a running model

ollama run <model>

Description

Starts and runs a specified model

1 of 6

These commands form the foundation of how you interact with Ollama through the terminal, making it easy to manage and use models locally.

Ollama Rest API Endpoints

Ollama provides a RESTful api to be able to access and play with models through code. These are endpoints that allow you to process text, manage models, generate embeddings, and more, all of which work locally on your machine

MethodEndpointPurpose

POST

/api/generate

Text generation

POST

/api/chat

Chat-style message handling

POST

/api/create

Create custom models

GET

/api/tags

List available (installed) models

DELETE

/api/delete

Delete a model

POST

/api/pull

Download a model

POST

/api/push

Upload a model

POST

/api/embed

Generate embeddings

GET

/api/ps

List running model processes

POST

/api/embeddings

OpenAI-compatible embedding endpoint

GET

/api/version

Get the current Ollama version

POST

Endpoint

/api/generate

Purpose

Text generation

1 of 11

These endpoints will allow you to easily incorporate Ollama into your apps, tools, or workflows with just a few HTTP requests.

Suggested Reads- What are Embedding Models in Machine Learning?

How to Run Ollama Models

Before using Ollama through code or APIs, you first need to install and run a supported model. Here's how to get started:

Step 1: Install and Run a Model

  1. Open your terminal.
  2. Choose a model from the Ollama model library:🔗 https://ollama.com/library
  3. Pull(install) the model you want to use. For example, to install llama3.2(LLaMA 3.2):
ollama pull llama3.2
  1. Once it's downloaded, run it.
ollama run llama3.2
Output of Local LLMs with Ollama

Now you can start chatting with the model directly in the terminal.

Step 2: Use the API with Python

If you want to use Python without any third-party SDKs like OpenAI's, you can make direct HTTP requests to Ollama's local server. Here’s how to do it:

import requests
import json

url = "http://localhost:11434/api/generate"

payload = {
    "model": "llama3.2",
    "prompt": "What is LLM?",
    "temperature": 0.2,
    "top_p": 0.7,
    "top_k": 30,
    "repeat_penalty": 1.1,
    "max_tokens": 100,
    "stream": False
}

response = requests.post(url, json=payload)

if response.status_code == 200:
    result = response.json()
    print(result["response"])
else:
    print("Error:", response.status_code, response.text)
  1. Click Send to see the model's response being generated.
Running Local LLMs with Ollama
Understand how Ollama hosts open-weight LLMs locally. Learn model management, quantization, and prompt tuning.
Murtuza Kutub
Murtuza Kutub
Co-Founder, F22 Labs

Walk away with actionable insights on AI adoption.

Limited seats available!

Calendar
Saturday, 27 Dec 2025
10PM IST (60 mins)

This configuration allows you to use Ollama as a local LLM server and test a variety of model behaviors with actual API calls. In your script if you don't want to use the default AI model name and warmup settings, you can modify the name or parameters such as temperature, top_p, repeat_penalty, etc., in the script to affect the trivia model behaviour.

OpenAI Compatibility Setup with Ollama

Ollama is designed to follow the OpenAI API format (like ChatGPT).This means you can use Ollama as a local drop-in replacement in apps or tools that were originally built for those services without changing much of your code.

Why OpenAI Compatibility Matters

  • Ollama follows the same structure for chat, completion, and embedding endpoints used by many leading LLM providers.
  • You can connect it with tools and frameworks like LangChain, LlamaIndex, and more.
  • Easily reuse your existing ChatGPT-style apps or backend code by simply switching the base URL to Ollama (http://localhost:11434).
  • It allows fast, offline testing and development with full control and no cloud dependency.

In this blog we’ll explore how to use OpenAI-compatible code and tools, since it's one of the most widely supported and easiest ways to get started with open source LLMs.

from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama'  # required, but not actually used
)

response = client.chat.completions.create(
    model="llama2",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"}
    ],
    temperature=0.7,
    top_p=0.9,
    top_k=40,
    repeat_penalty=1.1,
    num_ctx=2048,
    max_tokens=1000
)

print(response.choices[0].message.content)

Make sure the model is installed and running:

ollama run llama3.2

4 Ollama Limitations You Should Know

While Ollama is a powerful tool for running local LLMs, it does come with a few limitations to keep in mind:

  • High RAM Usage for Larger ModelsRunning bigger models like DeepSeek-R1 or Mixtral may require a lot of system memory, which can be a challenge on lower-end machines.
  • No Built-in GPU Support in Some EnvironmentsGPU acceleration isn’t available everywhere by default, which means model performance might be slower, especially on CPU-only setups.
  • Limited Community or Contributed ModelsUnlike platforms like Hugging Face and frameworks such as Transformers, vLLM, and SGLang, Ollama currently has a smaller library of models and fewer community-made variations.
  • Not Meant for Large-Scale ProductionOllama is best suited for local testing, development, or personal use. While it can be used for small-scale or low-traffic production setups, it is not optimized for large-scale, high-load, or enterprise-level deployments.

These limitations don’t affect most local development or testing needs, but they’re important to be aware of depending on your use case.

Conclusion

Ollama is an easy-to-deploy, high-performing software to run AI language models locally on your desktop or server without requiring the cloud. It has varied use cases such as on the command line, via APIs, or as code and has the same structure as the widely used tools like ChatGPT. 

Though it might require a little additional memory, it’s not compatible with every advanced feature, but it’s good for learning, testing, and creating local AI projects. Ollama is a good place to begin if you prefer privacy, control, and offline access to AI

Author-Dharshan
Dharshan

Passionate AI/ML Engineer with interest in OpenCV, MediaPipe, and LLMs. Exploring computer vision and NLP to build smart, interactive systems.

Share this article

Phone

Next for you

10 Claude Code Productivity Tips For Every Developer in 2025 Cover

AI

Dec 22, 202510 min read

10 Claude Code Productivity Tips For Every Developer in 2025

Are you using Claude Code as just another coding assistant, or as a real productivity accelerator? Most developers only tap into a fraction of what Claude Code can do, missing out on faster workflows, cleaner code, and fewer mistakes. When used correctly, Claude Code can behave like a senior pair programmer who understands your project structure, conventions, and intent. In this article, I’ll walk through 10 practical Claude Code productivity tips I use daily in real projects. You’ll learn how

What Is On-Device AI? A Complete Guide for 2025 Cover

AI

Dec 22, 202511 min read

What Is On-Device AI? A Complete Guide for 2025

Imagine your smartphone analyzing medical images with 95% accuracy instantly, your smartwatch detecting heart issues 15 minutes before symptoms appear, or autonomous drones navigating disaster zones without internet connectivity. This is on device AI in 2025, not science fiction, but daily reality. For years, AI lived exclusively in massive data centers, requiring constant connectivity and consuming megawatts of power. But cloud-based AI suffers from critical limitations: * Latency: A self-dr

What Are Voice AI Agents? Everything You Need to Know Cover

AI

Dec 19, 20259 min read

What Are Voice AI Agents? Everything You Need to Know

Have you ever spoken to customer support and wondered if the voice on the other end was human or AI? Voice AI agents now power everything from virtual assistants and call centers to healthcare reminders and sales calls. What once felt futuristic is already part of everyday interactions. This beginner-friendly guide explains what voice AI agents are, how they work, and how core components like Speech-to-Text, Large Language Models, Text-to-Speech, and Voice Activity Detection come together to en