Facebook iconHow To Use Local LLMs with Ollama? (A Complete Guide)
Blogs/AI

How To Use Local LLMs with Ollama? (A Complete Guide)

Written by Dharshan
Oct 23, 2025
6 Min Read
How To Use Local LLMs with Ollama? (A Complete Guide) Hero

AI tools like chatbots and content generators are everywhere. But usually, they run online using cloud services. What if you could run those smart AI models directly on your own computer, just like running a regular app? That’s what Ollama helps you do. 

In this blog, you’ll learn how to set it up, use it in different ways (like with terminal, code, or API), change some basic settings, and know what it can and can't do.

What is Ollama?

Ollama is a software that allows you to use large, powerful AI models on your own machine without having to rely on the internet. It handles downloading and running the models and allows you to chat with them or get responses from them, much like ChatGPT. 

You can interact with it through simple commands, programming (Python, etc.) or with other API tools. It’s awesome for testing AI that you have locally, doing your projects, testing things, private and easy.

Why Use Ollama?

Ollama is aimed at developers and AI enthusiasts who are interested in running large language models locally, so they have more control and flexibility, and less dependency on cloud services.

  • Run Models Offline with Full ControlOllama lets you download and run models directly on your machine. It automatically uses your GPU for acceleration if available, and falls back to CPU when GPU isn’t present, ensuring it works across a wide range of systems.
  • Fast Testing and DevelopmentLocal setup means quicker iteration, easier debugging, and smoother experimentation without waiting for remote servers or rate limits.
  • No Cloud DependencyWith no need for internet or cloud APIs, Ollama removes the reliance on third-party providers, making your workflow more stable and self-contained.

How to Install Ollama?

Installing Ollama is quick and straightforward. It works on macOS, Windows, and Linux.

Install Ollama for macOS and Linux (with Homebrew):

brew install ollama

Install Ollama For Windows:

  1. Visit the official website: https://ollama.com
  2. Download the Windows installer (.exe file)
  3. Run the installer and follow the setup steps

Alternative (for all platforms using manual download) Ollama:

Go to https://ollama.com/download and choose the right version for your operating system.

After installation, open a terminal and test it by running:

ollama --version

If the version number appears, Ollama is successfully installed and ready to use.

Running Local LLMs with Ollama
Understand how Ollama hosts open-weight LLMs locally. Learn model management, quantization, and prompt tuning.
Murtuza Kutub
Murtuza Kutub
Co-Founder, F22 Labs

Walk away with actionable insights on AI adoption.

Limited seats available!

Calendar
Saturday, 22 Nov 2025
10PM IST (60 mins)

Basic Ollama CLI Commands

Ollama provides a simple command-line interface to help you manage and interact with language models on your local machine. Below are some of the most commonly used commands:

CommandDescription

ollama run <model>

Starts and runs a specified model

ollama list

Displays a list of installed models

ollama pull <model>

Downloads a model from the Ollama library

ollama create <name> -f Modelfile

Creates a custom model using a Modelfile

ollama serve

Starts the Ollama API server

ollama stop <model>

Stops a running model

ollama run <model>

Description

Starts and runs a specified model

1 of 6

These commands form the foundation of how you interact with Ollama through the terminal, making it easy to manage and use models locally.

Ollama Rest API Endpoints

Ollama provides a RESTful api to be able to access and play with models through code. These are endpoints that allow you to process text, manage models, generate embeddings, and more, all of which work locally on your machine

MethodEndpointPurpose

POST

/api/generate

Text generation

POST

/api/chat

Chat-style message handling

POST

/api/create

Create custom models

GET

/api/tags

List available (installed) models

DELETE

/api/delete

Delete a model

POST

/api/pull

Download a model

POST

/api/push

Upload a model

POST

/api/embed

Generate embeddings

GET

/api/ps

List running model processes

POST

/api/embeddings

OpenAI-compatible embedding endpoint

GET

/api/version

Get the current Ollama version

POST

Endpoint

/api/generate

Purpose

Text generation

1 of 11

These endpoints will allow you to easily incorporate Ollama into your apps, tools, or workflows with just a few HTTP requests.

Suggested Reads- What are Embedding Models in Machine Learning?

How to Run Ollama Models

Before using Ollama through code or APIs, you first need to install and run a supported model. Here's how to get started:

Step 1: Install and Run a Model

  1. Open your terminal.
  2. Choose a model from the Ollama model library:🔗 https://ollama.com/library
  3. Pull(install) the model you want to use. For example, to install llama3.2(LLaMA 3.2):
ollama pull llama3.2
  1. Once it's downloaded, run it.
ollama run llama3.2
Output of Local LLMs with Ollama

Now you can start chatting with the model directly in the terminal.

Step 2: Use the API with Python

If you want to use Python without any third-party SDKs like OpenAI's, you can make direct HTTP requests to Ollama's local server. Here’s how to do it:

import requests
import json

url = "http://localhost:11434/api/generate"

payload = {
    "model": "llama3.2",
    "prompt": "What is LLM?",
    "temperature": 0.2,
    "top_p": 0.7,
    "top_k": 30,
    "repeat_penalty": 1.1,
    "max_tokens": 100,
    "stream": False
}

response = requests.post(url, json=payload)

if response.status_code == 200:
    result = response.json()
    print(result["response"])
else:
    print("Error:", response.status_code, response.text)
  1. Click Send to see the model's response being generated.
Running Local LLMs with Ollama
Understand how Ollama hosts open-weight LLMs locally. Learn model management, quantization, and prompt tuning.
Murtuza Kutub
Murtuza Kutub
Co-Founder, F22 Labs

Walk away with actionable insights on AI adoption.

Limited seats available!

Calendar
Saturday, 22 Nov 2025
10PM IST (60 mins)

This configuration allows you to use Ollama as a local LLM server and test a variety of model behaviors with actual API calls. In your script if you don't want to use the default AI model name and warmup settings, you can modify the name or parameters such as temperature, top_p, repeat_penalty, etc., in the script to affect the trivia model behaviour.

OpenAI Compatibility Setup with Ollama

Ollama is designed to follow the OpenAI API format (like ChatGPT).This means you can use Ollama as a local drop-in replacement in apps or tools that were originally built for those services without changing much of your code.

Why OpenAI Compatibility Matters

  • Ollama follows the same structure for chat, completion, and embedding endpoints used by many leading LLM providers.
  • You can connect it with tools and frameworks like LangChain, LlamaIndex, and more.
  • Easily reuse your existing ChatGPT-style apps or backend code by simply switching the base URL to Ollama (http://localhost:11434).
  • It allows fast, offline testing and development with full control and no cloud dependency.

In this blog we’ll explore how to use OpenAI-compatible code and tools, since it's one of the most widely supported and easiest ways to get started with open source LLMs.

from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama'  # required, but not actually used
)

response = client.chat.completions.create(
    model="llama2",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"}
    ],
    temperature=0.7,
    top_p=0.9,
    top_k=40,
    repeat_penalty=1.1,
    num_ctx=2048,
    max_tokens=1000
)

print(response.choices[0].message.content)

Make sure the model is installed and running:

ollama run llama3.2

4 Ollama Limitations You Should Know

While Ollama is a powerful tool for running local LLMs, it does come with a few limitations to keep in mind:

  • High RAM Usage for Larger ModelsRunning bigger models like DeepSeek-R1 or Mixtral may require a lot of system memory, which can be a challenge on lower-end machines.
  • No Built-in GPU Support in Some EnvironmentsGPU acceleration isn’t available everywhere by default, which means model performance might be slower, especially on CPU-only setups.
  • Limited Community or Contributed ModelsUnlike platforms like Hugging Face and frameworks such as Transformers, vLLM, and SGLang, Ollama currently has a smaller library of models and fewer community-made variations.
  • Not Meant for Large-Scale ProductionOllama is best suited for local testing, development, or personal use. While it can be used for small-scale or low-traffic production setups, it is not optimized for large-scale, high-load, or enterprise-level deployments.

These limitations don’t affect most local development or testing needs, but they’re important to be aware of depending on your use case.

Conclusion

Ollama is an easy-to-deploy, high-performing software to run AI language models locally on your desktop or server without requiring the cloud. It has varied use cases such as on the command line, via APIs, or as code and has the same structure as the widely used tools like ChatGPT. 

Though it might require a little additional memory, it’s not compatible with every advanced feature, but it’s good for learning, testing, and creating local AI projects. Ollama is a good place to begin if you prefer privacy, control, and offline access to AI

Author-Dharshan
Dharshan

Passionate AI/ML Engineer with interest in OpenCV, MediaPipe, and LLMs. Exploring computer vision and NLP to build smart, interactive systems.

Share this article

Phone

Next for you

Qdrant vs Weaviate vs FalkorDB: Best AI Database 2025 Cover

AI

Nov 14, 20254 min read

Qdrant vs Weaviate vs FalkorDB: Best AI Database 2025

What if your AI application’s performance depended on one critical choice, the database powering it? In the era of vector search and retrieval-augmented generation (RAG), picking the right database can be the difference between instant, accurate results and sluggish responses. Three names dominate this space: Qdrant, Weaviate, and FalkorDB. Qdrant leads with lightning-fast vector search, Weaviate shines with hybrid AI features and multimodal support, while FalkorDB thrives on uncovering complex

AI PDF Form Detection: Game-Changer or Still Evolving? Cover

AI

Nov 10, 20253 min read

AI PDF Form Detection: Game-Changer or Still Evolving?

AI-based PDF form detection promises to transform static documents into interactive, fillable forms with minimal human intervention. Using computer vision and layout analysis, these systems automatically identify text boxes, checkboxes, radio buttons, and signature fields to reconstruct form structures digitally. The technology shows significant potential in streamlining document processing, reducing manual input, and improving efficiency across industries.  However, performance still varies wi

How to Use UV Package Manager for Python Projects Cover

AI

Oct 31, 20254 min read

How to Use UV Package Manager for Python Projects

Managing Python packages and dependencies has always been a challenge for developers. Tools like pip and poetry have served well for years, but as projects grow more complex, these tools can feel slow and cumbersome.  UV is a modern, high-performance Python package manager written in Rust, built as a drop-in replacement for pip and pip-tools. It focuses on speed, reliability, and ease of use rather than adding yet another layer of complexity. According to benchmarks from Astral, UV installs pac