Facebook iconHow To Use Local LLMs with Ollama? (A Complete Guide)
Blogs/AI

How To Use Local LLMs with Ollama? (A Complete Guide)

Jul 3, 20256 Min Read
Written by Dharshan
How To Use Local LLMs with Ollama? (A Complete Guide) Hero

AI tools like chatbots and content generators are everywhere. But usually, they run online using cloud services. What if you could run those smart AI models directly on your own computer, just like running a regular app? That’s what Ollama helps you do. 

In this blog, you’ll learn how to set it up, use it in different ways (like with terminal, code, or API), change some basic settings, and know what it can and can't do.

What is Ollama?

Ollama is a software that allows you to use large, powerful AI models on your own machine without having to rely on the internet. It handles downloading and running the models and allows you to chat with them or get responses from them, much like ChatGPT. 

You can interact with it through simple commands, programming (Python, etc.) or with other API tools. It’s awesome for testing AI that you have locally, doing your projects, testing things, private and easy.

Why Use Ollama?

Ollama is aimed at developers and AI enthusiasts who are interested in running large language models locally, so they have more control and flexibility, and less dependency on cloud services.

  • Run Models Offline with Full ControlOllama lets you download and run models directly on your machine. It automatically uses your GPU for acceleration if available, and falls back to CPU when GPU isn’t present, ensuring it works across a wide range of systems.
  • Fast Testing and DevelopmentLocal setup means quicker iteration, easier debugging, and smoother experimentation without waiting for remote servers or rate limits.
  • No Cloud DependencyWith no need for internet or cloud APIs, Ollama removes the reliance on third-party providers, making your workflow more stable and self-contained.

How to Install Ollama?

Installing Ollama is quick and straightforward. It works on macOS, Windows, and Linux.

Install Ollama for macOS and Linux (with Homebrew):

brew install ollama

Install Ollama For Windows:

  1. Visit the official website: https://ollama.com
  2. Download the Windows installer (.exe file)
  3. Run the installer and follow the setup steps

Alternative (for all platforms using manual download) Ollama:

Go to https://ollama.com/download and choose the right version for your operating system.

After installation, open a terminal and test it by running:

ollama --version

If the version number appears, Ollama is successfully installed and ready to use.

Partner with Us for Success

Experience seamless collaboration and exceptional results.

Basic Ollama CLI Commands

Ollama provides a simple command-line interface to help you manage and interact with language models on your local machine. Below are some of the most commonly used commands:

CommandDescription

ollama run <model>

Starts and runs a specified model

ollama list

Displays a list of installed models

ollama pull <model>

Downloads a model from the Ollama library

ollama create <name> -f Modelfile

Creates a custom model using a Modelfile

ollama serve

Starts the Ollama API server

ollama stop <model>

Stops a running model

ollama run <model>

Description

Starts and runs a specified model

1 of 6

These commands form the foundation of how you interact with Ollama through the terminal, making it easy to manage and use models locally.

Ollama Rest API Endpoints

Ollama provides a RESTful api to be able to access and play with models through code. These are endpoints that allow you to process text, manage models, generate embeddings, and more, all of which work locally on your machine

MethodEndpointPurpose

POST

/api/generate

Text generation

POST

/api/chat

Chat-style message handling

POST

/api/create

Create custom models

GET

/api/tags

List available (installed) models

DELETE

/api/delete

Delete a model

POST

/api/pull

Download a model

POST

/api/push

Upload a model

POST

/api/embed

Generate embeddings

GET

/api/ps

List running model processes

POST

/api/embeddings

OpenAI-compatible embedding endpoint

GET

/api/version

Get the current Ollama version

POST

Endpoint

/api/generate

Purpose

Text generation

1 of 11

These endpoints will allow you to easily incorporate Ollama into your apps, tools, or workflows with just a few HTTP requests.

Suggested Reads- What are Embedding Models in Machine Learning?

How to Run Ollama Models

Before using Ollama through code or APIs, you first need to install and run a supported model. Here's how to get started:

Step 1: Install and Run a Model

  1. Open your terminal.
  2. Choose a model from the Ollama model library:🔗 https://ollama.com/library
  3. Pull(install) the model you want to use. For example, to install llama3.2(LLaMA 3.2):
ollama pull llama3.2
  1. Once it's downloaded, run it.
ollama run llama3.2
Output of Local LLMs with Ollama

Now you can start chatting with the model directly in the terminal.

Step 2: Use the API with Python

If you want to use Python without any third-party SDKs like OpenAI's, you can make direct HTTP requests to Ollama's local server. Here’s how to do it:

import requests
import json

url = "http://localhost:11434/api/generate"

payload = {
    "model": "llama3.2",
    "prompt": "What is LLM?",
    "temperature": 0.2,
    "top_p": 0.7,
    "top_k": 30,
    "repeat_penalty": 1.1,
    "max_tokens": 100,
    "stream": False
}

response = requests.post(url, json=payload)

if response.status_code == 200:
    result = response.json()
    print(result["response"])
else:
    print("Error:", response.status_code, response.text)
  1. Click Send to see the model's response being generated.

Partner with Us for Success

Experience seamless collaboration and exceptional results.

This configuration allows you to use Ollama as a local LLM server and test a variety of model behaviors with actual API calls. In your script if you don't want to use the default AI model name and warmup settings, you can modify the name or parameters such as temperature, top_p, repeat_penalty, etc., in the script to affect the trivia model behaviour.

OpenAI Compatibility Setup with Ollama

Ollama is designed to follow the OpenAI API format (like ChatGPT).This means you can use Ollama as a local drop-in replacement in apps or tools that were originally built for those services without changing much of your code.

Why OpenAI Compatibility Matters

  • Ollama follows the same structure for chat, completion, and embedding endpoints used by many leading LLM providers.
  • You can connect it with tools and frameworks like LangChain, LlamaIndex, and more.
  • Easily reuse your existing ChatGPT-style apps or backend code by simply switching the base URL to Ollama (http://localhost:11434).
  • It allows fast, offline testing and development with full control and no cloud dependency.

In this blog we’ll explore how to use OpenAI-compatible code and tools, since it's one of the most widely supported and easiest ways to get started with local LLMs.

from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama'  # required, but not actually used
)

response = client.chat.completions.create(
    model="llama2",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"}
    ],
    temperature=0.7,
    top_p=0.9,
    top_k=40,
    repeat_penalty=1.1,
    num_ctx=2048,
    max_tokens=1000
)

print(response.choices[0].message.content)

Make sure the model is installed and running:

ollama run llama3.2

4 Ollama Limitations You Should Know

While Ollama is a powerful tool for running local LLMs, it does come with a few limitations to keep in mind:

  • High RAM Usage for Larger ModelsRunning bigger models like DeepSeek-R1 or Mixtral may require a lot of system memory, which can be a challenge on lower-end machines.
  • No Built-in GPU Support in Some EnvironmentsGPU acceleration isn’t available everywhere by default, which means model performance might be slower, especially on CPU-only setups.
  • Limited Community or Contributed ModelsUnlike platforms like Hugging Face, Ollama currently has a smaller library of models and fewer community-made variations.
  • Not Meant for Large-Scale ProductionOllama is best suited for local testing, development, or personal use. While it can be used for small-scale or low-traffic production setups, it is not optimized for large-scale, high-load, or enterprise-level deployments.

These limitations don’t affect most local development or testing needs, but they’re important to be aware of depending on your use case.

Conclusion

Ollama is an easy-to-deploy, high-performing software to run AI language models locally on your desktop or server without requiring the cloud. It has varied use cases such as on the command line, via APIs, or as code and has the same structure as the widely used tools like ChatGPT. 

Though it might require a little additional memory, it’s not compatible with every advanced feature, but it’s good for learning, testing, and creating local AI projects. Ollama is a good place to begin if you prefer privacy, control, and offline access to AI

Author-Dharshan
Dharshan

Passionate AI/ML Engineer with interest in OpenCV, MediaPipe, and LLMs. Exploring computer vision and NLP to build smart, interactive systems.

Phone

Next for you

Qdrant vs Milvus: Which Vector Database Should You Choose? Cover

AI

Jul 18, 20259 min read

Qdrant vs Milvus: Which Vector Database Should You Choose?

Which vector database should you choose for your AI-powered application, Qdrant or Milvus? As the need for high-dimensional data storage grows in modern AI use cases like semantic search, recommendation systems, and Retrieval-Augmented Generation (RAG), vector databases have become essential.  In this article, we compare Qdrant vs Milvus, two of the most popular vector databases, based on architecture, performance, and ideal use cases. You’ll get a practical breakdown of insertion speed, query

Voxtral-Mini 3B vs Whisper Large V3: Which One’s Faster? Cover

AI

Jul 18, 20254 min read

Voxtral-Mini 3B vs Whisper Large V3: Which One’s Faster?

Which speech-to-text model delivers faster and more accurate transcriptions Voxtral-Mini 3B or Whisper Large V3? We put Voxtral-Mini 3B and Whisper Large V3 head-to-head to find out which speech-to-text model performs better in real-world tasks. Using the same audio clips, we compared latency (speed) and word error rate (accuracy) to help you choose the right model for use cases like transcribing calls, meetings, or voice messages. As speech-to-text systems become smarter and more reliable, th

What is Google Gemini CLI & how to install and use it? Cover

AI

Jul 3, 20252 min read

What is Google Gemini CLI & how to install and use it?

Ever wish your terminal could help you debug, write code, or even run DevOps tasks, without switching tabs? Google’s new Gemini CLI might just do that. Launched in June 2025, Gemini CLI is an open-source command-line AI tool designed to act like your AI teammate, helping you write, debug, and understand code right from the command line. What is Gemini CLI? Gemini CLI is a smart AI assistant you can use directly in your terminal. It’s not just for chatting, it’s purpose-built for developers.