Facebook iconHow to Use Ollama to Run LLMs Locally (Step-by-Step Guide)
F22 logo
Blogs/AI

How to Use Ollama to Run LLMs Locally (Step-by-Step Guide)

Written by Dharshan
Feb 10, 2026
9 Min Read
How to Use Ollama to Run LLMs Locally (Step-by-Step Guide) Hero

Running large language models locally is becoming increasingly popular, especially among developers who care deeply about privacy, predictable costs, and having full control over their AI stack.

I started exploring Ollama for this exact reason. I wanted to run powerful LLMs directly on my own machine, interact with them through the terminal, APIs, and Python, and avoid relying on cloud services for everyday experimentation and development.

In this guide, you’ll learn how to use Ollama to run LLMs locally, start the Ollama server, manage models, use common Ollama commands, and connect Ollama to other tools using its local LLM API. Whether you’re exploring AI for learning, experimentation, personal projects, or application development, this guide will help you get started with full control over your models and data.

How We Tested Ollama Locally

To create this guide, I installed Ollama on both macOS and Windows systems and ran multiple local LLMs such as LLaMA 3.2 directly on my own machines. We tested core Ollama commands, started the local server using ollama serve, interacted with models via the CLI, and made REST API calls from Python to validate real outputs.

Every example in this article is based on actual local execution on my system, not simulated responses or theoretical setups.

What is Ollama?

Ollama is a tool that lets you run a large language model locally on your machine, which is exactly why I started using it for hands-on experimentation and private development workflows. An Ollama LLM runs entirely on your system, giving you full control over data, models, and execution without relying on cloud-based AI services, without having to rely on the internet. It handles downloading and running the models and allows you to chat with them or get responses from them, much like ChatGPT. 

You can interact with it through simple commands, programming tools like Python, or APIs. I’ve found it especially useful for learning, experimentation, private projects, and offline AI workflows where control and transparency matter.

If you’re wondering how to run LLM locally without relying on cloud platforms, Ollama provides one of the simplest and most accessible ways to get started.

Why Use Ollama?

Ollama is aimed at developers and AI enthusiasts who want to run large language models locally, especially those who’ve felt the friction of cloud limits, API costs, and restricted experimentation so they have more control and flexibility, and less dependency on cloud services.

Run Models Offline with Full Control: Ollama lets you download and run models directly on your machine. It automatically uses your GPU for acceleration if available, and falls back to CPU when GPU isn’t present, ensuring it works across a wide range of systems.

Fast Testing and Development: In my experience, local setup means quicker iteration, easier debugging, and smoother experimentation without waiting on remote servers or rate limits, without waiting for remote servers or rate limits.

No Cloud Dependency With no need for internet or cloud APIs, Ollama removes the reliance on third-party providers, making your workflow more stable and self-contained.

How to Install and Start Ollama Locally

Installing Ollama is quick and straightforward. It works on macOS, Windows, and Linux.

Install Ollama for macOS and Linux (with Homebrew):

brew install ollamaCopy

Install Ollama For Windows:

  1. Visit the official website: https://ollama.com
  2. Download the Windows installer (.exe file)
  3. Run the installer and follow the setup steps

Once installed, Ollama runs locally and can be started using the ollama serve command.

Alternative (for all platforms using manual download) Ollama:

Go to https://ollama.com/download and choose the right version for your operating system.

After installation, open a terminal and test it by running:

ollama --version

If the version number appears, Ollama is successfully installed and ready to use.

Essential Ollama Commands to Run Local LLMs

Ollama commands cheatsheet to run local LLMs

Ollama provides a simple command-line interface to help you manage and interact with language models on your local machine. Below are some of the most commonly used commands:

CommandDescription

ollama run <model>

Starts and runs a specified model

ollama list

Displays a list of installed models

ollama pull <model>

Downloads a model from the Ollama library

ollama create <name> -f Modelfile

Creates a custom model using a Modelfile

ollama serve

Starts the Ollama API server

ollama stop <model>

Stops a running model

ollama run <model>

Description

Starts and runs a specified model

1 of 6

These commands form the foundation of how you interact with Ollama through the terminal, making it easy to manage and use models locally.

Running Local LLMs with Ollama
Understand how Ollama hosts open-weight LLMs locally. Learn model management, quantization, and prompt tuning.
Murtuza Kutub
Murtuza Kutub
Co-Founder, F22 Labs

Walk away with actionable insights on AI adoption.

Limited seats available!

Calendar
Saturday, 28 Feb 2026
10PM IST (60 mins)

Ollama Rest API Endpoints

Ollama provides a RESTful api to be able to access and play with models through code. These are endpoints that allow you to process text, manage models, generate embeddings, and more, all of which work locally on your machine

MethodEndpointPurpose

POST

/api/generate

Text generation

POST

/api/chat

Chat-style message handling

POST

/api/create

Create custom models

GET

/api/tags

List available (installed) models

DELETE

/api/delete

Delete a model

POST

/api/pull

Download a model

POST

/api/push

Upload a model

POST

/api/embed

Generate embeddings

GET

/api/ps

List running model processes

POST

/api/embeddings

OpenAI-compatible embedding endpoint

GET

/api/version

Get the current Ollama version

POST

Endpoint

/api/generate

Purpose

Text generation

1 of 11

These endpoints will allow you to easily incorporate Ollama into your apps, tools, or workflows with just a few HTTP requests.

Suggested Reads- What are Embedding Models in Machine Learning?

How to Run LLMs Locally Using Ollama

Before using Ollama through code or APIs, it’s important to understand how to start Ollama, run it as a local LLM server, and load a supported model on your machine.

Ollama runs models locally on your machine and exposes them through a local server. Make sure the Ollama service is running before you interact with models or APIs.

Step 1: Install and Run a Model

  1. Open your terminal.
  2. Choose a model from the Ollama model library: https://ollama.com/library
  3. Pull (install) the model you want to use. For example, to install LLaMA 3.2:
ollama pull llama3.2
  1. Start the Ollama local LLM server:

ollama serve

  1. Run the model:
ollama run llama3.2
Output of local LLMs with Ollama

Once the model starts, you can chat with the LLM directly from your terminal.This confirms that Ollama is running correctly as a local LLM environment.

Ollama Python Integration Using the Local LLM API Server

Ollama exposes a local LLM server, allowing you to run an Ollama local LLM directly on your machine and access it from Python without relying on cloud services. Without relying on any external cloud services. This makes it ideal for building private, offline AI applications using a simple REST-based integration.

The Ollama API runs locally on your machine and allows you to send prompts, generate text, and control model behavior through HTTP requests.

Below is the exact setup I used to run Ollama as a local LLM backend inside a Python application during my testing.

Example: Using Ollama’s Local LLM API Server With Python

import requests
import json

url = "http://localhost:11434/api/generate"

payload = {
    "model": "llama3.2",
    "prompt": "What is a large language model?",
    "temperature": 0.2,
    "top_p": 0.7,
    "top_k": 30,
    "repeat_penalty": 1.1,
    "max_tokens": 100,
    "stream": False
}

response = requests.post(url, json=payload)

if response.status_code == 200:
    result = response.json()
    print(result["response"])
else:
    print("Error:", response.status_code, response.text)

This example reflects how I used Ollama as a drop-in local LLM backend for Python applications while validating real responses on my machine. The request is sent to Ollama’s local API server, which processes the prompt and returns the generated response from the model running on your machine.

In your script, you can change the model name or adjust parameters such as temperature, top_p, top_k, and repeat_penalty to control how the local LLM responds.

How Ollama Serve Works as a Local LLM Server

When you run ollama serve, Ollama starts a local LLM server on your machine.

This server manages model loading, inference, and request handling in the background.

Once running, Ollama exposes a local API (default: http://localhost:11434) that allows you to:

Chat with models from the CLI

Send requests via REST APIs

Integrate Ollama into Python applications or other tools

This architecture is what turns Ollama into a local LLM API server, enabling offline, private AI workflows.

OpenAI Compatibility Setup with Ollama

Ollama is designed to follow the OpenAI API format (like ChatGPT).This means you can use Ollama as a local drop-in replacement in apps or tools that were originally built for those services without changing much of your code.

Why OpenAI Compatibility Matters

Ollama follows the same structure for chat, completion, and embedding endpoints used by many leading LLM providers.

You can connect it with tools and frameworks like LangChain, LlamaIndex, and more.

Easily reuse your existing ChatGPT-style apps or backend code by simply switching the base URL to Ollama (http://localhost:11434).

It allows fast, offline testing and development with full control and no cloud dependency.

In this blog we’ll explore how to use OpenAI-compatible code and tools, since it's one of the most widely supported and easiest ways to get started with open source LLMs.

Make sure the model is installed and running:

ollama run llama3.2

4 Ollama Limitations You Should Know

While Ollama is a powerful tool for running local LLMs, I did run into a few limitations that are worth keeping in mind before adopting it.

Running Local LLMs with Ollama
Understand how Ollama hosts open-weight LLMs locally. Learn model management, quantization, and prompt tuning.
Murtuza Kutub
Murtuza Kutub
Co-Founder, F22 Labs

Walk away with actionable insights on AI adoption.

Limited seats available!

Calendar
Saturday, 28 Feb 2026
10PM IST (60 mins)

High RAM Usage for Larger Models: Running bigger models like DeepSeek-R1 or Mixtral may require a lot of system memory, which can be a challenge on lower-end machines.

No Built-in GPU Support in Some Environments: GPU acceleration isn’t available everywhere by default, which means model performance might be slower, especially on CPU-only setups.

Limited Community or Contributed Models: Unlike platforms like Hugging Face and frameworks such as Transformers, vLLM, and SGLang, Ollama currently has a smaller library of models and fewer community-made variations.

Not Meant for Large-Scale Production: Ollama is best suited for local testing, development, or personal use. While it can be used for small-scale or low-traffic production setups, it is not optimized for large-scale, high-load, or enterprise-level deployments.

These limitations don’t affect most local development or testing needs, but they’re important to be aware of, depending on your use case

End-to-End Workflow: How to Use Ollama to Run LLMs Locally

Here’s how a complete Ollama workflow looks from start to finish:

  1. Install and start Ollama on your machine using the official installer or Homebrew.
  2. Run ollama serve to start Ollama as a local LLM server.
  3. Pull a model such as LLaMA using ollama pull llama3.2.
  4. Run the model locally via the CLI using ollama run.
  5. Send prompts programmatically using the Ollama REST API or Python integration.
  6. Tune parameters like temperature, top-p, and max tokens to control model behavior.

This workflow allows you to run LLMs locally with full control, privacy, and no dependency on cloud APIs.

FAQ

How to use Ollama to run LLMs locally?

To use Ollama, first install it on your machine and start the local LLM server using ollama serve. Then pull a supported model such as llama3.2 and run it with ollama run. Once running, you can interact with the model via the terminal, REST API, or Python integration.

How do I start Ollama and run a local LLM server?

After installing Ollama, start the local LLM server by running ollama serve in your terminal. This launches a local API server on your machine. You can then run models using ollama run <model-name> or send requests to the local API from applications.

What are the most common Ollama commands?

Some commonly used Ollama commands include:

  • ollama pull <model> to download a model
  • ollama run <model> to start a model
  • ollama list to view installed models
  • ollama serve to start the local LLM API server
  • ollama stop <model> to stop a running model

These commands allow you to manage and run local LLMs efficiently.

Can I use Ollama with Python?

Yes. Ollama provides a local LLM API server that can be accessed directly from Python using HTTP requests. This allows you to build Python applications that generate text, chat with models, or control inference parameters without relying on cloud-based APIs.

Is Ollama suitable for running LLMs offline?

Yes. Ollama is designed to run LLMs locally on your machine without an internet connection once models are downloaded. This makes it ideal for privacy-sensitive projects, offline experimentation, and local development workflows.

What are the limitations of running LLMs locally with Ollama?

Running LLMs locally may require significant system resources, especially RAM and disk space for larger models. Ollama is best suited for development, testing, and small-scale deployments rather than large, high-traffic production environments.

Conclusion

Ollama turned out to be one of the easiest and most practical ways I’ve found to run AI language models locally on my desktop without relying on the cloud. It has varied use cases, such as on the command line, via APIs, or as code, and has the same structure as the widely used tools like ChatGPT. 

Though it might require a little additional memory, it’s not compatible with every advanced feature, but it’s good for learning, testing, and creating local AI projects. If you value privacy, control, and offline access to AI, as I do, Ollama is a strong place to begin.

Author-Dharshan
Dharshan

Passionate AI/ML Engineer with interest in OpenCV, MediaPipe, and LLMs. Exploring computer vision and NLP to build smart, interactive systems.

Share this article

Phone

Next for you

DSPy vs Normal Prompting: A Practical Comparison Cover

AI

Feb 23, 202618 min read

DSPy vs Normal Prompting: A Practical Comparison

When you build an AI agent that books flights, calls tools, or handles multi-step workflows, one question comes up quickly: how should you control the model? Most developers use prompt engineering. You write detailed instructions, add examples, adjust wording, and test until it works. Sometimes it works well. Sometimes changing a single sentence breaks the entire workflow. DSPy offers a different approach. Instead of manually crafting prompts, you define what the system should do, and the fram

How to Calculate GPU Requirements for LLM Inference? Cover

AI

Feb 23, 20269 min read

How to Calculate GPU Requirements for LLM Inference?

If you’ve ever tried running a large language model on a CPU, you already know the pain. It works, but the latency feels unbearable. This usually leads to the obvious question:          “If my CPU can run the model, why do I even need a GPU?” The short answer is performance. The long answer is what this blog is about. Understanding GPU requirements for LLM inference is not about memorizing hardware specs. It’s about understanding where memory goes, what limits throughput, and how model choice

Map Reduce for Large Document Summarization with LLMs Cover

AI

Feb 23, 20268 min read

Map Reduce for Large Document Summarization with LLMs

LLMs are exceptionally good at understanding and generating text, but they struggle when documents grow large. Movies script, policy PDFs, books, and research papers quickly exceed a model’s context window, resulting in incomplete summaries, missing sections, or higher latency. When it’s tempting to assume that increasing context length solves this problem, real-world usage shows hits different. Larger contexts increase cost, latency, and instability, and still do not guarantee full coverage.