Blogs/AI

How to Integrate Local LLMs With Ollama and Python

Written by Dharshan
Apr 21, 2026
8 Min Read
How to Integrate Local LLMs With Ollama and Python Hero

Running large language models locally is becoming a popular choice for developers who want better privacy, predictable costs, and full control over their AI stack. Instead of depending entirely on cloud APIs, local models offer faster testing, offline access, and more flexible development workflows.

Ollama makes this process much easier by helping you download, run, and manage local LLMs from your own machine. It also supports terminal commands, REST APIs, and Python integration, making it useful for both experimentation and real applications.

In this guide, I’ll show you how to integrate local LLMs with Ollama and Python, run models locally, and start building with your own private AI environment.

How We Tested Ollama Locally

To create this guide, I installed Ollama on both macOS and Windows systems and tested multiple local LLMs, including LLaMA 3.2, directly on my machines. I ran core Ollama commands, started the server with ollama serve, interacted through the CLI, and sent REST API requests using Python.

Every example in this article is based on real local execution and tested outputs, not theoretical setups.

What is Ollama?

Ollama is a tool that lets you run large language models locally on your own machine. It downloads, manages, and runs models directly on your system, giving you full control over data, privacy, and execution without relying on cloud AI services.

You can interact with Ollama through terminal commands, APIs, or programming languages like Python, making it ideal for learning, experimentation, private projects, and offline AI workflows. If you want a simple way to run LLMs locally, Ollama is one of the easiest places to start.

Why Use Ollama?

Ollama is built for developers and AI enthusiasts who want to run large language models locally with more privacy, flexibility, and less dependence on cloud services. It is especially useful for anyone frustrated by API costs, rate limits, or restricted experimentation.

Run Models Offline with Full Control

Ollama lets you download and run models directly on your machine. It uses your GPU when available and falls back to CPU when needed, making it accessible across different systems.

Faster Testing and Development

Local models allow quicker iteration, easier debugging, and smoother experimentation without waiting on remote servers or usage limits.

No Cloud Dependency

Because everything runs locally, you are not dependent on internet access or third-party AI providers, giving you a more stable and self-contained workflow.

How to Install and Start Ollama Locally

Installing Ollama is quick and works on macOS, Windows, and Linux. Once installed, you can start running local LLMs directly from your machine.

Install Ollama for macOS and Linux

Using Homebrew:

brew install ollamaCopy

Install Ollama on Windows

  1. Visit the official Ollama website.
  2. Download the Windows installer (.exe).
  3. Run the installer and complete the setup.

Start Ollama Locally

After installation, start the local Ollama server with:

ollama serve

Once running, Ollama is ready to load models, accept commands, and connect with Python or local APIs.

Alternative (for all platforms using manual download) Ollama:

Go to https://ollama.com/download and choose the right version for your operating system.

After installation, open a terminal and test it by running:

ollama --version

If the version number appears, Ollama is successfully installed and ready to use.

Running Local LLMs with Ollama
Understand how Ollama hosts open-weight LLMs locally. Learn model management, quantization, and prompt tuning.
Murtuza Kutub
Murtuza Kutub
Co-Founder, F22 Labs

Walk away with actionable insights on AI adoption.

Limited seats available!

Calendar
Saturday, 30 May 2026
10PM IST (60 mins)

Essential Ollama Commands to Run Local LLMs

Ollama commands cheatsheet to run local LLMs

Ollama provides a simple command-line interface to help you manage and interact with language models on your local machine. Below are some of the most commonly used commands:

CommandDescription

ollama run <model>

Starts and runs a specified model

ollama list

Displays a list of installed models

ollama pull <model>

Downloads a model from the Ollama library

ollama create <name> -f Modelfile

Creates a custom model using a Modelfile

ollama serve

Starts the Ollama API server

ollama stop <model>

Stops a running model

ollama run <model>

Description

Starts and runs a specified model

1 of 6

These commands form the foundation of how you interact with Ollama through the terminal, making it easy to manage and use models locally.

Ollama Rest API Endpoints

Ollama provides a RESTful api to be able to access and play with models through code. These are endpoints that allow you to process text, manage models, generate embeddings, and more, all of which work locally on your machine

MethodEndpointPurpose

POST

/api/generate

Text generation

POST

/api/chat

Chat-style message handling

POST

/api/create

Create custom models

GET

/api/tags

List available (installed) models

DELETE

/api/delete

Delete a model

POST

/api/pull

Download a model

POST

/api/push

Upload a model

POST

/api/embed

Generate embeddings

GET

/api/ps

List running model processes

POST

/api/embeddings

OpenAI-compatible embedding endpoint

GET

/api/version

Get the current Ollama version

POST

Endpoint

/api/generate

Purpose

Text generation

1 of 11

These endpoints will allow you to easily incorporate Ollama into your apps, tools, or workflows with just a few HTTP requests.

Suggested Reads- What are Embedding Models in Machine Learning?

How to Run LLMs Locally Using Ollama

Before using Ollama through code or APIs, it’s important to understand how to start Ollama, run it as a local LLM server, and load a supported model on your machine.

Ollama runs models locally on your machine and exposes them through a local server. Make sure the Ollama service is running before you interact with models or APIs.

Step 1: Install and Run a Model

  1. Open your terminal.
  2. Choose a model from the Ollama model library: https://ollama.com/library
  3. Pull (install) the model you want to use. For example, to install LLaMA 3.2:
ollama pull llama3.2
  1. Start the Ollama local LLM server:

ollama serve

  1. Run the model:
ollama run llama3.2
Output of local LLMs with Ollama

Once the model starts, you can chat with the LLM directly from your terminal.This confirms that Ollama is running correctly as a local LLM environment.

Ollama Python Integration Using the Local LLM API Server

Ollama exposes a local LLM server, allowing you to run an Ollama local LLM directly on your machine and access it from Python without relying on cloud services. Without relying on any external cloud services. This makes it ideal for building private, offline AI applications using a simple REST-based integration.

The Ollama API runs locally on your machine and allows you to send prompts, generate text, and control model behavior through HTTP requests.

Below is the exact setup I used to run Ollama as a local LLM backend inside a Python application during my testing.

Example: Using Ollama’s Local LLM API Server With Python

import requests
import json

url = "http://localhost:11434/api/generate"

payload = {
    "model": "llama3.2",
    "prompt": "What is a large language model?",
    "temperature": 0.2,
    "top_p": 0.7,
    "top_k": 30,
    "repeat_penalty": 1.1,
    "max_tokens": 100,
    "stream": False
}

response = requests.post(url, json=payload)

if response.status_code == 200:
    result = response.json()
    print(result["response"])
else:
    print("Error:", response.status_code, response.text)

This example reflects how I used Ollama as a drop-in local LLM backend for Python applications while validating real responses on my machine. The request is sent to Ollama’s local API server, which processes the prompt and returns the generated response from the model running on your machine.

In your script, you can change the model name or adjust parameters such as temperature, top_p, top_k, and repeat_penalty to control how the local LLM responds.

How Ollama Serve Works as a Local LLM Server

When you run ollama serve, Ollama starts a local LLM server on your machine. It handles model loading, inference, and request processing in the background.

Once active, Ollama exposes a local API by default at http://localhost:11434, allowing you to:

  • Chat with models from the terminal
  • Send prompts through REST APIs
  • Integrate Ollama with Python applications or other tools

This is what makes Ollama a practical local LLM API server for private, offline AI workflows.

OpenAI Compatibility Setup with Ollama

Ollama is designed to follow the OpenAI API format (like ChatGPT).This means you can use Ollama as a local drop-in replacement in apps or tools that were originally built for those services without changing much of your code.

Why OpenAI Compatibility Matters

Ollama follows the same structure for chat, completion, and embedding endpoints used by many leading LLM providers.

You can connect it with tools and frameworks like LangChain, LlamaIndex, and more.

Easily reuse your existing ChatGPT-style apps or backend code by simply switching the base URL to Ollama (http://localhost:11434).

It allows fast, offline testing and development with full control and no cloud dependency.

In this blog we’ll explore how to use OpenAI-compatible code and tools, since it's one of the most widely supported and easiest ways to get started with open source LLMs.

Running Local LLMs with Ollama
Understand how Ollama hosts open-weight LLMs locally. Learn model management, quantization, and prompt tuning.
Murtuza Kutub
Murtuza Kutub
Co-Founder, F22 Labs

Walk away with actionable insights on AI adoption.

Limited seats available!

Calendar
Saturday, 30 May 2026
10PM IST (60 mins)

Make sure the model is installed and running:

ollama run llama3.2

4 Ollama Limitations You Should Know

While Ollama is a powerful tool for running local LLMs, I did run into a few limitations that are worth keeping in mind before adopting it.

High RAM Usage for Larger Models: Running bigger models like DeepSeek-R1 or Mixtral may require a lot of system memory, which can be a challenge on lower-end machines.

No Built-in GPU Support in Some Environments: GPU acceleration isn’t available everywhere by default, which means model performance might be slower, especially on CPU-only setups.

Limited Community or Contributed Models: Unlike platforms like Hugging Face and frameworks such as Transformers, vLLM, and SGLang, Ollama currently has a smaller library of models and fewer community-made variations.

Not Meant for Large-Scale Production: Ollama is best suited for local testing, development, or personal use. While it can be used for small-scale or low-traffic production setups, it is not optimized for large-scale, high-load, or enterprise-level deployments.

These limitations don’t affect most local development or testing needs, but they’re important to be aware of, depending on your use case

End-to-End Workflow: How to Use Ollama to Run LLMs Locally

Here’s how a complete Ollama workflow looks from start to finish:

  1. Install and start Ollama on your machine using the official installer or Homebrew.
  2. Run ollama serve to start Ollama as a local LLM server.
  3. Pull a model such as LLaMA using ollama pull llama3.2.
  4. Run the model locally via the CLI using ollama run.
  5. Send prompts programmatically using the Ollama REST API or Python integration.
  6. Tune parameters like temperature, top-p, and max tokens to control model behavior.

This workflow allows you to run LLMs locally with full control, privacy, and no dependency on cloud APIs.

Conclusion

Ollama is one of the easiest ways to run large language models locally without relying on cloud services. It supports terminal usage, APIs, and Python integration, making it a practical choice for developers, learners, and private AI projects.

While larger models may require more system resources, Ollama is excellent for testing, experimentation, and building offline workflows. If you value privacy, control, and local AI access, Ollama is a strong place to start.

FAQ

How to use Ollama to run LLMs locally?

To use Ollama, first install it on your machine and start the local LLM server using ollama serve. Then pull a supported model such as llama3.2 and run it with ollama run. Once running, you can interact with the model via the terminal, REST API, or Python integration.

How do I start Ollama and run a local LLM server?

After installing Ollama, start the local LLM server by running ollama serve in your terminal. This launches a local API server on your machine. You can then run models using ollama run <model-name> or send requests to the local API from applications.

What are the most common Ollama commands?

Some commonly used Ollama commands include:

  • ollama pull <model> to download a model
  • ollama run <model> to start a model
  • ollama list to view installed models
  • ollama serve to start the local LLM API server
  • ollama stop <model> to stop a running model

These commands allow you to manage and run local LLMs efficiently.

Can I use Ollama with Python?

Yes. Ollama provides a local LLM API server that can be accessed directly from Python using HTTP requests. This allows you to build Python applications that generate text, chat with models, or control inference parameters without relying on cloud-based APIs.

Is Ollama suitable for running LLMs offline?

Yes. Ollama is designed to run LLMs locally on your machine without an internet connection once models are downloaded. This makes it ideal for privacy-sensitive projects, offline experimentation, and local development workflows.

What are the limitations of running LLMs locally with Ollama?

Running LLMs locally may require significant system resources, especially RAM and disk space for larger models. Ollama is best suited for development, testing, and small-scale deployments rather than large, high-traffic production environments.

Author-Dharshan
Dharshan

Passionate AI/ML Engineer with interest in OpenCV, MediaPipe, and LLMs. Exploring computer vision and NLP to build smart, interactive systems.

Share this article

Phone

Next for you

3,000 Tokens/Sec on Two RTX 4090s for Free Cover

AI

May 22, 2026 • 7 min read

3,000 Tokens/Sec on Two RTX 4090s for Free

We had 475,000 candidate profiles to synthesise for HuntVox, our internal tool. The data came from multiple sources, including LinkedIn, Weekday, resume parsing pipelines, and Lemlist, resulting in duplicate fields, inconsistent formats, and noisy profile information. Our goal was simple: convert raw profiles into semantic summaries, structured skills, and domain tags that could improve search quality and retrieval. At this scale, hosted APIs became difficult to justify. Rate limits reduced th

TRT-LLM vs vLLM vs SGLang: What to Choose in 2026 Cover

AI

May 15, 2026 • 11 min read

TRT-LLM vs vLLM vs SGLang: What to Choose in 2026

Running LLMs efficiently is one of the most important engineering challenges in today’s world. We need to choose the right inference engine. The wrong choice can mean slow responses, wasted GPU memory, and poor user experience. This blog documents what we learned after benchmarking three inference engines on a RTX 4090 server: NVIDIA TensorRT-LLM, vLLM, and SGLang. We explain not just the numbers, but why each engine behaves the way it does at the GPU level. What Are These Engines? Before co

Speculative Speculative Decoding Explained Cover

AI

May 25, 2026 • 12 min read

Speculative Speculative Decoding Explained

If you have worked with large language models in production, you have probably faced this problem: Models are powerful, but they are slow. Even with good GPUs, generating responses one token at a time adds latency. For real-world applications like chat systems, copilots, or voice assistants, this delay is noticeable and often unacceptable. Several techniques have been proposed to speed up inference. One of the most effective is speculative decoding, which uses a smaller model to guess the nex