As large language models become more widely adopted, developers are looking for flexible ways to integrate them without being tied to a single provider. Hugging Face’s newly introduced OpenAI-compatible API offers a practical solution, allowing you to run models like LLaMA, Mixtral, or DeepSeek using the same syntax as OpenAI’s Python client. According to Hugging Face, hundreds of models are now accessible using the OpenAI-compatible client across providers like Together AI, Replicate, and more.
In this article, you’ll learn how to set up and use the OpenAI-compatible interface step by step: from configuring your environment and authenticating your API key, to choosing the right model and provider, and making your first chat completion request. We’ll also look at how to compare different providers based on speed, cost, or availability, all without changing your existing code.
Start building with more flexibility, right from your existing codebase.
Hugging Face Inference Providers is a system designed for you to run AI models from lots of different backends: Hugging Face's own servers, AWS, Azure, or third-party companies, all through one single interface. You don't need to learn one API for each provider; with a consistent and combined method, you can do it.
This is particularly helpful for developers who prefer to shop around between providers due to performance, cost or availability, but don’t want to modify code as they do so. Combine that with OpenAI compatibility, and that means you can write OpenAI-style code and run it on models hosted anywhere Hugging Face does.
Hugging Face recently introduced support for OpenAI-compatible APIs, allowing you to use functions like ChatCompletion.create() or Embedding.create() just as you would with the OpenAI Python client. The key difference is that instead of sending your request to OpenAI’s servers, you point it to Hugging Face’s API, which can route the call to a variety of models both open and third-party. This makes it possible to plug in alternatives like Mixtral, Kimi, or LLaMA with minimal changes to your existing code.
Suggested Reads- How To Use Open Source LLMs (Large Language Model)?
To use OpenAI-style code with Hugging Face, you only need to update your API settings and model reference. This section walks you through the exact steps to get started, including how to select specific providers like Together AI or Replicate.
Unlike OpenAI, you must also specify which provider will run the model by adding a:provider suffix to the model name. This section shows exactly how to set it up.
pip install openai python-dotenv
Create a .env file and add your Hugging Face token:
HF_TOKEN=hf_your_token_here
Then, load it in your Python script:
from dotenv import load_dotenv
import os
load_dotenv()
api_key = os.getenv("HF_TOKEN")
from openai import OpenAI
client = OpenAI(
base_url="https://router.huggingface.co/v1",
api_key=api_key
)
You must include the provider in the model name using the :provider format for example: model-id:provider
Experience seamless collaboration and exceptional results.
You can explore available models here: https://huggingface.co/models
To check which inference providers support a model:
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-R1:together", # ":any other providers" is required
messages=[{"role": "user", "content": "Tell me a fun fact."}]
)
print(response.choices[0].message.content)
If you don’t want to specify a provider manually, you can use :auto — it will automatically select a supported provider.
Hugging Face's Inference Providers system gives you access to a wide range of AI models hosted by different backend providers all through one unified API. When using the OpenAI-compatible interface, specifying the provider is required by adding a suffix like :together or :replicate to the model name. This tells Hugging Face exactly where to route the request.
Experience seamless collaboration and exceptional results.
Each provider offers different strengths, some are optimized for speed, others for specific hardware, and some for cost-efficiency. Here's a list of the most commonly used providers you can access via Hugging Face:
Provider | Suffix | Highlights |
Hugging Face | :hf-inference | Models hosted directly by Hugging Face |
Together AI | :together | Fast LLM inference with sub-100 ms latency |
Replicate | :replicate | Supports both text and image models |
fal.ai | :fal-ai | Lightweight, fast response time |
SambaNova | :sambanova | Enterprise-grade AI infrastructure |
Groq | :groq | High-speed inference on custom silicon |
Nscale | :nscale | Scalable inference with private model hosting |
Cerebras | :cerebras | AI models running on wafer-scale compute |
To use any of these, just append the suffix to your model name. For example:
model="deepseek-ai/DeepSeek-R1:together"
You can browse huggingface.co/models and filter by provider to find out which models are available under each backend. If you use a model without a supported provider or forget the suffix, the request will fail so it’s important to get this right.
This system gives you flexibility to try different models or backends just by changing the provider tag, all without modifying your application logic.
In conclusion, setting up Hugging Face’s OpenAI-compatible API involves just a few key steps: updating the base URL, providing your Hugging Face token, and including the required provider suffix in the model name. This simple setup allows developers to access a wide range of models without changing their existing code.
Throughout this blog, we explored how this compatibility works, why specifying a provider is essential, and how it fits into Hugging Face’s broader inference system. It’s a practical and flexible approach for anyone looking to build with language models beyond a single provider.