Introduction¶

Azure AI Model Inference API - 5 Resources To Start With

Browse the Documentation
Explore the Samples
Sep 2024: #RAGHack - Pick the Right Model For The Right Job
Aug 2024: What's New with Azure AI Model Inference API
May 2024: Accelerate your AI Journey with Azure AI Model Catalog

1. What is this API?¶

The Azure AI Model Inference is an API that exposes a common set of capabilities for foundational models and that can be used by developers to consume predictions from a diverse set of models in a uniform and consistent way. Developers can talk with different models deployed in Azure AI Studio without changing their underlying code!

Another way to think about it is that it is a model wrapper that abstracts frequently-used model capabilities into a common API that applications can interact with, for a consistent developer experience, independent of the specific model implementing them.

It throws exceptions cleanly if the underlying model lacks a specific API feature.
It supports extensibility by "passing through" custom parameters for model-unique features.

This allows us to explore model choice with the same codebase, simply by swapping the model deployment details in the configuration or environment - without changing the application code. We can also compare models side-by-side, by running the same code in different terminals, to contrast quality of responses or performance of execution.

2. Why should we use it?¶

Building generative AI applications requires rapid prototyping and ideation, with the ability to make decisions like model selection, model configuration, prompt template design and orchestration framework in a flexible way. The API provides an abstraction layer between application code and model invocation interface, allowing us to evolve each side independently. This lets us do the following:

Rapid Ideation - Quickly prototype against the API, then explore diverse models for best fit.
Design Flexibility - Build for common capabilities, and extend to custom features if present.
Deploy Flexibility - Works with Managed Compute and Serverless API model deployments.
Ecosystem Expansion - Growing partner list e.g, GitHub Marketplace Models, LlamaIndex.
Prompty Enabled - Prompt asset & runtime for rapid prototyping, works out-of-the-box.

3. Where could we use it?¶

Use it in any situation where you think model choice matters. For instance, "you have a specific cost or performance target and want to see if an alternative model offers better trade-offs". The API lets you swap in alternative models without changing the application code, and compare the results side-by-side to make effective decisions. For example,

Performance optimization - swap downstream "bottleneck" models for faster ones.
Cost optimization - scale down to cheaper models when justified (e.g., traffic lulls)
Multi-model composition - use different API endpoints for orchestrated flows (agentic, RAG)

4. How would we use it?¶

The current version of the Azure AI Model Inference API supports the following endpoints, reflecting the types of inference tasks exposed using a common API. See these in action in the Quickstart section of this lab.

Get Info - about underlying model
Text Embeddings - generate vectors from text
Image Embeddings - generate vectors from images and text
Text Completion - single turn model response (for prompt)
Chat Completion - multi-turn model response (for conversation)

The API samples showcase both sync client and async client usage patterns.

Next - Setup Environment