Sunday, May 17, 2026

Microsoft Azure OpenAI & AI Services Complete Guide

Microsoft Azure OpenAI & AI Services — Complete Guide

Azure OpenAI · GPT Models · Prompt Engineering · RAG · Semantic Kernel · AI Search · AI Studio · Responsible AI · Scenarios · Cheat Sheet

Core Concepts — Azure AI Services Overview
Azure OpenAI Service — Deep Dive
Prompt Engineering & Model Parameters
RAG — Retrieval-Augmented Generation
Semantic Kernel & AI Orchestration
Responsible AI & Governance
Scenario-Based Questions
Cheat Sheet — Quick Reference

1. Core Concepts — Azure AI Services Overview

What are Azure AI Services and how are they organised?

Azure AI Services (formerly Azure Cognitive Services) is Microsoft's portfolio of pre-built, cloud-hosted AI capabilities accessible via REST APIs — enabling developers to add AI to applications without building or training models from scratch.

Category	Services
Language	Azure OpenAI, Language Service (NLP, sentiment, NER, summarisation), Translator
Speech	Speech-to-Text, Text-to-Speech, Speaker Recognition, Speech Translation
Vision	Computer Vision, Custom Vision, Face API, Document Intelligence (Form Recognizer)
Decision	Anomaly Detector, Content Moderator, Personalizer
Search	Azure AI Search (cognitive search with vector + hybrid search)
Generative AI	Azure OpenAI Service (GPT-4o, GPT-4, GPT-3.5, DALL-E 3, Whisper, Embeddings)
AI Platform	Azure AI Studio, Azure Machine Learning, Prompt Flow

Tip: Azure AI Services = pre-built models via API (no training needed). Azure Machine Learning = custom model training, MLOps. Azure OpenAI = Microsoft-hosted OpenAI models with enterprise security, compliance, and your data stays in your tenant.

What is the difference between Azure OpenAI and the OpenAI API?

OpenAI API (openai.com):
→ Hosted by OpenAI (US-based)
→ Data sent to OpenAI's servers
→ No enterprise compliance guarantees
→ Shared infrastructure
→ No Azure RBAC, no VNet, no private endpoints
→ Billing: per token via OpenAI account
→ No data residency guarantees

Azure OpenAI Service:
→ Hosted by Microsoft in Azure regions
→ Data stays within your Azure tenant
→ Enterprise compliance: SOC 2, ISO 27001, HIPAA, FedRAMP
→ Dedicated capacity (Provisioned Throughput Units — PTUs)
→ Azure RBAC integration, VNet support, private endpoints
→ Managed Identity authentication (no API key required)
→ Content filtering: Azure Responsible AI content filters
→ Data residency: choose Azure region (UK South, East US, etc.)
→ Billing: via Azure subscription (same invoice)
→ Access: requires application approval by Microsoft

When to use Azure OpenAI over OpenAI direct:
→ Enterprise/regulated organisations (GDPR, HIPAA, ISO 27001)
→ Need data residency guarantees
→ Need VNet integration / private endpoints
→ Need enterprise SLA and support
→ Need Azure RBAC for access control
→ M365 or Azure-integrated solutions

What Azure OpenAI models are available and what are they used for?

GPT-4o (Omni) — flagship multimodal:
→ Text, image, audio input/output
→ Fastest and most capable in the GPT-4 family
→ Context window: 128K tokens
→ Use for: complex reasoning, document analysis, vision tasks

GPT-4o mini:
→ Smaller, faster, cheaper than GPT-4o
→ Context window: 128K tokens
→ Use for: high-volume tasks where cost matters, classification,
           extraction, summarisation at scale

GPT-4 Turbo:
→ High capability text model
→ Context window: 128K tokens
→ Use for: complex generation, analysis, code

GPT-3.5 Turbo:
→ Fast, economical text model
→ Context window: 16K tokens
→ Use for: simple tasks, high-volume, chat applications

text-embedding-ada-002 / text-embedding-3-large:
→ Converts text to vector embeddings (numerical representations)
→ Output: 1536 or 3072 dimensional float vectors
→ Use for: semantic search, RAG, similarity comparison, clustering

DALL-E 3:
→ Text-to-image generation
→ Generates images from natural language descriptions
→ Resolutions: 1024×1024, 1024×1792, 1792×1024
→ Use for: image generation, marketing assets, creative tools

Whisper:
→ Speech-to-text transcription
→ Multilingual, handles accents and background noise well
→ Use for: transcription, meeting notes, voice interfaces

o1 / o3-mini (Reasoning models):
→ Extended thinking before responding
→ Excels at complex multi-step reasoning, math, coding
→ Slower and more expensive than GPT-4o
→ Use for: complex problem solving requiring step-by-step reasoning

2. Azure OpenAI Service — Deep Dive

How do you provision and call Azure OpenAI?

Provisioning steps:
1. Apply for Azure OpenAI access (Microsoft approval required)
2. Create Azure OpenAI resource in Azure Portal
   → Choose resource name, region, pricing tier
3. Deploy a model: Azure AI Studio → Deployments → Create deployment
   → Select model (gpt-4o), set deployment name, set TPM capacity
4. Note: endpoint URL + API key (or use Managed Identity)

Endpoint format:
https://{resourceName}.openai.azure.com/openai/deployments/{deploymentName}/chat/completions?api-version=2024-02-01

REST API call:
POST https://{resource}.openai.azure.com/openai/deployments/{deployment}/chat/completions?api-version=2024-02-01
Headers:
  api-key: {your-api-key}
  Content-Type: application/json

Body:
{
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user",   "content": "Summarise this contract in 3 bullet points." }
  ],
  "max_tokens": 500,
  "temperature": 0.3
}

Python SDK (openai):
from openai import AzureOpenAI

client = AzureOpenAI(
  azure_endpoint="https://{resource}.openai.azure.com",
  api_key=os.getenv("AZURE_OPENAI_API_KEY"),
  api_version="2024-02-01"
)

response = client.chat.completions.create(
  model="gpt-4o",  # deployment name
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user",   "content": "Summarise this contract in 3 bullet points."}
  ],
  max_tokens=500,
  temperature=0.3
)
print(response.choices[0].message.content)

What are tokens and how do they affect cost and limits?

Tokens = the unit of text that LLMs process
→ 1 token ≈ 4 characters ≈ 0.75 words (English)
→ "Hello, world!" ≈ 4 tokens
→ Common words = 1 token, rare words = 2-3 tokens
→ Whitespace, punctuation each consume tokens

Token types:
Input tokens:  tokens in your prompt (system + user messages + history)
Output tokens: tokens in the model's response
Total:         input + output (billed together, different rates)

Why tokens matter:
1. Cost: charged per 1,000 tokens (input and output separately)
   GPT-4o: ~$5/1M input, ~$15/1M output tokens (approximate)
   GPT-4o mini: ~$0.15/1M input, ~$0.60/1M output (much cheaper)

2. Context window: maximum total tokens (input + output) per request
   GPT-4o: 128K context window (≈ 100,000 words = a full novel)
   GPT-3.5 Turbo: 16K context window

3. Rate limits: Tokens Per Minute (TPM) quota per deployment
   Manage in Azure AI Studio → Deployments → Edit → TPM slider

Token optimisation:
→ Use $select-equivalent: ask for concise responses
→ Truncate long documents before sending — chunk and summarise
→ Use embeddings + RAG: send only relevant chunks, not full document
→ Use streaming: receive tokens as generated (better UX, same cost)
→ Cache responses for identical queries (Semantic Cache in APIM)

What is function calling (tool use) in Azure OpenAI?

Function calling allows the model to request that your application call a specific function and return the result — enabling the model to take actions and access real-time data.

# Define available functions (tools):
tools = [
  {
    "type": "function",
    "function": {
      "name": "get_stock_price",
      "description": "Get the current stock price for a given ticker symbol",
      "parameters": {
        "type": "object",
        "properties": {
          "ticker": {
            "type": "string",
            "description": "Stock ticker symbol e.g. MSFT, AAPL"
          }
        },
        "required": ["ticker"]
      }
    }
  }
]

# First call: model decides to call the function:
response = client.chat.completions.create(
  model="gpt-4o",
  messages=[{"role": "user", "content": "What is Microsoft's current stock price?"}],
  tools=tools,
  tool_choice="auto"
)

# Check if model wants to call a function:
if response.choices[0].finish_reason == "tool_calls":
  tool_call = response.choices[0].message.tool_calls[0]
  function_name = tool_call.function.name  # "get_stock_price"
  arguments = json.loads(tool_call.function.arguments)  # {"ticker": "MSFT"}

  # Call the actual function:
  result = get_stock_price(arguments["ticker"])  # "$415.32"

  # Second call: send function result back to model:
  messages.append(response.choices[0].message)  # assistant's tool call
  messages.append({
    "role": "tool",
    "tool_call_id": tool_call.id,
    "content": str(result)
  })

  final_response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    tools=tools
  )
  print(final_response.choices[0].message.content)
  # "Microsoft's current stock price is $415.32."

Tip: Function calling is the foundation of AI agents. The model decides WHEN to call a function and HOW to structure the arguments — your code executes the actual function and returns results. This pattern enables LLMs to interact with live data and external systems.

What is streaming in Azure OpenAI and why use it?

# Streaming: receive tokens as they are generated:
stream = client.chat.completions.create(
  model="gpt-4o",
  messages=[{"role": "user", "content": "Write a 500-word blog post about Azure AI."}],
  max_tokens=1000,
  stream=True  # Enable streaming
)

for chunk in stream:
  if chunk.choices and chunk.choices[0].delta.content:
    print(chunk.choices[0].delta.content, end="", flush=True)

# Why use streaming:
→ Better UX: user sees output immediately rather than waiting for full response
→ Same cost as non-streaming (tokens still counted the same way)
→ Allows early termination if user navigates away
→ Essential for chat interfaces — shows "typing" effect
→ Required for long responses to avoid timeout issues

3. Prompt Engineering & Model Parameters

What are the key model parameters and how do they affect output?

temperature (0.0 – 2.0):
→ Controls randomness of output
→ 0.0: deterministic — always the same output for same input
→ 0.3-0.7: balanced — creative but coherent (general use)
→ 1.0+: highly creative, variable, sometimes incoherent
→ Use low (0.0-0.3) for: factual Q&A, extraction, classification, code
→ Use high (0.7-1.0) for: creative writing, brainstorming, marketing copy

top_p (0.0 – 1.0) — nucleus sampling:
→ Consider only tokens whose cumulative probability is ≤ top_p
→ 1.0: consider all tokens (no restriction)
→ 0.9: consider top 90% probability mass tokens
→ Do NOT change both temperature and top_p at the same time
→ Change one or the other — not both

max_tokens:
→ Maximum output tokens to generate
→ Does NOT affect input tokens
→ Set conservatively to control costs
→ Response truncated (not summarised) when limit reached

frequency_penalty (-2.0 – 2.0):
→ Penalises tokens based on how often they appeared in the response
→ Positive value: reduces repetition of exact words
→ 0.0: no penalty (default)
→ 0.5-1.0: useful for reducing repetitive phrasing in long outputs

presence_penalty (-2.0 – 2.0):
→ Penalises tokens based on whether they appeared at all in the response
→ Positive value: encourages talking about new topics
→ 0.0: no penalty (default)

stop sequences:
→ Array of strings — model stops generating when it produces one
→ E.g., stop: ["###", "\n\n"]
→ Useful for structured generation where you control the format

What are prompt engineering techniques?

1. Zero-shot prompting:
Ask the model to perform a task with no examples.
"Classify the sentiment of this review: 'The product broke after one day.'"
→ Good for: simple tasks the model already knows well

2. Few-shot prompting:
Provide 2-5 examples of input → output before the actual task.
"Classify sentiment:
  Review: 'Amazing quality!' → Positive
  Review: 'Total waste of money.' → Negative
  Review: 'The product broke after one day.' → ???"
→ Good for: custom output formats, domain-specific classification

3. Chain-of-thought (CoT):
Ask the model to show its reasoning step-by-step before answering.
"Think step by step. A train travels 120km in 2 hours. What is its speed?"
→ Model shows: "Distance = 120km, Time = 2 hours, Speed = Distance/Time = 60 km/h"
→ Dramatically improves accuracy on reasoning, maths, logic tasks

4. System prompt (role assignment):
Set the model's persona, rules, and constraints in the system message.
"You are a senior Microsoft 365 administrator. Answer only questions
 related to M365 administration. If asked about unrelated topics,
 politely decline. Always recommend following Microsoft best practices."

5. Structured output:
Ask the model to respond in a specific format.
"Return your answer as JSON only:
 { 'sentiment': '...', 'confidence': 0.0-1.0, 'key_phrases': [...] }"
→ Use with JSON mode (response_format: { type: 'json_object' })

6. ReAct (Reason + Act):
Prompt the model to reason about what to do, then take an action.
Combines chain-of-thought with function calling.
"Think about what information you need, then call the appropriate
 function to get it, then reason about the result."

7. Grounding:
Provide context/documents in the prompt before asking questions.
"Here is the company policy document: [document content]
 Based only on the above document, answer: What is the leave policy?"
→ Reduces hallucination by providing factual grounding

What are hallucinations and how do you mitigate them?

Hallucination: the model generates plausible-sounding but factually
incorrect information — presented with confidence.

Why it happens:
→ LLMs predict the next most likely token — not verify facts
→ Training data may contain errors or outdated information
→ Model "fills in" when it doesn't know the answer

Mitigation strategies:
1. RAG (Retrieval-Augmented Generation):
   Retrieve relevant facts from a trusted knowledge base → inject into prompt
   Model answers from provided context → grounded in real data
   "Answer based ONLY on the following documents: [retrieved chunks]"

2. Grounding instructions:
   "If you don't know the answer based on the provided context,
    say 'I don't have enough information to answer that.'"

3. Low temperature (0.0-0.3):
   Less randomness = more conservative, consistent output
   High temperature encourages creative but potentially incorrect output

4. Citation requirements:
   "For each claim, cite the source document and page number."
   Forces model to stay close to provided source material

5. Structured output validation:
   Parse model output programmatically
   Validate against known schemas or business rules
   Reject invalid outputs and retry

6. Human review for high-stakes outputs:
   Don't use AI output directly for medical, legal, financial decisions
   Always include human review in the workflow for critical decisions

7. Azure Content Safety + Groundedness detection:
   Azure AI Studio → Prompt flow → Groundedness evaluator
   Detects when model answer is not supported by provided context

4. RAG — Retrieval-Augmented Generation

What is RAG and why is it the most important pattern for enterprise AI?

RAG (Retrieval-Augmented Generation) combines a retrieval system (search over your own data) with an LLM (generation) — allowing the model to answer questions based on your organisation's private, up-to-date documents rather than only its training data.

Why RAG:
→ LLMs have a training cutoff — they don't know about recent events/docs
→ LLMs don't have access to your private company documents
→ LLMs hallucinate when they don't know the answer
→ RAG solves all three: retrieves relevant, up-to-date, private content

RAG architecture:
                    User Question
                         ↓
              [Query Embedding Generation]
              text-embedding-ada-002 converts
              question to a vector
                         ↓
              [Vector Search — Azure AI Search]
              Find documents whose vectors are
              closest to the question vector
              (cosine similarity)
                         ↓
              [Context Assembly]
              Top K chunks retrieved
              and assembled into prompt
                         ↓
              [Augmented Prompt → LLM]
              "Answer based only on these documents:
               [chunk 1][chunk 2][chunk 3]
               Question: {user question}"
                         ↓
              [LLM Response — Grounded Answer]
              Answer with citations to source docs

What is Azure AI Search and how does it enable RAG?

Azure AI Search (formerly Cognitive Search):
→ Enterprise search service combining keyword, vector, and hybrid search
→ Core component of most Azure-based RAG architectures

Key concepts:
Index:         schema defining fields (title, content, vector, metadata)
Indexer:       crawls data sources and populates the index automatically
Skillset:      AI enrichment pipeline (OCR, entity extraction, chunking,
               embedding generation) applied during indexing
Search:        query the index via REST API or SDK

Search modes:
Keyword search:    BM25 full-text search (classic TF-IDF relevance)
Vector search:     ANN similarity search over embedding vectors
Hybrid search:     combines keyword + vector results (RECOMMENDED)
Semantic reranking: reranks hybrid results using language model understanding

Setting up RAG with Azure AI Search:
Step 1 — Ingest:
  Upload documents to Azure Blob Storage
  → Indexer crawls blob storage
  → Skillset: chunk documents (512 tokens), generate embeddings
    (call text-embedding-ada-002), enrich with metadata
  → Documents stored in index with vector field

Step 2 — Retrieve:
  User query → generate query embedding
  → AI Search: hybrid search (keyword + vector)
  → Semantic reranker: reorder results by relevance
  → Return top K chunks with scores

Step 3 — Generate:
  Assemble prompt: system prompt + retrieved chunks + user question
  → Call GPT-4o: answer based only on provided context
  → Return answer with source citations

Python retrieval:
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizedQuery

query_embedding = get_embedding(user_question)

results = search_client.search(
  search_text=user_question,           # keyword search
  vector_queries=[VectorizedQuery(     # vector search
    vector=query_embedding,
    k_nearest_neighbors=5,
    fields="contentVector"
  )],
  query_type="semantic",               # semantic reranking
  semantic_configuration_name="my-semantic-config",
  top=5,
  select=["title", "content", "sourcePage"]
)

What is chunking and why does it matter for RAG quality?

Chunking: splitting documents into smaller pieces before indexing
→ LLMs have token limits — can't send entire documents
→ Smaller chunks = more precise retrieval
→ Too small = loses context; too large = less precise retrieval

Chunking strategies:
Fixed-size chunking:
  Split every N tokens (e.g., 512 tokens)
  Simple but may split sentences or paragraphs mid-thought
  Overlap: include 50-100 token overlap between chunks to preserve context

Recursive character text splitter:
  Split at paragraph → sentence → word level
  Tries to preserve natural text boundaries
  Most common general-purpose strategy

Semantic chunking:
  Group sentences with similar semantic meaning together
  Produces more coherent chunks
  More expensive (requires embedding each sentence)

Document structure-aware:
  Use document headings, sections as chunk boundaries
  Tables remain intact as single chunks
  Best for structured documents (PDFs with headings, manuals)

Metadata per chunk (critical for citation):
{
  "id": "doc1-chunk-3",
  "content": "The refund policy states...",
  "title": "Customer Service Policy",
  "sourcePage": 4,
  "sourceFile": "policy-2025.pdf",
  "category": "policy",
  "contentVector": [0.023, -0.145, ...]  // 1536-dim embedding
}

Recommended chunk size: 512-1024 tokens with 10-20% overlap

5. Semantic Kernel & AI Orchestration

What is Semantic Kernel and what does it provide?

Semantic Kernel (SK) is Microsoft's open-source SDK for building AI-powered applications — an orchestration layer that connects LLMs with your existing code, data sources, and APIs.

Semantic Kernel key concepts:

Kernel:
→ The central orchestration object
→ Configures AI services (Azure OpenAI), memory, plugins
→ Manages the lifecycle of AI calls

Plugins (formerly Skills):
→ Collections of functions the AI can call
→ Native functions: C#/Python/Java code
→ Semantic functions: prompts with template variables

Planners:
→ AI generates a plan to complete a multi-step goal
→ Sequentially executes steps using available plugins
→ Types: Sequential planner, Stepwise planner (ReAct)

Memory:
→ Vector store integration (Azure AI Search, Qdrant, Chroma)
→ Store and retrieve semantic memories by similarity
→ Enables long-term context across conversations

Agents (new architecture):
→ Goal-oriented autonomous AI entities
→ Use function calling to select actions
→ Multi-agent orchestration: agents collaborating on tasks

from semantic_kernel import Kernel
from semantic_kernel.connectors.ai.open_ai import AzureChatCompletion
from semantic_kernel.core_plugins import TimePlugin

# Create kernel:
kernel = Kernel()
kernel.add_service(AzureChatCompletion(
  service_id="gpt4o",
  deployment_name="gpt-4o",
  endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
  api_key=os.getenv("AZURE_OPENAI_API_KEY")
))

# Add plugins:
kernel.add_plugin(TimePlugin(), plugin_name="time")

# Create semantic function:
summarise = kernel.add_function(
  function_name="summarise",
  plugin_name="document",
  prompt="Summarise the following in 3 bullet points:\n{{$input}}"
)

# Invoke:
result = await kernel.invoke(summarise, input=document_text)
print(result)

What is Prompt Flow in Azure AI Studio?

Prompt Flow is a visual development tool in Azure AI Studio for building, testing, evaluating, and deploying LLM workflows.

Prompt Flow concepts:
Flow:      directed acyclic graph (DAG) of nodes
Node:      a processing step (LLM call, Python function, API call, search)
Input:     flow inputs (user question, documents, parameters)
Output:    flow outputs (final answer, citations, metadata)

Node types:
LLM:      call an Azure OpenAI deployment with a prompt template
Python:   run custom Python code (data processing, API calls)
Prompt:   compose complex prompts with Jinja2 templating
Search:   query Azure AI Search
Tool:     built-in tools (embedding, content safety check)

Evaluation flow:
→ Run the same flow against a test dataset
→ Measure: groundedness, coherence, relevance, fluency
→ Compare multiple prompt variations (A/B testing)
→ Before deploying: ensure quality meets threshold

Deployment:
→ Deploy flow as a managed endpoint (REST API)
→ Monitor: token usage, latency, error rates
→ Version: update prompts without code changes

CI/CD for Prompt Flow:
→ Export flow as YAML → store in Git
→ Azure DevOps pipeline: evaluate → test → deploy
→ Environment variables for different environments (dev/prod)

What is Azure AI Foundry (AI Studio)?

Azure AI Foundry (formerly Azure AI Studio):
→ Unified platform for building enterprise AI applications
→ Replaces/unifies Azure OpenAI Studio + Prompt Flow + AI Studio

Key capabilities:
Model catalogue:   browse 1,700+ models (OpenAI, Meta LLaMA, Mistral,
                   Cohere, Hugging Face, custom) and deploy with one click
Playground:        test prompts and models interactively
Fine-tuning:       customise GPT-3.5/GPT-4 on your own training data
Prompt Flow:       visual LLM workflow orchestration (see above)
Evaluation:        measure AI output quality at scale
Safety:            content filtering, groundedness detection
Deployments:       manage model deployments, TPM quotas, PTUs
Connections:       connect to Azure AI Search, blob storage, CosmosDB

Fine-tuning (when to use):
→ When prompt engineering alone cannot achieve desired behaviour
→ Custom domain terminology (medical, legal, technical jargon)
→ Specific output formats or styles that prompting can't enforce
→ Improving accuracy on narrow, repetitive tasks at lower cost
→ Process: prepare JSONL training data → upload → start fine-tune job
           → deploy fine-tuned model → test → production

6. Responsible AI & Governance

What are Microsoft's Responsible AI principles?

Microsoft has six core Responsible AI principles that apply to all AI development on Azure:

Principle	Description
Fairness	AI systems should treat all people fairly — not amplify bias based on gender, race, age, disability
Reliability & Safety	AI systems should perform reliably and safely — fail gracefully, tested across conditions
Privacy & Security	AI systems should respect privacy — protect personal data, resist attacks
Inclusiveness	AI systems should empower everyone — consider people with disabilities, diverse languages
Transparency	AI systems should be understandable — people should know when they're interacting with AI
Accountability	People should be accountable for AI systems — governance, oversight, and human review

What is Azure Content Safety and how does it work?

Azure AI Content Safety:
→ Detect harmful content in text and images before it reaches users
→ Part of Azure AI Services — REST API or SDK

Content categories detected:
Hate:          content promoting hatred based on identity attributes
Violence:      content depicting or encouraging violence
Sexual:        sexually explicit content
Self-harm:     content promoting self-harm or suicide

Severity levels: 0 (safe) | 2 (low) | 4 (medium) | 6 (high)
Action: allow | block | human review (based on threshold you set)

Azure OpenAI built-in content filters:
→ Every Azure OpenAI deployment has default content filters
→ Applied to both input (prompt) and output (completion)
→ Filters: hate, violence, sexual, self-harm content
→ Custom filters: configure thresholds per category per deployment
→ Jailbreak detection: detect attempts to bypass safety instructions
→ Prompt shield: detect indirect prompt injection attacks

Groundedness detection (Prompt Flow evaluator):
→ Detects when model answer is NOT supported by provided context
→ Example: model claims "Policy X allows unlimited leave" but
           retrieved document says "Maximum 20 days leave per year"
→ Score: 1-5 (1 = completely ungrounded, 5 = fully grounded)

Best practices:
→ Apply content filters to both input and output for every deployment
→ Use groundedness evaluator in RAG flows before production deployment
→ Log all AI interactions for audit and incident investigation
→ Implement human review for high-risk AI decisions
→ Display AI disclosures: "This answer was generated by AI"

What is the Responsible AI toolbox and governance approach?

Responsible AI toolbox (open source, integrated in Azure ML):
Interpretability:  SHAP, LIME — explain model predictions
Fairness:          Fairlearn — measure and mitigate unfairness across groups
Error analysis:    identify where and why a model fails
Causal analysis:   measure causal effects of features on predictions
Counterfactuals:   "what would need to change for a different outcome?"

Enterprise AI governance:
1. AI Impact Assessment: before deploying any AI system, document:
   → Purpose and intended use
   → Potential harms and mitigation measures
   → Stakeholders affected
   → Data provenance and quality

2. Model cards: document model capabilities, limitations,
   training data, evaluation results, intended use

3. Data governance: all training and retrieval data must comply
   with data classification and privacy policies

4. Monitoring: track model performance, drift, and bias over time
   → Azure ML Model Monitoring: detect data drift
   → Application Insights: track query patterns and user feedback

5. Human oversight: for high-risk decisions (medical diagnosis,
   loan approval, legal advice) — always require human review
   AI assists, human decides

6. Incident response: process for responding to AI failures
   → Rollback to previous model version
   → Disable AI feature if causing harm
   → Notify affected users

7. Scenario-Based Questions

Scenario: Build an enterprise Q&A chatbot over internal documents using Azure OpenAI + RAG.

Architecture:

Document ingestion pipeline:
- Upload documents (PDFs, Word, SharePoint) to Azure Blob Storage
- Azure AI Search indexer crawls blob storage
- Skillset: document cracking (OCR for scanned PDFs) → chunking (512 tokens, 20% overlap) → embedding generation (text-embedding-ada-002) → index population
Azure AI Search index with fields: id, content, contentVector (1536-dim), title, sourcePage, sourceFile, category

RAG query flow (Prompt Flow):

User submits question
Generate query embedding via text-embedding-ada-002
Hybrid search (keyword + vector) + semantic reranking → top 5 chunks

Assemble prompt:

System: "You are a helpful assistant for Contoso employees.         Answer based ONLY on the provided documents.         If the answer is not in the documents, say so.         Always cite the source document and page."User: "Context: [chunk1] [chunk2] [chunk3]       Question: {user_question}"

GPT-4o generates grounded answer with citations

Safety: Azure Content Safety filters on input + output. Groundedness evaluator in Prompt Flow before production.
Authentication: Managed Identity for Azure OpenAI + AI Search. No API keys in code.
Frontend: Teams bot (Bot Framework) or SharePoint SPFx web part calling the Prompt Flow endpoint.

Scenario: Design a multi-agent AI system for automated invoice processing.

Orchestrator agent: receives invoice image/PDF → routes to appropriate specialist agents
Extraction agent (Document Intelligence + GPT-4o vision):
- Azure Document Intelligence extracts structured fields (invoice number, date, line items, totals)
- GPT-4o vision validates extraction and handles complex layouts
- Output: structured JSON invoice object
Validation agent:
- Checks extracted data against business rules (amounts match, vendor in approved list, PO number valid)
- Calls ERP API (function calling) to validate PO and vendor
- Flags anomalies for review
Approval routing agent:
- Based on invoice amount → routes to correct approver
- <£1,000: auto-approve, £1,000-£10,000: manager approval, >£10,000: finance director
- Sends Teams Adaptive Card approval request to approver
Posting agent:
- On approval: calls ERP API to post invoice
- Sends confirmation notification via Teams
Semantic Kernel orchestrates all agents. Each agent is a plugin. The orchestrator planner selects which plugins to invoke in sequence.

Scenario: How do you reduce hallucinations in a customer-facing Azure OpenAI application?

RAG architecture: retrieve grounding context from trusted knowledge base (Azure AI Search over curated content). Never rely on the model's parametric knowledge for facts.
Explicit grounding instruction: "Answer based ONLY on the following documents. If the answer is not in the documents, respond: 'I don't have information about that in our knowledge base.'" — gives the model a safe exit.
Low temperature (0.0–0.2): minimise randomness for factual Q&A. High temperature increases creative but potentially incorrect output.
Structured output + validation: require JSON responses → parse and validate programmatically → reject malformed or out-of-range values and retry.
Groundedness evaluation: use Azure AI Studio's groundedness evaluator on a test dataset before production. Set minimum groundedness threshold (e.g., ≥ 4/5) as a deployment gate.
Citation requirement: force citations to source documents. Verify citations actually exist in the retrieved context.
Human review for high-risk outputs: legal, medical, financial answers → flagged for human review before delivery to end user.
Feedback loop: user thumbs-down triggers logging of the query + response → review queue → improve chunking or prompts.

Scenario: How do you secure an Azure OpenAI deployment in an enterprise?

Managed Identity: use system-assigned Managed Identity for App Service / Azure Function to authenticate to Azure OpenAI — no API keys in code or config.
Private endpoints: disable public network access on Azure OpenAI resource. Deploy private endpoint in your VNet. DNS: private DNS zone resolves endpoint to private IP.
Azure RBAC: assign Cognitive Services OpenAI User role to the Managed Identity. Restrict Cognitive Services OpenAI Contributor (can deploy models) to admin SPs only.
Content filtering: configure custom content filters per deployment. Enable jailbreak detection and prompt shield on every deployment.
No API keys: rotate and disable API keys. Use Entra ID authentication exclusively. API keys are shared secrets — anyone with the key has full access.
APIM as gateway: route all Azure OpenAI calls through Azure API Management — rate limiting per consumer, logging, masking of sensitive data in logs, retry policies, semantic caching.
Audit logging: enable diagnostic settings → send to Log Analytics. Monitor for: high error rates, unusual token consumption, suspicious query patterns.
Data residency: deploy Azure OpenAI in the required region (UK South for UK data residency). Verify no data leaves the region.

8. Cheat Sheet — Quick Reference

Model Selection Guide

Task                                   → Best Model
Complex reasoning, vision, multimodal  → GPT-4o
High-volume classification/extraction  → GPT-4o mini (lower cost)
Multi-step math / complex reasoning    → o1 or o3-mini
Text to vector (semantic search/RAG)   → text-embedding-3-large
Image generation                       → DALL-E 3
Speech to text                         → Whisper
Code generation                        → GPT-4o or GPT-4 Turbo
Simple chat, quick responses           → GPT-3.5 Turbo (cheapest)

Temperature Quick Reference

0.0:     Deterministic — factual Q&A, extraction, classification, code
0.1-0.3: Highly consistent — data extraction, structured output, RAG
0.4-0.6: Balanced — general assistant, summarisation, translation
0.7-0.9: Creative — marketing copy, brainstorming, storytelling
1.0+:    Highly variable — poetry, experimental, maximum creativity

RAG Pipeline Steps

INGEST (one-time / scheduled):
Documents → chunk (512 tokens, 20% overlap) → embed (ada-002)
→ index in Azure AI Search (keyword + vector fields)

RETRIEVE (per query):
User question → embed query → hybrid search (keyword + vector)
→ semantic reranking → top K chunks

GENERATE (per query):
System prompt + retrieved chunks + user question → GPT-4o
→ grounded answer with citations
→ content safety check on output

Key metrics:
Recall:     are relevant documents being retrieved?
Precision:  are retrieved documents actually relevant?
Groundedness: is the answer supported by retrieved context?
Latency:    end-to-end response time (target: < 3 seconds)

Azure OpenAI Security Checklist

✓ Use Managed Identity — no API keys in code
✓ Private endpoints — disable public internet access
✓ Azure RBAC — least privilege role assignments
✓ Content filters — custom thresholds per deployment
✓ Jailbreak detection + Prompt Shield enabled
✓ APIM gateway — rate limiting, logging, caching
✓ Diagnostic logging → Log Analytics
✓ Data residency — correct Azure region
✓ No sensitive PII in prompts (strip before sending)
✓ Human review for high-risk AI decisions

Prompt Engineering Quick Reference

Zero-shot:     just the task — good for simple, well-known tasks
Few-shot:      2-5 examples + task — good for custom formats/domain
Chain-of-thought: "Think step by step" — good for reasoning/maths
System prompt: persona + rules + constraints in system message
Grounding:     "Answer based ONLY on: [documents]" — reduces hallucination
JSON mode:     response_format: {type: "json_object"} — structured output
ReAct:         reason → act (function call) → observe → reason again

Temperature:   0.0-0.3 for facts, 0.7-1.0 for creativity
max_tokens:    set conservatively — truncates (not summarises)
stop:          ["###"] — stop generation at a specific sequence

Semantic Kernel Architecture

Kernel
  ├── AI Services (Azure OpenAI, Ollama, etc.)
  ├── Memory (Azure AI Search, Qdrant, Chroma)
  └── Plugins
        ├── Native functions (C#/Python code)
        └── Semantic functions (prompt templates)

Agent architecture:
Agent = Kernel + Instructions + Plugins
Planner: AI generates multi-step plan → executes plugins in sequence
Multi-agent: Orchestrator agent → delegates to specialist agents

Common plugins:
TimePlugin:      get current date/time
HttpPlugin:      call external REST APIs
SearchPlugin:    query Azure AI Search
EmailPlugin:     send emails via Graph API
CodePlugin:      execute Python code safely

Top 10 Tips

Azure OpenAI ≠ OpenAI API — Azure OpenAI keeps data in your tenant, has enterprise compliance (GDPR, HIPAA, SOC 2), supports Managed Identity and private endpoints. Always recommend Azure OpenAI for enterprise solutions.
RAG is the most important pattern — most enterprise AI questions come back to RAG. Know the three phases: ingest (chunk + embed + index), retrieve (hybrid search), generate (grounded prompt). This pattern is the answer to "how do you build AI on your own data."
Tokens are the unit of cost and limits — 1 token ≈ 4 characters. Know that context window = input + output tokens combined. Chunking and $select-equivalent patterns reduce token usage and cost.
Temperature 0 for facts, higher for creativity — never leave temperature at default (1.0) for factual Q&A or extraction tasks. Low temperature = consistent, deterministic output for structured tasks.
Function calling = the foundation of agents — the model decides WHEN and HOW to call a function; your code executes it. This enables LLMs to access real-time data and take actions. Essential for agentic AI.
Chunking strategy matters for RAG quality — chunk too small: loses context. Chunk too large: imprecise retrieval. 512-1024 tokens with 10-20% overlap is the practical starting point. Always include metadata per chunk for citations.
Hybrid search over pure vector search — hybrid (keyword + vector) + semantic reranking consistently outperforms pure vector search. Use Azure AI Search's hybrid mode for production RAG.
Managed Identity over API keys — API keys are shared secrets that can be leaked. Managed Identity requires no credential management and integrates with Azure RBAC. Always recommend for production deployments.
Content filters + groundedness detection before production — configure custom content filters, enable jailbreak detection, and run groundedness evaluation on a test dataset. These are the RAG quality gates before going live.
Responsible AI is not optional — every enterprise AI project needs an AI Impact Assessment, content safety filters, bias evaluation, and human oversight for high-risk decisions. Microsoft's six principles (fairness, reliability, privacy, inclusiveness, transparency, accountability) apply to every deployment.

Sunday, May 17, 2026