Wednesday, April 29, 2026

Microsoft Fabric & OneLake Complete Guide

 

Microsoft Fabric & OneLake — Complete Guide

OneLake · Lakehouse · Data Warehouse · Direct Lake · Medallion Architecture · Real-Time Intelligence · Scenarios · Cheat Sheet


Table of Contents

  1. Core Concepts — Basics
  2. OneLake — The Foundation
  3. Fabric Workloads
  4. Data Engineering — Notebooks, Pipelines & Medallion
  5. Security & Governance
  6. Performance Optimisation
  7. Scenario-Based Questions
  8. Cheat Sheet — Quick Reference

1. Core Concepts — Basics

What is Microsoft Fabric and what problem does it solve?

Microsoft Fabric is an all-in-one analytics platform that unifies data engineering, data integration, data warehousing, real-time analytics, data science, and business intelligence into a single SaaS platform.

Before Fabric (fragmented):

  • Azure Synapse → warehousing
  • Azure Data Factory → pipelines
  • Azure Databricks → Spark
  • Power BI Premium → BI
  • Separate storage, separate governance, separate billing, complex integration

With Fabric (unified):

  • All workloads in one platform
  • Single copy of data in OneLake
  • Unified security and governance via Microsoft Purview
  • Single capacity billing model (F SKUs)

Key insight: Fabric is NOT a new product — it is a platform that unifies existing Microsoft data tools under one roof with OneLake as the shared storage layer. Everything reads and writes from the same data lake.


What are the core workloads in Microsoft Fabric?

Workload Description Evolution From
Data Factory Data integration and orchestration — pipelines and dataflows Azure Data Factory
Data Engineering Big data processing — Lakehouses, Spark notebooks Azure Synapse Spark
Data Warehouse SQL-based analytics warehouse — T-SQL querying Azure Synapse SQL Dedicated Pool
Real-Time Intelligence Streaming ingestion and KQL analytics Azure Data Explorer
Data Science ML experiments, models, Python/R notebooks Azure Synapse ML
Power BI Business intelligence (Direct Lake mode) Power BI Premium
Data Activator No-code data alerting and automation New in Fabric

What is a Fabric Workspace and how does it relate to capacity?

Workspace: collaboration container holding all Fabric items (Lakehouses, Warehouses, Notebooks, Pipelines, Reports, Semantic Models).

Capacity: the compute allocation powering Fabric workloads. Every workspace must be assigned to a capacity.

SKU Type Use Case
Trial Free 60 days Evaluation
F2–F2048 Fabric SKUs Pay-as-you-go or reserved
P SKUs Power BI Premium Existing Premium → Fabric compatible

Tip: Capacity is shared across ALL workloads in assigned workspaces. A Spark job, pipeline run, and Power BI report refresh all consume from the same capacity pool. Size based on peak concurrent workload requirements.


How does Microsoft Fabric relate to Azure Synapse Analytics?

Fabric is the strategic successor to Azure Synapse Analytics:

Synapse Feature Fabric Equivalent
Synapse Spark Fabric Data Engineering (Lakehouses + Notebooks)
Synapse SQL Dedicated Pool Fabric Data Warehouse
Synapse Pipelines Fabric Data Factory Pipelines
Synapse Link Fabric Mirroring
Azure Data Explorer Fabric Real-Time Intelligence (Eventhouse/KQL DB)

Key additions in Fabric vs Synapse: OneLake (shared storage), Direct Lake Power BI mode, Data Activator, unified Purview governance, simpler capacity-based pricing.

Warning: Azure Synapse is not deprecated — existing workloads continue. But all new analytics projects should be built on Fabric. Microsoft's investment is entirely focused on Fabric going forward.


What is the Fabric item hierarchy?

Microsoft Fabric Tenant
└── Capacity (F64, F128, etc.)
    └── Workspace (collaboration container)
        ├── Lakehouse (data lake + Delta tables)
        ├── Data Warehouse (T-SQL warehouse)
        ├── Notebook (Spark development)
        ├── Spark Job Definition (packaged Spark job)
        ├── Data Pipeline (orchestration)
        ├── Dataflow Gen2 (Power Query ETL)
        ├── Semantic Model (Power BI dataset)
        ├── Report (Power BI report)
        ├── KQL Database / Eventhouse (real-time)
        ├── Eventstream (streaming ingestion)
        └── Reflex (Data Activator alerts)

2. OneLake — The Foundation

What is OneLake and what makes it architecturally significant?

OneLake is a single, unified, tenant-wide data lake that is the storage foundation of Microsoft Fabric.

Key architectural principles:

  1. One copy of data: all Fabric workloads read from the same data — no copies between workloads
  2. Automatic with Fabric: every Fabric tenant has exactly one OneLake, automatically provisioned
  3. Built on ADLS Gen2: fully compatible with any tool supporting Azure Data Lake Storage
  4. Open formats: data stored in Delta Parquet — readable by any engine (Spark, Trino, DuckDB) without conversion
  5. Hierarchical: Tenant → Workspace → Lakehouse/Warehouse → Tables/Files

analogy: OneLake is the "OneDrive for data" — just as OneDrive gives every user one place for files, OneLake gives every organisation one place for data.


What is the OneLake folder structure?

OneLake (tenant-level)
└── Workspace: ContosoAnalytics
    ├── Bronze_Lakehouse.Lakehouse/
    │   ├── Tables/                    ← Delta tables (queryable via SQL)
    │   │   ├── raw_orders/            ← Delta Parquet files
    │   │   └── raw_customers/
    │   └── Files/                     ← Raw files (CSV, JSON, Parquet, images)
    │       ├── raw/
    │       └── processed/
    ├── Gold_Lakehouse.Lakehouse/
    │   └── Tables/
    │       ├── FactSales/
    │       └── DimProduct/
    └── Finance_Warehouse.Warehouse/
        └── Tables/                    ← Warehouse tables (Delta format)

Access paths:
ADLS DFS:  abfss://workspace@onelake.dfs.fabric.microsoft.com/Bronze_Lakehouse.Lakehouse/Tables/raw_orders
OneLake:   https://onelake.dfs.fabric.microsoft.com/ContosoAnalytics/Bronze_Lakehouse.Lakehouse/Tables/raw_orders

What are OneLake Shortcuts and how do they enable a "single copy" architecture?

Shortcuts are virtual pointers to data stored outside the current Lakehouse — appearing as folders but not copying data.

Supported shortcut sources:

  • Another workspace in OneLake
  • Azure Data Lake Storage Gen2
  • Amazon S3
  • Google Cloud Storage
  • Dataverse (via Fabric Mirroring)
Shortcut scenarios:

1. Cross-workspace data sharing (zero copy):
   Finance Lakehouse → shortcut → Sales Gold Lakehouse/Tables/FactSales
   Finance queries FactSales without any data duplication
   Sales updates are immediately visible to Finance

2. External ADLS access:
   Lakehouse → shortcut → abfss://existing-lake.dfs.core.windows.net/raw
   Query existing ADLS data from Fabric without migration

3. Multi-cloud data access:
   Lakehouse → shortcut → AWS S3 bucket
   Spark notebooks query S3 data alongside OneLake data

Tip: Shortcuts eliminate the need to copy data between teams, workspaces, and clouds. This is how Fabric implements "single copy of truth" across a large organisation — no ETL duplication pipelines.


What is the Delta Lake format and why does Fabric use it as the standard?

Delta Lake is an open-source storage format adding ACID transactions, schema enforcement, time travel, and efficient upserts to Parquet files.

Feature Benefit
ACID transactions Concurrent reads/writes without corruption
Time travel Query data as of previous point in time
Schema evolution Add columns without breaking existing readers
Efficient upserts MERGE INTO for CDC patterns
Open format Any Spark, Trino, DuckDB engine can read
Direct Lake Power BI reads Delta files at in-memory speed
-- Time travel:
SELECT * FROM silver_orders VERSION AS OF 5
SELECT * FROM silver_orders TIMESTAMP AS OF '2025-01-01'

-- Upsert (MERGE):
MERGE INTO silver_orders AS target
USING new_orders AS source
ON target.OrderId = source.OrderId
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

-- View Delta history:
DESCRIBE HISTORY silver_orders

3. Fabric Workloads

What is a Fabric Lakehouse and how does it differ from a Data Warehouse?

Lakehouse Data Warehouse
Schema On-read (Files) + on-write (Tables) Strict schema-on-write (DDL required)
Data types Structured + semi-structured + unstructured Structured only
Primary interface Spark (Python, Scala, SQL) T-SQL (INSERT, UPDATE, DELETE, MERGE)
Secondary interface Auto SQL Endpoint (read-only T-SQL)
Stored procedures No Yes
Best for Data engineering, data science, raw→curated Business analysts, BI, structured semantic layer
Storage OneLake (Delta Parquet) OneLake (Delta Parquet)

Tip: Both store data in OneLake as Delta Parquet. The Lakehouse auto-generates a SQL Endpoint — so a Lakehouse IS also queryable via T-SQL (read-only). A Warehouse is T-SQL read-write. This overlap confuses many candidates.


What is Direct Lake mode in Power BI and why is it a breakthrough?

Direct Lake is a new Power BI connectivity mode that reads Delta Parquet files directly from OneLake into Power BI's VertiPaq in-memory engine.

Three Power BI connection modes:

Import:
→ Data COPIED into VertiPaq memory
→ Fastest queries | Stale (last refresh only) | Size limited

DirectQuery:
→ Live queries to source on EVERY visual interaction
→ Always current | Slow (source DB performance) | No size limit

Direct Lake (NEW — Fabric only):
→ Reads Delta Parquet from OneLake into VertiPaq ON DEMAND
→ Import-speed performance (in-memory columnar)
→ Always current (reads latest Delta files automatically)
→ No dataset size limit
→ "Framing": loads column segments on first access
→ Falls back to DirectQuery if framing not complete

Tip: Direct Lake eliminates the Import vs DirectQuery trade-off — Import-level performance + DirectQuery-level freshness. This is the biggest Power BI innovation in years and a guaranteed question for any Fabric role.


What is Real-Time Intelligence in Fabric?

Component Description
Eventstream No-code streaming ingestion — capture, transform, route events from Event Hubs, IoT Hub, Kafka, CDC
KQL Database (Eventhouse) High-performance analytical DB for time-series data. KQL queries. Billions of rows in seconds.
KQL Queryset Saved KQL queries for reuse
Real-Time Dashboard Auto-refreshing dashboards (sub-second) powered by KQL
Data Activator (Reflex) No-code alerting — trigger email/Teams/flows when data conditions are met
Real-time pipeline:
IoT sensors / Event Hubs / Kafka / CDC
  → Eventstream (ingest + route + transform)
  → KQL Database / Eventhouse (store + query)
  → Real-Time Dashboard (visualise live)
  → Data Activator (alert on anomalies)

What is Fabric Mirroring and what databases does it support?

Fabric Mirroring replicates operational database data into OneLake as Delta Parquet tables in near real-time — using Change Data Capture (CDC), not traditional ETL.

Supported sources:

  • Azure SQL Database
  • Azure Cosmos DB
  • Snowflake
  • Azure SQL Managed Instance
  • Azure Database for PostgreSQL
  • MongoDB Atlas
How it works:
Source DB (Azure SQL)
  → CDC captures inserts/updates/deletes
  → Fabric replicates changes into OneLake
  → Latency: typically < 1 minute
  → Stored as Delta tables (queryable via Spark, SQL Endpoint, Power BI)
  → Appears as a shortcut in the workspace

Benefits:
→ No ETL pipeline to build or maintain
→ Near real-time analytics on operational data
→ Zero duplication — reads directly from CDC log
→ Combines with medallion architecture for Silver/Gold transformation

Tip: Mirroring eliminates complex ADF pipelines for operational DB analytics. It is the recommended pattern for near real-time analytics on Azure SQL, Cosmos DB, and Snowflake.


What is the Semantic Model in Fabric?

A Semantic Model (previously called a Power BI Dataset) is the reusable analytical layer between data storage and reports. In Fabric:

  • Created directly on a Lakehouse or Warehouse
  • Uses Direct Lake mode by default (no data import needed)
  • Contains measures (DAX), hierarchies, relationships, and security roles
  • Multiple reports can share a single semantic model
  • Can be certified/endorsed for enterprise use
  • Accessible via XMLA endpoint for external tools (Excel, Tabular Editor, SSMS)

4. Data Engineering — Notebooks, Pipelines & Medallion

What is the Medallion Architecture and how is it implemented in Fabric?

The Medallion Architecture organises data into three progressive quality layers stored as Delta tables in Lakehouses.

Bronze (Raw) layer:
→ Raw data as-is from source systems
→ No transformations, append-only or full snapshot
→ Preserves source fidelity for reprocessing
→ Lakehouse: Tables/bronze/ or Files/bronze/

Silver (Cleansed) layer:
→ Cleaned, validated, deduplicated
→ Standardised data types and naming conventions
→ Joined/enriched from multiple Bronze sources
→ Optimised: partitioned, Z-ordered
→ Lakehouse: Tables/silver/

Gold (Business) layer:
→ Business-level aggregations and metrics
→ Conformed dimensions and fact tables (star schema)
→ Ready for Power BI Direct Lake consumption
→ Domain-specific: Sales Gold, Finance Gold, HR Gold
→ Lakehouse: Tables/gold/

Recommended Fabric implementation:
Bronze Workspace   → Bronze Lakehouse
Silver Workspace   → Silver Lakehouse (shortcuts to Bronze)
Gold Workspace     → Gold Lakehouse (shortcuts to Silver)
Reporting          → Semantic Models → Power BI Reports

Separate workspaces = separate security roles per layer

Tip: Separate Lakehouses per medallion layer is the recommended enterprise pattern — different security roles, workspace assignments, and access control per layer.


What are Fabric Notebooks and what do they support?

Fabric Notebooks are interactive Spark development environments supporting:

  • PySpark (Python + Spark) — primary language
  • Spark SQL — SQL against Delta tables
  • Scala — JVM Spark code
  • R (SparkR) — R with Spark
  • %%sql magic — inline SQL in Python notebooks
# Read from Bronze Lakehouse
df = spark.read.format("delta").load("Tables/raw_orders")

# Transform (PySpark)
from pyspark.sql.functions import col, to_date, when, trim

df_silver = (df
  .filter(col("OrderStatus") != "TEST")
  .filter(col("OrderId").isNotNull())
  .withColumn("OrderDate", to_date(col("OrderDateStr"), "yyyy-MM-dd"))
  .withColumn("CustomerName", trim(col("CustomerName")))
  .withColumn("Region",
    when(col("Country").isin("UK", "IE", "DE", "FR"), "EMEA")
    .when(col("Country").isin("US", "CA"), "AMER")
    .otherwise("APAC"))
  .dropDuplicates(["OrderId"])
)

# Write to Silver Lakehouse as Delta (upsert pattern)
from delta.tables import DeltaTable

if spark.catalog.tableExists("silver_orders"):
    DeltaTable.forName(spark, "silver_orders").alias("target") \
      .merge(df_silver.alias("source"), "target.OrderId = source.OrderId") \
      .whenMatchedUpdateAll() \
      .whenNotMatchedInsertAll() \
      .execute()
else:
    df_silver.write.format("delta").saveAsTable("silver_orders")

What are Fabric Data Pipelines?

Fabric Data Pipelines use the same visual designer and JSON format as Azure Data Factory. Key activities:

Activity Purpose
Copy Data Move data between 90+ sources/sinks
Notebook Execute a Fabric Notebook (Spark)
Spark Job Definition Run a packaged PySpark/Scala job
Dataflow Gen2 Run a Power Query Online dataflow
Stored Procedure Execute SQL in Warehouse
If Condition / Switch Conditional control flow
ForEach / Until Iterative control flow
Get Metadata / Validation File/table checks

Tip: ADF pipelines are compatible with Fabric pipelines (same JSON format). The key difference: Fabric pipelines have native OneLake connectivity — no linked service configuration needed for internal Fabric data access.


What is a Dataflow Gen2 and when should you use it vs Spark?

Use Dataflow Gen2 when:

  • Business analyst / maker-level ETL (no Spark knowledge needed)
  • 150+ Power Query connectors not available in pipelines
  • Simple transformations: filter, merge, pivot, custom M columns
  • Small-to-medium datasets

Use Spark Notebooks when:

  • Large-scale data (billions of rows)
  • Complex transformations requiring ML or custom logic
  • Full developer control and unit testability
  • Integration with Python ecosystem (pandas, scikit-learn, PyTorch)

Dataflow Gen2 outputs to:

  • Fabric Lakehouse (Tables or Files)
  • Fabric Data Warehouse
  • Azure SQL, other external databases
  • OneLake directly

What is V-Order and Z-Order optimisation in Fabric Delta tables?

V-Order: a Microsoft-specific write-time optimisation for Delta Parquet files. Sorts and compresses data within Parquet row groups for maximum Power BI Direct Lake read performance. Enabled by default in Fabric.

Z-Order: a data skipping optimisation co-locating related data in the same Parquet files. When filtering on a Z-ordered column, Spark skips entire files that don't match — dramatically reducing I/O.

# Z-Order after data ingestion (run periodically):
spark.sql("""
  OPTIMIZE gold_FactSales
  ZORDER BY (CustomerId, OrderDate)
""")
# Queries filtering by CustomerId or OrderDate now read far fewer files

# V-Order (enabled by default in Fabric):
spark.conf.set("spark.microsoft.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.microsoft.delta.optimizeWrite.vorder.enabled", "true")

# Compact small files:
spark.sql("OPTIMIZE silver_orders")  # merges small files into ~256MB chunks

# Vacuum old Delta versions (default: 7 days retention):
spark.sql("VACUUM silver_orders RETAIN 168 HOURS")

Key distinction: V-Order = Fabric/Power BI optimisation (Direct Lake performance). Z-Order = Spark/Delta general query optimisation. Know both for architect-level.


5. Security & Governance

What are the workspace roles in Fabric?

Role Capabilities
Admin Full control — manage workspace, members, all items. Can delete workspace.
Member Create, edit, delete items. Share items. Cannot manage workspace settings.
Contributor Create and edit items. Cannot delete others' items or share.
Viewer Read-only — view reports, query SQL Endpoint. Cannot edit.

Warning: Workspace roles are coarse-grained — they apply to ALL items. For fine-grained data access (which rows/columns), use Row-Level Security and Column-Level Security within Lakehouses and Warehouses.


What is OneSecurity and how does it implement Row and Column-Level Security?

-- ROW-LEVEL SECURITY (Warehouse / Lakehouse SQL Endpoint):
-- 1. Create security predicate function
CREATE FUNCTION dbo.fn_region_filter(@Region AS sysname)
  RETURNS TABLE WITH SCHEMABINDING AS
  RETURN SELECT 1 AS result
    WHERE @Region = USER_NAME()
       OR IS_MEMBER('GlobalDataAdmin') = 1;

-- 2. Apply security policy
CREATE SECURITY POLICY RegionFilter
  ADD FILTER PREDICATE dbo.fn_region_filter(Region)
  ON dbo.SalesOrders WITH (STATE = ON);
-- Users only see rows where Region matches their username

-- COLUMN-LEVEL SECURITY:
GRANT SELECT ON dbo.Employees(EmployeeId, Name, Department) TO [HRViewers];
DENY SELECT ON dbo.Employees(Salary, PersonalEmail, NationalId) TO [HRViewers];

-- DYNAMIC DATA MASKING:
ALTER TABLE dbo.Customers
  ALTER COLUMN Email ADD MASKED WITH (FUNCTION = 'email()');
-- Non-privileged users see: aXXX@XXXX.com instead of actual email

How does Microsoft Purview integrate with Fabric for governance?

Purview Feature Fabric Integration
Data Catalog Auto-scans Lakehouses, Warehouses, Datasets, Reports — discovers schemas and metadata
Data Lineage End-to-end lineage from source → pipeline → table → report
Sensitivity Labels Labels on Fabric items flow downstream to reports and exports
Information Protection Prevent export of highly confidential data
Audit Logs All Fabric workspace and item activity logged

Tip: Data lineage in Purview is especially powerful for Fabric — you can trace a single Power BI KPI back through every transformation step to its source system. This is an architect-level differentiator.


What is Fabric capacity management and how do you handle throttling?

All Fabric workloads consume Capacity Units (CUs) from the assigned capacity.

Throttling tiers (in order of impact):

  1. Background operations throttled first (Spark jobs, pipelines)
  2. Interactive operations throttled second (SQL queries, report loads)
  3. Users see delays/errors if throttling is severe

Management approach:

  1. Monitor: Fabric Capacity Metrics App (Power BI app) — CU consumption per workload, workspace, item
  2. Identify top consumers: find which notebooks, pipelines, or reports consume the most CUs
  3. Workspace isolation: assign heavy workloads (batch pipelines) to a separate capacity from interactive reporting
  4. Schedule off-peak: run heavy Spark jobs during off-peak hours
  5. Right-size: F64 for most enterprise scenarios. Scale up if Capacity Metrics shows consistent throttling.

6. Performance Optimisation

What are the key performance best practices for Fabric Lakehouses?

Delta table optimisation:

# 1. Compact small files (run after incremental loads)
spark.sql("OPTIMIZE tablename")

# 2. Z-Order on high-cardinality filter columns
spark.sql("OPTIMIZE tablename ZORDER BY (CustomerId, OrderDate)")

# 3. Partition large tables by low-cardinality column (e.g., date)
df.write.format("delta") \
  .partitionBy("Year", "Month") \
  .saveAsTable("partitioned_table")

# 4. Vacuum old file versions (keep 7 days for time travel)
spark.sql("VACUUM tablename RETAIN 168 HOURS")

# 5. Enable V-Order for Power BI Direct Lake performance
spark.conf.set("spark.microsoft.delta.optimizeWrite.vorder.enabled", "true")

Notebook performance:

  • Use .select() early to reduce columns processed
  • Filter data as early as possible (predicate pushdown)
  • Cache frequently reused DataFrames: df.cache()
  • Use broadcast() for small lookup tables in joins
  • Avoid collect() on large DataFrames (brings all data to driver)
  • Write Delta with mergeSchema only when needed — adds overhead

What causes Direct Lake fallback to DirectQuery?

Direct Lake falls back to DirectQuery (slower) when:

  1. Delta table not framed — column segments not yet loaded into VertiPaq memory for first access
  2. V-Order not applied — non-V-Order Delta files degrade Direct Lake read performance
  3. Too many small files — hundreds of tiny Parquet files slow framing
  4. Schema changes — recent DDL changes may invalidate the framing cache
  5. Large Delta history — too many uncommitted/vacuumed versions

Fix: Run OPTIMIZE + VACUUM on Gold tables. Ensure V-Order is enabled at write time. Monitor Direct Lake framing status in the Fabric workspace.


7. Scenario-Based Questions

Scenario: Design a modern data platform using Microsoft Fabric for a retail organisation with 50 stores.

Architecture:

  1. Ingestion:

    • POS transaction data → Fabric Pipeline Copy Activity → Bronze Lakehouse (daily batch)
    • Inventory DB (Azure SQL) → Fabric Mirroring → Bronze Lakehouse (near real-time)
    • Web analytics events → Eventstream → KQL Database (real-time)
  2. Medallion layers:

    • Bronze Lakehouse: raw data as-is
    • Silver Lakehouse: cleansed, deduped, joined (Spark notebooks)
    • Gold Lakehouse: star schema — FactSales, DimStore, DimProduct, DimDate
  3. Reporting:

    • Power BI Direct Lake Semantic Model on Gold Lakehouse
    • Store manager dashboards (RLS: each manager sees only their store)
    • Executive KPI reports (no RLS)
  4. Real-time operations:

    • Real-Time Dashboard from KQL Database — live sales by store
    • Data Activator: alert when stock drops below reorder threshold
  5. Data science:

    • Fabric Notebooks: demand forecasting model (Python + Spark)
    • Predictions written back to Gold Lakehouse → visible in Power BI
  6. Governance:

    • Purview sensitivity labels on customer PII tables
    • RLS in Gold Lakehouse (store managers see only their store)
    • Data lineage tracking in Purview
  7. Capacity: F64 — separate workspaces for Bronze/Silver/Gold layers


Scenario: How do you migrate from Azure Synapse Analytics to Microsoft Fabric?

  1. Assessment: inventory all Synapse assets — pipelines, Spark notebooks, SQL scripts, dedicated pool tables. Use Fabric migration assessment tool.
  2. OneLake setup: create Fabric workspace, create Lakehouses matching Synapse storage structure.
  3. Data migration:
    • ADLS Gen2 data → OneLake Shortcut (zero-copy, immediate access — fastest approach)
    • Or: Fabric Pipeline copy from ADLS to OneLake Tables
  4. Pipeline migration: Synapse pipelines → Fabric Pipelines (JSON format compatible). Update linked services to native Fabric connections.
  5. Notebook migration: copy Synapse Spark notebooks into Fabric. Update storage paths from abfss://container@storage.../ to OneLake paths. Most PySpark code runs unchanged.
  6. SQL migration: Synapse Dedicated Pool T-SQL → Fabric Data Warehouse T-SQL. Validate stored procedures, views, and custom functions.
  7. Power BI migration: update Semantic Models to use Direct Lake mode pointing to Fabric Lakehouse/Warehouse instead of Synapse connection.
  8. Parallel run: validate data output matches between Synapse and Fabric for 2–4 weeks before decommissioning.

Scenario: A Gold layer table is slow to query in Power BI Direct Lake. How do you diagnose and fix it?

  1. Check Direct Lake vs DirectQuery fallback: Power BI Desktop → Performance Analyzer → verify queries show "Direct Lake" mode, not "DirectQuery." Fallback indicates framing not complete.

  2. V-Order check: ensure V-Order was applied when writing the Gold table. Re-write with V-Order enabled if missing.

  3. OPTIMIZE the table: run OPTIMIZE gold_FactSales ZORDER BY (CustomerId, OrderDate) to compact small files and improve data skipping.

  4. Check small file count: DESCRIBE DETAIL gold_FactSales — if numFiles is in the hundreds, run OPTIMIZE to merge them.

  5. DAX measure performance: use Performance Analyzer in Power BI Desktop to separate "DAX query" time from "Direct Lake" time. Slow DAX = model issue, not storage issue.

  6. Add aggregation table: for very large facts (billions of rows), create a pre-aggregated summary table in Gold. Point the Semantic Model to the aggregation first.


Scenario: How do you implement data sharing between two business units without duplicating data?

Requirement: Finance needs Sales Gold tables for combined reporting. Data stays in Sales workspace. Zero data copy.

Solution: OneLake Shortcuts

  1. In Finance Workspace → Finance Lakehouse → New Shortcut
  2. Source: Another workspace in OneLake → Sales Gold Lakehouse → Tables/FactSales
  3. FactSales appears in Finance Lakehouse as a shortcut (virtual — no copy)
  4. Finance Spark notebooks, SQL Endpoint, and Power BI Direct Lake all read via the shortcut
  5. Apply RLS to Sales Gold Lakehouse SQL Endpoint — Finance users see only shared records
  6. Finance Power BI Semantic Model spans both Finance tables and Sales shortcuts — single model, zero data duplication

Tip: This is the canonical Fabric cross-workspace data sharing pattern. Shortcuts eliminate inter-team ETL pipelines entirely.


Scenario: How do you implement near real-time analytics on Azure SQL operational data?

Solution: Fabric Mirroring

  1. In Fabric workspace → New → Mirrored database → Azure SQL Database
  2. Connect to Azure SQL (provide connection string and credentials)
  3. Select tables to mirror: Orders, Customers, Products
  4. Fabric automatically starts CDC-based replication
  5. Within minutes, tables appear in OneLake as Delta tables
  6. Ongoing: changes in Azure SQL replicate to OneLake with < 1 minute latency
  7. Build Silver transformations in Spark notebooks reading from mirrored tables
  8. Power BI Direct Lake Semantic Model reads Gold tables for near real-time dashboards

No ETL pipeline to build or maintain. Mirroring handles all change capture automatically.


8. Cheat Sheet — Quick Reference

Fabric Workload Selection Guide

Need to...                                    → Use
Ingest data from 90+ sources                  → Data Pipeline (Copy Activity)
Low-code ETL with 150+ connectors             → Dataflow Gen2
Big data transformation at scale              → Spark Notebook
Query structured data with T-SQL              → Data Warehouse
Store mixed-format data (raw + Delta)         → Lakehouse
Near real-time DB replication into OneLake    → Fabric Mirroring
Ingest streaming events                       → Eventstream
Query time-series at extreme speed            → KQL Database (Eventhouse)
Visualise live streaming data                 → Real-Time Dashboard
Alert when data condition is met              → Data Activator (Reflex)
Build Power BI reports                        → Semantic Model + Reports
Share data without copying                    → OneLake Shortcut

OneLake Access Paths

ADLS Gen2 DFS endpoint:
abfss://{workspaceName}@onelake.dfs.fabric.microsoft.com/{itemName}.Lakehouse/Tables/{tableName}

OneLake REST endpoint:
https://onelake.dfs.fabric.microsoft.com/{workspaceName}/{itemName}.Lakehouse/Tables/{tableName}

In Fabric Notebook (relative path within same Lakehouse):
spark.read.format("delta").load("Tables/tableName")
spark.read.format("delta").load("Files/raw/filename.csv")

Medallion Architecture Quick Reference

Layer     Storage        Quality      Users
Bronze    Lakehouse      Raw          Data Engineers only
Silver    Lakehouse      Cleansed     Data Engineers + Data Scientists
Gold      Lakehouse      Business     Analysts + Power BI + Business Users

Transformation:
Bronze → Silver: Spark Notebooks (clean, dedupe, join)
Silver → Gold:   Spark Notebooks or Dataflow Gen2 (aggregate, model as star schema)
Gold → Reports:  Power BI Direct Lake Semantic Model

Scheduling:
Fabric Pipeline orchestrates Notebook runs
Bronze refresh → trigger Silver → trigger Gold → trigger Semantic Model refresh

Delta Table Maintenance Commands

-- Compact small files
OPTIMIZE tablename

-- Compact + Z-Order on filter columns
OPTIMIZE tablename ZORDER BY (col1, col2)

-- Remove old Delta versions (keep 7 days)
VACUUM tablename RETAIN 168 HOURS

-- View Delta history
DESCRIBE HISTORY tablename

-- Time travel query
SELECT * FROM tablename VERSION AS OF 5
SELECT * FROM tablename TIMESTAMP AS OF '2025-01-15'

-- Table details (file count, size, format)
DESCRIBE DETAIL tablename

Direct Lake vs Import vs DirectQuery

                  Import    DirectQuery    Direct Lake
Performance       ★★★★★    ★★☆☆☆          ★★★★★
Data freshness    ★★☆☆☆    ★★★★★          ★★★★★
Dataset size      Limited   Unlimited      Unlimited
Requires Fabric   No        No             YES
Storage           VertiPaq  None           VertiPaq (framed)
Needs refresh?    Yes       No             No (auto-current)

Top 10 Tips

  1. OneLake = "OneDrive for data" — one tenant-wide lake, all workloads share one copy of data. This single sentence is the most important concept in Fabric.
  2. Direct Lake = Import speed + DirectQuery freshness — the breakthrough that eliminates the historic Power BI trade-off. Know how framing works and what causes DirectQuery fallback.
  3. Lakehouse SQL Endpoint is read-only T-SQL — a Lakehouse auto-generates a SQL endpoint. A Warehouse is full read-write T-SQL. Many candidates confuse these.
  4. V-Order for Power BI, Z-Order for Spark queries — two different optimisations. V-Order is Fabric-specific and write-time. Z-Order is Delta standard and applied post-write.
  5. Medallion = Bronze → Silver → Gold — the universal data engineering pattern in Fabric. Know what happens at each layer and how separate Lakehouses enforce access control per layer.
  6. Shortcuts = zero-copy data sharing — the answer to any "how do you share data without duplicating it" question. Finance → shortcut → Sales Gold is the canonical example.
  7. Mirroring > ETL pipelines for operational DBs — near real-time CDC-based replication into OneLake. No pipeline to build or maintain. For Azure SQL, Cosmos DB, Snowflake.
  8. Fabric is Synapse's successor — know the mapping: Synapse Spark → Fabric Data Engineering, Synapse SQL → Fabric Warehouse, Synapse Pipelines → Fabric Pipelines.
  9. Capacity = shared pool — all workloads in assigned workspaces share CUs. Heavy Spark jobs and interactive Power BI reports compete for the same capacity. Workspace isolation is the mitigation.
  10. Purview for lineage — end-to-end lineage from source DB → pipeline → Silver table → Gold table → Power BI report. This is the governance story that closes enterprise architecture discussions.


Azure Integration Services Complete Guide

 

Azure Integration Services — Complete Guide

Logic Apps · Service Bus · API Management · Event Grid · Integration Patterns · Scenarios · Cheat Sheet


Table of Contents

  1. Azure Integration Services Overview
  2. Azure Logic Apps — Deep Dive
  3. Azure Service Bus — Deep Dive
  4. Azure API Management — Deep Dive
  5. Azure Event Grid
  6. Integration Patterns & Architecture
  7. Scenario-Based Questions
  8. Cheat Sheet — Quick Reference

1. Azure Integration Services Overview

What are Azure Integration Services and what problems do they solve?

Azure Integration Services is a suite of cloud services for connecting applications, data, and processes across cloud and on-premises environments.

Service Purpose
Azure Logic Apps Workflow orchestration — automate processes and integrate systems using a designer-based workflow engine
Azure Service Bus Enterprise message broker — reliable, ordered, durable message queuing and pub/sub
Azure API Management (APIM) API gateway — publish, secure, transform, and monitor APIs
Azure Event Grid Event routing — reactive, event-driven architecture at scale
Azure Event Hubs Big data streaming — high-throughput event ingestion (millions/sec)

Key positioning: Logic Apps = orchestration (do things in sequence). Service Bus = decoupling (reliable async messaging). APIM = API facade (secure and manage). Event Grid = events (react to things that happened). Event Hubs = telemetry streaming.


What is the difference between Logic Apps and Power Automate?

Logic Apps and Power Automate share the same underlying engine and connector library.

Logic Apps Power Automate
Audience Developers, IT, enterprise architects Business users, makers
Hosting Azure subscription Microsoft-managed SaaS
Pricing Pay-per-execution (Consumption) or App Service Plan (Standard) Per user/flow licence
VNET/private endpoints Yes (Standard and Premium) No
Source control / IaC ARM/Bicep templates, workflow JSON Solution packages
B2B EDI Yes (Integration Account) No
Local development Yes (VS Code, Standard tier) No

Tip: Same engine, different audience and infrastructure control. Enterprise integration with VNET, IaC, and B2B = Logic Apps. Business user automation = Power Automate.


What is the difference between Logic Apps Consumption and Standard plans?

Consumption Standard
Infrastructure Shared multi-tenant Dedicated App Service Plan
Pricing Pay per action execution Fixed plan cost
VNET integration No Yes
Private endpoints No Yes
Workflows per resource One Many
Local development No Yes (VS Code)
Stateless workflows No Yes (much faster)
Best for Low-volume, simple, quick deploy Enterprise production, high-volume

What is Azure Event Grid and how does it differ from Service Bus?

Event Grid Service Bus
Model Pub/sub event routing Message broker
Delivery guarantee At-least-once At-least-once with acknowledgement
Ordering Not guaranteed FIFO with sessions
Message retention 24h (default, max 7 days) Up to 14 days
Throughput 10M events/sec Up to 1000 msg/sec (Premium)
Use for Something happened — notify subscribers Message MUST be processed exactly once
Dead letter Yes Yes (DLQ)
Use Event Grid when:
→ Azure resource events (blob created, VM deallocated)
→ High-volume, low-latency event fan-out
→ Fire-and-forget is acceptable
→ Many subscribers need the same event

Use Service Bus when:
→ Message MUST be processed exactly once
→ Order of processing matters (sessions)
→ Consumer may be temporarily offline
→ Transactional messaging required
→ Complex routing with filter rules

2. Azure Logic Apps — Deep Dive

What are the key components of a Logic Apps workflow?

Component Description
Trigger Starts the workflow. Types: polling, push (webhook), recurrence, manual (HTTP)
Actions Steps after trigger. Each is a connector operation.
Connectors Wrappers around APIs. 400+ built-in. Custom connectors via OpenAPI.
Control flow Condition (if/else), Switch, For Each, Until, Scope
Variables Mutable state within a workflow run
Expressions @{body('action')}, @{utcNow()} — same syntax as Power Automate

What is the difference between stateful and stateless workflows in Logic Apps Standard?

Stateful workflow: each step's input/output persisted in Azure Storage. Full run history available. Can pause for external callbacks or approvals. Supports long-running workflows (days/weeks). Higher latency, higher cost.

Stateless workflow: data kept in memory only — not persisted. No run history. Cannot pause/resume. Maximum duration: minutes. Much faster (no storage I/O). Much lower cost. Best for high-volume synchronous processing.

Tip: Use stateless for high-volume, fast API transformations (hundreds per second). Use stateful for long-running processes, human-in-the-loop approvals, and workflows needing audit trails.


How does error handling work in Logic Apps?

Logic Apps uses the same Run After and Scope patterns as Power Automate:

Try-Catch-Finally pattern:

[Try Scope]
  → Main workflow steps

[Catch Scope]  ← Run After: Try scope → Failed
  → Alert via email/Teams
  → Log to Application Insights
  → Access error: @{result('Try_scope')[0]['error']['message']}
  → Access failed action: @{result('Try_scope')[0]['name']}

[Finally Scope]  ← Run After: Try + Catch → all states
  → Cleanup (always runs)

Retry policy per action:
→ None / Fixed / Exponential
→ Count: 1–90, Interval: configurable
→ Auto-retries on HTTP 408, 429, 5xx

What is an Integration Account in Logic Apps?

An Integration Account is an Azure resource providing B2B EDI integration capabilities.

Feature Description
Trading partners Define business partners and their identities
Agreements Message exchange agreements (send/receive settings)
Schemas XML schemas for validating/transforming EDI messages
Maps (XSLT) Transform message formats between partners
Certificates Message signing and encryption
EDI protocols AS2 (secure HTTP), X12 (US EDI), EDIFACT (international EDI)

Tip: Integration Account is the answer to any question about B2B EDI, trading partners, or AS2/X12/EDIFACT in Logic Apps .


How do you implement Logic Apps in a DevOps CI/CD pipeline?

Consumption plan (ARM templates):

# Export workflow as ARM template
# Store in Git
# Deploy via Azure CLI or DevOps ARM task
az deployment group create \
  --resource-group myRG \
  --template-file logicapp.json \
  --parameters @params.dev.json

Standard plan (workflow JSON — like Azure Functions):

# GitHub Actions deployment
- name: Deploy Logic App Standard
  uses: Azure/functions-action@v1
  with:
    app-name: my-logic-app-standard
    package: ./src/logicapp

Environment-specific config:

App Settings per environment (dev/test/prod):
→ Connection strings
→ API endpoints
→ Feature flags

Key Vault references for secrets:
@Microsoft.KeyVault(SecretUri=https://myvault.vault.azure.net/secrets/apikey)

What are managed connectors vs built-in connectors in Logic Apps Standard?

Managed connectors: run in Microsoft's shared cloud infrastructure. Make outbound calls to external services (Salesforce, SAP, ServiceNow). Subject to connector throttling limits. Same as Power Automate connectors.

Built-in connectors (Standard only): run inside the Logic App's own host process. Faster, no throttling limits. Include: HTTP, Service Bus, Event Hubs, Azure Blob, SQL, Dataverse, B2B operations. Use built-in over managed where available for better performance.


3. Azure Service Bus — Deep Dive

What is Azure Service Bus and what are its key components?

Component Description
Namespace Top-level container. Unique hostname: contoso.servicebus.windows.net
Queue Point-to-point. One sender, competing consumers. Each message processed once.
Topic Pub/sub. One publisher, multiple subscriptions each receiving a copy.
Subscription Named receiver on a topic. Has its own queue + filter rules.
Message Body (256KB Standard / 100MB Premium) + system + custom properties
Queue (point-to-point):
Producer → [Queue] → Consumer A (competing)
                   → Consumer B (competing)
→ ONE consumer processes each message

Topic/Subscription (pub/sub):
Publisher → [Topic] → [Sub A: filter Region='EMEA'] → Consumer A
                    → [Sub B: filter Priority='High'] → Consumer B
                    → [Sub C: no filter]              → Consumer C
→ EACH subscription gets its own independent copy

What are Service Bus tiers and which should you use for production?

Tier Queues Topics Max Msg Size VNET Best For
Basic 256KB Dev/test only
Standard 256KB Low-criticality workloads
Premium 100MB Enterprise production

Premium features: dedicated capacity units, geo-disaster recovery, VNET/private endpoints, predictable performance, availability zones.

Warning: Always use Premium for enterprise production — dedicated resources, VNET integration, and predictable latency. Standard has noisy neighbour risk and no network isolation.


What is the Dead Letter Queue (DLQ) and when are messages sent to it?

The DLQ is a sub-queue storing messages that cannot be processed. Every queue and subscription has its own DLQ.

Messages are dead-lettered when:

  1. Max delivery count exceeded — consumer abandons (nacks) message more than MaxDeliveryCount times (default: 10)
  2. TTL expired — message not consumed before Time-To-Live expires (if dead-lettering on expiry enabled)
  3. Subscription filter error — filter expression throws an exception
  4. Consumer explicitly dead-letters — calls deadLetterAsync() for unprocessable messages
DLQ message properties:
DeadLetterReason            → WHY it was dead-lettered
DeadLetterErrorDescription  → Detailed error message
EnqueuedTimeUtc             → When original message was sent
DeliveryCount               → How many times delivery was attempted

DLQ path:
Queue: myqueue/$DeadLetterQueue
Subscription: mytopic/mysubscription/$DeadLetterQueue

Critical: An overflowing DLQ means data loss. Always monitor DLQ counts with Azure Monitor alerts and have a process for reviewing, resubmitting, or archiving dead-lettered messages.


What are Sessions in Service Bus and when do you use them?

Sessions enable FIFO ordered processing of related messages. A session groups messages by SessionId — all messages with the same SessionId are processed in order by the same consumer instance.

Without sessions (no ordering guarantee):
Producer sends: Order-123-Created, Order-123-Paid, Order-123-Shipped
3 competing consumers → may process Shipped before Paid → wrong state

With sessions (SessionId = "Order-123"):
→ All messages with SessionId="Order-123" locked to ONE consumer
→ Processed in order: Created → Paid → Shipped
→ Other sessions (Order-456, Order-789) processed in parallel by other consumers
→ Scale: number of concurrent sessions = throughput

Enable sessions on queue/subscription:
RequiresSession = true (must be set at creation time)

Send with session:
message.SessionId = orderId;

Receive with session:
var session = await receiver.AcceptNextSessionAsync();

Tip: Sessions are the answer to "how do you guarantee message ordering in Service Bus." Without sessions, ordering is not guaranteed even with a single consumer.


What is the difference between peek-lock and receive-and-delete?

Peek-Lock (recommended):

  1. Consumer receives message — it is locked (invisible to others) for lock duration
  2. Consumer processes and calls Complete() → deleted from queue
  3. Or calls Abandon() → released back to queue for retry
  4. Or calls DeadLetter() → moved to DLQ
  5. If consumer crashes — lock expires, message re-queues automatically
  6. Guaranteed at-least-once delivery

Receive-and-Delete:

  1. Message deleted immediately on receive
  2. If consumer crashes after receive but before processing → message is permanently lost
  3. Use only for non-critical, idempotent scenarios (metrics, logging)

Warning: Always use Peek-Lock for business-critical messages. Receive-and-Delete risks data loss if the consumer fails between receiving and completing processing.


What is duplicate detection in Service Bus?

Duplicate detection allows Service Bus to discard messages with a previously seen MessageId within a configurable time window (1 min – 7 days).

// Enable at queue/topic creation:
new CreateQueueOptions("myqueue") {
  RequiresDuplicateDetection = true,
  DuplicateDetectionHistoryTimeWindow = TimeSpan.FromMinutes(10)
};

// Set MessageId on the sender side:
var message = new ServiceBusMessage(body) {
  MessageId = $"{orderId}-{timestamp}"
};

Use when producers may retry sending on network failure — prevents double-processing of the same business event.


4. Azure API Management — Deep Dive

What is Azure API Management and what problem does it solve?

Azure API Management (APIM) is a fully managed API gateway between API consumers (clients) and API backends.

Core capabilities:

Capability Description
Security OAuth 2.0, JWT validation, subscription keys, client certificates, IP filtering
Rate limiting Rate limit and quota per subscription, IP, or custom key
Transformation Modify requests/responses without touching backend code
Versioning Manage multiple API versions, route to different backends
Developer portal Self-service API documentation and subscription management
Caching Cache responses to reduce backend load
Analytics Request logs, metrics, tracing via Azure Monitor + Application Insights

Key value: Decouple API consumers from backends. Backend can change without consumers knowing — APIM handles the translation.


What are APIM policies and what are the most important ones to know?

Policies are XML-based rules applied in four sections: inbound, backend, outbound, on-error.

<policies>
  <inbound>
    <!-- Validate JWT (Entra ID) -->
    <validate-jwt header-name="Authorization" failed-validation-httpcode="401">
      <openid-config url="https://login.microsoftonline.com/{tenantId}/v2.0/.well-known/openid-configuration"/>
      <required-claims>
        <claim name="aud"><value>api://my-api-id</value></claim>
      </required-claims>
    </validate-jwt>

    <!-- Rate limit per subscription: 100 calls/60s -->
    <rate-limit-by-key calls="100" renewal-period="60"
      counter-key="@(context.Subscription.Id)"/>

    <!-- Add internal key to backend request -->
    <set-header name="X-Internal-Key" exists-action="override">
      <value>{{internal-api-key-named-value}}</value>
    </set-header>

    <!-- Route to different backend based on header -->
    <choose>
      <when condition="@(context.Request.Headers.GetValueOrDefault('X-Version','v1') == 'v2')">
        <set-backend-service base-url="https://v2.api.backend.com"/>
      </when>
      <otherwise>
        <set-backend-service base-url="https://v1.api.backend.com"/>
      </otherwise>
    </choose>
  </inbound>

  <outbound>
    <!-- Remove internal header from response -->
    <set-header name="X-Powered-By" exists-action="delete"/>
    <!-- Cache successful responses for 5 minutes -->
    <cache-store duration="300"/>
    <!-- Transform XML response to JSON -->
    <xml-to-json kind="direct" apply="always"/>
  </outbound>

  <on-error>
    <return-response>
      <set-status code="@((int)context.Response.StatusCode)" reason="@(context.Response.StatusReason)"/>
      <set-body>@("Error: " + context.LastError.Message)</set-body>
    </return-response>
  </on-error>
</policies>

What are APIM tiers?

Tier SLA VNET Multi-region Scale Units Best For
Developer None No No 1 Dev/test only
Basic 99.95% No No 2 Small, non-critical
Standard 99.95% No No 4 Medium volume
Premium 99.99% Yes Yes 31 Enterprise production
Consumption 99.95% No No Serverless Very low volume, serverless

Tip: Premium is the only tier with VNET integration and multi-region deployment. Required when backends are in private networks or when global availability is needed.


What are APIM Products, APIs, and Subscriptions?

API: a backend service exposed through APIM
  → Has operations: GET /orders, POST /orders, DELETE /orders/{id}
  → Has policies applied at API or operation level

Product: a bundle of one or more APIs
  → Controls access: who can subscribe
  → Applies shared rate limits and quotas
  → Types: Open (no approval needed) | Protected (subscription required)

Subscription: a consumer's access key to a Product
  → Primary and secondary keys (for key rotation without downtime)
  → Passed in header: Ocp-Apim-Subscription-Key: {key}
  → Can be scoped to: All APIs | Product | Single API

Example:
APIs: Orders, Inventory, Customers, Analytics

Products:
  "Developer Tier" → Orders API only, 10 calls/min, free
  "Standard Tier"  → Orders + Inventory, 100 calls/min
  "Enterprise"     → All APIs, 1000 calls/min, SLA guaranteed

Consumer A subscribes to "Standard Tier":
→ Gets subscription key
→ Can call Orders and Inventory up to 100 calls/min
→ Cannot call Customers or Analytics

How do you implement API versioning in APIM?

Three versioning schemes:

Scheme URL Format Example
URL path /v{version}/resource /v1/orders, /v2/orders
Query string /resource?api-version={ver} /orders?api-version=2024-01-01
Header Api-Version: {version} Header: Api-Version: 2024-01-01

Each version can point to a different backend, have different policies, and be independently documented. A version set groups all versions together in the developer portal.

Tip: Always create a version set when deploying v1. Retrofitting versioning to an existing API with consumers is painful and disruptive.


What are Named Values in APIM and why are they important?

Named Values (formerly Properties) are key-value pairs stored in APIM and referenced in policies using {{name}} syntax.

<!-- Instead of hardcoding: -->
<set-header name="X-API-Key"><value>abc123supersecret</value></set-header>

<!-- Use Named Value: -->
<set-header name="X-API-Key"><value>{{backend-api-key}}</value></set-header>

Types:

  • Plain: static string value
  • Secret: encrypted, not visible after saving
  • Key Vault reference: value retrieved from Azure Key Vault at runtime

Best practice: Always use Named Values (preferably Key Vault-backed) for secrets and environment-specific values in APIM policies. Never hardcode secrets in policy XML.


5. Azure Event Grid

What is Azure Event Grid and what are its key components?

Event Grid is a fully managed event routing service for building reactive, event-driven architectures.

Component Description
Event source What emits events: Azure resources (Blob Storage, Resource Groups, Service Bus) or custom apps
Topic Endpoint where events are sent. System topics (Azure resources) or custom topics
Event subscription Maps a topic to an event handler with optional filter rules
Event handler What processes the event: Azure Functions, Logic Apps, Event Hubs, Service Bus, Webhooks
Azure Blob Storage (event source)
  → BlobCreated event published to System Topic
  → Event Grid routes to:
    Subscription 1 (filter: blobType=image) → Azure Function (resize image)
    Subscription 2 (no filter) → Logic App (archive to cold storage)
    Subscription 3 (filter: container=reports) → Power Automate (notify team)

When should you use Event Grid vs Event Hubs vs Service Bus?

Event Grid:
→ Discrete events (something happened)
→ Azure resource events (blob created, VM stopped)
→ Low volume, reactive architecture
→ Fan-out to many subscribers
→ Events expire after 24h–7 days

Event Hubs:
→ High-throughput streaming (millions/sec)
→ Time-series data, telemetry, logs
→ Replay capability (retention up to 90 days)
→ Big data pipelines (Spark, Stream Analytics)

Service Bus:
→ Reliable transactional messaging
→ Message must be processed exactly once
→ Order matters (sessions)
→ Consumer may be offline
→ Complex routing with business rules

6. Integration Patterns & Architecture

What is the Competing Consumers pattern?

Multiple consumer instances process messages from a single queue concurrently — scaling throughput horizontally. Each message processed by exactly one consumer.

Service Bus Queue with 3 consumers:
Producer → [Queue: 1000 messages]
  Consumer 1 (Azure Function instance) ← picks up messages
  Consumer 2 (Azure Function instance) ← picks up messages
  Consumer 3 (Azure Function instance) ← picks up messages

→ Each message processed by EXACTLY ONE consumer
→ Scale out by adding consumers
→ Azure Functions + Service Bus trigger auto-scales based on queue depth

What is the Saga pattern and how do you implement it?

The Saga pattern manages long-running distributed transactions without a central transaction coordinator. Each step publishes an event; compensating transactions undo completed steps if a later step fails.

Order Processing Saga — Orchestration (Logic App):

Step 1: Reserve Inventory API
  → Success: continue
  → Failure: STOP (nothing to compensate)

Step 2: Charge Payment API
  → Success: continue
  → Failure: call Inventory API to RELEASE reservation (compensate Step 1)

Step 3: Create Shipment API
  → Success: Saga complete
  → Failure: call Payment API to REFUND (compensate Step 2)
           + call Inventory API to RELEASE (compensate Step 1)

Choreography (Service Bus Topics):
  Each service subscribes to its trigger event
  Publishes result event → triggers next service
  Compensating events flow in reverse on failure
  No central orchestrator — fully decoupled

What is the Claim Check pattern?

Offloads large message payloads to external storage and sends only a reference in the Service Bus message.

Problem: Message payload = 5MB → exceeds 256KB Service Bus limit

Solution:
Producer:
  1. Upload full payload to Azure Blob Storage
  2. Get SAS URL (time-limited access token)
  3. Send small message: { "claimCheck": "{sasUrl}", "type": "OrderCreated" }

Consumer:
  1. Receive small message from Service Bus
  2. Download full payload from Blob using claimCheck URL
  3. Process the full payload
  4. Delete the blob after successful processing

How do Logic Apps, Service Bus, and APIM work together in enterprise integration?

External systems / Trading partners / Mobile apps
        ↓ HTTPS
[APIM] ← API Gateway (North-South traffic)
  → Authenticate: validate JWT/OAuth/mTLS
  → Rate limit external callers
  → Route to correct backend version
  → Transform: REST→SOAP, JSON→XML
        ↓
[Logic Apps] ← Orchestration layer
  → Receives HTTP request from APIM
  → Calls multiple backend services (SAP, D365, SQL)
  → Handles retry, error handling, compensation
  → Sends responses back and notifications
        ↓
[Service Bus] ← Async decoupling layer (East-West traffic)
  → Decouples Logic App from slow backend systems
  → Reliable delivery to downstream consumers
  → Topics/subscriptions fan out to multiple consumers
        ↓
Backend services (SAP, Dynamics 365, SQL, Custom APIs)

Design principle:
APIM    = North-South gateway (external ↔ internal boundary)
Service Bus = East-West bus (internal service ↔ service decoupling)
Logic Apps  = Orchestration across both layers

What is the Throttling / Rate Limiting pattern in APIM?

<!-- Rate limit per subscriber: 100 calls per 60 seconds -->
<rate-limit-by-key calls="100" renewal-period="60"
  counter-key="@(context.Subscription.Id)"
  increment-condition="@(context.Response.StatusCode >= 200
    and context.Response.StatusCode < 300)"/>

<!-- Rate limit per IP address -->
<rate-limit-by-key calls="50" renewal-period="60"
  counter-key="@(context.Request.IpAddress)"/>

<!-- Weekly quota (business tier limit) -->
<quota-by-key calls="50000" bandwidth="102400"
  renewal-period="604800"
  counter-key="@(context.Subscription.Id)"/>

<!-- Spike arrest: max 10 calls/second to protect backend -->
<rate-limit calls="10" renewal-period="1"/>

When limit exceeded:

  • Returns HTTP 429 Too Many Requests
  • Include Retry-After header
  • Custom error body via policy

Tip: Implement both rate-limit (short window, protect from spikes) and quota (long window, enforce business tier limits) for complete throttling strategy.


7. Scenario-Based Questions

Scenario: Design a reliable order processing system using Service Bus for 10,000 orders/hour.

  1. Service Bus Premium namespace: dedicated capacity, ~1000 msg/sec throughput, VNET integration
  2. Topic: orders with subscriptions:
    • inventory-sub: filter OrderType = 'Physical'
    • payment-sub: no filter (all orders), sessions enabled (SessionId = OrderId)
    • analytics-sub: no filter (all orders)
  3. Azure Functions consumers with Service Bus trigger: KEDA-based auto-scaling
  4. Dead Letter Queue monitoring: Azure Monitor alert when DLQ count > 0. Logic App notifies ops team.
  5. Duplicate detection: MessageId = OrderId + SubmissionTimestamp — prevents double-processing on producer retry
  6. Geo-Disaster Recovery: paired secondary namespace in secondary region. Failover RTO < 1 minute.
  7. Message TTL: 24h — orders not processed in 24h → DLQ alert for manual review

Scenario: A backend API averages 3 second responses. How do you use APIM to improve consumer experience?

  1. Response caching for stable reference data (product catalogue, lookup tables):
    <cache-lookup vary-by-developer="false">  <vary-by-header>Accept</vary-by-header></cache-lookup><!-- In outbound: --><cache-store duration="300"/>
    
  2. Circuit breaker pattern: on repeated backend failures, return cached last-good response or a meaningful error:
    <retry condition="@(context.Response.StatusCode >= 500)"  count="3" interval="2" first-fast-retry="true"/>
    
  3. Async pattern for operations > 1s:
    • APIM accepts request → sends to Service Bus → returns 202 Accepted + jobId
    • Backend processes async → stores result
    • Consumer polls GET /jobs/{jobId} for status
  4. Mock response for dev/test environments — no backend call
  5. Backend timeout policy: set explicit timeout so slow responses don't exhaust APIM threads
  6. Backend load balancing: configure multiple backend pool members — APIM round-robins or health-checks

Scenario: How do you expose an on-premises SAP API securely to external partners via APIM?

  1. APIM Premium with Internal VNET mode: APIM deployed inside a VNET. SAP reachable via ExpressRoute/VPN into the same VNET.
  2. Application Gateway in front of APIM: AG provides public endpoint, WAF, SSL termination. Forwards to internal APIM only.
  3. Partner authentication: mutual TLS (client certificates) or OAuth 2.0 client credentials. APIM validates certificates against trusted CA store.
  4. Request transformation: APIM policy transforms REST JSON → SOAP for SAP. Partners never see SAP's native SOAP interface.
  5. Rate limiting: per-partner subscription limits prevent any partner overwhelming SAP.
  6. Schema validation: APIM validates request payloads at gateway — bad requests rejected before reaching SAP.
  7. Logging: all partner API calls → Azure Monitor + Application Insights for audit compliance.

Scenario: Design a Logic App that syncs orders from e-commerce to Dynamics 365 every 15 minutes with deduplication.

  1. Trigger: Recurrence — every 15 minutes
  2. Fetch new orders: HTTP GET ?modifiedAfter=@{addMinutes(utcNow(),-15)}
  3. Parse JSON: extract orders array
  4. For Each (concurrency 5): parallel processing of 5 orders at a time
    • Query D365 by external order ID → does it exist?
    • No: create new D365 record
    • Yes: compare modifiedOn → if changed, update
  5. Try-Catch per order: wrap each order in a Try scope. Catch: log failed orderId + error to Azure Table Storage. Continue to next order — don't fail entire run on one bad record.
  6. Summary notification: Teams message after loop: "Sync complete: X created, Y updated, Z failed. See log for details."
  7. Persist last sync timestamp: update a D365 config record with current timestamp for next run's filter query.

Scenario: How do you implement a pub/sub notification system where different teams receive different order events?

Architecture:
Order Service → publishes to Service Bus Topic: "order-events"

Subscriptions with SQL filter rules:
inventory-team-sub:
  Filter: OrderStatus = 'Confirmed' OR OrderStatus = 'Cancelled'
  Action: Set RouteTo = 'inventory-queue'

finance-team-sub:
  Filter: OrderTotal > 1000 AND PaymentStatus = 'Charged'

shipping-team-sub:
  Filter: OrderStatus = 'ReadyToShip' AND DeliveryType = 'Express'

analytics-sub:
  No filter — receives ALL events for reporting

Implementation:
Order service sends with properties:
message.ApplicationProperties["OrderStatus"] = "Confirmed";
message.ApplicationProperties["OrderTotal"] = 1500.00;
message.ApplicationProperties["DeliveryType"] = "Express";

Each team's consumer only receives events matching their filter.
Teams can be added/removed without changing the producer.

8. Cheat Sheet — Quick Reference

Service Selection Guide

Need to...                              → Use
Orchestrate a multi-step process        → Logic Apps
Decouple services reliably              → Service Bus Queue
Fan out events to multiple consumers    → Service Bus Topic OR Event Grid
React to Azure resource events          → Event Grid
Ingest millions of telemetry events/sec → Event Hubs
Secure and manage APIs                  → API Management
Transform request/response format       → APIM policy
Rate limit API callers                  → APIM rate-limit policy
Guarantee message order                 → Service Bus + Sessions
Handle large payloads (>256KB)          → Claim Check pattern + Blob Storage
B2B EDI (AS2, X12, EDIFACT)            → Logic Apps + Integration Account

Service Bus Quick Reference

Namespace tiers: Basic (queues only) | Standard | Premium (enterprise)
Max message size: 256KB (Standard) | 100MB (Premium)
Message retention: up to 14 days
Max delivery count: default 10 (configurable)

Queue (point-to-point):
→ Competing consumers
→ Each message processed once
→ Enable sessions for FIFO ordering

Topic (pub/sub):
→ Multiple subscriptions
→ SQL filter rules per subscription
→ Each subscription gets independent copy

Message receive modes:
Peek-Lock: safe, acknowledged, retryable    ← always use for business data
Receive-Delete: immediate delete, risk loss ← only for non-critical

DLQ: every queue/subscription has one
Monitor: alert when DLQ count > 0
Path: myqueue/$DeadLetterQueue

APIM Policy Reference

<!-- JWT validation -->
<validate-jwt header-name="Authorization" failed-validation-httpcode="401">
  <openid-config url="https://login.microsoftonline.com/{tid}/v2.0/.well-known/openid-configuration"/>
</validate-jwt>

<!-- Rate limit per subscription -->
<rate-limit-by-key calls="100" renewal-period="60"
  counter-key="@(context.Subscription.Id)"/>

<!-- Cache response -->
<cache-store duration="300"/>

<!-- Transform XML to JSON -->
<xml-to-json kind="direct" apply="always"/>

<!-- Set backend URL -->
<set-backend-service base-url="https://api.backend.com"/>

<!-- Add header -->
<set-header name="X-Key" exists-action="override">
  <value>{{named-value}}</value>
</set-header>

<!-- Remove response header -->
<set-header name="X-Powered-By" exists-action="delete"/>

<!-- Mock response -->
<mock-response status-code="200" content-type="application/json"/>

<!-- Return custom error -->
<return-response>
  <set-status code="400"/>
  <set-body>{"error": "Invalid request"}</set-body>
</return-response>

Logic Apps — Connectivity Modes

Consumption plan:
→ Shared infrastructure
→ Pay per action (~$0.000025/action)
→ No VNET
→ One workflow per resource
→ Good for: low-volume, quick start

Standard plan:
→ Dedicated App Service Plan
→ Fixed cost, predictable
→ VNET integration + private endpoints
→ Multiple workflows per resource
→ Stateless workflows (fast, no history)
→ Local development in VS Code
→ Good for: enterprise production

Top 10 Tips

  1. Logic Apps vs Power Automate — same engine, different infrastructure control. Enterprise = Logic Apps (VNET, IaC, B2B). Business users = Power Automate. Never say "they're the same thing."
  2. Standard vs Consumption — Standard is the enterprise choice (VNET, stateless, multi-workflow). Consumption for quick prototypes only. Know this before any architecture question.
  3. Service Bus sessions = FIFO — sessions are the ONLY way to guarantee message ordering. Without them, ordering is not guaranteed even with a single consumer.
  4. Peek-Lock, always — Receive-and-Delete risks message loss on consumer crash. Peek-Lock with Complete/Abandon is the correct pattern for any business-critical message.
  5. DLQ monitoring is non-negotiable — an overflowing DLQ is silent data loss. Azure Monitor alert on DLQ count > 0 is a standard architecture requirement.
  6. APIM Premium for VNET — it's the only tier with private network integration. If backends are on-premises or in a private VNET, you need Premium.
  7. Named Values for secrets in APIM policies — never hardcode API keys or connection strings in policy XML. Named Values (Key Vault-backed) are the correct approach.
  8. Products in APIM — how you bundle APIs and control consumer access tiers. Many candidates know APIs but miss Products as the access control layer.
  9. Integration Account for B2B EDI — AS2, X12, EDIFACT in Logic Apps requires an Integration Account. This is the expected answer for any B2B/trading partner question.
  10. APIM + App Gateway pattern — exposing APIM in Internal VNET mode behind an Application Gateway is the standard enterprise pattern for secure public API exposure with backend network isolation.


Microsoft 365 Copilot Extensibility Complete Guide

 

Microsoft 365 Copilot Extensibility — Complete Guide

Plugins · Declarative Agents · Graph Connectors · Security · Scenarios · Cheat Sheet


Table of Contents

  1. Core Concepts — Basics
  2. Plugins — API Plugins & Message Extensions
  3. Declarative Agents
  4. Microsoft Graph Connectors
  5. Security, Governance & Responsible AI
  6. Scenario-Based Questions
  7. Cheat Sheet — Quick Reference

1. Core Concepts — Basics

What is Microsoft 365 Copilot and what does it do?

Microsoft 365 Copilot is an AI assistant integrated across Microsoft 365 apps (Teams, Outlook, Word, Excel, PowerPoint, SharePoint, Loop). It combines large language models (GPT-4 class) with the Microsoft Graph — giving it access to your organisation's data (emails, meetings, documents, chats) to generate contextually relevant responses.

Key capabilities:

  • Microsoft 365 Chat (BizChat): cross-app AI assistant — ask questions across emails, meetings, documents, chats
  • In-app Copilot: context-aware AI within specific apps — Word (draft, rewrite), Excel (analyse data), PowerPoint (create), Teams (meeting summaries)
  • Extensibility: developers extend Copilot with plugins, agents, and Graph Connectors

Key positioning: M365 Copilot = LLM + Microsoft Graph (your org data) + Microsoft 365 apps. Extensibility = adding YOUR data and YOUR actions to this system.


What are the three main extensibility mechanisms for Microsoft 365 Copilot?

Mechanism What It Does Who Builds It
Plugins Extend Copilot's ability to take actions and retrieve real-time data from external systems Developers
Declarative Agents Customised Copilot experiences with specific persona, scope, knowledge, and plugins Makers + Developers
Graph Connectors Index external data into Microsoft Graph so Copilot can search and reason over it Developers

Mental model: Graph Connectors = bring your data IN. Plugins = let Copilot take actions OUT. Declarative Agents = package it all into a focused experience.


What is the Microsoft 365 Copilot architecture and how does it process a user prompt?

User types a prompt in Teams / BizChat
        ↓
Copilot orchestrator (powered by Semantic Kernel)
  1. Understands intent via LLM
  2. Decides which skills/plugins to invoke
  3. Calls Microsoft Graph for user context
     (emails, calendar, files, chats, contacts)
  4. Calls relevant plugins for external data/actions
  5. Passes all retrieved data as grounding context to LLM
        ↓
LLM (GPT-4 class) generates a response
grounded in the user's actual organisational data
        ↓
Response rendered in Teams/Outlook/BizChat
with citations to source documents and items

Tip: The orchestrator is powered by Semantic Kernel — Microsoft's open-source AI SDK. Understanding this pipeline is essential for architect-level extensibility.


What is Teams App Manifest v1.13+ and how does it relate to Copilot extensibility?

The Teams App Manifest (now called Microsoft 365 App Manifest) defines an app's capabilities across Microsoft 365 surfaces. Version 1.13+ introduced Copilot extensibility support:

{
  "$schema": "https://developer.microsoft.com/json-schemas/teams/v1.17/MicrosoftTeams.schema.json",
  "manifestVersion": "1.17",
  "id": "unique-app-guid",
  "name": { "short": "Contoso HR", "full": "Contoso HR Assistant" },
  "copilotAgents": {
    "declarativeAgents": [{
      "id": "hrAgent",
      "file": "agents/hrAgent.json"
    }]
  },
  "plugins": [{
    "id": "hrApiPlugin",
    "file": "plugins/hrApiPlugin.json"
  }]
}

What is Teams Toolkit and how does it help with Copilot extensibility development?

Teams Toolkit is a VS Code extension and CLI providing project templates, local debugging, and deployment automation.

Capability Description
Project templates Pre-built scaffolds for API plugins, declarative agents, Graph Connectors
Local debugging Dev Tunnel for local API testing with real Copilot
Provision & deploy Automates Azure resource creation and deployment
Environment management Dev/test/prod configs with .env files
App publishing Packages and submits to Teams Admin Center

What is the difference between extending Copilot and building a standalone chatbot?

Extending M365 Copilot Standalone Chatbot
Base AI Uses M365 Copilot's LLM + Graph Bring your own LLM
User experience Within Teams/Outlook — familiar surface Separate app/interface
Org data access Built-in via Microsoft Graph Custom integration needed
Authentication Inherits M365 SSO Build your own auth
Governance Managed by Teams Admin Center Custom governance
Licence required M365 Copilot licence per user Depends on platform
Best for Enhancing existing M365 workflows Fully custom experiences

2. Plugins — API Plugins & Message Extensions

What is an API Plugin for Microsoft 365 Copilot and how does it work?

An API Plugin allows Copilot to call a REST API based on an OpenAPI (Swagger) specification. When a user asks something requiring real-time data or an external action, Copilot invokes the relevant API operation.

How it works:
1. Developer creates an OpenAPI spec for the REST API
2. Developer creates an AI plugin definition (ai-plugin.json)
   describing which operations Copilot can call + when
3. Plugin packaged in Teams app manifest (.zip)
4. Copilot's orchestrator reads plugin description + OpenAPI spec
5. When user prompt matches plugin's purpose → orchestrator calls the API
6. API response passed to LLM as grounding context
7. LLM generates response citing real-time API data

Key files:
manifest.json     ← Teams app manifest referencing the plugin
apiPlugin.json    ← AI plugin definition (name, desc, auth, api spec)
openapi.json      ← OpenAPI 3.0 spec describing REST endpoints

Tip: The quality of descriptions in the OpenAPI spec and plugin definition determines whether Copilot invokes the plugin correctly. Poor descriptions = plugin never triggered or triggered for wrong queries.


What is an AI Plugin definition file (ai-plugin.json) and what does it contain?

{
  "schema_version": "v2.1",
  "name_for_human": "Contoso HR Plugin",
  "name_for_model": "ContosoHR",
  "description_for_human": "Access HR data including leave balances and policies",
  "description_for_model": "Use this plugin to retrieve employee leave balances, HR policies, and submit leave requests. Call this when users ask about their leave, time off, holidays, annual leave, sick leave, or HR policies.",
  "auth": {
    "type": "OAuthPluginVault",
    "reference_id": "${{OAUTH_CONNECTION_NAME}}"
  },
  "api": {
    "type": "openapi",
    "url": "https://hrapi.contoso.com/openapi.json"
  },
  "logo_url": "https://hrapi.contoso.com/logo.png",
  "contact_email": "support@contoso.com",
  "functions": [
    {
      "name": "getLeaveBalance",
      "description": "Get the current leave balance for the authenticated employee. Use when user asks about remaining leave days, annual leave, sick leave, or time off balance."
    },
    {
      "name": "submitLeaveRequest",
      "description": "Submit a leave request for the employee. Use when user wants to apply for leave, book time off, or request annual/sick/personal leave."
    }
  ]
}

Warning: The description_for_model is what the Copilot orchestrator reads to decide when to invoke the plugin. Be very specific and include example scenarios. Vague descriptions cause the plugin to be ignored or misused.


What authentication options are available for API Plugins?

Auth Type Description Best For
None (anonymous) No authentication Public, non-sensitive APIs
API Key Static API key in header/query string Simple external APIs
OAuth 2.0 (OAuthPluginVault) User authenticates via OAuth. Token stored in Teams token vault. External systems with OAuth support
Microsoft Entra ID SSO Seamless SSO — user's M365 identity used automatically. No login prompt. Internal enterprise APIs

Tip: OAuthPluginVault with Entra ID is the recommended enterprise pattern — users get seamless SSO and the API receives a proper OAuth token scoped to the user's identity. No shared credentials.


What is a Message Extension plugin and how does it differ from an API plugin?

Message Extension plugin: a Teams Message Extension (search/action command) also surfaced as a Copilot plugin. Existing Teams apps can become Copilot plugins with minimal changes.

API plugin: built purely for Copilot, based on OpenAPI spec. No Teams UI surface — only callable by Copilot's orchestrator.

Message Extension Plugin API Plugin
Works in Teams compose box AND Copilot Copilot only
Backend Bot Framework handler Any REST API
Best for Search and insert records into Teams messages Read/write operations, complex workflows
Code required Yes (Bot Framework) No (just OpenAPI spec + api-plugin.json)

What are Adaptive Cards in the context of Copilot plugins?

Adaptive Cards are JSON-based UI templates that Copilot renders as rich structured responses when a plugin returns data.

{
  "type": "AdaptiveCard",
  "version": "1.5",
  "body": [
    {
      "type": "TextBlock",
      "text": "Your Leave Balance",
      "weight": "Bolder",
      "size": "Large"
    },
    {
      "type": "FactSet",
      "facts": [
        { "title": "Annual Leave", "value": "12 days remaining" },
        { "title": "Sick Leave",   "value": "5 days remaining" },
        { "title": "Personal",     "value": "2 days remaining" }
      ]
    }
  ],
  "actions": [
    {
      "type": "Action.Execute",
      "title": "Apply for Leave",
      "verb": "submitLeaveRequest"
    }
  ]
}

Tip: Always design Adaptive Card responses for data-heavy plugin results. Structured cards are far more readable than plain text responses — and they support actionable buttons for follow-up operations.


How do you write effective OpenAPI descriptions for Copilot plugins?

The Copilot orchestrator reads operation descriptions to decide when and how to invoke plugin functions.

# POOR description (will be ignored or mis-triggered):
operationId: getItems
summary: Get items

# GOOD description (Copilot knows exactly when to use this):
operationId: getEmployeeLeaveBalance
summary: Get an employee's current leave balance and entitlements
description: >
  Returns the authenticated employee's current leave balances for all
  leave types including annual leave, sick leave, personal leave, and
  parental leave. Use this operation when the user asks about:
  - How many days of leave they have left
  - Their annual/sick/personal leave balance
  - Time off entitlements
  - Remaining holidays
parameters:
  - name: leaveType
    description: >
      Optional. Filter by leave type: 'annual', 'sick', 'personal',
      'parental'. Omit to return all leave types.

3. Declarative Agents

What is a Declarative Agent and how does it differ from Microsoft 365 Copilot?

Microsoft 365 Copilot (general): broad AI assistant across all M365 data. General-purpose, answers any work question. No persona customisation.

Declarative Agent: a focused, scoped Copilot experience with:

  1. Custom persona: name, description, avatar — "Contoso HR Assistant", "IT Helpdesk Bot"
  2. Scoped knowledge: only specific SharePoint sites, OneDrive folders, or Graph Connector data
  3. Custom instructions: system prompt defining tone, behaviour, what to answer/refuse
  4. Specific plugins: only the plugins relevant to this agent's purpose
  5. Conversation starters: suggested prompts shown to users on first open

Tip: Declarative Agents are the answer to "build a custom Copilot for our HR team." They are not a separate AI model — they are a configured, scoped view of M365 Copilot.


What is the declarative agent manifest file and what does it contain?

{
  "$schema": "https://developer.microsoft.com/json-schemas/copilot/declarative-agent/v1.4/schema.json",
  "version": "v1.4",
  "name": "Contoso HR Assistant",
  "description": "Your personal HR assistant for leave, policies, and benefits.",
  "instructions": "You are an HR assistant for Contoso. Only answer HR-related questions about leave policies, employee benefits, and HR procedures. If asked about non-HR topics, politely redirect. Always be professional and empathetic. Never share other employees' personal information.",
  "conversation_starters": [
    {
      "title": "Check leave balance",
      "text": "How many days of annual leave do I have remaining?"
    },
    {
      "title": "Submit leave request",
      "text": "I want to apply for annual leave next week"
    },
    {
      "title": "Find HR policy",
      "text": "What is the remote working policy?"
    }
  ],
  "capabilities": [
    {
      "name": "OneDriveAndSharePoint",
      "items_by_sharepoint_ids": [
        {
          "site_id": "contoso.sharepoint.com,abc123,...",
          "web_id": "...",
          "list_id": "..."
        }
      ]
    },
    {
      "name": "GraphConnectors",
      "connections": [{ "connection_id": "hrdocuments" }]
    }
  ],
  "actions": [
    { "id": "hrPlugin", "file": "plugins/hrApiPlugin.json" }
  ]
}

What knowledge sources can a Declarative Agent use?

Source Description
SharePoint sites and libraries Specific sites, libraries, or folders scoped to the agent
OneDrive files Specific folders or files
Microsoft Graph Connectors External data indexed into Graph (ServiceNow, SAP, Confluence, custom DBs)
Web search Public internet via Bing (can be enabled/disabled)
Plugins Real-time data and actions via API plugins

Warning: Knowledge source scoping is a security control — an agent scoped to HR SharePoint sites cannot access Finance sites, even if the user has permission. The agent's scope is a strict constraint on what Copilot searches.


How do you write effective instructions for a Declarative Agent?

The instructions field is the system prompt defining persona, scope, behaviour, and constraints.

Effective instruction structure:

1. PERSONA: "You are [name], the [role] for [organisation]."

2. SCOPE: "Only answer questions about [domain]. If asked about
   [out-of-scope topic], say [specific redirect message]."

3. TONE: "Always be [professional/friendly/concise]. Use
   [plain language / technical language appropriate for audience]."

4. DATA CONSTRAINTS: "Never reveal other employees' [personal data].
   Only show the current user's own [records]."

5. PLUGIN GUIDANCE: "When asked about [specific topic], always use
   the [plugin name] to retrieve real-time data rather than relying
   on documents."

6. ESCALATION: "For questions you cannot answer, direct users to
   [contact/resource]."

Example — IT Helpdesk Agent:
"You are ITBot, the IT Helpdesk assistant for Contoso. Only answer
questions about IT support, software, hardware, network connectivity,
and access requests. For HR questions, direct users to the HR
Assistant. Always check the user's open tickets using the
ServiceNow plugin before suggesting solutions. Never close a ticket
without explicit user confirmation."

Tip: Treat instructions like a detailed job description. The more specific and clear, the better the agent behaves. Vague instructions lead to unpredictable responses and scope creep.


What is Copilot Studio and how does it relate to Declarative Agents?

Copilot Studio provides a low-code UI for building and publishing Declarative Agents without writing JSON files.

Copilot Studio for M365 Copilot agents:

  1. Create and configure declarative agents visually
  2. Add plugins from the catalogue or connect to custom APIs
  3. Add knowledge sources (SharePoint sites, web URLs, uploaded files)
  4. Test the agent in the test panel
  5. Publish to Microsoft 365 Copilot — appears in the Copilot agent store
  6. Share with specific users or deploy org-wide via Teams Admin Center
Copilot Studio Teams Toolkit + JSON
Audience Business users, makers Developers
Authoring GUI-based, low-code Code-first, JSON manifests
Source control Limited Full Git/DevOps support
Best for Quick deployment, business-owned agents Complex agents, ALM, pro-code

4. Microsoft Graph Connectors

What is a Microsoft Graph Connector and what problem does it solve?

A Microsoft Graph Connector indexes external data into the Microsoft Graph — making it searchable by Microsoft 365 Search, Microsoft 365 Copilot, and other Graph-aware services.

The problem it solves: organisations have valuable data in non-Microsoft systems (ServiceNow, Confluence, SAP, Salesforce, SQL databases, intranets). Without a connector, Copilot cannot access this data. A Graph Connector brings it into the Microsoft Search and Copilot ecosystem without migrating data to SharePoint.

Data flow:
ServiceNow / Confluence / SAP / Custom DB
        ↓ (Graph Connector indexes items via Graph API)
Microsoft Graph — external items index
        ↓
Microsoft 365 Search + Copilot can find and cite it
        ↓
User: "What is the status of IT ticket #12345?"
Copilot: retrieves from ServiceNow index → answers with citation

Tip: Graph Connectors are what makes Copilot truly enterprise-ready — they eliminate "Copilot doesn't know about our systems" by bringing external data into Graph's searchable index.


What are the components of a Graph Connector solution?

Component Description API
Connection Defines the connector — name, description POST /external/connections
Schema Structure of external items — property types, searchability PATCH /external/connections/{id}/schema
External items The actual data records pushed into the index PUT /external/connections/{id}/items/{itemId}
Result type How search results are displayed (Adaptive Card template) Admin Center configuration
Crawler/sync agent Your app that reads external system and pushes to Graph Custom application
// 1. Create connection
POST https://graph.microsoft.com/v1.0/external/connections
{
  "id": "contosohr",
  "name": "Contoso HR Documents",
  "description": "HR policies, procedures, and employee handbook"
}

// 2. Define schema
PATCH https://graph.microsoft.com/v1.0/external/connections/contosohr/schema
{
  "baseType": "microsoft.graph.externalItem",
  "properties": [
    { "name": "title", "type": "String", "isSearchable": true,
      "isRetrievable": true, "labels": ["title"] },
    { "name": "content", "type": "String", "isSearchable": true,
      "isRetrievable": true, "labels": ["body"] },
    { "name": "url", "type": "String", "isRetrievable": true,
      "labels": ["url"] },
    { "name": "lastModified", "type": "DateTime",
      "isRetrievable": true, "labels": ["lastModifiedDateTime"] },
    { "name": "category", "type": "String", "isSearchable": true,
      "isRefinable": true, "isRetrievable": true }
  ]
}

// 3. Push an external item
PUT https://graph.microsoft.com/v1.0/external/connections/contosohr/items/policy_001
{
  "acl": [
    { "type": "everyone", "value": "everyone", "accessType": "grant" }
  ],
  "properties": {
    "title": "Remote Working Policy 2025",
    "content": "Employees may work remotely up to 3 days per week...",
    "url": "https://intranet.contoso.com/hr/policies/remote-working",
    "lastModified": "2025-01-15T10:00:00Z",
    "category": "Work Arrangements"
  },
  "content": {
    "value": "Full policy text for search indexing...",
    "type": "text"
  }
}

What is ACL in Graph Connectors and why is it critical?

The ACL (Access Control List) on each external item defines who can see it in search results and Copilot responses.

"acl": [
  // Everyone in the org can see:
  { "type": "everyone", "value": "everyone",
    "accessType": "grant" },

  // Specific Entra ID user can see:
  { "type": "user",
    "value": "entra-user-object-id",
    "accessType": "grant" },

  // Members of an Azure AD group can see:
  { "type": "group",
    "value": "aad-group-object-id",
    "accessType": "grant" },

  // Deny a specific user (overrides grants):
  { "type": "user",
    "value": "blocked-user-object-id",
    "accessType": "deny" }
]

Critical: ACL is the security boundary for Graph Connector data. If ACLs are misconfigured (e.g., everyone can see confidential HR records), Copilot will expose that data to all users. Always map source system permissions to Entra ID identities — never default to "everyone" for sensitive data.


What are pre-built Graph Connectors and when would you build custom?

Pre-built connectors (available in M365 Admin Center → Search → Data Sources): ServiceNow, Confluence, Jira, Salesforce, SAP, Azure DevOps, GitHub, MediaWiki, and more.

Build custom when:

  • No pre-built connector exists for your system
  • Pre-built connector doesn't index the specific data/schema you need
  • Custom ACL mapping logic is required for your org's permission model
  • You need to transform or enrich data before indexing
  • You need incremental sync with custom change detection

Tip: Always check the pre-built catalogue first. ServiceNow, Confluence, and Jira connectors cover the majority of enterprise use cases. Custom connectors are for unique or proprietary systems.


5. Security, Governance & Responsible AI

How does Microsoft 365 Copilot enforce data security and permissions?

  1. Microsoft Graph respects existing permissions: Copilot only retrieves data the current user has permission to access in SharePoint, OneDrive, Exchange, and Teams. It cannot elevate permissions.
  2. Graph Connector ACLs: external items only shown to users whose Entra ID identity matches the item's ACL
  3. Plugin authentication: OAuth plugins call external API as the current user — not a service account. Users only see their own data.
  4. Tenant isolation: Copilot data never crosses tenant boundaries. Org data is never used to train or improve the LLM.
  5. Microsoft Purview integration: sensitivity labels on documents are respected — confidential documents are not surfaced to users without appropriate permissions

Key principle: Copilot can only show what the user could find themselves. It is a search and reasoning layer on top of existing permissions — not a permission bypass.


What admin controls exist for governing Copilot extensibility?

Control Location Purpose
App approval Teams Admin Center → Manage apps Allow/block specific plugins and agents
App permission policies Teams Admin Center Control which users can access which Copilot apps
App setup policies Teams Admin Center Pre-install agents for users automatically
Copilot settings M365 Admin Center → Copilot Tenant-wide: web search, connected experiences
Purview audit Microsoft Purview Audit Copilot interactions, retention policies
Graph Connector admin M365 Admin Center → Search Enable/disable connections, view indexed items

Warning: Any plugin or agent published to the organisation must be approved in Teams Admin Center before users can access it. Unapproved apps are blocked by default.


What is Responsible AI and how does it apply to Copilot extensibility?

Microsoft's Responsible AI principles applied to extensibility:

Principle Developer Responsibility
Transparency Be honest about what the agent can/cannot do. Don't design agents that deceive users about their AI nature.
Fairness Don't design agents that return different quality responses based on protected characteristics.
Privacy Don't log user prompts in violation of privacy policies. Use minimum data needed. Respect sensitivity labels.
Reliability Handle plugin errors gracefully. Never let a failed API call result in a misleading response.
Human oversight For high-stakes actions (approvals, purchases, deletions), require explicit confirmation via Adaptive Card before executing.
Accountability Maintain audit logs of agent actions. Ensure humans can review and override agent decisions.

6. Scenario-Based Questions

Scenario: Build a Copilot plugin that lets employees check their ServiceNow IT tickets.

  1. Approach: API Plugin (not message extension) — structured data, Copilot-only needed
  2. ServiceNow API: use Table API. OpenAPI spec for:
    • GET /api/now/table/incident?caller_id={userId} — get user's tickets
    • GET /api/now/table/incident/{id} — get ticket detail
  3. Authentication: OAuth 2.0 with Entra ID SSO — users authenticate once
  4. Critical — OpenAPI descriptions:
    operationId: getMyTicketsdescription: "Retrieves all IT support tickets raised by the current  employee. Use when user asks about their tickets, incidents, IT  requests, open issues, support cases, or IT problems."
    
  5. Adaptive Card: ticket number, title, status, priority, last updated + "View in ServiceNow" deep-link button
  6. ai-plugin.json: detailed description_for_model with example user queries
  7. Deploy: Teams Toolkit → package → Teams Admin Center → admin approves

Key insight: Description quality determines plugin invocation. Poor descriptions = Copilot never uses the plugin even when it should.


Scenario: Build a Declarative Agent for the Sales team with CRM and product data.

Agent configuration:

  • Name: "Contoso Sales Assistant"
  • Instructions: "You are a Sales Assistant for Contoso. Help sales reps with product information, pricing, opportunity management, and competitive analysis. Always reference official pricing from the pricing API. Only access the current user's opportunities from Salesforce."

Knowledge sources:

  • SharePoint: Product catalogue site, Sales playbooks, Pricing documents
  • Graph Connector: Salesforce CRM (opportunities, accounts, contacts — indexed nightly)
  • OneDrive: Sales team shared folder

Plugins:

  • Salesforce API plugin: get/update opportunities, log activities
  • Pricing API plugin: real-time pricing for product configurations

Conversation starters:

  • "What are my open opportunities this quarter?"
  • "What is the pricing for Product X with Enterprise support?"
  • "Help me prepare talking points for my meeting with [Company]"

Deploy: Copilot Studio → publish → share with Sales Azure AD group


Scenario: Your organisation has a legacy intranet with thousands of HR policy documents. How do you make them available to Copilot?

Solution: Custom Graph Connector

  1. Build a sync agent: Azure Function or .NET app that reads the intranet's content via its API or crawl
  2. Create connection: POST /external/connections with id: "hrintranet", descriptive name and description
  3. Define schema: properties for Title, Content, URL, Author, LastModified, PolicyCategory — mark Content and Title as isSearchable: true, set appropriate labels
  4. Map permissions: if HR policies are org-wide, use ACL type "everyone". For restricted policies (e.g., management-only), map intranet groups to Entra ID group object IDs
  5. Push items: for each document, PUT /external/connections/hrintranet/items/{id} with properties and content
  6. Schedule incremental sync: track lastCrawledDateTime, only push items modified since last sync
  7. Configure result type: Adaptive Card template for how results appear in search
  8. Add to HR Declarative Agent: reference "connection_id": "hrintranet" in agent capabilities
  9. Test: ask Copilot "What is the parental leave policy?" — should cite intranet document with link

Scenario: How do you handle a plugin write operation with user confirmation before executing?

Requirement: Plugin submits a leave request. User must confirm before the request is submitted — Copilot should not submit automatically on a single ambiguous prompt.

Two-step API design:

# Step 1 — Preview (safe, no side effects):
GET /leaveRequest/preview?type={leaveType}&start={date}&end={date}
Returns: { dates, days, remainingBalance, approver, confirmation_id }

# Step 2 — Submit (only called after user confirms):
POST /leaveRequest/submit
Body: { confirmation_id }

Adaptive Card confirmation returned from preview:

{
  "type": "AdaptiveCard",
  "body": [
    { "type": "TextBlock", "text": "Leave Request Preview", "weight": "Bolder" },
    { "type": "FactSet", "facts": [
      { "title": "Type", "value": "Annual Leave" },
      { "title": "Dates", "value": "15 Jan – 19 Jan 2026" },
      { "title": "Days", "value": "5 days" },
      { "title": "Balance after", "value": "7 days remaining" }
    ]}
  ],
  "actions": [
    { "type": "Action.Execute", "title": "✓ Confirm & Submit",
      "verb": "submitLeave" },
    { "type": "Action.Execute", "title": "✗ Cancel",
      "verb": "cancelLeave" }
  ]
}

Plugin instructions in ai-plugin.json:

"Always call the preview endpoint first and display the confirmation card. Only call the submit endpoint when the user explicitly clicks Confirm in the card."

Responsible AI principle: The two-step preview-then-confirm pattern is the standard for any write operation in Copilot plugins. It gives users agency and prevents accidental actions from ambiguous prompts.


Scenario: How do you ensure a Declarative Agent doesn't leak confidential Finance data to HR users?

  1. Scope knowledge sources strictly: only add HR SharePoint sites to the HR agent — never Finance sites, even if the admin has access to both
  2. Use Graph Connector ACLs: HR-specific connector items have ACL entries for the HR security group only — Finance documents have Finance group ACL. Copilot automatically enforces these.
  3. Plugin authentication: use OAuth SSO — the plugin calls the HR API as the current user. The API enforces its own role-based access. Copilot cannot bypass API-level security.
  4. Instructions boundary: "Only answer questions about HR. If asked about financial data, budgets, or Finance matters, say 'I only have access to HR information.'"
  5. Test with non-HR users: verify Finance users cannot retrieve HR-restricted content through the agent

7. Cheat Sheet — Quick Reference

Extensibility Mechanisms at a Glance

API Plugin
→ OpenAPI spec + ai-plugin.json + manifest.json
→ Copilot calls your REST API for real-time data/actions
→ Auth: None / API Key / OAuth / Entra SSO
→ Response: plain text or Adaptive Cards

Message Extension Plugin
→ Existing Teams app (composeExtension) surfaced in Copilot
→ Works in Teams compose box AND Copilot
→ Uses Bot Framework
→ Good for search + insert into Teams messages

Declarative Agent
→ agents/myAgent.json + manifest.json
→ Scoped Copilot with custom persona + knowledge + plugins
→ No custom AI model — configured view of M365 Copilot
→ Authoring: Teams Toolkit (code) or Copilot Studio (low-code)

Graph Connector
→ Custom app calling Graph API to index external items
→ Creates: connection → schema → push items
→ Items appear in M365 Search and Copilot responses
→ ACL per item controls who can see it

AI Plugin Definition — Key Fields

{
  "name_for_model": "ShortNoSpacesName",
  "description_for_model": "DETAILED description of what this plugin does,
    when to use it, example scenarios. This is what Copilot reads.",
  "auth": {
    "type": "OAuthPluginVault | ApiKeyAuth | None",
    "reference_id": "${{OAUTH_CONNECTION}}"
  },
  "functions": [{
    "name": "operationId from OpenAPI spec",
    "description": "SPECIFIC description of this function with
      example user queries that should trigger it"
  }]
}

Declarative Agent — Key Fields

{
  "name": "Display name shown to users",
  "instructions": "System prompt — persona, scope, tone, data rules",
  "conversation_starters": [
    { "title": "Short title", "text": "Full prompt to send" }
  ],
  "capabilities": [
    {
      "name": "OneDriveAndSharePoint",
      "items_by_sharepoint_ids": [{ "site_id": "..." }]
    },
    {
      "name": "GraphConnectors",
      "connections": [{ "connection_id": "myconnection" }]
    }
  ],
  "actions": [
    { "id": "myPlugin", "file": "plugins/plugin.json" }
  ]
}

Graph Connector — API Flow

1. Create connection
POST /v1.0/external/connections
{ "id": "myconn", "name": "...", "description": "..." }

2. Register schema (async — poll until Completed)
PATCH /v1.0/external/connections/myconn/schema
{ "baseType": "microsoft.graph.externalItem",
  "properties": [{ "name": "title", "type": "String",
    "isSearchable": true, "labels": ["title"] }] }

3. Push items
PUT /v1.0/external/connections/myconn/items/{itemId}
{ "acl": [...],
  "properties": { "title": "...", "content": "..." },
  "content": { "value": "...", "type": "text" } }

4. Update items (same PUT — upsert semantics)
5. Delete items
DELETE /v1.0/external/connections/myconn/items/{itemId}

Schema property labels (map to Microsoft semantics):
title, url, iconUrl, createdBy, lastModifiedBy,
createdDateTime, lastModifiedDateTime, fileName, fileExtension, body

Deployment Path

Development:
Teams Toolkit → scaffold project → code → local debug with Dev Tunnel

Packaging:
Teams Toolkit → Provision → Deploy → Package

Internal deployment:
Upload .zip to Teams Admin Center → Admin approves → Users access

Public (Teams Store):
Microsoft Partner Center → Submit for validation → Store listing

Top 10 Tips

  1. Three mechanisms, one sentence each — "Graph Connectors bring data IN. Plugins let Copilot act OUT. Declarative Agents package it into a focused experience." Start every architecture question here.
  2. description_for_model quality = plugin success — the orchestrator reads this to decide when to invoke the plugin. Vague descriptions = plugin never used. This is the #1 developer mistake.
  3. Copilot cannot bypass permissions — it only surfaces what the user could already access. Copilot is a reasoning layer, not a permission bypass. Security questions always return to this principle.
  4. ACL misconfiguration is the #1 Graph Connector risk — defaulting to "everyone" on sensitive items exposes confidential data to all users through Copilot queries.
  5. Two-step confirm pattern for write operations — preview first, submit only on explicit user confirmation. Responsible AI standard for any destructive or irreversible action.
  6. Declarative Agents ≠ new AI model — they are a configured, scoped view of M365 Copilot. Many candidates think they require a custom LLM — they do not.
  7. Teams Admin Center approval is mandatory — no plugin or agent reaches users without admin approval. This is the governance gate — always mention it in deployment discussions.
  8. OAuthPluginVault with Entra SSO is the correct enterprise auth answer. Shared service account credentials in plugins are a security anti-pattern.
  9. Copilot Studio vs Teams Toolkit — Copilot Studio for makers and business users; Teams Toolkit for developers needing source control and ALM. Both produce the same underlying manifest JSON.
  10. Schema property labels in Graph Connectors map external properties to Microsoft's semantic understanding — labels: ["title"], labels: ["body"] etc. Correct labelling dramatically improves Copilot's ability to summarise and cite your external content.


Featured Post

Microsoft Fabric & OneLake Complete Guide

  Microsoft Fabric & OneLake — Complete Guide OneLake · Lakehouse · Data Warehouse · Direct Lake · Medallion Architecture · Real-Time I...

Popular posts