Wednesday, April 29, 2026

Microsoft Fabric & OneLake Complete Guide

 

Microsoft Fabric & OneLake — Complete Guide

OneLake · Lakehouse · Data Warehouse · Direct Lake · Medallion Architecture · Real-Time Intelligence · Scenarios · Cheat Sheet


Table of Contents

  1. Core Concepts — Basics
  2. OneLake — The Foundation
  3. Fabric Workloads
  4. Data Engineering — Notebooks, Pipelines & Medallion
  5. Security & Governance
  6. Performance Optimisation
  7. Scenario-Based Questions
  8. Cheat Sheet — Quick Reference

1. Core Concepts — Basics

What is Microsoft Fabric and what problem does it solve?

Microsoft Fabric is an all-in-one analytics platform that unifies data engineering, data integration, data warehousing, real-time analytics, data science, and business intelligence into a single SaaS platform.

Before Fabric (fragmented):

  • Azure Synapse → warehousing
  • Azure Data Factory → pipelines
  • Azure Databricks → Spark
  • Power BI Premium → BI
  • Separate storage, separate governance, separate billing, complex integration

With Fabric (unified):

  • All workloads in one platform
  • Single copy of data in OneLake
  • Unified security and governance via Microsoft Purview
  • Single capacity billing model (F SKUs)

Key insight: Fabric is NOT a new product — it is a platform that unifies existing Microsoft data tools under one roof with OneLake as the shared storage layer. Everything reads and writes from the same data lake.


What are the core workloads in Microsoft Fabric?

Workload Description Evolution From
Data Factory Data integration and orchestration — pipelines and dataflows Azure Data Factory
Data Engineering Big data processing — Lakehouses, Spark notebooks Azure Synapse Spark
Data Warehouse SQL-based analytics warehouse — T-SQL querying Azure Synapse SQL Dedicated Pool
Real-Time Intelligence Streaming ingestion and KQL analytics Azure Data Explorer
Data Science ML experiments, models, Python/R notebooks Azure Synapse ML
Power BI Business intelligence (Direct Lake mode) Power BI Premium
Data Activator No-code data alerting and automation New in Fabric

What is a Fabric Workspace and how does it relate to capacity?

Workspace: collaboration container holding all Fabric items (Lakehouses, Warehouses, Notebooks, Pipelines, Reports, Semantic Models).

Capacity: the compute allocation powering Fabric workloads. Every workspace must be assigned to a capacity.

SKU Type Use Case
Trial Free 60 days Evaluation
F2–F2048 Fabric SKUs Pay-as-you-go or reserved
P SKUs Power BI Premium Existing Premium → Fabric compatible

Tip: Capacity is shared across ALL workloads in assigned workspaces. A Spark job, pipeline run, and Power BI report refresh all consume from the same capacity pool. Size based on peak concurrent workload requirements.


How does Microsoft Fabric relate to Azure Synapse Analytics?

Fabric is the strategic successor to Azure Synapse Analytics:

Synapse Feature Fabric Equivalent
Synapse Spark Fabric Data Engineering (Lakehouses + Notebooks)
Synapse SQL Dedicated Pool Fabric Data Warehouse
Synapse Pipelines Fabric Data Factory Pipelines
Synapse Link Fabric Mirroring
Azure Data Explorer Fabric Real-Time Intelligence (Eventhouse/KQL DB)

Key additions in Fabric vs Synapse: OneLake (shared storage), Direct Lake Power BI mode, Data Activator, unified Purview governance, simpler capacity-based pricing.

Warning: Azure Synapse is not deprecated — existing workloads continue. But all new analytics projects should be built on Fabric. Microsoft's investment is entirely focused on Fabric going forward.


What is the Fabric item hierarchy?

Microsoft Fabric Tenant
└── Capacity (F64, F128, etc.)
    └── Workspace (collaboration container)
        ├── Lakehouse (data lake + Delta tables)
        ├── Data Warehouse (T-SQL warehouse)
        ├── Notebook (Spark development)
        ├── Spark Job Definition (packaged Spark job)
        ├── Data Pipeline (orchestration)
        ├── Dataflow Gen2 (Power Query ETL)
        ├── Semantic Model (Power BI dataset)
        ├── Report (Power BI report)
        ├── KQL Database / Eventhouse (real-time)
        ├── Eventstream (streaming ingestion)
        └── Reflex (Data Activator alerts)

2. OneLake — The Foundation

What is OneLake and what makes it architecturally significant?

OneLake is a single, unified, tenant-wide data lake that is the storage foundation of Microsoft Fabric.

Key architectural principles:

  1. One copy of data: all Fabric workloads read from the same data — no copies between workloads
  2. Automatic with Fabric: every Fabric tenant has exactly one OneLake, automatically provisioned
  3. Built on ADLS Gen2: fully compatible with any tool supporting Azure Data Lake Storage
  4. Open formats: data stored in Delta Parquet — readable by any engine (Spark, Trino, DuckDB) without conversion
  5. Hierarchical: Tenant → Workspace → Lakehouse/Warehouse → Tables/Files

analogy: OneLake is the "OneDrive for data" — just as OneDrive gives every user one place for files, OneLake gives every organisation one place for data.


What is the OneLake folder structure?

OneLake (tenant-level)
└── Workspace: ContosoAnalytics
    ├── Bronze_Lakehouse.Lakehouse/
    │   ├── Tables/                    ← Delta tables (queryable via SQL)
    │   │   ├── raw_orders/            ← Delta Parquet files
    │   │   └── raw_customers/
    │   └── Files/                     ← Raw files (CSV, JSON, Parquet, images)
    │       ├── raw/
    │       └── processed/
    ├── Gold_Lakehouse.Lakehouse/
    │   └── Tables/
    │       ├── FactSales/
    │       └── DimProduct/
    └── Finance_Warehouse.Warehouse/
        └── Tables/                    ← Warehouse tables (Delta format)

Access paths:
ADLS DFS:  abfss://workspace@onelake.dfs.fabric.microsoft.com/Bronze_Lakehouse.Lakehouse/Tables/raw_orders
OneLake:   https://onelake.dfs.fabric.microsoft.com/ContosoAnalytics/Bronze_Lakehouse.Lakehouse/Tables/raw_orders

What are OneLake Shortcuts and how do they enable a "single copy" architecture?

Shortcuts are virtual pointers to data stored outside the current Lakehouse — appearing as folders but not copying data.

Supported shortcut sources:

  • Another workspace in OneLake
  • Azure Data Lake Storage Gen2
  • Amazon S3
  • Google Cloud Storage
  • Dataverse (via Fabric Mirroring)
Shortcut scenarios:

1. Cross-workspace data sharing (zero copy):
   Finance Lakehouse → shortcut → Sales Gold Lakehouse/Tables/FactSales
   Finance queries FactSales without any data duplication
   Sales updates are immediately visible to Finance

2. External ADLS access:
   Lakehouse → shortcut → abfss://existing-lake.dfs.core.windows.net/raw
   Query existing ADLS data from Fabric without migration

3. Multi-cloud data access:
   Lakehouse → shortcut → AWS S3 bucket
   Spark notebooks query S3 data alongside OneLake data

Tip: Shortcuts eliminate the need to copy data between teams, workspaces, and clouds. This is how Fabric implements "single copy of truth" across a large organisation — no ETL duplication pipelines.


What is the Delta Lake format and why does Fabric use it as the standard?

Delta Lake is an open-source storage format adding ACID transactions, schema enforcement, time travel, and efficient upserts to Parquet files.

Feature Benefit
ACID transactions Concurrent reads/writes without corruption
Time travel Query data as of previous point in time
Schema evolution Add columns without breaking existing readers
Efficient upserts MERGE INTO for CDC patterns
Open format Any Spark, Trino, DuckDB engine can read
Direct Lake Power BI reads Delta files at in-memory speed
-- Time travel:
SELECT * FROM silver_orders VERSION AS OF 5
SELECT * FROM silver_orders TIMESTAMP AS OF '2025-01-01'

-- Upsert (MERGE):
MERGE INTO silver_orders AS target
USING new_orders AS source
ON target.OrderId = source.OrderId
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

-- View Delta history:
DESCRIBE HISTORY silver_orders

3. Fabric Workloads

What is a Fabric Lakehouse and how does it differ from a Data Warehouse?

Lakehouse Data Warehouse
Schema On-read (Files) + on-write (Tables) Strict schema-on-write (DDL required)
Data types Structured + semi-structured + unstructured Structured only
Primary interface Spark (Python, Scala, SQL) T-SQL (INSERT, UPDATE, DELETE, MERGE)
Secondary interface Auto SQL Endpoint (read-only T-SQL)
Stored procedures No Yes
Best for Data engineering, data science, raw→curated Business analysts, BI, structured semantic layer
Storage OneLake (Delta Parquet) OneLake (Delta Parquet)

Tip: Both store data in OneLake as Delta Parquet. The Lakehouse auto-generates a SQL Endpoint — so a Lakehouse IS also queryable via T-SQL (read-only). A Warehouse is T-SQL read-write. This overlap confuses many candidates.


What is Direct Lake mode in Power BI and why is it a breakthrough?

Direct Lake is a new Power BI connectivity mode that reads Delta Parquet files directly from OneLake into Power BI's VertiPaq in-memory engine.

Three Power BI connection modes:

Import:
→ Data COPIED into VertiPaq memory
→ Fastest queries | Stale (last refresh only) | Size limited

DirectQuery:
→ Live queries to source on EVERY visual interaction
→ Always current | Slow (source DB performance) | No size limit

Direct Lake (NEW — Fabric only):
→ Reads Delta Parquet from OneLake into VertiPaq ON DEMAND
→ Import-speed performance (in-memory columnar)
→ Always current (reads latest Delta files automatically)
→ No dataset size limit
→ "Framing": loads column segments on first access
→ Falls back to DirectQuery if framing not complete

Tip: Direct Lake eliminates the Import vs DirectQuery trade-off — Import-level performance + DirectQuery-level freshness. This is the biggest Power BI innovation in years and a guaranteed question for any Fabric role.


What is Real-Time Intelligence in Fabric?

Component Description
Eventstream No-code streaming ingestion — capture, transform, route events from Event Hubs, IoT Hub, Kafka, CDC
KQL Database (Eventhouse) High-performance analytical DB for time-series data. KQL queries. Billions of rows in seconds.
KQL Queryset Saved KQL queries for reuse
Real-Time Dashboard Auto-refreshing dashboards (sub-second) powered by KQL
Data Activator (Reflex) No-code alerting — trigger email/Teams/flows when data conditions are met
Real-time pipeline:
IoT sensors / Event Hubs / Kafka / CDC
  → Eventstream (ingest + route + transform)
  → KQL Database / Eventhouse (store + query)
  → Real-Time Dashboard (visualise live)
  → Data Activator (alert on anomalies)

What is Fabric Mirroring and what databases does it support?

Fabric Mirroring replicates operational database data into OneLake as Delta Parquet tables in near real-time — using Change Data Capture (CDC), not traditional ETL.

Supported sources:

  • Azure SQL Database
  • Azure Cosmos DB
  • Snowflake
  • Azure SQL Managed Instance
  • Azure Database for PostgreSQL
  • MongoDB Atlas
How it works:
Source DB (Azure SQL)
  → CDC captures inserts/updates/deletes
  → Fabric replicates changes into OneLake
  → Latency: typically < 1 minute
  → Stored as Delta tables (queryable via Spark, SQL Endpoint, Power BI)
  → Appears as a shortcut in the workspace

Benefits:
→ No ETL pipeline to build or maintain
→ Near real-time analytics on operational data
→ Zero duplication — reads directly from CDC log
→ Combines with medallion architecture for Silver/Gold transformation

Tip: Mirroring eliminates complex ADF pipelines for operational DB analytics. It is the recommended pattern for near real-time analytics on Azure SQL, Cosmos DB, and Snowflake.


What is the Semantic Model in Fabric?

A Semantic Model (previously called a Power BI Dataset) is the reusable analytical layer between data storage and reports. In Fabric:

  • Created directly on a Lakehouse or Warehouse
  • Uses Direct Lake mode by default (no data import needed)
  • Contains measures (DAX), hierarchies, relationships, and security roles
  • Multiple reports can share a single semantic model
  • Can be certified/endorsed for enterprise use
  • Accessible via XMLA endpoint for external tools (Excel, Tabular Editor, SSMS)

4. Data Engineering — Notebooks, Pipelines & Medallion

What is the Medallion Architecture and how is it implemented in Fabric?

The Medallion Architecture organises data into three progressive quality layers stored as Delta tables in Lakehouses.

Bronze (Raw) layer:
→ Raw data as-is from source systems
→ No transformations, append-only or full snapshot
→ Preserves source fidelity for reprocessing
→ Lakehouse: Tables/bronze/ or Files/bronze/

Silver (Cleansed) layer:
→ Cleaned, validated, deduplicated
→ Standardised data types and naming conventions
→ Joined/enriched from multiple Bronze sources
→ Optimised: partitioned, Z-ordered
→ Lakehouse: Tables/silver/

Gold (Business) layer:
→ Business-level aggregations and metrics
→ Conformed dimensions and fact tables (star schema)
→ Ready for Power BI Direct Lake consumption
→ Domain-specific: Sales Gold, Finance Gold, HR Gold
→ Lakehouse: Tables/gold/

Recommended Fabric implementation:
Bronze Workspace   → Bronze Lakehouse
Silver Workspace   → Silver Lakehouse (shortcuts to Bronze)
Gold Workspace     → Gold Lakehouse (shortcuts to Silver)
Reporting          → Semantic Models → Power BI Reports

Separate workspaces = separate security roles per layer

Tip: Separate Lakehouses per medallion layer is the recommended enterprise pattern — different security roles, workspace assignments, and access control per layer.


What are Fabric Notebooks and what do they support?

Fabric Notebooks are interactive Spark development environments supporting:

  • PySpark (Python + Spark) — primary language
  • Spark SQL — SQL against Delta tables
  • Scala — JVM Spark code
  • R (SparkR) — R with Spark
  • %%sql magic — inline SQL in Python notebooks
# Read from Bronze Lakehouse
df = spark.read.format("delta").load("Tables/raw_orders")

# Transform (PySpark)
from pyspark.sql.functions import col, to_date, when, trim

df_silver = (df
  .filter(col("OrderStatus") != "TEST")
  .filter(col("OrderId").isNotNull())
  .withColumn("OrderDate", to_date(col("OrderDateStr"), "yyyy-MM-dd"))
  .withColumn("CustomerName", trim(col("CustomerName")))
  .withColumn("Region",
    when(col("Country").isin("UK", "IE", "DE", "FR"), "EMEA")
    .when(col("Country").isin("US", "CA"), "AMER")
    .otherwise("APAC"))
  .dropDuplicates(["OrderId"])
)

# Write to Silver Lakehouse as Delta (upsert pattern)
from delta.tables import DeltaTable

if spark.catalog.tableExists("silver_orders"):
    DeltaTable.forName(spark, "silver_orders").alias("target") \
      .merge(df_silver.alias("source"), "target.OrderId = source.OrderId") \
      .whenMatchedUpdateAll() \
      .whenNotMatchedInsertAll() \
      .execute()
else:
    df_silver.write.format("delta").saveAsTable("silver_orders")

What are Fabric Data Pipelines?

Fabric Data Pipelines use the same visual designer and JSON format as Azure Data Factory. Key activities:

Activity Purpose
Copy Data Move data between 90+ sources/sinks
Notebook Execute a Fabric Notebook (Spark)
Spark Job Definition Run a packaged PySpark/Scala job
Dataflow Gen2 Run a Power Query Online dataflow
Stored Procedure Execute SQL in Warehouse
If Condition / Switch Conditional control flow
ForEach / Until Iterative control flow
Get Metadata / Validation File/table checks

Tip: ADF pipelines are compatible with Fabric pipelines (same JSON format). The key difference: Fabric pipelines have native OneLake connectivity — no linked service configuration needed for internal Fabric data access.


What is a Dataflow Gen2 and when should you use it vs Spark?

Use Dataflow Gen2 when:

  • Business analyst / maker-level ETL (no Spark knowledge needed)
  • 150+ Power Query connectors not available in pipelines
  • Simple transformations: filter, merge, pivot, custom M columns
  • Small-to-medium datasets

Use Spark Notebooks when:

  • Large-scale data (billions of rows)
  • Complex transformations requiring ML or custom logic
  • Full developer control and unit testability
  • Integration with Python ecosystem (pandas, scikit-learn, PyTorch)

Dataflow Gen2 outputs to:

  • Fabric Lakehouse (Tables or Files)
  • Fabric Data Warehouse
  • Azure SQL, other external databases
  • OneLake directly

What is V-Order and Z-Order optimisation in Fabric Delta tables?

V-Order: a Microsoft-specific write-time optimisation for Delta Parquet files. Sorts and compresses data within Parquet row groups for maximum Power BI Direct Lake read performance. Enabled by default in Fabric.

Z-Order: a data skipping optimisation co-locating related data in the same Parquet files. When filtering on a Z-ordered column, Spark skips entire files that don't match — dramatically reducing I/O.

# Z-Order after data ingestion (run periodically):
spark.sql("""
  OPTIMIZE gold_FactSales
  ZORDER BY (CustomerId, OrderDate)
""")
# Queries filtering by CustomerId or OrderDate now read far fewer files

# V-Order (enabled by default in Fabric):
spark.conf.set("spark.microsoft.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.microsoft.delta.optimizeWrite.vorder.enabled", "true")

# Compact small files:
spark.sql("OPTIMIZE silver_orders")  # merges small files into ~256MB chunks

# Vacuum old Delta versions (default: 7 days retention):
spark.sql("VACUUM silver_orders RETAIN 168 HOURS")

Key distinction: V-Order = Fabric/Power BI optimisation (Direct Lake performance). Z-Order = Spark/Delta general query optimisation. Know both for architect-level.


5. Security & Governance

What are the workspace roles in Fabric?

Role Capabilities
Admin Full control — manage workspace, members, all items. Can delete workspace.
Member Create, edit, delete items. Share items. Cannot manage workspace settings.
Contributor Create and edit items. Cannot delete others' items or share.
Viewer Read-only — view reports, query SQL Endpoint. Cannot edit.

Warning: Workspace roles are coarse-grained — they apply to ALL items. For fine-grained data access (which rows/columns), use Row-Level Security and Column-Level Security within Lakehouses and Warehouses.


What is OneSecurity and how does it implement Row and Column-Level Security?

-- ROW-LEVEL SECURITY (Warehouse / Lakehouse SQL Endpoint):
-- 1. Create security predicate function
CREATE FUNCTION dbo.fn_region_filter(@Region AS sysname)
  RETURNS TABLE WITH SCHEMABINDING AS
  RETURN SELECT 1 AS result
    WHERE @Region = USER_NAME()
       OR IS_MEMBER('GlobalDataAdmin') = 1;

-- 2. Apply security policy
CREATE SECURITY POLICY RegionFilter
  ADD FILTER PREDICATE dbo.fn_region_filter(Region)
  ON dbo.SalesOrders WITH (STATE = ON);
-- Users only see rows where Region matches their username

-- COLUMN-LEVEL SECURITY:
GRANT SELECT ON dbo.Employees(EmployeeId, Name, Department) TO [HRViewers];
DENY SELECT ON dbo.Employees(Salary, PersonalEmail, NationalId) TO [HRViewers];

-- DYNAMIC DATA MASKING:
ALTER TABLE dbo.Customers
  ALTER COLUMN Email ADD MASKED WITH (FUNCTION = 'email()');
-- Non-privileged users see: aXXX@XXXX.com instead of actual email

How does Microsoft Purview integrate with Fabric for governance?

Purview Feature Fabric Integration
Data Catalog Auto-scans Lakehouses, Warehouses, Datasets, Reports — discovers schemas and metadata
Data Lineage End-to-end lineage from source → pipeline → table → report
Sensitivity Labels Labels on Fabric items flow downstream to reports and exports
Information Protection Prevent export of highly confidential data
Audit Logs All Fabric workspace and item activity logged

Tip: Data lineage in Purview is especially powerful for Fabric — you can trace a single Power BI KPI back through every transformation step to its source system. This is an architect-level differentiator.


What is Fabric capacity management and how do you handle throttling?

All Fabric workloads consume Capacity Units (CUs) from the assigned capacity.

Throttling tiers (in order of impact):

  1. Background operations throttled first (Spark jobs, pipelines)
  2. Interactive operations throttled second (SQL queries, report loads)
  3. Users see delays/errors if throttling is severe

Management approach:

  1. Monitor: Fabric Capacity Metrics App (Power BI app) — CU consumption per workload, workspace, item
  2. Identify top consumers: find which notebooks, pipelines, or reports consume the most CUs
  3. Workspace isolation: assign heavy workloads (batch pipelines) to a separate capacity from interactive reporting
  4. Schedule off-peak: run heavy Spark jobs during off-peak hours
  5. Right-size: F64 for most enterprise scenarios. Scale up if Capacity Metrics shows consistent throttling.

6. Performance Optimisation

What are the key performance best practices for Fabric Lakehouses?

Delta table optimisation:

# 1. Compact small files (run after incremental loads)
spark.sql("OPTIMIZE tablename")

# 2. Z-Order on high-cardinality filter columns
spark.sql("OPTIMIZE tablename ZORDER BY (CustomerId, OrderDate)")

# 3. Partition large tables by low-cardinality column (e.g., date)
df.write.format("delta") \
  .partitionBy("Year", "Month") \
  .saveAsTable("partitioned_table")

# 4. Vacuum old file versions (keep 7 days for time travel)
spark.sql("VACUUM tablename RETAIN 168 HOURS")

# 5. Enable V-Order for Power BI Direct Lake performance
spark.conf.set("spark.microsoft.delta.optimizeWrite.vorder.enabled", "true")

Notebook performance:

  • Use .select() early to reduce columns processed
  • Filter data as early as possible (predicate pushdown)
  • Cache frequently reused DataFrames: df.cache()
  • Use broadcast() for small lookup tables in joins
  • Avoid collect() on large DataFrames (brings all data to driver)
  • Write Delta with mergeSchema only when needed — adds overhead

What causes Direct Lake fallback to DirectQuery?

Direct Lake falls back to DirectQuery (slower) when:

  1. Delta table not framed — column segments not yet loaded into VertiPaq memory for first access
  2. V-Order not applied — non-V-Order Delta files degrade Direct Lake read performance
  3. Too many small files — hundreds of tiny Parquet files slow framing
  4. Schema changes — recent DDL changes may invalidate the framing cache
  5. Large Delta history — too many uncommitted/vacuumed versions

Fix: Run OPTIMIZE + VACUUM on Gold tables. Ensure V-Order is enabled at write time. Monitor Direct Lake framing status in the Fabric workspace.


7. Scenario-Based Questions

Scenario: Design a modern data platform using Microsoft Fabric for a retail organisation with 50 stores.

Architecture:

  1. Ingestion:

    • POS transaction data → Fabric Pipeline Copy Activity → Bronze Lakehouse (daily batch)
    • Inventory DB (Azure SQL) → Fabric Mirroring → Bronze Lakehouse (near real-time)
    • Web analytics events → Eventstream → KQL Database (real-time)
  2. Medallion layers:

    • Bronze Lakehouse: raw data as-is
    • Silver Lakehouse: cleansed, deduped, joined (Spark notebooks)
    • Gold Lakehouse: star schema — FactSales, DimStore, DimProduct, DimDate
  3. Reporting:

    • Power BI Direct Lake Semantic Model on Gold Lakehouse
    • Store manager dashboards (RLS: each manager sees only their store)
    • Executive KPI reports (no RLS)
  4. Real-time operations:

    • Real-Time Dashboard from KQL Database — live sales by store
    • Data Activator: alert when stock drops below reorder threshold
  5. Data science:

    • Fabric Notebooks: demand forecasting model (Python + Spark)
    • Predictions written back to Gold Lakehouse → visible in Power BI
  6. Governance:

    • Purview sensitivity labels on customer PII tables
    • RLS in Gold Lakehouse (store managers see only their store)
    • Data lineage tracking in Purview
  7. Capacity: F64 — separate workspaces for Bronze/Silver/Gold layers


Scenario: How do you migrate from Azure Synapse Analytics to Microsoft Fabric?

  1. Assessment: inventory all Synapse assets — pipelines, Spark notebooks, SQL scripts, dedicated pool tables. Use Fabric migration assessment tool.
  2. OneLake setup: create Fabric workspace, create Lakehouses matching Synapse storage structure.
  3. Data migration:
    • ADLS Gen2 data → OneLake Shortcut (zero-copy, immediate access — fastest approach)
    • Or: Fabric Pipeline copy from ADLS to OneLake Tables
  4. Pipeline migration: Synapse pipelines → Fabric Pipelines (JSON format compatible). Update linked services to native Fabric connections.
  5. Notebook migration: copy Synapse Spark notebooks into Fabric. Update storage paths from abfss://container@storage.../ to OneLake paths. Most PySpark code runs unchanged.
  6. SQL migration: Synapse Dedicated Pool T-SQL → Fabric Data Warehouse T-SQL. Validate stored procedures, views, and custom functions.
  7. Power BI migration: update Semantic Models to use Direct Lake mode pointing to Fabric Lakehouse/Warehouse instead of Synapse connection.
  8. Parallel run: validate data output matches between Synapse and Fabric for 2–4 weeks before decommissioning.

Scenario: A Gold layer table is slow to query in Power BI Direct Lake. How do you diagnose and fix it?

  1. Check Direct Lake vs DirectQuery fallback: Power BI Desktop → Performance Analyzer → verify queries show "Direct Lake" mode, not "DirectQuery." Fallback indicates framing not complete.

  2. V-Order check: ensure V-Order was applied when writing the Gold table. Re-write with V-Order enabled if missing.

  3. OPTIMIZE the table: run OPTIMIZE gold_FactSales ZORDER BY (CustomerId, OrderDate) to compact small files and improve data skipping.

  4. Check small file count: DESCRIBE DETAIL gold_FactSales — if numFiles is in the hundreds, run OPTIMIZE to merge them.

  5. DAX measure performance: use Performance Analyzer in Power BI Desktop to separate "DAX query" time from "Direct Lake" time. Slow DAX = model issue, not storage issue.

  6. Add aggregation table: for very large facts (billions of rows), create a pre-aggregated summary table in Gold. Point the Semantic Model to the aggregation first.


Scenario: How do you implement data sharing between two business units without duplicating data?

Requirement: Finance needs Sales Gold tables for combined reporting. Data stays in Sales workspace. Zero data copy.

Solution: OneLake Shortcuts

  1. In Finance Workspace → Finance Lakehouse → New Shortcut
  2. Source: Another workspace in OneLake → Sales Gold Lakehouse → Tables/FactSales
  3. FactSales appears in Finance Lakehouse as a shortcut (virtual — no copy)
  4. Finance Spark notebooks, SQL Endpoint, and Power BI Direct Lake all read via the shortcut
  5. Apply RLS to Sales Gold Lakehouse SQL Endpoint — Finance users see only shared records
  6. Finance Power BI Semantic Model spans both Finance tables and Sales shortcuts — single model, zero data duplication

Tip: This is the canonical Fabric cross-workspace data sharing pattern. Shortcuts eliminate inter-team ETL pipelines entirely.


Scenario: How do you implement near real-time analytics on Azure SQL operational data?

Solution: Fabric Mirroring

  1. In Fabric workspace → New → Mirrored database → Azure SQL Database
  2. Connect to Azure SQL (provide connection string and credentials)
  3. Select tables to mirror: Orders, Customers, Products
  4. Fabric automatically starts CDC-based replication
  5. Within minutes, tables appear in OneLake as Delta tables
  6. Ongoing: changes in Azure SQL replicate to OneLake with < 1 minute latency
  7. Build Silver transformations in Spark notebooks reading from mirrored tables
  8. Power BI Direct Lake Semantic Model reads Gold tables for near real-time dashboards

No ETL pipeline to build or maintain. Mirroring handles all change capture automatically.


8. Cheat Sheet — Quick Reference

Fabric Workload Selection Guide

Need to...                                    → Use
Ingest data from 90+ sources                  → Data Pipeline (Copy Activity)
Low-code ETL with 150+ connectors             → Dataflow Gen2
Big data transformation at scale              → Spark Notebook
Query structured data with T-SQL              → Data Warehouse
Store mixed-format data (raw + Delta)         → Lakehouse
Near real-time DB replication into OneLake    → Fabric Mirroring
Ingest streaming events                       → Eventstream
Query time-series at extreme speed            → KQL Database (Eventhouse)
Visualise live streaming data                 → Real-Time Dashboard
Alert when data condition is met              → Data Activator (Reflex)
Build Power BI reports                        → Semantic Model + Reports
Share data without copying                    → OneLake Shortcut

OneLake Access Paths

ADLS Gen2 DFS endpoint:
abfss://{workspaceName}@onelake.dfs.fabric.microsoft.com/{itemName}.Lakehouse/Tables/{tableName}

OneLake REST endpoint:
https://onelake.dfs.fabric.microsoft.com/{workspaceName}/{itemName}.Lakehouse/Tables/{tableName}

In Fabric Notebook (relative path within same Lakehouse):
spark.read.format("delta").load("Tables/tableName")
spark.read.format("delta").load("Files/raw/filename.csv")

Medallion Architecture Quick Reference

Layer     Storage        Quality      Users
Bronze    Lakehouse      Raw          Data Engineers only
Silver    Lakehouse      Cleansed     Data Engineers + Data Scientists
Gold      Lakehouse      Business     Analysts + Power BI + Business Users

Transformation:
Bronze → Silver: Spark Notebooks (clean, dedupe, join)
Silver → Gold:   Spark Notebooks or Dataflow Gen2 (aggregate, model as star schema)
Gold → Reports:  Power BI Direct Lake Semantic Model

Scheduling:
Fabric Pipeline orchestrates Notebook runs
Bronze refresh → trigger Silver → trigger Gold → trigger Semantic Model refresh

Delta Table Maintenance Commands

-- Compact small files
OPTIMIZE tablename

-- Compact + Z-Order on filter columns
OPTIMIZE tablename ZORDER BY (col1, col2)

-- Remove old Delta versions (keep 7 days)
VACUUM tablename RETAIN 168 HOURS

-- View Delta history
DESCRIBE HISTORY tablename

-- Time travel query
SELECT * FROM tablename VERSION AS OF 5
SELECT * FROM tablename TIMESTAMP AS OF '2025-01-15'

-- Table details (file count, size, format)
DESCRIBE DETAIL tablename

Direct Lake vs Import vs DirectQuery

                  Import    DirectQuery    Direct Lake
Performance       ★★★★★    ★★☆☆☆          ★★★★★
Data freshness    ★★☆☆☆    ★★★★★          ★★★★★
Dataset size      Limited   Unlimited      Unlimited
Requires Fabric   No        No             YES
Storage           VertiPaq  None           VertiPaq (framed)
Needs refresh?    Yes       No             No (auto-current)

Top 10 Tips

  1. OneLake = "OneDrive for data" — one tenant-wide lake, all workloads share one copy of data. This single sentence is the most important concept in Fabric.
  2. Direct Lake = Import speed + DirectQuery freshness — the breakthrough that eliminates the historic Power BI trade-off. Know how framing works and what causes DirectQuery fallback.
  3. Lakehouse SQL Endpoint is read-only T-SQL — a Lakehouse auto-generates a SQL endpoint. A Warehouse is full read-write T-SQL. Many candidates confuse these.
  4. V-Order for Power BI, Z-Order for Spark queries — two different optimisations. V-Order is Fabric-specific and write-time. Z-Order is Delta standard and applied post-write.
  5. Medallion = Bronze → Silver → Gold — the universal data engineering pattern in Fabric. Know what happens at each layer and how separate Lakehouses enforce access control per layer.
  6. Shortcuts = zero-copy data sharing — the answer to any "how do you share data without duplicating it" question. Finance → shortcut → Sales Gold is the canonical example.
  7. Mirroring > ETL pipelines for operational DBs — near real-time CDC-based replication into OneLake. No pipeline to build or maintain. For Azure SQL, Cosmos DB, Snowflake.
  8. Fabric is Synapse's successor — know the mapping: Synapse Spark → Fabric Data Engineering, Synapse SQL → Fabric Warehouse, Synapse Pipelines → Fabric Pipelines.
  9. Capacity = shared pool — all workloads in assigned workspaces share CUs. Heavy Spark jobs and interactive Power BI reports compete for the same capacity. Workspace isolation is the mitigation.
  10. Purview for lineage — end-to-end lineage from source DB → pipeline → Silver table → Gold table → Power BI report. This is the governance story that closes enterprise architecture discussions.


No comments:

Post a Comment

Featured Post

Microsoft Fabric & OneLake Complete Guide

  Microsoft Fabric & OneLake — Complete Guide OneLake · Lakehouse · Data Warehouse · Direct Lake · Medallion Architecture · Real-Time I...

Popular posts