Microsoft Fabric & OneLake — Complete Guide
OneLake · Lakehouse · Data Warehouse · Direct Lake · Medallion Architecture · Real-Time Intelligence · Scenarios · Cheat Sheet
Table of Contents
- Core Concepts — Basics
- OneLake — The Foundation
- Fabric Workloads
- Data Engineering — Notebooks, Pipelines & Medallion
- Security & Governance
- Performance Optimisation
- Scenario-Based Questions
- Cheat Sheet — Quick Reference
1. Core Concepts — Basics
What is Microsoft Fabric and what problem does it solve?
Microsoft Fabric is an all-in-one analytics platform that unifies data engineering, data integration, data warehousing, real-time analytics, data science, and business intelligence into a single SaaS platform.
Before Fabric (fragmented):
- Azure Synapse → warehousing
- Azure Data Factory → pipelines
- Azure Databricks → Spark
- Power BI Premium → BI
- Separate storage, separate governance, separate billing, complex integration
With Fabric (unified):
- All workloads in one platform
- Single copy of data in OneLake
- Unified security and governance via Microsoft Purview
- Single capacity billing model (F SKUs)
Key insight: Fabric is NOT a new product — it is a platform that unifies existing Microsoft data tools under one roof with OneLake as the shared storage layer. Everything reads and writes from the same data lake.
What are the core workloads in Microsoft Fabric?
| Workload | Description | Evolution From |
|---|---|---|
| Data Factory | Data integration and orchestration — pipelines and dataflows | Azure Data Factory |
| Data Engineering | Big data processing — Lakehouses, Spark notebooks | Azure Synapse Spark |
| Data Warehouse | SQL-based analytics warehouse — T-SQL querying | Azure Synapse SQL Dedicated Pool |
| Real-Time Intelligence | Streaming ingestion and KQL analytics | Azure Data Explorer |
| Data Science | ML experiments, models, Python/R notebooks | Azure Synapse ML |
| Power BI | Business intelligence (Direct Lake mode) | Power BI Premium |
| Data Activator | No-code data alerting and automation | New in Fabric |
What is a Fabric Workspace and how does it relate to capacity?
Workspace: collaboration container holding all Fabric items (Lakehouses, Warehouses, Notebooks, Pipelines, Reports, Semantic Models).
Capacity: the compute allocation powering Fabric workloads. Every workspace must be assigned to a capacity.
| SKU | Type | Use Case |
|---|---|---|
| Trial | Free 60 days | Evaluation |
| F2–F2048 | Fabric SKUs | Pay-as-you-go or reserved |
| P SKUs | Power BI Premium | Existing Premium → Fabric compatible |
Tip: Capacity is shared across ALL workloads in assigned workspaces. A Spark job, pipeline run, and Power BI report refresh all consume from the same capacity pool. Size based on peak concurrent workload requirements.
How does Microsoft Fabric relate to Azure Synapse Analytics?
Fabric is the strategic successor to Azure Synapse Analytics:
| Synapse Feature | Fabric Equivalent |
|---|---|
| Synapse Spark | Fabric Data Engineering (Lakehouses + Notebooks) |
| Synapse SQL Dedicated Pool | Fabric Data Warehouse |
| Synapse Pipelines | Fabric Data Factory Pipelines |
| Synapse Link | Fabric Mirroring |
| Azure Data Explorer | Fabric Real-Time Intelligence (Eventhouse/KQL DB) |
Key additions in Fabric vs Synapse: OneLake (shared storage), Direct Lake Power BI mode, Data Activator, unified Purview governance, simpler capacity-based pricing.
Warning: Azure Synapse is not deprecated — existing workloads continue. But all new analytics projects should be built on Fabric. Microsoft's investment is entirely focused on Fabric going forward.
What is the Fabric item hierarchy?
Microsoft Fabric Tenant
└── Capacity (F64, F128, etc.)
└── Workspace (collaboration container)
├── Lakehouse (data lake + Delta tables)
├── Data Warehouse (T-SQL warehouse)
├── Notebook (Spark development)
├── Spark Job Definition (packaged Spark job)
├── Data Pipeline (orchestration)
├── Dataflow Gen2 (Power Query ETL)
├── Semantic Model (Power BI dataset)
├── Report (Power BI report)
├── KQL Database / Eventhouse (real-time)
├── Eventstream (streaming ingestion)
└── Reflex (Data Activator alerts)
2. OneLake — The Foundation
What is OneLake and what makes it architecturally significant?
OneLake is a single, unified, tenant-wide data lake that is the storage foundation of Microsoft Fabric.
Key architectural principles:
- One copy of data: all Fabric workloads read from the same data — no copies between workloads
- Automatic with Fabric: every Fabric tenant has exactly one OneLake, automatically provisioned
- Built on ADLS Gen2: fully compatible with any tool supporting Azure Data Lake Storage
- Open formats: data stored in Delta Parquet — readable by any engine (Spark, Trino, DuckDB) without conversion
- Hierarchical: Tenant → Workspace → Lakehouse/Warehouse → Tables/Files
analogy: OneLake is the "OneDrive for data" — just as OneDrive gives every user one place for files, OneLake gives every organisation one place for data.
What is the OneLake folder structure?
OneLake (tenant-level)
└── Workspace: ContosoAnalytics
├── Bronze_Lakehouse.Lakehouse/
│ ├── Tables/ ← Delta tables (queryable via SQL)
│ │ ├── raw_orders/ ← Delta Parquet files
│ │ └── raw_customers/
│ └── Files/ ← Raw files (CSV, JSON, Parquet, images)
│ ├── raw/
│ └── processed/
├── Gold_Lakehouse.Lakehouse/
│ └── Tables/
│ ├── FactSales/
│ └── DimProduct/
└── Finance_Warehouse.Warehouse/
└── Tables/ ← Warehouse tables (Delta format)
Access paths:
ADLS DFS: abfss://workspace@onelake.dfs.fabric.microsoft.com/Bronze_Lakehouse.Lakehouse/Tables/raw_orders
OneLake: https://onelake.dfs.fabric.microsoft.com/ContosoAnalytics/Bronze_Lakehouse.Lakehouse/Tables/raw_orders
What are OneLake Shortcuts and how do they enable a "single copy" architecture?
Shortcuts are virtual pointers to data stored outside the current Lakehouse — appearing as folders but not copying data.
Supported shortcut sources:
- Another workspace in OneLake
- Azure Data Lake Storage Gen2
- Amazon S3
- Google Cloud Storage
- Dataverse (via Fabric Mirroring)
Shortcut scenarios:
1. Cross-workspace data sharing (zero copy):
Finance Lakehouse → shortcut → Sales Gold Lakehouse/Tables/FactSales
Finance queries FactSales without any data duplication
Sales updates are immediately visible to Finance
2. External ADLS access:
Lakehouse → shortcut → abfss://existing-lake.dfs.core.windows.net/raw
Query existing ADLS data from Fabric without migration
3. Multi-cloud data access:
Lakehouse → shortcut → AWS S3 bucket
Spark notebooks query S3 data alongside OneLake data
Tip: Shortcuts eliminate the need to copy data between teams, workspaces, and clouds. This is how Fabric implements "single copy of truth" across a large organisation — no ETL duplication pipelines.
What is the Delta Lake format and why does Fabric use it as the standard?
Delta Lake is an open-source storage format adding ACID transactions, schema enforcement, time travel, and efficient upserts to Parquet files.
| Feature | Benefit |
|---|---|
| ACID transactions | Concurrent reads/writes without corruption |
| Time travel | Query data as of previous point in time |
| Schema evolution | Add columns without breaking existing readers |
| Efficient upserts | MERGE INTO for CDC patterns |
| Open format | Any Spark, Trino, DuckDB engine can read |
| Direct Lake | Power BI reads Delta files at in-memory speed |
-- Time travel:
SELECT * FROM silver_orders VERSION AS OF 5
SELECT * FROM silver_orders TIMESTAMP AS OF '2025-01-01'
-- Upsert (MERGE):
MERGE INTO silver_orders AS target
USING new_orders AS source
ON target.OrderId = source.OrderId
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
-- View Delta history:
DESCRIBE HISTORY silver_orders
3. Fabric Workloads
What is a Fabric Lakehouse and how does it differ from a Data Warehouse?
| Lakehouse | Data Warehouse | |
|---|---|---|
| Schema | On-read (Files) + on-write (Tables) | Strict schema-on-write (DDL required) |
| Data types | Structured + semi-structured + unstructured | Structured only |
| Primary interface | Spark (Python, Scala, SQL) | T-SQL (INSERT, UPDATE, DELETE, MERGE) |
| Secondary interface | Auto SQL Endpoint (read-only T-SQL) | — |
| Stored procedures | No | Yes |
| Best for | Data engineering, data science, raw→curated | Business analysts, BI, structured semantic layer |
| Storage | OneLake (Delta Parquet) | OneLake (Delta Parquet) |
Tip: Both store data in OneLake as Delta Parquet. The Lakehouse auto-generates a SQL Endpoint — so a Lakehouse IS also queryable via T-SQL (read-only). A Warehouse is T-SQL read-write. This overlap confuses many candidates.
What is Direct Lake mode in Power BI and why is it a breakthrough?
Direct Lake is a new Power BI connectivity mode that reads Delta Parquet files directly from OneLake into Power BI's VertiPaq in-memory engine.
Three Power BI connection modes:
Import:
→ Data COPIED into VertiPaq memory
→ Fastest queries | Stale (last refresh only) | Size limited
DirectQuery:
→ Live queries to source on EVERY visual interaction
→ Always current | Slow (source DB performance) | No size limit
Direct Lake (NEW — Fabric only):
→ Reads Delta Parquet from OneLake into VertiPaq ON DEMAND
→ Import-speed performance (in-memory columnar)
→ Always current (reads latest Delta files automatically)
→ No dataset size limit
→ "Framing": loads column segments on first access
→ Falls back to DirectQuery if framing not complete
Tip: Direct Lake eliminates the Import vs DirectQuery trade-off — Import-level performance + DirectQuery-level freshness. This is the biggest Power BI innovation in years and a guaranteed question for any Fabric role.
What is Real-Time Intelligence in Fabric?
| Component | Description |
|---|---|
| Eventstream | No-code streaming ingestion — capture, transform, route events from Event Hubs, IoT Hub, Kafka, CDC |
| KQL Database (Eventhouse) | High-performance analytical DB for time-series data. KQL queries. Billions of rows in seconds. |
| KQL Queryset | Saved KQL queries for reuse |
| Real-Time Dashboard | Auto-refreshing dashboards (sub-second) powered by KQL |
| Data Activator (Reflex) | No-code alerting — trigger email/Teams/flows when data conditions are met |
Real-time pipeline:
IoT sensors / Event Hubs / Kafka / CDC
→ Eventstream (ingest + route + transform)
→ KQL Database / Eventhouse (store + query)
→ Real-Time Dashboard (visualise live)
→ Data Activator (alert on anomalies)
What is Fabric Mirroring and what databases does it support?
Fabric Mirroring replicates operational database data into OneLake as Delta Parquet tables in near real-time — using Change Data Capture (CDC), not traditional ETL.
Supported sources:
- Azure SQL Database
- Azure Cosmos DB
- Snowflake
- Azure SQL Managed Instance
- Azure Database for PostgreSQL
- MongoDB Atlas
How it works:
Source DB (Azure SQL)
→ CDC captures inserts/updates/deletes
→ Fabric replicates changes into OneLake
→ Latency: typically < 1 minute
→ Stored as Delta tables (queryable via Spark, SQL Endpoint, Power BI)
→ Appears as a shortcut in the workspace
Benefits:
→ No ETL pipeline to build or maintain
→ Near real-time analytics on operational data
→ Zero duplication — reads directly from CDC log
→ Combines with medallion architecture for Silver/Gold transformation
Tip: Mirroring eliminates complex ADF pipelines for operational DB analytics. It is the recommended pattern for near real-time analytics on Azure SQL, Cosmos DB, and Snowflake.
What is the Semantic Model in Fabric?
A Semantic Model (previously called a Power BI Dataset) is the reusable analytical layer between data storage and reports. In Fabric:
- Created directly on a Lakehouse or Warehouse
- Uses Direct Lake mode by default (no data import needed)
- Contains measures (DAX), hierarchies, relationships, and security roles
- Multiple reports can share a single semantic model
- Can be certified/endorsed for enterprise use
- Accessible via XMLA endpoint for external tools (Excel, Tabular Editor, SSMS)
4. Data Engineering — Notebooks, Pipelines & Medallion
What is the Medallion Architecture and how is it implemented in Fabric?
The Medallion Architecture organises data into three progressive quality layers stored as Delta tables in Lakehouses.
Bronze (Raw) layer:
→ Raw data as-is from source systems
→ No transformations, append-only or full snapshot
→ Preserves source fidelity for reprocessing
→ Lakehouse: Tables/bronze/ or Files/bronze/
Silver (Cleansed) layer:
→ Cleaned, validated, deduplicated
→ Standardised data types and naming conventions
→ Joined/enriched from multiple Bronze sources
→ Optimised: partitioned, Z-ordered
→ Lakehouse: Tables/silver/
Gold (Business) layer:
→ Business-level aggregations and metrics
→ Conformed dimensions and fact tables (star schema)
→ Ready for Power BI Direct Lake consumption
→ Domain-specific: Sales Gold, Finance Gold, HR Gold
→ Lakehouse: Tables/gold/
Recommended Fabric implementation:
Bronze Workspace → Bronze Lakehouse
Silver Workspace → Silver Lakehouse (shortcuts to Bronze)
Gold Workspace → Gold Lakehouse (shortcuts to Silver)
Reporting → Semantic Models → Power BI Reports
Separate workspaces = separate security roles per layer
Tip: Separate Lakehouses per medallion layer is the recommended enterprise pattern — different security roles, workspace assignments, and access control per layer.
What are Fabric Notebooks and what do they support?
Fabric Notebooks are interactive Spark development environments supporting:
- PySpark (Python + Spark) — primary language
- Spark SQL — SQL against Delta tables
- Scala — JVM Spark code
- R (SparkR) — R with Spark
- %%sql magic — inline SQL in Python notebooks
# Read from Bronze Lakehouse
df = spark.read.format("delta").load("Tables/raw_orders")
# Transform (PySpark)
from pyspark.sql.functions import col, to_date, when, trim
df_silver = (df
.filter(col("OrderStatus") != "TEST")
.filter(col("OrderId").isNotNull())
.withColumn("OrderDate", to_date(col("OrderDateStr"), "yyyy-MM-dd"))
.withColumn("CustomerName", trim(col("CustomerName")))
.withColumn("Region",
when(col("Country").isin("UK", "IE", "DE", "FR"), "EMEA")
.when(col("Country").isin("US", "CA"), "AMER")
.otherwise("APAC"))
.dropDuplicates(["OrderId"])
)
# Write to Silver Lakehouse as Delta (upsert pattern)
from delta.tables import DeltaTable
if spark.catalog.tableExists("silver_orders"):
DeltaTable.forName(spark, "silver_orders").alias("target") \
.merge(df_silver.alias("source"), "target.OrderId = source.OrderId") \
.whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()
else:
df_silver.write.format("delta").saveAsTable("silver_orders")
What are Fabric Data Pipelines?
Fabric Data Pipelines use the same visual designer and JSON format as Azure Data Factory. Key activities:
| Activity | Purpose |
|---|---|
| Copy Data | Move data between 90+ sources/sinks |
| Notebook | Execute a Fabric Notebook (Spark) |
| Spark Job Definition | Run a packaged PySpark/Scala job |
| Dataflow Gen2 | Run a Power Query Online dataflow |
| Stored Procedure | Execute SQL in Warehouse |
| If Condition / Switch | Conditional control flow |
| ForEach / Until | Iterative control flow |
| Get Metadata / Validation | File/table checks |
Tip: ADF pipelines are compatible with Fabric pipelines (same JSON format). The key difference: Fabric pipelines have native OneLake connectivity — no linked service configuration needed for internal Fabric data access.
What is a Dataflow Gen2 and when should you use it vs Spark?
Use Dataflow Gen2 when:
- Business analyst / maker-level ETL (no Spark knowledge needed)
- 150+ Power Query connectors not available in pipelines
- Simple transformations: filter, merge, pivot, custom M columns
- Small-to-medium datasets
Use Spark Notebooks when:
- Large-scale data (billions of rows)
- Complex transformations requiring ML or custom logic
- Full developer control and unit testability
- Integration with Python ecosystem (pandas, scikit-learn, PyTorch)
Dataflow Gen2 outputs to:
- Fabric Lakehouse (Tables or Files)
- Fabric Data Warehouse
- Azure SQL, other external databases
- OneLake directly
What is V-Order and Z-Order optimisation in Fabric Delta tables?
V-Order: a Microsoft-specific write-time optimisation for Delta Parquet files. Sorts and compresses data within Parquet row groups for maximum Power BI Direct Lake read performance. Enabled by default in Fabric.
Z-Order: a data skipping optimisation co-locating related data in the same Parquet files. When filtering on a Z-ordered column, Spark skips entire files that don't match — dramatically reducing I/O.
# Z-Order after data ingestion (run periodically):
spark.sql("""
OPTIMIZE gold_FactSales
ZORDER BY (CustomerId, OrderDate)
""")
# Queries filtering by CustomerId or OrderDate now read far fewer files
# V-Order (enabled by default in Fabric):
spark.conf.set("spark.microsoft.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.microsoft.delta.optimizeWrite.vorder.enabled", "true")
# Compact small files:
spark.sql("OPTIMIZE silver_orders") # merges small files into ~256MB chunks
# Vacuum old Delta versions (default: 7 days retention):
spark.sql("VACUUM silver_orders RETAIN 168 HOURS")
Key distinction: V-Order = Fabric/Power BI optimisation (Direct Lake performance). Z-Order = Spark/Delta general query optimisation. Know both for architect-level.
5. Security & Governance
What are the workspace roles in Fabric?
| Role | Capabilities |
|---|---|
| Admin | Full control — manage workspace, members, all items. Can delete workspace. |
| Member | Create, edit, delete items. Share items. Cannot manage workspace settings. |
| Contributor | Create and edit items. Cannot delete others' items or share. |
| Viewer | Read-only — view reports, query SQL Endpoint. Cannot edit. |
Warning: Workspace roles are coarse-grained — they apply to ALL items. For fine-grained data access (which rows/columns), use Row-Level Security and Column-Level Security within Lakehouses and Warehouses.
What is OneSecurity and how does it implement Row and Column-Level Security?
-- ROW-LEVEL SECURITY (Warehouse / Lakehouse SQL Endpoint):
-- 1. Create security predicate function
CREATE FUNCTION dbo.fn_region_filter(@Region AS sysname)
RETURNS TABLE WITH SCHEMABINDING AS
RETURN SELECT 1 AS result
WHERE @Region = USER_NAME()
OR IS_MEMBER('GlobalDataAdmin') = 1;
-- 2. Apply security policy
CREATE SECURITY POLICY RegionFilter
ADD FILTER PREDICATE dbo.fn_region_filter(Region)
ON dbo.SalesOrders WITH (STATE = ON);
-- Users only see rows where Region matches their username
-- COLUMN-LEVEL SECURITY:
GRANT SELECT ON dbo.Employees(EmployeeId, Name, Department) TO [HRViewers];
DENY SELECT ON dbo.Employees(Salary, PersonalEmail, NationalId) TO [HRViewers];
-- DYNAMIC DATA MASKING:
ALTER TABLE dbo.Customers
ALTER COLUMN Email ADD MASKED WITH (FUNCTION = 'email()');
-- Non-privileged users see: aXXX@XXXX.com instead of actual email
How does Microsoft Purview integrate with Fabric for governance?
| Purview Feature | Fabric Integration |
|---|---|
| Data Catalog | Auto-scans Lakehouses, Warehouses, Datasets, Reports — discovers schemas and metadata |
| Data Lineage | End-to-end lineage from source → pipeline → table → report |
| Sensitivity Labels | Labels on Fabric items flow downstream to reports and exports |
| Information Protection | Prevent export of highly confidential data |
| Audit Logs | All Fabric workspace and item activity logged |
Tip: Data lineage in Purview is especially powerful for Fabric — you can trace a single Power BI KPI back through every transformation step to its source system. This is an architect-level differentiator.
What is Fabric capacity management and how do you handle throttling?
All Fabric workloads consume Capacity Units (CUs) from the assigned capacity.
Throttling tiers (in order of impact):
- Background operations throttled first (Spark jobs, pipelines)
- Interactive operations throttled second (SQL queries, report loads)
- Users see delays/errors if throttling is severe
Management approach:
- Monitor: Fabric Capacity Metrics App (Power BI app) — CU consumption per workload, workspace, item
- Identify top consumers: find which notebooks, pipelines, or reports consume the most CUs
- Workspace isolation: assign heavy workloads (batch pipelines) to a separate capacity from interactive reporting
- Schedule off-peak: run heavy Spark jobs during off-peak hours
- Right-size: F64 for most enterprise scenarios. Scale up if Capacity Metrics shows consistent throttling.
6. Performance Optimisation
What are the key performance best practices for Fabric Lakehouses?
Delta table optimisation:
# 1. Compact small files (run after incremental loads)
spark.sql("OPTIMIZE tablename")
# 2. Z-Order on high-cardinality filter columns
spark.sql("OPTIMIZE tablename ZORDER BY (CustomerId, OrderDate)")
# 3. Partition large tables by low-cardinality column (e.g., date)
df.write.format("delta") \
.partitionBy("Year", "Month") \
.saveAsTable("partitioned_table")
# 4. Vacuum old file versions (keep 7 days for time travel)
spark.sql("VACUUM tablename RETAIN 168 HOURS")
# 5. Enable V-Order for Power BI Direct Lake performance
spark.conf.set("spark.microsoft.delta.optimizeWrite.vorder.enabled", "true")
Notebook performance:
- Use
.select()early to reduce columns processed - Filter data as early as possible (predicate pushdown)
- Cache frequently reused DataFrames:
df.cache() - Use
broadcast()for small lookup tables in joins - Avoid
collect()on large DataFrames (brings all data to driver) - Write Delta with
mergeSchemaonly when needed — adds overhead
What causes Direct Lake fallback to DirectQuery?
Direct Lake falls back to DirectQuery (slower) when:
- Delta table not framed — column segments not yet loaded into VertiPaq memory for first access
- V-Order not applied — non-V-Order Delta files degrade Direct Lake read performance
- Too many small files — hundreds of tiny Parquet files slow framing
- Schema changes — recent DDL changes may invalidate the framing cache
- Large Delta history — too many uncommitted/vacuumed versions
Fix: Run OPTIMIZE + VACUUM on Gold tables. Ensure V-Order is enabled at write time. Monitor Direct Lake framing status in the Fabric workspace.
7. Scenario-Based Questions
Scenario: Design a modern data platform using Microsoft Fabric for a retail organisation with 50 stores.
Architecture:
-
Ingestion:
- POS transaction data → Fabric Pipeline Copy Activity → Bronze Lakehouse (daily batch)
- Inventory DB (Azure SQL) → Fabric Mirroring → Bronze Lakehouse (near real-time)
- Web analytics events → Eventstream → KQL Database (real-time)
-
Medallion layers:
- Bronze Lakehouse: raw data as-is
- Silver Lakehouse: cleansed, deduped, joined (Spark notebooks)
- Gold Lakehouse: star schema — FactSales, DimStore, DimProduct, DimDate
-
Reporting:
- Power BI Direct Lake Semantic Model on Gold Lakehouse
- Store manager dashboards (RLS: each manager sees only their store)
- Executive KPI reports (no RLS)
-
Real-time operations:
- Real-Time Dashboard from KQL Database — live sales by store
- Data Activator: alert when stock drops below reorder threshold
-
Data science:
- Fabric Notebooks: demand forecasting model (Python + Spark)
- Predictions written back to Gold Lakehouse → visible in Power BI
-
Governance:
- Purview sensitivity labels on customer PII tables
- RLS in Gold Lakehouse (store managers see only their store)
- Data lineage tracking in Purview
-
Capacity: F64 — separate workspaces for Bronze/Silver/Gold layers
Scenario: How do you migrate from Azure Synapse Analytics to Microsoft Fabric?
- Assessment: inventory all Synapse assets — pipelines, Spark notebooks, SQL scripts, dedicated pool tables. Use Fabric migration assessment tool.
- OneLake setup: create Fabric workspace, create Lakehouses matching Synapse storage structure.
- Data migration:
- ADLS Gen2 data → OneLake Shortcut (zero-copy, immediate access — fastest approach)
- Or: Fabric Pipeline copy from ADLS to OneLake Tables
- Pipeline migration: Synapse pipelines → Fabric Pipelines (JSON format compatible). Update linked services to native Fabric connections.
- Notebook migration: copy Synapse Spark notebooks into Fabric. Update storage paths from
abfss://container@storage.../to OneLake paths. Most PySpark code runs unchanged. - SQL migration: Synapse Dedicated Pool T-SQL → Fabric Data Warehouse T-SQL. Validate stored procedures, views, and custom functions.
- Power BI migration: update Semantic Models to use Direct Lake mode pointing to Fabric Lakehouse/Warehouse instead of Synapse connection.
- Parallel run: validate data output matches between Synapse and Fabric for 2–4 weeks before decommissioning.
Scenario: A Gold layer table is slow to query in Power BI Direct Lake. How do you diagnose and fix it?
-
Check Direct Lake vs DirectQuery fallback: Power BI Desktop → Performance Analyzer → verify queries show "Direct Lake" mode, not "DirectQuery." Fallback indicates framing not complete.
-
V-Order check: ensure V-Order was applied when writing the Gold table. Re-write with V-Order enabled if missing.
-
OPTIMIZE the table: run
OPTIMIZE gold_FactSales ZORDER BY (CustomerId, OrderDate)to compact small files and improve data skipping. -
Check small file count:
DESCRIBE DETAIL gold_FactSales— ifnumFilesis in the hundreds, run OPTIMIZE to merge them. -
DAX measure performance: use Performance Analyzer in Power BI Desktop to separate "DAX query" time from "Direct Lake" time. Slow DAX = model issue, not storage issue.
-
Add aggregation table: for very large facts (billions of rows), create a pre-aggregated summary table in Gold. Point the Semantic Model to the aggregation first.
Scenario: How do you implement data sharing between two business units without duplicating data?
Requirement: Finance needs Sales Gold tables for combined reporting. Data stays in Sales workspace. Zero data copy.
Solution: OneLake Shortcuts
- In Finance Workspace → Finance Lakehouse → New Shortcut
- Source: Another workspace in OneLake → Sales Gold Lakehouse → Tables/FactSales
- FactSales appears in Finance Lakehouse as a shortcut (virtual — no copy)
- Finance Spark notebooks, SQL Endpoint, and Power BI Direct Lake all read via the shortcut
- Apply RLS to Sales Gold Lakehouse SQL Endpoint — Finance users see only shared records
- Finance Power BI Semantic Model spans both Finance tables and Sales shortcuts — single model, zero data duplication
Tip: This is the canonical Fabric cross-workspace data sharing pattern. Shortcuts eliminate inter-team ETL pipelines entirely.
Scenario: How do you implement near real-time analytics on Azure SQL operational data?
Solution: Fabric Mirroring
- In Fabric workspace → New → Mirrored database → Azure SQL Database
- Connect to Azure SQL (provide connection string and credentials)
- Select tables to mirror: Orders, Customers, Products
- Fabric automatically starts CDC-based replication
- Within minutes, tables appear in OneLake as Delta tables
- Ongoing: changes in Azure SQL replicate to OneLake with < 1 minute latency
- Build Silver transformations in Spark notebooks reading from mirrored tables
- Power BI Direct Lake Semantic Model reads Gold tables for near real-time dashboards
No ETL pipeline to build or maintain. Mirroring handles all change capture automatically.
8. Cheat Sheet — Quick Reference
Fabric Workload Selection Guide
Need to... → Use
Ingest data from 90+ sources → Data Pipeline (Copy Activity)
Low-code ETL with 150+ connectors → Dataflow Gen2
Big data transformation at scale → Spark Notebook
Query structured data with T-SQL → Data Warehouse
Store mixed-format data (raw + Delta) → Lakehouse
Near real-time DB replication into OneLake → Fabric Mirroring
Ingest streaming events → Eventstream
Query time-series at extreme speed → KQL Database (Eventhouse)
Visualise live streaming data → Real-Time Dashboard
Alert when data condition is met → Data Activator (Reflex)
Build Power BI reports → Semantic Model + Reports
Share data without copying → OneLake Shortcut
OneLake Access Paths
ADLS Gen2 DFS endpoint:
abfss://{workspaceName}@onelake.dfs.fabric.microsoft.com/{itemName}.Lakehouse/Tables/{tableName}
OneLake REST endpoint:
https://onelake.dfs.fabric.microsoft.com/{workspaceName}/{itemName}.Lakehouse/Tables/{tableName}
In Fabric Notebook (relative path within same Lakehouse):
spark.read.format("delta").load("Tables/tableName")
spark.read.format("delta").load("Files/raw/filename.csv")
Medallion Architecture Quick Reference
Layer Storage Quality Users
Bronze Lakehouse Raw Data Engineers only
Silver Lakehouse Cleansed Data Engineers + Data Scientists
Gold Lakehouse Business Analysts + Power BI + Business Users
Transformation:
Bronze → Silver: Spark Notebooks (clean, dedupe, join)
Silver → Gold: Spark Notebooks or Dataflow Gen2 (aggregate, model as star schema)
Gold → Reports: Power BI Direct Lake Semantic Model
Scheduling:
Fabric Pipeline orchestrates Notebook runs
Bronze refresh → trigger Silver → trigger Gold → trigger Semantic Model refresh
Delta Table Maintenance Commands
-- Compact small files
OPTIMIZE tablename
-- Compact + Z-Order on filter columns
OPTIMIZE tablename ZORDER BY (col1, col2)
-- Remove old Delta versions (keep 7 days)
VACUUM tablename RETAIN 168 HOURS
-- View Delta history
DESCRIBE HISTORY tablename
-- Time travel query
SELECT * FROM tablename VERSION AS OF 5
SELECT * FROM tablename TIMESTAMP AS OF '2025-01-15'
-- Table details (file count, size, format)
DESCRIBE DETAIL tablename
Direct Lake vs Import vs DirectQuery
Import DirectQuery Direct Lake
Performance ★★★★★ ★★☆☆☆ ★★★★★
Data freshness ★★☆☆☆ ★★★★★ ★★★★★
Dataset size Limited Unlimited Unlimited
Requires Fabric No No YES
Storage VertiPaq None VertiPaq (framed)
Needs refresh? Yes No No (auto-current)
Top 10 Tips
- OneLake = "OneDrive for data" — one tenant-wide lake, all workloads share one copy of data. This single sentence is the most important concept in Fabric.
- Direct Lake = Import speed + DirectQuery freshness — the breakthrough that eliminates the historic Power BI trade-off. Know how framing works and what causes DirectQuery fallback.
- Lakehouse SQL Endpoint is read-only T-SQL — a Lakehouse auto-generates a SQL endpoint. A Warehouse is full read-write T-SQL. Many candidates confuse these.
- V-Order for Power BI, Z-Order for Spark queries — two different optimisations. V-Order is Fabric-specific and write-time. Z-Order is Delta standard and applied post-write.
- Medallion = Bronze → Silver → Gold — the universal data engineering pattern in Fabric. Know what happens at each layer and how separate Lakehouses enforce access control per layer.
- Shortcuts = zero-copy data sharing — the answer to any "how do you share data without duplicating it" question. Finance → shortcut → Sales Gold is the canonical example.
- Mirroring > ETL pipelines for operational DBs — near real-time CDC-based replication into OneLake. No pipeline to build or maintain. For Azure SQL, Cosmos DB, Snowflake.
- Fabric is Synapse's successor — know the mapping: Synapse Spark → Fabric Data Engineering, Synapse SQL → Fabric Warehouse, Synapse Pipelines → Fabric Pipelines.
- Capacity = shared pool — all workloads in assigned workspaces share CUs. Heavy Spark jobs and interactive Power BI reports compete for the same capacity. Workspace isolation is the mitigation.
- Purview for lineage — end-to-end lineage from source DB → pipeline → Silver table → Gold table → Power BI report. This is the governance story that closes enterprise architecture discussions.
No comments:
Post a Comment