Wednesday, June 3, 2026

Building an End-to-End Retail Analytics Platform on Microsoft Fabric

Building an End-to-End Retail Analytics Platform on Microsoft Fabric

A complete, reproducible walkthrough — from raw CSVs to a secured, refreshable Power BI report — using the Lakehouse medallion pattern, Direct Lake, dynamic row-level security, and a Dev → Prod deployment pipeline.

Companion repo: https://github.com/usreekanthreddy/contoso-retail-fabric


Why this project

Most "Fabric demos" stop at loading a CSV and drawing a bar chart. Real analytics platforms need more: a layered data architecture you can trust, a dimensional model that performs, security that respects who is looking, and a release process that gets changes from development to production safely.

This post documents a complete build for a fictional retailer, Contoso Retail 360. Everything here is reproducible — the repo ships seed data, the notebooks, the SQL validation, and the DAX so you can stand the whole thing up in your own Fabric workspace in well under an hour.

The finished platform answers questions like:

  • How are sales trending month over month, and how does that compare to last year?
  • Which regions and stores are pulling their weight, and which channel (Store vs. Online) is winning?
  • What's our gross margin, and which products drive it?
  • …while making sure a regional manager only ever sees their own region's numbers.

The architecture at a glance

The solution follows the medallion architecture — three progressively refined layers inside a single Fabric Lakehouse (lh_retail):

Source CSVs (Files/landing/)
        │   file-based ingest
        ▼
┌─────────────────┐
│  BRONZE         │  raw, as-landed, + ingest metadata
│  *_raw tables   │  (_ingest_ts, _source_file)
└─────────────────┘
        │   clean, cast, de-dupe, hash PII
        ▼
┌─────────────────┐
│  SILVER          │  conformed, typed business entities
│  sales/products  │  customers (FullName hashed)
│  stores/customers│
└─────────────────┘
        │   build star schema, surrogate keys
        ▼
┌─────────────────┐
│  GOLD           │  DimDate / DimProduct / DimStore /
│  star schema    │  DimCustomer / DimChannel + FactSales
│  (+ UserRegion) │  Delta change-data-feed ON
└─────────────────┘
        │   Direct Lake
        ▼
   Semantic model  →  Power BI report (with RLS)

Why these choices:

  • Bronze / Silver / Gold keeps raw data immutable and auditable, isolates cleaning logic, and gives BI a stable, well-shaped surface to build on. If a transform is wrong, you fix Silver/Gold and re-run — Bronze is untouched.
  • Direct Lake lets Power BI read the Gold Delta tables directly from OneLake — no import refresh, no DirectQuery latency. You get import-like speed with live data.
  • Delta change data feed on the Gold tables makes Direct Lake "framing" (the refresh that picks up new data) efficient — only changed files are reprocessed.

The data model: a classic star schema

Gold is a textbook star: one fact table surrounded by conformed dimensions.

FactSales (one row per order line)

ColumnNotes
SalesKeysurrogate key
DateKey, ProductKey, StoreKey, CustomerKey, ChannelKeyforeign keys to dimensions
OrderNumberdegenerate dimension (for distinct order counts)
Quantity, UnitPrice, DiscountAmountadditive measures
SalesAmountQuantity × UnitPrice − DiscountAmount
UnitCostcarried for margin math

Dimensions

  • DimDate — a generated calendar with Day/Month/Quarter/Year, MonthName, DayOfWeek, IsWeekend. Marked as the date table to unlock time-intelligence.
  • DimProduct — product, category, subcategory, brand, list price, unit cost.
  • DimStore — store, city, region, country, format, open date.
  • DimCustomer — segment, city, loyalty tier, signup date (name hashed for privacy).
  • DimChannel — Store / Online.

All relationships are single-direction, one-to-many from each dimension to FactSales.


Step 1 — Create the Lakehouse and schemas

In the Fabric workspace, create a Lakehouse named lh_retail and three schemas: bronze, silver, gold. The notebooks create the schemas defensively too:

for s in ["bronze", "silver", "gold"]:
    spark.sql(f"CREATE SCHEMA IF NOT EXISTS {s}")

Step 2 — Land the source data

The kit ships four seed CSVs (generated deterministically with seed=42, so everyone gets identical numbers):

  • sales.csv — 3,001 order lines (Dec 2025 – May 2026)
  • products.csv — 30 products
  • stores.csv — 12 stores across West / Central / East
  • customers.csv — 500 customers

Upload them to the Lakehouse under Files/landing/. In production this folder would be the drop zone for a Data Factory Copy pipeline (pl_ingest_bronze); the notebook plays that role for the quick-test path.


Step 3 — Bronze: file-based ingestion

Bronze reads the landing CSVs and writes them straight to Delta tables, stamping each row with ingest metadata so you always know where a record came from and when it arrived:

def read_csv(name):
    return (spark.read.option("header", True).option("inferSchema", True)
            .csv(f"Files/landing/{name}.csv")
            .withColumn("_ingest_ts", F.current_timestamp())
            .withColumn("_source_file", F.lit(f"Files/landing/{name}.csv")))

W = lambda df, tbl: (df.write.format("delta")
                       .mode("overwrite")
                       .option("overwriteSchema", "true")
                       .saveAsTable(tbl))

W(read_csv("products"),  "bronze.products_raw")
W(read_csv("stores"),    "bronze.stores_raw")
W(read_csv("customers"), "bronze.customers_raw")
W(read_csv("sales"),     "bronze.sales_raw")

mode("overwrite") means every notebook is safe to re-run from scratch — no duplicate-on-rerun surprises.


Step 4 — Silver: clean, type, de-duplicate, and protect PII

Silver is where raw becomes trustworthy. Strings become dates and decimals, duplicates are dropped on natural keys, and — importantly — customer names are hashed and the raw value dropped before anything sensitive reaches the analytics layer:

# Sales: type-cast, default discounts, de-dupe on the natural key
sl = (spark.read.table("bronze.sales_raw")
      .withColumn("OrderDate", F.to_date("OrderDate"))
      .withColumn("Quantity", F.col("Quantity").cast("int"))
      .withColumn("UnitPrice", F.col("UnitPrice").cast("decimal(18,2)"))
      .withColumn("DiscountAmount",
                  F.coalesce(F.col("DiscountAmount").cast("decimal(18,2)"), F.lit(0)))
      .dropDuplicates(["OrderNumber", "ProductId"]))
W(sl.drop("_ingest_ts", "_source_file"), "silver.sales")

# Customers: SHA-256 the name, then drop the raw column
W(spark.read.table("bronze.customers_raw")
    .withColumn("SignupDate", F.to_date("SignupDate"))
    .withColumn("FullNameHash", F.sha2(F.col("FullName"), 256))
    .drop("FullName", "_ingest_ts", "_source_file")
    .dropDuplicates(["CustomerId"]), "silver.customers")

The hash preserves the ability to count and join on a customer without ever exposing their name downstream — a small change that makes the platform far more defensible from a privacy standpoint.


Step 5 — Gold: build the star schema

Gold assembles dimensions with surrogate keys and joins everything into FactSales.

# Surrogate-key helper
def skey(df, name):
    return df.withColumn(name, F.row_number().over(
        Window.orderBy(F.monotonically_increasing_id())))

W(skey(spark.read.table("silver.products"),  "ProductKey"),  "gold.DimProduct")
W(skey(spark.read.table("silver.stores"),    "StoreKey"),    "gold.DimStore")
W(skey(spark.read.table("silver.customers"), "CustomerKey"), "gold.DimCustomer")
W(spark.createDataFrame([(1, "Store"), (2, "Online")],
                        ["ChannelKey", "Channel"]), "gold.DimChannel")

The date dimension is generated from the actual min/max order dates in the data:

b = (spark.read.table("silver.sales")
        .agg(F.min("OrderDate").alias("mn"), F.max("OrderDate").alias("mx"))
        .collect()[0])
dd = spark.sql(
    f"SELECT explode(sequence(to_date('{b.mn}'), to_date('{b.mx}'), "
    f"interval 1 day)) AS Date")
dd = (dd.withColumn("DateKey", F.date_format("Date", "yyyyMMdd").cast("int"))
        .withColumn("Year", F.year("Date")).withColumn("Quarter", F.quarter("Date"))
        .withColumn("Month", F.month("Date"))
        .withColumn("MonthName", F.date_format("Date", "MMMM"))
        .withColumn("DayOfWeek", F.date_format("Date", "EEEE"))
        .withColumn("IsWeekend", F.dayofweek("Date").isin([1, 7])))
W(dd, "gold.DimDate")

And the fact table resolves every natural key to a surrogate key, computes SalesAmount, and selects the final shape:

fact = (s.join(dp, "ProductId", "left").join(ds, "StoreId", "left")
         .join(dc, "CustomerId", "left").join(dch, "Channel", "left")
         .withColumn("DateKey", F.date_format("OrderDate", "yyyyMMdd").cast("int"))
         .withColumn("SalesAmount",
                     (F.col("Quantity") * F.col("UnitPrice")
                      - F.col("DiscountAmount")).cast("decimal(18,2)"))
         .select(F.monotonically_increasing_id().alias("SalesKey"),
                 "DateKey", "ProductKey", "StoreKey", "CustomerKey", "ChannelKey",
                 "OrderNumber", "Quantity", "UnitPrice", "DiscountAmount",
                 "SalesAmount", "UnitCost"))
W(fact, "gold.FactSales")

A security table for dynamic RLS

Gold also includes a small mapping table that drives dynamic row-level security — which user can see which region:

user_region = spark.createDataFrame([
    ("sreekanth@vkn2k.onmicrosoft.com",       "West"),
    ("east.manager@vkn2k.onmicrosoft.com",    "East"),
    ("central.manager@vkn2k.onmicrosoft.com", "Central"),
], ["Email", "Region"])
W(user_region, "gold.UserRegion")

Turn on change data feed

Finally, enable Delta change data feed on every Gold table so Direct Lake refreshes stay incremental and fast:

for t in ["DimProduct","DimStore","DimCustomer","DimChannel",
          "DimDate","UserRegion","FactSales"]:
    spark.sql(f"ALTER TABLE gold.{t} "
              f"SET TBLPROPERTIES (delta.enableChangeDataFeed = true)")

Step 6 — Validate the Gold layer

Trust, but verify. The notebook prints a built-in validation block, and sql/validate_gold.sql runs the same checks against the SQL analytics endpoint. With the shipped seed data you should get exactly these numbers:

MetricExpected value
Orphan fact rows0
Distinct orders1,224
Total units9,097
Total sales1,673,866.79
Total cost1,028,681.42
Gross margin645,185.37
Gross margin %38.5%
Date range2025-12-01 → 2026-05-31
Raw FullName present in GoldFalse

Sales by region and channel (rounded):

RegionOnlineStore
Central224,875317,640
East226,002356,374
West226,808322,167

The orphan check is the most important line: it confirms every fact row resolved to a real product, store, customer, and channel. Zero orphans means the joins are clean.

orphans = f.filter(
    F.col("ProductKey").isNull() | F.col("StoreKey").isNull()
    | F.col("CustomerKey").isNull() | F.col("ChannelKey").isNull()).count()
assert orphans == 0

Step 7 — Build the Direct Lake semantic model

From the Lakehouse SQL analytics endpoint, choose New semantic model and select only the gold.* tables. Then wire up the relationships — all single-direction, one-to-many from dimension to fact:

  • DimDate[DateKey]FactSales[DateKey]
  • DimProduct[ProductKey]FactSales[ProductKey]
  • DimStore[StoreKey]FactSales[StoreKey]
  • DimCustomer[CustomerKey]FactSales[CustomerKey]
  • DimChannel[ChannelKey]FactSales[ChannelKey]

Mark DimDate as the date table (on DimDate[Date]) — this is what makes time-intelligence work.


Step 8 — Add the DAX measures

The model ships with a full measure library. The base measures:

Total Sales   = SUM ( FactSales[SalesAmount] )
Total Units   = SUM ( FactSales[Quantity] )
Total Cost    = SUMX ( FactSales, FactSales[Quantity] * FactSales[UnitCost] )
Gross Margin  = [Total Sales] - [Total Cost]
Gross Margin %= DIVIDE ( [Total Sales] - [Total Cost], [Total Sales] )
Order Count   = DISTINCTCOUNT ( FactSales[OrderNumber] )
Avg Basket Size = DIVIDE ( [Total Sales], [Order Count] )

A note on a bug worth remembering: define Gross Margin % as a self-contained expression — DIVIDE([Total Sales] - [Total Cost], [Total Sales]) — rather than referencing [Gross Margin]. During the build, having % depend on the plain Gross Margin measure (and accidentally overwriting one with the other) produced a circular-dependency error. Making the percentage stand on its own avoids the trap.

Time-intelligence (these are why DimDate had to be marked):

Sales LY   = CALCULATE ( [Total Sales], SAMEPERIODLASTYEAR ( DimDate[Date] ) )
Sales YoY  = [Total Sales] - [Sales LY]
Sales YoY %= DIVIDE ( [Sales YoY], [Sales LY] )
Sales MTD  = CALCULATE ( [Total Sales], DATESMTD ( DimDate[Date] ) )
Sales YTD  = CALCULATE ( [Total Sales], DATESYTD ( DimDate[Date] ) )
Sales MoM %= DIVIDE ( [Total Sales] - [Sales PM], [Sales PM] )

Mix and ranking:

Active Customers = DISTINCTCOUNT ( FactSales[CustomerKey] )
Online Sales %   = DIVIDE (
        CALCULATE ( [Total Sales], DimChannel[Channel] = "Online" ),
        [Total Sales] )
Product Sales Rank = RANKX (
        ALLSELECTED ( DimProduct[ProductName] ), [Total Sales], , DESC )

Step 9 — Dynamic row-level security

This is where the UserRegion table earns its place. Instead of hard-coding a role per region, a single role filters DimStore to the regions mapped to the logged-in user:

[Region] IN
    SELECTCOLUMNS (
        FILTER ( UserRegion, UserRegion[Email] = USERPRINCIPALNAME () ),
        "r", UserRegion[Region]
    )

Add one manager's email + region to gold.UserRegion, and they automatically see only their slice — no model changes, no new roles. The West manager sees West; the East manager sees East; an admin row could map to all three.


Step 10 — Build the report

Two pages cover the core audience:

  • Executive Overview — KPI cards (Total Sales, Gross Margin %, Order Count, Active Customers), a sales-over-time line with YoY, a region/channel matrix, and a top-products bar using the rank measure.
  • Store Performance — store-level table with margin, format breakdown, and a regional map, laid out so the most-scanned numbers sit top-left.

Because the model is Direct Lake, the report reflects new data as soon as the Gold tables are framed — no scheduled import refresh required.


Step 11 — Orchestrate, monitor, and deploy

A platform isn't done when the report renders. Three operational pieces close the loop:

Orchestration. A Data Factory pipeline chains ingest → Silver → Gold notebooks and runs on a schedule, so the medallion stays current without manual runs.

Failure notification. The pipeline includes an on-failure activity that alerts when a run breaks — you find out from a notification, not from a stakeholder asking why the numbers look stale.

Dev → Prod deployment pipeline. Fabric deployment pipelines move the Lakehouse, semantic model, and report from a Development stage to Production with a controlled promotion, instead of editing production artifacts by hand.

The semantic model is also certified, signalling to report authors across the tenant that it's the trusted, governed source for retail sales.


What's in the repo

contoso-retail-fabric/
├─ README.md
├─ .gitignore
├─ data/landing/         products.csv · stores.csv · customers.csv · sales.csv
├─ notebooks/            nb_01_bronze_ingest · nb_02_silver_transform · nb_03_gold_model
├─ sql/                  validate_gold.sql
└─ semantic-model/       measures.dax

Clone it, follow the run order, and compare your validation numbers against the table above. If they match, your pipeline ran correctly end to end.


Lessons learned

A few things that are easy to get wrong and worth doing right the first time:

  • Hash PII in Silver, not later. Once a name reaches Gold or the semantic model, it's much harder to claw back. Do it at the conforming step.
  • Keep Gross Margin % self-contained to avoid circular-dependency errors between dependent measures.
  • Mark the date table before writing time-intelligenceSAMEPERIODLASTYEAR and friends silently misbehave otherwise.
  • Enable change data feed on Gold so Direct Lake framing stays incremental as data grows.
  • Make RLS data-driven. A UserRegion mapping table beats a proliferation of hard-coded roles — onboarding a new manager becomes a single row insert.
  • Validate with assertions, not eyeballs. The orphan check and the fixed expected totals turn "looks right" into "is right."

Wrapping up

What started as raw CSVs is now a governed analytics platform: layered and auditable, modeled for performance, secured per-user, monitored, and promotable from dev to prod. The same pattern scales well beyond a demo — swap the seed CSVs for a real Copy pipeline, point Bronze at your sources, and the rest of the medallion carries the weight.

The full, reproducible kit lives at https://github.com/usreekanthreddy/contoso-retail-fabric

Tuesday, June 2, 2026

Building a Daily Chatbot Transcript Archiver with Power Automate

Building a Daily Chatbot Transcript Archiver with Power Automate

How to automatically export Copilot conversation transcripts from Dataverse to SharePoint as a daily CSV.

Overview

This Power Automate cloud flow ("Daily [Assistant] Transcripts to SharePoint") runs once a day, pulls the previous day's chatbot conversation transcripts out of Dataverse, flattens each message into a CSV row, and drops a single dated .csv file into a SharePoint document library. It's a clean pattern for archiving conversational AI logs for analytics or compliance.

Flow Architecture

Recurrence (daily 6:00 AM)
   v
Initialize variable  -> csvContent (CSV header)
   v
List rows            -> Dataverse: conversationtranscripts (yesterday only)
   v
Apply to each        -> over each transcript record
   |- Compose         -> extract bot name
   \- Condition (If)   -> only "[Assistant Name]" transcripts
        |- Filter array -> keep only "message" activities
        |- Select       -> map each message to a CSV row
        \- Append to string variable -> add rows to csvContent
   v
Create file          -> SharePoint: ChatTranscripts_<yesterday>.csv

Step 1 - Trigger: Recurrence

A scheduled trigger fires the flow once per day.

SettingValue
Interval1
FrequencyDay
Time zonePacific Standard Time
At these hours6
At these minutes0

Preview: Runs at 6:00 every day.

{
  "type": "Recurrence",
  "recurrence": {
    "interval": 1,
    "frequency": "Day",
    "timeZone": "Pacific Standard Time",
    "schedule": { "hours": [ "6" ], "minutes": [ 0 ] }
  }
}

Step 2 - Initialize variable (CSV header)

A string variable csvContent is seeded with the CSV header row. decodeUriComponent('%0A') injects a real newline character so the next rows start on a fresh line.

  • Name: csvContent
  • Type: String
  • Value (fx):
@concat('AgentName,Session,ConversationStartTime,Role,MessageText,Timestamp', decodeUriComponent('%0A'))

Step 3 - List rows (Dataverse)

Retrieves transcript records from the Dataverse conversationtranscripts table, scoped to yesterday using an OData $filter.

  • Table name: conversationtranscripts
  • Filter rows (fx):
conversationstarttime ge @{startOfDay(addDays(utcNow(),-1))} and conversationstarttime lt @{startOfDay(utcNow())}

startOfDay(addDays(utcNow(),-1)) = start of yesterday; startOfDay(utcNow()) = start of today. Together they form a clean 24-hour window for the prior day.

{
  "type": "OpenApiConnection",
  "inputs": {
    "parameters": {
      "entityName": "conversationtranscripts",
      "$filter": "conversationstarttime ge @{startOfDay(addDays(utcNow(),-1))} and conversationstarttime lt @{startOfDay(utcNow())}"
    },
    "host": {
      "apiId": "/providers/Microsoft.PowerApps/apis/shared_commondataserviceforapps",
      "connection": "[REDACTED - connection reference]",
      "operationId": "ListRecords"
    }
  }
}

Step 4 - Apply to each (per transcript)

Loops over every record returned by List rows: @outputs('List_rows')?['body/value'].

4a - Compose (extract bot name)

Pulls the friendly (formatted) name of the bot tied to the transcript:

@items('Apply_to_each')?['_bot_conversationtranscriptid_value@OData.Community.Display.V1.FormattedValue']

4b - Condition (If)

Only process transcripts that belong to the target assistant:

@equals(outputs('Compose'), '[Assistant Name]')

If true, the three actions below run. If false, nothing happens for that record.

Filter array - keep only chat messages

The transcript content is JSON; parse it, grab the activities array, and keep only items whose type is message (dropping system/event entries).

  • From (fx): @json(items('Apply_to_each')?['content'])?['activities']
  • Where (fx): @equals(item()?['type'], 'message')

Select - map each message to a CSV row

Builds one quoted, comma-separated CSV line per message. Note the careful quote-escaping (" -> ""), role mapping, timestamp conversion from Unix epoch seconds, and newline stripping inside message text.

  • From (fx): @body('Filter_array')
  • Map / select (fx):
@concat(
  '"', replace(outputs('Compose'), '"', '""'), '","',
  replace(string(items('Apply_to_each')?['conversationtranscriptid']), '"', '""'), '","',
  formatDateTime(items('Apply_to_each')?['conversationstarttime'], 'dd MMM yyyy HH:mm'), '","',
  if(equals(string(item()?['from']?['role']), '1'), 'User', 'Bot'), '","',
  replace(replace(replace(coalesce(item()?['text'], ''), '"', '""'), decodeUriComponent('%0D'), ''), decodeUriComponent('%0A'), ' '), '","',
  formatDateTime(addSeconds('1970-01-01T00:00:00Z', int(string(item()?['timestamp']))), 'dd MMM yyyy HH:mm:ss'), '"'
)

What the formula does, field by field:

  • AgentName - bot name from Compose, double-quotes escaped.
  • Session - conversationtranscriptid, double-quotes escaped.
  • ConversationStartTime - formatted dd MMM yyyy HH:mm.
  • Role - message from/role"1" -> User, otherwise Bot.
  • MessageText - coalesce(..., '') guards nulls; quotes escaped; carriage returns (%0D) removed and line feeds (%0A) replaced with a space so each message stays on one CSV line.
  • Timestamp - Unix epoch seconds converted to a date via addSeconds('1970-01-01T00:00:00Z', ...), formatted to the second.

Append to string variable - accumulate rows

Joins the row array with newlines and appends to csvContent; the if(empty(...)) guard avoids adding a stray blank line when a transcript has no messages.

  • Name: csvContent
  • Value (fx):
@if(empty(body('Select')), '', concat(join(body('Select'), decodeUriComponent('%0A')), decodeUriComponent('%0A')))

Step 5 - Create file (SharePoint)

Writes the fully assembled CSV to a SharePoint library with a date-stamped filename.

ParameterValue
Site Address[REDACTED - SharePoint site URL]
Folder Path[REDACTED - document library path]
File Name (fx)ChatTranscripts_@{formatDateTime(addDays(utcNow(),-1),'yyyy-MM-dd')}.csv
File Content (fx)@{variables('csvContent')}

Transfer mode is Chunked to support larger files.

{
  "type": "OpenApiConnection",
  "inputs": {
    "parameters": {
      "dataset": "[REDACTED - SharePoint site URL]",
      "folderPath": "[REDACTED - document library path]",
      "name": "ChatTranscripts_@{formatDateTime(addDays(utcNow(),-1),'yyyy-MM-dd')}.csv",
      "body": "@{variables('csvContent')}"
    },
    "host": {
      "apiId": "/providers/Microsoft.PowerApps/apis/shared_sharepointonline",
      "connection": "[REDACTED - connection reference]",
      "operationId": "CreateFile"
    }
  },
  "runtimeConfiguration": { "contentTransfer": { "transferMode": "Chunked" } }
}

Key Takeaways

  • Time-windowing with startOfDay + addDays gives a reliable "yesterday only" query without hardcoding dates.
  • Build CSV by hand with Select + concat when you need precise control over quoting and escaping; always escape " -> "" and strip %0D/%0A from free-text fields.
  • coalesce + if(empty(...)) guards prevent null errors and stray blank lines.
  • Unix-epoch timestamps convert cleanly via addSeconds('1970-01-01T00:00:00Z', int(...)).
  • Chunked transfer mode on Create file keeps large exports reliable.

Featured Post

Building an End-to-End Retail Analytics Platform on Microsoft Fabric

Building an End-to-End Retail Analytics Platform on Microsoft Fabric A complete, reproducible walkthrough — from raw CSVs to a secured, refr...

Popular posts