Sunday, May 17, 2026

Microsoft Azure Solutions Architect Expert (AZ-305) Complete Guide

Microsoft Azure Solutions Architect Expert (AZ-305) — Complete Guide

Well-Architected Framework · Identity & Governance · Data Storage · Infrastructure · Business Continuity · Landing Zones · Scenarios · Cheat Sheet

Top Hashtags: #AZ305, #AzureSolutionsArchitect, #MicrosoftAzure, #CloudArchitecture, #AzureArchitect, #WellArchitectedFramework, #CloudAdoptionFramework, #AzureDesign, #MicrosoftCertified, #CloudComputing

Exam Overview & Architect Mindset
Design Identity, Governance & Monitoring (25–30%)
Design Data Storage Solutions (25–30%)
Design Infrastructure Solutions (30–35%)
Design Business Continuity Solutions (10–15%)
Azure Well-Architected Framework & Landing Zones
Scenario-Based Questions
Cheat Sheet — Quick Reference

1. Exam Overview & Architect Mindset

AZ-305 Exam at a Glance

The AZ-305 exam validates expertise in designing cloud and hybrid solutions — translating business requirements into Azure architectures aligned with the Well-Architected Framework and Cloud Adoption Framework. This is a design exam, not a configuration exam.

Skill Domain	Exam Weight
Design infrastructure solutions	30–35% ← LARGEST
Design identity, governance, and monitoring solutions	25–30%
Design data storage solutions	25–30%
Design business continuity solutions	10–15%

Prerequisite: AZ-104 (Azure Administrator Associate) must be held before earning AZ-305 Expert certification.

Critical mindset shift: AZ-305 questions present a business scenario with stated requirements and ask you to recommend, evaluate, or compare Azure services. The key skill is understanding WHY one service fits a scenario better than another — not just knowing what each service does.

What is the Azure Well-Architected Framework?

The Azure Well-Architected Framework (WAF) is Microsoft's set of guiding principles for building high-quality cloud solutions. Every AZ-305 design decision should be justified using WAF pillars.

Five WAF Pillars:

1. Reliability:
→ Design for failure — assume components will fail
→ Use redundancy: Availability Zones, geo-replication, multi-region
→ Define and meet RTO (Recovery Time Objective) and RPO (Recovery Point Objective)
→ Health monitoring, automatic failover, circuit breakers
→ Example: deploy SQL Database with zone-redundant Business Critical tier
             + auto-failover group to paired region

2. Security:
→ Defence in depth: multiple layers of security
→ Principle of least privilege: minimum permissions needed
→ Zero Trust: verify explicitly, assume breach, limit blast radius
→ Encrypt data at rest and in transit; manage secrets in Key Vault
→ Example: Private Endpoints for PaaS, WAF on Application Gateway,
             Managed Identity instead of service accounts

3. Cost Optimisation:
→ Right-size resources — don't over-provision
→ Match consumption to business value (pay for what you use)
→ Reserved Instances for predictable workloads
→ Use PaaS over IaaS where possible (lower management overhead)
→ Azure Hybrid Benefit for existing licences

4. Operational Excellence:
→ Infrastructure as Code (Bicep/Terraform) — repeatable, version-controlled
→ CI/CD pipelines for deployment automation
→ Monitoring: Azure Monitor, Log Analytics, Application Insights
→ Runbooks and automated remediation
→ Change management: blue-green deployments, feature flags

5. Performance Efficiency:
→ Scale horizontally (add instances) not just vertically (bigger VM)
→ Cache frequently accessed data (Azure Cache for Redis)
→ CDN for static content delivery (Azure Front Door / CDN)
→ Choose the right service tier for the workload profile
→ Autoscaling: VMSS, App Service autoscale, AKS cluster autoscaler

What is the Cloud Adoption Framework (CAF)?

Cloud Adoption Framework (CAF):
→ Microsoft's guidance for adopting Azure at enterprise scale
→ Covers: strategy, planning, landing zones, governance, operations

CAF phases:
Strategy:    define business justification, expected outcomes
Plan:        assess current state, create migration backlog
Ready:       set up landing zones, management groups hierarchy
Adopt:       migrate or modernise workloads (lift-and-shift, re-architecture)
Govern:      establish guardrails — Azure Policy, cost management, security baseline
Manage:      operations baseline — monitoring, backup, disaster recovery
Innovate:    build cloud-native solutions, AI/ML, data analytics

Azure Landing Zone:
→ Preconfigured, governance-ready Azure environment for enterprise adoption
→ Includes: management group hierarchy, Azure Policy assignments, networking,
            identity, security baseline, monitoring
→ Conceptual architecture: Platform vs Application landing zones
→ Platform: connectivity hub, identity, management subscriptions (managed by central IT)
→ Application: individual workload subscriptions (managed by application teams)
→ Accelerator: Bicep/Terraform modules to deploy landing zone in hours

Management Group hierarchy (CAF-recommended):
Tenant Root Group
  ├── Platform (managed by central IT)
  │     ├── Management (Log Analytics, Automation)
  │     ├── Identity (AD DS, Entra Connect)
  │     └── Connectivity (hub VNet, VPN/ExpressRoute, Firewall)
  └── Landing Zones (workload teams)
        ├── Corp (connected to hub, private)
        └── Online (internet-facing, DMZ)

2. Design Identity, Governance & Monitoring (25–30%)

How do you design an identity architecture for a hybrid enterprise?

Hybrid identity decision tree:

Question 1: Do you have on-premises Active Directory?
No:  → Cloud-only Entra ID. Use Entra ID Joined devices. Done.
Yes: → Need Entra Connect for identity synchronisation

Question 2: Authentication method for synchronised identities?
Password Hash Sync (PHS): → RECOMMENDED for most orgs
  Pros: works if on-prem unavailable, leaked credential detection,
        seamless SSO with seamless SSO extension
  Cons: password hash in cloud (some orgs have policy against this)

Pass-Through Auth (PTA): → when policy prohibits cloud password hashes
  Pros: passwords never leave on-prem, real-time account state
  Cons: requires on-prem PTA agent availability, no leaked cred detection

Federation (AD FS): → only when advanced requirements (smart cards,
                       complex claims transformation)
  Pros: full control over authentication claims
  Cons: complex, expensive infrastructure, Microsoft recommends migrating away

Question 3: Device management?
Cloud-native devices:  → Entra ID Joined + Intune (MDM)
Existing domain:       → Hybrid Entra ID Joined (domain join + Entra registration)
BYOD:                  → Entra ID Registered + MAM (App Protection Policies)

Identity architecture for a 10,000-person hybrid enterprise:
→ Entra Connect with PHS + Seamless SSO
→ All admin roles in PIM (eligible assignments, not permanent)
→ Break-glass accounts: 2 cloud-only Global Admins, excluded from all CA
→ Conditional Access: MFA all users, block legacy auth, require compliant device
→ Identity Protection: sign-in risk ≥ Medium → MFA, user risk High → password reset
→ Named locations: corporate IP ranges for trusted sign-in signals
→ B2B: Entitlement Management for partner access, time-limited, self-service
→ Entra ID P2 required for PIM, Identity Protection risk policies, Access Reviews

How do you design a governance architecture at scale?

Governance architecture layers:

Layer 1 — Management Group hierarchy:
Root → Decommissioned MG | Platform MG | Landing Zones MG | Sandbox MG
→ Azure Policy applied at MG level inherits to all child subscriptions
→ RBAC applied at MG level inherits to all child subscriptions

Layer 2 — Azure Policy:
Initiative: "CIS Azure Benchmark" or custom baseline
Key policies:
  Deny:  "No public IP on VMs" (enforces private-only VMs)
  Deny:  "Allowed locations: UK South, UK West" (data residency)
  Deny:  "Allowed VM SKUs" (prevent large SKUs in dev)
  Audit: "Storage accounts should use private endpoints"
  DINE:  "Deploy Azure Monitor Agent if missing" (auto-remediate)

Layer 3 — Azure Blueprints (or Bicep/Terraform for landing zones):
→ Combine: role assignments + policy assignments + ARM templates
→ Assign to subscriptions for consistent baseline deployment
→ Note: Bicep + Azure DevOps is now preferred over Blueprints for new designs

Layer 4 — RBAC design:
Management Group → Owner (cloud platform team only — use PIM)
Subscription → Contributor (workload team — via PIM)
Resource Group → Contributor/specific roles (workload developers)
Resources → Reader/specific data roles (application service principals)

Layer 5 — Tagging strategy:
Mandatory tags enforced by Azure Policy:
  Environment: Production | Development | Test
  Owner:        team or product owner email
  CostCentre:   cost allocation code
  Application:  application name/ID
  DataClass:    Public | Internal | Confidential | Restricted
Cost Management: filter by tag to see per-team, per-app spending

Layer 6 — Subscription design:
Option A: One subscription per environment (simplest)
  Prod-Sub | Dev-Sub | Test-Sub
Option B: One subscription per workload (better isolation)
  Sales-Prod | HR-Prod | Finance-Prod (separate blast radius, billing)
Option C: Hybrid (recommended for large orgs)
  Platform subs (connectivity, identity, management) + Workload subs

How do you design a monitoring solution for Azure?

Monitoring architecture:
→ Centralised Log Analytics workspace (one per environment or per org)
→ All resources send diagnostic logs and metrics to central workspace
→ Azure Monitor alerts trigger action groups (email, Teams, PagerDuty)

Data sources and collection:
Activity Log:        subscription-level control plane operations (all subs → workspace)
Resource Diagnostics: service-level logs (SQL queries, App Gateway access, Key Vault ops)
VM Guest OS:          Azure Monitor Agent + Data Collection Rules → workspace
Application:          Application Insights SDK → telemetry (requests, exceptions, dependencies)
Container:            Container Insights → AKS node + pod metrics to workspace
Security:             Microsoft Defender for Cloud → security findings to workspace
Network:              Network Watcher → flow logs, connection monitor

Application Insights:
→ APM (Application Performance Monitoring) for web applications
→ Instrumentation: add SDK or use auto-instrumentation (App Service, AKS)
→ Telemetry: requests, exceptions, dependencies (SQL, HTTP calls), custom events
→ Live Metrics: real-time monitoring during deployment
→ Availability tests: ping test every 5 minutes from multiple Azure regions
→ Smart detection: ML-based anomaly detection (failure rate spikes, performance degradation)

Azure Monitor Alerts:
Metric alerts:   trigger on: CPU > 80%, DTU > 90%, response time > 2s
Log alerts (KQL): trigger on: error rate > 5%, specific exception count
Activity log:    trigger on: resource deleted, policy denied, security event

Action Groups (what happens when alert fires):
Email:           notify team
SMS:             notify on-call
Teams webhook:   post to Teams channel
Azure Function:  auto-remediate (restart service, scale out)
Logic App:       create ITSM ticket, update status page
Automation Runbook: execute remediation script

Dashboard and visualisation:
Azure Monitor Dashboards: operational view for NOC
Azure Workbooks:          custom parameterised reporting and analysis
Power BI:                 management-level KPI reporting from Log Analytics
Grafana:                  popular open-source dashboards (integrates with Azure Monitor)

3. Design Data Storage Solutions (25–30%)

How do you choose the right Azure storage service?

Storage service selection matrix:

Requirement                                  → Best Service
Unstructured blobs (documents, images, video) → Azure Blob Storage
Virtual machine disks                         → Azure Managed Disks
File shares (SMB/NFS, lift-and-shift)         → Azure Files
Hybrid file server (on-prem + cloud cache)    → Azure File Sync
NoSQL document database (JSON)                → Azure Cosmos DB (SQL API)
NoSQL key-value at massive scale              → Azure Cosmos DB (Table API)
Globally distributed, multi-write            → Azure Cosmos DB (any API)
Relational data, OLTP workloads              → Azure SQL Database
SQL Server features, lift-and-shift           → Azure SQL Managed Instance
Data warehouse, analytical queries (OLAP)    → Azure Synapse Analytics / Fabric
Real-time analytics over streaming data      → Azure Synapse Analytics + Event Hubs
Message queue (simple, cheap, massive)        → Azure Storage Queues
Enterprise messaging (ordering, DLQ, pub/sub) → Azure Service Bus
Event streaming (millions/sec, telemetry)     → Azure Event Hubs
Time-series / IoT analytics                  → Azure Data Explorer / KQL
Graph data (relationships between entities)  → Azure Cosmos DB (Gremlin API)
Cache layer (session, query results)          → Azure Cache for Redis

Blob storage access tiers (cost vs latency):
Hot:     frequent access, highest storage cost, lowest access cost
Cool:    30-day minimum, cheaper storage, higher access cost
Cold:    90-day minimum, very cheap storage, higher access cost
Archive: 180-day minimum, cheapest storage, OFFLINE (rehydrate before read)
Lifecycle policy: automatically tier blobs based on last modified age

Blob storage redundancy (durability vs cost):
LRS:   3 copies, single datacenter, cheapest
ZRS:   3 copies across AZs, zone-resilient (recommended for most production)
GRS:   LRS + async copy to paired region, secondary readable only after failover
GZRS:  ZRS + async to paired region, highest durability (16 nines)
RA-GRS/RA-GZRS: read from secondary at all times (DR read scale-out)

How do you design a relational database solution on Azure?

Azure SQL options:

Azure SQL Database (fully managed PaaS):
→ Serverless tier: auto-pause during idle (cost saving for dev/test)
→ General Purpose: balanced compute/storage (most production OLTP)
→ Business Critical: in-memory OLTP, readable secondary, local SSD
→ Hyperscale: up to 100 TB, fast backup/restore (large databases)
→ Elastic Pool: share resources across multiple databases (multi-tenant SaaS)
→ Auto-failover groups: automatic geo-replication + failover (RPO < 5 sec)
→ Zone-redundant: Business Critical and General Purpose (zonal HA)

Azure SQL Managed Instance (PaaS with near-full SQL Server compatibility):
→ Lift-and-shift of SQL Server workloads needing: SQL Agent, CLR, linked servers,
  Service Broker, CDC, cross-database queries (not available in SQL Database)
→ Deployed INSIDE a VNet (private, no public endpoint by default)
→ Business Critical tier: in-memory OLTP + readable secondary
→ Migration: Database Migration Service from on-prem SQL Server
→ Choice: if application needs SQL Agent or cross-database queries → Managed Instance

SQL Server on Azure VM (IaaS):
→ Full SQL Server control (any version, any edition, OS-level access)
→ Use when: need features only available on bare SQL Server (SSRS, SSAS, SSIS),
            or specific versions/configurations not supported in PaaS
→ Higher management overhead vs PaaS options

Decision framework:
Can PaaS SQL Database handle the requirements?
  Yes: use SQL Database (lowest management overhead)
  Needs SQL Agent, cross-DB queries, Service Broker?
    Yes: use SQL Managed Instance
    Needs full OS control or specific SQL Server version?
      Yes: SQL Server on Azure VM

High availability for SQL:
Single region:    Business Critical tier → built-in Always On AG (3 replicas, ZR)
Multi-region:     Auto-failover group → geo-replication + automatic DNS failover
RPO:              < 5 seconds (geo-replication log shipping)
RTO:              < 30 seconds (automatic failover with DNS redirect)

How do you design a Cosmos DB solution?

Azure Cosmos DB:
→ Globally distributed, multi-model NoSQL database
→ SLA: 99.999% availability, single-digit millisecond latency (P99)
→ APIs: SQL (Core), MongoDB, Cassandra, Gremlin (graph), Table, PostgreSQL

Key design decisions:

1. Partition key selection (MOST CRITICAL decision):
→ Must provide even distribution of data (high cardinality)
→ Must be included in most queries (enables partition-targeted reads)
→ Cannot be changed after container creation
→ Good: UserId, CustomerId, TenantId, ProductId
→ Bad: Boolean, Status (low cardinality — hot partitions)
→ Synthetic key: if single key has low cardinality, combine fields
   e.g., {country}-{city} instead of country alone

2. Consistency levels (5 levels, stronger = more latency/cost):
Strong:           linearisable — always reads latest write (highest cost)
Bounded Staleness: reads lag writes by configurable window (K ops or T seconds)
Session:           strong consistency within a session (DEFAULT — best for most apps)
Consistent Prefix: reads never see out-of-order writes
Eventual:          weakest — reads may lag, highest performance

3. Request Units (RUs):
→ Currency of Cosmos DB — normalised unit of throughput
→ Every operation (read, write, query) consumes RUs
→ Provision throughput (guaranteed RU/s) or serverless (pay per op)
→ 1 RU = reading a 1 KB document by its partition key + id

4. Global distribution:
→ Add regions with one click — data replicated automatically
→ Single-write region: one primary write, any region reads (default)
→ Multi-region write: any region can write (use for write-heavy global apps)
→ Conflict resolution: needed for multi-write (last write wins or custom)

5. Change Feed:
→ Ordered stream of inserts and updates in a Cosmos DB container
→ Use for: event-driven microservices, materialised views, ETL to analytics
→ Consumers: Azure Functions trigger, Spark, SDK change feed processor

4. Design Infrastructure Solutions (30–35%)

How do you design a compute solution — choosing between VM, App Service, AKS, and Functions?

Compute service selection framework:

Azure Virtual Machines (IaaS):
→ Use when: need full OS control, specific OS/software, GPU, legacy apps
→ Lift-and-shift: move existing apps without code changes
→ HA pattern: Availability Zones + Standard Load Balancer (99.99% SLA)
→ Scale: VM Scale Sets (VMSS) for auto-scaling clusters
→ Avoid for: stateless web apps (use App Service/containers instead)

Azure App Service (PaaS):
→ Use when: web apps, REST APIs, mobile backends in .NET, Java, Node, Python, PHP
→ No VM management: Azure handles OS, patching, load balancing, scaling
→ Deployment slots: zero-downtime deployments, blue-green, A/B testing
→ VNet Integration: outbound private connectivity to databases/services
→ Private Endpoint: inbound private access (no public internet)
→ Scale: horizontal (add instances) up to 30, vertical (bigger plan)
→ Choose tier: Free/Basic (dev/test) → Standard (production, autoscale)
             → Premium (VNet integration) → Isolated (ASE, dedicated)

Azure Kubernetes Service (AKS):
→ Use when: microservices, containerised workloads, complex orchestration
→ Benefits: auto-scaling, rolling deployments, self-healing, rich ecosystem
→ Design: multiple node pools (system + user), node autoscaler, HPA
→ Networking: Azure CNI (direct pod IPs from VNet — needed for private)
→ Storage: Azure Disk (ReadWriteOnce) or Azure Files (ReadWriteMany)
→ Security: Managed Identity for pod-level Azure access (Workload Identity)
→ Ingress: Azure Application Gateway Ingress Controller (AGIC) for WAF

Azure Functions (Serverless):
→ Use when: event-driven compute, short-lived operations, unknown/spiky traffic
→ Scale to zero: pay only during execution (Consumption plan)
→ Premium plan: pre-warmed instances, VNet integration, longer timeout
→ Triggers: HTTP, Service Bus, Event Hub, Blob Storage, Timer, Cosmos DB
→ Durable Functions: stateful orchestration (fan-out/fan-in, long workflows)
→ Avoid for: long-running processes > 10 min (use App Service or AKS)

Decision tree:
Need OS-level control?              → VM
Containerised microservices?        → AKS
Web/API app, managed PaaS?          → App Service
Event-driven, short duration?       → Azure Functions
Burst traffic, stateless?           → Functions or App Service with autoscale
Complex stateful workflow?          → Durable Functions or Logic Apps

How do you design a network architecture for enterprise Azure?

Hub-Spoke network topology (recommended for most enterprises):

Hub VNet (central):
→ Shared services: Azure Firewall, VPN/ExpressRoute Gateway, Bastion, DNS
→ Managed by central networking team
→ All spoke traffic routes through hub (forced tunnelling)
→ Hub in each Azure region for global presence

Spoke VNets (per workload/environment):
→ Each application or team gets own spoke VNet
→ Peered to Hub: VNet Peering (low latency, same region)
               Global VNet Peering (cross-region)
→ No direct spoke-to-spoke peering (routes through hub firewall for inspection)
→ Subnet design: AppSubnet, DataSubnet, PrivateEndpointSubnet, etc.

Virtual WAN (alternative for large/global organisations):
→ Microsoft-managed hub-and-spoke at scale
→ Supports: any-to-any connectivity (spoke-spoke via VWAN hub)
→ Integrates: VPN, ExpressRoute, SD-WAN partners, Azure Firewall Manager
→ Use for: organisations with 10+ spokes or multiple regions

On-premises connectivity:
VPN Gateway:     encrypted IPsec over internet (up to 10 Gbps, simpler, cheaper)
ExpressRoute:    private dedicated circuit via connectivity provider
                 (consistent latency, up to 100 Gbps, higher cost, compliance)
ExpressRoute + VPN: redundant connectivity (ExpressRoute primary, VPN backup)

DNS architecture:
Azure Private DNS zones: private name resolution within Azure VNet
  privatelink.blob.core.windows.net → maps to storage private endpoint IPs
  privatelink.database.windows.net  → maps to SQL private endpoint IPs
Azure DNS Resolver: forward on-prem DNS queries to Azure private zones
                    forward Azure queries to on-prem DNS servers
  → Enables split-horizon DNS for hybrid environments

DDoS Protection:
DDoS Network Protection: per-VNet (recommended for public-facing)
DDoS IP Protection:      per-public IP (simpler, less features)
WAF:                     Application Gateway WAF or Front Door WAF
                         (protects against OWASP Top 10, layer 7 attacks)

How do you design a secure solution using Azure Key Vault?

Azure Key Vault:
→ Centralised secrets, keys, and certificates management
→ Every application should use Key Vault — NEVER store secrets in code/config

Key Vault uses:
Secrets:      connection strings, API keys, passwords
Keys:         RSA/EC keys for envelope encryption, signing
Certificates: X.509 certificates with auto-renewal (Let's Encrypt, DigiCert)

Access patterns:
Managed Identity (RECOMMENDED):
  → Application Managed Identity → assigned Key Vault Secrets User role
  → No credentials to manage, no rotation risk
  → Code: var secret = await client.GetSecretAsync("ConnectionString");

RBAC roles for Key Vault:
  Key Vault Secrets User:    read secret values (applications)
  Key Vault Secrets Officer: manage secrets, not read values (operations)
  Key Vault Administrator:   full control (admin — use PIM)
  Key Vault Reader:          read metadata, not values (audit)

Key Vault networking:
  Private Endpoint:     Key Vault accessible only via private IP in VNet
  Service Endpoint:     traffic stays on Azure backbone (no private IP)
  Firewall:             allowlist specific IPs/VNets; deny all others
  Production: always Private Endpoint + disable public access

Soft delete + Purge Protection:
  Soft delete (90 days): deleted secrets recoverable for 90 days
  Purge Protection:      prevents permanent deletion for retention period
  Both required for: compliance, ransomware protection, accidental deletion
  NEVER disable purge protection in production

Key rotation:
  Manual:   generate new version, update application to use latest version
  Automatic (certificates): Key Vault rotates before expiry via CA integration
  Event-driven: Key Vault fires Event Grid event on near-expiry
                → Logic App / Azure Function generates and stores new key
                → Notifies application to reload secret

5. Design Business Continuity Solutions (10–15%)

How do you design for high availability and disaster recovery?

Key definitions:
RTO (Recovery Time Objective):
  Maximum acceptable time for service to be restored after failure
  "We can be down for no more than 4 hours" → RTO = 4 hours

RPO (Recovery Point Objective):
  Maximum acceptable data loss measured in time
  "We cannot lose more than 15 minutes of transactions" → RPO = 15 minutes

SLA tiers and what they require:
99.9%   = 8.7 hours downtime/year  → single instance + Standard SSD (min)
99.95%  = 4.4 hours downtime/year  → Availability Set (2+ VMs)
99.99%  = 52 minutes downtime/year → Availability Zones (2+ VMs, 2+ AZs)
99.999% = 5.2 minutes downtime/year → multi-region active-active

Azure HA architecture patterns:

Single region HA (protects against hardware/datacenter failure):
  Azure Load Balancer + VM Scale Set across 3 Availability Zones
  SQL: Business Critical (3 replicas, zone-redundant)
  Storage: ZRS (3 copies across AZs)
  App Service: Premium tier (zone redundancy)
  SLA: 99.99% for compute + 99.99% for database

Multi-region HA (protects against region-wide outage):
Active-Passive (failover):
  Primary region handles traffic
  Secondary region: pre-deployed, inactive (warm standby)
  DNS failover via Azure Traffic Manager or Azure Front Door
  SQL: auto-failover group with automatic DNS failover
  RTO: minutes (DNS TTL), RPO: < 5 seconds (geo-replication)
  Use for: moderate RTO/RPO requirements

Active-Active (both regions serve traffic simultaneously):
  Both regions active, load balanced globally
  Data: Cosmos DB multi-write, or synchronous replication
  DNS: Azure Front Door (anycast routing, fastest region wins)
  RTO: near-zero (no failover needed — traffic shifts automatically)
  RPO: near-zero (no replication lag for active-active write)
  Use for: highest availability requirements, global user base

Azure Site Recovery (ASR):
  VM replication to secondary region
  Recovery Plans: orchestrated failover in correct order
  Test failover: validate DR without impact to production
  RTO: 1-2 hours (VM startup + DNS propagation)
  RPO: 15 minutes (replication interval)

How do you design a backup strategy?

Azure Backup architecture:

Recovery Services Vault vs Backup Vault:
Recovery Services Vault: VMs, SQL in VM, Azure Files, on-prem (MARS/MABS)
Backup Vault:            newer vault type — Blobs, Disks, AKS, PostgreSQL Flexible

Backup policies (design per RPO/RTO/retention):
Frequency:  hourly, daily, weekly
Retention:  daily (7-30 days), weekly (up to 5 years), monthly/yearly (compliance)
Cross-region restore: restore to paired region for DR (opt-in, extra cost)

Azure Backup for VMs:
→ Crash-consistent (default) or application-consistent (VSS on Windows)
→ Instant restore: restore individual files in minutes from snapshot
→ Soft delete: 14 days protection against accidental/ransomware deletion
→ Immutable vault: prevents deletion of backup data (locked permanently)

SQL Server in Azure VM:
→ Workload-aware backup: full + differential + log backup
→ Log backup every 15 minutes → RPO of 15 minutes for SQL
→ Compression enabled by default

Azure SQL Database:
→ Built-in automated backups: full weekly, differential 12-24h, log every 5-12 min
→ Point-In-Time Restore (PITR): restore to any second within retention period
→ Long-term retention (LTR): store backups up to 10 years (compliance)

Geo-redundant backup storage:
→ Recovery Services Vault: GRS (default) or LRS
→ Cross-region restore: restore backup from primary in secondary region
→ Required for: regulatory compliance, DR strategy validation

Backup monitoring:
→ Azure Backup Reports (Power BI): compliance dashboard, which VMs NOT backed up
→ Azure Monitor alerts: backup job failures → email/Teams notification
→ Recovery Services Vault → Backup Jobs → filter by status = Failed

6. Azure Well-Architected Framework & Landing Zones

What is the Azure Architecture Centre and key architecture patterns?

Azure Architecture Centre (docs.microsoft.com/azure/architecture):
→ Microsoft's official reference for proven Azure architectural patterns
→ Reference architectures: tested designs for common scenarios
→ Design patterns: reusable solutions to recurring problems

Key cloud design patterns for AZ-305:

1. Retry pattern:
   Transient failures (network timeout, throttling) → automatically retry
   Exponential backoff: wait 1s, 2s, 4s, 8s between retries
   Circuit Breaker: stop retrying after threshold → fail fast

2. Circuit Breaker pattern:
   Prevent cascading failures when a downstream service is unhealthy
   States: Closed (normal) → Open (failing, reject calls) → Half-Open (test)
   Implementation: Polly (C#), resilience4j (Java)

3. CQRS (Command Query Responsibility Segregation):
   Separate read and write models
   Write model: transactional database (SQL), optimised for writes
   Read model: denormalised read store (Cosmos DB, Redis), optimised for reads
   Event sourcing: store events, not state; rebuild state by replaying events

4. Strangler Fig pattern:
   Incrementally replace a monolith with microservices
   New requests → new service; old requests → legacy system
   Gradually migrate functionality until legacy system is empty

5. Event-driven architecture:
   Services communicate via events (loose coupling)
   Event Grid: routing discrete state change events
   Service Bus: reliable ordered message delivery
   Event Hubs: high-volume event streaming

6. Bulkhead pattern:
   Isolate critical resources — if one component fails, others continue
   e.g., separate thread pools, connection pools per downstream service
   Prevents one slow service from exhausting all resources

7. Cache-Aside pattern:
   Application manages cache explicitly:
   1. Check cache (Redis) — if found, return cached value
   2. If not found: query database, store in cache, return value
   3. On update: update database, INVALIDATE cache (not update)
   Use: Azure Cache for Redis + Azure SQL / Cosmos DB

8. Federated Identity pattern:
   Delegate authentication to external identity provider (IdP)
   Application trusts tokens from Entra ID (or ADFS, Google, etc.)
   Benefits: SSO across apps, no password management per app

What are the key networking design patterns for AZ-305?

Private Endpoint (most important networking concept for AZ-305):
→ Assign a private IP from your VNet to an Azure PaaS service
→ Traffic between your VNet and the PaaS service never leaves the Microsoft backbone
→ Disable public endpoint on PaaS resource for maximum security
→ Requires: Private DNS Zone to resolve FQDN to private IP

Pattern: secure multi-tier application
  App Service (private endpoint inbound) ← only VNet traffic allowed
    ↓ (VNet Integration — outbound)
  Azure SQL Database (private endpoint) ← no public access
  Azure Key Vault (private endpoint) ← no public access
  Azure Storage (private endpoint) ← no public access

Azure Front Door:
→ Global application delivery network: CDN + WAF + load balancing + failover
→ Anycast: routes users to the closest Azure PoP (Point of Presence)
→ WAF: OWASP protection globally at the edge
→ Origin groups: pool of backends (App Service in US + App Service in EU)
→ Health probes: automatically route away from unhealthy backends
→ Use for: global web applications with multi-region active-active

Azure Traffic Manager:
→ DNS-based global load balancing (NOT a reverse proxy — users connect directly to origin)
→ Routing methods:
  Priority:      primary region + failover regions
  Weighted:      distribute traffic by percentage (canary deployments)
  Performance:   route to lowest latency region for the user
  Geographic:    route users to specific region (data residency)
  Multivalue:    return all healthy endpoints, client picks
→ Use for: global routing of any protocol (not just HTTP), DNS-based failover

Azure Front Door vs Traffic Manager:
Front Door:       HTTP/HTTPS only, reverse proxy, WAF, anycast, CDN, SSL offload
Traffic Manager:  any protocol, DNS-based (no proxy), no WAF, no CDN
Choose Front Door for modern web applications
Choose Traffic Manager for non-HTTP workloads or existing architectures

7. Scenario-Based Questions

Scenario: Design a globally available, zero-trust e-commerce platform handling 1M users.

Requirements:
→ 99.99% SLA, global users, GDPR compliance (EU data residency)
→ Zero Trust security, no public access to backend services
→ Auto-scaling for seasonal traffic spikes

Architecture:

GLOBAL LAYER:
Azure Front Door Premium:
  → WAF policy: OWASP 3.2, custom rules (bot protection, rate limiting)
  → Origins: App Service (West Europe) + App Service (East US)
  → Health probes: route away from unhealthy origins automatically
  → CDN: static assets (images, JS, CSS) cached at edge PoPs
  → Private Link to App Service backend (Front Door → App Service private)

COMPUTE LAYER (West Europe — primary):
Azure App Service (Premium v3, zone-redundant):
  → Outbound via VNet Integration → private access to all backends
  → Private Endpoint for inbound (Front Door is the only entry point)
  → Auto-scale: 2-50 instances based on CPU and HTTP queue
  → Deployment slots: blue-green deployments (swap staging → production)

DATA LAYER:
Azure SQL Database (Business Critical, zone-redundant):
  → Auto-failover group → East US (passive secondary)
  → Private Endpoint: no public internet access
  → Encryption: Always Encrypted for PII columns (GDPR)
  → Auditing: all queries logged to Log Analytics

Azure Cosmos DB (SQL API):
  → Multi-region: West Europe (write) + East US (read)
  → Product catalogue, shopping cart, user sessions
  → Session consistency (strong within session, eventual across regions)
  → PITR: 30-day point-in-time restore

Azure Cache for Redis:
  → Product catalogue cache (60-minute TTL)
  → Session state store (24-hour TTL)
  → Private Endpoint

SECURITY LAYER:
Azure Key Vault (private endpoint): connection strings, API keys, TLS certs
Managed Identity: App Service → Key Vault, SQL, Storage (no stored credentials)
Private DNS Zones: all PaaS services resolve to private IPs
DDoS Network Protection: on all public-facing VNets
Microsoft Defender for Cloud: security posture management across all services
Microsoft Purview: classify and track PII data (GDPR compliance)

MONITORING:
Application Insights: request traces, exceptions, dependency calls, availability tests
Azure Monitor: infrastructure metrics, alerts, action groups
Log Analytics workspace: centralised logs from all services
KQL alert: failed login rate > 5% in 5 minutes → Teams alert

Scenario: A company wants to migrate a 3-tier on-premises application to Azure. Design the migration approach.

Current state: 
  Web tier: 3 IIS servers (Windows Server 2019)
  App tier:  4 .NET application servers
  Data tier: SQL Server 2019 cluster (500 GB, 100 concurrent connections)

Migration phases using CAF Migrate + Modernise:

Phase 1 — Assess (2 weeks):
  Azure Migrate: discover and assess all servers
    → Right-size recommendations (map on-prem specs to Azure SKUs)
    → Dependency analysis (what does each server call?)
    → Cost estimates for IaaS (lift-and-shift) vs PaaS (modernise)

Phase 2 — Lift-and-Shift (IaaS first — 4 weeks):
  Web tier:  3 × D4s_v5 VMs → Application Gateway → 3 VMs in AZ1/2/3
  App tier:  4 × D8s_v5 VMs → Internal Load Balancer → AZ1/2/3
  Data tier: SQL Server on Azure VM (business critical, zone-redundant)
  Migration tool: Azure Migrate + Replication (agent-based for physical)
  Cutover: DNS switch, 4-hour maintenance window, rollback plan ready
  
Phase 3 — Optimise and Modernise (3 months post-migration):
  Web tier:  migrate IIS apps → Azure App Service (Premium, ZR)
             Benefits: no VM management, built-in scaling, deployment slots
  App tier:  containerise .NET apps → Azure Kubernetes Service
             Benefits: better resource utilisation, rolling deployments
  Data tier: assess SQL Managed Instance compatibility
             Azure Database Migration Service (DMS) → online migration (< 1h cutover)
             If SQL Agent or Service Broker needed → SQL Managed Instance
             If standard T-SQL → Azure SQL Database Business Critical

Phase 4 — Govern and Secure (ongoing):
  Azure Policy: enforce tagging, no public IPs on VMs, encryption at rest
  Cost Management: reserved instances for stable workloads (save 40%)
  Monitoring: Azure Monitor + Application Insights (enable for modernised apps)
  Security: Defender for Servers (VMs) + Defender for SQL (databases)

Scenario: Design a multi-tenant SaaS architecture on Azure.

Multi-tenant SaaS: one deployment serves multiple customer organisations

Tenancy model decision:

Model A — Shared everything (full multi-tenant):
  → One database, one compute, all tenants share
  → Lowest cost, highest density
  → Risk: noisy neighbour, data isolation challenges
  → Use for: small tenants, cost-sensitive, lower compliance requirements
  → Data isolation: Row-Level Security in SQL Database (TenantId filter)
  → Compute isolation: App Service multi-tenant with Entra ID OIDC tenant filtering

Model B — Shared compute, separate databases (hybrid):
  → One App Service plan, one Cosmos DB container per tenant (partition by TenantId)
  → OR: Elastic Pool — separate databases sharing one SQL server
  → Elastic Pool: up to 500 databases, pooled DTUs/vCores
  → Good balance: isolation at data layer, shared compute cost
  → Use for: SMB SaaS with moderate isolation requirements

Model C — Fully isolated (silo):
  → Separate subscription or resource group per tenant
  → Highest isolation, highest cost
  → Use for: enterprise tenants, strict compliance (FinTech, Healthcare)
  → Deployment: Bicep template, deploy new resources per tenant onboarding

SaaS-specific design patterns:
Onboarding automation: Logic App / Azure Function → provision new tenant infra
Tenant routing:   App Service → header/domain-based routing to correct data
Tenant billing:   Azure Cost Management tags (TenantId) → per-tenant cost tracking
Tenant offboarding: Data export → soft delete retention → hard delete after 90 days
Compliance:       Purview sensitivity labels per tenant, separate Key Vault per tenant

Scenario: How do you design for cost optimisation without sacrificing reliability?

Cost optimisation architecture decisions:

1. Right-size with Azure Advisor:
   → Azure Advisor → Cost → Underutilised resources
   → VMs with < 5% average CPU → downsize (D4 → D2)
   → SQL DTUs < 20% average → move to smaller service objective

2. Reserved Instances (40-72% savings):
   → 1-year RI for VMs known to run 24/7 (web servers, databases)
   → 3-year RI for very stable workloads (domain controllers, monitoring)
   → Scope: shared (across all subscriptions) for maximum flexibility

3. Azure Savings Plans (up to 65%):
   → Commitment to hourly spend on compute (any region, any size)
   → More flexible than Reserved Instances
   → Best for variable but consistent compute usage

4. Azure Hybrid Benefit:
   → Windows Server with SA: 40% savings on Windows VMs
   → SQL Server with SA: 55% savings on SQL licensing
   → Stack together with Reserved Instances for up to 80% savings

5. PaaS over IaaS:
   → Serverless/PaaS: pay only for what you use
   → App Service vs VMs: same capability, App Service scales to zero in some tiers
   → Azure Functions (Consumption): pay per execution (dev/test = free tier)

6. Autoscaling + scheduled scaling:
   → Scale in nights/weekends for non-production
   → VMSS: minimum 2 instances (HA), scale out on CPU > 70%
   → Dev VMs: auto-shutdown at 6pm via Azure Automation

7. Storage lifecycle management:
   → Blob: auto-tier (Hot → Cool after 30 days → Archive after 365 days)
   → Saves 70-95% on old data that must be retained

8. Dev/Test pricing:
   → Azure Dev/Test subscriptions: discounted Windows Server licensing
   → SQL in Dev/Test: no SQL licence cost (dev only workloads)

9. Spot VMs (up to 90% discount):
   → Fault-tolerant, interruptible workloads: batch processing, rendering, CI runners
   → Use VMSS with Spot priority + fallback to regular (Azure auto-evicts when needed)

10. Network costs:
    → Outbound bandwidth costs money; keep data in same region as consumers
    → VNet Peering cross-region: transfer costs (consider VPN vs peering)
    → CDN: serve static content from edge to reduce origin egress

8. Cheat Sheet — Quick Reference

AZ-305 Exam Domain Weights

Domain                                     Weight   Focus
Design infrastructure solutions            30-35%   Compute (VM/AKS/App Service/Functions)
                                                     Networking (hub-spoke, Front Door, VPN)
                                                     Security (Key Vault, Private Endpoints)
Design identity, governance & monitoring   25-30%   Hybrid identity, Conditional Access, PIM
                                                     Management Groups, Azure Policy, tagging
                                                     Azure Monitor, App Insights, Log Analytics
Design data storage solutions              25-30%   SQL DB vs MI vs VM, Cosmos DB design
                                                     Blob tiers, redundancy, Key Vault
                                                     Synapse / Fabric for analytics
Design business continuity solutions       10-15%   RTO/RPO, AZ vs multi-region, ASR
                                                     Azure Backup, immutable vault, PITR

Compute Decision Tree

Need OS-level control / specific software?     → VM + Availability Zones
Containerised microservices / complex scaling? → AKS
Web/API managed PaaS (.NET/Java/Node/Python)?  → App Service
Event-driven, short burst, unknown traffic?    → Azure Functions
Stateful orchestration (long workflow)?        → Durable Functions / Logic Apps
GPU workloads (AI/ML training, rendering)?     → N-series VMs
Burst compute (CI runners, batch)?             → Spot VMs or Batch
Virtual desktop for remote workers?            → Azure Virtual Desktop

SLA achieved by:
Single VM + Premium SSD:  99.9%
Availability Set (2+ VMs): 99.95%
Availability Zones (2+ AZs): 99.99%
Multi-region active-active: 99.999%

Storage Decision Matrix

Use case                              → Service
Blobs (unstructured: media, docs)     → Blob Storage (Hot/Cool/Archive)
VM Disks                              → Managed Disks (Premium SSD for prod)
Shared file system (SMB/NFS)          → Azure Files
Hybrid file server caching            → Azure File Sync
Relational OLTP (managed)             → Azure SQL Database
SQL Server features (SQL Agent, etc.) → Azure SQL Managed Instance
Full SQL Server control               → SQL Server on Azure VM
NoSQL document / JSON                 → Azure Cosmos DB (SQL API)
Global multi-region, multi-write      → Azure Cosmos DB
Data warehouse / OLAP                 → Azure Synapse Analytics / Fabric
Enterprise message queue              → Azure Service Bus
Event streaming (IoT, telemetry)      → Azure Event Hubs
Session cache, query cache            → Azure Cache for Redis
Graph data                            → Cosmos DB (Gremlin API)

Networking Quick Reference

Topology:           Hub-Spoke (most enterprises) or Azure Virtual WAN (large scale)
On-prem private:    ExpressRoute (dedicated, consistent) preferred over VPN
On-prem encrypted:  VPN Gateway (encrypted over internet, simpler, cheaper)
Global HTTP/HTTPS:  Azure Front Door (anycast, WAF, CDN, private link origins)
Global any protocol: Azure Traffic Manager (DNS-based, any protocol)
Regional HTTP:      Application Gateway (L7, WAF, SSL termination, URL routing)
Regional TCP/UDP:   Azure Load Balancer Standard (L4, zone-redundant)
Private PaaS:       Private Endpoint + Private DNS Zone (production standard)
Admin access:       Azure Bastion (no public IP on VMs, browser-based RDP/SSH)
Perimeter firewall: Azure Firewall Premium (IDPS, TLS inspection, centralised)
DDoS:               DDoS Network Protection (per-VNet for public-facing workloads)

Business Continuity Quick Reference

RTO/RPO targets → design pattern:
Hours/Hours:     Azure Backup + ASR warm standby (passive failover)
Minutes/Minutes: Auto-failover group + Traffic Manager / Front Door failover
Near-zero/zero:  Active-active multi-region (Cosmos DB multi-write, Front Door)

SQL DR:
Single region: Business Critical (3 replicas, AZ) — 99.995% SLA
Multi-region:  Auto-failover group → automatic DNS failover
               RPO: < 5 seconds, RTO: < 30 seconds

VM DR:
Azure Site Recovery: replicate VMs to secondary region
Recovery Plans: orchestrate failover order (DC first, then app, then DB)
Test failover: validate quarterly without impacting production

Backup essentials:
Soft delete: 14 days recovery after accidental deletion
Immutable vault (locked): ransomware protection (cannot be disabled)
Cross-region restore: restore backup in secondary region for DR validation
PITR (SQL): restore to any second in the retention window

Well-Architected Framework Pillars Summary

Pillar           Key services / patterns
Reliability      AZs, multi-region, auto-failover, health probes, retry+circuit breaker
Security         Zero Trust, Private Endpoints, Key Vault, Managed Identity, WAF
Cost             Reserved Instances, right-sizing, lifecycle tiers, PaaS over IaaS
Operational Exc. IaC (Bicep), CI/CD, Azure Monitor, runbooks, deployment slots
Performance      Autoscale, CDN (Front Door), Redis cache, read replicas, AKS HPA

Top 10 Tips

AZ-305 is a DESIGN exam, not a configuration exam — every question presents a business scenario asking you to RECOMMEND, COMPARE, or EVALUATE. The key skill is justifying WHY one service fits better than another — not knowing every configuration option.
Infrastructure is 30–35% of the exam — the largest domain. Master: compute service selection (VM vs App Service vs AKS vs Functions), hub-spoke networking, Private Endpoints, Key Vault, and Azure Front Door vs Traffic Manager.
Always anchor decisions in the Well-Architected Framework — every design choice has a WAF justification. "Why Private Endpoints?" → Security pillar (defence in depth). "Why Reserved Instances?" → Cost pillar. love WAF-framed answers.
Private Endpoints are the production networking standard — any PaaS service (SQL, Storage, Key Vault, App Service) in production should have a Private Endpoint and disabled public access. Service Endpoints are insufficient for most enterprise requirements.
Managed Identity eliminates credential management — any Azure resource authenticating to another Azure service should use Managed Identity. No secrets in code, no rotation risk, no leakage. Always the first recommendation for any "how do you authenticate" question.
Hub-Spoke topology with Azure Firewall — the standard enterprise network architecture. Hub contains shared services (Firewall, VPN/ExpressRoute GW, Bastion). Spokes are workload VNets peered to hub. All spoke-to-spoke traffic inspected via hub Azure Firewall.
RTO and RPO drive the HA design — before answering any HA/DR question, establish the RTO and RPO requirements. RTO/RPO in hours → warm standby + ASR. Minutes → auto-failover groups + Traffic Manager. Near-zero → active-active multi-region with Cosmos DB.
SQL Database vs SQL Managed Instance — the most commonly confused SQL decision. SQL Database = most managed, best for greenfield. SQL Managed Instance = when you need SQL Agent, cross-database queries, Service Broker, or CLR. Know this boundary cold.
Azure Front Door vs Traffic Manager — Front Door is a reverse proxy (HTTP/HTTPS, WAF, CDN, anycast, SSL offload). Traffic Manager is DNS-based routing (any protocol, no proxy, no WAF). Global web apps → Front Door. Non-HTTP global routing → Traffic Manager.
The Cloud Adoption Framework provides the governance story — management group hierarchy, landing zones, Azure Policy at scale, and subscription design are all CAF concepts. For any "how do you govern 50 subscriptions" question, the answer starts with CAF landing zones.

Sunday, May 17, 2026