VMware — Complete Guide
vSphere · ESXi · vCenter · NSX-T · vSAN · Tanzu · HCX · Site Recovery · Aria · VCF · Cheat Sheet
1. Core Concepts — VMware Portfolio Overview
What is VMware and how is its product portfolio organised?
VMware (now part of Broadcom) is the industry leader in virtualization and software-defined infrastructure — enabling organisations to abstract compute, network, and storage resources from underlying hardware and manage them as software.
| Category | Products |
|---|---|
| Compute Virtualization | vSphere (ESXi + vCenter), VMware Workstation, VMware Fusion |
| Network Virtualization | NSX-T Data Center, NSX Advanced Load Balancer (Avi) |
| Storage Virtualization | vSAN, vSAN ESA, vSphere Virtual Volumes (vVols) |
| Cloud Management | Aria Suite (Operations, Automation, Log Insight, Network Insight) |
| Cloud-Native / Containers | Tanzu (TKGs, TKGi, TMC, App Platform) |
| Hybrid Cloud | VMware Cloud Foundation (VCF), VMware Cloud on AWS/Azure/GCP |
| Disaster Recovery | Site Recovery Manager (SRM), VMware Live Recovery |
| Migration | VMware HCX, vSphere Replication |
| Desktop Virtualization | Horizon (VDI), App Volumes, Dynamic Environment Manager |
What is the difference between VMware vSphere, ESXi, and vCenter?
ESXi (hypervisor): → Type-1 bare-metal hypervisor — runs directly on physical hardware → Replaces the OS: no underlying Windows or Linux required → Manages: CPUs, memory, storage, network interfaces → Hosts: virtual machines (VMs) and containers → Single host: can be managed standalone via DCUI or Host Client → Version: ESXi 8.x (current under Broadcom) vCenter Server: → Centralised management platform for multiple ESXi hosts → Provides: vSphere Client (HTML5 UI), REST APIs, PowerCLI → Features: DRS, HA, vMotion, DPM, vSphere Lifecycle Manager → Deployment: vCenter Server Appliance (VCSA — Linux OVA) → No Windows vCenter — VCSA only since vSphere 7.x → Single Sign-On (SSO): authentication domain for all vSphere services vSphere: → The product suite = ESXi + vCenter Server + related features → Not a single component — it's the umbrella term for the platform → Licences: vSphere Standard, vSphere Enterprise Plus (now VCF bundles) Relationship: Physical Server → ESXi (hypervisor) → Virtual Machines ESXi (one or many) → managed by → vCenter Server vCenter Server → part of → vSphere (the platform)
What are the core virtualisation concepts every VMware professional must know?
Hypervisor types: Type-1 (bare-metal): ESXi, Hyper-V, KVM → Runs directly on hardware — no host OS → Lower overhead, better performance, enterprise use Type-2 (hosted): VMware Workstation, VirtualBox, Fusion → Runs on top of an existing OS (Windows/macOS) → Higher overhead, dev/test use only Virtual Machine (VM): → Software-emulated computer — has vCPU, vRAM, vNIC, vDisk → Isolated from other VMs on same host → Runs any OS independently of underlying hardware → Portable: can migrate between hosts (vMotion), datacentres (HCX) VM files: .vmx — VM configuration file (settings, hardware definition) .vmdk — VM disk file (virtual hard disk data) -flat.vmdk — actual disk data; descriptor .vmdk is the header .nvram — VM BIOS/EFI state .vmsd — snapshot metadata .vmsn — snapshot state .vmxf — supplemental config .log — VM log file Snapshots: → Point-in-time state capture (disk + memory + VM state optional) → NOT a backup — child disks grow; performance degrades over time → Best practice: no more than 2–3 snapshots per VM → Consolidate before decommissioning: right-click → Manage Snapshots VMFS (Virtual Machine File System): → VMware's clustered file system for shared storage (SAN/iSCSI/FC) → Allows multiple ESXi hosts to access the same datastore concurrently → Supports vMotion, HA, DRS — requires shared VMFS datastore → Current version: VMFS-6 (supports 4K native drives, automatic UNMAP)
2. vSphere & ESXi — Deep Dive
How do vMotion, Storage vMotion, and Cross-vCenter vMotion work?
vMotion (live migration — compute): → Migrates a running VM between ESXi hosts with zero downtime → Requirements: - Shared storage (VMFS datastore accessible by both hosts) - Compatible CPUs (EVC mode if CPU generations differ) - L2 network connectivity (same port group or NSX-T stretched segment) - vMotion VMkernel port on both hosts → Process: 1. vCenter pre-copies VM memory pages to destination host 2. VM continues running on source; changed pages tracked 3. At switchover: VM pauses (< 1 second), final memory state transferred 4. VM resumes on destination; source VM deleted → Use case: host maintenance (Enter Maintenance Mode triggers auto-vMotion) Storage vMotion (live migration — storage): → Migrates VM disk files between datastores with VM running → No CPU/network compatibility requirements (moves storage, not compute) → Requirements: vCenter, no RDMs (or use mapped LUN) → Use case: datastore maintenance, storage tiering, LUN reclamation Cross-vCenter vMotion (xvMotion — long-distance): → Migrate VM between different vCenter Servers / sites → Can move compute + storage simultaneously → Requires: Enhanced Linked Mode OR HCX for stretched layer-2 EVC (Enhanced vMotion Compatibility): → Masks CPU features to a common baseline across a cluster → Allows vMotion between hosts with different CPU generations → Set at cluster level: Intel Broadwell, Cascade Lake, Ice Lake, etc. → EVC mode cannot be lowered if running VMs have higher baseline
What are vSphere HA, DRS, and DPM and when does each activate?
vSphere HA (High Availability): → Restarts VMs on surviving hosts if a host fails → Heartbeat: hosts exchange heartbeats every second (UDP 902) → Failure detection: no heartbeats for 10s → host declared failed → VM restart: vCenter restarts failed VMs on remaining hosts → VM/App monitoring: restarts VMs that stop sending VMware Tools heartbeats → Admission Control policies: - Percentage cluster resources reserved (e.g., 25% = tolerate 1/4 hosts) - Slot-based: reserves slots per host for worst-case failover - Dedicated failover hosts DRS (Distributed Resource Scheduler): → Balances VM workloads across hosts in a cluster automatically → Monitors: CPU and memory utilisation per host every 5 minutes → Automation levels: Manual: suggestions only — admin approves each migration Partially Auto: initial placement auto, balance manual Fully Auto: vCenter migrates VMs automatically → DRS rules: Affinity: keep VMs on same host (e.g., app + DB for latency) Anti-affinity: keep VMs on different hosts (e.g., HA pairs) Host affinity: pin VMs to specific hosts (licensing, hardware) DPM (Distributed Power Management): → Consolidates workloads and powers down idle hosts to save energy → Wakes hosts when capacity is needed (IPMI/iLO/iDRAC WoL) → Best practice: enable only if hosts support remote power management Cluster design best practice: → N+1 minimum for HA (lose 1 host, all VMs still fit) → N+2 for business-critical clusters → Never over-commit with HA admission control disabled → DRS Fully Automatic + Balanced threshold for production clusters
What is the VMkernel and what ports does it require?
VMkernel (vmk) — management network interfaces on ESXi:
→ Not a VM NIC — it's ESXi's own network stack
→ Each vmk port has: IP, subnet mask, gateway, VLAN tag, MTU
Service Default vmk Port / Protocol
─────────────────────────────────────────────────────────
Management vmk0 TCP 443 (vSphere Client)
TCP 902 (vCenter heartbeat)
vMotion vmk1 TCP/UDP 8000 (vMotion data)
vSAN vmk2 UDP 12321, 23451 (vSAN traffic)
iSCSI / NFS vmk3 TCP 3260 (iSCSI)
TCP/UDP 111, 2049 (NFS)
Fault Tolerance (FT) vmk4 TCP 80, 8100-8200 (FT logging)
Replication vmk5 TCP 31031 (vSphere Replication)
Best practice:
→ Dedicate separate vmk ports to each service type
→ Use separate physical NICs (pNICs) for management vs vMotion vs vSAN
→ Jumbo frames (MTU 9000) for vMotion and vSAN traffic
→ vSAN and vMotion on 10/25/100 GbE NICs; management can share 1GbE
Networking constructs:
vSwitch (Standard): configured per host — no shared state
dvSwitch (VDS): configured in vCenter — shared across hosts
→ Required for vMotion, DRS, HA features
→ Port groups: named network segments with policy (VLAN, security, QoS)
NSX-T segments: overlay networks — no VLAN dependency
3. vCenter Server & Administration
How do you manage vSphere via PowerCLI?
# Connect to vCenter:
Connect-VIServer -Server vcenter.domain.local -Credential (Get-Credential)
# List all VMs in a cluster:
Get-Cluster "Production" | Get-VM | Select Name, PowerState, NumCpu, MemoryGB
# Power operations:
Start-VM -VM "WebServer01"
Stop-VMGuest -VM "WebServer01" -Confirm:$false # graceful shutdown via Tools
Stop-VM -VM "WebServer01" -Confirm:$false # power off (hard)
# vMotion — live migrate VM to a different host:
Move-VM -VM "AppServer01" -Destination (Get-VMHost "esxi02.domain.local")
# Storage vMotion — migrate VM disk to a different datastore:
Move-VM -VM "AppServer01" -Datastore (Get-Datastore "SSD-DS-02")
# Create a snapshot (quiesced, no memory):
New-Snapshot -VM "DBServer01" -Name "Pre-Patch-$(Get-Date -f yyyyMMdd)" `
-Description "Before monthly patching" -Memory $false -Quiesce $true
# Remove all snapshots on a VM:
Get-VM "DBServer01" | Get-Snapshot | Remove-Snapshot -Confirm:$false
# VM inventory export:
Get-VM | Select Name, @{N="ToolsVersion";E={$_.ExtensionData.Guest.ToolsVersion}},
@{N="GuestOS";E={$_.ExtensionData.Guest.GuestFullName}},
NumCpu, MemoryGB, ProvisionedSpaceGB |
Export-Csv "vm-inventory.csv" -NoTypeInformation
# List hosts in maintenance mode:
Get-VMHost | Where-Object {$_.ConnectionState -eq "Maintenance"} |
Select Name, ConnectionState, Version
# Get datastore space usage — sorted by used %:
Get-Datastore | Select Name, CapacityGB,
@{N="FreeGB";E={[math]::Round($_.FreeSpaceGB,1)}},
@{N="UsedPct";E={[math]::Round((1-($_.FreeSpaceGB/$_.CapacityGB))*100,1)}} |
Sort UsedPct -Descending
# Place host into maintenance mode (evacuates VMs via vMotion):
Set-VMHost -VMHost "esxi01.domain.local" -State Maintenance
What are the key vCenter components and their dependencies?
vCenter Server Appliance (VCSA) components: ──────────────────────────────────────────────────────────────── Service Port Purpose ──────────────────────────────────────────────────────────────── vSphere Client 443 HTML5 web UI vCenter SSO 443/7444 Authentication — issues SAML tokens Lookup Service 7444 Service registry for vSphere services Inventory Service 10443 VM/host/datastore inventory vSphere API (SDK) 443 REST + SOAP APIs for automation PostgreSQL DB 5432 Embedded DB (events, tasks, inventory) vSAN Health 8006 vSAN monitoring service Update Manager 9084/9087 vSphere Lifecycle Manager (patching) ──────────────────────────────────────────────────────────────── VCSA sizing (production recommendations): Small: up to 100 hosts / 1,000 VMs — 2 vCPU, 12 GB RAM, 415 GB storage Medium: up to 400 hosts / 4,000 VMs — 8 vCPU, 24 GB RAM, 480 GB storage Large: up to 1000 hosts / 10,000 VMs — 16 vCPU, 32 GB RAM, 870 GB storage X-Large: up to 2000 hosts / 35,000 VMs — 24 vCPU, 48 GB RAM, 1665 GB storage vCenter High Availability (vCHA): → Active-Passive pair with a Witness node (3 nodes total) → Automatic failover if Active VCSA fails (RTO: ~5 minutes) → No impact on running VMs (ESXi HA is independent of vCenter) → Requires: 3 VMs, shared/stretched network, 250 Mbps between sites vSphere Lifecycle Manager (vLCM): → Manages ESXi patches, updates, firmware (replaces VUM in 7.x+) → Image-based management: define desired ESXi image per cluster → Depot: download patches from VMware or air-gapped offline bundle → Remediation: puts host in maintenance mode → patches → reboots → exits
4. VMware NSX-T — Network Virtualization
What is NSX-T and what problem does it solve?
NSX-T Data Center decouples networking and security from physical hardware — delivering software-defined networking, micro-segmentation, and consistent policy across on-premises and cloud environments.
Why NSX-T:
→ Traditional VLANs require physical switch changes for every new segment
→ Firewall rules applied at perimeter only — east-west traffic unprotected
→ No consistent networking policy across vSphere, bare-metal, and cloud
→ NSX-T solves all three with overlay networking and distributed security
NSX-T Architecture:
┌─────────────────────────────────────────────────────────────┐
│ NSX Manager (Control Plane + Management Plane) │
│ - 3-node cluster for HA │
│ - Stores policy, pushes config to data plane │
└──────────────────────────────┬──────────────────────────────┘
│ Pushes fabric config
▼
┌─────────────────────────────────────────────────────────────┐
│ Transport Nodes (Data Plane — ESXi hosts + bare-metal) │
│ - NSX kernel modules installed on ESXi │
│ - N-VDS or VDS7+ as the virtual switch │
│ - TEP (Tunnel Endpoint): GENEVE overlay tunnels │
└─────────────────────────────────────────────────────────────┘
Key NSX-T constructs:
Overlay Segments: Layer-2 logical switches — no VLAN dependency
Tier-1 (T1) Router: per-tenant/application gateway
Tier-0 (T0) Router: datacenter edge — BGP/OSPF to physical fabric
Edge Nodes: NSX-T gateways for N-S traffic
Distributed Firewall (DFW): stateful firewall in ESXi kernel — every vNIC
NSX ALB (Avi): L4/L7 load balancer, WAF, GSLB
Micro-segmentation with DFW:
→ Firewall rules enforced at vNIC level — every VM protected
→ Rules follow the VM (vMotion, migrations) — not tied to IP/port
→ Groups: dynamic workload grouping by tag, OS, VM name
→ East-west traffic secured without hairpinning to perimeter firewall
→ Example policy:
Group "WebTier" (tag:Web) → Group "AppTier" (tag:App) → Allow TCP 8080
Group "AppTier" → Group "DBTier" (tag:DB) → Allow TCP 1433
Any → Any → Block (default deny)
How does NSX-T routing work (Tier-0 / Tier-1)?
Two-tier routing model: Tier-0 (T0) Gateway: → Connects NSX overlay to physical network (underlay) → Runs BGP/OSPF with physical ToR switches → Handles: N-S traffic, NAT (SNAT/DNAT), VPN (IPsec/L2VPN) → Deployed on Edge Transport Nodes (dedicated VMs or bare-metal) → Active-Active or Active-Standby HA modes Tier-1 (T1) Gateway: → Per-application or per-tenant logical router → Connected to T0 upstream, overlay segments downstream → Handles: inter-segment routing, DHCP relay/server, DNS forwarder → Runs distributed on all ESXi transport nodes (low latency) Traffic flow examples: VM-to-VM (same segment): switch locally on host — no router VM-to-VM (different segments, T1): routed at distributed T1 in host kernel VM-to-VM (different T1s): routed up to T0, back down to T1 VM-to-Internet: T1 SR → T0 SR on Edge → physical → Internet Route redistribution: → T0 advertises NSX overlay prefixes into BGP to physical fabric → Physical fabric advertises default route / external routes into T0 → T1 routes auto-redistributed to T0 via internal protocol
5. vSAN — Software-Defined Storage
What is vSAN and how does it work?
VMware vSAN aggregates local disks from ESXi hosts into a shared distributed datastore — eliminating the need for external SAN or NAS in most workloads.
vSAN architecture: Each ESXi host contributes: NVMe/SSD (cache) + SSD/HDD (capacity) → All-Flash: NVMe cache tier + SSD capacity tier (best performance) → Hybrid: SSD cache tier + HDD capacity tier (lower cost) → ESA (Express Storage Architecture): all-NVMe, single-tier, vSAN 8.x vSAN Objects and Components: VM Storage Objects: - VM Home namespace (vmx, logs) - VM Swap object - VMDK data object (one per disk) Each object is split into Components distributed across hosts: Component = chunk of data stored on a single host's disk group Witness = tie-breaker metadata component (no user data) Storage Policy Based Management (SPBM): → Per-VM storage policies — not per-datastore → FTT (Failures to Tolerate): FTT=1: survives 1 host/disk failure FTT=2: survives 2 failures → RAID type: RAID-1 (mirroring): FTT=1 requires 3 hosts RAID-5 (erasure): FTT=1 requires 4 hosts (more efficient) RAID-6: FTT=2 requires 6 hosts vSAN Cluster sizing: Minimum hosts: 3 (FTT=1 RAID-1), 4 (FTT=1 RAID-5), 6 (FTT=2 RAID-6) Recommended: 6+ hosts for production flexibility vSAN network: dedicated VMkernel port, 10/25 GbE, MTU 9000 2-node cluster: requires separate Witness Host (no storage contributed) vSAN Stretched Cluster: → Two active sites + witness site → All writes mirrored to both sites (synchronous) — RPO = 0 → Automatic HA failover between sites → Requires: < 5ms RTT between sites, < 200ms RTT to witness
What are vSAN performance monitoring best practices?
Key vSAN metrics (Skyline Health / vSAN Proactive Tests):
Latency: read/write latency (target: <1ms read, <5ms write — all-flash)
Throughput: MB/s per disk group and per VM object
IOPS: read/write IOPS per VM, per host, per disk group
Congestion: >0 congestion events = disk group bottleneck
Resync: % resync traffic — elevated after host failure/add
Cache hit %: flash read cache hit ratio (hybrid tier only)
Common vSAN issues and resolutions:
Issue Resolution
─────────────────────────────────────────────────────────────────────
vSAN component inaccessible Check host connectivity, disk health
High congestion / latency Check disk group IOPS; add hosts/disks
Resync traffic impacting production Enable resync throttling (IOPS limit)
Capacity imbalance between hosts Rebalance via vSAN proactive rebalance
Disk failures Replace disk; policy compliance auto-restores
Snapshot performance degradation Limit snapshot count; consolidate regularly
PowerCLI — vSAN health check:
Get-VsanDisk | Select VMHost, CanonicalName, State, Tier, OperationalState |
Where-Object {$_.OperationalState -ne "ok"} |
Format-Table -AutoSize
Get-VsanClusterConfiguration -Cluster "vSAN-Cluster" |
Select SpaceEfficiencyEnabled, EncryptionEnabled, HealthCheckEnabled
6. VMware Tanzu & Cloud-Native
What is VMware Tanzu and how does it fit with vSphere?
VMware Tanzu is VMware's portfolio of products for running, managing, and securing Kubernetes workloads — both on-premises (on vSphere) and across clouds.
Tanzu product breakdown:
────────────────────────────────────────────────────────────────────────
Product Description
────────────────────────────────────────────────────────────────────────
TKGs (Tanzu K8s Grid K8s clusters provisioned in vSphere Namespaces
with Supervisor) vSphere 7/8 required; NSX-T or NSX ALB for LB
TKGi (Tanzu K8s Grid Enterprise K8s with BOSH; air-gapped, FIPS
Integrated) Required for PKS migrations
Tanzu Mission Control (TMC) Centralised multi-cluster K8s management (SaaS)
Policy, compliance, cost visibility
Tanzu Application Platform Developer-focused app platform (TAP) on K8s
(TAP) Supply chain security, app accelerators
Tanzu Service Mesh Service mesh for micro-services (Global Namespace)
────────────────────────────────────────────────────────────────────────
vSphere with Tanzu (Workload Control Plane — WCP):
→ vSphere 7+ Supervisor feature: run K8s natively in vCenter
→ vSphere Namespace: RBAC-controlled project space for K8s clusters
→ Storage: vSAN/vVols as PersistentVolumes via CNS (Container Native Storage)
→ Networking: NSX-T recommended; VDS 7+ also supported
Key concepts:
Supervisor Cluster: vSphere control plane K8s cluster (runs on ESXi hosts)
Tanzu K8s Cluster: guest cluster in a vSphere Namespace for workloads
VM Class: pre-defined VM sizes (best-effort, guaranteed)
Storage Policy: maps to vSAN SPBM policy for PVC provisioning
Load Balancer: NSX ALB or HA Proxy for external service exposure
kubectl on vSphere:
# Login to Supervisor:
kubectl vsphere login --vsphere-username admin@vsphere.local \
--server wcp.domain.local
# List vSphere Namespaces:
kubectl get namespaces
# Apply a Tanzu K8s Cluster manifest:
kubectl apply -f tanzu-cluster.yaml
# Access workload cluster:
kubectl vsphere login --server wcp.domain.local \
--vsphere-username admin@vsphere.local \
--tanzu-kubernetes-cluster-name prod-cluster \
--tanzu-kubernetes-cluster-namespace dev-namespace
7. VMware Cloud Foundation & Multi-Cloud
What is VMware Cloud Foundation (VCF) and what does it include?
VCF = VMware's integrated private cloud platform → Bundles: vSphere, vSAN, NSX-T, Aria Suite under one subscription → Deployed on validated hardware (VxRail, off-the-shelf servers) → Managed by: SDDC Manager (lifecycle automation, brownfield import) VCF components: ┌─────────────────────────────────────────────────────────────────────┐ │ SDDC Manager — orchestrates bring-up, patching, scaling │ ├─────────────────────────────────────────────────────────────────────┤ │ vCenter Server (management) │ vSAN (storage) │ NSX-T (networking) │ ├─────────────────────────────────────────────────────────────────────┤ │ Aria Operations │ Aria Automation │ Aria Log Insight │ Aria Networks │ └─────────────────────────────────────────────────────────────────────┘ VCF domains: Management Domain: vCenter, NSX Manager, SDDC Manager, Aria Suite Workload Domains: isolated vSphere+vSAN+NSX-T cluster for production VMs VI Workload Domain: standard VM workloads VVD (Validated): pre-tested reference architecture for specific use cases Bring-up workflow: 1. Physical server preparation (BIOS settings, NIC connectivity verified) 2. Cloud Builder VM: orchestrates automated VCF bring-up (JSON spec input) 3. SDDC Manager deployed → manages ESXi patching, domain expansion 4. Workload domain: add hosts → SDDC Manager creates vSphere+vSAN+NSX cluster VMware HCX (Hybrid Cloud Extension): → Application mobility across any vSphere environment or VMware Cloud → Migration types: Cold migration: offline VM copy — no vMotion required vMotion: live migration (zero downtime) up to 150ms RTT Bulk migration: mass move with maintenance window Replication-assisted vMotion (RAV): pre-seeds data, then live cutover → Components: HCX Manager, Service Mesh, Interconnect, WAN Opt, Network Ext → Network Extension: stretch L2 VLAN/NSX segment between sites — no IP renumbering VMware Cloud on AWS (VMC): → VMware-managed vSphere SDDC running on dedicated AWS bare-metal → Same vSphere/NSX-T/vSAN stack — no re-training required → Native access to AWS services (RDS, S3, ELB in same AZ) → HCX included: migrate VMs from on-prem to VMC non-disruptively → Use case: DR, datacenter extension, cloud bursting
8. Scenario-Based Questions
Immediate HA response (automatic):
- Failure detection: surviving ESXi hosts stop receiving heartbeats from the failed host (default 10 seconds). vCenter marks the host as "Not Responding."
- HA Master election: the HA Master host (elected from the cluster via datastore heartbeats) takes ownership of recovery.
- VM restart: the Master instructs surviving hosts to restart VMs from the failed host. Restart order follows VM restart priority (High → Medium → Low → Disabled).
- Admission Control validation: HA checks reserved capacity (N+1 design). If sufficient, all VMs restart within 1–3 minutes.
Post-failure remediation:
- Identify root cause: check iDRAC/iLO hardware alerts, DCUI on host, vCenter Alarms, Aria Log Insight for ESXi syslog around failure time.
- Host replacement/repair: replace failed components (PSU, NIC, disk). If connectivity issue, check ToR switch and vmk0 management network.
- Reintroduce host: reconnect host to vCenter. vSphere Lifecycle Manager (vLCM) remediates host to cluster image baseline.
- vSAN resync: if vSAN cluster, failed host's components are rebuilt on surviving hosts within FTT policy. Monitor resync completion in vSAN Health.
- Post-mortem: document timeline, RCA, and verify HA design covers future single-host failures.
Primary datacenter design:
- Compute: 6 ESXi hosts per cluster (N+2 for HA; DRS Fully Automatic). Two clusters: Management (vCenter, NSX Manager, Aria) + Workload (production VMs). All hosts: 2× 25GbE (vSAN + NSX-T TEP), 2× 10GbE (management + vMotion).
- Storage: vSAN All-Flash on workload hosts. Policy: FTT=1 RAID-5 for standard workloads, FTT=2 RAID-6 for Tier-1 databases. vSAN stretched cluster for RPO=0 if budget permits.
- Networking: NSX-T Data Center. Tier-0 BGP peering to ToR switches (ECMP for bandwidth). Tier-1 per application tier. DFW micro-segmentation: default deny east-west. NSX ALB for application load balancing.
- Management: VCSA in vCHA (3-node HA) on management cluster. NSX Manager 3-node cluster. Aria Operations for performance monitoring and capacity planning.
DR site design:
- Site Recovery Manager (SRM): policy-based replication and automated failover orchestration. Recovery Plans define failover order, IP customisation, pre/post scripts.
- vSphere Replication: RPO 5-minute minimum per VM (no shared storage required). Configure replication groups matching application tiers.
- RTO/RPO targets: RTO 4 hours (SRM automated failover + smoke tests), RPO 15 minutes. Tier-1: RPO 5 minutes.
- DR testing: SRM supports non-disruptive test failovers (isolated network bubble). Test quarterly; review recovery plan annually.
- Assessment: inventory VM sizes, dependencies, storage consumption, and network segments using RVTools and vSphere Replication assessment.
- HCX deployment: deploy HCX Connector on-premises (OVA), activate with VMC HCX Cloud Manager. Create Service Mesh (interconnect, WAN optimisation, network extension).
- Network Extension: extend on-premises VLANs/NSX segments to VMC via HCX Network Extension. VMs retain IP addresses — no renumbering during migration.
- Migration wave planning: group VMs by application dependency (DB before App, App before Web). Schedule maintenance windows for stateful workloads.
- Migration execution:
- Tier-3 (dev/test): Bulk Migration — replicate overnight, cutover in window
- Tier-2 (standard): RAV — pre-seed data, live cutover under 1 minute
- Tier-1 (business-critical): vMotion — zero downtime live migration
- DNS/LB cutover: update DNS records and load balancer backends to VMC IPs after each wave. Validate application health before proceeding.
- Network extension retirement: once all VMs migrated, unextend segments, update routing, decommission HCX Network Extension.
- Post-migration: right-size VMs using Aria Operations recommendations. Apply VMC-appropriate storage policies. Remove HCX Service Mesh.
Step 1 — VM layer:
→ Ping default gateway from within VM
→ Check NIC driver and VMware Tools version
→ Verify IP/subnet/gateway configuration
Step 2 — NSX-T logical port:
→ NSX Manager → Networking → Segments → find VM's segment → check logical port
→ Verify port state: UP, Admin State: UP, Attachment: correct VM
→ API: GET /api/v1/logical-ports?attachment_id={vm-id}
Step 3 — Distributed Firewall:
→ NSX Manager → Security → Distributed Firewall → check flow analysis
→ Enable DFW logging on suspect rules temporarily
→ On ESXi host:
vsipioctl getrules -f nic:xxxx (DFW rules applied to vNIC)
vsipioctl getflows -f nic:xxxx (active connections)
Step 4 — Segment and T1 Gateway:
→ Check T1 routing table: NSX Manager → Tier-1 → Routes
→ Verify segment is attached to correct T1
→ Test: SSH to T1 edge SR node → ping VM gateway IP
Step 5 — T0 / Edge Node:
→ Check T0 BGP status: NSX Manager → Tier-0 → BGP Neighbours
→ Verify routes propagated to physical switches
→ Test N-S: ping from VM to external IP → check SNAT rules
Step 6 — Transport Node (ESXi):
→ NSX Manager → System → Fabric → Transport Nodes → Health (check VTEP)
→ On host: esxcli network vswitch dvs vmware vxlan stats get
→ Packet capture on TEP vmkernel: pktcap-uw --dir 2 --vmk vmk2
9. Cheat Sheet — Quick Reference
VMware Component Selection Guide
Requirement → Component ───────────────────────────────────────────────────────────────────── Server virtualisation → vSphere (ESXi + vCenter) Shared storage (all-flash, hyperconverged) → vSAN All-Flash / ESA Network virtualisation + micro-segment. → NSX-T Data Center Load balancing (L4/L7, WAF, GSLB) → NSX Advanced LB (Avi) Kubernetes on vSphere → vSphere with Tanzu (WCP) Multi-cluster K8s management → Tanzu Mission Control (TMC) VM performance monitoring / forecasting → Aria Operations Log aggregation (vSphere/NSX/guest) → Aria Log Insight IaC / self-service cloud automation → Aria Automation Network visibility and flow analysis → Aria Networks (vRNI) Live VM migration between sites → vMotion / HCX vMotion Mass migration on-prem → VMC/cloud → HCX Bulk / RAV DR orchestration + automated failover → Site Recovery Manager (SRM) Replication (no shared storage required) → vSphere Replication Integrated SDDC lifecycle management → VMware Cloud Foundation (VCF) VDI / virtual desktops → VMware Horizon
vSphere HA / DRS / DPM / FT Summary
| Feature | Trigger | Action |
|---|---|---|
| HA | Host heartbeat loss (10s) | Restart VMs on surviving hosts |
| HA App Monitor | VMware Tools heartbeat loss | Restart VM OS |
| DRS | CPU/Mem imbalance (every 5 min) | vMotion VMs to balance hosts |
| DRS Rules | Affinity policy defined | Keep / separate VMs on hosts |
| DPM | Low cluster utilisation | Power off hosts; wake on demand |
| FT | VM failure | Instant failover to shadow VM (RPO=0, RTO=0) |
ESXi Networking Quick Reference
Standard vSwitch (VSS): → Per-host config — no central management → No distributed features (no vMotion port group portability) → Use for: management vmk0 on standalone/small hosts only Distributed vSwitch (VDS): → Managed centrally in vCenter — config pushed to all member hosts → Required for DRS, vMotion, and advanced HA features → LACP, LLDP, NetFlow, port mirroring support → Use for: all production clusters NSX-T N-VDS / VDS7+: → Overlay networking — VMs on virtual segments (no VLAN dependency) → GENEVE encapsulation for TEP tunnels (replaces VXLAN in NSX-T) → vSphere 7.x+: N-VDS merged into VDS — single switch for both NIC Teaming policies (VDS): Route based on originating port: default; good for most workloads Route based on IP hash: requires LACP/EtherChannel on physical switch Route based on physical NIC load: LBT — VDS only; dynamic balancing Use explicit failover order: active-passive, deterministic
PowerCLI Quick Reference
# Connect / Disconnect:
Connect-VIServer -Server vcenter.domain.local
Disconnect-VIServer -Confirm:$false
# VM management:
Get-VM | Where PowerState -eq "PoweredOff"
Get-VM "VMName" | Start-VM
Get-VM "VMName" | Restart-VMGuest
Get-VM "VMName" | Get-Snapshot | Remove-Snapshot -Confirm:$false
Move-VM -VM "VMName" -Destination (Get-VMHost "esxi02")
Move-VM -VM "VMName" -Datastore (Get-Datastore "DS01")
# Host management:
Get-VMHost | Select Name, ConnectionState, Version, NumCpu, MemoryTotalGB
Set-VMHost -VMHost "esxi01" -State Maintenance
Get-VMHost -State Maintenance | Set-VMHost -State Connected
# Inventory and reporting:
Get-VM | Select Name, NumCpu, MemoryGB, ProvisionedSpaceGB |
Export-Csv "vm-report.csv" -NoTypeInformation
Get-Datastore | Select Name, CapacityGB, FreeSpaceGB |
Sort FreeSpaceGB | Format-Table
# vSAN health:
Get-VsanDisk | Where OperationalState -ne "ok"
Get-VsanClusterConfiguration -Cluster "vSAN-Cluster"
# Alarms and events:
Get-AlarmDefinition | Where Enabled -eq $true
Get-VIEvent -MaxSamples 100 -Types Error | Select CreatedTime, FullFormattedMessage
vSAN Storage Policy Reference
| FTT | RAID Type | Min Hosts | Overhead | Use Case |
|---|---|---|---|---|
| 1 | RAID-1 | 3 | 100% | Small clusters, Tier-1 (mirroring) |
| 1 | RAID-5 | 4 | 33% | Standard production (efficient) |
| 2 | RAID-1 | 5 | 200% | Highest resilience, Tier-0 |
| 2 | RAID-6 | 6 | 50% | Balanced resilience + efficiency |
| 0 | None | 1 | 0% | Test/Dev only — no protection |
Top 10 VMware Tips
- vSphere is not just ESXi. vSphere is the platform (ESXi + vCenter + features). Always distinguish ESXi (hypervisor), VCSA (management), and vSAN/NSX-T (infrastructure) when designing or troubleshooting solutions.
- Snapshots are not backups. Snapshots consume disk space proportional to VM change rate and degrade performance over time. Use a dedicated backup solution (Veeam, VADP-based) for data protection. Consolidate snapshots before every maintenance window.
- EVC protects vMotion across CPU generations. Set EVC at cluster creation, before adding hosts. You can raise EVC mode but cannot lower it while VMs are powered on. Baseline to the oldest CPU family in the cluster.
- vSAN network requires jumbo frames (MTU 9000) end-to-end. Configure MTU 9000 on VMkernel port, VDS uplink, ToR switch ports, and all inter-switch links. An MTU mismatch silently drops vSAN traffic above 1500 bytes — often misdiagnosed as disk failure.
- NSX-T DFW follows the VM — not the IP. Firewall rules applied to Security Groups (dynamic by VM tag/name) stay with VMs through vMotion, DR failover, and migrations. Design security policy around workload identity, not network location.
- vCenter HA is not the same as vSphere HA. vSphere HA restarts VMs on host failure. vCenter HA protects the VCSA management appliance itself. Both coexist independently. Running VMs are completely unaffected by a vCenter outage.
- HCX is the safest migration path for large-scale moves. HCX Network Extension eliminates IP renumbering. Use RAV migration for Tier-1 workloads — replication pre-seeds data and final cutover is under 1 minute with minimal downtime.
- SPBM decouples storage policy from infrastructure. Define VM storage requirements (FTT, RAID, IOPS) in a policy; vSAN enforces it automatically. When hosts are added, vSAN rebalances to meet all policies without manual intervention.
- VCF SDDC Manager owns lifecycle management. In VCF deployments, never patch ESXi, vCenter, or NSX manually. SDDC Manager orchestrates the patching sequence (NSX → vSAN → vCenter → ESXi) to maintain component compatibility. Manual patching breaks SDDC Manager state.
- Monitor capacity proactively with Aria Operations. vSAN "time remaining" and Aria capacity forecasting give weeks of notice before exhaustion. Set alarms at 70% capacity threshold (not 90%) to allow orderly expansion without emergency procurement.
No comments:
Post a Comment