You are on page 1of 355

VMware vSphere:

Troubleshooting Workshop
Lecture Manual
ESXi 6.5 and vCenter Server 6.5

VMware® Education Services


vmware® VMware, Inc.
www.vmware.com/education
VMware vSphere:
Troubleshooting Workshop
ESXi 6.5 and vCenter Server 6.5
Part Number EDU-EN-VTSW65-LECT (6/2017)
Lecture Manual
Copyright© 2017 VMware, Inc. All rights reserved. This manual and its accompanying materials
are protected by U.S. and international copyright and intellectual property laws. VMware products
are covered by one or more patents listed at http://www.vmware.com/go/patents. VMware is a
registered trademark or trademark of VMware, Inc. in the United States and/or other jurisdictions.
All other marks and names mentioned herein may be trademarks of their respective companies.
The training material is provided "as is," and all express or implied conditions, representations,
and warranties, including any implied warranty of merchantability, fitness for a particular purpose
or noninfringement, are disclaimed , even if VMware, Inc., has been advised of the possibility of
such claims . This training material is designed to support an instructor-led training course and is
intended to be used for reference purposes in conjunction with the instructor-led training course.
The training material is not a standalone training tool. Use of the training material for self-study
without class attendance is not recommended.
These materials and the computer programs to which it relates are the property of, and embody
trade secrets and confidential information proprietary to, VMware, Inc., and may not be
reproduced , copied, disclosed, transferred , adapted or modified without the express written
approval of VMware, Inc.

www.vmware.com/education
CONTENTS

MODULE 1 Course Introduction .... . .. . .... . ..... . .. . .. . .. ... .. .... . . .... 1


Importance ... .. . ... . .... .. . . . . . .. . .. . ... . . .. ... .. . ... . . .... 2
Learner Objectives .... . . . . . . .. . . .. . . ..... . .. .. . . ..... ... . .... 3
You Are Here ................ . ..... . ..... . . .. .. . . .... .. .... 4
Typographical Conventions . ... .. . .. . . . . ... ..... .. . . . .... . .. ... 5
References ( 1) . .. ..... . ....... . ........ . .. . . .... . . .... . . .... 6
References (2) . . .. .. .. . .. . . . .. . .. .. . . .. .. .. . . ... . . ... . ... ... 7
References (3) . .. ..... . ..... . . . ... . . . ..... . ...... . ... .. . .... 8
VMware Online Resources .... . .. . .... . ..... .. ..... . .... .. . .. . 9
VMware Certification . . . ..... . . .. .. . . . .... .. ..... .. ... . . .. .. 10
VMware Education Overview .... .. .. . .... . .. . . .... .. .... . .... 11

MODULE 2 Introduction to Troubleshooting . . .. . .. .. .. . . . . . . . . .. . .. . . .. ... 13


You Are Here .. . ................... . ..... . . . . . . .. .. . . .. ... 14
Importance ..... . ..... . .. . .. . . . ... . . . ..... .. .... .. ... . . . ... 15
Learner Objectives ..... . .. . .. . . . ... . .... . ... .. .......... . ... 16
Troubleshooting Process .. ... . .. . .. .. .. ... . .. . . . .. . . .. .. ..... 17
Definition of a System Problem . . . . .. . . . . . . .. . ...... . .... .. ... 18
Effects of a System Problem ... . . . . .... . . .... .. .... . . . .... . ... 19
Collecting Symptoms of a Problem . . .. . . . . .... . . . ... .. .... . . ... 20
Gathering Supplemental Information ..... . ..... . ..... . .... . . ... 21
Viewing and Interpreting Diagnostic Information ... ... .. . ... ... .. 22
Identify ing Possible Causes and Taking an Appropriate Approach . ... 23
Determining the Root Cause . .. . . . . .. .. . .. ... .. .... . .. . . . . . ... 24
Resolving the Problem . . . . ... . . . ..... . ..... . . . ... . . .. ... . ... 25
Example Scenario: Defining the Problem . . . . . ... . ..... . ..... . ... 26
Example: Gathering Information ... ..... . .. .. ... .... . . .. .. . .. .. 27
Example: Identifying Possible Causes ... . . . . .. . . ..... . .... . . ... 28
Example: Determining the Root Cause ... . . . ... .. . ... .. .... . . ... 29
Example: Resolving the Problem . . . ........... . . .... . .... . .... 31
Review of Learner Objectives .. . . . . .. . . . ...... .. .... . ..... . ... 32
Key Points ... .. . ... .. . ... ... .... . . .. . ..... . .. . . . ..... . . ... 33

MODULE 3 Troubleshooting Tools . . . . . . .. .. . . ... . . . . ... . . .......... .. . .. 35


You Are Here . . . ..... . .. . .. . . . ... . ........ . . ... .. ... . . . ... 36
Importance ................. . . . ..... . .. . ... . .... .. ..... . ... 37
Module Lessons . . ... .. .. . . .... . ..... . ... .. . . ... . .. ... . ... .. 38
Lesson 1: Command Line ..... . .. . .... . .. . .. .. .... .. .... . .... 39
Learner Objectives .. . . . . ... .. .. . . ... . . .. .. .. . . ... .. .. . . ... .. 40
Methods to Run Commands ..... . . .. . ....... . . . ... .. .... . .... 41
Accessing vSphere ESXi Shell . . . . ..... . .. . .. .. . . . .. .. ... . . . . . 42
vSphere ESXi Shell and SSH Timeouts . . . ... .. . .. ... . . . ... .. .. . 43
vSphere ESXi Shell and SSH Timeouts (2) . . . .. . . . . .. . . .. . . .. ... 44

iii
ESXCLI Commands ... . ............. . .. . .. . . . ... .. ..... . ... 45
Viewing vSphere Storage Information ... . ... . . . .. . . .. . .. . .. . .. . 46
Viewing vSphere Network Information . ..... . .. .. .... . .. . .. .... 47
Viewing Standard Switch Information ... . . .... .. . . .. .. .... .. ... 48
Viewing Distributed Switch Information . . .. .. . .. .. . . .. .... . . .. . 49
Viewing Hardware Information .. . . ..... . . . .. . ...... . .... .. ... 50
Lab 1: Using the Command Line .... .. . . . ... .. . . . .. . . .. .. . ... . 51
Review of Learner Objectives .... . . .... . ..... . ..... .. .... . . ... 52
Lesson 2: vSphere Management Assistant . . .... . . .... . . .... . . ... 53
Learner Objectives ... . . . ... . . . . . ... . . . ..... .. .... . . ..... . ... 54
vSphere Management Assistant Components .... .. ..... . ..... . ... 55
Configuring vSphere Management Assistant for AD Authentication ... 57
Adding vSphere Management Assistant to Active Directory ..... . ... 59
vicfg-* Commands ............ . ..... . ..... . . ........ .. .. ... 60
vmware-cmd Command . ..... . . . . .. . . . ..... . . . .... . ... . .. . .. 61
Viewing Virtual Machine Information .... . .... . ...... . .... . .... 62
Viewing Snapshot Information . . . . .. .. .... .. ...... . .. ... . . .. .. 63
Direct Console, SSH, or vSphere Management Assistant. ....... . ... 64
Lab 2: Adding vSphere Management Assistant to Active Directory . .. 65
Review of Learner Objectives .. . . . . .... . ..... . . .... .. .... . .. .. 66
Lesson 3: Logging, Log Files, and vRealize Log Insight .. . ..... . ... 67
Learner Objectives .... . . .... . . . . ... . . . ...... .. .. . ...... . . ... 68
Location of vCenter Server Logs . . . . .. ..... . .. . . . . .. . .. . . . .... 69
Common Logs . . ... ... . ... .... . . .... . . .... . . .... .. .. .. .. ... 70
Management Node Logs . ..... . . . . .... . ..... . . . ... .. .... . .... 71
Platform Services Controller Logs . . ........... . ..... . ......... 72
Important vCenter Server Logs for Troubleshooting .... .. ... . . . ... 73
Viewing vCenter Server Log Files in vSphere Web Client ...... . ... 74
Location ofESXi Host Logs . . ... . . .. .. . . . ... .. ..... . .... . . . . . 75
Useful ESXi Host Logs for Troubleshooting ..... . .... .. .... .. ... 76
Viewing Log Files in the DCUI . . . . .... . . ..... .. ... .. ..... . ... 77
vSphere Syslog Collector ... .. .. . .. . . . . . .. ... .. .. ....... . . . .. 78
vRealize Log Insight ... . ..... . . . .. . .. .. .. .. . . . . . .. . .. .... ... 79
Searching and Filtering Log Events . . ... . . . .... . . . ... . .... ... .. 80
Analyzing Logs with the Interactive Analytics Charts ... .. ... . . . ... 81
Dynamic Field Extraction ..... . . . . .... . ..... .. ..... . ......... 82
Troubleshooting Using Customized Dashboards . .. .... . . .. . ..... . 83
Monitoring Log Events and Sending Alerts . .... .. ..... . .... .. ... 84
Lab 3: Searching Log Files .... .. . . ... . . . ... .... ... . . . ... ..... 85
Lab 4: Searching Log Files ...... . . .......... . . . ... .. .... . .... 86
Review of Learner Objectives .. . . .. .... . ..... . .. .... . ..... . ... 87
Key Points .... . .... .. ...... .. . . . .. . . .. . .. . . . . ... . .... . . ... 88

iv VMware vSphere: Troubleshooting Workshop


MODULE 4 Troubleshooting Virtual Networking .. . . . . . . .. . . . ... .. .... . .... 89
You Are Here .. . .... . . .... . . . . .. .. .. . ... . . . ... .. . . ... . . ... 90
Importance .... . . .... . . . . . . .. . . .. . . ..... . .. . .... .. .. ... . ... 91
Leamer Objectives ............. . ..... . ..... ... ... . ..... . .... 92
Networking Troubleshooting Overview . . . ..... . .. .. . . . .. . . . .. .. 93
Review of Standard Switch .................. . ..... .. .... . .... 94
Network Problem 1 .... . .... . .. . .. .. . . .. .. .. . . . .. ....... . ... 95
Identify ing Possible Causes ...... . . .. . . . ..... . ..... .. .... . . ... 96
Possible Cause: ESXi Network Misconfiguration (1) .. . . . . .... . . ... 97
Possible Cause: ESXi Network Misconfiguration (2) .... .. ..... . ... 98
Resolving ESXi Network Misconfiguration . .... .. ..... . .... .. ... 99
Possible Cause: NIC Teaming Misconfiguration ... . . .. . ... .. . .. . 100
Possible Cause: Unsupported or Faulty Hardware .. ..... . .. ... . .. 101
Possible Cause: Slow Network Performance . . .. .. .. . . .. .. .. .. .. 102
Review of Virtual Machine Connectivity . . . .... .. .... .. ... . . . .. 103
Network Problem 2 .... . ..... . ..... . .... . .................. 104
Identifying Possible Causes .. .... . .. .. .... .. ..... .. . . .. .. . ... 105
Possible Cause: IP Settings and Firewall Problems . ..... . ... .. . .. 106
Possible Cause: Port Group Misconfiguration ... .. . ... .. .... . . .. 107
Possible Cause: ESXi Network Connectivity Problems .. . .... .... . 108
Network Problem 3 .... . ..... . . . ... . ....................... 109
Heartbeat Communication Between vCenter Server and ESXi . . . . .. 110
Identifying Possible Causes ...... . . .. . . . ..... . . .... .. .... . ... 111
Possible Cause: Port Blocked by Firewall. ... ... .. . ... .. .. . . .. .. 112
IPTables Firewall. ..... . ..... . . . ..... . ..... . . .......... . . .. 114
Possible Cause: vCenter Server Not Using Port 902 . ..... . ..... . .. 115
Resolving the Use of a Port Other Than 902 (1) .. .. .... .. ... . . . .. 116
Resolving the Use of a Port Other Than 902 (2) .. .. ..... . ..... . .. 117
Resolving Network Congestion ..... .. .. . . . ... .. .... .. .... . . .. 118
Network Problem 4 .... . ..... . . . ............ . ......... . . . .. 119
Preventing Loss of Management Network Connectivity . . ...... . .. 120
Host Networking Rollback .. .. .. . .. . . . . ... . . .. .. .. . . .... . ... 121
Recovering a Lost Management Network: Standard Switch .. ... ... 122
Network Restore Options in the DCUI ..... . .... . .... .. . ... . . .. 123
Review of Distributed Switch Network Connectivity .... .. ... . . . .. 124
Distributed Switch Rollback ... . . . . .... . ..... .. ..... . ........ 125
Recovering from a Distributed Switch Misconfiguration . . ... .. .. . . 126
Lab 5: Troubleshooting Network Problems . .... .. ..... . ..... . .. 127
Review of Learner Objectives . ... . . .. .. . . ... .... ... . . . . ..... . 128
Key Points .... .. ..... . ..... . . . ........... .. ..... . ... .. . .. 129

Contents v
MODULE 5 Troubleshooting Storage . .. . .... . ..... . .. . .. . . . ... .. ........ 131
You Are Here .. . . .. . . . .. . . .. . . . . . . .. . ... . . . ... .. . . ... . .. . 132
Importance .... . . .... . . . . . . .. .... . . ..... . .. . .... .. ... . . . .. 133
Module Lessons ............... . ..... . ...... . .... . ..... . ... 134
Lesson 1: Storage Connectivity and Configuration .. . .. . .. . .. .. . . . 135
Learner Objectives ..... . ................... . ..... . ......... 136
Review ofvSphere Storage Architecture ... ... .. . . .. . .. . ... . . . . 137
Review of iSCSI Storage . ....... . ... . . . ..... . ..... .. ........ 138
Storage Problem 1 ................... . ..... . . .......... . . . . 139
Identifying Possible Causes . ... .. . .. . . . . . . . .. .. ... . .. . .. . . .. . 140
Possible Cause: Hardware-Level Problems . ..... .. ..... . ..... . .. 141
Possible Cause: Poor iSCSI Storage Performance . . ... . .. .. .... . . 142
Possible Cause: VMkernel Interface Misconfiguration ....... . . . .. 143
Possible Cause: iSCSI HBA Misconfiguration (1) .. .... .. .... .. .. 144
Possible Cause: iSCSI HBA Misconfiguration (2) .. .... .. ... . . . .. 145
Possible Cause: iSCSI HBA Misconfiguration (3) .. ..... . ..... . .. 146
Possible Cause: iSCSI HBA Misconfiguration (4) . . . .. . .. . ... ... . 147
Possible Cause: iSCSI HBA Misconfiguration (5) .. ..... . ... . . . .. 148
Possible Cause: iSCSI HBA Misconfiguration (6) .. . ... .. .... .. .. 149
Possible Cause: iSCSI HBA Misconfiguration (7) .. .... .. ... . . . .. 150
Possible Cause: Port Unreachable . . ............ . ..... . ..... . .. 151
Possible Cause: VMFS Metadata Inconsistency . . . . . ... . . .... . .. . 152
Use vSphere On-Disk Metadata Analyzer (1) ..... . ... . . . .. .. . . .. 153
Use vSphere On-Disk Metadata Analyzer (2) . ... .. . ... .. .... .. .. 154
Use vSphere On-Disk Metadata Analyzer (3) .... .. ..... . ..... . .. 155
Possible Cause: NFS Misconfiguration ... . ...... . ..... . ..... . .. 156
NFS Version Compatibility with Other vSphere Technologies . .... . 157
NFS Dual Stack Not Supported ... . ..... . ..... .. ..... . ..... . .. 158
NFS Client Authentication .... .. . . ... . . .... . .. ..... . .... . . .. 159
Configuring Active Directory and NFS Servers to Use Kerberos .... 160
Configuring Host Time Synchronization . . ...... . .... .. ..... . .. 161
Configuring Host Authentication Services . . ..... .. . . . . ..... .. .. 162
Configuring the Datastore to Use Kerberos .. . .. . . . .. . . . .. .. . . .. 163
Viewing Session Information .. .. . . ... . . .... ... ..... . ...... . . 164
Review of Learner Objectives .. . . .. .. . ......... . ... .. .... . ... 165
Lesson 2: Multipathing ....... . . . ..... . ...... . ..... . ........ 166
Learner Objectives . .... .. .. . .. . .. . . .. . . . ... . . .... . .. ... .. . . 167
Review of iSCSI Multipathing . . ....... . ..... . . .... . ..... . . .. 168
Storage Problem 2 . .. . . . . ..... . . .... . . . . .. . . .. .... . .... .. .. 169
Identifying Possible Causes ...... . ........... . . . ... .. .... . ... 170
PDL Condition .. . ........... . . . ..... . ..... . . .... . ..... ... . 171
Recovering from an Unplanned PDL . . . ... . . . ... . . ... . .... .. . . 173
APD Condition . . ..... . ....... . ... .. .. ... .. . . .. . .... .. . . .. 174

vi VMware vSphere: Troubleshooting Workshop


Recovering from an APD Condition . .... . . .... .. ..... . ..... . .. 176
Possible Cause: NIC Teaming Misconfiguration .. . .. .. . . .. . ..... 177
Possible Cause: Path Selection Policy Misconfiguration . . ... ... . .. 178
Possible Cause: NFSv3 and v4.1 Misconfiguration . . . .. .. .... . ... 180
Lab 6: Troubleshooting Storage Problems . ..... .. .... .. ..... . .. 181
Review of Learner Objectives ..... . ....... . .. . ..... .. .... . ... 182
Lesson 3: vSAN and Virtual Volumes ..... .. ... . .... .. ... . . . .. 183
Learner Objectives ..... . ..... . . . ... . . . ..... . ..... . .... . .... 184
Review of vSAN . ................... . ..... . . .... . ..... . . . . 185
vSAN Troubleshooting Tools .. .. . .. . .. . ..... .. .... .. . .. . . .. . 186
vSAN Disk Query ..... . ..... . . . ... . .... . .................. 189
vSAN Problem 1 . .... . ... . . .. . . .. . .. . . ...... ... . .. .... . ... 190
vSAN Problem 2 . .... .. . . . . .. . . .. . . ... . . . . . . . . . ...... .. ... 191
vSAN Problem 3 . ........ . .......... . ..... . .. ... . ..... . . . . 192
vSAN Problem 4 . ..... . .. . .. . . . ... . . . ..... . . . ... . .... . .. . . 193
VSAN Problem 5 . ..... . ..... . ..... . .... . .................. 194
Review ofvSphere Virtual Volumes ............ . .... .. ... . .... 195
vSphere Virtual Volume Object Types ... . . .... .. ..... . ... . . . .. 196
About Protocol Endpoints . .... . . . . .... . . .... .. .... . . . .... .. . 197
About Storage Containers . . ... . .. . .... . ..... . .. ... .. .... .. . . 198
Bidirectional Discovery Process .. . . ..... . ........... . ..... . .. 199
Troubleshooting vSphere Virtual Volumes ...... . . .... . . .... . ... 200
vSphere Virtual Volumes Problem 1. .... . . .... . . ... . . . .. .... .. 201
vSphere Virtual Volumes Problem 2 . .... ... ... .. . . . . .. .. .. .. .. 202
vSphere Virtual Volumes Problem 3 . .. . . . . .... .. ..... . ..... . .. 203
vSphere Virtual Volumes Problem 4 . .. . . . . . . ... . ..... . ..... . .. 204
Review of Learner Objectives .... . ..... . .. .. . .. . .. . . . ... . . ... 205
Key Points .... .. ..... . ..... . ....... . ..... .. ..... . ... .. . .. 206

MODULE 6 Troubleshooting vSphere Clusters . . .. . .. . ..... . .... .. ... . . . .. 207


You Are Here .............. . . . ... . . . ...... .. ... . ..... . . .. 208
Importance ... .. . ........ . . . .. . .. .... . .. . .. .. . . . . ..... .. . . 209
Learner Objectives ..... . . .. . .. . . .. . .. .. .. .. . . . . . ..... ... ... 210
Review ofvSphere HA . . . . . .. .. . . ... . . . . .... . .......... . . . . 211
vSphere HA Problem 1 . . ..... . . . ... . ........ . . ... .. ... . . ... 212
Identifying Possible Causes .... . . . . .... . ..... .. ..... . ........ 213
Possible Cause: Improper Configuration of vSphere HA . . ... .. .. . . 214
Possible Cause: Heartbeat Datastore Inaccessible .. .... . . .... . . .. 215
Possible Cause: Failure to Install FDM Agent on ES Xi Host ( 1) ..... 216
Possible Cause: Failure to Install FDM Agent on ESXi Host (2) . . ... 217
Possible Cause: Loss of Network Connectivity .. .. .... .. .... .. .. 218
vSphere HA Problem 2 ....... .. . ... . . . ... . . ... . . .. . . ... . . . . 219
Identifying Possible Causes .... . . . . . . . .... .. . ... . . .. . .. . . .. .. 220

Contents vii
Possible Cause: Insufficient Physical Resources . . . . ... .. .... .. .. 221
Bandwidth Reservation . . . .. . . .. . . . .. . . . . .. .. . . .... . . .. . . .. . 222
Possible Cause: Excessive Virtual Machine Reservations (1) .. . .... 223
Possible Cause: Excessive Virtual Machine Reservations (2) ... . ... 224
High Availability Configuration .. . .. .. . . .. .. . . .. .. . .. . .. .. . . . 225
Possible Cause: Admission Control Policy Misconfiguration ....... 226
vSphere HA Cluster: Admission Control Guidelines . .. . .. . ... . . . . 227
Example of Calculating Slot Size . . ..... . ..... .. ..... . ..... . .. 229
Apply ing Slot Size ............. . ..... . ..... . . ............ . . 231
Distorted Slot Size . ... . . . ... . . . . . . . . . . . . . .. . . ... . . . . . .... . . 232
Reserving a Percentage of Cluster Resources .... .. ..... . ..... . .. 233
Calculating Current Failover Capacity .. .... . ... .. . . . .. .. .... . . 234
Using VMCP .. . . ..... . ..... . . . ... . ... . . . . . . . .... . ... . . . .. 235
Useful Troubleshooting Commands ...... .. ... .. .... .. .... .. .. 236
Cluster Utilization Graph ..... . . . . .. . . . ..... . . . .... . ...... . . 237
Review ofvSphere vMotion ......... . ....... . ...... . .... . ... 238
vSphere vMotion TCP/IP Stacks ... . . .. .... .. .. . . .. . .. . ... ... . 239
Use esxcli to Display vMotion Network Information ..... . ... . . . .. 240
Long Distance vMotion . . . .... . . . ..... . . .... .. ..... . . ... ... . 241
Cross vCenter Server vMotion . . . . ..... . ..... . . . ... . . .... . ... 242
vSphere vMotion Problem 1 ......... . .............. . .... . ... 243
Identifying Possible Causes .. . . .. . . . .. . . . . .. .... ... . . ... . . . . . 244
Possible Cause: VMkernel Interface Misconfiguration ...... .. . . .. 245
Possible Cause: Invalid Name Resolution on the Host ... . . .... ... . 246
Possible Cause: Required Disk Space Not Available ..... . ..... . .. 247
Possible Cause: Reservation Requirements Not Met ..... . ..... . .. 248
Possible Cause: log.rotateSize Set to Low Value . .. . ... . . . ... .. . . 249
Resetting Migrate.Enabled ...... . ..... . ..... . ..... .. .... . ... 250
vSphere vMotion Problem 2 ... .. . . ... . . .... . .. ..... . .... . ... 251
Possible Cause: vSphere DRS Configuration ..... . .... .. ... . . . .. 252
Possible Cause: Configuration Problems . . ...... . .... .. ..... . .. 253
Lab 7: Troubleshooting Cluster Problems .. . ..... .. . . . . ..... .. .. 254
Review of Learner Objectives .... . ..... . .. . .. . . . . . .. . .. . . . ... 255
Key Points . .... . . .... . . .... . . . .... . . .... . . . . .. . .. . ... . . . . 256

MODULE 7 Troubleshooting Virtual Machines . ..... . . ..... . .... .. ..... . .. 257


You Are Here . . . . .... .. .. . .. . . . . . .. . . . ... . . ... . . .. ... . .. . 258
Importance .... .. ..... . ..... . ....... . ..... .. ..... . ..... . .. 259
Learner Objectives . .. . . . . ..... . . .... . . . . .. . . .. .... . .... .. .. 260
Review of Virtual Machine Files . . ... . ....... .. ..... . ..... . .. 261
Disk Content IDs . ........... . . . ..... . ..... . . .... . ..... . . . . 262
Virtual Machine Problem 1 . . . . . . . .. .. . . .. . . . .. .. . .. . . ... ... . 263
CID Mismatch Example . .... .. . . .. . . ... ... . ... . . .. . .. . . .... 264

viii VMware vSphere: Troubleshooting Workshop


Resolving a CID Mismatch ...... . . .... . .. . .. . . . ... .. .... . ... 265
Virtual Machine Problem 2 ...... . .. . . . . ... . . . . .. . . .. .. . ..... 266
Resolving Quiesced Snapshot Failure .. . . . .. ... . . .... .. .. ... . .. 267
Virtual Machine Problem 3 .... . . . . .... . ..... .... .. . . .... ... . 268
Identifying Possible Causes . ..... . .. . . . . ..... .. .... . . .. . . . .. . 269
Possible Cause: No Permissions to Create Snapshots .... . . .... . . .. 270
Possible Cause: Missing Delta Descriptor File .... . .... .. ... . . . .. 271
Possible Cause: Insufficient Space on Datastore .. .. ..... . ... .. . .. 272
Virtual Machine Problem 4 .... . . . . .... . ..... . . ..... . .... .. . . 273
Identifying Possible Causes . ..... . .. . . . . ..... .. .. . . . . . .. . . .. . 274
Possible Cause: Virtual Machine Files Missing .. .. ..... . ..... . .. 275
Possible Cause: Virtual Machine File Locked .... .. ... . .. ... . .. . 276
Resolving a Locked Virtual Machine File . . . .... .. ..... . .. ... . .. 277
Possible Cause: Insufficient Resources on ESXi Host . . . .. .. .. ... . 278
Possible Cause: ESXi Host Unresponsive . . . .... .. .... .. ... . . . .. 279
Review of Virtual Machine Connection States ... .. ..... . ..... . .. 280
Virtual Machine Problem 5 .. .... . .. .. .... .. ... . . .. . . .. .. . ... 281
Identifying Possible Causes ...... . . .... . ..... . ..... .. .... . . .. 282
Possible Cause: vSphere vMotion or vSphere DRS Migration
Occurred ..... . ..... . ..... . . . ..... . ..... .. .... .. .... . . .. 283
Possible Cause: VM Deleted Outside vCenter Server .... . ..... . .. 284
Possible Cause: Special Characters in the .vmx File . .... .. . ... . .. . 285
Recovering from an Invalid or Orphaned Virtual Machine ... ... ... 286
Virtual Machine Problem 6 .... . . . . .... . .. ... .. .... ... ... . .. . 287
Identify ing Possible Causes .... . . . . .... . ..... . . . ... .. .... . ... 288
Possible Cause: Wrong Guest Operating System .. . ..... . ..... . .. 289
Possible Cause: ISO Image Not Being Loaded ... .. .... .. ... . . . .. 290
Possible Cause: ISO Image Cannot Be Found ... .. ..... . ..... . .. 291
Possible Cause: VMware Tools ISO Image Corrupt. . . .. .. . ... . . . . 292
Lab 8: Troubleshooting Virtual Machine Problems . .... .. ... . . . .. 293
Review of Learner Objectives .. . . . . .... . ...... . ..... . ........ 294
Key Points ... .. .. .... . ... ... . . .. . . .. . ..... . .. . . . ..... .. . . 295

MODULE 8 Troubleshooting vCenter Server and ESXi .. .. ... . .... .. . ... . . .. 297
You Are Here . . . ..... . ..... . . . ... . ........ . . ... . .... . . . .. 298
Importance ................. . . . ..... . .. . ... . .... .. ..... . .. 299
Learner Objectives ... .. .. ...... . ..... . ... .. . . .... . ... .. .. . . 300
Review ofvSphere 6.x Deployment Modes . . . .. .. .... . . .... . . .. 301
vCenter Server Deployment Options . . ..... .. ... . . . .. .. .. . . .... 303
Platform Services Controller Deployment Options .. ..... . ... ... .. 304
Review ofvCenter Single Sign-On . . .... . . .... .. .... .. .... .. .. 305
VMware CA ..... ..... .. .... . . .. . . ... .. . .. . .... . . . ... .. .. 306
VMware Certificate Store .. . . .. . . . . . . .... .. . ... . . .. . .. . . . ... 307

Contents ix
Trust and Certificates ( 1) . .. . .... . . .... . .. . .. . . . ... .. .... . ... 308
Trust and Certificates (2) . .. . . .. . . .. . . . . . ... . . . . . . ....... . ... 309
Chain of Trust (1) ..... . .. . . .. . . .. . . ..... . .. .. . . . . .... .. ... 310
Chain of Trust (2) ........ . .... . ..... . ..... . . . ... . ..... .. .. 311
Chain of Trust (3) ... . . . ... . . . . . ... . . . .... . . .. .. . . ....... . . 312
Multinode Chains of Trust. ...... . . ....... . .. . ..... .. .... . ... 313
Certificate Problem .... . .... . .. . .. .. . . .. .. .. . . . .. .......... 314
vCenter Server Problem 1 .. . ..... . .. . . . ..... . ..... .. .... . ... 315
vCenter Server Problem 2 ..... . . . . .... . ..... . . ..... . .... .. . . 316
Use the VMware Appliance Management Console . .... .. ..... . .. 318
Growth of the vCenter Server Database . ............. . ..... . ... 319
vCenter Server Database Tables That Typically Grow .. . .. . ... . . .. 320
Roll up Jobs Control Growth ... . . . . .... . ..... . ...... . .. . . .. .. 321
Query the Status of Roll up Jobs on MS SQL Server .. . . .. .. .. ... . 322
Verifying the Size of the Database Tables . . .... .. .... .. ... . . . .. 323
Resolving Performance Data Growth Issues . .... .. ..... . ..... . .. 324
PostgreSQL Database Out of Space ..... ... .. .. . .. . . .. ... . .... 325
Set the Statistics Level. . . ..... . . . ..... . .. . .. . ..... . .... .. ... 326
Modify the Database Settings .. . . . . .. .. . . .... .. .... . . . .... . .. 327
Reinitializing the vCenter Server Database . . .... .. .... .. ... ... .. 328
Other PostgreSQL Troubleshooting . .... . . ..... . ..... . ..... . .. 330
Accessing the vCenter Server Appliance Shell .. . . . . ... . . . ... . . .. 331
Configuring Access Settings ... . . . . .... . ..... . . . . . .. . .. . . .. .. 332
Log in to the Appliance Shell .. . . . . .. .. . .. ... .. ..... .. . . . . . .. 333
Querying Service Status and Restarting Services . .. ..... . ..... . .. 334
Using API Commands and Plug-Ins from the Appliance Shell ...... 335
ESXi Problem 1 . . ... .. . .... . . . . ... . . . ...... .. .. . . .... . . . .. 336
Verifying That the ESXi Host Has Crashed . .... .. ..... . ..... . .. 337
Recovering from a Purple Diagnostic Screen Crash . . . .. . . . ... . . . . 338
ESXi Problem 2 . . ..... . ..... . . . ............ . .... . .... . . . .. 339
Verifying That the ESXi Host Has Stopped Responding . . ...... . .. 340
Recovering from an ESXi Host Failure ... . ..... . . .... . ..... .. .. 341
Lab 9: Managing the PostgreSQL Database . .... .. ..... . ..... . .. 342
Lab 10: Troubleshooting vCenter Server and ESXi Host Problems . . . 343
Lab 11 : (Optional) Working with Certificates .... . .... .. ..... . .. 344
Review of Learner Objectives .. . . . . .... . ..... .. ..... . ........ 345
Key Points ( 1) .. . ... .. .. ...... . ..... . ... .. . . . ... . .... . ... . 346
Key Points (2) . .. ... . . . ..... . ....... . ..... .. .... .. .... . . .. 347

x VMware vSphere: Troubleshooting Workshop


MODULE 1
Course Introduction
Slide 1-1

Module 1

VMware vSphere:
Troubleshooting Workshop 6. 5

1
Importance
Slide 1-2

VMware vSphere® administrators should be able to troubleshoot various


vSphere problems caused by misconfigurations and system failures.

2 VMware vSphere: Troubleshooting Workshop


Learner Objectives
Slide 1-3

By the end of this course, you should be able to meet the following
objectives:
• Use VMware vSphere® Web Client, the command line, and log files to
configure or diagnose and correct problems in vSphere
• Troubleshoot networking problems
• Troubleshoot storage problems
• Troubleshoot VMware vSphere® High Availability problems
• Troubleshoot VMware vSphere® Distributed Resource Scheduler™ problems
• Troubleshoot VMware vSphere® vMotion® problems
• Troubleshoot VMware vCenter Server® problems
• Troubleshoot VMware vCenter® Single Sign-On and certificate problems
• Troubleshoot VMware ESXi™ host problems
• Troubleshoot virtual machine problems

Module 1 Course Introduction 3


You Are Here
Slide 1-4

1. Course Introduction
2. Introduction to Troubleshooting
3. Troubleshooting Tools
4. Troubleshooting Virtual Networking
5. Troubleshooting Storage
6. Troubleshooting vSphere Clusters
7. Troubleshooting Virtual Machines
8. Troubleshooting vCenter Server and ESXi

4 VMware vSphere: Troubleshooting Workshop


Typographical Conventions
Slide 1-5

The following typographical conventions are used in this course.


Monospace Filenames, folder names, path names, command names,
and code:

Navigate to the VMS folder.

Monospace Bold What the user inputs:

Enter ipconfig /release.

Boldface User interface controls:

Click the Configuration tab.

Italic Book titles and placeholder variables:

• vSphere Virtual Machine Administration


• ESXi- host- name

Module 1 Course Introduction 5


References ( 1)
Slide 1-6

Title Location
http://pubs .vmware.com/vsphere-
vSphere Troubleshooting 65/topic/com. vmware. ICbase/PD F/vsphere-esxi-
vcenter-se rver-65-trou bl eshooti ng-g uide.pdf
http://pubs.vmware.com/vsphere-
vCenter Server and Host
65/topic/com. vmware. ICbase/PD F/vsphere-esxi-
Management
vcenter-server-65-host-management-guide. pdf
http://pubs .vmware.com/vsphere-
vSphere Virtual Machine
65/topic/com .vmware. ICbase/PD F/vsphere-esxi-
Administration
vcenter-server-65-virtual-mach ine-adm in-guide.pdf
http://pubs.vmware.com/vsphere-
vSphere Networking 65/topic/com. vmware. ICbase/PD F/vsphere-esxi-
vcenter-server-65-networking-guide. pdf
http://pubs .vmware.com/vsphere-
vSphere Security 65/topic/com .vmware.ICbase/PD F/vsphere-esxi-
vcenter-server-65-security-gu ide. pdf

6 VMware vSphere: Troubleshooting Workshop


References (2)
Slide 1-7

Title Location
http ://pubs. vmware.com/vsphere-
vSphere Resource Management 65/topic/com.vmware. ICbase/PDF/vsphere-esxi-
vcenter-server-65-resource-management-guide. pdf
http ://pubs. vmware .com/vsphere-
vSphere Availability 65/topic/com. vmware. ICbase/PDF/vsphere-esxi-
vcenter-server-65-availability-guide.pdf
http://pubs.vmware.com/vsphere-
vSphere Installation and Setup 65/topic/com. vmware. ICbase/PDF/vsphere-esxi-
vcenter-server-65-installation-setup-g uide. pdf
http ://pubs.vmware. com/vsphere-
vSphere Platform Services 65/topic/com. vmware. ICbase/PDF/vsphere-esxi-
Controller Administration Guide vcenter-server-65-platform-services-controller-
administration-guide.pdf

Module 1 Course Introduction 7


References (3)
Slide 1-8

Title Location
http ://pubs.vmware .com/vsphere-
vSphere Monitoring and 65/topic/com. vmware. ICbase/PD F/vsphere-esxi-
Performance vcente r-server-65-mon itori ng-performance-
g u ide. pdf
vSphere Command-Line Interface
https://www.vmware.com/su pport/developer/vcli/
Documentation
VMware vSphere 6.5
https://pubs.vmware.com/vsphere-65/index.jsp
Documentation Center
http ://pubs.vmware.com/vsphere-
vSphere Management Assistant
65/topic/com .vmware.ICbase/PDF/vsphere-
Guide for vSphere 6. 5
management-assistant-65-guide. pdf
Configuration Maximums for http ://www.vmwa re. com/pdf/vsphere6/r65/vsphere-
vSphere 6.5 65-config uration-maximu ms.pdf

8 VMware vSphere: Troubleshooting Workshop


VMware Online Resources
Slide 1-9

• VMware vSphere Blog: http://blogs.vmware.com/vsphere/


• VMware Communities: http://communities.vmware.com
• VMware Support: http://www.vmware.com/support
• VMware Education: http://www.vmware.com/education
• VMware Certifications: http://mylearn.vmware.com/portals/certification
• VMware Education and Certification Blog: http://blogs.vmware.com/education/
• VMware Knowledge Base: http://kb.vmware.com
• vSphere Release Notes: http://www.vmware.com/support/pubs/vsphere-esxi-
vcenter-server-pubs. html

Module 1 Course Introduction 9


VMware Certification
Slide 1-10

Although this course is not


required for VMware
certification, the content of
this course is a subset of the
knowledge tested in the
VCIX6-DCV certification.

The VMware Certified Implementation Expert 6 - Data Center


Virtualization certification consists of two exams, focused on design and
deployment skills, respectively. This course provides training on a subset
of knowledge found in the deployment exam.
For details about VMware certifications, go to:
http://mylearn.vmware.com/portals/certification

The VMware Certified Implementation Expert 6 - Data Center Virtualization certification (VCIX6-
DCV) program tests candidates on two skill sets. The design exam portion of the certification tests
candidates on their ability to design a VMware vSphere® 6.x solution in both single and multisite
environments. Candidates should have a strong understanding of vSphere 6.x core components and
their relation to the data center, including virtual storage and networking technologies and their
relation to physical data center resources.
The deployments exam portion of the certification tests candidates on their ability to administer a
vSphere 6.x data center. Candidates should be capable of working with large and complex
virtualized data centers and demonstrate technical leadership with vSphere 6.x technologies.
Candidates must be capable of using automation tools, implementing virtualized environments, and
administering all vSphere 6.x enterprise components.
The VCIX6-DCV certification is also an entry point to the prestigious VMware Certified Design
Expert 6 certification.
The training in this course covers the troubleshooting objectives found in the VCIX6-DCV
deployment exam.

10 VMware vSphere: Troubleshooting Workshop


MODULE 2
Introduction to Troubleshooting
Slide 2-1

Module 2

13
You Are Here
Slide 2-2

1. Course Introduction
2. Introduction to Troubleshooting
3. Troubleshooting Tools
4. Troubleshooting Virtual Networking

5. Troubleshooting Storage
6. Troubleshooting vSphere Clusters
7. Troubleshooting Virtual Machines
8. Troubleshooting vCenter Server and ESXi

14 VMware vSphere: Troubleshooting Workshop


Importance
Slide 2-3

You can quickly identify, diagnose, and solve a problem if you use an
efficient troubleshooting methodology in a consistent and repeatable
manner.

Module 2 Introduction to Troubleshooting 15


Learner Objectives
Slide 2-4

By the end of this module, you should be able to meet the following
objectives:
• Define the scope of troubleshooting
• Use a structured approach to solve configuration and operational problems
• Apply troubleshooting methodology to logically diagnose faults and improve
troubleshooting efficiency

16 VMware vSphere: Troubleshooting Workshop


Troubleshooting Process
Slide 2-5

Troubleshooting is a systematic approach to identifying the root cause of


a problem and defining steps to resolve the problem.
A typical troubleshooting process involves the following steps:
1. Defining the problem.
2. Identifying the cause of the problem.
3. Resolving the problem.

The troubleshooting process begins when a user reports a problem. In this context, the user is
anyone using the system, from an end user to an administrator. The problem reported by the user
might not be the problem. A user might be reporting symptoms of the problem.
An observed problem might be directly causing the symptoms, but typically the problem has a more
fundamental cause.

Module 2 Introduction to Troubleshooting 17


Definition of a System Problem
Slide 2-6

A system problem is a fault in a system, or one of its components, that


negatively affects the services needed for normal production.
System problems arise from various sources:
• Configuration issues
• Resource contention
• Network attacks
• Software bugs
• Hardware failures
Do not assume that you understand the problem after you have identified
one symptom:
• The first symptom reported might not indicate the true source of the problem.
• Do a thorough analysis. Verify that nothing else is broken.

A system consists of several components, both software and hardware. For example, a VMware®
ESXi™ host consists of components such as CPU, memory, storage, networking, and hypervisor
software. A virtual machine consists of various components, such as one or more applications, a
guest operating system, and virtual hardware. A problem that occurs in a system can disrupt and
negatively affect production services that were functioning normally.
This course concentrates on the configuration and operational issues.

18 VMware vSphere: Troubleshooting Workshop


Effects of a System Problem
Slide 2-7

These problems can affect certain aspects of a system:


• Usability
• Accuracy
• Reliability
• Performance
Perceived effects or symptoms are generally exposed and reported.
Symptoms of a system problem often appear to be the problem itself.
Your must look at all of the symptoms of a system problem to determine
the root cause.

Usability is about whether users can complete tasks and achieve goals with the given product.
Usability is also about the amount of effort (often measured in time) that is required by a user to
perform a certain task.
Accuracy is about a system's precision and the system's ability to repeatedly show the same results
under unchanged conditions.
Reliability can be defined in terms of whether a system consistently produces correct outputs up to
some given time. Reliability is enhanced by system features that help avoid and detect problems.
Reliability is often defined in business service-level agreements (SLAs) in the form of availability.
Performance is also defined in terms of an SLA. An SLA establishes performance and reliability
requirements for applications. An SLA enables tracking and analyzing the achieved performance
and reliability to ensure that those requirements are met. A performance problem exists when an
application fails to meet its SLA. Depending on the SLA, the failure might be in the form of
excessively long response times or an unacceptable length of time when the system was unavailable.
Although performance is a predominant symptom in reported problems, this course does not focus
on performance issues. Performance troubleshooting is covered in the VMware vSphere: Optimize
and Scale course.

Module 2 Introduction to Troubleshooting 19


Collecting Symptoms of a Problem
Slide 2-8

Collecting symptoms is the first step in troubleshooting a problem.


A single root cause often presents itself as several symptoms that users
report.
Differentiating between symptoms and the root cause of a problem is
imperative.

Symptom Possible Causes Root Cause

One or more LUNs on a storage The LUNs that are not visible
array are not visible to a specific are not presented correctly to
ESXi host. the ESXi host.
Pathing failure Network has failed between the
ESXi host and the storage
array. No redundant path
available.
You cannot connect to vCenter The VirtualCenter Server vCenter Server Appliance has a
Server with vSphere Web Client. service failed to start. corrupt database.

Network path between you and


vCenter Server Appliance is
down.

Problems can arise in any computing environment. Complex application behaviors, changing
demands, and shared infrastructure can lead to problems arising in previously stable environments.
Troubleshooting problems requires an understanding of the interactions between the software and
hardware components of a computing environment. Moving to a virtualized computing environment
adds new software layers and new types of interactions that must be considered when
troubleshooting problems.

20 VMware vSphere: Troubleshooting Workshop


Gathering Supplemental Information
Slide 2-9

Ask questions to gather additional information to define the problem:


• Can the problem be reproduced?
- Provide a repeatable means to recreate the problem and a way to validate that the
problem was resolved.
• What is the scope?
- Does the problem affect only one object or multiple objects?
• Was the system working before?
- If so, what changed in the environment or configuration?
• Is the problem a known problem?
- Consult references, such as release notes, to determine whether the problem is
documented.

Proper troubleshooting requires starting with a broad view of the computing environment and
systematically narrowing the scope of the investigation as possible sources of problems are
eliminated. Troubleshooting efforts that start with a narrowly conceived idea of the source of a
problem often get stuck in detailed analysis of one component, when the real source of the problem
is elsewhere in the infrastructure. To quickly isolate the source of a problem, you must adhere to a
logical troubleshooting methodology that avoids preconceptions about the source of the
problem.The troubleshooting process begins when a user reports a problem. In this context, the user
is anyone using the system, from an end user to an administrator. The problem reported by the user
might not be the problem. A user might be reporting symptoms of the problem.
An observed problem might be directly causing the symptoms, but typically the problem has a more
fundamental cause.

Module 2 Introduction to Troubleshooting 21


Viewing and Interpreting Diagnostic Information
Slide 2-10

View diagnostic messages in the GUI or in log files.


Interpret the diagnostic messages to find the root cause.
III Recent Tasks
~~~~~~~~~ ....... Module 'Monitorloop' power on failed.
An error was received from the ESX host
Tad< Name T arget s1, while powering on VM linux-a-05 .
Power On virtual machine B!J linux-a-05 o Failed to start the virtual machine .
Initialize powering On fZl Training " Failed to power on VM.
Reconfigure virtual machine B!J linux-a-04
Reconfigure virtual machine B!J linux-a-05
: Could not power on virtual machine: No
space left on device.
Failed to extend the virtual machine swap
file
Current swap file size is O KB.
i;;;;;;;;;;.;;;;;;==========;;;;;;...;; Failed to extend swap file from O KB to
524288 KB.
File system specific implementation of
LookupAndOpen[file] failed
File system specific implementation of
Lookup[file] failed

View diagnostic messages that were generated by the problem. If diagnostic information does not
appear in the GUI or in an event viewer, then check the appropriate log files for useful entries.
Use the information in the diagnostic messages to help focus on the area of the system that is most
likely causing the problem. For example, the user received an error message when powering on a
virtual machine. The error message indicates that the datastore on which the virtual machine is
located has insufficient disk space. This information tells you to focus on the storage component
instead of, for example, on the networking component. The rest of the error message indicates that
the virtual machine's swap file cannot be extended because no space is left on the disk.

22 VMware vSphere: Troubleshooting Workshop


Identifying Possible Causes and Taking an Appropriate Approach
Slide 2-11

A structured approach to Top-Down


(Most specific)
troubleshooting enables you
to determine the root cause Application or
quickly and effectively. Guest OS

Based on the problem's Approach


characteristics, take one of the cause
b halves.
the following troubleshooting
approaches: ESXi
Host
• Investigate the cause top-down.
• Investigate the cause bottom-up. Hardware
(CPU, Memory,
• Approach the cause by halves. Network, Storage
Bottom-Up
(Most General)

In a VMware virtual environment, the root cause of a problem can occur in any one of the virtual
components. Knowing where to start looking for the root cause is often not obvious. Thus, gathering
as much information as you can about the problem can help determine which virtual component to
check first.
You might take one of the following troubleshooting approaches:
• Top-down: Start troubleshooting in the guest operating system first, then work your way down
the stack, then to the virtual machine, then to the ESXi host, and finally to the hardware.
• Bottom-up: Start troubleshooting at the hardware level first, then work your way up the stack to
the ESXi host, then to the virtual machine, and finally to the guest operating system.
• Approach cause by halves: Start troubleshooting at the middle of the stack. For example, start
with the virtual machine and test possible causes. The test results determine whether you should
continue troubleshooting up the stack or down the stack.

Module 2 Introduction to Troubleshooting 23


Determining the Root Cause
Slide 2-12

To determine the root cause, test your environment and eliminate


possible causes.
Example: The virtual machine stopped responding.

Possible Causes
Problem is triggered by an operation (snapshot or Application or
vSphere vMotion migration) performed on the Guest OS
virtual machine.
Limit and share values are Virtual
misconfigured on the virtual machine. Machine

Not enough host resources (CPU, ES Xi


Host
memory) are available.

Physical resources are inaccessible. Hardware


(CPU, Memory,
Network, Storage)

General virtual infrastructure knowledge and knowledge of your specific system configuration are
very helpful in identifying possible causes. Prioritize the list of possible causes, ordering them from
most probable to least probable. Then test each possible cause to determine the most likely cause of
the problem, called the root cause.
In the example, the problem is that a virtual machine has stopped responding. In a nonresponsive
system, the operating system seems to be paralyzed and no error messages appear. However, the
operating system is still running. Such problems might require guidance from documents, such as
VMware knowledge base articles. For example, to troubleshoot a virtual machine that has stopped
responding, see VMware knowledge base article 1007819 at http://kb.vmware.com/kb/1007819.
For this problem, you might take a top-down approach. Start with the operations performed on the
virtual machine, check the virtual machine configuration, and check for sufficient resources on the
host where the virtual machine is located.

24 VMware vSphere: Troubleshooting Workshop


Resolving the Problem
Slide 2-13

After identifying the root cause, assess the impact of the problem on
operations:
• High impact: Resolve immediately.
• Medium impact: Resolve when possible.
• Low impact: Resolve during next maintenance window.
Identify possible solutions and their impact on the vSphere environment:
• Short-term solution: Workaround.
• Long-term solution: Reconfiguration.
• Impact analysis: Assess the impact of the solution on operations.
Resolve the problem by implementing the most effective solution.

After identifying the root cause, resolve the problem. To resolve the problem, you identify possible
solutions to the problem, then implement a solution.
In determining the best solution, assess the impact that the problem has on normal operations. For
example, if the problem causes business-critical applications to be inaccessible, then the impact of
the problem is high, and immediate resolution is necessary.
When identifying possible solutions, you might decide to first implement a short-term fix so that
systems can be brought back online quickly. Before implementing the short-term solution, document
all changes that you have made to the system from the time the problem occurred. Also, back up
your log files from the time the problem occurred. Some short-term solutions can be destructive and
truncate important log information necessary for additional assistance.
Eventually, you want to implement a more permanent, long-term solution to prevent the problem
from happening again.

Module 2 Introduction to Troubleshooting 25


Example Scenario: Defining the Problem
Slide 2-14

Scenario:
• You attempt to migrate the virtual machine named VM01 from the host named
ESXi01 to the host named ESXi02. After waiting a couple of minutes, the
vSphere vMotion migration fails with an error.
Is this failure a vSphere vMotion problem or a symptom of an underlying
problem?
• The error message will provide additional information.

In the example, you use the troubleshooting methodology to diagnose a VMware vSphere®
vMotion® migration problem. You use VMware vSphere® Web Client to perform a vSphere
vMotion migration, but the migration fails with an error.
At this point, you cannot tell whether the problem is specific to vSphere vMotion or whether the
problem is in the underlying infrastructure, such as storage or networking.
To pinpoint the problem area, gather information about the problem, starting with any diagnostic
messages displayed in vSphere Web Client.

26 VMware vSphere: Troubleshooting Workshop


Example: Gathering Information
Slide 2-15

Error messages can help determine the problem.


The vMotion failed because the destination
~ Recent Tasks host did not receive data from the source
host on the vMotion network. Please check

Ta51< Name
I
your vMotion network settings and physical
network configuration and ensure they are
correct.
Relo cate virtual machine Migration [ - 1'e t
1407971789: 1562818463788028833) failed
Root problem: IP address
to connect to remote host <172 .20.12.52>
assigned to the VMkernel from host <172.20 .14.51>: Timeout.
port on the vMotion network vMotion migration [ -
is in the wrong subnet. 1407971789: 1562818463788028833] failed
to create a connection with remote host
<172 .20.12 .52>: The ESX hosts failed to
connect over the VMotion network
The vMotion migrations failed because the
ESX hosts were not able to connect over the
vMotion network. Check the vMotion
• network settings and physical network
I

I configuration.
vSphere Web Client shows the following error messages for the failed vSphere vMotion migration
task:
• A general system error occurred: The vSphere vMotion migrations failed because the ESXi
hosts were not able to connect over the vSphere vMotion network. Check the vSphere vMotion
network settings and physical network configuration.
• vSphere vMotion migration failed to create a connection with remote host 172.20.13.52: The
ESXi hosts failed to connect over the vSphere vMotion network.
• Migration failed to connect to remote host 172.20.13.52 from host 172.20. 12.5 1: Timeout.
The IP addresses refer to the vSphere vMotion VMkernel interfaces on the remote host
(ESXi02) and the local host (ESXiO 1).
• The vSphere vMotion migration failed because the destination host did not receive data from
the source host on the vSphere vMotion network. Verify that your vSphere vMotion network
settings and physical network configuration are correct.
The first error message in the stack is helpful and tells you to check the vSphere vMotion network
settings and physical network configuration. All error messages might not be so helpful.

Module 2 Introduction to Troubleshooting 27


Example: Identifying Possible Causes
Slide 2-16

Use the information that you gathered to identify possible causes:


• Based on error messages, the vSphere vMotion migration failed because
ESXi01 and ESXi02 failed to connect over the network named vMotion.
• This error indicates a possible misconfiguration on the ESXi host.
• Check the connectivity of the vSphere vMotion VMkernel interface.

Application or
Guest OS
Possible Causes

vSphere vMotion is Virtual


misconfigured . Machine

Network connectivity is down


ESX~
on one of the ESXi hosts. Host _J
vSphere vMotion VMkernel interface
Hardware
connectivity is down on one of the
(CPU, Memory,
ESXi hosts. Network, Storage)

The error message points to connectivity issues with the network named vMotion, with the
following possible causes:
• vSphere vMotion is misconfigured.
• Network connectivity between ESXiO 1 and ESXi02 is down.
• The vSphere vMotion VMkernel interface connectivity between ESXiOl and ESXi02 is down.
When you initiate vSphere vMotion migration, several compatibility checks are performed before
the migration is initiated. Thus, you can eliminate possible causes such as vSphere vMotion not
being enabled or incompatible CPUs, because these configuration items are checked before the
migration begins.

28 VMware vSphere: Troubleshooting Workshop


Example: Determining the Root Cause
Slide 2-17

If possible, test possible causes using a repeatable flow to determine the


root cause.

Start here: ping ESXi02 ing 172 . 20 . 12 . 52

Yes Further
>-'--=--=--~ investigation
necessary.

Fix network configuration ix VMkernel configuratio


to get a successful ping. to get a successful ping.

Perform vSphere vMotion Perform vSphere vMotion


migration. migration.

Test next No
possible
cause.

Test each possible cause and eliminate possible causes to determine the root cause.
First, use the ping command to test network connectivity between the hosts. For example, from
ESXiOl, ping ESXi02.
If the ping command fails, then investigate why the ping is failing. For example, the ping might fail
because of a network misconfiguration or faulty physical hardware. Make a change to your
environment and try the ping again.
After the ping is successful, test the vSphere vMotion migration. If the migration is successful, then
you have identified the root cause of the problem. If the migration is not successful, then test the
next possible cause in the list. If the ping command is successful, then you know that network
connectivity exists between the two hosts.
Test the VMkernel interface connectivity. You use the p ing command for this test too. From one
host, run the ping command, pointing to the VMkernel interface that you want to check on the
target host. For example, from ESXiOl , use the ping command to ping the vSphere vMotion
VMkernel interface on ESXi02 (172.20.13.52).

Module 2 Introduction to Troubleshooting 29


If the ping command fails, then investigate why the ping command is failing. Verify that the
VMkernel interface is configured correctly. Make a change to your environment and try the ping
command again.
When the ping command is successful, test the vSphere vMotion migration. If the migration is
successful, you have identified the root cause of the problem. If the migration is not successful, you
must further investigate the root cause.

30 VMware vSphere: Troubleshooting Workshop


Example: Resolving the Problem
Slide 2-18

In this example, assume that the root cause is an incorrect IP address of


a VMkernel interface for vSphere vMotion on the ESXi02 host.
Assess the impact of the problem on operations:
• Probably high impact:
- The problem affects any virtual machine that is migrated to the ESXi02 host.
- The problem also affects the proper operation of vSphere DRS.

Identify possible solutions to resolve the problem:


• Short-term solution: Do not migrate virtual machines to the ESXi02 host.
• Long-term solution: Fix the IP address of the vSphere vMotion VMkernel
interface of the ESXi02 host.
Implementing the solution should not require downtime.

When you have identified the root cause, identify possible solutions to fix the problem. The impact
that the problem has on normal operations (high, medium, or low) determines how quickly the
solution should be implemented.
Finally, determine the appropriate type of solution for this problem. You might implement a short-
term solution so that the system works normally. Document all changes that you made to the system
since the problem occurred. Also, back up your log files from the time the problem occurred,
because logs rotate and might be available at a future time.

Module 2 Introduction to Troubleshooting 31


Review of Learner Objectives
Slide 2- 19

You should be able to meet the following objectives:


• Use a structured approach to solve configuration and operational problems
• Apply troubleshooting methodology to logically diagnose faults and improve
troubleshooting efficiency

32 VMware vSphere: Troubleshooting Workshop


Key Points
Slide 2-20

• A structured approach to troubleshooting enables you to resolve problems


quickly and effectively.
• Differentiating between the symptoms and the problem is an important step in
the troubleshooting process.
• Prerequisite knowledge of how the VMware virtual infrastructure works as well
as your knowledge of your system's configuration are very useful in the
troubleshooting process.
Questions?

Module 2 Introduction to Troubleshooting 33


MODULE 3

Troubleshooting Tools
Slide 3-1

Module 3

35
You Are Here
Slide 3-2

1. Course Introduction
2. Introduction to Troubleshooting

3. Troubleshooting Tools
4. Troubleshooting Virtual Networking

5. Troubleshooting Storage
6. Troubleshooting vSphere Clusters
7. Troubleshooting vCenter Server and ESXi
8. Troubleshooting Virtual Machines

36 VMware vSphere: Troubleshooting Workshop


Importance
Slide 3-3

Knowing how to use the right tools to solve various types of problems
can save time and maximize your troubleshooting result.
The GUI, the command-line, the log files, and VMware vRealize® Log
Insight™ can help you analyze problems and guide you toward
resolution.

Module 3 Troubleshooting Tools 37


Module Lessons
Slide 3-4

Lesson 1: Command Line


Lesson 2: vSphere Management Assistant
Lesson 3: Logging, Log Files, and vRealize Log Insight

38 VMware vSphere: Troubleshooting Workshop


Lesson 1: Command Line
Slide 3-5

Lesson 1: Command Line

Module 3 Troubleshooting Tools 39


Learner Objectives
Slide 3-6

By the end of this lesson, you should be able to meet the following
objectives:
• Discuss the various methods to run commands
• Discuss the various ways to access VMware vSphere® ESXi™ Shell
• Use commands to view, configure, and manage your vSphere components

40 VMware vSphere: Troubleshooting Workshop


Methods to Run Commands
Slide 3-7

You can obtain command-line access on an ESXi host in several ways:


• vSphere ESXi Shell, which includes:
- esxcli commands
- A set of other troubleshooting commands
- Available through either the Direct Console User Interface (DCUI) or SSH session
• VMware vSphere® Management Assistant:
- With the installed VMware vSphere® Command-Line Interface (vCLI) package, an
administrator can carry out configuration and troubleshooting tasks.
- vSphere Management Assistant is available as an appliance that can be downloaded.
But it is also possible to install the vCLI software package into Windows and Linux
virtual machines.

VMware vSphere® ESXi™ Shell includes a set of fully supported ESXCLI commands and a set of
commands for diagnosing and managing ESXi hosts. Be familiar with vSphere ESXi Shell in case
VMware Technical Support directs you to use it.
The esxcfg-* commands are included in the VMware vSphere® Command-Line Interface (vCLI)
package, but are mainly for compatibility reasons. Although the esxcfg- * commands are still
available, they have been deprecated. VMware recommends that you use the ESXCLI commands as
a newer command-line utility.
The vCLI command set allows you to run common system administration and configuration tasks
against vSphere systems from an administration server of your choice. The vCLI package can be
installed on supported operating systems, such as Windows and Linux.

Module 3 Troubleshooting Tools 41


Accessing vSphere ESXi Shell
Slide 3-8

You can access vSphere ESXi Shell in different ways:


• Local access by using the ~ sa.esxi-01.vclassJocai ~ ~ @ Actions•

Direct Console User Interface Summary Monitor Configure Permissions VMS Resource Pools Oatastores Networks Update I

(DCUI): vsanvp BOBO (TCP)


Certificate
1. Enable the vSphere ESXi Power Management Services
Shell service, either in the Advanced System Settings tl •me o.umon
DCUI or vSphere Web Client. System Resource Reservation Direct Console UI Running

2. Access vSphere ESXi Shell Security Profile ESXJ Sheil Running


System Swa11 SSH Running
from the DCUI by
pressing Alt-F1 .
3. Disable the vSphere ESXi Shell service when not using it.
4. Log out of the DCUI by pressing Alt-F2.
• Remote access by using SSH:
1. Enable the SSH service, either in the DCUI or VMware vSphere® Web Client.
2 . Use an SSH client, such as PuTTY, to access vSphere ESXi Shell.
3. Disable the SSH service when not using it.

An ESXi system includes a direct console that enables you to start and stop the system and to
perform a limited set of maintenance and troubleshooting tasks. The Direct Console User Interface
(DCUI) includes vSphere ESXi Shell, which is disabled by default. You can enable vSphere ESXi
Shell in the DCUI or through VMware vSphere® Client™ or vSphere Web Client.
To access vSphere ESXi Shell locally, you require physical access to the DCUI and administrator
privileges. Local users that are assigned to the administrator group automatically have local shell
access.
To remotely access vSphere ESXi Shell, you enable the SSH service. However, you should enable
SSH access only for a limited time. SSH should never be left open on an ESXi host in a production
environment. Enabling SSH creates a security vulnerability and reduces ESXi resources.
Perform the following procedure to enable shell and SSH access in vSphere Web Client:
1. Select the ESXi host.
2. Click Configure.
3. Click Security Profile.
4. Scroll down to Services and click Edit.
5. Start the Shell and SSH services.
For more information about methods of accessing vSphere ESXi Shell, see vSphere Command-Line
Interface Documentation at https://www.vmware.com/support/developer/vcli.

42 VMware vSphere: Troubleshooting Workshop


vSphere ESXi Shell and SSH Timeouts
Slide 3-9

~ sa-esxi-01.vclass.local ~ ~ @ Actions.

Summ ary Monitor Configure Permissions VMS Resource Pools Datastores Networks Update Manager

.. Advanced System Settings


... Virtual Machines
( Cl Timeo ut
VM StartuplSlnrtdown
Name Val ue S um m .1ry
Agent VM Settings
Scsl.SCSITimeout_ScanTime 1000 Time Qn milliseconds) to sleep betwe ...
Swa1l file location
Scsi.TimeoutTMThreadExpires 1800 Life in seconds of timeout task mg mt .. .
Default VM Compatibility
Scsi.TimeoutTMThreadlatency 2000 Delay in ms before waking up new las.. .
... System
Scsi.TlmeoutTMThreadMax 16 Max number of timeout task-mgmt han...
Licensing
Scsi.TimeoulTMThreadMin Min number of timeout task-mgmt han.. .
Time Configuration
.. Scsi.TimeoulTMThreadRetry Delay in milliseconds before retrying I...
Authentication Services 2000

Certificate UseNars.DcuiTimeOut 600 An Idle time In seconds before DCUI i...

Power Management UseNars.EsximageNelTimeout 60 Set the timeout In seconds for downlo ...

Allvanced System Settmgs UseNars.ESXiShelllnteractiveTlmeOut O Idle time before an interactive shell ls ..

System Resource Reservation UseNars.ESXiShellTimeOut Time before automatically disabling lo...


Security Profile UseNars.HoslClient.Sessionnmeout 900 Default trmeout for Host Client sessio...
System Swa1l

The Availability timeout setting determines how long both the SSH and vSphere ESXi Shell remain
enabled:
• The default value is 0 and SSH and vSphere ESXi Shell remain enabled until manually
disabled.
• A value of 1 or higher determines how many minutes (in the DCUI) or seconds (in vSphere
Web Client) the services remain enabled before being automatically disabled.
If the Idle Timeout setting is configured, local and remote users are automatically logged out iftheir
sessions are idle for the defined period:
• The default value is 0 and sessions are not logged out automatically.
• A value of 1 or higher determines how an idle session remains active before being
automatically logged out. This value is measured in minutes in the DCUI and in seconds in
vSphere Web Client.

Module 3 Troubleshooting Tools 43


vSphere ESXi Shell and SSH Timeouts (2)
Slide 3- 10

Both options can be configured in the DCUI when the services are disabled. In
vSphere Web Client, the services must be restarted after changing these values.

Troubleshoot Ing llode Opt Ions Modify ESXi Shel I and SSH t i'1eouts

Enable ES>Ci Shell Mudify the 11u11be1 of 11inute~ thdt Cdll eldµ~e before you nu~t
Enable SSH log in after [SXi Shell acce'5s i"S. enabled nnd the id]e
od 1flJ ESX 1 She I I and SSH t 1r•eouts t H'leout for interact 1ve se-s.c:. ion<;.

vailability ti11eout [ 1
Idle t il'leOUt
<Enter> OK <E~c> Cancel

44 VMware vSphere: Troubleshooting Workshop


ESXCLI Commands
Slide 3-11

The esxcli command offers options in the following namespaces:


• esxcli namespace esxcli rdrna namespace
• esxcli device namespace esxcli sched namespace
• esxcli elxnet namespace esxcli software namespace
• esxcli fcoe namespace esxcli storage namespace
• esxcli graphics namespace esxcli system namespace
• esxcli hardware namespace esxcli vrn namespace
esxcli iscsi namespace esxcli vsan namespace
esxcli network namespace
esxcli nvrne namespace

esxcli command list for a full listing

The ESXCLI commands are part of the vCLI command set. The ESXCLI commands are a
comprehensive set of commands for managing most aspects of the vSphere environment.
Eventually, the ESXCLI command set will replace other command sets that are part of vCLI.
Help is available at all levels of the ESXCLI command set. For example, to see the namespaces
available with ESXCLI, enter esxcli. A list of available namespaces appears.
To determine the options available in the network namespace, enter esxcli network. A list of
available options for the network namespace appears.
To determine the configuration options available for firewalls, enter esxcli network firewall .
Each level displays command format help and options available for the namespace.
For more information about ESXCLI commands and their descriptions, see vSphere Command-Line
Interface Reference at https://www.vmware.com/support/developer/vcli/.

Module 3 Troubleshooting Tools 45


Viewing vSphere Storage Information
Slide 3-12

You use the esxcli sto rage command to retrieve storage information,
including multipathing configuration, LUN specifics, and datastore settings.

[rootBesx i -a-0 1 : - J esxc l i storage


Usage : esxcl i storage (cmd) [cmd options]

Availab l e Namespaces :
core VMware core storage commands.
nf s Operations to create, manage , and remove Net ~ork Attached Storage
filesys tems .
nfs41 Operatio ns to create, manage, and remove NFS v4.1 filesystems.
nmp VMware Native Multipath Plugin (NMP) . This is the VMwa re default
imp l ementation of the Pluggable S t o r age Ar c hi tecture.
san IO device management operations to the SAN devices on the system .
vf l ash virtual f l ash Ma nagemen t Operation s on t h e system .
vmfs VMFS operations .
vvo l Operations pertaining to Virtual Volume s
filesystem Operations pertaining to file syscems, a l so known as datastores, on
the ESX hos t.
iofilter I OFilter related commands .

• The esxc li storage command includes options, such as esxcli


sto rage nf s4 1,esxcli storage v flash, esxcli storage vvo l,
and so on.

The e sxcli storage command set includes the following namespaces:


• core: Provides configuration options and details on adapters, devices, paths, plug-ins, claiming
and claim rules.
• nmp: Provides command-line options to the default Native Multipathing Plug-in.

• san: Provides display and reset options for the available types of adapter, including Fibre
Channel, iSCSI, Fibre Channel over Ethernet (FCoE), and SAS.
• vmfs: Provides you the option of upgrading a VMFS3 datastore to VMFS5 and using the
command line to manage snapshots and extents.
• fi les ystem: File system operations include mounting, unmounting, rescanning, listing, and
performing an automount on VMware vSphere® VMFS and NFS datastores.
• nfs: Provides a way to add, remove, and list NFS datastores using the command line.

46 VMware vSphere: Troubleshooting Workshop


Viewing vSphere Network Information
Slide 3-13

You can use esxcli netwo rk commands to display physical and virtual
network information.

[ root@esxi-a-0 1: - ] esxcli network


Usage: esxcli network (cmd) [cmd options]

Available Namespaces:
firewall A set of commands for fir ewal l related operations
ip Operations that can be performed on vmknics
multicast Operat i ons having to do with multicast
nic Operations having to do with the configuration of Network
Interface Card and getting and updating the NIC sett ings.
port Commands to get information about a port
sriovnic Operations having to do wi th the configuration of SRI OV enabled
Network Interface Card and getting and updating the NI C settings.
vm A set of comman ds for VM related operations
vswitch Commands t o list and man i pulate Virtual Switches on an ESX host.
diag Operations pertaining to network diagn ostics

The following options to the esxcli network command are available:


• firewall: Provides a way to view, load, refresh, set, and unload firewall settings

• ip : Enables you to view and configure properties of the VMkernel interfaces to include DNS,
Internet Protocol Security (IPsec ), and route information
• nic: Provides a command-line interface for physical NIC operations including enabling and
disabling the adapter, setting some general options, and listing the current NIC setup
• port : Provides the ability to filter and get port statistics

• sriovnic: Lists Single Root 1/0 Virtualization-capable physical adapters

• vm: Lists networking information for virtual machines that have active ports and lists ports used
by virtual machines
• vsw i tch: Provides command-line options for standard and distributed switches

• diag : Sends ICMP echo requests to network hosts

Module 3 Troubleshooting Tools 47


Viewing Standard Switch Information
Slide 3-14

You can use the es x c li netwo rk vsw i t c h s t anda r d command to create


standard switches.

[root@esxi- a - 01: - ] esxcli network vswitch standard


Usage : esxc l i network vswitch standard {cmd) [cmd options]

Available Namespaces :
policy Commands to manipulate network policy settings governing the given
virtual switch.
portgroup Commands to list and manipulate Port Groups on an ESX host.
uplink Commands to add and remove uplink on given virtual switch.

Available Commands:
add Add a new virtua l switch to the ESXi networking system.
list List the virtua l switches current on the ESXi host .
remove Remove a vi r tual switch from the ESXi networking system .
set This command sets the MTU size and CDP status of a given virtua l
switch .

You can use the esxc l i n etwork vswi tch s tanda r d namespace to create and map physical
adapters to a virtual switch, create ports groups on the switch, and configure port group and switch
policies.

48 VMware vSphere: Troubleshooting Workshop


Viewing Distributed Switch Information
Slide 3-1 5

Although you cannot create a distributed switch from the command line, you can
use the e s x c li command to list recorded distributed switch information.

[root@esxi-a-01 : -) esxc l i network vswitch dvs vmwar e


Usage : esxcli network vswitch dvs vmware (cmd} [cmd options]

Availab l e Namespaces :
lacp A set of commands for LACP re l ated oper ations

Availab l e Commands :
list List the VMware vSphere Distributed Switch currently configured on
the ESXi host.
[ root@esxi-a-01 : -) esxcli networ k vswitch dvs vmware list
LabVDS
Name: LabVDS
VDS ID : Sc 09 2c 50 89 a7 10 Sd-26 f8 bl bd ld 9d 26 de
Class : etherswitch
Num Ports : 153 6
Used Ports : 23
Configured Ports : 5 12
MTU: 1500
CDP Status : l isten
Beacon T imeout : -1
Uplinks : vmnic7, vmn ic2, vmnicS, vmnicO, vmnic6, vmn ic4, vmnic3, vmnicl
VMware Branded : true

The esxc l i n etwor k vsw i tch d v s namespace enables you to list the distributed switches in
your environment and to get the details of your LACP or VXLAN configurations.
No method is available to create a distributed switch using the command line. Using vSphere Web
Client is the preferred way to create a distributed switch. However, you use the vic f g - vswitch
command to add, modify, and remove uplinks to existing distributed switches.

Module 3 Troubleshooting Tools 49


Viewing Hardware Information
Slide 3-16

You can use esxcli hardware commands to discover the physical


composition of ESXi hosts.

[root@esxi-a-01:-) esxcli hardware


Usage: esxcli hardware {cmd} [cmd options]

Available Namespaces:
cpu CPU information.
ipmi IPMI information.
smartcard Smart card subsystem .
usb VMware USB Plugin.
bootdevice Boot device information .
clock Interaction with the hardware clock.
memory Memory information.
pci PC! device information and configuration .
platform Platform information.
trustedboot Information about the status of trusted boot.

The esxcli hardware command provides a method for viewing the hardware configuration of an
ESXi host. The hardware namespace provides a method for viewing server information. You can
also set the system clock. You can enable or disable hyperthreading with the esxcli hardware
cpu global set command.

50 VMware vSphere: Troubleshooting Workshop


Lab 1: Using the Command Line
Slide 3- 17

Use the command line to review the ESXi host configuration


1. Access Your Student Desktop System
2. Validate the vSphere Licenses
3. Directly Access the DCUI of the ESXi Host
4. Remotely Access the DCUI of the ESXi Host
5. Use ESXCLI Commands to Verify the Host Hardware Configuration
6. Use ESXCLI Commands to Verify the Storage Information
7. Use ESXCLI Commands to Verify the Virtual Switch Information

Module 3 Troubleshooting Tools 51


Review of Learner Objectives
Slide 3- 18

You should be able to meet the following objectives:


• Discuss the various methods to run commands
• Discuss the various ways to access vSphere ESXi Shell
• Use commands to view, configure, and manage your ESXi hosts

52 VMware vSphere: Troubleshooting Workshop


Lesson 2: vSphere Management Assistant
Slide 3- 19

Lesson 2: vSphere Management


Assistant

Module 3 Troubleshooting Tools 53


Learner Objectives
Slide 3-20

By the end of this lesson, you should be able to meet the following
objectives:
• Use the vSphere Management Assistant virtual appliance
• Use commands to view, configure, and manage your vSphere components
• Identify the tool for command-line interface troubleshooting

54 VMware vSphere: Troubleshooting Workshop


vSphere Management Assistant Components
Slide 3-21

vSphere Management Assistant is a virtual appliance that includes


components for running vSphere commands:
• vCLI command set:
- Enables you to run common system administration commands against ESXi hosts,
such as:
• esxcl i
• vmware - cmd
• v i cfg - * commands
- Requires credential connection options to a server
• vi-fastpass authentication component:
- Automates authentication to the vCenter Server system or ESXi host targets.
- Relieves the user from having to continually add login credentials to every command
that is executed.
- Facilitates unattended scripted operations.
• The vi fpt a r g et command sets the target server for vCLI commands:
- Entering this command changes the command prompt, which indicates the target
server designation.

VMware vSphere® Management Assistant is a downloadable virtual appliance that includes several
components, including vCLI. vSphere Management Assistant enables administrators to run scripts
or agents that interact with ESXi hosts and vCenter Server systems without having to authenticate
each time.
The vSphere Management Assistant authentication interface enables users and applications to
authenticate with the target servers by using v i-fa s tpas s or Active Directory (AD). While adding
a server as a target, the administrator can determine whether the target must use v i-fastp ass or
AD authentication. For v i-fa stpass authentication, the credentials that a user has on the vCenter
Server system or ESXi host are stored in a local credential store. For AD authentication, the user is
authenticated with an AD server.
When you add an ESXi host as a fastpass target server, v i-fastpass creates two users with
obfuscated passwords (in an unreadable format) on the target server and stores the password
information on vSphere Management Assistant:
• vi-admin with administrator privileges
• vi-user with read-only privileges

Module 3 Troubleshooting Tools 55


Run vifptarget -s server, where server is a vCenter Server system or ESXi host, before you
run vSphere CLI commands or vSphere SDK for Perl scripts against that system. The system is set
as the target for vCLI commands run on vSphere Management Assistant. The vSphere Management
Assistant prompt changes to show which system is the target. The system remains a vSphere
Management Assistant target across appliance reboots, but running vifptarget again is required
each time that you log in to vSphere Management Assistant.
For more information about the commands included in vSphere Management Assistant, see Getting
Started with vSphere Command-Line Interfaces and vSphere Command-Line Interface Concepts and
Examples at https://www.vmware.com/support/developer/vcli/.
For more information about deploying vSphere Management Assistant and using vi - fastpass,
see vSphere Management Assistant Guide at http://www.vmware.com/support/developer/vima.

56 VMware vSphere: Troubleshooting Workshop


Configuring vSphere Management Assistant for AD
Authentication
Slide 3-22

vSphere Management Assistant can be configured to participate in Active


Directory (AD) if the ESXi hosts and the vCenter Server systems are part
of an AD domain:
• vSphere Management Assistant can add targets without storing credentials.
• AD, as a centralized solution, offers more security controls than vi-fastpass.
Follow these best practices before you configure vSphere Management
Assistant for AD:
• Verify that the DNS servers and vSphere Management Assistant are in the
same domain.
• Verify that the domain is accessible from vSphere Management Assistant.
• Verify that IP addresses translate to the proper DNS name.

When using vSphere Management Assistant to manage ESXi hosts and vCenter Server systems
without using AD, vSphere Management Assistant stores credentials in its credential store. Adding
vSphere Management Assistant, ESXi hosts, and vCenter Server systems to an AD domain is more
secure because the credentials are stored in AD.
Configuring vSphere to use AD also has the advantage of using a single security model for both
virtual and non virtual environments.
Before you configure vSphere Management Assistant for use with AD, verify that the following
prerequisites are met:
• The DNS server configured for vSphere Management Assistant is the same as the DNS server
of the domain.
You can change the DNS server by using vSphere Management Assistant Console or the Web
User Interface.
• The domain is accessible from vSphere Management Assistant.

Module 3 Troubleshooting Tools 57


• You must be able to ping the ESXi hosts and the vCenter Server systems that you want to add to
vSphere Management Assistant.
• Ensure that pinging resolves the IP address to target_server_name.domain_name, where
domain_name is the domain to which vSphere Management Assistant is to be added.
For more information about configuring vSphere Management Assistant for AD authentication, see
vSphere Management Assistant Guide at https://www.vmware.com/support/developer/vima/.

58 VMware vSphere: Troubleshooting Workshop


Adding vSphere Management Assistant to Active Directory
Slide 3-23

The following command adds vSphere Management Assistant to an AD


domain:
s u do domainjoi n- c li join doma i n name doma i n a dmin_user
The following command checks the domain settings of vSphere
Management Assistant:
sudo domainjoin - cli q u ery

The following command adds a target system for AD authentication:


vifp addserver FQDN of server --authpo l icy adauth - - username
ADDOMAIN \\ user ID
The following command removes vSphere Management Assistant from
the domain:
sudo domainjoin - cli l eave

After you run sudo domainj oin- cli join and authenticate by using a domain administrator
user name and password, vSphere Management Assistant is now a member of the domain. The
domainj o in- cli command also adds entries in the /e tc /hos ts file with the fully qualified
domain name of vSphere Management Assistant.

Module 3 Troubleshooting Tools 59


vicfg-* Commands
Slide 3-24

v i c fg- *
• You can use v i c f g - * commands to manage your storage, network, and host
configuration.
• For example, you can run the vicfg- vmkni c -1 command to display the IP
information of your VMkernel interfaces.
vi-admin@sa-vma-01 : -[sa-esxi-0 1 .vclass .local] > vic!g-vmknic - 1
Interface Port Group/DVPort IP family IP Address Netma~ k MAC Addre
vrnkO 0 IPv4 172.20.10.51 255 . 255 . 255. o 00:50:56:
vrnk3 46 IPv4 172 . 20. 13 . 51 255 . 255 .255. 0 00 : 50:56:
vrnk4 52 IPv4 172. 20 . 13 . 61 255. 255 . 255. 0 00 : 50:56:
vrnkl 29 IPv4 172 .20.12 .51 255. 255. 255 . o 00:50:56:
vrnk2 33 IPv4 172. 20. 12 . 61 255.255 . 255 . 0 00 : 50:56:
vi-admin@sa-vma-01: - [sa-esxi-01 . v c l ass .local] > I

In addition to the ESXCLI commands, the vCLI command set also includes a set of commands with
the vic f g - prefix. For more information about each of these commands, use the -- he l p option
with the vicfg command, for example, vic f g -route --help.

60 VMware vSphere: Troubleshooting Workshop


vmware-cmd Command
Slide 3-25

The vmware - c md command is used for configuring virtual machines and


gathering information about them.
Following is a list of vmware-cmd operations:
VM Operations:
vmware-cmd <cfg> getstate
vmware-cmd <cfg> start <powerop_mode>
vmware-cmd <cfg> stop <powerop_mode>
vmware-cmd <cfg> reset <powerop_mode>
vmware-cmd <cfg> suspend <powerop_mode>
vmware-cmd <cfg> setguestinfo <variable> <value>
vmware-cmd <cfg> getguestinfo <variable>
vmware-cmd <cfg> getproductinfo <prodinfo>
vmware-cmd <cfg> connectdevice <device name>
vmware-cmd <cfg> disconnectdevice <device name>
vmware-cmd <cfg> getconfigf ile
vmware-cmd <cfg> get uptime
vmware-cmd <cfg> a nswer
vmware - cmd <cfg> gettoolslastactive
vmware-cmd <cfg> hassnapshot
vmware-cmd <cfg> c reatesnapshot <name> <description> <quiesce> <memory>
vmware-cmd <cfg> revertsnapshot
vmware- cmd <cfg> removesnapshots

The vmware -cmd command-line option is dedicated to performing operations on virtual machines.
Most operations that can be done using vSphere Web Client can also be done using vmware - cmd .
Getting help with vmware - cmd is similar to getting help with es x cli . On the command line, enter
vmware-cmd to display a list of available options and syntax help.

The path to the . vmx file must be provided in the command line for the command to work for a
specific virtual machine. For example, to unregister a virtual machine using vmwa re - cmd, you enter
vmware-cmd path_to_the_ . vmx_fiie unregister.

Module 3 Troubleshooting Tools 61


Viewing Virtual Machine Information
Slide 3-26

The vmware - cmd -1 command lists the virtual machines that are located on
the target host according to the path to their . vmx file.

vi-admin@sa-vma-Ol: ~ [sa-e sxi -01.vc lass.loca l] > vmware-cmd -1

/vmfs/volwnes/S4f7fff9-757c9064-S48b-OOSOS6011403/linux-a-01/linux-a-01.vmx
/vmfs/vo lwnes/S60e8ea9-e4bSlcd0-4c0d-OOSOS601 1403/ linux-a-02/linux- a-02 .vmx
/vmfs/vo lwnes/S60e8ea9-e4bSlcd0-4c0d-OOSOS601 1403/ linux-a-03/linux-a-03 .vmx
/vmfs/volwnes/560d0f97-f4de674a-fed0-00SOS601 1403/linux-a-04/linux-a-04 . vmx

Assuming that the target host is already set, you can enter vmware-cmd -1 on the command line to
list the registered virtual machines that reside on the target ESXi host. The command output displays
the physical path to the . vmx file, including the unique user ID (UUID) of the VMFS datastore.
When a VMFS datastore is created, a UUID is created for the datastore. The UUID is used
internally by the ESXi host to uniquely identify a datastore. An alias is created when the datastore is
created, which is the label assigned to the VMFS datastore. The label provides a logical name that
you can provide to identify the datastore. Using the logical name as part of the . vmx file path makes
it easier to run vmware - cmd.
You can specify the path to the . vmx file in the following ways:
• The physical path to the . vmx file :
/vmfs/volumes /4 f870db6-5ed5460c-e0c7-0 05056370612/Win02-A /Win02 - A. vmx
• Replacing the UUID with the datastore label:
/vmfs/volumes/Shared /Wi n02 - A/W i n02 -A. vmx
• Using brackets and quotation marks to represent the path:
" [Shared] Wi n 02 - A/Win02 - A. vmx"

62 VMware vSphere: Troubleshooting Workshop


Viewing Snapshot Information
Slide 3-27

You can use the vmware - cmd . vmx f i l e path hassnapshot command to
determine whether a virtual machine TS currently using a snapshot. You can also
use the command to perform snapshot operations.
Here, the = 1 indicates that the virtual machine is currently using a snapshot.

vi- aclmin@sa-vma-01 : - [sa- esxi- 01 . vc l ass .local] > vm~a r e - cmd / vmfs/volumes/560dC
f97-f4de674a-fed0-005056011403/linux-a-04/linux -a-04 . vmx hassnapshoc
hassnapshoc () = 1

The has snapshot option of the vmwa re-cmd command provides a command-line option to
determine whether a virtual machine is using a snapshot. If the command returns 0, a snapshot is not
present. If the command returns 1, the virtual machine is using a snapshot.
Other command-line options for snapshots include creating snapshots (createsnapshot),
reverting snapshots (revertsnapshot), and removing snapshots (removesnapshots ).

Module 3 Troubleshooting Tools 63


Direct Console, SSH, or vSphere Management Assistant
Slide 3-28

Use the direct console when the management network connection to an


ESXi host is down:
• vCenter Server commands (vicfg) are not available.
• Virtual machine commands (vmware-cmd) are not available.
SSH is a simple type of connection:
• Management network must be functioning.
• SSH and console access must be enabled.
• Available if vSphere Management Assistant is not installed and configured.
vSphere Management Assistant gives you a single command-line
platform to communicate with all of your ESXi hosts and your vCenter
Server systems:
• Preconfigure authentication to all ESXi hosts and vCenter Server systems.
• No need to open separate consoles to multiple systems.
• Use both vCenter Server commands and ESXi host commands.

64 VMware vSphere: Troubleshooting Workshop


Lab 2: Adding vSphere Management Assistant to Active
Directory
Slide 3-29

Configure vSphere Management Assistant to use Active Directory


1. Log In to vSphere Management Assistant
2. Add the vSphere Management Assistant Instance to an Active
Directory Domain
3. Configure the Target Server
4. Use the more and less Commands
5. Use vicfg-* Commands to Verify the Virtual Switch Information
6. Use vmware-cmd Commands to Verify the Virtual Machine
Information

Module 3 Troubleshooting Tools 65


Review of Learner Objectives
Slide 3-30

You should be able to meet the following objectives:


• Use the vSphere Management Assistant virtual appliance
• Use commands to view, configure, and manage your vSphere components
• Identify the tool for command-line interface troubleshooting

66 VMware vSphere: Troubleshooting Workshop


Lesson 3: Logging, Log Files, and vRealize Log Insight
Slide 3-3 1

Lesson 3: Logging, Log Files,


and vRealize Log Insight

Module 3 Troubleshooting Tools 67


Learner Objectives
Slide 3-32

By the end of this lesson, you should be able to meet the following
objectives:
• Find important log files
• Use VMware vSphere® Syslog Collector
• Use vRealize Log Insight for log aggregation, log analysis, and log search

68 VMware vSphere: Troubleshooting Workshop


Location of vCenter Server Logs
Slide 3-33

Most of the vCenter Server log files are on vCenter Server Appliance in
the /var I l og /vmware / directory.
• Subdirectories exist for vCenter Server components. For example a
vpxd- svcs subdirectory exists for vCenter Server services logs.
• Some logs, such as FirstBoot. l og are located in /var / l og.
root@sa-vcsa-01 [ /var/log/vmware ]# l s
appJ.mgmt rsyslogd v:mdird
applmgmt-audit rsyslogd-2068 vmdnsd
cis-license rsyslogd-2078 vmon
cloudvm sea vmvare- imagebuilder
cm sso vmvare-sps
content-I i.brary sys log vmvare-updatemgr
eam vapi vpostgres
journal vcha vpxd
mbcs vet op vpxd-svcs
netdumper VDild vsan- health
per:fcharts VDildd V3DI
psc-client vmcad vsphere-client
rbd vmcam vsphere- ui
rhttpproxy v:mdir

For more information about the location ofvCenter Server log files, see VMware knowledge base
article 2110014 at http://kb.vmware.com/kb/21100 14.

Module 3 Troubleshooting Tools 69


Common Logs
Slide 3-34

Common logs are common to all vCenter Server systems.

Log Directory Description


applmgmt VMware Appliance Management Service
CloudVM Logs for allotment and distribution of resources between
services
CM VMware Component Manager
rhttpproxy Reverse Web Proxy
SCA VMware Service Control Agent
Va pi VMware vAPI Endpoint
vmaffd VMware Authentication Framework daemon
vmdird VMware Directory Service daemon
Vmon VMware Service Lifecycle Manager
vSphere-Client VMware vSphere Web Client

These are some of the more important common logs.

70 VMware vSphere: Troubleshooting Workshop


Management Node Logs
Slide 3-35

Management node logs appear on vCenter Server Appliance systems


that are combined nodes (single-node deployment) or are third-party
nodes.

Log Description
Auto Deploy VMware vSphere Auto Deploy Waiter
content-library VMware Content Library Service
EAM VMware ESX Agent Manager
lnvSvc VMware Inventory Service
vmcam VMware vSphere Authentication Proxy
vpxd VMware VirtualCenter Server
vPostgres vFabric Postgres database service

These are some of the more important management node logs.

Module 3 Troubleshooting Tools 71


Platform Services Controller Logs
Slide 3-36

VMware Platform Services Controller™ logs appear on VMware Platform


Services Controller nodes.

Log Description
cis-license VMware Licensing Service
sso VMware Secure Token Service
Vmcad VMware Certificate Authority daemon

72 VMware vSphere: Troubleshooting Workshop


Important vCenter Server Logs for Troubleshooting
Slide 3-37

The vpxd. l og file is the main log file for vCenter Server. Most vCenter
Server actions are captured in v p xd . l og. It is located in
/va r I l og /vmwa r e /vpxd / .
2017-03-03Tl7: 37: 30. 809Z into vpxd[ 7f'D78 F66C700] (Originator@ 6876 sub•vpxLro opID•urn : vmomi :Virtua1Hachine :vm-128: S96e2 6b2-df48-ic88-b39c-eb702
ba52e66 .properties :01-3d] [VpxLRO} -- BEGIN lro-2'1393 -- ResourceHodel -- cis.data.provider.Re:!lourceHodel.query -- 5266dfBc-2t36-e3c4-ld59-c47"l
66b43db 1 (52723at4-e7c1- 92 cS-02 cS-4108 l 4 0b6ba2)
2017-03 -0JT l 7: 37: 30. 8102 info vpxd[7FD7BF66C700] [Originat.or8 6876 sub•vpxLro opID•ur n :vmomi: Virtual Machine :vm-126 : 596e2 6b2-df1B-ic88-b39c-eb702
ba52e66 .properties:O l-3d] [VpxLRO] -- FINISH lro-24393
2017-03 - 0JTl 7 : 37: 30. 813 Z info vpxd[7FD7BF66C700] [Originator@ 6876 sub•vpxLro o pID• urn : vmomi :VirtualHachine :vm-128 : 596e2 6b2- d!48 - 4cBB-b39c- eb702
ba52e66. properties : 0 1-e!] (VpxLRO] -- BEGIN l ro- 2439 4 -- Resource Model -- cis. data . provider. Resource Model. query -- 52 66d!8c- 2!3 6- e3c4- ld59- c::474
66b43db l ( 52?23a! 4-e7c1-92 c5-02 c::8- 4108140b6ba2)
2017-03-03Tl7 : 37: 30. 813 2 i nfo vpxd[7F D7BF66C700] [Originator@ 6876 sub• v pxLro opID•urn : vmomi :Virtual Machine :vm-128: 596e2 6b2-dt48-1c88-b39c-eb702
ba52e66. properties: 0 1-e:t] [VpxLRO] -- FINIS H lro-2 4391
2017- 03-03Tl7: 37: 30. 8152 info vpxd[7FD7BFE7C700] [Originator@ 6876 s ub • vpxLro opID• uc-n : vmomi :VirtuslHachine :vm-128: 596e2 6b2-d.t48-1c::B8-b39c::- e b702
bs52e66. propertie:!!I: 01-d!] (VpxLRO] -- BEGIN lro-21395 -- ResourceHodel -- c::is, data. provide r. Resouc-c::eHodel. query -- 52 6 6d!8c-2f3 6-e3c4-ld59-c::474
66b43db 1 (5272 3 at 4-e7c 1-92 c5-02 ca-4108110b6ba2)
2017-03-03Tl 7: 37: 30. 615Z into vpxd[7FD78FE7C700] [Originator@ 6676 s ub•vpxLro opID•urn: vmomi :Virtual Machine :vm-128 : 596e2 6b2-dt48-4c88-b39c-eb702
ba52e66. properties: 01-d.t] [VpxLRO} -- f INISH l ro-2 4 395
2017-03-0JT l 7: 37: 30. 816Z into vpxd[7FD7BFE7C700] [Originator9 6876 sub•vpxLro opID•urn :vmomi :VirtualHachine :vm-128 : 596e2 6b2-dt48-4c::88-b39c-eb702
ba52e66. properties: 01-67] [VpxLRO] -- BEGIN lro-21396 -- Resource Model -- cis. data. provider. Re:!!lourceKode l. query -- 52 66dt8c-2t3 6-e3c4-ld59-c474
66b4 3db l ( 52723 at1-e7cl-92 c5-02c8-4108140b6ba2)
2017-03-03Tl 7 : 37 : 30 . 8162 into vpxd(7FD78FE7C?OOJ [orioi n ator @6876 sub•vpxLro opID•urn : vroomi :Virtual Machine : vm- 128 : 596e2 6b2-d!i8-1c88-b39c-eb702
h•<;? •1' 1' n r n n • r r i • • · ni -1'?1 rHnvl.Ot"'ll -- J"TllfT<:;M l r n- ? 4'"lQ1'

Module 3 Troubleshooting Tools 73


Viewing vCenter Server Log Files in vSphere Web Client
Slide 3-38

You can use vSphere Web Client to view log files on vCenter Server
systems. Use the Monitor tab for vCenter Server systems.
) sa-vcsa-0 1.vclass.local ti tJ 1' Q @ Actions~

Summary Montt... Configure Permissions Datacenters Hosts & Clu... VMs Datastores Networks Linked vCen... Extensions Update Man ...

Issues Tasks & Events Syslem Logs Sessions

[ vCenler Server log (vpxd.log(


H Export System Logs

(OvsUlils::GetLale8indingPortgroup8ackings( portgroup (dvportgroup-381024( is not found


2017-03-01T22:11:26.630Z errorvpxd[7FD7E465F700( [Orlglnator@6876 sub=DvsUtils oplD=HostSync-hosl-109-233025b8J
(OvsUlils::GelLale8indingPortgroup8ackings( portgroup (dvportgroup-381024( is nol found
2017-03-01T22:11 :26.817Zwaming vpxd(7FD7E4359700J (Originalor@6876 sub=VpxProfiler oplD=HoslSync-host-99-28d61248J DoHoslSync:hosl-99
(ProcessChangesJ took 6384 ms
201 T-03·0 IT22. 11 . 20.BT9Z inru VJJJl.U{TFDTBF2t54TOO) (Orlulm::1lur@OBTl5 ::;utJ=(SSO] uµID=2004 I UB~ {Usl::!rDill::!t:lurySsu) G1::!lUst:Hlr1ru~µl1t:H1::!. IUt:etl\VJJXll·~.llhmsiu11-
339d8365·0ef7-4812-9c64-e9d7273d7d56, false) res: VSPHERE.LOCAL\Vpxd-extension-339d8365-0ef7-4812-9c64-e9d7273d7d56
2017-03-01T22:11 :26.879Z info vpxd(7F078F264700J (Originalor@6876 sub=AulhorizeManager opl0=20041 d8Q (Aulh]: UserVSPHERE.LOCAL\vpxd-extension-
339d8365-0ef7-4812-9c64-e9d7273d7d56
2017-03-01T22:11 :26.886Z info vpxd(7F078F264700J (Origlnator@6876 sub=vpxLro opl0=20041 d8Q (VpxLROJ ·· FINISH lro-72
2017-03-01T22:11 :26.901Z info vpxd(7FD78F56A700J (Originalor@6876 sub=vpxLro oplD=19Bb02BcJ (VpxLROJ •· BEGIN lro-102 ·• ExtensionManager ·•
vim.ExtensionManager.findExtenslon -· 5225e564-67c2·1 00d-c1 dd·58fllb8b7f918(525c6be6-aab7-cc4a-1778-88f9791b6b71]
2017-03-01T22:11 :26.901 Zwarning vpxd(7FD78F56A700J (Originalor@6876 sub=Defaull opl0=198b028cJ Defaull resource used for extension description
'extension.com.vmware.vtlntegrity.server.HTTP.label'
2017-03·01T22:11 :26.901 Zwarning vpxd(7FD78F56A700] (0riginalor@6876 sub=Defaull opl0=198b028c] Defaull resource used for extension description
'extension.com.vmware.vtlntegrity.server.HTTP.summary'
2017-03-01T22:11 :26.901Zwarnlng vpxd(7FD78F56A700J (Origlnator@6876 sub=Defaull opl0=198b028c( Defaull resource used for extension description

Showing 4000 of 46629 lines O Show line numbers I Show Next 2000 Lines J I Show All Lines J

74 VMware vSphere: Troubleshooting Workshop


Location of ESXi Host Logs
Slide 3-39

The ESXi host log files are on the ESXi host at /var I l og.
[root@sa-esxi-01: /var/log] ls
Xorg-log ipmi vitd-log
auth-log jumpstart-stdout . log vmauthd-log
boot.gz kickstart . log vmkdev:mgr-log
clomd-log lacp_log vmJrerneL log
configRP . log nfcd-log vmkeventd- log
ddecomd.log osfsd.log vmksunmary. log
dhclient.log rabbitmqproxy.log vmkwarning. log
epd. log rhttpproxy.log vnnrare
esxcli-software .log sdrsinjector.log vrnware-vrnsvc . log
esxupdate. log shell.log vobd.log
fdm.log smbios.bin vprobe.log
bbrca.log storagerm. log vpxa.log
hostd- probe.log svapobjd. log vsanmgmt. log
hostd.log sysboot.log vsansystem.log
hostdCgiServer.log syslog. log vsanvpd. log
hostprofiletrace.log tally log vvold.log
iofilter-init.log upitd.log
iofiltervpd .log usb.log

Most ESXi host log files are under the /var I l og directory.
If persistent scratch space is configured, many of these logs are on the scratch volume,
I scratch/ log. The /var I l og directory contains symbolic links (which are identified as light
blue names in the output) to log files in /scratch/ l og. Run the l s - 1 command to see the log
files that these symbolic links point to.

Module 3 Troubleshooting Tools 75


Useful ESXi Host Logs for Troubleshooting
Slide 3-40

ESXi hosts write to multiple log files, depending on which action is being
performed.

Log File Purpose


hos t d . log Host management service logs
sys l og . log Management service initialization, watchdogs,
scheduled tasks , and DCUI use
v mkern e l.log Core VMkernel logs, including device discovery,
storage and networking device and driver events,
and virtual machine startups
vmkwarni ng .log A summary of warning and alert log messages
excerpted from the VMkernel logs

v mks ummary .log A summary of ESXi host startup and shutdown , and
an hourly heartbeat with uptime, number of virtual
machines running, and service resource
consumption

The table provides the name and description ofESX i host log files that are useful for
troubleshooting.

76 VMware vSphere: Troubleshooting Workshop


Viewing Log Files in the DCUI
Slide 3-41

You can use the DCUI to view log files if vCenter Server is not available.
Only the log files for a single ESXi host can be viewed in the DCUI.

Sgs teR Cus toM ization Vieu Sgs leM Logs

Configure Pas S1Jord <I> Stjslog


Configure Lockdown Mode <Z) V1•1kerne I
<3> Conf ig
Conf igure ltanagenent ltetwork <1> ManagerqentAgent (hostd)
Res tart Manage.ient Network (5) VidudlCenter- Agent (vpxd)
Test l'l<lnagenent Network (6) VMuore ESXi Observation log (vobd)
Network Restore Options
Press the r.nrrespnnding key tn view a log.
Conf igure Keyboard Press <f.P to retur-n to this screen.
Troub leshoot ing Options

ieu Sgo.te1•1 Log>


Vieu Support Infor l4<!tlon

Reset Sgsten Conf igurat ion

In the ESXi host console, press F2 and log in to the DCUI, using the root user name and password.
Press the appropriate key to search the log file(/), request help (H), or quit viewing the log file (Q).

Module 3 Troubleshooting Tools 77


vSphere Syslog Collector
Slide 3-42

vSphere Syslog Collector provides a single location for all ESXi hosts to
write log files.
vSphere Syslog Collector enables logs from multiple hosts to be
combined and provides a structure for network logging.
vSphere Syslog Collector is preinstalled on vCenter Server Appliance.
By default, the vSphere Syslog Collector server uses port 514 for TCP
and UDP, and port 1514 for SSL.

VMware vSphere® Syslog Collector provides a unified architecture for system logging. With
vSphere Syslog Collector, you can direct logs from ESXi hosts to a server on the network, rather
than save the logs to a local disk.

78 VMware vSphere: Troubleshooting Workshop


vRealize Log Insight
Slide 3-43

Machine-generated log data is typically massive in scale, difficult to


manage, and overwhelming.
vRealize Log Insight provides a single location to collect, store, and
analyze logs at scale.
vRealize Log Insight is a log management solution for physical, virtual,
and cloud environments.
vRealize Log Insight helps monitor events and metrics. It analyzes root
causes and performs intelligent grouping, which helps you to
troubleshoot efficiently.
vRealize Log Insight provides faster analytical queries and aggregation
than traditional tools, especially on larger data sets. It adds structure to
all types of unstructured log data, so administrators can troubleshoot
quickly.

Module 3 Troubleshooting Tools 79


Searching and Filtering Log Events
Slide 3-44

You can search and filter log events by specifying the keywords, time
range, field operations, and so on.

I scsi performance

+ Md Filter

1 10 12 ~ t Latest hour of data First ..


Events EventTypes Fi:ld Tobie
Latest 6 hours of data
- J6 1 jun 7 11: 22: 26 esx-&lo. corp. local 8T: [scsiCorrelotor] 2&8ll86269644us: [esx. problei:l.SCsi.device. io.•
I :::::; __ I noo. 7001S&S2dl642CC076lf&&70ee&eeee& ptrformoncc hos iClj)rcved. ! /0 lotency demosed rnif MF¥ 'ffjj Latest 24 hours of data 89
i1icroseconds.
Latest 7 days of data
source e1ent_type hostname appname vmw_esxi_problem Ylll\Y_eSJU_device_iJ Vffl\Y_esxi_scsi_lolency
All time
Jun 7 11: 21:22 esx-01a.corp.local 6T: [scs!COrrelatorJ 2&8ll86169644us: [esx.problem. scs!.devlce. !o . J~ Custom time range
naa.ae01a&S1n646eC&764ee01e000eeee& 1><rforllilnce has iC11rcved. _/O Jatenq decreased fr"' overage value 65
microseconds. ------

Event types grouping uses machine learning to group similar events


together, making root cause analysis and troubleshooting faster and easier.
You can view events in context and analyze event trends to uncover
anomalies.
You can also save, export, rename, share, or delete a query.

Often, without intelligent log analysis tools, you do not notice log errors until they affect production
workloads and the business. With VMware vRealize® Log Insight™, you can uncover patterns that
can ultimately lead to problems and take action when these patterns arise.
You can enter any complete keywords, globs, or phrases in the search text box to find only events
that contain the specified keywords. You can search for log events that match certain values of
specific fields. Time ranges are inclusive when filtering. In the query example shown in the slide,
you can see log entries related to SCSI performance for the past 24-hour period. Filtering log events
helps you narrow down the troubleshooting scope by displaying only relevant information.
You can use the list of existing fields to search log events with specific values for a field. You can
also search the list of log events for events that occurred before, after, and around an event in the
list. You can analyze log events for trends and anomalies.
vRealize Log Insight enables a troubleshooter to view the context of a log event and browse the log
events that arrived before and after it. If you want to know more about the status of your
environment before and after an event, you can check the surrounding events.
You can take snapshots of queries and save them to dashboards.

80 VMware vSphere: Troubleshooting Workshop


Analyzing Logs with the Interactive Analytics Charts
Slide 3-45

The charts provide visual representation of data and enable you to


perform visual analysis on your query results.

i'I .,'() 2•H-tl-UTU:•J:SJ .HJ·t1H SU"ltl· bo•t5.ait .....art'.CClll Mtstd: (HllC2Me vff'oosc 'Dllf'-.ilt' l f"rwtr oolky 8WM~
, , UILM:l (!}Ym#_op)mney
-.ui .......iyp. *'>'ly pf01tr ~ ~ 0~.00id
{!)WIM' 9CSl_llCldtiona'.,"8Mt_CO
I01........, Hl•~·2JTIJ:4J;SJ.Hl·t1ff •tt•u· llOnl.tfll .W--att. COll \'pill; (ff("8tt vtrMM 'Otft11h' ] Sit \fltfNlal 9...,,... tal.MNO.eod6
ll~lA for 'ltl! ~ ('fPI 'ht t el), 154 (llP~• ¥14 i.) It fl prlNry} 4 (!)WIM.~..,JOnte.dltll
90Wlll . . . . .type '-""' pr!IDrlt, " - - ...... 8""'W.~..,Pd

You can select different chart types to graphically analyze log events.
You can modify the aggregation and grouping of query results to
correlate events and make the chart meaningful for troubleshooting.

vRealize Log Insight provides faster analytical queries and aggregation than traditional tools,
especially on larger data sets. vRealize Log Insight adds structure to all types of unstructured log
data, so that administrators can troubleshoot quickly, without needing to know the data beforehand.
You can select different chart types to change the way data is visualized on the Interactive Analytics
page. Different chart types require different aggregation functions, the use of time series, and group-
by fields. You can use multifunction charts to compare variables that are not the same scale. You can
change how charts look, add charts to your custom dashboards, and manage dashboard charts.

Module 3 Troubleshooting Tools 81


Dynamic Field Extraction
Slide 3-46

When facing a large environment with numerous log events, you can
locate the data fields that are important to you, extract and save them.

scsi I ~ao F Iler


+ Ad Contains '609407' 2014
Does not contain '689407'
Fields

: ~--
1 : c1

"a Lll€- 1
I ~14;1"'
Jun 7 11 : 26 :1 ~ esx-03a.corp . local BT: [scsiCorrelatorJ 26831S6269044us: [e Extrat:led value Me Onl
naa.700i 5052d1642C0076lfe076&00&0000 perfo1111ance has deteriorated. 1/0 lat e ==~~-=-......::,...-.J .,..""""'",.;;....;;....,........,
666666 oucrosoconds. lnteg,er
source event_type hostname appname vmw_esx1_problem vmw_esx1_devlce_KI
Pre a.nD post context
Jun 1 11: 26:10 esx-02a.corp. local 8T: [scsiCorrelatorJ 2&83 1 86269 0~4us : [e
naa. 8001 8052fl 64600&764e0070&00000C0 perfo1111ance has de teriorated. 1/0 lat e
Ito
740722 rucroseconos. 1--------------1
microseconds
~------------~

Log field extraction turns unstructured data + Add additional context


to structured data that is meaningful for
troubleshooters.

In a large environment with numerous log events, you cannot always locate the data fields that are
important to you. vRealize Log Insight provides runtime field extraction to address this problem.
You can extract any field dynamically from the data by providing a regular expression.
In the example shown on the slide, the log search was based on the scsi performance deteriorated
keywords. You can now highlight the number representing a higher latency value and then extract
the field.
The field extraction is a very flexible feature that takes any event logs and turn them into structured
data, which you can include in your analysis.

82 VMware vSphere: Troubleshooting Workshop


Troubleshooting Using Customized Dashboards
Slide 3-47

vRealize Log Insight Dashboards are collections of chart, field table, and
query list widgets.
You may customize dashboards by adding, modifying, and deleting them
and tailor them to your troubleshooting needs.
For example, you can save a filtered query to your custom dashboard by
creating a query list widget.

I scsi pelformance detenorated


1( scsuat~ncy - 1000000

+ Add Filter X Clear All Filters 21!14-00-15 15 44:35.636 21114-06 -16 15: 4436.45•

Module 3 Troubleshooting Tools 83


Monitoring Log Events and Sending Alerts
Slide 3-48

You can run specific alert queries at scheduled intervals. When the query
exceeds the preconfigured threshold, alerts can be sent to your email or
VMware vRealize® Operations Manager™.

spooler --- Lalest llow- ot d.1!11

X llo1tn1me .... wh7-011

+ Add Filter X Clear All Fi tters :ZCH'-41..o& O• 1031 TiO 201 6 .G t .ol 05 1033«0
Manage Alerts...
EVflnts Fii-id Table EveMTypu Even\Trend1

Th'- l"rlnt Spooler Hf"Y!Ct tnttrH tht stopped st t.te .


1 ~ ~ ;i ...t<:1 66 t.I :s VI_.,.. Se¢ MA .ut f~'ll - fie kl

ffi d
•J .440
.__------1 -
ch1Mel ev•riLtypl ev111tld ev1tt1trecordld ea ntsourcen1,,.. hostnune l tyw!K'ds level opcode G 11'1 Fields
pmvdeni1me t.s• vmw_elister vmw_ctatace11ter vmw_lwis.t vmw_obted_d ,.mw_vceriter G 9'
YmW_vcenler_ d Ymw_vr_ops_ d (£J 11'1 El channel
0" Elevent_type
Tl'le Pnnt Sc>oolll!'r s,
New Alert ~ eventi d
ct\1nnel n<f!'IUype:
provdtmlJlll tatk .. meventrecordld
vrnw_YCllflltr_ .i vmw
Name lwin7-01a Spooler lssue' Y - - -rf] eventsourcename
--0 hostname
~
1 •
&~6
The Print Spooler s e
ch1nne1 eYeflL type
Notes n J .],,! ,
El keywords
El level

e Qoera\iQiWij
j Select... ~1calrty: ! none EJ
Raise an alert

@ On any matcn
0 When more than 8 1 matdies are found In the last 5 Minutes I B

You can be proactive by configuring vRealize Log Insight to run specific queries at scheduled
intervals.
If the number of events that match the query exceeds the thresholds that you have set, v Realize Log
Insight can send email notifications and trigger notification events in VMware vRealize®
Operations ManagerTM_
For example, the spooler service is stuck in a state of flux and needs to be fixed. You can create a
query that captures this pattern, and send an alert over to the vRealize Operations dashboard for the
affected virtual machine named Win7-01 a. Thus, the operations team becomes aware of and
resolves the problem in a timely fashion.

84 VMware vSphere: Troubleshooting Workshop


Lab 3: Searching Log Files
Slide 3-49

Search log files for events


1. Modify the vSphere Environment
2. Extract log files from vCenter Server
3. Search Log Files for Event Information
4. Clean Up for the Next Lab

Module 3 Troubleshooting Tools 85


Lab 4: Searching Log Files
Slide 3-50

Use vRealize Log Insight to monitor the health of vSphere systems


1. Log In to vRealize Log Insight
2. Search and Filter Log Events
3. Use Interactive Analytics Charts
4. Use Dynamic Field Extraction

86 VMware vSphere: Troubleshooting Workshop


Review of Learner Objectives
Slide 3-5 1

You should be able to meet the following objectives:


• Find important log files
• Use vSphere Syslog Collector
• Use vRealize Log Insight for log aggregation, log analysis, and log search

Module 3 Troubleshooting Tools 87


Key Points
Slide 3-52

• You can use vSphere ESXi Shell and vSphere Management Assistant to run
commands.
• You can use vCLI in vSphere Management Assistant to view and troubleshoot
the system configuration.
• Log files are useful when trying to resolve vSphere problems.
• vSphere offers various features that enable you to collect, search, and export
log files.
• vSphere Syslog Collector provides a single location for all ESXi hosts to write
log files.
• vRealize Log Insight provides a single location to collect, store, and analyze all
types of machine-generated log data.
• vRealize Log Insight provides fast analytical queries and aggregation,
especially on large data sets. As a result, the troubleshooting time is reduced
and operational efficiency is improved.
Questions?

88 VMware vSphere: Troubleshooting Workshop


MODULE 4
Troubleshooting Virtual Networking
Slide 4-1

Module 4

89
You Are Here
Slide 4-2

1. Course Introduction
2. Introduction to Troubleshooting
3. Troubleshooting Tools
4. Troubleshooting Virtual Networking
5. Troubleshooting Storage
6. Troubleshooting vSphere Clusters
7. Troubleshooting Virtual Machines
8. Troubleshooting vCenter Server and ESXi

90 VMware vSphere: Troubleshooting Workshop


Importance
Slide 4-3

Networks are used to access or control nearly every component in the


vSphere environment. When a network problem occurs, you must quickly
diagnose and resolve it to minimize the negative user impact.

Module 4 Troubleshooting Virtual Networking 91


Learner Objectives
Slide 4-4

By the end of this module, you should be able to meet the following
objectives:
• Provide a network troubleshooting overview
• Analyze and troubleshoot standard switch problems
• Analyze and troubleshoot virtual machine connectivity problems
• Analyze and troubleshoot management network problems
• Analyze and troubleshoot distributed switch problems

92 VMware vSphere: Troubleshooting Workshop


Networking Troubleshooting Overview
Slide 4-5

In vSphere, networking problems can occur with the following types of


connectivity:
• Virtual switch connectivity:
- Standard switches
- Distributed switches
• Virtual machine network connectivity
• ESXi host management network connectivity
A vSphere administrator should know how to troubleshoot common
networking issues.

Networks are used to access or control nearly every component in a vSphere environment. Virtual
switches, whether standard or distributed, provide ESXi host networking capabilities. Virtual
machines connect to virtual switches for access to both internal and external networks.
The management networks also use virtual switches for connectivity and are very important. Loss of
management network connectivity prevents important functions from occurring, such as allowing
ESXi hosts to be managed by VMware vCenter Server™.

Module 4 Troubleshooting Virtual Networking 93


Review of Standard Switch
Slide 4-6

If a virtual machine loses network connectivity, the cause of the problem


might be anywhere from the virtual machine's NIC to the ESXi host's
physical network.

Virtual Virtual
ESXi • NIC • NIC
Host Management
Network

IP Storage

vmnicO vmnic1 Physical NICs


Team

All network communication that is handled by a host passes through one or more virtual switches. A
virtual switch provides connections for virtual machines to communicate with one another, whether
they run on the same host or on a different host. A virtual switch allows connections for the
management and migration networks as well as connections to access IP storage.
You can also configure settings on virtual machine port groups such as allowing promiscuous
behavior. And you can team uplinks to aggregate bandwidth.

94 VMware vSphere: Troubleshooting Workshop


Network Problem 1
Slide 4-7

As an initial check from vSphere ESXi Shell, ping a system that is known
to be up and accessible by the ESXi host.

e•,xilll.vcl.i•.-...lor:dl loqi11: 1·001


Pd·. . ·~wrwd:
Ille l ir1e dlld ddlt: ul l111":::> loqi11 l1c.1vt: l1ce11 '>e11t tu tl1t; "~l'='>lt.:11 lull'"·

VMU1lf"r: off1:r-· .. ·,11pp1wt1:d, p111..J1:1·ft1I ·.. 1j",t1·11 .uh1i11i·-.t1·.1t irn1 f1H1l· •. Plr:,1·,.i:
·-,t:c 1-llll-l.Vlll-hJI c;.1.ur1/qo/·.,l_r·•• 1th1i"1ool·-. fur dctt1i !·"'>.

The ISXi Slu:ll """ lw di··...t1lcd hlj •lfl .irf11i11i•.ti·.1t ivc 11·.1:r- Si:r ttu:
v~lpher-r.- ~cctu- itq ciocrnH~ntdt ion frw non·~ infor-11ut inn

- "
- II pi 11q I 0. 20. HI. Ll
!'IN(, 11!.lll JO.II(!() lll.111.LIJ: '•lo 1k1t<1 hqlc>

li).?fl.lfl.l~i pilHJ ·,f.1tj·.,tji;•,


~1 11twkr·t·:, tr-dn·.:.11ittcd, fl p.-u:kT1•;, l't'rciv('rl. lnfl/ p.ick1·t )(),-,...;,,
n

If your ESXi host experiences intermittent or no network connectivity, then you must first try to ping a
system from your ESXi host. Choose a system that is active and that your ESXi host can access.
You can use an SSH client, such as PuTTY, to log in to your ESXi host and get to the command line.
Ensure that the SSH service is enabled in your ESXi host's security profile.
If you cannot open a PuTTY session, you can always use the Direct Console User Interface (DCUI) to
get a command line (Alt+Fl from the main DCUI screen). Ensure that vSphere ESXi Shell is enabled.

Module 4 Troubleshooting Virtual Networking 95


Identifying Possible Causes
Slide 4-8

If you know that your hardware is functioning correctly, take the top-down
approach to troubleshooting, starting with the ESXi host configuration.

Possible Causes

The ESXi host network configuration is incorrect.


The VLAN ID of the port group is incorrect.
ESXi The speed and duplex of the network links are not
Host consistent.
The network link is down.
NIC teaming is not configured properly.

The network adapter or server hardware is not supported.


Hardware
The physical hardware is faulty or misconfigured.
(Network, Server)
Network performance is slow.

When identify ing possible causes, take a structured approach. For this issue, you might start with the
ESXi host. Check the host's configuration. If the host's configuration is correct, then check for
hardware problems.
For information about how to troubleshoot ESXi hosts that have intermittent or no network
connectivity, see VMware knowledge base article 1004109 at http://kb.vmware.com/kb/10041 09.

96 VMware vSphere: Troubleshooting Workshop


Possible Cause: ESXi Network Misconfiguration (1)
Slide 4-9

Verify that your ESXi host network is configured properly:


• Check vSphere standard switches, vmnics, port groups, and VMkernel ports:
- In vSphere Management Assistant, use v icfg - vsw i t ch -1
- In vSphere ESXi Shell, use es x cfg- vswitch - 1 and esxc f g - vmknic - 1

. :._ ~ I' - I • ' - • 1 l ·1 ••• ! -

: !. • '- :._ 1 I '- , !. 1 ,:._ ,;_ • ',: !._' } I '. ,, :,• ' !._ • : : : l.: l . ', ::: . .•. , 1- • - - : ,

!: :._ '1 •- ! ::.- - • • :._ - - ". 1 l :· :..l. u -· • I. 1-' :. ! • I }

' •I " t' I !•_ J. 0


•,._ • • •. • • : _ I l I L _

I. • • I • L I • I • • I : • l ::_ • L I r • L • ' > " • •' • t 1,.' 1 ! ~ • l

..... 1 . -: :;: . .. - ~ l l.'


' . : 1. ': '' . ' I • .....- : • l - : t. ;;o,_. ~ .. ~·. ,,...• 1. • • • T l '~ t 1

• Check VLAN IDs of port groups:


- es x cli ne t work vsw i tch s t anda r d portgroup list

Verify that the components in your ESXi network configuration are configured correctly.
From vSphere Management Assistant, use the v i c f g - v s wit ch command to list information about
your standard and distributed switches, your vmnics, and your port groups.
From vSphere ESXi Shell, use the e sxc li command to list information about each of your port
groups and their assigned VLAN IDs.
The first command output in the slide demonstrates the vmnicO uplink available to both port groups.
The second command output in the slide demonstrates a VMkernel port that is manually disabled
with the command esxcfg-vmk.nic. To re-enable the VMkernel port, use the esxcfg-vmk.nic -e
command.

Module 4 Troubleshooting Virtual Networking 97


Possible Cause: ESXi Network Misconfiguration (2)
Slide 4-10

Verify that your ESXi host network configuration is configured properly:


• Speed and duplex:
- vic f g - n i cs - 1
• Network uplink and NIC status (up or down):
- vicfg- ni cs - 1
- esxc l i ne t wor k n i c lis t

From vSphere Management Assistant, use the v i c f g - nic s command to check the network
adapter's speed and duplex as well as the link status.
The command output in the slide demonstrates a vmnic that is manually brought down by the
esxcli network nic down -n command. You can manually bring the vmnic up. For example,
to bring up the vmnic2, use the esxcli network nic up -n vmnic2 command.

98 VMware vSphere: Troubleshooting Workshop


Resolving ESXi Network Misconfiguration
Slide 4-11

Adjust the settings in your ESXi network configuration not configured


properly:
• Standard switches, vmnics, port groups:
- Add standard switch: vicfg - vsw i tch -a vswi tch#
- Add port group: vicfg - vswitch - A pg_ name vswitch#
- Add uplink: vicfg - vswitch -L vmnic# vswitch#

• VLAN IDs of port groups:


- esxcli network vswitch s t andard portgroup set
-p pg_ name - v vlan ID
• Speed and duplex:
- vicfg- nics - d duplex - s speed vmnic#

• Network link status (up or down):


- Connect network adapters to the intended physical switch ports.

To edit your ESXi network configuration, you can use the same commands: v i cfg- vsw i tch,
esxcli, and vicfg- nics.

For example, to add a virtual switch named vSwitch5, run the following command from vSphere
Management Assistant:
vicfg- vswitch - a vSwitch5
To add a port group named Production to vSwitch5, run the following command from vSphere
Management Assistant:
vicfg- vswitch - A Production vSwitch5
To add the uplink, vmnic4, to the standard switch named vSwitch5, run the following command
from vSphere Management Assistant:
vicfg- vswitch - L vmnic4 vSw i tch5
To set the VLAN ID of the port group Production to ID 34, run the following command from the
ESXi command line:
esxcli network vswitch standard portgroup set -p Production - v 34
To set vmnic3 's speed to 10,000 MB and duplex to full, run the following command from vSphere
Management Assistant:
vicfg- nics - s 10000 - d full vmnic3
You can also use - a to set the speed and duplex settings to autonegotiate.

Module 4 Troubleshooting Virtual Networking 99


Possible Cause: N IC Teaming Misconfiguration
Slide 4-12

Verify that NIC teaming is configured properly.


...?B Production-A - Edit Settings

General Load balancing:


Advanced
Network failure detection:
Security Route based on source MAC hash
Notify switches:
Traffic shaping Route based on originating virtual port
Failback: Use explicit failover order
VLAN
Route based on physical NIC load
earning and failover Faitover order

Monitoring

Traffic filtering and marking



Active uplinks

Miscellaneous · Uplink 3
Uplink 4
standby uplinks

When setting up NIC teaming, you can configure settings such as the load balancing policy and the
failover order.
If you are using NIC teaming on the virtual switch, verify that the physical switch ports are
configured consistently for each teamed network adapter. Also verify that the proper load-balancing
policy is configured on the virtual switch. VMware recommends you to use the default load-
balancing policy, Route Based On The Originating Virtual Port ID. If link aggregation on the
physical switch is configured, use the load balancing policy, Route Based On IP Hash.
To use some adapters but reserve others for emergencies, you can use the Failover Order conditions
to specify how to distribute the workload for the network adapters:
• Active adapters: Continue to use the adapter when the network adapter connectivity is available
and active.
• Standby adapters: Use this adapter if one of the active adapter 's connectivity is unavailable.
• Unused adapters: Do not use this adapter.

100 VMware vSphere: Troubleshooting Workshop


Possible Cause: Unsupported or Faulty Hardware
Slide 4-13

Verify that you are not encountering the following ESXi network hardware
issues:
The network adapter or server hardware is not supported:
vicfg -n ics - 1
Verify that the network hardware is listed in VMware Compatibility Guide.
The physical hardware is faulty or misconfigured:
e sxcfg- v s wit c h , vicfg - vswitch , or es x c l i
vi-adm i n B!!la.-vma-0 1 : "' ( !!le.-e!!lici- 0 1 .vcla!!!! . l o c a l ]> v ic~g- n ic!!I -1
Name PCI Driver Li n k Speed Dup lex MAC Addreis:s HTU Descript.ion
vmnicO 0000 : 0 2 : CI O. O e 1000 Up 1000Hb p3 full OO: S 0:5 6 :01: c 1 : cb Intl! l Cor po r a t i o n BZ 5 4 5EH Gigabi t Et he r net Co ntro ll l!: r (Co pper)
vmn i c l 0000 : 02 : 0 1. 0 e l OOO Up lOOOMb ps Ji'u ll oo : s o : 5 6 : 01 : c l: cc I n te l Cor p o r atio n 8Z5'1 5 EH Gigab i t Etherne t Co n tro ller (Copper)
vmni c2 0000 : 02 : 02 . 0 elOOO Up l OOOMb ps Ji'u ll 00 : SO : 56 : 0 1: C l : Cd I n te l Corp o r a t. io n BZ5'1 5 EH Gigabi t Ethe rnet Co ntro ll e r (Copper )
vmni c 3 0000 : 02 : 03.0 e l OOO Up 1000Mb p;, Full 00 : 50 : 5 6 : 0 1: c l: ce: Intel Corp o r ation BZS'l S EH Gigabit Et hernet Contro ll er (Cop per )
vmn i c'! 0000 : 02 : 05.0 e l OOO Up lOOOMb p!!!Fu.ll 0 0 : 5 0: 5 6 : 0 1 : c l: c:C Intel Cor po r a t. i on 8Z5'1 5 EH Gi g abi t !:t hernee Controlle r ( Copper )
vmn ic:: S 0 000 :02 : 06 . 0 e l OOO Up 10 00Hbp3 F1Jll 0 0 : SO : 56 : 0 1 : c l: dO I nte l Corpo r e. t io n 82S4SEH Gigabit Et he r net Co ntro ll e r (Co p pe r )
vrnnic6 0000:02 : 07.0 e l OOO Up lOOO Mb p~ f ull 00: 50 : 56 : 0 1 : c l: d l Inte l Corpo r ation 825'15 EH Gioabit Ethernet Contro lle r ( Copper)
vmn ic7 0000 : 02 : 06 . 0 elOOO Up lOOOMbp!!I fu ll 00 : SO : 56 : 0 1: c l: d2 Inte l corpo r atio n B2:5'15 EM Gigab i t Ethernet contro ll er ( Copper )

.....
vi-edmi n@sa-vme.-01: - (1!19-el!Jxi-O l . vcl9!!1S . loce.1] > esxcU necvork nic lbc
P'CI Device Delv er Actmin $tatu!I L i nk $tatu!I Spe ed. Duplex
------------
?IAC Addce!l!I
-----------------
HTU Desc:dption
--------------------------------------------------------------
vrnnicO 0000 ; 02 .oo.o elOOO Up Up 1000 Full 00 : 50 : 56 : 01 : cl : c b 1500 Intel Corporation 825<!5EH Giuabit Ethernet Con c r ol ler (Coppe r )
vronicl 0000 : 02;0 1 .0 elOOO Up Up 1000 Full 0 0: 50 :56: 01 : c 1 : cc 1500 I ntel Corporation 825-'I S EH Giga>Jit Echernet Concr ol ler (Coppe r )
vrnnic2 0000 : 02 ; 02.0 elOOO Up Up 10 00 Full 00 : 50 ; 56 : 0 1: cl ;cd 1500 I ntel Corporat ion 825-'1 5 EH GivMi t Ethernet Con t r o ller (Co pper )
vrnnic3 0000 : 02:03.0 elOOO Up Up 1000 Full 00: SO :56 : 01 : cl : ce 1500 I ntel Corporation 825-'!SEH GivMit Ethernet Contr oller (Coppe r )
vronic:1 0000 : 02 : OS . O e l OOO Up Up 1000 f u ll 00 : SO :56: 0 1: c l :c t 1500 I ntel Corpocotion 62S'I SEH Gigol:;J it Ethernet Cont c o l l er (Co-ppe r )
vrnnicS 0000 : 02 : 06 . 0 elOOO Up Up 1000 f ull 00 : SO : S6 : 0 1: cl : d.0 1500 I ntel Cor pocotion 825 '1 5 EH Civet.bit Ethernet Cont r oJ.ler (Co- pp e r )
vmnic6 0000 : 02 : 07 . 0 e l OOO Up Up 1000 Full 00 : SO : S6 : 0 1 : cl : d. l 1500 I nt e l Cor pocot ion 62S<! SEH Citro.bi t Ether n e t Con t r o J. l er (Co p p e r )
vmnic:l 0000 : 02 : OB . O ~1000 Up Vp 1000 l"ull 0 0 : so :56 : 0 1 : C'l : d 2 1500 Int el Cocporat ion 82S<! S!H GitJM1 t ! the:rn~ t Cont r oller (Coppe r )

From vSphere Management Assistant, use the vicfg -ni cs command to view the model of your
network adapters. Compare your hardware information to the network I/O device list in the VMware
Compatibility Guide. To use the VMware Compatibility Guide, go to http ://www.vmware.com/
resources/compatibility.
If the lines do not exist at all for the card that has been added to the server, then you must rule out
faulty hardware:
If the network adapter is an add-in, reseat the adapter or move it to an alternate PCI slot on the
server 's motherboard .
Try an alternate network card.
Update the BIOS of the server to the latest version recommended by the manufacturer.
Run hardware diagnostics to identify any potential hardware issues.
If you installed third-party NIC hardware that is certified and supported by the vendor, verify that
vendor-specific drivers have been loaded correctly into the VMkernel after ESXi is installed.

Module 4 Troubleshooting Virtual Networking 101


Possible Cause: Slow Network Performance
Slide4-14

Use vSphere Web Client to display advanced traffic load information.


• Monitor > Performance > Advanced > Network.
• Traffic load by vmnic interface can be isolated.
~ sa.esxi-01.vclass.local f?. ~ @ Actions•

Permissions VMS Resource Pools Datastores Networks Update Manager

Issues Performance Tasks & Events Resource Reservation Utilization Hardware Status

Overview

Advanced

""a· ,.. ~.
10:14AM 10:24AM 10:34AM I0:44AM 10:54AM 11:04AM

Time
Performance Chart l egend

.
K Ohj•et Roll1Jp Units

• sa-esxi-01 .vclass.I... Data receive rate Average KBps 11 119 2 5


• vmnlc7 Data receive rate Average KBps 56 0 0
• vmrncJ Data receive rate Average KBps 0 0
• vmnic4 Data receive rate Average KBps 0 0

• vmnlc6 Data receive rate Average KBps 57 0 1


• vmnlc2 Data receive rate Average KBps 0 0 . -;

Network performance depends on application workload and network configuration. Dropped


network packets indicate a bottleneck in the network.
If packets are not being dropped and the data receive rate is slow, the host probably lacks the CPU
resources required to handle the load . Check the number of virtual machines assigned to each
physical NIC. If necessary, perform load balancing by moving virtual machines to different virtual
switches or by adding more NICs to the host. You can also move virtual machines to another host or
increase the CPU resources of the host or virtual machines.

102 VMware vSphere: Troubleshooting Workshop


Review of Virtual Machine Connectivity
Slide 4-15

If your virtual machine loses network connectivity, the cause of the


problem might be in the physical layer, the virtual layer, or the guest
operating system itself.

Virtual
Switch

Uplink Ports VM
~ Physical NICs

Virtual machine connectivity is achieved through multiple layers of networking. A virtual network
provides networking for virtual machines. The fundamental component of a virtual network is a
virtual switch. A virtual switch is a software construct, implemented in the VMkernel, that provides
networking connectivity for virtual machines that run on an ESXi host.
When two or more virtual machines are connected to the same virtual switch, network traffic among
them is forwarded locally. If an uplink adapter (physical Ethernet adapter) is attached to the virtual
switch, each virtual machine can access the external network that the adapter is connected to.

Module 4 Troubleshooting Virtual Networking 103


Network Problem 2
Slide 4-16

As an initial check, ping the virtual machine from another system.


If the ping command fails, ping other virtual machines on the same
network to determine the scope of the problem.

If you find that no network connectivity exists to a virtual machine, the first test is to try to ping the
virtual machine from another system to verify this behavior. Ping the virtual machine 's name. If the
ping fails, then ping the virtual machine's IP address. If the ping is successful, then the problem
might be with the application accessing the network.
You might also want to determine whether loss of network connectivity is being experienced by
other virtual machines on the same network.

104 VMware vSphere: Troubleshooting Workshop


Identifying Possible Causes
Slide 4-17

Take a top-down approach to troubleshooting, from the guest operating


system to the virtual machine and the ESXi host.

Possible Causes

Application or IP settings are misconfigured.


Guest OS The firewall in the guest OS is blocking traffic.

Virtual The port group name does not exist.


Machine The virtual network adapter is not connected.

Underlying issues with ESXi network connectivity exist.


ES Xi Storage or resource contention on the ESXi host exists.
Host

Based on the results of the initial ping test, if the ping is successful, then ensure that the application
accessing the network is not encountering problems.
If the ping is not successful, then take a top-down troubleshooting approach to identify possible
causes. Troubleshoot the guest operating system first, then troubleshoot the virtual machine, then the
ESXi host.
For information about how to troubleshoot virtual machine network connection issues, see VMware
knowledge base article 1003893 at http ://kb.vmware.com/kb/1003893.

Module 4 Troubleshooting Virtual Networking 105


Possible Cause: IP Settings and Firewall Problems
Slide 4-18

IP settings and problems with firewalls might cause the problem.


Check IP settings to ensure that the TCP/IP settings in the guest
operating system are correct.
The firewall in the guest operating system might be blocking traffic.
Ensure that the firewall does not block required ports.

Incorrect TCP/IP settings, such as an incorrect IP address, subnet mask, default gateway, or DNS
servers, can cause communication problems.
To verify TCP/IP settings
1. Run the IP configuration command.
• On a Windows system, run the ipconfig command.
• On a Linux system, run the ifco nfig command.
2. If DHCP is configured, confirm that DHCP is assigning the IP address correctly by renewing
the IP address.
• On a Windows system, run the ipconfig / renew command.
• From a Linux system, renew the DHCP address with the following commands:
dhclient -r

dhclient ethO

3. If a firewall is enabled in the guest operating system, verify that it is correctly configured to
allow and block certain types of traffic .
If the root cause lies within the guest operating system (incorrect IP settings or misconfigured
firewall), use the guest operating system tools to resolve the problem.

106 VMware vSphere: Troubleshooting Workshop


Possible Cause: Port Group Misconfiguration
Slide 4-19

The port group name that the virtual machine uses is incorrect:
• View the standard switch port group names on the ESXi host:
- vi c fg - vs witch -1
• Verify that the virtual machine is using the correct port group.
The virtual network adapter is not connected to the port group:
• Verify that the network adapter is connected to the correct port group.

Bi Win01-C - Edit Settings

Virtual Hardware VM Option s SDRS Rule s vApp Opti ons

j.[J CPU

• iii Memory
Hard disk 1

• ~ SCSI contro ller 0 LSI Logic Parallel

• ~ ' Netw ork adapter 1 [ Production

• ® CDIDVD
-., dnve
- .,.,,,,._..._ "' ....
1

Verify that the port group names associated with the virtual machine's network adapters are on your
standard switch and distributed switch. From vSphere Management Assistant, use the v i c fg-
vswi tch command.

Verify that the virtual network adapters for the virtual machine are present and connected. Use the
VMware vSphere® Web Client to view the virtual machine settings. Verify that the network adapter
status is Connected.
If you want to use vSphere Management Assistant, use the following vSphere Management
Assistant command to set the status of the network adapter to Connected:
vmware- cmd -H ESXi_host_ name Full_path_name_ of_ VM_ c onfig_ fils
c onnec tdevice "Network adapter 1"

For example:
vmware-cmd -H esxi02.vclass.local /vmfs/volumes/Shared/Win01-C/Win01-C.vmx
connectdevice "Network adapter 1"

Module 4 Troubleshooting Virtual Networking 107


Possible Cause: ESXi Network Connectivity Problems
Slide 4-20

Storage or resource contention on the ESXi host can cause network


connectivity issues:
• Ensure that the virtual machine has no underlying issues with storage and that
it is not in resource contention.
Problems might exist with the ESXi host network, the port group ID, the
speed or duplex settings, the physical network link, or the NIC teaming
configuration.
To eliminate a NIC failure or physical configuration issue, connect the
virtual machine to a virtual switch that uses NIC teaming.

Verify that the virtual machine has no underlying issues with storage or the virtual machine is not in
resource contention, as this might result in networking issues with the virtual machine.
As a long-term solution, you might want to consider NIC teaming for the virtual switches that your
virtual machines are connected to. A NIC team can either share the load of traffic between physical
and virtual networks among some or all of its members, or provide passive failover in the event of a
hardware failure or a network outage.

108 VMware vSphere: Troubleshooting Workshop


Network Problem 3
Slide 4-21

Another symptom is that the ESXi host is successfully added to the


vCenter Server inventory but disconnects 30 to 90 seconds after the task
completes.
The problem is that dropped, blocked, or lost heartbeat packets are
occurring between vCenter Server and the ESXi host.

The ESXi host is successfully added to the vCenter Server inventory but after approximately 60
seconds, vCenter Server changes the ESXi host's state to Not Responding or Disconnected.
Although the ESXi host frequently disconnects from vCenter Server, you can still use vSphere
Client to connect directly to the ESXi host.

Module 4 Troubleshooting Virtual Networking 109


Heartbeat Communication Between vCenter Server and ESXi
Slide 4-22

The ESXi host sends a heartbeat to vCenter Server to signal that the
host is accessible by the management network.

vCenter Server
~========- Management
ESXi Network
Windows (vmkO)
•••• 000

Heartbeat Sent over UDP Port 902

The ES Xi host sends heartbeats every 10 seconds to vCenter Server. By default, this traffic is sent
over UDP port 902. vCenter Server has a window of 60 seconds to receive the heartbeats. If the
UDP heartbeat message is not received by vCenter Server within that window, vCenter Server treats
the host as not responding.

110 VMware vSphere: Troubleshooting Workshop


Identifying Possible Causes
Slide 4-23

Take a top-down approach to troubleshooting, from the vCenter Server


system to the ESXi host and the hardware.

Possible Causes

Windows Firewall is enabled on the vCenter Server system,


vCenter Server
and UDP port 902 is blocked.

ES Xi If ports are not permitted, disable the firewall to test.


Host

Hardware
{CPU, Memory, The network between ESXi and vCenter Server is congested.
Network, Storage)

If the Windows firewall is not enabled on your vCenter Server system, then begin troubleshooting at
the ESXi host.
For information about how to troubleshoot an ESXi host that frequently disconnects from vCenter
Server, see VMware knowledge base article 2020100 at http://kb.vmware.com/kb/2020100.
For information about the ports required for communications between vSphere components, see
VMware knowledge base article 2106283 at http://kb.vmware.com/kb/2 106283. Also see the
information about TCP and UDP ports required to access vCenter Server, ESXi hosts, and other
network components in VMware knowledge base article 1012382 at http://kb.vmware.com/kb/
1012382.

Module 4 Troubleshooting Virtual Networking 111


Possible Cause: Port Blocked by Firewall
Slide 4-24

If the firewall is enabled and UDP port 902 is blocked, view the ports
blocked by the vCenter Server Appliance firewall.
To resolve this problem, adjust the firewall settings on the vCenter Server
Appliance virtual machine:
• If ports are not configured, disable the firewall.
• If the firewall is configured to affect ports, ensure that the firewall is not
blocking UDP port 902.

Navigator ' .:fJ sa-vcsa-01.vclass.local Es !t © Actions ..

/~a:r~Conflguratlon Summary Monitor Manage ._


R_e1a1_eoo
- -'
01e_cts_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ __

~ Nodes fSettings I Certificate Authority j


~ Services
Firewall
Nodes .... common Ordu Adion
/. savcsa 01 vc1ass1oca1 > Access This Hsl Is empty
Networking

... Advanced

Firewall
Active Directory

Check the firewall on the vCenter Server Appliance virtual machine. If ports are not configured,
then disable the firewall. If ports are configured, then verify that network traffic is allowed to pass
from the ESXi host to the vCenter Server system. That is, verify that the firewall is not blocking
UDP port 902.
To reach the settings on your vCenter Server Appliance firewall following the procedure given
below

1. Log in to vSphere Web Client.


2. Click the Home icon.
3. Click System Configuration under Administration.
4. Expand nodes.
5. Select the vCenter Server Appliance node.
6. Click the Manage tab.
7. Click Settings.
8. Expand Advanced.

112 VMware vSphere: Troubleshooting Workshop


9. Click Firewall.
10. Click Edit.
For additional information about how to troubleshoot an ESXi host that disconnects from vCenter
Server after being added or connected to the inventory, see VMware knowledge base article
2040630 at http://kb.vmware.com/kb/2040630.
For additional information about configuring the vCenter Server Appliance see vCenter Server
Appliance Configuration at http://pubs.vmware.com/vsphere-65/topic/com.vmware.ICbase/PDF/
vsphere-esxi-vcenter-server-65-appliance-configuration-guide.pdf.

Module 4 Troubleshooting Virtual Networking 113


IPTables Firewall
Slide 4-25

vCenter Server Appliance uses the iptables firewall.


• Examine firewall rules from the command line or an SSH session.
r oot.@sa-vcsa-01 ( - ) # 1pt.eble3 -L I
iptabl es - L Chain INPUT (policy DROP)
t.l!lrc;Jet. prot. opt. :1ource dest.inat.ion
ACCEPT a ll anywhere anywhere
DROP all -- anywher e anywhere c t st.at.e INVALID
ACCEPT all -- a nywhere anywhere ct.st.at.e RELATED, ESTABLISHED
inbound all -- anywhere anywhere
• Remove port._tilt.er a ll -- a.n ywhere
DROP icmp -- anywhere
anywhere
anywhere icmp t.imestamp-reque5t.
firewall rules. DROP
ACCEPT
icmp --
icmp - -
anywher e
anywhere
anywhere
anywhere
icmp timest.arnp-reply

DROP udplit.e-- anywh ere anywhere

• Modify firewall LOG all -- anywhere anywhere limit. : avq 2/min burst. 5 LOG

Ch ain rORWARD (policy DROP)


rules. p r ot. opt. sourcl! dest.inat.ion

Ch ain OUTPUT (policy ACCEPT)


c.aroec. pr ot. opt. source desc.inacion

I DROP
RETURN
all
all
--
--
sa-esxi-01.vclass . l ocal anywhere
anvvhere anvvhere
I
Cha i n porc_t ilcer ( l reterences)
cargec proc ope source descinacion
ACCEPT ccp anywhere anywhere ccp dpc: l dap
ACCEPT ccp -- anywhere anywhere ccp dpc: ldaps

VMware vCenter® Server Appliance™ uses the iptables firewall. You can list the firewall tables
with the command:
iptab les - L

You can list iptables firewall rules by line number in a specific table with the command:
iptables - L <tab l e name> - n -- line- numbers

Example:
iptables - L inbound - n --l ine - numbers

To delete a specific rule use the command:


iptables - D <tab l e name > <l ine number >

Example:
iptables - D inbound 1

After changing the firewall rules save the rules with the command:
iptab les - save

114 VMware vSphere: Troubleshooting Workshop


Possible Cause: vCenter Server Not Using Port 902
Slide 4-26

By default, the vpxa agent on the ESXi host sends heartbeats to vCenter
Server (vpxd) through UDP port 902.
A problem might exist if the host is configured to send heartbeats over a
port other than 902.
Use the less I etc/vmware /vpxa/vpxa. cfg command on the host
to determine the port that is used to send heartbeats.
if' esH101.vdass.local - PuTTY
~ # less /etc/vmware/vpxa/vpxa.cig

<vpxa>
<bundleVersion>lOOOOOO</bundleVersion>
<datastorePrincipal>root</datastorePrincipal>
<hostip>172 .20.10.51</hostlp>
<hostKey>52ld9d38-20c7-df53-cbcd-4457cf6eae69</hostKey>
<hostPort>443</hostPort>
<licenseExpiryNotiiicationThreshold>15</licenseExpiryN01
<memoryCheckerTimeinSecs>30</memoryCheckerTimeinSecs>
<sery erl p> 172.20 .10 . 91</sery er!p>
l <serverPort>9020</serverPort> I
</vpxa>
< workingDir>/var / log/vmware/vpx</wor kingDir>
</coniig>
EDDI
A rule in the ES Xi firewall exists that allows for vCenter Server heartbeat traffic. If vCenter Server
has been configured to receive traffic over an alternate port, that traffic will be blocked.
Determine whether an ESXi host is using a port other than the default port, 902. At the ESXi host
command prompt, use the l ess /etc/vmware /vpxa/vpxa . cfg command to determine the port
in use. The port number in use is contained in the server Port tags.
In this example, server Port is set to port 9020, not the default port.

Module 4 Troubleshooting Virtual Networking 115


Resolving the Use of a Port Other Than 902 ( 1)
Slide 4-27

If you prefer to use a non default port for heartbeats, ensure that the
ESXi firewall does not block that port.
Contents of heartbeat.xml
<' -- Fire~all configuration sample-->
<ConfigRoot>
<service>
<id>nondefheartbeat</id>
<rule id='OOOO'>
<direction>inbound</direction>
<protocol>udp</protocol>
<porttype>dst</porttype >
<port>9020</port> <111(~--
</rule>
<rule id='OOOl'>
<direction>outbound</direction>
<protocol>udp</protocol>
<porttype>dst</porttype>
<port>9020</port> <111(~--
</rule>
<enabled>true </enabled>
<required>true</required>
Th e path to th e neartoeat. xm.l rne 1s
/et c/vmware/firewall/heartbeat.xml.

If you prefer to use a port other than the default port 902 for the heartbeat traffic between vCenter
Server and the ESXi host, you must configure the firewall to specifically allow traffic on that port.
To add a firewall rule to the ESXi host

1. Use SSH to connect to the ESXi host.


2. Navigate to the /etc/vmware /firewa ll directory:
cd /etc/vmware/ fir ewa ll
3. Use a text editor, such as v i , to create a new file named hea r tbeat . xml.
The alternate port number used in this example is 9020. Be sure to use the port number you
determined earlier for your configuration.
4. Enable the new firewall rule by running the command:
esxcli network fir ewa ll refresh

116 VMware vSphere: Troubleshooting Workshop


Resolving the Use of a Port Other Than 902 (2)
Slide 4-28

Check the vCenter Server configuration to verify the port number used
for heartbeats.
;;:" Reg1§try Editor
E_ile ~dit l{iew F~vorites l::!elp
·{ ] SOFTWARE Name T e Data
1±1 CJ Classes ~ (Default) REG_SZ (value not set
1±1{] Clients [3 AdamldapPort REG_SZ 389
8:J 0 Description ~ AdamSslPort REG_SZ 636
B:JQ Gemplus
~ BumpUpEphemer ... REG_SZ
8:J 0 JavaSoft
~ DbinstanceName REG_SZ VIM_SQLEXP
i±l·· O Macromedia
~ DbServerType REG_SZ Bundled
$·0 Martin Prikryl
~ EvaluationExpiry ... REG_SZ AQD+yggAAA
ltl·· O Microsoft
~ FullyQualifiedDo ... REG_SZ VCO I. vclass. le
$0 MozillaPlugins
ab Group Type REG_SZ Single
1±1 0 ODBC
8:J 0 Policies
REG_SZ 902

'-···O Program Groups HttpProxyPort REG_SZ 80


8:J 0 Schlumberger ~ HttpsProxyPort REG_SZ 443
;· 0 Secure ~ InstalledVersion REG_SZ 5. 1.0.33762
i±l·· O ThinPrint ~ InstallPath REG_SZ C:\Program Fil
- - • S·· O \/Mware, Inc. ~ JVM Memory Option REG_SZ s
· E!J · O VMware Infrastructure ~ LDAPAdminPrincipal REG_SZ Administrators
l±l··CJ VMware Tools ~ LinkedMode REG_SZ
......~~- --~ .
VMware VirtualCenter
i D DB
~ PrimaryAdamPort REG_SZ 389
~ PrimaryAdamServer REG_SZ
i····O Install ~ RebootRequired REG_SZ

Using the default port (UDP 902) for vpxa and vpxd communication is encouraged. VMware
recommends that you configure vCenter Server and the ESXi hosts to use the default port instead of
a nondefault port. As a good practice, before changing the port number on vCenter Server, ensure
that no other application installed on it is using this port.

Module 4 Troubleshooting Virtual Networking 117


Resolving Network Congestion
Slide 4-29

Resolving network congestion has both short-term and long-term


solutions.
Short-term solution to this problem:
• Increase the timeout limit in vCenter Server to keep the ESXi host continuously
connected.
Long-term solution to this problem:
• Resolve the underlying network congestion problems.
• If using distributed switches, use VMware vSphere® Network 1/0 Control to
reprioritize traffic and increase the number of shares for management traffic.

The long-term solution is to resolve any issues that affect network performance. You might also
consider enabling VMware vSphere® Network 1/0 control if you are using distributed switches.
Network 1/0 control lets you prioritize different network traffic flowing through the same pipe.
Until the network issues can be resolved, you can work around the issue by increasing the timeout
limit. Increasing the timeout limit in vCenter Server allows the ESXi host to be connected
continuously. For information about how to increase the heartbeat timeout limit, see VMware
knowledge base article 1005757 at http ://kb.vmware.com/kb/ 1005757.

118 VMware vSphere: Troubleshooting Workshop


Network Problem 4
Slide 4-30

This problem can occur if the ESXi host's management network was
misconfigured or manipulated from the command line.
For example, you can bring a physical network card up or down with the
esxcli command:
esxcli network nic up -n vmnicO
esxcli network nic down - n vmnicO
esxcli network nic list

The management network is configured on every ESXi host and is used to communicate with
vCenter Server. Communication between an ESXi host and vCenter Server is critical for centrally
managing hosts through vCenter Server. The management network is also used to interact with other
hosts in a VMware vSphere® High Availability configuration. If the management network on the
host becomes unavailable or misconfigured, vCenter Server cannot connect to the host and cannot
centrally manage resources.

Module 4 Troubleshooting Virtual Networking 119


Preventing Loss of Management Network Connectivity
Slide 4-31

vSphere network rollback prevents accidental misconfiguration of


management networking and loss of connectivity:
• For example, if you try to change the IP address of your management
VMkernel interface, vSphere Web Client returns the error message in the
screenshot.

Update llirtual NIC

Status: 0 An error occurred while communicating with the remote host


Initiator: SYSTEM-DOMAH\J\\admin
Target ~ esxi01 .vclass.local
Ser.rer: VC01.vclass.local
Error stack: Submit error report.

4 P..letwork configuration change disconnected the host 'esxi01.vclass.local'from vcenter server


and has been rolled back.
4 faultNetworkDisruptedAndConfigRolledBack.summary

Rollback detects configuration changes on the management network. If a configuration change is


made to the management network, and if that change would result in the host losing connection to
vCenter Server, the change is rolled back automatically to the previous configuration so that the
connection is maintained. This feature works the same regardless of whether the host's management
network is configured on a standard or distributed switch. Rollback is enabled by default.
For information about network rollback and recovery, see VMware knowledge base article 2032908
at http://kb.vmware.com/kb/2032908.

120 VMware vSphere: Troubleshooting Workshop


Host Networking Rollback
Slide 4-32

Rollback enables you to roll back to a previous valid configuration.


The host networking rollback is triggered when a network configuration
change is made that disconnects the host.
Several events can trigger a host networking rollback:
• Updating DNS and routing settings
• Updating the speed or duplex of a physical NIC
• Changing the IP settings of a management VMkernel network adapter
• Updating teaming and failover policies to a port group that contains the
management VMkernel network adapter
If a network disconnects for any of these reasons, the task fails and the
host reverts to the last valid configuration.

Host networking rollbacks occur when an invalid change is made to the host networking
configuration. Every network change that disconnects a host also triggers a rollback.
In addition to the events mentioned previously, other events that might trigger a rollback are the
fo llowing:
• Updating the VLAN of a standard port group that contains the management VMkernel network
adapter
• Increasing the MTU of management VMkernel network adapters and its switch to values not
supported by the physical infrastructure
• Removing the management VMkernel network adapter from a standard or distributed switch
• Removing a physical NIC of a standard or distributed switch containing the management
VMkernel network adapter

Module 4 Troubleshooting Virtual Networking 121


Recovering a Lost Management Network: Standard Switch
Slide 4-33

If your management network is on a standard switch and you lose


management network connectivity, the solution uses the Configure
Management Network option in the DCUI.

SysteM CustoMization
Configure ManageMent Network

Configure Password Network Adapters


Configure Lockdown Mode VLAN (optional)

) IP Configuration '
1Pv6 Configuration
ManageMent Network DNS Configuration
Network Restore Options CustoM DNS Suffixes
·. __ ... . ._. . .. , ....., _,. ........ .
"'- ·-.,_.... . . . "' · -. . . . . ................. .. . . . ... . .

The DCUI allows you to correct your management network settings, such as your IP configuration
and DNS configuration. From the DCUI, you can also restart your management network and test the
management network.

122 VMware vSphere: Troubleshooting Workshop


Network Restore Options in the DCUI
Slide 4-34

To restore the network through the DCUI:


1. Select Network Restore Options.
2. Perform a full network restore.
3. Repair the Management network on a misconfigured standard or
distributed switch.
SysteM CustoMization

Configure Password ~etwork Restore Options


Configure Lockdown Mode
Configure ManageMent Network
Restart ManageMent Network
Test ManageMent Network Restore Standard Switch
etwork Restore Options ) Restore vDS
Configure Keyboard
Troubleshooting Options
....~ ..........................~ ....................."v.,,.,,,.... ~.,.,_, +..

The Restore Network Settings option deletes all the current network settings
except for the Management network.

The Network Restore Options selection enables you to recover from management network
configuration errors on a distributed switch. The management network must be configured on a
distributed switch.
Using the DCUI is the only way to fix distributed switch configuration errors. The DCUI clones a
host local port from the existing misconfigured port and copies all VLAN ID and blocked port
information that you configured on the distributed switch. The DCUI changes the management
network to use the new host local port to restore connectivity to vCenter Server. vCenter Server
picks up the new host local port and updates its database with the new information. vCenter Server
creates a standalone port that is connected to the management network.
As a last resort, if you cannot recover your management network by fixing the management network
settings, you can revert to a default network setting. The Restore Network Settings option reverts
your entire network configuration to network system defaults. Restoring the network settings stops
all the running virtual machines on the host.
Use this option carefully. Verify that you have a record of your network configuration so that you
can recreate your production environment. If you do revert your network settings to the network
system default, you can apply a host profile, if you have one, to recreate your virtual switches.

Module 4 Troubleshooting Virtual Networking 123


Review of Distributed Switch Network Connectivity
Slide 4-35

The cause of a network connectivity problem might be in the virtual


machines, the vCenter Server system, or the ESXi hosts that have NICs
assigned to the distributed switch and the physical network.
Management Port
I

VM State 7

Distributed Ports
and Port Groups vCenter
Distributed Switch
(Control Plane) Server

Hidden Virtual
Switches
(1/0 Plane)
Virtual
--------ESXfHosi ________ _ --------ESXIHost ________ _
Physical
Physical NICs
(Uplinks)

If your ES Xi host is experiencing intermittent or no network connectivity, and your ES Xi host is


connected to a distributed switch, then verify that your distributed switch is configured properly.
The distributed switch is managed by vCenter Server, however, the physical adapters and virtual
adapters (VMkernel interfaces) are managed directly on the ESXi host.

124 VMware vSphere: Troubleshooting Workshop


Distributed Switch Rollback
Slide 4-36

The distributed switch rollback is triggered when invalid updates are


made to distributed switch-related objects.
Examples of events that might trigger a distributed switch rollback:
• Changing the MTU of a distributed switch
• Changing the following settings in the distributed port group of the
management VMkernel network adapter:
- NIC teaming and failover
- VLAN
- Traffic shaping

If an invalid configuration occurs, one or more hosts might be out of


synchronization with the distributed switch.

Distributed switch rollbacks occur when invalid updates are made to distributed switch-related
objects, such as distributed switches, distributed port groups, or distributed ports.
In addition to the events mentioned on the slide, the following events might trigger a distributed
switch rollback:
• Blocking all ports in the distributed port group containing the management VMkernel network
adapter
• Overriding the preceding policies for the distributed port to which the management VMkernel
network adapter is connected
If you know where the conflicting configuration setting is located, you can manually correct the
setting. For example, if you incorrectly migrated a management VMkernel network adapter to a new
VLAN, the VLAN might not be trunked on the physical switch. When you correct the physical
switch configuration, the next distributed switch-to-host synchronization will resolve the
configuration issue.

Module 4 Troubleshooting Virtual Networking 125


Recovering from a Distributed Switch Misconfiguration
Slide 4-37

Always back up your distributed switch before you make a change to its
configuration:
• If your distributed switch loses network connectivity because of a
misconfiguration, you can restore from your latest backup.
vSphere Web Client provides you with features to back up and restore
distributed switch configuration:
• Export: Back up your distributed switch configuration.
• Restore: Reset the configuration of a distributed switch from an exported
configuration file.
• Import: Create a distributed switch from an exported configuration file.

If you are troubleshooting a network connectivity problem on your distributed switch, and you are
not sure where the problem exists, you can roll back the distributed switch or distributed port group
to a previous configuration. vSphere Web Client allows you to back up your distributed switch
configuration to a file and restore the configuration if necessary.

126 VMware vSphere: Troubleshooting Workshop


Lab 5: Troubleshooting Network Problems
Slide 4-38

Identify, diagnose, and resolve virtual networking problems


1. Run a Break Script
2. Verify That the System Is Not Functioning Properly
3. Troubleshoot and Repair the Problem
4. Verify That the Problem Is Repaired

Module 4 Troubleshooting Virtual Networking 127


Review of Learner Objectives
Slide 4-39

You should be able to meet the following objectives:


• Provide a network troubleshooting overview
• Analyze and troubleshoot standard switch problems
• Analyze and troubleshoot virtual machine connectivity problems
• Analyze and troubleshoot management network problems
• Analyze and troubleshoot distributed switch problems

128 VMware vSphere: Troubleshooting Workshop


Key Points
Slide 4-40

• Virtual network connectivity problems might occur with standard switches,


distributed switches, virtual machines, or management networks.
• A virtual machine connectivity problem might exist in the physical layer, the
virtual layer, or the guest operating system.
• The ping command is useful when troubleshooting ESXi host and virtual
machine connectivity issues.
• When an ESXi host frequently disconnects from vCenter Server, heartbeat
packets are being lost between vCenter Server and the ESXi host.
• vSphere network rollback prevents accidental misconfiguration of management
networking and loss of connectivity.
• A good practice is to back up your distributed switch configuration with the
vSphere Web Client whenever you make a change to the configuration.
• You can use the restore or the import function to reset the distributed switch
configuration.
Questions?

Module 4 Troubleshooting Virtual Networking 129


130 VMware vSphere: Troubleshooting Workshop
MODULE 5
Troubleshooting Storage
Slide 5-1

Module 5

131
You Are Here
Slide 5-2

1. Course Introduction
2. Introduction to Troubleshooting
3. Troubleshooting Tools
4. Troubleshooting Virtual Networking
5. Troubleshooting Storage
6. Troubleshooting vSphere Clusters
7. Troubleshooting vCenter Server and ESXi
8. Troubleshooting Virtual Machines

132 VMware vSphere: Troubleshooting Workshop


Importance
Slide 5-3

Nearly every service depends on access to storage. When a host's


access to a certain storage device is lost, you must use your
troubleshooting knowledge and skills to quickly identify the problem and
restore the access.

Module 5 Troubleshooting Storage 133


Module Lessons
Slide 5-4

Lesson 1: Storage Connectivity and Configuration


Lesson 2: Multipathing
Lesson 3: vSAN and Virtual Volumes

134 VMware vSphere: Troubleshooting Workshop


Lesson 1: Storage Connectivity and Configuration
Slide 5-5

Lesson 1: Storage Connectivity


and Configuration

Module 5 Troubleshooting Storage 135


Learner Objectives
Slide 5-6

By the end of this lesson, you should be able to meet the following
objectives:
• Discuss vSphere storage architecture
• Identify possible causes of problems in various types of datastores
• Analyze common storage connectivity and configuration problems and discuss
possible causes
• Solve storage connectivity problems, correct misconfigurations, and restore
LUN visibility

136 VMware vSphere: Troubleshooting Workshop


Review of vSphere Storage Architecture
Slide 5-7

If a virtual machine cannot access its virtual disks, the cause of the
problem might be anywhere from the virtual machine to physical storage.

Virtual Disk

Data store
Type

Transport

Backing

A virtual machine's virtual disks reside on one or more datastores. Datastores are logical containers
that are configured on ESXi hosts. Datastores provide a uniform model for storing virtual machine
files. Depending on the type of storage that you use, datastores can be formatted in different ways.
When a problem occurs with accessing storage, the root cause can be in one of the layers that form
the virtual storage architecture.

Module 5 Troubleshooting Storage 137


Review of iSCSI Storage
Slide 5-8

If the ESXi host has iSCSI storage connectivity issues, check the iSCSI
configuration on the ESXi host and, if necessary, the iSCSI hardware
configuration.
disk array
t(j tl1
iSCSI target name:
iqn.1992-08-com.acme:storage1
iSCSI alias: storage1
IP address: 192.168.36.101

iSCSI initiator name:


iqn .1998-01.com. vmware:train 1
iSCSI alias: train1
IP address: 192.168.36.88 host

An iSCSI storage system contains one or more LUNs and one or more storage processors (SPs).
Communication between the host and the storage array occurs over a TCP/IP network.
The ESXi host is configured with an iSCSI initiator. An initiator can be hardware-based or it can be
software-based. The software-based initiator is called the iSCSI software initiator. An initiator
resides in the ESXi host. Targets reside in the storage arrays that are supported by the ESXi host.
iSCSI arrays can use various mechanisms, including IP address, subnets, and authentication
requirements, to restrict access to targets from hosts.

138 VMware vSphere: Troubleshooting Workshop


Storage Problem 1
Slide 5-9

Initial checks using the command line look at connectivity on the host:
• Verify that the ESXi host can see the LUN:
- esxcli storage core path list
- # esxcli !!ltorage con~ path listlmore
iqn. 1996-0l .co m.V'!ll11ar e : esxi01-00023d000001, 1qn . 2003 -10.com. letthandnetuork!!I: iscsi- m
g : 203 : tQtO, t, 1- naa . 6000eb3a2b3b330e00000000000000cb
UID : iqn . 1998- 01 . com . vmware: esx 101- 00023d000001, i qn.2003-1 0 . com . l e :t:c handnet woi:~
: l!!IC!!li-mg : 2 03 : tgtO, t , 1- naa. 6000eb3a2b3b330eOOOOOOOOOOOOOOcb
Runtilne N&ne: vmhba33 : CO: T'i : LO
Device : naa . 6000eb3a2b3b330e OOOOOOOOOOOOOOcb
Device Displ ay N ame : LCP'THAND 1SCS I Disk (naa. 6000eb3a2b3b330eOOOOOOOOOOOOOOeb)
Adapt er: vmhb a 33
Channel: 0
Target : 'l
L UN : 0
Plugin : NHP

• Check whether a rescan restores visibility to the LUNs.


- esxcli storage core adapter rescan -A vmhba##
• Check how many datastores exist and how full they are:
- df -h I grep VMFS
[ root i:i e.::: :·: i - ::;. - Cl :. : · ] cl t - li I ·~!rep \I !If.:;
\ll!f.:;-s 19 ..:"~ 11.·="~ .:.. o·~ r:-.o-· VJ[•f.:: ·:oli.u[le.:: Loc::;.1:-.:. \Ill.::
\.'!l f.:;-s ~.i.:. . Oii 1.:-..011 49.:-. . 0 11 :.- . ·.'J[lf.:: ·:ol1u[le.:: L·x·::;.15:.

The first check to perform is to see what paths to your IP storage LUNs are visible to your ESXi
host. You can use vSphere Web Client to perform this check or you can use the command line.
The esxcli storage core path list command prints a mapping between the HBAs and the
devices that it provides paths to. Use a PuITY session to run the esxcli command on the ESXi host.
If you do not see any paths to your IP storage, then use the es xc li command to perform a rescan of
the problem adapter to try to restore LUN visibility.

Module 5 Troubleshooting Storage 139


Identifying Possible Causes
Slide 5-10

If the ESXi host accessed IP storage in the past, and no recent changes
were made to the host configuration, you might take a bottom-up
approach to troubleshooting.

Possible Causes

The VMkernel interface for IP storage is misconfigured.


IP storage is not configured correctly on the ESXi host.
ES Xi iSCSI TCP port 3260 is unreachable.
Host A firewall is interfering with iSCSI traffic.
NFS storage is not configured correctly.
VMFS datastore metadata is inconsistent.

The iSCSI storage array is not supported.


Hardware The LUN is not presented to the ESXi host.
(Storage Network,
Storage Array) The physical hardware is not functioning correctly.
Poor iSCSI storage performance is observed .

Instead of a top-down approach, you might take a bottom-up approach and look for possible causes
at the hardware level first. Taking a bottom-up approach is especially appropriate if you know that
your ESXi hosts have been able to access datastores located on IP storage and you know that you
did not make recent changes to the IP storage configuration.
If you know that you have a solid configuration on your ESXi host, start troubleshooting by
verifying proper operation and acceptable performance of your storage hardware.
For information about troubleshooting LUN connectivity issues on ESXi hosts, see VMware
knowledge base article 1003955 at http ://kb.vmware.com/kb/1003955 .
For information about troubleshooting iSCSI array connectivity issues in ESXi, see VMware
knowledge base article 1003681 at http ://kb.vmware.com/kb/1003681.

140 VMware vSphere: Troubleshooting Workshop


Possible Cause: Hardware-Level Problems
Slide 5-11

Check the VMware Compatibility Guide to see if the iSCSI HBA or iSCSI
storage array is supported.
Verify that the LUN is presented correctly to the ESXi host:
• The LUN is in the same storage group as all the ESXi hosts.
• The LUN is configured correctly for use with the ESXi host.
• The LUN is not set to read-only on the array.
• The host ID on the array for the ESXi LUN is 0 - 16383.
• Max LUNs per ESXi host is 512.
If the storage device is malfunctioning, use hardware diagnostic tools to
identify the faulty component.

Verify that the storage array is listed in VMware Compatibility Guide. Some array vendors have a
minimum-recommended microcode or firmware version to operate with ESXi. This information can
be obtained from the array vendor and VMware Compatibility Guide.
On the array side, ensure that the LUN IQNs and access control list (ACL) allow the ESXi host
HBAs to access the array targets.
Ensure that the host ID on the array for the LUN is in the range 0- 1023 for the LUN. On ESXi, the
host ID appears as the LUN ID. The maximum LUN ID is 1023. Any LUN that has a host ID
greater than 1023 might not appear as available under Storage Adapters. However, on the array, the
LUN might reside in the same storage group as the other LUNs that have host IDs less than 1023.
Verify that the physical hardware is functioning correctly, including:
• The storage processors (sometimes called heads) on the array
• The storage array
• The SAN and switch configuration
For information about configuration maximums, see Corifi.guration Maximums at http://
www.vmware.com/pdf/vsphere6/r65/vsphere-65-configuration-maximums.pdf.

Module 5 Troubleshooting Storage 141


Possible Cause: Poor iSCSI Storage Performance
Slide 5-12

Adhere to best practices for your IP storage networks:


• Avoid oversubscribing your links.
• Isolate iSCSI traffic from NFS traffic and any other network traffic.
Monitor device latency metrics:
• Use the esx t op or resxtop command: Enter din the window.

~0: 4 6:28am u p 2 days 3: 1 6 , 77 worlds ; CP U l oad average : 0.32 , 0.31 , 0 . 32

.. .... CMDS/s READS/s WRITES/s MBREAD/s MBWRTN/s AVG/cmd KAVG/cmd GAVG/cmd


vmhbaO 5 .24 0.00 5.24 0.00 0 . 09 46.11 29 . 87 75 . 98 0.00
v mhbal 5 .06 0 . 00 5.06 0 . 00 0 . 02 1 . 10 0 . 01 1 . 11 0.00
v mhba2 0 . 00 0.00 0.00 0.00 0 . 00 0.00 0 . 00 0.00 0 . 00

/ \ \
Device Avg. Kernel Avg . Guest Avg .

Make sure that your network topology does not contain Ethernet bottlenecks. Bottlenecks where
multiple links are routed through fewer links can result in over subscription and dropped network
packets. Recovering from dropped network packets results in large performance degradation.
Isolating iSCSI and NFS traffic, or creating separate VLANs for NFS and iSCSI, is beneficial. This
separation minimizes network interference from other packet sources.
A performance problem can usually be identified and corrected by monitoring the following latency
metrics:
• DAVG/cmd: The average amount of time it takes a device (which includes the HBA, the storage
array, and everything in between) to service a single 110 request (read or write).
• If the value < 10, the system is healthy. If the value is 11 through 20 (inclusive), be aware
of the situation by monitoring the value more frequently. If the value is > 20, this most
likely indicates a problem.
• KAVG/cmd: The average amount of time it takes the VMkernel to service a disk operation. This
number represents time spent by the CPU to manage 1/0 . Because processors are much faster
than disks, this value should be close to zero. A value or 1 or 2 is considered high for this metric.
• GAVG/cmd: The total latency seen from the virtual machine when performing an 1/0 request.
GAVG is the sum ofDAVG plus KAVG.

142 VMware vSphere: Troubleshooting Workshop


Possible Cause: VMkernel Interface Misconfiguration
Slide 5-13

A misconfigured VMkernel interface for IP storage affects any IP storage,


whether iSCSI or NFS:
• To test configuration from the ESXi host, ping the iSCSI target IP address:
- For example, p i ng 172 . 20 .13 . 12
• 172 . 2 o . 13 . 12 is the IP address of the iSCSI target.

Standard Switch: vSwitch3 Remove, , , Properties, , ,

VMkemel Port Phvskal Adapter.;


Q iSCSIOl
vm~2 : 172.20. 13.52
---~
~

fF -
U vmnic4 1DDD Full Q

• If the ping command fails, ensure that the IP settings are correct.

Log in to the ESXi host and use the ping command to test connectivity between the ESXi host and
the iSCSI target. The p ing command is a symbolic link to the vmkp i n g command.
If the ping command fails, check that you are using the correct iSCSI target IP address. Also check
that the VMkernel interface used for IP storage is correct. In this example, the VMkernel port for IP
storage is vmk2, 172.20.13.52.

Module 5 Troubleshooting Storage 143


Possible Cause: iSCSI HBA Misconfiguration (1)
Slide 5-14

The iSCSI initiator might be configured incorrectly on the ESXi host.


Use vSphere Web Client to check the configured components:
• iSCSI initiator name iSCSI Software Adapter
CiJ vmhba65 iSCSI Online I isn.1998-01 .com.vmware:esxi-a-01-3dbe31a6
iSCSI target address
and port number Adapter Details

Properties Devices Paths Targets Neh¥ork Port Binding Advanced Options

IL Dynamic Discovery J Static Discovery J

~I Remove JI Authenlicali~ I Advanced...


!SCSI servu

172.20., 3.12:3260

Here the iSCSI software adapter has an initiator name of iqn . 1 9 9 8-


0 l. com . vmwa re . esx i -a - 01 : 3db e31 a 8.
This is a static discovery iSCSI target.
The target address here is 172.20.13.12 on port number 3260.

Verify that the iSCSI initiator name and the target address and port number of the iSCSI array are
correct.
Here the iSCSI software adapter has an initiator name of iqn.1998-01.com.vmware.esxi-a-
01 :3dbe3 la8. This is an automatically generated initiator name. You can manually set that name. A
common practice is to edit the name and remove the MAC address string leaving only the node
name. The first part of the name (iqn.1998-01.com.vmware) will be the same for any ESXi host
using the iSCSI software adapter.
iSCSI storage providers can be configured so that only certain specific nodes are allowed to connect.
In these cases you might have to change the automatically assigned iSCSI initiator name to
something the iSCSI storage provider is expecting.
Target addresses can be set up for dynamic or static discovery. This one is static.
The target address here is 172.20.13.12 on port number 3260. That is the IP address and port of the
iSCSI provider. Any typos in the definition of the iSCSI storage provider (IP address or port
number) will cause the connection to fail.

144 VMware vSphere: Troubleshooting Workshop


Possible Cause: iSCSI HBA Misconfiguration (2)
Slide 5- 15

Confirm the Challenge Handshake Authentication Protocol (CHAP)


authentication setting are correct.

The CHAP 172.20.13.12:3260 - Authentlcallon Settings ?

authentication setting
O Inherit settings from parent - vmhba65
can be inherited from
the ESXi host. Authentication Method: ~[u_s_e b_ld_lre_ct_lo_na_I c_HAP
_ _ _ _ _ _ _ _ _ _~I~
· l
Authentication Method 01rtgoing CHAP Credentials (target a1rtl1enticates the initiator)

can be bidirectional, or Name: 1iZJ Use initiator name


one of three kinds of ~ s a-01- dbe31 a6
unidirectional. Secret

Incoming CHAP Credentials (initiator a1rthenticates the target)

Name: 1iZJ Use initiator name


1n ' SJO-a-01- dbe31 a6

Secret

[ OK =i [ Cane~

Verify your Challenge Handshake Authentication Protocol (CHAP) authentication configuration. If


CHAP is configured on the storage array, ensure that the authentication settings for the ESXi hosts
are the same as the settings on the array.

Module 5 Troubleshooting Storage 145


Possible Cause: iSCSI HBA Misconfiguration (3)
Slide 5-16

Verify that the VMkernel port bindings are configured properly.


Adapter Details
Here there are two
VMkernel ports Properties Devices Paths Targets I Network Port Birl(ting ]~Adv
_ an_
ce_
d_OP_
tio_
ns_ _ _ _ _ _ _ _,

(vmk3 and vmk4) +


with each one Port Group VMkeinel Ad... Port Group Polic:y P.1th St.ltus Physio.11 Netwo&kAdapter

connected to a single .!, IPStorage01 (LabV... vmk3 ~ Compliant + Active vmnic6 (1 GbiVs. Full)
port group. .!, IPStorage02 (LabV... !iii vmk4 ~ Compliant ~ Active !ill vmnic7 (1 GbiVs, Full)

Port group policy is


compliant. H 2 ~ems ~ Export · ~ Copy

Path status is active.


The vSphere Web Status I Port Group Switch VMkernel Adapter Physical Adapter
Client also shows
which NIC is being Port group policy: 0 Compliant
used and its Path status: • Active
configuration.

Verify that your VMkernel port bindings are configured correctly. Port binding is used in iSCSI when
multiple VMkernel ports for iSCSI reside on the same broadcast domain to allow multiple paths to an
iSCSI array that broadcasts a single IP address. When using port binding, consider the following facts:
• Array target iSCSI ports must reside on the same broadcast domain as the VMkernel port.
• All VMkernel ports must reside on the same broadcast domain.
If you configure port bindings on multiple VMkernel ports in different broadcast domains and the
target ports also reside in different broadcast domains, you might experience the following issues:
• Rescan times take longer than usual.
• An incorrect number of paths is seen per device.
• You cannot see storage from the storage device.
If you do not configure port bindings on multiple VMkernel ports that reside in the same broadcast
domain, you might experience the following symptoms:
• You cannot see storage presented to the ESXi host.
• Paths to the storage report as Dead.
• Loss of path redundancy messages are logged in vCenter Server.
For more information about considerations for using software iSCSI port bindings in ESXi, see
VMware knowledge base article 2038869 at http://kb.vmware.com/kb/2038869.

146 VMware vSphere: Troubleshooting Workshop


Possible Cause: iSCSI HBA Misconfiguration (4)
Slide 5-17

Verify that the network port group is configured properly.


sa-esxi-01.vclass.local: Details for vmk3
Here is one of the port
Status Port GrouF) Switch VMkernel Adapter Physical Adapter
group configurations
(IPStorage01 ). lstributed 1mrt group: IPS1orage01

fAill_ Properties Policies


Note the following
General
settings on this port Name: IPStorage01
group: Port binding: Static binding
Port allocation: Elastic
Elastic port allocation
Number of ports:
Ports can be blocked Network resource pool: (default)

Advanced
No traffic shaping Configure reset at disconnect Enabled
No VLAN Override 11ort policies
Block ports: Allowed
No teaming Traffic shaping: Disabled
Vendor configuration: Disabled
VLAN: Disabled
Uplink teaming: Disabled
Security policy: Disabled
NetFlow: Disabled
Traffic filtering and marking: Disabled

Security

If you do not have elastic port allocation configured in the port group it is possible to run into a "no
available ports" condition, which can prevent storage from connecting.
Port blocking, traffic shaping, VLAN configuration, and teaming can have either a positive or
negative impact on network connected storage.

Module 5 Troubleshooting Storage 147


Possible Cause: iSCSI HBA Misconfiguration (5)
Slide 5-18

Verify that the virtual switch is configured properly.


sa.esxi-01.vclass.local: Details for vmk3
This is a distributed switch.
Status Port Group ~kernel Adapter Physical A•
Standard switches can also be
istribut ed sw~ch: LabVDS
used but are not recommended.
General
MTU is set to 1500. Name: LabVDS
Manufacturer: VMware, Inc.
- An incorrect MTU can degrade
Version: 6.0.0
storage performance. Number of uplinks:
Network 110 Control: Disabled
Only basic multicast filtering is
Advanced
configured. 1500 Bytes
MTU:

No administrator contact for the Multicast filtering mode: Basic

network is documented. Discovery 11rotocol


Type: Cisco Discovery Protocol
Operation: Listen

Administrator contact
Name:
Other details:

The Switch configuration tab is the first place that the MTU is reported. An incorrect MTU
configuration can have an extremely negative impact on network connected storage.
If the network is being managed by a different administration group, contact information and other
details should be available.

148 VMware vSphere: Troubleshooting Workshop


Possible Cause: iSCSI HBA Misconfiguration (6)
Slide 5- 19

Verify that the VMkernel Adapter is configured properly.


1
sa-esxi-01.vclass.local: Details for vmkJ j
• This VMkernel Adapter is vmk3.
Status Port Group Switch [ VMkernel Adapter Phy:
It is connected to the
IPStorage01 port group. VMkernel network adapter: vmkJ

• This adapter has a static [Afil Properties IP Settings

address. Port properties


Network label IPStorage01
Depending on how the iSCSI is
Enabled services
configured the gateway and
1Pv4 settings
DNS configuration can impact
DHCP Disabled
storage.
1Pv4 address 172.20.13.51 (static)
Best practice is for storage to be Subnet mask 255.255.255.0
on non-routed networks. Default gateway 172.20.10.10

The iSCSI target can be DNS server addresses 172.20.10.10


configured by node name or IP NIC settings
address. MAC address 00:50:56:6e:00:02
MTU 1500

Although storage adapters can be configured as DHCP, it is best practice to use a static address.
If the network configuration (routing, gateway, DNS) is incorrect this can hurt or prevent storage
from working. The iSCSI target can be configured by node name as well as IP address. But if you
have a node name configured you must have a correct DNS configuration.

Module 5 Troubleshooting Storage 149


Possible Cause: iSCSI HBA Misconfiguration (7)
Slide 5-20

Verify that the physical network adapter is configured properly.


~sa-esxi-01.vclass.local: Details for vmk3
I
Slalus Port Group Swilch VMkernel Adapter [ Physical Ada11ter J
Here is the full ~~~~~~~~-

information on the Physical network adapter: vmnic6


type and
configuration of the
M_ Properties CDP LLDP

physical network Adapter Intel Corporation 82545EM Gigabit Ethernet Controller (Copper)
Name vmnic6
adapter.
Location PCI 0000:02:07 .0
Some adapters and Driver e1000
configurations are not Status
supported. Slatus Connected
Configured speed, Duplex Auto negotiate
Actual speed. Duplex 1 000 Mb, Full Duplex
Networks 172.20.111 .10-172.20.111 .10

DirectPath llO
Slatus Not supported
6 The physical NIC does not support DirectPath 110.

Cisco Discovery Protocol


Cisco Discovery Protocol is not available on this physical network adapter

Link Layer Discovery Protoc ol


Link Layer Discovery Protocol is not available on this physical network adapter

The configuration of your network adapter must match the requirements of the physical network it is
connected to. Best practice is to set up an isolated network that is dedicated to storage.

150 VMware vSphere: Troubleshooting Workshop


Possible Cause: Port Unreachable
Slide 5-2 1

Failure could occur because iSCSI TCP port 3260 is unreachable.


• From the ESXi host, use the nc ( n etcat) command to reach port 3260 on
the iSCSI storage array.
- n c -z I Pa ddr 3260
• /Paddr is the IP address of the iSCSI storage array.

Resolve this problem by checking paths between the host and hardware:
• Verify that the iSCSI storage array is configured properly and is active.
• Verify that a firewall is not interfering with iSCSI traffic.

You use the n e t c at (nc) command to verify that you can reach the iSCSI TCP port (default 3260)
on the storage array from the host.
If you receive an error message or receive no response, then verify that the iSCSI storage array is
configured to permit connections on that port.
Also, if a firewall is between your ESXi host and iSCSI storage array, check that the firewall is not
blocking connections to TCP port 3260.

Module 5 Troubleshooting Storage 151


Possible Cause: VMFS Metadata Inconsistency
Slide 5-22

Verify that your VMware vSphere® VMFS datastore metadata is


consistent:
• Use the vSphere On-disk Metadata Analyzer to check VMFS metadata
consistency:
- voma - m vmfs
- d /vmfs/devices/disks/naa . 00000000000000000000000000 : 1
- s /tmp/analysis.txt

A file system's metadata must be checked under the following conditions:


• Disk replacement
• Reports of metadata errors in the vmkernel. l og file
• Inability to access files on the VMFS volume that are not in use by any other
host
If you encounter VMFS inconsistencies, perform these tasks:
1. Recreate the VMFS datastore and restore files from your last backup to the
VMFS datastore.
2. If necessary, complete a support request.

The vSphere On-disk Metadata Analyzer (VOMA) is a utility for performing VMFS file system
metadata checks. This utility scans the VMFS volume metadata and highlights any inconsistencies
to which you might be required to open a support request.
You perform a VOMA check on a VMFS datastore and send the results to a specific log file.
Before running VOMA, you must ensure that the following conditions are true:
• All virtual machines on the affected datastore are powered off or migrated to another datastore.
• The datastore consists of only a single extent and that it has been unmounted on all ESXi hosts.
VOMA must be run against a disk partition and not the disk device. If VOMA is run against a disk
device, it produces an error similar to the following:
Error: Mis sing LVM Magic . Disk doesn't have a val id LVM Device
Error: Failed to Initialize LVM Metadata
When the corruption is irreversible, VMware recommends that you restore the datastore files from a
backup. Or consult with a data recovery organization. VMware does not perform data recovery.
• For more information about using VOMA, see VMware knowledge base article 2036767 at
http ://kb.vmware.com/kb/2036767.

152 VMware vSphere: Troubleshooting Workshop


Use vSphere On-Disk Metadata Analyzer ( 1)
Slide 5-23

To use the vSphere on-disk metadata analyzer (VOMA) utility:


1. Determine the runtime name in vSphere Web Client.
Runtime name might be in a Cx:Ty:Lz format or might be in a different format like an
NAA or EUI ID.
Runtime name might not be the same as Device ID.

!'J Shared r§'


Oetting Started Summaiy Moni~rmlsslons Flies Hosts VMs

Device Backing
General A VMFS Datastore can span mulliple hard disk partitions, or extents to create a sir
Ca11ability sets Select an extent to view its device details.

Oe\llce Backmg
ConnectMty and MuttiJJa1hing

STARWIND ISCSI Disk (eul.dbfdBac3cf56761b) : 1

H
Device Details

Device: STARWIND iSCSI Disk (eui.dbfdBac3cf56761 b)


Capacity: 10.00 OB
Partition Format: GPT

Module 5 Troubleshooting Storage 153


Use vSphere On-Disk Metadata Analyzer (2)
Slide 5-24

To use the VOMA utility:


2. Migrate virtual machines off of shared storage.
3. Unmount datastore from all ESXi hosts.
4. Determine the device name with:
esxcli storage core path list

iqn . 1998-01.com.vmware :esx i-a-0 1-3dbe3 1a6-00023d000004,iqn . 2008-08 . com.starwindsc


UID: iqn . 1998-01 . com . vmware : esx i - a-01-3dbe31a6-00023d000004,iqn . 2008-08.com . st
Runtime Name : vmhba6S : Cl : T4:LO
I Device: eui.dbfd8ac3cf56761b I
Device Display Name : ;:,fARIJ.rnu iSCS I Disk (eui.dbfd8ac3cf56761b)
Adapter : vmhba65
Channel: 1
Target: 4
LUN: 0

154 VMware vSphere: Troubleshooting Workshop


Use vSphere On-Disk Metadata Analyzer (3)
Slide 5-25

Use VOMA utility:


voma -m vmfs -d /vmfs/devices/disks/ devicename

[root@sa-esxi-01: - ) voma -m vrnfs - d /vrnfs / devices/disks/eu i . dbfd8ac3cf56761b


Checki ng it device is actively used by other hosts
Ru nni ng VMF S Checker version 2 . 1 in default mode
Initializing LVM metadata, Basic Checks wi ll be done
Ph ase 1: Checking VMFS header and resource files
Detected VMFS file system ( l abe led:'Sh ared') wi th UUID : S60d0f97-f4de674a-fed0-005056011403, Version 5 : 61
Phase 2: Checki ng VMFS heartbeat region
Phase 3: Ch ecking al l file descr iptors .
Phase 4: Checking pathname and connect i vity.
Phase S: Checking resource reference counts .

Tota l Errors Found : 0

Module 5 Troubleshooting Storage 155


Possible Cause: NFS Misconfiguration
Slide 5-26

If your virtual machines reside on NFS datastores, verify that your NFS
configuration is correct.

NFS server Directory to share


with the ESXi host
NFS Server Name over the network
or IP Address
Mount permission
(Read/Write or
Read-Only) and
AC Ls

ESXi host with NIC


mapped to virtual
switch VMkernel port
~-- configured with
IP address
host

An NFS file system is located on an NFS server. The NFS server contains one or more directories
that are shared with the ESXi host over a TCP/IP network. An ESXi host accesses the NFS server
through a VMkernel port that is defined on a virtual switch.
If the ESXi host is unable to access its NFS datastore, verify that the properties of your NFS
datastore are configured correctly:
• Host name or IP address of the NFS server
• no_root_squash set on the NFS share
• The path to the folder on the NFS server that you want this datastore to correspond to
• Mount permissions (read-only or read/write)
• The name of the datastore

156 VMware vSphere: Troubleshooting Workshop


NFS Version Compatibility with Other vSphere Technologies
Slide 5-27

vSphere Technologies NFS v3 NFS v4.1


vSphere vMotionNMware vSphere® Storage
vMotion®
Yes Yes
vSphere HA Yes Yes
VMware vSphere® Fault Tolerance Yes Yes
vSphere DRSNMware vSphere® Distributed Power
ManagementTM
Yes Yes
Stateless ESXi/Host Profiles Yes Yes
VMware vSphere® Storage DRSTMNMware
Yes No
vSphere® Storage 1/0 Control
VMware Site Recovery Manager™ Yes No
VMware vSphere® Virtual Volumes TM Yes No
Hardware acceleration support, vSphere Storage AP ls
Yes Yes
-Array Integration

Module 5 Troubleshooting Storage 157


NFS Dual Stack Not Supported
Slide 5-28

NFS v3 and v4.1 use different locking semantics:


• NFS v3 uses proprietary client-side cooperative locking.
• NFS v4.1 uses server-side locking.
The best practices are:
• Configure an NFS array to allow only one NFS protocol.
• Use either NFS v3 or NFS v4.1 to mount the same NFS share across
all ESXi hosts in a cluster.

~ New Datastore

v' 1 Type M=SVersion


2 Select NFS version 0 NFS 3
3 Name and configuration NFS 3

4 Readyto complete Q NFS4.1


NFS 4.1

& Use onty one NFS version to access a given datas1ore. Consequences of mounting one or more hosts to the same datastore
using different versions can Include data corruption.

Back Ne><1 Cancel

Data corruption might occur if hosts attempt to access the same NFS
share using different NFS client versions.

158 VMware vSphere: Troubleshooting Workshop


NFS Client Authentication
Slide 5-29

The NFS v3 and v4.1 clients support AUTH_SYS:


• Server trusts UID or GID included in the RPG request.
• Client trusts the response received from the server.
Security exceptions must be configured to enable root access to files:
• Disable root_squash to allow root access to files.
The NFS v4.1 client supports Kerberos 5i authentication:
• Introduced in vSphere 6.5.
• In addition to identity verification, provides data integrity services.
• Supports stronger encryption types:
- AES256-CTS-H MAC-SHA1-96
- AES 128-CTS-H MAC-SHA1-96

Module 5 Troubleshooting Storage 159


Configuring Active Directory and NFS Servers to Use Kerberos
Slide 5-30

If you use NFS 4.1 with Kerberos, you must perform the following tasks
before enabling Kerberos on your ESXi hosts:
• Verify that Active Directory (AD) and NFS servers are configured to use
Kerberos (5 or 5i).
• In AD, enable one of the following encryption modes:
- AES256-CTS-HMAC-SHA1 -96.
- AES 128-CTS-H MAC-SHA1-96.
- In vSphere 6.5, the NFS 4.1 client does not support the DES-CBC-MD5 encryption
mode.
• Verify that the NFS server exports are configured to grant full access to the
Kerberos user.

160 VMware vSphere: Troubleshooting Workshop


Configuring Host Time Synchronization
Slide 5-31

Configure Network Time Protocol settings for each ESXi host:


• Configuring Network Time Protocol settings can be automated with host
profiles.
Summary Monitor Manage Related Objects
'--~~~~~~~~~~~~~~~~~~~~~~~

Settings Networking Storage Alarm Definitions Tags Permissions

Time Configuration Edit...


,.. Virtual Machines
~ Date & Time: 21812015 1:31 PM
VM Startup/Shutdown
NTP Client: Disabled
Agent VM Settings
NTP Service Status: Stopped
Swap file location
NTP Servers:
Default VM Compatibility

,.. System

Licensing
• An incorrect time setting can cause things like Kerberos to
Host Profile stop functioning.
Time Configuration • Best practice is to use NTP whenever possible.
~~~~~~~~~~~-----<

This host is not configured for NTP, which can cause


problems.

Module 5 Troubleshooting Storage 161


Configuring Host Authentication Services
Slide 5-32

Each ESXi host must be added to the AD domain.


• Adding each ESXi host to the AD domain can be automated with host profiles.
Each ESXi host can be configured to use an AD account that was
created for NFS access:
• You can automate this configuration with host profiles.
• The user name and password can also be set using the CLI:
esxcfg- nas -U nfsuser - v 4 .1

Use the same AD user on all ESXi hosts:


• vSphere vMotion and other components might fail if individual hosts use
different AD user accounts.
• Use host profiles to automate and avoid errors.

162 VMware vSphere: Troubleshooting Workshop


Configuring the Datastore to Use Kerberos
Slide 5-33

Enable Kerberos authentication when creating each datastore.


You can use host profiles to automate Kerberos authentication.

Hew Datastore

v' 1 Location CDflfagure Kerheros authentication


The NFS 4.1 client suppo1ts Kerberos autl1enlication of RPC headers. You can enable It here.
v' 2 Type

<t./ J sel ect NFS version ~ Enable Kerberos·based authentlc<itlon


v' 4 Name and configuration & In order to use Kerberos autl1entlcatlon, each host that mounts this datastore has to be a part of an Actt;e Dlrectoiy domain
• and Its NFS authentication credentials need to be set This Is done on the Authentication Services page on each host.
Confiyure Kerheros
5
authent1cat1on

6 Ready to complete

Module 5 Troubleshooting Storage 163


Viewing Session Information
Slide 5-34

You use the esxcli storage nfs41 li st command to view the


volume name, IP address, and other information.

MONtOr ~oe R~oo,.cts


----------------------

·-
S - - -rt
0
· ~1Sltlocal
. . a.vr~
T)'P9
lA..
WS• t
. .~........_.~7-20.e-1 . . 0000 QE llC I l~
!-:o 2'oawo
0CS.Utor•1
0~•1C1)

,..-34,
-
1921690201
••S0.3

ltopc@~S'X~ 0 -9 ! -) ~~~c1i ~
Vo.l.'Wllt!: t: ~ eas~c~>

nta03 1VZ . i6S . 0 .2 01~~''-!Ul . 0 - ~~2 /Hr.3DS


rcooc9~SXi60-e : -1 I

164 VMware vSphere: Troubleshooting Workshop


Review of Learner Objectives
Slide 5-35

You should be able to meet the following objectives:


• Discuss vSphere storage architecture
• Identify possible causes of problems in various types of datastores
• Analyze common storage connectivity and configuration problems and discuss
possible causes
• Solve storage connectivity problems, correct misconfigurations, and restore
LUN visibility

Module 5 Troubleshooting Storage 165


Lesson 2: Multipathing
Slide 5-36

Lesson 2: Multipathing

166 VMware vSphere: Troubleshooting Workshop


Learner Objectives
Slide 5-37

By the end of this lesson, you should be able to meet the following
objectives:
• Review multipathing
• Identify common causes of missing paths, including PDL and APD conditions
• Solve missing path problems between hosts and storage devices

Module 5 Troubleshooting Storage 167


Review of iSCSI Multipathing
Slide 5-38

If your ESXi host has iSCSI multipathing issues, check the multipathing
configuration on the ESXi host and, if necessary, the iSCSI hardware
configuration.

VMkernel
ports

With software iSCSI, you can use multiple NICs that provide failover for iSCSI connections
between your host and iSCSI storage systems.
For this setup, because multipathing plug-ins do not have direct access to physical NICs on your host,
you first connect each physical NIC to a separate VMkernel port. You then use port binding to
associate all VMkernel ports with the iSCSI initiator. As a result, each VMkernel port connected to a
separate NIC becomes a different path that the iSCSI storage stack and its multipathing plug-in can
use.
After iSCSI multipathing is set up, each port on the ESXi host has its own IP address, but they all
share the same iSCSI initiator IQN. Due to the latency that can be incurred, VMware does not
recommend routing iSCSI traffic.
For more about configuring iSCSI multipathing, see vSphere Storage at https://www.vmware.com/
support/pu bs/vsphere-esxi-vcenter-server-6-pubs. html.

168 VMware vSphere: Troubleshooting Workshop


Storage Problem 2
Slide 5-39

Initial checks of LUN paths are performed using the esxcli command:
• Find detailed information regarding multiple paths to the LUNs:
- esxcli s torage c o r e path lis t
• List LUN multipathing information:
- es x c li stor a g e nmp devi ce li st
• Check whether a rescan restores visibility to the LUNs:
- esxcli storag e co r e adapte r rescan - A vmhba##

Loss of connectivity to a specific storage device might be due to one or more paths to a LUN being
lost. If all paths are lost, then any virtual machines using the affected datastore become
unresponsive.
The first check is to get detailed information on available LUNs and paths on your ESXi host. The
esxc l i s t o r a g e cor e p a th li s t command prints a mapping between the HBAs and the
devices for which it provides paths.
You might also list the devices controlled by VMware Native Multipathing (NMP) and show the
Storage Array Type Plug-In (SATP) and Path Selection Plug-In (PSP) information associated with
each device. SATP, also called Storage Array Type Policy, handles path failover for a given storage
array. PSP, also called Path Selection Policy, handles path selection for a given device.
If you determine that certain paths to a LUN are missing, perform a rescan of the troubled adapter to
try to restore LUN visibility.

Module 5 Troubleshooting Storage 169


Identifying Possible Causes
Slide 5-40

If you see errors in / var I log /vmkernel. l o g that refer to a permanent


device loss (POL) or all paths down (APO) condition, then take a bottom-
up approach to troubleshooting.

Possible Causes

For NFS storage, NIC teaming is misconfigured.


ESXi
The path selection policy for a storage device is
Host
misconfigured.

Hardware A POL condition has occurred .


(Storage Network,
An APO condition has occurred.
Storage Array)

For information about troubleshooting lost redundant paths to a storage device, see VMware
knowledge base article 1009554 at http ://kb.vmware.com/kb/ 1009554.
For more information about the permanent data loss (PDL) and all paths down (APD) conditions,
see VMware knowledge base article 2004684 at http://kb.vmware.com/kb/2004684.

170 VMware vSphere: Troubleshooting Workshop


POL Condition
Slide 5-41

A storage device is in a POL state when it becomes permanently


unavailable to the ESXi host.
Possible causes of an unplanned POL:
• The device is unintentionally removed.
• The device's unique ID changes.
• The device experiences an unrecoverable hardware error.
• The device ran out of space, causing it to become inaccessible.
vSphere Web Client displays pertinent information when a device is in a
POL state:
• The operational state of the device changes to Lost Communication.
• All paths appear as dead.
• Datastores on the device are unavailable.

When a PDL occurs, vSphere Web Client displays the following information for the device:
• The operational state of the device changes to Lost Communication.
• All paths appear as Dead.
• Datastores on the device are dimmed.
Check / var I log/vmkernel. log.
When the storage array determines that the device is permanently unavailable, it sends SCSI sense
codes to the ESXi host. The sense codes enable your host to recognize that the device has failed and
register the state of the device as PDL.
The following VMkernel log (/ v ar I l o g / vmkerne l. l o g) example of a SCSI sense code indicates
that the device is in the PDL state:
H:O x O D:Ox2 P:O x O Valid sense data: Ox 5 Ox 2 5 OxO o r Logi c al Unit No t
Suppo rted

In the case of iSCSI arrays with a single LUN per target, PDL is detected through iSCSI login
failure . An iSCSI storage array rejects your host's attempts to start an iSCSI session with the reason

Module 5 Troubleshooting Storage 171


Target Unavailable. As with the sense codes, this response must be received on all paths for the
device to be considered permanently lost.
After registering the PDL state of the device, the host stops attempts to reestablish connectivity or to
issue commands to the device to avoid becoming blocked or unresponsive. The 1/0 from virtual
machines is terminated.
vSphere HA can detect PDL and restart failed virtual machines. vSphere HA has the ability to
restart virtual machines that were terminated during a PDL condition. The objective is to start the
affected virtual machines on another host that might not be in a PDL state for the shared storage
device. vSphere HA does not migrate a virtual machine's disks. In the event of a failure, vSphere
HA attempts to start a virtual machine on another host but on the same storage.
The das .maskCleanShutd ownEnabled option enables vSphere HA to differentiate between
virtual machines that were shut down gracefully during a PDL, and thus should not be restarted, and
virtual machines that were terminated during a PDL, and thus should be restarted.

172 VMware vSphere: Troubleshooting Workshop


Recovering from an Unplanned POL
Slide 5-42

If the LUN was not in use when the POL condition occurred, the LUN is
removed automatically after the POL condition clears.
If the LUN was in use, manually detach the device and remove the LUN
from the ESXi host.
When storage reconfiguration is complete, perform these steps:
1. Reattach the storage device.
2. Mount the datastore.
3. Restore from backups if necessary.
4. Restart the virtual machines.

For detailed information about detaching devices and removing a LUN, see vSphere Command-Line
Interface Concepts and Examples at http://www.vmware.com/support/developer/vcli.

Module 5 Troubleshooting Storage 173


APO Condition
Slide 5-43

An APO condition occurs when a storage device becomes unavailable to


your ESXi host for an unspecified amount of time:
• This condition is transient. The device is expected to be available again.
An APO condition might be caused by several causes:
• The storage device is removed in an uncontrolled manner from the host.
• The storage device fails:
- The VMkernel cannot detect how long the loss of device access will last.
• Network connectivity fails, which brings down all paths to iSCSI storage.
vSphere Web Client displays pertinent information when an APO
condition occurs:
• The operational state of the device changes to Dead or Error.
• All paths appear as dead.
• Datastores on the device are unavailable.

In contrast with the PDL state, the host treats an APD condition as transient and expects the device
to be available again.
An APD condition can occur on an ESXi host when a storage device is removed in an uncontrolled
manner from the host. An APD can also occur if the device fails and the VMkernel core storage
stack cannot detect how long the loss of device access will last. One possible scenario for an APD
condition is a Fibre Channel switch failure that brings down all the storage paths. Or, in the case of
an iSCSI array, an APD condition can occur if a network connectivity issue similarly brings down
all the storage paths.
The host indefinitely continues to retry issued commands in an attempt to reestablish connectivity
with the device. If the host's commands fail the retries for a prolonged period of time, the host and
its virtual machines might be at risk of having performance problems and becoming unresponsive.
When an APD occurs, vSphere Web Client displays the following information for the device :
• The operational state of the device changes to Dead or Error.
• All paths are shown as Dead.
• Datastores on the device are dimmed.
• Host might appear as disconnected in vCenter Server.
Check /va r I l o g /vmke rne l. log.

174 VMware vSphere: Troubleshooting Workshop


Although the device and datastores are unavailable, virtual machines might remain responsive. You
can power off the virtual machines or migrate them to a different host. If one or more device paths
later become operational, subsequent I/Oto the device is issued normally and all special APD
treatment ends.
An APD condition has an effect on the management agents, such as the ESXi host's h o st d process.
The commands from these management agents are not responded to until the device is accessible
again. As a result, an ESXi host becomes inaccessible in vCenter Server because the ho std process
has hung.
h os t d is responsible for managing many ESXi host operations, such as virtual machine creation,
virtual machine power state changes, vSphere vMotion migrations, and LUN and VMFS volume
discovery. If an administrator issues a rescan of the SAN, ho std worker threads wait indefinitely
for I/Oto return from the device in APD. However, ho std has a finite number of worker threads.
So, if all these threads are busy waiting for disk I/O, other h os t d tasks will be affected.

Module 5 Troubleshooting Storage 175


Recovering from an APO Condition
Slide 5-44

The APO condition must be resolved at the storage array or fabric layer
to restore connectivity to the host:
• All affected ESXi hosts might require a reboot.
vSphere vMotion migration of unaffected virtual machines cannot be
attempted:
• Management agents might be affected by the APO condition.
To avoid APO problems, the ESXi host has a default APO handling
feature:
• Global setting: Misc. APDHandlingEnable
- By default, set to 1, which enables storage APD handling
• Timeout setting: Misc . APDTimeout
- By default, set to 140, the number of seconds that a device can be in APD before
failing

No clean way exists to recover from an APD condition. All affected ESXi hosts might require a
reboot to remove any residual references to the affected devices that are in an APD state.
Management agents might be affected by the APD condition, and the ESXi host might become
unmanaged. As a result, a reboot of an affected ESXi host forces an outage to all unaffected virtual
machines on that host.
When a device enters the APD state, the system immediately turns on a timer and allows the ESXi
host to continue retrying non-virtual machine commands for a limited period of time.
By default, the APD timeout is set to 140 seconds, which is typically longer than most devices need
to recover from a connection loss. If the device becomes available within this time, the host and its
virtual machine continue to run without experiencing any problems.
If the device does not recover and the timeout ends, the host stops its attempts at retries and
terminates any non-virtual machine I/O. The device is marked as APD Timeout. Any further I/Os
are fast-failed with a status of No Connect, preventing hos t d and others from getting hung. Virtual
machine I/O will continue retrying.
If a path to the device recovers, subsequent I/Os to the device are issued normally and special APD
treatment concludes.
A global setting exists called Mis c . APDHandlingEnable. If the value is set to 0, then the ESXi
host permanently retries failing I/Os. If Misc . APDHandlingEnable is set to 1, APD handling uses
the timeout setting called Misc. APDT imeout. This setting has a default value of 140 seconds.

176 VMware vSphere: Troubleshooting Workshop


Possible Cause: N IC Teaming Misconfiguration
Slide 5-45

Verify that N IC teaming is configured properly.

~ Production-A - Edit Settings

General Load balancing:


Advanced
Network failure detection:
Security Route based on source MAC hash
Notify switches:
Traffic sha1iing
Route based on originating virtual port
Failback: Use explicit failover order
VLAN
Route based on physical NIC load
Failover order
eaming and failover

Monitoring

Traffic filtering and m arking


••
Active uplinks

Miscellaneous Uplink3
· Uplink4
standby u1>links

If you have not configured port binding and you are using NIC teaming on the virtual switch, verify
that the physical switch ports are configured consistently for each teamed network adapter. Also
verify that the proper load balancing policy is configured on the virtual switch. VMware
recommends that you to use the default load-balancing policy, Route based on originating virtual
port. If link aggregation on the physical switch is configured, use the Route based on IP hash load-
balancing policy.
To use some adapters but reserve others for emergencies, you can use the failover order conditions
to specify how to distribute the work load for the network adapters:
• Active adapters: Continue to use the adapter when the network adapter connectivity is available
and active.
• Standby adapters: Use this adapter if one of the active adapters' connectivity is unavailable.
• Unused adapters: Do not use this adapter.

Module 5 Troubleshooting Storage 177


Possible Cause: Path Selection Policy Misconfiguration
Slide 5-46

Verify that the path selection policy for a storage device is configured
properly.
Ii] esxl-a-01.vciass.local - Edit Multlpathlng Policies for eul.dbfd8ac3cf5676 1b ?

Pa1h selection policy:


Ftxe d (VMware)
Most Recently Used (VMware)
Round Robin (VMware)
foed (VMware\

Runtime Name Status Ta1g et LUN Preferred

vmhba33:CO:T4:LO • Active (1/0) iqn.20 08-08.com.starwindsoltware:i... 0 •


vmhba33:C1 :T4:LO • Active iqn.2008-08.com. s1arwindsoltware:i...

OK ) l Cancel !,,

You usually do not have to change the default multipathing settings that your host uses for a specific
storage device. However, if you want to make changes, you can modify a path selection policy and
specify the preferred path for the Fixed policy.
By default, VMware supports the following path selection policies. If you have a third-party PSP
installed on your host, its policy also appears on the list:
• Fixed (VMware): The host uses the designated preferred path, if it has been configured.
Otherwise, it selects the first working path discovered at system boot time. If you want the host
to use a particular preferred path, specify it manually. Fixed is the default policy for most
active-active storage devices.
• Most Recently Used (VMware): The host selects the path that it used most recently. When the
path becomes unavailable, the host selects an alternative path. The host does not revert to the
original path when that path becomes available again. No preferred path setting is associated
with the MRU policy. MRU is the default policy for most active-passive storage devices.
• Round Robin (VMware): The host uses an automatic path-selection algorithm rotating through
all active paths when connecting to active-passive arrays, or through all available paths when
connecting to active-active arrays. Round Robin is the default for a number of arrays and can be
used with both active-active and active-passive arrays to implement load balancing across paths
for different LUNs. Round Robin is also the path selection policy recommended by VMware.

178 VMware vSphere: Troubleshooting Workshop


You can use the e s xc li command to make Round Robin the default policy for new LUNs:
es x cli sto rage nmp s atp s et - s SATP_NAME -P VMW_PSP_RR

SATP_NAME is the name of a certain SATP, such as VMW_ SATP_LOCAL.


You can use the following code to change all live devices to use Round Robin without changing
them one by one:
f o r i in 'e s xc li s to r age nmp dev i ce list I g r ep n aa I g r ep - v Devi ce' ;

do
esxc li stor age nmp devi ce set - devi ce $ i -psp VMW PSP RR;
d o ne

Module 5 Troubleshooting Storage 179


Possible Cause: NFSv3 and v4.1 Misconfiguration
Slide 5-47

Virtual machines on an NFS 4.1 datastore fail after the NFS 4.1 share
recovers from an APO state.
The loc k protecting VM . vmdk has been lost error message is
displayed.
This issue occurs because NFSv3 and v4.1 are two different protocols
with different behaviors. After the grace period (array vendor-specific),
the NFS server flushes the client state.
This behavior is expected in NFSv4 servers.

When the NFS 4.1 storage enters an APD state and then exits it after a grace period, you experience
these symptoms:
• Powered-on virtual machines that run on the NFS 4.1 datastore fail.
• After the NFS 4.1 share recovers from the APD condition, you see this message on the virtual
machine's Summary page in vSphere Web Client: The lock protecting VM.vmdk has been lost,
possibly due to underlying storage issues. If this virtual machine is configured to be highly
available, ensure that the virtual machine is running on some other host before clicking OK.
• After you click OK, crash files are generated and the virtual machine powers off
This problem occurs because NFSv3 and v4. l are different protocols. These protocols behave
differently. After the grace period (array vendor-specific), the NFS server flushes the client state.
This behavior is expected in NFSv4 servers. Currently, no resolution or work around is available.
For more information about this problem, see VMware knowledge base article 2089321 at http://
kb.vmware.com/kb/2089321 .

180 VMware vSphere: Troubleshooting Workshop


Lab 6: Troubleshooting Storage Problems
Slide 5-48

Identify, diagnose, and resolve virtual storage problems


1. Run a Break Script
2. Verify That the System Is Not Functioning Properly
3. Troubleshoot and Repair the Problem
4. Verify That the Problem Is Repaired

Module 5 Troubleshooting Storage 181


Review of Learner Objectives
Slide 5-49

You should be able to meet the following objectives:


• Review multipathing
• Identify common causes of missing paths, including POL and APO conditions
• Solve missing path problems between hosts and storage devices

182 VMware vSphere: Troubleshooting Workshop


Lesson 3: vSAN and Virtual Volumes
Slide 5-50

Lesson 3: vSAN and Virtual


Volumes

Module 5 Troubleshooting Storage 183


Learner Objectives
Slide 5-5 1

By the end of this lesson, you should be able to meet the following
objectives:
• Become familiar with the various types of tools that are available to
troubleshoot VMware vSAN™
• Use the appropriate tools to identify, analyze, and quickly resolve common
configuration problems related to vSAN
• Describe vSphere Virtual Volumes
• Use vSphere Virtual Volumes Diagnostic Commands

184 VMware vSphere: Troubleshooting Workshop


Review of vSAN
Slide 5-52
vSAN is a hyper-converged, software-based storage solution.
vSAN aggregates locally attached disks of multiple hosts that are
members of a vSphere cluster to provide distributed shared storage to
virtual machines.
VM
~~~~~~~~~~~~~~~~~~~~
VM VM VM VM I VM

vSAN

3-64
,--------------------------------------,
: ~~ ~y ·~~ :
1
SSD HD/SSD 550 HD/SSD SSD HD/SSD 1
·-------------------------- - -----------~

vSAN is entirely dependent on the proper functioning of its underlying


hardware components, such as network and storage devices.
The vSAN network must be configured correctly. Only one vSAN
VMkernel port can exist per subnet.
Because vSAN™ is a software-based storage product, it is entirely dependent on the proper
functioning of its underlying hardware components such as the network, the storage 1/0 controller,
and the storage devices. Because vSAN is an enterprise storage product, it can put an unusually
demanding load on supporting components and subsystems, exposing flaws and gaps that might not
be seen with simplistic testing or other, less-demanding use cases.
Most vSAN troubleshooting exercises involve determining whether or not the network is functioning
properly, or whether the vSAN VMware Compatibility Guide (VCG) has been rigorously followed.
Because vSAN uses the network to communicate between nodes, a properly configured and fully
functioning network is essential to operations. Many vSAN errors can be traced back to things like
improperly configured multicast, mismatched MTU sizes, and so on. vSAN requires more than
simple TCP/IP connectivity.
vSAN uses server-based storage components to recreate the functions normally found in enterprise
storage arrays. This architectural approach demands a rigorous discipline in sourcing and
maintaining the correct storage 1/0 controllers, disks, flash devices, device drivers and firmware, as
documented in the vSAN VCG. Failure to adhere to these guidelines often results in erratic
performance, excessive error messages, or both.
For more information about common vSAN problems and solutions, see VMware Virtual SAN
Diagnostics and Troubleshooting Reference Manual at http://www.vmware.com/files/pdf/products/
vsan/VSAN-Troubleshooting-Reference-Manual.pdf.

Module 5 Troubleshooting Storage 185


vSAN Troubleshooting Tools
Slide 5-53

A set of monitoring and troubleshooting tools are available:


/localhosWSAN Datacenter> mark cluster -/computersrVSAN Cluste('
vSphere Web Client:
- Primary tool to configure and /localhosWSAN Datacenter> vsan.whatif_host_failures -cluster
Simulating 1 host failures·
manage vSAN
+---------------+----------------------+-----------------------------+
• ESXCLI : I Resource I Usage nght now I Usage after failure/re-protection I
- For diagnostics and troubleshooting +------------+----------------+----------------------+
I HOD capacity I 7% used (27.84 GB free) I 11% used (17.86 GB free)
• vSAN health check plug-in: I Components I 0% used (1874 available) I 0% used (1312 available)
I RC reservations I 0% used (20 98 GB free) I 0% used ( 13 99 GB free)
- For configuration, operation, +--------------+--------------------+--------------------------+
and device health check
/localhosWSAN Datacenter> vsanresync_dashboard
• Ruby vSphere Console (RVC): missing argument 'cluster_or_host'
/localhosWSAN Datacenter> vsan resync_dashboard -cluster
Ruby-based expandable 2015-07-08 18: 13:35 +oooo· Querying all VMs on VSAN
2015-07-08 18 13 35 +0000: Querying all ob1ects 1n the system from 10.1.1.82
management platform
2015-07-08 18: 13:36 +0000: Got all the info, computing table ...
- Command-line console UI +-------+-------------+-------+
I VM/Ob1ect I Syncing objects I Bytes to sync I
• vSAN Observer: +---------+---------------+-----------+
+---------+----------------+----------+
- Built on RVC, for performance I Total Io I 000 GB
troubleshooting +---------+----------------+-------------+

<RVC sample output>

A set of tools is available, which enables you to monitor, diagnose, and troubleshoot vSAN:
• vSphere Web Client:
• vSphere Web Client is used to configure storage policies and monitor their compliance.
vSphere Web Client can also be used to inspect underlying disk devices and how these
devices are being used by vSAN.
• As a troubleshooting tool, vSphere Web Client can be configured to present specific alarms
and warnings associated with vSAN. vSphere Web Client also highlights certain network
misconfiguration issues and whether hardware components are functioning properly.
• Additionally, vSphere Web Client can provide at-a-glance overviews of individual virtual
machine performance and indicate whether vSAN is recovering from a failed hardware
component.
• vSphere Web Client is the logical place to start when diagnosing or troubleshooting a
suspected problem.
• Although vSphere Web Client does not include many of the low-level vSAN metrics and
counters, it has a pretty comprehensive set of virtual machine metrics. You can use the
performance charts in vSphere Web Client to examine the performance of individual virtual
machines running on vSAN.

186 VMware vSphere: Troubleshooting Workshop


• ESXCLI:
• Every ESXi host supports a direct console that can be used for limited administrative
purposes: starting and stopping the system, setting parameters and observing state. As such,
ESXCLI is an important tool for vSAN diagnostics and troubleshooting. ESXCLI works
through the use of individual namespaces that refer to different aspects of ESXi, including
vSAN. To see the options available to ESXCLI for monitoring and troubleshooting vSAN,
enter the esxcli vsan command at vSphere ESXi Shell. A list of commands appears.
• ESXCLI can only communicate with one host or ESXi instance at a time. To look at
cluster-wide information, the Ruby vSphere Console (RVC) should be used.
• You can display similar information by using the ESXCLI in many ways. In this course, the
focus is on the few best ways to get the information.
• vSAN health check plug-in:
• The vSAN health check plug-in examines hardware compatibility, networking
configuration and operations, advanced vSAN configuration options, storage device health,
and virtual machine object health.
• For more information about vSAN health check, see VMware Virtual SAN Health Check
Guide at https://www.vmware.com/files/pdf/products/vsan/VMware-Virtual-SAN-Health-
Check-Guide-6.1.pdf.
• Ruby vSphere Console (RVC):
• RVC is a Ruby-based expandable management platform from which you can use any API
that vCenter Server exposes.
• RVC can be described as a command-line console UI for ESXi hosts and vCenter Server.
• The vSphere inventory is presented as a virtual file system, allowing you to navigate and
run commands against managed entities using familiar shell syntax, such as cd to change
directory and ls to list directory (inventory) contents. RVC has been extended to provide a
wealth of useful information about vSAN health and status.
• RVC is included in both the Windows and Linux versions of vCenter Server.
• vSAN Observer:
• vSAN Observer is a monitoring and troubleshooting tool that is available with RVC, and
enables analysis of a vSAN cluster. vSAN Observer is launched from RVC. vSAN
Observer captures low-level metrics from vSAN and presents these metrics in a format that
is easily consumable through a Web browser. vSAN Observer is a tool that is typically
deployed to assist with monitoring and troubleshooting vSAN issues. vSAN Observer
monitors vSAN in real time.

Module 5 Troubleshooting Storage 187


• vSAN can be examined either from the perspective of physical resources (CPU, memory,
disks) with a wealth of different metrics. vSAN can also be monitored from a virtual
machine perspective, allowing resources consumed by the virtual machine to be examined,
and whether a virtual machine contends with other virtual machines for resources, and so on.
• Third-party tools:
• As vSAN uses third-party storage 1/0 controllers and flash devices, you might need to use
specific tools provided by these vendors to configure and check status.

188 VMware vSphere: Troubleshooting Workshop


vSAN Disk Query
Slide 5-54

[root@esxi01 :-] vdq -q


To query the disks on your ESXi host,
[
you can run the vdq - q command. {
"Name" : "eui.68162383505096af',
The output displays useful information, 'VSANUUID" : "",
such as: "State" : "Ineligible for use by VSAN",
Disk device name "ChecksumSupport": "O".
"Reason" : "Has partitions",
• vSAN node UUID u,

• The state of the disk (whether or not it "lsCapacityFlash": "O",


can be used by vSAN or if it is already in "lsPDL" : "O",
},
use)
Reason
Disk type: SSD or HOD "Name" : "eui.b0465b2d00000000",
'VSANUUID" : "",
Device state: POL or not "State" : "Ineligible for use by VSAN",
"ChecksumSupport" "O",
The vdq command is useful to see "Reason" : "Non-local disk",
whether the SSD is being used for "lsSSD" : "O",
cache or for capacity. "lsCapacityFlash": "O",
"lsPDL" : "O",
},

Module 5 Troubleshooting Storage 189


vSAN Problem 1
Slide 5-55

A vSAN cluster fails to form correctly. Members of the cluster cannot


communicate with each other.
Help

One or more hosts cannot oommunlcate


• Virtual SAN with the VSAfl datastore.
. Each host requires at least one vmkemel
Add disks IO storage Automatl adapter with VSAfl service enabled. All
Hosts 3 hosts those adapters need to be oonnected to !he
same physical network to ensure oorrect
SSD disks in use 3 of 3 ell! oommunlcatlon with the VSAfl datastore.
Data disks in use 18 of 18 To view the network partition groups, check
the respective column in the Disk
Total capacity of VSAfl datastore 2.73 TB Management grid.
Free capacity of VSAfl datastore 2.72 TB ~---~,

Network status I ~ Misoonfiguratlon detected e I


Disk management
Solution:
-
• Ensure that the physical switch and the ports used for vSAN are active and
have multicast enabled.
• Validate the virtual switch configuration for correct uplink, VLAN, NIC team
failover policy and the vSAN traffic service is enabled on the VMkernel
interfaces.

In this scenario, one or more members of the vSAN cluster are in different network groups. vSAN
fails to form correctly because the members cannot communicate with each other.
To solve the problem, ensure that the physical switch and the ports used for vSAN are active and
have multicast enabled. Enabling multicast can be done in one of two ways on your physical
switches:
• Disable IGMP snooping
• Configure IGMP snooping for selective traffic
You must also validate the virtual switch configuration for correct uplink, VLAN, NIC team failover
policy, and that the vSAN traffic service is enabled on the VMkernel interfaces. vSAN requires a
VMkernel network interface with the vSAN traffic enabled. All members of the clusters must
communicate on the same Layer 2 network segment with multicast enabled, and all members of the
cluster should be able to ping each other. Failing to meet this requirement prevents vSAN from
being successfully configured, because hosts are prevented from communicating.
You can use vSphere Web Client to verify the vSAN configuration. You can use the vmkpi ng
command to validate the vSAN network accessibility. You can also use the esxcl i vsan
netwo r k namespace to examine and modify the vSAN network configurations.

190 VMware vSphere: Troubleshooting Workshop


vSAN Problem 2
Slide 5-56

Automatic disk claiming operation fails to claim disks.


Analysis:
• For disks to be claimed automatically, they must be flagged as local by ESXi.
• Many SAS controllers allow disks to be shared. If ESXi determines that the
disks are shared, it does not report them as local.
• Some disks that are reported as shared are not actually shared. You must
mark such disks as local. This applies to both HOD and SSD.
Solution:
• Manually create the disk groups and ensure that the disks are flagged as local.

In this scenario, you can use the esx cli storage core namespace before or after enabling vSAN
to examine whether the disks are fl agged as local, by checking the rs Loca l attri bute. Marking a
disk as local can also be done from vSphere Web Client using the disk management dashboard. For
more information about the configuration steps to mark a storage device as local, see vSphere
Storage Guide at https://www.vmware.com/support/pubs/vsphere-esxi-vcenter-server-6-pubs.html .
With RVC, you can use v s an . disks_ info to gather detailed disks capabilities and characteristics,
such as size, disk type, manufacturer, model, and identify if the disks are flagged as local or non-local.

Module 5 Troubleshooting Storage 191


vSAN Problem 3
Slide 5-57

After manually selecting a desired group of HOD and an SSD for the
creation of a disk group, the operation is successfully completed but disk
groups are not created.
[t_vsAN-West Actions •

r Summary ] Monitor Manage Related Objects

7 ... '; ;
- VSAN-Wost CPU FREE: 124.17 GHz
Total Processors: 48 I
USED: mUHz. CAPACITY: t24.46 GHz

t r Total vMotloo Mlgratlooa: O


ME MORY FRE E: 372.68 GB

USED: 11.22 CB CAPACITY: 383.9 GB

I STORAGE FREE: 08.11 GB


I
USED: 1.64 GB CAPACITY: IMl.75 GB

I
VSAN datastore vsanDatastore In duster VSAN-West In datacenter SDDC-West does not have capacity J
I· vSphere DRS ol r• Vlrtual SAN ol
Causes and solutions:
• If vSAN is not licensed correctly, you can assign a vSAN license to the cluster
using vSphere Web Client.
• If the vSphere Web Client refresh time is timed out, you can log out of the
system and log back in.

In this scenario, vSphere Web Client indicates that the vSAN datastore in the cluster does not have
capacity.
The following issues can cause this behavior:
• vSAN might not be licensed correctly. The vSAN feature is not automatically added to the
cluster when a vSAN license is added to the vCenter Server license catalog. You can assign a
vSAN license to the cluster enabled with vSAN by using vSphere Web Client.
• The vSphere Web Client refresh timer timed out. Depending on the number of disk groups and
the number of disks, the completion of the operation can take some time. You can log out and
log back in to vSphere Web Client.

192 VMware vSphere: Troubleshooting Workshop


vSAN Problem 4
Slide 5-58

You cannot delete disk groups from the vSphere Web Client user
interface.
Disk Groups

l Q. Filter
O.ska In ... 1 " State Status Networlc Part.I.on Croup

T Id pnnh-a06-h38G-01 .pml.local 7 of 7 Connected Healthy Group 1


Disk group (0200000000500... 7 Healthy
T Id pnnh-a06-h38G-02.pml.local 7 of 7 Connected Healthy Group 1
Disk group (0200000000500... 7 Healthy
T Id pnnh-a06-h38G-03.pml.local 7 of 7 Connected Healthy Group 1
Disk group (0200000000500... 7 Healthy

Cause and solution:


• You cannot delete disk groups as a result of vSAN disk claiming operation
being set to automatic.
• Using vSphere Web Client, modify the disk claiming operation by changing it to
manual.

Module 5 Troubleshooting Storage 193


vSAN Problem 5
Slide 5-59

Too much multicast traffic for multiple vSAN clusters causes slowness.
Cause:
• If multiple vSAN clusters exist on the same Layer 2 network, each host
receives all multicast messages.
Solution:
• To reduce the amount of multicast traffic for each vSAN cluster, change the
multicast address for each vSAN cluster.
• Changing the multicast address on an active vSAN cluster can lead to network
partitioning until all of the ESXi hosts in the cluster are on the same multicast
network. VMware recommends that you schedule downtime before making this
change.

For information about how to change the multicast address on an ESXi host configured for vSAN,
see VMware knowledge base article 2075451 at http://kb.vmware.com/kb/207545 l .

194 VMware vSphere: Troubleshooting Workshop


Review of vSphere Virtual Volumes
Slide 5-60

Here is a brief review of


vSphere Virtual Volumes:

---
vSphere 6 x
• No LUNs or NFS shares.
• Set up a single 1/0 access
called a protocol endpoint to set
up a data path from virtual
machines to virtual volumes
(VMDKs).
• Set up a logical entity called a
storage container to group
virtual volumes (VMDKs) for STORAGE STORAGE STORAGE

...
CONTAINER CONTAINER CONTAINER

easy management

. ~- : :_ - :_ :

vSphere Virtual Volumes simplifies storage management in the following ways:


• Administrators do not need to configure LUNs or NFS shares.
• An administrator must create a single 1/0 access point called a protocol endpoint (PE):
• Virtual volumes (VMDKs) are bound and unbound to the PE by vSphere.
• An administrator defines storage capacities with capabilities, called a storage container (SC).
Storage containers can be of any size, are dynamic, and can span an entire array.

Module 5 Troubleshooting Storage 195


vSphere Virtual Volume Object Types
Slide 5-6 1

Five different types of vSphere Virtual Volume object types map to a


different and specific virtual machine file:
• Config-WOL
- Virtual machine Home
- Configuration files
- Logs
• Data
- Equivalent to a VMDK
• Memory
- Snapshots
• SWAP
- Virtual machine memory swap
• Other
- vSphere solution specific object
Multiple vSphere Virtual Volume objects together represent a virtual machine
object.

196 VMware vSphere: Troubleshooting Workshop


About Protocol Endpoints
Slide 5-62

A protocol endpoint has the following features:


• Set up by the storage administrator. VM VM VM VM VM
• Part of the physical storage hardware:

---
vSphere 6.x
- Treated like a proxy
• Supports typical SCSI and NFS commands.
• Virtual volumes (VMDKs) are bound and
unbound to a protocol endpoint:
- ESXi or vCenter Server initiates the bind or
irtual
unbind operation.
V~e
• Existing multipathing policies and NFS STORAGE STORAGE

-
topology requirements can be applied. CONTAINER CONTAINER

c : - :_: .. :_: .,
'- : '- : ., '- : .,

A PE is created by the storage administrator to define a single 1/0 access point. A PE is treated like
a LUN and handles industry-standard protocols, such as ISCSI. A PE creates a single configuration
regardless of the protocol used.
Virtual volumes are bound to a PE by using the bind/unbind commands that are initiated by ESXi
hosts and vCenter Server instances. PEs should be configured in a high availability environment so
that a single point of failure does not exist. Each PE configuration rests on the array and is
considered to be part of the physical storage fabric.

Module 5 Troubleshooting Storage 197


About Storage Containers
Slide 5-63

A storage container has the following features:


• Configured by the storage administrator. VM VM VM VM VM
• Logical grouping of virtual volumes (VMDKs).

---
vSphere 6.x

• Storage container capacity limited only


by hardware capacity.
• Must set up at least one storage container
per storage system.
• Can have multiple storage containers per
array.
• Assign capabilities to a storage container.

'- : - l_ : •

An SC is created by the storage administrator to consolidate the management of multiple virtual


volumes (VMDKs). At lease one SC must be set up per storage system with many SCs allowed per
array. In the traditional datastore approach, one-to-one correlation existed between the datastore and
a LUN. The LUN had a fixed limit, whereas an SC does not. An SC can equal the overall capacity
of the array and can be dynamically adjusted.
An SC represents the following:
• An SC is the logical representation of the underlying hardware. An array can be represented by
a single SC, or might be represented by multiple SCs that equal the size of the array.
• An SC is a logical grouping of virtual volumes (VMDKs) that allows the storage administrator
to focus on higher level management versus per-virtual machine management:
• Replication, cloning, backup, and other data services are handled by the array at the SC
level
Each storage container is configured with a set of capabilities that are matched to the policy
requirements that are associated with provisioned virtual machines. SC capabilities allow the
definition of storage tiers that can be finely tuned to application requirements and customer needs.

198 VMware vSphere: Troubleshooting Workshop


Bidirectional Discovery Process
Slide 5-64

Protocol endpoint and storage container objects are discovered by ESXi


hosts and vCenter Server.

Protocol Endpoint Storage Container

Storage administrator sets up a


Storage administrator sets up a
storage container of defined
protocol endpoint.
capacity and capability.

ESXi discovers the protocol VASA provider discovers the


endpoint during a scan. storage container and reports to
vCenter Server.

vSphere API for Storage


Awareness is used to bind virtual Any new virtual volumes are
volumes to the protocol endpoint. created in a storage container.

The PE and SC columns in the illustration show how PE and SC objects are discovered by ESXi
hosts and vCenter Server.

Module 5 Troubleshooting Storage 199


Troubleshooting vSphere Virtual Volumes
Slide 5-65

vSphere Virtual Volumes and esxcli Commands.

Namespace Command Option Description

esxcli storage core device list Identify protocol endpoints. The output
entry Is VVOL PE: true indicates that the
storage device is a protocol endpoint.
esxcli storage vvol daemon unbindall Unbind all virtual volumes from all VASA
providers known to the ESXi host.

esxcli storage vvol protocol list List all protocol endpoints that your host
endpoint can access.

esxcli storage vvol list List all available storage containers.


storagecontainer
abandonedvvol scan Scan the specified storage container for
abandoned virtual volumes.

esxcli storage vvol vasacontext get Show the VASA context (VC UUID)
associated with the host.

esxcli storage vvol vasaprovider list List all storage (VASA) providers
associated with the host.

200 VMware vSphere: Troubleshooting Workshop


vSphere Virtual Volumes Problem 1
Slide 5-66

Virtual Datastore is Inaccessible.


Cause:
• This problem can occur when you fail to configure protocol endpoints for the
SCSI-based storage container that is mapped to the virtual datastore.
• SCSI protocol endpoints need to be configured so that an ESXi host can detect
them.
Solution :
• Before creating virtual datastores for SCSI-based containers, make sure to
configure protocol endpoints on the storage side.

Module 5 Troubleshooting Storage 201


vSphere Virtual Volumes Problem 2
Slide 5-67

Failures occur when migrating virtual machines to vSphere Virtual


Volumes datastores.
Cause:
• The configuration virtual volume, or config-VVol, contains various VM-related
files. On traditional nonvirtual datastores, these files are stored in the VM home
directory. Similar to the VM home directory, the config-VVol typically includes
the VM configuration file, virtual disk and snapshot descriptor files, log files,
lock files , and so on.
• On virtual datastores, all other large-sized files, such as virtual disks, memory
snapshots, swap, and digest, are stored as separate virtual volumes.
Solution :
• Before migrating a virtual machine from a traditional datastore to a virtual
datastore remove excess content from the virtual machine home directory to
stay under the 4-GB limit.

202 VMware vSphere: Troubleshooting Workshop


vSphere Virtual Volumes Problem 3
Slide 5-68

Failures occur when deploying virtual machine OVFs to vSphere Virtual


Volumes datastores.
Cause:
• Config-VVols are created as 4-GB virtual volumes. Generic content of the
config-VVol usually consumes only a fraction of this 4-GB allocation, so config-
VVols are typically thin-provisioned to conserve backing space. Any additional
large files, such as ISO disk images, DVD images, and image files, might
cause the config-Wol to exceed its 4-GB limit. If such files are included in an
OVF template, deployment of the VM OVF to vSphere Virtual Volumes storage
fails.
Solution :
• You cannot deploy an OVF template that contains excess files directory to a
virtual datastore. First deploy the VM to a nonvirtual datastore. Remove any
excess content from the VM home directory, and migrate the resulting VM to
vSphere Virtual Volumes storage.

Module 5 Troubleshooting Storage 203


vSphere Virtual Volumes Problem 4
Slide 5-69

Migrating virtual machines with memory snapshots can fail.


Cause:
• Virtual machines must be hardware version 11 or higher to migrate to a
vSphere Virtual Volume if memory snapshots are present.
• Virtual machines of hardware version 11 or later use separate files to store
their memory snapshots.
• Virtual machines on vSphere Virtual Volumes storage should have memory
snapshots created as separate virtual volumes instead of being stored as part
of a .vmsn file in the VM home directory.
• Virtual machines with hardware version 10 and earlier store their memory
snapshots as part of the .vmsn file in the VM home directory.
Solution:
• Best practice is to remove all memory snapshots on virtual machines of
hardware version 10 and earlier before attempting a storage migration
involving vSphere Virtual Volumes.
• Virtual machines with hardware version 11 or higher migrate with no problems.

204 VMware vSphere: Troubleshooting Workshop


Review of Learner Objectives
Slide 5-70

You should be able to meet the following objectives:


• Become familiar with the various types of tools that are available to
troubleshoot vSAN
• Use the appropriate tools to identify, analyze, and quickly resolve common
configuration problems related to vSAN
• Describe vSphere Virtual Volumes
• Use vSphere Virtual Volumes Diagnostic Commands

Module 5 Troubleshooting Storage 205


Key Points
Slide 5-71

• As a first troubleshooting step, if you do not see any paths to your LUNs,
perform a rescan of the troubled adapter to try to restore LUN visibility.
• For iSCSI connectivity issues, verify that the VMkernel port bindings are
configured properly.
• A storage device is considered to be in a POL state when it becomes
permanently unavailable to the ESXi host.
• An APO condition occurs when a storage device becomes unavailable to your
ESXi host for an unspecified amount of time.
• Because vSAN is a software-based storage product, it is entirely dependent on
the proper functioning of its underlying hardware components.
• When troubleshooting vSAN , ensure that the network is functioning properly
and the vSAN compatibility guidelines have been closely followed.
• vSphere Virtual Volumes is a set of different vSphere Virtual Volume object
types that together function as a virtual machine.
• vSphere Virtual Volumes is based on storage policies.
Questions?

206 VMware vSphere: Troubleshooting Workshop


MODULE 6
Troubleshooting vSphere Clusters
Slide 6-1

Module 6

207
You Are Here
Slide 6-2

1. Course Introduction
2. Introduction to Troubleshooting
3. Troubleshooting Tools
4. Troubleshooting Virtual Networking
5. Troubleshooting Storage
6. Troubleshooting vSphere Clusters
7. Troubleshooting Virtual Machines
8. Troubleshooting vCenter Server and ESXi

208 VMware vSphere: Troubleshooting Workshop


Importance
Slide 6-3

As you scale your vSphere environment, you must be aware of the


vSphere features and functions that help you manage the hosts in your
environment.
The vSphere administrator must be able to identify and troubleshoot
vSphere HA, vSphere vMotion, and VMware vSphere® Storage DRS™
problems to keep the cluster running properly and continuously.

Module 6 Troubleshooting vSphere Clusters 209


Learner Objectives
Slide 6-4

By the end of this module, you should be able to meet the following
objectives:
• Identify and troubleshoot vSphere HA problems
• Analyze and solve vSphere vMotion problems
• Diagnose and troubleshoot common vSphere DRS problems

210 VMware vSphere: Troubleshooting Workshop


Review of vSphere HA
Slide 6-5

A reliable network connection between the hosts and a vCenter Server


system is essential for enabling vSphere HA.

Heartbeat
Datastores

--------------"'ff--+-___....,__,_...._..... Primary Heartbeat Network


(Management Network
or vSAN Network)
Redundant Heartbeat Network

vCenter Server System


When vSphere HA is enabled, the Fault Domain Manager (FDM) Agent service starts on the
member hosts. After the FDM agents have started, the cluster hosts are in a fault domain. Hosts
cannot participate in a fault domain if they are in maintenance mode or standby mode. A host can be
in only one fault domain at a time.
The fault domain is managed by a master host. All other hosts are called slave hosts. To determine
which host will be the master, an election process takes place.
The master host sends periodic heartbeats to the slave hosts so that the slave hosts know that the
master host is alive. Heartbeats are sent to each slave host from the master host over all configured
heartbeat networks. The heartbeat network is the management network, except where the vSphere
HA cluster is already configured as a vSAN cluster, in which case the virtual SAN network will be
used as the heartbeat network. Slave hosts use only one heartbeat network to communicate with the
master. If the primary heartbeat network fails, the slave host switches to another VMkernel interface
to communicate with the master host over the redundant heartbeat network.
When the master host in a vSphere HA cluster cannot communicate with a slave host over the
heartbeat network, the master host uses datastore heartbeating to determine whether the slave host
has failed, is in a network partition, or is network isolated. vCenter Server selects a preferred set of
datastores for heartbeating. This selection is made to maximize the number of hosts that have access
to a heartbeating datastore and minimize the likelihood that the datastores are backed by the same
LUN or NFS server.

Module 6 Troubleshooting vSphere Clusters 211


vSphere HA Problem 1
Slide 6-6

If vSphere HA fails to enable, check the following:


• Disable and reenable or reconfigure vSphere HA on the cluster:
- This check resolves most problems.
• Check the FDM logs on the ESXi hosts and search for error messages.
• Check the release notes for known problems with vSphere HA:
- Verify that you are using the latest version of vSphere.

Increase the verbosity of the FDM logs to collect more information about the cause of the issue.
Search the FDM log files for error messages:
• FDM operations log: /var / l og/ fdm . l o g or /var / run / l og/fdm* (one log file for FDM
operations)
• FDM agent installation log: /var / l og/fdm- insta ller. l o g

212 VMware vSphere: Troubleshooting Workshop


Identifying Possible Causes
Slide 6-7

If the ESXi hosts in the cluster are operating properly, take a top-down
approach to troubleshooting. Ensure that vSphere HA is configured
properly in vCenter Server.

Possible Causes

vSphere HA is not properly configured.


vCenter Server The datastore for vSphere HA heartbeats is not
accessible by all hosts.

The FDM agent was not installed properly on the ESXi


host.
ESXi
Network connectivity is lost between the ESXi host and
Host
vCenter Server.
The ESXi host is not properly connected to vCenter
Server.

If communication between your ESXi hosts and vCenter Server is working properly, then start your
troubleshooting with the vSphere HA configuration. Verify that all the cluster settings are correct.
For information about troubleshooting FDM problems, see VMware knowledge base article
2004429 at http://kb.vmware.com/kb/2004429.

Module 6 Troubleshooting vSphere Clusters 213


Possible Cause: Improper Configuration of vSphere HA
Slide 6-8

vSphere HA may not be enabled properly because of improper


configuration.
Ensure that you adhere to the vSphere HA cluster requirements:
• All hosts must be configured with static IP addresses:
- If you are using DHCP, ensure that the address for each host persists across reboots.
• All hosts must have at least one heartbeat network (management network or
vSAN network if vSAN was first enabled on the cluster) in common:
- The best practice is to have at least two management networks in common .
• To ensure that any virtual machine can run on any host in the cluster, all hosts
should have access to the same virtual machine networks and datastores.
• At least two hosts must exist in the cluster.
• For virtual machine monitoring to work, VMware Tools™ must be installed.
• All hosts must be licensed for vSphere HA.
• vSphere HA supports 1Pv4 and 1Pv6. However, a cluster that mixes the use of
both of these protocol versions is more likely to result in a network partition.

The vSphere HA checklist shown in the slide contains requirements that you must be aware of
before creating and using a vSphere HA cluster.
For more information about vSphere HA requirements, see vSphere Availability Guide at http://
www.vmware.com/support/pubs/vsphere-esxi-vcenter-server-pubs. html.

214 VMware vSphere: Troubleshooting Workshop


Possible Cause: Heartbeat Datastore Inaccessible
Slide 6-9

View vSphere HA heartbeat information in vSphere Web Client to verify


that the heartbeat datastore is accessible to all hosts.
To resolve an inaccessible heartbeat datastore, modify the cluster to
select a heartbeat datastore accessible by all hosts.

Hosts VMs Oatastores Networks Update Manager

ls sues Performance Tasks & Events Profile Compliance Resource Reservation vSphere ORS vSphere HA Utllization

Datastores used for heartbealing

Summary
Hearthea1 !J NFS01 NIA

Configuration Issues !J Shared NIA

Datastores under APO or


POL

Hosts mounting selected datastore

Cl sa-esxl-01 .vclassJocal
C] sa-esxi-02.vclass.local

VMware recommends two heartbeat datastores.

vCenter Server automatically selects a preferred set of datastores for heartbeating. This selection is
made with the goal of maximizing the number of hosts that have access to a given datastore and
minimizing the likelihood that the selected datastores are backed by the same storage array or NFS
server. In most cases, this selection should not be changed.
Verify that you have LUN connectivity. The heartbeat datastore might be inaccessible because of a
storage failure, such as all paths down or permanent device loss condition.

Module 6 Troubleshooting vSphere Clusters 215


Possible Cause: Failure to Install FDM Agent on ESXi Host (1)
Slide 6- 10

If the FDM agent cannot be installed on the ESXi host, perform the
following checks:
• Check the FDM log for errors:
/ var / l o g / fdm . log
• Verify that sufficient network bandwidth exists between the ESXi host and
vCenter Server.
• If network traffic throughput is causing the issue, determine whether a
hardware problem exists.
• Verify that there is disk space available on the ESXi host in I r oo t .

If you determine that network traffic throughput is causing the problem, determine the cause of the
network issues:
• Hardware problem with the NIC
• Mismatch between NIC speed and switch speed
• Hardware or firewall is blocking or slowing traffic
Verify that the FDM agent files were pushed to the ESXi host:
• c l u s t er config: Cluster configuration

• hostlist: Host membership list

• fdm . c fg: FDM configuration file

• vmme t adata : List of all virtual machines and the hosts that they are compatible w ith

216 VMware vSphere: Troubleshooting Workshop


Possible Cause: Failure to Install FDM Agent on ESXi Host (2)
Slide 6- 11

Verify that all FDM agent files were pushed from vCenter Server to the
ESXi host:
• On the ESXi host, check for the following agent files in the
I e t c/ opt/vmware/ fdm directory:
- clusterconfig
- fdrn . cfg
- hostlist
- vrnrnetadata

Module 6 Troubleshooting vSphere Clusters 217


Possible Cause: Loss of Network Connectivity
Slide6- 12

Loss of network connectivity between the ESXi host and vCenter Server
might cause vSphere HA not to enable.
Test connectivity from the ESXi host:
p i ng vcOl . vc l ass .lo c al
If the ping fails, troubleshoot ESXi host network connectivity issues.
The ESXi host is not properly connected to vCenter Server:
• Check the ESXi host's connection status in the vCenter Server hierarchy.
• If the host is disconnected, reconnect the ESXi host to vCenter Server.

If you suspect ESXi host network connectivity problems, use the troubleshooting information for
virtual networks.

218 VMware vSphere: Troubleshooting Workshop


vSphere HA Problem 2
Slide 6-13

The problem occurs when you try to power on a virtual machine that is
part of a vSphere HA cluster with insufficient failover resources.

Power On Failures

I/Center Seiver was unabl e to find a suitable host to power on th e following virtual mac hines for the reasons listed below.

Virtu.il Mach in e Host Descripti on

/JI Wi n0 2·C NIA J1nsufficient re sources to sati sfy configu red failover level forvSphere f

OK I

The error message indicates that the operation performed violates the configured failover level of
the vSphere HA cluster.
This problem can arise when hosts in the cluster are disconnected, in maintenance mode, or not
responding, or when they have a vSphere HA error. Disconnected and maintenance mode hosts are
typically caused by user action. Unresponsive or error-possessing hosts usually result from a more
serious problem, for example, hosts or agents have failed or a networking problem exists.
Another possible cause of this problem is that your cluster contains virtual machines that have much
larger memory or CPU reservations than the others. As a virtual machine with high reservation
values starts up, slot calculation is affected and fewer slots are available. The Host Failures Cluster
Tolerates admission control policy is based on the calculation of a slot size that includes two
components: the CPU and memory reservations of a virtual machine. If the calculation of this slot
size is skewed by outlier virtual machines, the admission control policy can become restrictive and
result in virtual machines being unable to start. Check the CPU and Memory reservations of the
virtual machine. Either change the admission control policy to reserve fewer resources for failover
or add more hosts to the cluster.

Module 6 Troubleshooting vSphere Clusters 219


Identifying Possible Causes
Slide6- 14

Excessive virtual machine reservations or insufficient resources in the


cluster can cause insufficient failover capacity for vSphere HA.

Possible Causes

vSphere HA admission control policy is not configured


vSphere HA
correctly.

Virtual One or more of the virtual machines have excessive


Machine reservations.

ESXi
The cluster has insufficient physical resources.
Host

Use the bottom-up approach to troubleshoot this problem. Start by verifying the configuration of the
physical resources of the cluster. And then verify that virtual machine reservations are reasonable
and not excessive. Finally, verify that the vSphere HA admission control policy is configured to
protect failover capacity.

220 VMware vSphere: Troubleshooting Workshop


Possible Cause: Insufficient Physical Resources
Slide 6-15

A cluster with insufficient physical resources might cause the error


message.
You can use various strategies to resolve the resource problem:
• Add more physical resources to the cluster:
- Add ESXi hosts to the cluster (maximum of 64 hosts per cluster).
- Add CPU and memory resources to one or more hosts in the cluster.
• Build a new vSphere HA cluster:
- This option is reasonable if you are deploying small clusters of homogeneous ESXi
hosts. This option avoids CPU incompatibility issues.

The goal of the vSphere HA Admission Control policy is to ensure that a certain proportion of
cluster resources are never allocated to virtual machine reservations so that if one or multiple hosts
fail, sufficient unreserved resources are available in the cluster to power on the failed virtual
machines.
The best, long-term solution for resolving insufficient failover capacity in a vSphere HA cluster is to
add physical resources. Otherwise, you might be forced to compromise the spare cluster capacity
that the vSphere HA admission control policy is trying to safeguard.

Module 6 Troubleshooting vSphere Clusters 221


Bandwidth Reservation
Slide 6- 16

When a host failure or isolation (depending on the isolation settings)


occurs, vSphere HA powers on the virtual machines on other hosts in the
cluster, with respect to each virtual machine's admission control policies
and bandwidth reservation.
If a virtual machine cannot start because the bandwidth reservation
cannot be met, information about the failure is available in the UI and log
files:
NIOCES . etherswitch : NIOCES_UpdateNIOCVnicinfo: Fail to
reserve bandwidth for the port

The network reservation feature is part of VMware vSphere® Network


1/0 Control .

222 VMware vSphere: Troubleshooting Workshop


Possible Cause: Excessive Virtual Machine Reservations (1)
Slide 6-17

Some virtual machines in a cluster might have much larger memory or


CPU reservations than others.
tJl Lab Cluster
Oetting Started Summary r Monitor l Configure Permissions Hosts VMs Oatastores Net.Norks Update Manager

Issues Performance Tasks & Events Profile Compliance Resource Reservation VSphere ORS vSphere HA Utilization

OGB 4.3'3 ~B
CPU
Memory Cluster Total Capacity 12.00 GB
Storaue Total Reservation Capacity 4.33 GB

Vsed Reservation 2.41 GB


Available Reservation 1.92 GB

Error !'-.
Name 1 • ReH rvatlori (lulB) !-:---------- ~'~----~
~ The "Power on virtual machine" operation failed for the entity with
iii linux-a-01 the following error message.
{jJ linux·a·02 4096
{jJ linux-a-03 4096 Insufficient resources to satisfy configured failover level for
vSphere HA.
{jJ linux·a·06 6144
iii linux-a-07
H 10 Objects [;).Export l!t!

vCenter Server uses admission control to ensure that sufficient resources in a vSphere HA cluster
are reserved for virtual machine recovery if host failure occurs.
Some cluster might contain virtual machines that have much larger memory or CPU reservation than
other virtual machines. The Host Failures Cluster Tolerates admission control policy is based on the
calculation on a slot size including two components, the CPU and memory reservations of a virtual
machine. If the calculation of this slot size is skewed by outlier virtual machines, the admission
control policy can become too restrictive and result in the inability to power on virtual machines.
Verify that none of the virtual machines have excessive reservations. If one or a few machines have
a much larger reservation than other machines in a cluster, the calculation will be skewed.
Fore more information about various causes of the insufficient fail over resource problems, see
vSphere Troubleshooting at http: //pubs.vmware .com/vsphere-65/topic/com.vmware.ICbase/PDF/
vs phere-esxi-v center-server-65-trou b1eshooting-guide. pdf.

Module 6 Troubleshooting vSphere Clusters 223


Possible Cause: Excessive Virtual Machine Reservations (2)
Slide 6- 18

If one or more virtual machines have excessive reservations, errors can


occur:
• View the Resource Allocation tab for the vSphere cluster.
Try the following options to resolve
this problem:
• Take large virtual machines out
of the cluster.
• Group virtual machines into several
clusters with similar reservations.

224 VMware vSphere: Troubleshooting Workshop


High Availability Configuration
Slide 6- 19

Verify that your HA configuration will perform as required .

vSphere Availability

vSphere Availability is comprised ofvSphere HA and Proactive HA. To enable Proactive HA you must also enable DRS on the cluster.

0 Turn ON VSphere HA

D Turn on Proactive HA 6
! Failure IResponse I Details
I Host failure ~ RestartVMs Restart VMs using VM restart priority ordering.

Proactive HA ~ Disabled Proactive HA is not enabled.

Host Isolation ~ Disabled VMs on isolated hosts will remain powered on.

Data store with Permanent Device ~ Disabled Datastore protection for All Paths Down and
Loss Permanent Device Loss is disabled.

Data store with All Paths Down ~ Disabled Datastore protection for All Paths Down and
Permanent Device Loss is disabled.

Guest not heartbeating ~ Disabled VM and application monitoring disabled.

Module 6 Troubleshooting vSphere Clusters 225


Possible Cause: Admission Control Policy Misconfiguration
Slide 6-20

Verify that your admission control settings match your restart


expectations if a failure occurs.
Admission Control

Admission control is a policy used by vSphere HA to ensure fail over capacity within a cluster. Increasing the value of host fai lures cluster
tolerates will increase the availability constraints and capacity reserved.

Host failures cluster tolerates 1 '8 Maximum is one less than number of hosts in cluster.
Define host failover capacity by [ Cluster resource percentage
I• l
0 Override calculated failover capacity.

CPU 50 : %

Performance degradation VMs 1oo £:B % Percentage of performance degradation the VMs in the cluster are allowed to
tolerate tolerate during a fail ure.

0% - Raises a warning ifthere is insufficient failover capacity to guarantee the


same performance after VMs restart.
100% - Warning is disabled.

If vSphere HA admission control does not function properly, there is no assurance that all virtual
machines in the cluster can be restarted after a host failure.
Ensure that your admission control settings match your restart expectations if a failure occurs.
Problems occur when no free slots are available in the cluster or if powering on a virtual machine
causes the slot size to increase because it has a larger reservation than existing virtual machines. In
either case, you might use the vSphere HA advanced options to reduce the slot size, use a different
admission control policy, or modify the policy to tolerate fewer host failures.
The vSphere HA Advanced Runtime Info pane shows the slot size and how many available slots are
in the cluster. If the slot size appears too high, view the Resource Allocation tab of the cluster and
sort the virtual machines by reservation to determine which have the largest CPU and memory
reservations. If outlier virtual machines with much higher reservations than the others exist, consider
using a different vSphere HA admission control policy. For example, define failover capacity by
reserving a percentage of the cluster resources. Or use the vSphere HA advanced options to place an
absolute cap on the slot size. Both of these options, however, increase the risk ofresource
fragmentation. Resource fragmentation results in the inability to guarantee that the outlier virtual
machines can be restarted in the event of an ESXi host failure.

226 VMware vSphere: Troubleshooting Workshop


vSphere HA Cluster: Admission Control Guidelines
Slide 6-21

Ensure that you have enough resources in the vSphere HA cluster to


accommodate your admission control policy.

Admission Control Policy Space Calculations

Host Failures Cluster Tolerates Calculate slot size.

Percentage of Cluster Resources Calculate spare capacity based on


Reserved as Failover Spare total resource requirements for
Capacity powered-on virtual machines.
Calculate the number of failover
Specify Failover Hosts hosts needed to hold your virtual
machines.

vCenter Server uses admission control to ensure that sufficient resources are available in a cluster to
provide failover protection. Admission control is also used to ensure that virtual machine resource
reservations are respected.
Choose a vSphere HA admission control policy based on your availability needs and the
characteristics of your cluster. When choosing an admission control policy, you should consider the
following factors:
• Avoiding resource fragmentation: Resource fragmentation occurs when enough resources in
aggregate exist for a virtual machine to be failed over. However, those resources are located on
multiple hosts and are unusable because a virtual machine can run on only one ESXi host at a
time. The Host Failures Cluster Tolerates policy avoids resource fragmentation by defining a
slot as the maximum virtual machine reservation. The Percentage of Cluster Resources policy
does not address the problem ofresource fragmentation. If vSphere HA and vSphere DRS are
enabled on the cluster, then vSphere DRS ensures that no resource fragmentation occurs. If
vSphere DRS is not enabled, resource fragmentation can occur. With the Specify Failover Hosts
policy, resources are not fragmented, because hosts are reserved for failover.

Module 6 Troubleshooting vSphere Clusters 227


• Flexibility of failover resource reservation: Admission control policies differ in the level of
control they give you when reserving cluster resources for failover protection. The Host
Failures Cluster Tolerates policy enables you to set the fail over level as a number of hosts. The
Percentage of Cluster Resources policy enables you to reserve a percent of cluster CPU or
memory resources for failover. The Specify Failover Hosts policy enables you to specify a set
of fail over hosts.
• Heterogeneity of cluster: Clusters can be heterogeneous in terms of virtual machine resource
reservations and host total resource capacities. In a heterogeneous cluster, the Host Failures
Cluster Tolerates policy can be too conservative. This policy defaults to considering the
powered-on virtual machines ' largest reservations when selecting the slot size, unless the Fixed
slot size option was selected, and assumes that the largest hosts fail when computing the current
failover capacity. The other two admission control policies are not affected by cluster
heterogeneity.

228 VMware vSphere: Troubleshooting Workshop


Example of Calculating Slot Size
Slide 6-22

The vSphere HA cluster contains five powered-on virtual machines with


differing CPU and memory reservation requirements.
Reserved failover capacity is set to 1.

<Cf>u:2 GHz CPU: 1 GHz CPU: 1 GHz


RAM: 1 GB RAM: 1 GB RAM: 1 GB

Slot Size: 2 GHz, 2 GB (Plus Memory Overhead)

4 Slots Available 3 Slots Available 3 Slots Available

ESXi Host 1: ESXi Host 2: ESXi Host 3:


CPU: 9 GHz CPU: 9 GHz CPU: 6 GHz
RAM: 9 GB RAM: 7 GB RAM: 7 GB

A slot is a logical representation of memory and CPU resources. By default, a slot is sized to satisfy
requirements for any powered-on virtual machine in the cluster:
• vSphere HA calculates the CPU component by obtaining the CPU reservation of each powered-
on virtual machine and selecting the largest value. If you have not specified a CPU reservation
for a virtual machine, the virtual machine is assigned a default value of 32 MHz. You can
change this value by using the das.vmcpuminmhz advanced attribute.
• vSphere HA calculates the memory component by obtaining the memory reservation, plus
memory overhead, of each powered-on virtual machine and selecting the largest value. No
memory is reserved by default. The default value for memory reservation is zero.
In the example, the largest CPU requirement, 2 GHz, is shared by the first and second virtual
machines. The largest memory requirement, 2 GB, is held by the third virtual machine. Based on
this information, the slot size is 2 GHz CPU and 2 GB memory.
Of the three ESXi hosts in the example, ESXi host 1 can support four slots. ESXi hosts 2 and 3 can
support three slots each.
A virtual machine requires a certain amount of available overhead memory to power on. The
amount of overhead required varies, depending on the amount of configured memory. The amount

Module 6 Troubleshooting vSphere Clusters 229


of this overhead is added to the memory slot size value. For simplicity, the calculations do not
include the memory overhead. Such overhead does exist, though.
vSphere HA admission control does not consider the number of vCPUs that a virtual machine has.
Admission control looks only at the virtual machine's reservation values. Thus, a 4 GHz CPU
reservation means the same thing for both a four-way SMP virtual machine and a single-vCPU
virtual machine.

230 VMware vSphere: Troubleshooting Workshop


Applying Slot Size
Slide 6-23

After determining the maximum number of slots that each host can
support, the current failover capacity can be computed.
In this example, current failover capacity is 1.

CPU: 2 GHz CPU: 2 GHz CPU: 1 GHz CPU: 1 GHz CPU: 1 GHz
RAM : 1 GB RAM: 1 GB RAM: 2 GB RAM: 1 GB RAM: 1 GB

Slot Size: 2 GHz, 2 GB (Plus Memory Overhead)

3 Slots Available 3 Slots Available

ESXi Host 2: ESXi Host 3:


CPU: 9 GHz CPU: 6 GHz
RAM: 7 GB RAM: 7 GB

In the example, host 1 is the largest host in the cluster. If host 1 fails, then six slots remain in the
cluster, which is sufficient for all five of the powered-on virtual machines. The cluster has one
available slot left.
If both hosts 1 and 2 fail, then only three slots remain, which is insufficient for the number of
powered-on virtual machines to fail over. Thus, the current failover capacity is one.

Module 6 Troubleshooting vSphere Clusters 231


Distorted Slot Size
Slide 6-24

A single virtual machine with a very large memory or CPU reservation


can drastically reduce your slots available.
You can override the maximum upper bound of a slot size in vSphere
Availability Advanced Options.

CPU: 2 GHz CPU: 2 GHz CPU: 1 GHz CPU: 1 GHz CPU: 1 GHz
RAM : 1 GB RAM: 1 GB RAM : 4GB RAM: 1 GB RAM: 1 GB

Slot Size: 2 GHz, 4 GB (Plus Memory Overhead)

2 Slots Available 1 Slot Available 1 Slot Available

ESXi Host 1: ESXi Host 2: ESXi Host 3:


CPU: 9 GHz CPU: 9 GHz CPU: 6 GHz
RAM: 9GB RAM: 7 GB RAM: 7 GB

A single virtual machine with a very large memory or CPU reservation can drastically reduce the
number of slots available. By default the slot size is calculated based on the largest CPU reservation
and the largest memory reservation (not actual CPU or memory usage) of powered on virtual
machines. The memory reservation will include overhead.
You can manually set slot sizes to a fixed value in vSphere Availability Admission Control in the
vSphere Web Client. You can also set a maxim upper bound for the CPU or memory component of
the slot size by creating parameters das. slotcpuinrnhz and das. slotmeminmb in vSphere
Availability Advanced Options.

232 VMware vSphere: Troubleshooting Workshop


Reserving a Percentage of Cluster Resources
Slide 6-25

With the Percentage of Cluster Resources Reserved as Failover Spare


Capacity policy, vSphere HA ensures that a specified percentage of
aggregate CPU and memory resources is reserved for failover.

CPU: 2 GHz CPU: 2 GHz CPU: 1 GHz CPU: 1 GHz CPU: 1 GHz
RAM: 1 GB RAM: 1 GB RAM: 2 GB RAM: 1 GB RAM: 1 GB

Total Virtual Machine Host Requirements: 7 GHz, 6 GB

Total Host Resources: 24 GHz, 23 GB

ESXi Host 1: ESXi Host 2: ESXi Host 3:


CPU: 9 GHz CPU: 9 GHz CPU : 6 GHz
RAM: 9 GB RAM: 7 GB RAM: 7 GB

With this policy, vSphere HA enforces admission control in the following way:
1. vSphere HA calculates the total resource requirements for all powered-on virtual machines in
the cluster.
2. It calculates the total host resources available for virtual machines.
3. It calculates the current CPU failover capacity and current memory failover capacity for the
cluster.
4. It determines whether either the current CPU failover capacity or the current memory failover
capacity is less than the corresponding configured failover capacity (provided by the user).

Module 6 Troubleshooting vSphere Clusters 233


Calculating Current Failover Capacity
Slide 6-26

The current CPU failover capacity is computed by subtracting the total


CPU resource requirements from the total host CPU resources and
dividing the result by the total host CPU resources.
The current memory failover capacity is calculated similarly.

CPU: 2 GHz CPU: 2 GHz CPU: 1 GHz CPU: 1 GHz CPU: 1 GHz
RAM: 1 GB RAM: 1 GB RAM: 2 GB RAM: 1 GB RAM: 1 GB

Total Virtual Machine Host Requirements: 7 GHz, 6 GB

Total Host Resources: 24 GHz, 23 GB

ESXi Host 1: ESXi Host 2: ESXi Host 3:


CPU: 9 GHz CPU: 9 GHz CPU: 6 GHz
RAM: 9 GB RAM: 7 GB RAM: 7GB

Current CPU failover capacity= (24 GHz - 7 GHz) I 24 GHz= 70.8%


Current memory failover capacity= (23 GB - 6 GB) I 23 GB = 73.9%

In the example, the total resource requirements for the powered-on virtual machines is 7 GHz and 6
GB. The total host resources available for virtual machines is 24 GHz and 23 GB. Based on these
values, the current CPU failover capacity is 70 percent. Similarly, the current memory failover
capacity is 74 percent.
If the cluster's configured failover capacity is set to 25 percent, 45 percent of the cluster's total CPU
resources and 49 percent of the cluster 's memory resources are still available to power on additional
virtual machines.

234 VMware vSphere: Troubleshooting Workshop


Using VMCP
Slide 6-27

Earlier versions of vSphere cannot detect APO conditions or remediate


POL conditions.
The new Virtual Machine Component Protection (VMCP) feature has the
following features:
• Provides enhanced recovery from APO and POL conditions.
• Can automatically restart impacted virtual machines on nonimpacted hosts.
Failure condrtions and responses

You can configure how vSphere HA responds to the failure conditions on this cluster. The following failure conditions are supported: host,
host isolation, VM component protection (datastore with POL and APO). VM and application.

0 Enable Host Monitoring 0


> Host Failure Response [ RestartVMs
I· l
> Response for Host Isolation I Disabled I• l
> Datastore with POL [ Disabled I• l
> Datastore with APO [ Disabled I• ]
> VM Monitoring [ Disabled I• l

If Virtual Machine Component Protection (VMCP) is enabled, vSphere HA can detect datastore
accessibility failures and provide automated recovery for affected virtual machines.
VMCP provides protection against storage accessibility failures that can affect a virtual machine
running on a host in a vSphere HA cluster. When a datastore accessibility failure occurs, the affected
host cannot access the storage path for a specific datastore. You can determine the response of
vSphere HA to such a failure, ranging from the creation of event alarms to virtual machine restarts
on other hosts.

Module 6 Troubleshooting vSphere Clusters 235


Useful Troubleshooting Commands
Slide 6-28

You can view the vSphere HA log in vSphere Web Client using the log
browser, or using the command line:
[root@esxi - a - 03 : -J ls /var/log/fdm .log
/var/log/fdm . log
[ root@esxi - a - 03 : -J tail -f /var/log/fdm . log

You can start, stop, or restart the vSphere HA service from the command
line:
[root@esxi- a - 03 : - J ps I grep - i fdm

[root@esxi- a - 03 : - ] /etc/init . d/vmware - fdm start

[root@esxi- a - 03 : - ] /etc/init . d/vmware - fdm stop

[root@esxi- a - 03 : - ] /etc/init.d/vmware - fdm restart

236 VMware vSphere: Troubleshooting Workshop


Cluster Utilization Graph
Slide 6-29

The Monitor> Utilization graph can show you the cluster load balance
across ESXi hosts.
,....,N,,,
a111
,,,·,u,,•,,,
to,,,r, , . . . . - - - - - - -'-. {II Lab Cluster 1i) ~ t3'J till e @ Actions •
'iBaCi( Getting started Hosts VMs Datastores Netw<>rks Update Manager

~ Issues Performance Tasks & Events Profile Compliance Resource Reservation ...Sphere DRS VSphere HA Utilization
... ~ sa-vcsa-01 .vclass.local
T [J SA Datacenter
- ~ Lab Cluster
l:J sa·esxi-01 .vclass.local
l:J sa-esxi-02.vtlass.local
r·~C-1-ust_e_r_C_P_U~~~~~~~~~~~Ol ·~C-1-ust_e_r_M_e_rn_o_ry~~~~~~~~~~O
1
• @ Production
• @ Test
/ii> linux-a-01 0 GHz 11 .20GHz OGB 12 .00 GS

(jJ linux·a·02 Consumed 3.95 GHz Consumed 2.95 GB


(jJ llnux-a-03
• Active 6.83 GHz • Overhead 249.00 MB
Iii> linux-a-06
Capacity 11 .20 GHz Capacity 12.00 OB
Iii> linux·a·07
/il> linux-a-08
Iii> linux·a·09 • Guest Memory 0
lit> linux·a· 1O

0 24.~GB

·- GeActive Guest Memoiy 1.73 GB


I Pnvo:itP. ? 71 OR

Module 6 Troubleshooting vSphere Clusters 237


Review of vSphere vMotion
Slide 6-30

A vSphere vMotion migration occurs over a network that is enabled for


vSphere vMotion.

VMkernel Port Enabled for


vSphere vMotion

......····· ············...
..
....····· ~----··.---~-
ESXi···...
····...
•••• ODO 0

Migration of VM's Execution State


vMotion
Network

vSphere vMotion transfers the entire execution state of a running virtual machine from the source
ESXi host to the destination ESXi host over a high-speed network. The execution state primarily
consists of the following components:
• The virtual machine's physical memory
• The virtual device state, including the state of the CPU, network and disk adapters, and SVGA
• External network connections
• The virtual machine's virtual disks (migrated only when disks are not on shared storage)

238 VMware vSphere: Troubleshooting Workshop


vSphere vMotion TCP/IP Stacks
Slide 6-31

Each host has a second TCP/IP stack dedicated to vSphere vMotion.

hostd PING DHCP


• •

VMkernel •M·l§iii+ +;141++fhl"*


I
I
I
I
I
I
M·@+nm•
I
·---~----~------+----------------~---------
• I I I VMKTCP-API I :
1---+----L------~---------------- J ---------
... t -.!.t -.!.t
Default TCP/IP Stack vMotion TCP/IP Stack

• Separate Memory Heap • Separate Memory Heap


· ARP Tables • ARP Tables
· Routing Table • Routing Table
• Default Gateway • Default Gateway

Module 6 Troubleshooting vSphere Clusters 239


Use esxcli to Display vMotion Network Information
Slide 6-32

Use the esxcl i network ip netstack command to display


vMotion network information for a specific ESXi host.

[root@sa-esxi-01: - ] esxcli network ip netstack list


defaultTcpipStac k
Key : defaultTcpipStack
Name : defaultTcpipStack
S tate: '1660

vmo tion
Key: vmotion
Name : vmotion
State: '1660
[ro ot@sa- esxi - 01: - ] esx cli network ip netstack get -N vmo tion
vmotion
Key : vmotion
Name : vmotion
Enabled : true
Max Connections: 11000
Current Max Connec tions: 11000
Congestion Control Algorithm : newreno
IPv6 Enabled : true
Current IPv6 Enabled : false
State: '1660

240 VMware vSphere: Troubleshooting Workshop


Long Distance vMotion
Slide 6-33

You can migrate virtual machines over long distances:


• You can perform reliable migrations between hosts and sites separated by high
network round-trip latency times.
• This feature requires the VMware vSphere® Enterprise Plus Edition TM license.

To migrate virtual machines over long distances, your environment must comply with these
requirements:
• A RTT (round-trip time) latency of 150 milliseconds or less, between hosts.
• Your license must cover migrating virtual machines across long distances. The long distance
vMotion features require an VMware vSphere® Enterprise Plus Edition™ license. For more
information, see Compare vSphere Editions at http://www.vmware.com/uk/products/vsphere/
compare.html.
• You must place the traffic related to virtual machine file transfer to the destination host on the
provisioning TCP/IP stack. For more information about placing traffic for cold migration,
cloning, and snapshots on the provisioning TCP/IP stack, see the chapter about migrating
virtual machines in vCenter Server and Host Management Guide at http://pubs.vmware.com/
vsphere-65/topic/com.vmware.ICbase/PDF/vsphere-esxi-vcenter-server-65-host-management-
guide.pdf.
For more information about migrating virtual machines over long distances, see VMware
knowledge base article 2106949 at http ://kb.vmware.com/kb/2 l 06949.

Module 6 Troubleshooting vSphere Clusters 241


Cross vCenter Server vMotion
Slide 6-34

You can migrate virtual machines across vCenter Server instances:


• The source and destination vCenter Server instances and ESXi hosts must be
running version 6.0 or later.
• Both vCenter Server instances must be in Enhanced Linked Mode and must be
in the same vCenter Single Sign-On domain if you are using the vSphere Web
Client (instead of the API).
• Both vCenter Server instances must be time-synchronized.
• For migration of compute resources only, both vCenter Server instances must
be connected to the shared virtual machine storage.
• This feature requires the vSphere Enterprise Plus license.

To migrate virtual machines over vCenter Server instances by using the vSphere Web Client, you
must enable Enhanced Linked Mode for both the source and destination vCenter Server instances.
You must also put the vCenter Server instances in the same VMware vCenter™ Single Sign-On™
domain, so that the source vCenter Server can authenticate to the destination vCenter Server.
When using the vSphere APis or SDK, both vCenter Server instances can exist in separate vCenter
Single Sign-On domains. Additional parameters are required when performing a nonfederated cross
vCenter Server vMotion migration. For more information about the virtual machine relocation
specifications, see vSphere AP! Reference at https://www.vmware.com/support/developer/vc-sdk/.

242 VMware vSphere: Troubleshooting Workshop


vSphere vMotion Problem 1
Slide 6-35

If vSphere vMotion was working but fails in this way, begin by confirming
proper VMkernel port settings and values and verify physical component
functionality.
If the network configuration is correct and physical components are
functioning, restart the management agents:
• Restart the management agents on the ESXi host at the command prompt.
- /e t c/ i nit . d/hostd r es t art Troubleshooting Mode Options
- /e t c/init . d/vpxa r es t a r t
• Use the DCUI to restart the Disable ESXi Shell
management agents on the ESXi host. Disable SSH
Modify ESXi Shell and SSH tiMeouts
mn• ;c'·€'·'Aij 'A .1autg.sp
••• •..,..._, __ ...._,,A ,,.,.... •" 1
Before you restart the management agents confirm that the network configuration is still correct and
that all network hardware if functioning correctly.
You can restart the management agents from the ESXi host command line. Or you can restart the
agents by selecting Troubleshooting Mode Options> Restart Management Agents in the DCUI.
Restarting the management agents might affect tasks that are running on the ESXi host at the time of
the restart.
For information about restarting the management service on an ESXi host, see VMware knowledge
base article 1005566 at http://kb.vmware.com/kb/1005566.

Module 6 Troubleshooting vSphere Clusters 243


Identifying Possible Causes
Slide 6-36

The most probable causes of vSphere vMotion migration failing at


15 percent or less can be attributed to the ESXi host. Use a bottom-up
approach to troubleshooting this problem.

Possible Causes
~ =====i,,
'\
Virtual
The log. rotateS i ze parameter is set to a low value.
Machine

VMkernel network connectivity is lost.


VMkernel network configuration is invalid.

L J ESXi
Host
Name resolution is not valid on the host.
Time is not synchronized across the environment.
The required disk space is not available.
VM reservation requirements are not met on the target host.

If the vSphere vMotion network is not functioning properly, vSphere vMotion migrations can fail.
The ESXi host must be configured properly and have enough resources to allow virtual machine
migrations from one host to the next.
For information about diagnosing a vSphere vMotion failure at 15 percent or less, see VMware
knowledge base article 1003734 at http ://kb.vmware.com/kb/1003734.

244 VMware vSphere: Troubleshooting Workshop


Possible Cause: VMkernel Interface Misconfiguration
Slide 6-37

vSphere vMotion might time out completely if the vSphere vMotion


VMkernel network interface is not configured properly:
• From the source host, verify that you can connect to the vSphere vMotion
VMkernel interface on the destination host:
- p ing vSphere_vMo ti on_vmk_IP_add r_on_dest i na ti on

If the ping command fails, ensure that the IP settings of the vSphere
vMotion VMkernel interface are correct:
- IP address
- Subnet mask
- VMkernel gateway

Verify that VMkernel network connectivity for your vSphere vMotion network exists. If the p ing
command results in 100 percent packet loss, then verify that the VMkernel configuration for your
vSphere vMotion network is valid.

Module 6 Troubleshooting vSphere Clusters 245


Possible Cause: Invalid Name Resolution on the Host
Slide 6-38

vSphere vMotion might time out completely if name resolution is not


working properly:
• Verify that the ESXi hosts can see each other by IP address and host name:
- nslookup source_or_destination_host_IP_address
- nslookup source_or_destination_host_FQDN
• If the name does not resolve properly, check the DNS server to ensure that it
has the correct information for your source and destination hosts.
If system time is not synchronized across the environment, vSphere
vMotion might time out completely:
• Check the time on the source and destination hosts:
- From the ESXi host command line, run date .
• If the times are not synchronized, configure the Network Time Protocol client in
the Time Configuration settings on the ESXi hosts.

Verify that name resolution is valid on both the source and destination hosts.
Verify that time is synchronized across the environment. The time must be synchronized if there are
time discrepancies in the environment. The time can be maintained by using a Network Time
Protocol server.

246 VMware vSphere: Troubleshooting Workshop


Possible Cause: Required Disk Space Not Available
Slide 6-39

vSphere vMotion might fail if required disk space is not available on the
target host:
• On the destination host, run df -h.
- Verify that enough space is available on the destination datastore.

I • '

~ # df -h
File5y5tem Size U5ed Available Use% Mounted on
MFS-5 55.0G 1.7G 53.3G /vmf5/volume5/Local01

249.7M 144.BM 104.9M 58%


-~.l.~ :,7M ;u;J_;?,.,l:/ , _,.J..,JJ.~..Jiill..~-~2_%

• If insufficient space is available, try these resolutions:


- Migrate the virtual machine to a different destination datastore.
- Increase the size of the destination datastore.

vSphere vMotion presents a unified migration architecture that migrates live virtual machines,
including their memory and storage, between vSphere hosts without any requirement for shared
storage.
If vSphere vMotion must transfer the virtual machine's storage from the source host to a different
datastore on the destination host, ensure that enough disk space exists to accommodate the migrated
virtual machine.

Module 6 Troubleshooting vSphere Clusters 247


Possible Cause: Reservation Requirements Not Met
Slide 6-40

vSphere vMotion might time out completely if reservation requirements


are not met on the ESXi host.
Verify that the virtual machine does not have reservations set that
exceed the available resources on the destination ESXi host:
• Check the virtual machine's processor and memory values.
• Check the virtual machine's CPU and memory reservation values.
• Check the virtual machine's VMkernel overhead value.

Verify that virtual machine reservation values do not exceed available resources on the host. Check
the ESXi host's Summary tab for the number of processors, processor speed, and amount of
physical memory available. And then check the virtual machine's reservation values for CPU and
memory.
If the virtual machine has reserves configured that exceed available resources, enough resources
must be made available on the target ESXi host or the reserves must be lowered or removed.
The VMkemel overhead, and the memory reservation, must be available for a virtual machine to
power on.

248 VMware vSphere: Troubleshooting Workshop


Possible Cause: log.rotateSize Set to Low Value
Slide 6-41

If the l og. rotateSize parameter in the virtual machine's


configuration file exists and is set to a low value, vSphere vMotion might
time out completely.
On the ESXi host, check the value of this parameter in the virtual
machine's . vrnx file:
• If the parameter does not exist, then the default value is used (0 for unlimited).
To resolve this issue, take one of the following actions:
• Increase l og . rotateSize value to a larger number to prevent the log file
from rotating too quickly.
• Use the default value: 0.

The log. rotatesize setting defines the maximum size in bytes that the virtual machine log file,
vmware . log, can grow to. By default, the maximum size is set to zero, which means the log file
can grow to an unlimited size.
If the log. rotateSize value exists in the virtual machine 's . vmx file and is set to a very low
value, vmware. l og might rotate quickly. As a result, by the time the destination host is requesting
the VMFS lock for vmware. l og, the log file has already rotated and a new vmware. log file is
created. The destination host is then unable to acquire a proper file lock, which causes the vSphere
vMotion failure.
For information about log rotation and logging options for vmware. log, see VMware knowledge
base article 8182749 at http://kb.vmware.com/kb/8182749.

Module 6 Troubleshooting vSphere Clusters 249


Resetting Migrate.Enabled
Slide 6-42

If vSphere vMotion fails between 10 and 20 percent with the error


message,A general system erro r occu r red : Migration
failed wh ile copying data, Broke n Pipe, take the following
action:
• Reset the Migrate.Enabled advanced setting:
1. Change the value to 0 and save the setting.
2. Change the value back to 1 and save the setting.

Migrate.Enablecl Enable hot migration support


AdVanced system Settings
MigrateTtyToUseDefaultHeap Attempt use the default migration hea
system Resource Alloc111:ion
.. ·"-' ~- .-.,, ·-h.._._-~~!eVA_§£2;.~~-~~~eCput:il 1)_.,._µ,_ ___.__,~.Qf.!!Jj.g,(~~onsju,'i!i_~~.v~u-

If the vSphere vMotion migration still fails between 10 and 20 percent, reset the ESXi host's
advanced setting, Migrate. Enabled, on both the source and the destination ESXi hosts.

250 VMware vSphere: Troubleshooting Workshop


vSphere vMotion Problem 2
Slide 6-43

If hosts are equally balanced for CPU and memory consumption, no or


few vSphere vMotion migrations should occur.
If vSphere DRS is not operating normally and should be migrating virtual
machines, perform these checks:
• Verify that the vSphere DRS automation level is not set to manual mode.
• Verify that vSphere vMotion is working properly.

Troubleshoot vSphere DRS only if the hosts are out of balance. vSphere DRS might not be
migrating virtual machines because migrations are not needed at the time.
If the vSphere DRS automation level is set to manual mode, then vSphere vMotion migrations do
not take place automatically. You must approve the migration recommendation before the migration
takes place.
Verify that your vSphere vMotion configuration is correct on all hosts in the cluster. As a test, you
should be able to manually migrate your virtual machines between hosts without a problem.

Module 6 Troubleshooting vSphere Clusters 251


Possible Cause: vSphere DRS Configuration
Slide 6-44

vSphere DRS might have valid reasons for not performing vSphere
vMotion migrations.

vSphere DRS Never vSphere DRS Seldom vSphere DRS Often


Migrates Migrates Migrates
The automation level is Virtual machine loads are Virtual machine loads are
set to manual mode. fairly consistent. very erratic in their
resource requirements.
The automation level is The automation level is The automation level is
fully automated mode. fully automated mode. fully automated mode.
The migration threshold is The migration threshold is The migration threshold is
set to apply priority 1 set to apply priority 1, 2, set to apply all
recommendations. and 3 recommendations. recommendations.

vSphere DRS never performs migrations if the migration threshold is set to apply priority 1
recommendations. With this setting, vCenter Server applies only recommendations that must be
taken to satisfy cluster constraints like affinity rules and host maintenance. vCenter Server will not
apply load-balancing recommendations.
vSphere DRS seldom migrates virtual machines if the virtual machine load is fairly consistent. If the
hosts are load balanced, then the need for vSphere DRS to move virtual machines rarely occurs.
vSphere DRS often migrates virtual machines if the virtual machine loads are very erratic in their
resource requirements. In this case, vSphere DRS might need to frequently reshuffie the virtual
machines across the hosts in the cluster to keep the load balanced.
vSphere DRS seldom performs migrations if the migration threshold is set to apply priority 1, 2, and
3 recommendations. With this setting, vCenter Server performs vSphere vMotion migrations only
for extreme and high load imbalances across the hosts in the cluster.
vSphere DRS often performs migrations if the migration threshold is set to apply all
recommendations. With this setting, vCenter Server performs vSphere vMotion migrations at the
slightest load imbalance across the hosts in the cluster.

252 VMware vSphere: Troubleshooting Workshop


Possible Cause: Configuration Problems
Slide 6-45

Verify that vSphere DRS and vSphere vMotion are configured correctly.

vSphere DRS Never vSphere DRS Seldom vSphere DRS Often


Migrates Migrates Migrates
The vSphere vMotion Some virtual machines vSphere DRS might not
network is not configured cannot be migrated, have a problem. Virtual
or is not working because they are using machines might be
properly. local host resources. performing erratically,
causing vSphere DRS to
The migration threshold Too many restrictive
do more work to
is incorrectly set to apply affinity or anti-affinity
maintain an equal load
priority 1 rules are enabled.
balance across the hosts
recommendations.
in the cluster.

If virtual machines cannot be migrated with vSphere vMotion, verify that these virtual machines are
not actively using local host resources such as local storage, local CD/DVD drives, or internal
networks.
If the hosts in the cluster are consistently out of balance, then vSphere DRS is not working correctly
and you must investigate whether misconfiguration is causing this behavior.

Module 6 Troubleshooting vSphere Clusters 253


Lab 7: Troubleshooting Cluster Problems
Slide 6-46

Identify, diagnose, and resolve cluster problems


1. Run a Break Script
2. Verify That the System Is Not Functioning Properly
3. Troubleshoot and Repair the Problem
4. Verify That the Problem Is Repaired

254 VMware vSphere: Troubleshooting Workshop


Review of Learner Objectives
Slide 6-47

You should be able to meet the following objectives:


• Identify and troubleshoot vSphere HA problems
• Analyze and solve vSphere vMotion problems
• Diagnose and troubleshoot common vSphere DRS problems

Module 6 Troubleshooting vSphere Clusters 255


Key Points
Slide 6-48

• Hosts in vSphere HA clusters have a master-slave relationship.


• If the FOM agent cannot be installed on the ESXi host, verify that sufficient
network bandwidth exists between the ESXi host and vCenter Server.
• VMCP can protect virtual machines from APO and POL conditions by
automatically restarting impacted virtual machines on healthy hosts.
• Each ESXi host has a second TCP/IP stack dedicated to vSphere vMotion.
• Improperly configured admission control policies and network bandwidth
reservation might contribute to insufficient resource problems.
• If you cannot migrate virtual machines over long distances, check the licensing.
• If you cannot migrate virtual machines across vCenter Server instances, check
whether the vCenter Server instances belong to the same vCenter Single Sign-
On domain (Enhanced Linked Mode).
Questions?

256 VMware vSphere: Troubleshooting Workshop


MODULE 7
Troubleshooting Virtual Machines
Slide 7-1

Module 7

257
You Are Here
Slide 7-2

1. Course Introduction
2. Introduction to Troubleshooting
3. Troubleshooting Tools
4. Troubleshooting Virtual Networking
5. Troubleshooting Storage
6. Troubleshooting vSphere Clusters
7. Troubleshooting Virtual Machines

8. Troubleshooting vCenter Server and ESXi

258 VMware vSphere: Troubleshooting Workshop


Importance
Slide 7-3

Administrators must understand how to quickly identify and effectively


troubleshoot virtual machine problems to protect against application
downtime, especially for mission-critical applications.

Module 7 Troubleshooting Virtual Machines 259


Learner Objectives
Slide 7-4

By the end of this module, you should be able to meet the following
objectives:
• Discuss virtual machine files and content IDs
• Identify, analyze, and solve virtual machine snapshot problems
• Troubleshoot virtual machine power-on problems
• Identify possible causes and troubleshoot virtual machine connection state
problems
• Diagnose and recover from VMware Tools installation failures

260 VMware vSphere: Troubleshooting Workshop


Review of Virtual Machine Files
Slide 7-5

A virtual machine consists of a set of files located in a datastore,


controlled by the ESXi host to which it is registered.

Configuration file WinOl - A.vmx


Swap files WinOl - A . vswp
vmx - WinOl - A . vswp
Win01-A BIOS file Wi nOl - A.nvram
Log files vmware . l og
Raw device map file WinOl - A- rdm . vmdk
Disk descriptor file WinOl - A . vmdk
Disk data file WinOl - A- flat . v md k
Snapshot data file win O1 - A . vms d
Snapshot state file Wi nO l- A . vmsn
Data store
Snapshot disk descriptor file WinOl - A- 000001 . vmdk
Snapshot disk data file WinOl - A- 000001 - delta . vmdk
Snapshot memory state file WinOl - A- 000001 . vmem

To troubleshoot common virtual machine issues, you must understand what the underlying virtual
machine files are used for. Sometimes resolving a virtual machine problem requires fixing one or
more of the virtual machine's files.
The example lists the files that make up a virtual machine named WinO 1-A. Except for the log files,
the name of each file starts with the virtual machine's name.
If the virtual machine has more than one disk file, the file pair for the second disk file and later is
named WinOl - A_# . vmdk and WinOl - A_# - flat. vmdk, where# is the next number in the
sequence, starting with 1.

Module 7 Troubleshooting Virtual Machines 261


Disk Content IDs
Slide 7-6

A content ID (CID) resides in the disk descriptor file of every virtual


machine for integrity and state tracking.
Win Ol - A . vmd k
# Disk DescriptorFile
version=l Wino 1 - A . vmd k is the parent of
encodin ="UTF-8" Wi nO l - A- 00 0 00 1 . vmdk.
CID=1eb89935 WinOl - A- 00 0 00 1 . v md k is the parent of
parentCID=ffffffff
isNative5napshot="no" Wi nOl - A- 000002 . vmdk .
createType="vrofs"

«...,~,~"'-ru;...d~?..Q,.W-O-ti .

WinOl - A- 000001 . vmdk Wi nOl - A- 000002 . vmdk


# Disk DescriptorFile # Disk DescriptorFile
version=l version=l
encodin ="UTF-8" encoding="UTF-8"
ID=daab58f2 CID=5a3e7ab'I
parentCID=daab58f2
isNative5napshot="no"
createType="vrofs5parse" createType="vrofs5parse"
parentF i leNarneHint="lJinOl-A. vrodk" parentFileNarueHint= " lJinOl-A-000001.vmdk"
# Extent desc ription
# Extent desc ription
-----·"'--.::r~· ~.,._..... ,. ~
....... _ ~.<\,All...!!,_~~.... -c·~- ...·- ·

A virtual machine disk descriptor file details the basic geometry, format, or other identification and
handling for a virtual disk. If a virtual machine has snapshots, a disk descriptor file exists for each
snapshot file, also referred to as a delta file. The disk descriptor file also contains the CID. The CID
value of each disk descriptor file helps to ensure that the content in its parent disk descriptor file is
retained in a consistent state.
In the example, the virtual disk descriptor files and their relationships are as follows:
• WinOl - A . vmdk: Descriptor file for the base virtual disk

• Wi nOl - A- 000 0 01 . vmdk : Descriptor file for the first snapshot. The parent file of this snapshot
file is the base disk descriptor file, Win o 1-A. vmdk.
• Wi nOl - A- 000 0 02 . vmdk : Descriptor file fo r the second snapshot. The parent file of this
snapshot is the first snapshot's disk descriptor file, Wi nOl - A- 00 0 00 1 . vmd k.
CIDs that are referenced correctly in the descriptor files confirm the integrity of the snapshot chain.

262 VMware vSphere: Troubleshooting Workshop


Virtual Machine Problem 1
Slide 7-7

CID mismatch conditions are triggered by several factors:


• Interruptions to vSphere vMotion migrations
• VMware software errors

As an initial check in this situation, view the virtual machine's


vmware . l og file to identify the specific disk chain affected.

When a CID mismatch occurs, the virtual machine name is provided in the error message, but you
must identify the following information:
• Which virtual machine disk or disks are affected
• Which specific disk descriptor files are affected
• The cause of the mismatch, or what changes occurred
The virtual machine's log file, vmware. l og, is in the virtual machine's directory on the ESXi host,
/vmfs/volumes/datastore- name/virtual- machine- name.

Module 7 Troubleshooting Virtual Machines 263


CID Mismatch Example
Slide 7-8

In this example, the parentCID descriptor in WinOl-A-000002. v mdk


is incorrect.
WinOl - A. v mdk
# Disk DescriptorFile
version=l
encoding="UTF-8"
CID=1eb89935
parentCID=ffffffff
isNative5napshot="no"
createType="vrnfs"

it_ .J:~"'-RtJ.~P..?._Q~-l! .

Wi nOl - A- 000001 . vmdk WinOl - A- 000002 . vmd k


# Disk DescriptorFile # Disk DescriptorFile
version=l
encoding="UTF-8"
CID=5a3e7ab4
~~~""1'"1":~~~ 935 parentCID=ebbc69a3
isNative5napshot="no" isNative5napshot="no"
createType="vrnfs5parse" createType="vrnfs5parse"
parentF i leNameHint="lJinO 1-A. vrndk" parentF ileNameHint= "lJinO 1-A-000001. vrndk"
# Extent description # Extent description
--- ~ .-..~._,..J ....,.... -~,., • ~ ... -~-.... .. ~

If a content ID (CID) mismatch is found, tasks are prevented from running on the virtual machine.
In effect, a CID mismatch ensures that deviance from the original disk state results in all dependent
child delta content being invalidated. Stored data is protected from further potential corruption.
In the example, a CID mismatch error will be generated because the pare n tC ID descriptor in the
descriptor file, WinOl-A-00 0 002. vmdk, does not match the CID of the parent descriptor file,
WinO l -A-00 0 0 0 1.vmdk .

264 VMware vSphere: Troubleshooting Workshop


Resolving a CID Mismatch
Slide 7-9

To resolve the problem, make a backup of the disk descriptor files that
require editing and use a text editor to correct the mismatch.

WinOl - A- 000001 . vmdk Win Ol - A- 000002 . vmdk


# Disk DescriptorFile # Disk DescriptorFile
version=l version=l
encoding="UTF-8"
C ID=Sa3e7ab4
= e 935 parentCID=ebbc69a3
isNativeSnapshot="no" isNativeSnapshot="no"
createType="vmfsSparse" createType="vmfsSparse"
parentFileNameHint="UJinOl-A.vmdk" parentFileNameHint="IJJinOl-A-000001.vmdk"
# Ex t ent des c rip t ion
-·-~
# Extent description
.-..}..... '-"'"... - ...~~ ~ .~"-·

Verify that the CID mismatch was corrected by running the following
command against the highest-level snapshot:
v mkfstoo l s - e WinOl - A- 000002.vmdk
If you see failure messages, the CID mismatch was not corrected.

With the symptom determined, make backup copies of the disk descriptor files that require
corrections or editing. In the example, make a backup copy of winOl - A- 000002 . vmdk. Then edit
the value of parentCID to match the CID in Wi nOl - A- 00000 1 . vmdk.
For information about how to resolve a CID mismatch error, see VMware knowledge base article
1007969 at http://kb.vmware.com/kb/1 007969.

Module 7 Troubleshooting Virtual Machines 265


Virtual Machine Problem 2
Slide 7-10

A virtual machine generating heavy 1/0 workload might encounter issues


when quiescing before a snapshot operation.
Quiescing can be done by using the following technologies:
• Microsoft Volume Shadow Copy Service (VSS)
• The VMware Tools SYNC driver

As an initial check, verify that you can create a manual, nonquiesced


snapshot by using Snapshot Manager.

VMware products might require file systems in a guest operating system to be quiesced before a
snapshot operation for the purposes of backup and data integrity.
Services that are known to generate heavy 1/0 workload include Exchange, Active Directory, LDAP,
and MS-SQL.
The quiescing operation can be done by Microsoft Volume Shadow Copy Service (VSS). VSS is
provided by Microsoft in its operating systems since Windows Server 2003. Before verifying a VSS
quiescing issue, ensure that you are able to create a manual nonquiesced snapshot with the Snapshot
Manager in vSphere Web Client.
The quiescing operation can also be done by an optional VMware Tools™ component called the
SYNC driver.

266 VMware vSphere: Troubleshooting Workshop


Resolving Quiesced Snapshot Failure
Slide 7-11

When quiescing with VSS, the following conditions should be met:


• VSS prerequisites are met.
• Appropriate services are running and startup types are correct.
• The VSS provider is used.
• All the VSS writers are stable and not reporting errors.

When VSS is invoked, all VSS providers must be running. If a problem occurs with third-party
software providers or the VSS service itself, the snapshot operation might fail.
For detailed steps on how to troubleshoot VSS quiescing problems, see VMware knowledge base
article 1007696 at http://kb.vmware.com/kb/1007696.

Module 7 Troubleshooting Virtual Machines 267


Virtual Machine Problem 3
Slide 7-12

If you cannot create or commit a snapshot, multiple initial checks are


available:
• Verify that the virtual disk type is supported for snapshots:
- ROM in physical mode, independent disks, or virtual machines configured with bus-
sharing are not supported.
• If you have reached 32 levels of snapshots (which includes the base disk), you
cannot create more snapshots:
- In vSphere Web Client, verify this limit with Manage Snapshots.

The limitations of snapshots include the following items:


• VMware does not support snapshots of raw disks, RDM physical mode disks, or guest
operating systems that use an iSCSI initiator in the guest.
• Virtual machines with independent disks must be powered off before you take a snapshot.
Snapshots of powered-on or suspended virtual machines with independent disks are not
supported.
• Snapshots are not supported when you are using VMware vSphere® DirectPath I/O™ to
directly access PCI devices.
• VMware does not support snapshots of virtual machines configured for bus sharing. If you
require bus sharing, consider running backup software in your guest operating system as an
alternative solution. If your virtual machine has snapshots that prevent you from configuring
bus sharing, delete (consolidate) the snapshots.

268 VMware vSphere: Troubleshooting Workshop


Identifying Possible Causes
Slide 7-13

When you cannot create or commit snapshots, take a top-down


approach to troubleshooting, starting with vCenter Server permissions,
checking the virtual machine's files, and then checking the ESXi host's
datastores.

Possible Causes

The user does not have permission to create or commit


vCenter Server snapshots.

Virtual The -delta . vmdk file does not have an associated


Ma~ descriptor file.

ESXi The snapshot file size reached a maximum not supported by


Host the datastore.

Creating or committing snapshots involves several layers. Users must have proper permissions, the
virtual machine's files must be in place, and the datastore on which the virtual machine and its
snapshots are stored must be in working order.

Module 7 Troubleshooting Virtual Machines 269


Possible Cause: No Permissions to Create Snapshots
Slide7-14

The user might not have permission to create or commit snapshots.


To verify whether this is the case, check the user's permissions in
vCenter Server.
To resolve this issue, perform one of the following tasks:
• Give the user the virtual machine power-user role.
• Create a custom role that includes snapshot management permissions and
assign that role to the user.
• Permission to write to the datastore.

The authorization to perform tasks in vCenter Server is governed by an access control system. This
system enables the vCenter Server administrator to specify in great detail which users or groups can
perform which tasks on which objects. The access control system is defined with privileges, roles,
users and groups, and objects. Together, privileges, roles, users and groups, and objects define
perm1ss10ns.
The vCenter Server roles, administrator and virtual machine power user, give a user the permission
to create and manage snapshots. To create and manage snapshots, you must verify that you have one
of these roles or an equivalent custom role associated with your user account.

270 VMware vSphere: Troubleshooting Workshop


Possible Cause: Missing Delta Descriptor File
Slide 7-15

You cannot create or commit a snapshot if a snapshot (delta) file does


not have an associated descriptor file:
• For example, examplevm- 000001 - delta.vmdk is missing its
corresponding descriptor file, examp l evm- 000001.vmdk.

The best practice is to restore from backups if snapshot corruption of


any kind exists.
If you do not have a valid backup and if the-del ta. vmdk has no
descriptor file, then you must create one:
1. Copy the base disk descriptor file, giving it the name of the missing
descriptor file.
2. Edit the new descriptor file to change its format from a base disk to a
snapshot delta disk descriptor.

The best practice is always to restore a virtual machine from backup if there is corruption in the
snapshots.
Snapshot data files are based on the vmfsSparse format. The vmfsSparse format is intended to
store delta content (changed content) for a period of time as a snapshot.
A missing delta descriptor file might occur if you delete the delta descriptor file or files of a virtual
machine whose snapshots are not visible through the Manage Snapshots pane in vSphere Web
Client. You might accidentally delete the descriptor files by navigating to the datastore from the
command line and deleting the files.
If you find that you are missing a delta descriptor file, then you must create one. For detailed steps
on how to create the missing descriptor file, see VMware knowledge base article 1026353 at http://
kb.vmware.com/kb/1026353 .

Module 7 Troubleshooting Virtual Machines 271


Possible Cause: Insufficient Space on Datastore
Slide 7-16

You cannot create or commit a snapshot if the space on the datastore is


insufficient to commit all snapshots.
To verify that enough space exists, identify the ESXi host and datastore
on which the snapshot files reside and run df -h.

ec;.x1Ut.vcla~c;..local - Put TY

/ var/ log II df -h
File~ ~tern S ize U~ed Avai l able U~e \ Mount ed on

Multiple ways to resolve the issue of insufficient space are available:


• Increase the size of the datastore.
• Move virtual machines to datastores with sufficient space.
• Clean unneeded files and virtual machines off of the datastore with
insufficient space.

In the example, the df command displays the available space (in GB) of the VMware vSphere®
VMFS5 datastores named Local01 and Shared.
If necessary, increase the size of the datastore by adding another extent to the datastore. You might
also consider making room on a datastore by moving virtual machines to other datastores with
available space. Also check to make sure there are no unused files or virtual machines on the
datastore with insufficient space.
For more information about verifying the free disk space for an ESX/ESXi virtual machine, see
VMware knowledge base article 1003755 at http://kb.vmware.com/kb/1003755.

272 VMware vSphere: Troubleshooting Workshop


Virtual Machine Problem 4
Slide7- 17

If a virtual machine fails to power on, begin by viewing the error


message, either displayed in vSphere Web Client or found in the virtual
machine log file, vmware . l og:
• Look for reasons for why the power-on operation is failing .

If you are trying to power on a virtual machine from vSphere Web Client, view the Tasks tab and
Events tab of the virtual machine for reasons about why the power-on operation is failing.

Module 7 Troubleshooting Virtual Machines 273


Identifying Possible Causes
Slide 7-18

When a virtual machine does not power on, take a top-down approach to
troubleshooting. Start with the virtual machine files and then check the
ESXi host.

Possible Causes

Virtual One or more virtual machine files are missing.


Machine I One of the virtual machine files is locked. I

w L J ESXi
Host I
Insufficient resources exist on the ESXi host.
The ESXi host is not responsive. I

If you monitor your ESXi hosts periodically and they are stable and working properly, then start
your troubleshooting by checking that the virtual machine's files are in place and working properly.

274 VMware vSphere: Troubleshooting Workshop


Possible Cause: Virtual Machine Files Missing
Slide 7-19

One or more of the virtual machine files might be missing.


Determine whether virtual machine files are missing:
ls / vmfs / v o l umes /S hared / WinOl - B
if- esxoO I.vclass.local - Pu I I Y
/ vmf~ / volwn~~ / 4f870cllo6-S~d5460c-~Oc7-005056 3 70 612/ 1Jin01-B # l~
1Jin01-B-baf70764 . vswp TJinOl-B . vmx.lck vm1Jare-13 . log
TJinOl-B-flat . vmdk TJinOl-B . vmxf vmware-14 . l og
TJinDl-B . nvram ' . vm1Jar e-9. l og
1Jin01-B . vmdk vmvare-10 . log vmwar e .log
TJinO l -B . vmsd vmware- 1 1 . log vmx-TJinDl-B- 3136751460 -1 . vswp
l • " vmvare-12 . log
/vm:l'.s/volwnes/4!87Dcllo6-Sed546Dc-e0c7-005056370612/TJ1n01-B # I

To resolve this problem, restore the missing file or files from your last
backup. If a descriptor file is missing, recreate the descriptor file
manually.

A best practice is to make regular backups of your virtual machine files so that if files are
accidentally deleted from disk, you can restore them.
If the virtual machine's configuration (. vmx) file is missing, you can restore it from a backup. Or
you can recreate the file by recreating the virtual machine and pointing to the existing virtual disk
files. For more information about recovering from a missing virtual machine configuration file, see
VMware knowledge base article 1002294 at http://kb. vmware.com/kb/1002294.
If the virtual machine's virtual disk (-flat. vmdk) files are missing, then you must restore these
files from a backup. If the virtual machine's virtual disk descriptor (. vmdk) files are missing, then
you can restore them from a backup or you can manually recreate the missing descriptor file.
For detailed steps about how to recreate a missing virtual machine disk descriptor file, see VMware
knowledge base article 1002511 at http://kb.vmware.com/kb/10025 l l .

Module 7 Troubleshooting Virtual Machines 275


Possible Cause: Virtual Machine File Locked
Slide 7-20

A virtual machine will not power on if one of the virtual machine files, for
example, the virtual machine disk file, is locked.
Perform these steps to find a locked file:
1. Power on a virtual machine:
- If the power-on fails, the error message identifies the affected file.
2. Determine whether the file can be locked:
- t o u ch f i le na me
3. Determine which ESXi host has locked the file:
- vmkfstools - D / vmfs/vo l umes/Sh ared /Win0 1-B/Win 011- lat . vmdk

Hostname vmkernel: 17:00:38:46.977 cpu1:1033)Lock [type lOcOOOOl offset 58048 v 20, hb


offset 3499520
Hostname vmkernel: gen 532, [mode 1,J owner 45feb537-9c52009b-e812- [ 00137266e 200 Jmtime
11746694621
Hostname vmkernel: 17:00:38:46.977 cpu1:1033)Addr <4, 136, 2>, gen 19, links 1, type reg,
flags OxO, uid O, gid O, mode 600
Hostname vmkernel: 17:00:38:46.977 cpu1:1033) len 297795584, nb 142 tbz O, zla 1, bs 2097152

To prevent concurrent changes to critical virtual machine files and file systems, ESXi hosts establish
locks on these files. In certain circumstances, these locks might not be released when the virtual
machine is powered off. The files cannot be accessed by the ESXi hosts while locked, and the
virtual machine is unable to power on.
The virtual machine's disk files (-fla t . vmdk) and delta disk files (- d elta . vmd k) are files that are
commonly locked during runtime.
You can use the vmk f stoo l s command to determine which ESXi host has locked the file. The
second line shows the MAC address. In the example, the MAC address is 00: 13:72:66:E2:00. Log in
to the ESXi host with this MAC address and identify the process that is maintaining a lock on the
virtual machine's file.
You might encounter the following types of locks:
• Mode 0: No lock on file.
• Mode 1: Exclusive lock. Only one specific host or process can access and update the file. Other
hosts or processes have no read permission.
• Mode 2: Read-only lock. All hosts can read the file but no one can update or modify it.
For more information about troubleshooting locked virtual disks, see VMware knowledge base
article 2107795 at http ://kb.vmware.com/kb/2107795.

276 VMware vSphere: Troubleshooting Workshop


Resolving a Locked Virtual Machine File
Slide 7-21

The following steps are an alternative way to resolve a locked virtual


machine file:
1. Determine which ESXi host has a network adapter with the MAC address
from the v mkfst oo l s output.
2. Identify the process that is holding the lock.
lsof I grep name_of_ l ocked_ fi le
3. Kill the process that is locking the file:
- If the file is being accessed by a running virtual machine, the lock cannot be taken
away or removed.
If you still cannot determine which process or entity has the virtual machine file
locked, perform the following tasks:
1. Migrate all virtual machines from the ESXi host that created the lock to
another ESXi host.
2. Reboot the ESXi host that created the lock.

On the ESXi host command line, use the ls o f command to determine which process is locking the
file. Before terminating the process, understand what that process is used for.
Use the kill command to stop the process. Using the ki l l command abruptly terminates all the
running process for the virtual machine without generating a core dump to analyze the status later.
Carefully consider the consequences before using this command if you decide to troubleshoot the
virtual machine state.
The ESXi host holding the lock might be running the virtual machine and might have become
unresponsive. Or another running virtual machine has the disk incorrectly added to its configuration
before power-on attempts.
If you have already completed your investigation and you still cannot determine which process is
locking the file, then restart the ESXi host to allow the virtual machine to be powered on again.
For detailed steps on investigating virtual machine file locks on ESXi, see VMware knowledge base
article 10051 at http://kb.vmware.com/kb/10051.

Module 7 Troubleshooting Virtual Machines 277


Possible Cause: Insufficient Resources on ESXi Host
Slide 7-22

A virtual machine will not power on if insufficient resources are on the


ESXi host.
To determine if sufficient resources exist, check CPU, memory, network
resource, and storage resource availability on the ESXi host.
To resolve this issue, perform one of the following tasks:
• Decrease the CPU and memory reservations on the virtual machine.
• Add more resources to the ESXi host, cluster, or resource pool.

Periodically check that the ESXi hosts in your cluster have enough CPU, memory, and storage to
accommodate your existing virtual machines as well as any new virtual machines.

278 VMware vSphere: Troubleshooting Workshop


Possible Cause: ESXi Host Unresponsive
Slide 7-23

If the ESXi host is unresponsive, a virtual machine will not power on.
Determine whether the host has crashed or is hanging:
• If the host is hanging, you cannot perform the following tasks:
- Ping the VMkernel network interface.
- Determine whether host Client responds to queries.
- Monitor network traffic from the ESXi host and its virtual machine.
• If the host has crashed, you will see the purple crash screen on the ESXi
console.

An unresponsive host is a serious problem and can cause your virtual machines to fail to power on.

Module 7 Troubleshooting Virtual Machines 279


Review of Virtual Machine Connection States
Slide 7-24

In vCenter Server, the virtual machine can have one of several


connectivity states:
• Connected: vCenter Server has access to the virtual machine.
• Disconnected: vCenter Server is disconnected from the virtual machine,
because its host is disconnected.
• Inaccessible: One or more of the virtual machine configuration files are
inaccessible.
• Invalid: The virtual machine configuration format is invalid.
• Orphaned: The virtual machine is no longer registered on the host with which it
is associated.

A virtual machine can become inaccessible due to transient disk failures.


An invalid virtual is accessible on disk but corrupted in such a way that does not allow vCenter
Server to read the content. In this state, no configuration can be returned for a virtual machine.
A virtual machine that is unregistered or deleted directly on a host managed by vCenter Server
becomes orphaned.

280 VMware vSphere: Troubleshooting Workshop


Virtual Machine Problem 5
Slide 7-25

If a virtual machine appears as invalid or orphaned, begin by determining


whether vCenter Server was restarted while a migration was in progress.
This situation causes the virtual machine being migrated to be listed as
orphaned, but the state is only temporary.

As a first check, check the event viewer on the vCenter Server system to determine whether the
VirtualCenter Server service was restarted at the time a migration was in progress. During startup,
vCenter Server reconnects to all hosts. If a migration completed while vCenter Server was down, a
virtual machine can be reported as an orphan until vCenter Server establishes a connection to the
virtual machine's target host.

Module 7 Troubleshooting Virtual Machines 281


Identifying Possible Causes
Slide 7-26

When a virtual machine is invalid or orphaned, take a top-down approach


to troubleshooting, starting with migration operations initiated in vCenter
Server, then the virtual machine, and then the ESXi host.

Possible Causes

vCenter Server A vSphere vMotion or vSphere DRS migration occurred.

,:;r--=- ~
A virtual machine is deleted outside of vCenter Server.
Virtual
A . vmx file contains special characters or incomplete line-
.,•chine item entries.

If you monitor your ESXi hosts periodically and they are stable and working properly, then start
your troubleshooting from the top down.
For information about what orphaned virtual machines are, how they occur, and how you can fix
them, see VMware knowledge base article 1003742 at http://kb.vmware.com/kb/1003742.

282 VMware vSphere: Troubleshooting Workshop


Possible Cause: vSphere vMotion or vSphere DRS Migration
Occurred
Slide 7-27

A virtual machine appears as invalid or orphaned if a vSphere vMotion


migration or vSphere DRS migration occurred.
To determine whether a migration occurred, view the cluster Tasks tab.
Check where the orphaned virtual machine is registered: on the source
host or the destination host.
If the virtual machine is registered, restart the ESXi host management
services.
If the virtual machine is not registered, either register the virtual machine
or create a virtual machine with the original virtual disks.

In vCenter Server, you might find that you have a virtual machine that has an orphan designation or
has become invalid. An orphan virtual machine is one that exists in the vCenter Server database but
is no longer present on the ESXi host. A virtual machine also appears as orphaned if it exists on a
different ESXi host than the ESXi host expected by vCenter Server.
A virtual machine can have a status of invalid or orphaned after a vSphere vMotion migration or a
vSphere DRS migration, although this type of occurrence is unlikely.
For detailed steps about how to recover after a vSphere vMotion or vSphere DRS migration has
caused the virtual machine to become an orphan, see VMware knowledge base article 1003742 at
http://kb.vmware.com/kb/1003742.

Module 7 Troubleshooting Virtual Machines 283


Possible Cause: VM Deleted Outside vCenter Server
Slide 7-28

A virtual machine appears as orphaned if it was deleted outside vCenter


Server.
Verify that the virtual machine files still exist:
ls /vmfs /vo lumes /S hared /WinOl-B

If the configuration file was deleted and the virtual disk remains, recreate
the virtual machine:
• Attach the original virtual disks to a new . vmx file.
If the virtual disk file was deleted, restore virtual machine files from your
last backup.

If vCenter Server is down, a user can still delete a virtual machine. A user can delete a virtual
machine from the command line at the ESXi host, by using vSphere Management Assistant, or by
using vSphere Client directly connected to an ESXi host.
If the configuration file was deleted and the virtual disk remains, you can recreate the virtual
machine by using vSphere Management Assistant or vSphere Web Client. Attach the existing virtual
disk to a new . vmx file.

284 VMware vSphere: Troubleshooting Workshop


Possible Cause: Special Characters in the .vmx File
Slide 7-29

A virtual machine appears as invalid if the virtual machine configuration


file (. vmx) contains special characters or incomplete line-item entries.
Using a text editor, check for lines that appear incomplete or that contain
trailing spaces or special characters in the . vmx file.
To resolve this problem:
1. Fix the . vmx file by performing one of these steps:
- Using a text editor, delete the partial lines and remove any trailing spaces or special
characters.
- Restore the . v mx file from your last backup.
2. Remove the virtual machine from the inventory.
3. Add the virtual machine to the inventory.

After completing the steps on the slide, try to power-on the virtual machine. If you receive a
question about the virtual machine, continue with the default response and the virtual machine
powers on normally.

Module 7 Troubleshooting Virtual Machines 285


Recovering from an Invalid or Orphaned Virtual Machine
Slide 7-30

For the majority of problems related to orphaned virtual machines on


ESXi, multiple resolutions are available:
1. Reregister the virtual machines:
a. Remove the virtual machine from the inventory.
b. Add the virtual machine to the inventory again.
2. If the virtual machines are still in the invalid or orphaned state, restart the host
management services on the ESXi host.

To restart the host management services, run the following commands at the ESXi host
command line:
• /etc/ ini t . d/ h ostd r es t a r t
• /etc/ ini t . d /vp x a r e s ta r t
Alternatively, use the I sbin/ se r v i ces . sh r e sta rt command.

286 VMware vSphere: Troubleshooting Workshop


Virtual Machine Problem 6
Slide 7-31

If VMware Tools fails to install, verify that the guest operating system is
supported by VMware.
Check the release notes or the knowledge base for any known issues
with installing VMware Tools in certain guest operating systems.

As an initial check, verify that you are using a guest operating system that is supported. For more
information about supported guest operating systems, see VMware Compatibility Guide at http://
www.vmware.com/resources/compatibility.

Module 7 Troubleshooting Virtual Machines 287


Identifying Possible Causes
Slide 7-32

When VMware Tools installation fails, take a top-down approach to


troubleshooting, starting with your virtual machine configuration and then
checking the ISO image configuration on the ESXi host.

Possible Causes

,;.=== =-.
Virtual
An incorrect guest OS is selected for the virtual machine.
Machine

The correct VMware Tools ISO image is not being loaded.


ESXi
The VMware Tools ISO image cannot be found .
Host
w I The VMware Tools ISO image is corrupt. I

Start your troubleshooting with the more probable cause of an incorrect guest operating system
configuration on the virtual machine. Then check for less likely causes that involve the ISO image
on the ESXi host.
For details on troubleshooting a failed VMware Tools installation in a guest operating system, see
VMware knowledge base article 1003908 at http://kb.vmware.com/kb/1003908.

288 VMware vSphere: Troubleshooting Workshop


Possible Cause: Wrong Guest Operating System
Slide 7-33

If VMware Tools installation fails, the wrong guest operating system


might have been configured for the virtual machine.
In vSphere Web Client, view the guest operating system settings under
VM Options. Ensure that the correct guest operating system is selected.

8!J Winll1 -D Actions•

Getting Started Summary Monitor Manage Related Objects


~~~~~~~~~~~~~~~~'

Settings Alarm Definitions Tags Permissions Profiles Scheduled Tasks vServices

VM Options

VM Hardware
• General Options
VMOpt1ons
VM name Win01-D
VM SDRS Rules
VM config file [Shared] Win01-DNVin01 -D.vmx
vApp Options
VM working location [Shared] Win01-DI

Windows

The VMware Tools ISO image that is loaded when you select Install/Upgrade VMware Tools is
decided by the guest operating system type and version that you selected when you created the
virtual machine. If you want to install an operating system, ensure that you have selected the same
version that you want to install.
For example, select Microsoft Windows Server 2008 (64-bit) instead of Microsoft Windows Server
2008 (32-bit) if the version that you are installing is 64-bit. Otherwise, the installation will fail or
VMware Tools will not install or both.

Module 7 Troubleshooting Virtual Machines 289


Possible Cause: ISO Image Not Being Loaded
Slide 7-34

If an incorrect VMware Tools ISO image is being loaded, VMware Tools


installation will fail.
Verify that the ISO image is connected to CD/DVD drive 1.
To resolve the problem: 1$ Win02-A Actions •

• Manually start the Getting Started summary Monitor I Manage ~R_e1_a1e_d_Ob_Je_cts_ _ _ _ _ _ ___
VMware Tools installer.
Settings Alarm Definitions Tags Permissions Profiles Scheduled Tasks vServices
• Manually connect
the correct ISO image •• VM Hardware
VMHaritware
to the virtual machine. • CPU 1 CPU(S), 322 MHz used
VM011t1ons
• Memory 384 MB, 299 MB used
VM SORS Rules
• Hard disk 1 2.00 GB
vApp Options
• Network adapter 1 VM Network (connected)

• @) COIOVD drl\'e 1 Connected

Connected to Host image file

File Olusrnibtvmware/isoimageslwindows.iso
.Q.~
'~

VMware Tools ISO images are on the ESXi host in the / u s r / l i b / vmware / i s o ima g e s directory.
This directory contains the following files:
• linux .is o
• WinPre Vis t a.i s o
• windows .iso
Other VMware Tools ISO images are available for download from http://www.myvmware.com.
When connected to the CD/DVD drive of the virtual machine, the VMware Tools installation starts
automatically.
If the VMware Tools installation fails to start automatically, you can manually start the VMware
Tools installer from the guest operating system. For example, for Windows, go to the drive where
the first CD/DVD drive is configured for your virtual machine and run s et up. exe.
For more information on installing VMware Tools manually from the ISO image, see VMware
knowledge base article 1003910 at http ://kb.vmware.com/kb/ 1003910.

290 VMware vSphere: Troubleshooting Workshop


Possible Cause: ISO Image Cannot Be Found
Slide 7-35

If the VMware Tools ISO image cannot be found, VMware Tools


installation will fail.
Verify that ISO images exist on the ESXi host:
• ISO Images are in /us r / lib /vmware/ i soimages .
• This directory is a symbolic link to the /productLocker /vmtools directory.
To resolve this problem, ensure that the
/usr I lib /vmware / isoimages directory is correctly linked to the
/product Locker /vmtools directory.

li::wxi::wxi::wx 1 i::oot i::oot 22 Feb 19 00: 3 9 -> /pi::oductLockei::/Vllltools


li::wxi::wxi::wx 1 i::oot i::oot 'I Aug 2 2012 -> /lib

VMware Tools ISO images are on the ESXi host in the / usr /lib/vmware/isoimage s directory.
The i so images directory is a symbolic link to the /p roductLocker/vmtools directory.
In the rare occurrence that the symbolic link does not exist, use the following command line to
recreate the link:
In -s /productLocker/vmtools /usr/lib/vmware/isoimages

Module 7 Troubleshooting Virtual Machines 291


Possible Cause: VMware Tools ISO Image Corrupt
Slide 7-36

A VMware Tools installation will fail if the VMware Tools ISO image is
corrupt.
To verify whether corruption has occurred, compare the checksum of the
corrupt ISO image with a known good ISO image.

/vmfs/volumes/4e5fc427-le4lde53-3780-0050562e0aal/pac kages / 5.1 . 0/vmtools # md5sum windows . iso


eabf6f843da3336ad3e825dl3d3bf50e windows.iso
/vmfs/volumes/4e5fc427-1e41de53-3780-0050562e0aa1/packages/5.1 . 0/vmtools #I

To resolve this issue of different checksums, copy a known, stable ISO


image from an ESXi host to the /p roductLocke r /vmtoo ls directory
on the ESXi host with the corrupt image.

A corrupted VMware Tools ISO image can cause installation failures on your guest operating
system. Verifying that your ISO image is valid is key to a successful installation.
Use the md5sum command to calculate file checksums.

292 VMware vSphere: Troubleshooting Workshop


Lab 8: Troubleshooting Virtual Machine Problems
Slide 7-37

Identify, diagnose, and resolve virtual machine problems


1. Run a Break Script
2. Verify That the System Is Not Functioning Properly
3. Troubleshoot and Repair the Problem
4. Verify That the Problem Is Repaired

Module 7 Troubleshooting Virtual Machines 293


Review of Learner Objectives
Slide 7-38

You should be able to meet the following objectives:


• Discuss virtual machine files and disk content IDs
• Identify, analyze, and solve virtual machine snapshot problems
• Troubleshoot virtual machine power-on problems
• Identify possible causes and troubleshoot virtual machine connection state
problems
• Diagnose and recover from VMware Tools installation failures

294 VMware vSphere: Troubleshooting Workshop


Key Points
Slide 7-39

• A CID resides in each virtual machine's disk descriptor file for integrity and
state tracking .
• CID mismatch conditions can be caused by software errors or interruptions to
vSphere vMotion migrations.
• Virtual machine quiescing can be done by the Microsoft VSS or the VMware
Tools SYNC driver.
• If you cannot create a content library, check that you have the required content
library global permissions.
• When a virtual machine does not power on, check that there are sufficient
resources on the host, and virtual machine files are not missing or locked.
• For problems related to orphaned virtual machines on ESXi, reregistering the
virtual machines can return the virtual machines to a connected state.
• If VMware Tools installation fails, verify that the VMware Tools ISO image can
be loaded and is not corrupt.
Questions?

Module 7 Troubleshooting Virtual Machines 295


296 VMware vSphere: Troubleshooting Workshop
MODULE 8
Troubleshooting vCenter Server and
ESXi
Slide 8- 1

Module 8

297
You Are Here
Slide 8-2

1. Course Introduction
2. Introduction to Troubleshooting
3. Troubleshooting Tools
4. Troubleshooting Virtual Networking
5. Troubleshooting Storage
6. Troubleshooting vSphere Clusters
7. Troubleshooting Virtual Machines
8. Troubleshooting vCenter Server and ESXi

298 VMware vSphere: Troubleshooting Workshop


Importance
Slide 8-3

Incorrect configuration of key components will lead to problems while


managing vCenter Server and ESXi hosts. You must correct all
configuration problems quickly to reestablish management control.

Module 8 Troubleshooting vCenter Server and ESXi 299


Learner Objectives
Slide 8-4

By the end of this module, you should be able to meet the following
objectives:
• Understand vSphere 6.x architecture and main components
• Troubleshoot authentication and certificate problems
• Analyze and solve vCenter Server service problems
• Diagnose and troubleshoot vCenter Server database problems
• Use vCenter Server Appliance shell and the Bash shell to identify and solve
problems
• Identify and troubleshoot ESXi host problems

300 VMware vSphere: Troubleshooting Workshop


Review of vSphere 6.x Deployment Modes
Slide 8-5

Multiple deployment modes are available:


• vCenter Server with an embedded Platform Services Controller
• vCenter Server with an external Platform Services Controller
VMware does not recommend using these deployment modes in
combination with each other.
Multiple Platform Services Controller instances can be used together
when used with a load balancer approved by VMware.

VMware Platform Services Controller™ provides infrastructure services for vCenter environments
by providing services that were previously installed as separate vCenter component:.
• Lookup Service: Creates authenticated connections between multiple services endpoints from
the Platform Services Controller node.
• vCenter Single Sign-On service: Coordinates authentication credentials between vCenter Server
and other authentication endpoint services.
• VMware Certificate Authority: Provides vCenter Server components and ESXi hosts with
certificates and stores those certificates for authentication.
• License Service: Delivers centralized license management and reporting functionality to
vSphere and products that integrate with vSphere.
• Directory Service: Provides directory services associated with the vsphere.local domain.
The vCenter Server system provides the remainder of the vCenter Server services, including vCenter
Server, vSphere Web Client, Inventory Service, VMware vSphere® Auto Deploy™, VMware
vSphere® ESXi™ Dump Collector, and VMware vSphere® Syslog Collector or Syslog Service.

Module 8 Troubleshooting vCenter Server and ESXi 301


Factors such as the number of vSphere components, the type of vSphere components, multiple
VMware solutions used together, and physical location of vCenter Server systems are the major
factors in determining the vCenter Server deployment mode to use.
For more information about the deployment modes, see vSphere Installation and Setup Guide at
http://pubs.vmware.com/vsphere-65/topic/com.vmware.ICbase/PDF/vsphere-esxi-vcenter-server-65-
installation-setup-guide. pdf.

302 VMware vSphere: Troubleshooting Workshop


vCenter Server Deployment Options
Slide 8-6

vCenter Server can be deployed in vCenter Server Appliance


different ways: r-----------------,
1 Platform Services l
• vCenter Server configured with an
embedded Platform Services
lr_____ ~-o_n~r.?~:~ ____ -~
-----------------,
I I
Controller. 1 vCenter Server :
l __________________ ,
• vCenter Server configured as a
distributed vCenter Server instance
with an external Platform Services vCenter Server Appliance
/' - - - - - - - - - ,
Controller. I Platform Services I
'~ - - ~~t.!:_0_!!4:!" _ - )
\
I
/ "\
/vCenter Server' Windows
Appliance vCenter Server
r------- r-------
1 vCenter J 1 vCenter J
l __s!~~-- ' l __s!~~-- '.,

Module 8 Troubleshooting vCenter Server and ESXi 303


Platform Services Controller Deployment Options
Slide 8-7

Deployment Models Recommended for Platform Services Controller in


Enhanced Linked Mode

Enhanced Linked Mode with an External Enhanced Linked Mode with an External
Platform Services Controller Instance Platform Services Controller Instance with a
Without a load balancer load balancer
.~· •:n@•'nma:n ~ , .13rn1• . ·Mt!®! ~ .
! ppr' e · · -.·-:. . . MM,,.!' .)
·--------·r--------·
-to®md. 1 ·fi.ii§' •,

"-
''flH'* '
~''
i
.._ _,''
... *'Hd* ,!
.'

304 VMware vSphere: Troubleshooting Workshop


Review of vCenter Single Sign-On
Slide 8-8

vCenter Single Sign-On enables vSphere components to communicate


with one another for authentication purposes instead of requiring users to
authenticate separately with each component.

vCenter Single
Sign-On

VMware CA
vCenter Server

VMware
Directory
Service

vCenter operations generally occur in the context of authenticated connections between the client,
vCenter Server, and other VMware product solutions. To support the requirements for secure
software environments, software components require authorization to perform operations on behalf
of a user. In a vCenter Single Sign-On environment, a user provides credentials once, and
components in the environment perform operations based on the original authentication.
A user logs in to the vSphere Web Client with a user name and password to access the vCenter Server
system or another vCenter service. The default user name and password used for vSphere Web Client
is administrator@vsphere.locaL Other user accounts can be granted access to sign on. A user can also
log in using Windows credentials by checking the Use Windows session authentication check box.
vSphere Web Client passes the login information to the vCenter Single Sign-On service, which
checks the SAML token of the vSphere Web Client If the vSphere Web Client has a valid token,
vCenter Single Sign-On checks whether the user is in a configured identity source, for example
Active Directory (AD).
If no domain name is entered with the user name, vCenter Single Sign-On checks in the default
vCenter Single Sign-On domain, vsphere.local.
If a domain name is included with the user name (DOMAIN\userl or userl @DOMAIN), vCenter
Single Sign-On checks that domain.
For more information about vCenter Single Sign-On, see vSphere Security Guide at http://
www.vmware.com/support/pubs/vsphere-esxi-vcenter-server-pubs.html.
Module 8 Troubleshooting vCenter Server and ESXi 305
VMware CA
Slide 8-9

Platform Services Controller includes its own Certificate Authority named


VMware Certificate Authority.
VMware CA is the default root certificate authority that supplies
certificates to ensure secure communication between vCenter Server
and ESXi hosts.
By default, VMware CA provisions each ESXi host, each vCenter Server
service, each machine in the environment, and each solution user with a
certificate signed by VMware CA. You can change this default behavior.
vCenter Server Appliance
ESXi Hosts
'
(
I
/
'I
I I
: Platform Services Controller :
I
I
\
------------------'
~
I
vCenter Server
l___________________ / I
1 1

306 VMware vSphere: Troubleshooting Workshop


VMware Certificate Store
Slide 8-10

VMware Endpoint Certificate Store (VECS) serves as a local (client-side)


repository for certificates, private keys, and other certificate information
that can be stored in a keystore.
You can use the vSphere Certificate Manager command-line utility to
perform certificate replacement operations. In special cases, you can
replace certificates manually.

Platform Services Controller handles tasks such as single sign-on and licensing, and ships with its
own Certificate Authority called VMware CA. See VMware Certificate Authority Overview and
Using VMware CA Root Certificates in a Browser at http://blogs.vmware.com/vsphere/2015/03/
vmware-certificate-authority-overview-using-vmca-root-certificates-browser.html.
For more information about replacing certificate and key files, see vSphere Security Guide at https://
www.vmware.com/support/pubs/vsphere-esxi-vcenter-server-6-pubs.html. For more information
about replacing a vSphere 6.0 machine SSL certificate with a custom Certificate Authority signed
certificate, see VMware knowledge base article 2112277 at http://kb.vmware.com/kb/2112277.

Module 8 Troubleshooting vCenter Server and ESXi 307


Trust and Certificates ( 1)
Slide 8- 11

In order for SSL to work, you must trust the certificate presented by the
server.
• A certificate binds a public key with a distinguished name (DN):
- A ON is the name of the person or entity that owns the public key.
• Certificates contain:
- Issuer name (CA)
- Name of system using the certificate (common name or URL)
- Public key of system
- Serial number
• No two certificates from the same CA ever use the same serial number.
- Date range of when the certificate is valid .
• All certificates have an expiration date.
• CAs periodically release certificate revocation lists (CRLs):
- If a certificate is listed on a CRL from a CA you trust, then the system does not trust
that certificate.
- If a system cannot contact the CA and check the CRL, then some systems do not
trust the certificate.

308 VMware vSphere: Troubleshooting Workshop


Trust and Certificates (2)
Slide8- 12

In order for SSL to work, you must trust the certificate presented by the
server.
• A certificate is signed with the issuer's private key.
• A certificate contains all of the information needed to verify its validity.
• A certificate does not contain the C~s certificate:
- The C/\s certificate contains the C/\s public key.
- Anyone who has the CA's public key can decrypt a message encrypted with the C/\s
private key.
- You must go to the C/\s Website to download the C/\s certificate (independent
verification).
• Certificates are stored in a local database called a keystore:
- Your type of keystore depends on your system and your software tools.
• A self-signed certificate is where the issuer (server) and the user (client) are
the same system.

Module 8 Troubleshooting vCenter Server and ESXi 309


Chain of Trust ( 1)
Slide 8-13

If you trust the CA, then you implicitly trust all of the certificates issued by
that CA.
• In order to trust a certificate, you must trust some part of the chain of trust. One
of the following must be true:
- You must say that you explicitly trust the certificate itself.
- You must say that you explicitly trust the CA that issued it.
• In a self-signed certificate the issuer and the user are the same system.
- To use self-signed certificates every user (client) system must install and explicitly
trust every self-signed certificate that is in use in the entire network.
- Every time a new service is brought on line all clients must individually install and
trust each and every self-signed certificate in the network.
• An in-house or commercial CA eliminates the requirement of each client
system installing each and every self-signed certificate so long as:
- All client systems trust the CA.
- All certificates come from that trusted CA.

310 VMware vSphere: Troubleshooting Workshop


Chain of Trust (2)
Slide 8-14

~-"l_t_ru_s_tt_h_e_c_A_"_ _,L "Why should I


trust you?"

CA ?.
"Because I trust the
CA and the CA has
issued you a
certificate then I
trust ou."

• This configuration is known as


a two-node chain of trust.
• The two nodes are the client
system and the CA.

Web Server

Module 8 Troubleshooting vCenter Server and ESXi 311


Chain of Trust (3)
Slide 8- 15

L___
" i_t_
r u_s_t _
t h_e_c_A_"__ L " Why should I
trust you?"

CA

"Because I trust the


CA, and it issued the
certificate for the
ICA, I trust you. "
Intermediate
Certificate
Authority (ICA) • This configuration is known
as a three-node chain of
trust.
• The three nodes are:
- the client system
- the ICA
Web-serve r
Certificate
the CA
Web Server

312 VMware vSphere: Troubleshooting Workshop


Multinode Chains of Trust
Slide 8- 16

Most chains of trust are three or more nodes deep.


• Commercial root CA certificates are carefully protected and have long lifetimes.
• ICAs are used by enterprises to issue lower certificates with shorter lifetimes to
local corporate systems.
• As long as a system trusts the root CA all certificates issued by subordinate
CAs in the chain of trust should be trusted .
• Not all certificate resolvers are smart enough to resolve the complete chain of
trust if they only have an ICA certificate installed in their trusted store.
• Self-signed certificates used anywhere in the system can break the chain of
trust.
• If a certificate has a certificate revocation list (CRL) parameter configured and
if a client cannot reach the root CA to check the CRL, the certificate is not
trusted.

Module 8 Troubleshooting vCenter Server and ESXi 313


Certificate Problem
Slide 8- 17

Symptoms:
• Replacing the machine SSL certificate or solution user certificates with custom
certificate authority certificates fails at 0 percent.
• The ce rtifi c ate - ma nage r . l o g file indicates that the d ir- c li command
to publish the trusted certificate failed.
Causes:
• All Intermediates and the root CA certificates must be published into the
trusted store in VECS for the script to complete.
• This issue can also be caused by using non-Base64 certificates.
Solutions:
• To work around this issue, manually publish the full chain to the VECS or
upgrade to vCenter Server 6.0.0b or higher.

For more information about solving this certificate problem, see VMware knowledge base article
2 111 571 at http://kb.vmware.com/kb/2 111 571.

314 VMware vSphere: Troubleshooting Workshop


vCenter Server Problem 1
Slide 8- 18

Using the vSphere Web Client check the vCenter Server service status.

vmware· vSphere Web Client t'I: • {J I Adrnodsl,,foo@JSPHERE LOCAL .

Navigator 4 VMware "'enter Server (vcsa.a..vc.Jass.local) Actions ...


r=================~=:--=---.
~ Administration Sll';;;;i~M_a_n_a~-'------------------------
VMware .Center Senrer (wc89-a.1reh11ss.local)
I.. Nodes ~scr'lf:(ion VM'NCM"e vCenter Servet
..J SeMces S't11~ Type: Aulomtlbc
Healh: Warrwig
State. Ri.rrilg
Auto Deploy (vcsa-a vclass local) Node vc:se.avclasslocal

~ Content Library Serrice (vcsa-a vclass.locaO


EdrtSethnos
Data SeNite ~sa-a.vclass.locan

r
Hardware Heanh Service (vcsra vclassJoca!)
.. Health Messages Cl • Related Objects

vC:enter Server health Is GREEN Node vcsa-a vc1ass.1oea1

Worl<now Manager t\ealth IS GREEN

VMware ESX Agent Manager (vcsa-a vcJassJocaO Performance statistics rollup from Pasl Month lo Past Y•..

t;J VMware Message Bus Configuration Service (vcsa


~ VMware Open Vir1ualizabon Format Service (vcsa-...

~ VMware Performance Charts S11Mce tofeu·a vclas

VMware Postgres (..-csa-a vclass local)


(J VMware Syslog Service (Ytsa-a vtlass.locaO

VMware vSer.ice Manager (vcsa-a.vclassJocal)

As a first test, try to stop the VirtualCenter Server service and then restart.

Module 8 Troubleshooting vCenter Server and ESXi 315


vCenter Server Problem 2
Slide 8-19

If the database is not healthy, vCenter Server will be slow to start.


For VMware vCenter Server Appliance, you can use only the PostgreSQL
database.
Ensure that the external database chosen is interoperable with vCenter
Server.
Validate the basic configuration of the vCenter Server database:
• Adequate disk space available
• Microsoft SQL Server: Healthy transaction logs, backed up regularly
• Oracle database: Adequate space available for tablespace growth
• Ability to connect to the database repository through SQL Server or Oracle
• Valid authentication to the vCenter Server database

For Windows-based vCenter Server, you can use either the embedded PostgreSQL database to
support up to 20 hosts and 200 virtual machines, or an external database (Oracle or Microsoft SQL)
for larger deployments. Ensure that the external database is interoperable with the version of
vCenter server that you are installing. For information about supported database server versions, see
the VMware Product Interoperability Matrix at http://www.vmware.com/resources/compatibility/
sim/interop_matrix.php. If you want to use an external database, ensure that you create a 64-bit DSN
so that vCenter Server can connect to the Oracle database.
For VMware vCenter™ Server Appliance™, you can use only the embedded PostgreSQLSee the
VMware Product Interoperability Matrix at http://www.vmware.com/resources/compatibility/sim/
interop_matrix. php.
For more information about the vCenter Server database requirements, see vSphere Installation and
Setup Guide at https ://www.vm ware. com/support/pubs/vsphere-esxi-vcenter-server-6-pubs. html.
To validate the basic configuration of a vCenter Server database:
• Verify that adequate disk space is available on the volume that is storing the database files to
ensure the correct operation of the database. If adequate space is not available on the physical
volume that stores the database files, free some disk space.

316 VMware vSphere: Troubleshooting Workshop


• If you choose to use Microsoft SQL Server, verify that the transaction logs for the vCenter
Server database are healthy and are backed up on a regular basis.
• If you choose to use Oracle database, verify that adequate space is available in the tablespace
for growth of the database.
• Validate the authentication to the vCenter Server database. The vCenter Server service might
not be able to authenticate with the database if the vCenter Server database user is not granted
correct permissions.
• You might not be able to connect to the database repository through Oracle. If vCenter Server
disconnects from the Oracle database, the database server can maintain a lock on the vCenter
Server database. You might need to restart the Oracle instance to free the lock.
For more information about how to investigate the health of a vCenter Server database, see VMware
knowledge base article 1003979 at http://kb.vmware.com/kb/1003979.

Module 8 Troubleshooting vCenter Server and ESXi 317


Use the VMware Appliance Management Console
Slide 8-20

The VMware Appliance Management Console can display information on


the size and free space available in the PostgreSQL database.
• Connect to https://<vCenter Server appliance FQDN>:5480
• Go to Database
\fTl'lWare· vCenter Sel"'ler Appll1nce i.-.. 1 '-'

Transaction Log
11 -
0r--

·- VCJnventory

u..;i.,..._ 1Wf-

I: I:
..m·~
--.-== ,, -----------
,,,- ..m'~
- -·-=-
,,,------
,,

The VMware Management Appliance can display the health status of an embedded PostgreSQL
database. This includes the amount of space being consumed by Statistics, Events, Alarms, and
Tasks (SEAT) along with the space consumed by the transaction log and the vCenter Server inventory.

318 VMware vSphere: Troubleshooting Workshop


Growth of the vCenter Server Database
Slide 8-2 1

The size and rate of the vCenter Server database growth can affect
vCenter Server performance.
vCenter Server collects several types of data and stores the data in the
database:
• Performance data
• Log of tasks that were performed
• Log of events that occurred

The database increases in size because of the performance data.

Under many circumstances, the vCenter Server database can grow excessively. Growth of the
vCenter Server database is due to data being collected and stored in the database. This data can be
categorized in three ways:
• Performance data
• Log of tasks that were performed
• Log of events that occurred
When troubleshooting excessive growth of the database, start by examining where the growth is
occurring. From here, you can determine how to troubleshoot the issue.

Module 8 Troubleshooting vCenter Server and ESXi 319


vCenter Server Database Tables That Typically Grow
Slide 8-22

A subset of vCenter Server tables accounts for most of the cases that
show substantial growth in the database:
• vpx _ hist statl to vpx_h ist stat4
- Contains the collected performance data information
• vpx sample timel to vpx_ sample time4
- Stores the reference time frames for the performance data in the vpx hist stat
tables
• vpx _ event and vpx_ event arg
- Stores the event information in vCenter Server
• vpx_task
- Stores the task information in vCenter Server

The vCenter Server database is a complex database and several areas can cause problems. Of the
many tables in the vCenter Server database, few accumulate data during regular operation. The
previous tables accumulate data during regular operation.
For information about how to determine where growth is occurring in the vCenter Server database,
see VMware knowledge base article 1028356 at http://kb.vmware.com/kb/1028356.
For information about how to purge old data from the database used by vCenter Server, see VMware
knowledge base article 1025914 at http://kb.vmware.com/kb/1025914.

320 VMware vSphere: Troubleshooting Workshop


Rollup Jobs Control Growth
Slide 8-23

Rollup jobs exist in only external databases: SQL database (for


Windows-based vCenter Server) or Oracle databases (for Windows-
based vCenter Server or vCenter Server Appliance).
When the performance data is in the vpx_his t_statl and
vpx sample timel tables, several outcomes are achieved:
• Postprocessing of the information begins by the scheduled past-day roll up job.
• This job runs frequently to average and summarize the data, inserting it into
the vpx_ hist stat2 and vpx_ samp l e time2 tables.
Statistics
Specify settings for collecting vCenter Server statistics.

Enabled Interval Duration Sav@ For Statistics Le\i"ed


Default settings for
~ 5 minutes 1 day Level 1
collecting statistics
~ 30 minutes 1 week Level 1

~ 2 hours 1 month Level 1

~ 1 day 1 year Level 1


__ ............... _........_..__.
............,,.......
·"-~

The rollup jobs are scheduled to run by default on these intervals:


• Past day stats rollup : Every 30 minutes
• Past week stats rollup: Every 2 hours
• Past month stats roll up : Every 24 hours
The goal of each roll up job is to make the information much less granular over the different intervals
and to move the data all the way down to the vpx_hist_stat4 and vpx_sample_time4 tables.
By default, the Statistics Level in vCenter Server is set to Level l for each level. This setting
controls the amount of data that is gathered for that level. There are 4 levels in total.
If you set the statistics level to higher than level 2, the amount of data collected is substantially
greater. Without adequate processing power on the SQL server, this setting could cause performance
data to be not processed properly.
Although there are no rollup jobs for internal databases (vPostgres in vSphere 6 and SQL Express in
vSphere 5.x), statistics processing or roll up of data still occur. The roll up jobs are scheduled tasks
offloaded to the database engine. These scheduled tasks run stored procedures to process the data in
the database. Internal databases do not have the ability to set up such scheduled tasks. Instead the
vCenter process directly runs the stored procedure at the required interval.

Module 8 Troubleshooting vCenter Server and ESXi 321


Query the Status of Rollup Jobs on MS SQL Server
Slide 8-24

To query the status of rollup jobs on MS SQL Server:


1. Launch SQL Management Studio.
2. Connect to the SQL instance where the vCenter Server database resides.
3. Expand SQL Server Agent within which the jobs reside.
4. Locate the Past Day Stats rollup job.
5. Right-click the job and click View History. The status of the job and other
details regarding the schedule are displayed.
6. Repeat the steps for past week rollup and past month rollup
For information on how to query the status of rollup jobs on other
database servers, see VMware knowledge base article 2012226 at
http://kb.vmware.com/kb/2012226.

322 VMware vSphere: Troubleshooting Workshop


Verifying the Size of the Database Tables
Slide 8-25

To determine whether the size of the performance data might be


affecting vCenter Server performance, perform the following tasks:
• Check the size of the database tables:
- Start with the vpx_ his t s ta tl table.
• This table is where nonprocessed performance data is stored if the rollup jobs are not
running.
- An acceptable amount of rows depends on the size of the environment:
• A problem might occur if more than 10 million rows exist in the v px_hi s t_s t a tl table.

• If you think that performance data is not being processed, determine the last
time that data was successfully processed:
- If the date returned is more than 24 hours in the past, a problem with the rollup jobs
is likely.

To start diagnosing the performance data situation in vCenter Server, check the size of the database
tables. Because vpx_hi st_ s ta t l and v p x _ s amp l e _ time l stores the raw incoming data for the
statistic level, these tables frequently cause problems.
For details about how to verify the size of the vCenter Server database tables, see VMware
knowledge base article 2007388 at http ://kb.vmware.com/kb/2007388.

Module 8 Troubleshooting vCenter Server and ESXi 323


Resolving Performance Data Growth Issues
Slide 8-26

Use any of several methods to resolve performance data growth issues:


• Ensure that statistic rollup jobs exist.
• Ensure that the MSSQL agent service is started on the database server.
• Ensure that statistic collection levels are not set too high for the given
configuration.
• Keep the statistics level at level 2 or lower:
- VMware does not recommend setting the level higher than level 2.
• Increase the statistics level only when you are debugging a problem:
- Decrease the statistics level as soon as you are finished debugging your problem.

In an upgrade or recovery scenario, the database (although properly restored), might not include the
restoration of the roll up jobs. Validate within MSSQL or Oracle whether the Past Day, Week, and
Month roll up jobs exist and re-create them if necessary.
By default, when SQL is installed the MSSQL Agent service is started in Manual mode. On reboot
the roll up jobs might not run since the service is not started. Validate this configuration by checking
the agent services and making sure that the MSSQL agent service is set to started and automatic.
If statistic collection levels above level 2 are used, other than for debugging an issue, growth of the
database may occur. Reducing the level to a lower level stabilizes the system, but it might not be
possible to recover.
Truncating the unprocessed information from the v px_hist_ stat 1 table is normally a last resort,
but can be the ultimate solution if it is not possible to process the data in an appropriate period of time.
A problem occurs when vCenter Server Appliance with embedded vPostgres database stops running
because a disk partition on the vPostgres database contains no free space. You can verify that the
vPostgres database is full and if necessary, extend the disk size for the vPostgres database. For more
information, see VMware knowledge base article 2058273 at http ://kb.vmware.com/kb/2058273.

324 VMware vSphere: Troubleshooting Workshop


PostgreSQL Database Out of Space
Slide 8-27

Use the following procedure to move the database to a drive with more
space:
1. Shut down your vCenter Server Appliance virtual machine.
2. Add a new hard disk to the virtual machine.
3. Start up the vCenter Server Appliance virtual machine

Too high a statistic level can cause your database to fill up. If your database is filling up too fast
reduce the statistics level to Level 1. The higher the statistics level the more higher the level of
detail that is recorded in the database.

Module 8 Troubleshooting vCenter Server and ESXi 325


Set the Statistics Level
Slide 8-28

To set the statistics level in vCenter Server:


1. Select the vCenter Server.
2. Select Configure > Settings > Edit.

sa-vcsa-01.vclass.local . Edit \ICenter Seiver Settings

§$1$11 ¥ statistics
Enter settings for collecting vCenter SeNer statistics
Database
Runtim e settings
Eubltd lnl tl'f'•I01.1r.1bon S1vtro1 SLIUrtlw L.vtl
User directory
el 5mlnutes 1 day Level 1 I·
Mail el JO minutes 1 week

SNMP receivers el 2 h0urs 1month Level 2 "°"


Ports el 1 day 1 year level 3

Tim eout settings


Level 4
Database size
Logging seuings Based on the current vCenter Server inventory size, the vCenter Server database can be estimated. Enter the expected number of
hosts and virtual machines In the Inventory to calculate an esllmate.
SSL Settings

so------iB PhyS1ca1 hosts Estimated space required· 16.71 GB

20~ Vlrtuatmachlnes

Monitor vCenter database consumpi1on and disk partition In Appliance Management UI c!'

326 VMware vSphere: Troubleshooting Workshop


Modify the Database Settings
Slide 8-29

You can set how long the database retains events.


sa.vcsa.01.vclass.local . Edit VCenter Server Settings ? H

statistics Database
Enter database settings. use tasks and events retention settings to llmlt the growth of the database.
atabase

Runtime settings
Maximum connections
User directory
Task cleanup Enabled
Mail
Task retention (days)
SNMP recenters
Event cleanup ~ EnabledMaximum age in days a task remains in the
Ports
Event retention (days) 130 database
Timeout settings '-·- - - Minimum: l
Maximum: 2,147,483,647
Logging settings
SSL senings Increasing the events retention to more than 30 days will result In significant Increase ofVCenter database size and could
shutdown the vCenter Server. Please ensure that you enlarge the vCenter database accordingly.

llllonilorVCenter database consumption and disk partition in Appliance Management UI ~

~ Requires manual restart ofvCenter SeNer.

A reinitialization of the vCenter Server database resets it to the default configuration, as if the
vCenter Server was newly installed.
Reinitializing the database permanently erases all data in the database. If data must be protected,
verify that a proper backup of the database is taken before reinitializing the database.
For details about how to reinitialize the vCenter Server database, see VMware knowledge base
article 2031295 at http://kb.vmware.com/kb/2031295.

Module 8 Troubleshooting vCenter Server and ESXi 327


Reinitializing the vCenter Server Database
Slide 8-30

You reinitialize the vCenter Server database in the following situations:


• To rebuild vCenter Server.
• When there is a suspected data corruption.
• When there is a support request for reinitialization.
To reset the database configuration on the vCenter Server Appliance, run
/us r /sb in /vpxd .

It is best practice to stop vCenter Server services before you stop the PostgreSQL database on the
vCenter Server appliance. At the very least you should stop the main vCenter Service with the
following command:
service-control --stop vmware-vpxd
Then you can stop the database server:
service-control --stop vmware-vpostgres
Restart the services in the opposite order. Start the database server first:
service-control --start vmware-vpostgres
Then start the vCenter Server service:
service-control --start vmware-vpxd
If you have multiple services running on your vCenter Server Appliance virtual machine (VMware
License Service, VMware Identity Management Service, VMware Content Library Service, and so
on) you can stop them all safely with the command:
service-control --stop --all

328 VMware vSphere: Troubleshooting Workshop


After you perform you database management tasks (changing the log level, moving the database
files to a new larger disk drive, and so on) you can restart all services safely with the command:
service-control --start --all
For details about stop and start all services in the vCenter Server appliance virtual machine, see
VMware knowledge base article 2109881 at http://kb.vmware.com/kb/2109881.

Module 8 Troubleshooting vCenter Server and ESXi 329


Other PostgreSQL Troubleshooting
Slide 8-31

You perform the following tasks to confirm that PostgreSQL is working


properly:
• Confirm that PostgreSQL is running
ps aux I grep postgres
• Start PostgreSQL
service-co ntrol --start vmware-vpostgres
• Stop PostgreSQL
service-co ntrol --stop vmware-v p o stgres
All PostgreSQL commands must be run from an SSH session or remote
console session on the vCenter Server Appliance virtual machine.

The vCenter Server Appliance API commands and plug-ins included in the appliance shell enable
you to configure, troubleshoot, and monitor vCenter Server Appliance.
To use the plug-ins and API commands, you must first access the appliance shell.
You can access the appliance shell directly through the appliance console, or remotely using a
remote console connection, such as SSH.

330 VMware vSphere: Troubleshooting Workshop


Accessing the vCenter Server Appliance Shell
Slide 8-32

Use the vCenter Server Appliance console to configure access to the


shell.
Connect https://<vCenter Server>:5480

Enter a user name and password that is recognized by vCenter Server


Appliance to log in to the appliance console.

If you log in to the appliance shell as a user who has a super administrator's role, you can enable
access to the Bash shell of the appliance. The default user with a super administrator role is root.
Connect to https:l/<vCenter Server>:5480.
To enable access to the Bash shell, use the shell . set --enabled true command.
To access the Bash shell, run the shell or pi shell command. You can run all commands in the
appliance shell with or without the pi keyword.

Module 8 Troubleshooting vCenter Server and ESXi 331


Configuring Access Settings
Slide 8-33

Use the Access menu to configure access settings such as Bash shell
and SSH.
vmware- vCenter Server Appliance

Navigator qi Access
9 Edit Access Settings
Q Summary
Access Settings
Enable SSH Login
'; Access SSH Looin Enabled

O, Networldng
Bash Shell Disabled
0 Enable BASH Shell
Timeout (Minutes): ,...16- 0- - - .
0 nme

• Update

' Administration

[ii Syslog Configuration

OK Cancel
!! CPU and Memory

i] Database

332 VMware vSphere: Troubleshooting Workshop


Log in to the Appliance Shell
Slide 8-34

After you have enabled access you can use SSH to log in to the vCenter
Server appliance shell.

VMware vCenter Server Appliance 6 . 5.0.5200

Type: vCenter Server with an embedded Platform Services Controller

Connected to service

• List APis : "help api list "


• List Plugins : "help pi list"
• Launch BASH: "shell"

Command> shell
Shell access is granted to root
root@sa-vcsa-01 [ ~ ]# I

Module 8 Troubleshooting vCenter Server and ESXi 333


Querying Service Status and Restarting Services
Slide 8-35

From the Bash shell, you can:


• Verify the status of a service.
• Start or restart a service if the service was interrupted.
VMware vCenter Server Appliance 6.S . 0.5200

Type : vCenter Ser ver wi th an embedded P latform Services Contro ller

Last login: Mon May 1 20 : 18 : 08 2017 f r om 172.20.10.80


Connected to service

• List AP i s: " he l p a p i list "


• List Plugin s : " he l p p i l i st"
• La unch BASH: " she ll"

Command > shel l


Shel l access is granted to root
root@sa- vcsa- 01 - ]# service-contro l --status vsphere- client
Stopped:
vsphere-c l ient
root@sa- vcsa- 01 [ - ]# service-contro l --start vsphere-client
Perform start operation. vmon_profile=None , svc_names=['vsphere-cli
ent 'J, include_coreossvcs=False , include_leafossvcs=Fal se
2017-05-01T20:24 : 04 . 736Z Service vsphere-c l ient state STOPPED
Successfu ll y started service vsphere-client
root@sa- vcsa-01 C - ]# I
The API commands enable you to perform various administrative tasks and facilitate
troubleshooting. For example, you can edit time synchronization settings, monitor processes and
services, set up the SNMP settings, and so on.
The plug-ins in vCenter Server Appliance reside in the CLI itself The plug-ins are standalone Linux
or VMware utilities, which do not depend on any VMware service.
For more information about the steps of accessing the vCenter Server Appliance shell, available API
commands, and plug-ins in the shell, see vSphere Installation and Setup Guide at https://
www.vmware.com/support/pubs/vsphere-esxi-vcenter-server-6-pubs.html.

334 VMware vSphere: Troubleshooting Workshop


Using API Commands and Plug-Ins from the Appliance Shell
Slide 8-36

The API commands in vCenter Server The plug-ins in vCenter Server


Appliance enable you to perform Appliance provide access to
various administrative tasks, such as various administrative and
monitor processes and services, and troubleshooting tools.
facilitate troubleshooting.
Convnand> help api list Command> help pi l ist
Supported API calls by this s erver : Available plugin API calls:
com .vmware . app liance.health . a pplmgmt.get com . vmware.clear
com.vmware.appliance.health.datal:Jasestorage.get com .vmware. cmsso-util
com.vmware.appliance.health.load.get com . vmware.dcli
com.vmware.appliance . health . mem.get com.vmware.nslookup
com.vmware.appliance . health.softwarepackages . get com .vmwar e.pgrep
com.vmware.appliance.health.stor age.get com.vmware.pgtop
com. vmware.appliance . health . swap.get com . vmware.ping
com.vmware.appliance.health . system .get com.vmwar e .ping6
com.vmware.appliance.health.system .lastcheck com.vmware.portaccess
com.vmware . appliance.monitoring.get com.vmware.ps
com. vmware . appliance . monitoring.list com.vmware.rvc
com.vmware.app liance.monitor ing.query com.vmware.service-control
com.vmware.appliance.recovery.backup.job.cancel com . vmware . s hell
com.vmware.appliance.recovery.backup. job.create com.vmware.showlog
com.vmware.appliance.recovery.backup.job.get com .vmware. s hutdown
com.vmware.appliance.recovery.backup.job.list com.vmware . software-packages
com.vmware.appliance . recovery.backup.parts.get com . vmware.support-bundl e
com.vmware.appliance.recovery . backup.parts.list com.vmware . top
com.vmware.ap pliance. recovery.backup. v alidate com.vmwar e .trace path
com.vmware.appliance.recovery.restore.job .cancel com.vmware.tracepath6
com.vmware.appliance . recovery.res tore.job.create com.vmwar e.updatemqr-uti l
com.vmware . appliance . recovery . restore. job . g e t com . vmwa.re.vcenter-restore
com.vmware.appliance . recovery.restore.validate com.vmwar e.vimtop
Command > I

Module 8 Troubleshooting vCenter Server and ESXi 335


ESXi Problem 1
Slide 8-37

An ESXi host crash is typically caused by one of several reasons:


• CPU exception
• Driver or module panic
• Machine check exception
• Hardware fault
• Software defect

Available information for many problems might prove inconclusive. Server hangs, purple screen
crashes without disk dumps, or disk failures might leave the server with very little information
logged regarding a problem. While the root cause of this outage might be elusive, you can better
prepare for the next time the problem happens.
Review logs for diagnostic messages that were generated leading up to the issue as well as during
the issue.
For hardware faults, run hardware diagnostics.
Faulty CPUs can manifest as unusual behavior, such as abrupt reboots, hangs, or purple screens.
Most often, the CPU generates an exception that is trapped by the VMkernel and handled with a
purple screen.

336 VMware vSphere: Troubleshooting Workshop


Verifying That the ESXi Host Has Crashed
Slide 8-38

View the ESXi local console at the DCUI to verify that the purple screen
problem exists.
Mwore ESXi 6 .5.0 lRe leasebu i ld -1887370 x86_61l
PF Except ion H in world 295215:vs ish IP Ox1180070b113S oddr OxO
TEs : Ox1590ob027 : Ox H 7eee027: OxO :
SX inVM cr0 ; Qx80010031 cr 2; 0x0 cr3; Qxl5ad7b000 cr1 ; Qx12728
raMe; Ox1390ca69b240 ip; Ox1180070b413S err;2 rf lags;Oxl0093
ox; OxO rbx ; Ox117fc7126rBO rcx ; Ox117fc6f239a0
dx; Ox1180072632c0 rbp; QxO rs i ; Qxf7
d i ; QxO r8 ; QxO r9 ; Qx1
lO; QxO rll ; OxO rl2; 0x6
13 ; Qx0 rH ; OxO rlS ; Ox110006720000
CPU! :295215/ vs ish
CPU 0 : UU
ode s t art: Ox118006e00000 VMK upt iMe : 18 : 06 :30:10.839
· x4390ca69b300 : lElx4180070b4135JCr6shMeCurrentCore@vMkcrne 1Unovcr•fhc54CJ st ack: EJx6
x4 390ca69b3c0 : ( Ox41B006cd l eaf J Int rCook Ie _Do Int errupt@v11kerne I Unover •Ox l 7b s tack : Ox7b80
x4390ca69b170 : l Ox418006cd24co 1 Int rCook i e_VMkerne 11 nt errupt@vMkerne I Anover • Ox4e stack : Oxf7
x1390co69b1o0: l0x118006f2dbfdl IDT_ lntrHandler@vnkern e 1Unover•Ox9d s tock : Ox1390ca69b5e8
x4390ca69b'1c0: [Qx418006f3c014 Jgatc_entry_@vNkerne IUnover•OxO s tack: OxO
x1 390co69b588: l0x118006e2Hb0l lnt errupts_SetF lags@vMkerne 1Dnover•Ox1 stock: OxlOOOOOOOl
x4390ca69b590: l0x4180070b4Be2 JCrashMc_Vs i CormandSet@vMkerne I Anover•Oxbc s tack : OxO
· x1 390co69b5d0: !Ox118006e0lf95JVSl _Set [nfo@v1.,kerne lttnover•Ox369 stock: Ox1390co69b6b0
x1390ca69b650: l0x118007516df1 JUWVMKSysca I IUnpar.kVS J_Se t@Cuser l U<Hone)•0•308 s tock: OxO
x439Ucob9befU : 1Ux1 WUU /~Ua2tU IU•er _UMVMKSysca 11 Hand I er~ l us er lA <Hone> •Uxo1 stock : UxffebUc3B
x4390ca69bf 20: l0x4 1B006f0e'J41 JUser _UJ.IVMKSysca I I Hand l er(fvp1kernc IRnover•Ox ld s tack: OxO
x4390ca69bf30: lOx418006f3c041 lgate_entry_@vMkernetnnover•OxO stack: OxO
ose f s;OxO gs;Ox118040400000 Kgs;OxO
1 other PCPU is in panic .
017- 03 - !0T19:03:15 .192Z cpuO :66604 )Marn ing : / v1·1rs/ dev ices/char/vPJkdr iver/ us bpassthrough not found
orcduNp to di sk . Slot 1 of 1 on device 11px.v1·1hbal :CO:TO:LO :'J .
inali zed dm1p header C14/ 14) Oi skOrn1p : Successful .
o file configured to durip data .
o v san object configured to du1,1p data .
o port for reNot c debugger . "Escape" for Ioca l debugger .

The ESXi host crashes when the VMkernel enters a condition where it cannot or should not proceed.
A VMkernel fault is manifested by a purple screen on the ESXi console. This screen is referred to as
a purple screen of death (PSOD).
When recovering from a host crash in a production environment, the main goal is to get your virtual
machines back up and running as soon as possible. A vSphere HA cluster can help you recover
quickly when one of the hosts in the cluster fails.

Module 8 Troubleshooting vCenter Server and ESXi 337


Recovering from a Purple Diagnostic Screen Crash
Slide 8-39

To recover from a purple diagnostic screen crash on an ESXi host:


1. Record the state of the system:
a. Take a screenshot or photograph of the purple diagnostic screen.
b. Note any relevant environmental issues or conditions.
2. Restart the host:
a. Get the virtual machines up and running.
b. Collect a vm - suppor t log bundle from the affected host.
3. Contact VMware technical support:
- If VMware Technical Support determines that the issue is a hardware problem, you
must contact your hardware vendor.

When a host stops responding, the entire server becomes unresponsive. You might not be able to
determine whether the issue related to the hardware or software without collecting further data.
If the host is unresponsive and you cannot boot the system properly, the problem could be due to a
corrupt configuration, or a hardware fault. Try to boot from diagnostics or installer CD if possible.

338 VMware vSphere: Troubleshooting Workshop


ESXi Problem 2
Slide 8-40

When the system becomes unresponsive, it indicates that the ESXi host
has stopped responding. The system might get into this state after a
power cycle.
An ESXi host stops responding due to the following reasons:
• The VMkernel is too busy or deadlocked.
• A hardware lockup occurs.

Module 8 Troubleshooting vCenter Server and ESXi 339


Verifying That the ESXi Host Has Stopped Responding
Slide 8-41

To confirm that an ESXi host is not responding, determine whether you


can perform the following tasks on the host:
• Ping the VMkernel network interface.
• Determine whether vSphere Client responds to queries.
• Monitor network traffic from the ESXi host and its virtual machines.
If any of these tasks are successful, then your ESXi host should be at
least minimally operational.
To verify that the host is not responding, use the ESXi host's DCUI to
display VMkernel messages on the screen (ALT+F12).

The main goal is to get the virtual machines back up and running as soon as possible. After you have
done that, do some research and try to determine why the ESXi host locked up.
Check the VMkernel log file (/var / l o g /vmk ernel. log) for error messages.
Use esxtop to gather performance statistics. esxt op shows current performance statistics of the
entire ESXi host.
A best practice is to have logs captured independent of disk or network connectivity when
troubleshooting issues. Enabling serial logging sends all VMkernel logs to the serial port in addition
to their normal destination.
For more information about backing up an ESXi host, see VMware knowledge base article 2042141
at http: //kb.vmware.com/kb/204214 l.

340 VMware vSphere: Troubleshooting Workshop


Recovering from an ESXi Host Failure
Slide 8-42

To recover from an ESXi host failure:


1. Reboot the host.
2. Determine why the host locked up.
- Review logs that led to the outage.
- Set up serial-line logging.
- Gather performance statistics.

3. After hardware problems are corrected, reinstall and configure the


ESXi host, using your most recent backup to ensure that faulty
hardware did not corrupt the disk.
4. Install the latest patches and updates for the ESXi host.

Module 8 Troubleshooting vCenter Server and ESXi 341


Lab 9: Managing the PostgreSQL Database
Slide 8-43

Manage the PostgreSQL Database


1. Verify That PostgreSQL Is Running
2. Modify the Logging Level in the PostgreSQL Configuration File
3. Reload the PostgreSQL Server Instance
4. Examine the PostgreSQL Log File
5. Use the vCenter Server Appliance Management Interface
6. Use vSphere Web Client to Examine the Health of the PostgreSQL Database

342 VMware vSphere: Troubleshooting Workshop


Lab 10: Troubleshooting vCenter Server and ESXi Host
Problems
Slide 8-44

Identify, diagnose, and resolve vCenter Server and ESXi host problems
1. Run a Break Script
2. Verify That the System Is Not Functioning Properly
3. Troubleshoot and Repair the Problem
4. Verify That the Problem Is Repaired

Module 8 Troubleshooting vCenter Server and ESXi 343


Lab 11: (Optional) Working with Certificates
Slide 8-45

Generate and replace vCenter Server certificates


1. Examine vSphere certificates
2. Create a Windows 2012 certificate authority template for vSphere
3. Create a certificate signing request
4. Download the CSR to the student desktop
5. Request a signed custom certificate
6. Replace a Machine Certificate with the new custom certificate
7. Regenerate a new VMCA root certification and replace all certificates
8. Use the vSphere Web Client to confirm certificate replacement

344 VMware vSphere: Troubleshooting Workshop


Review of Learner Objectives
Slide 8-46

You should be able to meet the following objectives:


• Understand vSphere 6.0 architecture and main components
• Troubleshoot authentication and certificate problems
• Analyze and solve vCenter Server service problems
• Diagnose and troubleshoot vCenter Server database problems
• Use the vCenter Server Appliance shell and the Bash shell to identify and
solve problems
• Identify and troubleshoot ESXi host problems

Module 8 Troubleshooting vCenter Server and ESXi 345


Key Points ( 1)
Slide 8-47

• vSphere 6.5 includes vCenter Server and Platform Services Controller.


• vCenter Single Sign-On enables vSphere components to communicate with
one another for authentication purposes.
• VMware CA is the default root certificate authority that supplies certificates to
ensure secure communication between vCenter Server and ESXi hosts.
• VECS serves as a local (client-side) repository for certificates, private keys,
and other certificate information that can be stored in a keystore.
• vSphere 6.5 has multiple deployment modes: vCenter Server with an
embedded Platform Services Controller, vCenter Server with an external
Platform Services Controller, and Enhanced Linked Mode.
• The size and rate of the vCenter Server database growth can affect vCenter
Server performance.
• A subset of vCenter Server tables accounts for most of the cases that show
substantial growth in the database.

346 VMware vSphere: Troubleshooting Workshop


Key Points (2)
Slide 8-48

• Using the vCenter Server Appliance shell and the Bash shell enable you to
carry out configuration, monitoring, and troubleshooting tasks.
• The API commands and plug-ins in vCenter Server Appliance can be used to
perform administrative tasks and are useful for troubleshooting.
• An ESXi host crash is typically caused by CPU exception, driver or module
panic, machine check exception, hardware fault, or software defect.
• Use tools such as ping to determine if a host is hung, and use tools such as log
files or performance statistics to determine the causes of the host lock-up.
Questions?

Module 8 Troubleshooting vCenter Server and ESXi 347


348 VMware vSphere: Troubleshooting Workshop

You might also like