Professional Documents
Culture Documents
Troubleshooting Workshop
Lecture Manual
ESXi 6.5 and vCenter Server 6.5
www.vmware.com/education
CONTENTS
iii
ESXCLI Commands ... . ............. . .. . .. . . . ... .. ..... . ... 45
Viewing vSphere Storage Information ... . ... . . . .. . . .. . .. . .. . .. . 46
Viewing vSphere Network Information . ..... . .. .. .... . .. . .. .... 47
Viewing Standard Switch Information ... . . .... .. . . .. .. .... .. ... 48
Viewing Distributed Switch Information . . .. .. . .. .. . . .. .... . . .. . 49
Viewing Hardware Information .. . . ..... . . . .. . ...... . .... .. ... 50
Lab 1: Using the Command Line .... .. . . . ... .. . . . .. . . .. .. . ... . 51
Review of Learner Objectives .... . . .... . ..... . ..... .. .... . . ... 52
Lesson 2: vSphere Management Assistant . . .... . . .... . . .... . . ... 53
Learner Objectives ... . . . ... . . . . . ... . . . ..... .. .... . . ..... . ... 54
vSphere Management Assistant Components .... .. ..... . ..... . ... 55
Configuring vSphere Management Assistant for AD Authentication ... 57
Adding vSphere Management Assistant to Active Directory ..... . ... 59
vicfg-* Commands ............ . ..... . ..... . . ........ .. .. ... 60
vmware-cmd Command . ..... . . . . .. . . . ..... . . . .... . ... . .. . .. 61
Viewing Virtual Machine Information .... . .... . ...... . .... . .... 62
Viewing Snapshot Information . . . . .. .. .... .. ...... . .. ... . . .. .. 63
Direct Console, SSH, or vSphere Management Assistant. ....... . ... 64
Lab 2: Adding vSphere Management Assistant to Active Directory . .. 65
Review of Learner Objectives .. . . . . .... . ..... . . .... .. .... . .. .. 66
Lesson 3: Logging, Log Files, and vRealize Log Insight .. . ..... . ... 67
Learner Objectives .... . . .... . . . . ... . . . ...... .. .. . ...... . . ... 68
Location of vCenter Server Logs . . . . .. ..... . .. . . . . .. . .. . . . .... 69
Common Logs . . ... ... . ... .... . . .... . . .... . . .... .. .. .. .. ... 70
Management Node Logs . ..... . . . . .... . ..... . . . ... .. .... . .... 71
Platform Services Controller Logs . . ........... . ..... . ......... 72
Important vCenter Server Logs for Troubleshooting .... .. ... . . . ... 73
Viewing vCenter Server Log Files in vSphere Web Client ...... . ... 74
Location ofESXi Host Logs . . ... . . .. .. . . . ... .. ..... . .... . . . . . 75
Useful ESXi Host Logs for Troubleshooting ..... . .... .. .... .. ... 76
Viewing Log Files in the DCUI . . . . .... . . ..... .. ... .. ..... . ... 77
vSphere Syslog Collector ... .. .. . .. . . . . . .. ... .. .. ....... . . . .. 78
vRealize Log Insight ... . ..... . . . .. . .. .. .. .. . . . . . .. . .. .... ... 79
Searching and Filtering Log Events . . ... . . . .... . . . ... . .... ... .. 80
Analyzing Logs with the Interactive Analytics Charts ... .. ... . . . ... 81
Dynamic Field Extraction ..... . . . . .... . ..... .. ..... . ......... 82
Troubleshooting Using Customized Dashboards . .. .... . . .. . ..... . 83
Monitoring Log Events and Sending Alerts . .... .. ..... . .... .. ... 84
Lab 3: Searching Log Files .... .. . . ... . . . ... .... ... . . . ... ..... 85
Lab 4: Searching Log Files ...... . . .......... . . . ... .. .... . .... 86
Review of Learner Objectives .. . . .. .... . ..... . .. .... . ..... . ... 87
Key Points .... . .... .. ...... .. . . . .. . . .. . .. . . . . ... . .... . . ... 88
Contents v
MODULE 5 Troubleshooting Storage . .. . .... . ..... . .. . .. . . . ... .. ........ 131
You Are Here .. . . .. . . . .. . . .. . . . . . . .. . ... . . . ... .. . . ... . .. . 132
Importance .... . . .... . . . . . . .. .... . . ..... . .. . .... .. ... . . . .. 133
Module Lessons ............... . ..... . ...... . .... . ..... . ... 134
Lesson 1: Storage Connectivity and Configuration .. . .. . .. . .. .. . . . 135
Learner Objectives ..... . ................... . ..... . ......... 136
Review ofvSphere Storage Architecture ... ... .. . . .. . .. . ... . . . . 137
Review of iSCSI Storage . ....... . ... . . . ..... . ..... .. ........ 138
Storage Problem 1 ................... . ..... . . .......... . . . . 139
Identifying Possible Causes . ... .. . .. . . . . . . . .. .. ... . .. . .. . . .. . 140
Possible Cause: Hardware-Level Problems . ..... .. ..... . ..... . .. 141
Possible Cause: Poor iSCSI Storage Performance . . ... . .. .. .... . . 142
Possible Cause: VMkernel Interface Misconfiguration ....... . . . .. 143
Possible Cause: iSCSI HBA Misconfiguration (1) .. .... .. .... .. .. 144
Possible Cause: iSCSI HBA Misconfiguration (2) .. .... .. ... . . . .. 145
Possible Cause: iSCSI HBA Misconfiguration (3) .. ..... . ..... . .. 146
Possible Cause: iSCSI HBA Misconfiguration (4) . . . .. . .. . ... ... . 147
Possible Cause: iSCSI HBA Misconfiguration (5) .. ..... . ... . . . .. 148
Possible Cause: iSCSI HBA Misconfiguration (6) .. . ... .. .... .. .. 149
Possible Cause: iSCSI HBA Misconfiguration (7) .. .... .. ... . . . .. 150
Possible Cause: Port Unreachable . . ............ . ..... . ..... . .. 151
Possible Cause: VMFS Metadata Inconsistency . . . . . ... . . .... . .. . 152
Use vSphere On-Disk Metadata Analyzer (1) ..... . ... . . . .. .. . . .. 153
Use vSphere On-Disk Metadata Analyzer (2) . ... .. . ... .. .... .. .. 154
Use vSphere On-Disk Metadata Analyzer (3) .... .. ..... . ..... . .. 155
Possible Cause: NFS Misconfiguration ... . ...... . ..... . ..... . .. 156
NFS Version Compatibility with Other vSphere Technologies . .... . 157
NFS Dual Stack Not Supported ... . ..... . ..... .. ..... . ..... . .. 158
NFS Client Authentication .... .. . . ... . . .... . .. ..... . .... . . .. 159
Configuring Active Directory and NFS Servers to Use Kerberos .... 160
Configuring Host Time Synchronization . . ...... . .... .. ..... . .. 161
Configuring Host Authentication Services . . ..... .. . . . . ..... .. .. 162
Configuring the Datastore to Use Kerberos .. . .. . . . .. . . . .. .. . . .. 163
Viewing Session Information .. .. . . ... . . .... ... ..... . ...... . . 164
Review of Learner Objectives .. . . .. .. . ......... . ... .. .... . ... 165
Lesson 2: Multipathing ....... . . . ..... . ...... . ..... . ........ 166
Learner Objectives . .... .. .. . .. . .. . . .. . . . ... . . .... . .. ... .. . . 167
Review of iSCSI Multipathing . . ....... . ..... . . .... . ..... . . .. 168
Storage Problem 2 . .. . . . . ..... . . .... . . . . .. . . .. .... . .... .. .. 169
Identifying Possible Causes ...... . ........... . . . ... .. .... . ... 170
PDL Condition .. . ........... . . . ..... . ..... . . .... . ..... ... . 171
Recovering from an Unplanned PDL . . . ... . . . ... . . ... . .... .. . . 173
APD Condition . . ..... . ....... . ... .. .. ... .. . . .. . .... .. . . .. 174
Contents vii
Possible Cause: Insufficient Physical Resources . . . . ... .. .... .. .. 221
Bandwidth Reservation . . . .. . . .. . . . .. . . . . .. .. . . .... . . .. . . .. . 222
Possible Cause: Excessive Virtual Machine Reservations (1) .. . .... 223
Possible Cause: Excessive Virtual Machine Reservations (2) ... . ... 224
High Availability Configuration .. . .. .. . . .. .. . . .. .. . .. . .. .. . . . 225
Possible Cause: Admission Control Policy Misconfiguration ....... 226
vSphere HA Cluster: Admission Control Guidelines . .. . .. . ... . . . . 227
Example of Calculating Slot Size . . ..... . ..... .. ..... . ..... . .. 229
Apply ing Slot Size ............. . ..... . ..... . . ............ . . 231
Distorted Slot Size . ... . . . ... . . . . . . . . . . . . . .. . . ... . . . . . .... . . 232
Reserving a Percentage of Cluster Resources .... .. ..... . ..... . .. 233
Calculating Current Failover Capacity .. .... . ... .. . . . .. .. .... . . 234
Using VMCP .. . . ..... . ..... . . . ... . ... . . . . . . . .... . ... . . . .. 235
Useful Troubleshooting Commands ...... .. ... .. .... .. .... .. .. 236
Cluster Utilization Graph ..... . . . . .. . . . ..... . . . .... . ...... . . 237
Review ofvSphere vMotion ......... . ....... . ...... . .... . ... 238
vSphere vMotion TCP/IP Stacks ... . . .. .... .. .. . . .. . .. . ... ... . 239
Use esxcli to Display vMotion Network Information ..... . ... . . . .. 240
Long Distance vMotion . . . .... . . . ..... . . .... .. ..... . . ... ... . 241
Cross vCenter Server vMotion . . . . ..... . ..... . . . ... . . .... . ... 242
vSphere vMotion Problem 1 ......... . .............. . .... . ... 243
Identifying Possible Causes .. . . .. . . . .. . . . . .. .... ... . . ... . . . . . 244
Possible Cause: VMkernel Interface Misconfiguration ...... .. . . .. 245
Possible Cause: Invalid Name Resolution on the Host ... . . .... ... . 246
Possible Cause: Required Disk Space Not Available ..... . ..... . .. 247
Possible Cause: Reservation Requirements Not Met ..... . ..... . .. 248
Possible Cause: log.rotateSize Set to Low Value . .. . ... . . . ... .. . . 249
Resetting Migrate.Enabled ...... . ..... . ..... . ..... .. .... . ... 250
vSphere vMotion Problem 2 ... .. . . ... . . .... . .. ..... . .... . ... 251
Possible Cause: vSphere DRS Configuration ..... . .... .. ... . . . .. 252
Possible Cause: Configuration Problems . . ...... . .... .. ..... . .. 253
Lab 7: Troubleshooting Cluster Problems .. . ..... .. . . . . ..... .. .. 254
Review of Learner Objectives .... . ..... . .. . .. . . . . . .. . .. . . . ... 255
Key Points . .... . . .... . . .... . . . .... . . .... . . . . .. . .. . ... . . . . 256
MODULE 8 Troubleshooting vCenter Server and ESXi .. .. ... . .... .. . ... . . .. 297
You Are Here . . . ..... . ..... . . . ... . ........ . . ... . .... . . . .. 298
Importance ................. . . . ..... . .. . ... . .... .. ..... . .. 299
Learner Objectives ... .. .. ...... . ..... . ... .. . . .... . ... .. .. . . 300
Review ofvSphere 6.x Deployment Modes . . . .. .. .... . . .... . . .. 301
vCenter Server Deployment Options . . ..... .. ... . . . .. .. .. . . .... 303
Platform Services Controller Deployment Options .. ..... . ... ... .. 304
Review ofvCenter Single Sign-On . . .... . . .... .. .... .. .... .. .. 305
VMware CA ..... ..... .. .... . . .. . . ... .. . .. . .... . . . ... .. .. 306
VMware Certificate Store .. . . .. . . . . . . .... .. . ... . . .. . .. . . . ... 307
Contents ix
Trust and Certificates ( 1) . .. . .... . . .... . .. . .. . . . ... .. .... . ... 308
Trust and Certificates (2) . .. . . .. . . .. . . . . . ... . . . . . . ....... . ... 309
Chain of Trust (1) ..... . .. . . .. . . .. . . ..... . .. .. . . . . .... .. ... 310
Chain of Trust (2) ........ . .... . ..... . ..... . . . ... . ..... .. .. 311
Chain of Trust (3) ... . . . ... . . . . . ... . . . .... . . .. .. . . ....... . . 312
Multinode Chains of Trust. ...... . . ....... . .. . ..... .. .... . ... 313
Certificate Problem .... . .... . .. . .. .. . . .. .. .. . . . .. .......... 314
vCenter Server Problem 1 .. . ..... . .. . . . ..... . ..... .. .... . ... 315
vCenter Server Problem 2 ..... . . . . .... . ..... . . ..... . .... .. . . 316
Use the VMware Appliance Management Console . .... .. ..... . .. 318
Growth of the vCenter Server Database . ............. . ..... . ... 319
vCenter Server Database Tables That Typically Grow .. . .. . ... . . .. 320
Roll up Jobs Control Growth ... . . . . .... . ..... . ...... . .. . . .. .. 321
Query the Status of Roll up Jobs on MS SQL Server .. . . .. .. .. ... . 322
Verifying the Size of the Database Tables . . .... .. .... .. ... . . . .. 323
Resolving Performance Data Growth Issues . .... .. ..... . ..... . .. 324
PostgreSQL Database Out of Space ..... ... .. .. . .. . . .. ... . .... 325
Set the Statistics Level. . . ..... . . . ..... . .. . .. . ..... . .... .. ... 326
Modify the Database Settings .. . . . . .. .. . . .... .. .... . . . .... . .. 327
Reinitializing the vCenter Server Database . . .... .. .... .. ... ... .. 328
Other PostgreSQL Troubleshooting . .... . . ..... . ..... . ..... . .. 330
Accessing the vCenter Server Appliance Shell .. . . . . ... . . . ... . . .. 331
Configuring Access Settings ... . . . . .... . ..... . . . . . .. . .. . . .. .. 332
Log in to the Appliance Shell .. . . . . .. .. . .. ... .. ..... .. . . . . . .. 333
Querying Service Status and Restarting Services . .. ..... . ..... . .. 334
Using API Commands and Plug-Ins from the Appliance Shell ...... 335
ESXi Problem 1 . . ... .. . .... . . . . ... . . . ...... .. .. . . .... . . . .. 336
Verifying That the ESXi Host Has Crashed . .... .. ..... . ..... . .. 337
Recovering from a Purple Diagnostic Screen Crash . . . .. . . . ... . . . . 338
ESXi Problem 2 . . ..... . ..... . . . ............ . .... . .... . . . .. 339
Verifying That the ESXi Host Has Stopped Responding . . ...... . .. 340
Recovering from an ESXi Host Failure ... . ..... . . .... . ..... .. .. 341
Lab 9: Managing the PostgreSQL Database . .... .. ..... . ..... . .. 342
Lab 10: Troubleshooting vCenter Server and ESXi Host Problems . . . 343
Lab 11 : (Optional) Working with Certificates .... . .... .. ..... . .. 344
Review of Learner Objectives .. . . . . .... . ..... .. ..... . ........ 345
Key Points ( 1) .. . ... .. .. ...... . ..... . ... .. . . . ... . .... . ... . 346
Key Points (2) . .. ... . . . ..... . ....... . ..... .. .... .. .... . . .. 347
Module 1
VMware vSphere:
Troubleshooting Workshop 6. 5
1
Importance
Slide 1-2
By the end of this course, you should be able to meet the following
objectives:
• Use VMware vSphere® Web Client, the command line, and log files to
configure or diagnose and correct problems in vSphere
• Troubleshoot networking problems
• Troubleshoot storage problems
• Troubleshoot VMware vSphere® High Availability problems
• Troubleshoot VMware vSphere® Distributed Resource Scheduler™ problems
• Troubleshoot VMware vSphere® vMotion® problems
• Troubleshoot VMware vCenter Server® problems
• Troubleshoot VMware vCenter® Single Sign-On and certificate problems
• Troubleshoot VMware ESXi™ host problems
• Troubleshoot virtual machine problems
1. Course Introduction
2. Introduction to Troubleshooting
3. Troubleshooting Tools
4. Troubleshooting Virtual Networking
5. Troubleshooting Storage
6. Troubleshooting vSphere Clusters
7. Troubleshooting Virtual Machines
8. Troubleshooting vCenter Server and ESXi
Title Location
http://pubs .vmware.com/vsphere-
vSphere Troubleshooting 65/topic/com. vmware. ICbase/PD F/vsphere-esxi-
vcenter-se rver-65-trou bl eshooti ng-g uide.pdf
http://pubs.vmware.com/vsphere-
vCenter Server and Host
65/topic/com. vmware. ICbase/PD F/vsphere-esxi-
Management
vcenter-server-65-host-management-guide. pdf
http://pubs .vmware.com/vsphere-
vSphere Virtual Machine
65/topic/com .vmware. ICbase/PD F/vsphere-esxi-
Administration
vcenter-server-65-virtual-mach ine-adm in-guide.pdf
http://pubs.vmware.com/vsphere-
vSphere Networking 65/topic/com. vmware. ICbase/PD F/vsphere-esxi-
vcenter-server-65-networking-guide. pdf
http://pubs .vmware.com/vsphere-
vSphere Security 65/topic/com .vmware.ICbase/PD F/vsphere-esxi-
vcenter-server-65-security-gu ide. pdf
Title Location
http ://pubs. vmware.com/vsphere-
vSphere Resource Management 65/topic/com.vmware. ICbase/PDF/vsphere-esxi-
vcenter-server-65-resource-management-guide. pdf
http ://pubs. vmware .com/vsphere-
vSphere Availability 65/topic/com. vmware. ICbase/PDF/vsphere-esxi-
vcenter-server-65-availability-guide.pdf
http://pubs.vmware.com/vsphere-
vSphere Installation and Setup 65/topic/com. vmware. ICbase/PDF/vsphere-esxi-
vcenter-server-65-installation-setup-g uide. pdf
http ://pubs.vmware. com/vsphere-
vSphere Platform Services 65/topic/com. vmware. ICbase/PDF/vsphere-esxi-
Controller Administration Guide vcenter-server-65-platform-services-controller-
administration-guide.pdf
Title Location
http ://pubs.vmware .com/vsphere-
vSphere Monitoring and 65/topic/com. vmware. ICbase/PD F/vsphere-esxi-
Performance vcente r-server-65-mon itori ng-performance-
g u ide. pdf
vSphere Command-Line Interface
https://www.vmware.com/su pport/developer/vcli/
Documentation
VMware vSphere 6.5
https://pubs.vmware.com/vsphere-65/index.jsp
Documentation Center
http ://pubs.vmware.com/vsphere-
vSphere Management Assistant
65/topic/com .vmware.ICbase/PDF/vsphere-
Guide for vSphere 6. 5
management-assistant-65-guide. pdf
Configuration Maximums for http ://www.vmwa re. com/pdf/vsphere6/r65/vsphere-
vSphere 6.5 65-config uration-maximu ms.pdf
The VMware Certified Implementation Expert 6 - Data Center Virtualization certification (VCIX6-
DCV) program tests candidates on two skill sets. The design exam portion of the certification tests
candidates on their ability to design a VMware vSphere® 6.x solution in both single and multisite
environments. Candidates should have a strong understanding of vSphere 6.x core components and
their relation to the data center, including virtual storage and networking technologies and their
relation to physical data center resources.
The deployments exam portion of the certification tests candidates on their ability to administer a
vSphere 6.x data center. Candidates should be capable of working with large and complex
virtualized data centers and demonstrate technical leadership with vSphere 6.x technologies.
Candidates must be capable of using automation tools, implementing virtualized environments, and
administering all vSphere 6.x enterprise components.
The VCIX6-DCV certification is also an entry point to the prestigious VMware Certified Design
Expert 6 certification.
The training in this course covers the troubleshooting objectives found in the VCIX6-DCV
deployment exam.
Module 2
13
You Are Here
Slide 2-2
1. Course Introduction
2. Introduction to Troubleshooting
3. Troubleshooting Tools
4. Troubleshooting Virtual Networking
5. Troubleshooting Storage
6. Troubleshooting vSphere Clusters
7. Troubleshooting Virtual Machines
8. Troubleshooting vCenter Server and ESXi
You can quickly identify, diagnose, and solve a problem if you use an
efficient troubleshooting methodology in a consistent and repeatable
manner.
By the end of this module, you should be able to meet the following
objectives:
• Define the scope of troubleshooting
• Use a structured approach to solve configuration and operational problems
• Apply troubleshooting methodology to logically diagnose faults and improve
troubleshooting efficiency
The troubleshooting process begins when a user reports a problem. In this context, the user is
anyone using the system, from an end user to an administrator. The problem reported by the user
might not be the problem. A user might be reporting symptoms of the problem.
An observed problem might be directly causing the symptoms, but typically the problem has a more
fundamental cause.
A system consists of several components, both software and hardware. For example, a VMware®
ESXi™ host consists of components such as CPU, memory, storage, networking, and hypervisor
software. A virtual machine consists of various components, such as one or more applications, a
guest operating system, and virtual hardware. A problem that occurs in a system can disrupt and
negatively affect production services that were functioning normally.
This course concentrates on the configuration and operational issues.
Usability is about whether users can complete tasks and achieve goals with the given product.
Usability is also about the amount of effort (often measured in time) that is required by a user to
perform a certain task.
Accuracy is about a system's precision and the system's ability to repeatedly show the same results
under unchanged conditions.
Reliability can be defined in terms of whether a system consistently produces correct outputs up to
some given time. Reliability is enhanced by system features that help avoid and detect problems.
Reliability is often defined in business service-level agreements (SLAs) in the form of availability.
Performance is also defined in terms of an SLA. An SLA establishes performance and reliability
requirements for applications. An SLA enables tracking and analyzing the achieved performance
and reliability to ensure that those requirements are met. A performance problem exists when an
application fails to meet its SLA. Depending on the SLA, the failure might be in the form of
excessively long response times or an unacceptable length of time when the system was unavailable.
Although performance is a predominant symptom in reported problems, this course does not focus
on performance issues. Performance troubleshooting is covered in the VMware vSphere: Optimize
and Scale course.
One or more LUNs on a storage The LUNs that are not visible
array are not visible to a specific are not presented correctly to
ESXi host. the ESXi host.
Pathing failure Network has failed between the
ESXi host and the storage
array. No redundant path
available.
You cannot connect to vCenter The VirtualCenter Server vCenter Server Appliance has a
Server with vSphere Web Client. service failed to start. corrupt database.
Problems can arise in any computing environment. Complex application behaviors, changing
demands, and shared infrastructure can lead to problems arising in previously stable environments.
Troubleshooting problems requires an understanding of the interactions between the software and
hardware components of a computing environment. Moving to a virtualized computing environment
adds new software layers and new types of interactions that must be considered when
troubleshooting problems.
Proper troubleshooting requires starting with a broad view of the computing environment and
systematically narrowing the scope of the investigation as possible sources of problems are
eliminated. Troubleshooting efforts that start with a narrowly conceived idea of the source of a
problem often get stuck in detailed analysis of one component, when the real source of the problem
is elsewhere in the infrastructure. To quickly isolate the source of a problem, you must adhere to a
logical troubleshooting methodology that avoids preconceptions about the source of the
problem.The troubleshooting process begins when a user reports a problem. In this context, the user
is anyone using the system, from an end user to an administrator. The problem reported by the user
might not be the problem. A user might be reporting symptoms of the problem.
An observed problem might be directly causing the symptoms, but typically the problem has a more
fundamental cause.
View diagnostic messages that were generated by the problem. If diagnostic information does not
appear in the GUI or in an event viewer, then check the appropriate log files for useful entries.
Use the information in the diagnostic messages to help focus on the area of the system that is most
likely causing the problem. For example, the user received an error message when powering on a
virtual machine. The error message indicates that the datastore on which the virtual machine is
located has insufficient disk space. This information tells you to focus on the storage component
instead of, for example, on the networking component. The rest of the error message indicates that
the virtual machine's swap file cannot be extended because no space is left on the disk.
In a VMware virtual environment, the root cause of a problem can occur in any one of the virtual
components. Knowing where to start looking for the root cause is often not obvious. Thus, gathering
as much information as you can about the problem can help determine which virtual component to
check first.
You might take one of the following troubleshooting approaches:
• Top-down: Start troubleshooting in the guest operating system first, then work your way down
the stack, then to the virtual machine, then to the ESXi host, and finally to the hardware.
• Bottom-up: Start troubleshooting at the hardware level first, then work your way up the stack to
the ESXi host, then to the virtual machine, and finally to the guest operating system.
• Approach cause by halves: Start troubleshooting at the middle of the stack. For example, start
with the virtual machine and test possible causes. The test results determine whether you should
continue troubleshooting up the stack or down the stack.
Possible Causes
Problem is triggered by an operation (snapshot or Application or
vSphere vMotion migration) performed on the Guest OS
virtual machine.
Limit and share values are Virtual
misconfigured on the virtual machine. Machine
General virtual infrastructure knowledge and knowledge of your specific system configuration are
very helpful in identifying possible causes. Prioritize the list of possible causes, ordering them from
most probable to least probable. Then test each possible cause to determine the most likely cause of
the problem, called the root cause.
In the example, the problem is that a virtual machine has stopped responding. In a nonresponsive
system, the operating system seems to be paralyzed and no error messages appear. However, the
operating system is still running. Such problems might require guidance from documents, such as
VMware knowledge base articles. For example, to troubleshoot a virtual machine that has stopped
responding, see VMware knowledge base article 1007819 at http://kb.vmware.com/kb/1007819.
For this problem, you might take a top-down approach. Start with the operations performed on the
virtual machine, check the virtual machine configuration, and check for sufficient resources on the
host where the virtual machine is located.
After identifying the root cause, assess the impact of the problem on
operations:
• High impact: Resolve immediately.
• Medium impact: Resolve when possible.
• Low impact: Resolve during next maintenance window.
Identify possible solutions and their impact on the vSphere environment:
• Short-term solution: Workaround.
• Long-term solution: Reconfiguration.
• Impact analysis: Assess the impact of the solution on operations.
Resolve the problem by implementing the most effective solution.
After identifying the root cause, resolve the problem. To resolve the problem, you identify possible
solutions to the problem, then implement a solution.
In determining the best solution, assess the impact that the problem has on normal operations. For
example, if the problem causes business-critical applications to be inaccessible, then the impact of
the problem is high, and immediate resolution is necessary.
When identifying possible solutions, you might decide to first implement a short-term fix so that
systems can be brought back online quickly. Before implementing the short-term solution, document
all changes that you have made to the system from the time the problem occurred. Also, back up
your log files from the time the problem occurred. Some short-term solutions can be destructive and
truncate important log information necessary for additional assistance.
Eventually, you want to implement a more permanent, long-term solution to prevent the problem
from happening again.
Scenario:
• You attempt to migrate the virtual machine named VM01 from the host named
ESXi01 to the host named ESXi02. After waiting a couple of minutes, the
vSphere vMotion migration fails with an error.
Is this failure a vSphere vMotion problem or a symptom of an underlying
problem?
• The error message will provide additional information.
In the example, you use the troubleshooting methodology to diagnose a VMware vSphere®
vMotion® migration problem. You use VMware vSphere® Web Client to perform a vSphere
vMotion migration, but the migration fails with an error.
At this point, you cannot tell whether the problem is specific to vSphere vMotion or whether the
problem is in the underlying infrastructure, such as storage or networking.
To pinpoint the problem area, gather information about the problem, starting with any diagnostic
messages displayed in vSphere Web Client.
Ta51< Name
I
your vMotion network settings and physical
network configuration and ensure they are
correct.
Relo cate virtual machine Migration [ - 1'e t
1407971789: 1562818463788028833) failed
Root problem: IP address
to connect to remote host <172 .20.12.52>
assigned to the VMkernel from host <172.20 .14.51>: Timeout.
port on the vMotion network vMotion migration [ -
is in the wrong subnet. 1407971789: 1562818463788028833] failed
to create a connection with remote host
<172 .20.12 .52>: The ESX hosts failed to
connect over the VMotion network
The vMotion migrations failed because the
ESX hosts were not able to connect over the
vMotion network. Check the vMotion
• network settings and physical network
I
I configuration.
vSphere Web Client shows the following error messages for the failed vSphere vMotion migration
task:
• A general system error occurred: The vSphere vMotion migrations failed because the ESXi
hosts were not able to connect over the vSphere vMotion network. Check the vSphere vMotion
network settings and physical network configuration.
• vSphere vMotion migration failed to create a connection with remote host 172.20.13.52: The
ESXi hosts failed to connect over the vSphere vMotion network.
• Migration failed to connect to remote host 172.20.13.52 from host 172.20. 12.5 1: Timeout.
The IP addresses refer to the vSphere vMotion VMkernel interfaces on the remote host
(ESXi02) and the local host (ESXiO 1).
• The vSphere vMotion migration failed because the destination host did not receive data from
the source host on the vSphere vMotion network. Verify that your vSphere vMotion network
settings and physical network configuration are correct.
The first error message in the stack is helpful and tells you to check the vSphere vMotion network
settings and physical network configuration. All error messages might not be so helpful.
Application or
Guest OS
Possible Causes
The error message points to connectivity issues with the network named vMotion, with the
following possible causes:
• vSphere vMotion is misconfigured.
• Network connectivity between ESXiO 1 and ESXi02 is down.
• The vSphere vMotion VMkernel interface connectivity between ESXiOl and ESXi02 is down.
When you initiate vSphere vMotion migration, several compatibility checks are performed before
the migration is initiated. Thus, you can eliminate possible causes such as vSphere vMotion not
being enabled or incompatible CPUs, because these configuration items are checked before the
migration begins.
Yes Further
>-'--=--=--~ investigation
necessary.
Test next No
possible
cause.
Test each possible cause and eliminate possible causes to determine the root cause.
First, use the ping command to test network connectivity between the hosts. For example, from
ESXiOl, ping ESXi02.
If the ping command fails, then investigate why the ping is failing. For example, the ping might fail
because of a network misconfiguration or faulty physical hardware. Make a change to your
environment and try the ping again.
After the ping is successful, test the vSphere vMotion migration. If the migration is successful, then
you have identified the root cause of the problem. If the migration is not successful, then test the
next possible cause in the list. If the ping command is successful, then you know that network
connectivity exists between the two hosts.
Test the VMkernel interface connectivity. You use the p ing command for this test too. From one
host, run the ping command, pointing to the VMkernel interface that you want to check on the
target host. For example, from ESXiOl , use the ping command to ping the vSphere vMotion
VMkernel interface on ESXi02 (172.20.13.52).
When you have identified the root cause, identify possible solutions to fix the problem. The impact
that the problem has on normal operations (high, medium, or low) determines how quickly the
solution should be implemented.
Finally, determine the appropriate type of solution for this problem. You might implement a short-
term solution so that the system works normally. Document all changes that you made to the system
since the problem occurred. Also, back up your log files from the time the problem occurred,
because logs rotate and might be available at a future time.
Troubleshooting Tools
Slide 3-1
Module 3
35
You Are Here
Slide 3-2
1. Course Introduction
2. Introduction to Troubleshooting
3. Troubleshooting Tools
4. Troubleshooting Virtual Networking
5. Troubleshooting Storage
6. Troubleshooting vSphere Clusters
7. Troubleshooting vCenter Server and ESXi
8. Troubleshooting Virtual Machines
Knowing how to use the right tools to solve various types of problems
can save time and maximize your troubleshooting result.
The GUI, the command-line, the log files, and VMware vRealize® Log
Insight™ can help you analyze problems and guide you toward
resolution.
By the end of this lesson, you should be able to meet the following
objectives:
• Discuss the various methods to run commands
• Discuss the various ways to access VMware vSphere® ESXi™ Shell
• Use commands to view, configure, and manage your vSphere components
VMware vSphere® ESXi™ Shell includes a set of fully supported ESXCLI commands and a set of
commands for diagnosing and managing ESXi hosts. Be familiar with vSphere ESXi Shell in case
VMware Technical Support directs you to use it.
The esxcfg-* commands are included in the VMware vSphere® Command-Line Interface (vCLI)
package, but are mainly for compatibility reasons. Although the esxcfg- * commands are still
available, they have been deprecated. VMware recommends that you use the ESXCLI commands as
a newer command-line utility.
The vCLI command set allows you to run common system administration and configuration tasks
against vSphere systems from an administration server of your choice. The vCLI package can be
installed on supported operating systems, such as Windows and Linux.
Direct Console User Interface Summary Monitor Configure Permissions VMS Resource Pools Oatastores Networks Update I
An ESXi system includes a direct console that enables you to start and stop the system and to
perform a limited set of maintenance and troubleshooting tasks. The Direct Console User Interface
(DCUI) includes vSphere ESXi Shell, which is disabled by default. You can enable vSphere ESXi
Shell in the DCUI or through VMware vSphere® Client™ or vSphere Web Client.
To access vSphere ESXi Shell locally, you require physical access to the DCUI and administrator
privileges. Local users that are assigned to the administrator group automatically have local shell
access.
To remotely access vSphere ESXi Shell, you enable the SSH service. However, you should enable
SSH access only for a limited time. SSH should never be left open on an ESXi host in a production
environment. Enabling SSH creates a security vulnerability and reduces ESXi resources.
Perform the following procedure to enable shell and SSH access in vSphere Web Client:
1. Select the ESXi host.
2. Click Configure.
3. Click Security Profile.
4. Scroll down to Services and click Edit.
5. Start the Shell and SSH services.
For more information about methods of accessing vSphere ESXi Shell, see vSphere Command-Line
Interface Documentation at https://www.vmware.com/support/developer/vcli.
~ sa-esxi-01.vclass.local ~ ~ @ Actions.
Summ ary Monitor Configure Permissions VMS Resource Pools Datastores Networks Update Manager
Power Management UseNars.EsximageNelTimeout 60 Set the timeout In seconds for downlo ...
The Availability timeout setting determines how long both the SSH and vSphere ESXi Shell remain
enabled:
• The default value is 0 and SSH and vSphere ESXi Shell remain enabled until manually
disabled.
• A value of 1 or higher determines how many minutes (in the DCUI) or seconds (in vSphere
Web Client) the services remain enabled before being automatically disabled.
If the Idle Timeout setting is configured, local and remote users are automatically logged out iftheir
sessions are idle for the defined period:
• The default value is 0 and sessions are not logged out automatically.
• A value of 1 or higher determines how an idle session remains active before being
automatically logged out. This value is measured in minutes in the DCUI and in seconds in
vSphere Web Client.
Both options can be configured in the DCUI when the services are disabled. In
vSphere Web Client, the services must be restarted after changing these values.
Troubleshoot Ing llode Opt Ions Modify ESXi Shel I and SSH t i'1eouts
Enable ES>Ci Shell Mudify the 11u11be1 of 11inute~ thdt Cdll eldµ~e before you nu~t
Enable SSH log in after [SXi Shell acce'5s i"S. enabled nnd the id]e
od 1flJ ESX 1 She I I and SSH t 1r•eouts t H'leout for interact 1ve se-s.c:. ion<;.
vailability ti11eout [ 1
Idle t il'leOUt
<Enter> OK <E~c> Cancel
The ESXCLI commands are part of the vCLI command set. The ESXCLI commands are a
comprehensive set of commands for managing most aspects of the vSphere environment.
Eventually, the ESXCLI command set will replace other command sets that are part of vCLI.
Help is available at all levels of the ESXCLI command set. For example, to see the namespaces
available with ESXCLI, enter esxcli. A list of available namespaces appears.
To determine the options available in the network namespace, enter esxcli network. A list of
available options for the network namespace appears.
To determine the configuration options available for firewalls, enter esxcli network firewall .
Each level displays command format help and options available for the namespace.
For more information about ESXCLI commands and their descriptions, see vSphere Command-Line
Interface Reference at https://www.vmware.com/support/developer/vcli/.
You use the esxcli sto rage command to retrieve storage information,
including multipathing configuration, LUN specifics, and datastore settings.
Availab l e Namespaces :
core VMware core storage commands.
nf s Operations to create, manage , and remove Net ~ork Attached Storage
filesys tems .
nfs41 Operatio ns to create, manage, and remove NFS v4.1 filesystems.
nmp VMware Native Multipath Plugin (NMP) . This is the VMwa re default
imp l ementation of the Pluggable S t o r age Ar c hi tecture.
san IO device management operations to the SAN devices on the system .
vf l ash virtual f l ash Ma nagemen t Operation s on t h e system .
vmfs VMFS operations .
vvo l Operations pertaining to Virtual Volume s
filesystem Operations pertaining to file syscems, a l so known as datastores, on
the ESX hos t.
iofilter I OFilter related commands .
• san: Provides display and reset options for the available types of adapter, including Fibre
Channel, iSCSI, Fibre Channel over Ethernet (FCoE), and SAS.
• vmfs: Provides you the option of upgrading a VMFS3 datastore to VMFS5 and using the
command line to manage snapshots and extents.
• fi les ystem: File system operations include mounting, unmounting, rescanning, listing, and
performing an automount on VMware vSphere® VMFS and NFS datastores.
• nfs: Provides a way to add, remove, and list NFS datastores using the command line.
You can use esxcli netwo rk commands to display physical and virtual
network information.
Available Namespaces:
firewall A set of commands for fir ewal l related operations
ip Operations that can be performed on vmknics
multicast Operat i ons having to do with multicast
nic Operations having to do with the configuration of Network
Interface Card and getting and updating the NIC sett ings.
port Commands to get information about a port
sriovnic Operations having to do wi th the configuration of SRI OV enabled
Network Interface Card and getting and updating the NI C settings.
vm A set of comman ds for VM related operations
vswitch Commands t o list and man i pulate Virtual Switches on an ESX host.
diag Operations pertaining to network diagn ostics
• ip : Enables you to view and configure properties of the VMkernel interfaces to include DNS,
Internet Protocol Security (IPsec ), and route information
• nic: Provides a command-line interface for physical NIC operations including enabling and
disabling the adapter, setting some general options, and listing the current NIC setup
• port : Provides the ability to filter and get port statistics
• vm: Lists networking information for virtual machines that have active ports and lists ports used
by virtual machines
• vsw i tch: Provides command-line options for standard and distributed switches
Available Namespaces :
policy Commands to manipulate network policy settings governing the given
virtual switch.
portgroup Commands to list and manipulate Port Groups on an ESX host.
uplink Commands to add and remove uplink on given virtual switch.
Available Commands:
add Add a new virtua l switch to the ESXi networking system.
list List the virtua l switches current on the ESXi host .
remove Remove a vi r tual switch from the ESXi networking system .
set This command sets the MTU size and CDP status of a given virtua l
switch .
You can use the esxc l i n etwork vswi tch s tanda r d namespace to create and map physical
adapters to a virtual switch, create ports groups on the switch, and configure port group and switch
policies.
Although you cannot create a distributed switch from the command line, you can
use the e s x c li command to list recorded distributed switch information.
Availab l e Namespaces :
lacp A set of commands for LACP re l ated oper ations
Availab l e Commands :
list List the VMware vSphere Distributed Switch currently configured on
the ESXi host.
[ root@esxi-a-01 : -) esxcli networ k vswitch dvs vmware list
LabVDS
Name: LabVDS
VDS ID : Sc 09 2c 50 89 a7 10 Sd-26 f8 bl bd ld 9d 26 de
Class : etherswitch
Num Ports : 153 6
Used Ports : 23
Configured Ports : 5 12
MTU: 1500
CDP Status : l isten
Beacon T imeout : -1
Uplinks : vmnic7, vmn ic2, vmnicS, vmnicO, vmnic6, vmn ic4, vmnic3, vmnicl
VMware Branded : true
The esxc l i n etwor k vsw i tch d v s namespace enables you to list the distributed switches in
your environment and to get the details of your LACP or VXLAN configurations.
No method is available to create a distributed switch using the command line. Using vSphere Web
Client is the preferred way to create a distributed switch. However, you use the vic f g - vswitch
command to add, modify, and remove uplinks to existing distributed switches.
Available Namespaces:
cpu CPU information.
ipmi IPMI information.
smartcard Smart card subsystem .
usb VMware USB Plugin.
bootdevice Boot device information .
clock Interaction with the hardware clock.
memory Memory information.
pci PC! device information and configuration .
platform Platform information.
trustedboot Information about the status of trusted boot.
The esxcli hardware command provides a method for viewing the hardware configuration of an
ESXi host. The hardware namespace provides a method for viewing server information. You can
also set the system clock. You can enable or disable hyperthreading with the esxcli hardware
cpu global set command.
By the end of this lesson, you should be able to meet the following
objectives:
• Use the vSphere Management Assistant virtual appliance
• Use commands to view, configure, and manage your vSphere components
• Identify the tool for command-line interface troubleshooting
VMware vSphere® Management Assistant is a downloadable virtual appliance that includes several
components, including vCLI. vSphere Management Assistant enables administrators to run scripts
or agents that interact with ESXi hosts and vCenter Server systems without having to authenticate
each time.
The vSphere Management Assistant authentication interface enables users and applications to
authenticate with the target servers by using v i-fa s tpas s or Active Directory (AD). While adding
a server as a target, the administrator can determine whether the target must use v i-fastp ass or
AD authentication. For v i-fa stpass authentication, the credentials that a user has on the vCenter
Server system or ESXi host are stored in a local credential store. For AD authentication, the user is
authenticated with an AD server.
When you add an ESXi host as a fastpass target server, v i-fastpass creates two users with
obfuscated passwords (in an unreadable format) on the target server and stores the password
information on vSphere Management Assistant:
• vi-admin with administrator privileges
• vi-user with read-only privileges
When using vSphere Management Assistant to manage ESXi hosts and vCenter Server systems
without using AD, vSphere Management Assistant stores credentials in its credential store. Adding
vSphere Management Assistant, ESXi hosts, and vCenter Server systems to an AD domain is more
secure because the credentials are stored in AD.
Configuring vSphere to use AD also has the advantage of using a single security model for both
virtual and non virtual environments.
Before you configure vSphere Management Assistant for use with AD, verify that the following
prerequisites are met:
• The DNS server configured for vSphere Management Assistant is the same as the DNS server
of the domain.
You can change the DNS server by using vSphere Management Assistant Console or the Web
User Interface.
• The domain is accessible from vSphere Management Assistant.
After you run sudo domainj oin- cli join and authenticate by using a domain administrator
user name and password, vSphere Management Assistant is now a member of the domain. The
domainj o in- cli command also adds entries in the /e tc /hos ts file with the fully qualified
domain name of vSphere Management Assistant.
v i c fg- *
• You can use v i c f g - * commands to manage your storage, network, and host
configuration.
• For example, you can run the vicfg- vmkni c -1 command to display the IP
information of your VMkernel interfaces.
vi-admin@sa-vma-01 : -[sa-esxi-0 1 .vclass .local] > vic!g-vmknic - 1
Interface Port Group/DVPort IP family IP Address Netma~ k MAC Addre
vrnkO 0 IPv4 172.20.10.51 255 . 255 . 255. o 00:50:56:
vrnk3 46 IPv4 172 . 20. 13 . 51 255 . 255 .255. 0 00 : 50:56:
vrnk4 52 IPv4 172. 20 . 13 . 61 255. 255 . 255. 0 00 : 50:56:
vrnkl 29 IPv4 172 .20.12 .51 255. 255. 255 . o 00:50:56:
vrnk2 33 IPv4 172. 20. 12 . 61 255.255 . 255 . 0 00 : 50:56:
vi-admin@sa-vma-01: - [sa-esxi-01 . v c l ass .local] > I
In addition to the ESXCLI commands, the vCLI command set also includes a set of commands with
the vic f g - prefix. For more information about each of these commands, use the -- he l p option
with the vicfg command, for example, vic f g -route --help.
The vmware -cmd command-line option is dedicated to performing operations on virtual machines.
Most operations that can be done using vSphere Web Client can also be done using vmware - cmd .
Getting help with vmware - cmd is similar to getting help with es x cli . On the command line, enter
vmware-cmd to display a list of available options and syntax help.
The path to the . vmx file must be provided in the command line for the command to work for a
specific virtual machine. For example, to unregister a virtual machine using vmwa re - cmd, you enter
vmware-cmd path_to_the_ . vmx_fiie unregister.
The vmware - cmd -1 command lists the virtual machines that are located on
the target host according to the path to their . vmx file.
/vmfs/volwnes/S4f7fff9-757c9064-S48b-OOSOS6011403/linux-a-01/linux-a-01.vmx
/vmfs/vo lwnes/S60e8ea9-e4bSlcd0-4c0d-OOSOS601 1403/ linux-a-02/linux- a-02 .vmx
/vmfs/vo lwnes/S60e8ea9-e4bSlcd0-4c0d-OOSOS601 1403/ linux-a-03/linux-a-03 .vmx
/vmfs/volwnes/560d0f97-f4de674a-fed0-00SOS601 1403/linux-a-04/linux-a-04 . vmx
Assuming that the target host is already set, you can enter vmware-cmd -1 on the command line to
list the registered virtual machines that reside on the target ESXi host. The command output displays
the physical path to the . vmx file, including the unique user ID (UUID) of the VMFS datastore.
When a VMFS datastore is created, a UUID is created for the datastore. The UUID is used
internally by the ESXi host to uniquely identify a datastore. An alias is created when the datastore is
created, which is the label assigned to the VMFS datastore. The label provides a logical name that
you can provide to identify the datastore. Using the logical name as part of the . vmx file path makes
it easier to run vmware - cmd.
You can specify the path to the . vmx file in the following ways:
• The physical path to the . vmx file :
/vmfs/volumes /4 f870db6-5ed5460c-e0c7-0 05056370612/Win02-A /Win02 - A. vmx
• Replacing the UUID with the datastore label:
/vmfs/volumes/Shared /Wi n02 - A/W i n02 -A. vmx
• Using brackets and quotation marks to represent the path:
" [Shared] Wi n 02 - A/Win02 - A. vmx"
You can use the vmware - cmd . vmx f i l e path hassnapshot command to
determine whether a virtual machine TS currently using a snapshot. You can also
use the command to perform snapshot operations.
Here, the = 1 indicates that the virtual machine is currently using a snapshot.
vi- aclmin@sa-vma-01 : - [sa- esxi- 01 . vc l ass .local] > vm~a r e - cmd / vmfs/volumes/560dC
f97-f4de674a-fed0-005056011403/linux-a-04/linux -a-04 . vmx hassnapshoc
hassnapshoc () = 1
The has snapshot option of the vmwa re-cmd command provides a command-line option to
determine whether a virtual machine is using a snapshot. If the command returns 0, a snapshot is not
present. If the command returns 1, the virtual machine is using a snapshot.
Other command-line options for snapshots include creating snapshots (createsnapshot),
reverting snapshots (revertsnapshot), and removing snapshots (removesnapshots ).
By the end of this lesson, you should be able to meet the following
objectives:
• Find important log files
• Use VMware vSphere® Syslog Collector
• Use vRealize Log Insight for log aggregation, log analysis, and log search
Most of the vCenter Server log files are on vCenter Server Appliance in
the /var I l og /vmware / directory.
• Subdirectories exist for vCenter Server components. For example a
vpxd- svcs subdirectory exists for vCenter Server services logs.
• Some logs, such as FirstBoot. l og are located in /var / l og.
root@sa-vcsa-01 [ /var/log/vmware ]# l s
appJ.mgmt rsyslogd v:mdird
applmgmt-audit rsyslogd-2068 vmdnsd
cis-license rsyslogd-2078 vmon
cloudvm sea vmvare- imagebuilder
cm sso vmvare-sps
content-I i.brary sys log vmvare-updatemgr
eam vapi vpostgres
journal vcha vpxd
mbcs vet op vpxd-svcs
netdumper VDild vsan- health
per:fcharts VDildd V3DI
psc-client vmcad vsphere-client
rbd vmcam vsphere- ui
rhttpproxy v:mdir
For more information about the location ofvCenter Server log files, see VMware knowledge base
article 2110014 at http://kb.vmware.com/kb/21100 14.
Log Description
Auto Deploy VMware vSphere Auto Deploy Waiter
content-library VMware Content Library Service
EAM VMware ESX Agent Manager
lnvSvc VMware Inventory Service
vmcam VMware vSphere Authentication Proxy
vpxd VMware VirtualCenter Server
vPostgres vFabric Postgres database service
Log Description
cis-license VMware Licensing Service
sso VMware Secure Token Service
Vmcad VMware Certificate Authority daemon
The vpxd. l og file is the main log file for vCenter Server. Most vCenter
Server actions are captured in v p xd . l og. It is located in
/va r I l og /vmwa r e /vpxd / .
2017-03-03Tl7: 37: 30. 809Z into vpxd[ 7f'D78 F66C700] (Originator@ 6876 sub•vpxLro opID•urn : vmomi :Virtua1Hachine :vm-128: S96e2 6b2-df48-ic88-b39c-eb702
ba52e66 .properties :01-3d] [VpxLRO} -- BEGIN lro-2'1393 -- ResourceHodel -- cis.data.provider.Re:!lourceHodel.query -- 5266dfBc-2t36-e3c4-ld59-c47"l
66b43db 1 (52723at4-e7c1- 92 cS-02 cS-4108 l 4 0b6ba2)
2017-03 -0JT l 7: 37: 30. 8102 info vpxd[7FD7BF66C700] [Originat.or8 6876 sub•vpxLro opID•ur n :vmomi: Virtual Machine :vm-126 : 596e2 6b2-df1B-ic88-b39c-eb702
ba52e66 .properties:O l-3d] [VpxLRO] -- FINISH lro-24393
2017-03 - 0JTl 7 : 37: 30. 813 Z info vpxd[7FD7BF66C700] [Originator@ 6876 sub•vpxLro o pID• urn : vmomi :VirtualHachine :vm-128 : 596e2 6b2- d!48 - 4cBB-b39c- eb702
ba52e66. properties : 0 1-e!] (VpxLRO] -- BEGIN l ro- 2439 4 -- Resource Model -- cis. data . provider. Resource Model. query -- 52 66d!8c- 2!3 6- e3c4- ld59- c::474
66b43db l ( 52?23a! 4-e7c1-92 c5-02 c::8- 4108140b6ba2)
2017-03-03Tl7 : 37: 30. 813 2 i nfo vpxd[7F D7BF66C700] [Originator@ 6876 sub• v pxLro opID•urn : vmomi :Virtual Machine :vm-128: 596e2 6b2-dt48-1c88-b39c-eb702
ba52e66. properties: 0 1-e:t] [VpxLRO] -- FINIS H lro-2 4391
2017- 03-03Tl7: 37: 30. 8152 info vpxd[7FD7BFE7C700] [Originator@ 6876 s ub • vpxLro opID• uc-n : vmomi :VirtuslHachine :vm-128: 596e2 6b2-d.t48-1c::B8-b39c::- e b702
bs52e66. propertie:!!I: 01-d!] (VpxLRO] -- BEGIN lro-21395 -- ResourceHodel -- c::is, data. provide r. Resouc-c::eHodel. query -- 52 6 6d!8c-2f3 6-e3c4-ld59-c::474
66b43db 1 (5272 3 at 4-e7c 1-92 c5-02 ca-4108110b6ba2)
2017-03-03Tl 7: 37: 30. 615Z into vpxd[7FD78FE7C700] [Originator@ 6676 s ub•vpxLro opID•urn: vmomi :Virtual Machine :vm-128 : 596e2 6b2-dt48-4c88-b39c-eb702
ba52e66. properties: 01-d.t] [VpxLRO} -- f INISH l ro-2 4 395
2017-03-0JT l 7: 37: 30. 816Z into vpxd[7FD7BFE7C700] [Originator9 6876 sub•vpxLro opID•urn :vmomi :VirtualHachine :vm-128 : 596e2 6b2-dt48-4c::88-b39c-eb702
ba52e66. properties: 01-67] [VpxLRO] -- BEGIN lro-21396 -- Resource Model -- cis. data. provider. Re:!!lourceKode l. query -- 52 66dt8c-2t3 6-e3c4-ld59-c474
66b4 3db l ( 52723 at1-e7cl-92 c5-02c8-4108140b6ba2)
2017-03-03Tl 7 : 37 : 30 . 8162 into vpxd(7FD78FE7C?OOJ [orioi n ator @6876 sub•vpxLro opID•urn : vroomi :Virtual Machine : vm- 128 : 596e2 6b2-d!i8-1c88-b39c-eb702
h•<;? •1' 1' n r n n • r r i • • · ni -1'?1 rHnvl.Ot"'ll -- J"TllfT<:;M l r n- ? 4'"lQ1'
You can use vSphere Web Client to view log files on vCenter Server
systems. Use the Monitor tab for vCenter Server systems.
) sa-vcsa-0 1.vclass.local ti tJ 1' Q @ Actions~
Summary Montt... Configure Permissions Datacenters Hosts & Clu... VMs Datastores Networks Linked vCen... Extensions Update Man ...
Showing 4000 of 46629 lines O Show line numbers I Show Next 2000 Lines J I Show All Lines J
The ESXi host log files are on the ESXi host at /var I l og.
[root@sa-esxi-01: /var/log] ls
Xorg-log ipmi vitd-log
auth-log jumpstart-stdout . log vmauthd-log
boot.gz kickstart . log vmkdev:mgr-log
clomd-log lacp_log vmJrerneL log
configRP . log nfcd-log vmkeventd- log
ddecomd.log osfsd.log vmksunmary. log
dhclient.log rabbitmqproxy.log vmkwarning. log
epd. log rhttpproxy.log vnnrare
esxcli-software .log sdrsinjector.log vrnware-vrnsvc . log
esxupdate. log shell.log vobd.log
fdm.log smbios.bin vprobe.log
bbrca.log storagerm. log vpxa.log
hostd- probe.log svapobjd. log vsanmgmt. log
hostd.log sysboot.log vsansystem.log
hostdCgiServer.log syslog. log vsanvpd. log
hostprofiletrace.log tally log vvold.log
iofilter-init.log upitd.log
iofiltervpd .log usb.log
Most ESXi host log files are under the /var I l og directory.
If persistent scratch space is configured, many of these logs are on the scratch volume,
I scratch/ log. The /var I l og directory contains symbolic links (which are identified as light
blue names in the output) to log files in /scratch/ l og. Run the l s - 1 command to see the log
files that these symbolic links point to.
ESXi hosts write to multiple log files, depending on which action is being
performed.
v mks ummary .log A summary of ESXi host startup and shutdown , and
an hourly heartbeat with uptime, number of virtual
machines running, and service resource
consumption
The table provides the name and description ofESX i host log files that are useful for
troubleshooting.
You can use the DCUI to view log files if vCenter Server is not available.
Only the log files for a single ESXi host can be viewed in the DCUI.
In the ESXi host console, press F2 and log in to the DCUI, using the root user name and password.
Press the appropriate key to search the log file(/), request help (H), or quit viewing the log file (Q).
vSphere Syslog Collector provides a single location for all ESXi hosts to
write log files.
vSphere Syslog Collector enables logs from multiple hosts to be
combined and provides a structure for network logging.
vSphere Syslog Collector is preinstalled on vCenter Server Appliance.
By default, the vSphere Syslog Collector server uses port 514 for TCP
and UDP, and port 1514 for SSL.
VMware vSphere® Syslog Collector provides a unified architecture for system logging. With
vSphere Syslog Collector, you can direct logs from ESXi hosts to a server on the network, rather
than save the logs to a local disk.
You can search and filter log events by specifying the keywords, time
range, field operations, and so on.
I scsi performance
+ Md Filter
Often, without intelligent log analysis tools, you do not notice log errors until they affect production
workloads and the business. With VMware vRealize® Log Insight™, you can uncover patterns that
can ultimately lead to problems and take action when these patterns arise.
You can enter any complete keywords, globs, or phrases in the search text box to find only events
that contain the specified keywords. You can search for log events that match certain values of
specific fields. Time ranges are inclusive when filtering. In the query example shown in the slide,
you can see log entries related to SCSI performance for the past 24-hour period. Filtering log events
helps you narrow down the troubleshooting scope by displaying only relevant information.
You can use the list of existing fields to search log events with specific values for a field. You can
also search the list of log events for events that occurred before, after, and around an event in the
list. You can analyze log events for trends and anomalies.
vRealize Log Insight enables a troubleshooter to view the context of a log event and browse the log
events that arrived before and after it. If you want to know more about the status of your
environment before and after an event, you can check the surrounding events.
You can take snapshots of queries and save them to dashboards.
i'I .,'() 2•H-tl-UTU:•J:SJ .HJ·t1H SU"ltl· bo•t5.ait .....art'.CClll Mtstd: (HllC2Me vff'oosc 'Dllf'-.ilt' l f"rwtr oolky 8WM~
, , UILM:l (!}Ym#_op)mney
-.ui .......iyp. *'>'ly pf01tr ~ ~ 0~.00id
{!)WIM' 9CSl_llCldtiona'.,"8Mt_CO
I01........, Hl•~·2JTIJ:4J;SJ.Hl·t1ff •tt•u· llOnl.tfll .W--att. COll \'pill; (ff("8tt vtrMM 'Otft11h' ] Sit \fltfNlal 9...,,... tal.MNO.eod6
ll~lA for 'ltl! ~ ('fPI 'ht t el), 154 (llP~• ¥14 i.) It fl prlNry} 4 (!)WIM.~..,JOnte.dltll
90Wlll . . . . .type '-""' pr!IDrlt, " - - ...... 8""'W.~..,Pd
You can select different chart types to graphically analyze log events.
You can modify the aggregation and grouping of query results to
correlate events and make the chart meaningful for troubleshooting.
vRealize Log Insight provides faster analytical queries and aggregation than traditional tools,
especially on larger data sets. vRealize Log Insight adds structure to all types of unstructured log
data, so that administrators can troubleshoot quickly, without needing to know the data beforehand.
You can select different chart types to change the way data is visualized on the Interactive Analytics
page. Different chart types require different aggregation functions, the use of time series, and group-
by fields. You can use multifunction charts to compare variables that are not the same scale. You can
change how charts look, add charts to your custom dashboards, and manage dashboard charts.
When facing a large environment with numerous log events, you can
locate the data fields that are important to you, extract and save them.
: ~--
1 : c1
"a Lll€- 1
I ~14;1"'
Jun 7 11 : 26 :1 ~ esx-03a.corp . local BT: [scsiCorrelatorJ 26831S6269044us: [e Extrat:led value Me Onl
naa.700i 5052d1642C0076lfe076&00&0000 perfo1111ance has deteriorated. 1/0 lat e ==~~-=-......::,...-.J .,..""""'",.;;....;;....,........,
666666 oucrosoconds. lnteg,er
source event_type hostname appname vmw_esx1_problem vmw_esx1_devlce_KI
Pre a.nD post context
Jun 1 11: 26:10 esx-02a.corp. local 8T: [scsiCorrelatorJ 2&83 1 86269 0~4us : [e
naa. 8001 8052fl 64600&764e0070&00000C0 perfo1111ance has de teriorated. 1/0 lat e
Ito
740722 rucroseconos. 1--------------1
microseconds
~------------~
In a large environment with numerous log events, you cannot always locate the data fields that are
important to you. vRealize Log Insight provides runtime field extraction to address this problem.
You can extract any field dynamically from the data by providing a regular expression.
In the example shown on the slide, the log search was based on the scsi performance deteriorated
keywords. You can now highlight the number representing a higher latency value and then extract
the field.
The field extraction is a very flexible feature that takes any event logs and turn them into structured
data, which you can include in your analysis.
vRealize Log Insight Dashboards are collections of chart, field table, and
query list widgets.
You may customize dashboards by adding, modifying, and deleting them
and tailor them to your troubleshooting needs.
For example, you can save a filtered query to your custom dashboard by
creating a query list widget.
+ Add Filter X Clear All Filters 21!14-00-15 15 44:35.636 21114-06 -16 15: 4436.45•
You can run specific alert queries at scheduled intervals. When the query
exceeds the preconfigured threshold, alerts can be sent to your email or
VMware vRealize® Operations Manager™.
+ Add Filter X Clear All Fi tters :ZCH'-41..o& O• 1031 TiO 201 6 .G t .ol 05 1033«0
Manage Alerts...
EVflnts Fii-id Table EveMTypu Even\Trend1
ffi d
•J .440
.__------1 -
ch1Mel ev•riLtypl ev111tld ev1tt1trecordld ea ntsourcen1,,.. hostnune l tyw!K'ds level opcode G 11'1 Fields
pmvdeni1me t.s• vmw_elister vmw_ctatace11ter vmw_lwis.t vmw_obted_d ,.mw_vceriter G 9'
YmW_vcenler_ d Ymw_vr_ops_ d (£J 11'1 El channel
0" Elevent_type
Tl'le Pnnt Sc>oolll!'r s,
New Alert ~ eventi d
ct\1nnel n<f!'IUype:
provdtmlJlll tatk .. meventrecordld
vrnw_YCllflltr_ .i vmw
Name lwin7-01a Spooler lssue' Y - - -rf] eventsourcename
--0 hostname
~
1 •
&~6
The Print Spooler s e
ch1nne1 eYeflL type
Notes n J .],,! ,
El keywords
El level
e Qoera\iQiWij
j Select... ~1calrty: ! none EJ
Raise an alert
@ On any matcn
0 When more than 8 1 matdies are found In the last 5 Minutes I B
You can be proactive by configuring vRealize Log Insight to run specific queries at scheduled
intervals.
If the number of events that match the query exceeds the thresholds that you have set, v Realize Log
Insight can send email notifications and trigger notification events in VMware vRealize®
Operations ManagerTM_
For example, the spooler service is stuck in a state of flux and needs to be fixed. You can create a
query that captures this pattern, and send an alert over to the vRealize Operations dashboard for the
affected virtual machine named Win7-01 a. Thus, the operations team becomes aware of and
resolves the problem in a timely fashion.
• You can use vSphere ESXi Shell and vSphere Management Assistant to run
commands.
• You can use vCLI in vSphere Management Assistant to view and troubleshoot
the system configuration.
• Log files are useful when trying to resolve vSphere problems.
• vSphere offers various features that enable you to collect, search, and export
log files.
• vSphere Syslog Collector provides a single location for all ESXi hosts to write
log files.
• vRealize Log Insight provides a single location to collect, store, and analyze all
types of machine-generated log data.
• vRealize Log Insight provides fast analytical queries and aggregation,
especially on large data sets. As a result, the troubleshooting time is reduced
and operational efficiency is improved.
Questions?
Module 4
89
You Are Here
Slide 4-2
1. Course Introduction
2. Introduction to Troubleshooting
3. Troubleshooting Tools
4. Troubleshooting Virtual Networking
5. Troubleshooting Storage
6. Troubleshooting vSphere Clusters
7. Troubleshooting Virtual Machines
8. Troubleshooting vCenter Server and ESXi
By the end of this module, you should be able to meet the following
objectives:
• Provide a network troubleshooting overview
• Analyze and troubleshoot standard switch problems
• Analyze and troubleshoot virtual machine connectivity problems
• Analyze and troubleshoot management network problems
• Analyze and troubleshoot distributed switch problems
Networks are used to access or control nearly every component in a vSphere environment. Virtual
switches, whether standard or distributed, provide ESXi host networking capabilities. Virtual
machines connect to virtual switches for access to both internal and external networks.
The management networks also use virtual switches for connectivity and are very important. Loss of
management network connectivity prevents important functions from occurring, such as allowing
ESXi hosts to be managed by VMware vCenter Server™.
Virtual Virtual
ESXi • NIC • NIC
Host Management
Network
IP Storage
All network communication that is handled by a host passes through one or more virtual switches. A
virtual switch provides connections for virtual machines to communicate with one another, whether
they run on the same host or on a different host. A virtual switch allows connections for the
management and migration networks as well as connections to access IP storage.
You can also configure settings on virtual machine port groups such as allowing promiscuous
behavior. And you can team uplinks to aggregate bandwidth.
As an initial check from vSphere ESXi Shell, ping a system that is known
to be up and accessible by the ESXi host.
VMU1lf"r: off1:r-· .. ·,11pp1wt1:d, p111..J1:1·ft1I ·.. 1j",t1·11 .uh1i11i·-.t1·.1t irn1 f1H1l· •. Plr:,1·,.i:
·-,t:c 1-llll-l.Vlll-hJI c;.1.ur1/qo/·.,l_r·•• 1th1i"1ool·-. fur dctt1i !·"'>.
The ISXi Slu:ll """ lw di··...t1lcd hlj •lfl .irf11i11i•.ti·.1t ivc 11·.1:r- Si:r ttu:
v~lpher-r.- ~cctu- itq ciocrnH~ntdt ion frw non·~ infor-11ut inn
- "
- II pi 11q I 0. 20. HI. Ll
!'IN(, 11!.lll JO.II(!() lll.111.LIJ: '•lo 1k1t<1 hqlc>
If your ESXi host experiences intermittent or no network connectivity, then you must first try to ping a
system from your ESXi host. Choose a system that is active and that your ESXi host can access.
You can use an SSH client, such as PuTTY, to log in to your ESXi host and get to the command line.
Ensure that the SSH service is enabled in your ESXi host's security profile.
If you cannot open a PuTTY session, you can always use the Direct Console User Interface (DCUI) to
get a command line (Alt+Fl from the main DCUI screen). Ensure that vSphere ESXi Shell is enabled.
If you know that your hardware is functioning correctly, take the top-down
approach to troubleshooting, starting with the ESXi host configuration.
Possible Causes
When identify ing possible causes, take a structured approach. For this issue, you might start with the
ESXi host. Check the host's configuration. If the host's configuration is correct, then check for
hardware problems.
For information about how to troubleshoot ESXi hosts that have intermittent or no network
connectivity, see VMware knowledge base article 1004109 at http://kb.vmware.com/kb/10041 09.
: !. • '- :._ 1 I '- , !. 1 ,:._ ,;_ • ',: !._' } I '. ,, :,• ' !._ • : : : l.: l . ', ::: . .•. , 1- • - - : ,
Verify that the components in your ESXi network configuration are configured correctly.
From vSphere Management Assistant, use the v i c f g - v s wit ch command to list information about
your standard and distributed switches, your vmnics, and your port groups.
From vSphere ESXi Shell, use the e sxc li command to list information about each of your port
groups and their assigned VLAN IDs.
The first command output in the slide demonstrates the vmnicO uplink available to both port groups.
The second command output in the slide demonstrates a VMkernel port that is manually disabled
with the command esxcfg-vmk.nic. To re-enable the VMkernel port, use the esxcfg-vmk.nic -e
command.
From vSphere Management Assistant, use the v i c f g - nic s command to check the network
adapter's speed and duplex as well as the link status.
The command output in the slide demonstrates a vmnic that is manually brought down by the
esxcli network nic down -n command. You can manually bring the vmnic up. For example,
to bring up the vmnic2, use the esxcli network nic up -n vmnic2 command.
To edit your ESXi network configuration, you can use the same commands: v i cfg- vsw i tch,
esxcli, and vicfg- nics.
For example, to add a virtual switch named vSwitch5, run the following command from vSphere
Management Assistant:
vicfg- vswitch - a vSwitch5
To add a port group named Production to vSwitch5, run the following command from vSphere
Management Assistant:
vicfg- vswitch - A Production vSwitch5
To add the uplink, vmnic4, to the standard switch named vSwitch5, run the following command
from vSphere Management Assistant:
vicfg- vswitch - L vmnic4 vSw i tch5
To set the VLAN ID of the port group Production to ID 34, run the following command from the
ESXi command line:
esxcli network vswitch standard portgroup set -p Production - v 34
To set vmnic3 's speed to 10,000 MB and duplex to full, run the following command from vSphere
Management Assistant:
vicfg- nics - s 10000 - d full vmnic3
You can also use - a to set the speed and duplex settings to autonegotiate.
Monitoring
Miscellaneous · Uplink 3
Uplink 4
standby uplinks
When setting up NIC teaming, you can configure settings such as the load balancing policy and the
failover order.
If you are using NIC teaming on the virtual switch, verify that the physical switch ports are
configured consistently for each teamed network adapter. Also verify that the proper load-balancing
policy is configured on the virtual switch. VMware recommends you to use the default load-
balancing policy, Route Based On The Originating Virtual Port ID. If link aggregation on the
physical switch is configured, use the load balancing policy, Route Based On IP Hash.
To use some adapters but reserve others for emergencies, you can use the Failover Order conditions
to specify how to distribute the workload for the network adapters:
• Active adapters: Continue to use the adapter when the network adapter connectivity is available
and active.
• Standby adapters: Use this adapter if one of the active adapter 's connectivity is unavailable.
• Unused adapters: Do not use this adapter.
Verify that you are not encountering the following ESXi network hardware
issues:
The network adapter or server hardware is not supported:
vicfg -n ics - 1
Verify that the network hardware is listed in VMware Compatibility Guide.
The physical hardware is faulty or misconfigured:
e sxcfg- v s wit c h , vicfg - vswitch , or es x c l i
vi-adm i n B!!la.-vma-0 1 : "' ( !!le.-e!!lici- 0 1 .vcla!!!! . l o c a l ]> v ic~g- n ic!!I -1
Name PCI Driver Li n k Speed Dup lex MAC Addreis:s HTU Descript.ion
vmnicO 0000 : 0 2 : CI O. O e 1000 Up 1000Hb p3 full OO: S 0:5 6 :01: c 1 : cb Intl! l Cor po r a t i o n BZ 5 4 5EH Gigabi t Et he r net Co ntro ll l!: r (Co pper)
vmn i c l 0000 : 02 : 0 1. 0 e l OOO Up lOOOMb ps Ji'u ll oo : s o : 5 6 : 01 : c l: cc I n te l Cor p o r atio n 8Z5'1 5 EH Gigab i t Etherne t Co n tro ller (Copper)
vmni c2 0000 : 02 : 02 . 0 elOOO Up l OOOMb ps Ji'u ll 00 : SO : 56 : 0 1: C l : Cd I n te l Corp o r a t. io n BZ5'1 5 EH Gigabi t Ethe rnet Co ntro ll e r (Copper )
vmni c 3 0000 : 02 : 03.0 e l OOO Up 1000Mb p;, Full 00 : 50 : 5 6 : 0 1: c l: ce: Intel Corp o r ation BZS'l S EH Gigabit Et hernet Contro ll er (Cop per )
vmn i c'! 0000 : 02 : 05.0 e l OOO Up lOOOMb p!!!Fu.ll 0 0 : 5 0: 5 6 : 0 1 : c l: c:C Intel Cor po r a t. i on 8Z5'1 5 EH Gi g abi t !:t hernee Controlle r ( Copper )
vmn ic:: S 0 000 :02 : 06 . 0 e l OOO Up 10 00Hbp3 F1Jll 0 0 : SO : 56 : 0 1 : c l: dO I nte l Corpo r e. t io n 82S4SEH Gigabit Et he r net Co ntro ll e r (Co p pe r )
vrnnic6 0000:02 : 07.0 e l OOO Up lOOO Mb p~ f ull 00: 50 : 56 : 0 1 : c l: d l Inte l Corpo r ation 825'15 EH Gioabit Ethernet Contro lle r ( Copper)
vmn ic7 0000 : 02 : 06 . 0 elOOO Up lOOOMbp!!I fu ll 00 : SO : 56 : 0 1: c l: d2 Inte l corpo r atio n B2:5'15 EM Gigab i t Ethernet contro ll er ( Copper )
.....
vi-edmi n@sa-vme.-01: - (1!19-el!Jxi-O l . vcl9!!1S . loce.1] > esxcU necvork nic lbc
P'CI Device Delv er Actmin $tatu!I L i nk $tatu!I Spe ed. Duplex
------------
?IAC Addce!l!I
-----------------
HTU Desc:dption
--------------------------------------------------------------
vrnnicO 0000 ; 02 .oo.o elOOO Up Up 1000 Full 00 : 50 : 56 : 01 : cl : c b 1500 Intel Corporation 825<!5EH Giuabit Ethernet Con c r ol ler (Coppe r )
vronicl 0000 : 02;0 1 .0 elOOO Up Up 1000 Full 0 0: 50 :56: 01 : c 1 : cc 1500 I ntel Corporation 825-'I S EH Giga>Jit Echernet Concr ol ler (Coppe r )
vrnnic2 0000 : 02 ; 02.0 elOOO Up Up 10 00 Full 00 : 50 ; 56 : 0 1: cl ;cd 1500 I ntel Corporat ion 825-'1 5 EH GivMi t Ethernet Con t r o ller (Co pper )
vrnnic3 0000 : 02:03.0 elOOO Up Up 1000 Full 00: SO :56 : 01 : cl : ce 1500 I ntel Corporation 825-'!SEH GivMit Ethernet Contr oller (Coppe r )
vronic:1 0000 : 02 : OS . O e l OOO Up Up 1000 f u ll 00 : SO :56: 0 1: c l :c t 1500 I ntel Corpocotion 62S'I SEH Gigol:;J it Ethernet Cont c o l l er (Co-ppe r )
vrnnicS 0000 : 02 : 06 . 0 elOOO Up Up 1000 f ull 00 : SO : S6 : 0 1: cl : d.0 1500 I ntel Cor pocotion 825 '1 5 EH Civet.bit Ethernet Cont r oJ.ler (Co- pp e r )
vmnic6 0000 : 02 : 07 . 0 e l OOO Up Up 1000 Full 00 : SO : S6 : 0 1 : cl : d. l 1500 I nt e l Cor pocot ion 62S<! SEH Citro.bi t Ether n e t Con t r o J. l er (Co p p e r )
vmnic:l 0000 : 02 : OB . O ~1000 Up Vp 1000 l"ull 0 0 : so :56 : 0 1 : C'l : d 2 1500 Int el Cocporat ion 82S<! S!H GitJM1 t ! the:rn~ t Cont r oller (Coppe r )
From vSphere Management Assistant, use the vicfg -ni cs command to view the model of your
network adapters. Compare your hardware information to the network I/O device list in the VMware
Compatibility Guide. To use the VMware Compatibility Guide, go to http ://www.vmware.com/
resources/compatibility.
If the lines do not exist at all for the card that has been added to the server, then you must rule out
faulty hardware:
If the network adapter is an add-in, reseat the adapter or move it to an alternate PCI slot on the
server 's motherboard .
Try an alternate network card.
Update the BIOS of the server to the latest version recommended by the manufacturer.
Run hardware diagnostics to identify any potential hardware issues.
If you installed third-party NIC hardware that is certified and supported by the vendor, verify that
vendor-specific drivers have been loaded correctly into the VMkernel after ESXi is installed.
Issues Performance Tasks & Events Resource Reservation Utilization Hardware Status
Overview
Advanced
""a· ,.. ~.
10:14AM 10:24AM 10:34AM I0:44AM 10:54AM 11:04AM
Time
Performance Chart l egend
.
K Ohj•et Roll1Jp Units
Virtual
Switch
Uplink Ports VM
~ Physical NICs
Virtual machine connectivity is achieved through multiple layers of networking. A virtual network
provides networking for virtual machines. The fundamental component of a virtual network is a
virtual switch. A virtual switch is a software construct, implemented in the VMkernel, that provides
networking connectivity for virtual machines that run on an ESXi host.
When two or more virtual machines are connected to the same virtual switch, network traffic among
them is forwarded locally. If an uplink adapter (physical Ethernet adapter) is attached to the virtual
switch, each virtual machine can access the external network that the adapter is connected to.
If you find that no network connectivity exists to a virtual machine, the first test is to try to ping the
virtual machine from another system to verify this behavior. Ping the virtual machine 's name. If the
ping fails, then ping the virtual machine's IP address. If the ping is successful, then the problem
might be with the application accessing the network.
You might also want to determine whether loss of network connectivity is being experienced by
other virtual machines on the same network.
Possible Causes
Based on the results of the initial ping test, if the ping is successful, then ensure that the application
accessing the network is not encountering problems.
If the ping is not successful, then take a top-down troubleshooting approach to identify possible
causes. Troubleshoot the guest operating system first, then troubleshoot the virtual machine, then the
ESXi host.
For information about how to troubleshoot virtual machine network connection issues, see VMware
knowledge base article 1003893 at http ://kb.vmware.com/kb/1003893.
Incorrect TCP/IP settings, such as an incorrect IP address, subnet mask, default gateway, or DNS
servers, can cause communication problems.
To verify TCP/IP settings
1. Run the IP configuration command.
• On a Windows system, run the ipconfig command.
• On a Linux system, run the ifco nfig command.
2. If DHCP is configured, confirm that DHCP is assigning the IP address correctly by renewing
the IP address.
• On a Windows system, run the ipconfig / renew command.
• From a Linux system, renew the DHCP address with the following commands:
dhclient -r
dhclient ethO
3. If a firewall is enabled in the guest operating system, verify that it is correctly configured to
allow and block certain types of traffic .
If the root cause lies within the guest operating system (incorrect IP settings or misconfigured
firewall), use the guest operating system tools to resolve the problem.
The port group name that the virtual machine uses is incorrect:
• View the standard switch port group names on the ESXi host:
- vi c fg - vs witch -1
• Verify that the virtual machine is using the correct port group.
The virtual network adapter is not connected to the port group:
• Verify that the network adapter is connected to the correct port group.
j.[J CPU
• iii Memory
Hard disk 1
• ® CDIDVD
-., dnve
- .,.,,,,._..._ "' ....
1
Verify that the port group names associated with the virtual machine's network adapters are on your
standard switch and distributed switch. From vSphere Management Assistant, use the v i c fg-
vswi tch command.
Verify that the virtual network adapters for the virtual machine are present and connected. Use the
VMware vSphere® Web Client to view the virtual machine settings. Verify that the network adapter
status is Connected.
If you want to use vSphere Management Assistant, use the following vSphere Management
Assistant command to set the status of the network adapter to Connected:
vmware- cmd -H ESXi_host_ name Full_path_name_ of_ VM_ c onfig_ fils
c onnec tdevice "Network adapter 1"
For example:
vmware-cmd -H esxi02.vclass.local /vmfs/volumes/Shared/Win01-C/Win01-C.vmx
connectdevice "Network adapter 1"
Verify that the virtual machine has no underlying issues with storage or the virtual machine is not in
resource contention, as this might result in networking issues with the virtual machine.
As a long-term solution, you might want to consider NIC teaming for the virtual switches that your
virtual machines are connected to. A NIC team can either share the load of traffic between physical
and virtual networks among some or all of its members, or provide passive failover in the event of a
hardware failure or a network outage.
The ESXi host is successfully added to the vCenter Server inventory but after approximately 60
seconds, vCenter Server changes the ESXi host's state to Not Responding or Disconnected.
Although the ESXi host frequently disconnects from vCenter Server, you can still use vSphere
Client to connect directly to the ESXi host.
The ESXi host sends a heartbeat to vCenter Server to signal that the
host is accessible by the management network.
vCenter Server
~========- Management
ESXi Network
Windows (vmkO)
•••• 000
The ES Xi host sends heartbeats every 10 seconds to vCenter Server. By default, this traffic is sent
over UDP port 902. vCenter Server has a window of 60 seconds to receive the heartbeats. If the
UDP heartbeat message is not received by vCenter Server within that window, vCenter Server treats
the host as not responding.
Possible Causes
Hardware
{CPU, Memory, The network between ESXi and vCenter Server is congested.
Network, Storage)
If the Windows firewall is not enabled on your vCenter Server system, then begin troubleshooting at
the ESXi host.
For information about how to troubleshoot an ESXi host that frequently disconnects from vCenter
Server, see VMware knowledge base article 2020100 at http://kb.vmware.com/kb/2020100.
For information about the ports required for communications between vSphere components, see
VMware knowledge base article 2106283 at http://kb.vmware.com/kb/2 106283. Also see the
information about TCP and UDP ports required to access vCenter Server, ESXi hosts, and other
network components in VMware knowledge base article 1012382 at http://kb.vmware.com/kb/
1012382.
If the firewall is enabled and UDP port 902 is blocked, view the ports
blocked by the vCenter Server Appliance firewall.
To resolve this problem, adjust the firewall settings on the vCenter Server
Appliance virtual machine:
• If ports are not configured, disable the firewall.
• If the firewall is configured to affect ports, ensure that the firewall is not
blocking UDP port 902.
... Advanced
Firewall
Active Directory
Check the firewall on the vCenter Server Appliance virtual machine. If ports are not configured,
then disable the firewall. If ports are configured, then verify that network traffic is allowed to pass
from the ESXi host to the vCenter Server system. That is, verify that the firewall is not blocking
UDP port 902.
To reach the settings on your vCenter Server Appliance firewall following the procedure given
below
• Modify firewall LOG all -- anywhere anywhere limit. : avq 2/min burst. 5 LOG
I DROP
RETURN
all
all
--
--
sa-esxi-01.vclass . l ocal anywhere
anvvhere anvvhere
I
Cha i n porc_t ilcer ( l reterences)
cargec proc ope source descinacion
ACCEPT ccp anywhere anywhere ccp dpc: l dap
ACCEPT ccp -- anywhere anywhere ccp dpc: ldaps
VMware vCenter® Server Appliance™ uses the iptables firewall. You can list the firewall tables
with the command:
iptab les - L
You can list iptables firewall rules by line number in a specific table with the command:
iptables - L <tab l e name> - n -- line- numbers
Example:
iptables - L inbound - n --l ine - numbers
Example:
iptables - D inbound 1
After changing the firewall rules save the rules with the command:
iptab les - save
By default, the vpxa agent on the ESXi host sends heartbeats to vCenter
Server (vpxd) through UDP port 902.
A problem might exist if the host is configured to send heartbeats over a
port other than 902.
Use the less I etc/vmware /vpxa/vpxa. cfg command on the host
to determine the port that is used to send heartbeats.
if' esH101.vdass.local - PuTTY
~ # less /etc/vmware/vpxa/vpxa.cig
<vpxa>
<bundleVersion>lOOOOOO</bundleVersion>
<datastorePrincipal>root</datastorePrincipal>
<hostip>172 .20.10.51</hostlp>
<hostKey>52ld9d38-20c7-df53-cbcd-4457cf6eae69</hostKey>
<hostPort>443</hostPort>
<licenseExpiryNotiiicationThreshold>15</licenseExpiryN01
<memoryCheckerTimeinSecs>30</memoryCheckerTimeinSecs>
<sery erl p> 172.20 .10 . 91</sery er!p>
l <serverPort>9020</serverPort> I
</vpxa>
< workingDir>/var / log/vmware/vpx</wor kingDir>
</coniig>
EDDI
A rule in the ES Xi firewall exists that allows for vCenter Server heartbeat traffic. If vCenter Server
has been configured to receive traffic over an alternate port, that traffic will be blocked.
Determine whether an ESXi host is using a port other than the default port, 902. At the ESXi host
command prompt, use the l ess /etc/vmware /vpxa/vpxa . cfg command to determine the port
in use. The port number in use is contained in the server Port tags.
In this example, server Port is set to port 9020, not the default port.
If you prefer to use a non default port for heartbeats, ensure that the
ESXi firewall does not block that port.
Contents of heartbeat.xml
<' -- Fire~all configuration sample-->
<ConfigRoot>
<service>
<id>nondefheartbeat</id>
<rule id='OOOO'>
<direction>inbound</direction>
<protocol>udp</protocol>
<porttype>dst</porttype >
<port>9020</port> <111(~--
</rule>
<rule id='OOOl'>
<direction>outbound</direction>
<protocol>udp</protocol>
<porttype>dst</porttype>
<port>9020</port> <111(~--
</rule>
<enabled>true </enabled>
<required>true</required>
Th e path to th e neartoeat. xm.l rne 1s
/et c/vmware/firewall/heartbeat.xml.
If you prefer to use a port other than the default port 902 for the heartbeat traffic between vCenter
Server and the ESXi host, you must configure the firewall to specifically allow traffic on that port.
To add a firewall rule to the ESXi host
Check the vCenter Server configuration to verify the port number used
for heartbeats.
;;:" Reg1§try Editor
E_ile ~dit l{iew F~vorites l::!elp
·{ ] SOFTWARE Name T e Data
1±1 CJ Classes ~ (Default) REG_SZ (value not set
1±1{] Clients [3 AdamldapPort REG_SZ 389
8:J 0 Description ~ AdamSslPort REG_SZ 636
B:JQ Gemplus
~ BumpUpEphemer ... REG_SZ
8:J 0 JavaSoft
~ DbinstanceName REG_SZ VIM_SQLEXP
i±l·· O Macromedia
~ DbServerType REG_SZ Bundled
$·0 Martin Prikryl
~ EvaluationExpiry ... REG_SZ AQD+yggAAA
ltl·· O Microsoft
~ FullyQualifiedDo ... REG_SZ VCO I. vclass. le
$0 MozillaPlugins
ab Group Type REG_SZ Single
1±1 0 ODBC
8:J 0 Policies
REG_SZ 902
Using the default port (UDP 902) for vpxa and vpxd communication is encouraged. VMware
recommends that you configure vCenter Server and the ESXi hosts to use the default port instead of
a nondefault port. As a good practice, before changing the port number on vCenter Server, ensure
that no other application installed on it is using this port.
The long-term solution is to resolve any issues that affect network performance. You might also
consider enabling VMware vSphere® Network 1/0 control if you are using distributed switches.
Network 1/0 control lets you prioritize different network traffic flowing through the same pipe.
Until the network issues can be resolved, you can work around the issue by increasing the timeout
limit. Increasing the timeout limit in vCenter Server allows the ESXi host to be connected
continuously. For information about how to increase the heartbeat timeout limit, see VMware
knowledge base article 1005757 at http ://kb.vmware.com/kb/ 1005757.
This problem can occur if the ESXi host's management network was
misconfigured or manipulated from the command line.
For example, you can bring a physical network card up or down with the
esxcli command:
esxcli network nic up -n vmnicO
esxcli network nic down - n vmnicO
esxcli network nic list
The management network is configured on every ESXi host and is used to communicate with
vCenter Server. Communication between an ESXi host and vCenter Server is critical for centrally
managing hosts through vCenter Server. The management network is also used to interact with other
hosts in a VMware vSphere® High Availability configuration. If the management network on the
host becomes unavailable or misconfigured, vCenter Server cannot connect to the host and cannot
centrally manage resources.
Host networking rollbacks occur when an invalid change is made to the host networking
configuration. Every network change that disconnects a host also triggers a rollback.
In addition to the events mentioned previously, other events that might trigger a rollback are the
fo llowing:
• Updating the VLAN of a standard port group that contains the management VMkernel network
adapter
• Increasing the MTU of management VMkernel network adapters and its switch to values not
supported by the physical infrastructure
• Removing the management VMkernel network adapter from a standard or distributed switch
• Removing a physical NIC of a standard or distributed switch containing the management
VMkernel network adapter
SysteM CustoMization
Configure ManageMent Network
) IP Configuration '
1Pv6 Configuration
ManageMent Network DNS Configuration
Network Restore Options CustoM DNS Suffixes
·. __ ... . ._. . .. , ....., _,. ........ .
"'- ·-.,_.... . . . "' · -. . . . . ................. .. . . . ... . .
The DCUI allows you to correct your management network settings, such as your IP configuration
and DNS configuration. From the DCUI, you can also restart your management network and test the
management network.
The Restore Network Settings option deletes all the current network settings
except for the Management network.
The Network Restore Options selection enables you to recover from management network
configuration errors on a distributed switch. The management network must be configured on a
distributed switch.
Using the DCUI is the only way to fix distributed switch configuration errors. The DCUI clones a
host local port from the existing misconfigured port and copies all VLAN ID and blocked port
information that you configured on the distributed switch. The DCUI changes the management
network to use the new host local port to restore connectivity to vCenter Server. vCenter Server
picks up the new host local port and updates its database with the new information. vCenter Server
creates a standalone port that is connected to the management network.
As a last resort, if you cannot recover your management network by fixing the management network
settings, you can revert to a default network setting. The Restore Network Settings option reverts
your entire network configuration to network system defaults. Restoring the network settings stops
all the running virtual machines on the host.
Use this option carefully. Verify that you have a record of your network configuration so that you
can recreate your production environment. If you do revert your network settings to the network
system default, you can apply a host profile, if you have one, to recreate your virtual switches.
VM State 7
Distributed Ports
and Port Groups vCenter
Distributed Switch
(Control Plane) Server
Hidden Virtual
Switches
(1/0 Plane)
Virtual
--------ESXfHosi ________ _ --------ESXIHost ________ _
Physical
Physical NICs
(Uplinks)
Distributed switch rollbacks occur when invalid updates are made to distributed switch-related
objects, such as distributed switches, distributed port groups, or distributed ports.
In addition to the events mentioned on the slide, the following events might trigger a distributed
switch rollback:
• Blocking all ports in the distributed port group containing the management VMkernel network
adapter
• Overriding the preceding policies for the distributed port to which the management VMkernel
network adapter is connected
If you know where the conflicting configuration setting is located, you can manually correct the
setting. For example, if you incorrectly migrated a management VMkernel network adapter to a new
VLAN, the VLAN might not be trunked on the physical switch. When you correct the physical
switch configuration, the next distributed switch-to-host synchronization will resolve the
configuration issue.
Always back up your distributed switch before you make a change to its
configuration:
• If your distributed switch loses network connectivity because of a
misconfiguration, you can restore from your latest backup.
vSphere Web Client provides you with features to back up and restore
distributed switch configuration:
• Export: Back up your distributed switch configuration.
• Restore: Reset the configuration of a distributed switch from an exported
configuration file.
• Import: Create a distributed switch from an exported configuration file.
If you are troubleshooting a network connectivity problem on your distributed switch, and you are
not sure where the problem exists, you can roll back the distributed switch or distributed port group
to a previous configuration. vSphere Web Client allows you to back up your distributed switch
configuration to a file and restore the configuration if necessary.
Module 5
131
You Are Here
Slide 5-2
1. Course Introduction
2. Introduction to Troubleshooting
3. Troubleshooting Tools
4. Troubleshooting Virtual Networking
5. Troubleshooting Storage
6. Troubleshooting vSphere Clusters
7. Troubleshooting vCenter Server and ESXi
8. Troubleshooting Virtual Machines
By the end of this lesson, you should be able to meet the following
objectives:
• Discuss vSphere storage architecture
• Identify possible causes of problems in various types of datastores
• Analyze common storage connectivity and configuration problems and discuss
possible causes
• Solve storage connectivity problems, correct misconfigurations, and restore
LUN visibility
If a virtual machine cannot access its virtual disks, the cause of the
problem might be anywhere from the virtual machine to physical storage.
Virtual Disk
Data store
Type
Transport
Backing
A virtual machine's virtual disks reside on one or more datastores. Datastores are logical containers
that are configured on ESXi hosts. Datastores provide a uniform model for storing virtual machine
files. Depending on the type of storage that you use, datastores can be formatted in different ways.
When a problem occurs with accessing storage, the root cause can be in one of the layers that form
the virtual storage architecture.
If the ESXi host has iSCSI storage connectivity issues, check the iSCSI
configuration on the ESXi host and, if necessary, the iSCSI hardware
configuration.
disk array
t(j tl1
iSCSI target name:
iqn.1992-08-com.acme:storage1
iSCSI alias: storage1
IP address: 192.168.36.101
An iSCSI storage system contains one or more LUNs and one or more storage processors (SPs).
Communication between the host and the storage array occurs over a TCP/IP network.
The ESXi host is configured with an iSCSI initiator. An initiator can be hardware-based or it can be
software-based. The software-based initiator is called the iSCSI software initiator. An initiator
resides in the ESXi host. Targets reside in the storage arrays that are supported by the ESXi host.
iSCSI arrays can use various mechanisms, including IP address, subnets, and authentication
requirements, to restrict access to targets from hosts.
Initial checks using the command line look at connectivity on the host:
• Verify that the ESXi host can see the LUN:
- esxcli storage core path list
- # esxcli !!ltorage con~ path listlmore
iqn. 1996-0l .co m.V'!ll11ar e : esxi01-00023d000001, 1qn . 2003 -10.com. letthandnetuork!!I: iscsi- m
g : 203 : tQtO, t, 1- naa . 6000eb3a2b3b330e00000000000000cb
UID : iqn . 1998- 01 . com . vmware: esx 101- 00023d000001, i qn.2003-1 0 . com . l e :t:c handnet woi:~
: l!!IC!!li-mg : 2 03 : tgtO, t , 1- naa. 6000eb3a2b3b330eOOOOOOOOOOOOOOcb
Runtilne N&ne: vmhba33 : CO: T'i : LO
Device : naa . 6000eb3a2b3b330e OOOOOOOOOOOOOOcb
Device Displ ay N ame : LCP'THAND 1SCS I Disk (naa. 6000eb3a2b3b330eOOOOOOOOOOOOOOeb)
Adapt er: vmhb a 33
Channel: 0
Target : 'l
L UN : 0
Plugin : NHP
The first check to perform is to see what paths to your IP storage LUNs are visible to your ESXi
host. You can use vSphere Web Client to perform this check or you can use the command line.
The esxcli storage core path list command prints a mapping between the HBAs and the
devices that it provides paths to. Use a PuITY session to run the esxcli command on the ESXi host.
If you do not see any paths to your IP storage, then use the es xc li command to perform a rescan of
the problem adapter to try to restore LUN visibility.
If the ESXi host accessed IP storage in the past, and no recent changes
were made to the host configuration, you might take a bottom-up
approach to troubleshooting.
Possible Causes
Instead of a top-down approach, you might take a bottom-up approach and look for possible causes
at the hardware level first. Taking a bottom-up approach is especially appropriate if you know that
your ESXi hosts have been able to access datastores located on IP storage and you know that you
did not make recent changes to the IP storage configuration.
If you know that you have a solid configuration on your ESXi host, start troubleshooting by
verifying proper operation and acceptable performance of your storage hardware.
For information about troubleshooting LUN connectivity issues on ESXi hosts, see VMware
knowledge base article 1003955 at http ://kb.vmware.com/kb/1003955 .
For information about troubleshooting iSCSI array connectivity issues in ESXi, see VMware
knowledge base article 1003681 at http ://kb.vmware.com/kb/1003681.
Check the VMware Compatibility Guide to see if the iSCSI HBA or iSCSI
storage array is supported.
Verify that the LUN is presented correctly to the ESXi host:
• The LUN is in the same storage group as all the ESXi hosts.
• The LUN is configured correctly for use with the ESXi host.
• The LUN is not set to read-only on the array.
• The host ID on the array for the ESXi LUN is 0 - 16383.
• Max LUNs per ESXi host is 512.
If the storage device is malfunctioning, use hardware diagnostic tools to
identify the faulty component.
Verify that the storage array is listed in VMware Compatibility Guide. Some array vendors have a
minimum-recommended microcode or firmware version to operate with ESXi. This information can
be obtained from the array vendor and VMware Compatibility Guide.
On the array side, ensure that the LUN IQNs and access control list (ACL) allow the ESXi host
HBAs to access the array targets.
Ensure that the host ID on the array for the LUN is in the range 0- 1023 for the LUN. On ESXi, the
host ID appears as the LUN ID. The maximum LUN ID is 1023. Any LUN that has a host ID
greater than 1023 might not appear as available under Storage Adapters. However, on the array, the
LUN might reside in the same storage group as the other LUNs that have host IDs less than 1023.
Verify that the physical hardware is functioning correctly, including:
• The storage processors (sometimes called heads) on the array
• The storage array
• The SAN and switch configuration
For information about configuration maximums, see Corifi.guration Maximums at http://
www.vmware.com/pdf/vsphere6/r65/vsphere-65-configuration-maximums.pdf.
/ \ \
Device Avg. Kernel Avg . Guest Avg .
Make sure that your network topology does not contain Ethernet bottlenecks. Bottlenecks where
multiple links are routed through fewer links can result in over subscription and dropped network
packets. Recovering from dropped network packets results in large performance degradation.
Isolating iSCSI and NFS traffic, or creating separate VLANs for NFS and iSCSI, is beneficial. This
separation minimizes network interference from other packet sources.
A performance problem can usually be identified and corrected by monitoring the following latency
metrics:
• DAVG/cmd: The average amount of time it takes a device (which includes the HBA, the storage
array, and everything in between) to service a single 110 request (read or write).
• If the value < 10, the system is healthy. If the value is 11 through 20 (inclusive), be aware
of the situation by monitoring the value more frequently. If the value is > 20, this most
likely indicates a problem.
• KAVG/cmd: The average amount of time it takes the VMkernel to service a disk operation. This
number represents time spent by the CPU to manage 1/0 . Because processors are much faster
than disks, this value should be close to zero. A value or 1 or 2 is considered high for this metric.
• GAVG/cmd: The total latency seen from the virtual machine when performing an 1/0 request.
GAVG is the sum ofDAVG plus KAVG.
fF -
U vmnic4 1DDD Full Q
• If the ping command fails, ensure that the IP settings are correct.
Log in to the ESXi host and use the ping command to test connectivity between the ESXi host and
the iSCSI target. The p ing command is a symbolic link to the vmkp i n g command.
If the ping command fails, check that you are using the correct iSCSI target IP address. Also check
that the VMkernel interface used for IP storage is correct. In this example, the VMkernel port for IP
storage is vmk2, 172.20.13.52.
172.20., 3.12:3260
Verify that the iSCSI initiator name and the target address and port number of the iSCSI array are
correct.
Here the iSCSI software adapter has an initiator name of iqn.1998-01.com.vmware.esxi-a-
01 :3dbe3 la8. This is an automatically generated initiator name. You can manually set that name. A
common practice is to edit the name and remove the MAC address string leaving only the node
name. The first part of the name (iqn.1998-01.com.vmware) will be the same for any ESXi host
using the iSCSI software adapter.
iSCSI storage providers can be configured so that only certain specific nodes are allowed to connect.
In these cases you might have to change the automatically assigned iSCSI initiator name to
something the iSCSI storage provider is expecting.
Target addresses can be set up for dynamic or static discovery. This one is static.
The target address here is 172.20.13.12 on port number 3260. That is the IP address and port of the
iSCSI provider. Any typos in the definition of the iSCSI storage provider (IP address or port
number) will cause the connection to fail.
authentication setting
O Inherit settings from parent - vmhba65
can be inherited from
the ESXi host. Authentication Method: ~[u_s_e b_ld_lre_ct_lo_na_I c_HAP
_ _ _ _ _ _ _ _ _ _~I~
· l
Authentication Method 01rtgoing CHAP Credentials (target a1rtl1enticates the initiator)
Secret
[ OK =i [ Cane~
connected to a single .!, IPStorage01 (LabV... vmk3 ~ Compliant + Active vmnic6 (1 GbiVs. Full)
port group. .!, IPStorage02 (LabV... !iii vmk4 ~ Compliant ~ Active !ill vmnic7 (1 GbiVs, Full)
Verify that your VMkernel port bindings are configured correctly. Port binding is used in iSCSI when
multiple VMkernel ports for iSCSI reside on the same broadcast domain to allow multiple paths to an
iSCSI array that broadcasts a single IP address. When using port binding, consider the following facts:
• Array target iSCSI ports must reside on the same broadcast domain as the VMkernel port.
• All VMkernel ports must reside on the same broadcast domain.
If you configure port bindings on multiple VMkernel ports in different broadcast domains and the
target ports also reside in different broadcast domains, you might experience the following issues:
• Rescan times take longer than usual.
• An incorrect number of paths is seen per device.
• You cannot see storage from the storage device.
If you do not configure port bindings on multiple VMkernel ports that reside in the same broadcast
domain, you might experience the following symptoms:
• You cannot see storage presented to the ESXi host.
• Paths to the storage report as Dead.
• Loss of path redundancy messages are logged in vCenter Server.
For more information about considerations for using software iSCSI port bindings in ESXi, see
VMware knowledge base article 2038869 at http://kb.vmware.com/kb/2038869.
Advanced
No traffic shaping Configure reset at disconnect Enabled
No VLAN Override 11ort policies
Block ports: Allowed
No teaming Traffic shaping: Disabled
Vendor configuration: Disabled
VLAN: Disabled
Uplink teaming: Disabled
Security policy: Disabled
NetFlow: Disabled
Traffic filtering and marking: Disabled
Security
If you do not have elastic port allocation configured in the port group it is possible to run into a "no
available ports" condition, which can prevent storage from connecting.
Port blocking, traffic shaping, VLAN configuration, and teaming can have either a positive or
negative impact on network connected storage.
Administrator contact
Name:
Other details:
The Switch configuration tab is the first place that the MTU is reported. An incorrect MTU
configuration can have an extremely negative impact on network connected storage.
If the network is being managed by a different administration group, contact information and other
details should be available.
Although storage adapters can be configured as DHCP, it is best practice to use a static address.
If the network configuration (routing, gateway, DNS) is incorrect this can hurt or prevent storage
from working. The iSCSI target can be configured by node name as well as IP address. But if you
have a node name configured you must have a correct DNS configuration.
physical network Adapter Intel Corporation 82545EM Gigabit Ethernet Controller (Copper)
Name vmnic6
adapter.
Location PCI 0000:02:07 .0
Some adapters and Driver e1000
configurations are not Status
supported. Slatus Connected
Configured speed, Duplex Auto negotiate
Actual speed. Duplex 1 000 Mb, Full Duplex
Networks 172.20.111 .10-172.20.111 .10
DirectPath llO
Slatus Not supported
6 The physical NIC does not support DirectPath 110.
The configuration of your network adapter must match the requirements of the physical network it is
connected to. Best practice is to set up an isolated network that is dedicated to storage.
Resolve this problem by checking paths between the host and hardware:
• Verify that the iSCSI storage array is configured properly and is active.
• Verify that a firewall is not interfering with iSCSI traffic.
You use the n e t c at (nc) command to verify that you can reach the iSCSI TCP port (default 3260)
on the storage array from the host.
If you receive an error message or receive no response, then verify that the iSCSI storage array is
configured to permit connections on that port.
Also, if a firewall is between your ESXi host and iSCSI storage array, check that the firewall is not
blocking connections to TCP port 3260.
The vSphere On-disk Metadata Analyzer (VOMA) is a utility for performing VMFS file system
metadata checks. This utility scans the VMFS volume metadata and highlights any inconsistencies
to which you might be required to open a support request.
You perform a VOMA check on a VMFS datastore and send the results to a specific log file.
Before running VOMA, you must ensure that the following conditions are true:
• All virtual machines on the affected datastore are powered off or migrated to another datastore.
• The datastore consists of only a single extent and that it has been unmounted on all ESXi hosts.
VOMA must be run against a disk partition and not the disk device. If VOMA is run against a disk
device, it produces an error similar to the following:
Error: Mis sing LVM Magic . Disk doesn't have a val id LVM Device
Error: Failed to Initialize LVM Metadata
When the corruption is irreversible, VMware recommends that you restore the datastore files from a
backup. Or consult with a data recovery organization. VMware does not perform data recovery.
• For more information about using VOMA, see VMware knowledge base article 2036767 at
http ://kb.vmware.com/kb/2036767.
Device Backing
General A VMFS Datastore can span mulliple hard disk partitions, or extents to create a sir
Ca11ability sets Select an extent to view its device details.
Oe\llce Backmg
ConnectMty and MuttiJJa1hing
H
Device Details
If your virtual machines reside on NFS datastores, verify that your NFS
configuration is correct.
An NFS file system is located on an NFS server. The NFS server contains one or more directories
that are shared with the ESXi host over a TCP/IP network. An ESXi host accesses the NFS server
through a VMkernel port that is defined on a virtual switch.
If the ESXi host is unable to access its NFS datastore, verify that the properties of your NFS
datastore are configured correctly:
• Host name or IP address of the NFS server
• no_root_squash set on the NFS share
• The path to the folder on the NFS server that you want this datastore to correspond to
• Mount permissions (read-only or read/write)
• The name of the datastore
~ New Datastore
& Use onty one NFS version to access a given datas1ore. Consequences of mounting one or more hosts to the same datastore
using different versions can Include data corruption.
Data corruption might occur if hosts attempt to access the same NFS
share using different NFS client versions.
If you use NFS 4.1 with Kerberos, you must perform the following tasks
before enabling Kerberos on your ESXi hosts:
• Verify that Active Directory (AD) and NFS servers are configured to use
Kerberos (5 or 5i).
• In AD, enable one of the following encryption modes:
- AES256-CTS-HMAC-SHA1 -96.
- AES 128-CTS-H MAC-SHA1-96.
- In vSphere 6.5, the NFS 4.1 client does not support the DES-CBC-MD5 encryption
mode.
• Verify that the NFS server exports are configured to grant full access to the
Kerberos user.
,.. System
Licensing
• An incorrect time setting can cause things like Kerberos to
Host Profile stop functioning.
Time Configuration • Best practice is to use NTP whenever possible.
~~~~~~~~~~~-----<
Hew Datastore
6 Ready to complete
·-
S - - -rt
0
· ~1Sltlocal
. . a.vr~
T)'P9
lA..
WS• t
. .~........_.~7-20.e-1 . . 0000 QE llC I l~
!-:o 2'oawo
0CS.Utor•1
0~•1C1)
,..-34,
-
1921690201
••S0.3
ltopc@~S'X~ 0 -9 ! -) ~~~c1i ~
Vo.l.'Wllt!: t: ~ eas~c~>
Lesson 2: Multipathing
By the end of this lesson, you should be able to meet the following
objectives:
• Review multipathing
• Identify common causes of missing paths, including PDL and APD conditions
• Solve missing path problems between hosts and storage devices
If your ESXi host has iSCSI multipathing issues, check the multipathing
configuration on the ESXi host and, if necessary, the iSCSI hardware
configuration.
VMkernel
ports
With software iSCSI, you can use multiple NICs that provide failover for iSCSI connections
between your host and iSCSI storage systems.
For this setup, because multipathing plug-ins do not have direct access to physical NICs on your host,
you first connect each physical NIC to a separate VMkernel port. You then use port binding to
associate all VMkernel ports with the iSCSI initiator. As a result, each VMkernel port connected to a
separate NIC becomes a different path that the iSCSI storage stack and its multipathing plug-in can
use.
After iSCSI multipathing is set up, each port on the ESXi host has its own IP address, but they all
share the same iSCSI initiator IQN. Due to the latency that can be incurred, VMware does not
recommend routing iSCSI traffic.
For more about configuring iSCSI multipathing, see vSphere Storage at https://www.vmware.com/
support/pu bs/vsphere-esxi-vcenter-server-6-pubs. html.
Initial checks of LUN paths are performed using the esxcli command:
• Find detailed information regarding multiple paths to the LUNs:
- esxcli s torage c o r e path lis t
• List LUN multipathing information:
- es x c li stor a g e nmp devi ce li st
• Check whether a rescan restores visibility to the LUNs:
- esxcli storag e co r e adapte r rescan - A vmhba##
Loss of connectivity to a specific storage device might be due to one or more paths to a LUN being
lost. If all paths are lost, then any virtual machines using the affected datastore become
unresponsive.
The first check is to get detailed information on available LUNs and paths on your ESXi host. The
esxc l i s t o r a g e cor e p a th li s t command prints a mapping between the HBAs and the
devices for which it provides paths.
You might also list the devices controlled by VMware Native Multipathing (NMP) and show the
Storage Array Type Plug-In (SATP) and Path Selection Plug-In (PSP) information associated with
each device. SATP, also called Storage Array Type Policy, handles path failover for a given storage
array. PSP, also called Path Selection Policy, handles path selection for a given device.
If you determine that certain paths to a LUN are missing, perform a rescan of the troubled adapter to
try to restore LUN visibility.
Possible Causes
For information about troubleshooting lost redundant paths to a storage device, see VMware
knowledge base article 1009554 at http ://kb.vmware.com/kb/ 1009554.
For more information about the permanent data loss (PDL) and all paths down (APD) conditions,
see VMware knowledge base article 2004684 at http://kb.vmware.com/kb/2004684.
When a PDL occurs, vSphere Web Client displays the following information for the device:
• The operational state of the device changes to Lost Communication.
• All paths appear as Dead.
• Datastores on the device are dimmed.
Check / var I log/vmkernel. log.
When the storage array determines that the device is permanently unavailable, it sends SCSI sense
codes to the ESXi host. The sense codes enable your host to recognize that the device has failed and
register the state of the device as PDL.
The following VMkernel log (/ v ar I l o g / vmkerne l. l o g) example of a SCSI sense code indicates
that the device is in the PDL state:
H:O x O D:Ox2 P:O x O Valid sense data: Ox 5 Ox 2 5 OxO o r Logi c al Unit No t
Suppo rted
In the case of iSCSI arrays with a single LUN per target, PDL is detected through iSCSI login
failure . An iSCSI storage array rejects your host's attempts to start an iSCSI session with the reason
If the LUN was not in use when the POL condition occurred, the LUN is
removed automatically after the POL condition clears.
If the LUN was in use, manually detach the device and remove the LUN
from the ESXi host.
When storage reconfiguration is complete, perform these steps:
1. Reattach the storage device.
2. Mount the datastore.
3. Restore from backups if necessary.
4. Restart the virtual machines.
For detailed information about detaching devices and removing a LUN, see vSphere Command-Line
Interface Concepts and Examples at http://www.vmware.com/support/developer/vcli.
In contrast with the PDL state, the host treats an APD condition as transient and expects the device
to be available again.
An APD condition can occur on an ESXi host when a storage device is removed in an uncontrolled
manner from the host. An APD can also occur if the device fails and the VMkernel core storage
stack cannot detect how long the loss of device access will last. One possible scenario for an APD
condition is a Fibre Channel switch failure that brings down all the storage paths. Or, in the case of
an iSCSI array, an APD condition can occur if a network connectivity issue similarly brings down
all the storage paths.
The host indefinitely continues to retry issued commands in an attempt to reestablish connectivity
with the device. If the host's commands fail the retries for a prolonged period of time, the host and
its virtual machines might be at risk of having performance problems and becoming unresponsive.
When an APD occurs, vSphere Web Client displays the following information for the device :
• The operational state of the device changes to Dead or Error.
• All paths are shown as Dead.
• Datastores on the device are dimmed.
• Host might appear as disconnected in vCenter Server.
Check /va r I l o g /vmke rne l. log.
The APO condition must be resolved at the storage array or fabric layer
to restore connectivity to the host:
• All affected ESXi hosts might require a reboot.
vSphere vMotion migration of unaffected virtual machines cannot be
attempted:
• Management agents might be affected by the APO condition.
To avoid APO problems, the ESXi host has a default APO handling
feature:
• Global setting: Misc. APDHandlingEnable
- By default, set to 1, which enables storage APD handling
• Timeout setting: Misc . APDTimeout
- By default, set to 140, the number of seconds that a device can be in APD before
failing
No clean way exists to recover from an APD condition. All affected ESXi hosts might require a
reboot to remove any residual references to the affected devices that are in an APD state.
Management agents might be affected by the APD condition, and the ESXi host might become
unmanaged. As a result, a reboot of an affected ESXi host forces an outage to all unaffected virtual
machines on that host.
When a device enters the APD state, the system immediately turns on a timer and allows the ESXi
host to continue retrying non-virtual machine commands for a limited period of time.
By default, the APD timeout is set to 140 seconds, which is typically longer than most devices need
to recover from a connection loss. If the device becomes available within this time, the host and its
virtual machine continue to run without experiencing any problems.
If the device does not recover and the timeout ends, the host stops its attempts at retries and
terminates any non-virtual machine I/O. The device is marked as APD Timeout. Any further I/Os
are fast-failed with a status of No Connect, preventing hos t d and others from getting hung. Virtual
machine I/O will continue retrying.
If a path to the device recovers, subsequent I/Os to the device are issued normally and special APD
treatment concludes.
A global setting exists called Mis c . APDHandlingEnable. If the value is set to 0, then the ESXi
host permanently retries failing I/Os. If Misc . APDHandlingEnable is set to 1, APD handling uses
the timeout setting called Misc. APDT imeout. This setting has a default value of 140 seconds.
Monitoring
Miscellaneous Uplink3
· Uplink4
standby u1>links
If you have not configured port binding and you are using NIC teaming on the virtual switch, verify
that the physical switch ports are configured consistently for each teamed network adapter. Also
verify that the proper load balancing policy is configured on the virtual switch. VMware
recommends that you to use the default load-balancing policy, Route based on originating virtual
port. If link aggregation on the physical switch is configured, use the Route based on IP hash load-
balancing policy.
To use some adapters but reserve others for emergencies, you can use the failover order conditions
to specify how to distribute the work load for the network adapters:
• Active adapters: Continue to use the adapter when the network adapter connectivity is available
and active.
• Standby adapters: Use this adapter if one of the active adapters' connectivity is unavailable.
• Unused adapters: Do not use this adapter.
Verify that the path selection policy for a storage device is configured
properly.
Ii] esxl-a-01.vciass.local - Edit Multlpathlng Policies for eul.dbfd8ac3cf5676 1b ?
OK ) l Cancel !,,
You usually do not have to change the default multipathing settings that your host uses for a specific
storage device. However, if you want to make changes, you can modify a path selection policy and
specify the preferred path for the Fixed policy.
By default, VMware supports the following path selection policies. If you have a third-party PSP
installed on your host, its policy also appears on the list:
• Fixed (VMware): The host uses the designated preferred path, if it has been configured.
Otherwise, it selects the first working path discovered at system boot time. If you want the host
to use a particular preferred path, specify it manually. Fixed is the default policy for most
active-active storage devices.
• Most Recently Used (VMware): The host selects the path that it used most recently. When the
path becomes unavailable, the host selects an alternative path. The host does not revert to the
original path when that path becomes available again. No preferred path setting is associated
with the MRU policy. MRU is the default policy for most active-passive storage devices.
• Round Robin (VMware): The host uses an automatic path-selection algorithm rotating through
all active paths when connecting to active-passive arrays, or through all available paths when
connecting to active-active arrays. Round Robin is the default for a number of arrays and can be
used with both active-active and active-passive arrays to implement load balancing across paths
for different LUNs. Round Robin is also the path selection policy recommended by VMware.
do
esxc li stor age nmp devi ce set - devi ce $ i -psp VMW PSP RR;
d o ne
Virtual machines on an NFS 4.1 datastore fail after the NFS 4.1 share
recovers from an APO state.
The loc k protecting VM . vmdk has been lost error message is
displayed.
This issue occurs because NFSv3 and v4.1 are two different protocols
with different behaviors. After the grace period (array vendor-specific),
the NFS server flushes the client state.
This behavior is expected in NFSv4 servers.
When the NFS 4.1 storage enters an APD state and then exits it after a grace period, you experience
these symptoms:
• Powered-on virtual machines that run on the NFS 4.1 datastore fail.
• After the NFS 4.1 share recovers from the APD condition, you see this message on the virtual
machine's Summary page in vSphere Web Client: The lock protecting VM.vmdk has been lost,
possibly due to underlying storage issues. If this virtual machine is configured to be highly
available, ensure that the virtual machine is running on some other host before clicking OK.
• After you click OK, crash files are generated and the virtual machine powers off
This problem occurs because NFSv3 and v4. l are different protocols. These protocols behave
differently. After the grace period (array vendor-specific), the NFS server flushes the client state.
This behavior is expected in NFSv4 servers. Currently, no resolution or work around is available.
For more information about this problem, see VMware knowledge base article 2089321 at http://
kb.vmware.com/kb/2089321 .
By the end of this lesson, you should be able to meet the following
objectives:
• Become familiar with the various types of tools that are available to
troubleshoot VMware vSAN™
• Use the appropriate tools to identify, analyze, and quickly resolve common
configuration problems related to vSAN
• Describe vSphere Virtual Volumes
• Use vSphere Virtual Volumes Diagnostic Commands
vSAN
3-64
,--------------------------------------,
: ~~ ~y ·~~ :
1
SSD HD/SSD 550 HD/SSD SSD HD/SSD 1
·-------------------------- - -----------~
A set of tools is available, which enables you to monitor, diagnose, and troubleshoot vSAN:
• vSphere Web Client:
• vSphere Web Client is used to configure storage policies and monitor their compliance.
vSphere Web Client can also be used to inspect underlying disk devices and how these
devices are being used by vSAN.
• As a troubleshooting tool, vSphere Web Client can be configured to present specific alarms
and warnings associated with vSAN. vSphere Web Client also highlights certain network
misconfiguration issues and whether hardware components are functioning properly.
• Additionally, vSphere Web Client can provide at-a-glance overviews of individual virtual
machine performance and indicate whether vSAN is recovering from a failed hardware
component.
• vSphere Web Client is the logical place to start when diagnosing or troubleshooting a
suspected problem.
• Although vSphere Web Client does not include many of the low-level vSAN metrics and
counters, it has a pretty comprehensive set of virtual machine metrics. You can use the
performance charts in vSphere Web Client to examine the performance of individual virtual
machines running on vSAN.
In this scenario, one or more members of the vSAN cluster are in different network groups. vSAN
fails to form correctly because the members cannot communicate with each other.
To solve the problem, ensure that the physical switch and the ports used for vSAN are active and
have multicast enabled. Enabling multicast can be done in one of two ways on your physical
switches:
• Disable IGMP snooping
• Configure IGMP snooping for selective traffic
You must also validate the virtual switch configuration for correct uplink, VLAN, NIC team failover
policy, and that the vSAN traffic service is enabled on the VMkernel interfaces. vSAN requires a
VMkernel network interface with the vSAN traffic enabled. All members of the clusters must
communicate on the same Layer 2 network segment with multicast enabled, and all members of the
cluster should be able to ping each other. Failing to meet this requirement prevents vSAN from
being successfully configured, because hosts are prevented from communicating.
You can use vSphere Web Client to verify the vSAN configuration. You can use the vmkpi ng
command to validate the vSAN network accessibility. You can also use the esxcl i vsan
netwo r k namespace to examine and modify the vSAN network configurations.
In this scenario, you can use the esx cli storage core namespace before or after enabling vSAN
to examine whether the disks are fl agged as local, by checking the rs Loca l attri bute. Marking a
disk as local can also be done from vSphere Web Client using the disk management dashboard. For
more information about the configuration steps to mark a storage device as local, see vSphere
Storage Guide at https://www.vmware.com/support/pubs/vsphere-esxi-vcenter-server-6-pubs.html .
With RVC, you can use v s an . disks_ info to gather detailed disks capabilities and characteristics,
such as size, disk type, manufacturer, model, and identify if the disks are flagged as local or non-local.
After manually selecting a desired group of HOD and an SSD for the
creation of a disk group, the operation is successfully completed but disk
groups are not created.
[t_vsAN-West Actions •
7 ... '; ;
- VSAN-Wost CPU FREE: 124.17 GHz
Total Processors: 48 I
USED: mUHz. CAPACITY: t24.46 GHz
I
VSAN datastore vsanDatastore In duster VSAN-West In datacenter SDDC-West does not have capacity J
I· vSphere DRS ol r• Vlrtual SAN ol
Causes and solutions:
• If vSAN is not licensed correctly, you can assign a vSAN license to the cluster
using vSphere Web Client.
• If the vSphere Web Client refresh time is timed out, you can log out of the
system and log back in.
In this scenario, vSphere Web Client indicates that the vSAN datastore in the cluster does not have
capacity.
The following issues can cause this behavior:
• vSAN might not be licensed correctly. The vSAN feature is not automatically added to the
cluster when a vSAN license is added to the vCenter Server license catalog. You can assign a
vSAN license to the cluster enabled with vSAN by using vSphere Web Client.
• The vSphere Web Client refresh timer timed out. Depending on the number of disk groups and
the number of disks, the completion of the operation can take some time. You can log out and
log back in to vSphere Web Client.
You cannot delete disk groups from the vSphere Web Client user
interface.
Disk Groups
l Q. Filter
O.ska In ... 1 " State Status Networlc Part.I.on Croup
Too much multicast traffic for multiple vSAN clusters causes slowness.
Cause:
• If multiple vSAN clusters exist on the same Layer 2 network, each host
receives all multicast messages.
Solution:
• To reduce the amount of multicast traffic for each vSAN cluster, change the
multicast address for each vSAN cluster.
• Changing the multicast address on an active vSAN cluster can lead to network
partitioning until all of the ESXi hosts in the cluster are on the same multicast
network. VMware recommends that you schedule downtime before making this
change.
For information about how to change the multicast address on an ESXi host configured for vSAN,
see VMware knowledge base article 2075451 at http://kb.vmware.com/kb/207545 l .
---
vSphere 6 x
• No LUNs or NFS shares.
• Set up a single 1/0 access
called a protocol endpoint to set
up a data path from virtual
machines to virtual volumes
(VMDKs).
• Set up a logical entity called a
storage container to group
virtual volumes (VMDKs) for STORAGE STORAGE STORAGE
...
CONTAINER CONTAINER CONTAINER
easy management
. ~- : :_ - :_ :
---
vSphere 6.x
- Treated like a proxy
• Supports typical SCSI and NFS commands.
• Virtual volumes (VMDKs) are bound and
unbound to a protocol endpoint:
- ESXi or vCenter Server initiates the bind or
irtual
unbind operation.
V~e
• Existing multipathing policies and NFS STORAGE STORAGE
-
topology requirements can be applied. CONTAINER CONTAINER
c : - :_: .. :_: .,
'- : '- : ., '- : .,
A PE is created by the storage administrator to define a single 1/0 access point. A PE is treated like
a LUN and handles industry-standard protocols, such as ISCSI. A PE creates a single configuration
regardless of the protocol used.
Virtual volumes are bound to a PE by using the bind/unbind commands that are initiated by ESXi
hosts and vCenter Server instances. PEs should be configured in a high availability environment so
that a single point of failure does not exist. Each PE configuration rests on the array and is
considered to be part of the physical storage fabric.
---
vSphere 6.x
'- : - l_ : •
The PE and SC columns in the illustration show how PE and SC objects are discovered by ESXi
hosts and vCenter Server.
esxcli storage core device list Identify protocol endpoints. The output
entry Is VVOL PE: true indicates that the
storage device is a protocol endpoint.
esxcli storage vvol daemon unbindall Unbind all virtual volumes from all VASA
providers known to the ESXi host.
esxcli storage vvol protocol list List all protocol endpoints that your host
endpoint can access.
esxcli storage vvol vasacontext get Show the VASA context (VC UUID)
associated with the host.
esxcli storage vvol vasaprovider list List all storage (VASA) providers
associated with the host.
• As a first troubleshooting step, if you do not see any paths to your LUNs,
perform a rescan of the troubled adapter to try to restore LUN visibility.
• For iSCSI connectivity issues, verify that the VMkernel port bindings are
configured properly.
• A storage device is considered to be in a POL state when it becomes
permanently unavailable to the ESXi host.
• An APO condition occurs when a storage device becomes unavailable to your
ESXi host for an unspecified amount of time.
• Because vSAN is a software-based storage product, it is entirely dependent on
the proper functioning of its underlying hardware components.
• When troubleshooting vSAN , ensure that the network is functioning properly
and the vSAN compatibility guidelines have been closely followed.
• vSphere Virtual Volumes is a set of different vSphere Virtual Volume object
types that together function as a virtual machine.
• vSphere Virtual Volumes is based on storage policies.
Questions?
Module 6
207
You Are Here
Slide 6-2
1. Course Introduction
2. Introduction to Troubleshooting
3. Troubleshooting Tools
4. Troubleshooting Virtual Networking
5. Troubleshooting Storage
6. Troubleshooting vSphere Clusters
7. Troubleshooting Virtual Machines
8. Troubleshooting vCenter Server and ESXi
By the end of this module, you should be able to meet the following
objectives:
• Identify and troubleshoot vSphere HA problems
• Analyze and solve vSphere vMotion problems
• Diagnose and troubleshoot common vSphere DRS problems
Heartbeat
Datastores
Increase the verbosity of the FDM logs to collect more information about the cause of the issue.
Search the FDM log files for error messages:
• FDM operations log: /var / l og/ fdm . l o g or /var / run / l og/fdm* (one log file for FDM
operations)
• FDM agent installation log: /var / l og/fdm- insta ller. l o g
If the ESXi hosts in the cluster are operating properly, take a top-down
approach to troubleshooting. Ensure that vSphere HA is configured
properly in vCenter Server.
Possible Causes
If communication between your ESXi hosts and vCenter Server is working properly, then start your
troubleshooting with the vSphere HA configuration. Verify that all the cluster settings are correct.
For information about troubleshooting FDM problems, see VMware knowledge base article
2004429 at http://kb.vmware.com/kb/2004429.
The vSphere HA checklist shown in the slide contains requirements that you must be aware of
before creating and using a vSphere HA cluster.
For more information about vSphere HA requirements, see vSphere Availability Guide at http://
www.vmware.com/support/pubs/vsphere-esxi-vcenter-server-pubs. html.
ls sues Performance Tasks & Events Profile Compliance Resource Reservation vSphere ORS vSphere HA Utllization
Summary
Hearthea1 !J NFS01 NIA
Cl sa-esxl-01 .vclassJocal
C] sa-esxi-02.vclass.local
vCenter Server automatically selects a preferred set of datastores for heartbeating. This selection is
made with the goal of maximizing the number of hosts that have access to a given datastore and
minimizing the likelihood that the selected datastores are backed by the same storage array or NFS
server. In most cases, this selection should not be changed.
Verify that you have LUN connectivity. The heartbeat datastore might be inaccessible because of a
storage failure, such as all paths down or permanent device loss condition.
If the FDM agent cannot be installed on the ESXi host, perform the
following checks:
• Check the FDM log for errors:
/ var / l o g / fdm . log
• Verify that sufficient network bandwidth exists between the ESXi host and
vCenter Server.
• If network traffic throughput is causing the issue, determine whether a
hardware problem exists.
• Verify that there is disk space available on the ESXi host in I r oo t .
If you determine that network traffic throughput is causing the problem, determine the cause of the
network issues:
• Hardware problem with the NIC
• Mismatch between NIC speed and switch speed
• Hardware or firewall is blocking or slowing traffic
Verify that the FDM agent files were pushed to the ESXi host:
• c l u s t er config: Cluster configuration
• vmme t adata : List of all virtual machines and the hosts that they are compatible w ith
Verify that all FDM agent files were pushed from vCenter Server to the
ESXi host:
• On the ESXi host, check for the following agent files in the
I e t c/ opt/vmware/ fdm directory:
- clusterconfig
- fdrn . cfg
- hostlist
- vrnrnetadata
Loss of network connectivity between the ESXi host and vCenter Server
might cause vSphere HA not to enable.
Test connectivity from the ESXi host:
p i ng vcOl . vc l ass .lo c al
If the ping fails, troubleshoot ESXi host network connectivity issues.
The ESXi host is not properly connected to vCenter Server:
• Check the ESXi host's connection status in the vCenter Server hierarchy.
• If the host is disconnected, reconnect the ESXi host to vCenter Server.
If you suspect ESXi host network connectivity problems, use the troubleshooting information for
virtual networks.
The problem occurs when you try to power on a virtual machine that is
part of a vSphere HA cluster with insufficient failover resources.
Power On Failures
I/Center Seiver was unabl e to find a suitable host to power on th e following virtual mac hines for the reasons listed below.
/JI Wi n0 2·C NIA J1nsufficient re sources to sati sfy configu red failover level forvSphere f
OK I
The error message indicates that the operation performed violates the configured failover level of
the vSphere HA cluster.
This problem can arise when hosts in the cluster are disconnected, in maintenance mode, or not
responding, or when they have a vSphere HA error. Disconnected and maintenance mode hosts are
typically caused by user action. Unresponsive or error-possessing hosts usually result from a more
serious problem, for example, hosts or agents have failed or a networking problem exists.
Another possible cause of this problem is that your cluster contains virtual machines that have much
larger memory or CPU reservations than the others. As a virtual machine with high reservation
values starts up, slot calculation is affected and fewer slots are available. The Host Failures Cluster
Tolerates admission control policy is based on the calculation of a slot size that includes two
components: the CPU and memory reservations of a virtual machine. If the calculation of this slot
size is skewed by outlier virtual machines, the admission control policy can become restrictive and
result in virtual machines being unable to start. Check the CPU and Memory reservations of the
virtual machine. Either change the admission control policy to reserve fewer resources for failover
or add more hosts to the cluster.
Possible Causes
ESXi
The cluster has insufficient physical resources.
Host
Use the bottom-up approach to troubleshoot this problem. Start by verifying the configuration of the
physical resources of the cluster. And then verify that virtual machine reservations are reasonable
and not excessive. Finally, verify that the vSphere HA admission control policy is configured to
protect failover capacity.
The goal of the vSphere HA Admission Control policy is to ensure that a certain proportion of
cluster resources are never allocated to virtual machine reservations so that if one or multiple hosts
fail, sufficient unreserved resources are available in the cluster to power on the failed virtual
machines.
The best, long-term solution for resolving insufficient failover capacity in a vSphere HA cluster is to
add physical resources. Otherwise, you might be forced to compromise the spare cluster capacity
that the vSphere HA admission control policy is trying to safeguard.
Issues Performance Tasks & Events Profile Compliance Resource Reservation VSphere ORS vSphere HA Utilization
OGB 4.3'3 ~B
CPU
Memory Cluster Total Capacity 12.00 GB
Storaue Total Reservation Capacity 4.33 GB
Error !'-.
Name 1 • ReH rvatlori (lulB) !-:---------- ~'~----~
~ The "Power on virtual machine" operation failed for the entity with
iii linux-a-01 the following error message.
{jJ linux·a·02 4096
{jJ linux-a-03 4096 Insufficient resources to satisfy configured failover level for
vSphere HA.
{jJ linux·a·06 6144
iii linux-a-07
H 10 Objects [;).Export l!t!
vCenter Server uses admission control to ensure that sufficient resources in a vSphere HA cluster
are reserved for virtual machine recovery if host failure occurs.
Some cluster might contain virtual machines that have much larger memory or CPU reservation than
other virtual machines. The Host Failures Cluster Tolerates admission control policy is based on the
calculation on a slot size including two components, the CPU and memory reservations of a virtual
machine. If the calculation of this slot size is skewed by outlier virtual machines, the admission
control policy can become too restrictive and result in the inability to power on virtual machines.
Verify that none of the virtual machines have excessive reservations. If one or a few machines have
a much larger reservation than other machines in a cluster, the calculation will be skewed.
Fore more information about various causes of the insufficient fail over resource problems, see
vSphere Troubleshooting at http: //pubs.vmware .com/vsphere-65/topic/com.vmware.ICbase/PDF/
vs phere-esxi-v center-server-65-trou b1eshooting-guide. pdf.
vSphere Availability
vSphere Availability is comprised ofvSphere HA and Proactive HA. To enable Proactive HA you must also enable DRS on the cluster.
0 Turn ON VSphere HA
D Turn on Proactive HA 6
! Failure IResponse I Details
I Host failure ~ RestartVMs Restart VMs using VM restart priority ordering.
Host Isolation ~ Disabled VMs on isolated hosts will remain powered on.
Data store with Permanent Device ~ Disabled Datastore protection for All Paths Down and
Loss Permanent Device Loss is disabled.
Data store with All Paths Down ~ Disabled Datastore protection for All Paths Down and
Permanent Device Loss is disabled.
Admission control is a policy used by vSphere HA to ensure fail over capacity within a cluster. Increasing the value of host fai lures cluster
tolerates will increase the availability constraints and capacity reserved.
Host failures cluster tolerates 1 '8 Maximum is one less than number of hosts in cluster.
Define host failover capacity by [ Cluster resource percentage
I• l
0 Override calculated failover capacity.
CPU 50 : %
Performance degradation VMs 1oo £:B % Percentage of performance degradation the VMs in the cluster are allowed to
tolerate tolerate during a fail ure.
If vSphere HA admission control does not function properly, there is no assurance that all virtual
machines in the cluster can be restarted after a host failure.
Ensure that your admission control settings match your restart expectations if a failure occurs.
Problems occur when no free slots are available in the cluster or if powering on a virtual machine
causes the slot size to increase because it has a larger reservation than existing virtual machines. In
either case, you might use the vSphere HA advanced options to reduce the slot size, use a different
admission control policy, or modify the policy to tolerate fewer host failures.
The vSphere HA Advanced Runtime Info pane shows the slot size and how many available slots are
in the cluster. If the slot size appears too high, view the Resource Allocation tab of the cluster and
sort the virtual machines by reservation to determine which have the largest CPU and memory
reservations. If outlier virtual machines with much higher reservations than the others exist, consider
using a different vSphere HA admission control policy. For example, define failover capacity by
reserving a percentage of the cluster resources. Or use the vSphere HA advanced options to place an
absolute cap on the slot size. Both of these options, however, increase the risk ofresource
fragmentation. Resource fragmentation results in the inability to guarantee that the outlier virtual
machines can be restarted in the event of an ESXi host failure.
vCenter Server uses admission control to ensure that sufficient resources are available in a cluster to
provide failover protection. Admission control is also used to ensure that virtual machine resource
reservations are respected.
Choose a vSphere HA admission control policy based on your availability needs and the
characteristics of your cluster. When choosing an admission control policy, you should consider the
following factors:
• Avoiding resource fragmentation: Resource fragmentation occurs when enough resources in
aggregate exist for a virtual machine to be failed over. However, those resources are located on
multiple hosts and are unusable because a virtual machine can run on only one ESXi host at a
time. The Host Failures Cluster Tolerates policy avoids resource fragmentation by defining a
slot as the maximum virtual machine reservation. The Percentage of Cluster Resources policy
does not address the problem ofresource fragmentation. If vSphere HA and vSphere DRS are
enabled on the cluster, then vSphere DRS ensures that no resource fragmentation occurs. If
vSphere DRS is not enabled, resource fragmentation can occur. With the Specify Failover Hosts
policy, resources are not fragmented, because hosts are reserved for failover.
A slot is a logical representation of memory and CPU resources. By default, a slot is sized to satisfy
requirements for any powered-on virtual machine in the cluster:
• vSphere HA calculates the CPU component by obtaining the CPU reservation of each powered-
on virtual machine and selecting the largest value. If you have not specified a CPU reservation
for a virtual machine, the virtual machine is assigned a default value of 32 MHz. You can
change this value by using the das.vmcpuminmhz advanced attribute.
• vSphere HA calculates the memory component by obtaining the memory reservation, plus
memory overhead, of each powered-on virtual machine and selecting the largest value. No
memory is reserved by default. The default value for memory reservation is zero.
In the example, the largest CPU requirement, 2 GHz, is shared by the first and second virtual
machines. The largest memory requirement, 2 GB, is held by the third virtual machine. Based on
this information, the slot size is 2 GHz CPU and 2 GB memory.
Of the three ESXi hosts in the example, ESXi host 1 can support four slots. ESXi hosts 2 and 3 can
support three slots each.
A virtual machine requires a certain amount of available overhead memory to power on. The
amount of overhead required varies, depending on the amount of configured memory. The amount
After determining the maximum number of slots that each host can
support, the current failover capacity can be computed.
In this example, current failover capacity is 1.
CPU: 2 GHz CPU: 2 GHz CPU: 1 GHz CPU: 1 GHz CPU: 1 GHz
RAM : 1 GB RAM: 1 GB RAM: 2 GB RAM: 1 GB RAM: 1 GB
In the example, host 1 is the largest host in the cluster. If host 1 fails, then six slots remain in the
cluster, which is sufficient for all five of the powered-on virtual machines. The cluster has one
available slot left.
If both hosts 1 and 2 fail, then only three slots remain, which is insufficient for the number of
powered-on virtual machines to fail over. Thus, the current failover capacity is one.
CPU: 2 GHz CPU: 2 GHz CPU: 1 GHz CPU: 1 GHz CPU: 1 GHz
RAM : 1 GB RAM: 1 GB RAM : 4GB RAM: 1 GB RAM: 1 GB
A single virtual machine with a very large memory or CPU reservation can drastically reduce the
number of slots available. By default the slot size is calculated based on the largest CPU reservation
and the largest memory reservation (not actual CPU or memory usage) of powered on virtual
machines. The memory reservation will include overhead.
You can manually set slot sizes to a fixed value in vSphere Availability Admission Control in the
vSphere Web Client. You can also set a maxim upper bound for the CPU or memory component of
the slot size by creating parameters das. slotcpuinrnhz and das. slotmeminmb in vSphere
Availability Advanced Options.
CPU: 2 GHz CPU: 2 GHz CPU: 1 GHz CPU: 1 GHz CPU: 1 GHz
RAM: 1 GB RAM: 1 GB RAM: 2 GB RAM: 1 GB RAM: 1 GB
With this policy, vSphere HA enforces admission control in the following way:
1. vSphere HA calculates the total resource requirements for all powered-on virtual machines in
the cluster.
2. It calculates the total host resources available for virtual machines.
3. It calculates the current CPU failover capacity and current memory failover capacity for the
cluster.
4. It determines whether either the current CPU failover capacity or the current memory failover
capacity is less than the corresponding configured failover capacity (provided by the user).
CPU: 2 GHz CPU: 2 GHz CPU: 1 GHz CPU: 1 GHz CPU: 1 GHz
RAM: 1 GB RAM: 1 GB RAM: 2 GB RAM: 1 GB RAM: 1 GB
In the example, the total resource requirements for the powered-on virtual machines is 7 GHz and 6
GB. The total host resources available for virtual machines is 24 GHz and 23 GB. Based on these
values, the current CPU failover capacity is 70 percent. Similarly, the current memory failover
capacity is 74 percent.
If the cluster's configured failover capacity is set to 25 percent, 45 percent of the cluster's total CPU
resources and 49 percent of the cluster 's memory resources are still available to power on additional
virtual machines.
You can configure how vSphere HA responds to the failure conditions on this cluster. The following failure conditions are supported: host,
host isolation, VM component protection (datastore with POL and APO). VM and application.
If Virtual Machine Component Protection (VMCP) is enabled, vSphere HA can detect datastore
accessibility failures and provide automated recovery for affected virtual machines.
VMCP provides protection against storage accessibility failures that can affect a virtual machine
running on a host in a vSphere HA cluster. When a datastore accessibility failure occurs, the affected
host cannot access the storage path for a specific datastore. You can determine the response of
vSphere HA to such a failure, ranging from the creation of event alarms to virtual machine restarts
on other hosts.
You can view the vSphere HA log in vSphere Web Client using the log
browser, or using the command line:
[root@esxi - a - 03 : -J ls /var/log/fdm .log
/var/log/fdm . log
[ root@esxi - a - 03 : -J tail -f /var/log/fdm . log
You can start, stop, or restart the vSphere HA service from the command
line:
[root@esxi- a - 03 : - J ps I grep - i fdm
The Monitor> Utilization graph can show you the cluster load balance
across ESXi hosts.
,....,N,,,
a111
,,,·,u,,•,,,
to,,,r, , . . . . - - - - - - -'-. {II Lab Cluster 1i) ~ t3'J till e @ Actions •
'iBaCi( Getting started Hosts VMs Datastores Netw<>rks Update Manager
~ Issues Performance Tasks & Events Profile Compliance Resource Reservation ...Sphere DRS VSphere HA Utilization
... ~ sa-vcsa-01 .vclass.local
T [J SA Datacenter
- ~ Lab Cluster
l:J sa·esxi-01 .vclass.local
l:J sa-esxi-02.vtlass.local
r·~C-1-ust_e_r_C_P_U~~~~~~~~~~~Ol ·~C-1-ust_e_r_M_e_rn_o_ry~~~~~~~~~~O
1
• @ Production
• @ Test
/ii> linux-a-01 0 GHz 11 .20GHz OGB 12 .00 GS
......····· ············...
..
....····· ~----··.---~-
ESXi···...
····...
•••• ODO 0
vSphere vMotion transfers the entire execution state of a running virtual machine from the source
ESXi host to the destination ESXi host over a high-speed network. The execution state primarily
consists of the following components:
• The virtual machine's physical memory
• The virtual device state, including the state of the CPU, network and disk adapters, and SVGA
• External network connections
• The virtual machine's virtual disks (migrated only when disks are not on shared storage)
vmo tion
Key: vmotion
Name : vmotion
State: '1660
[ro ot@sa- esxi - 01: - ] esx cli network ip netstack get -N vmo tion
vmotion
Key : vmotion
Name : vmotion
Enabled : true
Max Connections: 11000
Current Max Connec tions: 11000
Congestion Control Algorithm : newreno
IPv6 Enabled : true
Current IPv6 Enabled : false
State: '1660
To migrate virtual machines over long distances, your environment must comply with these
requirements:
• A RTT (round-trip time) latency of 150 milliseconds or less, between hosts.
• Your license must cover migrating virtual machines across long distances. The long distance
vMotion features require an VMware vSphere® Enterprise Plus Edition™ license. For more
information, see Compare vSphere Editions at http://www.vmware.com/uk/products/vsphere/
compare.html.
• You must place the traffic related to virtual machine file transfer to the destination host on the
provisioning TCP/IP stack. For more information about placing traffic for cold migration,
cloning, and snapshots on the provisioning TCP/IP stack, see the chapter about migrating
virtual machines in vCenter Server and Host Management Guide at http://pubs.vmware.com/
vsphere-65/topic/com.vmware.ICbase/PDF/vsphere-esxi-vcenter-server-65-host-management-
guide.pdf.
For more information about migrating virtual machines over long distances, see VMware
knowledge base article 2106949 at http ://kb.vmware.com/kb/2 l 06949.
To migrate virtual machines over vCenter Server instances by using the vSphere Web Client, you
must enable Enhanced Linked Mode for both the source and destination vCenter Server instances.
You must also put the vCenter Server instances in the same VMware vCenter™ Single Sign-On™
domain, so that the source vCenter Server can authenticate to the destination vCenter Server.
When using the vSphere APis or SDK, both vCenter Server instances can exist in separate vCenter
Single Sign-On domains. Additional parameters are required when performing a nonfederated cross
vCenter Server vMotion migration. For more information about the virtual machine relocation
specifications, see vSphere AP! Reference at https://www.vmware.com/support/developer/vc-sdk/.
If vSphere vMotion was working but fails in this way, begin by confirming
proper VMkernel port settings and values and verify physical component
functionality.
If the network configuration is correct and physical components are
functioning, restart the management agents:
• Restart the management agents on the ESXi host at the command prompt.
- /e t c/ i nit . d/hostd r es t art Troubleshooting Mode Options
- /e t c/init . d/vpxa r es t a r t
• Use the DCUI to restart the Disable ESXi Shell
management agents on the ESXi host. Disable SSH
Modify ESXi Shell and SSH tiMeouts
mn• ;c'·€'·'Aij 'A .1autg.sp
••• •..,..._, __ ...._,,A ,,.,.... •" 1
Before you restart the management agents confirm that the network configuration is still correct and
that all network hardware if functioning correctly.
You can restart the management agents from the ESXi host command line. Or you can restart the
agents by selecting Troubleshooting Mode Options> Restart Management Agents in the DCUI.
Restarting the management agents might affect tasks that are running on the ESXi host at the time of
the restart.
For information about restarting the management service on an ESXi host, see VMware knowledge
base article 1005566 at http://kb.vmware.com/kb/1005566.
Possible Causes
~ =====i,,
'\
Virtual
The log. rotateS i ze parameter is set to a low value.
Machine
L J ESXi
Host
Name resolution is not valid on the host.
Time is not synchronized across the environment.
The required disk space is not available.
VM reservation requirements are not met on the target host.
If the vSphere vMotion network is not functioning properly, vSphere vMotion migrations can fail.
The ESXi host must be configured properly and have enough resources to allow virtual machine
migrations from one host to the next.
For information about diagnosing a vSphere vMotion failure at 15 percent or less, see VMware
knowledge base article 1003734 at http ://kb.vmware.com/kb/1003734.
If the ping command fails, ensure that the IP settings of the vSphere
vMotion VMkernel interface are correct:
- IP address
- Subnet mask
- VMkernel gateway
Verify that VMkernel network connectivity for your vSphere vMotion network exists. If the p ing
command results in 100 percent packet loss, then verify that the VMkernel configuration for your
vSphere vMotion network is valid.
Verify that name resolution is valid on both the source and destination hosts.
Verify that time is synchronized across the environment. The time must be synchronized if there are
time discrepancies in the environment. The time can be maintained by using a Network Time
Protocol server.
vSphere vMotion might fail if required disk space is not available on the
target host:
• On the destination host, run df -h.
- Verify that enough space is available on the destination datastore.
I • '
~ # df -h
File5y5tem Size U5ed Available Use% Mounted on
MFS-5 55.0G 1.7G 53.3G /vmf5/volume5/Local01
vSphere vMotion presents a unified migration architecture that migrates live virtual machines,
including their memory and storage, between vSphere hosts without any requirement for shared
storage.
If vSphere vMotion must transfer the virtual machine's storage from the source host to a different
datastore on the destination host, ensure that enough disk space exists to accommodate the migrated
virtual machine.
Verify that virtual machine reservation values do not exceed available resources on the host. Check
the ESXi host's Summary tab for the number of processors, processor speed, and amount of
physical memory available. And then check the virtual machine's reservation values for CPU and
memory.
If the virtual machine has reserves configured that exceed available resources, enough resources
must be made available on the target ESXi host or the reserves must be lowered or removed.
The VMkemel overhead, and the memory reservation, must be available for a virtual machine to
power on.
The log. rotatesize setting defines the maximum size in bytes that the virtual machine log file,
vmware . log, can grow to. By default, the maximum size is set to zero, which means the log file
can grow to an unlimited size.
If the log. rotateSize value exists in the virtual machine 's . vmx file and is set to a very low
value, vmware. l og might rotate quickly. As a result, by the time the destination host is requesting
the VMFS lock for vmware. l og, the log file has already rotated and a new vmware. log file is
created. The destination host is then unable to acquire a proper file lock, which causes the vSphere
vMotion failure.
For information about log rotation and logging options for vmware. log, see VMware knowledge
base article 8182749 at http://kb.vmware.com/kb/8182749.
If the vSphere vMotion migration still fails between 10 and 20 percent, reset the ESXi host's
advanced setting, Migrate. Enabled, on both the source and the destination ESXi hosts.
Troubleshoot vSphere DRS only if the hosts are out of balance. vSphere DRS might not be
migrating virtual machines because migrations are not needed at the time.
If the vSphere DRS automation level is set to manual mode, then vSphere vMotion migrations do
not take place automatically. You must approve the migration recommendation before the migration
takes place.
Verify that your vSphere vMotion configuration is correct on all hosts in the cluster. As a test, you
should be able to manually migrate your virtual machines between hosts without a problem.
vSphere DRS might have valid reasons for not performing vSphere
vMotion migrations.
vSphere DRS never performs migrations if the migration threshold is set to apply priority 1
recommendations. With this setting, vCenter Server applies only recommendations that must be
taken to satisfy cluster constraints like affinity rules and host maintenance. vCenter Server will not
apply load-balancing recommendations.
vSphere DRS seldom migrates virtual machines if the virtual machine load is fairly consistent. If the
hosts are load balanced, then the need for vSphere DRS to move virtual machines rarely occurs.
vSphere DRS often migrates virtual machines if the virtual machine loads are very erratic in their
resource requirements. In this case, vSphere DRS might need to frequently reshuffie the virtual
machines across the hosts in the cluster to keep the load balanced.
vSphere DRS seldom performs migrations if the migration threshold is set to apply priority 1, 2, and
3 recommendations. With this setting, vCenter Server performs vSphere vMotion migrations only
for extreme and high load imbalances across the hosts in the cluster.
vSphere DRS often performs migrations if the migration threshold is set to apply all
recommendations. With this setting, vCenter Server performs vSphere vMotion migrations at the
slightest load imbalance across the hosts in the cluster.
Verify that vSphere DRS and vSphere vMotion are configured correctly.
If virtual machines cannot be migrated with vSphere vMotion, verify that these virtual machines are
not actively using local host resources such as local storage, local CD/DVD drives, or internal
networks.
If the hosts in the cluster are consistently out of balance, then vSphere DRS is not working correctly
and you must investigate whether misconfiguration is causing this behavior.
Module 7
257
You Are Here
Slide 7-2
1. Course Introduction
2. Introduction to Troubleshooting
3. Troubleshooting Tools
4. Troubleshooting Virtual Networking
5. Troubleshooting Storage
6. Troubleshooting vSphere Clusters
7. Troubleshooting Virtual Machines
By the end of this module, you should be able to meet the following
objectives:
• Discuss virtual machine files and content IDs
• Identify, analyze, and solve virtual machine snapshot problems
• Troubleshoot virtual machine power-on problems
• Identify possible causes and troubleshoot virtual machine connection state
problems
• Diagnose and recover from VMware Tools installation failures
To troubleshoot common virtual machine issues, you must understand what the underlying virtual
machine files are used for. Sometimes resolving a virtual machine problem requires fixing one or
more of the virtual machine's files.
The example lists the files that make up a virtual machine named WinO 1-A. Except for the log files,
the name of each file starts with the virtual machine's name.
If the virtual machine has more than one disk file, the file pair for the second disk file and later is
named WinOl - A_# . vmdk and WinOl - A_# - flat. vmdk, where# is the next number in the
sequence, starting with 1.
«...,~,~"'-ru;...d~?..Q,.W-O-ti .
A virtual machine disk descriptor file details the basic geometry, format, or other identification and
handling for a virtual disk. If a virtual machine has snapshots, a disk descriptor file exists for each
snapshot file, also referred to as a delta file. The disk descriptor file also contains the CID. The CID
value of each disk descriptor file helps to ensure that the content in its parent disk descriptor file is
retained in a consistent state.
In the example, the virtual disk descriptor files and their relationships are as follows:
• WinOl - A . vmdk: Descriptor file for the base virtual disk
• Wi nOl - A- 000 0 01 . vmdk : Descriptor file for the first snapshot. The parent file of this snapshot
file is the base disk descriptor file, Win o 1-A. vmdk.
• Wi nOl - A- 000 0 02 . vmdk : Descriptor file fo r the second snapshot. The parent file of this
snapshot is the first snapshot's disk descriptor file, Wi nOl - A- 00 0 00 1 . vmd k.
CIDs that are referenced correctly in the descriptor files confirm the integrity of the snapshot chain.
When a CID mismatch occurs, the virtual machine name is provided in the error message, but you
must identify the following information:
• Which virtual machine disk or disks are affected
• Which specific disk descriptor files are affected
• The cause of the mismatch, or what changes occurred
The virtual machine's log file, vmware. l og, is in the virtual machine's directory on the ESXi host,
/vmfs/volumes/datastore- name/virtual- machine- name.
it_ .J:~"'-RtJ.~P..?._Q~-l! .
If a content ID (CID) mismatch is found, tasks are prevented from running on the virtual machine.
In effect, a CID mismatch ensures that deviance from the original disk state results in all dependent
child delta content being invalidated. Stored data is protected from further potential corruption.
In the example, a CID mismatch error will be generated because the pare n tC ID descriptor in the
descriptor file, WinOl-A-00 0 002. vmdk, does not match the CID of the parent descriptor file,
WinO l -A-00 0 0 0 1.vmdk .
To resolve the problem, make a backup of the disk descriptor files that
require editing and use a text editor to correct the mismatch.
Verify that the CID mismatch was corrected by running the following
command against the highest-level snapshot:
v mkfstoo l s - e WinOl - A- 000002.vmdk
If you see failure messages, the CID mismatch was not corrected.
With the symptom determined, make backup copies of the disk descriptor files that require
corrections or editing. In the example, make a backup copy of winOl - A- 000002 . vmdk. Then edit
the value of parentCID to match the CID in Wi nOl - A- 00000 1 . vmdk.
For information about how to resolve a CID mismatch error, see VMware knowledge base article
1007969 at http://kb.vmware.com/kb/1 007969.
VMware products might require file systems in a guest operating system to be quiesced before a
snapshot operation for the purposes of backup and data integrity.
Services that are known to generate heavy 1/0 workload include Exchange, Active Directory, LDAP,
and MS-SQL.
The quiescing operation can be done by Microsoft Volume Shadow Copy Service (VSS). VSS is
provided by Microsoft in its operating systems since Windows Server 2003. Before verifying a VSS
quiescing issue, ensure that you are able to create a manual nonquiesced snapshot with the Snapshot
Manager in vSphere Web Client.
The quiescing operation can also be done by an optional VMware Tools™ component called the
SYNC driver.
When VSS is invoked, all VSS providers must be running. If a problem occurs with third-party
software providers or the VSS service itself, the snapshot operation might fail.
For detailed steps on how to troubleshoot VSS quiescing problems, see VMware knowledge base
article 1007696 at http://kb.vmware.com/kb/1007696.
Possible Causes
Creating or committing snapshots involves several layers. Users must have proper permissions, the
virtual machine's files must be in place, and the datastore on which the virtual machine and its
snapshots are stored must be in working order.
The authorization to perform tasks in vCenter Server is governed by an access control system. This
system enables the vCenter Server administrator to specify in great detail which users or groups can
perform which tasks on which objects. The access control system is defined with privileges, roles,
users and groups, and objects. Together, privileges, roles, users and groups, and objects define
perm1ss10ns.
The vCenter Server roles, administrator and virtual machine power user, give a user the permission
to create and manage snapshots. To create and manage snapshots, you must verify that you have one
of these roles or an equivalent custom role associated with your user account.
The best practice is always to restore a virtual machine from backup if there is corruption in the
snapshots.
Snapshot data files are based on the vmfsSparse format. The vmfsSparse format is intended to
store delta content (changed content) for a period of time as a snapshot.
A missing delta descriptor file might occur if you delete the delta descriptor file or files of a virtual
machine whose snapshots are not visible through the Manage Snapshots pane in vSphere Web
Client. You might accidentally delete the descriptor files by navigating to the datastore from the
command line and deleting the files.
If you find that you are missing a delta descriptor file, then you must create one. For detailed steps
on how to create the missing descriptor file, see VMware knowledge base article 1026353 at http://
kb.vmware.com/kb/1026353 .
ec;.x1Ut.vcla~c;..local - Put TY
/ var/ log II df -h
File~ ~tern S ize U~ed Avai l able U~e \ Mount ed on
In the example, the df command displays the available space (in GB) of the VMware vSphere®
VMFS5 datastores named Local01 and Shared.
If necessary, increase the size of the datastore by adding another extent to the datastore. You might
also consider making room on a datastore by moving virtual machines to other datastores with
available space. Also check to make sure there are no unused files or virtual machines on the
datastore with insufficient space.
For more information about verifying the free disk space for an ESX/ESXi virtual machine, see
VMware knowledge base article 1003755 at http://kb.vmware.com/kb/1003755.
If you are trying to power on a virtual machine from vSphere Web Client, view the Tasks tab and
Events tab of the virtual machine for reasons about why the power-on operation is failing.
When a virtual machine does not power on, take a top-down approach to
troubleshooting. Start with the virtual machine files and then check the
ESXi host.
Possible Causes
w L J ESXi
Host I
Insufficient resources exist on the ESXi host.
The ESXi host is not responsive. I
If you monitor your ESXi hosts periodically and they are stable and working properly, then start
your troubleshooting by checking that the virtual machine's files are in place and working properly.
To resolve this problem, restore the missing file or files from your last
backup. If a descriptor file is missing, recreate the descriptor file
manually.
A best practice is to make regular backups of your virtual machine files so that if files are
accidentally deleted from disk, you can restore them.
If the virtual machine's configuration (. vmx) file is missing, you can restore it from a backup. Or
you can recreate the file by recreating the virtual machine and pointing to the existing virtual disk
files. For more information about recovering from a missing virtual machine configuration file, see
VMware knowledge base article 1002294 at http://kb. vmware.com/kb/1002294.
If the virtual machine's virtual disk (-flat. vmdk) files are missing, then you must restore these
files from a backup. If the virtual machine's virtual disk descriptor (. vmdk) files are missing, then
you can restore them from a backup or you can manually recreate the missing descriptor file.
For detailed steps about how to recreate a missing virtual machine disk descriptor file, see VMware
knowledge base article 1002511 at http://kb.vmware.com/kb/10025 l l .
A virtual machine will not power on if one of the virtual machine files, for
example, the virtual machine disk file, is locked.
Perform these steps to find a locked file:
1. Power on a virtual machine:
- If the power-on fails, the error message identifies the affected file.
2. Determine whether the file can be locked:
- t o u ch f i le na me
3. Determine which ESXi host has locked the file:
- vmkfstools - D / vmfs/vo l umes/Sh ared /Win0 1-B/Win 011- lat . vmdk
To prevent concurrent changes to critical virtual machine files and file systems, ESXi hosts establish
locks on these files. In certain circumstances, these locks might not be released when the virtual
machine is powered off. The files cannot be accessed by the ESXi hosts while locked, and the
virtual machine is unable to power on.
The virtual machine's disk files (-fla t . vmdk) and delta disk files (- d elta . vmd k) are files that are
commonly locked during runtime.
You can use the vmk f stoo l s command to determine which ESXi host has locked the file. The
second line shows the MAC address. In the example, the MAC address is 00: 13:72:66:E2:00. Log in
to the ESXi host with this MAC address and identify the process that is maintaining a lock on the
virtual machine's file.
You might encounter the following types of locks:
• Mode 0: No lock on file.
• Mode 1: Exclusive lock. Only one specific host or process can access and update the file. Other
hosts or processes have no read permission.
• Mode 2: Read-only lock. All hosts can read the file but no one can update or modify it.
For more information about troubleshooting locked virtual disks, see VMware knowledge base
article 2107795 at http ://kb.vmware.com/kb/2107795.
On the ESXi host command line, use the ls o f command to determine which process is locking the
file. Before terminating the process, understand what that process is used for.
Use the kill command to stop the process. Using the ki l l command abruptly terminates all the
running process for the virtual machine without generating a core dump to analyze the status later.
Carefully consider the consequences before using this command if you decide to troubleshoot the
virtual machine state.
The ESXi host holding the lock might be running the virtual machine and might have become
unresponsive. Or another running virtual machine has the disk incorrectly added to its configuration
before power-on attempts.
If you have already completed your investigation and you still cannot determine which process is
locking the file, then restart the ESXi host to allow the virtual machine to be powered on again.
For detailed steps on investigating virtual machine file locks on ESXi, see VMware knowledge base
article 10051 at http://kb.vmware.com/kb/10051.
Periodically check that the ESXi hosts in your cluster have enough CPU, memory, and storage to
accommodate your existing virtual machines as well as any new virtual machines.
If the ESXi host is unresponsive, a virtual machine will not power on.
Determine whether the host has crashed or is hanging:
• If the host is hanging, you cannot perform the following tasks:
- Ping the VMkernel network interface.
- Determine whether host Client responds to queries.
- Monitor network traffic from the ESXi host and its virtual machine.
• If the host has crashed, you will see the purple crash screen on the ESXi
console.
An unresponsive host is a serious problem and can cause your virtual machines to fail to power on.
As a first check, check the event viewer on the vCenter Server system to determine whether the
VirtualCenter Server service was restarted at the time a migration was in progress. During startup,
vCenter Server reconnects to all hosts. If a migration completed while vCenter Server was down, a
virtual machine can be reported as an orphan until vCenter Server establishes a connection to the
virtual machine's target host.
Possible Causes
,:;r--=- ~
A virtual machine is deleted outside of vCenter Server.
Virtual
A . vmx file contains special characters or incomplete line-
.,•chine item entries.
If you monitor your ESXi hosts periodically and they are stable and working properly, then start
your troubleshooting from the top down.
For information about what orphaned virtual machines are, how they occur, and how you can fix
them, see VMware knowledge base article 1003742 at http://kb.vmware.com/kb/1003742.
In vCenter Server, you might find that you have a virtual machine that has an orphan designation or
has become invalid. An orphan virtual machine is one that exists in the vCenter Server database but
is no longer present on the ESXi host. A virtual machine also appears as orphaned if it exists on a
different ESXi host than the ESXi host expected by vCenter Server.
A virtual machine can have a status of invalid or orphaned after a vSphere vMotion migration or a
vSphere DRS migration, although this type of occurrence is unlikely.
For detailed steps about how to recover after a vSphere vMotion or vSphere DRS migration has
caused the virtual machine to become an orphan, see VMware knowledge base article 1003742 at
http://kb.vmware.com/kb/1003742.
If the configuration file was deleted and the virtual disk remains, recreate
the virtual machine:
• Attach the original virtual disks to a new . vmx file.
If the virtual disk file was deleted, restore virtual machine files from your
last backup.
If vCenter Server is down, a user can still delete a virtual machine. A user can delete a virtual
machine from the command line at the ESXi host, by using vSphere Management Assistant, or by
using vSphere Client directly connected to an ESXi host.
If the configuration file was deleted and the virtual disk remains, you can recreate the virtual
machine by using vSphere Management Assistant or vSphere Web Client. Attach the existing virtual
disk to a new . vmx file.
After completing the steps on the slide, try to power-on the virtual machine. If you receive a
question about the virtual machine, continue with the default response and the virtual machine
powers on normally.
To restart the host management services, run the following commands at the ESXi host
command line:
• /etc/ ini t . d/ h ostd r es t a r t
• /etc/ ini t . d /vp x a r e s ta r t
Alternatively, use the I sbin/ se r v i ces . sh r e sta rt command.
If VMware Tools fails to install, verify that the guest operating system is
supported by VMware.
Check the release notes or the knowledge base for any known issues
with installing VMware Tools in certain guest operating systems.
As an initial check, verify that you are using a guest operating system that is supported. For more
information about supported guest operating systems, see VMware Compatibility Guide at http://
www.vmware.com/resources/compatibility.
Possible Causes
,;.=== =-.
Virtual
An incorrect guest OS is selected for the virtual machine.
Machine
Start your troubleshooting with the more probable cause of an incorrect guest operating system
configuration on the virtual machine. Then check for less likely causes that involve the ISO image
on the ESXi host.
For details on troubleshooting a failed VMware Tools installation in a guest operating system, see
VMware knowledge base article 1003908 at http://kb.vmware.com/kb/1003908.
VM Options
VM Hardware
• General Options
VMOpt1ons
VM name Win01-D
VM SDRS Rules
VM config file [Shared] Win01-DNVin01 -D.vmx
vApp Options
VM working location [Shared] Win01-DI
Windows
The VMware Tools ISO image that is loaded when you select Install/Upgrade VMware Tools is
decided by the guest operating system type and version that you selected when you created the
virtual machine. If you want to install an operating system, ensure that you have selected the same
version that you want to install.
For example, select Microsoft Windows Server 2008 (64-bit) instead of Microsoft Windows Server
2008 (32-bit) if the version that you are installing is 64-bit. Otherwise, the installation will fail or
VMware Tools will not install or both.
• Manually start the Getting Started summary Monitor I Manage ~R_e1_a1e_d_Ob_Je_cts_ _ _ _ _ _ ___
VMware Tools installer.
Settings Alarm Definitions Tags Permissions Profiles Scheduled Tasks vServices
• Manually connect
the correct ISO image •• VM Hardware
VMHaritware
to the virtual machine. • CPU 1 CPU(S), 322 MHz used
VM011t1ons
• Memory 384 MB, 299 MB used
VM SORS Rules
• Hard disk 1 2.00 GB
vApp Options
• Network adapter 1 VM Network (connected)
File Olusrnibtvmware/isoimageslwindows.iso
.Q.~
'~
VMware Tools ISO images are on the ESXi host in the / u s r / l i b / vmware / i s o ima g e s directory.
This directory contains the following files:
• linux .is o
• WinPre Vis t a.i s o
• windows .iso
Other VMware Tools ISO images are available for download from http://www.myvmware.com.
When connected to the CD/DVD drive of the virtual machine, the VMware Tools installation starts
automatically.
If the VMware Tools installation fails to start automatically, you can manually start the VMware
Tools installer from the guest operating system. For example, for Windows, go to the drive where
the first CD/DVD drive is configured for your virtual machine and run s et up. exe.
For more information on installing VMware Tools manually from the ISO image, see VMware
knowledge base article 1003910 at http ://kb.vmware.com/kb/ 1003910.
VMware Tools ISO images are on the ESXi host in the / usr /lib/vmware/isoimage s directory.
The i so images directory is a symbolic link to the /p roductLocker/vmtools directory.
In the rare occurrence that the symbolic link does not exist, use the following command line to
recreate the link:
In -s /productLocker/vmtools /usr/lib/vmware/isoimages
A VMware Tools installation will fail if the VMware Tools ISO image is
corrupt.
To verify whether corruption has occurred, compare the checksum of the
corrupt ISO image with a known good ISO image.
A corrupted VMware Tools ISO image can cause installation failures on your guest operating
system. Verifying that your ISO image is valid is key to a successful installation.
Use the md5sum command to calculate file checksums.
• A CID resides in each virtual machine's disk descriptor file for integrity and
state tracking .
• CID mismatch conditions can be caused by software errors or interruptions to
vSphere vMotion migrations.
• Virtual machine quiescing can be done by the Microsoft VSS or the VMware
Tools SYNC driver.
• If you cannot create a content library, check that you have the required content
library global permissions.
• When a virtual machine does not power on, check that there are sufficient
resources on the host, and virtual machine files are not missing or locked.
• For problems related to orphaned virtual machines on ESXi, reregistering the
virtual machines can return the virtual machines to a connected state.
• If VMware Tools installation fails, verify that the VMware Tools ISO image can
be loaded and is not corrupt.
Questions?
Module 8
297
You Are Here
Slide 8-2
1. Course Introduction
2. Introduction to Troubleshooting
3. Troubleshooting Tools
4. Troubleshooting Virtual Networking
5. Troubleshooting Storage
6. Troubleshooting vSphere Clusters
7. Troubleshooting Virtual Machines
8. Troubleshooting vCenter Server and ESXi
By the end of this module, you should be able to meet the following
objectives:
• Understand vSphere 6.x architecture and main components
• Troubleshoot authentication and certificate problems
• Analyze and solve vCenter Server service problems
• Diagnose and troubleshoot vCenter Server database problems
• Use vCenter Server Appliance shell and the Bash shell to identify and solve
problems
• Identify and troubleshoot ESXi host problems
VMware Platform Services Controller™ provides infrastructure services for vCenter environments
by providing services that were previously installed as separate vCenter component:.
• Lookup Service: Creates authenticated connections between multiple services endpoints from
the Platform Services Controller node.
• vCenter Single Sign-On service: Coordinates authentication credentials between vCenter Server
and other authentication endpoint services.
• VMware Certificate Authority: Provides vCenter Server components and ESXi hosts with
certificates and stores those certificates for authentication.
• License Service: Delivers centralized license management and reporting functionality to
vSphere and products that integrate with vSphere.
• Directory Service: Provides directory services associated with the vsphere.local domain.
The vCenter Server system provides the remainder of the vCenter Server services, including vCenter
Server, vSphere Web Client, Inventory Service, VMware vSphere® Auto Deploy™, VMware
vSphere® ESXi™ Dump Collector, and VMware vSphere® Syslog Collector or Syslog Service.
Enhanced Linked Mode with an External Enhanced Linked Mode with an External
Platform Services Controller Instance Platform Services Controller Instance with a
Without a load balancer load balancer
.~· •:n@•'nma:n ~ , .13rn1• . ·Mt!®! ~ .
! ppr' e · · -.·-:. . . MM,,.!' .)
·--------·r--------·
-to®md. 1 ·fi.ii§' •,
"-
''flH'* '
~''
i
.._ _,''
... *'Hd* ,!
.'
vCenter Single
Sign-On
VMware CA
vCenter Server
VMware
Directory
Service
vCenter operations generally occur in the context of authenticated connections between the client,
vCenter Server, and other VMware product solutions. To support the requirements for secure
software environments, software components require authorization to perform operations on behalf
of a user. In a vCenter Single Sign-On environment, a user provides credentials once, and
components in the environment perform operations based on the original authentication.
A user logs in to the vSphere Web Client with a user name and password to access the vCenter Server
system or another vCenter service. The default user name and password used for vSphere Web Client
is administrator@vsphere.locaL Other user accounts can be granted access to sign on. A user can also
log in using Windows credentials by checking the Use Windows session authentication check box.
vSphere Web Client passes the login information to the vCenter Single Sign-On service, which
checks the SAML token of the vSphere Web Client If the vSphere Web Client has a valid token,
vCenter Single Sign-On checks whether the user is in a configured identity source, for example
Active Directory (AD).
If no domain name is entered with the user name, vCenter Single Sign-On checks in the default
vCenter Single Sign-On domain, vsphere.local.
If a domain name is included with the user name (DOMAIN\userl or userl @DOMAIN), vCenter
Single Sign-On checks that domain.
For more information about vCenter Single Sign-On, see vSphere Security Guide at http://
www.vmware.com/support/pubs/vsphere-esxi-vcenter-server-pubs.html.
Module 8 Troubleshooting vCenter Server and ESXi 305
VMware CA
Slide 8-9
Platform Services Controller handles tasks such as single sign-on and licensing, and ships with its
own Certificate Authority called VMware CA. See VMware Certificate Authority Overview and
Using VMware CA Root Certificates in a Browser at http://blogs.vmware.com/vsphere/2015/03/
vmware-certificate-authority-overview-using-vmca-root-certificates-browser.html.
For more information about replacing certificate and key files, see vSphere Security Guide at https://
www.vmware.com/support/pubs/vsphere-esxi-vcenter-server-6-pubs.html. For more information
about replacing a vSphere 6.0 machine SSL certificate with a custom Certificate Authority signed
certificate, see VMware knowledge base article 2112277 at http://kb.vmware.com/kb/2112277.
In order for SSL to work, you must trust the certificate presented by the
server.
• A certificate binds a public key with a distinguished name (DN):
- A ON is the name of the person or entity that owns the public key.
• Certificates contain:
- Issuer name (CA)
- Name of system using the certificate (common name or URL)
- Public key of system
- Serial number
• No two certificates from the same CA ever use the same serial number.
- Date range of when the certificate is valid .
• All certificates have an expiration date.
• CAs periodically release certificate revocation lists (CRLs):
- If a certificate is listed on a CRL from a CA you trust, then the system does not trust
that certificate.
- If a system cannot contact the CA and check the CRL, then some systems do not
trust the certificate.
In order for SSL to work, you must trust the certificate presented by the
server.
• A certificate is signed with the issuer's private key.
• A certificate contains all of the information needed to verify its validity.
• A certificate does not contain the C~s certificate:
- The C/\s certificate contains the C/\s public key.
- Anyone who has the CA's public key can decrypt a message encrypted with the C/\s
private key.
- You must go to the C/\s Website to download the C/\s certificate (independent
verification).
• Certificates are stored in a local database called a keystore:
- Your type of keystore depends on your system and your software tools.
• A self-signed certificate is where the issuer (server) and the user (client) are
the same system.
If you trust the CA, then you implicitly trust all of the certificates issued by
that CA.
• In order to trust a certificate, you must trust some part of the chain of trust. One
of the following must be true:
- You must say that you explicitly trust the certificate itself.
- You must say that you explicitly trust the CA that issued it.
• In a self-signed certificate the issuer and the user are the same system.
- To use self-signed certificates every user (client) system must install and explicitly
trust every self-signed certificate that is in use in the entire network.
- Every time a new service is brought on line all clients must individually install and
trust each and every self-signed certificate in the network.
• An in-house or commercial CA eliminates the requirement of each client
system installing each and every self-signed certificate so long as:
- All client systems trust the CA.
- All certificates come from that trusted CA.
CA ?.
"Because I trust the
CA and the CA has
issued you a
certificate then I
trust ou."
Web Server
L___
" i_t_
r u_s_t _
t h_e_c_A_"__ L " Why should I
trust you?"
CA
Symptoms:
• Replacing the machine SSL certificate or solution user certificates with custom
certificate authority certificates fails at 0 percent.
• The ce rtifi c ate - ma nage r . l o g file indicates that the d ir- c li command
to publish the trusted certificate failed.
Causes:
• All Intermediates and the root CA certificates must be published into the
trusted store in VECS for the script to complete.
• This issue can also be caused by using non-Base64 certificates.
Solutions:
• To work around this issue, manually publish the full chain to the VECS or
upgrade to vCenter Server 6.0.0b or higher.
For more information about solving this certificate problem, see VMware knowledge base article
2 111 571 at http://kb.vmware.com/kb/2 111 571.
Using the vSphere Web Client check the vCenter Server service status.
r
Hardware Heanh Service (vcsra vclassJoca!)
.. Health Messages Cl • Related Objects
VMware ESX Agent Manager (vcsa-a vcJassJocaO Performance statistics rollup from Pasl Month lo Past Y•..
As a first test, try to stop the VirtualCenter Server service and then restart.
For Windows-based vCenter Server, you can use either the embedded PostgreSQL database to
support up to 20 hosts and 200 virtual machines, or an external database (Oracle or Microsoft SQL)
for larger deployments. Ensure that the external database is interoperable with the version of
vCenter server that you are installing. For information about supported database server versions, see
the VMware Product Interoperability Matrix at http://www.vmware.com/resources/compatibility/
sim/interop_matrix.php. If you want to use an external database, ensure that you create a 64-bit DSN
so that vCenter Server can connect to the Oracle database.
For VMware vCenter™ Server Appliance™, you can use only the embedded PostgreSQLSee the
VMware Product Interoperability Matrix at http://www.vmware.com/resources/compatibility/sim/
interop_matrix. php.
For more information about the vCenter Server database requirements, see vSphere Installation and
Setup Guide at https ://www.vm ware. com/support/pubs/vsphere-esxi-vcenter-server-6-pubs. html.
To validate the basic configuration of a vCenter Server database:
• Verify that adequate disk space is available on the volume that is storing the database files to
ensure the correct operation of the database. If adequate space is not available on the physical
volume that stores the database files, free some disk space.
Transaction Log
11 -
0r--
·- VCJnventory
u..;i.,..._ 1Wf-
I: I:
..m·~
--.-== ,, -----------
,,,- ..m'~
- -·-=-
,,,------
,,
The VMware Management Appliance can display the health status of an embedded PostgreSQL
database. This includes the amount of space being consumed by Statistics, Events, Alarms, and
Tasks (SEAT) along with the space consumed by the transaction log and the vCenter Server inventory.
The size and rate of the vCenter Server database growth can affect
vCenter Server performance.
vCenter Server collects several types of data and stores the data in the
database:
• Performance data
• Log of tasks that were performed
• Log of events that occurred
Under many circumstances, the vCenter Server database can grow excessively. Growth of the
vCenter Server database is due to data being collected and stored in the database. This data can be
categorized in three ways:
• Performance data
• Log of tasks that were performed
• Log of events that occurred
When troubleshooting excessive growth of the database, start by examining where the growth is
occurring. From here, you can determine how to troubleshoot the issue.
A subset of vCenter Server tables accounts for most of the cases that
show substantial growth in the database:
• vpx _ hist statl to vpx_h ist stat4
- Contains the collected performance data information
• vpx sample timel to vpx_ sample time4
- Stores the reference time frames for the performance data in the vpx hist stat
tables
• vpx _ event and vpx_ event arg
- Stores the event information in vCenter Server
• vpx_task
- Stores the task information in vCenter Server
The vCenter Server database is a complex database and several areas can cause problems. Of the
many tables in the vCenter Server database, few accumulate data during regular operation. The
previous tables accumulate data during regular operation.
For information about how to determine where growth is occurring in the vCenter Server database,
see VMware knowledge base article 1028356 at http://kb.vmware.com/kb/1028356.
For information about how to purge old data from the database used by vCenter Server, see VMware
knowledge base article 1025914 at http://kb.vmware.com/kb/1025914.
• If you think that performance data is not being processed, determine the last
time that data was successfully processed:
- If the date returned is more than 24 hours in the past, a problem with the rollup jobs
is likely.
To start diagnosing the performance data situation in vCenter Server, check the size of the database
tables. Because vpx_hi st_ s ta t l and v p x _ s amp l e _ time l stores the raw incoming data for the
statistic level, these tables frequently cause problems.
For details about how to verify the size of the vCenter Server database tables, see VMware
knowledge base article 2007388 at http ://kb.vmware.com/kb/2007388.
In an upgrade or recovery scenario, the database (although properly restored), might not include the
restoration of the roll up jobs. Validate within MSSQL or Oracle whether the Past Day, Week, and
Month roll up jobs exist and re-create them if necessary.
By default, when SQL is installed the MSSQL Agent service is started in Manual mode. On reboot
the roll up jobs might not run since the service is not started. Validate this configuration by checking
the agent services and making sure that the MSSQL agent service is set to started and automatic.
If statistic collection levels above level 2 are used, other than for debugging an issue, growth of the
database may occur. Reducing the level to a lower level stabilizes the system, but it might not be
possible to recover.
Truncating the unprocessed information from the v px_hist_ stat 1 table is normally a last resort,
but can be the ultimate solution if it is not possible to process the data in an appropriate period of time.
A problem occurs when vCenter Server Appliance with embedded vPostgres database stops running
because a disk partition on the vPostgres database contains no free space. You can verify that the
vPostgres database is full and if necessary, extend the disk size for the vPostgres database. For more
information, see VMware knowledge base article 2058273 at http ://kb.vmware.com/kb/2058273.
Use the following procedure to move the database to a drive with more
space:
1. Shut down your vCenter Server Appliance virtual machine.
2. Add a new hard disk to the virtual machine.
3. Start up the vCenter Server Appliance virtual machine
Too high a statistic level can cause your database to fill up. If your database is filling up too fast
reduce the statistics level to Level 1. The higher the statistics level the more higher the level of
detail that is recorded in the database.
§$1$11 ¥ statistics
Enter settings for collecting vCenter SeNer statistics
Database
Runtim e settings
Eubltd lnl tl'f'•I01.1r.1bon S1vtro1 SLIUrtlw L.vtl
User directory
el 5mlnutes 1 day Level 1 I·
Mail el JO minutes 1 week
20~ Vlrtuatmachlnes
Monitor vCenter database consumpi1on and disk partition In Appliance Management UI c!'
statistics Database
Enter database settings. use tasks and events retention settings to llmlt the growth of the database.
atabase
Runtime settings
Maximum connections
User directory
Task cleanup Enabled
Mail
Task retention (days)
SNMP recenters
Event cleanup ~ EnabledMaximum age in days a task remains in the
Ports
Event retention (days) 130 database
Timeout settings '-·- - - Minimum: l
Maximum: 2,147,483,647
Logging settings
SSL senings Increasing the events retention to more than 30 days will result In significant Increase ofVCenter database size and could
shutdown the vCenter Server. Please ensure that you enlarge the vCenter database accordingly.
A reinitialization of the vCenter Server database resets it to the default configuration, as if the
vCenter Server was newly installed.
Reinitializing the database permanently erases all data in the database. If data must be protected,
verify that a proper backup of the database is taken before reinitializing the database.
For details about how to reinitialize the vCenter Server database, see VMware knowledge base
article 2031295 at http://kb.vmware.com/kb/2031295.
It is best practice to stop vCenter Server services before you stop the PostgreSQL database on the
vCenter Server appliance. At the very least you should stop the main vCenter Service with the
following command:
service-control --stop vmware-vpxd
Then you can stop the database server:
service-control --stop vmware-vpostgres
Restart the services in the opposite order. Start the database server first:
service-control --start vmware-vpostgres
Then start the vCenter Server service:
service-control --start vmware-vpxd
If you have multiple services running on your vCenter Server Appliance virtual machine (VMware
License Service, VMware Identity Management Service, VMware Content Library Service, and so
on) you can stop them all safely with the command:
service-control --stop --all
The vCenter Server Appliance API commands and plug-ins included in the appliance shell enable
you to configure, troubleshoot, and monitor vCenter Server Appliance.
To use the plug-ins and API commands, you must first access the appliance shell.
You can access the appliance shell directly through the appliance console, or remotely using a
remote console connection, such as SSH.
If you log in to the appliance shell as a user who has a super administrator's role, you can enable
access to the Bash shell of the appliance. The default user with a super administrator role is root.
Connect to https:l/<vCenter Server>:5480.
To enable access to the Bash shell, use the shell . set --enabled true command.
To access the Bash shell, run the shell or pi shell command. You can run all commands in the
appliance shell with or without the pi keyword.
Use the Access menu to configure access settings such as Bash shell
and SSH.
vmware- vCenter Server Appliance
Navigator qi Access
9 Edit Access Settings
Q Summary
Access Settings
Enable SSH Login
'; Access SSH Looin Enabled
O, Networldng
Bash Shell Disabled
0 Enable BASH Shell
Timeout (Minutes): ,...16- 0- - - .
0 nme
• Update
' Administration
OK Cancel
!! CPU and Memory
i] Database
After you have enabled access you can use SSH to log in to the vCenter
Server appliance shell.
Connected to service
Command> shell
Shell access is granted to root
root@sa-vcsa-01 [ ~ ]# I
Available information for many problems might prove inconclusive. Server hangs, purple screen
crashes without disk dumps, or disk failures might leave the server with very little information
logged regarding a problem. While the root cause of this outage might be elusive, you can better
prepare for the next time the problem happens.
Review logs for diagnostic messages that were generated leading up to the issue as well as during
the issue.
For hardware faults, run hardware diagnostics.
Faulty CPUs can manifest as unusual behavior, such as abrupt reboots, hangs, or purple screens.
Most often, the CPU generates an exception that is trapped by the VMkernel and handled with a
purple screen.
View the ESXi local console at the DCUI to verify that the purple screen
problem exists.
Mwore ESXi 6 .5.0 lRe leasebu i ld -1887370 x86_61l
PF Except ion H in world 295215:vs ish IP Ox1180070b113S oddr OxO
TEs : Ox1590ob027 : Ox H 7eee027: OxO :
SX inVM cr0 ; Qx80010031 cr 2; 0x0 cr3; Qxl5ad7b000 cr1 ; Qx12728
raMe; Ox1390ca69b240 ip; Ox1180070b413S err;2 rf lags;Oxl0093
ox; OxO rbx ; Ox117fc7126rBO rcx ; Ox117fc6f239a0
dx; Ox1180072632c0 rbp; QxO rs i ; Qxf7
d i ; QxO r8 ; QxO r9 ; Qx1
lO; QxO rll ; OxO rl2; 0x6
13 ; Qx0 rH ; OxO rlS ; Ox110006720000
CPU! :295215/ vs ish
CPU 0 : UU
ode s t art: Ox118006e00000 VMK upt iMe : 18 : 06 :30:10.839
· x4390ca69b300 : lElx4180070b4135JCr6shMeCurrentCore@vMkcrne 1Unovcr•fhc54CJ st ack: EJx6
x4 390ca69b3c0 : ( Ox41B006cd l eaf J Int rCook Ie _Do Int errupt@v11kerne I Unover •Ox l 7b s tack : Ox7b80
x4390ca69b170 : l Ox418006cd24co 1 Int rCook i e_VMkerne 11 nt errupt@vMkerne I Anover • Ox4e stack : Oxf7
x1390co69b1o0: l0x118006f2dbfdl IDT_ lntrHandler@vnkern e 1Unover•Ox9d s tock : Ox1390ca69b5e8
x4390ca69b'1c0: [Qx418006f3c014 Jgatc_entry_@vNkerne IUnover•OxO s tack: OxO
x1 390co69b588: l0x118006e2Hb0l lnt errupts_SetF lags@vMkerne 1Dnover•Ox1 stock: OxlOOOOOOOl
x4390ca69b590: l0x4180070b4Be2 JCrashMc_Vs i CormandSet@vMkerne I Anover•Oxbc s tack : OxO
· x1 390co69b5d0: !Ox118006e0lf95JVSl _Set [nfo@v1.,kerne lttnover•Ox369 stock: Ox1390co69b6b0
x1390ca69b650: l0x118007516df1 JUWVMKSysca I IUnpar.kVS J_Se t@Cuser l U<Hone)•0•308 s tock: OxO
x439Ucob9befU : 1Ux1 WUU /~Ua2tU IU•er _UMVMKSysca 11 Hand I er~ l us er lA <Hone> •Uxo1 stock : UxffebUc3B
x4390ca69bf 20: l0x4 1B006f0e'J41 JUser _UJ.IVMKSysca I I Hand l er(fvp1kernc IRnover•Ox ld s tack: OxO
x4390ca69bf30: lOx418006f3c041 lgate_entry_@vMkernetnnover•OxO stack: OxO
ose f s;OxO gs;Ox118040400000 Kgs;OxO
1 other PCPU is in panic .
017- 03 - !0T19:03:15 .192Z cpuO :66604 )Marn ing : / v1·1rs/ dev ices/char/vPJkdr iver/ us bpassthrough not found
orcduNp to di sk . Slot 1 of 1 on device 11px.v1·1hbal :CO:TO:LO :'J .
inali zed dm1p header C14/ 14) Oi skOrn1p : Successful .
o file configured to durip data .
o v san object configured to du1,1p data .
o port for reNot c debugger . "Escape" for Ioca l debugger .
The ESXi host crashes when the VMkernel enters a condition where it cannot or should not proceed.
A VMkernel fault is manifested by a purple screen on the ESXi console. This screen is referred to as
a purple screen of death (PSOD).
When recovering from a host crash in a production environment, the main goal is to get your virtual
machines back up and running as soon as possible. A vSphere HA cluster can help you recover
quickly when one of the hosts in the cluster fails.
When a host stops responding, the entire server becomes unresponsive. You might not be able to
determine whether the issue related to the hardware or software without collecting further data.
If the host is unresponsive and you cannot boot the system properly, the problem could be due to a
corrupt configuration, or a hardware fault. Try to boot from diagnostics or installer CD if possible.
When the system becomes unresponsive, it indicates that the ESXi host
has stopped responding. The system might get into this state after a
power cycle.
An ESXi host stops responding due to the following reasons:
• The VMkernel is too busy or deadlocked.
• A hardware lockup occurs.
The main goal is to get the virtual machines back up and running as soon as possible. After you have
done that, do some research and try to determine why the ESXi host locked up.
Check the VMkernel log file (/var / l o g /vmk ernel. log) for error messages.
Use esxtop to gather performance statistics. esxt op shows current performance statistics of the
entire ESXi host.
A best practice is to have logs captured independent of disk or network connectivity when
troubleshooting issues. Enabling serial logging sends all VMkernel logs to the serial port in addition
to their normal destination.
For more information about backing up an ESXi host, see VMware knowledge base article 2042141
at http: //kb.vmware.com/kb/204214 l.
Identify, diagnose, and resolve vCenter Server and ESXi host problems
1. Run a Break Script
2. Verify That the System Is Not Functioning Properly
3. Troubleshoot and Repair the Problem
4. Verify That the Problem Is Repaired
• Using the vCenter Server Appliance shell and the Bash shell enable you to
carry out configuration, monitoring, and troubleshooting tasks.
• The API commands and plug-ins in vCenter Server Appliance can be used to
perform administrative tasks and are useful for troubleshooting.
• An ESXi host crash is typically caused by CPU exception, driver or module
panic, machine check exception, hardware fault, or software defect.
• Use tools such as ping to determine if a host is hung, and use tools such as log
files or performance statistics to determine the causes of the host lock-up.
Questions?