Initial Troubleshooting – Start with the basics in troubleshooting – Transport Network and Control Plane
Identifying Controller Deployment Issues:
- View the vCenter Tasks and Events logs during deployment.
Verify connectivity from NSX Manager to vCenter:
- Run command ping x.x.x.x or show arp or show ip route or debug connection <IP-or hostname> on NSX Manager to verify network connectivity.
- On NSX Manager run the command show log follow then connect to vCenter and look in log for connection failure.
- Verify network configuration on NSX Manager by running command show running-config.
Identify EAM common issues:
- Check vSphere ESXi Agent Manager for errors:
- vCenter home > vCenter Solutions Manager >vSphere ESX Agent Manager.
- Check status of Agencies prefixed with _VCNS_153.
- Log into the ESXi host as root that is experiencing the issue and run command tail /var/log/esxupdate.log.
NOTE: You can also access the Managed Object Browser by accessing address:
https://<VCIP or hostname>/eam/mob/
UWA’s – vsfwd or netcpa – not functioning correctly? This manifest itself as firewall showing a bad status or the control plane between hypervisor(s) and the controllers being down.
- To find fault on the messaging infrastructure, check if Messaging Infrastructure is down on host in NSX Manager System Events.
- More than one ESXi host affected? Check message bus service on NSX Manager Appliance web UI under the Summary Tab.
- If RabbitMQ is stopped, then restart it.
Common Deployment Issues:
1. Connecting NSX to vCenter
- DNS/NTP incorrectly configured on NSX Manager and vCenter.
- User account without vCenter Administrator role used to connect NSX Manager to vCenter.
- No network connectivity between NSX Manager and vCenter Server.
- User logging into vCenter with an account with no role on NSX Manager.
2. Controller Deployment
- Insufficient resources, available to host controllers.
- NTP on ESXi hosts and NSX Manager not in sync.
- IP connectivity issues between NSX Manager and the NSX Controllers.
3. Host Preparation
- EAM fails to deploy VIBs because of mis-configured DNS on ESXi hosts.
- EAM fails to deploy VIBs because of firewall blocking required ports between ESXi, NSX Manager and vCenter.
- A Previous VIB of an older version is already installed requiring user intervention to reboot the hosts.
- Incorrect teaming method selected for VXLAN.
- VTEPs does not have connectivity between each other.
- DHCP selected to assign VTEP IPs but unavailable.
- If a vmknic has a bad IP, configure manually via vCenter.
- MTU on Transport Network not set to 1600 bytes or greater.
NSX Controller CLI VXLAN Commands:
- show control-cluster logical-switches vni <vni>
- show control-cluster logical-switches connection-table <vni>
- show control-cluster logical-switches vtep-table <vni>
- show control-cluster logical-switches mac-table <vni>
- show control-cluster logical-switches arp-table <vni>
- show control-cluster logical-switches vni-stats <vni>
NSX Controller CLI cluster status and health:
- show control-cluster status
- show control-cluster startup-nodes
- show control-cluster roles
- show control-cluster connections
- show control-cluster core stats
- show network <arg>
- show log cloudnet/cloudnet_java-vnet-controller. <start-time-stamp>.log
- sync control-cluster <arg>
VXLAN namespace for esxcli:
- esxcli network vswitch dvs vmware vxlan list
- esxcli network vswitch dvs vmware vxlan network list –vds-name=<vds>
- esxcli network vswitch dvs vmware vxlan network mac list –vds-name =<vds> –vxlan-id=<vni>
- esxcli network vswitch dvs vmware vxlan network arp list –vds-name –vxlan-id=
- esxcli network vswitch dvs vmware vxlan network port list –vds-name <vds> –vxlan-id=<vni>
- esxcli network vswitch dvs vmware vxlan network stats list –vds-name <vds> –vxlan-id=<vni>
Troubleshooting Components – Understand the component interactions to narrow the problem focus
No connectivity for new VM’s, increased BUM traffic (ARP cache misses).
General NSX Controller troubleshooting steps:
- Verify Controller cluster status and roles.
- Verify Controller node network connectivity.
- Check Controller API service.
- Validate VXLAN and Logical Router mapping table entries to ensure they are consistent.
- Review source and destination netcpa logs and CLI to determine control plane connectivity issues between ESXi hosts and NSX Controller.
Verify VTEPs have sent network information to Controllers.
- show control-cluster logical-switches vni <vni.no>
- show control-cluster logical-switches vtep-table <vni.no>
User World Agent (UWA) issues:
- Logical switching not functioning (netcpa)
- Firewall rules not being updated (vsfwd).
General UWA troubleshooting steps:
- Start UWA if not running.
- tail netcpa logs: /var/log/netcpa.log
- tail vsfwd logs: /var/log/vsfwd.log
Check if UWA’s are connected to NSX Manager and Controllers.
esxcli network ip connection list |grep 5671 (Message bus TCP connection)
esxcli network ip connection list |grep 1234 (Controller TCP connection)
Check the configuration file /etc/vmware/netcpa/config-by-vsm.xml on the ESXi host that has the settings under UserVars/Rmq* (In particular UserVars/RmqipAddress).
The list of UserVars needed for the message bus currently are:
NVS Issues – Limited/Intermittent connectivity for VM’s on the same Logical switch.
General VXLAN troubleshooting steps:
- Check for incorrect MTU on Physical Network for VXLAN traffic.
- Incorrect IP route configured during VXLAN configuration.
- Physical network connectivity issues.
Verify connectivity between VTEPs:
- ping from VXLAN dedicated TCP/IP stack ping ++netstack=vxlan -l vmk1 <ip address>
- View ARP table of VXLAN dedicated TCP/IP stack esxcli network ip route ipv4 list -N vxlan
- Ping succeeds between VM’s but TCP seems intermittent. Possible reason is due to incorrect MTU on physical network or VDS on vSphere. Check the MTU configured on VDS in use by VXLAN.
Verify VXLAN component:
- Verify VXLAN vib is installed and the correct version esxcli software vib get -vibname esx-vxlan
- Verify VXLAN kernel module vdl2 is loaded vmkload_mod -l |grep vdl2 (logs to /var/log/vmkernel.log – prefix VXLAN)
- Verify Control Plane is up and active for Logical Switch on the ESXi host esxcli network vswitch dvs vmware vxlan network list –vds-name
- Verify VM information. Verify that ESXi host has learned MAC addresses of remote VM’s esxcli network vswitch dvs vmware vxlan network mac list –vds-name
- List active ports and VTEP they are mapped to esxcli network vswitch dvs vmware vxlan network port list –vds-name –vxlan-id=
- Verify host has locally cached ARP entry for remote VM’s esxcli network vswitch dvs vmware vxlan network arp list –vds-name –vxlan-id=
Distributed Firewall issues – Flow Monitoring provides vNIC level visibility of VM traffic flow.
- Detailed Flow Data for both Allow and Block flows.
- Global flow collection disabled by default – click the Enable button.
- By default NSX excludes own VMs: NSX Manager, Controllers and Edges.
Note: Add a VM to the Exclusion List to remove it from the DFW. This allows you to determine if it’s a DFW issue. If there is still a problem, then it is not DFW-related.
- In vSphere Web client, go to Network and Security and select the NSX Manager. From Manage > Exclusion List, click the + and select the VM.
- Verify the DFW vib (esx-vsip) is installed on the ESXi host esxcli software vib list |grep vsip
- Verify that the DFW kernel module is installed on the ESXi host vmkload_mod -l |grep vsip
- Verify the vsfwd service daemon is running on the ESXi host ps |grep vsfwd
- To start/stop the vsfwd daemon on the ESXi host /etc/init.d/vShield-stateful-Firewall[stop|start|status|restart]
NSX Manager Log (collected via WEB UI)
- Select Download Tech Support Log
Edge (VDR/ESG) Log (collected via WEB UI)
- From the vSphere Web Client, right click vCenter Server, select All vCenter Actions -> Export System Logs.
- You can also generate ESXi host logs on the ESXi host cli: vm-support.
NSX Controller Logs
- From the vSphere Web Client, go to Networking & Security -> Installation -> NSX Conroller Nodes.
- Select the Controller and select Download Tech Support logs.
Issues and Corresponding Logs
Installation/upgrade related issues
- NSX Manager Log.
- vCenter Support Bundle: /var/log/vmware/vpx/EAM.log and /var/log/esxupdate.log.
- NSX Manager log.
- VDR log from the affected VDR.
- VM support bundle: var/log/netcpa.log and /var/log/vsfwd.log.
- Controller logs.
Edge Services Gateway issues
- NSX Manager log.
- Edge log for the affected ESG.
NSX Manager issues
- NSX Manager log.
- NSX Manager log.
- vCenter Support bundle.
- VM support bundle.
VXLAN data plane: /var/log/vmkernel.log.
VXLAN control plane: /var/log/netcpa.log.
Management plane: /var/log/vsfwd.log and /var/log/netcpa.log.
Distributed Firewall (DFW) issues
- NSX Manager log.
- VM support bundle: /var/log/vsfwd.log, /var/log/vmkernel.log.
- VC support bundle.