A Multi-Site vSphere/SDDC Upgrade – A Step by Step Guide

I wanted to share with you the steps required to upgrade a typical multi-site vSphere/ SDDC implementation. This particular implementation had the following products.

 

Product Current version Planned Version
vCenter Server 6.0 U3 6.5 U1
ESXi 6.0 U3 6.5 U1
vSAN 6.2 6.6.1
NSX 6.2.4 6.3.5
vROPS 6.3.0 6.6.1
vRLI 3.3.1 4.5
vDP 6.1.0.173 6.1.5
vSphere Replication 6.1.1 6.5.1
VMware Tools 6.0 U3 6.5 U1

 

In order to understand the upgrade order here is some background information on this fictitious environment.

The environment has a Production site and a DR site. NSX is deployed across both sites using universal objects, SRM and vSphere Replication are in use with vSAN being used locally at each site. It also has a deployment of vROPS and vRLI at both sites.

Note. The following should be taken as a rough guide only please make sure you check the appropriate VMware Guides/KB’s for your product versions before commencing any upgrade.

 

Action

Site

Impact to vSphere

Required

VM downtime

1.      

Carry out Health Check of VC & PSC before starting (Go or No Go)

Both Sites

N/A

Yes

No

2.      

Backup All Components

Both Sites

N/A

Yes

No

3.      

Backup PSC’s with vDP & Data Protect

Both Sites

N/A

Yes

No

4.      

Deploy second PSC at each site – Configure replication see KB 2131191 for justification

Both Sites

vCenter management of ESXi hosts unavailable during upgrade.

Yes

No

5.      

Disable vSphere Replication

Both Sites

No protection of VMs (Backup is the only method of restoring)

Yes

No

6.      

Upgrade the external Platform Services Controller server 6.0.x to vCenter 6.5 for both sites

Both Sites

vCenter management of ESXi hosts unavailable during upgrade.

Yes (if using an external Platform Services Controller)

No

7.      

Upgrade vDP at both sites

Both Sites

Backups unavailable during upgrade

Yes in order to restore in to 6.5 during the upgrade

No

8.      

Upgrade NSX Manager at the DR Site

DR Site

No Changes to DR NSX during Upgrade

Yes

No

9.      

Upgrade NSX Controller Cluster at the DR Site

DR Site

NSX reverts to read only mode Change Window Required

Yes

Yes – No Disruption as long as VM’s don’t move or any changes made

10.    

Upgrade NSX Host preperation at the DR Site

DR Site

Hosts require Reboot

Yes

No

11.    

Upgrade NSX DLR’s at the DR Site

DR Site

Disruption to Service

Yes

Yes

12.    

NSX Edges at the DR Site

DR Site

Outage required while edge is redeployed and upgraded

Yes

Yes

13.    

Upgrade vCenter from vCenter 6.0.x to vCenter 6.5. at the DR Site

DR Site

vCenter management of ESXi hosts unavailable during upgrade.

Yes

No

14.    

Upgrade vROPS

Both Sites

N/A

Yes

No

15.    

Upgrade Log Insight

Both Sites

N/A

Yes

No

16.    

Use vSphere Update Manger to scan and remediate an ESXi host.

DR Site

1 host not available, reduced capacity.

Yes

No

17.    

Repeat steps for remaining hosts in the cluster.

DR Site

One host will always be unavailable while it is being upgraded with vSphere Update Manager.

Yes

No

18.    

Upgrade vSAN at the DR Site

DR Site

Possible Performance and increased risk of failure during upgrade – upgrade could take several days.

Yes

No

19.    

Upgrade NSX Manager at the Prod Site

Prod Site

No Changes to DR NSX during Upgrade

Yes

No

20.    

Upgrade NSX Controller Cluster at the Prod Site

Prod Site

NSX reverts to read only mode Change Window Required

Yes

Yes – No Disruption as long as VM’s don’t move or any changes made

21.    

Upgrade NSX Host preperation at the Prod Site

Prod Site

Hosts require Reboot

Yes

No

22.    

Upgrade NSX DLR’s at the Prod Site

Prod Site

Disruption to Service

Yes

Yes

23.    

NSX Edges at the Prod Site

Prod Site

Outage required while edge is redeployed and upgraded

Yes

Yes

24.    

Upgrade vCenter from vCenter 6.0.x to vCenter 6.5. at the Prod Site

Prod Site

vCenter management of ESXi hosts unavailable during upgrade.

Yes

No

25.    

Upgrade vSphere Replication at Both Sites

Both Sites

Replication unavailable until this points it has to be disabled before starting the upgrade process

Yes

No vSphere Replication protection during upgrade

26.    

Use vSphere Update Manger to scan and remediate an ESXi host.

Prod Site

1 host not available, reduced capacity.

Yes

No

27.    

Repeat steps for remaining hosts in the cluster.

Prod Site

One host will always be unavailable while it is being upgraded with vSphere Update Manager.

Yes

No

28.    

Upgrade vSAN at the Prod Site

Prod Site

Possible Performance and increased risk of failure during upgrade – upgrade could take several days.

Yes

No

29.    

Optional: Update VMware Tools on each VM with a vSphere Update Manager baseline.

Prod Site

Virtual machine reboot.

Recommended

Reboot

30.    

Optional: Update virtual hardware on each VM with a vSphere Update Manager baseline.

Prod Site

Virtual machine shutdown, 1+ reboots.

No

Yes; upgrade during shutdown

 

Further Upgrade Information

vSphere

 

  1. Upgrade all external Platform Services Controller servers running 6.0.x to 6.5.
  2. Upgrade vCenter Server 6.0.x to vCenter 6.5.

This step includes the upgrade of all components and installation of new components that were not previously addressed.

During the upgrade, the vCenter Server will be unavailable to perform any provisioning operations or functions such as vSphere vMotion and vSphere DRS. Once started, the vCenter database is upgraded first, followed by the vCenter binary upgrade. After the upgrade, the hosts are auto-reconnected.

  1. Upgrade vSphere Update Manager(Windows Only, on vSphere 6.5 Update Manager is included with the vCenter Appliance). In the same way as vCenter, the database is upgraded first, followed by binaries. With vSphere 6.5, vSphere Update Manager is configured and administered through the vSphere Web Client.
  2. Use vSphere Update Manager to create a baseline, scan the hosts, and then upgrade the ESXi hosts to version 6.5. A copy of the ESXi 6.5 installation media must be available and added to vSphere Update Manager.
  3. Use vSphere Update Manager to create a baseline for VMware Tools, scan virtual machines, and remediate to upgrade VMware Tools.
  4. Upgrade VMware Virtual Machine Hardware to version 13 (if not already completed previously) to take advantage of new features in vSphere 6.5.
  • This step should not be performed until all hosts have been upgraded to ESXi 6.5, because VMware Virtual Machine Hardware version 13 cannot be run on older hosts.
  1. If applicable, upgrade VMFS volumes on any ESXi host that has already been upgraded, but is not yet running VMFS6.
  • Only VMFS3 and above can be upgraded to VMFS5 on an ESXi 6.x host. Verify that older volumes are upgraded to at least VMFS3 before upgrading to ESXi 6.x.
  • Once a VMFS file system update has been completed, it is not possible to roll back this change. Therefore, hosts not running ESXi 5.x or 6.x will be unable to read these volumes.

 

 

vSAN

 

Before starting the vSAN upgrade process, ensure that the following requirements are met:

  1. The vSphere environment is up to date:
    • The vCenter Server managing the hosts must be at an equal or higher patch level than the ESXi hosts it manages.
    • All hosts should be running the same build of ESXi before vSAN cluster upgrade is started.
      • If the ESXi host versions are not matched, the hosts should be patched to the same build before upgrading.
  1. All vSAN disks should be healthy
    • No disk should be failed or absent
    • This can be determined via the vSAN Disk Management view in the vSphere Web Client
  2. There should be no inaccessible vSAN objects
    • This can be verified with the vSAN Health Service in vSAN 6.0 and above, or with the Ruby vSphere Console(RVC) in all releases.
  3. There should not be any active resync at the start of the upgrade process.
    • Some resync activity is expected during the upgrade process, as data needs to be synchronized following host reboots.
  4. Ensure that there are no known compatibility issues between your current vSAN version and the desired target vSAN version. For information on upgrade requirements, see vSAN upgrade requirements (2145248).
    • If required, update the vSAN cluster to the required build before undertaking the upgrade process to avoid compatibility concerns.

Host Preparation

Ensure you choose the right maintenance mode option. When you move a host into maintenance mode in vSAN, you have three options to choose:

  • Ensure availability
    If you select Ensure availability, vSAN allows you to move the host into maintenance mode faster than Full data migration and allows access to the virtual machines in the environment.
  • Full data migration
  • No data migration
    If you select No data migration, vSAN does not evacuate any data from this host. If you power off or remove the host from the cluster, some virtual machines might become inaccessible.

Exit maintenance mode and resync

  • When the ESXi host is upgraded and moved out of maintenance mode, a resync will occur. You can see this through the web client.
  • Ensure this is complete before moving onto the next host. A resync is occurring as the host that has been updated can now contribute to the vSAN Datastore again. It is vital to wait till this resync is complete to ensure there is no data loss.

 

 

NSX

 

For detailed information please refer to the NSX Upgrade guide.https://docs.vmware.com/en/VMware-NSX-for-vSphere/6.3/nsx_63_upgrade.pdf

 

vROPS

 

For detailed information please refer to the vROPS Upgrade guide.

https://docs.vmware.com/en/vRealize-Operations-Manager/6.6/vrealize-operations-manager-66-vapp-deploy-guide.pdf

 

vRLI

 

For detailed information please refer to the vRLI Upgrade guide.

https://docs.vmware.com/en/vRealize-Log-Insight/4.5/com.vmware.log-insight.administration.doc/GUID-4F6ACCE8-73F4-4380-80DE-19885F96FECB.html

 

vDP

 

For detailed information please refer to the vDP Upgrade guide.

https://docs.vmware.com/en/VMware-vSphere/6.5/vmware-data-protection-administration-guide-61.pdf

 

 

vRA 7 & NSX 6.3 – The Security Tag Gotcha!

Lets assume you’re wishing to deploy a vRA 7.x blueprint into an environment where NSX 6.3.x has been deployed, and the DFW default rule is set to deny. During the provisioning of the vRA VMs they will of course need firewall access for services such as Active Directory and DNS to allow them to customise successfully, and here in lies the problem.

You might assume you could go about creating your security policies and security groups as normal and simply include the security tag within the blueprint to grant access to these services. However; vRA won’t assign the security tag until after the machine has finished customizing. So that creates us a potential issue as the VM won’t have access to the applicable network resources such as AD & DNS to finish customizing successfully as the default DFW is set to deny.

So to design around this you need to consider having some shared services rules at the top of the DFW rule table which allow services such as Active Directory and DNS access to these VMs, this will allow the vRA VMs to have the necessary network access to finish deploying & customizing successfully. You could achieve is in a number of ways such as creating a security group based on OS name of “Windows” and VM Name that equals the name of your vRA VM’s. Therefore as soon as vRA creates the VM object it will be assigned to the correct shared services security group and given the correct access, you can then layer in additional services using security tags as originally intended.

vRealize Log Insight – Tracking SSH Logins to Edge Devices

Here is an example of a really simple but cool query that can be setup in vRealize Log Insight to track accepted and failed SSH logins to Edge devices.

Query:

Match ALL:

appname contains “sshd”

text contains “failed password” (This can be changed to “accepted password” to track accepted logins)

hostname contains “hostname”

 

vDS is NOT a pre-requisite for NSX Guest Introspection

vDS is NOT a pre-requisite for NSX Guest Introspection

This seems to be a common misconception both from customers and third party vendors, the vDS is also not licensed as part of NSX in the “NSX for Endpoint” license. Therefore a normal VSS standard vSwitch is fully supported for deploying NSX Guest Introspection to use with third party vendors i.e. for Anti Virus.

Please refer to the NSX 6.x Installation Guide on how to use the “Specified on host” option when deploying Guest Introspection

Extract from the NSX 6.x Installation Guide Below…

 

NSX Troubleshooting Commands

Initial Troubleshooting – Start with the basics in troubleshooting – Transport Network and Control Plane

Identifying Controller Deployment Issues:

  • View the vCenter Tasks and Events logs during deployment.

Verify connectivity from NSX Manager to vCenter:

  • Run command ping x.x.x.x or show arp or show ip route or debug connection <IP-or hostname> on NSX Manager to verify network connectivity.
  • On NSX Manager run the command show log follow then connect to vCenter and look in log for connection failure.
  • Verify network configuration on NSX Manager by running command show running-config.

Identify EAM common issues:

  • Check vSphere ESXi Agent Manager for errors:
  1. vCenter home > vCenter Solutions Manager >vSphere ESX Agent Manager.
  2. Check status of Agencies prefixed with _VCNS_153.
  3. Log into the ESXi host as root that is experiencing the issue and run command tail /var/log/esxupdate.log.

NOTE: You can also access the Managed Object Browser by accessing address:
https://<VCIP or hostname>/eam/mob/

UWA’s – vsfwd or netcpa – not functioning correctly? This manifest itself as firewall showing a bad status or the control plane between hypervisor(s) and the controllers being down.

  • To find fault on the messaging infrastructure, check if Messaging Infrastructure is down on host in NSX Manager System Events.
  • More than one ESXi host affected? Check message bus service on NSX Manager Appliance web UI under the Summary Tab.
  • If RabbitMQ is stopped, then restart it.

Common Deployment Issues:

1. Connecting NSX to vCenter

  • DNS/NTP incorrectly configured on NSX Manager and vCenter.
  • User account without vCenter Administrator role used to connect NSX Manager to vCenter.
  • No network connectivity between NSX Manager and vCenter Server.
  • User logging into vCenter with an account with no role on NSX Manager.

2. Controller Deployment

  • Insufficient resources, available to host controllers.
  • NTP on ESXi hosts and NSX Manager not in sync.
  • IP connectivity issues between NSX Manager and the NSX Controllers.

3. Host Preparation

  • EAM fails to deploy VIBs because of mis-configured DNS on ESXi hosts.
  • EAM fails to deploy VIBs because of firewall blocking required ports between ESXi, NSX Manager and vCenter.
  • A Previous VIB of an older version is already installed requiring user intervention to reboot the hosts.

4. VXLAN

  • Incorrect teaming method selected for VXLAN.
  • VTEPs does not have connectivity between each other.
  • DHCP selected to assign VTEP IPs but unavailable.
  • If a vmknic has a bad IP, configure manually via vCenter.
  • MTU on Transport Network not set to 1600 bytes or greater.

CLI

NSX Controller CLI VXLAN Commands:

  • show control-cluster logical-switches vni <vni>
  • show control-cluster logical-switches connection-table <vni>
  • show control-cluster logical-switches vtep-table <vni>
  • show control-cluster logical-switches mac-table <vni>
  • show control-cluster logical-switches arp-table <vni>
  • show control-cluster logical-switches vni-stats <vni>

NSX Controller CLI cluster status and health:

  • show control-cluster status
  • show control-cluster startup-nodes
  • show control-cluster roles
  • show control-cluster connections
  • show control-cluster core stats
  • show network <arg>
  • show log cloudnet/cloudnet_java-vnet-controller. <start-time-stamp>.log
  • sync control-cluster <arg>

VXLAN namespace for esxcli:

  • esxcli network vswitch dvs vmware vxlan list
  • esxcli network vswitch dvs vmware vxlan network list –vds-name=<vds>
  • esxcli network vswitch dvs vmware vxlan network mac list –vds-name =<vds> –vxlan-id=<vni>
  • esxcli network vswitch dvs vmware vxlan network arp list –vds-name –vxlan-id=
  • esxcli network vswitch dvs vmware vxlan network port list –vds-name <vds> –vxlan-id=<vni>
  • esxcli network vswitch dvs vmware vxlan network stats list –vds-name <vds> –vxlan-id=<vni>

Troubleshooting Components – Understand the component interactions to narrow the problem focus

Controller Issues:

No connectivity for new VM’s, increased BUM traffic (ARP cache misses).

General NSX Controller troubleshooting steps:

  • Verify Controller cluster status and roles.
  • Verify Controller node network connectivity.
  • Check Controller API service.
  • Validate VXLAN and Logical Router mapping table entries to ensure they are consistent.
  • Review source and destination netcpa logs and CLI to determine control plane connectivity issues between ESXi hosts and NSX Controller.

Example:

Verify VTEPs have sent network information to Controllers.

On Controller:

  • show control-cluster logical-switches vni <vni.no>
  • show control-cluster logical-switches vtep-table <vni.no>

User World Agent (UWA) issues:

  • Logical switching not functioning (netcpa)
  • Firewall rules not being updated (vsfwd).

General UWA troubleshooting steps:

  • Start UWA if not running.

/etc/init.d/netcpad [status/start]

/etc/init.d/vShield-Stateful-Firewall [status/start]
  • tail netcpa logs: /var/log/netcpa.log
  • tail vsfwd logs: /var/log/vsfwd.log

Check if UWA’s are connected to NSX Manager and Controllers.

esxcli network ip connection list |grep 5671 (Message bus TCP connection)

esxcli network ip connection list |grep 1234 (Controller TCP connection)

Check the configuration file /etc/vmware/netcpa/config-by-vsm.xml on the ESXi host that has the settings under UserVars/Rmq* (In particular UserVars/RmqipAddress).

The list of UserVars needed for the message bus currently are:

a. RmqClientPeerName

b. RmqHostId

c. RmqClientResponseQueue

d. RmqClientExchange

e. RmqSslCertSha1ThumbprintBase64

f. RmqHostVer

g. RmqClientId

h. RmqClientToken

i. RmqClientRequestQueue

j. RmqVsmExchange

k. RmqPort

l. RmqVsmRequestQueue

m. RmqVHost

n. RmqPassword

o. RmqUserna

p. RmqIpAddress

NVS Issues – Limited/Intermittent connectivity for VM’s on the same Logical switch.

General VXLAN troubleshooting steps:

  • Check for incorrect MTU on Physical Network for VXLAN traffic.
  • Incorrect IP route configured during VXLAN configuration.
  • Physical network connectivity issues.

Verify connectivity between VTEPs:

  • ping from VXLAN dedicated TCP/IP stack ping ++netstack=vxlan -l vmk1 <ip address>
  • View ARP table of VXLAN dedicated TCP/IP stack esxcli network ip route ipv4 list -N vxlan
  • Ping succeeds between VM’s but TCP seems intermittent. Possible reason is due to incorrect MTU on physical network or VDS on vSphere. Check the MTU configured on VDS in use by VXLAN.

Verify VXLAN component:

  • Verify VXLAN vib is installed and the correct version esxcli software vib get -vibname esx-vxlan
  • Verify VXLAN kernel module vdl2 is loaded vmkload_mod -l |grep vdl2 (logs to /var/log/vmkernel.log – prefix VXLAN)
  • Verify Control Plane is up and active for Logical Switch on the ESXi host esxcli network vswitch dvs vmware vxlan network list –vds-name
  • Verify VM information. Verify that ESXi host has learned MAC addresses of remote VM’s esxcli network vswitch dvs vmware vxlan network mac list –vds-name
  • List active ports and VTEP they are mapped to esxcli network vswitch dvs vmware vxlan network port list –vds-name –vxlan-id=
  • Verify host has locally cached ARP entry for remote VM’s esxcli network vswitch dvs vmware vxlan network arp list –vds-name –vxlan-id=

Distributed Firewall issues – Flow Monitoring provides vNIC level visibility of VM traffic flow.

  • Detailed Flow Data for both Allow and Block flows.
  • Global flow collection disabled by default – click the Enable button.
  • By default NSX excludes own VMs: NSX Manager, Controllers and Edges.

Note: Add a VM to the Exclusion List to remove it from the DFW. This allows you to determine if it’s a DFW issue. If there is still a problem, then it is not DFW-related.

  • In vSphere Web client, go to Network and Security and select the NSX Manager. From Manage > Exclusion List, click the + and select the VM.
  • Verify the DFW vib (esx-vsip) is installed on the ESXi host esxcli software vib list |grep vsip
  • Verify that the DFW kernel module is installed on the ESXi host vmkload_mod -l |grep vsip
  • Verify the vsfwd service daemon is running on the ESXi host ps |grep vsfwd
  • To start/stop the vsfwd daemon on the ESXi host /etc/init.d/vShield-stateful-Firewall[stop|start|status|restart]

LOGS

NSX Manager Log (collected via WEB UI)

  • Select Download Tech Support Log

Edge (VDR/ESG) Log (collected via WEB UI)

  • From the vSphere Web Client, right click vCenter Server, select All vCenter Actions -> Export System Logs.
  • You can also generate ESXi host logs on the ESXi host cli: vm-support.

NSX Controller Logs

  • From the vSphere Web Client, go to Networking & Security -> Installation -> NSX Conroller Nodes.
  • Select the Controller and select Download Tech Support logs.

Issues and Corresponding Logs

Installation/upgrade related issues

  • NSX Manager Log.
  • vCenter Support Bundle: /var/log/vmware/vpx/EAM.log and /var/log/esxupdate.log.

VDR issues

  • NSX Manager log.
  • VDR log from the affected VDR.
  • VM support bundle: var/log/netcpa.log and /var/log/vsfwd.log.
  • Controller logs.

Edge Services Gateway issues

  • NSX Manager log.
  • Edge log for the affected ESG.

NSX Manager issues

  • NSX Manager log.

VXLAN/Controller/Logical Switch

  • NSX Manager log.
  • vCenter Support bundle.
  • VM support bundle.

VXLAN data plane: /var/log/vmkernel.log.

VXLAN control plane: /var/log/netcpa.log.

Management plane: /var/log/vsfwd.log and /var/log/netcpa.log.

Distributed Firewall (DFW) issues

  • NSX Manager log.
  • VM support bundle: /var/log/vsfwd.log, /var/log/vmkernel.log.
  • VC support bundle.

vCNS to NSX Upgrades – T+1 Post-Upgrade Steps

T+1 Post-Upgrade Steps

After the upgrade, do the following:

  1. Delete the snapshot of the NSX Manager taken before the upgrade.
  2. Create a current backup of the NSX Manager after the upgrade.
  3. Check that VIBs have been installed on the hosts.

NSX installs these VIBs:

esxcli software vib get --vibname esx-vxlan

esxcli software vib get --vibname esx-vsip
  1. If Guest Introspection has been installed, also check that this VIB is present on the hosts:
esxcli software vib get --vibname epsec-mux
  1. Resynchronize the host message bus. VMware advises that all customers perform resync after an upgrade. You can use the following API call to perform the resynchronization on each host.
URL : https://<nsx-mgr-ip>/api/4.0/firewall/forceSync/<host-id>

HTTP Method : POST 

 Headers: 

Authorization : base64encoded value of username password

Accept : application/xml

Content-Type : application/xml

vCNS to NSX Upgrades – vShield Endpoint to NSX Guest Introspection

vShield Endpoint to NSX Guest Introspection

  1. The Installation Status column says Upgrade Available.In the Installation tab, click Service Deployments.
  2. Select the Guest Introspection deployment that you want to upgrade.
  3. The Upgrade ( ) icon in the toolbar above the services table is enabled.
  4. Click the Upgrade ( ) icon and follow the UI prompts.

After Guest Introspection is upgraded, the installation status is Succeeded and service status is Up. Guest Introspection service virtual machines are visible in the vCenter Server inventory.

For more information in this series please continue on to the next part

vCNS to NSX Upgrades – vShield Edges to NSX Edges

vShield Edge to NSX Edge Upgrade Steps

  1. In the vSphere Web Client, select Networking & Security > NSX Edges
  2. For each NSX Edge instance, double click the edge and check for the following configuration settings before upgrading
    1. Click ManageVPN > L2 VPN and check if L2 VPN is enabled. If it is, take note of the configuration details and then delete all L2 VPN configuration
    2. Click ManageRouting Static Routes and check if any static routes are missing a next hop setting. If they are, add the next hop before upgrading the NSX Edge
  3. For each NSX Edge instance, select Upgrade Version from the Actions menu

After the NSX Edge is upgraded successfully, the Status is Deployed, and the Version column displays the new NSX versionIf the upgrade fails with the error message “Failed to deploy edge appliance,” make sure that the host on which the NSX edge appliance is deployed is connected and not in maintenance mode.

  1. If an Edge fails to upgrade and does not rollback to the old version, click the Redeploy NSX Edge icon and then retry the upgrade

For more information in this series please continue on to the next part