vSphere 6.7 – QuickBoot – A Must See Feature!

I must admit to not knowing about this until I stumbled upon it vSphere 6.7 has added a new feature called QuickBoot.

Quick Boot enables the ESXi host to restart straight into the ESXi load screen avoiding the initial POST and memory test screens. This is particularly cool when you are doing ESXi updates as the reboot time is dramatically shortened as shown below.

So how does it work?

  • When a reboot is necessary, devices are shut down and the hypervisor restarts
  • Hardware initialization and memory tests are not performed
  • Improves host availability and shortens maintenance windows

​Requirements

  • Supported server hardware (current: shortlist of Dell and HPE systems)
  • Native device drivers only – no vmklinux driver support
  • Secure boot not supported

How to enable it?

​Quick Boot is primarily intended to benefit Update Manager workflows, there are stringent checks for compatibility, according to the following files

  • /usr/lib/vmware/loadesx/whitelist.txt & platforms.txt

​It is also possible to enable Quick Boot from the ESXi console

  • /usr/lib/vmware/loadesx/bin/loadESXCheckCompat.py
  • loadESXEnable –e

​To test whether or not the last reboot was quick or normal

  • vsish -e get /system/loadESX/isBootLoadESX

Additional troubleshooting info can be found in the logs

  • /var/log/loadESX.log

NSX Edge becomes unmanageable after upgrading to NSX 6.2.3

If you are in the process of upgrading to NSX 6.2.3 and hit the above issue have a look at the following KB.

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2145887

This issue occurs when serverSsl or clientSsl is configured in load balancer, but ciphers value is set as NULL in the previous version.

VMware NSX for vSphere 6.2.3 introduced approved ciphers list in NSX Manager and does not allow the ciphers to be NULL for serverSsland clientSsl in load balancer.

Note: Default ciphers value is NULL in NSX 6.2.x.

Since the ciphers value defaults to NULL in the earlier version, if this is not set, NSX Manager 6.2.3 considers this ciphers value as invalid hence the reason why this issue occurs.

There is a workaround in the above KB which involves making a POST API call to set ciphers value from NULL to Default.

SRM 6.1.1 Released

What’s New in Site Recovery Manager 6.1.1

So SRM 6.1.1 got released yesterday so I thought I would share with you what’s new!

VMware Site Recovery Manager 6.1.1 delivers new bug fixes described in the Resolved Issues section.

VMware Site Recovery Manager 6.1.1 provides the following new features:

  • Support for VMware vSphere 6.0 update 2.
  • Support for Two-factor authentication with RSA SecurID for vCenter Server 6.0U2.
  • Support for Smart Card (Common Access Card) authentication for vCenter Server 6.0U2.
  • Site Recovery Manager 6.1.1 now supports the following external databases:
    • Microsoft SQL Server 2012 Service Pack 3
    • Microsoft SQL Server 2014 Service Pack 1
  • Adding localization support for Spanish language.

See the release notes for further information.

 

 

NSX Design & Deploy – Course Notes Part 2

Following on from the last post here are my remaining key points:

  • L2 Bridging – Single vDs only supported
  • DLR’s don’t place two DLR uplinks in the same Transport Zone
  • OSPF – Consider setting the OSPF timers to an aggressive setting if you have a requirement to do so  lowest setting is – Hello 1 Sec / Dead 3 Secs ( on Physical Router as well – set both sides i.e. the physical as well but note it might limit what you can set as an aggressive timer ) (10 Hello /40 Dead is the default protocol timeout)
  • Recommendation is to use anti-affinity rules to prevent deploying the DLR Control VM on the same ESXi host with an Active ESG – to prevent losing the DLR control VM and the Edge with the summary routes on (if you have set them – you should have! 😉 )
  • When applying DFW rules on the cluster object if you move VM’s between clusters then there is a few second outage on protection while the new rules are applied
  • Security policy weights – higher number takes priority e.g. 2000 higher than 1000

NSX Design & Deploy – Course Notes

So I have just finished a week in Munich on an internal NSX design & deploy course which had some great content. I have made a few notes and wanted to very briefly blog those, mainly for my benefit but also in the hope they may be useful or interesting to others as well.

In no particular order:

  • vSphere 5.5 – When upgrading the vDS from 5.1 to 5.5 you need to upgrade vDS to Advanced mode after you do the vDS upgrade. This enables features such as LACP and NIOC v3 but is a requirement for NSX to function correctly.
  • When considering Multicast mode with VTEPs in different subnets be aware that PIM is not included in the Cisco standard licence and requires advanced! PIM is a requirement when using multiple VTEPs in different subnets.
  • When using dynamic routing with a HA enabled DLR the total failover time could be up to 120 seconds
    • 15 sec dead time ( 6 sec min 9-12 recommended- 6 not recommended as its pinging every 3 seconds)
    • claim interfaces/ start services
    • ESXi update
    • route reconvergence
  • Consider setting a static route (summary route) with the correct weight which is lower than the dynamic route  so traffic isn’t disrupted during DLR failover
  • Different subnets for virtual and physical workloads lets you have 1 summary route for the virtual world. (for the above event – this is likely not possible in a brownfield deployment)
  • Note during a DLR failover it looses the routing adjacency – the routing table entry upstream is removed during this period – which looses the route – this could be avoided by increasing the OSPF timers upstream so they won’t lose the route.
  • If you fail to spread the transport zones across all the applicable hosts that are part of the same vDS -the port groups will exist but won’t be able to communicate (No traffic flow – if someone accidentally selects that port group)

Part two coming soon!

Don’t forget the logs

Scratch Partition

A design consideration when using ESXi hosts which do not have local storage or a SD/USB device under 5.2GB is where should I place my scratch location?

You may be aware that if you do use Autodeploy, a SD or USB device to boot your hosts then the scratch partition runs on the Ramdisk which is lost after a host reboot for obvious reasons as its non-persistent.

Consider configuring a dedicated NFS or VMFS volume to store the scratch logs, making sure it has the IO characteristics to handle this. You also need to create a separate folder for each ESXi host.

The scratch log location can be specified by configuring the following ESXi advanced setting.

Screen Shot 2016-02-16 at 11.21.33

 

 

Syslog Server

Specifying a remote syslog server allows ESXi to ship its logs to a remote location, so it can help mitigate some of the issues we have just discussed above, however the location of the syslog server is just as important as remembering to configure it in the first place.

You don’t want to place your syslog server in the same cluster as the ESXi hosts your configuring it on or on the same primary storage for that matter, consider placing it inside your management cluster or on another cluster altogether.

The syslog location can be specified by configuring the following ESXi advanced setting.

Screen Shot 2016-02-16 at 11.24.44

Horizon Sizing Estimator

Something which I discovered today which I am sure many partners will find useful is the Horizon Sizing Estimator tool. This is located at https://developercenter.vmware.com/group/dp/horizon-sizing-tool  this is available to partners only I believe. You can use the free lakeside assessment tool to generate the assessment data either via the free cloud based tool http://assessment.vmware.com or the free on site version.

Using this tool you can then generate a XML file which is then fed in to the sizing tool to help size your customer’s environment.  You can also manually specify a whole ton of other variables.

Screenshot 2016-02-01 20.25.32

Useful NSX Troubleshooting Commands

download

Having just completed a NSX 6.2 ICM course I wanted to share with you some commonly used troubleshooting commands which Simon Reynolds shared with us.

ESXi Host CLI Commands

Here are some useful CLI examples to run in ESXi shell for logical switch info:

# esxcli network vswitch dvs vmware vxlan network ….

# esxcli network vswitch dvs vmware vxlan network arp list

–vds-name=Compute_VDS –vxlan-id=5001

# /bin/net-vdl2 -M arp -s Compute_VDS -n 5001

# esxcli network vswitch dvs vmware vxlan network mac list

–vds-name=Compute_VDS –vxlan-id=5001

# /bin/net-vdl2 -M mac -s Compute_VDS -n 5001

# esxcli network vswitch dvs vmware vxlan network vtep list

–vds-name=Compute_VDS –vxlan-id=5001

# /bin/net-vdl2 -M vtep -s Compute_VDS -n 5001

vDS Info:

net-vds -l

b) Show the separate ip stack for vxlan

 esxcli network ip netstack list

c) Raise the netcpa logging level to verbose (logs ESXi to controller messages in more detail)

# /etc/init.d/netcpad stop

# chmod +wt /etc/vmware/netcpa/netcpa.xml

# vi /etc/vmware/netcpa/netcpa.xml

and change info to verbose between the <level> tags, then save the file and then restart the netcpa daemon:

# /etc/init.d/netcpad start

d) Packet capture commands

# pktcap-uw –-uplink vmnic2 –o unencap.pcap –dir=1 –-stage=0

# tcpdump-uw –enr unencap.pcap

or

# pktcap-uw –-uplink vmnic2 –-dir=1 –-stage=0 -o – | tcpdump-uw –enr –

(–dir=1 implies “outbound”, –dir=0 implies “inbound”, –stage=0

implies “before the capture point”, –stage=1 implies “after the capture

point”)

Show vxlan encapsulated frames:

# pktcap-uw –uplink vmnic2 –-dir=1 –-stage=1 -o -| tcpdump-uw –enr –

Show frames at a vm switchport connection:

# pktcap-uw –o – –switchport –dir 1 | tcpdump-uw –enr –

(get vm port id from esxtop network view)

 

Test VXLAN connectivity between hosts:

ping ++netstack=vxlan -d -s 1572 -I vmk3 xxx.xxx.xxx.xxx

 

NSX Controller Commands:

# show control-cluster status

# show control-cluster logical-switches vni 5000

# show control-cluster logical-switches connection-table 5000

Need to be on the controller that manages the vni for the next commands:

# show control-cluster logical-switches vtep-table 5001

# show control-cluster logical-switches mac-table 5001

# show control-cluster logical-switches arp-table 5001

 

 

To TPS, or not to TPS, that is the question?

micron-LRDIMM-module

Something that crops up again and again when discussing vSphere designs with customers is whether on not they should enable (Inter-VM) TPS (Transparent Page Sharing) as since the end of 2014 VMware decided to disable TPS by default.

To understand if you should enable TPS you need to firstly understand what it does.  TPS is quite a misunderstood beast as many people contribute a 20-30% memory overcommitment too TPS where in reality it’s not even close to that. That is because TPS only comes in to effect when the ESXi host is close to memory exhaustion. TPS would be the first trick in the box that ESXi uses to try to reduce the memory pressure on the host, if it can not do this then the host would then start ballooning. This would be closely followed by compression and then finally swapping, none of which are cool guys!

So should I enable (Inter-VM) TPS?… well as that great IT saying goes… it depends!

The reason VMware disabled (Inter-VM) TPS in the first place was because of their stronger stance on security (Their Secure by Default Stance), their concern was a man in the middle type attack could be launched and shared pages compromised. So in a nutshell, you need to consider the risk of enabling (Inter-VM) TPS. If you are running and offering a public cloud solution from your vSphere environment then it may be best to leave TPS disabled as you could argue you are under a greater risk of attack; and have a greater responsibility for your customers data.

If however you are running a private vSphere compute environment then the chances are only IT admins have access to the VM’s so the risk is much less. Therefore to reduce the risk of running in to any performance issues caused by ballooning and swapping you may want to consider enabling TPS, which would help mitigate against both of these.