Strive virtually

About me : i am Abbed Sedkaoui, worked on VMware virtualization, since GSX and ESX 3, and before that on Virtual Server and VirtualPC from Connectix who also first made Virtual Game Station (VGS a PSX that holded in a floppy disk 1.41MB) back in 1998, all the way up to today latest VCF VMware Cloud Foundation infrastructure VMware Cloud SDDC is based on.

In my views "it" (the Cloud) all started since 2008 with the advent of AMD "Nested Pages" and then 2009 Intel "Extended Pages Tables" in their processor became the trends for alot of compute: for Router i think VRF (Virtual Routing and Forwarding), for Firewall (Context), for Switch (VSI Virtual Switching Instance).

And Hopefully for us labbers we get since then the ability to deploy End2End all virtualized infrastructure :) Following William Lam since around that times. Fast forward 2023 successfully deploying VCF, i am looking to certify VCP-VMC as the required course is offered for FREE! Look at Required Step 2.

About this site : i'll share what worked for me when facing issues and "the problem solving critical thinking mindset" (i know.. its a mouthful :) used to document root cause analysis. Please don't mind the rusticness of this site as i literally created this from scratch on AWS in a few hours.

03/29/2024 Tutorial Install VMware Cloud Director 10.5.1.1 and Create Provider Virtual Data Center (pVDC) backed by vSphere with Tanzu Kubernetes 8.0u2 and NSX-T 4.1.2

The product page vmware.com/products/cloud-director.html
First download the OVA file, it's 2GB from VMware Customer Connect VMware_Cloud_Director-10.5.1.11019-23401219_OVF10.ova

Prerequisite: NFS Network Path and DNS records
Deploy the OVF Template
Primary Appliance Setup (VAMI or Virtual Appliance Management Interface is on https://$VCDIP:5480)
Add Ressources: vCenter Server from vSphere with Tanzu Supervisor Cluster already deployed
Add Ressources: NSX-T Manager already deployed and Create a Geneve backed Network Pool
Create Provider VDC backed by vSphere with Tanzu Supervisor Cluster and NSX-T

This is continuation from previous highlighted Service Provider's tasks below, in the highly available Supervisor multi vSphere zones Lab.

Next Service Provider's Tasks: Create a provider VDC backed by a Supervisor Cluster, Publish a Provider VDC Kubernetes Policy to an Organization VDC in VMware Cloud Director, Offers Kubernetes as a Service (CaaS).

Prerequisite: NFS Network Path and DNS records

1 / 7

After installing Server for NFS role with quick profile, Create a folder then head to Manage NFS Sharing

2 / 7

Share this folder and leave the default, but click on Permissions

3 / 7

By default here it is read only and root access disallowed, so we change it

4 / 7

In the drop-down menu Read-Write and clickon Allow root access

5 / 7

Copy the Network Path will be needed

6 / 7

A DNS A record is required

7 / 7

A DNS PTR record is required

❮ ❯

Deploy the OVF Template

1 / 16

2 /16

In case the OVA tranfert fail with OVF descriptor is not available, Extract the OVA and load the 4 files

3 / 16

4 / 16

5 / 16

6 / 16

7 / 16

8 / 16

9 / 16

10 / 16

11 / 16

Here we deploy on the Outer vSphere Mgmt VM Network and not on the Nested Inner vSphere Lab

12 / 16

13 / 16

Domain Name here refer to The VM FQDN, Additional Networking Properties is skipped

14 / 16

15 / 16

16 / 16

❮ ❯

Primary Appliance Setup (VAMI or Virtual Appliance Management Interface is on https://$VCDIP:5480)

1 / 10

2 /10

3 / 10

Green mark mean NFS is OK, vcloud need at least 14 !

4 / 10

5 / 10

6 / 10

Here we are all green, clicking on the link

7 / 10

We are greeted a gray page

8 / 10

After a while it says The resource was not found on this server

9 / 10

Looking at the certificate we see that can use the FQDN instead of the IP, let's give a try

10 / 10

Success !

❮ ❯

Add Ressources: vCenter backed by Supervisor Tanzu Cluster

1 / 8

2 / 8

3 / 8

Here the Common Name is our vCenter so click only on TRUST, not on RETRIEVE

4 / 8

5 / 8

Here we turn off this setting because we are using NSX-T and not NSX-V on vCenter Server

6 / 8

7 / 8

8 / 8

Finally we are prompted to TRUST vCenter CA certificate that conclude this part

❮ ❯

Add Ressources: NSX-T Manager and Create a Geneve backed Network Pool

1 / 9

2 / 9

Shortname for the Name, and FQDN for the URL, then TRUST the certificate thumbprint of your NSX-T

3 / 9

As a requirement for the pVDC we need a Geneve Transport Zone backed Network Pool

4 / 9

5 / 9

6 / 9

7 / 9

8 / 9

9 / 9

❮ ❯

Create Provider VDC backed by vSphere with Tanzu Supervisor Cluster and NSX-T

1 / 10

2 / 10

3 / 10

4 / 10

This page merit attention as the Hardware Version can't be downgraded, please refer to your VM Class flavors, be it those included in vSphere or, a Custom VM Class or, NVIDIA VM Class for VCF Private AI

5 / 10

TRUST the vCenter Supervisor cetificate thumbprint

6 / 10

Select our VSAN storage policy

7 / 10

Select an NSX-T manager and Geneve Network pool

8 / 10

9 / 10

Review and click Finish

10 / 10

Succesffully created a Provider Virtual Data Center backed by Kubernetes, Customer Organizations can opt with this capability along with vGPU already present for TELCO or AI or anything Cloud Native related.

❮ ❯

For a more comprehensive approach on how to offer Kubernetes as Service with VMware Cloud Director
if you're VMware Partner Cloud Provider or just to be informed from high level view,
take a look at the latest Feature Friday on the subject
Feature Friday Episode 144 - Kubernetes as a Service with Cloud Director
and Download the Whitepaper: Architecting Kubernetes-as-a-Service Offering with VMware Cloud Director

03/07/2024 This website is finally updated to HTTPS for ease everyone access.

02/23/2024 Honored to be part of the VMware vEXPERT community in 2024 again !

https://vexpert.vmware.com/directory/10999

02/22/2024 Updated vSphere with Tanzu using NSX-T Automated Lab Deployment : Enabling Vlan for Management (1731), T0 uplink (1751),TEP (301), VRF (1683) and Trunk(1683-1687), Traffic Separation with 2 N-VDS

Fork Branch https://github.com/abbedsedk/vsphere-with-tanzu-nsxt-automated-lab-deployment/tree/vlan

Added 2 NSX Switch (N-VDS):
- Tanzu-VDS1 MGMT(+EDGE UPLINK T0 Segment) "North-South" Traffic
- Tanzu-VDS2 Overlay "East-West" Traffic
Ref: 7.4.2.2 Multiple virtual switches as a requirement NSX Reference Design Guide 4-1_v1.5.pdf P.291-293
(Compliance PCI,... separate dedicated infra components, Cloud Provider separate internal and external, Telco Provider NFV standard and enhanced vswitch)

Migrate VMKernel0 in VSS to Tanzu-VDS1, Remove old vSwitch0
2 Edge T0 interfaces (1 interface per edge) in Active-Active scaling out up to 10, LoadBalancing "North-South"
2 TEP per ESXi , 2 TEP per EDGE x 2 EDGE = 2 x (2x2) = 8 Tunnels for the bare minimum 1 ESXi and 2 Edges scaling out, LoadBalancing "East-West"
Requirements:
3 VLANs, 3 subinterfaces VLAN Gateway on a virtual router(VLAN 1731 MGMT, VLAN 1751 EDGE UPLINK T0, VLAN 301 VTEPs),
2 edges nodes (T0 Active-Active),
Trunk Vlan 4095 PortGroup ("VMTRUNK") and,
NestedVM Mgmt Vlan 1731 PortGroup ("1731-Network")
1 VRF VLAN (1683) in a TRUNK VLAN Range (1683-1687) (of up to 5 in a single Project), 1 subinterface VLAN Gateway, Corresponding at least 2 (max 10) T0 VRF interfaces (1 int per edge)
Virtual Router VM :
- NIC1 Outer Esxi "VM Network",
- NIC2 Outer "VMTRUNK",
Interfaces, Vlans, MTU, Source NAT, DNS Forwarding, NTP, SSH,
Static Routes:
- Default Route,
- Route to Supervisor Namespace (10.244.0.0/23) via T0 Interfaces(A/A) (172.17.51.121,172.17.51.122),
- Routes to Supervisor Ingress (172.17.31.128/27)and Egress (172.17.31.160/27) via T1 (10.244.0.1)(A/S)
VyOS config inspired from template of VyOS Module for PowerCLI
And from William Lam blog post's How to automate... Here.

				vyos@vyos:~$ show configuration commands | strip-private
				set interfaces ethernet eth0 address 'xxx.xxx.1.253/24'
				set interfaces ethernet eth0 hw-id 'xx:xx:xx:xx:xx:d9'
				set interfaces ethernet eth0 ipv6 address no-default-link-local
				set interfaces ethernet eth1 hw-id 'xx:xx:xx:xx:xx:e3'
				set interfaces ethernet eth1 ipv6 address no-default-link-local
				set interfaces ethernet eth1 mtu '1700'
				set interfaces ethernet eth1 vif 301 address 'xxx.xxx.1.253/24'
				set interfaces ethernet eth1 vif 301 description 'VLAN 301 for HOST/EDGE VTEP with MTU 1700'
				set interfaces ethernet eth1 vif 301 ipv6 address no-default-link-local
				set interfaces ethernet eth1 vif 301 mtu '1700'
				set interfaces ethernet eth1 vif 1683 address 'xxx.xxx.3.253/24'
				set interfaces ethernet eth1 vif 1683 description 'VLAN 1683 for EDGE UPLINK T0 VRF'
				set interfaces ethernet eth1 vif 1683 ipv6 address no-default-link-local
				set interfaces ethernet eth1 vif 1731 address 'xxx.xxx.31.253/24'
				set interfaces ethernet eth1 vif 1731 description 'VLAN 1731 for MGMT'
				set interfaces ethernet eth1 vif 1731 ipv6 address no-default-link-local
				set interfaces ethernet eth1 vif 1751 address 'xxx.xxx.51.253/24'
				set interfaces ethernet eth1 vif 1751 description 'VLAN 1751 for EDGE UPLINK T0'
				set interfaces ethernet eth1 vif 1751 ipv6 address no-default-link-local
				set nat source rule 1 outbound-interface name 'eth0'
				set nat source rule 1 source address 'xxx.xxx.31.0/24'
				set nat source rule 1 translation address 'masquerade'
				set nat source rule 2 outbound-interface name 'eth0'
				set nat source rule 2 source address 'xxx.xxx.51.0/24'
				set nat source rule 2 translation address 'masquerade'
				set nat source rule 3 outbound-interface name 'eth0'
				set nat source rule 3 source address 'xxx.xxx.3.0/24'
				set nat source rule 3 translation address 'masquerade'
				set protocols static route xxx.xxx.0.0/0 next-hop xxx.xxx.1.x
				set protocols static route xxx.xxx.0.0/23 next-hop xxx.xxx.51.121
				set protocols static route xxx.xxx.0.0/23 next-hop xxx.xxx.51.122
				set protocols static route xxx.xxx.31.128/27 next-hop xxx.xxx.51.121
				set protocols static route xxx.xxx.31.128/27 next-hop xxx.xxx.51.122
				set protocols static route xxx.xxx.31.160/27 next-hop xxx.xxx.51.121
				set protocols static route xxx.xxx.31.160/27 next-hop xxx.xxx.51.122
				set service dns forwarding allow-from 'xxx.xxx.0.0/0'
				set service dns forwarding domain 3.168.192.in-addr.arpa. name-server xxx.xxx.1.100
				set service dns forwarding domain 31.17.172.in-addr.arpa. name-server xxx.xxx.1.100
				set service dns forwarding domain 51.17.172.in-addr.arpa. name-server xxx.xxx.1.100
				set service dns forwarding listen-address 'xxx.xxx.31.253'
				set service dns forwarding listen-address 'xxx.xxx.51.253'
				set service dns forwarding listen-address 'xxx.xxx.3.253'
				set service dns forwarding name-server xxx.xxx.8.8
				set service dns forwarding name-server xxx.xxx.1.100
				set service ntp allow-client xxxxxx 'xxx.xxx.0.0/0'
				set service ntp allow-client xxxxxx '::/0'
				set service ntp listen-address 'xxx.xxx.1.253'
				set service ntp server xxxxx.tld
				set service ssh port '22'
				set system name-server 'xxx.xxx.1.100'
				set system name-server 'xxx.xxx.8.8'

Multiple vApp Deployment - Pre-Req Rename any tanzu-vcsa-4 vm before redeploying again!

2 Nested ESXi nodes with 24GB of RAM for testing prupose, to allow Tanzu Supervisor Cluster and Tanzu Kubernetes Cluster at least 28GB of memory is needed.

02/22/2024 New Tanzu using NSX-T Automated Lab Deployment with single Nested ESXi 28GB minimum and Workload Enablement single SupervisorVM and single replica deployments.

Editing Workload Control Plane (wcp) file for Lab prupose only without needing support editing does break support.
We will change the number of master from 3 to 1, change the disk from "thick" to "thin" and, restart the service.

				ssh root@tanzu-vcsa-4
				
				vi /etc/vmware/wcp/wcpsvc.yaml
				minmasters: 1
				maxmasters: 1
				
				controlplane_vm_disk_provisioning: "thin"
				:wq!
				
				service-control --restart wcp

Next let's go to Workload Management

Further editing files in SupervisorVM for Lab prupose only without needing support editing does break support.
We will change the number of replica from 3 to 1 and from 2 to 1 for the deployments in the namespaces starting with "vmware-system-" or "kube-system".

				ssh root@tanzu-vcsa-4
				/usr/lib/vmware-wcp/decryptK8Pwd.py
				ssh root@IP
				kubectl get deployments -A
				reduce from 3 to 1 replica
				bash <(kubectl get deployments -A -o json | jq -r '.items[] | select(.metadata.namespace | (startswith("vmware-system-") or contains("kube-system"))) | select(.status.replicas == 3) | "kubectl scale deployments/\(.metadata.name) -n \(.metadata.namespace) --replicas=1"')
				
				reduce from 2 to 1 replica
				bash <(kubectl get deployments -A -o json | jq -r '.items[] | select(.metadata.namespace | (startswith("vmware-system-") or contains("kube-system"))) | select(.status.replicas == 2) | "kubectl scale deployments/\(.metadata.name) -n \(.metadata.namespace) --replicas=1"')
				
				watch 'kubectl get deployments -A'

Since the deployments happen once the bits are downloaded, they appear in the watch and we have to use ctrl+c to come back in the shell and upper arrow the scale down replica command.
This little babysit task allow less container running and is desired in LAb with limited resources. At last there is one deployment that need to be edited to comment the anti-affinity

				ssh root@tanzu-vcsa-4
				/usr/lib/vmware-wcp/decryptK8Pwd.py
				ssh root@IP
				kubectl get deployments.apps -n vmware-system-registry -o yaml > vmware-registry-controller-manager.yaml
				vi vmware-registry-controller-manager.yaml

				kubectl apply -f vmware-registry-controller-manager.yaml
				deployment.apps/vmware-registry-controller-manager configured

Next Deploy T0 VRF as described
Deploy NSX T0 VRF and Project and VPC Subnets Segments IP Blocks (3 min)

Here we only need following Project variables turned on and can leave VPC variables to 0.

				$deployProjectExternalIPBlocksConfig = 1
				$deployProject = 1

Next head over Namespaces to Create Namespace: tick "Override Supervisor network settings" and select T0 VRF in the dropdown menu

Namespace configuration:
We will add Storage, Users Permissions, VM size (VM Class), Content Libraries (TKRs) and, download CLI Tools.

For the sake simplicity, we will add king kong administrator as well.

With VM Class aka (flavor) we will set the size of the VM in our TKC, here i choose "xsmall" 2CPUs 2GB for each VM Master (aka Control Plane) or Worker.

Download Kubectl+vSphere plugin, vSphere Docker Credential Helper

We will login to Supervisor Namespace, then switch to our VRF Namespace context, to apply networkpolicy from a yaml via CLI using kubectl.

				kubectl vsphere login --server=172.17.31.130 -u administrator@vsphere.local --insecure-skip-tls-verify
				
				KUBECTL_VSPHERE_PASSWORD environment variable is not set. Please enter the password below
				Password:
				Logged in successfully.
				
				You have access to the following contexts:
				172.17.31.130
				t0vrf-1683-prj-2-ns1
				
				If the context you wish to use is not in this list, you may need to try
				logging in again later, or contact your cluster administrator.
				
				To change context, use `kubectl config use-context <workload name>`

				kubectl config use-context t0vrf-1683-prj-2-ns1
				Switched to context "t0vrf-1683-prj-2-ns1".

				kubectl apply -f enable-all-policy.yaml
				networkpolicy.networking.k8s.io/allow-all created

				apiVersion: networking.k8s.io/v1
				kind: NetworkPolicy
				metadata:
				name: allow-all
				spec:
				podSelector: {}
				ingress:
				- {}
				egress:
				- {}
				policyTypes:
				- Ingress
				- Egress

Deploy TKC on VRF Namespace.

				kubectl apply -f t0vrf-1683-prj2-tkc-v1alpha3.yaml
				tanzukubernetescluster.run.tanzu.vmware.com/t0vrf-1683-prj2-tkc-v1alpha3 created

This k8s configuration yaml come from VMware Docs v1alpha3 Example: TKC with Default Storage and Node Volumes

					apiVersion: run.tanzu.vmware.com/v1alpha3
					kind: TanzuKubernetesCluster
					metadata:
					name: t0vrf-1683-prj2-tkc-v1alpha3
					namespace: t0vrf-1683-prj-2-ns1
					spec:
					topology:
					controlPlane:
					replicas: 1
					vmClass: best-effort-xsmall
					storageClass: tanzu-gold-storage-policy
					tkr:
					reference:
					name: v1.25.7---vmware.3-fips.1-tkg.1
					nodePools:
					- name: worker
					replicas: 1
					vmClass: best-effort-xsmall
					storageClass: tanzu-gold-storage-policy
					tkr:
					reference:
					name: v1.25.7---vmware.3-fips.1-tkg.1
					volumes:
					- name: containerd
					mountPath: /var/lib/containerd
					capacity:
					storage: 5Gi
					- name: kubelet
					mountPath: /var/lib/kubelet
					capacity:
					storage: 5Gi
					settings:
					storage:
					defaultClass: tanzu-gold-storage-policy
					network:
					cni:
					name: antrea
					services:
					cidrBlocks: ["198.53.100.0/16"]
					pods:
					cidrBlocks: ["192.0.5.0/16"]
					serviceDomain: cluster.local

Kubectl get node -o wide - Login Supervisor NS -- VRF NS -- TKC cluster + switch to TKC cluster context

					kubectl vsphere login --server=172.17.31.130 -u administrator@vsphere.local --insecure-skip-tls-verify  --tanzu-kubernetes-cluster-namespace t0vrf-1683-prj-2-ns1 --tanzu-kubernetes-cluster-name t0vrf-1683-prj2-tkc-v1alpha3
					
					
					KUBECTL_VSPHERE_PASSWORD environment variable is not set. Please enter the password below
					Password:
					Logged in successfully.
					
					You have access to the following contexts:
					172.17.31.130
					t0vrf-1683-prj-2-ns1
					t0vrf-1683-prj2-tkc-v1alpha3
					
					If the context you wish to use is not in this list, you may need to try
					logging in again later, or contact your cluster administrator.
					
					To change context, use `kubectl config use-context <workload name>`

					kubectl config use-context t0vrf-1683-prj2-tkc-v1alpha3
					Switched to context "t0vrf-1683-prj2-tkc-v1alpha3".

					kubectl get node -o wide
					NAME                                                         STATUS   ROLES           AGE   VERSION                   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                 KERNEL-VERSION       CONTAINER-RUNTIME
					t0vrf-1683-prj2-tkc-v1alpha3-ch9cz-dw2p9                     Ready    control-plane   23m   v1.25.7+vmware.3-fips.1   10.244.2.18   <none>        VMware Photon OS/Linux   4.19.283-3.ph3-esx   containerd://1.6.18-1-gdbc99e5b1
					t0vrf-1683-prj2-tkc-v1alpha3-worker-d4t44-59cd4bc4bf-vtp94   Ready    <none>          14m   v1.25.7+vmware.3-fips.1   10.244.2.19   <none>        VMware Photon OS/Linux   4.19.283-3.ph3-esx   containerd://1.6.18-1-gdbc99e5b1

VMware NSX Network Topology - Supervisor Cluster on T0 - Guest Cluster on T0 VRF

VMware NSX - 1ESXi 2 TEP x (2 x Edge VM 2TEP) = 8 Tunnels

VMware NSX - Edge1 Tunnel Endpoint (4 to ESXi + 4 to Edge2 TEP)

VMware NSX - Host Transport Node - ESXi Details

VMware vCenter - Outer VCSA Virtual switches - Trunk vswitch (VMTRUNK vlan 4095 + 1731-Network vlan 1731) - vswitch0

VMware vCenter - Inner VCSA Virtual switches - 2 NSX Switch (aka N-VDS) Tanzu-VDS1 MGMT - Tanzu-VDS2 TEP

Tanzu-VDS1 - Ports

Tanzu-VDS2 - Ports

VMware vCenter - Distributed Port Groups per N-VDS

ESXi 28GB takes SupervisorVM + TKC VMs

01/21/2024 Scaling out NSX Edge Cluster T0 in Active-Active mode, Loadbalancing Edge TEP and ESXi TEP and multiple Automated deployment.

You can grab it from my fork page.
Or on real WL master repo in PR section.

More screenshots will follow soon!
Cheers

01/08/2024 What's in the new VMware vSphere Foundation (VVF) and VMware Cloud Foundation (VCF) offers?

Blog post and diagrams made by William Lam to better grasp the new offerings vmwa.re/skus in addition to recently published new VMware KB 95927.

12/22/2023 Recover from NSX-T unrecoverableCorfuError due to Power loss or Storage issue in singleton NSX Manager cluster

NSX-T Manager cluster should consist of 3 VMs. And as a side note: each VM can be located in its own location in MultiSite (see latest NSX-T Multi-Location Design Guide (Federation + Multisite)).
And in an unforeseen event like power outage or underlying storage issue, there is procedure to detach, to redeploy the VM and join the NSX-T Manager cluster (Replacing a faulty NSX-T manager node in a VCF environment (78967)).
That said, in the use cases with limited resources, the NSX-T Manager cluster could consist of only a single VM (see bottom of this article for the documentation reference).
And in an unforeseen event like power outage or underlying storage issue, there is no way to join the cluster if it consist of a single VM and that one is corrupt.
Only NSX Backup could restore the environment, if it has been setup ! But what if NSX Backup has not been setup yet, what to do ?

Here i present a simple trick that, i believe, is by no means supported, that might allow us to get our cluster recovered from unrecoverableCorfuError that is occurring when the database CorfuDB of NSX-T find its file corrupted.

Symptoms:

The NSX UI is stuck with error 101:

You have a single node cluster, that is one NSX-T manager and not the recommended three.
Cluster status could either show error or down status
admin > get cluster status verbose

Another example with nested VCF 4 nodes setup and outer datastore disk full

Impact / Risks

Some NSX configurations may get deleted.

The trick is 3 simple steps and 1 step to confirm when the NSX-T cluster is stable:

1. stopping CorfuDB server service
root ~# systemctl stop corfu-server.service
2. delete/rename the last log
root ~# cd /config/corfu/log
root ~# ls -lrth
root ~# mv /config/corfu/log/77.log /config/corfu/log/77.bak
3. starting CorfuDB server service
root ~# systemctl start corfu-server.service
4. get the cluster status while waiting to become stable
admin > get cluster status

After recovering to a stable cluster we will look at the column "LEASE VERSION" matching the new (clean) log generated using the following command:

admin > get cluster status verbose

Also there is a VMware by Broadcom Knowledge Base article KB90840 with same UnrecoverableCorfuError due to underlying storage issue on service corfu-nonconfig-server.

The service name is corfu-nonconfig-server
the log directory is /nonconfig/corfu/corfu/log/
I believe the same trick would work as well.

Note:

First this trick is not 100% reliable or it would be have been acknowledged as workaround,
and having to wait lengthy dozen minutes for the cluster to come up stable,
we often find it simpler to restart the NSX Manager,
what i noticed in this case is a large amount read I/O, certainly caused by the sync happening at the start of the service.
In my case theses power outage came as frequently as 2-3 time per month due to bad ram / heavy nesting environment causing BSOD.
And finally this point out the importance of setting NSX Backup and testing the Restore !

And the placement of this SFTP backup server, as per the latest VMware® NSX-T Reference Design:
7.3.4.4 Singleton NSX Manager
The resources required to run a cluster of three NSX Managers may represent a challenge in
small environments. Starting with NSX version 3.1, VMware supports deploying a single NSX
manager in production environments. This minimal deployment model relies on vSphere HA
and the backup and restore procedure to maintain an adequate level of high availability.
vSphere HA will protect against the failure of the physical host where the NSX manager is
running. vSphere HA will restart NSX Manager on a different available host. Enough resources
must be available on the surviving hosts; vSphere HA admission control can help ensure they
are available in case of failure.
Backup and restore procedures help in case of failure of the NSX manager itself. The SFTP
server where the backup is stored should not be placed on an infrastructure shared by the
single NSX Manager node.

Quick dive deep into the CorfuDB history:

It is log appending database with fast performance,
(think like Kafka),
where log consist not only of text but also binary.
When i mean fast, i mean CorfuDB can write dozens if not hundreds thousands time per seconds !
Source: https://github.com/CorfuDB/CorfuDB/wiki/White-papers

12/14/2023 VMware by Broadcom Flings Continue

VMTN Flings

12/11/2023 VMware by Broadcom Dramatically Simplifies Offer Lineup and Licensing Model

By Krish Prasad, Senior Vice President and General Manager, VMware Cloud Foundation Division: VMware by Broadcom business transformation

Desktop Hypervisor Continue

11/22/2023 Broadcom announces successful acquisition of VMware

Hock Tan : President and Chief Executive Officer "Providing best-in-class solutions for our customers, partners and the industry"

11/12/2023 VMware Explore 2023 Breakout Session URLs

Links to videos with Customer Connect account and direct download links to supporting presentation slides.

VMware Explore EMEA 2023 Breakout Session URLs

VMware Explore US 2023 Breakout Session URLs

11/07/2023 Updated script Automated Tanzu lab Deployment with NSX VRF, Project, VPC

11/11/2023 Update Now merged to William Lam master repo! https://github.com/lamw/vsphere-with-tanzu-nsxt-automated-lab-deployment

My Fork with Branch NSX4 github.com/abbedsedk/vsphere-with-tanzu-nsxt-automated-lab-deployment/tree/nsx4

- Updated for vSphere 8.0 and NSX 4.1.1 due to API changes since vSphere 7 and NSX 3.
- Added a few checks to allow reuse of existing objects like vCenter VDS, VDPortGroup, StoragePolicy, Tag and TagCategory, NSX TransportNodeProfile.
- Added FAQ to create multiple Clusters, and using the same VDS/VDPortGroup, This allow Multi Kubernetes Cluster High-Availability with vSphere Zone and Workload Enablement.
- Added a few pause in the usecase where we deploy only a new cluster to allow Nested ESXi to boot and fully come online (180s) and before VSAN Diskgroup creation (30s).
- Added FTT configuration for VSAN allowing 0 redundancy and to use only one node demo lab VSAN Cluster. (This allow the whole Nested MultiAZ Tanzu lab with NSX VRF, Project, VPC, to run on 128GB box and the play by play of this usecase is next.)
$hostFailuresToTolerate = 0
- Added pause to the script to workaround without babysitting for AMD Zen DPDK FastPath capable owner CPU.
$NSXTEdgeAmdZenPause = 0
- Added -DownloadContentOnDemand option in TKG Content Library to prevent the download in advance of 250GB and reduce to a few GB.
- Added T0 VRF Gateway Automated Creation with Static route like the Parent T0 (Note: an uplink segment '$NetworkSegmentProjectVRF' is connected to parent T0 for connectivity to outside world)
- Added Project and VPC Automated Creation.

11/07/2023 A usecase vSphere with Tanzu using NSX Project VPC Networks and with Multi K8s Cluster High Availability using vSphere Zones

Deploy 1st VSAN Cluster (+1h)vSphere with Tanzu using NSX-T Automated Lab same as before
Deploy 2nd and 3rd VSAN Clusters (15min each) vSphere with Tanzu using NSX-T Automated Lab
Todo after 3 Clusters deployments
Deploy NSX T0 VRF and Project and VPC Subnets Segments IP Blocks (3 min)
Create Zonal Storage Policy Multi-AZ-Storage-Policy
Create 3 zones with the 3 Clusters
Workload Control Plane (WCP) Enablement in Workload Management
Enablement Begining to Ready
Next Enterprise Developper's Tasks: Give a name to a Namespace, Deploy Class-Based or Tanzu Kubernetes Cluster (TKC) and, Deploy a stateful app with Cluster HA.
Next Service Provider's Tasks: Create a provider VDC backed by a Supervisor Cluster, Publish a Provider VDC Kubernetes Policy to an Organization VDC in VMware Cloud Director, Offers Kubernetes as a Service (CaaS).

VMware Docs - VMware-vSphere 8.0 - Workflow for Deploying a Supervisor with NSX Networking
In the following section we will do a three-zone Supervisor deployment type.

Deploy 1st Cluster using vSphere with Tanzu using NSX-T Automated Lab same as before

With 3 Nested Esxi, if it is a requirement to fit in 128GB Memory box then specify only 1 Esxi hostname ip, this is possible with $hostFailuresToTolerate = 0
Fill the value of these 3 variables
$NestedESXiHostnameToIPs = @{...}
$NewVCVSANClusterName = "Workload-Cluster-1"
$vsanDatastoreName = "vsanDatastore-1"

1st Cluster

Now Deploying the 2nd and 3rd Cluster follow the steps:

- Change values of these 3 variables for 2nd and 3rd cluster deployments,
- Change to fixed value for the VAppName
- Change value of already deployed VMs (VCSA, NSXManager, NSXEdge) to 0,
- Change value in postDeployNSXConfig from $true to $false for all variables except ($runHealth, $runTransportNodeProfile, $runAddEsxiTransportNode),

					$NestedESXiHostnameToIPs = @{
					$NewVCVSANClusterName = "Workload-Cluster-2"
					$vsanDatastoreName = "vsanDatastore-2" 
					
					$VAppName = "Nested-vSphere-with-Tanzu-NSX-T-Lab-qnateilb"
					# "Nested-vSphere-with-Tanzu-NSX-T-Lab-$random_string" 
					# Random string can be used on the first cluster but reuse the $VAppName for 2nd and 3rd cluster deployments.
					
					$preCheck = 1
					$confirmDeployment = 1
					$deployNestedESXiVMs = 1
					$deployVCSA = 0
					$setupNewVC = 1
					$addESXiHostsToVC = 1
					$configureVSANDiskGroup = 1
					$configureVDS = 1
					$clearVSANHealthCheckAlarm = 1
					$setupTanzuStoragePolicy = 1
					$setupTKGContentLibrary = 1
					$deployNSXManager = 0
					$deployNSXEdge = 0.
					$postDeployNSXConfig = 1
					$setupTanzu = 1
					$moveVMsIntovApp = 1
					
					$deployProjectExternalIPBlocksConfig = 0
					$deployProject = 0
					$deployVpc = 0
					$deployVpcSubnetPublic = 0
					$deployVpcSubnetPrivate = 0
					
					
					if($postDeployNSXConfig -eq 1) {
					$runHealth=$true
					$runCEIP=$false
					$runAddVC=$false
					$runIPPool=$false
					$runTransportZone=$false
					$runUplinkProfile=$false
					$runTransportNodeProfile=$true
					$runAddEsxiTransportNode=$true
					$runAddEdgeTransportNode=$false
					$runAddEdgeCluster=$false
					$runNetworkSegment=$false
					$runT0Gateway=$false
					$runT0StaticRoute=$false
					$registervCenterOIDC=$false

2nd Cluster

3rd Cluster

NSX View

VCENTER View

Todo after 3 Clusters deployments:

- Esxi -> Configure -> TCP/IP Configuration -> IPV6 CONFIGURATION -> Disable
- Esxi -> Configure -> TCP/IP Configuration -> Default -> Edit -> copy 'Search domains' to 'Domain'
- Esxi -> Configure -> TCP/IP Configuration -> Default -> Edit -> inverse Preferred and Alternate DNS server if needed. (In my case this is part of why Workload Enablement wouldn't come up)
- SSH Esxi's Reboot via Send to all 'Multitab Putty' and Enter in each Esxi's Tab
- Snapshot/Export the Outer ESXi VM or the Lab vApp
- Start the Lab vApp and reset the alarms
- SSH virtual routeur, i use vyos, configure a static route each Project and VPC Subnet IP/Netmask via $T0GatewayInterfaceAddress (In my case this is the other part of why Workload Enablement wouldn't come up).

- Deploy NSX T0 VRF and Project and VPC Subnets Segments IP Blocks (3 min)

Fill the variables of section:
# Project ,Public Ip Block, Private Ip Block
# VPC, Public Subnet, Private Subnet
VMware Docs - VMware-NSX 4.1 - Add a Subnet in an NSX VPC
Self Service Consumption with Virtual Private Clouds Powered by NSX
(Gotcha: $VpcPublicSubnetIpaddresses must be a subset of $ProjectPUBcidr, and can't use the first or last subnet block size.)
# T0 VRF Gateway
# Which T0 to use for the Project External connectivity : $T0GatewayName or $T0GatewayVRFName (This option is important as it determine whether the T0 VRF Gateway is created or not.)
$ProjectT0 = $T0GatewayVRFName
Change values of all variables to 0 and set to 1 ($preCheck , $confirmDeployment , Project's and Vpc's ones).

					$VAppName = "Nested-vSphere-with-Tanzu-NSX-T-Lab-qnateilb"
					# "Nested-vSphere-with-Tanzu-NSX-T-Lab-$random_string" 
					# Random string can be used on the first cluster but reuse the $VAppName for 2nd and 3rd cluster deployments.
					
					$preCheck = 1
					$confirmDeployment = 1
					$deployNestedESXiVMs = 0
					$deployVCSA = 0
					$setupNewVC = 0
					$addESXiHostsToVC = 0
					$configureVSANDiskGroup = 0
					$configureVDS = 0
					$clearVSANHealthCheckAlarm = 0
					$setupTanzuStoragePolicy = 0
					$setupTKGContentLibrary = 0
					$deployNSXManager = 0
					$deployNSXEdge = 0.
					$postDeployNSXConfig = 0
					$setupTanzu = 0
					$moveVMsIntovApp = 0
					
					$deployProjectExternalIPBlocksConfig = 1
					$deployProject = 1
					$deployVpc = 1
					$deployVpcSubnetPublic = 1
				$deployVpcSubnetPrivate = 1

Note: Screenshot the summary before confirming as a reminder of the Subnet IP/Netmask later.

Deploy VRF, Project, VPC with all associated networking (IpBlocks, Segments, Subnets, Routing, DHCP) in 3.27 minutes.

Florilege of NSX API call from 2 PowerCLI Modules and from straight REST call.

NSX Topology T0/VRF - Project - VPC

- Create Zonal Storage Policy "Multi-AZ-Storage-Policy" -> No redanduncy (if you configured FTT = 0 one node cluster)

VMware Docs - VMware-vSphere 8.0 - Create Storage Policies for vSphere with Tanzu

VMware Docs - VMware-vSphere 8.0 - Deploy a Three-Zone Supervisor with NSX Networking

- Create 3 zones with the 3 Clusters

Workload Control Plane (WCP) Enablement in Workload Management

Enablement Begining to Ready

Next Developpers Tasks:Give a name to a Namespace, Deploy Class-Based or Tanzu Kubernetes Cluster (TKC) and, Deploy a stateful app with Cluster HA.

vSphere with Supervisor Cluster Configuration Files

Next Service Provider's Tasks: Create a provider VDC backed by a Supervisor Cluster, Publish a Provider VDC Kubernetes Policy to an Organization VDC in VMware Cloud Director, Offers Kubernetes as a Service (CaaS).

Publish a Provider VDC Kubernetes Policy to an Organization VDC in VMware Cloud Director

04/01/2023 Added Export option Nested VCF Lab VMs PR script.

The option can be set to run following the deployment or at later time wich is prefered to save a state of the Lab VMs as OVA at later time.

A FAQ is added to explain how to set option.

Note that script is coded to export the VMs of the latest vApp deployed by the script that start with the name Nested-VCF-Lab-.

15 min to stop, export as OVA, and start back the VMs

03/27/2023 Enable multiple vApp deployment on the same Cluster

Because i was unable to deploy multiple time i created an issue then a PR that got Merged.

Fixed Automated VMware Cloud Foundation Lab Deployment
Credit to LucD from VMTN

03/05/2023 Comparing CPU I/O usage during VCF SDDC Management Bringup on 4 vs 1 Nesed ESXi node

Follow-up on 02/14/2023 previous issue.

Found that the root cause to be a nested lab environment use case or CPU-I/O contention on the hosts,
occurring on a task towards the end of the bringup called "Configure Base Install Image Repository on SDDC Manager",
that copy vcsa iso and nsx ova to an nfs on the 4 Nested ESXi VSAN datastore,
that made the cpu to the roof and consequently applications ruuning in the three VMs vCenter, NSX and SDDC manager had kernel stuck at one point or
multiple time.
Looking deeper into it, i think the subsequent tasks might had issue with kernel stuck vms (i feel there maybe missing pieces to understand it all ...).
Was monitoring while that contention happened,then made screenshots CPU and I/O usage of 2 SDDC bringup at time of that copy task to illustrate:

one when that whole issue occured with 4 nested ESXi
one with 1 nested ESXi using FTT=0 trick given by William Lam.
using less vCPUs (8 instead of 4x8) and a faster I/O capable NVMe SSD (PCIe 4.0 instead of 3.0) confirmed without kernel stuck all is well.
I think that on real gears this should not happen.

03/03/2023 PCIE 4.0 LAB UPGRADE - AMD Ryzen 3700X + Netac NV7000

B.O.M 308€
AMD Ryzen7 3700X 3,6 GHz 7NM L3 = 32M at 158€
Netac SSD 2tb M2 NVMe PCIe 4.0 x4 at 150€

Ordered on 02/11/2023 and received 03/03/2023 but was worth the wait,not only it come from the Official Netac store but on the back it says Quality Check "QC PASS 02/2023".
Note you have to have PCIe 4.0 capable motherboard, i choosen mine MSI X570 just for that and the fact that it run my older Ryzen 2700.
What to expect of this speedup i mean from PCIe 3.0 at 2000MB/s to PCIe 4.0 at 7000MB/s sequential read/write throughput, not really that because we all know OS use mixed read/write random 4KB,
nevertheless VCF Nested deploy twice faster in 15 minutes instead of 30 because the bandwidth is twice faster 😀.

02/24/2023 VMware Cloud Foundation with a single ESXi host for Workload Management Domain made by William Lam.

This will give room to play AVN or Workload VI Domain in the futur.

02/23/2023 Removing NSX CPU/Memory reservations when deploying a VMware Cloud Foundation (VCF) Management or Workload Domain made by William Lam.

03/21/2023 Update

I followed the steps but in my case i had some issues with directory returned by ovftool wich needed /${NSX_FILENAME}/ in the path of the commands and
as final step to get the modified NSX ova into the overlay part of "/mnt/iso/" known as "/upper/" from "/work/".

										/mnt/iso/...ova # the bringup is seeing this directory wich is combination of the following 'oldiso' RO + 'upper' RW directories
										|
										/root/oldiso/...ova # read only filesystems
										+
										/overlay/upper/...ova # read write filesystems
										
										/overlay/work/work/...ova # read write filesystems

I simply issued a "cp" of the ova from "/work" to "/upper" wich is writable and it was presented in the "/mnt/iso" thus
i shared these on the page that what worked for me.

In the bringup lasts tasks the NSX ova is copied from "/mnt/iso" to an NFS share for SDDC Manager to consume when adding 'Workload VI Domain'.

Feel free to check it out, it's not only removing the NSX reservation for the 'Workload Management Domain' bringup but
also for later subsequent 'Workload VI Domain' which is wanted for limited resources on lab environement.

And now like "Neo" in "The Matrix" with "JuJitsu" i can say 'Yay i know the Linux overlay filesystems!' to make readonly writable, thanks to that (just a side note pointer, docker use exactly that for its layering).

02/14/2023 - SSDC Mananger 8 accounts disconnect

1) NSXT MANAGER root admin audit account
Just as in the post before click on the 3 dots and REMEDIATE using same password used in the deployment script

2) ESXI service accounts

Steps to recover expired Service Accounts in VMware Cloud Foundation (KB 83615)
SSH into each of the 4 Nested ESXi
[root@vcf-m01-esx01:~] passwd svc-vcf-vcf-m01-esx01
Changing password for svc-vcf-vcf-m01-esx01
Enter new password:
Re-type new password:
passwd: password updated successfully
(note i didn't do the reset failed login part)
SDDC Manager ESXI svc accounts -> 3dots REMEDIATE with this newly created password

3) PSC - KB: Password rotation for administrator@vsphere.local causes issues when multiple VMware Cloud Foundation instances share a single SSO domain (KB 85485)
we must be logged with an another SSO user with ADMIN role
to be able to click REMEDIATE on PSC administrator@vsphere.local

I think a proper SSO ADMIN user like vcf-secure-user@vsphere.local illustrated in the KB is the way to go on production.
In my case since it was a lab i found an SSO account, so i promoted it to admin role.
Disclamer: i do not know if that is the supported even thought:
from the remediate password window we learn that service acount will be rotate after the remediate,
we can remove admin role from this service account.
Using a)SDDC manager UI or b)vCenter UI, it's easly done instead of API
a) SDDC manager UI as administrator@vsphere.local -> Single Sign On -> +USERS AND GROUPS -> Search User: svc , Refine search by: Single User, Domain: vsphere.local
Select the user svc-vcf-m01-nsx01-vcf-m01-vc01 -> Choose Role: ADMIN (note this can be done from vCenter see below), then click ADD.

b) vCenter UI as administrator@vsphere.local -> Licensing -> Single Sign On -> Users and Groups -> Users -> Domain: vsphere.local, Find: svc -> EDIT: Password, Confirm Password

c) SDDC manager UI login as svc-vcf-m01-nsx01-vcf-m01-vc01@vsphere.local -> Security -> Password Management -> PSC -> administrator@vsphere.local -> REMEDIATE again using the same original password

d) logout
optionally e) redo a) but select the 3dots and remove the admin role on this service SSO user.

Update

When we mouse hover ⓘ there is a bubble informing us that sync should be happening no more than 24h.
So mine fall in expected result because i didn't give a chance after the deployment to sync and refresh, less than 24h.

Lesson learned, if this happening again i will wait 24h before taking action. Related this with someone experiencing similar effect on VMTN VCF 4.5.0 reporting accounts disconnected.

02/18/2023 - Importing VMs Vyos and nested ESXi, Checking and Configuring NTP

First the ovas import wizard don't need to be filled as the default are already set for our environment.

Vyos

One thing to do on the vyos console
is to remove occurence of old mac address "hw-id" and
any new interfaces in the config.boot file using
"vi /config/config.boot" then
"dd" command to delete line then
save it with ":" "wq!"
"configure"
"load /config/config.boot"
"commit"
"save"
"exit"
"reboot"
note: you got to learn where US QWERTY keymap are if you have AZERTY keyboard or be sure to load your regional keymap with "sudo loadkeys fr" ("fr" for french keymap).

nested ESXi and check NTP

One thing to do on all nested ESXi VM uppon import as well is to:
SSH into each of them to remount permanently the OS volume with this one liner for example and recheck NTP.
using Multi Tabbed Putty mtputty
ssh all 4 nestedesxi
tick send to all
UUID=$( esxcfg-volume -l | grep UUID | cut -b 17-52 ); esxcfg-volume -M $UUID
hit ENTER
ssh cb
tick send to all
ntpq -p
hit ENTER

At this point not all Esxi had ntp running or even setup or sitting in INIT state

Configure NTP server on nested ESXi

We're tempted to edit ntp.conf but there is a comment that tell not to
[root@vcf-m01-esx02:~] cat /etc/ntp.conf
# Do not edit this file, config store overwites it
So how do we it:
Troubleshooting NTP on ESX and ESXi 6.x / 7.x / 8.x (KB 1005092)
for builds 7.0.3 onwards
this KB explain how to add "tos maxdist 15" setting
So we can use this same method to configure the server setting

										/etc/init.d/ntpd restart
										
										NTPold="`cat /etc/ntp.conf | grep server`"
										NTPprefered="server 0.pool.ntp.org"
										cp /etc/ntp.conf /etc/ntp.conf.bak -f && sed -i 's/'"$NTPold"'/'"$NTPprefered"'/' /etc/ntp.conf.bak  && esxcli system ntp set -f /etc/ntp.conf.bak
										cp /etc/ntp.conf /etc/ntp.conf.bak -f && echo "tos maxdist 15" >> /etc/ntp.conf.bak && esxcli system ntp set -f /etc/ntp.conf.bak
										esxcli system ntp set -e 0 && esxcli system ntp set -e 1
										/etc/init.d/ntpd restart
										ntpq -p

NTP service auto start is not working in ESXi 7.0 (KB 80189)
chkconfig --list ntpd
chkconfig ntpd on
reboot
That's it, you're set for success! Remember before you begin the bringup to shutdown all VMs to snapshots them all, just to be safe!

02/09/2023 UPDATE - Contributed to William vSphere with Tanzu using NSX-T Automated Lab Deployment script to allow additional Edge nodes creation. Now merged.

02/08/2023 - SDDC Manager account disconnected NSXT MANAGER

The trick here is to understand, the text "Specify the password that was set manually on the component", that means the same password we set on the deployment script, more than the misleading warning.

02/06/2023 UPDATE - Finally solution to issues NSX Installation and HA agent install on ESXi were due to lack of memory.

Clicking on NSX Install Fail.. we see that the ESXi host is lacking memory.

This 2nd Esxi Node happened to be one hosting the nsx VM but it had more than 13GB of free memory.

We can work around this issue by live migrating the NSX vm to the 3rd ESXi node, and then hit the Resolve Button

We see an unknown node status, but from KB 94377 we learn that is health check issue.

Next install of the HA agent onto this exact same 3rd ESXi Node fail.

I was thinking of doing the same trick with live Migration of NSX but not possible, then i shutdown NSX and migrated it to ESXi 4th node.
But then it wouldn't power on. Needing an extra ~200MB.

Looking at the 4th ESXi node there was plenty of memory apparently 28.7GB.

At that point i was curious, from vCenter enabled SSH service since it's stopped during bringup, to have a look at the available Reservation memory for the user namespace using this command found on VMTN:

memstats -r group-stats -g0 -l2 -s gid:name:parGid:nChild:min:max:conResv:availResv:memSize -u mb 2> /dev/null | sed -n '/^-\+/,/.*\n/p'

I figured out that if NSX need 16384MB of reservation when here we see 16372MB reservation available + 178 MB overhead,
16384-16372+178=200MB that would explain why vCenter admission failure wouldn't let NSX vm power on.

The solution is easy, just bump the ESXi memory a bit more, at that time i was testing 42GB, so redone the lab with 46GB and it worked flawlessly on these tasks. Now merged.

Stay tunned for the next series of issues/solutions (VMCA, SSH Key Rotate, account disconnected from SSDC Manager).

02/04/2023 UPDATE - Note i do not recommend doing the Bringup with nesting the Nested environment like in the picture above, to have better performance (having 3 hypervisors "in a row" is only meant for lite Lab deployment 😀) Following below i'll explain how deploy, then modify Cloud Builder timeout, then export.

And to avoid BUG soft lockup which makes feel pretty much like PSOD with some nasty effect (more on that later in disconnected account posts here and here) and many others issues that could arise during bringup. Clearly the expected I/O throughput is minimum on 100s MB/s not 10s MB/s.

To play with VCF Lab on a laptop with 16cores and 128GB, it's prefectly acceptable to deploy like in the picture
but after the script deployment, do the export of the VCF's VM (it means double time 30min deploy + 30min export + then import, you'll see real SSD speed at this point!).

Export the Nested VCF Lab's VMs

If you do know how to connect the VirtualInfrastrure then that can be done in one liner powershell to export the VApp:

Get-VM -Name vcf-m01-* | Export-VApp -Destination "D:\VM\Nested\Vapp\" -Force -Format Ova | Out-Null

If you don't, i'll soon make a PR to William Lam's script to add ExportVM option (Just leave option $exportVMs = 0, for the deployment).

Update 04/01/2023 PR Done

Customization pre export:
Use multi tabbed SSH client, on Windows MTPuTTY is free.
For Cloud Builder vm ssh to it and extend this two timeout:

sed -i 's/ovf.deployment.timeout.period.in.minutes=40/ovf.deployment.timeout.period.in.minutes=180/' /opt/vmware/bringup/webapps/bringup-app/conf/application.properties
								sed -i -e's/nsxt.disable.certificate.validation=true/nsxt.disable.certificate.validation=true\nnsxt.manager.wait.minutes=180/' /opt/vmware/bringup/webapps/bringup-app/conf/application.properties

systemctl status vcf-bringup
systemctl restart vcf-bringup

Note: the second timeout "nsxt.manager.wait.minutes" is shown in vcf-bringup-debug.log in milliseconds and converting it from 1 200 000 ms, it is 20 minutes and this is part of why the installation of NSX bits is interrupted, the other reason is a lack of memory on ESXi wich have been fixed in the script to be 46GB.

After the customization of the vm done and CB validation is all green, rerun the script with all option set to 0 execpt
$preCheck = 1
$confirmDeployment = 1
$exportVMs = 1

Export the Virtual router's VMs

Additionally export also your virtual router(s), in my case it is a csr1000v, supposedly there are deployed with name convention csr-*

Get-VM -Name csr-* | Export-VApp -Destination "D:\VM\Nested\Vapp\" -Force -Format Ova | Out-Null

If your virtual router(s) is/are Vyos, supposedly there are deployed with name convention vyos-*

Get-VM -Name vyos-* | Export-VApp -Destination "D:\VM\Nested\Vapp\" -Force -Format Ova | Out-Null

01/25/2023 UPDATE - Good News Automated VMware Cloud Foundation Lab Deployment script new version already here !

Just asked for it few days ago here, then shared some of these tips on William Lam website and on the same day, (would you believe it ?) a PR and a merge make it happen ! The virtualization community is fast 😀. This version include fix for step 1,3,4 (need to follow the KB i choose option 2 patch with winscp or integrate it in the ova),5,7.

01/21/2023 - VCF v4.5 Lab

PHYSICAL LAB B.O.M 900€ (GPU & HDD not counted)
• RYZEN 2700 BOX (230€) officially support 64GB but it takes
• 128GB DDR4 Patriot 4 x 32GB at (100€) each with few BSOD MEMORY MANAGEMENT
• MOTHERBOARD MSI X570 (170€)
• 1TB SSD NVMe M.2 Micron P1 (100€) (100GB for OS and 831GB for LAB that became full! I got a story)

VM DC+DNS+iSCSI+NFS 2vCPUs 2GB
VM HOST-ESXI+VCSA 16vCPUs 104GB

For Router specifically
1 adapter not tagged for management
8 adapter trunk port group vlan 4095 (coming from windows VMware Workstation VMNet adapter Configuration Jumbo + vlan 4095 + all IP protocols unchecked)
7 configured sub-interface dot1q tag corresponding to VLAN desired for the bringup
1 configured as trunk

For Nested ESXi specifically
4 adapter on trunk port group

Validation errors -> solutions

1. After deployment Automated VMware Cloud Foundation Lab Deployment
Open Outer vCenter change the 1st disk from 12GB to 32GB in Nested ESXi VMs or Cloud builder fail "VSAN_MIN_BOOT_DISKS.error".
2. Change the 3rd (Vsan Capacity) disk from 60GB to more than 150GB if the Nested ESXi are Nested themselves in an ESXi VM !!
(I go into the inception movie running the Outer ESXi in a VM on windows VMware Workstation. The advantage to snapshot the whole thing is
significantly appreciated especially for the VCF bringup but the slowness less appreciated.) Regarding speed I’m looking forward trying PCIe 4.0 NVME
once I upgrade my CPU 3700X to speedup some tasks of the bringup and avoid some CPU issues related (windows BSOD).
3. Change all four passwords of SddcManager with ones as strong as the NSX ones
4. I got “Gateway IP Management not contactable” -> patch it with KB 89990 (release notes)
5. Failed VSAN Diskgroup -> “esxcli system settings advanced set -o /VSAN/FakeSCSIReservations -i 1” on the Outer ESXi.
6. For DUP “esxcli system settings advanced set -o /Net/ReversePathFwdCheckPromisc -i 1”
7. Instead of DHCP, use IP Pool VMware Cloud Foundation API Reference Guide SDDC look for
"ESXi Host Overlay TEP IP Pool"
8. Use a router IP as NTP for VCF but configure on the router a reliable stratum external NTP server
9. After Validation All green, Before launching the bringup Modify some CloudBuilder timeout:

								vim /opt/vmware/bringup/webapps/bringup-app/conf/application.properties
								sed -i 's/ovf.deployment.timeout.period.in.minutes=40/ovf.deployment.timeout.period.in.minutes=180/' /opt/vmware/bringup/webapps/bringup-app/conf/application.properties
								sed -i -e's/nsxt.disable.certificate.validation=true/nsxt.disable.certificate.validation=true\nnsxt.manager.wait.minutes=180/' /opt/vmware/bringup/webapps/bringup-app/conf/application.properties
								echo "bringup.mgmt.cluster.minimum.size=1" >> /etc/vmware/vcf/bringup/application.properties
								systemctl restart vcf-bringup
								watch "systemctl status vcf-bringup"

tail -f /opt/vmware/bringup/logs/vcf-bringup-debug.log
10. Disable Automatic DRS of VC NSX and SDDC Manager after each deployment
in the Inner Vcenter or else VSAN will rebalance those critical VM during the others being deployed :
Cluster -> Configure -> VM Overrides -> Automatic DRS -> Disabled or Manual
vcf-m01-vc01
vcf-m01-nsx01a
vcf-m01-sddcm01

Knowing these issues beforehand allow to modify the OVAs and scripts before deploying for Nested ESXi and Cloud Builder, until a new version come up.

01/19/2023 - Deploying Cloud Director in small form factor : the troubleshoot

This issue arise due to slow NFS access and lack of cpu for the initial primary cell boot. Encountered in version 10.4

Long story short, issue this command to relax the timeout of NFS access:

									sed -i s/10s/60s/ /opt/vmware/appliance/bin/appliance-sync.sh

and bump up the vCPUs from 2 to 4.

The best way to avoid thinkering the appliance scripts file is to give it at least 4 vCPUs before deploying, as there is an hard coded value of 8 CPUs, i detemined that 4 is sufficient based on top utility showing 400% cpu usage, meaning 4 x 100% x 1 CPU core.

I had previously answered to this issue in VMTN which were found helpful.
VMware Technology Network > Cloud & SDDC > vCloud > VMware vCloud Director Discussions > Re: Configure-vcd script failed to complete

Welcome to my website about hands-on quirks we bump into & hopefully how to resolve them !