VXLAN

Sabyasachi K.

Network Engineer @ Google

Published Sep 30, 2019

Why is VXLAN ?

1) The first in the VLAN space itself 802.1q we restrict to 4096 VLAN.

Cloud providers require accommodating different tenants in the same underlying physical infrastructure. Each tenant may in turn create multiple L2 and L3 networks within their own slice of virtualized data center. This drives the need for a greater number of L2 network.

2) Second issue is with the operational model for deploying VLAN. Although VTP exists as a protocol for creating, disseminating and deleting VLAN as well as for pruning them if optimal extent, most network disable it. That means there is a little effort required among the network admin, cloud admin and tenant admin to transport VLAN over the switches. Any proposed extension to VLAN must figure out a way to avoid such coordination. To be more precise adding each Layer 2 network must not require incremental config changes in the transport infra.

3) VLANs are too restrictive for virtual data center in terms of physical constraints of distance and deployment . The new standard should be ideally be free of these constraints. This allow DC more flexibility in distributing workloads, for instance, across L3 boundaries.

What is VXLAN ?

Virtual extensive LAN, the same service connected to ethernet end system that VLAN do today, but in more extensive manner. VXLAN is extensible with regards to scale and extensible with regards to the reach of their deployment.

The VXLAN identifier space is 24 bits. This doubling the size allow the VXLAN id space to increase by over 4 lakh percent. 16 million VLAN unique identifier.

VXLAN uses Internet Protocol (both unicast and multicast) as transport medium.

Protocol Consideration :

Highly distributed systems : VXLAN should work in an environment where could thousand of nodes. This protocol should work without requiring a central point nor without a hierarchy of protocols.

VXLAN Encapsulation Packet :

The outer IP Header has the source IP and destination IP of the VTEP endpoints.

The Outer Ethernet Header has the source MAC of the Source VTEP and the destination MAC of the immediate Layer 3 next hop.

VXLAN adds 50 bytes of original Ethernet header frame.

VTEP must not fragment the VXLAN header.

Intermediate router may fragment encapsulated VXLAN packet due to the larger packet size.

The destination VTEP may silently discards such VXLAN fragments.

To ensure end to end packet delivery its recommend to use larger MTU size due to the encapsulation.

How does it work ?

The VxLAN defines the VTEP (VxLAN tunnel end point) which contains all the functionality needed to provide Ethernet Layer 2 services to connect to the end system.

VTEP are considered the edge of the network typically connected an access switch to an IP transport network.

VTEP functionality build into access switch, but it is logically separated from the access switch.

Each end system connected to the same access switch communicate through the access switch.

The access switch act as a learning bridge does, by flooding out its port when it doesn’t know the destination MAC, or sending out a single port when it has learned which direction leads to the end station as determined by source MAC learning.

Broadcast traffic is sent out all the ports .

Further the access switch can support multiple bridge domain which are typically identified as VXLAN with in an associated VLAN ID that is carried in the 802.1Q header or trunk ports. In this case of a VXLAN enabled switch, the bridge domain would instead by associated within a VXLAN.

VTEP uses these IP interfaces to exchange the IP packets carrying the encapsulated Ethernet Frame with over VTEPs.

A VTEP also use as an IP host by using the IGMP to join IP mcast group.

VXLAN ID to be carried over the IP interface between VTEP each VXLAN is associated with a IP multicast group.

The IP multicast group is used as communication bus between each VTEP to carry broadcast, multicast and unknown unicast frame to every VTEP participating in the VXLAN at a given moment in time.

The VTEP function also work the same way as a learning bridge, in that if it doesn’t know where the destination MAC is, its flood the frame, but it performs this flooding function by sending the frame to the VXLAN associated mcast group.

Learning is similar except instead of learning the source interface associated with a frame source MAC, it learns the encapsulating source IP address.

Once it has learned this MAC to remote IP association, frames can be encapsulated within a unicast IP packet directly to the destination VTEP.

VXLAN Flood & Learn Mechanism

Flood and learn is a data plane learning technique for VXLAN, where a VNI mapped to a multicast group on a VTEP.

The host traffic always Broadcast/unknown unicast/ Multicast (BUM) format.

The BUM traffic is flooded to the multicast delivery group for the VNI that is sourcing the host packet.

The remote VTEP that is a part of the multicast group learns about the remote host MAC, VNI and source VTEP IP information from the flooded traffic.

The unicast packet to the Host MAC are sent directly to the destination VTEP as a VXLAN packet.

Note : Local MAC Learned over a VLAN (VNI) on a VTEP.

STEP 1 : The end system A with MAC-AA and IP 10.1.1.1 sends an ARP request for host with 10.1.1.2

The Source MAC address is MAC AA and the destination MAC address is FF:FF:FF:FF:FF

The Host with MAC BB is in VLAN 10, The packet is sent to VTEP 1. VTEP 1 has VNID 10 mapped to VLAN 10.

STEP 2 : When the ARP packet is received at the VTEP-1, the packet is encapsulated and forwarded to the remote VTEP’s with the source address of VTEP 1 and destination address for 239.1.1.1 (Multicast address associated) as a VXLAN packet. When the VXLAN encapsulation is done, the VNID is set to 10, the SRC MAC A and the destination MAC as multicast MAC for 239.1.1.1

NOTE : The VTEP that have the subscribed to that particular multicast group received the multicast packet.

STEP 3 : Both the VTEP receive the packet, decapsulate it to forward it to the End System connected to respective VTEP.

VTEP update their MAC table.

REMOTE VTEP : 192.168.1.1

VLAN 10

MAC ADDRESS : MAC A

In this process VTEP knows the MAC address of Remote VTEP.

STEP 4 : After the ARP packet forwards to the respective host, and host finds it IP address in it. Responds back with ARP reply.

STEP 5 : When the ARP Reply Receives at VTEP 2, VTEP 2 already knows the MAC A details and VTEP IP. The VTEP forwards the ARP reply in a Unicast packet.

STEP 6 : When VTEP 1 receives the ARP unicast packet, then VTEP update the MAC table for the remote host information.

STEP 7 : After the MAC table is updated, the ARP reply forwarded to the Host.

VXLAN BGP EVPN

BGP MPLS based EVPN solution deployed in order to meet the limitation of the flood and learn mechanism.

In BGP EVPN solution for VXLAN overlay, a VLAN is mapped to a VNI for the layer 2 services and a VRF is mapped to the VNI for the Layer 3 services on a VTEP.

Control plane learning for end host layer 2 and layer 3 reachability information to build more robust and scalable VXLAN Overlay network.

An iBGP EVPN session is established between all the VTEP or with the EVPN RR in order to provide the full mesh connectivity required by iBGP peering rules.

After the BGP EVPN established the VTEP exchange MAC-VNI or MAC-IP bindings as a part of BGP NLRI.

Advantage of using BGP for VXLAN :

Minimize the network flooding through the protocol driven host MAC/IP and ARP suppression on the local VTEPs.

Provide optimal forwarding for east-west and north-south bound traffic with the distributed anycast functions.

Provides VTEP peer discovery and authentication which mitigates the risk of rouge VTEP in the VXLAN overlay network.

Distributed Anycast Gateway :

Distributed anycast gateway refers to the use of anycast gateway addressing and an overlay network in order to provide a distributed control plane that governs the forwarding facility of frames within and across a Layer 3 core network.

The distributed anycast gateway functionality transparent to VM mobility and optimal east west routing by configuring the leaf switch with same gateway IP and MAC address.

The main benefit of the distributed anycast gateway is that the host or VM uses the same default gateway IP and MAC address no matter which leaf they are connected to.

Traffic forwarding in Leaf and spine :

With the leaf and spine topology, there are various traffic forwarding combinations. Based on the forwarding types the distributed anycast gateway plays its role in one of these manners.

Intra Subnet and non IP Traffic : The host to host communication that is intrasubnet or non IP the destination MAC address in the ingress frame is the target end host MAC address. This traffic is bridge from VLAN to VNI on the ingress egress VTEP.

Intra Subnet and Non IP traffic : The destination MAC address belongs to the default gateway MAC address. This traffic gets routed. But in the egress switch their can be two possible forwarding behaviour, it can either get router or bridge.

In order to configure distributed anycast gateway all the leaf switches or VTEP are required to be configured with global command “Fabric Forwarding anycast gateway mac” where the mac address is statically assigned address to be used across all switches by the anycast gateway.

ARP Suppression :

Host sends GARP when it comes online

Local leaf node receive GARP and create local ARP cache and advertise to other leaf by BGP as route type 2.

Remote leaf nodes put IP MAC info into remote ARP cache and suppresses incoming ARP request for this IP.

If IP info not found in the ARP suppression cache table, VTEP flood the ARP request to other VTEPs.

VXLAN

Sabyasachi K.

Network Engineer @ Google

More articles by this author

Insights from the community

Others also viewed

Q & A: CORD

The Common Concepts Used!

Cisco isn't leaving the data center to VMware

Software Defined Data Center (SDDC) 2

Cisco HyperFlex and the benefits of Hyperconverged Infrastructure

Heart Of The Digital World-"SERVER"

Data, Storage, and SDN: An Application Example

IntelliStack™ storage: Tegile plus Cisco gives IT users what they want

How Can a Software-First Strategy Improve Your Data Center?

Dynamic Application Delivery puts the ‘Everything’ into SDx

Explore topics

TCP Part II

Oct 8, 2019

TCP Part 1

Oct 7, 2019

BGP EVPN

Oct 1, 2019

Segment Routing Part 1

Sep 27, 2019

TiLFA

Sep 26, 2019

What is Implicit Ack and Explicit Ack in OSFP:-

Jul 23, 2017

BGP CONVERGENCE PROCESS

Oct 6, 2016

Difference Between ISIS vs OSPF

Oct 3, 2016

MPLS LDP Ping : What happen when FEC is broken while doing a MPLS Ping.