.. contents:: Table of Contents :depth: 6 ================================================ VNI based L2 switching, L3 forwarding and NATing ================================================ https://git.opendaylight.org/gerrit/#/q/topic:vni-based-l2-l3-nat **Important**: All gerrit links raised for this feature will have topic name as **vni-based-l2-l3-nat** This feature attempts to realize the use of VxLAN VNI (Virtual Network Identifier) for VxLAN tenant traffic flowing on the cloud data-network. This is applicable to L2 switching, L3 forwarding and NATing for all VxLAN based provider networks. In doing so, it eliminates the presence of ``LPort tags``, ``ELAN tags`` and ``MPLS labels`` on the wire and instead, replaces them with VNIs supplied by the tenant's OpenStack. This will be selectively done for the use-cases covered by this spec and hence, its implementation won't completely remove the usage of the above entities. The usage of ``LPort tags`` and ``ELAN tags`` within an OVS datapath (not on the wire) of the hypervisor will be retained, as eliminating it completely is a large redesign and can be pursued incrementally later. This spec is the first step in the direction of enforcing datapath semantics that uses tenant supplied VNI values on VxLAN Type networks created by tenants in OpenStack Neutron. Problem description =================== OpenDaylight NetVirt service today supports the following types of networks: * Flat * VLAN * VxLAN * GRE Amongst these, VxLAN-based overlay is supported only for traffic within the DataCenter. External network accesses over the DC-Gateway are supported via VLAN or GRE type external networks. For rest of the traffic over the DC-Gateway, the only supported overlay is GRE. Today, for VxLAN enabled networks by the tenant, the labels are generated by L3 forwarding service and used. Such labels are re-used for inter-DC use-cases with BGPVPN as well. This does not honor and is not in accordance with the datapath semantics from an orchestration point of view. **This spec attempts to change the datapath semantics by enforcing the VNIs** (unique for every VxLAN enabled network in the cloud) **as dictated by the tenant's OpenStack configuration for L2** **switching, L3 forwarding and NATing**. This implementation will remove the reliance on using the following (on the wire) within the DataCenter: * Labels for L3 forwarding * LPort tags for L2 switching More specifically, the traffic from source VM will be routed in source OVS by the L3VPN / ELAN pipeline. After that, the packet will travel as a switched packet in the VxLAN underlay within the DC, containing the VNI in the VxLAN header instead of MPLS label / LPort tag. In the destination OVS, the packet will be collected and sent to the destination VM through the existing ELAN pipeline. In the nodes themselves, the LPort tag will continue to be used when pushing the packet from ELAN / L3VPN pipeline towards the VM as ACLService continues to use ``LPort tags``. Simiarly ``ELAN tags`` will continue to be used for handling L2 broadcast packets: * locally generated in the OVS datapath * remotely received from another OVS datapath via internal VxLAN tunnels LPort tag uses 8 bits and ELAN tag uses 21 bits in the metadata. The existing use of both in the metadata will remain unaffected. In Scope -------- Since VNIs are provisioned only for VxLAN based underlays, this feature has in its scope the use-cases pertaining to **intra-DC connectivity over internal VxLAN tunnels only**. On the cloud data network wire, all the VxLAN traffic for basic L2 switching within a VxLAN network and L3 forwarding across VxLAN-type networks using routers and bgpvpns will use tenant supplied VNI values for such VXLAN networks. Inter-DC connectivity over external VxLAN tunnels is covered by the EVPN_RT5_ spec. Out of Scope ------------ * Complete removal of use of ``LPort tags`` everywhere in ODL: Use of ``LPort tags`` within the OVS Datapath of a hypervisor, for streaming traffic to the right virtual endpoint on that hypervisor (note: not on the wire) will be retained * Complete removal of use of ``ELAN tags`` everywhere in ODL: Use of ``ELAN tags`` within the OVS Datapath to handle local/remote L2 broadcasts (note: not on the wire) will be retained * Complete removal of use of ``MPLS labels`` everywhere in ODL: Use of ``MPLS labels`` for realizing an inter-dc communication over BGPVPN will be retained. * Intra DC NAT usecase where no explicit Internet VPN is created for VxLAN based external provider networks: Detailed further in Intra DC subsection in NAT section below. Complete removal of use of ``LPort tags``, ``ELAN tags`` and ``MPLS labels`` for VxLAN-type networks has large scale design/pipeline implications and thus need to be attempted as future initiatives via respective specs. Use Cases --------- This feature involves amendments/testing pertaining to the following: L2 switching use cases ++++++++++++++++++++++ #. L2 Unicast frames exchanged within an OVS datapath #. L2 Unicast frames exchanged over OVS datapaths that are on different hypervisors #. L2 Broadcast frames transmitted within an OVS datapath #. L2 Broadcast frames received from remote OVS datapaths L3 forwarding use cases +++++++++++++++++++++++ #. Router realized using VNIs for networks attached to a new router (with network having pre-created VMs) #. Router realized using VNIs for networks attached to a new router (with new VMs booted later on the network) #. Router updated with one or more extra route(s) to an existing VM. #. Router updated to remove previously added one/more extra routes. #. Network associated BGPVPNs. * intra DC - Destination network VNI to be used over Vxlan tunnel. * inter DC - MPLS Label of Destination IP to be used over GRE tunnel. #. Router associated BGPVPNs. * intra DC - Destination network VNI to be used over Vxlan tunnel. * inter DC - MPLS Label of Destination IP to be used over GRE tunnel. #. Dual-Stack routing with Router associated BGPVPN. * intra DC - IPv4 and IPv6 packets over VXLAN tunnels carry VNI enforced by Openstack. * inter DC - MPLS Label of Destination IP to be used over GRE tunnel. #. Router associated BGPVPNs with import/export Route Targets of other VPN. #. Traffic between Trunk port and Subport over a BGPVPN will carry VNI enforced on the respective networks to which Trunk and Subports belong. This will be applied universally regardless of whether such trunk ports and subports were network-associated and/or router-associated to the bgpvpn. #. For Multi-Segment networks that have one of them as VXLAN network, the VNI in such networks will be enforced in the dataplane over VXLAN tunnels NAT use cases +++++++++++++ The provider network types for external networks supported today are: * External VLAN Provider Networks (transparent Internet VPN) * External Flat Networks (transparent Internet VPN) * Tenant-orchestrated Internet VPN of type GRE (actually MPLSOverGRE) Following are the SNAT/DNAT use-cases applicable to the network types listed above: #. SNAT functionality. #. DNAT functionality. #. DNAT to DNAT functionality (Intra DC) * FIP VM to FIP VM on same hypervisor * FIP VM to FIP VM on different hypervisors #. SNAT to DNAT functionality (Intra DC) * Non-FIP VM to FIP VM on the same NAPT hypervisor * Non-FIP VM to FIP VM on the same hypervisor, but NAPT on different hypervisor * Non-FIP VM to FIP VM on different hypervisors (with NAPT on FIP VM hypervisor) * Non-FIP VM to FIP VM on different hypervisors (with NAPT on Non-FIP VM hypervisor) Proposed change =============== The following components within OpenDaylight Controller needs to be enhanced: * NeutronVPN Manager * ELAN Manager * VPN Engine (VPN Manager, VPN Interface Manager and VPN Subnet Route Handler) * FIB Manager * NAT Service Pipeline changes ---------------- L2 Switching ++++++++++++ Unicast ^^^^^^^ Within hypervisor ~~~~~~~~~~~~~~~~~ There are no explicit pipeline changes for this use-case. Across hypervisors ~~~~~~~~~~~~~~~~~~ * `Ingress OVS` Instead of setting the destination LPort tag, destination network VNI will be set in the ``tun_id`` field in ``L2_DMAC_FILTER_TABLE`` (table 51) while egressing the packet on the tunnel port. The modifications in flows and groups on the ingress OVS are illustrated below: .. code-block:: bash :emphasize-lines: 8 cookie=0x8000000, duration=65.484s, table=0, n_packets=23, n_bytes=2016, priority=4,in_port=6actions=write_metadata:0x30000000000/0xffffff0000000001,goto_table:17 cookie=0x6900000, duration=63.106s, table=17, n_packets=23, n_bytes=2016, priority=1,metadata=0x30000000000/0xffffff0000000000 actions=write_metadata:0x2000030000000000/0xfffffffffffffffe,goto_table:40 cookie=0x6900000, duration=64.135s, table=40, n_packets=4, n_bytes=392, priority=61010,ip,dl_src=fa:16:3e:86:59:fd,nw_src=12.1.0.4 actions=ct(table=41,zone=5002) cookie=0x6900000, duration=5112.542s, table=41, n_packets=21, n_bytes=2058, priority=62020,ct_state=-new+est-rel-inv+trk actions=resubmit(,17) cookie=0x8040000, duration=62.125s, table=17, n_packets=15, n_bytes=854, priority=6,metadata=0x6000030000000000/0xffffff0000000000 actions=write_metadata:0x700003138a000000/0xfffffffffffffffe,goto_table:48 cookie=0x8500000, duration=5113.124s, table=48, n_packets=24, n_bytes=3044, priority=0 actions=resubmit(,49),resubmit(,50) cookie=0x805138a, duration=62.163s, table=50, n_packets=15, n_bytes=854, priority=20,metadata=0x3138a000000/0xfffffffff000000,dl_src=fa:16:3e:86:59:fd actions=goto_table:51 cookie=0x803138a, duration=62.163s, table=51, n_packets=6, n_bytes=476, priority=20,metadata=0x138a000000/0xffff000000,dl_dst=fa:16:3e:31:fb:91 actions=set_field:**0x710**->tun_id,output:1 * `Egress OVS` On the egress OVS, for the packets coming in via the internal VxLAN tunnel (OVS - OVS), ``INTERNAL_TUNNEL_TABLE`` currently matches on destination LPort tag for unicast packets. Since the incoming packets will now contain the network VNI in the VxLAN header, the ``INTERNAL_TUNNEL_TABLE`` will match on this VNI, set the ELAN tag in the metadata and forward the packet to ``L2_DMAC_FILTER_TABLE`` so as to reach the destination VM via the ELAN pipeline. The modifications in flows and groups on the egress OVS are illustrated below: .. code-block:: bash :emphasize-lines: 2-7 cookie=0x8000001, duration=5136.996s, table=0, n_packets=12601, n_bytes=899766, priority=5,in_port=1,actions=write_metadata:0x10000000001/0xfffff0000000001,goto_table:36 cookie=0x9000004, duration=1145.594s, table=36, n_packets=15, n_bytes=476, priority=5,**tun_id=0x710,actions=write_metadata:0x138a000001/0xfffffffff000000,goto_table:51** cookie=0x803138a, duration=62.163s, table=51, n_packets=9, n_bytes=576, priority=20,metadata=0x138a000001/0xffff000000,dl_dst=fa:16:3e:86:59:fd actions=load:0x300->NXM_NX_REG6[],resubmit(,220) cookie=0x6900000, duration=63.122s, table=220, n_packets=9, n_bytes=1160, priority=6,reg6=0x300actions=load:0x70000300->NXM_NX_REG6[],write_metadata:0x7000030000000000/0xfffffffffffffffe,goto_table:251 cookie=0x6900000, duration=65.479s, table=251, n_packets=8, n_bytes=392, priority=61010,ip,dl_dst=fa:16:3e:86:59:fd,nw_dst=12.1.0.4 actions=ct(table=252,zone=5002) cookie=0x6900000, duration=5112.299s, table=252, n_packets=19, n_bytes=1862, priority=62020,ct_state=-new+est-rel-inv+trk actions=resubmit(,220) cookie=0x8000007, duration=63.123s, table=220, n_packets=8, n_bytes=1160, priority=7,reg6=0x70000300actions=output:6 Broadcast ^^^^^^^^^ Across hypervisors ~~~~~~~~~~~~~~~~~~ The ARP broadcast by the VM will be a (local + remote) broadcast. For the local broadcast on the VM's OVS itself, the packet will continue to get flooded to all the VM ports by setting the destination LPort tag in the local broadcast group. Hence, there are no explicit pipeline changes for when a packet is transmitted within the source OVS via a local broadcast. The changes in pipeline for the remote broadcast are illustrated below: * `Ingress OVS` Instead of setting the ELAN tag, network VNI will be set in the ``tun_id`` field as part of bucket actions in remote broadcast group while egressing the packet on the tunnel port. The modifications in flows and groups on the ingress OVS are illustrated below: .. code-block:: bash :emphasize-lines: 11 cookie=0x8000000, duration=65.484s, table=0, n_packets=23, n_bytes=2016, priority=4,in_port=6actions=write_metadata:0x30000000000/0xffffff0000000001,goto_table:17 cookie=0x6900000, duration=63.106s, table=17, n_packets=23, n_bytes=2016, priority=1,metadata=0x30000000000/0xffffff0000000000 actions=write_metadata:0x2000030000000000/0xfffffffffffffffe,goto_table:40 cookie=0x6900000, duration=64.135s, table=40, n_packets=4, n_bytes=392, priority=61010,ip,dl_src=fa:16:3e:86:59:fd,nw_src=12.1.0.4 actions=ct(table=41,zone=5002) cookie=0x6900000, duration=5112.542s, table=41, n_packets=21, n_bytes=2058, priority=62020,ct_state=-new+est-rel-inv+trk actions=resubmit(,17) cookie=0x8040000, duration=62.125s, table=17, n_packets=15, n_bytes=854, priority=6,metadata=0x6000030000000000/0xffffff0000000000 actions=write_metadata:0x700003138a000000/0xfffffffffffffffe,goto_table:48 cookie=0x8500000, duration=5113.124s, table=48, n_packets=24, n_bytes=3044, priority=0 actions=resubmit(,49),resubmit(,50) cookie=0x805138a, duration=62.163s, table=50, n_packets=15, n_bytes=854, priority=20,metadata=0x3138a000000/0xfffffffff000000,dl_src=fa:16:3e:86:59:fd actions=goto_table:51 cookie=0x8030000, duration=5112.911s, table=51, n_packets=18, n_bytes=2568, priority=0 actions=goto_table:52 cookie=0x870138a, duration=62.163s, table=52, n_packets=9, n_bytes=378, priority=5,metadata=0x138a000000/0xffff000001 actions=write_actions(group:210004) group_id=210004,type=all,bucket=actions=group:210003,bucket=actions=set_field:**0x710**->tun_id,output:1 * `Egress OVS` On the egress OVS, for the packets coming in via the internal VxLAN tunnel (OVS - OVS), ``INTERNAL_TUNNEL_TABLE`` currently matches on ELAN tag for broadcast packets. Since the incoming packets will now contain the network VNI in the VxLAN header, the ``INTERNAL_TUNNEL_TABLE`` will match on this VNI, set the ELAN tag in the metadata and forward the packet to ``L2_DMAC_FILTER_TABLE`` to be broadcasted via the local broadcast groups traversing the ELAN pipeline. The ``TUNNEL_INGRESS_BIT`` being set in the ``CLASSIFIER_TABLE`` (table 0) ensures that the packet is always sent to the local broadcast group only and hence, remains within the OVS. This is necessary to avoid switching loop back to the source OVS. The modifications in flows and groups on the egress OVS are illustrated below: .. code-block:: bash :emphasize-lines: 2-12 cookie=0x8000001, duration=5136.996s, table=0, n_packets=12601, n_bytes=899766, priority=5,in_port=1,actions=write_metadata:0x10000000001/0xfffff0000000001,goto_table:36 cookie=0x9000004, duration=1145.594s, table=36, n_packets=15, n_bytes=476, priority=5,**tun_id=0x710,actions=write_metadata:0x138a000001/0xfffffffff000000,goto_table:51** cookie=0x8030000, duration=5137.609s, table=51, n_packets=9, n_bytes=1293, priority=0 actions=goto_table:52 cookie=0x870138a, duration=1145.592s, table=52, n_packets=0, n_bytes=0, priority=5,metadata=0x138a000001/0xffff000001 actions=apply_actions(group:210003) group_id=210003,type=all,bucket=actions=set_field:0x4->tun_id,resubmit(,55) cookie=0x8800004, duration=1145.594s, table=55, n_packets=9, n_bytes=378, priority=9,tun_id=0x4,actions=load:0x400->NXM_NX_REG6[],resubmit(,220) cookie=0x6900000, duration=63.122s, table=220, n_packets=9, n_bytes=1160, priority=6,reg6=0x300actions=load:0x70000300->NXM_NX_REG6[],write_metadata:0x7000030000000000/0xfffffffffffffffe,goto_table:251 cookie=0x6900000, duration=65.479s, table=251, n_packets=8, n_bytes=392, priority=61010,ip,dl_dst=fa:16:3e:86:59:fd,nw_dst=12.1.0.4 actions=ct(table=252,zone=5002) cookie=0x6900000, duration=5112.299s, table=252, n_packets=19, n_bytes=1862, priority=62020,ct_state=-new+est-rel-inv+trk actions=resubmit(,220) cookie=0x8000007, duration=63.123s, table=220, n_packets=8, n_bytes=1160, priority=7,reg6=0x70000300actions=output:6 The ARP response will be a unicast packet, and as indicated above, for unicast packets, there are no explicit pipeline changes. L3 Forwarding +++++++++++++ Between VMs on a single OVS ^^^^^^^^^^^^^^^^^^^^^^^^^^^ There are no explicit pipeline changes for this use-case. The destination LPort tag will continue to be set in the nexthop group since when ``The EGRESS_DISPATCHER_TABLE`` sends the packet to ``EGRESS_ACL_TABLE``, it is used by the ACL service. Between VMs on two different OVS ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ L3 forwarding between VMs on two different hypervisors is asymmetric forwarding since the traffic is routed in the source OVS datapath while it is switched over the wire and then all the way to the destination VM on the destination OVS datapath. VM sourcing the traffic (Ingress OVS) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``L3_FIB_TABLE`` will set the destination network VNI in the ``tun_id`` field instead of the MPLS label. .. code-block:: bash :emphasize-lines: 3 CLASSIFIER_TABLE => DISPATCHER_TABLE => INGRESS_ACL_TABLE => DISPATCHER_TABLE => L3_GW_MAC_TABLE => L3_FIB_TABLE (set destination MAC, **set tunnel-ID as destination network VNI**) => Output to tunnel port The modifications in flows and groups on the ingress OVS are illustrated below: .. code-block:: bash :emphasize-lines: 11 cookie=0x8000000, duration=128.140s, table=0, n_packets=25, n_bytes=2716, priority=4,in_port=5 actions=write_metadata:0x50000000000/0xffffff0000000001,goto_table:17 cookie=0x8000000, duration=4876.599s, table=17, n_packets=0, n_bytes=0, priority=0,metadata=0x5000000000000000/0xf000000000000000 actions=write_metadata:0x6000000000000000/0xf000000000000000,goto_table:80 cookie=0x1030000, duration=4876.563s, table=80, n_packets=0, n_bytes=0, priority=0 actions=resubmit(,17) cookie=0x6900000, duration=123.870s, table=17, n_packets=25, n_bytes=2716, priority=1,metadata=0x50000000000/0xffffff0000000000 actions=write_metadata:0x2000050000000000/0xfffffffffffffffe,goto_table:40 cookie=0x6900000, duration=126.056s, table=40, n_packets=15, n_bytes=1470, priority=61010,ip,dl_src=fa:16:3e:63:ea:0c,nw_src=10.1.0.4 actions=ct(table=41,zone=5001) cookie=0x6900000, duration=4877.057s, table=41, n_packets=17, n_bytes=1666, priority=62020,ct_state=-new+est-rel-inv+trk actions=resubmit(,17) cookie=0x6800001, duration=123.485s, table=17, n_packets=28, n_bytes=3584, priority=2,metadata=0x2000050000000000/0xffffff0000000000 actions=write_metadata:0x5000050000000000/0xfffffffffffffffe,goto_table:60 cookie=0x6800000, duration=3566.900s, table=60, n_packets=24, n_bytes=2184, priority=0 actions=resubmit(,17) cookie=0x8000001, duration=123.456s, table=17, n_packets=17, n_bytes=1554, priority=5,metadata=0x5000050000000000/0xffffff0000000000 actions=write_metadata:0x60000500000222e0/0xfffffffffffffffe,goto_table:19 cookie=0x8000009, duration=124.815s, table=19, n_packets=15, n_bytes=1470, priority=20,metadata=0x222e0/0xfffffffe,dl_dst=fa:16:3e:51:da:ee actions=goto_table:21 cookie=0x8000003, duration=125.568s, table=21, n_packets=9, n_bytes=882, priority=42,ip,metadata=0x222e0/0xfffffffe,nw_dst=12.1.0.3 actions=**set_field:0x710->tun_id**,set_field:fa:16:3e:31:fb:91->eth_dst,output:1 The Ingress OVS traffic flows will remain same as above for Router Associated and Network Associated VPNs. VM receiving the traffic (Egress OVS) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ On the egress OVS, for the packets coming in via the VxLAN tunnel, ``INTERNAL_TUNNEL_TABLE`` currently matches on MPLS label and sends it to the nexthop group to be taken to the destination VM via ``EGRESS_ACL_TABLE``. Since the incoming packets will now contain network VNI in the VxLAN header, the ``INTERNAL_TUNNEL_TABLE`` will match on the VNI, set the ELAN tag in the metadata and forward the packet to ``L2_DMAC_FILTER_TABLE``, from where it will be taken to the destination VM via the ELAN pipeline. .. code-block:: bash :emphasize-lines: 1 CLASSIFIER_TABLE => INTERNAL_TUNNEL_TABLE (Match on network VNI, set ELAN tag in the metadata) => L2_DMAC_FILTER_TABLE (Match on destination MAC) => EGRESS_DISPATCHER_TABLE => EGRESS_ACL_TABLE => Output to destination VM port The modifications in flows and groups on the egress OVS are illustrated below: .. code-block:: bash :emphasize-lines: 2-7 cookie=0x8000001, duration=4918.647s, table=0, n_packets=12292, n_bytes=877616, priority=5,in_port=1actions=write_metadata:0x10000000001/0xfffff0000000001,goto_table:36 cookie=0x9000004, duration=927.245s, table=36, n_packets=8234, n_bytes=52679, priority=5,**tun_id=0x710,actions=write_metadata:0x138a000001/0xfffffffff000000,goto_table:51** cookie=0x803138a, duration=62.163s, table=51, n_packets=9, n_bytes=576, priority=20,metadata=0x138a000001/0xffff000000,dl_dst=fa:16:3e:86:59:fd actions=load:0x300->NXM_NX_REG6[],resubmit(,220) cookie=0x6900000, duration=63.122s, table=220, n_packets=9, n_bytes=1160, priority=6,reg6=0x300actions=load:0x70000300->NXM_NX_REG6[],write_metadata:0x7000030000000000/0xfffffffffffffffe,goto_table:251 cookie=0x6900000, duration=65.479s, table=251, n_packets=8, n_bytes=392, priority=61010,ip,dl_dst=fa:16:3e:86:59:fd,nw_dst=12.1.0.4 actions=ct(table=252,zone=5002) cookie=0x6900000, duration=5112.299s, table=252, n_packets=19, n_bytes=1862, priority=62020,ct_state=-new+est-rel-inv+trk actions=resubmit(,220) cookie=0x8000007, duration=63.123s, table=220, n_packets=8, n_bytes=1160, priority=7,reg6=0x70000300actions=output:6 NAT Service +++++++++++ For NAT, we need VNIs to be used in two scenarios: * When packet is forwarded from non-NAPT to NAPT hypervisor (VNI per router) * Between hypervisors (intra DC) over Internet VPN (VNI per Internet VPN) Hence, a pool titled ``opendaylight-vni-ranges``, non-overlapping with the OpenStack Neutron vni_ranges configuration, needs to be configured by the OpenDaylight Controller Administrator. This ``opendaylight-vni-ranges`` pool will be used to carve out a unique VNI per router to be then used in the datapath for traffic forwarding from non-NAPT to NAPT switch for this router. Similarly, for MPLSOverGRE based external networks, the ``opendaylight-vni-ranges`` pool will be used to carve out a unique VNI per Internet VPN (GRE-provider-type) to be then used in the datapath for traffic forwarding for ``SNAT-to-DNAT`` and ``DNAT-to-DNAT`` cases within the DataCenter. Only one external network can be associated to Internet VPN today and this spec doesn't attempt to address that limitation. A NeutronVPN configuration API will be exposed to the administrator to configure the lower and higher limit for this pool. If the administrator doesn’t configure this explicitly, then the pool will be created with default values of lower limit set to 70000 and upper limit set to 100000, during the first NAT session configuration. **FIB Manager changes**: For external network of type GRE, it is required to use ``Internet VPN VNI`` for intra-DC communication, but we still require ``MPLS labels`` to reach SNAT/DNAT VMs from external entities via MPLSOverGRE. Hence, we will make use of the ``l3vni`` attribute added to fibEntries container as part of EVPN_RT5_ spec. NAT will populate both ``label`` and ``l3vni`` values for fibEntries created for floating-ips and external-fixed-ips with external network of type GRE. This ``l3vni`` value will be used while programming remote FIB flow entries (on all the switches which are part of the same VRF). But still, MPLS label will be used to advertise prefixes and in ``L3_LFIB_TABLE`` taking the packet to ``INBOUND_NAPT_TABLE`` and ``PDNAT_TABLE``. For SNAT/DNAT use-cases, we have following provider network types for External Networks: #. VLAN - not VNI based #. Flat - not VNI based #. VxLAN - VNI based (covered by the EVPN_RT5_ spec) #. GRE - not VNI based (will continue to use ``MPLS labels``) Inter DC ^^^^^^^^ SNAT ~~~~ * From a VM on a NAPT switch to reach Internet, and reverse traffic reaching back to the VM There are no explicit pipeline changes. * From a VM on a non-NAPT switch to reach Internet, and reverse traffic reaching back to the VM On the non-NAPT switch, ``PSNAT_TABLE`` (table 26) will be set with ``tun_id`` field as ``Router Based VNI`` allocated from the pool and send to group to reach NAPT switch. On the NAPT switch, ``INTERNAL_TUNNEL_TABLE`` (table 36) will match on the ``tun_id`` field which will be ``Router Based VNI`` and send the packet to ``OUTBOUND_NAPT_TABLE`` (table 46) for SNAT Translation and to be taken to Internet. + `Non-NAPT switch` .. code-block:: bash :emphasize-lines: 1 cookie=0x8000006, duration=2797.179s, table=26, n_packets=47, n_bytes=3196, priority=5,ip,metadata=0x23a50/0xfffffffe actions=**set_field:0x710->tun_id**,group:202501 group_id=202501,type=all,bucket=actions=output:1 + `NAPT switch` .. code-block:: bash :emphasize-lines: 2 cookie=0x8000001, duration=4918.647s, table=0, n_packets=12292, n_bytes=877616, priority=5,in_port=1,actions=write_metadata:0x10000000001/0xfffff0000000001,goto_table:36 cookie=0x9000004, duration=927.245s, table=36, n_packets=8234, n_bytes=52679, priority=10,ip,**tun_id=0x710**,actions=write_metadata:0x23a50/0xfffffffe,goto_table:46 As part of the response from NAPT switch, the packet will be taken to the Non-NAPT switch after SNAT reverse translation using destination VMs Network VNI. DNAT ~~~~ There is no NAT specific explicit pipeline change for DNAT traffic to DC-gateway. Intra DC ^^^^^^^^ * VLAN Provider External Networks: VNI is not applicable on the external VLAN Provider network. However, the Router VNI will be used for datapath traffic from non-NAPT switch to NAPT-switch over the internal VxLAN tunnel. * VxLAN Provider External Networks: + **Explicit creation of Internet VPN**: An L3VNI, mandatorily falling within the ``opendaylight-vni-ranges``, will be provided by the Cloud admin (or tenant). This VNI will be used uniformly for all packet transfer over the VxLAN wire for this Internet VPN (uniformly meaning all the traffic on Internal or External VXLAN Tunnel, except the non-NAPT to NAPT communication). This usecase is covered by EVPN_RT5_ spec + **No explicit creation of Internet VPN**: A transparent Internet VPN having UUID same as that of the corresponding external network UUID is created implicitly and the VNI configured for this external network should be used on the VxLAN wire. This usecase is **out of scope** from the perspective of this spec, and the same is indicated in `Out of Scope`_ section. * GRE Provider External Networks: ``Internet VPN VNI`` will be carved per Internet VPN using ``opendaylight-vni-ranges`` to be used on the wire. DNAT to DNAT ~~~~~~~~~~~~ * FIP VM to FIP VM on different hypervisors After DNAT translation on the first hypervisor ``DNAT-OVS-1``, the traffic will be sent to the ``L3_FIB_TABLE`` (table=21) in order to reach the floating IP VM on the second hypervisor ``DNAT-OVS-2``. Here, the ``tun_id`` action field will be set as the ``INTERNET VPN VNI`` value. + `DNAT-OVS-1` .. code-block:: bash :emphasize-lines: 1 cookie=0x8000003, duration=518.567s, table=21, n_packets=0, n_bytes=0, priority=42,ip,metadata=0x222e8/0xfffffffe,nw_dst=172.160.0.200 actions=**set_field:0x11178->tun_id**,output:9 + `DNAT-OVS-2` .. code-block:: bash :emphasize-lines: 1-2, 4 cookie=0x9011177, duration=411685.075s, table=36, n_packets=2, n_bytes=196, priority=**6**,**tun_id=0x11178**actions=resubmit(,25) cookie=0x9011179, duration=478573.171s, table=36, n_packets=2, n_bytes=140, priority=5,**tun_id=0x11178**,actions=goto_table:44 cookie=0x8000004, duration=408145.805s, table=25, n_packets=600, n_bytes=58064, priority=10,ip,nw_dst=172.160.0.100,**eth_dst=fa:16:3e:e6:e3:c6** actions=set_field:10.0.0.5->ip_dst,write_metadata:0x222e0/0xfffffffe,goto_table:27 cookie=0x8000004, duration=408145.805s, table=25, n_packets=600, n_bytes=58064, priority=10,ipactions=goto_table:44 First, the ``INTERNAL_TUNNEL_TABLE`` (table=36) will take the packet to the ``PDNAT_TABLE`` (table 25) for an exact FIP match in ``PDNAT_TABLE``. - In case of a successful FIP match, ``PDNAT_TABLE`` will further match on floating IP MAC. This is done as a security prerogative since in DNAT usecases, the packet can land to the hypervisor directly from the external world. Hence, better to have a second match criteria. - In case of no match, the packet will be redirected to the SNAT pipeline towards the ``INBOUND_NAPT_TABLE`` (table=44). This is the use-case where ``DNAT-OVS-2`` also acts as the NAPT switch. In summary, on an given NAPT switch, if both DNAT and SNAT are configured, the incoming traffic will first be sent to the ``PDNAT_TABLE`` and if there is no FIP match found, then it will be forwarded to ``INBOUND_NAPT_TABLE`` for SNAT translation. As part of the response, the ``Internet VPN VNI`` will be used as ``tun_id`` to reach floating IP VM on ``DNAT-OVS-1``. * FIP VM to FIP VM on same hypervisor The pipeline changes will be similar as are for different hypervisors, the only difference being that ``INTERNAL_TUNNEL_TABLE`` will never be hit in this case. SNAT to DNAT ~~~~~~~~~~~~ * Non-FIP VM to FIP VM on different hypervisors (with NAPT elected as the FIP VM hypervisor) The packet will be sent to the NAPT hypervisor from non-FIP VM (for SNAT translation) using ``Router VNI`` (similar to as described in `SNAT`_ section). As part of the response from the NAPT switch after SNAT reverse translation, the packet is forwarded to non-FIP VM using destination VM's Network VNI. * Non-FIP VM to FIP VM on the same NAPT hypervisor There are no explicit pipeline changes for this use-case. * Non-FIP VM to FIP VM on the same hypervisor, but a different hypervisor elected as NAPT switch + `NAPT hypervisor` The packet will be sent to the NAPT hypervisor from non-FIP VM (for SNAT translation) using ``Router VNI`` (similar to as described in `SNAT`_ section). On the NAPT switch, the ``INTERNAL_TUNNEL_TABLE`` will match on the ``Router VNI`` in the ``tun_id`` field and send the packet to ``OUTBOUND_NAPT_TABLE`` for SNAT translation (similar to as described in `SNAT`_ section). .. code-block:: bash :emphasize-lines: 1 cookie=0x8000005, duration=5073.829s, table=36, n_packets=61, n_bytes=4610, priority=10,ip,**tun_id=0x11170**,actions=write_metadata:0x222e0/0xfffffffe,goto_table:46 The packet will later be sent back to the FIP VM hypervisor from L3_FIB_TABLE with ``tun_id`` field set as the ``Internet VPN VNI``. .. code-block:: bash :emphasize-lines: 1 cookie=0x8000003, duration=518.567s, table=21, n_packets=0, n_bytes=0, priority=42,ip,metadata=0x222e8/0xfffffffe,nw_dst=172.160.0.200 actions=**set_field:0x11178->tun_id**,output:9 + `FIP VM hypervisor` On reaching the FIP VM Hypervisor, the packet will be sent for DNAT translation. The ``INTERNAL_TUNNEL_TABLE`` will match on the ``Internet VPN VNI`` in the ``tun_id`` field and send the packet to ``PDNAT_TABLE``. .. code-block:: bash :emphasize-lines: 1-2 cookie=0x9011177, duration=411685.075s, table=36, n_packets=2, n_bytes=196, priority=**6**,**tun_id=0x11178**,actions=resubmit(,25) cookie=0x8000004, duration=408145.805s, table=25, n_packets=600, n_bytes=58064, priority=10,ip,nw_dst=172.160.0.100,**eth_dst=fa:16:3e:e6:e3:c6** actions=set_field:10.0.0.5->ip_dst,write_metadata:0x222e0/0xfffffffe,goto_table:27 Upon FIP VM response, DNAT reverse translation happens and traffic is sent back to the NAPT switch for SNAT translation. The ``L3_FIB_TABLE`` will be set with ``Internet VPN VNI`` in the ``tun_id`` field. .. code-block:: bash :emphasize-lines: 1 cookie=0x8000003, duration=95.300s, table=21, n_packets=2, n_bytes=140, priority=42,ip,metadata=0x222ea/0xfffffffe,nw_dst=172.160.0.3 actions=**set_field:0x11178->tun_id**,output:5 + `NAPT hypervisor` On NAPT hypervisor, the ``INTERNAL_TUNNEL_TABLE`` will match on the ``Internet VPN VNI`` in the ``tun_id`` field and send the packet to `` INBOUND_NAPT_TABLE`` for SNAT reverse translation (external fixed IP to VM IP). The packet will then be sent back to the non-FIP VM using destination VM's Network VNI. * Non-FIP VM to FIP VM on different hypervisors (with NAPT elected as the non-FIP VM hypervisor) After SNAT Translation, ``Internet VPN VNI`` will be used to reach FIP VM. On FIP VM hypervisor, the ``INTERNAL_TUNNEL_TABLE`` will take the packet to the ``PDNAT_TABLE`` to match on ``Internet VPN VNI`` in the ``tun_id`` field for DNAT translation. Upon response from FIP, DNAT reverse translation happens and uses ``Internet VPN VNI`` to reach back to the non-FIP VM. YANG changes ------------ * ``opendaylight-vni-ranges`` and ``enforce-openstack-semantics`` leaf elements will be added to neutronvpn-config container in ``neutronvpn-config.yang``: + ``opendaylight-vni-ranges`` will be introduced to accept inputs for the VNI range pool from the configurator via the corresponding exposed REST API. In case this is not defined, the default value defined in ``netvirt-neutronvpn-config.xml`` will be used to create this pool. + ``enforce-openstack-semantics`` will be introduced to have the flexibility to enable or disable OpenStack semantics in the dataplane for this feature. It will be defaulted to true, meaning these semantics will be enforced by default. In case it is set to false, the dataplane will continue to be programmed with LPort tags / ELAN tags for switching and with labels for routing use-cases. Once this feature gets stabilized and the semantics are in place to use VNIs on the wire for BGPVPN based forwarding too, this config can be permanently removed if deemed fit. .. code-block:: none :caption: neutronvpn-config.yang :emphasize-lines: 5-12 container neutronvpn-config { config true; ... ... leaf opendaylight-vni-ranges { type string; default "70000:99999"; } leaf enforce-openstack-semantics { type boolean; default true; } } * Provider network-type and provider segmentation-ID need to be propagated to FIB Manager to manipulate flows based on the same. Hence: + A new grouping ``network-attributes`` will be introduced in ``neutronvpn.yang`` to hold network type and segmentation ID. This grouping will replace the leaf-node ``network-id`` in ``subnetmaps`` MD-SAL configuration datastore: .. code-block:: none :caption: neutronvpn.yang :emphasize-lines: 1-27 grouping network-attributes { leaf network-id { type yang:uuid; description "UUID representing the network"; } leaf network-type { type enumeration { enum "FLAT"; enum "VLAN"; enum "VXLAN"; enum "GRE"; } } leaf segmentation-id { type uint32; description "Optional. Isolated segment on the physical network. If segment-type is vlan, this ID is a vlan identifier. If segment-type is vxlan, this ID is a vni. If segment-type is flat/gre, this ID is set to 0"; } } container subnetmaps { ... ... uses network-attributes; } + These attributes will be propagated upon addition of a router-interface or addition of a subnet to a BGPVPN to VPN Manager module via the ``subnet-added-to-vpn`` notification modelled in ``neutronvpn.yang``. Hence, the following node will be added: .. code-block:: none :caption: neutronvpn.yang :emphasize-lines: 5 notification subnet-added-to-vpn { description "new subnet added to vpn"; ... ... uses network-attributes; } + VpnSubnetRouteHandler will act on these notifications and store these attributes in ``subnet-op-data`` MD-SAL operational datastore as described below. FIB Manager will get to retrieve the ``subnetID`` from the primary adjacency of the concerned VPN interface. This ``subnetID`` will be used as the key to retrieve ``network-attributes`` from ``subnet-op-data`` datastore. .. code-block:: none :caption: odl-l3vpn.yang :emphasize-lines: 1-10 import neutronvpn { prefix nvpn; revision-date "2015-06-02"; } container subnet-op-data { ... ... uses nvpn:network-attributes; } * ``subnetID`` and ``nat-prefix`` leaf elements will be added to ``prefix-to-interface`` container in ``odl-l3vpn.yang``: + For NAT use-cases where the VRF entry is not always associated with a VPN interface (eg. for NAT entries such as floating IP and router-gateway-IPs for external VLAN / flat networks), ``subnetID`` leaf element will be added to make it possible to retrieve the ``network-attributes``. + To distinguish a non-NAT prefix from a NAT prefix, ``nat-prefix`` leaf element will be added. This is a boolean attribute indicating whether the prefix is a NAT prefix (meaning a floating IP, or an external-fixed-ip of a router-gateway). The VRFEntry corresponding to the NAT prefix entries here may carry both the ``MPLS label`` and the ``Internet VPN VNI``. For SNAT-to-DNAT within the datacenter, where the Internet VPN contains an MPLSOverGRE based external network, this VRF entry will publish the ``MPLS label`` to BGP while the ``Internet VPN VNI`` (also known as ``L3VNI``) will be used to carry intra-DC traffic on the external segment within the datacenter. * While constructing Remote FIB flows, it is required to know the network-type and network's segmentation-ID. Currently, these values are back-pulled from neutronVPN's subnet-map which is incorrect. + ``vpn-interface`` will be augmented and ``prefix-to-interface`` container in ``odl-l3vpn.yang`` will be enhanced to hold additional network attributes - ``network-id, network-type and segmentation-id``. This will be referred to obtain the required network attributes during Remote FIB flows construction. .. code-block:: none :caption: odl-l3vpn.yang :emphasize-lines: 1-4 augment "/l3vpn:vpn-interfaces/l3vpn:vpn-interface" { ext:augment-identifier "networkParameters"; uses nvpn:network-attributes; } .. code-block:: none :caption: odl-l3vpn.yang :emphasize-lines: 10-16 container prefix-to-interface { config false; list vpn-ids { key vpn-id; leaf vpn-id {type uint32;} list prefixes { key ip_address; ... ... leaf subnet-id { type yang:uuid; } uses nvpn:network-attributes; leaf nat-prefix { type boolean; default false; } } } } Configuration impact -------------------- * We have to make sure that we do not accept configuration of VxLAN type provider networks without the ``segmentation-ID`` available in them since we are using it to represent the VNI on the wire and in the flows/groups. Clustering considerations ------------------------- No specific additional clustering considerations to be adhered to. Other Infra considerations -------------------------- None. Security considerations ----------------------- None. Scale and Performance Impact ---------------------------- None. Targeted Release(s) ------------------- Carbon. Known Limitations ----------------- None. Alternatives ------------ N.A. Usage ===== Features to Install ------------------- odl-netvirt-openstack REST API -------- No new changes to the existing REST APIs. CLI --- No new CLI is being added. Implementation ============== Assignee(s) ----------- Primary assignee: Abhinav Gupta Vivekanandan Narasimhan Other contributors: Chetan Arakere Gowdru Karthikeyan Krishnan Yugandhar Sarraju Shaik Zakir Basha Work Items ---------- Trello card: https://trello.com/c/PfARbEmU/84-enforce-vni-on-the-wire-for-l2-switching-l3-forwarding-and-nating-on-vxlan-overlay-networks #. Code changes to alter the pipeline and e2e testing of the use-cases mentioned. #. Add Documentation Dependencies ============ This doesn't add any new dependencies. Testing ======= Unit Tests ---------- Appropriate UTs will be added for the new code coming in once framework is in place. Integration Tests ----------------- There won't be any Integration tests provided for this feature. CSIT ---- New testcases will be added to validate the functionality for L2 Switching, L3 Forwarding and NAT with Openstack Semantics Set. Documentation Impact ==================== This will require changes to the Developer Guide. Developer Guide needs to capture how this feature modifies the existing Netvirt L3 forwarding service implementation. References ========== * http://docs.opendaylight.org/en/latest/documentation.html * https://wiki.opendaylight.org/view/Genius:Carbon_Release_Plan * `EVPN_RT5 `_ * `Dual Stack `_ * `Multi-segment L2 configuration `_ * `Trunk/Sub-Port Support `_