Network-oriented programming

Openstack SDN - OpenContrail With BGP VPN

2018-01-02T00:00:00+00:00

In this post I’ll show how to build a dockerized OpenStack and OpenContrail lab, integrate it with Juniper MX80 DC-GW and demonstrate one of Contrail’s most interesting and unique features called BGP-as-a-Service.

Continuing on the trend started in my previous post about OpenDaylight, I’ll move on to the next open-source product that uses BGP VPNs for optimal North-South traffic forwarding. OpenContrail is one of the most popular SDN solutions for OpenStack. It was one of the first hybrid SDN solutions, offering both pure overlay and overlay/underlay integration. It is the default SDN platform of choice for Mirantis Cloud Platform, it has multiple large-scale deployments in companies like Workday and AT&T. I, personally, don’t have any production experience with OpenContrail, however my impression, based on what I’ve heard and seen in the last 2-3 years that I’ve been following Telco SDN space, is that OpenContrail is the most mature SDN platform for Telco NFVs not least because of its unique feature set.

During the time of production deployment at AT&T, Contrail has added a lot of features required by Telco NFVs like QoS, VLAN trunking and BGP-as-a-service. My first acquaintance with BGPaaS took place when I started working on Telco DCs and I remember being genuinely shocked when I first saw the requirement for dynamic routing exchange with VNFs. To me this seemed to break one of the main rules of cloud networking - a VM is not to have any knowledge or interaction with the underlay. I gradually went though all stages of grief, all the way to acceptance and although it still feels “wrong” now, I can at least understand why it’s needed and what are the pros/cons of different BGPaaS solutions.

BGP-as-a-Service

There’s a certain range of VNFs that may require to advertise a set of IP addresses into the existing VPNs inside Telco network. The most notable example is PGW inside EPC. I won’t pretend to be an expert in this field, but based on my limited understanding PGW needs to advertise IP networks into various customer VPNs, for example to connect private APNs to existing customer L3VPNs. Obviously, when this kind of network function gets virtualised, it still retains this requirement which now needs to be fulfilled by DC SDN.

This requirement catches a lot of big SDN vendors off guard and the best they come up with is connecting those VNFs, through VLANs, directly to underlay TOR switches. Although this solution is easy to implement, it has an incredible amount of drawbacks since a single VNF can now affect the stability of the whole POD or even the whole DC network. Some VNFs vendors also require BFD to monitor liveliness of those BGP sessions which, in case a L3 boundary is higher than the TOR, may create even a bigger number of issues on a POD spine.

There’s a small range of SDN platforms that run a full routing stack on each compute node (e.g. Cumulus, Calico). These solutions are the best fit for this kind of scenarios since they allow BGP sessions to be established over a single hop (VNF <-> virtual switch). However they represent a small fraction of total SDN solutions space with majority of vendors implementing a much simpler OpenFlow or XMPP-based flow push model.

OpenContrail, as far as I know, is the only SDN controller that doesn’t run a full routing stack on compute nodes but still fulfills this requirement in a very elegant way. When BGPaaS is enabled for a particular VM’s interface, controller programs vRouter to proxy BGP TCP connections coming to virtual network’s default gateway IP and forward them to the controller. This way VNF thinks it peers with a next hop IP, however all BGP state and path computations still happen on the controller.

The diagram below depicts a sample implementation of BGPaaS using OpenContrail. VNF is connected to a vRouter using a dot1Q trunk interface (to allow multiple VRFs over a single vEth link). Each VRF has its own BGPaaS session setup to advertise network ranges (NET1-3) into customer VPNs. These BGP sessions get proxied to the controller which injects those prefixes into their respective VPNs. These updates are then sent to DC gateways using either a VPNv4/6 or EVPN and the traffic is forwarded through DC underlay with VPN segregation preserved by either an MPLS tag (for MPLSoGRE or MPLSoUDL encapsulation) or a VXLAN VNI.

Now let me briefly go over the lab that I’ve built to showcase the BGPaaS and DC-GW integration features.

Lab setup overview

OpenContrail follows a familiar pattern of DC SDN architecture with central controller orchestrating the work of multiple virtual switches. In case of OpenContrail, these switches are called vRouters and they communicate with controller using XMPP-based extension of BGP as described in this RFC draft. A very detailed description of its internal architecture is available on OpenContrail’s website so it would be pointless to repeat all of this information here. That’s why I’ll concentrate on how to get things done rather then on the architectural aspects. However to get things started, I always like to have a clear picture of what I’m trying to achieve. The below diagram depicts a high-level architecture of my lab setup. Although OpenContrail supports BGP VPNv4/6 with multiple dataplane encapsulations, in this post I’ll use EVPN as the only control plane protocol to communicate with MX80 and use VXLAN encapsulation in the dataplane.

EVPN as a DC-GW integration protocol is relatively new to OpenContrail and comes with a few limitations. One of them is the absence of EVPN type-5 routes, which means I can’t use it in the same way I did in OpenDaylight’s case. Instead I’ll demonstrate a DC-GW IRB scenario, which extends the existing virtual network to a DC-GW and makes IRB/SVI interface on that DC-GW act as a default gateway for this network. This is a very common scenario for L2 DCI and active-active DC deployment models. To demonstrate this scenario I’m going to setup a single OpenStack virtual network with a couple of VMs whose gateway will reside on MX80. Since I only have a single OpenStack instance and a single MX80, I’ll setup one half of L2 DCI and setup a mutual redistribution to make our overlay network reachable from MX80’s global routing table.

All-in-one VM setup

Physically, my lab will consist of a single hypervisor running an all-in-one VM with kolla-openstack and kolla-contrail and a physical Juniper MX80 playing the role of a DC-GW.

OpenContrail’s kolla github page contains a set of instructions to setup the environment. As usual, I have automated all of these steps which can be setup from a hypervisor with the following commands:

git clone https://github.com/networkop/kolla-odl-bgpvpn && cd kolla-odl-bgpvpn
./1-create.sh do
./2-contrail.sh do

OpenStack setup

Once installation is complete and all docker containers are up and running, we can setup the OpenStack side of our test environment. The script below will do the following:

Download cirros and CumulusVX images and upload them to Glance
Create a virtual network
Update security rules to allow inbound ICMP and SSH connections
Create a pair of VMs - one based on cirros and one based on CumulusVX image

curl -L -o ./cirros http://download.cirros-cloud.net/0.3.5/cirros-0.3.5-x86_64-disk.img
curl -L -o ./cumulusVX http://cumulusfiles.s3.amazonaws.com/cumulus-linux-3.5.0-vx-amd64.qcow2

openstack image create --disk-format qcow2 --container-format bare --public \
--property os_type=linux --file ./cirros cirros
rm ./cirros

openstack image create --disk-format qcow2 --container-format bare --public \
--property os_type=linux --file ./cumulusVX cumulus
rm ./cumulusVX

openstack network create --provider-network-type vxlan irb-net

openstack subnet create --subnet-range 10.0.100.160/27 --network irb-net \
      --host-route destination=0.0.0.0/0,gateway=10.0.100.190 \
      --gateway 10.0.100.161 --dns-nameserver 8.8.8.8 irb-subnet

openstack flavor create --id 1 --ram 256 --disk 1 --vcpus 1 m1.nano
openstack flavor create --id 2 --ram 512 --disk 10 --vcpus 1 m1.tiny

ADMIN_PROJECT_ID=$(openstack project show 'admin' -f value -c id)
ADMIN_SEC_GROUP=$(openstack security group list --project ${ADMIN_PROJECT_ID} | awk '/ default / {print $2}')
openstack security group rule create --ingress --ethertype IPv4 \
    --protocol icmp ${ADMIN_SEC_GROUP}
openstack security group rule create --ingress --ethertype IPv4 \
    --protocol tcp --dst-port 22 ${ADMIN_SEC_GROUP}

openstack server create --image cirros --flavor m1.nano --net irb-net VM1
openstack server create --image cumulus --flavor m1.tiny --net irb-net VR1

The only thing worth noting in the above script is that a default gateway 10.0.100.161 gets overridden by a default host route pointing to 10.0.100.190. Normally, to demonstrate DC-GW IRB scenario, I would have setup a gateway-less L2 only subnet, however in that case I wouldn’t have been able to demonstrate BGPaaS on the same network, since this feature relies on having a gateway IP setup (which later acts as a BGP session termination endpoint). So instead of setting up two separate networks I’ve decided to implement this hack to minimise the required configuration.

EVPN integration with MX80

DC-GW integration procedure is very simple and requires only a few simple steps:

Make sure VXLAN VNI is matched on both ends
Configure import/export route targets
Setup BGP peering with DC-GW

All of these steps can be done very easily through OpenContrail’s GUI. However as I’ve mentioned before, I always prefer to use API when I have a chance and in this case I even have a python library for OpenContrail’s REST API available on Juniper’s github page, which I’m going to use below to implement the above three steps.

Configuration

Before we can begin working with OpenContrail’s API, we need to authenticate with the controller and get a REST API connection handler.

import pycontrail.client as client
CONTRAIL_API = 'http://10.0.100.140:8082'
AUTH_URL = 'http://10.0.100.140:5000/v2.0'
AUTH_PARAMS = {
    'type': 'keystone',
    'username': 'admin',
    'password': 'mypassword',
    'tenant_name': 'admin',
    'auth_url': AUTH_URL
}
conn = client.Client(url=CONTRAIL_API,auth_params=AUTH_PARAMS)

The first thing I’m going to do is override the default VNI setup by OpenContrail for irb-net to a pre-defined value of 5001. To do that I first need to get a handler for irb-net object and extract the virtual_network_properties object containing a vxlan_network_identifier property. Once it’s overridden, I just need to update the parent irb-net object to apply the change to the running configuration on the controller.

irb_net = conn.virtual_network_read(fq_name = [ 'default-domain', 'admin' ,'irb-net'] )
vni_props=irb_net.get_virtual_network_properties()
vni_props.set_vxlan_network_identifier(5001)
irb_net.set_virtual_network_properties(vni_props)
conn.virtual_network_update(irb_net)

The next thing I need to do is explicitly set the import/export route-target properties for the irb-net object. This will require a new RouteTargetList object which then gets referenced by a route_target_list property of the irb-net object.

from pycontrail import types as t
new_rtl = t.RouteTargetList(['target:200:200'])
irb_net.set_route_target_list(new_rtl)
conn.virtual_network_update(irb_net)

The final step is setting up a peering with MX80. The main object that needs to be created is BgpRouter, which contains a pointer to BGP session parameters object with session-specific values like ASN and remote peer IP. BGP router is defined in a global context (default domain and default project) which will make it available to all configured virtual networks.

from pycontrail import types as t
ctx = ['default-domain', 'default-project', 'ip-fabric', '__default__']
af = t.AddressFamilies(family=['inet-vpn', 'e-vpn'])
bgp_params = t.BgpRouterParams(vendor='Juniper', \
                               autonomous_system=65411, \
                               address='10.0.101.15', \
                               address_families=af)
vrf = conn.routing_instance_read(fq_name = ctx)
bgp_router = t.BgpRouter(name='MX80', display_name='MX80', \
                         bgp_router_parameters=bgp_params,
                         parent_obj=vrf)
contrail = conn.bgp_router_read(fq_name = ctx + ['controller-1'])
bgp_router.set_bgp_router(contrail,t.BgpPeeringAttributes())
conn.bgp_router_create(bgp_router)

For the sake of brevity, I will not cover MX80’s configuration in details and simply include it here for reference with some minor explanatory comments.

# Interface and global settings configuration
set interfaces irb unit 5001 family inet address 10.0.100.190/27
set interfaces lo0 unit 0 family inet address 10.0.101.15/32
set routing-options router-id 10.0.101.15
set routing-options autonomous-system 65411

# Setup BGP peering with OpenContrail
set protocols bgp group CONTRAIL multihop
set protocols bgp group CONTRAIL local-address 10.0.101.15
set protocols bgp group CONTRAIL family inet-vpn unicast
set protocols bgp group CONTRAIL family evpn signaling
set protocols bgp group CONTRAIL peer-as 64512
set protocols bgp group CONTRAIL neighbor 10.0.100.140

# Setup EVPN instance type with IRB interface and matching RT and VNI
set routing-instances EVPN-L2-IRB vtep-source-interface lo0.0
set routing-instances EVPN-L2-IRB instance-type evpn
set routing-instances EVPN-L2-IRB vlan-id 501
set routing-instances EVPN-L2-IRB routing-interface irb.5001
set routing-instances EVPN-L2-IRB vxlan vni 5001
set routing-instances EVPN-L2-IRB route-distinguisher 200:200
set routing-instances EVPN-L2-IRB vrf-target target:200:200
set routing-instances EVPN-L2-IRB protocols evpn encapsulation vxlan

# Setup VRF instance with IRB interface
set routing-instances EVPN-L3-IRB instance-type vrf
set routing-instances EVPN-L3-IRB interface irb.5001
set routing-instances EVPN-L3-IRB route-distinguisher 201:200
set routing-instances EVPN-L3-IRB vrf-target target:200:200

# Setup route redistribution between EVPN and Global VRFs
set routing-options rib-groups CONTRAIL-TO-GLOBAL import-rib EVPN-L3-IRB.inet.0
set routing-options rib-groups CONTRAIL-TO-GLOBAL import-rib inet.0
set routing-options rib-groups GLOBAL-TO-CONTRAIL import-rib inet.0
set routing-options rib-groups GLOBAL-TO-CONTRAIL import-rib EVPN-L3-IRB.inet.0
set routing-options interface-routes rib-group inet CONTRAIL-TO-GLOBAL
set routing-instances EVPN-L3-IRB routing-options interface-routes rib-group inet CONTRAIL-TO-GLOBAL
set protocols bgp group EXTERNAL-BGP family inet unicast rib-group GLOBAL-TO-CONTRAIL

Verification

The easiest way to verify that BGP peering has been established is to query OpenContrail’s introspection API:

$ curl  -s http://10.0.100.140:8083/Snh_BgpNeighborReq?ip_address=10.0.101.15 | \
  xmllint --xpath '/BgpNeighborListResp/neighbors[1]/list/BgpNeighborResp/state' -
type="string" identifier="8">Established

Datapath verification can be done from either side, in this case I’m showing a ping from MX80’s global VRF towards one of the OpenStack VMs:

admin@MX80> ping 10.0.100.164 count 2
PING 10.0.100.164 (10.0.100.164): 56 data bytes
64 bytes from 10.0.100.164: icmp_seq=0 ttl=64 time=3.836 ms
64 bytes from 10.0.100.164: icmp_seq=1 ttl=64 time=3.907 ms

--- 10.0.100.164 ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max/stddev = 3.836/3.872/3.907/0.035 ms

BGP-as-a-Service

To keep things simple I will not use multiple dot1Q interfaces and setup a BGP peering with CumulusVX over a normal, non-trunk interface. From CumulusVX I will inject a loopback IP 1.1.1.1/32 into the irb-net network. Since REST API python library I’ve used above is two major releases behind the current version of OpenContrail, it cannot be used to setup BGPaaS feature. Instead I will demonstrate how to use REST API directly from the command line of all-in-one VM using cURL.

Configuration

In order to start working with OpenContrail’s API, I first need to obtain an authentication token from OpenStack’s keystone. With that token I can now query the list of IPs assigned to all OpenStack instances and pick the one assigned to CumulusVX. I need the UUID of that particular IP address in order to extract the ID of the VM interface this IP is assigned to.

source /etc/kolla/admin-openrc.sh
TOKEN=$(openstack token issue -f value -c id)
CONTRAIL_AUTH="X-AUTH-TOKEN: $TOKEN"
CTYPE="Content-Type: application/json; charset=UTF-8"
curl -H "$CONTRAIL_AUTH" http://10.0.100.140:8082/instance-ips | jq
VMI_ID=$(curl -H "$CONTRAIL_AUTH" http://10.0.100.140:8082/instance-ip/2e7987be-3f53-4296-905a-0c64793307a9 | \
         jq '.["instance-ip"] .virtual_machine_interface_refs[0].uuid')

With VM interface ID saved in a VMI_ID variable I can create a BGPaaS service and link it to that particular VM interface.

cat << EOF > ./bgpaas.json
{
    "bgp-as-a-service":
    {
        "fq_name": ["default-domain", "admin", "cumulusVX-bgp" ],
        "autonomous_system": 321,
        "bgpaas_session_attributes": {
            "address_families": {"family": ["inet"] }
            },
        "parent_type": "project",
        "virtual_machine_interface_refs": [{
            "attr": null,
            "to": ["default-domain", "admin", ${VMI_ID}]
            }],
        "bgpaas-shared": false,
        "bgpaas-ip-address": "10.0.100.164"
    }
}
EOF

curl -X POST -H "$CONTRAIL_AUTH" -H "$CTYPE" -d @bgpaas.json http://10.0.100.140:8082/bgp-as-a-services

The final step is setting up a BGP peering on the CumulusVX side. CumulusVX configuration is very simple and self-explanatory. The BGP neighbor IP is the IP of virtual network’s default gateway located on local vRouter.

!
interface lo
 ip address 1.1.1.1/32
!
router bgp 321
 neighbor 10.0.100.161 remote-as 64512
 !
 address-family ipv4 unicast
  network 1.1.1.1/32
 exit-address-family
!

Verification

Here’s where we come across another limitation of EVPN. The loopback prefix 1.1.1.1/32 does not get injected into EVPN address family, however it does show up automatically in the VPNv4 address family which can be verified from the MX80:

admin@MX80> show route table bgp.l3vpn.0 hidden 1.1.1.1/32 extensive

bgp.l3vpn.0: 6 destinations, 6 routes (3 active, 0 holddown, 3 hidden)
10.0.100.140:2:1.1.1.1/32 (1 entry, 0 announced)
         BGP    Preference: 170/-101
                Route Distinguisher: 10.0.100.140:2
                Next hop type: Unusable, Next hop index: 0
                Next-hop reference count: 6
                State: 
                Local AS: 65411 Peer AS: 64512
                Age: 37:44
                Validation State: unverified
                Task: BGP_64512.10.0.100.140
                AS path: 64512 321 I
                Communities: target:200:200 target:64512:8000003 encapsulation:unknown(0x2) encapsulation:mpls-in-udp(0xd) unknown type 8004 value fc00:7a1201 unknown type 8071 value fc00:1389
                Import Accepted
                VPN Label: 31
                Localpref: 100
                Router ID: 10.0.100.140
                Secondary Tables: EVPN-L3-IRB.inet.0
                Indirect next hops: 1
                        Protocol next hop: 10.0.100.140
                        Label operation: Push 31
                        Label TTL action: prop-ttl
                        Load balance label: Label 31: None;
                        Indirect next hop: 0x0 - INH Session ID: 0x0

It’s hidden since I haven’t configured MPLSoUDP dynamic tunnels on MX80. However this proves that the prefix does get injected into customer VPNs and become available on all devices with the matching import route-target communities.

Outro

This post concludes Series 2 of my OpenStack SDN saga. I’ve covered quite an extensive range of topics in my two-part series, however, OpenStack networking landscape is so big, it’s simply impossible to cover everything I find interesting. I started writing about OpenStack SDN when I first learned I got a job with Nokia. Back then I knew little about VMware NSX and even less about OpenStack. That’s why I started researching topics that I found interesting and branching out into adjacent areas as I went along. Almost 2 years later, looking back I can say I’ve learned a lot about the internals of SDN in general and hopefully so have my readers. Now I’m leaving Nokia to rediscover my networking roots at Arista. I’ll dive into DC networking from a different perspective now and it may be awhile before I accumulate a critical mass of interesting material to start spilling it out in my blog again. I still may come back to OpenStack some day but for now I’m gonna take a little break, maybe do some house keeping (e.g. move my blog from Jekyll to Hugo, add TLS support) and enjoy my time being a farther.

OpenStack SDN - OpenDaylight With BGP VPN

2017-12-15T00:00:00+00:00

In this post I’ll demonstrate how to build a simple OpenStack lab with OpenDaylight-managed virtual networking and integrate it with a Cisco IOS-XE data centre gateway using EVPN.

For the last 5 years OpenStack has been the training ground for a lot of emerging DC SDN solutions. OpenStack integration use case was one of the most compelling and easiest to implement thanks to the limited and suboptimal implementation of the native networking stack. Today, in 2017, features like L2 population, local ARP responder, L2 gateway integration, distributed routing and service function chaining have all become available in vanilla OpenStack and don’t require a proprietary SDN controller anymore. Admittedly, some of the features are still not (and may never be) implemented in the most optimal way (e.g. DVR). This is where new opensource SDN controllers, the likes of OVN and Dragonflow, step in to provide scalable, elegant and efficient implementation of these advanced networking features. However one major feature still remains outside of the scope of a lot of these new opensource SDN projects, and that is data centre gateway (DC-GW) integration. Let me start by explain why you would need this feature in the first place.

Optimal forwarding of North-South traffic

OpenStack Neutron and VMware NSX, both being pure software solutions, rely on a special type of node to forward traffic between VMs and hosts outside of the data centre. This node acts as a L2/L3 gateway for all North-South traffic and is often implemented as either a VM or a network namespace. This kind of solution gives software developers greater independence from the underlying networking infrastructure which makes it easier for them to innovate and introduce new features.

However, from the traffic forwarding point of view, having a gateway/network node is not a good solution at all. There is no technological reason for a packet to have to go through this node when after all it ends up on a DC-GW anyway. In fact, this solution introduces additional complexity which needs to be properly managed (e.g. designed, configured and troubleshooted) and a potential bottleneck for high-throughput traffic flows.

It’s clear that the most optimal way to forward traffic is directly from a compute node to a DC-GW. The only question is how can this optimal forwarding be achieved? SDN controller needs to be able to exchange reachability information with DC-GW using a common protocol understood by most of the existing routing stacks. One such protocol, becoming very common in DC environments, is BGP, which has two address families we can potentially use:

VPNv4/6 will allow routes to be exchanged and the dataplance to use MPLSoGRE encapsulation. This should be considered a “legacy” approach since for a very long time DC-GWs did not have the VXLAN ecap/decap capabilities.
EVPN with VXLAN-based overlays. EVPN makes it possible to exchange both L2 and L3 information under the same AF, which means we have the flexibility of doing not only a L3 WAN integration, but also a L2 data centre interconnect with just a single protocol.

In OpenStack specifically, BGPVPN project was created to provide a pluggable driver framework for 3rd party BGP implementations. Apart from a reference BaGPipe driver (BaGPipe is an ExaBGP fork with lightweight implementation of BGP VPNs), which relies on a default openvswitch ML2 mechanism driver, only Nuage, OpenDaylight and OpenContrail have contributed their drivers to this project. In this post I will focus on OpenDaylight and show how to install containerised OpenStack with OpenDaylight and integrate it with Cisco CSR using EVPN.

OpenDaylight integration with OpenStack

Historically, OpenDaylight has had multiple projects implementing custom OpenStack networking drivers:

VTN (Virtual Tenant Networking) - spearheaded by NEC was the first project to provide OpenStack networking implementation
GBP (Group Based Policy) - a project led by Cisco, one of the first (if not THE first) commercial implementation of Intent-based networking
NetVirt - currently a default Neutron plugin from ODL, developed jointly by Brocade (RIP), RedHat, Ericsson, Intel and many others.

NetVirt provides several common Neutron services including L2 and L3 forwarding, ACL and NAT, as well as advanced services like L2 gateway, QoS and SFC. To do that it assumes full control over an OVS switch inside each compute node and implements the above services inside a single br-int OVS bridge. L2/L3 forwarding tables are built based on tenant IP/MAC addresses that have been allocated by Neutron and the current network topology. For high-level overview of NetVirt’s forwarding pipeline you can refer to this document.

It helps to think of an ODL-managed OpenStack as a big chassis switch. NetVirt plays the role of a supervisor by managing control plane and compiling RIB based on the information received from Neutron. Each compute node running an OVS is a linecard with VMs connected to its ports. Unlike the distributed architecture of OVN and Dragonflow, compute nodes do not contain any control plane elements and each OVS gets its FIB programmed directly by the supervisor. DC underlay is a backplane, interconnecting all linecards and a supervisor.

OpenDaylight BGP VPN service architecture

In order to provide BGP VPN functionality, NetVirt employs the use of three service components:

FIB service - maintains L2/L3 forwarding tables and reacts to topology changes
BGP manager - provides a translation of information sent to and received from an external BGP stack (Quagga BGP)
VPN Manager - ties together the above two components, creates VRFs and keeps track of RD/RT values

In order to exchange BGP updates with external DC-GW, NetVirt requires a BGP stack with EVPN and VPNV4/6 capabilities. Ideally, internal ODL BGP stack could have been used for that, however it didn’t meet all the performance requirements (injecting/withdrawing thousand of prefixes at the same time). Instead, an external Quagga fork with EVPN add-ons is connected to BGP manager via a high-speed Apache Thrift interface. This interface defines the format of data to be exchanged between Quagga (a.k.a QBGP) and NetVirt’s BGP Manager in order to do two things:

Configure BGP settings like ASN and BGP neighbors
Read/Write RIB entries inside QBGP

BGP session is established between QBGP and external DC-GW, however next-hop values installed by NetVirt and advertised by QBGP have IPs of the respective compute nodes, so that traffic is sent directly via the most optimal path.

Demo

Enough of the theory, let’s have a look at how to configure a L3VPN between QBGP (advertising ODL’s distributed router subnets) and IOS-XE DC-GW using EVPN route type 5 or, more specifically, Interface-less IP-VRF-to-IP-VRF model:

Installation

My lab environment is still based on a pair of nested VMs running containerised Kolla OpenStack I’ve described in my earlier post. A few months ago OpenDaylight role has been added to kolla-ansible so now it is possible to install OpenDaylight-intergrated OpenStack automatically. However, there is no option to install QBGP so I had to augment the default Kolla and Kolla-ansible repositories to include the QBGP Dockerfile template and QBGP ansible role. So the first step is to download my latest automated installer and make sure enable_opendaylight global variable is set to yes:

git clone https://github.com/networkop/kolla-odl-bgpvpn.git && cd kolla-odl-bgpvpn
mkdir group_vars
echo "enable_opendaylight: \"yes\"" >> group_vars/all.yaml

At the time of writing I was relying on a couple of latest bug fixes inside OpenDaylight, so I had to modify the default ODL role to install the latest master-branch ODL build. Make sure the link below is pointing to the latest zip file in 0.8.0-SNAPSHOT directory.

cat << EOF >> group_vars/all.yaml
odl_latest_enabled: true
odl_latest_url: https://nexus.opendaylight.org/content/repositories/opendaylight.snapshot/org/opendaylight/integration/netvirt/karaf/0.8.0-SNAPSHOT/karaf-0.8.0-20171106.102232-1767.zip
EOF

The next few steps are similar to what I’ve described in my Kolla lab post, will create a pair of VMs, build all Kolla containers, push them to a local Docker repo and finally deploy OpenStack using Kolla-ansible playbooks:

./1-create.sh do
./2-bootstrap.sh do
./3-build.sh do
./4-deploy.sh do

The final 4-deploy.sh script will also create a simple init.sh script inside the controller VM that can be used to setup a test topology with a single VM connected to a 10.0.0.0/24 subnet:

ssh kolla-controller
source /etc/kolla/admin-openrc.sh
./init.sh

Of course, another option to build a lab is to follow the official Kolla documentation to create your own custom test environment.

Configuration

Assuming the test topology was setup with no issues and a test VM can ping its default gateway 10.0.0.1, we can start configuring BGP VPNs. Unfortunately, we won’t be able to use OpenStack BGPVPN API/CLI, since ODL requires an extra parameter (L3 VNI for symmetric IRB) which is not available in OpenStack BGPVPN API, but we still can configure everything directly through ODL’s API. My interface of choice is always REST, since it’s easier to build it into a fully programmatic plugin, so even though all of the below steps can be accomplished through karaf console CLI, I’ll be using cURL to send and retrieve data from ODL’s REST API.

1. Source admin credentials and setup ODL’s REST variables

source /etc/kolla/admin-openrc.sh
export ODL_URL='http://192.168.133.100:8181/restconf'
export CT_JSON="Content-Type: application/json"

2. Configure local BGP settings and BGP peering with DC-GW

cat << EOF > ./bgp-full.json
{
    "bgp": {
        "as-id": {
            "announce-fbit": false,
            "local-as": 100,
            "router-id": "192.168.133.100",
            "stalepath-time": 0
        },
        "logging": {
            "file": "/var/log/bgp_debug.log",
            "level": "errors"
        },
        "neighbors": [
            {
                "address": "192.168.133.50",
                "remote-as": 100,
                "address-families": [
                   {
                     "ebgp:afi": "3",
                     "ebgp:peer-ip": "192.168.133.50",
                     "ebgp:safi": "6"
                   }
                ]
            }
        ]
    }
}
EOF

curl -X PUT -u admin:admin -k -v -H "$CT_JSON"  \
     $ODL_URL/config/ebgp:bgp -d @bgp-full.json

3. Define L3VPN instance and associate it with OpenStack `admin` tenant

TENANT_UUID=$(openstack project show admin -f value -c id | \
            sed 's/\(........\)\(....\)\(....\)\(....\)\(.*\)/\1-\2-\3-\4-\5/')

cat << EOF > ./l3vpn-full.json
{
   "input": {
      "l3vpn":[
         {
            "id":"f503fcb0-3fd9-4dee-8c3a-5034cf707fd9",
            "name":"L3EVPN",
            "route-distinguisher": ["100:100"],
            "export-RT": ["100:100"],
            "import-RT": ["100:100"],
            "l3vni": "5000",
            "tenant-id":"${TENANT_UUID}"
         }
      ]
   }
}
EOF

curl -X POST -u admin:admin -k -v -H "$CT_JSON"  \
      $ODL_URL/operations/neutronvpn:createL3VPN -d @l3vpn-full.json

4. Inject prefixes into L3VPN by associating the previously created L3VPN with a `demo-router`

ROUTER_UUID=$(openstack router show demo-router -f value -c id)

cat << EOF > ./l3vpn-assoc.json
{
  "input":{
     "vpn-id":"f503fcb0-3fd9-4dee-8c3a-5034cf707fd9",
     "router-id":[ "${ROUTER_UUID}" ]
   }
}
EOF

curl -X POST -u admin:admin -k -v -H "$CT_JSON"  \
     $ODL_URL/operations/neutronvpn:associateRouter -d @l3vpn-assoc.json

5. Configure DC-GW VTEP IP

ODL cannot automatically extract VTEP IP from updates received from DC-GW, so we need to explicitly configure it:

cat << EOF > ./tep.json
{
  "input": {
    "destination-ip": "1.1.1.1",
    "tunnel-type": "odl-interface:tunnel-type-vxlan"
  }
}
EOF
curl -X POST -u admin:admin -k -v -H "$CT_JSON"  \
     $ODL_URL/operations/itm-rpc:add-external-tunnel-endpoint -d @tep.json

6. DC-GW configuration

That is all what needs to be configured on ODL. Although I would consider this to be outside of the scope of the current post, for the sake of completeness I’m including the relevant configuration from the DC-GW:

!
vrf definition ODL
 rd 100:100
 route-target export 100:100
 route-target import 100:100
 !
 address-family ipv4
  route-target export 100:100 stitching
  route-target import 100:100 stitching
 exit-address-family
!
bridge-domain 5000
 member vni 5000
!
interface Loopback0
 ip address 1.1.1.1 255.255.255.255
!
interface GigabitEthernet1
 ip address 192.168.133.50 255.255.255.0
!
interface nve1
 no ip address
 source-interface Loopback0
 host-reachability protocol bgp
 member vni 5000 vrf ODL
!
interface BDI5000
 vrf forwarding ODL
 ip address 8.8.8.8 255.255.255.0
 encapsulation dot1Q 500
!
router bgp 100
 bgp log-neighbor-changes
 no bgp default ipv4-unicast
 neighbor 192.168.133.100 remote-as 100
 !
 address-family l2vpn evpn
  import vpnv4 unicast
  neighbor 192.168.133.100 activate
 exit-address-family
 !
 address-family ipv4 vrf ODL
  advertise l2vpn evpn
  redistribute connected
 exit-address-family
!

For detailed explanation of how EVPN RT5 is configured on Cisco CSR refer to the following guide.

Verification

There are several things that can be checked to verify that the DC-GW integration is working. One of the first steps would be to check if BGP session with CSR is up. This can be done from the CSR side, however it’s also possible to check this from the QBGP side. First we need to get into the QBGP’s interactive shell from the controller node:

[centos@controller-1 ~]$ sudo docker exec -it quagga /opt/quagga/bin/vtysh

From here, we can check that the BGP session has been established:

controller-1# sh bgp neighbors 192.168.133.50
BGP neighbor is 192.168.133.50, remote AS 100, local AS 100, internal link
  BGP version 4, remote router ID 1.1.1.1
  BGP state = Established, up for 00:03:05

We can also check the contents of EVPN RIB compiled by QBGP

controller-1# sh bgp evpn rd 100:100
BGP table version is 0, local router ID is 192.168.133.100
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
Route Distinguisher: as2 100:100
*> [0][fa:16:3e:37:42:d8/48][10.0.0.2/32]
                    192.168.133.100         0          32768 i
*> [0][fa:16:3e:dc:77:65/48][10.0.0.3/32]
                    192.168.133.101         0          32768 i
*>i8.8.8.0/24       1.1.1.1         0     100       0 ?
*> 10.0.0.0/24      192.168.133.100         0          32768 i

Finally, we can verify that the prefix 8.8.8.0/24 advertised from DC-GW is being passed by QBGP and accepted by NetVirt’s FIB Manager:

$ curl -u admin:admin -k -v  $ODL_URL/config/odl-fib:fibEntries/\
  vrfTables/100%3A100/vrfEntry/8.8.8.0%2F24 | python -m json.tool
{
    "vrfEntry": [
        {
            "destPrefix": "8.8.8.0/24",
            "encap-type": "vxlan",
            "gateway_mac_address": "00:1e:49:69:24:bf",
            "l3vni": 5000,
            "origin": "b",
            "route-paths": [
                {
                    "nexthop-address": "1.1.1.1"
                }
            ]
        }
    ]
}

The last output confirms that the prefix is being received and accepted by ODL. To do a similar check on CSR side we can run the following command:

CSR1k#show bgp l2vpn evpn

     Network          Next Hop            Metric LocPrf Weight Path
Route Distinguisher: 100:100 (default for vrf ODL)
 *>i  [2][100:100][0][48][FA163E3742D8][32][10.0.0.2]/24
                      192.168.133.100          0    100      0 i
 *>i  [2][100:100][0][48][FA163EDC7765][32][10.0.0.3]/24
                      192.168.133.101          0    100      0 i
 *>   [5][100:100][0][24][8.8.8.0]/17
                      0.0.0.0                  0         32768 ?
 *>i  [5][100:100][0][24][10.0.0.0]/17
                      192.168.133.100          0    100      0 i

This confirms that the control plane information has been successfully exchanged between NetVirt and Cisco CSR.

At the time of writing, there was an open bug in ODL master branch that prevented the forwarding entries from being installed in OVS datapath. Once the bug is fixed I will update this post with the dataplance verification, a.k.a ping

Conclusion

OpenDaylight is a pretty advanced OpenStack SDN platform. Its functionality includes clustering, site-to-site federation (without EVPN) and L2/L3 EVPN DC-GW integration for both IPv4 and IPv6. It is yet another example of how an open-source platform can match even the most advanced proprietary SDN solutions from incumbent vendors. This is all thanks to the companies involved in OpenDaylight development. I also want to say special thanks to Vyshakh Krishnan, Kiran N Upadhyaya and Dayavanti Gopal Kamath from Ericsson for helping me clear up some of the questions I posted on netvirt-dev mailing list.

Openstack SDN - NFV Management and Orchestration

2017-11-23T00:00:00+00:00

In this post I’ll have a brief look at the NFV MANO framework developed by ETSI and create a simple vIDS network service using Tacker.

In the ongoing hysteria surrounding all things SDN, one important thing gets often overlooked. You don’t build SDN for its own sake. SDN is just a little cog in a big machine called “cloud”. To take it even further, I would argue that the best SDN solution is the one that you don’t know even exists. Despite what the big vendors tell you, operators are not supposed to interact with SDN interface, be it GUI or CLI. If you dig up some of the earliest presentation about Cisco ACI, when the people talking about it were the actual people who designed the product, you’ll notice one common motif being repeated over and over again. That is that ACI was never designed for direct human interaction, but rather was supposed to be configured by a higher level orchestrating system. In data center environments such orchestrating system may glue together services of virtualization layer and SDN layer to provide a seamless “cloud” experience to the end users. The focus of this post will be one incarnation of such orchestration system, specific to SP/Telco world, commonly known as NFV MANO.

NFV MANO for Telco SDN

At the early dawn of SDN/NFV era a lot of people got very excited by “the promise” and started applying the disaggregation and virtualization paradigms to all areas of networking. For Telcos that meant virtualizing network functions that built the service core of their networks - EPC, IMS, RAN. Traditionally those network functions were a collection of vertically-integrated baremetal appliances that took a long time to commission and had to be overprovisioned to cope with the peak-hour demand. Virtualizing them would have made it possible to achieve quicker time-to-market, elasticity to cope with a changing network demand and hardware/software disaggregation.

As expected however, such fundamental change has to come at price. Not only do Telcos get a new virtualization platform to manage but they also need to worry about lifecycle management and end-to-end orchestration (MANO) of VNFs. Since any such change presents an opportunity for new streams of revenue, it didn’t take long for vendors to jump on the bandwagon and start working on a new architecture designed to address those issues.

The first problem was the easiest to solve since VMware and OpenStack already existed at that stage and could be used to host VNFs with very little modifications. The management and orchestration problem, however, was only partially solved by existing orchestration solutions. There were a lot of gaps between the current operational model and the new VNF world and although these problems could have been solved by Telcos engaging themselves with the open-source community, this proved to be too big of a change for them and they’ve turned to the only thing they could trust - the standards bodies.

ETSI MANO

ETSI NFV MANO working group has set out to define a reference architecture for management and orchestration of virtualized resources in Telco data centers. The goal of NFV MANO initiative was to do a research into what’s required to manage and orchestrate VNFs, what’s currently available and identify potential gaps for other standards bodies to fill. Initial ETSI NFV Release 1 (2014) defined a base framework through relatively weak requirements and recommendations and was followed by Release 2 (2016) that made them more concrete by locking down the interfaces and data model specifications. For a very long time Release 1 was the only available NFV MANO standard, which led to a lot of inconsistencies in each vendors' implementations of it. This was very frustrating for Telcos since it required a lot of integration effort to build a multi-vendor MANO stack. Another potential issue with ETSI MANO standard is its limited scope - a lot of critical components like OSS and EMS are left outside of it which created a lot of confusion for Telcos and resulted in other standardisation efforts addressing those gaps.

On the below diagram I have shown an adbridged version of the original ETSI MANO reference architecture diagram adapted to the use case I’ll be demonstrating in this post.

This architecture consists of the following building blocks:

NFVI (NFV Infrastructure) - OpenStacks compute or VMware’s ESXI nodes
VIM (Virtual Infrastructure Manager) - OpenStack’s controller/API or VMware’s vCenter nodes
VNFM (VNF Manager) - an element responsible for lifecycle management (create,delete,scale) and monitoring of VNFs
NFVO (NFV Orchestrator) - an element responsible for lifecyle management of Network Services (described below)

All these elements are working together towards a single goal - managing and orchestrating a Network Service (NS), which itself is comprised of multiple VNFs, Virtual Links (VLs), VNF Forwarding Graphs (VNFFGs) and Physical Network Functions (PNFs). In this post I create a NS for a simple virtual IDS use case, described in my previous SFC post. The goal is to steer all ICMP traffic coming from VM1 through a vIDS VNF which will forward the traffic to its original destination.

Before I get to the implementation, let me give a quick overview of how a Network Service is build from its constituent parts, in the context of our vIDS use case.

Relationship between NS, VNF and VNFFG

According to ETSI MANO, a Network Service (NS) is a subset of end-to-end service implemented by VNFs and instantiated on the NFVI. As I’ve mentioned before, some examples of a NS would be vEPC, vIMS or vCPE. NS can be described in either a YANG or a Tosca template called NS Descriptor (NSD). The main goal of a NSD is to tie together VNFs, VLs, VNFFGs and PNFs by defining relationship between various templates describing those objects (VNFDs, VLDs, VNFFGDs). Once NSD is onboarded (uploaded), it can be instantiated by NFVO, which communicates with VIM and VNFM to create the constituent components and stitch them together as described in a template. NSD normally does not contain VNFD or VNFFGD templates, but imports them through their names, which means that in order to instantiate a NSD, the corresponding VNFDs and VNFFGDs should already be onboarded.

VNF Descriptor is a template describing the compute and network parameters of a single VNF. Each VNF consists of one or more VNF components (VNFCs), represented in Tosca as Virtual Deployment Units (VDUs). A VDU is the smallest part of a VNF and can be implemented as either a container or, as it is in our case, a VM. Apart from the usual set of parameters like CPU, RAM and disk, VNFD also describes all the virtual networks required for internal communication between VNFCs, called internal VLs. VNFM can ask VIM to create those networks when the VNF is being instantiated. VNFD also contains a reference to external networks, which are supposed to be created by NFVO. Those networks are used to connect different VNFs together or to connect VNFs to PNFs and other elements outside of NFVI platform. If external VLs are defined in a VNFD, VNFM will need to source them externally, either as input parameters to VNFM or from NFVO. In fact, VNF instantiation by VNFM, as described in Tacker documentation, is only used for testing purposes and since a VNF only makes sense as a part of a Network Service, the intended way is to use a NSD to instantiate all VNFs in production environment.

The final component that we’re going to use is VNF Forwarding Graph. VNFFG Descriptor is an optional component that describes how different VNFs are supposed to be chained together to form a Network Service. In the absence of VNFFG, VNFs will fall back to the default destination-based forwarding, when the IPs of VNFs forming a NS are either automatically discovered (e.g. through DNS) or provisioned statically. Tacker’s implementation of VNFFG is not fully integrated with NSD yet and VNFFGD has to be instantiated separately and, as will be shown below, linked to an already running instance of a Network Service through its ID.

Using Tacker to orchestrate a Network Service

Tacker is an OpenStack project implementing a generic VNFM and NFVO. At the input it consumes Tosca-based templates, converts them to Heat templates which are then used to spin up VMs on OpenStack. This diagram from Brocade, the biggest Tacker contributor (at least until its acquisition), is the best overview of internal Tacker architecture.

For this demo environment I’ll keep using my OpenStack Kolla lab environment described in my previous post.

Step 1 - VIM registration

Before we can start using Tacker, it needs to know how to reach the OpenStack environment, so the first step in the workflow is OpenStack or VIM registration. We need to provide the address of the keystone endpoint along with the admin credentials to give Tacker enough rights to create and delete VMs and SFC objects:

cat << EOF > ./vim.yaml
auth_url: 'http://192.168.133.254:35357/v3'
username: 'admin'
password: 'admin'
project_name: 'admin'
project_domain_name: 'Default'
user_domain_name: 'Default'
EOF

tacker vim-register --is-default --config-file vim.yaml --description MYVIM KOLLA-OPENSTACK

The successful result can be checked with tacker vim-list which should report that registered VIM is now reachable.

Step 2 - Onboarding a VNFD

VNFD defines a set of VMs (VNFCs), network ports (CPs) and networks (VLs) and their relationship. In our case we have a single cirros-based VM with a pair of ingress/egress ports. In this template we also define a special node type tosca.nodes.nfv.vIDS which will be used by NSD to pass the required parameters for ingress and egress VLs. These parameters are going to be used by VNFD to attach network ports (CPs) to virtual networks (VLs) as defined in the substitution_mappings section.

cat << EOF > ./vnfd.yaml
tosca_definitions_version: tosca_simple_profile_for_nfv_1_0_0
description: Cirros vIDS example

node_types:
  tosca.nodes.nfv.vIDS:
    requirements:
      - INGRESS_VL:
          type: tosca.nodes.nfv.VL
          required: true
      - EGRESS_VL:
          type: tosca.nodes.nfv.VL
          required: true

topology_template:
  substitution_mappings:
    node_type: tosca.nodes.nfv.vIDS
    requirements:
      INGRESS_VL: [CP1, virtualLink]
      EGRESS_VL:  [CP2, virtualLink]

  node_templates:
    VDU1:
      type: tosca.nodes.nfv.VDU.Tacker
      properties:
        availability_zone: nova
        flavor: m1.nano
        image: cirros
        mgmt_driver: noop
        user_data_format: RAW
        user_data: |
          #!/bin/sh
          sudo cirros-dhcpc up eth1
          sudo ip rule add iif eth0 table default
          sudo ip route add default via 10.0.0.1 dev eth1 table default
          sudo sysctl -w net.ipv4.ip_forward=1

    CP1:
      type: tosca.nodes.nfv.CP.Tacker
      properties:
        anti_spoofing_protection: false
      requirements:
        - virtualBinding:
            node: VDU1

    CP2:
      type: tosca.nodes.nfv.CP.Tacker
      properties:
        anti_spoofing_protection: false
      requirements:
        - virtualBinding:
            node: VDU1
EOF

tacker vnfd-create --vnfd-file vnfd.yaml vIDS-TEMPLATE

Step 4 - Onboarding a NSD

In our use case the NSD template is going to really small. All what we need to define is a single VNF of the tosca.nodes.nfv.vIDS type that was defined previously in the VNFD. We also define a VL node which points to the pre-existing demo-net virtual network and pass this VL to both INGRESS_VL and EGRESS_VL parameters of the VNFD.

cat << EOF > ./nsd.yaml
tosca_definitions_version: tosca_simple_profile_for_nfv_1_0_0
imports:
  - vIDS-TEMPLATE

topology_template:
  node_templates:
    vIDS:
      type: tosca.nodes.nfv.vIDS
      requirements:
        - INGRESS_VL: VL1
        - EGRESS_VL: VL1
    VL1:
      type: tosca.nodes.nfv.VL
      properties:
          network_name: demo-net
          vendor: tacker
EOF

tacker nsd-create --nsd-file nsd.yaml NSD-vIDS-TEMPLATE

Step 5 - Instantiating a NSD

As I’ve mentioned before, VNFFG is not integrated with NSD yet, so we’ll add it later. For now, we have provided enough information to instantiate our NSD.

tacker ns-create --nsd-name NSD-vIDS-TEMPLATE NS-vIDS-1

This last command creates a cirros-based VM with two interfaces and connects them to demo-net virtual network. All ICMP traffic from VM1 still goes directly to its default gateway so the last thing we need to do is create a VNFFG.

Step 6 - Onboarding and Instantiating a VNFFG

VNFFG consists of two two types of nodes. The first type defines a Forwarding Path (FP) as a set of virtual ports (CPs) and a flow classifier to build an equivalent service function chain inside the VIM. The second type groups multiple forwarding paths to build a complex service chain graphs, however only one FP is supported by Tacker at the time of writing.

The following template demonstrates another important feature - template parametrization. Instead of defining all parameters statically in a template, they can be provided as inputs during instantiation, which allows to keep templates generic. In this case I’ve replaced the network port id parameter with PORT_ID variable which will be provided during VNFFGD instantiation.

cat << EOF > ./vnffg.yaml
tosca_definitions_version: tosca_simple_profile_for_nfv_1_0_0

description: vIDS VNFFG tosca

topology_template:
  inputs:
    PORT_ID:
      type: string
      description: Port ID of the target VM

  node_templates:

    Forwarding_Path-1:
      type: tosca.nodes.nfv.FP.Tacker
      description: creates path (CP1->CP2)
      properties:
        id: 51
        policy:
          type: ACL
          criteria:
            - network_src_port_id: { get_input: PORT_ID }
            - ip_proto: 1
        path:
          - forwarder: vIDS-TEMPLATE
            capability: CP1
          - forwarder: vIDS-TEMPLATE
            capability: CP2


  groups:
    VNFFG1:
      type: tosca.groups.nfv.VNFFG
      description: Set of Forwarding Paths
      properties:
        vendor: tacker
        version: 1.0
        number_of_endpoints: 1
        dependent_virtual_link: [VL1]
        connection_point: [CP1]
        constituent_vnfs: [vIDS-TEMPLATE]
      members: [Forwarding_Path-1]
EOF

tacker vnffgd-create --vnffgd-file vnffgd.yaml VNFFG-TEMPLATE

Note that the VNFFGD has been updated to support multiple flow classifiers which means you many need to update the above template as per the sample VNFFGD template

In order to instantiate a VNFFGD we need to provide two runtime parameters:

OpenStack port ID of VM1 for forwarding path flow classifier
ID of the VNF created by the Network Service

All these parameters can be obtained using the CLI commands as shown below:

CLIENT_IP=$(openstack server list | grep VM1 | grep -Eo '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+')
PORT_ID=$(openstack port list | grep $CLIENT_IP | awk '{print $2}')
echo "PORT_ID: $PORT_ID" > params-vnffg.yaml
vIDS_ID=$(tacker ns-show NS-vIDS-1 -f value -c vnf_ids | sed "s/'/\"/g" | jq '.vIDS' | sed "s/\"//g")

The following command creates a VNFFG and an equivalent SFC to steer all ICMP traffic from VM1 through vIDS VNF. The result can be verified using Skydive following the procedure described in my previous post.

tacker vnffg-create --vnffgd-name VNFFG-TEMPLATE \
                    --vnf-mapping vIDS-TEMPLATE:$vIDS_ID \
                    --param-file params-vnffg.yaml VNFFG-1

Other Tacker features

This post only scratches the surface of what’s available in Tacker with a lot of other salient features left out of scope, including:

VNF monitoring - through monitoring driver its possible to do VNF monitoring from VNFM using various methods ranging from a single ICMP/HTTP ping to Alarm-based monitoring using OpenStack’s Telemetry framework
Enhanced Placement Awareness - VNFD Tosca template extensions that allow the definition of required performance features like NUMA topology mapping, SR-IOV and CPU pinning.
Mistral workflows - ability to drive Tacker workflows through Mistral

Conclusion

Tacker is one of many NFV orchestration platforms in a very competitive environment. Other open-source initiatives have been created in response to the shortcomings of the original ETSI Release 1 reference architecture. The fact the some of the biggest Telcos have finally realised that the only way to achieve the goal of NFV orchestration is to get involved with open-source and do it themselves, may be a good sign for the industry and maybe not so good for the ETSI NFV MANO working group. Whether ONAP with its broader scope becomes a new de-facto standard for NFV orchestration, still remains to be seen, until then ETSI MANO remains the only viable standard for NFV lifecycle management and orchestration.

Openstack SDN - Skydiving Into Service Function Chaining

2017-09-15T00:00:00+01:00

In this post I’ll show how to configure Neutron’s service function chaining, troubleshoot it with Skydive and how SFC is implemented in OVS forwarding pipeline.

SFC is another SDN feature that for a long time only used to be available in proprietary SDN solutions and that has recently become available in vanilla OpenStack. It serves as another proof that proprietary SDN solutions are losing the competitive edge, especially for Telco SDN/NFV use cases. Hopefully, by the end of this series of posts I’ll manage do demonstrate how to build a complete open-source solution that has feature parity (in terms of major networking features) with all the major proprietary data centre SDN platforms. But for now, let’s just focus on SFC.

SFC High-level overview

In most general terms, SFC refers to packet forwarding technique that uses more than just destination IP address to decide how to forward packets. In more specific terms, SFC refers to “steering” of traffic through a specific set of endpoints (a.k.a Service Functions), overriding the default destination-based forwarding. For those coming from a traditional networking background, think of SFC as a set of policy-based routing instances orchestrated from a central element (SDN controller). Typical use cases for SFC would be things like firewalling, IDS/IPS, proxying, NAT'ing, monitoring.

SFC is usually modelled as a directed (acyclic) graph, where the first and the last elements are the source and destination respectively and each vertex inside the graph represents a SF to be chained. IETF RFC7665 defines the reference architecture for SFC implementations and establishes some of the basic terminology. A simplified SFC architecture consists of the following main components:

Classifier - a network element that matches and redirects traffic flows to a chain
Service Function - an element responsible for packet processing
Service Function Forwarder - a network element that forwards traffic to and from a directly connected SF

One important property of a SF is elasticity. More instances of the same type can be added to a pool of SF and SFF will load-balance the traffic between them. This is the reason why, as we’ll see in the next section, SFF treats connections to a SF as a group of ports rather than just a single port.

Insertion modes and implementation models

In legacy, pre-SDN environments SFs had no idea if they were a part of a service chain and network devices (routers and switches) had to “insert” the interesting traffic into the service function using one of the following two modes:

L2 mode is when SF is physically inserted between the source and destination inside a single broadcast domain, so traffic flows through a SF without any intervention from a switch. Example of this mode could be a firewall in transparent mode, physically connected between a switch and a default gateway router. All packets entering a SF have their original source and destination MAC addresses, which requires SF to be in promiscuous mode.
L3 mode is when a router overrides its default destination-based forwarding and redirects the interesting traffic to a SF. In legacy networks this could have been achieved with PBR or WCCP. In this case SF needs to be L2-attached to a router and all redirected packets have their destination MAC updated to that of a SF’s ingress interface.

Modern SDN networks make it really easy to modify forwarding behaviour of network elements, both physical and virtual. There is no need for policy-based routing or bump-in-the-wire designs anymore. When flow needs to be redirected to a SF on a virtual switch, all what’s required is a matching OpenFlow entry with a high enough priority. However redirecting traffic to a SF is just one part of the problem. Another part is how to make SFs smarter, to provide greater visibility of end-to-end service function path.

So far SFs have only been able to extract metadata from the packet itself. This limited the flexibility of SF logic and became computationally expensive in case many SFs need to access some L7 header information. Ideal way would be to have an additional header which can be used to read and write arbitrary information and pass it along the service function chain. RFC7665 defines requirements for “SFC Encapsulation” header which can be used to uniquely identify an instance of a chain as well as share metadata between all its elements. Neutron API refers to SFC encapsulation as correlation since its primary function is to identify a particular service function path. There are two implementations of SFC encapsulation in use today:

MPLS - used by current OVS agent driver (as of Pike). This method does not provide any means to share metadata and serves only for SFP identification. It is intended as an interim solution until NSH becomes available upstream in OVS.
NSH - complete implementation of SFC encapsulation defined in RFC7665. This method is currently implemented in Opendaylight where NSH is used as a shim between VXLAN-GPE and the encapsulated packet

It should be noted that the new approach with SFC encapsulation still allows for legacy, non-SFC-aware SFs to be chained. In this case SFC encapsulation is stripped off the packet by an “SFC proxy” before the packet is sent to the ingress port of a service function. All logical elements forming an SFC forwarding pipeline, including SFC proxy, Classifier and Forwarder, are implemented inside the same OVS bridges (br-int and br-tun) used by vanilla OVS-agent driver.

Configuring Neutron SFC

We’ll pick up where we left off in the previous post. All Neutron and ML2 configuration files have already been updated thanks to the enable_sfc="yes" setting in the global Kolla-Ansible configuration file. If not, you can change it in /etc/kolla/globals.yaml and re-run kolla-ansible deployment script.

First, let’s generate OpenStack credentials using a post-deployment script. We later can use a default bootstrap script to downloads the cirros image and set up some basic networking and security rules.

kolla-ansible post-deploy
source /etc/kolla/admin-openrc.sh
/usr/share/kolla-ansible/init-runonce

The goal for this post is to create a simple uni-directional SFC to steer the ICMP requests from VM1 to its default gateway through another VM that will be playing the role of a firewall.

The network was already created by the bootstrap script so all what we have to do is create a test VM. I’m creating a port in a separate step simply so that I can refer to it by name instead of UUID.

openstack port create --network demo-net P0
openstack server create --image cirros --flavor m1.tiny --port P0 VM1

I’ll go over all the necessary steps to setup SFC, but will only provide a brief explanation. Refer to the official OpenStack Networking Guide for a complete SFC configuration guide.

First, let’s create a FW VM with two ports - P1 and P2.

openstack port create --network demo-net P1
openstack port create --network demo-net P2
openstack server create --image cirros --flavor m1.tiny --port P1 --port P2 FW

Next, we need create an ingress/egress port pair and assign it to a port pair group. The default setting for correlation in a port pair (not shown) is none. That means that SFC encapsulation header (MPLS) will get stripped before the packet is sent to P1.

openstack sfc port pair create --ingress P1 --egress P2 PPAIR
openstack sfc port pair group create --port-pair PPAIR PPGROUP

Port pair group also allows to specify the L2-L4 headers which to use for load-balancing in OpenFlow groups, overriding the default behaviour described in the next section.

Another required element is a flow classifier. We will be redirecting ICMP traffic coming from VM1’s port P0

openstack sfc flow classifier create --protocol icmp --logical-source-port P0 FLOW-ICMP

Finally, we can tie together flow classifier with a previously created port pair group. The default setting for correlation (not shown again) in this case is mpls. That means that each chain will have its own unique MPLS label to be used as an SFC encapsulation.

openstack sfc port chain create --port-pair-group PPGROUP --flow-classifier FLOW-ICMP PCHAIN

That’s all the configuration needed to setup SFC. However if you login VM1’s console and try pinging default gateway, it will fail. Next, I’m going to give a quick demo of how to use a real-time network analyzer tool called Skydive to troubleshoot this issue.

Using Skydive to troubleshoot SFC

Skydive is a new open-source distributed network probing and traffic analyzing tool. It consists of a set of agents running on compute nodes, collecting topology and flow information and forwarding it to a central element for analysis.

The idea of using Skydive to analyze and track SFC is not new. In fact, for anyone interested in this topic I highly recommend the following blogpost. In my case I’ll show how to use Skydive from a more practical perspective - troubleshooting multiple SFC issues.

Skydive CLI client is available inside the skydive_analyzer container. We need to start an interactive bash session inside this container and set some environment variables:

docker exec -it skydive_analyzer bash
export SKYDIVE_ANALYZERS=192.168.133.100:8085
export SKYDIVE_USERNAME=admin
export SKYDIVE_PASSWORD=admin

The first thing we can do to troubleshoot is see if ICMP traffic is entering the ingress port of the FW VM. Based on the output of openstack port list command I know that P1 has got an IP of 10.0.0.8. Let’s if we can identify a tap port corresponding to P1:

skydive client topology query --gremlin "G.V().Has('Neutron.IPs', '10.0.0.8', 'Type', 'tun').Values('Neutron')"
{
  "IPs": "10.0.0.8",
  "NetworkID": "8eabb451-b026-417c-b54b-8e79ee6e71c3",
  "NetworkName": "demo-net",
  "PortID": "e6334df9-a5c4-4e86-a5f3-671760c2bbbe",
  "TenantID": "bd5829e0cb5b40b68ab4f8e7dc68b14d"
}

The output above proves that skydive agent has successfully read the configuration of the port and we can start a capture on that object to see any packets arriving on P1.

skydive client capture create --gremlin "G.V().Has('Neutron.IPs', '10.0.0.8', 'Type', 'tun')"
skydive client topology query --gremlin "G.V().Has('Neutron.IPs', '10.0.0.8', 'Type', 'tun').Flows().Has('Application','ICMPv4').Values('Metric.ABPackets')"
[
  7
]

If you watch the last command for several seconds you should see that the number in brackets is increasing. That means that packets are hitting the ingress port of the FW VM. Now let’s repeat the same test on egress port P2.

skydive client capture create --gremlin "G.V().Has('Neutron.IPs', '10.0.0.4', 'Type', 'tun')"
skydive client topology query --gremlin "G.V().Has('Neutron.IPs', '10.0.0.4', 'Type', 'tun').Flows()"
[]

The output above tells us that there are no packets coming out of the FW VM. This is expected since we haven’t done any changes to the blank cirros image to make it forward the packets between the two interfaces. If we examine the IP configuration of the FW VM, we would see that it doesn’t have an IP address configured on the second interface. We would also need to create a source-based routing policy to force all traffic from VM1 (10.0.0.6) to egress via interface eth2 and make sure IP forwarding is turned on. The following commands would need to be executed on FW VM:

sudo cirros-dhcpc up eth1
sudo ip rule add from 10.0.0.6 table default
sudo ip route add default via 10.0.0.1 dev eth1 table default
sudo sysctl -w net.ipv4.ip_forward=1

Having done that, we should see some packets coming out of egress port P2.

skydive client topology query --gremlin "G.V().Has('Neutron.IPs', '10.0.0.4', 'Type', 'tun').Flows().Has('Application','ICMPv4').Values('Metric.ABPackets')"
[
  7
]

However form the VM1’s perspective the ping is still failing. Next step would be to see if the packets are hitting the integration bridge that port P2 is attached to:

skydive client capture create --gremlin "G.V().Has('Neutron.IPs', '10.0.0.4', 'Type', 'veth')"
skydive client topology query --gremlin "G.V().Has('Neutron.IPs', '10.0.0.4', 'Type', 'veth').Flows()"
[]

No packets means they are getting dropped somewhere between the P2 and the integration bridge. This can only be done by security groups. In fact, source MAC/IP anti-spoofing is enabled by default which would only allow packets matching the source MAC/IP addresses assigned to P2 and would drop any packets coming from VM1’s IP address. The easiest fix would be to disable security groups for P2 completely:

openstack port set --no-security-group --disable-port-security P2

After this step the counters should start incrementing and the ping from VM1 to its default gateway is resumed.

skydive client topology query --gremlin "G.V().Has('Neutron.IPs', '10.0.0.4', 'Type', 'veth').Flows().Has('Application','ICMPv4').Values('Metric.ABPackets')"
[
  79
]

SFC implementation in OVS forwarding pipeline

The only element being affected in our case (both VM1 and FW are on the same compute node) is the integration bridge. Refer to my older post about vanilla OpenStack networking for a refresher of the vanilla OVS-agent architecture.

Normally, I would start by collecting all port and flow details from the integration bridge with the following commands:

ovs-ofctl dump-ports-desc br-int  | grep addr
ovs-ofctl dump-flows br-int | cut -d ',' -f3-

However, for the sake of brevity, I will omit the actual outputs and only show graphical representation of forwarding tables and packet flows. The tables below have two columns - first showing what is being matched and second showing the resulting action. Let’s start with the OpenFlow rules in an integration bridge before SFC is configured:

As we can see, the table structure is quite simple, since integration bridge mostly relies on data-plane MAC learning. A couple of MAC and ARP anti-spoofing tables will check the validity of a packet and send it to table 60 where NORMAL action will trigger the “flood-and-learn” behaviour. Therefore, an ICMP packet coming from VM1 will take the following path:

After we’ve configured SFC, the forwarding pipeline is changed and now looks like this:

First, we can see that table 0 acts as a classifier, by redirecting the “interesting” packets towards group 1. This groups is an OpenFlow Group of type select, which load-balances traffic between multiple destinations. By default OVS will use a combination of L2-L4 header as described here to calculate a hash which determines the output bucket, similar to how per-flow load-balancing works in traditional routers and switches. This behaviour can be overridden with a specific set of headers in lb_fields setting of a port pair group.

In our case we’ve only got a single SF, so the packet gets its destination MAC updated to that of SF’s ingress port and is forwarded to a new table 5. Table 5 is where all packets destined for a SF are aggregated with a single MPLS label which uniquely identifies the service function path. The packet is then forwarded to table 10, which I’ve called SFC Ingress. This is where the packets are distributed to SF’s ingress ports based on the assigned MPLS label.

After being processed by a SF, the packet leaves the egress port and re-enters the integration bridge. This time table 0 knows that the packet has already been processed by a SF and, since the anti-spoofing rules have been disabled, simply floods the packet out of all ports in the same VLAN. The packet gets flooded to the tunnel bridge where it gets replicated and delivered to the qrouter sitting on the controller node as per the default behaviour.

Upcoming enhancements

SFC is a pretty vast topic and is still under active development. Some of the upcoming enhancement to the current implementation of SFC will include:

NSH header for SFC correlation
TAP functionality which can replace the separate Tap-as-a-service OpenStack project
Service graphs allowing multiple chains to be interconnected to create more complex service chain scenarios

Coming Up

SFC is one of the major features in Telco SDN and, like many things, it’s not meant to be configured manually. In fact, Telco SDN have their own framework for management and orchestration of VNFs (a.k.a. VMs) and VNF forwarding graphs (a.k.a. SFCs) called ETSI MANO. As it is expected from a Telco standard, it abounds with acronyms and confuses the hell out of anyone who’s name is not on the list of authors or contributors. That’s why in the next post I will try to provide a brief overview of what Telco SDN is and use Tacker, a software implementation of NFVO and VNFM, to automatically build a firewall VNF and provision a SFC, similar to what has been done in this post manually.

Openstack SDN - Building a Containerized OpenStack Lab

2017-09-08T00:00:00+01:00

I’m returning to my OpenStack SDN series to explore some of the new platform features like service function chaining, network service orchestration, intent-based networking and dynamic WAN routing. To kick things off I’m going to demonstrate my new fully-containerized OpenStack Lab that I’ve built using an OpenStack project called Kolla.

For quite a long time installation and deployment have been deemed as major barriers for OpenStack adoption. The classic “install everything manually” approach could only work in small production or lab environments and the ever increasing number of project under the “Big Tent” made service-by-service installation infeasible. This led to the rise of automated installers that over time evolved from a simple collection of scripts to container management systems.

Evolution of automated OpenStack installers

The first generation of automated installers were simple utilities that tied together a collection of Puppet/Chef/Ansible scripts. Some of these tools could do baremetal server provisioning through Cobbler or Ironic (Fuel, Compass) and some relied on server operating system to be pre-installed (Devstack, Packstack). In either case the packages were pulled from the Internet or local repository every time the installer ran.

The biggest problem with the above approach is the time it takes to re-deploy, upgrade or scale the existing environment. Even for relatively small environments it could be hours before all packages are downloaded, installed and configured. One of the ways to tackle this is to pre-build an operating system with all the necessary packages and only use Puppet/Chef/Ansible to change configuration files and turn services on and off. Redhat’s TripleO is one example of this approach. It uses a “golden image” with pre-installed OpenStack packages, which is dd-written bit-by-bit onto the baremetal server’s disk. The undercloud then decides which services to turn on based on the overcloud server’s role.

Another big problem with most of the existing deployment methods was that, despite their microservices architecture, all OpenStack services were deployed as static packages on top of a shared operating system. This made the ongoing operations, troubleshooting and ugprades really difficult. The obvious thing to do would be to have all OpenStack services (e.g. Neutron, Keyston, Nova) deployed as containers and managed by a container management system. The first company to implement that, as far as I know, was Canonical. The deployment process is quite complicated, however the end result is a highly flexible OpenStack cloud deployed using LXC containers, managed and orchestrated by Juju controller.

Today (September 2017) deploying OpenStack services as containers is becoming mainstream and in this post I’ll show how to use Kolla to build container images and Kolla-Ansible to deploy them on a pair of “baremetal” VMs.

Lab overview

My lab consists of a single controller and a single compute VM. The goal was to make them as small as possible so they could run on a laptop with limited resources. Both VMs are connected to three VM bridged networks - provisioning, management and external VM access.

I’ve written some bash and Ansible scripts to automate the deployment of VMs on top of any Fedora derivative (e.g. Centos7). These scripts should be run directly from the hypervisor:

git clone https://github.com/networkop/kolla-odl-bgpvpn.git && cd kolla-odl-bgpvpn
./1-create.sh do
./2-bootstrap.sh do

The first bash script downloads the VM OS (Centos7), creates two blank VMs and sets up a local Docker registry. The second script installs all the dependencies, including Docker and Ansible.

Building OpenStack docker containers with Kolla

The first step in Kolla deployment workflow is deciding where to get the Docker images. Kolla maintains a Docker Hub registry with container images built for every major OpenStack release. The easiest way to get them would be to pull the images from Docker hub either directly or via a pull-through caching registry.

In my case I needed to build the latest version of OpenStack packages, not just the latest major release. I also wanted to build a few additional, non-Openstack images (Opendaylight and Quagga). Because of that I had to build all Docker images locally and push them into a local docker registry. The procedure to build container images is very well documented in the official Kolla image building guide. I’ve modified it slightly to include the Quagga Dockerfile and automated it so that the whole process can be run with a single command:

./3-build.sh do

This step can take quite a long time (anything from 1 to 4 hours depending on the network and disk I/O speed), however, once it’s been done these container images can be used to deploy as many OpenStack instances as necessary.

Deploying OpenStack with Kolla-Ansible

The next step in OpenStack deployment workflow is to deploy Docker images on target hosts. Kolla-Ansible is a highly customizable OpenStack deployment tool that is also extemely easy to use, at least for people familiar with Ansible. There are two main sources of information for Kolla-Ansible:

Global configuration file (/etc/kolla/globals.yaml), which contains some of the most common customization options
Ansible inventory file (/usr/share/kolla-ansible/ansible/inventory/*), which maps OpenStack packages to target deployment hosts

To get started with Kolla-Ansible all what it takes is a few modifications to the global configuration file to make sure that network settings match the underlying OS interface configuration and an update to the inventory file to point it to the correct deployment hosts. In my case I’m making additional changes to enable SFC, Skydive and Tacker and adding files for Quagga container, all of which can be done with the following command:

./4-deploy.sh do

The best thing about this method of deployment is that it takes (in my case) under 5 minutes to get the full OpenStack cloud from scratch. That means if I break something or want to redeploy with some major changes (add/remove Opendaylight), all what I have to do is destroy the existing deployment (approx. 1 minute), modify global configuration file and re-deploy OpenStack. This makes Kolla-Ansible an ideal choice for my lab environment.

Overview of containerized Openstack

Once the deployment has been completed, we should be able to see a number of running Docker containers - one for each OpenStack process.

root@compute-1# docker ps
CONTAINER ID        IMAGE                                                                 COMMAND             CREATED             STATUS              PORTS               NAMES
0bb8a8eeb1a9        172.26.0.1:5000/kolla/centos-source-skydive-agent:5.0.0               "kolla_start"       3 days ago          Up 3 days                               skydive_agent
63b5b643dfae        172.26.0.1:5000/kolla/centos-source-neutron-openvswitch-agent:5.0.0   "kolla_start"       3 days ago          Up 3 days                               neutron_openvswitch_agent
f6f74c5982cb        172.26.0.1:5000/kolla/centos-source-openvswitch-vswitchd:5.0.0        "kolla_start"       3 days ago          Up 3 days                               openvswitch_vswitchd
3078421a3892        172.26.0.1:5000/kolla/centos-source-openvswitch-db-server:5.0.0       "kolla_start"       3 days ago          Up 3 days                               openvswitch_db
9146c16d561b        172.26.0.1:5000/kolla/centos-source-nova-compute:5.0.0                "kolla_start"       3 days ago          Up 3 days                               nova_compute
8079f840627f        172.26.0.1:5000/kolla/centos-source-nova-libvirt:5.0.0                "kolla_start"       3 days ago          Up 3 days                               nova_libvirt
220d617d31a5        172.26.0.1:5000/kolla/centos-source-nova-ssh:5.0.0                    "kolla_start"       3 days ago          Up 3 days                               nova_ssh
743ce602d485        172.26.0.1:5000/kolla/centos-source-cron:5.0.0                        "kolla_start"       3 days ago          Up 3 days                               cron
8b71f08d2781        172.26.0.1:5000/kolla/centos-source-kolla-toolbox:5.0.0               "kolla_start"       3 days ago          Up 3 days                               kolla_toolbox
f76d0a7fcf2a        172.26.0.1:5000/kolla/centos-source-fluentd:5.0.0                     "kolla_start"       3 days ago          Up 3 days                               fluentd

All the standard docker tools are available to interact with those containers. For example, this is how we can see what processes are running inside a container:

root@compute-1# docker exec nova_compute ps -www aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
nova         1  0.0  0.0    188     4 pts/3    Ss+  Sep04   0:00 /usr/local/bin/dumb-init /bin/bash /usr/local/bin/kolla_start
nova         7  0.7  1.3 2292560 134896 ?      Ssl  Sep04  35:33 /var/lib/kolla/venv/bin/python /var/lib/kolla/venv/bin/nova-compute
root        86  0.0  0.3 179816 32900 ?        S    Sep05   0:00 /var/lib/kolla/venv/bin/python /var/lib/kolla/venv/bin/privsep-helper --config-file /etc/nova/nova.conf --privsep_context vif_plug_ovs.privsep.vif_plug --privsep_sock_path /tmp/tmpFvP0GS/privsep.sock

Some of you may have noticed that none of the containers expose any ports. So how do they communicate? The answer is very simple - all containers run in a host networking mode, effectively disabling any network isolation and giving all contaners access to TCP/IP stacks of their Docker hosts. This is a simple way to avoid having to deal with Docker networking complexities, while at the same time preserving the immutability and portability of Docker containers.

All containers are configured to restart in case of a failure, however there’s no CMS to provide full lifecycle management and advanced scheduling. If upgrade of scale-in/out is needed, Kolla-Ansible will have to be re-run with updated configuration options. There is sibling project called Kolla-Kubernetes (still under developement), that’s designed to address some of the mentioned shortcomings.

Coming up

Now that the lab is up we can start exploring the new OpenStack SDN features. In the next post I’ll have a close look at Neutron’s SFC feature, how to configure it and how it’s been implemented in OVS forwarding pipeline.

Linux SSH Session Management for Network Engineers

2017-05-12T00:00:00+01:00

A short post about how I do SSH session management for network devices in Linux

A few weeks ago I bought myself a new Dell XPS-13 and decided for the n-th time to go all-in Linux, that is to have Linux as the main and only laptop OS. Since most of my Linux experience is with Fedora-family distros, I quickly installed Fedora-25 and embarked on a long and painful journey of getting out of my Windows comfort zone and re-establishing it in Linux. One of the most important aspects for me, as a network engineer, is to have a streamlined process of accessing network devices. In Windows I was using MTPutty and it helped define my expectations of an ideal SSH session manager:

I want a multi-tab terminal with the ability to switch between tabs quickly - default (GNOME) terminal does that out-of-the box with no extra modifications
I want to login the device without having to enter a password - Not available by default but is possible with some dirty expect hacks.
I want my SSH sessions to be organised in a hierarchical manner with groups representing various administrative domains - customer A, local VMs, lab.

Although GNOME terminal looked like a very good option, it didn’t meet all of my requirements. I briefly looked and PAC Manager and GNOME Connection Manager but quickly dismissed them due to their ugliness and clunkiness. Ideally I wanted to keep using GNOME terminal as the main terminal emulator, without having to configure and rely on other 3rd party apps. I also didn’t want to wrap my SSH session in expect as I didn’t want my password to be pasted in my screen every time I cat a file containing the trigger keyword Password:. I’ve finally managed to make everything work inside the native GNOME terminal and this post is a documentation of my approach.

1. Install ssh-copy-net

I’ve written a little tool that uses Netmiko to install (and remove) public SSH keys onto network devices. Assuming python-pip is already installed here’s what’s required to download and install ssh-copy-net:

$ pip install git+https://github.com/networkop/ssh-copy-net.git

Its functionality mimics the one of ssh-copy-id, so the next step is always to upload the public key to the device:

$ ssh-copy-net 10.6.142.1 juniper
Username: admin
Password:
All Done!

2. Define SSH config for network devices

OpenSSH client config file provides a nice way of managing user’s SSH sessions. Configuration file allows you to define per-host SSH settings including username, port forwarding options, key checking flags etc. In my case all what I had to do was define IP addresses of my network devices:

/home/null/.ssh/config

Host srx
  HostName 10.6.142.1

Host arista
  HostName 10.6.142.2

Now I am able to login the device by simply typing its name:

$ ssh arista
Last login: Sun May  7 10:57:30 2017 from 10.1.2.3
arista-1>

3. Define zsh aliases

The final step is session organisations. For that I’ve decided to use zsh aliases and have device groups encoded in the alias name, separated by dashes. For example, if my SRX device was in the lab and Arista was in Site-51 of Customer-A this is how I would write my aliases:

/home/null/.zshrc

alias lab-srx='ssh srx'
alias customer-a-site-51-arista='ssh arista'

4. Multi-pane sessions with tmux

As a network engineer, I often find myself troubleshooting issues spanning multiple devices, which is why I need multiple tabs inside a single terminal window. Simply pressing Ctrl+T in GNOME terminal opens a new tab and I can switch between tabs using Alt+[1-9]. However what would be really nice is to have a couple of tabs opened side by side so that I can see the logs and compare output on a number of devices at the same time. This is where tmux comes in. It can do much more than this, but I simply use it to have multiple panes inside the same terminal tab:

Here’s an example of my tmux configuration file:

/home/null/.tmux.conf

# Automatically set window title
set-window-option -g automatic-rename on
set-option -g set-titles on

# Use Alt-arrow keys without prefix key to switch panes
bind -n M-Left select-pane -L
bind -n M-Right select-pane -R
bind -n M-Up select-pane -U
bind -n M-Down select-pane -D

# Pane splitting keys
bind-key v split-window -h
bind-key s split-window -v

# New key-binding to reset hung SSH sessions
bind-key k respawn-pane -k

# Easy fix for arrow keys inside ssh
set -g default-terminal "xterm"

# Enable mouse mode (tmux 2.1 and above)
set -g mouse on

# Reload tmux config
bind r source-file ~/.tmux.conf

# No delay for escape key press
set -sg escape-time 0

Demo

Now having all the above defined and with the help of zsh command autocompletion, I can login the device with just a few keypresses (shown in square brackets below):

$ lab  [TAB]
$ lab-  [TAB]
lab-srx
$ lab-  [s][TAB]
$ lab-srx  [ENTER]
--- JUNOS 12.3X48-D30.7 built 2016-04-28 22:37:34 UTC
{primary:node0}
null@srx>

Press Ctrl+B v to split the terminal window vertically:

$ customer [TAB]
$ customer- [TAB]
customer-a-site-51-arista
$ customer- [a][TAB]
$ customer-a-arista [ENTER]
Last login: Thu May 11 15:28:03 2017 from 10.1.2.3
arista-1>

An so on and so forth…

Using YANG Models in Ansible to Configure and Verify State of IOS-XE and JUNOS Devices

2017-04-04T00:00:00+01:00

In this post I will show how to use IETF, OpenConfig and vendor-specific YANG models in Ansible to configure BGP peering and verify state of physical interfaces between IOS-XE and JUNOS devices.

The idea of using Ansible for configuration changes and state verification is not new. However the approach I’m going to demonstrate in this post, using YANG and NETCONF, will have a few notable differences:

I will not use any templates and absolutely no XML/JSON for device config generation
All changes will be pushed through a single, vendor and model-independent Ansible module
State verification will be done with no pattern-matching or screen-scraping
All configuration and operational state will be based on a couple of YAML files
To demonstrate the model-agnostic behaviour I will use a mixture of vendor’s native, IETF and OpenConfig YANG models

I hope this promise is exciting enough so without further ado, let’s get cracking.

Environment setup

The test environment will consist of a single instance of CSR1000v running IOS-XE version 16.4.1 and a single instance of vMX running JUNOS version 17.1R1.8. The VMs containing the two devices are deployed within a single hypervisor and connected with one interface to the management network and back-to-back with the second pair of interfaces for BGP peering.

Each device contains some basic initial configuration to allow it be reachable from the Ansible server.

IOS-XE initial configuration

interface GigabitEthernet1
ip address 192.168.145.51 255.255.255.0
!
netconf-yang
netconf-yang cisco-odm polling enable
netconf-yang cisco-odm actions parse Interfaces

vMX configuration is quite similar. Static MAC address is required in order for ge interfaces to work.

vMX initial configuration

set system login user admin class super password admin123
set system services netconf
set interface fxp0 unit 0 family inet address 192.168.145.53/24
set interface ge-0/0/0 mac 00:0c:29:fc:1a:b7

Ansible playbook configuration

My Ansible-101 repository contains two plays - one for configuration and one for state verification. The local inventory file contains details about the two devices along with the login credentials. All the work will be performed by a custom Ansible module stored in the ./library directory. This module is a wrapper for a ydk_yaml module described in my previous post. I had to heavily modify the original ydk_yaml module to work around some Ansible limitations, like the lack of support for set data structures.
This custom Ansible module also relies on a number of YDK Python bindings to be pre-installed. Refer to my YAML, Operational and JUNOS repositories for the instructions on how to install those modules.
The desired configuration and expected operational state are documented inside a couple of device-specific host variable files. For each device there is a configuration file config.yaml, describing the desired configuration state. For IOS-XE there is an additional file verify.yaml, describing the expected operational state using the IETF interface YANG model (I couldn’t find how to get the IETF or OpenConfig state models to work on Juniper).
All of these files follow the same structure:

Root container can be either config or verify and defines how the enclosed data is supposed to be used
First nested container has to match the top-most container of a YANG model. For example it could be bgp-state for cisco-bgp-state.yang or openconfig-bgp for openconfig-bgp.yang model
The remaining nested data has to follow the structure of the original YANG model as described in my previous post.

Here’s how IOS-XE will be configured, using IETF interfaca YANG models (to unshut the interface) and Cisco’s native YANG model for interface IP and BGP settings:

---
config:
  interfaces:
    interface:
      - name: GigabitEthernet3
        enabled: true
  native:
    interface:
      gigabitethernet:
        - name: '3'
          description: P2P link
          ip:
            address:
              primary:
                address: 12.12.12.1
                mask: 255.255.255.0
      loopback:
        - name: 0
          description: ROUTER ID
          ip:
            address:
              primary:
                address: 1.1.1.1
                mask: 255.255.255.255
    router:
      bgp:
        - id: 65111
          bgp:
            router_id: 1.1.1.1
          neighbor:
            - id: 12.12.12.2
              remote_as: 65222
          redistribute:
            connected:
              empty: empty

For JUNOS configuration, instead of the default humongous native model, I’ll use a set of much more light-weight OpenConfig YANG models to configure interfaces, BGP and redistribution policies:

---
config:
  openconfig-interfaces:
    interface:
      - name: ge-0/0/0
        subinterfaces:
          subinterface:
            - index: 0
              ipv4:
                addresses:
                  address:
                    - ip: 12.12.12.2/24
                      config:
                        ip: 12.12.2.2
                        prefix_length: 24
      - name: lo0
        subinterfaces:
          subinterface:
            - index: 0
              ipv4:
                addresses:
                  address:
                    - ip: 2.2.2.2/32
                      config:
                        ip: 2.2.2.2
                        prefix_length: 32
  openconfig-policy:
    policy_definitions:
      policy_definition:
        - name: CONNECTED->BGP
          statements:
            statement:
              - name: Loopback0
                conditions:
                  match_interface:
                    config:
                      interface: lo0
                      subinterface: 0
                actions:
                  config:
                    accept_route: empty
  openconfig-bgp:
    global_:
      config:
        as_: 65222
    neighbors:
      neighbor:
        - neighbor_address: 12.12.12.1
          config:
            peer_group: YANG
            peer_as: 65111
    peer_groups:
      peer_group:
        - peer_group_name: YANG
          config:
            peer_as: 65111
          apply_policy:
            config:
              export_policy:
                - CONNECTED->BGP

Configuration

Both devices now can be configured with just a single command:

ansible-playbook config.yaml

Behind the scenes, Ansible calls my custom ydk_module and passes to it the full configuration state and device credentials. This module then constructs an empty YDK binding based on the name of a YANG model and populates it recursively with the data from the config container. Finally, it pushes the data to the device with the help of YDK NETCONF service provider.

Verification

There’s one side to YANG which I have carefully avoided until now and it’s operational state models. These YANG models are built similarly to configuration models, but with a different goal - to extract the running state from a device. The reason why I’ve avoided them is that, unlike the configuration models, the current support for state models is limited and somewhat brittle.
For example, JUNOS natively only supports state models as RPCs, where each RPC represents a certain show command which, I assume, when passed to the devices gets evaluated, its output parsed and result returned back to the client. With IOX-XE things are a little better with a few of the operational models available in the current 16.4 release. You can check out my Github repo for some examples of how to check the interface and BGP neighbor state between the two IOS-XE devices. However, most of the models are still missing (I’m not counting the MIB-mapped YANG models) in the current release. The next few releases, though, are promised to come with an improved state model support, including some OpenConfig models, which is going to be super cool.
So in this post, since I couldn’t get JUNOS OpenConfig models report any state and my IOS-XE BGP state model wouldn’t return any output unless the BGP peering was with another Cisco device or in the Idle state, I’m going to have to resort to simply checking the state of physical interfaces. This is how a sample operational state file would look like (question marks are YAML’s special notation for sets which is how I decided to encode Enum data type):

---
verify:
  interfaces-state:
    interface:
      - name: GigabitEthernet3
        oper_status:
          ? up
      - name: Loopback0
        oper_status:
          ? up
      - name: GigabitEthernet2
        oper_status:
          ? down

Once again, all expected state can be verified with a single command:

ansible-playbook verify.yaml

If the state defined in that YAML file matches the data returned by the IOS-XE device, the playbook completes successfully. You can check that it works by shutting down one of the GigabitEthernet3 or Loopback0 interfaces and observing how Ansible module returns an error.

Outro

Now that I’ve come to the end of my YANG series of posts I feel like I need to provide some concise and critical summary of everything I’ve been through. However, if there’s one thing I’ve learned in the last couple of months about YANG, it’s that things are changing very rapidly. Both Cisco and Juniper are working hard introducing new models and improving support for the existing ones. So one thing to keep in mind, if you’re reading this post a few months after it was published (April 2017), is that some or most of the above limitations may not exist and it’s always worth checking what the latest software release has to offer.

Finally, I wanted to say that I’m a strong believer that YANG models are the way forward for network device configuration and state verification, despite the timid scepticism of the networking industry. I think that there are two things that may improve the industry’s perception of YANG and help increase its adoption:

Support from networking vendors - we’ve already seen Cisco changing by introducing YANG support on IOS-XE instead of producing another dubious One-PK clone. So big thanks to them and I hope that other vendors will follow suit.
Tools - this part, IMHO, is the most crucial. In order for people to start using YANG models we have to have the right tools that would be versatile enough to allow network engineers to be limited only by their imagination and at the same time be as robust as the CLI. So I wanted to give a big shout out to all the people contributing to open-source projects like pyang, YDK and many others that I have missed or don’t know about. You’re doing a great job guys, don’s stop.

Configuring Cisco IOS XE With YANG-based YAML Files

2017-03-13T00:00:00+00:00

One thing that puts a lot of network engineers off NETCONF and YANG is the complexity of the device configuration process. Even the simplest change involves multiple tools and requires some knowledge of XML. In this post I will show how to use simple, human-readable YAML configuration files to instantiate YANG models and push them down to network devices using a single command.

XML, just like many more structured data formats, was not designed to be human-friendly. That’s why many network engineers lose interest in YANG as soon as the conversation gets to the XML part. JSON is a much more human-readable alternative, however very few devices support RESTCONF, and the ones that do may have buggy implementations. At the same time, a lot of network engineers have happily embraced Ansible, which extensively uses YAML. That’s why I’ve decided to write a Python module that would program network devices using YANG and NETCONF according to configuration data described in a YAML format.

In the previous post I have introduced a new open-source tool called YDK, designed to create API bindings for YANG models and interact with network devices using NETCONF or RESTCONF protocols. I have also mentioned that I would still prefer to use pyangbind along with other open-source tools to achieve the same functionality. Now, two weeks later, I must admin I have been converted. Initially, I was planning to write a simple REST API client to interact with RESTCONF interface of IOS XE, create an API binding with pyangbind, use it to produce the JSON output, convert it to XML and send it to the device, similar to what I’ve described in my netconf and restconf posts. However, I’ve realised that YDK can already do all what I need with just a few function calls. All what I’ve got left to do is create a wrapper module to consume the YAML data and use it to automatically populate YDK bindings.

This post will be mostly about the internal structure of this wrapper module I call ydk_yaml.py, which will serve as a base library for a YANG Ansible module, which I will describe in my next post. This post will be very programming-oriented, I’ll start with a quick overview of some of the programming concepts being used by the module and then move on to the details of module implementation. Those who are not interested in technical details can jump straight to the examples sections at the end of this post for a quick demonstration of how it works.

Recursion

One of the main tasks of ydk_yaml.py module is to be able parse a YAML data structure. This data structure, when loaded into Python, is stored as a collection of Python objects like dictionaries, lists and primitive data types like strings, integers and booleans. One key property of YAML data structures is that they can be represented as trees and parsing trees is a very well-known programming problem.

After having completed this programming course I fell in love with functional programming and recursions. Every problem I see, I try to solve with a recursive function. Recursions are very interesting in a way that they are very difficult to understand but relatively easy to write. Any recursive function will consist of a number of if/then/else conditional statements. The first one (or few) if statements are called the base of a recursion - this is where recursion stops and the value is returned to the outer function. The remaining few if statements will implement the recursion by calling the same function with a reduced input. You can find a much better explanation of recursive functions here. For now, let’s consider the problem of parsing the following tree-like data structure:

{ 'parent': {
    'child_1': {
      'leaf_1': 'value_1'
    },
    'child_1': 'value_2'
    }
}

Recursive function to parse this data structure written in a pseudo-language will look something like this:

def recursion(input_key, input_value):
  if input_value is String:
    return process(input_value)
  elif input_value is Dictonary:
    for key, value in input_value.keys_and_values():
      return recursion(key, value)

The beauty of recursive functions is that they are capable parsing data structures of arbitrary complexity. That means if we had 1000 randomly nested child elements in the parent data structure, they all could have been parsed by the same 6-line function.

Introspection

Introspection refers to the ability of Python to examine objects at runtime. It can be useful when dealing with object of arbitrary structure, e.g. a YAML document. Introspection is used whenever there is a need for a function to behave differently based on the runtime data. In the above pseudo-language example, the two conditional statements are the examples of introspection. Whenever we need to determine the type of an object in Python we can either use a built-in function type(obj) which returns the type of an object or isinstance(obj, type) which checks if the object is an instance or a descendant of a particular type. This is how we can re-write the above two conditional statements using real Python:

if isinstance(input_value, str):
  print('input value is a string')
elif isinstance(input_value, dict):
  print('intput value is a dictionary')

Metaprogramming

Another programming concept used in my Python module is metaprogramming. Metaprogramming, in general, refers to an ability of programs to write themselves. This is what compilers normally do when they read the program written in a higher-level language and translate it to a lower-level language, like assembler. What I’ve used in my module is the simplest version of metaprogramming - dynamic getting and setting of object attributes. For example, this is how we would configure BGP using YDK Python binding, as described in my previous post:

bgp.id = 100
n = bgp.Neighbor()
n.id = '2.2.2.2'
n.remote_as = 65100
bgp.neighbor.append(n)

The same code could be re-written using the getattr and setattr method calls:

setattr(bgp, 'id', 100)
n = getattr(bgp, 'Neighbor')()
setattr(n, 'id', '2.2.2.2')
setattr(n, 'remote_as', 65100)
getattr(bgp, 'neighbor').append(n)

This is also very useful when working with arbitrary data structures and objects. In my case the goal was to write a module that would be completely independent of the structure of a particular YANG model, which means that I can not know the structure of the Python binding generated by YDK. However, I can “guess” the name of the attributes if I assume that my YAML document is structured exactly like the YANG model. This simple assumption allows me to implement YAML mapping for all possible YANG models with just a single function.

YANG mapping to YAML

As I’ve mentioned in my previous post, YANG is simply a way to define the structure of an XML document. At the same time, it is known that YANG-based XML can be mapped to JSON as described in this RFC. Since YAML is a superset of JSON, it’s easy to come up with a similar XML-to-YAML mapping convention. The following table contains the mapping between some of the most common YAML and YANG data structures and types:

YANG data	YAML representation
container	dictionary
container name	dictionary key
leaf name	dictionary key
leaf	dictionary value
list	list
string, bool, integer	string, bool, integer
empty	null

Using this table, it’s easy to map the YANG data model to a YAML document. Let me demonstrate it on IOS XE’s native OSPF data model. First, I’ve generated a tree representation of an OSPF data model using pyang:

pyang -f tree --tree-path "/native/router/ospf" ~/ydk-gen/gen-api/.cache/models/cisco_ios_xe@0.1.0/ned.yang -o ospf.tree

Next, I’ve trimmed it down to only contain the options that I would like to set and created a YAML document based on the model’s tree structure:

With the right knowledge of YANG model’s structure, it’s fairly easy to generate similar YAML configuration files for other configuration objects, like interface and BGP.

YANG instantiating function

At the heart of the ydk_yaml module is a single recursive function that traverses the input YAML data structure and uses it to instantiate the YDK-generated Python binding. Here is a simple, abridged version of the function that demonstrates the main logic.

def instantiate(binding, model_key, model_value):
    if any(isinstance(model_value, x) for x in [str, bool, int]):
        setattr(binding, model_key, model_value)
    elif isinstance(model_value, list):
        for el in model_value:
            getattr(binding, model_key).append(instantiate(binding, model_key, el))
    elif isinstance(model_value, dict):
        container_instance = getattr(binding, model_key)()
        for k, v in model_value.iteritems():
            instantiate(container_instance, k, v)
        setattr(binding, model_key, container_instance)

Most of it should already make sense based on what I’ve covered above. The first conditional statement is the base of the recursion and performs the action of setting the value of a YANG Leaf element. The second conditional statement takes care of a YANG List by traversing all its elements, instantiating them recursively, and appends the result to a YDK binding. The last elif statement creates a class instance for a YANG container, recursively populates its values and saves the final result inside a YDK binding.

The full version of this function covers a few extra corner cases and can be found here.

The YDK module wrapper

The final step is to write a wrapper class that would consume the YDK model binding along with the YAML data, and both instantiate and push the configuration down to the network device.

class YdkModel:

    def __init__(self, model, data):
        self.model = model
        self.data = data
        from ydk.models.cisco_ios_xe.ned import Native
        self.binding = Native()
        for k,v in self.data.iteritems():
            instantiate(self.binding, k, v)

    def action(self, crud_action, device):
        from ydk.services import CRUDService
        from ydk.providers import NetconfServiceProvider
        provider = NetconfServiceProvider(address=device['hostname'],
                                          port=device['port'],
                                          username=device['username'],
                                          password=device['password'],
                                          protocol='ssh')
        crud = CRUDService()
        crud_instance = getattr(crud, crud_action)
        crud_instance(provider, self.binding)
        provider.close()
        return

The structure of this class is pretty simple. The constructor instantiates a YDK native data model and calls the recursive instantiation function to populate the binding. The action method implements standard CRUD actions using the YDK’s NETCONF provider. The full version of this Python module can be found here.

Configuration examples

In my Github repo, I’ve included a few examples of how to configure Interface, OSPF and BGP settings of IOS XE device. A helper Python script 1_send_yaml.py accepts the YANG model name and the name of the YAML configuration file as the input. It then instantiates the YdkModel class and calls the create action to push the configuration to the device. Let’s assume that we have the following YAML configuration data saved in a bgp.yaml file:

---
router:
  bgp:
    - id: 100
      bgp:
        router_id: 1.1.1.1
        fast_external_fallover: null
        update_delay: 15
      neighbor:
        - id: 2.2.2.2
          remote_as: 200
        - id: 3.3.3.3
          remote_as: 300
      redistribute:
        connected: {}

To push this BGP configuration to the device all what I need to do is run the following command:

./1_send_yaml.py bgp bgp.yaml

The resulting configuration on IOS XE device would look like this:

router bgp 100
 bgp router-id 1.1.1.1
 bgp log-neighbor-changes
 bgp update-delay 15
 redistribute connected
 neighbor 2.2.2.2 remote-as 200
 neighbor 3.3.3.3 remote-as 300

To see more example, follow this link to my Github repo.

Configuring Cisco IOS XE With YDK and OpenDaylight

2017-02-22T00:00:00+00:00

Now it’s time to turn our gaze to the godfather of YANG models and one of the most famous open-source SDN controllers, OpenDaylight. In this post I’ll show how to connect Cisco IOS XE device to ODL and use Yang Development Kit to push a simple BGP configuration through ODL’s RESTCONF interface.

In the previous posts about NETCONF and RESTCONF I’ve demonstrated how to interact with Cisco IOS XE device directly from the Linux shell of my development VM. This approach works fine in some cases, e.g. whenever I setup a new DC fabric, I would make calls directly to the devices I’m configuring. However, it becomes impractical in the Ops world where change is constant and involves a large number of devices. This is where centralised service orchestrators come to the fore. The prime examples of such platforms are Network Services Orchestrator from Tail-f/Cisco and open-source project OpenDaylight. In this post we’ll concentrate on ODL and how to make it work with Cisco IOS XE. Additionally, I’ll show how to use an open-source tool YDK to generate Python bindings for native YANG models and how it compares with pyangbind.

OpenDaylight primer

OpenDaylight is a swiss army knife of SDN controllers. At the moment it is comprised of dozens of projects implementing all possible sorts of SDN functionality starting from Openflow controller all the way up to L3VPN orchestrator. ODL speaks most of the modern Southbound protocols like Openflow, SNMP, NETCONF and BGP. The brain of the controller is in the Service Abstraction Layer, a framework to model all network-related characteristics and properties. All logic inside SAL is modelled in YANG which is why I called it the godfather of YANG models. Towards the end users ODL exposes Java function calls for applications running on the same host and REST API for application running remotely.

OpenDaylight has several commercial offerings from companies involved in its development. Most notable ones are from Brocade and Cisco. Here I will allow myself a bit of a rant, feel free to skip it to go straight to the technical stuff.

One thing I find interesting is that Cisco are being so secretive about their Open SDN Controller, perhaps due to the earlier market pressure to come up with a single SDN story, but still have a very large number of contributors to this open-source project. It could be the case of having an egg in each basket, but the number of Cisco’s employees involved in ODL development is substantial. I wonder if, now that the use cases for ACI and ODL have finally formed and ACI still not showing the uptake originally expected, Cisco will change their strategy and start promoting ODL more aggressively, or at least stop hiding it deep in the bowels of cisco.com. Or, perhaps, it will always stay in the shade of Tail-f’s NSC and Insieme’s ACI and will be used only for customer with unique requirements, e.g. to have both OpenStack and network devices managed through the same controller.

Environment setup

We’ll use the same environment we’ve setup in the previous posts, consisting of a CSR1K and a Linux VM connected to the same network inside my hypervisor. IOS XE device needs to have netconf-yang configured in order to enable the northbound NETCONF interface.

On the same Linux VM, I’ve downloaded and launched the latest version of ODL (Boron-SR2), and enabled NETCONF and RESTCONF plugins.

unzip distribution-karaf-0.5.2-Boron-SR2.zip
mv distribution-karaf-0.5.2-Boron-SR2 odl-0.5.2
cd odl-0.5.2/
./bin/karaf
opendaylight-user@root>feature:install odl-netconf-connector-all
opendaylight-user@root>feature:install odl-restconf-all

We’ll use NETCONF to connect to Cisco IOS XE device and RESTCONF to interact with ODL from a Linux shell.

It might be useful to turn on logging in karaf console to catch any errors we might encounter later:

opendaylight-user@root>log:tail

Connecting IOS XE to ODL

According to ODL NETCONF user guide, in order to connect a new device to the controller, we need to create an XML document which will include the IP, port and user credentials of the IOS XE device. Here’s the excerpt from the full XML document:

 xmlns="urn:opendaylight:params:xml:ns:yang:controller:config">
   xmlns:prefix="urn:opendaylight:params:xml:ns:yang:controller:md:sal:connector:netconf">prefix:sal-netconf-connector
  CSR1K
   xmlns="urn:opendaylight:params:xml:ns:yang:controller:md:sal:connector:netconf">192.168.145.51
   xmlns="urn:opendaylight:params:xml:ns:yang:controller:md:sal:connector:netconf">830
   xmlns="urn:opendaylight:params:xml:ns:yang:controller:md:sal:connector:netconf">admin
   xmlns="urn:opendaylight:params:xml:ns:yang:controller:md:sal:connector:netconf">admin

Assuming this XML is saved in a file called new_device.xml.1, we can use curl to send it to ODL’s netconf-connector plugin:

curl -v -k -u admin:admin -H "Content-Type: application/xml" -X POST \
 http://localhost:8181/restconf/config/network-topology:network-topology\
 /topology/topology-netconf/node/controller-config/yang-ext:mount/config:modules\
  -d @new_device.xml.1

When the controller gets this information it will try to connect to the device via NETCONF and do the following three things:

Discover device capabilities advertised in the Hello message
Download all YANG models advertised by the device into the ./cache/schema directory
Go through all of the imports in each model and verify that they can be satisfied

After ODL downloads all of the 260 available models (can take up to 20 minutes) we will see the following errors in the karaf console:

Netconf device does not provide all yang models reported in hello message capabilities
Unable to build schema context, unsatisfied imports
Initialization in sal failed, disconnecting from device
No more sources for schema context

Due to inconsistencies between the advertised and the available models, ODL fails to build the full device YANG schema context, which ultimately results in inability to connect the device to the controller. However, we won’t need all of the 260 models advertised by the device. In fact, most of the configuration can be done through a single Cisco native YANG model, ned. With ODL it is possible to override the default capabilities advertised in the Hello message and “pin” only the ones that are going to be used. Assuming that ODL has downloaded most of the models at the previous step, we can simply tell it use the selected few with the following additions to the XML document:

 xmlns="urn:opendaylight:params:xml:ns:yang:controller:md:sal:connector:netconf">
    true
     xmlns="urn:opendaylight:params:xml:ns:yang:controller:md:sal:connector:netconf">
      urn:ietf:params:xml:ns:yang:ietf-inet-types?module=ietf-inet-types&revision=2013-07-15
    
     xmlns="urn:opendaylight:params:xml:ns:yang:controller:md:sal:connector:netconf">
      http://cisco.com/ns/yang/ned/ios?module=ned&revision=2016-10-24
    

Assuming the updated XML is saved in new_device.xml.2 file, the following command will update the current configuration of CSR1K device:

curl -v -k -u admin:admin -H "Content-Type: application/xml" -X PUT \
http://localhost:8181/restconf/config/network-topology:network-topology\
/topology/topology-netconf/node/controller-config\
/yang-ext:mount/config:modules/module\
/odl-sal-netconf-connector-cfg:sal-netconf-connector\
/CSR1K -d @new_device.xml.2

We can then verify that the device has been successfully mounted to the controller:

curl -v -k -u admin:admin http://localhost:8181/restconf/operational\
/network-topology:network-topology/ | python -m json.tool

The output should look similar to the following with the connection-status set to connected and no detected unavailable-capabilities:

"netconf-node-topology:connection-status": "connected",
"netconf-node-topology:host": "192.168.145.51",
"netconf-node-topology:port": 830,
"netconf-node-topology:unavailable-capabilities": {},
"node-id": "CSR1K"

At this point we should be able to interact with IOS XE’s native YANG model through ODL’s RESTCONF interface using the following URL

 http://localhost:8181/restconf/config/network-topology:network-topology\
 /topology/topology-netconf/node/CSR1K/yang-ext:mount/ned:native

The only thing that’s missing is the actual configuration data. To generate it, I’ll use a new open-source tool called YDK.

YDK primer

Yang Development Kit is a suite of tools to work with NETCONF/RESTCONF interfaces of a network device. The way I see it, YDK accomplishes two things:

Generates API bindings for programming languages (Python and C++) from YANG models
Creates an abstraction layer to interact with southbound protocols (NETCONF or RESTCONF) in a uniform way

There’s a lot of overlap between the tools that we’ve used before and YDK. Effectively YDK combines in itself the functions of a NETCONF client, a REST client, pyangbind and pyang(the latter is used internally for model verification). Since one of the main functions of YDK is API generation I thought it’d be interesting to know how it compares to Rob Shakir’s pyangbind plugin. The following information is what I’ve managed to find on the Internet and from the comment of Santiago Alvarez below:

Feature	Pyangbind	YDK
PL support	Python	Python, C++ with Ruby and Go in the pipeline
Serialization	JSON, XML	only XML at this stage with JSON coming up in a few weeks
Southbound interfaces	N/A	NETCONF, RESTCONF with ODL coming up in a few weeks
Support	Cisco’s devnet team	Rob Shakir

So it looks like YDK is a very promising alternative to pyangbind, however I, personally, would still prefer to use pyangbind due to familiarity, simplicity and the fact that I don’t need the above extra features offered by YDK right now. However, given that YDK has been able to achieve so much in just under one year of its existence, I don’t discount the possibility that I may switch to YDK as it becomes more mature and feature-rich.

Python binding generation with YDK-GEN

One of the first things we need to do is install YDK-GEN, the tools responsible for API bindings generation, and it’s core Python packages on the local machine. The following few commands are my version of the official installation procedure:

git clone https://github.com/CiscoDevNet/ydk-gen.git ~/ydk-gen
pip install -r ~/ydk-gen/requirements.txt
export YDKGEN_HOME=~/ydk-gen/
~/ydk-gen/generate.py --python --core
pip install ~/ydk-gen/gen-api/python/ydk/dist/ydk*.tar.gz

YDK-GEN generates Python bindings based on the so-called bundle profile. This is a simple JSON document which lists all YANG models to include in the output package. In our case we’d need to include a ned model along with all its imports. The sample below shows only the model specification. Refer to my Github repo for a complete bundle profile for Cisco IOS XE native YANG model.

{"models":{"git":[{"url":"https://github.com/YangModels/yang.git",
  "commits":[{"commitid":"6f4a025431103f8cbbf3405ce01bdc61d0811b1d",
    "file":["vendor/cisco/xe/1641/ned.yang",
      "vendor/cisco/xe/1641/tailf-common.yang",
      "vendor/cisco/xe/1641/tailf-meta-extensions.yang",
      "vendor/cisco/xe/1641/tailf-cli-extensions.yang",
      "standard/ietf/RFC/ietf-inet-types.yang",
      "standard/ietf/RFC/ietf-yang-types.yang"]
      }]}]}}

Assuming that the IOS XE bundle profile is saved in a file called cisco-ios-xe_0_1_0.json, we can use YDK to generate and install the Python binding package:

~/ydk-gen/generate.py --python --bundle cisco-ios-xe_0_1_0.json -v
pip install ~/ydk-gen/gen-api/python/cisco_ios_xe-bundle/dist/ydk*.tar.gz

Configuring BGP with YDK

Now we can start configuring BGP using our newly generated Python package. First, we need to create an instance of BGP configuration data:

from ydk.models.cisco_ios_xe.ned import Native
bgp = Native().router.Bgp()

The configuration will follow the pattern defined in the original model, which is why it’s important to understand the internal structure of a YANG model. YANG leafs are represented as simple instance attributes. All YANG containers need to be explicitly instantiated, just like the Native and Bgp classes in the example above. Presence containers (router in the above example) will be instantiated at the same time as its parent container, inside the __init__ function of the Native class. Don’t worry if this doesn’t make sense, use iPython or any IDE with autocompletion and after a few tries, you’ll get the hang of it.

Let’s see how we can set the local BGP AS number and add a new BGP peer to the neighbor list.

bgp.id = 100
new_neighbor = bgp.Neighbor()
new_neighbor.id = '2.2.2.2'
new_neighbor.remote_as = 65100
bgp.neighbor.append(new_neighbor)

At this point of time all data is stored inside the instance of a Bgp class. In order to get an XML representation of it, we need to use YDK’s XML provider and encoding service:

from ydk.providers import CodecServiceProvider
from ydk.services import CodecService
provider = CodecServiceProvider(type="xml")
codec = CodecService()
xml_string = codec.encode(provider, bgp)
print xml_string

All what we’ve got left now is to send the data to ODL:

import requests
url = ("http://localhost:8181/restconf"
       "/config/network-topology:network-topology"
       "/topology/topology-netconf/node"
       "/CSR1K/yang-ext:mount/ned:native"
       "/router")
headers = {'Content-Type': 'application/xml'}
result = requests.post(url, auth=('admin', 'admin'), headers=headers, data=xml_string)
print result.status_code

The controller should have returned the status code 204 No Content, meaning that configuration has been changed successfully.

Verification

Back at the IOS XE CLI we can see the new BGP configuration that has been pushed down from the controller.

TEST#sh run | i router
router bgp 100
 bgp log-neighbor-changes
 neighbor 2.2.2.2 remote-as 65100

More examples

You can find a shorter version of the above procedure in my ODL 101 repo.

Introduction to YANG Programming and RESTCONF on Cisco IOS XE

2017-02-15T00:00:00+00:00

The sheer size of some of the YANG models can scare away even the bravest of network engineers. However, as it is with any programming language, the complexity is built out of a finite set of simple concepts. In this post we’ll learn some of these concepts by building our own YANG model to program static IP routes on Cisco IOS XE.

In the previous post I have demonstrated how to make changes to interface configuration of Cisco IOS XE device using the standard IETF model. In this post I’ll show how to use Cisco’s native YANG model to modify static IP routes. To make things even more interesting I’ll use RESTCONF, an HTTP-based sibling of NETCONF.

RESTCONF primer

RESTCONF is a very close functional equivalent of NETCONF. Instead of SSH, RESTCONF relies on HTTP to interact with configuration data and operational state of the network device and encodes all exchanged data in either XML or JSON. RESTCONF borrows the idea of Create-Read-Update-Delete operations on resources from REST and maps them to YANG models and datastores. There is a direct relationship between NETCONF operations and RESTCONF HTTP verbs:

HTTP VERB	NETCONF OPERATION
POST	create
PUT	replace
PATCH	merge
DELETE	delete
GET	get/get-config

Both RESTfullness and the ability to encode data as JSON make RESTCONF a very attractive choice for application developers. In this post, for the sake of simplicity, we’ll use Python CLI and curl to interact with RESTCONF API. In the upcoming posts I’ll show how to implement the same functionality inside a simple Python library.

Environment setup

We’ll pick up from where we left our environment in the previous post right after we’ve configured a network interface. The following IOS CLI command enables RESTCONF’s root URL at http://192.168.145.51/restconf/api/

CSR1k(config)#restconf

You can start exploring the structure of RESTCONF interface starting at the root URL by specifying resource names separated by “/”. For example, the following command will return all configuration from Cisco’s native datastore.

curl -v -k admin:admin http://192.168.145.51/restconfi/api/config/native?deep

In order to get JSON instead of the default XML output the client should specify JSON media type application/vnd.yang.datastore+json and pass it in the Accept header.

Writing a YANG model

Normally, you would expect to download the YANG model from the device itself. However IOS XE’s NETCONF and RESTCONF support is so new that not all of the models are available. Specifically, Cisco’s native YANG model for static routing cannot be found in either Yang Github Repo or the device itself (via get_schema RPC), which makes it a very good candidate for this post.

Update 13-02-2017: As it turned out, the model was right under my nose the whole time. It’s called ned and encapsulates the whole of Cisco’s native datastore. So think of everything that’s to follow as a simple learning exercise, however the point I raise in the closing paragraph still stands.

The first thing we need to do is get an understanding of the structure and naming convention of the YANG model. The simplest way to do that would be to make a change on the CLI and observe the result via RESTCONF.

Retrieving running configuration data

Let’s start by adding the following static route to the IOS XE device:

ip route 2.2.2.2 255.255.255.255 GigabitEthernet2

Now we can view the configured static route via RESTCONF:

curl -v -k -u admin:admin -H "Accept: application/vnd.yang.data+json" \
 http://192.168.145.51/restconf/api/config/native/ip/route?deep

The returned output should look something like this:

{ "ned:route": {
    "ip-route-interface-forwarding-list": [
      { "prefix": "2.2.2.2",
        "mask": "255.255.255.255",
        "fwd-list": [ { "fwd": "GigabitEthernet2" } ]
      }
    ]
  }
}

This JSON object gives us a good understanding of how the YANG model should look like. The root element route contains a list of IP prefixes, called ip-route-interface-forwarding-list. Each element of this list contains values for IP network and mask as well as the list of next-hops called fwd-list. Let’s see how we can map this to YANG model concepts.

Building a simple YANG model

YANG RFC defines a number of data structures to model an XML tree. Let’s first concentrate on the three most fundamental data structures that constitute the biggest part of any YANG model:

Container is a node of a tree with a unique name which encloses a set of child elements. In JSON it is mapped to a name/object pair 'name': {...}
Leaf is a node which contains a value and does not contain any child elements. In JSON leaf is mapped to a single key/value pair 'name': 'value'
List can be thought of as a table that contains a set rows (list entries). Each list entry can contain Leafs, Containers and other elements and can be uniquely identified by at least one Leaf element called a key. In JSON lists are encoded as name/arrays pairs containing JSON objects 'name': [{...}, {...}]

Now let’s see how we can describe the received data in terms of the above data structures:

The value of the topmost route element is a JSON object, therefore it can only be mapped to a YANG container.
The value of ip-route-interface-forwarding-list is an array of JSON objects, therefore it must be a list.
The only entry of this list contains prefix and mask key/value pairs. Since they don’t contain any child elements and their values are strings they can only be mapped to YANG leafs.
The third element, fwd-list, is another YANG list and so far contains a single next-hop value inside a YANG leaf called fwd.
Finally, since fwd is the only leaf in the fwd-list list, it must be that lists' key. The ip-route-interface-forwarding-list list will have both prefix and mask as its key values since their combination represents a unique IP destination.

With all that in mind, this is how a skeleton of our YANG model will look like:

module cisco-route-static {
  namespace "http://cisco.com/ns/yang/ned/ios";
  prefix ned;
  container route {
    list ip-route-interface-forwarding-list {
      key "prefix mask";
      leaf prefix { type string; }
      leaf mask { type string; }
      list fwd-list {
        key "fwd";
        leaf fwd { type string; }
      }
    }
  }
}

YANG’s syntax is pretty light-weight and looks very similar to JSON. The topmost module defines the model’s name and encloses all other elements. The first two statements are used to define XML namespace and prefix that I’ve described in my previous post.

Refactoring a YANG model

At this stage the model can already be instantiated by pyang and pyangbind, however there’s a couple of very important changes and additions that I wanted to make to demonstrate some of the other features of YANG.

The first of them is common IETF data types. So far in our model we’ve assumed that prefix and mask can take any value in string format. But what if we wanted to check that the values we use are, in fact, the correctly-formatted IPv4 addresses and netmasks before sending them to the device? That is where IETF common data types come to the rescue. All what we need to do is add an import statement to define which model to use and we can start referencing them in our type definitions:

...
import ietf-yang-types { prefix "yang"; }
import ietf-inet-types { prefix "inet"; }
...
leaf prefix { type inet:ipv4-address; }
leaf mask { type yang:dotted-quad; }

This solves the problem for the prefix part of a static route but how about its next-hop? Next-hops can be defined as either strings (representing an interface name) or IPv4 addresses. To make sure we can use either of these two types in the fwd leaf node we can define its type as a union. This built-in type is literally a union, a logical OR, of all its member elements. This is how we can change the fwd leaf definition:

...
typedef ip-next-hop {
  type union {
    type inet:ipv4-address;
    type string;
  }
}
...
leaf fwd { type ip-next-hop; }

So far we’ve been concentrating on the simplest form of a static route, which doesn’t include any of the optional arguments. Let’s add the leaf nodes for name, AD, tag, track and permanent options of the static route:

...
leaf metric { type uint8; }
leaf name { type string; }
leaf tag { type uint8; }
leaf track { type uint8; }
leaf permanent { type empty; }
...

Since track and permanent options are mutually exclusive they should not appear in the configuration at the same time. To model that we can use the choice YANG statement. Let’s remove the track and permanent leafs from the model and replace them with this:

choice track-or-perm {
  leaf track { type uint8; }
  leaf permanent { type empty; }
}

And finally, we need to add an options for VRF. When VRF is defined the whole ip-route-interface-forwarding-list gets encapsulated inside a list called vrf. This list has just one more leaf element name which plays the role of this lists' key. In order to model this we can use another oft-used YANG concept called grouping. I like to think of it as a Python function, a reusable part of code that can be referenced multiple times by its name. Here are the final changes to our model to include the VRF support:

grouping ip-route-list {
  list ip-route-interface-forwarding-list {
      ...
  }
}
grouping vrf-grouping {
  list vrf {
    key "name";
    leaf name { type string; }
    uses ip-route-list;
  }
}
container route {
  uses vrf-grouping;
  uses ip-route-list;
}

Each element in a YANG model is optional by default, which means that the route container can include any number of VRF and non-VRF routes. The full YANG model can be found here.

Modifying static route configuration

Now let me demonstrate how to use our newly built YANG model to change the next-hop of an existing static route. Using pyang we need to generate a Python module based on the YANG model.

pyang --plugindir $PYBINDPLUGIN -f pybind -o binding.py cisco-route-static.yang

From a Python shell, download the current static IP route configuration:

import requests
url = "http://{h}:{p}/restconf/api/config/native/ip/route?deep".format(h='192.168.145.51', p='80')
headers = {'accept': 'application/vnd.yang.data+json'}
result = requests.get(url, auth=('admin', 'admin'), headers=headers)
current_json = result.text

Import the downloaded JSON into a YANG model instance:

import binding
import pyangbind.lib.pybindJSON as pybindJSON
model = pybindJSON.loads_ietf(current_json, binding, "cisco_route_static")

Delete the old next-hop and replace it with 12.12.12.2:

route = model.route.ip_route_interface_forwarding_list["2.2.2.2 255.255.255.255"]
route.fwd_list.delete("GigabitEthernet2")
route.fwd_list.add("12.12.12.2")

Save the updated model in a JSON file with the help of a write_file function:

json_data = pybindJSON.dumps(model, mode='ietf')
write_file('new_conf.json', json_data)

Updating running configuration

If we tried sending the new_conf.json file now, the device would have responded with an error:

missing element: prefix in /ios:native/ios:ip/ios:route/ios:ip-route-interface-forwarding-list

In our JSON file the order of elements inside a JSON object can be different from what was defined in the YANG model. This is expected since one of the fundamental principles of JSON is that an object is an unordered collection of name/value pairs. However it looks like behind the scenes IOS XE converts JSON to XML before processing and expects all elements to come in a strict, predefined order. Fortunately, this bug is already known and we can hope that Cisco will implement the fix for IOS XE soon. In the meantime, we’re gonna have to resort to sending XML.

Following the procedure described in my previous post, we can use json2xml tool to convert our instance into an XML document. Here we hit another issue. Since json2xml was designed to produce a NETCONF-compliant XML, it wraps the payload inside a data or a config element. Thankfully, json2xml is a Python script and can be easily patched to produce a RESTCONF-compliant XML. The following is a diff between the original and the patched files

408c409
<     if args.target not in ["data", "config"]:
---
>     if args.target not in ["data", "config", "restconf"]:
437c438,442
<     ET.ElementTree(root_el).write(outfile, encoding="utf-8", xml_declaration=True)
---
>     if args.target != 'restconf':
>         ET.ElementTree(root_el).write(outfile, encoding="utf-8", xml_declaration=True)
>     else:
>         ET.ElementTree(list(root_el)[0]).write(outfile, encoding="utf-8", xml_declaration=True)

Instead of patching the original file, I’ve applied the above changes to a local copy of the file. Once patched, the following commands should produce the needed XML.

pyang -f jtox -o static-route.jtox cisco-route-static.yang
./json2xml -t restconf -o new_conf.xml static-route.jtox new_conf.json

The final step would be to send the generated XML to the IOS XE device. Since we are replacing the old static IP route configuration we’re gonna have to use HTTP PUT to overwrite the old data.

curl -v -k -u admin:admin -H "Content-Type: application/vnd.yang.data+xml" \
 -X PUT http://192.168.145.51/restconf/api/config/native/ip/route/ -d @new_conf.xml

Verification

Back at the IOS XE CLI we can see the new static IP route installed.

TEST#sh run | i ip route
ip route 2.2.2.2 255.255.255.255 12.12.12.2

More examples

As always there are more examples available in my YANG 101 repo

The exercise we’ve done in this post, though useful from a learning perspective, can come in very handy when dealing with vendors who forget or simply don’t want to share their YANG models with their customers (I know of at least one vendor that would only publish tree representations of their YANG models). In the upcoming posts I’ll show how to create a simple Python library to program static routes via RESTCONF and finally how to build an Ansible module to do that.

Getting Started With NETCONF and YANG on Cisco IOS XE

2017-01-25T00:00:00+00:00

Everyone who has any interest in network automation inevitably comes across NETCONF and YANG. These technologies have mostly been implemented for and adopted by big telcos and service providers, while support in the enterprise/DC gear has been virtually non-existent. Things are starting to change now as NETCONF/YANG support has been introduced in the latest IOS XE software train. That’s why I think it’s high time I started another series of posts dedicated to YANG, NETCONF, RESTCONF and various open-source tools to interact with those interfaces.

To kick things off I will show how to use ncclient and pyang to configure interfaces on Cisco IOS XE device. In order to make sure everyone is on the same page and to provide some reference points for the remaining parts of the post, I would first need to cover some basic theory about NETCONF, XML and YANG.

NETCONF primer

NETCONF is a network management protocol that runs over a secure transport (SSH, TLS etc.). It defines a set of commands (RPCs) to change the state of a network device, however it does not define the structure of the exchanged information. The only requirement is for the payload to be a well-formed XML document. Effectively NETCONF provides a way for a network device to expose its API and in that sense it is very similar to REST. Here are some basic NETCONF operations that will be used later in this post:

hello - messages exchanged when the NETCONF session is being established, used to advertise the list of supported capabilities.
get-config - used by clients to retrieve the configuration from a network device.
edit-config - used by clients to edit the configuration of a network device.
close-session - used by clients to gracefully close the NETCONF session.

All of these standard NETCONF operations are implemented in ncclient Python library which is what we’re going to use to talk to CSR1k.

XML primer

There are several ways to exchange structured data over the network. HTML, YAML, JSON and XML are all examples of structured data formats. XML encodes data elements in tags and nests them inside one another to create complex tree-like data structures. Thankfully we are not going to spend much time dealing with XML in this post, however there are a few basic concepts that might be useful for the overall understanding:

Root - Every XML document has one root element containing one or more child elements.
Path - is a way of addressing a particular element inside a tree.
Namespaces - provide name isolation for potentially duplicate elements. As we’ll see later, the resulting XML document may be built from several YANG models and namespaces are required to make sure there are no naming conflicts between elements.

The first two concepts are similar to paths in a Linux filesystem where all of the files are laid out in a tree-like structure with root partition at its top. Namespace is somewhat similar to a unique URL identifying a particular server on the network. Using namespaces you can address multiple unique /etc/hosts files by prepending the host address to the path.

As with other structured data formats, XML by itself does not define the structure of the document. We still need something to organise a set of XML tags, specify what is mandatory and what is optional and what are the value constraints for the elements. This is exactly what YANG is used for.

YANG primer

YANG was conceived as a human-readable way to model the structure of an XML document. Similar to a programming language it has some primitive data types (integers, boolean, strings), several basic data structures (containers, lists, leafs) and allows users to define their own data types. The goal is to be able to formally model any network device configuration.

Anyone who has ever used Ansible to generate text network configuration files is familiar with network modelling. Coming up with a naming conventions for variables, deciding how to split them into different files, creating data structures for variables representing different parts of configuration are all a part of network modelling. YANG is similar to that kind of modelling, only this time the models are already created for you. There are three main sources of YANG models today:

Equipment Vendors create their own “native” models to interact with their devices.
Standards bodies (e.g. IETF and IEEE) were supposed to be the driving force of model creation. However in reality they have managed to produce only a few models that cover basic functionality like interface configuration and routing. Half of these models are still in the “DRAFT” stage.
OpenConfig working group was formed by major telcos and SPs to fill the gap left by IETF. OpenConfig has produced the most number of models so far ranging from LLDP and VLAN to segment routing and BGP configurations. Unfortunately these models are only supported by high-end SP gear and we can only hope that they will find their way into the lower-end part of the market.

Be sure to check of these and many other YANG models on YangModels Github repo.

Environment setup

My test environment consists of a single instance of Cisco CSR1k running IOS XE 16.04.01. For the sake of simplicity I’m not using any network emulator and simply run it as a stand-alone VM inside VMWare Workstation. CSR1k has the following configuration applied:

username admin privilege 15 secret admin
!
interface GigabitEthernet1
  ip address 192.168.145.51 255.255.255.0
  no shutdown
!
netconf-yang

The last command is all what’s required to enable NETCONF/YANG support.

On the same hypervisor I have my development CentOS7 VM, which is connected to the same network as the first interface of CSR1k. My VM is able to ping and ssh into the CSR1k. We will need the following additional packages installed:

yum install openssl-devel python-devel python-pip gcc
pip install ncclient pyang pyangbind ipython

Device configuration workflow

The following workflow will be performed in both interactive Python shell (e.g. iPython) and Linux bash shell. The best way to follow along is to have two sessions opened, one with each of the shells. This will save you from having to rerun import statements every time you re-open a python shell.

1. Discovering device capabilities

The first thing you have to do with any NETCONF-capable device is discover its capabilities. We’ll use ncclient’s manager module to establish a session to CSR1k. Method .connect() of the manager object takes device IP, port and login credentials as input and returns a reference to a NETCONF session established with the device.

from ncclient import manager

m = manager.connect(host='192.168.145.51', port=830, username='admin',
                    password='admin', device_params={'name': 'csr'})

print m.server_capabilities

When the session is established, server capabilities advertised in the hello message get saved in the server_capabilities variable. Last command should print a long list of all capabilities and supported YANG models.

2. Obtaining YANG models

The task we have set for ourselves is to configure an interface. CSR1k supports both native (Cisco-specific) and IETF-standard ways of doing it. In this post I’ll show how to use the IETF models to do that. First we need to identify which model to use. Based on the discovered capabilities we can guess that ietf-ip could be used to configure IP addresses, so let’s get this model first. One way to get a YANG model is to search for it on the Internet, and since its an IETF model, it most likely can be found in of the RFCs. Another way to get it is to download it from the device itself. All devices supporting RFC6022 must be able to send the requested model in response to the get_schema call. Let’s see how we can download the ietf-ip YANG model:

schema = m.get_schema('ietf-ip')
print schema

At this stage the model is embedded in the XML response and we still need to extract it and save it in a file. To do that we’ll use python lxml library to parse the received XML document, pick the first child from the root of the tree (data element) and save it into a variable. A helper function write_file simply saves the Python string contained in the yang_text variable in a file.

import xml.etree.ElementTree as ET
root = ET.fromstring(schema.xml)
yang_text = list(root)[0].text
write_file('ietf-ip.yang', yang_text)

Back at the Linux shell we can now start using pyang. The most basic function of pyang is to convert the YANG model into one of the many supported formats. For example, tree format can be very helpful for high-level understanding of the structure of a YANG model. It produces a tree-like representation of a YANG model and annotates element types and constraints using syntax described in this RFC.

$ pyang -f tree ietf-ip.yang | head -
module: ietf-ip
  augment /if:interfaces/if:interface:
    +--rw ipv4!
    |  +--rw enabled?      boolean
    |  +--rw forwarding?   boolean
    |  +--rw mtu?          uint16
    |  +--rw address* [ip]
    |  |  +--rw ip               inet:ipv4-address-no-zone
    |  |  +--rw (subnet)
    |  |     +--:(prefix-length)

From the output above we can see the ietf-ip augments or extends the interface model. It adds new configurable (rw) containers with a list of IP prefixes to be assigned to an interface. Another thing we can see is that this model cannot be used on its own, since it doesn’t specify the name of the interface it augments. This model can only be used together with ietf-interfaces YANG model which models the basic interface properties like MTU, state and description. In fact ietf-ip relies on a number of YANG models which are specified as imports at the beginning of the model definition.

module ietf-ip {
 namespace "urn:ietf:params:xml:ns:yang:ietf-ip";
 prefix ip;
 import ietf-interfaces {
   prefix if;
 }
 import ietf-inet-types {
   prefix inet;
 }
 import ietf-yang-types {
   prefix yang;
 }

Each import statement specifies the model and the prefix by which it will be referred later in the document. These prefixes create a clear separation between namespaces of different models.

We would need to download all of these models and use them together with the ietf-ip throughout the rest of this post. Use the procedure described above to download the ietf-interfaces, ietf-inet-types and ietf-yang-types models.

3. Instantiating YANG models

Now we can use pyangbind, an extension to pyang, to build a Python module based on the downloaded YANG models and start building interface configuration. Make sure your $PYBINDPLUGIN variable is set like its described here.

pyang --plugindir $PYBINDPLUGIN -f pybind -o ietf_ip_binding.py ietf-ip.yang ietf-interfaces.yang ietf-inet-types.yang ietf-inet-types.yang

The resulting ietf_ip_binding.py is now ready for use inside the Python shell. Note that we import ietf_interfaces as this is the parent object for ietf_ip. The details about how to work with generated Python binding can be found on pyangbind’s Github page.

from ietf_ip_binding import ietf_interfaces
model = ietf_interfaces()
model.get()
{'interfaces': {'interface': {}}, 'interfaces-state': {'interface': {}}}

To setup an IP address, we first need to create a model of an interface we’re planning to manipulate. We can then use .get() on the model’s instance to see the list of all configurable parameters and their defaults.

new_interface = model.interfaces.interface.add('GigabitEthernet2')
new_interface.get()
{'description': u'',
 'enabled': True,
 'ipv4': {'address': {},
  'enabled': True,
  'forwarding': False,
  'mtu': 0,
  'neighbor': {}},
 'ipv6': {'address': {},
  'autoconf': {'create-global-addresses': True,
   'create-temporary-addresses': False,
   'temporary-preferred-lifetime': 86400L,
   'temporary-valid-lifetime': 604800L},
  'dup-addr-detect-transmits': 1L,
  'enabled': True,
  'forwarding': False,
  'mtu': 0L,
  'neighbor': {}},
 'link-up-down-trap-enable': u'',
 'name': u'GigabitEthernet2',
 'type': u''}

The simples thing we can do is modify the interface description.

new_interface.description = 'NETCONF-CONFIGURED PORT'
new_interface.get()['description']

New objects are added by calling .add() on the parent object and passing unique key as an argument.

ipv4_addr = new_interface.ipv4.address.add('12.12.12.2')
ipv4_addr.get()
{'ip': u'12.12.12.2', 'netmask': u'', 'prefix-length': 0}
ipv4_addr.netmask = '255.255.255.0'

At the time of writing pyangbind only supported serialisation into JSON format which means we have to do a couple of extra steps to get the required XML. For now let’s dump the contents of our interface model instance into a file.

import pyangbind.lib.pybindJSON as pybindJSON
json_data = pybindJSON.dumps(model, mode='ietf')
write_file('new_interface.json',json_data)
print json_data

4. Applying configuration changes

Even though pyanbind does not support XML, it is possible to use other pyang plugins to generate XML from JSON.

pyang -f jtox -o interface.jtox ietf-ip.yang ietf-interfaces.yang ietf-inet-types.yang ietf-yang-types.yang
json2xml -t config -o interface.xml interface.jtox interface.json

The resulting interface.xml file contains the XML document ready to be sent to the device. I’ll use read_file helper function to read its contents and save it into a variable. We should still have a NETCONF session opened from one of the previous steps and we’ll use the edit-config RPC call to apply our changes to the running configuration of CSR1k.

xml = read_file('interface.xml')
reply = m.edit_config(target='running', config=xml)
print("Success? {}".format(reply.ok))
m.close_session()

If the change was applied successfully reply.ok should return True and we can close the session to the device.

Verifying changes

Going back to the CSR1k’s CLI we should see our changes reflected in the running configuration:

Router#sh run int gi 2
Building configuration...

Current configuration : 126 bytes
!
interface GigabitEthernet2
 description NETCONF-CONFIGURED PORT
 ip address 12.12.12.2 255.255.255.0
 negotiation auto
end

All-in-one scripts

Checkout this Github page for Python scripts that implement the above workflow in a more organised way.

In this post I have merely scratched the surface of YANG modelling and network device programming. In the following posts I am planning to take a closer look at the RESTCONF interface, internal structure of a YANG model, Ansible integration and other YANG-related topics until I run out of interest. So until that happens… stay tuned.

OpenStack SDN With OVN (Part 2) - Network Engineering Analysis

2016-12-10T00:00:00+00:00

In this post we will see how OVN implements virtual networks for OpenStack. The structure of this post is such that starting from the highest level of networking abstraction we will delve deeper into implementation details with each subsequent section. The biggest emphasis will be on how networking data model gets transformed into a set of logical flows, which eventually become OpenFlow flows. The final section will introduce a new overlay protocol GENEVE and explain why VXLAN no longer satisfies the needs of an overlay protocol.

OpenStack - virtual network topology

In the previous post we have installed OpenStack and created a simple virtual topology as shown below. In OpenStack’s data model this topology consists of the following elements:

Network defines a virtual L2 broadcast domain
Subnet attached to the network, defines an IP subnet within the network
Router provides connectivity between all directly connected subnets
Port VM’s point of attachment to the subnet

So far nothing unusual, this is a simple Neutron data model, all that information is stored in Neutron’s database and can be queried with neutron CLI commands.

OVN Northbound DB - logical network topology

Every call to implement an element for the above data model is forwarded to OVN ML2 driver as defined by the mechanism driver setting of the ML2 plugin. This driver is responsible for the creation of an appropriate data model inside the OVN Northbound DB. The main elements of this data model are:

Switch equivalent of a Neutron’s Subnet, enables L2 forwarding for all attached ports
Distributed Router provides distributed routing between directly connected subnets
Gateway Router provides connectivity between external networks and distributed routers, implements NAT and Load Balancing
Port of a logical switch, attaches VM to the switch

This is a visual representation of our network topology inside OVN’s Northbound DB, built based on the output of ovn-nbctl show command:

This topology is pretty similar to Neutron’s native data model with the exception of a gateway router. In OVN, a gateway router is a special non-distributed router which performs functions that are very hard or impossible to distribute amongst all nodes, like NAT and Load Balancing. This router only exists on a single compute node which is selected by the scheduler based on the ovn_l3_scheduler setting of the ML2 plugin. It is attached to a distributed router via a point-to-point /30 subnet defined in the ovn_l3_admin_net_cidr setting of the ML2 plugin.

Apart from the logical network topology, Northbound database keeps track of all QoS, NAT and ACL settings and their parent objects. The detailed description of all tables and properties of this database can be found in the official Northbound DB documentation.

OVN Southbound DB - logical flows

OVN northd process running on the controller node translates the above logical topology into a set of tables stored in Southbound DB. Each row in those tables is a logical flow and together they form a forwarding pipeline by stringing together multiple actions to be performed on a packet. These actions range from packet drop through packet header modification to packet output. The stringing is implemented with a special next action which moves the packet one step down the pipeline starting from table 0. Let’s have a look at the simplified versions of L2 and L3 forwarding pipelines using examples from our virtual topology.

L2 datapath

In the first example we’ll explore the L2 datapath between VM1 and VM3. Both VMs are attached to the ports of the same logical switch. The full datapath of a logical switch consists of two parts - ingress and egress datapath (the direction is from the perspective of a logical switch). The ultimate goal of an ingress datapath is to determine the output port or ports (in case of multicast) and pass the packet to the egress datapath. The egress datapath does a few security checks before sending the packet out to its destination. Two things are worth noting at this stage:

The two datapaths can be located either on the same or on two different hypervisor nodes. In the latter case, the packet is passed between the two nodes in an overlay tunnel.
The egress datapath does not have a destination lookup step which means that all information about the output port MUST be supplied by the ingress datapath. This means that destination lookup does not have to be done twice and it also has some interesting implications on the choice of encapsulation protocol as we’ll see in the next section.

Let’s have a closer look at each of the stages of the forwarding pipeline. I’ll include snippets of logical flows demonstrating the most interesting behaviour at each stage. Full logical datapath is quite long and can be viewed with ovn-sbctl lflow-list [DATAPATH] command. Here is some useful information, collected from the Northbound database, that will be used in the examples below:

VM#	IP	MAC	Port UUID
VM1	10.0.0.2	fa:16:3e:4f:2f:b8	26c23a54-6a91-48fd-a019-3bd8a7e118de
VM3	10.0.0.5	fa:16:3e:2a:60:32	5c62cfbe-0b2f-4c2a-98c3-7ee76c9d2879

Port security - makes sure that incoming packet has the correct source MAC and IP addresses.

table=0 (ls_in_port_sec_l2), priority=50, match=(inport == "26c23a54-6a91-48fd-a019-3bd8a7e118de"
  && eth.src == {fa:16:3e:4f:2f:b8}), action=(next;)
table=1 (ls_in_port_sec_ip), priority=90, match=(inport == "26c23a54-6a91-48fd-a019-3bd8a7e118de"
  && eth.src == fa:16:3e:4f:2f:b8 && ip4.src == {10.0.0.2}), action=(next;)

Egress ACL - set of tables that implement Neutron’s Egress Port Security functionality. Default rules allow all egress traffic from a VM. The first flow below matches all new connections coming from VM1 and marks them for connection tracking with reg0[1] = 1. The next table catches these marked packets and commits them to the connection tracker. Special ct_label=0/1 action ensures return traffic is allowed which is a standard behaviour of all stateful firewalls.

table=6 (ls_in_acl), priority=2002 , match=(((ct.new && !ct.est) ||
  (!ct.new && ct.est && !ct.rpl && ct_label.blocked == 1)) &&
  (inport == "26c23a54-6a91-48fd-a019-3bd8a7e118de" && ip4)),
  action=(reg0[1] = 1; next;)
table=9 (ls_in_stateful), priority=100  , match=(reg0[1] == 1),
  action=(ct_commit(ct_label=0/1); next;)

ARP Responder - matches an incoming ARP/ND request and generates an appropriate ARP/ND response. The way it is accomplished is similar to Neutron’s native ARP responder feature. Effectively an ARP request gets transformed into an ARP response by swapping source and destination fields.

table=10(ls_in_arp_rsp), priority=50, match=(arp.tpa == 10.0.0.5 && arp.op == 1),
  action=(eth.dst = eth.src; eth.src = fa:16:3e:2a:60:32; arp.op = 2; /* ARP reply */
  arp.tha = arp.sha; arp.sha = fa:16:3e:2a:60:32; arp.tpa = arp.spa; arp.spa = 10.0.0.5;
  outport = inport; flags.loopback = 1; output;)

DHCP Processing - set of tables that implement the DHCP server functionality using the approach similar to the ARP responder described above.

table=12(ls_in_dhcp_response), priority=100, match=(inport == "26c23a54-6a91-48fd-a019-3bd8a7e118de"
  && eth.src == fa:16:3e:4f:2f:b8 && ip4.src == 0.0.0.0 && ip4.dst == 255.255.255.255
  && udp.src == 68 && udp.dst == 67 && reg0[3]),
  action=(eth.dst = eth.src; eth.src = fa:16:3e:94:b6:bc; ip4.dst = 10.0.0.2;
  ip4.src = 10.0.0.1; udp.src = 67; udp.dst = 68; outport = inport; flags.loopback = 1; output;)

Destination Lookup - implements L2 forwarding based on the destination MAC address of a frame. At this stage the outport variable is set to the VM3’s port UUID.

table=13(ls_in_l2_lkup), priority=50, match=(eth.dst == fa:16:3e:2a:60:32),
  action=(outport = "5c62cfbe-0b2f-4c2a-98c3-7ee76c9d2879"; output;)

Ingress ACL - set of tables that implement Neutron’s Ingress Port security. For the sake of argument let’s assume that we have enabled inbound SSH connections. The principle is same as before - the packet gets matched in one table and submitted to connection tracking in another table.

table=4 (ls_out_acl), priority=2002 , match=(((ct.new && !ct.est)
  || (!ct.new && ct.est && !ct.rpl && ct_label.blocked == 1))
  && (outport == "26c23a54-6a91-48fd-a019-3bd8a7e118de" && ip4
  && ip4.src == 0.0.0.0/0 && tcp && tcp.dst == 22)),
  action=(reg0[1] = 1; next;
table=6 (ls_out_stateful), priority=100  , match=(reg0[1] == 1),
  action=(ct_commit(ct_label=0/1); next;)

Port Security - implements inbound port security for destination VM by checking the sanity of destination MAC and IP addresses.

table=7 (ls_out_port_sec_ip), priority=90, match=(outport == "5c62cfbe-0b2f-4c2a-98c3-7ee76c9d2879"
  && eth.dst == fa:16:3e:2a:60:32 && ip4.dst == {255.255.255.255, 224.0.0.0/4, 10.0.0.5}),
  action=(next;)
table=8 (ls_out_port_sec_l2), priority=50, match=(outport == "5c62cfbe-0b2f-4c2a-98c3-7ee76c9d2879"
  && eth.dst == {fa:16:3e:2a:60:32}),
  action=(output;)

L3 datapath

Similar to a logical switch pipeline, L3 datapath is split into ingress and egress parts. In this example we’ll concentrate on the Gateway router datapath. This router is connected to a distributed logical router via a transit subnet (SWtr) and to an external network via an external bridge (SWex) and performs NAT translation for all VM traffic.

Here is some useful information about router interfaces and ports that will be used in the examples below.

SW function	IP	MAC	Port UUID
External	169.254.0.54/24	fa:16:3e:39:c8:d8	lrp-dc1ae9e3-d8fd-4451-aed8-3d6ddc5d095b
DVR-GW transit	169.254.128.2/30	fa:16:3e:7e:96:e7	lrp-gtsp-186d8754-cc4b-40fd-9e5d-b0d26fc063bd

Port security - implements sanity check for all incoming packets.

table=0 (lr_in_admission), priority=50, match=((eth.mcast || eth.dst == fa:16:3e:7e:96:e7)
 && inport == "lrp-gtsp-186d8754-cc4b-40fd-9e5d-b0d26fc063bd"), action=(next;)

IP Input - performs additional L3 sanity checks and implements typical IP services of a router (e.g. ICMP/ARP reply)

table=1 (lr_in_ip_input), priority=100, match=(ip4.src == {169.254.128.2, 169.254.128.3}),
  action=(drop;)
table=1 (lr_in_ip_input), priority=90, match=(inport == "lrp-gtsp-186d8754-cc4b-40fd-9e5d-b0d26fc063bd"
  && arp.tpa == 169.254.128.2 && arp.op == 1),
  action=(eth.dst = eth.src; eth.src = fa:16:3e:7e:96:e7; arp.op = 2;
  /* ARP reply */ arp.tha = arp.sha; arp.sha = fa:16:3e:7e:96:e7;
  arp.tpa = arp.spa; arp.spa = 169.254.128.2;
  outport = "lrp-gtsp-186d8754-cc4b-40fd-9e5d-b0d26fc063bd"; flags.loopback = 1; output;)
table=1 (lr_in_ip_input), priority=90, match=(ip4.dst == 169.254.128.2
  && icmp4.type == 8 && icmp4.code == 0),
  action=(ip4.dst <-> ip4.src; ip.ttl = 255; icmp4.type = 0; flags.loopback = 1; next; )

UNSNAT - translates the destination IP to the real address for packets coming from external networks

table=3 (lr_in_unsnat), priority=100, match=(ip && ip4.dst == 169.254.0.54),
  action=(ct_snat; next;)

DNAT - implements what is commonly known as static NAT, i.e. performs one-to-one destination IP translation for every configured floating IP.

table=4 (lr_in_dnat), priority=100, match=(ip && ip4.dst == 169.254.0.52),
  action=(flags.loopback = 1; ct_dnat(10.0.0.5);)

IP routing - implements L3 forwarding based on the destination IP address. At this stage the outport is decided, IP TTL is decremented and the new next-hop IP is set in register0.

table=5 (lr_in_ip_routing), priority=1, match=(ip4.dst == 0.0.0.0/0),
  action=(ip.ttl--; reg0 = 169.254.0.1; reg1 = 169.254.0.54; eth.src = fa:16:3e:39:c8:d8;
  outport = "lrp-dc1ae9e3-d8fd-4451-aed8-3d6ddc5d095b"; flags.loopback = 1; next;)

Next Hop Resolver - discovers the next-hop MAC address for a packet. This could either be a statically configured value when the next-hop is an OVN-managed router or a dynamic binding learned through ARP and stored in a special MAC_Binding table of Southbound DB.

table=6 (lr_in_arp_resolve), priority=100, match=(outport == "lrp-gtsp-186d8754-cc4b-40fd-9e5d-b0d26fc063bd"
  && reg0 == 169.254.128.1), action=(eth.dst = fa:16:3e:2a:7f:25; next;)
table=6 (lr_in_arp_resolve), priority=0, match=(ip4),
  action=(get_arp(outport, reg0); next;)

SNAT - implements what is commonly known as overload NAT. Translates source IP, source UDP/TCP port number and ICMP Query ID to hide them behind a single IP address

table=0 (lr_out_snat), priority=25, match=(ip && ip4.src == 10.0.0.0/24),
  action=(ct_snat(169.254.0.54);)

Output - send the packet out the port determined during the IP routing stage.

table=1 (lr_out_delivery), priority=100, match=(outport == "lrp-gtsp-186d8754-cc4b-40fd-9e5d-b0d26fc063bd"),
  action=(output;)

This was a very high-level, abridged and simplified version of how logical datapaths are built in OVN. Hopefully this lays enough groundwork to move on to the official northd documentation which describes both L2 and L3 datapaths in much greater detail.

Apart from the logical flows, Southbound DB also contains a number of tables that establish the logical-to-physical bindings. For example, the Port_Binding table establishes binding between logical switch, logical port, logical port overlay ID (a.k.a. tunnel key) and the unique hypervisor ID. In the next section we’ll see how this information is used to translate logical flows into OpenFlow flows at each compute node. For full description of Southbound DB, its tables and their properties refer to the official SB schema documentation.

OVN Controller - OpenFlow flows

OVN Controller process is the distributed part of OVN SDN controller. This process, running on each compute node, connects to Southbound DB via OVSDB and configures local OVS according to information received from it. It also uses Southbound DB to exchange the physical location information with other hypervisors. The two most important bits of information that OVN controller contributes to Southbound DB are physical location of logical ports and overlay tunnel IP address. These are the last two missing pieces to map logical flows to physical nodes and networks.

The whole flat space of OpenFlow tables is split into multiple areas. Tables 16 to 47 implement an ingress logical pipeline and tables 48 to 63 implement an egress logical pipeline. These tables have no notion of physical ports and are functionally equivalent to logical flows in Southbound DB. Tables 0 and 65 are responsible for mapping between the physical and logical realms. In table 0 packets are matched on the physical incoming port and assigned to a correct logical datapath as was defined by the Port_Binding table. In table 65 the information about the outport, that was determined during the ingress pipeline processing, is mapped to a local physical interface and the packet is sent out.

To demonstrate the details of OpenFlow implementation, I’ll use the traffic flow between VM1 and external destination (8.8.8.8). For the sake of brevity I will only cover the major steps of packet processing inside OVS, omitting security checks and ARP/DHCP processing.

When packets traverse OpenFlow tables they get labelled or annotated with special values to simplify matching in subsequent tables. For example, when table 0 matches the incoming port, it annotates the packet with the datapath ID. Since it would have been impractical to label packets with globally unique UUIDs from Soutbound DB, these UUIDs get mapped to smaller values called tunnel keys. To make things even more confusing, each port will have a local kernel ID, unique within each hypervisor. We’ll need both tunnel keys and local port IDs to be able to track the packets inside the OVS. The figure below depicts all port and datapath IDs that have been collected from the Soutbound DB and local OVSDB on each hypervisor. Local port numbers are attached with a dotted line to their respective tunnel keys.

When VM1 sends the first packet to 8.8.8.8, it reaches OVS on local port 13. OVN Controller knows that this port belongs to VM1 and installs an OpenFlow rule to match all packets from this port and annotate them with datapath ID (OXM_OF_METADATA), incoming port ID (NXM_NX_REG14), conntrack zone (NXM_NX_REG13). It then moves these annotated packets to the first table of the ingress pipeline.

table=0, priority=100,in_port=13 actions=load:0x2->NXM_NX_REG13[],
  load:0x2->OXM_OF_METADATA[],load:0x2->NXM_NX_REG14[],
  resubmit(,16)

Skipping to the L2 MAC address lookup stage, the output port (0x1) is decided based on the destination MAC address and saved in register 15.

table=29, priority=50,metadata=0x2,dl_dst=fa:16:3e:0d:df:ea
  actions=load:0x1->NXM_NX_REG15[],resubmit(,32)

Finally, the packet reaches the last table where it is sent out the physical patch port interface towards R1.

table=65, priority=100,reg15=0x1,metadata=0x2 actions=output:1

The other end of this patch port is connected to a local instance of distributed router R1. That means our packet, unmodified, re-enters OpenFlow table 0, only this time on a different port. Local port 2 is associated with a logical pipeline of a router, hence metadata for this packet is set to 4.

table=0, priority=100,in_port=2 actions=load:0x4->OXM_OF_METADATA[],
  load:0x1->NXM_NX_REG14[],resubmit(,16)

The packet progresses through logical router datapath and finally gets to table 21 where destination IP lookup take place. It matches the catch-all default route rule and the values for its next-hop IP (0xa9fe8002), MAC address (fa:16:3e:2a:7f:25) and logical output port (0x03) are set.

table=21, priority=1,ip,metadata=0x4 actions=dec_ttl(),load:0xa9fe8002->NXM_NX_XXREG0[96..127],
  load:0xa9fe8001->NXM_NX_XXREG0[64..95],mod_dl_src:fa:16:3e:2a:7f:25,
  load:0x3->NXM_NX_REG15[],load:0x1->NXM_NX_REG10[0],resubmit(,22)

Table 65 converts the logical output port 3 to physical port 6, which is yet another patch port connected to a transit switch.

table=65, priority=100,reg15=0x3,metadata=0x4 actions=output:6

The packet once again re-enters OpenFlow pipeline from table 0, this time from port 5. Table 0 maps incoming port 5 to the logical datapath of a transit switch with Tunnel key 7.

table=0, priority=100,in_port=5 actions=load:0x7->OXM_OF_METADATA[],
  load:0x1->NXM_NX_REG14[],resubmit(,16)

Destination lookup determines the output port (2) but this time, instead of entering the egress pipeline locally, the packet gets sent out the physical tunnel port (7) which points to the IP address of a compute node hosting the GW router. The headers of an overlay packet are populated with logical datapath ID (0x7), logical input port (copied from register 14) and logical output port (0x2).

table=29, priority=50,metadata=0x7,dl_dst=fa:16:3e:7e:96:e7
  actions=load:0x2->NXM_NX_REG15[],resubmit(,32)
table=32, priority=100,reg15=0x2,metadata=0x7 actions=load:0x7->NXM_NX_TUN_ID[0..23],
  set_field:0x2/0xffffffff->tun_metadata0,move:NXM_NX_REG14[0..14]->NXM_NX_TUN_METADATA0[16..30],
  output:7

When packet reaches the destination node, it once again enters the OpenFlow table 0, but this time all information is extracted from the tunnel keys.

table=0, priority=100,in_port=17 actions=move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],
  move:NXM_NX_TUN_METADATA0[16..30]->NXM_NX_REG14[0..14],
  move:NXM_NX_TUN_METADATA0[0..15]->NXM_NX_REG15[0..15],
  resubmit(,33)

At the end of the transit switch datapath the packet gets sent out port 12, whose peer is patch port 16.

table=65, priority=100,reg15=0x2,metadata=0x7 actions=output:12

The packet re-enters OpenFlow table 0 from port 16, where it gets mapped to the logical datapath of a gateway router.

table=0, priority=100,in_port=16 actions=load:0x2->NXM_NX_REG11[],
  load:0x6->NXM_NX_REG12[],load:0x6->OXM_OF_METADATA[],
  load:0x2->NXM_NX_REG14[],resubmit(,16)

Similar to a distributed router R1, table 21 determines the next-hop MAC address for a packet and saves the output port in register 15.

table=21, priority=1,ip,metadata=0x6 actions=dec_ttl(),load:0xa9fe0001->NXM_NX_XXREG0[96..127],
  load:0xa9fe0036->NXM_NX_XXREG0[64..95],mod_dl_src:fa:16:3e:39:c8:d8,
  load:0x1->NXM_NX_REG15[],load:0x1->NXM_NX_REG10[0],resubmit(,22)

The first table of an egress pipeline source-NATs packets to external IP address of the GW router.

table=48, priority=33,ip,metadata=0x6,nw_src=10.0.0.2
  actions=ct(commit,table=49,zone=NXM_NX_REG12[0..15],nat(src=169.254.0.56))

The modified packet is sent out the physical port 14 towards the external switch.

table=65, priority=100,reg15=0x1,metadata=0x6 actions=output:14

External switch determines the output port connected to the br-ex on a local hypervisor and send the packet out.

table=0, priority=100,in_port=13 actions=load:0x5->NXM_NX_REG11[],
  load:0x3->NXM_NX_REG12[],load:0x3->OXM_OF_METADATA[],
  load:0x2->NXM_NX_REG14[],resubmit(,16)
table=29, priority=0,metadata=0x3 actions=load:0xfffe->NXM_NX_REG15[],resubmit(,32)
table=33, priority=100,reg15=0xfffe,metadata=0x3
  actions=load:0x1->NXM_NX_REG13[],load:0x1->NXM_NX_REG15[],
  resubmit(,34),load:0xfffe->NXM_NX_REG15[]
table=65, priority=100,reg15=0x1,metadata=0x3 actions=output:15

As we’ve just seen, OpenFlow repeats the logical topology by interconnecting logical datapaths of switches and routers with virtual point-to-point patch cables. This may seem like an unnecessary modelling element with a potential for a performance impact. However, when flows get installed in kernel datapath, these patch ports do not exist, which means that there isn’t any performance impact on packets in fastpath.

Physical network - GENEVE overlay

Before we wrap up, let us have a quick look at the new overlay protocol GENEVE. The goal of any overlay protocol is to transport all the necessary tunnel keys. With VXLAN the only tunnel key that could be transported is the Virtual Network Identifier (VNI). In OVN’s case these tunnel keys include not only the logical datapath ID (commonly known as VNI) but also both input and output port IDs. You could have carved up the 24 bits of VXLAN tunnel ID to encode all this information but this would only have given you 256 unique values per key. Some other overlay protocols, like STT have even bigger tunnel ID header size but they, too, have a strict upper limit.

GENEVE was designed to have a variable-length header. The first few bytes are well-defined fixed size fields followed by variable-length Options. This kind of structure allows software developers to innovate at their own pace while still getting the benefits of hardware offload for the fixed-size portion of the header. OVN developers decided to use Options header type 0x80 to store the 15-bit logical ingress port ID and a 16-bit egress port ID (an extra bit is for logical multicast groups).

The figure above shows the ICMP ping coming from VM1(10.0.0.2) to Google’s DNS. As I’ve showed in the previous section, GENEVE is used between the ingress and egress pipelines of a transit switch (SWtr), whose datapath ID is encoded in the VNI field (0x7). Packets enter the transit switch on port 1 and leave it on port 2. These two values are encoded in the 00010002 value of the Options Data field.

So now that GENEVE has taken over as the inter-hypervisor overlay protocol, does that mean that VXLAN is dead? OVN still supports VXLAN but only for interconnects with 3rd party devices like VXLAN-VLAN gateways or VXLAN TOR switches. Rephrasing the official OVN documentation, VXLAN gateways will continue to be supported but they will have a reduced feature set due to lack of extensibility.

Conclusion

OpenStack networking has always been one of the first use cases of any new SDN controller. All the major SDN platforms like ACI, NSX, Contrail, VSP or ODL have some form of OpenStack integration. And it made sense, since native Neutron networking has always been one of the biggest pain points in OpenStack deployments. As I’ve just demonstrated, OVN can now do all of the common networking functionality natively, without having to rely on 3rd party agents. In addition to that it has a fantastic documentation, implements all forwarding inside a single OVS bridge and it is an open-source project. As an OpenStack networking solution it is still, perhaps, a few months away from being production ready - active/active HA is not supported with OVSDB, GW router scheduling options are limited, lack of native support for DNS and Metadata proxy. However I anticipate that starting from the next OpenStack release (Ocata, Feb 2017) OVN will be ready for mass deployment even by companies without an army of OVS/OpenStack developers. And when that happens there will even less need for proprietary OpenStack SDN platforms.

OpenStack SDN With OVN (Part 1) - Build and Install

2016-11-27T00:00:00+00:00

This is a first of a two-post series dedicated to project OVN. In this post I’ll show how to build, install and configure OVN to work with a 3-node RDO OpenStack lab.

Vanilla OpenStack networking has many functional, performance and scaling limitations. Projects like L2 population, local ARP responder, L2 Gateway and DVR were conceived to address those issues. However good a job these projects do, they still remain a collection of separate projects, each with its own limitations, configuration options and sets of dependencies. That led to an effort outside of OpenStack to develop a special-purpose OVS-only SDN controller that would address those issues in a centralised and consistent manner. This post will be about one such SDN controller, coming directly from the people responsible for OpenvSwitch, Open Virtual Network (OVN).

OVN quick introduction

OVN is a distributed SDN controller implementing virtual networks with the help OVS. Even though it is positioned as a CMS-independent controller, the main use case is still OpenStack. OVN was designed to address the following limitations of vanilla OpenStack networking:

Security groups could not be implemented directly on OVS ports and, therefore, required a dedicated Linux bridge between the VM and the OVS integration bridge.
Routing and DHCP agents required dedicated network namespaces.
NAT was implemented using a combination of network namespaces, iptables and proxy-ARP.

OVN implements security groups, distributed virtual routing, NAT and distributed DHCP server all inside a single OVS bridge. This dramatically improves performance by reducing the number of inter-process packet handling and ensures that all flows can benefit from kernel fast-path switching.

At a high level, OVN consists of 3 main components:

OVN ML2 Plugin - performs translation between Neutron data model and OVN logical data model stored in Northbound DB.
OVN northd - the brains of OVN, translates the high level networking abstractions (logical switches, routers and ports) into logical flows. These logical flows are not yet OpenFlow flows but similar in concept and a very powerful abstraction. All translated information is stored in Southbound DB.
OVN controllers - located on each compute node, receive identical copies of logical flows (centralised network view) and exchange logical port to overlay IP binding information via the central Southbound DB. This information is used to perform logical flow translation into OpenFlow which are then programmed into the local OVS instance.

If you want to learn more about OVN architecture and use cases, OpenStack OVN page has an excellent collection of resources for further reading.

OpenStack installation

I’ll use RDO packstack to help me build a 1 controller and 2 compute nodes OpenStack lab on CentOS7. I’ll use the master trunk to deploy the latest OpenStack Ocata packages. This is required since at the time of writing (Nov 2016) some of the OVN features were not available in OpenStack Newton.

Install latest RDO repositories on all 3 nodes

cd /etc/yum.repos.d/
wget http://trunk.rdoproject.org/centos7/delorean-deps.repo
wget https://trunk.rdoproject.org/centos7-master/current/delorean.repo

On the controller node, generate a sample answer file and modify settings to match the IPs of individual nodes. Optionally, you can disable some of the unused components like Nagios and Ceilometer similar to how I did it in my earlier post.

Modify answer file and deploy OpenStack

yum install -y openstack-packstack crudini
packstack --gen-answer-file=/root/packstack.answer
crudini --set --existing defautl CONFIG_COMPUTE_HOSTS 169.254.0.12,169.254.0.13
crudini --set --existing defautl CONFIG_CONTROLLER_HOST 169.254.0.11
crudini --set --existing defautl CONFIG_NETWORK_HOSTS 169.254.0.11
packstack --answer-file=/root/packstack.answer

After the last step we should have a working 3-node OpenStack lab, similar to the one depicted below. If you want to learn about how to automate this process, refer to my older posts about OpenStack and underlay Leaf-Spine fabric build using Chef.

OVN Build

OVN can be built directly from OVS source code. Instead of building and installing OVS on each of the OpenStack nodes individually, I’ll build a set of RPM’s on the Controller and will use them to install and upgrade OVS/OVN components on the remaining nodes.

Part of OVN build process includes building an OVS kernel module. In order to be able to use kmod RPM on all nodes we need to make sure all nodes use the same version of Linux kernel. The easiest way would be to fetch the latest updates from CentOS repos and reboot the nodes. This step should result in same kernel version on all nodes, which can be checked with uname -r command.

Upgrade kernel on all nodes

yum -y update kernel
reboot

The official OVS installation procedure for CentOS7 is pretty accurate and requires only a few modifications to account for the packages missing in the minimal CentOS image I’ve used as a base OS.

OVS build procedure

yum install rpm-build autoconf automake libtool systemd-units openssl openssl-devel python python-twisted-core python-zope-interface python-six desktop-file-utils groff graphviz procps-ng libcap-ng libcap-ng-devel
yum install selinux-policy-devel kernel-devel-`uname -r` git
git clone https://github.com/openvswitch/ovs.git
cd ovs
./boot.sh
./configure
make rpm-fedora RPMBUILD_OPT="--without check"
make rpm-fedora-kmod

At the end of the process we should have a set of rpms inside the ovs/rpm/rpmbuild/RPMS/ directory.

OVN Install

Before we can begin installing OVN, we need to prepare the existing OpenStack environment by disabling and removing legacy Neutron OpenvSwitch agents. Since OVN natively implements L2 and L3 forwarding, DHCP and NAT, we won’t need L3 and DHCP agents on any of the Compute nodes. Network node that used to provide North-South connectivity will no longer be needed.

OpenStack preparation

First, we need to make sure all Compute nodes have a bridge that would provide access to external provider networks. In my case, I’ll move the eth1 interface under the OVS br-ex on all Compute nodes.

/etc/sysconfig/network-scripts/ifcfg-eth1

DEVICE=eth1
NAME=eth1
DEVICETYPE=ovs
TYPE=OVSPort
OVS_BRIDGE=br-ex
ONBOOT=yes
BOOTPROTO=none

IP address needs to be moved to br-ex interface. Below example is for Compute node #2:

/etc/sysconfig/network-scripts/ifcfg-br-ex

ONBOOT=yes
DEFROUTE=yes
IPADDR=169.254.0.12
PREFIX=24
GATEWAY=169.254.0.1
DNS1=8.8.8.8
DEVICE=br-ex
NAME=br-ex
DEVICETYPE=ovs
OVSBOOTPROTO=none
TYPE=OVSBridge

At the same time OVS configuration on Network/Controller node will need to be completely wiped out. Once that’s done, we can remove the Neutron OVS package from all nodes.

Remove legacy Neutron OVS agents

yum remove openstack-neutron-openvswitch

OVS packages installation

Now everything is ready for OVN installation. First step is to install the kernel module and upgrade the existing OVS package. Reboot may be needed in order for the correct kernel module to be loaded.

Upgrade OVS on all nodes

rpm -i openvswitch-kmod-2.6.90-1.el7.centos.x86_64.rpm
rpm -U openvswitch-2.6.90-1.el7.centos.x86_64.rpm
reboot

Now we can install OVN. Controllers will be running the ovn-northd process which can be installed as follows:

Install OVN on the Controller

rpm -i openvswitch-ovn-common-*.x86_64.rpm
rpm -i openvswitch-ovn-central-*.x86_64.rpm
systemctl start ovn-northd

The following packages install the ovn-controller on all Compute nodes:

Install OVN on Compute nodes

rpm -i openvswitch-ovn-common-*.x86_64.rpm
rpm -i openvswitch-ovn-host-*.x86_64.rpm
systemctl start ovn-controller

The last thing is to install the OVN ML2 plugin, a python library that allows Neutron to talk to OVN Northbound database.

Install OVN ML2 plugin on the Controller

yum install python-networking-ovn

OVN Configuration

Now that we have all the required packages in place, it’s time to reconfigure Neutron to start using OVN instead of a default openvswitch plugin. The installation procedure is described in the official Neutron integration guide. At the end, once we’ve restarted ovn-northd on the controller and ovn-controller on the compute nodes, we should see the following output on the controller node:

ovs-sbctl show

Chassis "d03bdd51-e687-4078-aa54-0ff8007db0b5"
    hostname: "compute-3"
    Encap geneve
        ip: "10.0.0.4"
        options: {csum="true"}
    Encap vxlan
        ip: "10.0.0.4"
        options: {csum="true"}
Chassis "b89b8683-7c74-43df-8ac6-1d57ddefec77"
    hostname: "compute-2"
    Encap vxlan
        ip: "10.0.0.2"
        options: {csum="true"}
    Encap geneve
        ip: "10.0.0.2"
        options: {csum="true"}

This means that all instances of a distributed OVN controller located on each compute node have successfully registered with Southbound OVSDB and provided information about their physical overlay addresses and supported encapsulation types.

(Optional) Automating everything with Chef

At this point of time there’s no way to automate OVN deployment with Packstack (TripleO already has OVN integration templates). For those who want to bypass the manual build process I have created a new Chef cookbook, automating all steps described above. This Chef playbook assumes that OpenStack environment has been built as described in my earlier post. Optionally, you can automate the build of underlay network as well by following my other post. Once you’ve got both OpenStack and underlay built, you can use the following scripts to build, install and configure OVN:

git clone https://github.com/networkop/chef-unl-os.git
cd chef-unl-os
chef-client -z -E lab ovn.rb

Test topology setup

Now we should be able to create a test topology with two tenant subnets and an external network interconnected by a virtual router.

Creating a test topology

neutron net-create NET-RED
neutron net-create NET-BLUE
neutron subnet-create --name SUB-BLUE NET-BLUE 10.0.0.0/24
neutron subnet-create --name SUB-RED NET-RED 20.0.0.0/24
neutron net-create NET-EXT --provider:network_type flat \
                           --provider:physical_network extnet \
                           --router:external --shared
neutron subnet-create --name SUB-EXT --enable_dhcp=False \
                      --allocation-pool=start=169.254.0.50,end=169.254.0.99 \
                      --gateway=169.254.0.1 NET-EXT 169.254.0.0/24
neutron router-create R1
neutron router-interface-add R1 SUB-BLUE
neutron router-interface-add R1 SUB-RED
neutron router-gateway-set R1 NET-EXT

When we attach a few test VMs to each subnet we should be able to successfully ping between the VMs, assuming the security groups are setup to allow ICMP/ND.

Create VMs

curl http://download.cirros-cloud.net/0.3.4/cirros-0.3.4-x86_64-disk.img | glance \
image-create --name='IMG-CIRROS' \
  --visibility=public \
  --container-format=bare \
  --disk-format=qcow2
nova aggregate-create AGG-RED AZ-RED
nova aggregate-create AGG-BLUE AZ-BLUE
nova aggregate-add-host AGG-BLUE compute-2
nova aggregate-add-host AGG-RED compute-3
nova boot --flavor m1.tiny --image 'IMG-CIRROS' \
  --nic net-name=NET-BLUE \
  --availability-zone AZ-BLUE \
  VM1

nova boot --flavor m1.tiny --image 'IMG-CIRROS' \
  --nic net-name=NET-RED \
  --availability-zone AZ-RED \
  VM2
nova boot --flavor m1.tiny --image 'IMG-CIRROS' \
  --nic net-name=NET-BLUE \
  --availability-zone AZ-RED \
  VM3
openstack floating ip create NET-EXT
openstack server add floating ip VM3 169.254.0.53

In the next post we will use the above virtual topology to explore the dataplane packet flow inside an OVN-managed OpenvSwitch and how it uses the new encapsulation protocol GENEVE to optimise egress forwarding lookups on remote compute nodes.

Type-2 and Type-5 EPVN on vQFX 10k in UnetLab

2016-10-26T00:00:00+01:00

I was fortunate enough to be given a chance to test the new virtual QFX 10k image from Juniper. In this post I will show how to import this image into UnetLab and demonstrate the basic L2 and L3 EVPN services.

News about UnetLab

Those who read my blog regularly know that I’m a big fan of a network simulator called UnetLab. For the last two years I’ve done all my labbing in UNL and was constantly surprised by how extensible and stable it has been. I believe that projects like this are very important to our networking community because they help train the new generation of network engineers and enable them to expand their horizons. Recently UnetLab team has decided take the next step and create a new version of UNL. This new project, called EVE-NG, will help users build labs of any size and run full replicas of their production networks, which is ideal for pre-deployment testing of network changes. If you want to learn more, check out the EVE-NG page on indiegogo.

Creating vQFX nodes in UnetLab

Back to the business at hand, vQFX is not publically available yet but is expected to pop up at Juniper.net some time in the future. Similar to a recently released vMX, vQFX will consist of two virtual machines - one running the routing engine (RE) and second simulating the ASIC forwarding piplines (PFE). You can find more information about these images on Juniper’s Github page. Images get distributed in multiple formats but in the context of this post we’ll only deal with two VMDK files:

vQFX images

vqfx10k-re-15.1X53-D60.vmdk
vqfx10k-pfe-20160609-2.vmdk

To be able to use these images in UnetLab, we first need to convert them to qcow2 format and copy them to the directory where UNL stores all its qemu images:

Importing VMDK images

 mkdir /opt/unetlab/addons/qemu/qfx_re-15d1X53
 mkdir /opt/unetlab/addons/qemu/qfx_pfe-20160609

/opt/qemu/bin/qemu-img convert -f vmdk -O qcow2 vqfx10k-pfe-20160609-2.vmdk /opt/unetlab/addons/qemu/qfx_pfe-20160609/hda.qcow2

/opt/qemu/bin/qemu-img convert -f vmdk -O qcow2 vqfx10k-re-15.1X53-D60.vmdk /opt/unetlab/addons/qemu/qfx_re-15d1X53/hda.qcow2

Next, we need to create new node definitions for RE and PFE VMs. The easiest way would be to clone the linux node type:

Creating vQFX node definitions

cd /opt/unetlab/html/templates
cp linux.php qfx_pfe.php
cp linux.php qfx_re.php

sed -i 's/2048/1024/; s/virtio-net-pci/e1000/; s/Server/Switch/' qfx_re.php
sed -i 's/2048/1536/; s/virtio-net-pci/e1000/; s/Server/Switch/' qfx_pfe.php

sed -i 's/Linux/QFX_RE/g; s/linux/qfx_re/g' qfx_re.php
sed -i 's/Linux/QFX_PFE/g; s/linux/qfx_pfe/g' qfx_pfe.php

sed -ri 's/(.*ethernet.*) = 1/\1 = 2/' qfx_pfe.php
sed -ri 's/(.*ethernet.*) = 1/\1 = 8/' qfx_re.php

Now let’s add the QFX to the list of nodes by modifying the following file:

/opt/unetlab/html/includes/init.php

'openstack'             =>      'Openstack',
'qfx_re'                =>      'QFX10k-RE',
'qfx_pfe'               =>      'QFX10k-PFE',
'mikrotik'              =>      'MikroTik RouterOS',

Optionally, /opt/unetlab/html/includes/__node.php can be modified to change the default naming convention similar to the vmx node.

Once you’ve done all the above changes, you should have a working vQFX 10k node available in UNL GUI. For the purpose of demonstration of EVPN features I’ve created the following topology:

EVPN L2 and L3 services

EVPN standards define multiple routes types to distribute NLRI information across the network. The two most “significant” route types are 2 and 5. Type-2 NLRI was designed to carry the MAC (and optionally IP) address to VTEP IP binding information which is used to populate the dynamic MAC address table. This function, that was previously accomplished by a central SDN controller, is now performed in a scalable, standard-based, controller-independent fashion. Type-5 NLRI contains IP Prefix to VTEP IP mapping and is similar to the function of traditional L3 VPNs. In order to explore the capabilities of EVPN implementation on vQFX I’ve created and artificial scenario with 3 virtual switches, 3 VLANs and 4 hosts.

VLAN10 (green) is present on all 3 switches, VLAN20 (purple) is only configured on switches 1 and 2 and VLAN88 (red) only exists on SW3. I’ve provided configuration snippets below for reference purposes only and only for SW1. Remaining switches are configured similarly.

Configuring Basic IP and BGP setup

set interfaces xe-0/0/0 unit 0 family inet address 12.12.12.1/24
set interfaces xe-0/0/2 unit 0 family inet address 13.13.13.1/24
set interfaces lo0 unit 0 family inet address 99.99.99.1/32
set routing-options static route 99.99.99.2/32 next-hop 12.12.12.2
set routing-options static route 99.99.99.3/32 next-hop 13.13.13.3

set routing-options router-id 99.99.99.1
set routing-options autonomous-system 555

set routing-options autonomous-system 555
set protocols bgp group EVPN type internal
set protocols bgp group EVPN local-address 99.99.99.1
set protocols bgp group EVPN family evpn signaling
set protocols bgp group EVPN neighbor 99.99.99.2
set protocols bgp group EVPN neighbor 99.99.99.3

Configuring End-host connectivity and IRB

set vlans BD5010 vlan-id 10
set vlans BD5010 l3-interface irb.10
set vlans BD5020 vlan-id 20
set vlans BD5020 l3-interface irb.20
set interfaces xe-0/0/1 unit 0 family ethernet-switching vlan members 10
set interfaces xe-0/0/3 unit 0 family ethernet-switching vlan members 20
set interfaces irb unit 10 family inet address 10.0.0.254/24
set interfaces irb unit 20 family inet address 20.0.0.254/24

Configuring L2 EVPN services

set protocols evpn encapsulation vxlan
set protocols evpn extended-vni-list all
set protocols evpn multicast-mode ingress-replication

set switch-options vtep-source-interface lo0.0
set switch-options route-distinguisher 555:0
set switch-options vrf-target target:555:123

set vlans BD5010 vxlan vni 5010
set vlans BD5010 vxlan ingress-node-replication
set vlans BD5020 vxlan vni 5020
set vlans BD5020 vxlan ingress-node-replication

Configuring L3 EVPN service

set routing-instances EVPN-VRF instance-type vrf
set routing-instances EVPN-VRF interface irb.10
set routing-instances EVPN-VRF interface irb.20
set routing-instances EVPN-VRF interface lo0.10
set routing-instances EVPN-VRF route-distinguisher 555:1
set routing-instances EVPN-VRF vrf-target target:123:5055
set routing-instances EVPN-VRF protocols evpn ip-prefix-routes advertise direct-nexthop
set routing-instances EVPN-VRF protocols evpn ip-prefix-routes encapsulation vxlan
set routing-instances EVPN-VRF protocols evpn ip-prefix-routes vni 5555

Traffic flow overview

Once all the nodes have been configured, we can have a closer look at the traffic flows, specifically at how packets are being forwarded and where the L2 and L3 lookups take place.

L2 forwarding - H1 to H2 (00:50:79:66:68:06)

Traffic from H1 to H2 will never leave its own broadcast domain. As soon as the packet hits the incoming interface of SW1, MAC address lookup occurs pointing to the remote VTEP interface of SW2.

SW1> show ethernet-switching table | match 00:50:79:66:68:06

   BD5010              00:50:79:66:68:06   D        vtep.32769             99.99.99.2

Once SW2 decapsulates the packet, the lookup in the MAC address table returns the locally connected interface, where it gets forwarded next.

SW2> show ethernet-switching table | match 00:50:79:66:68:06

   BD5010              00:50:79:66:68:06   D        xe-0/0/1.0

L3 forwarding (symmetric) - H3 to H4

The route to 8.8.8.0/24 is advertised by SW3 in type-5 NLRI

SW1> show route receive-protocol bgp 99.99.99.3 extensive

* 5:555:1::0::8.8.8.0::24/304 (1 entry, 1 announced)
     Import Accepted
     Route Distinguisher: 555:1
     Route Label: 5555
     Overlay gateway address: 0.0.0.0
     Nexthop: 99.99.99.3
     Localpref: 100
     AS path: I
     Communities: target:123:5055 encapsulation0:0:0:0:vxlan router-mac:02:05:86:71:72:00

This NLRI doesn’t contain any overlay gateway address, however it does have a special “router-mac” community with a globally unique SW3’s chassis MAC. This MAC is advertised as normal type-2 MAC address and points to the remote VTEP interface of SW3:

SW1> show ethernet-switching table | match 02:05:86:71:72:00

   BD5010              02:05:86:71:72:00   D        vtep.32770             99.99.99.3

The above two pieces of information are fed into our EVPN-VRF routing table to produce the entry with the following parameters:

SW1> show route table EVPN-VRF.inet.0 detail 8.8.8.8 | match "VTEP|VNI|MAC"

Encap VNI: 5555, Decap VNI: 5555
Source VTEP: 99.99.99.1, Destination VTEP: 99.99.99.3
SMAC: 02:05:86:71:3b:00, DMAC: 02:05:86:71:72:00

This is the example of how “symmetric” IRB routing is performed. Instead of routing the packet at the ingress and switching at the egress node, how it was done in the case of Neutron’s DVR, the routing is performed twice. First the packet is routed into a “transit” VNI 5555, which glues all the switches in the same EVI together from the L3 perspective. Once the packet reaches the destination node, it gets routed into the intended VNI (5088 in our case) and forwarded out the local interface. This way switches may have different sets of VLANs and IRBs and still be able route packets between VXLANs.

L3 forwarding (asymmetric) - H1 to H4

As you may have noticed, the green broadcast domain extends to all three switches, even though hosts are only attached to the first two. Let’s see how it will affect the packet flows. The flow from H1 to H4 will be similar to the one from H3 to H4 described above. However return packets will get routed on SW3 directly into VXLAN5010, since that switch has an IRB.10 interface and then switched all the way to H1.

SW3> show route forwarding-table destination 10.0.0.1

Routing table: EVPN-VRF.inet
Internet:
Destination        Type RtRef Next hop           Type Index    NhRef Netif
10.0.0.1/32        dest     0 0:50:79:66:68:5    ucst     1772     1 vtep.32770

This is the example of “asymmetric” routing, similar to the one exhibited by Neutron DVR. You would see similar behaviour if you examined the flow between H3 and H2.

Conclusion

So why all the hassle configuring EPVN on data centre switches? For one, you can get rid of MLAG in TOR switches and replace it with EVPN multihoming. However, the main benefit is that you can stretch L2 broadcast domains across your whole data centre without the need for an SDN controller. So, for example, we can now easily satisfy the requirement of having external floating IP network on all compute nodes introduced by Neutron DVR. EVPN-enabled switches can also now perform functions similar to DC gateway routers (the likes of ASR, MX or SR) while giving you the benefits of horizontal scaling of Leaf/Spine networks. As more and more vendors introduce EVPN support, it is poised to become the ultimate DC routing protocol, complementing the functions already performed by the host-based virtual switches, and with all the DC switches running BGP already, introducing EVPN may be as easy as enabling a new address family.

OpenStack SDN - Distributed Virtual Routing

2016-10-13T00:00:00+01:00

In this post we’ll explore how DVR is implemented in OpenStack Neutron and what are some of its benefits and shortcomings.

To be honest I was a little hesitant to write this post because the topic of Neutron’s DVR has already been exhaustively covered by many, including Assaf Muller, Eran Gampel and in the official OpenStack networking guide. The coverage of the topic was so thorough that I barely had anything to add. However I still decided to write a DVR post of my own for the following two reasons:

I often use my own posts as references and it’s always easier for me to find information in my own writings.
I wanted to use this post as a reference platform for subsequent posts about dynamic routing and OVN project.

The topic of Neutron’s DVR is quite vast so I had to compromise between the length of this post and the level of details. In the end, I edited out most of the repeated content and replaced it with references to my older posts. I think I left everything that should be needed to follow along the narrative so hopefully it won’t seem too patchy.

Virtual topology overview

Let’s see what we’re going to be dealing with in this post. This is a simple virtual topology with two VMs sitting in two different subnets. VM1 has a floating IP assigned that is used for external access.

Before we get to the packet walk details, let me briefly describe how to build the above topology using Neutron CLI. I’ll assume that OpenStack has just been installed and nothing has been configured yet, effectively we’ll pick up from where we left our lab in the previous post.

Virtual topology setup

Upload Cirros Linux image to OpenStack's image repository

curl http://download.cirros-cloud.net/0.3.4/cirros-0.3.4-x86_64-disk.img | glance \
     image-create --name='IMG-CIRROS' \
     --visibility=public \
     --container-format=bare \
     --disk-format=qcow2

Create 2 virtual subnets - RED and BLUE

neutron net-create NET-BLUE
neutron subnet-create --name SUB-BLUE NET-BLUE 10.0.0.0/24 \
  --dns-nameserver 8.8.8.8

neutron net-create NET-RED
neutron subnet-create --name SUB-RED NET-RED 10.0.1.0/24 \
  --dns-nameserver 8.8.8.8

Create external network

neutron net-create EXT-NET --provider:network_type flat \
  --provider:physical_network extnet  \
  --router:external \
  --shared

neutron subnet-create --name EXT-SUB \
 --enable_dhcp=False \
 --allocation-pool=start=169.254.0.50,end=169.254.0.99 \
 --gateway=169.254.0.1 EXT-NET 169.254.0.0/24

Create a router and attach it to all three networks created above

neutron router-create R1
neutron router-interface-add R1 SUB-RED
neutron router-interface-add R1 SUB-BLUE
neutron router-gateway-set R1 EXT-NET

Create two host aggregates to spread the VMs across two different hosts

nova aggregate-create AGG-RED AZ-RED
nova aggregate-create AGG-BLUE AZ-BLUE
nova aggregate-add-host AGG-BLUE compute-2
nova aggregate-add-host AGG-RED compute-3

Boot VMs on two different hypervisors

nova boot --flavor m1.tiny --image 'IMG-CIRROS' \
     --nic net-name=NET-BLUE \
    --availability-zone AZ-BLUE \
     VM1

nova boot --flavor m1.tiny --image 'IMG-CIRROS' \
    --nic net-name=NET-RED \
    --availability-zone AZ-RED \
    VM2

Assign a floating IP (Static NAT) to VM1

nova floating-ip-create EXT-NET
nova floating-ip-associate VM1 169.254.0.55

Enable ingress ICMP and SSH access

nova secgroup-add-rule default tcp 22 22 0.0.0.0/0
nova secgroup-add-rule default icmp -1 -1 0.0.0.0/0

Make sure that both VMs are up and running.

nova list

+--------------------------------------+------+--------+------------+-------------+----------------------------------+
| ID                                   | Name | Status | Task State | Power State | Networks                         |
+--------------------------------------+------+--------+------------+-------------+----------------------------------+
| 92263ae8-43d1-4cd0-b271-2b11f0efbe7f | VM1  | ACTIVE | -          | Running     | NET-BLUE=10.0.0.12, 169.254.0.55 |
| b4562f24-2461-49fb-875b-fa1bf869dc4a | VM2  | ACTIVE | -          | Running     | NET-RED=10.0.1.4                 |
+--------------------------------------+------+--------+------------+-------------+----------------------------------+

non-DVR traffic flow

Using the technique described in my earlier post I’ve collected the dynamically allocated port numbers and created a physical representation of our virtual network.

For the sake of brevity I will omit the verification commands. The traffic flow between VM1 and VM2 will follow the standard path that I’ve explored in my native Neutron SDN post.
It is obvious that in this case traffic flows are suboptimal. Instead of going directly between the peer compute nodes, the packet has to hairpin through a Neutron router. This adds to the end-to-end latency and creates unnecessary load on the Network node. These are one of the main reasons why Distributed Virtual Routing was introduced in OpenStack Juno.

Enabling DVR

Enabling DVR requires configuration changes of multiple files on all OpenStack nodes. At a high level, all compute nodes will now run Neutron’s L3-agent service which will be responsible for provisioning of DVR and other auxiliary namespaces. The details of specific configuration options that need to be enabled can be found in the official OpenStack Networking guide. As usual, I’ve incorporated all the necessary changes into a single Chef cookbook, so in order to enable DVR in our lab all what you need to do is run the following commands from the UNetLab VM:

git pull origin master
chef-client -z -E lab neutron.rb

Once all changes has been made, we need to either create a new router or update the existing one to enable the DVR functionality:

neutron router-update --admin-state-up False --distributed True R1
neutron router-update --admin-state-up True R1

DVR East-West traffic flow

Now let’s see how the traffic flows have changed with the introduction of DVR.

We’re going to be examining the following traffic flow:

From VM1 (10.0.0.12/fa:16:3e:83:92:96)
Via router R1 (10.0.0.1/fa:16:3e:72:7a:50; 10.0.1.1/fa:16:3e:6a:2c:8b)
To VM2 (10.0.1.4/fa:16:3e:76:31:68)

R1 now has an instance on all compute nodes that have VMs in the BLUE or RED networks. That means that VM1 will send a packet directly to the R1’s BLUE interface via the integration bridge.

ovs-appctl fdb/show br-int

 port  VLAN  MAC                Age
   1  fa:16:3e:83:92:96    1
   1  fa:16:3e:72:7a:50    1
   2  fa:16:3e:6a:2c:8b    1

This is dynamically populated MAC address table of the integration bridge. You can see that the MAC address of VM1 and both interfaces of R1 have been learned. That means that when VM1 sends a packet to its default gateway’s MAC address, it will go directly to R1’s BLUE interface on port 4.

In this post I will omit the details of ARP resolution process which remains the same as before, however there’s one interesting detail that is worth mentioning before we move on. During the initial flood-and-learn phase on the br-int, the ARP request will get flooded down to the tunnel bridge. As per the standard behaviour, the packet should get replicated to all nodes. However, in this case we don’t want to hear responses from other nodes, since the router is hosted locally. In order to help that, tunnel bridges explicitly drop all packets coming from integration bridges and destined for MAC addresses of locally hosted routers:

ovs-appctl ofproto/trace br-tun in_port=1,dl_vlan=1,dl_dst=fa:16:3e:72:7a:50 | grep -E "Rule|action"

Rule: table=0 cookie=0xa3536ac94478bd1d priority=1,in_port=1
OpenFlow actions=goto_table:1
        Rule: table=1 cookie=0xa3536ac94478bd1d priority=2,dl_vlan=1,dl_dst=fa:16:3e:72:7a:50
        OpenFlow actions=drop

Getting back to our traffic flow, once the IP packet has reached the DVR instance of R1 on compute node #2, the routing lookup occurs and the packet is sent back to the integration bridge with a new source MAC of R1’s RED interface.

ip netns exec qrouter-uuid ip route

10.0.0.0/24 dev qr-102c4426-86  proto kernel  scope link  src 10.0.0.1
10.0.1.0/24 dev qr-3779302e-62  proto kernel  scope link  src 10.0.1.1

Tunnel bridge will do its usual work by locating the target compute node based on the destination MAC address of VM2 (DVR requires L2 population to be enabled) and will send the packet directly to the compute node #3.

ovs-appctl ofproto/trace br-tun in_port=1,dl_vlan=2,dl_src=fa:16:3e:6a:2c:8b,dl_dst=fa:16:3e:76:31:68 | grep -E "Rule|action"

Rule: table=0 cookie=0xa3536ac94478bd1d priority=1,in_port=1
OpenFlow actions=goto_table:1
        Rule: table=1 cookie=0xa3536ac94478bd1d priority=1,dl_vlan=2,dl_src=fa:16:3e:6a:2c:8b
        OpenFlow actions=set_field:fa:16:3f:d3:10:60->eth_src,goto_table:2
                Rule: table=2 cookie=0xa3536ac94478bd1d priority=0,dl_dst=00:00:00:00:00:00/01:00:00:00:00:00
                OpenFlow actions=goto_table:20
                        Rule: table=20 cookie=0xa3536ac94478bd1d priority=2,dl_vlan=2,dl_dst=fa:16:3e:76:31:68
                        OpenFlow actions=pop_vlan,set_field:0x4d->tun_id,output:3

Since all instances of R1 have the same set of IP/MAC addresses, the MAC address of a local router can be learned by the remote integration bridge hosting the same instance of DVR. In order to prevent that from happening, the sending br-tun replaces the source MAC address of the frame with the set_field:fa:16:3f:d3:10:60->eth_src action. This way the real R1’s MAC address gets masked as the frame leaves the node. These “mask” MACs are generated by and learned from the Neutron server, which ensures that each node gets a unique address.

The receiving node’s br-tun will swap the VXLAN header with a VLAN ID and forward the frame up to the integration bridge.

ovs-appctl ofproto/trace br-tun in_port=3,tun_id=0x4d,dl_dst=fa:16:3e:76:31:68 | grep -E "Rule|action"

Rule: table=0 cookie=0x8a7dedf35101427f priority=1,in_port=3
OpenFlow actions=goto_table:4
        Rule: table=4 cookie=0x8a7dedf35101427f priority=1,tun_id=0x4d
        OpenFlow actions=push_vlan:0x8100,set_field:4097->vlan_vid,goto_table:9
                Rule: table=9 cookie=0x8a7dedf35101427f priority=0
                OpenFlow actions=goto_table:10
                        Rule: table=10 cookie=0x8a7dedf35101427f priority=1
                        OpenFlow actions=learn(table=20,hard_timeout=300,priority=1,cookie=0x8a7dedf35101427f,NXM_OF_VLAN_TCI[0..11],NXM_OF_ETH_DST[]=NXM_OF_ETH_SRC[],load:0->NXM_OF_VLAN_TCI[],load:NXM_NX_TUN_ID[]->NXM_NX_TUN_ID[],output:OXM_OF_IN_PORT[]),output:1
                                Rule: table=0 cookie=0x9a1e0026794eadc5 priority=1
                                OpenFlow actions=NORMAL

Integration bridge of compute node #3 will lookup the destination MAC address and send the packet out port 2.

ovs-appctl fdb/show br-int

 port  VLAN  MAC                Age
   1  fa:16:3e:76:31:68    0
   2  fa:16:3e:72:7a:50    0
   1  fa:16:3e:6a:2c:8b    0
   1  fa:16:3e:29:de:20    0

The reverse packet flow is similar - the packet will get routed on the compute node #3 and sent in a BLUE network to the compute node #2.

External connectivity

External connectivity will be very different for VMs with and without a floating IP. We will examine each case individually.

Case 1 - Overload NAT (VM2 with no FIP)

External connectivity for VMs with no floating IP is still performed by the Network node. This time however, NATing is performed by a new element - SNAT namespace. As per the normal behaviour, VM2 will send a packet to its default gateway first. Let’s have a closer look at the routing table of the DVR:

ip netns exec qrouter-uuid ip route

10.0.0.0/24 dev qr-102c4426-86  proto kernel  scope link  src 10.0.0.1
10.0.1.0/24 dev qr-3779302e-62  proto kernel  scope link  src 10.0.1.1

There’s no default route in the main routing table, so how would it get routed out? DVRs extensively use Linux routing policy database (RPDB), a feature that has a lot in common with OpenFlow tables. The principle of RPDB is that every packet gets matched against a set of routing tables until there’s a hit. The tables are checked in the order of their priority (lowest to highest). One of the main features of RPDB is the ability to perform matches based on something other than the destination IP address, which is why it’s often referred to as policy-based routing. To view the contents of RPDB use the ip rule command under the DVR namespace:

ip netns exec qrouter-uuid ip rule

0:      from all lookup local
32766:  from all lookup main
32767:  from all lookup default
167772161:      from 10.0.0.1/24 lookup 167772161
167772417:      from 10.0.1.1/24 lookup 167772417

In our case table 167772161 matches all packets sourced from the BLUE subnet and if we examine the corresponding routing table we’ll find the missing default route there.

ip netns exec qrouter-uuid ip route list table 167772417

default via 10.0.1.12 dev qr-3779302e-62

The next hop of this default route points to the SNAT’s interface in the BLUE network. MAC address is statically programmed by the local L3-agent.

ip netns exec qrouter-uuid ip neigh | grep 10.0.1.12

10.0.1.12 dev qr-3779302e-62 lladdr fa:16:3e:29:de:20 PERMANENT

Integration bridge sends the packet out port 1 to the tunnel bridge.

ovs-appctl fdb/show br-int | grep fa:16:3e:29:de:20

1     1  fa:16:3e:29:de:20    1

Tunnel bridge finds the corresponding match and sends the VXLAN-encapsulated packet to the Network node.

ovs-appctl ofproto/trace br-tun in_port=1,dl_vlan=1,dl_dst=fa:16:3e:29:de:20 | tail -n 1

Datapath actions: set(tunnel(tun_id=0x4d,src=10.0.0.4,dst=10.0.0.0,ttl=64,flags(df|key))),pop_vlan,3

Tunnel bridge of the Network node forwards the frame up to the integration bridge.

ovs-appctl ofproto/trace br-tun in_port=3,tun_id=0x4d,dl_dst=fa:16:3e:29:de:20 | grep output

OpenFlow actions=learn(table=20,hard_timeout=300,priority=1,cookie=0xb9be3fe62922c800,NXM_OF_VLAN_TCI[0..11],NXM_OF_ETH_DST[]=NXM_OF_ETH_SRC[],load:0->NXM_OF_VLAN_TCI[],load:NXM_NX_TUN_ID[]->NXM_NX_TUN_ID[],output:OXM_OF_IN_PORT[]),output:1

Integration bridge sends the frame to port 10, which is where SNAT namespace is attached

ovs-appctl fdb/show br-int | grep fa:16:3e:29:de:20

   10     1  fa:16:3e:29:de:20    0

SNAT is a namespace with an interface in each of the subnets - BLUE, RED and External subnet

ip netns exec snat-uuid ip a | grep -E "UP|inet"

16: sg-fefd493b-a5:  mtu 1450 qdisc noqueue state UNKNOWN
    link/ether fa:16:3e:99:5c:3a brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.6/24 brd 10.0.0.255 scope global sg-fefd493b-a5
18: sg-b3d58360-b4:  mtu 1450 qdisc noqueue state UNKNOWN
    link/ether fa:16:3e:29:de:20 brd ff:ff:ff:ff:ff:ff
    inet 10.0.1.12/24 brd 10.0.1.255 scope global sg-b3d58360-b4
19: qg-765b5aca-ce:  mtu 1500 qdisc noqueue state UNKNOWN
    link/ether fa:16:3e:d5:75:0e brd ff:ff:ff:ff:ff:ff
    inet 169.254.0.57/24 brd 169.254.0.255 scope global qg-765b5aca-ce

SNAT has a single default route pointing to the External network’s gateway.

ip netns exec snat-uuid ip route | grep default

default via 169.254.0.1 dev qg-765b5aca-ce

Before sending the packet out, iptables will NAT the packet to hide it behind SNAT’s qg external interface IP.

ip netns exec snat-uuid iptables -t nat -L | grep SNAT

SNAT       all  --  anywhere             anywhere             to:169.254.0.57

Case 2 - Static NAT (VM1 with FIP)

The first step in this scenario is the same - VM1 sends a packet to the MAC address of its default gateway. As before, the default route is missing in the main routing table.

ip netns exec qrouter-uuid ip route list table main

0.0.0/24 dev qr-102c4426-86  proto kernel  scope link  src 10.0.0.1
0.1.0/24 dev qr-3779302e-62  proto kernel  scope link  src 10.0.1.1
254.106.114/31 dev rfp-e4d4897e-7  proto kernel  scope link  src 169.254.106.114

Looking at the ip rule configuration we can find that table 16 matches all packets from that particular VM (10.0.0.12).

ip netns exec qrouter-uuid ip rule

0:      from all lookup local
32766:  from all lookup main
32767:  from all lookup default
57481:  from 10.0.0.12 lookup 16
167772161:      from 10.0.0.1/24 lookup 167772161
167772417:      from 10.0.1.1/24 lookup 167772417

Routing table 16 sends the packet via a point-to-point veth pair link to the FIP namespace.

ip netns exec qrouter-uuid ip route list table 16

default via 169.254.106.115 dev rfp-e4d4897e-7

Before sending the packet out, DVR translates the source IP of the packet to the FIP assigned to that VM.

ip netns exec qrouter-uuid iptables -t nat -L | grep NAT

SNAT       all  --  10.0.0.12            anywhere             to:169.254.0.55

A FIP namespace is a simple router designed to connect multiple DVRs to external network. This way all routers can share the same “uplink” namespace and don’t have to consume valuable addresses from external subnet.

ip netns exec fip-uuid ip a | grep -E "UP|inet"

2: fpr-e4d4897e-7:  mtu 1500 qdisc pfifo_fast state UP qlen 1000
    inet 169.254.106.115/31 scope global fpr-e4d4897e-7
15: fg-d3bb699d-af:  mtu 1500 qdisc noqueue state UNKNOWN
    inet 169.254.0.58/24 brd 169.254.0.255 scope global fg-d3bb699d-af

Default route inside the FIP namespace points to the External subnet’s gateway IP.

ip netns exec fip-uuid ip route | grep default

default via 169.254.0.1 dev fg-d3bb699d-af

The MAC address of the gateway is statically configured by the L3 agent.

ip netns exec fip-uuid ip neigh | grep 169.254.0.1

169.254.0.1 dev fg-d3bb699d-af lladdr 32:3e:7d:13:ca:78 DELAY

The packet is sent to the br-int with the destination MAC address of the default gateway, which is learned on port 3.

ovs-appctl fdb/show br-int | grep 32:3e:7d:13:ca:78

  3     3  32:3e:7d:13:ca:78    1

External bridge strips the VLAN ID of the packet coming from the br-int and does the lookup in the dynamic MAC address table.

ovs-appctl ofproto/trace br-ex in_port=2,dl_vlan=3 | grep actions=

OpenFlow actions=pop_vlan,NORMAL

The frame is forwarded out the physical interface.

ovs-appctl fdb/show br-ex | grep 32:3e:7d:13:ca:78

1     0  32:3e:7d:13:ca:78    1

Reverse packet flow will be quite similar, however in this case FIP namespace must be able to respond to ARP requests for the IPs that only exist on DVRs. In order to do that, it uses a proxy-ARP feature. First, L3 agent installs a static route for the FIP pointing back to the correct DVR over the veth pair interface:

ip netns exec fip-uuid ip route get 169.254.0.55

169.254.0.55 via 169.254.106.114 dev fpr-e4d4897e-7  src 169.254.106.115

Now that the FIP namespace knows the route to the floating IP, it can respond to ARPs on behalf of DVR as long as proxy-ARP is enabled on the external fg interface:

ip netns exec fip-uuid

cat /proc/sys/net/ipv4/conf/fg-d3bb699d-af/proxy_arp
1

Finally, the DVR NATs the packet back to its internal IP in the BLUE subnet and forwards it straight to VM1.

ip netns exec qrouter-uuid iptables -t nat -L | grep DNAT

DNAT       all  --  anywhere             169.254.0.55         to:10.0.0.12

DVR Pros and Cons

Without a doubt DVR has introduced a number of much needed improvements to OpenStack networking:

East-West traffic now follows the most optimal path thereby reducing the load on the Network node.
External connectivity to floating IPs now also follows the most optimal path directly to the compute node hosting the VM.

However, there’s a number of issues that either remain unaddressed or result directly from the current DVR architecture:

DHCP and SNAT are still hosted on the Network node.
Asymmetric routing means that every DVR needs to have an interface in every configured subnet, even when there are no VMs that belong to those subnets on the current compute node.
Direct connectivity to FIP means that all compute nodes now need to have direct L2 adjacency to external subnets.
FIP namespace on compute nodes consumes IP addresses from external subnets which can be a problem if external subnet is in a public IPv4 address range.
DVR implementation as a network namespace creates additional overhead in packet processing.

Some of the above issues are not critical and can be fixed with a little effort:

In order to reduce the scope of a External network VLAN span inside the DC, dedicate a subset of hosts that will have a direct L2 adjacency to external networks and only deploy external-facing VMs on those hosts.
Since FIP namespace only requires an external IP address for debugging purposes, we can create an additional, secondary, subnet in RFC1918 space for FIP connectivity. This is enabled by a feature called subnet “Service types” and is available in the latest Newton release.

However the main issue still remains unresolved. Every North-South packet has to hop several times between the global and DVR/FIP/NAT namespaces. These kind of operations are very expensive in terms of consumed CPU and memory resources and can be very detrimental to network performance. Using namespaces may be the most straight-forward and non-disruptive way of implementing DVR, however it’s definitely not the most optimal. Ideally we’d like to see both L2 and L3 pipelines implemented in OpenvSwitch tables. This way all packets can benefit from OVS fast-path flow caching. But fear not, the solution to this already exists in a shape of Open Virtual Network. OVN is a project spawned from the OVS and aims to address a number of shortcomings existing in current implementations of virtual networks.

Automating the Build of OpenStack Lab (Part 2)

2016-09-09T00:00:00+01:00

In this post we’ll use Chef, unnumbered BGP and Cumulus VX to build a massively scalable “Lapukhov” Leaf-Spine data centre.

In the last post we’ve seen how to use Chef to automate the build of a 3-node OpenStack cloud. The only thing remaining is to build an underlay network supporting communication between the nodes, which is what we’re going to do next. The build process will, again, be relatively simple and will include only a few manual steps, but before we get there let me go over some of the decisions and assumptions I’ve made in my network design.

High-level design

The need to provide more bandwidth for East-West traffic has made the Clos Leaf-Spine architecture a de facto standard in any data centre network design. The use of virtual overlay networks has obviated the requirement to have a strict VLAN and IP numbering schemes in the underlay. The only requirement for the compute nodes now is to have any-to-any layer 3 connectivity. This is how the underlay network design has converged to a Layer 3 Leaf-Spine architecture.
The choice of a routing protocol is not so straight-forward. My fellow countryman Petr Lapukhov and co-authors of RFC draft claim that having a single routing protocol in your WAN and DC reduces complexity and makes interoperability and operations a lot easier. This draft presents some of the design principles that can be used to build a L3 data centre with BGP as the only routing protocol. In our lab we’re going to implement a single “cluster” of the multi-tier topology proposed in that RFC.

In order to help us build this in an automated and scalable way, we’re going to use a relatively new feature called unnumbered BGP.

Unnumbered BGP as a replacement for IGP

As we all know, one of the main advantages of interior gateway protocols is the automatic discovery of adjacent routers which is accomplished with the help of link-local multicasts. On the other hand, BGP traditionally required you to explicitly define neighbor’s IP address in order to establish a peering relationship with it. This is where IPv6 comes to the rescue. With the help of neighbor discovery protocol and router advertisement messages, it becomes possible to accurately determine the address of the peer BGP router on an intra-fabric link. The only question is how we would exchange IPv4 information over and IPv6-only BGP network.
RFC 5549, described an “extended nexthop encoding capability” which allows BGP to exchange routing updates with nexthops that don’t belong to the address family of the advertised prefix. In plain English it means that BGP is now capable of advertising an IPV4 prefix with an IPv6 nexthop. This makes it possible to configure all transit links inside the Clos fabric with IPv6 link-local addresses and still maintain reachability between the edge IPv4 host networks. Since nexthop IPs will get updated at every hop, there is no need for an underlying IGP to distribute them between all BGP routers. What we see is, effectively, BGP absorbing the functions of an IGP protocol inside the data centre.

Configuration example on Cumulus VX

In order to implement BGP unnumbered on Cumulus Linux all you need to is:

Enable IPv6 router advertisements on all transit links
Enable BGP on the same interfaces

Example Quagga configuration snippet will look like this:

interface swp1
  ipv6 nd ra-interval 5
  no ipv6 nd suppress-ra

rouer bgp 
  neighbor swp1 interface
  neighbor swp1 external

As you can see, Cumulus simplifies it even more by allowing you to only specify the BGP peering type (external/internal) and learning the value of peer BGP AS dynamically from a neighbor.

Design assumptions and caveats

With all the above in mind, this is the list of decisions I’ve made while building the fabric configuration:

All switches inside the fabric will be running BGP peerings using IPv6 link-local addresses
eBGP will be used throughout to simplify configuration automation (all peers will be external)
Each Leaf/Spine switch will have a unique IPv4 loopback address assigned for management purposes (ICMP, SSH)
On each Leaf switch all directly connected IPv4 prefixes will get redistributed into BGP
BGP multipath rule will be “relaxed” to allow for different AS-PATHs. This is not used in our current topology but is required in an HA Leaf switch design (same IPv4 prefix will be advertised from two Leaf switches with different ASN)
Loop prevention on Leaf switches will also be “relaxed”. This, again, is not used in our single “cluster” topology, however it will allow same Leaf ASNs to be reused in a different cluster.

Implementation steps

Picking up where we left off after the OpenStack node provisioning described in the previous post

Get the latest OpenStack lab cookbooks

1
2
3

    git clone https://github.com/networkop/chef-unl-os.git
    cd chef-unl-os

Download and import Cumulus VX image similar to how it’s described here.
1 2
/opt/unetlab/addons/qemu/cumulus-vx/hda.qcow2
Build the topology inside UNL. Make sure that Node IDs inside UNL match the ones in chef-unl-os/environment/lab.rb file and that interfaces are connected as shown in the diagram below
Re-run UNL self-provisioning cookbook to create a zero touch provisioning file and update DHCP server configuration with static entries for the switches.
1 2
chef-client -z -E lab -o pxe
Cumulus ZTP allows you to run a predefined script on the first boot of the operating system. In our case we inject a UNL VM’s public key and enable passwordless sudo for cumulus user.
Kickoff Chef provisioning to bootstrap and configure the DC fabric.
1 2
chef-client -z -E lab fabric.rb
This command instructs Chef provisioning to connect to each switch, download and install the Chef client and run a simple recipe to create quagga configuration file from a template.

At the end of step 5 we should have a fully functional BGP-only fabric and all 3 compute nodes should be able to reach each other in at most 4 hops.

[root@controller-1 ~]# traceroute 10.0.0.4

traceroute to 10.0.0.4 (10.0.0.4), 30 hops max, 60 byte packets
10.0.0.1 (10.0.0.1)  0.609 ms  0.589 ms  0.836 ms
10.255.255.7 (10.255.255.7)  0.875 ms  2.957 ms  3.083 ms
10.255.255.6 (10.255.255.6)  3.473 ms  5.486 ms  3.147 ms
10.0.0.4 (10.0.0.4)  4.231 ms  4.159 ms  4.115 ms

Automating the Build of OpenStack Lab (Part 1)

2016-08-26T00:00:00+01:00

In this post we will explore what’s required to perform a zero-touch deployment of an OpenStack cloud. We’ll get a 3-node lab up and running inside UNetLab with just a few commands.

Now that I’m finally beginning to settle down at my new place of residence I can start spending more time on research and blogging. I have left off right before I was about to start exploring the native OpenStack distributed virtual routing function. However as I’d started rebuilding my OpenStack lab from scratch I realised that I was doing a lot of repetitive tasks which can be easily automated. Couple that with the fact that I needed to learn Chef for my new work and you’ve got this blogpost describing a few Chef cookbooks (similar to Ansible’s playbook) automating all those manual steps described in my earlier blogposts 1 and 2.
In addition to that in this post I’ll show how to build a very simple OpenStack baremetal provisioner and installer. Some examples of production-grade baremetal provisioners are Ironic, Crowbar and MAAS. In our case we’ll turn UNetLab VM into an undercloud, a server used to provision and deploy our OpenStack lab, an overcloud. To do that we’ll first install and configure DHCP, TFTP and Apache servers to PXE-boot our UNL OpenStack nodes. Once all the nodes are bootstrapped, we’ll use Chef to configure the server networking and kickoff the packstack OpenStack installer.

In this post I’ll try to use Chef recipes that I’ve written as much as possible, therefore you won’t see the actual configuration commands, e.g. how to configure Apache or DHCP servers. However I will try to describe everything that happens at each step and hopefully that will provide enough incentive for the curious to look into the Chef code and see how it’s done. To help with the Chef code understanding let me start with a brief overview of what to look for in a cookbook.

How to read a Chef cookbook (Optional)

A cookbook directory (/cookbooks/[cookbook_name]) contains all its configuration scripts in /recipes. Each file inside a recipe contains a list of steps to be performed on a server. Each step is an operation (add/delete/update) on a resource. Here are some of the common Chef resources:

Package - allows you to add, remove or update a package
Template - creates a file from an erb-formatted template
Execute - runs an ad-hoc CLI command

Just these three basic resources allow you to do 95% of administrative tasks on any server. Most importantly they do it in platform-independent (any flavour of Linux) and idempotent (only make changes if current state is different from a desired state) way. Other directories you might want to explore are:

/templates - contains all the erb-formatted templates
/attributes - contains recipe variables (file paths, urls etc.)
/files - contains the non-template files, i.e. files with static content

Bootstrapping the OpenStack nodes

If you haven’t done it yet, download a copy of the UNetLab VM from the official website. Set it up inside your hypervisor so that you can access Internet through the first interface pnet0 (i.e. connect the first NIC of the VM to hypervisor’s NAT interface). Make sure the VM has got at least 6GB of RAM and VT-x support enabled for nested virtualization.

Follow the official installation instructions to install Chef Development Kit inside UNetLab VM.

1
2
3

   wget https://packages.chef.io/stable/ubuntu/12.04/chefdk_0.16.28-1_amd64.deb
   dpkg -i chefdk_0.16.28-1_amd64.deb

Install git and clone chef cookbooks.

   apt-get -y update
   apt-get -y install git
   git clone https://github.com/networkop/chef-unl-os.git
   cd chef-unl-os

Examine the lab environment settings to see what values are going to be used. You can modify that file to your liking.

Note that the OpenStack node IDs (keys of os_lab hash) MUST have one to one correspondence with the UNL node IDs which will be created at step 5
1 2
cat environment/lab.rb
Run Chef against a local server to setup the baremetal provisioner. This step installs and configures DHCP, TFTP and Apache servers. It also creates all the necessary PXE-boot and kickstart files based on our environment settings. Note that a part of the process is the download of a 700MB CentOS image so it might take a while to complete.
1 2
chef-client -z -E lab -o pxe
At the start of the PXE-boot process, DCHP server sends an OFFER which, along with the standard IP information, includes the name of the PXE boot image and the IP address of TFTP server where to get it from. A server loads this image and then searches the TFTP server for the boot configuration file which tells it what kernel to load and where to get a kickstart file. Both kickstart and the actual installation files are accessed via HTTP and served by the same Apache server that runs UNL GUI.
From UNL GUI create a new lab, add 3 OpenStack nodes and connect them all to pnet10 interface as described in this guide. Note that the pnet10 interface has already been created by Chef so you don’t have to re-create it again.

Make sure that the UNL node IDs match the ones defined in the environment setting file
Fire-up the nodes and watch them being bootstrapped by our UNL VM.

Server provisioning

Next step is to configure the server networking and kickoff the OpenStack installer. These steps will also be done with a single command:

   chef-client -z -E lab lab.rb

The first part of this script will connect to each prospective OpenStack node and setup its network interfaces and hostnames. The second part of this script will generate a packstack answer file and modify its settings to exclude some of the components we’re not going to use (like Nagios, Ceph and Ceilometer). Have a look at cookbooks/packstack/recipe/default.rb for the list of modifications. The final step is a command to kickoff the packstack installer which will use another configuration management system, Puppet, to install and configure OpenStack according to the provided answer file.

At the end of these steps you should have a fully functional 3-node OpenStack environment.

To be continued…

This is a part of a 2-post series. In the next post we’ll look into how to use the same tools to perform the baremetal provisioning of our physical underlay network.

OpenStack SDN - Interconnecting VMs and Physical Devices With Cumulus VX L2 Gateway

2016-05-21T00:00:00+01:00

One of the basic function of any data centre network is the ability to communicate with baremetal servers. In this post we’ll see how Neutron L2 Gateway plugin can be used to configure a Cumulus VX switch for VXLAN-VLAN bridging.

Since I have all my OpenStack environment running inside UNetLab, it makes it really easy for me to extend my L3 fabric with a switch from another vendor. In my previous posts I’ve used Cisco and Arista switches to build a 4-leaf 2-spine CLOS fabric. For this task I’ve decided to use a Cumulus VX switch which I’ve downloaded and imported into my lab.

To simulate the baremetal server (10.0.0.100) I’ve VRF’d an interface on Arista “L4” switch and connected it directly to a “swp3” interface of the Cumulus VX. This is not shown on the diagram.

Solution overview

L2 Gateway is a relatively new service plugin for OpenStack Neutron. It provides the ability to interconnect a given tenant network with a VLAN on a physical switch. There are three main components that compose this solution:

Hardware switch implementing the OVSDB hardware vtep schema. This is a special “flavour” of OVSDB designed specifically to enable connectivity between logical (VXLAN VTEP) and physical (switchport) interfaces.
L2GW agent running on a network node. This is the process responsible for connecting to OVSDB server running on a hardware switch and updating that database based on instructions received from a L2GW service plugin.
L2GW Service Plugin residing on a control node. The task of this plugin is to notify the L2GW agent and normal L2 OVS agents running on compute hosts about network events and distribute VTEP IP address information between them.

Note that in our case both network and control nodes are running on the same VM.

Cumulux VX configuration

Cumulux is a debian-based linux distribution, therefore most of the basic networking configuration will be similar to how things are done in Ubuntu. First, let’s start by configuring basic IP addressing on Loopback (VTEP IP), Eth0 (OOB management), swp1 and swp2 (fabric) interfaces.

/etc/network/interfaces

iface lo inet loopback
        address 10.0.0.5/32

auto eth0
iface eth0 inet static
        address 192.168.91.21/24

auto swp1
iface swp1 inet static
        address 169.254.51.5/24

auto swp2
iface swp2 inet static
        address 169.254.52.5/24

auto swp3
iface swp3

Next, let’s enable OSPF

Enable OSPF process

sudo sed -i s/zebra=no/zebra=yes/ /etc/quagga/daemons
sudo sed -i s/ospfd=no/ospfd=yes/ /etc/quagga/daemons
sudo service quagga restart

Once OSPFd is running, we can use sudo vtysh to connect to local quagga shell and finalise the configuration.

show run

interface lo
 ip ospf area 0.0.0.0
 link-detect
!
interface swp1
 ip ospf area 0.0.0.0
 ip ospf network point-to-point
 link-detect
!
interface swp2
 ip ospf area 0.0.0.0
 ip ospf network point-to-point
 link-detect
!
router ospf
 ospf router-id 10.0.0.5
 passive-interface default
 no passive-interface swp1
 no passive-interface swp2

At this stage our Cumulus VX switch should be fully adjacent to both spines and its loopback IP (10.0.0.5) should be reachable from all OpenStack nodes.

The final step is to enable the hardware VTEP functionality. The process is fairly simple and involves only a few commands.

$ sudo sed -i s/START=no/START=yes/g /etc/default/openvswitch-vtep
$ sudo service openvswitch-vtep start
$ sudo vtep-bootstrap L5 10.0.0.5 192.168.91.21 --no_encryption

The last command runs a bootstrap script that does the following things:

Creates a hardware VTEP OVSDB schema
Inside that schema creates a new physical switch called “L5”
Sets the VTEP IP to 10.0.0.5
Starts listening to incoming OVSDB connections on 192.168.91.21

Hardware VTEP vs OpenvSwitch OVSDB schemas (Optional)

By now you’re probably wondering what’s that hardware VTEP OVSDB schema and how it’s different from a normal OVS schema. First of all, remember that OVSDB is just a database and OVSDB protocol is just a set of JSON RPC calls to work with that database. Information that can be stored in the database is defined by a schema - a structure that represents tables and their relations. Therefore, OVSDB can be used to store and manage ANY type of data which makes it very flexible. Specificallly OVS project defines two OVSDB schemas:

Open_vSwitch schema - used to manage bridges, ports and controllers of OpenvSwitch. This schema is used by OVS inside every compute host we have in our OpenStack environment.
Hardware_vtep schema - designed to be used by physical switches. The goal of this schema is to extend the virtual L2 switch into a physical realm by providing the ability to map physical ports to logical networks. For each logical network the hardware VTEP database holds mappings of MAC addresses to VTEPs and physical switchport.

The information from these databases is later consumed by another process that sets up the actual bridges and ports. The first schema is used by the ovs-vswitchd process running on all compute hosts to configure ports and flows of integration and tunnel bridges. In case of a Cumulus switch, the information from hardware_vtep OVSDB is used by a process called ovs-vtepd that is responsible for settings up VXLAN VTEP interfaces, provisioning of VLANs on physical switchports and interconnecting them with a Linux bridge.

If you want to learn more, check out this awesome post about hardware VTEP and OVS.

OpenStack Control node configuration

Most of the following procedure has been borrowed from another blog. It’s included it this post because I had to do some modifications and also for the sake of completeness.

Clone the L2GW repository

 git clone -b stable/mitaka https://github.com/openstack/networking-l2gw.git

Use pip to install the plugin
```
 pip install ./networking-l2gw/
```

Enable the L2GW service plugin

 sudo sed -ri 's/^(service_plugins.*)/\1,networking_l2gw.services.l2gateway.plugin.L2GatewayPlugin/' \
 /etc/neutron/neutron.conf

Copy L2GW configuration files into the neutron configuration directory
```
 cp  /usr/etc/neutron/l2g* /etc/neutron/
```

Point the L2GW plugin to our Cumulus VX switch

 sudo sed -ri "s/^#\s+(ovsdb_hosts).*/\1 = 'ovsdb1:192.168.91.21:6632'/" /etc/neutron/l2gateway_agent.ini

Update Neutron database with the new schema required by L2GW plugin

 systemctl stop neutron-server
 neutron-db-manage --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/l2gw_plugin.ini  upgrade head
 systemctl start neutron-server

Update Neutron startup script to load the L2GW plugin configuration file

 sed -ri "s/(ExecStart=.*)/\1 --config-file \/etc\/neutron\/l2gw_plugin.ini /" /usr/lib/systemd/system/neutron-server.service

Create a L2GW systemd unit file

 cat >> /usr/lib/systemd/system/neutron-l2gateway-agent.service << EOF
 [Unit]
 Description=OpenStack Neutron L2 Gateway Agent
 After=neutron-server.service

 [Service]
 Type=simple
 User=neutron
 ExecStart=/usr/bin/neutron-l2gateway-agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/l2gateway_agent.ini
 KillMode=process

 [Install]
 WantedBy=multi-user.target
 EOF

Restart both L2GW and neutron server

 systemctl daemon-reload
 systemctl restart neutron-server.service
 systemctl start neutron-l2gateway-agent.service

Enter the “neutron configuration mode”
```
source ~/keystone_admin
neutron
```

Create a new L2 gateway device

l2-gateway-create --device name="L5",interface_names="swp3" CUMULUS-L2GW

Create a connection between a “private_network” and a native vlan (dot1q 0) of swp3 interface
```
l2-gateway-connection-create --default-segmentation-id 0 CUMULUS-L2GW private_network
```

Verification and Traffic Flows

At this stage everything should be ready for testing. We’ll start by examining the following traffic flow:

From VM-2 10.0.0.4/fa:16:3e:d7:0e:14
To baremetal server 10.0.0.100/50:00:00:6b:2e:70

The communication starts with VM-2 sending an ARP request for the MAC address of the baremetal server. Packet flow inside the compute host will be exactly the same as before, with packet being flooded from the VM to the integration and tunnel bridges. Inside the tunnel bridge the packet gets resubmitted to table 22 where head-end replication of ARP request takes place.

The only exception is that this time the frame will get replicated to a new VXLAN port pointing towards the Cumulux VTEP IP. We’ll use the ovs-appctl ofproto/trace command to see the full path a packet takes inside OVS, which is similar to packet-tracer command of Cisco ASA. To simulate an ARP packet we need to specify the incoming port(in_port), EtherType(arp), internal VLAN number for our tenant(dl_vlan) and an ARP request target IP address(arp_tpa). You can find the full list of fields that can be matched in this document.

ARP request to the baremetal server

$ ovs-appctl ofproto/trace br-tun in_port=1,arp,dl_vlan=1,arp_tpa=10.0.0.100 | grep -E "Rule|actions="
Rule: table=0 cookie=0xb3c018296c2aa8a3 priority=1,in_port=1
OpenFlow actions=resubmit(,2)
        Rule: table=2 cookie=0xb3c018296c2aa8a3 priority=0,dl_dst=00:00:00:00:00:00/01:00:00:00:00:00
        OpenFlow actions=resubmit(,20)
                Rule: table=20 cookie=0xb3c018296c2aa8a3 priority=0
                OpenFlow actions=resubmit(,22)
                        Rule: table=22 cookie=0xb3c018296c2aa8a3 dl_vlan=1
                        OpenFlow actions=strip_vlan,set_tunnel:0x45,output:9,output:4,output:6

The packet leaving port 9 will get encapsulated into a VXLAN header with destination IP of 10.0.0.5 and forwarded out the fabric-facing interface eth1.100. When VXLAN packet reaches the vxln69 interface (10.0.0.5) of the Cumulus switch, the br-vxlan69 Linux bridge floods the frame out the second connected interface - swp3.

brctl show br-vxln69

bridge name        bridge id          STP enabled     interfaces
br-vxln69          8000.500000070003  no              swp3
                                                      vxln69

The rest of the story is very simple. When ARP packet hits the baremetal server it populates its ARP cache. A unicast response travels all the way back to the Cumulus switch, gets matched by the static MAC (0e:14) entry created based on information provided by the L2GW plugin. This entry points to the VTEP IP of Compute host 2(10.0.2.10) which is where it gets forwarded next.

bridge fdb show

50:00:00:09:00:04 dev swp3 vlan 0 master br-vxln69
50:00:00:07:00:03 dev swp3 vlan 0 master br-vxln69 permanent
50:00:00:6b:2e:70 dev swp3 vlan 0 master br-vxln69
26:21:90:a8:8a:cc dev vxln69 vlan 0 master br-vxln69 permanent
fa:16:3e:57:1c:6c dev vxln69 dst 10.0.3.10 vlan 65535 self permanent
fa:16:3e:a4:12:e6 dev vxln69 dst 10.0.3.10 vlan 65535 self permanent
fa:16:3e:d7:0e:14 dev vxln69 dst 10.0.2.10 vlan 65535 self permanent
fa:16:3e:3c:51:d7 dev vxln69 dst 10.0.1.10 vlan 65535 self permanent

The packet travels through compute host 2, populating the flow entries of all OVS bridges along the way. These entries are then used by subsequent unicast packets travelling from VM-2.

Unicast packet to the baremetal server

$ ovs-appctl ofproto/trace br-tun in_port=1,dl_vlan=1,dl_dst=50:00:00:6b:2e:70 | grep -E "Rule|actions="
Rule: table=0 cookie=0xb5625033061a8ae5 priority=1,in_port=1
OpenFlow actions=resubmit(,2)
        Rule: table=2 cookie=0xb5625033061a8ae5 priority=0,dl_dst=00:00:00:00:00:00/01:00:00:00:00:00
        OpenFlow actions=resubmit(,20)
                Rule: table=20 cookie=0xb5625033061a8ae5 priority=1,vlan_tci=0x0001/0x0fff,dl_dst=50:00:00:6b:2e:70
                OpenFlow actions=load:0->NXM_OF_VLAN_TCI[],load:0x45->NXM_NX_TUN_ID[],output:9

It all looks fine until the ARP cache of the baremetal server expires and you get an ARP request coming from the physical into the virtual world. There is a known issue with BUM forwarding which requires a special service node to perform the head-end replication. The idea is that a switch that needs to flood a multicast packet, would send it to a service node which keeps track of all active VTEPs in the network and performs packet replication on behalf of the sender. OpenStack doesn’t have a dedicated service node, however it is possible to trick the network node into performing a similar functionality, which is what I’m going to demonstrate next.

Programming Network Node as BUM replication service node

First of all, we need to tell our Cumulus switch to send all multicast packets to the network node. To do that we need to modify OVSDB table called “Mcast_Macs_Remote”. You can view the contents of the database using the ovsdb-client dump --pretty tcp:192.168.91.21:6632 command to make sure that this table is empty. Using the VTEP control command we need to force all unknown-dst (BUM) traffic to go to the network node(10.0.3.10). The UUID of the logical switch can be found with sudo vtep-ctl list-ls command.

Forward all BUM traffic to OpenStack network node

sudo vtep-ctl add-mcast-remote 818b4779-645c-49bb-ae4a-aa9340604019 unknown-dst 10.0.3.10

At this stage all BUM traffic hits the network node and gets flooded to the DHCP and the virtual router namespaces. In order to force this traffic to also be replicated to all compute nodes we can use some of the existing tables of the tunnel bridge. Before we do anything let’s have a look at the tables our ARP request has to go through inside the tunnel bridge.

Packet from Cumulus VTEP inside br-tun

table=0, priority=1,in_port=2 actions=resubmit(,4)
table=4, priority=1,tun_id=0x45 actions=mod_vlan_vid:1,resubmit(,10)
table=10,priority=1 actions=learn(table=20,hard_timeout=300,priority=1,cookie=0x9f3e746b7ee48bbf,NXM_OF_VLAN_TCI[0..11],NXM_OF_ETH_DST[]=NXM_OF_ETH_SRC[],load:0->NXM_OF_VLAN_TCI[],load:NXM_NX_TUN_ID[]->NXM_NX_TUN_ID[],output:NXM_OF_IN_PORT[]),output:1

We also have a default head-end replication table 22 which floods all BUM traffic received from the integration bridge to all VTEPs:

BUM replication table

table=22, dl_vlan=1 actions=strip_vlan,set_tunnel:0x45,output:2,output:4,output:6

So what we can do is create a new flow entry that would intercept all ARP packets inside Table 4 and resubmit them to tables 10 and 22. Table 10 will take our packet up to the integration bridge of the network node, since we still need to be able to talk the virtual router and the DHCP. Table 22 will receive a copy of the packet and flood it to all known VXLAN endpoints.

Manipulating OVS flows

ovs-ofctl add-flow br-tun "table=4,arp,tun_id=0x45,priority=2,actions=mod_vlan_vid:1,resubmit(,10),resubmit(,22)"

We can once again use the trace command to see the ARP request flow inside the tunnel bridge.

ARP request received from the Cumulus switch

$ ovs-appctl ofproto/trace br-tun in_port=2,arp,tun_id=0x45 | grep -E "Rule|actions="
Rule: table=0 cookie=0x9f3e746b7ee48bbf priority=1,in_port=2
OpenFlow actions=resubmit(,4)
        Rule: table=4 cookie=0 priority=2,arp,tun_id=0x45
        OpenFlow actions=mod_vlan_vid:1,resubmit(,10),resubmit(,22)
                Rule: table=10 cookie=0x9f3e746b7ee48bbf priority=1
                OpenFlow actions=learn(table=20,hard_timeout=300,priority=1,cookie=0x9f3e746b7ee48bbf,NXM_OF_VLAN_TCI[0..11],NXM_OF_ETH_DST[]=NXM_OF_ETH_SRC[],load:0->NXM_OF_VLAN_TCI[],load:NXM_NX_TUN_ID[]->NXM_NX_TUN_ID[],output:NXM_OF_IN_PORT[]),output:1
                        Rule: table=0 cookie=0x91b1a9a9b6e8d608 priority=0
                        OpenFlow actions=NORMAL
                                Rule: table=0 cookie=0xb36f6e358a37bea6 priority=2,in_port=2
                                OpenFlow actions=drop
                Rule: table=22 cookie=0x9f3e746b7ee48bbf dl_vlan=1
                OpenFlow actions=strip_vlan,set_tunnel:0x45,output:2,output:4,output:6

Now we should be able to clear the ARP cache on baremetal device and successfully ping both VM-2, VM-1 and the virtual router.

Conclusion

The workaround presented above is just a temporary solution for the problem. In order to fix the problem properly, OVS vtep schema needs to be updated to support source node replication. Luckily, the patch implementing this functionality has been merged into master OVS branch only a few days ago. So hopefully, this update trickles down to Cumulus package repositories soon.

Despite all the issues, Neutron L2 gateway plugin is a cool project that provides a very important piece of functionality without having to rely on 3rd party SDN controllers. Let’s hope it will continue to be supported and developed by the community.

Coming up

In the next post I was planning to examine another “must have” feature of any SDN solution - Distributed Virtual Routing. However due to my current circumstances I may need to take a few weeks' break before going on. Be back soon!

Openstack SDN - Extending a L2 Provider Network Over a L3 Fabric

2016-05-11T00:00:00+01:00

In the this post we’ll tackle yet another Neutron scalability problem identified in my earlier post - a requirement to have a direct L2 adjacency between the external provider network and the network node.

Provider vs Tenant networks

Before we start, let’s recap the difference between the two major Neutron network types:

Tenant networks are:
- provisioned by tenants
- used for inter-VM (east-west) communication
- use Neutron virtual router as their default gateway
Provider networks are:
- provisioned by OpenStack administrator(for use by tenants)
- match existing physical networks
- can be either flat (untagged VLAN) or VLAN-based (multiple VLANs)
- need to be L2 adjacent to network and/or compute nodes

These two network types are not mutually exclusive. In our case the admin tenant network is implemented as a VXLAN-based overlay whose only requirement is to have a layer-3 reachability in the underlay. However tenant network could also have been implemented using a VLAN-based provider network in which case a set of dot1Q tags pre-provisioned in the underlay would have been used for tenant network segregation.

External provider network

External network is used by VMs to communicate with the outside world (north-south). Since default gateway is located outside of OpenStack environment this, by definition, is a provider network. Normally, tenant networks will use the non-routable address space and will rely on a Neutron virtual router to perform some form of NAT translation. As we’ve seen in the earlier post, Neutron virtual router is directly connected to the external bridge which allows it to “borrow” ip address from the external provider network to use for two types of NAT operations:

SNAT - a source-based port address translation performed by the Neutron virtual router
DNAT - a static NAT created for every floating ip address configured for a VM

In default deployments all NATing functionality is performed by a network node, so external provider network only needs to be L2 adjacent with a limited number of physical hosts. In deployments where DVR is used, the virtual router and NAT functionality gets distributed among all compute hosts which means that they, too, now need to be layer-2 adjacent to the external network.

Solutions overview

The direct adjacency requirement presents a big problem for deployments where layer-3 routed underlay is used for the tenant networks. There is a limited number of ways to satisfy this requirements, for example:

Span a L2 segment across the whole DC fabric. This means that the fabric needs to be converted to layer-2, reintroducing spanning-tree and all the unique vendor solutions to overcome STP limitations(e.g. TRILL, Fabripath, SPB).
Build a dedicated physical network. This may not always be feasible, especially considering that it needs to be delivered to all compute hosts.
Extend the provider network over an existing L3 fabric with VXLAN overlay. This can easily be implemented with just a few commands, however it requires a border leaf switch capable of performing VXLAN-VLAN translation.

Detailed design

As I’ve said in my earlier post, I’ve built the leaf-spine fabric out of Cisco IOU virtual switches, however the plan was to start introducing other vendors later in the series. So this time for the border leaf role I’ve chosen Arista vEOS switch, however, technically, it could have been any other vendor capable of doing VXLAN-VLAN bridging (e.g. any hardware switch with Trident 2 or similar ASIC).

Arista vEOS configuration

Configuration of Arista switches is very similar to Cisco IOS. In fact, I was able to complete all interface and OSPF routing configuration only with the help of CLI context help. The only bit that was new to me and that I had to lookup in the official guide was the VXLAN configuration. These similarities makes the transition from Cisco to Arista very easy and I can understand (but not approve!) why Cisco would file a lawsuit against Arista for copying its “industry-standard CLI”.

L4 configuration

interface Ethernet1
   description SPINE-1:Eth0/3
   no switchport
   ip address 169.254.41.4/24
   ip ospf network point-to-point
!
interface Ethernet2
   description SPINE-2:Eth0/3
   no switchport
   ip address 169.254.42.4/24
   ip ospf network point-to-point
!
interface Ethernet3
   description VM-HOST-ONLY:PNET1
   switchport access vlan 100
   spanning-tree portfast
!
interface Loopback0
   ip address 10.0.0.4/32
!
interface Vxlan1
   vxlan source-interface Loopback0
   vxlan udp-port 4789
   vxlan vlan 100 vni 1000
   vxlan vlan 100 flood vtep 10.0.3.10
!
router ospf 1
   router-id 10.0.0.4
   passive-interface default
   no passive-interface Ethernet1
   no passive-interface Ethernet2
   network 0.0.0.0/0 area 0.0.0.0
!

Interface VXLAN1 sets up VXLAN-VLAN bridging between VNI 1000 and VLAN 100. VLAN 100 is used to connect to VMware Workstation’s host-only interface, the one that was previously connected directly to the L3 leaf switch. VXLAN interface does the multicast source replication by flooding unknown packets over the layer 3 fabric to the network node (10.0.3.10).

OpenStack network node configuration

Since we don’t yet have the distributed routing feature enabled, the only OpenStack component that requires any changes is the network node. First, let’s remove the physical interface from the external bridge, since it will no longer be used to connect to the external provider network.

Remove the physical interface from the external bridge

$ ovs-vsctl del-port br-ex eth1.300

Next let’s add the VXLAN interface towards the Loopback IP address of the Arista border leaf switch. The key option sets the VNI which must be equal to the VNI defined on the border leaf.

add the VXLAN interface towards the Arista switch

$ ovs-vsctl add-port br-ex vxlan1 \
-- set interface vxlan1 \
type=vxlan \
options:remote_ip=10.0.0.4 \
options:key=1000

Without any physical interfaces attached to the external bridge, the OVS will use the Linux network stack to find the outgoing interface. When a packet hits the vxlan1 interface of the br-ex, it will get encapsulated in a VXLAN header and passed on to the OS network stack where it will follow the pre-configured static route forwarding all 10/8 traffic towards the leaf-spine fabric. Check out this article if you want to learn more about different types of interfaces and traffic forwarding behaviours in OpenvSwitch.

Cleanup

In order to make changes persistent and prevent the static interface configuration from interfering with OVS, remove all OVS-related configuration and shutdown interface eth1.300.

/etc/sysconfig/network-scripts/ifcfg-eth1.300

ONBOOT=no
VLAN=yes

Change in the packet flow

None of the packet flows have changed as the result of this modification. All VMs will still use NAT to break out of the private environment, the NAT’d packets will reach the external bridge br-ex as described in my earlier post. However this time br-ex will forward the packets out the vxlan1 port which will deliver them to the Arista switch over the same L3 fabric used for east-west communication.

If we did a capture on the fabric-facing interface eth1 of the control node while running a ping from one of the VMs to the external IP address, we would see a VXLAN-encapsulated packet destined for the Loopback IP of L4 leaf switch.

Coming Up

In the next post we’ll examine the L2 gateway feature that allows tenant networks to communicate with physical servers through yet another VXLAN-VLAN hardware gateway.

Openstack SDN - L2 Population and ARP Proxy

2016-05-06T00:00:00+01:00

In the previous post we’ve had a look at how native OpenStack SDN works and what are some of its limitations. In this post we’ll tackle the first one of them - overhead created by multicast source replication.

MAC learning in a controller-less VXLAN overlay

VXLAN standard does not specify any control plane protocol to exchange MAC-IP bindings between VTEPs. Instead it relies on data plane flood-and-learn behaviour, just like a normal switch. To force this behaviour in an underlay, the standard stipulates that each VXLAN network should be mapped to its own multicast address and each VTEP participating in a network should join the corresponding multicast group. That multicast group would be used to flood the BUM traffic in an underlay to all subscribed VTEPs thereby populating dynamic MAC address tables.

Default OpenvSwitch implementation does not support VXLAN multicast flooding and uses unicast source replication instead. This decision comes with a number of tradeoffs:

Duplicate packets consume additional bandwidth. Extra 100 bytes exchanged every 3 minutes in a 100-nodes environment generate around 500 kbit/s of traffic on average. This can be considered negligible inside modern high-speed DC fabrics.
Hardware VTEP gateways rely on multicast for MAC learning and VTEP discovery. As we’ll see later in the series, these gateways can now be controlled by Neutron just like a normal OVS inside a compute host.
Duplicate packets are processed by hosts that do not need them, e.g. ARP request is processed by tunnel and integration bridges of all hosts that have VMs in the same broadcast domain. This presents some serious scaling limitation and is addressed by the L2 population feature described in this post.

Despite all the tradeoffs, OVS with unicast source replication has become a de-facto standard in most recent OpenStack implementations. The biggest advantage of such approach is the lack of requirement for multicast in the underlay network.

VXLAN MAC learning with an SDN controller

Neutron server is aware of all active MAC and IP addresses within the environment. This information can be used to prepopulate forwarding entries on all tunnel bridges. This is accomplished by a L2 population driver. However that in itself isn’t enough. Whenever a VM doesn’t know the destination MAC address, it will send a broadcast ARP request which needs to be intercepted and responded by a local host to stop it from being flooded in the network. The latter is accomplished by a feature called ARP responder which simulates the functionality commonly known as ARP proxy inside the tunnel bridge.

Configuration

Configuration of these two features is fairly straight-forward. First, we need to add L2 population to the list of supported mechanism drivers on our control node and restart the neutron server.

Update on control node

$ sed -ri 's/(mechanism_drivers.*)/\1,l2population/' /etc/neutron/plugin.ini
$ service neutron-server restart

Next we need to enable L2 population and ARP responder features on all 3 compute nodes.

Updates on all 3 compute nodes

$ sed -ri 's/.*(arp_responder).*/\1 = true/' /etc/neutron/plugins/ml2/openvswitch_agent.ini
$ sed -ri 's/.*(l2_population).*/\1 = true/' /etc/neutron/plugins/ml2/openvswitch_agent.ini
$ service neutron-openvswitch-agent restart

Since L2 population is triggered by the port_up messages, we might need to restart both our VMs for the change to take effect.

BUM frame from VM-1 for MAC address of VM-2 (Revisited)

Now let’s once again examine what happens when VM-1 issues an ARP request for VM-2’s MAC address (1a:bf).

First, the frame hits the flood-and-learn rule of the integration bridge and gets flooded down to the tunnel bridge as desribed in the previous post. Once in the br-tun, the frames gets matched by the incoming port and resubmitted to table 2. In addition to a default unicast/multicast bit match, table 2 now also matches all ARP requests and resubmitts them to the new table 21. Note how the ARP entry has a higher priority to always match before the default catch-all multicast rule.

ovs-ofctl dump-flows br-tun

table=0, priority=1,in_port=1 actions=resubmit(,2)
table=2, priority=1,arp,dl_dst=ff:ff:ff:ff:ff:ff actions=resubmit(,21)
table=2, priority=0,dl_dst=00:00:00:00:00:00/01:00:00:00:00:00 actions=resubmit(,20)
table=2, priority=0,dl_dst=01:00:00:00:00:00/01:00:00:00:00:00 actions=resubmit(,22)

Inside table 21 are the entries created by the ARP responder feature. The following is an example entry that matches all ARP requests where target IP address field equals the IP of VM-2(10.0.0.9).

ovs-ofctl dump-flows br-tun

 table=21, priority=1,arp,dl_vlan=1,arp_tpa=10.0.0.9
 actions=move:NXM_OF_ETH_SRC[]->NXM_OF_ETH_DST[],
 mod_dl_src:fa:16:3e:ab:1a:bf,
 load:0x2->NXM_OF_ARP_OP[],
 move:NXM_NX_ARP_SHA[]->NXM_NX_ARP_THA[],
 move:NXM_OF_ARP_SPA[]->NXM_OF_ARP_TPA[],
 load:0xfa163eab1abf->NXM_NX_ARP_SHA[],
 load:0xa000009->NXM_OF_ARP_SPA[],
 IN_PORT

The resulting action builds an ARP response by modifying the fields and headers on the original ARP request message, specifically OVS:

Copies the source MAC address (VM-1) to the destination MAC address header
Spoofs the source MAC address to make it look like it comes from VM-2
Modifies the operation code of ARP message to 0x2, meaning reply
Overwrites the target IP and MAC address fields inside the ARP packet with VM-1’s values
Overwrites the source hardware address with VM-2’s MAC
Overwrites the source IP address with the address of VM-2(0xa000009)
Sends the packet out the port from which it was received

Unicast frame from VM-1 to VM-2 (Revisited)

Now that VM-1 has learned the MAC address of VM-2 it can start sending the unicast frames. The first few steps will again be the same. The frame hits the tunnel bridge, gets classified as a unicast and resubmitted to table 20. Table 20 will still have an entry generated by a learn action triggered by a packet coming from VM-2, however now it also has and identical entry with a higher priority(priority=2), which was preconfigured by a L2 population feature.

table=0, priority=1,in_port=1 actions=resubmit(,2)
table=2, priority=0,dl_dst=00:00:00:00:00:00/01:00:00:00:00:00 actions=resubmit(,20)
table=20, priority=2,dl_vlan=1,dl_dst=fa:16:3e:ab:1a:bf actions=strip_vlan,set_tunnel:0x54,output:2
table=20, priority=1,vlan_tci=0x0001/0x0fff,dl_dst=fa:16:3e:ab:1a:bf actions=load:0->NXM_OF_VLAN_TCI[],load:0x54->NXM_NX_TUN_ID[],output:2

Other BUM traffic

The two features described in this post only affect the ARP traffic to VMs known to the Neutron server. All the other BUM traffic will still be flooded as described in the previous post.

Results

As the result of enabling L2 population and ARP responder features we were able to reduce the amount of BUM traffic in the overlay network and reduce the eliminate processing on compute hosts incurred by ARP request flooding.

However one downside of this approach is the increased number of flow entries in tunnel bridges of compute hosts. Specifically, for each known VM there now will be two entries in the tunnel bridge with different priorities. This may have negative impact on performance and is something to keep in mind when designing OpenStack solutions for scale.

Coming Up

In the next post I’ll show how to overcome the requirement of a direct L2 adjacency between the network node and external subnet. Specifically, I’ll use Arista switch to extend a L2 provider network over a L3 leaf-spine Cisco fabric.

Network-oriented programming

Openstack SDN - OpenContrail With BGP VPN

BGP-as-a-Service

Lab setup overview

All-in-one VM setup

OpenStack setup

EVPN integration with MX80

Configuration

Verification

BGP-as-a-Service

Configuration

Verification

Outro

OpenStack SDN - OpenDaylight With BGP VPN

Optimal forwarding of North-South traffic

OpenDaylight integration with OpenStack

OpenDaylight BGP VPN service architecture

Demo

Installation

Configuration

1. Source admin credentials and setup ODL’s REST variables

2. Configure local BGP settings and BGP peering with DC-GW

3. Define L3VPN instance and associate it with OpenStack admin tenant

4. Inject prefixes into L3VPN by associating the previously created L3VPN with a demo-router

5. Configure DC-GW VTEP IP

6. DC-GW configuration

Verification

Conclusion

Openstack SDN - NFV Management and Orchestration

NFV MANO for Telco SDN

ETSI MANO

Relationship between NS, VNF and VNFFG

Using Tacker to orchestrate a Network Service

Step 1 - VIM registration

Step 2 - Onboarding a VNFD

Step 4 - Onboarding a NSD

Step 5 - Instantiating a NSD

Step 6 - Onboarding and Instantiating a VNFFG

Other Tacker features

Conclusion

Openstack SDN - Skydiving Into Service Function Chaining

SFC High-level overview

Insertion modes and implementation models

Configuring Neutron SFC

Using Skydive to troubleshoot SFC

SFC implementation in OVS forwarding pipeline

Upcoming enhancements

Coming Up

Openstack SDN - Building a Containerized OpenStack Lab

Evolution of automated OpenStack installers

Lab overview

Building OpenStack docker containers with Kolla

Deploying OpenStack with Kolla-Ansible

Overview of containerized Openstack

Coming up

Linux SSH Session Management for Network Engineers

1. Install ssh-copy-net

2. Define SSH config for network devices

3. Define zsh aliases

4. Multi-pane sessions with tmux

Demo

Using YANG Models in Ansible to Configure and Verify State of IOS-XE and JUNOS Devices

Environment setup

Ansible playbook configuration

Configuration

Verification

Outro

Configuring Cisco IOS XE With YANG-based YAML Files

Recursion

Introspection

Metaprogramming

YANG mapping to YAML

YANG instantiating function

The YDK module wrapper

Configuration examples

Configuring Cisco IOS XE With YDK and OpenDaylight

OpenDaylight primer

Environment setup

Connecting IOS XE to ODL

YDK primer

3. Define L3VPN instance and associate it with OpenStack `admin` tenant

4. Inject prefixes into L3VPN by associating the previously created L3VPN with a `demo-router`