Continuing on the trend started in my previous post about OpenDaylight, I’ll move on to the next open-source product that uses BGP VPNs for optimal North-South traffic forwarding. OpenContrail is one of the most popular SDN solutions for OpenStack. It was one of the first hybrid SDN solutions, offering both pure overlay and overlay/underlay integration. It is the default SDN platform of choice for Mirantis Cloud Platform, it has multiple large-scale deployments in companies like Workday and AT&T. I, personally, don’t have any production experience with OpenContrail, however my impression, based on what I’ve heard and seen in the last 2-3 years that I’ve been following Telco SDN space, is that OpenContrail is the most mature SDN platform for Telco NFVs not least because of its unique feature set.
During the time of production deployment at AT&T, Contrail has added a lot of features required by Telco NFVs like QoS, VLAN trunking and BGP-as-a-service. My first acquaintance with BGPaaS took place when I started working on Telco DCs and I remember being genuinely shocked when I first saw the requirement for dynamic routing exchange with VNFs. To me this seemed to break one of the main rules of cloud networking - a VM is not to have any knowledge or interaction with the underlay. I gradually went though all stages of grief, all the way to acceptance and although it still feels “wrong” now, I can at least understand why it’s needed and what are the pros/cons of different BGPaaS solutions.
There’s a certain range of VNFs that may require to advertise a set of IP addresses into the existing VPNs inside Telco network. The most notable example is PGW inside EPC. I won’t pretend to be an expert in this field, but based on my limited understanding PGW needs to advertise IP networks into various customer VPNs, for example to connect private APNs to existing customer L3VPNs. Obviously, when this kind of network function gets virtualised, it still retains this requirement which now needs to be fulfilled by DC SDN.
This requirement catches a lot of big SDN vendors off guard and the best they come up with is connecting those VNFs, through VLANs, directly to underlay TOR switches. Although this solution is easy to implement, it has an incredible amount of drawbacks since a single VNF can now affect the stability of the whole POD or even the whole DC network. Some VNFs vendors also require BFD to monitor liveliness of those BGP sessions which, in case a L3 boundary is higher than the TOR, may create even a bigger number of issues on a POD spine.
There’s a small range of SDN platforms that run a full routing stack on each compute node (e.g. Cumulus, Calico). These solutions are the best fit for this kind of scenarios since they allow BGP sessions to be established over a single hop (VNF <-> virtual switch). However they represent a small fraction of total SDN solutions space with majority of vendors implementing a much simpler OpenFlow or XMPP-based flow push model.
OpenContrail, as far as I know, is the only SDN controller that doesn’t run a full routing stack on compute nodes but still fulfills this requirement in a very elegant way. When BGPaaS is enabled for a particular VM’s interface, controller programs vRouter to proxy BGP TCP connections coming to virtual network’s default gateway IP and forward them to the controller. This way VNF thinks it peers with a next hop IP, however all BGP state and path computations still happen on the controller.
The diagram below depicts a sample implementation of BGPaaS using OpenContrail. VNF is connected to a vRouter using a dot1Q trunk interface (to allow multiple VRFs over a single vEth link). Each VRF has its own BGPaaS session setup to advertise network ranges (NET1-3) into customer VPNs. These BGP sessions get proxied to the controller which injects those prefixes into their respective VPNs. These updates are then sent to DC gateways using either a VPNv4/6 or EVPN and the traffic is forwarded through DC underlay with VPN segregation preserved by either an MPLS tag (for MPLSoGRE or MPLSoUDL encapsulation) or a VXLAN VNI.
Now let me briefly go over the lab that I’ve built to showcase the BGPaaS and DC-GW integration features.
OpenContrail follows a familiar pattern of DC SDN architecture with central controller orchestrating the work of multiple virtual switches. In case of OpenContrail, these switches are called vRouters and they communicate with controller using XMPP-based extension of BGP as described in this RFC draft. A very detailed description of its internal architecture is available on OpenContrail’s website so it would be pointless to repeat all of this information here. That’s why I’ll concentrate on how to get things done rather then on the architectural aspects. However to get things started, I always like to have a clear picture of what I’m trying to achieve. The below diagram depicts a high-level architecture of my lab setup. Although OpenContrail supports BGP VPNv4/6 with multiple dataplane encapsulations, in this post I’ll use EVPN as the only control plane protocol to communicate with MX80 and use VXLAN encapsulation in the dataplane.
EVPN as a DC-GW integration protocol is relatively new to OpenContrail and comes with a few limitations. One of them is the absence of EVPN type-5 routes, which means I can’t use it in the same way I did in OpenDaylight’s case. Instead I’ll demonstrate a DC-GW IRB scenario, which extends the existing virtual network to a DC-GW and makes IRB/SVI interface on that DC-GW act as a default gateway for this network. This is a very common scenario for L2 DCI and active-active DC deployment models. To demonstrate this scenario I’m going to setup a single OpenStack virtual network with a couple of VMs whose gateway will reside on MX80. Since I only have a single OpenStack instance and a single MX80, I’ll setup one half of L2 DCI and setup a mutual redistribution to make our overlay network reachable from MX80’s global routing table.
Physically, my lab will consist of a single hypervisor running an all-in-one VM with kolla-openstack and kolla-contrail and a physical Juniper MX80 playing the role of a DC-GW.
OpenContrail’s kolla github page contains a set of instructions to setup the environment. As usual, I have automated all of these steps which can be setup from a hypervisor with the following commands:
1 2 3 |
|
Once installation is complete and all docker containers are up and running, we can setup the OpenStack side of our test environment. The script below will do the following:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
|
The only thing worth noting in the above script is that a default gateway 10.0.100.161
gets overridden by a default host route pointing to 10.0.100.190
. Normally, to demonstrate DC-GW IRB scenario, I would have setup a gateway-less L2 only subnet, however in that case I wouldn’t have been able to demonstrate BGPaaS on the same network, since this feature relies on having a gateway IP setup (which later acts as a BGP session termination endpoint). So instead of setting up two separate networks I’ve decided to implement this hack to minimise the required configuration.
DC-GW integration procedure is very simple and requires only a few simple steps:
All of these steps can be done very easily through OpenContrail’s GUI. However as I’ve mentioned before, I always prefer to use API when I have a chance and in this case I even have a python library for OpenContrail’s REST API available on Juniper’s github page, which I’m going to use below to implement the above three steps.
Before we can begin working with OpenContrail’s API, we need to authenticate with the controller and get a REST API connection handler.
1 2 3 4 5 6 7 8 9 10 11 |
|
The first thing I’m going to do is override the default VNI setup by OpenContrail for irb-net
to a pre-defined value of 5001
. To do that I first need to get a handler for irb-net
object and extract the virtual_network_properties
object containing a vxlan_network_identifier
property. Once it’s overridden, I just need to update the parent irb-net
object to apply the change to the running configuration on the controller.
1 2 3 4 5 |
|
The next thing I need to do is explicitly set the import/export route-target properties for the irb-net
object. This will require a new RouteTargetList
object which then gets referenced by a route_target_list
property of the irb-net
object.
1 2 3 4 |
|
The final step is setting up a peering with MX80. The main object that needs to be created is BgpRouter
, which contains a pointer to BGP session parameters object with session-specific values like ASN and remote peer IP. BGP router is defined in a global context (default domain and default project) which will make it available to all configured virtual networks.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
For the sake of brevity, I will not cover MX80’s configuration in details and simply include it here for reference with some minor explanatory comments.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
|
The easiest way to verify that BGP peering has been established is to query OpenContrail’s introspection API:
1 2 3 |
|
Datapath verification can be done from either side, in this case I’m showing a ping from MX80’s global VRF towards one of the OpenStack VMs:
1 2 3 4 5 6 7 8 |
|
To keep things simple I will not use multiple dot1Q interfaces and setup a BGP peering with CumulusVX over a normal, non-trunk interface. From CumulusVX I will inject a loopback IP 1.1.1.1/32
into the irb-net
network. Since REST API python library I’ve used above is two major releases behind the current version of OpenContrail, it cannot be used to setup BGPaaS feature. Instead I will demonstrate how to use REST API directly from the command line of all-in-one VM using cURL.
In order to start working with OpenContrail’s API, I first need to obtain an authentication token from OpenStack’s keystone. With that token I can now query the list of IPs assigned to all OpenStack instances and pick the one assigned to CumulusVX. I need the UUID of that particular IP address in order to extract the ID of the VM interface this IP is assigned to.
1 2 3 4 5 6 7 |
|
With VM interface ID saved in a VMI_ID
variable I can create a BGPaaS service and link it to that particular VM interface.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
The final step is setting up a BGP peering on the CumulusVX side. CumulusVX configuration is very simple and self-explanatory. The BGP neighbor IP is the IP of virtual network’s default gateway located on local vRouter.
1 2 3 4 5 6 7 8 9 10 11 |
|
Here’s where we come across another limitation of EVPN. The loopback prefix 1.1.1.1/32
does not get injected into EVPN address family, however it does show up automatically in the VPNv4 address family which can be verified from the MX80:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
|
It’s hidden since I haven’t configured MPLSoUDP dynamic tunnels on MX80. However this proves that the prefix does get injected into customer VPNs and become available on all devices with the matching import route-target communities.
This post concludes Series 2 of my OpenStack SDN saga. I’ve covered quite an extensive range of topics in my two-part series, however, OpenStack networking landscape is so big, it’s simply impossible to cover everything I find interesting. I started writing about OpenStack SDN when I first learned I got a job with Nokia. Back then I knew little about VMware NSX and even less about OpenStack. That’s why I started researching topics that I found interesting and branching out into adjacent areas as I went along. Almost 2 years later, looking back I can say I’ve learned a lot about the internals of SDN in general and hopefully so have my readers. Now I’m leaving Nokia to rediscover my networking roots at Arista. I’ll dive into DC networking from a different perspective now and it may be awhile before I accumulate a critical mass of interesting material to start spilling it out in my blog again. I still may come back to OpenStack some day but for now I’m gonna take a little break, maybe do some house keeping (e.g. move my blog from Jekyll to Hugo, add TLS support) and enjoy my time being a farther.
]]>For the last 5 years OpenStack has been the training ground for a lot of emerging DC SDN solutions. OpenStack integration use case was one of the most compelling and easiest to implement thanks to the limited and suboptimal implementation of the native networking stack. Today, in 2017, features like L2 population, local ARP responder, L2 gateway integration, distributed routing and service function chaining have all become available in vanilla OpenStack and don’t require a proprietary SDN controller anymore. Admittedly, some of the features are still not (and may never be) implemented in the most optimal way (e.g. DVR). This is where new opensource SDN controllers, the likes of OVN and Dragonflow, step in to provide scalable, elegant and efficient implementation of these advanced networking features. However one major feature still remains outside of the scope of a lot of these new opensource SDN projects, and that is data centre gateway (DC-GW) integration. Let me start by explain why you would need this feature in the first place.
OpenStack Neutron and VMware NSX, both being pure software solutions, rely on a special type of node to forward traffic between VMs and hosts outside of the data centre. This node acts as a L2/L3 gateway for all North-South traffic and is often implemented as either a VM or a network namespace. This kind of solution gives software developers greater independence from the underlying networking infrastructure which makes it easier for them to innovate and introduce new features.
However, from the traffic forwarding point of view, having a gateway/network node is not a good solution at all. There is no technological reason for a packet to have to go through this node when after all it ends up on a DC-GW anyway. In fact, this solution introduces additional complexity which needs to be properly managed (e.g. designed, configured and troubleshooted) and a potential bottleneck for high-throughput traffic flows.
It’s clear that the most optimal way to forward traffic is directly from a compute node to a DC-GW. The only question is how can this optimal forwarding be achieved? SDN controller needs to be able to exchange reachability information with DC-GW using a common protocol understood by most of the existing routing stacks. One such protocol, becoming very common in DC environments, is BGP, which has two address families we can potentially use:
In OpenStack specifically, BGPVPN project was created to provide a pluggable driver framework for 3rd party BGP implementations. Apart from a reference BaGPipe driver (BaGPipe is an ExaBGP fork with lightweight implementation of BGP VPNs), which relies on a default openvswitch
ML2 mechanism driver, only Nuage, OpenDaylight and OpenContrail have contributed their drivers to this project. In this post I will focus on OpenDaylight and show how to install containerised OpenStack with OpenDaylight and integrate it with Cisco CSR using EVPN.
Historically, OpenDaylight has had multiple projects implementing custom OpenStack networking drivers:
NetVirt provides several common Neutron services including L2 and L3 forwarding, ACL and NAT, as well as advanced services like L2 gateway, QoS and SFC. To do that it assumes full control over an OVS switch inside each compute node and implements the above services inside a single br-int
OVS bridge. L2/L3 forwarding tables are built based on tenant IP/MAC addresses that have been allocated by Neutron and the current network topology. For high-level overview of NetVirt’s forwarding pipeline you can refer to this document.
It helps to think of an ODL-managed OpenStack as a big chassis switch. NetVirt plays the role of a supervisor by managing control plane and compiling RIB based on the information received from Neutron. Each compute node running an OVS is a linecard with VMs connected to its ports. Unlike the distributed architecture of OVN and Dragonflow, compute nodes do not contain any control plane elements and each OVS gets its FIB programmed directly by the supervisor. DC underlay is a backplane, interconnecting all linecards and a supervisor.
In order to provide BGP VPN functionality, NetVirt employs the use of three service components:
In order to exchange BGP updates with external DC-GW, NetVirt requires a BGP stack with EVPN and VPNV4/6 capabilities. Ideally, internal ODL BGP stack could have been used for that, however it didn’t meet all the performance requirements (injecting/withdrawing thousand of prefixes at the same time). Instead, an external Quagga fork with EVPN add-ons is connected to BGP manager via a high-speed Apache Thrift interface. This interface defines the format of data to be exchanged between Quagga (a.k.a QBGP) and NetVirt’s BGP Manager in order to do two things:
BGP session is established between QBGP and external DC-GW, however next-hop values installed by NetVirt and advertised by QBGP have IPs of the respective compute nodes, so that traffic is sent directly via the most optimal path.
Enough of the theory, let’s have a look at how to configure a L3VPN between QBGP (advertising ODL’s distributed router subnets) and IOS-XE DC-GW using EVPN route type 5 or, more specifically, Interface-less IP-VRF-to-IP-VRF model:
My lab environment is still based on a pair of nested VMs running containerised Kolla OpenStack I’ve described in my earlier post. A few months ago OpenDaylight role has been added to kolla-ansible so now it is possible to install OpenDaylight-intergrated OpenStack automatically. However, there is no option to install QBGP so I had to augment the default Kolla and Kolla-ansible repositories to include the QBGP Dockerfile template and QBGP ansible role. So the first step is to download my latest automated installer and make sure enable_opendaylight
global variable is set to yes
:
1 2 3 |
|
At the time of writing I was relying on a couple of latest bug fixes inside OpenDaylight, so I had to modify the default ODL role to install the latest master-branch ODL build. Make sure the link below is pointing to the latest zip
file in 0.8.0-SNAPSHOT
directory.
1 2 3 4 |
|
The next few steps are similar to what I’ve described in my Kolla lab post, will create a pair of VMs, build all Kolla containers, push them to a local Docker repo and finally deploy OpenStack using Kolla-ansible playbooks:
1 2 3 4 |
|
The final 4-deploy.sh
script will also create a simple init.sh
script inside the controller VM that can be used to setup a test topology with a single VM connected to a 10.0.0.0/24
subnet:
1 2 3 |
|
Of course, another option to build a lab is to follow the official Kolla documentation to create your own custom test environment.
Assuming the test topology was setup with no issues and a test VM can ping its default gateway 10.0.0.1
, we can start configuring BGP VPNs. Unfortunately, we won’t be able to use OpenStack BGPVPN API/CLI, since ODL requires an extra parameter (L3 VNI for symmetric IRB) which is not available in OpenStack BGPVPN API, but we still can configure everything directly through ODL’s API. My interface of choice is always REST, since it’s easier to build it into a fully programmatic plugin, so even though all of the below steps can be accomplished through karaf console CLI, I’ll be using cURL to send and retrieve data from ODL’s REST API.
1 2 3 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
|
admin
tenant1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
|
demo-router
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
ODL cannot automatically extract VTEP IP from updates received from DC-GW, so we need to explicitly configure it:
1 2 3 4 5 6 7 8 9 10 |
|
That is all what needs to be configured on ODL. Although I would consider this to be outside of the scope of the current post, for the sake of completeness I’m including the relevant configuration from the DC-GW:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
|
For detailed explanation of how EVPN RT5 is configured on Cisco CSR refer to the following guide.
There are several things that can be checked to verify that the DC-GW integration is working. One of the first steps would be to check if BGP session with CSR is up. This can be done from the CSR side, however it’s also possible to check this from the QBGP side. First we need to get into the QBGP’s interactive shell from the controller node:
1
|
|
From here, we can check that the BGP session has been established:
1 2 3 4 5 |
|
We can also check the contents of EVPN RIB compiled by QBGP
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
Finally, we can verify that the prefix 8.8.8.0/24
advertised from DC-GW is being passed by QBGP and accepted by NetVirt’s FIB Manager:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
The last output confirms that the prefix is being received and accepted by ODL. To do a similar check on CSR side we can run the following command:
1 2 3 4 5 6 7 8 9 10 11 12 |
|
This confirms that the control plane information has been successfully exchanged between NetVirt and Cisco CSR.
At the time of writing, there was an open bug in ODL master branch that prevented the forwarding entries from being installed in OVS datapath. Once the bug is fixed I will update this post with the dataplance verification, a.k.a ping
OpenDaylight is a pretty advanced OpenStack SDN platform. Its functionality includes clustering, site-to-site federation (without EVPN) and L2/L3 EVPN DC-GW integration for both IPv4 and IPv6. It is yet another example of how an open-source platform can match even the most advanced proprietary SDN solutions from incumbent vendors. This is all thanks to the companies involved in OpenDaylight development. I also want to say special thanks to Vyshakh Krishnan, Kiran N Upadhyaya and Dayavanti Gopal Kamath from Ericsson for helping me clear up some of the questions I posted on netvirt-dev mailing list.
]]>In the ongoing hysteria surrounding all things SDN, one important thing gets often overlooked. You don’t build SDN for its own sake. SDN is just a little cog in a big machine called “cloud”. To take it even further, I would argue that the best SDN solution is the one that you don’t know even exists. Despite what the big vendors tell you, operators are not supposed to interact with SDN interface, be it GUI or CLI. If you dig up some of the earliest presentation about Cisco ACI, when the people talking about it were the actual people who designed the product, you’ll notice one common motif being repeated over and over again. That is that ACI was never designed for direct human interaction, but rather was supposed to be configured by a higher level orchestrating system. In data center environments such orchestrating system may glue together services of virtualization layer and SDN layer to provide a seamless “cloud” experience to the end users. The focus of this post will be one incarnation of such orchestration system, specific to SP/Telco world, commonly known as NFV MANO.
At the early dawn of SDN/NFV era a lot of people got very excited by “the promise” and started applying the disaggregation and virtualization paradigms to all areas of networking. For Telcos that meant virtualizing network functions that built the service core of their networks - EPC, IMS, RAN. Traditionally those network functions were a collection of vertically-integrated baremetal appliances that took a long time to commission and had to be overprovisioned to cope with the peak-hour demand. Virtualizing them would have made it possible to achieve quicker time-to-market, elasticity to cope with a changing network demand and hardware/software disaggregation.
As expected however, such fundamental change has to come at price. Not only do Telcos get a new virtualization platform to manage but they also need to worry about lifecycle management and end-to-end orchestration (MANO) of VNFs. Since any such change presents an opportunity for new streams of revenue, it didn’t take long for vendors to jump on the bandwagon and start working on a new architecture designed to address those issues.
The first problem was the easiest to solve since VMware and OpenStack already existed at that stage and could be used to host VNFs with very little modifications. The management and orchestration problem, however, was only partially solved by existing orchestration solutions. There were a lot of gaps between the current operational model and the new VNF world and although these problems could have been solved by Telcos engaging themselves with the open-source community, this proved to be too big of a change for them and they’ve turned to the only thing they could trust - the standards bodies.
ETSI NFV MANO working group has set out to define a reference architecture for management and orchestration of virtualized resources in Telco data centers. The goal of NFV MANO initiative was to do a research into what’s required to manage and orchestrate VNFs, what’s currently available and identify potential gaps for other standards bodies to fill. Initial ETSI NFV Release 1 (2014) defined a base framework through relatively weak requirements and recommendations and was followed by Release 2 (2016) that made them more concrete by locking down the interfaces and data model specifications. For a very long time Release 1 was the only available NFV MANO standard, which led to a lot of inconsistencies in each vendors' implementations of it. This was very frustrating for Telcos since it required a lot of integration effort to build a multi-vendor MANO stack. Another potential issue with ETSI MANO standard is its limited scope - a lot of critical components like OSS and EMS are left outside of it which created a lot of confusion for Telcos and resulted in other standardisation efforts addressing those gaps.
On the below diagram I have shown an adbridged version of the original ETSI MANO reference architecture diagram adapted to the use case I’ll be demonstrating in this post.
This architecture consists of the following building blocks:
All these elements are working together towards a single goal - managing and orchestrating a Network Service (NS), which itself is comprised of multiple VNFs, Virtual Links (VLs), VNF Forwarding Graphs (VNFFGs) and Physical Network Functions (PNFs). In this post I create a NS for a simple virtual IDS use case, described in my previous SFC post. The goal is to steer all ICMP traffic coming from VM1 through a vIDS VNF which will forward the traffic to its original destination.
Before I get to the implementation, let me give a quick overview of how a Network Service is build from its constituent parts, in the context of our vIDS use case.
According to ETSI MANO, a Network Service (NS) is a subset of end-to-end service implemented by VNFs and instantiated on the NFVI. As I’ve mentioned before, some examples of a NS would be vEPC, vIMS or vCPE. NS can be described in either a YANG or a Tosca template called NS Descriptor (NSD). The main goal of a NSD is to tie together VNFs, VLs, VNFFGs and PNFs by defining relationship between various templates describing those objects (VNFDs, VLDs, VNFFGDs). Once NSD is onboarded (uploaded), it can be instantiated by NFVO, which communicates with VIM and VNFM to create the constituent components and stitch them together as described in a template. NSD normally does not contain VNFD or VNFFGD templates, but imports them through their names, which means that in order to instantiate a NSD, the corresponding VNFDs and VNFFGDs should already be onboarded.
VNF Descriptor is a template describing the compute and network parameters of a single VNF. Each VNF consists of one or more VNF components (VNFCs), represented in Tosca as Virtual Deployment Units (VDUs). A VDU is the smallest part of a VNF and can be implemented as either a container or, as it is in our case, a VM. Apart from the usual set of parameters like CPU, RAM and disk, VNFD also describes all the virtual networks required for internal communication between VNFCs, called internal VLs. VNFM can ask VIM to create those networks when the VNF is being instantiated. VNFD also contains a reference to external networks, which are supposed to be created by NFVO. Those networks are used to connect different VNFs together or to connect VNFs to PNFs and other elements outside of NFVI platform. If external VLs are defined in a VNFD, VNFM will need to source them externally, either as input parameters to VNFM or from NFVO. In fact, VNF instantiation by VNFM, as described in Tacker documentation, is only used for testing purposes and since a VNF only makes sense as a part of a Network Service, the intended way is to use a NSD to instantiate all VNFs in production environment.
The final component that we’re going to use is VNF Forwarding Graph. VNFFG Descriptor is an optional component that describes how different VNFs are supposed to be chained together to form a Network Service. In the absence of VNFFG, VNFs will fall back to the default destination-based forwarding, when the IPs of VNFs forming a NS are either automatically discovered (e.g. through DNS) or provisioned statically. Tacker’s implementation of VNFFG is not fully integrated with NSD yet and VNFFGD has to be instantiated separately and, as will be shown below, linked to an already running instance of a Network Service through its ID.
Tacker is an OpenStack project implementing a generic VNFM and NFVO. At the input it consumes Tosca-based templates, converts them to Heat templates which are then used to spin up VMs on OpenStack. This diagram from Brocade, the biggest Tacker contributor (at least until its acquisition), is the best overview of internal Tacker architecture.
For this demo environment I’ll keep using my OpenStack Kolla lab environment described in my previous post.
Before we can start using Tacker, it needs to know how to reach the OpenStack environment, so the first step in the workflow is OpenStack or VIM registration. We need to provide the address of the keystone endpoint along with the admin credentials to give Tacker enough rights to create and delete VMs and SFC objects:
1 2 3 4 5 6 7 8 9 10 |
|
The successful result can be checked with tacker vim-list
which should report that registered VIM is now reachable.
VNFD defines a set of VMs (VNFCs), network ports (CPs) and networks (VLs) and their relationship. In our case we have a single cirros-based VM with a pair of ingress/egress ports. In this template we also define a special node type tosca.nodes.nfv.vIDS
which will be used by NSD to pass the required parameters for ingress and egress VLs. These parameters are going to be used by VNFD to attach network ports (CPs) to virtual networks (VLs) as defined in the substitution_mappings
section.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
|
In our use case the NSD template is going to really small. All what we need to define is a single VNF of the tosca.nodes.nfv.vIDS
type that was defined previously in the VNFD. We also define a VL node which points to the pre-existing demo-net
virtual network and pass this VL to both INGRESS_VL and EGRESS_VL parameters of the VNFD.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
As I’ve mentioned before, VNFFG is not integrated with NSD yet, so we’ll add it later. For now, we have provided enough information to instantiate our NSD.
1
|
|
This last command creates a cirros-based VM with two interfaces and connects them to demo-net
virtual network. All ICMP traffic from VM1 still goes directly to its default gateway so the last thing we need to do is create a VNFFG.
VNFFG consists of two two types of nodes. The first type defines a Forwarding Path (FP) as a set of virtual ports (CPs) and a flow classifier to build an equivalent service function chain inside the VIM. The second type groups multiple forwarding paths to build a complex service chain graphs, however only one FP is supported by Tacker at the time of writing.
The following template demonstrates another important feature - template parametrization. Instead of defining all parameters statically in a template, they can be provided as inputs during instantiation, which allows to keep templates generic. In this case I’ve replaced the network port id parameter with PORT_ID
variable which will be provided during VNFFGD instantiation.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
|
Note that the VNFFGD has been updated to support multiple flow classifiers which means you many need to update the above template as per the sample VNFFGD template
In order to instantiate a VNFFGD we need to provide two runtime parameters:
All these parameters can be obtained using the CLI commands as shown below:
1 2 3 4 |
|
The following command creates a VNFFG and an equivalent SFC to steer all ICMP traffic from VM1 through vIDS VNF. The result can be verified using Skydive following the procedure described in my previous post.
1 2 3 |
|
This post only scratches the surface of what’s available in Tacker with a lot of other salient features left out of scope, including:
Tacker is one of many NFV orchestration platforms in a very competitive environment. Other open-source initiatives have been created in response to the shortcomings of the original ETSI Release 1 reference architecture. The fact the some of the biggest Telcos have finally realised that the only way to achieve the goal of NFV orchestration is to get involved with open-source and do it themselves, may be a good sign for the industry and maybe not so good for the ETSI NFV MANO working group. Whether ONAP with its broader scope becomes a new de-facto standard for NFV orchestration, still remains to be seen, until then ETSI MANO remains the only viable standard for NFV lifecycle management and orchestration.
]]>SFC is another SDN feature that for a long time only used to be available in proprietary SDN solutions and that has recently become available in vanilla OpenStack. It serves as another proof that proprietary SDN solutions are losing the competitive edge, especially for Telco SDN/NFV use cases. Hopefully, by the end of this series of posts I’ll manage do demonstrate how to build a complete open-source solution that has feature parity (in terms of major networking features) with all the major proprietary data centre SDN platforms. But for now, let’s just focus on SFC.
In most general terms, SFC refers to packet forwarding technique that uses more than just destination IP address to decide how to forward packets. In more specific terms, SFC refers to “steering” of traffic through a specific set of endpoints (a.k.a Service Functions), overriding the default destination-based forwarding. For those coming from a traditional networking background, think of SFC as a set of policy-based routing instances orchestrated from a central element (SDN controller). Typical use cases for SFC would be things like firewalling, IDS/IPS, proxying, NAT'ing, monitoring.
SFC is usually modelled as a directed (acyclic) graph, where the first and the last elements are the source and destination respectively and each vertex inside the graph represents a SF to be chained. IETF RFC7665 defines the reference architecture for SFC implementations and establishes some of the basic terminology. A simplified SFC architecture consists of the following main components:
One important property of a SF is elasticity. More instances of the same type can be added to a pool of SF and SFF will load-balance the traffic between them. This is the reason why, as we’ll see in the next section, SFF treats connections to a SF as a group of ports rather than just a single port.
In legacy, pre-SDN environments SFs had no idea if they were a part of a service chain and network devices (routers and switches) had to “insert” the interesting traffic into the service function using one of the following two modes:
L2 mode is when SF is physically inserted between the source and destination inside a single broadcast domain, so traffic flows through a SF without any intervention from a switch. Example of this mode could be a firewall in transparent mode, physically connected between a switch and a default gateway router. All packets entering a SF have their original source and destination MAC addresses, which requires SF to be in promiscuous mode.
L3 mode is when a router overrides its default destination-based forwarding and redirects the interesting traffic to a SF. In legacy networks this could have been achieved with PBR or WCCP. In this case SF needs to be L2-attached to a router and all redirected packets have their destination MAC updated to that of a SF’s ingress interface.
Modern SDN networks make it really easy to modify forwarding behaviour of network elements, both physical and virtual. There is no need for policy-based routing or bump-in-the-wire designs anymore. When flow needs to be redirected to a SF on a virtual switch, all what’s required is a matching OpenFlow entry with a high enough priority. However redirecting traffic to a SF is just one part of the problem. Another part is how to make SFs smarter, to provide greater visibility of end-to-end service function path.
So far SFs have only been able to extract metadata from the packet itself. This limited the flexibility of SF logic and became computationally expensive in case many SFs need to access some L7 header information. Ideal way would be to have an additional header which can be used to read and write arbitrary information and pass it along the service function chain. RFC7665 defines requirements for “SFC Encapsulation” header which can be used to uniquely identify an instance of a chain as well as share metadata between all its elements. Neutron API refers to SFC encapsulation as correlation since its primary function is to identify a particular service function path. There are two implementations of SFC encapsulation in use today:
It should be noted that the new approach with SFC encapsulation still allows for legacy, non-SFC-aware SFs to be chained. In this case SFC encapsulation is stripped off the packet by an “SFC proxy” before the packet is sent to the ingress port of a service function. All logical elements forming an SFC forwarding pipeline, including SFC proxy, Classifier and Forwarder, are implemented inside the same OVS bridges (br-int and br-tun) used by vanilla OVS-agent driver.
We’ll pick up where we left off in the previous post. All Neutron and ML2 configuration files have already been updated thanks to the enable_sfc="yes"
setting in the global Kolla-Ansible configuration file. If not, you can change it in /etc/kolla/globals.yaml
and re-run kolla-ansible deployment script.
First, let’s generate OpenStack credentials using a post-deployment script. We later can use a default bootstrap script to downloads the cirros image and set up some basic networking and security rules.
1 2 3 |
|
The goal for this post is to create a simple uni-directional SFC to steer the ICMP requests from VM1 to its default gateway through another VM that will be playing the role of a firewall.
The network was already created by the bootstrap script so all what we have to do is create a test VM. I’m creating a port in a separate step simply so that I can refer to it by name instead of UUID.
1 2 |
|
I’ll go over all the necessary steps to setup SFC, but will only provide a brief explanation. Refer to the official OpenStack Networking Guide for a complete SFC configuration guide.
First, let’s create a FW VM with two ports - P1 and P2.
1 2 3 |
|
Next, we need create an ingress/egress port pair and assign it to a port pair group. The default setting for correlation in a port pair (not shown) is none
. That means that SFC encapsulation header (MPLS) will get stripped before the packet is sent to P1.
1 2 |
|
Port pair group also allows to specify the L2-L4 headers which to use for load-balancing in OpenFlow groups, overriding the default behaviour described in the next section.
Another required element is a flow classifier. We will be redirecting ICMP traffic coming from VM1’s port P0
1
|
|
Finally, we can tie together flow classifier with a previously created port pair group. The default setting for correlation (not shown again) in this case is mpls
. That means that each chain will have its own unique MPLS label to be used as an SFC encapsulation.
1
|
|
That’s all the configuration needed to setup SFC. However if you login VM1’s console and try pinging default gateway, it will fail. Next, I’m going to give a quick demo of how to use a real-time network analyzer tool called Skydive to troubleshoot this issue.
Skydive is a new open-source distributed network probing and traffic analyzing tool. It consists of a set of agents running on compute nodes, collecting topology and flow information and forwarding it to a central element for analysis.
The idea of using Skydive to analyze and track SFC is not new. In fact, for anyone interested in this topic I highly recommend the following blogpost. In my case I’ll show how to use Skydive from a more practical perspective - troubleshooting multiple SFC issues.
Skydive CLI client is available inside the skydive_analyzer
container. We need to start an interactive bash session inside this container and set some environment variables:
1 2 3 4 |
|
The first thing we can do to troubleshoot is see if ICMP traffic is entering the ingress
port of the FW VM. Based on the output of openstack port list
command I know that P1 has got an IP of 10.0.0.8
. Let’s if we can identify a tap port corresponding to P1:
1 2 3 4 5 6 7 8 |
|
The output above proves that skydive agent has successfully read the configuration of the port and we can start a capture on that object to see any packets arriving on P1.
1 2 3 4 5 |
|
If you watch
the last command for several seconds you should see that the number in brackets is increasing. That means that packets are hitting the ingress port of the FW VM. Now let’s repeat the same test on egress
port P2.
1 2 3 |
|
The output above tells us that there are no packets coming out of the FW VM. This is expected since we haven’t done any changes to the blank cirros image to make it forward the packets between the two interfaces. If we examine the IP configuration of the FW VM, we would see that it doesn’t have an IP address configured on the second interface. We would also need to create a source-based routing policy to force all traffic from VM1 (10.0.0.6
) to egress via interface eth2
and make sure IP forwarding is turned on. The following commands would need to be executed on FW VM:
1 2 3 4 |
|
Having done that, we should see some packets coming out of egress
port P2.
1 2 3 4 |
|
However form the VM1’s perspective the ping is still failing. Next step would be to see if the packets are hitting the integration bridge that port P2 is attached to:
1 2 3 |
|
No packets means they are getting dropped somewhere between the P2 and the integration bridge. This can only be done by security groups. In fact, source MAC/IP anti-spoofing is enabled by default which would only allow packets matching the source MAC/IP addresses assigned to P2 and would drop any packets coming from VM1’s IP address. The easiest fix would be to disable security groups for P2 completely:
1
|
|
After this step the counters should start incrementing and the ping from VM1 to its default gateway is resumed.
1 2 3 4 |
|
The only element being affected in our case (both VM1 and FW are on the same compute node) is the integration bridge. Refer to my older post about vanilla OpenStack networking for a refresher of the vanilla OVS-agent architecture.
Normally, I would start by collecting all port and flow details from the integration bridge with the following commands:
1 2 |
|
However, for the sake of brevity, I will omit the actual outputs and only show graphical representation of forwarding tables and packet flows. The tables below have two columns - first showing what is being matched and second showing the resulting action. Let’s start with the OpenFlow rules in an integration bridge before SFC is configured:
As we can see, the table structure is quite simple, since integration bridge mostly relies on data-plane MAC learning. A couple of MAC and ARP anti-spoofing tables will check the validity of a packet and send it to table 60 where NORMAL
action will trigger the “flood-and-learn” behaviour. Therefore, an ICMP packet coming from VM1 will take the following path:
After we’ve configured SFC, the forwarding pipeline is changed and now looks like this:
First, we can see that table 0 acts as a classifier, by redirecting the “interesting” packets towards group 1
. This groups is an OpenFlow Group of type select
, which load-balances traffic between multiple destinations. By default OVS will use a combination of L2-L4 header as described here to calculate a hash which determines the output bucket, similar to how per-flow load-balancing works in traditional routers and switches. This behaviour can be overridden with a specific set of headers in lb_fields
setting of a port pair group.
In our case we’ve only got a single SF, so the packet gets its destination MAC updated to that of SF’s ingress port and is forwarded to a new table 5. Table 5 is where all packets destined for a SF are aggregated with a single MPLS label which uniquely identifies the service function path. The packet is then forwarded to table 10, which I’ve called SFC Ingress
. This is where the packets are distributed to SF’s ingress ports based on the assigned MPLS label.
After being processed by a SF, the packet leaves the egress
port and re-enters the integration bridge. This time table 0 knows that the packet has already been processed by a SF and, since the anti-spoofing rules have been disabled, simply floods the packet out of all ports in the same VLAN. The packet gets flooded to the tunnel bridge where it gets replicated and delivered to the qrouter
sitting on the controller node as per the default behaviour.
SFC is a pretty vast topic and is still under active development. Some of the upcoming enhancement to the current implementation of SFC will include:
SFC is one of the major features in Telco SDN and, like many things, it’s not meant to be configured manually. In fact, Telco SDN have their own framework for management and orchestration of VNFs (a.k.a. VMs) and VNF forwarding graphs (a.k.a. SFCs) called ETSI MANO. As it is expected from a Telco standard, it abounds with acronyms and confuses the hell out of anyone who’s name is not on the list of authors or contributors. That’s why in the next post I will try to provide a brief overview of what Telco SDN is and use Tacker, a software implementation of NFVO and VNFM, to automatically build a firewall VNF and provision a SFC, similar to what has been done in this post manually.
]]>For quite a long time installation and deployment have been deemed as major barriers for OpenStack adoption. The classic “install everything manually” approach could only work in small production or lab environments and the ever increasing number of project under the “Big Tent” made service-by-service installation infeasible. This led to the rise of automated installers that over time evolved from a simple collection of scripts to container management systems.
The first generation of automated installers were simple utilities that tied together a collection of Puppet/Chef/Ansible scripts. Some of these tools could do baremetal server provisioning through Cobbler or Ironic (Fuel, Compass) and some relied on server operating system to be pre-installed (Devstack, Packstack). In either case the packages were pulled from the Internet or local repository every time the installer ran.
The biggest problem with the above approach is the time it takes to re-deploy, upgrade or scale the existing environment. Even for relatively small environments it could be hours before all packages are downloaded, installed and configured. One of the ways to tackle this is to pre-build an operating system with all the necessary packages and only use Puppet/Chef/Ansible to change configuration files and turn services on and off. Redhat’s TripleO is one example of this approach. It uses a “golden image” with pre-installed OpenStack packages, which is dd-written bit-by-bit onto the baremetal server’s disk. The undercloud then decides which services to turn on based on the overcloud server’s role.
Another big problem with most of the existing deployment methods was that, despite their microservices architecture, all OpenStack services were deployed as static packages on top of a shared operating system. This made the ongoing operations, troubleshooting and ugprades really difficult. The obvious thing to do would be to have all OpenStack services (e.g. Neutron, Keyston, Nova) deployed as containers and managed by a container management system. The first company to implement that, as far as I know, was Canonical. The deployment process is quite complicated, however the end result is a highly flexible OpenStack cloud deployed using LXC containers, managed and orchestrated by Juju controller.
Today (September 2017) deploying OpenStack services as containers is becoming mainstream and in this post I’ll show how to use Kolla to build container images and Kolla-Ansible to deploy them on a pair of “baremetal” VMs.
My lab consists of a single controller and a single compute VM. The goal was to make them as small as possible so they could run on a laptop with limited resources. Both VMs are connected to three VM bridged networks - provisioning, management and external VM access.
I’ve written some bash and Ansible scripts to automate the deployment of VMs on top of any Fedora derivative (e.g. Centos7). These scripts should be run directly from the hypervisor:
1 2 3 |
|
The first bash script downloads the VM OS (Centos7), creates two blank VMs and sets up a local Docker registry. The second script installs all the dependencies, including Docker and Ansible.
The first step in Kolla deployment workflow is deciding where to get the Docker images. Kolla maintains a Docker Hub registry with container images built for every major OpenStack release. The easiest way to get them would be to pull the images from Docker hub either directly or via a pull-through caching registry.
In my case I needed to build the latest version of OpenStack packages, not just the latest major release. I also wanted to build a few additional, non-Openstack images (Opendaylight and Quagga). Because of that I had to build all Docker images locally and push them into a local docker registry. The procedure to build container images is very well documented in the official Kolla image building guide. I’ve modified it slightly to include the Quagga Dockerfile and automated it so that the whole process can be run with a single command:
1
|
|
This step can take quite a long time (anything from 1 to 4 hours depending on the network and disk I/O speed), however, once it’s been done these container images can be used to deploy as many OpenStack instances as necessary.
The next step in OpenStack deployment workflow is to deploy Docker images on target hosts. Kolla-Ansible is a highly customizable OpenStack deployment tool that is also extemely easy to use, at least for people familiar with Ansible. There are two main sources of information for Kolla-Ansible:
To get started with Kolla-Ansible all what it takes is a few modifications to the global configuration file to make sure that network settings match the underlying OS interface configuration and an update to the inventory file to point it to the correct deployment hosts. In my case I’m making additional changes to enable SFC, Skydive and Tacker and adding files for Quagga container, all of which can be done with the following command:
1
|
|
The best thing about this method of deployment is that it takes (in my case) under 5 minutes to get the full OpenStack cloud from scratch. That means if I break something or want to redeploy with some major changes (add/remove Opendaylight), all what I have to do is destroy the existing deployment (approx. 1 minute), modify global configuration file and re-deploy OpenStack. This makes Kolla-Ansible an ideal choice for my lab environment.
Once the deployment has been completed, we should be able to see a number of running Docker containers - one for each OpenStack process.
1 2 3 4 5 6 7 8 9 10 11 12 |
|
All the standard docker tools are available to interact with those containers. For example, this is how we can see what processes are running inside a container:
1 2 3 4 5 |
|
Some of you may have noticed that none of the containers expose any ports. So how do they communicate? The answer is very simple - all containers run in a host networking mode, effectively disabling any network isolation and giving all contaners access to TCP/IP stacks of their Docker hosts. This is a simple way to avoid having to deal with Docker networking complexities, while at the same time preserving the immutability and portability of Docker containers.
All containers are configured to restart in case of a failure, however there’s no CMS to provide full lifecycle management and advanced scheduling. If upgrade of scale-in/out is needed, Kolla-Ansible will have to be re-run with updated configuration options. There is sibling project called Kolla-Kubernetes (still under developement), that’s designed to address some of the mentioned shortcomings.
Now that the lab is up we can start exploring the new OpenStack SDN features. In the next post I’ll have a close look at Neutron’s SFC feature, how to configure it and how it’s been implemented in OVS forwarding pipeline.
]]>A few weeks ago I bought myself a new Dell XPS-13 and decided for the n-th time to go all-in Linux, that is to have Linux as the main and only laptop OS. Since most of my Linux experience is with Fedora-family distros, I quickly installed Fedora-25 and embarked on a long and painful journey of getting out of my Windows comfort zone and re-establishing it in Linux. One of the most important aspects for me, as a network engineer, is to have a streamlined process of accessing network devices. In Windows I was using MTPutty and it helped define my expectations of an ideal SSH session manager:
expect
hacks.Although GNOME terminal looked like a very good option, it didn’t meet all of my requirements. I briefly looked and PAC Manager and GNOME Connection Manager but quickly dismissed them due to their ugliness and clunkiness. Ideally I wanted to keep using GNOME terminal as the main terminal emulator, without having to configure and rely on other 3rd party apps. I also didn’t want to wrap my SSH session in expect
as I didn’t want my password to be pasted in my screen every time I cat a file containing the trigger keyword Password:. I’ve finally managed to make everything work inside the native GNOME terminal and this post is a documentation of my approach.
I’ve written a little tool that uses Netmiko to install (and remove) public SSH keys onto network devices. Assuming python-pip
is already installed here’s what’s required to download and install ssh-copy-net
:
1
|
|
Its functionality mimics the one of ssh-copy-id
, so the next step is always to upload the public key to the device:
1 2 3 4 |
|
OpenSSH client config file provides a nice way of managing user’s SSH sessions. Configuration file allows you to define per-host SSH settings including username, port forwarding options, key checking flags etc. In my case all what I had to do was define IP addresses of my network devices:
1 2 3 4 5 |
|
Now I am able to login the device by simply typing its name:
1 2 3 |
|
The final step is session organisations. For that I’ve decided to use zsh aliases and have device groups encoded in the alias name, separated by dashes. For example, if my SRX device was in the lab and Arista was in Site-51 of Customer-A this is how I would write my aliases:
1 2 |
|
As a network engineer, I often find myself troubleshooting issues spanning multiple devices, which is why I need multiple tabs inside a single terminal window. Simply pressing Ctrl+T in GNOME terminal opens a new tab and I can switch between tabs using Alt+[1-9]. However what would be really nice is to have a couple of tabs opened side by side so that I can see the logs and compare output on a number of devices at the same time. This is where tmux comes in. It can do much more than this, but I simply use it to have multiple panes inside the same terminal tab:
Here’s an example of my tmux configuration file:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
|
Now having all the above defined and with the help of zsh command autocompletion, I can login the device with just a few keypresses (shown in square brackets below):
1 2 3 4 5 6 7 8 |
|
Press Ctrl+B v to split the terminal window vertically:
1 2 3 4 5 6 7 |
|
An so on and so forth…
]]>The idea of using Ansible for configuration changes and state verification is not new. However the approach I’m going to demonstrate in this post, using YANG and NETCONF, will have a few notable differences:
I hope this promise is exciting enough so without further ado, let’s get cracking.
The test environment will consist of a single instance of CSR1000v running IOS-XE version 16.4.1 and a single instance of vMX running JUNOS version 17.1R1.8. The VMs containing the two devices are deployed within a single hypervisor and connected with one interface to the management network and back-to-back with the second pair of interfaces for BGP peering.
Each device contains some basic initial configuration to allow it be reachable from the Ansible server.
1 2 3 4 5 6 |
|
vMX configuration is quite similar. Static MAC address is required in order for ge
interfaces to work.
1 2 3 4 |
|
My Ansible-101 repository contains two plays - one for configuration and one for state verification. The local inventory file contains details about the two devices along with the login credentials. All the work will be performed by a custom Ansible module stored in the ./library
directory. This module is a wrapper for a ydk_yaml
module described in my previous post. I had to heavily modify the original ydk_yaml
module to work around some Ansible limitations, like the lack of support for set data structures.
This custom Ansible module also relies on a number of YDK Python bindings to be pre-installed. Refer to my YAML, Operational and JUNOS repositories for the instructions on how to install those modules.
The desired configuration and expected operational state are documented inside a couple of device-specific host variable files. For each device there is a configuration file config.yaml
, describing the desired configuration state. For IOS-XE there is an additional file verify.yaml
, describing the expected operational state using the IETF interface YANG model (I couldn’t find how to get the IETF or OpenConfig state models to work on Juniper).
All of these files follow the same structure:
config
or verify
and defines how the enclosed data is supposed to be usedHere’s how IOS-XE will be configured, using IETF interfaca YANG models (to unshut the interface) and Cisco’s native YANG model for interface IP and BGP settings:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
|
For JUNOS configuration, instead of the default humongous native model, I’ll use a set of much more light-weight OpenConfig YANG models to configure interfaces, BGP and redistribution policies:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
|
Both devices now can be configured with just a single command:
1
|
|
Behind the scenes, Ansible calls my custom ydk_module
and passes to it the full configuration state and device credentials. This module then constructs an empty YDK binding based on the name of a YANG model and populates it recursively with the data from the config
container. Finally, it pushes the data to the device with the help of YDK NETCONF service provider.
There’s one side to YANG which I have carefully avoided until now and it’s operational state models. These YANG models are built similarly to configuration models, but with a different goal - to extract the running state from a device. The reason why I’ve avoided them is that, unlike the configuration models, the current support for state models is limited and somewhat brittle.
For example, JUNOS natively only supports state models as RPCs, where each RPC represents a certain show
command which, I assume, when passed to the devices gets evaluated, its output parsed and result returned back to the client. With IOX-XE things are a little better with a few of the operational models available in the current 16.4 release. You can check out my Github repo for some examples of how to check the interface and BGP neighbor state between the two IOS-XE devices. However, most of the models are still missing (I’m not counting the MIB-mapped YANG models) in the current release. The next few releases, though, are promised to come with an improved state model support, including some OpenConfig models, which is going to be super cool.
So in this post, since I couldn’t get JUNOS OpenConfig models report any state and my IOS-XE BGP state model wouldn’t return any output unless the BGP peering was with another Cisco device or in the Idle state, I’m going to have to resort to simply checking the state of physical interfaces. This is how a sample operational state file would look like (question marks are YAML’s special notation for sets which is how I decided to encode Enum data type):
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
Once again, all expected state can be verified with a single command:
1
|
|
If the state defined in that YAML file matches the data returned by the IOS-XE device, the playbook completes successfully. You can check that it works by shutting down one of the GigabitEthernet3
or Loopback0
interfaces and observing how Ansible module returns an error.
Now that I’ve come to the end of my YANG series of posts I feel like I need to provide some concise and critical summary of everything I’ve been through. However, if there’s one thing I’ve learned in the last couple of months about YANG, it’s that things are changing very rapidly. Both Cisco and Juniper are working hard introducing new models and improving support for the existing ones. So one thing to keep in mind, if you’re reading this post a few months after it was published (April 2017), is that some or most of the above limitations may not exist and it’s always worth checking what the latest software release has to offer.
Finally, I wanted to say that I’m a strong believer that YANG models are the way forward for network device configuration and state verification, despite the timid scepticism of the networking industry. I think that there are two things that may improve the industry’s perception of YANG and help increase its adoption:
Support from networking vendors - we’ve already seen Cisco changing by introducing YANG support on IOS-XE instead of producing another dubious One-PK clone. So big thanks to them and I hope that other vendors will follow suit.
Tools - this part, IMHO, is the most crucial. In order for people to start using YANG models we have to have the right tools that would be versatile enough to allow network engineers to be limited only by their imagination and at the same time be as robust as the CLI. So I wanted to give a big shout out to all the people contributing to open-source projects like pyang, YDK and many others that I have missed or don’t know about. You’re doing a great job guys, don’s stop.
XML, just like many more structured data formats, was not designed to be human-friendly. That’s why many network engineers lose interest in YANG as soon as the conversation gets to the XML part. JSON is a much more human-readable alternative, however very few devices support RESTCONF, and the ones that do may have buggy implementations. At the same time, a lot of network engineers have happily embraced Ansible, which extensively uses YAML. That’s why I’ve decided to write a Python module that would program network devices using YANG and NETCONF according to configuration data described in a YAML format.
In the previous post I have introduced a new open-source tool called YDK, designed to create API bindings for YANG models and interact with network devices using NETCONF or RESTCONF protocols. I have also mentioned that I would still prefer to use pyangbind along with other open-source tools to achieve the same functionality. Now, two weeks later, I must admin I have been converted. Initially, I was planning to write a simple REST API client to interact with RESTCONF interface of IOS XE, create an API binding with pyangbind, use it to produce the JSON output, convert it to XML and send it to the device, similar to what I’ve described in my netconf and restconf posts. However, I’ve realised that YDK can already do all what I need with just a few function calls. All what I’ve got left to do is create a wrapper module to consume the YAML data and use it to automatically populate YDK bindings.
This post will be mostly about the internal structure of this wrapper module I call ydk_yaml.py
, which will serve as a base library for a YANG Ansible module, which I will describe in my next post. This post will be very programming-oriented, I’ll start with a quick overview of some of the programming concepts being used by the module and then move on to the details of module implementation. Those who are not interested in technical details can jump straight to the examples sections at the end of this post for a quick demonstration of how it works.
One of the main tasks of ydk_yaml.py
module is to be able parse a YAML data structure. This data structure, when loaded into Python, is stored as a collection of Python objects like dictionaries, lists and primitive data types like strings, integers and booleans. One key property of YAML data structures is that they can be represented as trees and parsing trees is a very well-known programming problem.
After having completed this programming course I fell in love with functional programming and recursions. Every problem I see, I try to solve with a recursive function. Recursions are very interesting in a way that they are very difficult to understand but relatively easy to write. Any recursive function will consist of a number of if/then/else
conditional statements. The first one (or few) if
statements are called the base of a recursion - this is where recursion stops and the value is returned to the outer function. The remaining few if
statements will implement the recursion by calling the same function with a reduced input. You can find a much better explanation of recursive functions here. For now, let’s consider the problem of parsing the following tree-like data structure:
1 2 3 4 5 6 7 |
|
Recursive function to parse this data structure written in a pseudo-language will look something like this:
1 2 3 4 5 6 |
|
The beauty of recursive functions is that they are capable parsing data structures of arbitrary complexity. That means if we had 1000 randomly nested child elements in the parent data structure, they all could have been parsed by the same 6-line function.
Introspection refers to the ability of Python to examine objects at runtime. It can be useful when dealing with object of arbitrary structure, e.g. a YAML document. Introspection is used whenever there is a need for a function to behave differently based on the runtime data. In the above pseudo-language example, the two conditional statements are the examples of introspection. Whenever we need to determine the type of an object in Python we can either use a built-in function type(obj)
which returns the type of an object or isinstance(obj, type)
which checks if the object is an instance or a descendant of a particular type. This is how we can re-write the above two conditional statements using real Python:
1 2 3 4 |
|
Another programming concept used in my Python module is metaprogramming. Metaprogramming, in general, refers to an ability of programs to write themselves. This is what compilers normally do when they read the program written in a higher-level language and translate it to a lower-level language, like assembler. What I’ve used in my module is the simplest version of metaprogramming - dynamic getting and setting of object attributes. For example, this is how we would configure BGP using YDK Python binding, as described in my previous post:
1 2 3 4 5 |
|
The same code could be re-written using the getattr
and setattr
method calls:
1 2 3 4 5 |
|
This is also very useful when working with arbitrary data structures and objects. In my case the goal was to write a module that would be completely independent of the structure of a particular YANG model, which means that I can not know the structure of the Python binding generated by YDK. However, I can “guess” the name of the attributes if I assume that my YAML document is structured exactly like the YANG model. This simple assumption allows me to implement YAML mapping for all possible YANG models with just a single function.
As I’ve mentioned in my previous post, YANG is simply a way to define the structure of an XML document. At the same time, it is known that YANG-based XML can be mapped to JSON as described in this RFC. Since YAML is a superset of JSON, it’s easy to come up with a similar XML-to-YAML mapping convention. The following table contains the mapping between some of the most common YAML and YANG data structures and types:
YANG data | YAML representation |
---|---|
container | dictionary |
container name | dictionary key |
leaf name | dictionary key |
leaf | dictionary value |
list | list |
string, bool, integer | string, bool, integer |
empty | null |
Using this table, it’s easy to map the YANG data model to a YAML document. Let me demonstrate it on IOS XE’s native OSPF data model. First, I’ve generated a tree representation of an OSPF data model using pyang:
1
|
|
Next, I’ve trimmed it down to only contain the options that I would like to set and created a YAML document based on the model’s tree structure:
With the right knowledge of YANG model’s structure, it’s fairly easy to generate similar YAML configuration files for other configuration objects, like interface and BGP.
At the heart of the ydk_yaml
module is a single recursive function that traverses the input YAML data structure and uses it to instantiate the YDK-generated Python binding. Here is a simple, abridged version of the function that demonstrates the main logic.
1 2 3 4 5 6 7 8 9 10 11 |
|
Most of it should already make sense based on what I’ve covered above. The first conditional statement is the base of the recursion and performs the action of setting the value of a YANG Leaf element. The second conditional statement takes care of a YANG List by traversing all its elements, instantiating them recursively, and appends the result to a YDK binding. The last elif
statement creates a class instance for a YANG container, recursively populates its values and saves the final result inside a YDK binding.
The full version of this function covers a few extra corner cases and can be found here.
The final step is to write a wrapper class that would consume the YDK model binding along with the YAML data, and both instantiate and push the configuration down to the network device.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
|
The structure of this class is pretty simple. The constructor instantiates a YDK native data model and calls the recursive instantiation function to populate the binding. The action method implements standard CRUD actions using the YDK’s NETCONF provider. The full version of this Python module can be found here.
In my Github repo, I’ve included a few examples of how to configure Interface, OSPF and BGP settings of IOS XE device. A helper Python script 1_send_yaml.py
accepts the YANG model name and the name of the YAML configuration file as the input. It then instantiates the YdkModel
class and calls the create
action to push the configuration to the device. Let’s assume that we have the following YAML configuration data saved in a bgp.yaml
file:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
To push this BGP configuration to the device all what I need to do is run the following command:
1
|
|
The resulting configuration on IOS XE device would look like this:
1 2 3 4 5 6 7 |
|
To see more example, follow this link to my Github repo.
]]>In the previous posts about NETCONF and RESTCONF I’ve demonstrated how to interact with Cisco IOS XE device directly from the Linux shell of my development VM. This approach works fine in some cases, e.g. whenever I setup a new DC fabric, I would make calls directly to the devices I’m configuring. However, it becomes impractical in the Ops world where change is constant and involves a large number of devices. This is where centralised service orchestrators come to the fore. The prime examples of such platforms are Network Services Orchestrator from Tail-f/Cisco and open-source project OpenDaylight. In this post we’ll concentrate on ODL and how to make it work with Cisco IOS XE. Additionally, I’ll show how to use an open-source tool YDK to generate Python bindings for native YANG models and how it compares with pyangbind.
OpenDaylight is a swiss army knife of SDN controllers. At the moment it is comprised of dozens of projects implementing all possible sorts of SDN functionality starting from Openflow controller all the way up to L3VPN orchestrator. ODL speaks most of the modern Southbound protocols like Openflow, SNMP, NETCONF and BGP. The brain of the controller is in the Service Abstraction Layer, a framework to model all network-related characteristics and properties. All logic inside SAL is modelled in YANG which is why I called it the godfather of YANG models. Towards the end users ODL exposes Java function calls for applications running on the same host and REST API for application running remotely.
OpenDaylight has several commercial offerings from companies involved in its development. Most notable ones are from Brocade and Cisco. Here I will allow myself a bit of a rant, feel free to skip it to go straight to the technical stuff.
One thing I find interesting is that Cisco are being so secretive about their Open SDN Controller, perhaps due to the earlier market pressure to come up with a single SDN story, but still have a very large number of contributors to this open-source project. It could be the case of having an egg in each basket, but the number of Cisco’s employees involved in ODL development is substantial. I wonder if, now that the use cases for ACI and ODL have finally formed and ACI still not showing the uptake originally expected, Cisco will change their strategy and start promoting ODL more aggressively, or at least stop hiding it deep in the bowels of cisco.com. Or, perhaps, it will always stay in the shade of Tail-f’s NSC and Insieme’s ACI and will be used only for customer with unique requirements, e.g. to have both OpenStack and network devices managed through the same controller.
We’ll use the same environment we’ve setup in the previous posts, consisting of a CSR1K and a Linux VM connected to the same network inside my hypervisor. IOS XE device needs to have netconf-yang
configured in order to enable the northbound NETCONF interface.
On the same Linux VM, I’ve downloaded and launched the latest version of ODL (Boron-SR2), and enabled NETCONF and RESTCONF plugins.
1 2 3 4 5 6 |
|
We’ll use NETCONF to connect to Cisco IOS XE device and RESTCONF to interact with ODL from a Linux shell.
It might be useful to turn on logging in karaf console to catch any errors we might encounter later:
1
|
|
According to ODL NETCONF user guide, in order to connect a new device to the controller, we need to create an XML document which will include the IP, port and user credentials of the IOS XE device. Here’s the excerpt from the full XML document:
1 2 3 4 5 6 7 |
|
Assuming this XML is saved in a file called new_device.xml.1, we can use curl
to send it to ODL’s netconf-connector plugin:
1 2 3 4 |
|
When the controller gets this information it will try to connect to the device via NETCONF and do the following three things:
./cache/schema
directoryAfter ODL downloads all of the 260 available models (can take up to 20 minutes) we will see the following errors in the karaf console:
1 2 3 4 |
|
Due to inconsistencies between the advertised and the available models, ODL fails to build the full device YANG schema context, which ultimately results in inability to connect the device to the controller. However, we won’t need all of the 260 models advertised by the device. In fact, most of the configuration can be done through a single Cisco native YANG model, ned
. With ODL it is possible to override the default capabilities advertised in the Hello message and “pin” only the ones that are going to be used. Assuming that ODL has downloaded most of the models at the previous step, we can simply tell it use the selected few with the following additions to the XML document:
1 2 3 4 5 6 7 8 |
|
Assuming the updated XML is saved in new_device.xml.2 file, the following command will update the current configuration of CSR1K device:
1 2 3 4 5 6 |
|
We can then verify that the device has been successfully mounted to the controller:
1 2 |
|
The output should look similar to the following with the connection-status set to connected
and no detected unavailable-capabilities
:
1 2 3 4 5 |
|
At this point we should be able to interact with IOS XE’s native YANG model through ODL’s RESTCONF interface using the following URL
1 2 |
|
The only thing that’s missing is the actual configuration data. To generate it, I’ll use a new open-source tool called YDK.
Yang Development Kit is a suite of tools to work with NETCONF/RESTCONF interfaces of a network device. The way I see it, YDK accomplishes two things:
There’s a lot of overlap between the tools that we’ve used before and YDK. Effectively YDK combines in itself the functions of a NETCONF client, a REST client, pyangbind and pyang(the latter is used internally for model verification). Since one of the main functions of YDK is API generation I thought it’d be interesting to know how it compares to Rob Shakir’s pyangbind plugin. The following information is what I’ve managed to find on the Internet and from the comment of Santiago Alvarez below:
Feature | Pyangbind | YDK |
---|---|---|
PL support | Python | Python, C++ with Ruby and Go in the pipeline |
Serialization | JSON, XML | only XML at this stage with JSON coming up in a few weeks |
Southbound interfaces | N/A | NETCONF, RESTCONF with ODL coming up in a few weeks |
Support | Cisco’s devnet team | Rob Shakir |
So it looks like YDK is a very promising alternative to pyangbind, however I, personally, would still prefer to use pyangbind due to familiarity, simplicity and the fact that I don’t need the above extra features offered by YDK right now. However, given that YDK has been able to achieve so much in just under one year of its existence, I don’t discount the possibility that I may switch to YDK as it becomes more mature and feature-rich.
One of the first things we need to do is install YDK-GEN, the tools responsible for API bindings generation, and it’s core Python packages on the local machine. The following few commands are my version of the official installation procedure:
1 2 3 4 5 |
|
YDK-GEN generates Python bindings based on the so-called bundle profile. This is a simple JSON document which lists all YANG models to include in the output package. In our case we’d need to include a ned
model along with all its imports. The sample below shows only the model specification. Refer to my Github repo for a complete bundle profile for Cisco IOS XE native YANG model.
1 2 3 4 5 6 7 8 9 |
|
Assuming that the IOS XE bundle profile is saved in a file called cisco-ios-xe_0_1_0.json, we can use YDK to generate and install the Python binding package:
1 2 |
|
Now we can start configuring BGP using our newly generated Python package. First, we need to create an instance of BGP configuration data:
1 2 |
|
The configuration will follow the pattern defined in the original model, which is why it’s important to understand the internal structure of a YANG model. YANG leafs are represented as simple instance attributes. All YANG containers need to be explicitly instantiated, just like the Native
and Bgp
classes in the example above. Presence containers (router
in the above example) will be instantiated at the same time as its parent container, inside the __init__
function of the Native
class. Don’t worry if this doesn’t make sense, use iPython or any IDE with autocompletion and after a few tries, you’ll get the hang of it.
Let’s see how we can set the local BGP AS number and add a new BGP peer to the neighbor list.
1 2 3 4 5 |
|
At this point of time all data is stored inside the instance of a Bgp
class. In order to get an XML representation of it, we need to use YDK’s XML provider and encoding service:
1 2 3 4 5 6 |
|
All what we’ve got left now is to send the data to ODL:
1 2 3 4 5 6 7 8 9 |
|
The controller should have returned the status code 204 No Content
, meaning that configuration has been changed successfully.
Back at the IOS XE CLI we can see the new BGP configuration that has been pushed down from the controller.
1 2 3 4 |
|
You can find a shorter version of the above procedure in my ODL 101 repo.
]]>In the previous post I have demonstrated how to make changes to interface configuration of Cisco IOS XE device using the standard IETF model. In this post I’ll show how to use Cisco’s native YANG model to modify static IP routes. To make things even more interesting I’ll use RESTCONF, an HTTP-based sibling of NETCONF.
RESTCONF is a very close functional equivalent of NETCONF. Instead of SSH, RESTCONF relies on HTTP to interact with configuration data and operational state of the network device and encodes all exchanged data in either XML or JSON. RESTCONF borrows the idea of Create-Read-Update-Delete operations on resources from REST and maps them to YANG models and datastores. There is a direct relationship between NETCONF operations and RESTCONF HTTP verbs:
HTTP VERB | NETCONF OPERATION |
---|---|
POST | create |
PUT | replace |
PATCH | merge |
DELETE | delete |
GET | get/get-config |
Both RESTfullness and the ability to encode data as JSON make RESTCONF a very attractive choice for application developers. In this post, for the sake of simplicity, we’ll use Python CLI and curl
to interact with RESTCONF API. In the upcoming posts I’ll show how to implement the same functionality inside a simple Python library.
We’ll pick up from where we left our environment in the previous post right after we’ve configured a network interface. The following IOS CLI command enables RESTCONF’s root URL at http://192.168.145.51/restconf/api/
1
|
|
You can start exploring the structure of RESTCONF interface starting at the root URL by specifying resource names separated by “/”. For example, the following command will return all configuration from Cisco’s native datastore.
1
|
|
In order to get JSON instead of the default XML output the client should specify JSON media type application/vnd.yang.datastore+json
and pass it in the Accept
header.
Normally, you would expect to download the YANG model from the device itself. However IOS XE’s NETCONF and RESTCONF support is so new that not all of the models are available. Specifically, Cisco’s native YANG model for static routing cannot be found in either Yang Github Repo or the device itself (via get_schema
RPC), which makes it a very good candidate for this post.
Update 13-02-2017: As it turned out, the model was right under my nose the whole time. It’s called
ned
and encapsulates the whole of Cisco’s native datastore. So think of everything that’s to follow as a simple learning exercise, however the point I raise in the closing paragraph still stands.
The first thing we need to do is get an understanding of the structure and naming convention of the YANG model. The simplest way to do that would be to make a change on the CLI and observe the result via RESTCONF.
Let’s start by adding the following static route to the IOS XE device:
1
|
|
Now we can view the configured static route via RESTCONF:
1 2 |
|
The returned output should look something like this:
1 2 3 4 5 6 7 8 9 |
|
This JSON object gives us a good understanding of how the YANG model should look like. The root element route
contains a list of IP prefixes, called ip-route-interface-forwarding-list
. Each element of this list contains values for IP network and mask as well as the list of next-hops called fwd-list
. Let’s see how we can map this to YANG model concepts.
YANG RFC defines a number of data structures to model an XML tree. Let’s first concentrate on the three most fundamental data structures that constitute the biggest part of any YANG model:
'name': {...}
'name': 'value'
key
. In JSON lists are encoded as name/arrays pairs containing JSON objects 'name': [{...}, {...}]
Now let’s see how we can describe the received data in terms of the above data structures:
route
element is a JSON object, therefore it can only be mapped to a YANG container.ip-route-interface-forwarding-list
is an array of JSON objects, therefore it must be a list.prefix
and mask
key/value pairs. Since they don’t contain any child elements and their values are strings they can only be mapped to YANG leafs.fwd-list
, is another YANG list and so far contains a single next-hop value inside a YANG leaf called fwd
.fwd
is the only leaf in the fwd-list
list, it must be that lists' key. The ip-route-interface-forwarding-list
list will have both prefix
and mask
as its key values since their combination represents a unique IP destination.With all that in mind, this is how a skeleton of our YANG model will look like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
YANG’s syntax is pretty light-weight and looks very similar to JSON. The topmost module
defines the model’s name and encloses all other elements. The first two statements are used to define XML namespace and prefix that I’ve described in my previous post.
At this stage the model can already be instantiated by pyang and pyangbind, however there’s a couple of very important changes and additions that I wanted to make to demonstrate some of the other features of YANG.
The first of them is common IETF data types. So far in our model we’ve assumed that prefix and mask can take any value in string format. But what if we wanted to check that the values we use are, in fact, the correctly-formatted IPv4 addresses and netmasks before sending them to the device? That is where IETF common data types come to the rescue. All what we need to do is add an import statement to define which model to use and we can start referencing them in our type definitions:
1 2 3 4 5 6 |
|
This solves the problem for the prefix part of a static route but how about its next-hop? Next-hops can be defined as either strings (representing an interface name) or IPv4 addresses. To make sure we can use either of these two types in the fwd
leaf node we can define its type as a union
. This built-in type is literally a union, a logical OR, of all its member elements. This is how we can change the fwd
leaf definition:
1 2 3 4 5 6 7 8 9 |
|
So far we’ve been concentrating on the simplest form of a static route, which doesn’t include any of the optional arguments. Let’s add the leaf nodes for name, AD, tag, track and permanent options of the static route:
1 2 3 4 5 6 7 |
|
Since track and permanent options are mutually exclusive they should not appear in the configuration at the same time. To model that we can use the choice
YANG statement. Let’s remove the track and permanent leafs from the model and replace them with this:
1 2 3 4 |
|
And finally, we need to add an options for VRF. When VRF is defined the whole ip-route-interface-forwarding-list
gets encapsulated inside a list called vrf
. This list has just one more leaf element name
which plays the role of this lists' key. In order to model this we can use another oft-used YANG concept called grouping
. I like to think of it as a Python function, a reusable part of code that can be referenced multiple times by its name. Here are the final changes to our model to include the VRF support:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
Each element in a YANG model is optional by default, which means that the route
container can include any number of VRF and non-VRF routes. The full YANG model can be found here.
Now let me demonstrate how to use our newly built YANG model to change the next-hop of an existing static route. Using pyang we need to generate a Python module based on the YANG model.
1
|
|
From a Python shell, download the current static IP route configuration:
1 2 3 4 5 |
|
Import the downloaded JSON into a YANG model instance:
1 2 3 |
|
Delete the old next-hop and replace it with 12.12.12.2:
1 2 3 |
|
Save the updated model in a JSON file with the help of a write_file function:
1 2 |
|
If we tried sending the new_conf.json
file now, the device would have responded with an error:
1
|
|
In our JSON file the order of elements inside a JSON object can be different from what was defined in the YANG model. This is expected since one of the fundamental principles of JSON is that an object is an unordered collection of name/value pairs. However it looks like behind the scenes IOS XE converts JSON to XML before processing and expects all elements to come in a strict, predefined order. Fortunately, this bug is already known and we can hope that Cisco will implement the fix for IOS XE soon. In the meantime, we’re gonna have to resort to sending XML.
Following the procedure described in my previous post, we can use json2xml tool to convert our instance into an XML document. Here we hit another issue. Since json2xml was designed to produce a NETCONF-compliant XML, it wraps the payload inside a data or a config element. Thankfully, json2xml is a Python script and can be easily patched to produce a RESTCONF-compliant XML. The following is a diff between the original and the patched files
1 2 3 4 5 6 7 8 9 10 11 |
|
Instead of patching the original file, I’ve applied the above changes to a local copy of the file. Once patched, the following commands should produce the needed XML.
1 2 |
|
The final step would be to send the generated XML to the IOS XE device. Since we are replacing the old static IP route configuration we’re gonna have to use HTTP PUT to overwrite the old data.
1 2 |
|
Back at the IOS XE CLI we can see the new static IP route installed.
1 2 |
|
As always there are more examples available in my YANG 101 repo
The exercise we’ve done in this post, though useful from a learning perspective, can come in very handy when dealing with vendors who forget or simply don’t want to share their YANG models with their customers (I know of at least one vendor that would only publish tree representations of their YANG models). In the upcoming posts I’ll show how to create a simple Python library to program static routes via RESTCONF and finally how to build an Ansible module to do that.
]]>To kick things off I will show how to use ncclient and pyang to configure interfaces on Cisco IOS XE device. In order to make sure everyone is on the same page and to provide some reference points for the remaining parts of the post, I would first need to cover some basic theory about NETCONF, XML and YANG.
NETCONF is a network management protocol that runs over a secure transport (SSH, TLS etc.). It defines a set of commands (RPCs) to change the state of a network device, however it does not define the structure of the exchanged information. The only requirement is for the payload to be a well-formed XML document. Effectively NETCONF provides a way for a network device to expose its API and in that sense it is very similar to REST. Here are some basic NETCONF operations that will be used later in this post:
All of these standard NETCONF operations are implemented in ncclient Python library which is what we’re going to use to talk to CSR1k.
There are several ways to exchange structured data over the network. HTML, YAML, JSON and XML are all examples of structured data formats. XML encodes data elements in tags and nests them inside one another to create complex tree-like data structures. Thankfully we are not going to spend much time dealing with XML in this post, however there are a few basic concepts that might be useful for the overall understanding:
The first two concepts are similar to paths in a Linux filesystem where all of the files are laid out in a tree-like structure with root partition at its top. Namespace is somewhat similar to a unique URL identifying a particular server on the network. Using namespaces you can address multiple unique /etc/hosts
files by prepending the host address to the path.
As with other structured data formats, XML by itself does not define the structure of the document. We still need something to organise a set of XML tags, specify what is mandatory and what is optional and what are the value constraints for the elements. This is exactly what YANG is used for.
YANG was conceived as a human-readable way to model the structure of an XML document. Similar to a programming language it has some primitive data types (integers, boolean, strings), several basic data structures (containers, lists, leafs) and allows users to define their own data types. The goal is to be able to formally model any network device configuration.
Anyone who has ever used Ansible to generate text network configuration files is familiar with network modelling. Coming up with a naming conventions for variables, deciding how to split them into different files, creating data structures for variables representing different parts of configuration are all a part of network modelling. YANG is similar to that kind of modelling, only this time the models are already created for you. There are three main sources of YANG models today:
Be sure to check of these and many other YANG models on YangModels Github repo.
My test environment consists of a single instance of Cisco CSR1k running IOS XE 16.04.01. For the sake of simplicity I’m not using any network emulator and simply run it as a stand-alone VM inside VMWare Workstation. CSR1k has the following configuration applied:
1 2 3 4 5 6 7 |
|
The last command is all what’s required to enable NETCONF/YANG support.
On the same hypervisor I have my development CentOS7 VM, which is connected to the same network as the first interface of CSR1k. My VM is able to ping and ssh into the CSR1k. We will need the following additional packages installed:
1 2 |
|
The following workflow will be performed in both interactive Python shell (e.g. iPython) and Linux bash shell. The best way to follow along is to have two sessions opened, one with each of the shells. This will save you from having to rerun import statements every time you re-open a python shell.
The first thing you have to do with any NETCONF-capable device is discover its capabilities. We’ll use ncclient’s manager module to establish a session to CSR1k. Method .connect()
of the manager object takes device IP, port and login credentials as input and returns a reference to a NETCONF session established with the device.
1 2 3 4 5 6 |
|
When the session is established, server capabilities advertised in the hello message get saved in the server_capabilities
variable. Last command should print a long list of all capabilities and supported YANG models.
The task we have set for ourselves is to configure an interface. CSR1k supports both native (Cisco-specific) and IETF-standard ways of doing it. In this post I’ll show how to use the IETF models to do that. First we need to identify which model to use. Based on the discovered capabilities we can guess that ietf-ip could be used to configure IP addresses, so let’s get this model first. One way to get a YANG model is to search for it on the Internet, and since its an IETF model, it most likely can be found in of the RFCs.
Another way to get it is to download it from the device itself. All devices supporting RFC6022 must be able to send the requested model in response to the get_schema
call. Let’s see how we can download the ietf-ip YANG model:
1 2 |
|
At this stage the model is embedded in the XML response and we still need to extract it and save it in a file. To do that we’ll use python lxml
library to parse the received XML document, pick the first child from the root of the tree (data element) and save it into a variable. A helper function write_file simply saves the Python string contained in the yang_text
variable in a file.
1 2 3 4 |
|
Back at the Linux shell we can now start using pyang. The most basic function of pyang is to convert the YANG model into one of the many supported formats. For example, tree format can be very helpful for high-level understanding of the structure of a YANG model. It produces a tree-like representation of a YANG model and annotates element types and constraints using syntax described in this RFC.
1 2 3 4 5 6 7 8 9 10 11 |
|
From the output above we can see the ietf-ip augments or extends the interface model. It adds new configurable (rw) containers with a list of IP prefixes to be assigned to an interface. Another thing we can see is that this model cannot be used on its own, since it doesn’t specify the name of the interface it augments. This model can only be used together with ietf-interfaces
YANG model which models the basic interface properties like MTU, state and description. In fact ietf-ip
relies on a number of YANG models which are specified as imports at the beginning of the model definition.
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Each import statement specifies the model and the prefix by which it will be referred later in the document. These prefixes create a clear separation between namespaces of different models.
We would need to download all of these models and use them together with the ietf-ip throughout the rest of this post. Use the procedure described above to download the ietf-interfaces, ietf-inet-types and ietf-yang-types models.
Now we can use pyangbind, an extension to pyang, to build a Python module based on the downloaded YANG models and start building interface configuration. Make sure your $PYBINDPLUGIN
variable is set like its described here.
1
|
|
The resulting ietf_ip_binding.py
is now ready for use inside the Python shell. Note that we import ietf_interfaces
as this is the parent object for ietf_ip
. The details about how to work with generated Python binding can be found on pyangbind’s Github page.
1 2 3 4 |
|
To setup an IP address, we first need to create a model of an interface we’re planning to manipulate. We can then use .get()
on the model’s instance to see the list of all configurable parameters and their defaults.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
|
The simples thing we can do is modify the interface description.
1 2 |
|
New objects are added by calling .add()
on the parent object and passing unique key as an argument.
1 2 3 4 |
|
At the time of writing pyangbind only supported serialisation into JSON format which means we have to do a couple of extra steps to get the required XML. For now let’s dump the contents of our interface model instance into a file.
1 2 3 4 |
|
Even though pyanbind does not support XML, it is possible to use other pyang plugins to generate XML from JSON.
1 2 |
|
The resulting interface.xml
file contains the XML document ready to be sent to the device. I’ll use read_file helper function to read its contents and save it into a variable. We should still have a NETCONF session opened from one of the previous steps and we’ll use the edit-config RPC call to apply our changes to the running configuration of CSR1k.
1 2 3 4 |
|
If the change was applied successfully reply.ok
should return True
and we can close the session to the device.
Going back to the CSR1k’s CLI we should see our changes reflected in the running configuration:
1 2 3 4 5 6 7 8 9 10 |
|
Checkout this Github page for Python scripts that implement the above workflow in a more organised way.
In this post I have merely scratched the surface of YANG modelling and network device programming. In the following posts I am planning to take a closer look at the RESTCONF interface, internal structure of a YANG model, Ansible integration and other YANG-related topics until I run out of interest. So until that happens… stay tuned.
]]>In the previous post we have installed OpenStack and created a simple virtual topology as shown below. In OpenStack’s data model this topology consists of the following elements:
So far nothing unusual, this is a simple Neutron data model, all that information is stored in Neutron’s database and can be queried with neutron
CLI commands.
Every call to implement an element for the above data model is forwarded to OVN ML2 driver as defined by the mechanism driver
setting of the ML2 plugin. This driver is responsible for the creation of an appropriate data model inside the OVN Northbound DB. The main elements of this data model are:
This is a visual representation of our network topology inside OVN’s Northbound DB, built based on the output of ovn-nbctl show
command:
This topology is pretty similar to Neutron’s native data model with the exception of a gateway router. In OVN, a gateway router is a special non-distributed router which performs functions that are very hard or impossible to distribute amongst all nodes, like NAT and Load Balancing. This router only exists on a single compute node which is selected by the scheduler based on the ovn_l3_scheduler
setting of the ML2 plugin. It is attached to a distributed router via a point-to-point /30 subnet defined in the ovn_l3_admin_net_cidr
setting of the ML2 plugin.
Apart from the logical network topology, Northbound database keeps track of all QoS, NAT and ACL settings and their parent objects. The detailed description of all tables and properties of this database can be found in the official Northbound DB documentation.
OVN northd process running on the controller node translates the above logical topology into a set of tables stored in Southbound DB. Each row in those tables is a logical flow and together they form a forwarding pipeline by stringing together multiple actions to be performed on a packet. These actions range from packet drop through packet header modification to packet output. The stringing is implemented with a special next
action which moves the packet one step down the pipeline starting from table 0. Let’s have a look at the simplified versions of L2 and L3 forwarding pipelines using examples from our virtual topology.
In the first example we’ll explore the L2 datapath between VM1 and VM3. Both VMs are attached to the ports of the same logical switch. The full datapath of a logical switch consists of two parts - ingress and egress datapath (the direction is from the perspective of a logical switch). The ultimate goal of an ingress datapath is to determine the output port or ports (in case of multicast) and pass the packet to the egress datapath. The egress datapath does a few security checks before sending the packet out to its destination. Two things are worth noting at this stage:
Let’s have a closer look at each of the stages of the forwarding pipeline. I’ll include snippets of logical flows demonstrating the most interesting behaviour at each stage. Full logical datapath is quite long and can be viewed with ovn-sbctl lflow-list [DATAPATH]
command. Here is some useful information, collected from the Northbound database, that will be used in the examples below:
VM# | IP | MAC | Port UUID |
---|---|---|---|
VM1 | 10.0.0.2 | fa:16:3e:4f:2f:b8 | 26c23a54-6a91-48fd-a019-3bd8a7e118de |
VM3 | 10.0.0.5 | fa:16:3e:2a:60:32 | 5c62cfbe-0b2f-4c2a-98c3-7ee76c9d2879 |
1 2 3 4 |
|
reg0[1] = 1
. The next table catches these marked packets and commits them to the connection tracker. Special ct_label=0/1
action ensures return traffic is allowed which is a standard behaviour of all stateful firewalls.1 2 3 4 5 6 |
|
1 2 3 4 |
|
1 2 3 4 5 |
|
1 2 |
|
1 2 3 4 5 6 7 |
|
1 2 3 4 5 6 |
|
Similar to a logical switch pipeline, L3 datapath is split into ingress and egress parts. In this example we’ll concentrate on the Gateway router datapath. This router is connected to a distributed logical router via a transit subnet (SWtr) and to an external network via an external bridge (SWex) and performs NAT translation for all VM traffic.
Here is some useful information about router interfaces and ports that will be used in the examples below.
SW function | IP | MAC | Port UUID |
---|---|---|---|
External | 169.254.0.54/24 | fa:16:3e:39:c8:d8 | lrp-dc1ae9e3-d8fd-4451-aed8-3d6ddc5d095b |
DVR-GW transit | 169.254.128.2/30 | fa:16:3e:7e:96:e7 | lrp-gtsp-186d8754-cc4b-40fd-9e5d-b0d26fc063bd |
1 2 |
|
1 2 3 4 5 6 7 8 9 10 11 |
|
1 2 |
|
1 2 |
|
outport
is decided, IP TTL is decremented and the new next-hop IP is set in register0.1 2 3 |
|
MAC_Binding
table of Southbound DB.1 2 3 4 |
|
1 2 |
|
1 2 |
|
This was a very high-level, abridged and simplified version of how logical datapaths are built in OVN. Hopefully this lays enough groundwork to move on to the official northd documentation which describes both L2 and L3 datapaths in much greater detail.
Apart from the logical flows, Southbound DB also contains a number of tables that establish the logical-to-physical bindings. For example, the Port_Binding
table establishes binding between logical switch, logical port, logical port overlay ID (a.k.a. tunnel key) and the unique hypervisor ID. In the next section we’ll see how this information is used to translate logical flows into OpenFlow flows at each compute node. For full description of Southbound DB, its tables and their properties refer to the official SB schema documentation.
OVN Controller process is the distributed part of OVN SDN controller. This process, running on each compute node, connects to Southbound DB via OVSDB and configures local OVS according to information received from it. It also uses Southbound DB to exchange the physical location information with other hypervisors. The two most important bits of information that OVN controller contributes to Southbound DB are physical location of logical ports and overlay tunnel IP address. These are the last two missing pieces to map logical flows to physical nodes and networks.
The whole flat space of OpenFlow tables is split into multiple areas. Tables 16 to 47 implement an ingress logical pipeline and tables 48 to 63 implement an egress logical pipeline. These tables have no notion of physical ports and are functionally equivalent to logical flows in Southbound DB. Tables 0 and 65 are responsible for mapping between the physical and logical realms. In table 0 packets are matched on the physical incoming port and assigned to a correct logical datapath as was defined by the Port_Binding
table. In table 65 the information about the outport, that was determined during the ingress pipeline processing, is mapped to a local physical interface and the packet is sent out.
To demonstrate the details of OpenFlow implementation, I’ll use the traffic flow between VM1 and external destination (8.8.8.8). For the sake of brevity I will only cover the major steps of packet processing inside OVS, omitting security checks and ARP/DHCP processing.
When packets traverse OpenFlow tables they get labelled or annotated with special values to simplify matching in subsequent tables. For example, when table 0 matches the incoming port, it annotates the packet with the datapath ID. Since it would have been impractical to label packets with globally unique UUIDs from Soutbound DB, these UUIDs get mapped to smaller values called tunnel keys. To make things even more confusing, each port will have a local kernel ID, unique within each hypervisor. We’ll need both tunnel keys and local port IDs to be able to track the packets inside the OVS. The figure below depicts all port and datapath IDs that have been collected from the Soutbound DB and local OVSDB on each hypervisor. Local port numbers are attached with a dotted line to their respective tunnel keys.
When VM1 sends the first packet to 8.8.8.8, it reaches OVS on local port 13. OVN Controller knows that this port belongs to VM1 and installs an OpenFlow rule to match all packets from this port and annotate them with datapath ID (OXM_OF_METADATA), incoming port ID (NXM_NX_REG14), conntrack zone (NXM_NX_REG13). It then moves these annotated packets to the first table of the ingress pipeline.
1 2 3 |
|
Skipping to the L2 MAC address lookup stage, the output port (0x1) is decided based on the destination MAC address and saved in register 15.
1 2 |
|
Finally, the packet reaches the last table where it is sent out the physical patch port interface towards R1.
1
|
|
The other end of this patch port is connected to a local instance of distributed router R1. That means our packet, unmodified, re-enters OpenFlow table 0, only this time on a different port. Local port 2 is associated with a logical pipeline of a router, hence metadata
for this packet is set to 4.
1 2 |
|
The packet progresses through logical router datapath and finally gets to table 21 where destination IP lookup take place. It matches the catch-all default route rule and the values for its next-hop IP (0xa9fe8002), MAC address (fa:16:3e:2a:7f:25) and logical output port (0x03) are set.
1 2 3 |
|
Table 65 converts the logical output port 3 to physical port 6, which is yet another patch port connected to a transit switch.
1
|
|
The packet once again re-enters OpenFlow pipeline from table 0, this time from port 5. Table 0 maps incoming port 5 to the logical datapath of a transit switch with Tunnel key 7.
1 2 |
|
Destination lookup determines the output port (2) but this time, instead of entering the egress pipeline locally, the packet gets sent out the physical tunnel port (7) which points to the IP address of a compute node hosting the GW router. The headers of an overlay packet are populated with logical datapath ID (0x7), logical input port (copied from register 14) and logical output port (0x2).
1 2 3 4 5 |
|
When packet reaches the destination node, it once again enters the OpenFlow table 0, but this time all information is extracted from the tunnel keys.
1 2 3 4 |
|
At the end of the transit switch datapath the packet gets sent out port 12, whose peer is patch port 16.
1
|
|
The packet re-enters OpenFlow table 0 from port 16, where it gets mapped to the logical datapath of a gateway router.
1 2 3 |
|
Similar to a distributed router R1, table 21 determines the next-hop MAC address for a packet and saves the output port in register 15.
1 2 3 |
|
The first table of an egress pipeline source-NATs packets to external IP address of the GW router.
1 2 |
|
The modified packet is sent out the physical port 14 towards the external switch.
1
|
|
External switch determines the output port connected to the br-ex
on a local hypervisor and send the packet out.
1 2 3 4 5 6 7 8 |
|
As we’ve just seen, OpenFlow repeats the logical topology by interconnecting logical datapaths of switches and routers with virtual point-to-point patch cables. This may seem like an unnecessary modelling element with a potential for a performance impact. However, when flows get installed in kernel datapath, these patch ports do not exist, which means that there isn’t any performance impact on packets in fastpath.
Before we wrap up, let us have a quick look at the new overlay protocol GENEVE. The goal of any overlay protocol is to transport all the necessary tunnel keys. With VXLAN the only tunnel key that could be transported is the Virtual Network Identifier (VNI). In OVN’s case these tunnel keys include not only the logical datapath ID (commonly known as VNI) but also both input and output port IDs. You could have carved up the 24 bits of VXLAN tunnel ID to encode all this information but this would only have given you 256 unique values per key. Some other overlay protocols, like STT have even bigger tunnel ID header size but they, too, have a strict upper limit.
GENEVE was designed to have a variable-length header. The first few bytes are well-defined fixed size fields followed by variable-length Options. This kind of structure allows software developers to innovate at their own pace while still getting the benefits of hardware offload for the fixed-size portion of the header. OVN developers decided to use Options header type 0x80 to store the 15-bit logical ingress port ID and a 16-bit egress port ID (an extra bit is for logical multicast groups).
The figure above shows the ICMP ping coming from VM1(10.0.0.2) to Google’s DNS. As I’ve showed in the previous section, GENEVE is used between the ingress and egress pipelines of a transit switch (SWtr), whose datapath ID is encoded in the VNI field (0x7). Packets enter the transit switch on port 1 and leave it on port 2. These two values are encoded in the 00010002
value of the Options Data
field.
So now that GENEVE has taken over as the inter-hypervisor overlay protocol, does that mean that VXLAN is dead? OVN still supports VXLAN but only for interconnects with 3rd party devices like VXLAN-VLAN gateways or VXLAN TOR switches. Rephrasing the official OVN documentation, VXLAN gateways will continue to be supported but they will have a reduced feature set due to lack of extensibility.
OpenStack networking has always been one of the first use cases of any new SDN controller. All the major SDN platforms like ACI, NSX, Contrail, VSP or ODL have some form of OpenStack integration. And it made sense, since native Neutron networking has always been one of the biggest pain points in OpenStack deployments. As I’ve just demonstrated, OVN can now do all of the common networking functionality natively, without having to rely on 3rd party agents. In addition to that it has a fantastic documentation, implements all forwarding inside a single OVS bridge and it is an open-source project. As an OpenStack networking solution it is still, perhaps, a few months away from being production ready - active/active HA is not supported with OVSDB, GW router scheduling options are limited, lack of native support for DNS and Metadata proxy. However I anticipate that starting from the next OpenStack release (Ocata, Feb 2017) OVN will be ready for mass deployment even by companies without an army of OVS/OpenStack developers. And when that happens there will even less need for proprietary OpenStack SDN platforms.
]]>Vanilla OpenStack networking has many functional, performance and scaling limitations. Projects like L2 population, local ARP responder, L2 Gateway and DVR were conceived to address those issues. However good a job these projects do, they still remain a collection of separate projects, each with its own limitations, configuration options and sets of dependencies. That led to an effort outside of OpenStack to develop a special-purpose OVS-only SDN controller that would address those issues in a centralised and consistent manner. This post will be about one such SDN controller, coming directly from the people responsible for OpenvSwitch, Open Virtual Network (OVN).
OVN is a distributed SDN controller implementing virtual networks with the help OVS. Even though it is positioned as a CMS-independent controller, the main use case is still OpenStack. OVN was designed to address the following limitations of vanilla OpenStack networking:
OVN implements security groups, distributed virtual routing, NAT and distributed DHCP server all inside a single OVS bridge. This dramatically improves performance by reducing the number of inter-process packet handling and ensures that all flows can benefit from kernel fast-path switching.
At a high level, OVN consists of 3 main components:
If you want to learn more about OVN architecture and use cases, OpenStack OVN page has an excellent collection of resources for further reading.
I’ll use RDO packstack to help me build a 1 controller and 2 compute nodes OpenStack lab on CentOS7. I’ll use the master trunk to deploy the latest OpenStack Ocata packages. This is required since at the time of writing (Nov 2016) some of the OVN features were not available in OpenStack Newton.
1 2 3 |
|
On the controller node, generate a sample answer file and modify settings to match the IPs of individual nodes. Optionally, you can disable some of the unused components like Nagios and Ceilometer similar to how I did it in my earlier post.
1 2 3 4 5 6 |
|
After the last step we should have a working 3-node OpenStack lab, similar to the one depicted below. If you want to learn about how to automate this process, refer to my older posts about OpenStack and underlay Leaf-Spine fabric build using Chef.
OVN can be built directly from OVS source code. Instead of building and installing OVS on each of the OpenStack nodes individually, I’ll build a set of RPM’s on the Controller and will use them to install and upgrade OVS/OVN components on the remaining nodes.
Part of OVN build process includes building an OVS kernel module. In order to be able to use kmod RPM on all nodes we need to make sure all nodes use the same version of Linux kernel. The easiest way would be to fetch the latest updates from CentOS repos and reboot the nodes. This step should result in same kernel version on all nodes, which can be checked with uname -r
command.
1 2 |
|
The official OVS installation procedure for CentOS7 is pretty accurate and requires only a few modifications to account for the packages missing in the minimal CentOS image I’ve used as a base OS.
1 2 3 4 5 6 7 8 |
|
At the end of the process we should have a set of rpms inside the ovs/rpm/rpmbuild/RPMS/
directory.
Before we can begin installing OVN, we need to prepare the existing OpenStack environment by disabling and removing legacy Neutron OpenvSwitch agents. Since OVN natively implements L2 and L3 forwarding, DHCP and NAT, we won’t need L3 and DHCP agents on any of the Compute nodes. Network node that used to provide North-South connectivity will no longer be needed.
First, we need to make sure all Compute nodes have a bridge that would provide access to external provider networks. In my case, I’ll move the eth1
interface under the OVS br-ex
on all Compute nodes.
1 2 3 4 5 6 7 |
|
IP address needs to be moved to br-ex
interface. Below example is for Compute node #2:
1 2 3 4 5 6 7 8 9 10 11 |
|
At the same time OVS configuration on Network/Controller node will need to be completely wiped out. Once that’s done, we can remove the Neutron OVS package from all nodes.
1
|
|
Now everything is ready for OVN installation. First step is to install the kernel module and upgrade the existing OVS package. Reboot may be needed in order for the correct kernel module to be loaded.
1 2 3 |
|
Now we can install OVN. Controllers will be running the ovn-northd
process which can be installed as follows:
1 2 3 |
|
The following packages install the ovn-controller
on all Compute nodes:
1 2 3 |
|
The last thing is to install the OVN ML2 plugin, a python library that allows Neutron to talk to OVN Northbound database.
1
|
|
Now that we have all the required packages in place, it’s time to reconfigure Neutron to start using OVN instead of a default openvswitch plugin. The installation procedure is described in the official Neutron integration guide. At the end, once we’ve restarted ovn-northd
on the controller and ovn-controller
on the compute nodes, we should see the following output on the controller node:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
This means that all instances of a distributed OVN controller located on each compute node have successfully registered with Southbound OVSDB and provided information about their physical overlay addresses and supported encapsulation types.
At this point of time there’s no way to automate OVN deployment with Packstack (TripleO already has OVN integration templates). For those who want to bypass the manual build process I have created a new Chef cookbook, automating all steps described above. This Chef playbook assumes that OpenStack environment has been built as described in my earlier post. Optionally, you can automate the build of underlay network as well by following my other post. Once you’ve got both OpenStack and underlay built, you can use the following scripts to build, install and configure OVN:
1 2 3 |
|
Now we should be able to create a test topology with two tenant subnets and an external network interconnected by a virtual router.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
When we attach a few test VMs to each subnet we should be able to successfully ping between the VMs, assuming the security groups are setup to allow ICMP/ND.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
|
In the next post we will use the above virtual topology to explore the dataplane packet flow inside an OVN-managed OpenvSwitch and how it uses the new encapsulation protocol GENEVE to optimise egress forwarding lookups on remote compute nodes.
]]>Those who read my blog regularly know that I’m a big fan of a network simulator called UnetLab. For the last two years I’ve done all my labbing in UNL and was constantly surprised by how extensible and stable it has been. I believe that projects like this are very important to our networking community because they help train the new generation of network engineers and enable them to expand their horizons. Recently UnetLab team has decided take the next step and create a new version of UNL. This new project, called EVE-NG, will help users build labs of any size and run full replicas of their production networks, which is ideal for pre-deployment testing of network changes. If you want to learn more, check out the EVE-NG page on indiegogo.
Back to the business at hand, vQFX is not publically available yet but is expected to pop up at Juniper.net some time in the future. Similar to a recently released vMX, vQFX will consist of two virtual machines - one running the routing engine (RE) and second simulating the ASIC forwarding piplines (PFE). You can find more information about these images on Juniper’s Github page. Images get distributed in multiple formats but in the context of this post we’ll only deal with two VMDK files:
1 2 |
|
To be able to use these images in UnetLab, we first need to convert them to qcow2 format and copy them to the directory where UNL stores all its qemu images:
1 2 3 4 5 6 |
|
Next, we need to create new node definitions for RE and PFE VMs. The easiest way would be to clone the linux node type:
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Now let’s add the QFX to the list of nodes by modifying the following file:
1 2 3 4 |
|
Optionally, /opt/unetlab/html/includes/__node.php
can be modified to change the default naming convention similar to the vmx
node.
Once you’ve done all the above changes, you should have a working vQFX 10k node available in UNL GUI. For the purpose of demonstration of EVPN features I’ve created the following topology:
EVPN standards define multiple routes types to distribute NLRI information across the network. The two most “significant” route types are 2 and 5. Type-2 NLRI was designed to carry the MAC (and optionally IP) address to VTEP IP binding information which is used to populate the dynamic MAC address table. This function, that was previously accomplished by a central SDN controller, is now performed in a scalable, standard-based, controller-independent fashion. Type-5 NLRI contains IP Prefix to VTEP IP mapping and is similar to the function of traditional L3 VPNs. In order to explore the capabilities of EVPN implementation on vQFX I’ve created and artificial scenario with 3 virtual switches, 3 VLANs and 4 hosts.
VLAN10 (green) is present on all 3 switches, VLAN20 (purple) is only configured on switches 1 and 2 and VLAN88 (red) only exists on SW3. I’ve provided configuration snippets below for reference purposes only and only for SW1. Remaining switches are configured similarly.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
1 2 3 4 5 6 7 8 |
|
1 2 3 4 5 6 7 8 9 10 11 12 |
|
1 2 3 4 5 6 7 8 9 |
|
Once all the nodes have been configured, we can have a closer look at the traffic flows, specifically at how packets are being forwarded and where the L2 and L3 lookups take place.
Traffic from H1 to H2 will never leave its own broadcast domain. As soon as the packet hits the incoming interface of SW1, MAC address lookup occurs pointing to the remote VTEP interface of SW2.
1
|
|
Once SW2 decapsulates the packet, the lookup in the MAC address table returns the locally connected interface, where it gets forwarded next.
1
|
|
The route to 8.8.8.0/24 is advertised by SW3 in type-5 NLRI
1 2 3 4 5 6 7 8 9 |
|
This NLRI doesn’t contain any overlay gateway address, however it does have a special “router-mac” community with a globally unique SW3’s chassis MAC. This MAC is advertised as normal type-2 MAC address and points to the remote VTEP interface of SW3:
1
|
|
The above two pieces of information are fed into our EVPN-VRF routing table to produce the entry with the following parameters:
1 2 3 |
|
This is the example of how “symmetric” IRB routing is performed. Instead of routing the packet at the ingress and switching at the egress node, how it was done in the case of Neutron’s DVR, the routing is performed twice. First the packet is routed into a “transit” VNI 5555, which glues all the switches in the same EVI together from the L3 perspective. Once the packet reaches the destination node, it gets routed into the intended VNI (5088 in our case) and forwarded out the local interface. This way switches may have different sets of VLANs and IRBs and still be able route packets between VXLANs.
As you may have noticed, the green broadcast domain extends to all three switches, even though hosts are only attached to the first two. Let’s see how it will affect the packet flows. The flow from H1 to H4 will be similar to the one from H3 to H4 described above. However return packets will get routed on SW3 directly into VXLAN5010, since that switch has an IRB.10 interface and then switched all the way to H1.
1 2 3 4 |
|
This is the example of “asymmetric” routing, similar to the one exhibited by Neutron DVR. You would see similar behaviour if you examined the flow between H3 and H2.
So why all the hassle configuring EPVN on data centre switches? For one, you can get rid of MLAG in TOR switches and replace it with EVPN multihoming. However, the main benefit is that you can stretch L2 broadcast domains across your whole data centre without the need for an SDN controller. So, for example, we can now easily satisfy the requirement of having external floating IP network on all compute nodes introduced by Neutron DVR. EVPN-enabled switches can also now perform functions similar to DC gateway routers (the likes of ASR, MX or SR) while giving you the benefits of horizontal scaling of Leaf/Spine networks. As more and more vendors introduce EVPN support, it is poised to become the ultimate DC routing protocol, complementing the functions already performed by the host-based virtual switches, and with all the DC switches running BGP already, introducing EVPN may be as easy as enabling a new address family.
]]>To be honest I was a little hesitant to write this post because the topic of Neutron’s DVR has already been exhaustively covered by many, including Assaf Muller, Eran Gampel and in the official OpenStack networking guide. The coverage of the topic was so thorough that I barely had anything to add. However I still decided to write a DVR post of my own for the following two reasons:
The topic of Neutron’s DVR is quite vast so I had to compromise between the length of this post and the level of details. In the end, I edited out most of the repeated content and replaced it with references to my older posts. I think I left everything that should be needed to follow along the narrative so hopefully it won’t seem too patchy.
Let’s see what we’re going to be dealing with in this post. This is a simple virtual topology with two VMs sitting in two different subnets. VM1 has a floating IP assigned that is used for external access.
Before we get to the packet walk details, let me briefly describe how to build the above topology using Neutron CLI. I’ll assume that OpenStack has just been installed and nothing has been configured yet, effectively we’ll pick up from where we left our lab in the previous post.
1 2 3 4 5 |
|
1 2 3 4 5 6 7 |
|
1 2 3 4 5 6 7 8 9 |
|
1 2 3 4 |
|
1 2 3 4 |
|
1 2 3 4 5 6 7 8 9 |
|
1 2 |
|
1 2 |
|
1 2 3 4 5 6 |
|
Using the technique described in my earlier post I’ve collected the dynamically allocated port numbers and created a physical representation of our virtual network.
For the sake of brevity I will omit the verification commands. The traffic flow between VM1 and VM2 will follow the standard path that I’ve explored in my native Neutron SDN post.
It is obvious that in this case traffic flows are suboptimal. Instead of going directly between the peer compute nodes, the packet has to hairpin through a Neutron router. This adds to the end-to-end latency and creates unnecessary load on the Network node. These are one of the main reasons why Distributed Virtual Routing was introduced in OpenStack Juno.
Enabling DVR requires configuration changes of multiple files on all OpenStack nodes. At a high level, all compute nodes will now run Neutron’s L3-agent service which will be responsible for provisioning of DVR and other auxiliary namespaces. The details of specific configuration options that need to be enabled can be found in the official OpenStack Networking guide. As usual, I’ve incorporated all the necessary changes into a single Chef cookbook, so in order to enable DVR in our lab all what you need to do is run the following commands from the UNetLab VM:
1 2 |
|
Once all changes has been made, we need to either create a new router or update the existing one to enable the DVR functionality:
1 2 |
|
Now let’s see how the traffic flows have changed with the introduction of DVR.
We’re going to be examining the following traffic flow:
R1 now has an instance on all compute nodes that have VMs in the BLUE or RED networks. That means that VM1 will send a packet directly to the R1’s BLUE interface via the integration bridge.
1 2 3 4 |
|
This is dynamically populated MAC address table of the integration bridge. You can see that the MAC address of VM1 and both interfaces of R1 have been learned. That means that when VM1 sends a packet to its default gateway’s MAC address, it will go directly to R1’s BLUE interface on port 4.
In this post I will omit the details of ARP resolution process which remains the same as before, however there’s one interesting detail that is worth mentioning before we move on. During the initial flood-and-learn phase on the br-int, the ARP request will get flooded down to the tunnel bridge. As per the standard behaviour, the packet should get replicated to all nodes. However, in this case we don’t want to hear responses from other nodes, since the router is hosted locally. In order to help that, tunnel bridges explicitly drop all packets coming from integration bridges and destined for MAC addresses of locally hosted routers:
1 2 3 4 |
|
Getting back to our traffic flow, once the IP packet has reached the DVR instance of R1 on compute node #2, the routing lookup occurs and the packet is sent back to the integration bridge with a new source MAC of R1’s RED interface.
1 2 |
|
Tunnel bridge will do its usual work by locating the target compute node based on the destination MAC address of VM2 (DVR requires L2 population to be enabled) and will send the packet directly to the compute node #3.
1 2 3 4 5 6 7 8 |
|
Since all instances of R1 have the same set of IP/MAC addresses, the MAC address of a local router can be learned by the remote integration bridge hosting the same instance of DVR. In order to prevent that from happening, the sending br-tun replaces the source MAC address of the frame with the set_field:fa:16:3f:d3:10:60->eth_src
action. This way the real R1’s MAC address gets masked as the frame leaves the node. These “mask” MACs are generated by and learned from the Neutron server, which ensures that each node gets a unique address.
The receiving node’s br-tun will swap the VXLAN header with a VLAN ID and forward the frame up to the integration bridge.
1 2 3 4 5 6 7 8 9 10 |
|
Integration bridge of compute node #3 will lookup the destination MAC address and send the packet out port 2.
1 2 3 4 5 |
|
The reverse packet flow is similar - the packet will get routed on the compute node #3 and sent in a BLUE network to the compute node #2.
External connectivity will be very different for VMs with and without a floating IP. We will examine each case individually.
External connectivity for VMs with no floating IP is still performed by the Network node. This time however, NATing is performed by a new element - SNAT namespace. As per the normal behaviour, VM2 will send a packet to its default gateway first. Let’s have a closer look at the routing table of the DVR:
1 2 |
|
There’s no default route in the main routing table, so how would it get routed out? DVRs extensively use Linux routing policy database (RPDB), a feature that has a lot in common with OpenFlow tables. The principle of RPDB is that every packet gets matched against a set of routing tables until there’s a hit. The tables are checked in the order of their priority (lowest to highest). One of the main features of RPDB is the ability to perform matches based on something other than the destination IP address, which is why it’s often referred to as policy-based routing. To view the contents of RPDB use the ip rule
command under the DVR namespace:
1 2 3 4 5 |
|
In our case table 167772161 matches all packets sourced from the BLUE subnet and if we examine the corresponding routing table we’ll find the missing default route there.
1
|
|
The next hop of this default route points to the SNAT’s interface in the BLUE network. MAC address is statically programmed by the local L3-agent.
1
|
|
Integration bridge sends the packet out port 1 to the tunnel bridge.
1
|
|
Tunnel bridge finds the corresponding match and sends the VXLAN-encapsulated packet to the Network node.
1
|
|
Tunnel bridge of the Network node forwards the frame up to the integration bridge.
1
|
|
Integration bridge sends the frame to port 10, which is where SNAT namespace is attached
1
|
|
SNAT is a namespace with an interface in each of the subnets - BLUE, RED and External subnet
1 2 3 4 5 6 7 8 9 |
|
SNAT has a single default route pointing to the External network’s gateway.
1
|
|
Before sending the packet out, iptables will NAT the packet to hide it behind SNAT’s qg external interface IP.
1
|
|
The first step in this scenario is the same - VM1 sends a packet to the MAC address of its default gateway. As before, the default route is missing in the main routing table.
1 2 3 |
|
Looking at the ip rule configuration we can find that table 16 matches all packets from that particular VM (10.0.0.12).
1 2 3 4 5 6 |
|
Routing table 16 sends the packet via a point-to-point veth pair link to the FIP namespace.
1
|
|
Before sending the packet out, DVR translates the source IP of the packet to the FIP assigned to that VM.
1
|
|
A FIP namespace is a simple router designed to connect multiple DVRs to external network. This way all routers can share the same “uplink” namespace and don’t have to consume valuable addresses from external subnet.
1 2 3 4 |
|
Default route inside the FIP namespace points to the External subnet’s gateway IP.
1
|
|
The MAC address of the gateway is statically configured by the L3 agent.
1
|
|
The packet is sent to the br-int with the destination MAC address of the default gateway, which is learned on port 3.
1
|
|
External bridge strips the VLAN ID of the packet coming from the br-int and does the lookup in the dynamic MAC address table.
1
|
|
The frame is forwarded out the physical interface.
1
|
|
Reverse packet flow will be quite similar, however in this case FIP namespace must be able to respond to ARP requests for the IPs that only exist on DVRs. In order to do that, it uses a proxy-ARP feature. First, L3 agent installs a static route for the FIP pointing back to the correct DVR over the veth pair interface:
1
|
|
Now that the FIP namespace knows the route to the floating IP, it can respond to ARPs on behalf of DVR as long as proxy-ARP is enabled on the external fg interface:
1 2 |
|
Finally, the DVR NATs the packet back to its internal IP in the BLUE subnet and forwards it straight to VM1.
1
|
|
Without a doubt DVR has introduced a number of much needed improvements to OpenStack networking:
However, there’s a number of issues that either remain unaddressed or result directly from the current DVR architecture:
Some of the above issues are not critical and can be fixed with a little effort:
However the main issue still remains unresolved. Every North-South packet has to hop several times between the global and DVR/FIP/NAT namespaces. These kind of operations are very expensive in terms of consumed CPU and memory resources and can be very detrimental to network performance. Using namespaces may be the most straight-forward and non-disruptive way of implementing DVR, however it’s definitely not the most optimal. Ideally we’d like to see both L2 and L3 pipelines implemented in OpenvSwitch tables. This way all packets can benefit from OVS fast-path flow caching. But fear not, the solution to this already exists in a shape of Open Virtual Network. OVN is a project spawned from the OVS and aims to address a number of shortcomings existing in current implementations of virtual networks.
]]>In the last post we’ve seen how to use Chef to automate the build of a 3-node OpenStack cloud. The only thing remaining is to build an underlay network supporting communication between the nodes, which is what we’re going to do next. The build process will, again, be relatively simple and will include only a few manual steps, but before we get there let me go over some of the decisions and assumptions I’ve made in my network design.
The need to provide more bandwidth for East-West traffic has made the Clos Leaf-Spine architecture a de facto standard in any data centre network design. The use of virtual overlay networks has obviated the requirement to have a strict VLAN and IP numbering schemes in the underlay. The only requirement for the compute nodes now is to have any-to-any layer 3 connectivity. This is how the underlay network design has converged to a Layer 3 Leaf-Spine architecture.
The choice of a routing protocol is not so straight-forward. My fellow countryman Petr Lapukhov and co-authors of RFC draft claim that having a single routing protocol in your WAN and DC reduces complexity and makes interoperability and operations a lot easier. This draft presents some of the design principles that can be used to build a L3 data centre with BGP as the only routing protocol. In our lab we’re going to implement a single “cluster” of the multi-tier topology proposed in that RFC.
In order to help us build this in an automated and scalable way, we’re going to use a relatively new feature called unnumbered BGP.
As we all know, one of the main advantages of interior gateway protocols is the automatic discovery of adjacent routers which is accomplished with the help of link-local multicasts. On the other hand, BGP traditionally required you to explicitly define neighbor’s IP address in order to establish a peering relationship with it. This is where IPv6 comes to the rescue. With the help of neighbor discovery protocol and router advertisement messages, it becomes possible to accurately determine the address of the peer BGP router on an intra-fabric link. The only question is how we would exchange IPv4 information over and IPv6-only BGP network.
RFC 5549, described an “extended nexthop encoding capability” which allows BGP to exchange routing updates with nexthops that don’t belong to the address family of the advertised prefix. In plain English it means that BGP is now capable of advertising an IPV4 prefix with an IPv6 nexthop. This makes it possible to configure all transit links inside the Clos fabric with IPv6 link-local addresses and still maintain reachability between the edge IPv4 host networks. Since nexthop IPs will get updated at every hop, there is no need for an underlying IGP to distribute them between all BGP routers. What we see is, effectively, BGP absorbing the functions of an IGP protocol inside the data centre.
In order to implement BGP unnumbered on Cumulus Linux all you need to is:
Example Quagga configuration snippet will look like this:
1 2 3 4 5 6 7 |
|
As you can see, Cumulus simplifies it even more by allowing you to only specify the BGP peering type (external/internal) and learning the value of peer BGP AS dynamically from a neighbor.
With all the above in mind, this is the list of decisions I’ve made while building the fabric configuration:
Picking up where we left off after the OpenStack node provisioning described in the previous post
Get the latest OpenStack lab cookbooks
1
2
3
git clone https://github.com/networkop/chef-unl-os.git
cd chef-unl-os
Download and import Cumulus VX image similar to how it’s described here.
1
2
/opt/unetlab/addons/qemu/cumulus-vx/hda.qcow2
Build the topology inside UNL. Make sure that Node IDs inside UNL match the ones in chef-unl-os/environment/lab.rb file and that interfaces are connected as shown in the diagram below
Re-run UNL self-provisioning cookbook to create a zero touch provisioning file and update DHCP server configuration with static entries for the switches.
1
2
chef-client -z -E lab -o pxe
Cumulus ZTP allows you to run a predefined script on the first boot of the operating system. In our case we inject a UNL VM’s public key and enable passwordless sudo for cumulus user.
Kickoff Chef provisioning to bootstrap and configure the DC fabric.
1
2
chef-client -z -E lab fabric.rb
This command instructs Chef provisioning to connect to each switch, download and install the Chef client and run a simple recipe to create quagga configuration file from a template.
At the end of step 5 we should have a fully functional BGP-only fabric and all 3 compute nodes should be able to reach each other in at most 4 hops.
1 2 3 4 5 |
|
Now that I’m finally beginning to settle down at my new place of residence I can start spending more time on research and blogging. I have left off right before I was about to start exploring the native OpenStack distributed virtual routing function. However as I’d started rebuilding my OpenStack lab from scratch I realised that I was doing a lot of repetitive tasks which can be easily automated. Couple that with the fact that I needed to learn Chef for my new work and you’ve got this blogpost describing a few Chef cookbooks (similar to Ansible’s playbook) automating all those manual steps described in my earlier blogposts 1 and 2.
In addition to that in this post I’ll show how to build a very simple OpenStack baremetal provisioner and installer. Some examples of production-grade baremetal provisioners are Ironic, Crowbar and MAAS. In our case we’ll turn UNetLab VM into an undercloud, a server used to provision and deploy our OpenStack lab, an overcloud. To do that we’ll first install and configure DHCP, TFTP and Apache servers to PXE-boot our UNL OpenStack nodes. Once all the nodes are bootstrapped, we’ll use Chef to configure the server networking and kickoff the packstack OpenStack installer.
In this post I’ll try to use Chef recipes that I’ve written as much as possible, therefore you won’t see the actual configuration commands, e.g. how to configure Apache or DHCP servers. However I will try to describe everything that happens at each step and hopefully that will provide enough incentive for the curious to look into the Chef code and see how it’s done. To help with the Chef code understanding let me start with a brief overview of what to look for in a cookbook.
A cookbook directory (/cookbooks/[cookbook_name]) contains all its configuration scripts in /recipes. Each file inside a recipe contains a list of steps to be performed on a server. Each step is an operation (add/delete/update) on a resource. Here are some of the common Chef resources:
Just these three basic resources allow you to do 95% of administrative tasks on any server. Most importantly they do it in platform-independent (any flavour of Linux) and idempotent (only make changes if current state is different from a desired state) way. Other directories you might want to explore are:
If you haven’t done it yet, download a copy of the UNetLab VM from the official website. Set it up inside your hypervisor so that you can access Internet through the first interface pnet0 (i.e. connect the first NIC of the VM to hypervisor’s NAT interface). Make sure the VM has got at least 6GB of RAM and VT-x support enabled for nested virtualization.
Follow the official installation instructions to install Chef Development Kit inside UNetLab VM.
1
2
3
wget https://packages.chef.io/stable/ubuntu/12.04/chefdk_0.16.28-1_amd64.deb
dpkg -i chefdk_0.16.28-1_amd64.deb
Install git and clone chef cookbooks.
1
2
3
4
5
apt-get -y update
apt-get -y install git
git clone https://github.com/networkop/chef-unl-os.git
cd chef-unl-os
Examine the lab environment settings to see what values are going to be used. You can modify that file to your liking.
Note that the OpenStack node IDs (keys of os_lab hash) MUST have one to one correspondence with the UNL node IDs which will be created at step 5
1
2
cat environment/lab.rb
Run Chef against a local server to setup the baremetal provisioner. This step installs and configures DHCP, TFTP and Apache servers. It also creates all the necessary PXE-boot and kickstart files based on our environment settings. Note that a part of the process is the download of a 700MB CentOS image so it might take a while to complete.
1
2
chef-client -z -E lab -o pxe
At the start of the PXE-boot process, DCHP server sends an OFFER which, along with the standard IP information, includes the name of the PXE boot image and the IP address of TFTP server where to get it from. A server loads this image and then searches the TFTP server for the boot configuration file which tells it what kernel to load and where to get a kickstart file. Both kickstart and the actual installation files are accessed via HTTP and served by the same Apache server that runs UNL GUI.
From UNL GUI create a new lab, add 3 OpenStack nodes and connect them all to pnet10 interface as described in this guide. Note that the pnet10 interface has already been created by Chef so you don’t have to re-create it again.
Make sure that the UNL node IDs match the ones defined in the environment setting file
Fire-up the nodes and watch them being bootstrapped by our UNL VM.
Next step is to configure the server networking and kickoff the OpenStack installer. These steps will also be done with a single command:
1
2
chef-client -z -E lab lab.rb
At the end of these steps you should have a fully functional 3-node OpenStack environment.
This is a part of a 2-post series. In the next post we’ll look into how to use the same tools to perform the baremetal provisioning of our physical underlay network.
]]>Since I have all my OpenStack environment running inside UNetLab, it makes it really easy for me to extend my L3 fabric with a switch from another vendor. In my previous posts I’ve used Cisco and Arista switches to build a 4-leaf 2-spine CLOS fabric. For this task I’ve decided to use a Cumulus VX switch which I’ve downloaded and imported into my lab.
To simulate the baremetal server (10.0.0.100) I’ve VRF’d an interface on Arista “L4” switch and connected it directly to a “swp3” interface of the Cumulus VX. This is not shown on the diagram.
L2 Gateway is a relatively new service plugin for OpenStack Neutron. It provides the ability to interconnect a given tenant network with a VLAN on a physical switch. There are three main components that compose this solution:
Note that in our case both network and control nodes are running on the same VM.
Cumulux is a debian-based linux distribution, therefore most of the basic networking configuration will be similar to how things are done in Ubuntu. First, let’s start by configuring basic IP addressing on Loopback (VTEP IP), Eth0 (OOB management), swp1 and swp2 (fabric) interfaces.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
Next, let’s enable OSPF
1 2 3 |
|
Once OSPFd is running, we can use sudo vtysh
to connect to local quagga shell and finalise the configuration.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
At this stage our Cumulus VX switch should be fully adjacent to both spines and its loopback IP (10.0.0.5) should be reachable from all OpenStack nodes.
The final step is to enable the hardware VTEP functionality. The process is fairly simple and involves only a few commands.
1 2 3 |
|
The last command runs a bootstrap script that does the following things:
By now you’re probably wondering what’s that hardware VTEP OVSDB schema and how it’s different from a normal OVS schema. First of all, remember that OVSDB is just a database and OVSDB protocol is just a set of JSON RPC calls to work with that database. Information that can be stored in the database is defined by a schema - a structure that represents tables and their relations. Therefore, OVSDB can be used to store and manage ANY type of data which makes it very flexible. Specificallly OVS project defines two OVSDB schemas:
The information from these databases is later consumed by another process that sets up the actual bridges and ports. The first schema is used by the ovs-vswitchd process running on all compute hosts to configure ports and flows of integration and tunnel bridges. In case of a Cumulus switch, the information from hardware_vtep OVSDB is used by a process called ovs-vtepd that is responsible for settings up VXLAN VTEP interfaces, provisioning of VLANs on physical switchports and interconnecting them with a Linux bridge.
If you want to learn more, check out this awesome post about hardware VTEP and OVS.
Most of the following procedure has been borrowed from another blog. It’s included it this post because I had to do some modifications and also for the sake of completeness.
Clone the L2GW repository
git clone -b stable/mitaka https://github.com/openstack/networking-l2gw.git
Use pip to install the plugin
pip install ./networking-l2gw/
Enable the L2GW service plugin
sudo sed -ri 's/^(service_plugins.*)/\1,networking_l2gw.services.l2gateway.plugin.L2GatewayPlugin/' \
/etc/neutron/neutron.conf
Copy L2GW configuration files into the neutron configuration directory
cp /usr/etc/neutron/l2g* /etc/neutron/
Point the L2GW plugin to our Cumulus VX switch
sudo sed -ri "s/^#\s+(ovsdb_hosts).*/\1 = 'ovsdb1:192.168.91.21:6632'/" /etc/neutron/l2gateway_agent.ini
Update Neutron database with the new schema required by L2GW plugin
systemctl stop neutron-server
neutron-db-manage --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/l2gw_plugin.ini upgrade head
systemctl start neutron-server
Update Neutron startup script to load the L2GW plugin configuration file
sed -ri "s/(ExecStart=.*)/\1 --config-file \/etc\/neutron\/l2gw_plugin.ini /" /usr/lib/systemd/system/neutron-server.service
Create a L2GW systemd unit file
cat >> /usr/lib/systemd/system/neutron-l2gateway-agent.service << EOF
[Unit]
Description=OpenStack Neutron L2 Gateway Agent
After=neutron-server.service
[Service]
Type=simple
User=neutron
ExecStart=/usr/bin/neutron-l2gateway-agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/l2gateway_agent.ini
KillMode=process
[Install]
WantedBy=multi-user.target
EOF
Restart both L2GW and neutron server
systemctl daemon-reload
systemctl restart neutron-server.service
systemctl start neutron-l2gateway-agent.service
Enter the “neutron configuration mode”
source ~/keystone_admin
neutron
Create a new L2 gateway device
l2-gateway-create --device name="L5",interface_names="swp3" CUMULUS-L2GW
Create a connection between a “private_network” and a native vlan (dot1q 0) of swp3 interface
l2-gateway-connection-create --default-segmentation-id 0 CUMULUS-L2GW private_network
At this stage everything should be ready for testing. We’ll start by examining the following traffic flow:
The communication starts with VM-2 sending an ARP request for the MAC address of the baremetal server. Packet flow inside the compute host will be exactly the same as before, with packet being flooded from the VM to the integration and tunnel bridges. Inside the tunnel bridge the packet gets resubmitted to table 22 where head-end replication of ARP request takes place.
The only exception is that this time the frame will get replicated to a new VXLAN port pointing towards the Cumulux VTEP IP. We’ll use the ovs-appctl ofproto/trace
command to see the full path a packet takes inside OVS, which is similar to packet-tracer
command of Cisco ASA. To simulate an ARP packet we need to specify the incoming port(in_port), EtherType(arp), internal VLAN number for our tenant(dl_vlan) and an ARP request target IP address(arp_tpa). You can find the full list of fields that can be matched in this document.
1 2 3 4 5 6 7 8 9 |
|
The packet leaving port 9 will get encapsulated into a VXLAN header with destination IP of 10.0.0.5 and forwarded out the fabric-facing interface eth1.100. When VXLAN packet reaches the vxln69 interface (10.0.0.5) of the Cumulus switch, the br-vxlan69 Linux bridge floods the frame out the second connected interface - swp3.
1 2 3 |
|
The rest of the story is very simple. When ARP packet hits the baremetal server it populates its ARP cache. A unicast response travels all the way back to the Cumulus switch, gets matched by the static MAC (0e:14) entry created based on information provided by the L2GW plugin. This entry points to the VTEP IP of Compute host 2(10.0.2.10) which is where it gets forwarded next.
1 2 3 4 5 6 7 8 |
|
The packet travels through compute host 2, populating the flow entries of all OVS bridges along the way. These entries are then used by subsequent unicast packets travelling from VM-2.
1 2 3 4 5 6 7 |
|
It all looks fine until the ARP cache of the baremetal server expires and you get an ARP request coming from the physical into the virtual world. There is a known issue with BUM forwarding which requires a special service node to perform the head-end replication. The idea is that a switch that needs to flood a multicast packet, would send it to a service node which keeps track of all active VTEPs in the network and performs packet replication on behalf of the sender. OpenStack doesn’t have a dedicated service node, however it is possible to trick the network node into performing a similar functionality, which is what I’m going to demonstrate next.
First of all, we need to tell our Cumulus switch to send all multicast packets to the network node. To do that we need to modify OVSDB table called “Mcast_Macs_Remote”. You can view the contents of the database using the ovsdb-client dump --pretty tcp:192.168.91.21:6632
command to make sure that this table is empty. Using the VTEP control command we need to force all unknown-dst (BUM) traffic to go to the network node(10.0.3.10). The UUID of the logical switch can be found with sudo vtep-ctl list-ls
command.
1
|
|
At this stage all BUM traffic hits the network node and gets flooded to the DHCP and the virtual router namespaces. In order to force this traffic to also be replicated to all compute nodes we can use some of the existing tables of the tunnel bridge. Before we do anything let’s have a look at the tables our ARP request has to go through inside the tunnel bridge.
1 2 3 |
|
We also have a default head-end replication table 22 which floods all BUM traffic received from the integration bridge to all VTEPs:
1
|
|
So what we can do is create a new flow entry that would intercept all ARP packets inside Table 4 and resubmit them to tables 10 and 22. Table 10 will take our packet up to the integration bridge of the network node, since we still need to be able to talk the virtual router and the DHCP. Table 22 will receive a copy of the packet and flood it to all known VXLAN endpoints.
1
|
|
We can once again use the trace command to see the ARP request flow inside the tunnel bridge.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
Now we should be able to clear the ARP cache on baremetal device and successfully ping both VM-2, VM-1 and the virtual router.
The workaround presented above is just a temporary solution for the problem. In order to fix the problem properly, OVS vtep schema needs to be updated to support source node replication. Luckily, the patch implementing this functionality has been merged into master OVS branch only a few days ago. So hopefully, this update trickles down to Cumulus package repositories soon.
Despite all the issues, Neutron L2 gateway plugin is a cool project that provides a very important piece of functionality without having to rely on 3rd party SDN controllers. Let’s hope it will continue to be supported and developed by the community.
In the next post I was planning to examine another “must have” feature of any SDN solution - Distributed Virtual Routing. However due to my current circumstances I may need to take a few weeks' break before going on. Be back soon!
]]>Before we start, let’s recap the difference between the two major Neutron network types:
These two network types are not mutually exclusive. In our case the admin tenant network is implemented as a VXLAN-based overlay whose only requirement is to have a layer-3 reachability in the underlay. However tenant network could also have been implemented using a VLAN-based provider network in which case a set of dot1Q tags pre-provisioned in the underlay would have been used for tenant network segregation.
External network is used by VMs to communicate with the outside world (north-south). Since default gateway is located outside of OpenStack environment this, by definition, is a provider network. Normally, tenant networks will use the non-routable address space and will rely on a Neutron virtual router to perform some form of NAT translation. As we’ve seen in the earlier post, Neutron virtual router is directly connected to the external bridge which allows it to “borrow” ip address from the external provider network to use for two types of NAT operations:
In default deployments all NATing functionality is performed by a network node, so external provider network only needs to be L2 adjacent with a limited number of physical hosts. In deployments where DVR is used, the virtual router and NAT functionality gets distributed among all compute hosts which means that they, too, now need to be layer-2 adjacent to the external network.
The direct adjacency requirement presents a big problem for deployments where layer-3 routed underlay is used for the tenant networks. There is a limited number of ways to satisfy this requirements, for example:
As I’ve said in my earlier post, I’ve built the leaf-spine fabric out of Cisco IOU virtual switches, however the plan was to start introducing other vendors later in the series. So this time for the border leaf role I’ve chosen Arista vEOS switch, however, technically, it could have been any other vendor capable of doing VXLAN-VLAN bridging (e.g. any hardware switch with Trident 2 or similar ASIC).
Configuration of Arista switches is very similar to Cisco IOS. In fact, I was able to complete all interface and OSPF routing configuration only with the help of CLI context help. The only bit that was new to me and that I had to lookup in the official guide was the VXLAN configuration. These similarities makes the transition from Cisco to Arista very easy and I can understand (but not approve!) why Cisco would file a lawsuit against Arista for copying its “industry-standard CLI”.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
|
Interface VXLAN1 sets up VXLAN-VLAN bridging between VNI 1000 and VLAN 100. VLAN 100 is used to connect to VMware Workstation’s host-only interface, the one that was previously connected directly to the L3 leaf switch. VXLAN interface does the multicast source replication by flooding unknown packets over the layer 3 fabric to the network node (10.0.3.10).
Since we don’t yet have the distributed routing feature enabled, the only OpenStack component that requires any changes is the network node. First, let’s remove the physical interface from the external bridge, since it will no longer be used to connect to the external provider network.
1
|
|
Next let’s add the VXLAN interface towards the Loopback IP address of the Arista border leaf switch. The key option sets the VNI which must be equal to the VNI defined on the border leaf.
1 2 3 4 5 |
|
Without any physical interfaces attached to the external bridge, the OVS will use the Linux network stack to find the outgoing interface. When a packet hits the vxlan1 interface of the br-ex, it will get encapsulated in a VXLAN header and passed on to the OS network stack where it will follow the pre-configured static route forwarding all 10/8 traffic towards the leaf-spine fabric. Check out this article if you want to learn more about different types of interfaces and traffic forwarding behaviours in OpenvSwitch.
In order to make changes persistent and prevent the static interface configuration from interfering with OVS, remove all OVS-related configuration and shutdown interface eth1.300.
1 2 |
|
None of the packet flows have changed as the result of this modification. All VMs will still use NAT to break out of the private environment, the NAT’d packets will reach the external bridge br-ex as described in my earlier post. However this time br-ex will forward the packets out the vxlan1 port which will deliver them to the Arista switch over the same L3 fabric used for east-west communication.
If we did a capture on the fabric-facing interface eth1 of the control node while running a ping from one of the VMs to the external IP address, we would see a VXLAN-encapsulated packet destined for the Loopback IP of L4 leaf switch.
In the next post we’ll examine the L2 gateway feature that allows tenant networks to communicate with physical servers through yet another VXLAN-VLAN hardware gateway.
]]>VXLAN standard does not specify any control plane protocol to exchange MAC-IP bindings between VTEPs. Instead it relies on data plane flood-and-learn behaviour, just like a normal switch. To force this behaviour in an underlay, the standard stipulates that each VXLAN network should be mapped to its own multicast address and each VTEP participating in a network should join the corresponding multicast group. That multicast group would be used to flood the BUM traffic in an underlay to all subscribed VTEPs thereby populating dynamic MAC address tables.
Default OpenvSwitch implementation does not support VXLAN multicast flooding and uses unicast source replication instead. This decision comes with a number of tradeoffs:
Despite all the tradeoffs, OVS with unicast source replication has become a de-facto standard in most recent OpenStack implementations. The biggest advantage of such approach is the lack of requirement for multicast in the underlay network.
Neutron server is aware of all active MAC and IP addresses within the environment. This information can be used to prepopulate forwarding entries on all tunnel bridges. This is accomplished by a L2 population driver. However that in itself isn’t enough. Whenever a VM doesn’t know the destination MAC address, it will send a broadcast ARP request which needs to be intercepted and responded by a local host to stop it from being flooded in the network. The latter is accomplished by a feature called ARP responder which simulates the functionality commonly known as ARP proxy inside the tunnel bridge.
Configuration of these two features is fairly straight-forward. First, we need to add L2 population to the list of supported mechanism drivers on our control node and restart the neutron server.
1 2 |
|
Next we need to enable L2 population and ARP responder features on all 3 compute nodes.
1 2 3 |
|
Since L2 population is triggered by the port_up messages, we might need to restart both our VMs for the change to take effect.
Now let’s once again examine what happens when VM-1 issues an ARP request for VM-2’s MAC address (1a:bf).
First, the frame hits the flood-and-learn rule of the integration bridge and gets flooded down to the tunnel bridge as desribed in the previous post. Once in the br-tun, the frames gets matched by the incoming port and resubmitted to table 2. In addition to a default unicast/multicast bit match, table 2 now also matches all ARP requests and resubmitts them to the new table 21. Note how the ARP entry has a higher priority to always match before the default catch-all multicast rule.
1 2 3 4 |
|
Inside table 21 are the entries created by the ARP responder feature. The following is an example entry that matches all ARP requests where target IP address field equals the IP of VM-2(10.0.0.9).
1 2 3 4 5 6 7 8 9 |
|
The resulting action builds an ARP response by modifying the fields and headers on the original ARP request message, specifically OVS:
Now that VM-1 has learned the MAC address of VM-2 it can start sending the unicast frames. The first few steps will again be the same. The frame hits the tunnel bridge, gets classified as a unicast and resubmitted to table 20. Table 20 will still have an entry generated by a learn action triggered by a packet coming from VM-2, however now it also has and identical entry with a higher priority(priority=2), which was preconfigured by a L2 population feature.
1 2 3 4 |
|
The two features described in this post only affect the ARP traffic to VMs known to the Neutron server. All the other BUM traffic will still be flooded as described in the previous post.
As the result of enabling L2 population and ARP responder features we were able to reduce the amount of BUM traffic in the overlay network and reduce the eliminate processing on compute hosts incurred by ARP request flooding.
However one downside of this approach is the increased number of flow entries in tunnel bridges of compute hosts. Specifically, for each known VM there now will be two entries in the tunnel bridge with different priorities. This may have negative impact on performance and is something to keep in mind when designing OpenStack solutions for scale.
In the next post I’ll show how to overcome the requirement of a direct L2 adjacency between the network node and external subnet. Specifically, I’ll use Arista switch to extend a L2 provider network over a L3 leaf-spine Cisco fabric.
]]>