Network-CI Part 3 - OSPF to BGP Migration in Active/Standby DC

Current network overview

Let’s start by taking a high-level look at our DC network routing topology. The core routing protocol is OSPF, it is responsible for distributing routing information between the Core and WAN layers of the network. WAN layer consists of two MPLS L3VPN services running BGP as PE-CE protocol and two DMVPN Hubs running EIGRP. All WAN layer devices perform mutual redistribution between the respective WAN protocol and OSPF.

Current network topology

Target network overview

The task is to replace OSPF with BGP as the core routing protocol inside the Data Centre. There are many advantages to using BGP inside a DC, in our case they are:

  • Enhanced traffic routing and filtering policies
  • Reduced number of redistribution points
  • Because Ivan Pepelnjak said so

We’re not going getting rid of OSPF completely, but rather reduce its function to a simple distribution of internal DC prefixes. BGP will be running on top of OSPF and distribute all the DC and WAN summary prefixes.

Target network topology

Physical topology overview

Now let’s take a closer look at the network that we’re going to emulate. All green devices on the left-hand side constitute the Active Data Centre, that is where all the traffic will flow under normal conditions. All green devices have red Standby counterparts. These devices will pick up the function of traffic forwarding in case their green peer becomes unavailable.

Full demo topology

When simulating a real-life network it’s often impossible to fit an exact replica inside a network emulator. That’s why using mock devices is a crucial part in every simulation. The function of a mock is to approximate a set of network devices. There’s a number of mock devices on our diagram colour-coded in purple. These devices simulate the remaining parts of the network. For example, Cloud devices may represent TOR switches, while MPLS/DMVPN devices represent remote WAN sites. Normally these devices will have some made-up configuration that best reflects real life, but not necessarily a copy-paste from an existing network device.

It’s also important to pick the right amount of mock devices to strike the balance between accuracy and complexity. For example, for WAN sites it may suffice to create one site per unique combination of WAN links to make sure WAN failover works as expected.

Traffic flow definition

Let’s define how we would expect the traffic to flow through our network. Let’s assume that we should always try to use MPLS links when possible and only use DMVPN when both MPLS links are down. This translates to the following order of WAN links’ precedence:

  1. Primary MPLS link
  2. Standby MPLS link
  3. Primary DMVPN link
  4. Standby DMVPN link

Based on that we can create the following traffic flows definition for network working under normal conditions.

1 Failed None
  From FW to MPLS-DMVPN via Primary-WAN, Primary-MPLS
  From FW to DMVPN-ONLY via Primary-CORE-SW, Primary-DMVPN
  From FW to MPLS-ONLY via Primary-WAN, Primary-MPLS
  From Cloud-1 to FW Loopback0 via Primary-CORE-SW
  From Cloud-2 to MPLS-DMVPN via Primary-WAN, Primary-MPLS

We expect all traffic to flow through active devices even when the path may be suboptimal, like it’s the case with traffic from Cloud-2. Similarly, we can create traffic flows definitions for different failure conditions. The complete traffic flows definition file contains 2 additional failure scenarios covering the outage of the primary MPLS link and a complete outage of the primary core switch.

Workflow example

This is how you would approach a project like this.

  1. Get a copy of network-ci VM
  2. Get a local copy of network-ci tools
  3. Copy configuration from real-life devices into the config directory
  4. Add configuration files for mock devices to the same directory
  5. Review the topology definition file to make sure it reflects our physical diagram
  6. Review the UNL configuration file to make sure it points to the correct IP address assigned to your network-ci VM
  7. Kick-off topology build inside UNL by running ./0_built_topo.py script
  8. Verify that traffic flows as expected with 2_test.py script
  9. Start the real-time monitoring with 1_monitor.py script
  10. Implement required changes on individual devices (all required changes can be found in ospf-bgp.txt file)
  11. Make sure that the network still behaves as before by running 2_test.py script
  12. Destroy the topology in UNL by running 3_destroy_topo.py

Continuous Integration

In the previous post I’ve showed how to use Jenkins to setup the CI environment for a small demo network. The same method can be applied to setup the job for our small Data Centre. It is simply a matter of changing the directory name from acme-small to acme-large in the first build step.

Source code

All code from this and previous posts is available on Github

Related