Network Simulation is Hard (Part 2/2)

August 27, 2023 - 10 minutes read - 1919 words

This post is part two of two, where I’ll cover how you can use network simulation and emulation tools to validate changes, and the limits of each approach. Please read the previous post covering network simulation concepts for further context before reading the rest of this post.

Ok, What Can Be Tested?

As a reminder, the goal is pre-change network validation. What are examples of things that a network engineer might want to test? Here’s a few of the most common scenarios:

Link up / link down
Device failure
Change IGP link metric
Change BGP policy or redistribution
Change in overlay or tunneling (SR, RSVP-TE, VxLAN, IPsec, SD-WAN, etc.)
Change security or policy (stateful firewall rules, NAT, PBR, etc.)
Test application traffic flowing through the network

We’ll go through a few of these scenarios to understand the strengths and limits of the ability to model each one of these.

What Are The Approaches?

The first step would be creating a model of the network as it exists right now. Even this is full of nuance. I’ve roughly lumped together the software solutions and open-source tools into three different approaches:

Modeling configs to simulate current state
Learning state and creating a simulation
Creating a one-to-one network emulation

Let’s now cover each of these in turn.

Modeling Configs to Simulate Current State

One starting point could be to parse the device configurations and convert that into device state, such as routing table entries. This is the approach that Batfish takes. Opnet, Cariden and WANDL took a similar approach. LnetD NM takes a simpler approach of just parsing IGP information and converting it into state. You’re left with a network simulation in which you can make changes to, and predict the outcome of those changes.

Pros:

If your device and OS version and each config line is supported by the tool, and assuming the tool knows how to translate the config statement to routing state the same way the vendor implements it, you can have a model of your network as if it was freshly rebooted.
You’re left with a vendor-agnostic routing model, meaning you can simulate link and device failures, routing protocol changes, and possibly overlay changes.

Cons:

You must have 100% support of devices, OS versions and config statements to create a model of the network. This is time consuming to develop, and hence these tools are either very expensive (major vendors’ software solutions), or have limited vendor support (open-source tools).
Does not include any routing information learned at the edges of the network. You’d need to spend extra effort injecting routes learned from internet-facing routers, for example.
Assumes the current config is equal to the current state of the network. This is rarely 100% true, so could lead to issues during the actual change window.
Most of these solutions do not cover L2 behavior or other L4 behavior such as stateful firewall rules, NAT, policy-based routing (SD-WAN) or load balancing.

Learning State and Creating a Simulation

Another approach would be to connect to each network device, gather its state table information, and create a simulation of the network state at a point in time. This can be done via participating in routing protocols (e.g. Ciena BluePlanet’s approach) or via screen scraping and APIs (e.g. IP Fabric and Forward Network’s approach). State table information could be L2 MAC tables, L3 routing tables, NAT and stateful firewall rules and many other tables.

Pros:

Detailed model of network state. This is valuable for post-change analysis and diff-ing the state of the network at different points in time.
Scales well. Thousands or tens of thousands of network devices can be modeled, and end-to-end network behavior can be simulated.
Can model L2 behavior or other L4 behavior such as stateful firewall rules, NAT, policy-based routing (SD-WAN) or load balancing.

Cons:

Doesn’t build a full network behavior model, so usually severely limited in the types of pre-change scenarios that can be tested.
You must have 100% support of devices, OS versions and config statements to create a model of the network.
Labour-intensive development for the tool vendor - each vendor, NOS and version needs to be supported and kept current. For an example of the complexity, see IP Fabric’s Vendor Support Matrix. Hence each of these solutions are commercial and aimed towards enterprise or SP network environments large enough to justify the purchase of solutions.

Creating a One-To-One Network Emulation

Another approach would be to use an emulation tool (EVE-NG, Vagrant, Cisco Modeling Labs, etc.) to spin up VMs or containers running vendor-specific images that seek to behave like their physical hardware counterparts. This allows network engineers to approach a more accurate representation of how the network behaves, but in practice is difficult to setup, scales poorly, and is costly and time-consuming to maintain.

Pros:

Config changes can be tested. Step-by-step procedures for a change can be validated, including backout steps.
Test your automation! Run your nornir or Ansible playbooks and ensure the correct final state will be achieved.
Can potentially cover L2 behavior or other L4 behavior such as stateful firewall rules, NAT, policy-based routing (SD-WAN) or load balancing, however the devil is in the details on vendor VM support.

Cons:

Getting VMs for all your vendors - Not all vendors provide them. Those that do often provide VMs that work approximately like the physical hardware, but are often lacking key features or behave differently.
Lack of proper L2 behavior in VMs - Many vendors implement L2 behaviors in ASICs, and the VMs don’t attempt to emulate the same packet processing behavior. Watch out for tunneling and overlay limitations like L2, VxLAN, IPsec and more.
Cost! - You have to buy VM licenses. You need significant server hardware to run even a small emulated network. By way of example, here’s CPU and memory resources needed for various vendor VMs.
Scale of vendor VMs - Want to emulate a Juniper PTX or a Cisco running IOS-XR? Each VM takes 8GB of RAM. Getting access to bare metal servers with 128GB or more of RAM is usually costly and/or time consuming. Thus trying to emulate a 1000 device network is well beyond the capability of almost every network engineer and company on the planet.
Interface limits and naming in VMs - This is one of a few examples where trying to create an exact one-to-one replica of a production environment is nearly impossible to create, much less maintain over time. Many vendor VMs have limits, such as only 8 interfaces are supported, but your switch may have 24 ports. Others use different interface names in the VM compared to hardware (e.g. Eth1 on the VM vs. GigE0/0 on physical hardware). The result is a network engineer needs to translate their production network to a virtual environment, creating differences and/or severe limits in what can be emulated.

Bonus Challenge 1 - What Network Boundaries?

How do you decide which part of the network to model? Ideally you could have a model of every networking device from a wireless access point through to campus switches, across a WAN, to a data centre, and up to the cloud. The challenge is simulating or emulating large environments. Natural routing boundaries are the most obvious place to carve up the network. OSPF areas or BGP ASNs are great examples. One challenge here includes identifying exact boundaries to carve up, especially programatically. Another challenge is when the smallest network boundary you can create (e.g. a single OSPF area) is still too large to emulate. For example, an average enterprise DC could have 50 to 100 network devices all in the same networking domain.

Bonus Challenge 2 - How do you know it’s “working”?

The purpose of simulating or emulating a network and doing a dry-run of a change is to:

Ensure the changes have the intended effect
De-risk the changes having unintended consequences
Possibly test step ordering or backout procedures to reduce the chance of mistakes or delays

When it comes to de-risking the change causing unintended side effects, how can you verify this? In increasing complexity, some possible ways are:

Validate interface status (Is my interface up and MTUs match?)
Validate protocol state (Is BGP established and receiving routes?)
Validate end-to-end reachability (Can the user get to the application?)

The first two present challenges in both scale and data collection across many vendors. You can either build your own code and tests, or outsource this time-consuming work to a vendor (shameless plug). The third way is a topic unto itself which I will briefly cover in the next section.

To see an excellent example of how to design a network with intent as the starting point for the design, watch this YouTube talk by Jeremy Schulman on Design Driven Network Assurance.

Bonus Challenge 3 - What About Traffic?

The business doesn’t care about the network or the change you’re making. The business cares about some higher level reason the change is being made. If a network engineer could validate the end-to-end reachability of every application deployed in their environment from every source, it would significantly de-risk changes. This lofty goal is beyond the scope of this blog post, but an area I’m actively interested in and working on.

When it comes to testing end-to-end applications in pre-change simulations, there’s a few things that can be done:

If you have a config-based model, you may be able to validate IP to IP routing reachability.
If you have a state-based simulation, they’re excellent for post-change validation, but effectively useless¹ for testing changes in advance.
Emulated networks are the best option here. You can test specific apps if you wish, or just port/protocol traffic tests across a network, albeit understanding the vendor VM limits of performance and features.

There remain challenges that network engineers have in discovering and cataloguing current applications in their environment, how to maintain that inventory, how to write automated tests and how to automate config changes using this information. For further reading around this topic, read Ken Celenza’s blog post on nautobot’s Application Dictionary functionality.

Summary

Each of the three approaches offers some value, as long as the limits are understood. Taking each in turn, here’s briefly where I see network engineers deriving value from each approach:

Modeling Configs to Simulate Current State

A good approach for service providers or networks running only major routing vendors and not needing other enterprise vendors or features such as L2, NAT, security, SD-WAN, or wireless.

Learning State and Creating a Simulation

Not useful for pre-change simulation, but often very good for post-change validation.

Creating a One-To-One Network Emulation

Great for detail, poor scaling. Time consuming and costly for smaller networking teams to build and maintain. Fine for small labs when you’re not striving for exact accuracy between production and the emulated network.

If you interested in learning more around this topic, below are a list of recent blog posts and talks covering similar topics I’d come across whilst writing this post.

Footnotes

There’s one scenario that can be tested by some of these tools, which is single-device config file changes that don’t affect a state table. One example is adding or changing an ACL, as this only affects packet processing on the one device and state propagation of something more complex like an IGP change doesn’t need to be determined. ↩︎

Ok, What Can Be Tested?

What Are The Approaches?

Modeling Configs to Simulate Current State

Learning State and Creating a Simulation

Creating a One-To-One Network Emulation

Bonus Challenge 1 - What Network Boundaries?

Bonus Challenge 2 - How do you know it’s “working”?

Bonus Challenge 3 - What About Traffic?

Summary

Modeling Configs to Simulate Current State

Learning State and Creating a Simulation

Creating a One-To-One Network Emulation

Related Reading

Footnotes