Spanning-tree protocol was one of the first network control plane protocols that I learned about back in my Intro to Routing and Switching class during college. At the time, it seemed pretty obvious: network loops are bad at layer 2, and should be indiscriminately avoided in an effort to prevent broadcast storms. However, real-life networks really aren’t that simple, as any data center engineer will gladly tell you. Specifically, modern data centers face a few important issues:
- TenGig and faster ethernet is expensive. While STP blocking a 1Gbps interface isn’t really a big deal, blocking a higher speed interface can be prohibitively expensive.
- STP is inefficient. Logically blocking a path through the network results in completely wasted bandwidth.
Transparent Interconnection of Lots of Links (TRILL) is a proposed solution to this problem from the IETF. The basic principles of TRILL are described in RFC 6325. Packet Pushers recently aired an episode asserting that TRILL is most likely dead, with support provided by Jon Hudson, the co-chair of the TRILL IETF Working Group. While this may be true, the vendor-proprietary implementations of TRILL-based protocols do still see popular use, and I think it’s a good idea to understand them at a fundamental level.
My goal in this post is to perform a basic investigation into the protocols used in FabricPath at a packet level. We won’t be discussing every detail of FabricPath, but Cisco provides a truly superb white paper that takes an extremely deep dive into the operation of the technology. In fact, most of the discussion in this article is driven by the details in that white paper. Naturally, Cisco seems to have either removed or moved this white paper while I was in the process of writing this article. Thankfully, someone uploaded it to Scribd which, while not ideal, will have to suffice if you want to download and read it.
At any rate, let’s take a look at the topology for this packet dive!
Our topology consists of a full-mesh of four core switches and two edge switches, each with an attached host. The FabricPath ID of each switch is the numerical portion of the hostname (i.e. SW_40 uses ID 40), and will be useful when analyzing the packet captures. IP addresses for Server-1 and Server-2 are also shown. The topology uses the 10.0.0.0/24 subnet, and we’ll only be dealing with the IPs of the two servers.
I’ve used Cisco’s Virtual Internet Routing Lab (VIRL) to build this topology, since it allows for simulation of Nexus switches with FabricPath functionality.The Packet Thrower’s Blog has a great article about setting up FabricPath in VIRL, and it helped me out when I was playing around in my lab. While VIRL was very useful for building this exercise, I did run into a few issues:
- There seemed to be some bugs with the way the NX-OSv switches handled IS-IS messages. Multiple debug messages and “Contact TAC” errors would appear when the topology converged. However, this doesn’t seem to have really broken much for this basic exercise.
- The MAC address tables on the NX-OSv switches were always empty, except for the local system MAC addresses. More on this later.
- MAC addresses change each time the topology is started and torn down. If I forgot to capture a particular packet, I would have to rebuild the topology and recapture all of my packets for the data to be meaningful. Otherwise, there would be no consistency among MAC addresses for the packet-level discussion.
- There is a limit of 4 simultaneous ongoing packet captures, which made it slightly challenging to capture all of the data that I wanted.
The configs for all of the switches are available on my Github. The switches use the default credentials of admin/admin.
We’re going to be exploring the packets seen for both control and data plane operations within the FabricPath topology. FabricPath is a new technology for me, so I don’t promise an extreme deep dive into every minutiae of its operation. However, this exploration will hopefully grant us a better understanding of the mechanisms in use.
Our control plane packet analysis will simply involve observing some of the protocol traffic that appears during FabricPath topology convergence. This will provide us with insight into the methods used by FabricPath during the data plane forwarding process.
Our data plane scenario is going to be equally simple: an ARP exchange initiated from Server-1 to Server-2. We’ll be interested in seeing both the broadcast traffic (ARP Request) and unicast traffic at the packet level, supported by data from each relevant switch in the topology. This scenario will essentially follow along with the packet forwarding process from the Cisco whitepaper mentioned in the introduction.
The Control Plane Protocols
First, let’s take a look at the control plane protocol utilized in the FabricPath domain: Intermediate System to Intermediate System (IS-IS). I’ve often thought of IS-IS as a reasonably complex protocol, and I’ve never had the chance to dig into its internal operation. Luckily, one of the main ideas underpinning FabricPath (and TRILL implementations in general) is that IS-IS should simply be “turned on” and work. While the protocol can be tuned, the initial functionality should be sufficient to deploy FabricPath in an environment, and Cisco actually advises against tuning the default behavior.
One of the key benefits of this link-state routing protocol, and the feature that makes it an ideal choice for TRILL and FabricPath, is the use of type-length-value elements. This allows for protocol extensions, such as those used in FabricPath, to be easily added and standardized.
Let’s take a look at the IS-IS traffic captured on Ethernet2/2 of SW_100 when the interfaces comes up. Ethernet2/2 is a FabricPath port and is connected to Ethernet2/1 on SW_30.
SW_100(config-if-range)# sh fabricpath isis switch-id <<< output truncated >>> System-ID Primary Secondary Reachable Bcast-Priority Ftag-Root Capable MT-0 fa16.3e03.ccee 40 [C] 0[C] Yes 64 [S] Y fa16.3e11.3ee6* 100 [C] 0[C] Yes 64 [S] Y fa16.3e71.6733 10 [C] 0[C] Yes 64 [S] Y fa16.3ea3.ed63 200 [C] 0[C] Yes 64 [S] Y fa16.3eb8.a8ca 20 [C] 0[C] Yes 64 [S] Y fa16.3efe.d4a0 30 [C] 0[C] Yes 64 [S] Y
Filtering for IS-IS only traffic above, we can see hello messages from 2 system IDs: fa16.3efe.d4a0 and fa16.3e11.3ee6. By taking a look at the learned switch IDs on SW_100, we can see that the system IDs in these hello messages are from SW_100 and SW_30. This is expected behavior, as we are capturing on the link between SW_100 and SW_30.
SW_30# sh int eth2/1 Ethernet2/1 is up admin state is up, Dedicated Interface Hardware: Ethernet, address: fa16.3efe.d472 (bia fa16.3efe.d472) <<< output truncated >>>
Additionally, we can see that each hello is sourced at layer 2 with the address of the corresponding interface on the switch. For SW_30, we can see that the frame is sourced by fa16.3efe.d472, which is the addresses assigned to the Ethernet2/1 interface. These hellos are correspondingly destined for the All-IS-IS-RBridges multicast address of 0180.c200.0041, as defined in Section 1.4 of RFC 6325.
Now that we can see 2 FabricPath switches establishing an adjacency, let’s take a look at a link-state packet (LSP) that is used to build each switch’s knowledge of the FabricPath topology. Just like the Hello message, the LSP is sourced at layer 2 by the Ethernet address of the sender, in this case Ethernet2/1 on SW_30, and destined to the All-IS-IS-RBridges multicast address.
We can see that the LSP-ID corresponds with the switch ID of the LSP originator, in this case fa16.3e03.ccee. This is expected behavior, as defined in the IS-IS specification. Additionally, we can see that the configured switch ID is listed in the “Nickname” sub-TLV within the Router Capability TLV. In this case, the nickname of the switch is 40, which corresponds with SW_40 (as seen in the earlier switch-ID output). Note that the originator of the LSP is not the same as the sender of the frame at layer 2. In this case, the originator of the LSP is SW_40. However, the layer 2 source address is SW_30, which is the sender of the LSP on this particular link (between SW_100 and SW_30).
Digging into the Extended IS Reachability TLV, we can see the neighbor information and an associated metric for each switch that SW_40 is connected to. We can see an entry for each neighbor of SW_40: SW_200 (fa16.3ea3.ed63), SW_10 (fa16.3e71.6733), SW_20 (fa16.3eb8.a8ca), and SW_30 (fa16.3efe.d4a0). Since SW_100 will receive an LSP from each FabricPath switch in the FabricPath domain, it will be able to run the SPF algorithm to determine the best route to each switch based on the metrics and nicknames provided in each LSP.
SW_100(config-if)# sh fabricpath isis route Fabricpath IS-IS domain: default MT-0 Topology 0, Tree 0, Swid routing table 10, L1 via Ethernet2/1, metric 400 20, L1 via Ethernet2/1, metric 800 via Ethernet2/2, metric 800 30, L1 via Ethernet2/2, metric 400 40, L1 via Ethernet2/1, metric 800 via Ethernet2/2, metric 800 200, L1 via Ethernet2/1, metric 1200 via Ethernet2/2, metric 1200
By taking a look at the FabricPath routing table of SW_100, we can see that traffic will be routed based on the nicknames of each switch. We can also see that multiple equal-cost paths to the same destination are allowed. For example, SW_20 can be reached by Ethernet2/1 or Ethernet 2⁄2 with an equal cost of 800 for each link. The cost is determined as a sum of the link costs along the path. In this topology, each link is a GigabitEthernet link with a cost of 400, which is the default metric given to GigabitEthernet interfaces in a FabricPath topology. This idea of equal cost multi-pathing (ECMP) is important, as it allows us to have a topology with multiple potential paths for unicast traffic. Legacy STP-based topologies would result in the blocking of a path, which would lead to wasted bandwidth.
The Data Plane
Now that we have an idea of the control plane for FabricPath, let’s take a look at data plane operation. For this scenario, we’re going to send an ARP request from Server-1 at 10.0.0.1 to Server-2 at 10.0.0.2. First, let’s quickly review the traffic that we would expect to see in this simple example:
Broadcast ARP request from Server-1 for 10.0.0.2
Unicast ARP response from Server-2 to Server-1 with the corresponding MAC address of Server-2
I would also like to note that I forced traffic to use the SW_100 – SW_30 – SW_40 – SW_200 path by adjusting the FabricPath IS-IS metric of each interface in the path, as illustrated above. This was done to ensure a consistent path for all traffic, since FabricPath is capable of utilizing ECMP. It was also done to get around the “4 simultaneous packet capture” limitation within VIRL. By forcing traffic to the “lower” path in the topology, I could perform simultaneous capture on most of the necessary interfaces. Alright, enough administrivia: let’s look at some packets!
First, we’ll take a look at the ARP Request sent by Server-1, as seen by a packet capture performed on Ethernet2/3 on SW_100. It’s not particularly interesting. We can see a standard ARP request sourced by Server-1 and destined to the broadcast address, with a target IP address of 10.0.0.2, which is the host that Server-1 is trying to resolve a layer 2 address for. This is pretty standard network stuff.
Now, let’s look at the same ARP request as it is forwarded between SW_100 and SW_30. We now have an entirely new header on the frame: the Cisco FabricPath header. Notice that the source field contains the source address of the switch that Server-1 is connected to, which is SW_100. We can also see that the frame is even destined for the broadcast address at the FabricPath layer, which means that it will be flooded throughout all FabricPath switches in the topology. Eventually, the frame will reach SW_200 and be flooded to Server-2, just like in a normal switched topology. Also notice that the FabricPath header contains a TTL. This allows for equal-cost multipathing, without the blocking that spanning tree introduced. The TTL is decremented at each hop, and the FabricPath frame will be prevented from infinitely looping through the topology.
FabricPath also has the conept of different trees for controlling traffic, but we won’t get into that here. There’s a good explanation in the Cisco whitepaper mentioned earlier.
Taking a look at the packet captured on SW_200 Ethernet2/3, we can see that the ARP request looks exactly like the one captured on SW_100. Once the frame is forwarded out of a Classic Ethernet (CE) port, the FabricPath header is removed. This is somewhat analogous to how an 802.1Q VLAN header is removed before forwarding out an access port.
At this point, Server-2 has received the ARP request for processing. Next, we’ll expect to see an ARP Reply, which will give us some insight into the more interesting aspects of the FabricPath data plane.
Above we can see the ARP reply as captured on Ethernet2/3 on SW_200. It’s exactly what we would expect: a unicast frame sourced by Server-2 at fa16.3e0d.fe19 and destined to Server-1 at fa16.3e7b.92c3. The interesting part about this frame isn’t the payload itself, but rather how SW_200 makes a forwarding decision. When the switch looks up the destination MAC address, it will see that the frame is destined for a host on another FabricPath switch, specifically SW_100. SW_200 will then encapsulate the unicast frame inside of a FabricPath header, which allows the frame to be forwarded throughout the FabricPath topology.
Ordinarily, I would include a screenshot of the MAC address table for SW_200 and the corresponding FabricPath forwarding decision. Unfortunately, due to what I can only assume is a bug in the version of NX-OS running in VIRL, the MAC address tables for all of the switches are completely empty, except for the local system MAC address. The MAC address table doesn’t even contain the MAC address for Server-2. Instead of putting in too much effort troubleshooting this, I’m opting to just include this explanation.
Now let’s look at the same ARP reply on the link between SW_30 and SW_40. Initially this looks similar to the ARP request that we saw earlier. However, notice that the FabricPath destination is for Switch 100 instead of the broadcast address. All of the FabricPath switches are able to use the Switch-ID of 100 to make a forwarding decision, instead of the MAC address contained within the ARP reply payload.
The idea that FabricPath switches forward based on the Switch-ID field is particularly important to understand, especially when trying to realize the scalability benefits of FabricPath. The switches that are only running FabricPath in the “core” of our topology have no reason to learn the MAC addresses of hosts throughout the topology. They only need to be aware of FabricPath Switch-IDs to make forwarding decisions. A normal switched topology would require the MAC addresses to be known throughout the topology for unicast forwarding to occur. Otherwise, flooding behavior would take place. This results in large MAC address tables for large layer 2 domains.
Additionally, even the switches that are running both Classic Ethernet and FabricPath (SW_100 and SW_200) don’t need to know every single MAC address in the topology. FabricPath has a concept called “Conversational MAC Learning.” This allows a switch to only learn source MAC addresses if the destination MAC address is known via a Classic Ethernet port. For example, let’s say that we introduced another switch into this topology: SW_300 connected to SW_20. If hosts on SW_100 and SW_300 were having a conversation, SW_200 would never have to learn the MAC addresses for either host, even though a broadcast ARP request might traverse SW_200 at some point during the communication. This allows FabricPath to scale by limiting the number of MAC addresses that must be learned in the topology.
Again, I would include screenshots of MAC address tables to prove this, but I think I’m bumping into a VIRL bug.
Finally, let’s take a look at our packet at the end of its journey. Above, we can see the ARP reply as captured on Ethernet2/3 of SW_100. The FabricPath header has been stripped off, and the standard ARP reply is the only remaining payload on this Classic Ethernet port.
By taking a look at a packet capture, we were able to dive into the data plane forwarding decisions made by switches in a FabricPath network. The packet headers and subsequent decisions made by FabricPath switches help to reveal some of the benefits of running FabricPath in a network with a large layer 2 domain. While these concepts may be new, an analysis of the packets reveals that the forwarding decisions are reasonably straightforward, and this simplicity can be a strong motivator for adopting FabricPath.
Revisiting the introduction to this article, I found FabricPath to be particularly interesting because spanning tree was one of the earliest protocols that I learned in networking. As always, my understanding of a networking topic is greatly enhanced by opening the hood and viewing the involved protocols at a packet level. Digging into FabricPath reminded me of the switching labs from college, all of which involved providing a detailed explanation of STP operation through the use of packet captures and analysis.
While I think FabricPath is fairly interesting, I’m curious to see how long TRILL-based implementations will still be relevant. Not to harp on the buzzword wagon too much, but SDN solutions are growing in maturity. As robust APIs and programmable control (and even data) planes gain traction and evolve into mature products, I can’t help but wonder if we’ll still be digging into IS-IS headers when we troubleshoot network problems. At any rate, FabricPath is a step in the right direction: it’s generally simple to configure, is reasonably scalable, and it helps to alleviate some of the problems that are inherent in traditional Ethernet topologies.