I’ve done a lot of posts featuring ZeroTier in this Blog. There’s a lot of different overlay solutions that I could feature. There’s ZeroTier, Tailscale, NetBird, Nebula, TwinGate…just to name a few. The reason I feature ZeroTier so heavily is it is uniquely suited for site-to-site networking over the different offerings. This is because it creates a virtual LAN rather than a collection of point-to-points like other solutions. Next hops on a multiaccess ethernet networks just need to point to an IP, and ARP will resolve the IP so the MAC address of the next hop can be imposed on the packet.


The simplicity of that solution allows network engineers to treat ZeroTier not as some additional thing to integrate with an already complex environment, but to just treat ZeroTier as any LAN segment they are accustomed to working with. You can run anything over ZeroTier that you could run over a traditional Ethernet Network.

Performance in ZeroTier

If you scour the internet, a lot of people will praise ZeroTier, but a frequent complaint is the performance against WireGuard and sometimes IPsec. While I certainly wouldn’t categorize the typical performance of ZeroTier as bad, it usually can’t beat WireGuard. WireGuard can max out 1Gbps circuits even on cheap and low powered hardware.


In this post, we’re going to aim to bring the performance up to that of WireGuard; maybe even surpass it.


NOTE: In this post, we’re going to be comparing speeds between ZeroTier, WireGuard, OpenVPN, and IPsec VTI tunnels. It’s important to note that this entire post is aimed at improving the overall throughput of ZeroTier between 2 VyOS hosts. What it is not is a statement on the superiority of the raw speed between protocols. Most of what will be done in this article to improve speeds using ZeroTier can also be applied to WireGuard and OpenVPN.

What causes the lower performance of ZeroTier

ZeroTier has a lot of great functionality, but as a product, it’s still relatively young compared to WireGuard. The ZeroTier developers wanted to focus on functionality over performance. Provide solid functionality, then improve it.


ZeroTier is generally single-threaded, whereas WireGuard is optimized for multi-threading. This means that with more cores, WireGuard should generally beat ZeroTier as it can leverage all of the cores in the system.


Look at the below graphics showing the CPU utilization when running iPerf between 2 devices. You can see the much better utilization of all cores using WireGuard.


WireGuard:

ZeroTier:


Developing for good multi-threaded scaling can be difficult, and I’m sure it is on ZeroTier’s roadmap, but what if we had another way to use more cores with ZeroTier, can that help us get to an even footing with WireGuard? Let’s explore that idea.

Building a Proof of Concept – The Hardware

I wanted to have 2 systems with identical specs to hopefully remove some variability. I didn’t want to spend a lot of money on a proof of concept, so I tried to find the cheapest mini PCs available that had at least 4 cores, and 2 Network Interfaces. After some searching, I found the Minisforum GK41. It has a Quad Core Celeron (J4125), with 8GB of RAM, and 2x1Gbps interfaces. I was able to get 2 of them on sale for around $120 USD each.


The GK41 can be purchased on Amazon, but I can’t find a product page on the Minisforum website. They do have an old presentation for it which can be found here: https://www.minisforum.com/ueditor/file/20200820/1597895064757677.pdf


I received mine, let’s start setting up a Proof of Concept.


Building a Proof of Concept – The Software

I’m going to use VyOS as the operating system for each system. That’s been another staple of my blog posts, as I love the direction VyOS takes their product. They have developed a feature rich Software Router/Firewall, while allowing it to feel familiar to Network Engineers coming from the enterprise.


For our VPN solutions, I’m going to build WireGuard, VTI IPsec, and OpenVPN directly within VyOS, as those are already fully integrated within VyOS. For ZeroTier, I’m going to install it in containers. If you need to see how to do this, check out this previous blog post:
https://lev-0.com/2024/01/16/dynamic-multipoint-vpn-with-zerotier-and-vyos-part-4-more-zerotier/

Initial Testing

Let’s get a baseline of these mini PCs to see what their max unencrypted throughput is.


[  5]   0.00-10.00  sec  1.09 GBytes   941 Mbits/sec                  receiver


That’s a good sign, we can max out the 1Gbps connection (941Mbits is the payload throughput after overhead).


Let’s see what we get with our different VPN solutions:


WireGuard:

[  5]   0.00-10.00  sec  1021 MBytes   856 Mbits/sec                  receiver

IPsec:

[  5]   0.00-10.00  sec   958 MBytes   803 Mbits/sec                  receiver

OpenVPN:

[  5]   0.00-10.00  sec   327 MBytes   274 Mbits/sec                  receiver

ZeroTier:

[  5]   0.00-10.00  sec   645 MBytes   541 Mbits/sec                  receiver


Honestly, I’m quite surprised on the performance of these Mini-PCs. I wasn’t expecting to not just max out WireGuard, but normal IPsec as well. OpenVPN and ZeroTier was along the lines of what I expected, and it’s actually good that ZeroTier didn’t max out the circuit, since it allows us to test our Proof of Concept.

Using more cores

We talked previously on how WireGuard sees better performance due to it’s ability to scale to additional cores in the system. Rewriting ZeroTier to be multi-threaded will take a large effort, but one quick way to use more cores with ZeroTier is to simply install it multiple times.


Normally in Linux, this would be difficult, but with Containers, it becomes quite easy. We have 4 cores, so let’s go ahead and configure 4 instances of ZeroTier.


Each of the containers are going to try to listen on UDP 9993, which obviously won’t work. We also want to make sure that we don’t build ZeroTier over any of the other VPN solutions. We need to modify the local.conf file for each ZeroTier instance so it blacklists the necessary interfaces and listens on a unique port.


{
"physical": {},
"virtual": {},
"settings": {
"primaryPort": "9994",
"interfacePrefixBlacklist": [
"eth10",
"eth12",
"eth13",
"eth14",
"dum0",
"vti0",
"vtun1",
"wg0"
]
}
}


I have setup the 4 ZeroTier interfaces as eth10-14. The local.conf file should be placed in whatever folder you mapped to ‘zerotier-one’, and the container restarted.


I’m going to assign each node an IP in a ‘/30’ subnet. This ensure that the nodes on each side go only between a single node on the opposite end.


Creating a floating IP

We’ll need to create an IP that we can create ECMP routing over to be able to use all of the ZeroTier containers at the same time.


Router1:
set interfaces dummy dum0 address '10.0.55.1/32'

Router2:
set interfaces dummy dum0 address '10.0.55.2/32'


We then need to create routing between the 2 routers for that dummy interface. Notice that we disable 3 of the routes so we can test performance with 1,2,3, and 4 cores separately.


Router1:
set protocols static route 10.0.55.2/32 next-hop 10.14.0.2
set protocols static route 10.0.55.2/32 next-hop 10.14.0.6 disable
set protocols static route 10.0.55.2/32 next-hop 10.14.0.10 disable
set protocols static route 10.0.55.2/32 next-hop 10.14.0.14 disable

Router2:
set protocols static route 10.0.55.1/32 next-hop 10.14.0.1
set protocols static route 10.0.55.1/32 next-hop 10.14.0.5 disable
set protocols static route 10.0.55.1/32 next-hop 10.14.0.9 disable
set protocols static route 10.0.55.1/32 next-hop 10.14.0.13 disable


We also need to enable layer-4 hashing for ECMP in VyOS. This will allow our traffic to be load-balanced based on the source/destination port in the packet. Sending multiple threads with iPerf will create unique source/destination port pairings to allow ECMP.


set system ip multipath layer4-hashing


We should have a slight performance penalty with ZeroTier by using the floating IP, but it should be small. Let’s try it with just one core to see what penalty we get.


We need to make sure our iPerf tests now use that floating IP. We can do it with something like this in iPerf.


iperf3 -c 10.0.55.1 -B 10.0.55.2


Here’s the results of iPerf

[  5]   0.00-10.00  sec   526 MBytes   441 Mbits/sec                  receiver


You can see we lost about 100Mbps of throughput in doing this. Let’s now try 2 cores and see if our change was even worth it. We’re going to re-enable some of the static routes we disabled earlier.


Router1:
delete protocols static route 10.0.55.2/32 next-hop 10.14.0.6 disable

Router2:
delete protocols static route 10.0.55.1/32 next-hop 10.14.0.5 disable


We’ll need to run iPerf with multiple threads to use multiple ZeroTier instances.


iperf3 -c 10.0.55.1 -B 10.0.55.2 -P 16


iPerf Results (ZeroTier ECMP with 2 cores):

[SUM]   0.00-10.01  sec  1.01 GBytes   864 Mbits/sec                  receiver


Perfect, that worked. But we now have a problem, I’ve only used 2 cores and we’ve already run into a limit of the 1Gbps interfaces on this box. I guess that’s a good problem to have if you intend to use this mini-PC in production. I don’t want to spend more money on Mini-PCs to try to test this, so USB 2.5Gbps adapters to the (hopefully) rescue.


Alright, I received 2 of them, now back to testing.



I guess we need new baselines now.


Unencrypted:

[SUM]   0.00-10.00  sec  2.60 GBytes  2.34 Gbits/sec                  receiver

WireGuard:

  [  5]   0.00-10.00  sec  1.83 GBytes  1.57 Gbits/sec                  receiver

IPsec:

[  5]   0.00-10.00  sec  1.58 GBytes  1.36 Gbits/sec                  receiver

OpenVPN:

[  5]   0.00-10.00  sec   270 MBytes   226 Mbits/sec                  receiver

ZeroTier:

[  5]   0.00-10.00  sec   469 MBytes   393 Mbits/sec                  receiver


We can see we actually were able to get quite a bit more throughput using both WireGuard and IPsec. As we weren’t able to max out ZeroTier or OpenVPN before with a single core, they didn’t really change much. Now back to the testing. Let’s try ZeroTier with 2 Cores again.


iPerf Results (ZeroTier ECMP with 2 cores):

[SUM]   0.00-10.00  sec  1.19 GBytes  1.02 Gbits/sec                  receiver


That’s looking great, we’re seeing near linear scaling when using ECMP with ZeroTier. Let’s see what we get with 3 cores.


Router1:
delete protocols static route 10.0.55.2/32 next-hop 10.14.0.10 disable

Router2:
delete protocols static route 10.0.55.1/32 next-hop 10.14.0.9 disable


iPerf Results (ZeroTier ECMP with 3 cores):

[SUM]   0.00-10.01  sec  1.69 GBytes  1.45 Gbits/sec                  receiver


And finally, with all 4 cores.

Router1:
delete protocols static route 10.0.55.2/32 next-hop 10.14.0.14 disable

Router2:
delete protocols static route 10.0.55.1/32 next-hop 10.14.0.13 disable


iPerf Results (ZeroTier ECMP with 4 cores):

[SUM]   0.00-10.00  sec  2.25 GBytes  1.93 Gbits/sec                  receiver


It look all 4 cores, but we were finally able to beat both WireGuard and IPsec on this cheap Mini-PC. Getting almost 2Gbps of encrypted throughput with ZeroTier for a little over $100 USD for each PC is kind of amazing.


Here was the CPU utilization during that test:


Part of me thought about stopping this post here, but seeing the near linear scaling made me very curious. I have some other PCs that can max out a 2.5Gbps interface with a single ZeroTier instance; what could a more capable box get as a max aggregate throughput?

Building a (bigger) Proof of Concept – The Hardware

As I was starting the planning for all of this, Minisforum announced a new Mini-PC, which had 10G networking, along with a PCIe slot, which could allow for even greater network speeds. I went back and forth on whether or not to get a couple of them, but I ultimately figured I could find a use for them as Network Attached Storage boxes or something later.


The PC is called the MS-01, and comes with an Intel 14-core processor (6 P-core; 8 E-core). The max clock speed is 5.4Ghz for the P-cores and 4Ghz for the E-cores. I spec’ed them out with 96GB of RAM since I plan on turning them into servers later.


You can check out the MS-01 here:
https://store.minisforum.com/products/minisforum-ms-01?_pos=1&_sid=83a4251da&_ss=r


I’ve received them, let’s do some baseline testing for these. I’m using the 10G interfaces between the boxes.


Further Testing

We’re going to port our config from the GK41s directly to the MS-01, to include the config files for our containers. This allows everything to be the same. Additionally, since we have 6 P-cores, I’m going to go ahead and configure 6 instances of ZeroTier on each system.


Let’s get some baselines for this system.


Unencrypted:

[SUM]   0.00-10.00  sec  10.9 GBytes  9.35 Gbits/sec                  receiver

WireGuard:

[  5]   0.00-10.00  sec  7.63 GBytes  6.55 Gbits/sec                  receiver

IPsec:

[  5]   0.00-10.00  sec  6.66 GBytes  5.72 Gbits/sec                  receiver

OpenVPN:

[  5]   0.00-10.00  sec  2.54 GBytes  2.18 Gbits/sec                  receiver

ZeroTier:

[SUM]   0.00-10.00  sec  5.23 GBytes  4.49 Gbits/sec                  receiver


Even without touching anything, that’s pretty impressive. We see nearly triple of all the performance of the GK41. I think that would probably be plenty for most people. But I want to see if we can get 10Gbps of ZeroTier. Let’s try 2 cores.


iPerf Results (ZeroTier ECMP with 2 cores):

[SUM]   0.00-10.00  sec  9.76 GBytes  8.38 Gbits/sec                  receiver


Well, I can already see a problem….we’re again seeing near linear scaling, which means we only need 3 cores to max out 10G.


Remember how I said this Mini-PC had a PCIe slot in it….let’s order some 25Gb cards.



Now that they’re installed, let’s start ramping up our core count. Remember we have 6 P-Cores.


iPerf Results (ZeroTier ECMP with 3 cores):

[SUM]   0.00-10.00  sec  13.5 GBytes  11.6 Gbits/sec                  receiver


We took down that 10G target pretty easily.


iPerf Results (ZeroTier ECMP with 4 cores):

[SUM]   0.00-10.00  sec  16.4 GBytes  14.1 Gbits/sec                  receiver

iPerf Results (ZeroTier ECMP with 5 cores):

[SUM]   0.00-10.01  sec  19.2 GBytes  16.5 Gbits/sec                  receiver


We’re getting close, but we’re seeing that linear scaling decrease (possible reason for that in a bit). Let’s see if 6 P-Cores is enough to max out the 25Gb connection.


iPerf Results (ZeroTier ECMP with 6 cores):

[SUM]   0.00-10.00  sec  22.0 GBytes  18.9 Gbits/sec                  receiver


Sadly using 6 cores fell a little short. It’s very possible that I could have maxed it with just 6 cores, but it’s important to note that I’m not pinning any of the containers to the P-Cores, so it’s possible that some of those slower 2Gbps jumps when adding more cores was the CPU scheduler having the containers use the E-Cores.


We’ve come this far, so we might as well push it over the edge….we do still have 8 E-Cores after all.


I added 2 more containers to finish the test, and let’s see what we get after having 8 ECMP paths for ZeroTier:


iPerf Results (ZeroTier ECMP with 8 cores):

[SUM]   0.00-10.00  sec  24.8 GBytes  21.3 Gbits/sec                  receiver


WireGuard who?



That’s pretty much the max we should expect given the overhead of each packet. We even still have 6 E-Cores remaining in theory. I know what you’re thinking….but I’m not going to buy some 40Gb cards.

Conclusion

You may wonder how practical a solution like this is. For anyone that has configured VPNs to AWS VPG, you already know that overcoming an encrypted throughput limit (1.25Gbps to a VPG) requires simply adding more tunnels and doing ECMP over them. While no single flow can exceed that limit, the aggregate throughput of all traffic can reach pretty impressive speeds. With our MS-01s, we can have single flows of 4.5Gbps. To put that in perspective, that’s enough to saturate a SATA3 SSD. As an aggregate we can almost saturate a PCIe 3.0 NVMe SSD.


For ZeroTier, this speaks to the promise as a solution in the future. As the developers continue to improve on the product, and allow for native multi-threading, you may start to see to see ZeroTier being the best performing solution without needing to design ECMP.

7 responses to “Chasing Performance in ZeroTier – VyOS on Minisforum”

  1. Great article. Curious as to which USB NICs you used (assuming RTL8156/B/BG or similar?).

    1. I used this adapter: https://a.co/d/5ASOq2f

      It uses a Realtek RTL8156B chip

  2. Amazing read, thanks!

  3. So zerotier silently released the ability to use multiple cores. Check it out when you get a chance.

    1. Yeah, I talked to them when they were first implementing it. The issue then was that when doing high throughput, a large number of packets would be thrown, and they were doing per packet multithreading. That led to a lot of packet reordering, which actually slowed things down in high throughput applications. I just saw this update on the PR for the change:

      Some updates:

      Packets are sorted by flow to prevent re-ordering (though this doesn’t seem to be a full solution)
      Configuration is now done via local.conf, not environment variables

      That’s pretty promising and should provide a consistent experience. I don’t have the test bed I used anymore unfortunately (I have the cheap PCs, but not the 25Gbps setup), so I can’t fully compare, but if you can do 4.5Gbps per flow, then getting a good aggregate throughput should be easy. I might test with the cheap PCs with the 2.5Gbps USB NICs and see what I get. That would tell me the scaling between my solution and their official solution. With them changing to flow based multithreading, the throughput should be comparable, with the difference being you’d just need to adjust a few settings in the local.conf file instead of setting up a bunch of different containers (which also eats into your node count).

      Thanks for the message.

    2. Quick update on this. I did test it using one of the Mini PCs (4 core). It did improve speeds, I saw 1Gbps of throughput, but it didn’t seem to want to scale beyond 2 cores, even with concurrency of 4. So the solution in the article can achieve greater throughput, but I feel like 1Gbps will max most peoples connections. You might even by able to do a hybrid, and half the number of containers in my article for the same results.

  4. I just discovered your blog.
    I knew vyos was feature rich, but I didn’t imagined it perform so well or had nfv support via containers, application filtering/shaping or firewalling acceleration.. It put vyos at the same level that some famous network products.
    It blew my mind and I’ll look into it ASAP.
    Thank you a million times for these informations.
    I’ll be reading your blog from now on.

Leave a Reply

Trending

Discover more from Level Zero Networking

Subscribe now to keep reading and get access to the full archive.

Continue reading