Tagged: espresso bin Performance Router
I’m looking for performance numbers on the Espressobin.
Mostly looking for WAN-LAN router throughput and OpenVPN performance. Anyone that have a Espressobin want to share som results?
I tested iperf between espressobin wan port (should be same with the lan ports) and a Linux pc. I got more than 900Mbit/s on both receiving and sending traffic.
Simultaneous 900Mbit (1800Mbit total)? Was it NAT/firewalled? Sounds promising.
I did iperf client and server tests separatly on Espressobin. It’s a simple test not involving NAT/firewall.
That performance is to be expected from the dedicated switch built in, it is likely not having to do much if any processing on the marvell’s end. The performance with NAT/Firewall will likely be substantially less.
Once I get mine running I will post some benchmark numbers here around NAT/firewall, and tunneling such as openvpn. I don’t expect much, as my 32 thread xeon is barely able to do 700Mbps without heavy optimizations, but they advertised some hardware accelerated crypto on the SOC, so it would be interesting to see if it is able to do openvpn traffic at 100Mbps, as that is the speed of my internet uplink.
I’m about to do some throughput tests as well. Testing with Espresso Bin and buildroot-2015.11-16.08 and I am going to use a smartbits device for throughput testing in router mode.
Seeing that the switch is apparently connected via RGMII with a 2.5 Gbps link:
U-Boot 2015.01-armada-17.02.0-gc80c919 (Mar 04 2017 – 15:51:07)
DRAM: 1 GiB
CPU @ 800 [MHz]
L2 @ 800 [MHz]
TClock @ 200 [MHz]
DDR @ 800 [MHz]
Comphy-0: PEX0 2.5 Gbps
Comphy-1: USB3 5 Gbps
Comphy-2: SATA0 5 Gbps
and the link to the switch itself is reported as 1Gps:
# ip link set dev eth0 up
# [ 404.704060] mvneta d0030000.ethernet eth0: Link is Up – 1Gbps/Full – flow control off
I’m not sure about the switches capability to get faster than the reported 1Gbps via RGMII, but 2.5Gbps would be the upper limit, right ?
Then I would suspect that we will see no more than 1.25Gbps bidirectional throughput going from lan1 to wan interface while using the CPU for routing. That would be ~750Mbps unidirectional per port.
The upper limit for a gbit network is ~2Gbps throughput bidirectional, which means the connection between switch and CPU would have to manage ~4Gbps when routing/NATing between lan1 and wan for example. I’ll try to test also with an additional USB3.0 network adapter connected to Comphy-1, to get a little bit less of a bottleneck on the RGMII bus.
The SoC is connected with 2.5Gbit (full duplex) to the switch. There shouldn’t be any bottlenecks on the network side for a near 2Gbit total throughput per port.
Ok, I did my first smartbits test. Here is the setup I have used:
yocto with latest changes, built following the software howto , and also with enabled CONFIG_NETFILTER options in the kernel.
root@cb-88f3720-ddr3-expbin:~# uname -a
Linux cb-88f3720-ddr3-expbin 4.4.8-armada-17.02.2-armada-17.02.2+g8148be9 #3 SMP PREEMPT Thu Apr 6 14:48:55 CEST 2017 aarch64 GNU/Linux
I ran this script before testing:
root@cb-88f3720-ddr3-expbin:~# cat setup.sh
ip link set lan0 up
ifconfig lan0 172.18.1.1 netmask 255.255.0.0 up
ip link set wan up
ifconfig wan 172.19.1.1 netmask 255.255.0.0 up
echo 1 > /proc/sys/net/ipv4/ip_forward
The smartbits puts UDP load bidirectionally through the lan0 and wan ports and tries to find the maximum throughput while less then 0.5 % of the sent frames get lost:
smb (max 1Gbps) -> lan0 -> CPU -> wan -> smb
smb <- lan0 <- CPU <- wan <- smb (max 1 Gbps)
The iptables netfilter rules are empty (default rule ACCEPT) , only the conntrack entries for the UDP connection are used for fast forwarding of the frames:
root@cb-88f3720-ddr3-expbin:~# iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source destination
Chain FORWARD (policy ACCEPT)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
root@cb-88f3720-ddr3-expbin:~# cat /proc/net/ip_conntrack
udp 17 174 src=172.18.13.1 dst=172.19.13.1 sport=5000 dport=5000 src=172.19.13.1 dst=172.18.13.1 sport=5000 dport=5000 [ASSURED] use=2
udp 17 175 src=172.18.11.1 dst=172.19.11.1 sport=5000 dport=5000 src=172.19.11.1 dst=172.18.11.1 sport=5000 dport=5000 [ASSURED] use=2
udp 17 175 src=172.18.10.1 dst=172.19.10.1 sport=5000 dport=5000 src=172.19.10.1 dst=172.18.10.1 sport=5000 dport=5000 [ASSURED] use=2
udp 17 174 src=172.18.12.1 dst=172.19.12.1 sport=5000 dport=5000 src=172.19.12.1 dst=172.18.12.1 sport=5000 dport=5000 [ASSURED] use=2
Here are my first results. The constant frame rate for different frame sizes is usually a good indicator for a functioning network offload engine of some kind (DMA etc.)
FrameSize | FrameRate | BitRate
[bytes] | [fps] | [bps] L3
64 | 87516 | 44814528
128 | 85916 | 87995547
256 | 85730 | 175595067
512 | 85601 | 350749484
1024 | 82315 | 674329501
1280 | 83947 | 859692307
1400 | 85002 | 952332746
1514 | 76186 | 922809647
Since this is my first shot I’m sure there are possibilities to optimize the throughput. I have already checked the /proc/irq/9/smp_affinity setting for the interrupt allocated for eth0 (i.e. RGMII) and it is set to “3”, meaning both cores can be utilized for eth0 activity. But I did only get 50% CPU load for the ksoftirqd/0 thread during the high network load phases, looks like only one core gets busy currently.
Any hints for improvement of the setup ?
I did another throughput test, this time with a mini PCIe Gbit network card (RTL8111 chipset using the linux rtl8169 driver). After recompiling the kernel with the required driver option, the card is recognized and works ok. I also added the cards firmware blob manually under /lib/firmware/rtl_nic/rtl8168e-3.fw (took the firmware from the debian package firmware-nonfree_0.43.tar.gz )
The setup is the same, but this time I tuned the SMP affinity of the interrupts 9 and 32 to share the workload of LAN and WAN interfaces to both cores:
echo “3” > /proc/irq/9/smp_affinity
echo “2” > /proc/irq/32/smp_affinity
This worked as intended, as the following output after the throughput test run shows:
root@cb-88f3720-ddr3-expbin:~# cat /proc/interrupts
9: 88062744 0 GICv3 74 Level eth0
32: 1 12487562 GICv3 61 Level advk-pcie
100: 1 12487562 d0070000.pcie-msi 0 Edge eth1
The output of top during the test also showed that both cores get really busy this time (during my last test CPU load only went up to 50 %)
These are my results, which are notably better than the ones using lan0 and wan via the switch:
FrameSize FrameRate BitRate 64 136064 69670588 128 133782 137000531 256 127721 261595952 512 122464 501616536 1024 116115 951226069 1280 111183 1139557695 1400 110145 1234474020 1514 106913 1294944515 1518 107531 1306439707
So the throughput is going up to ~1.3 Gbps L3 this time. The used RTL8111 PCIe Ethernet controller is not the fastest one, maybe there is still room for further gains. IMHO the results so far show that the network drivers interrupt thread using only one core (maybe my smp_affinity settings were wrong?) , or the single RGMII interface used for both lan0, lan1 and wan interfaces on the switch pose a bottleneck on the maximum throughput possible with the processor.
In case you don’t use htop already better install it since it displays individual core utilization. By default regardless of IRQ affinity all IRQs are processed by cpu0 so without manually tweaking IRQ affinity you get pretty fast bottlenecked by CPU: https://github.com/armbian/build/blob/master/scripts/armhwinfo#L217
On a related note: How fast do CPU cores clock in your installation: /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq
Cpufreq scaling should scale between 800 and 1200 MHz but at least the Armbian dev currently working on Espressobin reports the CPU cores being limited to 800 MHz for whatever reasons (by examining the .dtsi I wondered that OPP for dvfs/cpufreq scaling are currently missing but I didn’t had a closer look).
The older Armada 38x SoCs (Cortex-A9 and clocking up to 1.6GHz) are known to saturate a 2.5GbE link so there’s still some hope 😉
Thanks for the hints. My results above are all while running at 800MHz, and without touching the fdt or the cpufreq driver I thinks that’s all I can get:
root@cb-88f3720-ddr3-expbin:~# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies 133333 200000 400000 800000 root@cb-88f3720-ddr3-expbin:~# cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq 800000
I’m somewhat reluctant to overclock the board, as it is the only one I have currently ;).
Regarding the affinity settings, I used mainly the interrupt distribution shown in /proc/interrupts for checking if I get any effects. Since I cannot split the interrupt handling for the interfaces or for rx/tx , I can only shift all of the network handling to cpu1, which the github commit you mentioned above does. I tried it during my test, but setting the smp_affinity to 1, 2 or 3 did not make a real difference.
The u-boot snip:
Comphy-0: PEX0 2.5 Gbps ^^^^^^^^
means the SERDES line 0 is used for PCI Express. RGMII is dedicated bus (not SERDES) with 1Gbps RX and 1Gbps TX. Hence, I think, EspressoBin can’t handle more than 1Gbps throughput by hardware design (without help of additional NIC connected to the PCI-e slot).
If you consider routing between two 1Gb ethernet networks connected to two 1Gb ethernet ports (for example wan and lan0 in the EspressoBin case), maximum throughput is 2Gbps — just because each port can receive 1Gbps at maximum (2Gbps in sum) and there is no way how to get more data into the router. How did you calculate the value 4Gbps?
The maximum throughput benchmark result of bidirectional routing with additional 1Gbps PCI-e NIC is better (L3/L2? ~0.95Gbps vs L3/L2? ~1.3Gbps) mainly because the theoretical hardware throughput increased from L1 1Gbps to L1 2Gbps.
The FrameRate improvement is interesting, I can’t explain.
>If you consider routing between two 1Gb ethernet networks connected to two 1Gb ethernet ports (for example wan and lan0 in the EspressoBin case), >maximum throughput is 2Gbps — just because each port can receive 1Gbps at maximum (2Gbps in sum) and there is no way how to get more data into the >router. How did you calculate the value 4Gbps?
By 4Gbps I meant “bidirectional”, which equals 2 Gbps full duplex. If you look at the setup again, it should get clear:
Smartbits Sender (port 1, max 1Gbps tx) -> lan0 ---RGMII--> CPU ---RGMII--> wan -> Smartbits Receiver (port 2, max 1 Gbps rx) Smartbits Receiver(port 1, max 1Gbps rx) <- lan0 <--RGMII--- CPU <---RGMII-- wan <- Smartbits Sender (port 2, max 1 Gbps tx)
Each gbit port can manage 1Gbps full duplex, i.e send 1Gbps and simultaneously receive 1Gbps. If the CPU handles each packet for routing, it has to pass the RGMII interface twice. This adds up to a total of 4Gbps of bidirectional traffic, or 2Gbps fullduplex, passing the RGMII bus.
The frame rate improvement is because using the dedicated interrupt of the PCIe interface I can now split the workload between the two cores using the smp_affinity settings: IRQ 32 (PCIe) is handled by core 1, irq 9 (RGMII) is handled by core0. Both cores together can manage more frames.
In my first test there was only the RGMII interrupt (irq 9) signalling all network traffic. This was exclusively handled by one core, while the other one was idling. This bottlenecked the frame rate.
> I’m somewhat reluctant to overclock the board, as it is the only one I have currently 😉
Understandable. At the moment both cpufreq scaling and THS (thermal readouts) seem not to work. Since the SoC seems to be approx. 11x12mm in size I already ordered some 1.5mm thick copper shims to connect SoC and Topaz Switch to the enclosure bottom of an aluminium box (so the case will be used for heat dissipation).
As far as I understood SoC and Topaz switch are interconnected with 2.5GbE, what about using two GbE clients, starting
taskset -c 0 iperf3 -s -p 5021 and
taskset -c 1 iperf3 -s -p 5022 on the Espressobin and then testing with each client against port 5021 and 5022 respectively?
Technical specification tables can not be displayed on mobile. Please view on desktop