Home Forums Hardware discussions Performance (Router)

This topic contains 18 replies, has 9 voices, and was last updated by  happypackets 5 months, 1 week ago.

Viewing 15 posts - 1 through 15 (of 19 total)
  • Author
    Posts
  • #308

    Majkey
    Participant

    I’m looking for performance numbers on the Espressobin.
    Mostly looking for WAN-LAN router throughput and OpenVPN performance. Anyone that have a Espressobin want to share som results?

    #313

    Benjamin Huang
    Keymaster

    I tested iperf between espressobin wan port (should be same with the lan ports) and a Linux pc. I got more than 900Mbit/s on both receiving and sending traffic.

    #314

    Majkey
    Participant

    Simultaneous 900Mbit (1800Mbit total)? Was it NAT/firewalled? Sounds promising.

    #321

    Benjamin Huang
    Keymaster

    I did iperf client and server tests separatly on Espressobin. It’s a simple test not involving NAT/firewall.

    #323

    biothundernxt
    Participant

    That performance is to be expected from the dedicated switch built in, it is likely not having to do much if any processing on the marvell’s end. The performance with NAT/Firewall will likely be substantially less.
    Once I get mine running I will post some benchmark numbers here around NAT/firewall, and tunneling such as openvpn. I don’t expect much, as my 32 thread xeon is barely able to do 700Mbps without heavy optimizations, but they advertised some hardware accelerated crypto on the SOC, so it would be interesting to see if it is able to do openvpn traffic at 100Mbps, as that is the speed of my internet uplink.

    #330

    peter
    Participant

    I’m about to do some throughput tests as well. Testing with Espresso Bin and buildroot-2015.11-16.08 and I am going to use a smartbits device for throughput testing in router mode.
    Seeing that the switch is apparently connected via RGMII with a 2.5 Gbps link:

    <snip>
    U-Boot 2015.01-armada-17.02.0-gc80c919 (Mar 04 2017 – 15:51:07)

    I2C: ready
    DRAM: 1 GiB
    Board: DB-88F3720-ESPRESSOBin
    CPU @ 800 [MHz]
    L2 @ 800 [MHz]
    TClock @ 200 [MHz]
    DDR @ 800 [MHz]
    Comphy-0: PEX0 2.5 Gbps
    ^^^^^^^^
    Comphy-1: USB3 5 Gbps
    Comphy-2: SATA0 5 Gbps

    </snip>

    and the link to the switch itself is reported as 1Gps:

    <snip>
    # ip link set dev eth0 up
    # [ 404.704060] mvneta d0030000.ethernet eth0: Link is Up – 1Gbps/Full – flow control off
    </snip>

    I’m not sure about the switches capability to get faster than the reported 1Gbps via RGMII, but 2.5Gbps would be the upper limit, right ?
    Then I would suspect that we will see no more than 1.25Gbps bidirectional throughput going from lan1 to wan interface while using the CPU for routing. That would be ~750Mbps unidirectional per port.

    The upper limit for a gbit network is ~2Gbps throughput bidirectional, which means the connection between switch and CPU would have to manage ~4Gbps when routing/NATing between lan1 and wan for example. I’ll try to test also with an additional USB3.0 network adapter connected to Comphy-1, to get a little bit less of a bottleneck on the RGMII bus.

    #331

    Majkey
    Participant

    The SoC is connected with 2.5Gbit (full duplex) to the switch. There shouldn’t be any bottlenecks on the network side for a near 2Gbit total throughput per port.

    #333

    peter
    Participant

    Ok, I did my first smartbits test. Here is the setup I have used:

    yocto with latest changes, built following the software howto , and also with enabled CONFIG_NETFILTER options in the kernel.
    root@cb-88f3720-ddr3-expbin:~# uname -a
    Linux cb-88f3720-ddr3-expbin 4.4.8-armada-17.02.2-armada-17.02.2+g8148be9 #3 SMP PREEMPT Thu Apr 6 14:48:55 CEST 2017 aarch64 GNU/Linux

    I ran this script before testing:

    root@cb-88f3720-ddr3-expbin:~# cat setup.sh
    #!/bin/sh -ex
    ip link set lan0 up
    ifconfig lan0 172.18.1.1 netmask 255.255.0.0 up
    ip link set wan up
    ifconfig wan 172.19.1.1 netmask 255.255.0.0 up
    echo 1 > /proc/sys/net/ipv4/ip_forward

    The smartbits puts UDP load bidirectionally through the lan0 and wan ports and tries to find the maximum throughput while less then 0.5 % of the sent frames get lost:

    smb (max 1Gbps) -> lan0 -> CPU -> wan -> smb
    smb <- lan0 <- CPU <- wan <- smb (max 1 Gbps)

    The iptables netfilter rules are empty (default rule ACCEPT) , only the conntrack entries for the UDP connection are used for fast forwarding of the frames:

    root@cb-88f3720-ddr3-expbin:~# iptables -L
    Chain INPUT (policy ACCEPT)
    target prot opt source destination

    Chain FORWARD (policy ACCEPT)
    target prot opt source destination

    Chain OUTPUT (policy ACCEPT)
    target prot opt source destination

    root@cb-88f3720-ddr3-expbin:~# cat /proc/net/ip_conntrack
    udp 17 174 src=172.18.13.1 dst=172.19.13.1 sport=5000 dport=5000 src=172.19.13.1 dst=172.18.13.1 sport=5000 dport=5000 [ASSURED] use=2
    udp 17 175 src=172.18.11.1 dst=172.19.11.1 sport=5000 dport=5000 src=172.19.11.1 dst=172.18.11.1 sport=5000 dport=5000 [ASSURED] use=2
    udp 17 175 src=172.18.10.1 dst=172.19.10.1 sport=5000 dport=5000 src=172.19.10.1 dst=172.18.10.1 sport=5000 dport=5000 [ASSURED] use=2
    udp 17 174 src=172.18.12.1 dst=172.19.12.1 sport=5000 dport=5000 src=172.19.12.1 dst=172.18.12.1 sport=5000 dport=5000 [ASSURED] use=2

    Here are my first results. The constant frame rate for different frame sizes is usually a good indicator for a functioning network offload engine of some kind (DMA etc.)

    FrameSize | FrameRate | BitRate
    [bytes] | [fps] | [bps] L3
    64 | 87516 | 44814528
    128 | 85916 | 87995547
    256 | 85730 | 175595067
    512 | 85601 | 350749484
    1024 | 82315 | 674329501
    1280 | 83947 | 859692307
    1400 | 85002 | 952332746
    1514 | 76186 | 922809647

    Since this is my first shot I’m sure there are possibilities to optimize the throughput. I have already checked the /proc/irq/9/smp_affinity setting for the interrupt allocated for eth0 (i.e. RGMII) and it is set to “3”, meaning both cores can be utilized for eth0 activity. But I did only get 50% CPU load for the ksoftirqd/0 thread during the high network load phases, looks like only one core gets busy currently.

    Any hints for improvement of the setup ?

    #354

    peter
    Participant

    I did another throughput test, this time with a mini PCIe Gbit network card (RTL8111 chipset using the linux rtl8169 driver). After recompiling the kernel with the required driver option, the card is recognized and works ok. I also added the cards firmware blob manually under /lib/firmware/rtl_nic/rtl8168e-3.fw (took the firmware from the debian package firmware-nonfree_0.43.tar.gz )

    The setup is the same, but this time I tuned the SMP affinity of the interrupts 9 and 32 to share the workload of LAN and WAN interfaces to both cores:

    echo “3” > /proc/irq/9/smp_affinity
    echo “2” > /proc/irq/32/smp_affinity

    This worked as intended, as the following output after the throughput test run shows:
    root@cb-88f3720-ddr3-expbin:~# cat /proc/interrupts
    CPU0 CPU1

    9: 88062744 0 GICv3 74 Level eth0

    32: 1 12487562 GICv3 61 Level advk-pcie

    100: 1 12487562 d0070000.pcie-msi 0 Edge eth1

    The output of top during the test also showed that both cores get really busy this time (during my last test CPU load only went up to 50 %)

    These are my results, which are notably better than the ones using lan0 and wan via the switch:

    
    FrameSize	FrameRate	BitRate
    
    64		136064		69670588
    128		133782		137000531
    256		127721		261595952
    512		122464		501616536
    1024		116115		951226069
    1280		111183		1139557695
    1400		110145		1234474020
    1514		106913		1294944515
    1518		107531		1306439707
    

    So the throughput is going up to ~1.3 Gbps L3 this time. The used RTL8111 PCIe Ethernet controller is not the fastest one, maybe there is still room for further gains. IMHO the results so far show that the network drivers interrupt thread using only one core (maybe my smp_affinity settings were wrong?) , or the single RGMII interface used for both lan0, lan1 and wan interfaces on the switch pose a bottleneck on the maximum throughput possible with the processor.

    #355

    tkaiser
    Participant

    In case you don’t use htop already better install it since it displays individual core utilization. By default regardless of IRQ affinity all IRQs are processed by cpu0 so without manually tweaking IRQ affinity you get pretty fast bottlenecked by CPU: https://github.com/armbian/build/blob/master/scripts/armhwinfo#L217

    On a related note: How fast do CPU cores clock in your installation: /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq

    Cpufreq scaling should scale between 800 and 1200 MHz but at least the Armbian dev currently working on Espressobin reports the CPU cores being limited to 800 MHz for whatever reasons (by examining the .dtsi I wondered that OPP for dvfs/cpufreq scaling are currently missing but I didn’t had a closer look).

    The older Armada 38x SoCs (Cortex-A9 and clocking up to 1.6GHz) are known to saturate a 2.5GbE link so there’s still some hope šŸ˜‰

    #356

    peter
    Participant

    Thanks for the hints. My results above are all while running at 800MHz, and without touching the fdt or the cpufreq driver I thinks that’s all I can get:

    
    root@cb-88f3720-ddr3-expbin:~# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies 
    133333 200000 400000 800000 
    root@cb-88f3720-ddr3-expbin:~# cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq 
    800000
    

    I’m somewhat reluctant to overclock the board, as it is the only one I have currently ;).

    Regarding the affinity settings, I used mainly the interrupt distribution shown in /proc/interrupts for checking if I get any effects. Since I cannot split the interrupt handling for the interfaces or for rx/tx , I can only shift all of the network handling to cpu1, which the github commit you mentioned above does. I tried it during my test, but setting the smp_affinity to 1, 2 or 3 did not make a real difference.

    #357

    kyril77
    Participant

    The u-boot snip:

    Comphy-0: PEX0 2.5 Gbps
    ^^^^^^^^

    means the SERDES line 0 is used for PCI Express. RGMII is dedicated bus (not SERDES) with 1Gbps RX and 1Gbps TX. Hence, I think, EspressoBin can’t handle more than 1Gbps throughput by hardware design (without help of additional NIC connected to the PCI-e slot).

    If you consider routing between two 1Gb ethernet networks connected to two 1Gb ethernet ports (for example wan and lan0 in the EspressoBin case), maximum throughput is 2Gbps — just because each port can receive 1Gbps at maximum (2Gbps in sum) and there is no way how to get more data into the router. How did you calculate the value 4Gbps?

    #358

    kyril77
    Participant

    The maximum throughput benchmark result of bidirectional routing with additional 1Gbps PCI-e NIC is better (L3/L2? ~0.95Gbps vs L3/L2? ~1.3Gbps) mainly because the theoretical hardware throughput increased from L1 1Gbps to L1 2Gbps.

    The FrameRate improvement is interesting, I can’t explain.

    #359

    peter
    Participant

    >If you consider routing between two 1Gb ethernet networks connected to two 1Gb ethernet ports (for example wan and lan0 in the EspressoBin case), >maximum throughput is 2Gbps ā€” just because each port can receive 1Gbps at maximum (2Gbps in sum) and there is no way how to get more data into the >router. How did you calculate the value 4Gbps?

    By 4Gbps I meant “bidirectional”, which equals 2 Gbps full duplex. If you look at the setup again, it should get clear:

    
    Smartbits Sender  (port 1, max 1Gbps tx)  ->     lan0    ---RGMII-->   CPU     ---RGMII-->    wan -> Smartbits Receiver (port 2, max 1 Gbps rx)
    Smartbits Receiver(port 1, max 1Gbps rx)  <-     lan0    <--RGMII---   CPU     <---RGMII--    wan <- Smartbits Sender  (port 2, max 1 Gbps tx)
    

    Each gbit port can manage 1Gbps full duplex, i.e send 1Gbps and simultaneously receive 1Gbps. If the CPU handles each packet for routing, it has to pass the RGMII interface twice. This adds up to a total of 4Gbps of bidirectional traffic, or 2Gbps fullduplex, passing the RGMII bus.

    The frame rate improvement is because using the dedicated interrupt of the PCIe interface I can now split the workload between the two cores using the smp_affinity settings: IRQ 32 (PCIe) is handled by core 1, irq 9 (RGMII) is handled by core0. Both cores together can manage more frames.

    In my first test there was only the RGMII interrupt (irq 9) signalling all network traffic. This was exclusively handled by one core, while the other one was idling. This bottlenecked the frame rate.

    #360

    tkaiser
    Participant

    > Iā€™m somewhat reluctant to overclock the board, as it is the only one I have currently šŸ˜‰

    Understandable. At the moment both cpufreq scaling and THS (thermal readouts) seem not to work. Since the SoC seems to be approx. 11x12mm in size I already ordered some 1.5mm thick copper shims to connect SoC and Topaz Switch to the enclosure bottom of an aluminium box (so the case will be used for heat dissipation).

    As far as I understood SoC and Topaz switch are interconnected with 2.5GbE, what about using two GbE clients, starting
    taskset -c 0 iperf3 -s -p 5021 and taskset -c 1 iperf3 -s -p 5022 on the Espressobin and then testing with each client against port 5021 and 5022 respectively?

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic.

Signup to our newsletter

Technical specification tables can not be displayed on mobile. Please view on desktop