ifconfig -a
will list all detected interfaces, then you can use
ip link set dev XXX up
ifconfig XXX 192.168.1.1 netmask 255.255.255.0 up
eth0 is the RGMII device interfacing to the switch, the switch ports have names like “wan” and “lan0”. These names are defined in the flat device tree file (*.dtb) loaded by u-boot.
Hi,
I finally had the chance to do the same throughput measurements with a db_88f3720_ddr3_modular board, which has the same 88F3720 CPU, and two phys connected to two dedicated SGMII busses (via SerDes), and it runs at 1GHz. Otherwise the same setup as above, both cores can be fully utilized using CPU affinity settings for the two SGMII related interrupts. Here are the results:
FrameSize FrameRate BitRate
64 349726 179166674
128 340851 349324332
256 313973 643659427
512 328397 1346240596
1024 237464 1946168575
1280 192286 1969230769
1400 176001 1971830985
1514 162956 1973924380
1518 162506 1973992197
So this time the GBit Ethernet gets saturated for frames >=1024bytes, and all this while staying below 55°C without a heat sink, which was really impressive 😉
>If you consider routing between two 1Gb ethernet networks connected to two 1Gb ethernet ports (for example wan and lan0 in the EspressoBin case), >maximum throughput is 2Gbps — just because each port can receive 1Gbps at maximum (2Gbps in sum) and there is no way how to get more data into the >router. How did you calculate the value 4Gbps?
By 4Gbps I meant “bidirectional”, which equals 2 Gbps full duplex. If you look at the setup again, it should get clear:
Smartbits Sender (port 1, max 1Gbps tx) -> lan0 ---RGMII--> CPU ---RGMII--> wan -> Smartbits Receiver (port 2, max 1 Gbps rx)
Smartbits Receiver(port 1, max 1Gbps rx) <- lan0 <--RGMII--- CPU <---RGMII-- wan <- Smartbits Sender (port 2, max 1 Gbps tx)
Each gbit port can manage 1Gbps full duplex, i.e send 1Gbps and simultaneously receive 1Gbps. If the CPU handles each packet for routing, it has to pass the RGMII interface twice. This adds up to a total of 4Gbps of bidirectional traffic, or 2Gbps fullduplex, passing the RGMII bus.
The frame rate improvement is because using the dedicated interrupt of the PCIe interface I can now split the workload between the two cores using the smp_affinity settings: IRQ 32 (PCIe) is handled by core 1, irq 9 (RGMII) is handled by core0. Both cores together can manage more frames.
In my first test there was only the RGMII interrupt (irq 9) signalling all network traffic. This was exclusively handled by one core, while the other one was idling. This bottlenecked the frame rate.
Thanks for the hints. My results above are all while running at 800MHz, and without touching the fdt or the cpufreq driver I thinks that’s all I can get:
root@cb-88f3720-ddr3-expbin:~# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
133333 200000 400000 800000
root@cb-88f3720-ddr3-expbin:~# cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq
800000
I’m somewhat reluctant to overclock the board, as it is the only one I have currently ;).
Regarding the affinity settings, I used mainly the interrupt distribution shown in /proc/interrupts for checking if I get any effects. Since I cannot split the interrupt handling for the interfaces or for rx/tx , I can only shift all of the network handling to cpu1, which the github commit you mentioned above does. I tried it during my test, but setting the smp_affinity to 1, 2 or 3 did not make a real difference.
I did another throughput test, this time with a mini PCIe Gbit network card (RTL8111 chipset using the linux rtl8169 driver). After recompiling the kernel with the required driver option, the card is recognized and works ok. I also added the cards firmware blob manually under /lib/firmware/rtl_nic/rtl8168e-3.fw (took the firmware from the debian package firmware-nonfree_0.43.tar.gz )
The setup is the same, but this time I tuned the SMP affinity of the interrupts 9 and 32 to share the workload of LAN and WAN interfaces to both cores:
echo “3” > /proc/irq/9/smp_affinity
echo “2” > /proc/irq/32/smp_affinity
This worked as intended, as the following output after the throughput test run shows:
root@cb-88f3720-ddr3-expbin:~# cat /proc/interrupts
CPU0 CPU1
…
9: 88062744 0 GICv3 74 Level eth0
…
32: 1 12487562 GICv3 61 Level advk-pcie
…
100: 1 12487562 d0070000.pcie-msi 0 Edge eth1
The output of top during the test also showed that both cores get really busy this time (during my last test CPU load only went up to 50 %)
These are my results, which are notably better than the ones using lan0 and wan via the switch:
FrameSize FrameRate BitRate
64 136064 69670588
128 133782 137000531
256 127721 261595952
512 122464 501616536
1024 116115 951226069
1280 111183 1139557695
1400 110145 1234474020
1514 106913 1294944515
1518 107531 1306439707
So the throughput is going up to ~1.3 Gbps L3 this time. The used RTL8111 PCIe Ethernet controller is not the fastest one, maybe there is still room for further gains. IMHO the results so far show that the network drivers interrupt thread using only one core (maybe my smp_affinity settings were wrong?) , or the single RGMII interface used for both lan0, lan1 and wan interfaces on the switch pose a bottleneck on the maximum throughput possible with the processor.
Ok, I did my first smartbits test. Here is the setup I have used:
yocto with latest changes, built following the software howto , and also with enabled CONFIG_NETFILTER options in the kernel.
root@cb-88f3720-ddr3-expbin:~# uname -a
Linux cb-88f3720-ddr3-expbin 4.4.8-armada-17.02.2-armada-17.02.2+g8148be9 #3 SMP PREEMPT Thu Apr 6 14:48:55 CEST 2017 aarch64 GNU/Linux
I ran this script before testing:
root@cb-88f3720-ddr3-expbin:~# cat setup.sh
#!/bin/sh -ex
ip link set lan0 up
ifconfig lan0 172.18.1.1 netmask 255.255.0.0 up
ip link set wan up
ifconfig wan 172.19.1.1 netmask 255.255.0.0 up
echo 1 > /proc/sys/net/ipv4/ip_forward
The smartbits puts UDP load bidirectionally through the lan0 and wan ports and tries to find the maximum throughput while less then 0.5 % of the sent frames get lost:
smb (max 1Gbps) -> lan0 -> CPU -> wan -> smb
smb <- lan0 <- CPU <- wan <- smb (max 1 Gbps)
The iptables netfilter rules are empty (default rule ACCEPT) , only the conntrack entries for the UDP connection are used for fast forwarding of the frames:
root@cb-88f3720-ddr3-expbin:~# iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source destination
Chain FORWARD (policy ACCEPT)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
root@cb-88f3720-ddr3-expbin:~# cat /proc/net/ip_conntrack
udp 17 174 src=172.18.13.1 dst=172.19.13.1 sport=5000 dport=5000 src=172.19.13.1 dst=172.18.13.1 sport=5000 dport=5000 [ASSURED] use=2
udp 17 175 src=172.18.11.1 dst=172.19.11.1 sport=5000 dport=5000 src=172.19.11.1 dst=172.18.11.1 sport=5000 dport=5000 [ASSURED] use=2
udp 17 175 src=172.18.10.1 dst=172.19.10.1 sport=5000 dport=5000 src=172.19.10.1 dst=172.18.10.1 sport=5000 dport=5000 [ASSURED] use=2
udp 17 174 src=172.18.12.1 dst=172.19.12.1 sport=5000 dport=5000 src=172.19.12.1 dst=172.18.12.1 sport=5000 dport=5000 [ASSURED] use=2
Here are my first results. The constant frame rate for different frame sizes is usually a good indicator for a functioning network offload engine of some kind (DMA etc.)
FrameSize | FrameRate | BitRate
[bytes] | [fps] | [bps] L3
64 | 87516 | 44814528
128 | 85916 | 87995547
256 | 85730 | 175595067
512 | 85601 | 350749484
1024 | 82315 | 674329501
1280 | 83947 | 859692307
1400 | 85002 | 952332746
1514 | 76186 | 922809647
Since this is my first shot I’m sure there are possibilities to optimize the throughput. I have already checked the /proc/irq/9/smp_affinity setting for the interrupt allocated for eth0 (i.e. RGMII) and it is set to “3”, meaning both cores can be utilized for eth0 activity. But I did only get 50% CPU load for the ksoftirqd/0 thread during the high network load phases, looks like only one core gets busy currently.
Any hints for improvement of the setup ?
I’m about to do some throughput tests as well. Testing with Espresso Bin and buildroot-2015.11-16.08 and I am going to use a smartbits device for throughput testing in router mode.
Seeing that the switch is apparently connected via RGMII with a 2.5 Gbps link:
<snip>
U-Boot 2015.01-armada-17.02.0-gc80c919 (Mar 04 2017 – 15:51:07)
I2C: ready
DRAM: 1 GiB
Board: DB-88F3720-ESPRESSOBin
CPU @ 800 [MHz]
L2 @ 800 [MHz]
TClock @ 200 [MHz]
DDR @ 800 [MHz]
Comphy-0: PEX0 2.5 Gbps
^^^^^^^^
Comphy-1: USB3 5 Gbps
Comphy-2: SATA0 5 Gbps
</snip>
and the link to the switch itself is reported as 1Gps:
<snip>
# ip link set dev eth0 up
# [ 404.704060] mvneta d0030000.ethernet eth0: Link is Up – 1Gbps/Full – flow control off
</snip>
I’m not sure about the switches capability to get faster than the reported 1Gbps via RGMII, but 2.5Gbps would be the upper limit, right ?
Then I would suspect that we will see no more than 1.25Gbps bidirectional throughput going from lan1 to wan interface while using the CPU for routing. That would be ~750Mbps unidirectional per port.
The upper limit for a gbit network is ~2Gbps throughput bidirectional, which means the connection between switch and CPU would have to manage ~4Gbps when routing/NATing between lan1 and wan for example. I’ll try to test also with an additional USB3.0 network adapter connected to Comphy-1, to get a little bit less of a bottleneck on the RGMII bus.
Technical specification tables can not be displayed on mobile. Please view on desktop