Establish what XDP_DROP test tool we use?
This xdp1 tool is the most low-overhead tool avail, but it is annoying to use. Both the form of output format and parameters. The parameter is the ifindex (and not the interface name).
A cmdline hack can be used for looking up the ifindex by name:
sudo ./xdp1 $(</sys/class/net/mlx5p2/ifindex)
The xdp_rxq_info tool is avail in kernel.
It have the advantage of showing stats per CPU and RX-queue, which is very practical for the multi-flow multi-CPU tests. As it allow us to quickly identify if the flow RSS hash distribution is off/wrong.
The name ‘xdp_rxq_info’ is misleading, for a drop test, but it have a parameter –action that allows us to turn this into a XDP_DROP test.
The xdp_bench01_mem_access_cost tool is part of my github repo prototype-kernel.
It is specifically written for benchmarking the different XDP-modes via –action parameter. Plus, it have the ability to make sure data is “read”.
https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/samples/bpf
It’s implementation of XDP_TX action is also more correct than xdp_rxq_info, as it can do a –swapmac.
Choosing a kernel version or git-tree. Timing is that kernel v4.18 have not been released yet. Today <2018-06-07 Thu> we are at the beginning of the merge window for v4.18. The git tree net-next, have just been merged by Linus torvalds.
Thus, Linus’es tree is in merge flux at the moment, but net-next and bpf-next are “closed”, and thus in are more stable state (which is unusual).
Base kernel compile on bpf-next at commit 75d4e704fa8d (“netdev-FAQ: clarify DaveM’s position for stable backports”).
Kernel version: 4.17.0-rc7-bpf-next-xdp-paper01-02308-g75d4e704fa8d
Issue: the mlx5 redirect patches does not apply to bpf-next. Update <2018-06-07 Thu>: Fixed up mlx5 redirect patches.
Created a kernel.org git branch named: xdp_paper01 https://git.kernel.org/pub/scm/linux/kernel/git/hawk/net-next-xdp.git/?h=xdp_paper01
Running XDP on dev:mlx5p1 (ifindex:8) action:XDP_DROP XDP stats CPU pps issue-pps XDP-RX CPU 0 25,583,300 0 XDP-RX CPU total 25,583,300 RXQ stats RXQ:CPU pps issue-pps rx_queue_index 4:0 25,583,303 0 rx_queue_index 4:sum 25,583,303
$uname -a Linux broadwell 4.17.0-rc7-bpf-next-xdp-paper01-02308-g75d4e704fa8d #26 SMP PREEMPT
$ ethtool –show-priv-flags mlx5p1 Private flags for mlx5p1: rx_cqe_moder : on tx_cqe_moder : off rx_cqe_compress: off rx_striding_rq : off
t-rex script:
~/git/xdp-paper/benchmarks/udp_for_benchmarks.py
t-rex cmdline:
start -f /home/jbrouer/git/xdp-paper/benchmarks/udp_for_benchmarks.py -t packet_len=64,stream_count=XX --port 0 -m 100mpps
stream_count=12 XDP_DROP total: 86,338,684
Running XDP on dev:mlx5p1 (ifindex:8) action:XDP_DROP XDP stats CPU pps issue-pps XDP-RX CPU 0 14,390,053 0 XDP-RX CPU 1 14,390,081 0 XDP-RX CPU 2 14,389,045 0 XDP-RX CPU 3 14,390,277 0 XDP-RX CPU 4 14,390,219 0 XDP-RX CPU 5 14,389,006 0 XDP-RX CPU total 86,338,684 RXQ stats RXQ:CPU pps issue-pps rx_queue_index 0:0 14,390,090 0 rx_queue_index 0:sum 14,390,090 rx_queue_index 1:1 14,390,095 0 rx_queue_index 1:sum 14,390,095 rx_queue_index 2:2 14,389,054 0 rx_queue_index 2:sum 14,389,054 rx_queue_index 3:3 14,390,270 0 rx_queue_index 3:sum 14,390,270 rx_queue_index 4:4 14,390,166 0 rx_queue_index 4:sum 14,390,166 rx_queue_index 5:5 14,389,022 0 rx_queue_index 5:sum 14,389,022
Changing number cores receiving traffic by adjusting stream_count.
stream_count=1 XDP_DROP total: 25,572,977
Running XDP on dev:mlx5p1 (ifindex:8) action:XDP_DROP XDP stats CPU pps issue-pps XDP-RX CPU 0 25,572,977 0 XDP-RX CPU total 25,572,977 RXQ stats RXQ:CPU pps issue-pps rx_queue_index 0:0 25,572,974 0 rx_queue_index 0:sum 25,572,974
stream_count=2 XDP_DROP total: 51,907,348
Running XDP on dev:mlx5p1 (ifindex:8) action:XDP_DROP XDP stats CPU pps issue-pps XDP-RX CPU 0 25,359,525 0 XDP-RX CPU 1 26,547,822 0 XDP-RX CPU total 51,907,348 RXQ stats RXQ:CPU pps issue-pps rx_queue_index 0:0 25,359,524 0 rx_queue_index 0:sum 25,359,524 rx_queue_index 1:1 26,547,829 0 rx_queue_index 1:sum 26,547,829
stream_count=3 XDP_DROP total: 75,530,250
Running XDP on dev:mlx5p1 (ifindex:8) action:XDP_DROP XDP stats CPU pps issue-pps XDP-RX CPU 0 25,041,439 0 XDP-RX CPU 1 25,243,786 0 XDP-RX CPU 2 25,245,025 0 XDP-RX CPU total 75,530,250 RXQ stats RXQ:CPU pps issue-pps rx_queue_index 0:0 25,041,446 0 rx_queue_index 0:sum 25,041,446 rx_queue_index 1:1 25,243,788 0 rx_queue_index 1:sum 25,243,788 rx_queue_index 2:2 25,245,037 0 rx_queue_index 2:sum 25,245,037
stream_count=4 XDP_DROP total: 86,521,177
Notice at stream_count=4, CPUs start to have idle cycles.
Running XDP on dev:mlx5p1 (ifindex:8) action:XDP_DROP XDP stats CPU pps issue-pps XDP-RX CPU 0 21,627,817 0 XDP-RX CPU 1 21,630,688 0 XDP-RX CPU 2 21,631,349 0 XDP-RX CPU 3 21,631,321 0 XDP-RX CPU total 86,521,177 RXQ stats RXQ:CPU pps issue-pps rx_queue_index 0:0 21,627,817 0 rx_queue_index 0:sum 21,627,817 rx_queue_index 1:1 21,630,690 0 rx_queue_index 1:sum 21,630,690 rx_queue_index 2:2 21,631,359 0 rx_queue_index 2:sum 21,631,359 rx_queue_index 3:3 21,631,227 0 rx_queue_index 3:sum 21,631,227
stream_count=5 XDP_DROP total: 86,837,876
With more idle cycles.
Running XDP on dev:mlx5p1 (ifindex:8) action:XDP_DROP XDP stats CPU pps issue-pps XDP-RX CPU 0 17,364,174 0 XDP-RX CPU 1 17,368,545 0 XDP-RX CPU 2 17,368,884 0 XDP-RX CPU 3 17,368,908 0 XDP-RX CPU 4 17,367,363 0 XDP-RX CPU total 86,837,876 RXQ stats RXQ:CPU pps issue-pps rx_queue_index 0:0 17,364,143 0 rx_queue_index 0:sum 17,364,143 rx_queue_index 1:1 17,368,530 0 rx_queue_index 1:sum 17,368,530 rx_queue_index 2:2 17,368,816 0 rx_queue_index 2:sum 17,368,816 rx_queue_index 3:3 17,368,884 0 rx_queue_index 3:sum 17,368,884 rx_queue_index 4:4 17,367,366 0 rx_queue_index 4:sum 17,367,366
stream_count=6 XDP_DROP total: 86,809,556
With more idle cycles.
Running XDP on dev:mlx5p1 (ifindex:8) action:XDP_DROP XDP stats CPU pps issue-pps XDP-RX CPU 0 14,468,490 0 XDP-RX CPU 1 14,468,507 0 XDP-RX CPU 2 14,468,888 0 XDP-RX CPU 3 14,468,750 0 XDP-RX CPU 4 14,467,744 0 XDP-RX CPU 5 14,467,175 0 XDP-RX CPU total 86,809,556 RXQ stats RXQ:CPU pps issue-pps rx_queue_index 0:0 14,468,463 0 rx_queue_index 0:sum 14,468,463 rx_queue_index 1:1 14,468,470 0 rx_queue_index 1:sum 14,468,470 rx_queue_index 2:2 14,468,916 0 rx_queue_index 2:sum 14,468,916 rx_queue_index 3:3 14,468,746 0 rx_queue_index 3:sum 14,468,746 rx_queue_index 4:4 14,467,752 0 rx_queue_index 4:sum 14,467,752 rx_queue_index 5:5 14,467,191 0 rx_queue_index 5:sum 14,467,191
stream_count=7 XDP_DROP total: 85,095,736
Now we are running out of CPUs (6), as we have disabled HT. In this example, CPU2 gets extra traffic and actually don’t have any idle cycles, and handle/drop 24,313,750 pps.
Running XDP on dev:mlx5p1 (ifindex:8) action:XDP_DROP XDP stats CPU pps issue-pps XDP-RX CPU 0 12,156,595 0 XDP-RX CPU 1 12,154,906 0 XDP-RX CPU 2 24,313,750 0 XDP-RX CPU 3 12,155,349 0 XDP-RX CPU 4 12,158,029 0 XDP-RX CPU 5 12,157,106 0 XDP-RX CPU total 85,095,736 RXQ stats RXQ:CPU pps issue-pps rx_queue_index 0:0 12,156,625 0 rx_queue_index 0:sum 12,156,625 rx_queue_index 1:1 12,154,888 0 rx_queue_index 1:sum 12,154,888 rx_queue_index 2:2 24,313,738 0 rx_queue_index 2:sum 24,313,738 rx_queue_index 3:3 12,155,287 0 rx_queue_index 3:sum 12,155,287 rx_queue_index 4:4 12,158,076 0 rx_queue_index 4:sum 12,158,076 rx_queue_index 5:5 12,157,174 0 rx_queue_index 5:sum 12,157,174
stream_count=8 XDP_DROP total: 86,484,755
All CPUs have idle cycles, but some less than others, e.g CPU-2 have 6.8% idle, and CPU-3 have 10.2% idle.
Running XDP on dev:mlx5p1 (ifindex:8) action:XDP_DROP XDP stats CPU pps issue-pps XDP-RX CPU 0 10,811,300 0 XDP-RX CPU 1 10,811,547 0 XDP-RX CPU 2 21,623,304 0 XDP-RX CPU 3 21,622,057 0 XDP-RX CPU 4 10,805,394 0 XDP-RX CPU 5 10,811,152 0 XDP-RX CPU total 86,484,755 RXQ stats RXQ:CPU pps issue-pps rx_queue_index 0:0 10,811,291 0 rx_queue_index 0:sum 10,811,291 rx_queue_index 1:1 10,811,570 0 rx_queue_index 1:sum 10,811,570 rx_queue_index 2:2 21,623,306 0 rx_queue_index 2:sum 21,623,306 rx_queue_index 3:3 21,622,064 0 rx_queue_index 3:sum 21,622,064 rx_queue_index 4:4 10,805,406 0 rx_queue_index 4:sum 10,805,406 rx_queue_index 5:5 10,811,027 0 rx_queue_index 5:sum 10,811,027
stream_count=X XDP_DROP total:
Tariq expect seeing rx_discards_phy when PCI causes backpressure.
Running XDP on dev:mlx5p1 (ifindex:8) action:XDP_DROP XDP stats CPU pps issue-pps XDP-RX CPU 0 14,417,582 0 XDP-RX CPU 1 14,431,145 0 XDP-RX CPU 2 14,434,436 0 XDP-RX CPU 3 14,417,894 0 XDP-RX CPU 4 14,417,705 0 XDP-RX CPU 5 14,419,834 0 XDP-RX CPU total 86,538,599 RXQ stats RXQ:CPU pps issue-pps rx_queue_index 0:0 14,417,582 0 rx_queue_index 0:sum 14,417,582 rx_queue_index 1:1 14,431,116 0 rx_queue_index 1:sum 14,431,116 rx_queue_index 2:2 14,434,409 0 rx_queue_index 2:sum 14,434,409 rx_queue_index 3:3 14,417,894 0 rx_queue_index 3:sum 14,417,894 rx_queue_index 4:4 14,417,702 0 rx_queue_index 4:sum 14,417,702 rx_queue_index 5:5 14,419,839 0 rx_queue_index 5:sum 14,419,839
Notice: Ethtool(mlx5p1 ) stat: 98233252 ( 98,233,252) <= rx_packets_phy /sec Ethtool(mlx5p1 ) stat: 12048409 ( 12,048,409) <= rx_discards_phy /sec Ethtool(mlx5p1 ) stat: 86188489 ( 86,188,489) <= rx_prio0_packets /sec 98233252 - 12048409 = 86184843 ( 86,184,843)
Show adapter(s) (mlx5p1) statistics (ONLY that changed!) Ethtool(mlx5p1 ) stat: 34585 ( 34,585) <= ch0_arm /sec Ethtool(mlx5p1 ) stat: 34585 ( 34,585) <= ch0_events /sec Ethtool(mlx5p1 ) stat: 238494 ( 238,494) <= ch0_poll /sec Ethtool(mlx5p1 ) stat: 33807 ( 33,807) <= ch1_arm /sec Ethtool(mlx5p1 ) stat: 33807 ( 33,807) <= ch1_events /sec Ethtool(mlx5p1 ) stat: 237339 ( 237,339) <= ch1_poll /sec Ethtool(mlx5p1 ) stat: 34512 ( 34,512) <= ch2_arm /sec Ethtool(mlx5p1 ) stat: 34512 ( 34,512) <= ch2_events /sec Ethtool(mlx5p1 ) stat: 238103 ( 238,103) <= ch2_poll /sec Ethtool(mlx5p1 ) stat: 34668 ( 34,668) <= ch3_arm /sec Ethtool(mlx5p1 ) stat: 34668 ( 34,668) <= ch3_events /sec Ethtool(mlx5p1 ) stat: 238448 ( 238,448) <= ch3_poll /sec Ethtool(mlx5p1 ) stat: 34585 ( 34,585) <= ch4_arm /sec Ethtool(mlx5p1 ) stat: 34585 ( 34,585) <= ch4_events /sec Ethtool(mlx5p1 ) stat: 238680 ( 238,680) <= ch4_poll /sec Ethtool(mlx5p1 ) stat: 33943 ( 33,943) <= ch5_arm /sec Ethtool(mlx5p1 ) stat: 33944 ( 33,944) <= ch5_events /sec Ethtool(mlx5p1 ) stat: 237098 ( 237,098) <= ch5_poll /sec Ethtool(mlx5p1 ) stat: 206103 ( 206,103) <= ch_arm /sec Ethtool(mlx5p1 ) stat: 206103 ( 206,103) <= ch_events /sec Ethtool(mlx5p1 ) stat: 1428189 ( 1,428,189) <= ch_poll /sec Ethtool(mlx5p1 ) stat: 1 ( 1) <= outbound_pci_stalled_wr_events /sec Ethtool(mlx5p1 ) stat: 14334364 ( 14,334,364) <= rx0_cache_reuse /sec Ethtool(mlx5p1 ) stat: 14334335 ( 14,334,335) <= rx0_xdp_drop /sec Ethtool(mlx5p1 ) stat: 14362141 ( 14,362,141) <= rx1_cache_reuse /sec Ethtool(mlx5p1 ) stat: 14362141 ( 14,362,141) <= rx1_xdp_drop /sec Ethtool(mlx5p1 ) stat: 14362118 ( 14,362,118) <= rx2_cache_reuse /sec Ethtool(mlx5p1 ) stat: 14362118 ( 14,362,118) <= rx2_xdp_drop /sec Ethtool(mlx5p1 ) stat: 14338843 ( 14,338,843) <= rx3_cache_reuse /sec Ethtool(mlx5p1 ) stat: 14338841 ( 14,338,841) <= rx3_xdp_drop /sec Ethtool(mlx5p1 ) stat: 14356201 ( 14,356,201) <= rx4_cache_reuse /sec Ethtool(mlx5p1 ) stat: 14356183 ( 14,356,183) <= rx4_xdp_drop /sec Ethtool(mlx5p1 ) stat: 14359375 ( 14,359,375) <= rx5_cache_reuse /sec Ethtool(mlx5p1 ) stat: 14359405 ( 14,359,405) <= rx5_xdp_drop /sec Ethtool(mlx5p1 ) stat: 98233801 ( 98,233,801) <= rx_64_bytes_phy /sec Ethtool(mlx5p1 ) stat: 6286927774 ( 6,286,927,774) <= rx_bytes_phy /sec Ethtool(mlx5p1 ) stat: 86114457 ( 86,114,457) <= rx_cache_reuse /sec Ethtool(mlx5p1 ) stat: 12048409 ( 12,048,409) <= rx_discards_phy /sec Ethtool(mlx5p1 ) stat: 69718 ( 69,718) <= rx_out_of_buffer /sec Ethtool(mlx5p1 ) stat: 98233252 ( 98,233,252) <= rx_packets_phy /sec Ethtool(mlx5p1 ) stat: 6287027905 ( 6,287,027,905) <= rx_prio0_bytes /sec Ethtool(mlx5p1 ) stat: 86188489 ( 86,188,489) <= rx_prio0_packets /sec Ethtool(mlx5p1 ) stat: 5171087787 ( 5,171,087,787) <= rx_vport_unicast_bytes /sec Ethtool(mlx5p1 ) stat: 86184793 ( 86,184,793) <= rx_vport_unicast_packets /sec Ethtool(mlx5p1 ) stat: 86114444 ( 86,114,444) <= rx_xdp_drop /sec
Running XDP on dev:mlx5p1 (ifindex:8) action:XDP_DROP options:read XDP stats CPU pps issue-pps XDP-RX CPU 0 18879925 0 XDP-RX CPU 2 18848656 0 XDP-RX CPU 4 14047461 0 XDP-RX CPU 5 14048655 0 XDP-RX CPU total 65824699 RXQ stats RXQ:CPU pps issue-pps rx_queue_index 0:0 18879897 0 rx_queue_index 0:sum 18879897 rx_queue_index 2:2 18848642 0 rx_queue_index 2:sum 18848642 rx_queue_index 4:4 14047746 0 rx_queue_index 4:sum 14047746 rx_queue_index 5:5 14048712 0 rx_queue_index 5:sum 14048712
Looking closer at CPU-0, clearly shows that mlx5 driver have many function calls. These each only contribute approx 1.3ns each, but at these speeds they add up!
The function call overhead is measured with: https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c
time_bench: Type:funcion_call_cost Per elem: 3 cycles(tsc) 1.009 ns (step:0) - (measurement period time:0.100951200 sec time_interval:100951200) - (invoke count:100000000 tsc_interval:363427758) time_bench: Type:func_ptr_call_cost Per elem: 4 cycles(tsc) 1.251 ns (step:0) - (measurement period time:0.125172385 sec time_interval:125172385) - (invoke count:100000000 tsc_interval:450625065)
The perf report shows the individual function calls, but it does not show the functions that got inlined. For the mlx5 driver most of these function that didn’t get inlined are called as indirect function pointer calls.
$ perf report --sort cpu,symbol --kallsyms=/proc/kallsyms --no-children -C0 Samples: 20K of event 'cycles:ppp', Event count (approx.): 18342640436 Overhead CPU Symbol + 16.29% 000 [k] bpf_prog_1a32f9805bcb7bb7_xdp_prognum0 + 15.27% 000 [k] mlx5e_post_rx_wqes + 14.12% 000 [k] mlx5e_poll_rx_cq + 14.11% 000 [k] mlx5e_skb_from_cqe_linear + 11.77% 000 [k] mlx5e_handle_rx_cqe + 8.48% 000 [k] mlx5e_xdp_handle + 7.54% 000 [k] mlx5e_page_release + 2.37% 000 [k] swiotlb_sync_single + 1.87% 000 [k] percpu_array_map_lookup_elem + 1.02% 000 [k] net_rx_action + 0.92% 000 [k] swiotlb_sync_single_for_device + 0.87% 000 [k] swiotlb_sync_single_for_cpu + 0.79% 000 [k] intel_idle 0.41% 000 [k] mlx5e_poll_xdpsq_cq 0.40% 000 [k] mlx5e_napi_poll 0.31% 000 [k] __softirqentry_text_start 0.20% 000 [k] smpboot_thread_fn 0.19% 000 [k] mlx5e_poll_tx_cq 0.16% 000 [k] __sched_text_start 0.15% 000 [k] cpuidle_enter_state
function | called per packet? | called by |
---|---|---|
mlx5e_post_rx_wqes | bulk | |
mlx5e_poll_rx_cq | bulk NAPI budget | |
mlx5e_handle_rx_cqe | per packet | mlx5e_poll_rx_cq |
mlx5e_skb_from_cqe_linear | per packet | mlx5e_handle_rx_cqe |
mlx5e_xdp_handle | per packet | mlx5e_skb_from_cqe_linear |
mlx5e_page_release | per packet | mlx5e_handle_rx_cqe |
swiotlb_sync_single | 2x per packet | mlx5e_skb_from_cqe_linear |
swiotlb_sync_single | (above) | mlx5e_post_rx_wqes |
swiotlb_sync_single_for_cpu | per packet | mlx5e_skb_from_cqe_linear |
swiotlb_sync_single_for_device | per packet | mlx5e_post_rx_wqes |
percpu_array_map_lookup_elem | per packet | bpf_prog_xx_xdp_prognum0 |
bpf_prog_xx_xdp_prognum0 | per packet | (cannot be avoided) |
Thus, 10 function calls that have a per packet invocation.
The DMA sync calls result in 4 calls, as they are not inlined in kernel/dma/swiotlb.c. Guess the compiler choose not to inline swiotlb_sync_single().
Trying to figure out a way to measure, the effect of avoidong some of the DMA calls.
In principle, the DMA sync operation is a no-op with this Intel CPU, as we are not-using-IOMMU and not using the bounce-buffer feature. Thus, I could in principle just compile it out…
For some reason needed to disabled rx_striding_rq for good performance:
- ethtool –set-priv-flags mlx5p1 rx_striding_rq off
Baseline XDP_DROP mlx5 performance on 4.19.0-rc5-bpf-next:
Running XDP on dev:mlx5p1 (ifindex:18) action:XDP_DROP options:no_touch XDP stats CPU pps issue-pps XDP-RX CPU 3 24,971,062 0 XDP-RX CPU total 24,971,062 RXQ stats RXQ:CPU pps issue-pps rx_queue_index 3:3 24,971,055 0 rx_queue_index 3:sum 24,971,055
(1/24971055)*10^9 = 40.04636568218000000000 nanosec
Removed all DMA sync calls from mlx5 driver code:
Running XDP on dev:mlx5p1 (ifindex:16) action:XDP_DROP options:no_touch XDP stats CPU pps issue-pps XDP-RX CPU 3 29,028,209 0 XDP-RX CPU total 29,028,209 RXQ stats RXQ:CPU pps issue-pps rx_queue_index 3:3 29,028,207 0 rx_queue_index 3:sum 29,028,207
(1/29028209)*10^9 = 34.44924900464000000000 ns
Difference, improvement in pps:
- 29028209-24971055 = 4057154 -> 4,057,154 pps
Difference in nanosec: ((1/24971055)-(1/29028209))*10^9 = 5.59711667754000000000 ns
This removed 4 function calls ( 5.5971/4 ) = 1.399 ns per call. This measured improvement is correlates well with the expected overhead per function call of approx 1.3ns.
Extra-polating, that we could inline the remaining 6 calls.
29Mpps = 34.4 ns
34.4 - (6 * 1.3) = 26.6 ns
26.6 ns converted to pps:
1/(34.4 - (6 * 1.3)) * 1000 = 37.59 Mpps
Looking at driver we/compiler should be able to inline 3 calls, into mlx5e_poll_rx_cq:
function | called per packet? | called by |
---|---|---|
mlx5e_handle_rx_cqe | per packet | mlx5e_poll_rx_cq |
mlx5e_skb_from_cqe_linear | per packet | mlx5e_handle_rx_cqe |
mlx5e_xdp_handle | per packet | mlx5e_skb_from_cqe_linear |
Thus, a realistic driver optimization can remove/inline 3 calls:
1/(34.4 - (3 * 1.3)) * 1000 = 32.78 Mpps
The DPDK implementation uses SSE instructions, how much does this matter?
Compiled DPDK PMD (Poll Mode driver) mlx5 with this patch:
diff --git a/drivers/net/mlx5/mlx5_ethdev.c b/drivers/net/mlx5/mlx5_ethdev.c index 90488af33b81..53c2ff086799 100644 --- a/drivers/net/mlx5/mlx5_ethdev.c +++ b/drivers/net/mlx5/mlx5_ethdev.c @@ -1165,7 +1165,8 @@ mlx5_select_rx_function(struct rte_eth_dev *dev) assert(dev != NULL); if (mlx5_check_vec_rx_support(dev) > 0) { - rx_pkt_burst = mlx5_rx_burst_vec; + rx_pkt_burst = mlx5_rx_burst; ; DRV_LOG(DEBUG, "port %u selected Rx vectorized function", dev->data->port_id); } else if (mlx5_mprq_enabled(dev)) {
RXQs/cores | DPDK SSE | DPDK no-sse |
---|---|---|
1 | 42314910 | 43301465 |
2 | 74510689 | 75804436 |
3 | 97570907 | 86492148 |
4 | 109448595 | 105838967 |
5 | 115487272 | 109945465 |
Issue, above results are NOT stable for the DPDK no-sse case. DPDK is not processing packets on all cores. This can be seen when ctrl-C the testpmd program, and it shows counters for each queue, make special notice if a queue number is not present.
I believe there a setup race condition in DPDK no-sse case, because went I stop all traffic before starting testpmd results are stable.
Also got this assert failure:
testpmd: /home/jbrouer/git/dpdk/dpdk-hacks/drivers/net/mlx5/mlx5_rxtx.c:1980: mlx5_rx_burst: Assertion `len >= (rxq->crc_present << 2)’ failed.
Make sure our machine have memory bandwidth to handle: 100Gbit/s = 12.5 GBytes/sec
Below show a single CPU can handle this. I needed two generator machines, each sending a single flow, but I changed RXq to 1 in-order to only use 1 CPU.
Running XDP on dev:mlx5p1 (ifindex:8) action:XDP_DROP options:read XDP stats CPU pps issue-pps XDP-RX CPU 0 8210806 0 XDP-RX CPU total 8210806 RXQ stats RXQ:CPU pps issue-pps rx_queue_index 0:0 8210762 0 rx_queue_index 0:sum 8210762
Show adapter(s) (mlx5p1 mlx5p2) statistics (ONLY that changed!) Ethtool(mlx5p1 ) stat: 59146 ( 59,146) <= ch0_arm /sec Ethtool(mlx5p1 ) stat: 75717 ( 75,717) <= ch0_events /sec Ethtool(mlx5p1 ) stat: 164685 ( 164,685) <= ch0_poll /sec Ethtool(mlx5p1 ) stat: 59144 ( 59,144) <= ch_arm /sec Ethtool(mlx5p1 ) stat: 75716 ( 75,716) <= ch_events /sec Ethtool(mlx5p1 ) stat: 164684 ( 164,684) <= ch_poll /sec Ethtool(mlx5p1 ) stat: 7676 ( 7,676) <= rx0_cache_empty /sec Ethtool(mlx5p1 ) stat: 7676 ( 7,676) <= rx0_cache_full /sec Ethtool(mlx5p1 ) stat: 8203123 ( 8,203,123) <= rx0_cache_reuse /sec Ethtool(mlx5p1 ) stat: 959 ( 959) <= rx0_congst_umr /sec Ethtool(mlx5p1 ) stat: 8210784 ( 8,210,784) <= rx0_xdp_drop /sec Ethtool(mlx5p1 ) stat: 8210812 ( 8,210,812) <= rx_1024_to_1518_bytes_phy /sec Ethtool(mlx5p1 ) stat: 12336063433 ( 12,336,063,433) <= rx_bytes_phy /sec Ethtool(mlx5p1 ) stat: 7676 ( 7,676) <= rx_cache_empty /sec Ethtool(mlx5p1 ) stat: 7676 ( 7,676) <= rx_cache_full /sec Ethtool(mlx5p1 ) stat: 8203039 ( 8,203,039) <= rx_cache_reuse /sec Ethtool(mlx5p1 ) stat: 959 ( 959) <= rx_congst_umr /sec Ethtool(mlx5p1 ) stat: 120 ( 120) <= rx_out_of_buffer /sec Ethtool(mlx5p1 ) stat: 8210812 ( 8,210,812) <= rx_packets_phy /sec Ethtool(mlx5p1 ) stat: 12336123874 ( 12,336,123,874) <= rx_prio0_bytes /sec Ethtool(mlx5p1 ) stat: 8210852 ( 8,210,852) <= rx_prio0_packets /sec Ethtool(mlx5p1 ) stat: 12303230973 ( 12,303,230,973) <= rx_vport_unicast_bytes /sec Ethtool(mlx5p1 ) stat: 8210819 ( 8,210,819) <= rx_vport_unicast_packets /sec Ethtool(mlx5p1 ) stat: 8210705 ( 8,210,705) <= rx_xdp_drop /sec
mpstat -P ALL -u -I SCPU -I SUM 2 04:33:34 PM CPU %usr %nice %sys %iowait %irq %soft %idle 04:33:36 PM all 0.08 0.00 1.01 0.00 0.93 7.66 90.32 04:33:36 PM 0 0.00 0.00 3.70 0.00 2.65 47.62 46.03 04:33:36 PM 1 0.00 0.00 0.50 0.00 0.00 0.00 99.50 04:33:36 PM 2 0.00 0.00 0.50 0.00 0.00 0.00 99.50 04:33:36 PM 3 0.00 0.00 0.50 0.00 2.51 0.00 96.98 04:33:36 PM 4 0.00 0.00 1.01 0.00 0.00 0.00 98.99 04:33:36 PM 5 0.00 0.00 0.50 0.00 0.00 0.50 99.00 04:33:34 PM CPU intr/s 04:33:36 PM all 99682.00 04:33:36 PM 0 241079.00 04:33:36 PM 1 438.00 04:33:36 PM 2 203.00 04:33:36 PM 3 758.50 04:33:36 PM 4 534.00 04:33:36 PM 5 133.00 04:33:34 PM CPU HI/s TIMER/s NET_TX/s NET_RX/s BLOCK/s IRQ_POLL/s TASKLET/s SCHED/s HRTIMER/s RCU/s 04:33:36 PM 0 0.00 1000.00 0.50 164745.00 0.00 0.00 75221.50 102.50 0.00 9.50 04:33:36 PM 1 0.00 306.50 0.50 2.00 39.00 0.00 0.00 76.50 0.00 13.50 04:33:36 PM 2 0.00 175.00 0.00 2.50 0.00 0.00 0.00 15.00 0.00 10.50 04:33:36 PM 3 0.00 734.00 0.00 3.50 0.00 0.00 0.00 11.50 0.00 9.50 04:33:36 PM 4 0.00 451.00 0.50 23.00 0.00 0.00 13.50 40.50 0.00 5.50 04:33:36 PM 5 0.00 116.50 0.00 8.00 0.00 0.00 0.00 0.00 0.00 8.50
Why the i40e driver / XL710 NIC was not choosen for benchmarking.
It would basically have been booring to show that, XDP can handle what the HW can offer basically on a single CPU. And comparing against DPDK, the results would have shown the same, as the HW is the real limit. Thus, it would sort-of be unfair for DPDK.
The i40e HW datasheet states that it can “only” handle 40G wirespeed with 128-byte packets.
Intel Ethernet Controller X710/XXV710/XL710: Datasheet: https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xl710-10-40-controller-datasheet.pdf
Quote from Datasheet: Table 1-4. Performance on the network vs. packet size: “Maximize link capacity when operating at 40 Gb/s link or 4x10 Gb/s links with packets larger than 128 bytes at full duplex traffic”
40G wirespeed is:
- 40000/(84*8) = 59.52 Mpps at 64B due to overhead 84 Bytes
- 40000/(128*8) = 39.06 Mpps at 128B without inter-frame gap
- 40000/((128+12)*8) = 35.71 Mpps at 128B with 12B inter-frame gap
XDP can XDP_DROP 35.3 Mpps on a single RX-queue (100% CPU usage), and scaling this up to more CPUs, the sum still remains 36 Mpps, just with a lot of idle CPU cycles.
Single queue XDP_DROP:
Running XDP on dev:i40e1 (ifindex:3) action:XDP_DROP options:no_touch XDP stats CPU pps issue-pps XDP-RX CPU 2 35,382,617 0 XDP-RX CPU total 35,382,617 RXQ stats RXQ:CPU pps issue-pps rx_queue_index 2:2 35,382,616 0 rx_queue_index 2:sum 35,382,616
Multi-queue XDP_DROP with xdp_rxq_info:
Running XDP on dev:i40e1 (ifindex:3) action:XDP_DROP options:no_touch XDP stats CPU pps issue-pps XDP-RX CPU 0 6,023,281 0 XDP-RX CPU 1 3,011,011 0 XDP-RX CPU 3 16,555,566 0 XDP-RX CPU 4 4,501,346 0 XDP-RX CPU 5 6,005,919 0 XDP-RX CPU total 36,097,126 RXQ stats RXQ:CPU pps issue-pps rx_queue_index 0:0 6,023,283 0 rx_queue_index 0:sum 6,023,283 rx_queue_index 1:1 3,011,011 0 rx_queue_index 1:sum 3,011,011 rx_queue_index 2:3 3,012,275 0 rx_queue_index 2:sum 3,012,275 rx_queue_index 3:3 13,543,264 0 rx_queue_index 3:sum 13,543,264 rx_queue_index 4:4 4,501,393 0 rx_queue_index 4:sum 4,501,393 rx_queue_index 5:5 6,005,977 0 rx_queue_index 5:sum 6,005,977