This patch uses profile info to balance the binary search trees used for switch lowering by weight instead of node count, causing hot nodes to appear closer to the root.
I've been using this benchmark:
. It's a 512-case switch. A loop exercises the cases in a random sequence, hitting the first case once, the second twice, and so on, making the 512th case the hottest. I ran it like this:$ bin/clang -O3 switch_with_profile.c $ perf stat -r5 ./a.out Performance counter stats for './a.out' (5 runs): 2456.382229 task-clock (msec) # 0.999 CPUs utilized ( +- 0.63% ) 403 context-switches # 0.164 K/sec ( +- 3.93% ) 12 cpu-migrations # 0.005 K/sec ( +- 32.02% ) 131 page-faults # 0.053 K/sec 7,961,322,884 cycles # 3.241 GHz ( +- 0.07% ) [66.68%] 2,511,064,458 stalled-cycles-frontend # 31.54% frontend cycles idle ( +- 0.06% ) [50.05%] 1,874,963,880 stalled-cycles-backend # 23.55% backend cycles idle ( +- 0.05% ) [50.14%] 10,579,662,099 instructions # 1.33 insns per cycle # 0.24 stalled cycles per insn ( +- 0.08% ) [66.80%] 4,910,929,143 branches # 1999.253 M/sec ( +- 0.05% ) [66.63%] 14,888,827 branch-misses # 0.30% of all branches ( +- 0.07% ) [66.58%] 2.460022963 seconds time elapsed ( +- 0.64% ) $ bin/clang -O3 switch_with_profile.c -fprofile-instr-generate $ LLVM_PROFILE_FILE=profile.raw ./a.out $ bin/llvm-profdata merge -output=profile.profdata profile.raw $ bin/clang -O3 switch_with_profile.c -fprofile-instr-use=profile.profdata $ perf stat -r5 ./a.out Performance counter stats for './a.out' (5 runs): 2156.926282 task-clock (msec) # 0.998 CPUs utilized ( +- 0.84% ) 348 context-switches # 0.161 K/sec ( +- 3.07% ) 7 cpu-migrations # 0.003 K/sec ( +- 39.10% ) 131 page-faults # 0.061 K/sec ( +- 0.15% ) 7,269,065,426 cycles # 3.370 GHz ( +- 0.05% ) [66.56%] 2,183,070,540 stalled-cycles-frontend # 30.03% frontend cycles idle ( +- 0.27% ) [50.00%] 1,601,430,175 stalled-cycles-backend # 22.03% backend cycles idle ( +- 0.37% ) [50.21%] 10,328,446,596 instructions # 1.42 insns per cycle # 0.21 stalled cycles per insn ( +- 0.08% ) [66.91%] 4,777,439,719 branches # 2214.930 M/sec ( +- 0.05% ) [66.75%] 14,804,172 branch-misses # 0.31% of all branches ( +- 0.06% ) [66.59%] 2.160195042 seconds time elapsed ( +- 0.84% )
That's a 12 % speed-up, which I think is good given that the switch's profile is tricky and not completely dominated by one or a few cases.
Note that this also affects non-profile guided builds. When there is no profile info, each case has the same weight. When clustered together in e.g. a jump table, that cluster will be heavier than a single-case cluster, which affects the tree layout.