Re-order exceptional branches and slightly adjust the evaluation.
Performance tested with the CORE-MATH project on AMD EPYC 7B12 (clocks/op)
Reciprocal throughputs:
--- BEFORE --- $ CORE_MATH_PERF_MODE=rdtsc ./perf.sh tanhf [####################] 100 % (with -mavx2 -mfma) Ntrial = 20 ; Min = 7.794 + 0.102 clc/call; Median-Min = 0.066 clc/call; Max = 8.267 clc/call; [####################] 100 %. (with -msse4.2) Ntrial = 20 ; Min = 10.783 + 0.172 clc/call; Median-Min = 0.144 clc/call; Max = 11.446 clc/call; [####################] 100 %. (SSE2) Ntrial = 20 ; Min = 18.926 + 0.381 clc/call; Median-Min = 0.342 clc/call; Max = 19.623 clc/call; --- AFTER --- $ CORE_MATH_PERF_MODE=rdtsc ./perf.sh tanhf [####################] 100 % (with -mavx2 -mfma) Ntrial = 20 ; Min = 6.598 + 0.085 clc/call; Median-Min = 0.052 clc/call; Max = 6.868 clc/call; [####################] 100 % (with -msse4.2) Ntrial = 20 ; Min = 9.245 + 0.304 clc/call; Median-Min = 0.248 clc/call; Max = 10.675 clc/call; [####################] 100 %. (SSE2) Ntrial = 20 ; Min = 11.724 + 0.440 clc/call; Median-Min = 0.444 clc/call; Max = 12.262 clc/call;
Latency:
--- BEFORE --- $ PERF_ARGS="--latency" CORE_MATH_PERF_MODE=rdtsc ./perf.sh tanhf [####################] 100 % (with -mavx2 -mfma) Ntrial = 20 ; Min = 38.821 + 0.157 clc/call; Median-Min = 0.122 clc/call; Max = 39.539 clc/call; [####################] 100 %. (with -msse4.2) Ntrial = 20 ; Min = 44.767 + 0.766 clc/call; Median-Min = 0.681 clc/call; Max = 45.951 clc/call; [####################] 100 %. (SSE2) Ntrial = 20 ; Min = 55.055 + 1.512 clc/call; Median-Min = 1.571 clc/call; Max = 57.039 clc/call; --- AFTER --- $ PERF_ARGS="--latency" CORE_MATH_PERF_MODE=rdtsc ./perf.sh tanhf [####################] 100 % (with -mavx2 -mfma) Ntrial = 20 ; Min = 36.147 + 0.194 clc/call; Median-Min = 0.181 clc/call; Max = 36.536 clc/call; [####################] 100 % (with -msse4.2) Ntrial = 20 ; Min = 40.904 + 0.728 clc/call; Median-Min = 0.557 clc/call; Max = 42.231 clc/call; [####################] 100 %. (SSE2) Ntrial = 20 ; Min = 55.776 + 0.557 clc/call; Median-Min = 0.542 clc/call; Max = 56.551 clc/call;