I'm assuming the standard size integer instructions for this end up as something like:
mulq %rsi
seto %al
And the 'mul' generally has reciprocal throughput of 1 on typical implementations (higher latency, but that's not handled here).
The default costs may end up much higher than that, and that's what we see in the test diffs.