We can use an i64 clmul to emulate i32 clmul.
For clmulh and clmulr we need to zero extend the 32 bit input
to 64 bits then extract either bits [63:32] or [62:31].
Unfortunately, without Zba we need to use 2 shifts for the
zero extends. These can be optimized out later if the producing
instruction already zeroed the upper bits or if we can use lwu.
There are alternative sequences we can use for clmulh/clmulr
when the zero extend isn't free, but those are best handled by
a DAG combine to give the best opportunity for removing the extend.
This allows us to implement i32 clmul C intrinsics proposed in
https://github.com/riscv-non-isa/riscv-c-api-doc/pull/44.