This patch uses PSHUFB to lower vector CTLZ and avoid (slower) scalarizations.
The leading zero count of each 4-bit nibble of the vector is determine by using a PSHUFB lookup. Pairs of results are then repeatedly combined up to the original element width.
Please skip braces on clear single-line ifs and loops.