As detailed here: https://github.com/InstLatx64/InstLatX64_Demo/blob/master/GFNI_Demo.h
We can use the gf2p8affine instruction to lower byte shifts/rotates as well as the existing bitreverse case.
I've added a concat(gf2p8affine, gf2p8affine) to remerge AVX1 splitting - but TBH, I'm not certain if there's ever going to be a AVX1+GFNI target, but that might be just one of the things we handle like the weird combinations of AVX512 modes....
There's a few other GFNI patterns we can probably handle - e.g. TZCNT/LZCNT were detailed on PR47394
Amt > 0 && Amt < 8?