Adds the algorithm to match and select v_dot4 instructions in combining, and removes the patterns from selection. The patterns are fragile, and fail to match when byte extraction code is slightly different, or any optimizations alters the add / mul structure of the tree. The DAG combining approach is more flexible, and should not result in much overhead given all the early exits.
For kernels that should select into these instructions, doing so is vitally important. Not only is performance much improved, but failing to select into them can result in severe code bloat which drastically degrades compile time.
The extended perm matching is a happy consequence of whitelisting EXTRACT_VECT_ELT i32s as ultimate srcs of bytes.