This patch teaches DAG combine to recognize an extract_subvector of a horizontal reduction step and to reduce the size of the operation. New extract_subvectors will be inserted to propagate the reduction up the tree.
If the starting binop size is 512-bits wide, this reduction can allow the later steps to be narrowed to 128/256 bits were we can use a shorter VEX encoding.
I've put in the ADD, MIN, and MAX instructions so far, but there may be other operations we should support.
I have noticed an oddity due to the order that DAG combine visits nodes. We visit the last layer before all the FMAX/FMIN nodes get created. This prevents the combine from being recognized. A later DAG combine trigger by type legalization or vector legalization can catch it, but those DAG combines aren't guaranteed to run if nothing was legalized. We could mitigate this by detecting the reduction step at the binop itself and just padding the upper bits with undef hoping its used by an extract_subvector? I think that would get properly triggered as we create FMAX in the upper nodes since the combine will add users back to the worklist. Thoughts?