This is an archive of the discontinued LLVM Phabricator instance.

[X86] Add X86FixupVectorConstantsPass to fold vectors constant loads as broadcasts (WIP)
AbandonedPublic

Authored by RKSimon on May 8 2023, 2:05 PM.

Details

Summary

This is WIP patch to remove the broadcasting of constants from the DAG and to instead perform this in a later pass, I'd like to hear people's thoughts on the approach while its still in the early stages.

The principal aim is to prevent the premature creation of broadcasts that prevent us folding the loads with another instruction, helping to reduce register pressure.

There's still a lot to be addressed in this early patch including:

  • Subvector Broadcast handling (VBROADCASTF128 etc.).
  • Folding of AVX512 constant loads (including masked loads) to AVX512 broadcasts.
  • Folding of AVX512 instruction with folded constant loads to folded broadcasts.
  • Better use of AVX (fp broadcasts) and SSE3 (movddup) broadcast instructions - the comment printout are a mess of float / integer which we might want to address first?
  • Use of VPMOVZ/VPMOVSX extension load for non-uniform constants that are representable with smaller integers
  • Remove the constant support entirely from lowerBuildVectorAsBroadcast in DAG

Diff Detail

Event Timeline

RKSimon created this revision.May 8 2023, 2:05 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 8 2023, 2:05 PM
Herald added a subscriber: hiraditya. · View Herald Transcript
RKSimon requested review of this revision.May 8 2023, 2:05 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 8 2023, 2:05 PM
RKSimon updated this revision to Diff 520480.May 8 2023, 2:07 PM

Add as much context as possible without exceeding phab's limit

goldstein.w.n added a comment.EditedMay 8 2023, 4:55 PM

The principal aim is to prevent the premature creation of broadcasts that prevent us folding the loads with another instruction, helping to reduce register pressure.

There are two other problems we have with constant generation that it might also be nice to address with a new pass but that I don't think this approach will be
able to handle.

  1. There are a variety of cases where we could get "better" constants by re-ordering instructions which is very difficult to do at any stage in the current DAG lowering process without creating an infinite loop, but is also something that would be very difficult to do after lowering at the machineinstruction level. I tried to do this for shl; and in D141653 but it ran into an infinite loop when I did more robust testing, and we could imagine also doing it for vector-compares, add/sub, etc...
  1. There are constants like splat(1), splat(mask), splat(-mask), etc... which can often be preferable to build without a load at all (abs(ALL_ONES), shr(ALL_ONES), shl(ALL_ONES), etc...), especially if the fairly common ALL_ONES node dominates.

I think both of these goal lend themselves to a pass before DAG lowering has been completed, but after the common transforms/optimizations.
I was thinking a pass in CodeGenAndEmitDAG after the fourth post-legalization Combine but before DoInstructionSelection would make the
most sense for fixing up constant generation (including the aims of this patch).

pengfei added inline comments.May 8 2023, 9:12 PM
llvm/lib/Target/X86/X86FixupVectorConstants.cpp
209

What's the reason to check hasInt256 rather than hasAVX2?

212

Should avoid FP instruction for integer?

228

Why don't use VBROADCASTSD here?

llvm/test/CodeGen/X86/combine-concatvectors.ll
67–68

Regression?

llvm/test/CodeGen/X86/horizontal-reduce-umin.ll
1170–1171

This looks like regression too.

RKSimon planned changes to this revision.May 14 2023, 7:07 AM

The principal aim is to prevent the premature creation of broadcasts that prevent us folding the loads with another instruction, helping to reduce register pressure.

There are two other problems we have with constant generation that it might also be nice to address with a new pass but that I don't think this approach will be
able to handle.

  1. There are a variety of cases where we could get "better" constants by re-ordering instructions which is very difficult to do at any stage in the current DAG lowering process without creating an infinite loop, but is also something that would be very difficult to do after lowering at the machineinstruction level. I tried to do this for shl; and in D141653 but it ran into an infinite loop when I did more robust testing, and we could imagine also doing it for vector-compares, add/sub, etc...
  1. There are constants like splat(1), splat(mask), splat(-mask), etc... which can often be preferable to build without a load at all (abs(ALL_ONES), shr(ALL_ONES), shl(ALL_ONES), etc...), especially if the fairly common ALL_ONES node dominates.

I think both of these goal lend themselves to a pass before DAG lowering has been completed, but after the common transforms/optimizations.
I was thinking a pass in CodeGenAndEmitDAG after the fourth post-legalization Combine but before DoInstructionSelection would make the
most sense for fixing up constant generation (including the aims of this patch).

I think we agree that broadcasting the constants in build vector lowering is premature and working on full width constant data is going to be a lot easier.

I have been investigating constant rematerialization as part of this patch and its looking like it should be handled separately (maybe as part of better general instruction rematerialization handling) - your proposal to investigate fitting rematerialization into CodeGenAndEmitDAG makes sense to me.

But I'm seeing most problems with premature constant broadcasting regarding folding of the constants to stack and hoisting, so a great deal of these cases are going to need to be addressed much later than DAG, I think I have this proposed pass in the correct location for this, but I will drop the proposal for it to handle constant rematerialization, this should be a true fixup for poorly lowered vector constant loads, missed AVX512 broadcast folds, etc.

I'll split off another patch.

llvm/lib/Target/X86/X86FixupVectorConstants.cpp
209

There's actually very little consistency between the 2 checks (which are identical), I try to use hasInt256 when I'm identifying cases where we're selecting 256-bit ops between AVX1 and AVX2 for 256-bit integer ops - I can change it to hasAVX2().

212

Interestingly, DAG already uses MOVDDUP in some cases to broadcast integer, but I agree we should avoid it in most cases.

228

xmm VBROADCASTSD doesn't exist - it would have exactly the same behaviour as VMOVDDUP

llvm/test/CodeGen/X86/combine-concatvectors.ll
67–68

Yes, we previously reused the ymm0 splat to avoid a second load

llvm/test/CodeGen/X86/horizontal-reduce-umin.ll
1170–1171

Yes, we have some peepholes in DAG to reuse broadcasts of different widths - I'll address this in a later iteration.

RKSimon edited the summary of this revision. (Show Details)May 14 2023, 7:09 AM
Matt added a subscriber: Matt.May 22 2023, 2:31 PM
RKSimon abandoned this revision.Nov 2 2023, 9:34 AM