Split off of D59703. If we have a gather with a single element, that's equivalent to a single element masked.load from the respective address.
Details
Diff Detail
Event Timeline
| test/Transforms/InstCombine/masked_intrinsics.ll | ||
|---|---|---|
| 211 | Note: This result shows a missed scalarization opportunity. | |
In the generic case, we're creating a masked load of a scalar (although as noted inline, there's apparently no support or test for that pattern). But I don't know what the codegen for a <1 x X> masked load would be (do we cmp/br around the load since it's not speculatable?). Someone else with masked load lowering knowledge (@craig.topper @efriedma ?) should also review this.
If the motivating case really is a constant mask, then I think it's best to limit this patch to that pattern alone and create the scalar load directly.
| lib/Transforms/InstCombine/InstCombineCalls.cpp | ||
|---|---|---|
| 1234 | 'countPopulation() == 1' - more direct translation of the code comment? | |
| 1251–1252 | Change 'auto' to 'Value' for all of these Builder calls to avoid ambiguity. | |
| test/Transforms/InstCombine/masked_intrinsics.ll | ||
| 204–206 | It's fine to show the expected follow-on transform, but we should have a minimal test for this transform alone: define <4 x double> @gather_lane0(<4 x double*> %ptrs, <4 x i1> %mask, <4 x double> %passthru) {
  %mask_with_at_most_1_bit_set = and <4 x i1> %mask, <i1 true, i1 false, i1 false, i1 false>
  %res = call <4 x double> @llvm.masked.gather.v4f64.v4p0f64(<4 x double*> %ptrs, i32 4, <4 x i1> %mask_with_at_most_1_bit_set, <4 x double> %passthru)
  ret <4 x double> %res
}But there's not enough analysis power in possiblyDemandedEltsInMask() to get this? Could at least reduce the test to this? define <4 x double> @gather_lane1(<4 x double*> %ptrs, <4 x double> %passthru)  {
  %res = call <4 x double> @llvm.masked.gather.v4f64.v4p0f64(<4 x double*> %ptrs, i32 4, <4 x i1> <i1 false, i1 true, i1 false, i1 false>, <4 x double> %passthru)
 ret <4 x double> %res
} | |
A single <1 x X> masked load should get scalarized to a compare and branch sequence by the ScalarizeMaskedMemIntrinsic pass. This was done because the type legalizer wants to scalarize 1x vectors. So I have X86's target transform info reporting illegal for 1x vectors. Not sure if it would make sense to custom widen that instead.
| lib/Transforms/InstCombine/InstCombineCalls.cpp | ||
|---|---|---|
| 25 | include order? | |
include order?