This is an archive of the discontinued LLVM Phabricator instance.

[WIP] Remove switch statements before vectorization
AbandonedPublic

Authored by kmclaughlin on Aug 16 2021, 8:35 AM.

Details

Summary

This patch changes the LowerSwitch pass so that when a flag is
passed (LoopUnswitch) the pass will only attempt to unswitch simple
switch statements (i.e. there are no ranges, each destination block is
unique) which are part of a loop. The purpose of this is to allow
vectorization of loops which is not possible at the moment due to
the presence of switch statements.

The LowerSwitch pass is now run just before the vectorizer, with
LoopUnswitch set to true. For simple switches, we create a series of
compares and branches which have a simpler structure which SimplifyCFG
can later replace with a switch again if the vectorizer made no changes.

The following tests have been added:

  • LowerSwitch/simple-switches.ll: Tests the changes to LowerSwitch to replace switch statments in loops.
  • LoopVectorize/AArch64/sve-remove-switches.ll: Tests that we can vectorize loops with switch statements with scalable vectors. Also tests that where vectorization is not possible, that the switch statement is created again.
  • LoopVectorize/remove-switches.ll: Ensures that we do not vectorize the loop if the target doesn't support masked loads & stores, where the cost would be too high.

Diff Detail

Event Timeline

kmclaughlin created this revision.Aug 16 2021, 8:35 AM
kmclaughlin requested review of this revision.Aug 16 2021, 8:35 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptAug 16 2021, 8:35 AM

I'm not sure i'm sold on this, even though i'm aware that selects hurt vectorization.
How does this Simplify the CFG? I think it would be best to teach LV selects,
or at worst do this in LV itself.

I'm not sure i'm sold on this, even though i'm aware that selects hurt vectorization.
How does this Simplify the CFG? I think it would be best to teach LV selects,
or at worst do this in LV itself.

Hi @lebedev.ri, I'm under the impression that the vectoriser has a policy of never making scalar transformations so I doubt it would be acceptable to do this in the vectoriser pass. I think the only realistic alternative is to teach LV how to vectorise switch statements and create the vector compares and selects directly in the code, or scalarise them in the vector loop with creation of new blocks. @fhahn and @craig.topper do you have any thoughts on this or preference?

The alternative to teaching looprotate/LV about switches is to make swiches non-canonical in the first half of the pipeline, before LV.
That is, don't form them, and aggressively expand any and all existing switches.

I'm under the impression that the vectoriser has a policy of never making scalar transformations

I'm not sure what you mean. I've not looked into the details, but it could presumably be done as some sort of VPlan transformation, possibly in the constructions of vplans to treat switches like multiple icmp's/branches?

I'm under the impression that the vectoriser has a policy of never making scalar transformations

I'm not sure what you mean. I've not looked into the details, but it could presumably be done as some sort of VPlan transformation, possibly in the constructions of vplans to treat switches like multiple icmp's/branches?

Hi @dmgreen, I just meant that if LV makes a scalar transformation prior to legality/cost-model checks, then for some reason we don't vectorise, we then end up with a changed scalar body without any vectorisation.

I'm under the impression that the vectoriser has a policy of never making scalar transformations

I'm not sure what you mean. I've not looked into the details, but it could presumably be done as some sort of VPlan transformation, possibly in the constructions of vplans to treat switches like multiple icmp's/branches?

Hi @dmgreen, I just meant that if LV makes a scalar transformation prior to legality/cost-model checks, then for some reason we don't vectorise, we then end up with a changed scalar body without any vectorisation.

Oh yeah, that makes sense. I was wondering if we could teach VPlan to treat them as ICmp/Br without having to actually transform the IR, just doing it as part of constructing the VPlan.

Matt added a subscriber: Matt.Aug 17 2021, 9:23 AM

Since we already have LowerSwitchPass to transform switchinst, can we add a cost modle and run it before vectorization?

Thanks all for the suggestions on this patch :)

I had a look at the LowerSwitch pass as suggested by @junparser, and I did find that running it before vectorisation transforms the switch and allows the same loops to be vectorised. However, I did find that if the loop is not vectorised then the switch is not created again later by SimplifyCFG (possibly because the pass is also arbitrarily splitting cases into ranges and creating multiple branches to the default block?). Tests such as Transforms/PhaseOrdering/X86/simplifycfg-late.ll then fail, which attempts to convert a switch statement into a lookup table.

For example, running the @switch_no_vectorize test (from remove-switches.ll) with -lowerswitch results in:

for.body:                                         ; preds = %L3, %entry
  %i = phi i64 [ %inc, %L3 ], [ 0, %entry ]
  %sum.033 = phi float [ %conv20, %L3 ], [ 2.000000e+00, %entry ]
  %arrayidx = getelementptr inbounds i32, i32* %a, i64 %i
  %0 = load i32, i32* %arrayidx, align 4
  br label %NodeBlock

NodeBlock:                                        ; preds = %for.body
  %Pivot = icmp slt i32 %0, 3
  br i1 %Pivot, label %LeafBlock, label %LeafBlock1

LeafBlock1:                                       ; preds = %NodeBlock
  %SwitchLeaf2 = icmp eq i32 %0, 3
  br i1 %SwitchLeaf2, label %L3, label %NewDefault

LeafBlock:                                        ; preds = %NodeBlock
  %SwitchLeaf = icmp eq i32 %0, 2
  br i1 %SwitchLeaf, label %L2, label %NewDefault

NewDefault:                                       ; preds = %LeafBlock1, %LeafBlock
  br label %L1

I also found that any weights assigned to the switch statement are ignored when creating the new branches in LowerSwitch.

I'm not sure what the best approach to this is - I could try to change LowerSwitch to create branches which SimplifyCFG will be able to recognise and replace with a switch, or try to change SimplifyCFG to recognise this pattern of compares & branches. Alternatively, the changes in this patch could be used as the basis for a new pass which runs before the vectoriser. I wondered if anyone has any thoughts or preferences on which would be the best option here?

IMO anything other than enhancing LV is wrong.

IMO anything other than enhancing LV is wrong.

Hi @lebedev.ri I personally disagree here. Adding support to LV for this is significantly more work (and IMO unnecessary) because there are cases when LV has to handle a lot more than just the obvious flattened vectorisation case using vector comparisons and select instructions. We will also need to add support for vectorisation factors of 1 (with interleaving) and cases where VF>1,but we have to scalarise the switch statement. These latter two cases require basically doing exactly the same thing as @kmclaughlin's patch does here, i.e. unswitching the switch statement into compares/branches and new blocks. It seems far simpler to have a small pass that runs prior to the vectoriser (when enabled) that unswitches.

Not sure what others think here?

How is it conceptually different to break apart IR in LV itself, or do the same in a special pass just before that?
If we want to go this road, we need to completely make switches illegal/non-canonical before LV.

How is it conceptually different to break apart IR in LV itself, or do the same in a special pass just before that?
If we want to go this road, we need to completely make switches illegal/non-canonical before LV.

If I understand correctly you're suggesting that LV makes a scalar transformation prior to legalisation checks/cost model analysis? If that's the case then I don't think we can do that as this is beyond LV's remit and I don't see how that's any different to making a scalar transformation in a separate pass prior to LV.

How is it conceptually different to break apart IR in LV itself, or do the same in a special pass just before that?
If we want to go this road, we need to completely make switches illegal/non-canonical before LV.

If I understand correctly you're suggesting that LV makes a scalar transformation prior to legalisation checks/cost model analysis? If that's the case then I don't think we can do that as this is beyond LV's remit and I don't see how that's any different to making a scalar transformation in a separate pass prior to LV.

Actually no, i'm saying that since LV is not allowed to do such scalar transformations,
doing the same scalar transfomation, but just outside of LV, doesn't change the fact
that we've just made a preparatory transformation in hope that it will allow LV,
without actually knowing that. If it doesn't, we now need to undo it.

I could try to change LowerSwitch to create branches which SimplifyCFG will be able to recognise and replace with a switch, or try to change SimplifyCFG to recognise this pattern of compares & branches.

  1. option is better.
xbolva00 added a comment.EditedAug 26 2021, 7:37 AM

I had a look at the LowerSwitch pass as suggested by @junparser, and I did find that running it before vectorisation transforms the switch and allows the same loops to be vectorised. However, I did find that if the loop is not vectorised then the switch is not created again later by SimplifyCFG

Maybe always lower switch in loops before LV? And some very late (simplifycfg) pass to form switches from branches? icmps are more friendly for futher optimizations than switches anyway, or?

kmclaughlin retitled this revision from [SimplifyCFG] Remove switch statements before vectorization to [WIP] Remove switch statements before vectorization.
kmclaughlin edited the summary of this revision. (Show Details)
  • Removed changes to SimplifyCFG and instead run LowerSwitch before vectorisation.
  • Added SimpleSwitchConvert to LowerSwitch which is used if the pass is run before vectorisation - this only considers simple switches (where each destination block is unique) which are also part of a loop.

Hi all, I've updated this to take a different approach - the new patch runs LowerSwitch just before the vectoriser, where it will only consider simple switches which are part of a loop. For these switches, the pass will create a series of branches and compares which SimplifyCFG is able replace with a switch again later if the vectoriser did not make any changes.

I'm happy to split this patch up to make it easier to review, but I thought I would first post the changes I have so far to gather some thoughts on whether this is a better direction than before? Thanks!

Hi. I'm personally still not very okay with the approach as it currently is.

Do you need to run LoopRotate after lowering switches? Anything else?
But then you don't actually know that after spending all this compile time,
the vectorization will actually happen, and you won't just now need to undo all this,
correct? This seems conceptually wrong to me.

Will LV never have to learn to deal with switches properly?
I would assume it will, in which case what is the urgency of this temporary approach?

If you really don't want to fix this properly, i'm looking forward to an RFC on llvm-dev.

kmclaughlin abandoned this revision.Oct 8 2021, 8:57 AM

I just wanted to give an update on this patch, which I'm abandoning for the time being:

@lebedev.ri raised some good questions about the approach taken and whether the additional compile time spent would be worth the additional opportunities for vectorisation. After posting the last update, I collected some benchmark results using Spec2017 to get a better understanding of the impact of these changes and found that several benchmarks showed performance regressions for fixed-width.

The biggest outliers (in terms of percentage runtime change) were:
520.omnetpp_r: -3.00%
500.perlbench_r: -2.00%
502.gcc_r: -1.52%

I also collected the results after adding in a threshold number of cases to be unswitched (set to 4), as was included in the first draft of this patch. This also showed some regressions in the benchmarks run and no significant improvements. Both sets of results showed increased compile times for many benchmarks.

The same benchmarks as above, with the threshold of 4 set:
520.omnetpp_r: -3.46%
500.perlbench_r: -1.20%
502.gcc_r: -1.22%

Results were collected on a Neoverse-N1 machine. Given that these results indicate this isn't the best approach to take, I'm abandoning the patch for now. When this is picked up in future, it will likely be better to follow either the suggestion to prevent canonicalisation of branches & compares into switch statements (under a given number of cases) in the first place, or to teach the loop vectoriser to recognise switches.

:(
I'm sorry for derailing this.
I still think proper switch handling for loops would be nice.