This is an archive of the discontinued LLVM Phabricator instance.

Recommit "[ARM][MVE] findVCMPToFoldIntoVPS"
ClosedPublic

Authored by SjoerdMeijer on Dec 13 2019, 8:19 AM.

Details

Summary

This is a recommit of D71330, but with a few things fixed/changed:

  • ReachingDefAnalysis: this was not running with optnone as it was checking skipFunction(), which other analysis passes don't do. I guess this is a copy-paste from a codegen pass.
  • VPTBlockPass: here I've added skipFunction(), because like most/all optimisations, we don't want to run this with optnone.

This fixes the issues with the initial/previous commit of this: the VPTBlockPass was running with optnone, but ReachingDefAnalysis wasn't, and so VPTBlockPass was crashing querying ReachingDefAnalysis.

I've added test case mve-vpt-block-optnone.mir to check that we don't run VPTBlock with optnone.

Diff Detail

Event Timeline

SjoerdMeijer created this revision.Dec 13 2019, 8:19 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 13 2019, 8:19 AM
SjoerdMeijer added inline comments.Dec 13 2019, 9:30 AM
llvm/lib/Target/ARM/MVEVPTBlockPass.cpp
153–157

Actually, I was planning on making one more change: save all the instructions that we want to remove to a worklist, so that they can all be removed in one go at the end. This should avoid that RDA possibly has an inconsistent view on the block...

Delayed deleting the VCMP, in order not to invalidate the RDA analysis.

I think this is okay... But do we have a test where we have two VPT blocks, predicated on the same value, where that value is generated by a foldable vcmp?

Thanks for taking a look, I have added test case mve-vpt-block-fold-vcmp.mir.

samparker accepted this revision.Jan 6 2020, 9:13 AM

Cheers, my mind is at ease, LGTM.

This revision is now accepted and ready to land.Jan 6 2020, 9:13 AM
This revision was automatically updated to reflect the committed changes.

Hello. I had to revert this in https://reviews.llvm.org/rGd50e188a072deca9d48149e05a05756c474bf569 because it was causing some problems to do with checking for uses between the VCMP and the VPT block it is folded into. When looking I didn't think that some of the other details here looked right either.

I've tried to keep your tests as they still seemed useful, and will add a new one for the case I was looking at today.

Sorry for not really looking at it earlier. I wasn't paying enough attention after the first revision. I have put some extra comments below.

llvm/lib/Target/ARM/MVEVPTBlockPass.cpp
75

What happened to this modifies register check?

This was the original error I was looking at.

176

I don't think it's valid to skip this at -O0. It is needed for valid assembly.

You could argue that some of the optimisations this pass is doing shouldn't be done at O0, but at least VPST's need to be added.

llvm/test/CodeGen/ARM/O3-pipeline.ll
147

RDA looks like quite an expensive pass to me. It would scan through and store reaching defs info for every instruction in every block of a function?

Do you think it's worth it for this case, where we are only doing some minor code cleanup?

llvm/test/CodeGen/Thumb2/mve-vpt-block-optnone.mir
73

If this instruction is predicated, then we need to generate a VPST, otherwise the final assembly will be invalid (the "Then" predicate will be silently dropped as this is not part of a valid VPT block).

Can you provide the test case, or point me at it if it is there already?

Sorry for the delay. The original reproducer I had was with a predicated vdup intrinsic. But when they were added recently it turns out it needs the downstream version (or an odd combination of downstream and upstream intrinsics), and so the same problem didn't show up.

It turns out it was simpler to just write it by hand. Now in as rGf64b3466b6bb.

SjoerdMeijer marked 2 inline comments as done.Feb 6 2020, 1:45 AM

Sorry for the delay. The original reproducer I had was with a predicated vdup intrinsic. But when they were added recently it turns out it needs the downstream version (or an odd combination of downstream and upstream intrinsics), and so the same problem didn't show up.

It turns out it was simpler to just write it by hand. Now in as rGf64b3466b6bb.

Thanks, I will look into that.

llvm/lib/Target/ARM/MVEVPTBlockPass.cpp
75

I think thhat is covered by RDA->getReachingMIDef(), which gets the first def of VPR, and looks to me to be equivalent to the modifiesRegister() here if I'm not wrong.
But will take a look at the test case and how that behaves.

llvm/test/CodeGen/ARM/O3-pipeline.ll
147

There's clearly a trade-off here. The 3 calls to RDA are concise, easy to read, and reusing exiting code. This is a big benefit, but indeed comes at the cost of running RDA. I guess the only way to answer your question is by measuring compile times.

dmgreen added inline comments.Feb 7 2020, 1:21 AM
llvm/lib/Target/ARM/MVEVPTBlockPass.cpp
75

Ah, yes. Sorry. The "readsRegister" uses check is what this should have pointed at.