This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Introduce t2DoLoopStartTP
ClosedPublic

Authored by dmgreen on Nov 2 2020, 1:22 AM.

Details

Summary

This introduces a new pseudo instruction, almost identical to a t2DoLoopStart but taking 2 parameters - the original loop iteration count needed for a low overhead loop, plus the VCTP element count needed for a DLSTP instruction setting up a tail predicated loop. The idea is that the instruction holds both values and the backend ARMLowOverheadLoops pass can pick between the two, depending on whether it creates a tail predicated loop or falls back to a low overhead loop.

To do that there needs to be something that converts a t2DoLoopStart to a t2DoLoopStartTP, for which this patch repurposes the MVEVPTOptimisationsPass as a "tail predication and vpt optimisation" pass. The extra operand for the t2DoLoopStartTP is chosen based on the operands of VCTP's in the loop, and the instruction is moved as late in the block as possible to attempt to increase the likelihood of making tail predicated loops.

Diff Detail

Event Timeline

dmgreen created this revision.Nov 2 2020, 1:22 AM
dmgreen requested review of this revision.Nov 2 2020, 1:22 AM

The TP at the end of the name somewhat implies that this is only for tail predication, would it make sense to change t2DoLoopStart to take the extra register?

Yeah I was wondering which way to go with that. The t2DoLoopStartTP is meant to mean "a t2DoLoopStart that is almost certainly going to become a DLSTP". A t2DoLoopStart are for all the low overhead loops that are not expected to change to tail predicated loop. That way we could treat them differently elsewhere in the pipeline if we need to.

Yeah I was wondering which way to go with that. The t2DoLoopStartTP is meant to mean "a t2DoLoopStart that is almost certainly going to become a DLSTP". A t2DoLoopStart are for all the low overhead loops that are not expected to change to tail predicated loop. That way we could treat them differently elsewhere in the pipeline if we need to.

That makes sense 👍 Do you envisage us moving all of the TP validation to the MVEVPTOptimisationsPass and then only doing the actual codegen in ARMLowOverheadLoops?

That makes sense 👍 Do you envisage us moving all of the TP validation to the MVEVPTOptimisationsPass and then only doing the actual codegen in ARMLowOverheadLoops?

I was thinking of putting some bits here, there is some stuff about terminators that produce values that I was hoping to do here too. Also something about ensuring lr is used as a predicate. We can't do everything at this early stage as we have to be very careful about being able to handle loops that end up being reverted. Some things are better to do pre-RA though, like the changes here, and let the backend pass do what it can from there like removing the unused arg.

samparker added a comment.EditedNov 3 2020, 1:27 AM

This sounds like a good idea to me, but if we're going to do stuff earlier, to make our lives easier, maybe we should do this at the IR level and use a target-specific intrinsic and free ourselves completely from searching at the MI level?

Searching at the pre-ra level seems simple enough so long as we are in SSA form, and there are extra things I would like to add at the same point if I can makes sure they work. Doing those post-isel has some benefits in making sure we know the instructions we are looking at. Things like reverting calls is easier to do at this point (as it's easier to make sure we catch all calls).

Things like reverting calls is easier to do at this point.

What calls do you mean..?
What specifically do you need from the pre-RA MachineInstr form that means this shouldn't be done at IR level? The search here is certainly not more simple and manageable than doing this MVETailPredication, where the components are also found, and handling phis at this level is just an added pain.

Sos I missed your comment.

This code seemed simple enough to me, perhaps because I wrote essentially the same thing a few times. We are just looking through vregs for the PHI's. There may be COPY's in the way but they are simple enough to deal with. It's all just SSA form still.

The other things I would like to put here start with combining t2LoopDec and t2LoopEnd into a single instructions. A t2LoopEndDec is what I've called it. Thats a terminator that produces a value, but it seems to work OK in the testing I've been trying, given a few adjustments. I was going the "they can never spill" approach, which I've not found any problems with so long as there are no calls in the loop (or inline asm). It's obviously more accurate to determine what was made into a call as opposed to trying to guess at what will become a call from pre-isel. And if we get it wrong, the compiler would crash, so we do have to be a bit careful. It is using a lot of the same code as this, and also removing COPYs from the loop which seemed generally useful.

Unfortunately I found out the AMDGPU backend also has terminators the produce values, but they work differently. I will have to figure something out there.

I was also going to add an LR predicate to MVE instructions to make sure they would never go wrong, done in the same place hopefully. That's another one that involves updating 100's of tests unfortunately.

I would like to put here start with combining t2LoopDec and t2LoopEnd

Ok, fair enough.

t2LoopEndDec is what I've called it.

It's a good name.

I was going the "they can never spill" approach

And how is this done? Is it a generic codegen change and that's why AMDGPU is problematic?

I was also going to add an LR predicate to MVE instructions to make sure they would never go wrong

I'm not sure what you mean, are you talking about adding another optional register operand?

A bit late, but have a high-level comment about this:

repurposes the MVEVPTOptimisationsPass as a "tail predication and vpt optimisation" pass

This feels messy because we are doing tail-predication now in 2 different places, and that doesn't sound ideal. Happily unaware of some of the details here, but I am wondering if this is necessary?

And how is this done? Is it a generic codegen change and that's why AMDGPU is problematic?

There are just some changes in phi elimination and and the register allocator to make a terminator that produces a value never spill. I was thinking that this was not something that already existed, so we could invent some semantics for it - only use it very carefully in cases we know are going to be OK. Unfortunately the AMDGPU backend is already using it in a different way, where they have pseudo copy terminators for spills. I have to take a look what we can do there. It also might just be a bad idea in general to say that these can never spill, but I'd like to try it if we can and I've not seen any problems with all the testing I've ran.

I'm not sure what you mean, are you talking about adding another optional register operand?

Yep. As you might imagine a lot of tests have changed. But it will make sure that the condition from a VPT block is never wrong, that we don't spill lr around an MVE instruction that would be using it as the predicate.

With the MVE instructions taking LR, maybe you wouldn't need to make the terminator change, have you tried it yet? I guess I'm naively assuming that the register allocator will be much less likely to spill something that is used by most / all of the MVE instructions within the loop.

With the MVE instructions taking LR, maybe you wouldn't need to make the terminator change, have you tried it yet? I guess I'm naively assuming that the register allocator will be much less likely to spill something that is used by most / all of the MVE instructions within the loop.

"Much less likely" isn't a fantastic answer for something that we should really never be reverting, unfortunately. You can always have a number of MVE instructions at the start of the loop followed by a load of scalar instructions. I think when I had it that way in experimental patches there were still some reverts happening.

That's all about different patches though! Any comments on this one? If we can get this in, I can commit a few others without breaking performance on some important machine learning matrix multiply kernels, which would be good.

That's all about different patches though! Any comments on this one?

Sorry, it's just difficult to gauge this without a view on the wider implementation idea. For the other stuff that you're talking about, of course we need that in MI but, like I said at the start, we could currently achieve this patch with probably 10x less code at the IR level.

we could currently achieve this patch with probably 10x less code at the IR level.

10x sounds like a lot. I think that the code would still need a t2DoLoopStartTP instruction, so those change would all have to stay. The findLoopComponents is used in other patches - that is what they are built upon and seems generally useful so that would have to stay, along with the "loop" boilerplate changes for that pass. That just leaves ConvertTailPredLoop. If we did it pre-isel then we would need an extra intrinsic for the space of time between where we convert it and ISel, plus the lowering in ISel. It would still need to recognize the instruction and do some checks for things like VCTP's being the same, find the incoming phi value for the loop, then create the new instruction like is done here. This also moves the instruction to the end of the block which we would need to either find a new place for or it would lead to worse performance.

I don't think I see a lot of difference in it, and would prefer not to add a new intrinsic if we can avoid it.

dmgreen updated this revision to Diff 303379.Nov 6 2020, 2:13 AM

Rebase and fix up some comments I noticed reading this again.

dmgreen updated this revision to Diff 303384.Nov 6 2020, 2:40 AM

Remove some debug asserts that shouldn't be necessary.

I think that the code would still need a t2DoLoopStartTP instruction, so those change would all have to stay

Agreed, and I think this pseudo is very good idea.

The findLoopComponents is used in other patches

But I think it could be then added with those patches when needed.

AFAICT the search you're describing would be trivial in MVETailPredication now that you've, almost, added start_loop_iterations - and that's my problem. I don't understand why adding a target-specific intrinsic and a tablegen pattern to match it would not be the favoured approach?

samparker added inline comments.Nov 9 2020, 4:03 AM
llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
442

This change highlights how many times we query the LoopStart opcode and it looks worth while to have this IsDo as a little helper!

627–628

Call getLoopStartOperand instead?

1444–1445

Can use getLoopStartOperand again.

llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp
189

What about only checking that it's not predicated in the case where there's more than one VCTP? At least then we can handle the VPT -> VCTP case.

230

I don't follow what's happening here, what uses could there be which we need to schedule for?

dmgreen updated this revision to Diff 303867.Nov 9 2020, 7:22 AM

Added IsDo and altered the code about checking predicates.

llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp
189

Would the backend pass handle multiple VCTP's in different blocks? If so we could just remove the check.

I've done that here, but can add it back in if you think it might cause problems.

230

There can be COPY's between the LoopStart and the PHI.

samparker accepted this revision.Nov 9 2020, 7:46 AM
samparker added inline comments.
llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp
189

Yep, this should be fine. The backend can handle any number of VCTPs and VPT blocks.

llvm/test/CodeGen/Thumb2/mve-vldshuffle.ll
169 ↗(On Diff #303867)

How was the register allocator getting this so wrong?! I wonder if there's some logic missing, or some priorities need to be re-ordered, to handle the case of register classes with only one register.

This revision is now accepted and ready to land.Nov 9 2020, 7:46 AM
This revision was automatically updated to reflect the committed changes.
pirama added a subscriber: pirama.Nov 10 2020, 11:27 AM
pirama added inline comments.
llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp
234

This variable seems unused outside of debugging purposes:

llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp:234:23: warning: unused variable 'MI' [-Wunused-variable]
  MachineInstrBuilder MI = BuildMI(*MBB, InsertPt, LoopStart->getDebugLoc(),
                      ^
1 warning generated.

Ah, Thanks! I'll try and fix that up now