This is an archive of the discontinued LLVM Phabricator instance.

[CodeGen] Generic Hardware Loop Support
ClosedPublic

Authored by samparker on May 29 2019, 8:52 AM.

Details

Summary

Patch which introduces a target-independent framework for generating hardware loops at the IR level. Most of the code has been taken from PowerPC CTRLoops and PowerPC has been ported over to use this generic pass. The target dependent parts have been moved into TargetTransformInfo, via isHardwareLoopProfitable, with HardwareLoopInfo introduced to transfer information from the backend.

Three generic intrinsics have been introduced:

  • void @llvm.set_loop_iterations. Takes as a single operand, the number of iterations to be executed.
  • i1 @llvm.loop_decrement(anyint). Takes the maximum number of elements processed in an iteration of the loop body. Returns false if the loop should exit.
  • anyint @llvm.loop_decrement_reg(anyint, anyint). Takes the number of elements remaining to be processed as well as the maximum number of elements processed in an iteration of the loop body. Returns the updated number of elements remaining.

Diff Detail

Event Timeline

samparker created this revision.May 29 2019, 8:52 AM

Where in the pipeline is this supposed to run?
How will this interact with all the usual loop transforms?

lib/CodeGen/HardwareLoops.cpp
2

Missing header blurb

As late as possible, it's expected that the intrinsics map almost 1-1 to an instruction and I expect them to prevent most other loop transforms. Currently it's added PreISel by PowerPC and we'll be doing similar for Arm.

samparker updated this revision to Diff 202150.May 30 2019, 3:58 AM

Added some more tests and changed the pass to enable loop nesting. Also now enabling checking of the preheader for validity.

samparker updated this revision to Diff 202157.May 30 2019, 5:01 AM

Added file description.

SjoerdMeijer added inline comments.Jun 3 2019, 2:13 AM
include/llvm/IR/Intrinsics.td
1185

Do we need to document these intrinsics? Do they need to be 'experimental' first, like the ones above?

Went through this for the first time, just some initial nits inline, will continue looking a bit.

lib/CodeGen/HardwareLoops.cpp
55

Nit: perhaps add something along the lines of "force hwloops even when the cost model does not deem it profitable"?

181

Do we want to check the return value?

lib/Target/PowerPC/PPCISelLowering.cpp
9840

Should this be commented out? Remove this?

markus added inline comments.Jun 3 2019, 4:06 AM
include/llvm/Analysis/TargetTransformInfo.h
462

Is NumElements really required by the initial commit that only adds the loop intrinsics?

468

I find this confusing. I understand that a preheader is required for the transformation (because that is where the llvm.set.loop.iterations is inserted) but why would an additional/new preheader ever be needed?

samparker marked 5 inline comments as done.Jun 3 2019, 5:11 AM
samparker added inline comments.
include/llvm/Analysis/TargetTransformInfo.h
462

Not strictly, but it makes the insertion of the loop.decrement intrinsic cleaner. In this patch, this value is always '1' and used several times, so it's nice to have this initialised and stored for re-use.

468

The current use case is when the existing preheader could interfere with whatever method the architecture uses for the loops. For PowerPC they don't want the mctr register being incorrectly written.

include/llvm/IR/Intrinsics.td
1185

I'm honestly unsure when *experimental* is needed. I've assumed they're normally used when the operation maybe slightly ambiguous or needs to be interpreted by a lot of the rest of the infrastructure.

lib/CodeGen/HardwareLoops.cpp
181

It's not neeeded, TryConvertLoop returns a value to control its recursive search. MadeChange is a member and is written by it.

lib/Target/PowerPC/PPCISelLowering.cpp
9840

Ah, cheers. Will need to take a look at that..

samparker updated this revision to Diff 202702.Jun 3 2019, 6:03 AM
  • Added comments describing the intrinsics.
  • loop_decrement now just returns an i1, instead of being declared anyint. This allowed removing some changes from PPC backend.
markus added inline comments.Jun 3 2019, 6:07 AM
include/llvm/Analysis/TargetTransformInfo.h
468

Hmm, I am not sure what exactly interfere means here so to me it sounds a bit questionable. An existing preheader should be as good as a newly created one. Maybe this is something PowerPC does in its current implementation because it makes things easier but in that case I don't think it should go into the generic framework.

samparker marked an inline comment as done.Jun 3 2019, 6:57 AM
samparker added inline comments.
include/llvm/Analysis/TargetTransformInfo.h
468

I agree. I actually removed the checks originally and will do so again. I think it is just to make it easy, but I'm not actually convinced it's required. I'm assuming that as long as set_loop_iteration is immediately followed by the terminator, there's nothing else to write CTR.

samparker updated this revision to Diff 202710.Jun 3 2019, 7:07 AM

Removed RequiresNewPreheader from HardwareLoopInfo.

markus added a comment.Jun 4 2019, 2:33 AM

What happens if a target decides late (like addPreEmitPass() late) that a hardware loop is no longer possible due to reasons that are not detectable from the IR level hook. For PowerPC this does not seem to be an issue, or at least I cannot find it in the code, but for other targets there could be limitations that you need to analyse the actual machine instructions to find out about. Since the original IV chain has been replaced by intrinsic llvm.loop.decrement.reg (or whatever it iselected to) it seems we now would need to manually insert instructions to update the loop counter which could possibly be a bit hairy to do this late.

On the other hand (but not saying that it is better) if one treats the intrinsic as a pure hint and keep the original loop counter intact then the problem here becomes easier as it would just be a matter of deleting instructions if hardware loop insertion is successful and doing nothing if it fails.

luke added a subscriber: luke.Jun 4 2019, 2:49 AM
samparker added a comment.EditedJun 4 2019, 3:51 AM

I would expect many targets to have some kind of validity check late on in the pipeline. loop.decrement.reg is designed so that it could just be selected to a machine sub, as the IV chain still exists along with the icmp and br. I have assumed that because the intrinsic behaves like a sub, any target should be able to, hopefully trivially, fall back to a machine sub late on. Is this something that would be difficult for you..? The loop.decrement, which produces an i1, would cause more problems but this framework allows the backend to make the best decision for itself.

samparker edited the summary of this revision. (Show Details)Jun 4 2019, 3:52 AM
markus added a comment.Jun 4 2019, 4:17 AM

I would expect many targets to have some kind of validity check late on in the pipeline. loop.decrement.reg is designed so that it could just be selected to a machine sub, as the IV chain still exists along with the icmp and br. I have assumed that because the intrinsic behaves like a sub, any target should be able to, hopefully trivially, fall back to a machine sub late on. Is this something that would be difficult for you..? The loop.decrement, which produces an i1, would cause more problems but this framework allows the backend to make the best decision for itself.

Makes sense. Always setting CounterInReg and iselecting the resulting loop.decrement.reg as a sub (possibly with some additional annotation so that it can easily be removed later) would probably work well. Perhaps you could add something to the intrinsic documentation stating that it is designed to look and act as a sub for that particular purpose.

samparker updated this revision to Diff 202914.Jun 4 2019, 5:10 AM

Updated llvm.loop.decrement.reg comment description.

What happens if a target decides late (like addPreEmitPass() late) that a hardware loop is no longer possible due to reasons that are not detectable from the IR level hook. For PowerPC this does not seem to be an issue,

For PowerPC, it is an issue. We deal with this by checking, at the IR level, for all constructs that could cause the later steps to fail (or be invalid). We even have a dedicated MI-level verification pass which will assert if we missed something in this regard. This works, but is fragile. I like the idea discussed here of having a fallback lowering (although this patch seems to preserve the current scheme).

or at least I cannot find it in the code, but for other targets there could be limitations that you need to analyse the actual machine instructions to find out about. Since the original IV chain has been replaced by intrinsic llvm.loop.decrement.reg (or whatever it iselected to) it seems we now would need to manually insert instructions to update the loop counter which could possibly be a bit hairy to do this late.

On the other hand (but not saying that it is better) if one treats the intrinsic as a pure hint and keep the original loop counter intact then the problem here becomes easier as it would just be a matter of deleting instructions if hardware loop insertion is successful and doing nothing if it fails.

lib/CodeGen/HardwareLoops.cpp
327

Are you sure that isSafeToExpand[At] will always be true here? I recommend checking explicitly and then dealing in the callers of this function with the fact that it might fail.

lib/Target/PowerPC/PPCTargetTransformInfo.h
57

This can't have this name here, as CTR refers to a PowerPC-specific register. Shouldn't this just be folded into the isHardwareLoopProfitable implementation?

samparker marked 2 inline comments as done.Jun 4 2019, 11:25 PM
samparker added inline comments.
lib/CodeGen/HardwareLoops.cpp
327

Thanks, I'll look into it.

lib/Target/PowerPC/PPCTargetTransformInfo.h
57

I don't understand why that is, since this in a PowerPC-specific file. But I could probably just make it static.

samparker updated this revision to Diff 203098.Jun 5 2019, 1:23 AM
  • Added expand safety check..
  • Made mightUseCTR into a private method. I tried static but since it uses many PPCTTIImpl members it just makes sense to have it as part of the class.

Made mightUseCTR into a private method. I tried static but since it uses many PPCTTIImpl members it just makes sense to have it as part of the class.

Thanks, sounds good.

lib/Analysis/TargetTransformInfo.cpp
135

I'm unclear on the rationale for this default initialization. Is there really value here, or should we just set these to nullptr or similar?

samparker marked an inline comment as done.Jun 6 2019, 2:13 AM
samparker added inline comments.
lib/Analysis/TargetTransformInfo.cpp
135

I mainly did it so that we don't need a supported target to run some basic tests. But really I guess I can just make them cli options, as I've done for other parts of testing.

samparker updated this revision to Diff 203321.Jun 6 2019, 3:23 AM

Removed initialisation in HardwareLoopInfo constructor and, instead, added two more options to do the same.

SjoerdMeijer added inline comments.Jun 6 2019, 3:46 AM
include/llvm/Analysis/TargetTransformInfo.h
451

I am not sure we need speak about elements in vectorised or non-vectorised loops here, see comment below.

462

Can we now just call this the loop decrement?

466

Nit: "Should be loop". Spell error?

samparker updated this revision to Diff 203333.Jun 6 2019, 5:26 AM

Cheers, NumElements has been renamed to LoopDecrement.

hfinkel accepted this revision.Jun 6 2019, 8:16 AM

LGTM

This revision is now accepted and ready to land.Jun 6 2019, 8:16 AM
samparker closed this revision.Jun 7 2019, 12:33 AM

Committed in rL362774.

Herald added a project: Restricted Project. · View Herald TranscriptApr 18 2023, 7:09 PM