This is an archive of the discontinued LLVM Phabricator instance.

[globalisel] Add a combiner helpers for extending loads and use them in a pre-legalize combiner for AArch64
ClosedPublic

Authored by dsanders on Apr 11 2018, 4:12 PM.

Diff Detail

Repository
rL LLVM

Event Timeline

dsanders created this revision.Apr 11 2018, 4:12 PM

Hey Daniel - this looks mostly good to me.

Maybe we could have all the tests in one file called prelegalize-combine-extloads.mir

I'm assuming this is in Target/AArch64 because other targets haven't been updated to use the new opcodes yet? We do eventually want to use these representations for every target though right?

Maybe we could have all the tests in one file called prelegalize-combine-extloads.mir

I went with three files to match how we've been organizing the tests for the other passes but you raise a good point here. There's a good argument for the combiner being tested with one file per CombinerHelper::try*() function. I think that's probably a better organization.

I'm assuming this is in Target/AArch64 because other targets haven't been updated to use the new opcodes yet? We do eventually want to use these representations for every target though right?

That's right. As you say, we'll want to support it in every target that supports the extending loads (which is most if not all of the in-tree targets). However, until the new opcodes are legal for those targets, there's not much point in combining them only to revert them back to load+extend in the legalizer.

One other thing to mention is that we don't have a target-independent combiner in GlobalISel at the moment. Each target implements its own combiner(s) and makes use of code in CombinerHelper (where appropriate) to share code. I expect these combines to be used by multiple targets so I've put the bulk of the code in CombinerHelper but each target will need to add a pass and call to it.

aemerson accepted this revision.Apr 23 2018, 8:34 AM

Maybe we could have all the tests in one file called prelegalize-combine-extloads.mir

I went with three files to match how we've been organizing the tests for the other passes but you raise a good point here. There's a good argument for the combiner being tested with one file per CombinerHelper::try*() function. I think that's probably a better organization.

I'm assuming this is in Target/AArch64 because other targets haven't been updated to use the new opcodes yet? We do eventually want to use these representations for every target though right?

That's right. As you say, we'll want to support it in every target that supports the extending loads (which is most if not all of the in-tree targets). However, until the new opcodes are legal for those targets, there's not much point in combining them only to revert them back to load+extend in the legalizer.

One other thing to mention is that we don't have a target-independent combiner in GlobalISel at the moment. Each target implements its own combiner(s) and makes use of code in CombinerHelper (where appropriate) to share code. I expect these combines to be used by multiple targets so I've put the bulk of the code in CombinerHelper but each target will need to add a pass and call to it.

Ok makes sense. We still have some work to do on the combiner design front on how to allow targets to select subsets of combines they're interested in so we can have a shared pipeline. I think this is fine for now but it raises the priority for that discussion later.

This revision is now accepted and ready to land.Apr 23 2018, 8:34 AM
rtereshin added inline comments.Apr 30 2018, 12:40 PM
lib/CodeGen/GlobalISel/CombinerHelper.cpp
43 ↗(On Diff #142091)

It looks like we have a contract-inconsistency here. CombinerHelper::tryCombineCopy assumes it could be called on any opcode, and if it's not a COPY, it's expected to just gracefully return and report it didn't change anything.

While the newly added CombinerHelper::tryCombineExtendingLoads requires the opcode belonging to a specific subset.

I think it makes sense to be consistent about it and probably not just within the CombinerHelper, but among all the derived combiners and maybe even all global isel combiners in general.

+ @aditya_nandakumar

54 ↗(On Diff #142091)

Is it possible to have a memory operand missing here?
Also, if not, does MachineVerifier enforce it?

63 ↗(On Diff #142091)

getOpcodeDef is able to look through copies, therefore I'd expect this combiner to match the following sequence:

%v1:_(s16) = G_LOAD %ptr(p0), (load 2)
%v2:_(s16) = COPY %v1(s16)
%v3:_(s32) = G_ZEXT %v2(s16)

and produce the following output:

%v1:_(s16) = G_LOAD %ptr(p0), (load 2)
%v2:_(s16) = COPY %v1(s16)
%v3:_(s32) = G_ZEXTLOAD %ptr(p0), (load 2)

Do you think it's a good idea to add tests like this and control that this, in fact, actually happens and happens correctly?

70 ↗(On Diff #142091)

This is clearly supposed to try all of the combines implemented in the helper:

  /// If \p MI is extend that consumes the result of a load, try to combine it.
  /// Returns true if MI changed.
  bool tryCombineExtendingLoads(MachineInstr &MI);

  /// Try to transform \p MI by using all of the above
  /// combine functions. Returns true if changed.
  bool tryCombine(MachineInstr &MI);
};
lib/Target/AArch64/AArch64PreLegalizerCombiner.cpp
49 ↗(On Diff #142091)

CombinerHelper::tryCombineCopy contains a bug (it doesn't check that register classes and register banks of its source and destination are compatible ), as soon as it's fixed* and CombinerHelper::tryCombine properly tries all of the combines implemented in CombinerHelper as it's supposed to, this could be replaced with just a single call to CombinerHelper::tryCombine, not before, though.

*I'm planning on adding a patch with that fix soon, but it will be dependent on https://reviews.llvm.org/D45732 as the latter makes the former simpler and shorter.

+ @aditya_nandakumar

lib/CodeGen/GlobalISel/CombinerHelper.cpp
43 ↗(On Diff #142091)

Good catch @rtereshin - I would think being consistent here is nice - ie return gracefully if it's not the opcode we want (unless there's a strong reason to change that).

70 ↗(On Diff #142091)

I suspect he's put the calls to tryCombineExtendingLoads in AArch pass as other passes may not handle the extending load opcodes correctly in the legalizer. If/when they're ready, then it makes sense to move them into the generic helper.

dsanders added inline comments.May 1 2018, 9:58 AM
lib/CodeGen/GlobalISel/CombinerHelper.cpp
43 ↗(On Diff #142091)

I can switch it over to that. I just thought it was a shame to check it on both sides of the call.

54 ↗(On Diff #142091)

Is it possible to have a memory operand missing here?

No, several parts of GlobalISel require it

Also, if not, does MachineVerifier enforce it?

Yes, it reports 'Generic instruction accessing memory must have one mem operand' if it's not the case

63 ↗(On Diff #142091)

That makes sense to me. We won't want to go overboard with that kind of thing though, it's enough to check it in a couple combines

70 ↗(On Diff #142091)

They'll combine correctly but the legalizer will immediately decompose them again on most targets at the moment so it's just wasted effort slowing the compile-times. This brings up something I've been wondering how to deal with over the last few days. If we continue down the path we're going, I currently don't see how we're going to manage large collections of combines and several targets.

Suppose the Foo and Bar targets supports the following combines:

tryCombine
  tryCombineGroup1
    tryA
    tryB
    tryC
  tryCombineGroup2
    tryD
    tryE
    tryF

and each calls tryCombine. Now suppose Bar realizes that tryE is doing more harm than good.

We could make tryE check that the target isn't Bar. The consequence of that is that we have a monolithic implementation covering all targets. It means a target specific change can introduce all-target bugs, all targets must carry everyones complexity, every change needs testing (including performance) for every target, etc.

We could make Bar call tryCombineGroup1, tryD, and tryF instead. The consequence of that is that useful changes to tryCombine, and tryCombineGroup2 won't apply to Bar. Everyone will have to review combines and choose to enable them and as such won't benefit from improvements. They also won't suffer from losses either which is a good thing but I would hope that those are less common.

We could split tryCombineGroup2 to get (this is just one example):

tryCombine
  tryCombineGroup1
    tryA
    tryB
    tryC
  tryCombineGroup2
    tryD
    tryF
  tryCombineGroup3
    tryE

this of course assumes that ordering between E and F doesn't matter. Of course, if we do that enough then we eventually reach:

tryA
tryB
tryC
tryD
tryF
tryE

bringing with it the same problems as the previous option.

The answer I keep returning to at the moment is that even if the majority of our implementation is C++, we need something declarative to organize and control the combines for each target. We don't necessarily need to go to my original intent of re-using the ISel matcher for combines and thereby doing the majority of the matching in tablegenerated code but we do need a declarative way to control whether a combine is enabled/disabled (which is trivial) and the order they're applied (not so trivial).

lib/Target/AArch64/AArch64PreLegalizerCombiner.cpp
49 ↗(On Diff #142091)

and CombinerHelper::tryCombine properly tries all of the combines implemented in CombinerHelper as it's supposed to

At the moment, that will look like a good idea but having a single monolithic tryCombine() isn't going to last long as more combines get implemented and more targets use combines. (see above)

rtereshin added inline comments.May 1 2018, 10:55 AM
lib/CodeGen/GlobalISel/CombinerHelper.cpp
54 ↗(On Diff #142091)

Cool, thanks!

63 ↗(On Diff #142091)

Agreed.

70 ↗(On Diff #142091)

I like the idea with groups. I think it will make it easier for target-writers (and not only human target-writers ;-) ) to compose a reasonable pipeline for their needs relatively easily. Also, the grouping doesn't have to be 1-level, it could be more, if well-structured and useful.

It may worth contemplating though what principle should be used to group combines together. I think it's going to be tempting to group them by root-level opcode: all G_ADD combines, all arithmetic combines (2-nd level group), etc, but this kind of grouping has a chance to prove less useful than architecture / micro-architecture targeted grouping. Like "combines for super-scalar architectures with a lot of instruction-level parallelism", and "architectures with highly efficient SIMD units" etc, something driven by what actual targets use and ignore.

As for the declarative approach, honestly, putting as much as possible into Tablegen doesn't strike me like a wise approach. Tablegen'erated implementations are hard to read and search, interfaces between them and the rest of the compiler seem to be obscure and fragile, and it's very hard to make changes to any of it. If it's possible to derive the implementation from the information already provided by target writers, extending a Tablegen-backend at least goes into "let's auto-generate the entire compiler" direction, which is valuable. For instance, if we could derive the optimal pipeline of combines from scheduling and instruction info already put together by the targets. If it's needed, however, to come up with a new set of Tablegen-classes to explicitly define that pipeline manually in a *.td-file, and write an entirely new Tablegen-backend to process it, that doesn't look like a valuable thing to do.

We could be as declarative as we want to be in C++ and that may be much easier to work with.

Also, I think it may be valuable to make whatever design we eventually come up with easily compatible with a hypothetical (at the moment) tool that would be able to generate an optimal combine-pipeline for a target automatically, provided a performance feedback mechanism. Let it start with something reasonable, compile a corpus of programs, evaluate the code quality via that feedback mechanism, mutate the pipeline a bit, and try again.

That sort of thing calls for a separate binary tool having a wide access to LLVM infrastructure, including all the combines, not for a Table-gen backend. I gave up on making Testgen a part of the Tablegen process very quickly, for instance.

lib/Target/AArch64/AArch64PreLegalizerCombiner.cpp
49 ↗(On Diff #142091)

Something still needs to be done about tryCombine. If it doesn't represent any practically useful group used by any target (not necessarily in final implementation, during experimentation too), let's remove it. if it does, let's maintain it so it does precisely what it promises to do.

aemerson added inline comments.May 8 2018, 8:33 AM
lib/CodeGen/GlobalISel/CombinerHelper.cpp
70 ↗(On Diff #142091)

All good points from you and Daniel. I don't know if tablegen is the best tool for the combiners, as I worry we'll paint ourselves into a corner once we go down that path.

I think it would be worth exploring the design of combiner groups, and higher level schemes expressed in a declarative way as Daniel wants. The schemes could encapsulate the choices a target makes about their combiner configuration, with a default one that most targets would use at least at the beginning. Where customisability comes into it is with something like predicates/masks that enable/disable the activation of specific combines in the groups, so a scheme is a set of groups or existing scheme with overlaid modifications. For altering the orders of combines, an overlay could express the reversal of two given combines within a specific group.

dsanders added inline comments.May 8 2018, 12:05 PM
lib/CodeGen/GlobalISel/CombinerHelper.cpp
70 ↗(On Diff #142091)

It may worth contemplating though what principle should be used to group combines together. I think it's going to be tempting to
group them by root-level opcode: all G_ADD combines, all arithmetic combines (2-nd level group), etc, but this kind of grouping has a
chance to prove less useful than architecture / micro-architecture targeted grouping. Like "combines for super-scalar architectures
with a lot of instruction-level parallelism", and "architectures with highly efficient SIMD units" etc, something driven by what actual
targets use and ignore.

That's what I was thinking and is the reason for naming this tryCombineExtendingLoads(). We might need to break it down as more targets get added though as some only have sign-extending loads and others only have zero-extending loads

As for the declarative approach, honestly, putting as much as possible into Tablegen doesn't strike me like a wise approach...

As you say, it doesn't necessarily have to be tablegen. The main thing is being able to throw the common rules and target-dependent rules together and get something that executes them in a sensible order and excludes the things the target doesn't want.

Also, I think it may be valuable to make whatever design we eventually come up with easily compatible with a hypothetical (at the
moment) tool that would be able to generate an optimal combine-pipeline for a target automatically, provided a performance
feedback mechanism. Let it start with something reasonable, compile a corpus of programs, evaluate the code quality via that
feedback mechanism, mutate the pipeline a bit, and try again.

I agree. This line of thinking is a key reason I included a code coverage mechanism in the InstructionSelector. By extending the 1-bit counters to N-bit and feeding that into the rule sorting, we would end up with a match table that prioritized common rules as far as correctness allowed.

All good points from you and Daniel. I don't know if tablegen is the best tool for the combiners, as I worry we'll paint ourselves into a
corner once we go down that path.

That's something we'll need to investigate while we look at how to maintain this long-term.

Just for reference, in my original plan back when I started on the DAGISel importer, my main arguments in favour of tablegen for combiners were:

  • ISel and Combiners are essentially the same thing. They match MIR and replace it with equivalent MIR.
  • By using the same infrastructure for both, optimization investments we make for one can benefit the other.
  • By using the same infrastructure for both, solving the ordering problem for one solves it for the other.
  • The feature bits mechanism (e.g. Requires<[HasX]>) is ideal for compiling out Combines as you can simply tell tablegen to discard rules with particular feature bits and check the rest at run time. This can also be done for ISel, potentially reducing the match table size when you only need particular subtarget(s).

I didn't go through the implementation detail for Combiners at the time but I did leave a few doors open in case we decided to go down this route later. For example, this is the reason the implementation permits multiple match roots.

dsanders updated this revision to Diff 145802.May 8 2018, 3:30 PM
dsanders marked 12 inline comments as done.

Update before commit

This revision was automatically updated to reflect the committed changes.
dsanders reopened this revision.May 29 2018, 10:17 AM

Re-opening as I'm about to update it with a revised version. The previous one broke several bots because it sank loads down to the extends

This revision is now accepted and ready to land.May 29 2018, 10:17 AM
dsanders updated this revision to Diff 148945.May 29 2018, 10:42 AM

Rewrite the patch such that the extends hoist up to the loads. The previous
patch had a couple problems that became clear on the bots:

  • The loads could be duplicated. This is a correctness problem for volatile loads but more generally is likely to harm performance.
  • The loads would sink to the extends without any hazard checking

Matching the load and folding in the uses is significantly more complex than
the previous code, but by anchoring the load where it is we both avoid both
correctness and performance issues.

The majority of the complexity comes from the need to resolve multiple
(potentially conflicting) uses. The approach this patch takes is fairly simple
in principle:

  1. Pick a preferred extend (see below)
  2. Fold that extend into the load
  3. Fix up the other uses with truncates/extends

2 and 3 are fairly straightforward but 1 relies on heuristics to make a good
choice. The current heuristics are:

  • Prefer a sext/zext over an anyext. This is on the basis that anyext is essentially free (and we therefore don't save instructions by folding it), is compatible with sext/zext, and extending with defined bits opens further optimization opportunities once we have known-bits infrastructure.
  • Prefer a sext over a zext. This is on the basis that a zext typically lowers to a single immediate 'and' instruction whereas a signext typically lowers to a shift-left/shift-right sequence (except in special cases). It's therefore cheaper to zext.
  • Prefer larger types. This is on the basis that G_TRUNC is usually free (Mips is a notable exception and will probably want to tweak this when its port gets this far). There is a catch with this though since it can also increase the pressure on larger registers which some targets have a smaller supply of.
dsanders updated this revision to Diff 148961.May 29 2018, 12:21 PM

The full patch. The previous update was taken from the wrong machine and hadn't
been finished. Aside from not compiling, there was also a bug where additional
combine attempts would see G_SEXTLOAD mutate into G_LOAD (the anyext form).

aemerson added inline comments.May 31 2018, 5:30 AM
test/CodeGen/AArch64/GlobalISel/arm64-fallback.ll
42 ↗(On Diff #148961)

This was removed during r332449?

dsanders added inline comments.Jun 7 2018, 7:59 AM
test/CodeGen/AArch64/GlobalISel/arm64-fallback.ll
42 ↗(On Diff #148961)

This is probably a result of the rebase. I'll remove it

Hi Daniel, sorry for the delay.

I have some questions inline, but overall it seems that we don't keep track of how many of each use is a particular kind of extend. Could we have situations where overall we increase code size due to, for example, preferring a sign extend even if we have multiple zero-extending users? This might not be worth tackling in a first pass at this if it's a rare case though.

include/llvm/CodeGen/GlobalISel/CombinerHelper.h
34 ↗(On Diff #148961)

There doesn't seem to be a user of this.

lib/CodeGen/GlobalISel/Combiner.cpp
29 ↗(On Diff #148961)

What's the purpose of this? Currently it just wraps around the underlying worklist. Debugging messages only?

33 ↗(On Diff #148961)

Should we have a specialization for GISelWorklist<512>, and then not repeat the 512 in multiple places?

37 ↗(On Diff #148961)

Delete or DEBUG()?

lib/CodeGen/GlobalISel/CombinerHelper.cpp
55 ↗(On Diff #148961)

Can we promote this lambda into a helper function? From the name and lack of header doc it's unclear what it's supposed to do with the params on first reading.

64 ↗(On Diff #148961)

What else could ExtendOpcodeB be at this point? If nothing else, should be an assert?

Thanks Amara

I have some questions inline, but overall it seems that we don't keep track of how many of each use
is a particular kind of extend. Could we have situations where overall we increase code size due to,
for example, preferring a sign extend even if we have multiple zero-extending users? This might not
be worth tackling in a first pass at this if it's a rare case though.

That's a good point. Yes, that can happen at the moment.

I can see two cases (and mixtures of the two). The first is multiple zexts to the same type. In this case, we would generate one trunc/zext pair for each use. CSE ought to fix it later (once we enable that) but it would be better to CSE it up front and avoid paying the cost of allocating+processing the redundant instructions. A simple map of opcode and LLT to vreg while emitting the trunc/zext's ought to do the trick there. The other is the case where there's zexts to multiple types. In this case, picking the sext and trunc/zext is still a win as it eliminates 1-2 instructions whereas whichever zext we pick can eliminate at most 1 since all but one zext still needs to emit an instruction. There's target-specific special cases though. For example, Mips sext from 32-bit to >32-bit is free (because it was actually done by the trunc or the gpr32 instruction that def'd it) so picking that particular sext is worse than picking any zext.

include/llvm/CodeGen/GlobalISel/CombinerHelper.h
34 ↗(On Diff #148961)

That's right. It's not needed for this particular combine but there should be some that need it in future. I could drop it from this patch

lib/CodeGen/GlobalISel/Combiner.cpp
29 ↗(On Diff #148961)

The intent behind CombinerChangeObserver is to inform the combiner pass that certain events happened in CombinerHelper. The main one's I'm expecting to end up with are instruction creation and deletion. I can also see instruction modification events here in future. I don't think scheduling for revisit should be in CombinerChangeObserver though. I think that should be determined by the combiner pass in response to these events. It should also be derived from the implemented combines in some way to avoid the issue in DAGCombine where combines are sometimes missed because one rule failed to schedule the right node (e.g. because the root wasn't directly connected to the modified nodes in the case combines that cover several nodes).

The reason for having CombinerChangeObserver subclasses rather than just implementing it directly is so that CombinerHelper is usable by lots of different kinds of Combiner passes. We might have different implementations for O0 - O3, or a specific strategy might work better for one particular target, or maybe the pass isn't a combiner at all and just wants to borrow a couple combines. Each implementation would provide a CombinerChangeObserver subclass that glues CombinerHelper into its implementation.

This particular implementation ought to grow a mechanism to limit the run-time of the pass. Exactly how that should work will need some more thinking about but as an example we could track how deeply recursed the combiner is (i.e. how many combine rules triggered to create it) and decline to schedule instructions beyond a certain point. For O1 combines might be limited to the first generation of new/modified instructions, for O2 maybe the first 3 generations, and O3 might be unlimited.

33 ↗(On Diff #148961)

Yes, let's add an alias for it as Combiner::WorkListTy

37 ↗(On Diff #148961)

Oops. That was some temporary debugging code. It's useful for tracing the combines though so let's make it DEBUG()/dbgs()

lib/CodeGen/GlobalISel/CombinerHelper.cpp
55 ↗(On Diff #148961)

Sure.

64 ↗(On Diff #148961)

Anything. There's no guarantee that the use of a load is an extend. We ignore that when selecting a preferred use and deal with the non-extends by inserting a trunc (which is free for most targets) later

aemerson requested changes to this revision.Jun 24 2018, 7:17 PM
aemerson added inline comments.
include/llvm/CodeGen/GlobalISel/CombinerHelper.h
34 ↗(On Diff #148961)

Ok, as long as we're aware let's keep it in.

lib/CodeGen/GlobalISel/Combiner.cpp
29 ↗(On Diff #148961)

Ok thanks. I think it would be beneficial to have a comment at the definition (can be your reply here summarised).

lib/CodeGen/GlobalISel/CombinerHelper.cpp
64 ↗(On Diff #148961)

Then perhaps change the name to something more accurate, like "UseOpcode"?

This revision now requires changes to proceed.Jun 24 2018, 7:17 PM
dsanders updated this revision to Diff 160708.Aug 14 2018, 3:29 PM
dsanders marked 14 inline comments as done.

Rebased

Fixed the various nits

Fixed a problem where the pass manager couldn't schedule 'Function Alias
Analysis Results' which was apparently required by 'AArch64 Instruction
Selection' according to the pass manager... except it wasn't. The actual problem
was that we weren't preserving the StackProtector pass.

I've also added preservation of the CFG analyses while I was fixing the pass
manager issue above.

dsanders marked 3 inline comments as done.Aug 14 2018, 3:31 PM

I believe the only remaining issue on this was the compile-time impact. I've run CTMark* before and after and found that no tests were significantly different. The geomean was improved by 1.24%. I'm going to be working on the combiner infrastructure for the next few months so if there are any further issues that come up we can address them post-commit.

  • Except bullet, kc, and tramp3d-v4 as these don't seem to build when I target AArch64. They all fail due to missing headers for me.
include/llvm/CodeGen/GlobalISel/Combiner.h
35 ↗(On Diff #160708)

This should not be necessary any more. You can attach delegate to the machine function to observe insertions and deletions. The combiner can install a delegate at the beginning of the pass.

Do you see any other reason to have the Observer mechanism? Is there a need to rely on the maintainer scheduling order of visit or is it just to make sure of insertions and deletions?

This revision was not accepted when it landed; it landed in state Needs Review.Oct 1 2018, 11:58 AM
This revision was automatically updated to reflect the committed changes.
dsanders reopened this revision.Oct 2 2018, 2:55 PM
This revision was not accepted when it landed; it landed in state Needs Review.Oct 2 2018, 7:14 PM
This revision was automatically updated to reflect the committed changes.