Depends on D45541
Details
- Reviewers
ab aditya_nandakumar bogner rtereshin volkan rovka javed.absar aemerson - Commits
- rGc973ad1878f3: Re-commit: [globalisel] Add a combiner helpers for extending loads and use them…
rG9659bfda5a64: [globalisel] Add a combiner helpers for extending loads and use them in a pre…
rGd24dcdd1f74b: [globalisel] Add a combiner helpers for extending loads and use them in a pre…
rL343654: Re-commit: [globalisel] Add a combiner helpers for extending loads and use them…
rL343521: [globalisel] Add a combiner helpers for extending loads and use them in a pre…
rL331816: [globalisel] Add a combiner helpers for extending loads and use them in a pre…
Diff Detail
- Repository
- rL LLVM
Event Timeline
Maybe we could have all the tests in one file called prelegalize-combine-extloads.mir
I'm assuming this is in Target/AArch64 because other targets haven't been updated to use the new opcodes yet? We do eventually want to use these representations for every target though right?
I went with three files to match how we've been organizing the tests for the other passes but you raise a good point here. There's a good argument for the combiner being tested with one file per CombinerHelper::try*() function. I think that's probably a better organization.
That's right. As you say, we'll want to support it in every target that supports the extending loads (which is most if not all of the in-tree targets). However, until the new opcodes are legal for those targets, there's not much point in combining them only to revert them back to load+extend in the legalizer.
One other thing to mention is that we don't have a target-independent combiner in GlobalISel at the moment. Each target implements its own combiner(s) and makes use of code in CombinerHelper (where appropriate) to share code. I expect these combines to be used by multiple targets so I've put the bulk of the code in CombinerHelper but each target will need to add a pass and call to it.
Ok makes sense. We still have some work to do on the combiner design front on how to allow targets to select subsets of combines they're interested in so we can have a shared pipeline. I think this is fine for now but it raises the priority for that discussion later.
lib/CodeGen/GlobalISel/CombinerHelper.cpp | ||
---|---|---|
43 ↗ | (On Diff #142091) | It looks like we have a contract-inconsistency here. CombinerHelper::tryCombineCopy assumes it could be called on any opcode, and if it's not a COPY, it's expected to just gracefully return and report it didn't change anything. While the newly added CombinerHelper::tryCombineExtendingLoads requires the opcode belonging to a specific subset. I think it makes sense to be consistent about it and probably not just within the CombinerHelper, but among all the derived combiners and maybe even all global isel combiners in general. |
54 ↗ | (On Diff #142091) | Is it possible to have a memory operand missing here? |
63 ↗ | (On Diff #142091) | getOpcodeDef is able to look through copies, therefore I'd expect this combiner to match the following sequence: %v1:_(s16) = G_LOAD %ptr(p0), (load 2) %v2:_(s16) = COPY %v1(s16) %v3:_(s32) = G_ZEXT %v2(s16) and produce the following output: %v1:_(s16) = G_LOAD %ptr(p0), (load 2) %v2:_(s16) = COPY %v1(s16) %v3:_(s32) = G_ZEXTLOAD %ptr(p0), (load 2) Do you think it's a good idea to add tests like this and control that this, in fact, actually happens and happens correctly? |
70 ↗ | (On Diff #142091) | This is clearly supposed to try all of the combines implemented in the helper: /// If \p MI is extend that consumes the result of a load, try to combine it. /// Returns true if MI changed. bool tryCombineExtendingLoads(MachineInstr &MI); /// Try to transform \p MI by using all of the above /// combine functions. Returns true if changed. bool tryCombine(MachineInstr &MI); }; |
lib/Target/AArch64/AArch64PreLegalizerCombiner.cpp | ||
49 ↗ | (On Diff #142091) | CombinerHelper::tryCombineCopy contains a bug (it doesn't check that register classes and register banks of its source and destination are compatible ), as soon as it's fixed* and CombinerHelper::tryCombine properly tries all of the combines implemented in CombinerHelper as it's supposed to, this could be replaced with just a single call to CombinerHelper::tryCombine, not before, though. *I'm planning on adding a patch with that fix soon, but it will be dependent on https://reviews.llvm.org/D45732 as the latter makes the former simpler and shorter. |
lib/CodeGen/GlobalISel/CombinerHelper.cpp | ||
---|---|---|
43 ↗ | (On Diff #142091) | Good catch @rtereshin - I would think being consistent here is nice - ie return gracefully if it's not the opcode we want (unless there's a strong reason to change that). |
70 ↗ | (On Diff #142091) | I suspect he's put the calls to tryCombineExtendingLoads in AArch pass as other passes may not handle the extending load opcodes correctly in the legalizer. If/when they're ready, then it makes sense to move them into the generic helper. |
lib/CodeGen/GlobalISel/CombinerHelper.cpp | ||
---|---|---|
43 ↗ | (On Diff #142091) | I can switch it over to that. I just thought it was a shame to check it on both sides of the call. |
54 ↗ | (On Diff #142091) |
No, several parts of GlobalISel require it
Yes, it reports 'Generic instruction accessing memory must have one mem operand' if it's not the case |
63 ↗ | (On Diff #142091) | That makes sense to me. We won't want to go overboard with that kind of thing though, it's enough to check it in a couple combines |
70 ↗ | (On Diff #142091) | They'll combine correctly but the legalizer will immediately decompose them again on most targets at the moment so it's just wasted effort slowing the compile-times. This brings up something I've been wondering how to deal with over the last few days. If we continue down the path we're going, I currently don't see how we're going to manage large collections of combines and several targets. Suppose the Foo and Bar targets supports the following combines: tryCombine tryCombineGroup1 tryA tryB tryC tryCombineGroup2 tryD tryE tryF and each calls tryCombine. Now suppose Bar realizes that tryE is doing more harm than good. We could make tryE check that the target isn't Bar. The consequence of that is that we have a monolithic implementation covering all targets. It means a target specific change can introduce all-target bugs, all targets must carry everyones complexity, every change needs testing (including performance) for every target, etc. We could make Bar call tryCombineGroup1, tryD, and tryF instead. The consequence of that is that useful changes to tryCombine, and tryCombineGroup2 won't apply to Bar. Everyone will have to review combines and choose to enable them and as such won't benefit from improvements. They also won't suffer from losses either which is a good thing but I would hope that those are less common. We could split tryCombineGroup2 to get (this is just one example): tryCombine tryCombineGroup1 tryA tryB tryC tryCombineGroup2 tryD tryF tryCombineGroup3 tryE this of course assumes that ordering between E and F doesn't matter. Of course, if we do that enough then we eventually reach: tryA tryB tryC tryD tryF tryE bringing with it the same problems as the previous option. The answer I keep returning to at the moment is that even if the majority of our implementation is C++, we need something declarative to organize and control the combines for each target. We don't necessarily need to go to my original intent of re-using the ISel matcher for combines and thereby doing the majority of the matching in tablegenerated code but we do need a declarative way to control whether a combine is enabled/disabled (which is trivial) and the order they're applied (not so trivial). |
lib/Target/AArch64/AArch64PreLegalizerCombiner.cpp | ||
49 ↗ | (On Diff #142091) |
At the moment, that will look like a good idea but having a single monolithic tryCombine() isn't going to last long as more combines get implemented and more targets use combines. (see above) |
lib/CodeGen/GlobalISel/CombinerHelper.cpp | ||
---|---|---|
54 ↗ | (On Diff #142091) | Cool, thanks! |
63 ↗ | (On Diff #142091) | Agreed. |
70 ↗ | (On Diff #142091) | I like the idea with groups. I think it will make it easier for target-writers (and not only human target-writers ;-) ) to compose a reasonable pipeline for their needs relatively easily. Also, the grouping doesn't have to be 1-level, it could be more, if well-structured and useful. It may worth contemplating though what principle should be used to group combines together. I think it's going to be tempting to group them by root-level opcode: all G_ADD combines, all arithmetic combines (2-nd level group), etc, but this kind of grouping has a chance to prove less useful than architecture / micro-architecture targeted grouping. Like "combines for super-scalar architectures with a lot of instruction-level parallelism", and "architectures with highly efficient SIMD units" etc, something driven by what actual targets use and ignore. As for the declarative approach, honestly, putting as much as possible into Tablegen doesn't strike me like a wise approach. Tablegen'erated implementations are hard to read and search, interfaces between them and the rest of the compiler seem to be obscure and fragile, and it's very hard to make changes to any of it. If it's possible to derive the implementation from the information already provided by target writers, extending a Tablegen-backend at least goes into "let's auto-generate the entire compiler" direction, which is valuable. For instance, if we could derive the optimal pipeline of combines from scheduling and instruction info already put together by the targets. If it's needed, however, to come up with a new set of Tablegen-classes to explicitly define that pipeline manually in a *.td-file, and write an entirely new Tablegen-backend to process it, that doesn't look like a valuable thing to do. We could be as declarative as we want to be in C++ and that may be much easier to work with. Also, I think it may be valuable to make whatever design we eventually come up with easily compatible with a hypothetical (at the moment) tool that would be able to generate an optimal combine-pipeline for a target automatically, provided a performance feedback mechanism. Let it start with something reasonable, compile a corpus of programs, evaluate the code quality via that feedback mechanism, mutate the pipeline a bit, and try again. That sort of thing calls for a separate binary tool having a wide access to LLVM infrastructure, including all the combines, not for a Table-gen backend. I gave up on making Testgen a part of the Tablegen process very quickly, for instance. |
lib/Target/AArch64/AArch64PreLegalizerCombiner.cpp | ||
49 ↗ | (On Diff #142091) | Something still needs to be done about tryCombine. If it doesn't represent any practically useful group used by any target (not necessarily in final implementation, during experimentation too), let's remove it. if it does, let's maintain it so it does precisely what it promises to do. |
lib/CodeGen/GlobalISel/CombinerHelper.cpp | ||
---|---|---|
70 ↗ | (On Diff #142091) | All good points from you and Daniel. I don't know if tablegen is the best tool for the combiners, as I worry we'll paint ourselves into a corner once we go down that path. I think it would be worth exploring the design of combiner groups, and higher level schemes expressed in a declarative way as Daniel wants. The schemes could encapsulate the choices a target makes about their combiner configuration, with a default one that most targets would use at least at the beginning. Where customisability comes into it is with something like predicates/masks that enable/disable the activation of specific combines in the groups, so a scheme is a set of groups or existing scheme with overlaid modifications. For altering the orders of combines, an overlay could express the reversal of two given combines within a specific group. |
lib/CodeGen/GlobalISel/CombinerHelper.cpp | ||
---|---|---|
70 ↗ | (On Diff #142091) |
That's what I was thinking and is the reason for naming this tryCombineExtendingLoads(). We might need to break it down as more targets get added though as some only have sign-extending loads and others only have zero-extending loads
As you say, it doesn't necessarily have to be tablegen. The main thing is being able to throw the common rules and target-dependent rules together and get something that executes them in a sensible order and excludes the things the target doesn't want.
I agree. This line of thinking is a key reason I included a code coverage mechanism in the InstructionSelector. By extending the 1-bit counters to N-bit and feeding that into the rule sorting, we would end up with a match table that prioritized common rules as far as correctness allowed.
That's something we'll need to investigate while we look at how to maintain this long-term. Just for reference, in my original plan back when I started on the DAGISel importer, my main arguments in favour of tablegen for combiners were:
I didn't go through the implementation detail for Combiners at the time but I did leave a few doors open in case we decided to go down this route later. For example, this is the reason the implementation permits multiple match roots. |
Re-opening as I'm about to update it with a revised version. The previous one broke several bots because it sank loads down to the extends
Rewrite the patch such that the extends hoist up to the loads. The previous
patch had a couple problems that became clear on the bots:
- The loads could be duplicated. This is a correctness problem for volatile loads but more generally is likely to harm performance.
- The loads would sink to the extends without any hazard checking
Matching the load and folding in the uses is significantly more complex than
the previous code, but by anchoring the load where it is we both avoid both
correctness and performance issues.
The majority of the complexity comes from the need to resolve multiple
(potentially conflicting) uses. The approach this patch takes is fairly simple
in principle:
- Pick a preferred extend (see below)
- Fold that extend into the load
- Fix up the other uses with truncates/extends
2 and 3 are fairly straightforward but 1 relies on heuristics to make a good
choice. The current heuristics are:
- Prefer a sext/zext over an anyext. This is on the basis that anyext is essentially free (and we therefore don't save instructions by folding it), is compatible with sext/zext, and extending with defined bits opens further optimization opportunities once we have known-bits infrastructure.
- Prefer a sext over a zext. This is on the basis that a zext typically lowers to a single immediate 'and' instruction whereas a signext typically lowers to a shift-left/shift-right sequence (except in special cases). It's therefore cheaper to zext.
- Prefer larger types. This is on the basis that G_TRUNC is usually free (Mips is a notable exception and will probably want to tweak this when its port gets this far). There is a catch with this though since it can also increase the pressure on larger registers which some targets have a smaller supply of.
The full patch. The previous update was taken from the wrong machine and hadn't
been finished. Aside from not compiling, there was also a bug where additional
combine attempts would see G_SEXTLOAD mutate into G_LOAD (the anyext form).
test/CodeGen/AArch64/GlobalISel/arm64-fallback.ll | ||
---|---|---|
42 ↗ | (On Diff #148961) | This was removed during r332449? |
test/CodeGen/AArch64/GlobalISel/arm64-fallback.ll | ||
---|---|---|
42 ↗ | (On Diff #148961) | This is probably a result of the rebase. I'll remove it |
Hi Daniel, sorry for the delay.
I have some questions inline, but overall it seems that we don't keep track of how many of each use is a particular kind of extend. Could we have situations where overall we increase code size due to, for example, preferring a sign extend even if we have multiple zero-extending users? This might not be worth tackling in a first pass at this if it's a rare case though.
include/llvm/CodeGen/GlobalISel/CombinerHelper.h | ||
---|---|---|
34 ↗ | (On Diff #148961) | There doesn't seem to be a user of this. |
lib/CodeGen/GlobalISel/Combiner.cpp | ||
29 ↗ | (On Diff #148961) | What's the purpose of this? Currently it just wraps around the underlying worklist. Debugging messages only? |
33 ↗ | (On Diff #148961) | Should we have a specialization for GISelWorklist<512>, and then not repeat the 512 in multiple places? |
37 ↗ | (On Diff #148961) | Delete or DEBUG()? |
lib/CodeGen/GlobalISel/CombinerHelper.cpp | ||
55 ↗ | (On Diff #148961) | Can we promote this lambda into a helper function? From the name and lack of header doc it's unclear what it's supposed to do with the params on first reading. |
64 ↗ | (On Diff #148961) | What else could ExtendOpcodeB be at this point? If nothing else, should be an assert? |
Thanks Amara
I have some questions inline, but overall it seems that we don't keep track of how many of each use
is a particular kind of extend. Could we have situations where overall we increase code size due to,
for example, preferring a sign extend even if we have multiple zero-extending users? This might not
be worth tackling in a first pass at this if it's a rare case though.
That's a good point. Yes, that can happen at the moment.
I can see two cases (and mixtures of the two). The first is multiple zexts to the same type. In this case, we would generate one trunc/zext pair for each use. CSE ought to fix it later (once we enable that) but it would be better to CSE it up front and avoid paying the cost of allocating+processing the redundant instructions. A simple map of opcode and LLT to vreg while emitting the trunc/zext's ought to do the trick there. The other is the case where there's zexts to multiple types. In this case, picking the sext and trunc/zext is still a win as it eliminates 1-2 instructions whereas whichever zext we pick can eliminate at most 1 since all but one zext still needs to emit an instruction. There's target-specific special cases though. For example, Mips sext from 32-bit to >32-bit is free (because it was actually done by the trunc or the gpr32 instruction that def'd it) so picking that particular sext is worse than picking any zext.
include/llvm/CodeGen/GlobalISel/CombinerHelper.h | ||
---|---|---|
34 ↗ | (On Diff #148961) | That's right. It's not needed for this particular combine but there should be some that need it in future. I could drop it from this patch |
lib/CodeGen/GlobalISel/Combiner.cpp | ||
29 ↗ | (On Diff #148961) | The intent behind CombinerChangeObserver is to inform the combiner pass that certain events happened in CombinerHelper. The main one's I'm expecting to end up with are instruction creation and deletion. I can also see instruction modification events here in future. I don't think scheduling for revisit should be in CombinerChangeObserver though. I think that should be determined by the combiner pass in response to these events. It should also be derived from the implemented combines in some way to avoid the issue in DAGCombine where combines are sometimes missed because one rule failed to schedule the right node (e.g. because the root wasn't directly connected to the modified nodes in the case combines that cover several nodes). The reason for having CombinerChangeObserver subclasses rather than just implementing it directly is so that CombinerHelper is usable by lots of different kinds of Combiner passes. We might have different implementations for O0 - O3, or a specific strategy might work better for one particular target, or maybe the pass isn't a combiner at all and just wants to borrow a couple combines. Each implementation would provide a CombinerChangeObserver subclass that glues CombinerHelper into its implementation. This particular implementation ought to grow a mechanism to limit the run-time of the pass. Exactly how that should work will need some more thinking about but as an example we could track how deeply recursed the combiner is (i.e. how many combine rules triggered to create it) and decline to schedule instructions beyond a certain point. For O1 combines might be limited to the first generation of new/modified instructions, for O2 maybe the first 3 generations, and O3 might be unlimited. |
33 ↗ | (On Diff #148961) | Yes, let's add an alias for it as Combiner::WorkListTy |
37 ↗ | (On Diff #148961) | Oops. That was some temporary debugging code. It's useful for tracing the combines though so let's make it DEBUG()/dbgs() |
lib/CodeGen/GlobalISel/CombinerHelper.cpp | ||
55 ↗ | (On Diff #148961) | Sure. |
64 ↗ | (On Diff #148961) | Anything. There's no guarantee that the use of a load is an extend. We ignore that when selecting a preferred use and deal with the non-extends by inserting a trunc (which is free for most targets) later |
include/llvm/CodeGen/GlobalISel/CombinerHelper.h | ||
---|---|---|
34 ↗ | (On Diff #148961) | Ok, as long as we're aware let's keep it in. |
lib/CodeGen/GlobalISel/Combiner.cpp | ||
29 ↗ | (On Diff #148961) | Ok thanks. I think it would be beneficial to have a comment at the definition (can be your reply here summarised). |
lib/CodeGen/GlobalISel/CombinerHelper.cpp | ||
64 ↗ | (On Diff #148961) | Then perhaps change the name to something more accurate, like "UseOpcode"? |
Rebased
Fixed the various nits
Fixed a problem where the pass manager couldn't schedule 'Function Alias
Analysis Results' which was apparently required by 'AArch64 Instruction
Selection' according to the pass manager... except it wasn't. The actual problem
was that we weren't preserving the StackProtector pass.
I've also added preservation of the CFG analyses while I was fixing the pass
manager issue above.
I believe the only remaining issue on this was the compile-time impact. I've run CTMark* before and after and found that no tests were significantly different. The geomean was improved by 1.24%. I'm going to be working on the combiner infrastructure for the next few months so if there are any further issues that come up we can address them post-commit.
- Except bullet, kc, and tramp3d-v4 as these don't seem to build when I target AArch64. They all fail due to missing headers for me.
include/llvm/CodeGen/GlobalISel/Combiner.h | ||
---|---|---|
35 ↗ | (On Diff #160708) | This should not be necessary any more. You can attach delegate to the machine function to observe insertions and deletions. The combiner can install a delegate at the beginning of the pass. Do you see any other reason to have the Observer mechanism? Is there a need to rely on the maintainer scheduling order of visit or is it just to make sure of insertions and deletions? |