Page MenuHomePhabricator

[MachineInstr] Add support for instructions with multiple memory operands.
ClosedPublic

Authored by hliao on Oct 14 2020, 10:10 PM.

Details

Summary
  • Basically iterate each pair of memory operands from both instructions and return true if any of them may alias.
  • The exception are memory instructions without any memory operand. They may touch everything and could alias to any memory instruction.

Diff Detail

Event Timeline

hliao created this revision.Oct 14 2020, 10:10 PM
hliao requested review of this revision.Oct 14 2020, 10:10 PM

That change triggers a few regression test failures. All of them are due to different code schedule due to memory instructions with multiple memory operands. Please help me double-check that's the case and whether that sounds a better code sequence. Thanks.

This looks familiar. It is the same as https://reviews.llvm.org/D80161? That patch was apparently reverted in 65cd2c7a8015577fea15c861f41d2e4b5768961f, because it was hitting timeouts on certain targets.

I'm not sure if @Kayjukh remembers more? But it may be worth putting a limit on the total number of alias checks it can perform.

The test changes seem fine to me.

This looks like the same as my old patch (https://reviews.llvm.org/D80161). There are additional details available on the commit's Phab page. To put it in a nutshell, some codes can trigger a very large amount of calls to the aliasing check (see the repro provided by @nemanjai: https://pastebin.com/tRtTQdSa), which results in a very large increase in compilation time.

Bounding the number of checks may be a good solution, even though it would be nicer to have something more clever that could allow all the operands to be checked. Not sure how feasible this would be though.

hliao added a comment.Oct 15 2020, 5:53 AM

This looks familiar. It is the same as https://reviews.llvm.org/D80161? That patch was apparently reverted in 65cd2c7a8015577fea15c861f41d2e4b5768961f, because it was hitting timeouts on certain targets.

I'm not sure if @Kayjukh remembers more? But it may be worth putting a limit on the total number of alias checks it can perform.

The test changes seem fine to me.

Sorry, don't search for the relevant change. Yes, it's exactly the same purpose with minor differences on check conditions.

hliao added a comment.Oct 15 2020, 5:54 AM

This looks like the same as my old patch (https://reviews.llvm.org/D80161). There are additional details available on the commit's Phab page. To put it in a nutshell, some codes can trigger a very large amount of calls to the aliasing check (see the repro provided by @nemanjai: https://pastebin.com/tRtTQdSa), which results in a very large increase in compilation time.

Bounding the number of checks may be a good solution, even though it would be nicer to have something more clever that could allow all the operands to be checked. Not sure how feasible this would be though.

Thanks. I will check that.

hliao added a comment.Oct 15 2020, 9:45 AM

This looks like the same as my old patch (https://reviews.llvm.org/D80161). There are additional details available on the commit's Phab page. To put it in a nutshell, some codes can trigger a very large amount of calls to the aliasing check (see the repro provided by @nemanjai: https://pastebin.com/tRtTQdSa), which results in a very large increase in compilation time.

Bounding the number of checks may be a good solution, even though it would be nicer to have something more clever that could allow all the operands to be checked. Not sure how feasible this would be though.

Yeah, I could reproduce that compile-time issue. That issue happens in the post-RA scheduler. I didn't find that as the generic x86 backend disables the post-RA scheduler.
Back to the compilation issue itself, as a short-term solution, putting a limit seems reasonable to me as most native machine instructions probably only take 2~3 memory operands. We may also add a target hook to get that more reasonable for the underlying target. If any instruction has more memory operands that that limit, we just take them conservatively. In long run, adding alias result caching in basic-aa would reduce time but increase space required.

A relevant question is that do we need to run full AA in the post-RA scheduler if the pre-RA scheduler is already enabled. Except for the newly generated memory operations (spill, reload, rematerialized loads, and etc.), we only got registers allocated and assigned in RA. The memory dependency after the pre-RA scheduler still holds. Running AA check on these memory instructions only reproduce the same result. For register spills/reloads, they won't alias to these existing memory instructions but only alias to their counterparts. They also don't have too much space to be scheduled around due to the register dependency. By pruning these unnecessary memory dependency setup, we could save lots of compilation time contributed from the post-RA scheduler.

uweigand added a subscriber: jonpa.Oct 15 2020, 9:58 AM

@jonpa can you check whether the SystemZ test case you added still checks what it was intended to check here?

hliao updated this revision to Diff 298475.Oct 15 2020, 2:47 PM

Add limit on memory operand AA check.

hliao added a comment.Oct 15 2020, 2:48 PM

Add limit on memory operand AA check.

With that change, there's almost no measurable difference in compilation time on that reproducer.

lkail added a subscriber: lkail.Oct 15 2020, 3:06 PM
jonpa added a comment.Oct 16 2020, 4:24 AM

@jonpa can you check whether the SystemZ test case you added still checks what it was intended to check here?

I actually see a problem with this patch: The SystemZ test case passes the original test by not producing any scalar load + vector element insertions instructions. However now I see that all PFD (prefetch) instructions end up last in the block, whereas before they were not. This may or not be good if the block is huge and the prefetching is therefore not done too many iterations ahead. Maybe this should be checked with benchmarks to make sure the current prefetching tuning does not loose by this.

The reason seems to be that the SystemZ::PFD instruction is marked both with mayLoad and mayStore, but now this patch looks at the *memory operand* and figures out that it is only loading and therefore does not alias with another load. This check was previously only done with the MI flags. The post-RA scheduler now puts all of them at the end.

hliao updated this revision to Diff 298792.Oct 16 2020, 6:27 PM

Remove the MMO non-store check.

hliao added a comment.Oct 16 2020, 6:30 PM

@jonpa can you check whether the SystemZ test case you added still checks what it was intended to check here?

I actually see a problem with this patch: The SystemZ test case passes the original test by not producing any scalar load + vector element insertions instructions. However now I see that all PFD (prefetch) instructions end up last in the block, whereas before they were not. This may or not be good if the block is huge and the prefetching is therefore not done too many iterations ahead. Maybe this should be checked with benchmarks to make sure the current prefetching tuning does not loose by this.

The reason seems to be that the SystemZ::PFD instruction is marked both with mayLoad and mayStore, but now this patch looks at the *memory operand* and figures out that it is only loading and therefore does not alias with another load. This check was previously only done with the MI flags. The post-RA scheduler now puts all of them at the end.

Even if I removed the following check, the result is still the same

The real reason is that prefetch is iterations ahead and won't alias to any memory access in that tloop body. Previously, we

Remove the MMO non-store check.

@jonpa can you check whether the SystemZ test case you added still checks what it was intended to check here?

I actually see a problem with this patch: The SystemZ test case passes the original test by not producing any scalar load + vector element insertions instructions. However now I see that all PFD (prefetch) instructions end up last in the block, whereas before they were not. This may or not be good if the block is huge and the prefetching is therefore not done too many iterations ahead. Maybe this should be checked with benchmarks to make sure the current prefetching tuning does not loose by this.

The reason seems to be that the SystemZ::PFD instruction is marked both with mayLoad and mayStore, but now this patch looks at the *memory operand* and figures out that it is only loading and therefore does not alias with another load. This check was previously only done with the MI flags. The post-RA scheduler now puts all of them at the end.

I removed that check that one of those two MMOs needs to be store. No change to test now. But, that sounds a little weird that one instruction claims mayLoad and mayStore but only has isLoad MMO. Maybe we need to change that MMO in prefetch to be both isLoad and isStore.

jonpa added a comment.Oct 17 2020, 4:49 AM

I removed that check that one of those two MMOs needs to be store. No change to test now. But, that sounds a little weird that one instruction claims mayLoad and mayStore but only has isLoad MMO. Maybe we need to change that MMO in prefetch to be both isLoad and isStore.

The SystemZ Prefetch Data (PFD) instruction has a flag bit to control if the prefetch is read or write, so in a way those two flags make sense. However, since the PFD does not in fact clobber any memory I think the idea behind the mayLoad and mayStore flags is to keep the instruction in place in the block as much as possible. It shouldn't necessarily matter, but generally it is probably better spread out the memory accesses / prefetches rather than having them all happen at once, I would think. If the block is big with many prefetches they shouldn't all end up in the end of the block...

If you think that check is valuable, then maybe we could try to find another way to keep the PFD instructions in their places. I tried adding hasSideEffects = 1 on PFD/PFDRL instructions instead of removing the check on the MMOs, but that did not seem to be NFC on benchmarks...

hliao added a comment.Oct 17 2020, 9:02 AM

I removed that check that one of those two MMOs needs to be store. No change to test now. But, that sounds a little weird that one instruction claims mayLoad and mayStore but only has isLoad MMO. Maybe we need to change that MMO in prefetch to be both isLoad and isStore.

The SystemZ Prefetch Data (PFD) instruction has a flag bit to control if the prefetch is read or write, so in a way those two flags make sense. However, since the PFD does not in fact clobber any memory I think the idea behind the mayLoad and mayStore flags is to keep the instruction in place in the block as much as possible. It shouldn't necessarily matter, but generally it is probably better spread out the memory accesses / prefetches rather than having them all happen at once, I would think. If the block is big with many prefetches they shouldn't all end up in the end of the block...

If you think that check is valuable, then maybe we could try to find another way to keep the PFD instructions in their places. I tried adding hasSideEffects = 1 on PFD/PFDRL instructions instead of removing the check on the MMOs, but that did not seem to be NFC on benchmarks...

Not only target-specific PREFETCH has that, the target-independent PREFETCH has both mayLoad and mayStore. I thought that maybe added in the early day to hour the program order. Even if a prefetch may prefetch for write or precisely for the ownership, it really didn't change that memory at all. The order of PREFETCH needs to be honored by the scheduler is due to it don't understand the target timing yet.

hliao added a comment.Oct 20 2020, 7:07 AM

PING for review.

hliao added a comment.Oct 22 2020, 8:08 AM

Just kingly PING for review.

PING for review. As a similar patch was approved, shall I just commit it again with the compilation time issue is addressed?

Has the SystremZ Prefetch issues been resolved?

llvm/include/llvm/CodeGen/TargetInstrInfo.h
1743

So the limit on the number of alias checks is effectively 16?

Would it be worth making it so limit is on the total number of checks, not the MemOperands per instruction? That way an instruction with a single operand could be compared to an instruction with many (which I imagine would be a common case).

llvm/lib/CodeGen/MachineInstr.cpp
1341

Why is this new ordered check needed? I didn't think that atomics had multiple memory operands.

hliao added a comment.Oct 24 2020, 8:51 AM

Has the SystremZ Prefetch issues been resolved?

Yes, no sytemz tests are affected by this change now.

hliao updated this revision to Diff 300663.Oct 26 2020, 7:09 AM

Remove unordered check.

hliao marked an inline comment as done.Oct 26 2020, 7:14 AM
hliao added inline comments.
llvm/include/llvm/CodeGen/TargetInstrInfo.h
1743

Just IMHO, the limit here is more straightforward to figure out. The backend needs to consider the corner case where two instructions with more than one mem operands are checked.

llvm/lib/CodeGen/MachineInstr.cpp
1341

I added that just as future proof just like the load/store check. Just remove them to keep this change as minimal as possible.

Kayjukh added inline comments.Oct 26 2020, 11:20 AM
llvm/include/llvm/CodeGen/TargetInstrInfo.h
1743

I think it would make sense to still limit the number of checks instead of the number of memory operands. If, say, you put a limit of 16 checks---which would be equivalent to the 4 memory operands you set there----then you can compare both instructions with 4 memory operands each but also an instruction with a single memory operand with another one that happens to have 8.

The bound on the cost would be the same but it would make the check more flexible.

hliao marked an inline comment as done.Oct 26 2020, 3:01 PM
hliao added inline comments.
llvm/include/llvm/CodeGen/TargetInstrInfo.h
1743

if a target processor has a native instruction with up to 8 mem operands, won't the developer set this limit to 8 instead of 4 to support all combinations of possible native instructions? My point is that should such a limit be mainly decided by the target processor or possible memory operands due to post-RA optimizations, which one is more reasonable?

PING for review

Kindly PING again

dmgreen added inline comments.Mon, Nov 2, 12:24 AM
llvm/include/llvm/CodeGen/TargetInstrInfo.h
1743

I would make it a total limit. Consider the arm backend where we can have 16 MMO's in a single ldm instruction (sort of). You do not want to put the limit here to 16 per instruction, as that would put the total limit to 256! Far too high! Even a limit of 8 per instruction would sound very high.

But if the limit was a total of 16 then at least the ldm with 16 operand could be compared against an instruction with 1. And a ldm with 8 operands could be compared against a ldrd with 2.

hliao updated this revision to Diff 302576.Tue, Nov 3, 7:07 AM

Change how that limit is set.

hliao marked an inline comment as done.Tue, Nov 3, 7:08 AM
hliao added inline comments.
llvm/include/llvm/CodeGen/TargetInstrInfo.h
1743

I see your point. That's reasonable.

dmgreen accepted this revision.Tue, Nov 3, 9:34 AM

Thanks. (We get some codesize increases from this in my quick AArch64 testing. The idea should in general be an improvement though and they are not very large).

Please keep on eye on compile time, but other than that LGTM.

This revision is now accepted and ready to land.Tue, Nov 3, 9:34 AM
This revision was landed with ongoing or failed builds.Tue, Nov 3, 5:49 PM
This revision was automatically updated to reflect the committed changes.
hliao marked an inline comment as done.