This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Add support for 128-bit non temporal loads.
ClosedPublic

Authored by zjaffal on Aug 24 2022, 7:17 AM.

Details

Summary

Adding to the work done in D131773 here we add support to 128-bit loads.

Diff Detail

Event Timeline

zjaffal created this revision.Aug 24 2022, 7:17 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 24 2022, 7:17 AM
zjaffal requested review of this revision.Aug 24 2022, 7:17 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 24 2022, 7:17 AM

My understanding is that the !nontemporal metadata is a hint to the backend that the load data will not be reused so caching it is unlikely to be useful for performance. It isn't a mandate. Similarly the ldnp pairwise loads are a hint to the microarchitecture that the loaded data is likely not going to be reused so doesn't need caching. The micro-architecture is free to ignore the hint again.
With that, is it beneficial to force uses of 128bit non-temperoral loads if it leads to more instructions overall? For i256 it was almost certainly a good thing to do, but for i128 it sounds like it might end up slowing things down more than it helps. I'm not sure where the balance point lies.

My understanding is that the !nontemporal metadata is a hint to the backend that the load data will not be reused so caching it is unlikely to be useful for performance. It isn't a mandate. Similarly the ldnp pairwise loads are a hint to the microarchitecture that the loaded data is likely not going to be reused so doesn't need caching. The micro-architecture is free to ignore the hint again.
With that, is it beneficial to force uses of 128bit non-temperoral loads if it leads to more instructions overall? For i256 it was almost certainly a good thing to do, but for i128 it sounds like it might end up slowing things down more than it helps. I'm not sure where the balance point lies.

I agree that this seems more uarch specific. If the selected uarch just ignores the hints, we shouldn't try too hard to generate LDNP. But on some uarch avoiding cache pollution can outweigh the drawbacks of having to issue more load instructions (and the overhead of a few extra movs should be negligible on most beefier uarchs). I think codegen could also be improved for the cases where we have input types that need further legalization.

In most cases the hint comes directly from the user who hopefully know how to use it.

llvm/test/CodeGen/AArch64/nontemporal-load.ll
327–328

I guess we would also use the LDNQ variant here. I assume the reason we don't is because <17 x float> will get broken down to <4 x float> pieces during legalization.

@dmgreen do you by any chance have any ideas on where to best improve this?

I don't think they ever get ignored by the hardware, but it is only a hints. I would be surprised in the extra instructions are better than a normal load, but it will depend on how much pressure this is on the cache at the time. I don't think there is a lot in it either way though, and if non-temporal loads are being used the cpu is more likely to have high memory usage with lower computation, meaning the extra instructions are less of an issue.

llvm/test/CodeGen/AArch64/nontemporal-load.ll
327–328

I'm not sure I'm afraid. I believe that loads get split to legal parts (not in half like other operations).

If we wouldn't expect 17x non temporal loads very often, it may not be too important to fix. The loop vectorizer will always pick powers of 2 after all.

fhahn requested changes to this revision.Aug 30 2022, 3:27 AM
fhahn added a subscriber: zainja.

Running this locally causes many test failures for me. @zainja could you double check all AArch64 codegen tests pass with the patch applied to current main? Also, could you rebase the patch to current main to make the precommit tests run latest main?

This revision now requires changes to proceed.Aug 30 2022, 3:27 AM

Running this locally causes many test failures for me. @zainja could you double check all AArch64 codegen tests pass with the patch applied to current main? Also, could you rebase the patch to current main to make the precommit tests run latest main?

The issue is related to

setOperationAction(ISD::LOAD, MVT::vxixx, Custom);

Looking at the following test arm64-neon-vector-shuffle-extract.ll the stack trace failure indicates that the last function called was LowerLoad. Before the patch, the function wasn't called for that specific case which means that the setOperationAction calls are the issue.

zjaffal updated this revision to Diff 457201.EditedSep 1 2022, 1:42 AM

Address some of the failing tests. The problem we have here is that all 128-bit vector loads will go to LowerLOAD function.
Most of the failing test cases where triggered by the following assert

assert((VT == MVT::v4i16 || VT == MVT::v4i32) && "Expected v4i16 or v4i32");

To address this I added a block before it to return an empty SDValue node. Some tests
still fail after this fix.

zjaffal updated this revision to Diff 457553.Sep 2 2022, 3:25 AM

Fix patch for failing tests. The problem originated from 128-bit loads being handled incorrectly if they weren't nontemporal loads. I added two checks to address the issue
First, if the load is of floating point type we preserve the behaviour provided from

setOperationPromotedToType(ISD::LOAD, VT, PromoteTo);

Otherwise we return an empty SDNode() object.

zjaffal updated this revision to Diff 457576.Sep 2 2022, 6:07 AM

Remove support for floating point non temporal loads

Matt added a subscriber: Matt.Sep 2 2022, 8:47 PM
zjaffal updated this revision to Diff 460077.Sep 14 2022, 7:16 AM

check for little endian target

fhahn added a comment.Sep 26 2022, 5:51 AM

This should be fine now, could you rebase this patch on top of D133421? Codegen of test_ldnp_v33f64 should be improved by this, right?

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
802

Would be good to use the more descriptive comment from above, just for 128 bits.

5445

nit: would be good to have a newline before this line..

20893–20896

unrelated change?

zjaffal updated this revision to Diff 462909.Sep 26 2022, 8:05 AM

rebase on top of D133421 and address stylistic comments

fhahn added a comment.Sep 29 2022, 1:50 AM

Thanks for getting rid of the excessive codegen regressions from the earlier version! IIUC now we should get at most an extra mov instruction to combine the 2 loaded values into a single vector. IMO this is a reasonable trade-off between user request & codegen.

@dmgreen WDYT? If there still are concerns about the extra mov instruction, we could make this opt-in with a target feature.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
5466–5467

nit: this needs formatting.

dmgreen added a comment.EditedSep 29 2022, 5:18 AM

Yeah sure. This sounds fine - no objections from me.

fhahn accepted this revision.Sep 29 2022, 5:59 AM

Yeah sure. This sounds fine - no objections from me.

Cheers, LGTM then!

This revision is now accepted and ready to land.Sep 29 2022, 5:59 AM
This revision was landed with ongoing or failed builds.Sep 30 2022, 3:05 AM
This revision was automatically updated to reflect the committed changes.

Hello. I've had reports that this is changing the lowering of unaligned vector loads when strict-align is present. https://godbolt.org/z/bT4cnvn4a.

Custom lowering is often more pain than it is worth. It doesn't seem like something that is intended, and can lead to access violations if unaligned accesses are not enabled.

Hello. I've had reports that this is changing the lowering of unaligned vector loads when strict-align is present. https://godbolt.org/z/bT4cnvn4a.

Custom lowering is often more pain than it is worth. It doesn't seem like something that is intended, and can lead to access violations if unaligned accesses are not enabled.

That is very interesting, I will look into what is triggering this issue.

That is very interesting, I will look into what is triggering this issue.

OK thanks. Reverting the patch in the meantime is probably the best idea, I should probably have done that yesterday. Things are pretty broken at the moment and the fix might take a little time.

That is very interesting, I will look into what is triggering this issue.

OK thanks. Reverting the patch in the meantime is probably the best idea, I should probably have done that yesterday. Things are pretty broken at the moment and the fix might take a little time.

I will revert it now.

The cause of the issue is that we specify custom lowering operation for v2i64

setOperationAction(ISD::LOAD, MVT::v2i64, Custom);

Which prevents LegalizeLoadOps from creating the necessary code for unaligned loads

Yeah thanks - I was just running the tests to check the revert. It took a while, but looks like they passed in the end.