Adding to the work done in D131773 here we add support to 128-bit loads.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
My understanding is that the !nontemporal metadata is a hint to the backend that the load data will not be reused so caching it is unlikely to be useful for performance. It isn't a mandate. Similarly the ldnp pairwise loads are a hint to the microarchitecture that the loaded data is likely not going to be reused so doesn't need caching. The micro-architecture is free to ignore the hint again.
With that, is it beneficial to force uses of 128bit non-temperoral loads if it leads to more instructions overall? For i256 it was almost certainly a good thing to do, but for i128 it sounds like it might end up slowing things down more than it helps. I'm not sure where the balance point lies.
I agree that this seems more uarch specific. If the selected uarch just ignores the hints, we shouldn't try too hard to generate LDNP. But on some uarch avoiding cache pollution can outweigh the drawbacks of having to issue more load instructions (and the overhead of a few extra movs should be negligible on most beefier uarchs). I think codegen could also be improved for the cases where we have input types that need further legalization.
In most cases the hint comes directly from the user who hopefully know how to use it.
llvm/test/CodeGen/AArch64/nontemporal-load.ll | ||
---|---|---|
327–328 | I guess we would also use the LDNQ variant here. I assume the reason we don't is because <17 x float> will get broken down to <4 x float> pieces during legalization. @dmgreen do you by any chance have any ideas on where to best improve this? |
I don't think they ever get ignored by the hardware, but it is only a hints. I would be surprised in the extra instructions are better than a normal load, but it will depend on how much pressure this is on the cache at the time. I don't think there is a lot in it either way though, and if non-temporal loads are being used the cpu is more likely to have high memory usage with lower computation, meaning the extra instructions are less of an issue.
llvm/test/CodeGen/AArch64/nontemporal-load.ll | ||
---|---|---|
327–328 | I'm not sure I'm afraid. I believe that loads get split to legal parts (not in half like other operations). If we wouldn't expect 17x non temporal loads very often, it may not be too important to fix. The loop vectorizer will always pick powers of 2 after all. |
Running this locally causes many test failures for me. @zainja could you double check all AArch64 codegen tests pass with the patch applied to current main? Also, could you rebase the patch to current main to make the precommit tests run latest main?
The issue is related to
setOperationAction(ISD::LOAD, MVT::vxixx, Custom);
Looking at the following test arm64-neon-vector-shuffle-extract.ll the stack trace failure indicates that the last function called was LowerLoad. Before the patch, the function wasn't called for that specific case which means that the setOperationAction calls are the issue.
Address some of the failing tests. The problem we have here is that all 128-bit vector loads will go to LowerLOAD function.
Most of the failing test cases where triggered by the following assert
assert((VT == MVT::v4i16 || VT == MVT::v4i32) && "Expected v4i16 or v4i32");
To address this I added a block before it to return an empty SDValue node. Some tests
still fail after this fix.
Fix patch for failing tests. The problem originated from 128-bit loads being handled incorrectly if they weren't nontemporal loads. I added two checks to address the issue
First, if the load is of floating point type we preserve the behaviour provided from
setOperationPromotedToType(ISD::LOAD, VT, PromoteTo);
Otherwise we return an empty SDNode() object.
This should be fine now, could you rebase this patch on top of D133421? Codegen of test_ldnp_v33f64 should be improved by this, right?
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
802 | Would be good to use the more descriptive comment from above, just for 128 bits. | |
5445 | nit: would be good to have a newline before this line.. | |
20893–20896 | unrelated change? |
Thanks for getting rid of the excessive codegen regressions from the earlier version! IIUC now we should get at most an extra mov instruction to combine the 2 loaded values into a single vector. IMO this is a reasonable trade-off between user request & codegen.
@dmgreen WDYT? If there still are concerns about the extra mov instruction, we could make this opt-in with a target feature.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
5466–5467 | nit: this needs formatting. |
Hello. I've had reports that this is changing the lowering of unaligned vector loads when strict-align is present. https://godbolt.org/z/bT4cnvn4a.
Custom lowering is often more pain than it is worth. It doesn't seem like something that is intended, and can lead to access violations if unaligned accesses are not enabled.
That is very interesting, I will look into what is triggering this issue.
OK thanks. Reverting the patch in the meantime is probably the best idea, I should probably have done that yesterday. Things are pretty broken at the moment and the fix might take a little time.
I will revert it now.
The cause of the issue is that we specify custom lowering operation for v2i64
setOperationAction(ISD::LOAD, MVT::v2i64, Custom);
Which prevents LegalizeLoadOps from creating the necessary code for unaligned loads
Yeah thanks - I was just running the tests to check the revert. It took a while, but looks like they passed in the end.
Would be good to use the more descriptive comment from above, just for 128 bits.