This is an archive of the discontinued LLVM Phabricator instance.

[PowerPC] Combine 64-bit bswap(load) without LDBRX
ClosedPublic

Authored by nemanjai on Jun 23 2021, 8:43 PM.

Details

Summary

When targeting CPUs that don't have LDBRX, we end up producing code that is very inefficient and large for this common idiom. This patch just optimizes it two 32-bit LWBRX instructions along with a merge.

This fixes https://bugs.llvm.org/show_bug.cgi?id=49610

Diff Detail

Event Timeline

nemanjai created this revision.Jun 23 2021, 8:43 PM
nemanjai requested review of this revision.Jun 23 2021, 8:43 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 23 2021, 8:43 PM
spatel added inline comments.Jun 24 2021, 5:31 AM
llvm/lib/Target/PowerPC/PPCISelLowering.cpp
15245

If we honor the clang-tidy (no else after return), that makes it more obvious that we are repeating conditions in both clauses. It would be cleaner to hoist those into a common 'if' clause, and probably better still to make it an early exit for the inverted clause (possibly add a helper function too):

if (!Is64BitBswapOn64BitTarget || !IsSingleUseNonExtLd) return SDValue();
15246–15247

Do we need to check for 'volatile' or other modifiers?
Are we ok with misaligned accesses?
Should add regression tests either way.

llvm/test/CodeGen/PowerPC/ld-bswap64-no-ldbrx.ll
3

It would be better to include tests for the minimal patterns - just a load or just a store. We are not handling the store case yet, so that might be a "TODO" item (or doesn't matter if we don't need to worry about the pre-stdbrx targets):

define void @split_store(i64 %x, i64* %p) {
  %b = call i64 @llvm.bswap.i64(i64 %x)
  store i64 %b, i64* %p, align 8
  ret void
}

define i64 @split_load(i64* %p) {
  %x = load i64, i64* %p, align 8
  %b = call i64 @llvm.bswap.i64(i64 %x)
  ret i64 %b
}
nemanjai added inline comments.Jun 24 2021, 6:07 AM
llvm/lib/Target/PowerPC/PPCISelLowering.cpp
15245

Oh, I probably should have noticed that the if clause has a return. I'll update it. Thanks!

15246–15247

There are no alignment requirements for the byte-reversed loads/stores. I'll add a test.

llvm/test/CodeGen/PowerPC/ld-bswap64-no-ldbrx.ll
3

Yes, the store combine is definitely a TODO :)
I'll update the tests, thank you.

nemanjai updated this revision to Diff 354237.Jun 24 2021, 6:34 AM

Fix up clang-tidy warnings and add some more regression tests.

LangRef says "the backend should never split or merge target-legal volatile load/store instructions":
https://llvm.org/docs/LangRef.html#volatile-memory-accesses

I haven't looked at the use cases in detail, but the target does support 64-bit loads via plain ld, so we shouldn't do the transform?

LangRef says "the backend should never split or merge target-legal volatile load/store instructions":
https://llvm.org/docs/LangRef.html#volatile-memory-accesses

I haven't looked at the use cases in detail, but the target does support 64-bit loads via plain ld, so we shouldn't do the transform?

Hmm... clearly this requirement cannot possibly be met for volatile operations that are wider than the available load for the target. Of course, that is not the case here. So I suppose it is possible for a volatile load to load the two halves of the value before and after a store to the same memory from another thread.
I'll add the volatile check and bail on the combine. Thanks for bringing this up.

nemanjai updated this revision to Diff 354304.Jun 24 2021, 11:01 AM

Do not combine volatile loads.

spatel accepted this revision.Jun 24 2021, 11:20 AM

See inline for a couple of small adjustments, otherwise LGTM.
I'm not up-to-speed on current PPC targets though, so might want to wait for a 2nd opinion.

llvm/lib/Target/PowerPC/PPCISelLowering.cpp
15275

I forgot to mention atomics (add one more test...).
I think we want to use "LD->isSimple()" to be safe.

llvm/test/CodeGen/PowerPC/ld-bswap64-no-ldbrx.ll
20

IIUC, we are ok doing the transform with any alignment, so this test isn't capturing that.

Please add a misaligned-only test or adjust the first test to have "align 1" so it can verify that we perform the fold independent of alignment.

This revision is now accepted and ready to land.Jun 24 2021, 11:20 AM

I can confirm that this reduces the stack size of the function in the Linux kernel that prompted the report:

Before:

arch/powerpc/kvm/book3s_hv_nested.c:289:6: warning: stack frame size (2048) exceeds limit (768) in function 'kvmhv_enter_nested_guest' [-Wframe-larger-than]
long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu)
     ^
1 warning generated.

After:

arch/powerpc/kvm/book3s_hv_nested.c:289:6: warning: stack frame size (1856) exceeds limit (768) in function 'kvmhv_enter_nested_guest' [-Wframe-larger-than]
long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu)
     ^
1 warning generated.
This revision was automatically updated to reflect the committed changes.