This is an archive of the discontinued LLVM Phabricator instance.

[RISCV] Make preferred alignment of PointerArgs for MemIntrinsic
Needs RevisionPublic

Authored by JojoR on Sep 19 2022, 1:11 AM.

Details

Summary

Set default preferred alignment for MemIntrinsic like memcpy according to arch32 or arch64,
it will improve performance.

e.g. dhrystone with "-O2" boosts performance by 50% on arch RV32.

Diff Detail

Event Timeline

JojoR created this revision.Sep 19 2022, 1:11 AM
JojoR requested review of this revision.Sep 19 2022, 1:11 AM

Please use like git diff -U999999 to make context available.(see https://llvm.org/docs/DeveloperPolicy.html#making-and-submitting-a-patch)

Please upload patches using -U99999 as indicated here https://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-web-interface

llvm/lib/Target/RISCV/RISCVISelLowering.cpp
12987

Align(Subtarget.getXLen() / 8)

JojoR updated this revision to Diff 461443.Sep 19 2022, 6:47 PM
JojoR marked an inline comment as done.

Set default preferred alignment for MemIntrinsic like memcpy according to arch32 or arch64,
it will improve performance.

What are arch32 and arch64?

e.g. dhrystone with "-O2" boosts performance by 50% on arch RV32.

On what implementation? Does this affect actually-useful benchmarks, not just dhrystone? I would assume so, but it'd be more useful to get numbers for meaningful benchmarks rather than ones people should've long since abandoned.

llvm/lib/Target/RISCV/RISCVISelLowering.cpp
12986

This seems arbitrary. I'd expect at least XLEN / 8 here?

JojoR added a comment.Sep 19 2022, 7:42 PM

Set default preferred alignment for MemIntrinsic like memcpy according to arch32 or arch64,
it will improve performance.

What are arch32 and arch64?

e.g. dhrystone with "-O2" boosts performance by 50% on arch RV32.

On what implementation? Does this affect actually-useful benchmarks, not just dhrystone? I would assume so, but it'd be more useful to get numbers for meaningful benchmarks rather than ones people should've long since abandoned.

I think WORD_SIZE as preferred alignment is basic behavior to improve performance (use lw/sw instead of lb/sb), and dhrystone gets good performance benefit from this patch,
some other useful benchmarks are up slightly (< 5%), I do not list here so.

Set default preferred alignment for MemIntrinsic like memcpy according to arch32 or arch64,
it will improve performance.

What are arch32 and arch64?

e.g. dhrystone with "-O2" boosts performance by 50% on arch RV32.

On what implementation? Does this affect actually-useful benchmarks, not just dhrystone? I would assume so, but it'd be more useful to get numbers for meaningful benchmarks rather than ones people should've long since abandoned.

I think WORD_SIZE as preferred alignment is basic behavior to improve performance (use lw/sw instead of lb/sb), and dhrystone gets good performance benefit from this patch,
some other useful benchmarks are up slightly (< 5%), I do not list here so.

I'm not saying it's a bad idea, I'm just saying that the 50% isn't a very useful figure to quote as it's unlikely to be representative of real-world performance, so some less awful benchmarks would be more meaningful and helpful to quote.

craig.topper added a comment.EditedSep 19 2022, 8:03 PM

The description doesn't clearly describe the effect of the patch. My understanding from reading the user of this function is that the alignment of allocas and global variables used by memcpy are increased in CodeGenPrepare. This results in less memory operations. In the case of 32-bit dhrystone, it looks like we have an explicit call to the memcpy library function. I guess by aligning the pointers we allow the source and dest to both be word_size aligned so we can use the full word copy loop?

In my local testing for riscv64 this didn't seem to affect dhrystone performance.

JojoR added inline comments.Sep 19 2022, 8:11 PM
llvm/lib/Target/RISCV/RISCVISelLowering.cpp
12986

The caller has check this condition, set PrefAlign only the original is more smaller :)

jrtc27 added inline comments.Sep 19 2022, 8:15 PM
llvm/lib/Target/RISCV/RISCVISelLowering.cpp
12986

I don't understand this comment at all. This is about MinSize not PrefAlign, and CGP just uses this value as a threshold. Why 8? Why not XLEN / 8, or XLEN / 4 if the 8 was chosen to optimise for RV32. This needs justification.

JojoR updated this revision to Diff 461458.Sep 19 2022, 8:23 PM
JojoR marked an inline comment as done.
JojoR marked an inline comment as done.
JojoR added a comment.Sep 19 2022, 8:30 PM

The description doesn't clearly describe the effect of the patch. My understanding from reading the user of this function is that the alignment of allocas and global variables used by memcpy are increased in CodeGenPrepare. This results in less memory operations. In the case of 32-bit dhrystone, it looks like we have an explicit call to the memcpy library function. I guess by aligning the pointers we allow the source and dest to both be word_size aligned so we can use the full word copy loop?

In my local testing for riscv64 this didn't seem to affect dhrystone performance.

Yes, my test is on the RV32.

craig.topper added inline comments.Sep 19 2022, 8:40 PM
llvm/lib/Target/RISCV/RISCVISelLowering.cpp
12986

having the same value for rv32 and rv64 kind of makes sense from a certain perspective. If the object is smaller than minsize then you're saying you're ok with "object size in bytes" number of stores if the object is align 1. What you want to allow doesn't necessarily vary with xlen.

Any other suggestions ?

@craig.topper @jrtc27

arichardson requested changes to this revision.Sep 20 2022, 3:32 AM

It looks like this hook is only used by codegenprepare, so the test should be in llvm/test/Transforms/CodeGenPrepare and check that the alignment of the IR variable was changed rather than indirectly testing it via assembly code. I'd also add a version with an alloca.

Additionally, I don't think this should be specific to RISC-V, I would assume that all architectures that don't have unaligned accesses (and even those with slower accesses) would benefit from aligning global string constants that are used in memcpy operations. I see that right now this hook is only used by ARM targets (D7908).

Can this be changed to call allowsMisalignedMemoryAccesses() and adjusting the alignment to the size of pointers?

This revision now requires changes to proceed.Sep 20 2022, 3:32 AM

It looks like this hook is only used by codegenprepare, so the test should be in llvm/test/Transforms/CodeGenPrepare and check that the alignment of the IR variable was changed rather than indirectly testing it via assembly code. I'd also add a version with an alloca.

Additionally, I don't think this should be specific to RISC-V, I would assume that all architectures that don't have unaligned accesses (and even those with slower accesses) would benefit from aligning global string constants that are used in memcpy operations. I see that right now this hook is only used by ARM targets (D7908).

Can this be changed to call allowsMisalignedMemoryAccesses() and adjusting the alignment to the size of pointers?

I've made this generic in D134282 (depends on D134281).