This is an archive of the discontinued LLVM Phabricator instance.

[AArch64]Add support for converting halfword loads into a 32-bit word load
ClosedPublic

Authored by junbuml on Oct 15 2015, 7:43 AM.

Details

Summary

Convert two halfword loads into a single 32-bit word load with bitfield extract
instructions. For example :

ldrh w0, [x2]
ldrh w1, [x2, #2]

becomes

ldr w0, [x2]
ubfx w1, w0, #16, #16
and w0, w0, #ffff

Diff Detail

Event Timeline

junbuml updated this revision to Diff 37481.Oct 15 2015, 7:43 AM
junbuml retitled this revision from to [AArch64]Add support for converting halfword loads into a 32-bit word load.
junbuml updated this object.
junbuml added reviewers: mcrosier, mzolotukhin.
junbuml added a subscriber: llvm-commits.
junbuml added a reviewer: ab.Oct 15 2015, 7:46 AM

FYI,
With this change 1.5% performance improvement was found in spec2006/h264ref.

mcrosier edited edge metadata.Oct 15 2015, 9:24 AM

Hi Jun,
Just a few general comments:

  1. You might want to consider pulling this optimization into a separate loop before the loop the does the load and store pairing. I could see a case where you promote the ldrh+ldrh into a ldr and then that ldr gets paired with another ldr into a ldp.

ldrh w1, [x0]
ldrh w2, [x0, #2]
ldrh w3, [x0, #4]
ldrh w4, [x0, #8]

becomes something like:

ldp w1, w3 [x0]
ubfx w2, w1, #16, #16
ubfx w1, w1, #0, #16
ubfx w4, w3, #16, #16
ubfx w3, w3, #0, #16

  1. For the suggestion in #1 to work you would need to append the MMOs to the new ldr instruction. Otherwise, the hasOrderedMemoryRef() check will fail.
  1. Would is make sense to use an 'and w0, w0, 0xff' to zero the upper bits of the load? In other words, I'm wondering if a 'ubfx' is more expensive than an 'and' operation?

    Chad
lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
172

Just delete and fall through.

mcrosier added a subscriber: gberry.
  • I agree that implementing this in the separate loop will open up more pairing.
  • Okay, I will append MMO.
  • Yes, for low 16, AND should be preferred.

Thanks Chad for the review.

lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
172

Thanks. I will fix it.

mcrosier added inline comments.Oct 15 2015, 10:28 AM
lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
1181

I would also make this the first listed optimization. That way the documentation is consistent with the order in which the optimization occur.

  1. combining of halfword/small types opt
  2. load/store paring opt
  3. pre- and post-index opt
junbuml updated this revision to Diff 37590.Oct 16 2015, 8:26 AM
junbuml updated this object.
junbuml marked 3 inline comments as done.
mcrosier accepted this revision.Oct 19 2015, 7:27 AM
mcrosier edited edge metadata.

LGTM, with one minor nit.

lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
1142

If we merge two ldrh instructions into a ldr we haven't actually created a paired instruction. How about we add another statistic to count when we merge narrow loads/stores into wider loads/stores?

This revision is now accepted and ready to land.Oct 19 2015, 7:27 AM
junbuml added inline comments.Oct 19 2015, 7:32 AM
lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
1142

Adding new statistic sounds good to me.

junbuml closed this revision.Oct 19 2015, 11:39 AM

Committed in r250719

I tested it on A57 for spec2000 and spec2006. This patch was applied pretty widely, but clear performance was only observed in spce2006/h264ref (about 2%), and no performance regression was found. I can add a flag to turn it on only for a specific architecture. Please let me know any suggestion.

James,
Unfortunately, we're not setup to conduct performance testing using the llvm test-suite. Currently, we can only perform correctness testing, but we hope to have this resolved in the coming weeks.

Chad