AArch64: Fold immediate into the immediate field of logical instructions
ClosedPublic

Authored by ahatanak on Oct 2 2014, 4:08 PM.

Details

Summary

llvm currently turns the following code

void foo1(int a, char *p) {

int t = a & 0xfd;
*p = t;

}

into these instructions:

movz w8, #253
and w8, w0, w8
strb w8, [x1]

This can be done using just two instructions, since we don't care what the upper 24-bits of the "and" instruction are.

and w8, w0, 0xfffffffd
strb w8, [x1]

This patch adds a target hook to TargetLowering::TargetLoweringOpt and overrides it in the AArch64 backend to sign-extend an immediate operand if the upper bits are not demanded and sign-extending enables folding the immediate into the instruction.

This optimization speeds up 253.perlbmk by 5%.

Diff Detail

Repository
rL LLVM
ahatanak retitled this revision from to AArch64: Fold immediate into the immediate field of logical instructions.Oct 2 2014, 4:08 PM
ahatanak updated this object.
ahatanak edited the test plan for this revision. (Show Details)
ahatanak added a subscriber: Unknown Object (MLST).

Nice gain!! Just a few nits.

include/llvm/Target/TargetLowering.h
2056 ↗(On Diff #14357)

s/LTO/TLO ?

lib/Target/AArch64/AArch64ISelLowering.cpp
674 ↗(On Diff #14357)

Remove vertical whitespace.

681 ↗(On Diff #14357)

Please add an assert message:

assert(cond && "msg");

aadg added a subscriber: aadg.Oct 3 2014, 2:20 AM
ahatanak updated this revision to Diff 14392.Oct 3 2014, 12:06 PM

Address Chad's comments.

ping

lib/Target/AArch64/AArch64ISelLowering.cpp
685 ↗(On Diff #14392)

The code I previously had here didn't make sense, so I cleaned it up. It was checking Size > 0 after Size = std::max(VT.getSizeInBits(), 32u). Also, I changed it to break if VT is a vector, just in case this function is called on a vector node.

Hi Akira,

I've got a couple of comments:

include/llvm/Target/TargetLowering.h
2058 ↗(On Diff #14392)

Functions should usually start with a lower-case letter.

lib/Target/AArch64/AArch64ISelLowering.cpp
693–694 ↗(On Diff #14392)

OK, so we've got an immediate "ab...iJK..Z" where lower-case digits aren't used. Sign extending converts this to "JJ...JJK...Z". Is there any particular reason to expect that's representable? It seems like a bit of a shot in the dark.

700 ↗(On Diff #14392)

Shifting a negative number left is undefined behaviour.

ahatanak updated this revision to Diff 15345.Oct 23 2014, 5:30 PM

Tim, I fixed the undefined behavior and renamed the function to start with lower-case letter.

The reason I only do sign-extension is that it seemed to be the cheapest way to get the most gain without hurting compile time or making the code overly complicated. I looked at the instructions llvm emits, and I saw many places where logical instructions were followed by truncating stores.

I can try changing it to do a more exhaustive search of the bit patterns to see how much further performance can be improved, if that's necessary.

Hi Akira,

The reason I only do sign-extension is that it seemed to be the cheapest way to get the most gain without hurting compile time or making the code overly complicated. I looked at the instructions llvm emits, and I saw many places where logical instructions were followed by truncating stores.

I think the kind of store being done is largely orthogonal to the
calculation most likely to get you a valid immediate. It tells you
what bits can be ignored, but nothing about the best way to fill them.

So I think it's reasonable to start with just assuming that
DemandedBits has some number of low bits set and the rest ignored
based on your observations. I think it's harder to justify coming up
with a single NewImm value by sign extending the existing immediate
and giving up if that fails.

Cheers.

Tim.

Hi again,

I think it's harder to justify coming up
with a single NewImm value by sign extending the existing immediate
and giving up if that fails.

I've been doing some more thinking here, and if we're willing to
assume the truncation is to a power of two type (I am, anyone using
i14 deserves whatever they get) then I think we can cover *all* valid
cases by instead replicating the demanded bits across the 32-bit
register. E.g. try 0xfdfdfdfd instead of 0xfffffffd for the 8-bit
0xfd.

The argument goes that if the input is morally contiguous, then there
are multiple representations involving sign extension to 4, 8, 16 or
32 followed by replication. Otherwise the replication width is less
than the demanded width so we're completely forced and have to
continue the replication that's already started.

Cheers.

Tim.

The argument goes that if the input is morally contiguous, then there
are multiple representations involving sign extension to 4, 8, 16 or
32 followed by replication. Otherwise the replication width is less
than the demanded width so we're completely forced and have to
continue the replication that's already started.

Sign-extension followed by replication enables converting constants like 0xfdfd (demanded = 16-bits) or 0x3dfd (demanded = 14bits) to bimm, where just sign-extending fails, but I think we should also try to handle cases like 0x19 (demanded = 5bits). This isn't a bimm as it stands and sign-extending doesn't make it a bimm either. In this case, we have to copy bits 1-3 to bits 5-7.

I think I can come up with a patch that does this.

ahatanak updated this revision to Diff 15600.Oct 30 2014, 6:51 PM

Rewrote the algorithm for searching for a bitmask immediate based on Tim's feedback.

I had to make changes to DAGCombiner::visitAND because it was transforming the DAG in way that made it impossible to do any optimization in optimizeConstant (the code here was originally committed in r97616). This change doesn't seem to have any noticeable impact on performance, but I'm still investigating.

ahatanak updated this revision to Diff 26086.May 19 2015, 1:51 PM

Sorry for not updating the patch for a long time. Here is my new patch.

For the most part, the new patch takes the same approach as the previous patch to find an immediate operand, but there were a couple of changes made.

  • Function optimizeLogicalImm emits AArch64 machine nodes to prevent the target-independent dagcombine from undoing the optimization. With this change, there is no need to make changes in DAGCombiner::visitOR as I did in my previous patch.
  • In optimizeLogicalImm, rotation is used to avoid using branches and simplify the logic.
  • A target-specific function object for optimizing nodes with immediates is passed to the constructor of TargetLoweringOpt, which gets called later in TargetLoweringOpt::ShrinkDemandedConstant.
ahatanak updated this revision to Diff 71888.Sep 19 2016, 4:51 PM

Rebase and make a couple of changes:

  • Add comments.
  • In optimizeLogicalImm, return false instead of true when the immediate is already a bimm32 or bimm64 so that it doesn't inhibit the optimization done later in AArch64ISelDAGToDAG.cpp that emits BFXIL.
  • Remove redundant instructions in bitreverse.ll.
ab added a subscriber: ab.Oct 4 2016, 9:40 AM

Some drive-by comments; I haven't looked into the optimizeLogicalImm logic yet.

include/llvm/Target/TargetLowering.h
2252 ↗(On Diff #71888)

Add more detail on when this is called and what the parameters are?

2278–2281 ↗(On Diff #71888)

I don't think this is the best location for this: I'd rather have TargetLoweringOpt be "the result of an optimization", and TargetLowering be "how to do optimizations".

What do you think of going back to the TLI virtual hook, and fixing the other TLO methods to do the same: https://reviews.llvm.org/differential/diff/73495/

lib/Target/AArch64/AArch64ISelLowering.cpp
745–746 ↗(On Diff #71888)

Check this only once when initializing the std::function? (or leave it here if you remove the std::function, I suppose)

test/CodeGen/AArch64/optimize-imm.ll
1 ↗(On Diff #71888)

Triple can be simplified to: -mtriple=aarch64--

ahatanak added inline comments.Oct 4 2016, 10:32 PM
include/llvm/Target/TargetLowering.h
2278–2281 ↗(On Diff #71888)

I'm not sure what TargetLoweringOpt is supposed to do, but using TLI virtual hooks looks like a better approach.

ahatanak updated this revision to Diff 73702.Oct 5 2016, 3:21 PM

Address review comments.

ahatanak marked 2 inline comments as done.Oct 5 2016, 3:22 PM
ahatanak updated this revision to Diff 73721.Oct 5 2016, 7:21 PM

Fix comment.

ahatanak updated this revision to Diff 78297.Nov 16 2016, 5:28 PM

Rebase. Move TargetLoweringOpt::SimplifyDemandedBits into TargetLowering.

mcrosier added inline comments.Wed, Apr 12, 11:19 AM
lib/Target/AArch64/AArch64ISelLowering.cpp
806 ↗(On Diff #85648)

You should use AArch64_AM::isLogicalImmediate() here.

901 ↗(On Diff #85648)

If you put this after the below switch, you'll guarantee Op has operand 1.

907 ↗(On Diff #85648)

Can't we early exit if we demand all of the bits? E.g.,

if (Demanded.countPopulation() == Size)
  return false;
922 ↗(On Diff #85648)

Default case goes at top of switch, per coding guidelines.

FWIW, I think this is in pretty good shape. Also, I ran correctness tests on everything I've got (e.g., llvm-ts, SPEC200X, internal tests) and saw no correctness failures.

ahatanak updated this revision to Diff 95070.Wed, Apr 12, 6:11 PM
ahatanak marked 4 inline comments as done.

Address Chad's comments.

mcrosier accepted this revision.Mon, Apr 17, 10:18 AM

Thanks for working on this, Akira. I believe you've addressed all of my concerns as well as Tim's and Arnaud's. LGTM, assuming you've done the due diligence to ensure this doesn't dramatically increase compile-time.

This revision is now accepted and ready to land.Mon, Apr 17, 10:18 AM

I compiled the files in MultiSource/Applications of the test suite and didn't see any measurable increase in compile time. I'll commit this patch today.

Thank you for the review.

Thanks for working on this patch and for being *extremely* patient with the 2.5 year review, Akira!

This revision was automatically updated to reflect the committed changes.