Page MenuHomePhabricator

Allow merging of immediates within a basic block for code size savings and reduced footprint.
ClosedPublic

Authored by zansari on Jul 20 2015, 12:29 PM.

Details

Summary

First attempt / phase of changes to prevent common immediates that occur more than once within a single basic block to be pulled into their users, in order to prevent unnecessary large instruction encoding.

A quick run on cpu2k shows just over 1% .text size savings on the sum of all the objects (around 8% on just 252.eon alone). Performance (O2 and Os) is flat, though this is currently only enabled under Os, with the hopes to later also include O2.

Quick example of how this help..

int a, b, c, d, e, f, g, h, i, j;
int x;
int foo(void)
{
    a = 0x0fffffff;
    b = 0x0fffffff;
    if (e == 0x0fffffff) {
        x = 1;
    }
    f = 456;
    g = 789;
    h = 555;
    i += 555;

    return 0;
}

BEFORE:

 0:	c7 05 00 00 00 00 ff 	movl   $0xfffffff,0x0
 7:	ff ff 0f 
 a:	c7 05 00 00 00 00 ff 	movl   $0xfffffff,0x0
11:	ff ff 0f 
14:	81 3d 00 00 00 00 ff 	cmpl   $0xfffffff,0x0
1b:	ff ff 0f 
1e:	75 0a                	jne    2a <foo+0x2a>
20:	c7 05 00 00 00 00 01 	movl   $0x1,0x0
27:	00 00 00 
2a:	c7 05 00 00 00 00 c8 	movl   $0x1c8,0x0
31:	01 00 00 
34:	c7 05 00 00 00 00 15 	movl   $0x315,0x0
3b:	03 00 00 
3e:	c7 05 00 00 00 00 2b 	movl   $0x22b,0x0
45:	02 00 00 
48:	81 05 00 00 00 00 2b 	addl   $0x22b,0x0
4f:	02 00 00 
52:	31 c0                	xor    %eax,%eax
54:	c3                   	ret

AFTER:

 0:	b8 ff ff ff 0f       	mov    $0xfffffff,%eax
 5:	a3 00 00 00 00       	mov    %eax,0x0
 a:	a3 00 00 00 00       	mov    %eax,0x0
 f:	39 05 00 00 00 00    	cmp    %eax,0x0
15:	75 0a                	jne    21 <foo+0x21>
17:	c7 05 00 00 00 00 01 	movl   $0x1,0x0
1e:	00 00 00 
21:	c7 05 00 00 00 00 c8 	movl   $0x1c8,0x0
28:	01 00 00 
2b:	c7 05 00 00 00 00 15 	movl   $0x315,0x0
32:	03 00 00 
35:	b8 2b 02 00 00       	mov    $0x22b,%eax
3a:	a3 00 00 00 00       	mov    %eax,0x0
3f:	01 05 00 00 00 00    	add    %eax,0x0
45:	31 c0                	xor    %eax,%eax
47:	c3                   	ret

This is a first attempt and phase of this, with later followup including:

  1. Increase the instruction types of users beyond stores and binary ops.
  2. Enable at O2.
  3. Move beyond single BB to operate globally in a function.

These other phases will require a little additional work to be done to make them safe.

Included is a request to delete CodeGen/X86/remat-invalid-liveness.ll

Justification : This is a large test that failed at some point due to a register being incorrectly clobbered. It was trimmed down a little to allow for a regression test. The CHECK conditions loosely resembled the original failing condition, and now look like this:

; CHECK-LABEL: __XXX1:
; CHECK: movl $3, %ecx
; CHECK-NOT: subb %{{[a-z]+}}, %ch

Note that (1) There are a few instances of "movl $3, %ecx", and (2) there's a lot of code between the different movs and the subb.
Based on those to conditions, it's extremely hard (without some form of data flow analysis) to determine that the "subb" is, indeed, bad. For example, with my changes, I get:

movl              $3, %ecx
movb            $64, %ch
…
testb             %cl, %al
…
subb              %al, %ch

..which is reasonable, yet fails.

I can't think of any reasonable way to fix this test to catch the intended issue.

Thanks,
Zia.

Diff Detail

Repository
rL LLVM

Event Timeline

zansari updated this revision to Diff 30173.Jul 20 2015, 12:29 PM
zansari retitled this revision from to Allow merging of immediates within a basic block for code size savings and reduced footprint..
zansari updated this object.
zansari added reviewers: mkuper, qcolombet, spatel, nadav.
zansari added a subscriber: llvm-commits.
zansari updated this object.Jul 20 2015, 12:33 PM
zansari updated this object.
zansari updated this object.Jul 20 2015, 12:44 PM
mkuper edited edge metadata.Jul 21 2015, 1:53 AM

Hi Zia,

A couple of small comments inline.

lib/Target/X86/X86ISelDAGToDAG.cpp
300 ↗(On Diff #30173)

You can add a UseCount > 1 condition to bail out of the loop immediately once you're past the threshold.

337 ↗(On Diff #30173)

otherOp should start with a capital.

zansari updated this revision to Diff 30448.Jul 22 2015, 10:39 PM
zansari edited edge metadata.

Applied Michael's suggestions.

zansari updated this revision to Diff 30449.Jul 22 2015, 10:42 PM

Removed a tab that snuck into last update.

spatel edited edge metadata.Jul 24 2015, 10:34 AM

Hi Zia -

Nitpicks and a question.

Seems like a good optimization to me, but someone else should take a look and issue a verdict.

lib/Target/X86/X86ISelDAGToDAG.cpp
287 ↗(On Diff #30449)

No comma after "size".

289 ↗(On Diff #30449)

Indentation (tab width) is off for this whole function.

292 ↗(On Diff #30449)

Don't want to --> Do not

293 ↗(On Diff #30449)

Eventually, --> TODO:

295 ↗(On Diff #30449)

I realize this is intended to be a temporary check, but I don't know if this is the correct predicate (and I'm hoping someone else will answer this for us):
What about MinSize? There are lots of similar size checks sprinkled around, but AFAICT, if the function only has a MinSize attribute, we're missing all those cases. Should OptForSize imply MinSize?

298 ↗(On Diff #30449)

Period.

304 ↗(On Diff #30449)

legit --> legitimate

327 ↗(On Diff #30449)

Immedates --> Immediates

329 ↗(On Diff #30449)

No comma after "passing".

lib/Target/X86/X86InstrArithmetic.td
618–626 ↗(On Diff #30449)

Weird spacing; over the 80-col edge.

lib/Target/X86/X86InstrInfo.td
883 ↗(On Diff #30449)

issues --> issue

887 ↗(On Diff #30449)

No comma after "top".

test/CodeGen/X86/immediate_merging.ll
1 ↗(On Diff #30449)

This doesn't work for on my system. The "RUN" must be followed by a ":".

Also, it looks like most tests do a redirect rather than making the input file a param:
; RUN: llc -o - < %s

Don't know what the benefit is, but might as well stick to that pattern. :)

14 ↗(On Diff #30449)

test --> Test

55 ↗(On Diff #30449)

test --> Test
Period at end.

zansari updated this revision to Diff 30606.Jul 24 2015, 2:19 PM
zansari edited edge metadata.

Thanks, Sanjay.

I've addressed all of your comment (I hope).

Regarding your question.. At first, I wasn't entirely sure myself, but some testing seemed to indicate that MinSize implies OptForSize. I'm not sure if this is true under all conditions, or not, but I added an extra check to explicitly check for MinSize, just to be on the safe side.

Thanks,
Zia.

hfinkel added inline comments.
lib/Target/X86/X86ISelDAGToDAG.cpp
289 ↗(On Diff #30606)

This function does not really hoist anything, so maybe name this ShouldHoistImmediateForSize. Even that's confusing, however, because the constants are not actually immediates if they're hoisted, and you intend to enable this even when not optimizing for size. I think I prefer naming it:

bool shouldEncodeConstantAsImmediate

or something like that.

295 ↗(On Diff #30606)

I think that you can remove the MinSize check. I don't have a strong opinion, but Clang always emits both for -Oz and emits OptimizeForSize for -Os. Arguably, we should be checking for both everywhere, but we should make that change, if we make it, globally, not just in this one place.

mkuper added inline comments.Jul 26 2015, 1:20 AM
lib/Target/X86/X86ISelDAGToDAG.cpp
295 ↗(On Diff #30606)

As a side note, there are already a couple of places that do the "or" check.

Admittedly, I wrote one of them myself (in X86CallFrameOpt), but there are at least two other significant places - both the initialization of AArch64DAGToDAGISel and the DAGCombiner constructors have:

ForCodeSize = F->hasFnAttribute(Attribute::OptimizeForSize) ||
              F->hasFnAttribute(Attribute::MinSize);

It would be great to make this consistent across the board, but I agree, that's a separate commit.

qcolombet edited edge metadata.Jul 28 2015, 1:05 PM

Hi Zia,

Have you looked into the ConstantHoisting pass that performs this kind of constant sharing at a function level?

I do not know if that would be the right solution for the given problem, but it looks to me like this is redundant with that pass and less powerful, though the cost model is probably better.

Could you double check?

Thanks,
-Quentin

Hi Quentin,

I believe Zia is on vacation for a few more days, but this is something we've discussed.
The problem is that ConstantHoisting is useless for multiple usages in the same basic block.
That is, consider this test-case from the ConstantHoisting pass:

define i128 @test1(i128 %a) {
  %1 = add i128 %a, 12297829382473034410122878
  %2 = add i128 %1, 12297829382473034410122878
  ret i128 %2
}

After hoisting, we get this:

define i128 @test1(i128 %a) {
  %const = bitcast i128 12297829382473034410122878 to i128
  %1 = add i128 %a, %const
  %2 = add i128 %1, %const
  ret i128 %2
}

From the DAG's point of view, this is basically a no-op. That is, given this:

define i32 @test1(i32 %a) {
  %const = bitcast i32 42 to i32 
  %1 = add i32 %a, %const
  %2 = add i32 %1, %const
  ret i32 %2
}

We sill still pull the constant into the instruction during selection.

It's true that we also want to make the ConstantHoisting heuristic smarter to handle the multi-basic-block cases better, but the instruction selection changes are needed regardless of that.

Hi Michael,

I am not sure I understand this part:

We sill still pull the constant into the instruction during selection.

I thought ConstantHoisting was marking the constants as opaque so that this cannot happen. What am I missing?

[...] but the instruction selection changes are needed regardless of that.

Which ones are needed?

Thanks,
-Quentin

Hi Michael,

I am not sure I understand this part:

We sill still pull the constant into the instruction during selection.

I thought ConstantHoisting was marking the constants as opaque so that this cannot happen. What am I missing?

Not opaque enough.
It's opaque in terms of cross-basic-block propagation, but not within a block.
That is, for this:

define i32 @test1(i32 %a) {
  %const = bitcast i32 42 to i32 
  br label %next
next:
  %1 = add i32 %a, %const
  %2 = add i32 %1, %const
  ret i32 %2
}

define i32 @test2(i32 %a) {
  %const = bitcast i32 42 to i32 
  %1 = add i32 %a, %const
  %2 = add i32 %1, %const
  ret i32 %2
}

We get:

test1:                                  # @test1
        movl    4(%esp), %eax
        movl    $42, %ecx
        addl    %ecx, %eax
        addl    %ecx, %eax
        retl

test2:                                  # @test2
        movl    4(%esp), %eax
        addl    $42, %eax
        addl    $42, %eax
        retl

[...] but the instruction selection changes are needed regardless of that.

Which ones are needed?

The ones in this review - handling basic-block-local constants in a smarter way.
One possible alternative is getting ConstantHoisting to always hoist the constants into a new BB, but that seems fairly ugly to me.

Hi Michael,

Thanks for the clarification.

I would go differently and make the constants produced by constants hoisting "more opaque”, so that it also works within a basic block.

Cheers,
-Quentin

I just noticed D5544 which tries to do a similar optimization using a separate pass.

I'm not advocating a particular approach because I don't fully understand the trade-offs; I just wanted to add that link for my own reference. :)

zansari updated this revision to Diff 31245.Aug 3 2015, 11:17 AM
zansari edited edge metadata.

OK, back from vacation.. Thanks for the additional reviews and comments. Some updates:

  • Hal: I agree with your comment regarding the naming for the querying function that doesn't actually hoist anything. I wasn't 100% comfortable with the name you suggested, since it was a bit confusing (to me, at least). I went with "shouldAvoidImmediateInstFormsForSize", if that works for you.
  • Hal: I also removed the minsize check, per your recommendation.
  • Quentin / Sanjay:

Here are my thoughts regarding ConstantHoisting.cpp and D5544...

D5544 is overkill, in my opinion. It's a lot of code that costs extra computation and does something that can be easily done in existing places.

Regarding ConstantHoisting.cpp, I did look into this at first and my earliest prototype was done through there. I'd still like to explore global (cross-BB) constant hoisting in the future, but I chose to go with my current proposed changes for a couple of main reasons:

  1. ConstantHoisting.cpp can make the constants more opaque, preventing isel from pulling them, back in, but this is phase dependent. How about constants introduced after hoisting? DAG creation will always pull out constants, and it seems more natural to properly select the right instructions after this transformation, given our heuristics.
  1. Perhaps more importantly : Pulling out all common constants in ConstantHoisting.cpp might affect other downstream target-independent optimizations on the IL that look for constants as part of the instructions. It's easier to optimize with subsumed immediates at the IL level. My changes may have the same effect in the CG, but they occur later in isel, and are a little more natural for something that's more target-specific (i.e. dealing with instruction encoding).

Those two reasons, along with wanting to tackle this the safest/easiest way (without rocking the boat too much) were my primary motivations for the route that I took.

Hope this helps clear up some of the confusion.

Thanks,
Zia.

Hi Zia,

ConstantHoisting.cpp can make the constants more opaque, preventing isel from pulling them, back in, but this is phase dependent. How about constants introduced after hoisting? DAG creation will always pull out constants, and it seems more natural to properly select the right instructions after this transformation, given our heuristics.

Agreed, however I do not believe any of the test cases in that patch fall into that case. I.e., I’d rather have an actual motivating example not to use ConstantHoisting than adding extra logic.

Perhaps more importantly : Pulling out all common constants in ConstantHoisting.cpp might affect other downstream target-independent optimizations on the IL that look for constants as part of the instructions. It's easier to optimize with subsumed immediates at the IL level. My changes may have the same effect in the CG, but they occur later in isel, and are a little more natural for something that's more target-specific (i.e. dealing with instruction encoding).

This is a phase ordering problem and if I remember correctly ConstantHoisting runs very late and already produces “target-dependent” code, since the constants to hoist are chosen given some target dependent cost model. In other words, this shouldn’t be a problem with the current setting.

I am aware that pushing to reuse ConstantHoist may not be the ultimate solution, but I miss compelling test cases/arguments for not using/improving it right now. Put differently, I’d like to see whether or not we can improve the generic framework for all targets to benefit from it, before exploring target specific solutions.

Cheers,
-Quentin

Hi Quentin,

Thanks for the additional thoughts and suggestions.

However, I’m still not comfortable with having ConstantHoisting doing the transformation. I’ll try to address your concerns, and break things down a little more. Hopefully I’ll address most of your questions and concerns.

  • I’ll start by giving my opinion on what’s “natural” with respect to subsuming constants.
    • ISEL will always pull out all constants within BBs.
    • ISEL already has extensive heuristics for when and how to pull those constants back in to instructions. Why create new heuristics elsewhere that would require synchronization with this?
    • It seems extremely natural to simply modify those existing heuristics just slightly to catch EXACTLY the cases that we want. These can be easily tuned to catch more/fewer case as we desire.
      • Doing it in ConstantHoisting is clumsy in that it will blindly pull out constants.
      • Adding heuristics to ConstantHoisting to be smarter about what it pulls out needs careful and fragile synchronization with ISEL for various targets, which could get ugly.
  • I DO agree that it would be nice to do this at a level where more targets could benefit, but my concern is with doing something that will, instead, hurt all targets.
    • First of all, I don’t really know how to make constants “more opaque”, as suggested above. How much more opaque can you get beyond using a bitcast for immdiates? I am new to LLVM, so forgive me if I’m just being naïve 
      • ConstantHoisting uses bitcasts AND pulls said bitcasts out into dominating blocks, and calls them “opaque”.
        • These opaque constants, when appearing in their own BB will be lowered as constant instantiations by ISEL.
        • These opaque constants, when referenced from outside BBs, will references the instantiated constants.. however..
        • These opaque constants, when referenced in a block in which they’re defined, will get pulled into all references by ISEL (per Michael’s argument that they are not “opaque enough”).
      • Given that opaque constants get pulled into local BB users, we would require similar (to my patch) changes in ISEL to prevent this from happening.
      • I’d really prefer not to hoist the constants out of their BBs, as I’d like to avoid cross-BB live ranges. I’d like to tackle global constant hoisting in a separate effort, which will/should be handled in ConstantHoisting.cpp.
    • Let’s assume we can make these local BB constant “more opaque” to the point that ISEL can’t pull them into instructions anymore. This will have a few negative effects:
      • Without lots of changes, ISEL can’t pick and choose which constants might be OK for its target for specific instructions. It’s left to the mercy of everything ConstantHoisting has pulled out. That is, ALL targets will simply punt on pulling any of those constants back in. I think it needs to be a target-specific decision to pick instructions/constants to subsume. Moving all of the target-specific decision making into ConstantHoisting doesn’t seem practical.
      • Without lots of changes, DAG and ISEL will stop seeing constants and fail to do any optimizations that involve immediates. I’ll admit I haven’t spent tons of time trying to come up with concrete examples, but DAG and ISEL have hundreds of references and queries for Constants in .td and .cpp files. I’m certain a lot of these are optimizations. With “more opaque” constants, these will all fail if they have been pull out by ConstantHoisting. Again, I think it’s unreasonable to implement changes in ConstantHoisting to work together with ISEL.
  • On to the phase ordering..
    • Yes, I agree that ConstantHoisting runs late, but that seems to be a big restriction. Anything that runs on the IR after ConstantHoisting today, or tomorrow, will suffer from not being able to process immediates, since they are now all pulled out and “more opaque”. Given the frequency of 8/16/32bit immediates, this will have a large impact on the IR and everyone walking it.
    • Even if we were guaranteed that ConstantHoisting was always the very last phase to run before DAG/ISEL, we’ll still be missing opportunities. DAG itself introduces lots of constants in call lowering, address lowering, and perhaps a few other places. I actually messed with my patch a little to make it more aggressive to demonstrate this. Take this simple example:
int x, y, p;
double d;
int foo()
{
    p = 16;
    x = 4;
    y = 8;
    bar(4, 8, d, 4);
}

By doing the merging in ISEL, I get this (note the stack offsets):

subl      $28, %esp
movl      $16, %eax
movl      %eax, p
movl      $4, %ecx
movl      %ecx, x
movl      $8, %edx
movl      %edx, y
movsd     d, %xmm0
movl      %ecx, (%esp,%eax)
movsd     %xmm0, (%esp,%edx)
movl      %edx, (%esp,%ecx)
movl      %ecx, (%esp)
calll     bar
  • Which I thought was kind of neat, since I’ve never seen this sort of optimization in a compiler :) Those stores to the stack are all smaller than their immediate offset counterparts. You can do similar optimizations with address expansion, also.
  • You simply can’t do this with ConstantHoisting.

In summary, I’d like to emphasize what I think is the most important point : “ISEL already has extensive heuristics for when and how to pull constants back in to instructions. Why create new ones elsewhere that would require synchronization with this?”

However, since we have no other place that naturally pulls out and subsumes global (cross BB) constants, I’m in 100% agreement that ConstantHoisting should tackle those.

Thanks,
Zia.

Hi Zia,

Thanks for sharing your point of view on that.

My position can be summarized like this:
I do not want to prevent you to go the ISel path, but if you do, please make sure to expose at lest a test case that couldn’t be tackled by ConstantHoisting, i.e., a case where constants are exposed at ISel time.
Whatever we do to make ISel clever will not scale to function level scope, so why not invest directly into ConstantHoisting? I reckon this may be a bigger commitment as ConstantHoisting may, like you said, be clumsy.

That being said, I do believe ConstantHoisting is a bit too high level for this kind of optimization and we do not have a good framework yet to make that better.

Anyhow, a couple of answers regarding your questions.

ISEL already has extensive heuristics for when and how to pull those constants back in to instructions. Why create new heuristics elsewhere that would require synchronization with this?

Because ConstantHoisting has a cost model that is supposed to model this kind of thing and this pass has a broader scope than ISel.

First of all, I don’t really know how to make constants “more opaque”, as suggested above. How much more opaque can you get beyond using a bitcast for immdiates? I am new to LLVM, so forgive me if I’m just being naïve

Honestly, I do not know the details of ConstantHoisting. I thought there was a mechanism to mark bitcasted constant as opaque and that the basic block scope wouldn’t change any of that. Given your comment, this is not the case and this is unfortunate. Anyhow, please proceed with your ISel changes.

Thanks for your patience,
-Quentin

zansari updated this revision to Diff 31559.Aug 7 2015, 4:47 PM

Thanks, Quentin.

Totally understand and agree with your position, and I also appreciate your patience with me.

I've started with a restricted set of target instructions but I think that as we add to the selection heuristics for the cases we catch, it should get easier to expose more places that ConstantHoisting can't catch, as we continue to improve code-size reductions.

However, as you suggested, I did add to my lit-test to check for a case where ISEL exposes constants not available in ConstantHoisting, even with my current restricted heuristics.

For example, the simple case of :

memset(a, 33, 24)

Before:

movl    $555819297, a+20        # imm = 0x21212121
movl    $555819297, a+16        # imm = 0x21212121
movl    $555819297, a+12        # imm = 0x21212121
movl    $555819297, a+8         # imm = 0x21212121
movl    $555819297, a+4         # imm = 0x21212121
movl    $555819297, a           # imm = 0x21212121

with my changes:

movl    $555819297, %eax        # imm = 0x21212121
movl    %eax, a+20
movl    %eax, a+16
movl    %eax, a+12
movl    %eax, a+8
movl    %eax, a+4
movl    %eax, a

That's about 40% size reduction.

If the changes to the test look OK to you, I'll go ahead with the commit.

Thanks, again.
Zia.

qcolombet accepted this revision.Aug 7 2015, 5:17 PM
qcolombet edited edge metadata.

Hi Zia,

LGTM.

Thanks,
-Quentin

This revision is now accepted and ready to land.Aug 7 2015, 5:17 PM
hfinkel added inline comments.Aug 10 2015, 2:47 AM
lib/Target/X86/X86ISelDAGToDAG.cpp
349 ↗(On Diff #31559)

Do you need to check for X86::RSP here too?

Your test case seems to be 32-bit only. Assuming this affects 64-bit code too, can you please add some coverage for that as well?

This revision was automatically updated to reflect the committed changes.