This is an archive of the discontinued LLVM Phabricator instance.

Allow merging of immediates within a basic block for code size savings and reduced footprint.
ClosedPublic

Authored by zansari on Jul 20 2015, 12:29 PM.

Download Raw Diff

Details

Reviewers

spatel
qcolombet
nadav
mkuper

Commits

rG243c073a2ea3: [X86] Allow merging of immediates within a basic block for code size savings
rL244601: [X86] Allow merging of immediates within a basic block for code size savings

Summary

First attempt / phase of changes to prevent common immediates that occur more than once within a single basic block to be pulled into their users, in order to prevent unnecessary large instruction encoding.

A quick run on cpu2k shows just over 1% .text size savings on the sum of all the objects (around 8% on just 252.eon alone). Performance (O2 and Os) is flat, though this is currently only enabled under Os, with the hopes to later also include O2.

Quick example of how this help..

int a, b, c, d, e, f, g, h, i, j;
int x;
int foo(void)
{
    a = 0x0fffffff;
    b = 0x0fffffff;
    if (e == 0x0fffffff) {
        x = 1;
    }
    f = 456;
    g = 789;
    h = 555;
    i += 555;

    return 0;
}

BEFORE:

 0:	c7 05 00 00 00 00 ff 	movl   $0xfffffff,0x0
 7:	ff ff 0f 
 a:	c7 05 00 00 00 00 ff 	movl   $0xfffffff,0x0
11:	ff ff 0f 
14:	81 3d 00 00 00 00 ff 	cmpl   $0xfffffff,0x0
1b:	ff ff 0f 
1e:	75 0a                	jne    2a <foo+0x2a>
20:	c7 05 00 00 00 00 01 	movl   $0x1,0x0
27:	00 00 00 
2a:	c7 05 00 00 00 00 c8 	movl   $0x1c8,0x0
31:	01 00 00 
34:	c7 05 00 00 00 00 15 	movl   $0x315,0x0
3b:	03 00 00 
3e:	c7 05 00 00 00 00 2b 	movl   $0x22b,0x0
45:	02 00 00 
48:	81 05 00 00 00 00 2b 	addl   $0x22b,0x0
4f:	02 00 00 
52:	31 c0                	xor    %eax,%eax
54:	c3                   	ret

AFTER:

 0:	b8 ff ff ff 0f       	mov    $0xfffffff,%eax
 5:	a3 00 00 00 00       	mov    %eax,0x0
 a:	a3 00 00 00 00       	mov    %eax,0x0
 f:	39 05 00 00 00 00    	cmp    %eax,0x0
15:	75 0a                	jne    21 <foo+0x21>
17:	c7 05 00 00 00 00 01 	movl   $0x1,0x0
1e:	00 00 00 
21:	c7 05 00 00 00 00 c8 	movl   $0x1c8,0x0
28:	01 00 00 
2b:	c7 05 00 00 00 00 15 	movl   $0x315,0x0
32:	03 00 00 
35:	b8 2b 02 00 00       	mov    $0x22b,%eax
3a:	a3 00 00 00 00       	mov    %eax,0x0
3f:	01 05 00 00 00 00    	add    %eax,0x0
45:	31 c0                	xor    %eax,%eax
47:	c3                   	ret

This is a first attempt and phase of this, with later followup including:

Increase the instruction types of users beyond stores and binary ops.
Enable at O2.
Move beyond single BB to operate globally in a function.

These other phases will require a little additional work to be done to make them safe.

Included is a request to delete CodeGen/X86/remat-invalid-liveness.ll

Justification : This is a large test that failed at some point due to a register being incorrectly clobbered. It was trimmed down a little to allow for a regression test. The CHECK conditions loosely resembled the original failing condition, and now look like this:

; CHECK-LABEL: __XXX1:
; CHECK: movl $3, %ecx
; CHECK-NOT: subb %{{[a-z]+}}, %ch

Note that (1) There are a few instances of "movl $3, %ecx", and (2) there's a lot of code between the different movs and the subb.
Based on those to conditions, it's extremely hard (without some form of data flow analysis) to determine that the "subb" is, indeed, bad. For example, with my changes, I get:

movl              $3, %ecx
movb            $64, %ch
…
testb             %cl, %al
…
subb              %al, %ch

..which is reasonable, yet fails.

I can't think of any reasonable way to fix this test to catch the intended issue.

Thanks,
Zia.

Diff Detail

Repository: rL LLVM

Event Timeline

zansari updated this revision to Diff 30173.Jul 20 2015, 12:29 PM

zansari retitled this revision from to Allow merging of immediates within a basic block for code size savings and reduced footprint..

zansari updated this object.

zansari added reviewers: mkuper, qcolombet, spatel, nadav.

zansari added a subscriber: llvm-commits.

zansari updated this object.Jul 20 2015, 12:33 PM

zansari updated this object.

zansari updated this object.Jul 20 2015, 12:44 PM

Hi Zia,

A couple of small comments inline.

lib/Target/X86/X86ISelDAGToDAG.cpp
300 ↗	(On Diff #30173)	You can add a UseCount > 1 condition to bail out of the loop immediately once you're past the threshold.
337 ↗	(On Diff #30173)	otherOp should start with a capital.

Applied Michael's suggestions.

Removed a tab that snuck into last update.

Hi Zia -

Nitpicks and a question.

Seems like a good optimization to me, but someone else should take a look and issue a verdict.

lib/Target/X86/X86ISelDAGToDAG.cpp
287 ↗	(On Diff #30449)	No comma after "size".
289 ↗	(On Diff #30449)	Indentation (tab width) is off for this whole function.
292 ↗	(On Diff #30449)	Don't want to --> Do not
293 ↗	(On Diff #30449)	Eventually, --> TODO:
295 ↗	(On Diff #30449)	I realize this is intended to be a temporary check, but I don't know if this is the correct predicate (and I'm hoping someone else will answer this for us): What about MinSize? There are lots of similar size checks sprinkled around, but AFAICT, if the function only has a MinSize attribute, we're missing all those cases. Should OptForSize imply MinSize?
298 ↗	(On Diff #30449)	Period.
304 ↗	(On Diff #30449)	legit --> legitimate
327 ↗	(On Diff #30449)	Immedates --> Immediates
329 ↗	(On Diff #30449)	No comma after "passing".
lib/Target/X86/X86InstrArithmetic.td
618–626 ↗	(On Diff #30449)	Weird spacing; over the 80-col edge.
lib/Target/X86/X86InstrInfo.td
883 ↗	(On Diff #30449)	issues --> issue
887 ↗	(On Diff #30449)	No comma after "top".
test/CodeGen/X86/immediate_merging.ll
1 ↗	(On Diff #30449)	This doesn't work for on my system. The "RUN" must be followed by a ":". Also, it looks like most tests do a redirect rather than making the input file a param: ; RUN: llc -o - < %s Don't know what the benefit is, but might as well stick to that pattern. :)
14 ↗	(On Diff #30449)	test --> Test
55 ↗	(On Diff #30449)	test --> Test Period at end.

Thanks, Sanjay.

I've addressed all of your comment (I hope).

Regarding your question.. At first, I wasn't entirely sure myself, but some testing seemed to indicate that MinSize implies OptForSize. I'm not sure if this is true under all conditions, or not, but I added an extra check to explicitly check for MinSize, just to be on the safe side.

Thanks,
Zia.

hfinkel added a subscriber: hfinkel.Jul 25 2015, 4:21 PM

hfinkel added inline comments.

lib/Target/X86/X86ISelDAGToDAG.cpp
289 ↗	(On Diff #30606)	This function does not really hoist anything, so maybe name this ShouldHoistImmediateForSize. Even that's confusing, however, because the constants are not actually immediates if they're hoisted, and you intend to enable this even when not optimizing for size. I think I prefer naming it: bool shouldEncodeConstantAsImmediate or something like that.
295 ↗	(On Diff #30606)	I think that you can remove the MinSize check. I don't have a strong opinion, but Clang always emits both for -Oz and emits OptimizeForSize for -Os. Arguably, we should be checking for both everywhere, but we should make that change, if we make it, globally, not just in this one place.

mkuper added inline comments.Jul 26 2015, 1:20 AM

lib/Target/X86/X86ISelDAGToDAG.cpp
295 ↗	(On Diff #30606)	As a side note, there are already a couple of places that do the "or" check. Admittedly, I wrote one of them myself (in X86CallFrameOpt), but there are at least two other significant places - both the initialization of AArch64DAGToDAGISel and the DAGCombiner constructors have: ForCodeSize = F->hasFnAttribute(Attribute::OptimizeForSize) \|\| F->hasFnAttribute(Attribute::MinSize); It would be great to make this consistent across the board, but I agree, that's a separate commit.

spatel mentioned this in D11568: fix memcpy/memset/memmove lowering when optimizing for size.Jul 28 2015, 11:07 AM

Hi Zia,

Have you looked into the ConstantHoisting pass that performs this kind of constant sharing at a function level?

I do not know if that would be the right solution for the given problem, but it looks to me like this is redundant with that pass and less powerful, though the cost model is probably better.

Could you double check?

Thanks,
-Quentin

Hi Quentin,

I believe Zia is on vacation for a few more days, but this is something we've discussed.
The problem is that ConstantHoisting is useless for multiple usages in the same basic block.
That is, consider this test-case from the ConstantHoisting pass:

define i128 @test1(i128 %a) {
  %1 = add i128 %a, 12297829382473034410122878
  %2 = add i128 %1, 12297829382473034410122878
  ret i128 %2
}

After hoisting, we get this:

define i128 @test1(i128 %a) {
  %const = bitcast i128 12297829382473034410122878 to i128
  %1 = add i128 %a, %const
  %2 = add i128 %1, %const
  ret i128 %2
}

From the DAG's point of view, this is basically a no-op. That is, given this:

define i32 @test1(i32 %a) {
  %const = bitcast i32 42 to i32 
  %1 = add i32 %a, %const
  %2 = add i32 %1, %const
  ret i32 %2
}

We sill still pull the constant into the instruction during selection.

It's true that we also want to make the ConstantHoisting heuristic smarter to handle the multi-basic-block cases better, but the instruction selection changes are needed regardless of that.

Hi Michael,

I am not sure I understand this part:

We sill still pull the constant into the instruction during selection.

I thought ConstantHoisting was marking the constants as opaque so that this cannot happen. What am I missing?

[...] but the instruction selection changes are needed regardless of that.

Which ones are needed?

Thanks,
-Quentin

In D11363#214332, @qcolombet wrote:

Hi Michael,

I am not sure I understand this part:

We sill still pull the constant into the instruction during selection.

I thought ConstantHoisting was marking the constants as opaque so that this cannot happen. What am I missing?

Not opaque enough.
It's opaque in terms of cross-basic-block propagation, but not within a block.
That is, for this:

define i32 @test1(i32 %a) {
  %const = bitcast i32 42 to i32 
  br label %next
next:
  %1 = add i32 %a, %const
  %2 = add i32 %1, %const
  ret i32 %2
}

define i32 @test2(i32 %a) {
  %const = bitcast i32 42 to i32 
  %1 = add i32 %a, %const
  %2 = add i32 %1, %const
  ret i32 %2
}

We get:

test1:                                  # @test1
        movl    4(%esp), %eax
        movl    $42, %ecx
        addl    %ecx, %eax
        addl    %ecx, %eax
        retl

test2:                                  # @test2
        movl    4(%esp), %eax
        addl    $42, %eax
        addl    $42, %eax
        retl

[...] but the instruction selection changes are needed regardless of that.

Which ones are needed?

The ones in this review - handling basic-block-local constants in a smarter way.
One possible alternative is getting ConstantHoisting to always hoist the constants into a new BB, but that seems fairly ugly to me.

Hi Michael,

Thanks for the clarification.

I would go differently and make the constants produced by constants hoisting "more opaque”, so that it also works within a basic block.

Cheers,
-Quentin

spatel mentioned this in rL243693: fix memcpy/memset/memmove lowering when optimizing for size.Jul 30 2015, 2:42 PM

I just noticed D5544 which tries to do a similar optimization using a separate pass.

I'm not advocating a particular approach because I don't fully understand the trade-offs; I just wanted to add that link for my own reference. :)

spatel mentioned this in D11685: cache OptSize and MinSize attributes in the DAG.Jul 31 2015, 11:42 AM

OK, back from vacation.. Thanks for the additional reviews and comments. Some updates:

Hal: I agree with your comment regarding the naming for the querying function that doesn't actually hoist anything. I wasn't 100% comfortable with the name you suggested, since it was a bit confusing (to me, at least). I went with "shouldAvoidImmediateInstFormsForSize", if that works for you.

Hal: I also removed the minsize check, per your recommendation.

Quentin / Sanjay:

Here are my thoughts regarding ConstantHoisting.cpp and D5544...

D5544 is overkill, in my opinion. It's a lot of code that costs extra computation and does something that can be easily done in existing places.

Regarding ConstantHoisting.cpp, I did look into this at first and my earliest prototype was done through there. I'd still like to explore global (cross-BB) constant hoisting in the future, but I chose to go with my current proposed changes for a couple of main reasons:

ConstantHoisting.cpp can make the constants more opaque, preventing isel from pulling them, back in, but this is phase dependent. How about constants introduced after hoisting? DAG creation will always pull out constants, and it seems more natural to properly select the right instructions after this transformation, given our heuristics.

Perhaps more importantly : Pulling out all common constants in ConstantHoisting.cpp might affect other downstream target-independent optimizations on the IL that look for constants as part of the instructions. It's easier to optimize with subsumed immediates at the IL level. My changes may have the same effect in the CG, but they occur later in isel, and are a little more natural for something that's more target-specific (i.e. dealing with instruction encoding).

Those two reasons, along with wanting to tackle this the safest/easiest way (without rocking the boat too much) were my primary motivations for the route that I took.

Hope this helps clear up some of the confusion.

Thanks,
Zia.

Hi Zia,

ConstantHoisting.cpp can make the constants more opaque, preventing isel from pulling them, back in, but this is phase dependent. How about constants introduced after hoisting? DAG creation will always pull out constants, and it seems more natural to properly select the right instructions after this transformation, given our heuristics.

Agreed, however I do not believe any of the test cases in that patch fall into that case. I.e., I’d rather have an actual motivating example not to use ConstantHoisting than adding extra logic.

Perhaps more importantly : Pulling out all common constants in ConstantHoisting.cpp might affect other downstream target-independent optimizations on the IL that look for constants as part of the instructions. It's easier to optimize with subsumed immediates at the IL level. My changes may have the same effect in the CG, but they occur later in isel, and are a little more natural for something that's more target-specific (i.e. dealing with instruction encoding).

This is a phase ordering problem and if I remember correctly ConstantHoisting runs very late and already produces “target-dependent” code, since the constants to hoist are chosen given some target dependent cost model. In other words, this shouldn’t be a problem with the current setting.

I am aware that pushing to reuse ConstantHoist may not be the ultimate solution, but I miss compelling test cases/arguments for not using/improving it right now. Put differently, I’d like to see whether or not we can improve the generic framework for all targets to benefit from it, before exploring target specific solutions.

Cheers,
-Quentin

Hi Quentin,

Thanks for the additional thoughts and suggestions.

However, I’m still not comfortable with having ConstantHoisting doing the transformation. I’ll try to address your concerns, and break things down a little more. Hopefully I’ll address most of your questions and concerns.

I’ll start by giving my opinion on what’s “natural” with respect to subsuming constants.
- ISEL will always pull out all constants within BBs.
- ISEL already has extensive heuristics for when and how to pull those constants back in to instructions. Why create new heuristics elsewhere that would require synchronization with this?
- It seems extremely natural to simply modify those existing heuristics just slightly to catch EXACTLY the cases that we want. These can be easily tuned to catch more/fewer case as we desire.
  - Doing it in ConstantHoisting is clumsy in that it will blindly pull out constants.
  - Adding heuristics to ConstantHoisting to be smarter about what it pulls out needs careful and fragile synchronization with ISEL for various targets, which could get ugly.

I DO agree that it would be nice to do this at a level where more targets could benefit, but my concern is with doing something that will, instead, hurt all targets.
- First of all, I don’t really know how to make constants “more opaque”, as suggested above. How much more opaque can you get beyond using a bitcast for immdiates? I am new to LLVM, so forgive me if I’m just being naïve 
  - ConstantHoisting uses bitcasts AND pulls said bitcasts out into dominating blocks, and calls them “opaque”.
    - These opaque constants, when appearing in their own BB will be lowered as constant instantiations by ISEL.
    - These opaque constants, when referenced from outside BBs, will references the instantiated constants.. however..
    - These opaque constants, when referenced in a block in which they’re defined, will get pulled into all references by ISEL (per Michael’s argument that they are not “opaque enough”).
  - Given that opaque constants get pulled into local BB users, we would require similar (to my patch) changes in ISEL to prevent this from happening.
  - I’d really prefer not to hoist the constants out of their BBs, as I’d like to avoid cross-BB live ranges. I’d like to tackle global constant hoisting in a separate effort, which will/should be handled in ConstantHoisting.cpp.
- Let’s assume we can make these local BB constant “more opaque” to the point that ISEL can’t pull them into instructions anymore. This will have a few negative effects:
  - Without lots of changes, ISEL can’t pick and choose which constants might be OK for its target for specific instructions. It’s left to the mercy of everything ConstantHoisting has pulled out. That is, ALL targets will simply punt on pulling any of those constants back in. I think it needs to be a target-specific decision to pick instructions/constants to subsume. Moving all of the target-specific decision making into ConstantHoisting doesn’t seem practical.
  - Without lots of changes, DAG and ISEL will stop seeing constants and fail to do any optimizations that involve immediates. I’ll admit I haven’t spent tons of time trying to come up with concrete examples, but DAG and ISEL have hundreds of references and queries for Constants in .td and .cpp files. I’m certain a lot of these are optimizations. With “more opaque” constants, these will all fail if they have been pull out by ConstantHoisting. Again, I think it’s unreasonable to implement changes in ConstantHoisting to work together with ISEL.

On to the phase ordering..
- Yes, I agree that ConstantHoisting runs late, but that seems to be a big restriction. Anything that runs on the IR after ConstantHoisting today, or tomorrow, will suffer from not being able to process immediates, since they are now all pulled out and “more opaque”. Given the frequency of 8/16/32bit immediates, this will have a large impact on the IR and everyone walking it.
- Even if we were guaranteed that ConstantHoisting was always the very last phase to run before DAG/ISEL, we’ll still be missing opportunities. DAG itself introduces lots of constants in call lowering, address lowering, and perhaps a few other places. I actually messed with my patch a little to make it more aggressive to demonstrate this. Take this simple example:

int x, y, p;
double d;
int foo()
{
    p = 16;
    x = 4;
    y = 8;
    bar(4, 8, d, 4);
}

By doing the merging in ISEL, I get this (note the stack offsets):

subl      $28, %esp
movl      $16, %eax
movl      %eax, p
movl      $4, %ecx
movl      %ecx, x
movl      $8, %edx
movl      %edx, y
movsd     d, %xmm0
movl      %ecx, (%esp,%eax)
movsd     %xmm0, (%esp,%edx)
movl      %edx, (%esp,%ecx)
movl      %ecx, (%esp)
calll     bar

Which I thought was kind of neat, since I’ve never seen this sort of optimization in a compiler :) Those stores to the stack are all smaller than their immediate offset counterparts. You can do similar optimizations with address expansion, also.
You simply can’t do this with ConstantHoisting.

In summary, I’d like to emphasize what I think is the most important point : “ISEL already has extensive heuristics for when and how to pull constants back in to instructions. Why create new ones elsewhere that would require synchronization with this?”

However, since we have no other place that naturally pulls out and subsumes global (cross BB) constants, I’m in 100% agreement that ConstantHoisting should tackle those.

Thanks,
Zia.

Hi Zia,

Thanks for sharing your point of view on that.

My position can be summarized like this:
I do not want to prevent you to go the ISel path, but if you do, please make sure to expose at lest a test case that couldn’t be tackled by ConstantHoisting, i.e., a case where constants are exposed at ISel time.
Whatever we do to make ISel clever will not scale to function level scope, so why not invest directly into ConstantHoisting? I reckon this may be a bigger commitment as ConstantHoisting may, like you said, be clumsy.

That being said, I do believe ConstantHoisting is a bit too high level for this kind of optimization and we do not have a good framework yet to make that better.

Anyhow, a couple of answers regarding your questions.

ISEL already has extensive heuristics for when and how to pull those constants back in to instructions. Why create new heuristics elsewhere that would require synchronization with this?

Because ConstantHoisting has a cost model that is supposed to model this kind of thing and this pass has a broader scope than ISel.

First of all, I don’t really know how to make constants “more opaque”, as suggested above. How much more opaque can you get beyond using a bitcast for immdiates? I am new to LLVM, so forgive me if I’m just being naïve

Honestly, I do not know the details of ConstantHoisting. I thought there was a mechanism to mark bitcasted constant as opaque and that the basic block scope wouldn’t change any of that. Given your comment, this is not the case and this is unfortunate. Anyhow, please proceed with your ISel changes.

Thanks for your patience,
-Quentin

Thanks, Quentin.

Totally understand and agree with your position, and I also appreciate your patience with me.

I've started with a restricted set of target instructions but I think that as we add to the selection heuristics for the cases we catch, it should get easier to expose more places that ConstantHoisting can't catch, as we continue to improve code-size reductions.

However, as you suggested, I did add to my lit-test to check for a case where ISEL exposes constants not available in ConstantHoisting, even with my current restricted heuristics.

For example, the simple case of :

memset(a, 33, 24)

Before:

movl    $555819297, a+20        # imm = 0x21212121
movl    $555819297, a+16        # imm = 0x21212121
movl    $555819297, a+12        # imm = 0x21212121
movl    $555819297, a+8         # imm = 0x21212121
movl    $555819297, a+4         # imm = 0x21212121
movl    $555819297, a           # imm = 0x21212121

with my changes:

movl    $555819297, %eax        # imm = 0x21212121
movl    %eax, a+20
movl    %eax, a+16
movl    %eax, a+12
movl    %eax, a+8
movl    %eax, a+4
movl    %eax, a

That's about 40% size reduction.

If the changes to the test look OK to you, I'll go ahead with the commit.

Thanks, again.
Zia.

Hi Zia,

LGTM.

Thanks,
-Quentin

This revision is now accepted and ready to land.Aug 7 2015, 5:17 PM

hfinkel added inline comments.Aug 10 2015, 2:47 AM

lib/Target/X86/X86ISelDAGToDAG.cpp
349 ↗	(On Diff #31559)	Do you need to check for X86::RSP here too? Your test case seems to be 32-bit only. Assuming this affects 64-bit code too, can you please add some coverage for that as well?

Closed by commit rL244601: [X86] Allow merging of immediates within a basic block for code size savings (authored by mkuper). · Explain WhyAug 11 2015, 7:11 AM

This revision was automatically updated to reflect the committed changes.

jevinskie added a subscriber: jevinskie.Aug 11 2015, 2:38 PM

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86ISelDAGToDAG.cpp

76 lines

X86InstrArithmetic.td

8 lines

X86InstrInfo.td

40 lines

test/

CodeGen/

X86/

immediate_merging.ll

82 lines

remat-invalid-liveness.ll

85 lines

Diff 31811

llvm/trunk/lib/Target/X86/X86ISelDAGToDAG.cpp

Show First 20 Lines • Show All 277 Lines • ▼ Show 20 Lines	inline void getAddressOperands(X86ISelAddressMode &AM, SDLoc DL,
Disp = CurDAG->getTargetConstant(AM.Disp, DL, MVT::i32);		Disp = CurDAG->getTargetConstant(AM.Disp, DL, MVT::i32);

if (AM.Segment.getNode())		if (AM.Segment.getNode())
Segment = AM.Segment;		Segment = AM.Segment;
else		else
Segment = CurDAG->getRegister(0, MVT::i32);		Segment = CurDAG->getRegister(0, MVT::i32);
}		}

		// Utility function to determine whether we should avoid selecting
		// immediate forms of instructions for better code size or not.
		// At a high level, we'd like to avoid such instructions when
		// we have similar constants used within the same basic block
		// that can be kept in a register.
		//
		bool shouldAvoidImmediateInstFormsForSize(SDNode *N) const {
		uint32_t UseCount = 0;

		// Do not want to hoist if we're not optimizing for size.
		// TODO: We'd like to remove this restriction.
		// See the comment in X86InstrInfo.td for more info.
		if (!OptForSize)
		return false;

		// Walk all the users of the immediate.
		for (SDNode::use_iterator UI = N->use_begin(),
		UE = N->use_end(); (UI != UE) && (UseCount < 2); ++UI) {

		SDNode User = UI;

		// This user is already selected. Count it as a legitimate use and
		// move on.
		if (User->isMachineOpcode()) {
		UseCount++;
		continue;
		}

		// We want to count stores of immediates as real uses.
		if (User->getOpcode() == ISD::STORE &&
		User->getOperand(1).getNode() == N) {
		UseCount++;
		continue;
		}

		// We don't currently match users that have > 2 operands (except
		// for stores, which are handled above)
		// Those instruction won't match in ISEL, for now, and would
		// be counted incorrectly.
		// This may change in the future as we add additional instruction
		// types.
		if (User->getNumOperands() != 2)
		continue;

		// Immediates that are used for offsets as part of stack
		// manipulation should be left alone. These are typically
		// used to indicate SP offsets for argument passing and
		// will get pulled into stores/pushes (implicitly).
		if (User->getOpcode() == X86ISD::ADD \|\|
		User->getOpcode() == ISD::ADD \|\|
		User->getOpcode() == X86ISD::SUB \|\|
		User->getOpcode() == ISD::SUB) {

		// Find the other operand of the add/sub.
		SDValue OtherOp = User->getOperand(0);
		if (OtherOp.getNode() == N)
		OtherOp = User->getOperand(1);

		// Don't count if the other operand is SP.
		RegisterSDNode *RegNode;
		if (OtherOp->getOpcode() == ISD::CopyFromReg &&
		(RegNode = dyn_cast_or_null<RegisterSDNode>(
		OtherOp->getOperand(1).getNode())))
		if ((RegNode->getReg() == X86::ESP) \|\|
		(RegNode->getReg() == X86::RSP))
		continue;
		}

		// ... otherwise, count this and move on.
		UseCount++;
		}

		// If we have more than 1 use, then recommend for hoisting.
		return (UseCount > 1);
		}

/// getI8Imm - Return a target constant with the specified value, of type		/// getI8Imm - Return a target constant with the specified value, of type
/// i8.		/// i8.
inline SDValue getI8Imm(unsigned Imm, SDLoc DL) {		inline SDValue getI8Imm(unsigned Imm, SDLoc DL) {
return CurDAG->getTargetConstant(Imm, DL, MVT::i8);		return CurDAG->getTargetConstant(Imm, DL, MVT::i8);
}		}

/// getI32Imm - Return a target constant with the specified value, of type		/// getI32Imm - Return a target constant with the specified value, of type
/// i32.		/// i32.
▲ Show 20 Lines • Show All 2,632 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86InstrArithmetic.td

Show First 20 Lines • Show All 609 Lines • ▼ Show 20 Lines	class X86TypeInfo<ValueType vt, string instrsuffix, RegisterClass regclass,
/// HasREX_WPrefix - This bit is set to true if the instruction should have		/// HasREX_WPrefix - This bit is set to true if the instruction should have
/// the 0x40 REX prefix. This is set for i64 types.		/// the 0x40 REX prefix. This is set for i64 types.
bit HasREX_WPrefix = hasREX_WPrefix;		bit HasREX_WPrefix = hasREX_WPrefix;
}		}

def invalid_node : SDNode<"<<invalid_node>>", SDTIntLeaf,[],"<<invalid_node>>">;		def invalid_node : SDNode<"<<invalid_node>>", SDTIntLeaf,[],"<<invalid_node>>">;


def Xi8 : X86TypeInfo<i8 , "b", GR8 , loadi8 , i8mem ,		def Xi8 : X86TypeInfo<i8, "b", GR8, loadi8, i8mem,
Imm8 , i8imm , imm, i8imm , invalid_node,		Imm8, i8imm, imm8_su, i8imm, invalid_node,
0, OpSizeFixed, 0>;		0, OpSizeFixed, 0>;
def Xi16 : X86TypeInfo<i16, "w", GR16, loadi16, i16mem,		def Xi16 : X86TypeInfo<i16, "w", GR16, loadi16, i16mem,
Imm16, i16imm, imm, i16i8imm, i16immSExt8,		Imm16, i16imm, imm16_su, i16i8imm, i16immSExt8_su,
1, OpSize16, 0>;		1, OpSize16, 0>;
def Xi32 : X86TypeInfo<i32, "l", GR32, loadi32, i32mem,		def Xi32 : X86TypeInfo<i32, "l", GR32, loadi32, i32mem,
Imm32, i32imm, imm, i32i8imm, i32immSExt8,		Imm32, i32imm, imm32_su, i32i8imm, i32immSExt8_su,
1, OpSize32, 0>;		1, OpSize32, 0>;
def Xi64 : X86TypeInfo<i64, "q", GR64, loadi64, i64mem,		def Xi64 : X86TypeInfo<i64, "q", GR64, loadi64, i64mem,
Imm32S, i64i32imm, i64immSExt32, i64i8imm, i64immSExt8,		Imm32S, i64i32imm, i64immSExt32, i64i8imm, i64immSExt8,
1, OpSizeFixed, 1>;		1, OpSizeFixed, 1>;

/// ITy - This instruction base class takes the type info for the instruction.		/// ITy - This instruction base class takes the type info for the instruction.
/// Using this, it:		/// Using this, it:
/// 1. Concatenates together the instruction mnemonic with the appropriate		/// 1. Concatenates together the instruction mnemonic with the appropriate
▲ Show 20 Lines • Show All 735 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86InstrInfo.td

Show First 20 Lines • Show All 867 Lines • ▼ Show 20 Lines	def X86_COND_E_OR_NE : ImmLeaf<i8, [{
return (Imm == X86::COND_E) \|\| (Imm == X86::COND_NE);		return (Imm == X86::COND_E) \|\| (Imm == X86::COND_NE);
}]>;		}]>;


def i16immSExt8 : ImmLeaf<i16, [{ return Imm == (int8_t)Imm; }]>;		def i16immSExt8 : ImmLeaf<i16, [{ return Imm == (int8_t)Imm; }]>;
def i32immSExt8 : ImmLeaf<i32, [{ return Imm == (int8_t)Imm; }]>;		def i32immSExt8 : ImmLeaf<i32, [{ return Imm == (int8_t)Imm; }]>;
def i64immSExt8 : ImmLeaf<i64, [{ return Imm == (int8_t)Imm; }]>;		def i64immSExt8 : ImmLeaf<i64, [{ return Imm == (int8_t)Imm; }]>;

		// If we have multiple users of an immediate, it's much smaller to reuse
		// the register, rather than encode the immediate in every instruction.
		// This has the risk of increasing register pressure from stretched live
		// ranges, however, the immediates should be trivial to rematerialize by
		// the RA in the event of high register pressure.
		// TODO : This is currently enabled for stores and binary ops. There are more
		// cases for which this can be enabled, though this catches the bulk of the
		// issues.
		// TODO2 : This should really also be enabled under O2, but there's currently
		// an issue with RA where we don't pull the constants into their users
		// when we rematerialize them. I'll follow-up on enabling O2 after we fix that
		// issue.
		// TODO3 : This is currently limited to single basic blocks (DAG creation
		// pulls block immediates to the top and merges them if necessary).
		// Eventually, it would be nice to allow ConstantHoisting to merge constants
		// globally for potentially added savings.
		//
		def imm8_su : PatLeaf<(i8 imm), [{
		return !shouldAvoidImmediateInstFormsForSize(N);
		}]>;
		def imm16_su : PatLeaf<(i16 imm), [{
		return !shouldAvoidImmediateInstFormsForSize(N);
		}]>;
		def imm32_su : PatLeaf<(i32 imm), [{
		return !shouldAvoidImmediateInstFormsForSize(N);
		}]>;

		def i16immSExt8_su : PatLeaf<(i16immSExt8), [{
		return !shouldAvoidImmediateInstFormsForSize(N);
		}]>;
		def i32immSExt8_su : PatLeaf<(i32immSExt8), [{
		return !shouldAvoidImmediateInstFormsForSize(N);
		}]>;


def i64immSExt32 : ImmLeaf<i64, [{ return Imm == (int32_t)Imm; }]>;		def i64immSExt32 : ImmLeaf<i64, [{ return Imm == (int32_t)Imm; }]>;


// i64immZExt32 predicate - True if the 64-bit immediate fits in a 32-bit		// i64immZExt32 predicate - True if the 64-bit immediate fits in a 32-bit
// unsigned field.		// unsigned field.
def i64immZExt32 : ImmLeaf<i64, [{ return (uint64_t)Imm == (uint32_t)Imm; }]>;		def i64immZExt32 : ImmLeaf<i64, [{ return (uint64_t)Imm == (uint32_t)Imm; }]>;

▲ Show 20 Lines • Show All 394 Lines • ▼ Show 20 Lines
def MOV32ri_alt : Ii32<0xC7, MRM0r, (outs GR32:$dst), (ins i32imm:$src),		def MOV32ri_alt : Ii32<0xC7, MRM0r, (outs GR32:$dst), (ins i32imm:$src),
"mov{l}\t{$src, $dst\|$dst, $src}", [], IIC_MOV>, OpSize32;		"mov{l}\t{$src, $dst\|$dst, $src}", [], IIC_MOV>, OpSize32;
}		}
} // SchedRW		} // SchedRW

let SchedRW = [WriteStore] in {		let SchedRW = [WriteStore] in {
def MOV8mi : Ii8 <0xC6, MRM0m, (outs), (ins i8mem :$dst, i8imm :$src),		def MOV8mi : Ii8 <0xC6, MRM0m, (outs), (ins i8mem :$dst, i8imm :$src),
"mov{b}\t{$src, $dst\|$dst, $src}",		"mov{b}\t{$src, $dst\|$dst, $src}",
[(store (i8 imm:$src), addr:$dst)], IIC_MOV_MEM>;		[(store (i8 imm8_su:$src), addr:$dst)], IIC_MOV_MEM>;
def MOV16mi : Ii16<0xC7, MRM0m, (outs), (ins i16mem:$dst, i16imm:$src),		def MOV16mi : Ii16<0xC7, MRM0m, (outs), (ins i16mem:$dst, i16imm:$src),
"mov{w}\t{$src, $dst\|$dst, $src}",		"mov{w}\t{$src, $dst\|$dst, $src}",
[(store (i16 imm:$src), addr:$dst)], IIC_MOV_MEM>, OpSize16;		[(store (i16 imm16_su:$src), addr:$dst)], IIC_MOV_MEM>, OpSize16;
def MOV32mi : Ii32<0xC7, MRM0m, (outs), (ins i32mem:$dst, i32imm:$src),		def MOV32mi : Ii32<0xC7, MRM0m, (outs), (ins i32mem:$dst, i32imm:$src),
"mov{l}\t{$src, $dst\|$dst, $src}",		"mov{l}\t{$src, $dst\|$dst, $src}",
[(store (i32 imm:$src), addr:$dst)], IIC_MOV_MEM>, OpSize32;		[(store (i32 imm32_su:$src), addr:$dst)], IIC_MOV_MEM>, OpSize32;
def MOV64mi32 : RIi32S<0xC7, MRM0m, (outs), (ins i64mem:$dst, i64i32imm:$src),		def MOV64mi32 : RIi32S<0xC7, MRM0m, (outs), (ins i64mem:$dst, i64i32imm:$src),
"mov{q}\t{$src, $dst\|$dst, $src}",		"mov{q}\t{$src, $dst\|$dst, $src}",
[(store i64immSExt32:$src, addr:$dst)], IIC_MOV_MEM>;		[(store i64immSExt32:$src, addr:$dst)], IIC_MOV_MEM>;
} // SchedRW		} // SchedRW

let hasSideEffects = 0 in {		let hasSideEffects = 0 in {

/// Memory offset versions of moves. The immediate is an address mode sized		/// Memory offset versions of moves. The immediate is an address mode sized
▲ Show 20 Lines • Show All 1,651 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/immediate_merging.ll

				; RUN: llc -o - -mtriple=i386-unknown-linux-gnu < %s \| FileCheck %s
				; RUN: llc -o - -mtriple=x86_64-unknown-linux-gnu < %s \| FileCheck %s

				@a = common global i32 0, align 4
				@b = common global i32 0, align 4
				@c = common global i32 0, align 4
				@e = common global i32 0, align 4
				@x = common global i32 0, align 4
				@f = common global i32 0, align 4
				@h = common global i32 0, align 4
				@i = common global i32 0, align 4

				; Test -Os to make sure immediates with multiple users don't get pulled in to
				; instructions.
				define i32 @foo() optsize {
				; CHECK-LABEL: foo:
				; CHECK: movl $1234, [[R1:%[a-z]+]]
				; CHECK-NOT: movl $1234, a
				; CHECK-NOT: movl $1234, b
				; CHECK-NOT: movl $12, c
				; CHECK-NOT: cmpl $12, e
				; CHECK: movl [[R1]], a
				; CHECK: movl [[R1]], b

				entry:
				store i32 1234, i32* @a
				store i32 1234, i32* @b
				store i32 12, i32* @c
				%0 = load i32, i32* @e
				%cmp = icmp eq i32 %0, 12
				br i1 %cmp, label %if.then, label %if.end

				if.then: ; preds = %entry
				store i32 1, i32* @x
				br label %if.end

				; New block.. Make sure 1234 isn't live across basic blocks from before.
				; CHECK: movl $1234, f
				; CHECK: movl $555, [[R3:%[a-z]+]]
				; CHECK-NOT: movl $555, h
				; CHECK-NOT: addl $555, i
				; CHECK: movl [[R3]], h
				; CHECK: addl [[R3]], i

				if.end: ; preds = %if.then, %entry
				store i32 1234, i32* @f
				store i32 555, i32* @h
				%1 = load i32, i32* @i
				%add1 = add nsw i32 %1, 555
				store i32 %add1, i32* @i
				ret i32 0
				}

				; Test -O2 to make sure that all immediates get pulled in to their users.
				define i32 @foo2() {
				; CHECK-LABEL: foo2:
				; CHECK: movl $1234, a
				; CHECK: movl $1234, b

				entry:
				store i32 1234, i32* @a
				store i32 1234, i32* @b

				ret i32 0
				}

				declare void @llvm.memset.p0i8.i32(i8* nocapture, i8, i32, i32, i1) #1

				@AA = common global [100 x i8] zeroinitializer, align 1

				; memset gets lowered in DAG. Constant merging should hoist all the
				; immediates used to store to the individual memory locations. Make
				; sure we don't directly store the immediates.
				define void @foomemset() optsize {
				; CHECK-LABEL: foomemset:
				; CHECK-NOT: movl ${{.*}}, AA
				; CHECK: mov{{l\|q}} %{{e\|r}}ax, AA

				entry:
				call void @llvm.memset.p0i8.i32(i8* getelementptr inbounds ([100 x i8], [100 x i8]* @AA, i32 0, i32 0), i8 33, i32 24, i32 1, i1 false)
				ret void
				}

llvm/trunk/test/CodeGen/X86/remat-invalid-liveness.ll

	; RUN: llc %s -mcpu=core2 -o - \| FileCheck %s
	; This test was failing while tracking the liveness in the register scavenger
	; during the branching folding pass. The allocation of the subregisters was
	; incorrect.
	; I.e., the faulty pattern looked like:
	; CH = movb 64
	; ECX = movl 3 <- CH was killed here.
	; CH = subb CH, ...
	;
	; This reduced test case triggers the crash before the fix, but does not
	; strictly speaking check that the resulting code is correct.
	; To check that the code is actually correct we would need to check the
	; liveness of the produced code.
	;
	; Currently, we check that after ECX = movl 3, we do not have subb CH,
	; whereas CH could have been redefine in between and that would have been
	; totally fine.
	; <rdar://problem/16582185>
	target datalayout = "e-m:o-p:32:32-f64:32:64-f80:128-n8:16:32-S128"
	target triple = "i386-apple-macosx10.9"

	%struct.A = type { %struct.B, %struct.C, %struct.D, [1 x i8] }
	%struct.B = type { i32, [4 x i8] }
	%struct.C = type { i128 }
	%struct.D = type { {}*, [0 x i32] }
	%union.E = type { i32 }

	; CHECK-LABEL: __XXX1:
	; CHECK: movl $3, %ecx
	; CHECK-NOT: subb %{{[a-z]+}}, %ch
	; Function Attrs: nounwind optsize ssp
	define fastcc void @__XXX1(%struct.A* %ht) #0 {
	entry:
	%const72 = bitcast i128 72 to i128
	%const3 = bitcast i128 3 to i128
	switch i32 undef, label %if.end196 [
	i32 1, label %sw.bb.i
	i32 3, label %sw.bb2.i
	]

	sw.bb.i: ; preds = %entry
	%call.i.i.i = tail call i32 undef(%struct.A* %ht, i8 zeroext 22, i32 undef, i32 0, %struct.D* undef)
	%bf.load.i.i = load i128, i128* undef, align 4
	%bf.lshr.i.i = lshr i128 %bf.load.i.i, %const72
	%shl1.i.i = shl nuw nsw i128 %bf.lshr.i.i, 8
	%shl.i.i = trunc i128 %shl1.i.i to i32
	br i1 undef, label %cond.false10.i.i, label %__XXX2.exit.i.i

	__XXX2.exit.i.i: ; preds = %sw.bb.i
	%extract11.i.i.i = lshr i128 %bf.load.i.i, %const3
	%extract.t12.i.i.i = trunc i128 %extract11.i.i.i to i32
	%bf.cast7.i.i.i = and i32 %extract.t12.i.i.i, 3
	%arrayidx.i.i.i = getelementptr inbounds %struct.A, %struct.A* %ht, i32 0, i32 3, i32 %bf.cast7.i.i.i
	br label %cond.end12.i.i

	cond.false10.i.i: ; preds = %sw.bb.i
	%arrayidx.i6.i.i = getelementptr inbounds %struct.A, %struct.A* %ht, i32 0, i32 3, i32 0
	br label %cond.end12.i.i

	cond.end12.i.i: ; preds = %cond.false10.i.i, %__XXX2.exit.i.i
	%.sink.in.i.i = phi i8** [ %arrayidx.i.i.i, %__XXX2.exit.i.i ], [ %arrayidx.i6.i.i, %cond.false10.i.i ]
	%.sink.i.i = load i8, i8* %.sink.in.i.i, align 4
	%tmp = bitcast i8* %.sink.i.i to %union.E*
	br i1 undef, label %for.body.i.i, label %if.end196

	for.body.i.i: ; preds = %for.body.i.i, %cond.end12.i.i
	%weak.i.i = getelementptr inbounds %union.E, %union.E* %tmp, i32 undef, i32 0
	%tmp1 = load i32, i32* %weak.i.i, align 4
	%cmp36.i.i = icmp ne i32 %tmp1, %shl.i.i
	%or.cond = and i1 %cmp36.i.i, false
	br i1 %or.cond, label %for.body.i.i, label %if.end196

	sw.bb2.i: ; preds = %entry
	%bf.lshr.i85.i = lshr i128 undef, %const72
	br i1 undef, label %if.end196, label %__XXX2.exit.i95.i

	__XXX2.exit.i95.i: ; preds = %sw.bb2.i
	%extract11.i.i91.i = lshr i128 undef, %const3
	br label %if.end196

	if.end196: ; preds = %__XXX2.exit.i95.i, %sw.bb2.i, %for.body.i.i, %cond.end12.i.i, %entry
	ret void
	}

	attributes #0 = { nounwind optsize ssp "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" }

This is an archive of the discontinued LLVM Phabricator instance.

Allow merging of immediates within a basic block for code size savings and reduced footprint.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 31811

llvm/trunk/lib/Target/X86/X86ISelDAGToDAG.cpp

llvm/trunk/lib/Target/X86/X86InstrArithmetic.td

llvm/trunk/lib/Target/X86/X86InstrInfo.td

llvm/trunk/test/CodeGen/X86/immediate_merging.ll

llvm/trunk/test/CodeGen/X86/remat-invalid-liveness.ll

Allow merging of immediates within a basic block for code size savings and reduced footprint.
ClosedPublic