Page MenuHomePhabricator

[DAGCombiner] Match load by bytes idiom and fold it into a single load. Attempt #2.

Authored by apilipenko on Dec 16 2016, 1:48 PM.



The previous patch ( got reverted because of a bug. Chandler also requested some changes to the algorithm.

This is an updated patch. The key difference is that collectBitProviders now collects the origin of one byte, not the whole value. It simplifies the implementation and allows to stop the traversal earlier if we know that the result won't be used.

From the original review:

Match a pattern where a wide type scalar value is loaded by several narrow loads and combined by shifts and ors. Fold it into a single load or a load and a bswap if the targets supports it.

Assuming little endian target:

i8 *a = ...
i32 val = a[0] | (a[1] << 8) | (a[2] << 16) | (a[3] << 24)
i32 val = *((i32)a)
i8 *a = ...
i32 val = (a[0] << 24) | (a[1] << 16) | (a[2] << 8) | a[3]
i32 val = BSWAP(*((i32)a))

Diff Detail


Event Timeline

apilipenko updated this revision to Diff 81793.Dec 16 2016, 1:48 PM
apilipenko retitled this revision from to [DAGCombiner] Match load by bytes idiom and fold it into a single load. Attempt #2..
apilipenko updated this object.
apilipenko updated this object.
apilipenko added a subscriber: llvm-commits.
apilipenko updated this object.Dec 16 2016, 1:51 PM
RKSimon added inline comments.Dec 17 2016, 3:28 PM
4481 ↗(On Diff #81793)

What is the effect of changing this to:

if (LegalOperations && !TLI.isOperationLegal(ISD::LOAD, VT))

Would the legalize do such a bad job of splitting poorly combined loads/bswaps?

4540 ↗(On Diff #81793)

Please make the assert messages more explanatory.

4559 ↗(On Diff #81793)

Would this work?

if (NeedsBswap && LegalOperations && !TLI.isOperationLegal(ISD::BSWAP, VT))
1 ↗(On Diff #81793)

Additionally test with a armv6/7 cpu with REV? Same for big-endian tests

filcab edited edge metadata.Jan 3 2017, 8:34 AM

Looks good in general. Please check out Simon's comments, as I'm curious about them too.

4388 ↗(On Diff #81793)

Nit: We're not "collecting" any more. Maybe {get,calculate,deduce}ByteProvider? (Or something else)
Also fix the comment.

4559 ↗(On Diff #81793)

I wonder if it's useful to generate a bswap only to change it back later. Do you have an example of something llvm already does? Or would this be a future optimization possibility?

RKSimon edited edge metadata.Jan 3 2017, 11:21 AM has been updated to support ARM/AARCH64 so you should be able to use to regenerate now.

apilipenko updated this revision to Diff 83613.Jan 9 2017, 5:03 AM
apilipenko edited edge metadata.
apilipenko marked 2 inline comments as done.

Address the comments.

Currently supports arm-eabi target only. I left ARM test cases with manually written checks for now.

4481 ↗(On Diff #81793)

This looks like a good idea, it enables combining of i64 pattern to two i32 loads on 32 bit targets (first loads are combined to a single i64 load and then it is split into to i32 loads).

4559 ↗(On Diff #81793)

As a result we have a single load followed by an instruction sequence doing the swap. E.g. for load_i32_by_i8_bswap from test/CodeGen/ARM/load-combine.ll we'll have:

	ldr	r0, [r0]
	mov	r1, #65280
	mov	r2, #16711680
	and	r1, r1, r0, lsr #8
	and	r2, r2, r0, lsl #8
	orr	r1, r1, r0, lsr #24
	orr	r0, r2, r0, lsl #24
	orr	r0, r0, r1

instead of

	ldrb	r2, [r0, #1]
	ldrb	r1, [r0]
	ldrb	r3, [r0, #2]
	ldrb	r0, [r0, #3]
	lsl	r2, r2, #16
	orr	r1, r2, r1, lsl #24
	orr	r1, r1, r3, lsl #8
	orr	r0, r1, r0

Assuming that shuffling bytes in a register is cheaper that loading from memory it looks like a generally good transformation.

A few minors - anyone else have any comments?

4503 ↗(On Diff #83613)

Are there any situations where we can use ZEXTLOAD? If so add a TODO comment?

4511 ↗(On Diff #83613)

Avoid a comparison we know the result of:

else if (Chain != LChain)
4518 ↗(On Diff #83613)

Avoid a comparison we know the result of:

else if (!Base->equalBaseIndex(Ptr))
4527 ↗(On Diff #83613)

Loop invariant - pull this out to the top of the function:

bool IsBigEndianTarget = DAG.getDataLayout().isBigEndian();
654 ↗(On Diff #83613)

Add a TODO to the comment

filcab added a comment.Jan 9 2017, 7:20 AM

One minor comment.

4527 ↗(On Diff #83613)

Might as well pull the {Little,Big}EndianByteAt from the loop too.

apilipenko marked 5 inline comments as done.Jan 9 2017, 10:29 AM
apilipenko added inline comments.
4503 ↗(On Diff #83613)

I'm going to add ext loads support in a follow up patch. Left a TODO for now.

apilipenko updated this revision to Diff 83648.Jan 9 2017, 10:31 AM

Address the comments, add a comment that calculateByteProvider recurses over trees, not DAGs.

RKSimon accepted this revision.Jan 9 2017, 11:09 AM
RKSimon edited edge metadata.


This revision is now accepted and ready to land.Jan 9 2017, 11:09 AM
chandlerc edited edge metadata.Jan 13 2017, 7:10 PM

Sorry I didn't see this (much) sooner!

4393–4395 ↗(On Diff #83648)

I agree this doesn't need a set because it is tree structured, but other similar routines at the SDAG layer bound recursions. See for example computeKnownBits. I suspect this code should do something similar.

4446 ↗(On Diff #83648)

So, one path of what this is doing is essentially computing known-zero bits. I wonder if we should really be hand rolling that or whether we should instead use computeKnownBits. We could likely use computeKnownBits in the OR logic and then only handle recursing to find loads in the rest of these paths... Thoughts? maybe not worth it, hard to tell.

4485–4492 ↗(On Diff #83648)

Why only do the legality check when in the legalize phase? When would we want to combine loads to a non-legal integer type?

Is the goal here to combine loads into a too-wide illegal integer type to let the legalizer then split them into legal sized chunks? If so, that at least needs a comment. And in that case, why only up to 64-bit integers?

Alternatively, you could do the legality check in all phases and just never combine stores when the merged value is wider than the legal integer size.

4573–4577 ↗(On Diff #83648)

Shouldn't this check come first, up with the legalization stuff?

apilipenko marked an inline comment as done.Jan 16 2017, 4:45 AM
apilipenko added inline comments.
4446 ↗(On Diff #83648)

It's definitely an option, but I can't justify the change. In follow up changes I'm going to exploit the information about zero bytes in DAGCombiner::MatchLoadCombine to handle partially available patterns. That will introduce another point where I'll need to use computeKnownBits.

4485–4492 ↗(On Diff #83648)

Yes, the goal is to combine to a possibly too-wide load and let the legalizer split it later. This enables us to combine load i64 by i8 to a couple of i32 loads on 32 bit targets. Will add a comment.

The i64 limitation is somewhat arbitrary just to limit the scope of the transformation. It can be lifted easily. (On the other hand with the newly introduced depth limit in calculateByteProvider we won't be able to fold patterns much wider than i64)

4573–4577 ↗(On Diff #83648)

allowsMemoryAccess needs to know the address, addrspace and alignment specifically. We don't know it before we do all the computations above.

apilipenko updated this revision to Diff 84547.Jan 16 2017, 4:54 AM
apilipenko edited edge metadata.

Address review comments:

  • Add recursion depth limit
  • Add comments about combine before legalize

Add a comment about the known worklist problem.

@chandlerc since none of your comments express explicit objections or concerns and the patch has already been accepted I'm going to go ahead and land it tomorrow. Let me know if I misinterpreted your comments and you do want some changes.

This revision was automatically updated to reflect the committed changes.