This is an archive of the discontinued LLVM Phabricator instance.

[BitfieldShrinking] Shrink Bitfields load/store when the bitfields are legal to access independently
Needs RevisionPublic

Authored by wmi on Feb 27 2017, 10:39 AM.

Details

Summary

reduceLoadOpStoreWidth is a useful optimization already existing in DAGCombiner. It can shrink the bitfield store in the following testcase:

class A {
public:
unsigned long f1:16;
unsigned long f2:3;
};
A a;

void foo() {
// if (a.f1 == 2) return;
a.f1 = a.f1 + 3;
}

For a.f2 = a.f2 + 3, without reduceLoadOpStoreWidth in DAGCombiner, the code will be:
movl a(%rip), %eax
leal 3(%rax), %ecx
movzwl %cx, %ecx
andl $-65536, %eax # imm = 0xFFFF0000
orl %ecx, %eax
movl %eax, a(%rip)

with reduceLoadOpStoreWidth, the code will be:
movl a(%rip), %eax
addl $3, %eax
movw %ax, a(%rip)

However, if we remove the comment above, the load of a.f2 and the store of a.f2 will stay in two different BasicBlocks and reduceLoadOpStoreWidth in DAGCombiner cannot work.

The patch is redoing the same optimization in InstCombine, so the optimization will not be limited by the BasicBlock boundary.

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
wmi added inline comments.Feb 27 2017, 5:14 PM
lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
1331 ↗(On Diff #89902)

Ah, I had a mental block here. I was thinking MaskLeadOnes has to be APInt if I use APInt::countLeadingOnes and APInt doesn't support % operator.

It is better to APInt. I will fix it.

1388 ↗(On Diff #89902)

I am not expecting the alignment will increase. I am worried that the original alignment will be overestimated if directly applied to the new store and caused undefine behavior.
Suppose the original i32 store to address @a has 32 bits alignment. Now we will store an i16 to a.f2 which is at address "@a + 2B". "@a + 2B" should only have 16bits alignment.

test/Transforms/InstCombine/bitfield-store.ll
89 ↗(On Diff #89902)

Sorry I don't get the point. Are you suggesting the following?

%bf.set = or i16 %bf.clear3, %bf.value
%bf.set.truncate = trunc %bf.set i16 to i13
store i13 %bf.set.trunc, i13* bitcast (%class.A4* @a4 to i13*), align 8

llvm will still generate the same code:

andl    $8191, %edi             # imm = 0x1FFF
movw    %di, a4(%rip)
efriedma added inline comments.Feb 27 2017, 6:05 PM
lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
1388 ↗(On Diff #89902)

Suppose the original i32 store to address @a has 32 bits alignment. Now we will store an i16 to a.f2 which is at address "@a + 2B". "@a + 2B" should only have 16bits alignment.

Suppose the original i32 store to address @a has 8 bits alignment. What is the alignment of "@a + 2B"? (You need to compute the GCD of the offset and the original alignment.)

test/Transforms/InstCombine/bitfield-store.ll
89 ↗(On Diff #89902)

Oh, sorry, this isn't a good example; I mixed up the fields. But consider:

; class ATest {
;   unsigned long f1:13;
;   unsigned long f2:3;
; } atest;
; atest.f2 = n;

You could shrink the store here (trunc to i8).

wmi added inline comments.Feb 27 2017, 10:02 PM
lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
1388 ↗(On Diff #89902)

You are right.

class A {
public:
unsigned long f1:8;
unsigned long f2:16;
unsigned long f3:8;
};
A a;

int foo () {

a.f2 = 3;

}

i16 has 16 bits natural alignment, but a.f2 only has 8 bits alignment here.

test/Transforms/InstCombine/bitfield-store.ll
89 ↗(On Diff #89902)

Ah, I see what you mean. In your case, we can shrink the store, but cannot remove the original load and bit operations doing the mask. I can add the shrink but I am not sure whether it is better than without the shrink.

efriedma added inline comments.Mar 1 2017, 12:40 PM
test/Transforms/InstCombine/bitfield-store.ll
89 ↗(On Diff #89902)

It's a substantial improvement if you're transforming from an illegal type to a legal type. (I've been dealing with trying to optimize an i24 bitfield recently; see, for example, test/CodeGen/ARM/illegal-bitfield-loadstore.ll.) In other cases, you're right, it's not obviously profitable.

wmi updated this revision to Diff 90589.Mar 4 2017, 12:46 PM
wmi added a reviewer: eli.friedman.

Update patch according to Eli's comments.

  • Add safety check to ensure no memory modifying inst bewteen load and store.
  • Extend the shrinking funcationality to cover the cases Eli gave to me.
  • Code refactoring so that different shrinking requirements can share the code as much as they can.
wmi added a comment.Mar 4 2017, 12:49 PM

Although I did't find regression in internal benchmarks testing, I still moved the transformation to codegenprepare because we want to use TargetLowering information to decide how to shrink in some cases.

mkuper added inline comments.Mar 6 2017, 7:42 PM
lib/CodeGen/CodeGenPrepare.cpp
5719 ↗(On Diff #90589)

We don't have alias analysis in CGP at all, do we?
Maybe it would be better to pull this out somewhere else (late in the pipeline).

5739 ↗(On Diff #90589)

Oh, ok, now I see why Eli suggested MemorySSA.
There has to be a better way to do this. (Although I can't think of one at the moment.)

5823 ↗(On Diff #90589)

Are you sure this does the right thing for xor?

5895 ↗(On Diff #90589)

I don't think this is what you meant to check here. :-)

5912 ↗(On Diff #90589)

Do we want to look through bitcasts? It probably doesn't matter in practice, though.

lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
1388 ↗(On Diff #89902)

Could you add a test-case like this?

mkuper added inline comments.Mar 6 2017, 8:02 PM
lib/CodeGen/CodeGenPrepare.cpp
5823 ↗(On Diff #90589)

Err, never mind, of course it does.

wmi added a comment.Mar 6 2017, 9:03 PM

Could you add a test-case like this?

Sure. I will add such testcase after other major issues are solved.

lib/CodeGen/CodeGenPrepare.cpp
5739 ↗(On Diff #90589)

Yes, if the optimization happens before loadpre, then this simple check is enough. If the optimization happens late in the pipeline, we need memoryssa + alias query to do the safety check.

5895 ↗(On Diff #90589)

Oh, thanks for catching the stupid mistake.

5912 ↗(On Diff #90589)

Yes, I want. I do see case that needs it.

wmi updated this revision to Diff 93018.Mar 24 2017, 3:53 PM
wmi added a reviewer: chandlerc.

Revamp the patch.

  • Extend bitfield store shrinking to handle and(or(and( ... or(load, C_1), MaskedVal_1), ..., C_N), MaskedVal_N))) pattern.
  • Add bitfield load shrinking.
  • Use memorySSA to do the safety check and maintain it on the fly.

With all these changes, llvm now can catch most of the shrinking opportunities for testcase http://lists.llvm.org/pipermail/llvm-commits/attachments/20170307/23ad5702/attachment-0001.cc, but still keep its bitfield coalescing ablity by putting the shrinking pass in the late pipeline.

I need to add testcases for bitfield load shrinking. Will send out patch update soon.

wmi retitled this revision from [InstCombine] Redo reduceLoadOpStoreWidth in instcombine for bitfield store optimization. to [BitfieldShrinking] Shrink Bitfields load/store when the bitfields are legal to access independently.Mar 24 2017, 3:57 PM
chandlerc edited edge metadata.Mar 25 2017, 7:25 AM

Thanks a lot for working on this, first round of comments!

include/llvm/CodeGen/Passes.h
67–68

Since this is a CodeGen pass, the code should live in lib/CodeGen rather than in lib/Transforms.

lib/CodeGen/TargetPassConfig.cpp
488–490

This should probably be predicated on getOptLevel() much like below?

lib/Transforms/Scalar/BitfieldShrinking.cpp
1 ↗(On Diff #93018)

I understand that the motivation here are bitfield accesses, but that isn't how we should describe the pass IMO.

This is a generic pass to narrow memory accesses, and I think you should name it and document it accordingly.

Naturally, you can still mention bitfields as one of the motivations and to help explain the specific patterns that are handled here. But if we have some other memory access shrinking we want to do, I would imagine that we would want to add it to this pass.

This will probably need to be propagated through many of the comments here.

10–19 ↗(On Diff #93018)

See above about re-focusing the documentation here on the generic memory access narrowing, and making the details about bitfields part of the motivation.

I would also make sure to include here a high level overview of the approach / algorithm used. Things like the fact that this uses MemorySSA and is specifically designed to handle shrinking across control flow seems important.

I'd also suggest making this a \file doxygen comment.

54 ↗(On Diff #93018)

Use C++ struct naming rather than a C-style typedef of an anonymous struct.

232–235 ↗(On Diff #93018)

While I generally like the use of lambdas to help factor this code, I find the parameters which are changing with each loop iteration being captured by reference and so implicitly changing to be really confusing.

I would prefer to pass parameters that are fundamentally the input to the lambda as actual parameters, and use capture more for common context that isn't really specific to a particular call to the lambda. Does that make sense?

266–269 ↗(On Diff #93018)

We try to avoid doing isa<Foo>(...) and then cast<Foo>(...) in LLVM (it adds overhead to asserts builds that can really add up and is a bit redundant). Instead, use dyn_cast here?

488–489 ↗(On Diff #93018)

Is this valid at this point? It seems like it shouldn't be able to happen. I'd either use llvm_unreachable to mark that or add a comment explaining what is happening here.

568 ↗(On Diff #93018)

Worklist is a more common name for this kind of vector in LLVM.

887–889 ↗(On Diff #93018)

The pass manager already provides for facilities for printing before and after passes -- is this needed?

wmi added a comment.Apr 3 2017, 4:07 PM

Chandler, thanks for the review and sorry about the delay of replying. It takes me a while to fix some issues of the patch found when I was adding test for the load shrinking part and doing the unittest.

include/llvm/CodeGen/Passes.h
67–68

Fixed.

lib/CodeGen/TargetPassConfig.cpp
488–490

Fixed.

lib/Transforms/Scalar/BitfieldShrinking.cpp
1 ↗(On Diff #93018)

You are right. The impact of the pass is not limited to bitfield access. I renamed the pass to MemAccessShrink and changed the comments accordingly.

10–19 ↗(On Diff #93018)

Add a highlevel overview of the approach used as suggested and use \file doxygen comment.

54 ↗(On Diff #93018)

Fixed.

232–235 ↗(On Diff #93018)

It makes sense. Fixed.

266–269 ↗(On Diff #93018)

Fixed.

488–489 ↗(On Diff #93018)

Fixed.

568 ↗(On Diff #93018)

Fixed.

887–889 ↗(On Diff #93018)

The printing was Removed.

wmi updated this revision to Diff 93973.Apr 3 2017, 4:44 PM
  • Address Chandler's comments.
  • Fix unittest errors.
  • Add unittest for load shrinking part. Add the original motivation case as a unittest.
  • Add cost evaluation for the case when there is multiple use node inside the shrinking pattern.

I'll find some time to look at the core algorithm later.

include/llvm/Target/TargetLowering.h
1973

I'm not sure I see the point of this hook. Every in-tree target has cheap i8 load/store and aligned i16 load/store operations. And we have existing hooks to check support for misaligned operations.

If there's some case I'm not thinking of, please add an example to the comment.

lib/CodeGen/MemAccessShrinking.cpp
53

"try", not "tries".

1015

If an instruction has no uses and isn't trivially dead, we're never going to erase it; no point to adding it to CandidatesToErase.

1028

Should we clear CandidatesToErase here, as opposed to modifying it inside the loop?

Digging into the code next, but wanted to send some comments just on terminology and the documentation while I'm doing that.

lib/CodeGen/MemAccessShrinking.cpp
16

nit: s/now//

27

This sentence doesn't really parse for me, it reads as a command rather a description and comments typically are descriptive.

47

"no more change has happened" -> "no more changes have happened"

51

"mergd" -> "merged"

"load/store" needs to be "loads and stores" or "load/store instructions".

53

I would reword this sentence to be a bit easier to read:

It provides scalable and precise safety checks even when we try to insert
a smaller access into a block which is many blocks away from the original
access.
54

A comment on the terminology throughout this patch: the adjective describing something which has been reduced in size in the past is "shrunken". That said, if this is awkward to use, I might use the adjective "smaller". But "shrinked" isn't a word in English.

wmi added inline comments.Apr 21 2017, 4:31 PM
include/llvm/Target/TargetLowering.h
1973

It is because some testcase for amdgpu. Like the testcase below:

define void @s_sext_in_reg_i1_i16(i16 addrspace(1)* %out, i32 addrspace(2)* %ptr) #0 {

%ld = load i32, i32 addrspace(2)* %ptr
%in = trunc i32 %ld to i16
%shl = shl i16 %in, 15
%sext = ashr i16 %shl, 15
store i16 %sext, i16 addrspace(1)* %out
ret void

}

code with the patch:
s_load_dwordx2 s[4:5], s[0:1], 0x9
s_load_dwordx2 s[0:1], s[0:1], 0xb
s_mov_b32 s7, 0xf000
s_mov_b32 s6, -1
s_mov_b32 s2, s6
s_mov_b32 s3, s7
s_waitcnt lgkmcnt(0)
buffer_load_ushort v0, off, s[0:3], 0
s_waitcnt vmcnt(0)
v_bfe_i32 v0, v0, 0, 1
buffer_store_short v0, off, s[4:7], 0
s_endpgm

code without the patch:
s_load_dwordx2 s[4:5], s[0:1], 0x9
s_load_dwordx2 s[0:1], s[0:1], 0xb
s_mov_b32 s7, 0xf000
s_mov_b32 s6, -1
s_waitcnt lgkmcnt(0)
s_load_dword s0, s[0:1], 0x0
s_waitcnt lgkmcnt(0)
s_bfe_i32 s0, s0, 0x10000
v_mov_b32_e32 v0, s0
buffer_store_short v0, off, s[4:7], 0
s_endpgm

amdgpu codegen chooses to use buffer_load_short instead of s_load_dword and generates longer code sequence. I know almost nothing about amdgpu so I simply add the hook and only focus on the architectures I am more faimiliar with before the patch becomes in better shape and stable.

lib/CodeGen/MemAccessShrinking.cpp
1015

An instruction which is not trivially dead for now may become dead after other instructions in CandidatesToErase are removed. That is why I want to add it to CandidatesToErase.

1028

Ah, right. Actually, I shouldn't use range based loop since the iterator will be invalidated after insertion and deletion.

efriedma added inline comments.Apr 21 2017, 5:14 PM
include/llvm/Target/TargetLowering.h
1973

Huh, GPU targets are weird like that. I would still rather turn it off for amdgpu, as opposed to leaving it off by default.

lib/CodeGen/MemAccessShrinking.cpp
1015

OpI has no uses here. The only way an instruction can have no uses and still not be trivially dead is if it has side-effects. Deleting other instructions won't change the fact that it has side-effects.

arsenm added inline comments.Apr 21 2017, 5:45 PM
include/llvm/Target/TargetLowering.h
1973

32-bit loads should not be reduced to a shorter width. Using a buffer_load_ushort is definitely worse than using s_load_dword. There is a target hook that is supposed to avoid reducing load widths like this

wmi added inline comments.Apr 21 2017, 6:07 PM
include/llvm/Target/TargetLowering.h
1973

Matt, thanks for the explanation.

I guess the hook is isNarrowingProfitable. However, the hook I need is a little different. I need to know whether narrowing is expensive enough. isNarrowingProfitable on x86 shows i32 --> i16 is not profitable, maybe slightly harmful, but it is not quite harmful, and the benefit to do narrowing may outweigh the cost.

lib/CodeGen/MemAccessShrinking.cpp
1015

You are right. The entire logic about CandidatesToErase is problematic. I will fix it.

I'm still working on this, but since Wei mentioned he is looking at fixing the CandidatesToErase stuff, I wanted to go ahead and send these comments -- there is a significant one w.r.t. to the deletion stuff as well.

lib/CodeGen/MemAccessShrinking.cpp
85–86

Are both of these really useful for debugging? We already have a flag that controls whether the pass is enabled or not.

89–90

"mod" range seems like an odd name?

93

Maybe a comment on what this is used for?

100–101

We generally use initializer sequences:

MemAccessShrinkingPass(const TargetMachine *TM = nullptr)
    : FunctionPass(ID), TM(TM) {

Also, sense this is an internal type I'd skip the default argument for TM as it seems to not really give you much.

132

I would try to expand unusual acronyms: 'du-chain' -> 'Def-Use chain'.

984–997

Rather than reproduce the body of RecursivelyDeleteTriviallyDeadInstructions here, how about actually refactoring that routine to have an overload taht accepts a SmallVectorImpl<Instruction *> list of dead instructions, then you can hand your list to this routine rather than writing your own.

Specifically, I think (as Eli is alluding to below) you should only put things into your CandidatesToErase vector (which I would rename DeadInsts or something) when they satisfy isInstructionTriviallyDead. Even if deleting one of the instructions is necessary to make one of the candidates dead, we'll still visit it because these routines *recursively* delete dead instructions already.

1052–1054

This pattern seems confusing. How about using a lambda (or even an actual separate function) to model a single pass over the function, so that it can just return a single Changed variable?

1057–1059

I would still use a for loop here, and importantly capture rend early:

for (auto InstI = BB->rbegin(), InstE = BB-rend(); InstI != InstE;)
  ...(*InstI++);
1062

Why not just one Changed variable?

wmi added a comment.Apr 21 2017, 10:11 PM

Thanks for bearing with my poor English. I will fix the terminologies and comments according to your suggestions.

lib/CodeGen/MemAccessShrinking.cpp
85–86

It is actually put there for my own conveniency when debugging. I will remove them.

89–90

Yes, I will change them.

93

Sure.

100–101

Ok.

132

Will fix it.

984–997

Ok, I will do some refactoring based on RecursivelyDeleteTriviallyDeadInstructions. Another motivation is that I need to update MSSA while deleting instruction. We don't have callback to remove MemoryAccess when we delete memory instruction.

1052–1054

It uses the same iterative pattern as CodeGenPrepare, but maybe the iterative pattern in InstCombine is clearer -- only one Changed variable is used there. I will wrap a single pass into a function. I probably rename tryShrinkOnInst to tryShrinkOnFunc and use it to wrap a single pass. Existing tryShrinkOnInst is simple enough so I can inline its content into tryShrinkOnFunc.

1057–1059

ok, will change it.

First off, really sorry to keep sending *partial* code reviews. =[ I again didn't quite have enough time to do a full review of the patch (it is a bit large) but wanted to at least send out everything I have so that you aren't blocked waiting on me to produce some comments. =] I'll try again tomorrow to make more progress here although it may start to make sense for me to wait for an iteration as one of the refactorings I'm suggesting here will I think change the structure quite a bit.

In D30416#734516, @wmi wrote:

Thanks for bearing with my poor English.

Please don't stress at all. =D I think reviewing comments, phrasing, etc., needs to happen in any code review. The whole point is to figure out how to write comments and such in a way that make sense to others, and speaking for myself at least, no level of knowledge about the English language is enough there -- it really requires someone else reading it to figure this out.

lib/CodeGen/MemAccessShrinking.cpp
375–397
This comment has been deleted.
418–431

Even after reading your comment I'm not sure I understand what is driving the complexity of this match.

Can you explain (maybe just here in the review) what patterns you're trying to handle that are motivating this?

I'm wondering whether there is any canonicalization that can be leveraged (or added if not already there) to reduce the complexity of the pattern here. Or if we really have to handle this complexity, what the best way to write it and/or comment it so that readers understand the result.

454–457

What happens when both are true? It looks like we just overwrite the 'MR' code?

I feel like both of these analyze...() methods should return the ModRange struct rather than having an output parameter.

chandlerc added inline comments.Apr 24 2017, 6:41 PM
lib/CodeGen/MemAccessShrinking.cpp
375–397

After reading more of this routine, I think you should split it into two routines, one that tries to handle the first pattern, and one that only handles the second pattern.

You can factor the rewriting code that is currently shared by both patterns into utility functions that are called for both. But the logic of this routine is harder to follow because you always have this state to hold between doing two different kinds of transforms.

408

TBits doesn't really give me enough information as a variable name... Maybe StoreBitSize?

468–471

Should this be testing against the DataLayout rather than hard coded 8, 16, and 32? What if 64 bits is legal and that's the width of the MR?

605–608

This comment and function name don't really add up for me...

There is no Cst parameter here. I assume you mean AI?

Also having a flag like AInB seems to make this much more confusing to read. Why not just have two routines for each case?

My guess at what this is actually trying to do is areConstantBitsWithinModRange and areConstantBitsOutsideModRange?

620–624

Maybe a method and use the term 'disjoint'? MR1.isDisjoint(MR2) reads a bit better to me.

627–629

This makes this a very confusing API -- now it isn't really just a predicate, it also computes the insertion point...

Also, why do you need to compute a particular insertion point within a basic block? Can't you just always insert into the beginning of the basic block and let the scheduler do any other adjustments?

arsenm added inline comments.Apr 24 2017, 7:36 PM
include/llvm/Target/TargetLowering.h
1973

The hook I was thinking of was shouldReduceLoadWidth. s_load_dword uses a different cache with much faster access than the buffer instruction if it can be used

wmi added a comment.Apr 25 2017, 10:29 AM

Thanks for drafting the comments. It is apparently more descriptive and clearer, and I like the varnames -- (LargeVal and SmallVal), which are much better than what I used -- (OrigVal, MaskedVal). I will rewrite the comments based on your draft.

lib/CodeGen/MemAccessShrinking.cpp
375–397

I tried that splitting before but I had to move many temporaries to MemAccessShrinkingPass. However, some temporaries are only used by store shrinking but not load shrinking, so that looked a little weird. I agree the logic will be clearer if it is split into two routines. I will try again, and see if I can separate some common temporaries into MemAccessShrinkingPass and leave the special temporaries as parameters, or create a store shrinking class to keep the temporaries.

408

Ok, will change it.

418–431

I borrow the template in your comments to explain: store(or(and(LargeVal, MaskConstant), SmallVal), address)
The case is:

store(or_1(and_1(or_2(and_2(load, -65281), Val1), -256), and_3(Val2, 7))

The two operands of "or_1" are "and_1" and "and_3", but it doesn't know which subtree of and1 or and3 contains the LargeVal. I hope or_2 can be matched to the LargeVal. It is a common pattern after bitfield load/store coalescing.

But I realize when I am explaining to you, that I can split the complex pattern matching above into two, which may be simpler.

bool OrAndPattern = match(Val, m_c_Or(m_And(m_Value(LargeVal), m_ConstantInt(Cst)), m_Value(SmallVal)));
if (match(SmallVal, m_c_And(m_c_Or(m_And(m_Value(), m_Value()), m_Value())))
std::swap(SmallVal, LargeVal);

454–457

We will just overwrite "MR" but it is still not good for "OrAndPattern". I will change the second "if" to "else if".

468–471

That is better. Will fix it.

605–608

Sorry, Cst means AI here.
The func is doing areConstantBitsWithinModRange and areModRangeWithinConstantBits.

620–624

Ok.

627–629

Good point. If there is a clobber instruction, but that clobber instruction is at the same block as "To" Instruction, I can simply insert it at the beginning of the block of "To" instruction, and NewInsertPt is not needed.

But I still prefer to use the insertion point closer to "To" instruction if there is no clobber instruction, because the IR looks better. That means at least a flag showing whether I need to insert at the beginning of the "To" instruction block has to be returned.

I.E., I can simplify "Instruction *&NewInsertPt" to a flag. Is that API acceptable?

wmi added inline comments.Apr 28 2017, 4:11 PM
include/llvm/Target/TargetLowering.h
1973

shouldReduceLoadWidth is a hook for TargetLowering but I need a hook in TargetLoweringBase, which can be used for llvm IR pass.

I cannot change shouldReduceLoadWidth to be a hook in TargetLoweringBase because of the way in which x86 uses it, so I copy the logic in AMDGPUTargetLowering::shouldReduceLoadWidth to AMDGPUTargetLowering::isNarrowingExpensive. I can let shouldReduceLoadWidth call isNarrowingExpensive in a NFC. Is it ok?

lib/CodeGen/MemAccessShrinking.cpp
418–431

I find I still have to keep the original complex pattern. Now I remember where the real difficulty is:

For case like: store(or_1(and_1(or_2(and_2(load, -65281), Val1), -256), and_3(Val2, 7)), I want to match LargeVal to or_2(and_2(load, ...)

But I cannot use match(Val, m_c_Or(m_And(m_c_Or(m_And(...)))) because I have no way to get the intermediate results of the match, like I cannot bind LargeVal to the second m_c_Or. So I have to split the match into multiples. That is where the complexity comes from.

468–471

For x8664, DataLayout works fine. However, for other architectures, like arm, the datalayout is

target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-a:0:32-n32-S64"

DL::LegalIntWidths will only contain widths of natural integers, represented by "n32", so only 32 is Legal Integer Width for ARM.

For x8664, its datalayout is:
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"

because of "n8:16:32:64", 8, 16, 32, 64 are all legal integer width.

wmi updated this revision to Diff 97170.Apr 28 2017, 4:20 PM

Address Eli, Matt and Chandler's comments.

Some major changes:

  • A lot of comments changed.
  • split reduceLoadOpsStoreWidth into two and several other helper functions.
  • refactor RecursivelyDeleteTriviallyDeadInstructions.
chandlerc requested changes to this revision.May 9 2017, 5:33 PM

Somewhat focused on the store side. Near the bottom is a high-level comment about the load shrinking approach.

lib/CodeGen/MemAccessShrinking.cpp
104

Still should omit the = nullptr here since this is an internal type.

154–157

These seem like helper functions that don't actually need to be part of the class at all. Maybe make them static free functions?

(This may be true for some of the other routines in this file.)

205–207

It seems much more idiomatic to return the bool indicating whether a valid BitRange was computable, and if true, set up the values in an output parameter.

Or even better, you could return an Optional<BitRange>, and return None when the requirements aren't satisfied.

209

I know other passes use the variable name Cst but I'd suggest using just C for generic constants or some more descriptive term like you use elsewhere like Mask.

212–215

Do you really need to extend or truncate here? Surely the type system has already caused the constant to be of the size you want? If so, I'd just assert it here. Maybe could directly pass a 'const &' APInt in as the parameter, letting you call the parameter Mask above?

216–217

I would call these MaskLeadingOnes and MaskTrailingOnes.

218–220

I'm having trouble understanding the logic here in the case where there are leading ones. Here is my reasoning, but maybe I've gotten something wrong here:

Shifting right will remove leading ones, but you're shifting right the number of *trailing* ones... Shouldn't that be *leading ones*? And won't the result of a shift *right* be to place the middle zero sequence at the least significant bit, meaning you would want to count the *leading* zeros?

Put differently, arithmetic shift is required to not change the most significant bit, so doing an arithmetic shift right based on how many ones are trailing, seems like it will never change the count of trailing zeros.

If this is correct, then this is a bug and you should add some test cases that will hit this bug.

But regardless of whether my understanding is correct or there is a bug here, I think this can be written in a more obvious way:

unsigned MaskMidZeros = BitSize - (MaskLeadingOnes + MaskTrailingOnes);

And then directly testing whether they are all zero:

if (Mask == APInt::getBitsSet(BitSize, MaskLeadingOnes,
                              MaskLeadingOnes + MaskMidZeros)) {
223–225

Why would we see an all ones mask? Shouldn't have that been eliminated earlier? It seems like we could just bail in this case.

246

The idiomatic way to test this with APInt is BitMask.isSubsetOf(KnownZero).

Also, it would be good to use early-exit here. It *sounds* like you are testing whether it is valid to do anything, but that isn't clear when you have set up members of BR here before returning.

260

When you have alternates, the pattern notation is a bit confusing. I'd just say something like Analyze <bop>(load P, \p Cst) where <bop> is either 'or', 'xor', or 'and'..

261

This isn't really about whether the original value is loaded or not, right? It is just bounding the changed bits?

I'd explain it that way. You'll mention the load when you use it.

263–264

Maybe a better name of this function would be: computeBopChangedBitRange?

265

Same comment above about just asserting the correct bitsize and passing the APInt Mask in directly.

266

Why not pass the argument as a BinaryOperator?

267–268

Might be nice to add a comment explaining the logic here.

Something like:

Both 'or' and 'xor' operations only mutate when the operand has a one bit.
But 'and' only mutates when the operand has a zero bit, so invert the
constant when the instruction is an and so that all the (potentially)
changed bits are ones in the operand.
284

Why the PowerOf2Ceil here? Will the actual store used have that applied? If the actual store has that applied, why don't we want to consider that as BitSize so we're free to use that larger size for the narrow new type?

286–289

As the comment explains, this lambda is actually computing the Shift. But the name seems to indicate it is just a predicate testing whether the old range is covered by the new one.

Also, why does the old BR need to passed in as an argument, isn't that something that can be captured? I actually like passing NewBR in here to show that it is what is *changing* between calls to this routine. But it seems awkward to setup NewBR before this lambda (which would allow it to be implicitly capture it) and then call it with a parameter name that shadows that, therefore avoiding capturing it. I'd consider whether you want to sink NewBR down or othewise more cleanly handle it in the loop.

Nit pick: we typically name lambdas like variables with FooBar rather than like functions with fooBar.

290

What about platforms with support for unaligned loads? Probably best to just leave a FIXME rather than adding more to this patch, but it seems nice to mention that technique.

As an example, on x86, if you have a bitfield that looks like:

struct S {
  unsigned a : 48;
  unsigned b : 48;
  unsigned c : 32;
};

It seems likely to be substantially better to do single 8-byte load and mask off the high 2 bytes when accessing b than to do two nicely aligned 8-byte loads and all the bit math to recombine things.

329

Pass BR by value? (or make it a const reference, but it seems small)

331–335

Generally we prefer early returns to state. That would make this:

if (...)
  return ...;

return ...;
345–346

You can replace uses of this with: V1->stripPointerCasts() == V2->stripPointerCasts(). This will be more powerful as well.

370–371

Maybe this is just a strange API on MemorySSA, but typically I wouldn't expect a lack of dominance to indicate that no access between two points exists.

How does MemorySSA model a pattern that looks like:

From  x 
 \   /
  \ /
   A
   |
   |
   To

Where A is a defining access, is between From and To, but I wouldn't expect From to dominate A because there is another predecessor x.

393

StOffset seems an odd name if this is used to create new pointers for loads as well as stores.

402

No need to call it uglygep. If you want a clue as to the types, maybe rawgep or bytegep.

407–408

This comment mentions LargeVal but that isn't an argument?

418–431

It would be really nice if LLVM would canonicalize in one way or the other so you didn't have to handle so many variations. Asking folks about whether we can/should do anything like that.

But I think the bigger question is, why would only two layers be enough? I feel like there is something more general here that will make explaining everything else much simpler.

Are you looking for a load specifically? Or are you just looking for one side of an or which has a "narrow" (after masking) and?

If the former, maybe just search for the load?

If the latter, maybe you should be just capturing the two sides of the or, and rather than looking *explicitly* for an 'and', instead compute whether the non-zero bits of one side or the other are "narrow"?

422

This reads like a fragment, were there supposed to be more comments before this line?

557–558

Keeping this close to extendBitRange would make it a lot easier to read. Also, why have two functions at all? It appears this is the only thing calling extendBitRange. (I'm OK if there is a reason, just curious what it is.)

562–565

I'm surprised this doesn't just fall out from the logic in extendBitRange.

575

Is StoreShrunkInfo buying much here? It seems to mostly be used in arguments, why not just pass the argument directly? The first bit of the code seems to just unpack everything into local variables.

587–588

Doesn't MinAlign handle 0 correctly so that you can just do this unconditionally?

593–595

Isn't this only called when we need to insert two stores?

596–606

It feels like all of this could be factored into an 'insertStore' method? In particular, the clone doesn't seem to buy you much as you rewrite most parts of the store anyways.

This could handle all of the MemorySSA updating, logging, etc.

662–663

Rather than cloning and mutating, just build a new load? the IRBuilder has a helpful API here.

679–687

There is no need to handle constants specially. The constant folder will do all the work for you.

690–705

I think the amount of code that is special cased here for one caller of this routine or the other is an indication that there is a better factoring of the code.

If you had load insertion and store insertion factored out, then each caller could cleanly insert the narrow load, compute the narrow store (differently), and then insert it.

Does that make sense? Maybe there is a reason why that doesn't work well?

729–730

Might read more easily as: "Assuming that \p AI contains a single sequence of bits set to 1, check whether the range \p BR is covered by that sequence."

732–733

It seems more obvious to me to test this the other way:

BR.Shift >= AI.countLeadingZeros() &&
BR.Shift + BR.Width < (AI.getBitWidth() - AI.countTrailingZeros())

Is this not equivalent for some reason? (Maybe my brain is off...)

The reason I find this easier to read is because it seems to more directly test: "is the start of the BitRange after the start of the 1s, and is the end of the BitRange before the end of the 1s.".

742–745

There is no comment about the cost of this routine.

It looks *really* expensive. It appears to walk all transitive predecessors of the block containing To. So worst case, every basic block in the function. I see this called in several places from inside of for-loops. Is this really a reasonable approach?

Why aren't we just walking the def-use chain from MemorySSA to figure this kind of thing out in a much lower time complexity bound? Like, shouldn't we just be able to walk up defs until we either see a clobber or From?

762–763

You can just insert -- that will return whether it succeeded in inserting the block.

770

Naming convention.

780–781

So, each of these dominates queries are *incredibly* slow. They require linearly walking every instruction in the basic block (worst case).

Why doesn't MemorySSA handle this for you? (Maybe my comment above about using MemorySSA will obviate this comment though.)

946

It would be much more clear for this to be a parameter rather than an implicit parameter via class member. For example, multiple uses *of what*?

967–997

Rather than re-implementing all of this logic, can you re-use the existing demanded bits facilities in LLVM?

For example, I think you can use the DemandedBits analysis, walk all loads in the function, and then narrow them based on the demanded bits it has computed. Because of how DemandedBits works, it is both efficient and very powerful. It can handle many more patterns.

Thinking about this, I suspect you'll want to do two passes essentially. First, narrow all the *stores* that you can. This will likely be iterative. Once that finishes, it seems like you'll be able to then do a single walk over the loads with a fresh DemandedBits analysis and narrow all of those left. You'll probably want to narrow the stores first because that may make bits stop being demanded. But I don't see any way for the reverse to be true, so there should be a good sequencing.

To make the analysis invalidation stuff easier, you may actually need this to actually be two passes so that the store pass can invalidate the DemandedBits analysis, and the load pass can recompute it fresh.

Does that make sense?

If so, I would suggest getting just the store shrinking in this patch, and add the load shrinking in a follow-up patch. I'm happy for them to be implemented in a single file as they are very similar and its good for people to realize they likely want *both* passes.

998

Inst would seem like a more common variable name here.

1144–1145

Comments would be good explaining the very particular iteration order.

1183–1184

Do you want to run tryShrink again just because you removed dead instructions?

If so, do you want to remove dead instructions on each iteration instead of just once tryShrink doesn't make a change?

This revision now requires changes to proceed.May 9 2017, 5:33 PM
wmi added a comment.May 10 2017, 10:43 AM

Chandler, Thanks for the comments. They are very helpful. I will address them in the next revision. I only replied some comments which I had questions or concerns.

lib/CodeGen/MemAccessShrinking.cpp
104

I cannot omit it because In INITIALIZE_TM_PASS_END, callDefaultCtor<passName> requires the param to have a default value.

218–220

Shifting right will remove leading ones, but you're shifting right the number of *trailing* ones... Shouldn't that be *leading ones*? And won't the result of a shift *right* be to place the middle zero sequence at the least significant bit, meaning you would want to count the *leading* zeros?

I think shifting right will remove trailing ones? And after the shift (Mask.ashr(MaskTrailOnes)), middle zeros are at the least sigficant bits, and they are trailing zeros, right?

But like you said, I should rule out the all zero/all one cases separately so the logic will become more clear.

370–371

The case will not happen because we ensure From dominates To before calling the function. You are right, it is better to add an assertion at the entry of the function to prevent misuse of the API.

596–606

I use clone here just to duplicate the subclass data like volatile and ordered.

742–745

That is because the instruction To here may not be a memory access instruction (It is probably a And or Trunc instruction which indicates only some bits of the input are demanded), and we cannot get a MemoryAccess for it. Note hasClobberBetween are overloaded and there are two versions. The other version which walks the MSSA def-use chain is used in several for-loops as you saw. This higher cost version is not used in a loop. Besides, we only check MSSA DefList in each BB, so the worse case complexity is the number of memory access instructions in the func, which is usually much less than the number of instructions in the func.

946

MultiUsesSeen is not changed for every instruction. It is saying whether a previous instruction on the chain was found to have multiuse when we walk the chain bottom-up.

r1 = ...;
r2 = r1 + r3;
r4 = r2 + r5;

If r2 has multiple uses, both r2 = r1 + r3 and r1 = ... cannot be removed after the shrinking.

967–997

I considered demanded bits facilities before, but I found it can only simplify the code a little bit. Finding the demanded bits inside of the load is only a small part of the work. Most of the complexity of the code comes from figuring out which ops in the sequence on the Def-Use Chain change the demanded bits. Like if we see shifts, we may clear some demanded bits in less significant position to zeros because we shift right then shift left. Because we change the demanded bits, we must include the shifts into the shrinked code sequence. Like if we see Or(And(Or(And(...)) pattern, we want to know that the bits changed by Or(And()) are different bits of the demanded bits, only when that is true, we can omit the Or(And(...)) pattern in the final shrinked code sequence. Another reason is, demanded bits analysis may not be very cheap. As for memory shrinking, few pattern like and/trunc is very common to be useful for the shrinking so we actually don't need very general demanded bits analysis for every instruction.

1183–1184

If dead instruction is removed, another iteration will be taken and tryShrink will run again.

I think it makes no difference between running removeDeadInsts only when tryShrink makes no change and running removeDeadInsts everytime after tryShrink makes a change.

Trying to provide answers to the open questions here...

lib/CodeGen/MemAccessShrinking.cpp
104

Ah, right, I forgot that about the pass initialization. Sorry for the noise!

218–220

Ah, ok, this makes sense to me now. I had confused myself thinking about it. Anyways, the simpler formulation will avoid any future reader confusion.

370–371

Ok, while that makes sense, it still seems counter-intuitive in terms of how to use MemorySSA based on my limited understanding.

I would have expected essentially walking up the def's from the use until either a clobber is found or the 'from' is found. One has to come first, and which ever is first dictates if there is a clobber. Essentially, I would expect to use the *SSA* properties to answer these questions rather than the *dominance* or control flow properties. But I'm happy if folks more deeply familiar with MemorySSA can explain better why this is the right way to use it, as I'm still new to this infrastructure in LLVM.

596–606

I still think it will be cleaner to directly construct the load.

Also, I wouldn't expect this pass to be valid for either volatile or ordered loads...

742–745

Given how different the two routines are, I would suggest giving them separate names. It seemingly wasn't obvious that they were different already.

I'm happy to look at the use cases, but this still feels much too expensive to me. In terms of big-O, the fact that it is only memory accesses doesn't really seem to help much. Quadratic in the number of memory accesses is still probably not something we can realistically do.

I want to think more about the algorithm once I see exactly where this is being called.

946

Ok, that explanation makes sense, but you'll need to find a way to make this clear from the code itself. =] At the very least, not using a member, but probably with some more helpful variable names, function names, structure or comments.

967–997

I don't really understand the argument here...

I would expect the demanded bits facilities to already handle things like shifts changing which bits are demanded, does it not? If not, it seems like we should extend that general facility rather than building an isolated system over here.

Regarding the cost, I'm much more worried about the algorithmic cost of this and the fact that it seems relatively sensitive to things that we don't reliably canonicalize (using trunc or and instructions to remove demanded bits).

Generally speaking, working *backwards* from the and or trunc is going to be much more expensive than working *forwards*.

But even if it the existing demanded bits is too expensive, we still shouldn't develop a new cheap one locally here. We should either add a parameter to decrease its cost or add a new general purpose "fast" demanded bits and then use that here.

1183–1184

I guess I'm trying to ask: why will the removal of dead instructions cause shrinking to become more effective? Most of the algorithms here don't seem likely to remove entire classes of uses, and so I'm not sure why this iteration is valuable at all.

But if it is valuable, that is, if removing dead instructions exposes more shrinking opportunities, I would expect that removing dead instructions *earlier* (IE, on each iteration) to cause this to converge faster.

wmi added inline comments.May 10 2017, 10:37 PM
lib/CodeGen/MemAccessShrinking.cpp
596–606

Ok, I will change it to use IRBuilder.

967–997

I would expect the demanded bits facilities to already handle things like shifts changing which bits are demanded, does it not?

Yes, demanded bits facilities can adjust which bits are demanded if there is shift. However, which bits are the demanded bits is not the only thing I need to know. I still need to know whether an operation in the Def-Use Chain effectively change the value of the bits demanded. If the operation changes the value of demanded bits, the operation should be shrinked together with the load. If it only changes the value of bits other than the demanded bits, the operation can be omitted. That is actually what most of the pattern matching is doing in load shrinking, and that part of work I think cannot be fullfilled by demanded bits analysis.

1183–1184

why will the removal of dead instructions cause shrinking to become more effective?

After removing some dead instructions, some multi uses def will become single use def and it will help increasing the benefit to do more shrinking.

if removing dead instructions exposes more shrinking opportunities, I would expect that removing dead instructions *earlier* (IE, on each iteration) to cause this to converge faster.

Ok, then I will do it after tryShrink in every iteration.

wmi added inline comments.May 11 2017, 7:26 AM
lib/CodeGen/MemAccessShrinking.cpp
967–997

Even only for demanded bits, working forwards is not that straightforward because a wide load for a field group will be shared by multiple field accesses, and demanded bits analysis will sometimes show that all the bits of the load will be used. And it is possible that on the upper side of the Def-Use Chain, the width of the demanded bits are larger because of a node may have multiple uses, on the lower side of the Def-Use Chain, the width of the demanded bits are smaller. Working forward, we have to do some search on the expr tree rooted at a load stmt. Working backwards will let us know the demanded bits we want to use from the beginning.

wmi added a comment.May 16 2017, 5:20 PM

Discussed with Chandler offline, and we decided to split the patch and tried to commit the store shrinking first.

Then I tried the idea of walking forward for load shrinking by using demandedbits but I run into a problem for the motivational testcase (test/CodeGen/X86/mem-access-shrink.ll). Look at the %bf.load which we want to shrink in mem-access-shrink.ll, it has multiple uses, so we want to look at all its uses and get the demanded bits for each use. However, on the Def-Use chain from %bf.load to its narrower uses, it is not only %bf.load having multiple uses, for example, %bf.set also has multiple uses, so we also need to look at all the uses of %bf.set. In theory, every node on the Def-Use Chain can have multiple uses, then at the initial portion of Def-Use Chain starting from %bf.load, we don't know whether the %bf.load can be shrinked or not from demanded bits, only when we walk pretty close to the end of the Def-Use Chain, we know whether %bf.load can be shrinked at the specific place. In other words, by walking forward, in order not to miss any shrinking opportunity, we have to walk across almost all the nodes on the Def-Uses tree before knowing where %bf.load can be shrinked.

For walking backward, most of the cases, we only have to walk from a narrower use to "%bf.load =" which is the root of the Def-Uses tree. It is like walking from some leafs to root, which should be more efficient in most cases. I agree we may see special testcase like there is a long common portion for those paths from leafs to root (for that case walking forward is better). If that happen, we can add a cap about the maximum walking distance to avoid the compile time cost from being too high. Chandler, do you think it is ok?

In D30416#756899, @wmi wrote:

Discussed with Chandler offline, and we decided to split the patch and tried to commit the store shrinking first.

Then I tried the idea of walking forward for load shrinking by using demandedbits but I run into a problem for the motivational testcase (test/CodeGen/X86/mem-access-shrink.ll). Look at the %bf.load which we want to shrink in mem-access-shrink.ll, it has multiple uses, so we want to look at all its uses and get the demanded bits for each use. However, on the Def-Use chain from %bf.load to its narrower uses, it is not only %bf.load having multiple uses, for example, %bf.set also has multiple uses, so we also need to look at all the uses of %bf.set. In theory, every node on the Def-Use Chain can have multiple uses, then at the initial portion of Def-Use Chain starting from %bf.load, we don't know whether the %bf.load can be shrinked or not from demanded bits, only when we walk pretty close to the end of the Def-Use Chain, we know whether %bf.load can be shrinked at the specific place. In other words, by walking forward, in order not to miss any shrinking opportunity, we have to walk across almost all the nodes on the Def-Uses tree before knowing where %bf.load can be shrinked.

For walking backward, most of the cases, we only have to walk from a narrower use to "%bf.load =" which is the root of the Def-Uses tree. It is like walking from some leafs to root, which should be more efficient in most cases. I agree we may see special testcase like there is a long common portion for those paths from leafs to root (for that case walking forward is better). If that happen, we can add a cap about the maximum walking distance to avoid the compile time cost from being too high. Chandler, do you think it is ok?

I'm somewhat worried about this cap -- it has hurt us in the past. But maybe there is a way to make walking backwards have reasonable complexity. It still seems like something we can do in a separate phase rather than having it interleave with the store-based shrinking, and so I'd still split it into a separate patch.

Setting aside forwards vs. backwards-with-a-cap, I still think it is a mistake to add yet another implementation of tracking which bits are demanded. So I would look at how you might share the logic in DemandedBits (or one of the other places in LLVM where we reason about this, I think there are already some others) for reasoning about the semantics of the IR instructions. Maybe there is no way to share that, but it seems worth trying. Either way, I'd suggest a fresh thread (or IRC) to discuss issues until there is a patch so that we can move the store side of this forward independently.

That make sense?

fhahn added a subscriber: fhahn.May 19 2017, 5:30 AM

Looks like this thread has gone stale for a while.

I have not read the patch in details so what I said might be be nonsense :) From reading of the discussions, it seems like that using DemandedBits Analysis is not ready today to handle forward walking for load shrinking, nor ideal for backward walking without using capping, so what is the good path going forward? Is it possible to keep the core of this patch but with more re-use of DemandedBit analysis (with refactoring)? If not, we may want to consider moving on with a stop-gap solution for now and committing on longer term unification as follow ups.