Page MenuHomePhabricator

[BitfieldShrinking] Shrink Bitfields load/store when the bitfields are legal to access independently
Needs RevisionPublic

Authored by wmi on Feb 27 2017, 10:39 AM.

Details

Summary

reduceLoadOpStoreWidth is a useful optimization already existing in DAGCombiner. It can shrink the bitfield store in the following testcase:

class A {
public:
unsigned long f1:16;
unsigned long f2:3;
};
A a;

void foo() {
// if (a.f1 == 2) return;
a.f1 = a.f1 + 3;
}

For a.f2 = a.f2 + 3, without reduceLoadOpStoreWidth in DAGCombiner, the code will be:
movl a(%rip), %eax
leal 3(%rax), %ecx
movzwl %cx, %ecx
andl $-65536, %eax # imm = 0xFFFF0000
orl %ecx, %eax
movl %eax, a(%rip)

with reduceLoadOpStoreWidth, the code will be:
movl a(%rip), %eax
addl $3, %eax
movw %ax, a(%rip)

However, if we remove the comment above, the load of a.f2 and the store of a.f2 will stay in two different BasicBlocks and reduceLoadOpStoreWidth in DAGCombiner cannot work.

The patch is redoing the same optimization in InstCombine, so the optimization will not be limited by the BasicBlock boundary.

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
chandlerc added inline comments.Apr 24 2017, 6:41 PM
lib/CodeGen/MemAccessShrinking.cpp
375–397

After reading more of this routine, I think you should split it into two routines, one that tries to handle the first pattern, and one that only handles the second pattern.

You can factor the rewriting code that is currently shared by both patterns into utility functions that are called for both. But the logic of this routine is harder to follow because you always have this state to hold between doing two different kinds of transforms.

408

TBits doesn't really give me enough information as a variable name... Maybe StoreBitSize?

454–457

What happens when both are true? It looks like we just overwrite the 'MR' code?

I feel like both of these analyze...() methods should return the ModRange struct rather than having an output parameter.

468–471

Should this be testing against the DataLayout rather than hard coded 8, 16, and 32? What if 64 bits is legal and that's the width of the MR?

605–608

This comment and function name don't really add up for me...

There is no Cst parameter here. I assume you mean AI?

Also having a flag like AInB seems to make this much more confusing to read. Why not just have two routines for each case?

My guess at what this is actually trying to do is areConstantBitsWithinModRange and areConstantBitsOutsideModRange?

620–624

Maybe a method and use the term 'disjoint'? MR1.isDisjoint(MR2) reads a bit better to me.

627–629

This makes this a very confusing API -- now it isn't really just a predicate, it also computes the insertion point...

Also, why do you need to compute a particular insertion point within a basic block? Can't you just always insert into the beginning of the basic block and let the scheduler do any other adjustments?

arsenm added inline comments.Apr 24 2017, 7:36 PM
include/llvm/Target/TargetLowering.h
1973

The hook I was thinking of was shouldReduceLoadWidth. s_load_dword uses a different cache with much faster access than the buffer instruction if it can be used

wmi added a comment.Apr 25 2017, 10:29 AM

Thanks for drafting the comments. It is apparently more descriptive and clearer, and I like the varnames -- (LargeVal and SmallVal), which are much better than what I used -- (OrigVal, MaskedVal). I will rewrite the comments based on your draft.

lib/CodeGen/MemAccessShrinking.cpp
375–397

I tried that splitting before but I had to move many temporaries to MemAccessShrinkingPass. However, some temporaries are only used by store shrinking but not load shrinking, so that looked a little weird. I agree the logic will be clearer if it is split into two routines. I will try again, and see if I can separate some common temporaries into MemAccessShrinkingPass and leave the special temporaries as parameters, or create a store shrinking class to keep the temporaries.

408

Ok, will change it.

418–431

I borrow the template in your comments to explain: store(or(and(LargeVal, MaskConstant), SmallVal), address)
The case is:

store(or_1(and_1(or_2(and_2(load, -65281), Val1), -256), and_3(Val2, 7))

The two operands of "or_1" are "and_1" and "and_3", but it doesn't know which subtree of and1 or and3 contains the LargeVal. I hope or_2 can be matched to the LargeVal. It is a common pattern after bitfield load/store coalescing.

But I realize when I am explaining to you, that I can split the complex pattern matching above into two, which may be simpler.

bool OrAndPattern = match(Val, m_c_Or(m_And(m_Value(LargeVal), m_ConstantInt(Cst)), m_Value(SmallVal)));
if (match(SmallVal, m_c_And(m_c_Or(m_And(m_Value(), m_Value()), m_Value())))
std::swap(SmallVal, LargeVal);

454–457

We will just overwrite "MR" but it is still not good for "OrAndPattern". I will change the second "if" to "else if".

468–471

That is better. Will fix it.

605–608

Sorry, Cst means AI here.
The func is doing areConstantBitsWithinModRange and areModRangeWithinConstantBits.

620–624

Ok.

627–629

Good point. If there is a clobber instruction, but that clobber instruction is at the same block as "To" Instruction, I can simply insert it at the beginning of the block of "To" instruction, and NewInsertPt is not needed.

But I still prefer to use the insertion point closer to "To" instruction if there is no clobber instruction, because the IR looks better. That means at least a flag showing whether I need to insert at the beginning of the "To" instruction block has to be returned.

I.E., I can simplify "Instruction *&NewInsertPt" to a flag. Is that API acceptable?

wmi added inline comments.Apr 28 2017, 4:11 PM
include/llvm/Target/TargetLowering.h
1973

shouldReduceLoadWidth is a hook for TargetLowering but I need a hook in TargetLoweringBase, which can be used for llvm IR pass.

I cannot change shouldReduceLoadWidth to be a hook in TargetLoweringBase because of the way in which x86 uses it, so I copy the logic in AMDGPUTargetLowering::shouldReduceLoadWidth to AMDGPUTargetLowering::isNarrowingExpensive. I can let shouldReduceLoadWidth call isNarrowingExpensive in a NFC. Is it ok?

lib/CodeGen/MemAccessShrinking.cpp
418–431

I find I still have to keep the original complex pattern. Now I remember where the real difficulty is:

For case like: store(or_1(and_1(or_2(and_2(load, -65281), Val1), -256), and_3(Val2, 7)), I want to match LargeVal to or_2(and_2(load, ...)

But I cannot use match(Val, m_c_Or(m_And(m_c_Or(m_And(...)))) because I have no way to get the intermediate results of the match, like I cannot bind LargeVal to the second m_c_Or. So I have to split the match into multiples. That is where the complexity comes from.

468–471

For x8664, DataLayout works fine. However, for other architectures, like arm, the datalayout is

target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-a:0:32-n32-S64"

DL::LegalIntWidths will only contain widths of natural integers, represented by "n32", so only 32 is Legal Integer Width for ARM.

For x8664, its datalayout is:
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"

because of "n8:16:32:64", 8, 16, 32, 64 are all legal integer width.

wmi updated this revision to Diff 97170.Apr 28 2017, 4:20 PM

Address Eli, Matt and Chandler's comments.

Some major changes:

  • A lot of comments changed.
  • split reduceLoadOpsStoreWidth into two and several other helper functions.
  • refactor RecursivelyDeleteTriviallyDeadInstructions.
chandlerc requested changes to this revision.May 9 2017, 5:33 PM

Somewhat focused on the store side. Near the bottom is a high-level comment about the load shrinking approach.

lib/CodeGen/MemAccessShrinking.cpp
104

Still should omit the = nullptr here since this is an internal type.

154–157

These seem like helper functions that don't actually need to be part of the class at all. Maybe make them static free functions?

(This may be true for some of the other routines in this file.)

205–207

It seems much more idiomatic to return the bool indicating whether a valid BitRange was computable, and if true, set up the values in an output parameter.

Or even better, you could return an Optional<BitRange>, and return None when the requirements aren't satisfied.

209

I know other passes use the variable name Cst but I'd suggest using just C for generic constants or some more descriptive term like you use elsewhere like Mask.

212–215

Do you really need to extend or truncate here? Surely the type system has already caused the constant to be of the size you want? If so, I'd just assert it here. Maybe could directly pass a 'const &' APInt in as the parameter, letting you call the parameter Mask above?

216–217

I would call these MaskLeadingOnes and MaskTrailingOnes.

218–220

I'm having trouble understanding the logic here in the case where there are leading ones. Here is my reasoning, but maybe I've gotten something wrong here:

Shifting right will remove leading ones, but you're shifting right the number of *trailing* ones... Shouldn't that be *leading ones*? And won't the result of a shift *right* be to place the middle zero sequence at the least significant bit, meaning you would want to count the *leading* zeros?

Put differently, arithmetic shift is required to not change the most significant bit, so doing an arithmetic shift right based on how many ones are trailing, seems like it will never change the count of trailing zeros.

If this is correct, then this is a bug and you should add some test cases that will hit this bug.

But regardless of whether my understanding is correct or there is a bug here, I think this can be written in a more obvious way:

unsigned MaskMidZeros = BitSize - (MaskLeadingOnes + MaskTrailingOnes);

And then directly testing whether they are all zero:

if (Mask == APInt::getBitsSet(BitSize, MaskLeadingOnes,
                              MaskLeadingOnes + MaskMidZeros)) {
223–225

Why would we see an all ones mask? Shouldn't have that been eliminated earlier? It seems like we could just bail in this case.

246

The idiomatic way to test this with APInt is BitMask.isSubsetOf(KnownZero).

Also, it would be good to use early-exit here. It *sounds* like you are testing whether it is valid to do anything, but that isn't clear when you have set up members of BR here before returning.

260

When you have alternates, the pattern notation is a bit confusing. I'd just say something like Analyze <bop>(load P, \p Cst) where <bop> is either 'or', 'xor', or 'and'..

261

This isn't really about whether the original value is loaded or not, right? It is just bounding the changed bits?

I'd explain it that way. You'll mention the load when you use it.

263–264

Maybe a better name of this function would be: computeBopChangedBitRange?

265

Same comment above about just asserting the correct bitsize and passing the APInt Mask in directly.

266

Why not pass the argument as a BinaryOperator?

267–268

Might be nice to add a comment explaining the logic here.

Something like:

Both 'or' and 'xor' operations only mutate when the operand has a one bit.
But 'and' only mutates when the operand has a zero bit, so invert the
constant when the instruction is an and so that all the (potentially)
changed bits are ones in the operand.
284

Why the PowerOf2Ceil here? Will the actual store used have that applied? If the actual store has that applied, why don't we want to consider that as BitSize so we're free to use that larger size for the narrow new type?

286–289

As the comment explains, this lambda is actually computing the Shift. But the name seems to indicate it is just a predicate testing whether the old range is covered by the new one.

Also, why does the old BR need to passed in as an argument, isn't that something that can be captured? I actually like passing NewBR in here to show that it is what is *changing* between calls to this routine. But it seems awkward to setup NewBR before this lambda (which would allow it to be implicitly capture it) and then call it with a parameter name that shadows that, therefore avoiding capturing it. I'd consider whether you want to sink NewBR down or othewise more cleanly handle it in the loop.

Nit pick: we typically name lambdas like variables with FooBar rather than like functions with fooBar.

290

What about platforms with support for unaligned loads? Probably best to just leave a FIXME rather than adding more to this patch, but it seems nice to mention that technique.

As an example, on x86, if you have a bitfield that looks like:

struct S {
  unsigned a : 48;
  unsigned b : 48;
  unsigned c : 32;
};

It seems likely to be substantially better to do single 8-byte load and mask off the high 2 bytes when accessing b than to do two nicely aligned 8-byte loads and all the bit math to recombine things.

329

Pass BR by value? (or make it a const reference, but it seems small)

331–335

Generally we prefer early returns to state. That would make this:

if (...)
  return ...;

return ...;
345–346

You can replace uses of this with: V1->stripPointerCasts() == V2->stripPointerCasts(). This will be more powerful as well.

370–371

Maybe this is just a strange API on MemorySSA, but typically I wouldn't expect a lack of dominance to indicate that no access between two points exists.

How does MemorySSA model a pattern that looks like:

From  x 
 \   /
  \ /
   A
   |
   |
   To

Where A is a defining access, is between From and To, but I wouldn't expect From to dominate A because there is another predecessor x.

393

StOffset seems an odd name if this is used to create new pointers for loads as well as stores.

402

No need to call it uglygep. If you want a clue as to the types, maybe rawgep or bytegep.

407–408

This comment mentions LargeVal but that isn't an argument?

418–431

It would be really nice if LLVM would canonicalize in one way or the other so you didn't have to handle so many variations. Asking folks about whether we can/should do anything like that.

But I think the bigger question is, why would only two layers be enough? I feel like there is something more general here that will make explaining everything else much simpler.

Are you looking for a load specifically? Or are you just looking for one side of an or which has a "narrow" (after masking) and?

If the former, maybe just search for the load?

If the latter, maybe you should be just capturing the two sides of the or, and rather than looking *explicitly* for an 'and', instead compute whether the non-zero bits of one side or the other are "narrow"?

422

This reads like a fragment, were there supposed to be more comments before this line?

557–558

Keeping this close to extendBitRange would make it a lot easier to read. Also, why have two functions at all? It appears this is the only thing calling extendBitRange. (I'm OK if there is a reason, just curious what it is.)

562–565

I'm surprised this doesn't just fall out from the logic in extendBitRange.

575

Is StoreShrunkInfo buying much here? It seems to mostly be used in arguments, why not just pass the argument directly? The first bit of the code seems to just unpack everything into local variables.

587–588

Doesn't MinAlign handle 0 correctly so that you can just do this unconditionally?

593–595

Isn't this only called when we need to insert two stores?

596–606

It feels like all of this could be factored into an 'insertStore' method? In particular, the clone doesn't seem to buy you much as you rewrite most parts of the store anyways.

This could handle all of the MemorySSA updating, logging, etc.

662–663

Rather than cloning and mutating, just build a new load? the IRBuilder has a helpful API here.

679–687

There is no need to handle constants specially. The constant folder will do all the work for you.

690–705

I think the amount of code that is special cased here for one caller of this routine or the other is an indication that there is a better factoring of the code.

If you had load insertion and store insertion factored out, then each caller could cleanly insert the narrow load, compute the narrow store (differently), and then insert it.

Does that make sense? Maybe there is a reason why that doesn't work well?

729–730

Might read more easily as: "Assuming that \p AI contains a single sequence of bits set to 1, check whether the range \p BR is covered by that sequence."

732–733

It seems more obvious to me to test this the other way:

BR.Shift >= AI.countLeadingZeros() &&
BR.Shift + BR.Width < (AI.getBitWidth() - AI.countTrailingZeros())

Is this not equivalent for some reason? (Maybe my brain is off...)

The reason I find this easier to read is because it seems to more directly test: "is the start of the BitRange after the start of the 1s, and is the end of the BitRange before the end of the 1s.".

742–745

There is no comment about the cost of this routine.

It looks *really* expensive. It appears to walk all transitive predecessors of the block containing To. So worst case, every basic block in the function. I see this called in several places from inside of for-loops. Is this really a reasonable approach?

Why aren't we just walking the def-use chain from MemorySSA to figure this kind of thing out in a much lower time complexity bound? Like, shouldn't we just be able to walk up defs until we either see a clobber or From?

762–763

You can just insert -- that will return whether it succeeded in inserting the block.

770

Naming convention.

780–781

So, each of these dominates queries are *incredibly* slow. They require linearly walking every instruction in the basic block (worst case).

Why doesn't MemorySSA handle this for you? (Maybe my comment above about using MemorySSA will obviate this comment though.)

946

It would be much more clear for this to be a parameter rather than an implicit parameter via class member. For example, multiple uses *of what*?

967–997

Rather than re-implementing all of this logic, can you re-use the existing demanded bits facilities in LLVM?

For example, I think you can use the DemandedBits analysis, walk all loads in the function, and then narrow them based on the demanded bits it has computed. Because of how DemandedBits works, it is both efficient and very powerful. It can handle many more patterns.

Thinking about this, I suspect you'll want to do two passes essentially. First, narrow all the *stores* that you can. This will likely be iterative. Once that finishes, it seems like you'll be able to then do a single walk over the loads with a fresh DemandedBits analysis and narrow all of those left. You'll probably want to narrow the stores first because that may make bits stop being demanded. But I don't see any way for the reverse to be true, so there should be a good sequencing.

To make the analysis invalidation stuff easier, you may actually need this to actually be two passes so that the store pass can invalidate the DemandedBits analysis, and the load pass can recompute it fresh.

Does that make sense?

If so, I would suggest getting just the store shrinking in this patch, and add the load shrinking in a follow-up patch. I'm happy for them to be implemented in a single file as they are very similar and its good for people to realize they likely want *both* passes.

998

Inst would seem like a more common variable name here.

1144–1145

Comments would be good explaining the very particular iteration order.

1183–1184

Do you want to run tryShrink again just because you removed dead instructions?

If so, do you want to remove dead instructions on each iteration instead of just once tryShrink doesn't make a change?

This revision now requires changes to proceed.May 9 2017, 5:33 PM
wmi added a comment.May 10 2017, 10:43 AM

Chandler, Thanks for the comments. They are very helpful. I will address them in the next revision. I only replied some comments which I had questions or concerns.

lib/CodeGen/MemAccessShrinking.cpp
104

I cannot omit it because In INITIALIZE_TM_PASS_END, callDefaultCtor<passName> requires the param to have a default value.

218–220

Shifting right will remove leading ones, but you're shifting right the number of *trailing* ones... Shouldn't that be *leading ones*? And won't the result of a shift *right* be to place the middle zero sequence at the least significant bit, meaning you would want to count the *leading* zeros?

I think shifting right will remove trailing ones? And after the shift (Mask.ashr(MaskTrailOnes)), middle zeros are at the least sigficant bits, and they are trailing zeros, right?

But like you said, I should rule out the all zero/all one cases separately so the logic will become more clear.

370–371

The case will not happen because we ensure From dominates To before calling the function. You are right, it is better to add an assertion at the entry of the function to prevent misuse of the API.

596–606

I use clone here just to duplicate the subclass data like volatile and ordered.

742–745

That is because the instruction To here may not be a memory access instruction (It is probably a And or Trunc instruction which indicates only some bits of the input are demanded), and we cannot get a MemoryAccess for it. Note hasClobberBetween are overloaded and there are two versions. The other version which walks the MSSA def-use chain is used in several for-loops as you saw. This higher cost version is not used in a loop. Besides, we only check MSSA DefList in each BB, so the worse case complexity is the number of memory access instructions in the func, which is usually much less than the number of instructions in the func.

946

MultiUsesSeen is not changed for every instruction. It is saying whether a previous instruction on the chain was found to have multiuse when we walk the chain bottom-up.

r1 = ...;
r2 = r1 + r3;
r4 = r2 + r5;

If r2 has multiple uses, both r2 = r1 + r3 and r1 = ... cannot be removed after the shrinking.

967–997

I considered demanded bits facilities before, but I found it can only simplify the code a little bit. Finding the demanded bits inside of the load is only a small part of the work. Most of the complexity of the code comes from figuring out which ops in the sequence on the Def-Use Chain change the demanded bits. Like if we see shifts, we may clear some demanded bits in less significant position to zeros because we shift right then shift left. Because we change the demanded bits, we must include the shifts into the shrinked code sequence. Like if we see Or(And(Or(And(...)) pattern, we want to know that the bits changed by Or(And()) are different bits of the demanded bits, only when that is true, we can omit the Or(And(...)) pattern in the final shrinked code sequence. Another reason is, demanded bits analysis may not be very cheap. As for memory shrinking, few pattern like and/trunc is very common to be useful for the shrinking so we actually don't need very general demanded bits analysis for every instruction.

1183–1184

If dead instruction is removed, another iteration will be taken and tryShrink will run again.

I think it makes no difference between running removeDeadInsts only when tryShrink makes no change and running removeDeadInsts everytime after tryShrink makes a change.

Trying to provide answers to the open questions here...

lib/CodeGen/MemAccessShrinking.cpp
104

Ah, right, I forgot that about the pass initialization. Sorry for the noise!

218–220

Ah, ok, this makes sense to me now. I had confused myself thinking about it. Anyways, the simpler formulation will avoid any future reader confusion.

370–371

Ok, while that makes sense, it still seems counter-intuitive in terms of how to use MemorySSA based on my limited understanding.

I would have expected essentially walking up the def's from the use until either a clobber is found or the 'from' is found. One has to come first, and which ever is first dictates if there is a clobber. Essentially, I would expect to use the *SSA* properties to answer these questions rather than the *dominance* or control flow properties. But I'm happy if folks more deeply familiar with MemorySSA can explain better why this is the right way to use it, as I'm still new to this infrastructure in LLVM.

596–606

I still think it will be cleaner to directly construct the load.

Also, I wouldn't expect this pass to be valid for either volatile or ordered loads...

742–745

Given how different the two routines are, I would suggest giving them separate names. It seemingly wasn't obvious that they were different already.

I'm happy to look at the use cases, but this still feels much too expensive to me. In terms of big-O, the fact that it is only memory accesses doesn't really seem to help much. Quadratic in the number of memory accesses is still probably not something we can realistically do.

I want to think more about the algorithm once I see exactly where this is being called.

946

Ok, that explanation makes sense, but you'll need to find a way to make this clear from the code itself. =] At the very least, not using a member, but probably with some more helpful variable names, function names, structure or comments.

967–997

I don't really understand the argument here...

I would expect the demanded bits facilities to already handle things like shifts changing which bits are demanded, does it not? If not, it seems like we should extend that general facility rather than building an isolated system over here.

Regarding the cost, I'm much more worried about the algorithmic cost of this and the fact that it seems relatively sensitive to things that we don't reliably canonicalize (using trunc or and instructions to remove demanded bits).

Generally speaking, working *backwards* from the and or trunc is going to be much more expensive than working *forwards*.

But even if it the existing demanded bits is too expensive, we still shouldn't develop a new cheap one locally here. We should either add a parameter to decrease its cost or add a new general purpose "fast" demanded bits and then use that here.

1183–1184

I guess I'm trying to ask: why will the removal of dead instructions cause shrinking to become more effective? Most of the algorithms here don't seem likely to remove entire classes of uses, and so I'm not sure why this iteration is valuable at all.

But if it is valuable, that is, if removing dead instructions exposes more shrinking opportunities, I would expect that removing dead instructions *earlier* (IE, on each iteration) to cause this to converge faster.

wmi added inline comments.May 10 2017, 10:37 PM
lib/CodeGen/MemAccessShrinking.cpp
596–606

Ok, I will change it to use IRBuilder.

967–997

I would expect the demanded bits facilities to already handle things like shifts changing which bits are demanded, does it not?

Yes, demanded bits facilities can adjust which bits are demanded if there is shift. However, which bits are the demanded bits is not the only thing I need to know. I still need to know whether an operation in the Def-Use Chain effectively change the value of the bits demanded. If the operation changes the value of demanded bits, the operation should be shrinked together with the load. If it only changes the value of bits other than the demanded bits, the operation can be omitted. That is actually what most of the pattern matching is doing in load shrinking, and that part of work I think cannot be fullfilled by demanded bits analysis.

1183–1184

why will the removal of dead instructions cause shrinking to become more effective?

After removing some dead instructions, some multi uses def will become single use def and it will help increasing the benefit to do more shrinking.

if removing dead instructions exposes more shrinking opportunities, I would expect that removing dead instructions *earlier* (IE, on each iteration) to cause this to converge faster.

Ok, then I will do it after tryShrink in every iteration.

wmi added inline comments.May 11 2017, 7:26 AM
lib/CodeGen/MemAccessShrinking.cpp
967–997

Even only for demanded bits, working forwards is not that straightforward because a wide load for a field group will be shared by multiple field accesses, and demanded bits analysis will sometimes show that all the bits of the load will be used. And it is possible that on the upper side of the Def-Use Chain, the width of the demanded bits are larger because of a node may have multiple uses, on the lower side of the Def-Use Chain, the width of the demanded bits are smaller. Working forward, we have to do some search on the expr tree rooted at a load stmt. Working backwards will let us know the demanded bits we want to use from the beginning.

wmi added a comment.May 16 2017, 5:20 PM

Discussed with Chandler offline, and we decided to split the patch and tried to commit the store shrinking first.

Then I tried the idea of walking forward for load shrinking by using demandedbits but I run into a problem for the motivational testcase (test/CodeGen/X86/mem-access-shrink.ll). Look at the %bf.load which we want to shrink in mem-access-shrink.ll, it has multiple uses, so we want to look at all its uses and get the demanded bits for each use. However, on the Def-Use chain from %bf.load to its narrower uses, it is not only %bf.load having multiple uses, for example, %bf.set also has multiple uses, so we also need to look at all the uses of %bf.set. In theory, every node on the Def-Use Chain can have multiple uses, then at the initial portion of Def-Use Chain starting from %bf.load, we don't know whether the %bf.load can be shrinked or not from demanded bits, only when we walk pretty close to the end of the Def-Use Chain, we know whether %bf.load can be shrinked at the specific place. In other words, by walking forward, in order not to miss any shrinking opportunity, we have to walk across almost all the nodes on the Def-Uses tree before knowing where %bf.load can be shrinked.

For walking backward, most of the cases, we only have to walk from a narrower use to "%bf.load =" which is the root of the Def-Uses tree. It is like walking from some leafs to root, which should be more efficient in most cases. I agree we may see special testcase like there is a long common portion for those paths from leafs to root (for that case walking forward is better). If that happen, we can add a cap about the maximum walking distance to avoid the compile time cost from being too high. Chandler, do you think it is ok?

In D30416#756899, @wmi wrote:

Discussed with Chandler offline, and we decided to split the patch and tried to commit the store shrinking first.

Then I tried the idea of walking forward for load shrinking by using demandedbits but I run into a problem for the motivational testcase (test/CodeGen/X86/mem-access-shrink.ll). Look at the %bf.load which we want to shrink in mem-access-shrink.ll, it has multiple uses, so we want to look at all its uses and get the demanded bits for each use. However, on the Def-Use chain from %bf.load to its narrower uses, it is not only %bf.load having multiple uses, for example, %bf.set also has multiple uses, so we also need to look at all the uses of %bf.set. In theory, every node on the Def-Use Chain can have multiple uses, then at the initial portion of Def-Use Chain starting from %bf.load, we don't know whether the %bf.load can be shrinked or not from demanded bits, only when we walk pretty close to the end of the Def-Use Chain, we know whether %bf.load can be shrinked at the specific place. In other words, by walking forward, in order not to miss any shrinking opportunity, we have to walk across almost all the nodes on the Def-Uses tree before knowing where %bf.load can be shrinked.

For walking backward, most of the cases, we only have to walk from a narrower use to "%bf.load =" which is the root of the Def-Uses tree. It is like walking from some leafs to root, which should be more efficient in most cases. I agree we may see special testcase like there is a long common portion for those paths from leafs to root (for that case walking forward is better). If that happen, we can add a cap about the maximum walking distance to avoid the compile time cost from being too high. Chandler, do you think it is ok?

I'm somewhat worried about this cap -- it has hurt us in the past. But maybe there is a way to make walking backwards have reasonable complexity. It still seems like something we can do in a separate phase rather than having it interleave with the store-based shrinking, and so I'd still split it into a separate patch.

Setting aside forwards vs. backwards-with-a-cap, I still think it is a mistake to add yet another implementation of tracking which bits are demanded. So I would look at how you might share the logic in DemandedBits (or one of the other places in LLVM where we reason about this, I think there are already some others) for reasoning about the semantics of the IR instructions. Maybe there is no way to share that, but it seems worth trying. Either way, I'd suggest a fresh thread (or IRC) to discuss issues until there is a patch so that we can move the store side of this forward independently.

That make sense?

fhahn added a subscriber: fhahn.May 19 2017, 5:30 AM

Looks like this thread has gone stale for a while.

I have not read the patch in details so what I said might be be nonsense :) From reading of the discussions, it seems like that using DemandedBits Analysis is not ready today to handle forward walking for load shrinking, nor ideal for backward walking without using capping, so what is the good path going forward? Is it possible to keep the core of this patch but with more re-use of DemandedBit analysis (with refactoring)? If not, we may want to consider moving on with a stop-gap solution for now and committing on longer term unification as follow ups.