This is an archive of the discontinued LLVM Phabricator instance.

Reducing the costs of cast instructions to enable more vectorization of smaller types in LoopVectorize
AbandonedPublic

Authored by samparker on May 18 2015, 3:49 AM.

Details

Reviewers
jmolloy
sbaranga
Summary

Hi,

So I have completely rewritten the patch so that it doesn't use getWidestType and that it searches for clamped values and instructions. I have reversed the search through the loop body so that trunc instructions can be searched up from to the zext/sext nodes, noting smaller type requirements.

Cheers,
Sam

Diff Detail

Event Timeline

samparker updated this revision to Diff 25951.May 18 2015, 3:49 AM
samparker retitled this revision from to Reducing the costs of cast instructions to enable more vectorization of smaller types in LoopVectorize.
samparker updated this object.
samparker edited the test plan for this revision. (Show Details)
samparker added subscribers: mzolotukhin, jmolloy, delena.
samparker added a subscriber: Unknown Object (MLST).May 20 2015, 1:09 AM

Hi Sam,

If I understand correctly, this implementation allows for only s/zext -> op -> trunc chains (with only one operation). Is this correct (or perhaps I'm missing something)? I think in theory it should be possible to have arbitrary long chains, depending on what operations are in them.

-Silviu

lib/Transforms/Vectorize/LoopVectorize.cpp
4519

APInt::getMaxValue(ScalarWidth) should be used as (1 << ScalarWidth) - 1 might return different values depending on the host machine.

test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll
141

I think this would be cleaner without the metadata.

Hi Silviu,

Yes, that is how is it implemented at the moment and I agree longer chains should be considered. I just wasn't sure what types of nodes it wouldn't be legal for and would welcome suggestions.

Cheers,
sam

samparker updated this revision to Diff 27034.Jun 3 2015, 5:57 AM

Now use SmallPtrSet and ValueMap in place of the standard containers for the instructions and types found by the new function.
Also compare against APInt::getMaxValue as Silviu suggested.
Removed metadata from test file.

jmolloy requested changes to this revision.Jun 3 2015, 7:59 AM
jmolloy added a reviewer: jmolloy.

Hi Sam,

I've given an initial review, but I'm not sure this will catch all the interesting cases.

You don't recurse anywhere, so you only check single operations. For example, you'll miss this:

%1 = zext i32 %a to i64
%2 = add %1, %1
%3 = add %2, %1
%4 = trunc i64 %3 to i32

There are more complex cases it won't catch either, but I'm less worried about that as primarily this comes up with C promotion rules, so everything gets promoted to the same size (i32) which simplifies things somewhat.

Cheers,

James

lib/Transforms/Vectorize/LoopVectorize.cpp
891

This needs doxygen comment style (///) and also a description of what a clamped type is.

895

This needs documentation. Does it assume I is a CastInst?

4379

Whitespace change.

4387

Please add a comment as to why we're reversing through the loop.

4484

There's no need for this to be split over two lines - the "bool" will fit inside the 80 char limit.

4490

"so any extends would have been found earlier." - SPAG and clarification

4491

The canonical way to do this is "if (!isa<TruncInst>(I))"

4492

This can simply be "return FreeCasts.count(I);"

4498

I know it's pedantic, but this comment implies that it was the user's fault that we were called on a non-integer operation. Actually, we just don't deal with them, so I'd suggest rewriting to something like:

"// We only care about integer operations."

4514

Canonical form has local variables as UpperCamelCase.

4516

You can merge this and the next line with:

if (ConstantInst *CI = dyn_cast<ConstantInst>(ChainInst->getOperand(i)))
This revision now requires changes to proceed.Jun 3 2015, 7:59 AM
samparker updated this revision to Diff 27366.Jun 9 2015, 3:29 AM
samparker edited edge metadata.

Chains of instructions are now searched through, with the limitations that they are either an Add, And, Or, Xor, Mul or Shl.
The instructions are saved in a map with their clamped value type and so during vectorization, these types can be used instead.
This is performed by enclosing the binary operations within truncs and zext operations in the hope that these get optimised out.

jmolloy requested changes to this revision.Jun 11 2015, 2:56 AM
jmolloy edited edge metadata.

Hi Sam,

Thanks for making this update. I have lots of coding style nits but primarily we need two algorithms changed from being purely recursive to being iterative. They'll still do the same thing just without overflowing the stack on large programs.

Cheers,

James

lib/Transforms/Vectorize/LoopVectorize.cpp
2639

This is recursive, and unboundedly so. This means that with a large function, we may blow the stack.

Instead of recursing, you could change this to be a simple iterative algorithm. Just iterate as you are currently doing, but call visitClampedInstr() for ALL instructions that are clamped, not just TruncInsts.

Store the newly created clamped instructions in a map<Value*,Value*>, and when you're visiting a clamped instruction check the map for each of its operands. If a value exists, we've already visited this - use it. Otherwise insert a cast.

Then, when we hit a trunc instruction, do the replacement (using OldTrunc->replaceAllUsesWith(NewTrunc)) and that'll make the entire tree live.

This will only really work if you visit all instructions in dominator order, so we visit defs before uses. Luckily the caller of this function already finds such an ordering (LoopBlocksDFS), so you can just use that.

4581

This is also unboundedly recursive. But it can be changed into an iterative algorithm.

Instead of searching bottom-up, search top-down. For every instruction (visiting defs before uses, so in LoopBlocksDFS order!), check if it is narrow. That's easy because you just need to check its immediate operands. If so, store it.

Because we visit all instructions, and we visit all defs before uses, any single isNarrowInstruction() call only needs to check its immediate operands - it doesn't need to go crawling through a tree. This removes the recursion and makes for a more elegant algorithm.

This revision now requires changes to proceed.Jun 11 2015, 2:56 AM

Hi Sam,

Some initial comments from me inline:

lib/Transforms/Vectorize/LoopVectorize.cpp
952–953

Please add a comment that the key in this map is VectorizationFactor.

2601–2602

Should it be if(Clamped != *II) ? Since Clamped originally is just an operand of I, this condition is almost always true, or am I missing something?

2612

Is it used anywhere?

4149

Please remove this unneeded change.

4478–4480

Please reword the comment so that it doesn't refer to the previous version: e.g. instead of "This has been reversed.." use something like "iterate in reverse order because..".
In the current form the comment makes sense in the context of the patch, but it would look a bit strange when you don't see the context.

4481

Maybe use auto here?

4484

This change looks strange. Is it correct at all?

4850–4851

else if should be on the same line as }.
Actually, you could just run clang-format on the patch to avoid such issues.

samparker updated this revision to Diff 28007.Jun 19 2015, 12:52 AM
samparker added a reviewer: sbaranga.

I have changed the algorithms so that they are not recursive. The first pass checks all the instructions top-down and assigns a potential (TBC) narrow type. The second pass then searches bottom up from truncs to confirm, or adjust, the narrow type. I've also added support for sub, ashr and lshr instructions, but would really appreciate my logic being confirmed.

sbaranga edited edge metadata.Jun 19 2015, 5:35 AM

Hi Sam,

Some comments from me inline.

Cheers,
Silviu

lib/Transforms/Vectorize/LoopVectorize.cpp
1212

I think the name "TBCNarrowInstrs" is potentially confusing. Maybe TBC -> Candidate?

5236

Why use inline here? I would normally trust the compiler to do the right thing.

5298

The check for sub could potentially be removed, but it depends on exactly what we're checking for.
It would be interesting to have an example where sub is an issue that would prevent the narrowing.

5309

The comment says "logical shifts" but the code also allows ashr? Is this correct?

5386

It would be a good idea to have a comment that explains how these were chosen, and why some of them need extra checking.

5420

Why consider here only narrowing to i8 or i16? Would narrowing from i64 to i32 also make sense?

samparker updated this revision to Diff 29333.Jul 9 2015, 8:05 AM
samparker edited edge metadata.

Hi Silviu,

So i've added some comments for clarification and also removed the sub operands from the chains for simplicity as I didn't notice any performance difference in the tests I was looking at. I've also added support for 64-bit > 32-bit operations and updated the test file to reflect this.

cheers,

Hi Silviu,

So i've added some comments for clarification and also removed the sub operands from the chains for simplicity as I didn't notice any performance difference in the tests I was looking at. I've also added support for 64-bit > 32-bit operations and updated the test file to reflect this.

cheers,

Thanks Sam!

Some further comments:

Subtraction is a very common operation, it would be much nicer to get it now. wrt performance numbers this would only show up in a benchmark if it would affect the hot loop _and_ the loop would end up being vectorized.

Looking at what bits we don't care about seems like the right approach to me. I think an easier way to reason about these is to have a bitmask for each value in the chain that we're looking at, indicating what bits we care about. Different operations would do different things on this bitmask, and we would compute the value of the bitmasks starting from the trunc operations.

For example:

b = add a, c means that a and c have the same bitmask as b

The same would be true for mul, left shift, all bitwise operators, and sub (since sub can be written with and add and a not).

b = shr a, 2,  and b has a bitmask of 0b00011 ( where 1 means we care, 0 don't care) means a has a bitmask of 0b01111 (technically it would be 0b01100, but this is an approximation).

This can probably be done for most operators.

Also, if you have an operation that takes a constant and you know what bits you care about it should be perfectly legal to truncate the constant and get rid of some of the bits that you don't care about.

Technically, we could set the mask values on the truncate operations and iterate using a work list until all bitmasks have converged, but that is best left for future work.

What would be your opinion on this?

Cheers,
Silviu

Hi Silviu,

So i've added some comments for clarification and also removed the sub operands from the chains for simplicity as I didn't notice any performance difference in the tests I was looking at. I've also added support for 64-bit > 32-bit operations and updated the test file to reflect this.

cheers,

Thanks Sam!

Some further comments:

Subtraction is a very common operation, it would be much nicer to get it now. wrt performance numbers this would only show up in a benchmark if it would affect the hot loop _and_ the loop would end up being vectorized.

Looking at what bits we don't care about seems like the right approach to me. I think an easier way to reason about these is to have a bitmask for each value in the chain that we're looking at, indicating what bits we care about. Different operations would do different things on this bitmask, and we would compute the value of the bitmasks starting from the trunc operations.

For example:

b = add a, c means that a and c have the same bitmask as b

The same would be true for mul, left shift, all bitwise operators, and sub (since sub can be written with and add and a not).

b = shr a, 2,  and b has a bitmask of 0b00011 ( where 1 means we care, 0 don't care) means a has a bitmask of 0b01111 (technically it would be 0b01100, but this is an approximation).

This can probably be done for most operators.

Also, if you have an operation that takes a constant and you know what bits you care about it should be perfectly legal to truncate the constant and get rid of some of the bits that you don't care about.

Technically, we could set the mask values on the truncate operations and iterate using a work list until all bitmasks have converged, but that is best left for future work.

What would be your opinion on this?

Cheers,
Silviu

Hey Silviu,

I believe this is the technique that James had told me about and mentioned that it had already been implemented somewhere in the code base, but without an interface. I didn't go for it originally because I didn't what the ramp up time of trying to understand, refactor and modify two separate passes and I'm still concerned about that time factor! The bitmask idea does sound like a much more useful and reusable approach, however I really can't commit the time to implement it, unless you believe it can slot into the current structure of my analysis.

Cheers,

samparker updated this revision to Diff 29873.Jul 16 2015, 1:01 AM

Rebased and corrected some typos. Also removed sext and zext from being nodes that are searched as they will get searched from their users anyway.

jmolloy requested changes to this revision.Jul 17 2015, 4:21 AM
jmolloy edited edge metadata.

Hi Sam,

I've got a bunch of typographical changes, but the biggest problem is still the unbounded recursion.

Cheers,

James

lib/Transforms/Vectorize/LoopVectorize.cpp
1170

All these functions should start with a lowercase letter.

3559

I'm confused as to why we need ->getScalarType() here - surely this can't be a vector type by definition, as it hasn't yet been vectorized?

5110

In what case can you get here and one of the types not be an integer?

5122

I think it'd be easier if you just did:

CandidateNarrowInstrs[VF][I] = Largest;

To remove a bunch of useless code.

5125

Braces around else.

5137

Again, use std::map::operator[] rather than ::erase and ::insert.

5198

You're not actually grabbing any value here.

5199

Here and elsewhere: if statements should either be one-liners or multi-liners. Please don't mix one-liners and braces.

if (a)
  b;
else
  c;  // GOOD

if (a) {
  b; c;
} else
  d; // BAD
5203

This is unboundedly recursive.

5245

Please use a switch here.

5264

Capital letter.

This revision now requires changes to proceed.Jul 17 2015, 4:21 AM
samparker added inline comments.Jul 20 2015, 6:14 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
5203

But isn't the recursion bound by the size of CandiateNarrowInstrs, which is defined during the first pass?

hfinkel added inline comments.
lib/Transforms/Vectorize/LoopVectorize.cpp
1182

Line is too long.

5203

What bounds that size?

samparker added inline comments.Jul 27 2015, 2:12 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
5203

The number of instructions in the loop body, so do I need to add a further more explicit limit?

hfinkel added inline comments.Aug 3 2015, 10:21 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
5203

Yes, you'd need a more-explicit limit. We can't have a recursion bounded only by the number of instructions in a block, loop, etc.

samparker abandoned this revision.Aug 26 2016, 1:03 AM