This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Fix a bug generating incorrect instruction when building small vector.
ClosedPublic

Authored by kevin.qin on Jul 4 2014, 2:08 AM.

Details

Summary

This bug is introduced by r211144. When building a v4i8 vector, v4i8 will be widened to v4i16 after legalization. Then the element of operand is smaller than the element of result, but previous commit can only handle the contrary condition. This patch is going to handle this scenario and to generate optimized codes like ZIP1. Please review.

Regard,
Kevin

Diff Detail

Event Timeline

kevin.qin updated this revision to Diff 11080.Jul 4 2014, 2:08 AM
kevin.qin retitled this revision from to [AArch64] Fix a bug generating incorrect instruction when building small vector..
kevin.qin updated this object.
kevin.qin edited the test plan for this revision. (Show Details)
kevin.qin added reviewers: t.p.northover, mcrosier.
kevin.qin set the repository for this revision to rL LLVM.
kevin.qin added a project: deleted.
kevin.qin added a subscriber: Unknown Object (MLST).
mcrosier edited edge metadata.Jul 4 2014, 8:12 AM

Tim should probably give the final LGTM, but I don't see any glaring issues. I'll kick of regression tests and let you know how it goes.

lib/Target/AArch64/AArch64ISelLowering.cpp
4160

Perhaps, just ResMultiplier.

The regressions have been resolved by this patch.

kevin.qin updated this revision to Diff 11137.Jul 7 2014, 7:19 PM
kevin.qin updated this object.
kevin.qin edited edge metadata.

Small changes according to Chad's comments.

Re-ping...

2014-07-12 0:13 GMT+08:00 Chad Rosier <mcrosier@codeaurora.org>:

Ping.

http://reviews.llvm.org/D4385

t.p.northover edited edge metadata.Jul 21 2014, 3:34 AM

Hi Kevin,

Sorry this took so long to review. I kept putting it off because the code looked nasty for what it was trying to achieve. It actually turned out much better than I was expecting though (I have comments, but they're superficial rather than horrified).

Cheers.

Tim.

lib/Target/AArch64/AArch64ISelLowering.cpp
4167–4172

I think this is putting too much emphasis on the v4i8 origin of the DAG rather than what's actually in front of us: a v4i16 BUILD_VECTOR made up of a set of v8i8 extractions (and even that's too much of a special case).

This loop also looked more intimidating than it needed to because of this big block comment inside. If the purpose is explained completely outside, the entire thing becomes a fairly trivial 3-4 line loop.

4173

I don't think this is correct. Imagine an output vector of v4i32, produced from SourceVecs with type v8i16 and v16i8. This situation should give ResMultiplier = 4, but it would end up as 2.

4189–4192

This comment seems stale. While you're in the area could you rephrase it to something more accurate?

4269–4270

I don't think this is generic enough. Again in the situation with two mismatched sources (so take a v4i32 VT being produced from a v4i16 and a v8i8 for example).

Anything coming from the v4i16 needs to set *two* elements of Mask properly (and then 2 undefs).

I'm not sure, but a more thorough restructuring of this loop might make the result clearer too. Don't be afraid to move things about if you see anything.

4277–4279

LLVM will omit the bitcast if it's not needed. It's simpler to just "return DAG.getNode(ISD::BITCAST, ...);".

Hi Tim,

Thanks for your review. I also hate this implementation, which is so complex and may introduce some potential bug just like what this patch is going to fix. I hope this patch could bring peace in this area some time, otherwise we should consider to refactor the whole function. Some inline comments below.

Regards,
Kevin

lib/Target/AArch64/AArch64ISelLowering.cpp
4173

Yes, you are right. ResMultiplier should multiply itself with other part.

4269–4270

For your example, v4i32 will bitcast to v16i8, which only 0, 8, 16, 24 are valid lane number, and all other lanes should be undef. The valid lane number is decided by ResMultiplier only, and independent to sources. So I think this code can cover all scenarios including two sources are mismatched(The mismatching of two sources will reflect on their OffsetMultipliers, and finally change the Mask number). But I can make a change for code style.

kevin.qin updated this revision to Diff 11739.Jul 21 2014, 10:15 PM
kevin.qin edited edge metadata.

This is the updated patch. Please review.

Regards,
Kevin

Hi Kevin,

I'm not quite convinced yet:

lib/Target/AArch64/AArch64ISelLowering.cpp
4203

It's actually far easier to just set ResMultiplier once outside the loop: you know SmallestEltTy and VT.getVectorElementType(). Those give the correct ratio.

4269–4270

OK, I don't think I was clear enough. Let's look at a concrete example. Take LHS = v4i16, RHS = v8i8, result = v2i32 and imagine something like

(BUILD_VECTOR (i32 extract_element LHS, 2),
              (i32 extract_element RHS, 0))

I think this should lead to v8i8 as the basic shuffle type (we agree there), with indices <4, 5, u, u, 8, u, u, u>. I think the current loop discards the 5, though.

See comments below.

Regards,
Kevin

lib/Target/AArch64/AArch64ISelLowering.cpp
4203

I got myself limited to the previous solution.... Get it out of the loop is a better idea.

4269–4270

If LHS = v4i16, RHS = v8i8, result = v2i32, it must be LHS = v4i8, RHS = v8i8, result = v2i8 before legalization. Because the actual lane size is decided by the smallest element size among result and 2 operands, and all other larger element sizes must come from legalization. Please correct me if I'm wrong. Based on this logic, the mask <4, u, u, u, 8, u, u, u> is correct.

If LHS = v4i16, RHS = v8i8, result = v2i32, it must be LHS = v4i8, RHS = v8i8, result = v2i8 before legalization. Because the actual lane size is decided by the smallest element size among result and 2 operands, and all other larger element sizes must come from legalization. Please correct me if I'm wrong. Based on this logic, the mask <4, u, u, u, 8, u, u, u> is correct.

You shouldn't be assuming that these nodes come from some specific
path through legalisation that you understand. The node we're given
has clear semantics, no matter how it got to us, and it says that 16
bits from LHS need to be inserted into the final vector.

This isn't even theoretical, see the attached file for something that
reaches ReconstructShuffle very much as I described (I had to use
v4i32 as the result since only results with >= 4 vectors call
ReconstructShuffle).

Cheers.

Tim.

Actually, the more I think about that .ll file the more I worry that it's really showing a bug in DAGCombiner. It just happened to be the first AND-elision code I spotted that I thought I could exploit.

I think the point still stands though: these nodes can come from anywhere and we have to deal with them properly.

Cheers.

Tim.

kevin.qin updated this revision to Diff 11806.Jul 23 2014, 12:47 AM

Hi Tim,

Here is the new revision adding codes to compute how many parts of split lane should be inserted. Please have a look.

Thanks,
Kevin

t.p.northover accepted this revision.Jul 23 2014, 2:48 AM
t.p.northover edited edge metadata.

Hi Kevin,

I think the function is correct now, but that second test is pretty much useless if it really is relying on a bug in the DAG-combiner. As soon as someone fixes that it'll no longer even exercise the code it's supposed to.

I've got one or two ideas to improve this function with refactoring, though, so I think you should probably just commit and then I can take a look afterwards.

So LGTM! Thanks for working on this for so long (and being patient at the beginning).

Cheers.

Tim.

This revision is now accepted and ready to land.Jul 23 2014, 2:48 AM

Hi Tim,

I removed the new test and committed at r213830.

Cheers,
Kevin

mcrosier closed this revision.Jul 29 2014, 12:28 PM

Committed in r213830.