This is an archive of the discontinued LLVM Phabricator instance.

Fix codegen of stores of vectors with non byte-sized elements.
AbandonedPublic

Authored by jonpa on Jan 16 2018, 4:18 AM.

Details

Summary

https://bugs.llvm.org/show_bug.cgi?id=35520

This is completely broken currently.

Diff Detail

Event Timeline

jonpa created this revision.Jan 16 2018, 4:18 AM
jonpa added a comment.Jan 16 2018, 4:20 AM

First attempt, which seems to work, sort of:

Most failing tests regenerated (as they had already been generated previously), except for two tests:

Failing Tests (2):

LLVM :: CodeGen/SystemZ/vec-move-17.ll
LLVM :: CodeGen/X86/pr32108.ll

During these tests, an assert triggers in SelectionDAG::computeKnownBits(). It seems somewhere there is a mixup of integer width where identical widths are assumed:

(gdb) p Known
$92 = (llvm::KnownBits &) @0x3ffffffb930: {Zero = {static WORD_MAX = 18446744073709551615, U = {
      VAL = 4294967294, pVal = 0xfffffffe}, BitWidth = 32}, One = {
    static WORD_MAX = 18446744073709551615, U = {VAL = 1, pVal = 0x1}, BitWidth = 32}}
(gdb) p Known2
$93 = {Zero = {static WORD_MAX = 18446744073709551615, U = {VAL = 0, pVal = 0x0}, BitWidth = 64},
  One = {static WORD_MAX = 18446744073709551615, U = {VAL = 0, pVal = 0x0}, BitWidth = 64}}

    // Output known-1 bits are only known if set in both the LHS & RHS.                                                                                                                                             
    Known.One &= Known2.One;
->
  APInt &operator&=(const APInt &RHS) {
    assert(BitWidth == RHS.BitWidth && "Bit widths must be the same");

I am guessing that my new code is probably missing some kind of flag, but I am not sure exactly what the fix is...

(SystemZ/vec-move-17.ll was previously just supposed to compile, but now there should probably be some test as well once it actually compiles with patch).

The tests I have gathered so far are right now in CodeGen/Generic/store_nonbytesized_vecs.ll. They seem to be correctly handled by CodeGen, but I am hoping that you will help to verify this, as well as give a suggestion as to where this test should be placed. In SystemZ run just for that target, or in Generic run by multiple targets, perhaps? Currently, I just added the generated output by update_llc_test_checks.py for SystemZ.

spatel added a subscriber: spatel.Jan 16 2018, 8:52 AM

I don't see a change to DAGTypeLegalizer::SplitVecOp_STORE? (I think you're currently miscompiling the <8 x i31> testcase.)

We're generating pretty terrible code for some of the x86 testcases, but you obviously don't need to fix that here.

For the assertion failure, it looks like you're uncovering some existing bug in computeKnownBits; I can't think of any way your changes could cause that issue. Maybe some bug in the bitcast handling code?

jonpa updated this revision to Diff 130133.Jan 17 2018, 4:46 AM

I found that if I extract the element into its native width, and then truncate, everything works. I then had to narrow this handling to integer types, as I saw X86 failing with v4f80 vectors since TRUNCATE is only for integers.

About the problem in computeKnownBits:

computeKnownBits() gets called with Known of 32 bits, to analyze an EXTRACT_VECTOR_ELT, which is an v2i64 vector. Therefore Known becomes 64 bits wide, and the calling computeKnownBits() crashes on

Known.One &= Known2.One;

, as Known is 32 bits.

I fixed this by

diff --git a/lib/CodeGen/SelectionDAG/SelectionDAG.cpp b/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
index 837173e..9f98bdd 100644
--- a/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
+++ b/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
@@ -2805,6 +2805,10 @@ void SelectionDAG::computeKnownBits(SDValue Op, KnownBits &Known,
     }
     if (BitWidth > EltBitWidth)
       Known = Known.zext(BitWidth);
+    // If the extracted element is wider than BitWidth, truncate to match
+    // callers expectations.
+    if (Known.getBitWidth() > BitWidth)
+      Known = Known.trunc(BitWidth);
     break;
   }
   case ISD::INSERT_VECTOR_ELT: {

, which made the DAGCombiner work for the test case (before trying the explit truncate, which worked without this)

Not sure if this patch should still go in, or if there should be an assert that a EXTRACT_VECTOR_ELT should not implicitly truncate?

I don't see a change to DAGTypeLegalizer::SplitVecOp_STORE? (I think you're currently miscompiling the <8 x i31> testcase.)

You're right, I added the same handling there.

Tests updated.

if there should be an assert that a EXTRACT_VECTOR_ELT should not implicitly truncate?

There should be an assert, yes.

I then had to narrow this handling to integer types, as I saw X86 failing with v4f80 vectors since TRUNCATE is only for integers.

I think all you need for non-integer types is a bitcast to an integer type after the EXTRACT_VECTOR_ELT.

I think all you need for non-integer types is a bitcast to an integer type after the EXTRACT_VECTOR_ELT.

Err, actually, wait a sec... why are we triggering this path for v4f80? As long as the element size in bits is a multiple of 8, we can use separate stores for each element. (Whether the element size is a power of two isn't really important.)

jonpa updated this revision to Diff 130402.Jan 18 2018, 6:35 AM

if there should be an assert that a EXTRACT_VECTOR_ELT should not implicitly truncate?

There should be an assert, yes.

Where should this assert be located? Should it be asserted at the point of node creation, or just at the end during isel? I don't see any asserts in the getNode() method. (We could make this a separate patch, I guess)

I then had to narrow this handling to integer types, as I saw X86 failing with v4f80 vectors since TRUNCATE is only for integers.

why are we triggering this path for v4f80? As long as the element size in bits is a multiple of 8, we can use separate stores for each element.

OK, I asked this previously since I was worried that if e.g. storing a v3i24, the i24 element stores might be promoted / split / ... somehow, so that it would be safest to just build the integer.

(Whether the element size is a power of two isn't really important.)

So then we trust that the original loop in scalarizeVectorStore() will do the job, which seems to be the case.

I changed the check then to simply call isByteSized() on the vector element.

In case we get a floating point element which is not byte-sized, the assert for ISD::TRUNCATE will trigger. Since there are no such types in the language reference, I skipped that check.

Tests still pass unmodified from before.

I don't see any asserts in the getNode() method.

Really? I see a bunch. (getNode() is overloaded; make sure you're looking at the version for nodes with two operands.) But anyway, that can be a separate patch.

test/CodeGen/X86/avx512-mask-op.ll
2786

This new code is substantially worse than the old code, without solving any correctness problem.

Specifically, the key here is that you can split without scalarizing if the size of each half in bits is a multiple of 8. So on AVX-512, we can split a <64 x i1> store into four legal <16 x i1> stores.

jonpa updated this revision to Diff 130631.Jan 19 2018, 8:49 AM

I don't see any asserts in the getNode() method.

Really? I see a bunch. (getNode() is overloaded; make sure you're looking at the version for nodes with two operands.) But anyway, that can be a separate patch.

Aah, sorry, now I see it. But actually, it says

if (VT != Elt.getValueType())
   // If the vector element type is not legal, the BUILD_VECTOR operands
   // are promoted and implicitly truncated, and the result implicitly
   // extended. Make that explicit here.
   Elt = getAnyExtOrTrunc(Elt, DL, VT);

So it is not really true that there could be a general assert since this is actually even expected, or? I don't think I have a patch for this, then.

Specifically, the key here is that you can split without scalarizing if the size of each half in bits is a multiple of 8. So on AVX-512, we can split a <64 x i1> store into four legal <16 x i1> stores.

I updated the test in SplitVecOp_STORE to

+ // Scalarize if elements and the split halves are not byte-sized.
+ if (!MemoryVT.getScalarType().isByteSized() &&
+ (!LoMemVT.isByteSized() || !HiMemVT.isByteSized()))
+ return TLI.scalarizeVectorStore(N, DAG);

, which seems to work. Two stores will be produced, and the second store will have the byte-offset of the first store, which works as long as that first store is byte-sized.

Two X86 tests updated (test/CodeGen/X86/avx512-mask-op.ll, test/CodeGen/X86/vector-compare-results.ll)

efriedma accepted this revision.Jan 19 2018, 11:17 AM

LGTM

So it is not really true that there could be a general assert since this is actually even expected, or? I don't think I have a patch for this, then.

That code is specifically dealing with an EXTRACT_VECTOR_ELT from a BUILD_VECTOR; in that case, before type legalization is finished, you could have a mix of types: either the BUILD_VECTOR operand type could be wider than the EXTRACT_VECTOR_ELT result, or the EXTRACT_VECTOR_ELT result type could be wider than the BUILD_VECTOR operand. But they should both be at least as wide as the vector element type.

lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
2021

I think the "MemoryVT.getScalarType().isByteSized()" check is redundant. Not a big deal either way.

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3471

Extra space after assert.

This revision is now accepted and ready to land.Jan 19 2018, 11:17 AM
craig.topper added inline comments.Jan 19 2018, 11:29 AM
test/CodeGen/SystemZ/vec-move-17.ll
65

Why did this comment get deleted?

jonpa closed this revision.Jan 20 2018, 8:13 AM
jonpa marked 3 inline comments as done.

Thanks for review!

Committed as trunk@323042.

lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
2021

Yes, I think you are right :-) removed.

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3471

removed

test/CodeGen/SystemZ/vec-move-17.ll
65

On Jan 16, I wrote: "(SystemZ/vec-move-17.ll was previously just supposed to compile, but now there should probably be some test as well once it actually compiles with patch)."

I have now added some CHECKs and left the comment deleted.

efriedma added inline comments.Jan 20 2018, 10:00 AM
lib/CodeGen/SelectionDAG/TargetLowering.cpp
3466

I just realized there's a problem here. I guess I've spent too much time on little-endian targets.

Specifically, the problem is that this code will store the elements in the wrong order on big-endian targets. This will lead to inconsistent, and generally messy, results.

I think you can solve this by reversing the loop on big-endian targets, so high element of the vector is in the low bits of the integer.

uabelho added inline comments.Jan 23 2018, 5:46 AM
lib/CodeGen/SelectionDAG/TargetLowering.cpp
3466

So we have a bug for big-endian targets on trunk now?

Will you deal with this Jonas?

jonpa marked 3 inline comments as done.Jan 30 2018, 3:58 AM
jonpa added inline comments.
lib/CodeGen/SelectionDAG/TargetLowering.cpp
3466

sorry for delay (vacation). Yes, I will propose a patch soon.

I've been talking to Jonas about the big-endian question, and I now think that there may actually be open questions about what the layout should be.

In general, these are the same questions that need to be answered when laying out a struct type including bitfields. However, for bitfields you usually have a platform ABI that specifies the precise layout. But these vector types are LLVM IR types that do not match any high-level language type directly, so really should be defined by LLVM. I was not able to find any such definition in the IR documentation -- am I missing something here?

Starting with the simple case of <8 x i1>, the question that needs to be answered is, if you convert this to an i8 by either storing to memory and loading the byte, or else doing a bitcast, where in the resulting i8 can you find the elements 0 ... 7 of the original <8 x i1>. And specifically, should the answer to that question depend on whether the system is big- or little-endian?

Note that comparing this to a bit field that implements a similar type: struct { char b0 : 1; char b1 : 1; ... char b7 : 1 } ... here, the platform ABI defines where to find the bits b0 .. b7, and this does actually differ. On (most) little-endian machines, b0 is placed in the least-significant bit of the byte in memory, while on (most) big-endian machines, b0 is placed in the most-significant bit. [ GCC in fact also supports a few rare platforms where this doesn't hold. I think those are not supported by LLVM today. ]

Should LLVM define the layout of <8 x i1> the same way? Or should it impose its own layout, which may be defined independently of byte order? (Again my question, maybe this has been defined and I just don't know where to find this information?)

Once we've decided the simple case, we can probably extend the definition to more complex cases like <7 x i1> or <2 x i3> in a straightforward manner ...

We never really formally defined the encoding of vectors, I think. The DataLayout has computed the size in bits of vector types as element_size_in_bits*num_elements for a long time, though (so there's no padding between elements), and we've generally had the rule that the lowest element has the lowest address, since that's generally how vector units work on both big-endian and little-endian targets.

Given those two constraints, there's only one possible rule for non-power-of-two byte-sized elements: <4 x i24> should have the same layout as <{ i24, i24, i24, i24 }> (a packed struct). And there's only one way to extrapolate that rule to non-byte-size element types: essentially, if you bitcast the vector to an integer, the first element should be in the high bit(s).

Yes, that makes sense to me. It also agrees with the layout of bitfields on (all currently supported) big-endian platforms, so that's another argument for it.

jonpa updated this revision to Diff 132162.Jan 31 2018, 6:20 AM

This patch follows your suggestion by reversing the order of the vector elements in the built integer for big endian targets.

SystemZ tests updated.

jonpa reopened this revision.Jan 31 2018, 6:21 AM
This revision is now accepted and ready to land.Jan 31 2018, 6:21 AM
jonpa requested review of this revision.Jan 31 2018, 6:21 AM

Could you open a new review request instead of reusing the existing one with a new patch? This is going to get confusing.

jonpa abandoned this revision.Feb 1 2018, 1:27 AM

Could you open a new review request instead of reusing the existing one with a new patch? This is going to get confusing.

sure...

test/CodeGen/X86/pr20011.ll