https://bugs.llvm.org/show_bug.cgi?id=35520
This is completely broken currently.
Differential D42100
Fix codegen of stores of vectors with non byte-sized elements. jonpa on Jan 16 2018, 4:18 AM. Authored by
Details
https://bugs.llvm.org/show_bug.cgi?id=35520 This is completely broken currently.
Diff Detail Event TimelineComment Actions First attempt, which seems to work, sort of: Most failing tests regenerated (as they had already been generated previously), except for two tests: Failing Tests (2): LLVM :: CodeGen/SystemZ/vec-move-17.ll LLVM :: CodeGen/X86/pr32108.ll During these tests, an assert triggers in SelectionDAG::computeKnownBits(). It seems somewhere there is a mixup of integer width where identical widths are assumed: (gdb) p Known $92 = (llvm::KnownBits &) @0x3ffffffb930: {Zero = {static WORD_MAX = 18446744073709551615, U = { VAL = 4294967294, pVal = 0xfffffffe}, BitWidth = 32}, One = { static WORD_MAX = 18446744073709551615, U = {VAL = 1, pVal = 0x1}, BitWidth = 32}} (gdb) p Known2 $93 = {Zero = {static WORD_MAX = 18446744073709551615, U = {VAL = 0, pVal = 0x0}, BitWidth = 64}, One = {static WORD_MAX = 18446744073709551615, U = {VAL = 0, pVal = 0x0}, BitWidth = 64}} // Output known-1 bits are only known if set in both the LHS & RHS. Known.One &= Known2.One; -> APInt &operator&=(const APInt &RHS) { assert(BitWidth == RHS.BitWidth && "Bit widths must be the same"); I am guessing that my new code is probably missing some kind of flag, but I am not sure exactly what the fix is... (SystemZ/vec-move-17.ll was previously just supposed to compile, but now there should probably be some test as well once it actually compiles with patch). The tests I have gathered so far are right now in CodeGen/Generic/store_nonbytesized_vecs.ll. They seem to be correctly handled by CodeGen, but I am hoping that you will help to verify this, as well as give a suggestion as to where this test should be placed. In SystemZ run just for that target, or in Generic run by multiple targets, perhaps? Currently, I just added the generated output by update_llc_test_checks.py for SystemZ. Comment Actions I don't see a change to DAGTypeLegalizer::SplitVecOp_STORE? (I think you're currently miscompiling the <8 x i31> testcase.) We're generating pretty terrible code for some of the x86 testcases, but you obviously don't need to fix that here. For the assertion failure, it looks like you're uncovering some existing bug in computeKnownBits; I can't think of any way your changes could cause that issue. Maybe some bug in the bitcast handling code? Comment Actions I found that if I extract the element into its native width, and then truncate, everything works. I then had to narrow this handling to integer types, as I saw X86 failing with v4f80 vectors since TRUNCATE is only for integers. About the problem in computeKnownBits: computeKnownBits() gets called with Known of 32 bits, to analyze an EXTRACT_VECTOR_ELT, which is an v2i64 vector. Therefore Known becomes 64 bits wide, and the calling computeKnownBits() crashes on Known.One &= Known2.One; , as Known is 32 bits. I fixed this by diff --git a/lib/CodeGen/SelectionDAG/SelectionDAG.cpp b/lib/CodeGen/SelectionDAG/SelectionDAG.cpp index 837173e..9f98bdd 100644 --- a/lib/CodeGen/SelectionDAG/SelectionDAG.cpp +++ b/lib/CodeGen/SelectionDAG/SelectionDAG.cpp @@ -2805,6 +2805,10 @@ void SelectionDAG::computeKnownBits(SDValue Op, KnownBits &Known, } if (BitWidth > EltBitWidth) Known = Known.zext(BitWidth); + // If the extracted element is wider than BitWidth, truncate to match + // callers expectations. + if (Known.getBitWidth() > BitWidth) + Known = Known.trunc(BitWidth); break; } case ISD::INSERT_VECTOR_ELT: { , which made the DAGCombiner work for the test case (before trying the explit truncate, which worked without this) Not sure if this patch should still go in, or if there should be an assert that a EXTRACT_VECTOR_ELT should not implicitly truncate?
You're right, I added the same handling there. Tests updated. Comment Actions
There should be an assert, yes.
I think all you need for non-integer types is a bitcast to an integer type after the EXTRACT_VECTOR_ELT. Comment Actions
Err, actually, wait a sec... why are we triggering this path for v4f80? As long as the element size in bits is a multiple of 8, we can use separate stores for each element. (Whether the element size is a power of two isn't really important.) Comment Actions Where should this assert be located? Should it be asserted at the point of node creation, or just at the end during isel? I don't see any asserts in the getNode() method. (We could make this a separate patch, I guess)
OK, I asked this previously since I was worried that if e.g. storing a v3i24, the i24 element stores might be promoted / split / ... somehow, so that it would be safest to just build the integer.
So then we trust that the original loop in scalarizeVectorStore() will do the job, which seems to be the case. I changed the check then to simply call isByteSized() on the vector element. In case we get a floating point element which is not byte-sized, the assert for ISD::TRUNCATE will trigger. Since there are no such types in the language reference, I skipped that check. Tests still pass unmodified from before. Comment Actions
Really? I see a bunch. (getNode() is overloaded; make sure you're looking at the version for nodes with two operands.) But anyway, that can be a separate patch.
Comment Actions Aah, sorry, now I see it. But actually, it says if (VT != Elt.getValueType()) // If the vector element type is not legal, the BUILD_VECTOR operands // are promoted and implicitly truncated, and the result implicitly // extended. Make that explicit here. Elt = getAnyExtOrTrunc(Elt, DL, VT); So it is not really true that there could be a general assert since this is actually even expected, or? I don't think I have a patch for this, then.
I updated the test in SplitVecOp_STORE to + // Scalarize if elements and the split halves are not byte-sized. , which seems to work. Two stores will be produced, and the second store will have the byte-offset of the first store, which works as long as that first store is byte-sized. Two X86 tests updated (test/CodeGen/X86/avx512-mask-op.ll, test/CodeGen/X86/vector-compare-results.ll) Comment Actions LGTM
That code is specifically dealing with an EXTRACT_VECTOR_ELT from a BUILD_VECTOR; in that case, before type legalization is finished, you could have a mix of types: either the BUILD_VECTOR operand type could be wider than the EXTRACT_VECTOR_ELT result, or the EXTRACT_VECTOR_ELT result type could be wider than the BUILD_VECTOR operand. But they should both be at least as wide as the vector element type.
Comment Actions Thanks for review! Committed as trunk@323042.
Comment Actions I've been talking to Jonas about the big-endian question, and I now think that there may actually be open questions about what the layout should be. In general, these are the same questions that need to be answered when laying out a struct type including bitfields. However, for bitfields you usually have a platform ABI that specifies the precise layout. But these vector types are LLVM IR types that do not match any high-level language type directly, so really should be defined by LLVM. I was not able to find any such definition in the IR documentation -- am I missing something here? Starting with the simple case of <8 x i1>, the question that needs to be answered is, if you convert this to an i8 by either storing to memory and loading the byte, or else doing a bitcast, where in the resulting i8 can you find the elements 0 ... 7 of the original <8 x i1>. And specifically, should the answer to that question depend on whether the system is big- or little-endian? Note that comparing this to a bit field that implements a similar type: struct { char b0 : 1; char b1 : 1; ... char b7 : 1 } ... here, the platform ABI defines where to find the bits b0 .. b7, and this does actually differ. On (most) little-endian machines, b0 is placed in the least-significant bit of the byte in memory, while on (most) big-endian machines, b0 is placed in the most-significant bit. [ GCC in fact also supports a few rare platforms where this doesn't hold. I think those are not supported by LLVM today. ] Should LLVM define the layout of <8 x i1> the same way? Or should it impose its own layout, which may be defined independently of byte order? (Again my question, maybe this has been defined and I just don't know where to find this information?) Once we've decided the simple case, we can probably extend the definition to more complex cases like <7 x i1> or <2 x i3> in a straightforward manner ... Comment Actions We never really formally defined the encoding of vectors, I think. The DataLayout has computed the size in bits of vector types as element_size_in_bits*num_elements for a long time, though (so there's no padding between elements), and we've generally had the rule that the lowest element has the lowest address, since that's generally how vector units work on both big-endian and little-endian targets. Given those two constraints, there's only one possible rule for non-power-of-two byte-sized elements: <4 x i24> should have the same layout as <{ i24, i24, i24, i24 }> (a packed struct). And there's only one way to extrapolate that rule to non-byte-size element types: essentially, if you bitcast the vector to an integer, the first element should be in the high bit(s). Comment Actions Yes, that makes sense to me. It also agrees with the layout of bitfields on (all currently supported) big-endian platforms, so that's another argument for it. Comment Actions This patch follows your suggestion by reversing the order of the vector elements in the built integer for big endian targets. SystemZ tests updated. Comment Actions Could you open a new review request instead of reusing the existing one with a new patch? This is going to get confusing. |
I just realized there's a problem here. I guess I've spent too much time on little-endian targets.
Specifically, the problem is that this code will store the elements in the wrong order on big-endian targets. This will lead to inconsistent, and generally messy, results.
I think you can solve this by reversing the loop on big-endian targets, so high element of the vector is in the low bits of the integer.