This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/CodeGen/SelectionDAG/
-
CodeGen/
-
SelectionDAG/
-
LegalizeIntegerTypes.cpp
-
LegalizeVectorTypes.cpp
-
test/CodeGen/ARM/
-
CodeGen/
-
ARM/
4/5
legalize-bitcast.ll

Differential D70942

[LegalizeTypes] Bugfixes for big-endian targets when handling BITCASTs
ClosedPublic

Authored by uabelho on Dec 3 2019, 12:13 AM.

Download Raw Diff

Details

Reviewers

bogner
spatel
craig.topper
t.p.northover
dmgreen
efriedma
SjoerdMeijer
samparker

Commits

rG4763267eeee7: [LegalizeTypes] Bugfixes for big-endian targets when handling BITCASTs

Summary

This fixes PR44135.

The special case when we promote a bitcast from a vector to an int
needs special handling when we are on a big-endian target.

Prior to this fix, for the added vec_to_int we see the following in the
SelectionDAG printouts

Type-legalized selection DAG: %bb.1 'foo:bb.1'
SelectionDAG has 9 nodes:

t0: ch = EntryToken
      t2: v8i16,ch = CopyFromReg t0, Register:v8i16 %0
    t17: v4i32 = bitcast t2
  t23: i32 = extract_vector_elt t17, Constant:i32<3>
t8: ch,glue = CopyToReg t0, Register:i32 $r0, t23
t9: ch = ARMISD::RET_FLAG t8, Register:i32 $r0, t8:1

and I think here the extract_vector_elt is wrong and extracts the value
from the wrong index.

The program program should return the 32 bits made up of the elements at
index 4 and 5 in the vec6 array, but with

t23: i32 = extract_vector_elt t17, Constant:i32<3>

as far as I can tell, we will extract values that originally didn't even
exist in the vec6 vectore.

If we would instead extract the element at index 2 we would get the wanted
values.

With this fix we insert a right shift after the bitcast in
DAGTypeLegalizer::PromoteIntRes_BITCAST which then gives us

Type-legalized selection DAG: %bb.1 'vec_to_int:bb.1'
SelectionDAG has 9 nodes:

t0: ch = EntryToken
      t2: v8i16,ch = CopyFromReg t0, Register:v8i16 %0
    t23: v4i32 = bitcast t2
  t27: i32 = extract_vector_elt t23, Constant:i32<2>
t8: ch,glue = CopyToReg t0, Register:i32 $r0, t27
t9: ch = ARMISD::RET_FLAG t8, Register:i32 $r0, t8:1

So now we get

t27: i32 = extract_vector_elt t23, Constant:i32<2>

which is what we want.

Similarly, the new int_to_vec testcase exposes a bug where we cast the other
direction. Then we instead need to add a left shift before the bitcast on
big-endian targets for the bits in the input integer to end up at the exptected
place in the vector.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

uabelho created this revision.Dec 3 2019, 12:13 AM

Herald added a project: Restricted Project. · View Herald TranscriptDec 3 2019, 12:13 AM

Herald added subscribers: hiraditya, kristof.beyls. · View Herald Transcript

Originally triggered this problem for our out-of-tree target but I think it's exposed on ARM as well.

The problem (though an instcombine version of it) was discussed briefly here:
http://lists.llvm.org/pipermail/llvm-dev/2019-November/137297.html

I'm not very familiar with ARM though and I'm not very happy with the CHECKS, especially not for the int_to_vec
case, so any suggestions there from someone who knows ARM would be good.

uabelho added a subscriber: bjope.Dec 3 2019, 12:20 AM

spatel added inline comments.Dec 5 2019, 11:16 AM

llvm/test/CodeGen/ARM/legalize-bitcast.ll
1	The bug is still visible if you let this RUN all the way to asm output, right? If so, I think that would be easier to read, especially if we can auto-generate the CHECK lines using "utils/update_llc_test_checks.py". Either way, I think it's better to pre-commit this test to trunk showing the wrong output. That way, this patch will show the diff that creates the correct code. (And if for some reason this patch needs to be reverted, it will be clear that we have reverted to code that creates a miscompile.)

Added pre-commit of the testcase, changed testcase to run all the way to ASM output so we can use utils/update_llc_test_checks.py.

Thanks for the suggestion @spatel. Updated the testcase.

llvm/test/CodeGen/ARM/legalize-bitcast.ll
1	I think so. At least we get diffs in the ASM output without/with the fixes applied. I'm really not very good at reading ARM assembler though so it's not at all obvious to me what the resulting code really do.

Adding ARM asm experts to verify that the code is correct...

LGTM. (Left some notes on the testcase if anyone else wants to follow.)

llvm/test/CodeGen/ARM/legalize-bitcast.ll
21	We split the load into two parts, so we don't read past the end. For the first 64 bits, we load it as <8 x i8>, then bitcast it to `<4 x i16>` (using vrev). For the last 32 bits, we load it as i32, then swap the high and low halves. Then we concatenate the two, and spill to a stack slot as a native `<8 x i16>`.
25	The vrev32.16 essentially bitcasts the `<8 x i16>` to `<4 x i32>`. Then we extract the third element of the vector. This corresponds to the scalar load in the first block, i.e. the last 4 bytes of the load. Looks correct.
47	So the value we want to return here is the first two bytes, if the i80 were stored in memory. On a big-endian target, that's the two most significant bytes. With LLVM's default calling convention, those bytes end up in the low bits of r0. (This took me a little while to figure out; it would probably be easier to understand if the i80 value was loaded from memory, instead of passed in registers.) The sequence here is that we shift r0 left 16 bits, move it to d16, move that to d18, "shift" the value right 16 bits using vrev, them move it back to r0. That seems correct.

This revision is now accepted and ready to land.Dec 6 2019, 12:07 PM

In D70942#1773282, @efriedma wrote:

LGTM. (Left some notes on the testcase if anyone else wants to follow.)

Thanks for the analysis @eli.friedman!

If I change the int_to_vec test so it loads from memory instead of passing the i80 as a parameter I get

define i16 @int_to_vec() {
; CHECK-LABEL: int_to_vec:
; CHECK:       @ %bb.0:
; CHECK-NEXT:    movw r0, :lower16:i80_p
; CHECK-NEXT:    movt r0, :upper16:i80_p
; CHECK-NEXT:    ldr r1, [r0]
; CHECK-NEXT:    ldr r2, [r0, #4]
; CHECK-NEXT:    ldrh r0, [r0, #8]
; CHECK-NEXT:    orr r0, r0, r2, lsl #16
; CHECK-NEXT:    lsl r3, r1, #16
; CHECK-NEXT:    orr r2, r3, r2, lsr #16
; CHECK-NEXT:    lsr r1, r1, #16
; CHECK-NEXT:    b .LBB1_1
; CHECK-NEXT:  .LBB1_1: @ %bb.1
; CHECK-NEXT:    vmov.i32 q8, #0x0
; CHECK-NEXT:    vrev32.16 q8, q8
; CHECK-NEXT:    @ kill: def $d16 killed $d16 killed $q8
; CHECK-NEXT:    vmov.u16 r0, d16[0]
; CHECK-NEXT:    bx lr
  %i80 = load i80, i80* @i80_p, align 1
  br label %bb.1

bb.1:
  %vec = bitcast i80 %i80 to <5 x i16>
  %e0 = extractelement <5 x i16> %vec, i32 0
  ret i16 %e0
}

without the fix and

define i16 @int_to_vec() {
; CHECK-LABEL: int_to_vec:
; CHECK:       @ %bb.0:
; CHECK-NEXT:    sub sp, sp, #8
; CHECK-NEXT:    movw r0, :lower16:i80_p
; CHECK-NEXT:    movt r0, :upper16:i80_p
; CHECK-NEXT:    ldr r1, [r0]
; CHECK-NEXT:    ldr r2, [r0, #4]
; CHECK-NEXT:    ldrh r0, [r0, #8]
; CHECK-NEXT:    orr r0, r0, r2, lsl #16
; CHECK-NEXT:    lsl r3, r1, #16
; CHECK-NEXT:    orr r2, r3, r2, lsr #16
; CHECK-NEXT:    lsr r1, r1, #16
; CHECK-NEXT:    str r2, [sp, #4] @ 4-byte Spill
; CHECK-NEXT:    str r1, [sp] @ 4-byte Spill
; CHECK-NEXT:    b .LBB1_1
; CHECK-NEXT:  .LBB1_1: @ %bb.1
; CHECK-NEXT:    ldr r0, [sp] @ 4-byte Reload
; CHECK-NEXT:    lsl r1, r0, #16
; CHECK-NEXT:    ldr r2, [sp, #4] @ 4-byte Reload
; CHECK-NEXT:    orr r1, r1, r2, lsr #16
; CHECK-NEXT:    @ implicit-def: $d16
; CHECK-NEXT:    vmov.32 d16[0], r1
; CHECK-NEXT:    @ implicit-def: $q9
; CHECK-NEXT:    vmov.f64 d18, d16
; CHECK-NEXT:    vrev32.16 q9, q9
; CHECK-NEXT:    @ kill: def $d18 killed $d18 killed $q9
; CHECK-NEXT:    vmov.u16 r0, d18[0]
; CHECK-NEXT:    add sp, sp, #8
; CHECK-NEXT:    bx lr
  %i80 = load i80, i80* @i80_p, align 1
  br label %bb.1

bb.1:
  %vec = bitcast i80 %i80 to <5 x i16>
  %e0 = extractelement <5 x i16> %vec, i32 0
  ret i16 %e0
}

with, so at least the fix makes a difference in that case too.
If you prefer that I can update the testcase.

Otherwise I'll just commit as is now.

Thanks!

The load version is a lot more instructions, so it's probably not worth it... the current patch should be fine as-is.

In D70942#1776042, @efriedma wrote:

The load version is a lot more instructions, so it's probably not worth it... the current patch should be fine as-is.

Ok, Thanks!

Closed by commit rG4763267eeee7: [LegalizeTypes] Bugfixes for big-endian targets when handling BITCASTs (authored by uabelho). · Explain WhyDec 10 2019, 2:32 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

lib/

CodeGen/

SelectionDAG/

LegalizeIntegerTypes.cpp

17 lines

LegalizeVectorTypes.cpp

23 lines

test/

CodeGen/

ARM/

legalize-bitcast.ll

21 lines

Diff 233027

llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp

Show First 20 Lines • Show All 333 Lines • ▼ Show 20 Lines	if (!NOutVT.isVector()) {
return DAG.getNode(ISD::BITCAST, dl, NOutVT, InOp);		return DAG.getNode(ISD::BITCAST, dl, NOutVT, InOp);
}		}
break;		break;
}		}
case TargetLowering::TypeWidenVector:		case TargetLowering::TypeWidenVector:
// The input is widened to the same size. Convert to the widened value.		// The input is widened to the same size. Convert to the widened value.
// Make sure that the outgoing value is not a vector, because this would		// Make sure that the outgoing value is not a vector, because this would
// make us bitcast between two vectors which are legalized in different ways.		// make us bitcast between two vectors which are legalized in different ways.
if (NOutVT.bitsEq(NInVT) && !NOutVT.isVector())		if (NOutVT.bitsEq(NInVT) && !NOutVT.isVector()) {
return DAG.getNode(ISD::BITCAST, dl, NOutVT, GetWidenedVector(InOp));		SDValue Res =
		DAG.getNode(ISD::BITCAST, dl, NOutVT, GetWidenedVector(InOp));

		// For big endian targets we need to shift the casted value or the
		// interesting bits will end up at the wrong place.
		if (DAG.getDataLayout().isBigEndian()) {
		unsigned ShiftAmt = NInVT.getSizeInBits() - InVT.getSizeInBits();
		EVT ShiftAmtTy = TLI.getShiftAmountTy(NOutVT, DAG.getDataLayout());
		assert(ShiftAmt < NOutVT.getSizeInBits() && "Too large shift amount!");
		Res = DAG.getNode(ISD::SRL, dl, NOutVT, Res,
		DAG.getConstant(ShiftAmt, dl, ShiftAmtTy));
		}
		return Res;
		}
// If the output type is also a vector and widening it to the same size		// If the output type is also a vector and widening it to the same size
// as the widened input type would be a legal type, we can widen the bitcast		// as the widened input type would be a legal type, we can widen the bitcast
// and handle the promotion after.		// and handle the promotion after.
if (NOutVT.isVector()) {		if (NOutVT.isVector()) {
unsigned WidenInSize = NInVT.getSizeInBits();		unsigned WidenInSize = NInVT.getSizeInBits();
unsigned OutSize = OutVT.getSizeInBits();		unsigned OutSize = OutVT.getSizeInBits();
if (WidenInSize % OutSize == 0) {		if (WidenInSize % OutSize == 0) {
unsigned Scale = WidenInSize / OutSize;		unsigned Scale = WidenInSize / OutSize;
▲ Show 20 Lines • Show All 4,032 Lines • Show Last 20 Lines

llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp

Show First 20 Lines • Show All 3,451 Lines • ▼ Show 20 Lines	SDValue DAGTypeLegalizer::WidenVecRes_BITCAST(SDNode *N) {
EVT InVT = InOp.getValueType();		EVT InVT = InOp.getValueType();
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
EVT WidenVT = TLI.getTypeToTransformTo(*DAG.getContext(), VT);		EVT WidenVT = TLI.getTypeToTransformTo(*DAG.getContext(), VT);
SDLoc dl(N);		SDLoc dl(N);

switch (getTypeAction(InVT)) {		switch (getTypeAction(InVT)) {
case TargetLowering::TypeLegal:		case TargetLowering::TypeLegal:
break;		break;
case TargetLowering::TypePromoteInteger:		case TargetLowering::TypePromoteInteger: {
// If the incoming type is a vector that is being promoted, then		// If the incoming type is a vector that is being promoted, then
// we know that the elements are arranged differently and that we		// we know that the elements are arranged differently and that we
// must perform the conversion using a stack slot.		// must perform the conversion using a stack slot.
if (InVT.isVector())		if (InVT.isVector())
break;		break;

// If the InOp is promoted to the same size, convert it. Otherwise,		// If the InOp is promoted to the same size, convert it. Otherwise,
// fall out of the switch and widen the promoted input.		// fall out of the switch and widen the promoted input.
InOp = GetPromotedInteger(InOp);		SDValue NInOp = GetPromotedInteger(InOp);
InVT = InOp.getValueType();		EVT NInVT = NInOp.getValueType();
if (WidenVT.bitsEq(InVT))		if (WidenVT.bitsEq(NInVT)) {
return DAG.getNode(ISD::BITCAST, dl, WidenVT, InOp);		// For big endian targets we need to shift the input integer or the
		// interesting bits will end up at the wrong place.
		if (DAG.getDataLayout().isBigEndian()) {
		unsigned ShiftAmt = NInVT.getSizeInBits() - InVT.getSizeInBits();
		EVT ShiftAmtTy = TLI.getShiftAmountTy(NInVT, DAG.getDataLayout());
		assert(ShiftAmt < WidenVT.getSizeInBits() && "Too large shift amount!");
		NInOp = DAG.getNode(ISD::SHL, dl, NInVT, NInOp,
		DAG.getConstant(ShiftAmt, dl, ShiftAmtTy));
		}
		return DAG.getNode(ISD::BITCAST, dl, WidenVT, NInOp);
		}
		InOp = NInOp;
		InVT = NInVT;
break;		break;
		}
case TargetLowering::TypeSoftenFloat:		case TargetLowering::TypeSoftenFloat:
case TargetLowering::TypePromoteFloat:		case TargetLowering::TypePromoteFloat:
case TargetLowering::TypeExpandInteger:		case TargetLowering::TypeExpandInteger:
case TargetLowering::TypeExpandFloat:		case TargetLowering::TypeExpandFloat:
case TargetLowering::TypeScalarizeVector:		case TargetLowering::TypeScalarizeVector:
case TargetLowering::TypeSplitVector:		case TargetLowering::TypeSplitVector:
break;		break;
case TargetLowering::TypeWidenVector:		case TargetLowering::TypeWidenVector:
▲ Show 20 Lines • Show All 1,645 Lines • Show Last 20 Lines

llvm/test/CodeGen/ARM/legalize-bitcast.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				spatelUnsubmitted Not Done Reply Inline Actions The bug is still visible if you let this RUN all the way to asm output, right? If so, I think that would be easier to read, especially if we can auto-generate the CHECK lines using "utils/update_llc_test_checks.py". Either way, I think it's better to pre-commit this test to trunk showing the wrong output. That way, this patch will show the diff that creates the correct code. (And if for some reason this patch needs to be reverted, it will be clear that we have reverted to code that creates a miscompile.) spatel: The bug is still visible if you let this RUN all the way to asm output, right? If so, I think…
				uabelhoAuthorUnsubmitted Done Reply Inline Actions I think so. At least we get diffs in the ASM output without/with the fixes applied. I'm really not very good at reading ARM assembler though so it's not at all obvious to me what the resulting code really do. uabelho: I think so. At least we get diffs in the ASM output without/with the fixes applied. I'm really…
	; RUN: llc -O0 -mtriple=armebv7 -target-abi apcs -o - %s \| FileCheck %s			; RUN: llc -O0 -mtriple=armebv7 -target-abi apcs -o - %s \| FileCheck %s

	@vec6_p = external global <6 x i16>			@vec6_p = external global <6 x i16>

	define i32 @vec_to_int() {			define i32 @vec_to_int() {
	; CHECK-LABEL: vec_to_int:			; CHECK-LABEL: vec_to_int:
	; CHECK: @ %bb.0: @ %bb.0			; CHECK: @ %bb.0: @ %bb.0
	; CHECK-NEXT: push {r4}			; CHECK-NEXT: push {r4}
	; CHECK-NEXT: sub sp, sp, #28			; CHECK-NEXT: sub sp, sp, #28
	; CHECK-NEXT: movw r0, :lower16:vec6_p			; CHECK-NEXT: movw r0, :lower16:vec6_p
	; CHECK-NEXT: movt r0, :upper16:vec6_p			; CHECK-NEXT: movt r0, :upper16:vec6_p
	; CHECK-NEXT: vld1.8 {d16}, [r0]!			; CHECK-NEXT: vld1.8 {d16}, [r0]!
	; CHECK-NEXT: ldr r0, [r0]			; CHECK-NEXT: ldr r0, [r0]
	; CHECK-NEXT: @ implicit-def: $d17			; CHECK-NEXT: @ implicit-def: $d17
	; CHECK-NEXT: vmov.32 d17[0], r0			; CHECK-NEXT: vmov.32 d17[0], r0
	; CHECK-NEXT: vrev32.16 d17, d17			; CHECK-NEXT: vrev32.16 d17, d17
	; CHECK-NEXT: vrev16.8 d16, d16			; CHECK-NEXT: vrev16.8 d16, d16
	; CHECK-NEXT: vmov.f64 d18, d16			; CHECK-NEXT: vmov.f64 d18, d16
	; CHECK-NEXT: vmov.f64 d19, d17			; CHECK-NEXT: vmov.f64 d19, d17
	; CHECK-NEXT: vstmia sp, {d18, d19} @ 16-byte Spill			; CHECK-NEXT: vstmia sp, {d18, d19} @ 16-byte Spill
				efriedmaUnsubmitted Done Reply Inline Actions We split the load into two parts, so we don't read past the end. For the first 64 bits, we load it as <8 x i8>, then bitcast it to `<4 x i16>` (using vrev). For the last 32 bits, we load it as i32, then swap the high and low halves. Then we concatenate the two, and spill to a stack slot as a native `<8 x i16>`. efriedma: We split the load into two parts, so we don't read past the end. For the first 64 bits, we…
	; CHECK-NEXT: b .LBB0_1			; CHECK-NEXT: b .LBB0_1
	; CHECK-NEXT: .LBB0_1: @ %bb.1			; CHECK-NEXT: .LBB0_1: @ %bb.1
	; CHECK-NEXT: vldmia sp, {d16, d17} @ 16-byte Reload			; CHECK-NEXT: vldmia sp, {d16, d17} @ 16-byte Reload
	; CHECK-NEXT: vrev32.16 q9, q8			; CHECK-NEXT: vrev32.16 q9, q8
				efriedmaUnsubmitted Done Reply Inline Actions The vrev32.16 essentially bitcasts the `<8 x i16>` to `<4 x i32>`. Then we extract the third element of the vector. This corresponds to the scalar load in the first block, i.e. the last 4 bytes of the load. Looks correct. efriedma: The vrev32.16 essentially bitcasts the `<8 x i16>` to `<4 x i32>`. Then we extract the third…
	; CHECK-NEXT: @ kill: def $d19 killed $d19 killed $q9			; CHECK-NEXT: @ kill: def $d19 killed $d19 killed $q9
	; CHECK-NEXT: vmov.32 r0, d19[1]			; CHECK-NEXT: vmov.32 r0, d19[0]
	; CHECK-NEXT: add sp, sp, #28			; CHECK-NEXT: add sp, sp, #28
	; CHECK-NEXT: pop {r4}			; CHECK-NEXT: pop {r4}
	; CHECK-NEXT: bx lr			; CHECK-NEXT: bx lr
	bb.0:			bb.0:
	%vec6 = load <6 x i16>, <6 x i16>* @vec6_p, align 1			%vec6 = load <6 x i16>, <6 x i16>* @vec6_p, align 1
	br label %bb.1			br label %bb.1

	bb.1:			bb.1:
	%0 = bitcast <6 x i16> %vec6 to i96			%0 = bitcast <6 x i16> %vec6 to i96
	%1 = trunc i96 %0 to i32			%1 = trunc i96 %0 to i32
	ret i32 %1			ret i32 %1
	}			}

	define i16 @int_to_vec(i80 %in) {			define i16 @int_to_vec(i80 %in) {
	; CHECK-LABEL: int_to_vec:			; CHECK-LABEL: int_to_vec:
	; CHECK: @ %bb.0:			; CHECK: @ %bb.0:
	; CHECK-NEXT: sub sp, sp, #4			; CHECK-NEXT: mov r3, r1
	; CHECK-NEXT: vmov.i32 q8, #0x0			; CHECK-NEXT: mov r12, r0
	; CHECK-NEXT: vrev32.16 q8, q8			; CHECK-NEXT: lsl r0, r0, #16
	; CHECK-NEXT: @ kill: def $d16 killed $d16 killed $q8			; CHECK-NEXT: orr r0, r0, r1, lsr #16
				efriedmaUnsubmitted Done Reply Inline Actions So the value we want to return here is the first two bytes, if the i80 were stored in memory. On a big-endian target, that's the two most significant bytes. With LLVM's default calling convention, those bytes end up in the low bits of r0. (This took me a little while to figure out; it would probably be easier to understand if the i80 value was loaded from memory, instead of passed in registers.) The sequence here is that we shift r0 left 16 bits, move it to d16, move that to d18, "shift" the value right 16 bits using vrev, them move it back to r0. That seems correct. efriedma: So the value we want to return here is the first two bytes, if the i80 were stored in memory.
	; CHECK-NEXT: vmov.u16 r3, d16[0]			; CHECK-NEXT: @ implicit-def: $d16
	; CHECK-NEXT: str r0, [sp] @ 4-byte Spill			; CHECK-NEXT: vmov.32 d16[0], r0
	; CHECK-NEXT: mov r0, r3			; CHECK-NEXT: @ implicit-def: $q9
	; CHECK-NEXT: add sp, sp, #4			; CHECK-NEXT: vmov.f64 d18, d16
				; CHECK-NEXT: vrev32.16 q9, q9
				; CHECK-NEXT: @ kill: def $d18 killed $d18 killed $q9
				; CHECK-NEXT: vmov.u16 r0, d18[0]
	; CHECK-NEXT: bx lr			; CHECK-NEXT: bx lr
	%vec = bitcast i80 %in to <5 x i16>			%vec = bitcast i80 %in to <5 x i16>
	%e0 = extractelement <5 x i16> %vec, i32 0			%e0 = extractelement <5 x i16> %vec, i32 0
	ret i16 %e0			ret i16 %e0
	}			}