This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Fix incorrect LD1 of 16-bit FP vectors in big endian
ClosedPublic

Authored by pbarrio on Jan 9 2018, 7:25 AM.

Download Raw Diff

Details

Reviewers

jmolloy
olista01
craig.topper

Commits

rGf2c29571da71: [AArch64] Fix incorrect LD1 of 16-bit FP vectors in big endian
rL322663: [AArch64] Fix incorrect LD1 of 16-bit FP vectors in big endian

Summary

Loading a vector of 4 half-precision FP sometimes results in an LD1
of 2 single-precision FP + a reversal. This results in an incorrect
byte swap due to the conversion from little endian to big endian.

In order to generate the correct byte swap, it is easier to
generate the correct LD1 of 4 half-precision FP, thus avoiding the
subsequent reversal.

Diff Detail

Repository

rL LLVM

Build Status

Buildable 13869
Build 13869: arc lint + arc unit

Event Timeline

pbarrio created this revision.Jan 9 2018, 7:25 AM

Herald added subscribers: kristof.beyls, javed.absar, rengolin, aemerson. · View Herald TranscriptJan 9 2018, 7:25 AM

rogfer01 added a subscriber: rogfer01.Jan 9 2018, 8:37 AM

craig.topper resigned from this revision.Jan 9 2018, 9:54 AM

SjoerdMeijer added a subscriber: SjoerdMeijer.Jan 10 2018, 1:57 AM

SjoerdMeijer added inline comments.

lib/Target/AArch64/AArch64ISelLowering.cpp
736 ↗	(On Diff #129083)	How about MVT::v8f16? Does this one needs to get a similar treatment?
lib/Target/AArch64/AArch64InstrInfo.td
5852	Don't think I understand why this is now a special case. Why is this one different from the other patterns here?

samparker added a subscriber: samparker.Jan 11 2018, 12:56 AM

Hi Sjoerd, thanks for the review. I have attached some thoughts on your comments and I will upload a new patch soon.

lib/Target/AArch64/AArch64ISelLowering.cpp
736 ↗	(On Diff #129083)	I could do the same for MVT::v8f16, but this would be an optimization rather than a bugfix, since LLVM generates correct code in this case: ld1 { v0.2d }, [x0] rev64 v0.8h, v0.8h which does the correct byte reversal after the load. For the MVT::v4f16 case, LLVM currently generates the following incorrect code: ld1 { v0.2s }, [x0] rev64 v0.4h, v0.4h I will give the optimization a try and upload an updated patch soon.
lib/Target/AArch64/AArch64InstrInfo.td
5852	Earlier in the file there is an explanation about why we need to insert REVs when we do bitcasts in big endian (~line 5600). At the end, the following comment suggests that identity conversions (e.g. same-size float-to-int conversions) do not need it: // Most bitconverts require some sort of conversion. The only exceptions are: // a) Identity conversions - vNfX <-> vNiX It makes sense to me that a type conversion in big endian between vectors with elements of the same size does not need a byte reversal. This reversal should have been done right before, otherwise the original vector (in this case, with type v4f16) would have also been incorrect.

efriedma added a subscriber: efriedma.Jan 15 2018, 12:52 PM

efriedma added inline comments.

lib/Target/AArch64/AArch64InstrInfo.td
5852	There should be a testcase specifically for this change, then. Maybe something like this, to force the conversion: %x = add <4 x i16> %i, 1 %y = bitcast <4 x i16> %x to <4 x half> %z = fpext <4 x half> %y tp <4 x float> Please put this in a separate patch from the AArch64TargetLowering::addTypeForNEON changes. Also, please move the pattern out of the IsBE predicate.

olista01 added inline comments.Jan 16 2018, 2:03 AM

lib/Target/AArch64/AArch64ISelLowering.cpp
736 ↗	(On Diff #129083)	Why is this change needed for correctness of v4f16, but only an optimisation for v8f16? If there's a difference between the treatment of these two types elsewhere in the compiler, would it not be better to handle them consistently?
lib/Target/AArch64/AArch64InstrInfo.td
5849	I suspect that the problem is actually in these patterns - the pattern on line 5823 (v2i32 -> v4i16) used REV32, but this one (v2i32 -> v4f16) uses REV64. I would assume that these patterns should be the same regardless of whether the lanes are integers or floating-point.

I have fixed the bug as olista01 suggested, which is more straightforward than
my previous fix.

There is still the question about why we require reversals in identity
conversions, but I believe that is affecting other conversions apart from the
v4i16->v4f16 ones. This is an optimization and can be handled in another patch,
as efriedma suggested.

Harbormaster completed remote builds in B13869: Diff 129953.Jan 16 2018, 7:07 AM

olista01 added inline comments.Jan 16 2018, 7:26 AM

lib/Target/AArch64/AArch64InstrInfo.td
5852	(discussed in person with Pablo, noting here for the record) We think some of the other patterns here are also wrong, especially the vNf16<->vNi16 ones, which obviously don't need REV instructions. We should fix all of them at once.

Fixed big-endian bitconvert patterns and extensive testing for half-float vectors.

LGTM, thanks.

I've spotted a missed optimisation in the tests, but that should be done as a separate patch.

test/CodeGen/AArch64/arm64-big-endian-bitconverts.ll
56	It looks like we generate more efficient code for the v4i16 case than the v4f16 case above. Is there a way we could get better code here? I think this is just an optimisation, so makes sense to do it as a separate patch (or raise a ticket if you don't have time to do it now).

This revision is now accepted and ready to land.Jan 17 2018, 5:02 AM

pbarrio added inline comments.Jan 17 2018, 5:14 AM

test/CodeGen/AArch64/arm64-big-endian-bitconverts.ll
56	Thanks, I will post a patch soon. I can fix it by doing something similar to the first iteration of this patch.

Closed by commit rL322663: [AArch64] Fix incorrect LD1 of 16-bit FP vectors in big endian (authored by pabbar01). · Explain WhyJan 17 2018, 6:41 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

Target/

AArch64/

AArch64InstrInfo.td

2 lines

test/

CodeGen/

AArch64/

arm64-big-endian-bitconverts.ll

14 lines

Diff 129953

lib/Target/AArch64/AArch64InstrInfo.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 5,840 Lines • ▼ Show 20 Lines
	def : Pat<(v4f16 (bitconvert (v8i8 FPR64:$src))), (v4f16 FPR64:$src)>;			def : Pat<(v4f16 (bitconvert (v8i8 FPR64:$src))), (v4f16 FPR64:$src)>;
	def : Pat<(v4f16 (bitconvert (f64 FPR64:$src))), (v4f16 FPR64:$src)>;			def : Pat<(v4f16 (bitconvert (f64 FPR64:$src))), (v4f16 FPR64:$src)>;
	def : Pat<(v4f16 (bitconvert (v2f32 FPR64:$src))), (v4f16 FPR64:$src)>;			def : Pat<(v4f16 (bitconvert (v2f32 FPR64:$src))), (v4f16 FPR64:$src)>;
	def : Pat<(v4f16 (bitconvert (v1f64 FPR64:$src))), (v4f16 FPR64:$src)>;			def : Pat<(v4f16 (bitconvert (v1f64 FPR64:$src))), (v4f16 FPR64:$src)>;
	}			}
	let Predicates = [IsBE] in {			let Predicates = [IsBE] in {
	def : Pat<(v4f16 (bitconvert (v1i64 FPR64:$src))),			def : Pat<(v4f16 (bitconvert (v1i64 FPR64:$src))),
	(v4f16 (REV64v4i16 FPR64:$src))>;			(v4f16 (REV64v4i16 FPR64:$src))>;
	def : Pat<(v4f16 (bitconvert (v2i32 FPR64:$src))),			def : Pat<(v4f16 (bitconvert (v2i32 FPR64:$src))),
				olista01Unsubmitted Not Done Reply Inline Actions I suspect that the problem is actually in these patterns - the pattern on line 5823 (v2i32 -> v4i16) used REV32, but this one (v2i32 -> v4f16) uses REV64. I would assume that these patterns should be the same regardless of whether the lanes are integers or floating-point. olista01: I suspect that the problem is actually in these patterns - the pattern on line 5823 (v2i32 ->…
	(v4f16 (REV64v4i16 FPR64:$src))>;			(v4f16 (REV32v4i16 FPR64:$src))>;
	def : Pat<(v4f16 (bitconvert (v4i16 FPR64:$src))),			def : Pat<(v4f16 (bitconvert (v4i16 FPR64:$src))),
	(v4f16 (REV64v4i16 FPR64:$src))>;			(v4f16 (REV64v4i16 FPR64:$src))>;
				SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Don't think I understand why this is now a special case. Why is this one different from the other patterns here? SjoerdMeijer: Don't think I understand why this is now a special case. Why is this one different from the…
				pbarrioAuthorUnsubmitted Not Done Reply Inline Actions Earlier in the file there is an explanation about why we need to insert REVs when we do bitcasts in big endian (~line 5600). At the end, the following comment suggests that identity conversions (e.g. same-size float-to-int conversions) do not need it: // Most bitconverts require some sort of conversion. The only exceptions are: // a) Identity conversions - vNfX <-> vNiX It makes sense to me that a type conversion in big endian between vectors with elements of the same size does not need a byte reversal. This reversal should have been done right before, otherwise the original vector (in this case, with type v4f16) would have also been incorrect. pbarrio: Earlier in the file there is an explanation about why we need to insert REVs when we do…
				efriedmaUnsubmitted Not Done Reply Inline Actions There should be a testcase specifically for this change, then. Maybe something like this, to force the conversion: %x = add <4 x i16> %i, 1 %y = bitcast <4 x i16> %x to <4 x half> %z = fpext <4 x half> %y tp <4 x float> Please put this in a separate patch from the AArch64TargetLowering::addTypeForNEON changes. Also, please move the pattern out of the IsBE predicate. efriedma: There should be a testcase specifically for this change, then. Maybe something like this, to…
				olista01Unsubmitted Not Done Reply Inline Actions (discussed in person with Pablo, noting here for the record) We think some of the other patterns here are also wrong, especially the vNf16<->vNi16 ones, which obviously don't need REV instructions. We should fix all of them at once. olista01: (discussed in person with Pablo, noting here for the record) We think some of the other…
	def : Pat<(v4f16 (bitconvert (v8i8 FPR64:$src))),			def : Pat<(v4f16 (bitconvert (v8i8 FPR64:$src))),
	(v4f16 (REV16v8i8 FPR64:$src))>;			(v4f16 (REV16v8i8 FPR64:$src))>;
	def : Pat<(v4f16 (bitconvert (f64 FPR64:$src))),			def : Pat<(v4f16 (bitconvert (f64 FPR64:$src))),
	(v4f16 (REV64v4i16 FPR64:$src))>;			(v4f16 (REV64v4i16 FPR64:$src))>;
	def : Pat<(v4f16 (bitconvert (v2f32 FPR64:$src))),			def : Pat<(v4f16 (bitconvert (v2f32 FPR64:$src))),
	(v4f16 (REV64v4i16 FPR64:$src))>;			(v4f16 (REV64v4i16 FPR64:$src))>;
	def : Pat<(v4f16 (bitconvert (v1f64 FPR64:$src))),			def : Pat<(v4f16 (bitconvert (v1f64 FPR64:$src))),
	(v4f16 (REV64v4i16 FPR64:$src))>;			(v4f16 (REV64v4i16 FPR64:$src))>;
	▲ Show 20 Lines • Show All 437 Lines • Show Last 20 Lines

test/CodeGen/AArch64/arm64-big-endian-bitconverts.ll

Show First 20 Lines • Show All 47 Lines • ▼ Show 20 Lines	; CHECK: str
%3 = bitcast <2 x i32> %2 to i64		%3 = bitcast <2 x i32> %2 to i64
%4 = add i64 %3, %3		%4 = add i64 %3, %3
store i64 %4, i64* %q		store i64 %4, i64* %q
ret void		ret void
}		}

; CHECK-LABEL: test_i64_v4i16:		; CHECK-LABEL: test_i64_v4i16:
define void @test_i64_v4i16(<4 x i16>* %p, i64* %q) {		define void @test_i64_v4i16(<4 x i16>* %p, i64* %q) {
; CHECK: ld1 { v{{[0-9]+}}.4h }		; CHECK: ld1 { v{{[0-9]+}}.4h }
		olista01Unsubmitted Not Done Reply Inline Actions It looks like we generate more efficient code for the v4i16 case than the v4f16 case above. Is there a way we could get better code here? I think this is just an optimisation, so makes sense to do it as a separate patch (or raise a ticket if you don't have time to do it now). olista01: It looks like we generate more efficient code for the v4i16 case than the v4f16 case above. Is…
		pbarrioAuthorUnsubmitted Not Done Reply Inline Actions Thanks, I will post a patch soon. I can fix it by doing something similar to the first iteration of this patch. pbarrio: Thanks, I will post a patch soon. I can fix it by doing something similar to the first…
; CHECK: rev64 v{{[0-9]+}}.4h		; CHECK: rev64 v{{[0-9]+}}.4h
; CHECK: str		; CHECK: str
%1 = load <4 x i16>, <4 x i16>* %p		%1 = load <4 x i16>, <4 x i16>* %p
%2 = add <4 x i16> %1, %1		%2 = add <4 x i16> %1, %1
%3 = bitcast <4 x i16> %2 to i64		%3 = bitcast <4 x i16> %2 to i64
%4 = add i64 %3, %3		%4 = add i64 %3, %3
store i64 %4, i64* %q		store i64 %4, i64* %q
ret void		ret void
▲ Show 20 Lines • Show All 1,029 Lines • ▼ Show 20 Lines
; CHECK: st1 { v{{[0-9]+}}.16b }		; CHECK: st1 { v{{[0-9]+}}.16b }
%1 = load <8 x i16>, <8 x i16>* %p		%1 = load <8 x i16>, <8 x i16>* %p
%2 = add <8 x i16> %1, %1		%2 = add <8 x i16> %1, %1
%3 = bitcast <8 x i16> %2 to <16 x i8>		%3 = bitcast <8 x i16> %2 to <16 x i8>
%4 = add <16 x i8> %3, %3		%4 = add <16 x i8> %3, %3
store <16 x i8> %4, <16 x i8>* %q		store <16 x i8> %4, <16 x i8>* %q
ret void		ret void
}		}

		; CHECK-LABEL: test_v4f16_struct:
		%struct.struct1 = type { half, half, half, half }
		define %struct.struct1 @test_v4f16_struct(%struct.struct1* %ret) {
		entry:
		; CHECK: ld1 { {{v[0-9]+}}.2s }
		; CHECK: rev32
		; CHECK-NOT; rev64
		%0 = bitcast %struct.struct1* %ret to <4 x half>*
		%1 = load <4 x half>, <4 x half>* %0, align 2
		%2 = extractelement <4 x half> %1, i32 0
		%.fca.0.insert = insertvalue %struct.struct1 undef, half %2, 0
		ret %struct.struct1 %.fca.0.insert
		}