This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] break non-temporal loads over 256 into 256-loads and a smaller load
ClosedPublic

Authored by zjaffal on Sep 7 2022, 6:12 AM.

Details

Summary

Currently over 256 non-temporal loads are broken inefficently. For example, v17i32 gets broken into 2 128-bit loads. It is better if we can use
256-bit loads instead.

Diff Detail

Event Timeline

zjaffal created this revision.Sep 7 2022, 6:12 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 7 2022, 6:12 AM
zjaffal requested review of this revision.Sep 7 2022, 6:12 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 7 2022, 6:12 AM
fhahn requested changes to this revision.Sep 7 2022, 7:08 AM

Thanks for sharing the patch. As is, it looks like there are multiple test failures that are caused by this patch and need addressing (some comments inline about correctness issues)

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
53

Is this needed?

73

Is this needed?

17820

move all variables that are used after the early exit down after the early exit.

17825

We should also only do this for non-temporal loads.

17829

Could you add a comment here illustrating what kind of DAG nodes we create to replace the original load?

It would also be good to document the motivation for special handling for non-temporal loads here.

17836

I don't think that's correct, you are passing the offset in bits, but I think it should be in bytes.

Looking at some of the test changes, the offsets of the loads are wrong, e.g. ldnp q0, q2, [x0, #256].

17867

Do you need to create NullVector explicitly here? Could you just use DAG.degUNDEF(NewVT) here?

19860

This should be kept I think.

This revision now requires changes to proceed.Sep 7 2022, 7:08 AM
zjaffal added inline comments.Sep 8 2022, 7:17 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
17829

Would the motivation for the special handling for non-temporal loads be something like:
We have 256-bit non-temporal load so it might be good to utilise them when having large load instructions?

17867

Yes I will do that

zjaffal updated this revision to Diff 458753.Sep 8 2022, 8:14 AM

Fix the issues related to the scalable loads.

zjaffal marked 7 inline comments as done.Sep 8 2022, 8:15 AM
zjaffal added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
17825

Checks are added now for non-temporal loads

17836

This should be fixed now

zjaffal marked 3 inline comments as done.Sep 8 2022, 8:17 AM
fhahn added inline comments.Sep 8 2022, 8:17 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
17829

We have 256-bit non-temporal load so it might be good to utilise them when having large load instructions?

Yes something like that, maybe: Try to beak up non-temporal stores into blocks of 256 bits early, so the LDNPQ can be selected.

Might be good to add this as comment for the whole function.

zjaffal added inline comments.Sep 8 2022, 8:22 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
17829

You can check the comments I added. I will add a comment for the whole function as well

zjaffal updated this revision to Diff 458760.Sep 8 2022, 8:39 AM

Add comments for function

fhahn added a comment.Sep 9 2022, 2:07 AM

Thanks for the update, this looks like a nice improvement and should avoid regressions for those cases with D132559. A few more mostly stylistic comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
17813

It would be good to be more concrete what large means here (>256 bits) will be split into blocks of 256bits.

nit: comments should be full sentences ending with a period (.)

17819

Thinking about it a bit more, I think we should also not handle volatile and atomic loads here to be safe. Could you add tests and the extra conditions for early exit here? I think it would be fine to just add the tests directly in the patch.

17825

LD->isNonTemporal() is already checked above

17830

nit: BasePtr would be more accurate.

17841

nit: used as unsigned, so maybe make this unsigned too?

17851

nit: comments should be full sentences ending with a period (.)

bits in the load operation -> bits of the load operation?

It might also help with readability if you add a newline before the comment.

17854

nit: comments should be full sentences ending with a period (.)

17867

MaskVector not used?

17868

UndefVector would be more accurate?

zjaffal updated this revision to Diff 459056.Sep 9 2022, 7:17 AM
zjaffal marked an inline comment as done.
  • Add early exit for volatile loads and a test cases to check for it.
  • Address some style issues
zjaffal updated this revision to Diff 459073.Sep 9 2022, 8:12 AM

Removing checking if both operands are negative at the same time since it is handled by default.

fhahn added a comment.Sep 9 2022, 8:22 AM

It looks like you uploaded the wrong patch here?

zjaffal updated this revision to Diff 459079.Sep 9 2022, 8:29 AM

Fix non-temporal clause

zjaffal updated this revision to Diff 459083.Sep 9 2022, 8:47 AM

readd the volatile test

t.p.northover added inline comments.Sep 9 2022, 8:55 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
17863

Shouldn't the second operand of this call be PtrOffset again?

17870–17872

Is this (and the implementation generally) big-endian correct? I don't know the answer here, I can never remember what's supposed to happen. But someone should definitely try it on an aarch64_be target and at least eyeball the assembly to check the offsets and so on.

17884

The Chain here is the input chain, I think you need to TokenFactor all the loads' output chains together to make sure nothing gets reordered with them.

19928–19929

This looks like it takes precedence over performLOADCombine and disables ldnp formation if the TBI feature is enabled.

zjaffal added inline comments.Sep 12 2022, 1:40 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19928–19929

Removing the checks and calling performLOADCombine caused many tests to fail. Maybe we can check if the load is non-temporal here ?

zjaffal added inline comments.Sep 12 2022, 2:05 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
17863

Yes it should be, it is worth noting that the offsets on the generated assembly didn't change when I changed from using MemVT.getSizeInBits()) to PtrOffset

zjaffal added inline comments.Sep 12 2022, 2:19 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
17870–17872

This is the generated assembly for big-endian
using the following test case

define <17 x float> @test_ldnp_v17f32(<17 x float>* %A) {
  %lv = load<17 x float>, <17 x float>* %A, align 8, !nontemporal !0
  ret <17 x float> %lv
}

!0 = !{i32 1}
test_ldnp_v17f32:                       // @test_ldnp_v17f32
	.cfi_startproc
// %bb.0:
	ldnp	q0, q1, [x0, #32]
	add	x9, x8, #48
	add	x10, x8, #32
	ldnp	q2, q3, [x0]
	add	x11, x8, #16
	ldr	s4, [x0, #64]
	st1	{ v2.4s }, [x8]
	st1	{ v1.4s }, [x9]
	st1	{ v0.4s }, [x10]
	st1	{ v3.4s }, [x11]
	str	s4, [x8, #64]
	ret

Looking at https://godbolt.org I think there are more load instructions before breaking them

t.p.northover added inline comments.Sep 12 2022, 3:08 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
17870–17872

It's coming back to me now, and I think that output is incorrect.

Basically, the ld1 and st1 instructions would load each element big-endian but preserve the order (i.e. v0[0] is still loaded from &ptr[0] not &ptr[3]. On the other hand ldr and ldnp load the whole vector-register in big-endian order which loads v0[0] from &ptr[3].

This output mixes the two so it'll rearrange elements.

But I think the real bug is in https://reviews.llvm.org/rG7155ed4289 that you committed earlier, or at least that needs fixing before any bug here is obvious.

A proper fix would be to put an AArch64ISD::REV<n> (where <n> is the element size) after each ldnp to restore what an ld1 would have done, though I'd be OK with disabling the optimization for big-endian instead. I care if we break it, I'm less bothered if it isn't as fast as little-endian.

19928–19929

The check is needed because otherwise we assume TBI even when we shouldn't, but it probably shouldn't cause an early return. Both optimizations should get the opportunity to run if they might be applicable.

zjaffal updated this revision to Diff 459418.Sep 12 2022, 3:58 AM
  1. Change Align to use PtrOffset
  2. Make sure that performTBISimplification optimisation can run
  3. Add TokenFactor node
zjaffal marked 15 inline comments as done.Sep 12 2022, 8:47 AM
zjaffal updated this revision to Diff 462202.Sep 22 2022, 9:25 AM

rebase on top of main

fhahn accepted this revision.Sep 26 2022, 5:45 AM

LGTM, thanks for the latest changes. Some small stylistic comments inline.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
17813

nit: I think this is easier to read if you keep nontemporal loads together. Also, period at end of sentence.

17839

nit: utilise -> utilize (American spelling)

17856

nit: add newline after here, to separate blocks.

17858

nit: UNDEF vector instead of null value vector?

17880

add newline before the new function/

17880

nit: use ConcatVT to match other names here.

This revision is now accepted and ready to land.Sep 26 2022, 5:45 AM
zjaffal updated this revision to Diff 462908.Sep 26 2022, 8:05 AM

Address stylistic comments

Matt added a subscriber: Matt.Sep 26 2022, 9:22 AM
fhahn added a comment.Sep 28 2022, 5:47 AM

Thanks for the update. 2 more small comments, but I'll fix them before committing the change on your behalf.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
17813

Looks like there's another typo: nontermporal -> nontemporal

17856

The newline was meant to go after the } to separate it from the next block, not before it.

This revision was landed with ongoing or failed builds.Sep 28 2022, 7:21 AM
This revision was automatically updated to reflect the committed changes.