This is an archive of the discontinued LLVM Phabricator instance.

[X86] `X86TargetLowering`: override `allowsMemoryAccess()`
ClosedPublic

Authored by lebedev.ri on Jan 14 2023, 1:18 PM.

Details

Summary

The baseline allowsMemoryAccess() is wrong for X86.
It assumes that aligned memory operations are always allowed,
but that is not true.

For example, We can not perform a 32-byte aligned non-temporal load
of a 32-byte vector, without AVX2 that is, yet allowsMemoryAccess()
will say it is allowed, so we may end up merging non-temporal loads,
only to split them up to legalize them, and here we go again.

NOTE: the test changes here are superfluous. The main effect is that without this change, in D141777, we'd get stuck endlessly merging and splitting non-temporal stores.

Diff Detail

Event Timeline

lebedev.ri created this revision.Jan 14 2023, 1:18 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 14 2023, 1:18 PM
lebedev.ri requested review of this revision.Jan 14 2023, 1:18 PM

can you improve the summary please? you about nt memops, but all the test changes don't have anything to do with them

can you improve the summary please? you about nt memops, but all the test changes don't have anything to do with them

I'm open to suggestions. As you can check yourself, with the follow-up patch (and without this patch),
in llvm-project/llvm/test/CodeGen/X86/merge-consecutive-stores-nt.ll
we endlessly try to combine non-temporal stores, only to split them back again:

Combining: t4164: v8f32,ch = load<(non-temporal load (s256) from %ir.a0)> t0, t2, undef:i64
Creating constant: t4166: i64 = Constant<16>
Creating new node: t4167: i64 = add t2, Constant:i64<16>
Creating new node: t4168: v4f32,ch = load<(non-temporal load (s128) from %ir.a0, align 32)> t0, t2, undef:i64
Creating new node: t4169: v4f32,ch = load<(non-temporal load (s128) from %ir.a0 + 16, basealign 32)> t0, t4167, undef:i64
Creating new node: t4170: ch = TokenFactor t4168:1, t4169:1
Creating new node: t4171: v8f32 = concat_vectors t4168, t4169

Replacing.1 t4164: v8f32,ch = load<(non-temporal load (s256) from %ir.a0)> t0, t2, undef:i64

With: t4171: v8f32 = concat_vectors t4168, t4169
 and 1 other values

Legalizing: t4170: ch = TokenFactor t4168:1, t4169:1
Legal node: nothing to do

Combining: t4170: ch = TokenFactor t4168:1, t4169:1

Legalizing: t4169: v4f32,ch = load<(non-temporal load (s128) from %ir.a0 + 16, basealign 32)> t0, t4167, undef:i64
Legalizing non-extending load operation

Combining: t4169: v4f32,ch = load<(non-temporal load (s128) from %ir.a0 + 16, basealign 32)> t0, t4167, undef:i64

Legalizing: t4167: i64 = add t2, Constant:i64<16>
Legal node: nothing to do

Combining: t4167: i64 = add t2, Constant:i64<16>

Legalizing: t4166: i64 = Constant<16>
Legal node: nothing to do

Combining: t4166: i64 = Constant<16>

Legalizing: t4168: v4f32,ch = load<(non-temporal load (s128) from %ir.a0, align 32)> t0, t2, undef:i64
Legalizing non-extending load operation

Combining: t4168: v4f32,ch = load<(non-temporal load (s128) from %ir.a0, align 32)> t0, t2, undef:i64

Legalizing: t4171: v8f32 = concat_vectors t4168, t4169
Trying custom legalization
Creating new node: t4172: v8f32 = undef
Creating constant: t4173: i64 = Constant<0>
Creating new node: t4174: v8f32 = insert_subvector undef:v8f32, t4168, Constant:i64<0>
Creating constant: t4175: i64 = Constant<4>
Creating new node: t4176: v8f32 = insert_subvector t4174, t4169, Constant:i64<4>
Successfully custom legalized node
 ... replacing: t4171: v8f32 = concat_vectors t4168, t4169
     with:      t4176: v8f32 = insert_subvector t4174, t4169, Constant:i64<4>

Legalizing: t4168: v4f32,ch = load<(non-temporal load (s128) from %ir.a0, align 32)> t0, t2, undef:i64
Legalizing non-extending load operation

Combining: t4168: v4f32,ch = load<(non-temporal load (s128) from %ir.a0, align 32)> t0, t2, undef:i64

Legalizing: t4169: v4f32,ch = load<(non-temporal load (s128) from %ir.a0 + 16, basealign 32)> t0, t4167, undef:i64
Legalizing non-extending load operation

Combining: t4169: v4f32,ch = load<(non-temporal load (s128) from %ir.a0 + 16, basealign 32)> t0, t4167, undef:i64

Legalizing: t4176: v8f32 = insert_subvector t4174, t4169, Constant:i64<4>
Legal node: nothing to do

Combining: t4176: v8f32 = insert_subvector t4174, t4169, Constant:i64<4>

Legalizing: t4175: i64 = Constant<4>
Legal node: nothing to do

Combining: t4175: i64 = Constant<4>

Legalizing: t4174: v8f32 = insert_subvector undef:v8f32, t4168, Constant:i64<0>
Legal node: nothing to do

Combining: t4174: v8f32 = insert_subvector undef:v8f32, t4168, Constant:i64<0>
Creating new node: t4177: v4f32 = undef

Legalizing: t4173: i64 = Constant<0>
Legal node: nothing to do

Combining: t4173: i64 = Constant<0>

Legalizing: t4172: v8f32 = undef
Legal node: nothing to do

Combining: t4172: v8f32 = undef

Legalizing: t4165: ch = store<(non-temporal store (s256) into %ir.a1)> t4163, t4176, t4, undef:i64
Legalizing store operation
Optimizing float store operations
Trying custom lowering
Creating new node: t4178: v4f32 = extract_subvector t4176, Constant:i64<0>
Creating new node: t4179: i64 = add t4, Constant:i64<16>
Creating new node: t4180: ch = store<(non-temporal store (s128) into %ir.a1, align 32)> t4163, t4178, t4, undef:i64
Creating new node: t4181: ch = store<(non-temporal store (s128) into %ir.a1 + 16, basealign 32)> t4163, t4169, t4179, undef:i64
Creating new node: t4182: ch = TokenFactor t4180, t4181
 ... replacing: t4165: ch = store<(non-temporal store (s256) into %ir.a1)> t4163, t4176, t4, undef:i64
     with:      t4182: ch = TokenFactor t4180, t4181

Legalizing: t4176: v8f32 = insert_subvector t4174, t4169, Constant:i64<4>
Legal node: nothing to do

Combining: t4176: v8f32 = insert_subvector t4174, t4169, Constant:i64<4>

Legalizing: t4182: ch = TokenFactor t4180, t4181
Legal node: nothing to do

Combining: t4182: ch = TokenFactor t4180, t4181

Legalizing: t4181: ch = store<(non-temporal store (s128) into %ir.a1 + 16, basealign 32)> t4163, t4169, t4179, undef:i64
Legalizing store operation
Optimizing float store operations
Legal store

Combining: t4181: ch = store<(non-temporal store (s128) into %ir.a1 + 16, basealign 32)> t4163, t4169, t4179, undef:i64
Creating new node: t4183: ch = store<(non-temporal store (s128) into %ir.a1 + 16, basealign 32)> t4169:1, t4169, t4179, undef:i64
Creating new node: t4184: ch = TokenFactor t4163, t4183

Replacing.1 t4181: ch = store<(non-temporal store (s128) into %ir.a1 + 16, basealign 32)> t4163, t4169, t4179, undef:i64

With: t4184: ch = TokenFactor t4163, t4183
 and 0 other values

Legalizing: t4169: v4f32,ch = load<(non-temporal load (s128) from %ir.a0 + 16, basealign 32)> t0, t4167, undef:i64
Legalizing non-extending load operation

Combining: t4169: v4f32,ch = load<(non-temporal load (s128) from %ir.a0 + 16, basealign 32)> t0, t4167, undef:i64

Legalizing: t4184: ch = TokenFactor t4163, t4183
Legal node: nothing to do

Combining: t4184: ch = TokenFactor t4163, t4183

Legalizing: t4182: ch = TokenFactor t4180, t4184
Legal node: nothing to do

Combining: t4182: ch = TokenFactor t4180, t4184
Creating new node: t4185: ch = TokenFactor t4180, t4183
 ... into: t4185: ch = TokenFactor t4180, t4183

Legalizing: t4185: ch = TokenFactor t4180, t4183
Legal node: nothing to do

Combining: t4185: ch = TokenFactor t4180, t4183

Legalizing: t4183: ch = store<(non-temporal store (s128) into %ir.a1 + 16, basealign 32)> t4169:1, t4169, t4179, undef:i64
Legalizing store operation
Optimizing float store operations
Legal store

Combining: t4183: ch = store<(non-temporal store (s128) into %ir.a1 + 16, basealign 32)> t4169:1, t4169, t4179, undef:i64

Legalizing: t4179: i64 = add t4, Constant:i64<16>
Legal node: nothing to do

Combining: t4179: i64 = add t4, Constant:i64<16>

Legalizing: t4180: ch = store<(non-temporal store (s128) into %ir.a1, align 32)> t4163, t4178, t4, undef:i64
Legalizing store operation
Optimizing float store operations
Legal store

Combining: t4180: ch = store<(non-temporal store (s128) into %ir.a1, align 32)> t4163, t4178, t4, undef:i64
Creating new node: t4186: ch = store<(non-temporal store (s128) into %ir.a1, align 32)> t4168:1, t4178, t4, undef:i64
Creating new node: t4187: ch = TokenFactor t4163, t4186

Replacing.1 t4180: ch = store<(non-temporal store (s128) into %ir.a1, align 32)> t4163, t4178, t4, undef:i64

With: t4187: ch = TokenFactor t4163, t4186
 and 0 other values

Legalizing: t4187: ch = TokenFactor t4163, t4186
Legal node: nothing to do

Combining: t4187: ch = TokenFactor t4163, t4186
Creating new node: t4188: ch = TokenFactor t4186, t4170
 ... into: t4188: ch = TokenFactor t4186, t4170

Legalizing: t4170: ch = TokenFactor t4168:1, t4169:1
Legal node: nothing to do

Combining: t4170: ch = TokenFactor t4168:1, t4169:1

Legalizing: t4188: ch = TokenFactor t4186, t4170
Legal node: nothing to do

Combining: t4188: ch = TokenFactor t4186, t4170
Creating new node: t4189: ch = TokenFactor t4186, t4169:1
 ... into: t4189: ch = TokenFactor t4186, t4169:1

Legalizing: t4168: v4f32,ch = load<(non-temporal load (s128) from %ir.a0, align 32)> t0, t2, undef:i64
Legalizing non-extending load operation

Combining: t4168: v4f32,ch = load<(non-temporal load (s128) from %ir.a0, align 32)> t0, t2, undef:i64

Legalizing: t4169: v4f32,ch = load<(non-temporal load (s128) from %ir.a0 + 16, basealign 32)> t0, t4167, undef:i64
Legalizing non-extending load operation

Combining: t4169: v4f32,ch = load<(non-temporal load (s128) from %ir.a0 + 16, basealign 32)> t0, t4167, undef:i64

Legalizing: t4189: ch = TokenFactor t4186, t4169:1
Legal node: nothing to do

Combining: t4189: ch = TokenFactor t4186, t4169:1

Legalizing: t4185: ch = TokenFactor t4189, t4183
Legal node: nothing to do

Combining: t4185: ch = TokenFactor t4189, t4183
Creating new node: t4190: ch = TokenFactor t4183, t4186
 ... into: t4190: ch = TokenFactor t4183, t4186

Legalizing: t4169: v4f32,ch = load<(non-temporal load (s128) from %ir.a0 + 16, basealign 32)> t0, t4167, undef:i64
Legalizing non-extending load operation

Combining: t4169: v4f32,ch = load<(non-temporal load (s128) from %ir.a0 + 16, basealign 32)> t0, t4167, undef:i64

Legalizing: t4183: ch = store<(non-temporal store (s128) into %ir.a1 + 16, basealign 32)> t4169:1, t4169, t4179, undef:i64
Legalizing store operation
Optimizing float store operations
Legal store

Combining: t4183: ch = store<(non-temporal store (s128) into %ir.a1 + 16, basealign 32)> t4169:1, t4169, t4179, undef:i64

Legalizing: t4190: ch = TokenFactor t4183, t4186
Legal node: nothing to do

Combining: t4190: ch = TokenFactor t4183, t4186

Legalizing: t4186: ch = store<(non-temporal store (s128) into %ir.a1, align 32)> t4168:1, t4178, t4, undef:i64
Legalizing store operation
Optimizing float store operations
Legal store

Combining: t4186: ch = store<(non-temporal store (s128) into %ir.a1, align 32)> t4168:1, t4178, t4, undef:i64

Legalizing: t4178: v4f32 = extract_subvector t4176, Constant:i64<0>
Legal node: nothing to do

Combining: t4178: v4f32 = extract_subvector t4176, Constant:i64<0>
 ... into: t4168: v4f32,ch = load<(non-temporal load (s128) from %ir.a0, align 32)> t0, t2, undef:i64

Legalizing: t4169: v4f32,ch = load<(non-temporal load (s128) from %ir.a0 + 16, basealign 32)> t0, t4167, undef:i64
Legalizing non-extending load operation

Combining: t4169: v4f32,ch = load<(non-temporal load (s128) from %ir.a0 + 16, basealign 32)> t0, t4167, undef:i64

Legalizing: t4186: ch = store<(non-temporal store (s128) into %ir.a1, align 32)> t4168:1, t4168, t4, undef:i64
Legalizing store operation
Optimizing float store operations
Legal store

Combining: t4186: ch = store<(non-temporal store (s128) into %ir.a1, align 32)> t4168:1, t4168, t4, undef:i64
Creating new node: t4191: ch = TokenFactor t4168:1, t4169:1
Creating new node: t4192: v8f32,ch = load<(non-temporal load (s256) from %ir.a0)> t0, t2, undef:i64
Creating new node: t4193: ch = store<(non-temporal store (s256) into %ir.a1)> t4191, t4192, t4, undef:i64

Replacing.1 t4186: ch = store<(non-temporal store (s128) into %ir.a1, align 32)> t4192:1, t4168, t4, undef:i64

With: t4193: ch = store<(non-temporal store (s256) into %ir.a1)> t4191, t4192, t4, undef:i64
 and 0 other values

Replacing.1 t4183: ch = store<(non-temporal store (s128) into %ir.a1 + 16, basealign 32)> t4192:1, t4169, t4179, undef:i64

With: t4193: ch = store<(non-temporal store (s256) into %ir.a1)> t4191, t4192, t4, undef:i64
 and 0 other values

Legalizing: t4192: v8f32,ch = load<(non-temporal load (s256) from %ir.a0)> t0, t2, undef:i64
Legalizing non-extending load operation

Combining: t4192: v8f32,ch = load<(non-temporal load (s256) from %ir.a0)> t0, t2, undef:i64
Creating constant: t4194: i64 = Constant<16>
Creating new node: t4195: i64 = add t2, Constant:i64<16>
Creating new node: t4196: v4f32,ch = load<(non-temporal load (s128) from %ir.a0, align 32)> t0, t2, undef:i64
Creating new node: t4197: v4f32,ch = load<(non-temporal load (s128) from %ir.a0 + 16, basealign 32)> t0, t4195, undef:i64
Creating new node: t4198: ch = TokenFactor t4196:1, t4197:1
Creating new node: t4199: v8f32 = concat_vectors t4196, t4197

Replacing.1 t4192: v8f32,ch = load<(non-temporal load (s256) from %ir.a0)> t0, t2, undef:i64
lebedev.ri edited the summary of this revision. (Show Details)Jan 14 2023, 6:02 PM

@RKSimon updated description:

The baseline `allowsMemoryAccess()` is wrong for X86.
It assumes that aligned memory operations are always allowed,
but that is not true.

For example, We can not perform a 32-byte aligned non-temporal load
of a 32-byte vector, without AVX2 that is, yet `allowsMemoryAccess()`
will say it is allowed, so we may end up merging non-temporal loads,
only to split them up to legalize them, and here we go again.

NOTE: the test changes here are superfluous. The main effect is that without this change,
in D141777, we'd get stuck endlessly merging and splitting non-temporal stores.

i believe, this accurately describes what is going on.

pengfei added inline comments.Jan 15 2023, 3:34 AM
llvm/lib/Target/X86/X86ISelLowering.h
1024–1029

Why do we need this and doing the same thing as TargetLoweringBase?

lebedev.ri marked an inline comment as done.Jan 15 2023, 4:56 AM
lebedev.ri added inline comments.
llvm/lib/Target/X86/X86ISelLowering.h
1024–1029

Because otherwise we get a compilation failure.

RKSimon added inline comments.Jan 16 2023, 6:32 AM
llvm/lib/Target/X86/X86ISelLowering.h
1024–1029

it looks like not all the TargetLoweringBase::allowsMemoryAccess variants are marked as virtual - that probably needs cleaning up?

lebedev.ri marked 2 inline comments as done.Jan 16 2023, 6:56 AM
lebedev.ri added inline comments.
llvm/lib/Target/X86/X86ISelLowering.h
1024–1029

that probably needs cleaning up?

ABSOLUTELY NOT!
There is only a single true allowsMemoryAccess(), all others are just helpers
to call it with different types of arguments. If they are all virtual,
and not just one of them, then you will essentially need to override them all.
It would not be an improvement.

lebedev.ri marked an inline comment as done.Jan 18 2023, 12:32 PM

ping

pengfei added inline comments.Jan 18 2023, 8:13 PM
llvm/lib/Target/X86/X86ISelLowering.h
1024–1029

I don't get the point. Why can't we use the helper directly but define a new one here? What are these helper used for if we need to define a same one each time?

lebedev.ri marked an inline comment as done.Jan 19 2023, 5:19 AM
lebedev.ri added inline comments.
llvm/lib/Target/X86/X86ISelLowering.h
1024–1029

Does anyone have comments on the rest of the change, ignoring this particular function? :)

If i drop this function, we get compilation error.
We can't make it virtual because then every target
will need to override every single allowsMemoryAccess(),
which is error-prone, as comparing a single allowsMemoryAccess().

I do not understand the feedback here.

RKSimon accepted this revision.Jan 21 2023, 10:13 AM

LGTM with a few minors

llvm/lib/Target/X86/X86ISelLowering.cpp
2734

maybe add a isBitAligned(Align, uint64_t SizeInBits) helper?

llvm/lib/Target/X86/X86ISelLowering.h
1006

(style) isMemoryAccessFast sounds a little better

llvm/test/CodeGen/X86/setcc-wide-types.ll
1241

SSE2/SSE41 might be mergable here if we add a SSE check-prefix case?

This revision is now accepted and ready to land.Jan 21 2023, 10:13 AM
lebedev.ri marked 4 inline comments as done.

@RKSimon thank you for the review!

llvm/lib/Target/X86/X86ISelLowering.h
1006

Right. This is a typo.

This revision was landed with ongoing or failed builds.Jan 21 2023, 1:18 PM
This revision was automatically updated to reflect the committed changes.