This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Implement getMemcpyLoopLoweringType
AbandonedPublic

Authored by arsenm on Mar 22 2020, 8:23 AM.

Details

Diff Detail

Event Timeline

arsenm created this revision.Mar 22 2020, 8:23 AM
foad added inline comments.Mar 23 2020, 2:26 AM
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
325

I don't follow the reasoning here. When and why do we need to respect the alignment? Could you add comments?

The converse of MinAlign <= 1 || MinAlign >= 4 is MinAlign == 2 so maybe swap the if around:

if (MinAlign == 2)
  ...
else
  ...

?

357

Are you intentionally ignoring the alignment arguments?

363

Did you mean just while (RemainingBytes >= 8) here? If not the whole thing would be clearer structured as:

if (RemainingBytes % 8 == 0) {
  ... i64 loop ...
} else if (RemainingBytes % 4 == 0) {
  ... i32 loop ...
} else {
  ... i8 loop ...
}
arsenm updated this revision to Diff 252163.Mar 23 2020, 3:36 PM

Address comments, fix residual part

foad added a comment.Mar 24 2020, 1:52 AM

I still don't understand the logic for when to use 2-byte accesses. Is it something like: use 1, 4, 8 and 16-byte accesses unconditionally, but 2-byte accesses only when we know source and destination are at least 2-byte aligned? Why is the implementation of this different depending on whether the length is a known constant or not?

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
344

"16-byte"?

359

Should those == be >=?

384

Should those > be >=?

arsenm updated this revision to Diff 252348.Mar 24 2020, 9:17 AM
arsenm marked an inline comment as done.

Try again for 2 byte cases. I'm still somewhat unsure what we should be doing with 2-byte accesses, but try to use them for now

arsenm updated this revision to Diff 252349.Mar 24 2020, 9:17 AM

Wrong diff

foad added inline comments.Mar 25 2020, 1:28 AM
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
332

<=? You can't do unaligned dword (or multi-dword) accesses, can you?

349–351

Don't all these (multi-)dword cases need to be guarded by MinAlign >= 4?

arsenm marked 2 inline comments as done.Mar 25 2020, 12:31 PM
arsenm added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
332

Yes, you can on anything remotely new. It's also not critical to get this exactly right, since the loads will still be legalized later.

349–351

No, unaligned access is supposed to be enabled by default on targets that support it. I didn't bother trying to worry about what's best without the support. We also really need some microbenchmarks to make precise decisions here (although that will probably never happen)

foad added inline comments.Mar 26 2020, 2:37 AM
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
332

Then I don't understand why you have a special case for MinAlign == 2 at all. Why not just use unaligned (multi-)dword accesses, like you would for MinAlign == 1?

arsenm marked an inline comment as done.Mar 26 2020, 6:46 AM
arsenm added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
332

See D74345, we recently discovered 2 byte aligned accesses end up getting executed as multiple 1 byte accesses

foad added inline comments.Mar 26 2020, 9:10 AM
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
332

So accesses are slow if (run time address) % 4 == 2. If MinAlign == 2 then there's a 50% chance of the access being slow. If MinAlign == 1 then it's only a 25% chance. Is that right?

arsenm marked an inline comment as done.Mar 26 2020, 9:14 AM
arsenm added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
332

Yes, that is my understanding

foad added inline comments.Mar 27 2020, 1:38 AM
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
332

Then please add a comment that checking MinAlign is just a way to estimate the likelihood that the run time address will be 2-but-not-4 aligned (or "will end in 0b10"?).

But if "2 byte aligned accesses end up getting executed as multiple 1 byte accesses" I still don't understand why it's better for us to generate multiple 1-byte instructions, rather than just letting the hardware do that thing.

353

Can't you remove these "degenerate" cases and return v4i32 unconditionally? Surely the generic logic knows that for a short copy it should execute the v4i32 loop zero times, and then fall into the "residual" logic below?

foad added inline comments.Mar 30 2020, 7:01 AM
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
338

Shouldn't this && be ||, so we avoid forming either a 128-bit load or store?

foad added a comment.Mar 30 2020, 7:09 AM

Just to give a concrete counter-proposal, I was thinking this should look more like D77057.

arsenm abandoned this revision.Mar 30 2020, 3:16 PM