Diff Detail
Event Timeline
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
---|---|---|
325 | I don't follow the reasoning here. When and why do we need to respect the alignment? Could you add comments? The converse of MinAlign <= 1 || MinAlign >= 4 is MinAlign == 2 so maybe swap the if around: if (MinAlign == 2) ... else ... ? | |
357 | Are you intentionally ignoring the alignment arguments? | |
363 | Did you mean just while (RemainingBytes >= 8) here? If not the whole thing would be clearer structured as: if (RemainingBytes % 8 == 0) { ... i64 loop ... } else if (RemainingBytes % 4 == 0) { ... i32 loop ... } else { ... i8 loop ... } |
I still don't understand the logic for when to use 2-byte accesses. Is it something like: use 1, 4, 8 and 16-byte accesses unconditionally, but 2-byte accesses only when we know source and destination are at least 2-byte aligned? Why is the implementation of this different depending on whether the length is a known constant or not?
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
---|---|---|
344 | "16-byte"? | |
359 | Should those == be >=? | |
384 | Should those > be >=? |
Try again for 2 byte cases. I'm still somewhat unsure what we should be doing with 2-byte accesses, but try to use them for now
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
---|---|---|
332 | Yes, you can on anything remotely new. It's also not critical to get this exactly right, since the loads will still be legalized later. | |
349–351 | No, unaligned access is supposed to be enabled by default on targets that support it. I didn't bother trying to worry about what's best without the support. We also really need some microbenchmarks to make precise decisions here (although that will probably never happen) |
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
---|---|---|
332 | Then I don't understand why you have a special case for MinAlign == 2 at all. Why not just use unaligned (multi-)dword accesses, like you would for MinAlign == 1? |
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
---|---|---|
332 | So accesses are slow if (run time address) % 4 == 2. If MinAlign == 2 then there's a 50% chance of the access being slow. If MinAlign == 1 then it's only a 25% chance. Is that right? |
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
---|---|---|
332 | Yes, that is my understanding |
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
---|---|---|
332 | Then please add a comment that checking MinAlign is just a way to estimate the likelihood that the run time address will be 2-but-not-4 aligned (or "will end in 0b10"?). But if "2 byte aligned accesses end up getting executed as multiple 1 byte accesses" I still don't understand why it's better for us to generate multiple 1-byte instructions, rather than just letting the hardware do that thing. | |
353 | Can't you remove these "degenerate" cases and return v4i32 unconditionally? Surely the generic logic knows that for a short copy it should execute the v4i32 loop zero times, and then fall into the "residual" logic below? |
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
---|---|---|
338 | Shouldn't this && be ||, so we avoid forming either a 128-bit load or store? |
Just to give a concrete counter-proposal, I was thinking this should look more like D77057.
I don't follow the reasoning here. When and why do we need to respect the alignment? Could you add comments?
The converse of MinAlign <= 1 || MinAlign >= 4 is MinAlign == 2 so maybe swap the if around:
?