Diff Detail
Event Timeline
| llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
|---|---|---|
| 325 | I don't follow the reasoning here. When and why do we need to respect the alignment? Could you add comments? The converse of MinAlign <= 1 || MinAlign >= 4 is MinAlign == 2 so maybe swap the if around: if (MinAlign == 2) ... else ... ? | |
| 357 | Are you intentionally ignoring the alignment arguments? | |
| 363 | Did you mean just while (RemainingBytes >= 8) here? If not the whole thing would be clearer structured as: if (RemainingBytes % 8 == 0) {
... i64 loop ...
} else if (RemainingBytes % 4 == 0) {
... i32 loop ...
} else {
... i8 loop ...
} | |
I still don't understand the logic for when to use 2-byte accesses. Is it something like: use 1, 4, 8 and 16-byte accesses unconditionally, but 2-byte accesses only when we know source and destination are at least 2-byte aligned? Why is the implementation of this different depending on whether the length is a known constant or not?
| llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
|---|---|---|
| 344 | "16-byte"? | |
| 359 | Should those == be >=? | |
| 384 | Should those > be >=? | |
Try again for 2 byte cases. I'm still somewhat unsure what we should be doing with 2-byte accesses, but try to use them for now
| llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
|---|---|---|
| 332 | Yes, you can on anything remotely new. It's also not critical to get this exactly right, since the loads will still be legalized later. | |
| 349–351 | No, unaligned access is supposed to be enabled by default on targets that support it. I didn't bother trying to worry about what's best without the support. We also really need some microbenchmarks to make precise decisions here (although that will probably never happen) | |
| llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
|---|---|---|
| 332 | Then I don't understand why you have a special case for MinAlign == 2 at all. Why not just use unaligned (multi-)dword accesses, like you would for MinAlign == 1? | |
| llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
|---|---|---|
| 332 | So accesses are slow if (run time address) % 4 == 2. If MinAlign == 2 then there's a 50% chance of the access being slow. If MinAlign == 1 then it's only a 25% chance. Is that right? | |
| llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
|---|---|---|
| 332 | Yes, that is my understanding | |
| llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
|---|---|---|
| 332 | Then please add a comment that checking MinAlign is just a way to estimate the likelihood that the run time address will be 2-but-not-4 aligned (or "will end in 0b10"?). But if "2 byte aligned accesses end up getting executed as multiple 1 byte accesses" I still don't understand why it's better for us to generate multiple 1-byte instructions, rather than just letting the hardware do that thing. | |
| 353 | Can't you remove these "degenerate" cases and return v4i32 unconditionally? Surely the generic logic knows that for a short copy it should execute the v4i32 loop zero times, and then fall into the "residual" logic below? | |
| llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
|---|---|---|
| 338 | Shouldn't this && be ||, so we avoid forming either a 128-bit load or store? | |
Just to give a concrete counter-proposal, I was thinking this should look more like D77057.
I don't follow the reasoning here. When and why do we need to respect the alignment? Could you add comments?
The converse of MinAlign <= 1 || MinAlign >= 4 is MinAlign == 2 so maybe swap the if around:
?