This allow it to recognize more loads as being consecutive when the load's address are complex at the start.
Details
Diff Detail
Event Timeline
llvm/test/CodeGen/NVPTX/vector-loads-complex.ll | ||
---|---|---|
12 | Couple of general notes: I'd attempt to reduce IR to the bare minimum necessary to demonstrate/test intended behavior. This IR is not bad, but I think it may be reduced further to the point where we end up with just one vectorized 2-element load. It's a good practice to 'anchor the checks with fixed start/end points CHECK-LABEL: <function_name> ... # other checks on whatever happens inside the function. CHECK: ret It helps to delineate the search boundaries for FileCheck. Not a big deal now, as you only have one function, but I would not be surprised if more tests would be added over time. | |
14–18 | Negative matching is tricky. As written, FileCheck will only ensure that there's no ld.global between the fourth instance of ld.global.v2.u8 and the end of the file. I'd start by attempting to place the checks close to the load statements they coalesce into. It's not always possible, but it probably works in this case. As long as the number of loads is fixed/predictable, positively matching all of them makes negative matching unnecessary, so I'd just drop it. |
Minor test nits. LGTM otherwise.
I'm curious, what does generated ptx for the function look before/after the patch.
llvm/test/CodeGen/NVPTX/vector-loads.ll | ||
---|---|---|
73 ↗ | (On Diff #244243) | CHECK-LABEL |
75 ↗ | (On Diff #244243) | I'm not sure we need it. The check above verifies that the load did get vectorized, which was the purpose of this patch and the test. Any condition when CHECK-NOT may fail (e.g. we're loading more than one, or it's an unrelated load/store) is unlikely to be relevant. It will also only look in-between ld.v2.u8 and ret, and such issues will not be detected before ld.v2.u8. If you want to ensure that ld.v2.u8 is the only relevant load, then you may need to have a separate test case which would fail if it's found more than one ld of any kind. Or you can add another CHECK-NOT: ld between the start of the function and CHECK: ld.v2.u8. I think that may work, but I'm not completely sure. I'd just skip CHECK-NOT altogether. |
92 ↗ | (On Diff #244243) | CHECK: ld.v2.u8 |
98 ↗ | (On Diff #244243) | CHECK: ret |
Before:
.visible .func foo_complex(
.param .b32 foo_complex_param_0
) // @foo_complex
{
.reg .b16 %rs<4>; .reg .b32 %r<11>;
// %bb.0:
ld.param.u32 %r1, [foo_complex_param_0]; mov.u32 %r2, %tid.x; mov.u32 %r3, %ctaid.x; shl.b32 %r4, %r3, 9; and.b32 %r5, %r4, -131072; and.b32 %r6, %r4, 130560; shl.b32 %r7, %r2, 1; or.b32 %r8, %r6, %r7; add.s32 %r9, %r1, %r5; add.s32 %r10, %r9, %r8;
- ld.u8 %rs1, [%r10+128]; ld.u8 %rs2, [%r10+129];** max.u16 %rs3, %rs1, %rs2; st.u8 [%r10+129], %rs3; ret; // -- End function
}
After:
.visible .func foo_complex(
.param .b32 foo_complex_param_0
) // @foo_complex
{
.reg .b16 %rs<4>; .reg .b32 %r<11>;
// %bb.0:
ld.param.u32 %r1, [foo_complex_param_0]; mov.u32 %r2, %tid.x; mov.u32 %r3, %ctaid.x; shl.b32 %r4, %r3, 9; and.b32 %r5, %r4, -131072; and.b32 %r6, %r4, 130560; shl.b32 %r7, %r2, 1; or.b32 %r8, %r6, %r7; add.s32 %r9, %r1, %r5; add.s32 %r10, %r9, %r8;
- ld.v2.u8 {%rs1, %rs2}, [%r10+128];** max.u16 %rs3, %rs1, %rs2; st.u8 [%r10+129], %rs3; ret; // -- End function
}
llvm/test/CodeGen/NVPTX/vector-loads.ll | ||
---|---|---|
73 ↗ | (On Diff #244243) | Here I did as the other tests in this file. I'll fix all of them. |
Couple of general notes:
I'd attempt to reduce IR to the bare minimum necessary to demonstrate/test intended behavior. This IR is not bad, but I think it may be reduced further to the point where we end up with just one vectorized 2-element load.
It's a good practice to 'anchor the checks with fixed start/end points
It helps to delineate the search boundaries for FileCheck. Not a big deal now, as you only have one function, but I would not be surprised if more tests would be added over time.