Page MenuHomePhabricator

Add NVPTXPeephole pass to reduce unnecessary address cast
ClosedPublic

Authored by wengxt on Jun 18 2015, 1:56 PM.

Details

Summary

This patch first change the register that holds local address for stack
frame to %SPL. Then the new NVPTXPeephole pass will try to scan the
following pattern

%vreg0<def> = LEA_ADDRi64 <fi#0>, 4
%vreg1<def> = cvta_to_local %vreg0

and transform it into

%vreg1<def> = LEA_ADDRi64 %VRFrameLocal, 4

Patched by Xuetian Weng

Diff Detail

Event Timeline

wengxt updated this revision to Diff 27961.Jun 18 2015, 1:56 PM
wengxt retitled this revision from to Add NVPTXPeephole pass to reduce unnecessary address cast.
wengxt updated this object.
wengxt edited the test plan for this revision. (Show Details)
wengxt added reviewers: jholewinski, jingyue.
wengxt added a subscriber: Unknown Object (MLST).
jingyue added inline comments.Jun 18 2015, 4:07 PM
lib/Target/NVPTX/NVPTXFrameLowering.cpp
62–68

Is NVPTX::VRFrameLocal 32-bit or 64-bit? You use it for both 32-bit and 64-bit. Does that matter?

lib/Target/NVPTX/NVPTXPeephole.cpp
65

style nit: we prefer putting { at the same line as the function header.

82

The indentation looks strange. Did you run clang-format?

90

Could there be multiple frame indices? fi#0, fi#1, ...

98

again, move { up one line.

110

The type casts seem unnecessary, aren't they?

test/CodeGen/NVPTX/call-with-alloca-buffer.ll
24

I don't see cvta.local.u64 removed in any tests, probably because %SP is still used. Is it worthwhile adding a test where the cvta.local.u64 can indeed be removed, because that's the whole point of this peephole optimization?

eliben added a subscriber: eliben.Jun 18 2015, 4:14 PM
eliben added inline comments.
lib/Target/NVPTX/NVPTXPeephole.cpp
9

Slight rewording/grammar:

In NVPTX, we always use a special frame register which holds local address
of frame. NVPTXLowerAlloca may introduce a lot of cvta.to.local instructions.
// This peephole pass optimizes these cases, for example

also, I'd clarify a bit more what "holds local address" means

82

FWIW - since this is a new file, I'd suggest running it through clang-format with the LLVM style setting.

wengxt added inline comments.Jun 18 2015, 5:37 PM
lib/Target/NVPTX/NVPTXFrameLowering.cpp
62–68

See NVPTXAsmPrinter::setAndEmitFunctionVirtualRegisters, it will change depending on arch.

Though NVPTXRegisterInfo.td defines them as i32, but seems this info is not used.

lib/Target/NVPTX/NVPTXPeephole.cpp
110

Latter one is unnecessary, first one is necessary to eliminate ambiguity.

wengxt updated this revision to Diff 27982.Jun 18 2015, 5:38 PM

Fix issues raised in comments.

wengxt updated this revision to Diff 27983.Jun 18 2015, 5:43 PM

add a new test case to test a case of not emitting cvta.local %SP, %SPL

wengxt added inline comments.Jun 18 2015, 5:45 PM
lib/Target/NVPTX/NVPTXPeephole.cpp
9

Expanded with more details.

65

Fixed

90

Fixed

test/CodeGen/NVPTX/call-with-alloca-buffer.ll
24

A test case for this is added in local-stack-frame.ll

jingyue added inline comments.Jun 18 2015, 9:02 PM
lib/Target/NVPTX/NVPTXFrameLowering.cpp
51

I'd run clang-format on this file again. Indentation changed since your last modification.

lib/Target/NVPTX/NVPTXPeephole.cpp
19

"avoid casting"

90

Did you miss uploading something? I didn't see anything's changed.

wengxt added inline comments.Jun 19 2015, 1:31 PM
lib/Target/NVPTX/NVPTXPeephole.cpp
90

Related change is in CombineCVTAToLocal

wengxt updated this revision to Diff 28048.Jun 19 2015, 2:27 PM

update based on previous comments

jingyue accepted this revision.Jun 22 2015, 5:48 PM
jingyue edited edge metadata.

LGTM

lib/Target/NVPTX/NVPTXPeephole.cpp
90

ACK'ed

This revision is now accepted and ready to land.Jun 22 2015, 5:48 PM
jholewinski edited edge metadata.Jun 24 2015, 4:48 AM

What is the future expectation for NVPTXPeephole? Are you planning on adding additional transforms? If not, perhaps a more specific name is warranted. Otherwise, LGTM! Thanks!

lib/Target/NVPTX/NVPTXPeephole.cpp
8

Missing separator, e.g.

===----------------------------------------------------------------------===

lib/Target/NVPTX/NVPTXRegisterInfo.td
68

Good catch!

What is the future expectation for NVPTXPeephole? Are you planning on adding additional transforms? If not, perhaps a more specific name is warranted. Otherwise, LGTM! Thanks!

One thing currently in my mind is to combine some other unnecessary address cast just like this pass. I've noticed some other pointless address cast in generated code (e.g. the "FIXME" in call-with-alloca-buffer.ll).

wengxt updated this revision to Diff 28379.Jun 24 2015, 11:39 AM
wengxt edited edge metadata.

update based on comments.

jingyue updated this object.Jun 24 2015, 1:02 PM
jingyue closed this revision.Jun 24 2015, 1:24 PM

Hi jingyue,

I got a regression caused by this patch. What I see is that %SP is still used if references to automatic vars (alloca) are passed to functions but %SP is no longer initialized to anything.

I have this (broken) code being now generated:

.visible .entry __omptgt__0_db262_31_(
        .param .u64 __omptgt__0_db262_31__param_0,
        .param .u64 __omptgt__0_db262_31__param_1,
        .param .u64 __omptgt__0_db262_31__param_2,
        .param .u64 __omptgt__0_db262_31__param_3,
        .param .u64 __omptgt__0_db262_31__param_4
)
{
        .local .align 8 .b8     __local_depot0[56];
        .reg .b64       %SP;
        .reg .b64       %SPL;
        .reg .pred      %p<18>;
        .reg .s32       %r<31>;
        .reg .s64       %rd<30>;

        mov.u64         %SPL, __local_depot0;
...
          add.u64         %rd14, %SP, 36;
        // Callseq Start 1
        {
        .reg .b32 temp_param_reg;
        // <end>}
        .param .b64 param0;
        st.param.b64    [param0+0], %rd12;
        .param .b32 param1;
        st.param.b32    [param1+0], %r6;
        .param .b32 param2;
        st.param.b32    [param2+0], %r18;
        .param .b64 param3;
        st.param.b64    [param3+0], %rd13;
        .param .b64 param4;
        st.param.b64    [param4+0], %rd10;
        .param .b64 param5;
        st.param.b64    [param5+0], %rd8;
        .param .b64 param6;
        st.param.b64    [param6+0], %rd14;
        .param .b32 param7;
        st.param.b32    [param7+0], %r10;
        .param .b32 param8;
        st.param.b32    [param8+0], %r11;
        call.uni 
        __kmpc_for_static_init_4, 
        (
        param0, 
        param1, 
        param2, 
        param3, 
        param4, 
        param5, 
        param6, 
        param7, 
        param8
        );

%rd14 is computed from %SP but %SP never gets initialized. Before I used to have:

.visible .entry __omptgt__0_db262_31_(
        .param .u64 __omptgt__0_db262_31__param_0,
        .param .u64 __omptgt__0_db262_31__param_1,
        .param .u64 __omptgt__0_db262_31__param_2,
        .param .u64 __omptgt__0_db262_31__param_3,
        .param .u64 __omptgt__0_db262_31__param_4
)
{
        .local .align 8 .b8     __local_depot0[56];
        .reg .b64       %SP;
        .reg .b64       %SPL;
        .reg .pred      %p<18>;
        .reg .s32       %r<31>;
        .reg .s64       %rd<31>;

        mov.u64         %rd30, __local_depot0;
        cvta.local.u64  %SP, %rd30;

%SP is initialized properly (and %SPL is not used at all) so the references are properly generated. I suspect that the changes in this patch have to be reflected in other pieces of the backend. What do you think is the best way to tackle the problem?

Thanks,
Samuel

Looks like a bug to me. We'll figure it out. Thank you!

Samuel,

I hope http://reviews.llvm.org/D10844 fixes the issue on your end. LMK if
it doesn't.

Hi jingyue,

Thanks for looking into this.

The issue is solved partially by http://reviews.llvm.org/D10844. Now the address can be accessed inside the function. However, what I read from that address in the callee is different from what I write in the caller.

I've been investigating the issue and what I see is that with the patch I have for, e.g. param5 (that should point to the 63 constant) (I'm filtering out the code for the other parameters):

        mov.u64         %SPL, __local_depot0;
        cvta.local.u64  %SP, %SPL;
...
        add.u64         %rd3, %SPL, 0;
...
        mov.u32         %r21, 63;
        st.local.u32    [%rd3], %r21;
...
        add.u64         %rd18, %SP, 32;
...
        .param .b64 param5;
        st.param.b64    [param5+0], %rd18;
...
        call.uni 
        __kmpc_for_static_init_4, 
        (
...
        param5, 
...
        );

The code I had before (that was working), for the same parameter, was:

        mov.u64         %rd29, __local_depot0;
        cvta.local.u64  %SP, %rd29;
...
        add.u64         %rd9, %SP, 32;
        cvta.to.local.u64       %rd3, %rd9;
...
        mov.u32         %r21, 63;
        st.local.u32    [%rd3], %r21;
...
        add.u64         %rd18, %SP, 32;
...
        .param .b64 param5;
        st.param.b64    [param5+0], %rd18;
...
        call.uni 
        __kmpc_for_static_init_4, 
        (
...
        param5, 
...
        );

So apparently, the offsets of the frame are not being produced properly after the patch. Let me know if you need more information.

Thanks!
Samuel

Hi Samuel,

I hope http://reviews.llvm.org/D10853 can resolve the other issue you found.

FYI, D10853 submitted in r241185.

Seems to be fixed now.

Thanks!
Samuel