This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add VGPR copies post regalloc fix pass
ClosedPublic

Authored by rampitec on Jan 18 2017, 3:02 PM.

Details

Summary

Regalloc creates COPY instructions which do not formally use VALU.
That results in v_mov instructions displaced after exec mask modification.
One pass which do it is SIOptimizeExecMasking, but potentially it can be
done by other passes too.

This patch adds a pass immediately after regalloc to add implicit exec
use operand to all VGPR copy instructions.

Diff Detail

Repository
rL LLVM

Event Timeline

rampitec created this revision.Jan 18 2017, 3:02 PM
vpykhtin edited edge metadata.Jan 19 2017, 4:18 AM

Looks good, but is there a way to customize COPY insertions without addtitional pass?

Looks good, but is there a way to customize COPY insertions without addtitional pass?

Not without patching regalloc and other places.

Why is this necessary? I've been trying to avoid this. This isn't a question for only copies, any of the generic instructions need to be usually considered VALU. With exec writes as scheduling barriers, I don't think this should be needed

arsenm added inline comments.Jan 19 2017, 10:33 AM
lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
619

I would also expect if we need this that it would more important for pre-regalloc. We also can't generally expect to add these to every copy that any generic pass is allowed to produce

The problem is that RA inserts these copy instructions, so doing it before RA does not help. Here is the result:

s_and_saveexec_b64 s[4:5], vcc                             // 0000000170EC: BE84206A
v_mov_b32_e32 v149, v15                                    // 0000000170F0: 7F2A030F
v_mov_b32_e32 v148, v14                                    // 0000000170F4: 7F28030E
v_mov_b32_e32 v14, v16                                     // 0000000170F8: 7E1C0310
v_mov_b32_e32 v192, v21                                    // 0000000170FC: 7F800315
v_mov_b32_e32 v15, v17                                     // 000000017100: 7E1E0311
v_mov_b32_e32 v191, v20                                    // 000000017104: 7F7E0314
v_mov_b32_e32 v62, v11                                     // 000000017108: 7E7C030B
v_mov_b32_e32 v156, v13                                    // 00000001710C: 7F38030D
v_mov_b32_e32 v16, v86                                     // 000000017110: 7E200356
v_mov_b32_e32 v143, v230                                   // 000000017114: 7F1E03E6
v_mov_b32_e32 v20, v236                                    // 000000017118: 7E2803EC
s_xor_b64 s[4:5], exec, s[4:5]                             // 00000001711C: 8884047E
v_mov_b32_e32 v61, v10                                     // 000000017120: 7E7A030A
v_mov_b32_e32 v155, v12                                    // 000000017124: 7F36030C
v_mov_b32_e32 v17, v87                                     // 000000017128: 7E220357
v_mov_b32_e32 v144, v231                                   // 00000001712C: 7F2003E7
v_mov_b32_e32 v21, v237                                    // 000000017130: 7E2A03ED
s_cbranch_execz BB1_28                                     // 000000017134: BF880138

This is not a question of scheduling, where scheduling barrier may help. The problem is that generic instructions do not have dependency on exec.
To fix the above code I can make a special check inside SIOptimizeExecMasking, but this is less general and more a hack than adding required dependency.

vpykhtin accepted this revision.Jan 24 2017, 3:31 AM

I think we can try to go with it as this is affect correctness until we find better solution.

This revision is now accepted and ready to land.Jan 24 2017, 3:31 AM
This revision was automatically updated to reflect the committed changes.