This is an archive of the discontinued LLVM Phabricator instance.

Transform illegal intrinsics to V_ILLEGAL
ClosedPublic

Authored by Leonc on Apr 13 2022, 10:39 AM.

Details

Summary

Related tasks:

  • SWDEV-240194
  • SWDEV-309417
  • SWDEV-334876

Diff Detail

Event Timeline

Leonc created this revision.Apr 13 2022, 10:39 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 13 2022, 10:39 AM
Leonc requested review of this revision.Apr 13 2022, 10:39 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 13 2022, 10:39 AM
Leonc edited the summary of this revision. (Show Details)Apr 13 2022, 10:41 AM
Leonc added reviewers: bcahoon, rampitec, cfang.

I think this is too focused. There are other image_sample_lz intrinsics and all of them potentially need to be replaced if following this approach.

There is also global isel.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
7688

!ST->hasExtendedImageInsts() instead of target check.

Leonc updated this revision to Diff 436106.Jun 10 2022, 10:08 PM
  • Updates toward catching and handling all illegal intrinsics.
Leonc updated this revision to Diff 436108.Jun 10 2022, 10:13 PM
  • Revert rebase error.
Leonc retitled this revision from Transform tex2D to legal intrinsic on gfx90a. to Transform illegal intrinsics to V_ILLEGAL.Jun 10 2022, 10:24 PM
Leonc edited the summary of this revision. (Show Details)
Leonc added a comment.EditedJun 10 2022, 10:34 PM

Next steps:

  • Get feedback from team.
  • Add tests.
  • Extend coverage to INTRINSIC_WO_CHAIN and INTRINSIC_VOID.
  • Investigate test failure on gfx940.
    • CodeGenOpenCL/builtins-fp-atomics-gfx940.cl.

Issues:

  • Not all intrinsics marked as custom are handled by the target lowering functions. This is the reason INTRINSIC_WO_CHAIN and INTRINSIC_VOID are not covered yet.
    • It might be possible to determine if an intrinsic will be expanded if it is allowed to fall through.

Missing tests.

arsenm added inline comments.Jun 13 2022, 5:30 PM
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
327

Just replace with undef regardless of the type. You also should have just done this when the node was initially created, no need to use the DAG preprocess

llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h
539

I don't think it's worth introducing a wrapper node for this

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
7701

You can select directly to the instruction with getMachineNode, no need for the intermediate AMDGPUISD::ILLEGAL

llvm/lib/Target/AMDGPU/SIInstructions.td
3266–3270

This is more complicated. It's actually defined on gfx10 (and I believe can also be encoded as both 32 and 64-bit). For gfx6-9, all 0s does have an interpretation as an almost valid instruction. It decodes fine but violates the constant bus restriction. I would prefer to define a separate V_ILLEGAL that uses all 1s pre gfx10

arsenm requested changes to this revision.Jun 13 2022, 5:30 PM
This revision now requires changes to proceed.Jun 13 2022, 5:30 PM
Leonc added inline comments.Jun 13 2022, 9:23 PM
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
327

I did try doing this in TargetLowering originally, but any SDValue returned from a lowering function must have the same type(s) as the node it's replacing. I think this was the same for undef but I'll try it again just to be sure.

foad added a comment.Jun 14 2022, 1:08 AM

Where's the rationale? Surely unsupported intrinsics should fail to select, and perhaps be diagnosed even earlier?

Leonc added a comment.Jun 14 2022, 1:15 AM

Where's the rationale? Surely unsupported intrinsics should fail to select, and perhaps be diagnosed even earlier?

Normally yes. The issue is our libraries contain code for all targets. We rely on dead code elimination to remove illegal intrinsics, but that doesn't happen on -O0.

foad added a comment.Jun 14 2022, 1:28 AM

The issue is our libraries contain code for all targets. We rely on dead code elimination to remove illegal intrinsics, but that doesn't happen on -O0.

Why can't you put the code for different subtargets in different functions, each with an appropriate "target-cpu"= attribute?

Leonc added a comment.Jun 14 2022, 1:31 AM

The issue is our libraries contain code for all targets. We rely on dead code elimination to remove illegal intrinsics, but that doesn't happen on -O0.

Why can't you put the code for different subtargets in different functions, each with an appropriate "target-cpu"= attribute?

That's what I would do. I think there's some resistance to it though, and it would probably require a lot more work.

foad added a comment.Jun 14 2022, 2:00 AM

The issue is our libraries contain code for all targets. We rely on dead code elimination to remove illegal intrinsics, but that doesn't happen on -O0.

Why can't you put the code for different subtargets in different functions, each with an appropriate "target-cpu"= attribute?

That's what I would do. I think there's some resistance to it though, and it would probably require a lot more work.

Well I am resistant to selecting intrinsics on subtargets that don't support them :)

How far are you planning to go with this approach? Do you expect to extend it to all AMDGPU intrinsics?

Leonc added a comment.Jun 14 2022, 2:15 AM

The issue is our libraries contain code for all targets. We rely on dead code elimination to remove illegal intrinsics, but that doesn't happen on -O0.

Why can't you put the code for different subtargets in different functions, each with an appropriate "target-cpu"= attribute?

That's what I would do. I think there's some resistance to it though, and it would probably require a lot more work.

Well I am resistant to selecting intrinsics on subtargets that don't support them :)

How far are you planning to go with this approach? Do you expect to extend it to all AMDGPU intrinsics?

Agreed.

My initial approach was rejected for being too specific. Discussions since then have been about finding a solution for all illegal intrinsic calls. That's the plan unless I hear otherwise.

The rationale is for library code that looks like:
void example(float *p, float v) {

if (_ISA_verison == 9008 || __ISA_verison == 9010)
   global_atomic_fadd(p, v);
else
   generic_atomic_fadd(p, v);

}

At optimization levels > 0, the dead code is eliminated, but at optimization level 0, the dead code remain and the compiler attempts to generate an instruction. That instruction would never be executed due to the condition, but it still needs to be generated. In these cases, we don't have multilibs (target specific libraries).

Thanks @bcahoon. If there were another available approach that always worked, we would use it.

dmikushin added a comment.EditedJun 29 2022, 11:05 AM

Thank you for working on this. Debug build of PyTorch (BUILD_DEBUG_INFO=1) is also affected. Here is a creduce'd code which crashes rocm5.1.1 clang that you could possibly inciude into the test cases (compile with -O0):

extern "C" __attribute__((device)) void __ockl_atomic_add_noret_f32(float *,
                                                                    float);
__attribute__((device)) void a(float *address, float b) {
  __ockl_atomic_add_noret_f32(address, b);
}
Leonc updated this revision to Diff 447180.Jul 24 2022, 7:21 PM
  • Fix existing test failures.
bcahoon added inline comments.Jul 25 2022, 7:03 PM
llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
4411

At this location, ILLEGAL is a memory opcode, but SelectionDAGDumper will crash because when trying to print its memops.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
6509

I think it may be easiest to create the ILLEGAL instruction here rather than creating it after the call to lowerImage. That way, there doesn't have to be copies of the code that creates ILLEGAL.

For example,
MachineSDNode *NewNode = DAG.getMachineNode(AMDGPU::V_ILLEGAL, DL, MVT::Other, Op.getOperand(0));
return DAG.getMergeValues({ DAG.getUNDEF(Op.getValueType()), SDValue(NewNode, 0) }, DL);

The UNDEF value replaces return result for the intrinsic, and the V_ILLEGAL replaces the chain result. It appears, though, that V_ILLEGAL is removed if the intrinsic chain result is not used anywhere. I'm not sure if that is a bad thing since it means it's not really needed?

6988

Can this go in lowerImage? It seems like this will mark some cases as illegal that should be handled either in DAGToDAG or with patterns.

Leonc updated this revision to Diff 448074.Jul 27 2022, 9:31 AM
Leonc marked an inline comment as done.Jul 27 2022, 9:31 AM
  • Revert changes to return values.
  • Refactor code.
  • Address comments.
bcahoon added inline comments.Jul 27 2022, 3:33 PM
llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h
540

Will ILLEGAL be removed?

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
6510

No need for braces

8238

The v_illegal should have an operand for the chain operand in the original instruction, if one exists.

t0 EntryToken
res,ch = intrinsic t0, <other opernads>
->
res = undef
ch = v_illegal t0

Also, if there is a use of the chain definition in the original instruction, then there is no need to add the code below.

Missing testscases. Also should include some assembler and disassembler tests for V_ILLEGAL

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
8238

I'd expect to use the original chain, but it also doesn't really matter given that it's filler anyway

Leonc updated this revision to Diff 448301.Jul 28 2022, 4:33 AM
Leonc marked 8 inline comments as done.
  • Address comments.
Leonc added inline comments.Jul 28 2022, 4:34 AM
llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h
540

Thanks I missed that one.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
7688

I think this is unnecessary. We still need to call lowerImage for the return value when it succeeds.

8238

Thanks @bcahoon & @arsenm.

I added a check for MemSDNode and pass the chain to V_ILLEGAL if it exists.

Leonc updated this revision to Diff 448660.Jul 29 2022, 9:20 AM
  • Handling for atomics plus tests.

A couple more tests are needed. A test for for the image intrinsic. Also an assembler test for the v_illegal encoding.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
7666

no braces needed

llvm/lib/Target/AMDGPU/SIInstructions.td
3270

Do we need different version for gfx10?

Leonc updated this revision to Diff 448997.Aug 1 2022, 5:28 AM
  • Add different versions of V_ILLEGAL for different targets.
Leonc marked 2 inline comments as done.Aug 1 2022, 6:07 AM
foad added a comment.Aug 1 2022, 6:32 AM

The rationale is for library code that looks like:

void example(float *p, float v) {
  if (_ISA_verison == 9008 || __ISA_verison == 9010)
     global_atomic_fadd(p, v);
  else
     generic_atomic_fadd(p, v);
}

At optimization levels > 0, the dead code is eliminated, but at optimization level 0, the dead code remain and the compiler attempts to generate an instruction. That instruction would never be executed due to the condition, but it still needs to be generated. In these cases, we don't have multilibs (target specific libraries).

Right, but why can't global_atomic_fadd and generic_atomic_fadd be functions that just contain a call to the corresponding builtin and have an appropriate "target-cpu"= attribute? I guess you would need to do inlining in the backend once you know what subtarget you are compiling for. Is that viable?

llvm/test/CodeGen/AMDGPU/v_illegal-atomics.ll
1 ↗(On Diff #448997)

File should not be executable.

bcahoon added inline comments.Aug 1 2022, 7:28 AM
llvm/lib/Target/AMDGPU/SIInstructions.td
3266

Do we want different mnemonics, or should the same mnemonic be used for the cases when the encoding is all 0's or all 1's?

When you add the llvm-mc tests, it will also be a good idea to add a test that pipes the output of llvm-mc through llvm-objdump to the that the different encodings show up properly for the two cases.

Leonc updated this revision to Diff 449018.Aug 1 2022, 7:42 AM
Leonc marked an inline comment as done.
  • Fix indentation, braces, and file mode.
Leonc marked an inline comment as done.Aug 1 2022, 7:43 AM
Leonc added inline comments.
llvm/lib/Target/AMDGPU/SIInstructions.td
3266

Do we want different mnemonics, or should the same mnemonic be used for the cases when the encoding is all 0's or all 1's?

@arsenm asked for separate definitions:

I would prefer to define a separate V_ILLEGAL that uses all 1s pre gfx10

Leonc added inline comments.Aug 1 2022, 7:56 AM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
8236

Ternary args are the wrong way round. Running local tests before I update the diff to fix this.

Leonc updated this revision to Diff 449021.Aug 1 2022, 8:01 AM
  • Fix ternary args.
Leonc updated this revision to Diff 449602.Aug 3 2022, 2:15 AM
  • Add v_illegal assembly tests.
  • Update instruction definitions.
Leonc marked an inline comment as done.Aug 3 2022, 2:19 AM
Leonc added inline comments.
llvm/test/MC/AMDGPU/v_illegal-atomics.s
1–2 ↗(On Diff #449602)

Adding modifications to pipe the compiler output through objdump as these tests aren't very helpful.

arsenm added a comment.Aug 3 2022, 6:46 AM

Right, but why can't global_atomic_fadd and generic_atomic_fadd be functions that just contain a call to the corresponding builtin and have an appropriate "target-cpu"= attribute? I guess you would need to do inlining in the backend once you know what subtarget you are compiling for. Is that viable?

The problem is in the called function with the appropriate attributes. The function isn't deleted and we still need to produce something. The current situation is it works in cases where the wrong subtarget also happens to have the same encoding, but breaks on targets that never had an equivalent instruction encoded

arsenm added inline comments.Aug 3 2022, 7:04 AM
llvm/lib/Target/AMDGPU/SIInstructions.td
3266

The instruction is actually named v_illegal, so you shouldn't invent new names here. I would name these V_ILLEGAL for the gfx10 version and V_ILLEGAL_gfx6_gfx7_gf8_gfx9?

I also think V_ILLEGAL is available in a vop3 encoding (for gfx10) but probably should define that in a separate patch for the benefit of the disassembler

arsenm added inline comments.Aug 3 2022, 7:06 AM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
8238

Checking for MemSDNode doesn't make sense, other nodes can have chains. You would need to pass it in from the source if it applies. However I don't think it's worth the complexity, since this really is filler content anyway

Leonc added inline comments.Aug 3 2022, 7:12 AM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
8238

Thanks, I agree.

foad added a comment.Aug 3 2022, 7:12 AM

Right, but why can't global_atomic_fadd and generic_atomic_fadd be functions that just contain a call to the corresponding builtin and have an appropriate "target-cpu"= attribute? I guess you would need to do inlining in the backend once you know what subtarget you are compiling for. Is that viable?

The problem is in the called function with the appropriate attributes. The function isn't deleted and we still need to produce something. The current situation is it works in cases where the wrong subtarget also happens to have the same encoding, but breaks on targets that never had an equivalent instruction encoded

If target-cpu doesn't let you target different CPUs for different functions then I don't understand what it does.

arsenm added a comment.Aug 3 2022, 7:24 AM

If target-cpu doesn't let you target different CPUs for different functions then I don't understand what it does.

For the relevant cases in the library, we key off the subtarget feature in target-features. The problem is in the final codegen we've propagated the real cpu to the leaf function, which then artificially has an incompatible target-features. We would have to have some code fixup the target-cpu to some random other device with compatible features

Leonc updated this revision to Diff 449723.Aug 3 2022, 11:08 AM
  • Address comments.
Leonc marked 2 inline comments as done.Aug 3 2022, 11:12 AM
Leonc added inline comments.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
8238

It turns out the chain is necessary. An assertion in RemoveNodeFromCSEMaps fails when Op is an atomic intrinsic.

arsenm added inline comments.Aug 3 2022, 2:25 PM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
8238

Yes, a chain is needed but you should be able to use the entry node. If you want to forward it in from the call sites, that would also be OK

Leonc updated this revision to Diff 450001.Aug 4 2022, 8:54 AM
  • Address comments and fix tests.
Leonc marked an inline comment as done.Aug 4 2022, 8:55 AM
arsenm added a comment.Aug 4 2022, 1:39 PM

Could also use a disassembler test

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
8234

SDLoc DL(Op)

Leonc updated this revision to Diff 450279.Aug 5 2022, 6:26 AM
  • Address comments.
Leonc marked an inline comment as done.Aug 5 2022, 6:26 AM
arsenm added inline comments.Aug 5 2022, 6:48 AM
llvm/test/CodeGen/AMDGPU/v_illegal-atomics.ll
40 ↗(On Diff #450279)

This looks backwards. gfx10 is getting the -1 encoding and gfx9 is getting 0

53 ↗(On Diff #450279)

gfx10 has v_illegal, so this should be 0x0 and printed as v_illegal

Leonc added inline comments.Aug 5 2022, 6:58 AM
llvm/test/CodeGen/AMDGPU/v_illegal-atomics.ll
40 ↗(On Diff #450279)

Are you sure gfx9 is getting the zero encoding? I thought this line was "address: encoding".
If this was wrong wouldn't the tests in v_illegal-atomics.s also be wrong?

arsenm added inline comments.Aug 5 2022, 7:02 AM
llvm/test/CodeGen/AMDGPU/v_illegal-atomics.ll
40 ↗(On Diff #450279)

You're right, gfx9 looks right. The gfx1030 line with .long 0xffffffff looks wrong.

Leonc added inline comments.Aug 5 2022, 7:07 AM
llvm/test/CodeGen/AMDGPU/v_illegal-atomics.ll
40 ↗(On Diff #450279)

I'm guessing gfx1030 is due to the disassebler. We know the encoding is correct from the tests in v_illegal-atomics.s.
I'll look into it.

Leonc updated this revision to Diff 450320.Aug 5 2022, 10:25 AM
  • Address comments.
Leonc marked 3 inline comments as done.Aug 5 2022, 10:26 AM
arsenm accepted this revision.Aug 5 2022, 1:07 PM

LGTM except for the stray debug print. We also probably should clean up the the encoding to rely on subtarget features instead of the generation check

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
8233

DRop this debug printing

This revision is now accepted and ready to land.Aug 5 2022, 1:07 PM
Leonc updated this revision to Diff 450368.Aug 5 2022, 1:11 PM
  • Remove debug code.
arsenm accepted this revision.Aug 5 2022, 1:11 PM
Leonc marked an inline comment as done.Aug 5 2022, 1:12 PM
Leonc added inline comments.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
8233

Thanks, my mistake.

Leonc marked an inline comment as done.Aug 5 2022, 1:13 PM
Leonc added a comment.Aug 5 2022, 1:27 PM

We also probably should clean up the the encoding to rely on subtarget features instead of the generation check

Is there a generic way to do that without adding maintenance overhead, or is it better that we don't implicitly support new subtargets/generations in case they have different encodings of v_illegal?

This revision was landed with ongoing or failed builds.Aug 6 2022, 12:59 AM
This revision was automatically updated to reflect the committed changes.