Related tasks:
- SWDEV-240194
- SWDEV-309417
- SWDEV-334876
Differential D123693
Transform illegal intrinsics to V_ILLEGAL Leonc on Apr 13 2022, 10:39 AM. Authored by
Details Related tasks:
Diff Detail
Event TimelineComment Actions I think this is too focused. There are other image_sample_lz intrinsics and all of them potentially need to be replaced if following this approach. Comment Actions There is also global isel.
Comment Actions Next steps:
Issues:
Comment Actions Where's the rationale? Surely unsupported intrinsics should fail to select, and perhaps be diagnosed even earlier? Comment Actions Normally yes. The issue is our libraries contain code for all targets. We rely on dead code elimination to remove illegal intrinsics, but that doesn't happen on -O0. Comment Actions
Why can't you put the code for different subtargets in different functions, each with an appropriate "target-cpu"= attribute? Comment Actions That's what I would do. I think there's some resistance to it though, and it would probably require a lot more work. Comment Actions Well I am resistant to selecting intrinsics on subtargets that don't support them :) How far are you planning to go with this approach? Do you expect to extend it to all AMDGPU intrinsics? Comment Actions Agreed. My initial approach was rejected for being too specific. Discussions since then have been about finding a solution for all illegal intrinsic calls. That's the plan unless I hear otherwise. Comment Actions The rationale is for library code that looks like: if (_ISA_verison == 9008 || __ISA_verison == 9010) global_atomic_fadd(p, v); else generic_atomic_fadd(p, v); } At optimization levels > 0, the dead code is eliminated, but at optimization level 0, the dead code remain and the compiler attempts to generate an instruction. That instruction would never be executed due to the condition, but it still needs to be generated. In these cases, we don't have multilibs (target specific libraries). Comment Actions Thanks @bcahoon. If there were another available approach that always worked, we would use it. Comment Actions Thank you for working on this. Debug build of PyTorch (BUILD_DEBUG_INFO=1) is also affected. Here is a creduce'd code which crashes rocm5.1.1 clang that you could possibly inciude into the test cases (compile with -O0): extern "C" __attribute__((device)) void __ockl_atomic_add_noret_f32(float *, float); __attribute__((device)) void a(float *address, float b) { __ockl_atomic_add_noret_f32(address, b); }
Comment Actions Missing testscases. Also should include some assembler and disassembler tests for V_ILLEGAL
Comment Actions Right, but why can't global_atomic_fadd and generic_atomic_fadd be functions that just contain a call to the corresponding builtin and have an appropriate "target-cpu"= attribute? I guess you would need to do inlining in the backend once you know what subtarget you are compiling for. Is that viable?
Comment Actions The problem is in the called function with the appropriate attributes. The function isn't deleted and we still need to produce something. The current situation is it works in cases where the wrong subtarget also happens to have the same encoding, but breaks on targets that never had an equivalent instruction encoded
Comment Actions If target-cpu doesn't let you target different CPUs for different functions then I don't understand what it does. Comment Actions For the relevant cases in the library, we key off the subtarget feature in target-features. The problem is in the final codegen we've propagated the real cpu to the leaf function, which then artificially has an incompatible target-features. We would have to have some code fixup the target-cpu to some random other device with compatible features
Comment Actions Could also use a disassembler test
Comment Actions LGTM except for the stray debug print. We also probably should clean up the the encoding to rely on subtarget features instead of the generation check
Comment Actions Is there a generic way to do that without adding maintenance overhead, or is it better that we don't implicitly support new subtargets/generations in case they have different encodings of v_illegal? |
I think it may be easiest to create the ILLEGAL instruction here rather than creating it after the call to lowerImage. That way, there doesn't have to be copies of the code that creates ILLEGAL.
For example,
MachineSDNode *NewNode = DAG.getMachineNode(AMDGPU::V_ILLEGAL, DL, MVT::Other, Op.getOperand(0));
return DAG.getMergeValues({ DAG.getUNDEF(Op.getValueType()), SDValue(NewNode, 0) }, DL);
The UNDEF value replaces return result for the intrinsic, and the V_ILLEGAL replaces the chain result. It appears, though, that V_ILLEGAL is removed if the intrinsic chain result is not used anywhere. I'm not sure if that is a bad thing since it means it's not really needed?