This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Move placement of RemoveIncompatibleFunctions
ClosedPublic

Authored by arsenm on Jul 21 2023, 12:38 PM.

Details

Reviewers
Pierre-vh
Group Reviewers
Restricted Project
Summary

This should be approximately first and run with other module passes.

Diff Detail

Event Timeline

arsenm created this revision.Jul 21 2023, 12:38 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 21 2023, 12:38 PM
arsenm requested review of this revision.Jul 21 2023, 12:38 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 21 2023, 12:38 PM
Herald added a subscriber: wdng. · View Herald Transcript
Pierre-vh accepted this revision.Jul 27 2023, 12:09 AM
This revision is now accepted and ready to land.Jul 27 2023, 12:09 AM

@arsenm would you be so kind and explain the reasoning behind reshuffling the order of the passes.

In oneAPI math functions are handled through libclc which implements spir-v interface.

If, as an example, we focus on a call to tanh taking float4, oneAPI will generate a call to __spirv_ocl_tanh(float vector[4]), which under the hood for AMD backend is implemented as a sequence of scalar calls to __ocml_tanh_f32 (provided by AMD's device lib implementation).

Because AMDGPURemoveIncompatibleFunctions is now run early on, and as __spirv_ocl_tanh(float vector[4]) requires (among others) target support of FeatureDot3Inst (which is not provided on the GPU in question gfx1030), it is being earmarked for deletion and we end up with modules containing calls to deleted funcs:

%call3.i.i.i = tail call fastcc noundef <4 x float> null(<4 x float> noundef %agg.tmp.sroa.0.0.copyload.i)

As all the libclc functions are marked with always inline, it used to work well, as we would always replace the problematic vector function calls with exploded scalar versions of __ocml_..., this was handled for us by always inliner pass.

Would it be possible to have the incompatible functions pass run after the inliners?

@arsenm would you be so kind and explain the reasoning behind reshuffling the order of the passes.

If the function is unsupportable and going to be deleted, there's no point in running any of the other passes on the function. I was trying to keep all the module passes together

In oneAPI math functions are handled through libclc which implements spir-v interface.

If, as an example, we focus on a call to tanh taking float4, oneAPI will generate a call to __spirv_ocl_tanh(float vector[4]), which under the hood for AMD backend is implemented as a sequence of scalar calls to __ocml_tanh_f32 (provided by AMD's device lib implementation).

Because AMDGPURemoveIncompatibleFunctions is now run early on, and as __spirv_ocl_tanh(float vector[4]) requires (among others) target support of FeatureDot3Inst (which is not provided on the GPU in question gfx1030), it is being earmarked for deletion and we end up with modules containing calls to deleted funcs:

But __ocml_tanh_f32 doesn't use a dot intrinsic? Not sure how or where you would be seeing that. ocml is currently free of subtarget feature dependence

%call3.i.i.i = tail call fastcc noundef <4 x float> null(<4 x float> noundef %agg.tmp.sroa.0.0.copyload.i)

As all the libclc functions are marked with always inline, it used to work well, as we would always replace the problematic vector function calls with exploded scalar versions of __ocml_..., this was handled for us by always inliner pass.

libclc should stop doing using always_inline (and we should stop running the always inliner in the backend (I'm moving towards deleting it in D152414). Function calls should work well now and we should act like a normal target, we don't need these old hacks to support old gaps in codegen support.

Would it be possible to have the incompatible functions pass run after the inliners?

The inliner should not be able to fix your code. If this is deleting the function, then it shouldn't have been an inlining candidate in the first place. Something is failing to consider the incompatible feature.

This pass is truly a disgusting and unmaintainable hack I want to eventually remove. The way every other target expects this to be handled is to not allow unsupportable code to taint the module in the first place. This situation would never work in a world where machine linking happens, and you're relying on a compiler implementation detail to produce a working program. A better implementation would be to conditionally load specialized versions of functions when the target is known, not pretend we have one generic implementation that the compiler fixes up. I've been working towards this recently, and have purged all of the subtarget dependence from ocml so I'm not sure why you're running into this. There are still many problematic subtarget dependencies in ockl functions though

Hi @arsen
Apologies for a late reply, this should teach me not to post a day before my holidays.

But __ocml_tanh_f32 doesn't use a dot intrinsic? Not sure how or where you would be seeing that. ocml is currently free of subtarget feature dependence

The inliner should not be able to fix your code. If this is deleting the function, then it shouldn't have been an inlining candidate in the first place. Something is failing to consider the incompatible feature.

This was exactly the problem with our implementation, we had assumptions with regards to the target embedded in the libclc bitcode that were not portable across architectures. Thank you very much for your help, much appreciated!