This should be approximately first and run with other module passes.
Details
Diff Detail
Event Timeline
@arsenm would you be so kind and explain the reasoning behind reshuffling the order of the passes.
In oneAPI math functions are handled through libclc which implements spir-v interface.
If, as an example, we focus on a call to tanh taking float4, oneAPI will generate a call to __spirv_ocl_tanh(float vector[4]), which under the hood for AMD backend is implemented as a sequence of scalar calls to __ocml_tanh_f32 (provided by AMD's device lib implementation).
Because AMDGPURemoveIncompatibleFunctions is now run early on, and as __spirv_ocl_tanh(float vector[4]) requires (among others) target support of FeatureDot3Inst (which is not provided on the GPU in question gfx1030), it is being earmarked for deletion and we end up with modules containing calls to deleted funcs:
%call3.i.i.i = tail call fastcc noundef <4 x float> null(<4 x float> noundef %agg.tmp.sroa.0.0.copyload.i)
As all the libclc functions are marked with always inline, it used to work well, as we would always replace the problematic vector function calls with exploded scalar versions of __ocml_..., this was handled for us by always inliner pass.
Would it be possible to have the incompatible functions pass run after the inliners?
If the function is unsupportable and going to be deleted, there's no point in running any of the other passes on the function. I was trying to keep all the module passes together
In oneAPI math functions are handled through libclc which implements spir-v interface.
If, as an example, we focus on a call to tanh taking float4, oneAPI will generate a call to __spirv_ocl_tanh(float vector[4]), which under the hood for AMD backend is implemented as a sequence of scalar calls to __ocml_tanh_f32 (provided by AMD's device lib implementation).
Because AMDGPURemoveIncompatibleFunctions is now run early on, and as __spirv_ocl_tanh(float vector[4]) requires (among others) target support of FeatureDot3Inst (which is not provided on the GPU in question gfx1030), it is being earmarked for deletion and we end up with modules containing calls to deleted funcs:
But __ocml_tanh_f32 doesn't use a dot intrinsic? Not sure how or where you would be seeing that. ocml is currently free of subtarget feature dependence
%call3.i.i.i = tail call fastcc noundef <4 x float> null(<4 x float> noundef %agg.tmp.sroa.0.0.copyload.i)As all the libclc functions are marked with always inline, it used to work well, as we would always replace the problematic vector function calls with exploded scalar versions of __ocml_..., this was handled for us by always inliner pass.
libclc should stop doing using always_inline (and we should stop running the always inliner in the backend (I'm moving towards deleting it in D152414). Function calls should work well now and we should act like a normal target, we don't need these old hacks to support old gaps in codegen support.
Would it be possible to have the incompatible functions pass run after the inliners?
The inliner should not be able to fix your code. If this is deleting the function, then it shouldn't have been an inlining candidate in the first place. Something is failing to consider the incompatible feature.
This pass is truly a disgusting and unmaintainable hack I want to eventually remove. The way every other target expects this to be handled is to not allow unsupportable code to taint the module in the first place. This situation would never work in a world where machine linking happens, and you're relying on a compiler implementation detail to produce a working program. A better implementation would be to conditionally load specialized versions of functions when the target is known, not pretend we have one generic implementation that the compiler fixes up. I've been working towards this recently, and have purged all of the subtarget dependence from ocml so I'm not sure why you're running into this. There are still many problematic subtarget dependencies in ockl functions though
Hi @arsen
Apologies for a late reply, this should teach me not to post a day before my holidays.
But __ocml_tanh_f32 doesn't use a dot intrinsic? Not sure how or where you would be seeing that. ocml is currently free of subtarget feature dependence
The inliner should not be able to fix your code. If this is deleting the function, then it shouldn't have been an inlining candidate in the first place. Something is failing to consider the incompatible feature.
This was exactly the problem with our implementation, we had assumptions with regards to the target embedded in the libclc bitcode that were not portable across architectures. Thank you very much for your help, much appreciated!