fma:
While the fma implementation was passing the CTS, cos (which is using fma) was not because of the fma.
This new version is closer to the actual default libclc implementation. Still not using ulong to avoid bringing a dependency on u64.
vstore:
Based on the default libclc implementation. I have added several conditions required to have them passed on different platforms (tested on nvidia, intel and AMD).
inlining issues:
At the moment, fma is inlined in function like cos. But for architecture having a compliant native implementation of fma, it is better not to inline it to be able to replace the call by the native implementation.
Force libclc builtins not to be inline + adding a assume, so that we can then remove the noinline in clspv once the NativeMathPass has passed
Also add missing builtins needed for some platform (at least swiftshader and nvidia)