- User Since
- Nov 15 2018, 2:10 PM (227 w, 5 d)
Mar 10 2022
@kmclaughlin Thanks for the reply and adding the subtarget check. I figured this might have something to do with the scalable vectorization and this is a reasonable stopgap.
Mar 7 2022
I find the performance claims interesting. I've asked this question to SVE hw engineers before on which is faster -- gather vs load and shuffle vs load2 and the answer was essentially "depends on your loop". If you're using up ports to shuffle that could otherwise be used for computation, it seems like this would be a loser. If that analysis is correct then IMO this decision should be made in LV and the backend should honor the gather. However, if it stays in the backend, I still have some comments:
You can also join this LLVM-VP discord channel to follow/contribute to its progress.
May 29 2020
Given the constraints in SDAG, we should choose the (fma(fma)) variant by default (assuming as we do here that the target has fma instructions). For example on x86, our best perf heuristic at this stage of compilation on any recent Intel or AMD core is number of uops. The option with separate fmul and fadd always has more uops, so it would be backwards to choose that sequence here and then try to undo that later.
Not sure if I'm understanding the question. Is there a target or a code pattern with a known disadvantage for the 2 fma variant?
Wouldn't it be better to choose between what you have here fmadd(a,b,fma(c,d,n)) and a*b + fmadd(c,d,n) for targets that perform worse with FMA chains?