This patch lowers the SAD intrinsics to native LLVM IR. Comes with a clang patch (D45722).
How much value are we getting out of this change? Does this expose a lot of optimization potential to the middle end? This is a pretty complex sequence. How easy/likely is for the middle end to mess this up and make it hard for the backend to recognize?
This is not a scalar reduction. The patterns calls for a sum of specifically formed vectors (hence all the checks below) to form the PSADBW instruction where it is exactly semantically fitting rather than where it can be used as a reduction tool. This is also why the third path to recognize it is being added - other paths use it for reductions and so don't actually need the input pattern to match it in terms of which qword the specific byte corresponds to.