This fixes PR28474.
Admittedly, this is probably not the "right way" to fix this. We already had one case we handle by rebuilding the entire tree with the roots reversed (line 3816), and this adds another one, but this does a lot of redundant work, and obviously can't work for arbitrary orders of the loads.
What we'd really want is to sort the loads "on-the-fly" while building the tree, at least in the cases where the order of the scalars doesn't matter (i.e. reductions). In theory, we could also have a load + shuffle when the order does matter, or even prefer sorting loads to sorting stores when we have a store-rooted reduction, but that'd require additional cost modeling. Unfortunately, I still don't grok the SLP vectorizer enough to understand how to do that correctly. It's not just a question of making the tree mutable - it seems like there are plenty of places that assume the order doesn't change (treating VL[0] as special, isSame() that cares about the order, external users, etc.).
Advice will be appreciated.