This is an archive of the discontinued LLVM Phabricator instance.

[LV] Combine vector reductions parts in tree instead of serially.
AbandonedPublic

Authored by fhahn on Jan 17 2022, 9:50 AM.

Details

Summary

At the moment, LV chains together the reduction values for all parts
serially. This results in larger than necessary dependency chains.

This patch updates LV to repeatedly combine adjacent pairs of parts to
combine them, for arithmetic opcodes.

Diff Detail

Event Timeline

fhahn created this revision.Jan 17 2022, 9:50 AM
fhahn requested review of this revision.Jan 17 2022, 9:50 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 17 2022, 9:50 AM

Does this alter much? Or do we end up redistributing them anyway? https://godbolt.org/z/z4nf5hPna

Does this alter much? Or do we end up redistributing them anyway? https://godbolt.org/z/z4nf5hPna

It won't have a massive impact in general, but it shaves off a few cycles, depending on the interleave count.

AFAICT the redistributions done in the https://godbolt.org/z/z4nf5hPna are done by ReassoicatePass, which likes to turn parallel reduction trees into serial ones (? but that's a separate issue I think), like for @float2, which looks like it got serialized. I don't think any passes that run after the vectorizer try to improve the length of reduction chains: https://godbolt.org/z/v4K4aK3a1

It won't have a massive impact in general, but it shaves off a few cycles, depending on the interleave count.

AFAICT the redistributions done in the https://godbolt.org/z/z4nf5hPna are done by ReassoicatePass, which likes to turn parallel reduction trees into serial ones (? but that's a separate issue I think), like for @float2, which looks like it got serialized. I don't think any passes that run after the vectorizer try to improve the length of reduction chains: https://godbolt.org/z/v4K4aK3a1

Do we think this is something that should be done in general? This looks like it will allow the reordering of fp instructions under -hints-allow-reordering=true without fast flags, which would not otherwise be reassociatable. But the other cases could always be done by the backend if it considered it profitable.

fhahn abandoned this revision.Jan 9 2023, 9:43 AM

It won't have a massive impact in general, but it shaves off a few cycles, depending on the interleave count.

AFAICT the redistributions done in the https://godbolt.org/z/z4nf5hPna are done by ReassoicatePass, which likes to turn parallel reduction trees into serial ones (? but that's a separate issue I think), like for @float2, which looks like it got serialized. I don't think any passes that run after the vectorizer try to improve the length of reduction chains: https://godbolt.org/z/v4K4aK3a1

Do we think this is something that should be done in general? This looks like it will allow the reordering of fp instructions under -hints-allow-reordering=true without fast flags, which would not otherwise be reassociatable. But the other cases could always be done by the backend if it considered it profitable.

It looks like there was a restriction in the MachineCombiner's reassociate logic that was prevent reassociation here. I think the restriction can be removed, then those cases should be handled properly in the backend: D141302

Herald added a project: Restricted Project. · View Herald TranscriptJan 9 2023, 9:43 AM