This is an archive of the discontinued LLVM Phabricator instance.

[RISCV] Use vfirst insead of vcpop for i1 reduce.and/or.
AbandonedPublic

Authored by craig.topper on Jan 17 2023, 12:38 PM.

Details

Summary

vfirst has the chance of an early out in a microarchitecture, but
vcpop does not. Though I don't know of such a microarchitecture.

For and/or we only need to know if any 1 exists in the mask (after
invering for AND). So we can use vfirst.

Unfortunately, there is no sgez instruction so we end up needing
to invert an sltz for OR. That might make this not worthwhile if there
isn't a microarchitecture that optimizes vfirst. If the start value
is known to be 0 and the result is used by a branch we will hopefully
end up with a bgez instead.

Posting to collect other opinions.

Diff Detail

Event Timeline

craig.topper created this revision.Jan 17 2023, 12:38 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 17 2023, 12:38 PM
craig.topper requested review of this revision.Jan 17 2023, 12:38 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 17 2023, 12:38 PM

Is this early out really a feasible optimization for a microarchitecture?
All I can think about it is: for mask instruction vfirst.m, supposed that there are n = VLen / DataPath microinstructions issued (ideally, they are executed parallelly), then if the first k microinstructions find the 1, we may write the result of vfirst.m back before we get results of the last n-k microinstructions. But I don't see too much gains compared to the extra complexity.

Is this early out really a feasible optimization for a microarchitecture?
All I can think about it is: for mask instruction vfirst.m, supposed that there are n = VLen / DataPath microinstructions issued (ideally, they are executed parallelly), then if the first k microinstructions find the 1, we may write the result of vfirst.m back before we get results of the last n-k microinstructions. But I don't see too much gains compared to the extra complexity.

The question of gains has to do with how many rounds could be saved using this early out approach. It could be as much as VLen - 1 if DataPath = 1, it could be none if DataPath = VLen, or it could be somewhere in between.

It looks like using vfirst in the test cases above to implement the reduction lead to the same or one additional number of instructions. I wonder if the early out using vfirst makes up for the (possibly) additional instruction.

craig.topper abandoned this revision.Jan 30 2023, 2:30 PM

It feels not likely that a microarchitecture would want to implement a data dependent optimization that would make this worthwhile.