This is an archive of the discontinued LLVM Phabricator instance.

[GVN] Support load of pointer-select to value-select conversion.
ClosedPublic

Authored by fhahn on Jan 25 2022, 6:53 AM.

Details

Summary

This patch extends the available-value logic to detect loads
of pointer-selects that can be replaced by a value select.

For example, consider the code below:

loop:
  %sel.phi = phi i32* [ %start, %ph ], [ %sel, %ph ]
  %l = load %ptr
  %l.sel = load %sel.phi
  %sel = select cond, %ptr, %sel.phi
  ...

exit:
  %res = load %sel
  use(%res)

The load of the pointer phi can be replaced by a load of the start value
outside the loop and a new phi/select chain based on the loaded values,
as illustrated below

  %l.start = load %start
loop:
  sel.phi.prom = phi i32 [ %l.start, %ph ], [ %sel.prom, %ph ]
  %l = load %ptr
  %sel.prom = select cond, %l, %sel.phi.prom
  ...
exit:
  use(%sel.prom)

This is a first step towards alllowing vectorizing loops using common libc++
library functions, like std::min_element (https://clang.godbolt.org/z/6czGzzqbs)

#include <vector>
#include <algorithm>

int foo(const std::vector<int> &V) {
    return *std::min_element(V.begin(), V.end());
}

Diff Detail

Event Timeline

fhahn created this revision.Jan 25 2022, 6:53 AM
fhahn requested review of this revision.Jan 25 2022, 6:53 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 25 2022, 6:53 AM

@fhahn This really looks like an extension of full redundancy elimination. If we had the select expressed as a phi/branch pattern, wouldn't we be able to determine the value is fully available and do SSA construction to propagate the loaded value? If so, then this feels like a case where we should extend the fully available logic and SSA construction in FRE rather than add a parallel transform.

(p.s. I might be missing something obvious. Writing this in a hurry to get you any response at all due to personal life craziness atm.)

fhahn updated this revision to Diff 404028.Jan 28 2022, 8:03 AM

@fhahn This really looks like an extension of full redundancy elimination. If we had the select expressed as a phi/branch pattern, wouldn't we be able to determine the value is fully available and do SSA construction to propagate the loaded value? If so, then this feels like a case where we should extend the fully available logic and SSA construction in FRE rather than add a parallel transform.

(p.s. I might be missing something obvious. Writing this in a hurry to get you any response at all due to personal life craziness atm.)

Thanks! It seems like this indeed fits rather nicely into the available-value logic. I updated the patch to introduce a new SelectVal to AvailableValueInBlock and create those nodes as needed, together with the required materialization code.

This version does not replace the load outside the loop of the pointer-select for now. But that might be also be do-able separately by teaching the available value analysis to replace the load %res with the select %0 in the code below.

define i32 @src(i32* %ptr, i32* %end) {
entry:
  %start.ptr = getelementptr inbounds i32, i32* %ptr, i64 1
  %l.2.pre = load i32, i32* %ptr, align 4
  br label %loop

loop:                                             ; preds = %loop, %entry
  %l.2 = phi i32 [ %l.2.pre, %entry ], [ %0, %loop ]
  %ptr.iv = phi i32* [ %start.ptr, %entry ], [ %ptr.iv.next, %loop ]
  %min.ptr = phi i32* [ %ptr, %entry ], [ %min.select, %loop ]
  %l.1 = load i32, i32* %ptr.iv, align 4
  %cmp.i.i.i = icmp ult i32 %l.1, %l.2
  %0 = select i1 %cmp.i.i.i, i32 %l.1, i32 %l.2
  %min.select = select i1 %cmp.i.i.i, i32* %ptr.iv, i32* %min.ptr
  %ptr.iv.next = getelementptr inbounds i32, i32* %ptr.iv, i64 1
  %ec = icmp eq i32* %ptr.iv.next, %end
  br i1 %ec, label %exit, label %loop

exit:                                             ; preds = %loop
  %res = load i32, i32* %min.select, align 4
  ret i32 %res
}
fhahn retitled this revision from [GVN] Support loop load PRE through pointer select. to [GVN] Support load of pointer-select to value-select conversion..Jan 28 2022, 8:05 AM
fhahn edited the summary of this revision. (Show Details)
reames accepted this revision.Jan 31 2022, 12:05 PM

This wasn't quite what I'd expected, but this approach appears to be workable, if not fully generic.

The fully generic result I was expecting was for you to generalize the available block mechanism into a mapping per address of availability per block plus a side structure to remember the address translation. (We implicitly already have the later for phi nodes, but we'd have to track it explicitly for selects.)

The general mechanism is a much deeper change, so I understand if you don't want to do it now.

llvm/lib/Transforms/Scalar/GVN.cpp
977

I wasn't sure GVN already relied on comesBefore caching and thus was initial concerned this might be a performance bottleneck, but it looks like there's a bunch of cases we already assume block order is up to date, so this should be a non-issue.

988

Remove the extra def?

1261

Repeated code, pull out a static function.

1273

You appear to only be handling the case where the select and it's loads are in a different block from the load being removed. Please add the corresponding case for block local code as well.

1275

Unless I'm really missing something obvious, you've got a serious omission here.

The memory dependence walk already done tells us there's no clobber between the select and the using loads. I do not see anything in the added code which checks for a clobber between select and the loads above it.

L1 = load A1
L2 = load A2
clobber
S = select A1, A2
L3 = load S

This revision is now accepted and ready to land.Jan 31 2022, 12:05 PM
fhahn updated this revision to Diff 405089.Feb 1 2022, 1:45 PM
fhahn marked 5 inline comments as done.

This wasn't quite what I'd expected, but this approach appears to be workable, if not fully generic.

Thanks Philip, comments should be addressed. I intend to push the updated version with your comments addressed tomorrow, unless there are any concers raised.

The fully generic result I was expecting was for you to generalize the available block mechanism into a mapping per address of availability per block plus a side structure to remember the address translation. (We implicitly already have the later for phi nodes, but we'd have to track it explicitly for selects.)

The general mechanism is a much deeper change, so I understand if you don't want to do it now.

I'd prefer to keep the scope limited to start with. When we move away from MemDepAnalysis here might be a good time to see if MemorySSA could help with generalizing this code.

This revision was landed with ongoing or failed builds.Feb 2 2022, 1:23 AM
This revision was automatically updated to reflect the committed changes.
chill added a subscriber: chill.Feb 3 2022, 9:25 AM
chill added inline comments.Feb 3 2022, 9:29 AM
llvm/lib/Transforms/Scalar/GVN.cpp
1729

Shouldn't we allow the function to continue if we came here with a SelectInst
in the dep address, like on line 2022 ?