This is an archive of the discontinued LLVM Phabricator instance.

[LICM] Hoist loads with invariant.group metadata
AbandonedPublic

Authored by aeubanks on Apr 1 2021, 8:35 PM.

Details

Summary

We want to hoist loads when possible.

However, we can only safely retain metadata if the instruction is
guaranteed to run in the loop. For the purposes of the invariant.group
metadata, keeping it is more important than hoisting it, so do not hoist
a load if it is not guaranteed to run in the loop.

Diff Detail

Event Timeline

aeubanks created this revision.Apr 1 2021, 8:35 PM
aeubanks requested review of this revision.Apr 1 2021, 8:35 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 1 2021, 8:35 PM
rnk accepted this revision.Apr 5 2021, 2:42 PM
rnk added a reviewer: rsmith.

+@rsmith who also helped with the design

I think this is correct, I think this is how this is supposed to work.

This revision is now accepted and ready to land.Apr 5 2021, 2:43 PM
This revision was landed with ongoing or failed builds.Apr 8 2021, 9:57 PM
This revision was automatically updated to reflect the committed changes.
lebedev.ri reopened this revision.EditedMay 8 2021, 5:46 AM
lebedev.ri added a subscriber: lebedev.ri.

Hello. I have just reverted this in rG1acd9a1a29ac30044ecefb6613485d5d168f66ca,
as it bisected to be the first bad commit for a crash.

This appears to miscompile google benchmark's GetCacheSizesFromKVFS()
when compiling with -fstrict-vtable-pointers.
Runnable reproducer: https://godbolt.org/z/f9ovKqTzb

$ ./bin/clang++ -O3 -fstrict-vtable-pointers /tmp/test.cpp -g0
$ ./a.out 
Segmentation fault

The "f.fail()" crashes with BUS error, it is compiled into testb,
and the address it is testing is non-sensical.

Please let me know if you are having issues reproducing.
I don't believe this is UB in source code, at least i don't see it.

This revision is now accepted and ready to land.May 8 2021, 5:46 AM
lebedev.ri requested changes to this revision.May 8 2021, 5:47 AM
This revision now requires changes to proceed.May 8 2021, 5:47 AM
rnk added a comment.May 10 2021, 5:57 PM

Thanks for the repro!

while (true) {
  std::string FPath = StrCat(dir, "index", Idx++, "/");
  std::ifstream f(StrCat(FPath, "size"));
  if (!f.is_open()) break;
  std::string suffix;
  int size;
  f >> size;
  if (f.fail()) // BUS error
    return;
  if (f.good()) {
    std::cout << "read " << size << "\n";
  }
}

Clang is able to devirtualize the fail call on its own. However, it still has IR to perform the virtual base class adjustment:

%13 = bitcast %"class.std::basic_ifstream"* %f to %"class.std::basic_istream"*
%call3 = call nonnull align 8 dereferenceable(16) %"class.std::basic_istream"* @_ZNSirsERi(%"class.std::basic_istream"* nonnull dereferenceable(16) %13, i32* nonnull align 4 dereferenceable(4) %size)
%14 = bitcast %"class.std::basic_ifstream"* %f to i8**
%vtable = load i8*, i8** %14, align 8, !tbaa !10, !invariant.group !12
%vbase.offset.ptr = getelementptr i8, i8* %vtable, i64 -24
%15 = bitcast i8* %vbase.offset.ptr to i64*
%vbase.offset = load i64, i64* %15, align 8
%16 = bitcast %"class.std::basic_ifstream"* %f to i8*
%add.ptr = getelementptr inbounds i8, i8* %16, i64 %vbase.offset
%17 = bitcast i8* %add.ptr to %"class.std::basic_ios"*
%call4 = call zeroext i1 @_ZNKSt9basic_iosIcSt11char_traitsIcEE4failEv(%"class.std::basic_ios"* nonnull dereferenceable(264) %17)

This vtable load is based directly on the %f alloca, and not the result of a launder operation. Because there is no launder operation, LICM hoists this vptr load to the entry block, so it loads uninit stack memory. The virtual base class offset load happens before calling fail, which is where the crash happens.

So: this is a frontend bug, and perhaps a frontend missed optimization. In general, clang does not launder local variable allocas, and that seems like it could be a problem. However, always laundering and stripping every dynamic local variable is probably way overkill, and would block SROA. Maybe we could only use laundered object pointers for the vptr loads.

Prazek requested changes to this revision.May 11 2021, 6:14 AM

I had some time to look into it now, sorry that I missed it on first revision.
If I understand the LICM code correctly, it needs to drop all the instruction metadata when hoisting instruction that will not be executed unconditionally. This means that code like:
https://godbolt.org/z/bjP64PbhY

struct A {
  virtual void foo();
};

bool p(int, int);

void loop(A& a, int n) {

  int i = 0;
  for (int i = 0 ; i < n; i++) {
      if (p(i, n))
          a.foo();
  }
}

When vtable load will be hoisted, it will be stripped from invariant.group metadata. This means that we won't gonna be able to devirtualize it further after e.g. inlining this function somewhere (exposing other virtual call or construction of a.
Assuming my understanding is right (please run this example with your patch), I would oppose to always hoist loads with !invariant.group loads. It still might be beneficial for default optimization pipeline, but I worry that it will limit the optimizer across modules (LTO, ThinLTO etc).
I think it would be better to be on the safer side and hoist loads with !invariant.load only if they are executed unconditionally. I tried to prototpe it long time ago here: https://reviews.llvm.org/D45151

aeubanks updated this revision to Diff 345052.May 12 2021, 10:42 PM

only hoist when load is guaranteed to run

aeubanks edited the summary of this revision. (Show Details)May 12 2021, 10:45 PM

I'm wondering if we might be dropping invariant.group metadata on loads we were already hoisting

In D99784#2749459, @rnk wrote:

Thanks for the repro!

while (true) {
  std::string FPath = StrCat(dir, "index", Idx++, "/");
  std::ifstream f(StrCat(FPath, "size"));
  if (!f.is_open()) break;
  std::string suffix;
  int size;
  f >> size;
  if (f.fail()) // BUS error
    return;
  if (f.good()) {
    std::cout << "read " << size << "\n";
  }
}

Clang is able to devirtualize the fail call on its own. However, it still has IR to perform the virtual base class adjustment:

%13 = bitcast %"class.std::basic_ifstream"* %f to %"class.std::basic_istream"*
%call3 = call nonnull align 8 dereferenceable(16) %"class.std::basic_istream"* @_ZNSirsERi(%"class.std::basic_istream"* nonnull dereferenceable(16) %13, i32* nonnull align 4 dereferenceable(4) %size)
%14 = bitcast %"class.std::basic_ifstream"* %f to i8**
%vtable = load i8*, i8** %14, align 8, !tbaa !10, !invariant.group !12
%vbase.offset.ptr = getelementptr i8, i8* %vtable, i64 -24
%15 = bitcast i8* %vbase.offset.ptr to i64*
%vbase.offset = load i64, i64* %15, align 8
%16 = bitcast %"class.std::basic_ifstream"* %f to i8*
%add.ptr = getelementptr inbounds i8, i8* %16, i64 %vbase.offset
%17 = bitcast i8* %add.ptr to %"class.std::basic_ios"*
%call4 = call zeroext i1 @_ZNKSt9basic_iosIcSt11char_traitsIcEE4failEv(%"class.std::basic_ios"* nonnull dereferenceable(264) %17)

This vtable load is based directly on the %f alloca, and not the result of a launder operation. Because there is no launder operation, LICM hoists this vptr load to the entry block, so it loads uninit stack memory. The virtual base class offset load happens before calling fail, which is where the crash happens.

So: this is a frontend bug, and perhaps a frontend missed optimization. In general, clang does not launder local variable allocas, and that seems like it could be a problem. However, always laundering and stripping every dynamic local variable is probably way overkill, and would block SROA. Maybe we could only use laundered object pointers for the vptr loads.

I think the current frontend implementation is correct at least in this instance, the load from the alloca should always result in the same value. Although I'm wondering if it's possible that we end up using one alloca for different dynamic types with different lifetimes. If two different types have the invariant.group metadata on vtable loads, that would be bad.

The problem is that we're moving the load before the store (or call to constructor).
Currently, LICM will hoist loads, but only if it can prove the memory they point to is not modified in the loop. This patch bypasses that check for loads with invariant.group metadata. So the current patch is still wrong. We can only hoist if the corresponding store is outside the loop. With invariant.group metadata we're not worried about the loop changing the value, just that the value was stored to in the first place.

In D99784#2749459, @rnk wrote:

Thanks for the repro!

while (true) {
  std::string FPath = StrCat(dir, "index", Idx++, "/");
  std::ifstream f(StrCat(FPath, "size"));
  if (!f.is_open()) break;
  std::string suffix;
  int size;
  f >> size;
  if (f.fail()) // BUS error
    return;
  if (f.good()) {
    std::cout << "read " << size << "\n";
  }
}

Clang is able to devirtualize the fail call on its own. However, it still has IR to perform the virtual base class adjustment:

%13 = bitcast %"class.std::basic_ifstream"* %f to %"class.std::basic_istream"*
%call3 = call nonnull align 8 dereferenceable(16) %"class.std::basic_istream"* @_ZNSirsERi(%"class.std::basic_istream"* nonnull dereferenceable(16) %13, i32* nonnull align 4 dereferenceable(4) %size)
%14 = bitcast %"class.std::basic_ifstream"* %f to i8**
%vtable = load i8*, i8** %14, align 8, !tbaa !10, !invariant.group !12
%vbase.offset.ptr = getelementptr i8, i8* %vtable, i64 -24
%15 = bitcast i8* %vbase.offset.ptr to i64*
%vbase.offset = load i64, i64* %15, align 8
%16 = bitcast %"class.std::basic_ifstream"* %f to i8*
%add.ptr = getelementptr inbounds i8, i8* %16, i64 %vbase.offset
%17 = bitcast i8* %add.ptr to %"class.std::basic_ios"*
%call4 = call zeroext i1 @_ZNKSt9basic_iosIcSt11char_traitsIcEE4failEv(%"class.std::basic_ios"* nonnull dereferenceable(264) %17)

This vtable load is based directly on the %f alloca, and not the result of a launder operation. Because there is no launder operation, LICM hoists this vptr load to the entry block, so it loads uninit stack memory. The virtual base class offset load happens before calling fail, which is where the crash happens.

So: this is a frontend bug, and perhaps a frontend missed optimization. In general, clang does not launder local variable allocas, and that seems like it could be a problem. However, always laundering and stripping every dynamic local variable is probably way overkill, and would block SROA. Maybe we could only use laundered object pointers for the vptr loads.

I think the current frontend implementation is correct at least in this instance, the load from the alloca should always result in the same value. Although I'm wondering if it's possible that we end up using one alloca for different dynamic types with different lifetimes. If two different types have the invariant.group metadata on vtable loads, that would be bad.

The problem is that we're moving the load before the store (or call to constructor).
Currently, LICM will hoist loads, but only if it can prove the memory they point to is not modified in the loop. This patch bypasses that check for loads with invariant.group metadata. So the current patch is still wrong. We can only hoist if the corresponding store is outside the loop. With invariant.group metadata we're not worried about the loop changing the value, just that the value was stored to in the first place.

I'm confused. Was that an outdated comment you posted?
It seems like this current diff (which you updated before commenting!)
gets it right?

No, the current patch is still wrong. We can end up hoisting a load before the corresponding store. We currently don't check for that condition when the invariant.group metadata is present.

lebedev.ri requested changes to this revision.May 31 2021, 9:20 AM
This revision now requires changes to proceed.May 31 2021, 9:20 AM
Matt added a subscriber: Matt.May 31 2021, 9:31 AM

// Loads/stores with an invariant.group metadata are ok to hoist/sink.

I don't see how this follows from the metadata. Is there documentation I missed?
Even if, we might want to not conflate "invariant" and "speculatable". The latter
would be useful on it's own, see https://reviews.llvm.org/D103907#2811455 .

aeubanks abandoned this revision.Jun 10 2021, 3:51 PM

yeah, this is not feasible with the current implementation of fstrict-vtable-pointers, and not sure how beneficial this is anyway