This is an archive of the discontinued LLVM Phabricator instance.

Fix bug with duplicate struct types in Linker
Needs ReviewPublic

Authored by timarmstrong on Mar 31 2016, 11:18 PM.
This revision needs review, but all specified reviewers are disabled or inactive.

Details

Reviewers
espindola
Summary

This fixes a corner case where two structurally identical struct types
were erroneously merged into one when present in both the source and
destination modules. This could cause assertion failures and verification
failures when constructing IR modules through a combination of IRBuilder
and linking.

The fix is to track all struct types by pointer identity, which allows us to accurately check whether the struct type is present in the destination module. As a corollary, the code is slightly simpler.

Diff Detail

Event Timeline

timarmstrong retitled this revision from to Fix bug with duplicate struct types in Linker.
timarmstrong updated this object.
timarmstrong added a reviewer: rafael.
timarmstrong added a subscriber: llvm-commits.

Does anyone have time to look at this (or can suggest a reviewer).

It's a pretty straightforward fix and comes with a test.

We're running into this when jitting some code and would like to get the fix validated by someone more familiar with LLVM before we start shipping it.

rafael edited edge metadata.Apr 19 2016, 8:06 AM
rafael added a subscriber: rafael.

Taking a look. Sorry for the delay.

rafael added inline comments.Apr 19 2016, 8:32 AM
unittests/Linker/LinkModulesTest.cpp
346

Looks like this got out of date.There are no calls in the IR in this test.

Please run git-clang-format on the patch.

So, modifying the test a bit I found that if the dst is initially

%struct.T1 = type { i8 }
%struct.T2 = type { i8 }
declare void @callee1(%struct.T1* %x)
declare void @callee2(%struct.T2* %x)
define void @foo1(%struct.T1* %x) {
  call void @callee1(%struct.T1* %x)
  ret void
}
define void @foo2(%struct.T2* %x) {
  call void @callee2(%struct.T2* %x)
  ret void
}

and the source is

%struct.T1 = type { i8 }
%struct.T2 = type { i8 }
define void @callee1(%struct.T1* %x) {
  ret void
}
define void @callee2(%struct.T2* %x) {
  ret void
}

The linker produces

%struct.T1 = type { i8 }
%struct.T2 = type { i8 }
define void @foo1(%struct.T1* %x) {
  call void @callee1(%struct.T1* %x)
  ret void
}
define void @foo2(%struct.T2* %x) {
  call void @callee2(%struct.T2* %x)
  ret void
}
define void @callee1(%struct.T1* %x) {
  ret void
}
define void @callee2(%struct.T2* %x) {
  ret void
}

I.E.: it correctly avoids merging the types.

If I remove the call to callee2 and its declaration in the destination, we merge the types, but the module is still valid:

%struct.T1 = type { i8 }
%struct.T2 = type { i8 }
define void @foo1(%struct.T1* %x) {
  call void @callee1(%struct.T1* %x)
  ret void
}
define void @foo2(%struct.T2* %x) {
  ret void
}
define void @callee1(%struct.T1* %x) {
  ret void
}
define void @callee2(%struct.T1* %x) {
  ret void
}

or if I delete the call to callee1 instead:

%struct.T1 = type { i8 }
%struct.T2 = type { i8 }
define void @foo1(%struct.T1* %x) {
  ret void
}
define void @foo2(%struct.T2* %x) {
  call void @callee2(%struct.T2* %x)
  ret void
}
define void @callee1(%struct.T2* %x) {
  ret void
}
define void @callee2(%struct.T2* %x) {
  ret void
}

What I don't think is valid is the assumption that you can lookup types by name in the destination and assume their use in newly linked code.

unittests/Linker/LinkModulesTest.cpp
346

Sorry, I now noticed that you create the calls after linking.

Can the calls be there before?

Thanks for taking a look. I agree this a corner case that doesn't matter for ahead-of-time compilation but it's very useful for JIT compilation and it did historically work on older LLVM versions.

lib/Linker/IRMover.cpp
1480

Another argument for making this change is that 'hasType()' would lie about whether the type was present. I.e. I added an assert like the one below and it would fire if there were two structurally identical types in the module.

addNonOpaque(Ty);
assert(hasType(Ty));
unittests/Linker/LinkModulesTest.cpp
346

For context, we have a database that lets users compile user-defined functions to LLVM IR then use them in their queries. We want to be able to link in the user-provided module then generate calls to functions in that module. This worked fine on LLVM 3.3 and this seems like the natural way to do things in a JIT (link in the code, then generate more IR and add it to the module as needed).

It's not all that feasible to restructure the code so that all calls are generated before linking happens. We could do something like generate declarations for all the functions in the module before linking to force it to preserve the types, but it seems a bit unnecessary. When I read through the linker code it wasn't totally clear why that would work but the pattern in the test shouldn't - I was concerned it was working by coincidence rather than by design.

If you're interested in the context, we link in the module from a bitcode file here:

https://github.com/cloudera/Impala/blob/cdh5-trunk/be/src/exprs/scalar-fn-call.cc#L142

Then emit calls to functions as needed.
https://github.com/cloudera/Impala/blob/cdh5-trunk/be/src/exprs/scalar-fn-call.cc#L466

Comment at: unittests/Linker/LinkModulesTest.cpp:346
@@ +345,3 @@
+
+ // Check that we can call both callee1 and callee2 still.

+ // Since T1 and T2 were defined in Src, it should be valid to call functions

rafael wrote:

rafael wrote:

Looks like this got out of date.There are no calls in the IR in this test.

Sorry, I now noticed that you create the calls after linking.

Can the calls be there before?

For context, we have a database that lets users compile user-defined functions to LLVM IR then use them in their queries. We want to be able to link in the user-provided module then generate calls to functions in that module. This worked fine on LLVM 3.3 and this seems like the natural way to do things in a JIT (link in the code, then generate more IR and add it to the module as needed).

The thing that is odd with the change is that type names should not be
significant. The existing code that uses names is a hack that will
hopefully go away when we get typeless pointers.

In some sense this patch would add more dependency on names. Note that
if the src module types are named T3/T4 they still get mapped to T1.
So what this patch does is prevent renaming of a type that has the
same name in src and dst.

It's not all that feasible to restructure the code so that all calls are generated before linking happens. We could do something like generate declarations for all the functions in the module before linking to force it to preserve the types, but it seems a bit unnecessary. When I read through the linker code it wasn't totally clear why that would work but the pattern in the test shouldn't - I was concerned it was working by coincidence rather than by design.

If you're interested in the context, we link in the module from a bitcode file here:

https://github.com/cloudera/Impala/blob/cdh5-trunk/be/src/exprs/scalar-fn-call.cc#L142

Then emit calls to functions as needed.
https://github.com/cloudera/Impala/blob/cdh5-trunk/be/src/exprs/scalar-fn-call.cc#L466

Where is the getTypeByName? That is the real issue. Since you are
mapping visible (non-local) functions the name of the functions is
significant. Can't you instead get the function in the linked module
and from there get the type? That way you don't depend no the name of
the type.

Cheers,
Rafael

Typeless pointers definitely solve the problem and I'm sure this code will all go away once that happens, but in the meantime a lot of LLVM does care a lot about the exact pointee type.

We could keep track of where each function came from, and if we generate a call to a linked function, grovel around in the linked function signatures, work out what type the linker decided to assign it and then generate a bunch of bitcasts (since the pointer might be the original type if it came from the source module), but I'd rather not resort to doing that. It seems easier to fix the problem at the source (and also solve it for anyone else who's building jits using LLVM in a similar way).

I have no idea what the intended behaviour of the Linker is since there's not much documentation about what the linker is/isn't allow to do, but this is something that has worked historically. It wouldn't expect it to work if the struct types had different names originally, but this is just reversing some renaming that LLVM did.

timarmstrong edited edge metadata.

Ran git-clang-format on it. Not sure who looks after the developer policy page, but it might save some review iterations if it was one of the steps on there. http://llvm.org/docs/DeveloperPolicy.html#making-and-submitting-a-patch

We could keep track of where each function came from, and if we generate a call to a linked function, grovel around in the linked function signatures, work out what type the linker decided to assign it and then generate a bunch of bitcasts (since the pointer might be the original type if it came from the source module), but I'd rather not resort to doing that. It seems easier to fix the problem at the source (and also solve it for anyone else who's building jits using LLVM in a similar way).

It is not that much. It is just that instead of doing
"Dst.getTypeByName(Name)" you would do "Dst.getNamedValue(Name)" and
use the type of that. Where is the getTypeByName in the codebase you
pointed to?

I have no idea what the intended behaviour of the Linker is since there's not much documentation about what the linker is/isn't allow to do, but this is something that has worked historically. It wouldn't expect it to work if the struct types had different names originally, but this is just reversing some renaming that LLVM did.

Merge as many types as possible. The use of names is just an heuristic
to try to merge more types in the presence of cycles and opaque types.
With typeless pointers we will be able to merge *all* isomorphic types
without ever looking at the name.

Cheers,
Rafael

Right, but as someone who's trying to build a non-trivial project on top of LLVM, how am I meant to know whether something I'm building on top of LLVM will work if the expected behaviour is completely undocumented? The only thing I've had to go off is whether it works (it did in the past) and whether the behaviour is sane (it is) and whether it looks like what the code is meant to do (it does).

This has worked fine in the past and as far as I can tell is what the code is *meant* to do. Before my fix the code was internally inconsistent: hasType() returns false even when the type *will* be in the composite module. I.e. the internal invariants of IRMover were being violated and that's a recipe for bugs.

I get that it's an open-source project and use-cases only get supported if someone contributes time, so I'm trying to help here. The change makes the code simpler, internally consistent and fixes a use case that previously worked.

Are you just saying that IRMover is only meant to work for the particular use-case of the LLVM linker and other people who are trying to use it as a utiltiy to move IR between modules just need to deal with it?

I get that it's an open-source project and use-cases only get supported if someone contributes time, so I'm trying to help here. The change makes the code simpler, internally consistent and fixes a use case that previously worked.

Are you just saying that IRMover is only meant to work for the particular use-case of the LLVM linker and other people who are trying to use it as a utiltiy to move IR between modules just need to deal with it?

Not necessarily, but the invariant you are trying to support is not
the direction we are trying to move. In fact, it is the opposite. You
are trying to say that if there is a type with a given name in the
destination it will be safe to assume it will be use when linking in
code.

The direction we are trying to move to is that types are non recursive
and always structurally merged.

Cheers,
Rafael

espindola edited reviewers, added: espindola; removed: rafael.Mar 15 2018, 10:52 AM
sarmadka added a subscriber: sarmadka.EditedJun 16 2020, 3:20 PM

The direction we are trying to move to is that types are non recursive
and always structurally merged.

Cheers,
Rafael

Why so? It's clearly not working correctly, so why not include the name of the type when determining whether the types are actually the same? I'm currently facing this issue and the conclusion I have now is that Linker is simply not usable for me at all and I have to find another way. Here is the error I'm getting when trying to execute (using LLJIT) a module produced by linking a set of modules:

Stored value type does not match pointer operand type!

Here is the original code before linking using Linker::linkModules:

%11 = call %Gtk_AppWindow* @gtk_application_window_new(%Gtk_App* %10)
store %Gtk_AppWindow* %11, %Gtk_AppWindow** @window

and here is how it is after the link:

%11 = call %Gtk_ButtonBox* @gtk_application_window_new(%2* %10)
store %Gtk_ButtonBox* %11, %4** @window

That is because the types have the same structure so the linker is incorrectly merging them despite having different names. The same exact set of modules passed directly to LLJIT executes correctly.

FWIW I ended up working around this for our specific use cases by automatically inserting a bunch of bitcasts - https://issues.apache.org/jira/browse/IMPALA-3877.

I spent a while trying to make some other stuff work and ended up coming to the conclusion that the Linker just isn't really designed or test for merging code into an arbitrary. It seemed to assume, at least when I was looking, that you started off with a pristine module then linked everything into that. That was a bit of a non-starter for our JIT use case, since we didn't want to add an extra linking step for our manually-generated IR.

Yeah, it looks like the Linker is not well maintained so I'll just have to refactor my code so that I generate a single module in the first place rather than generate multiple modules and link them together with the linker.

Is the linker even used for anything or is it just obsolete code? It's failing to do the basic operation of linking multiple simple and small modules so I can't imagine it being used anywhere in prod.

I think it's used as part of the Link Time Optimization support, which is pretty battle tested, but is also just one use case.