This is an archive of the discontinued LLVM Phabricator instance.

[WIP] Support MustControl conditional control attribute
AbandonedPublic

Authored by melver on Jun 9 2021, 5:39 AM.

Details

Summary

[ WIP, only high-level comments for now ]

Introduce a new attribute, 'MustControl'/'mustcontrol', which denotes that a
conditional control statement must result in true control-flow and not
be optimized away. The attribute otherwise has no semantic relevance.

However, the existence of a true branch is of relevance when branch
execution has side-effects on machine state that the programmer is
interested in, for example in OS kernels.

The Linux kernel, for one, relies on the existence of true conditional
branches for the enforcement of memory orders, per Linux-kernel memory
consistency model (LKMM) [1]. With the 'mustcontrol' attribute, Clang
would provide a primitive required for the Linux kernel to ensure a true
branch is emitted without resorting to inline assembly (which often
results in poor codegen). The primitive is simple and low-level enough,
that the compiler can remain blissfully unaware of the LKMM and leave
the semantics of Linux's memory model to the kernel community.

[1] https://lkml.kernel.org/r/YLn8dzbNwvqrqqp5@hirez.programming.kicks-ass.net

Diff Detail

Event Timeline

melver created this revision.Jun 9 2021, 5:39 AM
melver requested review of this revision.Jun 9 2021, 5:39 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptJun 9 2021, 5:39 AM

This is missing langref changes, and a RFC to llvm-dev.
I'm rather skeptical of this.

melver added a comment.Jun 9 2021, 6:48 AM

This is missing langref changes, and a RFC to llvm-dev.

We're not there yet, but will send something. Having real code helped me understand what out the myriad of options that were discussed were actually reasonable to implement. (Perhaps I should have uploaded WIP code elsewhere, sorry about that.)

I'm rather skeptical of this.

We're trying to solve a serious problem, and the Linux kernel is an important usecase. We'd be very glad to hear constructive criticism, the LKML thread is still ongoing: https://lore.kernel.org/linux-arch/YLn8dzbNwvqrqqp5@hirez.programming.kicks-ass.net/

Thank you.

melver planned changes to this revision.Jun 9 2021, 6:51 AM

WIP here is fine, would help to include a test that shows/explains what problem is actually solved though. I don't understand form this patch alone.

nickdesaulniers edited reviewers, added: efriedma; removed: eli.friedman.

The first talk from https://www.youtube.com/watch?v=FFjV9f_Ub9o (https://github.com/ClangBuiltLinux/plumbers-2020-slides/blob/master/LPC_2020_--_Dependency_ordering.pdf) might be helpful to link to at some point from the commit message, for a little additional context.

The first talk from https://www.youtube.com/watch?v=FFjV9f_Ub9o (https://github.com/ClangBuiltLinux/plumbers-2020-slides/blob/master/LPC_2020_--_Dependency_ordering.pdf) might be helpful to link to at some point from the commit message, for a little additional context.

I read the slides and I'm not sure how this connects. I'll wait for the LangRef and/or IR example :)

I don't like using metadata like this. Dropping metadata should generally preserve the semantics of the code.

without resorting to inline assembly (which often results in poor codegen).

Could you give an example of the resulting assembly with mustcontrol vs. this patch?

without resorting to inline assembly (which often results in poor codegen).

Could you give an example of the resulting assembly with mustcontrol vs. this patch?

Err, I mean, the resulting assembly using the inline asm version, vs. an equivalent construct using mustcontrol.

melver added a comment.Jun 9 2021, 1:38 PM

I don't like using metadata like this. Dropping metadata should generally preserve the semantics of the code.

Anything better for this without introducing new instructions? Would an argument be reasonable?

without resorting to inline assembly (which often results in poor codegen).

Could you give an example of the resulting assembly with mustcontrol vs. this patch?

For one of the pathological cases:

int x, y, z;                                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                               
int main(int argc, char *argv[]) {                                                                                                                                                                                                                           
    z = 42;                                                                                                                                                                                                                                                    
                                                                                                                                                                                                                                                               
  volatile_if (READ_ONCE(x)) {                                                                                                                                                                                                                               
      WRITE_ONCE(y, z);                                                                                                                                                                                                                                        
    } else {                                                                                                                                                                                                                                                   
      WRITE_ONCE(y, z);                                                                                                                                                                                                                                        
    }                                                                                                                                                                                                                                                          
                                                                                                                                                                                                                                                               
    return 0;                                                                                                                                                                                                                                                  
}

Doing nothing:

define dso_local i32 @main(i32 %argc, i8** nocapture readnone %argv) local_unnamed_addr #0 {
entry:
  store i32 42, i32* @z, align 4, !tbaa !3
  %0 = load volatile i32, i32* @x, align 4, !tbaa !3
  store volatile i32 42, i32* @y, align 4, !tbaa !3
  ret i32 0
}

No branch here.

Their latest proposal using compiler barriers (and not asmgoto):

define dso_local i32 @main(i32 %0, i8** nocapture readnone %1) local_unnamed_addr #0 {
  store i32 42, i32* @z, align 4, !tbaa !2
  %3 = load volatile i32, i32* @x, align 4, !tbaa !2
  %4 = icmp eq i32 %3, 0
  br i1 %4, label %7, label %5

5:                                                ; preds = %2
  tail call void asm sideeffect "", "i,~{memory},~{dirflag},~{fpsr},~{flags}"(i32 0) #1, !srcloc !6                            
  %6 = load i32, i32* @z, align 4, !tbaa !2
  br label %7

7:                                                ; preds = %2, %5
  %8 = phi i32 [ %6, %5 ], [ 42, %2 ]
  store volatile i32 %8, i32* @y, align 4, !tbaa !2
  ret i32 0
}

You can see the unnecessary load to z.

With mustcontrol:

define dso_local i32 @main(i32 %argc, i8** nocapture readnone %argv) local_unnamed_addr #0 {
entry:
  store i32 42, i32* @z, align 4, !tbaa !3
  %0 = load volatile i32, i32* @x, align 4, !tbaa !3
  %tobool.not = icmp eq i32 %0, 0
  br i1 %tobool.not, label %if.end, label %if.then

if.then:                                          ; preds = %entry
  tail call void (...) @llvm.sideeffect(i64 42)
  br label %if.end

if.end:                                           ; preds = %entry, %if.then
  store volatile i32 42, i32* @y, align 4, !tbaa !3
  ret i32 0
}

Of course, the more common case is just

if (READ_ONCE(..)) { WRITE_ONCE(...); }

but as soon as inline asm is involved, the full compiler barrier would also cause any data after the branch to be reloaded.

The bigger worry is that the full compiler barrier does not guarantee emission of a branch, and the asmgoto variant is pretty fragile. arm64 maintainers worry about LTO, and better compiler optimizations. It really begs for compiler support for the architectures where it is relevant. The main one being arm64, which ld->st can be ordered by a control dependency.

While the issue at hand is related to memory models, I've tried to steer clear of the C/C++ memory models because the Linux kernel defines its own memory model. Therefore, defining the new primitive at a lower-level that relates to generated code (closer to 'volatile' or e.g. 'musttail' which inspired the name) is a goal here. This will satisfy the Linux kernel's requirements and can use 'mustcontrol' as a building block for the Linux-kernel memory model (LKMM) [ http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0124r6.html ].

I'm intending to define it as follows:

| Marking a conditional control statement as ``mustcontrol`` indicates that the
| compiler must generate a conditional branch in machine code, and such
| conditional branch is placed *before* conditionally executed instructions. The
| attribute may be ignored if the condition of the control statement is a
| constant expression.
|
| This typically affects optimizations that would cause removal of a conditional
| branch. However, it also ensures that a conditional branch and subsequent
| instructions are not replaced with non-branching conditional instructions.
|
| Requesting the generation of a branch may be required in execution environments
| where execution of a specific conditional branch inhibits speculation or has
| observable side-effects of interest otherwise.

Please bear with me, I'm updating examples and documentation. I didn't think anybody would look at this while [WIP]. :-)
Thanks.

Please bear with me, I'm updating examples and documentation. I didn't think anybody would look at this while [WIP]. :-)

People try to help so you have early design feedback ;)

melver added a comment.Jun 9 2021, 1:48 PM

Please bear with me, I'm updating examples and documentation. I didn't think anybody would look at this while [WIP]. :-)

People try to help so you have early design feedback ;)

Thank you for that. The LKML discussion got a little heated, so I worry slightly that underpresenting the issue will cause a bad first impression.

But while we're here:

There is the consideration to make this a __builtin and not an attribute.

AFAIK a __builtin suffers from the usual problem that the information cannot be propagated between TUs:

file1.c:
  bool foo(void) { return __builtin_mustcontrol(READ_ONCE(...)); }

file2.c:
  void bar(void) { if (foo()) { WRITE_ONCE(...); } }

Or is a language builtin that gives you an error when used in the wrong context acceptable? It seems a little odd because I'm unaware of other builtins that do that.

GCC devs expressed that GNU attribute syntax may be abused: https://lkml.kernel.org/r/20210609171419.GI18427@gate.crashing.org

The attribute is simpler, and hypothetically, if it were to become part of the language std we'd have to use [[...]] syntax anyway, so the GNU attribute problem seems somewhat artificial to me.

[[mustcontrol]] if (READ_ONCE(...)) { ... }
[[mustcontrol]] while (...) { }

Preferences?

This starts to sound an awful lot like convergent to me, basically: Do not change the control conditions of this call.
Still unsure, maybe you can add a LangRef draft so we know what you try to do, or a nice example what you don't want to happen.

I don't like using metadata like this. Dropping metadata should generally preserve the semantics of the code.

Anything better for this without introducing new instructions? Would an argument be reasonable?

If we really want to make it part of the branch, maybe add an intrinsic that can be used with callbr. Not something we've done before, but the infrastructure should be mostly there.

That said, I'm not sure this is the best approach. Alternative proposal:

We could add a regular intrinsic. Just ignore the control flow at the IR level, and come up with a straight-line blob that just does the right thing. I think we'd want to actually perform the load as part of the intrinsic, to avoid worrying about the consume dependency. So we'd have an intrinsic "__builtin_load_with_control_dependency()". It would lower to something along the lines of asm("ldr %0, [%1]; cbnz %0, .+4":"=r"(dest):"r"(x):"memory"); on AArch64. The differences between the intrinsic and just using the asm I wrote:

  1. We weaken the "memory" clobber to something that more accurately matches what we need.
  2. We add a compiler transform to check if the branch is redundant, late in the optimization pipeline, and remove it if it is.

I think this produces the code you want, and it should be easier to understand and maintain.

melver added a comment.EditedJun 9 2021, 4:45 PM

I don't like using metadata like this. Dropping metadata should generally preserve the semantics of the code.

Anything better for this without introducing new instructions? Would an argument be reasonable?

If we really want to make it part of the branch, maybe add an intrinsic that can be used with callbr. Not something we've done before, but the infrastructure should be mostly there.

That said, I'm not sure this is the best approach. Alternative proposal:

We could add a regular intrinsic. Just ignore the control flow at the IR level, and come up with a straight-line blob that just does the right thing. I think we'd want to actually perform the load as part of the intrinsic, to avoid worrying about the consume dependency. So we'd have an intrinsic "__builtin_load_with_control_dependency()". It would lower to something along the lines of asm("ldr %0, [%1]; cbnz %0, .+4":"=r"(dest):"r"(x):"memory"); on AArch64. The differences between the intrinsic and just using the asm I wrote:

  1. We weaken the "memory" clobber to something that more accurately matches what we need.
  2. We add a compiler transform to check if the branch is redundant, late in the optimization pipeline, and remove it if it is.

I think this produces the code you want, and it should be easier to understand and maintain.

Interesting, but probably doesn't quite work if I understood right -- however, perhaps it could solve something related (not part of this work, see below [footnote]).

Not every READ_ONCE() the kernel has needs to be a load_with_control_dependency(), which if I read it right, would happen if we just do, e.g.:

#define READ_ONCE __builtin_load_with_control_dependency
int x;
int foo(void) { return READ_ONCE(x); }

And they really want to avoid introducing another set of primitives, like READ_ONCE_CTRL(), because if we did that, I think it'd be reasonable to upgrade all READ_ONCE_CTRL() to acquires and we're done (suggested by Will Deacon in [1]). Yet upgrading all READ_ONCE() to acquire is not acceptable in general (do note, it's not just AArch64, also POWER and Armv7). For now, it'd be good to avoid this -- in particular, existing code like the following would become less clear or less optimal:

x = READ_ONCE(..);
y  = READ_ONCE(..);
... lots of other code ...
if (y) { ... do other stuff ... } // <--- no control dependency here
if (x && y) { WRITE_ONCE(..) } // <-- only want control  dependency here

Which is why the kernel folks probably wouldn't be too happy with more primitives as it likely penalizes more than with just marking the branch itself. Per [1] new load-primitives are probably a last resort assuming the compiler can deliver a nice mechanism to ensure control-dependencies remain (this work here).

Thanks.


[1] https://lore.kernel.org/linux-arch/20210607160252.GA7580@willie-the-truck/

[footnote] Re the "memory" clobber, Linus asks for more fine-grained asm clobber: https://lore.kernel.org/linux-arch/CAHk-=wjwXs5+SOZGTaZ0bP9nsoA+PymAcGE4CBDVX3edGUcVRg@mail.gmail.com/
If you see a way to support this, I think it'd help in other places (e.g. kernel's one-directional memory barriers).

Tangentially, per this presentation:
https://github.com/ClangBuiltLinux/plumbers-2020-slides/blob/master/LPC_2020_--_Dependency_ordering.pdf
there is another problem, which are address dependencies aka memory_order_consume. In reality the kernel wants every READ_ONCE() be something very close to memory_order_consume, with the compiler figuring out the optimal thing to do. Unfortunately, this is not the reality today. Paul McKenney et al. has been exploring this in: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0750r1.html -- but since addr-deps are much less likely to be optimized away, I think the kernel will do nothing about it in the near term. ctrl-deps on the other hand are more of a worry to the Linux kernel community right now.

I can't summarize this well enough here, so I would strongly recommend: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0124r6.html#Control%20Dependencies

You could break __builtin_load_with_control_dependency(x) into something like __builtin_control_dependency(READ_ONCE(x)). I don't think any transforms will touch that in practice, even if it isn't theoretically sound. The rest of my suggestion still applies to that form, I think. They key point is that the compiler just needs to ensure some branch consumes the loaded value; it doesn't matter which branch it is.

The theoretical problem with separating the load from the branch is that it imposes an implicit contract: the branch has to use the value of the load as input, not an equivalent value produced some other way. This is the general problem with C++11 consume ordering, which nobody has tried to tackle.

re: the memory clobber, LLVM understands acquire/release semantics for atomics; for example, you can write __atomic_signal_fence(3) in clang to get a "release" barrier. (I think that's what the email you linked is asking for?) Adding asm clobbers that are equivalent to the existing fences is probably feasible.

You could break __builtin_load_with_control_dependency(x) into something like __builtin_control_dependency(READ_ONCE(x)). I don't think any transforms will touch that in practice, even if it isn't theoretically sound. The rest of my suggestion still applies to that form, I think. They key point is that the compiler just needs to ensure some branch consumes the loaded value; it doesn't matter which branch it is.

Because the original inline asm version was pretty similar (they just called it volatile_cond()), I think __builtin_control_dependency() is equivalent. Actually a later suggestion just called it __builtin_ctrl_depends(): https://lkml.kernel.org/r/YL9TEqealhxBBhoS@hirez.programming.kicks-ass.net

It was nacked by GCC devs, who expressed concern that it seems impossible to guarantee a branch is emitted but also way too difficult to specify. The emitting-branch part seems straightforward, as you suggested.

I think implementation-wise, we can probably use your idea either way. I just worry about defined semantics, see below.

The theoretical problem with separating the load from the branch is that it imposes an implicit contract: the branch has to use the value of the load as input, not an equivalent value produced some other way. This is the general problem with C++11 consume ordering, which nobody has tried to tackle.

Indeed. Which is why I wanted to steer clear of a __builtin that talks about control-dependencies directly. There are 2 challenges:

  1. Defining the value used to establish a control dependency, e.g. the load later writes depend on (kernel only defines writes to be ctrl-dependently ordered).
  2. Defining later ops that are control-dependent. With an expression like the __builtin, this could be any operation after, or it becomes too hard to define:
x = __builtin_control_dependency(expr); // __builtin_control_dependency establishes an ordering edge between loads in expr and later ops
y = 1; // control dependently ordered, although there is no explicit control statement yet...
if (x) { z = 1; } // ... this code is only interested in z=1 to be control-dependently ordered.

Both are hard to define, as you suggested it's similar to consume which was also my worry.

Therefore, to get something simple that works and isn't doomed to a definition that is unimplementable, I tried to just talk about the control statement and the fact a branch must be emitted. In theory, (1) might still be a problem, but in practice the compiler has no other way than to use the loaded value if that value was loaded through an __atomic or volatile or similar.

Limiting ourselves to an attribute on control statements solves (2), because we can say that "the conditional branch is placed *before* conditionally executed instructions". Either that, or we make the __builtin give an error if used outside the condition of a control statement.


Given we've already gotten this far, I will summarize the options:

A. __builtin_load_with_control_dependency() -- this appears to solve the problem (1) above, but not (2). This approach seems unappealing if we want to solve this for the Linux kernel, because the whole point of compiler support was to avoid more read-primitives (with new primitives they say they'd just upgrade these to acquire and be done with it).

B. __builtin_control_dependency() -- would be nice if this would work, but I think it suffers from problems (1), and (2) above and is very hard to define properly.
B1. Do this but constrain it to only be usable as conditions in control statements, which would solve (2) at least.

C. [[mustcontrol]]:

Marking a conditional control statement as ``mustcontrol`` indicates that the
compiler must generate a conditional branch in machine code, and such
conditional branch is placed *before* conditionally executed instructions. The
attribute may be ignored if the condition of the control statement is a
constant expression.

D. But we can also just rename it to [[control_dependency]] if that's clearer. It looks like it's the same as B1 minus the artificial constraint; implementation should be similar. It'd allow for the same arch-dependent omission if an arch does not care about control dependencies. But I feel that, at least for the Linux kernel, they prefer having as much control over codegen as possible, regardless of arch. Because if there's an arch-agnostic way of doing this, we get it for free for POWER and Armv7.

All we'd like is a robust primitive, without overcomplicating things.

What do you recommend?

Thanks.

melver updated this revision to Diff 351229.Jun 10 2021, 11:59 AM

As promised, some cleanups, docs, and updated test for the current version (no
other major changes yet).

While the identical-writes test is quite contrived, the currently failing
switch test is a more realistic example. The example uses AArch64, where the
optimizer does not emit a branch but instead uses cinc, which would break the
requirement of emitting a real branch.

Defining the value used to establish a control dependency, e.g. the load later writes depend on (kernel only defines writes to be ctrl-dependently ordered).

[[mustcontrol]] also has this problem.

At the LLVM IR level, if just want to split the load from the control dependency intrinsic, we could define a special kind of load that produces a LLVM IR "token". The control dependency intrinsic then takes the token as an operand, and optimizations understand that they aren't allowed to touch the token.

The problem at that point is, how does clang emit LLVM IR? It would have to do some sort of dataflow analysis to connect the load to the control dependency. And we'd need to define rules for what sort of data/control flow are allowed. That's not impossible, but it's complicated.

  1. Defining later ops that are control-dependent. With an expression like the __builtin, this could be any operation after, or it becomes too hard to define:

I don't think this is a problem we need to solve. If the user sticks the builtin in some weird location that doesn't have a branch immediately following it, that's fine. Any branch that depends on a value can enforce a control dependency, so in general, we just insert a no-op branch at the point of the call to the builtin. Like I mentioned before, we can think of removing that no-op branch as an optimization.


Whatever we end up doing, I really don't want to mark up LLVM IR branches. I don't want to add more constraints to CFG transforms at the LLVM IR level. The rules are already hard to understand; I don't want to add more weird edge cases. And I don't think it's necessary here.

Defining the value used to establish a control dependency, e.g. the load later writes depend on (kernel only defines writes to be ctrl-dependently ordered).

[[mustcontrol]] also has this problem.

At the LLVM IR level, if just want to split the load from the control dependency intrinsic, we could define a special kind of load that produces a LLVM IR "token". The control dependency intrinsic then takes the token as an operand, and optimizations understand that they aren't allowed to touch the token.

The problem at that point is, how does clang emit LLVM IR? It would have to do some sort of dataflow analysis to connect the load to the control dependency. And we'd need to define rules for what sort of data/control flow are allowed. That's not impossible, but it's complicated.

  1. Defining later ops that are control-dependent. With an expression like the __builtin, this could be any operation after, or it becomes too hard to define:

I don't think this is a problem we need to solve. If the user sticks the builtin in some weird location that doesn't have a branch immediately following it, that's fine. Any branch that depends on a value can enforce a control dependency, so in general, we just insert a no-op branch at the point of the call to the builtin. Like I mentioned before, we can think of removing that no-op branch as an optimization.


Whatever we end up doing, I really don't want to mark up LLVM IR branches. I don't want to add more constraints to CFG transforms at the LLVM IR level. The rules are already hard to understand; I don't want to add more weird edge cases. And I don't think it's necessary here.

Thanks, all good points. The main thing was that we though it'd be much harder to get a __builtin_control_dependency() right (GCC devs didn't like it). If you think that __builtin_control_dependency() is the cleaner design, then let's try that! From a user's point-of-view, it certainly is more flexible if we can get it!

I'll abandon this change.

Thanks.

melver abandoned this revision.Jun 11 2021, 10:35 AM

FWIW, LWN recently published summary of some of the recent discussions on LKML: https://lwn.net/SubscriberLink/860037/aca06acfafce7937/.