This is an archive of the discontinued LLVM Phabricator instance.

[TTI] New PPC target hook enableUncondDivisionSpeculation
AbandonedPublic

Authored by alexgatea on Oct 7 2022, 8:39 AM.

Details

Summary

Created PPC-specific TTI hook for enabling unconditional speculative execution of division operations in llvm::isSafeToSpeculativelyExecute(). Integer division by 0 does not cause an exception on PPC, so it should be safe to speculate divide operations. This allows the compiler to optimize more aggressively, which benefits other passes.

For example, consider the following code optimized with clang -O3

void foo (int m);
int fcn (int x, int y){
  int n = 0;
  for(int i = 0; i < 1000; i++){
    foo(0);
    n += x/y;
  }
  return n;
}

With the speculative execution off, the final assembly file effectively does

int n = 0;
int tmp = x/y;
for(int i = 0; i < 1000; i++){
    foo(0);
    n += tmp;
}
return n;

However, with the speculative execution on, LICM hoists the division early in the pipeline which allows IndVarSimplifyPass to optimize
n += tmp in the loop to a multiplication at the end:

for(int i = 0; i < 1000; i++)
    foo(0);
return (x/y) * 1000;

Diff Detail

Event Timeline

alexgatea created this revision.Oct 7 2022, 8:39 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 7 2022, 8:39 AM
alexgatea requested review of this revision.Oct 7 2022, 8:39 AM
fhahn added a subscriber: fhahn.

IIUC this proposal would effectively re-define udiv and urem's semantics on the IR level to not have undefined behavior for PPC?

If that's the case, this sounds problematic, as the semantics of LLVM IR instructions are target-independent. Adding some other reviewers for additional who might have additional thoughts.

alexgatea added a comment.EditedOct 7 2022, 9:20 AM

IIUC this proposal would effectively re-define udiv and urem's semantics on the IR level to not have undefined behavior for PPC?

I don't think that's quite correct. We still view them as undefined, it's just that we allow further optimizations to happen that we previously bailed out of. The example I gave shows exactly this; without the speculative execution the div is still hoisted to the preheader, but this is done much later in the pipeline by MachineLICM so we do not optimize it fully (because IndVarSimplifyPass occurs earlier).

Adding some other reviewers for additional who might have additional thoughts.

Of course, I'd like to get other reviewers' thoughts as well.

fhahn added a comment.Oct 7 2022, 9:25 AM

IIUC this proposal would effectively re-define udiv and urem's semantics on the IR level to not have undefined behavior for PPC?

I don't think that's quite correct. We still view them as undefined, it's just that we allow further optimizations to happen that we previously bailed out of. The example I gave shows exactly this; without the speculative execution the div is still hoisted to the preheader, but this is done much later in the pipeline by MachineLICM so we do not optimize it fully (because IndVarSimplifyPass occurs earlier).

Right, but the reason MachineLICM can do this today is because at this point we are dealing with machine instructions.

In terms of LLVM IR semantics, this change would allow hoisting an instruction that may produce UB into a path that didn't have UB before AFAICT.

IIUC this proposal would effectively re-define udiv and urem's semantics on the IR level to not have undefined behavior for PPC?

I don't think that's quite correct. We still view them as undefined,

division by zero is currently undefined behavior at the IR level; if your program would execute it, it has no meaning at all. So hoisting a divide will interact badly with other optimizations; for example, instcombine will currently turn a divide by zero into "unreachable". This is different from instructions that return poison.

If you want a version of division that returns a poison value, you need to modify the semantics in LangRef.


In the original example, instead of trying to make the divide hoistable, you could teach LLVM to peel the first iteration of the loop, then CSE the divide.

IIUC this proposal would effectively re-define udiv and urem's semantics on the IR level to not have undefined behavior for PPC?

I don't think that's quite correct. We still view them as undefined, it's just that we allow further optimizations to happen that we previously bailed out of. The example I gave shows exactly this; without the speculative execution the div is still hoisted to the preheader, but this is done much later in the pipeline by MachineLICM so we do not optimize it fully (because IndVarSimplifyPass occurs earlier).

Right, but the reason MachineLICM can do this today is because at this point we are dealing with machine instructions.

In terms of LLVM IR semantics, this change would allow hoisting an instruction that may produce UB into a path that didn't have UB before AFAICT.

I see what you mean. It's definitely a valid point, let's see what others think as well.

nikic requested changes to this revision.Oct 7 2022, 2:31 PM

Marking this as changes requested per above comments. I agree that this is not legal under current LLVM IR semantics.

This revision now requires changes to proceed.Oct 7 2022, 2:31 PM

IIUC this proposal would effectively re-define udiv and urem's semantics on the IR level to not have undefined behavior for PPC?

I don't think that's quite correct. We still view them as undefined,

division by zero is currently undefined behavior at the IR level; if your program would execute it, it has no meaning at all. So hoisting a divide will interact badly with other optimizations; for example, instcombine will currently turn a divide by zero into "unreachable". This is different from instructions that return poison.

If you want a version of division that returns a poison value, you need to modify the semantics in LangRef.

As I understand it integer divide by zero is considered undefined behaviour in C while the same may not be true in other languages. Furthermore presence of side effects may be dependent on other configurations including target hardware (eg default treatment on Power vs x86). LLVM seems to distinguish the concept of "undefined behaviour" from "undefined value", treating the former more consequential than the latter. It currently treats div-by-zero as undefined behaviour, but that may be an over pessimistic treatment as demonstrated in this review. Baking assumptions about the source language or target hardware in the LLVM IR gets us into the situation were we have to sacrifice performance for some combinations to ensure functional correctness for others. To allow more flexibility we could either leave the IR neutral and let the optimizer decide based on config info (eg. TTI) or separate the undefined behaviour/value semantics from the udiv/urem instructions (eg. using an instruction flag). I think this revision takes the first approach and what you are suggesting is the second. I agree the second approach is cleaner and might be necessary given the historic assumptions made in this regard, although it would be a larger effort.

In the original example, instead of trying to make the divide hoistable, you could teach LLVM to peel the first iteration of the loop, then CSE the divide.

I don't think that would work in general, for example if the loop had unknown bounds, because peeling in such cases would still require the peeled iteration to be conditionally executed.

As I understand it integer divide by zero is considered undefined behaviour in C while the same may not be true in other languages. Furthermore presence of side effects may be dependent on other configurations including target hardware (eg default treatment on Power vs x86).

The LLVM IR rule is more driven by the behavior of the instructions on various targets. If a target only has a trapping divide, we'd need to wrap it in control flow to implement a non-trapping divide. And particularly for signed divide, that check isn't cheap. We tend to prefer poison where it makes sense (for example, out-of-bounds shifts).

Frontends can always use control flow to get whatever user-visible behavior they want. (For example, the Rust divide operator panics if you divide by zero.)

To allow more flexibility we could either leave the IR neutral and let the optimizer decide based on config info (eg. TTI)

We try to avoid making core IR semantics depend on TTI. Not that we can completely ignore target differences when writing IR optimizations, but we want to keep IR understandable without reference to target-specific semantics.

I mean, it would be self-consistent to write in LangRef something like "whether division by zero is undefined behavior, or defined to produce a poison value, depends on the current target/current subtarget/bitwith of the operation/current moon cycle". But I don't want to go there. If the rules are the same across all targets, it's easier to understand, and easier to implement tools like Alive2 to validate transforms.

In the original example, instead of trying to make the divide hoistable, you could teach LLVM to peel the first iteration of the loop, then CSE the divide.

I don't think that would work in general, for example if the loop had unknown bounds, because peeling in such cases would still require the peeled iteration to be conditionally executed.

Any loop can be peeled (as long as the body doesn't contain some exotic construct that inhibits cloning); it's basically just cloning the loop body. And if the divide dominates the latch before peeling, it will dominate the peeled loop after peeling.

The "general" problem is really the case where peeling is too expensive.

nlopes added a comment.Oct 8 2022, 1:37 AM

I would also prefer to not go this route.
Alternatives include using predicated builtins, stick to pre-headers, or introduce a new intrinsic that yields poison on division by zero and/or INT_MAX/-1.

As I understand it integer divide by zero is considered undefined behaviour in C while the same may not be true in other languages. Furthermore presence of side effects may be dependent on other configurations including target hardware (eg default treatment on Power vs x86).

The LLVM IR rule is more driven by the behavior of the instructions on various targets. If a target only has a trapping divide, we'd need to wrap it in control flow to implement a non-trapping divide. And particularly for signed divide, that check isn't cheap. We tend to prefer poison where it makes sense (for example, out-of-bounds shifts).

Frontends can always use control flow to get whatever user-visible behavior they want. (For example, the Rust divide operator panics if you divide by zero.)

To allow more flexibility we could either leave the IR neutral and let the optimizer decide based on config info (eg. TTI)

We try to avoid making core IR semantics depend on TTI. Not that we can completely ignore target differences when writing IR optimizations, but we want to keep IR understandable without reference to target-specific semantics.

I mean, it would be self-consistent to write in LangRef something like "whether division by zero is undefined behavior, or defined to produce a poison value, depends on the current target/current subtarget/bitwith of the operation/current moon cycle". But I don't want to go there. If the rules are the same across all targets, it's easier to understand, and easier to implement tools like Alive2 to validate transforms.

I'm sorry but I don't quite follow the logic that tries to justify the current design. On the one hand the IR rules are driven by target requirements, and at the same time you avoid making IR semantics dependent on TTI (which is the proper way to query about target requirements). They seem like recipes for the exact problem we are dealing with here, ie there are target-specific assumptions baked into the IR that are not configurable.

In the original example, instead of trying to make the divide hoistable, you could teach LLVM to peel the first iteration of the loop, then CSE the divide.

I don't think that would work in general, for example if the loop had unknown bounds, because peeling in such cases would still require the peeled iteration to be conditionally executed.

Any loop can be peeled (as long as the body doesn't contain some exotic construct that inhibits cloning); it's basically just cloning the loop body. And if the divide dominates the latch before peeling, it will dominate the peeled loop after peeling.

The "general" problem is really the case where peeling is too expensive.

consider this loop:

for (i = 0; i < n; i++) {
  v += x/y;
}

after peeling:

if (n > 0) {
  v += x/y;
}
for (i = 1; i < n; i++) {
  v += x/y;
}

The peeled divide that is guarded by the if (n > 0) conditional does not dominate the divide that's in the loop body. Even if we try to consider control flow equivalence between that guard and the loop guard, there could still be cases where the dominance cannot be safely determined (eg. non-affine loops).

fhahn added a comment.Oct 11 2022, 8:21 AM

I mean, it would be self-consistent to write in LangRef something like "whether division by zero is undefined behavior, or defined to produce a poison value, depends on the current target/current subtarget/bitwith of the operation/current moon cycle". But I don't want to go there. If the rules are the same across all targets, it's easier to understand, and easier to implement tools like Alive2 to validate transforms.

I'm sorry but I don't quite follow the logic that tries to justify the current design. On the one hand the IR rules are driven by target requirements, and at the same time you avoid making IR semantics dependent on TTI (which is the proper way to query about target requirements). They seem like recipes for the exact problem we are dealing with here, ie there are target-specific assumptions baked into the IR that are not configurable.

A practical consequence of changing the semantics to be target-dependent is that you would also have to audit each piece of code the reasons about udiv to make sure it properly considers both semantics.

TTI is at the moment only used for queries that inform cost decisions AFAICT. Changing it to impact semantics of LLVM IR would be a big change IMO.

I don't think that would work in general, for example if the loop had unknown bounds, because peeling in such cases would still require the peeled iteration to be conditionally executed.

Any loop can be peeled (as long as the body doesn't contain some exotic construct that inhibits cloning); it's basically just cloning the loop body. And if the divide dominates the latch before peeling, it will dominate the peeled loop after peeling.

The "general" problem is really the case where peeling is too expensive.

consider this loop:

for (i = 0; i < n; i++) {
  v += x/y;
}

after peeling:

if (n > 0) {
  v += x/y;
}
for (i = 1; i < n; i++) {
  v += x/y;
}

The peeled divide that is guarded by the if (n > 0) conditional does not dominate the divide that's in the loop body. Even if we try to consider control flow equivalence between that guard and the loop guard, there could still be cases where the dominance cannot be safely determined (eg. non-affine loops).

I'm not sure I follow here. The loop peeling implementation in LLVM should make sure that the remainder loop is dominated by the peeled iterations. For the motivating example, peeling by 1 iteration seems to have the desired effect: https://clang.godbolt.org/z/rhvsdvEjG

Peeling is used in a similar fashion to turn loads in a loop into dereferenceable loads, see D108114.

As I understand it integer divide by zero is considered undefined behaviour in C while the same may not be true in other languages. Furthermore presence of side effects may be dependent on other configurations including target hardware (eg default treatment on Power vs x86).

The LLVM IR rule is more driven by the behavior of the instructions on various targets. If a target only has a trapping divide, we'd need to wrap it in control flow to implement a non-trapping divide. And particularly for signed divide, that check isn't cheap. We tend to prefer poison where it makes sense (for example, out-of-bounds shifts).

Frontends can always use control flow to get whatever user-visible behavior they want. (For example, the Rust divide operator panics if you divide by zero.)

To allow more flexibility we could either leave the IR neutral and let the optimizer decide based on config info (eg. TTI)

We try to avoid making core IR semantics depend on TTI. Not that we can completely ignore target differences when writing IR optimizations, but we want to keep IR understandable without reference to target-specific semantics.

I mean, it would be self-consistent to write in LangRef something like "whether division by zero is undefined behavior, or defined to produce a poison value, depends on the current target/current subtarget/bitwith of the operation/current moon cycle". But I don't want to go there. If the rules are the same across all targets, it's easier to understand, and easier to implement tools like Alive2 to validate transforms.

I'm sorry but I don't quite follow the logic that tries to justify the current design. On the one hand the IR rules are driven by target requirements, and at the same time you avoid making IR semantics dependent on TTI (which is the proper way to query about target requirements). They seem like recipes for the exact problem we are dealing with here, ie there are target-specific assumptions baked into the IR that are not configurable.

The design is influenced by the all the ISAs we want to support, which means it won't be optimal for some (or any), but that's life. That doesn't imply the semantics of the IR must be parameterizable on TTI information.

Changing something fundamental like the semantics of the division instruction has far reaching effects. Even if we wanted to do it, your patch is very incomplete. We would need to audit every single optimization/analysis that touches divisions and make sure they would still be correct with your proposed semantics. Removing UB from division (as you propose) makes some optimizations wrong as they can't assume that RHS is always non-zero, for example.

In summary, changing semantics of div is not something we are willing to do. The risk is too high and the benefits seem low. The best way forward for you is to investigate existing intrinsics or create a new one.

arsenm added a subscriber: arsenm.Oct 11 2022, 8:34 AM

Could you instead insert a clamp of the divisor and then pattern match that out during selection?

Could you instead insert a clamp of the divisor and then pattern match that out during selection?

Hmm not sure what you mean by selection. Could you please elaborate (perhaps with an example)?

alexgatea abandoned this revision.Oct 20 2022, 10:13 AM

Thank you for all the comments and suggestions! Closing this PR since the consensus is that using a TTI hook to create target-specific IR semantics is undesirable.

Could you instead insert a clamp of the divisor and then pattern match that out during selection?

Hmm not sure what you mean by selection. Could you please elaborate (perhaps with an example)?

If you speculate sdiv x, y, replace it with sdiv x, (y == 0 ? 1 : y) or whatever behavior you get for this case. In your backend, then pattern match the divide by 0 check when selecting to the instruction

Could you instead insert a clamp of the divisor and then pattern match that out during selection?

Hmm not sure what you mean by selection. Could you please elaborate (perhaps with an example)?

If you speculate sdiv x, y, replace it with sdiv x, (y == 0 ? 1 : y) or whatever behavior you get for this case. In your backend, then pattern match the divide by 0 check when selecting to the instruction

I see what you mean. But what if the optimizer in the meantime changes/removes the select instruction? E.g. if y = foo() where foo() always returns 0 then the optimizer will at some point replace (y == 0 ? 1 : y) with 1. Also, how can we know in the backend that the original instruction was sdiv x, y and not actually sdiv x, (y == 0 ? 1 : y) ?