This is an archive of the discontinued LLVM Phabricator instance.

[Strict FP] Allow more relaxed scheduling
ClosedPublic

Authored by uweigand on Jul 9 2019, 6:15 AM.

Details

Summary

Support for strict floating-point instructions at the DAG/MI level, as recently introduced in https://reviews.llvm.org/D55506, constrains instruction scheduling for such instruction to enforce their original source order. While this mirrors the current requirements on strict FP intrinsics at the LLVM IR level, I believe this is really more strict than would be required to implement the semantics of strict FP.

Specifically, I believe it should be allowed to move one strict FP instructions across another, as long as it is not moved across any global barrier. If both instructions were to raise a trapping FP exception, this means that you may now see another of those exceptions first, but that should still be OK.

This patch provides an alternative implementation in ScheduleDAGInstrs::buildSchedGraph that implements this relaxed constraint. This means that instruction scheduling for strict FP instructions is now nearly as flexible as for standard FP instructions, removing a bit of the extra performance overhead.

Diff Detail

Repository
rL LLVM

Event Timeline

uweigand created this revision.Jul 9 2019, 6:15 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 9 2019, 6:15 AM
jsji added a subscriber: wuzish.Jul 9 2019, 6:40 AM
kpn added a comment.Jul 9 2019, 8:14 AM

Stupid question: what's a "global barrier"?

This isn't going to be moving instructions outside of tests, right? For example:

int foo(double d) {

return (isnan(d) ? 0 : (int)d);

}

A year ago this caused traps because of speculative execution causing both legs of the ternary operator to be executed. I feel a little silly asking, but ... I'm asking anyway.

Also, how does this patch interact with volatile accesses? We use volatile to finesse compilers into doing what we need.

This scheduler generally only operates on single basic blocks, so it would not move anything outside of a test.

It also will not move anything across a volatile memory access, since that it one of the global barriers. (Those are: calls, instructions with "unmodeled side effects", and volatile/atomic memory accesses.)

Ping? It would be good to get this in LLVM 9 ...

I'm reviewing the trap-safety issues now and have some open questions (language lawyers needed??). Let's say we have something like this:

z = strict_fmul x, y
c = strict_fmul a, b
store z
store c

And we schedule it as:

z = strict_fmul x, y
store z
c = strict_fmul a, b
store c

Now let's say the 2nd fmul traps on overflow and we have a signal handler set up to gracefully recover. The differences in scheduling could mean memory differences when the signal handler executes. That might not be okay for the very strictest conformance mode (I don't know).

Are we treating this case as undefined behavior, like the C/C++ Standards dictate?

Bah, my last comment was flawed! I read the test cases incorrectly and missed the 'fpexcept.ignore' on some of them.

But I think the question is still partially valid. What defines sequencing traps and stores? Are we (LLVM) defining something more strict than IEEE-754 and the C/C++ Standards?

My understanding is that IEEE-754 requires us to produce an exception if and only if the exception would be produced by a literal interpretation of the source. However, it does not require that the exceptions be raised in the same order as implied by the source. Also, that's what the LLVM language reference says we'll do with "fpexcept.strict" -- "The number and order of floating-point exceptions is NOT guaranteed." So, I think the changes you've got here are correct.

I think I'll have to challenge you a little here. ;)

My understanding is that IEEE-754 requires us to produce an exception if and only if the exception would be produced by a literal interpretation of the source.

The literal interpretation language refers to value-changing optimizations. I don't think it specifies memory ordering though. I could be wrong...

However, it does not require that the exceptions be raised in the same order as implied by the source. Also, that's what the LLVM language reference says we'll do with "fpexcept.strict" -- "The number and order of floating-point exceptions is NOT guaranteed." So, I think the changes you've got here are correct.

That's actually slightly different than what I'm asking -- that's about ordering two trapping operations (I'm not sure where that's specified as ok either), not ordering one trapping operation and a store.

In other words, what happens if we move stores around operations that can trap? I could easily write a small program to give different results based on whether this scheduling change is active or not. Is there somewhere that says different results are ok with same source?

I'm assuming we're treating it as undefined behavior, like the C/C++ Standards state, so that all this doesn't matter. Just want to confirm that we're not mistakenly throwing away strictness.

Just to clarify one thing: even the current implementation, before this patch, does not guarantee the relative order of FP instructions and memory instructions is unchanged. So even the current implementation may perform the reschedule your comment mentions. This patch would add the additional option of also changing the relative order of the two strict_fmul operations.

I do not think there is much point in attempting to guarantee the relative order of FP vs. memory instructions, since those memory instructions are themselves not guaranteed (the C/C++ standard allows memory accesses to be rather freely rescheduled, or even fully omitted).

If relative order of FP vs. memory instructions is an issue to your application, you'll have to use volatile (or atomic) memory accesses; in that case, both the current implementation and my patch will respect the ordering.

cameron.mcinally accepted this revision.Jul 16 2019, 7:31 AM

This patch would add the additional option of also changing the relative order of the two strict_fmul operations.

I now see that the IEEE-754 Standard allows for expression transformations that change the order of setting flags, so that should be fine for statements too.

The following value-changing transformations, among others, preserve the literal meaning of the source code:
<...snip...>
― Changing the order in which different flags are raised.

I do not think there is much point in attempting to guarantee the relative order of FP vs. memory instructions, since those memory instructions are themselves not guaranteed (the C/C++ standard allows memory accesses to be rather freely rescheduled, or even fully omitted).

If relative order of FP vs. memory instructions is an issue to your application, you'll have to use volatile (or atomic) memory accesses; in that case, both the current implementation and my patch will respect the ordering.

That's fair. I don't see anything explicitly disallowing it, so can't argue.

This revision is now accepted and ready to land.Jul 16 2019, 7:31 AM
This revision was automatically updated to reflect the committed changes.