This is an archive of the discontinued LLVM Phabricator instance.

Revert the revert of vectorization commits
Needs ReviewPublic

Authored by george.karpenkov on Feb 6 2020, 5:21 PM.

Details

Summary

Revert of https://github.com/llvm/llvm-project/commit/bfa32573bf2d0ab587f9a5d933ea2144a382cf3c

The vectorization introduced by the commit causes a miscompile.

This is the input LLVM IR: https://gist.github.com/cheshire/9dff84d8fbb83278736278854c746d8d
This is IR with the mentioned commit reverted: https://gist.github.com/cheshire/fb06443834735c2fa04fa0eacb288606 (produces correct results)
This is IR with ToT LLVM: https://gist.github.com/cheshire/d0b4195b1be344f52d77b98c70b5241d

If just a diff in IR is not enough, the IR dump is actually runnable.

To run it, compile a standalone (no dependencies, only a single file) driver tool at https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/tools/driver.cc and link with good/bad IR version above. Then run it with a single argument, containing a filename with the buffer assignment description (contents at https://gist.github.com/cheshire/f929aa118978f4bcad03a3da11209d27).

With the vectorization reverted, the driver produces this output:

Output:
(
1.15675, 1.04532, 1.1549, 1.04836, 0.947364, 1.04667, 1.04491, 0.944249, 1.04323, 0.94699, 0.855764, 0.945469, 1.1536, 1.04247, 1.15175, 1.0455, 0.944784, 1.04382
,
1.15675, 1.04532, 1.1549, 1.04836, 0.947364, 1.04667, 1.04491, 0.944249, 1.04323, 0.94699, 0.855764, 0.945469, 1.1536, 1.04247, 1.15175, 1.0455, 0.944784, 1.04382
)

Without the revert, the output is different:

Output:
(
1.15675, 1.04532, 1.1549, 1.04836, 0.947364, 1.04667, 1.04491, 0.944249, 1.04323, 0.94699, 0.855764, 0.945469, 1.1536, 1.04247, 1.15175, 1.0455, 0.944784, 1.04382
,
1.15675, 1.10195, 1.1549, 1.04836, 0.947364, 1.04667, 1.04491, 0.899154, 1.04323, 0.94699, 0.855764, 0.945469, 1.1536, 1.09595, 1.15175, 1.0455, 0.944784, 1.04382
)

(note the 8th number in the second row)

Diff Detail

Event Timeline

Herald added a project: Restricted Project. · View Herald TranscriptFeb 6 2020, 5:21 PM
ABataev requested changes to this revision.Feb 7 2020, 7:53 AM

I tried to compare the LLVM IR. I did not find incorrect optimizations. Actually, it is very hard to compare such big files. It would be good if you could try to generate a smaller reproducer. Also, reverting is not productive, better to try to find a proper fix.

This revision now requires changes to proceed.Feb 7 2020, 7:53 AM

@ABataev Please check out the runnable reproducer above. It clearly demonstrates a miscompile: if the pass is correct, same IR input should give identical output (barring undefined behavior, but we use sanitizers to check for that).

Also, reverting is not productive, better to try to find a proper fix.

Miscompiles should be reverted just based on incorrect behavior.

@ABataev Please check out the runnable reproducer above. It clearly demonstrates a miscompile: if the pass is correct, same IR input should give identical output (barring undefined behavior, but we use sanitizers to check for that).

Runnable reproducer requires some prerequisites. It would be good if you could provide a simpler one.

Also, reverting is not productive, better to try to find a proper fix.

Miscompiles should be reverted just based on incorrect behavior.

I'm not sure that this patch is the real cause of the problem you have. Maybe, it triggers some other issues in the backend. But it very hard to say currently if it is so or not.

Runnable reproducer requires some prerequisites. It would be good if you could provide a simpler one.

I understand, but actually we have already spent a very large amount of time on reducing this.

To make sure we are on the same page: what do you mean by prerequisites? In order to run the reproducer, you only need to wget a single C++ file with no dependencies and link it to the IR dump/

Runnable reproducer requires some prerequisites. It would be good if you could provide a simpler one.

I understand, but actually we have already spent a very large amount of time on reducing this.

To make sure we are on the same page: what do you mean by prerequisites? In order to run the reproducer, you only need to wget a single C++ file with no dependencies and link it to the IR dump/

Still, in its current form, it won't help. I already said, I don't see a real problem in the LLVM IR files. Maybe, this optimization just triggers some other dangerous transformations. Need to investigate this thoroughly to understand what is the real cause of the problem.

george.karpenkov requested review of this revision.Feb 7 2020, 3:08 PM

I did manage to reduce the test case further.

For input IR: https://gist.github.com/cheshire/17067d5ba4781817861c8b21d15c928d

Bad optimized version: https://gist.github.com/cheshire/bf1047b4385bcf82c22a70f5cf1fb5df

Good optimized version: https://gist.github.com/cheshire/8bea1f36ab849f8945bc190b519272a6

The compilation comes from XLA test case which looks like this:

HloModule EntryModule

ENTRY EntryModule {
  %input0 = f64[] parameter(0)
  %sign_227 = f64[] sign(f64[] %input0)
  %multiply_235 = f64[] multiply(f64[] %sign_227, f64[] %sign_227)

  %p4 = f64[2,2] broadcast(%input0), dimensions={}
  %dot_81 = f64[2,2] dot(f64[2,2] %p4, f64[2,2] %p4), lhs_contracting_dims={1}, rhs_contracting_dims={1}
  %br2 = f64[2,2] broadcast(f64[] %multiply_235), dimensions={}
  %reshape_294 = f64[2,2] multiply(f64[2,2] %dot_81, f64[2,2] br2)

  %broadcast_298 = f64[2,3,2,3] broadcast(f64[2,2] %reshape_294), dimensions={0,2}

  %arg7_8 = f64[3,3] parameter(1)
  %broadcast_300 = f64[2,3,2,3] broadcast(f64[3,3] %arg7_8), dimensions={1,3}
  %multiply_301 = f64[2,3,2,3] multiply(f64[2,3,2,3] %broadcast_298, f64[2,3,2,3] %broadcast_300)

  %reshape_302 = f64[6,6] reshape(f64[2,3,2,3] %multiply_301)
  %zero = f64[] constant(0)
  %zeros = f64[6,6] broadcast(f64[] %zero), dimensions={}

  %diag = pred[6,6] constant({{1,0,0,0,0,0}, {0,1,0,0,0,0}, {0,0,1,0,0,0}, {0,0,0,1,0,0}, {0,0,0,0,1,0}, {0,0,0,0,0,1}})
  ROOT %select_316 = f64[6,6] select(pred[6,6] %diag, f64[6,6] %reshape_302, f64[6,6] %zeros)
}

Essentially, it performs some element-wise multiplications on random input floats, and then replaces all non-diagonal entries with zeros.

Difference in output:

Expected literal:
f64[6,6] {
  { 0.000275566374291495, 0, 0, 0, 0, 0 },
  { 0, 0.00040918878254846619, 0, 0, 0, 0 },
  { 0, 0, -0.00058600272184509581, 0, 0, 0 },
  { 0, 0, 0, 0.000275566374291495, 0, 0 },
  { 0, 0, 0, 0, 0.00040918878254846619, 0 },
  { 0, 0, 0, 0, 0, -0.00058600272184509581 }
}

Actual literal:
f64[6,6] {
  { 0.000275566374291495, 0, 0, 0, 0, 0 },
  { 0, 1, 0, 0, 0, 0 },
  { 0, 0, -0.00058600272184509581, 0, 0, 0 },
  { 0, 0, 0, 0.000275566374291495, 0, 0 },
  { 0, 0, 0, 0, 0.00040918878254846619, 0 },
  { 0, 0, 0, 0, 0, -0.00058600272184509581 }
}
1/1 runs miscompared.

This is not something which can be caused by fast math: a number created by elementwise multiplication of random input floats is exactly "1" in the bad version.

I find this instruction suspicious: https://gist.github.com/cheshire/bf1047b4385bcf82c22a70f5cf1fb5df#file-bad_input_ir_opt-L29

why is the result of the comparison casted to double and then inserted into the vector?

I did manage to reduce the test case further.

For input IR: https://gist.github.com/cheshire/17067d5ba4781817861c8b21d15c928d

Bad optimized version: https://gist.github.com/cheshire/bf1047b4385bcf82c22a70f5cf1fb5df

Good optimized version: https://gist.github.com/cheshire/8bea1f36ab849f8945bc190b519272a6

The compilation comes from XLA test case which looks like this:

HloModule EntryModule

ENTRY EntryModule {
  %input0 = f64[] parameter(0)
  %sign_227 = f64[] sign(f64[] %input0)
  %multiply_235 = f64[] multiply(f64[] %sign_227, f64[] %sign_227)

  %p4 = f64[2,2] broadcast(%input0), dimensions={}
  %dot_81 = f64[2,2] dot(f64[2,2] %p4, f64[2,2] %p4), lhs_contracting_dims={1}, rhs_contracting_dims={1}
  %br2 = f64[2,2] broadcast(f64[] %multiply_235), dimensions={}
  %reshape_294 = f64[2,2] multiply(f64[2,2] %dot_81, f64[2,2] br2)

  %broadcast_298 = f64[2,3,2,3] broadcast(f64[2,2] %reshape_294), dimensions={0,2}

  %arg7_8 = f64[3,3] parameter(1)
  %broadcast_300 = f64[2,3,2,3] broadcast(f64[3,3] %arg7_8), dimensions={1,3}
  %multiply_301 = f64[2,3,2,3] multiply(f64[2,3,2,3] %broadcast_298, f64[2,3,2,3] %broadcast_300)

  %reshape_302 = f64[6,6] reshape(f64[2,3,2,3] %multiply_301)
  %zero = f64[] constant(0)
  %zeros = f64[6,6] broadcast(f64[] %zero), dimensions={}

  %diag = pred[6,6] constant({{1,0,0,0,0,0}, {0,1,0,0,0,0}, {0,0,1,0,0,0}, {0,0,0,1,0,0}, {0,0,0,0,1,0}, {0,0,0,0,0,1}})
  ROOT %select_316 = f64[6,6] select(pred[6,6] %diag, f64[6,6] %reshape_302, f64[6,6] %zeros)
}

Essentially, it performs some element-wise multiplications on random input floats, and then replaces all non-diagonal entries with zeros.

Difference in output:

Expected literal:
f64[6,6] {
  { 0.000275566374291495, 0, 0, 0, 0, 0 },
  { 0, 0.00040918878254846619, 0, 0, 0, 0 },
  { 0, 0, -0.00058600272184509581, 0, 0, 0 },
  { 0, 0, 0, 0.000275566374291495, 0, 0 },
  { 0, 0, 0, 0, 0.00040918878254846619, 0 },
  { 0, 0, 0, 0, 0, -0.00058600272184509581 }
}

Actual literal:
f64[6,6] {
  { 0.000275566374291495, 0, 0, 0, 0, 0 },
  { 0, 1, 0, 0, 0, 0 },
  { 0, 0, -0.00058600272184509581, 0, 0, 0 },
  { 0, 0, 0, 0.000275566374291495, 0, 0 },
  { 0, 0, 0, 0, 0.00040918878254846619, 0 },
  { 0, 0, 0, 0, 0, -0.00058600272184509581 }
}
1/1 runs miscompared.

This is not something which can be caused by fast math: a number created by elementwise multiplication of random input floats is exactly "1" in the bad version.

Thanks, will check this.

Thanks, will check this.

Should we revert the commit while you investigate?

Thanks, will check this.

Should we revert the commit while you investigate?

The problem here that this patch does not introduce new vectorization, instead it just triggers the existing vetorization for more cases. If there is a bug in the vectorizer, this patch just allows to reveal it, not introduces it.

Thanks, will check this.

Should we revert the commit while you investigate?

The problem here that this patch does not introduce new vectorization, instead it just triggers the existing vetorization for more cases. If there is a bug in the vectorizer, this patch just allows to reveal it, not introduces it.

Even if we believe that this commit only reveals a bug (and is itself buggy) I think the correct action is to revert. I have done this in the past with my changes.

Reverting is not a indictment on the quality of the change being reverted; it just keeps the trunk green and helps us all sleep better.

The problem here that this patch does not introduce new vectorization, instead it just triggers the existing vetorization for more cases. If there is a bug in the vectorizer, this patch just allows to reveal it, not introduces it.

As Sanjoy mentioned above, we would still have to revert in this case, as it is the only sustainable option to deal with miscompiles.

The problem here that this patch does not introduce new vectorization, instead it just triggers the existing vetorization for more cases. If there is a bug in the vectorizer, this patch just allows to reveal it, not introduces it.

As Sanjoy mentioned above, we would still have to revert in this case, as it is the only sustainable option to deal with miscompiles.

Even if it is not a real cause of the problem? Need to investigate it at first, to be sure that the vectorizer is the real cause.

Even if it is not a real cause of the problem?

Yes, of course. Many revisions are rolled back like this. It would be completely unsustainable to try to find the "real" culprit for a breaking change, and it's often not even clear where the "real" blame is. Taking this example to extreme: a commit in the compiler which breaks all executables due to triggering a CPU bug should still be rolled back, even if it perfectly fits the compiler contract.

Even if it is not a real cause of the problem?

Yes, of course. Many revisions are rolled back like this. It would be completely unsustainable to try to find the "real" culprit for a breaking change, and it's often not even clear where the "real" blame is. Taking this example to extreme: a commit in the compiler which breaks all executables due to triggering a CPU bug should still be rolled back, even if it perfectly fits the compiler contract.

I can investigate it thoroughly on Monday. Give me a Monday to investigate it and after that I will revert it myself, if no proper fix is found.

I did manage to reduce the test case further.

For input IR: https://gist.github.com/cheshire/17067d5ba4781817861c8b21d15c928d

Bad optimized version: https://gist.github.com/cheshire/bf1047b4385bcf82c22a70f5cf1fb5df

Good optimized version: https://gist.github.com/cheshire/8bea1f36ab849f8945bc190b519272a6

The compilation comes from XLA test case which looks like this:

HloModule EntryModule

ENTRY EntryModule {
  %input0 = f64[] parameter(0)
  %sign_227 = f64[] sign(f64[] %input0)
  %multiply_235 = f64[] multiply(f64[] %sign_227, f64[] %sign_227)

  %p4 = f64[2,2] broadcast(%input0), dimensions={}
  %dot_81 = f64[2,2] dot(f64[2,2] %p4, f64[2,2] %p4), lhs_contracting_dims={1}, rhs_contracting_dims={1}
  %br2 = f64[2,2] broadcast(f64[] %multiply_235), dimensions={}
  %reshape_294 = f64[2,2] multiply(f64[2,2] %dot_81, f64[2,2] br2)

  %broadcast_298 = f64[2,3,2,3] broadcast(f64[2,2] %reshape_294), dimensions={0,2}

  %arg7_8 = f64[3,3] parameter(1)
  %broadcast_300 = f64[2,3,2,3] broadcast(f64[3,3] %arg7_8), dimensions={1,3}
  %multiply_301 = f64[2,3,2,3] multiply(f64[2,3,2,3] %broadcast_298, f64[2,3,2,3] %broadcast_300)

  %reshape_302 = f64[6,6] reshape(f64[2,3,2,3] %multiply_301)
  %zero = f64[] constant(0)
  %zeros = f64[6,6] broadcast(f64[] %zero), dimensions={}

  %diag = pred[6,6] constant({{1,0,0,0,0,0}, {0,1,0,0,0,0}, {0,0,1,0,0,0}, {0,0,0,1,0,0}, {0,0,0,0,1,0}, {0,0,0,0,0,1}})
  ROOT %select_316 = f64[6,6] select(pred[6,6] %diag, f64[6,6] %reshape_302, f64[6,6] %zeros)
}

Essentially, it performs some element-wise multiplications on random input floats, and then replaces all non-diagonal entries with zeros.

Difference in output:

Expected literal:
f64[6,6] {
  { 0.000275566374291495, 0, 0, 0, 0, 0 },
  { 0, 0.00040918878254846619, 0, 0, 0, 0 },
  { 0, 0, -0.00058600272184509581, 0, 0, 0 },
  { 0, 0, 0, 0.000275566374291495, 0, 0 },
  { 0, 0, 0, 0, 0.00040918878254846619, 0 },
  { 0, 0, 0, 0, 0, -0.00058600272184509581 }
}

Actual literal:
f64[6,6] {
  { 0.000275566374291495, 0, 0, 0, 0, 0 },
  { 0, 1, 0, 0, 0, 0 },
  { 0, 0, -0.00058600272184509581, 0, 0, 0 },
  { 0, 0, 0, 0.000275566374291495, 0, 0 },
  { 0, 0, 0, 0, 0.00040918878254846619, 0 },
  { 0, 0, 0, 0, 0, -0.00058600272184509581 }
}
1/1 runs miscompared.

This is not something which can be caused by fast math: a number created by elementwise multiplication of random input floats is exactly "1" in the bad version.

Hmm, tried this reproducer with trunk and it generates the same LLVM IR just like your correct variant. Could you try to check it using the trunk, maybe the bug was fixed already? The convert of fcmp to double also looks suspicious to me, but it looks like the instruction combiner bug, not vectorizer. Vectorizer does not do this kind of transformation with IR. Meanwhile, I'll try to investigate you iriginal reproducer, maybe it is also fixed already.

@ABataev Thanks for looking into this! You are right: when I try the opt tool I also get the same result. Yet somehow when inside TensorFlow, rolling back this revision does fix the miscompile.
Do you think you could give any advice what I can do to narrow it down? Maybe places to print before and after in vectorization passes?

@ABataev Thanks for looking into this! You are right: when I try the opt tool I also get the same result. Yet somehow when inside TensorFlow, rolling back this revision does fix the miscompile.
Do you think you could give any advice what I can do to narrow it down? Maybe places to print before and after in vectorization passes?

There is an option in LLVM, which allows to limit the number of passes and find the pass, which causes the miscompilation. Can't tell you this option right now, will tell you tomorrow.

@ABataev Thanks for looking into this! You are right: when I try the opt tool I also get the same result. Yet somehow when inside TensorFlow, rolling back this revision does fix the miscompile.
Do you think you could give any advice what I can do to narrow it down? Maybe places to print before and after in vectorization passes?

There is an option in LLVM, which allows to limit the number of passes and find the pass, which causes the miscompilation. Can't tell you this option right now, will tell you tomorrow.

Yep, found it. Here it is https://llvm.org/docs/OptBisect.html. You can try to bisect optimization passes and find the wrong one.

Was a bug tracking this issue already filed? (mark it as blocking release-10.0.0)

@lebedev.ri Thanks! No, I haven't filed the issue yet. The current blocking issue is that I see the regression when compiled with JIT, but not with opt. I think I need to figure out the exact set of flags I need to give opt to give me the desired behavior.

llvm/test/Transforms/SLPVectorizer/X86/different-vec-widths.ll