Page MenuHomePhabricator

[InstCombine] Negator - sink sinkable negations
ClosedPublic

Authored by lebedev.ri on Oct 3 2019, 10:27 AM.

Details

Summary

As we have discussed previously (e.g. in D63992 / D64090 / PR42457), sub instruction
can almost be considered non-canonical. While we do convert sub %x, C -> add %x, -C,
we sparsely do that for non-constants. But we should.

Here, i propose to interpret sub %x, %y as add (sub 0, %y), %x IFF the negation can be sinked into the %y

This has some potential to cause endless combine loops (either around PHI's, or if there are some opposite transforms).
For former there's -instcombine-negator-max-depth option to mitigate it, should this expose any such issues
For latter, if there are still any such opposing folds, we'd need to remove the colliding fold.
In any case, reproducers welcomed!

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
lebedev.ri marked an inline comment as done.
lebedev.ri edited the summary of this revision. (Show Details)

Thank you for taking a look!
Also handle trunc, fix comments.

If that doesn't happen, then they are 'trivially' dead and deleted.

Deleted where, exactly? If you're expecting the instcombine main loop to delete them, you'll force instcombine into an infinite loop, I think.

If that doesn't happen, then they are 'trivially' dead and deleted.

Deleted where, exactly? If you're expecting the instcombine main loop to delete them, you'll force instcombine into an infinite loop, I think.

Hmm, they are not added to the instcombine's worklist unless we succeed in negating
the entire tree, and this passes test-suite with no infinite looping.

You are saying that we should instead DCE instructions in *our* worklist if we fail, correct?

Erase newly-created/inserted instruction if negation failed.

@efriedma - do you want to continue reviewing?

I've just pointed out a few nits for now.

llvm/lib/Transforms/InstCombine/InstCombineInternal.h
1020

typo: attempt

llvm/lib/Transforms/InstCombine/InstCombineNegator.cpp
169

it's -> its

183

it's -> its

193

negatible one -> negatible if one

193

it's -> its

244

typo: temporarily

llvm/test/Transforms/InstCombine/sub-of-negatible.ll
159–161

I didn't follow the diffs here - was one of these tests redundant? The code comment didn't match before, but it still doesn't?

272–273

Add these tests with baseline results as pre-commit?

352

it's -> its

lebedev.ri updated this revision to Diff 232090.Dec 4 2019, 5:00 AM
lebedev.ri marked 10 inline comments as done.

@efriedma - do you want to continue reviewing?

I've just pointed out a few nits for now.

Nits addressed.

There are some more patterns that aren't handled here.
The idea is to ideally move them all here.

llvm/test/Transforms/InstCombine/sub-of-negatible.ll
159–161

The test was too complicated, was checking more than the minimal pattern - subtraction can be freely negated by swapping its operands.

spatel added a comment.Dec 4 2019, 9:11 AM

Do you have stats on how often this fires? Impact on compile-time?

Remove more specific pattern-matching from InstCombiner::visitSub() simultaneously with adding this more general functionality, so we don't have redundancy (and limit compile-time impact)?

lebedev.ri updated this revision to Diff 232222.Dec 4 2019, 3:09 PM

NOT READY FOR REVIEW

Remove more specific pattern-matching from InstCombiner::visitSub() simultaneously with adding this more general functionality, so we don't have redundancy (and limit compile-time impact)?

I was hoping to do that in steps, but i guess that's one way to boost those stats :))
I think i have moved everything relevant from InstCombiner::visitSub() now.
Observations:

  • There's an artificial one-use restriction that needs to go away (when Depth=0 and we are looking at sub 0, %x)
  • We loose/do not propagate no wrap flags
  • InstCombiner::visitAdd() change is needed because of how good we are get at sinking negations - else there are two opposite folds. It should be done beforehand, separately.
  • Same with InstCombiner::foldAddWithConstant() change, not entirely related to the diff.
  • There are some other regressions.

Rebased, slightly better, but not there yet still.

lebedev.ri updated this revision to Diff 232532.Dec 6 2019, 4:25 AM

Still not for review

Some more unreachable code removed from InstCombiner::visitSub()

Maybe you can change title to [WIP] or [NOT READY FOR REVIEW] ?

xbolva00 added inline comments.Dec 6 2019, 4:41 AM
llvm/test/Transforms/InstCombine/sub.ll
1354–1355

Remove FIXME?

Rebased.

Finally understood the interfering transforms - this fundamentally interferes with D48754.
Yes, i see the PhaseOrdering regressions if we revert that.
Not yet sure how to deal with this, i see 3 options:

  1. don't do such canonicalization (what i've done here)
  2. don't implement Negator
  3. add Abs IR instruction (i think not)
  4. Don't try to sink negation when it's used by abs/nabs pattern

This doesn't hang check-llvm/test-suite.

Handle PHI's, too.
Strangely, i distinctively recall seeing some preexisting test
causing an endless cycle, but i'm no longer observing that problem.

xbolva00 added inline comments.Apr 11 2020, 4:11 PM
llvm/test/Transforms/PhaseOrdering/min-max-abs-cse.ll
39 ↗(On Diff #256801)

Regression

lebedev.ri marked 2 inline comments as done.
lebedev.ri edited the summary of this revision. (Show Details)

And solve potential endless combine cycle by not trying to sink negations
if that instruction has an user that looks like abs/nabs.

I think this is it for now, so review is welcomed..

llvm/test/Transforms/PhaseOrdering/min-max-abs-cse.ll
39 ↗(On Diff #256801)
  1. add Abs IR instruction (i think not)

Is the answer the same for an intrinsic? The more we see these kinds of problems, the more I wish we had created intrinsics for abs/smin/smax/umin/umax long ago. We keep trying to work around this limitation in IR, but it doesn't seem worth it. We have abs() functions in source and an abs node in codegen, so we're creating intermediate ops that don't exist on either side of IR.

lebedev.ri added a comment.EditedApr 12 2020, 8:36 AM
  1. add Abs IR instruction (i think not)

Is the answer the same for an intrinsic?

Ack, i meant instruction/intrinsic interchangeably here.

The more we see these kinds of problems, the more I wish we had created intrinsics for abs/smin/smax/umin/umax long ago. We keep trying to work around this limitation in IR, but it doesn't seem worth it. We have abs() functions in source and an abs node in codegen, so we're creating intermediate ops that don't exist on either side of IR.

Note that we similarly don't have saturating multiplication, overflowing/saturating left-shift.

  1. add Abs IR instruction (i think not)

Is the answer the same for an intrinsic?

Ack, i meant instruction/intrinsic interchangeably here.

We have grown more accepting of intrinsics (overflow, saturating, funnel, etc.) recently, so I'm not sure if the old arguments against an abs intrinsic still hold. Is there anything in particular about abs() that makes it different?

The more we see these kinds of problems, the more I wish we had created intrinsics for abs/smin/smax/umin/umax long ago. We keep trying to work around this limitation in IR, but it doesn't seem worth it. We have abs() functions in source and an abs node in codegen, so we're creating intermediate ops that don't exist on either side of IR.

Note that we similarly don't have saturating multiplication, overflowing/saturating left-shift.

Yes, there's no clear line that I know of. It's like IR canonicalization - we make the rules up as we go along and adapt the surrounding logic to work with the current format. I think we've overcome the limitation for this patch, so we don't need to gate this patch on a decision, but I think we still have the problem. The clearest sign is that -- within instcombine -- we have bailouts for otherwise accepted canonicalizations only if we "matchSelectPattern()".

nikic added a comment.Apr 13 2020, 1:50 PM

Yes, there's no clear line that I know of. It's like IR canonicalization - we make the rules up as we go along and adapt the surrounding logic to work with the current format. I think we've overcome the limitation for this patch, so we don't need to gate this patch on a decision, but I think we still have the problem. The clearest sign is that -- within instcombine -- we have bailouts for otherwise accepted canonicalizations only if we "matchSelectPattern()".

FWIW I agree that we need to reevaluate this decision and at least introduce min/max intrinsics. These are special-cased in too many places for the current treatment as a simple compare and select to still make sense. Especially when we take into account that there seems to be a new min/max related infinite loop every other month, and that the icmp/select representation has ill-defined behavior when it comes to undef. I expect the effort for introducing these intrinsics will be pretty similar to saturating add/sub, and would be happy to chip in...

+1 for new abs/min/max intrinsics

It'd be great if we could separate the new intrinsics disscussion from this patch..

  1. add Abs IR instruction (i think not)

Is the answer the same for an intrinsic?

Ack, i meant instruction/intrinsic interchangeably here.

We have grown more accepting of intrinsics (overflow, saturating, funnel, etc.) recently, so I'm not sure if the old arguments against an abs intrinsic still hold. Is there anything in particular about abs() that makes it different?

No, i just don't want to condition this patch on introduction of whole new set of intrinsics :]

Yes, there's no clear line that I know of. It's like IR canonicalization - we make the rules up as we go along and adapt the surrounding logic to work with the current format. I think we've overcome the limitation for this patch, so we don't need to gate this patch on a decision, but I think we still have the problem. The clearest sign is that -- within instcombine -- we have bailouts for otherwise accepted canonicalizations only if we "matchSelectPattern()".

FWIW I agree that we need to reevaluate this decision and at least introduce min/max intrinsics. These are special-cased in too many places for the current treatment as a simple compare and select to still make sense. Especially when we take into account that there seems to be a new min/max related infinite loop every other month, and that the icmp/select representation has ill-defined behavior when it comes to undef. I expect the effort for introducing these intrinsics will be pretty similar to saturating add/sub, and would be happy to chip in...

Sounds like general agreement for the people on this review (and sorry for going off-topic for this specific patch). I'll post something to llvm-dev after trying to dig up the earlier discussions for those intrinsics.

And looks like we have this month's infinite loop here :) -
https://bugs.llvm.org/show_bug.cgi?id=45539

Yes, there's no clear line that I know of. It's like IR canonicalization - we make the rules up as we go along and adapt the surrounding logic to work with the current format. I think we've overcome the limitation for this patch, so we don't need to gate this patch on a decision, but I think we still have the problem. The clearest sign is that -- within instcombine -- we have bailouts for otherwise accepted canonicalizations only if we "matchSelectPattern()".

FWIW I agree that we need to reevaluate this decision and at least introduce min/max intrinsics. These are special-cased in too many places for the current treatment as a simple compare and select to still make sense. Especially when we take into account that there seems to be a new min/max related infinite loop every other month, and that the icmp/select representation has ill-defined behavior when it comes to undef. I expect the effort for introducing these intrinsics will be pretty similar to saturating add/sub, and would be happy to chip in...

Sounds like general agreement for the people on this review (and sorry for going off-topic for this specific patch). I'll post something to llvm-dev after trying to dig up the earlier discussions for those intrinsics.

Note that i've posted https://github.com/AliveToolkit/alive2/pull/353 with my vision of their modelling.
Notably, i don't think we should have nabs, and i believe abs should be similar to the cttz/ctlz
in the sense that it should take a second param - i1 NSW.

Yes, there's no clear line that I know of. It's like IR canonicalization - we make the rules up as we go along and adapt the surrounding logic to work with the current format. I think we've overcome the limitation for this patch, so we don't need to gate this patch on a decision, but I think we still have the problem. The clearest sign is that -- within instcombine -- we have bailouts for otherwise accepted canonicalizations only if we "matchSelectPattern()".

FWIW I agree that we need to reevaluate this decision and at least introduce min/max intrinsics. These are special-cased in too many places for the current treatment as a simple compare and select to still make sense. Especially when we take into account that there seems to be a new min/max related infinite loop every other month, and that the icmp/select representation has ill-defined behavior when it comes to undef. I expect the effort for introducing these intrinsics will be pretty similar to saturating add/sub, and would be happy to chip in...

I think i'd like to try to handle that, since out of the people in this disscussion
only i (well, and @xbolva00) haven't dealt with that before, may be good to spread the knowledge.

Ping. What would it take to get this moving? :)

nikic added a comment.Apr 20 2020, 2:51 AM

Compile-time numbers look good: http://llvm-compile-time-tracker.com/compare.php?from=f52e0507574b4fd84dc4674536f5dfbab396c0f6&to=0a009b654793dee8e335c053eb043e297071e0d1&stat=instructions

Change does not seem to have cost above noise, apart from a -0.6% improvement on tramp3d-v4. There's a corresponding -0.74% reducing in code-size, so this transform is clearly doing (or enabling) something big there. It might be interesting (but not necessary) to take a quick look at what happens there.

I was actually trying to get that info, and i'm not sure what step i'm missing other than pushing the [[ https://github.com/LebedevRI/llvm-project/tree/perf/instcombine-negator | perf/* ]] branch?

Change does not seem to have cost above noise, apart from a -0.6% improvement on tramp3d-v4. There's a corresponding -0.74% reducing in code-size, so this transform is clearly doing (or enabling) something big there. It might be interesting (but not necessary) to take a quick look at what happens there.

Ah, interesting.
I actually expected this to have measurable negative(bad) cost,
so it's a pleasant surprise to see beneficial numbers here :)

nikic added a comment.Apr 20 2020, 5:41 AM

I was actually trying to get that info, and i'm not sure what step i'm missing other than pushing the [[ https://github.com/LebedevRI/llvm-project/tree/perf/instcombine-negator | perf/* ]] branch?

I need to manually add your fork as a remote first, so branches get picked up (too many forks to listen to all of them). I've done that now.

Compile-time numbers look good: http://llvm-compile-time-tracker.com/compare.php?from=f52e0507574b4fd84dc4674536f5dfbab396c0f6&to=0a009b654793dee8e335c053eb043e297071e0d1&stat=instructions

Change does not seem to have cost above noise, apart from a -0.6% improvement on tramp3d-v4. There's a corresponding -0.74% reducing in code-size, so this transform is clearly doing (or enabling) something big there. It might be interesting (but not necessary) to take a quick look at what happens there.

Yes, it would be good to derive a regression test from that benchmark and/or invent some larger tests that show the greater optimization power of the new code. Unless I missed it, all of the current test diffs show that we do no harm, but if we can show that the added code/complexity buys us something immediately, that makes the benefit clear.

Compile-time numbers look good: http://llvm-compile-time-tracker.com/compare.php?from=f52e0507574b4fd84dc4674536f5dfbab396c0f6&to=0a009b654793dee8e335c053eb043e297071e0d1&stat=instructions

Change does not seem to have cost above noise, apart from a -0.6% improvement on tramp3d-v4. There's a corresponding -0.74% reducing in code-size, so this transform is clearly doing (or enabling) something big there.
It might be interesting (but not necessary) to take a quick look at what happens there.

Yes, it would be good to derive a regression test from that benchmark


The impact there is quite noticeable as per llvm-diff, which almost immediately crashes.
It appears, a lot more inlining happens (functions now-missing in new.ll),
and some more function 'specialization' (functions now-appearing in new.ll).
I'm not sure i can distill/filter that to make any reasonable test case..

and/or invent some larger tests that show the greater optimization power of the new code.
Unless I missed it, all of the current test diffs show that we do no harm, but if we can show that the added code/complexity buys us something immediately, that makes the benefit clear.

The claim i'm making in the patch's description is that we almost consider sub instruction
non-canonical, and we should be trying to fold it away as much as possible.
These test cases show the common patters, that we miss currently, where we can get rid of it.

This approach is what was requested in

This *is* unusual. Is this too ugly to live?

I'd prefer to actually transform the operation tree at the point we decide it's profitable, to make it clear we can't end up in an infinite loop or something like that. As it is, you're depending on some other transforms happening in a particular order, and it's not clear that will happen consistently. (Yes, it's a little more code, but I think that's okay.)

spatel accepted this revision.Apr 20 2020, 1:02 PM
spatel added a reviewer: xbolva00.

Given that we have compile-time data now, LGTM.
The implementation goes beyond my normal casual C++ usage (eg, I'd never seen 'zip' before), so if someone else can take a 2nd/final look too, that would be great.

llvm/lib/Transforms/InstCombine/InstCombineNegator.cpp
134

Remove/update comment.

This revision is now accepted and ready to land.Apr 20 2020, 1:02 PM
nikic added inline comments.Apr 20 2020, 1:51 PM
llvm/lib/Transforms/InstCombine/InstCombineNegator.cpp
181

Noticed while looking through the tramp3d-v4 diff: This should be behind a one-use check, to avoid duplicating expensive division instructions.

xbolva00 added inline comments.Apr 20 2020, 1:52 PM
llvm/lib/Transforms/InstCombine/InstCombineAddSub.cpp
2005

Just wondering if we could use some better name for this lambda than “cleanup”.

lebedev.ri added inline comments.Apr 20 2020, 2:19 PM
llvm/lib/Transforms/InstCombine/InstCombineAddSub.cpp
2005

I couldn't come up with a better one, any suggestions?
TryToNarrowDeduceFlags()?
This is where goto might make sense, but somehow i don't want to use it..

xbolva00 added inline comments.Apr 20 2020, 2:48 PM
llvm/lib/Transforms/InstCombine/InstCombineAddSub.cpp
2005

Your idea is fine I think.

lebedev.ri added inline comments.Apr 20 2020, 2:52 PM
llvm/lib/Transforms/InstCombine/InstCombineNegator.cpp
181

I was hesitant about this one indeed, it isn't a typo that one-use check has gone away here,
because we generally consider only instruction count.

@spatel thoughts?

But the main, bigger question this touches is:
"but what if all the uses would get negated by us?
In future, can we somehow sanely model the whole negatible tree,
not giving up at non-single-use instructions,
but defer that to after we've finished building new tree?"

spatel added inline comments.Apr 21 2020, 5:52 AM
llvm/lib/Transforms/InstCombine/InstCombineNegator.cpp
181

I missed that logic difference, and I'm not getting a reviewable diff of the attached files with llvm-diff or other apps. Can you create an IR example/regression test for that?

lebedev.ri added inline comments.Apr 21 2020, 6:06 AM
llvm/lib/Transforms/InstCombine/InstCombineNegator.cpp
181

@spatel
To be clear, the "bigger question" is pretty rhetorical, or at least not for this review.

The actual question here is whether we should consider sdiv special,
and even if we can negate it without increasing instruction count,
we should only do so if there are no other uses of old sdiv.

spatel added inline comments.Apr 21 2020, 6:45 AM
llvm/lib/Transforms/InstCombine/InstCombineNegator.cpp
181

There is precedence for this kind of special treatment. In other words, not all opcodes are equal in terms of analysis (and secondary concern of codegen), and we will even increase instruction count to avoid some like div/rem (although those transforms are probably currently not safe with respect to poison).

If it would be NFC-ish to keep the one-use check, then we should do that. Then, remove the limitation as a follow-up if that can be shown useful?

lebedev.ri marked 11 inline comments as done.
lebedev.ri edited the summary of this revision. (Show Details)

Updated: adjust comments, lambda name, guard sdiv with an artificial one-use-check.

lebedev.ri added inline comments.Apr 21 2020, 10:50 AM
llvm/lib/Transforms/InstCombine/InstCombineNegator.cpp
181

Alright, i'll add one-use check just to get this moving :)

This revision was automatically updated to reflect the committed changes.
kcc added a subscriber: kcc.

Hi,

This change causes a performance regression in tsan, as detected on our LLVM buildbot:
http://lab.llvm.org:8011/builders/sanitizer-x86_64-linux-autoconf/builds/49850/steps/tsan%20analyze/logs/stdio

The script that comes with tsan checks the number of PUSH, etc in some of the key tsan functions,
where each extra PUSH cases tsan to be slower.

With this change, the number of PUSHes went from 3 to 4.

Please take a look, this might be a performance regression for a wider set of targets.

Before your change:

read1 tot 484; size 1830; rsp 1; push 3; pop 15; call 2; load 24; store  9; sh  46; mov 106; lea   2; cmp  76

After your change:

read1 tot 515; size 1980; rsp 1; push 4; pop 4; call 2; load 24; store  9; sh  46; mov 113; lea   2; cmp  90

Script to reproduce (in the llvm-project root dir, with "build" subdir)

#!/bin/bash

compile() {

clang -c -O2  compiler-rt/lib/tsan/rtl/tsan_rtl.cpp -I compiler-rt/lib -Wall -std=c++14 -Wno-unused-parameter -O2 -g -DNDEBUG    -m64 -fno-lto -fPIC -fno-builtin -fno-exceptions -fomit-frame-pointer -funwind-tables -fno-stack-protector -fno-sanitize=safe-stack -fvisibility=hidden -fno-lto -O3 -gline-tables-only -Wno-gnu -Wno-variadic-macros -Wno-c99-extensions -Wno-non-virtual-dtor -fPIE -fno-rtti -msse3 -Wframe-larger-than=530 -Wglobal-constructors

}

git checkout a13dce1d90cba6c55252dee0a2600eab37ffbc44
(cd build; ninja clang 2> /dev/null)
compile
compiler-rt/lib/tsan/analyze_libtsan.sh tsan_rtl.o | grep read1

git checkout 352fef3f11f5ccb2ddc8e16cecb7302a54721e9f
(cd build; ninja clang 2> /dev/null)
compile
compiler-rt/lib/tsan/analyze_libtsan.sh tsan_rtl.o | grep read1

In D68408#1998112, @kcc wrote:

Hi,

Hi.

This change causes a performance regression in tsan, as detected on our LLVM buildbot:
http://lab.llvm.org:8011/builders/sanitizer-x86_64-linux-autoconf/builds/49850/steps/tsan%20analyze/logs/stdio

Looks like the build was red already: http://lab.llvm.org:8011/builders/sanitizer-x86_64-linux-autoconf/builds/49808
That explains why i didn't see the new failure: http://lab.llvm.org:8011/builders/sanitizer-x86_64-linux-autoconf/builds/49809

The script that comes with tsan checks the number of PUSH, etc in some of the key tsan functions,
where each extra PUSH cases tsan to be slower.

With this change, the number of PUSHes went from 3 to 4.

Please take a look, this might be a performance regression for a wider set of targets.

Before your change:

read1 tot 484; size 1830; rsp 1; push 3; pop 15; call 2; load 24; store  9; sh  46; mov 106; lea   2; cmp  76

After your change:

read1 tot 515; size 1980; rsp 1; push 4; pop 4; call 2; load 24; store  9; sh  46; mov 113; lea   2; cmp  90

Interesting. Not very unexpected, there's always possibility of an avalanche effect with IR changes.

Unhelpful answer: wow how things have regressed since rL342092 / D51985

Script to reproduce (in the llvm-project root dir, with "build" subdir)

#!/bin/bash

compile() {

clang -c -O2  compiler-rt/lib/tsan/rtl/tsan_rtl.cpp -I compiler-rt/lib -Wall -std=c++14 -Wno-unused-parameter -O2 -g -DNDEBUG    -m64 -fno-lto -fPIC -fno-builtin -fno-exceptions -fomit-frame-pointer -funwind-tables -fno-stack-protector -fno-sanitize=safe-stack -fvisibility=hidden -fno-lto -O3 -gline-tables-only -Wno-gnu -Wno-variadic-macros -Wno-c99-extensions -Wno-non-virtual-dtor -fPIE -fno-rtti -msse3 -Wframe-larger-than=530 -Wglobal-constructors

}

git checkout a13dce1d90cba6c55252dee0a2600eab37ffbc44
(cd build; ninja clang 2> /dev/null)
compile
compiler-rt/lib/tsan/analyze_libtsan.sh tsan_rtl.o | grep read1

git checkout 352fef3f11f5ccb2ddc8e16cecb7302a54721e9f
(cd build; ninja clang 2> /dev/null)
compile
compiler-rt/lib/tsan/analyze_libtsan.sh tsan_rtl.o | grep read1


llvm-diff report is pretty large, IR instruction-wise, this appears to be a win overall (-639 +609 instructions).
Visually, i think can spot only two IR patterns that we now fail to fold:

in function _ZN6__tsan10InitializeEPNS_11ThreadStateE:
  in block %if.then88.i:
    >   %22 = xor i64 %xor.i.i.i, -17592186044417
    >   %mul.i.i194.neg.i = add i64 %22, 1
    >   %sub.i = add i64 %mul.i.i194.neg.i, %mul.i.i203.i
    <   %mul.i.i194.i = xor i64 %xor.i.i.i, 17592186044416
    <   %sub.i = sub i64 %mul.i.i203.i, %mul.i.i194.i
  in block %if.then88.1.i:
    >   %35 = xor i64 %xor.i.i.1.i, -17592186044417
    >   %mul.i.i194.neg.1.i = or i64 %mul.i.i203.1.i, 1
    >   %sub.1.i = add i64 %mul.i.i194.neg.1.i, %35
    <   %mul.i.i194.1.i = xor i64 %xor.i.i.1.i, 17592186044416
    <   %sub.1.i = sub i64 %mul.i.i203.1.i, %mul.i.i194.1.i

Filed https://bugs.llvm.org/show_bug.cgi?id=45647

But i believe, i'm supposed to look at the @__tsan_read1 function, right?
Then the relevant diff is:

in function __tsan_read1:
  in block %if.then.i1390.i.i.i:
    >   %sub.neg.i1385.i.i.i = sub nsw i64 %and.i19.i1382.i.i.i, %and.i.i1380.i.i.i
    >   %sub.neg.highbits.i1388.i.i.i = lshr i64 %sub.neg.i1385.i.i.i, %and.i.i.i1387.i.i.i
    >   %cmp7.i1389.i.i.i = icmp ne i64 %sub.neg.highbits.i1388.i.i.i, 0
    <   %sub6.i1387.i.i.i = sub nsw i64 0, %sub.i1383.i.i.i
    <   %sub6.highbits.i1388.i.i.i = lshr i64 %sub6.i1387.i.i.i, %and.i.i.i1386.i.i.i
    <   %cmp7.i1389.i.i.i = icmp ne i64 %sub6.highbits.i1388.i.i.i, 0
        %cmp.i1378.i.i.i = icmp ult i64 %xor.i1422.i.i.i, 1125899906842624
    >   %or.cond.i.i.i = or i1 %cmp.i1378.i.i.i, %cmp7.i1389.i.i.i
    >   br i1 %or.cond.i.i.i, label %do.body226.i.i.i, label %if.end86.i.i.i
    <   %or.cond.i.i.i = or i1 %cmp.i1378.i.i.i, %cmp7.i1389.i.i.i
    <   br i1 %or.cond.i.i.i, label %do.body226.i.i.i, label %if.end86.i.i.i

So we've traded 0 - %sub.i1383.i.i.i for %and.i19.i1382.i.i.i - %and.i.i1380.i.i.i
That's it, [un]fortunately, there is nothing else going on..
But thankfully, that explains the problem well.
Pushed rG5a159ed2a8e5a9a6ced73f78e4c64b01d76d3493.
Thanks.