Page MenuHomePhabricator

Transforms: refactor pow(x, n) expansion where n is a constant integer value
ClosedPublic

Authored by pawosm01 on Jun 25 2022, 1:48 PM.

Details

Summary

Since the backend's codegen is capable to expand powi into fmul's, it
is not needed anymore to do so in the ::optimizePow() function of
SimplifyLibCalls.cpp. What is sufficient is to always turn pow(x, n)
into powi(x, n) for the cases where n is a constant integer value.

Dropping the current expansion code allowed relaxation of the folding
conditions and now this can also happen at optimization levels below
Ofast.

The added CodeGen/AArch64/powi.ll test case ensures that powi is
actually expanded into fmul's, confirming that this refactor did not
cause any performance degradation.

Following an idea proposed by David Sherwood <david.sherwood@arm.com>.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Herald added a project: Restricted Project. · View Herald TranscriptJun 25 2022, 1:48 PM
Herald added a subscriber: hiraditya. · View Herald Transcript
pawosm01 requested review of this revision.Jun 25 2022, 1:48 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 25 2022, 1:48 PM
pawosm01 updated this revision to Diff 440154.Jun 27 2022, 3:39 AM

fixed formatting.

mgabka added a subscriber: mgabka.Jun 27 2022, 5:09 AM
david-arm added inline comments.Jun 28 2022, 6:55 AM
llvm/include/llvm/IR/Operator.h
263 ↗(On Diff #440154)

nit: Hi, I think this probably needs a simple comment here, something like

/// Test if this operation's arguments and results are assumed to be finite.
llvm/test/Transforms/InstCombine/pow-4.ll
35

It looks like the nsz flag is not needed here because the transformation only depends upon reassoc nnan ninf I think?

pawosm01 updated this revision to Diff 440610.Jun 28 2022, 7:11 AM

Addressed review comment 1.

pawosm01 updated this revision to Diff 440629.Jun 28 2022, 7:52 AM
pawosm01 marked an inline comment as done.

Comment 2 addressed.

pawosm01 marked an inline comment as done.Jun 28 2022, 7:53 AM
pawosm01 added inline comments.
llvm/include/llvm/IR/Operator.h
263 ↗(On Diff #440154)

done

llvm/test/Transforms/InstCombine/pow-4.ll
35

yes

david-arm added a subscriber: spatel.

The patch looks good to me. However, @spatel is more familiar than me with the semantics of fast-math flags on operations, so adding as a reviewer in case I've missed something!

llvm/test/Transforms/InstCombine/pow-4.ll
63

nit: Can you remove nsz here too?

110

nit: Can you remove nsz here too?

pawosm01 updated this revision to Diff 440657.Jun 28 2022, 9:25 AM
pawosm01 marked an inline comment as done.

Another review comment addressed.

pawosm01 marked 2 inline comments as done.Jun 28 2022, 9:27 AM
pawosm01 added inline comments.
llvm/test/Transforms/InstCombine/pow-4.ll
63

Si.

110

Si.

RKSimon added a subscriber: RKSimon.

Not to confuse things, but we also have @llvm.powi.* intrinsics and expansion which are all very similar to this - can we not merge these more?

pawosm01 marked 2 inline comments as done.Jun 29 2022, 11:23 AM

Sadly, powi is handled differently (than powf, pow and powl which all result in calling optimizePow() modified by this patch) in the SimplifyLibCalls.cpp code, so extending this patch in that direction would go beyond the initial scope and may overstretch my approved upstreaming activity. I would rather create separate commit covering powi later.

The patch description and tests don't match what the code does currently. We don't require full 'fast'; we just check for 'afn' and propagate that flag to the new instructions.

It might help to know what the motivating source case looks like - are we translating the source-level parameters to FMF as expected?

I'm not opposed to easing the restriction, but it's probably better to add/modify the tests first. That way, we know exactly what the current behavior is and how it will change with the proposed patch.

It seems problematic that LibCallSimplifier::optimizePow also folds pow ->powi yet we also have pow expansion code that is completely separate from powi (where its handled in CGP).

We now have TTI:isBeneficialToExpandPowI to decide when to expand powi intrinsics properly instead of hard coded limits like there are here.

Hi @RKSimon, thanks for your comments, although I'd like to obtain some of your guidance on what can I do to address them in the context of this commit.

It seems problematic that LibCallSimplifier::optimizePow also folds pow ->powi yet we also have pow expansion code that is completely separate from powi (where its handled in CGP).

Does it mean that in order to make things less problematic, all applications of createPowWithIntegerExponent() should be moved somewhere else? Or maybe pow/powf/powl expansion code should not occur in optimizePow()? Should the scope of this commit be extended towards such refactor?

We now have TTI:isBeneficialToExpandPowI to decide when to expand powi intrinsics properly instead of hard coded limits like there are here.

Should it happen the same way for pow/powf/powl too?

The patch description and tests don't match what the code does currently. We don't require full 'fast'; we just check for 'afn' and propagate that flag to the new instructions.

It might help to know what the motivating source case looks like - are we translating the source-level parameters to FMF as expected?

I'm not opposed to easing the restriction, but it's probably better to add/modify the tests first. That way, we know exactly what the current behavior is and how it will change with the proposed patch.

@spatel we have discussed your comment here, and my understanding is that you're not opposing to the proposed change as such, I guess what you wanted to know is how we ended up with proposed set of flags. At the time of our experiments, we were focused on trying to restrict certain optimizations to the lowest possible set of fast math flags. We were comparing with what Intel compiler can do by default at -O3 as it is more relaxed than clang, so we wanted to find a route to getting similar effect.

As such I'm finding the test case presented here fitting to the nature of our experiment and I don't see anything missing. Maybe it is quite academic and many may not find a huge benefit of that; in the end, it is sufficient to use -funsafe-math-optimizations flag in order to relax restrictions.

Hi @RKSimon, thanks for your comments, although I'd like to obtain some of your guidance on what can I do to address them in the context of this commit.

It seems problematic that LibCallSimplifier::optimizePow also folds pow ->powi yet we also have pow expansion code that is completely separate from powi (where its handled in CGP).

Does it mean that in order to make things less problematic, all applications of createPowWithIntegerExponent() should be moved somewhere else? Or maybe pow/powf/powl expansion code should not occur in optimizePow()? Should the scope of this commit be extended towards such refactor?

I'd expect a powi intrinsic instead of expanding the pow multiplies directly - the sqrt variant should be able to fed into the powi intrinsic as well afaict.

We now have TTI:isBeneficialToExpandPowI to decide when to expand powi intrinsics properly instead of hard coded limits like there are here.

Should it happen the same way for pow/powf/powl too?

Yes - create the powi intrinsic and leave it CGP and the backend to decide when to expand it.

@spatel we have discussed your comment here, and my understanding is that you're not opposing to the proposed change as such, I guess what you wanted to know is how we ended up with proposed set of flags. At the time of our experiments, we were focused on trying to restrict certain optimizations to the lowest possible set of fast math flags. We were comparing with what Intel compiler can do by default at -O3 as it is more relaxed than clang, so we wanted to find a route to getting similar effect.

As such I'm finding the test case presented here fitting to the nature of our experiment and I don't see anything missing. Maybe it is quite academic and many may not find a huge benefit of that; in the end, it is sufficient to use -funsafe-math-optimizations flag in order to relax restrictions.

Thanks for the background info. FP optimization flags are not precisely specified, so there will be mismatches between clang vs. icc vs. gcc. If you want to allow this with -fassociative-math + -ffinite-math-only, then we'd need to adjust something in the front-end because reassoc is not applied to the pow call:
https://godbolt.org/z/obeenobfq

I didn't realize we had the TLI hook to expand powi already, so I agree with @RKSimon (and there's a TODO comment in this code about removing the expansion here in IR) - we should convert to powi here and leave the expansion under target control in the backend. Maybe this already works if you just delete the expansion in this file?

pawosm01 updated this revision to Diff 442616.Jul 6 2022, 9:41 AM
pawosm01 retitled this revision from Transforms: Relax restrictions on pow(x, y) expansion to Transforms: refactor pow(x, n) expansion where n is a constant integer value.
pawosm01 edited the summary of this revision. (Show Details)

@RKSimon and @spatel you were right, this code can be happily removed, the fmul expansion is still happening when it's needed. I've uploaded a completely new commit.

Thanks @pawosm01

llvm/test/Transforms/InstCombine/pow-4.ll
139

I guess we need to decide whether we want to retain this variety of cases somehow? I assume we can perform this as powi(x, 16) * sqrt(x) ?

pawosm01 added inline comments.Jul 6 2022, 12:12 PM
llvm/test/Transforms/InstCombine/pow-4.ll
139

Something like this?:

   const APFloat *ExpoF;
   if (match(Expo, m_APFloat(ExpoF)) && !ExpoF->isExactlyValue(0.5) &&
       !ExpoF->isExactlyValue(-0.5)) {
+    // This transformation applies to integer or integer+0.5 exponents only.
+    // For integer+0.5, we create a sqrt(Base) call.
+    APFloat ExpoA(abs(*ExpoF));
+    Value *Sqrt = nullptr;
+    if (AllowApprox && !ExpoA.isInteger()) {
+      APFloat Expo2 = ExpoA;
+      // To check if ExpoA is an integer + 0.5, we add it to itself. If there
+      // is no floating point exception and the result is an integer, then
+      // ExpoA == integer + 0.5
+      if (Expo2.add(ExpoA, APFloat::rmNearestTiesToEven) != APFloat::opOK)
+        return nullptr;
+
+      if (!Expo2.isInteger())
+        return nullptr;
+
+      Sqrt = getSqrtCall(Base, Pow->getCalledFunction()->getAttributes(),
+                         Pow->doesNotAccessMemory(), M, B, TLI);
+      if (!Sqrt)
+        return nullptr;
+    }
+
     APSInt IntExpo(TLI->getIntSize(), /*isUnsigned=*/false);
+    // pow(x, n) -> powi(x, n) if n is a constant signed integer value
     if (ExpoF->isInteger() &&
         ExpoF->convertToInteger(IntExpo, APFloat::rmTowardZero, &Ignored) ==
             APFloat::opOK) {
-      return copyFlags(
+      Value *PowI = copyFlags(
           *Pow,
           createPowWithIntegerExponent(
               Base, ConstantInt::get(B.getIntNTy(TLI->getIntSize()), IntExpo),
               M, B));
+
+      if (PowI && Sqrt)
+        return B.CreateFMul(PowI, Sqrt);
+
+      return PowI;
     }
   }

Sadly, this leads to infinite loop of self-contradicting optimizations:

FAIL: LLVM :: Transforms/InstCombine/pow_fp_int.ll (10150 of 45161)
******************** TEST 'LLVM :: Transforms/InstCombine/pow_fp_int.ll' FAILED ********************
Script:
--
: 'RUN: at line 1';   /dsg_space/projectscratch_dsg_space/pawosm01/upstream/llvm-project.git/build-shared-debug/bin/opt -mtriple unknown -passes=instcombine -S < /dsg_space/projectscratch_dsg_space/pawosm01/upstream/llvm-project.git/llvm/test/Transforms/InstCombine/pow_fp_int.ll | /dsg_space/projectscratch_dsg_space/pawosm01/upstream/llvm-project.git/build-shared-debug/bin/FileCheck /dsg_space/projectscratch_dsg_space/pawosm01/upstream/llvm-project.git/llvm/test/Transforms/InstCombine/pow_fp_int.ll
--
Exit Code: 2

Command Output (stderr):
--
LLVM ERROR: Instruction Combining seems stuck in an infinite loop after 100 iterations.

@RKSimon could you give some suggestions where to go next with this?

Investigating the infinite loops further makes sense to me - do we have a reverse fold that combines powi / sqrt intrinsics?

pawosm01 updated this revision to Diff 443231.Jul 8 2022, 6:28 AM

@RKSimon turned out doing a simple copy-paste wasn't the best idea. I've corrected the code proposed within one of my comments and put it into a new patch. It passes all test now, nothing runs into infinite loop.

Does this approach fits your idea of addressing this specific case?

Yeah, that's the kind of thing I had in mind, but I'd feel happier if somebody else gave this a check through as well.

spatel added inline comments.Jul 8 2022, 8:45 AM
llvm/lib/Transforms/Utils/SimplifyLibCalls.cpp
1941

Delete stale code comment.

1972

Add another line for the sqrt variant like:
// pow(x, n) -> powi(x, n) * sqrt(x) if n has exactly a 0.5 fraction

llvm/test/CodeGen/AArch64/powi.ll
8

These tests results are independent of this patch, so this could be committed as a preliminary "NFC" patch.

We intentionally don't have regression tests that check end-to-end results of IR optimization + codegen because we want focused/unit testing at this level. The end-to-end tests live in test-suite instead (and could begin from C source too).

pawosm01 updated this revision to Diff 443288.Jul 8 2022, 10:48 AM

Following review comments.

pawosm01 marked 2 inline comments as done.Jul 8 2022, 10:51 AM
pawosm01 added inline comments.
llvm/lib/Transforms/Utils/SimplifyLibCalls.cpp
1941

OK.
Also followed the comment below.

1972

OK

llvm/test/CodeGen/AArch64/powi.ll
8

I'd keep it if it's not very disturbing. This is the last thing that links us to the idea that brought us here.

spatel accepted this revision.Jul 8 2022, 12:06 PM

I'm not sure if we answered the original question about fast-math-flags, but that can be an independent patch if needed. This part LGTM with the dead code removed.

llvm/lib/Transforms/Utils/SimplifyLibCalls.cpp
1640–1641

This function is dead code now and should be removed.

This revision is now accepted and ready to land.Jul 8 2022, 12:06 PM
pawosm01 updated this revision to Diff 443330.Jul 8 2022, 12:52 PM
pawosm01 marked 2 inline comments as done.

Removed dead function. Thanks @spatel for spotting this!

pawosm01 marked an inline comment as done.Jul 8 2022, 12:53 PM
pawosm01 added inline comments.
llvm/lib/Transforms/Utils/SimplifyLibCalls.cpp
1640–1641

Removed dead code.

pawosm01 marked an inline comment as done.Jul 8 2022, 1:01 PM

Seems my old write access is not valid anymore. Can I ask someone to push this commit?

This revision was landed with ongoing or failed builds.Jul 9 2022, 9:02 AM
This revision was automatically updated to reflect the committed changes.

Hi,

This patch introduces the difference between new inlined and library versions of std::pow, at least on x86_64. For large exponents, the difference is large. Also, when the difference accumulates, it makes a lot of existing tests that compare against golden values to go far beyond the meaningful margin.

Similarly, it introduces the difference between pow of constant and pow of non-constant with the same value.

I think such differences make related debugging a lot more complex. What do you think?

Here are the tests:

[[clang::optnone]] double test_pow(double x, double y) {
  return std::pow(x, y);
}

// PASS - compiler evaluates std::pow(const, const) similar to the library?
TEST(Test, Pow_const) {
  double x = 8.804009981602594;
  double y = 10.0;
  double t = test_pow(x, y);
  double z = std::pow(x, y);
  EXPECT_TRUE(t - z < 1e-6 && z -t < 1e-6);
}

[[clang::optnone]] double init(const char* buf) {
  double x;
  memcpy(&x, buf, sizeof(double));
  return x;
}

// FAIL - https://reviews.llvm.org/D128591 introduces difference.
TEST(Test, Pow_nonconst) {
  double x = init("\x0a\x48\x41\x32\xa7\x9b\x21\x40");  // 8.804009981602594
  double y = 10.0;
  double t = test_pow(x, y);
  double z = std::pow(x, y);
  EXPECT_TRUE(t - z < 1e-6 && z -t < 1e-6);
}

Hi @eaeltsin isn't this the same problem with every other FP optimisation where the user has explicitly requested optimisations to take place? The same thing is true for stuff like the following:

float __attribute__((optnone)) reduc1(float *src, long n) {
  float r = 0;
  for (long i = 0; i < n; i++)
    r += src[i];
  return r;
}

float reduc2(float *src, long n) {
  float r = 0;
  for (long i = 0; i < n; i++)
    r += src[i];
  return r;
}

and the user compiles with -Ofast. With certain pathological cases it's also possible to produce quite different results. It seems to me that if the user has explicitly requested FP optimisations (because they care more about performance than accuracy) and the optimisations are legal within the context of the flags requested, then the compiler is doing the right thing?

Hi David,

One more question - my concern is that we get different results in logically equal contexts:

// FAIL
TEST(Test, Pow7) {
  double x1 = 8.804009981602594;
  double x2 = init("\x0a\x48\x41\x32\xa7\x9b\x21\x40");  // 8.804009981602594
  double y = 10.0;
  double t = std::pow(x1, y);
  double z = std::pow(x2, y);
  EXPECT_TRUE(t - z < 1e-6 && z - t < 1e-6);
}

It looks like only the newly inlined version differs, while compile-time and library versions agree.

It seems to me that if the user has explicitly requested FP optimisations (because they care more about performance than accuracy) and the optimisations are legal within the context of the flags requested, then the compiler is doing the right thing?

I agree. Though it would be good to see the IR for the test to confirm that we didn't accidentally create a path where non-fast pow got transformed into powi. The description of powi tries to make it clear that it is unreliable in testing like the example - "The order of evaluation of multiplications is not defined." ( https://llvm.org/docs/LangRef.html#llvm-powi-intrinsic )

The question about constant folding a powi call is interesting independently of that - should we evaluate it at compile-time with a multiply loop? There's no way it's going to match the target-specific expansion or library call, so we're just trading one set of different answers for another?

Hi @eaeltsin, it's also worth bearing in mind that we were already doing this optimisation (replacing pow with multiplies) before this patch landed if the user was building with -Ofast. This patch simply extends that behaviour for a different set of fast math flags.

BTW, this is not about -Ofast - the problem reproduces with -O1 - https://gcc.godbolt.org/z/6barovn81

BTW, this is not about -Ofast - the problem reproduces with -O1 - https://gcc.godbolt.org/z/6barovn81

Ah, that's definitely a bug. We can't convert pow to powi without some kind of FMF. I thought the transform was still guarded by 'afn', but I see now that there are non-fast tests that changed too.

BTW, this is not about -Ofast - the problem reproduces with -O1 - https://gcc.godbolt.org/z/6barovn81

Ah, that's definitely a bug. We can't convert pow to powi without some kind of FMF. I thought the transform was still guarded by 'afn', but I see now that there are non-fast tests that changed too.

Fixed to require 'afn':
3d6c10dcf3b5

I'm not sure if that resolves all of the outstanding questions here (and might reopen some of the original motivation for the patch), but it should restore correctness with an -O1 compile.

Thanks Sanjay, this fixes the issues we are seeing so far.