Page MenuHomePhabricator

venkataramanan.kumar.llvm (Venkataramanan Kumar)
User

Projects

User does not belong to any projects.

User Details

User Since
Jul 4 2017, 7:47 AM (177 w, 5 d)

Recent Activity

Wed, Nov 25

venkataramanan.kumar.llvm added a comment to D88712: [CGBuiltin] Respect asm labels and redefine_extname for builtins with specialized emitting.

For your example:

extern double log (double) asm ("" "log_finite") __attribute__ ((nothrow));

double mylog (double d) { return log(d); }

The intention is to emit log_finite, no matter the -ffast-math. Both GCC and Clang (after this patch) emit jmp log_finite.

The previous behavior (according to your comment, with an older glibc) was undesired. So I think the right suggestion is to upgrade glibc.

Then there optimizations like vectorization, inst combine which works on the LLVM intrinsics. But they will be not happening now with log_finite calls.

I'm not sure about the expected semantics/lowering for the finite calls, but can you add something under LibCallSimplifier::optimizeFloatingPointLibCall() that would turn it into the LLVM log intrinsic with appropriate FMF? Codegen would need to be responsible for converting it back to finite call(s) if those are available?

Hi Sanjay I thought codegen no longer lowers them back to finite calls
https://reviews.llvm.org/rGcd0926d087a85c5ee1222ca80980b4440214a822

There was a comment in D74712 that we might be moving too fast. If you would like to revert/adjust that patch, raise a bug or post a patch to discuss? If the goal is to be able to vectorize the finite calls, then we will need some way to represent those. Alternatively, we could change something in the cost models to enable more unrolling, etc?

Wed, Nov 25, 5:16 AM · Restricted Project

Tue, Nov 24

venkataramanan.kumar.llvm added a comment to D88712: [CGBuiltin] Respect asm labels and redefine_extname for builtins with specialized emitting.

For your example:

extern double log (double) asm ("" "log_finite") __attribute__ ((nothrow));

double mylog (double d) { return log(d); }

The intention is to emit log_finite, no matter the -ffast-math. Both GCC and Clang (after this patch) emit jmp log_finite.

The previous behavior (according to your comment, with an older glibc) was undesired. So I think the right suggestion is to upgrade glibc.

Then there optimizations like vectorization, inst combine which works on the LLVM intrinsics. But they will be not happening now with log_finite calls.

I'm not sure about the expected semantics/lowering for the finite calls, but can you add something under LibCallSimplifier::optimizeFloatingPointLibCall() that would turn it into the LLVM log intrinsic with appropriate FMF? Codegen would need to be responsible for converting it back to finite call(s) if those are available?

Tue, Nov 24, 7:29 AM · Restricted Project

Mon, Nov 23

venkataramanan.kumar.llvm added a comment to D88712: [CGBuiltin] Respect asm labels and redefine_extname for builtins with specialized emitting.

For your example:

extern double log (double) asm ("" "log_finite") __attribute__ ((nothrow));

double mylog (double d) { return log(d); }

The intention is to emit log_finite, no matter the -ffast-math. Both GCC and Clang (after this patch) emit jmp log_finite.

The previous behavior (according to your comment, with an older glibc) was undesired. So I think the right suggestion is to upgrade glibc.

Mon, Nov 23, 9:34 PM · Restricted Project
venkataramanan.kumar.llvm updated subscribers of D88712: [CGBuiltin] Respect asm labels and redefine_extname for builtins with specialized emitting.
Mon, Nov 23, 10:13 AM · Restricted Project
venkataramanan.kumar.llvm added a comment to D88712: [CGBuiltin] Respect asm labels and redefine_extname for builtins with specialized emitting.

I am on Ubuntu 18 machine and it has finite math header <bits/math-finite.h>.
This header is included by the glibc 2.27. This header has this following definition.
extern double log (double) asm ("" "log_finite") attribute__ ((nothrow ));

Mon, Nov 23, 10:12 AM · Restricted Project

Oct 21 2020

venkataramanan.kumar.llvm added a comment to D88154: Initial support for vectorization using Libmvec (GLIBC vector math library)..

Thanks @spatel , @Florian and @fpetrogalli for the review comments and approval. Can someone please commit it on my behalf.

Oct 21 2020, 12:37 AM · Restricted Project, Restricted Project

Oct 20 2020

venkataramanan.kumar.llvm updated the diff for D88154: Initial support for vectorization using Libmvec (GLIBC vector math library)..

Remove an incorrect file that got attached with my earlier patch.

Oct 20 2020, 12:21 AM · Restricted Project, Restricted Project
venkataramanan.kumar.llvm updated the diff for D88154: Initial support for vectorization using Libmvec (GLIBC vector math library)..

Added a test case for testing vector library calls for VF=2 and VF=8.

Oct 20 2020, 12:18 AM · Restricted Project, Restricted Project

Oct 13 2020

venkataramanan.kumar.llvm updated the diff for D88154: Initial support for vectorization using Libmvec (GLIBC vector math library)..

Updated the patch as per review comments received.

Oct 13 2020, 11:42 AM · Restricted Project, Restricted Project

Oct 12 2020

venkataramanan.kumar.llvm updated the diff for D88154: Initial support for vectorization using Libmvec (GLIBC vector math library)..

As per review comments from Sanjay, updated the test case to use metadata. Also autogenerated the checks in the test cases using llvm/utils/update_test_checks.py.

Oct 12 2020, 11:54 AM · Restricted Project, Restricted Project

Oct 11 2020

venkataramanan.kumar.llvm updated the diff for D88154: Initial support for vectorization using Libmvec (GLIBC vector math library)..

Changed library naming to LIBMVEC-X86 as per comments and also selected based on Target Tripple in clang.
I am still working on auto generating FileCheck for the test cases.

Oct 11 2020, 11:25 AM · Restricted Project, Restricted Project
venkataramanan.kumar.llvm added a comment to D88154: Initial support for vectorization using Libmvec (GLIBC vector math library)..

Looks good to me.
Regarding the tests, it seems that you check if auto-vectorization takes advantages of libmvec?
Would it be interesting to have a test which declares a vector and call the builtin sin on it?

Thank you very much for the changes! :)

do we we have built-in support for sin that takes vector types?

I tried

m128d compute_sin(m128d x)
{

return __builtin_sin(x);

}

error: passing '__m128d' (vector of 2 'double' values) to parameter of incompatible type 'double'

We have LLVM intrinsics for sin/cos that may use vector types:
http://llvm.org/docs/LangRef.html#llvm-sin-intrinsic
...but I don't know of a way to produce those directly from C source.

Oct 11 2020, 11:16 AM · Restricted Project, Restricted Project

Oct 6 2020

venkataramanan.kumar.llvm added a comment to D88154: Initial support for vectorization using Libmvec (GLIBC vector math library)..

Pinging for review comments.

Oct 6 2020, 2:33 AM · Restricted Project, Restricted Project

Oct 4 2020

venkataramanan.kumar.llvm added a comment to D88154: Initial support for vectorization using Libmvec (GLIBC vector math library)..

Looks good to me.
Regarding the tests, it seems that you check if auto-vectorization takes advantages of libmvec?
Would it be interesting to have a test which declares a vector and call the builtin sin on it?

Thank you very much for the changes! :)

Oct 4 2020, 3:06 AM · Restricted Project, Restricted Project
venkataramanan.kumar.llvm updated the diff for D88154: Initial support for vectorization using Libmvec (GLIBC vector math library)..

Selection of Glibc vector math library is enabled via the option -fvec-lib=libmvec .

Oct 4 2020, 3:01 AM · Restricted Project, Restricted Project

Sep 29 2020

venkataramanan.kumar.llvm added a reviewer for D88154: Initial support for vectorization using Libmvec (GLIBC vector math library).: craig.topper.
Sep 29 2020, 6:21 AM · Restricted Project, Restricted Project

Sep 24 2020

venkataramanan.kumar.llvm added a reviewer for D88154: Initial support for vectorization using Libmvec (GLIBC vector math library).: rengolin.
Sep 24 2020, 11:17 PM · Restricted Project, Restricted Project
venkataramanan.kumar.llvm added reviewers for D88154: Initial support for vectorization using Libmvec (GLIBC vector math library).: nemanjai, hfinkel, mmasten, mzolotukhin.
Sep 24 2020, 5:51 AM · Restricted Project, Restricted Project

Sep 23 2020

venkataramanan.kumar.llvm added a comment to D88154: Initial support for vectorization using Libmvec (GLIBC vector math library)..

Initial version I supported the following vector functions (VF 2 and 4 ).

Sep 23 2020, 8:07 AM · Restricted Project, Restricted Project
venkataramanan.kumar.llvm requested review of D88154: Initial support for vectorization using Libmvec (GLIBC vector math library)..
Sep 23 2020, 8:02 AM · Restricted Project, Restricted Project

Sep 1 2020

venkataramanan.kumar.llvm added inline comments to D86726: [InstCombine]: Transform 1.0/sqrt(X) * X to X/sqrt(X) and X * 1.0/sqrt(X) to X/sqrt(X).
Sep 1 2020, 8:11 AM · Restricted Project
venkataramanan.kumar.llvm updated the diff for D86726: [InstCombine]: Transform 1.0/sqrt(X) * X to X/sqrt(X) and X * 1.0/sqrt(X) to X/sqrt(X).

Updated the patch as per comments received from Sanjay.

Sep 1 2020, 8:10 AM · Restricted Project

Aug 31 2020

venkataramanan.kumar.llvm added a comment to D86726: [InstCombine]: Transform 1.0/sqrt(X) * X to X/sqrt(X) and X * 1.0/sqrt(X) to X/sqrt(X).

Updated the patch with "nsz" check .

Aug 31 2020, 11:12 PM · Restricted Project
venkataramanan.kumar.llvm updated the diff for D86726: [InstCombine]: Transform 1.0/sqrt(X) * X to X/sqrt(X) and X * 1.0/sqrt(X) to X/sqrt(X).
Aug 31 2020, 11:11 PM · Restricted Project
venkataramanan.kumar.llvm added inline comments to D86726: [InstCombine]: Transform 1.0/sqrt(X) * X to X/sqrt(X) and X * 1.0/sqrt(X) to X/sqrt(X).
Aug 31 2020, 11:13 AM · Restricted Project
venkataramanan.kumar.llvm updated the diff for D86726: [InstCombine]: Transform 1.0/sqrt(X) * X to X/sqrt(X) and X * 1.0/sqrt(X) to X/sqrt(X).

Update the patch as per the review comments given by Sanjay.

Aug 31 2020, 1:34 AM · Restricted Project
venkataramanan.kumar.llvm updated the summary of D86726: [InstCombine]: Transform 1.0/sqrt(X) * X to X/sqrt(X) and X * 1.0/sqrt(X) to X/sqrt(X).
Aug 31 2020, 1:29 AM · Restricted Project

Aug 29 2020

venkataramanan.kumar.llvm added a comment to D86801: [InstCombine] add extra-use tests for fmul+sqrt; NFC.

Added comment in the test case.

Aug 29 2020, 11:24 AM · Restricted Project
venkataramanan.kumar.llvm updated the diff for D86801: [InstCombine] add extra-use tests for fmul+sqrt; NFC.
Aug 29 2020, 11:21 AM · Restricted Project
venkataramanan.kumar.llvm added inline comments to D86801: [InstCombine] add extra-use tests for fmul+sqrt; NFC.
Aug 29 2020, 11:10 AM · Restricted Project
venkataramanan.kumar.llvm updated the diff for D86801: [InstCombine] add extra-use tests for fmul+sqrt; NFC.

As per the comment given by Sanjay, updated the coverage test for x*1/sqrt(x) pattern.

Aug 29 2020, 9:22 AM · Restricted Project
venkataramanan.kumar.llvm added inline comments to D86801: [InstCombine] add extra-use tests for fmul+sqrt; NFC.
Aug 29 2020, 8:07 AM · Restricted Project
venkataramanan.kumar.llvm added inline comments to D86801: [InstCombine] add extra-use tests for fmul+sqrt; NFC.
Aug 29 2020, 12:52 AM · Restricted Project

Aug 28 2020

venkataramanan.kumar.llvm requested review of D86801: [InstCombine] add extra-use tests for fmul+sqrt; NFC.
Aug 28 2020, 12:04 PM · Restricted Project
venkataramanan.kumar.llvm added a comment to D86726: [InstCombine]: Transform 1.0/sqrt(X) * X to X/sqrt(X) and X * 1.0/sqrt(X) to X/sqrt(X).

Folding is not happening when the number of uses for the divide operand (say 1.0/something) is more than one.

Aug 28 2020, 2:25 AM · Restricted Project
venkataramanan.kumar.llvm added inline comments to D86726: [InstCombine]: Transform 1.0/sqrt(X) * X to X/sqrt(X) and X * 1.0/sqrt(X) to X/sqrt(X).
Aug 28 2020, 1:13 AM · Restricted Project

Aug 27 2020

venkataramanan.kumar.llvm requested review of D86726: [InstCombine]: Transform 1.0/sqrt(X) * X to X/sqrt(X) and X * 1.0/sqrt(X) to X/sqrt(X).
Aug 27 2020, 10:34 AM · Restricted Project

Aug 23 2020

venkataramanan.kumar.llvm updated the diff for D86403: [DAGCombine]: Fold X/Sqrt(X) to Sqrt(X)..

Updated patch as per the comments from Sanjay.

Aug 23 2020, 7:03 AM · Restricted Project
venkataramanan.kumar.llvm added inline comments to D86403: [DAGCombine]: Fold X/Sqrt(X) to Sqrt(X)..
Aug 23 2020, 6:51 AM · Restricted Project
venkataramanan.kumar.llvm updated the diff for D86403: [DAGCombine]: Fold X/Sqrt(X) to Sqrt(X)..

Updated the patch as per the comments given by Sanjay. Adjusted the test cases.

Aug 23 2020, 5:53 AM · Restricted Project

Aug 22 2020

venkataramanan.kumar.llvm added a comment to D86403: [DAGCombine]: Fold X/Sqrt(X) to Sqrt(X)..

Hi Sanjay,

Aug 22 2020, 10:52 AM · Restricted Project
venkataramanan.kumar.llvm requested review of D86403: [DAGCombine]: Fold X/Sqrt(X) to Sqrt(X)..
Aug 22 2020, 10:49 AM · Restricted Project

Aug 13 2020

venkataramanan.kumar.llvm added a comment to D85709: [InstSimplify] Implement Instruction simplification for X/sqrt(X) to sqrt(X)..

After looking at the codegen, I'm not sure if we can do this transform in IR with the expected performance in codegen because the transform loses information:
https://godbolt.org/z/7b84rG

The codegen for the case of "sqrt(x)" has to account for a 0.0 input. Ie, we filter out a 0.0 (or potentially denorm) input to avoid the NAN answer that we would get from "0.0 / 0.0". But the codegen for the case of "x/sqrt(x)" does not have to do that - NAN is the correct answer for a 0.0 input, so the code has implicitly signaled to us that 0.0 is not a valid input when compiled with -ffast-math (we can ignore possible NANs).

It might help to see the motivating code that produces the x/sqrt(x) pattern to see if there's something else we should be doing there.

Current AMD "x86_64" targets don't have the reciprocal sqrt instruction for the double precision types.
so x/sqrt(x) ends up with "vsqrtsd" followed by "vdivsd". This transform is basically to improve the efficiency.

Ah, I see. I think we should handle that in generic DAGCombiner then. There, we can make the target- and CPU-specific trade-offs necessary to get the (presumably) ideal asm code. I don't know how we would recover the missing div-by-0 info that I mentioned here.
Let me know if you want to try that patch. If not, I can take a shot at it.

Aug 13 2020, 10:55 AM · Restricted Project
venkataramanan.kumar.llvm added a comment to D85709: [InstSimplify] Implement Instruction simplification for X/sqrt(X) to sqrt(X)..

After looking at the codegen, I'm not sure if we can do this transform in IR with the expected performance in codegen because the transform loses information:
https://godbolt.org/z/7b84rG

The codegen for the case of "sqrt(x)" has to account for a 0.0 input. Ie, we filter out a 0.0 (or potentially denorm) input to avoid the NAN answer that we would get from "0.0 / 0.0". But the codegen for the case of "x/sqrt(x)" does not have to do that - NAN is the correct answer for a 0.0 input, so the code has implicitly signaled to us that 0.0 is not a valid input when compiled with -ffast-math (we can ignore possible NANs).

It might help to see the motivating code that produces the x/sqrt(x) pattern to see if there's something else we should be doing there.

Current AMD "x86_64" targets don't have the reciprocal sqrt instruction for the double precision types.
so x/sqrt(x) ends up with "vsqrtsd" followed by "vdivsd". This transform is basically to improve the efficiency.

Ah, I see. I think we should handle that in generic DAGCombiner then. There, we can make the target- and CPU-specific trade-offs necessary to get the (presumably) ideal asm code. I don't know how we would recover the missing div-by-0 info that I mentioned here.
Let me know if you want to try that patch. If not, I can take a shot at it.

Aug 13 2020, 7:59 AM · Restricted Project
venkataramanan.kumar.llvm added a comment to D85709: [InstSimplify] Implement Instruction simplification for X/sqrt(X) to sqrt(X)..

After looking at the codegen, I'm not sure if we can do this transform in IR with the expected performance in codegen because the transform loses information:
https://godbolt.org/z/7b84rG

The codegen for the case of "sqrt(x)" has to account for a 0.0 input. Ie, we filter out a 0.0 (or potentially denorm) input to avoid the NAN answer that we would get from "0.0 / 0.0". But the codegen for the case of "x/sqrt(x)" does not have to do that - NAN is the correct answer for a 0.0 input, so the code has implicitly signaled to us that 0.0 is not a valid input when compiled with -ffast-math (we can ignore possible NANs).

It might help to see the motivating code that produces the x/sqrt(x) pattern to see if there's something else we should be doing there.

Aug 13 2020, 12:34 AM · Restricted Project

Aug 11 2020

venkataramanan.kumar.llvm added a comment to D85709: [InstSimplify] Implement Instruction simplification for X/sqrt(X) to sqrt(X)..

I'm fairly sure this transform is a performance loss. For a target like Skylake Server, a SQRT(x) can take up to 20 cycles. But a RSQRT(x) is about 6 cycles and a MUL(y) is 4 cycles. We'd be better off with a X*RSQRT(X).

That is up to backends to decide. InstSimplify/InstCombine (and a few others) are canonicalization, target-independent passes.
A single sqrt(x) is more canonical IR than x/sqrt(x), because it's less instructions and x has less uses.

I agree with that. It should be canonicalized. It would also be good to make sure that the backends have lowering code in place before introducing a 2x performance hit.

I agree with both.
Knowing that this is often a perf-critical pattern, we've put a lot of SDAG effort into optimization of it for x86 already, but it's a tricky problem because we get into a similar situation that we have with FMA. Ie, should we favor latency, throughput, or some combo?
Here's an example to show that we are trying to make the right decision per-CPU in codegen:
https://godbolt.org/z/s5GPqc
Note that the codegen choice can differ between scalar/vector - example in the X86.td file:

// FeatureFastScalarFSQRT should be enabled if scalar FSQRT has shorter latency
// than the corresponding NR code. FeatureFastVectorFSQRT should be enabled if
// vector FSQRT has higher throughput than the corresponding NR code.
// The idea is that throughput bound code is likely to be vectorized, so for
// vectorized code we should care about the throughput of SQRT operations.
// But if the code is scalar that probably means that the code has some kind of
// dependency and we should care more about reducing the latency.

Agreed "FeatureFastScalarFSQRT" can be removed if target thinks scalar FSQRT is costly. I see currently set at "SKXTuning" (Skylake).

Aug 11 2020, 9:42 AM · Restricted Project
venkataramanan.kumar.llvm added a comment to D85709: [InstSimplify] Implement Instruction simplification for X/sqrt(X) to sqrt(X)..

I'm fairly sure this transform is a performance loss. For a target like Skylake Server, a SQRT(x) can take up to 20 cycles. But a RSQRT(x) is about 6 cycles and a MUL(y) is 4 cycles. We'd be better off with a X*RSQRT(X).

That is up to backends to decide. InstSimplify/InstCombine (and a few others) are canonicalization, target-independent passes.
A single sqrt(x) is more canonical IR than x/sqrt(x), because it's less instructions and x has less uses.

I agree with that. It should be canonicalized. It would also be good to make sure that the backends have lowering code in place before introducing a 2x performance hit.

Aug 11 2020, 7:46 AM · Restricted Project
venkataramanan.kumar.llvm updated the diff for D85709: [InstSimplify] Implement Instruction simplification for X/sqrt(X) to sqrt(X)..

Fixed test case to check for proper return value.

Aug 11 2020, 12:17 AM · Restricted Project
venkataramanan.kumar.llvm requested review of D85709: [InstSimplify] Implement Instruction simplification for X/sqrt(X) to sqrt(X)..
Aug 11 2020, 12:09 AM · Restricted Project

Apr 9 2019

venkataramanan.kumar.llvm added a comment to rL357943: [InstCombine] peek through fdiv to find a squared sqrt.

Looks Ok to me.

Apr 9 2019, 10:55 PM

Jan 30 2018

venkataramanan.kumar.llvm added a comment to D42607: [LoopStrengthReduce, x86] don't add cost for a cmp that will be macro-fused (PR35681).

I ran SPEC2017 on Ryzen (c/c++ benchmarks) -O2 -fno-unroll-loops. no significant change in performance with the patch.

Jan 30 2018, 10:08 PM
venkataramanan.kumar.llvm added a comment to D42607: [LoopStrengthReduce, x86] don't add cost for a cmp that will be macro-fused (PR35681).

I agree removing the lengthy (9 byte) instructions and reducing size of the loop is good. But on performance side, I need to do some tests.

Jan 30 2018, 3:35 AM

Jan 17 2018

venkataramanan.kumar.llvm added a comment to D41723: Introduce the "retpoline" x86 mitigation technique for variant #2 of the speculative execution vulnerabilities disclosed today, specifically identified by CVE-2017-5715, "Branch Target Injection", and is one of the two halves to Spectre...

Per kernel https://marc.info/?l=linux-kernel&m=151580566622935&w=2 and gcc https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01059.html, it seems AMD needs there to be an lfence in the speculation trap (and the pause is not useful for them, but does no harm). There seems to be some speculation (but no confirmation yet?) that pause *is* necessary vs lfence on intel. So in order to work generically, they seem to be suggesting using both instructions:

loop:
  pause
  lfence
  jmp loop

Some more links
https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01209.html
and final patch:
https://github.com/gcc-mirror/gcc/commit/a31e654fa107be968b802786d747e962c2fcdb2b

Per kernel https://marc.info/?l=linux-kernel&m=151580566622935&w=2 and gcc https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01059.html, it seems AMD needs there to be an lfence in the speculation trap (and the pause is not useful for them, but does no harm). There seems to be some speculation (but no confirmation yet?) that pause *is* necessary vs lfence on intel. So in order to work generically, they seem to be suggesting using both instructions:

loop:
  pause
  lfence
  jmp loop

Some more links
https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01209.html
and final patch:
https://github.com/gcc-mirror/gcc/commit/a31e654fa107be968b802786d747e962c2fcdb2b

Yes for AMD, we require "lfence" instruction after the "pause" in the "retpoline" loop filler. This solution has already been accepted in GCC and Linux kernel.
Can you please do the same in LLVM as well?

Ahh, I see my email crossed yours, sorry.

Have you tested adding 'lfence' and this patch on any AMD platforms? Do you have any results? Can you confirm that these patches are actually working?

Jan 17 2018, 8:27 AM

Jan 15 2018

venkataramanan.kumar.llvm added a comment to D41723: Introduce the "retpoline" x86 mitigation technique for variant #2 of the speculative execution vulnerabilities disclosed today, specifically identified by CVE-2017-5715, "Branch Target Injection", and is one of the two halves to Spectre...

Per kernel https://marc.info/?l=linux-kernel&m=151580566622935&w=2 and gcc https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01059.html, it seems AMD needs there to be an lfence in the speculation trap (and the pause is not useful for them, but does no harm). There seems to be some speculation (but no confirmation yet?) that pause *is* necessary vs lfence on intel. So in order to work generically, they seem to be suggesting using both instructions:

loop:
  pause
  lfence
  jmp loop

Some more links
https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01209.html
and final patch:
https://github.com/gcc-mirror/gcc/commit/a31e654fa107be968b802786d747e962c2fcdb2b

Jan 15 2018, 8:49 PM