Page MenuHomePhabricator

sroland (Roland Scheidegger)
User

Projects

User does not belong to any projects.

User Details

User Since
Dec 13 2016, 10:49 AM (196 w, 2 d)

Recent Activity

Wed, Sep 2

sroland accepted D87041: [Bindings] Move LLVMAddInstructionSimplifyPass to Scalar.cpp.

Looks good to me.

Wed, Sep 2, 10:30 AM · Restricted Project

Tue, Sep 1

sroland added a comment to D86764: [Bindings] Add LLVMAddInstructionSimplifyPass.

Actually it doesn't seem to work, I get undefined references to LLVMAddInstructionSimplifyPass at link time in mesa.
Not entirely sure why (would InstSimplifyPass.cpp have to include Scalar.h?), but moving the LLVMAddInstructionSimplifyPass() function to Scalar.cpp (all the other scalar passes have the wrapper function there already) fixes it.

Tue, Sep 1, 9:55 PM · Restricted Project
sroland accepted D86764: [Bindings] Add LLVMAddInstructionSimplifyPass.
Tue, Sep 1, 10:44 AM · Restricted Project

Fri, Aug 28

sroland added a comment to D86764: [Bindings] Add LLVMAddInstructionSimplifyPass.

Looks great to me, that'll certainly fit the bill for mesa (not sure though on the name should it have the Legacy in the name too for consistency - either way looks fine to me however).

Fri, Aug 28, 1:07 AM · Restricted Project

Thu, Aug 27

sroland added a comment to D85159: [ConstProp] Remove ConstantPropagation.

mesa/llvmpipe was using this pass, so it doesn't build anymore. If the proposed replacement is the InstSimplifyLegacyPass that's fine, however the old one was nicely available via the c interface (LLVMAddConstantPropagationPass) whereas the InstSimplifyLegacyPass is not.
(I'd note that we only use very few passes, and they are perhaps not optimal, but the goal is to avoid any expensive passes while still getting some optimization. So, not really insisting on InstSimplifyLegacyPass neither if there are other suitable replacements doing roughly the same thing - it would be nice though if we'd not have to do our own c++ wrapper as we try to avoid that when possible.)

Thu, Aug 27, 11:10 PM · Restricted Project

Sep 13 2018

sroland added a comment to D52043: [X86][SSE] Lower shuffles to permute(unpack(x,y)) (PR31151).

Looks alright to me, and cleaner than what I suggested in bug 31151. And the test results look great too.

Sep 13 2018, 2:27 PM

Jul 20 2018

sroland added inline comments to D46179: [X86] Lowering addus/subus intrinsics to native IR (LLVM part).
Jul 20 2018, 4:01 PM

May 17 2018

sroland added inline comments to D46179: [X86] Lowering addus/subus intrinsics to native IR (LLVM part).
May 17 2018, 12:04 PM
sroland added a comment to D47019: [X86] Lowering rotation intrinsics to native IR.

I don't really feel qualified to review this. But fwiw in mesa we never used rotate instructions, and for that matter we were not relying on shift intrinsics neither (we've got some cases where we actually need modulo width behavior so we're masking off the bits).
But omg do the larger-or-equal to bit width cases make things more complex... Shift is such a nice instruction - if just everybody could agree what the behavior should be when the shift count is larger than the bit width...

May 17 2018, 11:58 AM

May 16 2018

sroland added inline comments to D46179: [X86] Lowering addus/subus intrinsics to native IR (LLVM part).
May 16 2018, 9:23 AM
sroland added a comment to D46179: [X86] Lowering addus/subus intrinsics to native IR (LLVM part).

The addus looks right now, thanks.

May 16 2018, 8:51 AM

May 15 2018

sroland added a comment to D46179: [X86] Lowering addus/subus intrinsics to native IR (LLVM part).

FWIW as said I think could live with autoupgrade of at least the unsigned sat ones. These have similar complexity to pabs (pabs is cmp/sel/neg whereas these are add(or sub)/cmp/sel, although the add one also requires a constant), so if there were no concerns about the possibility of optimizations breaking pabs recognition, there might not be any here neither, if it actually works (which doesn't always seem to be the case according to the test cases?). (Although personally I'd have liked to keep pabs intrinsics too...)
Though disappearing intrinsics causing a null call in the generated code rather than some error when compiling remains a concern - but certainly I know about this one now so easily debugged :-).
Other than that, there's a bug in the logic of the addus, otherwise looks reasonable to me, though I'm not an expert on llvm internals.

May 15 2018, 10:16 PM

May 10 2018

sroland added a comment to D46179: [X86] Lowering addus/subus intrinsics to native IR (LLVM part).

I'm also concerned that the more complex patterns are easier for other optimizations to simplify a little and break. The simpler things like pmin/pmax or pabs aren't so bad to not match when they get optimized a little.

May 10 2018, 1:28 PM

May 9 2018

sroland added a comment to D46179: [X86] Lowering addus/subus intrinsics to native IR (LLVM part).

We're currently discussing the possible solutions for the JIT pipeline issue with @DavidKreitzer.

May 9 2018, 5:23 PM

May 7 2018

sroland added a comment to D45721: [X86] Lowering PACK*S (pack with saturation) intrinsics to native IR (LLVM side).

Hmm I suppose that's "more stuff which breaks our JIT compilation".
Removing these intrinsics actually is looking a bit ugly from my point of view (mesa llvmpipe developer).
One issue is that we take care to only emit instructions as necessary - that is if the packxxx instructions aren't available, we only do whatever is necessary to accomplish the pack - so if we already know we cannot have values outside the target range (or maybe we can have but we don't care for some reason) we make sure to only emit shuffle. But if packxxx is available we still prefer that of course.
And if we can't use intrinsics, we have to emit loads of other IR instructions (and they need to match what the autoupgrade would do), but still need a separate path for when we know the backend won't be able to use packxxx intrinsic, which is getting real ugly.

May 7 2018, 6:19 PM

Apr 25 2018

sroland added a comment to D44785: Lowering x86 adds/addus/subs/subus intrinsics (llvm part).

Is the IR being generated on the fly and then fed to the JIT? Which means
it doesn't go through the autoupgrade code since that is only done by the
bitcode reader and the ll parser? If that's the case, you'll need to
generate the replacement sequence directly instead of using the intrinsic.

Apr 25 2018, 12:17 PM

Dec 20 2016

sroland added a comment to D27756: [X86][SSE] Improve lowering of vXi64 multiplies .

This looks correct to me, always nice to shave off some instructions...

Dec 20 2016, 12:46 PM

Dec 15 2016

sroland added a comment to rL289846: [x86] use a single shufps for 256-bit vectors when it can save instructions.

Looks great to me.

Dec 15 2016, 6:30 PM

Dec 13 2016

sroland added a comment to D27692: [x86] use a single shufps when it can save instructions.

I'd like to propose the following:

1 - we get this patch and D27684 approved and committed, providing v4i32 lowering to shufps and avoiding some of the more unnecessary domain switches.
2 - get shufps lowering added to target shuffle combining, I added shufpd recently and it's just been the domain issues that I wanted to tidyup up before adding shufps as well
3 - add support for v8i32 (and v16i32?) lowering to shufps
4 - other missing domain switch patterns (scalar stores and vpermilps/vpshufd come to mind)
5 - add support for domain switching to target shuffle combine when the shuffle depth is 3 or more - this will allow pshufd use on pre-AVX targets and seems to introduce some good uses of insertps as well.

That seems within scope for 4.0 and doesn't involve anything too exotic. After 4.0 we should be in a better position to begin work on moving some of this work to MC combines to better make use of specific scheduler models

Dec 13 2016, 6:48 PM
sroland added a comment to D27692: [x86] use a single shufps when it can save instructions.

I'm not sure I completely agree on the details (In particular, I'm not certain the older non-AVX arches where the mov is needed have zero-latency movs), but I mostly agree. Especially since it looks like the same platforms that need the mov are the ones that *do* have transition penalty. So using a shufps may be worth it only with something like -march=nehalem -mtune=ivb, where we generate SSE code, but expect to run it on a newer platform.

Dec 13 2016, 1:44 PM
sroland added a comment to D27692: [x86] use a single shufps when it can save instructions.

Just wanted to point out the other direction for this also exists.

Dec 13 2016, 12:42 PM
sroland added a comment to D27692: [x86] use a single shufps when it can save instructions.

I think this should go in, just forget the domain penalties, as it shouldn't be an issue with most cpus. When I looked at this in Agner Fog's guides, my conclusion was that it is probably only really an issue with Nehalem and Via Nanos. If a cpu has just one clock additional latency back and forth it's still worth it replacing 3 shuffle instructions with one from the wrong domain (albeit the latency chain will be the same then) - and if it manages to only replace 2 shuffle instructions from the right domain it might be worse or better in such a case. (If it is actually worse with Nehalem with its 2 clock penalty back and forth would of course depend if some instruction mix is latency bound or throughput bound.)
Also, plenty of the more odd cpus either place all shuffles in int domain anyway or even do something more odd (like the original core2 merom). I suppose ideally the shuffle lowering code would take into account such hw cost differences, but the truth is right now it doesn't really model any of this (e.g. on some cpus unpacks might not have the same cost as pshufd neither and so on).
So, I'm all for it (and suggested fixing it the same in https://llvm.org/bugs/show_bug.cgi?id=27885).

Dec 13 2016, 11:02 AM