hfinkel (Hal Finkel)
User

Projects

User does not belong to any projects.

User Details

User Since
Nov 16 2012, 6:02 PM (264 w, 4 d)

Recent Activity

Today

hfinkel added a comment to D41096: [X86] Initial support for prefer-vector-width function attribute.

The dependency isn't really an llvm dependency. You're absolutely right, that if it was just that it would be fixable by changing our instruction patterns. The problem here is about how the features and dependencies are defined in our instructions manuals. avx512f means "avx512 foundation", but only includes 512 bit instructions. The latter features imply the foundation is present.

A user may expect that if they use "-march=skylake-avx512" that they can use 512-bit intrinsics in x86intrin.h. But at the same time we should generate good performing code for other users who specify -march=skylake-avx512 and didn't write any vector code in their source. Based on the data we've seen so far, the best way to do that is to disable 512-bit instructions.

Understood. However, we can separate this within the implementation. Specifically, Clang does not need to tag all generated functions with the same target attribute. On the LLVM side, we can separate these, and then use that finer-grained control.

So what I'm looking for a solution that optimizes for the likely case that the user code doesn't contain 512-bit vector intrinsics and tells the legalizer to use 256-bit registers only. But if the user code does contains explicit 512-bit intrinsics, we still allow that code to compile and ideally generate what the user expected. I don't want existing 512-bit intrinsic code to suddenly stop compiling with -march=skylake-avx512.

I think that we're on the same page. In addition to intrinsics, I'm also worried about OpenMP SIMD loops/functions (or other explicit vectorization). We may want those to also give 512-bit vectors by default (presumably, such specifically-tagged loops are likely to run long enough to amortize the clock-rate effects).

getHostCPUName/getHostCPUFeatures is called by the clang driver long before any code is parsed. How would it know if the code contained any 512-bit instructions?

I recommend that we do this in Clang during CodeGen. We should add a special callback that will allow TargetInfo to examine the AST and adjust the target (or target features) on a per-function basis. Any function using AXV-512 intrinsics on 512-bit vectors, explicit vector types, or OpenMP SIMD (unless a small simdlen clause is present) will stay as-is, and everything else will be modified to turn off 512-bit vectors.

I definitely don't want to change the target-cpu string added by clang. Maybe a target feature. This problem may continue in follow on CPUs after skx and I don't want to start coding a specific list of CPUs into clang. We already have more information about CPU feature mapping in clang than I think we would really like.

The real question is whether to block inlining on mismatch here. I don't think that we should (and we'll need to enable 512-bit vectors in the caller). The problem is that people write C++ wrappers around vector intrinsics, and we need the compiler to remove the abstraction layer. Generating that code poorly will be a significant problem. This will have the unfortunate "action at a distance" effects we discussed earlier (because having some 512-bit vectors in some function, even after inlining, will suddenly enable it elsewhere in the function), but I don't see any good way to prevent that in undesirable cases without causing significant problems elsewhere.

I agree we don't want to block inlining on a mismatch. Do we have a way to allow targets to control the merging behavior? If we do this as part of the "target-feature" or "target-cpu" attribute we would need custom merging.

We don't currently, we have only areInlineCompatible in TTI. This is called like this:

return TTI.areInlineCompatible(Caller, Callee) &&
       AttributeFuncs::areInlineCompatible(*Caller, *Callee);

we also have a:

AttributeFuncs::mergeAttributesForInlining(*Caller, *Callee);

adding a corresponding TTI function and calling it in the two places where AttributeFuncs::mergeAttributesForInlining is called would be straightforward.

Alternatively, I was thinking about a separate "required-vector-width" attribute. Either clang codegen, or an early IR pass

I prefer that we do this in Clang's CodeGen. We just don't have enough information at the IR level to differentiate between things the user explicitly requested and things that have been added by some earlier stage automatically (plus, the pass method would need to rely on pass injection, or similar, and that won't work with frontends with custom pipelines anyway).

Does "pass injection" here mean having a hook to put in a target specific pass or something else?

Wed, Dec 13, 12:42 PM
hfinkel added inline comments to D40196: Add hooks to MachineLICM to hoist invariant stores..
Wed, Dec 13, 12:27 PM
hfinkel added a comment to D41096: [X86] Initial support for prefer-vector-width function attribute.

The dependency isn't really an llvm dependency. You're absolutely right, that if it was just that it would be fixable by changing our instruction patterns. The problem here is about how the features and dependencies are defined in our instructions manuals. avx512f means "avx512 foundation", but only includes 512 bit instructions. The latter features imply the foundation is present.

A user may expect that if they use "-march=skylake-avx512" that they can use 512-bit intrinsics in x86intrin.h. But at the same time we should generate good performing code for other users who specify -march=skylake-avx512 and didn't write any vector code in their source. Based on the data we've seen so far, the best way to do that is to disable 512-bit instructions.

Understood. However, we can separate this within the implementation. Specifically, Clang does not need to tag all generated functions with the same target attribute. On the LLVM side, we can separate these, and then use that finer-grained control.

So what I'm looking for a solution that optimizes for the likely case that the user code doesn't contain 512-bit vector intrinsics and tells the legalizer to use 256-bit registers only. But if the user code does contains explicit 512-bit intrinsics, we still allow that code to compile and ideally generate what the user expected. I don't want existing 512-bit intrinsic code to suddenly stop compiling with -march=skylake-avx512.

I think that we're on the same page. In addition to intrinsics, I'm also worried about OpenMP SIMD loops/functions (or other explicit vectorization). We may want those to also give 512-bit vectors by default (presumably, such specifically-tagged loops are likely to run long enough to amortize the clock-rate effects).

getHostCPUName/getHostCPUFeatures is called by the clang driver long before any code is parsed. How would it know if the code contained any 512-bit instructions?

I recommend that we do this in Clang during CodeGen. We should add a special callback that will allow TargetInfo to examine the AST and adjust the target (or target features) on a per-function basis. Any function using AXV-512 intrinsics on 512-bit vectors, explicit vector types, or OpenMP SIMD (unless a small simdlen clause is present) will stay as-is, and everything else will be modified to turn off 512-bit vectors.

I definitely don't want to change the target-cpu string added by clang. Maybe a target feature. This problem may continue in follow on CPUs after skx and I don't want to start coding a specific list of CPUs into clang. We already have more information about CPU feature mapping in clang than I think we would really like.

The real question is whether to block inlining on mismatch here. I don't think that we should (and we'll need to enable 512-bit vectors in the caller). The problem is that people write C++ wrappers around vector intrinsics, and we need the compiler to remove the abstraction layer. Generating that code poorly will be a significant problem. This will have the unfortunate "action at a distance" effects we discussed earlier (because having some 512-bit vectors in some function, even after inlining, will suddenly enable it elsewhere in the function), but I don't see any good way to prevent that in undesirable cases without causing significant problems elsewhere.

I agree we don't want to block inlining on a mismatch. Do we have a way to allow targets to control the merging behavior? If we do this as part of the "target-feature" or "target-cpu" attribute we would need custom merging.

Wed, Dec 13, 11:42 AM
hfinkel added a comment to D41096: [X86] Initial support for prefer-vector-width function attribute.

The dependency isn't really an llvm dependency. You're absolutely right, that if it was just that it would be fixable by changing our instruction patterns. The problem here is about how the features and dependencies are defined in our instructions manuals. avx512f means "avx512 foundation", but only includes 512 bit instructions. The latter features imply the foundation is present.

A user may expect that if they use "-march=skylake-avx512" that they can use 512-bit intrinsics in x86intrin.h. But at the same time we should generate good performing code for other users who specify -march=skylake-avx512 and didn't write any vector code in their source. Based on the data we've seen so far, the best way to do that is to disable 512-bit instructions.

Wed, Dec 13, 10:34 AM
hfinkel added a comment to D41096: [X86] Initial support for prefer-vector-width function attribute.

Yes there are two variants of sklake-avx512, but there doesn't seem to be a good way of autodetecting this for march=native.

Your optimization manual suggests a relative timing test, so I'm guessing there's not (I wouldn't want to use that for -march=native because it wouldn't be deterministic). As a result, I think we'll just need to make a choice based on some combination of which is likely to be most common among our users and which is likely best on future hardware. Users will need to explicitly specify the architecture to get the other one.

There is avx512_2ndFMA (PIROM offset 70h bit 0), see: https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-scalable-datasheet-vol-1.pdf

Wed, Dec 13, 9:54 AM
hfinkel added a comment to D41096: [X86] Initial support for prefer-vector-width function attribute.

How/where do you propose detecting the presence or absense of 512-bit instructions to change the CPU name?

I think the path through clang's CodeGenFunction::checkTargetFeatures() is where we tell the user if they've used an intrinsic without the required CPU attribute. So somewhere around there might work?

Wed, Dec 13, 9:36 AM
hfinkel added a comment to D37076: [LICM] Allow sinking when foldable in loop.

Thanks for revisiting this.
Made some changes like the way we iterate users in sink(), while rebasing it .

Wed, Dec 13, 8:41 AM
hfinkel added inline comments to D41029: [JumpTables][PowerPC] Let targets decide which switch instructions are suitable for jump tables.
Wed, Dec 13, 8:02 AM

Yesterday

hfinkel added inline comments to D40782: [tablegen] Add !codeconcat operator which works like !strconcat.
Tue, Dec 12, 10:18 PM
hfinkel added inline comments to D41029: [JumpTables][PowerPC] Let targets decide which switch instructions are suitable for jump tables.
Tue, Dec 12, 10:05 PM
hfinkel added a comment to D40893: [PowerPC] fix a bug in TCO eligibility check.

Can you also update the patch description? One inline request as well. You can also handle the inline request as a separate commit.

I updated the description.

About the URL in the comment, it is an example of a case where GCC can do sibling call optimization even with a byval parameter rather tan an additional explanation on why this is a workaround; the example is too large to embed in a comment.
How about changing See: https://... to GCC can do such sibling call optimization such as https://... or just remove See: https://...? Any preference?
Anyway I like to touch up this comment later in another patch.

Tue, Dec 12, 9:38 PM
hfinkel added inline comments to D40855: [PowerPC] LSR tunings for PowerPC.
Tue, Dec 12, 9:14 PM
hfinkel added a comment to D40049: [PATCH] Global reassociation for improved CSE.

A couple minor comments (feel free to address as you commit).

Tue, Dec 12, 8:37 PM
hfinkel added inline comments to D38566: [SimplifyCFG] don't sink common insts too soon (PR34603).
Tue, Dec 12, 8:20 PM
hfinkel added a comment to D41136: [EarlyCSE] recognize commuted and swapped variants of min/max as equivalent (PR35642).

Could you use matchSelectPattern (from ValueTracking) here instead of matching the individual min/max patterns? That will also give us absolute value and the floating-point min/max as well. I think that you can hash_combine the fields of SelectPatternResult, and the logic might be simpler overall too (hopefully).

Tue, Dec 12, 7:46 PM
hfinkel accepted D41139: [SLPVectorizer] Don't ignore scalar extraction instructions of aggregate value.

The current code generation is also odd on x86 (for the same reason). Returning the aggregate forces the values to be placed in specific GPRs. That makes the vectorization not worthwhile. Can you please also add this test into the X86 directory (running for x86, obviously) as well?

Tue, Dec 12, 7:22 PM
hfinkel added a comment to D41096: [X86] Initial support for prefer-vector-width function attribute.

I really do want to make sure that I understand the problem (before I continue suggesting solutions). To summarize:

  • AVX-512 is twice the length of AVX2, and so using AVX-512 over AVX2 should give a 2x speedup, but...

I want to avoid using the term AVX-512 and AVX2 here and use ZMM and YMM or vector width. There are new instructions introduced after AVX512F as part of the AVX512VL instruction set that use only XMM and YMM registers and are not subject to this frequency issue. Our documentation really doesn't make that clear as it uses "AVX2".

Tue, Dec 12, 4:54 PM
hfinkel added inline comments to D32872: [PowerPC] Leverage PGO data to version/expand small/large memcpy calls.
Tue, Dec 12, 4:20 PM
hfinkel added inline comments to D31772: [PowerPC] Add pass to expand extra memcpy calls.
Tue, Dec 12, 4:08 PM
hfinkel added a comment to D41096: [X86] Initial support for prefer-vector-width function attribute.

How do you propose to control the cap? I don't think we want to default it to 256 for skx as that would make our codegen worse(or at the very least very different) from our avx2 codegen.

Tue, Dec 12, 3:05 PM
hfinkel added a comment to D41096: [X86] Initial support for prefer-vector-width function attribute.

@hfinkel, sorry our updates passed each other.

I see your point, we shouldn't release the preference just because the user did something explicit. So in that case we would need to unconstrain the legalizer, but still keep the TTI interface reporting 256 so the vectorizer won't go out of its way to generate large vectors in the same function. And potentially add more cost modeling enhancements and potentially spot fixes into the codegen lowering.

So we probably do need another function attribute to indicate the safe width for the legalizer.

I think that we should have the preference separate from legalization, but also have hard cap that the vectorizer uses. The vectorizer will generate larger vectors than the current max for loops with a mixture of types, under the assumption that it's better to fill the vectors of the smaller types, even if that means legalization splitting for the vectors of larger types. We could cap this behavior, or have the vectorizer do manual legalization in that case. I think that the latter is probably the best that we can do (but a cap will do in the mean time).

Tue, Dec 12, 1:25 PM
hfinkel updated subscribers of D39536: [PowerPC] Eliminate redundant register copys after register allocation.

Is this something that RDF copy propagation would do for us as well? @kparzysz might know. FWIW, if this is something that RDF would help with, I'd rather see our efforts put there rather than recreating a less-powerful version of that infrastructure just for this specific case.

Tue, Dec 12, 12:30 PM
hfinkel added a comment to D41096: [X86] Initial support for prefer-vector-width function attribute.

@hfinkel, sorry our updates passed each other.

I see your point, we shouldn't release the preference just because the user did something explicit. So in that case we would need to unconstrain the legalizer, but still keep the TTI interface reporting 256 so the vectorizer won't go out of its way to generate large vectors in the same function. And potentially add more cost modeling enhancements and potentially spot fixes into the codegen lowering.

So we probably do need another function attribute to indicate the safe width for the legalizer.

Tue, Dec 12, 12:23 PM
hfinkel added a comment to D41096: [X86] Initial support for prefer-vector-width function attribute.

If it finds any vector code, the pass would override the prefer-vector-width attribute to the length of the widest vector in that code.

That makes sense... sort of. If this target-specific pass runs as part of the optimization pass pipeline (as opposed to the codegen pass pipeline), we need to make sure we generate correct code even if the optimization pass doesn't run (if someone is using -O0, opt-bisect, etc.). So we probably need two attributes: one clang attribute "try to avoid zmm registers if possible" and one attribute "disable all avx-512 instructions which use zmm registers", and we only set the second attribute if we've analyzed the function and proved we don't need to use zmm registers.

Hrmm. I thought that this "preference" attribute was only going to affect TTI and other cost-model-based optimized-driven vectorization, not the lowering of explicit vectors. Does it do the latter?

Tue, Dec 12, 11:53 AM
hfinkel added a comment to D41096: [X86] Initial support for prefer-vector-width function attribute.

If it finds any vector code, the pass would override the prefer-vector-width attribute to the length of the widest vector in that code.

That makes sense... sort of. If this target-specific pass runs as part of the optimization pass pipeline (as opposed to the codegen pass pipeline), we need to make sure we generate correct code even if the optimization pass doesn't run (if someone is using -O0, opt-bisect, etc.). So we probably need two attributes: one clang attribute "try to avoid zmm registers if possible" and one attribute "disable all avx-512 instructions which use zmm registers", and we only set the second attribute if we've analyzed the function and proved we don't need to use zmm registers.

Tue, Dec 12, 11:50 AM
hfinkel accepted D31852: [PowerPC] Convert reg/reg instructions fed by constants to reg/imm instructions (pre and post RA).

@hfinkel What do you think about this? Do we want to proceed with this? We actually convert a lot of instructions with this and it has a proven effect on performance. The updated patch actually converts and deletes an extra ~1,000 instructions in bootstrap.

Tue, Dec 12, 8:07 AM

Mon, Dec 11

hfinkel added a comment to D41104: Set the NoRecurse attribute for the dbg intrinsics..

This looks fine, and I imagine we'll need to do this almost all intrinsics.

Mon, Dec 11, 9:51 PM
hfinkel accepted D40742: [LV] Fix PR34965 - Extend InstWidening with CM_Widen_Recursive.

LGTM

Mon, Dec 11, 9:39 PM
hfinkel added a comment to D39352: [SimplifyCFG] Don't do if-conversion if there is a long dependence chain.

You have a bunch of variable names here with underscores, which is not our usual convention. BB1_chain -> BB1Chain, etc.

Mon, Dec 11, 9:33 PM
hfinkel added a comment to D38566: [SimplifyCFG] don't sink common insts too soon (PR34603).

Ping * 2.

Mon, Dec 11, 8:37 PM
hfinkel added a comment to D38886: Try to avoid prefetches from disrupting folding of loads..

Are you looking for SDNode's hasPredecessor (or hasPredecessorHelper if you can do more than one at a time)? Such dependency checks are not uncommon.

Sorry, forgot to answer this when I updated the patch earlier today. That's good to know, but I still think that it is so much simpler to do this in a reasonable way to begin with. Especially since this should be a cross-platform win during isel at least. Besides, that is an expensive check.

Mon, Dec 11, 8:26 PM
hfinkel accepted D37076: [LICM] Allow sinking when foldable in loop.

As r314923 was reverted, we cannot rely on getUserCost in the foldability check for GEPs. For now, I checked GEP's users in LICM directly. Please let me know if you prefer other way around.

Mon, Dec 11, 8:22 PM
hfinkel added inline comments to D38978: [OpenMP] Enable the lowering of implicitly shared variables in OpenMP GPU-offloaded target regions to the GPU shared memory.
Mon, Dec 11, 7:53 PM
hfinkel added inline comments to D38778: Implement rudimentary support for the PowerPC SPE APU.
Mon, Dec 11, 7:16 PM
hfinkel added a comment to D41096: [X86] Initial support for prefer-vector-width function attribute.

The loop vectorizer definitely creates wider vectors even though its told not to.

For one it only considers the scalar types of loads, stores, and phis when determining the VF factor. So if all your loads/stores use i32, but some operations like compares or address calculations use i64 types due to zext/sext, the vectorizer doesn't see them when determining VF. I don't know enough about the vectorizer to say if that should be fixed or not.

Mon, Dec 11, 5:57 PM
hfinkel added a comment to D40975: [LangRef] Reflect the changes in the TBAA info format.

I think that we're on the same page. I'll also start reviewing the other patches.

Mon, Dec 11, 5:47 AM

Thu, Dec 7

hfinkel added a comment to D38886: Try to avoid prefetches from disrupting folding of loads..

I then thought about implementing a custom isel for SystemZ / INSERT_VECTOR_ELT, which seemed to be possible but tedious. It would still have to be proved that the INSERT_VECTOR_ELT was not dependent somehow on the prefetch node, which seems more doable here in a limited context.

Thu, Dec 7, 1:24 AM

Tue, Dec 5

hfinkel added inline comments to D40855: [PowerPC] LSR tunings for PowerPC.
Tue, Dec 5, 2:36 PM
hfinkel added inline comments to D39694: [VerifyDiagnosticConsumer] support -verify=<prefixes>.
Tue, Dec 5, 11:38 AM
hfinkel added inline comments to D38413: [PowerPC] Add handling for ColdCC calling convention and a pass to mark candidates with coldcc attribute.
Tue, Dec 5, 11:19 AM

Mon, Dec 4

hfinkel added a comment to D39805: [Power9] Set MicroOpBufferSize for Power 9.

I think that this is fine, as a work around, but I don't quite understand the problem based on this description. I'm thinking that either:

  1. Like on the P8 (and previous), there are some in-order resources in the decoder/dispatcher (which should be modeled as in-order resources).
  2. There are resources for which we're significantly underestimating the buffer size (the current P9 scheduling model only sets the BufferSize for BR (which is fine, but as we don't currently schedule branches, it's essentially documentation). Maybe we should be setting BufferSize for other things.

(or some combination of the two). Based on the description in the patch and in the patch description, it sounds like the latter, but I'm not sure.

Mon, Dec 4, 8:50 PM
hfinkel added a comment to D39694: [VerifyDiagnosticConsumer] support -verify=<prefixes>.

I think this is a good idea.

Mon, Dec 4, 7:28 PM

Sat, Dec 2

hfinkel committed rL319629: Revert "[CodeGen] Add initial support for union members in TBAA".
Revert "[CodeGen] Add initial support for union members in TBAA"
Sat, Dec 2, 7:10 PM
hfinkel committed rC319629: Revert "[CodeGen] Add initial support for union members in TBAA".
Revert "[CodeGen] Add initial support for union members in TBAA"
Sat, Dec 2, 7:10 PM
hfinkel added a comment to D39455: [CodeGen] Add initial support for union members in TBAA.

Once you confirm the bug, could you possibly revert the patch?

Sat, Dec 2, 1:13 PM · Restricted Project

Tue, Nov 28

hfinkel accepted D39455: [CodeGen] Add initial support for union members in TBAA.

I think this looks fine (and I agree that the underlying premise makes sense). Before you commit, could you also please add some tests with nested union members?

Tue, Nov 28, 9:36 PM · Restricted Project

Tue, Nov 21

hfinkel added a comment to D40275: [CUDA] Report "unsupported VLA" errors only on device side..

When Sema sees this code during compilation, it can not tell whether there is an error. Calling foo from the host code is perfectly valid. Calling it from device code is not. CUDADiagIfDeviceCode creates 'postponed' diagnostics which only gets emitted if we ever need to generate code for the function on device.

Tue, Nov 21, 10:13 AM

Thu, Nov 16

hfinkel added a comment to D40150: [LibCallSimplifier] fix pow(x, 0.5) -> sqrt() transforms.

This would be a lot simpler if we could disregard errno when we have relaxed FP math, but I'm assuming it's possible to do something like '-ffast-math -fmath-errno' and still expect errno to be set for domain errors?

nnan implies we don't care about the result in cases where it would be nan, so it's probably okay if we lower an nnan "pow(x, 0.5)" to something which never sets errno. But there's still the underlying problem that the backend can't lower the intrinsics correctly for targets where libm sets errno, so we shouldn't generate libm intrinsics for code which isn't already using them.

Thu, Nov 16, 4:33 PM
hfinkel added a comment to D40144: Implement `std::launder`.

At least for GCC, it should use __builtin_launder.

Thu, Nov 16, 12:41 PM

Nov 10 2017

hfinkel added a comment to D39611: [CodeGen] change const-ness of complex calls.

Thanks for the clarification!

If I'm reading this properly, we should make the same kind of change as in D39481 ('c' -> 'e') for most of complex.h. Ie, the standard allows errno-setting, and it's (unfortunately for optimization) even more clearly stated in the newer additions to the standards.

We can leave these functions as always constant ('c') because they don't actually do any math and therefore won't set errno:
cimag ( http://en.cppreference.com/w/c/numeric/complex/cimag )
creal ( http://en.cppreference.com/w/c/numeric/complex/creal )
cproj ( http://en.cppreference.com/w/c/numeric/complex/cproj )
conj (http://en.cppreference.com/w/c/numeric/complex/conj )

Nov 10 2017, 10:22 AM

Nov 9 2017

hfinkel added a comment to D39611: [CodeGen] change const-ness of complex calls.

Patch updated:
I don't know if we have agreement on the behavior that we want yet,

I just sent a note to the Austin group mailing list to see if the POSIX folks agree with my reading. I'll follow up.

Nov 9 2017, 4:28 PM

Nov 8 2017

hfinkel added inline comments to D38417: [test-suite] Adding HACCKernels app.
Nov 8 2017, 9:55 AM

Nov 7 2017

hfinkel accepted D39767: [OpenMP] Removing unused testsuite/ directory.

LGTM

Nov 7 2017, 3:42 PM · Restricted Project
hfinkel added a comment to D39611: [CodeGen] change const-ness of complex calls.

Patch updated:
I don't know if we have agreement on the behavior that we want yet,

Nov 7 2017, 3:24 PM
hfinkel accepted D39732: [Analysis] Fix merging TBAA tags with different final access types.
Nov 7 2017, 9:21 AM · Restricted Project
hfinkel added inline comments to D39732: [Analysis] Fix merging TBAA tags with different final access types.
Nov 7 2017, 8:36 AM · Restricted Project

Nov 5 2017

hfinkel added a comment to D39575: [X86] Add subtarget features prefer-avx256 and prefer-avx128 and use them to limit vector width presented by TTI.

Should we be looking at this from a cost model POV? Possibly introducing the concept of a cost for "first use" of a vector type (in this case for 512-bit vectors), increasing the cost of 512-bit ops and making the cost models more 'state aware' (use of a 512-bit vector causes a cost multiplier effect on ALL vector ops).

Nov 5 2017, 7:01 AM

Nov 3 2017

hfinkel added a comment to D39611: [CodeGen] change const-ness of complex calls.

In general, because all of these functions are well defined over the entire complex domain, they don't have errors. The "no errors defined" is likely correct.

One, it is not true that all these functions are well-defined over the entire complex domain. For example, according to the C standard catanh(1) raises a divide-by-zero error.

Nov 3 2017, 7:28 PM
hfinkel added a comment to D38886: Try to avoid prefetches from disrupting folding of loads..

So, with this change, what happens? Do the prefetches just all get chained together along with the last store (or other memory operation)?

The inital SelectionDAG of the added test case looks like (in part):

  ...
t53: ch = TokenFactor t37:1, t49:1, t50:1, t51:1, t52
    t48: i64 = add t2, Constant:i64<2220>
  t54: i32,ch = load<LD4[%scevgep37]> t53, t48, undef:i64
  t66: i32,ch = load<LD4[%20]> t53, GlobalAddress:i64<[150 x %type0]* @Mem> + 408, undef:i64
      t14: i64 = add t2, Constant:i64<237540>
    t67: ch = Prefetch<LD1[%scevgep26]> t53, t14, Constant:i32<0>, Constant:i32<3>, Constant:i32<1>
  t68: ch = TokenFactor t54:1, t66:1, t67
    t61: i64 = add t2, Constant:i64<3108>
  t69: i32,ch = load<LD4[%scevgep36]> t68, t61, undef:i64
      t12: i64 = add t2, Constant:i64<237984>
    t70: ch = Prefetch<LD1[%scevgep25]> t68, t12, Constant:i32<0>, Constant:i32<3>, Constant:i32<1>
  t71: ch = TokenFactor t69:1, t70
    t63: i64 = add t2, Constant:i64<3552>
  t72: i32,ch = load<LD4[%scevgep35]> t71, t63, undef:i64
      t10: i64 = add t2, Constant:i64<238428>
    t73: ch = Prefetch<LD1[%scevgep24]> t71, t10, Constant:i32<0>, Constant:i32<3>, Constant:i32<1>
  t74: ch = TokenFactor t72:1, t73
...

The result seems to be that each (load) prefetch gets chained in parallel with one load, which then in turn are the inputs to a TokenFactor (new DAG Root).

Nov 3 2017, 6:14 PM
hfinkel added a comment to D39611: [CodeGen] change const-ness of complex calls.

The POSIX docs all have language like this for complex calls:
"No errors are defined."

I would not trust the POSIX documentation.

carg() is an alias for atan2(), and cabs() is an alias for hypot(), so they should be marked the same way. (On glibc, cabs/hypot will set errno; not sure about carg/atan2.)

Nov 3 2017, 5:32 PM
hfinkel accepted D38417: [test-suite] Adding HACCKernels app.

LGTM

Nov 3 2017, 10:21 AM

Nov 2 2017

hfinkel added a comment to D38886: Try to avoid prefetches from disrupting folding of loads..

So, with this change, what happens? Do the prefetches just all get chained together along with the last store (or other memory operation)?

Nov 2 2017, 6:55 PM
hfinkel added inline comments to D38417: [test-suite] Adding HACCKernels app.
Nov 2 2017, 6:48 PM
hfinkel accepted D39345: SCEV: preserve debug information attached to PHI nodes..

LGTM

Nov 2 2017, 4:13 PM
hfinkel added a comment to D37163: [LICM] sink through non-trivially replicable PHI.

Rebased. I will commit this if the change in PowerPC test (subreg-postra-2.ll) is okay.

Nov 2 2017, 1:32 PM
hfinkel accepted D39463: [Analysis] Refine matching and merging of TBAA tags.

LGTM

Nov 2 2017, 11:20 AM
hfinkel added a comment to D39345: SCEV: preserve debug information attached to PHI nodes..

Thanks for confirming! Also, unfortunate :-(
Do you have any feedback on the patch for the zero-delta-case as is that I should be addressing before landing this?

Nov 2 2017, 11:14 AM
hfinkel added a comment to D39481: [CodeGen] fix const-ness of builtin equivalents of <math.h> and <complex.h> functions that might set errno.

There's an oddity with fma. The version without __builtin has 'e' already

Nov 2 2017, 11:10 AM

Nov 1 2017

hfinkel added a comment to D39304: [IR] redefine 'reassoc' fast-math-flag and add 'trans' fast-math-flag.

I'm happy; please go ahead when the other reviewers are too.

Nov 1 2017, 8:25 PM
hfinkel added a comment to D39463: [Analysis] Refine matching and merging of TBAA tags.

This looks pretty good.

Nov 1 2017, 8:17 PM
hfinkel added a comment to D39345: SCEV: preserve debug information attached to PHI nodes..

I moved the logic over to indvars now, and that seems to fit naturally into WidenIV::createWideIV() with about the same functionality as above.

But I'm having trouble generating more interesting testcases. For example: The top-level comment in IndVarSimplify claims that the pass would

//   1. The exit condition for the loop is canonicalized to compare the
//      induction value against the exit value.  This turns loops like:
//        'for (i = 7; i*i < 1000; ++i)' into 'for (i = 0; i != 25; ++i)'

But I can't reproduce this:

char *a;
void f(int j) {
  for (int i = 7; i*i < 1000; i+=1)
      a[i] = i;
}

clang -O3 -c -S -emit-llvm -o - /tmp/loop.c -fno-unroll-loops -mllvm -print-after-all -g 2>&1

*** IR Dump After Induction Variable Simplification ***
for.body:                                         ; preds = %entry, %for.body
  %indvars.iv = phi i64 [ 7, %entry ], [ %indvars.iv.next, %for.body ]

Has the functionality bitrotted, or is the comment out-of-date, or is there now a different transformation that would rewrite a loop like int that comment? Or can I change my testcase somehow?

Nov 1 2017, 8:06 PM
hfinkel added a comment to D31852: [PowerPC] Convert reg/reg instructions fed by constants to reg/imm instructions (pre and post RA).

This is really interesting. So we're seeing a number of cases where, after instruction selection, we're doing something that makes operations have constant operands. Do you know what that is? Maybe one thing we could do with this is have it assert, instead of transforming the code, and then reduce the test cases in order to understand why this happens.

I don't think that we can catch all of these early enough not to need something like this. I think that fundamentally, we're limited in our ability to select the right instruction when values come from other blocks. Most of the early cases are likely due to SDAG being bb-local.

Nov 1 2017, 6:45 PM

Oct 31 2017

hfinkel added inline comments to D37163: [LICM] sink through non-trivially replicable PHI.
Oct 31 2017, 4:15 PM

Oct 30 2017

hfinkel added a comment to D38566: [SimplifyCFG] don't sink common insts too soon (PR34603).

Patch updated:
Now that we have SimplifyCFGOptions exposed to the pass managers, we can use a precise disablement of sinking common instructions to only the first instantiation of -simplifycfg (the one that runs before early-cse).

Oct 30 2017, 7:56 PM
hfinkel added inline comments to D31772: [PowerPC] Add pass to expand extra memcpy calls.
Oct 30 2017, 7:47 PM
hfinkel added inline comments to D31772: [PowerPC] Add pass to expand extra memcpy calls.
Oct 30 2017, 7:45 PM
hfinkel added a comment to D39404: DAG: Add nuw when splitting loads and stores.

Looks right to me. Regarding a helper, the comment is longer than the code, so you should name the helper so that you can move the comment too. Maybe getObjectOffsetWrapFlags or similar.

Oct 30 2017, 7:17 PM
hfinkel added inline comments to D38313: [InstCombine] Introducing Aggressive Instruction Combine pass.
Oct 30 2017, 7:02 PM
hfinkel added a comment to D31852: [PowerPC] Convert reg/reg instructions fed by constants to reg/imm instructions (pre and post RA).

This is really interesting. So we're seeing a number of cases where, after instruction selection, we're doing something that makes operations have constant operands. Do you know what that is? Maybe one thing we could do with this is have it assert, instead of transforming the code, and then reduce the test cases in order to understand why this happens.

Oct 30 2017, 6:32 PM
hfinkel added inline comments to D39304: [IR] redefine 'reassoc' fast-math-flag and add 'trans' fast-math-flag.
Oct 30 2017, 5:48 PM
hfinkel added inline comments to D39304: [IR] redefine 'reassoc' fast-math-flag and add 'trans' fast-math-flag.
Oct 30 2017, 3:05 PM
hfinkel accepted D39434: Allow inaccessiblememonly and inaccessiblemem_or_argmemonly to be overwriten on call site with operand bundle.

LGTM

Oct 30 2017, 2:55 PM
hfinkel accepted D38783: [CGP] Fix the detection of trivial case for addressing mode.

This logic certainly looks cleaner.

Oct 30 2017, 2:47 PM
hfinkel added a comment to D39344: [PowerPC] Add implementation for -msave-toc-indirect option - llvm portion.

I'd like to think that we already have something that would clean this up, but apparently we don't. We should.

I was under the impression that this patch was supposed to be the something that cleans it up.

Oct 30 2017, 1:24 PM
hfinkel added a comment to D39386: [Power9] Allow gpr callee saved spills in prologue to vector registers rather than stack.

This is pretty neat.

Is the CFI limitation temporary? I imagine that you can use MCCFIInstruction::createRegister to indicate that the value that used to be in the GPR is now in the vector register?

Will be also be able to use this same infrastructure in order to save callee-saved condition registers to otherwise available GPRs (instead of to the stack)?

The CFI limitation is temporary. I did notice the createRegister, but I don't think there is support in the debuggers/unwinders for moving from a vector register to a GPR.

Oct 30 2017, 1:18 PM

Oct 29 2017

hfinkel accepted D39407: [(new) Pass Manager] instantiate SimplifyCFG with the same options as the old PM.

LGTM

Oct 29 2017, 11:03 AM

Oct 27 2017

hfinkel added a comment to D39386: [Power9] Allow gpr callee saved spills in prologue to vector registers rather than stack.

This is pretty neat.

Oct 27 2017, 7:46 PM
hfinkel added inline comments to D39304: [IR] redefine 'reassoc' fast-math-flag and add 'trans' fast-math-flag.
Oct 27 2017, 7:08 PM
hfinkel added a comment to D39345: SCEV: preserve debug information attached to PHI nodes..

First, should this logic be here or should it be in IndVarSimplify? I ask because there are transformations that can introduce new AddRecs into an existing loop without replacing the induction variable. Only the transformation using the SCEVExpander knows if it is planning to actually replace the induction variable.

Oct 27 2017, 6:02 PM
hfinkel accepted D38631: [SimplifyCFG] use pass options and remove the latesimplifycfg pass.

Patch updated:
Use the "late" options in the default SimplifyCFGPass constructor. This makes the next (functional) patch smaller and eliminates all of the diffs from lib/Passes/PassBuilder.cpp in this one.

Oct 27 2017, 4:57 PM
hfinkel added a comment to D38631: [SimplifyCFG] use pass options and remove the latesimplifycfg pass.

Patch updated:

  1. Keep default SimplifyCFGPass constructor and use it where appropriate (reminder: this patch is NFCI, so I preserved the non-default instantiations in the new PM with one FIXME comment).
Oct 27 2017, 12:27 PM
hfinkel accepted D39108: [DAGCombine] Don't combine sext with extload if sextload is not supported and extload has multi users.

LGTM

Oct 27 2017, 12:10 PM
hfinkel added a comment to D39376: [PowerPC] Add implementation for -msave-toc-indirect option - clang portion.

Please upload diffs with full context.

Oct 27 2017, 11:39 AM
hfinkel requested changes to D39344: [PowerPC] Add implementation for -msave-toc-indirect option - llvm portion.

I want to take a step back and discuss what problem this is trying to solve and how to best solve it. I understand that GCC has a flag for this (added in 4.7), but it is not clear to me that this is the best approach. Two thoughts:

Oct 27 2017, 10:43 AM
hfinkel added inline comments to D39108: [DAGCombine] Don't combine sext with extload if sextload is not supported and extload has multi users.
Oct 27 2017, 10:21 AM
hfinkel added inline comments to D39304: [IR] redefine 'reassoc' fast-math-flag and add 'trans' fast-math-flag.
Oct 27 2017, 8:11 AM

Oct 26 2017

hfinkel added inline comments to D38413: [PowerPC] Add handling for ColdCC calling convention and a pass to mark candidates with coldcc attribute.
Oct 26 2017, 9:09 PM
hfinkel added a comment to D39173: [MachineCSE] Add new callback for is caller preserved or constant physregs..

Oh, and please retitle this before you commit. You should talk about caller-preserved registers, not TLS.

Oct 26 2017, 8:28 PM
hfinkel accepted D39173: [MachineCSE] Add new callback for is caller preserved or constant physregs..

LGTM

Oct 26 2017, 8:24 PM
hfinkel accepted D38898: [SelectionDAG] Add MachineMemOperand flags to the NodeID for MemSDNodes.

Also, do we have a problem with prefetches in this regard too. For ISD::PREFETCH I see just this:

Oct 26 2017, 7:59 PM