kristof.beyls (Kristof Beyls)
User

Projects

User does not belong to any projects.

User Details

User Since
Jun 18 2014, 12:59 AM (209 w, 4 d)

Recent Activity

Fri, Jun 22

kristof.beyls accepted D48077: [LNT] Allow --use-perf=profile and --run-under to work together.

It's clear from your description that the run_under _mutateCommandLine isn't a "private" function in the module as is, and therefore removing the under score makes sense to me.
Maybe it'd be better for the mutatePlan in perf.py to be implemented in a different way so that _mutateCommandLine really can be private in the run_under module.
I'm afraid I don't have enough experience here to know whether that would be a reasonable goal, implementable with reasonable effort.
I hope @MatzeB could share his opinion on this.

Fri, Jun 22, 12:05 AM

Fri, Jun 8

kristof.beyls created D47930: Make email options of find_interesting_reviews more flexible..
Fri, Jun 8, 12:07 AM

Mon, Jun 4

kristof.beyls accepted D47581: [LNT] Fix use of field index in daily report.

LGTM.

Mon, Jun 4, 6:06 AM

Fri, Jun 1

kristof.beyls added a comment to D47581: [LNT] Fix use of field index in daily report.

Maybe the best way to test this is to add a helper function in tests/server/ui/V4Pages.py to extract the full content from a specific field from the table (i.e. including all HTML tags) and check against exact HTML content?

What do you think - would that be feasible and useful to implement?

After trying some things I think it will be easier to extract just the destination of the links and check that those are correct, especially as it's just the destination of the links that was wrong.

Fri, Jun 1, 7:48 AM

Thu, May 31

kristof.beyls added a comment to D47581: [LNT] Fix use of field index in daily report.

Hi John,

Thu, May 31, 5:24 AM

May 17 2018

kristof.beyls added a comment to D46192: Script to match open Phabricator reviews with potential reviewers.

Oh - and I indeed also ran the source code through YAPF - thanks for that pointer, I didn't know that existed.

May 17 2018, 3:10 AM
kristof.beyls updated the diff for D46192: Script to match open Phabricator reviews with potential reviewers.

Many thanks for the detailed review, Jonas!
I think I've addressed all your comments - very useful feedback!

May 17 2018, 3:08 AM

May 14 2018

kristof.beyls added a comment to D46714: [test-suite] Add list of programs we might add..

It does seem like a wiki would be nice to maintain this kind of information. In the absence of that, I think that a file in the test-suite repository, or a page in www are about equally easy/hard to maintain: it requires commit access to make any changes.
A file in www in theory could be more visible as it becomes part of the llvm.org web pages. That being said, source code is also viewable online, so it's easy to browse this text too.

May 14 2018, 11:51 PM
kristof.beyls added a comment to D45000: [LNT] fix some typos/dead code.

@MatzeB Is there something else I should do for this to be commited? First time using the Phabricator so I'm a bit unfamiliar with the process.

May 14 2018, 1:30 AM

May 7 2018

kristof.beyls added inline comments to D46192: Script to match open Phabricator reviews with potential reviewers.
May 7 2018, 7:50 AM
kristof.beyls updated the diff for D46192: Script to match open Phabricator reviews with potential reviewers.
May 7 2018, 7:48 AM
kristof.beyls updated the diff for D46192: Script to match open Phabricator reviews with potential reviewers.

Updating diff to make it add the script to llvm/utils/Reviewing.
I chose to put it in a new folder "Reviewing", since ideas for further scripts related to reviewing have been made on this thread.

May 7 2018, 6:37 AM
kristof.beyls accepted D46433: [LNT] lnt profile upgrade command for large Spec2017 perf.data aborts.

I hope that in the long run, we can get rid of this cPerf.cpp C++ implementation, and instead make use of "perf data convert --to-ctf" to convert perf.data files into a CTF format which should be easier to parse in python.
At the time this C++ implementation was added, the "perf data convert --to-ctf" functionality tended not to be compiled in to perf as distributed by the main linux distros.
Looking at a reason stack overflow issue about this suggests that the situation may not have changed much (https://stackoverflow.com/questions/43576997/building-perf-with-babeltrace-for-perf-to-ctf-conversion), hence we'll probably need this C++ implementation for some time to come.

May 7 2018, 3:12 AM

May 1 2018

kristof.beyls added a comment to D46192: Script to match open Phabricator reviews with potential reviewers.

This should probably go into llvm/utils, not the root.

It would also be good to have the opposite script - "given git diff / commits, whom should i add as reviewers?"

I once wrote the following one-liner to do this. It works pretty well but it can be *very* slow. It will suggest a top 10 of reviewers for your staged changes.

git diff --name-only --cached | xargs -n 1 git blame --porcelain | grep \"^author \" | sort | uniq -c | sort -nr | head -10

It could evaluate CODE_OWNERS.txt, git log / git shortlog -sne of the changed files, git blame.

May 1 2018, 6:46 AM
kristof.beyls added a comment to D46192: Script to match open Phabricator reviews with potential reviewers.

Hi Kristof. Thanks for this.

One more heuristic that might help getting quick reviews:

  1. If there is overlap between the files touched by the patch-under-review and and a 'reviewer' of recent changes to same file, that person may be a good reviewer.

    This heuristic is helpful where original implementor may have moved on, but new active reviewers in same area are happy to review. Some commit messages have 'reviewers' listed and that could be used.
May 1 2018, 6:45 AM
kristof.beyls added a comment to D46192: Script to match open Phabricator reviews with potential reviewers.

Do you plan on actually commiting this to the repository, or would you like any feedback nontheless?

May 1 2018, 6:44 AM
kristof.beyls added a comment to D46215: [AArch64] Support reserving x16 and x17 register.

I have some doubts this is the right long term approach.

As https://android.googlesource.com/kernel/common/+/c0385b24af15020a1e505f2c984db0d7c0d017e1%5E%21/#F4 indicates, there are others who have use cases for reserving other registers than x16,x17,x18. (The linked patch reserves x1,x2,x3,x4,x5,x6,x7).
With that indication, I think it may be better to bite the bullet now and implement full support for gcc command line option "-ffixed-reg", where reg is, according to the gcc documentation: "reg must be the name of a register. The register names accepted are machine-specific and are defined in the REGISTER_NAMES macro in the machine description macro file.".
For the full documentation of this option, see https://gcc.gnu.org/onlinedocs/gcc/Code-Gen-Options.html.

What do you think?

Thanks,

Kristof

That was original proposal bug @chandlerc suggested in our offline discussion that it's not easily doable. It seems like we might be able to duplicate the same code for all remaining registers which will require a lot of duplicated logic but it should work? However, I might be missing some details of the register allocator which will break.

May 1 2018, 5:56 AM

Apr 28 2018

kristof.beyls added a reviewer for D46215: [AArch64] Support reserving x16 and x17 register: srhines.

I have some doubts this is the right long term approach.

Apr 28 2018, 12:16 AM

Apr 27 2018

kristof.beyls created D46192: Script to match open Phabricator reviews with potential reviewers.
Apr 27 2018, 8:01 AM

Mar 26 2018

kristof.beyls added inline comments to D44374: [GlobalISel] Add support for lowering vector operations to scalar..
Mar 26 2018, 7:42 AM

Mar 16 2018

kristof.beyls added a comment to D44519: Add llvm-exegesis tool..

Thanks for sharing this - this looks like a really promising tool.
I wonder if you have thought about adding unit/regression tests? Having them would make it quite a bit easier for others to contribute to its development.

Sure, we do have unit tests (they are gmock unit tests, not lit tests). I did not include them in this diff to make review more manageable, but I'll add them now if you want.

Mar 16 2018, 3:36 AM
kristof.beyls added a comment to D44519: Add llvm-exegesis tool..

Thanks for sharing this - this looks like a really promising tool.
I wonder if you have thought about adding unit/regression tests? Having them would make it quite a bit easier for others to contribute to its development.
Off the top of my head - to be able to unit/regression test this, you'd need a mock/fake "libpfm-like" implementation to make the test not rely on the results coming from whatever machine you're running on?
If that's a direction you want to take, designing that in earlier may be easier than retrofitting it in later?

Mar 16 2018, 3:09 AM

Mar 12 2018

kristof.beyls added inline comments to D44374: [GlobalISel] Add support for lowering vector operations to scalar..
Mar 12 2018, 8:23 AM
kristof.beyls added a comment to D44374: [GlobalISel] Add support for lowering vector operations to scalar..

As far as I'm aware it is impossible to write unit tests for this unless a backend uses it?
I've attempted adding this to AArch64's GlobalISel implementation and hit other deficiencies in the backend that are unrelated but this code uncovered.

Mar 12 2018, 7:59 AM

Feb 22 2018

kristof.beyls accepted D43612: [zorg] Update testsuite parallelism settings for Linaro AArch64 builders..
Feb 22 2018, 6:20 AM
kristof.beyls added inline comments to D43612: [zorg] Update testsuite parallelism settings for Linaro AArch64 builders..
Feb 22 2018, 12:33 AM

Feb 3 2018

kristof.beyls added a comment to D41761: Introduce llvm.nospeculateload intrinsic.
  1. Is the API available in C too?

    IIUC, the SecureBoundOffset intrinsic is usable from both C and C++, but the IsPointerInRange intrinsic can only be used from C++? Do you have ideas around bringing similar functionality to C?

Yeah, this is part of why I suggest the much more generic ProtectFromSpeculation API which I think is easily applicable in C. The C version might use pointers or whatever, but this kind of API doesn't fundamentally require any interesting lang

  1. For the variant with a general predicate (bool ProtectFromSpeculation(bool predicate, ZeroArgs&... args);); do you have ideas about how to make sure that the optimizers in the compiler donโ€™t simplify the predicate too much? For example:

    ` if (a>7) { x = a; if (ProtectFromSpeculation(a>5, x) { ... = v[x]; ... } }

    ` how to prevent this from being optimized to: ` if (a>7) { x = a; if (ProtectFromSpeculation(true, x) { ... = v[x]; ... } } ` which leads to no longer giving protection.

No matter what, this will require deep compiler support to implement. Even without the example you give, these construct fundamentally violate the rules the optimizer uses: they are by definition no-ops for execution of the program!

This means we will have to work to build up specific an dedicated infrastructure in the compiler to model these as having special semantics. That exact infrastructure can provide whatever optimization barriers are necessary to get the desired behavior. For example, the code generation I suggest above for x86 cannot be implemented in LLVM using its IR alone (I've actually tried). We'll have to model this both in the IR and even in the code generator specially in order to produce the kind of specific pattern that is necessary.

But there is also the question of what burden do we want to place on the user of these intrinsics vs. what performance hit we're willing to accept due to optimization barriers. I could imagine two approaches here:

  1. It is the programmers responsibility to correctly protect any predicates that their application is sensitive to. As a consequence, if the a>5 predicate is sensitive for the application, so must the a>7 predicate be, and it is the programmers responsibility to protect both of them. This allows the implementation to have the minimal set of optimization barriers, but may make it difficult for programmers to use correctly.
  2. The predicate provided to these APIs is truly special and is forced to be a *dynamic* predicate. That is, we require the compiler to emit the predicate as if no preconditions existed. There are ways to model this in LLVM and I assume any compiler. As a trivial (but obviously bad) example: all references to variables within the predicate could be lowered by rinsing that SSA value through an opaque construct like inline asm.

    There is clearly a "programmer ease / security" vs. "better optimization" tradeoff between the two. If one isn't *clearly* the correct choice in all cases, we could even expose both behind separate APIs that try to make it clear the extent of protections provided.

    Does that make sense?
Feb 3 2018, 12:00 PM

Jan 30 2018

kristof.beyls updated subscribers of D41761: Introduce llvm.nospeculateload intrinsic.

How can two variant-1 attacks be "different" enough that a speculationsafeload would protect against one but not the other, when both exploit the same load operation

Sorry, wasn't quite clear. There are two speculated loads for a variant-1 attack: the load that reads the secret, and the load that leaks the secret to the user. The first load has to be speculation-safe to stop the attack; whether the second is speculation-safe is irrelevant, at least in the proposals so far. That isn't really a fundamental problem, just an illustration that reasoning about these attacks is tricky.

Wait, no, this isn't right. It is actually possible to attack a speculationsafeload, at least in theory.

In general, the logic behind the "soft" speculation barriers being proposed is that they block execution if a speculation check fails. But given that the speculation check is itself computed speculatively, you have to ensure that not only is the result correct in the normal sense, but also that it's correct when computed speculatively. And as I showed earlier, IR transforms can turn an arbitrary value into a speculatively-uninitialized value. Therefore, the speculation check can spuriously succeed, and the speculation barrier doesn't reliably prevent speculation.

Jan 30 2018, 8:35 AM
kristof.beyls added a comment to D41761: Introduce llvm.nospeculateload intrinsic.

I have no idea what form that assurance would take, since I don't know how LangRef handles such matters.

Well, I don't really know either; LangRef only describes the abstract virtual machine, mostly. That's part of the problem. :)

How can two variant-1 attacks be "different" enough that a speculationsafeload would protect against one but not the other, when both exploit the same load operation

Sorry, wasn't quite clear. There are two speculated loads for a variant-1 attack: the load that reads the secret, and the load that leaks the secret to the user. The first load has to be speculation-safe to stop the attack; whether the second is speculation-safe is irrelevant, at least in the proposals so far. That isn't really a fundamental problem, just an illustration that reasoning about these attacks is tricky.

I don't see how the code being spread over multiple functions matters- all that matters are the load, and the branch (or nested branches) that actually guard that load

Well, the CPU doesn't care (assuming it can perfectly predict calls), but it's problematic from an auditing perspective because it's harder to spot. Particularly since the vulnerable code might not explicitly reference any user-controlled data at all.

If the application logic doesn't explicitly prevent any of the loads the attacker is exploiting

A C programmer cannot reasonably come up with a complete list of all the potentially exploitable loads without the compiler being aware that the user needs Spectre-resistant code. There are two key problems:

  1. An exploitable load might not exist in the original program. One example is the switch-to-lookup-table transform. Given:

    ` int a(unsigned x) { switch (x) { case 0: return 2; case 1: return 44; case 2: return 23; default: return 8; } } `

    We transform to: ` int a(int x) { if (x > 2) return 8; const static int table[] = {2, 44, 23}; return table[x]; } `

    Now you have a speculated out-of-bounds load from code which didn't contain any loads.
  2. The exploitable code might come out of some non-obvious lowering. Even if a pointer points to something "obviously" safe, it might actually be uninitialized along some impossible path through the function. Probably the easiest example to explain is polly's invariant load hoisting. Basically, if you have a loop like this:

    ` int sum = 0; for (int i = 0; i < n; ++i) { if (b) sum += (*p)[i]; } `

    It gets transformed to something like this:

    ` int *pp; if (b) pp = *p; int sum = 0; for (int i = 0; i < n; ++i) { if (b) sum += pp[i]; } `

    So now you have the if() if() pattern which leads to a speculated load from an uninitialized pointer.
Jan 30 2018, 8:25 AM
kristof.beyls added a comment to D41761: Introduce llvm.nospeculateload intrinsic.

(I hope you can bear with me while I come up to speed; I understand there's some time pressure here, so I'm erring on the side of speaking sooner)

The proposed API for nospeculateload seems like it could be problematic for programmers, because it gives them no way to tell whether the bounds check passed, unless they are able to identify a failval sentinel that can never be the result of a successful load (and yet is safe to use as the speculative result of the load). Thus, it seems likely that in a lot of cases, the bounds check will be duplicated: once inside nospeculateload, and once in user code. That will make code using this API "feel" inefficient, which will tend to discourage programmers from using it. Furthermore, if it actually is inefficient, that will create pressure for optimization passes to "peek inside" nospeculateload in order to eliminate the duplication, and that seems like a can of worms we really don't want to open.

Another way of putting it is that we probably want this API to be as primitive as possible, because the more logic we put inside the intrinsic, the greater the risk that some parts of it will be unsuited for some users. Consequently, we've been experimenting with APIs that are concerned solely with the bounds check, rather than with the subsequent load:

int64_t SecureBoundedOffset(int64_t offset, int64_t bound);

At the abstract machine level, this function just returns offset, and has the precondition that 0 <= offset < bound, but it also ensures that for speculative executions, offset will always appear to be within those bounds. There are also variants for uint64_t, and variants that take the base-2 log of the bound, for greater efficiency when the bound is a power of two.

template <typename T, typename... ZeroArgs>
bool IsPointerInRange(T*& pointer, T* begin, T* end, ZeroArgs... args);

This function returns whether pointer is between begin and end, and also guarantees that if the function returns false, then any speculative execution which assumes it to be true will treat pointer and args... as zero (all ZeroArgs must be integers or pointers). Notice that this API allows the optimizer to hoist loads past the branch, so long as the loads don't depend on pointer or args...; I'm not sure if that's true of nospeculateload or SecureBoundedOffset.

Notice that by not handling the load itself, these APIs avoid the ptr/cmpptr awkwardness, as well as the need for the user to designate a sentinel value. One tradeoff is that whereas nospeculateload can be thought of as a conditional load, plus some "security magic", these APIs are harder to understand without explicitly thinking about speculative execution. However, I'm not sure that's much of a problem in practice-- if you don't want to think about speculative execution, you probably shouldn't be using this API in the first place.

Is there any way we could implement an interface like those efficiently on ARM?

I don't think it's likely the compiler would intentionally introduce a load using the same pointer as the operand to a speculationsafeload; given most transforms don't understand what speculationsafeload does, the compiler has no reason to introduce code like that (even if it isn't technically prohibited by LangRef).

Do we need to explicitly prohibit it in LangRef so that future transformations don't start understanding too much about what speculationsafeload does?

More practically, I'm worried about the possibility that code which doesn't appear to be vulnerable at the source-code level will get lowered to assembly which is vulnerable. For example, the compiler could produce code where the CPU speculates a load from an uninitialized pointer value. Without an IR/MIR model for speculation, we have no way to prevent this from happening.

When you say "code which doesn't appear to be vulnerable at the source-code level", do you mean "code that is explicitly protected by speculationsafeload", or "code that doesn't appear to need speculationsafeload in the first place"? If the former, could you give a more concrete example?

It seems to me that we ought to be able to specify this intrinsic without having an explicit model of CPU speculation in the IR, because at the IR level we can just constrain the types of transformations that can be performed on this intrinsic. That way we only need to talk about speculation when we're specifying how this intrinsic is used to generate code for a specific platform that happens to feature CPU-level branch speculation. Very tentatively, I think the specific constraints on transformations that are needed here are that the intrinsic has unmodeled side-effects (so it can't be eliminated or executed speculatively), and that it should be treated as writing to all memory (or only to pointer and args.. in the case of an API like IsPointerInRange).

Jan 30 2018, 8:20 AM

Jan 26 2018

kristof.beyls added a comment to D42568: [GlobalISel] Bail out on calls to dllimported functions.

Hi Martin,

Jan 26 2018, 1:43 AM

Jan 23 2018

kristof.beyls added a comment to D40696: Enable aggressive FMA on T99 and provide AArch64 option for other micro-arch's.

I am not sure I follow. I think @MatzeB 's issue was using getProcFamily() and it should be fine to just check the subtarget feature in enableAggressiveFMAFusion. @MatzeB summarized some benefits of making it a subtarget feature here: https://reviews.llvm.org/D40177#936974

If there's only one EnableAggressiveFMA boolean in AArch64.td:

What happens if a certain micro-arch wants to enable AggressiveFMA for scalar doubles and vector doubles, but not for scalar floats and vector floats? (as an example).

Won't they have to call getProcFamily() in enableAggressiveFMAFusion()?

As there is only one boolean in AArch64.td, AggressiveFMA will be either enabled for all the floating-point types, or disabled for all the floating-point types, scalars or vectors. The only way of knowing which particular floating-point type is being tested for AggressiveFMA is by testing the type of the EVT, and then making a decision based on getProcFamily() and EVT type.

Is it safe to speculate that each and every AArch64 micro-arch wants AggressiveFMA enabled or disabled for all possible floating-point types?

Jan 23 2018, 12:22 AM

Jan 18 2018

kristof.beyls added a comment to D41761: Introduce llvm.nospeculateload intrinsic.

It would be awesome to have static analysis rules to help identify *where* to put these intrinsics. Is somebody working on that? Or did I miss it?

Jan 18 2018, 7:14 AM
kristof.beyls added a comment to D41761: Introduce llvm.nospeculateload intrinsic.

"When the intrinsic is being executed speculatively"

What does this mean?

LLVM IR defines the semantics of a program only in terms of visible side-effects. It does not define any semantics for code which does not execute, and it does not provide any guarantees in terms of what data can be leaked via side-channels. If you're going to attach meaningful semantics to speculationsafeload, you also need to generally define what code the compiler is allowed to emit in blocks which could be speculatively executed. As an extreme example, there needs to be some rule which prevents the compiler from inserting an attackable load immediately before a call to speculationsafeload.

Jan 18 2018, 7:13 AM

Jan 17 2018

kristof.beyls added a comment to D41760: Introduce __builtin_load_no_speculate.

The API design has been discussed over the past weeks in detail on the gcc mailing list. As a result of that, we propose to adapt the API, to enable efficient code generation also on architectures that need to generate a barrier instruction to achieve the desired semantics.

Jan 17 2018, 7:30 AM
kristof.beyls added a comment to D41761: Introduce llvm.nospeculateload intrinsic.

The API design has been discussed over the past weeks in detail on the gcc mailing list. As a result of that, we propose to adapt the API, to enable efficient code generation also on architectures that need to generate a barrier instruction to achieve the desired semantics.

Jan 17 2018, 7:24 AM

Jan 11 2018

kristof.beyls added a comment to D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.
In D38196#951927, @az wrote:

Thank you for providing several very useful feedback to improve this patch. In the future (not very soon), I am planning to make this pass more complete by 1) implementing the existing vector element sub-pass in the same way as optimize interleaved store instructions sub-pass for consistency reason (put all rewrite rules and not just one, caching, instruction replacement table, etc.) 2) Addressing the ST3 instruction inefficiency assuming I can find an efficient combination of instructions to replace ST3. I hope I can put you as the reviewer.

I would like also to get your feedback on a couple of things:

  1. Is it worth replacing how code generation (i.e. building new instructions) is implemented in this pass? It is currently hard-coded but we can replace it by a table driven approach similar to the one used in the analysis phase (see IRT Table). In particular, I need to find a simple way to express through a table or other means how the new instructions are built out of the old one and more specifically how the operands of new instructions related to old ones. This adds complexity to the implementation and understanding of the code generation but makes adding more rewrite rules quite simple. It tried this before but the patch was not accepted but may be I should bring it again now that we added several new rewrite pattern in this patch and expect to add more for ST3.
Jan 11 2018, 2:54 AM

Jan 10 2018

kristof.beyls added a comment to D41761: Introduce llvm.nospeculateload intrinsic.

Thanks for the feedback, Chandler and Philip!

Jan 10 2018, 7:38 AM

Jan 5 2018

kristof.beyls updated subscribers of D41761: Introduce llvm.nospeculateload intrinsic.
Jan 5 2018, 5:49 AM
kristof.beyls updated subscribers of D41760: Introduce __builtin_load_no_speculate.
Jan 5 2018, 5:47 AM
kristof.beyls added a dependent revision for D41761: Introduce llvm.nospeculateload intrinsic: D41760: Introduce __builtin_load_no_speculate.
Jan 5 2018, 3:08 AM
kristof.beyls added a dependency for D41760: Introduce __builtin_load_no_speculate: D41761: Introduce llvm.nospeculateload intrinsic.
Jan 5 2018, 3:08 AM
kristof.beyls created D41761: Introduce llvm.nospeculateload intrinsic.
Jan 5 2018, 3:07 AM
kristof.beyls created D41760: Introduce __builtin_load_no_speculate.
Jan 5 2018, 3:02 AM

Dec 8 2017

kristof.beyls accepted D40985: [AArch64] Add Exynos to host detection.

LGTM modulo the one nitpick for which I'm not sure if it is needed.

Dec 8 2017, 12:03 AM

Dec 7 2017

kristof.beyls updated subscribers of D40789: [GlobalISel] Disable GISel for big endian.

I'll let @rovka review the code; I just thought I'd add @ivanbaev as a subscriber, as I believe he's a user of big endian code generation on Arm/AArch64 to be aware that this means that when globalisel gets enabled by default on AArch64 at -O0, it'll in practice only be for little endian, not big endian.
I'm wondering if a test case could be added to test/CodeGen/AArch64/GlobalISel/arm64-fallback.ll, so that test keeps on being a reasonable representation of fallbacks that need to be fixed eventually.

Dec 7 2017, 12:14 AM

Dec 5 2017

kristof.beyls added a comment to D40876: AArch64: Fix emergency spillslot being out of reach for large callframes.

I'm still in the process of coming up with a reliable and good testcase. As testcases with hundred of arguments are unwieldy I will probably end up placing a correctness test into the llvm test-suite for this.

Dec 5 2017, 11:49 PM
kristof.beyls accepted D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.

Thanks for your patience with my long string of review comments.
With just still addressing the few more nitpicks, I think this will have gotten to an acceptable state.

Dec 5 2017, 5:51 AM

Nov 27 2017

kristof.beyls added inline comments to D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.
Nov 27 2017, 5:31 AM

Nov 22 2017

kristof.beyls added inline comments to D40177: performance improvements for ThunderX2 T99.
Nov 22 2017, 12:31 AM

Nov 20 2017

kristof.beyls added inline comments to D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.
Nov 20 2017, 12:56 AM

Nov 17 2017

kristof.beyls created D40173: [GlobalISel] Update documentation after landing non-power-of-2 support.
Nov 17 2017, 5:45 AM

Nov 15 2017

kristof.beyls added a comment to D40074: [GISel] Canonicalize constants to RHS for commutative operations.

We have canonicalization in other parts of LLVM too - even though it's not an area I understand very well.
If this is the first canonicalization (of potentially many to come) in the IRTranslator, would it make sense to try and document the canonical representation(s) somewhere and/or somehow?

Nov 15 2017, 7:11 AM

Nov 14 2017

kristof.beyls accepted D31025: [Docs] Add tablegen backend for target opcode documentatio.

LGTM, modulo the one comment about the file header.

Nov 14 2017, 6:57 AM

Nov 13 2017

kristof.beyls added inline comments to D39823: GlobalISel: Enable the legalization of G_MERGE_VALUES and G_UNMERGE_VALUES.
Nov 13 2017, 8:51 AM
kristof.beyls added a comment to D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.

I haven't gone through the whole patch line by line again, but I believe this is almost in a reasonable shape.
Could you do the compilation time benchmarking to ensure this doesn't increase it for target not needing the rewriting?

Nov 13 2017, 8:44 AM

Nov 10 2017

kristof.beyls added a comment to D39712: [ARM] Add an alias for psr and psr_nzcvq.

Hi Leslie,

As far as I can tell, this is a gcc extension (it's not in any architecture reference manual I can find), which I can't find any documentation for. Is this correct, or is there some documentation I've missed? If I'm right, I'm not sure that there's any compelling reason to support this in LLVM. I've added Renato, who might know more about our policy for adding these sort of non-standard extensions.

If we do implement it, this isn't the right way to do it, you have given the "psr" operand a different encoding than "xpsr". In gcc's implementation, "psr" and "xpsr" result in exactly the same instructions. I've added Javed, who implemented this table-driver operand parsing, he might have some ideas about the best way to implement aliases.

As for the test, this is not the right file to put it in, I'd suggest adding it to test/MC/ARM/thumb2-mclass.s instead.

Nov 10 2017, 12:50 AM

Nov 9 2017

kristof.beyls added inline comments to D39823: GlobalISel: Enable the legalization of G_MERGE_VALUES and G_UNMERGE_VALUES.
Nov 9 2017, 1:25 AM

Nov 8 2017

kristof.beyls added a comment to D31025: [Docs] Add tablegen backend for target opcode documentatio.

I wanted to look at the documentation produced for a specific instruction today, as part of investigation PR35157.
I wanted to understand what the operands of VST1d64TPseudoWB_fixed are in the Arm backend.
After downloading this patch and building/running it; I saw it produced the following documentation for that instruction:

Nov 8 2017, 8:27 AM

Nov 7 2017

kristof.beyls added a comment to D39747: [globalisel][tablegen] Generate rule coverage and use it to identify untested rules.

Is the long-term intention to try and drive this coverage to 100% via in-tree tests, or rather via the test-suite?

Nov 7 2017, 11:50 PM
kristof.beyls added inline comments to D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.
Nov 7 2017, 6:57 AM

Nov 3 2017

kristof.beyls added inline comments to D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.
Nov 3 2017, 9:25 AM
kristof.beyls added inline comments to D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.
Nov 3 2017, 8:20 AM

Nov 1 2017

kristof.beyls added inline comments to D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.
Nov 1 2017, 12:49 AM

Oct 26 2017

kristof.beyls added a comment to D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.
In D38196#906795, @az wrote:

I thought the "shouldExitEarly" method aims to check if any of the transformations the pass can do can be profitable for the target at all.
My understanding is that this can be checked by just examining the instruction latencies, not having to look at the actual code being compiled at all. I assume that the saving of compile time comes from the fact that "shouldExitEarly" needs to do an amount of work proportional to the number of rewriting rules implemented, not proportional to the amount of code being compiled.
So, for the pass to work as expected, i.e. always run when for the target at least one of the transformations is profitable, I think shouldExitEarly should check all rewriting rules implemented.

Does that make sense, or am I misunderstanding something here?

Your understanding of ShouldExitEarly is accurate and I agree that ideally we should put down all rewriting rules. However, ShouldExitEarly is not very cheap as it calls shouldReplaceInstruction. The question is should we put some heuristics into ShouldExitEarly so that we do not check for all rules even though the number of rules is static and does not depend on code? I currently check for one rule and make it a representative of all (st2 in this case) and this may be over-simplification. Maybe, I can check for one representative of st2 and one for st4 (which both have many variants).

Oct 26 2017, 2:30 PM

Oct 25 2017

kristof.beyls added a comment to D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.

Just a quick remark/question before I go off-line for the rest of today:

Oct 25 2017, 8:15 AM

Oct 24 2017

kristof.beyls added a comment to D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.
In D38196#896905, @az wrote:

But my main question at this point is: did you benchmark this? What are the results of that? On which cores/benchmarks?

As llvm is now getting better at generating interleaved memory access instructions, we are seeing a major degradation in a proprietary benchmark for Exynos because latest llvm generates st2 instructions in a hot loop. I can refer you to the llvm file lib/Target/AArch64/AArch64SchedM1.td for Exynos latencies (example: st2 takes 8 cycles, while zip1, zip2, and stp each take 1 cycle).

Oct 24 2017, 7:49 AM

Oct 12 2017

kristof.beyls added a comment to D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.

Hi Abderrazek,

Oct 12 2017, 8:01 AM

Oct 11 2017

kristof.beyls added a comment to D35260: [AArch64] Move AES instruction fusion support.

Ping ๐Ÿ””

Oct 11 2017, 12:27 AM

Oct 10 2017

kristof.beyls added a comment to D38200: [GISel]: Process new insts to legalize in the order they were created.

While this shouldn't be an issue (legalization should only look at one instruction), there's some out of tree code that looks at other instructions as well, and this is breaking the legalization for that instruction.

Oct 10 2017, 7:50 AM
kristof.beyls added a comment to D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.

Hi Abderrazek,

Oct 10 2017, 7:39 AM

Oct 9 2017

kristof.beyls accepted D38160: [AArch64] Improve codegen for inverted overflow checking intrinsics.

Thanks Amara, LGTM modulo one tiny nitpick which you should feel free to disagree with.

Oct 9 2017, 1:44 AM

Oct 5 2017

kristof.beyls added inline comments to D38160: [AArch64] Improve codegen for inverted overflow checking intrinsics.
Oct 5 2017, 12:07 PM
kristof.beyls added a comment to D38588: Clear LastMappingSymbols and LastEMS(Info) when resetting the ARM(AArch64)ELFStreamer.

Not really sure how this can be tested since I can only trigger this when reusing the pass in (julia) JIT.

Oct 5 2017, 11:33 AM
kristof.beyls added inline comments to D38160: [AArch64] Improve codegen for inverted overflow checking intrinsics.
Oct 5 2017, 5:52 AM
kristof.beyls added inline comments to D38160: [AArch64] Improve codegen for inverted overflow checking intrinsics.
Oct 5 2017, 1:51 AM

Oct 3 2017

kristof.beyls added a comment to D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.

Hi Abderrazek,

Oct 3 2017, 8:32 AM
kristof.beyls updated the diff for D30529: [RFC][GlobalISel] Enable legalizing non-power-of-2 sized types..
  • updated to ToT; adapting the G_OR narrowing support added recently by Quentin.
  • slightly improve test arm64-fallback.ll by working around not being able to legalize non-power-of-2-sized G_IMPLICIT_DEFs yet.
Oct 3 2017, 6:07 AM

Oct 2 2017

kristof.beyls added a comment to D38417: [test-suite] Adding HACCKernels app.

Hi Brian,

Thanks for working on this!

On the execution time of about 13 seconds on a fast processor: would it be possible to adapt the input to reduce the running time to about 1 second (or less), and not loose the characteristic behaviour of this benchmark?
I'd expect adapting the input to have an order of magnitude shorter running time will not make the measured execution time more noisy; and it would help in trying to make sure the test-suite keeps on running as quickly as possible.

Kristof, just FYI, reducing the runtime is easy to do, however, this will get significantly faster once we start actually vectorizing the hot loops (at least 4x if you have an architecture with <4 x float>). I looked at this a few weeks ago, and as I recall, we currently can't if-convert reductions (which prevents vectorization). We plan on working on improving this in the near future. Does this affect your opinion at all?

I mainly keep on focussing on the run-time of newly added benchmarks to the test-suite as I do think it is a problem already that the test-suite takes too long to run.

I'm not sure that it runs too long in the abstract, but we certainly waste CPU time by having programs that run for longer than necessary.

For the benchmarks in the test-suite, this problem is far worse than for the programs that are only run to check correctness,

Agreed.

as we typically:
a) have to run the test-suite in benchmark-mode multiple times to figure out if a performance change was noise or significant.
b) have to run programs sequentially in benchmark-mode even on multi-core systems, to reduce noise, whereas for correctness testing we can use all the cores on a system.

At the risk of going too far afield, this has not been universally my experience. When checking for performance, on build servers with ~50 hardware threads, I often run with the test suite with a level of parallelism matching the number of hardware threads. I'd run the test suite ~15 times and then use ministat (https://github.com/codahale/ministat) to compare the ~15 timings from each test to a previous run. I've found these numbers to be better than quiet-server serial runs for two reasons: First, even a quiet server is noisy and we need to run the test multiple times (unless they really run for a long time), and second, the cores are in a more production-like state (where, for example, multiple hardware threads are being used and there's contention for the cache). I/O-related times are obviously more variable this way, but I've generally found that tests that run for a second (and as low of 0.2s on some systems) are fine for this kind of configuration. Also, so long as you have more than 30 hardware threads (or something like that, depending on the architecture), it's actually faster this way than a single serial run. Moreover, ministat gives error bars :-)

In case you're curious, there's also a Python version of ministat (https://github.com/lebinh/ministat).

This is annoying when evaluating patches, and also makes the response time of performance-tracking bots annoyingly long.
We have a few hundred benchmarks in the test-suite currently, but probably need a lot more to get more coverage (which is why I think it's awesome that these DOE benchmarks are being added!).

We definitely need more coverage for performance. We also need *a lot* more coverage for correctness (i.e. the fact that I catch far more miscompiles from self hosting than from the test suite is a problem).

Therefore, I think it's important to not lose focus on trying to keep benchmarks short-running as they are being added.

There's probably a lot of bikeshedding that could be done on what an acceptable run-time is for a newly-added benchmark and what is too long.
My experience on a few X86 and Arm platforms is that if you use linux perf to measure execution time, as soon as the program runs for 0.01 seconds, just running the program for longer doesn't reduce noise further.
Therefore, my limited experiments suggest to me that an ideal execution time for the benchmark programs in the test-suite would be just over 0.01 seconds - for the platforms I've tested on.
As I said, there's probably lots of arguing that could be done on what the execution time is that we should aim for when adding a new benchmark. So far, I've followed a personal rule-of-thumb that up to 1 second is acceptable, but when its more, there should be a reason for why a longer execution time is needed.

This is also close to my experience; aiming for about a second, maybe two, makes sense.

Which is why I reacted above.
As I don't think my personal 1 second rule-of-thumb is defendable any more or less than rules that set the threshold a bit higher or lower, I don't feel too strongly against this benchmark going in as is.
I just felt I had to ask the question if there was a good reason to make this benchmark run for this long.
Ultimately, vectorizing the hot loops in this benchmark won't make a change to my reasoning above.

In summary, I hope my reasoning above makes sense, and I won't oppose if you think there's a good reason to not shorten the running time of this benchmark as is.

Okay. I propose that we shorten the current running time to around 1.5 seconds. That should leave sufficient running time once we start vectorizing the loops.

Oct 2 2017, 2:30 AM
kristof.beyls added inline comments to D38439: AMDGPU/GlobalISel: Mark 32-bit G_FADD as legal.
Oct 2 2017, 2:13 AM

Sep 30 2017

kristof.beyls added a comment to D38417: [test-suite] Adding HACCKernels app.

Hi Brian,

Thanks for working on this!

On the execution time of about 13 seconds on a fast processor: would it be possible to adapt the input to reduce the running time to about 1 second (or less), and not loose the characteristic behaviour of this benchmark?
I'd expect adapting the input to have an order of magnitude shorter running time will not make the measured execution time more noisy; and it would help in trying to make sure the test-suite keeps on running as quickly as possible.

Kristof, just FYI, reducing the runtime is easy to do, however, this will get significantly faster once we start actually vectorizing the hot loops (at least 4x if you have an architecture with <4 x float>). I looked at this a few weeks ago, and as I recall, we currently can't if-convert reductions (which prevents vectorization). We plan on working on improving this in the near future. Does this affect your opinion at all?

Sep 30 2017, 11:41 AM

Sep 29 2017

kristof.beyls added a comment to D38417: [test-suite] Adding HACCKernels app.

Hi Brian,

Sep 29 2017, 11:52 PM
kristof.beyls added inline comments to D30529: [RFC][GlobalISel] Enable legalizing non-power-of-2 sized types..
Sep 29 2017, 8:47 AM
kristof.beyls updated the diff for D30529: [RFC][GlobalISel] Enable legalizing non-power-of-2 sized types..

Use unordered_map instead of (ordered) map.

Sep 29 2017, 8:44 AM
kristof.beyls added inline comments to D30529: [RFC][GlobalISel] Enable legalizing non-power-of-2 sized types..
Sep 29 2017, 7:24 AM
kristof.beyls updated the diff for D30529: [RFC][GlobalISel] Enable legalizing non-power-of-2 sized types..
  • Rebased to ToT.
  • Addressed all outstanding review comments.
  • Used the test-suite on AArch64 to make sure there are no correctness regressions, both in fallback mode and in assert-when-not-legalizable mode.
  • Measured compile time impact of this change: it's below the noise level I see on gathering CTMark compile time numbers on my system.
Sep 29 2017, 7:18 AM
kristof.beyls added a comment to D38378: [ARM] Optimize {s,u}{add,sub}.with.overflow..

Hi Joel,

Sep 29 2017, 12:43 AM

Sep 25 2017

kristof.beyls added a comment to D30529: [RFC][GlobalISel] Enable legalizing non-power-of-2 sized types..

Hi Kristof,

I was under the impression that the patch is good to land, at least as a first step.
What are we missing to push this change?

Cheers,
-Quentin

Sep 25 2017, 12:46 PM

Sep 12 2017

kristof.beyls accepted D37724: [ARM] Add more CPUs to host detection.
Sep 12 2017, 12:12 PM

Sep 11 2017

kristof.beyls added a comment to D34878: [ARM] Option for reading thread pointer from coprocessor register.

Still LGTM; please commit.

Sep 11 2017, 11:53 PM
kristof.beyls added a comment to D37724: [ARM] Add more CPUs to host detection.

Hi Eli,

Sep 11 2017, 11:50 PM
kristof.beyls accepted D34878: [ARM] Option for reading thread pointer from coprocessor register.

Thanks Strahinja!
I thought that some indentations looked a bit strange, so I'd just still check that clang-format formats your changes the same way.
Otherwise LGTM!

Sep 11 2017, 6:26 AM
kristof.beyls added inline comments to D34878: [ARM] Option for reading thread pointer from coprocessor register.
Sep 11 2017, 5:28 AM
kristof.beyls added inline comments to D34878: [ARM] Option for reading thread pointer from coprocessor register.
Sep 11 2017, 12:59 AM

Sep 7 2017

kristof.beyls added inline comments to D34878: [ARM] Option for reading thread pointer from coprocessor register.
Sep 7 2017, 2:19 AM

Aug 31 2017

kristof.beyls added a comment to D37360: Keep sqlalchemy session separate from database objects.

I have felt before that LNT's connection and interaction with the database is too obscure in the code.
Therefore, I like this proposal, even if it means passing around session objects more.

Aug 31 2017, 11:44 PM

Aug 30 2017

kristof.beyls added a comment to D34581: Fix missing/mismatched html tags.

For the record: I just added an optional integration to pytidylib/tidy-html5 that checks lnt pages for html problems (r312061). It can be used with lnt -Dtidylib=1.

Aug 30 2017, 12:40 AM

Aug 17 2017

kristof.beyls added a comment to D36717: [test-suite] Add SPEC CPU 2017.

One more remark: The 'ref' dataset of 638.imagick_s on all the computers I tried on took between 2 and 3 hours. Submitted results to SPEC are in a similar range. Unfortunately timeit.py has a 7200 seconds (2 hours) limit hardcoded (for --limit-cpu and --timeout). I had to remove that limit to be able to have a successful run.

Can we make the limit configurable or set higher?

Aug 17 2017, 10:56 PM

Aug 16 2017

kristof.beyls added inline comments to D36717: [test-suite] Add SPEC CPU 2017.
Aug 16 2017, 12:16 AM

Aug 11 2017

kristof.beyls accepted D36582: [test-suite] Adding HPCCG benchmark.

Thanks again, The -DVERIFICATION should not have been in the Makefile (not used in this app). I added a short note at the top of the README to make things clearer.

Brian

Aug 11 2017, 10:52 PM