kristof.beyls (Kristof Beyls)
User

Projects

User does not belong to any projects.

User Details

User Since
Jun 18 2014, 12:59 AM (173 w, 6 d)

Recent Activity

Thu, Oct 12

kristof.beyls added a comment to D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.

Hi Abderrazek,

Thu, Oct 12, 8:01 AM

Wed, Oct 11

kristof.beyls added a comment to D35260: [AArch64] Move AES instruction fusion support.

Ping 🔔

Wed, Oct 11, 12:27 AM

Tue, Oct 10

kristof.beyls added a comment to D38200: [GISel]: Process new insts to legalize in the order they were created.

While this shouldn't be an issue (legalization should only look at one instruction), there's some out of tree code that looks at other instructions as well, and this is breaking the legalization for that instruction.

Tue, Oct 10, 7:50 AM
kristof.beyls added a comment to D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.

Hi Abderrazek,

Tue, Oct 10, 7:39 AM

Mon, Oct 9

kristof.beyls accepted D38160: [AArch64] Improve codegen for inverted overflow checking intrinsics.

Thanks Amara, LGTM modulo one tiny nitpick which you should feel free to disagree with.

Mon, Oct 9, 1:44 AM

Thu, Oct 5

kristof.beyls added inline comments to D38160: [AArch64] Improve codegen for inverted overflow checking intrinsics.
Thu, Oct 5, 12:07 PM
kristof.beyls added a comment to D38588: Clear LastMappingSymbols and LastEMS(Info) when resetting the ARM(AArch64)ELFStreamer.

Not really sure how this can be tested since I can only trigger this when reusing the pass in (julia) JIT.

Thu, Oct 5, 11:33 AM
kristof.beyls added inline comments to D38160: [AArch64] Improve codegen for inverted overflow checking intrinsics.
Thu, Oct 5, 5:52 AM
kristof.beyls added inline comments to D38160: [AArch64] Improve codegen for inverted overflow checking intrinsics.
Thu, Oct 5, 1:51 AM

Tue, Oct 3

kristof.beyls added a comment to D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.

Hi Abderrazek,

Tue, Oct 3, 8:32 AM
kristof.beyls updated the diff for D30529: [RFC][GlobalISel] Enable legalizing non-power-of-2 sized types..
  • updated to ToT; adapting the G_OR narrowing support added recently by Quentin.
  • slightly improve test arm64-fallback.ll by working around not being able to legalize non-power-of-2-sized G_IMPLICIT_DEFs yet.
Tue, Oct 3, 6:07 AM

Mon, Oct 2

kristof.beyls added a comment to D38417: [test-suite] Adding HACCKernels app.

Hi Brian,

Thanks for working on this!

On the execution time of about 13 seconds on a fast processor: would it be possible to adapt the input to reduce the running time to about 1 second (or less), and not loose the characteristic behaviour of this benchmark?
I'd expect adapting the input to have an order of magnitude shorter running time will not make the measured execution time more noisy; and it would help in trying to make sure the test-suite keeps on running as quickly as possible.

Kristof, just FYI, reducing the runtime is easy to do, however, this will get significantly faster once we start actually vectorizing the hot loops (at least 4x if you have an architecture with <4 x float>). I looked at this a few weeks ago, and as I recall, we currently can't if-convert reductions (which prevents vectorization). We plan on working on improving this in the near future. Does this affect your opinion at all?

I mainly keep on focussing on the run-time of newly added benchmarks to the test-suite as I do think it is a problem already that the test-suite takes too long to run.

I'm not sure that it runs too long in the abstract, but we certainly waste CPU time by having programs that run for longer than necessary.

For the benchmarks in the test-suite, this problem is far worse than for the programs that are only run to check correctness,

Agreed.

as we typically:
a) have to run the test-suite in benchmark-mode multiple times to figure out if a performance change was noise or significant.
b) have to run programs sequentially in benchmark-mode even on multi-core systems, to reduce noise, whereas for correctness testing we can use all the cores on a system.

At the risk of going too far afield, this has not been universally my experience. When checking for performance, on build servers with ~50 hardware threads, I often run with the test suite with a level of parallelism matching the number of hardware threads. I'd run the test suite ~15 times and then use ministat (https://github.com/codahale/ministat) to compare the ~15 timings from each test to a previous run. I've found these numbers to be better than quiet-server serial runs for two reasons: First, even a quiet server is noisy and we need to run the test multiple times (unless they really run for a long time), and second, the cores are in a more production-like state (where, for example, multiple hardware threads are being used and there's contention for the cache). I/O-related times are obviously more variable this way, but I've generally found that tests that run for a second (and as low of 0.2s on some systems) are fine for this kind of configuration. Also, so long as you have more than 30 hardware threads (or something like that, depending on the architecture), it's actually faster this way than a single serial run. Moreover, ministat gives error bars :-)

In case you're curious, there's also a Python version of ministat (https://github.com/lebinh/ministat).

This is annoying when evaluating patches, and also makes the response time of performance-tracking bots annoyingly long.
We have a few hundred benchmarks in the test-suite currently, but probably need a lot more to get more coverage (which is why I think it's awesome that these DOE benchmarks are being added!).

We definitely need more coverage for performance. We also need *a lot* more coverage for correctness (i.e. the fact that I catch far more miscompiles from self hosting than from the test suite is a problem).

Therefore, I think it's important to not lose focus on trying to keep benchmarks short-running as they are being added.

There's probably a lot of bikeshedding that could be done on what an acceptable run-time is for a newly-added benchmark and what is too long.
My experience on a few X86 and Arm platforms is that if you use linux perf to measure execution time, as soon as the program runs for 0.01 seconds, just running the program for longer doesn't reduce noise further.
Therefore, my limited experiments suggest to me that an ideal execution time for the benchmark programs in the test-suite would be just over 0.01 seconds - for the platforms I've tested on.
As I said, there's probably lots of arguing that could be done on what the execution time is that we should aim for when adding a new benchmark. So far, I've followed a personal rule-of-thumb that up to 1 second is acceptable, but when its more, there should be a reason for why a longer execution time is needed.

This is also close to my experience; aiming for about a second, maybe two, makes sense.

Which is why I reacted above.
As I don't think my personal 1 second rule-of-thumb is defendable any more or less than rules that set the threshold a bit higher or lower, I don't feel too strongly against this benchmark going in as is.
I just felt I had to ask the question if there was a good reason to make this benchmark run for this long.
Ultimately, vectorizing the hot loops in this benchmark won't make a change to my reasoning above.

In summary, I hope my reasoning above makes sense, and I won't oppose if you think there's a good reason to not shorten the running time of this benchmark as is.

Okay. I propose that we shorten the current running time to around 1.5 seconds. That should leave sufficient running time once we start vectorizing the loops.

Mon, Oct 2, 2:30 AM
kristof.beyls added inline comments to D38439: AMDGPU/GlobalISel: Mark 32-bit G_FADD as legal.
Mon, Oct 2, 2:13 AM

Sat, Sep 30

kristof.beyls added a comment to D38417: [test-suite] Adding HACCKernels app.

Hi Brian,

Thanks for working on this!

On the execution time of about 13 seconds on a fast processor: would it be possible to adapt the input to reduce the running time to about 1 second (or less), and not loose the characteristic behaviour of this benchmark?
I'd expect adapting the input to have an order of magnitude shorter running time will not make the measured execution time more noisy; and it would help in trying to make sure the test-suite keeps on running as quickly as possible.

Kristof, just FYI, reducing the runtime is easy to do, however, this will get significantly faster once we start actually vectorizing the hot loops (at least 4x if you have an architecture with <4 x float>). I looked at this a few weeks ago, and as I recall, we currently can't if-convert reductions (which prevents vectorization). We plan on working on improving this in the near future. Does this affect your opinion at all?

Sat, Sep 30, 11:41 AM

Fri, Sep 29

kristof.beyls added a comment to D38417: [test-suite] Adding HACCKernels app.

Hi Brian,

Fri, Sep 29, 11:52 PM
kristof.beyls added inline comments to D30529: [RFC][GlobalISel] Enable legalizing non-power-of-2 sized types..
Fri, Sep 29, 8:47 AM
kristof.beyls updated the diff for D30529: [RFC][GlobalISel] Enable legalizing non-power-of-2 sized types..

Use unordered_map instead of (ordered) map.

Fri, Sep 29, 8:44 AM
kristof.beyls added inline comments to D30529: [RFC][GlobalISel] Enable legalizing non-power-of-2 sized types..
Fri, Sep 29, 7:24 AM
kristof.beyls updated the diff for D30529: [RFC][GlobalISel] Enable legalizing non-power-of-2 sized types..
  • Rebased to ToT.
  • Addressed all outstanding review comments.
  • Used the test-suite on AArch64 to make sure there are no correctness regressions, both in fallback mode and in assert-when-not-legalizable mode.
  • Measured compile time impact of this change: it's below the noise level I see on gathering CTMark compile time numbers on my system.
Fri, Sep 29, 7:18 AM
kristof.beyls added a comment to D38378: Optimize {s,u}{add,sub}.with.overflow on ARM..

Hi Joel,

Fri, Sep 29, 12:43 AM

Mon, Sep 25

kristof.beyls added a comment to D30529: [RFC][GlobalISel] Enable legalizing non-power-of-2 sized types..

Hi Kristof,

I was under the impression that the patch is good to land, at least as a first step.
What are we missing to push this change?

Cheers,
-Quentin

Mon, Sep 25, 12:46 PM

Sep 12 2017

kristof.beyls accepted D37724: [ARM] Add more CPUs to host detection.
Sep 12 2017, 12:12 PM

Sep 11 2017

kristof.beyls added a comment to D34878: [ARM] Option for reading thread pointer from coprocessor register.

Still LGTM; please commit.

Sep 11 2017, 11:53 PM
kristof.beyls added a comment to D37724: [ARM] Add more CPUs to host detection.

Hi Eli,

Sep 11 2017, 11:50 PM
kristof.beyls accepted D34878: [ARM] Option for reading thread pointer from coprocessor register.

Thanks Strahinja!
I thought that some indentations looked a bit strange, so I'd just still check that clang-format formats your changes the same way.
Otherwise LGTM!

Sep 11 2017, 6:26 AM
kristof.beyls added inline comments to D34878: [ARM] Option for reading thread pointer from coprocessor register.
Sep 11 2017, 5:28 AM
kristof.beyls added inline comments to D34878: [ARM] Option for reading thread pointer from coprocessor register.
Sep 11 2017, 12:59 AM

Sep 7 2017

kristof.beyls added inline comments to D34878: [ARM] Option for reading thread pointer from coprocessor register.
Sep 7 2017, 2:19 AM

Aug 31 2017

kristof.beyls added a comment to D37360: Keep sqlalchemy session separate from database objects.

I have felt before that LNT's connection and interaction with the database is too obscure in the code.
Therefore, I like this proposal, even if it means passing around session objects more.

Aug 31 2017, 11:44 PM

Aug 30 2017

kristof.beyls added a comment to D34581: Fix missing/mismatched html tags.

For the record: I just added an optional integration to pytidylib/tidy-html5 that checks lnt pages for html problems (r312061). It can be used with lnt -Dtidylib=1.

Aug 30 2017, 12:40 AM

Aug 17 2017

kristof.beyls added a comment to D36717: [test-suite] Add SPEC CPU 2017.

One more remark: The 'ref' dataset of 638.imagick_s on all the computers I tried on took between 2 and 3 hours. Submitted results to SPEC are in a similar range. Unfortunately timeit.py has a 7200 seconds (2 hours) limit hardcoded (for --limit-cpu and --timeout). I had to remove that limit to be able to have a successful run.

Can we make the limit configurable or set higher?

Aug 17 2017, 10:56 PM

Aug 16 2017

kristof.beyls added inline comments to D36717: [test-suite] Add SPEC CPU 2017.
Aug 16 2017, 12:16 AM

Aug 11 2017

kristof.beyls accepted D36582: [test-suite] Adding HPCCG benchmark.

Thanks again, The -DVERIFICATION should not have been in the Makefile (not used in this app). I added a short note at the top of the README to make things clearer.

Brian

Aug 11 2017, 10:52 PM
kristof.beyls added a comment to D36582: [test-suite] Adding HPCCG benchmark.

Hi Brian, Hal,

Aug 11 2017, 12:55 PM

Aug 10 2017

kristof.beyls added a comment to D36582: [test-suite] Adding HPCCG benchmark.

Great to see you taking the effort to make the test-suite more relevant!
I think it'd be helpful if you could also add a little explanation of how adding this code makes the test-suite more relevant.
I'm assuming that this kind of code is common in some domain (HPC?) and that there is no other benchmark already in the test-suite with similar enough properties?

Aug 10 2017, 8:55 AM
kristof.beyls added a comment to D30529: [RFC][GlobalISel] Enable legalizing non-power-of-2 sized types..

Hi Kristof,

Thanks for working on this and I'm really sorry it took so long to reply.

Aug 10 2017, 7:21 AM

Aug 9 2017

kristof.beyls updated subscribers of D35402: [zorg] Enable running 'lnt runtest test-suite' instead of 'lnt runtest nt'..
Aug 9 2017, 8:40 AM

Aug 8 2017

kristof.beyls added a comment to D35588: Remove --commit/--commit=1 flags/settings..

Interesting. I wonder if one use case for this option was to submit (but not commit) a run and then just obtain a set of performance comparisons against the last submitted run. I think I have seen this pattern once, but I am not using it.

I personally won't be affected by the removal of this option, so I am fine with dropping it immediately. Now, a slightly less agressive approach would be to just switch it's default value, deprecate it and wait for a release or two to actually remove it.

Aug 8 2017, 6:43 AM

Jul 17 2017

kristof.beyls added a comment to D35501: Add lnt commandline REST client.

This is a wonderful idea!

Jul 17 2017, 11:49 PM

Jul 14 2017

kristof.beyls created D35402: [zorg] Enable running 'lnt runtest test-suite' instead of 'lnt runtest nt'..
Jul 14 2017, 1:17 AM

Jul 12 2017

kristof.beyls accepted D35001: [LNT] Missing "Produced by" field on Run causes UI to crash.

LGTM!

Jul 12 2017, 11:44 PM
kristof.beyls added a comment to D35001: [LNT] Missing "Produced by" field on Run causes UI to crash.

Mostly looks good, just 2 minor remarks.

Jul 12 2017, 7:22 AM

Jul 7 2017

kristof.beyls accepted D34954: [AArch64] Use 16 bytes as preferred function alignment on Cortex-A57..

LGTM from a Cortex-A57 point-of-view: this patch is generating code in line with the optimization guide recommendations and the benchmark numbers quoted look good.

Jul 7 2017, 12:29 AM
kristof.beyls accepted D34961: [AArch64] Use 16 bytes as preferred function alignment on Cortex-A72..
Jul 7 2017, 12:29 AM
kristof.beyls added a comment to D34961: [AArch64] Use 16 bytes as preferred function alignment on Cortex-A72..

From a coding standpoint this all LGTM. However, I'm going to defer to a A72 code owner for final approval.

Jul 7 2017, 12:28 AM

Jul 5 2017

kristof.beyls added inline comments to D35001: [LNT] Missing "Produced by" field on Run causes UI to crash.
Jul 5 2017, 8:49 AM

Jul 4 2017

kristof.beyls added a comment to D34978: [GlobalIsel] fix undefined behavior if Action not set..

Hi Igor,

Would it be possible to add a regression test for this?
I'm wondering when this fails today. I guess it must be when the target has specified an action for e.g. TypeIdx 0, but not TypeIdx 1?
I'm just trying to understand if this should be an assert (if a target hasn't specified something fully enough) or an if statement.

Thanks,

Kristof

Hi Kristof,
I am not sure how to add regression test for this.
Today it fails for example on the follow mir

# RUN: llc -mtriple=x86_64-linux-gnu -O0 -run-pass=legalizer -global-isel %s -o - | FileCheck %s
  ---
  name:            test_implicit_def
  registers:
  body: |
  bb.0.entry:
    liveins:

    %0:_(s64) = G_IMPLICIT_DEF
...

In this case target doesn't set any Action for G_IMPLICIT_DEF.

I find it a bit suspicious that this triggers on G_IMPLICIT_DEF which was added in the past few days, but I don't have time right now to dig in and try and understand the full scope of what's going on.
Given the default action for G_IMPLICIT_DEF is defined as NarrowScalar, my impression is that a target will have to specify at least one size for which "needsLegalizingToDifferentSize" is false, e.g. "Legal".
Otherwise, which size should "NarrowScalar" narrow towards on a G_IMPLICIT_DEF?
That being said, the condition "Aspect.Idx >= Actions[Aspect.Opcode - FirstOp].size()" doesn't seem to be related to the infinite loop of ever more narrowing the type size I explained in the previous sentence?
I'm clearly missing something.
I hope Tim will be able to give you more useful feedback.

Jul 4 2017, 8:34 AM
kristof.beyls added a comment to D34978: [GlobalIsel] fix undefined behavior if Action not set..

Hi Igor,

Jul 4 2017, 4:55 AM

Jun 30 2017

kristof.beyls added inline comments to D32529: [GlobalISel] Make multi-step legalization work..
Jun 30 2017, 1:25 AM

Jun 29 2017

kristof.beyls added a comment to D34584: Introduce new JSON import format.
  • "project_revision" seems to generic to me (I wouldn't know whether it's the compiler or the benchmark suites revision). I could probably live with "compiler_revision", but we should have a separate patch for that.

Agreed, it's not the best possible name. It should be the revision of the software for which you're tracking the performance in my mind. That software doesn't have to be a "compiler". Not sure what a better name would be though.

I'd be fine changing it to "compiler_revision" right away, if we need more discussion about how to call the field, then I'd suggest doing that in another patch to not delay this patch.

Let's change it to "compiler_revision" then, as all the other projects I know of so far track performance of a piece of software that could be claimed to be a compiler of sorts.

Hmpf, I should have tried this before making promises: Turns out sqlite does not support renaming a column. This means I cannot easily provide migration code (adding a new column and copying doesn't work nicely either as removing the old column isn't supported either). So I fear this is best addressed by creating a new LNT schema.

Jun 29 2017, 12:53 PM
kristof.beyls updated the diff for D30529: [RFC][GlobalISel] Enable legalizing non-power-of-2 sized types..
Jun 29 2017, 9:09 AM
kristof.beyls accepted D34584: Introduce new JSON import format.

Before I dive in the meta discussion, are there any issues with the patch itself?

Jun 29 2017, 12:32 AM

Jun 28 2017

kristof.beyls added a comment to D34584: Introduce new JSON import format.

Drop Run/Info/tag: json itself is a schemaless format so it feels more natural without one. The LNT testsuite type is now determined by the submit URL used, or the '-s' commandline flag on the commandline. Both default to 'nts' if omitted.

I don't fully understand how LNT reacts to this, but on the surface makes sense to me. If someone ever had time to do a major redesign of the database schema, there is a chance we might also be able to drop the need for differences in the database schema per test suite, like "nts"? Anyway, that is definitely far out of scope for this patch.

The patch works fine. People using non-nts testsuites (which I have a strong suspicion only exist within apple), may need to supply a -s MySchema flag for the commandline tools or use db_default/v4/MySchema/submitRun instead of db_default/submitRun to express they don't use the 'nts' default.

This is probably not the right patch to discuss LNT database design. But some points anyway:

  • IMO making SQL tables too generic (like having one row with MetricType, Test, Run and Value; so we could create new MetricTypes on demand) is a sure way to kill performance. As long as we use SQL we should have 1 table column for each metric.
  • A schema is nice when you need to decide bits like the unit of a metric, whether bigger is better or getting a human readable metric name for the website. However we could make good arguments that we only need that on the presentation/analysis side and not as part of the data storage model.

Agreed that not losing performance will be the key challenge for any redesign of the SQL schema to avoid having test-suite-specific tables. Anyway, it's indeed a discussion for another time...

Jun 28 2017, 7:43 AM
kristof.beyls added a comment to D34362: [LNT] Support for different DataSet usage in Polybench for "lnt runtest nt".

It makes more sense to me to be using the --make-param flag to pass a test specific configuration options. If you want to add all these size classes, all the tests should support them, or have them mapped back to nearest size the tests can handle.

I feel having a common flag for test data sizes is better than having to pass them as test specific flag. As you suggested, we can map them back to nearest data sizes for the other tests. I was working with Polybench, so did not look into other tests. Will proceed with it if everyone agrees on adding a flag like "--testdatasizes=mini|small|medium|large|extra_large" and aliasing --small and --large for backward compatibility.

Look at SPEC for example: It has 3 different sizes: test, train and ref. You can describe them roughly as:

  • test: "As fast as posisble, not useful as a benchmark but they touch enough code paths so that they can give you a quick way to test for correctness"
  • train: "Somewhat realistic but smaller inputs, useful to produce data for PGO optimizations. The important aspect here is that the data is different enough from ref, so we don't overfit the code because training and reference data were the same"
  • ref: "Larger inputs running for a longer time producing stable numbers".

    So there is some semantics and intended uses here that is captured fine with "test", "train" and "ref". Mapping this to some generic terms like "small", "medium", "large" that just map to sizes would be a loss IMO.
Jun 28 2017, 7:23 AM
kristof.beyls updated the diff for D32529: [GlobalISel] Make multi-step legalization work..

Rebased to ToT & ping.

Jun 28 2017, 6:28 AM

Jun 27 2017

kristof.beyls updated subscribers of D34584: Introduce new JSON import format.

Thank you so much for this, Matthias!

Jun 27 2017, 12:02 AM

Jun 26 2017

kristof.beyls added a comment to D34581: Fix missing/mismatched html tags.

It would be wonderful if we could programmatically check these during testing. Thanks Matthias!

Jun 26 2017, 8:10 AM

Jun 23 2017

kristof.beyls added a comment to D32272: [XRay] Add Google Benchmark library + initial XRay benchmarks.

LGTM for the most part, thanks for making the changes, it's much simpler now.

Thoughts @MatzeB? I think we can work on running/reporting the microbenchmarks data later, in changes to LNT.

Can you please summarize the state of this and the eventual goal? It looks like:

  1. Eventually, we need a new LNT plugin (i.e. something like what we have in LNTBased/*) that understands how to interpret the output of the Google benchmark library.

I hope this could all be done within the test-suite, probably extending test-suite/litsupport. I hope/think that no changes to LNT will be needed, but I haven't investigated myself.
For example, I'm thinking that test-suite/litsupport could end up reporting the execution time of a single iteration of the code benchmarked with the Google Benchmark library, and that will map neatly to concepts already understood by LNT.

Jun 23 2017, 7:53 AM
kristof.beyls added a comment to D34368: [AArch64] PointerRegClass should be GPR64spRegClass.

Testcase? (I think this affects inline asm.)

This change is actually laying ground for fixing an inline asm bug (PR33134). Test cases will be added along with rest of the fix in a follow up change.

Jun 23 2017, 7:43 AM
kristof.beyls added reviewers for D34362: [LNT] Support for different DataSet usage in Polybench for "lnt runtest nt": MatzeB, cmatthews.

I think nt.py is deprecated, and meant to be replaced by test_suite.py.

Maybe someone who already contributed to lnt should be necessary for acceptance.

Jun 23 2017, 2:23 AM

Jun 20 2017

kristof.beyls added a comment to D34132: Add scripts to perform LNT submissions.

Thanks for sharing your thoughts - that makes things quite a bit clearer to me.

Jun 20 2017, 7:03 AM

Jun 19 2017

kristof.beyls added a comment to D34132: Add scripts to perform LNT submissions.

There is quite a bit of code that is duplicated from LNT to test-suite as part of this patch.
I'm not entirely sure if having this much code duplication between LNT and test-suite is desirable, unless this is part of a bigger redesign, like we discussed in January in the thread starting roughly at http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20170123/423073.html, with a nicer design in the end?

Jun 19 2017, 7:13 AM

Jun 1 2017

kristof.beyls added a comment to D33758: [globalisel][tablegen] Partially fix compile-time regressions by converting matcher to state-machine(s).

Replace the matcher if-statements for each rule with a state-machine. This
significantly reduces compile time, memory allocations, and cumulative memory
allocation when compiling AArch64InstructionSelector.cpp.o after r303259 is
recommitted.

My guess is that the compiler compile time goes down simply because tablegen produces far less code?

That's right. Instead of generating C++ and having the compiler optimize a single massive function, it's compiling a single copy of each predicate and chaining them together with a table.

Do you also have an insight of how using a state machine "interpreter" changes the run-time of the compiler itself compared to using the "directly-compiled" generated C++ code?
I can't predict really how the different tradeoffs in both approaches will affect the compiler's run-time performance in the end, so it would be nice to share insights you may have for this.

I haven't measured it yet since the priority so far has been to get the compile times back down to a reasonable level. Once I've finished posting the patches for review I'll take a look at the run-time effect.

My intuition is that the compiler was unlikely to have optimized the previous implementation very well (particularly since it didn't inline the single-use lambdas) and assuming that's the case, the overhead of the interpreter is likely to be limited to the cost of the table lookups. It's possible that it will turn out to be faster since the previous implementation had no chance of fitting into an instruction cache whereas the new code may fit on some machines. On machines where it doesn't fit, some paths are more common than others so it may still be able to use the instruction cache better than the old implementation did.

Also, if we end up using a state machine for this, I think it'd be a good idea to, after the design settles, write a blog post or something similar to explain how the state machine works, as I've heard from many people that reverse engineering the DAGISel-produced state machine was hard due to limited or no documentation on it.

Ok.

Jun 1 2017, 1:22 AM
kristof.beyls added a comment to D33758: [globalisel][tablegen] Partially fix compile-time regressions by converting matcher to state-machine(s).

Replace the matcher if-statements for each rule with a state-machine. This
significantly reduces compile time, memory allocations, and cumulative memory
allocation when compiling AArch64InstructionSelector.cpp.o after r303259 is
recommitted.

Jun 1 2017, 12:26 AM
kristof.beyls added a reviewer for D33734: [AArch64] Add fallback in FastISel fp16 conversions: SjoerdMeijer.
Jun 1 2017, 12:18 AM

May 30 2017

kristof.beyls abandoned D29494: Mostly split the StackProtect Pass into an Analysis Pass and a Transformation Pass..

Abandoning this as my motivation for this (fixing PR30324) has gone away, as that PR got fixed in another way, in r303360.
Splitting the StackProtector pass into a separate analysis and transformation is probably still nice, but probably also better pushed forward by someone with a stronger interest in the stack protector pass and/or someone who is more of an expert in this area.

May 30 2017, 12:26 AM
kristof.beyls closed D33443: Fix PR33031: Cannot scavenge register without an emergency spill slot..

committed in r304196.

May 30 2017, 12:20 AM

May 29 2017

kristof.beyls added inline comments to D33443: Fix PR33031: Cannot scavenge register without an emergency spill slot..
May 29 2017, 8:06 AM
kristof.beyls updated the diff for D33443: Fix PR33031: Cannot scavenge register without an emergency spill slot..

Thanks all for the review comments!

May 29 2017, 7:52 AM

May 25 2017

kristof.beyls accepted D33544: [lnt][profile] Add support for X86-64 CFG view.

LGTM, modulo considering a call instruction to be ending a basic block.
I don't know enough details about the X86_64 instruction set to judge whether the regular expressions are fully correct, but they look plausible to me.

May 25 2017, 5:45 AM

May 24 2017

kristof.beyls accepted D33343: Add some tips on how to benchhmark.

Looks like a reasonable start to me (with one more nit-pick).
Let's get this in as it's valuable as it is.
It can be improved further incrementally.

May 24 2017, 2:39 AM

May 23 2017

kristof.beyls created D33443: Fix PR33031: Cannot scavenge register without an emergency spill slot..
May 23 2017, 7:50 AM

May 19 2017

kristof.beyls updated subscribers of D33343: Add some tips on how to benchhmark.
May 19 2017, 11:53 AM
kristof.beyls added inline comments to D33343: Add some tips on how to benchhmark.
May 19 2017, 11:52 AM

May 16 2017

kristof.beyls added a comment to D33222: [LegacyPassManager] Remove TargetMachine constructors.

Just wanted to point to https://reviews.llvm.org/D29494, which is a patch that isn't fully polished to split the StackProtector pass into an Analysis pass and a Transformation pass.
My motivation for that was also fixing PR30324.

May 16 2017, 1:54 AM

May 11 2017

kristof.beyls added a comment to D32529: [GlobalISel] Make multi-step legalization work..

Thanks for the review on the ARM-specific part, Diana!
@t.p.northover : would you have time for a quick glance at the target-independent parts in LegalizerInfo.h, LegalizerInfo.cpp and LegalizerInfoTest.cpp?

May 11 2017, 7:54 AM

May 10 2017

kristof.beyls accepted D32963: [AARCH64] Fix a comment to match the code. NFC..

LGTM, even if you don't implement the nitpicky suggestions I made.

May 10 2017, 1:57 AM

May 9 2017

kristof.beyls added inline comments to D31965: [SLP] Enable 64-bit wide vectorization for Cyclone.
May 9 2017, 5:24 AM

May 3 2017

kristof.beyls added a comment to D32706: [AArch64] Consider widening instructions in cast cost calculation.

Addressed Kristof's comments.

  • Changed wording to refer to "widening" instructions rather than "lengthening" instructions.
  • Changed function name to is isWideningFree.
  • Updated the logic to consider the W instruction variants in addition to the L variants.
  • Added code generation tests along side the cost test cases.
  • Added additional tests for the W variants.
May 3 2017, 12:42 AM
kristof.beyls added a comment to D31965: [SLP] Enable 64-bit wide vectorization for Cyclone.

Hey Matt,

Hi Adam,

I'm not sure where the conversation about this patch landed, but I'm fine with it being a Cyclone only change for now if that's what you prefer. I haven't had a chance to evaluate it on our cores yet. But when I do, I can easily set MinVectorRegisterBitWith if there's any benefit. How does compile-time look?

There is no measurable compile-time change for AArch64 (testsuite, ctmark, spec).

I believe the idea was to try to enable this for all subtargets, assuming you and @evandro can test this. On the other hand, it's been almost a month and I'd like to wrap this up. So perhaps we should enable this for Cyclone and A57 for now and then perhaps file bugs for the remaining subtargets to evaluate this.

How does that sound, @rengolin and others?

May 3 2017, 12:19 AM

May 2 2017

kristof.beyls added inline comments to D32706: [AArch64] Consider widening instructions in cast cost calculation.
May 2 2017, 8:01 AM
kristof.beyls added inline comments to D32706: [AArch64] Consider widening instructions in cast cost calculation.
May 2 2017, 12:31 AM

Apr 27 2017

kristof.beyls updated subscribers of D32272: [XRay] Add Google Benchmark library + initial XRay benchmarks.

Could you explain the rationale for adding libs/benchmarks-1.1.0?
Also, since that seems to have a different license, you probably need to add that directory to the top-level LICENSE.TXT file as one of the sub-directories with "additional or alternate copyrights,licenses, and/or restrictions".

Apr 27 2017, 4:03 AM

Apr 26 2017

kristof.beyls added a comment to D32529: [GlobalISel] Make multi-step legalization work..

Thanks for splitting this out! The ARM part looks excellent :)
I think you could add some target-independent tests for this in unittests/CodeGen/GlobalISel/LegalizerInfoTest.cpp, to make sure it works to WidenScalar/MoreElements etc to a type that has a Libcall or Custom action. That way we won't have to depend on ARM for testing this functionality.

Apr 26 2017, 9:53 AM
kristof.beyls updated the diff for D32529: [GlobalISel] Make multi-step legalization work..

Adding a unit test.

Apr 26 2017, 9:53 AM
kristof.beyls added a comment to D28152: Cortex-A57 scheduling model for ARM backend (AArch32).

I've also run the latest version of this patch with command line options '-O3 -DNDEBUG -mcpu=cortex-a57 -mthumb -fomit-frame-pointer' on the test-suite and a range of other benchmarks.
I do see the same few performance regressions as reported by Andrew, for the same reasons; and many performance improvements.
Overall, I see a geomean performance improvement of 1.70% for the benchmarks reporting execution time and a 1.63% performance improvement for the benchmarks reporting scores.

Apr 26 2017, 7:30 AM
kristof.beyls created D32529: [GlobalISel] Make multi-step legalization work..
Apr 26 2017, 4:00 AM

Apr 24 2017

kristof.beyls retitled D32438: [LNT] Document how to capture linux perf profiles. from Document how to capture linux perf profiles. to [LNT] Document how to capture linux perf profiles..
Apr 24 2017, 9:37 AM
kristof.beyls created D32438: [LNT] Document how to capture linux perf profiles..
Apr 24 2017, 9:26 AM

Apr 21 2017

kristof.beyls added a comment to D32076: [ARM] Add big.LITTLE mcpu tuning options.

Adding big.LITTLE mcpu options for clang to be more command-line compatible with gcc is a good idea.
Mapping the code generation for it to what is done for the little core also is a defendable position.
I'm just wondering if in some of the implementation of this there could be less copy-paste? I've made a few specific comments in some places of the code where ideally there should be less copy-paste.

Apr 21 2017, 9:43 AM
kristof.beyls added inline comments to D31750: [globalisel] Enable tracing the legalizer with --debug-only=legalize-mir.
Apr 21 2017, 12:11 AM

Apr 19 2017

kristof.beyls closed D31934: [GlobalISel] Support vector-of-pointers in LLT.

Committed in r300664

Apr 19 2017, 8:26 AM
kristof.beyls accepted D31750: [globalisel] Enable tracing the legalizer with --debug-only=legalize-mir.
Apr 19 2017, 12:27 AM

Apr 18 2017

kristof.beyls added inline comments to D31750: [globalisel] Enable tracing the legalizer with --debug-only=legalize-mir.
Apr 18 2017, 5:44 AM
kristof.beyls added a comment to D32048: ClamAV: Copy zlib into ClamAV benchmark.

If a non-trivial amount of time is spent in zlib for clamav and we would want to retain benchmarking of the zlib library, maybe it'd be better to add zlib (with appropriate test input) as a separate benchmark to the test-suite?

There is already spec2000/164.gzip which has pretty much the same inflation/deflation code as zlib. Also this patch is not about adding a new benchmark but simply getting rid of an external dependency!

Apr 18 2017, 2:58 AM
kristof.beyls updated the diff for D31934: [GlobalISel] Support vector-of-pointers in LLT.

Move public methods to all be in the same location, as suggested by Ahmed, now that the implementation details are hidden more in private methods.

Apr 18 2017, 1:10 AM

Apr 14 2017

kristof.beyls added a comment to D32015: Make globalaa-retained.ll test catching more cases..

I have to note that after http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20170410/444886.html
[llvm] r300200 - Re-apply "[GVNHoist] Move GVNHoist to function simplification part of pipeline."
that test starts failing.

Apr 14 2017, 2:53 AM
kristof.beyls accepted D32015: Make globalaa-retained.ll test catching more cases..

LGTM, Thanks!

Apr 14 2017, 2:32 AM
kristof.beyls added inline comments to D32015: Make globalaa-retained.ll test catching more cases..
Apr 14 2017, 12:45 AM
kristof.beyls added a comment to D30521: Introduce llc/ExecuteTestCommands pass.

Currently, I really don't understand why this is the right approach...

We sometimes want to test a specific codegen API rather than a whole pass.

This adds a special pass to llc that features a minimalistic scripting language to
call some predefined API functions so we can test the API with the usual lit+FileCheck tools.

If you're actually trying to test a specific API, why aren't unittests the correct approach?

If the problem is that the unittests are hard to write, why not write libraries to make authoring the unittest easier?

I think there is something fundamental about how we write tests in llvm, and why we do not write pass tests in a unit testing framework because hey, who stops us implementing llc/opt as some functions in a unit testing library, they mostly have trivial APIs:

I think there is a beauty and we are hitting a sweet spot in the way most of our testing works in llvm: Forcing tests to be a simple human readable text file keeps things approachable and understandable. Understanding the basics of lit, llc, opt and FileCheck is easy. Having the full power of C++ in a unit test library on the other hand makes it way too easy do "fancy" stuff that is harder to understand because it happens to abstracted away in generic code and multiple layers of APIs that you have to learn and dig through when you actually have to understand a bug.

I don't see how things being abstracted away in a scripting language and complex fake "pass" infrastructure is going to be substantially better here.

Essentially, I totally understand prefering FileCheck and friends when testing something that is reasonably cohesive as a *pass*. But I feel like the point at which a scripting language or other things are being used to handle the fact that we're actually testing a specific API, I would *much rather* just write a unittest against that API directly.

I don't see why wanting FileCheck for things that aren't a pass isn't valid.

I don't think that pass vs. non-pass is the right distinction.

I think FileCheck makes sense when there is already a fundamental reason to have textual output that is sufficiently stable to use in tests. This can be everything from having an IR to having a clear and precise printing structure for an analysis. Those have utility *outside* of tests and also are great tools for testing.

But when the thing we are testing, fundamentally, is *an API*, I think we should test it *by calling that API*.

Admittedly I am not happy to introduce a scripting language here, but it seemed like the pragmatic choice compared with the alternative of building up a pile of small passes for every test or putting .mir code into C++ string literals in a unittest and still lacking all the llc mechanics for bringing up a target or having the nice matching capabilities of FileCheck.

Again, I am very much in favor of having the facilities from 'llc' or 'FileCheck' in useful APIs for unittests to leverage. But I don't think we should be adding a scripting language in order to write tests in a language other than C++ to exercise C++ API calls. I think this is significantly more burdensome than just writing unittests.

Sorry for the slow response, I think we may just fundamentally disagree about the right approach here. Currently, the approach you are proposing would make the resulting tests nearly incomprehensible to me... On the other hand, I think unittests (which are "just" C++ code even if sometimes non-obvious C++ code) *when used well* are much more friendly to people outside of the particular area than a MIR-specific scripting framework.

Apr 14 2017, 12:27 AM