This is an archive of the discontinued LLVM Phabricator instance.

[llvm-exegesis] Exploring X86::OperandType::OPERAND_COND_CODE
ClosedPublic

Authored by lebedev.ri on Feb 6 2020, 12:25 PM.

Details

Summary

Currently, we only have nice exploration for LEA instruction,
while for the rest, we rely on randomizeUnsetVariables()
to sometimes generate something interesting.
While that works, it isn't very reliable in coverage :)

Here, i'm making an assumption that while we may want to explore
multi-instruction configs, we are most interested in the
characteristics of the main instruction we were asked about.

Which we can do, by taking the existing randomizeMCOperand(),
and turning it on it's head - instead of relying on it to randomly fill
one of the interesting values, let's pregenerate all the possible interesting
values for the variable, and then generate as much InstructionTemplate
combinations of these possible values for variables as needed/possible.

Of course, that requires invasive changes to no longer pass just the
naked Instruction, but sometimes partially filled InstructionTemplate.

As it can be seen from the test, this allows us to explore
X86::OperandType::OPERAND_COND_CODE for instructions
that take such an operand.
I'm hoping this will greatly simplify exploration.

Thoughts?

Diff Detail

Event Timeline

lebedev.ri created this revision.Feb 6 2020, 12:25 PM

First, thank you for the patch, this is going in the right direction.

Now stepping back a bit there are many dimensions that we'd like to explore:

  • argument values (which is what you started here),
  • register selection (registers of a class are not strictly equivalent [1])
  • snippet generation (we always select the same pattern but exploring them would help [2])
  • repetition mode (to see the impact of the decoder)

Code wise this means it would be much better to have a global sampler object responsible for how to explore these dimensions rather than the greedy approach we're heading to.
Now I understand this is a substantial redesign and I'm not asking you to do it but I just wanted to share what I believe is the right direction to lower code complexity in the long run.


[1] LEA is known to produce different latencies when using EBP, RBP, or R13 as base registers
[2] for instance

XOR EAX, EAX, EAX

is self dependent but it's also a zero idiom, if we'd also executed the back to back pattern we would have learned something new

XOR EBX, EAX, EAX
XOR EAX, EBX, EBX
llvm/tools/llvm-exegesis/lib/SnippetGenerator.cpp
91

I don't think it's worth mentioning error here.
Maybe replace the whole comment with

We reached the number of  allowed configs and return early.
llvm/tools/llvm-exegesis/lib/X86/Target.cpp
749

This function deserves some documentation

775

This comment should be at the function level (possibly at the function declaration)

First, thank you for the patch, this is going in the right direction.

Thank you for taking a look!

Now stepping back a bit there are many dimensions that we'd like to explore:

Yeah, i suspect as much :)

  • argument values (which is what you started here),

Ack.
I have started with this because this is the current itch for me;
while i'm aware of the others, this seemed most straight-forward.

  • register selection

Right. Currently unset registers are mostly picked randomly, within constraints.

(registers of a class are not strictly equivalent [1])

I think that currently can't be expressed in sched models - is that planned to change,
or we just want to know when we fail to model things?

  • snippet generation (we always select the same pattern but exploring them would help [2])

which is roughly what SerialSnippetGenerator::generateCodeTemplates()/appendCodeTemplates() does,
but somewhat more general, correct?
This is not very useful until analyze learns to deal with serial chained instructions (D60000, stuck)

  • repetition mode (to see the impact of the decoder)

This appears to be currently handled via -repetition-mode switch (D68125).
I'm unfamiliar with that.
Does this really have to be accounted for (a dimension in) in the greedy approach?

Code wise this means it would be much better to have a global sampler object
responsible for how to explore these dimensions rather than the greedy approach we're heading to.
Now I understand this is a substantial redesign and I'm not asking you to do it
but I just wanted to share what I believe is the right direction to lower code complexity in the long run.

For context, we kind-of already explore condcodes/registers, by producing them randomly;
so if we run a lot of benchmarks, we're bound to explore them,
but without good coverage/reproducibility/repeatability though.
Which is why i'm starting with this patch - even greedy is better than what currently is.

So it's not so much that I don't want to redesign as though
I'm not sure i currently fully grasp the idea behind "global sampler object".
How would that work?


[1] LEA is known to produce different latencies when using EBP, RBP, or R13 as base registers

Yeah, i saw that on bdver2 too in rG76fcf900d58826d9f21c0dd7f02b61b4d59c9193.

[2] for instance

XOR EAX, EAX, EAX

is self dependent but it's also a zero idiom, if we'd also executed the back to back pattern we would have learned something new

XOR EBX, EAX, EAX
XOR EAX, EBX, EBX

Right. For latency we see

llvm/tools/llvm-exegesis/lib/X86/Target.cpp
749

Yeah, and unit tests. this function ended up being *too* smart,
although this is the best version i was able to come up with so far.

lebedev.ri updated this revision to Diff 243278.Feb 7 2020, 1:42 PM
lebedev.ri marked 4 inline comments as done.

Thank you for taking a look!
Addressed review notes, other than the redesign into "global sampler object".

lebedev.ri updated this revision to Diff 243356.Feb 8 2020, 2:39 AM

Fallback to baseline generateInstructionVariants() instead of using the recursive generator if we aren't performing any exploration after all.

Rewrite combination generator - it is more straight-forward to model it as a
weird number where each digit may have different base.
This seems more understandable to me, and involves no recursion.

CombinationGenerator::performGeneration(): simplify loops, NFC.

(registers of a class are not strictly equivalent [1])

I think that currently can't be expressed in sched models - is that planned to change,
or we just want to know when we fail to model things?

Not in the sched models but in the target: X86FixupLEAs.

  • snippet generation (we always select the same pattern but exploring them would help [2])

which is roughly what SerialSnippetGenerator::generateCodeTemplates()/appendCodeTemplates() does,
but somewhat more general, correct?

yes

This is not very useful until analyze learns to deal with serial chained instructions (D60000, stuck)

yes, @orodley is likely to work on this.

  • repetition mode (to see the impact of the decoder)

This appears to be currently handled via -repetition-mode switch (D68125).

yes

Does this really have to be accounted for (a dimension in) in the greedy approach?

@courbet and I believe that having two measures are better than one and can demonstrate the impact of decoding the instruction.

Code wise this means it would be much better to have a global sampler object
responsible for how to explore these dimensions rather than the greedy approach we're heading to.
Now I understand this is a substantial redesign and I'm not asking you to do it
but I just wanted to share what I believe is the right direction to lower code complexity in the long run.

For context, we kind-of already explore condcodes/registers, by producing them randomly;
so if we run a lot of benchmarks, we're bound to explore them,
but without good coverage/reproducibility/repeatability though.
Which is why i'm starting with this patch - even greedy is better than what currently is.

So it's not so much that I don't want to redesign as though
I'm not sure i currently fully grasp the idea behind "global sampler object".
How would that work?

So the GlobalSampler™ would be instantiated early (in the main). It would generate a single number (random or manually selected) which would be mapped to N dimensions (something like hilbert curve or z-curve). The value would be passed down to each function that needs to randomize a choice so they can deterministically select the values.
Does this make sense?

gchatelet added inline comments.Feb 11 2020, 6:49 AM
llvm/tools/llvm-exegesis/lib/SnippetGenerator.h
119

This is the idea but it should generate the values on the go because the space to explore might be too big to enumerate all the possibilities.

Sorry if it's a bit hand-wavy for now, I haven't sorted out the details but we should be able to use space filling curves to that end (article).

lebedev.ri added a comment.EditedFeb 11 2020, 7:03 AM

Let me know how to proceed.

I know this doesn't scale, because i have designed it this way,
because i don't really expect it to be used for exhaustive sweeping,
but only exploring a few operands, because this is the real problem i have :]

Is the traditional iterative approach not acceptable for this tool,
and the intention is to only get the the One True Solution,
even if it potentially takes ages before it's there?

Let me know how to proceed.

I know this doesn't scale, because i have designed it this way,
because i don't really expect it to be used for exhaustive sweeping,
but only exploring a few operands, because this is the real problem i have :]

Is the traditional iterative approach not acceptable for this tool,
and the intention is to only get the the One True Solution,
even if it potentially takes ages before it's there?

I'm confused by your answer.

I am not against this patch, I just want to explain where I'd like the tool to go in the long run.
And since you responded to my comment with a code change and the creation of the CombinationGenerator I thought you were trying to implement what I suggested. Hence my answers.

If you don't intend to implement it, that's fine.

I've added a few comments.

llvm/tools/llvm-exegesis/lib/X86/Target.cpp
788

formatting is weird here, maybe extract the lambda out of the constructor.

llvm/unittests/tools/llvm-exegesis/SnippetGeneratorTest.cpp
37 ↗(On Diff #243386)

https://github.com/google/googletest/blob/master/googlemock/docs/cheat_sheet.md#container-matchers

You should be able to write

ASSERT_THAT(Variants, ElementsAreArray(ExpectedVariants));

or

ASSERT_THAT(Variants, ElementsAreArray({
      {0, 2},
      {0, 3},
      {1, 2},
      {1, 3},
  }));
58 ↗(On Diff #243386)

ditto here and below

186 ↗(On Diff #243386)
ASSERT_THAT(Variants, IsEmpty());

Let me know how to proceed.

I know this doesn't scale, because i have designed it this way,
because i don't really expect it to be used for exhaustive sweeping,
but only exploring a few operands, because this is the real problem i have :]

Is the traditional iterative approach not acceptable for this tool,
and the intention is to only get the the One True Solution,
even if it potentially takes ages before it's there?

I'm confused by your answer.

That is because i'm confused by the feedback so far :)
That might be because of the differences in the dialogue structure from the usual in llvm.
(sometimes previously elsewhere this cautious non-committing-to-outcome feedback
meant polite 'no thank you' without actually stating as much)

I am not against this patch, I just want to explain where I'd like the tool to go in the long run.
And since you responded to my comment with a code change and the creation of the CombinationGenerator I thought you were trying to implement what I suggested. Hence my answers.

If you don't intend to implement it, that's fine.

I've added a few comments.

lebedev.ri marked 3 inline comments as done.

Addressed review notes.

gchatelet added inline comments.Feb 12 2020, 12:59 AM
llvm/unittests/tools/llvm-exegesis/SnippetGeneratorTest.cpp
178 ↗(On Diff #244083)

It's not really empty then. maybe rename with Singleton

lebedev.ri marked 3 inline comments as done.

Addressed review notes - renamed one test to Singleton.

lebedev.ri added inline comments.Feb 12 2020, 1:09 AM
llvm/tools/llvm-exegesis/lib/X86/Target.cpp
788

Hm, rerun clang-format on this, and this time it produced different result. Is this better?

llvm/unittests/tools/llvm-exegesis/SnippetGeneratorTest.cpp
186 ↗(On Diff #243386)

But it's not empty, it contains a single element 0?

178 ↗(On Diff #244083)

By empty i meant that the "exploration" space is empty - there is only a single possibility.

This revision is now accepted and ready to land.Feb 12 2020, 1:27 AM

LGTM

Okay, thank you for the review.
It shouldn't be impossible to rip this out once more generic handling is in place.

This revision was automatically updated to reflect the committed changes.
  • repetition mode (to see the impact of the decoder)

This appears to be currently handled via -repetition-mode switch (D68125).

yes

Does this really have to be accounted for (a dimension in) in the greedy approach?

@courbet and I believe that having two measures are better than one and can demonstrate the impact of decoding the instruction.

Oh, now i see what you mean. At least for CMOV, i'm seeing wildly different results

LatencyRThroughput
duplicate10.8
loop20.6

where latency=1 seems correct, and i'd expect the througput to be close to 1/2 (since there are two execution units).

So i would personally guess that --repetition-mode= shouldn't even be an another measurement,
but instead much like what is already done by running the whole snippet a few times
and denoising the result (not averaging!), the same should be done with this - run both, take minimum
https://github.com/llvm/llvm-project/blob/c55cf4afa9161bb4413b7ca9933d553327f5f069/llvm/tools/llvm-exegesis/lib/LatencyBenchmarkRunner.cpp#L38-L39

What are you thoughts on this?

...

Or, should the default be loop for RThrougput measurements?
I didn't check how much difference is there for other instructions.

  • repetition mode (to see the impact of the decoder)

This appears to be currently handled via -repetition-mode switch (D68125).

yes

Does this really have to be accounted for (a dimension in) in the greedy approach?

@courbet and I believe that having two measures are better than one and can demonstrate the impact of decoding the instruction.

Oh, now i see what you mean. At least for CMOV, i'm seeing wildly different results

LatencyRThroughput
duplicate10.8
loop20.6

where latency=1 seems correct, and i'd expect the througput to be close to 1/2 (since there are two execution units).

So i would personally guess that --repetition-mode= shouldn't even be an another measurement,
but instead much like what is already done by running the whole snippet a few times
and denoising the result (not averaging!), the same should be done with this - run both, take minimum
https://github.com/llvm/llvm-project/blob/c55cf4afa9161bb4413b7ca9933d553327f5f069/llvm/tools/llvm-exegesis/lib/LatencyBenchmarkRunner.cpp#L38-L39

What are you thoughts on this?

up

  • repetition mode (to see the impact of the decoder)

This appears to be currently handled via -repetition-mode switch (D68125).

yes

Does this really have to be accounted for (a dimension in) in the greedy approach?

@courbet and I believe that having two measures are better than one and can demonstrate the impact of decoding the instruction.

Oh, now i see what you mean. At least for CMOV, i'm seeing wildly different results

LatencyRThroughput
duplicate10.8
loop20.6

where latency=1 seems correct, and i'd expect the througput to be close to 1/2 (since there are two execution units).

So i would personally guess that --repetition-mode= shouldn't even be an another measurement,
but instead much like what is already done by running the whole snippet a few times
and denoising the result (not averaging!), the same should be done with this - run both, take minimum
https://github.com/llvm/llvm-project/blob/c55cf4afa9161bb4413b7ca9933d553327f5f069/llvm/tools/llvm-exegesis/lib/LatencyBenchmarkRunner.cpp#L38-L39

What are you thoughts on this?

Well that depends on how you see llvm-exegesis. It was originally designed to understand the behavior of the CPU.
To me --repetition-mode is interesting to explore sensitivity to instruction decoding.
If you mix everything and take the minimum then you lose information on the root cause and then you deduce that the minimum latency is X but it only happens if the instruction is already decoded and run in a tight loop.
I think it's important to keep this information separate. You can always take the minimum between the two during analysis though.

lebedev.ri added a comment.EditedFeb 26 2020, 4:35 AM

Thank you for replying!

  • repetition mode (to see the impact of the decoder)

This appears to be currently handled via -repetition-mode switch (D68125).

yes

Does this really have to be accounted for (a dimension in) in the greedy approach?

@courbet and I believe that having two measures are better than one and can demonstrate the impact of decoding the instruction.

Oh, now i see what you mean. At least for CMOV, i'm seeing wildly different results

LatencyRThroughput
duplicate10.8
loop20.6

where latency=1 seems correct, and i'd expect the througput to be close to 1/2 (since there are two execution units).

So i would personally guess that --repetition-mode= shouldn't even be an another measurement,
but instead much like what is already done by running the whole snippet a few times
and denoising the result (not averaging!), the same should be done with this - run both, take minimum
https://github.com/llvm/llvm-project/blob/c55cf4afa9161bb4413b7ca9933d553327f5f069/llvm/tools/llvm-exegesis/lib/LatencyBenchmarkRunner.cpp#L38-L39

What are you thoughts on this?

Well that depends on how you see llvm-exegesis. It was originally designed to understand the behavior of the CPU.
To me --repetition-mode is interesting to explore sensitivity to instruction decoding.
If you mix everything and take the minimum then you lose information on the root cause and then you deduce that the minimum latency is X but it only happens if the instruction is already decoded and run in a tight loop.
I think it's important to keep this information separate.

I do agree in principle.

You can always take the minimum between the two during analysis though.

I think this has the same-ish basic issue as merging results from different --mode=s - what if there is more than one result?
Currently, we form PerInstructionStats, which accumulates min/max/averages, which is nice for when results end up being noisy.
How would we aggregate (via taking min) results from different --repetition-mode=s, without sacrificing that?
Pick all --repetition-mode=duplicate and all --repetition-mode=loop and zipper-merge each pair?

Thank you for replying!

Sure, my apologies for the lag. I'm extra busy right now.

  • repetition mode (to see the impact of the decoder)

This appears to be currently handled via -repetition-mode switch (D68125).

yes

Does this really have to be accounted for (a dimension in) in the greedy approach?

@courbet and I believe that having two measures are better than one and can demonstrate the impact of decoding the instruction.

Oh, now i see what you mean. At least for CMOV, i'm seeing wildly different results

LatencyRThroughput
duplicate10.8
loop20.6

where latency=1 seems correct, and i'd expect the througput to be close to 1/2 (since there are two execution units).

So i would personally guess that --repetition-mode= shouldn't even be an another measurement,
but instead much like what is already done by running the whole snippet a few times
and denoising the result (not averaging!), the same should be done with this - run both, take minimum
https://github.com/llvm/llvm-project/blob/c55cf4afa9161bb4413b7ca9933d553327f5f069/llvm/tools/llvm-exegesis/lib/LatencyBenchmarkRunner.cpp#L38-L39

What are you thoughts on this?

Well that depends on how you see llvm-exegesis. It was originally designed to understand the behavior of the CPU.
To me --repetition-mode is interesting to explore sensitivity to instruction decoding.
If you mix everything and take the minimum then you lose information on the root cause and then you deduce that the minimum latency is X but it only happens if the instruction is already decoded and run in a tight loop.
I think it's important to keep this information separate.

I do agree in principle.

You can always take the minimum between the two during analysis though.

I think this has the same-ish basic issue as merging results from different --mode=s - what if there is more than one result?
Currently, we form PerInstructionStats, which accumulates min/max/averages, which is nice for when results end up being noisy.
How would we aggregate (via taking min) results from different --repetition-mode=s, without sacrificing that?
Pick all --repetition-mode=duplicate and all --repetition-mode=loop and zipper-merge each pair?

Right now you can simply run the benchmark twice (with loop or duplicate repetition mode), concatenate the two outputs and run them through the analysis.
We currently don't consider the flag as part of the instruction so it would be reflected into the PerInstructionStats, increasing the confidence interval if there's a discrepancy between the two.

Thank you for replying!

Sure, my apologies for the lag. I'm extra busy right now.

  • repetition mode (to see the impact of the decoder)

This appears to be currently handled via -repetition-mode switch (D68125).

yes

Does this really have to be accounted for (a dimension in) in the greedy approach?

@courbet and I believe that having two measures are better than one and can demonstrate the impact of decoding the instruction.

Oh, now i see what you mean. At least for CMOV, i'm seeing wildly different results

LatencyRThroughput
duplicate10.8
loop20.6

where latency=1 seems correct, and i'd expect the througput to be close to 1/2 (since there are two execution units).

So i would personally guess that --repetition-mode= shouldn't even be an another measurement,
but instead much like what is already done by running the whole snippet a few times
and denoising the result (not averaging!), the same should be done with this - run both, take minimum
https://github.com/llvm/llvm-project/blob/c55cf4afa9161bb4413b7ca9933d553327f5f069/llvm/tools/llvm-exegesis/lib/LatencyBenchmarkRunner.cpp#L38-L39

What are you thoughts on this?

Well that depends on how you see llvm-exegesis. It was originally designed to understand the behavior of the CPU.
To me --repetition-mode is interesting to explore sensitivity to instruction decoding.
If you mix everything and take the minimum then you lose information on the root cause and then you deduce that the minimum latency is X but it only happens if the instruction is already decoded and run in a tight loop.
I think it's important to keep this information separate.

I do agree in principle.

You can always take the minimum between the two during analysis though.

I think this has the same-ish basic issue as merging results from different --mode=s - what if there is more than one result?
Currently, we form PerInstructionStats, which accumulates min/max/averages, which is nice for when results end up being noisy.
How would we aggregate (via taking min) results from different --repetition-mode=s, without sacrificing that?
Pick all --repetition-mode=duplicate and all --repetition-mode=loop and zipper-merge each pair?

Right now you can simply run the benchmark twice (with loop or duplicate repetition mode), concatenate the two outputs and run them through the analysis.
We currently don't consider the flag as part of the instruction so it would be reflected into the PerInstructionStats, increasing the confidence interval if there's a discrepancy between the two.

Right, indeed, which is why i'm asking this question in the first place :)
Because at least for me that is the opposite from the behavior i'd expect/want to have in
analysis mode - in that case one of repetition modes results in obviously-wrong(C) results,
so i'm asking if it would be reasonable to fixup that during analysis via such a zipper merge.

Thank you for replying!

Sure, my apologies for the lag. I'm extra busy right now.

  • repetition mode (to see the impact of the decoder)

This appears to be currently handled via -repetition-mode switch (D68125).

yes

Does this really have to be accounted for (a dimension in) in the greedy approach?

@courbet and I believe that having two measures are better than one and can demonstrate the impact of decoding the instruction.

Oh, now i see what you mean. At least for CMOV, i'm seeing wildly different results

LatencyRThroughput
duplicate10.8
loop20.6

where latency=1 seems correct, and i'd expect the througput to be close to 1/2 (since there are two execution units).

So i would personally guess that --repetition-mode= shouldn't even be an another measurement,
but instead much like what is already done by running the whole snippet a few times
and denoising the result (not averaging!), the same should be done with this - run both, take minimum
https://github.com/llvm/llvm-project/blob/c55cf4afa9161bb4413b7ca9933d553327f5f069/llvm/tools/llvm-exegesis/lib/LatencyBenchmarkRunner.cpp#L38-L39

What are you thoughts on this?

Well that depends on how you see llvm-exegesis. It was originally designed to understand the behavior of the CPU.
To me --repetition-mode is interesting to explore sensitivity to instruction decoding.
If you mix everything and take the minimum then you lose information on the root cause and then you deduce that the minimum latency is X but it only happens if the instruction is already decoded and run in a tight loop.
I think it's important to keep this information separate.

I do agree in principle.

You can always take the minimum between the two during analysis though.

I think this has the same-ish basic issue as merging results from different --mode=s - what if there is more than one result?
Currently, we form PerInstructionStats, which accumulates min/max/averages, which is nice for when results end up being noisy.
How would we aggregate (via taking min) results from different --repetition-mode=s, without sacrificing that?
Pick all --repetition-mode=duplicate and all --repetition-mode=loop and zipper-merge each pair?

Right now you can simply run the benchmark twice (with loop or duplicate repetition mode), concatenate the two outputs and run them through the analysis.
We currently don't consider the flag as part of the instruction so it would be reflected into the PerInstructionStats, increasing the confidence interval if there's a discrepancy between the two.

Right, indeed, which is why i'm asking this question in the first place :)
Because at least for me that is the opposite from the behavior i'd expect/want to have in
analysis mode - in that case one of repetition modes results in obviously-wrong(C) results,
so i'm asking if it would be reasonable to fixup that during analysis via such a zipper merge.

I fail to see what is the expected behavior from your point of view. What should the tool do according to you?

Thank you for replying!

Sure, my apologies for the lag. I'm extra busy right now.

  • repetition mode (to see the impact of the decoder)

This appears to be currently handled via -repetition-mode switch (D68125).

yes

Does this really have to be accounted for (a dimension in) in the greedy approach?

@courbet and I believe that having two measures are better than one and can demonstrate the impact of decoding the instruction.

Oh, now i see what you mean. At least for CMOV, i'm seeing wildly different results

LatencyRThroughput
duplicate10.8
loop20.6

where latency=1 seems correct, and i'd expect the througput to be close to 1/2 (since there are two execution units).

So i would personally guess that --repetition-mode= shouldn't even be an another measurement,
but instead much like what is already done by running the whole snippet a few times
and denoising the result (not averaging!), the same should be done with this - run both, take minimum
https://github.com/llvm/llvm-project/blob/c55cf4afa9161bb4413b7ca9933d553327f5f069/llvm/tools/llvm-exegesis/lib/LatencyBenchmarkRunner.cpp#L38-L39

What are you thoughts on this?

Well that depends on how you see llvm-exegesis. It was originally designed to understand the behavior of the CPU.
To me --repetition-mode is interesting to explore sensitivity to instruction decoding.
If you mix everything and take the minimum then you lose information on the root cause and then you deduce that the minimum latency is X but it only happens if the instruction is already decoded and run in a tight loop.
I think it's important to keep this information separate.

I do agree in principle.

You can always take the minimum between the two during analysis though.

I think this has the same-ish basic issue as merging results from different --mode=s - what if there is more than one result?
Currently, we form PerInstructionStats, which accumulates min/max/averages, which is nice for when results end up being noisy.
How would we aggregate (via taking min) results from different --repetition-mode=s, without sacrificing that?
Pick all --repetition-mode=duplicate and all --repetition-mode=loop and zipper-merge each pair?

Right now you can simply run the benchmark twice (with loop or duplicate repetition mode), concatenate the two outputs and run them through the analysis.
We currently don't consider the flag as part of the instruction so it would be reflected into the PerInstructionStats, increasing the confidence interval if there's a discrepancy between the two.

Right, indeed, which is why i'm asking this question in the first place :)
Because at least for me that is the opposite from the behavior i'd expect/want to have in
analysis mode - in that case one of repetition modes results in obviously-wrong(C) results,
so i'm asking if it would be reasonable to fixup that during analysis via such a zipper merge.

I fail to see what is the expected behavior from your point of view. What should the tool do according to you?

I think we may be going in circles here.
Please refer to my original comment https://reviews.llvm.org/D74156#1873657, where i say that

much like what is already done by running the whole snippet a few times
and denoising the result (not averaging!), the same should be done with this - run both (edit: repetition modes), take minimum

To that you reply in https://reviews.llvm.org/D74156#1893104

I think it's important to keep this information separate. You can always take the minimum between the two during analysis though.

To which i agree and reply in https://reviews.llvm.org/D74156#1893156 with

How would we aggregate (via taking min) results from different --repetition-mode=s, without sacrificing that?
Pick all --repetition-mode=duplicate and all --repetition-mode=loop and zipper-merge each pair?

and then https://reviews.llvm.org/D74156#1893192 with

Because at least for me that (edit: the current behavior of always averaging) is the opposite from the behavior i'd expect/want to have in
analysis mode - in that case one of repetition modes results in obviously-wrong(C) results,
so i'm asking if it would be reasonable to fixup that during analysis via such a zipper merge.

And, unless i'm missing the point, you do 180° turn on your original point from https://reviews.llvm.org/D74156#1893104, and say that no, analysis should do no such minimum

I fail to see what is the expected behavior from your point of view. What should the tool do according to you?

... did that explanation of the question i'm having made any sense?

... did that explanation of the question i'm having made any sense?

Thx for digging in the conversation !
Ok it makes more sense now.

I discussed it a bit with @courbet:

  • We want the analysis tool to stay simple so we'd rather not make it knowledgeable of the repetition mode.
  • We'd like to still be able to select either repetition mode to dig into special cases

So we could add a third min repetition mode that would run both and take the minimum. It could be the default option.
Would you have some time to look what it would take to add this third mode?

... did that explanation of the question i'm having made any sense?

Thx for digging in the conversation !
Ok it makes more sense now.

Awesome! :)

I discussed it a bit with @courbet:

  • We want the analysis tool to stay simple so we'd rather not make it knowledgeable of the repetition mode.
  • We'd like to still be able to select either repetition mode to dig into special cases

So we could add a third min repetition mode that would run both and take the minimum. It could be the default option.
Would you have some time to look what it would take to add this third mode?

If that is the preferred direction then i will try to take a look :)
Thank you for your patience.

... did that explanation of the question i'm having made any sense?

Thx for digging in the conversation !
Ok it makes more sense now.

Awesome! :)

I discussed it a bit with @courbet:

  • We want the analysis tool to stay simple so we'd rather not make it knowledgeable of the repetition mode.
  • We'd like to still be able to select either repetition mode to dig into special cases

So we could add a third min repetition mode that would run both and take the minimum. It could be the default option.
Would you have some time to look what it would take to add this third mode?

If that is the preferred direction then i will try to take a look :)
Thank you for your patience.

Posted D76921, thanks.