This is an archive of the discontinued LLVM Phabricator instance.

[llvm-profgen] Filter out invalid LBR ranges.
ClosedPublic

Authored by hoy on Apr 6 2022, 6:20 PM.

Details

Summary

The profiler can sometimes give us a LBR trace that implicates bogus code ranges. For example,

0xc5acb56/0xc66c6c0 0xc628195/0xf31fbb0 0xc611261/0xc628130 0xc5c1a21/0xc6111c0 0x1f7edfd3/0xc5c3a50 0xc5c154f/0x1f7edec0 0xe8eed07/0xc5c11e0

, note that the first two pairs are supposed to form a linear execution range, in this case, it is [0xf31fbb0, 0xc5acb56] , which doesn't make sense.

Such bogus ranges should be ruled out to avoid generating a bad profile. I'm fixing this for both CS and non-CS cases.

Diff Detail

Event Timeline

hoy created this revision.Apr 6 2022, 6:20 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 6 2022, 6:20 PM
Herald added subscribers: modimo, wenlei. · View Herald Transcript
hoy requested review of this revision.Apr 6 2022, 6:20 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 6 2022, 6:20 PM
hoy retitled this revision from [llvm-profgen] Ignore invalid LBR ranges. Summary: to [llvm-profgen] Filter out invalid LBR ranges..Apr 6 2022, 6:24 PM
hoy edited the summary of this revision. (Show Details)
hoy added reviewers: wenlei, wlei.
wenlei added inline comments.Apr 6 2022, 8:39 PM
llvm/tools/llvm-profgen/PerfReader.cpp
288–289

Is it critical to skip unwinding instead of ignore the entire record for invalid range? If not, perhaps it's better to keep it simple and just ignore such record. I imagine such invalid ranges are rare.

298–299

Theoretically any kind of unwind can discover bad state, so this is not unique to unwindLinear. I suggest we have an invalid state flag for UnwindState, which gets checked for all isXXState (may also added isValidState). This way we also keep the API for unwindXX consistent, and also avoid special checking after unwindLinear.

1182

nit: above->after

1197

I think the choice to use unique ranges was intentional. What motivated this change?

1213–1216

Such invalid ranges doesn't seem to be recorded anyways, how do we get those ranges?

1220

It is unclear to user what "bogus" means. And one could argue that ranges crossing function boundaries (one above) is also bogus. Suggest to be specific about what this is so user knows what this means, e.g. ... of profiled ranges have range start after range end.

llvm/tools/llvm-profgen/PerfReader.h
301

What motivated this change? Conceptually parent frame of dummy root should indeed be nullptr.

hoy added inline comments.Apr 7 2022, 11:09 AM
llvm/tools/llvm-profgen/PerfReader.cpp
288–289

Sort of. With one of internal large services we have about 5% of such invalid ranges. Ignoring the whole trace where such ranges are in may not be trivial.

warning: 5.13%(77654546/1512342539) of profiled ranges is bogus.
298–299

Sounds good.

1197

Can you remind me of why it was intentional?

I think it's useful to give a number of how many samples are invalid, though some of them could be the same. This reflects the portion of samples will be dropped.

1213–1216

Right, they'll be abandoned. The warning here is to tell how many such samples would be dropped so we kinda getting a sense how important the issue is.

1220

Done.

llvm/tools/llvm-profgen/PerfReader.h
301

It is to handle the diversion of probe samples and dwarf samples. With probe, LBR ranges are reported towards its parent frame (calling context), while w/o probe, ranges are reported towards the leaf frame. See VirtualUnwinder::unwindLinear. When unwinding is at a bad state, the call stack is reset to DummyTrieRoot, thus leading to a null parent technically for the probe case which will cause AV here and there.

hoy updated this revision to Diff 421289.Apr 7 2022, 11:09 AM
hoy edited the summary of this revision. (Show Details)

Addressing comments.

wenlei added inline comments.Apr 7 2022, 4:47 PM
llvm/tools/llvm-profgen/PerfReader.cpp
288–289

Is this happening for just one workload or more often across different workloads?

304

nit: put this towards the end of the if-else sequence as it's supposed to be a corner case.

} else if (isValidState(State)) {
    // Unwind branches
    ...
   unwindBranch(State);
}
else {
    // Invalid state
}
305–308

all ranges will be counted towards the root context.

When we skip unwinding, we still have the context carried with probe/dwarf, so it's not always going to be base context, right?

1197

How many unique instance of one issue is an indicator of how wide spread such issue is. The way we report these stats are also aligned with how it was implemented - one "profiled range" is one unique range no matter that count it has.

In this case, there's 5% by your counting, what's the % with original counting?

1213–1216

My question was basically this - if such ranges are dropped earlier, how do we even see ranges whose start > end here?

Looked into this, I guess the answer is we warn on those before they are dropped?

llvm/tools/llvm-profgen/PerfReader.h
301

How many call sites need to guard against this? As an API, I think getParentFrame better strictly do what it says if possible. But if that means sprinkling a check in a dozen places, maybe it's ok to leave it in this state.

hoy added inline comments.Apr 7 2022, 6:03 PM
llvm/tools/llvm-profgen/PerfReader.cpp
288–289

So far I've only worked on a single workload. The other workloads all suffer from the issue, though not clear what's the percentage.

304

That'd require to change isCallState ... to. check for the invalid flag. It only checks if the current ip corresponds to a call inst.

305–308

Yes, ranges themselves carry inline contexts. The wording may be confusing. I'm changing it to "the rest of the trace can still be processed as if they do not have a calling context."

1197

How many unique instance of one issue is an indicator of how wide spread such issue is. The way we report these stats are also aligned with how it was implemented - one "profiled range" is one unique range no matter that count it has.

Not sure but feel like knowing the real number of samples to be dropped is more helpful, at least it tells how many samples we are losing in reality. For example, we aggregate samples by their look and we have ninety nine A and one B. If B is invalid, previous implementation will tell 50% of samples are invalid, while the current one tells 1% is invalid. Feel like knowing 1% is more helpful than 50%. WDYT?

In this case, there's 5% by your counting, what's the % with original counting?

Original counting is about 4%. Not a big difference.

1213–1216

Yes, this code runs before we actually process the ranges.

llvm/tools/llvm-profgen/PerfReader.h
301

On the second thought, here shouldn't be changed. An assumption of the virtual unwinding is that there always be a leaf frame in the stack. checkStateConsistency basically verifies that. With invalid state we can still honor that assumption by pushing the broken range's IP onto the stack so that nothing else will be changed.

hoy updated this revision to Diff 421377.Apr 7 2022, 6:04 PM

Updating D123271: [llvm-profgen] Filter out invalid LBR ranges.

wenlei added inline comments.Apr 7 2022, 6:14 PM
llvm/tools/llvm-profgen/PerfReader.cpp
304

Yes, I think we should change all isXXState to check validness first anyways. An invalid state isn't any of the valid states.

305–308

they do not have a calling context. -> they do not have stack sample?

1197

Ok, makes sense. Then we need to change the message, something like " .. samples are from ranges ..", instead of " ... profiled ranges.. "

hoy added inline comments.Apr 7 2022, 7:38 PM
llvm/tools/llvm-profgen/PerfReader.cpp
304

Sounds good.

305–308

done.

hoy updated this revision to Diff 421389.Apr 7 2022, 7:38 PM

Updating D123271: [llvm-profgen] Filter out invalid LBR ranges.

wenlei accepted this revision.Apr 7 2022, 8:28 PM

lgtm, thanks.

This revision is now accepted and ready to land.Apr 7 2022, 8:28 PM
This revision was landed with ongoing or failed builds.Apr 7 2022, 9:42 PM
This revision was automatically updated to reflect the committed changes.