This is an archive of the discontinued LLVM Phabricator instance.

[scudo] Split the code path of memory tagging out from allocate()
Needs RevisionPublic

Authored by Chia-hungDuan on Sep 1 2023, 5:19 PM.

Details

Reviewers
cferris
pcc
Summary

Memory tagging is a flag which requires system reboot to enable. Which
means, the code paths with memory tagging enabled will never be executed
if it's set to off. However, we only mark those paths with UNLIKELY()
which doesn't annotate the expectation properly here. As a result, the
assembly code always interleaves instructions between w/ and w/o memory
tagging enabled. The direct impact is the I-cache may always cache many
unused instructions.

This change explictily splits the paths into different code blocks. This
slightly introduces very few duplicated codes but it creates two
independent execution paths and will improve the cache locality.

Diff Detail

Event Timeline

Chia-hungDuan created this revision.Sep 1 2023, 5:19 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 1 2023, 5:19 PM
Herald added subscribers: yaneury, Enna1. · View Herald Transcript
Chia-hungDuan requested review of this revision.Sep 1 2023, 5:19 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 1 2023, 5:19 PM
Herald added a subscriber: Restricted Project. · View Herald Transcript
cferris requested changes to this revision.Sep 20 2023, 6:20 PM
cferris added inline comments.
compiler-rt/lib/scudo/standalone/combined.h
998

In the original code path, this is only done if ClassId is non-zero. I'm pretty sure if ClassId is zero, that means you did a secondary allocation so this memset is going to be a repeat of the one that happens in the secondary.

1004

I think this function is missing some code. In the original path, there are calls to addHeaderTag. I think you would need to add:

*Block = addHeaderTag(*Block);
void *Ptr = addHeaderTag(Ptr);

Then down below, the storeHeader should use the Ptr value, not the UserPtr value. Since Ptr is not used anywhere else, you could probably just do the call directly in the storeHeader call.

Since the tags are ignored unless MTE is enabled, it's really only adding a fixed tag. So missing this code doesn't cause any correctness issues.

This revision now requires changes to proceed.Sep 20 2023, 6:20 PM
Chia-hungDuan marked an inline comment as done.

Address review comment

compiler-rt/lib/scudo/standalone/combined.h
998

Nice catch!

1004

I misunderstood what allocatorSupportsMemoryTagging means. I thought the addHeaderTag will return the same pointer but in fact, it adds the fixed tag here. Ideally, I would like to get rid of the addHeaderTag as well but it will be better to keep it now. I'll do it in a separate CL

cferris accepted this revision.Sep 26 2023, 6:21 PM

LGTM.

This revision is now accepted and ready to land.Sep 26 2023, 6:21 PM
pcc requested changes to this revision.Sep 26 2023, 6:24 PM
pcc added a subscriber: pcc.

Have you actually measured the performance impact of this change?

This revision now requires changes to proceed.Sep 26 2023, 6:24 PM

Have you actually measured the performance impact of this change?

If you are asking the numbers from benchmarks, yes, there's 0.5% improvement on Geekbench 5 (on pixel 7 pro)and 1.5% improvement on some internal performance load tests (on cloud machine with x86_64). If you would like to know if we observe the reduction of i-cache miss rate, then I don't have the data. Do you have any suggestions on this?

pcc added a comment.Sep 26 2023, 6:48 PM

Did you also collect standard deviation? It could have just been measurement noise.

That x86 number sounds very suspicious, we don't even support memory tagging on x86 so all of the memory tagging code should have been compiled out either way. At best you're measuring the arbitrary effect of splitting this code into two functions.

Did you also collect standard deviation? It could have just been measurement noise.

That x86 number sounds very suspicious, we don't even support memory tagging on x86 so all of the memory tagging code should have been compiled out either way. At best you're measuring the arbitrary effect of splitting this code into two functions.

Sorry, I made a mistake here on x86, I was under the impression that the MaySupportMemoryTagging=true kept all the memory tagging code paths when I did the measurement. I dug out the raw data and it seems that I included the other optimizations in that measurement by accident. Please ignore the data there. Thanks for pointing it out!

The data from Geekbench 5 (sorry, there's another typo, it's measured on a MTE device, not pixel 7 pro) shows constantly slightly improvements in 6 runs. Given that the mixing code of memory tagging path may not be ideal when it can only be turned on/off at system start, thus I didn't keep more detailed statistics.

If you do have concern on this, I could hold it and redo the measurement later.

pcc added a comment.Sep 27 2023, 12:51 PM

Please do the measurements again and use a tool such as ministat to prove that the improvement is statistically significant.

If we go ahead with this change, there are better ways of doing this than duplicating the code. For example you could make initChunk a template function that takes a bool template parameter indicating whether to enable memory tagging.

Please do the measurements again and use a tool such as ministat to prove that the improvement is statistically significant.

If we go ahead with this change, there are better ways of doing this than duplicating the code. For example you could make initChunk a template function that takes a bool template parameter indicating whether to enable memory tagging.

Sure, I'll do that. But I may argue that even without statistically difference (no regression of course), this may be something still worth to do. Given that the interleaving of *never-executed* code path seems not a good idea.

The reason I use the different name instead of template I would like to separate the logic of this two paths. Not only for the result of compilation, it'll be easier when we want to read the code of some code paths, we don't need to carefully walk through the conditional branches.