This is an archive of the discontinued LLVM Phabricator instance.

[lldb/DWARF] Trust CU DW_AT_low/high_pc information when building address tables
ClosedPublic

Authored by labath on Apr 20 2020, 6:29 AM.

Details

Summary

The code in DWARFCompileUnit::BuildAddressRangeTable tries hard to avoid
relying on DW_AT_low/high_pc for compile unit range information, and
this logic is a big cause of llvm/lldb divergence in the lowest layers
of dwarf parsing code.

The implicit assumption in that code is that this information (as opposed to
DW_AT_ranges) is unreliable. However, I have not been able to verify
that assumption. It is definitely not true for all present-day
compilers (gcc, clang, icc), and it was also not the case for the
historic compilers that I have been able to get a hold of (thanks Matt
Godbolt).

All compiler included in my research either produced correct
DW_AT_ranges or .debug_aranges entries, or they produced no DW_AT_hi/lo
pc at all. The detailed findings are:

  • gcc >= 4.4: produces DW_AT_ranges and .debug_aranges
  • 4.1 <= gcc < 4.4: no DW_AT_ranges, no DW_AT_high_pc, .debug_aranges present. The upper version range here is uncertain as godbolt.org does not have intermediate versions.
  • gcc < 4.1: no versions on godbolt.org
  • clang >= 3.5: produces DW_AT_ranges, and (optionally) .debug_aranges
  • 3.4 <= clang < 3.5: no DW_AT_ranges, no DW_AT_high_pc, .debug_aranges present.
  • clang <= 3.3: no DW_AT_ranges, no DW_AT_high_pc, no .debug_aranges
  • icc >= 16.0.1: produces DW_AT_ranges
  • icc < 16.0.1: no functional versions on godbolt.org (some are present but fail to compile)

Based on this analysis, I believe it is safe to start trusting
DW_AT_low/high_pc information in dwarf as well as remove the code for
manually reconstructing range information by traversing the DIE
structure, and just keep the line table fallback. The only compilers
where this will change behavior are pre-3.4 clangs, which are almost 7
years old now. However, the functionality should remain unchanged
because we will be able to reconstruct this information from the line
table, which seems to be needed for some line-tables-only scenarios
anyway (haven't researched this too much, but at least some compilers
seem to emit DW_AT_ranges even in these situations).

In return, we simplify the code base, remove some untested code (the
only test changes are recent tests with overly reduced synthetic dwarf),
and increase llvm convergence.

Diff Detail

Event Timeline

labath created this revision.Apr 20 2020, 6:29 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 20 2020, 6:29 AM

The one "compiler" (DWARF generator) missing from this list is dsymutil, but I would expect it to produce reliable information, too. Given your thorough analysis it sounds like we should do this. @clayborg probably remembers why LLDB didn't trust the info best. Assuming that the original reason does no longer exist, this LGTM!

If I had to guess what this might've related to is the fact that LLVM puts a DW_AT_low_pc on the CU even if the CU uses discontiguous ranges - and in that case the low_pc has the constant value 0. So that all address values are resolved "relative" to that zero, making them absolute. There's some support in the DWARF spec for this being a right/good thing.

It's /possible/ that at some point LLVM didn't emit CU level address range info (it's redundant with aranges after all - though these days we err on the other direction of skipping aranges and just emitting CU ranges) - and just emitted the zero low_pc which might've been confusing?

clayborg requested changes to this revision.Apr 20 2020, 9:56 PM

I see plenty of examples daily, from clang version 7, where this happens. The first binary I tried to verify after seeing this patch, I found this error. Dump is below. This compile unit also shows what happens when the linker sets every function that was dead stripped to address zero: we end up with many functions at address zero. What happens if address zero is a real address? Then we end up with the first function that got dead stripped that was large enough to contain the address we are trying to lookup claiming to own this address. So I would vote to NOT change LLDB.

error: DIE has overlapping address ranges: [0x0000000000000000, 0x0000000000000008) and [0x0000000000000000, 0x000000000000001e)
error: DIE address ranges are not contained in its parent's ranges:
0x0002022e: DW_TAG_compile_unit
              DW_AT_producer	("Facebook clang version 7.0.0 (llvm: fa3433ee8def2a61195b46a32b44800c59bd21a1, cfe: e8d08cb2bc1f92443c1e64977d78ec6454370e01, compiler-rt: 9439182a2f29f42bd22126a1e0d967f9eba34616, lld: d83a49d30bca97c359a7d7995da6cc1468041cb1 e8d08cb2bc1f92443c1e64977d78ec6454370e01) (ssh://git-ro.vip.facebook.com/data/gitrepos/osmeta/external/llvm fa3433ee8def2a61195b46a32b44800c59bd21a1) (based on LLVM 7.0.0)")
              DW_AT_language	(DW_LANG_C_plus_plus)
              DW_AT_name	("xplat/IGL/src/igl/opengl/Device.cpp")
              DW_AT_stmt_list	(0x00004f60)
              DW_AT_comp_dir	(".")
              DW_AT_GNU_pubnames	(true)
              DW_AT_low_pc	(0x0000000000000000)
              DW_AT_ranges	(0x000010b0
                 [0x000045c4, 0x000045e8)
                 [0x000045e8, 0x00004610)
                 [0x00004610, 0x00004612)
                 [0x00000000, 0x0000001e)
                 [0x00000000, 0x00000008)
                 [0x00000000, 0x00000004)
                 [0x00004612, 0x00004660)
                 [0x00004660, 0x0000466a)
                 [0x0000466a, 0x0000469e)
                 [0x0000469e, 0x000046a6)
                 [0x000046a6, 0x000046da)
                 [0x000046da, 0x00004772)
                 [0x00004774, 0x000047a0)
                 [0x000047a0, 0x000047c8)
                 [0x000047c8, 0x000047d8)
                 [0x000047d8, 0x00004826)
                 [0x00004826, 0x00004830)
                 [0x00004830, 0x0000487e)
                 [0x0000487e, 0x00004888)
                 [0x00004888, 0x00004890)
                 [0x00004890, 0x000048f0)
                 [0x000048f0, 0x000048fc)
                 [0x000048fc, 0x0000490e)
                 [0x0000490e, 0x00004932)
                 [0x00003388, 0x00003398)
                 [0x00003398, 0x00003410)
                 [0x00003340, 0x00003350)
                 [0x00003350, 0x00003388)
                 [0x00004932, 0x0000495c)
                 [0x0000495c, 0x00004990)
                 [0x00003ea0, 0x00003eec)
                 [0x00004990, 0x00004996)
                 [0x00004996, 0x000049a8)
                 [0x000049a8, 0x000049ac)
                 [0x000049ac, 0x000049b8)
                 [0x000049b8, 0x000049bc)
                 [0x000049bc, 0x000049c0)
                 [0x000049c0, 0x00004a08)
                 [0x00004a08, 0x00004a0e)
                 [0x00004a10, 0x00004a40)
                 [0x00004a40, 0x00004a5c)
                 [0x00004a5c, 0x00004a62)
                 [0x00004a64, 0x00004a80)
                 [0x00004a80, 0x00004a82)
                 [0x00004a82, 0x00004a86)
                 [0x00004a86, 0x00004a8e)
                 [0x00004a8e, 0x00004a92)
                 [0x00004a92, 0x00004a96)
                 [0x00004a96, 0x00004aa6)
                 [0x00004288, 0x000042b0)
                 [0x000042be, 0x000042d4)
                 [0x000042d4, 0x000042e4)
                 [0x000042e4, 0x000042fa)
                 [0x00004aa6, 0x00004ace)
                 [0x00004ace, 0x00004ade)
                 [0x00004ade, 0x00004b08)
                 [0x00004b08, 0x00004b3c)
                 [0x00004b3c, 0x00004b42)
                 [0x00004b42, 0x00004b54)
                 [0x00004b54, 0x00004b56)
                 [0x00004b56, 0x00004b5a)
                 [0x00004b5a, 0x00004b66)
                 [0x00004b66, 0x00004b6a)
                 [0x00004b6a, 0x00004b6e)
                 [0x00004b6e, 0x00004ba4)
                 [0x00004ba4, 0x00004baa)
                 [0x00004bac, 0x00004bdc)
                 [0x00004bdc, 0x00004bf8)
                 [0x00004bf8, 0x00004bfe)
                 [0x00004c00, 0x00004c1c)
                 [0x00004c1c, 0x00004c1e)
                 [0x00004c1e, 0x00004c22)
                 [0x00004c22, 0x00004c2a)
                 [0x00004c2a, 0x00004c2e)
                 [0x00004c2e, 0x00004c32)
                 [0x00004c32, 0x00004c48)
                 [0x00004c48, 0x00004c8a)
                 [0x00004c8a, 0x00004c90)
                 [0x00004c90, 0x00004cc0)
                 [0x00004cc0, 0x00004cdc)
                 [0x00004cdc, 0x00004d30)
                 [0x00004d30, 0x00004d52)
                 [0x00004d52, 0x00004d6c)
                 [0x00004d6c, 0x00004d72)
                 [0x00004d74, 0x00004d90)
                 [0x00004d90, 0x00004d92)
                 [0x00004d92, 0x00004d96)
                 [0x00004d96, 0x00004d9e)
                 [0x00004d9e, 0x00004da2)
                 [0x00004da2, 0x00004da6)
                 [0x00004da6, 0x00004dbc)
                 [0x00004dbc, 0x00004dfc)
                 [0x00004dfc, 0x00004e02)
                 [0x00004e04, 0x00004e34)
                 [0x00004e34, 0x00004e50)
                 [0x00004e50, 0x00004e72)
                 [0x00004e72, 0x00004e78)
                 [0x00004e78, 0x00004e94)
                 [0x00004e94, 0x00004e96)
                 [0x00004e96, 0x00004e9a)
                 [0x00004e9a, 0x00004ea2)
                 [0x00004ea2, 0x00004ea6)
                 [0x00004ea6, 0x00004eaa)
                 [0x00004eaa, 0x00004ec0))

0x0002f1f1:   DW_TAG_subprogram
                DW_AT_low_pc	(0x00000000000049c0)
                DW_AT_high_pc	(0x0000000000004a08)
                DW_AT_frame_base	(DW_OP_reg13)
                DW_AT_object_pointer	(0x0002f214)
                DW_AT_linkage_name	("_ZNSt12__shared_ptrIN3igl6opengl12CommandQueueELN9__gnu_cxx12_Lock_policyE2EEC2ISaIS2_EJEEESt19_Sp_make_shared_tagRKT_DpOT0_")
                DW_AT_specification	(0x00022eef)
This revision now requires changes to proceed.Apr 20 2020, 9:56 PM

And also it is worth noting that just because new compilers might generate this information correctly, it doesn't mean LLDB won't be used to try and load older debug info from compilers that don't. LTO can also greatly affect the reliability of this information. And many other code post production tools often ruin this information when they move things around. The tools know to modify the information for the DW_TAG_subprogram DIE, but they very often don't even try to update the compile unit's DW_AT_ranges.

So if llvm is relying on this, we should change llvm. Symbolication will fail using LLVM's DWARF parser if it is using this information for anything real.

And this one binary had 115 examples of this:

$ llvm-dwarfdump --verify libIGL.so | grep "DIE address ranges are not contained in its parent" | wc -l
    115

Sorry I might have not understood this patch's actual implementation. I thought we were switching to trusting DW_AT_ranges, which seems we are already doing. Let me read the patch code a bit more carefully, not just quickly reading the description...

So it looks like there are bugs in the "llvm-dwarfdump --verify". The blurb I posted above clearly has the function's address range contained in it, sorry for the false alarm. I have been quite a few problems with DWARF with LTO and other post production linkers. The llvm-dwarfdump might be assuming that the address ranges in DW_AT_ranges are sorted. I will work on a fix for this if my llvm-dwarfdump was from top of tree.

I am worried about performance with this patch. Prior to this we would:
1 - check for DW_AT_ranges, and use that if present
2 - go through all DWARF and look for DW_TAG_subprogram DIEs with valid DW_AT_low_pc/DW_AT_high_pc attributes
3 - _only_ parse the line tables as a last resort

This seems like the right way to go to get max performance right?

With this patch we:
1 - check for DW_AT_ranges, and use that if present
2 - _always_ parse the line tables and try to glean address range information from this

This seems like it will be slower, though I have not benchmarked this. Should be easy to test with a large binary and just comment out the code that checks for DW_AT_ranges.

One thing to think about that my flawed example above did show: dead stripped functions cause problems for debug info. Both in DW_AT_ranges on the compile unit and in all of the DWARF below this. We might want to check any and all address ranges to ensure that they are in sections that read + execute permissions and if they don't throw them out. It is easy to come up with a minimal address range from the main object file prior to parsing any address information and use that to weed out the bad entries. If functions have had their address set to zero, then we will usually correctly weed these out as zero is often part of an image base load address, but not in a section (not program header or top level section in LLDB's ObjectFileELF, but in a section header section) with r+x permissions. It would be nice to be able to weed these addresses out so they don't cause problems possibly as a follow up patch.

So I will be ok with this patch if we can verify that line table parsing is faster than checking DIEs for DW_AT_high/low pc attributes. If there is a regression, we should keep the code as is.

Sorry for the false alarms. That'll teach me to check patches at the end of a long day...

lldb/source/Plugins/SymbolFile/DWARF/DWARFCompileUnit.cpp
58–67

So this is clearly much more efficient than parsing all the line tables for all functions right? Why would we stop doing this? With the old code, if we didn't have DW_AT_ranges, and we did have DW_TAG_subprogram DIEs with valid address ranges, then we would not go through all of line tables.

Thanks for the tip about commenting out DW_AT_ranges code. I wanted to do some measurements, but I couldn't find a non-contrived non-trivial code base to try this on (with c++ most compile units always get DW_AT_ranges due to all of the ODR linkage brought in by templates). The results are very interesting. The test setup is: release (no assert) lldb debugging a debug clang with accelerator tables (to avoid debug info indexing overshadowing everything). I'm running rel/lldb dbg/clang -o "image lookup -a 0x7234567" -b as a benchmark, running it 10 times and averaging. The address is a random address inside clang's .text section.

With the current code, where we get this information via DW_AT_ranges from most compile units (only 30 CUs don't have it), the command takes 5.15 (+/- .04) seconds. With the DW_AT_ranges code commented out, where we get information by traversing the DIE tree, it takes 10.38 (+/- 0.05) seconds. With both DW_AT_ranges *and* DIE code removed (information retrieved via line tables), the time is 6.85 (+/- 0.03) seconds. These results make sense to me, as line tables are much simpler and easier to parse than DIE trees, and this makes it possible to avoid parsing a lot of compile units

So, not only does this remove code, and improve llvm consistency, it also makes things faster, if only for a limited amount of users (those building with clang<=3.3, or those building huge amounts of C code -- the patch is nfc for the rest). Sounds like a win-win-win to me (?)

avl added a subscriber: avl.Apr 21 2020, 2:58 AM

Sounds like a win then, as long as we don't get slower I am fine with this. I was guessing that line tables might be faster because it is generally very compressed and compact compared to debug info, but thanks for verifying.

Might be worth checking the memory consumption of LLDB quickly too with DW_AT_ranges compiled out and just make sure we don't take up too much extra memory.

We aren't doing anything to unload these line tables like we do with DIEs are we? It might make sense to pares the line tables and then throw them away if they were not already parsed? With the DIEs we were throwing them away if they weren't already parsed to keep memory consumption down, so might be worth throwing the line tables away after running this if we are now going to rely on it.

One other thing to verify we want to go this route is to re-enable the old DIE parsing for high/low PCs, but put the results in a separate address ranges class. After we parse the line tables, verify that all ranges we found in the DIEs are in the line tables? It would not be great if we were going to start missing some functions if a function doesn't have a line table? Not sure if/how this would happen, but you never know.

Sounds like a win then, as long as we don't get slower I am fine with this. I was guessing that line tables might be faster because it is generally very compressed and compact compared to debug info, but thanks for verifying.

Might be worth checking the memory consumption of LLDB quickly too with DW_AT_ranges compiled out and just make sure we don't take up too much extra memory.

I've done a similar benchmark to the last one, but measured memory usage ("RssAnon" as reported by linux). One notable difference is that I

Currently (information retrieved from DW_AT_ranges) we use about ~330 MB of memory. If I switch to dwarf dies as the source, the memory goes all the way to 2890 MB. This number is suspiciously large -- it either means that our die freeing is not working properly, or that glibc is very bad at releasing memory back to the OS. Given the magnitude of the increase, i think it's a little bit of both. With line tables as the source the memory usage is 706 MB. It's an increase from 330, but definitely smaller than 2.8 GB. (the number 330 is kind of misleading here since we're not considering removing that option -- it will always be used if it is available).

We aren't doing anything to unload these line tables like we do with DIEs are we? It might make sense to pares the line tables and then throw them away if they were not already parsed? With the DIEs we were throwing them away if they weren't already parsed to keep memory consumption down, so might be worth throwing the line tables away after running this if we are now going to rely on it.

That would be possible, but I don't think it's worth the trouble. I think the phrase "relying on it" overemphasizes the importance of that code. In practice, the only time when this path will be taken is when debugging code built with clang<=3.3, which is seven years old and did not even fully implement c++11. It also seems like the switch to line tables will save memory, at least until the die freeing bug is fixed. And lastly, the difference i reported is pretty much the worst possible case, as the only thing the debugger will do is parse the line tables and exit. Once the debugger starts doing other stuff too, the difference will start to fade (e.g. running to the breakpoint in main increases the memory usage to 600 MB even with DW_AT_ranges).

One other thing to verify we want to go this route is to re-enable the old DIE parsing for high/low PCs, but put the results in a separate address ranges class. After we parse the line tables, verify that all ranges we found in the DIEs are in the line tables? It would not be great if we were going to start missing some functions if a function doesn't have a line table? Not sure if/how this would happen, but you never know.

I implemented a check like that. While doing it I've learned that the DIE-based parsing does not work since May 2018 (D47275) and nobody noticed. What that means is that in practice we were always going through line tables (if DW_AT_ranges were missing). It also means that my times reported earlier for die-based searching were incorrect as they also included the time to build the line tables. (However, that does not change the ordering, as even if we subtract the time it took to parse the line tables, the DIE method is still much slower).

After fixing the die-based search, I was able to verify that the die-based ranges are an exact match to the line table ranges (for the particular compiler used -- top of tree clang).

Given all of the above (die-based searching being slow, taking more memory, and not actually working), i think it's pretty clear that we should remove it.

clayborg accepted this revision.Apr 22 2020, 3:58 PM

Sounds like a win then, as long as we don't get slower I am fine with this. I was guessing that line tables might be faster because it is generally very compressed and compact compared to debug info, but thanks for verifying.

Might be worth checking the memory consumption of LLDB quickly too with DW_AT_ranges compiled out and just make sure we don't take up too much extra memory.

I've done a similar benchmark to the last one, but measured memory usage ("RssAnon" as reported by linux). One notable difference is that I

Currently (information retrieved from DW_AT_ranges) we use about ~330 MB of memory. If I switch to dwarf dies as the source, the memory goes all the way to 2890 MB. This number is suspiciously large -- it either means that our die freeing is not working properly, or that glibc is very bad at releasing memory back to the OS. Given the magnitude of the increase, i think it's a little bit of both. With line tables as the source the memory usage is 706 MB. It's an increase from 330, but definitely smaller than 2.8 GB. (the number 330 is kind of misleading here since we're not considering removing that option -- it will always be used if it is available).

Since we mmap in the entire DWARF, I am not surprised by taking up new memory because we touch those pages and won't get those back. If you remove the DIE freeing code, I will bet you see much more memory used. We definitely free the memory for the DIEs and give that back, so I would be willing to bet the increase you are seeing is from mmap loading pages in that we touch.

We aren't doing anything to unload these line tables like we do with DIEs are we? It might make sense to pares the line tables and then throw them away if they were not already parsed? With the DIEs we were throwing them away if they weren't already parsed to keep memory consumption down, so might be worth throwing the line tables away after running this if we are now going to rely on it.

That would be possible, but I don't think it's worth the trouble. I think the phrase "relying on it" overemphasizes the importance of that code. In practice, the only time when this path will be taken is when debugging code built with clang<=3.3, which is seven years old and did not even fully implement c++11. It also seems like the switch to line tables will save memory, at least until the die freeing bug is fixed. And lastly, the difference i reported is pretty much the worst possible case, as the only thing the debugger will do is parse the line tables and exit. Once the debugger starts doing other stuff too, the difference will start to fade (e.g. running to the breakpoint in main increases the memory usage to 600 MB even with DW_AT_ranges).

Ok, thanks for looking into this.

One other thing to verify we want to go this route is to re-enable the old DIE parsing for high/low PCs, but put the results in a separate address ranges class. After we parse the line tables, verify that all ranges we found in the DIEs are in the line tables? It would not be great if we were going to start missing some functions if a function doesn't have a line table? Not sure if/how this would happen, but you never know.

I implemented a check like that. While doing it I've learned that the DIE-based parsing does not work since May 2018 (D47275) and nobody noticed. What that means is that in practice we were always going through line tables (if DW_AT_ranges were missing). It also means that my times reported earlier for die-based searching were incorrect as they also included the time to build the line tables. (However, that does not change the ordering, as even if we subtract the time it took to parse the line tables, the DIE method is still much slower).

After fixing the die-based search, I was able to verify that the die-based ranges are an exact match to the line table ranges (for the particular compiler used -- top of tree clang).

Given all of the above (die-based searching being slow, taking more memory, and not actually working), i think it's pretty clear that we should remove it.

Fine with me. Thanks for taking the time to ensure we don't regress.

This revision is now accepted and ready to land.Apr 22 2020, 3:58 PM

Thanks for the review.

I've done a similar benchmark to the last one, but measured memory usage ("RssAnon" as reported by linux). One notable difference is that I

Currently (information retrieved from DW_AT_ranges) we use about ~330 MB of memory. If I switch to dwarf dies as the source, the memory goes all the way to 2890 MB. This number is suspiciously large -- it either means that our die freeing is not working properly, or that glibc is very bad at releasing memory back to the OS. Given the magnitude of the increase, i think it's a little bit of both. With line tables as the source the memory usage is 706 MB. It's an increase from 330, but definitely smaller than 2.8 GB. (the number 330 is kind of misleading here since we're not considering removing that option -- it will always be used if it is available).

Since we mmap in the entire DWARF, I am not surprised by taking up new memory because we touch those pages and won't get those back. If you remove the DIE freeing code, I will bet you see much more memory used. We definitely free the memory for the DIEs and give that back, so I would be willing to bet the increase you are seeing is from mmap loading pages in that we touch.

I don't think it's as simple as that. "RssAnon" should not include file-backed memory (which why I chose to measure it). According to man proc:

* RssAnon: Size of resident anonymous memory.  (since Linux 4.5).
* RssFile: Size of resident file mappings.  (since Linux 4.5).

I also don't think it's as simple as not freeing the DIE memory at all. There has to be some more complex interaction going on. If I had more time, I would be very interested in learning what it is, but for now I'll contend myself with "it's not a problem of this patch".

This revision was automatically updated to reflect the committed changes.

One thing to think about that my flawed example above did show: dead stripped functions cause problems for debug info. Both in DW_AT_ranges on the compile unit and in all of the DWARF below this. We might want to check any and all address ranges to ensure that they are in sections that read + execute permissions and if they don't throw them out. It is easy to come up with a minimal address range from the main object file prior to parsing any address information and use that to weed out the bad entries. If functions have had their address set to zero, then we will usually correctly weed these out as zero is often part of an image base load address, but not in a section (not program header or top level section in LLDB's ObjectFileELF, but in a section header section) with r+x permissions. It would be nice to be able to weed these addresses out so they don't cause problems possibly as a follow up patch.

Thanks for the review.

I've done a similar benchmark to the last one, but measured memory usage ("RssAnon" as reported by linux). One notable difference is that I

Currently (information retrieved from DW_AT_ranges) we use about ~330 MB of memory. If I switch to dwarf dies as the source, the memory goes all the way to 2890 MB. This number is suspiciously large -- it either means that our die freeing is not working properly, or that glibc is very bad at releasing memory back to the OS. Given the magnitude of the increase, i think it's a little bit of both. With line tables as the source the memory usage is 706 MB. It's an increase from 330, but definitely smaller than 2.8 GB. (the number 330 is kind of misleading here since we're not considering removing that option -- it will always be used if it is available).

Since we mmap in the entire DWARF, I am not surprised by taking up new memory because we touch those pages and won't get those back. If you remove the DIE freeing code, I will bet you see much more memory used. We definitely free the memory for the DIEs and give that back, so I would be willing to bet the increase you are seeing is from mmap loading pages in that we touch.

I don't think it's as simple as that. "RssAnon" should not include file-backed memory (which why I chose to measure it). According to man proc:

* RssAnon: Size of resident anonymous memory.  (since Linux 4.5).
* RssFile: Size of resident file mappings.  (since Linux 4.5).

I also don't think it's as simple as not freeing the DIE memory at all. There has to be some more complex interaction going on. If I had more time, I would be very interested in learning what it is, but for now I'll contend myself with "it's not a problem of this patch".

FWIW, I can recommend Valgrind's massif for memory usage analysis - I think it works at the level of intercepting memory allocation (malloc/free/etc) level, so in some ways that makes it more actionable (it doesn't get sidetracked by the actual libc memory management implementation), but also less accurate (in that, yeah, your process is still going to hold that memory if your libc isn't releasing it back to teh OS) - but perhaps if you tune your application code for what massif tells you, then if there's still a big delta between that and the reality, see if there are tuning options for your memory allocator.