This is an archive of the discontinued LLVM Phabricator instance.

[WebAssembly] Define __heap_base global
ClosedPublic

Authored by sbc100 on Jan 13 2018, 9:57 AM.

Details

Summary

This is an immutable exported global representing the start of the heap area. It is a page aligned.

Diff Detail

Repository
rLLD LLVM Linker

Event Timeline

sbc100 created this revision.Jan 13 2018, 9:57 AM
sbc100 retitled this revision from [WebAssembly] Define __heap_base global This is an immutable exported global representing the start of the heap area. It is a page aligned. to [WebAssembly] Define __heap_base global.Jan 13 2018, 9:57 AM
sbc100 edited the summary of this revision. (Show Details)
sbc100 added reviewers: dschuff, ruiu.
ncw added a subscriber: ncw.Jan 15 2018, 1:54 PM

Nice, thank you! My Musl port makes use of this, I'll test it sometime and confirm it's working. I do need to get onto that and try and work out how to make it useful to the Emscripten folks or the compat waterfall.

4096-byte alignment is good, when I read the comment I thought it was a 64KiB Wasm page :)

dschuff added inline comments.Jan 16 2018, 9:03 AM
wasm/Writer.cpp
38

Why 4096? Since wasm pages are 16k, is there some particular reason for it to be this as opposed to something else?

sbc100 added inline comments.Jan 16 2018, 2:57 PM
wasm/Writer.cpp
38

No really any good reason... should we just make it 16-byte aligned like the stack? In which case there is not need to do any more alignedment since the heap will follow the stack region which is already 16-byte aligned.

Incidentally our musl limits.h currently defines PAGE_SIZE to 4k.. should we change it to 16k?

sbc100 updated this revision to Diff 130059.Jan 16 2018, 4:14 PM

don't align __heap_base

sbc100 updated this revision to Diff 130077.Jan 16 2018, 5:33 PM
  • update test
sbc100 edited the summary of this revision. (Show Details)Jan 16 2018, 5:33 PM
This revision was not accepted when it landed; it landed in state Needs Review.Jan 16 2018, 5:38 PM
This revision was automatically updated to reflect the committed changes.
ncw added inline comments.Jan 17 2018, 12:42 AM
wasm/Writer.cpp
38

(I think Wasm pages are actually 64KiB: https://webassembly.github.io/spec/core/exec/runtime.html#page-size)

I wouldn't change the Musl PAGE_SIZE to match. (I looked into this when doing my Musl port, and Rich Felker added this as a review comment.) PAGE_SIZE doesn't have to match anything the CPU does, what it really has to match is what mmap does. So whatever alignment your mmap (and hence malloc) perform should be exposed via PAGE_SIZE. In my Musl, I chose 32KiB for the page size, and mmap works in those units, so that malloc rounds up "large" allocations (>100KiB) to the nearest libc page. It doesn't matter that grow_memory/current_memory Wasm instructions work in 64KiB units, the mmap implementation does the appropriate translation.

sbc100 added inline comments.Jan 17 2018, 8:59 AM
wasm/Writer.cpp
38

I discusses with with @dschuff yesterday. We decide tt makes sense to page size to be consistent with the level of granularity of the underlying memory system. There is no reason for mmap or brk to operate on anything smaller than 64k chunks since you we can't allocate in smaller chunks. i.e. it makes sense to have the page size match. Perhaps there is a reason I'm missing but I can't see any benefit in having two different concepts of page size.

ncw added inline comments.Jan 17 2018, 9:13 AM
wasm/Writer.cpp
38

I think it does make sense - the Wasm page size isn't exposed by any libc API, and there's no reason I can see why they _should_ be linked (other than, as you say, to avoid any confusion for the libc maintainers themselves).

mmap can allocate memory in smaller chunks than 64KiB - it can allocate chunks as small as you want, according to whatever allocation algorithm you've chosen. In Wasm, mmap will be built on top of __builtin_wasm_grow_memory, ie libc will be parcelling up the brk space into mmap-ed regions. It can parcel up the process space however it chooses. mmap needs to have some allocation mechanism in order for the interface to be implemented using sbrk.

In my Musl port, mmap has an 8K bitmap of all the possible pages (8K = 65536 bits) which it uses to "map" or "unmap" pages, expanding the brk as needed to create space for handing out more pages.

I went with 32KiB pages, rather than 64KiB, to reduce the amount of space wastage. With 64KiB pages, you'd only need 4K for the bitmap, but you'd waste up to 63KiB of space, which is decent fraction if you're doing ~150KiB allocations (for example). At 32KiB, the mmap space wastage is reduced from 33% to 20% for a single 129KiB allocation, ie the space you're wasting with mmap's bitmap is reclaimed immediately the first time you do a "medium-sized" allocation.

ncw added a comment.Jan 17 2018, 9:18 AM

In fact, I wonder now as I re-do the calculations if I shouldn't in fact use 16KiB page size (reducing wastage to 10% for 129KiB allocations, at the expense of having a 16KiB mmap bitmap.)

(Some extra detail: to avoid fragmentation I guess, and an unbounded number of "pools", malloc does direct mmap allocation for all malloced regions above ~120KiB. So for sizes above that, you can't rely on malloc to be trimming the regions and re-using them for smaller allocations in its free-bins. Any wastage from the mmap call is just a sunk cost.)

The actual low level memory allocator (__builtin_wasm_grow_memory, or the JS version as I've used here: https://github.com/jfbastien/musl/blob/wasm-prototype-1/arch/wasm32/wasm.js#L1358) can only allocate at a 64k granularity. Setting the page size to something smaller to reduce the memory overhead of a given mmap implementation doesn't make a lot of sense me. @dschuff, what do you think?

Also, I don't really understand your logic regarding the size of your bitmap. Surely the larger the pages, the fewer pages exist, the smaller the bitmap required to track them? Perhaps I'm missing something?

The musl malloc implementation seems to choose to use brk() directly over mmap, which means it will always be working with a contiguous region, so there shouldn't be any fragmentation in this case.

Even in the mmap case fragmentation should be problem according the comment in the code Expand the heap in-place if brk can be used, or otherwise via mmap, using an exponential lower bound on growth by mmap to make fragmentation asymptotically irrelevant.

dschuff added a comment.EditedJan 17 2018, 2:19 PM

It seems like the real issue is that because it's meant for Linux, musl assumes the existence of Linux's mmap. On traditional architectures, there's a machine page size, which is also the granularity of mmap on Linux (technically you can have some kinds of mappings that don't end on page boundaries but those pages will just be fragmented). There's really no reason to do anything not on page boundaries, adn 4k is a pretty small page. The with wasm is that there's really no such thing as mmap for wasm (or I should say, for any wasm embedding, since mmap is an OS feature rather than an architecture feature). mmap is a strange swiss army knife, and currently the only functionality is to grow the memory at the end; If and when additional functionality is added (file mapping, protection, unmapped space, etc) that would be as separate calls. So mmap must always be emulated and if it is emulated then it's really just another layer of allocation. To me it makes much more sense to have one allocator (malloc) which knows directly about how wasm memory is managed (instead of assuming mmap's behavior which will always be a fiction, at least in JS-wasm). Since the real behavior is more or less brk() it makes sense to use that logic in existing allocators.
Likewise, because memory can never be allocated from the underlying system in increments of less than 64k, it doesn't really make much sense to pretend otherwise by implementing an mmap for malloc to use.

More practically, what advantage would your 2-level (malloc+mmap) allocation have vs just using malloc's usual non-mmap mechanism for all allocations (or at least, up to a larger allocation size compared to systems with 4k pages)?

ncw added a comment.Jan 17 2018, 2:55 PM

Also, I don't really understand your logic regarding the size of your bitmap. Surely the larger the pages, the fewer pages exist, the smaller the bitmap required to track them? Perhaps I'm missing something?

That's spot on. Smaller pages (more granular) mean less wastage during allocations, but make the bitmap a bit bigger.

The musl malloc implementation seems to choose to use brk() directly over mmap, which means it will always be working with a contiguous region, so there shouldn't be any fragmentation in this case.

Even in the mmap case fragmentation should be problem according the comment in the code Expand the heap in-place if brk can be used, or otherwise via mmap, using an exponential lower bound on growth by mmap to make fragmentation asymptotically irrelevant.

Musl uses brk for "small" allocations (<120KiB) - and uses mmap for all large allocations, and as a fallback when brk fails. Hence, Musl's malloc requires the "kernel" to provide mmap support, and brk is totally optional. My Musl port simply returns ENOSYS.

The comment is regarding fragmentation as size tends to infinity. But my arithmetic is wastage, and for "medium" allocations of 100-200KiB where I would be concerned by 30% space wastage.

It seems like the real issue is that because it's meant for Linux, musl assumes the existence of Linux's mmap. On traditional architectures, there's a machine page size, which is also the granularity of mmap on Linux (technically you can have some kinds of mappings that don't end on page boundaries but those pages will just be fragmented). There's really no reason to do anything not on page boundaries, adn 4k is a pretty small page. The with wasm is that there's really no such thing as mmap for wasm (or I should say, for any wasm embedding, since mmap is an OS feature rather than an architecture feature). mmap is a strange swiss army knife, and currently the only functionality is to grow the memory at the end; If and when additional functionality is added (file mapping, protection, unmapped space, etc) that would be as separate calls. So mmap must always be emulated and if it is emulated then it's really just another layer of allocation. To me it makes much more sense to have one allocator (malloc) which knows directly about how wasm memory is managed (instead of assuming mmap's behavior which will always be a fiction, at least in JS-wasm). Since the real behavior is more or less brk() it makes sense to use that logic in existing allocators.

If you're willing to write your own malloc, that's great! But my Musl port has zero changes to the Musl core code (after a few patches were accepted upstream). The entirely of Wasm support is in architecture-specific directories like arch/wasm and src/internal/wasm. So for better or worse, using Musl's built-in allocator does force the Wasm port to provide mmap support. As you say, it would mean cobbling together a new malloc to work only on brk.

(I did have a look into doing that actually: the problem is that malloc uses "bins" to hold free lists. In Musl's malloc, the bins stop at 120Kib, at which point you transition to direct mmap. If there's no mmap, then the bins need to be unbounded, and that's actually quite a lot more bins to support allocations going up to 1GiB, since the bin sizes are not quite logarithmic. Basically what scales well for small allocations won't scale for big ones, and you'd end up with a two-tier allocator no matter what, I think.)

Likewise, because memory can never be allocated from the underlying system in increments of less than 64k, it doesn't really make much sense to pretend otherwise by implementing an mmap for malloc to use.

I don't see that. How the underlying system allocates memory, and how it's split up for application use, don't have to relate at all. Exactly the same reasoning would say, "well x86 uses 4K pages so it doesn't make sense for malloc to subdivide that for a 10-byte allocation". I do agree though that it's odd/unusual and also awkward for mmap not to reflect the kernel's page size - but odd doesn't make it wrong, or inefficient.

More practically, what advantage would your 2-level (malloc+mmap) allocation have vs just using malloc's usual non-mmap mechanism for all allocations (or at least, up to a larger allocation size compared to systems with 4k pages)?

The advantage is that malloc doesn't have a non-mmap for all allocations.

I really wouldn't mind putting in 64KiB pages for Musl's mmap implementation - but it wouldn't get rid of the second allocation layer (and its bitvector), nor would it make the code simpler, since you'd need just the same machinery for mapping blocks on top of Wasm's "brk" mechanism, no matter what the page size is chosen to be.

Hope that helps. What my Musl does shouldn't really matter at this stage, no-one else is using it yet!

Oh, I didn't see that musl's malloc had that threshold. So all allocations over MMAP_THRESHOLD will always waste n % PAGE_SIZE bytes, so the higher the page size for more wastage.

For emscripten they choose to use dlmalloc instead. I guess we'll have to see what kind of real world overheads and users we see. If musl's malloc does prove very wasteful when PAGE_SIZE is large there are two of options (use your own malloc, fix musl's malloc) which I think I would prefer over hacking the PAGE_SIZE. As you say its fairly hypothetical at this point.

In terms of modifications to musl, I have also been trying to absolutely minimize them. The port we use essentially only adds arch/wasm32 too. I think it would probably be good to merge our efforts here. Emscripten already has a musl port, and if we can avoid making two more that would be great. One feature of your port and the emscripten port that I think has some advantages is the fact that each syscall ends up as a unique wasm import (in emscripten they did this so they could GC the unused JS syscall functions).

I don't see that. How the underlying system allocates memory, and how it's split up for application use, don't have to relate at all. Exactly the same reasoning would say, "well x86 uses 4K pages so it doesn't make sense for malloc to subdivide that for a 10-byte allocation". I do agree though that it's odd/unusual and also awkward for mmap not to reflect the kernel's page size - but odd doesn't make it wrong, or inefficient.

I guess I am sort of just being pedantic here. The purpose of the mmap syscall is to change the mappings in the address space, and it exposes the kernel's page size based on how it rounds the addresses (and in some cases requires the user to round their requests to page size). It doesn't do subdivision or allocation per se, that's the caller's job (whether the caller is malloc or not). So having mismatched page sizes makes no sense because it's the difference between having mmap be a suballocator or not. And clearly doing nothing is better/more efficient than doing something :)

Now having said all that, I do buy the utility of having an emulation for compatibility. In a custom wasm-only allocator you wouldn't assume an mmap, but I understand why musl's does. I hadn't realized musl's was quite so reliant on mmap but really it's not a problem on any other recent architecture I know of. So in principle I don't see any problem with having a system that uses a partial mmap emulation suitable for musl's unmodified allocator. And in that case you'd want to take the specifics of musl's allocator into account when choosing your page size.

Which I guess brings us back to this code review; all of these tradeoffs can and should be made in the toolchain/OS emulation, and not in the linker; you or I or whoever should be able to make different choices and use the same linker. So while some ABI choices will necessarily be baked in, I guess we should probably try not to bake too many in. I wouldn't have expected page size to be something that people would want to have different, but this compatibility argument is pretty compelling. Now that I think about it, I'll be there's a lot of code out there that assumes 4k pages, and in at least some of cases I'd guess it would make more sense to just fake having 4k pages than to fix that. So it probably does make sense to keep those assumptions out of the tools where possible.

I think for the compiler and linker generally and this CL specifically, 16 byte stack alignment (rather than page-alignment) makes sense. This question of page size may come up in the future though, so I guess we should just be aware of it.

I don't see that. How the underlying system allocates memory, and how it's split up for application use, don't have to relate at all. Exactly the same reasoning would say, "well x86 uses 4K pages so it doesn't make sense for malloc to subdivide that for a 10-byte allocation". I do agree though that it's odd/unusual and also awkward for mmap not to reflect the kernel's page size - but odd doesn't make it wrong, or inefficient.

I guess I am sort of just being pedantic here. The purpose of the mmap syscall is to change the mappings in the address space, and it exposes the kernel's page size based on how it rounds the addresses (and in some cases requires the user to round their requests to page size). It doesn't do subdivision or allocation per se, that's the caller's job (whether the caller is malloc or not). So having mismatched page sizes makes no sense because it's the difference between having mmap be a suballocator or not. And clearly doing nothing is better/more efficient than doing something :)

Now having said all that, I do buy the utility of having an emulation for compatibility. In a custom wasm-only allocator you wouldn't assume an mmap, but I understand why musl's does. I hadn't realized musl's was quite so reliant on mmap but really it's not a problem on any other recent architecture I know of. So in principle I don't see any problem with having a system that uses a partial mmap emulation suitable for musl's unmodified allocator. And in that case you'd want to take the specifics of musl's allocator into account when choosing your page size.

Which I guess brings us back to this code review; all of these tradeoffs can and should be made in the toolchain/OS emulation, and not in the linker; you or I or whoever should be able to make different choices and use the same linker. So while some ABI choices will necessarily be baked in, I guess we should probably try not to bake too many in. I wouldn't have expected page size to be something that people would want to have different, but this compatibility argument is pretty compelling. Now that I think about it, I'll be there's a lot of code out there that assumes 4k pages, and in at least some of cases I'd guess it would make more sense to just fake having 4k pages than to fix that. So it probably does make sense to keep those assumptions out of the tools where possible.

I think for the compiler and linker generally and this CL specifically, 16 byte stack alignment (rather than page-alignment) makes sense. This question of page size may come up in the future though, so I guess we should just be aware of it.

Indeed, IIUC we have gone off topic here since this CL has already landed with essentially with no specific heap base alignment and musl handles that just fine.

This discussion is now really just about musl's limits.h and we can discuss that independently of lld (at least I think its unrelated since lld isn't assuming a page size here). Presumably over in the musl repo would be the best place to continue this.