AFAIK the driver's allocation will actually have to round this
up anyway. It is useful to track the rounded up size, so that
the end of the kernel segment is known to be dereferencable so
a wider s_load_dword can be used for a short argument at the end
of the segment.
One thing I'm unclear on is if we need to really allocate the
implicit arg bytes if they aren't actually used
I believe you can align this to 16. See HSA spec at www.hsafoundation.com/html_spec111/HSA_Library.htm#PRM/Topics/04_SyntaxSemantics/kernarg_segment.htm which says:
"The alignment of the base address of the kernel's kernarg segment variables is the larger of 16 bytes and the maximum alignment of the kernel's kernarg segment variables."
I suspect that the OpenCL runtime simply aligns all kernarg allocations to 256 but not sure of other languages.