Edit : Updating the summary based on comments
Even though granularity is 8, the roundup must be an even number of 8-granules for GFX9.
Probably this also needs to be mentioned in https://llvm.org/docs/AMDGPUUsage.html#amdgpu-amdhsa-compute-pgm-rsrc1-gfx6-gfx10-table for GRANULATED_WAVEFRONT_SGPR_COUNT.
The difference is seen when a the rounded value aligns to 8 but not to 16. (for example 40, 56).
This patch corrects the roundup for GFX9, hence correcting the number of SGPRBlocks.
I don't know if this is actually accurate, I think the reason for the "2 *" in the equation for GFX9 is not because the allocation granule is 16. It is still 8 for gfx9, but there is an additional constraint that you must allocate an even number of granules.
It is a bit confusing, and I would like @kzhuravl to weigh in as IIRC he was who originally helped me understand this when we were updating the assembler.