Reland below reverted commits with some changes:
As an optimization, split the entry block of kernel K just after top most static
From the better code transformation/optimization perspective, it is expected
that all static alloca appear as a single contiguous cluster at the start of the
entry block. If this canonical form is *not* maintained, then few static alloca
may become dynamic after the entry block split.