This patch avoids scalarization of CTLZ by instead expanding to use CTPOP (from "Hacker's Delight") when the necessary operations are available.
This also adds the necessary cost models for X86 SSE2 targets (the main beneficiary) to ensure vectorization only happens when its useful.
Do we need to protect non-power of 2 bit widths here? I think the scalar equivalent does protect that.