The original code considered only v2i64 as slow for this feature. This patch consider all 128-bit long vector types as slow candidates.
It's not clear why Cyclone, the first processor to use this feature, restricted it to just v2i64 and excluded the other 128-bit vector types. But, in internal tests, extending this feature to all 128-bit vector types resulted in an overall improvement of 1% on Exynos M1.