X86Interleave should not try handle cases where the wide-load size is less than the factor*num_elements.
Details
Diff Detail
Event Timeline
lib/Target/X86/X86InterleavedAccess.cpp | ||
---|---|---|
106 | It is not just a question of profitability. If lowerIntoOptimizedSequence were called for a load instruction that is too small, it would make an incorrect transformation, because the decompose function would generate Factor loads, each of size ShuffleVecSize. I would also recommend a slightly different fix. Rather than checking the expected shuffle size for both the load & the store, I would check the "expected wide vector size". For loads, that means checking the type of the load. For stores, it means checking the type of the shuffle. |
lib/Target/X86/X86InterleavedAccess.cpp | ||
---|---|---|
106 | It is not just a question of profitability. If lowerIntoOptimizedSequence were called for a load instruction that is too small, it would make an incorrect transformation, because the decompose function would generate Factor loads, each of size ShuffleVecSize.
|
lib/Target/X86/X86InterleavedAccess.cpp | ||
---|---|---|
106 | You could make the ShuffleVecTy check more generic by testing this, which is similar to what you had before adding interleaved store support: if (WideInstSize != DL.getTypeSizeInBits(ShuffleVecType) * Factor) return false; I get your point about the profitability. I think your comment in the test adequately captures that. |
lib/Target/X86/X86InterleavedAccess.cpp | ||
---|---|---|
106 | I cannot write the way you suggested, it would be incorrect. We are checking the ShuffleVecType, so it can't be multiplied with Factor. For store, it will check against 1024 * 4. |
lib/Target/X86/X86InterleavedAccess.cpp | ||
---|---|---|
106 | But we can make it semi generic by hardcoding the number of elements. |
lib/Target/X86/X86InterleavedAccess.cpp | ||
---|---|---|
106 |
But the code in question is guarded by isa<LoadInst>(Inst). At any rate, what you have is fine if you change 256 to ShuffleElemSize * SupportedNumElem. |
lib/Target/X86/X86InterleavedAccess.cpp | ||
---|---|---|
106 | if (WideInstSize != DL.getTypeSizeInBits(ShuffleVecType) * Factor) return false; But the code in question is guarded by isa<LoadInst>(Inst).
When it's a store, it will compare with with 1024 * 4. At any rate, what you have is fine if you change 256 to ShuffleElemSize * SupportedNumElem.
I like the way it is since all checks are confined in one place, make it look clean. But I don't mind to change it since you prefer the other way. |
lib/Target/X86/X86InterleavedAccess.cpp | ||
---|---|---|
106 | Regarding 256, I take it back. You are right, we don't need the element size check. |
It is not just a question of profitability. If lowerIntoOptimizedSequence were called for a load instruction that is too small, it would make an incorrect transformation, because the decompose function would generate Factor loads, each of size ShuffleVecSize.
I would also recommend a slightly different fix. Rather than checking the expected shuffle size for both the load & the store, I would check the "expected wide vector size". For loads, that means checking the type of the load. For stores, it means checking the type of the shuffle.