This is an archive of the discontinued LLVM Phabricator instance.

[LegalizeTypes] Teach DAGTypeLegalizer::GenWidenVectorLoads to pad with undef if needed when concatenating small or loads to match a larger load
ClosedPublic

Authored by craig.topper on Jul 23 2020, 1:40 PM.

Details

Summary

In the included test case the align 16 allowed the v23f32 load to handled as load v16f32, load v4f32, and load v4f32(one element not used). These loads all need to be concatenated together into a final vector. In this case we tried to concatenate the two v4f32 loads to match the type of the v16f32 load so we could do a second concat_vectors, but those loads alone only add up to v8f32. So we need to two v4f32 undefs to pad it.

It appears we've tried to hack around a similar issue in this code before by adding undef padding to loads in one of the earlier loops in this function. Originally in r147964 by padding all loads narrower than previous loads to the same size. Later modifed to only the last load in r293088. This patch removes that earlier code and just handles it on demand where we know we need it.

Fixes PR46820

Diff Detail

Event Timeline

craig.topper created this revision.Jul 23 2020, 1:40 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 23 2020, 1:40 PM
Herald added a subscriber: hiraditya. · View Herald Transcript

-Add CHECK lines to the test

craig.topper edited the summary of this revision. (Show Details)Jul 23 2020, 1:42 PM

I was stepping through that function myself after seeing the bug, and it wasn't clear to me why we are ok with reading more than the actual load size. Ie, we are only allowed to read <23 x float>, but we created a <16 x float> load + two <4 x float> loads right? Something later is guaranteed to reduce that to make sure we don't actually load the full <24 x float>?

I was stepping through that function myself after seeing the bug, and it wasn't clear to me why we are ok with reading more than the actual load size. Ie, we are only allowed to read <23 x float>, but we created a <16 x float> load + two <4 x float> loads right? Something later is guaranteed to reduce that to make sure we don't actually load the full <24 x float>?

No we load the whole 24xfloat. Because the load is 16 byte aligned we know the last load won’t cross a page boundary so won’t fault.

On almost all targets, if N is small, reading N bytes from a dereferenceable pointer with align N is safe, even if the known dereferenceable bytes is less than N, due to the way memory allocation works. We do this in a few places, I think?

On almost all targets, if N is small, reading N bytes from a dereferenceable pointer with align N is safe, even if the known dereferenceable bytes is less than N, due to the way memory allocation works. We do this in a few places, I think?

Err, I guess you know that. Yes, it's depending on the alignment here.

It might be a good idea to explicitly call this out on the line where we compute LdAlign.

Improve comment about widening loads based on alignment

spatel accepted this revision.Jul 23 2020, 4:46 PM

I was stepping through that function myself after seeing the bug, and it wasn't clear to me why we are ok with reading more than the actual load size. Ie, we are only allowed to read <23 x float>, but we created a <16 x float> load + two <4 x float> loads right? Something later is guaranteed to reduce that to make sure we don't actually load the full <24 x float>?

No we load the whole 24xfloat. Because the load is 16 byte aligned we know the last load won’t cross a page boundary so won’t fault.

Ah, ok. Might be good to include the same test with minimal alignment, so we see that difference in asm?
The way this code is replacing at the end of the ConcatOps vector is confusing (not sure if we can assert anything from the power-of-2 size assumption?), but that's existing code, so I won't hold up the fix. LGTM.

This revision is now accepted and ready to land.Jul 23 2020, 4:46 PM
This revision was automatically updated to reflect the committed changes.