It's common to see such cases for contraction from convolution with
input channel as 3. Although we aren't utilizing all 4 lanes for
dot product, it should still be better than performing the multiply
and reduction separately.
Details
Details
Diff Detail
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Comment Actions
LGTM but I wonder if we should generalize this to support any vector width > 2. For widths > 4, we could unroll it into a chain of 4-element dot products.
Comment Actions
For vectors with a size larger than 4, we could already relying on unrolling at vector level, right?
Comment Actions
That's independent of the unrolling IMO -- whatever the unrolling scheme used is, we can efficiently lower any reduction of vectors of size >= 3.