This is an archive of the discontinued LLVM Phabricator instance.

[ARM]{WIP] SADD16 support in ParallelDSP
AbandonedPublic

Authored by samparker on Jul 12 2018, 8:06 AM.

Details

Summary

Changes to allow the ParallelDSP to perform some vectorisation on add instructions in a (typically unrolled) loop to convert them to sadd16.

  • ParallelChains has been introduced to collate multiple parallel OpChains and Reduction now inherits from this class.
  • The SuperWord class is introduced, also inheriting from ParallelChains, to represent parallel chains rooted at different store instructions. These are created while searching for sequential stores and those form the roots from which we can then compare the chains in the usual way.
  • AreAliased is now given a ParallelChain instead of the OpChainList 'Candidates', which allows us to query other writes in the region.
  • Finally, to help memory management, OpChainList also now holds a unique_ptr to the OpChain.

I've also made some misc changes like initialising pointers and moving a couple of things into lamda helpers...

Diff Detail

Event Timeline

samparker created this revision.Jul 12 2018, 8:06 AM

I appreciate the tests here are a little lacking, I will add something for:

  • volatile stores,
  • non-consecutive stores,
  • adding a constant,
  • decrementing indvar,
  • a test from manually unrolled piece of source.

sadd16 writes to the GE flags, which we allow the user to read using llvm.arm.sel. We need to ensure this transform doesn't interfere with the user's code.

samparker updated this revision to Diff 156798.Jul 23 2018, 8:35 AM

Performed a rebase and added a test from a manually unrolled example. I've also added an option to control the use of the GE writing flags - really I think this should go as a subtarget feature so this can be used across this pass and ARMCodeGenPrepare.

samparker updated this revision to Diff 156803.Jul 23 2018, 9:10 AM

Added tests for:

  • non load operand to the add,
  • immediate operand to the add,
  • volatile store
  • non-consecutive loads

I think this should go as a subtarget feature so this can be used across this pass and ARMCodeGenPrepare

A subtarget feature for what exactly? "+my-code-does-not-read-ge-flags"? If we're going to provide the builtins in clang, they should automatically work correctly without the user doing something weird.

Is it such a bad idea? Sure, I would like to check whether the sel intrinsic has been used or not, but what happens in the case of inline assembly? The AAPCS is also vague, I'm not sure what a 'public interface' is in terms of an LLVM module. I'd like to have an option which is the user can be explicit in saying its fine to use these instructions.

Is it such a bad idea? Sure, I would like to check whether the sel intrinsic has been used or not, but what happens in the case of inline assembly? The AAPCS is also vague, I'm not sure what a 'public interface' is in terms of an LLVM module. I'd like to have an option which is the user can be explicit in saying its fine to use these instructions.

'public interface' means 'function' when it comes to C/C++. The ACLE document is says a bit more (https://developer.arm.com/products/software-development-tools/compilers/arm-compiler-5/docs/101028/latest/8-data-processing-intrinsics) but essentially it's perfectly possible to write

#include <arm_acle.h>

uint8x4_t fn(uint8x4_t x, uint8x4_t y) {
  __usub8(x,y); // sets ge flags

  // code here that ARMParallelDSP could possibly optimise

  return __sel(x, y);
}

and, so long as there are no function calls between the usub8 and the sel, the compiler must preserve the ge bits produced by the usub8 so that the sel can use them.

samparker retitled this revision from [ARM] SADD16 support in ParallelDSP to [ARM]{WIP] SADD16 support in ParallelDSP.Jul 25 2018, 1:10 AM
samparker abandoned this revision.Jul 30 2019, 5:10 AM