This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Add pattern for SQDML*Lv1i32_indexed
ClosedPublic

Authored by overmighty on Aug 11 2022, 10:46 AM.

Details

Summary

There was no pattern to fold into these instructions. This patch adds
the pattern obtained from the following ACLE intrinsics so that they
generate sqdmlal/sqdmlsl instructions instead of separate sqdmull and
sqadd/sqsub instructions:

  • vqdmlalh_s16, vqdmlslh_s16
  • vqdmlalh_lane_s16, vqdmlalh_laneq_s16, vqdmlslh_lane_s16, vqdmlslh_laneq_s16 (when the lane index is 0)

It also modifies the result of the existing pattern for the latter, when
the lane index is not 0, to use the v1i32_indexed instructions instead
of the v4i16_indexed ones.

Fixes #49997.

Diff Detail

Event Timeline

overmighty created this revision.Aug 11 2022, 10:46 AM
overmighty requested review of this revision.Aug 11 2022, 10:46 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 11 2022, 10:46 AM
overmighty added inline comments.Aug 11 2022, 11:32 AM
llvm/lib/Target/AArch64/AArch64InstrFormats.td
8849–8850

This matches the vqdmlalh_lane_s16, vqdmlalh_laneq_s16, vqdmlslh_lane_s16, and vqdmlslh_laneq_s16 ACLE functions when the lane is not 0.

I can simply remove this definition and the SQDML*Lv1i32_indexed instructions will be used, however it will result in longer AArch64 code (extra dup instruction).

Is this FIXME really an issue?

overmighty added inline comments.Aug 11 2022, 2:02 PM
llvm/lib/Target/AArch64/AArch64InstrFormats.td
8896

It looks like the second FPR32Op here is the cause of the failing tests, but it is required for the ISel DAG pattern below, which is the pattern of the ACLE functions mentioned in the diff summary. I doubt that we want to change the tests to fit this diff either, as instructions such as sqdmlal.h s0, h0, v0[7] are valid and should not result in llvm-mc errors.

Sounds OK to me if you can do it without changing the register types.

(I was originally unsure, for some of these it is not correct for sqadd+sqmul to be sqmla. But in this case it seems valid. The sqdmlsl does perform two saturation steps)

It may need a separate Pat so that is can generate an EXTRACT_SUBREG, I'm not sure. What happens if the lane index is not 0?

A separate Pat is now used instead of setting the pattern in the v1i32_indexed instructions' definition, so that changing the instructions' register types is no longer required.

The tests have been updated.

@dmgreen When the lane index is not 0, the Pat from lines 8852 to 8862 in AArch64InstrFormats.td is used. See my comment on line 8862 where I also ask about the FIXME comment before the said Pat.

overmighty updated this revision to Diff 452644.EditedAug 15 2022, 6:20 AM
overmighty edited the summary of this revision. (Show Details)

The (v4i16 (scalar_to_vector (i32 FPR32Op:$Rn))) part of the pattern was replaced with what it already folds into: (v4i16 V64:$Rn), as in the existing pattern for the vqdml*lh_lane*_s16 functions when given a lane index other than 0. The (i32 FPR32Op:$Rd) part of the result has been simplified into FPR32Op:$Rd too.

The existing Pat for vqdml*lh_lane*_s16 with lane indexes other than 0 has been modified to use the v1i32_indexed instructions instead of the v4i16_indexed ones. I figured this was a better thing to do than ask if the FIXME comment preceding it really is an issue or not. I am not sure what that comment meant by "but an intermediate EXTRACT_SUBREG would be untyped." Relevant tests have been updated.

I tried to follow the indentation style that the other DAGs in the SIMDIndexedLongSQDMLXSDTied multiclass seemed to have.

Can you add a testcase where the extractelement is not the 0 lane? It would be good to have tests to make sure that the new pattern with non-zero offset works OK.

The moved pattern looks OK from what I can see. The input hasn't changed, just been reformatted? Just the output of it has changed? I agree that I'm not sure what the FIXME meant. Tablegen can be finicky at times.

llvm/lib/Target/AArch64/AArch64InstrFormats.td
8908

Does this need to be 0, if we use the lowest lane in the output pattern (EXTRACT_SUBREG V64:$Rn, hsub)?

The VectorIndexH:$idx part of the new pattern has been replaced with (i64 0) to avoid unwanted matches that would result in incorrect code generation.

overmighty updated this revision to Diff 452780.EditedAug 15 2022, 12:49 PM

Oops, I was about to reply that such test cases already exist, but I think that I misunderstood what you meant. I added the test cases that you wanted if I understand you correctly now.

The input of the moved pattern has indeed just been reformatted, to follow the indentation style that the other DAGs in the multiclass seemed to follow, although different indentation styles are used throughout the entire file. Only the output has changed.

dmgreen accepted this revision.Aug 16 2022, 2:53 AM

Sounds good thanks. Those new tests look good.

LGTM

This revision is now accepted and ready to land.Aug 16 2022, 2:53 AM

Do you have commit access, or do you want us to commit this for you? If so can you provide a "name <email@etc.com>" to attribute the patch to.

overmighty marked an inline comment as done.Aug 16 2022, 3:01 AM

I do not have commit access. Please commit this as "OverMighty <its.overmighty@gmail.com>". Thank you. :)

overmighty edited the summary of this revision. (Show Details)Aug 16 2022, 6:04 AM
This revision was automatically updated to reflect the committed changes.