This is an archive of the discontinued LLVM Phabricator instance.

[Mips] Add intrinsics for 4-byte and 8-byte MSA loads/stores.
ClosedPublic

Authored by mbrkusanin on Jan 29 2020, 10:04 AM.

Details

Summary

New intrinisics are implemented for when we need to port SIMD code from other
arhitectures and only load or store portions of MSA registers.

Following intriniscs are added which only load/store element 0 of a vector:
v4i32 builtin_msa_ldr_w (const void *, imm_n2048_2044);
v2i64
builtin_msa_ldr_d (const void *, imm_n4096_4088);
void builtin_msa_str_w (v4i32, void *, imm_n2048_2044);
void
builtin_msa_str_d (v2i64, void *, imm_n4096_4088);

Diff Detail

Event Timeline

mbrkusanin created this revision.Jan 29 2020, 10:04 AM
mbrkusanin added a comment.EditedJan 29 2020, 10:13 AM

A few notes/questions:

  1. Generated code was tested with Qemu:
    • For mips32r5 Qemu provides p5600
    • For mips64r6 Qemu provides i6400
    • For mips64r5 there is no cpu on Qemu with MSA and it appears that there won't be any hardware with Mips64r5 and MSA.
    • For mips32r6 Qemu only provides a cpu called mips32r6-generic which does not support MSA. I tested the code for this on mips64r6.
  1. Names of the new intrinsics can be explained in the following way:

__builtin_msa_ldr_d (load right half)
__builtin_msa_ldrq_w (load right quarter)
__builtin_msa_str_d (store right half)
__builtin_msa_strq_w (store right quarter)
Other proposed names are: ld1_d/ld1_w/st1_d/st1_w and ldc1/lwc1/sdc1/swc1. I have no strong preference and would not mind changing them if someone thinks they would fit better.

  1. I did not make any tests for Clang (c/c++ test) since there are no tests for other intrinsics. Also should these new intrinsics be documented somewhere? Most other are corresponding to some instruction that already exists but these are replaced with pseudos.
  1. emitLDRQ_W() and emitLDR_D() could be combined into one function but it decreases readability. Same with emitting stores: emitSTRQ_W() and emitSTR_D(). I already tried this and have the code ready if this would be more preferable.

Is it possible to emulate these new intrinsics using existing ones and some additional code? Is code generated in this case much larger/slower then the code generated by the new intrinsics?

We could do that for loads. For example on Mips32r5 (where we need most instructions) for intrinsic ldr_d instead of:

	lwr	$1, 16($5)
	lwl	$1, 19($5)
	lwr	$2, 20($5)
	lwl	$2, 23($5)
	fill.w	$w0, $1
	insert.w	$w0[1], $2

We could use already available ld.d and then fix up $w0[2] and $w0[3] manually (when working with MSA128WRegClass / v4i32). ld.d has no alignment restrictions.

	ld.d	$w0, 16($5)
	copy_s.w	$1, $w0[0]
	insert.w	$w0[2], $1
	insert.w	$w0[3], $1

Optionally if we don't care what values are loaded in elements other then first we could just use ld.d and ld.w for ldr_d and ldrq_w respectively.

For stores however we cannot use st.d or st.w because we would write to memory we are not supposed to (we write to void* not necessarily v2i64 or v4i32).

I see, thanks. Is there the same or similar functionality in GCC?

mbrkusanin updated this revision to Diff 243153.EditedFeb 7 2020, 6:01 AM

Rebase.

mbrkusanin added a comment.EditedFeb 7 2020, 6:03 AM

Not yet, a proposal was made to both GCC and LLVM and as far as I can tell no work was done on GCC yet. If we accept these names I'll let them know so we end up with matching names.

As for 4/8 byte loads, in case of having them implemented as ld plus some extra instructions, I don't really see the point about making sure those other vector elements have same value as first. So if we ignore those we remain with only ld. In that case we can just not implement these loads and just have the user use __builtin_msa_ld_w and __builtin_msa_ld_d instead. But if we do decide to implement them it would make more sense to have them only read 4/8 bytes instead of all 16. That way you can use both since ld is already available.

atanasyan accepted this revision.Feb 10 2020, 4:22 AM

Looking good to me as-is.

  • Current naming is okay. But what do you think about reducing name of quarter intrinsics: __builtin_msa_ldr_w instead of __builtin_msa_ldrq_w? Will it clash with any future intrinsics' names?
  • There is almost no documentation on target specific intrinsics. Some articles like Using ARM NEON instructions in big endian mode cover specific use cases. It's up to you to write an article for these new intrinsics.
This revision is now accepted and ready to land.Feb 10 2020, 4:22 AM
mbrkusanin edited the summary of this revision. (Show Details)
  • Rebase
  • Rename ldrq_w to ldr_w; Rename strq_w to str_w.
This revision was automatically updated to reflect the committed changes.