This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
lib/
-
Conversion/ArmSMEToSCF/
-
ArmSMEToSCF/
1/2
ArmSMEToSCF.cpp
-
Dialect/ArmSME/Transforms/
-
ArmSME/
-
Transforms/
2/2
LegalizeForLLVMExport.cpp
-
test/
-
Conversion/ArmSMEToSCF/
-
ArmSMEToSCF/
1/1
arm-sme-to-scf.mlir
-
Dialect/ArmSME/
-
ArmSME/
2/2
vector-ops-to-llvm.mlir
-
Integration/Dialect/Vector/CPU/ArmSME/
-
Dialect/
-
Vector/
-
CPU/
-
ArmSME/
6/9
vector-load-store.mlir

Differential D156689

[mlir][ArmSME] Use memref indices for load and store
ClosedPublic

Authored by c-rhodes on Jul 31 2023, 6:52 AM.

Download Raw Diff

Details

Reviewers

awarzynski
benmxwl-arm
dcaballe
aartbik
ftynse
nicolasvasilache

Commits

rG65a6be5de97a: [mlir][ArmSME] Use memref indices for load and store

Summary

This patch extends the ArmSME load and store op lowering to use the
memref indices. An integration test that loads two 32-bit element ZA
tiles from memory and stores them back to memory in reverse order to
verify this is added.

Depends on D156467 D156558

Diff Detail

Event Timeline

c-rhodes created this revision.Jul 31 2023, 6:52 AM

Herald added a reviewer: aartbik. · View Herald TranscriptJul 31 2023, 6:52 AM

Herald added a reviewer: ftynse. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: gysit, Dinistro, bviyer and 26 others. · View Herald Transcript

c-rhodes requested review of this revision.Jul 31 2023, 6:52 AM

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptJul 31 2023, 6:52 AM

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

Harbormaster completed remote builds in B249199: Diff 545632.Jul 31 2023, 7:47 AM

c-rhodes mentioned this in D156467: [mlir][ArmSME] Add conversion from ArmSME to SCF to materialize loops.Jul 31 2023, 10:14 AM

c-rhodes edited the summary of this revision. (Show Details)Jul 31 2023, 10:24 AM

c-rhodes added a parent revision: D156558: [mlir][ArmSME] Remove "pure" side-effect from 'get_tile_id' op to prevent CSE.

Thanks!

mlir/lib/Conversion/ArmSMEToSCF/ArmSMEToSCF.cpp
34	nit: SmallVector -> SmallVectorImpl

This revision is now accepted and ready to land.Jul 31 2023, 10:37 PM

Thanks Cullen! A few minor comments/suggestions.

I think that it would be nice to update one of the tests (or more) as follows:

// BEFORE
func.func @arm_sme_tile_load(%src : memref<?x?xi32>) {
  %c0 = arith.constant 0 : index
  %tile = arm_sme.tile_load %src[%c0, %c0] : memref<?x?xi32>, vector<[4]x[4]xi32>
  return
}

// AFTER
func.func @arm_sme_tile_load(%src : memref<?x?xi32>) {
  %c0 = arith.constant 0 : index
  %c123 = arith.constant 123 : index
  %tile = arm_sme.tile_load %src[%c123, %c0] : memref<?x?xi32>, vector<[4]x[4]xi32>
  return
}

Otherwise it's a bit tricky to see what has changed - with c0 in all tests the effective address does not change, does it?

mlir/lib/Conversion/ArmSMEToSCF/ArmSMEToSCF.cpp
112	[nit] Suggesting a more descriptive name.
mlir/lib/Dialect/ArmSME/Transforms/LegalizeForLLVMExport.cpp
135	I think that this comment should be moved further down.
mlir/test/Conversion/ArmSMEToSCF/arm-sme-to-scf.mlir
13	[nit] "TILE_SLICE_OFFSET" suggests offset into `ZA` (that's where tile slices are defined). But this is for a plain memref. Perhaps `OFFSET`? I am also thinking that the key point of this patch is to clearly differentiate between memory offsets (low level LLVM concept) and "tile slice idx" (SME concept).
mlir/test/Integration/Dialect/Vector/CPU/ArmSME/vector-load-store.mlir
12–21	Rather than duplicating this, could you use `DEFINE` and `REDEFINE`? Alternatively, you could move this to a separate file. There's quite a lot going on here.
33	Could you add a comment that would highlight the difference between `z0_d_f64` and `load_store_two_za_s_tiles`?
246–247	[nit] Small suggestion for a more descriptive name.

Matt added a subscriber: Matt.Aug 1 2023, 2:44 PM

address comments

In D156689#4550748, @awarzynski wrote:

Thanks Cullen! A few minor comments/suggestions.

Thanks for comments

I think that it would be nice to update one of the tests (or more) as follows:
// BEFORE
func.func @arm_sme_tile_load(%src : memref<?x?xi32>) {
  %c0 = arith.constant 0 : index
  %tile = arm_sme.tile_load %src[%c0, %c0] : memref<?x?xi32>, vector<[4]x[4]xi32>
  return
}

// AFTER
func.func @arm_sme_tile_load(%src : memref<?x?xi32>) {
  %c0 = arith.constant 0 : index
  %c123 = arith.constant 123 : index
  %tile = arm_sme.tile_load %src[%c123, %c0] : memref<?x?xi32>, vector<[4]x[4]xi32>
  return
}
Otherwise it's a bit tricky to see what has changed - with c0 in all tests the effective address does not change, does it?

this isn't changing anything in the example you provided, the offset is adjusted when materializing the tile slice loop, and that is tested in the -convert-arm-sme-to-scf tests.

mlir/lib/Dialect/ArmSME/Transforms/LegalizeForLLVMExport.cpp
135	I think that this comment should be moved further down. that's not part of this patch but done anyway
mlir/test/Integration/Dialect/Vector/CPU/ArmSME/vector-load-store.mlir
12–21	Rather than duplicating this, could you use `DEFINE` and `REDEFINE`? Alternatively, you could move this to a separate file. There's quite a lot going on here. I was thinking this should have a single entry that calls both tests but ZA isn't currently re-enabled after calls so that's not possible until that's resolved but should be once it is.
246–247	[nit] Small suggestion for a more descriptive name. I understanding you intend that to mean to element type of the tile but appending type usually indicates the value is of that type which this isnt, it's an index, I've kept it as is.

Harbormaster completed remote builds in B249733: Diff 546409.Aug 2 2023, 6:28 AM

LGTM, thanks! I've left a few more comments, but feel free to ignore.

this isn't changing anything in the example you provided, the offset is adjusted when materializing the tile slice loop, and that is tested in the -convert-arm-sme-to-scf tests.

Clarified what I had in mind inline in a test. Not a blocker.

mlir/test/Dialect/ArmSME/vector-ops-to-llvm.mlir
62	I was suggesting to change `%c0 = 0` to e.g. `%c123 = 123` in places like this one. Right now (with `%c0`) it looks like this // BEFORE // CHECK-NEXT: %[[OFF0:.]] = llvm.mul %[[TILE_SLICE_I64]], %[[STRIDE0]] : i64 // CHECK-NEXT: %[[GEP:.]] = llvm.getelementptr %[[ALIGNED_BASE]]{{\[}}%[[OFF0]]] : (!llvm.ptr, i64) -> !llvm.ptr, i8 becomes: // AFTER // CHECK-NEXT: %[[OFF0:.]] = llvm.mul %[[TILE_SLICE_I64]], %[[STRIDE0]] : i64 // CHECK-NEXT: %[[OFF1:.]] = llvm.add %[[OFF0]], %[[C0_I64]] : i64 // CHECK-NEXT: %[[GEP:.]] = llvm.getelementptr %[[ALIGNED_BASE]]{{\[}}%[[OFF1]]] : (!llvm.ptr, i64) -> !llvm.ptr, i8 Note that `OFF0` and `OFF1` are identical so the benefits of this patch are obfuscated. The resulting code is correct _with_ and _without_ your change (because the offset is `0`). Replacing `%c0` with `%c123` makes things more interesting: // BEFORE // CHECK-NEXT: %[[OFF0:.]] = llvm.mul %[[TILE_SLICE_I64]], %[[STRIDE0]] : i64 // CHECK-NEXT: %[[GEP:.]] = llvm.getelementptr %[[ALIGNED_BASE]]{{\[}}%[[OFF0]]] : (!llvm.ptr, i64) -> !llvm.ptr, i8 becomes: // AFTER // CHECK-NEXT: %[[OFF0:.]] = llvm.mul %[[TILE_SLICE_I64]], %[[STRIDE0]] : i64 // CHECK-NEXT: %[[OFF1:.]] = llvm.add %[[OFF0]], %[[C123_I64]] : i64 // CHECK-NEXT: %[[GEP:.]] = llvm.getelementptr %[[ALIGNED_BASE]]{{\[}}%[[OFF1]]] : (!llvm.ptr, i64) -> !llvm.ptr, i8 In this case the "BEFORE" is obviously broken.
mlir/test/Integration/Dialect/Vector/CPU/ArmSME/vector-load-store.mlir
12–21	I was thinking this should have a single entry that calls both tests but ZA isn't currently re-enabled after calls so that's not possible until that's resolved but should be once it is. I don't follow - with `DEFINE`/`REDEFINE` you can re-use everything while keeping the entry point custom: // DEFINE: %{entry_point} = za0_d_f64 // DEFINE: %{compile} = mlir-opt %s -enable-arm-streaming="mode=locally enable-za" \ // DEFINE: -convert-vector-to-arm-sme -convert-arm-sme-to-scf \ // DEFINE: -convert-vector-to-llvm="enable-arm-sme" -cse -canonicalize \ // DEFINE: -allocate-arm-sme-tiles -test-lower-to-llvm // DEFINE: %{translate} = mlir-translate -mlir-to-llvmir \| \ // DEFINE: %{run} = %lli_aarch64_cmd --march=aarch64 --mattr="+sve,+sme" \ // DEFINE: --entry-function=%{entry_point} \ // DEFINE: --dlopen=%mlir_native_utils_lib_dir/libmlir_c_runner_utils%shlibext \ // DEFINE: --dlopen=%mlir_native_utils_lib_dir/libmlir_runner_utils%shlibext \| \ // RUN: %{compile} \| %{translate} \| %{run} \| FileCheck %s --check-prefix=CHECK-ZA0_D // REDEFINE: %{entry_point} = load_store_two_za_s_tiles // RUN: %{compile} \| %{translate} \| %{run} \| FileCheck %s
246–247	I understanding you intend that to mean to element type That's not the main thing that I had in mind :) Sorry should've been clearer. My main suggestion here is to avoid variables names like e.g. `size` (or `name` etc). With names like this my first question would be "_size_ of what?" (or "_name_ of what?"). I just wanted to clarify, this is just a nit.

Closed by commit rG65a6be5de97a: [mlir][ArmSME] Use memref indices for load and store (authored by c-rhodes). · Explain WhyAug 3 2023, 1:50 AM

This revision was automatically updated to reflect the committed changes.

c-rhodes marked an inline comment as done.

c-rhodes added a commit: rG65a6be5de97a: [mlir][ArmSME] Use memref indices for load and store.

In D156689#4556512, @awarzynski wrote:

LGTM, thanks! I've left a few more comments, but feel free to ignore.

this isn't changing anything in the example you provided, the offset is adjusted when materializing the tile slice loop, and that is tested in the -convert-arm-sme-to-scf tests.

Clarified what I had in mind inline in a test. Not a blocker.

Addressed your final comments before landing, cheers Andrzej

mlir/test/Dialect/ArmSME/vector-ops-to-llvm.mlir
62	I was suggesting to change `%c0 = 0` to e.g. `%c123 = 123` in places like this one. Right now (with `%c0`) it looks like this // BEFORE // CHECK-NEXT: %[[OFF0:.]] = llvm.mul %[[TILE_SLICE_I64]], %[[STRIDE0]] : i64 // CHECK-NEXT: %[[GEP:.]] = llvm.getelementptr %[[ALIGNED_BASE]]{{\[}}%[[OFF0]]] : (!llvm.ptr, i64) -> !llvm.ptr, i8 becomes: // AFTER // CHECK-NEXT: %[[OFF0:.]] = llvm.mul %[[TILE_SLICE_I64]], %[[STRIDE0]] : i64 // CHECK-NEXT: %[[OFF1:.]] = llvm.add %[[OFF0]], %[[C0_I64]] : i64 // CHECK-NEXT: %[[GEP:.]] = llvm.getelementptr %[[ALIGNED_BASE]]{{\[}}%[[OFF1]]] : (!llvm.ptr, i64) -> !llvm.ptr, i8 Note that `OFF0` and `OFF1` are identical so the benefits of this patch are obfuscated. The resulting code is correct _with_ and _without_ your change (because the offset is `0`). Replacing `%c0` with `%c123` makes things more interesting: // BEFORE // CHECK-NEXT: %[[OFF0:.]] = llvm.mul %[[TILE_SLICE_I64]], %[[STRIDE0]] : i64 // CHECK-NEXT: %[[GEP:.]] = llvm.getelementptr %[[ALIGNED_BASE]]{{\[}}%[[OFF0]]] : (!llvm.ptr, i64) -> !llvm.ptr, i8 becomes: // AFTER // CHECK-NEXT: %[[OFF0:.]] = llvm.mul %[[TILE_SLICE_I64]], %[[STRIDE0]] : i64 // CHECK-NEXT: %[[OFF1:.]] = llvm.add %[[OFF0]], %[[C123_I64]] : i64 // CHECK-NEXT: %[[GEP:.]] = llvm.getelementptr %[[ALIGNED_BASE]]{{\[}}%[[OFF1]]] : (!llvm.ptr, i64) -> !llvm.ptr, i8 In this case the "BEFORE" is obviously broken. ah ok sure, I've updated `vector_load_i8`
mlir/test/Integration/Dialect/Vector/CPU/ArmSME/vector-load-store.mlir
12–21	I was thinking this should have a single entry that calls both tests but ZA isn't currently re-enabled after calls so that's not possible until that's resolved but should be once it is. I don't follow - with `DEFINE`/`REDEFINE` you can re-use everything while keeping the entry point custom: // DEFINE: %{entry_point} = za0_d_f64 // DEFINE: %{compile} = mlir-opt %s -enable-arm-streaming="mode=locally enable-za" \ // DEFINE: -convert-vector-to-arm-sme -convert-arm-sme-to-scf \ // DEFINE: -convert-vector-to-llvm="enable-arm-sme" -cse -canonicalize \ // DEFINE: -allocate-arm-sme-tiles -test-lower-to-llvm // DEFINE: %{translate} = mlir-translate -mlir-to-llvmir \| \ // DEFINE: %{run} = %lli_aarch64_cmd --march=aarch64 --mattr="+sve,+sme" \ // DEFINE: --entry-function=%{entry_point} \ // DEFINE: --dlopen=%mlir_native_utils_lib_dir/libmlir_c_runner_utils%shlibext \ // DEFINE: --dlopen=%mlir_native_utils_lib_dir/libmlir_runner_utils%shlibext \| \ // RUN: %{compile} \| %{translate} \| %{run} \| FileCheck %s --check-prefix=CHECK-ZA0_D // REDEFINE: %{entry_point} = load_store_two_za_s_tiles // RUN: %{compile} \| %{translate} \| %{run} \| FileCheck %s Updated, this is nice. Also updated to use mlir-cpu-runner.
246–247	I understanding you intend that to mean to element type That's not the main thing that I had in mind :) Sorry should've been clearer. My main suggestion here is to avoid variables names like e.g. `size` (or `name` etc). With names like this my first question would be "_size_ of what?" (or "_name_ of what?"). I just wanted to clarify, this is just a nit. Sorry when I first read this I didn't spot `size`, I've updated it.

Revision Contents

Path

Size

mlir/

lib/

Conversion/

ArmSMEToSCF/

ArmSMEToSCF.cpp

44 lines

Dialect/

ArmSME/

Transforms/

LegalizeForLLVMExport.cpp

69 lines

test/

Conversion/

ArmSMEToSCF/

arm-sme-to-scf.mlir

10 lines

Dialect/

ArmSME/

vector-ops-to-llvm.mlir

28 lines

Integration/

Dialect/

Vector/

CPU/

ArmSME/

vector-load-store.mlir

187 lines

Diff 545632

mlir/lib/Conversion/ArmSMEToSCF/ArmSMEToSCF.cpp

Show All 20 Lines

namespace mlir {

#define GEN_PASS_DEF_CONVERTARMSMETOSCF

#include "mlir/Conversion/Passes.h.inc"

} // namespace mlir

using namespace mlir;

namespace {

/// Adjusts `indices` as follows for a given tile slice and returns them in

/// `outIndices`:

/// rank 1: (indices[0] + (tileSliceIndex * tileSliceNumElts))

/// rank 2: (indices[0] + tileSliceIndex, indices[1])

void getMemrefIndices(ValueRange indices, unsigned rank, Value tileSliceIndex,

Value tileSliceNumElts, SmallVector<Value> &outIndices,

dcaballeUnsubmitted

Not Done

nit: SmallVector -> SmallVectorImpl

dcaballe: nit: SmallVector -> SmallVectorImpl

Location loc, PatternRewriter &rewriter) {

assert((rank == 1 || rank == 2) && "memref has unexpected rank!");

outIndices = indices;

auto tileSliceOffset = tileSliceIndex;

if (rank == 1)

tileSliceOffset =

rewriter.create<arith::MulIOp>(loc, tileSliceOffset, tileSliceNumElts);

auto baseIndexPlusTileSliceOffset =

rewriter.create<arith::AddIOp>(loc, indices[0], tileSliceOffset);

outIndices[0] = baseIndexPlusTileSliceOffset;

}

/// Lower `arm_sme.tile_load` to a loop over the tile slices and load each slice

/// using `arm_sme.load_tile_slice`.

///

/// BEFORE:

/// ```mlir

/// %tile = arm_sme.tile_load %src[%c0, %c0] :

/// memref<?x?xi32>, vector<[4]x[4]xi32>

Show All 36 Lines

LogicalResult matchAndRewrite(arm_sme::TileLoadOp tileLoadOp,

// Create a loop that loads each ZA tile slice from memory.

auto step = rewriter.create<arith::ConstantIndexOp>(loc, 1);

auto minTileSlices = rewriter.create<arith::ConstantIndexOp>(

loc, arm_sme::getSMETileSliceMinNumElts(tileElementType));

auto vscale =

rewriter.create<vector::VectorScaleOp>(loc, rewriter.getIndexType());

auto lowerBound = rewriter.create<arith::ConstantIndexOp>(loc, 0);

// This describes both the number of ZA tile slices and the number of

// elements in a vector of SVL bits for a given element type (SVL_B, SVL_H,

// ..., SVL_Q).

auto numTileSlices =

rewriter.create<arith::MulIOp>(loc, minTileSlices, vscale);

auto forOp =

rewriter.create<scf::ForOp>(loc, lowerBound, numTileSlices, step);

rewriter.setInsertionPointToStart(forOp.getBody());

// Create 'arm_sme.load_tile_slice' to load tile slice from memory into

// tile.

SmallVector<Value> indices;

awarzynskiUnsubmitted

Done

// tile.

- SmallVector<Value> indices;

+ SmallVector<Value> indicesForMemref;

auto tileSliceIndex = forOp.getInductionVar();

[nit] Suggesting a more descriptive name.

awarzynski: [nit] Suggesting a more descriptive name.

auto tileSliceIndex = forOp.getInductionVar();

// TODO: use indices

getMemrefIndices(tileLoadOp.getIndices(),

// Create 'arm_sme.load_tile_slice' to load tile slice from

tileLoadOp.getMemRefType().getRank(), tileSliceIndex,

// memory into tile.

numTileSlices, indices, loc, rewriter);

rewriter.create<arm_sme::LoadTileSliceOp>(

loc, tileType, tileLoadOp.getBase(), tileSliceIndex, tileInit,

loc, tileType, tileLoadOp.getBase(), indices, tileInit,

tileSliceIndex);

rewriter.setInsertionPointAfter(forOp);

// Cast the tile ID used for the load with 'arm_sme.cast_tile_to_vector'

// and replace 'arm_sme.tile_load' with the cast. This is done after the

// loop to simplify the canonicalizations.

rewriter.replaceOpWithNewOp<arm_sme::CastTileToVector>(tileLoadOp, tileType,

Show All 36 Lines

LogicalResult matchAndRewrite(arm_sme::TileStoreOp tileStoreOp,

// Create a loop that stores each ZA tile slice from memory.

auto step = rewriter.create<arith::ConstantIndexOp>(loc, 1);

auto minTileSlices = rewriter.create<arith::ConstantIndexOp>(

loc, arm_sme::getSMETileSliceMinNumElts(tileElementType));

auto vscale =

rewriter.create<vector::VectorScaleOp>(loc, rewriter.getIndexType());

auto lowerBound = rewriter.create<arith::ConstantIndexOp>(loc, 0);

// This describes both the number of ZA tile slices and the number of

// elements in a vector of SVL bits for a given element type (SVL_B, SVL_H,

// ..., SVL_Q).

auto numTileSlices =

rewriter.create<arith::MulIOp>(loc, minTileSlices, vscale);

auto forOp =

rewriter.create<scf::ForOp>(loc, lowerBound, numTileSlices, step);

rewriter.setInsertionPointToStart(forOp.getBody());

SmallVector<Value> indices;

auto tileSliceIndex = forOp.getInductionVar();

// TODO: use indices

getMemrefIndices(tileStoreOp.getIndices(),

tileStoreOp.getMemRefType().getRank(), tileSliceIndex,

numTileSlices, indices, loc, rewriter);

rewriter.replaceOpWithNewOp<arm_sme::StoreTileSliceOp>(

tileStoreOp, tileStoreOp.getValueToStore(), tileSliceIndex,

tileStoreOp.getBase(), tileSliceIndex);

tileStoreOp.getBase(), indices);

return success();

}

};

} // namespace

void mlir::populateArmSMEToSCFConversionPatterns(RewritePatternSet &patterns) {

Show All 26 Lines

mlir/lib/Dialect/ArmSME/Transforms/LegalizeForLLVMExport.cpp

Show First 20 Lines • Show All 105 Lines • ▼ Show 20 Lines	Value castTileIDToI32(Value tile, Location loc,
unsigned tileElementWidth = tile.getType().getIntOrFloatBitWidth();		unsigned tileElementWidth = tile.getType().getIntOrFloatBitWidth();
if (tileElementWidth < 32)		if (tileElementWidth < 32)
return rewriter.create<arith::ExtUIOp>(loc, rewriter.getI32Type(), tile);		return rewriter.create<arith::ExtUIOp>(loc, rewriter.getI32Type(), tile);
if (tileElementWidth > 32)		if (tileElementWidth > 32)
return rewriter.create<arith::TruncIOp>(loc, rewriter.getI32Type(), tile);		return rewriter.create<arith::TruncIOp>(loc, rewriter.getI32Type(), tile);
return tile;		return tile;
}		}

/// Returns the following
/// * for rank 2 memrefs `tileSliceIndex`, since `getStridedElementPtr` does
/// the arithmetic.
/// * for rank 1 memrefs `tileSliceIndex * tileSliceNumElts`, adjusting the
/// index by the number of elements in a vector of SVL bits.
/// * otherwise throws an unreachable error.
Value getTileSlicePtrIndex(unsigned rank, Value tileSliceIndex,
Value tileSliceNumElts, Location loc,
ConversionPatternRewriter &rewriter) {
assert((rank == 1 \|\| rank == 2) && "memref has unexpected rank!");

auto tileSliceIndexI64 = rewriter.create<arith::IndexCastUIOp>(
loc, rewriter.getI64Type(), tileSliceIndex);

if (rank == 1) {
auto tileSliceNumEltsI64 = rewriter.create<arith::IndexCastUIOp>(
loc, rewriter.getI64Type(), tileSliceNumElts);
return rewriter.create<arith::MulIOp>(loc, tileSliceIndexI64,
tileSliceNumEltsI64);
}

if (rank == 2)
return tileSliceIndexI64;

llvm_unreachable("memref has unexpected rank!");
}

/// Lower `arm_sme.load_tile_slice` to SME intrinsics.		/// Lower `arm_sme.load_tile_slice` to SME intrinsics.
struct LoadTileSliceToArmSMELowering		struct LoadTileSliceToArmSMELowering
: public ConvertOpToLLVMPattern<arm_sme::LoadTileSliceOp> {		: public ConvertOpToLLVMPattern<arm_sme::LoadTileSliceOp> {
using ConvertOpToLLVMPattern<		using ConvertOpToLLVMPattern<
arm_sme::LoadTileSliceOp>::ConvertOpToLLVMPattern;		arm_sme::LoadTileSliceOp>::ConvertOpToLLVMPattern;

LogicalResult		LogicalResult
matchAndRewrite(arm_sme::LoadTileSliceOp loadTileSliceOp,		matchAndRewrite(arm_sme::LoadTileSliceOp loadTileSliceOp,
arm_sme::LoadTileSliceOp::Adaptor adaptor,		arm_sme::LoadTileSliceOp::Adaptor adaptor,
ConversionPatternRewriter &rewriter) const override {		ConversionPatternRewriter &rewriter) const override {
auto loc = loadTileSliceOp.getLoc();		auto loc = loadTileSliceOp.getLoc();
auto tileType = loadTileSliceOp.getVectorType();		auto tileType = loadTileSliceOp.getVectorType();
auto tileElementType = tileType.getElementType();		auto tileElementType = tileType.getElementType();
unsigned tileElementWidth = tileElementType.getIntOrFloatBitWidth();		unsigned tileElementWidth = tileElementType.getIntOrFloatBitWidth();

// Create 'arm_sme.cast_vector_to_tile' to get a tile ID for the tile being		// Create 'arm_sme.cast_vector_to_tile' to get a tile ID for the tile being
// loaded to.		// loaded to.
auto tile = rewriter.create<arm_sme::CastVectorToTile>(		auto tile = rewriter.create<arm_sme::CastVectorToTile>(
loc, rewriter.getIntegerType(tileElementWidth),		loc, rewriter.getIntegerType(tileElementWidth),
loadTileSliceOp.getTile());		loadTileSliceOp.getTile());

auto minTileSlices = rewriter.create<arith::ConstantIndexOp>(
loc, arm_sme::getSMETileSliceMinNumElts(tileElementType));
auto vscale =
rewriter.create<vector::VectorScaleOp>(loc, rewriter.getIndexType());
// This describes both the number of ZA tile slices and the number of
// elements in a vector of SVL bits for a given element type (SVL_B, SVL_H,
// ..., SVL_Q).
auto numTileSlices =
rewriter.create<arith::MulIOp>(loc, minTileSlices, vscale);

// Create 'arm_sme.intr.ld1*.horiz' intrinsic to load ZA tile slice.		// Create 'arm_sme.intr.ld1*.horiz' intrinsic to load ZA tile slice.
		awarzynskiUnsubmitted Done Reply Inline Actions I think that this comment should be moved further down. awarzynski: I think that this comment should be moved further down.
		c-rhodesAuthorUnsubmitted Done Reply Inline Actions I think that this comment should be moved further down. that's not part of this patch but done anyway c-rhodes: > I think that this comment should be moved further down. that's not part of this patch but…
auto memRefType = loadTileSliceOp.getMemRefType();		Value ptr = this->getStridedElementPtr(
		loc, loadTileSliceOp.getMemRefType(), adaptor.getBase(),
		adaptor.getIndices(), rewriter);

auto tileSlice = loadTileSliceOp.getTileSliceIndex();		auto tileSlice = loadTileSliceOp.getTileSliceIndex();
// TODO: The 'indices' argument for the 'base' memref is currently ignored,
// 'tileSliceIndex' should be added to 'indices[0]'.
Value tileSliceIndex = getTileSlicePtrIndex(memRefType.getRank(), tileSlice,
numTileSlices, loc, rewriter);
Value ptr = this->getStridedElementPtr(loc, memRefType, adaptor.getBase(),
{tileSliceIndex}, rewriter);

// Cast tile slice to i32 for intrinsic.		// Cast tile slice to i32 for intrinsic.
auto tileSliceI32 = rewriter.create<arith::IndexCastUIOp>(		auto tileSliceI32 = rewriter.create<arith::IndexCastUIOp>(
loc, rewriter.getI32Type(), tileSlice);		loc, rewriter.getI32Type(), tileSlice);

// Create all active predicate mask.		// Create all active predicate mask.
auto one = rewriter.create<arith::ConstantOp>(		auto one = rewriter.create<arith::ConstantOp>(
loc, rewriter.getI1Type(),		loc, rewriter.getI1Type(),
▲ Show 20 Lines • Show All 49 Lines • ▼ Show 20 Lines	matchAndRewrite(arm_sme::StoreTileSliceOp storeTileSliceOp,
unsigned tileElementWidth = tileElementType.getIntOrFloatBitWidth();		unsigned tileElementWidth = tileElementType.getIntOrFloatBitWidth();

// Create 'arm_sme.cast_vector_to_tile' to get a tile ID for the vector		// Create 'arm_sme.cast_vector_to_tile' to get a tile ID for the vector
// being stored.		// being stored.
auto tile = rewriter.create<arm_sme::CastVectorToTile>(		auto tile = rewriter.create<arm_sme::CastVectorToTile>(
loc, rewriter.getIntegerType(tileElementWidth),		loc, rewriter.getIntegerType(tileElementWidth),
storeTileSliceOp.getTile());		storeTileSliceOp.getTile());

auto minTileSlices = rewriter.create<arith::ConstantIndexOp>(
loc, arm_sme::getSMETileSliceMinNumElts(tileElementType));
auto vscale =
rewriter.create<vector::VectorScaleOp>(loc, rewriter.getIndexType());
// This describes both the number of ZA tile slices and the number of
// elements in a vector of SVL bits for a given element type (SVL_B, SVL_H,
// ..., SVL_Q).
auto numTileSlices =
rewriter.create<arith::MulIOp>(loc, minTileSlices, vscale);

// Create 'arm_sme.intr.st1*.horiz' intrinsic to store ZA tile slice.		// Create 'arm_sme.intr.st1*.horiz' intrinsic to store ZA tile slice.
auto memRefType = storeTileSliceOp.getMemRefType();		Value ptr = this->getStridedElementPtr(
		loc, storeTileSliceOp.getMemRefType(), adaptor.getBase(),
		adaptor.getIndices(), rewriter);

auto tileSlice = storeTileSliceOp.getTileSliceIndex();		auto tileSlice = storeTileSliceOp.getTileSliceIndex();
// TODO: The 'indices' argument for the 'base' memref is currently ignored,
// 'tileSliceIndex' should be added to 'indices[0]'.
Value tileSliceIndex = getTileSlicePtrIndex(memRefType.getRank(), tileSlice,
numTileSlices, loc, rewriter);
Value ptr = this->getStridedElementPtr(loc, memRefType, adaptor.getBase(),
{tileSliceIndex}, rewriter);

// Cast tile slice to i32 for intrinsic.		// Cast tile slice to i32 for intrinsic.
auto tileSliceI32 = rewriter.create<arith::IndexCastUIOp>(		auto tileSliceI32 = rewriter.create<arith::IndexCastUIOp>(
loc, rewriter.getI32Type(), tileSlice);		loc, rewriter.getI32Type(), tileSlice);

// Create all active predicate mask.		// Create all active predicate mask.
auto one = rewriter.create<arith::ConstantOp>(		auto one = rewriter.create<arith::ConstantOp>(
loc, rewriter.getI1Type(),		loc, rewriter.getI1Type(),
▲ Show 20 Lines • Show All 77 Lines • Show Last 20 Lines

mlir/test/Conversion/ArmSMEToSCF/arm-sme-to-scf.mlir

	// RUN: mlir-opt %s -convert-arm-sme-to-scf -cse -split-input-file \| FileCheck %s			// RUN: mlir-opt %s -convert-arm-sme-to-scf -cse -split-input-file \| FileCheck %s

	// CHECK-LABEL: func.func @arm_sme_tile_load(			// CHECK-LABEL: func.func @arm_sme_tile_load(
	// CHECK-SAME: %[[SRC:.*]]: memref<?x?xi32>) {			// CHECK-SAME: %[[SRC:.*]]: memref<?x?xi32>) {
	// CHECK-NEXT: %[[TILE_ID:.*]] = arm_sme.get_tile_id : i32			// CHECK-DAG: %[[TILE_ID:.*]] = arm_sme.get_tile_id : i32
	// CHECK-NEXT: %[[CAST_TILE_TO_VECTOR:.*]] = arm_sme.cast_tile_to_vector %[[TILE_ID]] : i32 to vector<[4]x[4]xi32>			// CHECK-DAG: %[[CAST_TILE_TO_VECTOR:.*]] = arm_sme.cast_tile_to_vector %[[TILE_ID]] : i32 to vector<[4]x[4]xi32>
	// CHECK-DAG: %[[C0:.*]] = arith.constant 0 : index			// CHECK-DAG: %[[C0:.*]] = arith.constant 0 : index
	// CHECK-DAG: %[[C1:.*]] = arith.constant 1 : index			// CHECK-DAG: %[[C1:.*]] = arith.constant 1 : index
	// CHECK-DAG: %[[C4:.*]] = arith.constant 4 : index			// CHECK-DAG: %[[C4:.*]] = arith.constant 4 : index
	// CHECK-DAG: %[[VSCALE:.*]] = vector.vscale			// CHECK-DAG: %[[VSCALE:.*]] = vector.vscale
	// CHECK-NEXT: %[[NUM_TILE_SLICES:.*]] = arith.muli %[[C4]], %[[VSCALE]] : index			// CHECK-NEXT: %[[NUM_TILE_SLICES:.*]] = arith.muli %[[C4]], %[[VSCALE]] : index
	// CHECK-NEXT: scf.for %[[TILE_SLICE_INDEX:.*]] = %[[C0]] to %[[NUM_TILE_SLICES]] step %[[C1]] {			// CHECK-NEXT: scf.for %[[TILE_SLICE_INDEX:.*]] = %[[C0]] to %[[NUM_TILE_SLICES]] step %[[C1]] {
	// CHECK-NEXT: arm_sme.load_tile_slice %[[SRC]]{{\[}}%[[TILE_SLICE_INDEX]]], %[[CAST_TILE_TO_VECTOR]], %[[TILE_SLICE_INDEX]] : memref<?x?xi32>, vector<[4]x[4]xi32>			// CHECK-NEXT: %[[TILE_SLICE_OFFSET:.*]] = arith.addi %[[C0]], %[[TILE_SLICE_INDEX]] : index
				awarzynskiUnsubmitted Done Reply Inline Actions [nit] "TILE_SLICE_OFFSET" suggests offset into `ZA` (that's where tile slices are defined). But this is for a plain memref. Perhaps `OFFSET`? I am also thinking that the key point of this patch is to clearly differentiate between memory offsets (low level LLVM concept) and "tile slice idx" (SME concept). awarzynski: [nit] "TILE_SLICE_OFFSET" suggests offset into `ZA` (that's where tile slices are defined). But…
				// CHECK-NEXT: arm_sme.load_tile_slice %[[SRC]]{{\[}}%[[TILE_SLICE_OFFSET]], %[[C0]]], %[[CAST_TILE_TO_VECTOR]], %[[TILE_SLICE_INDEX]] : memref<?x?xi32>, vector<[4]x[4]xi32>
	func.func @arm_sme_tile_load(%src : memref<?x?xi32>) {			func.func @arm_sme_tile_load(%src : memref<?x?xi32>) {
	%c0 = arith.constant 0 : index			%c0 = arith.constant 0 : index
	%tile = arm_sme.tile_load %src[%c0, %c0] : memref<?x?xi32>, vector<[4]x[4]xi32>			%tile = arm_sme.tile_load %src[%c0, %c0] : memref<?x?xi32>, vector<[4]x[4]xi32>
	return			return
	}			}

	// -----			// -----

	// CHECK-LABEL: func.func @arm_sme_tile_store(			// CHECK-LABEL: func.func @arm_sme_tile_store(
	// CHECK-SAME: %[[TILE:.*]]: vector<[4]x[4]xi32>,			// CHECK-SAME: %[[TILE:.*]]: vector<[4]x[4]xi32>,
	// CHECK-SAME: %[[DEST:.*]]: memref<?x?xi32>) {			// CHECK-SAME: %[[DEST:.*]]: memref<?x?xi32>) {
	// CHECK-DAG: %[[C0:.*]] = arith.constant 0 : index			// CHECK-DAG: %[[C0:.*]] = arith.constant 0 : index
	// CHECK-DAG: %[[C1:.*]] = arith.constant 1 : index			// CHECK-DAG: %[[C1:.*]] = arith.constant 1 : index
	// CHECK-DAG: %[[C4:.*]] = arith.constant 4 : index			// CHECK-DAG: %[[C4:.*]] = arith.constant 4 : index
	// CHECK-DAG: %[[VSCALE:.*]] = vector.vscale			// CHECK-DAG: %[[VSCALE:.*]] = vector.vscale
	// CHECK: %[[NUM_TILE_SLICES:.*]] = arith.muli %[[C4]], %[[VSCALE]] : index			// CHECK: %[[NUM_TILE_SLICES:.*]] = arith.muli %[[C4]], %[[VSCALE]] : index
	// CHECK: scf.for %[[TILE_SLICE_INDEX:.*]] = %[[C0]] to %[[NUM_TILE_SLICES]] step %[[C1]] {			// CHECK: scf.for %[[TILE_SLICE_INDEX:.*]] = %[[C0]] to %[[NUM_TILE_SLICES]] step %[[C1]] {
	// CHECK: arm_sme.store_tile_slice %[[TILE]], %[[TILE_SLICE_INDEX]], %[[DEST]]{{\[}}%[[TILE_SLICE_INDEX]]] : memref<?x?xi32>, vector<[4]x[4]xi32>			// CHECK: %[[TILE_SLICE_OFFSET:.*]] = arith.addi %[[C0]], %[[TILE_SLICE_INDEX]] : index
				// CHECK: arm_sme.store_tile_slice %[[TILE]], %[[TILE_SLICE_INDEX]], %[[DEST]]{{\[}}%[[TILE_SLICE_OFFSET]], %[[C0]]] : memref<?x?xi32>, vector<[4]x[4]xi32>
	func.func @arm_sme_tile_store(%tile : vector<[4]x[4]xi32>, %dest : memref<?x?xi32>) {			func.func @arm_sme_tile_store(%tile : vector<[4]x[4]xi32>, %dest : memref<?x?xi32>) {
	%c0 = arith.constant 0 : index			%c0 = arith.constant 0 : index
	arm_sme.tile_store %tile, %dest[%c0, %c0] : memref<?x?xi32>, vector<[4]x[4]xi32>			arm_sme.tile_store %tile, %dest[%c0, %c0] : memref<?x?xi32>, vector<[4]x[4]xi32>
	return			return
	}			}

mlir/test/Dialect/ArmSME/vector-ops-to-llvm.mlir

	// RUN: mlir-opt %s -convert-vector-to-arm-sme -convert-arm-sme-to-scf -convert-vector-to-llvm="enable-arm-sme" -cse -canonicalize -split-input-file \| FileCheck %s			// RUN: mlir-opt %s -convert-vector-to-arm-sme -convert-arm-sme-to-scf -convert-vector-to-llvm="enable-arm-sme" -cse -canonicalize -split-input-file \| FileCheck %s

	// CHECK-LABEL: @transfer_write_2d_zero_i8(			// CHECK-LABEL: @transfer_write_2d_zero_i8(
	// CHECK-SAME: %[[ARG0:.*]]: memref<?x?xi8>)			// CHECK-SAME: %[[ARG0:.*]]: memref<?x?xi8>)
	// CHECK-DAG: %[[MEM_DESC:.*]] = builtin.unrealized_conversion_cast %[[ARG0]] : memref<?x?xi8> to !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)>			// CHECK-DAG: %[[MEM_DESC:.*]] = builtin.unrealized_conversion_cast %[[ARG0]] : memref<?x?xi8> to !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)>
	// CHECK-DAG: %[[C0:.*]] = arith.constant 0 : index			// CHECK-DAG: %[[C0:.*]] = arith.constant 0 : index
	// CHECK-DAG: %[[C1:.*]] = arith.constant 1 : index			// CHECK-DAG: %[[C1:.*]] = arith.constant 1 : index
	// CHECK-DAG: %[[MIN_SVL_B:.*]] = arith.constant 16 : index			// CHECK-DAG: %[[MIN_SVL_B:.*]] = arith.constant 16 : index
	// CHECK-DAG: %[[C255:.*]] = arith.constant 255 : i32			// CHECK-DAG: %[[C255:.*]] = arith.constant 255 : i32
	// CHECK-DAG: %[[PTRUE_ALL:.*]] = arith.constant dense<true> : vector<[16]xi1>			// CHECK-DAG: %[[PTRUE_ALL:.*]] = arith.constant dense<true> : vector<[16]xi1>
				// CHECK-DAG: %[[C0_I64:.*]] = builtin.unrealized_conversion_cast %[[C0]] : index to i64
	// CHECK-DAG: "arm_sme.intr.zero"(%[[C255]]) : (i32) -> ()			// CHECK-DAG: "arm_sme.intr.zero"(%[[C255]]) : (i32) -> ()
	// CHECK-DAG: %[[TILE_ID:.*]] = arm_sme.get_tile_id : i8			// CHECK-DAG: %[[TILE_ID:.*]] = arm_sme.get_tile_id : i8
	// CHECK-DAG: %[[VSCALE:.*]] = "llvm.intr.vscale"() : () -> i64			// CHECK-DAG: %[[VSCALE:.*]] = "llvm.intr.vscale"() : () -> i64
	// CHECK-NEXT: %[[VSCALE_IDX:.*]] = builtin.unrealized_conversion_cast %[[VSCALE]] : i64 to index			// CHECK-NEXT: %[[VSCALE_IDX:.*]] = builtin.unrealized_conversion_cast %[[VSCALE]] : i64 to index
	// CHECK-NEXT: %[[SVL_B:.*]] = arith.muli %[[VSCALE_IDX]], %[[MIN_SVL_B]] : index			// CHECK-NEXT: %[[SVL_B:.*]] = arith.muli %[[VSCALE_IDX]], %[[MIN_SVL_B]] : index
	// CHECK-NEXT: scf.for %[[TILE_SLICE:.*]] = %[[C0]] to %[[SVL_B]] step %[[C1]] {			// CHECK-NEXT: scf.for %[[TILE_SLICE:.*]] = %[[C0]] to %[[SVL_B]] step %[[C1]] {
	// CHECK: %[[TILE_SLICE_I64:.*]] = arith.index_castui %[[TILE_SLICE]] : index to i64			// CHECK: %[[TILE_SLICE_I64:.*]] = builtin.unrealized_conversion_cast %[[TILE_SLICE]] : index to i64
	// CHECK-NEXT: %[[ALIGNED_BASE:.*]] = llvm.extractvalue %[[MEM_DESC]][1] : !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)>			// CHECK-NEXT: %[[ALIGNED_BASE:.*]] = llvm.extractvalue %[[MEM_DESC]][1] : !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)>
	// CHECK-NEXT: %[[STRIDE0:.*]] = llvm.extractvalue %[[MEM_DESC]][4, 0] : !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)>			// CHECK-NEXT: %[[STRIDE0:.*]] = llvm.extractvalue %[[MEM_DESC]][4, 0] : !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)>
	// CHECK-NEXT: %[[OFF0:.*]] = llvm.mul %[[TILE_SLICE_I64]], %[[STRIDE0]] : i64			// CHECK-NEXT: %[[OFF0:.*]] = llvm.mul %[[TILE_SLICE_I64]], %[[STRIDE0]] : i64
	// CHECK-NEXT: %[[GEP:.*]] = llvm.getelementptr %[[ALIGNED_BASE]]{{\[}}%[[OFF0]]] : (!llvm.ptr, i64) -> !llvm.ptr, i8			// CHECK-NEXT: %[[OFF1:.*]] = llvm.add %[[OFF0]], %[[C0_I64]] : i64
				// CHECK-NEXT: %[[GEP:.*]] = llvm.getelementptr %[[ALIGNED_BASE]]{{\[}}%[[OFF1]]] : (!llvm.ptr, i64) -> !llvm.ptr, i8
	// CHECK-NEXT: %[[TILE_SLICE_I32:.*]] = arith.index_castui %[[TILE_SLICE]] : index to i32			// CHECK-NEXT: %[[TILE_SLICE_I32:.*]] = arith.index_castui %[[TILE_SLICE]] : index to i32
	// CHECK-NEXT: %[[TILE_ID_I32:.*]] = arith.extui %[[TILE_ID]] : i8 to i32			// CHECK-NEXT: %[[TILE_ID_I32:.*]] = arith.extui %[[TILE_ID]] : i8 to i32
	// CHECK-NEXT: "arm_sme.intr.st1b.horiz"(%[[PTRUE_ALL]], %[[GEP]], %[[TILE_ID_I32]], %[[TILE_SLICE_I32]]) : (vector<[16]xi1>, !llvm.ptr, i32, i32) -> ()			// CHECK-NEXT: "arm_sme.intr.st1b.horiz"(%[[PTRUE_ALL]], %[[GEP]], %[[TILE_ID_I32]], %[[TILE_SLICE_I32]]) : (vector<[16]xi1>, !llvm.ptr, i32, i32) -> ()
	func.func @transfer_write_2d_zero_i8(%arg0 : memref<?x?xi8>) {			func.func @transfer_write_2d_zero_i8(%arg0 : memref<?x?xi8>) {
	%c0 = arith.constant 0 : index			%c0 = arith.constant 0 : index
	%cst = arith.constant dense<0> : vector<[16]x[16]xi8>			%cst = arith.constant dense<0> : vector<[16]x[16]xi8>
	vector.transfer_write %cst, %arg0[%c0, %c0] {in_bounds = [true, true]} : vector<[16]x[16]xi8>, memref<?x?xi8>			vector.transfer_write %cst, %arg0[%c0, %c0] {in_bounds = [true, true]} : vector<[16]x[16]xi8>, memref<?x?xi8>
	return			return
	}			}

	// -----			// -----

	// CHECK-LABEL: @vector_load_i8(			// CHECK-LABEL: @vector_load_i8(
	// CHECK-SAME: %[[ARG0:.*]]: memref<?x?xi8>)			// CHECK-SAME: %[[ARG0:.*]]: memref<?x?xi8>)
	// CHECK-DAG: %[[MEM_DESC:.*]] = builtin.unrealized_conversion_cast %[[ARG0]] : memref<?x?xi8> to !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)>			// CHECK-DAG: %[[MEM_DESC:.*]] = builtin.unrealized_conversion_cast %[[ARG0]] : memref<?x?xi8> to !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)>
	// CHECK-DAG: %[[TILE_ID:.*]] = arm_sme.get_tile_id : i8			// CHECK-DAG: %[[TILE_ID:.*]] = arm_sme.get_tile_id : i8
	// CHECK-DAG: %[[CAST_TILE_TO_VECTOR:.*]] = arm_sme.cast_tile_to_vector %[[TILE_ID]] : i8 to vector<[16]x[16]xi8>			// CHECK-DAG: %[[CAST_TILE_TO_VECTOR:.*]] = arm_sme.cast_tile_to_vector %[[TILE_ID]] : i8 to vector<[16]x[16]xi8>
	// CHECK-DAG: %[[C0:.*]] = arith.constant 0 : index			// CHECK-DAG: %[[C0:.*]] = arith.constant 0 : index
	// CHECK-DAG: %[[C1:.*]] = arith.constant 1 : index			// CHECK-DAG: %[[C1:.*]] = arith.constant 1 : index
	// CHECK-DAG: %[[MIN_SVL_B:.*]] = arith.constant 16 : index			// CHECK-DAG: %[[MIN_SVL_B:.*]] = arith.constant 16 : index
	// CHECK-DAG: %[[PTRUE_ALL:.*]] = arith.constant dense<true> : vector<[16]xi1>			// CHECK-DAG: %[[PTRUE_ALL:.*]] = arith.constant dense<true> : vector<[16]xi1>
				// CHECK-DAG: %[[C0_I64:.*]] = builtin.unrealized_conversion_cast %[[C0]] : index to i64
	// CHECK-DAG: %[[VSCALE:.*]] = "llvm.intr.vscale"() : () -> i64			// CHECK-DAG: %[[VSCALE:.*]] = "llvm.intr.vscale"() : () -> i64
	// CHECK-NEXT: %[[VSCALE_IDX:.*]] = builtin.unrealized_conversion_cast %[[VSCALE]] : i64 to index			// CHECK-NEXT: %[[VSCALE_IDX:.*]] = builtin.unrealized_conversion_cast %[[VSCALE]] : i64 to index
	// CHECK-NEXT: %[[SVL_B:.*]] = arith.muli %[[VSCALE_IDX]], %[[MIN_SVL_B]] : index			// CHECK-NEXT: %[[SVL_B:.*]] = arith.muli %[[VSCALE_IDX]], %[[MIN_SVL_B]] : index
	// CHECK-NEXT: scf.for %[[TILE_SLICE:.*]] = %[[C0]] to %[[SVL_B]] step %[[C1]] {			// CHECK-NEXT: scf.for %[[TILE_SLICE:.*]] = %[[C0]] to %[[SVL_B]] step %[[C1]] {
	// CHECK: %[[TILE_SLICE_I64:.*]] = arith.index_castui %[[TILE_SLICE]] : index to i64			// CHECK: %[[TILE_SLICE_I64:.*]] = builtin.unrealized_conversion_cast %[[TILE_SLICE]] : index to i64
	// CHECK-NEXT: %[[ALIGNED_BASE:.*]] = llvm.extractvalue %[[MEM_DESC]][1] : !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)>			// CHECK-NEXT: %[[ALIGNED_BASE:.*]] = llvm.extractvalue %[[MEM_DESC]][1] : !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)>
	// CHECK-NEXT: %[[STRIDE0:.*]] = llvm.extractvalue %[[MEM_DESC]][4, 0] : !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)>			// CHECK-NEXT: %[[STRIDE0:.*]] = llvm.extractvalue %[[MEM_DESC]][4, 0] : !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)>
	// CHECK-NEXT: %[[OFF0:.*]] = llvm.mul %[[TILE_SLICE_I64]], %[[STRIDE0]] : i64			// CHECK-NEXT: %[[OFF0:.*]] = llvm.mul %[[TILE_SLICE_I64]], %[[STRIDE0]] : i64
	// CHECK-NEXT: %[[GEP:.*]] = llvm.getelementptr %[[ALIGNED_BASE]]{{\[}}%[[OFF0]]] : (!llvm.ptr, i64) -> !llvm.ptr, i8			// CHECK-NEXT: %[[OFF1:.*]] = llvm.add %[[OFF0]], %[[C0_I64]] : i64
				// CHECK-NEXT: %[[GEP:.*]] = llvm.getelementptr %[[ALIGNED_BASE]]{{\[}}%[[OFF1]]] : (!llvm.ptr, i64) -> !llvm.ptr, i8
	// CHECK-NEXT: %[[TILE_SLICE_I32:.*]] = arith.index_castui %[[TILE_SLICE]] : index to i32			// CHECK-NEXT: %[[TILE_SLICE_I32:.*]] = arith.index_castui %[[TILE_SLICE]] : index to i32
	// CHECK-NEXT: %[[TILE_ID_I32:.*]] = arith.extui %[[TILE_ID]] : i8 to i32			// CHECK-NEXT: %[[TILE_ID_I32:.*]] = arith.extui %[[TILE_ID]] : i8 to i32
	// CHECK-NEXT: "arm_sme.intr.ld1b.horiz"(%[[PTRUE_ALL]], %[[GEP]], %[[TILE_ID_I32]], %[[TILE_SLICE_I32]]) : (vector<[16]xi1>, !llvm.ptr, i32, i32) -> ()			// CHECK-NEXT: "arm_sme.intr.ld1b.horiz"(%[[PTRUE_ALL]], %[[GEP]], %[[TILE_ID_I32]], %[[TILE_SLICE_I32]]) : (vector<[16]xi1>, !llvm.ptr, i32, i32) -> ()
	// CHECK-NEXT: }			// CHECK-NEXT: }
	// CHECK-NEXT: return %[[CAST_TILE_TO_VECTOR]] : vector<[16]x[16]xi8>			// CHECK-NEXT: return %[[CAST_TILE_TO_VECTOR]] : vector<[16]x[16]xi8>
	func.func @vector_load_i8(%arg0 : memref<?x?xi8>) -> vector<[16]x[16]xi8> {			func.func @vector_load_i8(%arg0 : memref<?x?xi8>) -> vector<[16]x[16]xi8> {
	%c0 = arith.constant 0 : index			%c0 = arith.constant 0 : index
				awarzynskiUnsubmitted Not Done Reply Inline Actions I was suggesting to change `%c0 = 0` to e.g. `%c123 = 123` in places like this one. Right now (with `%c0`) it looks like this // BEFORE // CHECK-NEXT: %[[OFF0:.]] = llvm.mul %[[TILE_SLICE_I64]], %[[STRIDE0]] : i64 // CHECK-NEXT: %[[GEP:.]] = llvm.getelementptr %[[ALIGNED_BASE]]{{\[}}%[[OFF0]]] : (!llvm.ptr, i64) -> !llvm.ptr, i8 becomes: // AFTER // CHECK-NEXT: %[[OFF0:.]] = llvm.mul %[[TILE_SLICE_I64]], %[[STRIDE0]] : i64 // CHECK-NEXT: %[[OFF1:.]] = llvm.add %[[OFF0]], %[[C0_I64]] : i64 // CHECK-NEXT: %[[GEP:.]] = llvm.getelementptr %[[ALIGNED_BASE]]{{\[}}%[[OFF1]]] : (!llvm.ptr, i64) -> !llvm.ptr, i8 Note that `OFF0` and `OFF1` are identical so the benefits of this patch are obfuscated. The resulting code is correct _with_ and _without_ your change (because the offset is `0`). Replacing `%c0` with `%c123` makes things more interesting: // BEFORE // CHECK-NEXT: %[[OFF0:.]] = llvm.mul %[[TILE_SLICE_I64]], %[[STRIDE0]] : i64 // CHECK-NEXT: %[[GEP:.]] = llvm.getelementptr %[[ALIGNED_BASE]]{{\[}}%[[OFF0]]] : (!llvm.ptr, i64) -> !llvm.ptr, i8 becomes: // AFTER // CHECK-NEXT: %[[OFF0:.]] = llvm.mul %[[TILE_SLICE_I64]], %[[STRIDE0]] : i64 // CHECK-NEXT: %[[OFF1:.]] = llvm.add %[[OFF0]], %[[C123_I64]] : i64 // CHECK-NEXT: %[[GEP:.]] = llvm.getelementptr %[[ALIGNED_BASE]]{{\[}}%[[OFF1]]] : (!llvm.ptr, i64) -> !llvm.ptr, i8 In this case the "BEFORE" is obviously broken. awarzynski: I was suggesting to change `%c0 = 0` to e.g. `%c123 = 123` in places like this one. Right now…
				c-rhodesAuthorUnsubmitted Done Reply Inline Actions I was suggesting to change `%c0 = 0` to e.g. `%c123 = 123` in places like this one. Right now (with `%c0`) it looks like this // BEFORE // CHECK-NEXT: %[[OFF0:.]] = llvm.mul %[[TILE_SLICE_I64]], %[[STRIDE0]] : i64 // CHECK-NEXT: %[[GEP:.]] = llvm.getelementptr %[[ALIGNED_BASE]]{{\[}}%[[OFF0]]] : (!llvm.ptr, i64) -> !llvm.ptr, i8 becomes: // AFTER // CHECK-NEXT: %[[OFF0:.]] = llvm.mul %[[TILE_SLICE_I64]], %[[STRIDE0]] : i64 // CHECK-NEXT: %[[OFF1:.]] = llvm.add %[[OFF0]], %[[C0_I64]] : i64 // CHECK-NEXT: %[[GEP:.]] = llvm.getelementptr %[[ALIGNED_BASE]]{{\[}}%[[OFF1]]] : (!llvm.ptr, i64) -> !llvm.ptr, i8 Note that `OFF0` and `OFF1` are identical so the benefits of this patch are obfuscated. The resulting code is correct _with_ and _without_ your change (because the offset is `0`). Replacing `%c0` with `%c123` makes things more interesting: // BEFORE // CHECK-NEXT: %[[OFF0:.]] = llvm.mul %[[TILE_SLICE_I64]], %[[STRIDE0]] : i64 // CHECK-NEXT: %[[GEP:.]] = llvm.getelementptr %[[ALIGNED_BASE]]{{\[}}%[[OFF0]]] : (!llvm.ptr, i64) -> !llvm.ptr, i8 becomes: // AFTER // CHECK-NEXT: %[[OFF0:.]] = llvm.mul %[[TILE_SLICE_I64]], %[[STRIDE0]] : i64 // CHECK-NEXT: %[[OFF1:.]] = llvm.add %[[OFF0]], %[[C123_I64]] : i64 // CHECK-NEXT: %[[GEP:.]] = llvm.getelementptr %[[ALIGNED_BASE]]{{\[}}%[[OFF1]]] : (!llvm.ptr, i64) -> !llvm.ptr, i8 In this case the "BEFORE" is obviously broken. ah ok sure, I've updated `vector_load_i8` c-rhodes: > I was suggesting to change `%c0 = 0` to e.g. `%c123 = 123` in places like this one. Right…
	%tile = vector.load %arg0[%c0, %c0] : memref<?x?xi8>, vector<[16]x[16]xi8>			%tile = vector.load %arg0[%c0, %c0] : memref<?x?xi8>, vector<[16]x[16]xi8>
	return %tile : vector<[16]x[16]xi8>			return %tile : vector<[16]x[16]xi8>
	}			}

	// -----			// -----

	// CHECK-LABEL: @vector_load_i8_from_rank_1_memref(			// CHECK-LABEL: @vector_load_i8_from_rank_1_memref(
	// CHECK-SAME: %[[ARG0:.*]]: memref<?xi8>)			// CHECK-SAME: %[[ARG0:.*]]: memref<?xi8>)
	// CHECK-DAG: %[[MEM_DESC:.*]] = builtin.unrealized_conversion_cast %[[ARG0]] : memref<?xi8> to !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>			// CHECK-DAG: %[[MEM_DESC:.*]] = builtin.unrealized_conversion_cast %[[ARG0]] : memref<?xi8> to !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
	// CHECK-DAG: %[[TILE_ID:.*]] = arm_sme.get_tile_id : i8			// CHECK-DAG: %[[TILE_ID:.*]] = arm_sme.get_tile_id : i8
	// CHECK-DAG: %[[CAST_TILE_TO_VECTOR:.*]] = arm_sme.cast_tile_to_vector %[[TILE_ID]] : i8 to vector<[16]x[16]xi8>			// CHECK-DAG: %[[CAST_TILE_TO_VECTOR:.*]] = arm_sme.cast_tile_to_vector %[[TILE_ID]] : i8 to vector<[16]x[16]xi8>
	// CHECK-DAG: %[[C0:.*]] = arith.constant 0 : index			// CHECK-DAG: %[[C0:.*]] = arith.constant 0 : index
	// CHECK-DAG: %[[C1:.*]] = arith.constant 1 : index			// CHECK-DAG: %[[C1:.*]] = arith.constant 1 : index
	// CHECK-DAG: %[[MIN_SVL_B:.*]] = arith.constant 16 : index			// CHECK-DAG: %[[MIN_SVL_B:.*]] = arith.constant 16 : index
	// CHECK-DAG: %[[PTRUE_ALL:.*]] = arith.constant dense<true> : vector<[16]xi1>			// CHECK-DAG: %[[PTRUE_ALL:.*]] = arith.constant dense<true> : vector<[16]xi1>
	// CHECK-DAG: %[[VSCALE:.*]] = "llvm.intr.vscale"() : () -> i64			// CHECK-DAG: %[[VSCALE:.*]] = "llvm.intr.vscale"() : () -> i64
	// CHECK-NEXT: %[[VSCALE_IDX:.*]] = builtin.unrealized_conversion_cast %[[VSCALE]] : i64 to index			// CHECK-NEXT: %[[VSCALE_IDX:.*]] = builtin.unrealized_conversion_cast %[[VSCALE]] : i64 to index
	// CHECK-NEXT: %[[SVL_B:.*]] = arith.muli %[[VSCALE_IDX]], %[[MIN_SVL_B]] : index			// CHECK-NEXT: %[[SVL_B:.*]] = arith.muli %[[VSCALE_IDX]], %[[MIN_SVL_B]] : index
	// CHECK-NEXT: scf.for %[[TILE_SLICE:.*]] = %[[C0]] to %[[SVL_B]] step %[[C1]] {			// CHECK-NEXT: scf.for %[[TILE_SLICE:.*]] = %[[C0]] to %[[SVL_B]] step %[[C1]] {
	// CHECK-NEXT: %[[VSCALE_1:.*]] = "llvm.intr.vscale"() : () -> i64			// CHECK-NEXT: %[[TILE_SLICE_IDX:.*]] = arith.muli %[[TILE_SLICE]], %[[SVL_B]] : index
	// CHECK-NEXT: %[[VSCALE_IDX_1:.*]] = builtin.unrealized_conversion_cast %[[VSCALE_1]] : i64 to index			// CHECK-NEXT: %[[TILE_SLICE_IDX_I64:.*]] = builtin.unrealized_conversion_cast %[[TILE_SLICE_IDX]] : index to i64
	// CHECK-NEXT: %[[SVL_B_1:.*]] = arith.muli %[[VSCALE_IDX_1]], %[[MIN_SVL_B]] : index
	// CHECK-NEXT: %[[TILE_SLICE_I64:.*]] = arith.index_castui %[[TILE_SLICE]] : index to i64
	// CHECK-NEXT: %[[SVL_B_I64:.*]] = arith.index_castui %[[SVL_B_1]] : index to i64
	// CHECK-NEXT: %[[TILE_SLICE_IDX:.*]] = arith.muli %[[TILE_SLICE_I64]], %[[SVL_B_I64]] : i64
	// CHECK-NEXT: %[[ALIGNED_BASE:.*]] = llvm.extractvalue %[[MEM_DESC]][1] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>			// CHECK-NEXT: %[[ALIGNED_BASE:.*]] = llvm.extractvalue %[[MEM_DESC]][1] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
	// CHECK-NEXT: %[[GEP:.*]] = llvm.getelementptr %[[ALIGNED_BASE]]{{\[}}%[[TILE_SLICE_IDX]]] : (!llvm.ptr, i64) -> !llvm.ptr, i8			// CHECK-NEXT: %[[GEP:.*]] = llvm.getelementptr %[[ALIGNED_BASE]]{{\[}}%[[TILE_SLICE_IDX_I64]]] : (!llvm.ptr, i64) -> !llvm.ptr, i8
	// CHECK-NEXT: %[[TILE_SLICE_I32:.*]] = arith.index_castui %[[TILE_SLICE]] : index to i32			// CHECK-NEXT: %[[TILE_SLICE_I32:.*]] = arith.index_castui %[[TILE_SLICE]] : index to i32
	// CHECK-NEXT: %[[TILE_ID_I32:.*]] = arith.extui %[[TILE_ID]] : i8 to i32			// CHECK-NEXT: %[[TILE_ID_I32:.*]] = arith.extui %[[TILE_ID]] : i8 to i32
	// CHECK-NEXT: "arm_sme.intr.ld1b.horiz"(%[[PTRUE_ALL]], %[[GEP]], %[[TILE_ID_I32]], %[[TILE_SLICE_I32]]) : (vector<[16]xi1>, !llvm.ptr, i32, i32) -> ()			// CHECK-NEXT: "arm_sme.intr.ld1b.horiz"(%[[PTRUE_ALL]], %[[GEP]], %[[TILE_ID_I32]], %[[TILE_SLICE_I32]]) : (vector<[16]xi1>, !llvm.ptr, i32, i32) -> ()
	// CHECK-NEXT: }			// CHECK-NEXT: }
	// CHECK-NEXT: return %[[CAST_TILE_TO_VECTOR]] : vector<[16]x[16]xi8>			// CHECK-NEXT: return %[[CAST_TILE_TO_VECTOR]] : vector<[16]x[16]xi8>
	func.func @vector_load_i8_from_rank_1_memref(%arg0 : memref<?xi8>) -> vector<[16]x[16]xi8> {			func.func @vector_load_i8_from_rank_1_memref(%arg0 : memref<?xi8>) -> vector<[16]x[16]xi8> {
	%c0 = arith.constant 0 : index			%c0 = arith.constant 0 : index
	%tile = vector.load %arg0[%c0] : memref<?xi8>, vector<[16]x[16]xi8>			%tile = vector.load %arg0[%c0] : memref<?xi8>, vector<[16]x[16]xi8>
	▲ Show 20 Lines • Show All 119 Lines • ▼ Show 20 Lines

	// CHECK-LABEL: @vector_store_i8(			// CHECK-LABEL: @vector_store_i8(
	// CHECK-SAME: %[[TILE:.*]]: vector<[16]x[16]xi8>,			// CHECK-SAME: %[[TILE:.*]]: vector<[16]x[16]xi8>,
	// CHECK-SAME: %[[ARG0:.*]]: memref<?x?xi8>)			// CHECK-SAME: %[[ARG0:.*]]: memref<?x?xi8>)
	// CHECK-DAG: %[[MEM_DESC:.*]] = builtin.unrealized_conversion_cast %[[ARG0]] : memref<?x?xi8> to !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)>			// CHECK-DAG: %[[MEM_DESC:.*]] = builtin.unrealized_conversion_cast %[[ARG0]] : memref<?x?xi8> to !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)>
	// CHECK-DAG: %[[C0:.*]] = arith.constant 0 : index			// CHECK-DAG: %[[C0:.*]] = arith.constant 0 : index
	// CHECK-DAG: %[[C1:.*]] = arith.constant 1 : index			// CHECK-DAG: %[[C1:.*]] = arith.constant 1 : index
	// CHECK-DAG: %[[MIN_SVL_B:.*]] = arith.constant 16 : index			// CHECK-DAG: %[[MIN_SVL_B:.*]] = arith.constant 16 : index
				// CHECK-DAG: %[[C0_I64:.*]] = builtin.unrealized_conversion_cast %[[C0]] : index to i64
	// CHECK-DAG: %[[PTRUE_ALL:.*]] = arith.constant dense<true> : vector<[16]xi1>			// CHECK-DAG: %[[PTRUE_ALL:.*]] = arith.constant dense<true> : vector<[16]xi1>
	// CHECK-DAG: %[[VSCALE:.*]] = "llvm.intr.vscale"() : () -> i64			// CHECK-DAG: %[[VSCALE:.*]] = "llvm.intr.vscale"() : () -> i64
	// CHECK-NEXT: %[[VSCALE_IDX:.*]] = builtin.unrealized_conversion_cast %[[VSCALE]] : i64 to index			// CHECK-NEXT: %[[VSCALE_IDX:.*]] = builtin.unrealized_conversion_cast %[[VSCALE]] : i64 to index
	// CHECK-NEXT: %[[SVL_B:.*]] = arith.muli %[[VSCALE_IDX]], %[[MIN_SVL_B]] : index			// CHECK-NEXT: %[[SVL_B:.*]] = arith.muli %[[VSCALE_IDX]], %[[MIN_SVL_B]] : index
	// CHECK-NEXT: scf.for %[[TILE_SLICE:.*]] = %[[C0]] to %[[SVL_B]] step %[[C1]] {			// CHECK-NEXT: scf.for %[[TILE_SLICE:.*]] = %[[C0]] to %[[SVL_B]] step %[[C1]] {
				// CHECK: %[[TILE_SLICE_I64:.*]] = builtin.unrealized_conversion_cast %[[TILE_SLICE]] : index to i64
	// CHECK-NEXT: %[[CAST_VECTOR_TO_TILE:.*]] = arm_sme.cast_vector_to_tile %[[TILE]] : vector<[16]x[16]xi8> to i8			// CHECK-NEXT: %[[CAST_VECTOR_TO_TILE:.*]] = arm_sme.cast_vector_to_tile %[[TILE]] : vector<[16]x[16]xi8> to i8
	// CHECK: %[[TILE_SLICE_I64:.*]] = arith.index_castui %[[TILE_SLICE]] : index to i64
	// CHECK-NEXT: %[[ALIGNED_BASE:.*]] = llvm.extractvalue %[[MEM_DESC]][1] : !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)>			// CHECK-NEXT: %[[ALIGNED_BASE:.*]] = llvm.extractvalue %[[MEM_DESC]][1] : !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)>
	// CHECK-NEXT: %[[STRIDE0:.*]] = llvm.extractvalue %[[MEM_DESC]][4, 0] : !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)>			// CHECK-NEXT: %[[STRIDE0:.*]] = llvm.extractvalue %[[MEM_DESC]][4, 0] : !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)>
	// CHECK-NEXT: %[[OFF0:.*]] = llvm.mul %[[TILE_SLICE_I64]], %[[STRIDE0]] : i64			// CHECK-NEXT: %[[OFF0:.*]] = llvm.mul %[[TILE_SLICE_I64]], %[[STRIDE0]] : i64
	// CHECK-NEXT: %[[GEP:.*]] = llvm.getelementptr %[[ALIGNED_BASE]]{{\[}}%[[OFF0]]] : (!llvm.ptr, i64) -> !llvm.ptr, i8			// CHECK-NEXT: %[[OFF1:.*]] = llvm.add %[[OFF0]], %[[C0_I64]] : i64
				// CHECK-NEXT: %[[GEP:.*]] = llvm.getelementptr %[[ALIGNED_BASE]]{{\[}}%[[OFF1]]] : (!llvm.ptr, i64) -> !llvm.ptr, i8
	// CHECK-NEXT: %[[TILE_SLICE_I32:.*]] = arith.index_castui %[[TILE_SLICE]] : index to i32			// CHECK-NEXT: %[[TILE_SLICE_I32:.*]] = arith.index_castui %[[TILE_SLICE]] : index to i32
	// CHECK-NEXT: %[[TILE_ID_I32:.*]] = arith.extui %[[CAST_VECTOR_TO_TILE]] : i8 to i32			// CHECK-NEXT: %[[TILE_ID_I32:.*]] = arith.extui %[[CAST_VECTOR_TO_TILE]] : i8 to i32
	// CHECK-NEXT: "arm_sme.intr.st1b.horiz"(%[[PTRUE_ALL]], %[[GEP]], %[[TILE_ID_I32]], %[[TILE_SLICE_I32]]) : (vector<[16]xi1>, !llvm.ptr, i32, i32) -> ()			// CHECK-NEXT: "arm_sme.intr.st1b.horiz"(%[[PTRUE_ALL]], %[[GEP]], %[[TILE_ID_I32]], %[[TILE_SLICE_I32]]) : (vector<[16]xi1>, !llvm.ptr, i32, i32) -> ()
	// CHECK-NEXT: }			// CHECK-NEXT: }
	// CHECK-NEXT: return			// CHECK-NEXT: return
	func.func @vector_store_i8(%tile : vector<[16]x[16]xi8>, %arg0 : memref<?x?xi8>) {			func.func @vector_store_i8(%tile : vector<[16]x[16]xi8>, %arg0 : memref<?x?xi8>) {
	%c0 = arith.constant 0 : index			%c0 = arith.constant 0 : index
	vector.store %tile, %arg0[%c0, %c0] : memref<?x?xi8>, vector<[16]x[16]xi8>			vector.store %tile, %arg0[%c0, %c0] : memref<?x?xi8>, vector<[16]x[16]xi8>
	▲ Show 20 Lines • Show All 115 Lines • Show Last 20 Lines

mlir/test/Integration/Dialect/Vector/CPU/ArmSME/vector-load-store.mlir

// RUN: mlir-opt %s -enable-arm-streaming="mode=locally enable-za" \ // RUN: mlir-opt %s -enable-arm-streaming="mode=locally enable-za" \

// RUN: -convert-vector-to-arm-sme -convert-arm-sme-to-scf \ // RUN: -convert-vector-to-arm-sme -convert-arm-sme-to-scf \

// RUN: -convert-vector-to-llvm="enable-arm-sme" -cse -canonicalize \ // RUN: -convert-vector-to-llvm="enable-arm-sme" -cse -canonicalize \

// RUN: -allocate-arm-sme-tiles -test-lower-to-llvm | \ // RUN: -allocate-arm-sme-tiles -test-lower-to-llvm | \

// RUN: mlir-translate -mlir-to-llvmir | \ // RUN: mlir-translate -mlir-to-llvmir | \

// RUN: %lli_aarch64_cmd --march=aarch64 --mattr="+sve,+sme" \ // RUN: %lli_aarch64_cmd --march=aarch64 --mattr="+sve,+sme" \

// RUN: --entry-function=za0_d_f64 \ // RUN: --entry-function=za0_d_f64 \

// RUN: --dlopen=%mlir_native_utils_lib_dir/libmlir_c_runner_utils%shlibext | \ // RUN: --dlopen=%mlir_native_utils_lib_dir/libmlir_c_runner_utils%shlibext \

// RUN: --dlopen=%mlir_native_utils_lib_dir/libmlir_runner_utils%shlibext | \

// RUN: FileCheck %s --check-prefix=CHECK-ZA0_D // RUN: FileCheck %s --check-prefix=CHECK-ZA0_D

// Integration test demonstrating load/store to/from SME ZA tile. // RUN: mlir-opt %s -enable-arm-streaming="mode=locally enable-za" \

// RUN: -convert-vector-to-arm-sme -convert-arm-sme-to-scf \

// RUN: -convert-vector-to-llvm="enable-arm-sme" -cse -canonicalize \

// RUN: -allocate-arm-sme-tiles -test-lower-to-llvm | \

// RUN: mlir-translate -mlir-to-llvmir | \

// RUN: %lli_aarch64_cmd --march=aarch64 --mattr="+sve,+sme" \

// RUN: --entry-function=load_store_two_za_s_tiles \

// RUN: --dlopen=%mlir_native_utils_lib_dir/libmlir_c_runner_utils%shlibext \

// RUN: --dlopen=%mlir_native_utils_lib_dir/libmlir_runner_utils%shlibext | \

// RUN: FileCheck %s

awarzynskiUnsubmitted

Not Done

Rather than duplicating this, could you use DEFINE and REDEFINE? Alternatively, you could move this to a separate file. There's quite a lot going on here.

awarzynski: Rather than duplicating this, could you use `DEFINE` and `REDEFINE`? Alternatively, you could…

c-rhodesAuthorUnsubmitted

Done

Rather than duplicating this, could you use DEFINE and REDEFINE? Alternatively, you could move this to a separate file. There's quite a lot going on here.

I was thinking this should have a single entry that calls both tests but ZA isn't currently re-enabled after calls so that's not possible until that's resolved but should be once it is.

c-rhodes: > Rather than duplicating this, could you use `DEFINE` and `REDEFINE`? Alternatively, you could…

awarzynskiUnsubmitted

Not Done

I was thinking this should have a single entry that calls both tests but ZA isn't currently re-enabled after calls so that's not possible until that's resolved but should be once it is.

I don't follow - with DEFINE/REDEFINE you can re-use everything while keeping the entry point custom:

// DEFINE: %{entry_point} = za0_d_f64

// DEFINE: %{compile} = mlir-opt %s -enable-arm-streaming="mode=locally enable-za" \
// DEFINE:   -convert-vector-to-arm-sme -convert-arm-sme-to-scf \
// DEFINE:   -convert-vector-to-llvm="enable-arm-sme" -cse -canonicalize \
// DEFINE:   -allocate-arm-sme-tiles -test-lower-to-llvm
// DEFINE: %{translate} = mlir-translate -mlir-to-llvmir | \
// DEFINE: %{run} = %lli_aarch64_cmd --march=aarch64 --mattr="+sve,+sme" \
// DEFINE:   --entry-function=%{entry_point} \
// DEFINE:   --dlopen=%mlir_native_utils_lib_dir/libmlir_c_runner_utils%shlibext \
// DEFINE:   --dlopen=%mlir_native_utils_lib_dir/libmlir_runner_utils%shlibext | \
// RUN: %{compile} | %{translate} | %{run} | FileCheck %s --check-prefix=CHECK-ZA0_D

// REDEFINE: %{entry_point} = load_store_two_za_s_tiles
// RUN: %{compile} | %{translate} | %{run} | FileCheck %s

awarzynski: > I was thinking this should have a single entry that calls both tests but ZA isn't currently…

c-rhodesAuthorUnsubmitted

Done

I was thinking this should have a single entry that calls both tests but ZA isn't currently re-enabled after calls so that's not possible until that's resolved but should be once it is.

I don't follow - with DEFINE/REDEFINE you can re-use everything while keeping the entry point custom:

// DEFINE: %{entry_point} = za0_d_f64

// DEFINE: %{compile} = mlir-opt %s -enable-arm-streaming="mode=locally enable-za" \
// DEFINE:   -convert-vector-to-arm-sme -convert-arm-sme-to-scf \
// DEFINE:   -convert-vector-to-llvm="enable-arm-sme" -cse -canonicalize \
// DEFINE:   -allocate-arm-sme-tiles -test-lower-to-llvm
// DEFINE: %{translate} = mlir-translate -mlir-to-llvmir | \
// DEFINE: %{run} = %lli_aarch64_cmd --march=aarch64 --mattr="+sve,+sme" \
// DEFINE:   --entry-function=%{entry_point} \
// DEFINE:   --dlopen=%mlir_native_utils_lib_dir/libmlir_c_runner_utils%shlibext \
// DEFINE:   --dlopen=%mlir_native_utils_lib_dir/libmlir_runner_utils%shlibext | \
// RUN: %{compile} | %{translate} | %{run} | FileCheck %s --check-prefix=CHECK-ZA0_D

// REDEFINE: %{entry_point} = load_store_two_za_s_tiles
// RUN: %{compile} | %{translate} | %{run} | FileCheck %s

Updated, this is nice. Also updated to use mlir-cpu-runner.

c-rhodes: > > I was thinking this should have a single entry that calls both tests but ZA isn't currently…

// Integration tests demonstrating load/store to/from SME ZA tile.

llvm.func @printF64(f64) llvm.func @printF64(f64)

llvm.func @printI64(i64)

llvm.func @printOpen() llvm.func @printOpen()

llvm.func @printClose() llvm.func @printClose()

llvm.func @printComma() llvm.func @printComma()

llvm.func @printNewline() llvm.func @printNewline()

llvm.func @printCString(!llvm.ptr<i8>)

func.func @za0_d_f64() -> i32 { func.func @za0_d_f64() -> i32 {

awarzynskiUnsubmitted

Done

Could you add a comment that would highlight the difference between z0_d_f64 and load_store_two_za_s_tiles?

awarzynski: Could you add a comment that would highlight the difference between `z0_d_f64` and…

%c0 = arith.constant 0 : index %c0 = arith.constant 0 : index

%c0_f64 = arith.constant 0.0 : f64 %c0_f64 = arith.constant 0.0 : f64

%c1_f64 = arith.constant 1.0 : f64 %c1_f64 = arith.constant 1.0 : f64

%c1_index = arith.constant 1 : index %c1_index = arith.constant 1 : index

%min_elts_d = arith.constant 2 : index %min_elts_d = arith.constant 2 : index

%vscale = vector.vscale %vscale = vector.vscale

▲ Show 20 Lines • Show All 158 Lines • ▼ Show 20 Lines scf.for %i = %c0 to %tilesize step %svl_d {

} }

llvm.call @printClose() : () -> () llvm.call @printClose() : () -> ()

llvm.call @printNewline() : () -> () llvm.call @printNewline() : () -> ()

} }

%c0_i32 = arith.constant 0 : i32 %c0_i32 = arith.constant 0 : i32

return %c0_i32 : i32 return %c0_i32 : i32

} }

func.func @printTileBegin() {

%0 = llvm.mlir.addressof @str_tile_begin : !llvm.ptr<array<11 x i8>>

%1 = llvm.mlir.constant(0 : index) : i64

%2 = llvm.getelementptr %0[%1, %1]

: (!llvm.ptr<array<11 x i8>>, i64, i64) -> !llvm.ptr<i8>

llvm.call @printCString(%2) : (!llvm.ptr<i8>) -> ()

return

}

func.func @printTileEnd() {

%0 = llvm.mlir.addressof @str_tile_end : !llvm.ptr<array<9 x i8>>

%1 = llvm.mlir.constant(0 : index) : i64

%2 = llvm.getelementptr %0[%1, %1]

: (!llvm.ptr<array<9 x i8>>, i64, i64) -> !llvm.ptr<i8>

llvm.call @printCString(%2) : (!llvm.ptr<i8>) -> ()

return

}

// This test loads two 32-bit element ZA tiles from memory and stores them back

// to memory in reverse order to verify the memref indices.

func.func @load_store_two_za_s_tiles() -> i32 {

%c0 = arith.constant 0 : index

%c0_i32 = arith.constant 0 : i32

%c1_i32 = arith.constant 1 : i32

%c2_i32 = arith.constant 2 : i32

%c1_index = arith.constant 1 : index

%c2_index = arith.constant 2 : index

%min_elts_s = arith.constant 4 : index

%vscale = vector.vscale

// "svl" refers to the Streaming Vector Length and "svl_s" can mean either:

// * the number of 32-bit elements in a vector of SVL bits.

// * the number of tile slices (1d vectors) in a 32-bit element tile.

%svl_s = arith.muli %min_elts_s, %vscale : index

// Allocate memory for two 32-bit element tiles.

%tilesize = arith.muli %svl_s, %svl_s : index

%size = arith.muli %tilesize, %c2_index : index

awarzynskiUnsubmitted

Not Done

// Allocate memory for two 32-bit element tiles.

- %tilesize = arith.muli %svl_s, %svl_s : index

- %size = arith.muli %tilesize, %c2_index : index

+ %tile_size_i32 = arith.muli %svl_s, %svl_s : index

+ %two_tiles_size_i32 = arith.muli %tilesize, %c2_index : index

%mem1 = memref.alloca(%size) : memref<?xi32>

[nit] Small suggestion for a more descriptive name.

awarzynski: [nit] Small suggestion for a more descriptive name.

c-rhodesAuthorUnsubmitted

Done

[nit] Small suggestion for a more descriptive name.

I understanding you intend that to mean to element type of the tile but appending type usually indicates the value is of that type which this isnt, it's an index, I've kept it as is.

c-rhodes: > [nit] Small suggestion for a more descriptive name. I understanding you intend that to mean…

awarzynskiUnsubmitted

Not Done

I understanding you intend that to mean to element type

That's not the main thing that I had in mind :) Sorry should've been clearer.

My main suggestion here is to avoid variables names like e.g. size (or name etc). With names like this my first question would be "_size_ of what?" (or "_name_ of what?"). I just wanted to clarify, this is just a nit.

awarzynski: > I understanding you intend that to mean to element type That's not the main thing that I…

c-rhodesAuthorUnsubmitted

Done

I understanding you intend that to mean to element type

That's not the main thing that I had in mind :) Sorry should've been clearer.

My main suggestion here is to avoid variables names like e.g. size (or name etc). With names like this my first question would be "_size_ of what?" (or "_name_ of what?"). I just wanted to clarify, this is just a nit.

Sorry when I first read this I didn't spot size, I've updated it.

c-rhodes: > > I understanding you intend that to mean to element type > > That's not the main thing…

%mem1 = memref.alloca(%size) : memref<?xi32>

// Fill memory that tile 1 will be loaded from with '1' and '2' for tile 2.

// For example, assuming an SVL of 128-bits and two 4x4xi32 tiles:

// tile 1

// 1, 1, 1, 1

// tile 2

// 2, 2, 2, 2

scf.for %i = %c0 to %size step %svl_s {

%isFirstTile = arith.cmpi ult, %i, %tilesize : index

%val = scf.if %isFirstTile -> i32 {

scf.yield %c1_i32 : i32

} else {

scf.yield %c2_i32 : i32

}

%splat_val = vector.broadcast %val : i32 to vector<[4]xi32>

vector.store %splat_val, %mem1[%i] : memref<?xi32>, vector<[4]xi32>

}

// Dump "mem1". The smallest SVL is 128-bits so each tile will be at least

// 4x4xi32.

// CHECK: ( 1, 1, 1, 1

// CHECK-NEXT: ( 1, 1, 1, 1

// CHECK: ( 2, 2, 2, 2

// CHECK-NEXT: ( 2, 2, 2, 2

scf.for %i = %c0 to %size step %svl_s {

%tileslice = vector.load %mem1[%i] : memref<?xi32>, vector<[4]xi32>

llvm.call @printOpen() : () -> ()

scf.for %i2 = %c0 to %svl_s step %c1_index {

%elem = vector.extractelement %tileslice[%i2 : index] : vector<[4]xi32>

%elem_i64 = llvm.zext %elem : i32 to i64

llvm.call @printI64(%elem_i64) : (i64) -> ()

%last_i = arith.subi %svl_s, %c1_index : index

%isNotLastIter = arith.cmpi ult, %i2, %last_i : index

scf.if %isNotLastIter {

llvm.call @printComma() : () -> ()

}

llvm.call @printClose() : () -> ()

llvm.call @printNewline() : () -> ()

}

// Load tile 1 from memory

%za0_s = vector.load %mem1[%c0] : memref<?xi32>, vector<[4]x[4]xi32>

// Load tile 2 from memory

%za1_s = vector.load %mem1[%tilesize] : memref<?xi32>, vector<[4]x[4]xi32>

// Allocate new memory to store tiles to

%mem2 = memref.alloca(%size) : memref<?xi32>

// Zero new memory

scf.for %i = %c0 to %size step %c1_index {

memref.store %c0_i32, %mem2[%i] : memref<?xi32>

}

// Stores tiles back to (new) memory in reverse order

// Store tile 2 to memory

vector.store %za1_s, %mem2[%c0] : memref<?xi32>, vector<[4]x[4]xi32>

// Store tile 1 to memory

vector.store %za0_s, %mem2[%tilesize] : memref<?xi32>, vector<[4]x[4]xi32>

// Dump "mem2" and check the tiles were stored in reverse order. The smallest

// SVL is 128-bits so the tiles will be at least 4x4xi32.

// CHECK: TILE BEGIN

// CHECK-NEXT: ( 2, 2, 2, 2

// CHECK: TILE END

// CHECK-NEXT: TILE BEGIN

// CHECK-NEXT: ( 1, 1, 1, 1

// CHECK: TILE END

func.call @printTileBegin() : () -> ()

scf.for %i = %c0 to %size step %svl_s {

%av = vector.load %mem2[%i] : memref<?xi32>, vector<[4]xi32>

llvm.call @printOpen() : () -> ()

scf.for %i2 = %c0 to %svl_s step %c1_index {

%elem = vector.extractelement %av[%i2 : index] : vector<[4]xi32>

%elem_i64 = llvm.zext %elem : i32 to i64

llvm.call @printI64(%elem_i64) : (i64) -> ()

%last_i = arith.subi %svl_s, %c1_index : index

%isNotLastIter = arith.cmpi ult, %i2, %last_i : index

scf.if %isNotLastIter {

llvm.call @printComma() : () -> ()

}

llvm.call @printClose() : () -> ()

llvm.call @printNewline() : () -> ()

%tileSizeMinusStep = arith.subi %tilesize, %svl_s : index

%isNextTile = arith.cmpi eq, %i, %tileSizeMinusStep : index

scf.if %isNextTile {

func.call @printTileEnd() : () -> ()

func.call @printTileBegin() : () -> ()

}

func.call @printTileEnd() : () -> ()

return %c0_i32 : i32

}

llvm.mlir.global internal constant @str_tile_begin("TILE BEGIN\0A")

llvm.mlir.global internal constant @str_tile_end("TILE END\0A")

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][ArmSME] Use memref indices for load and storeClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 545632

mlir/lib/Conversion/ArmSMEToSCF/ArmSMEToSCF.cpp

mlir/lib/Dialect/ArmSME/Transforms/LegalizeForLLVMExport.cpp

mlir/test/Conversion/ArmSMEToSCF/arm-sme-to-scf.mlir

mlir/test/Dialect/ArmSME/vector-ops-to-llvm.mlir

mlir/test/Integration/Dialect/Vector/CPU/ArmSME/vector-load-store.mlir

[mlir][ArmSME] Use memref indices for load and store
ClosedPublic