This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/Dialect/
-
mlir/
-
Dialect/
-
Arith/Transforms/
-
Transforms/
1/1
NarrowTypeEmulationConverter.h
1/1
Passes.h
-
MemRef/Transforms/
-
Transforms/
-
Transforms.h
-
lib/Dialect/
-
Dialect/
-
Arith/Transforms/
-
Transforms/
-
CMakeLists.txt
1/1
EmulateNarrowType.cpp
-
MemRef/Transforms/
-
Transforms/
-
CMakeLists.txt
28/28
EmulateNarrowType.cpp
-
test/
-
Dialect/
-
Arith/
-
emulate-narrow-type.mlir
-
MemRef/
-
emulate-narrow-type-diff-load-compute.mlir
-
emulate-narrow-type-same-load-compute.mlir
-
lib/Dialect/MemRef/
-
Dialect/
-
MemRef/
-
CMakeLists.txt
3/3
TestEmulateNarrowType.cpp
-
tools/mlir-opt/
-
mlir-opt/
-
mlir-opt.cpp

Differential D151519

[mlir] Narrow bitwidth emulation for MemRef load
ClosedPublic

Authored by yzhang93 on May 25 2023, 4:41 PM.

Download Raw Diff

Details

Reviewers

hanchung
nicolasvasilache
antiagainst
kuhar
mravishankar

Commits

rG5a1cdcbd8698: [mlir] Narrow bitwidth emulation for MemRef load

Summary

This patch adds support for narrow bitwidth storage emulation. The goal is to support sub-byte type
codegen for LLVM CPU. Specifically, a type converter is added to convert memref of narrow bitwidth
(e.g., i4) into supported wider bitwidth (e.g., i8). Another focus of this patch is to populate the
pattern for int4 memref.load. memref.store pattern should be added in a seperate patch.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

yzhang93 created this revision.May 25 2023, 4:41 PM

Herald added a project: Restricted Project. · View Herald TranscriptMay 25 2023, 4:41 PM

Herald added subscribers: bviyer, Moerafaat, zero9178 and 21 others. · View Herald Transcript

yzhang93 requested review of this revision.May 25 2023, 4:41 PM

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptMay 25 2023, 4:41 PM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

Harbormaster completed remote builds in B234711: Diff 525876.May 25 2023, 5:05 PM

hanchung added a reviewer: antiagainst.May 25 2023, 5:16 PM

hanchung retitled this revision from [mlir] Narrow bitwidth emulation for MemRef load r=hanchung to [mlir] Narrow bitwidth emulation for MemRef load.May 26 2023, 11:02 AM

Herald added a reviewer: kuhar. · View Herald TranscriptMay 26 2023, 11:02 AM

The bigger issue is:
https://llvm.org/docs/CodingStandards.html#anonymous-namespaces

mlir/include/mlir/Dialect/Arith/Transforms/NarrowIntEmulationConverter.h
26 ↗	(On Diff #525876)	I have a personal issue with the missing new line between the class and the namespace.
mlir/test/Dialect/MemRef/emulate-narrow-int.mlir
84 ↗	(On Diff #525876)	Newline please.

Nice work! I dropped few comments inline. :)

Inspired by IREE project patch, I found that some people would like to emulate f16 computation on f32 domain: https://github.com/openxla/iree/pull/13808

I think we can rename the EmulateNarrowInt to EmulateNarrowType or EmulateNarrowNumerics, and they can add such patterns to it later.

mlir/lib/Dialect/Arith/Transforms/EmulateNarrowInt.cpp
39–67 ↗	(On Diff #525876)	Can we move the conversion out of constructor? Users want to control the conversions themselves. I think we can move them to `TestEmulateNarrowInt.cpp`.
45–48 ↗	(On Diff #525876)	We don't tie the type conversion to `targetWideInt`. Instead, we leave the decision to users. The `targetWideInt` controls how we load a int4, but not the computation domain. Think the case that we want to use byte load for int4, but leave the computation on int4. We can force the test to convert int4 to int8, or control it with a flag (like `int4-arith-bitwidth=8`).
mlir/lib/Dialect/MemRef/Transforms/EmulateNarrowInt.cpp
89–93 ↗	(On Diff #525876)	I think we should check if converted MemRef type is as same as original type. The emulation is needed only if memref types mismatch. The newResTy can be the same in the scenario of using byte load but operating on int4 domain. IMO, the working flow is: Check if the original memref type match converted memref type. If they mismatch, we need the emulation and apply the pattern. Load the value using converted memref type. (which is already implemented below) Cast the load value to newResTy if the types mismatch. In the scenario, a int8 value is loaded and we need to cast it to int4 (i.e., newResTy).
99 ↗	(On Diff #525876)	Can we add a comment to elaborate the core idea? And some comments about why/how we compute `linearized_size`, `linearized_offset`, and `%stride#1`. %0 = memref.load %0[%v0][%v1] : memref<?x?xi4, strided<[?, ?], offset = ?>> can be replaced with %b, %offset, %sizes:2, %strides:2 = memref.extract_strided_metadata %0 %linearized_offset = /// %v0 * %stride#0 + %v1 * %stride#1 %linearized_size = /// %size0 * %size1 %linearized = memref.reinterpret_cast %b, offset = [%offset], sizes = [%linearized_size], strides = [%stride#1] %load = memref.load %linearized[%linearized_offset] : memref<?xi4, strided<?, offset = ?>>
130–131 ↗	(On Diff #525876)	style nit: names should be in camelCase, i.e., we should name them to `linearizedOffset` and `linearizedSize`. https://llvm.org/docs/CodingStandards.html#name-types-functions-variables-and-enumerators-properly
140–141 ↗	(On Diff #525876)	style nit: use `auto` because `XXX::get` already spells the type. https://llvm.org/docs/CodingStandards.html#use-auto-type-deduction-to-make-code-more-readable
199–201 ↗	(On Diff #525876)	IMO, this should be derived from `targetBitwidth`. If the bitwidth is less than `targetBitwidth`, we use `targetBitwidth`.
mlir/test/lib/Dialect/MemRef/TestEmulateNarrowInt.cpp
66 ↗	(On Diff #525876)	I can't map the comment to populate functions, maybe we can just remove the comment... It's pretty clear (from the function name) that we populate the patterns to do narrow type emulation.

This revision now requires changes to proceed.May 26 2023, 11:45 AM

yzhang93 requested review of this revision.May 31 2023, 12:45 PM

yzhang93 updated this revision to Diff 527169.

In D151519#4376791, @hanchung wrote:

Nice work! I dropped few comments inline. :)

Inspired by IREE project patch, I found that some people would like to emulate f16 computation on f32 domain: https://github.com/openxla/iree/pull/13808

I think we can rename the EmulateNarrowInt to EmulateNarrowType or EmulateNarrowNumerics, and they can add such patterns to it later.

mlir/lib/Dialect/MemRef/Transforms/EmulateNarrowInt.cpp
89–93 ↗	(On Diff #525876)	Thanks for your review and suggestions! I have modified the codes accordingly.

kuhar added inline comments.May 31 2023, 1:08 PM

mlir/lib/Dialect/MemRef/Transforms/EmulateNarrowInt.cpp
89–94 ↗	(On Diff #527169)	Same comment as in https://reviews.llvm.org/D151827

Harbormaster completed remote builds in B235648: Diff 527169.May 31 2023, 1:18 PM

yzhang93 updated this revision to Diff 527268.May 31 2023, 8:32 PM

yzhang93 marked an inline comment as not done.

Harbormaster completed remote builds in B235728: Diff 527268.May 31 2023, 8:55 PM

kuhar added inline comments.Jun 1 2023, 7:44 AM

mlir/lib/Dialect/MemRef/Transforms/EmulateNarrowInt.cpp
211–213 ↗	(On Diff #527268)	nit: when `else` is wrapped with braces so should the `then` body
89–94 ↗	(On Diff #527169)	Also, we should return `failure()` when there was no rewrite

yzhang93 updated this revision to Diff 527488.Jun 1 2023, 10:26 AM

yzhang93 marked 3 inline comments as done.

Harbormaster completed remote builds in B235892: Diff 527488.Jun 1 2023, 11:30 AM

hanchung requested changes to this revision.Jun 1 2023, 3:05 PM

Overall looks good, just some nits. Please also consider renaming the pass name and file name, thanks!

mlir/include/mlir/Dialect/Arith/Transforms/NarrowIntEmulationConverter.h
16–18 ↗	(On Diff #527488)	please update the comment as well.
22 ↗	(On Diff #527488)	There are target bitwidth, one is for load/store emulation, and the other is for arith computation domain. This is the one related to load/store emulation, please update the variable name more concretely.
mlir/include/mlir/Dialect/Arith/Transforms/Passes.h
39–40	Please add a comment about "users need to add conversions about the computation domain of narrow types".
mlir/lib/Dialect/MemRef/Transforms/EmulateNarrowInt.cpp
89–93 ↗	(On Diff #525876)	I think we should if `op.getMemRefType()` is as same as `typeConverter->convertType(op.getMemRefType())`. If they are not the same, we need the emulation.
109–119 ↗	(On Diff #527488)	Can you format it in a better way? Adding spaces and new lines could help, IMO. maybe something like: https://github.com/llvm/llvm-project/blob/7f374b6902fad9caed41284a57d573abe9ada9d1/mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h#L450-L476
180–182 ↗	(On Diff #527488)	It should be `typeConverter->convertType(oldElementType)`.
mlir/test/lib/Dialect/MemRef/TestEmulateNarrowInt.cpp
103–105 ↗	(On Diff #527488)	The naming is ambiguous (and mismatch between the name and flag), can we rename it to something like `loadEmulationBitwidth`?

This revision now requires changes to proceed.Jun 1 2023, 3:05 PM

hanchung added a reviewer: mravishankar.Jun 1 2023, 3:46 PM

yzhang93 requested review of this revision.Jun 1 2023, 10:59 PM

yzhang93 updated this revision to Diff 527745.

yzhang93 marked 8 inline comments as done.

Harbormaster completed remote builds in B236090: Diff 527745.Jun 1 2023, 11:15 PM

LGTM if the comments are addressed.

mlir/lib/Dialect/Arith/Transforms/EmulateNarrowType.cpp
11–25	I think we can remove `VectorOps.h` from the includes.
mlir/lib/Dialect/MemRef/Transforms/EmulateNarrowType.cpp
99–100	we can remove the declaration of `source` and write it like MemRefType sourceType = adaptor.getMemRefType();
104–106	This can be simplified to `op.getMemRefType().getElementType().getIntOrFloatBitWidth()`
132	I would just use `adaptor.getMemref()` here
190–192	The `trunci` op is not needed if they have the same number of bits.
225–227	can this be `auto`?
mlir/test/lib/Dialect/MemRef/TestEmulateNarrowType.cpp
58–63	the if-else already covers all the cases, this can be simplified.
72–80	ditto, this can be simplified

This revision is now accepted and ready to land.Jun 2 2023, 12:24 PM

yzhang93 updated this revision to Diff 527998.Jun 2 2023, 2:50 PM

yzhang93 marked 8 inline comments as done.

yzhang93 added inline comments.Jun 2 2023, 2:52 PM

mlir/lib/Dialect/MemRef/Transforms/EmulateNarrowType.cpp
99–100	Looks like there's no member named 'getMemRefType' in 'mlir::memref::LoadOpAdaptor'. I'll keep the use of what I have.

just two nits

mlir/lib/Dialect/MemRef/Transforms/EmulateNarrowType.cpp
231–240	I think we don't need `else`, this can save us one level of indents.
mlir/test/lib/Dialect/MemRef/TestEmulateNarrowType.cpp
72–80	can we remove the `else` keyword? that would save us a level of indent. same for above one.

yzhang93 updated this revision to Diff 528020.Jun 2 2023, 3:25 PM

Harbormaster completed remote builds in B236302: Diff 528020.Jun 2 2023, 4:02 PM

Nice! Mostly looks good. Just a few comments.

mlir/include/mlir/Dialect/Arith/Transforms/NarrowTypeEmulationConverter.h
24	Make this private and add a `getLoadStoreBitwidth` method.
mlir/lib/Dialect/MemRef/Transforms/EmulateNarrowType.cpp
30–40	Language nit : /// When data is loaded/stored in `targetBits` granularity, but is used in `sourceBits` granularity /// (`sourceBits` < `targetBits`), the `targetBits` is treated as an array of elements of width `sourceBits`. /// Return the bit offset of the value at position `srcIdx`. For example, if /// `sourceBits` equals to 4 and `targetBits` equals to 8, the x-th element is /// located at (x % 2) * 4. Because there are two elements in one i8, and one /// element has 4 bits.
91	Nit : For statements spanning multiple lines, still it is recommended to use braces.
106	Instead of assert just return a failure return notifyMatchFailure(op, "only dstBits %srcBits == 0 supported");
140	I think this needs to happen on the `linearizedOffset`. Basically find the `linearizedOffset`. Divide by the scaling factor (which is `dstBits / srcBits`) Load the value. Get the offset in bits
185	Note: This is only relevant for big-endian... Maybe add a comment somewhere that this is the only mode supported for now. Another robust option is to allow setting this in the TypeConverter, and assert that it is the endian-ness expected. Without that it can lead to subtle bugs.
193	I am trying to understand when this case happens. The `resultType`

This revision now requires changes to proceed.Jun 5 2023, 10:56 AM

yzhang93 requested review of this revision.Jun 5 2023, 4:05 PM

yzhang93 updated this revision to Diff 528616.

yzhang93 marked 8 inline comments as done.

Herald added a subscriber: manas. · View Herald TranscriptJun 5 2023, 4:05 PM

yzhang93 added inline comments.Jun 5 2023, 4:06 PM

mlir/lib/Dialect/MemRef/Transforms/EmulateNarrowType.cpp
140	@hanchung and I discussed about this before and we thought only the last index needs to be modified. However, I just rethink about this and I agree with you that the scaling needs to happen after the offset is linearized. @hanchung let me know if this makes sense to you.
193	This happens when the load bitwidth and computation bitwidth are the same, e.g., when we specify --test-emulate-narrow-int="arith-compute-bitwidth=8 memref-load-bitwidth=8"

Harbormaster completed remote builds in B236756: Diff 528616.Jun 5 2023, 5:00 PM

mravishankar requested changes to this revision.Jun 7 2023, 9:40 PM

mravishankar added inline comments.

mlir/lib/Dialect/MemRef/Transforms/EmulateNarrowType.cpp
107	Nit: This statement spans two lines. Please use braces.
154	Two things here First not sure why you need to special case `sourceRank == 1`? I think this computation is very different from what was there before. The `linearizedOffset = (adjustedOffset[0] + adjustedOffset[1] + ...) * srcBits / dstBits` Here you seem to be dividing the `scalar (= dstBits / srcBits)` as many times as the `sourceRank` which seems off.
193	this looks like a premature optimization to me. If the result width and compute width is the same, then there should not be a need to do this.... If the compute width is higher, then the `trunc` and `ext` should be folded away as a canonicalization. In any case I actually dont see a test with the --test-emulate-narrow-int="arith-compute-bitwidth=8 memref-load-bitwidth=8". Maybe we just do if (resultTy != srcElementType) { result = rewriter.create<arith::TruncIOp>(loc, resultTy, bitsLoad); }

This revision now requires changes to proceed.Jun 7 2023, 9:40 PM

yzhang93 requested review of this revision.Jun 8 2023, 12:20 PM

yzhang93 updated this revision to Diff 529694.

yzhang93 marked an inline comment as done.

yzhang93 added inline comments.Jun 8 2023, 12:23 PM

mlir/lib/Dialect/MemRef/Transforms/EmulateNarrowType.cpp
154	My bad. Thanks for pointing this out.
193	My idea behind this is if the emulated memref load bits and the computation bits are the same, e.g., 8 bits, and the actual load is 4 bits in the example. We'll need to return a 8 bits data but only the last 4 bits are the data we needed. So that's why I added a mask to make the first 4 bits zero, and only the last 4 bits are valid. I also added a test for the "arith-compute-bitwidth=8 memref-load-bitwidth=8" case. We can chat in detail if this doesn't make sense to you.

Harbormaster completed remote builds in B237567: Diff 529694.Jun 8 2023, 1:14 PM

Thanks Vivian, there are a couple of more bugs in this patch... also left a suggestion to use makeComposedAffineApplyOp which will make the code and IR more readable.

mlir/lib/Dialect/MemRef/Transforms/EmulateNarrowType.cpp

153

THanks for the changes. Now I understand better. I think I found an issue (sorry if it was triggered by a suggestion from me). This will segfault for zero-rank memrefs.

So this has to be

Value linearizedOffset = builder.create<arith::ConstantIndexOp>(loc, 0).;
Value linearizedSize = builder.create<arith::ConstantIndexOp>(loc, 1);
for (int i = 0; i < sourceRank; ++i) {
  linearizedOffset = rewriter.create<arith::AddIOp>(loc, linearizedOffset, adjustedOffsets[i]);
  linearizedSize = rewriter.create<arith::MulIOp>(loc, linearizedSize, baseSizes[i]);
}

Better yet... instead of creating all these ops we can use makeComposedAffineApplyOp

OpFoldResult linearizedOffset = rewriter.getIndexAttr(0);
OpFoldResult linearizedSize = rewriter.getIndexAttr(1);
AffineExpr s0, s1, s2;
bindSymbols(s0, s1, s2);
for (auto i : llvm::seq<int>(0, sourceRank)) {
  linearizedOffset = makeComposedAffineApplyOp(rewriter, loc, s0 + s1 * s2, {linearizedOffset, indices[i], baseStrides[i]);
  linearizedSize = makeComposedAffineApplyOp(rewriter, loc, s0 * s1, {linearizedSize, baseSizes[i]});
}
OpFoldResult scaler =rewriter.getIndexAttr(dstBits/srcBits);
linearizedOffset = makeComposedAffineApply(rewriter, loc, s0 floorDiv s1, {linearizedOffset, scaler});

Then you can get the Value for linearizedOffset/linearizedSize using getOrCreateConstantIndexOp.

This will fold away any statically known values, and will also make the code easier to read, the IR easier to read, while reducing index arithmetic overhead.

173

If I am not mistaken, baseOffset also needs to be scaled.

193

Maybe... But i cant think of a valid program where the load is in 4 bits but the use of it is directly in 8-bits....

This revision now requires changes to proceed.Jun 8 2023, 2:54 PM

yzhang93 requested review of this revision.Jun 13 2023, 10:30 AM

yzhang93 updated this revision to Diff 530979.

yzhang93 marked an inline comment as done.Jun 13 2023, 10:36 AM

yzhang93 added inline comments.

mlir/lib/Dialect/MemRef/Transforms/EmulateNarrowType.cpp
153	Thanks for pointing out the zero-rank problem and I appreciate your suggestions. I tried what you suggested with AffineApplyOp, but kept having this "error: failed to legalize operation 'memref.load' that was explicitly marked illegal %1 = memref.load %0[%arg0] : memref<4xi4>" on the test even with the simplest test. I'm not sure what caused the error, but if you know any potential issue and the way to fix it please let me know. Currently I refactor the codes and add the case for sourceRank==0. I think we probably want to treat these cases separately, because when sourceRank==0 we don't need to do linearization with memref.reinterpret_cast op.

Harbormaster completed remote builds in B238551: Diff 530979.Jun 13 2023, 1:23 PM

yzhang93 updated this revision to Diff 533814.Jun 22 2023, 4:38 PM

yzhang93 marked 2 inline comments as done.Jun 22 2023, 4:58 PM

yzhang93 added inline comments.

mlir/lib/Dialect/MemRef/Transforms/EmulateNarrowType.cpp
153	In the latest revision, I refactored the codes of linearization part with AffineApplyOp as suggested. I also added the conversion pattern for memref::AssumeAlignmentOp, as this is required for e2e test. The tests were updated accordingly.

Harbormaster completed remote builds in B240657: Diff 533814.Jun 22 2023, 6:09 PM

Thanks! This revision looks good!

mlir/lib/Dialect/MemRef/Transforms/EmulateNarrowType.cpp
62	Nit: Avoid using `auto` here. It is only used in LLVM when the type is obvious from the context, and here it is not.
86	Nit: avoid using the same variable in two contexts.
264	I am still not sure about this one... Not really sure this actually happens in practice, but harmless enough. (Could you just leave a comment explaining it isnt clear that this is needed, or something to record this discussion.

This revision is now accepted and ready to land.Jun 23 2023, 6:35 PM

yzhang93 updated this revision to Diff 534651.Jun 26 2023, 10:56 AM

yzhang93 marked 6 inline comments as done.

Harbormaster completed remote builds in B241242: Diff 534651.Jun 26 2023, 11:37 AM

hanchung accepted this revision.Jun 26 2023, 2:15 PM

Closed by commit rG5a1cdcbd8698: [mlir] Narrow bitwidth emulation for MemRef load (authored by yzhang93, committed by hanchung). · Explain WhyJun 26 2023, 2:21 PM

This revision was automatically updated to reflect the committed changes.

hanchung added a commit: rG5a1cdcbd8698: [mlir] Narrow bitwidth emulation for MemRef load.

yzhang93 mentioned this in D154178: [mlir] Narrow bitwidth emulation for vector.load.Jun 29 2023, 8:41 PM

hanchung mentioned this in rG9a7677d8ee34: [mlir] Narrow bitwidth emulation for vector.load.Jul 11 2023, 1:38 PM

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

Arith/

Transforms/

NarrowTypeEmulationConverter.h

31 lines

Passes.h

7 lines

MemRef/

Transforms/

Transforms.h

12 lines

lib/

Dialect/

Arith/

Transforms/

CMakeLists.txt

1 line

EmulateNarrowType.cpp

61 lines

MemRef/

Transforms/

CMakeLists.txt

1 line

EmulateNarrowType.cpp

315 lines

test/

Dialect/

Arith/

emulate-narrow-type.mlir

47 lines

MemRef/

emulate-narrow-type-diff-load-compute.mlir

107 lines

emulate-narrow-type-same-load-compute.mlir

72 lines

lib/

Dialect/

MemRef/

CMakeLists.txt

1 line

TestEmulateNarrowType.cpp

118 lines

tools/

mlir-opt/

mlir-opt.cpp

2 lines

Diff 534749

mlir/include/mlir/Dialect/Arith/Transforms/NarrowTypeEmulationConverter.h

This file was added.

				//===- NarrowTypeEmulationConverter.h - Type Converter for NTE -----*- C++
				//-*-===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#ifndef MLIR_DIALECT_ARITH_NARROW_TYPE_EMULATION_CONVERTER_H_
				#define MLIR_DIALECT_ARITH_NARROW_TYPE_EMULATION_CONVERTER_H_

				#include "mlir/Transforms/DialectConversion.h"

				namespace mlir::arith {
				/// Converts narrow integer or float types that are not supported
				/// by the target hardware to wider types. Currently, we only
				/// handle power-of-two integer types and convert them to wider
				/// integers that are equal or larger than 8 bits.
				class NarrowTypeEmulationConverter : public TypeConverter {
				public:
				explicit NarrowTypeEmulationConverter(unsigned targetBitwidth);

				unsigned getLoadStoreBitwidth() const { return loadStoreBitwidth; }
				mravishankarUnsubmitted Done Reply Inline Actions Make this private and add a `getLoadStoreBitwidth` method. mravishankar: Make this private and add a `getLoadStoreBitwidth` method.

				private:
				unsigned loadStoreBitwidth;
				};
				} // namespace mlir::arith

				#endif // MLIR_DIALECT_ARITH_NARROW_TYPE_EMULATION_CONVERTER_H_

mlir/include/mlir/Dialect/Arith/Transforms/Passes.h

	Show All 16 Lines
	namespace arith {			namespace arith {

	#define GEN_PASS_DECL			#define GEN_PASS_DECL
	#include "mlir/Dialect/Arith/Transforms/Passes.h.inc"			#include "mlir/Dialect/Arith/Transforms/Passes.h.inc"
	#define GEN_PASS_DECL_ARITHINTRANGEOPTS			#define GEN_PASS_DECL_ARITHINTRANGEOPTS
	#include "mlir/Dialect/Arith/Transforms/Passes.h.inc"			#include "mlir/Dialect/Arith/Transforms/Passes.h.inc"

	class WideIntEmulationConverter;			class WideIntEmulationConverter;
				class NarrowTypeEmulationConverter;

	/// Create a pass to bufferize Arith ops.			/// Create a pass to bufferize Arith ops.
	std::unique_ptr<Pass> createArithBufferizePass();			std::unique_ptr<Pass> createArithBufferizePass();

	/// Create a pass to bufferize arith.constant ops.			/// Create a pass to bufferize arith.constant ops.
	std::unique_ptr<Pass> createConstantBufferizePass(uint64_t alignment = 0);			std::unique_ptr<Pass> createConstantBufferizePass(uint64_t alignment = 0);

	/// Adds patterns to emulate wide Arith and Function ops over integer			/// Adds patterns to emulate wide Arith and Function ops over integer
	/// types into supported ones. This is done by splitting original power-of-two			/// types into supported ones. This is done by splitting original power-of-two
	/// i2N integer types into two iN halves.			/// i2N integer types into two iN halves.
	void populateArithWideIntEmulationPatterns(			void populateArithWideIntEmulationPatterns(
	WideIntEmulationConverter &typeConverter, RewritePatternSet &patterns);			WideIntEmulationConverter &typeConverter, RewritePatternSet &patterns);

				/// Adds patterns to emulate narrow Arith and Function ops into wide
				/// supported types. Users need to add conversions about the computation
				hanchungUnsubmitted Done Reply Inline Actions Please add a comment about "users need to add conversions about the computation domain of narrow types". hanchung: Please add a comment about "users need to add conversions about the computation domain of…
				/// domain of narrow types.
				void populateArithNarrowTypeEmulationPatterns(
				NarrowTypeEmulationConverter &typeConverter, RewritePatternSet &patterns);

	/// Add patterns to expand Arith ceil/floor division ops.			/// Add patterns to expand Arith ceil/floor division ops.
	void populateCeilFloorDivExpandOpsPatterns(RewritePatternSet &patterns);			void populateCeilFloorDivExpandOpsPatterns(RewritePatternSet &patterns);

	/// Add patterns to expand Arith bf16 patterns to lower level bitcasts/shifts.			/// Add patterns to expand Arith bf16 patterns to lower level bitcasts/shifts.
	void populateExpandBFloat16Patterns(RewritePatternSet &patterns);			void populateExpandBFloat16Patterns(RewritePatternSet &patterns);

	/// Add patterns to expand Arith ops.			/// Add patterns to expand Arith ops.
	void populateArithExpandOpsPatterns(RewritePatternSet &patterns);			void populateArithExpandOpsPatterns(RewritePatternSet &patterns);
	Show All 35 Lines

mlir/include/mlir/Dialect/MemRef/Transforms/Transforms.h

Show All 19 Lines
class OpBuilder;		class OpBuilder;
class RewritePatternSet;		class RewritePatternSet;
class RewriterBase;		class RewriterBase;
class Value;		class Value;
class ValueRange;		class ValueRange;

namespace arith {		namespace arith {
class WideIntEmulationConverter;		class WideIntEmulationConverter;
		class NarrowTypeEmulationConverter;
} // namespace arith		} // namespace arith

namespace memref {		namespace memref {
class AllocOp;		class AllocOp;
class AllocaOp;		class AllocaOp;

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Patterns		// Patterns
Show All 32 Lines	void populateMemRefWideIntEmulationPatterns(
arith::WideIntEmulationConverter &typeConverter,		arith::WideIntEmulationConverter &typeConverter,
RewritePatternSet &patterns);		RewritePatternSet &patterns);

/// Appends type conversions for emulating wide integer memref operations with		/// Appends type conversions for emulating wide integer memref operations with
/// ops over narrowe integer types.		/// ops over narrowe integer types.
void populateMemRefWideIntEmulationConversions(		void populateMemRefWideIntEmulationConversions(
arith::WideIntEmulationConverter &typeConverter);		arith::WideIntEmulationConverter &typeConverter);

		/// Appends patterns for emulating memref operations over narrow types with ops
		/// over wider types.
		void populateMemRefNarrowTypeEmulationPatterns(
		arith::NarrowTypeEmulationConverter &typeConverter,
		RewritePatternSet &patterns);

		/// Appends type conversions for emulating memref operations over narrow types
		/// with ops over wider types.
		void populateMemRefNarrowTypeEmulationConversions(
		arith::NarrowTypeEmulationConverter &typeConverter);

/// Transformation to do multi-buffering/array expansion to remove dependencies		/// Transformation to do multi-buffering/array expansion to remove dependencies
/// on the temporary allocation between consecutive loop iterations.		/// on the temporary allocation between consecutive loop iterations.
/// It returns the new allocation if the original allocation was multi-buffered		/// It returns the new allocation if the original allocation was multi-buffered
/// and returns failure() otherwise.		/// and returns failure() otherwise.
/// When `skipOverrideAnalysis`, the pass will apply the transformation		/// When `skipOverrideAnalysis`, the pass will apply the transformation
/// without checking thwt the buffer is overrided at the beginning of each		/// without checking thwt the buffer is overrided at the beginning of each
/// iteration. This implies that user knows that there is no data carried across		/// iteration. This implies that user knows that there is no data carried across
/// loop iterations. Example:		/// loop iterations. Example:
▲ Show 20 Lines • Show All 103 Lines • Show Last 20 Lines

mlir/lib/Dialect/Arith/Transforms/CMakeLists.txt

	add_mlir_dialect_library(MLIRArithTransforms			add_mlir_dialect_library(MLIRArithTransforms
	BufferizableOpInterfaceImpl.cpp			BufferizableOpInterfaceImpl.cpp
	Bufferize.cpp			Bufferize.cpp
	EmulateWideInt.cpp			EmulateWideInt.cpp
				EmulateNarrowType.cpp
	ExpandOps.cpp			ExpandOps.cpp
	IntNarrowing.cpp			IntNarrowing.cpp
	IntRangeOptimizations.cpp			IntRangeOptimizations.cpp
	ReifyValueBounds.cpp			ReifyValueBounds.cpp
	UnsignedWhenEquivalent.cpp			UnsignedWhenEquivalent.cpp

	ADDITIONAL_HEADER_DIRS			ADDITIONAL_HEADER_DIRS
	{$MLIR_MAIN_INCLUDE_DIR}/mlir/Dialect/Arith/Transforms			{$MLIR_MAIN_INCLUDE_DIR}/mlir/Dialect/Arith/Transforms
	Show All 21 Lines

mlir/lib/Dialect/Arith/Transforms/EmulateNarrowType.cpp

This file was added.

				//===- EmulateNarrowType.cpp - Narrow type emulation ----*- C++
				//-*-===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "mlir/Dialect/Arith/Transforms/Passes.h"

				#include "mlir/Dialect/Arith/IR/Arith.h"
				#include "mlir/Dialect/Arith/Transforms/NarrowTypeEmulationConverter.h"
				#include "mlir/Dialect/Func/IR/FuncOps.h"
				#include "mlir/Dialect/Func/Transforms/FuncConversions.h"
				#include "mlir/IR/BuiltinTypes.h"
				#include "mlir/IR/TypeUtilities.h"
				#include "mlir/Support/LogicalResult.h"
				#include "mlir/Transforms/DialectConversion.h"
				#include "llvm/ADT/APInt.h"
				#include "llvm/Support/FormatVariadic.h"
				#include "llvm/Support/MathExtras.h"
				#include <cassert>

				using namespace mlir;
				hanchungUnsubmitted Done Reply Inline Actions I think we can remove `VectorOps.h` from the includes. hanchung: I think we can remove `VectorOps.h` from the includes.

				//===----------------------------------------------------------------------===//
				// Public Interface Definition
				//===----------------------------------------------------------------------===//

				arith::NarrowTypeEmulationConverter::NarrowTypeEmulationConverter(
				unsigned targetBitwidth)
				: loadStoreBitwidth(targetBitwidth) {
				assert(llvm::isPowerOf2_32(targetBitwidth) &&
				"Only power-of-two integers are supported");

				// Allow unknown types.
				addConversion([](Type ty) -> std::optional<Type> { return ty; });

				// Function case.
				addConversion([this](FunctionType ty) -> std::optional<Type> {
				SmallVector<Type> inputs;
				if (failed(convertTypes(ty.getInputs(), inputs)))
				return std::nullopt;

				SmallVector<Type> results;
				if (failed(convertTypes(ty.getResults(), results)))
				return std::nullopt;

				return FunctionType::get(ty.getContext(), inputs, results);
				});
				}

				void arith::populateArithNarrowTypeEmulationPatterns(
				NarrowTypeEmulationConverter &typeConverter, RewritePatternSet &patterns) {
				// Populate `func.*` conversion patterns.
				populateFunctionOpInterfaceTypeConversionPattern<func::FuncOp>(patterns,
				typeConverter);
				populateCallOpTypeConversionPattern(patterns, typeConverter);
				populateReturnOpTypeConversionPattern(patterns, typeConverter);
				}

mlir/lib/Dialect/MemRef/Transforms/CMakeLists.txt

	add_mlir_dialect_library(MLIRMemRefTransforms			add_mlir_dialect_library(MLIRMemRefTransforms
	BufferizableOpInterfaceImpl.cpp			BufferizableOpInterfaceImpl.cpp
	ComposeSubView.cpp			ComposeSubView.cpp
	ExpandOps.cpp			ExpandOps.cpp
	ExpandStridedMetadata.cpp			ExpandStridedMetadata.cpp
	EmulateWideInt.cpp			EmulateWideInt.cpp
				EmulateNarrowType.cpp
	ExtractAddressComputations.cpp			ExtractAddressComputations.cpp
	FoldMemRefAliasOps.cpp			FoldMemRefAliasOps.cpp
	IndependenceTransforms.cpp			IndependenceTransforms.cpp
	MultiBuffer.cpp			MultiBuffer.cpp
	NormalizeMemRefs.cpp			NormalizeMemRefs.cpp
	ResolveShapedTypeResultDims.cpp			ResolveShapedTypeResultDims.cpp
	RuntimeOpVerification.cpp			RuntimeOpVerification.cpp

	Show All 27 Lines

mlir/lib/Dialect/MemRef/Transforms/EmulateNarrowType.cpp

This file was added.

				//===- EmulateNarrowType.cpp - Narrow type emulation ----*- C++
				//-*-===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "mlir/Dialect/Affine/IR/AffineOps.h"
				#include "mlir/Dialect/Arith/IR/Arith.h"
				#include "mlir/Dialect/Arith/Transforms/NarrowTypeEmulationConverter.h"
				#include "mlir/Dialect/Arith/Transforms/Passes.h"
				#include "mlir/Dialect/Arith/Utils/Utils.h"
				#include "mlir/Dialect/MemRef/IR/MemRef.h"
				#include "mlir/Dialect/MemRef/Transforms/Passes.h"
				#include "mlir/Dialect/MemRef/Transforms/Transforms.h"
				#include "mlir/Dialect/Vector/IR/VectorOps.h"
				#include "mlir/Transforms/DialectConversion.h"
				#include "llvm/Support/FormatVariadic.h"
				#include "llvm/Support/MathExtras.h"
				#include <cassert>

				using namespace mlir;

				//===----------------------------------------------------------------------===//
				// Utility functions
				//===----------------------------------------------------------------------===//

				/// The emulation only works on 1D memref types.
				/// To make this work on N-D memref, we need to linearize the offset.
				///
				/// For example, to emulate i4 to i8, the following op:
				///
				/// %0 = memref.load %arg0[%v0, %v1] :
				/// memref<?x?xi4, strided<[?, ?], offset: ?>>
				///
				/// can be replaced with
				///
				/// %b, %offset, %sizes:2, %strides:2 = memref.extract_strided_metadata %0
				mravishankarUnsubmitted Done Reply Inline Actions Language nit : /// When data is loaded/stored in `targetBits` granularity, but is used in `sourceBits` granularity /// (`sourceBits` < `targetBits`), the `targetBits` is treated as an array of elements of width `sourceBits`. /// Return the bit offset of the value at position `srcIdx`. For example, if /// `sourceBits` equals to 4 and `targetBits` equals to 8, the x-th element is /// located at (x % 2) * 4. Because there are two elements in one i8, and one /// element has 4 bits. mravishankar: Language nit : ``` /// When data is loaded/stored in `targetBits` granularity, but is used in…
				///
				/// %linearized_offset = %v0 * %stride#0 + %v1 * %stride#1
				/// %linearized_size = %size0 * %size1
				/// %scaled_linear_offset = %linearized_offset / 8 * 4
				/// %scaled_base_offset = %offset / 8 * 4
				///
				/// %linearized = memref.reinterpret_cast %b, offset = [%scaled_base_offset],
				/// sizes = [%linearized_size], strides = [%stride#1]
				///
				/// %new_load = memref.load %linearized[%scaled_linear_offset] :
				/// memref<?xi8, strided<[?], offset: ?>>

				static Value
				linearizeMemrefLoad(Location loc, MemRefType sourceType, int srcBits,
				int dstBits, SmallVector<Value> indices,
				memref::ExtractStridedMetadataOp stridedMetadata,
				OpBuilder &builder) {
				auto srcElementType = sourceType.getElementType();
				unsigned sourceRank = indices.size();

				Value baseBuffer = stridedMetadata.getBaseBuffer();
				SmallVector<Value> baseSizes = stridedMetadata.getSizes();
				mravishankarUnsubmitted Done Reply Inline Actions Nit: Avoid using `auto` here. It is only used in LLVM when the type is obvious from the context, and here it is not. mravishankar: Nit: Avoid using `auto` here. It is only used in LLVM when the type is obvious from the context…
				SmallVector<Value> baseStrides = stridedMetadata.getStrides();
				Value baseOffset = stridedMetadata.getOffset();
				assert(indices.size() == baseStrides.size());

				// Create the affine symbols and values for linearization.
				SmallVector<AffineExpr> symbols(2 * sourceRank + 2);
				bindSymbolsList(builder.getContext(), MutableArrayRef{symbols});
				symbols[0] = builder.getAffineSymbolExpr(0);
				AffineExpr addMulMap = symbols.front();
				AffineExpr mulMap = symbols.front();

				SmallVector<OpFoldResult> offsetValues(2 * sourceRank + 2);
				offsetValues[0] = builder.getIndexAttr(0);
				SmallVector<OpFoldResult> sizeValues(sourceRank + 1);
				sizeValues[0] = builder.getIndexAttr(1);

				for (unsigned i = 0; i < sourceRank; ++i) {
				unsigned offsetIdx = 2 * i + 1;
				addMulMap = addMulMap + symbols[offsetIdx] * symbols[offsetIdx + 1];
				offsetValues[offsetIdx] = indices[i];
				offsetValues[offsetIdx + 1] = baseStrides[i];

				unsigned sizeIdx = i + 1;
				mulMap = mulMap * symbols[sizeIdx];
				mravishankarUnsubmitted Done Reply Inline Actions Nit: avoid using the same variable in two contexts. mravishankar: Nit: avoid using the same variable in two contexts.
				sizeValues[sizeIdx] = baseSizes[i];
				}

				// Adjust linearizedOffset by the scale factor (dstBits / srcBits).
				OpFoldResult scaler = builder.getIndexAttr(dstBits / srcBits);
				mravishankarUnsubmitted Done Reply Inline Actions Nit : For statements spanning multiple lines, still it is recommended to use braces. mravishankar: Nit : For statements spanning multiple lines, still it is recommended to use braces.
				AffineExpr scaledAddMulMap = addMulMap.floorDiv(symbols.back());
				offsetValues.back() = scaler;

				OpFoldResult linearizedOffset = affine::makeComposedFoldedAffineApply(
				builder, loc, scaledAddMulMap, offsetValues);
				OpFoldResult linearizedSize =
				affine::makeComposedFoldedAffineApply(builder, loc, mulMap, sizeValues);

				// Adjust baseOffset by the scale factor (dstBits / srcBits).
				hanchungUnsubmitted Done Reply Inline Actions we can remove the declaration of `source` and write it like MemRefType sourceType = adaptor.getMemRefType(); hanchung: we can remove the declaration of `source` and write it like ``` MemRefType sourceType =…
				yzhang93AuthorUnsubmitted Done Reply Inline Actions Looks like there's no member named 'getMemRefType' in 'mlir::memref::LoadOpAdaptor'. I'll keep the use of what I have. yzhang93: Looks like there's no member named 'getMemRefType' in 'mlir::memref::LoadOpAdaptor'. I'll keep…
				AffineExpr s0, s1;
				bindSymbols(builder.getContext(), s0, s1);
				OpFoldResult adjustBaseOffset = affine::makeComposedFoldedAffineApply(
				builder, loc, s0.floorDiv(s1), {baseOffset, scaler});

				// Flatten n-D MemRef to 1-D MemRef.
				hanchungUnsubmitted Done Reply Inline Actions This can be simplified to `op.getMemRefType().getElementType().getIntOrFloatBitWidth()` hanchung: This can be simplified to `op.getMemRefType().getElementType().getIntOrFloatBitWidth()`
				mravishankarUnsubmitted Done Reply Inline Actions Instead of assert just return a failure return notifyMatchFailure(op, "only dstBits %srcBits == 0 supported"); mravishankar: Instead of assert just return a failure ``` return notifyMatchFailure(op, "only dstBits…
				auto layoutAttr = StridedLayoutAttr::get(
				mravishankarUnsubmitted Done Reply Inline Actions Nit: This statement spans two lines. Please use braces. mravishankar: Nit: This statement spans two lines. Please use braces.
				sourceType.getContext(), ShapedType::kDynamic, {ShapedType::kDynamic});
				int64_t staticShape = sourceType.hasStaticShape()
				? sourceType.getNumElements()
				: ShapedType::kDynamic;
				auto flattenMemrefType = MemRefType::get(
				staticShape, srcElementType, layoutAttr, sourceType.getMemorySpace());

				auto reinterpret = builder.create<memref::ReinterpretCastOp>(
				loc, flattenMemrefType, baseBuffer,
				getValueOrCreateConstantIndexOp(builder, loc, adjustBaseOffset),
				getValueOrCreateConstantIndexOp(builder, loc, linearizedSize),
				baseStrides.back());

				return builder.create<memref::LoadOp>(
				loc, srcElementType, reinterpret.getResult(),
				getValueOrCreateConstantIndexOp(builder, loc, linearizedOffset));
				}

				/// When data is loaded/stored in `targetBits` granularity, but is used in
				/// `sourceBits` granularity (`sourceBits` < `targetBits`), the `targetBits` is
				/// treated as an array of elements of width `sourceBits`.
				/// Return the bit offset of the value at position `srcIdx`. For example, if
				/// `sourceBits` equals to 4 and `targetBits` equals to 8, the x-th element is
				/// located at (x % 2) * 4. Because there are two elements in one i8, and one
				/// element has 4 bits.
				hanchungUnsubmitted Done Reply Inline Actions I would just use `adaptor.getMemref()` here hanchung: I would just use `adaptor.getMemref()` here
				static Value getOffsetForBitwidth(Location loc, Value srcIdx, int sourceBits,
				int targetBits, OpBuilder &builder) {
				assert(targetBits % sourceBits == 0);
				IntegerType targetType = builder.getIntegerType(targetBits);
				IntegerAttr idxAttr =
				builder.getIntegerAttr(targetType, targetBits / sourceBits);
				auto idx = builder.create<arith::ConstantOp>(loc, targetType, idxAttr);
				IntegerAttr srcBitsAttr = builder.getIntegerAttr(targetType, sourceBits);
				mravishankarUnsubmitted Done Reply Inline Actions I think this needs to happen on the `linearizedOffset`. Basically find the `linearizedOffset`. Divide by the scaling factor (which is `dstBits / srcBits`) Load the value. Get the offset in bits mravishankar: I think this needs to happen on the `linearizedOffset`. Basically 1) find the…
				yzhang93AuthorUnsubmitted Done Reply Inline Actions @hanchung and I discussed about this before and we thought only the last index needs to be modified. However, I just rethink about this and I agree with you that the scaling needs to happen after the offset is linearized. @hanchung let me know if this makes sense to you. yzhang93: @hanchung and I discussed about this before and we thought only the last index needs to be…
				auto srcBitsValue =
				builder.create<arith::ConstantOp>(loc, targetType, srcBitsAttr);
				auto m = builder.create<arith::RemUIOp>(loc, srcIdx, idx);
				return builder.create<arith::MulIOp>(loc, targetType, m, srcBitsValue);
				}

				namespace {

				//===----------------------------------------------------------------------===//
				// ConvertMemRefAlloc
				//===----------------------------------------------------------------------===//

				struct ConvertMemRefAlloc final : OpConversionPattern<memref::AllocOp> {
				mravishankarUnsubmitted Done Reply Inline Actions THanks for the changes. Now I understand better. I think I found an issue (sorry if it was triggered by a suggestion from me). This will segfault for zero-rank memrefs. So this has to be Value linearizedOffset = builder.create<arith::ConstantIndexOp>(loc, 0).; Value linearizedSize = builder.create<arith::ConstantIndexOp>(loc, 1); for (int i = 0; i < sourceRank; ++i) { linearizedOffset = rewriter.create<arith::AddIOp>(loc, linearizedOffset, adjustedOffsets[i]); linearizedSize = rewriter.create<arith::MulIOp>(loc, linearizedSize, baseSizes[i]); } Better yet... instead of creating all these ops we can use `makeComposedAffineApplyOp` OpFoldResult linearizedOffset = rewriter.getIndexAttr(0); OpFoldResult linearizedSize = rewriter.getIndexAttr(1); AffineExpr s0, s1, s2; bindSymbols(s0, s1, s2); for (auto i : llvm::seq<int>(0, sourceRank)) { linearizedOffset = makeComposedAffineApplyOp(rewriter, loc, s0 + s1 * s2, {linearizedOffset, indices[i], baseStrides[i]); linearizedSize = makeComposedAffineApplyOp(rewriter, loc, s0 * s1, {linearizedSize, baseSizes[i]}); } OpFoldResult scaler =rewriter.getIndexAttr(dstBits/srcBits); linearizedOffset = makeComposedAffineApply(rewriter, loc, s0 floorDiv s1, {linearizedOffset, scaler}); Then you can get the `Value` for `linearizedOffset/linearizedSize` using `getOrCreateConstantIndexOp`. This will fold away any statically known values, and will also make the code easier to read, the IR easier to read, while reducing index arithmetic overhead. mravishankar: THanks for the changes. Now I understand better. I think I found an issue (sorry if it was…
				yzhang93AuthorUnsubmitted Done Reply Inline Actions Thanks for pointing out the zero-rank problem and I appreciate your suggestions. I tried what you suggested with AffineApplyOp, but kept having this "error: failed to legalize operation 'memref.load' that was explicitly marked illegal %1 = memref.load %0[%arg0] : memref<4xi4>" on the test even with the simplest test. I'm not sure what caused the error, but if you know any potential issue and the way to fix it please let me know. Currently I refactor the codes and add the case for sourceRank==0. I think we probably want to treat these cases separately, because when sourceRank==0 we don't need to do linearization with memref.reinterpret_cast op. yzhang93: Thanks for pointing out the zero-rank problem and I appreciate your suggestions. I tried what…
				yzhang93AuthorUnsubmitted Done Reply Inline Actions In the latest revision, I refactored the codes of linearization part with AffineApplyOp as suggested. I also added the conversion pattern for memref::AssumeAlignmentOp, as this is required for e2e test. The tests were updated accordingly. yzhang93: In the latest revision, I refactored the codes of linearization part with AffineApplyOp as…
				using OpConversionPattern::OpConversionPattern;
				mravishankarUnsubmitted Done Reply Inline Actions Two things here First not sure why you need to special case `sourceRank == 1`? I think this computation is very different from what was there before. The `linearizedOffset = (adjustedOffset[0] + adjustedOffset[1] + ...) * srcBits / dstBits` Here you seem to be dividing the `scalar (= dstBits / srcBits)` as many times as the `sourceRank` which seems off. mravishankar: Two things here 1) First not sure why you need to special case `sourceRank == 1`? 2) I think…
				yzhang93AuthorUnsubmitted Done Reply Inline Actions My bad. Thanks for pointing this out. yzhang93: My bad. Thanks for pointing this out.

				LogicalResult
				matchAndRewrite(memref::AllocOp op, OpAdaptor adaptor,
				ConversionPatternRewriter &rewriter) const override {
				Type newTy = getTypeConverter()->convertType(op.getType());
				if (!newTy) {
				return rewriter.notifyMatchFailure(
				op->getLoc(),
				llvm::formatv("failed to convert memref type: {0}", op.getType()));
				}

				rewriter.replaceOpWithNewOp<memref::AllocOp>(
				op, newTy, adaptor.getDynamicSizes(), adaptor.getSymbolOperands(),
				adaptor.getAlignmentAttr());
				return success();
				}
				};

				//===----------------------------------------------------------------------===//
				mravishankarUnsubmitted Done Reply Inline Actions If I am not mistaken, `baseOffset` also needs to be scaled. mravishankar: If I am not mistaken, `baseOffset` also needs to be scaled.
				// ConvertMemRefAssumeAlignment
				//===----------------------------------------------------------------------===//

				struct ConvertMemRefAssumeAlignment final
				: OpConversionPattern<memref::AssumeAlignmentOp> {
				using OpConversionPattern::OpConversionPattern;

				LogicalResult
				matchAndRewrite(memref::AssumeAlignmentOp op, OpAdaptor adaptor,
				ConversionPatternRewriter &rewriter) const override {
				Type newTy = getTypeConverter()->convertType(op.getMemref().getType());
				if (!newTy) {
				mravishankarUnsubmitted Done Reply Inline Actions Note: This is only relevant for big-endian... Maybe add a comment somewhere that this is the only mode supported for now. Another robust option is to allow setting this in the TypeConverter, and assert that it is the endian-ness expected. Without that it can lead to subtle bugs. mravishankar: Note: This is only relevant for big-endian... Maybe add a comment somewhere that this is the…
				return rewriter.notifyMatchFailure(
				op->getLoc(), llvm::formatv("failed to convert memref type: {0}",
				op.getMemref().getType()));
				}

				rewriter.replaceOpWithNewOp<memref::AssumeAlignmentOp>(
				op, adaptor.getMemref(), adaptor.getAlignmentAttr());
				hanchungUnsubmitted Done Reply Inline Actions The `trunci` op is not needed if they have the same number of bits. hanchung: The `trunci` op is not needed if they have the same number of bits.
				return success();
				mravishankarUnsubmitted Done Reply Inline Actions I am trying to understand when this case happens. The `resultType` mravishankar: I am trying to understand when this case happens. The `resultType`
				yzhang93AuthorUnsubmitted Done Reply Inline Actions This happens when the load bitwidth and computation bitwidth are the same, e.g., when we specify --test-emulate-narrow-int="arith-compute-bitwidth=8 memref-load-bitwidth=8" yzhang93: This happens when the load bitwidth and computation bitwidth are the same, e.g., when we…
				mravishankarUnsubmitted Done Reply Inline Actions this looks like a premature optimization to me. If the result width and compute width is the same, then there should not be a need to do this.... If the compute width is higher, then the `trunc` and `ext` should be folded away as a canonicalization. In any case I actually dont see a test with the --test-emulate-narrow-int="arith-compute-bitwidth=8 memref-load-bitwidth=8". Maybe we just do if (resultTy != srcElementType) { result = rewriter.create<arith::TruncIOp>(loc, resultTy, bitsLoad); } mravishankar: this looks like a premature optimization to me. If the result width and compute width is the…
				yzhang93AuthorUnsubmitted Done Reply Inline Actions My idea behind this is if the emulated memref load bits and the computation bits are the same, e.g., 8 bits, and the actual load is 4 bits in the example. We'll need to return a 8 bits data but only the last 4 bits are the data we needed. So that's why I added a mask to make the first 4 bits zero, and only the last 4 bits are valid. I also added a test for the "arith-compute-bitwidth=8 memref-load-bitwidth=8" case. We can chat in detail if this doesn't make sense to you. yzhang93: My idea behind this is if the emulated memref load bits and the computation bits are the same…
				mravishankarUnsubmitted Done Reply Inline Actions Maybe... But i cant think of a valid program where the load is in 4 bits but the use of it is directly in 8-bits.... mravishankar: Maybe... But i cant think of a valid program where the load is in 4 bits but the use of it is…
				}
				};

				//===----------------------------------------------------------------------===//
				// ConvertMemRefLoad
				//===----------------------------------------------------------------------===//

				struct ConvertMemRefLoad final : OpConversionPattern<memref::LoadOp> {
				using OpConversionPattern::OpConversionPattern;

				LogicalResult
				matchAndRewrite(memref::LoadOp op, OpAdaptor adaptor,
				ConversionPatternRewriter &rewriter) const override {
				Type newTy = getTypeConverter()->convertType(op.getMemRefType());
				if (!newTy) {
				return rewriter.notifyMatchFailure(
				op->getLoc(), llvm::formatv("failed to convert memref type: {0}",
				op.getMemRefType()));
				}

				if (op.getMemRefType() == newTy)
				return failure();

				auto loc = op.getLoc();
				auto sourceType = cast<MemRefType>(adaptor.getMemref().getType());
				unsigned sourceRank = sourceType.getRank();
				SmallVector<Value> indices = adaptor.getIndices();
				assert(indices.size() == sourceRank);

				auto srcElementType = sourceType.getElementType();
				auto oldElementType = op.getMemRefType().getElementType();
				int srcBits = oldElementType.getIntOrFloatBitWidth();
				int dstBits = srcElementType.getIntOrFloatBitWidth();
				if (dstBits % srcBits != 0) {
				hanchungUnsubmitted Done Reply Inline Actions can this be `auto`? hanchung: can this be `auto`?
				return rewriter.notifyMatchFailure(
				op, "only dstBits % srcBits == 0 supported");
				}

				auto stridedMetadata = rewriter.create<memref::ExtractStridedMetadataOp>(
				loc, adaptor.getMemref());

				Value newLoad, lastIdx;
				if (sourceRank == 0) {
				newLoad = rewriter.create<memref::LoadOp>(
				loc, srcElementType, adaptor.getMemref(), adaptor.getIndices());

				lastIdx = stridedMetadata.getOffset();
				hanchungUnsubmitted Done Reply Inline Actions I think we don't need `else`, this can save us one level of indents. hanchung: I think we don't need `else`, this can save us one level of indents.
				} else {
				newLoad = linearizeMemrefLoad(loc, sourceType, srcBits, dstBits, indices,
				stridedMetadata, rewriter);

				lastIdx = adaptor.getIndices().back();
				}

				// Get the offset and shift the bits to the rightmost.
				// Note, currently only the big-endian is supported.
				auto castLastIdx =
				rewriter.create<arith::IndexCastUIOp>(loc, srcElementType, lastIdx);

				Value BitwidthOffset =
				getOffsetForBitwidth(loc, castLastIdx, srcBits, dstBits, rewriter);
				auto bitsLoad =
				rewriter.create<arith::ShRSIOp>(loc, newLoad, BitwidthOffset);

				// Get the corresponding bits. If the arith computation bitwidth equals
				// to the emulated bitwidth, we apply a mask to extract the low bits.
				// It is not clear if this case actually happens in practice, but we keep
				// the operations just in case. Otherwise, if the arith computation bitwidth
				// is different from the emulated bitwidth we truncate the result.
				Operation *result;
				auto resultTy = getTypeConverter()->convertType(oldElementType);
				mravishankarUnsubmitted Done Reply Inline Actions I am still not sure about this one... Not really sure this actually happens in practice, but harmless enough. (Could you just leave a comment explaining it isnt clear that this is needed, or something to record this discussion. mravishankar: I am still not sure about this one... Not really sure this actually happens in practice, but…
				if (resultTy == srcElementType) {
				auto mask = rewriter.create<arith::ConstantOp>(
				loc, srcElementType,
				rewriter.getIntegerAttr(srcElementType, (1 << srcBits) - 1));

				result = rewriter.create<arith::AndIOp>(loc, bitsLoad, mask);
				} else {
				result = rewriter.create<arith::TruncIOp>(loc, resultTy, bitsLoad);
				}

				rewriter.replaceOp(op, result->getResult(0));
				return success();
				}
				};
				} // end anonymous namespace

				//===----------------------------------------------------------------------===//
				// Public Interface Definition
				//===----------------------------------------------------------------------===//

				void memref::populateMemRefNarrowTypeEmulationPatterns(
				arith::NarrowTypeEmulationConverter &typeConverter,
				RewritePatternSet &patterns) {

				// Populate `memref.*` conversion patterns.
				patterns
				.add<ConvertMemRefAlloc, ConvertMemRefLoad, ConvertMemRefAssumeAlignment>(
				typeConverter, patterns.getContext());
				}

				void memref::populateMemRefNarrowTypeEmulationConversions(
				arith::NarrowTypeEmulationConverter &typeConverter) {
				typeConverter.addConversion(
				[&typeConverter](MemRefType ty) -> std::optional<Type> {
				auto intTy = dyn_cast<IntegerType>(ty.getElementType());
				if (!intTy)
				return ty;

				unsigned width = intTy.getWidth();
				unsigned loadStoreWidth = typeConverter.getLoadStoreBitwidth();
				if (width >= loadStoreWidth)
				return ty;

				auto newElemTy = IntegerType::get(ty.getContext(), loadStoreWidth,
				intTy.getSignedness());
				if (!newElemTy)
				return std::nullopt;

				return ty.cloneWith(std::nullopt, newElemTy);
				});
				}

mlir/test/Dialect/Arith/emulate-narrow-type.mlir

This file was added.

				// RUN: mlir-opt --test-emulate-narrow-int="arith-compute-bitwidth=8" %s \| FileCheck %s

				// Expect no conversions, f32 is not an integer type.
				// CHECK-LABEL: func @identity_f32
				// CHECK-SAME: ([[ARG:%.+]]: f32) -> f32
				// CHECK-NEXT: return [[ARG]] : f32
				func.func @identity_f32(%a : f32) -> f32 {
				return %a : f32
				}

				// Expect no conversions, i32 is supported.
				// CHECK-LABEL: func @identity_i32
				// CHECK-SAME: ([[ARG:%.+]]: vector<2xi32>) -> vector<2xi32>
				// CHECK-NEXT: return [[ARG]] : vector<2xi32>
				func.func @identity_i32(%a : vector<2xi32>) -> vector<2xi32> {
				return %a : vector<2xi32>
				}

				// CHECK-LABEL: func @identity_scalar
				// CHECK-SAME: ([[ARG:%.+]]: i8) -> i8
				// CHECK-NEXT: return [[ARG]] : i8
				func.func @identity_scalar(%x : i4) -> i4 {
				return %x : i4
				}

				// CHECK-LABEL: func @identity_vector
				// CHECK-SAME: ([[ARG:%.+]]: vector<4xi8>) -> vector<4xi8>
				// CHECK-NEXT: return [[ARG]] : vector<4xi8>
				func.func @identity_vector(%x : vector<4xi4>) -> vector<4xi4> {
				return %x : vector<4xi4>
				}

				// CHECK-LABEL: func @identity_vector2d
				// CHECK-SAME: ([[ARG:%.+]]: vector<3x4xi8>) -> vector<3x4xi8>
				// CHECK-NEXT: return [[ARG]] : vector<3x4xi8>
				func.func @identity_vector2d(%x : vector<3x4xi4>) -> vector<3x4xi4> {
				return %x : vector<3x4xi4>
				}

				// CHECK-LABEL: func @call
				// CHECK-SAME: ([[ARG:%.+]]: vector<4xi8>) -> vector<4xi8>
				// CHECK-NEXT: [[RES:%.+]] = call @identity_vector([[ARG]]) : (vector<4xi8>) -> vector<4xi8>
				// CHECK-NEXT: return [[RES]] : vector<4xi8>
				func.func @call(%a : vector<4xi4>) -> vector<4xi4> {
				%res = func.call @identity_vector(%a) : (vector<4xi4>) -> vector<4xi4>
				return %res : vector<4xi4>
				}

mlir/test/Dialect/MemRef/emulate-narrow-type-diff-load-compute.mlir

This file was added.

				// RUN: mlir-opt --test-emulate-narrow-int="arith-compute-bitwidth=4 memref-load-bitwidth=8" %s \| FileCheck %s

				// CHECK-DAG: #[[$MAP0:.]] = affine_map<()[s0, s1] -> ((s0 s1) floordiv 2)>
				// CHECK-DAG: #[[$MAP1:.*]] = affine_map<()[s0] -> (s0 floordiv 2)>
				// CHECK-DAG: #[[$MAP2:.]] = affine_map<()[s0, s1, s2, s3] -> ((s0 s1 + s2 * s3) floordiv 2)>
				// CHECK-DAG: #[[$MAP3:.]] = affine_map<()[s0, s1] -> (s0 s1)>

				// Expect no conversions, i32 is supported.
				// CHECK-LABEL: func @memref_i32
				// CHECK: [[M:%.+]] = memref.alloc() : memref<4xi32, 1>
				// CHECK-NEXT: [[V:%.+]] = memref.load [[M]][{{%.+}}] : memref<4xi32, 1>
				// CHECK-NEXT: memref.store {{%.+}}, [[M]][{{%.+}}] : memref<4xi32, 1>
				// CHECK-NEXT: return
				func.func @memref_i32() {
				%c0 = arith.constant 0 : index
				%c1 = arith.constant 1 : i32
				%m = memref.alloc() : memref<4xi32, 1>
				%v = memref.load %m[%c0] : memref<4xi32, 1>
				memref.store %c1, %m[%c0] : memref<4xi32, 1>
				return
				}

				// -----

				// Expect no conversions, f32 is not an integer type.
				// CHECK-LABEL: func @memref_f32
				// CHECK: [[M:%.+]] = memref.alloc() : memref<4xf32, 1>
				// CHECK-NEXT: [[V:%.+]] = memref.load [[M]][{{%.+}}] : memref<4xf32, 1>
				// CHECK-NEXT: memref.store {{%.+}}, [[M]][{{%.+}}] : memref<4xf32, 1>
				// CHECK-NEXT: return
				func.func @memref_f32() {
				%c0 = arith.constant 0 : index
				%c1 = arith.constant 1.0 : f32
				%m = memref.alloc() : memref<4xf32, 1>
				%v = memref.load %m[%c0] : memref<4xf32, 1>
				memref.store %c1, %m[%c0] : memref<4xf32, 1>
				return
				}

				// -----

				// CHECK-LABEL: func @memref_load_i4_zero_rank
				// CHECK-NEXT: %[[M:.*]] = memref.alloc() : memref<i8>
				// CHECK-NEXT: %[[BASE:.]], %[[OFFSET:.]] = memref.extract_strided_metadata %[[M]] : memref<i8> -> memref<i8>, index
				// CHECK-NEXT: %[[LOAD:.*]] = memref.load %[[M]][] : memref<i8>
				// CHECK-NEXT: %[[I:.*]] = arith.index_castui %[[OFFSET]] : index to i8
				// CHECK-NEXT: %[[C2:.*]] = arith.constant 2 : i8
				// CHECK-NEXT: %[[C4:.*]] = arith.constant 4 : i8
				// CHECK-NEXT: %[[REM:.*]] = arith.remui %[[I]], %[[C2]] : i8
				// CHECK-NEXT: %[[STEP:.*]] = arith.muli %[[REM]], %[[C4]] : i8
				// CHECK-NEXT: %[[SHIFT:.*]] = arith.shrsi %[[LOAD]], %[[STEP]] : i8
				// CHECK-NEXT: %[[RES:.*]] = arith.trunci %[[SHIFT]] : i8 to i4
				// CHECK-NEXT: return
				func.func @memref_load_i4_zero_rank() {
				%0 = memref.alloc() : memref<i4>
				%1 = memref.load %0[] : memref<i4>
				return
				}

				// -----

				// CHECK-LABEL: func @memref_load_i4
				// CHECK-SAME: (%[[ARG:.*]]: index)
				// CHECK-NEXT: %[[M:.*]] = memref.alloc() : memref<4xi8>
				// CHECK-NEXT: %[[BASE:.]], %[[OFFSET:.]], %[[SIZES:.]], %[[STRIDES:.]] = memref.extract_strided_metadata %[[M]] : memref<4xi8> -> memref<i8>, index, index, index
				// CHECK-NEXT: %[[INDEX:.*]] = affine.apply #[[$MAP0]]()[%[[ARG]], %[[STRIDES]]]
				// CHECK-NEXT: %[[AOFF:.*]] = affine.apply #[[$MAP1]]()[%[[OFFSET]]]
				// CHECK-NEXT: %[[CAST:.*]] = memref.reinterpret_cast %[[BASE]] to offset: [%[[AOFF]]], sizes: [%[[SIZES]]], strides: [%[[STRIDES]]] : memref<i8> to memref<4xi8, strided<[?], offset: ?>>
				// CHECK-NEXT: %[[LOAD:.*]] = memref.load %[[CAST]][%[[INDEX]]] : memref<4xi8, strided<[?], offset: ?>>
				// CHECK-NEXT: %[[I:.*]] = arith.index_castui %[[ARG]] : index to i8
				// CHECK-NEXT: %[[C2:.*]] = arith.constant 2 : i8
				// CHECK-NEXT: %[[C4:.*]] = arith.constant 4 : i8
				// CHECK-NEXT: %[[REM:.*]] = arith.remui %[[I]], %[[C2]] : i8
				// CHECK-NEXT: %[[STEP:.*]] = arith.muli %[[REM]], %[[C4]] : i8
				// CHECK-NEXT: %[[SHIFT:.*]] = arith.shrsi %[[LOAD]], %[[STEP]] : i8
				// CHECK-NEXT: %[[RES:.*]] = arith.trunci %[[SHIFT]] : i8 to i4
				// CHECK-NEXT: return
				func.func @memref_load_i4(%arg0: index) {
				%0 = memref.alloc() : memref<4xi4>
				%1 = memref.load %0[%arg0] : memref<4xi4>
				return
				}

				// -----

				// CHECK-LABEL: func @memref_load_i4_rank2
				// CHECK-SAME: (%[[ARG:.]]: memref<4x128xi8>, %[[ARG0:.]]: index, %[[ARG1:.*]]: index)
				// CHECK-NEXT: memref.assume_alignment %[[ARG]], 64 : memref<4x128xi8>
				// CHECK-NEXT: %[[BASE:.]], %[[OFFSET:.]], %[[SIZES:.]]:2, %[[STRIDES:.]]:2 = memref.extract_strided_metadata %[[ARG]] : memref<4x128xi8> -> memref<i8>, index, index, index, index, index
				// CHECK-NEXT: %[[INDEX:.*]] = affine.apply #[[$MAP2]]()[%[[ARG0]], %[[STRIDES]]#0, %[[ARG1]], %[[STRIDES]]#1]
				// CHECK-NEXT: %[[LSIZE:.*]] = affine.apply #[[$MAP3]]()[%[[SIZES]]#0, %[[SIZES]]#1]
				// CHECK-NEXT: %[[AOFF:.*]] = affine.apply #[[$MAP1]]()[%[[OFFSET]]]
				// CHECK-NEXT: %[[CAST:.*]] = memref.reinterpret_cast %[[BASE]] to offset: [%[[AOFF]]], sizes: [%[[LSIZE]]], strides: [%[[STRIDES]]#1] : memref<i8> to memref<512xi8, strided<[?], offset: ?>>
				// CHECK-NEXT: %[[LOAD:.*]] = memref.load %[[CAST]][%[[INDEX]]] : memref<512xi8, strided<[?], offset: ?>>
				// CHECK-NEXT: %[[I:.*]] = arith.index_castui %[[ARG1]] : index to i8
				// CHECK-NEXT: %[[C2:.*]] = arith.constant 2 : i8
				// CHECK-NEXT: %[[C4:.*]] = arith.constant 4 : i8
				// CHECK-NEXT: %[[REM:.*]] = arith.remui %[[I]], %[[C2]] : i8
				// CHECK-NEXT: %[[STEP:.*]] = arith.muli %[[REM]], %[[C4]] : i8
				// CHECK-NEXT: %[[SHIFT:.*]] = arith.shrsi %[[LOAD]], %[[STEP]] : i8
				// CHECK-NEXT: %[[RES:.*]] = arith.trunci %[[SHIFT]] : i8 to i4
				// CHECK-NEXT: return
				func.func @memref_load_i4_rank2(%0: memref<4x128xi4>, %arg0: index, %arg1: index) {
				memref.assume_alignment %0, 64 : memref<4x128xi4>
				%1 = memref.load %0[%arg0,%arg1] : memref<4x128xi4>
				return
				}

mlir/test/Dialect/MemRef/emulate-narrow-type-same-load-compute.mlir

This file was added.

				// RUN: mlir-opt --test-emulate-narrow-int="arith-compute-bitwidth=8 memref-load-bitwidth=8" %s \| FileCheck %s

				// CHECK-DAG: #[[$MAP0:.]] = affine_map<()[s0, s1] -> ((s0 s1) floordiv 2)>
				// CHECK-DAG: #[[$MAP1:.*]] = affine_map<()[s0] -> (s0 floordiv 2)>
				// CHECK-DAG: #[[$MAP2:.]] = affine_map<()[s0, s1, s2, s3] -> ((s0 s1 + s2 * s3) floordiv 2)>
				// CHECK-DAG: #[[$MAP3:.]] = affine_map<()[s0, s1] -> (s0 s1)>

				// Expect no conversions.
				// CHECK-LABEL: func @memref_i8
				// CHECK: [[M:%.+]] = memref.alloc() : memref<4xi8, 1>
				// CHECK-NEXT: [[V:%.+]] = memref.load [[M]][{{%.+}}] : memref<4xi8, 1>
				// CHECK-NEXT: memref.store {{%.+}}, [[M]][{{%.+}}] : memref<4xi8, 1>
				// CHECK-NEXT: return
				func.func @memref_i8() {
				%c0 = arith.constant 0 : index
				%c1 = arith.constant 1 : i8
				%m = memref.alloc() : memref<4xi8, 1>
				%v = memref.load %m[%c0] : memref<4xi8, 1>
				memref.store %c1, %m[%c0] : memref<4xi8, 1>
				return
				}

				// -----

				// CHECK-LABEL: func @memref_load_i4
				// CHECK-SAME: (%[[ARG:.*]]: index)
				// CHECK-NEXT: %[[M:.*]] = memref.alloc() : memref<4xi8>
				// CHECK-NEXT: %[[BASE:.]], %[[OFFSET:.]], %[[SIZES:.]], %[[STRIDES:.]] = memref.extract_strided_metadata %[[M]] : memref<4xi8> -> memref<i8>, index, index, index
				// CHECK-NEXT: %[[INDEX:.*]] = affine.apply #[[$MAP0]]()[%[[ARG]], %[[STRIDES]]]
				// CHECK-NEXT: %[[AOFF:.*]] = affine.apply #[[$MAP1]]()[%[[OFFSET]]]
				// CHECK-NEXT: %[[CAST:.*]] = memref.reinterpret_cast %[[BASE]] to offset: [%[[AOFF]]], sizes: [%[[SIZES]]], strides: [%[[STRIDES]]] : memref<i8> to memref<4xi8, strided<[?], offset: ?>>
				// CHECK-NEXT: %[[LOAD:.*]] = memref.load %[[CAST]][%[[INDEX]]] : memref<4xi8, strided<[?], offset: ?>>
				// CHECK-NEXT: %[[I:.*]] = arith.index_castui %[[ARG]] : index to i8
				// CHECK-NEXT: %[[C2:.*]] = arith.constant 2 : i8
				// CHECK-NEXT: %[[C4:.*]] = arith.constant 4 : i8
				// CHECK-NEXT: %[[REM:.*]] = arith.remui %[[I]], %[[C2]] : i8
				// CHECK-NEXT: %[[STEP:.*]] = arith.muli %[[REM]], %[[C4]] : i8
				// CHECK-NEXT: %[[SHIFT:.*]] = arith.shrsi %[[LOAD]], %[[STEP]] : i8
				// CHECK-NEXT: %[[MASK:.*]] = arith.constant 15 : i8
				// CHECK-NEXT: %[[RES:.*]] = arith.andi %[[SHIFT]], %[[MASK]] : i8
				// CHECK-NEXT: return
				func.func @memref_load_i4(%arg0: index) {
				%0 = memref.alloc() : memref<4xi4>
				%1 = memref.load %0[%arg0] : memref<4xi4>
				return
				}

				// -----

				// CHECK-LABEL: func @memref_load_i4_rank2
				// CHECK-SAME: (%[[ARG:.]]: memref<4x128xi8>, %[[ARG0:.]]: index, %[[ARG1:.*]]: index)
				// CHECK-NEXT: memref.assume_alignment %[[ARG]], 64 : memref<4x128xi8>
				// CHECK-NEXT: %[[BASE:.]], %[[OFFSET:.]], %[[SIZES:.]]:2, %[[STRIDES:.]]:2 = memref.extract_strided_metadata %[[ARG]] : memref<4x128xi8> -> memref<i8>, index, index, index, index, index
				// CHECK-NEXT: %[[INDEX:.*]] = affine.apply #[[$MAP2]]()[%[[ARG0]], %[[STRIDES]]#0, %[[ARG1]], %[[STRIDES]]#1]
				// CHECK-NEXT: %[[LSIZE:.*]] = affine.apply #[[$MAP3]]()[%[[SIZES]]#0, %[[SIZES]]#1]
				// CHECK-NEXT: %[[AOFF:.*]] = affine.apply #[[$MAP1]]()[%[[OFFSET]]]
				// CHECK-NEXT: %[[CAST:.*]] = memref.reinterpret_cast %[[BASE]] to offset: [%[[AOFF]]], sizes: [%[[LSIZE]]], strides: [%[[STRIDES]]#1] : memref<i8> to memref<512xi8, strided<[?], offset: ?>>
				// CHECK-NEXT: %[[LOAD:.*]] = memref.load %[[CAST]][%[[INDEX]]] : memref<512xi8, strided<[?], offset: ?>>
				// CHECK-NEXT: %[[I:.*]] = arith.index_castui %[[ARG1]] : index to i8
				// CHECK-NEXT: %[[C2:.*]] = arith.constant 2 : i8
				// CHECK-NEXT: %[[C4:.*]] = arith.constant 4 : i8
				// CHECK-NEXT: %[[REM:.*]] = arith.remui %[[I]], %[[C2]] : i8
				// CHECK-NEXT: %[[STEP:.*]] = arith.muli %[[REM]], %[[C4]] : i8
				// CHECK-NEXT: %[[SHIFT:.*]] = arith.shrsi %[[LOAD]], %[[STEP]] : i8
				// CHECK-NEXT: %[[MASK:.*]] = arith.constant 15 : i8
				// CHECK-NEXT: %[[RES:.*]] = arith.andi %[[SHIFT]], %[[MASK]] : i8
				// CHECK-NEXT: return
				func.func @memref_load_i4_rank2(%0: memref<4x128xi4>, %arg0: index, %arg1: index) {
				memref.assume_alignment %0, 64 : memref<4x128xi4>
				%1 = memref.load %0[%arg0,%arg1] : memref<4x128xi4>
				return
				}

mlir/test/lib/Dialect/MemRef/CMakeLists.txt

	# Exclude tests from libMLIR.so			# Exclude tests from libMLIR.so
	add_mlir_library(MLIRMemRefTestPasses			add_mlir_library(MLIRMemRefTestPasses
	TestComposeSubView.cpp			TestComposeSubView.cpp
				TestEmulateNarrowType.cpp
	TestMultiBuffer.cpp			TestMultiBuffer.cpp

	EXCLUDE_FROM_LIBMLIR			EXCLUDE_FROM_LIBMLIR

	LINK_LIBS PUBLIC			LINK_LIBS PUBLIC
	MLIRPass			MLIRPass
	MLIRMemRefDialect			MLIRMemRefDialect
	MLIRMemRefTransforms			MLIRMemRefTransforms
	MLIRTestDialect			MLIRTestDialect
	)			)

	target_include_directories(MLIRMemRefTestPasses			target_include_directories(MLIRMemRefTestPasses
	PRIVATE			PRIVATE
	${CMAKE_CURRENT_SOURCE_DIR}/../Test			${CMAKE_CURRENT_SOURCE_DIR}/../Test
	${CMAKE_CURRENT_BINARY_DIR}/../Test			${CMAKE_CURRENT_BINARY_DIR}/../Test
	)			)

mlir/test/lib/Dialect/MemRef/TestEmulateNarrowType.cpp

This file was added.

//===- TestEmulateNarrowType.cpp - Test Narrow Type Emulation ------*- c++

//-*-===//

// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

// See https://llvm.org/LICENSE.txt for license information.

// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

//===----------------------------------------------------------------------===//

#include "mlir/Dialect/Affine/IR/AffineOps.h"

#include "mlir/Dialect/Arith/IR/Arith.h"

#include "mlir/Dialect/Arith/Transforms/NarrowTypeEmulationConverter.h"

#include "mlir/Dialect/Arith/Transforms/Passes.h"

#include "mlir/Dialect/Func/IR/FuncOps.h"

#include "mlir/Dialect/MemRef/IR/MemRef.h"

#include "mlir/Dialect/MemRef/Transforms/Transforms.h"

#include "mlir/Dialect/Vector/IR/VectorOps.h"

#include "mlir/Pass/Pass.h"

#include "mlir/Transforms/DialectConversion.h"

using namespace mlir;

namespace {

struct TestEmulateNarrowTypePass

: public PassWrapper<TestEmulateNarrowTypePass,

OperationPass<func::FuncOp>> {

MLIR_DEFINE_EXPLICIT_INTERNAL_INLINE_TYPE_ID(TestEmulateNarrowTypePass)

TestEmulateNarrowTypePass() = default;

TestEmulateNarrowTypePass(const TestEmulateNarrowTypePass &pass)

: PassWrapper(pass) {}

void getDependentDialects(DialectRegistry &registry) const override {

registry

.insert<arith::ArithDialect, func::FuncDialect, memref::MemRefDialect,

vector::VectorDialect, affine::AffineDialect>();

}

StringRef getArgument() const final { return "test-emulate-narrow-int"; }

StringRef getDescription() const final {

return "Function pass to test Narrow Integer Emulation";

}

void runOnOperation() override {

if (!llvm::isPowerOf2_32(loadStoreEmulateBitwidth) ||

loadStoreEmulateBitwidth < 8) {

signalPassFailure();

return;

}

Operation *op = getOperation();

MLIRContext *ctx = op->getContext();

arith::NarrowTypeEmulationConverter typeConverter(loadStoreEmulateBitwidth);

// Convert scalar type.

typeConverter.addConversion([this](IntegerType ty) -> std::optional<Type> {

unsigned width = ty.getWidth();

if (width >= arithComputeBitwidth)

return ty;

return IntegerType::get(ty.getContext(), arithComputeBitwidth);

});

hanchungUnsubmitted

Done

unsigned width = ty.getWidth();

- if (width >= arithComputeBitwidth)

- return ty;

- else

- return IntegerType::get(ty.getContext(), arithComputeBitwidth);

- return std::nullopt;

+ if (ty.getWidth() >= arithComputBitwidth) return ty;

+ return IntegerType::get(ty.getContext(), arithComputeBitwidth);

});

// Convert vector type.

the if-else already covers all the cases, this can be simplified.

hanchung: the if-else already covers all the cases, this can be simplified.

// Convert vector type.

typeConverter.addConversion([this](VectorType ty) -> std::optional<Type> {

auto intTy = dyn_cast<IntegerType>(ty.getElementType());

if (!intTy)

return ty;

unsigned width = intTy.getWidth();

if (width >= arithComputeBitwidth)

return ty;

return VectorType::get(

to_vector(ty.getShape()),

IntegerType::get(ty.getContext(), arithComputeBitwidth));

});

memref::populateMemRefNarrowTypeEmulationConversions(typeConverter);

hanchungUnsubmitted

Done

ditto, this can be simplified

hanchung: ditto, this can be simplified

hanchungUnsubmitted

Done

can we remove the else keyword? that would save us a level of indent. same for above one.

hanchung: can we remove the `else` keyword? that would save us a level of indent. same for above one.

ConversionTarget target(*ctx);

target.addDynamicallyLegalOp<func::FuncOp>([&typeConverter](Operation *op) {

return typeConverter.isLegal(cast<func::FuncOp>(op).getFunctionType());

});

auto opLegalCallback = [&typeConverter](Operation *op) {

return typeConverter.isLegal(op);

};

target.addDynamicallyLegalOp<func::CallOp, func::ReturnOp>(opLegalCallback);

target.addDynamicallyLegalDialect<

arith::ArithDialect, vector::VectorDialect, memref::MemRefDialect,

affine::AffineDialect>(

[&typeConverter](Operation *op) { return typeConverter.isLegal(op); });

RewritePatternSet patterns(ctx);

arith::populateArithNarrowTypeEmulationPatterns(typeConverter, patterns);

memref::populateMemRefNarrowTypeEmulationPatterns(typeConverter, patterns);

if (failed(applyPartialConversion(op, target, std::move(patterns))))

signalPassFailure();

}

Option<unsigned> loadStoreEmulateBitwidth{

*this, "memref-load-bitwidth",

llvm::cl::desc("memref load/store emulation bit width"),

llvm::cl::init(8)};

Option<unsigned> arithComputeBitwidth{

*this, "arith-compute-bitwidth",

llvm::cl::desc("arith computation bit width"), llvm::cl::init(4)};

};

} // namespace

namespace mlir::test {

void registerTestEmulateNarrowTypePass() {

PassRegistration<TestEmulateNarrowTypePass>();

}

} // namespace mlir::test

mlir/tools/mlir-opt/mlir-opt.cpp

Show First 20 Lines • Show All 81 Lines • ▼ Show 20 Lines
void registerTestDataLayoutPropagation();		void registerTestDataLayoutPropagation();
void registerTestDataLayoutQuery();		void registerTestDataLayoutQuery();
void registerTestDeadCodeAnalysisPass();		void registerTestDeadCodeAnalysisPass();
void registerTestDecomposeCallGraphTypes();		void registerTestDecomposeCallGraphTypes();
void registerTestDiagnosticsPass();		void registerTestDiagnosticsPass();
void registerTestDialectConversionPasses();		void registerTestDialectConversionPasses();
void registerTestDominancePass();		void registerTestDominancePass();
void registerTestDynamicPipelinePass();		void registerTestDynamicPipelinePass();
		void registerTestEmulateNarrowTypePass();
void registerTestExpandMathPass();		void registerTestExpandMathPass();
void registerTestFooAnalysisPass();		void registerTestFooAnalysisPass();
void registerTestComposeSubView();		void registerTestComposeSubView();
void registerTestMultiBuffering();		void registerTestMultiBuffering();
void registerTestIntRangeInference();		void registerTestIntRangeInference();
void registerTestIRVisitorsPass();		void registerTestIRVisitorsPass();
void registerTestGenericIRVisitorsPass();		void registerTestGenericIRVisitorsPass();
void registerTestGenericIRVisitorsInterruptPass();		void registerTestGenericIRVisitorsInterruptPass();
▲ Show 20 Lines • Show All 102 Lines • ▼ Show 20 Lines	#if MLIR_ROCM_CONVERSIONS_ENABLED
mlir::test::registerTestGpuSerializeToHsacoPass();		mlir::test::registerTestGpuSerializeToHsacoPass();
#endif		#endif
mlir::test::registerTestDecomposeCallGraphTypes();		mlir::test::registerTestDecomposeCallGraphTypes();
mlir::test::registerTestDataLayoutPropagation();		mlir::test::registerTestDataLayoutPropagation();
mlir::test::registerTestDataLayoutQuery();		mlir::test::registerTestDataLayoutQuery();
mlir::test::registerTestDeadCodeAnalysisPass();		mlir::test::registerTestDeadCodeAnalysisPass();
mlir::test::registerTestDominancePass();		mlir::test::registerTestDominancePass();
mlir::test::registerTestDynamicPipelinePass();		mlir::test::registerTestDynamicPipelinePass();
		mlir::test::registerTestEmulateNarrowTypePass();
mlir::test::registerTestExpandMathPass();		mlir::test::registerTestExpandMathPass();
mlir::test::registerTestFooAnalysisPass();		mlir::test::registerTestFooAnalysisPass();
mlir::test::registerTestComposeSubView();		mlir::test::registerTestComposeSubView();
mlir::test::registerTestMultiBuffering();		mlir::test::registerTestMultiBuffering();
mlir::test::registerTestIntRangeInference();		mlir::test::registerTestIntRangeInference();
mlir::test::registerTestIRVisitorsPass();		mlir::test::registerTestIRVisitorsPass();
mlir::test::registerTestGenericIRVisitorsPass();		mlir::test::registerTestGenericIRVisitorsPass();
mlir::test::registerTestInterfaces();		mlir::test::registerTestInterfaces();
▲ Show 20 Lines • Show All 57 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[mlir] Narrow bitwidth emulation for MemRef loadClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 534749

mlir/include/mlir/Dialect/Arith/Transforms/NarrowTypeEmulationConverter.h

mlir/include/mlir/Dialect/Arith/Transforms/Passes.h

mlir/include/mlir/Dialect/MemRef/Transforms/Transforms.h

mlir/lib/Dialect/Arith/Transforms/CMakeLists.txt

mlir/lib/Dialect/Arith/Transforms/EmulateNarrowType.cpp

mlir/lib/Dialect/MemRef/Transforms/CMakeLists.txt

mlir/lib/Dialect/MemRef/Transforms/EmulateNarrowType.cpp

mlir/test/Dialect/Arith/emulate-narrow-type.mlir

mlir/test/Dialect/MemRef/emulate-narrow-type-diff-load-compute.mlir

mlir/test/Dialect/MemRef/emulate-narrow-type-same-load-compute.mlir

mlir/test/lib/Dialect/MemRef/CMakeLists.txt

mlir/test/lib/Dialect/MemRef/TestEmulateNarrowType.cpp

mlir/tools/mlir-opt/mlir-opt.cpp

[mlir] Narrow bitwidth emulation for MemRef load
ClosedPublic