Download Raw Diff

Details

Reviewers

dblaikie
chandlerc

Commits

rGa13b61f7f0a2: [ADT] Add edit_distance_insensitive to StringRef

Summary

In some instances its advantageous to calculate edit distances without worrying about casing.
Currently to achieve this both strings need to be converted to the same case first, then edit distance can be calculated.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

njames93 created this revision.May 22 2022, 1:23 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 22 2022, 1:23 AM

Herald added a subscriber: hiraditya. · View Herald Transcript

njames93 requested review of this revision.May 22 2022, 1:23 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 22 2022, 1:23 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B165722: Diff 431219.May 22 2022, 2:29 AM

xgupta added a subscriber: xgupta.May 22 2022, 3:28 AM

Any chance of a template to avoid duplicating this code between case sensitive and case insensitive - looks like a long enough function with room for bugs/fixes/changes that it'd be worth avoiding duplication?

Remove code duplication by adding an extra Map parameter to llvm::ComputeEditDistance.

Harbormaster completed remote builds in B165985: Diff 431570.May 23 2022, 9:59 PM

dblaikie added inline comments.May 24 2022, 1:14 PM

llvm/include/llvm/ADT/edit_distance.h
45–48	Do you need the default type argument here? The default (or explicit, in the other call) value below would allow template argument deduction, right?
49	I'm not sure this `+` is worthwhile - I'd say either make it a non-template entirely, and hardcode this parameter as a function pointer type (that'd work for the two callers here) or make the functor a template parameter and drop the `+` here (so that there's no call indirection overhead). (I guess a third option would be llvm::function_ref for functor-level generality without the template, but somewhat more runtime overhead, probably) I don't have /super/ strong feelings either way.

njames93 added inline comments.May 25 2022, 3:32 AM

llvm/include/llvm/ADT/edit_distance.h
45–48	The default value doesn't seem to deduce the template argument, however explicitly passing an argument would deduce the type correctly.
49	You're right, the + isn't necessary

Remove unnecessary '+'

Harbormaster completed remote builds in B166232: Diff 431931.May 25 2022, 4:24 AM

dblaikie added inline comments.May 25 2022, 10:03 AM

llvm/include/llvm/ADT/edit_distance.h
45–48	I think this default templtae argument is unused? (argument deduction kicks in for both uses, doesn't it?) & wrong anyway - the functor type won't be a function pointer type, it'll be the specific type of each lambda, I think? Could you remove this default template argument?
85–91	is there any concern about the number of times the map operation is used? Looks like the algorithm visits elements more than once, so might be worth some caching? Like an easy one would probably be to compute `Map(FromArray[y-1])` outside the `for x` loop at least? (but maybe other caching should be done too, I don't know - I guess toLower is cheap enough that it's not worth much more involved caching? - I guess `Map(ToArray[x])` could be computed and cached for the next round's references to `Map(ToArray[x-1])` for instance?)

Create a new functino ComputeMappedEditDistance to avoid template argument deduction issues.

njames93 marked 4 inline comments as done.Jun 2 2022, 1:40 AM

njames93 added inline comments.

llvm/include/llvm/ADT/edit_distance.h
85–91	Caching the outer loop value makes sense, but for the inner loop, probably not so much. If its expensive then it'd be better off to just create new arrays with the mapped values, then call ComputeEditDistance with no map functor.

Cache outer loop map item.

Harbormaster completed remote builds in B167468: Diff 433684.Jun 2 2022, 3:16 AM

dblaikie accepted this revision.Jun 2 2022, 1:46 PM

dblaikie added inline comments.

llvm/include/llvm/ADT/edit_distance.h
81	I'm not sure this amounts to anything different than using `auto`? But if the intent was to allow reference types here - maybe this could rely on reference lifetime extension? const auto &CurItem = ... If the map function returns by value, this'll do reference lifetime extension, and if it returns by reference it'll be a reference.

This revision is now accepted and ready to land.Jun 2 2022, 1:46 PM

njames93 added inline comments.Jun 3 2022, 4:40 PM

llvm/include/llvm/ADT/edit_distance.h
81	Reference lifetime extension is exactly what I needed, just one of those cases I always forget about

Use reference lifetime extension.

Harbormaster completed remote builds in B167932: Diff 434311.Jun 5 2022, 3:16 AM

alexander-shaposhnikov added a subscriber: alexander-shaposhnikov.Jun 5 2022, 3:23 AM

alexander-shaposhnikov added inline comments.

llvm/include/llvm/ADT/edit_distance.h
46	as a side note, wouldn't it be a bit clearer / more flexible to pass a comparator instead ?

This revision was landed with ongoing or failed builds.Jun 5 2022, 4:03 AM

Closed by commit rGa13b61f7f0a2: [ADT] Add edit_distance_insensitive to StringRef (authored by njames93). · Explain Why

This revision was automatically updated to reflect the committed changes.

njames93 added a commit: rGa13b61f7f0a2: [ADT] Add edit_distance_insensitive to StringRef.

dblaikie added inline comments.Jun 5 2022, 12:21 PM

llvm/include/llvm/ADT/edit_distance.h
46	oh, fair - I'd be open to that, but wouldn't insist

Diff 434318

llvm/include/llvm/ADT/StringRef.h

Show First 20 Lines • Show All 234 Lines • ▼ Show 20 Lines	#endif
/// \returns the minimum number of character insertions, removals,		/// \returns the minimum number of character insertions, removals,
/// or (if \p AllowReplacements is \c true) replacements needed to		/// or (if \p AllowReplacements is \c true) replacements needed to
/// transform one of the given strings into the other. If zero,		/// transform one of the given strings into the other. If zero,
/// the strings are identical.		/// the strings are identical.
LLVM_NODISCARD		LLVM_NODISCARD
unsigned edit_distance(StringRef Other, bool AllowReplacements = true,		unsigned edit_distance(StringRef Other, bool AllowReplacements = true,
unsigned MaxEditDistance = 0) const;		unsigned MaxEditDistance = 0) const;

		LLVM_NODISCARD unsigned
		edit_distance_insensitive(StringRef Other, bool AllowReplacements = true,
		unsigned MaxEditDistance = 0) const;

/// str - Get the contents as an std::string.		/// str - Get the contents as an std::string.
LLVM_NODISCARD		LLVM_NODISCARD
std::string str() const {		std::string str() const {
if (!Data) return std::string();		if (!Data) return std::string();
return std::string(Data, Length);		return std::string(Data, Length);
}		}

/// @}		/// @}
▲ Show 20 Lines • Show All 745 Lines • Show Last 20 Lines

llvm/include/llvm/ADT/edit_distance.h

Show All 22 Lines
namespace llvm {		namespace llvm {

/// Determine the edit distance between two sequences.		/// Determine the edit distance between two sequences.
///		///
/// \param FromArray the first sequence to compare.		/// \param FromArray the first sequence to compare.
///		///
/// \param ToArray the second sequence to compare.		/// \param ToArray the second sequence to compare.
///		///
		/// \param Map A Functor to apply to each item of the sequences before
		/// comparison.
		///
/// \param AllowReplacements whether to allow element replacements (change one		/// \param AllowReplacements whether to allow element replacements (change one
/// element into another) as a single operation, rather than as two operations		/// element into another) as a single operation, rather than as two operations
/// (an insertion and a removal).		/// (an insertion and a removal).
///		///
/// \param MaxEditDistance If non-zero, the maximum edit distance that this		/// \param MaxEditDistance If non-zero, the maximum edit distance that this
/// routine is allowed to compute. If the edit distance will exceed that		/// routine is allowed to compute. If the edit distance will exceed that
/// maximum, returns \c MaxEditDistance+1.		/// maximum, returns \c MaxEditDistance+1.
///		///
/// \returns the minimum number of element insertions, removals, or (if		/// \returns the minimum number of element insertions, removals, or (if
/// \p AllowReplacements is \c true) replacements needed to transform one of		/// \p AllowReplacements is \c true) replacements needed to transform one of
/// the given sequences into the other. If zero, the sequences are identical.		/// the given sequences into the other. If zero, the sequences are identical.
template<typename T>		template <typename T, typename Functor>
unsigned ComputeEditDistance(ArrayRef<T> FromArray, ArrayRef<T> ToArray,		unsigned ComputeMappedEditDistance(ArrayRef<T> FromArray, ArrayRef<T> ToArray,
		alexander-shaposhnikovUnsubmitted Not Done Reply Inline Actions as a side note, wouldn't it be a bit clearer / more flexible to pass a comparator instead ? alexander-shaposhnikov: as a side note, wouldn't it be a bit clearer / more flexible to pass a comparator instead ?
		dblaikieUnsubmitted Not Done Reply Inline Actions oh, fair - I'd be open to that, but wouldn't insist dblaikie: oh, fair - I'd be open to that, but wouldn't insist
bool AllowReplacements = true,		Functor Map, bool AllowReplacements = true,
unsigned MaxEditDistance = 0) {		unsigned MaxEditDistance = 0) {
		dblaikieUnsubmitted Done Reply Inline Actions Do you need the default type argument here? The default (or explicit, in the other call) value below would allow template argument deduction, right? dblaikie: Do you need the default type argument here? The default (or explicit, in the other call) value…
		njames93AuthorUnsubmitted Done Reply Inline Actions The default value doesn't seem to deduce the template argument, however explicitly passing an argument would deduce the type correctly. njames93: The default value doesn't seem to deduce the template argument, however explicitly passing an…
		dblaikieUnsubmitted Done Reply Inline Actions I think this default templtae argument is unused? (argument deduction kicks in for both uses, doesn't it?) & wrong anyway - the functor type won't be a function pointer type, it'll be the specific type of each lambda, I think? Could you remove this default template argument? dblaikie: I think this default templtae argument is unused? (argument deduction kicks in for both uses…
// The algorithm implemented below is the "classic"		// The algorithm implemented below is the "classic"
		dblaikieUnsubmitted Done Reply Inline Actions I'm not sure this `+` is worthwhile - I'd say either make it a non-template entirely, and hardcode this parameter as a function pointer type (that'd work for the two callers here) or make the functor a template parameter and drop the `+` here (so that there's no call indirection overhead). (I guess a third option would be llvm::function_ref for functor-level generality without the template, but somewhat more runtime overhead, probably) I don't have /super/ strong feelings either way. dblaikie: I'm not sure this `+` is worthwhile - I'd say either make it a non-template entirely, and…
		njames93AuthorUnsubmitted Done Reply Inline Actions You're right, the + isn't necessary njames93: You're right, the + isn't necessary
// dynamic-programming algorithm for computing the Levenshtein		// dynamic-programming algorithm for computing the Levenshtein
// distance, which is described here:		// distance, which is described here:
//		//
// http://en.wikipedia.org/wiki/Levenshtein_distance		// http://en.wikipedia.org/wiki/Levenshtein_distance
//		//
// Although the algorithm is typically described using an m x n		// Although the algorithm is typically described using an m x n
// array, only one row plus one element are used at a time, so this		// array, only one row plus one element are used at a time, so this
// implementation just keeps one vector for the row. To update one entry,		// implementation just keeps one vector for the row. To update one entry,
Show All 15 Lines	unsigned ComputeMappedEditDistance(ArrayRef<T> FromArray, ArrayRef<T> ToArray,
for (unsigned i = 1; i <= n; ++i)		for (unsigned i = 1; i <= n; ++i)
Row[i] = i;		Row[i] = i;

for (typename ArrayRef<T>::size_type y = 1; y <= m; ++y) {		for (typename ArrayRef<T>::size_type y = 1; y <= m; ++y) {
Row[0] = y;		Row[0] = y;
unsigned BestThisRow = Row[0];		unsigned BestThisRow = Row[0];

unsigned Previous = y - 1;		unsigned Previous = y - 1;
		const auto &CurItem = Map(FromArray[y - 1]);
		dblaikieUnsubmitted Not Done Reply Inline Actions I'm not sure this amounts to anything different than using `auto`? But if the intent was to allow reference types here - maybe this could rely on reference lifetime extension? const auto &CurItem = ... If the map function returns by value, this'll do reference lifetime extension, and if it returns by reference it'll be a reference. dblaikie: I'm not sure this amounts to anything different than using `auto`? But if the intent was to…
		njames93AuthorUnsubmitted Done Reply Inline Actions Reference lifetime extension is exactly what I needed, just one of those cases I always forget about njames93: Reference lifetime extension is exactly what I needed, just one of those cases I always forget…
for (typename ArrayRef<T>::size_type x = 1; x <= n; ++x) {		for (typename ArrayRef<T>::size_type x = 1; x <= n; ++x) {
int OldRow = Row[x];		int OldRow = Row[x];
if (AllowReplacements) {		if (AllowReplacements) {
Row[x] = std::min(		Row[x] = std::min(Previous + (CurItem == Map(ToArray[x - 1]) ? 0u : 1u),
Previous + (FromArray[y-1] == ToArray[x-1] ? 0u : 1u),
std::min(Row[x-1], Row[x])+1);		std::min(Row[x - 1], Row[x]) + 1);
}		}
else {		else {
if (FromArray[y-1] == ToArray[x-1]) Row[x] = Previous;		if (CurItem == Map(ToArray[x - 1]))
		Row[x] = Previous;
else Row[x] = std::min(Row[x-1], Row[x]) + 1;		else Row[x] = std::min(Row[x-1], Row[x]) + 1;
		dblaikieUnsubmitted Done Reply Inline Actions is there any concern about the number of times the map operation is used? Looks like the algorithm visits elements more than once, so might be worth some caching? Like an easy one would probably be to compute `Map(FromArray[y-1])` outside the `for x` loop at least? (but maybe other caching should be done too, I don't know - I guess toLower is cheap enough that it's not worth much more involved caching? - I guess `Map(ToArray[x])` could be computed and cached for the next round's references to `Map(ToArray[x-1])` for instance?) dblaikie: is there any concern about the number of times the map operation is used? Looks like the…
		njames93AuthorUnsubmitted Done Reply Inline Actions Caching the outer loop value makes sense, but for the inner loop, probably not so much. If its expensive then it'd be better off to just create new arrays with the mapped values, then call ComputeEditDistance with no map functor. njames93: Caching the outer loop value makes sense, but for the inner loop, probably not so much. If its…
}		}
Previous = OldRow;		Previous = OldRow;
BestThisRow = std::min(BestThisRow, Row[x]);		BestThisRow = std::min(BestThisRow, Row[x]);
}		}

if (MaxEditDistance && BestThisRow > MaxEditDistance)		if (MaxEditDistance && BestThisRow > MaxEditDistance)
return MaxEditDistance + 1;		return MaxEditDistance + 1;
}		}

unsigned Result = Row[n];		unsigned Result = Row[n];
return Result;		return Result;
}		}

		template <typename T>
		unsigned ComputeEditDistance(ArrayRef<T> FromArray, ArrayRef<T> ToArray,
		bool AllowReplacements = true,
		unsigned MaxEditDistance = 0) {
		return ComputeMappedEditDistance(
		FromArray, ToArray, [](const T &X) -> const T & { return X; },
		AllowReplacements, MaxEditDistance);
		}

} // End llvm namespace		} // End llvm namespace

#endif		#endif

llvm/lib/Support/StringRef.cpp

Show First 20 Lines • Show All 92 Lines • ▼ Show 20 Lines	unsigned StringRef::edit_distance(llvm::StringRef Other,
bool AllowReplacements,		bool AllowReplacements,
unsigned MaxEditDistance) const {		unsigned MaxEditDistance) const {
return llvm::ComputeEditDistance(		return llvm::ComputeEditDistance(
makeArrayRef(data(), size()),		makeArrayRef(data(), size()),
makeArrayRef(Other.data(), Other.size()),		makeArrayRef(Other.data(), Other.size()),
AllowReplacements, MaxEditDistance);		AllowReplacements, MaxEditDistance);
}		}

		unsigned llvm::StringRef::edit_distance_insensitive(
		StringRef Other, bool AllowReplacements, unsigned MaxEditDistance) const {
		return llvm::ComputeMappedEditDistance(
		makeArrayRef(data(), size()), makeArrayRef(Other.data(), Other.size()),
		llvm::toLower, AllowReplacements, MaxEditDistance);
		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// String Operations		// String Operations
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

std::string StringRef::lower() const {		std::string StringRef::lower() const {
return std::string(map_iterator(begin(), toLower),		return std::string(map_iterator(begin(), toLower),
map_iterator(end(), toLower));		map_iterator(end(), toLower));
}		}
▲ Show 20 Lines • Show All 499 Lines • Show Last 20 Lines

llvm/unittests/ADT/StringRefTest.cpp

Show First 20 Lines • Show All 578 Lines • ▼ Show 20 Lines	TEST(StringRefTest, EditDistance) {
EXPECT_EQ(9U, Soylent.edit_distance("people soiled our green",		EXPECT_EQ(9U, Soylent.edit_distance("people soiled our green",
/* allow replacements = */ true,		/* allow replacements = */ true,
/* max edit distance = */ 8));		/* max edit distance = */ 8));
EXPECT_EQ(53U, Soylent.edit_distance("people soiled our green "		EXPECT_EQ(53U, Soylent.edit_distance("people soiled our green "
"people soiled our green "		"people soiled our green "
"people soiled our green "));		"people soiled our green "));
}		}

		TEST(StringRefTest, EditDistanceInsensitive) {
		StringRef Hello("HELLO");
		EXPECT_EQ(2U, Hello.edit_distance_insensitive("hill"));
		EXPECT_EQ(0U, Hello.edit_distance_insensitive("hello"));

		StringRef Industry("InDuStRy");
		EXPECT_EQ(6U, Industry.edit_distance_insensitive("iNtErEsT"));
		}

TEST(StringRefTest, Misc) {		TEST(StringRefTest, Misc) {
std::string Storage;		std::string Storage;
raw_string_ostream OS(Storage);		raw_string_ostream OS(Storage);
OS << StringRef("hello");		OS << StringRef("hello");
EXPECT_EQ("hello", OS.str());		EXPECT_EQ("hello", OS.str());
}		}

TEST(StringRefTest, Hashing) {		TEST(StringRefTest, Hashing) {
▲ Show 20 Lines • Show All 551 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[ADT] Add edit_distance_insensitive to StringRef
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 434318

llvm/include/llvm/ADT/StringRef.h

llvm/include/llvm/ADT/edit_distance.h

llvm/lib/Support/StringRef.cpp

llvm/unittests/ADT/StringRefTest.cpp

This is an archive of the discontinued LLVM Phabricator instance.

[ADT] Add edit_distance_insensitive to StringRefClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 434318

llvm/include/llvm/ADT/StringRef.h

llvm/include/llvm/ADT/edit_distance.h

llvm/lib/Support/StringRef.cpp

llvm/unittests/ADT/StringRefTest.cpp

[ADT] Add edit_distance_insensitive to StringRef
ClosedPublic