I have a data frame of 600 books mostly on law firm management. My code removes stop words from the Title variable, This code runs, but the results are titles that have little to do with each other. The method is Jarowinkler, and I have not tried the other methods, Jaccard and Levenshtein. If they are all math based, I don't know if the latter two will be any better. Is there another library for fuzzy matching text?
library(stringdist)
find_best_match <- function(query, data = df, method = "jw", n = 1) {
# Clean the query the same way as the corpus
query_clean <- query |>
str_remove_all("\\*") |> # strip asterisks if present
str_to_lower() |>
str_split("\\s+") |>
unlist() |>
setdiff(all_stops$word) |> # remove stop words
paste(collapse = " ")
# Compute distance between query and every cleaned title
distances <- stringdist(query_clean, data$Title_clean, method = method)
# Return top n matches
data |>
mutate(distance = distances) |>
arrange(distance) |>
slice_head(n = n) |>
select(Book, Title, Title_clean, distance)}