u/Salt-Permit-8763

I have a data frame of 600 books mostly on law firm management. My code removes stop words from the Title variable, This code runs, but the results are titles that have little to do with each other. The method is Jarowinkler, and I have not tried the other methods, Jaccard and Levenshtein. If they are all math based, I don't know if the latter two will be any better. Is there another library for fuzzy matching text?

library(stringdist)

find_best_match <- function(query, data = df, method = "jw", n = 1) {

# Clean the query the same way as the corpus

query_clean <- query |>

str_remove_all("\\*") |> # strip asterisks if present

str_to_lower() |>

str_split("\\s+") |>

unlist() |>

setdiff(all_stops$word) |> # remove stop words

paste(collapse = " ")

# Compute distance between query and every cleaned title

distances <- stringdist(query_clean, data$Title_clean, method = method)

# Return top n matches

data |>

mutate(distance = distances) |>

arrange(distance) |>

slice_head(n = n) |>

select(Book, Title, Title_clean, distance)}

reddit.com
u/Salt-Permit-8763 — 11 days ago