r/rprogramming

I have a final group college project going on where I have to wrangle and clean a bunch of data using dplyr while i have ZERO idea what even does the R app does because my groupmates just pushed the hardest and most technical parts onto me while giving themselves such amazing jobs like powerpoint editors(its just copying canvas templates) and script writing(i am pretty sure they are using AI) while i have no clue on what i should do.

what the actual FUCK am i supposed to do in data wrangling and cleanup?

reddit.com
u/ilikeitchyballzdude1 — 10 days ago

I have a data frame of 600 books mostly on law firm management. My code removes stop words from the Title variable, This code runs, but the results are titles that have little to do with each other. The method is Jarowinkler, and I have not tried the other methods, Jaccard and Levenshtein. If they are all math based, I don't know if the latter two will be any better. Is there another library for fuzzy matching text?

library(stringdist)

find_best_match <- function(query, data = df, method = "jw", n = 1) {

# Clean the query the same way as the corpus

query_clean <- query |>

str_remove_all("\\*") |> # strip asterisks if present

str_to_lower() |>

str_split("\\s+") |>

unlist() |>

setdiff(all_stops$word) |> # remove stop words

paste(collapse = " ")

# Compute distance between query and every cleaned title

distances <- stringdist(query_clean, data$Title_clean, method = method)

# Return top n matches

data |>

mutate(distance = distances) |>

arrange(distance) |>

slice_head(n = n) |>

select(Book, Title, Title_clean, distance)}

reddit.com
u/Salt-Permit-8763 — 11 days ago