r/rstats

Improved markdown quality, code intelligence for 248 languages, and more in Kreuzberg v4.7.0
🔥 Hot ▲ 60 r/LocalLLaMA+10 crossposts

Improved markdown quality, code intelligence for 248 languages, and more in Kreuzberg v4.7.0

Kreuzberg v4.7.0 is here. Kreuzberg is a Rust-core document intelligence library that works with Python, TypeScript/Node.js, Go, Ruby, Java, C#, PHP, Elixir, R, C, and WASM. 

We’ve added several features, integrated OpenWEBUI, and made a big improvement in quality across all formats. There is also a new markdown rendering layer and new HTML output, which we now support. And much more (which you can find in our the release notes).

The main highlight is code intelligence and extraction. Kreuzberg now supports 248 formats through our tree-sitter-language-pack library. This is a step toward making Kreuzberg an engine for agents too. You can efficiently parse code, allowing direct integration as a library for agents and via MCP. Agents work with code repositories, review pull requests, index codebases, and analyze source files. Kreuzberg now extracts functions, classes, imports, exports, symbols, and docstrings at the AST level, with code chunking that respects scope boundaries. 

Regarding markdown quality, poor document extraction can lead to further issues down the pipeline. We created a benchmark harness using Structural F1 and Text F1 scoring across over 350 documents and 23 formats, then optimized based on that. LaTeX improved from 0% to 100% SF1. XLSX increased from 30% to 100%. PDF table SF1 went from 15.5% to 53.7%. All 23 formats are now at over 80% SF1. The output pipelines receive is now structurally correct by default. 

Kreuzberg is now available as a document extraction backend for OpenWebUI (by popular request!), with options for docling-serve compatibility or direct connection.

In this release, we’ve added unified architecture where every extractor creates a standard typed document representation. We also included TOON wire format, which is a compact document encoding that reduces LLM prompt token usage by 30 to 50%, semantic chunk labeling, JSON output, strict configuration validation, and improved security. GitHub: https://github.com/kreuzberg-dev/kreuzberg

And- Kreuzberg Cloud out soon, this will be the hosted version is for teams that want the same extraction quality without managing infrastructure. more here: https://kreuzberg.dev

Contributions are always very welcome

u/Eastern-Surround7763 — 13 hours ago
[OC] Life expectancy increased across all countries of the world between 1960 and 2020 -- an interactive d3 version of the slope plot
▲ 36 r/dataisbeautiful+1 crossposts

[OC] Life expectancy increased across all countries of the world between 1960 and 2020 -- an interactive d3 version of the slope plot

u/ikashnitsky — 12 hours ago
uvr update: R companion package, RStudio/Positron integration, and more based on your feedback
▲ 46 r/dataanalysis+1 crossposts

uvr update: R companion package, RStudio/Positron integration, and more based on your feedback

A few weeks ago I shared (https://github.com/nbafrank/uvr), a fast R package and project manager written in Rust. The response was great with a lot of specific, actionable feedback. I've been focused on implementing features based on what you asked for. Here's what's new.

R companion package — use uvr without touching the terminal

This was the #1 request (thanks u/BothSinger886). Many R users — especially scientists — live in the console, not the terminal. Now you can manage your entire project from R:

# install.packages("pak") 
pak::pak("nbafrank/uvr-r")
library(uvr) 
uvr::init() 
uvr::add("tidyverse") 
uvr::add("DESeq2", bioc = TRUE) 
uvr::sync()

Every uvr command has an R equivalent: `init()`, `add()`, `remove_pkgs()`, `sync()`, `lock()`, `run()`. If the CLI binary isn't installed, it prompts you to install it automatically. No terminal required.

Positron just works (Rstudio is wip)

`uvr init` and `uvr sync` now generate a `.Rprofile` that sets up the project library path automatically. Open your project in Positron and it picks up the right library — no configuration needed.

For Positron, uvr also writes `.vscode/settings.json` with the project's R interpreter path, so the correct R version appears in the IDE without manual setup.

Smarter error handling

- Typo protection`uvr add tidyvese` (typo) used to write the bad name to your manifest before failing resolution. Now the manifest rolls back automatically on failure — your `uvr.toml` stays clean.

- 4-component versions Packages like `data.table` (version `1.18.2.1`) now resolve correctly against version constraints. This was a subtle semver edge case that broke real workflows.

`uvr run --with` for one-off dependencies

Like `uv run --with` in Python. Need a package for a quick script without adding it to your project?

uvr run --with gt script.R

The package is installed to a temporary cache and available only for that run.

What's next

- Windows support— compiles and runs on Windows now, full testing in progress

- DESCRIPTION file support — use `DESCRIPTION` as an alternative manifest alongside `uvr.toml`

- Continued benchmarking and hardening

The full feature set: R version management, CRAN + Bioconductor + GitHub packages, P3M pre-built binaries, lockfile, dependency tree, `uvr doctor`, `uvr export` (to renv.lock), `uvr import` (from renv.lock), shell completions, self-update, and more.

Install in one line:

curl -fsSL https://raw.githubusercontent.com/nbafrank/uvr/main/install.sh | sh

Or from R:

pak::pak("nbafrank/uvr-r")
uvr::install_uvr()

GitHub: https://github.com/nbafrank/uvr

R package: https://github.com/nbafrank/uvr-r

Happy to answer questions. Your feedback last time shaped all of this — keep it coming.

Please try this out and test it yourself. I am using Positron for all of this and it's been going well. RStudio integration seems a bit more complex to me but if anyone wants to help please do

u/nbafrank — 1 day ago
Improved markdown quality, code intelligence for 248 formats, and more in Kreuzberg v4.7.0
▲ 14 r/rstats

Improved markdown quality, code intelligence for 248 formats, and more in Kreuzberg v4.7.0

Kreuzberg v4.7.0 is here. Kreuzberg is an open-source Rust-core document intelligence library with bindings for Python, TypeScript/Node.js, Go, Ruby, Java, C#, PHP, Elixir, R, C, and WASM. 

We’ve added several features, integrated OpenWEBUI, and made a big improvement in quality across all formats. There is also a new markdown rendering layer and new HTML output, which we now support. And many other fixes and features (find them in our the release notes).

The main highlight is code intelligence and extraction. Kreuzberg now supports 248 formats through our tree-sitter-language-pack library. This is a step toward making Kreuzberg an engine for agents. You can efficiently parse code, allowing direct integration as a library for agents and via MCP. AI agents work with code repositories, review pull requests, index codebases, and analyze source files. Kreuzberg now extracts functions, classes, imports, exports, symbols, and docstrings at the AST level, with code chunking that respects scope boundaries. 

Regarding markdown quality, poor document extraction can lead to further issues down the pipeline. We created a benchmark harness using Structural F1 and Text F1 scoring across over 350 documents and 23 formats, then optimized based on that. LaTeX improved from 0% to 100% SF1. XLSX increased from 30% to 100%. PDF table SF1 went from 15.5% to 53.7%. All 23 formats are now at over 80% SF1. The output pipelines receive is now structurally correct by default. 

Kreuzberg is now available as a document extraction backend for OpenWebUI, with options for docling-serve compatibility or direct connection. This was one of the most requested integrations, and it’s finally here. 

In this release, we’ve added unified architecture where every extractor creates a standard typed document representation. We also included TOON wire format, which is a compact document encoding that reduces LLM prompt token usage by 30 to 50%, semantic chunk labeling, JSON output, strict configuration validation, and improved security. GitHub: https://github.com/kreuzberg-dev/kreuzberg

Contributions are always very welcome!

https://kreuzberg.dev/ 

u/Eastern-Surround7763 — 12 hours ago
▲ 0 r/rstats

[Question]: LGCP/Point process forecasting methodology?

Anyone worked on forecasting point processes before? Just a bit stuck if this is the best way for me to do it with the tools I am using.

Currently as my estimation procedure is not likelihood based for an LGCP(stopp package in R), there is no easily available posterior so I can't draw parameters from there. It does have functions to fit the model with covariates and simulate from a Log gaussian cox process(LGCP) using covariates though.

My current idea is parametric bootstrapping:

fit my model to my original data

use fitted parameters to estimate new data and refit the model

repeat this and store the param estimates

simulate from the assumed log Gaussian Cox process(LGCP) using the list of param estimates and store the points


Grid/Voxelize my domain over the temporal and spatial forecast window and count whenever a simulation has a point in that grid, basically an indicator variable if that simulation has a point present in that grid.

Grab the "HPD" region, so the regions sorted by the mean count of presence of events in that region across simulations, as many simulations might have 0 events it will below 1 and can be interpreted as a predicted probability, collect these grids until the regions add up to or above the considered prob threshold.

Maybe I am overlooking something, so any guidance would be helpful.

For those still reading the goal is to use lightning strike event data and predict the next most likely region in time and space for activity in a chosen forecast window(A country and within 6-24h). LGCP was chosen due to it being a model that can capture the clustering behavior of lightning. I have also found self-exciting models such as Hawkes-processes as a good contender capturing the same clustering behavior that I will explore further.

reddit.com
u/Sinatio — 22 hours ago
Week