Can Regex crate beat Python's RE?
Hello, I've been learning PyO3 to rewrite some python functions in Rust.
I have one project intensive in regex parsing, and I thought that switching to Rust could bring improvements.
Yet, I haven't been able to beat Python's RE (yes, I know it's written in C). Parsing 1 single line, RE(2 microseconds), PyO3+Rust's regex (3 microseconds).
What I have tried:
* Using OnceLock to emulate Python's re.compile. (As expected this was important)
* Return a tuple instead of a struct. (The same)
* Try adding Rayon and testing in batch. (No luck)
If you had any idea it would be appreciated.
Below is my rust code, the python's side it's just a pytest source file that imports the rust library. I use pytest-benchmark for the benchmarking:
use pyo3::prelude::*;
/// A Python module implemented in Rust.
#[pymodule]
mod sysadmindb_rs {
use pyo3::exceptions::PyValueError;
use pyo3::prelude::*;
use rayon::prelude::*;
use regex::Regex;
use std::sync::OnceLock;
fn log_pattern() -> &'static Regex {
static RE: OnceLock<Regex> = OnceLock::new();
RE.get_or_init(|| Regex::new(r"<(?<prival>[0-9]+)>(?<version>[0-9])?\s?(?<date>([0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}\.[0-9]+(Z|[+-][0-9]{2}:[0-9]{2})|\w{3}\s[0-9]{2}\s[0-9]{2}:[0-9]{2}:[0-9]{2}))\s(?<hostname>[\w.]+)\s(?<appname>[\w.]+)\s?\[?(?<procid>[0-9-]+)?\]?\:?\s?(?<msgid>(-|\w{2}[0-9]{2}))?\s?(?<structureddata>(\[.+\]|-))?\s?(BOM)?(?<msg>.+)?").unwrap())
}
#[pyclass]
struct Log {
#[pyo3(get)]
version: Option<u32>,
#[pyo3(get)]
prival: u32,
#[pyo3(get)]
date: String,
#[pyo3(get)]
hostname: String,
#[pyo3(get)]
appname: String,
#[pyo3(get)]
procid: String,
#[pyo3(get)]
msgid: String,
#[pyo3(get)]
structureddata: String,
#[pyo3(get)]
msg: String,
}
#[pymethods]
impl Log {
#[new]
fn new(line: &str) -> PyResult<Self> {
match parse_log(line) {
Ok(log) => Ok(log),
Err(_) => Err(PyValueError::new_err("Cannot parse")),
}
}
}
fn parse_log(line: &str) -> Result<Log, String> {
let Some(caps) = log_pattern().captures(&line) else {
return Err("sorry".to_string());
};
Ok(Log {
prival: caps["prival"].parse().unwrap(),
version: caps.name("version").map(|m| m.as_str().parse().unwrap()),
date: caps["date"].to_owned(),
hostname: caps["hostname"].to_owned(),
appname: caps["appname"].to_owned(),
procid: caps["procid"].to_owned(),
msgid: caps["msgid"].to_owned(),
structureddata: caps["structureddata"].to_owned(),
msg: caps
.name("msg")
.map(|m| m.as_str().to_owned())
.unwrap_or_default(),
})
}