u/colombiangary

▲ 3 r/rust

Can Regex crate beat Python's RE?

Hello, I've been learning PyO3 to rewrite some python functions in Rust.

I have one project intensive in regex parsing, and I thought that switching to Rust could bring improvements.

Yet, I haven't been able to beat Python's RE (yes, I know it's written in C). Parsing 1 single line, RE(2 microseconds), PyO3+Rust's regex (3 microseconds).

What I have tried:

* Using OnceLock to emulate Python's re.compile. (As expected this was important)
* Return a tuple instead of a struct. (The same)
* Try adding Rayon and testing in batch. (No luck)

If you had any idea it would be appreciated.

Below is my rust code, the python's side it's just a pytest source file that imports the rust library. I use pytest-benchmark for the benchmarking:

use pyo3::prelude::*;

/// A Python module implemented in Rust.

#[pymodule]

mod sysadmindb_rs {

use pyo3::exceptions::PyValueError;

use pyo3::prelude::*;

use rayon::prelude::*;

use regex::Regex;

use std::sync::OnceLock;

fn log_pattern() -> &'static Regex {

static RE: OnceLock<Regex> = OnceLock::new();

RE.get_or_init(|| Regex::new(r"<(?<prival>[0-9]+)>(?<version>[0-9])?\s?(?<date>([0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}\.[0-9]+(Z|[+-][0-9]{2}:[0-9]{2})|\w{3}\s[0-9]{2}\s[0-9]{2}:[0-9]{2}:[0-9]{2}))\s(?<hostname>[\w.]+)\s(?<appname>[\w.]+)\s?\[?(?<procid>[0-9-]+)?\]?\:?\s?(?<msgid>(-|\w{2}[0-9]{2}))?\s?(?<structureddata>(\[.+\]|-))?\s?(BOM)?(?<msg>.+)?").unwrap())

}

#[pyclass]

struct Log {

#[pyo3(get)]

version: Option<u32>,

#[pyo3(get)]

prival: u32,

#[pyo3(get)]

date: String,

#[pyo3(get)]

hostname: String,

#[pyo3(get)]

appname: String,

#[pyo3(get)]

procid: String,

#[pyo3(get)]

msgid: String,

#[pyo3(get)]

structureddata: String,

#[pyo3(get)]

msg: String,

}

#[pymethods]

impl Log {

#[new]

fn new(line: &str) -> PyResult<Self> {

match parse_log(line) {

Ok(log) => Ok(log),

Err(_) => Err(PyValueError::new_err("Cannot parse")),

}

}

}

fn parse_log(line: &str) -> Result<Log, String> {

let Some(caps) = log_pattern().captures(&line) else {

return Err("sorry".to_string());

};

Ok(Log {

prival: caps["prival"].parse().unwrap(),

version: caps.name("version").map(|m| m.as_str().parse().unwrap()),

date: caps["date"].to_owned(),

hostname: caps["hostname"].to_owned(),

appname: caps["appname"].to_owned(),

procid: caps["procid"].to_owned(),

msgid: caps["msgid"].to_owned(),

structureddata: caps["structureddata"].to_owned(),

msg: caps

.name("msg")

.map(|m| m.as_str().to_owned())

.unwrap_or_default(),

})

}

reddit.com
u/colombiangary — 22 hours ago