@ievev Have you ever seen an implementation like @bablr/regex? https://github.com/bablr-lang/regex-vm It's an NFA system so it isn't going to be winning any awards for throughput, but in this particular case it does seem to completely avoid the complexity blowup. It will run your heap out of memory though on really big inputs.
The strategy this engine uses is just to evolve the state as a function of time. A match can be successfully completed, yet not be emitted because some other longer match could still supercede it by being longer or more leftmost.
I tried the pattern /d+s+/g on 10,000,000 digits followed by no space. It took 4 seconds to return no results. I tried it on 20,000,000 digits followed by no space. It took 8 seconds to return no results. I tried on 100,000,000 and I ran out of heap space.
Returning no results is going to be linear in any DFA or NFA based implementation, though. You go character by character, and confirm that there are no matches.
It's only when you return multiple matches that the engines have a problem and become superlinear.
Do any RE implementations do anything like the query planning that databases do or the rewrites that compilers do that can replace the RE with a different sequence of REs or string searches that might execute faster?
For example in the expression in your example (I'm assuming based on your description of the test data that /d+s+/ means the same as /\d+\s+/ in the RE engines I've used) any match must contain a digit followed by a space.
A scan for all places where a digit is followed by whitespace with each such place then being checked to find the length of the string of whitespace starting there and the length of the string of digits ending there should be linear time and constant heap space.
The reason why I was looking was to do query planning for a declarative pattern-finding in atomic structures (hierarchical labelled graphs, I suppose) although I'm slowly realising just how insanely hard it might be to get it to work efficiently!
(Note that there is a small mistake in the paper due to ambiguity, found by Vladimir Gapeyev. So the result does not hold in the generality stated but only for a special case when there is no ambiguity at "the end". There went my first PhD student publication...)
The two pass technique used to be implemented in the Scala compiler at the time (building DFAs upfront) , which could do regexps over lists and other sequences, but the approach would not work for top-down tree regexps so I did not pursue that and it got ripped out later.
It is good to see derivative regular expressions, Brzozowski/"position automata" used and discussed.
Restricting regex features to guarantee time complexity works, but it requires sacrificing potentially useful features like backtracking (or in the article's case, constraining oneself to fixed-upper-bound-length needles).
In a real-world deployment where you want to run any arbitrary regex in an idiot/malice-proof manner, the best solution is the same solution you'd use for running any other kind of untrusted code - sandbox it! A good regex API should limit its execution time and memory consumption and return a timeout error in case those limits are exceeded. Ideally, those parameters would be configurable at the API level. Unfortunately, the only regex libraries I know of that get this right are .NET's standard library Regex API and the third-party regex package in Python.
I find it weird to have the Perl innovation (?:...) be called "traditional regex". Perl was rather innovative back then, even if it's more than 30 years ago now. Traditional regex is what came before it (grep -E being the most advanced form). I wonder what counts as nontraditional in the author's eyes.
nearly everything that matters in practice: where the matches are, how long they are, and how many there are
I would say that regexes that matter in practice, e.g. when digging through logs, have clear boundaries that curb the pathological backtracking behavior. In particular, I find it difficult to imagine a practical need to find all matches of an expression like /.*a|b/, as shown in the article. Realistically you'd have to handle /\b.*a|b\b/, or similar, because realistically when you need all matches, you don't want intersecting matches. This means you want to proceed past the end of the n-th match to look for n+1-th match, and never want to use indeterminate prefixes like /.*a/.
This OTOH gives a reasonably useful heuristic if your regexp comes from an untrusted source and could be adversarial. Check that it does not start with a prefix with a Kleene star, like /a*/. Require at least one positive match (in each alternate branch). Of course, /a+b|c/ would still be quadratic if your text is long sequences of "a" interspersed with characters other than "b". But this, again, is more of a theoretical case, to my mind.
i want to say threats dont only come from inputs gathered over the internet.
there are many reasons to exploit things. one example is local privilege escalation. If your service has high privileges and somehow someone can edit an input source for it (like some file it reads thats accessible to the user, or even by tricking the service into looking at the wrong file) it will still be a useful vector.
now this might seem far fetched, but a lot of exploits i've seen actually do this type of stuff.
for example you find a program which gatheres some debug or support info package, and touches a directory which is user accessible. user put some kind of link or tricky file in there and boom, service compromised.
I would only not use hardened mode if the regex is actually embedded directly into the program, because that would atleast require the program itself to be touched before it breaks (which would already require the same level of privileges as the program runs on).
So, long story short. Be aware that if your program touches local resources that are not matching its own privilege level, like some log locations, tmp, etc , be sure that stuff doesn not get turned into regex or use the hardened mode to prevent problems.
its not always about users providing some input via an webpage or some online service that causes something to break..
- The Austin Group has recently accepted lazy quantifiers for inclusion into the next release of POSIX[1]. They think they have worked out a reasonable declarative definition for what they should mean. I am less than sure of that, but either way dismissing the whole thing as irredeemably tied to backtracking seems inappropriate.
- Once again the generalization in the title is AFAIK largely correct for industrial engines, but incorrect—arguably to the point of being misleading—for academic work. Just looking into the “parsing” subfolder of my papers stash reveals a 1998 paper[2] on linear-time maximal-munch tokenization, so at the very least the problem was recognized, and IIRC there’s a bunch of related work around that paper too.
- It is true that you can’t stream the haystack in the general case, but to what precise extent you can is an interesting question with a known algorithmic answer[3].
I wonder how gracefully redgrep handles this. This tool hasn't been talked about since the year of its release. If I recall correctly, it doesn't handle some obstruse regexes the way conventional tools do however.
I would argue that hardened mode should be default though, similar to how siphash is the default hashing function in Rust hash maps. Faster mode should be opt in if the user is confident that the supplied data is nonmalicious and they need the speed up.
Hmm. I kinda wonder how something like a regexp engine based on marpa would fall. It would probably be slower in the common well behaved cases, as well as preprocessing. But I wonder if there are any cases that couldn't be done in a O(n) way if you limit yourself to only one match (instead of finding all possible matches like it actually can do). Marpa definitely handles most degenerate regexp cases linearly (once you make the match generation not do overlapping entries), but there may be some regexp feature it can't handle?
Bringing a fully general CFG parser to parse regexps would be like hunting mosquitos with a nuke though..
I always thought that finite automata can go with O(n) through the string, with some perks like multistate (i.e. when finding [ab](b)+a in abba you get first state at character 1 and second first state at the 2nd character. Although I understand that regex syntax can be complicated and you can do really O(n^2) in it, but maybe the easier regex could be compiled the easier way...
i think i'll rest for a bit after this. i can only do 80-hour weeks for so long
Jesus Christ, 80 hours?! I really hope the author seriously takes a proper break! I mean, they seem to be riding that incredible high that comes from having a breakthrough in deeply understanding a really tough problem after thinking about it for too long, so I kind of get it, but that is also all the more reason to take good care the precious brain that now stores all that knowledge, before it burns out!
The original Kleene Star Regex was invented to model neural networks. Have you tried throwing a transformer at the problem /s? Also O(n²) but at least you get hardware acceleration ¯\(ツ)/¯
Here's Kleene's Representation of Events in Nerve Nets and Finite Automata:
70 comments
The strategy this engine uses is just to evolve the state as a function of time. A match can be successfully completed, yet not be emitted because some other longer match could still supercede it by being longer or more leftmost.
I tried the pattern /d+s+/g on 10,000,000 digits followed by no space. It took 4 seconds to return no results. I tried it on 20,000,000 digits followed by no space. It took 8 seconds to return no results. I tried on 100,000,000 and I ran out of heap space.
Test setup: https://gist.github.com/conartist6/051838025af1e04d966e03aa9...
It's only when you return multiple matches that the engines have a problem and become superlinear.
For example in the expression in your example (I'm assuming based on your description of the test data that /d+s+/ means the same as /\d+\s+/ in the RE engines I've used) any match must contain a digit followed by a space.
A scan for all places where a digit is followed by whitespace with each such place then being checked to find the length of the string of whitespace starting there and the length of the string of digits ending there should be linear time and constant heap space.
https://memgraph.com/docs/querying/query-plan
The reason why I was looking was to do query planning for a declarative pattern-finding in atomic structures (hierarchical labelled graphs, I suppose) although I'm slowly realising just how insanely hard it might be to get it to work efficiently!
"Compiling regular expressions to sequential machines" (2005) ACM Symposium of Applied Computing https://dl.acm.org/doi/10.1145/1066677.1066992
(Note that there is a small mistake in the paper due to ambiguity, found by Vladimir Gapeyev. So the result does not hold in the generality stated but only for a special case when there is no ambiguity at "the end". There went my first PhD student publication...)
The two pass technique used to be implemented in the Scala compiler at the time (building DFAs upfront) , which could do regexps over lists and other sequences, but the approach would not work for top-down tree regexps so I did not pursue that and it got ripped out later.
It is good to see derivative regular expressions, Brzozowski/"position automata" used and discussed.
In a real-world deployment where you want to run any arbitrary regex in an idiot/malice-proof manner, the best solution is the same solution you'd use for running any other kind of untrusted code - sandbox it! A good regex API should limit its execution time and memory consumption and return a timeout error in case those limits are exceeded. Ideally, those parameters would be configurable at the API level. Unfortunately, the only regex libraries I know of that get this right are .NET's standard library Regex API and the third-party regex package in Python.
> constraining oneself to fixed-upper-bound-length needles
wait! you haven't reached the important part of the post yet
>
even if it's more than 30 years ago now.Here's your answer.
>Traditional regex is what came before it
No, that's "ancient regex".
Ah, there is a post with more detail about RE# and discussion here recently that I must have missed: https://news.ycombinator.com/item?id=47206647
>
nearly everything that matters in practice: where the matches are, how long they are, and how many there areI would say that regexes that matter in practice, e.g. when digging through logs, have clear boundaries that curb the pathological backtracking behavior. In particular, I find it difficult to imagine a practical need to find all matches of an expression like /.*a|b/, as shown in the article. Realistically you'd have to handle /\b.*a|b\b/, or similar, because realistically when you need all matches, you don't want intersecting matches. This means you want to proceed past the end of the n-th match to look for n+1-th match, and never want to use indeterminate prefixes like /.*a/.
This OTOH gives a reasonably useful heuristic if your regexp comes from an untrusted source and could be adversarial. Check that it does not start with a prefix with a Kleene star, like /a*/. Require at least one positive match (in each alternate branch). Of course, /a+b|c/ would still be quadratic if your text is long sequences of "a" interspersed with characters other than "b". But this, again, is more of a theoretical case, to my mind.
there are many reasons to exploit things. one example is local privilege escalation. If your service has high privileges and somehow someone can edit an input source for it (like some file it reads thats accessible to the user, or even by tricking the service into looking at the wrong file) it will still be a useful vector.
now this might seem far fetched, but a lot of exploits i've seen actually do this type of stuff.
for example you find a program which gatheres some debug or support info package, and touches a directory which is user accessible. user put some kind of link or tricky file in there and boom, service compromised.
I would only not use hardened mode if the regex is actually embedded directly into the program, because that would atleast require the program itself to be touched before it breaks (which would already require the same level of privileges as the program runs on).
So, long story short. Be aware that if your program touches local resources that are not matching its own privilege level, like some log locations, tmp, etc , be sure that stuff doesn not get turned into regex or use the hardened mode to prevent problems.
its not always about users providing some input via an webpage or some online service that causes something to break..
- The Austin Group has recently accepted lazy quantifiers for inclusion into the next release of POSIX[1]. They think they have worked out a reasonable declarative definition for what they should mean. I am less than sure of that, but either way dismissing the whole thing as irredeemably tied to backtracking seems inappropriate.
- Once again the generalization in the title is AFAIK largely correct for industrial engines, but incorrect—arguably to the point of being misleading—for academic work. Just looking into the “parsing” subfolder of my papers stash reveals a 1998 paper[2] on linear-time maximal-munch tokenization, so at the very least the problem was recognized, and IIRC there’s a bunch of related work around that paper too.
- It is true that you can’t stream the haystack in the general case, but to what precise extent you can is an interesting question with a known algorithmic answer[3].
[1] https://www.austingroupbugs.net/view.php?id=793, https://www.austingroupbugs.net/view.php?id=1329, https://www.austingroupbugs.net/view.php?id=1857, see also the mailing list.
[2] Reps (1998), ACM TOPLAS 20(2), 259–273, https://dl.acm.org/doi/10.1145/276393.276394, https://research.cs.wisc.edu/wpis/papers/toplas98b.pdf.
[3] Grathwohl, Henglein, Rasmussen (2014), ICTAC ’14, LNCS 8687, 224–240, https://link.springer.com/chapter/10.1007/978-3-319-10882-7_..., https://utr.dk/pubs/files/grathwohl2014-0-paper.pdf.
> no capture groups > [...] > no lazy quantifiers - .*?
Here's the caveats.
And so running a regex engine on the matches seems like it would get you back to O(regexlen * haystacklen * matchcount) or roughly O(mn²) again.
https://github.com/google/redgrep
[0]: https://github.com/BurntSushi/rebar/pull/20#issuecomment-256...
I would argue that hardened mode should be default though, similar to how siphash is the default hashing function in Rust hash maps. Faster mode should be opt in if the user is confident that the supplied data is nonmalicious and they need the speed up.
Bringing a fully general CFG parser to parse regexps would be like hunting mosquitos with a nuke though..
> the problem we're talking about in this post (finding all longest matches without quadratic blowup)
Wait, what? I thought this was about finding all matches. With a minor tweak to the opening example:
We want to match
(.*a | b)againstbbbbbabbbbb.I want to detect each
bindividually, and I also want to detectbbbbba,bbbba,bbba,bba,ba, anda. That's what it means to find all matches.>
i think i'll rest for a bit after this. i can only do 80-hour weeks for so longJesus Christ, 80 hours?! I really hope the author seriously takes a proper break! I mean, they seem to be riding that incredible high that comes from having a breakthrough in deeply understanding a really tough problem after thinking about it for too long, so I kind of get it, but that is also all the more reason to take good care the precious brain that now stores all that knowledge, before it burns out!
Here's Kleene's Representation of Events in Nerve Nets and Finite Automata:
https://www.rand.org/content/dam/rand/pubs/research_memorand...