Perl regex help

| | Comments (18) | TrackBacks (0)
Dear Lazyweb, please put me out of my misery.

I have a string that looks something like:

a b c foo {0 1 2} fred {3 4 5} {{5 4 3} {2 1 0}} quux
I want to output:
a
b
c
foo
{0 1 2}
fred
{3 4 5}
{{5 4 3} {2 1 0}}
quux

(actually I want each line to be an array element, but that's obviously easy)

Can I do this with a Perl regex? Originally I thought I was going to have to walk the string like I would in C, but some playing around (and local help) got me to:

my $match = qr/(?:\s*[^\{\}\s]+\s*|\{(?R)\})/;

while ($cur =~ m/($match)\s+(.*)/o) {
        print $1, "\n";
        $cur = $2;
}

which sort of gets me there, except I get
{0 1 2} fred {3 4 5} {{5 4 3} {2 1 0}}
as one line instead of 4.
What obvious thing am I missing here?

0 TrackBacks

Listed below are links to blogs that reference this entry: Perl regex help.

TrackBack URL for this entry: http://the.earth.li/~noodles/cgi-bin/mt/mt-tb.cgi/213

18 Comments

Perl regexp is being greedy. Look into the non-greedy switch.

See Text::Balanced in the CPAN.

You can't use a regular expression to match well parenthesized documents, you need a grammar for that.

You must like pain a lot.

My brain exploded when I tried to read the documentation of (?R) in perlre(1).

Checking multiple nested parentheses is not possible with regular expressions (== finite state automata).

For example, see:
http://mail.python.org/pipermail/python-list/2002-July/156662.html

The obvious thing you're missing is the contents of the perlre(1) manpage. Search for the (??{ }) operator, and it has an example prefixed by "The following pattern matches a parenthesized group..."

Adapted to your case:

my $inside = qr/(?:[^\{\}]|\{(??{$inside})\})+/;
my $match = qr/\w+|\{$inside\}/;

while ($cur =~ m/($match)\s*(.*)/o) {
print $1, "\n";
$cur = $2;
}

what you actually need is LARL parser. regular expressions, even when as insanely extended as perls, are just not designed for nested parenthesis matching. I don't know if you have CS studies background, if you don't, having a good look at regular languages and context-free languages might clear your mind.

You will get slightly better results using qr/(?:\s*[^\{\}\s]+?\s*|\{(?R)\})/ (that makes the [^\{\}\s]+ non-greedy, so that it doesn't try to find the longest, but the shortest match).
It still won't be correct, simply because real regular expressions cannot do what you want (which is a result from theoretical CS). Perl 5.10 regexes can actually do what you want (by using recursion), but it's neither easy to understand, nor something that is very portable. Implementing a normal char-by-char parser should be less trouble, especially if you are used to doing it.

Marc

Noodles,

At least with common regexes it's demonstrable that you cannot parse nested structures. Anyway, it's easily doable with a recursive parser which consumes the tokens as are produced by a tokenizer. This seems to work well with your example

my $buffer = "";
sub get_token {
	$buffer = > unless ($buffer);
	return undef unless(defined $buffer);
	$buffer =~ s/^\s*([{}]|[^\s{}]+)\s*//;
	return $1;
}

sub parser {
my @res;
my $tok;
while(defined($tok = get_token())) {
if($tok eq "{") {
push @res, parser();
} elsif($tok eq "}") {
return \@res;
} else {
push @res, $tok;
}
}
return \@res;
}

use Data::Dumper; print Dumper parser();

Your $match does not match multiple blocks inside of a block. Simplification really helps spotting such mistakes.

This should do what you want:

print "$_\n" for grep {defined} m{
  (   \{ (?0)* \}  # block
    | [^{}\s]+     # value
  ) | \s+          # seperator
}xgo;

Though a regex really isn't the best solution for this kind of thing, it is fast and easy (to write).

Languages with nested brackets like yours are generated by context-free grammars, not regular grammars (Type 2 as opposed to Type 3 in the Chomsky Hierarchy). As a result a regular expression cannot parse that. Perl 5.10 introduces "recursive patterns". Such expressions are not regular expressions as defined in formal language theory but can solve your problem :).

Nope. Regexps don't handle nested parentheses. See http://www.perlmonks.org/?node_id=308039 for some other ideas.

What Marc Brockschmidt already said. To everyone repeatedly pointing out that regular expressions cannot parse that - whilst you are certainly correct, Perl "regular expressions" are not regular (in a Computer Science sense) and certainly can parse nested brackets.

$ echo $string | perl -pe 's/\s*({{.*?}}|{.*?}|\S+)\s*/$1\n/g'

The interpolation is not doing the right thing. It makes (?R) to refer to the whole regular expression you match, not just the one you compile to $match. However, when I tried to manually interpolate replacing the (?R) with (?1), it returned different incorrect result.

Not a Perl, nor a regex solution, but it seemed simpler here using Tcl lists:

$ cat print-list.tcl
#!/usr/bin/tclsh
foreach x [gets stdin] { puts [list $x] }

$ echo 'a b c foo {0 1 2} fred {3 4 5} {{5 4 3} {2 1 0}} quux' | ./print-list.tcl
a
b
c
foo
{0 1 2}
fred
{3 4 5}
{{5 4 3} {2 1 0}}
quux

$ echo "a  b c foo {0  1 2} fred  {3 4 5} {{5 4 3} {2 1 0}} quux" | perl -lne 'my $match = qr/(?:([^\{\}\s]+)|\{(?1)+\})\s*/;
while (m/($match)(.*)/o) {
        print $1;
        $_ = $3;
}
'
a  
b 
c 
foo 
{0  1 2} 
fred  
{3 4 5} 
{{5 4 3} {2 1 0}} 
quux
$

Leave a comment