Parsing bash/shell

I have been avoiding #629247 for quite a while. Not because I think we couldn’t use a better shell parser, but because I dreaded having to write the parser. Of course, #629247 blocks about 16 bugs and that number will only increase, so “someone” has to solve it eventually… Unfortunately, that “someone” is likely to be “me”.  So…

I managed to scrabble down the following Perl snippet. It does a decent job at getting lines split into “words” (which may or may not contain spaces, newlines, quotes etc.). It currently tokenizes the “<<EOF”-constructs (heredocs?).  Also it does not allow one to distinguish between “EOF” and ” EOF” (the former ends the heredoc, the latter doesn’t.).

Other defects includes that it does not tokenize all operators (like “>&”).  Probably all I need is a list of them and all the “special cases” (Example: “>&” can optionally take numbers on both sides, like “>&2” or “2>&1”).

It does not always appear to terminate (I think EOF + unclosed quote triggers this).  If you try it out and notice something funny, please let me know.

You can also find an older version of it in the bug #629247 and the output it produced at that time (that version used ” instead of – as token marker).

#!/usr/bin/perl

use strict;
use warnings;

use Text::ParseWords qw(quotewords);
my $opregex;

{
    my $tmp = join( "|", map { quotemeta $_ } qw (&& || | ; ));
    # Match & but not >& or <&
    # - Actually, it should eventually match those, but not right now.
    $tmp .= '|(?<![\>\<])\&';
    $opregex = qr/$tmp/ox;
}
my @tokens = ();
my $lno;
while (my $line = <>) {
    chomp $line;
    next if $line =~ m/^\s*(?:\#|$)/;
    $lno = $. unless defined $lno;
    while ($line =~ s,\\$,,) {
        $line .= "\n" . <>;
        chomp $line;
    }
    $line =~ s/^\s++//;
    $line =~ s/\s++$//;
    # Ignore empty lines (again, via "$empty \ $empty"-constructs)
    next if $line =~ m/^\s*(?:\#|$)/;

    my @it = quotewords ($opregex, 'delimiters', $line);
    if (!@it) {
        # This happens if the line has unbalanced quotes, so pop another
        # line and redo the loop.
        $line .= "\n" . <>;
        redo;
    }

    foreach my $orig (@it) {
        my @l;
        $orig =~ s,",\\\\",g;
        @l = quotewords (qr/\s++/, 1, $orig);
        pop @l unless defined $l[-1] && $l[-1] ne '';
        shift @l if $l[0] eq '';
        push @tokens, map { s,\\\\",",g; $_ } @l;
    }
    print "Line $lno: -" . join ("- -", map { s/\n/\\n/g; $_ } @tokens ) . "-\n";
    @tokens = ();
    $lno = undef;
}

Here is a little example script and the “tokenization” of that script (no, the example script is not supposed to be useful).

$ cat test
#!/bin/sh

for p in *; do
    if [ -d "$p" ];then continue;elif
    [ -f "$p" ]
    then echo "$p is a file";fi
done
$ ./test.pl test
Line 3: -for- -p- -in- -*- -;- -do-
Line 4: -if- -[- --d- -"$p"- -]- -;- -then- -continue- -;- -elif-
Line 5: -[- --f- -"$p"- -]-
Line 6: -then- -echo- -"$p is a file"- -;- -fi-
Line 7: -done-
Advertisements
This entry was posted in Debian, Lintian. Bookmark the permalink.

6 Responses to Parsing bash/shell

  1. Pingback: Niels Thykier: Parsing bash/shell | Linux-Support.com

  2. Gentoo User says:

    Did you ever had a look at Gentoo’s libbash?

  3. Iñigo says:

    Hello,

    After a initial testing of the code, I see something funny about space usage (in heredocs, and with command substitutions), see this example (I hope wordpress doesn’t scape too much this example, I will try to use pre and code tags just in case):

    $ cat test.sh 
    #!/bin/sh
    
    var1=value
    var2=$(value)
    var3=$( value )
    
    cat << EOF1
    hola $( echo mundo )
    EOF1
    
    cat <<EOF2
    foo $(command1 $(command2)) bar
    EOF2
    
    $ tokenparser.pl test.sh 
    Line 3: -var1=value-
    Line 4: -var2=$(value)-
    Line 5: -var3=$(- -value- -)-
    Line 7: -cat- -<<- -EOF1-
    Line 8: -hola- -$(- -echo- -mundo- -)-
    Line 9: -EOF1-
    Line 11: -cat- -<<EOF2-
    Line 12: -foo- -$(command1- -$(command2))- -bar-
    Line 13: -EOF2-
    $
    

    As I don’t know how it will render here, try with “$(cmd)” Vs “$( cmd )”, with and without spaces around the parenthesis, and try EOF near the redirection or with a space between the redirection and the delimiter.

    Good luck 🙂

    • Yeah “$(cmd) vs $( cmd )” is one of the cases, where I am not really sure what makes sense. Optimally, we would find the $(…)-part and the recurse into it (without splitting it first), but I am not entirely sure how to do that. As for “<<EOF" vs "<< EOF", that is simply because I haven't added the "<>”) operators.

      But maybe the Gentoo libbash thing will handle this more gracefully…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s