PWC 259 › Bank Holidays and Line Parser

This post is part of a series on Mohammad Anwar’s excellent Weekly Challenge, where hackers submit solutions in Perl, Raku, or any other language, to two different challenges every week. (It’s a lot of fun, if you’re into that sort of thing.)

This week’s tasks have a little bit more meat on their bones, which I quite enjoyed. They are user-submitted, as well, and it’s always fun to see what people come up with when submitting tasks.

Contents

Task 1 – Banking Day Offset

This task comes from the mind of Lee Johnson. Here, we’re given the following inputs:

Number of days [$offset]
Start date [$start_date]
List of dates which are holidays (optional) [@holidays]

From that, we’re supposed to return the date that is $offset working days from the $start_date, ignoring weekends and @holidays. This is straightforward. I opted to use the core Perl module Time::Piece to get the day of the week. There are about n + 1 different ways to do that, though.

The function starts with some initialization:

sub bank_holiday_ofs {
    my ($start_date, $offset, @holidays) = @_;

    my $t = Time::Piece->strptime($start_date => $date_fmt) - 86400;

    my %holiday = map { $_ => 1 } @holidays;
    $offset++; # Account for today

The %holiday hash simply maps the dates to a true value so $holiday{$date} will be true iff $date is in @holidays. The $offset variable gets an extra kick to account for the current day.

From here, we loop until $offset is zero:

    while ($offset) {
        $t += 86400; # Advance day
        $offset-- unless $t->wday == 1 or $t->wday == 7 
                      or $holiday{ $t->strftime($date_fmt) };
    }

From years of frequent use, 86,400 has sufficient semantic meaning as “1 day” that I’m not bothered about the naughty magic number. If that’s not your style, Time::Seconds has a ONE_DAY constant.

The loop is simple. Add a day, and then subtract a day from $offset unless it’s a weekend or holiday.

ch-1.pl source code

Task 2 – Line Parser

The second task, from Gabor Szabo, gives us a particular text record format that we must parse into a Perl hash. For example:

{%  id   field1="value1"    field2="value2"  field3=42 %}

Should become:

{
       name => id,
       fields => {
           field1 => value1,
           field2 => value2,
           field3 => value3,
       }
}

We are also required to handle escaped quotes (\") and (optionally) multiline tags. I wrote a pure-Perl parser that handles all of this. It’s easiest to think of in two parts:

Top level line parser

This is the part that takes in a line of text and decides what to do with it. First, let’s define our $Open and $Close tags:

    my ($O, $C) = (qr/^\s*\{\%\s*/, qr/\s*\%\}\s*$/); # Tokens gobble whitespace

I could have just done a simple {% and %} set, but I wanted to allow optional whitespace. A recurring theme you’ll see with my solution (and my parsers in general) is that I tend to be permissive with inputs, but precise with outputs.

Now I loop through each line of input (in this case, the __DATA__ block):

  for (<DATA>) {
      chomp;
  
      if ($id) {
          if (/${O}end$id${C}/) {
              $id{$id}{text} = @text > 1 ? [ @text ] : $text[0];
              @text = ();
              $id = undef;
          } else {
              push @text, $_
          }
      } 
      elsif (/${O}(?<id>\w+)\s+(?<fields>.+?)${C}/) {
          die "No end token found for <$id>" if $id and @text;
          $id = $+{id};
          die "Duplicate id <$id>" if exists $id{$id};
          $id{$id} = { name => $id, fields => parse_fields($+{fields}) };
      }
      else {
          die "Invalid line: <$_>";
      }
  }

The way this loop works is, if $id is defined, we’ve already seen an open {% tag and we’re expecting either a line of text, or the {% endid %} closing tag, so we look for those and handle them accordingly.

Otherwise, we expect to see a single line {% id key=value, ... %} record or the start of a multi-line record, so we look for that. We pass the key/value portion of the record to the parse_fields() sub, which we’ll look at next.

parse_fields(): The key/value (kv) parser

It might seem like parsing keys and values would be the easy part, and if not for Gabor’s requirement to handle escaped quotes, it might have been. There are several ways I could have tackled this, from tricky eval()s to full-blown grammars, but for my purposes here (and because I felt like it), I decided to implement a simple state machine.

Finite state machines can be described by a directed graph whose vertices contain the possible states the system can be in, and edges are the possible state transitions. To parse the kv pairs as described by this task, the following state machine will do the trick:

State diagram for parsing key/value pairs.

Note: To avoid a cluttered diagram, I’ve omitted arrows on most states that point to themselves, except for field_name and out. It so happens that every state in this particular system can have itself as the next state.

We effectively start from the out state, meaning we are outside of a key/value pair and are waiting for the start of the next key name. Once we see a word character (\w), we go to the field_name state and stay there until we’ve gobbled up all of the \w characters, and then we look for an equal sign. We similarly trundle through the states to find the start of the value (value_start), the value itself, then a comma, and back to out for the next field. We can stop at any time.

One way to implement simple state machines is by simply having a $state variable that you set to the name of the state you’re in, and an if ... elsif ... chain to handle the state transitions. For more complicated, or dynamic state machines, you might reach for one of the many CPAN modules for finite state machines (FSM). But I wanted to show you how it can be done without any help.

My parse_fields() function iterates over the input string character by character. First, we have some top level variables to keep track of:

      my %fields;
      my $state = 'out'; # Outside of KV pair
      my $backslash = 0; # Substate for whether we're backslashed
      my $name = undef;  # Field name
      my $value = undef; # Field value
      my $expected_closing_quote; # If defined, value must end with this

Our output will be %fields. The $backslash variable actually captures a parallel state of whether the last character was a backslash (1), whether the current character was escaped (2), or whether neither of those things is true (0). I might have used named values instead, but this made sufficient sense to me.

I decided values could be optionally quoted, by single ('), double (") or nothing, but that the closing quote had to match the opening quote. So that’s what $expected_closing_quote keeps track of.

Now, here’s how we handle backslashes:

          $backslash = 0 if $backslash == 2;
          if ($backslash) {
              $_ = eval "\$_"; # safe
              $backslash = 2;
          } elsif (/\\/) {
              $backslash = 1;
              next;
          }

I do use eval here, but you’ll note it’s done in a safe manner, as I only ever pass two characters to eval, and the first character is a backslash. I could have built up my own hash of slash characters, but that would be error prone and require updating with the language.

If the current character is a backslash, we just set $backslash and go to the next character. If the previous character was a backslash, we do the eval to get the unescaped character. We then set $backslash to 2, which is used when we’re looking for quotes, to avoid ending the value on an escaped quote.

Here’s what a typical state handler looks like:

          # Handle the value, with optional quotes and escape sequences
          elsif ($state eq 'value_start') {
              next if /\s/;
              if (/['"]/ and not $backslash) {
                  $expected_closing_quote = $_;
                  $state = 'value';
                  $value = '';
                  next;
              }
              $value = $_;
              $state = 'value';
          }

That is the value state. You’ll see that has the logic I just talked about; if we see our $expected_closing_quote and it wasn’t $backslashed, or we see whitespace and the value is not quoted at all, we have reached the end of the value. So we set $fields{$name} = $value (that’s now part of our return value), set our next state to comma, and go to the next character.

On the other hand, if none of those conditions were true, that means we’re still inside the value, so we append the character to $value and continue. (The value state points to itself.)

I won’t show all of the states here, as the above should give you the flavor of it, but of course you can see my full solution here.

All in all, it’s decently robust for a PWC task solution, but a production version could certainly use some better error checking and reporting. Invalid inputs are mostly handled fairly gracefully, but the resulting output might be confusing.

ch-2.pl source code

Task 1 – Banking Day Offset

Task 2 – Line Parser

Top level line parser

parse_fields(): The key/value (kv) parser

Leave a Reply Cancel reply