Regex lookahead

This arose from Tinderbox Training: Doctoral Dissertation Variable Hide in Text with RegEx - #3 by mwra but I’ll post the answer here to avoid thread-drift.

So, the needs was to find embedded bracket-enclosed research codes, e.g. [RDA1], but not bracket-enclosed citation anchors, e.g. [@marshall1992]. The original regex pattern used in an .extractAll() call was this:

\[[^\[@]{1,50}\]

So:

  • match a literal opening square bracket [
  • keep matching if the next characters are not a literal opening square bracket followed by an @-sign [@.
  • continue the match for 1 up to 50 characters (or the end of $Text if sooner)
  • match a literal closing square bracket ]
    As said the code works and with no edge-case errors so is fine. Regex are precise and can be a time-sump if tweaking precision. so if it works, often that’s good enough. So what follows isn’t critique…

The {1,50} part caught my eye as the literal character count smelled of a hard-coded fix. Indeed, chatting with @satikusala since, that was just so. In fact he’d already tried, unsuccessfully the method I’ll come to but had no joy, got this fix, it worked—job done.

First an assumption—untested assumptions are common way to derail regex. The codes are number/letter combos inside square brackets so I assume the codes thus don’t/can’t contain a square brackets (this assumption was confirmed).

If so, we actually only have to expend a successful match only until the next closing square bracket ]. One way to do this is with a ‘lookahead’ test where the regex looks beyond the current match—without extending the matched selection and testing it. So our lookahead becomes “is the next character a close angle bracket?”, which is encoded (?=\]). This gives us a regex of:

\[[^\[@]+(?=\])\]

avoiding the need to worry how long the detected ‘code’ string is.

As ever, there is more than one way and a simpler method presents:

\[[^@][^\]]+\]

Now we check:

  • match a literal opening square bracket [
  • keep matching if the next characters are not an @-sign @.
  • continue the match 1 or more times for as long as the tested character is not a literal closing square bracket ]
  • match a literal closing square bracket ]

Simpler, but mainly the point there is more than one way to do things. But note that codes need to be two-or-more characters.

2 Likes