Regex woes - stripping out numbers in a complex title


(Gavin Rees) #1

Hi,

I am trying to use regex to strip some data out of a note name, and can’t get it to work, although it does work in BBedit. And so I am probably just being plain dumb.

I have this is a note name:

And I want to parse it and fill attributes for a page numbers and the URL which links back to the article referenced by the highlights app. This is the code I am trying to use in the rule field of a relevant prototype, but it is supposed to strip out the relevant references but does not work.

if($Name.contains("\[Page\s(\d+)\]\((.+)\)"){$Page=$1;$HighlightsURL=$2}

Is it because I am failing to escape characters for Tinderbox’s grep implementation, or have just loused up the syntax in someway?

Any help on this would be really appreciated. It is the kind of thing that is probably simple to solve but could well take me days to work out otherwise!

all the best,

Gavin


(Mark Anderson) #2

Try this:

if($Name.contains("\[Page\s(\d+)\]\((.+)\)")){$Page=$1;$HighlightsURL=$2};

Note you need to backslash escape both the literal square brackets (regex group marker) and the literal parentheses (regex back-reference). You also forgot the closing parenthesis for the if() command - you closed .contains() but not the enclosing if().

FWIW, you don’t have to use the \s for the space, you can use a literal space character if it helps figure out the regex. Thus this also works:

if($Name.contains("\[Page (\d+)\]\((.+)\)")){$Page=$1;$HighlightsURL=$2};

(Mark Anderson) #3

It’s probably worth noting that if you have note names with parentheses, semi-colons or forward slashes in them then using action-code based linking (as opposed to manually dragged linking) won’t work. This is because these characters aren’t internally escaped when processing the note’s title/path. It is a logged limitation, so doesn’t need reporting.

The latter hits me hard in my research where my note titles are often names of academic papers which use all sorts of characters. So, if I need to action-link, I store the ‘real’ name in a user string filed, that I can use when exporting data—or even as a display expression in TB—and then bowdlerise the note name to remove unsafe characters. Note this issue doesn’t affect manual linking, only linking via action code (i.e. what you need once your TBX scales in size)…


(Gavin Rees) #4

Thanks Mark ,

That is helpful and kind. I maybe missing a gene when it comes to sport correct closures. As a matter of interest, which text editor are you using to display colours? When I paste your code into BBEdit, I don’t get the same kind of nesting. I am not a coder and so this could be because I have not turned on some option. That would make the checking easier.

And thanks too for your following comment on link behaviour. I am bringing in a file that is already parsed by BBEdit and so I can clean up in that application using a text factory.

all the best,
Gavin


(Mark Anderson) #5

The colouring is in the markdown formatting of the forum. Enclose a string in back-ticks to render it in basic ‘code’ form - i.e. monospaced font, background colour, no auto-substitution of ‘smart features’

To get colouring of code, put thee backticks on a separate line before and after the code string. The forum supports per-language colouring, but there is none written for Tinderbox though the default with no language works well, not least in picking out quoted strings.

BBEdit? Me too, great app! But I also don’t get any special mark-up - pasted in action code defaults to ‘plain text’ (txt). I guess if you burrow into Discourse forum software docs (are there any?) it might explain the logic for its default colouring. I think the general concept is copied across from the stackOverflow model from which Discourse is derived.

FWIW, I think a Tinderbox-specific colouring ‘library’ would be hard. The app is c.18 years old and started with a simple action code where meaning was derived from parsing, i.e. no need for quoted strings, expression endings, etc. Whilst current complexity means newer additions to action code are more formalised, most legacy (original) forms of code are supported. The most detailed reference is likely my aTbRef but even in there you’ll see grey areas and inconsistencies - from a programmer’s perspective (…I think, as I’m not one).

FWIW, in terms of regex, if you use a regex tester or reference that follows Perl rules you’re pretty much in the right place. Internally, Tinderbox uses the Boost Regex library whose documentation is here.

For a skilled programmer, the TBX file is 'just’XML (see here for the element structure - or my take on it from black-box deconstruction). So, you could do substitutions—with care, on a backup—directly in the XML. I’d only caution against being too creative with note $text. Links (in the link table in XML) work of offsets in the plain text version of $Text whilst the RTFD copy of text holds a richer copy of the text (images, RTF links, etc.).


(Gavin Rees) #6

Thanks Mark,
That saves from me wading through the BBEdit manual and getting even more puzzled. Interesting to get a sense too of the obstacles to making a tinderbox specific colouring library. Life is complicated under the bonnet.

G.