Tinderbox Forum

Extract Anchor and URL from pasted HTML

Is there a way to automatically parse anchors and URLs from pasted HTML in $Text. When I look at the HTML I see the text, but not the URLs. The URLs are clearly there because you can click on them. I would like to be able to extract anchor and text pairs and then be able to use them as I see fit.

No, these don’t exist as Tinderbox web links and are only in the RTF layer as part of the ‘Smart URL’ detection process.

When SmartURL were first possible, via the Apple Frameworks used in v6+, it was intended such links would get adopted as true Tinderbox (web) links but there were issues that as yet, are still to be resolved.

I suspect AppleScript (not tried, and my expertise is rather rusty) probably ought to be able to do this. From a quick Google, I found this: https://apple.stackexchange.com/questions/41277/in-a-service-how-to-get-a-url-from-rich-text. One solution also references TextSoap which does this (albeit via an internal AppleScript!).

1 Like

I was thinking textutil in runCommand should easily convert $Text to html from which the anchor and link could be extracted, something like this:

$MyString = runCommand("textutil -convert html -stdin -stdout",$Text)

Alas, that gives html, but drops the url and leaves just the anchor.

If you select the text in the note, copy to the clipboard and run the following AppleScript in Script Editor then you do get the full html, revealing the url.

set the clipboard to (the clipboard as «class RTF »)
set theHTML to do shell script "pbpaste -Prefer rtf | textutil -convert html -stdin -stdout"
-- the result is HTML from which the url and anchor can be extracted

Thanks! I tried this, but am getting an error “error “Can’t make some data into the expected type.” number -1700 to item” Any idea what might be wrong.

I’m not sure. With the AppleScript, one necessary (unwanted) step, of course, is making sure to select the rich text with the link and then typing command-c to copy to clipboard before running the script. In my simple test (a short note with a link formed by copy-pasting from the Eastgate site) the results were as expected: html like this:

<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.01//EN\" \"http://www.w3.org/TR/html4/strict.dtd\">

<html>

<head>

<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">

<meta http-equiv=\"Content-Style-Type\" content=\"text/css\">

<title></title>

<meta name=\"Generator\" content=\"Cocoa HTML Writer\">

<meta name=\"CocoaVersion\" content=\"2022.3\">

<style type=\"text/css\">

p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 16.0px Helvetica}

span.s1 {font-kerning: none}

span.s2 {font-kerning: none; color: #fb5a08}

</style>

</head>

<body>

<p class=\"p1\"><span class=\"s1\">Whether you’re plotting your next thriller or writing your dissertation, designing a course, managing a legal practice, coordinating a campaign or planning a season of orchestral concerts, <a href=\"http://www.eastgate.com/Tinderbox/updates/Tinderbox88.html\"><span class=\"s2\">Tinderbox 8.9</span></a> will be your personal information assistant</span></p>

</body>

</html>

You can see the anchor and url in that.

The action code above yields this html (unfortunately there is no url to grab; not sure why):

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  <meta http-equiv="Content-Style-Type" content="text/css">
  <title></title>
  <meta name="Generator" content="Cocoa HTML Writer">
  <meta name="CocoaVersion" content="2022.3">
  <style type="text/css">
    p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px 'Helvetica Light'}
    p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px 'Helvetica Light'; min-height: 14.0px}
  </style>
</head>
<body>
<p class="p1">Whether you’re plotting your next thriller or writing your dissertation, designing a course, managing a legal practice, coordinating a campaign or planning a season of orchestral concerts, Tinderbox 8.9 will be your personal information assistant..</p>
<p class="p2"><br></p>
</body>
</html>
1 Like

FWIW, experimenting with Tinderbox’s AppleScript methods shows they do not expose the RTF layer of $Text, only the plain text. But, the desired link info exists only in the RTF version of $Text. I imagine other AppleScript approaches might be able to drive UI access the essentially scrape the RTF from the $Text area of the $Text pane and use that.

It’s tough working with RTF in plain AppleScript. Generally the only practical way (for all but the most expert) is to go through the clipboard.

FWIW, here’s a quick and dirty way I found to extract urls from a note using Automator. There are two separate images here. I selected the text in the note (top) and copied to the clipboard with command-c before running the workflow (bottom).

The shell script action has this:

osascript -e 'the clipboard as «class RTF »' | perl -ne 'print chr foreach unpack("C*", pack("H*",substr($_,11,-3)))' | textutil -stdin -stdout -convert html -format rtf

It seems that it should be possible to put the perl and textutil parts of this into the first argument of runCommand and STDIN the RTF with $Text in the second argument. But I couldn’t figure out how to escape the perl so that it wouldn’t throw an error.

1 Like

Thanks @sumnerg. One more question. Do you know if it would be able to parse and create a keypair of the anchor and the url, e.g. ANCHOR::URL?

The Automator action is quick to implement. But, alas, there seems to be no easy way to get the anchor text.

However, to my surprise, I managed to wrangle AppleScript to do the job by adapting scripts shared online.

Script here
-- adapted from https://www.macscripter.net/viewtopic.php?pid=182034, https://macscripter.net/viewtopic.php?id=46657

use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

-- get any rich texts off the clipboard
set pb to current application's NSPasteboard's generalPasteboard()
set theRichTexts to (pb's readObjectsForClasses:{current application's NSAttributedString} options:(missing value)) as list

if (count of theRichTexts) = 0 then
	display dialog "No rich text found on the clipboard" buttons {"OK"} default button 1
	error number -128
end if

set theRichText to (item 1 of theRichTexts)

-- get length so we can start from the end
set start to (theRichText's |length|()) - 1

-- make plain string copy to work on
set theString to theRichText's |string|()'s mutableCopy()

set output to return

repeat while start ≥ 0
	set {aURL, theRange} to theRichText's attribute:(current application's NSLinkAttributeName) atIndex:start effectiveRange:(reference)
	if aURL is not missing value then
		-- get linked text
		set anchorText to theString's substringWithRange:theRange
		if aURL's |scheme|()'s isEqualToString:"mailto" then -- email address
			set newLink to aURL's resourceSpecifier()
		else if anchorText's containsString:"This Site" then -- resource specifier, remove //
			set newLink to aURL's resourceSpecifier()'s substringFromIndex:2
		else -- full URL
			set newLink to aURL's absoluteString()
		end if
		set output to ((output & anchorText as text) & "::" & newLink as text) & return
	end if
	set start to (location of theRange) - 2
end repeat

return output -- to view in Script Editor Result pane
And shorter, cleaned up script here
-- adapted fr https://www.macscripter.net/viewtopic.php?pid=182034, https://macscripter.net/viewtopic.php?id=46657
-- copy rich text to clipboard and run

use framework "Foundation"

-- get any rich texts off the clipboard
set pb to current application's NSPasteboard's generalPasteboard()
set theRichTexts to (pb's readObjectsForClasses:{current application's NSAttributedString} options:(missing value)) as list

set theRichText to (item 1 of theRichTexts)
set start to (theRichText's |length|()) - 1 -- will work from end backwards
set theString to theRichText's |string|()'s mutableCopy() -- plain string copy to work on
set output to ""

repeat while start ≥ 0
	set {aLink, theRange} to theRichText's attribute:(current application's NSLinkAttributeName) atIndex:start effectiveRange:(reference)
	
	if aLink is not missing value then
		set anchorText to theString's substringWithRange:theRange
		set urlText to aLink's absoluteString()
		set output to (anchorText as text) & "::" & (urlText as text) & return & output
	end if
	
	set start to (location of theRange) - 2
	
end repeat

return output -- to view in Script Editor Result pane

Here the output simply goes to the Result pane in the format suggested for copy-pasting. It could, of course, be delimited in other ways and automated to set values of attribute(s) in Tinderbox.

1 Like

Ok, this works perfectly. Thank you for the education. :pray:

Now, to finish this off. Does anyone know how to trigger an Apple Script from TBX? I can use the run command to copy $Text to the clipboard. I then want the apple script to run, and paste the results back into Text, or possibly a new note (have the apple script create a new note). Then, I can run an explode to parse the results.

Ah, so much fun to have later.

If extracting links from multiple notes then selecting them and File > Export > As Text > RTF > Selected Notes seems to be the way to go. Then run this script and choose the file that Tinderbox creates from the export.

Script here
-- adapted fr https://www.macscripter.net/viewtopic.php?pid=182034, https://macscripter.net/viewtopic.php?id=46657

use framework "Foundation"
use scripting additions

set thePath to POSIX path of (choose file)
set urlPath to current application's NSURL's fileURLWithPath:thePath

set {attString, theError} to current application's NSAttributedString's alloc()'s initWithURL:urlPath options:(missing value) documentAttributes:(missing value) |error|:(reference)

set start to (attString's |length|()) - 1 -- will work from end backwards
set theString to attString's |string|()'s mutableCopy() -- plain string copy to work on

set output to ""

repeat while start ≥ 0
	set {aLink, theRange} to attString's attribute:(current application's NSLinkAttributeName) atIndex:start effectiveRange:(reference)
	if aLink is not missing value then
		set anchorText to theString's substringWithRange:theRange
		set urlText to aLink's absoluteString()
		set output to (anchorText as text) & "::" & (urlText as text) & return & output
	end if
	set start to (location of theRange) - 2
end repeat

return output -- to view in Script Editor Result pane

I’ve noticed that the the RTF “layer” (or whatever it is called) is somehow “disturbed” in any note where a “wikilink/ziplink/text link” is added. The colors (if any) of the text pasted from the web all shift when one is added and the external links in the text are no longer clickable, … and of course these scripts can no longer find any hyperlinks. Not sure if that is expected behavior.

Why not enable the scripts menu on the OS menu-bar and use scripts that target the current TBX’s selection. Essentially it is like using a stamp, albeit called from outside the app. That seems less hassle than tinkering with runCommand().

1 Like

It is possible to trigger an AppleScript from within Tinderbox and even pass an argument, via runCommand, using osascript -e, using, say, a stamp. From my (very) old notes:

However, escaping AppleScript for the command line is a daunting task. No single quotes, for example. Similar problems with the perl one-liner above (that I thought might make it possible to pass rich text to textutil for conversion to html, which could be parsed within Tinderbox. )

Now that Tinderbox has external scripting support, making it efficient to get values in and out (with the notable exception of rich text) I suggest just launching a script outside of Tinderbox in the usual ways (the run button in Script Editor, a menu pick after placing the script in the Script menu, or with a keyboard shortcut after placing the script in an Automator Service, a.k.a. Quick Action).

1 Like

This is perfect!

1 Like

Would be great if we could set $Text to rich text via AppleScript.

I‘d like to e.g. get parts of a PDF as attributed string and create a note from that.

@eastgate Would this be possible?

1 Like

Set $Text to rich text and get rich text from $Text into a script.

And not have $Text “disturbed” by addition of a text link (per above) so that links can’t be extracted.

I suspect it is complicated, per @eastgate :

Not actually an RTF writing space. But the internal format happens to be stored as RTFD

But it would be nice. Especially the ability to add text links in a note without the existing links “disappearing” from the built-in export to RTF.

1 Like

We’ll take a look!

1 Like

@Pete This is going in the opposite direction from what you describe but I have figured out how get an attributed string from a Tinderbox note via AppleScript (thus retaining formatting and any embedded links). The results are easy to see in Script Debugger.

The script below retrieves the rtfd for the selected note from the tbx xml, decodes it, and converts the rtf part into an attributed string.

Perhaps going the other way (attributed string to Tinderbox) could be done by base64 encoding it in AppleScript and writing that to the rtfd. I suspect one would be living dangerously if one tried to write from a script directly to the xml of a document open in Tinderbox, though, as the script demonstrates, it is not too hard to read the rtfd from the xml file using XQuery. Perhaps @eastgate could consider exposing rtfd to AppleScript as an “attribute” whose value can be read and set.

-- select a Tinderbox note and run

use framework "Foundation"
use scripting additions

tell front document of application "Tinderbox 8"
	set strTBX to read (its file as alias) as «class utf8»
	set theIDs to my doXQuery("for $i in //item return string($i/@ID)", strTBX)
	set encodedRtfds to my doXQuery("for $i in //item return string($i/rtfd)", strTBX)
	tell selection 1 -- the selected note
		set noteID to value of attribute "ID"
		set encodedRtfd to my getValWithKey(noteID as text, theIDs, encodedRtfds) as text
	end tell
end tell

set strRtf to my getStringRtfFromEncodedRtfd(encodedRtfd)
set attrStr to makeAttributedStringFromStringRtf(strRtf)

--~~~ handlers/subroutines ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

to doXQuery(strXQuery, strXML)
	-- XQuery handler adapted from post by Rob Trew
	set {xmlError, xqueryError} to {reference, reference} -- holders to report errors
	set {docXML, xmlError} to (current application's NSXMLDocument's alloc()'s ¬
		initWithXMLString:strXML options:0 |error|:xmlError) -- parse XML
	if xmlError is not missing value then return (localizedDescription of xmlError) as string
	set {xs, xqueryError} to (docXML's objectsForXQuery:strXQuery |error|:xqueryError) -- apply XQuery
	if xqueryError is not missing value then return (localizedDescription of xqueryError) as string
	return xs as list -- values retrieved by the XQuery over the XML
end doXQuery

to getStringRtfFromEncodedRtfd(encodedRtfd)
	set decodedStr to do shell script " echo '" & encodedRtfd & "' | base64 -d"
	-- Remove visible and invisible characters outside the surrounding {}
	set startPos to offset of "{" in decodedStr
	set endPos to offset of "}" in (reverse of characters of decodedStr as string)
	set strRtf to text startPos thru -endPos of decodedStr
end getStringRtfFromEncodedRtfd

to makeAttributedStringFromStringRtf(strRtf)
	set ca to current application
	set s to ca's NSString's stringWithString:strRtf -- the string
	set d to (s)'s dataUsingEncoding:(ca's NSUTF8StringEncoding) -- the data
	set attStr to ca's NSAttributedString's alloc()'s initWithRTF:d documentAttributes:(missing value)
	if attStr is missing value then error "String not recognized as RTF"
	return attStr
end makeAttributedStringFromStringRtf

on getValWithKey(aKey, aKeysList, aValuesList)
	set ca to current application
	set theDict to ca's NSDictionary's dictionaryWithObjects:aValuesList forKeys:aKeysList
	set theResult to theDict's objectForKey:aKey
	set tempArray to ca's NSArray's arrayWithArray:{theResult}
	return item 1 of (tempArray as list)
end getValWithKey

Apologies if I’ve misread the aim here. But, I’d note that this seems to be going against the flow. Tinderbox (IIRC for v9) is now ‘adopting’ links created/pasted in the RTFD layer of text. I note the RTFD aspect as given that links are defined in against the plain text_ layer, using AppleScript to define a link in the RTFD layer when it could more effectively defined in the plain text later. Rich text style, necessarily, must be defined in the RTFD layer. But, I see only downside defining the link in the RTFD layer and then expecting Tinderbox to pick up the pieces.

†. Tinderbox stored links in a <links> linkbase discrete from the text, but link anchors are defined by character offsets in the plain text <text> of the note rather than the styled <rtfd> version. For those not aware, Tinderbox stores both plain and styled text.

[edits for clarity]

The aim is definitely misunderstood.:grinning: This isn’t about “defining” a link in rtfd or asking Tinderbox to “pick up the pieces.” It’s a demonstration that making the rftd accessible to external scripts can be useful (at least as read-only, if setting the value of the rtfd turns out to be too problematic).

The aim is to get styled text in and out of Tinderbox via AppleScript without going through the clipboard, either manually (as in the first examples above) or through easily broken “gui” scripting.

This script demonstrates one way to get styled text out (including any “smart” links that happen to be in the rtfd). It’s not too hard. But it would be much easier if the rtfd could be read directly (like reading the value of an attribute) rather than having to resort to XQuery.

Going the other way, getting styled text into Tinderbox, as @Pete has requested above, may be more difficult. My thought was that perhaps a scripter could base64 encode and set the rtfd to that, but @eastgate would have to comment on that.

BTW, I do not believe it is entirely correct to say that native Tinderbox links are “defined in the plain text_layer.” In the xml of my current version (8.9.2) I see them defined in <links>, with the position of the anchors specified in terms of offsets (sstart and slen) in the plain text. That may seem like quibbling. But the distinction is hugely significant from a scripting point of view. Unlike smart links embedded in the rtfd, it’s not that easy via a script to get text links out of Tinderbox anchored in their proper place in the text, though that has been demonstrated in the forum using R reading the xml.

Anyway, would not making the rftd more easily accessible to external scripts be a good idea, at least read-only?