Tinderbox Forum

Extract Anchor and URL from pasted HTML

Is there a way to automatically parse anchors and URLs from pasted HTML in $Text. When I look at the HTML I see the text, but not the URLs. The URLs are clearly there because you can click on them. I would like to be able to extract anchor and text pairs and then be able to use them as I see fit.

No, these don’t exist as Tinderbox web links and are only in the RTF layer as part of the ‘Smart URL’ detection process.

When SmartURL were first possible, via the Apple Frameworks used in v6+, it was intended such links would get adopted as true Tinderbox (web) links but there were issues that as yet, are still to be resolved.

I suspect AppleScript (not tried, and my expertise is rather rusty) probably ought to be able to do this. From a quick Google, I found this: https://apple.stackexchange.com/questions/41277/in-a-service-how-to-get-a-url-from-rich-text. One solution also references TextSoap which does this (albeit via an internal AppleScript!).

1 Like

I was thinking textutil in runCommand should easily convert $Text to html from which the anchor and link could be extracted, something like this:

$MyString = runCommand("textutil -convert html -stdin -stdout",$Text)

Alas, that gives html, but drops the url and leaves just the anchor.

If you select the text in the note, copy to the clipboard and run the following AppleScript in Script Editor then you do get the full html, revealing the url.

set the clipboard to (the clipboard as «class RTF »)
set theHTML to do shell script "pbpaste -Prefer rtf | textutil -convert html -stdin -stdout"
-- the result is HTML from which the url and anchor can be extracted

Thanks! I tried this, but am getting an error “error “Can’t make some data into the expected type.” number -1700 to item” Any idea what might be wrong.

I’m not sure. With the AppleScript, one necessary (unwanted) step, of course, is making sure to select the rich text with the link and then typing command-c to copy to clipboard before running the script. In my simple test (a short note with a link formed by copy-pasting from the Eastgate site) the results were as expected: html like this:

<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.01//EN\" \"http://www.w3.org/TR/html4/strict.dtd\">

<html>

<head>

<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">

<meta http-equiv=\"Content-Style-Type\" content=\"text/css\">

<title></title>

<meta name=\"Generator\" content=\"Cocoa HTML Writer\">

<meta name=\"CocoaVersion\" content=\"2022.3\">

<style type=\"text/css\">

p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 16.0px Helvetica}

span.s1 {font-kerning: none}

span.s2 {font-kerning: none; color: #fb5a08}

</style>

</head>

<body>

<p class=\"p1\"><span class=\"s1\">Whether you’re plotting your next thriller or writing your dissertation, designing a course, managing a legal practice, coordinating a campaign or planning a season of orchestral concerts, <a href=\"http://www.eastgate.com/Tinderbox/updates/Tinderbox88.html\"><span class=\"s2\">Tinderbox 8.9</span></a> will be your personal information assistant</span></p>

</body>

</html>

You can see the anchor and url in that.

The action code above yields this html (unfortunately there is no url to grab; not sure why):

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  <meta http-equiv="Content-Style-Type" content="text/css">
  <title></title>
  <meta name="Generator" content="Cocoa HTML Writer">
  <meta name="CocoaVersion" content="2022.3">
  <style type="text/css">
    p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px 'Helvetica Light'}
    p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px 'Helvetica Light'; min-height: 14.0px}
  </style>
</head>
<body>
<p class="p1">Whether you’re plotting your next thriller or writing your dissertation, designing a course, managing a legal practice, coordinating a campaign or planning a season of orchestral concerts, Tinderbox 8.9 will be your personal information assistant..</p>
<p class="p2"><br></p>
</body>
</html>
1 Like

FWIW, experimenting with Tinderbox’s AppleScript methods shows they do not expose the RTF layer of $Text, only the plain text. But, the desired link info exists only in the RTF version of $Text. I imagine other AppleScript approaches might be able to drive UI access the essentially scrape the RTF from the $Text area of the $Text pane and use that.

It’s tough working with RTF in plain AppleScript. Generally the only practical way (for all but the most expert) is to go through the clipboard.

FWIW, here’s a quick and dirty way I found to extract urls from a note using Automator. There are two separate images here. I selected the text in the note (top) and copied to the clipboard with command-c before running the workflow (bottom).

The shell script action has this:

osascript -e 'the clipboard as «class RTF »' | perl -ne 'print chr foreach unpack("C*", pack("H*",substr($_,11,-3)))' | textutil -stdin -stdout -convert html -format rtf

It seems that it should be possible to put the perl and textutil parts of this into the first argument of runCommand and STDIN the RTF with $Text in the second argument. But I couldn’t figure out how to escape the perl so that it wouldn’t throw an error.

1 Like

Thanks @sumnerg. One more question. Do you know if it would be able to parse and create a keypair of the anchor and the url, e.g. ANCHOR::URL?

The Automator action is quick to implement. But, alas, there seems to be no easy way to get the anchor text.

However, to my surprise, I managed to wrangle AppleScript to do the job by adapting scripts shared online.

Script here
-- adapted from https://www.macscripter.net/viewtopic.php?pid=182034, https://macscripter.net/viewtopic.php?id=46657

use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

-- get any rich texts off the clipboard
set pb to current application's NSPasteboard's generalPasteboard()
set theRichTexts to (pb's readObjectsForClasses:{current application's NSAttributedString} options:(missing value)) as list

if (count of theRichTexts) = 0 then
	display dialog "No rich text found on the clipboard" buttons {"OK"} default button 1
	error number -128
end if

set theRichText to (item 1 of theRichTexts)

-- get length so we can start from the end
set start to (theRichText's |length|()) - 1

-- make plain string copy to work on
set theString to theRichText's |string|()'s mutableCopy()

set output to return

repeat while start ≥ 0
	set {aURL, theRange} to theRichText's attribute:(current application's NSLinkAttributeName) atIndex:start effectiveRange:(reference)
	if aURL is not missing value then
		-- get linked text
		set anchorText to theString's substringWithRange:theRange
		if aURL's |scheme|()'s isEqualToString:"mailto" then -- email address
			set newLink to aURL's resourceSpecifier()
		else if anchorText's containsString:"This Site" then -- resource specifier, remove //
			set newLink to aURL's resourceSpecifier()'s substringFromIndex:2
		else -- full URL
			set newLink to aURL's absoluteString()
		end if
		set output to ((output & anchorText as text) & "::" & newLink as text) & return
	end if
	set start to (location of theRange) - 2
end repeat

return output -- to view in Script Editor Result pane
And shorter, cleaned up script here
-- adapted fr https://www.macscripter.net/viewtopic.php?pid=182034, https://macscripter.net/viewtopic.php?id=46657
-- copy rich text to clipboard and run

use framework "Foundation"

-- get any rich texts off the clipboard
set pb to current application's NSPasteboard's generalPasteboard()
set theRichTexts to (pb's readObjectsForClasses:{current application's NSAttributedString} options:(missing value)) as list

set theRichText to (item 1 of theRichTexts)
set start to (theRichText's |length|()) - 1 -- will work from end backwards
set theString to theRichText's |string|()'s mutableCopy() -- plain string copy to work on
set output to ""

repeat while start ≥ 0
	set {aLink, theRange} to theRichText's attribute:(current application's NSLinkAttributeName) atIndex:start effectiveRange:(reference)
	
	if aLink is not missing value then
		set anchorText to theString's substringWithRange:theRange
		set urlText to aLink's absoluteString()
		set output to (anchorText as text) & "::" & (urlText as text) & return & output
	end if
	
	set start to (location of theRange) - 2
	
end repeat

return output -- to view in Script Editor Result pane

Here the output simply goes to the Result pane in the format suggested for copy-pasting. It could, of course, be delimited in other ways and automated to set values of attribute(s) in Tinderbox.

1 Like

Ok, this works perfectly. Thank you for the education. :pray:

Now, to finish this off. Does anyone know how to trigger an Apple Script from TBX? I can use the run command to copy $Text to the clipboard. I then want the apple script to run, and paste the results back into Text, or possibly a new note (have the apple script create a new note). Then, I can run an explode to parse the results.

Ah, so much fun to have later.

If extracting links from multiple notes then selecting them and File > Export > As Text > RTF > Selected Notes seems to be the way to go. Then run this script and choose the file that Tinderbox creates from the export.

Script here
-- adapted fr https://www.macscripter.net/viewtopic.php?pid=182034, https://macscripter.net/viewtopic.php?id=46657

use framework "Foundation"
use scripting additions

set thePath to POSIX path of (choose file)
set urlPath to current application's NSURL's fileURLWithPath:thePath

set {attString, theError} to current application's NSAttributedString's alloc()'s initWithURL:urlPath options:(missing value) documentAttributes:(missing value) |error|:(reference)

set start to (attString's |length|()) - 1 -- will work from end backwards
set theString to attString's |string|()'s mutableCopy() -- plain string copy to work on

set output to ""

repeat while start ≥ 0
	set {aLink, theRange} to attString's attribute:(current application's NSLinkAttributeName) atIndex:start effectiveRange:(reference)
	if aLink is not missing value then
		set anchorText to theString's substringWithRange:theRange
		set urlText to aLink's absoluteString()
		set output to (anchorText as text) & "::" & (urlText as text) & return & output
	end if
	set start to (location of theRange) - 2
end repeat

return output -- to view in Script Editor Result pane

I’ve noticed that the the RTF “layer” (or whatever it is called) is somehow “disturbed” in any note where a “wikilink/ziplink/text link” is added. The colors (if any) of the text pasted from the web all shift when one is added and the external links in the text are no longer clickable, … and of course these scripts can no longer find any hyperlinks. Not sure if that is expected behavior.

Why not enable the scripts menu on the OS menu-bar and use scripts that target the current TBX’s selection. Essentially it is like using a stamp, albeit called from outside the app. That seems less hassle than tinkering with runCommand().

1 Like

It is possible to trigger an AppleScript from within Tinderbox and even pass an argument, via runCommand, using osascript -e, using, say, a stamp. From my (very) old notes:

However, escaping AppleScript for the command line is a daunting task. No single quotes, for example. Similar problems with the perl one-liner above (that I thought might make it possible to pass rich text to textutil for conversion to html, which could be parsed within Tinderbox. )

Now that Tinderbox has external scripting support, making it efficient to get values in and out (with the notable exception of rich text) I suggest just launching a script outside of Tinderbox in the usual ways (the run button in Script Editor, a menu pick after placing the script in the Script menu, or with a keyboard shortcut after placing the script in an Automator Service, a.k.a. Quick Action).

1 Like

This is perfect!

1 Like

Would be great if we could set $Text to rich text via AppleScript.

I‘d like to e.g. get parts of a PDF as attributed string and create a note from that.

@eastgate Would this be possible?

1 Like

Set $Text to rich text and get rich text from $Text into a script.

And not have $Text “disturbed” by addition of a text link (per above) so that links can’t be extracted.

I suspect it is complicated, per @eastgate :

Not actually an RTF writing space. But the internal format happens to be stored as RTFD

But it would be nice. Especially the ability to add text links in a note without the existing links “disappearing” from the built-in export to RTF.

1 Like

We’ll take a look!

1 Like