Export notes with titles intact for analysing them with GPT-4

Per · December 15, 2023, 11:47am

Dear TB Forum,
I am experimenting with using GPT-4 (AI) to ask questions to specific sets of written material that I know very well and have so far been rather blown away of how well it has performed. I now want to apply it on the notes in my TB knowledge management system (my Zettelkasten on steroids), but cannot manage to export the notes in a format that keeps the note titles intact. This is important, since the titles are descriptive of the contents of the notes and provides input to the AI on the other side. Reading the brilliant aTbRef, it should be able to get longer titles with spaces when exporting as HTML, but I cannot get it to work. They come out truncated and with “_” for each space. If it is better to export to another format, e.g. Markdown, I am open for suggestions as long as the names, text and links between notes are intact. Please help!

Best wishes,
Per

mwra · December 15, 2023, 1:27pm

^value($Name)^ will insert the verbatim UTF-* text of the title into the template. Insertion of underscores sounds more like what might happen in the creations of exported filenames. See more on ^value()^.

Also things evolve so if using code from years back likely there may be a newer, easier way. Also, support for a wider range of characters has evolved so things we had to remove or replace are less now.

Can you post the template that is creating these mangled titles? It shouldn’t be a problem. Likely just a case of the right code in the right place, which is sometimes only obvious—after you know.

Per · December 16, 2023, 1:10pm

Dear @mwra
Thanks for your quick repsonse and laudable patience with us happy users yet terrible technicians. I am certain there is nothing wrong with the code nor the guidance in aTbRef. It is simply my ignorance.

I have never worked with HTML, so I do not knwo where to enter ^value($Name)^ . I have not changed anything in the HTML template, so it looks like this:

<!DOCTYPE html>
<html>
<head>
	<meta http-equiv="content-type" content="text/html; charset=utf-8">
	<title>^title^</title>
</head>
<body>
<!--   ** Standard Tinderbox Template [section page] **  -->

<h1>^title^</h1>
^text^
^children(/Templates/HTML page/HTML item)
</body>
</html>

mwra · December 16, 2023, 1:46pm

No worries, and no judgement here! Some more quick questions though.

Are you saying the title text as seen as a heading on the page (the <h1> section in the template) is coming out mangled?

What sort of characters are getting getting substituted with underscores?

If easier, just post a screen shot of ‘bad’ preview, or better, place a small file that shows these errors when the notes are previewed.

While we get to the real case (I suspect some app defaults set to very old vales) try making a copy of the above template (i.e. a new template note in you TBX) and changing this line:

<h1>^title^</h1>

to

<h1>^value($Text)^</h1>

But, as hinted above, even if the latter seems to fix the issue, I think we’re simply working around some deeper settings that need updating.

HTH

Per · December 16, 2023, 1:54pm

The contents inside the notes (text and links) come out just as they should. It is the names of the notes that come out as truncated (24 signs) filenames with all remaining spaces turned into underscores. These changed filenames become the names of the markdown notes I use to feed the AI…

mwra · December 16, 2023, 2:28pm

But where? The template you showed me doesn’t see such info. I think you’re talking about exported filenames.

When you export a note to an HTML file, several settings control the resulting OS filename of the file generated.

$HTMLExportExtension sets the extension (‘.htm’, ‘.html’, ‘.txt’, etc.)
$HTMLExportFileName can set a pre-defined, non-changing filename.
$HTMLExportFileNameSpacer allows you to substitute spaces in source $Name strings with another characters. 20 years ago, spaces in filenames—and thus RLS—didn’t work.
$HTMLFileNameLowerCase allows you to force the filename to all-lowercase.
$HTMLFileNameMaxLength. This sets the number of characters allowed in the filename.

I’d wager you’re using a file made in an older version of Tinderbox what the apps defaults for the above were different. When a TBX is made the app’s attribute default values becomes the TBX system attribute defaults. In v9.5.0, $HTMLFileNameMaxLength rose from 24 to 100. any document created before v9.5.0 would have a default of 24. Similarly, $HTMLExportFileNameSpacer used to be _ but is now a space (again as from v9.5.0). So check your TBX’s defaults for those.

Tip. Before someone asks, “Why doesn’t the app just automatically update defaults in older TBXs?” I would note it is a less simple task than assumed. a TBX has over 400 (!) system attributes and it’s likely some of your defaults are as you want them and not necessarily as per current app defaults. A one-size-fits-all update could wreak havoc on a TBX (which we likely forgot to back up before ‘updating’). aTbref is kept updated for current system attribute defaults (listing), best is to check these against your own document (latter using the system attribute Inspector)

In summary, I think this isn’t about $Name values in templates so much as export filenames.

HTH

Per · December 16, 2023, 2:43pm

You are right, as always. $HTMLExportFileNameSpacer is set to “_” and $HTMLFileNameMaxLength is set to 24. I updated these two attributes for a selected note and it solved the issue. The final question: How do I do that best in bulk? Prototype?

mwra · December 16, 2023, 3:04pm

You can’t do it in bulk for multiple TBXs without some serious coding to read/edit the XML in the files. But, for one TBX (at a time), open the the system attribute Inspector and use the search option with the name of the attribute you wish to update. It will select that attribute, and you can view the default. Editing the default (re-)sets it for the doc.

Of course if you have set a value for that attribute, locally, i.e. for that note only, then you’ll want to purge those local settings. Let’s assume we’ve just rest the document defaults of $HTMLExportFileNameSpacer and $HTMLFileNameMaxLength. Now we’ll make an agent to find all notes with a local (i.e. non-default) setting for either attribute. The query is this:

$HTMLExportFileNameSpacer | $HTMLFileNameMaxLength

Tip: $HTMLExportFileNameSpacer is a short way of writing $HTMLExportFileNameSpacer !="". See: https://www.acrobatfaq.com/atbref95/index/Automating_Tinderbox/Coding/Action_Code/Operators/Full_Operator_List/AttributeName_i_e_a_short_form_test_for_no_value.html.

The query asks for notes with a value in either of those two attributes. Now the agent action:

$HTMLExportFileNameSpacer=;
$HTMLFileNameMaxLength=;

At the end the agent will be empty as the action resets the condition which would cause a note to match the query!

Tip: the =; usage resets the attribute to the default (inherited) value and can be used with any data type of attribute. We could use $HTMLExportFileNameSpacer=""; as a slightly more explicit but less terns alternative. See: Setting or resetting an attribute's default.

If you do want to edit a TBX’s XML, first make a copy, then open in BBEdit. look for these lines:

<attrib Name="HTMLFileNameMaxLength" parent="HTML" editable="1" visibleInEditor="1" lines="1" type="2" default="100" >
</attrib>

and

<attrib Name="HTMLExportFileNameSpacer" parent="HTML" editable="1" visibleInEditor="1" lines="1" default=" " >
</attrib>

In each case you will check/edit the value for the 'default XML element attribute. So, for the first case you might change default="24" to default="100".

Then save, open the file and test (if you mess up, you can start over with the back-up!).

HTH - note I’m now out until tomorrow, so hope you’re all fixed.

Per · December 16, 2023, 4:50pm

It worked like a charm! Now I get up to 100 character filenames as they should read without underscores. Thanks @mwra!

The only remaining issue is that the names of my source notes—representing and linking to scientific publications in Bookends—are named after “author (year)” (e.g. “Boyd and Richerson (2005)”) and the parentheses disappear in the exported filenames. Is there any nifty solution for that?

mwra · December 16, 2023, 11:12pm

What confuses me is why you need the parentheses in the filename? This suggests you are using the filenames in some other app … which likely has a more efficient way of ingesting the data in question.

Parentheses are allowed in the query part of a URL (i.e. after the first ?), but not in the filename part, so correctly a ( or ) in the filename must be urlencoded or elided.

So, why do you need the verbatim $Name in the exported filename? It might help us bypass these issues altogether.

Per · December 17, 2023, 9:34am

That is a good question. I do not know how important it is for the functionality, if at all important. The only thing I can say is, when experimenting, GPT-4 picks up and relates contents of notes also based on the titles, but I guess it must be unlikely that the parentheses in the titles of the source notes would matter. It is, then, just a matter of more appealing looks in the database. If there is no simple solution that someone like me could implement, it is not worth pursuing.

mwra · December 17, 2023, 1:01pm

[quote=“Per, post:11, topic:7137”]
experimenting, GPT-4 picks up and relates contents of notes also based on the titles,
[/quote], the filenames you are talking about only exist on export. IOW, if GPT-4 reads a TBX, the exported filename—created on the fly—is nowhere to be seen inside the TBX.

By comparison, in export the export filename is the OS name of the file. For example, take the source for aTbRef page https://www.acrobatfaq.com/atbref95/index/Automating_Tinderbox/Coding/Action_Code/Operators/Full_Operator_List/createLink_sourceItem_destinationItem_linkTypeStr.html:

Exported OS filename: Full_Operator_List/createLink_sourceItem_destinationItem_linkTypeStr.html
$Name: createLink(sourceItem, destinationItem[, linkTypeStr])

The $Name is also in the HTML header data:

If GPT-4 is ‘reading’ HTML files, I don’t really see why the OS filename makes a different as default Tinderbox buld-in templates include this in the header:

<title>^title^</title>

And normally they insert the title with something like:

<h1>^title^</h1>

This is a weakness of ‘magic’ process like GPT-4—you don’t really know what it’s reading … or why. It simply assume (wrongly IMO) the “computer knows best”.

Per · December 17, 2023, 4:25pm

I understand your point. However, the workflow involves steps in between, where the exported HTML-files are turned into Markdown notes in which the filenames of the HTML-files become the note titles again, before the AI is fed them. However, the parantheses are still likely not informative for the AI regardless. Your comment about the “black box” nature of AI is also pertinent, and brings a recurring “Little Britain”-sketch to mind, always ending in a “Computer says no”…

Thanks for all you help, @mwra! You are as important to my TB-work as the software itself.

webline · December 19, 2023, 7:45am

Hi Per,

why don’t you use ChatGPT directly in TBX instead of exporting everything?

Detlef

Per · December 19, 2023, 8:03am

Interesting! I have a look.