Recovering metadata from imported text


(Mark Anderson) #1

Off forum, I’ve been assisting @jprint714, with the task of getting tag data (and other metadata) from text generated elsewhere. In this case it was MarginNote but it could be any utility with little/no metadata (e.g. just ‘tags’) and very coarse export controls. this isn’t to disparage those apps… I use a lot of such utilities but it is common that the design flare that goes into their core task isn’t extended to how the user might use the generated data outside that utility.

In this case the source app could give use ‘tags’ (general keywords) as a list of #-prefixed values with underscores for spaces: #Project_X,#Thing_Y etc. An aim was to get these into Tinderbox $Tags as cleaned up Set values Project X;Thing Y.

There was also a desire to capture some streams of data in the source notes: research questions, questions to ask, specific contacts, etc. These weren’t in any separate storage at source (unlike tags). A further aim was to recover some dates in the notes for use in Tinderbox timelines. All this info would end up as a text file per source note (that’s how the source app created it) and the Tinderbox task was to extract that data to relevant attributes once imported.

The first issue is Tinderbox (writing as at v7.3.1) has no simple means to return the paragraphs of $Text. If $Text contains no semi-colons, $Text.split("\n") will return a list of paragraph length strings. One approach might be be to iterate through the resulting List.each() and test each paragraph.

However, a potential simpler method emerged and easier for a non-expert Tinderbox user to follow. The exported text files already had the tags at the end of the text, i.e. the last paragraph. so the plan was to put each discrete type of metadata (tags, contacts, etc.) in its own paragraph, with a distinct and consistent starting label, entered in a consistent manner. That way a stamp can find the last (of many) paragraphs, if a match parse out the data and set $Text to the rest of the original $Text i.e. removing the metadata which is now stored in $Tags or wherever.

So the exported text notes were formatted so they started with the following labels, to be treated as described. The order is first to last (i.e. the tags paragraph is the last) and the labels include a closing space:

  • "RESEARCH QUESTIONS: ". zero or one entries, one value. Mapped to String $ResearchQs.
  • "ASK QUESTIONS: ". zero or one entries, one value. Mapped to String $AskQs.
  • “CRITICAL POINT:`”. zero or one entries, one value. Mapped to String $CriticalPoint.
  • "CONTACT: ". zero or one entries, one value. Mapped to String $Contact.
  • "TIMELINE: ". zero or one entries, one or two values. Mapped to $StartDate (first or only value), $EndDate (second value). Values with comma+space delimiter.
  • #:”. zero or one entries, one or more values. Mapped to $Tags (all values). Values with comma+space delimiter.

Here’s the action code:

$Text=$Text.replace(";",":");

$Tags=$Tags+$Text.split("\n#").at(1).replace("#","").replace("_"," ").replace(", ",";");
$Text=$Text.split("\n#").at(0);

$MyString=;
$MyString=$Text.split("\nTIMELINE: ").at(1);
if($MyString){
	if ($MyString.contains(", ")){
	   $StartDate=date($MyString.split(", ").at(0));
	   $EndDate=date($MyString.split(", ").at(1));
   } else {
	   $StartDate=date($MyString);
   };
	$Text=$Text.split("\nTIMELINE: ").at(0);
};

$MyString=;
$MyString=$Text.split("\nCONTACT: ").at(1);
if($MyString){
	$Contact=$MyString;
	$Text=$Text.split("\nCONTACT: ").at(0);
};

$MyString=;
$MyString=$Text.split("\nCRITICAL POINT: ").at(1);
if($MyString){
	$CriticalPoint=$MyString;
	$Text=$Text.split("\nCRITICAL POINT: ").at(0);
};

$MyString=;
$MyString=$Text.split("\nASK QUESTIONS: ").at(1);
if($MyString){
	$AskQs=$MyString;
	$Text=$Text.split("\nASK QUESTIONS: ").at(0);
};

$MyString=;
$MyString=$Text.split("\nRESEARCH QUESTIONS: ").at(1);
if($MyString){
	$ResearchQs=$MyString;
	$Text=$Text.split("\nRESEARCH QUESTIONS: ").at(0);
};

$MyString=;

First we swap out any semi-colons in $Text for colons (Recall the issue with .split). Then we handle the tags, if present. We split on the label and proceeding line break so, if a tags paragraph is present the splitting string is thrown away and we’re left with the $Text and a string of comma+space delimited tag values. We remove the hash signs and underscores, change the value delimiter to a Tinderbox list delimiter, as semi-colon and add all the one or more values to $Tags.

Next, we check if the last paragraph (noting any tags are now removed and processed) has a 'TIMELINE: ’ label. If so we process, the date strings into Tinderbox dates and into $StartDate/$EndDate. I chose these for zero-configuration use of Timeline view, but any Date attribute would do. And so on through the other possible metadata paragraphs. If a given metadata doesn’t exist, the code just moves on until only the real source text is in $Text.

Notes:

  • I suggested yyyy/mm/dd input dates to avoid mess-ups over dd/mm vs mm/dd order in more normal date formats.
  • Are there lots of formatting ‘rules’ here? Yes, deliberately so. Over the years i’ve learned that the very flexibility experts crave is confusing for those new to Tinderbox and/or ‘code’. So by simple (apparently) fixed rules its much easier to set up the data and check for glitches.
  • The code only needs to be run once per note. If used in a stamp, you can process extra notes added at a later time.

Anyway, I hope that worked example helps a few folk starting out who want to get in data from apps with less capable data handling features.