Pulling attributes from text

mattw · June 16, 2018, 3:51pm

I want to bring in to tinderbox highlights from ibooks and automatically extract text into some attributes. So here is some original text from the ibooks output:

May 20, 2018

Chapter One: A COMPUTER WANTED, p. 12

punched cards, which separated pattern from process for the first time in history, would eventually find their way into the earliest computers. Patterns encoded on paper, which computer scientists later called “programs,” could meaningfully entangle numbers as easily as thread. The Jacquard loom

May 20, 2018

Chapter One: A COMPUTER WANTED, p. 10

Indeed, computing was the grunt labor of organized science; before they were made obsolete, human computers prepared ballistics trajectories for the United States Army, cracked Nazi codes at Bletchley Park, crunched astronomical data at Harvard, and assisted numerical studies of nuclear fission on the Manhattan Project. Despite the diversity of their work, human computers had one thing in common. They were women.

So the text in the note should just be the highlighted text. $Chapter should be the chapter text up to but not including the page, and $page should be the page number.

I am generally comfy with regex but not seeing any examples of use in an action in tinderbox.

eastgate · June 16, 2018, 4:31pm

I believe there’s an example of this in the Agents and Actions section of Getting Started.

The general pattern is to use an agent to find a pattern, and extract the substring:

Query: $Text.contains(“Chapter.+:(.+),”)
Action: $Chapter=$1;

mattw · June 16, 2018, 7:07pm

ahhh, i think i was missing it because i wasn’t sure how to search for what i wanted, and i didn’t realize that $1 would be the results of the query.

Bernard-0 · February 26, 2021, 1:38pm

I have a similar question, but not as simple from what I could gather.

What if this pattern occurs more than once in the text and I want to add every occurrence to the attribute field delimited by ;?

eastgate · February 26, 2021, 4:07pm

I don’t think you can do that, presently — not for an indefinite number of back-references. I think I see a straightforward way to make this easier.

Bernard-0 · March 31, 2021, 8:46pm

I found this little piece of code by @pat here in the forum and I thought that it would work for every occasion, but now I see that I fails in many cases to find all the occurrences.

var possibleTags;
var foundTags;

possibleTags = $Text.split("#");
possibleTags.each(pt) {
  if($Text.contains("#" + pt)) {
    foundTags = foundTags + firstWord(pt) + ";";
  };
};

$Keywords = foundTags;

Would you suggest any changes or some other alternative to make this possible @eastgate?

eastgate · March 31, 2021, 8:59pm

This looks sensible. You’d be better off using $Keywords directly and not foundTags, since $Keywords is a set (I presume) and does list assembly itself.

Note that you don’t allow any space between the # and the tags/

Bernard-0 · March 31, 2021, 9:04pm

Like this, you mean?

var possibleTags;

possibleTags = $Text.split("#");
possibleTags.each(pt) {
  if($Text.contains("#" + pt)) {
    $Keywords = $Keywords + firstWord(pt);
  };
};

Not sure if I am getting what you meant here.

eastgate · March 31, 2021, 9:24pm

This pattern would match

  I am looking for #tags in #text

but not

 I am looking for # tags in # text

Bernard-0 · March 31, 2021, 9:39pm

Thanks for clarifying that. Indeed, that is the intended behavior, since I am using markdown.

I couldn’t get the code above to retrieve every tag in the text, but I found that this works:

var possibleTags;
var firstTerm;
var theTag;

$Keywords = "";
possibleTags = $Text.split("#");
possibleTags = possibleTags.replace("\(|\)","");
possibleTags.each(pt) {
	theTag = firstWord(pt);
		if($Text.contains("#"+theTag)){
		theTag = theTag.replace(",|\.",""); 		
		theTag = theTag.replace("^.{1}$","");
		$Keywords = $Keywords + theTag;
		};
	};

sumnerg · April 2, 2021, 3:21am

I’ve been using this one-liner for a while that I think does the same thing.

$Keywords=runCommand("grep -o '#[a-zA-Z0-9_]\+'",$Text).replace('\r',';').replace('#','')

Bernard-0 · April 2, 2021, 2:08pm

Thanks!

After slightly adapting the line to my use case, it works perfectly.

$Keywords=runCommand("LANG=pt_BR.UTF-8 grep -o '#\w\+'",$Text).replace('\n',';').replace('#','');

sumnerg · April 3, 2021, 12:12pm

Yes, that’s better regex (I think) and it’s great to see an example of how to change locale. I’m going to see if I can figure how to do it for Chinese language tags… Doesn’t seem possible on my US system, though.

fidel · January 26, 2024, 4:39pm

Hi guys. I’ve followed this thread, as well as:

and

I’ve also gone on various sites on the internet that can help me build Regex. However, I’m still totally stuck. I’m requesting some assistance. I’ve narrowed the problem down to my ignorance of Regex. I’m hoping someone could help me with this. What I’d like to do is extrac the blocks from $Text and insert them under the relevant attributes. Please if I could get the Query and the Action, I’d be much obliged. Thanks.

eastgate · January 26, 2024, 5:01pm

Regular expressions aren’t the only way to attack this. The stream interface might be easier! For example:

$Text.skipTo("Heading Two: ").captureLine("Heading2")

will put “And these are the items under heading two.” in $Heading2.

BBEdit’s Pattern Playground is a terrific way to test out regular expressions. Highly recommended!
If you want an agent to find all the notes that contain the phrase Heading Two:, your query might be $Text.contains("Heading Two:")
You also want to capture the rest of the line after this phrase. So ``$Text.contains(“Heading Two:(.+)”)`. Here the period means “any character”, and the “+” means “1 or more occurrences.” So, (.+) means “remember what comes next, assuming that it’s at least one character”
The action would be $Heading2=$1. $1 means “the first thing you remembered”, which in this case is the only thing we remembered.

mwra · January 26, 2024, 5:06pm

It might be worth noting that BBEdit’s Pattern Playgrounds are only available in the paid licence version, not the free tier, in case some might we wondering how to access the feature.

fidel · January 26, 2024, 5:45pm

Great! This is working, using the Stream Interface. And it works for multi-line entries. Awesome. Thank you so much.