Tinderbox Forum

Pulling attributes from text

I want to bring in to tinderbox highlights from ibooks and automatically extract text into some attributes. So here is some original text from the ibooks output:

May 20, 2018

Chapter One: A COMPUTER WANTED, p. 12

punched cards, which separated pattern from process for the first time in history, would eventually find their way into the earliest computers. Patterns encoded on paper, which computer scientists later called “programs,” could meaningfully entangle numbers as easily as thread. The Jacquard loom

May 20, 2018

Chapter One: A COMPUTER WANTED, p. 10

Indeed, computing was the grunt labor of organized science; before they were made obsolete, human computers prepared ballistics trajectories for the United States Army, cracked Nazi codes at Bletchley Park, crunched astronomical data at Harvard, and assisted numerical studies of nuclear fission on the Manhattan Project. Despite the diversity of their work, human computers had one thing in common. They were women.

So the text in the note should just be the highlighted text. $Chapter should be the chapter text up to but not including the page, and $page should be the page number.

I am generally comfy with regex but not seeing any examples of use in an action in tinderbox.

I believe there’s an example of this in the Agents and Actions section of Getting Started.

The general pattern is to use an agent to find a pattern, and extract the substring:

Query: $Text.contains(“Chapter.+:(.+),”)
Action: $Chapter=$1;

ahhh, i think i was missing it because i wasn’t sure how to search for what i wanted, and i didn’t realize that $1 would be the results of the query.

I have a similar question, but not as simple from what I could gather.

What if this pattern occurs more than once in the text and I want to add every occurrence to the attribute field delimited by ;?

I don’t think you can do that, presently — not for an indefinite number of back-references. I think I see a straightforward way to make this easier.

1 Like

I found this little piece of code by @pat here in the forum and I thought that it would work for every occasion, but now I see that I fails in many cases to find all the occurrences.

var possibleTags;
var foundTags;

possibleTags = $Text.split("#");
possibleTags.each(pt) {
  if($Text.contains("#" + pt)) {
    foundTags = foundTags + firstWord(pt) + ";";
  };
};

$Keywords = foundTags;

Would you suggest any changes or some other alternative to make this possible @eastgate?

This looks sensible. You’d be better off using $Keywords directly and not foundTags, since $Keywords is a set (I presume) and does list assembly itself.

Note that you don’t allow any space between the # and the tags/

Like this, you mean?

var possibleTags;

possibleTags = $Text.split("#");
possibleTags.each(pt) {
  if($Text.contains("#" + pt)) {
    $Keywords = $Keywords + firstWord(pt);
  };
};

Not sure if I am getting what you meant here.

This pattern would match

  I am looking for #tags in #text

but not

 I am looking for # tags in # text
1 Like

Thanks for clarifying that. Indeed, that is the intended behavior, since I am using markdown.

I couldn’t get the code above to retrieve every tag in the text, but I found that this works:

var possibleTags;
var firstTerm;
var theTag;

$Keywords = "";
possibleTags = $Text.split("#");
possibleTags = possibleTags.replace("\(|\)","");
possibleTags.each(pt) {
	theTag = firstWord(pt);
		if($Text.contains("#"+theTag)){
		theTag = theTag.replace(",|\.",""); 		
		theTag = theTag.replace("^.{1}$","");
		$Keywords = $Keywords + theTag;
		};
	};

I’ve been using this one-liner for a while that I think does the same thing.

$Keywords=runCommand("grep -o '#[a-zA-Z0-9_]\+'",$Text).replace('\r',';').replace('#','')
1 Like

Thanks!

After slightly adapting the line to my use case, it works perfectly.

$Keywords=runCommand("LANG=pt_BR.UTF-8 grep -o '#\w\+'",$Text).replace('\n',';').replace('#','');
1 Like

Yes, that’s better regex (I think) and it’s great to see an example of how to change locale. I’m going to see if I can figure how to do it for Chinese language tags… Doesn’t seem possible on my US system, though.