Extract text from the following standard format

satikusala · August 14, 2021, 2:47am

Hi all, I’m stuck.

I have the followig text:

[Page 14](highlights://Predicting%20the%20Adoption%20of%20Self-Protections%20of%20Online%20Privacy-%20A%20Test%20of%20an%20Expanded%20Theory%20of%20Planned%20Behavior%20Model#page=14)

> An empirical study was conducted to examine the social
> psychological processes that may influence an individual’s
> adoption of online privacy protection strategies.

I want to create a stamp that extract the page number between the [ ] and put it in $PageNo, the URL between the ( ) and put it in $URL and then remove the “>” [ ] and ( ) and \n and leave the rest in $text.

I have the RegEx to get the parse the page number, (?<=[)(.?)(?=]), and the URL, (?<=()(.?)(?=)). I can’t use .replace, because that does the opposite of what I want. I just wan to grab this text and put it in the attribute. I’m guessing I could use a combination of .following and .replace to clean up the rest.

Can anyone help? I have a couple hundred of these blocks, so I need to automated it with explode and a stamp.

Thanks.

mwra · August 14, 2021, 9:27am

satikusala:

[Page 14](highlights://Predicting%20the%20Adoption%20of%20Self-Protections%20of%20Online%20Privacy-%20A%20Test%20of%20an%20Expanded%20Theory%20of%20Planned%20Behavior%20Model#page=14)

I 'd be tempted to take line/paragraph 1 (anyway, section as per your example), put it in an attribute or variable and remove the original from $Text. Then work on the small extracted sample, allowing you also to also fix things like all the ‘%20’ encodings of spaces to give more readable text. Having cleaned the extract, then concatenate the $Text back on the end as save it back as text. Working on the extract cuts the chance of regex edge case issues.

Indeed, in your example you could extract that then split it on `\n’ and process each part separately, again to make regex application easier. When you’ve all the clean bits at the end, simply re-assemble as $Text. Using regex at larger scope invariably means one part works but something else breaks. Smaller context removes the opportunity for that. HTH.

satikusala · August 14, 2021, 12:00pm

Yup, I agree, it will take a multi-step process. However, for the life of me I can’t figure out the opposite of .replace, basically, a .copy which does not seem to exist. My proposed regex selects exactly what I want, page number and URL, I can’t seem to get these selected strings into their respective attributes. I can then park the “>” sections into $MyString and run another option to clean them up for $Text. Any one of a clue on how to get to a .copy type function?

eastgate · August 14, 2021, 12:23pm

I agree with Mark Anderson.

First step: extract the paragraph to an attribute $Data. Now, we have Data containing some text:

[Page nn] (URL). ....maybe some extra text.....

Second step: in an agent, parse the block name in square brackets: (?<=[)(.?)(?=]), and store the second back reference $2 in the attribute $Extracted Name.

This step: again in an agent, parse the URL in parens, and store that.

satikusala · August 14, 2021, 12:37pm

Yes, I agree, break it up into steps. Not a problem.

What I’m not clear on is how to “parse the block name.” Furthermore, I’m not clear why the agent is necessary. If you’re running an agent to find the not and then have the agent apply action code, could this action code not be applied just as easily with a stamp?

Next question, what action code would be used to “parse the block name”? I’ve tried various versions of
$Date=$Text.replace("(?<=[)(.?)(?=])","$2"); to no avail. Clearly, there is something fundamental that I’m missing.

Any exemplary assistance you could provide would be most appreciated.

eastgate · August 14, 2021, 12:41pm

That’s why I’m using the agent.

Query: $ExtractedData.contains("(?<=[)(.?)(?=])")
Action: $ExtractedName=$2;

I think executing .contains in a stamp also binds $n to the back references, but in 9.0.0 I seem to recall that this binding only persists to the end of the statement, not the end of the block. The issue was fairly recent, and was either your report or Mark Anderson’s.

satikusala · August 14, 2021, 1:05pm

Thanks. Here is what I’ve tried:

Here is my result:

The back reference does not appear to be passed.

And, yes, I recall this “.contains in a stamp also binds $n to the back-references, but in 9.0.0 I seem to recall that this binding only persists to the end of the statement” too. Thanks.

satikusala · August 14, 2021, 6:30pm

Just spent some time with @mwra and we devised a way to get the backreference with a stamp.

$MyString=$Text.paragraph(0);
$Text=$Text.replace("^.+\)\n\n","");
$Text=$Text.replace("> ","");
$Text=$Text.replace("\n"," ");
if($MyString.contains("^\[Page (\d+)")){$PageNo=$1};
if($MyString.contains("(highlights:[^\)]+)")){$URL=$1};
$URL=$URL.replace("%20","");
$MyString="";

The above works.

For context setting. I have a string with a pre-defined and fixed structure. What I want to do is to be able to extract a specific identifiable piece of text with RegEx, e.g. “14” or the URL.

I find the .replace actions a bit confusing as it is counter-intuitive. I just want to extract what I want, i.e. what I want to match. Whereas .replace requires me to match what I don’t want to get what I want. Moreover, until talking to Mark, it would never have occurred to me to use an if statement to get the backreference which I can use to populate an attribute. NOTE: I don’t want to use an agent for this because I don’t want the background overhead, these are one-time on-demand actions.

I think it would be easier if we could have a String.extract(pattern) operator to enable a user to pull a specific piece of the content.

I understand that from an engineering perspective that this might seem like syntactic sugar, but given Bruce’s comments today this more straightforward approach may help users accomplish their goals.