Counting citations

eastgate · June 24, 2025, 5:23pm

Here’s a problem that the community might enjoy. I have a container /draft in which I’m gradually writing a paper. Citations in the paper look like this:

{Bernstein, 2025, WWW ’25 Companion Proceedings of the ACM Web Conference, 829-832}

I want to define an agent that tells me how many citations are currently in the paper. Details:

It’s OK to count repeated citations to the same work twice
Remember that one note might have multiple citations
It’s OK to assume that there are few or no other uses of “{”

In other words, “quick and dirty” is fine.

satikusala · June 24, 2025, 6:02pm

I’m tied up today, so can’t play with your pattern until the weekend, but here is a quick reply.

I use a different format for citations, as I use markdown and Pandoc to help me insert, track, and format citations. Here is the inline pattern [@CitationYEAR,@CITATIONYEAR] or @CitationYEAR [-@CitationYear]. I’ve set up several regex methods to parse these patterns out of the text into $IndexCitationKey:

vText.eachLine(aLine){
  if(aLine.icontains('(@[\w]+[0-9]{4})')){
      $IndexCitationKey+=$Text.extractAll("(@[\w]+[0-9]{4})");
  };
};

This gives me a count. I also use it pull citations into an agent, and then with my indexing libraries and templates, I can create an index for which sections of the document the Citation is used (I intend to publish this library by the start of SummerFest).

For your pattern, I would do something similar but use this regex {(.*)}.

There are lots of assumptions and questions. Are citations the only use of the curly brackets pattern? Do you want or need to parse the caution when captured? How do you want to report on the use of the citations? Where are you conducting the count? In a template? Attribute browser? Table?

mwra · June 24, 2025, 6:43pm

Likewise, just heading out for a fe days. Given @satikusala’s point about false matches, e.g. a mentions of coordinates at {2,9} or such, I’d suggest a regex like:

"\{[^\d]+, *\d{4},[^\}]+}"

So:

literal open brace
one or more non-digits (assumes no literal digits in author names, post-nominal III does not use digits)
a literal comma
zero or more spaces
4 digits (assumes cited years are post 1000 (CE))
one or more non-literal-closing-braces characters
literal closing brace

eastgate · June 24, 2025, 7:55pm

I believe each of these will detect the number of paragraphs that contain a citation. Ideally, I’d prefer to count the total number of citations; {Bernstein}{Anderson} should count as two.

Yes: you can assume this is the only use of {…}.

satikusala · June 24, 2025, 10:11pm

My method pulls the citation into a set, which you can then count. It also indexes them so that you can see if you’ve used them multiple times throughout your document.

brookter · June 25, 2025, 9:32am

Forgive me if I’ve misunderstood what’s going on in your function, but AIUI, the regex \{(.*)\}, will normally match everything between the first { and the last one, so it returns Bernstein}{Anderson in this situation.

Wouldn’t \{(\w*)\} match better for individual citations – as long as the citations are only single words?

Or does your function cope with this already (I’m not really experienced with using extract…)

If not, could you post-process Bernstein}{Anderson with something like:

$IndexCitationKey=$IndexCitationKey.replace("\}\{",";"})

Also, AIUI Mark wants to count duplicate citations individually, so wouldn’t a list be more appropriate than a set here?

(I’m just trying to understand the problem and am not in a position to test, so I offer these comments with due diffidence…)

eastgate · June 25, 2025, 2:41pm

List:set— this is an excellent point! It gives me flexibility to measure either (a) the total number of citations, or (b) the length of the References section. Both would be worth knowing.

Alternatives: If I wanted to avoid regular expressions and was willing to assume that every { starts a citation, what would be the easiest way?

satikusala · June 25, 2025, 4:35pm

Per my quick post, this pattern would catch everything. @mwra provided more granularity.

brookter · June 25, 2025, 5:54pm

Yes, that was my point – your version will pick up too much in the given scenario, if there is more than one citation per paragraph. I wondered if your function dealt with this problem in some way – e.g. does the use of extract get over the limitation of your regex?

eastgate · June 25, 2025, 6:33pm

I missed this too! The if() merely tests whether there might by any citations, since if there are none we need do no work at all. The extract makes a list of all matches,

In some situations, you might not need of if() at all. If there’s nothing to extract, then extract() returns an empty list.

satikusala · June 25, 2025, 8:03pm

I want all the citations in a paragraph captured, not just one. They automatically get dedupped in the $CitationKey set.

brookter · June 25, 2025, 8:16pm

Indeed – that was the point of my original post!

Unless I have misunderstood something about TBX’s version of regex, \{(.*)\} does not work with multiple citations in a paragraph, because the regex is too greedy – it slurps all the citations into one. Please see the screenshots to see what I mean: {Bernstein}{Anderson} returns one citation (Bernstein}{Anderson) not two.

So, you either need to change the regex to be less greedy, or do some post-processing to separate the ‘too-big’ capture (with the .replace command I suggested, perhaps).

Which is why I asked you if the extract command somehow manages to overcome the limitation of the too-greedy regex so it extracts all the citations in the paragraph, not just subsumes them all into one.

mwra · June 29, 2025, 11:05am

Yes, this is why \{[^\}]+\} is a clearer expression of intent. Inside the matched {} we look for one or more (+) characters that are not a closing brace. One or more not zero or more (*) to avoid mis-matching a legitimate empty set of braces ({}) as a citation marker. This approach is flexible at is doesn’t care where in the string the citations occur or if there is more than one per line (or tested string)

It may the edge case that (depending on the subject matter) that the subject matter has braces containing content. In that case, the braces alone are not a clear enough marker. If you have (and apply!) a very rigid format for the string describing the quote you could write a more complex regex to capture that structure. But, if {author, yyyy} is the pattern and Prof. A. B. Clump has 3 papers published in 2001, you’ll be tweaking your regex. and so on for each edge case that breaks the existing one.

Simpler, if braces-with-content sequences are an insufficient marker, would be to place another marker either immediately before or after the citation. For instance, with an @ symbol using either of these:

{@ ... }
@{ ... }

Users of Markdown might need to take care as to the character chosen and its placement, so avoid sequences that signal (or no longer do) intent to the Markdown parser.

Even if using a complex export chain e.g. Tinderbox+Markdown → pandoc → Word, or such, do stop to consider is inserted extra markers can be deleted before or at export.

IOW, the constraint isn’t what it appears, but you can choose when (necessary) extra markers are needed. Also, there isn’t a button /menu option that ‘just’ does this, in no small part due to all the (unstated) assumptions in this process.