Find co-occuring set values across notes

mwra · April 21, 2017, 5:15pm

Q1. So you’ve reviewed 500 papers and added added multiple values (tags, keywords, whatever) to each item, indeed possibly several different attributes on different aspects of the work. now, you want to get a feel for which terms cluster together. Whilst most individual note attributes may only have a few values the unique values across all notes maybe large so the agent-per-value approach doesn’t work. It’s thus useful to see how the values cluster - or don’t. Do ‘winken’ and ‘blinked’ turn up together (co-occur) a lot? Odd how ‘blinken’ and ‘nod’ never co-occur.

In you example (b) I’d imagine title and counties to be separate attributes’ values, though you could merge them for this purpose. TL;DR … “which tags co-occur the most?”. I think this is a feature you’ll go find because you have the need rather than one you’ll create data to use.

On the viz front, it did occur to me that if (aliases of) notes were placed on on a map as small shapes ($Height/$Width are intrinsic) with title as the $HE then you could plot clusters. Not tested, but I’ve a sense treemap could also use a co-occurence data to show things.

Q2. I don’t see why not, though I think binning would be needed. An opportunity to feature request something that AB view also needs - ability to set bin size and/or open max/min (i.e. an above X and below Y bin at each end of the range). In the case of this analysis, you might want to outside max/min rather than create a bin.

Q3. Ouch, if it is to be ‘cooccur’ please lets be consistent in name style .coOccur() - I see no upside for the user if we arbitrarily move away from the action code style of interCapitalisation for operators. (I don’t mean this in a snarky fashion - lest it read that way).

Q4. At this point I think we should be punting to R (or SPSS for the $$$ folk) via export. Here as in a number (not to mind as I type) of cases I think the best approach is export or a round-trip via command line. IOW, action if any might be to look as any changes to run-command that might be needed, included parsing of the returned values.

Not sure if that helps any…