Find co-occuring set values across notes

mwra · April 18, 2017, 5:52pm

EDIT: the link here now points to a revised version of the file that uses a stamp rather than a rule for reasons described in the thread “Annotated co-occurrence example code”.

Building off @PaulWalters’ demo “Open Code Example with Export.tbx”, I offer “OpenCode Example with co-occurence.tbx” (zipped TBX) with two extra agents. One uses the unique sorted string** of all code values for a given note and finds occurrence. this means shorter strings are detected within longer ones as both use the same sort order in construction.

The second does what I think Derek’s after. It finds all co-occurrence of value pairs (where AB is considered the same as BA) regardless of the number of $OpenCode values per note; notes with zero or a single value are ignored as no pair exists.

I’m sure there are edge cases and I’ve not even tried to extend to triples or quartets, etc. I hope this puts us all back on the same page again. I’ve only tested o the example in the TBX I used, I suppsect with 100 discrete $Opencode values performance might be less snappy!

** I fixed one error in the original TBX as when casting $OpenCode to $CodeCollector, .isort wants to be chained before .format() so it is the sorted list that gets formatted rather than a single formatted string (that can’t be sorted!).

JFallows · April 18, 2017, 6:31pm

Hi Mark A – Was curious to give this a look, but I get a 404 error at the link. Here is the way the link is registering for me (when I do the right-click “Copy Link Address” command):

“http://www.acrobatfaq.com/tbdemos/OpenCode_Example_with_co-occurrence.zip”

For whatever reason that isn’t working for me.

mwra · April 18, 2017, 6:47pm

Fixed the URL. Sorry about that.

JFallows · April 18, 2017, 8:29pm

Thanks! This is a level of analytical complexity that is beyond my own real-world needs, but it’s interesting to see how everyone here has approached it.

mwra · April 18, 2017, 8:51pm

The nice aspect of this general approach (aside from actual method of implementation) is that it nurtures incremental formalisation, avoiding the need to over categorise to single topics when the ‘best’ (most appropriate) topics may not be self-evident. The data from the above pair experiment came out as:

blue + green: 10
blue + orange: 6
blue + purple: 2
blue + red: 5
blue + white: 6
green + orange: 6
green + purple: 2
green + red: 4
green + white: 5
orange + purple: 4
orange + red: 7
orange + white: 3
purple + red: 4
purple + white: 2
red + white: 3

The following was mocked up manually, using the co-occurrence number for both treemap cell area and colour-shading (but I’m not visual designer!). Still, this feels as though it points to one way of looking at this data within Tinderbox - it’s scaled by 50% here just so as not to be to big in a post:

Edit: this additional stuff is not in the file posted earlier.

PaulWalters · April 18, 2017, 9:06pm

Hmmm. I’m not sure what that means. (I don’t know what “topic” refers to.)

Are you suggesting there’s no need for the .cooccur() action that’s been discussed above? I think there’s still a very good case for adding that new action.

mwra · April 19, 2017, 6:32am

By ‘topic’ I was referring to the values (or tags, keywords, metadata - pick one of many common terms) used in the co-occurrence evaluation. In this particular, case the attribute values set for the user attribute $OpenCode.

Err, No. As the treemap is of the co-occurrence output, the exercise above would otherwise be nugatory.

Rather, I re-iterate, that having a means to investigate (with ease) co-occurrence lessens the pressure to over-reduce the number of topics (attribute values) during early investigation of data. Currently, it’s easier to analyse single-value data, some form of co-occurrence reporting would broaden the scope of choice.

derekvan · April 21, 2017, 1:12pm

I’m just now having the chance to check this out. I plugged in my real notes into the sample file Mark provided and it did present the pair co-occurences as requested. So, thanks Mark!

However, I’m not sure what the path forward is. I don’t think I’m going to be able to understand all the working parts to the agents in this file, so it makes a bit nervous to depend on it through successive TBX versions. Also, it’s not clear to me how to manipulate the results (the co-occurence all pairs example isn’t sorted by number, which might be nice, but I wouldn’t know how to do it). Also, I’m getting pretty frequent crashes with the file. So, it seems to me that asking @eastgate to implement the features discussed above is still worthwhile.

(also: I’m not sure I understand the differences between the “co-occurences all codes” and “co-occurences all pairs” agents. The “pairs” agent has lots and lots of hits in my real data. Yet the “all codes” agent has very few. I would expect that the “all codes” agent would return all pairs, plus triplets and others. Yet it doesn’t seem to contain any pairs.)

PaulWalters · April 21, 2017, 1:21pm

I agree.

I suppose the question at this time is for @eastgate – whether the dialog in this thread, above, is persuasive and provides enough sense of what the .cooccur function might do. Or if @eastgate believes the concept needs further explication and Mark B has questions for us to dialog on.

I believe all users participating here recognize that this is not a simple feature, the engineering needed to implement it is not clear to us (but is likely significant), the time-to-market is unknown (to us), and the base of potential users is not large. However, for researchers, especially qualitative researchers, the feature could be well received as an interesting addition to the Tinderbox feature set.

mwra · April 21, 2017, 2:35pm

In case it helps, I’ve broken out the code used in my demo up-thread, clarified it slightly and posted it in a separate thread to aid discussion without causing thread drift here.

I concur with @PaulWalters’ last. I hope @eastgate has enough to consider whether something is feasible in an action operator wrapper. As my demo code, doing un-ordered pairs is do-able (but it’s not beginner’s code). going to triple and beyond is likely not feasible, as action code, but might be if done directly in the app. I’ve assume an output as text but perhaps there might be a way to use existing (new?) views to also help visualise the result.

eastgate · April 21, 2017, 4:47pm

THINGS I’D STILL LIKE TO KNOW

I understand why one might like to know co-occurence distributions, but I’m not confident I know how to explain this to the general audience. Examples I’ve thought about include things like (a) how well do student grades on the midterm align with the results of their final project? (b) which titles in our backlist are unusually popular in specific countries? © if we look at a mayor’s constituent email logs, are some topics unusually prominent from specific wards? But these aren’t as convincing as I’d like; better (and real-life) examples would help.
We’ve been talking about list and set variables; do we also need to do this with numeric variables? If so, is it sufficient to adopt a “bin” strategy like Attribute Browser?
Is this ideally an operator (.coocur() ) or a view – like Attribute Browser?
The next question will be, “is the cooccurrence distribution random?” I think that’s Student’s t-test, but it’s been ages since I’ve used that in anger myself. I could use a hand on that.

mwra · April 21, 2017, 5:15pm

Q1. So you’ve reviewed 500 papers and added added multiple values (tags, keywords, whatever) to each item, indeed possibly several different attributes on different aspects of the work. now, you want to get a feel for which terms cluster together. Whilst most individual note attributes may only have a few values the unique values across all notes maybe large so the agent-per-value approach doesn’t work. It’s thus useful to see how the values cluster - or don’t. Do ‘winken’ and ‘blinked’ turn up together (co-occur) a lot? Odd how ‘blinken’ and ‘nod’ never co-occur.

In you example (b) I’d imagine title and counties to be separate attributes’ values, though you could merge them for this purpose. TL;DR … “which tags co-occur the most?”. I think this is a feature you’ll go find because you have the need rather than one you’ll create data to use.

On the viz front, it did occur to me that if (aliases of) notes were placed on on a map as small shapes ($Height/$Width are intrinsic) with title as the $HE then you could plot clusters. Not tested, but I’ve a sense treemap could also use a co-occurence data to show things.

Q2. I don’t see why not, though I think binning would be needed. An opportunity to feature request something that AB view also needs - ability to set bin size and/or open max/min (i.e. an above X and below Y bin at each end of the range). In the case of this analysis, you might want to outside max/min rather than create a bin.

Q3. Ouch, if it is to be ‘cooccur’ please lets be consistent in name style .coOccur() - I see no upside for the user if we arbitrarily move away from the action code style of interCapitalisation for operators. (I don’t mean this in a snarky fashion - lest it read that way).

Q4. At this point I think we should be punting to R (or SPSS for the $$$ folk) via export. Here as in a number (not to mind as I type) of cases I think the best approach is export or a round-trip via command line. IOW, action if any might be to look as any changes to run-command that might be needed, included parsing of the returned values.

Not sure if that helps any…

PaulWalters · April 21, 2017, 5:30pm

I believe it is an operator? Not a view.

The name in this thread has just been colloquial and inconsistent; not a firm proposal. I imagine @eastgate would figure out the best term that fits into the existing taxonomy.

derekvan · April 21, 2017, 5:46pm

More examples:

Lots of folks using TBX for task management. Let’s say you want to see a report of all your tasks tags to see where you’ve been spending all your time. Or maybe you’ve fed Tinderbox your time-tracking data and want similar analysis.

Let’s say you’ve got notes with some kind of geographic data in them (cities where X events are happening, or states with X health care features). You could run this to see the various connections.

More broadly, I’d say this is the kind of feature that maybe brings in new users. People who want to run co-occurences know who they are and may find Tinderbox’s implementation more straightforward and manageable than learning some stat package (that’s my case!). And, then they see TBX does so much other cool stuff, they’re sold. For me, adding this tool makes Tinderbox nearly a complete qualitative data analysis package–it would have everything I’ve typically sought. Lots of folks have been interested in that video whose author escapes me now, where he interviewed some folks, broke up the transcripts in Tinderbox, then “tagged” them by dropping them onto adornments in various ways. One logical next step is to find co-occurences in tags …

As for view vs. operator, I could see it working either way. The operator is straightforward, but with the right implementation, I could see AB being the best way forward, as long as it was flexible enough to provide the right results binned in the right ways. If the bins, for example, were pairs of tags with numerical counts, this would be great as it would give me immediate access to the notes themselves within the bins.

mwra · April 21, 2017, 6:37pm

Interesting. If AB view could use an action code filter (that’s not an agent query or rule) that would be cool. A container (or agent) could scope the notes to consider but we’d need a way to configure the co-occurrence criteria but given that existing AB view feature could cope with visualising things.

I’d also the echo the point about qualitative tools. Please spare me NVIVO! For all the some-assembly-required element of aspect of Tinderbox it beats the you-are-feeding-a-predefined-database approach of so many tools.

Premature formality considered harmful…

PaulWalters · April 21, 2017, 7:12pm

Can we say “sets of tags” and not limit the analysis to pairs only? (Depending on input parameters.) I’m interested, for example, in examining co-occurrence of 3+ tags or identifiers in project plans.

derekvan · April 21, 2017, 8:36pm

Right. Good catch. .