Find co-occuring set values across notes

I’d love to hear how this might work in Tinderbox – mocked up with any extensions that would help.

Though not at the scale of @derekvan’s data, it did ocur to me one might ue the agent mehod above in a map using a smart adornment. However you would end up with a rime of query-rejected items if used successively with a number of different queries.

A while back, in the old forum, in a thread on a similar situation mentioned by @derekvan, Mark B. had some some useful suggestions for Agent queries. Now that we have Attribute Browser, the suggested queries can be tried on the fly in an AB tab. This doesn’t solve the request, but inspection of the AB results, is helpful. At least, this is what I see in a test dataset that I created that I think emulates the original based on descriptions above.

I get a little closer to a solution if my document has a string attribute $CodeCollector and a rule

$CodeCollector=$OpenCode.format(" ").isort()

and my attribute browser is focused on $CodeCollector. So AB aggregates notes by strings such as blue;green;red or green;orange;red, and so forth. This makes inspecting the results, visually easier. Still not a perfect solution, but more tractable.

(FWIW, the whole analysis process using AB becomes simpler if instead of $OpenCode I use a boolean attribute for each code value, e.g., $Red, $Blue, $Green, and so forth. More on that, later.)

1 Like

Here is an example document that uses two approaches

  1. The first approach uses the $CodeCollector attribute discussed above in an Attribute Browser
  2. The second approach requires
    ~ a boolean attribute for each value of $OpenCode (I assume that the sample values given in the original post are the correct universe of values
    ~ a stamp SetColors to load up these boolean attributes with the correct true/false setting (plus a secondary stamp, NullColors, that’s useful for recovering from errors
    ~ a couple of sample stamps that set $AgentQuery to look for relevant duals among the booleans – what I did was create a single agent and then apply the agent stamps to that agent to toggle the report. This is to avoid having a large number of agents. For some reason, even though this is a very small document, the execution of the agent seems to take a few seconds. No explanation for that.

The rest of the document should be self-explanatory. Notice the $DisplayExpression in the agent is used for counting. Sorry for the crudeness of the document. I did this to sort out a rough approach to the original problem. I’ll leave it for others to make it sophisticated, if the suggested approach merits further attention.

Thanks for this example. Although I don’t do anything exactly like this, I found the workings of the $CodeCollector attribute to be useful as a suggestion for other tasks I was trying to figure out. (For aggregating values in a set attribute.)

Specifics like this are always more useful than the originator might imagine, because of possibilities they suggest to others. Thanks!

Mark, I’m really not sure how this might work, I was mostly hoping Tinderbox had the magic already to read my mind (it has been so useful in my other efforts at qualitative analysis and typically there as already been a feature that I can appropriate). :slight_smile:

@PaulWalters example is pretty cool–I especially like the outline tab where you can see via check marks which codes are applied to any note. This is similar to a view I sometimes try to use in Excel.

Ultimately, I guess I was imagining there could be some kind of dot operator for “collect” or “list” that would produce a record of co-occurences across set/list values. So, user would write an collect expression like so:

collect.cooccur(children(/QualData),$OpenCodes)

which would return this text:

blue;green, 4
red; blue, 2
blue;orange, 2

Now that I write this, I see this is pretty different from the way most Tinderbox operators work in terms of returning values, but it doesn’t seem completely out of the realm of possibility (from my user perspective at least–I have no idea what technical issues might prevent such an example).

Given this list, I could then choose to make agents to see the exact notes where they occur, or turn the values into some kind of set to view in Attribute browser or something.

I’m sorry Mark B, I know this is pretty half-baked! This is the first time I’ve really been trying to accomplish this kind of analysis and Tinderbox is the first place I looked. After I have some mySQL experience trying to do the same thing, I might have better ideas of what I’m looking for and how it could be implemented in TBX.

Hmmm. Good idea. Here’s a modified version of the test file that now includes semicolon-delimited “csv” exporting. Bring the export into Excel; convert text to columns; use data filtering to narrow down the cooccurences. More work, but just another approach.

OK. How does Tinderbox know what cooccurrence means? Always just contiguous duals? So in {a,b,c} only {a.b} and {b,c} are valid but {a,c,} is not? I don’t think the definition of “cooccurrence” in general is axiomatic – there would be some need (I think) to give Tinderbox a parameter in the collect.cooccur operator to define what pairs in a set are true and which pairs are not.

Perhaps

collect.cooccur(group,attribute,[contiguous],[constituents])

where group and attribute have the normal meaning in collect, and [contiguous] is optionally true or false (true by default, if not provided), and constituents is an integer with a value 2 or greater to indicate are we looking for pairs (constituents == 2) or sets comprising more than 2 elements. Defaults for contiguous and constituents are True and 2, when not provided.

I think we also need a .count() operator to get the count of the cooccurrences.

collect.cooccur(group,attribute,[contiguous],[constituents]) + ", " + collect.cooccur(group,attribute,[contiguous],[constituents]).count()

would provide the text in @derekvan’s latest example ( blue;green, 4).

Problem. I think collect does not normally operate on sets? Not sure.

[ FWIW I’d certainly note myself as having an interest here as investigating the inter-relationship of notes’ metadata is something I find myself attempting quite often.]

A loose end here is are we looking at only pairs here, or all groupings of 2 or more values? The size of the contingency table can be an issue, with 100 values that’s hundreds of pair values. However, for much smaller value sets another idea that occurred is to use treemap view as this gives the opportunity to use both size and $Color to help visualise relationships without having to leave Tinderbox.

edit: sory, looks like I’m on the same point as the end of the preceding post and with which I concur.

1 Like

We’re quite willing to add support for this, if we arrive at a consensus for some reasonably straightforward solution. It’s a busy time, though, so I don’t think I can follow this in detail.

Let me know when the committee’s recommendations are ready!

@PaulWalters collect() and collect_if() can work on any list. Originally the operators returned a Set-Type and since (v5?) they have returned a List-type. Tinderbox offers several methods to de-dupe (and sort) a List so the difference is not onerous.

I’m wondering what the use case is for ‘contiguous’. Is this for tracking sequence? Whilst exploring the envelope I did wonder if the order (direction) of pairs matters. It generally doesn’t in TB, but for some people’s use of a List (which doesn’t re-sort on its own) can retain pairs A;B and B;A which might - to them - imply some difference. I raise this even if only that we expressly strike it out as something for which to cater.

I’m not sure a coOccur() needs to chain on collect() as if it takes a group input it is scoping in its own right. Thus (what ever the eventual name) I’d see this operator chaining off a list (i.e. List or Set). A coOccur_if() probably also makes sense to allow some scoping tweaks. Thus:

list.coOccur(group,attribute[,constituents][,contiguous])
list.coOccur_if(group,attribute,condition[,constituents][,contiguous])

I’ve reversed the last two inputs as the last seem least likely to be used and optional input require all preceding (option) inputs to be supplied.

Another issue is guessing the upper bound (constituents) value if one wants all N>1 configurations. I assume max($SomeMultiValueAttr.size) might suffice.

The bit I’m stuck on is how the app degrades gracefully when an unknowing user gives it 5k notes with 100 values for $SomeMultiValueAttr and about 20 different size configurations, even on a fast Mac.

The case is laid out here

If the “standard” use of .cooccur is to only look for pairs, and only look for pairs that are next to one another in the set, then the optional parameters ,[contiguous],[constituents] would not be needed.

If no one would ever think of using the features except to look for contiguous pairs in a set, then my proposal can be scraped. I’m suggesting it might be desirable to anticipate that other users might have a different (and valid) case to search for cooccurrences in sets where contiguous pairs is too limiting, and thus need the non-default settings provided by the optional parameters ,[contiguous],[constituents]

I like this proposal a good deal. I don’t think I would ever use the “contiguous” function, so my preference would be for it to be “off” by default (the sequence of values doesn’t matter in my usage, only that certain values appear together in whatever order). I really like the “constituents” flag, as I could see using that to narrow analysis in possibly interesting ways.

As for @mwra concern about degrading gracefully if too many notes are provided, I suppose there might be some error message (similar to the messages that pop up when Agent queries are running amiss) that says the coOccur is going to cause problems and needs to be scoped differently.

I also agree with MarkA about chaining off “list” or “set” instead of “collect.” This makes sense to me.

@PaulWalters thanks. I wasn’t disparaging the idea. I’m with you in being open to the ideas/needs of others but (a) failed to think of an actual use case (which is simply a failure of my imagination and not the use case) and (b) would require working only off Lists-type data** as Set-type cannot be assumed to return a consistent sort which underpins the sense of being contiguous.

If implementing this (and if supportable at likely scale of use) I’d agree matched value pairs might be a sensible default but for analysis being able to extend to matched N-value sets might be of interest. This is aptly informed by a remembrance of a number of recent research tasks where I ended up painfully grinding multiple values down to a (pre-eminent) single one in order to do further analysis.

** noted that collect() returns a list so your chained idea makes more sense on reflection.

So, I think we’re actually on the same page (bar fine detail on input sort order!).

The other part of this is what .coOccurence() would emit. Although not a formal data type, TB now has look-up lists and perhaps that might be an outcome, a list of “value set:count” items, e.g. “red blue:2;blue green:4”. However I’m not sure what the delimiter used in the value set part would be as both space and underscore might form part of actual attribute values.

Are we thinking of .cooccurrence() or .cooccur()? I’ve been using the original formulation that @derekvan proposed, but either name for the function works for me.

Output. If it output a .plist it would be a key/value pair. Can a structure like that exist in Tinderbox? I understand the issue about delimiter. Perhaps /n ?

As to the operator name, for my 2¢ I’d suggest the shortest compound variant that makes sense (less typing!) and using internal camel-case so as to follow the style of other compound action names. Thus list.coOccur() would seem a choice but in truth the inputs/outputs are of more interest.

As to output, unless Tinderbox were to add some new form of view (or in-map visualisation) it might be best to pump out something that can easily be consumed by the likes of an open system such as R (as well proprietary like Office Excel). I may be wrong, but I think both those apps (and similar) would most easily ingest a table with notes ($Name or some UID) on one axis and all the discrete values of the analysed attribute on the other. Working this notion further with example data from up-thread, I’d expect tabular like this:

Name    red  blue   green  purple  orange  white
Note1    1     1      1      0       0       0
Note2    0     1      1      1       0       0
Note3    0     1      1      0       1       0
Note4    0     1      1      0       0       1
Note5    1     1      0      0       1       0

One might add a per-row value count but I suspect that, if exporting to another app for process, the value count might more easily be created after ingest into the other app.

The above data table might appear easily created by exporting an agent with the header line as the $Text and the data rows via ^children^. However, unlike the above example, for more than a few values of $OpenCodes (or whatever multi-value source) iterating through the values might be a complex task for many users so an action or export code to do this would help.

Once the data is exported it might be possible to run the likes of R in context of the export folder so output could be seen from within Tinderbox by viewing the exported page’s preview after the contingency analysis had been run. It does, however, depend on what exactly the user wants too see: a plot of all contingencies (perhaps a a from of heat map)?, a listing of most common, co-occurrences?, etc. A problem for this sort of exploration is that you may often need to try several approaches if you don’t yet know the relationships hiding in the data.

In summary, unless Tinderbox is actually going to visualise the contingency table in some fashion within the app, this task might be better handled as an export code (or an action primarily intended to be called during export). These is offered for discussion and is not a firmly-held viewpoint.

I think a tabular export code output such as in the example would be a possibility. But the original request seems to be a different output form:

So, output like that needs to be the $Text of a note. Let’s say we have an agent whose query find all notes where $OpenCodes (for example) contain a value, or some other condition. Say the agent is named “Coded Notes”.

Some other note (call it “Co-occurrence Report”) would have a rule like this:

$Text=collect.cooccur("/Coded Notes",$OpenCodes).format("\n")

to get the output @derekvan specified here

or

$Text=collect.cooccur("/Coded Notes",$OpenCodes,FALSE,3).format("\n")

to get the same kind of report where code contiguity is not required, and we are looking for triplets instead of duals.

To export a table we would need a different syntax.

EDIT: the link here now points to a revised version of the file that uses a stamp rather than a rule for reasons described in the thread “Annotated co-occurrence example code”.

Building off @PaulWalters’ demo “Open Code Example with Export.tbx”, I offer “OpenCode Example with co-occurence.tbx” (zipped TBX) with two extra agents. One uses the unique sorted string** of all code values for a given note and finds occurrence. this means shorter strings are detected within longer ones as both use the same sort order in construction.

The second does what I think Derek’s after. It finds all co-occurrence of value pairs (where AB is considered the same as BA) regardless of the number of $OpenCode values per note; notes with zero or a single value are ignored as no pair exists.

I’m sure there are edge cases and I’ve not even tried to extend to triples or quartets, etc. I hope this puts us all back on the same page again. I’ve only tested o the example in the TBX I used, I suppsect with 100 discrete $Opencode values performance might be less snappy!

** I fixed one error in the original TBX as when casting $OpenCode to $CodeCollector, .isort wants to be chained before .format() so it is the sorted list that gets formatted rather than a single formatted string (that can’t be sorted!).

Hi Mark A – Was curious to give this a look, but I get a 404 error at the link. Here is the way the link is registering for me (when I do the right-click “Copy Link Address” command):

http://www.acrobatfaq.com/tbdemos/OpenCode_Example_with_co-occurrence.zip

For whatever reason that isn’t working for me.

Fixed the URL. Sorry about that.

Thanks! This is a level of analytical complexity that is beyond my own real-world needs, but it’s interesting to see how everyone here has approached it.