Find co-occuring set values across notes

derekvan · April 14, 2017, 2:21pm

I’ve got an analysis task and am wondering if there are ways for Tinderbox to do more of the work than I had originally imagined. I am looking for co-occuring values in a set attribute (there may be a more specific name for this kind of task, but I’m not sure what it might be).

More context: I have some notes with a set attribute called “OpenCodes.” Each note has some values for this attribute. For example,

Note1: red; blue; green
Note2: blue;green;purple
Note3: blue;green;orange
Note4: blue;green;white
Note5: red; blue;orange

What I want to do is find some way for Tinderbox to tell me that out of those notes, blue & green co-occur 4 times, red & blue co-occur twice, and blue & orange co-occur twice. Extra points if it is in some kind of agent or list that allows me to see which notes such co-occurences happen.

Now, I know I can try some of this manually, by creating agents that show notes that have the same two values (e.g., blue & green), but that requires that I know which values co-occur. I don’t, or at least I don’t before I start looking through them and trying to note patterns.

mwra · April 14, 2017, 3:09pm

Not stated but sort of implied - do you only want pairs of co-occurrence? also how big is the set of potential values (i.e. number of pairs to check). I suspect this scale badly.

But let’s assume you only want pairs and the attribute being tested is $OpenCodes.

Collect currently used range of values to be tested via values("OpenCodes").
Make two String type attributes $TestValue1 and $TestValue2. Set as suggested values the value list collected at step #1.
Make an agent and:
Set the KA to $TestValue1 and $TestValue2.
Set the query to: $OpenCodes & $OpenCodes.contains($TestValue1(agent)) & $OpenCodes.contains($TestValue2(agent)) . The first query terms is just a scoping query so we don’t waste effort doing regex-based tests on notes not using the target attribute. As east test value is tested independently we don’t need to worry about list value order.
Set the Display Expression to: `OpenCodes count for: “+$TestValue1+” and “+$TestValue1+” (“+$ChildCount+”)"
Set the agent’s KA to the desired test colours.
The query will find all notes with that combo of $OpenCodes values and the agent’s title will show the test values and the overall count.

You could make copies and have one for each pair but i doubt the scales well. Or, if deep in the analysis make a few clones but when done delete all copies except one, then turn that agent foo and delete any child aliases. Your analysis tool is now ‘stored’ until you next need it.

eastgate · April 14, 2017, 3:36pm

(a) If it’s not super-sensitive, I think readers would be interested to know what you’re actually studying here. I know I would.

(b) Much depends on the number of distinct values found in OpenCodes, and whether this is likely to change often. There simplest approach, for example, would be to build an agent for each co-occurrence; if we have N different values, we’d need N(N-1)/2 agents. That’s fine for 4 or 5, and obviously horrible for 50. Or, you could have an attribute browser tab that selects for one value and that browses on the others; that requires N tabs – again, fine for 4 or 5, horrible for 50.

(d) For the most general case and for large N, you’re going to want a dedicated cross tabs package. HTML export is your friend here, either for XML, JSON, or just for generating a CSV file.

derekvan · April 14, 2017, 3:46pm

The data itself is unfortunately protected under IRB regulations, so I can’t share it. But basically, it’s qualitative research. I’ve an interview which I’ve “coded” inductively. This basically means that I’ve looked at each bit of the transcript (which I exploded in Tinderbox) and assigned some labels based on intuition. I’m now looking to find any patterns or things that relate. (This is mostly like “open coding” moving to “axial coding” in grounded theory).

So, at the moment I’m interested in a single interview. The coding is not going to be changing during the analysis (I might recode at some later time, but I’d want to completely re-run the analysis then). I might then want to run the analysis over several interviews together as well, but again, the codes would be static at that point.

Obviously, this is built in to qual data programs like NVIVO, but I don’t want to use something like that since it’s mostly overkill for the scale I’m working at. A colleague has had good luck using mySQL for this, but I’m not familiar with that and I am familiar with Tinderbox, so I thought I’d give it a go here.

Unfortunately, I think it’s a bit too tedious to try and manually create the agents to find each co-occurence. I’m looking at over 100 codes right now. And I’d definitely be interested in more than just pairs (e.g., triples, quadruples), but pairs are the most likely. So, looks like mySQL experiments may be better off. Or maybe “cross tabs” as @eastgate mentions (but I’m not familiar with that …)

eastgate · April 14, 2017, 5:50pm

I’d love to hear how this might work in Tinderbox – mocked up with any extensions that would help.

mwra · April 14, 2017, 10:59pm

Though not at the scale of @derekvan’s data, it did ocur to me one might ue the agent mehod above in a map using a smart adornment. However you would end up with a rime of query-rejected items if used successively with a number of different queries.

PaulWalters · April 15, 2017, 2:30am

A while back, in the old forum, in a thread on a similar situation mentioned by @derekvan, Mark B. had some some useful suggestions for Agent queries. Now that we have Attribute Browser, the suggested queries can be tried on the fly in an AB tab. This doesn’t solve the request, but inspection of the AB results, is helpful. At least, this is what I see in a test dataset that I created that I think emulates the original based on descriptions above.

I get a little closer to a solution if my document has a string attribute $CodeCollector and a rule

$CodeCollector=$OpenCode.format(" ").isort()

and my attribute browser is focused on $CodeCollector. So AB aggregates notes by strings such as blue;green;red or green;orange;red, and so forth. This makes inspecting the results, visually easier. Still not a perfect solution, but more tractable.

(FWIW, the whole analysis process using AB becomes simpler if instead of $OpenCode I use a boolean attribute for each code value, e.g., $Red, $Blue, $Green, and so forth. More on that, later.)

PaulWalters · April 16, 2017, 11:54am

Here is an example document that uses two approaches

The first approach uses the $CodeCollector attribute discussed above in an Attribute Browser
The second approach requires
~ a boolean attribute for each value of $OpenCode (I assume that the sample values given in the original post are the correct universe of values
~ a stamp SetColors to load up these boolean attributes with the correct true/false setting (plus a secondary stamp, NullColors, that’s useful for recovering from errors
~ a couple of sample stamps that set $AgentQuery to look for relevant duals among the booleans – what I did was create a single agent and then apply the agent stamps to that agent to toggle the report. This is to avoid having a large number of agents. For some reason, even though this is a very small document, the execution of the agent seems to take a few seconds. No explanation for that.

The rest of the document should be self-explanatory. Notice the $DisplayExpression in the agent is used for counting. Sorry for the crudeness of the document. I did this to sort out a rough approach to the original problem. I’ll leave it for others to make it sophisticated, if the suggested approach merits further attention.

JFallows · April 16, 2017, 5:40pm

Thanks for this example. Although I don’t do anything exactly like this, I found the workings of the $CodeCollector attribute to be useful as a suggestion for other tasks I was trying to figure out. (For aggregating values in a set attribute.)

Specifics like this are always more useful than the originator might imagine, because of possibilities they suggest to others. Thanks!

derekvan · April 16, 2017, 6:48pm

Mark, I’m really not sure how this might work, I was mostly hoping Tinderbox had the magic already to read my mind (it has been so useful in my other efforts at qualitative analysis and typically there as already been a feature that I can appropriate).

@PaulWalters example is pretty cool–I especially like the outline tab where you can see via check marks which codes are applied to any note. This is similar to a view I sometimes try to use in Excel.

Ultimately, I guess I was imagining there could be some kind of dot operator for “collect” or “list” that would produce a record of co-occurences across set/list values. So, user would write an collect expression like so:

collect.cooccur(children(/QualData),$OpenCodes)

which would return this text:

blue;green, 4
red; blue, 2
blue;orange, 2

Now that I write this, I see this is pretty different from the way most Tinderbox operators work in terms of returning values, but it doesn’t seem completely out of the realm of possibility (from my user perspective at least–I have no idea what technical issues might prevent such an example).

Given this list, I could then choose to make agents to see the exact notes where they occur, or turn the values into some kind of set to view in Attribute browser or something.

I’m sorry Mark B, I know this is pretty half-baked! This is the first time I’ve really been trying to accomplish this kind of analysis and Tinderbox is the first place I looked. After I have some mySQL experience trying to do the same thing, I might have better ideas of what I’m looking for and how it could be implemented in TBX.

PaulWalters · April 16, 2017, 9:35pm

Hmmm. Good idea. Here’s a modified version of the test file that now includes semicolon-delimited “csv” exporting. Bring the export into Excel; convert text to columns; use data filtering to narrow down the cooccurences. More work, but just another approach.

OK. How does Tinderbox know what cooccurrence means? Always just contiguous duals? So in {a,b,c} only {a.b} and {b,c} are valid but {a,c,} is not? I don’t think the definition of “cooccurrence” in general is axiomatic – there would be some need (I think) to give Tinderbox a parameter in the collect.cooccur operator to define what pairs in a set are true and which pairs are not.

Perhaps

collect.cooccur(group,attribute,[contiguous],[constituents])

where group and attribute have the normal meaning in collect, and [contiguous] is optionally true or false (true by default, if not provided), and constituents is an integer with a value 2 or greater to indicate are we looking for pairs (constituents == 2) or sets comprising more than 2 elements. Defaults for contiguous and constituents are True and 2, when not provided.

I think we also need a .count() operator to get the count of the cooccurrences.

collect.cooccur(group,attribute,[contiguous],[constituents]) + ", " + collect.cooccur(group,attribute,[contiguous],[constituents]).count()

would provide the text in @derekvan’s latest example ( blue;green, 4).

Problem. I think collect does not normally operate on sets? Not sure.

mwra · April 16, 2017, 9:44pm

[ FWIW I’d certainly note myself as having an interest here as investigating the inter-relationship of notes’ metadata is something I find myself attempting quite often.]

A loose end here is are we looking at only pairs here, or all groupings of 2 or more values? The size of the contingency table can be an issue, with 100 values that’s hundreds of pair values. However, for much smaller value sets another idea that occurred is to use treemap view as this gives the opportunity to use both size and $Color to help visualise relationships without having to leave Tinderbox.

edit: sory, looks like I’m on the same point as the end of the preceding post and with which I concur.

eastgate · April 17, 2017, 2:58pm

We’re quite willing to add support for this, if we arrive at a consensus for some reasonably straightforward solution. It’s a busy time, though, so I don’t think I can follow this in detail.

Let me know when the committee’s recommendations are ready!

mwra · April 17, 2017, 7:25pm

@PaulWalters collect() and collect_if() can work on any list. Originally the operators returned a Set-Type and since (v5?) they have returned a List-type. Tinderbox offers several methods to de-dupe (and sort) a List so the difference is not onerous.

I’m wondering what the use case is for ‘contiguous’. Is this for tracking sequence? Whilst exploring the envelope I did wonder if the order (direction) of pairs matters. It generally doesn’t in TB, but for some people’s use of a List (which doesn’t re-sort on its own) can retain pairs A;B and B;A which might - to them - imply some difference. I raise this even if only that we expressly strike it out as something for which to cater.

I’m not sure a coOccur() needs to chain on collect() as if it takes a group input it is scoping in its own right. Thus (what ever the eventual name) I’d see this operator chaining off a list (i.e. List or Set). A coOccur_if() probably also makes sense to allow some scoping tweaks. Thus:

list.coOccur(group,attribute[,constituents][,contiguous])
list.coOccur_if(group,attribute,condition[,constituents][,contiguous])

I’ve reversed the last two inputs as the last seem least likely to be used and optional input require all preceding (option) inputs to be supplied.

Another issue is guessing the upper bound (constituents) value if one wants all N>1 configurations. I assume max($SomeMultiValueAttr.size) might suffice.

The bit I’m stuck on is how the app degrades gracefully when an unknowing user gives it 5k notes with 100 values for $SomeMultiValueAttr and about 20 different size configurations, even on a fast Mac.

PaulWalters · April 17, 2017, 7:35pm

The case is laid out here

If the “standard” use of .cooccur is to only look for pairs, and only look for pairs that are next to one another in the set, then the optional parameters ,[contiguous],[constituents] would not be needed.

If no one would ever think of using the features except to look for contiguous pairs in a set, then my proposal can be scraped. I’m suggesting it might be desirable to anticipate that other users might have a different (and valid) case to search for cooccurrences in sets where contiguous pairs is too limiting, and thus need the non-default settings provided by the optional parameters ,[contiguous],[constituents]

derekvan · April 17, 2017, 8:36pm

I like this proposal a good deal. I don’t think I would ever use the “contiguous” function, so my preference would be for it to be “off” by default (the sequence of values doesn’t matter in my usage, only that certain values appear together in whatever order). I really like the “constituents” flag, as I could see using that to narrow analysis in possibly interesting ways.

As for @mwra concern about degrading gracefully if too many notes are provided, I suppose there might be some error message (similar to the messages that pop up when Agent queries are running amiss) that says the coOccur is going to cause problems and needs to be scoped differently.

I also agree with MarkA about chaining off “list” or “set” instead of “collect.” This makes sense to me.

mwra · April 17, 2017, 9:38pm

@PaulWalters thanks. I wasn’t disparaging the idea. I’m with you in being open to the ideas/needs of others but (a) failed to think of an actual use case (which is simply a failure of my imagination and not the use case) and (b) would require working only off Lists-type data** as Set-type cannot be assumed to return a consistent sort which underpins the sense of being contiguous.

If implementing this (and if supportable at likely scale of use) I’d agree matched value pairs might be a sensible default but for analysis being able to extend to matched N-value sets might be of interest. This is aptly informed by a remembrance of a number of recent research tasks where I ended up painfully grinding multiple values down to a (pre-eminent) single one in order to do further analysis.

** noted that collect() returns a list so your chained idea makes more sense on reflection.

So, I think we’re actually on the same page (bar fine detail on input sort order!).

The other part of this is what .coOccurence() would emit. Although not a formal data type, TB now has look-up lists and perhaps that might be an outcome, a list of “value set:count” items, e.g. “red blue:2;blue green:4”. However I’m not sure what the delimiter used in the value set part would be as both space and underscore might form part of actual attribute values.

PaulWalters · April 17, 2017, 10:10pm

Are we thinking of .cooccurrence() or .cooccur()? I’ve been using the original formulation that @derekvan proposed, but either name for the function works for me.

Output. If it output a .plist it would be a key/value pair. Can a structure like that exist in Tinderbox? I understand the issue about delimiter. Perhaps /n ?

mwra · April 18, 2017, 8:02am

As to the operator name, for my 2¢ I’d suggest the shortest compound variant that makes sense (less typing!) and using internal camel-case so as to follow the style of other compound action names. Thus list.coOccur() would seem a choice but in truth the inputs/outputs are of more interest.

As to output, unless Tinderbox were to add some new form of view (or in-map visualisation) it might be best to pump out something that can easily be consumed by the likes of an open system such as R (as well proprietary like Office Excel). I may be wrong, but I think both those apps (and similar) would most easily ingest a table with notes ($Name or some UID) on one axis and all the discrete values of the analysed attribute on the other. Working this notion further with example data from up-thread, I’d expect tabular like this:

Name    red  blue   green  purple  orange  white
Note1    1     1      1      0       0       0
Note2    0     1      1      1       0       0
Note3    0     1      1      0       1       0
Note4    0     1      1      0       0       1
Note5    1     1      0      0       1       0

One might add a per-row value count but I suspect that, if exporting to another app for process, the value count might more easily be created after ingest into the other app.

The above data table might appear easily created by exporting an agent with the header line as the $Text and the data rows via ^children^. However, unlike the above example, for more than a few values of $OpenCodes (or whatever multi-value source) iterating through the values might be a complex task for many users so an action or export code to do this would help.

Once the data is exported it might be possible to run the likes of R in context of the export folder so output could be seen from within Tinderbox by viewing the exported page’s preview after the contingency analysis had been run. It does, however, depend on what exactly the user wants too see: a plot of all contingencies (perhaps a a from of heat map)?, a listing of most common, co-occurrences?, etc. A problem for this sort of exploration is that you may often need to try several approaches if you don’t yet know the relationships hiding in the data.

In summary, unless Tinderbox is actually going to visualise the contingency table in some fashion within the app, this task might be better handled as an export code (or an action primarily intended to be called during export). These is offered for discussion and is not a firmly-held viewpoint.

PaulWalters · April 18, 2017, 9:46am

I think a tabular export code output such as in the example would be a possibility. But the original request seems to be a different output form:

So, output like that needs to be the $Text of a note. Let’s say we have an agent whose query find all notes where $OpenCodes (for example) contain a value, or some other condition. Say the agent is named “Coded Notes”.

Some other note (call it “Co-occurrence Report”) would have a rule like this:

$Text=collect.cooccur("/Coded Notes",$OpenCodes).format("\n")

to get the output @derekvan specified here

or

$Text=collect.cooccur("/Coded Notes",$OpenCodes,FALSE,3).format("\n")

to get the same kind of report where code contiguity is not required, and we are looking for triplets instead of duals.

To export a table we would need a different syntax.