Analyzing a massive list (i.e. rendering a big Excel spreadsheet to make sense of it)

satikusala · December 22, 2020, 4:41am

I have a file with 412 notes. One of the attributes has 584 unique items in the $Topics set. Here is an example of some of topics:

A-B Testing
ABM commitment requirements
ABM digital advertising-retargeting
ABM functional design
ABM market trends
ABM measurement best practice examples
ABM measurement trends
ABM pipeline acceleration
ABM roles and responsibilities
ABM service agencies
Account and contact data management
Account and territory plans
Account planning process for grouped ABM
Account planning process for large one-to-one ABM
Account prioritization-selection and tools
Account profiling for ABM
Account profiling for grouped account ABM: named-industry
…and the list goes on.

Also, I got to try my first legit Crosstab today. WAY COOL. I ran an agent that looked for all items with ABM, had the agent apply an action to set a user-generated attribute ABM boolean to true. I then did this for another term and set another boolean to true. I now had two-user generated attributes to run a cross-tab against. Worked perfectly.

I would like to analyze the titles to these topics. Another attribute has 95 different discrete items. Clearly, rolling them up into smaller topical cohorta could be useful, or simply running a quick array to understand how many titlea relate to each of these topics or prioritizes would be a great start.

Manually creating agents seems unwieldy. I did about 16 and got some good data, but I wonder if there is a better approach.

I’d like to go farther into the data analysis.

Has anyone done this type of analysis with Tinderbox before? If so, have any ideas on how to start to tackle a project list this?

Thanks.

mwra · December 22, 2020, 10:47am

Refining lists is something that I have done quite often in research. Capturing grouping into new attributes (i.e. making metadata) is really useful in terms of tabulation and quantitative work.

I think there are two take-aways here. One is groupings, e.g. your ‘ABM’-based values. The other aspect is consolidation, if you’ve 100s of items (i.e. discrete values) with only one or a few using that value you want to consolidate.

For the first, I’d use a new user attribute to capture that relationship. Simplest is a boolean, if it a in-/out-of-grouping record. Don’t overlook the power of having a simple boolean to find with ease a group of notes that may otherwise need a complex query to locate. If it’s not a yes/no membership or there are additional standards to can use a string or list-based attribute instead/as well.

In the second case, I normally add a new Set or String attribute and start populating that whilst reviewing the source. Attribute Browser view is very useful for this. An annoying limitation is there is no way to access in-app a list of attribute values and the count of their use (in view scope), i.e. the count numbers you see at the right margin on their category headers in AB view. If you’ve 100s of notes/items, scrolling up/down gets tiresome. A means to copy such a list to the clipboard has been requested and even that would make AB view so much more useful for this sort of work.

Note that you don’t need to make agents just to then use Crosstabs or AB view. Both allow you to set an agent within the view itself. Both views default in scope to ‘whole document’ but can be set to descendants of any container without using an agent. Or, you can use default scope and define an agent, or mix both.

My work doesn’t call on Crosstabs but I use AB view a lot in my research work.

satikusala · December 22, 2020, 1:59pm

Thanks, so it looks like I’m on track with the boolean approach, as noted yesterday, that is good to know. Must finally be learning.

In terms of process and methods, however, I’m curious and have some questions and would like to share a few of the methods I’ve already started to use so that others may benefit.

Also, some background. I know nothing about this data. A friend gave me a huge spreadsheet and asked if I could help make sense of it.

My first step has been all around the effort of tenderizing the data to get it ready for consumption, just like a good steak.

Methods used so far,
1. Import Spreadsheet, I started with a spreadsheet. 417 rows and 11 columns. I simply copy all the rows and columns (be sure to select the headers) and paste them into a new Tinderbox. This creates a new note, “container,” called “imported spreadsheet.” Each row of the spreadsheet is added as a new note to the container and all the columns and their respective data are added as attributes. If the column name in the spreadsheet already matches an existing system or user-attribute Tinderbox is smart enough to use it and populate the new note with the data from the row’s corresponding column cell. For example, Tinderbox will use the data in columns labeled Title or Name and populate the $Name attribute with this data. If it does not find a column named Title or Name it will use Column 1 as the name of the note. If Tinderbox does not recognize the name of a column it will automatically generate a new user-attribute with a Type string. I then open Document Inspector, ⌘1, and modify the new user-generated attribute types as needed, e.g. make URL’s into Links, Sets into Sets, etc.

For example,

This,

Becomes this,

**2. See what I’ve got, next I want to see what I’m working with,

2a. What values are in the sets, I manually create notes based on the names of the user-generated attribute sets, e.g. Topic, Priority and Services (In my case I had previously created a Folder prototype and applied a $ChildCount to the $DisplayExpression).

Next, I [create a stamp(Stamps) to pull the values out of each set and then apply it to each namesake note. The goal here is to create a list of the attribute values in $Text (Remember: you need to change the user-generated attribute from string to a set otherwise you end up with tons of duplicates. In my case, in one example, 1,800 items vs. 95).

Here is the stamp:

$Text=values("$AttributeName").format("\n");

For example, this stamp $Text=values("$Entity").format("\n"); produced a note with this text:

Next, I explode the note to get each entry to be its own note.

I go through this process for every set or attribute whose context I want to analyze.

*3. Run some queries
Once I get a sense of what is in the sets, I can then run some queries on key terms I want to tease out, e.g. “ABM” or “Blog Posts.” The key here is to start to quantify the file and get an idea of how many notes related to each item. To do this I create an agent and run a query. Here is an important trick. You can’t run partial text-searches on sets, set searches want exact matches. To get around this you can use a dot operator that formats the set to a string on the fly, like this $Entity.format(";").icontains($MyString(agent));. The .format(“;”) changes the set Entity to a string, which enables partial text searches.

So this,

Results in this,

Like above, I do this for each of the queries of interest.

Also, as noted above, I created the ABM attribute and others, applied an agent action to set the attribute to true. Querying on true-false is a lot easier than running a bunch of .icontains searches.

*4. To make visual sense with links,
Next, I tried to make visual sense with links. As you can see above, in one of the queries I’ve tried using a linTo() action to link entities to their respective notes. This lets me view the relationships in a Hyperbolic, “link”, view.

*4. Cross-tabs,
Now that I have my boolean attributes I can easily run cross-tabs,

As for cross-tabls, this will prove very useful, and I see that I can run a query, but they don’t let me apply actions like you can on AB, so there is a limitation here.

Word repetition and keyword visuals,
I’ve also run some keyword visuals,

and work repetition to help me get a sense of the data,

That’s it so far, this is as far as I got with about an hour of work (it has taken me longer to write this post than to do the work). What I started with was a spreadsheet with a bunch of data and what I have now is an interactive data set that I can apply views to and manipulate to tease out insight. Isn’t Tinderbox cool!

As noted above, I have one exploded set with nearly 600 items. My next step will be to figure out a way to consolidate some of the items into more manageable cohorts. I’d like to future our way to automate this a bit. For example, I think I can create some arrays of items and aggregate them into lookup tables to make the associations faster. I also need to go back to my friend to get a sense of what he really wants to know. In reality, all I really have at this point is a bunch of data without clear questions.

Anyway, if anyone can think of any other creative ways of parsing and out and making sense of a big dataset like this, I’m all ears.

Thanks, and I hope the above explanation helps others.

mdavidson · December 22, 2020, 2:22pm

Just quickly - for cross-tabs you can apply an action to each cell by right clicking on the cell and selecting Perform Action on This Cell from the right-click menu. Maybe this helps ?

mdavidson · December 22, 2020, 2:25pm

Here an example TB notebook and right-click on a specific cell

Not only can you perform an action on all notes within the cell e.g. set boolean values, colours etc… but you can also select all the notes or create an agent for the cell to further process the note aliases as you see fit. It’s powerful stuff…

eastgate · December 22, 2020, 3:20pm

Not to deflect or distract from this excellent discussion, but I also wonder whether the profusion of values for this attribute indicates that the attribute combines information that could more cleanly be represented in multiple attributes.

For example, you might conceivably have

ABM market trends

ABM roles and responsibilities

Content marketing market trends

Content marketing roles and responsibilities

XXM market trends

XXM roles and responsibilities

You might recast these to have a $Topic and a $Role

Topic: ABM. Facet: market trends

My usual sermon on incremental formalization urges people not to get involved with this prematurely; it’s not mandatory to break everything down optimally. But when you’ve got 584 topics and growing, it might be useful to break things down.

See also: index, two-level; classification, faceted.

satikusala · December 22, 2020, 6:43pm

Without a doubt, you are correct! I said as much to my friend today. The nice thing is Tinderbox really helps point this out.

What I am curious about though, is if there is a method/process for creating an array (possibly using lookup() or something) to feed into different agents based on key terms or do each of these agents need to be set up manually. Not really sure yet…more thinking needs to be had (thanks for letting me think out loud and publicly).

For example, if I have a note with children, as illustrated below, is there a way to have this array take the name of the child notes or a value from the child note’s attribute, feed this into an agent or series of agents or rules and for each of the individual children notes get back results that include matched notes, counts and link associations to the other topical criteria like topic, role, priority, etc. Again, I’m thinking out loud and need to tease apart the problem into steps to accomplish what I want to do effectively. I’ll get there; but, ideas to stimulate the imagination are always welcome.

Finally, I’m not familiar with “index, two-level; classification, faceted.” I assume, as this is not linked to a specific reference that you’re generally referring to the idea of faceted classification (which is a new term for me; Ya! Another good day of learning), and a general Google search is warranted on the subject. Or, if there is something specific you have in mind? I’m all eyes and ears, and will start Googling now.

At any rate, I remain your ever grateful community Tinderboxer. Thanks for all the support everyone.

eastgate · December 22, 2020, 8:39pm

… or do each of these agents need to be set up manually.

My first impression is that these agents really only need to run once. You’re changing your representation; after this, you’ll enter new notes not as “ABM Market Trends”

Topic: ABM
Role: market trends

So, while it’s elegant (and possible) to automate the reclassification, it might not save you a lot of time compared to doing it with an agent, checking that what you’ve done is what you intended, and then deleting the agent or revising it to address a different topic.

Of course, if you get a new big spreadsheet every month from some publisher, and you can’t convince the publisher to use your new classification system, then sure, automation is a good idea!

mwra · December 22, 2020, 8:43pm

I use the method of: attribute values() → $Text → Explode a lot. Note that if linking via an agent, consider if you want the alias or its original to link to/from the exploded note. This is why actions like linkTo() have a linkToOriginal() sibling action do deal with just such a case.

mdavidson · December 23, 2020, 4:20pm

I may be mistaken but it seems to me that you are looking to cluster your notes into groups of similar content (links, attributes, text) but are not yet sure what these clusters and their descriptive elements should be ? The best approach from my perspective is still that of @eastgate which is to outline some basic criteria for grouping the notes based on your knowledge of their content. He suggests $Topic and $Role - you might have other insights.

My own approach recently was to use the Map view and start reading through the notes and push these to various areas of the Map depending on their content and my first impression of similarity. I then added some attribute values which described the spatial groups of notes I had identified. In parallel I created a Attribute Browser view which listed the notes under each attribute value and those not yet assigned and continued within the AB view to see if it made sense. Over time you get a feeling for whether the grouping is good, can be changed etc… often referred to as the legendary incremental formalism.

Perhaps useful in your context is the rudimentary automatic grouping in TB through a misuse of the similarTo action code which can help identify TB notes similar to a reference note. I’ve briefly tried the following in an Agent on a small test TB file

Here Note 1 is the reference note and the query will match 2 notes closest in content to the reference note. If you have a good idea of a note that best matches your perception of a particular category this approach can help return similar notes in an automatic way. NB: the process is very very slow and time consuming at least when I tried it. I’m hoping to see more clustering support in the future from TB as extremely useful for large number of notes.

A last but similar idea. You could export your notes to Devonthink, create some reference groups there which you populate with prototype (e.g. reference) notes and use the DT AI to assign your notes to one or the other group.

I’m out of ideas beyond this. Good luck.

satikusala · December 24, 2020, 2:06pm

These are all great ideas! I’ve learned a bunch in this thread, including the ideas of,

incremental formalism
index, two-level; classification, faceted

And ways of using cross-tabs and clustering. Very cool!

Also, I’ve figured out a fun way to do quick searches on the data.

Here are the steps.

Create a note, e.g. “Priority”
Pull the values from the attributes of the main notes using the values() method, e.g. $Text=values("Priority").format("\n");
Explode the note
Create a new attribute, a set, I will now use as a “filter” for queries, and then apply a stamp to populate the attribute with the values form the $Text, e.g. $QPriority=$Text.replace("\n",";");.
Create an agent that pulls the search query from the selected filter

Once you do this you know of a quick and dirty way query/filter system. It works really well.

Also, I’m now using the system that has evolved to do some automatic linking to create hyperbolic views from the data with query actions. I’ll explain this later, as I have one trick I’m trying to figure out with Mark’s help.

It truly is amazing how powerful TBX can be.

mwra · December 24, 2020, 4:32pm

Me gusta! I think the hyperbolic view is a powerful and, as yet, under-used view.