Attributes or tags - an example

mwra · December 14, 2017, 11:39pm

As a side note to the thread “User Attributes or Tags?” I thought I’d show an example from some of my current PhD research. in this case I’m looking at behaviour and outcomes in discussion surrounding bot use in Wikipedia. In this case there are some 4.5 topics spread over 75 archive pages. Here you see some items from archive page #10. I’d copied the topic headings from the page indexes and then used Explode to create per topic notes.

There’s nothing tidied here ‘for show’, this is data done earlier today. Notice a large number of key attributes. Apart from $URL, $StartDate and $EndDate, the rest are user attributes whose rational I’ll describe below. Many are empty as I’m using a 2-pass process, firstly getting date/person info whilst looking for incomplete source data (unsigned edits, wiki mark-up errors) and secondly reading for the purpose and outcome of the topic - in each case (supposedly) a request for a bot to do a task.

$URL. To save file bloat and because Wikipedia is reliably online and I’m in a setting with 24/7 access I’m reading the actual source web pages online. For research provenance purposes I have copies saved but as these are archives it’s easier to use the online pages so as to access other parts of the wiki to trace broken/missing data. To make the URLs I used the topic name, URL-encoded it than used a stamp to create the full URL (a stamp as I wanted more control with 4.5 items in scope).

$TaskSet. (Set). The range of task(s) being requested discussed.

$TaskSize. (Number). The number of pages to be edited (if mentioned).

$RefdBots. (Set). The name(s) of any bot account(s) actually named in the discussion.

$ReqOutcome. (Set). The outcome(s) of the request (if any!).

$StartDate. First edit in the topic.

$EndDate. Last edit in topic. Gives thread duration.

$WasMoved. (Boolean). Was this topic moved from elsewhere, or re-directed to another location

$FirstUser. (String). Name of the original requester (normally the first edit unless moved).

$UserSet. (Set). List of all contributors to the topic.

$UserCount. (Number). Number of discrete editors in the thread.

$EditCount. (Number). Count of discrete edits (some people post more than one comment), i.e. thread length.

$HasAnon. (Boolean). Are any edits by anonymous (IP-based) users

$MoverNameSet. (Set). For moved topics, the editors commenting re the move.

$MoverEditCount. (Number). The number of edits in the topic relating to the move as opposed to the request.

Note how the attributes mainly include a word indicating the data type (I find this useful to separate list from single values and Sets from Lists - as sets de-dupe and Lists don’t).

This set of Key Attributes arose from several restarts on the early part of the corpus and and sampling some later content. For now I’ve probably got what I need, with the granularity I need for later analysis. However, only a couple were created before I started working on the data. Before I’d read some of the data I simply couldn’t guess what might be there (indeed some of my assumptions were duly wrong and would have been wasted effort if pre-defined. Were all/most of the above just a set of keywords (e.g in $Tags) it would be far harder to get at the threads within the metadata.

Sorry for all the detail but I hope for those starting with Tinderbox and trying to see why one might want more than just ‘tags’, I hope this helps. Obviously, not everyone will be doing a task like above but hopefully the deconstructive process is clear and can be applied to your own projects.

Edit: I meant to add, note the use of columns here. I don’t generally edit into them but use them to check I’m completing important attributes and to look for emergent trends. Avoid the temptation to display loads of column and turn the display into a spreadsheet. I find it doesn’t aid clarity of thought and adds visual noise. I find 2-3 is fine, maybe about 5 if they’re booleans as they are tick-boxes when in columns so take little screen width.

I hope this helps - if not please ask. Sometimes picking apart and example can help anchor what is otherwise a quite abstract discussion.

mwra · December 15, 2017, 9:53am

As it arose on the referenced thread, here’s another way such threaded metadata (i.e. in discrete attributes) can help. Let’s say we want a list of (unique) names of Wikipedia editors who participated in any of the requests being studied, i.e. the unique values in $UserSet.

So, we use the values() operator, which is designed for just this. See the linked article for detail on the scope/syntax. But for instance, to get a Set (i.e. no dupes) of all $Tags values:

$MySet = values("TUserSet");

Let’s say I actually want to make a new note for each discrete user in $UserSet. I make a new note and give it this rule;

$Text=values("UserSet").isort.format("\n");

That gives me a sorted list of user names, one per line in the note’s text. I can then explode the note and I’ve then got a note per user, named for that user. I can, if I want, now use action code to link each author to the topics to which they contributed. Or I can use an edict** to find all notes where the person is in $UserSet and get a count of their contributions.

** Tip: as a document scales edicts are a good choice over rules as if you’ve lots of actions doing finds across the whole document (e.g. does my $Name occur in $SomeAttribute) this can load performance. That isn’t an issue in small docs (i.e. a hundred rather than thousands of notes).

In the case I’ll definitely be extracting a list of editors in a thread so I can compare it with a list (from another TBX) so I can identify which participants are also bot operators. IOW, I can see how the ‘expert’ are interacting with those requesting help. The Start/end dates allow for timeline considerations. Does activity vary over time (I’m looking at a 10+ year period).

Do please ask questions of technique. The above isn’t here to present my work but rather to have a real-world TBX task within which to explore issues of how we can interpret the content of our own TBXs.

MartinBoycott-Brown · December 15, 2017, 10:59am

@mwra This is very interesting, and shows the value of real examples. The “data” I tend to work with are about as different as they could be from yours. Looking at your work suggests to me that User Attributes will be less useful to me than $Tags (keywords!!), though I may live to revise that idea!

My own PhD (which I started at a very advanced age) was based on a close textual analysis of the war diaries of two British First World War generals: Douglas Haig and Henry Rawlinson. If memory seres me correctly I had something like 80,000 words of text for each year of the war. Conventional coding techniques for this sort of work involve writing a word in the left margin of the text indicating what is being talked about in that area of the text, and in the right margin you write something that says HOW the thing is being talked about. So, I might write “artillery” in the left margin and “ineffective because of dud shells” in the right margin. (There are computer programs to do this sort of thing, but at the time I was working I found them horribly unwieldy and slow. I actually did most of my work in Scrivener, which worked very well.) I have just dug out my paper copy of Haig’s diary, and although I finished the work six years ago and have forgotten a lot, I can still scan down the page and immediately see the main themes that are present in the diary entries. (See picture – no right margin comments here because of lack of space.)

So, for example, it became apparent that at a particular time of the war, one of the things that concerned Haig was what we decided to call “deportment” – how people carried themselves and behaved in all sorts of situations. So I extracted some passages that exemplified this, and quoted them in the thesis with a brief analysis of what Haig wrote.

I have moved on since I did the thesis, but I am still mainly interested in identifying themes and important concepts in texts. Having had these very useful discussions on the forum, it seems to me that this is more a job for $Tags (keywords) than it is for User Attributes. True, there might be a lot of sense in having a User Attribute for $Themes, but I haven’t yet worked out what might be gained or lost by separating $Themes from other keywords. In the first example I gave, it is not just the fact that the main theme is “artillery” that interests us: it is the fact that it is “artillery as ineffective because of dud shells” – which we might later compare with “artillery as ineffective because personnel poorly trained”. It occurs to me now that if we had a tag/keyword “artillery” and another “poor_training”, we could find notes where both keywords appear. But we could also find all instances of “poor_training”, whether of infantry, cavalry or whatever, and separately all instances of “artillery”, whether good, bad, indifferent, or whatever. On second thoughts, however, it might emerge that having a separate User Attribute for “training” could be useful, because you could give it a value of “good;bad;indifferent” and colour your notes accordingly(!)

In summary I think I’m inclining to the view that you have to listen to the data, and do what it tells you to do in response to whatever you are trying to find out from it. And the advantage of Tinderbox, from what I can tell, is that it will allow you to adapt as your understanding of your data develops. At one point in your research it might make sense to have $Tags, at another point it might make sense to have User Attributes because you can see categories emerging from a study of the data, then you might need to go back to $Tags again because you find the categories are too rigid for the material you are uncovering. Human behaviour and thought are particularly hard to pin down, and any “hard boundary” between one category and another is usually somewhat arbitrary, though some degree of categorisation is usually unavoidable. At the moment, therefore, I think I will have to use $Tags / keywords until such time as useful categories suggest themselves from what I am studying. At which point it should be relatively easy to add the relevant User Attributes.

PaulWalters · December 15, 2017, 11:21am

@MartinBoycott-Brown a mundane use of attributes in conjunction with $Tags in the example you gave could be a reader who tags selections of text, as you do, and adds attributes to record information about the context of the citation – e.g., $Page (a numerical attribute), $Relevance (perhaps a set whose default values are “high”, “medium”, “low”), and so on. I would use attributes in this sense when I want to use attribute browser of an agent to evaluate notes

E.g., an attribute browser query $Tags.icontains("hardness") and the chosen attribute $Relevance:

DEddy · December 20, 2017, 12:14am

I’m sort of working on a personal “PhD/ColdWar documentary/book/memoir” to be culled from a variety of resources.

One resource is a 28 page professional finding aid describing the contents of 58 archival boxes. Earlier this year—before discovering TB—I spent 20 hours going though a single box. Ouch. Best to step back & regroup.

This “tags” thread is good thinking material on how to restart that stalled research/data collection effort.