Raw XML import and manual processing

webline · February 14, 2022, 8:56am

With 9.1 TBX got a new XML parser. Currently I struggle with a very basic task: I would like to AutoFetch an URL returning some XML data. This is no problem - but TBX parses the XML and returns the text without the structuring elements. Like this:

306150dcxml Dies ist ein großer VD17-Testtitel EST beigefügter oder kommentierter Werke Angabe von Paralleltiteln (nicht auf Haupttitelseite) Angabe von Nebentiteln test4200 Unterfeld Dollar b Zusätzliche Indexeinträge Abweichender Titel Originaltitel <engl.> Diedrich, Andrea , 1974- (BuchkünstlerIn) Keutmann, Markus Hachmann, Karen (IllustratorIn) (MitwirkendeR) Wiegandt, Birgit (WidmungsempfängerIn) Wiegandt, Lisa (Widmungsempfänger) Lange, Matthias (Hrsg.) Berger, Renate (GefeierteR) Langer, Sylvia Jaehde, Maik

instead of:

<?xml version="1.0" encoding="UTF-8"?>
<zs:searchRetrieveResponse xmlns:zs="http://docs.oasis-open.org/ns/search-ws/sruResponse"><zs:numberOfRecords>306147</zs:numberOfRecords><zs:records><zs:record><zs:recordSchema>dc</zs:recordSchema><zs:recordXMLEscaping>xml</zs:recordXMLEscaping><zs:recordData><oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
  <dc:title xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:srw_dc="info:srw/schema/1/dc-schema">Notae &amp; Animadversiones in Dn. Guil. Ignatii Schützii, ICti, Manuale Pacificum: Quibus Ea, quae cum Principiis revelatis, nobiscum natis, Iuris item publici atque privati, in Imperio Romano-Germanico recepti, conveniunt, latius confirmantur, ea vero, quae his adversa, in praeiudicium &amp; odium Religionis Evangelicae, ab ipso asserta sunt, succincte examinantur &amp; refutantur</dc:title>
  <dc:title xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:srw_dc="info:srw/schema/1/dc-schema">Kupfert.: Manuale Pacificum. - Vorgebundenes Titelblatt u.d.T.: Manuale Pacificum Guil. Ignatii Schützii, Cum Notis Heydeni Borromei Riccrunti. Editio Novissima. Anno MDCCLII. - Weiteres Titelblatt u.d.T.: Manuale Pacificum, Seu Quaestiones Viginti, Ex Instrumento Pacis, Religionem eiusque Exercitium concernentes, compilatae &amp; in lucem editae, a Guil. Ignat. Schütz, ICto, eiusdem Pacis, olim ad partes Rheni ac Sueviae Executore subdelegato. Francofurti Anno M.DC.LIV. Postea Ab ipso Autore revisae, iam recusae, Iuxta Exemplar Spirense Anni 1683. ab eodem recognitum &amp; propria manu correctum. Anno M.DC.LXXXIX</dc:title>
  <dc:title xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:srw_dc="info:srw/schema/1/dc-schema">1654</dc:title>

How can I prevent the processing of the XML? $RawData is empty.
After getting the XML I would like to process the data and put it into corresponding custom attributes. According to the XSD:

<complexType name="oai_dcType">
  <choice minOccurs="0" maxOccurs="unbounded">
    <element ref="dc:title"/>
    <element ref="dc:creator"/>
    <element ref="dc:subject"/>
    <element ref="dc:description"/>
    <element ref="dc:publisher"/>
    <element ref="dc:contributor"/>
    <element ref="dc:date"/>
    <element ref="dc:type"/>
    <element ref="dc:format"/>
    <element ref="dc:identifier"/>
    <element ref="dc:source"/>
    <element ref="dc:language"/>
    <element ref="dc:relation"/>
    <element ref="dc:coverage"/>
    <element ref="dc:rights"/>
  </choice>
</complexType>

vd17.zip (34.1 KB)

Thanks!

mwra · February 14, 2022, 10:41am

I think you looked in the right place: $RawSource. But, I see it’s listed as experimental. I see $RawData is still created for new documents. I have a vague recollection $RawData is not used if AutoFetch is able to parse the fetched content (no point in storing the data twice). You could check that with Eastgate direct.

As the XML is arriving, albeit over-parsed for your needs you could try AutoFetch commands, which are action code. In turn this gives you access to the new Stream Processing and parsing tools, including those for XML processing.

Looking at your example XML data (online here) I notice it has 10 discrete records (possibly part of an overall recordset of 306,150 items—I guess the API chucks returns in to blocks of 10 records). Record #1 has (in XML occurrence order):

8 x dc:title elements
16 x dc:contributor elements
3 x dc:type elements
2 x dc:date: elements
1 x dc:publisher` element
1 x dc:language element
5 x dc:format elements
17 x dc:description`` elements
2 x dc:identifier elements
3 x dc:relation elements
2 x dc:identifier elements
1 x dc:relation element
1 x dc:identifier element

Note of the other Dublin Core elements are used. Note that the dc:identifier and dc:relation elements are interleaved and not contiguous per-type blocks.

Furthermore subsequent records have different combinations DC elements.

As to the XML parsing, I’m afraid that’s a bit beyond me and I can’t find any fully worked-through example of XML code stream parsing. I think you use Stream.xml(path) and/or Stream.xml.each(path){action} to locate the desired element and then Stream.captureXML to ‘capture’ that element’s value and assign it to a Tinderbox attribute or variable.

I’m not sure if that has helped any. If nothing else, it did help me spot some typos and formatting errors in some of the linked aTbref articles (which are now fixed!).

webline · February 14, 2022, 12:15pm

@mwra yes - you could pass a parameter to limit the number of records (500 is max). It is a scientific database of historic papers from the 17th century. And the interface is not very - let’s say - sophisticated.
I played with the AutoFetchCommand - but found no way to stop TBX parsing the XML.
For example: $Text=$Text; should stop the parser if it is true that the AutoFetchCommand runs before any processing of the data was done. But no change…
By the way: seems you have a tool to inspect the XML. May I ask what you are using outside of TBX?

mwra · February 14, 2022, 1:46pm

Sadly, in terms of AutoFetch behaviour, I’ve exhausted my limited knowledge. The irony here is that Tinderbox is parsing the data and extracting it. But, just not in the way we want! So I don’t think it is a bug but more an issue of how to signal our intent to Tinderbox and it’s probably at a depth needing @eastgate’s ability to see inside the app.

How am I viewing the XML? I just copy-pasted your demo TBX data source URL in my browser—for me, Safari—hoping it would display render the XML outline, and it does:

You can even expand/collapse container elements. I’m pretty use both current Firefox and Chrome do the same (yes - I just checked!). The nice thing is we also get a bit of syntax colouring (Firefox uses a different font but otherwise all the same). Only Chrome lets you see the ‘raw’ XML source (under sources), IOW not re-rendered as a a DOM-style outline; the others show the rendered version where one normally finds the raw input (HTML, CSS, whatever) as in its source code.

No special knowledge in the above, but just experience 30-odd years working as a data ‘emergency plumber’, so much used looking in raw form at things that ought to work … but don’t. I tend to look first at patterns in the code rather than the code, as often the code is the problem.

Note, the last is not the problem here as the XML looks well formed. But we do see that each record:

does not necessarily contain a full spread of the defined dc (Dublin core) namespace elements
elements may often occur multiple times
elements of the same type to not always occur in-succession
some empty elements occur

The last two may trip up parsing if you make assumtions that same-type elements occur in blocks of the same type, or that types list in a consistent order, or that type elelements of interest will always have a value. This shows, those assumptions are unsafe for this XML feed, even of the XML code is valid. This reinforces that ‘clean’ code can still offer untidy/unordered data. From your test XML:

An immediate question, pertinent to parsing/extractions is that do dc:relation element(s) ‘belong’ to, i.e. are closely associated with, the dc:identifier they sequentially follow in the code. Or, do all dc: elements stand separately regardless of encoded order as in the source data?

I’ve more, but this is a long post, so i’ll start another… (to follow)

mwra · February 14, 2022, 2:13pm

A point the preceding flags up is it simply isn’t safe to assume the API/export template creator coded the export which any knowledge of the data contained in the XML or the manner in which the recipient would parse the data.

If the order of typed elements in a record is not of significance, you might consider doing a re-order of the XML (how) so all same-dc-type elements list sequentially. That way you can stream parse needing only to test if the next such element is of the same type or not and adjust the target of extracted data accordingly. I again, I’m not able to advise on the how of such an XML transform.

I downloaded the XML data for your text and drag-dropped the (Safari) saved 'vd17.xml into your TBX. I get an empty note! But, having first taken care to add the built-in prototype ‘Code’ to the TBX and apply it to a new note, the XML can be copy/pasted safely into to the note’s $Text without harm:

Depending on how much of this data you have, that might be a way to get it into more easily parsed chunk of XML. Further to could use a stamp to pass the $Text to pass the $Text to BBEdit where you could re-order the data elements and if necessary strip unwanted XML elements before putting the DC type-sorted data elements back in $Text.

Also, if putting autofechced data from this source into a note disable $SmartLinks first or information codes will get misdetected as USA phone numbers as here:

I also tried some explode actions on you recovered data, and the pasted-in XML note. Notes on setting used are in the ‘exploded notes’ container for each. My edited version of your file is here: vd17-ed.tbx.zip (99.0 KB)

Hopefully some of that is of use.

eastgate · February 14, 2022, 3:36pm

I think this is simply bad luck. At present, auto fetch sees whether the input can usefully be converted to a styled string. Typically, XML won’t do this; in your case, there’s enough text to fool the system.

We’ll add an explicit test for the .xml extension in the next update.

In the interim, instead of auto fetch, you could use curl to update the data from an edict or a stamp.

webline · February 14, 2022, 4:13pm

Thanks a lot for all your input and @mwra for the demo file.
I moved to FileMaker for this task - maybe with 300.000 nodes TBX is not the way to go - but I liked the idea to work with the data in TBX
Thanks!

mwra · February 14, 2022, 6:23pm

Yes 300k source records is a bit of a stretch!

Then again, some projects I’m working on could, if fully expanded across the source data, head in that size direction. But, by using a subset in Tinderbox I can explore the data so as to inform how it might be implemented in a formal database, yet one which (importantly!) would lacks Tinderbox’s flexibility with altering structure.