Coding challenge! How to get an antiquated structured file into TB shape


(James Fallows) #1

OK, for the coding gurus here, a challenge that I bet is child’s play for most of you, but that seems very challenging for me.

Background: from back in the days when I used the lovely, doomed Lotus Agenda, in the 1990s–background here, my 1992 Atlantic article on Agenda available in PDF here–I have some research files whose data I’d like to be able to get at again.

Remember when I mentioned, in a different thread, about the virtue of Tinderbox, is storing its data in XML? Which means that it never truly can be orphaned? That was based partly on rough experience with Agenda, whose own native .AG files (which run only on DOS), really are orphaned.

But on one of my still DOS-capable machines, I’ve used Agenda’s “export” facility to send its data out to a “structured file.” My goal is to convert this structured info into a spreadsheet, so that I can then selectively bring it into Tinderbox.

Your mission, should any of you accept it: guide me on the macro or search-and-replace routine that will let me convert a structured file to spreadsheet-available form.
***
The file I want to convert contains about 2,000 items, all of which have the following regular structure:

  • {I} indicates the beginning of a new item
  • {T} begins the title, or $Name in TB terms, of the item
  • {N} marks the beginning of what would be the TB $Text field
  • Each of a series of category assignments – the equivalent of columns in a spreadsheet – begins with {C} and ends with {.}
  • The end of an item is marked with {!}

{I}{T}This is the name field
{N}This is the equivalent of the $Text field{S}
{C}First value field
{.}
{C}Second field
{.}
{C}A date field 03%/01%/1998 21:18
{.}
{C}Another date field 02%/16%/1998
{.}
{C}A numeric field #1.2
{.}
[A bunch of other fields…]
{C}Final numeric field #0
{.}
{!} [End of this item]

So we have regular markers: for the beginning and end of an item, for its name and its text field, and for the various values that I would like to convert to spreadsheet columns.

I imagine that for some of you, it would be five minutes’ work to suggest the macro that would let me put this into a form that would allow me to populate a spreadsheet with the values – and then import many of them to a TB file. If I’m right, and one of you has this on the top of your head, great! If it’s hard even for experts, that is useful to know too.

Thanks!


Love hate relationship and Tinderbox 7
(Paul Walters) #2

UPDATE what I originally posted in this reply is OBE – working with the data I see differences that I have accounted for in the BBEdit TextFactory posted here and discussed below.

http://d.pr/f/cMviqF+


I assume this is a plain text document, or could be.

I would use BBEdit or some text editor. I would search for the tokens and replace the first one ({I}) with a space. Replace the {N}, and {.}{C} tokens with a semicolon. The {S} with blank. The {.} with semicolon. The %/ with blank.

The tricky one is the final {!}. You want to replace this with a carriage return or new line. How you do that depends on the editor.

The point is to make a semi-colon delimited file. After you’ve done your replacements, add a new line at the beginning with the names of the attributes you want Tinderbox to create, and separate them with semicolons.

You should be able to save that as a .csv file and then import it into Tinderbox. (Yes, I know it’s not comma separated – Tinderbox doesn’t care if you use semicolons.)

Obviously, experiment with a few dozen items. If there are weird characters in the text you could run into problems – but if you use something like BBEdit you can use “zap gremlins” to get rid of the dreck.

You could of course write a macro with Keyboard Maestro, or write a text factory in BBEdit. Since you’re mainly just replacing one token with another, find and replace goes faster than writing and debugging macros.


(James Fallows) #3

This is exactly the kind of specific how-to I was hoping for. Thanks! And, yes, it is indeed a plain text file. Will give this a try and report back. Thanks!


(Mark Anderson) #4

I concur the suggestions above. As you’re going to a spreadsheet en route to Tinderbox, I think I’d go with tabs rather than semi-colons as Excel may be more picky than Tinderbox. Also, for Tinderbox at least semi-colon is the multi-value delimiter and may fool the app into an incorrect parse. My advice is to test these aspects on a piece of clean test data (i.e. no ambiguities in format) to help indicate where in the processes glitches, if any, occur when using your real data. Other tips:

Add a header row to your data with the desired target Tinderbox attribute names. For any new attributes, define those and set the data type before any import to avoid the import process having to guess. the headings are also useful when working on the data as a spreadsheet.

Look for gremlins such as line bears in the ‘text’ field(s). You may need to encode these as some other character then replace them once imported. This is because actual line break characters normally indicate the end of a a data record (row) in CSV/tab-delimited formats. There are other characters such as ‘vertical tab’ used for marking line breaks but some text editors render those as line breaks, obfuscating the issue.

If using a formal text/code editor like BBEdit, it is useful to turn on ‘invisibles’ (settings vary by app) so that characters like spaces, tabs, etc., are visible. Regex are precise so it helps to see a more accurate representation of the data.

For things like date or date/time source fields, ensure there is no other non-date time data in the field. Note that in Excel date/time and date cell values use different formats. As Tinderbox always assumes a time, I’d normalise all date-only data to date time by adding , say, ’ 08:00’ to any date-only dates. This gives you consistency and stops Tinderbox adding a time (which might be the system time as at import). By using a single common time for all ‘fake’ times in makes it easier to filter those from dates where the time is real.

For boolean source data, sometimes only the on/true/ticked value is stored. I’d convert any Boolean source to an explicit 0/1 or false/true.

For multi-value data, convert the existing value delimiter to a semi-colon.

I’ve done similar exercises, albeit from different start data formats, so what you’re trying is definitely achievable.

If you need help on specific regex, it would help to see actual specimen data as regex are very precise and generic description hide nuances in the raw data. Happy to correspond directly to keep data private. HTH.


(Paul Walters) #5

Just for clarity, @mwra is suggesting using regex replacement. I am not. I am suggesting using the vanilla find and replace function in whatever text editor you use.

I’d suggest avoiding Word for this kind of project. If you are using c1992+ data files from DOS-land, there’s a good chance they contain unusual non-printing characters. Word can be confused by this. If you don’t want to use BBEdit, use TextEdit.


(Mark Anderson) #6

Ah, apologies. For me the two blur into one (mainly as I do most of this sort of work in BBEdit). BTW, the Find dialog in BBEdit (and Text Wrangler) has a useful button that stored the last 10 uses search/replace string and the addition setting (case sensitive, etc.):

The button is useful for reloading find/replace data if you have <10 operations to do on several files (and don’t want to use the a Text Factory feature).


(James Fallows) #7

Again thanks to @PaulWalters and @mwra for these tips. All of it makes sense – I will start small with a controlled and cleaned-up sample of the data and move up from there.

I have long used Text Wrangler as my plain-text editor. I hadn’t realized that it’s part of the BBEdit family. I’ll see whether Text Wrangler can handle this on its own or whether it makes sense to move up to BBE too. Appreciate it. (Also, here is the link, previously omitted by mistake, to the PDF version of my 25-year-old article on Agenda.)


(Mark Anderson) #8

FWIW, I think Text Wrangler has ceased ongoing development and has been merged with BBEdit. Instead of two apps, BBEdit has a demo mode after which access to of some features (i.e. the old features missing in TW). So, to use the ‘current’ TW you should install BBEdit and just run in permanent demo mode.


(Paul Walters) #9

Here is a BBEdit TextFactory (a sort of macro) that does all of the replacing mentioned above. It works for me on the test data posted here, but in creating the posted test data I think there may be errors in what’s shown in the OP.

No worries – TextFactories are very easy to modify and require no obscure syntax (unlike Keyboard Maestro).

http://d.pr/f/cMviqF+


(eastgate) #10

I agree that this is likely to be pretty easy to convert to comma-separated values, or tab-separated values, which spreadsheets import readily. )OPML import might also be a possibility.)

This sort of file, with a decently-documented and (for its time) designed format, could also be the target of a straightforward custom importer. That’s not going to be necessary here, I think, but if you had thousands of Agenda documents to import – or if lots of Agenda users emerge from behind that curtain over there-- it’s something we could do here without great trouble.

That’s unlikely to be necessary in this specific case, but it’s worth mentioning for future readers.


(James Fallows) #11

Thanks for this TextFactory! I’ve just downloaded BBEdit and will give it a try. Sincerely grateful for all advice.

Update Mirabile dictu, this works! The TextFactory script puts the original Agenda-export structured file into shape to be pasted into a spreadsheet, where I can look it over and delete fields I don’t want any more. And then Tinderbox properly handles material copied-and-pasted from the spreadsheet itself. This has saved me a gigantic ton of effort. Many thanks.


(Paul Walters) #12

Cool. What once was lost is now found.