Fixing a semicolon-delimited file for CSV import (worked demo with files)

mwra · August 19, 2021, 3:17pm

Our starting point is a UTF-8, plain text file, with a header row and a data row:

Name;Quantity;Price;Sum;Note
Floggle-toggle;4,0;2,23;10,52;Used in "Freebies" Campaign 
Widget wedge;6,5;3,40;22,10;Out of stock, more due.

The content is chosen to test two issues with which arise when cleaning the format for import:

Some table fields contain a comma, both in numerical and textual fields.
Some fields contain straight-double-quotes.

This data might look like this in a spreadsheet^†:

The process needs to use straight-double-quotes to enclose data (cell) values. I could just enclose cells with a comma in them but for consistency it makes sense to apply the same treatment to all cells. However, this creates a problem for cells that already create straight-double-quotes. The technique there is to double each one which acts as a form of escape mark-up.

Task 1 - double up existing straight-double-quotes

So

Used in "Freebies" Campaign

needs to become

Used in ""Freebies"" Campaign

and in a bigger test to similar to all such occurrences of ".

The result, seen in BBEdit:

Using these settings:

Task 2 - add a straight-double-quote at the start and end of the source data line

Thus,

Using:

The \1 is a BBEdit back-reference inserting everything in parentheses in the query. So, being line-based, the query matches and references each line replacing it with the existing contents with straight-double-quotes before and after.

Task 3 - replace existing delimiters with commas between straight-double-quotes

The first and last cells per-line have an open or closing set of quotes. We just need to quote all the other cell boundaries whilst also setting the correct CSV delimiter.

using

We are done!

I then made a new TBX adding String-type user attributes ‘Quantity’, ‘Price’ and ‘Sum’ and ‘Note’. I used String type so my UK locale system didn’t (as in my first test ) coerce the German-style decimal-comma into a UK-style decimal-point. I then added a prototype setting those attributes as Displayed Attributes.

Next, I dragged in the semi-colon delimited data:

Notice how the lack of normal comma or tab delimiters mean the data is put directly into $Text. Now the ‘fixed’ file is added to the TBX:

The result is the header row is used to correctly make two new per-source row notes. Note: I set the prototype manually just to help illustrate the process. Notice how the comma in the note is not detected as a delimiter? This is because the contents of the cell was protected by the enclosing quotes. Likewise the in-cell double quotes are retained as they were escaped:

This zip contains the starting data, the ‘fixed’ version (both UTF-8 plain text) and the TBX illustrated above: Delim-demo.zip (29.5 KB)

†. I actually made this by importing my ‘fixed’ file back into Excel (as my settings are for UK), though it took some doing. Drag/drop or paste failed. I had to use data import and even then use non-default settings. As so often with MS Office, they have a different view of norms to everyone else.

archurhh · August 19, 2021, 5:05pm

This is brilliant, I will use it on a future import and revert. Thanks @mwra!!

abusch · August 19, 2021, 6:22pm

Thanks, @mwra, for your effort to demonstrate a way out of the problem. I look forward to trying this out on my Mac once I‘m back from my holiday!

mwra · August 19, 2021, 8:46pm

Took me a while to remember how (poor documentation in BBEdit manual) but if you have registered BBEdit you can use a Text Factory to save all the steps. Bear with me - it’s not exactly intuitive…

From the Find dialog, click the ‘g’ button (‘g’ for ‘grep’):

From the pop-up menu, choose ‘Save…’:

In the example above, I’d already saved the previous pattern as ‘semicolon to comma’ and you can see it is listed. Repeat so all three patterns are saved.

Now in BBEdit use menu File → New → Text Factory to get a Text Factory window:

From the pop-up menu to the right of the tickbox, select ‘Replace all’ and then the ‘Options…’ button:

Tick the ‘Grep’ box and then click the ‘g’ button. From the pop-up list (as seen above) and click the desired saved pattern which will populate the ‘search’ and ‘replace’ boxes. Click the ‘OK’ button:

If desired, click the ‘Comments…’ button and add a description of what the step does. For example here, you might note “Doubles every instance of a straight double-quote”.

From the bottom-left of the Factory window click the ‘+’ sign to add and configure the two extra steps (the steps are executed in order, top to bottom):

Leave the top box (for selecting files) empty.

Now use menu ‘File’ → ‘Save’. The default location is the BBEdit folder within Application Support. Choose the ‘Text Filters’, sub-folder and give the file a name, e.g. ‘Clean-CSV-format.textfactory’.

Now the set of actions is available via the BBEdit menu ‘Text’ → ‘Apply Text Filter’ → choose ‘Clean-CSV-format’ from the listed item(s). The process now runs on the currently selected open file.

abusch · August 20, 2021, 4:28am

What a great idea! Looks extremely useful.

mwra · August 20, 2021, 8:19am

Good. Sadly I can’t make this week’s meet-up but I’d be happy to walk through the process live, if it helps.

Sorry, one assumption I didn’t note: I’m assuming no (table cell) value has line breaks in it. These are harder to detect—as discrete from end of row line breaks—and may require manual fixing before running the above. Those regex can’t tell what is a record (row) end and what is an unexpected break inside a value.

If you are the creator of such data it is a good idea to replace in-value line breaks with another character (or string of characters) so you can replace them and re-insert the breaks once imported^†. For instance, if one value is some text with 3 paragraphs, thus two in-value line-breaks, I’d exchange those for somethigg like #### so all the text exports as a value without breaks. Then, once imported, just run a replace on that attribute to replace every discrete instance #### with \n.

†. This speaks to the fallacy to which we are all prone. That is: if I create data in App A and import it into App B and it fails, then App B is the cause of the problem. No! It may be, but you won’t know withuot checking both ends. But, in the above scenario the correct action is to fix the data export in App A or correct the errors in a text editor before trying to import into App B. This is something I wish people had told me earlier, so had to learn the hard (time-consuming way).