Requesting Idea on How To Use Command Line to edit the word/document.xml in a docx file

I’m getting close to automating the solution to a problem that has plagued most Pandoc users. Tinderbox is key.

When you convert an HTML file with Panoc the table formatting does not work. The only way is to go into the XML of the docx file and manually apply a change.

Here is an image of a docx file I’ve exported and converted from Tinderbox with Pandoc:
image

If you open this file with the BBEDIT Disk Browser you’ll see this:

What I want is a runCommand that I can trigger from Tinderbox that will go into the newly created docx file, e.g. Test22.docx, go to the word/document.xml file, and then do a global replace of every instance of “Table” to “MyTable” in the following string <w:tblStyle w:val="Table"/>

I’ve played with a script with the terminal that will upzip the docx file. We can then edit the document.xml file through terminal (I’m not exactly sure what the commands would be). Then try to rezip the file, but reziping fails.

With BBEDIT, as shown above I can manually go into the docx file with Disk Browser and make the changes I want with the replace feature. I’m wondering if there is a way to automate this. Maybe with text factory, not sure.

Again, my goal is to have TBX automated it all. In my export script I know the path and name of the newly created file (this file will be dynamic/unique every time i run an export out of Tidnerbox). What I need, as noted above, is is something that will open this file, edit the file with an attribute value, e.g “MyTable”, that I can pass view the Run Command. Does anyone have any idea on how to do this?

I don’t think you want to use Tinderbox to act on this file, better would be do do this on the command line using sed and calling that scripts from Tinderbox.

In deed a quick Google reveals this text processing - How to replace a word inside a .DOCX file using Linux command line? - Unix & Linux Stack Exchange which looks promising.

It’s a bit beyond my CL confidence zone though to propose a full answer. I do wonder if you should be using XPath to parse/change the XML rather than a brute force find/replace on XML code?

Separately, if only using one table in the output DOCX could you not set the CSS for the table element to the same as your class .MyTable. That might be a way of working around the find/replace need—at least until you need other tables in the doc with a different style.

I also found this How do I add custom formatting to docx files generated in Pandoc? - Stack Overflow

Perhaps I was not clear. I want Tinderbox runCommand to trigger an external script, which I can’t figure out how to write. I saw the article you shared but could not figure out how to use it for my purpose.

1 Like

and Markdown to docx, including complex template - Stack Overflow

This suggests a different approach using a form of reference file that lets pandoc seed you extra styles into the default DOCX ones.

Yup, this is the process I shared in yesterday’s meetup. It DOES NOT work for formatting tables. The only way to adjust formating table is to produce the docx, go in to it and edit the w:wal var. If you unzip rezipping creates a correct file. I can do it manually with BBEDIT Dirve Browser, but would LOVE to automate flow.

Thx. Yup, saw this too. I’ve got it dialed in for nearly everything but table formatting. As noted above, best I can tell you can’t programmatically adjust the table style. You need to manually set it in the work template and then go into the XML to universally change it.

@mwra. Your are a genius. Your nudging triggered something for me that helped me read the code I’d looked at before. I’m grateful. Getting close.

What is not clear to me is how to properly write the GSED (I’m using gnu-sed), e.g.:

gsed -i 's/\<w:tblStyle w:val=\"Table\"\/\>/\<w:tblStyle w:val=\"Table\"\/\>' document.xml

I’m getting this error gsed: -e expression #1, char 67: unterminated s’ command`.

I tried to escape the special characters, but clearly, I’m doing something wrong. Does anyone have any idea?
@rtalexander perhaps?

Ok, I have the gsep piece nearly working. @TomD suggested to add a `/’ at the end and that worked.

I then has the issue where it was only replace the first occurrence of the patter and not all occurnce. by adding a g’ at the end of the gsed string it work.

New gsed: gsed -i 's/w:val=\"MyTable\"/w:val=\"Table\"/g' document.xml

Next step, develop a series of TBX fun commands to automate the steps.

For example, I’d like to run the command, but I’m not sure how to set it up:

runCommand(/opt/homebrew/bin/gsed '-i \'s/w:val=\"Table\"/w:val=\"MyTable\"/g\' pathtofile/tmp/word/document.xml/document.xml')

Yes, because the find & replace for sed are enclosed—by default—in forward slashes:

sed 's/FIND/REPLACE/'

But as long as you change all 3 instances of the delimiter you can use other characters. For instance, if your FIND string contains actual froward slashes such as in a file path, you can use different delimiters:

sed 's#FIND#REPLACE#'

I think you can use any non-regex-special-character, though clearly using letter or numbers makes little sense.

The syntax for runCommand() is documented, i.e. runCommand(command,input). Therefore, you need to figure out which is the actual command and what are the input(s). Looking at this:

[quote="satikusala, post:9, topic:5915"]
`runCommand(/opt/homebrew/bin/gsed '-i \'s/w:val=\"Table\"/w:val=\"MyTable\"/g\' pathtofile/tmp/word/document.xml/document.xml')`
[/quote]

I see you will have a problem in that the command part uses both single and double quotes

runCommand(/opt/homebrew/bin/gsed '-i \'s/w:val=\"Table\"/w:val=\"MyTable\"/g\' 

I don’t have gsed, just the macOS sed but I’d try:

var:string vSedStr = 's/w:val=\"Table\"/w:val=\"MyTable\"/g';
runCommand("/opt/homebrew/bin/gsed -i '"+vSedStr+"\'", 'pathtofile/tmp/word/document.xml/document.xml')

Thus the CL command part is:

"/opt/homebrew/bin/gsed -i '"+vSedStr+"\'"

using a variable to hold the find/replace strings and avoid the quote nesting problem, and the input part is:

'pathtofile/tmp/word/document.xml/document.xml'

…which is full path or (`~/’ from home folder path) to the file.

So if the target file is ‘Test22.docx’ in your /Documents/ folder the path for the above would be:

'~/Documents/Test22.docx/word/document.xml/document.xml'

or if your home folder short name were ‘mbecker’, the full path version would be:

'/Users/mbecker/Documents/Test22.docx/word/document.xml/document.xml'

Note that paths used are case-sensitive throughout, e.g. it matters whether you use ‘DOCX’ vs. ‘docx’ in the path., etc.

I’m unsure (and no time to build a test rig ATM) but as a DOCX is essentially a zip fie (i.e. a compressed container) I don’t know if such a file can be address on-the-fly via a CL. I used BBEdit to check the paths but BBEdit is also unpacking/parsing the zipped content and a CL call may not do that.

HTH :slight_smile:

Your command line script could always unzip a file into a temp directory, so whatever needs doing, and then zip it back up.

1 Like

… and the script also delete the temporary unzipped data to avoid leaving process litter.

Yup, that is the plan. :slight_smile: