Occasional Notes

Occasional Notes Rating: 3,8/5 3829 reviews
Occasional Notes

We’re using Apache POI to manipulate the content of some Word documents. There are other ways to do it, but, on the whole, Apache POI works reasonably well for a nominally free solution. We’ve hit a use case that can be summarised by a simple question: does this Word document contain a (Word-generated) table of contents (TOC)? You would think that that is a reasonably uncontroversial question, perhaps even one commonly asked. Apparently it is not.

Occasional

Occasional Notes is a new publication not connected with any organisation. I invite contributions on all aspects of practical astronomy: visual observing, imaging, photometry, spectroscopy, telescope construction and maintenance, restoration of classic instruments, observatories, and historical articles particularly relating to instruments. The Inquirer and Commercial News (Perth, WA: 1855 - 1901), Wed 11 May 1881, Page 3 - Occasional Notes. You have corrected this article This article has been corrected by You and other Voluntroves This article has been corrected by Voluntroves.

Background

The background here is that I know nothing about TOC generation in Word beyond what I’ve been able to deduce from examining Word’s behaviour and trawling the content of word/document.xml. I gather that Word inserts a processing instruction of some kind, but also renders static content into the file—that is, there’s a marker saying “there is a TOC in this document”, but the TOC content itself is also rendered. It seems that instead of dynamically generating the TOC content (say, every time the document is changed), Word instead generates it once, and then it is only updated on a manual re-generation. So the problem we’re facing is:

  • A document has a TOC.
  • We make changes to the body content: say, removing an entire section.
  • The TOC is now stale, and instead of automatically refreshing it, Word inserts error messages at print time.

A basically satisfactory workaround in our case is to call enforceUpdateFields() on the document prior to save, which signals to Word to show a dialog on next load:

Occasional

Again, this isn’t ideal, but it is satisfactory.

Solution

Apache POI doesn’t expose anything useful in its high-level API for detecting an existing TOC. After an exhaustive Google search, and quite a bit of digging around in the lower-level class hierarchies, it wasn’t obvious that we could solve this at any level using Java alone.

Inspecting word/document.xml suggested that a processing instruction that looked something like this was present in all documents containing TOCs:

<w:instrText xml:space='preserve'> TOC o '1-3' h z u </w:instrText>

How about if we get the XML for the document and search for such an element? If we call getDocument() on the XWPFDocument, we get a CTDocument1 which implements XmlObject and provides a selectPath() method to select nodes via an XPath expression. (If you’re curious, it took a couple of hours of trial and error to be able to come up with the facts in the preceding sentence!) Firstly, add XMLBeans and Saxon to your POM:

(Again, that excerpt represents an hour of fun trying to assemble mutually compatible versions of POI, XMLBeans and Saxon, as well as answering the question “Do we also need xmlbeans-xpath?” Spoiler: we don’t.) Then, with an XWPFDocument called document, find any w:instrText elements, where w is a namespace which we’ll also define, and see if any of them contain a magic string:

So, it’s brute force and depends on a magic string, but it seems to work. Better solutions gladly accepted!

We’re using Apache POI to manipulate the content of some Word documents. There are other ways to do it, but, on the whole, Apache POI works reasonably well for a nominally free solution. We’ve hit a use case that can be summarised by a simple question: does this Word document contain a (Word-generated) table of contents (TOC)? You would think that that is a reasonably uncontroversial question, perhaps even one commonly asked. Apparently it is not.

Background

The background here is that I know nothing about TOC generation in Word beyond what I’ve been able to deduce from examining Word’s behaviour and trawling the content of word/document.xml. I gather that Word inserts a processing instruction of some kind, but also renders static content into the file—that is, there’s a marker saying “there is a TOC in this document”, but the TOC content itself is also rendered. It seems that instead of dynamically generating the TOC content (say, every time the document is changed), Word instead generates it once, and then it is only updated on a manual re-generation. So the problem we’re facing is:

Occasional Notes

  • A document has a TOC.
  • We make changes to the body content: say, removing an entire section.
  • The TOC is now stale, and instead of automatically refreshing it, Word inserts error messages at print time.

A basically satisfactory workaround in our case is to call enforceUpdateFields() on the document prior to save, which signals to Word to show a dialog on next load:

Occasional Notes

Occasional Notes Definition

Again, this isn’t ideal, but it is satisfactory.

Solution

Notes

Apache POI doesn’t expose anything useful in its high-level API for detecting an existing TOC. After an exhaustive Google search, and quite a bit of digging around in the lower-level class hierarchies, it wasn’t obvious that we could solve this at any level using Java alone.

Inspecting word/document.xml suggested that a processing instruction that looked something like this was present in all documents containing TOCs:

<w:instrText xml:space='preserve'> TOC o '1-3' h z u </w:instrText>

How about if we get the XML for the document and search for such an element? If we call getDocument() on the XWPFDocument, we get a CTDocument1 which implements XmlObject and provides a selectPath() method to select nodes via an XPath expression. (If you’re curious, it took a couple of hours of trial and error to be able to come up with the facts in the preceding sentence!) Firstly, add XMLBeans and Saxon to your POM:

Occasional Quotes Stampin Up

(Again, that excerpt represents an hour of fun trying to assemble mutually compatible versions of POI, XMLBeans and Saxon, as well as answering the question “Do we also need xmlbeans-xpath?” Spoiler: we don’t.) Then, with an XWPFDocument called document, find any w:instrText elements, where w is a namespace which we’ll also define, and see if any of them contain a magic string:

Occasional Notes 意味

So, it’s brute force and depends on a magic string, but it seems to work. Better solutions gladly accepted!