Documents conversion to W2ML

Table of content

Character encoding

W2ML documents are XML documents, so they should be encoded in Unicode (UTF-8 or UTF-16) for best compatibility, because the support of other encodings is not garanteed by XML. Note that US-ASCII is a subset of UTF-8, i.e. a text file encoded in US-ASCII is already encoded in UTF-8.

Conversion example

Here is how you can convert a Windows-1252 text file (infile) to an UTF-8 text file (outfile) using GNU recode:
recode -d Windows-1252..UTF-8 <infile >outfile

The same command also works for the ISO-8859 1 (aka ISO Latin 1) encoding, because it is a subset of Windows-1252.

XHTML™ well-formedness and validity

W2ML imposes no validity constraint on documents. But of course only well-formed XML documents can be parsed. It means that HTML files, or tag-soup files, must be converted to XHTML™ before you add W2ML markup inside.

Entity references

XML documents containing undefined entity references with no external DTD are not well-formed. Note that all HTML character entity references can be replaced by Unicode characters, except predefined XML entities: &lt;, &gt;, &amp;, &apos;, and &quot;. But these predefined entities can be used in any XML document.

System identifier in document type declaration

In XML, unlike SGML, a public identifier may not appear without a system identifier in the document type declaration (see doctypedecl and ExternalID productions in the XML 1.0 recommandation). It means that the next declarations often used for HTML 2.0 and HTML 3.2 are not well-formed XML:
<!DOCTYPE html PUBLIC "-//IETF//DTD HTML 2.0//EN">
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
But this HTML 4.01 declaration is well-formed XML:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
("-//W3C//DTD HTML 4.01//EN" is the public identifier, and "http://www.w3.org/TR/html4/strict.dtd" is the system identifier).

Conversion example

Using HTML Tidy, here is how you can convert an UTF-8 encoded HTML file into a well-formed XHTML™ document. To avoid undefined entity references, the -n argument of Tidy converts HTML character entity references into numeric character references (it converts &eacute; to &#233;):
tidy -n -utf8 -asxhtml infile >outfile

It may also be useful to check the next tidy.conf options:

# Does Tidy add a meta element.
tidy-mark: no
# Does Tidy generate a doctype declaration.
doctype: auto

Site conversion example on Unix

Here is a shell script that uses GNU recode and HTML Tidy to convert a Windows-1252 (works also for ISO-8859-1 and US-ASCII) encoded HTML (may be tag soup) file into an UTF-8 encoded XHTML file:

#!/bin/sh
recode Windows1252..UTF-8 "$*"
tidy -i -wrap 120 -m -q -n -utf8 -asxhtml "$*"
Or more silent:
#!/bin/sh
recode Windows1252..UTF-8 "$*"
tidy -i -wrap 120 -m -q -n -utf8 -asxhtml "$*" 2&>1|grep -v lacks\|proprietary
Or silent:
#!/bin/sh
recode Windows1252..UTF-8 "$*"
tidy -i -wrap 120 -m -q -n -utf8 -asxhtml -f /dev/null "$*"

To recursively convert all *.htm and *.html files of the working directory, you can use find:
find . -type f \( -iname \*.htm -or -iname \*.html \) -exec myscript '{}' \;

Note that we keep the same file name extensions, so links are not broken. Of course, the HTTP server must be configured to handle .html files with the W2ML handler.

Other tools

Usefull conversion tools are GNU Wget, GNU recode, HTML Tidy, and xsltproc from the the XSLT C library for GNOME. An XSLT processor can be used to systematically add W2ML markup with a command like xsltproc --novalid --nonet sheet.xslt example.html (the sheet.xslt will depend on the structure of your XHTML documents).