W2ML documents are XML documents, so they should be encoded in Unicode (UTF-8 or UTF-16) for best compatibility, because the support of other encodings is not garanteed by XML. Note that US-ASCII is a subset of UTF-8, i.e. a text file encoded in US-ASCII is already encoded in UTF-8.
Here is how you can convert a Windows-1252 text file (infile)
to an UTF-8 text file (outfile) using
GNU recode:
recode -d Windows-1252..UTF-8 <infile >outfile
The same command also works for the ISO-8859 1 (aka ISO Latin 1) encoding, because it is a subset of Windows-1252.
W2ML imposes no validity constraint on documents. But of course only well-formed XML documents can be parsed. It means that HTML files, or tag-soup files, must be converted to XHTML™ before you add W2ML markup inside.
XML documents containing undefined entity references with no
external DTD are not well-formed.
Note that all HTML character entity references can be replaced by Unicode characters,
except predefined XML entities: <, >,
&, ', and ".
But these predefined entities can be used in any XML document.
In XML, unlike SGML, a public identifier may not appear without
a system identifier in the document type declaration (see
doctypedecl
and ExternalID
productions in the XML 1.0 recommandation). It means that the next declarations
often used for HTML 2.0 and HTML 3.2 are not well-formed XML:
<!DOCTYPE html PUBLIC "-//IETF//DTD HTML 2.0//EN">
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
But this HTML 4.01 declaration is well-formed XML:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
("-//W3C//DTD HTML 4.01//EN" is the public identifier, and
"http://www.w3.org/TR/html4/strict.dtd" is the system identifier).
Using HTML Tidy,
here is how you can convert an UTF-8 encoded HTML file
into a well-formed XHTML™ document. To avoid undefined
entity references, the -n argument of Tidy converts
HTML character entity references into numeric character references
(it converts é to é):
tidy -n -utf8 -asxhtml infile >outfile
It may also be useful to check the next tidy.conf options:
# Does Tidy add a meta element. tidy-mark: no # Does Tidy generate a doctype declaration. doctype: auto
Here is a shell script that uses GNU recode and HTML Tidy to convert a Windows-1252 (works also for ISO-8859-1 and US-ASCII) encoded HTML (may be tag soup) file into an UTF-8 encoded XHTML file:
#!/bin/sh recode Windows1252..UTF-8 "$*" tidy -i -wrap 120 -m -q -n -utf8 -asxhtml "$*"
#!/bin/sh recode Windows1252..UTF-8 "$*" tidy -i -wrap 120 -m -q -n -utf8 -asxhtml "$*" 2&>1|grep -v lacks\|proprietary
#!/bin/sh recode Windows1252..UTF-8 "$*" tidy -i -wrap 120 -m -q -n -utf8 -asxhtml -f /dev/null "$*"
To recursively convert all *.htm and *.html files
of the working directory, you can use find:
find . -type f \( -iname \*.htm -or -iname \*.html \) -exec myscript '{}' \;
Note that we keep the same file name extensions,
so links are not broken. Of course, the HTTP server must be configured
to handle .html files with the W2ML handler.
Usefull conversion tools are
GNU Wget,
GNU recode,
HTML Tidy, and
xsltproc
from the the XSLT C library for GNOME.
An XSLT processor can be used to systematically add W2ML markup
with a command like
xsltproc --novalid --nonet sheet.xslt example.html
(the sheet.xslt will depend on the structure of your XHTML documents).
Last update: 2008-10-15 20:55 UTC |
✉ info@w2ml.org