(x)html, css, latex, perl

web + print frustrations

Sometimes one wants to write a document that can be viewed both in print form and on the web. From my experiences so far, there does not yet exist a good way to do this from one document. I have tried several different methods, each with its own set of complications. I have not given up hope yet. Perhaps someone else can suggest a better method.

Firstly, why would someone want to do this? Well, I think there are a couple times where documents should be available in both formats

  1. Program documentation
  2. Short tutorials
  3. Curriculum Vitae

The last one is what I have been focusing on. Almost all academics have their CV online nowadays. Some have it in html format. Many have it in pdf format, and of course some only have it a terrible format like .doc or something. Personally, if I am going to view a CV online, I prefer it to be in html format. For very complex documents with lots of figures, graphics and such, pdf is usually better. A CV usually is not that complicated though, and it is much nicer to read them in html format. However, if one is applying for jobs (as I am), it is crucial to send a copy of your CV, and this should be either in hard copy or pdf format. Next I will describe the two different approaches I have taken to this conundrum.

starting with latex

As you may have gleaned, I like LaTeX quite a bit. It is quite simple in many ways, yet can also handle some really complicated documents. It produces really nice postscript and pdf files. As I write this, I am printing off a 3′ x 5′ poster that I made with LaTeX (I’ll make a separate post about making posters with LaTeX). LaTeX was designed well before the world wide web existed, so it was not designed with the web in mind. It was designed with print in mind. That being said, there are some packages and utilities that do a decent job of converting LaTeX to html. The two that I have found best so far are latex2html and tex4ht. They both have their disadvantages and advantages

This program takes a direct latex to html approach. Essentially it tries to do all the same things standard LaTeX can do, but instead of producing a dvi, it produces html. The drawback from this approach is that it does not always handle all the packages that standard LaTeX does. It also produces some pretty ugly, out of date html. The latest html you can specify is 4.01. This is not very satisfactory if you try to adhere to web standards. It also uses some strange heading code. Using the article class, sections are coded with <h1> tags. Normally h1 is used as the title of a page, and only once, though see this article from A List Apart about why using h1 for the title text may be a bad idea
This program uses an entirely different approach. It acts more like other standard drivers, by converting dvi to html. The advantage of this is that it should work with any LaTeX package imaginable, as long as parseable dvi is still produced. This is nice, but the output is still too focused on print. It defines all sorts of extra classes, and specifies all sorts of font sizes, when this should really be done with CSS.

I recently read a very poorly written article about why to use LaTeX, which spurred me on to write my own, and as a proof of concept, I decided to write it in LaTeX, and convert it to html. I ended up using the tex4ht approach, and then wrote a perl script to clean up some of the code it produced. You can view the article on why to use LaTeX on my University of Michigan homepage

starting with html + css

It is possible to get pretty nice printed output using css today, but there are still some things missing. My wife has also been working on her résumé lately, and one thing I like about hers is that she had a header on each page with her name and info. This seems quite handy. Imagine especially if someone is printing off your CV from your webpage, and the pages get mixed up or something. All of a sudden they lost a page of your CV! That is no good. Using CSS 2.1, the current standard, one cannot specify such things. The same goes for page numbers. Most browsers will let you choose whether you want page numbers or not, but we don’t want to make more work for the user.

CSS 3 offers a glimpse of hope. CSS 3 offers many more possibilities, especially in terms of dealing with alternative media such as print or aural media. Unfortunately, it seems that CSS 3 support is still quite a ways off for most browsers.

Enter Prince.

Prince is a program that implements many features from CSS 3 and is designed to make quality pdf output from xml (or xhtml) documents. It can do some pretty neat stuff. I first learned about it from A List Apart, in an article about printing an entire book with xhtml and css. It has one major drawback though. It is not open-source. In fact it is quite expensive for a normal full license. It does offer a restricted license, which has full functionality but also sticks in a default page about Prince on the first page of your document. So this is not ideal. I have tried it out, and one can see the results of my experiment by looking at my CV
What you get in your browser is the normal stuff. If you click on the button at the bottom you get the nicely formatted version from prince. If you click on the print this page button, you get a very similar version using CSS 2 rules. It is nice too, but lacking the headers and page numbers.

The future

For my dissertation, I am definitely sticking to LaTeX, and will not attempt to convert it to html. I don’t think it is well suited for that. For most of my websites, I will stick to html. For my CV, and other documents for which I want the best of both worlds, I would like to go the LaTeX route, but I think that the tools available need some updating. Maybe I will work on that.