xhtmail documentation Notes about the project Patrice Levesque ptaff.ca Copyright © 2005 Patrice Levesque Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation. _________________________________________________________ Table of Contents 1. Quick startup recipe 1.1. Requirements 1.2. Installation 1.3. Usage examples 2. Info about the project 2.1. Incentive 2.2. Methodology 2.3. Performance 2.4. Caveats 2.5. Recognition 3. Helping out 3.1. API Documentation 3.2. Contributions 3.2.1. Easy for everyone - Translations 3.2.2. Easy for programmers - Detection of more MUA and OS A. Complete command-line reference List of Examples 1-1. Simple commandline invocation _________________________________________________________ Chapter 1. Quick startup recipe Let's get this running as fast as possible. _________________________________________________________ 1.1. Requirements * PHP4 [May currently work now with PHP5, one day it will indeed] * mailparse PECL extension (by rebound, mbstring is needed too) * PHP iconv extension (though if the list to archive is english only this can be optional) * PEAR's Mail_Mime (we need Mail/mimeDecode.php from this package, PHP4 doesn't ship iconv_mime_decode yet) The mailing list archives should be "1 message, 1 file" and contain full headers. _________________________________________________________ 1.2. Installation 1. The shared_www directory should be moved to a web-accessible location (and renamed if needed). This directory contains all images, scripts and stylesheets needed for normal operation. In the next step, this web-accessible directory will have to be specified. 2. xhtmail.php should be edited - on top of the file there are different paths to set. 3. Ready! + Sample files (found in the samples/ directory) are provided as examples for wrapping the output. + Tweaking the stylesheets is strongly encouraged. When still young, the author caught a disease called "Bad Taste" and it does not seem to fade away after all these years. _________________________________________________________ 1.3. Usage examples Example 1-1. Simple commandline invocation php xhtmail.php -a samples/after.simple.html -b samples/before.simple.html -n mylist -t "Archives of mylist" -o outdir myarchives/* The above command writes all archives in the outdir directory by processing messages from myarchives using samples/before.simple.html and samples/after.simple.html as template files. The list is called mylist. Coders that know even just a bit of PHP should look at samples/before.php and samples/after.php to see what's possible. A more advanced template engine could have been used, but let's keep things that should be simple, eh, simple. Complete command-line reference is available in the appendix and is also shown when php xhtmail.php --help is entered at a terminal. _________________________________________________________ Chapter 2. Info about the project Here is discussion about the project. _________________________________________________________ 2.1. Incentive 10 good reasons to switch to xhtmail The author uses Sympa as a mailing list manager. Though its pure mail engine works fine, its web archives (thru mhonarc) seem deficient. Other engines were studied (LISTSERV, Hypermail, dbmarc) but it feeled like the world needed a new mailing list archiver. Why? they all exhibit many annoyances from these: 1. Dumb table-based markup. Tables are for tabular data; everybody should know better now. It must be possible in this world to get more semantically meaningful archives than just table blobs. xhtmail generates semantic output, with no tables; signatures, titles, dates, quotes are recognized most of the times and wrapped in proper tags. 2. Cumbersome navigation. Seems the unwritten rule until now has been "One message, one page". Ever searched google and found the middle of the thread in a mail archive? Life is too short to play "go back, pick the other sub-thread, then the first reply, then come back, then the second reply...". This must cost the worldwide economy a couple billions per year. xhtmail puts all messages of a thread in a single page. 3. No markup language metadata. Each page should hold metadata (not just anonymous links) about the page context and creation. xhtmail exports lots of metadata; top, up, previous, next links; author, contributors; dates of creation and update. 4. Plain look. Everybody knows how the web looked like in 1994. Monochrome monitors are on the way out. Most of the engines don't allow control of the look. xhtmail is fully customizable with CSS so it may look like what its users want; it adds icons and even X-Faces to the message, bringing life to otherwise plain messages. 5. Whack-A-Mole URIs. Archives can change, messages can be deleted. As most of the times the pages are indexed by an internal pointer, should a page be deleted, other URI will change. xhtmail always generates the same URI for the same message. Cool URI Don't Change(TM). 6. Undescriptive URIs. Each web resource should be described when possible; it's not a RFC or a God Commandment but people (and search engines) like it. xhtmail uses a combination of timestamp and message thread subject to build its URI, making them meaningful; as a bonus, webstats suddenly become more readable... 7. Invalid markup. Though W3C recommendations for HTML exist since 1995, only a fraction of the web markup is well-formed and mailing list archivers too create tag soup. xhtmail is XHTML Strict-compliant and can be safely sent as application/xhtml+xml. 8. Useless file extensions for messages. As said before, Cool URI Don't Change(TM); if the mail archiver generates .html files, and suddenly the pages need to be wrapped in a scripting language, troubles arise. xhtmail puts all threads in separate directories so no extension is necessary. 9. No/flaky support for feeds. Many people like to stay informed of what's happening on a list but don't want to subscribe for different reasons. xhtmail exports RSS2.0 and Atom1.0 feeds. 10. Difficult integration with existing site templates. It should be easy to get a uniform look through all pages of a website, even mail archives. xhtmail is template-driven so existing bits of markup can be reused. _________________________________________________________ 2.2. Methodology Before anything, this script is really tied to the author's needs. Some limitations may seem stupid; contact the author, those are probably just overlooks. As a complement of the above remark: xhtmail was first built for a french-speaking mailing list; will indeed work better with this than Thai or Arabic. * As everything is sorted by timestamp before processing, there should in theory be no problem with re-indexing the same files over and over when there's new content. That's in fact the way to do it now. Filenames, relations should stay the same. * When possible, URI should point to the English Wikipedia. Two reasons are that most URI won't change (which could be different for a software publisher's URI when sales go down), and that all pages will be eventually translated (and users of xhtmail can work for it if necessary). Wikimedia was asked for a generic way to handle link translations so for now links will just point to the english version; when the author will get tired of this, a wrapper will be written. * GNU gettext was discarded for internationalization. Using gettext would mean scattering files all over the place; xhtmail should remain as self-contained as possible. It also requires locales for each of the translations to be installed, meaning trouble for many people that don't have root access. And as some differences with environment occur (LC_ALL? LANG? LANGUAGE?) this would have meant spending too much time debugging setups. There are not so many strings anyway. _________________________________________________________ 2.3. Performance * Here on an Athlon 1600, about a thousand messages are processed in 30 seconds. * Memory usage should remain minimal because only metadata about mail messages is kept all the way; individual messages are processed directly file by file. _________________________________________________________ 2.4. Caveats * All output is UTF-8. After 40 years of trouble with non-english languages, no time to mess with weird charsets problems that UTF-8 solve. For pure english data, that makes absolutely no difference to US-ASCII, ISO-8859-X. At this moment, supported input charsets are UTF-8, ASCII, ISO-8859-1. Contact author or write a patch if you need other charsets - author doesn't need them so no coding will happen without a need (especially when every problem is resolved when the mailing list users configure their mailer to use UTF-8). * Mail messages must have a text/plain part or else they are simply skipped. text/html attachments are discarded as they are first mail-unfriendly, second hard to manage (xhtmail aims for pure XHTML on output and tagsoup parsing is not fun). * Message URIs are built from the subject line and timestamp of the message. There is a possible clash if two threads are started at the very same second with precisely the same subject line. This software indeed is not designed for overactive user-support mailing lists where clueless lusers title all their mails "problem" and expect revelations without reading any doc. _________________________________________________________ 2.5. Recognition This software would probably not exist if it weren't for the GPL and other collaborative licences. Bits and pieces were picked from different sources, I thank all those involved. * PEAR/PECL that make code reuse a snap * Mimetype images: Horde project * OS images: AWStats * MUA images: dispMUA * Commandline parsing: bgetop * XHTML wordwrapping: Brian Huisman AKA GreyWyvern _________________________________________________________ Chapter 3. Helping out These are notes for the advanced usage; this may be skipped for those who have no knowledge or interest whatsoever about free software development. _________________________________________________________ 3.1. API Documentation Using phpdocumentor, phpdoc can be obtained for this code, by using something like: phpdoc -dn xhtmail -f xhtmail.php -o HTML:frames:earthli -t api -ti "xhtmail documentation" -s on That doc is also available on http://ptaff.ca/xhtmail/api/. _________________________________________________________ 3.2. Contributions They're welcome. Simple tasks can help the project get better in no time. Any patch to xhtmail should be made using diff -ru distributed_xhtmail_directory patched_xhtmail_directory and mailed to for revision. Only patches to the most recent CVS version or the latest official release will be accepted. _________________________________________________________ 3.2.1. Easy for everyone - Translations Currently, only a handful of languages are supported. To add a language should not take more than half an hour, it's a simple modification to one file. * A t_LC function needs to be added to xhtmail.php. The language code should be picked from ISO 639. t_fr is probably the most current language function so it can be used as a reference. * XHTML entites must be used in the file. That means numerical entities, and not HTML entities. _________________________________________________________ 3.2.2. Easy for programmers - Detection of more MUA and OS The set of detected MUA and OS is still small. Adding detection for them is simple using mail messages having their footprint in the header. MUA + All the logic for MUA detection is to be found in the extract_mua_from_email_header function. + After addition of a new MUA, a corresponding entry in the get_mua_uri function must be inserted. + A 14x14 PNG icon should be added for the MUA, in the shared_www/mua/ directory. OS + All the logic for OS detection is to be found in the extract_os_from_email_header function. + After addition of a new OS, a corresponding entry in the get_os_uri function must be inserted. + A 14x14 PNG icon should be added for the OS, in the shared_www/os/ directory. _________________________________________________________ Appendix A. Complete command-line reference Usage: php xhtmail.php options file [file...] Each file is a separate mail message, with full headers Options: -a FILENAME, --after=FILENAME What content (typically XHTML) should be put after the xhtmail output? -b FILENAME, --before=FILENAME What content (typically XHTML) should be put before the xhtmail output? Special tags in the file will be replaced: + <^TITLE^> will be replaced by the mailthread title + <^CHARSET^> will be replaced by the document's charset + ... No tags should be found in this file because they are already used by xhtmail for message titles. -c FILENAME, --contents-after=FILENAME What content (typically XHTML) should be put after the thread contents in thread pages? -e EXTENSION, --extension=EXTENSION What extension should we give to output files? Default is html -f NUMBER, --feed=NUMBER Number of Atom/RSS entries in the main archive output. Defaults to 0 (no feed) See -u option -h, --help Displays this help text and exits -i FILENAME, --index_after=FILENAME What content (typically XHTML) should be put after the index in index pages? -l LANGUAGE, --lang=LANGUAGE What language should the generated pages use? Default is en Available: en, fr -n STRING, --name_of_list=STRING Name of the mailing list -o DIRECTORY, --output_directory=DIRECTORY Where should the output files be placed? -p URI[,URI], --picture_uri=URI[,URI] Image URI and icon URI respectively for the Atom/RSS feeds. Useless if not used with -f; the icon should be small, like 16x16, if provided -t STRING, --title_of_list=STRING Long name of the archive (Like "Archives of the foo mailing list") -u URI,URI, --uri_base=URI,URI Base URI for webpages and RSS respectively, separated by a comma, like "-u http://ex.com/path1,http://ex.com/path2"; no slash is wanted at the end. This parameter is silently discarded if not using the -f flag (feeds) but mandatory if so. These base URIs are used only for cross-linking between web pages and feeds, not for internal links. The feeds base should be before the feeds subdirectory (if your feeds are at http://example.com/feeds/example.xml, use http://example.com) -v, --version Output version information and exit Caution Danger Will Robinson Option parsing is primitive at best. Should results be weird, command line inspection is strongly suggested.