PHP and Unicode’s Byte Order Mark

I’ve started messing around with custom websites based on WordPress lately, since it offers a very flexible platform with lots of functionality while still allowing you to customize the final result to your heart’s content. Really, the only real issues with this approach is that 1) we are talking about PHP which happens to not be my platform of choice and 2) WordPress’s performance issues that can be easily solved with a caching plugin.

 

So, I’ve created a nice custom layout from scratch. Tested it on all major browsers on which it worked perfectly (except from IE 6 and 7. Of course). Then I went on to implement specific functionality from WordPress on the layout. The drill is quite simple. You seperate the HTML into template files. In this case I had created 3 of them, header.php with the doctype, head tag and header declaration, footer.php with the site’s footer and all closing tags, and index.php with the main content of the page. You render the header by using WordPress’s get_header() method and get_footer() for the footer.

After a while, I’d noticed that viewing the page on Firefox there was some kind of a blank white space border in the uppermost side of the page, which had no reason to be there. I excluded the beta nature of the browser as a potential cause, since the standalone layout worked perfectly. Then, I attempted to view the result on IE and… there are no words to describe the colossal mess the browser rendered. Oh yes, the white border was there too.

As if that weren’t enough, double clicking on that white space on Firefox and then viewing the “selection source” would produce a quite different html source, specifically moving all declarations that belong to a <head> tag into the <body> tag. Chaos!

Then, I moved the contents of header.php back to index.php and removed the call to the get_header() method. Whadayaknow, it worked perfectly again. Feeling adventurous, I added the call to the get_header() method again in the index.php file but with the header file empty. Oh boy, chaos once again.

 

I was stumped. There was no logical explanation for that (as far as I could see), since the source produced was the same in each case. But when I checked the file sizes and found them different, I suddenly knew… It was the 3 characters known as Byte Order Mark, and should be the first thing that should appear on a Unicode document, be it a text file, a code file or a Word document. It’s a sort of metadata, only that it’s contained within the document contents themselves. It’s quite useful actually since not only it informs you about the endianness of the document’s encoding but of the unicode encoding variant used. Most applications that are Unicode aware usually remove those characters when they present or process such data, but there are few that do not and… well… the least painful result is the literal appearance of these characters on the document itself. In other cases, such as in PHP this means misrendered documents due to them app.

The easiest solution is to save the files without a BOM, especially the one containing the DOCTYPE declaration! There’s also supposed to be a zend configuration option (enable-zend-multibyte) that checks for the existence of the BOM, but from what I can gather it’s quite buggy. Also, the folks at PHP expect such issues to go away with PHP 6.0.

Leave a Reply

Your email address will not be published. Required fields are marked *