27 Jul

pre-parsing HTML for incorrectly-sized images

Every now and then, I get a call from a client who is puzzled why their site is running slow. I would look at their page and see an innocuous image inserted into a paragraph. When I examine the image, though, I see that the client has artificially resized the image using HTML.

One recent example showed on-screen as a 300px-wide image. When I examined it, it was actually 3000px wide (approx). As explained to the client, this had the effect of forcing the browser to use about 100 times more RAM (not counting the overhead of the transformation to 300px-wide), and the download was slower as well.

One solution to all this is to teach all clients how to resize images before they upload them. I did that in this case. But it’s not the easiest solution, and people forget how to do things.

Another solution was proposed by Ken, and that is to parse any submitted HTML for images and check that the size they claim to be is actually correct. he said that he’d had the idea ages ago but never implemented it. I think its time has come, so let’s do it.

There are four ways that images can get resized. through HTML parameters, inline CSS, selector-based CSS and JavaScript. We will address the first two, as the others would be too complex to solve in a small application.

How this will work is that resized images, if detected, will be adjusted in the HTML so their ‘src’ parameter points to a pre-created resized version of the image. The entire script is run when the HTML is submitted into a CMS, before the HTML is placed in the database or published to a file.

First, we need to detect image sources and their assigned sizes.

Here is some sample HTML with images from this site.

<p><img src="http://verens.com/wp-content/themes/mandigo-14/images/green/head.jpg" width="76" height="24" /></p>
<p><img src="/wp-content/themes/mandigo-14/images/green/head.jpg" style="width:76px;height:24px" /></p>

What we want is a function which, when fed that HTML, returns HTML which is modified such that images with incorrect widths and heights have their srcs modified to point to a pre-resized version, which is created using ImageMagick.

Here it is:

define('WORKDIR_IMAGERESIZES',$_SERVER['DOCUMENT_ROOT'].'/demos/html_imageresizer/f/');
define('WORKURL_IMAGERESIZES','/demos/html_imageresizer/f/');
function html_fixImageResizes($src){
	// checks for image resizes done with HTML parameters or inline CSS
	//   and redirects those images to pre-resized versions held elsewhere

	preg_match_all('/<img [^>]*>/im',$src,$matches);
	if(!count($matches))return $src;
	foreach($matches[0] as $match){
		$width=0;
		$height=0;
		if(preg_match('/width="[0-9]*"/i',$match) && preg_match('/height="[0-9]*"/i',$match)){
			$width=preg_replace('/.*width="([0-9]*)".*/i','\1',$match);
			$height=preg_replace('/.*height="([0-9]*)".*/i','\1',$match);
		}
		else if(preg_match('/style="[^"]*width: *[0-9]*px/i',$match) && preg_match('/style="[^"]*height: *[0-9]*px/i',$match)){
			$width=preg_replace('/.*style="[^"]*width: *([0-9]*)px.*/i','\1',$match);
			$height=preg_replace('/.*style="[^"]*height: *([0-9]*)px.*/i','\1',$match);
		}
		if(!$width || !$height)continue;
		$imgsrc=preg_replace('/.*src="([^"]*)".*/i','\1',$match);

		// get absolute address of img (naive, but will work for most cases)
		if(!preg_match('/^http/i',$imgsrc))$imgsrc=preg_replace('#^/*#','http://'.$_SERVER['HTTP_HOST'].'/',$imgsrc);

		list($x,$y)=getimagesize($imgsrc);
		if(!$x || !$y || ($x==$width && $y==$height))continue;

		// create address of resized image and update HTML
		$dir=md5($imgsrc);
		$newURL=WORKURL_IMAGERESIZES.$dir.'/'.$width.'x'.$height.'.png';
		$newImgHTML=preg_replace('/(.*src=")[^"]*(".*)/i',"$1$newURL$2",$match);
		$src=str_replace($match,$newImgHTML,$src);

		// create cached image
		$imgdir=WORKDIR_IMAGERESIZES.$dir;
		@mkdir($imgdir);
		$imgfile=$imgdir.'/'.$width.'x'.$height.'.png';
		if(file_exists($imgfile))continue;
		$str='convert "'.addslashes($imgsrc).'" -geometry '.$width.'x'.$height.' "'.$imgfile.'"';
		exec($str);
	}

	return $src;
}

The return string from calling that function with the above HTML is this:

<p><img src="/demos/html_imageresizer/f/6bf7dd2b8232448e85d7fa9cd1009b44/76x24.png" width="76" height="24" /></p>

<p><img src="/demos/html_imageresizer/f/6bf7dd2b8232448e85d7fa9cd1009b44/76x24.png" style="width:76px;height:24px" /></p>

Here is an example of it running, and here is the source of that demo.

5 thoughts on “pre-parsing HTML for incorrectly-sized images

  1. Pingback: Kae Verens’ Blog: pre-parsing HTML for incorrectly-sized images | Development Blog With Code Updates : Developercast.com

  2. Hi,

    This is a good solution to the problem, however the exec makes me cringe somewhat. Personally, I would use the GD library to do the resize, but that’s just me.

    Well done on a good routine 🙂 I think I might ‘borrow’ it for my projects.

    Thanks,

    Steve

  3. I should probably have explained my reasoning there. The use of an exec there was a calculated choice.

    GD (and Imagick, the built-in version of ImageMagick) both “suffer” from the memory setting in /etc/php.ini. While it is a great and essential limitation for most cases, in this case, we know that the image manipulation will probably go over that limit, but also that it’s a once-off problem. Programs run through exec(), however, are not counted when PHP’s memory usage is calculated.

    Also, it’s much quicker to write a simple one-liner like I did above, than to go to the effort needed for GD or Imagick.

    As far as I can see, there are no security flaws in the code I wrote – all values run through the exec() function are filtered beforehand (addslashes(), regexps).

    The only real problem here is portability – the function requires a Linux server, and that the external ImageMagick program be installed. However, as a proof of concept, I think I can be allowed that flaw 😉

    Stevan, feel free to use it as you wish. Ryan, i know what you mean, and I really did think about the exec() function before I used it.

  4. Pingback: klog » Blog Archive » php and jquery chp7: image manipulation

Comments are closed.