Web scraping with PHP and XSL

by ao2, July 26, 2009

In order to have always the latest version of the XHTML 1.1 Quick Reference by Examples from the git repository included on this site I am using this snippet as content of a Drupal node along with the PHP input format:

<?php 
$file = "http://git.ao2.it/xhtml11_quickref.git/?a=blob_plain;f=xhtml11_quick_reference_by_examples.html;hb=HEAD";

$xmlDoc = new DOMDocument();
$xmlDoc->load($file);

$stylesheet ='<?xml version="1.0"?>
<xsl:stylesheet
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xhtml="http://www.w3.org/1999/xhtml"
  xmlns="http://www.w3.org/1999/xhtml"
  exclude-result-prefixes="xhtml xsl"
  version="1.0">

  <xsl:output method="xml" indent="yes" omit-xml-declaration="yes"/>

  <!-- Rememeber to specify the namespace when dealing with XHTML -->
  <xsl:template match="/">
    <xsl:copy-of select="//xhtml:body/*[position() > 1]"/>
  </xsl:template>

</xsl:stylesheet>
';

$xsl = new DOMDocument();
$xsl->loadXML($stylesheet);

$xp = new XSLTProcessor();
$xp->importStylesheet($xsl);

$output = $xp->transformToXML($xmlDoc);
if (FALSE === $output)
    trigger_error('XSL transformation failed.', E_USER_ERROR);

echo $output;

The XPath expression used here in the stylesheet gets the elements in the body skipping the very first one, this avoids setting two titles since the first element in the QuickRef body is always a h1 heading.

When you process XML documents with namespaces you have to remember to use the namespace also in the XPath expressions, check out this interesting article about transforming XHTML to XHTML with XSLT.

Short link: https://ao2.it/11

ao2.it
Antonio Ospite, theorist attacks

Interface language