Web scraping with PHP and XSL

In order to have always the latest version of the XHTML 1.1 Quick Reference by Examples from the git repository included on this site I am using this snippet as content of a Drupal node along with the PHP input format:

<?php 
$file = "http://git.ao2.it/xhtml11_quickref.git/?a=blob_plain;f=xhtml11_quick_reference_by_examples.html;hb=HEAD";

$xmlDoc = new DOMDocument();
$xmlDoc->load($file);

$stylesheet ='<?xml version="1.0"?>
<xsl:stylesheet
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xhtml="http://www.w3.org/1999/xhtml"
  xmlns="http://www.w3.org/1999/xhtml"
  exclude-result-prefixes="xhtml xsl"
  version="1.0">

  <xsl:output method="xml" indent="yes" omit-xml-declaration="yes"/>

  <!-- Rememeber to specify the namespace when dealing with XHTML -->
  <xsl:template match="/">
    <xsl:copy-of select="//xhtml:body/*[position() > 1]"/>
  </xsl:template>

</xsl:stylesheet>
';

$xsl = new DOMDocument();
$xsl->loadXML($stylesheet);

$xp = new XSLTProcessor();
$xp->importStylesheet($xsl);

$output = $xp->transformToXML($xmlDoc);
if (FALSE === $output)
    trigger_error('XSL transformation failed.', E_USER_ERROR);

echo $output;

The XPath expression used here in the stylesheet gets the elements in the body skipping the very first one, this avoids setting two titles since the first element in the QuickRef body is always a h1 heading.

When you process XML documents with namespaces you have to remember to use the namespace also in the XPath expressions, check out this interesting article about transforming XHTML to XHTML with XSLT.


CommentsSyndicate content

Post new comment

The content of this field is kept private and will not be shown publicly. If you have a Gravatar account associated with the e-mail address you provide, it will be used to display your avatar.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
E
j
8
r
r
w
Enter the code without spaces.