Web scraping with PHP and XSL

In order to have always the latest version of the XHTML 1.1 Quick Reference by Examples from the git repository included on this site I am using this snippet as content of a Drupal node along with the PHP input format:

$file = "http://git.ao2.it/xhtml11_quickref.git/?a=blob_plain;f=xhtml11_quick_reference_by_examples.html;hb=HEAD";

$xmlDoc = new DOMDocument();

$stylesheet ='<?xml version="1.0"?>
  exclude-result-prefixes="xhtml xsl"

  <xsl:output method="xml" indent="yes" omit-xml-declaration="yes"/>

  <!-- Rememeber to specify the namespace when dealing with XHTML -->
  <xsl:template match="/">
    <xsl:copy-of select="//xhtml:body/*[position() > 1]"/>


$xsl = new DOMDocument();

$xp = new XSLTProcessor();

$output = $xp->transformToXML($xmlDoc);
if (FALSE === $output)
    trigger_error('XSL transformation failed.', E_USER_ERROR);

echo $output;

The XPath expression used here in the stylesheet gets the elements in the body skipping the very first one, this avoids setting two titles since the first element in the QuickRef body is always a h1 heading.

When you process XML documents with namespaces you have to remember to use the namespace also in the XPath expressions, check out this interesting article about transforming XHTML to XHTML with XSLT.

CommentiCondividi contenuti

Invia nuovo commento

Il contenuto di questo campo è privato e non verrà mostrato pubblicamente. If you have a Gravatar account associated with the e-mail address you provide, it will be used to display your avatar.
  • Indirizzi web o e-mail vengono trasformati in link automaticamente
  • Elementi HTML permessi: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Linee e paragrafi vanno a capo automaticamente.

Ulteriori informazioni sulle opzioni di formattazione

Questa domanda serve a verificare che il form non venga inviato da procedure automatizzate
Enter the code without spaces.