Monday, 3 February 2014

Get Text Between HTML Tags

Get Text Between HTML Tags

DO NOT USE REGEX TO PARSE HTML

By using regular expressions with the preg_match() or preg_match_all() functions, the parse is made to work extremely hard as PHP loops over and over the text to find matches. By using the DOM functions the speed is increased dramatically and parsing is much cleaner. This example shows how it might be done with preg_match_all().
<?php
 
/**
 *
 * @get text between tags *
 * @param string (The string with tags) *
 * @param string $tagname (the name of the tag *
 * @return string (Text between tags) *
 */
 
function getTextBetweenTags($string$tagname)
 {
    
$pattern "/<$tagname>(.*?)<\/$tagname>/";
    
preg_match($pattern$string$matches);
    return 
$matches[1];
 }
?>
The above function is very basic, and will not attend to nested tags or check for broken tags. By making use of the PHP DOM extension these issues can be addressed.
The function itself takes three arguements.
$tag
The tag to find the text between
$html
The HTML or XML to be searched
$strict
Tells the function to load in HTML or XML mode, default is HTML mode
The third parameter if set to one allows the function to parse custom tags as found in XML and some XHTML documents.
<?php/**
 *
 * @get text between tags
 *
 * @param string $tag The tag name
 *
 * @param string $html The XML or XHTML string
 *
 * @param int $strict Whether to use strict mode
 *
 * @return array
 *
 */
function getTextBetweenTags($tag$html$strict=0)
{
    
/*** a new dom object ***/
    
$dom = new domDocument;

    
/*** load the html into the object ***/
    
if($strict==1)
    {
        
$dom->loadXML($html);
    }
    else
    {
        
$dom->loadHTML($html);
    }

    
/*** discard white space ***/
    
$dom->preserveWhiteSpace false;

    
/*** the tag by its tag name ***/
    
$content $dom->getElementsByTagname($tag);

    
/*** the array to return ***/
    
$out = array();
    foreach (
$content as $item)
    {
        
/*** add node value to the out array ***/
        
$out[] = $item->nodeValue;
    }
    
/*** return the results ***/
    
return $out;
}
?>
In this example plain HTML is used and no third arguement is supplied to the function. This allows for invalid, or broken HTML. The third paragraph is missing a closing <p> tag, however, with the use of DOM and the loadHTML this deviation is allowed. The example will still parse the HTML and retrieve an array of all the text between all <a> anchor tags.
<?php

$html 
'<body>
<h1>Heading</h1>
<a href="http://phpro.org">PHPRO.ORG</a>
<p>paragraph here</p>
<p>Paragraph with a <a href="http://phpro.org">LINK TO PHPRO.ORG</a></p>
<p>This is a broken paragraph
</body>'
;
$content getTextBetweenTags('a'$html);

foreach( 
$content as $item )
{
    echo 
$item.'<br />';
}
?>
In this final example two custom tags are used such as may be found in XML or XHTML documents. The third parameter is set to one which tells the function to use XML mode and parse the custom tags.
<?php

$xhtml 
'<html>
<body>
<para>This is a paragraph</para>
<para>This is another paragraph</para>
</body>
</html>'
;
$content2 getTextBetweenTags('para'$xhtml1);
foreach( 
$content2 as $item )
{
    echo 
$item.'<br />';
}
?>

No comments:

Post a Comment