blueshoes php application framework and cms            core_html
[ class tree: core_html ] [ index: core_html ] [ all elements ]

Class: Bs_HtmlInfo

Source Location: /core/html/Bs_HtmlInfo.class.php

Class Overview

Bs_Object
   |
   --Bs_HtmlInfo

Bs_HtmlInfo class. can fetch information about an html page.


Author(s):

Version:

  • 4.3.$Revision: 1.5 $ $Date: 2003/11/19 08:18:13 $

Copyright:

  • blueshoes.org

Variables

Methods


Inherited Variables

Inherited Methods

Class: Bs_Object

Bs_Object::Bs_Object()
Bs_Object::getErrors()
Basic error handling: Get *all* errors as string array from the global Bs_Error-error stack.
Bs_Object::getLastError()
Basic error handling: Get last error string from the global Bs_Error-error stack.
Bs_Object::getLastErrors()
Basic error handling: Get last errors string array from the global Bs_Error-error stack sinc last call of getLastErrors().
Bs_Object::persist()
Persists this object by serializing it and saving it to a file with unique name.
Bs_Object::setError()
Basic error handling: Push an error string on the global Bs_Error-error stack.
Bs_Object::toHtml()
Dumps the content of this object to a string using PHP's var_dump().
Bs_Object::toString()
Dumps the content of this object to a string using PHP's var_dump().
Bs_Object::unpersist()
Fetches an object that was persisted with persist()

Class Details

[line 24]
Bs_HtmlInfo class. can fetch information about an html page.

it's not a parser, it uses some search and find, sometimes using regex.

dependencies: Bs_HtmlUtil,




Tags:

status:  experimental (things are still changing. class is very new. and there is lots of things that can go wrong with html.)
since:  bs4.3
version:  4.3.$Revision: 1.5 $ $Date: 2003/11/19 08:18:13 $
copyright:  blueshoes.org
author:  andrej arn <at blueshoes dot org>


[ Top ]


Class Variables

$Bs_HtmlUtil =

[line 32]

reference to global pseudostatic instance.

gets set in the constructor.




Tags:

access:  public

Type:   object


[ Top ]



Class Methods


constructor Bs_HtmlInfo [line 48]

Bs_HtmlInfo Bs_HtmlInfo( )

constructor



[ Top ]

method fetchBaseTag [line 406]

array fetchBaseTag( )

find the base tag, if there is one.

returns a hash with the keys 'href' and 'target'. if one was not defined (or both) its value will be set to NULL.




Tags:

return:  (hash, with the keys 'href' and 'target')
access:  public


[ Top ]

method fetchBody [line 97]

string fetchBody( [bool $html = FALSE])

returns the content of the body.



Tags:

return:  (html code or plain text see param $html, may be empty.)
access:  public


Parameters:

bool   $html   (default is FALSE which means convert to plain text.)

[ Top ]

method fetchDescription [line 137]

string fetchDescription( [bool $html = FALSE])

returns the meta description.

currently returns it in lowercase!




Tags:

return:  (may be empty)
access:  public


Parameters:

bool   $html   (default is false which means convert to plain text.)

[ Top ]

method fetchIframeUrls [line 371]

array fetchIframeUrls( [bool $dublicates = TRUE], [bool $useBaseTag = FALSE], [mixed $removeAnker = TRUE], [bool $ignoreInvalid = TRUE], bool $ignoreAnker)

fetches the urls of iframes.

note that the url may be relative. it's up to you to make something useful of it, based on where you found it.




Tags:

return:  (vector, may be empty.)
access:  public


Parameters:

bool   $dublicates   (default is TRUE. FALSE means that the same url won't be returned twice.)
bool   $useBaseTag   (currently not supported. default is FALSE.)
bool   $ignoreAnker   (default is TRUE. removes the #something part of the url's.)
bool   $ignoreInvalid   (ignore invalid url's, default is TRUE. like "about:blank".)

[ Top ]

method fetchImageTexts [line 491]

mixed fetchImageTexts( [string $returnType = 'array'])

returns the alt and title attributes of all images.

param $returnType: 'array': vector with hashes with the keys 'alt' and 'title'. check these for empty() because it can be an empty string or null. 'string': a string where all alt and title strings are separated by space dot space.




Tags:

return:  (see param returnType and read above)
access:  public


Parameters:

string   $returnType   (default is 'array', can be 'string', read above)

[ Top ]

method fetchKeywords [line 156]

string fetchKeywords( [bool $html = FALSE])

returns the meta keywords.

currently returns them in lowercase!




Tags:

return:  (may be empty)
access:  public


Parameters:

bool   $html   (default is false which means convert to plain text.)

[ Top ]

method fetchLinks [line 272]

mixed fetchLinks( [int $dublicates = 2], [bool $urlOnly = FALSE], [bool $useBaseTag = FALSE], [mixed $removeAnker = TRUE], [bool $ignoreInvalid = TRUE], bool $ignoreAnker)

return the links of the document with additional information.

param $urlOnly: if the 3rd param $urlOnly is turned on then the returned value will be a vector filled with the url's. otherwise: the returned vector holds hashes with the keys: 'href' => string, where the link goes to, the url. note that this may be a relative url. it's up to you to make something useful of it, based on where you found the url. 'caption' => string. may be empty. the text between the <a></a> tags with all inner tags removed. if there is only an image and that image has an alt attribute (or title attribute as 2nd fallback) then this value will be returned. all text is converted from html to plain text. 'target' => string, NULL if not available. like "_blank" or so.

if the tag has no href part then it won't be returned at all.

param $useBaseTag: removed basetag support again, use fetchBaseTag() instead and do it yourself. too many special cases to take into account. --andrej by default we watch out for the base tag in the head of the document. it looks like: <base href="http://www.blueshoes.org/" target="_parent"> it can have a href tag, a target tag or both. if the href tag is present, all urls in the document (images, links in this case) that don't link to something absolutely (eg http://, ftp://, https://) will be prefixed with that string. if the target part is present then all href tags that don't use their own target will use this one.

param $ignoreInvalid: like javascript: and about:blank and mailto:. they most always use javascript, onclicks etc. not much we can do to find out the real url's, they are dynamic anyway, so... default is TRUE.)




Tags:

return:  (vector holding hashes, or vector only, depending on $urlOnly, see above.)
access:  public


Parameters:

int   $dublicates   (default is 2. 0=no, 1=only if different caption, 2=yes)
bool   $urlOnly   (default is FALSE, affects the returned array, see above.)
bool   $useBaseTag   (currently not supported. default is FALSE. see above.)
bool   $ignoreAnker   (default is TRUE. removes the #something part of the url's.)
bool   $ignoreInvalid   (ignore invalid url's, see above.)

[ Top ]

method fetchMetaData [line 435]

array fetchMetaData( [bool $html = FALSE], [bool $withHttpEquiv = TRUE])

returns all available meta tags.

note:

  • the same name may be used more than once. if there are 2 <meta name="foo"> then you will get 2 such hashes back. do with them what you want.
  • all names are converted to lowercase, so it's "description" not "DESCRIPTION".
  • a meta tag may not be multiline by definition.
  • the meta tag is returned even if there is no valid or an empty content part. check for an empty string yourself.




Tags:

return:  (vector holding hashes where 'name' is the meta name, 'content' the meta content.)
access:  public


Parameters:

bool   $html   (default is FALSE which means convert all values to plain text.)
bool   $withHttpEquiv   (default is TRUE. if "HTTP-EQUIV=" style tags should be found/returned also. their key will be 'name' not 'http-equiv' in the returned array.)

[ Top ]

method fetchStringsByTagNamesStupid [line 220]

array fetchStringsByTagNamesStupid( array $tagNames, [string $string = NULL], [bool $html = FALSE])

just like fetchStringsByTagNameStupid() but you can pass an array of $tagNames. read there!



Tags:

return:  (vector filled with strings.)
see:  $this->fetchStringsByTagNameStupid()


Parameters:

array   $tagNames  
string   $string   (the haystack, html code. if not specified then the current html code will be used.)
bool   $html   (default is FALSE which means convert values to plain text.)

[ Top ]

method fetchStringsByTagNameStupid [line 191]

array fetchStringsByTagNameStupid( string $tagName, [string $string = NULL], [bool $html = FALSE])

finds all specified tags and returns the content.

this method name has the suffix "Stupid" because it behaves like that. this means: if the tag is found then there may not be any other html tags between the start and end tags. example: bold => valid. 'bold' will be found. bold and italic => invalid, will be ignored. using this specification you will only get words or phrases back that are directly enclosed in such tags. this simplifies a lot of things, and prevents the function from returning stupid things because the html document is fucked (unclosed tags etc).

example: $string = "some bold and <h1 style='foo'>header</h1> text."; $tagName = 'h1'; get_strings_in_tags($tagName, $string); will return: array('header')




Tags:

return:  (vector filled with strings.)


Parameters:

string   $tagName  
string   $string   (the haystack, html code. if not specified then the current html code will be used.)
bool   $html   (default is FALSE which means convert values to plain text.)

[ Top ]

method fetchTitle [line 122]

string fetchTitle( [bool $html = FALSE])

returns the content of the title tag.



Tags:

return:  (may be empty)
access:  public


Parameters:

bool   $html   (default is false which means convert to plain text.)

[ Top ]

method getTagParam [line 701]

string getTagParam( string $param, string $tag)

returns parameter $param of the given $tag.

the tag may be in double, single or no quotes.




Tags:

throws:  NULL if not found.


Parameters:

string   $param   (like 'href' for an <a></a> tag.)
string   $tag   (the tag itself)

[ Top ]

method get_strings_headed [line 679]

array get_strings_headed( int $from_headnumber, int $till_headnumber)

Returns all strings which are headed (<h1> ... </h1> etc)



Tags:

return:  the strings which have been found pusched in an array


Parameters:

int   $from_headnumber  
int   $till_headnumber  

[ Top ]

method get_strings_in_tag [line 663]

array get_strings_in_tag( string $start_tag, string $end_tag, string $string)

Returns all strings in $string which are between the start and end tag



Tags:

return:  the strings which have been found pusched in an array


Parameters:

string   $start_tag   the starting tag
string   $end_tag   the end tag
string   $string   the string to search for

[ Top ]

method get_strings_in_tags [line 622]

array get_strings_in_tags( array $tags, string $string)

Returns all strings in $string which are given to the parameter $tags

example: $string = "some bold and <h1 style='foo'>header</h1> text."; $tags = array('b', 'h1'); get_strings_in_tags($tags, $string);




Tags:

return:  the strings which have been found in an array


Parameters:

array   $tags   the tags in an array ($tags[$i]['open'] and $tags[$i]['close'])
string   $string   the HTML string

[ Top ]

method htmlToText [line 728]

string htmlToText( string $string)

takes an html string and converts it to text:

1) removes script tags and its content 2) strips all tags (thus loses alt tags) 3) replaces \n \r \t with a space 4) converts html special chars to plaintext, see Bs_HtmlUtil->htmlEntitiesUndo()

this method may not really belong to this class, but i used it internally, and why not make it available public.




Tags:

return:  (plain text)
access:  public


Parameters:

string   $string   (html code)

[ Top ]

method initByPath [line 70]

bool initByPath( mixed $fullPath)

inits this class using a path to an html file on the file system.



Tags:

see:  Bs_HtmlInfo::initByString()
access:  public


[ Top ]

method initByString [line 59]

bool initByString( mixed $str)

inits this class by passing an html string.



Tags:

see:  $this->initByPath()
access:  public


[ Top ]

method initByUrl [line 85]

bool initByUrl( mixed $url)

inits this class using an url to an html file.



Tags:

todo:  add fallback code
see:  Bs_HtmlInfo::initByString()
access:  public


[ Top ]


Documentation generated on Mon, 29 Dec 2003 21:10:51 +0100 by phpDocumentor 1.2.3