Class: Bs_HtmlInfo
Source Location: /core/html/Bs_HtmlInfo.class.php
Bs_Object
|
--Bs_HtmlInfo
Bs_HtmlInfo class. can fetch information about an html page.
Author(s):
Version:
- 4.3.$Revision: 1.5 $ $Date: 2003/11/19 08:18:13 $
Copyright:
|
|
|
Inherited Variables
|
Inherited Methods
|
Class Details
Class Variables
Class Methods
constructor Bs_HtmlInfo [line 48]
Bs_HtmlInfo Bs_HtmlInfo(
)
|
|
constructor
method fetchBaseTag [line 406]
find the base tag, if there is one. returns a hash with the keys 'href' and 'target'. if one was not defined (or both) its value will be set to NULL.
Tags:
method fetchBody [line 97]
string fetchBody(
[bool
$html = FALSE])
|
|
returns the content of the body.
Tags:
Parameters:
method fetchDescription [line 137]
string fetchDescription(
[bool
$html = FALSE])
|
|
returns the meta description. currently returns it in lowercase!
Tags:
Parameters:
method fetchIframeUrls [line 371]
array fetchIframeUrls(
[bool
$dublicates = TRUE], [bool
$useBaseTag = FALSE], [mixed
$removeAnker = TRUE], [bool
$ignoreInvalid = TRUE], bool
$ignoreAnker)
|
|
fetches the urls of iframes. note that the url may be relative. it's up to you to make something useful of it, based on where you found it.
Tags:
Parameters:
method fetchImageTexts [line 491]
mixed fetchImageTexts(
[string
$returnType = 'array'])
|
|
returns the alt and title attributes of all images. param $returnType: 'array': vector with hashes with the keys 'alt' and 'title'. check these for empty() because it can be an empty string or null. 'string': a string where all alt and title strings are separated by space dot space.
Tags:
Parameters:
method fetchKeywords [line 156]
string fetchKeywords(
[bool
$html = FALSE])
|
|
returns the meta keywords. currently returns them in lowercase!
Tags:
Parameters:
method fetchLinks [line 272]
mixed fetchLinks(
[int
$dublicates = 2], [bool
$urlOnly = FALSE], [bool
$useBaseTag = FALSE], [mixed
$removeAnker = TRUE], [bool
$ignoreInvalid = TRUE], bool
$ignoreAnker)
|
|
return the links of the document with additional information. param $urlOnly: if the 3rd param $urlOnly is turned on then the returned value will be a vector filled with the url's. otherwise: the returned vector holds hashes with the keys: 'href' => string, where the link goes to, the url. note that this may be a relative url. it's up to you to make something useful of it, based on where you found the url. 'caption' => string. may be empty. the text between the <a></a> tags with all inner tags removed. if there is only an image and that image has an alt attribute (or title attribute as 2nd fallback) then this value will be returned. all text is converted from html to plain text. 'target' => string, NULL if not available. like "_blank" or so. if the tag has no href part then it won't be returned at all. param $useBaseTag: removed basetag support again, use fetchBaseTag() instead and do it yourself. too many special cases to take into account. --andrej by default we watch out for the base tag in the head of the document. it looks like: <base href="http://www.blueshoes.org/" target="_parent"> it can have a href tag, a target tag or both. if the href tag is present, all urls in the document (images, links in this case) that don't link to something absolutely (eg http://, ftp://, https://) will be prefixed with that string. if the target part is present then all href tags that don't use their own target will use this one. param $ignoreInvalid: like javascript: and about:blank and mailto:. they most always use javascript, onclicks etc. not much we can do to find out the real url's, they are dynamic anyway, so... default is TRUE.)
Tags:
Parameters:
method fetchMetaData [line 435]
array fetchMetaData(
[bool
$html = FALSE], [bool
$withHttpEquiv = TRUE])
|
|
returns all available meta tags. note: - the same name may be used more than once. if there are 2 <meta name="foo"> then you
will get 2 such hashes back. do with them what you want.
- all names are converted to lowercase, so it's "description" not "DESCRIPTION".
- a meta tag may not be multiline by definition.
- the meta tag is returned even if there is no valid or an empty content part.
check for an empty string yourself.
Tags:
Parameters:
method fetchStringsByTagNamesStupid [line 220]
array fetchStringsByTagNamesStupid(
array
$tagNames, [string
$string = NULL], [bool
$html = FALSE])
|
|
just like fetchStringsByTagNameStupid() but you can pass an array of $tagNames. read there!
Tags:
Parameters:
method fetchStringsByTagNameStupid [line 191]
array fetchStringsByTagNameStupid(
string
$tagName, [string
$string = NULL], [bool
$html = FALSE])
|
|
finds all specified tags and returns the content. this method name has the suffix "Stupid" because it behaves like that. this means: if the tag is found then there may not be any other html tags between the start and end tags. example: bold => valid. 'bold' will be found. bold and italic => invalid, will be ignored. using this specification you will only get words or phrases back that are directly enclosed in such tags. this simplifies a lot of things, and prevents the function from returning stupid things because the html document is fucked (unclosed tags etc). example: $string = "some bold and <h1 style='foo'>header</h1> text."; $tagName = 'h1'; get_strings_in_tags($tagName, $string); will return: array('header')
Tags:
Parameters:
method fetchTitle [line 122]
string fetchTitle(
[bool
$html = FALSE])
|
|
returns the content of the title tag.
Tags:
Parameters:
method getTagParam [line 701]
string getTagParam(
string
$param, string
$tag)
|
|
returns parameter $param of the given $tag. the tag may be in double, single or no quotes.
Tags:
Parameters:
method get_strings_headed [line 679]
array get_strings_headed(
int
$from_headnumber, int
$till_headnumber)
|
|
Returns all strings which are headed (<h1> ... </h1> etc)
Tags:
Parameters:
method get_strings_in_tag [line 663]
array get_strings_in_tag(
string
$start_tag, string
$end_tag, string
$string)
|
|
Returns all strings in $string which are between the start and end tag
Tags:
Parameters:
method get_strings_in_tags [line 622]
array get_strings_in_tags(
array
$tags, string
$string)
|
|
Returns all strings in $string which are given to the parameter $tags example: $string = "some bold and <h1 style='foo'>header</h1> text."; $tags = array('b', 'h1'); get_strings_in_tags($tags, $string);
Tags:
Parameters:
method htmlToText [line 728]
string htmlToText(
string
$string)
|
|
takes an html string and converts it to text: 1) removes script tags and its content 2) strips all tags (thus loses alt tags) 3) replaces \n \r \t with a space 4) converts html special chars to plaintext, see Bs_HtmlUtil->htmlEntitiesUndo() this method may not really belong to this class, but i used it internally, and why not make it available public.
Tags:
Parameters:
method initByPath [line 70]
bool initByPath(
mixed
$fullPath)
|
|
inits this class using a path to an html file on the file system.
Tags:
method initByString [line 59]
bool initByString(
mixed
$str)
|
|
inits this class by passing an html string.
Tags:
method initByUrl [line 85]
bool initByUrl(
mixed
$url)
|
|
inits this class using an url to an html file.
Tags:
|
|