blueshoes php application framework and cms            applications_websearchengine
[ class tree: applications_websearchengine ] [ index: applications_websearchengine ] [ all elements ]

Class: Bs_Wse_Walker

Source Location: /applications/websearchengine/Bs_Wse_Walker.class.php

Class Overview

Bs_Object
   |
   --Bs_Wse_Walker

todo: on windows machines compare url's case insensitive. in many peaces of the code.


Author(s):

Version:

  • 4.3.$Revision: 1.5 $ $Date: 2003/11/29 22:22:14 $

Copyright:

  • blueshoes.org

Variables

Methods


Inherited Variables

Inherited Methods

Class: Bs_Object

Bs_Object::Bs_Object()
Bs_Object::getErrors()
Basic error handling: Get *all* errors as string array from the global Bs_Error-error stack.
Bs_Object::getLastError()
Basic error handling: Get last error string from the global Bs_Error-error stack.
Bs_Object::getLastErrors()
Basic error handling: Get last errors string array from the global Bs_Error-error stack sinc last call of getLastErrors().
Bs_Object::persist()
Persists this object by serializing it and saving it to a file with unique name.
Bs_Object::setError()
Basic error handling: Push an error string on the global Bs_Error-error stack.
Bs_Object::toHtml()
Dumps the content of this object to a string using PHP's var_dump().
Bs_Object::toString()
Dumps the content of this object to a string using PHP's var_dump().
Bs_Object::unpersist()
Fetches an object that was persisted with persist()

Class Details

[line 45]
todo: on windows machines compare url's case insensitive. in many peaces of the code.

do somthing with redirects (403 and meta). weighten differently based on back-links and away-from-frontpage add include/exclude-mask add different entry points group (split) index based on website part/lang add search frontend with record preview fetch pages first, persist, then index with link-back data. keep archive page. link page to db (include db when indexing, for example a phonebook page) propertiy for ignoring frame pages allow parts of the page to be tagged as noindex. respect robots.txt and similar meta tags.

dependencies: Bs_Url, Bs_HttpClient, Bs_StopWatch




Tags:

pattern:  singleton: (pseudostatic)
since:  bs4.3
access:  public
version:  4.3.$Revision: 1.5 $ $Date: 2003/11/29 22:22:14 $
copyright:  blueshoes.org
author:  Andrej Arn <at blueshoes dot org>


[ Top ]


Class Variables

$Bs_Url =

[line 61]

reference to global pseudostatic instance.

gets set in the constructor.




Tags:

access:  public

Type:   object


[ Top ]

$Bs_Wse_WebSearchEngine =

[line 53]

reference to Bs_Wse_WebSearchEngine singleton.

gets set in the constructor.




Tags:

access:  public

Type:   object


[ Top ]

$registeredIndexCallback =

[line 171]

a function to call on every index() call.

can be used to count the number of times index() has been called, to take the time, to stop the indexing process at some time, or whatever.

the function will receive one parameter: a reference to this object (take it by ref!).




Tags:

access:  public

Type:   string


[ Top ]

$stopWatch =

[line 92]

instance od Bs_StopWatch.

gets created in the constructor for benchmarking, seeing where bottlenecks are, debugging etc.




Tags:

access:  public

Type:   object


[ Top ]

$_httpClient =

[line 76]


Type:   mixed


[ Top ]

$_indexer =

[line 74]

instance of Bs_Is_Indexer. gets set in the constructor.


Type:   mixed


[ Top ]

$_keywordsFrontpage =

[line 158]

same as $_descriptionFrontpage but for keywords.


Type:   mixed


[ Top ]

$_keywordsMd5 = array()

[line 153]

same as $_descriptionMd5 but for keywords.


Type:   mixed


[ Top ]



Class Methods


constructor Bs_Wse_Walker [line 180]

Bs_Wse_Walker Bs_Wse_Walker( object &$Bs_Wse_WebSearchEngine, object &$profile, object &$bsDb)

constructor.



Parameters:

object   &$Bs_Wse_WebSearchEngine  
object   &$profile   (instance of Bs_Wse_Profile.)
object   &$bsDb   (instance of Bs_Db.)

[ Top ]

method dropTodoStack [line 687]

bool dropTodoStack( )

drops all entries in the todo stack.



Tags:

return:  TRUE
throws:  bs_exception
access:  public


[ Top ]

method fetchLinksFromPageByUrl [line 1173]

array fetchLinksFromPageByUrl( string $url)

returns the links that are going away from the given url.

note that the same url may appear more than once, often with a different caption.

we compare the url's case insensitive here. may be a problem on unix if 2 urls have same spelling but different case. but that does not happen often, it's a stupid thing.




Tags:

return:  (vector holding hashes with the keys 'href' and 'caption'. may be empty.)
throws:  bs_exception
access:  public


Parameters:

string   $url  

[ Top ]

method fetchLinksToPageByUrl [line 1148]

array fetchLinksToPageByUrl( string $url)

returns the links that are pointing to the given url.

note that the same url may appear more than once, often with a different caption.

we compare the url's case insensitive here. may be a problem on unix if 2 urls have same spelling but different case. but that does not happen often, it's a stupid thing.




Tags:

return:  (vector holding hashes with the keys 'href' and 'caption'. may be empty.)
throws:  bs_exception
access:  public


Parameters:

string   $url  

[ Top ]

method fetchPageInfoById [line 789]

mixed fetchPageInfoById( int $pageID)

returns an array with the indexed information about a page.



Tags:

return:  (hash if found, NULL if not.)
see:  Bs_Wse_Walker::fetchPageInfoByUrl()
throws:  bs_exception
access:  public


Parameters:

int   $pageID  

[ Top ]

method fetchPageInfoByUrl [line 770]

mixed fetchPageInfoByUrl( string $url)

returns an array with the indexed information about a page.

the url is compared case-insensitive.




Tags:

return:  (hash if found, NULL if not.)
see:  Bs_Wse_Walker::fetchPageInfoById()
throws:  bs_exception
access:  public


Parameters:

string   $url  

[ Top ]

method fetchPageList [line 807]

array fetchPageList( )

returns an array of all indexed pages.

key is the internal ID, value is the url.




Tags:

return:  (hash)
throws:  bs_exception
access:  public


[ Top ]

method fetchWordsForPageByID [line 1305]

array fetchWordsForPageByID( int $pageID, [string $order = 'caption'])

returns word information about the given webpage.

the returned vector holds hashes with the keys: caption, wordID, ranking




Tags:

return:  (vector holding hashes, see above)
access:  public


Parameters:

int   $pageID  
string   $order   (default is 'caption', can also be 'ranking'.)

[ Top ]

method getExternalLinks [line 1191]

array getExternalLinks( )

returns a list with the external links.



Tags:

return:  (hash where key is the ID, val is a hash with the keys urlFrom, urlTo and caption.)
throws:  bs_exception
access:  public


[ Top ]

method getPageID [line 733]

mixed getPageID( string $url)

returns the pageID for the given url.



Tags:

return:  (int page id > 0 if found, NULL if not found.)
throws:  bs_exception


Parameters:

string   $url  

[ Top ]

method getPageUrl [line 750]

mixed getPageUrl( int $pageID)

returns the url for the given pageID.



Tags:

return:  (string page url if found, NULL if not found.)
throws:  bs_exception


Parameters:

int   $pageID  

[ Top ]

method index [line 247]

bool index( string $url, [string $follow = FALSE])

indexes the url/page specified.



Tags:

return:  TRUE
throws:  bs_exception
access:  public


Parameters:

string   $url  
string   $follow   (if we should follow links and index them aswell, default is FALSE.)

[ Top ]

method isIgnoredUrl [line 938]

bool isIgnoredUrl( string $url, [bool $forIndexing = TRUE])

tells if the given url is ignored or not (depending on the current settings).

param $forIndexing: when we wanna know if the url is ignored for putting it into the todo queue we do it a little bit different. some things can only be told definitely when we are indexing the page. because in the time from seeing the url to indexing it there are lots of things that can happen, and lots of time that can pass by.

1) ignore file extensions (hard coded ones like gif, jpg ...). 2) define user-defined file extensions, see var Bs_Wse_Profile->ignoreFileExtensions. 3) ingore based on robots.txt and robots meta tag (todo). 4) use var Bs_Wse_Profile->allowIgnore 5) use var Bs_Wse_Profile->allowUrls (deprecated) 6) use var Bs_Wse_Profile->queryStringUrlLimit 7) use var Bs_Wse_Profile->ignoreUrls (deprecated)




Tags:

access:  public


Parameters:

string   $url  
bool   $forIndexing   (default is TRUE, read above)

[ Top ]

method loadTodoStack [line 648]

bool loadTodoStack( )

loads the persisted stack into the current stack.



Tags:

return:  TRUE
throws:  bs_exception
access:  public


[ Top ]

method persistTodoStack [line 669]

bool persistTodoStack( )

persists the todo-stack so we can work on it later (continue/resume).



Tags:

return:  TRUE
throws:  bs_exception
access:  public


[ Top ]

method setDbByDsn [line 226]

bool setDbByDsn( array $dsn)



Tags:

return:  TRUE
throws:  bs_exception
access:  public


Parameters:

array   $dsn  

[ Top ]

method setDbByObj [line 214]

void setDbByObj( object &$bsDb)

gives this class a db connection.



Tags:

access:  public


Parameters:

object   &$bsDb  

[ Top ]

method validateExternalLinks [line 1258]

string validateExternalLinks( )

validates all foreign links, and shows information about doing so.



Tags:

return:  (html code)
throws:  bs_exception
access:  public


[ Top ]


Documentation generated on Mon, 29 Dec 2003 21:13:25 +0100 by phpDocumentor 1.2.3