blueshoes php application framework and cms            plugins_indexserver
[ class tree: plugins_indexserver ] [ index: plugins_indexserver ] [ all elements ]

Class: Bs_Is_WebSearchEngine

Source Location: /plugins/indexserver/Bs_Is_WebSearchEngine.class.php

Class Overview

Bs_Object
   |
   --Bs_Is_WebSearchEngine

todo: on windows machines compare url's case insensitive. in many peaces of the code.


Author(s):

Version:

  • 4.3.$Revision: 1.2 $ $Date: 2003/08/09 15:23:14 $

Copyright:

  • blueshoes.org

Variables

Methods


Inherited Variables

Inherited Methods

Class: Bs_Object

Bs_Object::Bs_Object()
Bs_Object::getErrors()
Basic error handling: Get *all* errors as string array from the global Bs_Error-error stack.
Bs_Object::getLastError()
Basic error handling: Get last error string from the global Bs_Error-error stack.
Bs_Object::getLastErrors()
Basic error handling: Get last errors string array from the global Bs_Error-error stack sinc last call of getLastErrors().
Bs_Object::persist()
Persists this object by serializing it and saving it to a file with unique name.
Bs_Object::setError()
Basic error handling: Push an error string on the global Bs_Error-error stack.
Bs_Object::toHtml()
Dumps the content of this object to a string using PHP's var_dump().
Bs_Object::toString()
Dumps the content of this object to a string using PHP's var_dump().
Bs_Object::unpersist()
Fetches an object that was persisted with persist()

Class Details

[line 45]
todo: on windows machines compare url's case insensitive. in many peaces of the code.

do somthing with redirects (403 and meta). weighten differently based on back-links and away-from-frontpage add include/exclude-mask add different entry points group (split) index based on website part/lang add search frontend with record preview fetch pages first, persist, then index with link-back data. keep archive page. link page to db (include db when indexing, for example a phonebook page) propertiy for ignoring frame pages allow parts of the page to be tagged as noindex. respect robots.txt and similar meta tags.

dependencies: Bs_Url, Bs_HttpClient, Bs_StopWatch




Tags:

pattern:  singleton: (pseudostatic)
since:  bs4.3
access:  public
version:  4.3.$Revision: 1.2 $ $Date: 2003/08/09 15:23:14 $
copyright:  blueshoes.org
author:  Andrej Arn <at blueshoes dot org>


[ Top ]


Class Variables

$allowUrls = array()

[line 227]

same as $ignoreUrls but urls that match here will always go through.

they don't respect the $queryStringUrlLimit limit.



Type:   mixed


[ Top ]

$Bs_Url =

[line 54]

reference to global pseudostatic instance.

gets set in the constructor.




Tags:

access:  public

Type:   object


[ Top ]

$detectDublicatePages =  TRUE

[line 115]

detects dublicate pages even if they have different urls.

uses an md5 to do so... this TAKES TIME, 5-60 seconds depending on cpu and page size.




Tags:

access:  public

Type:   bool


[ Top ]

$ignoreFileExtensions = array('zip', 'tar', 'tgz', 'gz', 'bz2', 'pdb', 'chm')

[line 235]

links to urls with such file endings will be ignored, no matter what.

even $allowUrls won't make a change.




Tags:

access:  public

Type:   array


[ Top ]

$ignoreUrls = array()

[line 221]

(parts of) url's defined here will be ignored from indexing.

vector holding hashes with the keys 'value' and 'type'. possible options for 'type' are: 'part' => this string occures in the url. this is the default. 'file' => the filename of the requested page equals this. 'preg' => perl style regular expression. 'ereg' => better use preg.

example: $ignoreUrls = array( array('value'=>'/foo/', 'type'=>'part'), array('value'=>'forum.php', 'type'=>'file'), array('value'=>'/\.*foo\.*', 'type'=>'preg'), );

if you want to exclude directories then do 'value'=>'', 'type'=>'file'. so url's with an empty file part (means: no file part) will be ignored. untested feature.

note that instead of ignoring urls that have a "?" queston mark here you can fine-tune $this->queryStringUrlLimit.

for 'part' and 'file' the comparisons will be made case-insensitive when working on windows hosts. (todo)




Tags:

access:  public

Type:   array


[ Top ]

$ignoreUrlsWithPass =  TRUE

[line 143]

urls with a pass part will be ignored.

scheme://user:pass@host:port/path?query#fragment ^




Tags:

var:  (default is TRUE)
todo:  implement this
access:  public

Type:   bool


[ Top ]

$ignoreUrlsWithUser =  TRUE

[line 133]

urls with a user part will be ignored.

scheme://user:pass@host:port/path?query#fragment ^




Tags:

var:  (default is TRUE)
todo:  implement this
access:  public

Type:   bool


[ Top ]

$indexIframes =  1

[line 106]

should iframes be indexed?
  1. => no
  2. => yes, as part of the surrounding page. use the body of the iframe page and add it to the body of the "parent" page. this is the default, and makes sense.


2 => yes, as standalone page. this feature is not implemented yet, and i won't do it unless i see a use for it.




Tags:

access:  public

Type:   int


[ Top ]

$limitDomains =

[line 190]

the allowed domains. found url's that don't use one of these will be ignored.

examples: limitDomains = array(www.blueshoes.org) means that "developer.blueshoes.org" and "blueshoes.org" url's will be ignored.

if this is not set then we will only allow the domain we get first in index().




Tags:

todo:  support "*.domain.com" (or is this supported using "domain.com"?)
access:  public

Type:   array


[ Top ]

$queryStringUrlLimit =  10

[line 155]

how many pages with the same url but different query string do we fetch?

-1 = no limit

    10 = only first 10 url's, then stop ...




    Tags:

    var:  (default is 10)
    access:  public

    Type:   int


    [ Top ]

    $refetchAfter =  10

    [line 176]

    don't refetch a page if it has been indexed lately. save traffic and cpu.

    10 means don't refetch for at least 10 days (default).




    Tags:

    access:  public

    Type:   int


    [ Top ]

    $registeredIndexCallback =

    [line 351]

    a function to call on every index() call.

    can be used to count the number of times index() has been called, to take the time, to stop the indexing process at some time, or whatever.

    the function will receive one parameter: a reference to this object (take it by ref!).




    Tags:

    access:  public

    Type:   string


    [ Top ]

    $reindexIfUnchanged =  30

    [line 168]

    reindex even if the content of a page has not changed since the last time? default is 30, can be TRUE, or an int that is a number of days.

    for example 15 means that if the last spider date is at least 15 days ago then we'll do it, otherwise not. it makes sense to reindex from time to time because there could be new links to that page, which makes the page more or less important.




    Tags:

    todo:  code that functionality
    access:  public

    Type:   mixed


    [ Top ]

    $searchStyleBody =  '<li>__LINK_TITLE__<br>__DESCRIPTION__<br>__LINK_URL__<hr size=1 noshade></li>'

    [line 239]


    Type:   mixed


    [ Top ]

    $searchStyleFoot =  '</ol>'

    [line 240]


    Type:   mixed


    [ Top ]

    $searchStyleHead =  '__NUM_RESULTS_TOTAL__ Seiten gefunden.<br><br>__HINTS_STRING__<br><ol>'

    [line 238]


    Type:   mixed


    [ Top ]

    $stopWatch =

    [line 93]

    instance od Bs_StopWatch.

    gets created in init() for benchmarking, seeing where bottlenecks are, debugging etc.




    Tags:

    access:  public

    Type:   object


    [ Top ]

    $waitAfterIndex =  0

    [line 303]

    how many seconds to wait after indexing a page.

    useful for busy sites or servers, to avoid fucking up things.




    Tags:

    var:  (default is 0)

    Type:   int


    [ Top ]

    $weightProperties = array(
          'domain'      => array('weight' => 50),'path'=>array('weight'=>100),'file'=>array('weight'=>100),'queryString'=>array('weight'=>80),'title'=>array('weight'=>100),'description'=>array('weight'=>60),'keywords'=>array('weight'=>40),'links'=>array('weight'=>100),'h1'=>array('weight'=>80),'h2'=>array('weight'=>70),'h3'=>array('weight'=>60),'h4'=>array('weight'=>50),'h5'=>array('weight'=>40),'h6'=>array('weight'=>30),'h7'=>array('weight'=>20),'h8'=>array('weight'=>10),'b'=>array('weight'=>10),'i'=>array('weight'=>8),'u'=>array('weight'=>8),'body'=>array('weight'=>5),)

    [line 274]

    weight properties for the different parts of the pages.

    'b' (bold) includes <strong>. 'links' are the links from other pages back to this one.

    the defaults are: 'domain' => array('weight' => 50), 'path' => array('weight' => 100), 'file' => array('weight' => 100), 'queryString' => array('weight' => 80), 'title' => array('weight' => 100), 'description' => array('weight' => 60), 'keywords' => array('weight' => 40), 'links' => array('weight' => 100), 'h1' => array('weight' => 80), 'h2' => array('weight' => 70), 'h3' => array('weight' => 60), 'h4' => array('weight' => 50), 'h5' => array('weight' => 40), 'h6' => array('weight' => 30), 'h7' => array('weight' => 20), 'h8' => array('weight' => 10), 'b' => array('weight' => 10), 'i' => array('weight' => 8), 'u' => array('weight' => 8), 'body' => array('weight' => 5),




    Tags:

    access:  public

    Type:   array


    [ Top ]

    $_httpClient =

    [line 70]


    Type:   mixed


    [ Top ]

    $_indexer =

    [line 66]


    Type:   mixed


    [ Top ]

    $_indexServer =

    [line 64]


    Type:   mixed


    [ Top ]

    $_searcher =

    [line 68]


    Type:   mixed


    [ Top ]



    Class Methods


    constructor Bs_Is_WebSearchEngine [line 357]

    Bs_Is_WebSearchEngine Bs_Is_WebSearchEngine( )

    constructor.



    [ Top ]

    method dropTodoStack [line 776]

    bool dropTodoStack( )

    drops all entries in the todo stack.



    Tags:

    return:  TRUE
    throws:  bs_exception
    access:  public


    [ Top ]

    method fetchLinksFromPageByUrl [line 1143]

    array fetchLinksFromPageByUrl( string $url)

    returns the links that are going away from the given url.

    note that the same url may appear more than once, often with a different caption.

    we compare the url's case insensitive here. may be a problem on unix if 2 urls have same spelling but different case. but that does not happen often, it's a stupid thing.




    Tags:

    return:  (vector holding hashes with the keys 'href' and 'caption'. may be empty.)
    throws:  bs_exception
    access:  public


    Parameters:

    string   $url  

    [ Top ]

    method fetchLinksToPageByUrl [line 1118]

    array fetchLinksToPageByUrl( string $url)

    returns the links that are pointing to the given url.

    note that the same url may appear more than once, often with a different caption.

    we compare the url's case insensitive here. may be a problem on unix if 2 urls have same spelling but different case. but that does not happen often, it's a stupid thing.




    Tags:

    return:  (vector holding hashes with the keys 'href' and 'caption'. may be empty.)
    throws:  bs_exception
    access:  public


    Parameters:

    string   $url  

    [ Top ]

    method fetchPageInfoById [line 878]

    mixed fetchPageInfoById( int $pageID)

    returns an array with the indexed information about a page.



    Tags:

    return:  (hash if found, NULL if not.)
    see:  Bs_Is_WebSearchEngine::fetchPageInfoByUrl()
    throws:  bs_exception
    access:  public


    Parameters:

    int   $pageID  

    [ Top ]

    method fetchPageInfoByUrl [line 859]

    mixed fetchPageInfoByUrl( string $url)

    returns an array with the indexed information about a page.

    the url is compared case-insensitive.




    Tags:

    return:  (hash if found, NULL if not.)
    see:  Bs_Is_WebSearchEngine::fetchPageInfoById()
    throws:  bs_exception
    access:  public


    Parameters:

    string   $url  

    [ Top ]

    method fetchPageList [line 896]

    array fetchPageList( )

    returns an array of all indexed pages.

    key is the internal ID, value is the url.




    Tags:

    return:  (hash)
    throws:  bs_exception
    access:  public


    [ Top ]

    method fetchWordsForPageByID [line 1329]

    array fetchWordsForPageByID( int $pageID, [string $order = 'caption'])

    returns word information about the given webpage.

    the returned vector holds hashes with the keys: caption, wordID, ranking




    Tags:

    return:  (vector holding hashes, see above)
    access:  public


    Parameters:

    int   $pageID  
    string   $order   (default is 'caption', can also be 'ranking'.)

    [ Top ]

    method getExternalLinks [line 1161]

    array getExternalLinks( )

    returns a list with the external links.



    Tags:

    return:  (hash where key is the ID, val is a hash with the keys urlFrom, urlTo and caption.)
    throws:  bs_exception
    access:  public


    [ Top ]

    method getPageID [line 822]

    mixed getPageID( string $url)

    returns the pageID for the given url.



    Tags:

    return:  (int page id > 0 if found, NULL if not found.)
    throws:  bs_exception


    Parameters:

    string   $url  

    [ Top ]

    method getPageUrl [line 839]

    mixed getPageUrl( int $pageID)

    returns the url for the given pageID.



    Tags:

    return:  (string page url if found, NULL if not found.)
    throws:  bs_exception


    Parameters:

    int   $pageID  

    [ Top ]

    method index [line 497]

    bool index( string $url, [string $follow = FALSE])

    indexes the url/page specified.



    Tags:

    return:  TRUE
    throws:  bs_exception
    access:  public


    Parameters:

    string   $url  
    string   $follow   (if we should follow links and index them aswell, default is FALSE.)

    [ Top ]

    method init [line 366]

    void init( mixed $profileName)



    Tags:

    throws:  bs_exception


    [ Top ]

    method isIgnoredUrl [line 1016]

    bool isIgnoredUrl( string $url, [bool $forIndexing = TRUE])

    tells if the given url is ignored or not (depending on the current settings).

    param $forIndexing: when we wanna know if the url is ignored for putting it into the todo queue we do it a little bit different. some things can only be told definitely when we are indexing the page. because in the time from seeing the url to indexing it there are lots of things that can happen.




    Tags:

    access:  public


    Parameters:

    string   $url  
    bool   $forIndexing   (default is TRUE, read above)

    [ Top ]

    method loadTodoStack [line 737]

    bool loadTodoStack( )

    loads the persisted stack into the current stack.



    Tags:

    return:  TRUE
    throws:  bs_exception
    access:  public


    [ Top ]

    method persistTodoStack [line 758]

    bool persistTodoStack( )

    persists the todo-stack so we can work on it later (continue/resume).



    Tags:

    return:  TRUE
    throws:  bs_exception
    access:  public


    [ Top ]

    method prune [line 1355]

    bool prune( )

    similar to drop() but only removes the content, not the profile itself.



    Tags:

    return:  TRUE
    throws:  bs_exception
    access:  public


    [ Top ]

    method search [line 449]

    string search( mixed $searchString, [int $limit = 10], [int $offset = 0], string $serchString)

    performs a search for the given $searchString.

    in $this->searchStyleHead: __TIME_TAKEN__ __NUM_RESULTS_TOTAL__

    in $this->searchStyleBody: __DESCRIPTION__ __TITLE__ __URL__ __LINK_TITLE__ __LINK_URL__




    Tags:

    return:  (html code)
    access:  public


    Parameters:

    string   $serchString   (the user-submitted query)
    int   $limit   (default is 10)
    int   $offset   (default is 0)

    [ Top ]

    method setDbByDsn [line 416]

    bool setDbByDsn( array $dsn)



    Tags:

    return:  TRUE
    throws:  bs_exception
    access:  public


    Parameters:

    array   $dsn  

    [ Top ]

    method setDbByObj [line 404]

    void setDbByObj( object &$bsDb)

    gives this class a db connection.



    Tags:

    access:  public


    Parameters:

    object   &$bsDb  

    [ Top ]

    method validateExternalLinks [line 1282]

    string validateExternalLinks( )

    validates all foreign links, and shows information about doing so.



    Tags:

    return:  (html code)
    throws:  bs_exception
    access:  public


    [ Top ]


    Documentation generated on Mon, 29 Dec 2003 21:11:35 +0100 by phpDocumentor 1.2.3