blueshoes php application framework and cms            applications_websearchengine
[ class tree: applications_websearchengine ] [ index: applications_websearchengine ] [ all elements ]

Class: Bs_Wse_Profile

Source Location: /applications/websearchengine/Bs_Wse_Profile.class.php

Class Overview

Bs_Object
   |
   --Bs_Wse_Profile

WebSearchEngine Profile class.


Author(s):

Version:

  • 4.0.$id$

Copyright:

  • blueshoes.org

Variables

Methods


Inherited Variables

Inherited Methods

Class: Bs_Object

Bs_Object::Bs_Object()
Bs_Object::getErrors()
Basic error handling: Get *all* errors as string array from the global Bs_Error-error stack.
Bs_Object::getLastError()
Basic error handling: Get last error string from the global Bs_Error-error stack.
Bs_Object::getLastErrors()
Basic error handling: Get last errors string array from the global Bs_Error-error stack sinc last call of getLastErrors().
Bs_Object::persist()
Persists this object by serializing it and saving it to a file with unique name.
Bs_Object::setError()
Basic error handling: Push an error string on the global Bs_Error-error stack.
Bs_Object::toHtml()
Dumps the content of this object to a string using PHP's var_dump().
Bs_Object::toString()
Dumps the content of this object to a string using PHP's var_dump().
Bs_Object::unpersist()
Fetches an object that was persisted with persist()

Class Details

[line 21]
WebSearchEngine Profile class.

used to load/persist and work with wse profiles.

dependencies: XPath, Bs_Is_Profile




Tags:

copyright:  blueshoes.org
pattern:  singleton: (pseudostatic)
access:  public
version:  4.0.$id$
author:  andrej arn <at blueshoes dot org>


[ Top ]


Class Variables

$allowIgnore =

[line 215]

url's can be set to be used or ignored.

whatever matches first in this array will decide if it's a go or no-go. if nothing matches then $queryStringUrlLimit will still be looked at.

before we had the vars $allowUrls and $ignoreUrls. allowUrls came first, then $ignoreUrls was looked at. this was a problem because i need ignore - allow - ignore combinations. this var solves that problem, and deprecates the other vars. now the row is: $allowIgnore, $allowUrls, $ignoreUrls

structure: vector holding hashes with the keys 'value', 'type' and 'ignore'. possible options for 'type' are: 'part' => this string occures in the url. this is the default. 'parti' => same as part but compare is done case-insensitive. 'file' => the filename of the requested page equals this. 'preg' => perl style regular expression. 'ereg' => better use preg.

for 'file' the comparison is made case-insensitive when working on windows hosts. (todo)

example definition: $ignoreUrls = array( array('value'=>'?goto=', 'type'=>'part', 'ignore'=>TRUE), array('value'=>'/foo/', 'type'=>'part', 'ignore'=>FALSE), array('value'=>'/\.*foo\.*', 'type'=>'preg', 'ignore'=>FALSE), array('value'=>'forum.php', 'type'=>'file', 'ignore'=>TRUE), );

real-world example (using the definitions from above): http://www.blueshoes.org/forum.php => ignore http://www.blueshoes.org/foo/forum.php => use http://www.blueshoes.org/foo/forum.php?goto=foo => ignore

example for ignoring everything but url's that start with "/tomjones": cheap version: array('value'=>'/tomjones', 'type'=>'part', 'ignore'=>FALSE), array('value'=>'/', 'type'=>'part', 'ignore'=>TRUE), strong version: TODO: CODE SUPPORT FOR EREG AND PREG

if you want to exclude all directories then do 'value'=>'', 'type'=>'file'. so url's with an empty file part (means: no file part) will be ignored. untested feature.

note that instead of ignoring urls that have a "?" queston mark here you can fine-tune $this->queryStringUrlLimit.




Tags:

access:  public

Type:   array


[ Top ]

$allowUrls = array()

[line 224]

!!!deprecated use $allowIgnore !!!

same as $ignoreUrls but urls that match here will always go through. they don't respect the $queryStringUrlLimit limit.




Tags:

deprecated:  use $allowIgnore

Type:   mixed


[ Top ]

$categories =

[line 277]

categories can be used so that part-searches are possible.

for example a user might want to search only the 'catalog' or 'shop' of your website.

the data structure is similar to the one of $ignoreUrls but has the additional key 'category'.

example: $ignoreUrls = array( array('value'=>'/shop/', 'type'=>'part', 'category'=>'shop'), array('value'=>'forum.php', 'type'=>'file', 'category'=>'forum'), array('value'=>'/\.*foo\.*', 'type'=>'preg', 'category'=>'foo'), );

the first match will define the category. if there is no match (or that feature is not used at all) then the category value will be an empty string.




Tags:

access:  public

Type:   array


[ Top ]

$detectDublicatePages =  TRUE

[line 94]

detects dublicate pages even if they have different urls.

uses an md5 to do so... this TAKES TIME, 5-60 seconds depending on cpu and page size.




Tags:

access:  public

Type:   bool


[ Top ]

$ignoreFileExtensions = array('zip', 'tar', 'tgz', 'gz', 'bz2', 'pdb', 'chm')

[line 286]

links to urls with such file endings will be ignored, no matter what.

even $allowUrls won't make a change. user lowercase or it won't work.




Tags:

access:  public

Type:   array


[ Top ]

$ignoreUrls = array()

[line 254]

!!!deprecated use $allowIgnore !!!

(parts of) url's defined here will be ignored from indexing. vector holding hashes with the keys 'value' and 'type'. possible options for 'type' are: 'part' => this string occures in the url. this is the default. 'file' => the filename of the requested page equals this.

example: $ignoreUrls = array( array('value'=>'/foo/', 'type'=>'part'), array('value'=>'forum.php', 'type'=>'file'), );

if you want to exclude all directories then do 'value'=>'', 'type'=>'file'. so url's with an empty file part (means: no file part) will be ignored. untested feature.

note that instead of ignoring urls that have a "?" queston mark here you can fine-tune $this->queryStringUrlLimit.

for 'part' and 'file' the comparisons are made case-insensitive when working on windows hosts. (todo)




Tags:

deprecated:  use $allowIgnore
access:  public

Type:   array


[ Top ]

$ignoreUrlsWithPass =  TRUE

[line 114]

urls with a pass part will be ignored.

scheme://user:pass@host:port/path?query#fragment ^




Tags:

var:  (default is TRUE)
todo:  implement this
access:  public

Type:   bool


[ Top ]

$ignoreUrlsWithUser =  TRUE

[line 104]

urls with a user part will be ignored.

scheme://user:pass@host:port/path?query#fragment ^




Tags:

var:  (default is TRUE)
todo:  implement this
access:  public

Type:   bool


[ Top ]

$indexIframes =  1

[line 85]

should iframes be indexed?
  1. => no
  2. => yes, as part of the surrounding page. use the body of the iframe page and add it to the body of the "parent" page. this is the default, and makes sense.


2 => yes, as standalone page. this feature is not implemented yet, and i won't do it unless i see a use for it.




Tags:

access:  public

Type:   int


[ Top ]

$limitDomains =

[line 161]

the domains we use. found url's that don't use one of these will be ignored.

examples: limitDomains = array(www.blueshoes.org) means that "developer.blueshoes.org" and "blueshoes.org" url's will be ignored.

if this is not set then we will only allow the domain of the url we get first in index().




Tags:

todo:  support "*.domain.com" (or is this supported using "domain.com"?)
access:  public

Type:   array


[ Top ]

$profileName =

[line 64]

the unique name of this profile.



Tags:

access:  public

Type:   string


[ Top ]

$queryStringUrlLimit =  10

[line 126]

how many pages with the same url but different query string do we fetch?

-1 = no limit

    10 = only first 10 url's, then stop ...




    Tags:

    var:  (default is 10)
    access:  public

    Type:   int


    [ Top ]

    $refetchAfter =  10

    [line 147]

    don't refetch a page if it has been indexed lately. save traffic and cpu.

    10 means don't refetch for at least 10 days (default).




    Tags:

    access:  public

    Type:   int


    [ Top ]

    $reindexIfUnchanged =  30

    [line 139]

    reindex even if the content of a page has not changed since the last time? default is 30, can be TRUE, or an int that is a number of days.

    for example 15 means that if the last spider date is at least 15 days ago then we'll do it, otherwise not. it makes sense to reindex from time to time because there could be new links to that page, which makes the page more or less important.




    Tags:

    todo:  code that functionality
    access:  public

    Type:   mixed


    [ Top ]

    $useDescription =  1

    [line 387]

    similar to var $useKeywords so read there.

    if not used then the first 250 chars of the content will be used.




    Tags:

    see:  var $useKeywords
    access:  public

    Type:   int


    [ Top ]

    $useKeywords =  1

    [line 377]

    should the meta keywords be used or not.

    1. = not at all
    2. = only if not used yet in any other page
    3. = only if different than on frontpage
    4. = yes
    you must have a good reason to set this to 3. altavista is the only search engine nowadays (2002/2003) that still makes use of the meta keywords. google and all the others don't. and i see why. why should a word be important if it's not even in the content/body? there are rare cases. it only blows up the index and delivers bad results because some webmaster thought he'd put some keywords into every page.

    if the keywords are well-picked and different for each page then it may make sense. so that's the default setting for now.




    Tags:

    see:  var $useDescription
    access:  public

    Type:   int


    [ Top ]

    $waitAfterIndex =  0

    [line 354]

    how many seconds to wait after indexing a page.

    useful for busy sites or servers, to avoid fucking up things.




    Tags:

    var:  (default is 0)
    todo:  this seems to not work. check this.
    access:  public

    Type:   int


    [ Top ]

    $weightProperties = array(
          'domain'      => array('weight' => 50),'path'=>array('weight'=>100),'file'=>array('weight'=>100),'queryString'=>array('weight'=>80),'title'=>array('weight'=>100),'description'=>array('weight'=>40),'keywords'=>array('weight'=>5),'links'=>array('weight'=>100),'h1'=>array('weight'=>80),'h2'=>array('weight'=>70),'h3'=>array('weight'=>60),'h4'=>array('weight'=>50),'h5'=>array('weight'=>40),'h6'=>array('weight'=>30),'h7'=>array('weight'=>20),'h8'=>array('weight'=>10),'b'=>array('weight'=>10),'i'=>array('weight'=>8),'u'=>array('weight'=>8),'body'=>array('weight'=>5),'image'=>array('weight'=>5),)

    [line 322]

    weight properties for the different parts of the pages.

    'b' (bold) includes <strong>. 'links' are the links from other pages back to this one. image means the text of the alt and title tag.

    the defaults are: 'domain' => array('weight' => 50), 'path' => array('weight' => 100), 'file' => array('weight' => 100), 'queryString' => array('weight' => 80), 'title' => array('weight' => 100), 'description' => array('weight' => 40), 'keywords' => array('weight' => 5), 'links' => array('weight' => 100), 'h1' => array('weight' => 80), 'h2' => array('weight' => 70), 'h3' => array('weight' => 60), 'h4' => array('weight' => 50), 'h5' => array('weight' => 40), 'h6' => array('weight' => 30), 'h7' => array('weight' => 20), 'h8' => array('weight' => 10), 'b' => array('weight' => 10), 'i' => array('weight' => 8), 'u' => array('weight' => 8), 'body' => array('weight' => 5), 'image' => array('weight' => 5),




    Tags:

    access:  public

    Type:   array


    [ Top ]

    $_APP =

    [line 26]

    obvious.


    Type:   mixed


    [ Top ]



    Class Methods


    constructor Bs_Wse_Profile [line 393]

    Bs_Wse_Profile Bs_Wse_Profile( )

    constructor.



    [ Top ]

    method checkDbTables [line 648]

    bool checkDbTables( )

    checks that the needed db tables exist and are up-to-date.

    the needed changes will be made automatically. note that your user needs the appropriate rights (alter, create, index...)

    hint: first try your query, if it fails check the table using this method. if this method returns FALSE then try your query again.




    Tags:

    return:  (TRUE if table was ok, FALSE if changes have/had to be made.)
    todo:  all
    throws:  bs_exception
    access:  public


    [ Top ]

    method create [line 492]

    bool create( string $profileName, string $wseXml, string $isXml)

    creates a new index.



    Tags:

    return:  TRUE
    throws:  bs_exception
    access:  public


    Parameters:

    string   $profileName  
    string   $wseXml   (xml string for WebSearchEngine profile.)
    string   $isXml   (xml string for IndexServer profile.)

    [ Top ]

    method drop [line 519]

    bool drop( string $profileName)

    drops an existing index.



    Tags:

    return:  (TRUE if the index existed and was dropped, FALSE if it did not exist.)
    throws:  bs_exception
    access:  public


    Parameters:

    string   $profileName  

    [ Top ]

    method getCategoryForUrl [line 611]

    string getCategoryForUrl( string $url, array $urlParsed)

    tells the defined category for the given url.

    if no category matches of the feature is not used then an empty string is returned.




    Tags:

    return:  (may be empty)
    todo:  finish code with 'preg' and 'ereg'
    see:  var $this->categories
    access:  public


    Parameters:

    string   $url  
    array   $urlParsed   (use Bs_Url->parseUrlExtended($url) to get it.)

    [ Top ]

    method getIndexDbObj [line 438]

    & &getIndexDbObj( )

    returns a ref to the db obj used for indexing/searching.



    Tags:

    return:  instance of Bs_Db
    access:  public


    [ Top ]

    method getProfileName [line 595]

    string getProfileName( )

    returns the profile name.



    Tags:

    access:  public


    [ Top ]

    method load [line 452]

    bool load( string $profileName)

    loads the profile specified.



    Tags:

    return:  (TRUE on success, FALSE if profile does not exist.)
    todo:  finish code.
    throws:  bs_exception
    access:  public


    Parameters:

    string   $profileName  

    [ Top ]

    method prune [line 560]

    bool prune( string $profileName)

    similar to drop() but only removes the content, not the profile itself.



    Tags:

    return:  (TRUE if the index existed and was pruned, FALSE if it did not exist.)
    todo:  check if profile exists, and return FALSE if it does not instead of throwing an exception.
    since:  bs4.3
    throws:  bs_exception
    access:  public


    Parameters:

    string   $profileName  

    [ Top ]

    method reset [line 582]

    void reset( )

    resets the object vars to use this object for a new index.



    Tags:

    access:  public


    [ Top ]

    method setDbByDsn [line 421]

    bool setDbByDsn( array $dsn)

    gives this class a db connection to load/store profiles.



    Tags:

    return:  TRUE
    see:  $this->setDbByObj()
    throws:  bs_exception
    access:  public


    Parameters:

    array   $dsn  

    [ Top ]

    method setDbByObj [line 407]

    void setDbByObj( object &$bsDb)

    gives this class a db connection to load/store profiles.



    Tags:

    see:  $this->setDbByDsn()
    access:  public


    Parameters:

    object   &$bsDb  

    [ Top ]


    Documentation generated on Mon, 29 Dec 2003 21:13:22 +0100 by phpDocumentor 1.2.3