blueshoes php application framework and cms            plugins_indexserver
[ class tree: plugins_indexserver ] [ index: plugins_indexserver ] [ all elements ]

Class: Bs_Is_IndexServer

Source Location: /plugins/indexserver/Bs_Is_IndexServer.class.php

Class Overview

Bs_Object
   |
   --Bs_Is_IndexServer

Index Server Class.


Author(s):

Version:

  • 4.3.$Revision: 1.5 $ $Date: 2003/09/08 05:17:37 $

Copyright:

  • blueshoes.org

Variables

Methods


Inherited Variables

Inherited Methods

Class: Bs_Object

Bs_Object::Bs_Object()
Bs_Object::getErrors()
Basic error handling: Get *all* errors as string array from the global Bs_Error-error stack.
Bs_Object::getLastError()
Basic error handling: Get last error string from the global Bs_Error-error stack.
Bs_Object::getLastErrors()
Basic error handling: Get last errors string array from the global Bs_Error-error stack sinc last call of getLastErrors().
Bs_Object::persist()
Persists this object by serializing it and saving it to a file with unique name.
Bs_Object::setError()
Basic error handling: Push an error string on the global Bs_Error-error stack.
Bs_Object::toHtml()
Dumps the content of this object to a string using PHP's var_dump().
Bs_Object::toString()
Dumps the content of this object to a string using PHP's var_dump().
Bs_Object::unpersist()
Fetches an object that was persisted with persist()

Class Details

[line 104]
Index Server Class.

Base class for all indexing and searching. Acts as a clearing house for profiles, indexers and searchers. Offers utility methods. Is a singleton.

What do you need this indexserver for?

The strength of this package is to index all sorts of things (websites, file systems, files, databases tables, ...). You feed the indexer with information, and later you can query it.

Features:

  • Boolean search operators like + - (and, not).
  • Search for "fixed names" (eg "Bill Gates") using right neighbors.
  • Stemming, metaphone, soundex, fast part-word searches like foo*, foo*bar AND *foo.
  • Weightening (kinda good).
  • Foreign keys (for db-related indexes).
  • Stopword lists (multilingual).
  • Settings via xml (only basics implemented yet).
  • For db's: auto-calculated settings from name conventions and table structures (table scanning).
  • Returns hints "Did you mean xy" after a search.
  • strings,
  • arrays,
  • db tables,
  • text files,
  • built-in mime-type handlers for html, pdf, doc and xls
  • Automatic creation of the internal (MySQL) database tables.

How does the weightening work?

After finding results for keywords it is very important to order the results based on relevance. To achieve this weightening of different parts of the content is important. 1) Weight points can be given for different parts of the content that gets indexed. For different data types (db's, html) there exist default weight properties. Examples: - The words in the title of a website are more important than the words in the body.

  • A CHAR(20) db field is more important than a BLOB. foreign key fields are even less important.
2) A count is maintained on each word, so we know if a word is special or common for your application. 'madonna' may be a special word if you're indexing the world, but if you're indexing a db about madonna songs then it's different. 3) If a word is used 30 times in a text with 1'000 words, then it's more important than a word that's used once in 10'000 words. 4) long words are considered more special, thus are more important when searching.

todo:

  • replace hardcoded german stemmer, use language detection
  • timed indexing using cron/at (when cpu is low)
  • extend/replace default stopword lists
  • need some multi-level normalizing of characters. especially with the german ä/ö/ü. because now they become a/o/u, not ae/oe/ue. words like "kindergarten-lehrer" (kindergarten teacher) are treated like one word, just like the dash would not be there. not sure if that's a good thing. another idea would be to split, or to index both (splitted and together) but then we'd have a problem with the right neighbors (and weighting).

stopwords: (aka noise words) if you change the stopword lists, you'd theoretically need to reindex everything. of course you don't want that, but if you change a lot, you might consider it. wrong search results may be delivered otherwise.

naming:

  • rnbs stands for "right neighbors".

note: a profile name needs to be globally unique.

rtfm: porter stemming - http://www.tartarus.org/~martin/PorterStemmer/ http://snowball.sourceforge.net/ lancaster stemming - http://www.comp.lancs.ac.uk/computing/research/stemming/ stemming - http://www.scit.wlv.ac.uk/seed/docs/mypapers/stemalg.html soundex - soundex.doc metaphone - http://www.lanw.com/java/phonetic/ double metaphone - http://swoodbridge.com/DoubleMetaPhone/




Tags:

copyright:  blueshoes.org
pattern:  singleton
version:  4.3.$Revision: 1.5 $ $Date: 2003/09/08 05:17:37 $
author:  andrej arn <at blueshoes dot org>


[ Top ]


Class Variables

$Bs_HtmlUtil =

[line 118]

reference to global pseudostatic htmlutil class.



Tags:

access:  public

Type:   object


[ Top ]

$Bs_String =

[line 111]

reference to global pseudostatic string class.



Tags:

access:  public

Type:   object


[ Top ]



Class Methods


constructor Bs_Is_IndexServer [line 179]

Bs_Is_IndexServer Bs_Is_IndexServer( )

constructor



[ Top ]

method cacheStopWords [line 348]

void cacheStopWords( [mixed $lang = null])

preloads stop words.



Tags:

see:  var $_stopWords, Bs_Is_IndexServer::isStopWord()
access:  public


Parameters:

mixed   $lang   (string or array, if not specified then all available languages will be preloaded.)

[ Top ]

method cleanString [line 458]

string cleanString( string $string)

convert original string to internal usable string.

1) convert to lowercase 2) convert "job-
sharing" to "jobsharing" (special case) 3) convert "hello
world" to "hello
world" (the br tag is just an example, done for all tags) 4) strip_tags() 5) Bs_HtmlUtil->htmlEntitiesUndo() (&auml; becomes ä) 6) Bs_String->normalize() (ä becomes ae) 7) "car-market" -> "carmarket" etc. 8) replace everything that's not a-z 0-9 with a space




Tags:



Parameters:

string   $string  

[ Top ]

method cleanStringChunkSentence [line 505]

array cleanStringChunkSentence( string $string)

convert original string to internal usable string, and chunk sentences into array.

same as cleanString() but returns an array (sentence by sentence) instead of a string. the returned vector may have empty elements.




Tags:

return:  (vector)
since:  bs4.5
see:  Bs_Is_IndexServer::cleanString()
access:  public


Parameters:

string   $string  

[ Top ]

method cleanWord [line 569]

mixed cleanWord( string $word, [int $minLength = 3], [int $maxLength = 30], [bool $returnError = FALSE])

clean word

1) trim 2) check minlength 3) check maxlength, maybe cut 4) at least one letter, not only numbers (i don't think i want to keep that ... 2003-02-13 --andrej) yep: deactivated in bs4.5 2003-08-27 5)

param $returnError if set to FALSE, the return value of the method is a string, and is bool FALSE if the word was "cleaned" to nothing. if set to TRUE then the return value changes, it is a vector with 1 element (string) that tells why the string was cleaned to nothing. the 2 possible reasons are:

  • "length" (shorter than the minLength)
  • "numeric" ("word" was numbers only, no alpha char)




Tags:

return:  (string or bool FALSE, or string or array, depending on param $returnError.)
access:  public


Parameters:

string   $word  
int   $minLength   (word-min-length, default is 3)
int   $maxLength   (word-max-length, default is 30)
bool   $returnError   (default is FALSE, see above.)

[ Top ]

method getIndexer [line 288]

&object &getIndexer( string $profileName)

returns a reference to the indexer for the given profile name.



Tags:

throws:  bool FALSE if profile not loaded.
access:  public


Parameters:

string   $profileName  

[ Top ]

method getProfile [line 324]

&object &getProfile( string $profileName)

returns a reference to the profile for the given profile name.



Tags:

throws:  bool FALSE if profile not loaded.
access:  public


Parameters:

string   $profileName  

[ Top ]

method getSearcher [line 306]

&object &getSearcher( string $profileName)

returns a reference to the searcher for the given profile name.



Tags:

throws:  bool FALSE if profile not loaded.
access:  public


Parameters:

string   $profileName  

[ Top ]

method getStem [line 411]

string getStem( string $word, [string $lang = ''])

returns the stem of the given word.

needs the stem extension. if not available then an empty string will be returned.




Tags:

return:  (may be empty)
throws:  string (empty string if not known/not capable.)
access:  public


Parameters:

string   $word  
string   $lang   (language of the given word.)

[ Top ]

method isStopWord [line 385]

bool isStopWord( string $word, [string $lang = null])

tells if the given word is a stopword.

if $lang is not specified then the word will be checked with all loaded languages.




Tags:

return:  (TRUE if it is, FALSE if it's not or we can't tell.)
todo:  implement code.
see:  var $_stopWords, Bs_Is_IndexServer::cacheStopWords()
access:  public


Parameters:

string   $word  
string   $lang  

[ Top ]

method setProfile [line 270]

void setProfile( object &$profile)

adds the given profile into the clearing house.



Tags:

access:  public


Parameters:

object   &$profile   (instance of Bs_Is_Profile.)

[ Top ]


Documentation generated on Mon, 29 Dec 2003 21:11:30 +0100 by phpDocumentor 1.2.3