Class: Bs_Is_IndexServer
Source Location: /plugins/indexserver/Bs_Is_IndexServer.class.php
Bs_Object
|
--Bs_Is_IndexServer
Index Server Class.
Author(s):
Version:
- 4.3.$Revision: 1.5 $ $Date: 2003/09/08 05:17:37 $
Copyright:
|
|
|
Inherited Variables
|
Inherited Methods
|
Class Details
[line 104]
Index Server Class. Base class for all indexing and searching. Acts as a clearing house for profiles, indexers and searchers. Offers utility methods. Is a singleton. What do you need this indexserver for? The strength of this package is to index all sorts of things (websites, file systems, files, databases tables, ...). You feed the indexer with information, and later you can query it. Features: - Boolean search operators like + - (and, not).
- Search for "fixed names" (eg "Bill Gates") using right neighbors.
- Stemming, metaphone, soundex, fast part-word searches like foo*, foo*bar AND *foo.
- Weightening (kinda good).
- Foreign keys (for db-related indexes).
- Stopword lists (multilingual).
- Settings via xml (only basics implemented yet).
- For db's: auto-calculated settings from name conventions and table structures (table scanning).
- Returns hints "Did you mean xy" after a search.
- strings,
- arrays,
- db tables,
- text files,
- built-in mime-type handlers for html, pdf, doc and xls
- Automatic creation of the internal (MySQL) database tables.
How does the weightening work? After finding results for keywords it is very important to order the results based on relevance. To achieve this weightening of different parts of the content is important. 1) Weight points can be given for different parts of the content that gets indexed. For different data types (db's, html) there exist default weight properties. Examples: - The words in the title of a website are more important than the words in the body. - A CHAR(20) db field is more important than a BLOB. foreign key fields
are even less important.
2) A count is maintained on each word, so we know if a word is special or common for your application. 'madonna' may be a special word if you're indexing the world, but if you're indexing a db about madonna songs then it's different. 3) If a word is used 30 times in a text with 1'000 words, then it's more important than a word that's used once in 10'000 words. 4) long words are considered more special, thus are more important when searching. todo: - replace hardcoded german stemmer, use language detection
- timed indexing using cron/at (when cpu is low)
- extend/replace default stopword lists
- need some multi-level normalizing of characters. especially with the german
ä/ö/ü. because now they become a/o/u, not ae/oe/ue.
words like "kindergarten-lehrer" (kindergarten teacher) are treated like
one word, just like the dash would not be there. not sure if that's a good
thing. another idea would be to split, or to index both (splitted and
together) but then we'd have a problem with the right neighbors (and weighting).
stopwords: (aka noise words) if you change the stopword lists, you'd theoretically need to reindex everything. of course you don't want that, but if you change a lot, you might consider it. wrong search results may be delivered otherwise. naming: - rnbs stands for "right neighbors".
note: a profile name needs to be globally unique. rtfm: porter stemming - http://www.tartarus.org/~martin/PorterStemmer/ http://snowball.sourceforge.net/ lancaster stemming - http://www.comp.lancs.ac.uk/computing/research/stemming/ stemming - http://www.scit.wlv.ac.uk/seed/docs/mypapers/stemalg.html soundex - soundex.doc metaphone - http://www.lanw.com/java/phonetic/ double metaphone - http://swoodbridge.com/DoubleMetaPhone/
Tags:
Class Variables
Class Methods
|
|