$allowUrls = array()
[line 227]
same as $ignoreUrls but urls that match here will always go through.
they don't respect the $queryStringUrlLimit limit.
$Bs_Url =
[line 54]
reference to global pseudostatic instance.
gets set in the constructor.
Tags:
$detectDublicatePages = TRUE
[line 115]
detects dublicate pages even if they have different urls.
uses an md5 to do so... this TAKES TIME, 5-60 seconds depending on cpu and page size.
Tags:
$ignoreFileExtensions = array('zip', 'tar', 'tgz', 'gz', 'bz2', 'pdb', 'chm')
[line 235]
links to urls with such file endings will be ignored, no matter what.
even $allowUrls won't make a change.
Tags:
$ignoreUrls = array()
[line 221]
(parts of) url's defined here will be ignored from indexing.
vector holding hashes with the keys 'value' and 'type'. possible options for 'type' are: 'part' => this string occures in the url. this is the default. 'file' => the filename of the requested page equals this. 'preg' => perl style regular expression. 'ereg' => better use preg.
example: $ignoreUrls = array( array('value'=>'/foo/', 'type'=>'part'), array('value'=>'forum.php', 'type'=>'file'), array('value'=>'/\.*foo\.*', 'type'=>'preg'), );
if you want to exclude directories then do 'value'=>'', 'type'=>'file'. so url's with an empty file part (means: no file part) will be ignored. untested feature.
note that instead of ignoring urls that have a "?" queston mark here you can fine-tune $this->queryStringUrlLimit.
for 'part' and 'file' the comparisons will be made case-insensitive when working on windows hosts. (todo)
Tags:
$ignoreUrlsWithPass = TRUE
[line 143]
urls with a pass part will be ignored.
scheme://user:pass@host:port/path?query#fragment ^
Tags:
$ignoreUrlsWithUser = TRUE
[line 133]
urls with a user part will be ignored.
scheme://user:pass@host:port/path?query#fragment ^
Tags:
$indexIframes = 1
[line 106]
should iframes be indexed?
- => no
- => yes, as part of the surrounding page. use the body of the iframe page and
add it to the body of the "parent" page. this is the default, and makes sense.
2 => yes, as standalone page. this feature is not implemented yet, and i won't do it unless i see a use for it.
Tags:
$limitDomains =
[line 190]
the allowed domains. found url's that don't use one of these will be ignored.
examples: limitDomains = array(www.blueshoes.org) means that "developer.blueshoes.org" and "blueshoes.org" url's will be ignored.
if this is not set then we will only allow the domain we get first in index().
Tags:
$queryStringUrlLimit = 10
[line 155]
how many pages with the same url but different query string do we fetch?
-1 = no limit
10 = only first 10 url's, then stop ...
Tags:
$refetchAfter = 10
[line 176]
don't refetch a page if it has been indexed lately. save traffic and cpu.
10 means don't refetch for at least 10 days (default).
Tags:
$registeredIndexCallback =
[line 351]
a function to call on every index() call.
can be used to count the number of times index() has been called, to take the time, to stop the indexing process at some time, or whatever.
the function will receive one parameter: a reference to this object (take it by ref!).
Tags:
$reindexIfUnchanged = 30
[line 168]
reindex even if the content of a page has not changed since the last time? default is 30, can be TRUE, or an int that is a number of days.
for example 15 means that if the last spider date is at least 15 days ago then we'll do it, otherwise not. it makes sense to reindex from time to time because there could be new links to that page, which makes the page more or less important.
Tags:
$searchStyleBody = '<li>__LINK_TITLE__<br>__DESCRIPTION__<br>__LINK_URL__<hr size=1 noshade></li>'
[line 239]
$searchStyleFoot = '</ol>'
[line 240]
$searchStyleHead = '__NUM_RESULTS_TOTAL__ Seiten gefunden.<br><br>__HINTS_STRING__<br><ol>'
[line 238]
$stopWatch =
[line 93]
instance od Bs_StopWatch.
gets created in init() for benchmarking, seeing where bottlenecks are, debugging etc.
Tags:
$waitAfterIndex = 0
[line 303]
how many seconds to wait after indexing a page.
useful for busy sites or servers, to avoid fucking up things.
Tags:
$weightProperties = array(
'domain' => array('weight' => 50),'path'=>array('weight'=>100),'file'=>array('weight'=>100),'queryString'=>array('weight'=>80),'title'=>array('weight'=>100),'description'=>array('weight'=>60),'keywords'=>array('weight'=>40),'links'=>array('weight'=>100),'h1'=>array('weight'=>80),'h2'=>array('weight'=>70),'h3'=>array('weight'=>60),'h4'=>array('weight'=>50),'h5'=>array('weight'=>40),'h6'=>array('weight'=>30),'h7'=>array('weight'=>20),'h8'=>array('weight'=>10),'b'=>array('weight'=>10),'i'=>array('weight'=>8),'u'=>array('weight'=>8),'body'=>array('weight'=>5),)
[line 274]
weight properties for the different parts of the pages.
'b' (bold) includes <strong>. 'links' are the links from other pages back to this one.
the defaults are: 'domain' => array('weight' => 50), 'path' => array('weight' => 100), 'file' => array('weight' => 100), 'queryString' => array('weight' => 80), 'title' => array('weight' => 100), 'description' => array('weight' => 60), 'keywords' => array('weight' => 40), 'links' => array('weight' => 100), 'h1' => array('weight' => 80), 'h2' => array('weight' => 70), 'h3' => array('weight' => 60), 'h4' => array('weight' => 50), 'h5' => array('weight' => 40), 'h6' => array('weight' => 30), 'h7' => array('weight' => 20), 'h8' => array('weight' => 10), 'b' => array('weight' => 10), 'i' => array('weight' => 8), 'u' => array('weight' => 8), 'body' => array('weight' => 5),
Tags:
$_httpClient =
[line 70]
$_indexer =
[line 66]
$_indexServer =
[line 64]
$_searcher =
[line 68]