|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectjavax.management.Attribute
org.archive.crawler.settings.Type
org.archive.crawler.settings.ComplexType
org.archive.crawler.settings.ModuleType
org.archive.crawler.framework.Processor
org.archive.crawler.extractor.Extractor
org.archive.crawler.extractor.ExtractorHTML
public class ExtractorHTML
Basic link-extraction, from an HTML content-body, using regular expressions.
Nested Class Summary |
---|
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType |
---|
ComplexType.MBeanAttributeInfoIterator |
Field Summary | |
---|---|
(package private) static java.lang.String |
APPLET
|
static java.lang.String |
ATTR_EXTRACT_JAVASCRIPT
whether to try finding links in Javscript; default true |
static java.lang.String |
ATTR_EXTRACT_ONLY_FORM_GETS
|
static java.lang.String |
ATTR_IGNORE_FORM_ACTION_URLS
|
static java.lang.String |
ATTR_IGNORE_UNEXPECTED_HTML
|
static java.lang.String |
ATTR_TREAT_FRAMES_AS_EMBED_LINKS
|
(package private) static java.lang.String |
BASE
|
(package private) static java.lang.String |
CLASSEXT
|
(package private) static java.lang.String |
EACH_ATTRIBUTE_EXTRACTOR
|
static java.lang.String |
EXTRACT_VALUE_ATTRIBUTES
|
(package private) static java.lang.String |
FRAME
|
(package private) static java.lang.String |
IFRAME
|
(package private) static java.lang.String |
JAVASCRIPT
|
(package private) static java.lang.String |
LIKELY_URI_PATH
|
(package private) static java.lang.String |
LINK
|
(package private) static int |
MAX_ATTR_VAL_LENGTH
|
(package private) static java.lang.String |
NON_HTML_PATH_EXTENSION
|
protected long |
numberOfCURIsHandled
|
protected long |
numberOfLinksExtracted
|
(package private) static java.lang.String |
RELEVANT_TAG_EXTRACTOR
|
(package private) static java.lang.String |
WHITESPACE
|
Fields inherited from class org.archive.crawler.framework.Processor |
---|
ATTR_DECIDE_RULES, ATTR_ENABLED, attrDecideRules |
Fields inherited from class org.archive.crawler.settings.ComplexType |
---|
definition, definitionMap |
Constructor Summary | |
---|---|
ExtractorHTML(java.lang.String name)
|
|
ExtractorHTML(java.lang.String name,
java.lang.String description)
|
Method Summary | |
---|---|
void |
extract(CrawlURI curi)
|
(package private) void |
extract(CrawlURI curi,
java.lang.CharSequence cs)
Run extractor. |
protected boolean |
isHtmlExpectedHere(CrawlURI curi)
Test whether this HTML is so unexpected (eg in place of a GIF URI) that it shouldn't be scanned for links. |
protected void |
processEmbed(CrawlURI curi,
java.lang.CharSequence value,
java.lang.CharSequence context)
|
protected void |
processEmbed(CrawlURI curi,
java.lang.CharSequence value,
java.lang.CharSequence context,
char hopType)
|
protected void |
processGeneralTag(CrawlURI curi,
java.lang.CharSequence element,
java.lang.CharSequence cs)
|
protected void |
processLink(CrawlURI curi,
java.lang.CharSequence value,
java.lang.CharSequence context)
Handle generic HREF cases. |
protected boolean |
processMeta(CrawlURI curi,
java.lang.CharSequence cs)
Process metadata tags. |
protected void |
processScript(CrawlURI curi,
java.lang.CharSequence sequence,
int endOfOpenTag)
|
protected void |
processScriptCode(CrawlURI curi,
java.lang.CharSequence cs)
Extract the (java)script source in the given CharSequence. |
protected void |
processStyle(CrawlURI curi,
java.lang.CharSequence sequence,
int endOfOpenTag)
Process style text. |
java.lang.String |
report()
Compiles and returns a report (in human readable form) about the status of the processor. |
Methods inherited from class org.archive.crawler.extractor.Extractor |
---|
innerProcess |
Methods inherited from class org.archive.crawler.framework.Processor |
---|
checkForInterrupt, finalTasks, getController, getDecideRule, getDefaultNextProcessor, initialTasks, innerRejectProcess, isContentToProcess, isExpectedMimeType, isHttpTransactionContentToProcess, kickUpdate, process, rulesAccept, rulesAccept, setDefaultNextProcessor, spawn |
Methods inherited from class org.archive.crawler.settings.ModuleType |
---|
addElement, listUsedFiles |
Methods inherited from class org.archive.crawler.settings.Type |
---|
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient |
Methods inherited from class javax.management.Attribute |
---|
getName |
Methods inherited from class java.lang.Object |
---|
clone, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
Field Detail |
---|
static final java.lang.String RELEVANT_TAG_EXTRACTOR
static final int MAX_ATTR_VAL_LENGTH
static final java.lang.String EACH_ATTRIBUTE_EXTRACTOR
static final java.lang.String LIKELY_URI_PATH
static final java.lang.String WHITESPACE
static final java.lang.String CLASSEXT
static final java.lang.String APPLET
static final java.lang.String BASE
static final java.lang.String LINK
static final java.lang.String FRAME
static final java.lang.String IFRAME
public static final java.lang.String ATTR_TREAT_FRAMES_AS_EMBED_LINKS
public static final java.lang.String ATTR_IGNORE_FORM_ACTION_URLS
public static final java.lang.String ATTR_EXTRACT_ONLY_FORM_GETS
public static final java.lang.String ATTR_EXTRACT_JAVASCRIPT
public static final java.lang.String EXTRACT_VALUE_ATTRIBUTES
public static final java.lang.String ATTR_IGNORE_UNEXPECTED_HTML
protected long numberOfCURIsHandled
protected long numberOfLinksExtracted
static final java.lang.String JAVASCRIPT
static final java.lang.String NON_HTML_PATH_EXTENSION
Constructor Detail |
---|
public ExtractorHTML(java.lang.String name)
public ExtractorHTML(java.lang.String name, java.lang.String description)
Method Detail |
---|
protected void processGeneralTag(CrawlURI curi, java.lang.CharSequence element, java.lang.CharSequence cs)
protected void processScriptCode(CrawlURI curi, java.lang.CharSequence cs)
curi
- source CrawlURIcs
- CharSequence of javascript codeprotected void processLink(CrawlURI curi, java.lang.CharSequence value, java.lang.CharSequence context)
curi
- value
- context
- protected final void processEmbed(CrawlURI curi, java.lang.CharSequence value, java.lang.CharSequence context)
protected void processEmbed(CrawlURI curi, java.lang.CharSequence value, java.lang.CharSequence context, char hopType)
public void extract(CrawlURI curi)
extract
in class Extractor
void extract(CrawlURI curi, java.lang.CharSequence cs)
curi
- CrawlURI we're processing.cs
- Sequence from underlying ReplayCharSequence. This
is TRANSIENT data. Make a copy if you want the data to live outside
of this extractors' lifetime.protected boolean isHtmlExpectedHere(CrawlURI curi) throws org.apache.commons.httpclient.URIException
curi
- CrawlURI to examine.
org.apache.commons.httpclient.URIException
protected void processScript(CrawlURI curi, java.lang.CharSequence sequence, int endOfOpenTag)
protected boolean processMeta(CrawlURI curi, java.lang.CharSequence cs)
curi
- CrawlURI we're processing.cs
- Sequence from underlying ReplayCharSequence. This
is TRANSIENT data. Make a copy if you want the data to live outside
of this extractors' lifetime.
protected void processStyle(CrawlURI curi, java.lang.CharSequence sequence, int endOfOpenTag)
curi
- CrawlURI we're processing.sequence
- Sequence from underlying ReplayCharSequence. This
is TRANSIENT data. Make a copy if you want the data to live outside
of this extractors' lifetime.endOfOpenTag
- public java.lang.String report()
Processor
Examples of stats declared would include:
* Number of CrawlURIs handled.
* Number of links extracted (for link extractors)
etc.
report
in class Processor
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |