Integrated Search/How to write a custom crawler

From TeamWeaverWiki

< Integrated Search(Difference between revisions)

Current revision as of 07:46, 29 April 2010

This short guide describes how to write crawlers to make additional data sources available for searching with TeamWeaver Integrated Search.

A list of existing crawlers can be found at: supported data sources resp. repo_config.xml.

Note that for most web-based sources, the generic web crawler can be sufficient for a first shot solution. However, the web crawler does index complete HTML files and might thus result in a lower search quality (e.g. due to indexing navigation menues etc.).

Pull vs. Push Crawlers

There are two general strategies for crawling data sources with TeamWeaverIS:

"Pull"-Crawlers are implemented/deployed within TeamWeaverIS and allow to crawl/extract data by accessing external sources/system. Pull-crawlers are fairly easy to realize, but have the disadvantage, that the index might not be accurate, since TeamWeaverIS does only learn about data changes when a crawl is executed. Typically, there is a script file to update the TeamWeaverIS index regularly (e.g. every night). Index accuracy thus depends on the frequency of crawls.
"Push"-Crawlers are implemented on the side of the client system which includes the data to be indexed. They proactively update the index and thus need to have a direct (network) connection to the TeamWeaverIS backend. Accordingly, push-crawlers are more difficult to implement, but can reflect changes in the data more rapidly in the index.

Writing a "Pull"-Crawler

At the basic level, creating a pull-crawler requires to implement two Java classes and changing two XML-files - a Crawler and a Processor each. Crawlers are classes which access a data source and extract/create single data items. Processors act upon these items to prepare them for feeding into the index.

Crawler

You need to create a subclass of CrawlerBase which basically means to implement a method crawlObject. See our JIRACrawler for an example.
Afterwards, register your new crawler in defaultCrawlers.xml. You need to define a unique <repoType> key, which corresponds to the <srcType> in repo_config.xml.

Processor

You need to create a subclass of ProcessorBase. See our JIRAProcessor as an example.
Afterwards, register your new processor in defaults.xml by wiring it with the corresponding crawler.

Advanced topics

tbd. (generic JDBC crawler, Plain index vs. metadata)

Writing a "Push"-Crawler

tbd.
Push-Crawlers have to call the TeamWeaver backend's IndexService
You might want to look at our Woogle4MediaWiki code for an example implementation of a push-crawler. The particular place to look at is the "WoogleRemote" addon [1].

Main

Tools

Support

Quick links