Integrated Search/How to write a custom crawler

From TeamWeaverWiki

Revision as of 16:01, 29 June 2009 by Happel (Talk | contribs)

(diff) ← Older revision | Current revision (diff) | Newer revision → (diff)

This short guide describes how to write crawlers to make additional data sources available for searching with TeamWeaver Integrated Search.

A list of existing crawlers can be found at: supported data sources resp. repo_config.xml.

There are two general strategies for crawling data sources with TeamWeaverIS:

"Pull"-Crawlers are included in TeamWeaverIS and allow to crawl/extract data by accessing external sources/system. Pull-crawlers are fairly easy to realize, but have the disadvantage, that the index might not be accurate, since TeamWeaverIS does only learn about data changes, when a crawl is executed.
"Push"-Crawlers are implemented on the side of the client system which includes the data. They proactively update the index and thus have a direct connection to the TeamWeaverIS backend. Accordingly, push-crawlers are more difficult to implement, but can reflect changes in the data more rapidly in the index.

Writing a "Pull"-Crawler

At the basic level, creating a pull-crawler requires to implement two Java classes and changing two XML-files - a Crawler and a Processor each. Crawlers are classes which access a data source and extract/create single data items. Processors act upon these items to prepare them for feeding into the index.

Crawler

You need to create a subclass of CrawlerBase which basically means to implement a method crawlObject. See our JIRACrawler for an example.
Afterwards, register your new crawler in defaultCrawlers.xml. You need to define a unique <repoType> key, which corresponds to the <srcType> in repo_config.xml.

Processor

You need to create a subclass of ProcessorBase. See our JIRAProcessor as an example.
Afterwards, register your new processor in defaults.xml by wiring it with the corresponding crawler.

Advanced topics

tbd. (generic JDBC crawler, Plain index vs. metadata)

Writing a "Push"-Crawler

tbd.
Push-Crawlers have to call the TeamWeaver backend's IndexService
You might want to look at our Woogle code for an example implementation of a push-crawler

Main

Tools

Support

Quick links