Integrated Search/How to write a custom crawler

From TeamWeaverWiki

(Difference between revisions)

ChristianRoehr (Talk | contribs)
(Created page with 'This short guide describes how to write crawlers to make additional data sources available for searching with TeamWeaver Integrated Search. A list of existing crawlers can b…')
Newer edit →

Revision as of 09:24, 17 June 2009

This short guide describes how to write crawlers to make additional data sources available for searching with TeamWeaver Integrated Search.

A list of existing crawlers can be found at: supported data sources resp. repo_config.xml.

There are two general strategies for crawling data sources with TeamWeaverIS:

"Pull"-Crawlers are included in TeamWeaverIS and allow to crawl/extract data by accessing external sources/system. Pull-crawlers are fairly easy to realize, but have the disadvantage, that the index might not be accurate, since TeamWeaverIS does only learn about data changes, when a crawl is executed.
"Push"-Crawlers are implemented on the side of the client system which includes the data. They proactively update the index and thus have a direct connection to the TeamWeaverIS backend. Accordingly, push-crawlers are more difficult to implement, but can reflect changes in the data more rapidly in the index.

Writing a "Pull"-Crawler

At the basic level, creating a pull-crawler requires to implement two Java classes and changing two XML-files - a Crawler and a Processor each. Crawlers are classes which access a data source and extract/create single data items. Processors act upon these items to prepare them for feeding into the index.

Crawler

You need to create a subclass of CrawlerBase which basically means to implement a method crawlObject. See our JIRACrawler for an example.
Afterwards, register your new crawler in defaultCrawlers.xml. You need to define a unique <repoType> key, which corresponds to the <srcType> in repo_config.xml.

Processor

You need to create a subclass of ProcessorBase. See our JIRAProcessor as an example.
Afterwards, register your new processor in defaults.xml by wiring it with the corresponding crawler.

Advanced topics

tbd. (generic JDBC crawler)

Writing a "Push"-Crawler

tbd.
Push-Crawlers have to call the TeamWeaver backend's IndexService
You might want to look at our Woogle code for an example implementation of a push-crawler

Main

Tools

Support

Quick links