From TeamWeaverWiki
< Integrated Search(Difference between revisions)
|
|
(One intermediate revision not shown) |
Line 3: |
Line 3: |
| A list of existing crawlers can be found at: [[Integrated_Search/Supported_data_sources|supported data sources]] resp. [[repo_config.xml]]. | | A list of existing crawlers can be found at: [[Integrated_Search/Supported_data_sources|supported data sources]] resp. [[repo_config.xml]]. |
| | | |
| + | Note that for most web-based sources, the generic [[Repo_config.xml#Web_ressources_.28.22web.22.29|web crawler]] can be sufficient for a first shot solution. However, the web crawler does index complete HTML files and might thus result in a lower search quality (e.g. due to indexing navigation menues etc.). |
| + | |
| + | == Pull vs. Push Crawlers == |
| There are two general strategies for crawling data sources with TeamWeaverIS: | | There are two general strategies for crawling data sources with TeamWeaverIS: |
- | * "Pull"-Crawlers are included in TeamWeaverIS and allow to crawl/extract data by accessing external sources/system. Pull-crawlers are fairly easy to realize, but have the disadvantage, that the index might not be accurate, since TeamWeaverIS does only learn about data changes, when a crawl is executed. | + | * "Pull"-Crawlers are implemented/deployed within TeamWeaverIS and allow to crawl/extract data by accessing external sources/system. '''Pull-crawlers are fairly easy to realize''', but have the disadvantage, that the index might not be accurate, since TeamWeaverIS does only learn about data changes when a crawl is executed. Typically, there is a script file to update the TeamWeaverIS index regularly (e.g. every night). Index accuracy thus depends on the frequency of crawls. |
- | * "Push"-Crawlers are implemented on the side of the client system which includes the data. They proactively update the index and thus have a direct connection to the TeamWeaverIS backend. Accordingly, push-crawlers are more difficult to implement, but can reflect changes in the data more rapidly in the index. | + | * "Push"-Crawlers are implemented on the side of the client system which includes the data to be indexed. They proactively update the index and thus need to have a direct (network) connection to the TeamWeaverIS backend. Accordingly, '''push-crawlers''' are more difficult to implement, but can '''reflect changes in the data more rapidly in the index'''. |
| | | |
| == Writing a "Pull"-Crawler == | | == Writing a "Pull"-Crawler == |
Line 19: |
Line 22: |
| | | |
| === Advanced topics === | | === Advanced topics === |
- | * tbd. (generic JDBC crawler) | + | * tbd. (generic JDBC crawler, Plain index vs. metadata) |
- | | + | |
| | | |
| == Writing a "Push"-Crawler == | | == Writing a "Push"-Crawler == |
| * tbd. | | * tbd. |
| * Push-Crawlers have to call the TeamWeaver backend's [http://svn.polarion.org/repos/community/Teamweaver/Teamweaver/trunk/org.teamweaver.is.api/src/main/java/org/teamweaver/is/api/IndexService.java IndexService] | | * Push-Crawlers have to call the TeamWeaver backend's [http://svn.polarion.org/repos/community/Teamweaver/Teamweaver/trunk/org.teamweaver.is.api/src/main/java/org/teamweaver/is/api/IndexService.java IndexService] |
- | * You might want to look at our [[Woogle]] code for an example implementation of a push-crawler | + | * You might want to look at our [[Woogle4MediaWiki]] code for an example implementation of a push-crawler. The particular place to look at is the "WoogleRemote" addon [https://waves1.fzi.de/svn/waves/trunk/Woogle4MediaWiki/addons/Remote/]. |
Current revision as of 07:46, 29 April 2010
This short guide describes how to write crawlers to make additional data sources available for searching with TeamWeaver Integrated Search.
A list of existing crawlers can be found at: supported data sources resp. repo_config.xml.
Note that for most web-based sources, the generic web crawler can be sufficient for a first shot solution. However, the web crawler does index complete HTML files and might thus result in a lower search quality (e.g. due to indexing navigation menues etc.).
Pull vs. Push Crawlers
There are two general strategies for crawling data sources with TeamWeaverIS:
- "Pull"-Crawlers are implemented/deployed within TeamWeaverIS and allow to crawl/extract data by accessing external sources/system. Pull-crawlers are fairly easy to realize, but have the disadvantage, that the index might not be accurate, since TeamWeaverIS does only learn about data changes when a crawl is executed. Typically, there is a script file to update the TeamWeaverIS index regularly (e.g. every night). Index accuracy thus depends on the frequency of crawls.
- "Push"-Crawlers are implemented on the side of the client system which includes the data to be indexed. They proactively update the index and thus need to have a direct (network) connection to the TeamWeaverIS backend. Accordingly, push-crawlers are more difficult to implement, but can reflect changes in the data more rapidly in the index.
Writing a "Pull"-Crawler
At the basic level, creating a pull-crawler requires to implement two Java classes and changing two XML-files - a Crawler and a Processor each. Crawlers are classes which access a data source and extract/create single data items. Processors act upon these items to prepare them for feeding into the index.
Crawler
- You need to create a subclass of CrawlerBase which basically means to implement a method
crawlObject
. See our JIRACrawler for an example.
- Afterwards, register your new crawler in defaultCrawlers.xml. You need to define a unique
<repoType>
key, which corresponds to the <srcType>
in repo_config.xml.
Processor
Advanced topics
- tbd. (generic JDBC crawler, Plain index vs. metadata)
Writing a "Push"-Crawler
- tbd.
- Push-Crawlers have to call the TeamWeaver backend's IndexService
- You might want to look at our Woogle4MediaWiki code for an example implementation of a push-crawler. The particular place to look at is the "WoogleRemote" addon [1].