Repo config.xml

From TeamWeaverWiki

Revision as of 09:29, 17 June 2009 by ChristianRoehr (Talk | contribs)
(diff) ← Older revision | Current revision (diff) | Newer revision → (diff)
Jump to: navigation, search

The file repo_config.xml in the \teamweaverIS-backend\WEB-INF\conf of your TeamWeaverIS backend allows you to define data sources to be crawled.

Users can easily develop an plug in crawlers for other data sources. See our guide how to write a custom crawler.

Repositories are crawled by assigning them to a crawl.

Contents

Example repo_config.xml

<?xml version="1.0" encoding="UTF-8"?>
<repo_config>
	<repositoryInfo>
		<repoId>1</repoId>
		<repoName>PushIndexingTestRepo</repoName>
		<srcType>web</srcType>
		<srcVersion>1.0</srcVersion>
		<connectURL>http://www.teamweaver.org/</connectURL>
		<connectType>http</connectType>
		<user></user>
		<pass></pass>
		<group>all</group>
		<linkPath></linkPath>
		<pushIndexAuthKey>test42</pushIndexAuthKey>
		<pushIndexEnabled>true</pushIndexEnabled>
		<cacheFulltext>true</cacheFulltext>
	</repositoryInfo>
	<repositoryInfo>
		<repoId>2</repoId>
		<repoName>files</repoName>
		<srcType>filesystem</srcType>
		<srcVersion>1.0</srcVersion>
		<connectURL>//linde/c$/docs_small</connectURL>
		<connectType>filesystem</connectType>
                <linkPath>http://webdav.internal.de/linde/c/docs_small/</linkPath>
		<user></user>
		<pass></pass>
		<group>all</group>
	</repositoryInfo>
</repo_config>

Documentation of parameters

  • repo_config.xml contains <repositoryInfo> entries for each single repository to be crawled (see example file above)
  • For specific information about the semantics of parameters in the context of different source types consult the "List of Source Types" below
  • The parameters inside the <repositoryInfo> element are as follows:
    • <repoId>1</repoId> - denotes a numerical id for the repository. This needs to be unique within the repositoryInfo elements of the repo_config.xml
    • <repoName>My Test repository</repoName> - a human readable label for the repository
    • <srcType>web</srcType> - denotes the type of repository - see below for a list of allowed keys
    • <srcVersion>1.0</srcVersion> - denotes a version of the repository type - irrelevant for most types (OPTIONAL)
    • <connectURL>http://www.teamweaver.org/</connectURL> - a descriptor of the physical location. The exact form depends on the kind of srcType - e.g. for a "web" ressource, this is a URL, while for a file system, it is a path
    • <connectType>http</connectType> - denotes a conncection mode for srcTypes that allow for a choice - irrelevant for most types (OPTIONAL)
    • <user>myUser</user> - user name, if the srcType requires authentification (OPTIONAL)
    • <pass>myPass</pass> - password (OPTIONAL)
    • <group>all</group> - user group for which the crawled data for this entry should be accessible (OPTIONAL)
    • <linkPath></linkPath> - allows to specify a separate "link path" for repositories, which do not provide "clickable" URLs for the browser. E.g. a network file share might by indexed as \\computer\path\ which is typically not clickable in a web result list. Therefore you could provide an alternative link path to the repository (e.g. a WebDAV wrapper) - http://internal.mycompany.de/computer/path/ which is then used to refere to results. (OPTIONAL)
    • <cacheFulltext>false</cacheFulltext> - denotes if the backend should cache indexed files in order to make them accessible via the result interface. This is an alternative, if it not possible to expose those systems via the <linkPath> option.
    • <pushIndexEnabled>false</pushIndexEnabled> - if set to true, this repository can not be actively crawled any more using crawl_config.xml, but will instead push changes to the backend (OPTIONAL)
    • <pushIndexAuthKey>a_password</pushIndexAuthKey> - an arbitrary string which servers for authentification of the push indexing client (OPTIONAL)

List of Source Types

This is the list of allowed <srcType> attributes for a <repositoryInfo> entry. The complete authoritative list can be obtained from the defaultCrawlers.xml in the SVN.

Web ressources ("web")

  • <repositoryInfo> options
  • Behaviour: the web crawler in its current state of implementation starts from the initial page defined by the connectURL and follows all links including resp. starting with connectURL (e.g. http://www.fzi.de/ipe/some_subdirectory/page.html but not http://www.fzi.de/se/page.html) up to a depth of 9 hops. There is currently no way to configure a different behaviour (although this could be implemented fairly easy).

File systems/network shares

SVN repositories

CVS repositories

  • <repositoryInfo> options
    • <srcType>svn</srcType>
    • <connectURL>:pserver:anonymous@aperture.cvs.sourceforge.net:/cvsroot/aperture:aperture</connectURL> (Example)
    • <connectType>pserver</connectType>

Atlassian Confluence Wiki

Atlassian Jira Issue Tracker

  • <repositoryInfo> options
    • <srcType>jira</srcType>
    • <connectURL>localhost:8080/</connectURL> (Example)
    • <connectType>rpc</connectType>

JSPWiki

  • <repositoryInfo> options
    • <srcType>jspwiki</srcType>
    • <srcVersion>2.0</srcVersion>
    • <connectURL>http://www.jspwiki.org</connectURL>
    • <connectType>rpc</connectType>

Bugzilla Issue Tracker

To be continued....

  • To be continued....