Document Repository
Document Management :
A systematic method for storing, locating, and keeping track of information that is valuable to a business. The key characteristics of a document management system are the ability to manage information, to collaborate when creating information, to distribute the information, and to allow secure access to the greatest number of people.
From : www.data-core.com/glossary-of-terms.htm
Lemur Toolkit
The Lemur Toolkit is a open-source toolkit designed to facilitate research in language modeling and information retrieval. Lemur supports a wide range of industrial and research language applications such as ad-hoc retrieval, site-search, and text mining.
The toolkit supports indexing of large-scale text databases, the construction of simple language models for documents, queries, or subcollections, and the implementation of retrieval systems based on language models as well as a variety of other retrieval models. The system is written in the C and C++ languages, and is designed as a research system to run under Unix operating systems, although it can also run under Windows.
- Sophisticated structured query languages (using InQuery and Indri)
- Support for XML and structured document retrieval
- Used commonly with a wide range of research test collections (e.g., TREC CDs 1-5, wt10g, RCV1, gov, gov2)
- Index your web pages with an "out-of-the-box" site search capability
- Interactive interfaces for Windows, Linux, and Web
- Distributed information retrieval and document clustering applications
- Cross-platform, fast and modular code written in C++
- C++, Java and C# APIs
- Free and open-source software
- In use for over 6 years by a large and growing user community
Indexing:
- Multiple indexing methods for small, medium and large-scale (terabyte) collections
- Built-in support for English, Chinese and Arabic text
- Porter and Krovetz word stemming
- Incremental indexing
- Out-of-the-box indexing support for TREC Text, TREC Web, plain text, HTML, XML, PDF, MBox, Microsoft Word, and Microsoft PowerPoint
- Indexes inline and offset text annotations (e.g., part-of-speech and named entities)
- Indexes document attributes
Retrieval:
- Supports major language modeling approaches such as Indri and KL-divergence, as well as vector space, tf.idf, Okapi and InQuery
- Relevance- and pseudo-relevance feedback
- Wildcard term expansion (using Indri)
- Passage and XML element retrieval
- Cross-lingual retrieval
- Smoothing via Dirichlet priors and Markov chains
- Supports arbitrary document priors (e.g., Page Rank, URL depth)
Loading .....