Carrot2

Remove dependency of carrot2-core on Lucene APIs

Details

  • Type: Refactoring Refactoring
  • Status: Resolved Resolved
  • Priority: Critical Critical
  • Resolution: Fixed
  • Affects Version/s: None
  • Fix Version/s: 3.4.0
  • Component/s: Framework Core
  • Labels:
    None

Description

The primary purpose of depending on Lucene APIs was a giving a possibility of reusing the existing Lucene components (tokenizers, analyzers) with Carrot2. In reality, such reuse probably never happened.

Making carrot2-core dependent on Lucene causes a number of headaches:

  • If Carrot2 is to be used with Solr, Nutch or some custom Lucene-based applications, the Lucene versions must match (which can mean that e.g. upgrade of Carrot2 is not possible because Carrot2 has switched to a newer version of Lucene, while the embedding application has not)
  • Blocked development of Carrot2 plugin in Lucene/Solr trunk/dev, where Lucene API can change in a flexible manner, but Carrot2 JARs require a specific version of Lucene.

Given that the cons probably outweigh the pros, it's best to remove the dependency of carrot2-core on Lucene by defining a Carrot2-specific Tokenizer interface and using it as a replacement of the Lucene one.

Activity

Hide
Dawid Weiss added a comment -

There are two points in the code that use Lucene API: one is ExtendedLanguageModelFactory, the other is SnowballStemmerFactory. We use Lucene's repackaged snowball because it has performance improvements over the regular distribution.

Show
Dawid Weiss added a comment - There are two points in the code that use Lucene API: one is ExtendedLanguageModelFactory, the other is SnowballStemmerFactory. We use Lucene's repackaged snowball because it has performance improvements over the regular distribution.
Hide
Dawid Weiss added a comment -

The nearest snowball API change is perhaps coming in Lucene 3.1 (current trunk) – SnowballProgram's implementation changed and new methods have been added. This should NOT break the runtime or the builds because old methods are kept. I'm just saying from time to time they do fiddle with Snowball classes.

Show
Dawid Weiss added a comment - The nearest snowball API change is perhaps coming in Lucene 3.1 (current trunk) – SnowballProgram's implementation changed and new methods have been added. This should NOT break the runtime or the builds because old methods are kept. I'm just saying from time to time they do fiddle with Snowball classes.
Hide
Stanisław Osiński added a comment -

Fixed in trunk. A Solr patch submitted.

In the end I moved the SnowballStemmerAdapter under the DefaultLanguageModelFactory and LuceneLanguageModelFactory in Solr, so that we can deal with stemmer API changes too.

Show
Stanisław Osiński added a comment - Fixed in trunk. A Solr patch submitted. In the end I moved the SnowballStemmerAdapter under the DefaultLanguageModelFactory and LuceneLanguageModelFactory in Solr, so that we can deal with stemmer API changes too.

People

Vote (0)
Watch (1)

Dates

  • Created:
    Updated:
    Resolved: