Details
-
Type:
Refactoring
-
Status:
Resolved
-
Priority:
Critical
-
Resolution: Fixed
-
Affects Version/s: None
-
Fix Version/s: 3.4.0
-
Component/s: Framework Core
-
Labels:None
Description
The primary purpose of depending on Lucene APIs was a giving a possibility of reusing the existing Lucene components (tokenizers, analyzers) with Carrot2. In reality, such reuse probably never happened.
Making carrot2-core dependent on Lucene causes a number of headaches:
- If Carrot2 is to be used with Solr, Nutch or some custom Lucene-based applications, the Lucene versions must match (which can mean that e.g. upgrade of Carrot2 is not possible because Carrot2 has switched to a newer version of Lucene, while the embedding application has not)
- Blocked development of Carrot2 plugin in Lucene/Solr trunk/dev, where Lucene API can change in a flexible manner, but Carrot2 JARs require a specific version of Lucene.
Given that the cons probably outweigh the pros, it's best to remove the dependency of carrot2-core on Lucene by defining a Carrot2-specific Tokenizer interface and using it as a replacement of the Lucene one.
There are two points in the code that use Lucene API: one is ExtendedLanguageModelFactory, the other is SnowballStemmerFactory. We use Lucene's repackaged snowball because it has performance improvements over the regular distribution.