Carrot2

Java and C# API examples perform clustering without stemming by default

Details

  • Type: Bug Bug
  • Status: Resolved Resolved
  • Priority: Major Major
  • Resolution: Fixed
  • Affects Version/s: 3.4.0
  • Fix Version/s: 3.4.1
  • Component/s: C# API, Framework Core
  • Labels:
    None

Description

Problem

As of version 3.4.0, BaseLanguageModelFactory became the default. That factory produces language models containing identity stemmers, which may result in lower quality of clustering. As a result, Java and C# API examples, such as ClusteringDocumentList, do not override the language model factory and therefore may also produce lower quality clusters.

The following Java API example classes are affected:

  • ClusteringDataFromDocumentSources
  • ClusteringDataFromLucene
  • ClusteringDataFromLuceneWithCustomFields
  • ClusteringDataFromPubMed
  • ClusteringDocumentList
  • MoreConfigurationsOfOneAlgorithmInCachingController
  • UsingCachingController
  • UsingComponentSuites

The following C# API example classes are affected:

  • ClusteringDocumentListOpenSource
  • ClusteringXmlFilesShared

Similarly, any other code that uses Carrot2 Java or C# API without a workaround shown below may produce clusters of lower quality.

Other applications, including Carrot2 Document Clustering Workbench, Carrot2 Document Clustering Server, Carrot2 Web Application, Carrot2 Command Line interface and Solr clustering plugin are not affected by this issue.

Workaround

The fix is to set the language model factory to DefaultLanguageModelFactory, preferably during the initalization of the controller.

Java API

Controller controller = ... ;
controller.init(ImmutableMap.of("PreprocessingPipeline.languageModelFactory", 
    (Object)new DefaultLanguageModelFactory()));

C# API

using (var controller = ControllerFactory.CreatePooling())
{
    var initAttributes = new Dictionary<string, object>();
    initAttributes["PreprocessingPipeline.languageModelFactory"] = new org.carrot2.text.linguistic.DefaultLanguageModelFactory();
    controller.Init(initAttributes);

    ...
}

Component suites and XML attribute configurations

Alternatively, if using component suites and XML configurations for attribute, add the following declaration to the relevant value-set tag:

<attribute key="PreprocessingPipeline.languageModelFactory">
  <value type="java.lang.Class" value="org.carrot2.text.linguistic.DefaultLanguageModelFactory"/>
</attribute>

Activity

Hide
Dawid Weiss added a comment -

I think the fix should be to make DefaultLanguageModelFactory back the default factory. If Lucene classes are not available, it should simply log warnings and default to identity stemmer/ default tokenizer. I'll try to throw in a patch for this, please review.

Show
Dawid Weiss added a comment - I think the fix should be to make DefaultLanguageModelFactory back the default factory. If Lucene classes are not available, it should simply log warnings and default to identity stemmer/ default tokenizer. I'll try to throw in a patch for this, please review.
Hide
Stanisław Osiński added a comment -

Everything looks good. There is no need to update the clustering component in Solr, clustering algorithms built against Carrot2 3.4.1 work with the Solr clustering component based on Carrot2 3.4.0 (which also contains the DefaultLanguageModelFactory with the same API and behaviour).

Show
Stanisław Osiński added a comment - Everything looks good. There is no need to update the clustering component in Solr, clustering algorithms built against Carrot2 3.4.1 work with the Solr clustering component based on Carrot2 3.4.0 (which also contains the DefaultLanguageModelFactory with the same API and behaviour).

People

Vote (0)
Watch (0)

Dates

  • Created:
    Updated:
    Resolved: