Uploaded image for project: 'Carrot2'
  1. Carrot2
  2. CARROT-723

Java and C# API examples perform clustering without stemming by default

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.4.0
    • Fix Version/s: 3.4.1
    • Component/s: C# API, Framework Core
    • Labels:
      None

      Description

      Problem

      As of version 3.4.0, BaseLanguageModelFactory became the default. That factory produces language models containing identity stemmers, which may result in lower quality of clustering. As a result, Java and C# API examples, such as ClusteringDocumentList, do not override the language model factory and therefore may also produce lower quality clusters.

      The following Java API example classes are affected:

      • ClusteringDataFromDocumentSources
      • ClusteringDataFromLucene
      • ClusteringDataFromLuceneWithCustomFields
      • ClusteringDataFromPubMed
      • ClusteringDocumentList
      • MoreConfigurationsOfOneAlgorithmInCachingController
      • UsingCachingController
      • UsingComponentSuites

      The following C# API example classes are affected:

      • ClusteringDocumentListOpenSource
      • ClusteringXmlFilesShared

      Similarly, any other code that uses Carrot2 Java or C# API without a workaround shown below may produce clusters of lower quality.

      Other applications, including Carrot2 Document Clustering Workbench, Carrot2 Document Clustering Server, Carrot2 Web Application, Carrot2 Command Line interface and Solr clustering plugin are not affected by this issue.

      Workaround

      The fix is to set the language model factory to DefaultLanguageModelFactory, preferably during the initalization of the controller.

      Java API

      Controller controller = ... ;
      controller.init(ImmutableMap.of("PreprocessingPipeline.languageModelFactory", 
          (Object)new DefaultLanguageModelFactory()));
      

      C# API

      using (var controller = ControllerFactory.CreatePooling())
      {
          var initAttributes = new Dictionary<string, object>();
          initAttributes["PreprocessingPipeline.languageModelFactory"] = new org.carrot2.text.linguistic.DefaultLanguageModelFactory();
          controller.Init(initAttributes);
      
          ...
      }
      

      Component suites and XML attribute configurations

      Alternatively, if using component suites and XML configurations for attribute, add the following declaration to the relevant value-set tag:

      <attribute key="PreprocessingPipeline.languageModelFactory">
        <value type="java.lang.Class" value="org.carrot2.text.linguistic.DefaultLanguageModelFactory"/>
      </attribute>
      

        Attachments

          Activity

            People

            • Assignee:
              stachoo Stanisław Osiński
              Reporter:
              stachoo Stanisław Osiński
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: