Details
-
Type:
Bug
-
Status:
Closed
-
Priority:
Major
-
Resolution: Fixed
-
Affects Version/s: 3.4.4, 3.5.0, 3.5.0.1, 3.5.1, 3.5.2, 3.5.3
-
Fix Version/s: 3.6.0
-
Component/s: Clustering Algorithms
-
Labels:None
Description
This loop is on the dark side of the force:
for (int documentIndex = 0; documentIndex < documentCount; documentIndex++)
{
if (tfByDocumentIndex * 2 < tfByDocument.length
&& tfByDocument[tfByDocumentIndex * 2] == documentIndex)
{
double weight = termWeighting.calculateTermWeight(
tfByDocument[tfByDocumentIndex * 2 + 1], df, documentCount);
weight *= getWeightBoost(titleFieldIndex, fieldIndices);
tfByDocumentIndex++;
tdMatrix.set(i, documentIndex, weight);
}
}
Originally reported by Taojian Lu.
The issue seems to be relevant only when TermDocumentMatrixBuilder is used separately from Carrot2 preprocessing pipeline. Otherwise, indices in tfByDocumentIndex should be increasing (assuming that IndirectSorter.mergesort() is stable) and the bug should not affect the results.
Aside from that, the loop code didn't make any sense, it's fixed in master.