This loop is on the dark side of the force:
for (int documentIndex = 0; documentIndex < documentCount; documentIndex++)
if (tfByDocumentIndex * 2 < tfByDocument.length
&& tfByDocument[tfByDocumentIndex * 2] == documentIndex)
double weight = termWeighting.calculateTermWeight(
tfByDocument[tfByDocumentIndex * 2 + 1], df, documentCount);
weight *= getWeightBoost(titleFieldIndex, fieldIndices);
tdMatrix.set(i, documentIndex, weight);
Originally reported by Taojian Lu.
The issue seems to be relevant only when TermDocumentMatrixBuilder is used separately from Carrot2 preprocessing pipeline. Otherwise, indices in tfByDocumentIndex should be increasing (assuming that IndirectSorter.mergesort() is stable) and the bug should not affect the results.
Aside from that, the loop code didn't make any sense, it's fixed in master.
In fact, the issue document indices will be increasing only if all words are 1:1 with stems. If there are multiple original words per stem, indices will not be increasing and therefore the term-document matrix will be incorrect.
The bug fix has been pushed to master on GitHub.
This patch changed the output results in some cases – that's why solr integration tests fail now.