Class HistogramDiff
This implementation was derived by using the 4 rules that are outlined in Bram Cohen's blog, and then was further extended to support low-occurrence common elements.
The basic idea of the algorithm is to create a histogram of occurrences for each element of sequence A. Each element of sequence B is then considered in turn. If the element also exists in sequence A, and has a lower occurrence count, the positions are considered as a candidate for the longest common subsequence (LCS). After scanning of B is complete the LCS that has the lowest number of occurrences is chosen as a split point. The region is split around the LCS, and the algorithm is recursively applied to the sections before and after the LCS.
By always selecting a LCS position with the lowest occurrence count, this algorithm behaves exactly like Bram Cohen's patience diff whenever there is a unique common element available between the two sequences. When no unique elements exist, the lowest occurrence element is chosen instead. This offers more readable diffs than simply falling back on the standard Myers' O(ND) algorithm would produce.
To prevent the algorithm from having an O(N^2) running time, an upper limit
on the number of unique elements in a histogram bucket is configured by
setMaxChainLength(int)
. If sequence A has more than this many
elements that hash into the same hash bucket, the algorithm passes the region
to setFallbackAlgorithm(DiffAlgorithm)
. If no fallback algorithm is
configured, the region is emitted as a replace edit.
During scanning of sequence B, any element of A that occurs more than
setMaxChainLength(int)
times is never considered for an LCS match
position, even if it is common between the two sequences. This limits the
number of locations in sequence A that must be considered to find the LCS,
and helps maintain a lower running time bound.
So long as setMaxChainLength(int)
is a small constant (such as 64),
the algorithm runs in O(N * D) time, where N is the sum of the input lengths
and D is the number of edits in the resulting EditList. If the supplied
SequenceComparator
has a good hash function,
this implementation typically out-performs
MyersDiff
, even though its theoretical running
time is the same.
This implementation has an internal limitation that prevents it from handling sequences with more than 268,435,456 (2^28) elements.
-
Nested Class Summary
Nested ClassesNested classes/interfaces inherited from class org.eclipse.jgit.diff.DiffAlgorithm
DiffAlgorithm.SupportedAlgorithm
-
Field Summary
FieldsModifier and TypeFieldDescription(package private) DiffAlgorithm
Algorithm to use when there are too many element occurrences.(package private) int
Maximum number of positions to consider for a given element hash. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescription<S extends Sequence>
voiddiffNonCommon
(EditList edits, HashedSequenceComparator<S> cmp, HashedSequence<S> a, HashedSequence<S> b, Edit region) Compare two sequences and identify a list of edits between them.void
Set the algorithm used when there are too many element occurrences.void
setMaxChainLength
(int maxLen) Maximum number of positions to consider for a given element hash.Methods inherited from class org.eclipse.jgit.diff.LowLevelDiffAlgorithm
diffNonCommon
Methods inherited from class org.eclipse.jgit.diff.DiffAlgorithm
diff, getAlgorithm
-
Field Details
-
fallback
DiffAlgorithm fallbackAlgorithm to use when there are too many element occurrences. -
maxChainLength
int maxChainLengthMaximum number of positions to consider for a given element hash. All elements with the same hash are stored into a single chain. The chain size is capped to ensure search is linear time at O(len_A + len_B) rather than quadratic at O(len_A * len_B).
-
-
Constructor Details
-
HistogramDiff
public HistogramDiff()
-
-
Method Details
-
setFallbackAlgorithm
Set the algorithm used when there are too many element occurrences.- Parameters:
alg
- the secondary algorithm. If null the region will be denoted as a single REPLACE block.
-
setMaxChainLength
public void setMaxChainLength(int maxLen) Maximum number of positions to consider for a given element hash. All elements with the same hash are stored into a single chain. The chain size is capped to ensure search is linear time at O(len_A + len_B) rather than quadratic at O(len_A * len_B).- Parameters:
maxLen
- new maximum length.
-
diffNonCommon
public <S extends Sequence> void diffNonCommon(EditList edits, HashedSequenceComparator<S> cmp, HashedSequence<S> a, HashedSequence<S> b, Edit region) Compare two sequences and identify a list of edits between them. This method should be invoked only after the two sequences have been proven to have no common starting or ending elements. The expected elimination of common starting and ending elements is automatically performed by theDiffAlgorithm.diff(SequenceComparator, Sequence, Sequence)
method, which invokes this method usingSubsequence
s.- Specified by:
diffNonCommon
in classLowLevelDiffAlgorithm
- Parameters:
edits
- result list to append the region's edits onto.cmp
- the comparator supplying the element equivalence function.a
- the first (also known as old or pre-image) sequence. Edits returned by this algorithm will reference indexes using the 'A' side:Edit.getBeginA()
,Edit.getEndA()
.b
- the second (also known as new or post-image) sequence. Edits returned by this algorithm will reference indexes using the 'B' side:Edit.getBeginB()
,Edit.getEndB()
.region
- the region being compared within the two sequences.
-