Package edu.berkeley.nlp.lm.io
Class KneserNeyLmReaderCallback<W>
java.lang.Object
edu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback<W>
- Type Parameters:
W
-
- All Implemented Interfaces:
ArrayEncodedNgramLanguageModel<W>
,LmReader<ProbBackoffPair,
,ArpaLmReaderCallback<ProbBackoffPair>> LmReaderCallback<LongRef>
,NgramOrderedLmReaderCallback<LongRef>
,NgramLanguageModel<W>
,Serializable
public class KneserNeyLmReaderCallback<W>
extends Object
implements NgramOrderedLmReaderCallback<LongRef>, LmReader<ProbBackoffPair,ArpaLmReaderCallback<ProbBackoffPair>>, ArrayEncodedNgramLanguageModel<W>, Serializable
Class for producing a Kneser-Ney language model in ARPA format from raw text.
Confusingly, this class is both a
LmReaderCallback
(called from
TextReader
, which reads plain text), and a LmReader
, which
"reads" counts and produces Kneser-Ney probabilities and backoffs and passes
them on an ArpaLmReaderCallback
- Author:
- adampauls
- See Also:
-
Nested Class Summary
Nested classes/interfaces inherited from interface edu.berkeley.nlp.lm.ArrayEncodedNgramLanguageModel
ArrayEncodedNgramLanguageModel.DefaultImplementations
Nested classes/interfaces inherited from interface edu.berkeley.nlp.lm.NgramLanguageModel
NgramLanguageModel.StaticMethods
-
Field Summary
FieldsModifier and TypeFieldDescriptionprotected static final float
protected final int
protected final HashNgramMap
<KneserNeyCountValueContainer.KneserNeyCounts> protected final ConfigOptions
protected static final long
protected final int
protected final WordIndexer
<W> This array represents the discount used for each ngram order. -
Constructor Summary
ConstructorsConstructorDescriptionKneserNeyLmReaderCallback
(WordIndexer<W> wordIndexer, int maxOrder) KneserNeyLmReaderCallback
(WordIndexer<W> wordIndexer, int maxOrder, ConfigOptions opts) -
Method Summary
Modifier and TypeMethodDescriptionvoid
addNgram
(int[] ngram, int startPos, int endPos, LongRef value, String words, boolean justLastWord, long[][] scratch) void
Called for each n-gramvoid
void
callJustLast
(W[] ngram, LongRef value, long[][] scratch) void
cleanup()
Called once all reading is done.static double[]
static double[]
protected float
getDiscountForOrder
(int ngramOrder) protected float
getHighestOrderProb
(int[] ngram, int startPos, int endPos) int
Maximum size of n-grams stored by the model.float
getLogProb
(int[] ngram) Equivalent togetLogProb(ngram, 0, ngram.length)
float
getLogProb
(int[] ngram, int startPos, int endPos) Calculate language model score of an n-gram.float
getLogProb
(List<W> ngram) Scores an n-gram.protected float
getLowerOrderBackoff
(int[] ngram, int startPos, int endPos) protected float
getLowerOrderProb
(int[] ngram, int startPos, int endPos) long
Each LM must have a WordIndexer which assigns integer IDs to each word W in the language.void
handleNgramOrderFinished
(int order) Called when all n-grams of a given order are finishedvoid
handleNgramOrderStarted
(int order) Called when n-grams of a given order are startedprotected float
interpolateProb
(int[] ngram, int startPos, int endPos) void
parse
(ArpaLmReaderCallback<ProbBackoffPair> callback) float
scoreSentence
(List<W> sentence) Scores a complete sentence, taking appropriate care with the start- and end-of-sentence symbols.void
setOovWordLogProb
(float logProb) Sets the (log) probability for an OOV word.
-
Field Details
-
serialVersionUID
protected static final long serialVersionUID- See Also:
-
DEFAULT_DISCOUNT
protected static final float DEFAULT_DISCOUNT- See Also:
-
lmOrder
protected final int lmOrder -
wordIndexer
This array represents the discount used for each ngram order. The original Kneser-Ney discounting (-ukndiscount) uses one discounting constant for each N-gram order. These constants are estimated as D = n1 / (n1 + 2*n2) where n1 and n2 are the total number of N-grams with exactly one and two counts, respectively. For simplicity, our code just uses a constant discount for each order of 0.75. However, other discounts can be specified. -
ngrams
-
opts
-
startIndex
protected final int startIndex
-
-
Constructor Details
-
KneserNeyLmReaderCallback
- Parameters:
wordIndexer
-maxOrder
-inputIsSentences
- If true, input n-grams are assumed to be sentences, and all sub-ngrams of up to ordermaxOrder
are added. If false, input n-grams are assumed to be atomic.
-
KneserNeyLmReaderCallback
-
-
Method Details
-
call
-
callJustLast
-
call
Description copied from interface:LmReaderCallback
Called for each n-gram- Specified by:
call
in interfaceLmReaderCallback<W>
- Parameters:
ngram
- The integer representation of the words as given by the provided WordIndexervalue
- The value of the n-gramwords
- The string representation of the n-gram (space separated)
-
addNgram
public void addNgram(int[] ngram, int startPos, int endPos, LongRef value, String words, boolean justLastWord, long[][] scratch) - Parameters:
ngram
-startPos
-endPos
-value
-words
-
-
interpolateProb
protected float interpolateProb(int[] ngram, int startPos, int endPos) -
getHighestOrderProb
protected float getHighestOrderProb(int[] ngram, int startPos, int endPos) -
getLowerOrderProb
protected float getLowerOrderProb(int[] ngram, int startPos, int endPos) -
getLowerOrderBackoff
protected float getLowerOrderBackoff(int[] ngram, int startPos, int endPos) -
getDiscountForOrder
protected float getDiscountForOrder(int ngramOrder) -
cleanup
public void cleanup()Description copied from interface:LmReaderCallback
Called once all reading is done.- Specified by:
cleanup
in interfaceLmReaderCallback<W>
-
defaultDiscounts
public static double[] defaultDiscounts() -
defaultMinCounts
public static double[] defaultMinCounts() -
parse
- Specified by:
parse
in interfaceLmReader<ProbBackoffPair,
ArpaLmReaderCallback<ProbBackoffPair>>
-
getWordIndexer
Description copied from interface:NgramLanguageModel
Each LM must have a WordIndexer which assigns integer IDs to each word W in the language.- Specified by:
getWordIndexer
in interfaceNgramLanguageModel<W>
- Returns:
-
handleNgramOrderFinished
public void handleNgramOrderFinished(int order) Description copied from interface:NgramOrderedLmReaderCallback
Called when all n-grams of a given order are finished- Specified by:
handleNgramOrderFinished
in interfaceNgramOrderedLmReaderCallback<W>
- Parameters:
order
-
-
handleNgramOrderStarted
public void handleNgramOrderStarted(int order) Description copied from interface:NgramOrderedLmReaderCallback
Called when n-grams of a given order are started- Specified by:
handleNgramOrderStarted
in interfaceNgramOrderedLmReaderCallback<W>
- Parameters:
order
-
-
getLmOrder
public int getLmOrder()Description copied from interface:NgramLanguageModel
Maximum size of n-grams stored by the model.- Specified by:
getLmOrder
in interfaceNgramLanguageModel<W>
- Returns:
-
scoreSentence
Description copied from interface:NgramLanguageModel
Scores a complete sentence, taking appropriate care with the start- and end-of-sentence symbols. This is a convenience method and will generally be inefficient.- Specified by:
scoreSentence
in interfaceNgramLanguageModel<W>
- Returns:
-
getLogProb
Description copied from interface:NgramLanguageModel
Scores an n-gram. This is a convenience method and will generally be relatively inefficient. More efficient versions are available inArrayEncodedNgramLanguageModel.getLogProb(int[], int, int)
andContextEncodedNgramLanguageModel.getLogProb(long, int, int, edu.berkeley.nlp.lm.ContextEncodedNgramLanguageModel.LmContextInfo)
.- Specified by:
getLogProb
in interfaceNgramLanguageModel<W>
-
getLogProb
public float getLogProb(int[] ngram, int startPos, int endPos) Description copied from interface:ArrayEncodedNgramLanguageModel
Calculate language model score of an n-gram. Warning: if you pass in an n-gram of length greater thangetLmOrder()
, this call will silently ignore the extra words of context. In other words, if you pass in a 5-gram (endPos-startPos == 5
) to a 3-gram model, it will only score the words fromstartPos + 2
toendPos
.- Specified by:
getLogProb
in interfaceArrayEncodedNgramLanguageModel<W>
- Parameters:
ngram
- array of words in integer representationstartPos
- start of the portion of the array to be readendPos
- end of the portion of the array to be read.- Returns:
-
getLogProb
public float getLogProb(int[] ngram) Description copied from interface:ArrayEncodedNgramLanguageModel
Equivalent togetLogProb(ngram, 0, ngram.length)
- Specified by:
getLogProb
in interfaceArrayEncodedNgramLanguageModel<W>
- See Also:
-
getTotalSize
public long getTotalSize() -
setOovWordLogProb
public void setOovWordLogProb(float logProb) Description copied from interface:NgramLanguageModel
Sets the (log) probability for an OOV word. Note that this is in general different from the log prob of theunk
tag probability.- Specified by:
setOovWordLogProb
in interfaceNgramLanguageModel<W>
-