public class TfIdf extends Object
TF(document, word)is term frequency: the number of occurrences of a given word in a given document.
TFis expected to correlate with the relevance of the word to the document.
DF(word)be the document frequency of a word: the number of documents a given word occurs in.
IDF(word)is the inverse document frequency of a word:
Dis the overall number of documents. IDF is expected to correlate with the salience of the word: a high value means it's highly specific to the documents it occurs in. For example, words like "in" and "the" have an IDF of zero because they occur everywhere.
TF-IDF(document, word)is the product of
TF * IDFfor a given word in a given document.
When you enter a search phrase, the program first crosses out the stopwords, then looks up each remaining search term in the inverted index, resulting in a set of documents for each search term. It takes an intersection of all these sets, which gives us only the documents that contain all the search terms. For each combination of document and search term there will be an associated TF-IDF score. It sums up these scores per document to retrieve the total score of each document. Finally, it sorts the list of documents by score (descending) and presents them to the user as the search result.
|Constructor and Description|
public static void main(String args)
Copyright © 2020 Hazelcast, Inc.. All rights reserved.