Top K Words Algorithm

The Top K Words Algorithm is a widely-used approach in natural language processing and text mining to identify the most frequently occurring words or terms within a given document or a collection of documents. This algorithm is particularly useful in applications such as keyword extraction, document summarization, and text indexing. By selecting the top K words (where K is a user-defined integer), the algorithm allows users to focus on the most relevant and significant terms, which can help in understanding the main topics or themes present in the text. To implement the Top K Words Algorithm, a frequency count of each unique word in the text is performed, typically after preprocessing steps such as tokenization, lowercasing, and stopword removal. Tokenization involves splitting the text into individual words or tokens, while stopword removal filters out common words such as "a", "an", "and", "the", etc., which do not carry any significant semantic meaning. Once the frequency count is obtained, the words are sorted in descending order based on their frequency, and the top K words are selected. This list of top K words can then be used for various purposes, such as forming word clouds, informing search engine optimization strategies, or serving as input features in machine learning models.
package Others;

import java.io.*;
import java.util.*;

/* display the most frequent K words in the file and the times it appear
    in the file – shown in order (ignore case and periods) */

public class TopKWords {
    static class CountWords {
        private String fileName;

        public CountWords(String fileName) {
            this.fileName = fileName;
        }

        public Map<String, Integer> getDictionary() {
            Map<String, Integer> dictionary = new HashMap<>();
            FileInputStream fis = null;

            try {

                fis = new FileInputStream(fileName);  // open the file
                int in = 0;
                String s = "";  // init a empty word
                in = fis.read();  // read one character

                while (-1 != in) {
                    if (Character.isLetter((char) in)) {
                        s += (char) in;  //if get a letter, append to s
                    } else {
                        // this branch means an entire word has just been read
                        if (s.length() > 0) {
                            // see whether word exists or not
                            if (dictionary.containsKey(s)) {
                                // if exist, count++
                                dictionary.put(s, dictionary.get(s) + 1);
                            } else {
                                // if not exist, initiate count of this word with 1
                                dictionary.put(s, 1);
                            }
                        }
                        s = ""; // reInit a empty word
                    }
                    in = fis.read();
                }
                return dictionary;
            } catch (IOException e) {
                e.printStackTrace();
            } finally {
                try {
                    // you always have to close the I/O streams
                    if (fis != null)
                        fis.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
            return null;
        }
    }

    public static void main(String[] args) {
        // you can replace the filePath with yours
        CountWords cw = new CountWords("/Users/lisanaaa/Desktop/words.txt");
        Map<String, Integer> dictionary = cw.getDictionary();  // get the words dictionary: {word: frequency}

        // we change the map to list for convenient sort
        List<Map.Entry<String, Integer>> list = new ArrayList<>(dictionary.entrySet());

        // sort by lambda valueComparator
        list.sort(Comparator.comparing(
                m -> m.getValue())
        );

        Scanner input = new Scanner(System.in);
        int k = input.nextInt();
        while (k > list.size()) {
            System.out.println("Retype a number, your number is too large");
            input = new Scanner(System.in);
            k = input.nextInt();
        }
        for (int i = 0; i < k; i++) {
            System.out.println(list.get(list.size() - i - 1));
        }
        input.close();
    }
}

LANGUAGE:

DARK MODE: