wordCountEngine.txt

Word Count Engine

Let document consist of of N words where M of them are unique (M = N). The solution consists of two steps: 1) parsing the string according to the criteria described in the problem and counting the number of occurrences of each word. 2) sorting the [word, occurrence] pairs by the number of words’ occurrences in a descending order.

Step 1: we tokenize document into words by using whitespaces as delimiters.

For each word, we clean it from all non-alphabetic characters (digits, punctuation etc) and convert it to lowercase to make counting case-insensitive. In this part, you should be leveraging whatever parsing capabilities your programming language of choice is providing. There is really no point of implementing functions that already exist.

As for counting, we’ll use a Map (Hash Table) to store words and their corresponding occurrences. A map is optimal in this case because it allows us find, store and update operations in O(1) time complexity.

Step 2: as for the sorting part, rather than sorting the entries in the map directly, which takes O(M·log(M)) - where M is number of unique words in document - a better solution will be to place words into an array of string arrays indexed by the occurrence number and then iterate through the array in the reverse order. This is similar to a Bucket Sort. The proposed solution trades off a bit of space for performance, which may be a reasonable trade under certain circumstances.

Pseudocode:

function wordCountEngine(document):
    wordMap = new Map()
    wordList = document.split()
    largestCount = 0;

    for i from 0 to wordList.length-1:
        # convert each token to lowercase
        word = wordList[i].toLowerCase()

        # and remove special/punctuation characters
        charArray = []
        for ch in word:
            if (ch >= 'a' and ch <= 'z'):
                charArray.push(ch)

        # form a string from the characters in charArray.
        # use your programming language's native “join”
        # or equivalent function. If there isn't any,
        # implement yourself. It's quite straightforward.
        cleanWord = join(charArray)

        # if the token consisted of only whitespace
        # characters, then cleanWord is an empty string
        # and we should ignore it and continue to the
        # next word.
        if (cleanWord.length < 1):
            continue

        # add clean word to the wordMap and
        # increase counter if needed
        count = 0
        if (cleanWord in wordMap):
            count = wordMap[cleanWord]
            count++
        else:
            count = 1

        if (count > largestCount):
            largestCount = count

        wordMap[cleanWord] = count

    # init the word counter list of lists.
    # Since, in the worst case scenario, the
    # number of lists is going to be as
    # big as the maximum occurrence count,
    # we need counterList's size to be the
    # same to be able to store these lists.
    # Creating counterList will allow us to
    # “bucket-sort” the list by word occurrences
    counterList = new Array(largestCount+1)
    for j from 0 to largestCount:
        counterList[j] = null

    # add all words to a list indexed by the
    # corresponding occurrence number.
    for word in wordMap.keys():
        counter = wordMap[word]
        wordCounterList = counterList[counter]

        if (wordCounterList == null):
            wordCounterList = []

        wordCounterList.push(word)
        counterList[counter] = wordCounterList

    # iterate through the list in reverse order
    # and add only non-null values to result
    result = []
    for l from counterList.length-1 to 0:
        wordCounterList = counterList[l]
        if (wordCounterList == null):
            continue

        stringifiedOccurrenceVal = toString(l)
        for m from 0 to wordCounterList.length-1:
            result.push([wordCounterList[m], stringifiedOccurrenceVal])

    return result
    

Time Complexity: let N be the number of words in document and M the number of unique words in it (M = N). Iterating over all words, cleaning them and inserting them into a map takes O(N). The sorting step takes O(M) since notice that in the second loop, every word gets visited only once. The total time complexity is therefore O(N + M), which is O(N).

Space Complexity: wordMap takes O(M) space and the array of strings array, counterList, takes another O(M). So, in total, the space complexity is O(M).

Note: the reason we’re analyzing the problem complexity in terms of the number of words, and not number of characters is because the average length of an english word is ~5, so from a practical perspective this could be regarded as a constant and therefore can be ignored (i.e. O(5N) = O(N))