The task was to write, in Scala:
- an indexer that scans a bunch of input files, and
- a searcher that takes a query string and returns a list of lines containing it
It can be solved using the following:
In the indexer, tokenize each document, and create an inverted index of tokens to a sets of (file name, line number) tuples. In the searcher, break up the query string into tokens, and lookup the tokens in the index.
To enable substring search, use 3-character substrings of the tokens as keys in the index, instead of the entire token. Similarly, break up the query string into 3-character substrings, and return the intersection of all sets found in the inverted index for each substring.
Alternatively, one could use a suffix tree to enable substring search.
My various solutions can be found on these github branches.