Binary search or hash table

by

These are two methods that, with large data sets, greatly speed up lookups over a simple linear search. This increases insertion time a little, but this doesn't matter tremendously.

However, you can't easily manage either of these methods on variable-length data without a separate table of indexes.

What you basically do is switch your data file from INPUT/OUTPUT to BINARY, then have another file (probably RANDOM or BINARY) where you store a list of offsets into that binary file.

Let's call the data file WORDS.DAT and the index file WORDS.IDX.


Binary searches are conceptually easier. To start with, your index file (with all the offsets) is sorted so that it references the words in alphabetical order. When you want to retrieve a word, you pick the middle entry from the index, and then see whether the word you want comes alphabetically earlier or later. This immediately discounts half the words in the list, so you repeat the process with the remaining half.

When you add a new word, you just tack it on the end of WORDS.DAT, and then insert its offset into the appropriate position in WORDS.IDX. (I.e. so that WORDS.IDX remains sorted.) How you manage the length of the word is up to you - it can also go in the index, or you can use newlines to delemit the ends of words, or precede the word with an integer specifying its length.


Hash tables are a bit more involved, but result in extremely fast lookups when the number of elements gets very big. Basically you find a good hash algorithm, and use the result as an index into the index file. So (providing you don't have too many hash collisions), the lookup is as fast as calculting the hash, and finding the offset it points to. Check Wikipedia for more.

Posted on Mar 19, 2008, 4:33 PM
from IP address 220.245.178.132

Respond to this message   

Goto Forum Home