GVSU CIS 263
Week 4 / Day 1
Lower bound on Sort
- We have
O(n^2)
sorts andO(n log n)
sorts. Can we do better? - Imagine an arbitrary sorting problem: Four numbered boxes; but you don’t know what’s inside.
- The job of a sorting algorithm is to choose one of the
4! = 24
permutations. - Each time we compare two numbers (look in a pair of boxes), at best we can eliminate 1/2 of the remaining
permutations.
- What is the minimum number of comparisons that will guarantee any arbitrary list is sorted?
log(n!)
log(n!) = log((n)(n-1)(n-2)(n-3)...(3)(2)(1)) = log(n) + log(n-1) + log(n-2) + log(n-3) + ...
log(n!) < n log(n)
- However,
log(n!) > n/2 * log(n/2) = 1/2 (log n - log 1/2)
which isO(log n)
- What is the minimum number of comparisons that will guarantee any arbitrary list is sorted?
- What is a radix sort?
- What is the running time?
- Why doesn’t that contradict the result above?
- When can you use a radix sort?
- When can’t you use a radix sort?
Hash Table
- Arrays are nice because they are generally fast and simple.
- Suppose you want to store one record (struct, object) for each GVSU student.
- Is an array a good tool for this task? Why or why not?
- What would the index be? G#? There are 100 million of them. That’s a pretty big array for 25,000 records.
- How could you use less space?
- What is potential problem with
G# mod 25,000
?- Collisions
- Before we talk more about collisions, When might mod be a bad hash function?
- Suppose you are hashing home prices? (They all end with $1000)
- What is the benefit of making table size prime?
- Suppose you don’t have a handy integer for your key. Suppose your key is a string (e.g., name, address,
book title)
- For context, say you have a table with 10,007 slots. (10,007 is prime) and strings are typically of length 8 or less.
- What about simply adding the ASCII values in the string?
- Numbers don’t get big enough.
127*8 is only 1,016
- Numbers don’t get big enough.
- How about treating first three letters as a “base 26” number?
`char(0) + 27*char(1) + 27*27*char(2)?- Not a good distribution. Only 2,851 different three letter combinations in the dictionary.
char(0) + 37\*char(1) + 37^2\*char(2) + .. % table_size
does better.- Not great, but decent and simple.
- Is an array a good tool for this task? Why or why not?
- Chaining – each bucket is the head of a list.
- What is the worst-case run time for a lookup on a table filled with random data?
O(N)
: All to the same bucket
- What is the average case?
- Load factor (number of elements relative to table size)
lambda
is important. - What is the average case run time given random data and load factor
lambda
?- What is average list length?
lambda
- What is average number of searches?
1 + lambda / 2
- What is average list length?
- What is the downside of chaining?
- Linked lists have overhead, node creation, etc.
- What is the worst-case run time for a lookup on a table filled with random data?
- Linear probing
- Find the next unused slot in the array itself.
- Expected lookups for successful searches:
1/2(1 + 1/(1-lambda))
- What is expected lookup when lambda = .5? .8? .9? 1.0_?
- Expected lookups for unsuccessful searches:
1/2(1 + 1/(1-lambda))
- Expected worst case: If you randomly put N balls into N bins, on average, the bin with the most balls will
have
log(N) / log(log(N))
balls.- How many balls is this when N = 1,000,000? 5.26
- How many balls is this when N = 1,000,000,000,000 (a trillion)? 8.32
- What is the downside?
- Clusters
- Quadratic probing
- Use some quadratic function
f(i)
to determine where to look on attempti
- Can be as simple as
f(i) = hash(x) + i^2
- zyBooks uses the more general
f(i) = hash(x) + c_1*i + c_2*i^2
- Can be as simple as
- What is a fast way of computing
(i+1)^2
giveni^2
?- Look at the difference.
- Once
lambda
reaches .5, it may not be possible to find a spot – even though table isn’t full.- Applies to
f(i) = i^2
only. The more general function may have an even lower threshold. - Assume
(i^2 -j^2)= (i+j)(i-j) = pk
(wherep
is a prime table size, andk
is some integer multiple) - This can only happen when either
(i+j) = 0
,(i -j) = 0
, or the prime factorization of the product containsp
, which is impossible sincei
, andj
are less thanp/2
.
- Applies to
- Use some quadratic function
- Double hashing
f(i) = h1(x) + i*h2(x)
- Perfect Hashing
- Guarantee worst-case
O(1)
lookup, if you rebuild table as necessary while building it. - As number of rows (
M
) grows, probability of multiple items in a row decreases. - Key questions are
- Is
M
unreasonably large, and - How to avoid getting unlucky.
- Is
- If the probability that some row has more than one item is < 1/2, than we can just re-hash
until we get what we want.
- Problem is that in order to get this probability,
M
must be aboutN^2
— too large. (See Weis text for proof)
- Problem is that in order to get this probability,
- Idea: Hash table of hash tables. Each row
r
with more than one item needs a hash table withsize(r)^2
rows.- On average, the total size of the secondary hash tables will be < 4N with probabilty 1/2.
- As before, we can just re-hash the first table until we get a sufficiently small secondary table.
- Guarantee worst-case
- Resizing / Rehashing
- Load factor
- When there are too many collisions
- When a chained list gets too long
- Why use a prime number?
- As with rebuilding vector, it’s a ‘O(N)’ operation that doubles size of table, thus, amortized cost is constant.
- Hash function
- Perfect hash function Is it possible?
- What are features of a good hash function?
- Fast
- Uniformly distributed. (Buckets that stay empty aren’t helpful e.g., hashing house prices % 1000)
- Deletions
- Is it sufficient to simply mark a bin as ‘empty’ when removing an element?
- Why not?
- Real Implementations
- Look at Java 5 HashMap implementation.
https://github.com/eagle518/jdk-source-code/blob/master/jdk5.0_src/j2se/src/share/classes/java/util/HashMap.java * What kind of hash table is this? Separate chaining * Where is new element added? At the head. * {: .question} Why there? * Java 8 implementation http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/util/HashMap.java * Notice the use of trees when bins are too full. * Notice min and max thresholds for tree and “untree” * Notice a new, simpler secondary hash function. * C++ implementation: (Difficult to follow) https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/include/bits/hashtable.h