GVSU CIS 263

Week 4 / Day 1

Lower bound on Sort

We have O(n^2) sorts and O(n log n) sorts. Can we do better?
Imagine an arbitrary sorting problem: Four numbered boxes; but you don’t know what’s inside.
The job of a sorting algorithm is to choose one of the 4! = 24 permutations.
Each time we compare two numbers (look in a pair of boxes), at best we can eliminate 1/2 of the remaining permutations.
- What is the minimum number of comparisons that will guarantee any arbitrary list is sorted?
  - log(n!)
  - log(n!) = log((n)(n-1)(n-2)(n-3)...(3)(2)(1)) = log(n) + log(n-1) + log(n-2) + log(n-3) + ...
  - log(n!) < n log(n)
  - However, log(n!) > n/2 * log(n/2) = 1/2 (log n - log 1/2) which is O(log n)
What is a radix sort?
What is the running time?
Why doesn’t that contradict the result above?
When can you use a radix sort?
When can’t you use a radix sort?

Hash Table

Arrays are nice because they are generally fast and simple.
Suppose you want to store one record (struct, object) for each GVSU student.
- Is an array a good tool for this task? Why or why not?
  - What would the index be? G#? There are 100 million of them. That’s a pretty big array for 25,000 records.
- How could you use less space?
- What is potential problem with G# mod 25,000?
  - Collisions
- Before we talk more about collisions, When might mod be a bad hash function?
  - Suppose you are hashing home prices? (They all end with $1000)
  - What is the benefit of making table size prime?
- Suppose you don’t have a handy integer for your key. Suppose your key is a string (e.g., name, address, book title)
  - For context, say you have a table with 10,007 slots. (10,007 is prime) and strings are typically of length 8 or less.
  - What about simply adding the ASCII values in the string?
    - Numbers don’t get big enough. 127*8 is only 1,016
  - How about treating first three letters as a “base 26” number?
    `char(0) + 27*char(1) + 27*27*char(2)?
    - Not a good distribution. Only 2,851 different three letter combinations in the dictionary.
    - char(0) + 37\*char(1) + 37^2\*char(2) + .. % table_size does better.
    - Not great, but decent and simple.
Chaining – each bucket is the head of a list.
- What is the worst-case run time for a lookup on a table filled with random data?
  - O(N): All to the same bucket
- What is the average case?
- Load factor (number of elements relative to table size) lambda is important.
- What is the average case run time given random data and load factor lambda?
  - What is average list length?
    - lambda
  - What is average number of searches?
    - 1 + lambda / 2
- What is the downside of chaining?
  - Linked lists have overhead, node creation, etc.
Linear probing
- Find the next unused slot in the array itself.
- Expected lookups for successful searches: 1/2(1 + 1/(1-lambda))
- What is expected lookup when lambda = .5? .8? .9? 1.0_?
- Expected lookups for unsuccessful searches: 1/2(1 + 1/(1-lambda))
- Expected worst case: If you randomly put N balls into N bins, on average, the bin with the most balls will have log(N) / log(log(N)) balls.
  - How many balls is this when N = 1,000,000? 5.26
  - How many balls is this when N = 1,000,000,000,000 (a trillion)? 8.32
- What is the downside?
  - Clusters
Quadratic probing
- Use some quadratic function f(i) to determine where to look on attempt i
  - Can be as simple as f(i) = hash(x) + i^2
  - zyBooks uses the more general f(i) = hash(x) + c_1*i + c_2*i^2
- What is a fast way of computing (i+1)^2 given i^2?
  - Look at the difference.
- Once lambda reaches .5, it may not be possible to find a spot – even though table isn’t full.
  - Applies to f(i) = i^2 only. The more general function may have an even lower threshold.
  - Assume (i^2 -j^2)= (i+j)(i-j) = pk (where p is a prime table size, and k is some integer multiple)
  - This can only happen when either (i+j) = 0, (i -j) = 0, or the prime factorization of the product contains p, which is impossible since i, and j are less than p/2.
Double hashing
- f(i) = h1(x) + i*h2(x)
Perfect Hashing
- Guarantee worst-case O(1) lookup, if you rebuild table as necessary while building it.
- As number of rows (M) grows, probability of multiple items in a row decreases.
- Key questions are
  1. Is M unreasonably large, and
  2. How to avoid getting unlucky.
- If the probability that some row has more than one item is < 1/2, than we can just re-hash until we get what we want.
  - Problem is that in order to get this probability, M must be about N^2 — too large. (See Weis text for proof)
- Idea: Hash table of hash tables. Each row r with more than one item needs a hash table with size(r)^2 rows.
  - On average, the total size of the secondary hash tables will be < 4N with probabilty 1/2.
  - As before, we can just re-hash the first table until we get a sufficiently small secondary table.
Resizing / Rehashing
- Load factor
- When there are too many collisions
- When a chained list gets too long
- Why use a prime number?
- As with rebuilding vector, it’s a ‘O(N)’ operation that doubles size of table, thus, amortized cost is constant.
Hash function
- Perfect hash function Is it possible?
- What are features of a good hash function?
  - Fast
  - Uniformly distributed. (Buckets that stay empty aren’t helpful e.g., hashing house prices % 1000)
Deletions
- Is it sufficient to simply mark a bin as ‘empty’ when removing an element?
- Why not?
Real Implementations
- Look at Java 5 HashMap implementation.
https://github.com/eagle518/jdk-source-code/blob/master/jdk5.0_src/j2se/src/share/classes/java/util/HashMap.java * What kind of hash table is this? Separate chaining * Where is new element added? At the head. * {: .question} Why there? * Java 8 implementation http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/util/HashMap.java * Notice the use of trees when bins are too full. * Notice min and max thresholds for tree and “untree” * Notice a new, simpler secondary hash function. * C++ implementation: (Difficult to follow) https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/include/bits/hashtable.h