באילו פונקציות האש צריך להשתמש?

יש לנו אחת של פייטון - ובשנייה להשתמש בזו שבני כתב?

You can use Python's `hash` to generate many other hash functions, e.g., by defining

def hash1(x): y = (x,1) return hash(y)

I think there is a problem hashing like this especially if you define:

hash1 = hash((x,1))

hash2 = hash((x,2))

Because, most likely that if hash1 collides for x and y (hash1(x) = hash1(y)) then it will collide for hash2 as well (due to the way the hash function in python works, probably due to its small state or it being linear in some way).

A better option would probably be:

hash1 = hash((1,x))

hash2 = hash((2,x))

But I am not completely sure that this one is any good as well.

Another option is to use a function that was specifically designed to prevent collisions, like the MD5 function

(Message Digest 5). It maps (or "digests") Python objects to 128 bit strings. If you use as h1 Python's hash

and as h2 myhash below, it is unlikely you will run into dependent collisions. Alternatively, defining h1(x) as

myhash(x,1) and h2(x) as myhash(x,2), should also prevent dependent collisions.

import hashlib, pickle def myhash(obj): return int(hashlib.md5(pickle.dumps(obj)).hexdigest(), 16) >>> hex(myhash("566")) '0x43da9628e053c4460d406b0c9101aa1a5c92ac2c'

The problem with that is that md5 is a cryptographic hash function, which means that other than preventing simple collisions (due to its large state and non-linearity) it's preventing second and first preimage and other such fun things.

And these advantages come at a cost of expensive run time.

Isn't there a better option for good hashing which is not cryptographic?

The designer of MD5, Prof. Ron Rivest from MIT, knows something not only about cryptography (he is the R in the RSA public key crypto system) but also about efficiency (among other things, he is the co-author of the most popular algorithms textbook, CLRS).

In our context, MD5 is certainly sufficiently efficient for your needs. As a specific example, I ran

the ${\tt linear\_simulation}$ function from lecture 14, employing MD5 (${\tt func=myhash }$), inserting $n=2\cdot 10^6$ items into a

table with that many slots (resolving collisions with chaining). A complete run took slightly over one minute (incidentally,

the maximum load was 9). You should thus have no problem using MD5 yourselves.

When I've changed my single class implementation from python's hash to MD5 the timing has grown from 6 seconds to 66 seconds (similar to your result) possibly resulting in the question missing its point because now most of the time is spent on calculating the hash as opposed to searching the hash table. Also, searching for values (S1 and S3 both) resulted in single having the best seek time and cuckoo having the worst.

Furthermore, the seeking time for double and cuckoo seems to be twice as much as it is for single because each time two hashes are calculated (if optimized more it would have probably been 1.5-2, when calculating the second hash only when needed).

My conclusion here is that single hashing (with chaining) is the best, but I can't shake the feeling that you were aiming for something else.

Maybe a good solution would be to pre-calculate the hash values per student instance (thus calculating it only once upon creation)?

I suggested MD5 as a response to the complains about collisions in Python's hash.

Python's hash should certainly be good enough for (single) hashing with chaining. But then

use it and something related (Rani's suggestions; hash(1,x) and hash(2,x); hash(x)

and hash(xx); etc.) in the two other approaches.

Calculating the second hash for cuckoo only when the first slot is taken is indeed

a very reasonable optimization. Memorizing the hash values will possibly save time

but will take extra space in memory, so I'm not sure I'd recommend this.

Cuckoo is designed so that its worst case time **per single find** is $O(1)$. The fact

that cuckoo's total running time is worse than total of single hash does not stand in contradiction to this

fact. It does probably imply that most chains in single are short, plus the fact that

accessing elements in a list is cheaper than computing a function like MD5.

Other than that, we got no hidden agenda here - just to make sure you understand

hashing, how to implement some versions of it, and how to measure running times.

@סבא

Indeed, Python's hash for tuples unfortunately satisfies `hash((a,b)) == hash((c,d))` if `hash(a) == hash(c) and hash(b) == hash(d)`.

Another solution, more appropriate for us, would be to use the fact that we hash *strings*, and do something like

def hash1(s): return hash("meow"+s) def hash2(s): return hash("woof"+s)

This is safe, as we can see:

>>> hash('E') 242091972 >>> hash('mfeR') 242091972 >>> hash1('E') == hash1('mfeR') False >>> hash2('E') == hash2('mfeR') False