Hash functions (C++) ==================== _This page is for collecting know-how related to hash functions and hash tables._ The original STL was lacking proper support for hashtables, hash based associative arrays and hash calculation in general. To quite some developers, hash tables feel like some kind of _impure_ data structure -- unfortunately the properties of modern CPUs turned the balance significantly in favour of hash tables due to memory locality. Pointer based datastructures can't be considered especially _performant_ as they were in the good old times. The tr1 extension and the new C++11 standard amended the problem by defining a framework for hash functions and hash tables. When sticking to some rules, custom written hash functions can be picked up automatically by the standard library and -containers. Standard Hash Definitions ------------------------- Hash values:: hash values are unsigned integral numbers of type 'size_t' + Basically this means that the range of hash values roughly matches the memory address space. But it also means that this range is _platform dependant_ (32 or 64bit) and -- given the usual hash calculation based on modulus (wrap around) -- that generated hash values are nonportable. Hash function:: a hash function calculates a hash value for objects of its argument type. Thus, for every supported type, there is a dedicated hash function. Quite some hash functions are generated from function templates though. Hash functor:: a function object able to calculate hash values when invoked. The standard library and the corresponding boost libraries accept functors of type 'hash' to calculate hash values for objects or values of type 'TY' Hash based containers:: While the standard Set and Map types (including the Multiset and Multimap) are based on balanced binary trees, the new C\+\+11 standard includes hash based variants (with name prefix +unordered_+). These hashtable based containers require a +hash+ functor to be able to derive the hash value of any encountered key value. Hash functors may be provided as additional type parameter to the container; if omitted, the compiler tries to find a (maybe custom defined) hash functor by *ADL* (see below) C++11 versus Boost ~~~~~~~~~~~~~~~~~~ The Boost library *functional-hash* provided the foundation for the definition now accepted into the new C++ standard. Yet the boost library provides some additional facilities not part of the standard. Thus we're bound to choose * either including ++ and +using std::hash+ * or including ++ and +using boost::hash+ The boost version additionally provides pre defined hash functors for STL containers holding custom types -- and it provides an easy to use extension mechanism for writing hash functions for custom types. Effectively this means that, assuming usage of the boost-include, the actual implementation and the way it is picked up is _different but compatible_ to the standard way. Boost: hashing custom types ~~~~~~~~~~~~~~~~~~~~~~~~~~~ The extension mechanism used by the boost version is best explained by looking at the code .boost/functional/hash/extensions.hpp [source,C] ---- template struct hash { std::size_t operator()(T const& val) const { return hash_value(val); } } ---- So this templated standard implementation just _invokes an unqualified function_ with the name +hash_value(val)+ -- when instantiating this template for your custom class or type, the compiler will search this function not only in the current scope, but also in the namespace defining your custom type +T+ (this mechanism is known as ``**A**rgument **D**ependant **L**ookup''). Meaning that all we'd need to do is to define a free function or friend function named +hash_value+ alongside with our custom data types (classes).footnote:[Before C++11, such _functor objects_ were typically derived from `std::unary_function`, which provided typedefs `argument_type` and `result_type` -- these were actually picked up to be able to handle such generic function-like objects. Such is no longer needed and deprecated with C++17, since library code today either relies on _traits templates_ to query for some feature like a function operator, and contemporary code even relies on _concepts_ to express such a requirement from the code depending on it. Today it is sufficient just to expose an accessible function call operator.] To further facilitate providing custom hash functions, boost defines a function +boost::hash_combine(size_t seed, size_t hashValue)+, allowing to _chain up_ the calculated hash values of the parts forming a composite data structure. - see Lumiera's link:http://git.lumiera.org/gitweb?p=LUMIERA;a=blob;f=src/proc/asset/category.hpp;h=b7c8df2f2ce69b0ccf89439954de8346fe8d9276;hb=master#l104[asset::Category] for a simple usage example - our link:http://git.lumiera.org/gitweb?p=LUMIERA;a=blob;f=src/lib/symbol-impl.cpp;h=9e09b4254ac57baefeb0a0c06ccd423318e923c1;hb=master#l67[lib::Symbol datatype] uses the standard implementation of a string hash function combining the individual character's hashes. Hash-chaining ~~~~~~~~~~~~~ We use a dedicated function `lib::hash::combine(s,h)` to join several source(component) hashes. This usage pattern was pioneered by Boost and is based on the https://github.com/aappleby/smhasher/blob/master/src/MurmurHash2.cpp[Murmur-2.64A] hash algorithm. WARNING: as of [yellow-background]#11/2024#, portability of hash values is an unresolved issue; this code does not work on 32bit systems https://issues.lumiera.org/ticket/722#comment:10[see #722] LUID values ----------- Lumiera's uniform identifier values shouldn't be confused with regular hash values. The purpose of LUID values is to use just plain random numbers as ID values. But, because of using such a incredibly large number space (128bit), we can just assume any collision between such random LUID to be so unlikely as to reasonably ignore this possibility altogether. Let's say, the collision of random LUID values won't ever happen, same as the meltdown of an atomic power plant, which, as we all know, won't ever happen either. Relation to hash values ~~~~~~~~~~~~~~~~~~~~~~~ When objects incorporate such an unique LUID, this provides for a prime candidate to derive hash values as a side-effect of that design: Since incorporating an LUID typically means that this object has an _distinguishable identity_, all objects with the same LUID should be considered _equivalent_ and thus hash to the same value. Consequently we can just use a +size_t+ prefix of the LUID bitstring as hash value, without any further calculations. This relies on LUID being generated from a reliable _entropy source._