LUMIERA.clone/doc/technical/howto/HashFunctions.txt

Hash functions (C++)
====================

_This page is for collecting know-how related to hash functions and hash tables._

The original STL was lacking proper support for hashtables, hash based associative arrays
and hash calculation in general. To quite some developers, hash tables feel like some kind
of _impure_ data structure -- unfortunately the properties of modern CPUs turned the balance
significantly in favour of hash tables due to memory locality. Pointer based datastructures
can't be considered especially _performant_ as they were in the good old times.

The tr1 extension and the new C++11 standard amended the problem by defining a framework
for hash functions and hash tables. When sticking to some rules, custom written hash functions
can be picked up automatically by the standard library and -containers.

Standard Hash Definitions
-------------------------

Hash values::
  hash values are unsigned integral numbers of type 'size_t'
+
Basically this means that the range of hash values roughly matches the memory address space.
But it also means that this range is _platform dependant_ (32 or 64bit) and -- given the usual
hash calculation based on modulus (wrap around) -- that generated hash values are nonportable.

Hash function::
  a hash function calculates a hash value for objects of its argument type. Thus, for every
  supported type, there is a dedicated hash function. Quite some hash functions are generated
  from function templates though.

Hash functor::
  a function object able to calculate hash values when invoked. The standard library and the
  corresponding boost libraries accept functors of type 'hash<TY>' to calculate hash values
  for objects or values of type 'TY'

Hash based containers::
  While the standard Set and Map types (including the Multiset and Multimap) are based on
  balanced binary trees, the new C\+\+11 standard includes hash based variants (with name
  prefix +unordered_+). These hashtable based containers require a +hash<KEY>+ functor
  to be able to derive the hash value of any encountered key value. Hash functors may
  be provided as additional type parameter to the container; if omitted, the compiler
  tries to find a (maybe custom defined) hash functor by *ADL* (see below)


C++11 versus Boost
~~~~~~~~~~~~~~~~~~
The Boost library *functional-hash* provided the foundation for the definition now accepted
into the new C++ standard. Yet the boost library provides some additional facilities not
part of the standard. Thus we're bound to choose

* either including +<functional>+ and +using std::hash+
* or including +<boost/functional-hash>+ and +using boost::hash+

The boost version additionally provides pre defined hash functors for STL containers holding
custom types -- and it provides an easy to use extension mechanism for writing hash functions
for custom types. Effectively this means that, assuming usage of the boost-include, the actual
implementation and the way it is picked up is _different but compatible_ to the standard way.

Boost: hashing custom types
~~~~~~~~~~~~~~~~~~~~~~~~~~~
The extension mechanism used by the boost version is best explained by looking
at the code

.boost/functional/hash/extensions.hpp
[source,C]
----
    template <class T> struct hash
    {
        std::size_t operator()(T const& val) const
        {
            return hash_value(val);
        }
    }
----
So this templated standard implementation just _invokes an unqualified function_
with the name +hash_value(val)+ -- when instantiating this template for your custom
class or type, the compiler will search this function not only in the current scope,
but also in the namespace defining your custom type +T+ (this mechanism is known as
``**A**rgument **D**ependant **L**ookup''). Meaning that all we'd need to do is to define a
free function or friend function named +hash_value+ alongside with our custom data types
(classes).footnote:[Before C++11, such _functor objects_ were typically derived from
`std::unary_function`, which provided typedefs `argument_type` and `result_type` --
these were actually picked up to be able to handle such generic function-like objects.
Such is no longer needed and deprecated with C++17, since library code today either
relies on _traits templates_ to query for some feature like a function operator,
and contemporary code even relies on _concepts_ to express such a requirement
from the code depending on it. Today it is sufficient just to expose an
accessible function call operator.]

To further facilitate providing custom hash functions, boost defines a function
+boost::hash_combine(size_t seed, size_t hashValue)+, allowing to _chain up_ the
calculated hash values of the parts forming a composite data structure.

- see Lumiera's link:http://git.lumiera.org/gitweb?p=LUMIERA;a=blob;f=src/proc/asset/category.hpp;h=b7c8df2f2ce69b0ccf89439954de8346fe8d9276;hb=master#l104[asset::Category]
  for a simple usage example
- our link:http://git.lumiera.org/gitweb?p=LUMIERA;a=blob;f=src/lib/symbol-impl.cpp;h=9e09b4254ac57baefeb0a0c06ccd423318e923c1;hb=master#l67[lib::Symbol datatype]
  uses the standard implementation of a string hash function combining the individual
  character's hashes.

Hash-chaining
~~~~~~~~~~~~~
We use a dedicated function `lib::hash::combine(s,h)` to join several source(component) hashes.
This usage pattern was pioneered by Boost and is based on the
https://github.com/aappleby/smhasher/blob/master/src/MurmurHash2.cpp[Murmur-2.64A] hash algorithm.

WARNING: as of [yellow-background]#11/2024#, portability of hash values is an unresolved issue;
         this code does not work on 32bit systems https://issues.lumiera.org/ticket/722#comment:10[see #722]


LUID values
-----------
Lumiera's uniform identifier values shouldn't be confused with regular hash values.
The purpose of LUID values is to use just plain random numbers as ID values. But, because
of using such a incredibly large number space (128bit), we can just assume any collision
between such random LUID to be so unlikely as to reasonably ignore this possibility
altogether. Let's say, the collision of random LUID values won't ever happen, same as
the meltdown of an atomic power plant, which, as we all know, won't ever happen either.

Relation to hash values
~~~~~~~~~~~~~~~~~~~~~~~
When objects incorporate such an unique LUID, this provides for a prime candidate to
derive hash values as a side-effect of that design: Since incorporating an LUID typically
means that this object has an _distinguishable identity_, all objects with the same LUID
should be considered _equivalent_ and thus hash to the same value. Consequently we can just
use a +size_t+ prefix of the LUID bitstring as hash value, without any further calculations.
This relies on LUID being generated from a reliable _entropy source._
write down some know-how regarding standard hash functions 2011-10-14 01:10:16 +02:00			`Hash functions (C++)`
			`====================`

			`_This page is for collecting know-how related to hash functions and hash tables._`

			`The original STL was lacking proper support for hashtables, hash based associative arrays`
			`and hash calculation in general. To quite some developers, hash tables feel like some kind`
			`of _impure_ data structure -- unfortunately the properties of modern CPUs turned the balance`
			`significantly in favour of hash tables due to memory locality. Pointer based datastructures`
			`can't be considered especially _performant_ as they were in the good old times.`

			`The tr1 extension and the new C++11 standard amended the problem by defining a framework`
			`for hash functions and hash tables. When sticking to some rules, custom written hash functions`
			`can be picked up automatically by the standard library and -containers.`

			`Standard Hash Definitions`
			`-------------------------`

			`Hash values::`
			`hash values are unsigned integral numbers of type 'size_t'`
			`+`
			`Basically this means that the range of hash values roughly matches the memory address space.`
			`But it also means that this range is _platform dependant_ (32 or 64bit) and -- given the usual`
			`hash calculation based on modulus (wrap around) -- that generated hash values are nonportable.`

			`Hash function::`
			`a hash function calculates a hash value for objects of its argument type. Thus, for every`
			`supported type, there is a dedicated hash function. Quite some hash functions are generated`
			`from function templates though.`

			`Hash functor::`
			`a function object able to calculate hash values when invoked. The standard library and the`
			`corresponding boost libraries accept functors of type 'hash<TY>' to calculate hash values`
			`for objects or values of type 'TY'`

			`Hash based containers::`
			`While the standard Set and Map types (including the Multiset and Multimap) are based on`
			`balanced binary trees, the new C\+\+11 standard includes hash based variants (with name`
			`prefix +unordered_+). These hashtable based containers require a +hash<KEY>+ functor`
			`to be able to derive the hash value of any encountered key value. Hash functors may`
			`be provided as additional type parameter to the container; if omitted, the compiler`
			`tries to find a (maybe custom defined) hash functor by ADL (see below)`


			`C++11 versus Boost`
			`~~~~~~~~~~~~~~~~~~`
			`The Boost library functional-hash provided the foundation for the definition now accepted`
			`into the new C++ standard. Yet the boost library provides some additional facilities not`
			`part of the standard. Thus we're bound to choose`

DOC: eliminate spurious mentions of tr1:: 2018-01-12 03:03:25 +01:00			`* either including +<functional>+ and +using std::hash+`
write down some know-how regarding standard hash functions 2011-10-14 01:10:16 +02:00			`* or including +<boost/functional-hash>+ and +using boost::hash+`

			`The boost version additionally provides pre defined hash functors for STL containers holding`
			`custom types -- and it provides an easy to use extension mechanism for writing hash functions`
			`for custom types. Effectively this means that, assuming usage of the boost-include, the actual`
			`implementation and the way it is picked up is _different but compatible_ to the standard way.`

			`Boost: hashing custom types`
			`~~~~~~~~~~~~~~~~~~~~~~~~~~~`
			`The extension mechanism used by the boost version is best explained by looking`
			`at the code`

			`.boost/functional/hash/extensions.hpp`
			`[source,C]`
			`----`
			`template <class T> struct hash`
			`{`
			`std::size_t operator()(T const& val) const`
			`{`
			`return hash_value(val);`
			`}`
			`}`
			`----`
			`So this templated standard implementation just _invokes an unqualified function_`
			`with the name +hash_value(val)+ -- when instantiating this template for your custom`
			`class or type, the compiler will search this function not only in the current scope,`
			`but also in the namespace defining your custom type +T+ (this mechanism is known as`
			``Argument Dependant Lookup''). Meaning that all we'd need to do is to define a
Upgrade: address warnings -- obsoleted features Some pre C++11 features are marked deprecated and will be rejected with C++20 Notably the old marker inferfaces for unary (and binary) functions are no longer needed, since function-like objects can be detected by traits or concepts nowadays Moreover we can get rid of some boost(bind) usages and use a λ 2025-04-15 14:09:32 +02:00			`free function or friend function named +hash_value+ alongside with our custom data types`
			`(classes).footnote:[Before C++11, such _functor objects_ were typically derived from`
			`std::unary_function`, which provided typedefs `argument_type` and `result_type` --
			`these were actually picked up to be able to handle such generic function-like objects.`
			`Such is no longer needed and deprecated with C++17, since library code today either`
			`relies on _traits templates_ to query for some feature like a function operator,`
			`and contemporary code even relies on _concepts_ to express such a requirement`
			`from the code depending on it. Today it is sufficient just to expose an`
			`accessible function call operator.]`
write down some know-how regarding standard hash functions 2011-10-14 01:10:16 +02:00
			`To further facilitate providing custom hash functions, boost defines a function`
			`+boost::hash_combine(size_t seed, size_t hashValue)+, allowing to _chain up_ the`
			`calculated hash values of the parts forming a composite data structure.`

			`- see Lumiera's link:http://git.lumiera.org/gitweb?p=LUMIERA;a=blob;f=src/proc/asset/category.hpp;h=b7c8df2f2ce69b0ccf89439954de8346fe8d9276;hb=master#l104[asset::Category]`
			`for a simple usage example`
			`- our link:http://git.lumiera.org/gitweb?p=LUMIERA;a=blob;f=src/lib/symbol-impl.cpp;h=9e09b4254ac57baefeb0a0c06ccd423318e923c1;hb=master#l67[lib::Symbol datatype]`
			`uses the standard implementation of a string hash function combining the individual`
			`character's hashes.`

Invocation: build test-data manipulation function * based on reproducible data in `TestFrame` * using Murmur64A hash-chaining to »mark« with a parameter This emulates the simplest case of 1:1 processing and can also be applied ''in-place'' 2024-11-20 21:44:50 +01:00			`Hash-chaining`
			`~~~~~~~~~~~~~`
			We use a dedicated function `lib::hash::combine(s,h)` to join several source(component) hashes.
			`This usage pattern was pioneered by Boost and is based on the`
clean-up: trifles 2025-06-07 23:59:57 +02:00			`https://github.com/aappleby/smhasher/blob/master/src/MurmurHash2.cpp[Murmur-2.64A] hash algorithm.`
Invocation: build test-data manipulation function * based on reproducible data in `TestFrame` * using Murmur64A hash-chaining to »mark« with a parameter This emulates the simplest case of 1:1 processing and can also be applied ''in-place'' 2024-11-20 21:44:50 +01:00
			`WARNING: as of [yellow-background]#11/2024#, portability of hash values is an unresolved issue;`
			`this code does not work on 32bit systems https://issues.lumiera.org/ticket/722#comment:10[see #722]`
write down some know-how regarding standard hash functions 2011-10-14 01:10:16 +02:00

			`LUID values`
			`-----------`
			`Lumiera's uniform identifier values shouldn't be confused with regular hash values.`
			`The purpose of LUID values is to use just plain random numbers as ID values. But, because`
			`of using such a incredibly large number space (128bit), we can just assume any collision`
			`between such random LUID to be so unlikely as to reasonably ignore this possibility`
			`altogether. Let's say, the collision of random LUID values won't ever happen, same as`
			`the meltdown of an atomic power plant, which, as we all know, won't ever happen either.`

			`Relation to hash values`
			`~~~~~~~~~~~~~~~~~~~~~~~`
Invocation: build test-data manipulation function * based on reproducible data in `TestFrame` * using Murmur64A hash-chaining to »mark« with a parameter This emulates the simplest case of 1:1 processing and can also be applied ''in-place'' 2024-11-20 21:44:50 +01:00			`When objects incorporate such an unique LUID, this provides for a prime candidate to`
write down some know-how regarding standard hash functions 2011-10-14 01:10:16 +02:00			`derive hash values as a side-effect of that design: Since incorporating an LUID typically`
			`means that this object has an _distinguishable identity_, all objects with the same LUID`
			`should be considered _equivalent_ and thus hash to the same value. Consequently we can just`
			`use a +size_t+ prefix of the LUID bitstring as hash value, without any further calculations.`
Invocation: build test-data manipulation function * based on reproducible data in `TestFrame` * using Murmur64A hash-chaining to »mark« with a parameter This emulates the simplest case of 1:1 processing and can also be applied ''in-place'' 2024-11-20 21:44:50 +01:00			`This relies on LUID being generated from a reliable _entropy source._`
write down some know-how regarding standard hash functions 2011-10-14 01:10:16 +02:00