Fixing GNU bash associative array insert speed
Bash uses linear search to insert values in to associative arrays. This is all well and good for small numbers of keys. I was adding millions1. I went poking around the bash source code today (2020-04-18) to confirm my suspicion and gauge the difficulty of adding an option to do something more sensible.
In less than a day after I reported it, there is a patch https://lists.gnu.org/archive/html/bug-bash/2020-04/msg00114.html My timing code and pre and post patch timings are here: https://github.com/eludom/snippits/tree/master/bash/tests
Here the steps I took and where I might go if I get serious about fixing the problem:
Get the source code
Find it
- find the homepage
- A quick bit of googling lead to the homepage http://savannah.gnu.org/projects/bash/
- use git
- For a minute it looked like GNU was still stuck in the bad old days of having to download a tarball and then apply a series of patches, but fortunately, it there is a git repo http://savannah.gnu.org/git/?group=bash
Download it
git clone https://git.savannah.gnu.org/git/bash.git
Build it
Bash follows a time honored build convention
./configure
make
Analyze it
- I read the NEWS file for any indication that associative arrays has been worked on to speed up associative array insert/look-ups. No indication that they had.
- I checked the git commit logs, which appear to be meaningful after Bash-4.4 patch 19. Nothing.
- With judicious use of grep (“grep-find in Emacs”) for
“associative” and “hash_search” it turns out that associative
array inserts (as all inserts) are done with use of the
“hash_search” function in
hashlib.c
- has_insert() begins as follows:
/* Create an entry for STRING, in TABLE. If the entry already
exists, then return it (unless the HASH_NOSRCH flag is set). */
BUCKET_CONTENTS *
hash_insert (string, table, flags)
char *string;
HASH_TABLE *table;
int flags;
{
BUCKET_CONTENTS *item;
int bucket;
unsigned int hv;
if (table == 0)
table = hash_create (0);
item = (flags & HASH_NOSRCH) ? (BUCKET_CONTENTS *)NULL
: hash_search (string, table, 0);
- and there it is, the linear search walking the list in
hash_search()
/* Return a pointer to the hashed item. If the HASH_CREATE flag is passed,
create a new hash table entry for STRING, otherwise return NULL. */
BUCKET_CONTENTS *
hash_search (string, table, flags)
const char *string;
HASH_TABLE *table;
int flags;
{
BUCKET_CONTENTS *list;
int bucket;
unsigned int hv;
if (table == 0 || ((flags & HASH_CREATE) == 0 && HASH_ENTRIES (table) == 0))
return (BUCKET_CONTENTS *)NULL;
bucket = HASH_BUCKET (string, table, hv);
for (list = table->bucket_array ? table->bucket_array[bucket] : 0; list; list = list->next)
{
/* This is the comparison function */
if (hv == list->khash && STREQ (list->key, string))
{
list->times_found++;
return (list);
}
}
Next steps
DONE Reach out to the maintainers
see if they would even entertain the idea of a patch
CANCELED Look for appropriate in-memory hash insert/lookup functions
- btrees ?
CANCELED Code it
CANCELED test it
CANCELED submit patch
See https://lists.gnu.org/archive/html/bug-bash/2020-04/msg00114.html
-
yes, there are many better tools for this job, but not in the constrained environment where this had to run. ↩︎