A container of mixed-up Mash

I’ve been trying to write a k-mer sketching based pipeline to do fast QC of the population structure of read data. The idea is this will be a nextflow pipeline that I could run on large collections of reads and check that the data roughly make sense in terms of e.g. if I think they should come from two populations, there are differences between these two groups.

As part of this, I want to count kmers in just the subset of hash functions shared by two sketches, but I can’t see how to do that with the existing tools. While many of the k-mer sketching tools have an interchange format in json that lets you see which hash functions are used, but you can’t get the count of times each function is observed (some info on the json format is linked in a discussion on github here. Its what is output by mash using mash info -d. I’ve hacked the source code of Mash a bit to make it output what I think I need. This is slightly painful, as you need Cap’n Proto and lots of other dependencies to compile it, so I had to do this in a virtual machine where I could install the various dependencies.

Anyway - on my github fork of MASH here is a version that has a new mash info -t command that dumps a text format of the hash functions alongside the count for each in a sketch.

Along the way, I’ve also made a singularity container, so I don’t have to ever go through the process of building mash again.. and as it might be useful to other people to also avoid doing that, i’ve put it on github too here, together with the definition file.

Update (15/4/2021) – I’ve just noticed that sourmash has commands to perform intersections of sketches etc. in sourmash siganture intersect. Not sure if they are new and or if they were always there and I missed them. I still don’t think i’ve quite wasted my life, though, as these commands work with flattened sketches - i.e. without abundance information.

James Cotton
James Cotton
Senior Staff Scientist

My research interests are in the genomics, and particularly population genomics of parasites, particularly those that cause neglected tropical diseases