Blue Collar Bioinformatics

Note: new posts have moved to http://bcb.io/ Please look there for the latest updates and comments

Evaluating key-value and document stores for short read data

with 34 comments

Designing responsive web interfaces for analyzing short read data requires techniques to rapidly retrieve and display all details associated with a read. My own work on this has been relying heavily on Berkeley DB key/value databases. For example, an analysis will have key/value stores relating the read to aligned positions in the genome, counts of reads found in a sequencing run, and other associated metadata.

A recent post by Pierre on storing SNPs in CouchDB encouraged me to evaluate my choice of Berkeley DB for storage. My goals were to move to a network accessible store, and to potentially incorporate the advanced query features associated with document oriented databases. I had found myself embedding too much logic about database location and structure into my code while developing with Berkeley DB.

The main factors considered in evaluating the key/value and document stores for my needs were:

  • Network accessible
  • Python library support
  • Data loading time
  • Data query time
  • File storage space
  • Implementation of queries beyond key/value retrieval

Another relevant consideration, which is not as important to my work but might be to yours, is replicating and distributing a database across multiple servers. These are major issues when developing websites with millions of concurrent users; in science applications I am less likely to find that kind of popularity for short read SNP analysis.

Leonard had recently posted a summary of his experience evaluating distributed key stores, which served as an excellent starting point for looking at the different options out there. I decided to do an in-depth evaluation of three stores:

Tokyo Cabinet is a key/value store which received a number of excellent reviews for its speed and reliability. CouchDB and MongoDB are document oriented stores which offer additional query capabilities.

To evaluate the performance of these three stores, frequency counts for 2.8 million unique reads were loaded. Real stores would have additional details on each read, but the general idea is the same: a large number of relatively small documents. Each of the stores was accessed across the network on a remote machine. The 2.8 million reads were loaded, and then a half million records were retrieved from the database. The python scripts are available on github. The below table summarizes the results:

Database Load time Retrieval time File size
Tokyo Cabinet/Tyrant 12 minutes 3 1/2 minutes 24MB
CouchDB 5 1/2 minutes 14 1/2 minutes 236MB
MongoDB 3 minutes 4 minutes 192-960MB

For CouchDB, I initially reported large numbers which were improved dramatically with some small tweaks. With a naive loading strategy, times were in the range of 22 hours with large 6G files. Thanks to tips from Chris and Paul in the comments, the loading script was modified to use bulk loading. With this change, loading times and file sizes are in the range of the other stores and the new times are reflected in the table. There appear to also be some tweaks that can be made to favor speed over reliability; these tests were done with the standard configuration. The message here is to dig deeper if you find performance issues with CouchDB; small differences in usage can provide huge gains.

Based on the loading tests, I decided to investigate MongoDB and Tokyo Cabinet/Tyrant further. CouchDB loading speeds were improved dramatically by the changes mentioned above; however, retrieval speeds are about 3 times slower. Fetching single records from the database is the most important speed consideration for my work since it happens more frequently than loading, and reflects in the responsiveness of the web front end accessing the database. It is worth investigating whether client code changes can also speed up CouchDB. For Tokyo Cabinet/Tyrant and MongoDB the main performance trade-off was disk space usage for the database files. Tokyo Cabinet loads about 4 times slower, but maintains a much more compact file representation. To better understand how MongoDB stores file, I wrote to the community mailing list and received quick and thoughtful responses on the issue. See the full thread if you are interested in the details; in summary, MongoDB pre-allocated space to improve loading time and this allocation becomes less of an issue as the database size increases.

Looking beyond performance issues, Tokyo Cabinet/Tyrant and MongoDB represent two ends of the storage spectrum. MongoDB is a larger, full featured database providing complex query operations and management of multiple data stores. Tokyo Cabinet and Tyrant provide a lightweight solution for key/value retrieval. Each separate remote Tokyo Cabinet data store requires a Tyrant instance to be running. My work involves generating many different key/value databases for individual short read sequencing projects. To reasonably achieve this with remote Tyrant stores, I would need to develop a server that could start and manage Tyrant instances on demand. Additionally, if my query needs change the key/value paradigm of Tokyo Cabinet would require generating additional key values stores. For instance, we could not readily retrieve reads with a frequency greater than a defined threshold.

In conclusion, both Tokyo Cabinet/Tyrant and MongoDB proved to be excellent solutions for managing the volume and style of data associated with short read experiments. MongoDB provided additional functionality in the form of remote data store management and advanced queries which will be useful for my work; I’ll be switching my Berkeley DB stores over to MongoDB and continuing to explore its capabilities. I would welcome hearing about the solutions others have employed for similar storage and query issues.

Written by Brad Chapman

May 10, 2009 at 10:28 am

Posted in storage

Tagged with , ,

34 Responses

Subscribe to comments with RSS.

  1. a good summary. MongoDB looks promising.
    would also be nice to see the timings using pytc:
    http://github.com/rsms/tc/tree/master
    which should be much faster than pytyrant as it is using the c-api, rather than the client protocol. then one could use pytc for fast loading and pytyrant if remote connections are needed.

    any thoughts on redis? (http://code.google.com/p/redis/)

    brentp

    May 10, 2009 at 11:13 am

  2. Great writeup! We’re using the ruby bindings for Mongo and our own journey through each of these document stores support the numbers shown here. We’re now running http://tweetcongress.org and several other sites off a single MongoDB instance.

    Wynn Netherland

    May 10, 2009 at 11:44 am

  3. CouchDB’s design puts reliability above speed, so for each document saved, it will write out a full index header and fsync to disk. This means that there is no such thing as a fixup phase if your server is rebooted unexpectedly.

    To get high performance writes into CouchDB, it’s best to group them into batches of 1k or more documents, and save them using the bulk_docs API. For instance, CouchRest, the Ruby library, has an option to do this automatically.

    Using this bash/curl script I’m able to insert roughly 3k docs per second into CouchDB.


    #!/bin/bash
    # usage: time benchbulk.sh dbname
    # it takes about 30 seconds to run on my old MacBook
    BULKSIZE=1000
    DOCSIZE=100
    INSERTS=10
    ROUNDS=10
    DBURL="http://localhost:5984/$1"
    POSTURL="$DBURL/_bulk_docs"
    function make_bulk_docs() {
    ROW=0
    SIZE=$(($1-1))
    START=$2
    BODYSIZE=$3
    BODY=$(printf "%0${BODYSIZE}d")
    echo '{"docs":['
    while [ $ROW -lt $SIZE ]; do
    printf '{"_id":"%020d", "body":"'$BODY'"},' $(($ROW + $START))
    let ROW=ROW+1
    done
    printf '{"_id":"%020d", "body":"'$BODY'"}' $(($ROW + $START))
    echo ']}'
    }
    echo "Making $INSERTS bulk inserts of $BULKSIZE docs each"
    echo "Attempt to delete db at $DBURL"
    curl -X DELETE $DBURL -w\\n
    echo "Attempt to create db at $DBURL"
    curl -X PUT $DBURL -w\\n
    echo "Running $ROUNDS rounds of $INSERTS inserts to $POSTURL"
    RUN=0
    while [ $RUN -lt $ROUNDS ]; do
    POSTS=0
    while [ $POSTS -lt $INSERTS ]; do
    STARTKEY=$[ POSTS * BULKSIZE + RUN * BULKSIZE * INSERTS ]
    echo "startkey $STARTKEY bulksize $BULKSIZE"
    echo $(make_bulk_docs $BULKSIZE $STARTKEY $DOCSIZE) | curl -T – -X POST $POSTURL -w%{http_code}\ %{time_total}\ sec\\n -o out.file 2> /dev/null &
    let POSTS=POSTS+1
    done
    wait
    let RUN=RUN+1
    done
    curl $DBURL -w\\n

    view raw

    benchbulk.sh

    hosted with ❤ by GitHub

    The docs my script is inserting are a bit bigger than the key/value pairs you are working with, I’m guessing with your data we’d see a closer to 4k docs/second.

    4k docs/second with 2.8 million docs, will take a little under 12 minutes, which is roughly equal to Tokyo Cabinet in speed. I’ll see what I can do about updating your performance script to reflect bulk loading.

    J Chris A

    May 10, 2009 at 11:58 am

  4. Brad,

    Testing on a Dual Core 2 GHz, 2 GiB RAM Mac Book I can get 3.2 million short read rows inserted into CouchDB in just under 16 minutes. Using Tokyo Cabinet I can do it in 2 minutes. I’m pretty sure that once you make a TC db you can serve it up with tyrant but the port won’t install so I can’t test this.

    Database sizes were 318 MiB for CouchDB and 152 MiB for Tokyo Cabinet. Not sure why the TC db is so much bigger than the tyrant one though I didn’t bother calling optimize when loading it.

    Code is online here:
    http://github.com/davisp/bcbb.git

    The data file I used and preprocessing it are in the comments on the freq_to_couchdb.py script.

    22+ hours for CouchDB makes me think you’re using version 0.8 which is doing a double fsync for every write to the db. If so, there have been quite a few improvements and options that allow you to choose speed over durability if you so choose.

    Paul J. Davis

    May 10, 2009 at 1:50 pm

  5. Brent;
    pytc would definitely speed things up. This is a good post comparing timings with Cabinet and Tokyo:

    http://anyall.org/blog/2009/04/performance-comparison-keyvalue-stores-for-language-model-counts/

    For my case I was interested in keeping this distributed if possible; hence the focus on pytyrant.

    redis does look good. If I ended up feeling like a key/value store worked better than a document store I likely would have evaluated it as well. This space is definitely blessed with lots of good choices.

    Wynn;
    Thanks much. I am glad to hear my thoughts were in line with others’ experience.

    J Chris and Paul;
    I appreciate the pointers in the right direction. I figured I was missing something critical; the bulk loading is definitely the way to go. Everything here was done with version 0.9, but without any tuning or modifications at all.

    Paul, for the Cabinet/Tyrant size differences, my test here was with on a server with these options:

    ttserver test.tcb#opts=ld#bnum=1000000#lcnum=10000

    The d gives ‘deflate encoding’ which likely explains the difference.

    Brad Chapman

    May 10, 2009 at 7:18 pm

  6. Tokyo Tyrant supports a copying and synchronizing remotely. It’s entirely feasible that you could load, copy and clear data sets pretty easily on demand using Tyrant.

    I’m interested in your thoughts on Tokyo table support. The table store allows for document-like schema-less storage. Also, you mentioned the need for querying, the Tokyo table store allows for much more complex queries.

    Also, as J Chris pointed out about CouchDB, Tokyo also supports bulk read and bulk write. Using ruby-tokyotyrant (native c extension) I can bulk write 2.8m table records in 2m 22secs.

    Your closing paragraph seemed light on details on why you chose Mongo, would you care to elaborate on that?

    ActsAsFlinn

    May 10, 2009 at 10:20 pm

  7. Chris and Paul;
    I had an opportunity to re-run the loading and retrieval this morning using your suggestions for bulk loading, and updated the post to include the numbers doing this the right way. I hope these better reflect your experience using CouchDB.

    Flinn;
    Thanks for the tips on bulk loading for Tyrant. The time difference between Tyrant and MongoDB for loading really didn’t both me since loading is an infrequent operation, but that could be very useful for other cases.

    The table support in Toyko C/T does look like it can solve the issue of more complicated queries. The pytyrant bindings currently don’t expose any of that functionality, so I didn’t have a chance to play with it.

    In terms of my choice of MongoDB, it comes down to providing additional functionality without a significant performance hit. Using Tokyo Cabinet/Tyrant, I’d prefer to not have to write an on-demand server to start various Tyrant instances when needed. Versus CouchDB, my main concern was performance; of course, with the fixes mentioned above the performance of MongoDB and CouchDB do become more comparable.

    Brad Chapman

    May 11, 2009 at 8:04 am

  8. fwiw MongoDB does have a bulk insert feature — that might make the load times even faster. Not sure if that were used, but looks like it was fast enough anyway. Nice article.

    dwight

    May 11, 2009 at 8:46 am

  9. Brad,

    Nice write up. I would be interested in your thoughts on a new cloud-based database service we are launching. It is a document-style database service offered via API. Still in stealth mode, but we have an alpha version which is live on AWS. Since you are familiar with these other options your review of this early version of the service would be helpful to us. You can get a test account at http://www.apstrata.com.

    Michael Liss

    May 11, 2009 at 12:01 pm

  10. You wrote

    My work involves generating many different key/value databases for individual short read sequencing projects.
    To reasonably achieve this with remote Tyrant stores, I would need to develop a server that could start and manage Tyrant instances on demand.

    I’m not sure what you have in mind. Would you not add a “project id” to separate the keys by project (keys look like “project-id:sequence-key”) — while using just one Tyrant instance?

    Stephan

    Stephan Wehner

    May 18, 2009 at 10:44 pm

    • Stephan;
      Agreed completely; combining identifiers in the key does work and is a trick I’d used in the past with Berkeley DB databases. My impetus to explore document stores was to try and remove this type of logic from my code. In my week or so of using MongoDB since this writeup, I’ve found the abstraction available with database, collection and document levels very useful in doing this.

      Brad

      Brad Chapman

      May 19, 2009 at 7:04 am

  11. Excellent writeup. I would suggest you check out scalaris too.

    John Laker

    May 19, 2009 at 5:52 am

  12. […] are several good blog posts around that go into more detail for each […]

  13. […] is built for speed. Anything that would slow it down (aka transactions) have been left on the chopping block. Instead […]

  14. […] Evaluating key-value and document stores for short read data […]

  15. […] is built for speed. Anything that would slow it down (aka transactions) have been left on the chopping block. Instead […]

  16. […] flexibility, speed and scalability. Examples include CouchDB, MongoDB and Tokyo Cabinet. Pierre and Brad have described some examples of using CouchDB with bioinformatics data and Rich has started a […]

  17. […] Discussion of the Tokyo suite vs. Mongo: https://bcbio.wordpress.com/2009/05/10/evaluating-key-value-and-document-stores-for-short-read-data/ […]

  18. Been having trouble trying to extract subsets of reads (mapped / unmapped) using biopython. Think I will adapt your scripts to see how mongodb works with 70 million short reads data.

    Kevin

    March 2, 2010 at 8:37 pm

  19. The scaling question and concurrent request performance should be the main focus of this benchmark since its all about scaling, concurrency and reliability. I think couch being built on erlang should scale better and handle concurrency better to simply because erlang is built to solve these problems.

    pyrrhon

    March 12, 2010 at 9:59 am

  20. […] 2, 3, […]

  21. […] are several good blog posts around that go into more detail for each […]

  22. My testing result shows MongoDB is very promissing too. Here is the result: http://bit.ly/aH7h3L

    Jun

    November 7, 2010 at 7:34 pm

  23. Jun’s post looks like spam to me here’s the bit.ly stats for his ‘perfect’ marketing http://bit.ly/aH7h3L+

    kevin

    November 7, 2010 at 8:45 pm

  24. Hi Kevin, this was a true story, not a spam. When I was doing evaluation on NoSQL trying to find a solution for our system, I read numerous blogs including this one. I wished at that time that someone could just tell me what I said in my blog so I wouldn’t have wasted so much time to figure out which one I should choose. I’m just trying to help people out there. If you think my posts are spams, you can remove all of them, I really don’t care.

    Jun

    November 7, 2010 at 9:34 pm

    • Thanks Jun, no worries. A lot of comments in the spam filter are: “I really liked your site. Check out my link: [some short URL]” so your original probably just triggered some warning bells. Best of luck with MongoDB.

      Brad Chapman

      November 8, 2010 at 7:38 am

  25. Apologies Jun if I have offended u. Thanks for replying to clarify despite not caring wat I said. In truth I have no right to del the post and I dint click on the link to verify. Brad was right in saying why the alarm bells rang for me. Cheers

    Kevin

    November 8, 2010 at 10:37 am

  26. […] Evaluating key-value and document stores for short read data – Designing responsive web interfaces for analyzing short read data requires techniques to rapidly retrieve and display all details associated with a read. My own work on this has been relying heavily on Berkeley DB key/value databases. For example, an analysis will have key/value stores relating the read to aligned positions in the genome, counts of reads found in a sequencing run, and other associated metadata. A recent post by Pierre on storing SNPs in CouchDB encouraged me to evaluate my choice of Berkeley DB for storage. My goals were to move to a network accessible store, and to potentially incorporate the advanced query features associated with document oriented databases. … I decided to do an in-depth evaluation of three stores: Tokyo Cabinet, and its network server Tokyo Tyrant, using the pytyrant library; CouchDB, using the couchdb-python library; MongoDB, using pymongo. […]

  27. […] are several good blog posts around that go into more detail for each […]

  28. […] a 2009 benchmark comparing MongoDB and CouchDB for a specific bioinformatics application, there is one single set […]

  29. […] are several good blog posts around that go into more detail for each […]

  30. […] article tries to evaluate kv-stores with document stores like couchdb/riak/mongodb. These stores are better […]

  31. […] are several good blog posts around that go into more detail for each […]


Leave a reply to Choosing a non-relational database; why we migrated from MySQL to MongoDB - Server Density Blog Cancel reply