|CouchDB - JSON and B-trees and REST, oh my!
||[Jun. 2nd, 2013|09:15 pm]
Definitive Guide, but also from the Coursera Introduction to Data Science course and through an informative chat with necaris, who has used it extensively at Esplorio. The current draft of the Definitive Guide is rather out-of-date and has several long-open pull requests on GitHub, which doesn't exactly inspire confidence, but CouchDB itself appears to be actively maintained. I have yet to use CouchDB in anger, but here's what I've learned so far:I've been learning about the NoSQL database CouchDB, mainly from the |
- The JSON is not by default required to conform to any particular schema, but you can add validation functions to be called every time data is added to the database. These will reject improperly-formed data.
- CouchDB is at pains to be RESTful, to emit proper cache-invalidation data, and so on, and this is key to scaling it out: put a contiguous subset of (a consistent hash of) the keyspace on each machine, and build a tree of reverse HTTP proxies (possibly caching ones) in front of your database cluster.
- CouchDB's killer feature is probably master-to-master replication: if you want to do DB operations on a machine that's sometimes disconnected from the rest of the cluster (a mobile device, say), then you can do so, and sync changes up and down when you reconnect. Conflicts are flagged but not resolved by default; you can resolve them manually or automatically by recording a new version of the conflicted object. Replication is also used for load-balancing, failover and scaling out: you can maintain one or more machines that constantly replicate the master server for a section of keyspace, and you can replicate only a subset of keyspace onto a new database when you need to expand.
- CouchDB doesn't guarantee to preserve all the history of an object, and in particular replications only seem to send the most recent version; I think this precludes Git-style three-way merge from the conflicting versions' most recent common ancestor (and forget about Darcs-style full-history merging!).
- The cluster-management story isn't as good as for some other systems, but there are a couple of PaaS offerings.
- Queries/views and non-primary indexes are both handled using map/reduce. If you want to index on something other than the primary key - posts by date, say - then you write a map query which emits (date, post) pairs. These are put into another B-tree, which is stored on disk; clever things are done to mark subtrees invalid as new data comes in, and changes to the query result or index are calculated lazily. Since indices are stored as B-trees, it's cheap to get all the objects within a given range of secondary keys: all posts in February, for instance.
- CouchDB's reduce functions are crippled: attempting to calculate anything that isn't a scalar or a fixed-size object is considered Bad Form, and may cause your machine(s) to thrash. AFAICT you can't reduce results from different machines by this mechanism: CouchDB Lounge requires you to write extra merge functions in Twisted Python.
- There's a very limited SQL view engine, but AFAICT nothing like Hive or Pig that can take a complex query and compile it down into a number of chained map/reduce jobs. The aforementioned restrictions on reduce functions mean that the strategy I've been taught for expressing joins as map/reduce jobs won't work; I don't know if this limitation is fundamental. But it's IME pretty rare to require general joins in applications: usually you want to do some filtering or summarisation on at least one side.
- CouchDB can't quite make up its mind whether it wants to be a database or a Web application framework. It comes by default with an administration web app called Futon; you can also use it to store and execute code for rendering objects as HTML, Atom, etc. Such code (along with views, validations etc) is stored in special JSON objects called "design documents": best practice is apparently to have one design document for each application that needs to access the underlying data. Since design documents are ordinary JSON objects, they are propagated between nodes by replications.
- However, various standard webapp-framework bits are missing, notably URL routing. But hey, you can always use mod_rewrite...
- There's a tool called Erica (and an older one called CouchApp) which allows you to sync design documents with more conventional source-code directories in your filesystem.
- CouchDB is written in Erlang, and the functional-programming influence shows up in other places: most types of user-defined function are required to be free of side-effects, for instance. Then there's the aforementioned uses of lazy evaluation and the append-only nature of the system as a whole. You can extend it with your own Erlang code or embed it into an Erlang application, bypassing the need for HTTP requests.
tl;dr if you've ever thought "data modelling and synchronisation are hard, let's just stick a load of JSON files in Git" (as I have, on several occasions), then CouchDB is probably a good fit to your needs. Especially if your analytics needs aren't too complicated.
This entry was originally posted at http://pozorvlak.dreamwidth.org/175882.html. Please comment wherever and however is most convenient for you!