|Never mind the length, feel the bandwidth
||[Mar. 22nd, 2011|12:24 pm]
A couple of days ago I saw a tweet from @TomCadwallader, claiming that
1 sperm has 37.5MB of DNA information in it. That means a normal ejaculation represents a data transfer of 1587GB in about 3 seconds. That's cute, but there's a more interesting question lurking here: it may represent 1.5TB of data, but how much information is transferred? In other words, how many bits would be required to convey the same "message" if we compressed it as cleverly as possible?
[Also, three seconds? Either Tom's ignoring the time required for the initial protocol handshake, or he's doing it very, very wrong.]
First off, some Wikipedia-reading (it may surprise you to learn that I actually research these posts) brings Tom's figures into question. The raw information content of the human haploid genome is actually more like 770MB, of which 38.7MB is the X chromosome and 14.4MB is the Y chromosome. So a single sperm contains roughly 755MB if X, or 730MB if Y. Sperm density and volume of semen varies wildly, but the adult average is apparently 3.5ml and 300 million sperm per ml. That's around a billion sperm per ejaculation, rather higher than the (1.5TB/37.5MB) =~ 420,000 sperm in Tom's figures! This gives us a total raw information transfer of about 750 petabytes, and a bandwidth of 250PB/sec - comfortably exceeding the bandwidth of a fully-laden 747, let alone Andrew Tanenbaum's speeding truck full of videotapes.
[The relationship between karyotype and apparent sex is more complicated than that, of course: Tom might have Klinefelter's syndrome (XXY) or XYY syndrome, which would make the numbers come out differently. But let's assume he's XY for the sake of simplicity.]
But this data is actually highly redundant, and thus extremely compressible. To see why, we'll need to review some high-school biology. Most human cells are diploid, meaning that they have two copies of each chromosome (one from your mother and one from your father), giving 46 chromosomes in all. Sperm cells, on the other hand, are haploid, meaning that they only have one copy of each chromosome. Sperm cells are formed by meiosis: for each type of chromosome, the new sperm cell is randomly assigned either the mother's copy of that chromosome or the father's copy.
This immediately gives us a huge data reduction. Once we've got a full copy of Tom's genome, all we need to store about each sperm is which chromosomes it's got. That's one bit (mother/father) per chromosome, or 23 bits per sperm. Using our estimate from above, that means we need 23 billion bits, or 2.9 gigabytes. We can also compress our representation of Tom's genome using standard compression algorithms: the entropy of the human genome is apparently 1.7 bits per base pair, as opposed to the 2 bits per base pair we used for our naive encoding, so we can fit his genome into a bit under 1.3GB. Our total information transfer is thus 4.2GB.
... but that would be too easy, wouldn't it?
It turns out that meiosis is more complicated than that: as part of the process, paired chromosomes exchange data - sorry, sections of DNA - in a process known as homologous recombination. I've been unable to determine how frequent this is - any biologists able to help me out here? - but apparently it's vastly more common at recombination hotspots, of which there are known to be over 25,000 in the human genome. If recombination's relatively rare then we could use a space-efficient sparse representation - "exchanges occurred at positions 12, 94, 509, 1337, 2424,...", but if it's common then we're going to have to store one bit per hotspot, to indicate whether or not a recombination occurred there. That's 25,000 bits per sperm, or just over 3kB each: scale it up by a billion, and we're storing 3TB of recombination data.
So, as it turns out, the total bandwidth of Tom's, er, "uploads" is around twice what he thought it was. I suggest that he not use that as a chat-up line, though.