Log in

No account? Create an account
Never mind the length, feel the bandwidth - Beware of the Train [entries|archive|friends|userinfo]

[ website | My Website ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

[Links:| My moblog Hypothetical, the place to be My (fairly feeble) website ]

Never mind the length, feel the bandwidth [Mar. 22nd, 2011|12:24 pm]
[Tags|, , , ]

A couple of days ago I saw a tweet from @TomCadwallader, claiming that
1 sperm has 37.5MB of DNA information in it. That means a normal ejaculation represents a data transfer of 1587GB in about 3 seconds.
That's cute, but there's a more interesting question lurking here: it may represent 1.5TB of data, but how much information is transferred? In other words, how many bits would be required to convey the same "message" if we compressed it as cleverly as possible?

[Also, three seconds? Either Tom's ignoring the time required for the initial protocol handshake, or he's doing it very, very wrong.]

First off, some Wikipedia-reading (it may surprise you to learn that I actually research these posts) brings Tom's figures into question. The raw information content of the human haploid genome is actually more like 770MB, of which 38.7MB is the X chromosome and 14.4MB is the Y chromosome. So a single sperm contains roughly 755MB if X, or 730MB if Y. Sperm density and volume of semen varies wildly, but the adult average is apparently 3.5ml and 300 million sperm per ml. That's around a billion sperm per ejaculation, rather higher than the (1.5TB/37.5MB) =~ 420,000 sperm in Tom's figures! This gives us a total raw information transfer of about 750 petabytes, and a bandwidth of 250PB/sec - comfortably exceeding the bandwidth of a fully-laden 747, let alone Andrew Tanenbaum's speeding truck full of videotapes.

[The relationship between karyotype and apparent sex is more complicated than that, of course: Tom might have Klinefelter's syndrome (XXY) or XYY syndrome, which would make the numbers come out differently. But let's assume he's XY for the sake of simplicity.]

But this data is actually highly redundant, and thus extremely compressible. To see why, we'll need to review some high-school biology. Most human cells are diploid, meaning that they have two copies of each chromosome (one from your mother and one from your father), giving 46 chromosomes in all. Sperm cells, on the other hand, are haploid, meaning that they only have one copy of each chromosome. Sperm cells are formed by meiosis: for each type of chromosome, the new sperm cell is randomly assigned either the mother's copy of that chromosome or the father's copy.

This immediately gives us a huge data reduction. Once we've got a full copy of Tom's genome, all we need to store about each sperm is which chromosomes it's got. That's one bit (mother/father) per chromosome, or 23 bits per sperm. Using our estimate from above, that means we need 23 billion bits, or 2.9 gigabytes. We can also compress our representation of Tom's genome using standard compression algorithms: the entropy of the human genome is apparently 1.7 bits per base pair, as opposed to the 2 bits per base pair we used for our naive encoding, so we can fit his genome into a bit under 1.3GB. Our total information transfer is thus 4.2GB.

... but that would be too easy, wouldn't it?

It turns out that meiosis is more complicated than that: as part of the process, paired chromosomes exchange data - sorry, sections of DNA - in a process known as homologous recombination. I've been unable to determine how frequent this is - any biologists able to help me out here? - but apparently it's vastly more common at recombination hotspots, of which there are known to be over 25,000 in the human genome. If recombination's relatively rare then we could use a space-efficient sparse representation - "exchanges occurred at positions 12, 94, 509, 1337, 2424,...", but if it's common then we're going to have to store one bit per hotspot, to indicate whether or not a recombination occurred there. That's 25,000 bits per sperm, or just over 3kB each: scale it up by a billion, and we're storing 3TB of recombination data.

So, as it turns out, the total bandwidth of Tom's, er, "uploads" is around twice what he thought it was. I suggest that he not use that as a chat-up line, though.

(Deleted comment)
[User Picture]From: pozorvlak
2011-03-22 01:53 pm (UTC)
The human mutation rate is aroung 2.5 * 10^-8 per base pair per generation; at roughly 3*10^9 base pairs per sperm, I'd expect 75 per sperm. You can fit the location of each mutation into 32 bits, plus two bits to store whatever it's mutated to: that gives 75*34 = 2550 bits per sperm, or 0.3kB per sperm. Much more than the "choose your chromosome" data, but still less than the densely-represented homologous recombination data. That brings us up to 3.3TB :-)

Imprinting I know from nothing. To the researchmobile!
(Reply) (Parent) (Thread)
[User Picture]From: pozorvlak
2011-03-22 02:36 pm (UTC)
From my reading of Wikipedia's article on genomic imprinting, it works by setting an im_the_daddy bit (in the form of a methyl group?) on somewhere between 80 and 3000 genes.But since the affected loci are the same for all sperms in the sample, we can just set the im_the_daddy bit once in the header, so it only adds one bit to the size of the compressed data :-)

On the other hand, I'm sure that other epigenetic mechanisms are more data-intensive.
(Reply) (Parent) (Thread)
(Deleted comment)
[User Picture]From: pozorvlak
2011-03-23 11:03 am (UTC)
Cheers. To be clear, I was talking about stripping off the daddy bits in software, as part of the compression step, rather than directly stripping them off the DNA.
(Reply) (Parent) (Thread)
[User Picture]From: atreic
2011-03-22 01:24 pm (UTC)
That's interesting...

755MB if Y, or 730MB if X - did you mean that the other way round?
(Reply) (Thread)
[User Picture]From: pozorvlak
2011-03-22 01:44 pm (UTC)
I did, you're right. Fixed now. Thanks!
(Reply) (Parent) (Thread)
[User Picture]From: wormwood_pearl
2011-03-22 01:39 pm (UTC)
> [Also, three seconds? Either Tom's ignoring the time required for the initial protocol handshake, or he's doing it very, very wrong.]

(Reply) (Thread)
[User Picture]From: necaris
2011-03-22 08:35 pm (UTC)
I love that euphemism :-)
(Reply) (Parent) (Thread)
(Deleted comment)
[User Picture]From: pozorvlak
2011-03-22 05:00 pm (UTC)
If your definitions are in a twist, then so are mine :-) I would indeed expect long repetitive sequences to reduce the entropy, which is why the entropy is 1.7 bits per base pair rather than 2. If you're saying that you're surprised it's that high, then obviously there aren't as many long repetitive sequences as you'd expected. Or Wikipedia's sources aren't reliable.
(Reply) (Parent) (Thread)
[User Picture]From: IanEiloart
2011-03-23 09:45 am (UTC)

Gross overestimate

Hmm, never mind the handshake, what about the 9 months (or 70 years?) it takes to decode the information? And that's for a properly formatted and syntactically viable sperm. Most aren't, and never get decoded. So, the data rate may be huge, but the transmission losses are pretty close to 100%.

(Reply) (Thread)
[User Picture]From: pozorvlak
2011-03-23 11:01 am (UTC)

Re: Gross overestimate

I don't think decoding time should be accounted for in a bandwidth calculation (or do you count the time to play a video file as part of the time to transfer it?), but your point about transmission losses is well taken. At best you get 750MB/3TB = 0.02% data transferred :-)
(Reply) (Parent) (Thread)