*Originally posted by wormer*

**With age, somatic cells are thought to accumulate genomic scars as a resilt of the inaccurate repair to double stranded breaks by NHEJ. Estimates based on frequency of breaks in primary human fibroblasts suggest that by the age of 70 each human somatic cell carry some 2000 NHEJ-induced mutations due to inaccurate repair. If these mutations were distributed ra ...[text shortened]... you expect to be affected?
**

(assume 2.5% of the genome is crucial information provided by genes)

Suppose the probability of a coding mutation is x (which you seem to be saying is 2%, so 0.02). There are N mutations in total (N = 2000). Then the average number of mutations is:

<n> = 0* probability of all non-coding mutations + 1 * probability of exactly 1 coding mutation + 2 * probability of exactly 2 coding mutations + ... + N * probability that all mutations are in coding DNA.

Let's look at the typical term, we need to know the probability of n coding mutations. The probability of getting n coding mutations in a row is x^n (x to the power of n). The probability of then getting (N - n) non-coding mutations is (1 - x)^(N - n). We have to take into account that we can get our n coding mutations and (N - n) non-coding mutations in any order. This is given by the binomial coefficient (which I'll write C(N, n)). So the typical term in the above polynomial is:

n * C(N, n) * x^n * (1 - x)^(N - n)

To sum this we need a new variable y = x / (1-x), and we can rewrite the typical term as:

n* C(N, n) * y^n * (1 - x)^N

So the average number of coding mutations is now:

<n> = (1 - x)^N * sum(n = 0 ... N) n * C(N, n) * y^n

We can use that d/dy y^n = n y^(n - 1), to do the sum:

<n> = y*(1 - x)^N * d/dy sum(n = 0 ... N) C(N, n) * y^n

The sum is now straightforward:

<n> = y * (1 - x)^N * d/dy (1 + y)^N = y * (1 - x)^N * [N * (1+y)^(N - 1)]

1 + y = 1/(1 - x) so that:

<n> = [x/(1 - x)] * [(1 - x)^N] * N * [1/(1 - x)]^(N - 1) = Nx = 2000 * 0.02 = 40

So twhitehead got the right answer.

The only catch is if we have to take into account the possibility that a coding mutation is in a critical gene which produces a highly conserved protein and the mutation kills the cell. Some of these mutations might kill the organism, for example if it is on the PrP gene causing CJD before age 70. So we need to factor out mutations that kill cells or the entire organism. If there are m coding bases in total of which p are critical coding bases and the genome is length l, then where x was m/l we'd need to replace it with (m - p)/(l - p). If p is small compared with m then don't worry about it.