Modelling realistic email traffic – email size distribution


In network simulation, traffic has to be simulated. Here I’m looking at email traffic. Often, CBR (constant bit rate) is used.

For a more realistic email generator I wanted to find out how the emails are distributed in size.

The only post that goes into detail is found at

http://osdir.com/ml/freebsd.devel.net/2002-10/msg00203.html

The resolution of this is quite poor though, especially between 4KB and 0 would be interesting.
However, below is a sampler that generates random email sizes, distributed such that it conforms to the distribution.
Between 2048 and 900 (lower limit, headers almost always need that much space) a uniform distribution is used.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
/* cumulative probabilities */
double probabilities[12] = {
	0.2353, /* < 2048 = pow(2, 0+11) */
	0.4917, /* < 3096 = pow(2, 1+11) */
	0.676,
	0.7958,
	0.8536,
	0.9047,
	0.933,
	0.9574,
	0.9724,
	0.9849,
	0.9996,
	1.0
};
 
double interpolate(double yleft, double yright, double x, double xleft, double xright) {
	double deltax = xright - xleft;
	double deltay = yright - yleft;
	double k = deltay/deltax;
	return (x - xleft) * k + yleft;
}
 
unsigned long getnextsize() {
	double prob = gsl_rng_uniform_pos(rng);
	int i = 0;
	double j;
	unsigned long hardlowerlimit = 900;
	/*unsigned long hardupperlimit = 10*1024*1024;*/
	unsigned long size;
 
	while(probabilities[i] <= prob) {
		i++;
	}
	assert(probabilities[i] > prob);
	if (i > 0) {
		j = interpolate(i-1, i, prob, probabilities[i-1], probabilities[i]);
		size = pow(2, j + 11);
		if(size < hardlowerlimit) {
			return getnextsize();
		}
	}else{
		size = get_random(hardlowerlimit, pow(2, 0+11));
	}
 
	return size;
}

The measured resulting cumulative distribution function, as well as a comparison to my own mailbox (which might be atypical, who knows).

mailsize-cumulative-distribution

It would be interesting to get more detailed and newer histograms.
I also considered modelling it using a Maxwell–Boltzmann distribution, but generating that seemed more complex and on closer look, the form does not really fit.

Full sampler code:
emailgen.table.c

  1. #1 by simon on August 9th, 2009

    you might take a look at the zipf distribution. I don’t know much about it but it was mentioned in a lecture about multi-media-servers to model server requests.

    you might also consider a split in “personal” and “corporate” mail, since corporate mail often uses html and stuff, whereas provate mail is mostly plaintext.

    greetins to NZ,
    -s

  2. #2 by admin on August 15th, 2009

    Excellent hint, simon!
    I looked at it again, and I think I can come up with some adapted zipf if I have more data. I hope some mail server admins respond to http://marc.info/?l=postfix-users&m=125033153014044

(will not be published)