In network simulation, traffic has to be simulated. Here I’m looking at email traffic. Often, CBR (constant bit rate) is used.
For a more realistic email generator I wanted to find out how the emails are distributed in size.
The only post that goes into detail is found at
http://osdir.com/ml/freebsd.devel.net/2002-10/msg00203.html
The resolution of this is quite poor though, especially between 4KB and 0 would be interesting.
However, below is a sampler that generates random email sizes, distributed such that it conforms to the distribution.
Between 2048 and 900 (lower limit, headers almost always need that much space) a uniform distribution is used.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | /* cumulative probabilities */ double probabilities[12] = { 0.2353, /* < 2048 = pow(2, 0+11) */ 0.4917, /* < 3096 = pow(2, 1+11) */ 0.676, 0.7958, 0.8536, 0.9047, 0.933, 0.9574, 0.9724, 0.9849, 0.9996, 1.0 }; double interpolate(double yleft, double yright, double x, double xleft, double xright) { double deltax = xright - xleft; double deltay = yright - yleft; double k = deltay/deltax; return (x - xleft) * k + yleft; } unsigned long getnextsize() { double prob = gsl_rng_uniform_pos(rng); int i = 0; double j; unsigned long hardlowerlimit = 900; /*unsigned long hardupperlimit = 10*1024*1024;*/ unsigned long size; while(probabilities[i] <= prob) { i++; } assert(probabilities[i] > prob); if (i > 0) { j = interpolate(i-1, i, prob, probabilities[i-1], probabilities[i]); size = pow(2, j + 11); if(size < hardlowerlimit) { return getnextsize(); } }else{ size = get_random(hardlowerlimit, pow(2, 0+11)); } return size; } |
The measured resulting cumulative distribution function, as well as a comparison to my own mailbox (which might be atypical, who knows).
It would be interesting to get more detailed and newer histograms.
I also considered modelling it using a Maxwell–Boltzmann distribution, but generating that seemed more complex and on closer look, the form does not really fit.
Full sampler code:
emailgen.table.c

#1 by simon on August 9th, 2009
you might take a look at the zipf distribution. I don’t know much about it but it was mentioned in a lecture about multi-media-servers to model server requests.
you might also consider a split in “personal” and “corporate” mail, since corporate mail often uses html and stuff, whereas provate mail is mostly plaintext.
greetins to NZ,
-s
#2 by admin on August 15th, 2009
Excellent hint, simon!
I looked at it again, and I think I can come up with some adapted zipf if I have more data. I hope some mail server admins respond to http://marc.info/?l=postfix-users&m=125033153014044