Archive for August, 2009
Realistic user-email generator
Posted by JohannesTheLittleScientist in science on August 26th, 2009
This is an updated code for simulating user-generated email traffic, for the purpose of network simulation. (previous post)
It is based on the analysis of hundreds of gigabytes of mails (see this post).
This samples email size. You give this program a traffic rate, and it tells you when a user sends a mail of what size (two-column list output). It is based on the distribution of 402158 sent-emails, made available by Phill Macey (thank you!). Here you can see the distribution, and the sampler:
As you can see, the generator (by-table) follows the empirical distribution very well. The sampler is very fast (1700000 lines per second on my machine).
example usage: Generate five mails with 100 bytes/second throughput
$ gcc -pg -lgsl -lm -O3 -ansi -pedantic -Wall -Werror -Wextra emailgen-phill-macey.table.c -o emailgen-phill-macey.table.exe $ GSL_RNG_SEED=12144 ./emailgen-phill-macey.table.exe 5 100 0 4417 44 1262 56 791 63 829 71 1137
Which means, at second 44 the user wants to write a mail of 1262 bytes.
The full code, emailgen-phill-macey.table.c:
#include<stdio.h> #include<stdlib.h> #include<assert.h> #include<math.h> #include<gsl/gsl_rng.h> gsl_rng * rng; void init_rand() { gsl_rng_env_setup(); rng = gsl_rng_alloc(gsl_rng_default); } int get_random(int min, int max) { unsigned int umax; assert(max > min); umax = max - min - 1; return gsl_rng_uniform_int (rng, umax) + min; } double probabilities[15] = { 0.21380651385774, /* <= 1024 = pow(2,0+10)*/ 0.49594935323927, /* <= 2048 = pow(2,1+11)*/ 0.70676450549287, 0.81264080286852, 0.86918076974721, 0.89862442124737, 0.92154576062145, 0.94019514718096, 0.95773303030152, 0.97112577643613, 0.98311111553171, 0.99845085762312, 1 }; double interpolate(double yleft, double yright, double x, double xleft, double xright) { double deltax = xright - xleft; double deltay = yright - yleft; double k = deltay/deltax; return (x - xleft) * k + yleft; } unsigned long getnextsize() { double prob = gsl_rng_uniform_pos(rng); int i = 0; double j; unsigned long hardlowerlimit = 600; /*unsigned long hardupperlimit = 10*1024*1024;*/ unsigned long size; while(probabilities[i] <= prob) { i++; } if (i>0) { j = interpolate(i-1, i, prob, probabilities[i-1], probabilities[i]); size = pow(2, j + 10); if(size < hardlowerlimit) { return getnextsize(); } }else{ size = get_random(hardlowerlimit, pow(2, 0 + 10)); } return size; } void usage(char * progname) { fprintf(stderr, "%s: SYNAPSIS: number_of_mails throughput \n" "\n" "\tnumber_of_mails\thow many mails should be generated\n" "\tthroughput\tbytes per (virtual) second to send\n" "\n" "This program prints a 2-column list of time (in seconds) and\n" "size (in bytes) to simulate a email traffic. \n" "Configure your network simulator to transmit this number of bytes\n" "at that time. (This program does not send mails.)\n" "\n" "(c) Johannes Buchner\n", progname); exit(0); } int main(int argc, char ** argv) { int n; double throughput; unsigned long bytes_sent = 0; unsigned long nextsize; unsigned long time = 0; if (argc != 3) { usage(argv[0]); } init_rand(); throughput = atof(argv[2]); n = atoi(argv[1]); if (throughput < 0 || n < 0) { usage(argv[0]); } while (n-- > 0) { nextsize = getnextsize(); printf("%lu\t%lu\n", time, nextsize); bytes_sent += nextsize; time += nextsize /* bytes */ / throughput /* bytes per sec */; } return 0; }
Email size distribution: results
Posted by JohannesTheLittleScientist in science on August 26th, 2009
Following up from a previous post and a post to the qmail and postfix mailing lists:
I analysed the distribution of email sizes. I wasn’t interested in spam, automated emails or mailing lists, rather user-written mails.
I analyzed the sizes of mails in user inboxes. I am always looking at buckets (how many mails are smaller than size
x, but bigger than the last buckets). I used the following upper limits: 1024
1536 2048 2560 3072 4096 5120 6144 7168 8192 10240 12288 14336
16384 20480 24576 28672 32768 40960 49152 53248 57344 65536 131072
262144 524288 1048576 10264576 and 1073741824 (bytes).
I laid out some interpretations about my inbox as preliminary insight. Now I want to publish the results from the friendly people that answered my call for data.
Ordered by cardinality (number of mails in dataset):
mymail 5069 thomas.schwinge_priv 8759 linux.kernel 11127 charles_cazabon-nospam 22555 phoemix_harmlesshu 30991 thomas.schwinge_tech_mailinglist 154508 phill_macey_sent 402158 spamsizes_steff 1044713 michaelreck 4086621 (700GB, ~40000 users) phill_macey_inbox 7999737 phill_macey_all 8401951 (~300GB)
Thanks go out to Thomas Schwinge, Charles Cazabon, phoemix, Michael Reck from brauchmer.net, Markus Stumpf and Phill Macey.
The empirical distribution functions follow (with downloadable eps versions). X-axis (abscissa) is always size in Bytes, the y-axis (ordinate) is the portion of mails found with exactly this size. This is calculated by using the number of mails in a buckets above, dividing by the bucket size (difference to lower border) and dividing by the sum of mails.
The upper is single-log, the lower is double-log.
And the cumulative distribution. This is interesting because a shift can not be seen well in the probability functions. These are linear plots: ordinate being cumulative percentage (what percentage of mail is below size x). The thicker the lines, the more data in this dataset.
And logarithmic in size:
Here are the plots in better quality (as eps): overview-density overview-cum overview-cum-log
Warnings:
There are several errors to consider before taking this too literal: (1) It is uncertain how clean the data of large datasets is (spam, mailing-lists, generated data all mixed up). (2) headers are different on receiving than on sending. (3) What users send is not the same traffic as they receive (automated mails).
If you just want to take one number out of this, the median size of a mail is around 30kB (watch the 50% line).
For my purposes of modeling a human generating mails, this is fantastic (dataset phill_macey_sent). This is covered in my next post.
Comments are always welcome!
Merging and sorting sorted sources
Posted by JohannesTheDeveloper in Happy Hacking on August 18th, 2009
This neat little algorithm takes a number of sorted files and merges them together to a sorted file. It uses fixed block sizes (predefined 16).
It basically has a “slot” for each file, you may think a queue of the blocks in this file, and tries to work its way through to the end of all queues. So it takes from the slot with the smallest block, writes it to the output file, and refills the slot with the next queue element from that file.
It is a little more sophisticated, because instead of looking through all slots for the next smallest block, it actually juggles the slots (files) in a sorted manner.
I will definitely reuse this.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 | #include <stdio.h> #include <signal.h> #include <stdlib.h> #include <assert.h> #include <string.h> #define outputfile "mergedblocks" #ifndef DEBUG #define DEBUG 0 #endif #define ENTRYSIZE 16 FILE * f; int comp(const char * a, const char * b) { int i; for (i = 0; i < ENTRYSIZE; i++) { if (a[i] != b[i]) return a[i] - b[i]; } return 0; } void switchentries(char * a, char * b) { int i; char t; for(i=0; i<ENTRYSIZE; i++) { t = a[i]; a[i] = b[i]; b[i] = t; } } int main(int argc, char ** argv) { int i; int j; int nfiles; int v; char ** filenames; FILE ** files; FILE * outfile; char ** data; unsigned long * nentries; int * slot2file; assert(argc > 1); filenames = argv + 1; nfiles = argc - 1; files = (FILE **)calloc(nfiles, sizeof(FILE*)); assert(files != NULL); data = (char **) calloc(nfiles, sizeof(char*)); assert(data != NULL); nentries = (unsigned long *) calloc(nfiles, sizeof(unsigned long)); assert(nentries != NULL); slot2file = (int *) calloc(nfiles, sizeof(int)); assert(slot2file != NULL); for (i = 0; i < nfiles; i++) { files[i] = fopen(filenames[i], "r"); assert(files[i] != NULL); nentries[i] = 0; slot2file[i] = i; data[i] = (char*) malloc(sizeof(char) * ENTRYSIZE); assert(data[i] != NULL); memset((void*)data[i], 0, ENTRYSIZE); /* load first entry */ for (j = 0; j < ENTRYSIZE; j++) { v = getc(files[i]); assert(v != EOF); data[i][j] = v; } nentries[i]++; /* find next fitting */ for(j = i - 1; j >= 0 && comp(data[j], data[j + 1]) > 0; j--) { if(DEBUG) printf("switching %d <-> %d\n", j, j+1); switchentries(data[j+1], data[j]); v = slot2file[j+1]; slot2file[j+1] = slot2file[j]; slot2file[j] = v; } } outfile = fopen(outputfile, "w"); assert(outfile != NULL); while(1) { if(DEBUG) for (i = 0; i < nfiles; i++) { printf("slot %3d: ", i); for(j = 0; j < ENTRYSIZE; j++) { printf("%02x", (unsigned char)data[i][j]); } j = slot2file[i]; if (j < 0) { printf(" (ended)"); j = (j + 1)*-1; } printf("\t%d->%d: '%s'", i, j, filenames[j]); printf("\n"); } if(slot2file[0] < 0) { break; } /* smallest is at 0 */ /* write out smallest */ if (slot2file[1] >= 0 && comp(data[0], data[1]) == 0) { printf("duplicate found\n"); printf("file %s, item %lu\n", filenames[slot2file[0]], nentries[slot2file[0]]); printf("file %s, item %lu\n", filenames[slot2file[1]], nentries[slot2file[1]]); for(j = 0; j < ENTRYSIZE; j++) { printf("%02x", (unsigned char)data[0][j]); } printf("\n"); } else { if(DEBUG) printf("smallest writeout\n"); for (j = 0; j < ENTRYSIZE; j++) { putc(data[0][j], outfile); } } if(DEBUG) printf("refill %s\n", filenames[slot2file[0]]); /* refill slot if possible */ for (j = 0; j < ENTRYSIZE; j++) { v = getc(files[slot2file[0]]); if (v == EOF) { if(DEBUG) printf("end of file\n"); slot2file[0] = -slot2file[0] - 1; break; } data[0][j] = v; } if (slot2file[0] >= 0) { nentries[slot2file[0]]++; }else{ /* push to the end */ if(DEBUG) printf("moving to the end\n"); for(j = 1; j < nfiles && slot2file[j] >= 0; j++) { i = j - 1; if(DEBUG) printf("switching %d <-> %d\n", i, j); switchentries(data[i], data[j]); v = slot2file[i]; slot2file[i] = slot2file[j]; slot2file[j] = v; } } /* put in right place (bubblesort for 1 unsorted item at 0) */ if(DEBUG) printf("sorting\n"); for(j = 1; j < nfiles && slot2file[j] >= 0;j++){ i = j - 1; if(comp(data[i], data[j]) > 0) { if(DEBUG) printf("switching %d <-> %d\n", i, j); switchentries(data[i], data[j]); v = slot2file[i]; slot2file[i] = slot2file[j]; slot2file[j] = v; }else{ break; } } } for (i = 0; i < nfiles; i++) { slot2file[i] = -(slot2file[i] + 1); fclose(files[i]); if (i>0) nentries[0] += nentries[i]; } fclose(outfile); printf("mergesorted %d files (%lu entries), wrote to %s.\n", nfiles, nentries[0], outputfile); /*free(data);*/ return 0; } |
MD5 collisions lottery
Posted by JohannesTheLittleScientist in Happy Hacking on August 18th, 2009
Short summary: I tried to collide md5 hashes from within the 128 bit range itself. I used 12464890370 md5 hashes, but found none.
Long explanation:
md5 is a hashing algorithm that produces 128 bits. It is clear that with arbitrary long input strings, you will have two inputs that yield the same output (this is called a collision). You can also produce a hash of a hash. Seeing md5 as a black box, it might be that a collision is possible within the 128 bit input range.
Starting with 0{rest zero bits} to 255{rest zero bits}, I produced a stream of md5 hashes for each: start -> hash -> hash -> hash -> …, and stored all of them. Then I looked if any of these 12402390369 hash values (=186GB) I produced were the same (I used a form of mergesort). The answer is no, none were the same.
My personal conclusion:
If we assume that this sample of 10^10 values is representative, and there had been 1 collision, the share of collisions would be around 10^-11. This is the share of the 128 bit space I looked at, the whole space is around 10^38 big (2^128). One collision would have meant that there were 10^28 collisions.
Since I didn’t find any collisions, we can assume there are less than 10^28 collisions in the space. My personal guess is that there are around 10. It would be neat to have a md5 identity (where hash(x)=x ), but I don’t think that is possible.
A algorithmic analysis might be more successful than what I did, but it is complicated to produce collisions, and there haven’t been any constructed within space boundaries.
Another thing I noticed is that the disk speed is the very limiting factor for producing and storing hashes.
The generating code is basically a loop around the following code. And then some qsort over the output once its done.
MD5_Init(&md5_state); MD5_Update(&md5_state, message, BUF_SIZE); MD5_Final(message, &md5_state); for(j = 0; j < 16; j++) fputc(message[j], f);
Tomboy – Usability patch
Posted by JohannesTheDeveloper in Happy Hacking on August 10th, 2009
Got a little patch into tomboy for a menu item in the status icon menu.
Modelling realistic email traffic – email size distribution
Posted by JohannesTheLittleScientist in science on August 8th, 2009
In network simulation, traffic has to be simulated. Here I’m looking at email traffic. Often, CBR (constant bit rate) is used.
For a more realistic email generator I wanted to find out how the emails are distributed in size.
The only post that goes into detail is found at
http://osdir.com/ml/freebsd.devel.net/2002-10/msg00203.html
The resolution of this is quite poor though, especially between 4KB and 0 would be interesting.
However, below is a sampler that generates random email sizes, distributed such that it conforms to the distribution.
Between 2048 and 900 (lower limit, headers almost always need that much space) a uniform distribution is used.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | /* cumulative probabilities */ double probabilities[12] = { 0.2353, /* < 2048 = pow(2, 0+11) */ 0.4917, /* < 3096 = pow(2, 1+11) */ 0.676, 0.7958, 0.8536, 0.9047, 0.933, 0.9574, 0.9724, 0.9849, 0.9996, 1.0 }; double interpolate(double yleft, double yright, double x, double xleft, double xright) { double deltax = xright - xleft; double deltay = yright - yleft; double k = deltay/deltax; return (x - xleft) * k + yleft; } unsigned long getnextsize() { double prob = gsl_rng_uniform_pos(rng); int i = 0; double j; unsigned long hardlowerlimit = 900; /*unsigned long hardupperlimit = 10*1024*1024;*/ unsigned long size; while(probabilities[i] <= prob) { i++; } assert(probabilities[i] > prob); if (i > 0) { j = interpolate(i-1, i, prob, probabilities[i-1], probabilities[i]); size = pow(2, j + 11); if(size < hardlowerlimit) { return getnextsize(); } }else{ size = get_random(hardlowerlimit, pow(2, 0+11)); } return size; } |
The measured resulting cumulative distribution function, as well as a comparison to my own mailbox (which might be atypical, who knows).
It would be interesting to get more detailed and newer histograms.
I also considered modelling it using a Maxwell–Boltzmann distribution, but generating that seemed more complex and on closer look, the form does not really fit.
Full sampler code:
emailgen.table.c
Linear interpolation
Posted by JohannesTheLittleScientist in Happy Hacking on August 8th, 2009
Given the rectangle xleft, xright, yleft and yright, which mark the lower and upper bounds, this interpolates a y for the input value x.
double interpolate(double yleft, double yright, double x, double xleft, double xright) {
double deltax = xright - xleft;
double deltay = yright - yleft;
double k = deltay/deltax;
return (x - xleft) * k + yleft;
}





Recent Comments