Archive for August, 2009

Realistic user-email generator

This is an updated code for simulating user-generated email traffic, for the purpose of network simulation. (previous post)

It is based on the analysis of hundreds of gigabytes of mails (see this post).

This samples email size. You give this program a traffic rate, and it tells you when a user sends a mail of what size (two-column list output). It is based on the distribution of 402158 sent-emails, made available by Phill Macey (thank you!). Here you can see the distribution, and the sampler:

overview-cum-generated

As you can see, the generator (by-table) follows the empirical distribution very well. The sampler is very fast (1700000 lines per second on my machine).

example usage: Generate five mails with 100 bytes/second throughput

$ gcc -pg -lgsl -lm -O3 -ansi -pedantic -Wall -Werror -Wextra emailgen-phill-macey.table.c -o emailgen-phill-macey.table.exe
$ GSL_RNG_SEED=12144 ./emailgen-phill-macey.table.exe 5 100 
0	4417
44	1262
56	791
63	829
71	1137

Which means, at second 44 the user wants to write a mail of 1262 bytes.

The full code, emailgen-phill-macey.table.c:

#include<stdio.h>
#include<stdlib.h>
#include<assert.h>
#include<math.h>
#include<gsl/gsl_rng.h>
 
gsl_rng * rng;
void init_rand() {
	gsl_rng_env_setup();
	rng = gsl_rng_alloc(gsl_rng_default);
}
 
 
int get_random(int min, int max) {
	unsigned int umax;
	assert(max > min);
	umax = max - min - 1;
	return gsl_rng_uniform_int (rng, umax) + min;
}
 
double probabilities[15] = {
	0.21380651385774, /* <= 1024 = pow(2,0+10)*/
	0.49594935323927, /* <= 2048 = pow(2,1+11)*/
	0.70676450549287,
	0.81264080286852,
	0.86918076974721,
	0.89862442124737,
	0.92154576062145,
	0.94019514718096,
	0.95773303030152,
	0.97112577643613,
	0.98311111553171,
	0.99845085762312,
	1
};
 
double interpolate(double yleft, double yright, double x, double xleft, double xright) {
	double deltax = xright - xleft;
	double deltay = yright - yleft;
	double k = deltay/deltax;
	return (x - xleft) * k + yleft;
}
 
unsigned long getnextsize() {
	double prob = gsl_rng_uniform_pos(rng);
	int i = 0;
	double j;
	unsigned long hardlowerlimit = 600;
	/*unsigned long hardupperlimit = 10*1024*1024;*/
	unsigned long size;
 
	while(probabilities[i] <= prob) {
		i++;
	}
	if (i>0) {
		j = interpolate(i-1, i, prob, probabilities[i-1], probabilities[i]);
		size = pow(2, j + 10);
		if(size < hardlowerlimit) {
			return getnextsize();
		}
	}else{
		size = get_random(hardlowerlimit, pow(2, 0 + 10));
	}
 
	return size;
}
void usage(char * progname) {
	fprintf(stderr, "%s: SYNAPSIS: number_of_mails throughput \n"
		"\n"
		"\tnumber_of_mails\thow many mails should be generated\n"
		"\tthroughput\tbytes per (virtual) second to send\n"
		"\n"
		"This program prints a 2-column list of time (in seconds) and\n"
		"size (in bytes) to simulate a email traffic. \n"
		"Configure your network simulator to transmit this number of bytes\n"
		"at that time. (This program does not send mails.)\n"
		"\n"
		"(c) Johannes Buchner\n", progname);
	exit(0);
}
 
int main(int argc, char ** argv) {
	int n;
	double throughput;
	unsigned long bytes_sent = 0;
	unsigned long nextsize;
	unsigned long time = 0;
 
	if (argc != 3) {
		usage(argv[0]);
	}
	init_rand();
	throughput = atof(argv[2]);
	n = atoi(argv[1]);
	if (throughput < 0 || n < 0) {
		usage(argv[0]);
	}
 
	while (n-- > 0) {
		nextsize = getnextsize();
		printf("%lu\t%lu\n", time, nextsize);
		bytes_sent += nextsize;
		time += nextsize /* bytes */ / throughput /* bytes per sec */;
	}
	return 0;
}

1 Comment

Email size distribution: results

Following up from a previous post and a post to the qmail and postfix mailing lists:

I analysed the distribution of email sizes. I wasn’t interested in spam, automated emails or mailing lists, rather user-written mails.

I analyzed the sizes of mails in user inboxes. I am always looking at buckets (how many mails are smaller than size
x, but bigger than the last buckets).  I used the following upper limits: 1024
1536 2048 2560 3072 4096 5120 6144 7168 8192 10240 12288 14336
16384 20480 24576 28672 32768 40960 49152 53248 57344 65536 131072
262144 524288 1048576 10264576 and 1073741824 (bytes).

I laid out some interpretations about my inbox as preliminary insight. Now I want to publish the results  from the friendly people that answered my call for data.

Ordered by cardinality (number of mails in dataset):

mymail                             5069
thomas.schwinge_priv               8759
linux.kernel                      11127
charles_cazabon-nospam            22555
phoemix_harmlesshu                30991
thomas.schwinge_tech_mailinglist 154508
phill_macey_sent                 402158
spamsizes_steff                 1044713
michaelreck                     4086621 (700GB, ~40000 users)
phill_macey_inbox               7999737
phill_macey_all                 8401951 (~300GB)

Thanks go out to Thomas Schwinge, Charles Cazabon, phoemix, Michael Reck from brauchmer.net, Markus Stumpf and Phill Macey.

The empirical distribution functions follow (with downloadable eps versions). X-axis (abscissa) is always size in Bytes, the y-axis (ordinate) is the portion of mails found with exactly this size. This is calculated by using the number of mails in a buckets above, dividing by the bucket size (difference to lower border) and dividing by the sum of mails.

The upper is single-log, the lower is double-log.

overview-density

And the cumulative distribution. This is interesting because a shift can not be seen well in the probability functions. These are linear plots: ordinate being cumulative percentage (what percentage of mail is below size x). The thicker the lines, the more data in this dataset.

overview-cum

And logarithmic in size:

overview-cum-log

Here are the plots in better quality (as eps): overview-density overview-cum overview-cum-log

Warnings:

There are several errors to consider before taking this too literal: (1) It is uncertain how clean the data of large datasets is (spam, mailing-lists, generated data all mixed up).  (2) headers are different on receiving than on sending. (3) What users send is not the same traffic as they receive (automated mails).

If you just want to take one number out of this, the median size of a mail is around 30kB (watch the 50% line).

For my purposes of modeling a human generating mails, this is fantastic (dataset phill_macey_sent). This is covered in my next post.

Comments are always welcome!

1 Comment

Merging and sorting sorted sources

This neat little algorithm takes a number of sorted files and merges them together to a sorted file. It uses fixed block sizes (predefined 16).

It basically has a “slot” for each file, you may think a queue of the blocks in this file, and tries to work its way through to the end of all queues. So it takes from the slot with the smallest block, writes it to the output file, and refills the slot with the next queue element from that file.
It is a little more sophisticated, because instead of looking through all slots for the next smallest block, it actually juggles the slots (files) in a sorted manner.
I will definitely reuse this.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
#include <stdio.h>
#include <signal.h>
#include <stdlib.h>
#include <assert.h>
#include <string.h>
 
#define outputfile "mergedblocks"
#ifndef DEBUG
#define DEBUG 0
#endif
 
#define ENTRYSIZE 16
 
FILE * f;
 
int comp(const char * a, const char * b) {
	int i;
	for (i = 0; i < ENTRYSIZE; i++) {
		if (a[i] != b[i])
			return a[i] - b[i];
	}
	return 0;
}
void switchentries(char * a, char * b) {
	int i; 
	char t;
	for(i=0; i<ENTRYSIZE; i++) {
		t = a[i];
		a[i] = b[i];
		b[i] = t;
	}
}
 
 
int main(int argc, char ** argv)
{
	int i;
	int j;
	int nfiles;
	int v;
 
	char ** filenames;
	FILE ** files;
	FILE * outfile;
	char ** data;
	unsigned long * nentries;
	int * slot2file;
 
	assert(argc > 1);
	filenames = argv + 1;
	nfiles = argc - 1;
 
	files = (FILE **)calloc(nfiles, sizeof(FILE*));
	assert(files != NULL);
	data = (char **) calloc(nfiles, sizeof(char*));
	assert(data != NULL);
	nentries = (unsigned long *) calloc(nfiles, sizeof(unsigned long));
	assert(nentries != NULL);
	slot2file = (int *) calloc(nfiles, sizeof(int));
	assert(slot2file != NULL);
	for (i = 0; i < nfiles; i++) {
		files[i] = fopen(filenames[i], "r");
		assert(files[i] != NULL);
		nentries[i] = 0;
		slot2file[i] = i;
		data[i] = (char*) malloc(sizeof(char) * ENTRYSIZE);
		assert(data[i] != NULL);
		memset((void*)data[i], 0, ENTRYSIZE);
 
		/* load first entry */
		for (j = 0; j < ENTRYSIZE; j++) {
			v = getc(files[i]);
			assert(v != EOF);
			data[i][j] = v;
		}
		nentries[i]++;
		/* find next fitting */
		for(j = i - 1; j >= 0 && comp(data[j], data[j + 1]) > 0; j--) {
			if(DEBUG)
			printf("switching %d <-> %d\n", j, j+1);
			switchentries(data[j+1], data[j]);
			v = slot2file[j+1];
			slot2file[j+1] = slot2file[j];
			slot2file[j] = v;
		}
	}
	outfile = fopen(outputfile, "w");
	assert(outfile != NULL);
 
	while(1) {
		if(DEBUG)
		for (i = 0; i < nfiles; i++) {
			printf("slot %3d: ", i);
			for(j = 0; j < ENTRYSIZE; j++) {
				printf("%02x", (unsigned char)data[i][j]);
			}
			j = slot2file[i];
			if (j < 0) {
				printf(" (ended)");
				j = (j + 1)*-1;
			}
			printf("\t%d->%d: '%s'", i, j, filenames[j]);
			printf("\n");
		}
		if(slot2file[0] < 0) {
			break;
		}
		/* smallest is at 0 */
		/* write out smallest */
		if (slot2file[1] >= 0 && comp(data[0], data[1]) == 0) {
			printf("duplicate found\n");
			printf("file %s, item %lu\n", filenames[slot2file[0]], nentries[slot2file[0]]);
			printf("file %s, item %lu\n", filenames[slot2file[1]], nentries[slot2file[1]]);
			for(j = 0; j < ENTRYSIZE; j++) {
				printf("%02x", (unsigned char)data[0][j]);
			}
			printf("\n");
		} else {
			if(DEBUG)
			printf("smallest writeout\n");
			for (j = 0; j < ENTRYSIZE; j++) {
				putc(data[0][j], outfile);
			}
		}
 
		if(DEBUG)
		printf("refill %s\n", filenames[slot2file[0]]);
		/* refill slot if possible */
		for (j = 0; j < ENTRYSIZE; j++) {
			v = getc(files[slot2file[0]]);
			if (v == EOF) {
				if(DEBUG)
				printf("end of file\n");
				slot2file[0] = -slot2file[0] - 1;
				break;
			}
			data[0][j] = v;
		}
		if (slot2file[0] >= 0) {
			nentries[slot2file[0]]++;
		}else{ /* push to the end */
			if(DEBUG)
			printf("moving to the end\n");
			for(j = 1; j < nfiles && slot2file[j] >= 0; j++) {
				i = j - 1;
				if(DEBUG)
				printf("switching %d <-> %d\n", i, j);
				switchentries(data[i], data[j]);
				v = slot2file[i];
				slot2file[i] = slot2file[j];
				slot2file[j] = v;
			}
 
		}
		/* put in right place (bubblesort for 1 unsorted item at 0) */
		if(DEBUG)
		printf("sorting\n");
		for(j = 1; j < nfiles && slot2file[j] >= 0;j++){
			i = j - 1;
			if(comp(data[i], data[j]) > 0) {
				if(DEBUG)
				printf("switching %d <-> %d\n", i, j);
				switchentries(data[i], data[j]);
				v = slot2file[i];
				slot2file[i] = slot2file[j];
				slot2file[j] = v;
			}else{
				break;
			}
		}
	}
 
 
	for (i = 0; i < nfiles; i++) {
		slot2file[i] = -(slot2file[i] + 1);
		fclose(files[i]);
		if (i>0)
			nentries[0] += nentries[i];
	}
	fclose(outfile);
	printf("mergesorted %d files (%lu entries), wrote to %s.\n", 
		nfiles, nentries[0], outputfile);
	/*free(data);*/
 
	return 0;
}

2 Comments

MD5 collisions lottery

Short summary: I tried to collide md5 hashes from within the 128 bit range itself. I used 12464890370 md5 hashes, but found none.
Long explanation:
md5 is a hashing algorithm that produces 128 bits. It is clear that with arbitrary long input strings, you will have two inputs that yield the same output (this is called a collision). You can also produce a hash of a hash. Seeing md5 as a black box, it might be that a collision is possible within the 128 bit input range.
Starting with 0{rest zero bits} to 255{rest zero bits}, I produced a stream of md5 hashes for each: start -> hash -> hash -> hash -> …, and stored all of them. Then I looked if any of these 12402390369 hash values (=186GB) I produced were the same (I used a form of mergesort). The answer is no, none were the same.
My personal conclusion:
If we assume that this sample of 10^10 values is representative, and there had been 1 collision, the share of collisions would be around 10^-11. This is the share of the 128 bit space I looked at, the whole space is around 10^38 big (2^128). One collision would have meant that there were 10^28 collisions.
Since I didn’t find any collisions, we can assume there are less than 10^28 collisions in the space. My personal guess is that there are around 10. It would be neat to have a md5 identity (where hash(x)=x ), but I don’t think that is possible.
A algorithmic analysis might be more successful than what I did, but it is complicated to produce collisions, and there haven’t been any constructed within space boundaries.

Another thing I noticed is that the disk speed is the very limiting factor for producing and storing hashes.

The generating code is basically a loop around the following code. And then some qsort over the output once its done.

MD5_Init(&md5_state);
MD5_Update(&md5_state, message, BUF_SIZE);
MD5_Final(message, &md5_state);
for(j = 0; j < 16; j++)
    fputc(message[j], f);

3 Comments

Tomboy – Usability patch

Got a little patch into tomboy for a menu item in the status icon menu.

No Comments

Modelling realistic email traffic – email size distribution

In network simulation, traffic has to be simulated. Here I’m looking at email traffic. Often, CBR (constant bit rate) is used.

For a more realistic email generator I wanted to find out how the emails are distributed in size.

The only post that goes into detail is found at

http://osdir.com/ml/freebsd.devel.net/2002-10/msg00203.html

The resolution of this is quite poor though, especially between 4KB and 0 would be interesting.
However, below is a sampler that generates random email sizes, distributed such that it conforms to the distribution.
Between 2048 and 900 (lower limit, headers almost always need that much space) a uniform distribution is used.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
/* cumulative probabilities */
double probabilities[12] = {
	0.2353, /* < 2048 = pow(2, 0+11) */
	0.4917, /* < 3096 = pow(2, 1+11) */
	0.676,
	0.7958,
	0.8536,
	0.9047,
	0.933,
	0.9574,
	0.9724,
	0.9849,
	0.9996,
	1.0
};
 
double interpolate(double yleft, double yright, double x, double xleft, double xright) {
	double deltax = xright - xleft;
	double deltay = yright - yleft;
	double k = deltay/deltax;
	return (x - xleft) * k + yleft;
}
 
unsigned long getnextsize() {
	double prob = gsl_rng_uniform_pos(rng);
	int i = 0;
	double j;
	unsigned long hardlowerlimit = 900;
	/*unsigned long hardupperlimit = 10*1024*1024;*/
	unsigned long size;
 
	while(probabilities[i] <= prob) {
		i++;
	}
	assert(probabilities[i] > prob);
	if (i > 0) {
		j = interpolate(i-1, i, prob, probabilities[i-1], probabilities[i]);
		size = pow(2, j + 11);
		if(size < hardlowerlimit) {
			return getnextsize();
		}
	}else{
		size = get_random(hardlowerlimit, pow(2, 0+11));
	}
 
	return size;
}

The measured resulting cumulative distribution function, as well as a comparison to my own mailbox (which might be atypical, who knows).

mailsize-cumulative-distribution

It would be interesting to get more detailed and newer histograms.
I also considered modelling it using a Maxwell–Boltzmann distribution, but generating that seemed more complex and on closer look, the form does not really fit.

Full sampler code:
emailgen.table.c

4 Comments

Linear interpolation

Given the rectangle xleft, xright, yleft and yright, which mark the lower and upper bounds, this interpolates a y for the input value x.

double interpolate(double yleft, double yright, double x, double xleft, double xright) {
	double deltax = xright - xleft;
	double deltay = yright - yleft;
	double k = deltay/deltax;
	return (x - xleft) * k + yleft;
}

No Comments