Archive for category science

APEMoST

Recently I have been busy updating APEMoST, which is a MCMC sampler for Bayesian inference. This can be used as a statistical procedure to estimate parameters of a model. That sounds pretty generic, and indeed it is. One specific example would be to determine the orbit parameters of exoplanets. You can also specify multiple models (1 planet, 2 planets, no planets) and calculate (with the tool) which model is more likely.

Everything is at http://apemost.sourceforge.net/. Papers pending ;-)

No Comments

Exhausting a finite parameter space by enumerating Q^n

In many cases, the parameter space (e.g. simulation parameters) is a n-dimensional box/cube, scalable to [0:1]^n, where n donates the number of dimensions = number of parameters.

Since simulations for one parameter set take time, and one wants to exhaust the parameter space efficiently (visit every subspace uniformly), one has to come up with some method of generating or suggesting points to visit.

Often, a Monte-Carlo approach is used by just using random points (uniformly, to be exact).

Here I want to present a simple enumeration scheme that eventually visits all points, by ordering the  rational numbers in an suitable way.

First, lets look at one dimension. The strategy is to visit: 0, 1, 1/2, 1/4, 3/4. Then continue even deeper with 1/8, 3/8, 5/8, 7/8 and so on. Using fractions, a pattern is clearly visible. Here is a visualization:

Q1The y-axis shows the one dimensional parameter space, the x-axis just helps you to see the ordering of the enumeration. Each animation step goes one step deeper.

The generating scheme up to a certain deepness is:

{ (j*2+1) / 2^deepness | j in [0..(2^deepness) / 2] }

Higher dimensions

So far, so easy. Lets generalize this to higher dimensions. We want to keep the property that the enumeration sort of “goes deeper”, becomes more accurate by filling the voids between the previously visited points.

Q2Here, a two-dimensional parameter space is used. First, the edges (0,0),(1,0),(0,1),(1,1) are visited. Then the space in between and so on and so on. Each animation step goes one step deeper.

The easiest method for generating a enumeration scheme for arbitrary dimensions is

0. no visited points
1. Q1 <- generate enumerations of one dimension, up to deepness o
2. build all permutations [a_1,a_2,...,a_3] where a_i in Q1
3. remove already visited points
4. increase deepness, go to 1

Use: The more points you generate, the more you can exhaust the parameter space and the better the results get. At some point you have to stop your calculations, and this enumeration makes sure you have visited all subspaces homogeneously.

Appendix

Appended is a python script (numbering.py) I used to generate points in this enumerated way. the parameter space can easily be scaled by piping into awk (e.g. | awk '{print $1*3,$2+1}').

If you don’t want to or can’t run the script, the output for 1, 2 and 3 dimensional parameter spaces can be found in http://johannes.jakeapp.com/files/enumerate_rational/. These contains the first million points (deepness 22, 11 and 8).

import math
import sys 
 
def getSegment(i):
	if i == 0:
		return 0.
	#if i == 1:
	#	return 1
	o = int(math.ceil(math.log(i, 2)))
	j = i - (2**o / 2) - 1
	#print "i=", i, "o=", o, "j=", j
	return float(j*2+1) / 2**o
 
def getSegmentsInDeepnessRange(a, b):
	if a == -1:
		first = 0
	else:
		first = 2**a + 1
	last  = 2**b + 1
	# print "o=", o, "first=", first, "last=", last
	return [getSegment(j) for j in range(first, last)]
 
#crossproduct = lambda ss,row=[],level=0: len(ss)&gt;1 \
#	and reduce(lambda x,y:x+y,[crossproduct(ss[1:],row+[i],level+1) for i in ss[0]]) \
#	or [row+[i] for i in ss[0]]
def crossproduct(ss, row=[], level=0):
	if len(ss)&gt;1:
		subcrossproduct = [crossproduct(ss[1:],row+[i],level+1) for i in ss[0]]
		return reduce(lambda x,y:x+y,subcrossproduct)
	else:
		return [row+[i] for i in ss[0]]
 
if len(sys.argv) != 4:
	print "SYNOPSIS: dim min_deepness max_deepness"
	print
	print "\tdim\tdimensions of the parameter space"
	print "\tdeepness\thow many stages to go deep (set min=0)"
	print
	print "Q: how many numbers will be produced?"
	print "A: when min=0: (2^(max-1)+1)^dim; as a table:"
	print "  max   |    dim=1           dim=2     -&gt;"
	for i in range(20):
		print "%6d" % i,
		for dim in range(1,6):
			if ((2**(i-1)+1)**dim) &gt; 10000000:
				print "\t    ...    ",
			else:
				print "\t%10d" % (2**(i-1)+1)**dim,
		print
	exit()
 
dim = int(sys.argv[1])
 
i = 0
 
mindeepness = int(sys.argv[2])
maxdeepness = int(sys.argv[3])
 
if dim &gt; 1:
	# o = deepness
	for o in range(mindeepness, maxdeepness):
		p    = getSegmentsInDeepnessRange(-1, o)
		pnew = getSegmentsInDeepnessRange(o-1,  o)
		if dim == 1:
			newvals = [[nv] for nv in p]
		else:
			newvals = crossproduct([p]*dim)
		# print only the new ones:
		for nv in newvals:
			for pn in pnew:
				if pn in nv:
					for v in nv:
						print v,
					print
					i = i + 1
					break
else: # dim = 1:
#for i in range(20):
#	print getSegment(i)
 
# one dimension 
 
	i = 0
	print i
	i = 1
	print i
	for o in range(mindeepness, maxdeepness):
		for j in range((2**o) / 2):
			print ("%.30f" % ((j*2+1) * pow(2,-o))).rstrip('0')

No Comments

Distribution of digit sums

How many numbers do have a digit sum of 23? How frequent is the digit sum 32?

Astonishingly, 6% of the numbers below 100000 have a digit sum of 23:

crossfoot

This seems pretty high, doesn’t it? It suggests that the digit sum is not evenly distributed, and it is, in fact, not:

crossfoots

As you see, the numbers 13 and 14 have the highest share as a digit sum of numbers below 1000. For values smaller than 10000, it is 18. And for values smaller than 100000, with 6% of all numbers, 22 and 23 are the most common digit sums.

If I’d had to start a conspiracy theory, I’d choose one of these. Also note how the gaussian distribution appeared out of nowhere.

No Comments

Realistic user-email generator

This is an updated code for simulating user-generated email traffic, for the purpose of network simulation. (previous post)

It is based on the analysis of hundreds of gigabytes of mails (see this post).

This samples email size. You give this program a traffic rate, and it tells you when a user sends a mail of what size (two-column list output). It is based on the distribution of 402158 sent-emails, made available by Phill Macey (thank you!). Here you can see the distribution, and the sampler:

overview-cum-generated

As you can see, the generator (by-table) follows the empirical distribution very well. The sampler is very fast (1700000 lines per second on my machine).

example usage: Generate five mails with 100 bytes/second throughput

$ gcc -pg -lgsl -lm -O3 -ansi -pedantic -Wall -Werror -Wextra emailgen-phill-macey.table.c -o emailgen-phill-macey.table.exe
$ GSL_RNG_SEED=12144 ./emailgen-phill-macey.table.exe 5 100 
0	4417
44	1262
56	791
63	829
71	1137

Which means, at second 44 the user wants to write a mail of 1262 bytes.

The full code, emailgen-phill-macey.table.c:

#include<stdio.h>
#include<stdlib.h>
#include<assert.h>
#include<math.h>
#include<gsl/gsl_rng.h>
 
gsl_rng * rng;
void init_rand() {
	gsl_rng_env_setup();
	rng = gsl_rng_alloc(gsl_rng_default);
}
 
 
int get_random(int min, int max) {
	unsigned int umax;
	assert(max > min);
	umax = max - min - 1;
	return gsl_rng_uniform_int (rng, umax) + min;
}
 
double probabilities[15] = {
	0.21380651385774, /* <= 1024 = pow(2,0+10)*/
	0.49594935323927, /* <= 2048 = pow(2,1+11)*/
	0.70676450549287,
	0.81264080286852,
	0.86918076974721,
	0.89862442124737,
	0.92154576062145,
	0.94019514718096,
	0.95773303030152,
	0.97112577643613,
	0.98311111553171,
	0.99845085762312,
	1
};
 
double interpolate(double yleft, double yright, double x, double xleft, double xright) {
	double deltax = xright - xleft;
	double deltay = yright - yleft;
	double k = deltay/deltax;
	return (x - xleft) * k + yleft;
}
 
unsigned long getnextsize() {
	double prob = gsl_rng_uniform_pos(rng);
	int i = 0;
	double j;
	unsigned long hardlowerlimit = 600;
	/*unsigned long hardupperlimit = 10*1024*1024;*/
	unsigned long size;
 
	while(probabilities[i] <= prob) {
		i++;
	}
	if (i>0) {
		j = interpolate(i-1, i, prob, probabilities[i-1], probabilities[i]);
		size = pow(2, j + 10);
		if(size < hardlowerlimit) {
			return getnextsize();
		}
	}else{
		size = get_random(hardlowerlimit, pow(2, 0 + 10));
	}
 
	return size;
}
void usage(char * progname) {
	fprintf(stderr, "%s: SYNAPSIS: number_of_mails throughput \n"
		"\n"
		"\tnumber_of_mails\thow many mails should be generated\n"
		"\tthroughput\tbytes per (virtual) second to send\n"
		"\n"
		"This program prints a 2-column list of time (in seconds) and\n"
		"size (in bytes) to simulate a email traffic. \n"
		"Configure your network simulator to transmit this number of bytes\n"
		"at that time. (This program does not send mails.)\n"
		"\n"
		"(c) Johannes Buchner\n", progname);
	exit(0);
}
 
int main(int argc, char ** argv) {
	int n;
	double throughput;
	unsigned long bytes_sent = 0;
	unsigned long nextsize;
	unsigned long time = 0;
 
	if (argc != 3) {
		usage(argv[0]);
	}
	init_rand();
	throughput = atof(argv[2]);
	n = atoi(argv[1]);
	if (throughput < 0 || n < 0) {
		usage(argv[0]);
	}
 
	while (n-- > 0) {
		nextsize = getnextsize();
		printf("%lu\t%lu\n", time, nextsize);
		bytes_sent += nextsize;
		time += nextsize /* bytes */ / throughput /* bytes per sec */;
	}
	return 0;
}

1 Comment

Email size distribution: results

Following up from a previous post and a post to the qmail and postfix mailing lists:

I analysed the distribution of email sizes. I wasn’t interested in spam, automated emails or mailing lists, rather user-written mails.

I analyzed the sizes of mails in user inboxes. I am always looking at buckets (how many mails are smaller than size
x, but bigger than the last buckets).  I used the following upper limits: 1024
1536 2048 2560 3072 4096 5120 6144 7168 8192 10240 12288 14336
16384 20480 24576 28672 32768 40960 49152 53248 57344 65536 131072
262144 524288 1048576 10264576 and 1073741824 (bytes).

I laid out some interpretations about my inbox as preliminary insight. Now I want to publish the results  from the friendly people that answered my call for data.

Ordered by cardinality (number of mails in dataset):

mymail                             5069
thomas.schwinge_priv               8759
linux.kernel                      11127
charles_cazabon-nospam            22555
phoemix_harmlesshu                30991
thomas.schwinge_tech_mailinglist 154508
phill_macey_sent                 402158
spamsizes_steff                 1044713
michaelreck                     4086621 (700GB, ~40000 users)
phill_macey_inbox               7999737
phill_macey_all                 8401951 (~300GB)

Thanks go out to Thomas Schwinge, Charles Cazabon, phoemix, Michael Reck from brauchmer.net, Markus Stumpf and Phill Macey.

The empirical distribution functions follow (with downloadable eps versions). X-axis (abscissa) is always size in Bytes, the y-axis (ordinate) is the portion of mails found with exactly this size. This is calculated by using the number of mails in a buckets above, dividing by the bucket size (difference to lower border) and dividing by the sum of mails.

The upper is single-log, the lower is double-log.

overview-density

And the cumulative distribution. This is interesting because a shift can not be seen well in the probability functions. These are linear plots: ordinate being cumulative percentage (what percentage of mail is below size x). The thicker the lines, the more data in this dataset.

overview-cum

And logarithmic in size:

overview-cum-log

Here are the plots in better quality (as eps): overview-density overview-cum overview-cum-log

Warnings:

There are several errors to consider before taking this too literal: (1) It is uncertain how clean the data of large datasets is (spam, mailing-lists, generated data all mixed up).  (2) headers are different on receiving than on sending. (3) What users send is not the same traffic as they receive (automated mails).

If you just want to take one number out of this, the median size of a mail is around 30kB (watch the 50% line).

For my purposes of modeling a human generating mails, this is fantastic (dataset phill_macey_sent). This is covered in my next post.

Comments are always welcome!

1 Comment

Modelling realistic email traffic – email size distribution

In network simulation, traffic has to be simulated. Here I’m looking at email traffic. Often, CBR (constant bit rate) is used.

For a more realistic email generator I wanted to find out how the emails are distributed in size.

The only post that goes into detail is found at

http://osdir.com/ml/freebsd.devel.net/2002-10/msg00203.html

The resolution of this is quite poor though, especially between 4KB and 0 would be interesting.
However, below is a sampler that generates random email sizes, distributed such that it conforms to the distribution.
Between 2048 and 900 (lower limit, headers almost always need that much space) a uniform distribution is used.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
/* cumulative probabilities */
double probabilities[12] = {
	0.2353, /* < 2048 = pow(2, 0+11) */
	0.4917, /* < 3096 = pow(2, 1+11) */
	0.676,
	0.7958,
	0.8536,
	0.9047,
	0.933,
	0.9574,
	0.9724,
	0.9849,
	0.9996,
	1.0
};
 
double interpolate(double yleft, double yright, double x, double xleft, double xright) {
	double deltax = xright - xleft;
	double deltay = yright - yleft;
	double k = deltay/deltax;
	return (x - xleft) * k + yleft;
}
 
unsigned long getnextsize() {
	double prob = gsl_rng_uniform_pos(rng);
	int i = 0;
	double j;
	unsigned long hardlowerlimit = 900;
	/*unsigned long hardupperlimit = 10*1024*1024;*/
	unsigned long size;
 
	while(probabilities[i] <= prob) {
		i++;
	}
	assert(probabilities[i] > prob);
	if (i > 0) {
		j = interpolate(i-1, i, prob, probabilities[i-1], probabilities[i]);
		size = pow(2, j + 11);
		if(size < hardlowerlimit) {
			return getnextsize();
		}
	}else{
		size = get_random(hardlowerlimit, pow(2, 0+11));
	}
 
	return size;
}

The measured resulting cumulative distribution function, as well as a comparison to my own mailbox (which might be atypical, who knows).

mailsize-cumulative-distribution

It would be interesting to get more detailed and newer histograms.
I also considered modelling it using a Maxwell–Boltzmann distribution, but generating that seemed more complex and on closer look, the form does not really fit.

Full sampler code:
emailgen.table.c

4 Comments

Continuous parameter optimization

In the recent days and weeks I got into the topic of “continuous parameter optimization”. That is the case when you have a n-dimensional function that returns a value, and you want find a maximum or minimum.

In particular I was looking for an algorithm that finds one (local) maximum, and uses the least number of evaluations. Because in the scenario I have, evaluating the function takes hours (or days).

So what algorithms can you use?
I developed some variants of the most primitive and naive approach.
You can find the description and source here: http://github.com/JohannesBuchner/nobrain_optimization/

Later I found CONDOR which is much more sophisticated.
I adopted the project (GPL), so I can provide the same useful interface that allows to plug in any program or programming language.
I wrote an introduction here: http://wiki.github.com/JohannesBuchner/condor_optimization

What do I need this for?
I am developing a statistic and numerical application at the University of Vienna (Astronomy) with a number of input parameters. It takes hours to compute, and it can return a value of evaluation. I want to find the optimal values for the input parameters (that are not independent) in the shortest time (least evaluations) possible.

A good measurement for the performance of an optimization algorithm is Rosenbrock’s valley.

2 Comments