Archive for category science
APEMoST
Posted by JohannesTheLittleScientist in science on November 19th, 2009
Recently I have been busy updating APEMoST, which is a MCMC sampler for Bayesian inference. This can be used as a statistical procedure to estimate parameters of a model. That sounds pretty generic, and indeed it is. One specific example would be to determine the orbit parameters of exoplanets. You can also specify multiple models (1 planet, 2 planets, no planets) and calculate (with the tool) which model is more likely.
Everything is at http://apemost.sourceforge.net/. Papers pending ;-)
Exhausting a finite parameter space by enumerating Q^n
Posted by JohannesTheLittleScientist in science on September 12th, 2009
In many cases, the parameter space (e.g. simulation parameters) is a n-dimensional box/cube, scalable to [0:1]^n, where n donates the number of dimensions = number of parameters.
Since simulations for one parameter set take time, and one wants to exhaust the parameter space efficiently (visit every subspace uniformly), one has to come up with some method of generating or suggesting points to visit.
Often, a Monte-Carlo approach is used by just using random points (uniformly, to be exact).
Here I want to present a simple enumeration scheme that eventually visits all points, by ordering the rational numbers in an suitable way.
First, lets look at one dimension. The strategy is to visit: 0, 1, 1/2, 1/4, 3/4. Then continue even deeper with 1/8, 3/8, 5/8, 7/8 and so on. Using fractions, a pattern is clearly visible. Here is a visualization:
The y-axis shows the one dimensional parameter space, the x-axis just helps you to see the ordering of the enumeration. Each animation step goes one step deeper.
The generating scheme up to a certain deepness is:
{ (j*2+1) / 2^deepness | j in [0..(2^deepness) / 2] }
Higher dimensions
So far, so easy. Lets generalize this to higher dimensions. We want to keep the property that the enumeration sort of “goes deeper”, becomes more accurate by filling the voids between the previously visited points.
Here, a two-dimensional parameter space is used. First, the edges (0,0),(1,0),(0,1),(1,1) are visited. Then the space in between and so on and so on. Each animation step goes one step deeper.
The easiest method for generating a enumeration scheme for arbitrary dimensions is
0. no visited points 1. Q1 <- generate enumerations of one dimension, up to deepness o 2. build all permutations [a_1,a_2,...,a_3] where a_i in Q1 3. remove already visited points 4. increase deepness, go to 1
Use: The more points you generate, the more you can exhaust the parameter space and the better the results get. At some point you have to stop your calculations, and this enumeration makes sure you have visited all subspaces homogeneously.
Appendix
Appended is a python script (numbering.py) I used to generate points in this enumerated way. the parameter space can easily be scaled by piping into awk (e.g. | awk '{print $1*3,$2+1}').
If you don’t want to or can’t run the script, the output for 1, 2 and 3 dimensional parameter spaces can be found in http://johannes.jakeapp.com/files/enumerate_rational/. These contains the first million points (deepness 22, 11 and 8).
import math import sys def getSegment(i): if i == 0: return 0. #if i == 1: # return 1 o = int(math.ceil(math.log(i, 2))) j = i - (2**o / 2) - 1 #print "i=", i, "o=", o, "j=", j return float(j*2+1) / 2**o def getSegmentsInDeepnessRange(a, b): if a == -1: first = 0 else: first = 2**a + 1 last = 2**b + 1 # print "o=", o, "first=", first, "last=", last return [getSegment(j) for j in range(first, last)] #crossproduct = lambda ss,row=[],level=0: len(ss)>1 \ # and reduce(lambda x,y:x+y,[crossproduct(ss[1:],row+[i],level+1) for i in ss[0]]) \ # or [row+[i] for i in ss[0]] def crossproduct(ss, row=[], level=0): if len(ss)>1: subcrossproduct = [crossproduct(ss[1:],row+[i],level+1) for i in ss[0]] return reduce(lambda x,y:x+y,subcrossproduct) else: return [row+[i] for i in ss[0]] if len(sys.argv) != 4: print "SYNOPSIS: dim min_deepness max_deepness" print print "\tdim\tdimensions of the parameter space" print "\tdeepness\thow many stages to go deep (set min=0)" print print "Q: how many numbers will be produced?" print "A: when min=0: (2^(max-1)+1)^dim; as a table:" print " max | dim=1 dim=2 ->" for i in range(20): print "%6d" % i, for dim in range(1,6): if ((2**(i-1)+1)**dim) > 10000000: print "\t ... ", else: print "\t%10d" % (2**(i-1)+1)**dim, print exit() dim = int(sys.argv[1]) i = 0 mindeepness = int(sys.argv[2]) maxdeepness = int(sys.argv[3]) if dim > 1: # o = deepness for o in range(mindeepness, maxdeepness): p = getSegmentsInDeepnessRange(-1, o) pnew = getSegmentsInDeepnessRange(o-1, o) if dim == 1: newvals = [[nv] for nv in p] else: newvals = crossproduct([p]*dim) # print only the new ones: for nv in newvals: for pn in pnew: if pn in nv: for v in nv: print v, print i = i + 1 break else: # dim = 1: #for i in range(20): # print getSegment(i) # one dimension i = 0 print i i = 1 print i for o in range(mindeepness, maxdeepness): for j in range((2**o) / 2): print ("%.30f" % ((j*2+1) * pow(2,-o))).rstrip('0')
Distribution of digit sums
Posted by JohannesTheLittleScientist in science on September 2nd, 2009
How many numbers do have a digit sum of 23? How frequent is the digit sum 32?
Astonishingly, 6% of the numbers below 100000 have a digit sum of 23:
This seems pretty high, doesn’t it? It suggests that the digit sum is not evenly distributed, and it is, in fact, not:
As you see, the numbers 13 and 14 have the highest share as a digit sum of numbers below 1000. For values smaller than 10000, it is 18. And for values smaller than 100000, with 6% of all numbers, 22 and 23 are the most common digit sums.
If I’d had to start a conspiracy theory, I’d choose one of these. Also note how the gaussian distribution appeared out of nowhere.
Realistic user-email generator
Posted by JohannesTheLittleScientist in science on August 26th, 2009
This is an updated code for simulating user-generated email traffic, for the purpose of network simulation. (previous post)
It is based on the analysis of hundreds of gigabytes of mails (see this post).
This samples email size. You give this program a traffic rate, and it tells you when a user sends a mail of what size (two-column list output). It is based on the distribution of 402158 sent-emails, made available by Phill Macey (thank you!). Here you can see the distribution, and the sampler:
As you can see, the generator (by-table) follows the empirical distribution very well. The sampler is very fast (1700000 lines per second on my machine).
example usage: Generate five mails with 100 bytes/second throughput
$ gcc -pg -lgsl -lm -O3 -ansi -pedantic -Wall -Werror -Wextra emailgen-phill-macey.table.c -o emailgen-phill-macey.table.exe $ GSL_RNG_SEED=12144 ./emailgen-phill-macey.table.exe 5 100 0 4417 44 1262 56 791 63 829 71 1137
Which means, at second 44 the user wants to write a mail of 1262 bytes.
The full code, emailgen-phill-macey.table.c:
#include<stdio.h> #include<stdlib.h> #include<assert.h> #include<math.h> #include<gsl/gsl_rng.h> gsl_rng * rng; void init_rand() { gsl_rng_env_setup(); rng = gsl_rng_alloc(gsl_rng_default); } int get_random(int min, int max) { unsigned int umax; assert(max > min); umax = max - min - 1; return gsl_rng_uniform_int (rng, umax) + min; } double probabilities[15] = { 0.21380651385774, /* <= 1024 = pow(2,0+10)*/ 0.49594935323927, /* <= 2048 = pow(2,1+11)*/ 0.70676450549287, 0.81264080286852, 0.86918076974721, 0.89862442124737, 0.92154576062145, 0.94019514718096, 0.95773303030152, 0.97112577643613, 0.98311111553171, 0.99845085762312, 1 }; double interpolate(double yleft, double yright, double x, double xleft, double xright) { double deltax = xright - xleft; double deltay = yright - yleft; double k = deltay/deltax; return (x - xleft) * k + yleft; } unsigned long getnextsize() { double prob = gsl_rng_uniform_pos(rng); int i = 0; double j; unsigned long hardlowerlimit = 600; /*unsigned long hardupperlimit = 10*1024*1024;*/ unsigned long size; while(probabilities[i] <= prob) { i++; } if (i>0) { j = interpolate(i-1, i, prob, probabilities[i-1], probabilities[i]); size = pow(2, j + 10); if(size < hardlowerlimit) { return getnextsize(); } }else{ size = get_random(hardlowerlimit, pow(2, 0 + 10)); } return size; } void usage(char * progname) { fprintf(stderr, "%s: SYNAPSIS: number_of_mails throughput \n" "\n" "\tnumber_of_mails\thow many mails should be generated\n" "\tthroughput\tbytes per (virtual) second to send\n" "\n" "This program prints a 2-column list of time (in seconds) and\n" "size (in bytes) to simulate a email traffic. \n" "Configure your network simulator to transmit this number of bytes\n" "at that time. (This program does not send mails.)\n" "\n" "(c) Johannes Buchner\n", progname); exit(0); } int main(int argc, char ** argv) { int n; double throughput; unsigned long bytes_sent = 0; unsigned long nextsize; unsigned long time = 0; if (argc != 3) { usage(argv[0]); } init_rand(); throughput = atof(argv[2]); n = atoi(argv[1]); if (throughput < 0 || n < 0) { usage(argv[0]); } while (n-- > 0) { nextsize = getnextsize(); printf("%lu\t%lu\n", time, nextsize); bytes_sent += nextsize; time += nextsize /* bytes */ / throughput /* bytes per sec */; } return 0; }
Email size distribution: results
Posted by JohannesTheLittleScientist in science on August 26th, 2009
Following up from a previous post and a post to the qmail and postfix mailing lists:
I analysed the distribution of email sizes. I wasn’t interested in spam, automated emails or mailing lists, rather user-written mails.
I analyzed the sizes of mails in user inboxes. I am always looking at buckets (how many mails are smaller than size
x, but bigger than the last buckets). I used the following upper limits: 1024
1536 2048 2560 3072 4096 5120 6144 7168 8192 10240 12288 14336
16384 20480 24576 28672 32768 40960 49152 53248 57344 65536 131072
262144 524288 1048576 10264576 and 1073741824 (bytes).
I laid out some interpretations about my inbox as preliminary insight. Now I want to publish the results from the friendly people that answered my call for data.
Ordered by cardinality (number of mails in dataset):
mymail 5069 thomas.schwinge_priv 8759 linux.kernel 11127 charles_cazabon-nospam 22555 phoemix_harmlesshu 30991 thomas.schwinge_tech_mailinglist 154508 phill_macey_sent 402158 spamsizes_steff 1044713 michaelreck 4086621 (700GB, ~40000 users) phill_macey_inbox 7999737 phill_macey_all 8401951 (~300GB)
Thanks go out to Thomas Schwinge, Charles Cazabon, phoemix, Michael Reck from brauchmer.net, Markus Stumpf and Phill Macey.
The empirical distribution functions follow (with downloadable eps versions). X-axis (abscissa) is always size in Bytes, the y-axis (ordinate) is the portion of mails found with exactly this size. This is calculated by using the number of mails in a buckets above, dividing by the bucket size (difference to lower border) and dividing by the sum of mails.
The upper is single-log, the lower is double-log.
And the cumulative distribution. This is interesting because a shift can not be seen well in the probability functions. These are linear plots: ordinate being cumulative percentage (what percentage of mail is below size x). The thicker the lines, the more data in this dataset.
And logarithmic in size:
Here are the plots in better quality (as eps): overview-density overview-cum overview-cum-log
Warnings:
There are several errors to consider before taking this too literal: (1) It is uncertain how clean the data of large datasets is (spam, mailing-lists, generated data all mixed up). (2) headers are different on receiving than on sending. (3) What users send is not the same traffic as they receive (automated mails).
If you just want to take one number out of this, the median size of a mail is around 30kB (watch the 50% line).
For my purposes of modeling a human generating mails, this is fantastic (dataset phill_macey_sent). This is covered in my next post.
Comments are always welcome!
Modelling realistic email traffic – email size distribution
Posted by JohannesTheLittleScientist in science on August 8th, 2009
In network simulation, traffic has to be simulated. Here I’m looking at email traffic. Often, CBR (constant bit rate) is used.
For a more realistic email generator I wanted to find out how the emails are distributed in size.
The only post that goes into detail is found at
http://osdir.com/ml/freebsd.devel.net/2002-10/msg00203.html
The resolution of this is quite poor though, especially between 4KB and 0 would be interesting.
However, below is a sampler that generates random email sizes, distributed such that it conforms to the distribution.
Between 2048 and 900 (lower limit, headers almost always need that much space) a uniform distribution is used.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | /* cumulative probabilities */ double probabilities[12] = { 0.2353, /* < 2048 = pow(2, 0+11) */ 0.4917, /* < 3096 = pow(2, 1+11) */ 0.676, 0.7958, 0.8536, 0.9047, 0.933, 0.9574, 0.9724, 0.9849, 0.9996, 1.0 }; double interpolate(double yleft, double yright, double x, double xleft, double xright) { double deltax = xright - xleft; double deltay = yright - yleft; double k = deltay/deltax; return (x - xleft) * k + yleft; } unsigned long getnextsize() { double prob = gsl_rng_uniform_pos(rng); int i = 0; double j; unsigned long hardlowerlimit = 900; /*unsigned long hardupperlimit = 10*1024*1024;*/ unsigned long size; while(probabilities[i] <= prob) { i++; } assert(probabilities[i] > prob); if (i > 0) { j = interpolate(i-1, i, prob, probabilities[i-1], probabilities[i]); size = pow(2, j + 11); if(size < hardlowerlimit) { return getnextsize(); } }else{ size = get_random(hardlowerlimit, pow(2, 0+11)); } return size; } |
The measured resulting cumulative distribution function, as well as a comparison to my own mailbox (which might be atypical, who knows).
It would be interesting to get more detailed and newer histograms.
I also considered modelling it using a Maxwell–Boltzmann distribution, but generating that seemed more complex and on closer look, the form does not really fit.
Full sampler code:
emailgen.table.c
Continuous parameter optimization
Posted by JohannesTheLittleScientist in science on April 26th, 2009
In the recent days and weeks I got into the topic of “continuous parameter optimization”. That is the case when you have a n-dimensional function that returns a value, and you want find a maximum or minimum.
In particular I was looking for an algorithm that finds one (local) maximum, and uses the least number of evaluations. Because in the scenario I have, evaluating the function takes hours (or days).
So what algorithms can you use?
I developed some variants of the most primitive and naive approach.
You can find the description and source here: http://github.com/JohannesBuchner/nobrain_optimization/
Later I found CONDOR which is much more sophisticated.
I adopted the project (GPL), so I can provide the same useful interface that allows to plug in any program or programming language.
I wrote an introduction here: http://wiki.github.com/JohannesBuchner/condor_optimization
What do I need this for?
I am developing a statistic and numerical application at the University of Vienna (Astronomy) with a number of input parameters. It takes hours to compute, and it can return a value of evaluation. I want to find the optimal values for the input parameters (that are not independent) in the shortest time (least evaluations) possible.
A good measurement for the performance of an optimization algorithm is Rosenbrock’s valley.







Recent Comments