<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Johannes Buchner &#187; JohannesTheLittleScientist</title>
	<atom:link href="http://johannes.jakeapp.com/blog/category/author/johannesthescientist/feed" rel="self" type="application/rss+xml" />
	<link>http://johannes.jakeapp.com/blog</link>
	<description>Johannes Buchner&#039;s blog about advanced usage of your operating system</description>
	<lastBuildDate>Sun, 18 Jul 2010 08:30:01 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>APEMoST</title>
		<link>http://johannes.jakeapp.com/blog/category/science/200911/apemost</link>
		<comments>http://johannes.jakeapp.com/blog/category/science/200911/apemost#comments</comments>
		<pubDate>Thu, 19 Nov 2009 12:57:51 +0000</pubDate>
		<dc:creator>JohannesTheLittleScientist</dc:creator>
				<category><![CDATA[science]]></category>
		<category><![CDATA[tool]]></category>

		<guid isPermaLink="false">http://johannes.jakeapp.com/blog/?p=958</guid>
		<description><![CDATA[Recently I have been busy updating APEMoST, which is a MCMC sampler for Bayesian inference. This can be used as a statistical procedure to estimate parameters of a model. That sounds pretty generic, and indeed it is. One specific example would be to determine the orbit parameters of exoplanets. You can also specify multiple models [...]]]></description>
			<content:encoded><![CDATA[<p>Recently I have been busy updating APEMoST, which is a MCMC sampler for Bayesian inference. This can be used as a statistical procedure to estimate parameters of a model. That sounds pretty generic, and indeed it is. One specific example would be to determine the orbit parameters of exoplanets. You can also specify multiple models (1 planet, 2 planets, no planets) and calculate (with the tool) which model is more likely.</p>
<p>Everything is at <a href="http://apemost.sourceforge.net/">http://apemost.sourceforge.net/</a>. Papers pending ;-)</p>
]]></content:encoded>
			<wfw:commentRss>http://johannes.jakeapp.com/blog/category/science/200911/apemost/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Exhausting a finite parameter space by enumerating Q^n</title>
		<link>http://johannes.jakeapp.com/blog/category/science/200909/exhausting-a-finite-parameter-space-by-enumerating-qn</link>
		<comments>http://johannes.jakeapp.com/blog/category/science/200909/exhausting-a-finite-parameter-space-by-enumerating-qn#comments</comments>
		<pubDate>Sat, 12 Sep 2009 03:54:10 +0000</pubDate>
		<dc:creator>JohannesTheLittleScientist</dc:creator>
				<category><![CDATA[science]]></category>

		<guid isPermaLink="false">http://johannes.jakeapp.com/blog/?p=900</guid>
		<description><![CDATA[In many cases, the parameter space (e.g. simulation parameters) is a n-dimensional box/cube, scalable to [0:1]^n, where n donates the number of dimensions = number of parameters.
Since simulations for one parameter set take time, and one wants to exhaust the parameter space efficiently (visit every subspace uniformly), one has to come up with some method [...]]]></description>
			<content:encoded><![CDATA[<p>In many cases, the parameter space (e.g. simulation parameters) is a n-dimensional box/cube, scalable to [0:1]^n, where n donates the number of dimensions = number of parameters.</p>
<p>Since simulations for one parameter set take time, and one wants to exhaust the parameter space efficiently (visit every subspace uniformly), one has to come up with some method of generating or suggesting points to visit.</p>
<p>Often, a Monte-Carlo approach is used by just using random points (uniformly, to be exact).</p>
<p>Here I want to present a simple enumeration scheme that eventually visits all points, by ordering the  rational numbers in an suitable way.</p>
<p>First, lets look at one dimension. The strategy is to visit: 0, 1, 1/2, 1/4, 3/4. Then continue even deeper with 1/8, 3/8, 5/8, 7/8 and so on. Using fractions, a pattern is clearly visible. Here is a visualization:</p>
<p><a href="http://johannes.jakeapp.com/blog/wp-content/uploads/2009/09/Q1.gif"><img class="aligncenter size-full wp-image-901" title="Q1" src="http://johannes.jakeapp.com/blog/wp-content/uploads/2009/09/Q1.gif" alt="Q1" width="640" height="480" /></a>The y-axis shows the one dimensional parameter space, the x-axis just helps you to see the ordering of the enumeration. Each animation step goes one step deeper.</p>
<p>The generating scheme up to a certain deepness is:</p>
<pre>
<pre>{ (j*2+1) / 2^deepness | j in [0..(2^deepness) / 2] }</pre>
</pre>
<p><strong>Higher dimensions</strong></p>
<p>So far, so easy. Lets generalize this to higher dimensions. We want to keep the property that the enumeration sort of &#8220;goes deeper&#8221;, becomes more accurate by filling the voids between the previously visited points.</p>
<p><a href="http://johannes.jakeapp.com/blog/wp-content/uploads/2009/09/Q2.gif"><img class="aligncenter size-full wp-image-902" title="Q2" src="http://johannes.jakeapp.com/blog/wp-content/uploads/2009/09/Q2.gif" alt="Q2" width="640" height="480" /></a>Here, a two-dimensional parameter space is used. First, the edges (0,0),(1,0),(0,1),(1,1) are visited. Then the space in between and so on and so on. Each animation step goes one step deeper.</p>
<p>The easiest method for generating a enumeration scheme for arbitrary dimensions is</p>
<pre>0. no visited points
1. Q1 &lt;- generate enumerations of one dimension, up to deepness o
2. build all permutations [a_1,a_2,...,a_3] where a_i in Q1
3. remove already visited points
4. increase deepness, go to 1</pre>
<p><strong>Use: </strong>The more points you generate, the more you can exhaust the parameter space and the better the results get. At some point you have to stop your calculations, and this enumeration makes sure you have visited all subspaces homogeneously.</p>
<p><strong>Appendix</strong></p>
<p>Appended is a python script (numbering.py) I used to generate points in this enumerated way. the parameter space can easily be scaled by piping into awk (e.g. <code>| awk '{print $1*3,$2+1}')</code>.</p>
<p>If you don&#8217;t want to or can&#8217;t run the script, the output for 1, 2 and 3 dimensional parameter spaces can be found in <a href="http://johannes.jakeapp.com/files/enumerate_rational/">http://johannes.jakeapp.com/files/enumerate_rational/</a>. These contains the first million points (deepness 22, 11 and 8).</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">math</span>
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">sys</span> 
&nbsp;
<span style="color: #ff7700;font-weight:bold;">def</span> getSegment<span style="color: black;">&#40;</span>i<span style="color: black;">&#41;</span>:
	<span style="color: #ff7700;font-weight:bold;">if</span> i == <span style="color: #ff4500;">0</span>:
		<span style="color: #ff7700;font-weight:bold;">return</span> 0.
	<span style="color: #808080; font-style: italic;">#if i == 1:</span>
	<span style="color: #808080; font-style: italic;">#	return 1</span>
	o = <span style="color: #008000;">int</span><span style="color: black;">&#40;</span><span style="color: #dc143c;">math</span>.<span style="color: black;">ceil</span><span style="color: black;">&#40;</span><span style="color: #dc143c;">math</span>.<span style="color: black;">log</span><span style="color: black;">&#40;</span>i, <span style="color: #ff4500;">2</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
	j = i - <span style="color: black;">&#40;</span><span style="color: #ff4500;">2</span><span style="color: #66cc66;">**</span>o / <span style="color: #ff4500;">2</span><span style="color: black;">&#41;</span> - <span style="color: #ff4500;">1</span>
	<span style="color: #808080; font-style: italic;">#print &quot;i=&quot;, i, &quot;o=&quot;, o, &quot;j=&quot;, j</span>
	<span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #008000;">float</span><span style="color: black;">&#40;</span>j<span style="color: #66cc66;">*</span><span style="color: #ff4500;">2</span>+<span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span> / <span style="color: #ff4500;">2</span><span style="color: #66cc66;">**</span>o
&nbsp;
<span style="color: #ff7700;font-weight:bold;">def</span> getSegmentsInDeepnessRange<span style="color: black;">&#40;</span>a, b<span style="color: black;">&#41;</span>:
	<span style="color: #ff7700;font-weight:bold;">if</span> a == -<span style="color: #ff4500;">1</span>:
		first = <span style="color: #ff4500;">0</span>
	<span style="color: #ff7700;font-weight:bold;">else</span>:
		first = <span style="color: #ff4500;">2</span><span style="color: #66cc66;">**</span>a + <span style="color: #ff4500;">1</span>
	last  = <span style="color: #ff4500;">2</span><span style="color: #66cc66;">**</span>b + <span style="color: #ff4500;">1</span>
	<span style="color: #808080; font-style: italic;"># print &quot;o=&quot;, o, &quot;first=&quot;, first, &quot;last=&quot;, last</span>
	<span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: black;">&#91;</span>getSegment<span style="color: black;">&#40;</span>j<span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">for</span> j <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #008000;">range</span><span style="color: black;">&#40;</span>first, last<span style="color: black;">&#41;</span><span style="color: black;">&#93;</span>
&nbsp;
<span style="color: #808080; font-style: italic;">#crossproduct = lambda ss,row=[],level=0: len(ss)&amp;gt;1 \</span>
<span style="color: #808080; font-style: italic;">#	and reduce(lambda x,y:x+y,[crossproduct(ss[1:],row+[i],level+1) for i in ss[0]]) \</span>
<span style="color: #808080; font-style: italic;">#	or [row+[i] for i in ss[0]]</span>
<span style="color: #ff7700;font-weight:bold;">def</span> crossproduct<span style="color: black;">&#40;</span>ss, row=<span style="color: black;">&#91;</span><span style="color: black;">&#93;</span>, level=<span style="color: #ff4500;">0</span><span style="color: black;">&#41;</span>:
	<span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>ss<span style="color: black;">&#41;</span><span style="color: #66cc66;">&amp;</span>gt<span style="color: #66cc66;">;</span><span style="color: #ff4500;">1</span>:
		subcrossproduct = <span style="color: black;">&#91;</span>crossproduct<span style="color: black;">&#40;</span>ss<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>:<span style="color: black;">&#93;</span>,row+<span style="color: black;">&#91;</span>i<span style="color: black;">&#93;</span>,level+<span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">for</span> i <span style="color: #ff7700;font-weight:bold;">in</span> ss<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span><span style="color: black;">&#93;</span>
		<span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #008000;">reduce</span><span style="color: black;">&#40;</span><span style="color: #ff7700;font-weight:bold;">lambda</span> x,y:x+y,subcrossproduct<span style="color: black;">&#41;</span>
	<span style="color: #ff7700;font-weight:bold;">else</span>:
		<span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: black;">&#91;</span>row+<span style="color: black;">&#91;</span>i<span style="color: black;">&#93;</span> <span style="color: #ff7700;font-weight:bold;">for</span> i <span style="color: #ff7700;font-weight:bold;">in</span> ss<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span><span style="color: black;">&#93;</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span><span style="color: #dc143c;">sys</span>.<span style="color: black;">argv</span><span style="color: black;">&#41;</span> <span style="color: #66cc66;">!</span>= <span style="color: #ff4500;">4</span>:
	<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;SYNOPSIS: dim min_deepness max_deepness&quot;</span>
	<span style="color: #ff7700;font-weight:bold;">print</span>
	<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;<span style="color: #000099; font-weight: bold;">\t</span>dim<span style="color: #000099; font-weight: bold;">\t</span>dimensions of the parameter space&quot;</span>
	<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;<span style="color: #000099; font-weight: bold;">\t</span>deepness<span style="color: #000099; font-weight: bold;">\t</span>how many stages to go deep (set min=0)&quot;</span>
	<span style="color: #ff7700;font-weight:bold;">print</span>
	<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;Q: how many numbers will be produced?&quot;</span>
	<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;A: when min=0: (2^(max-1)+1)^dim; as a table:&quot;</span>
	<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;  max   |    dim=1           dim=2     -&amp;gt;&quot;</span>
	<span style="color: #ff7700;font-weight:bold;">for</span> i <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #008000;">range</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">20</span><span style="color: black;">&#41;</span>:
		<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;%6d&quot;</span> <span style="color: #66cc66;">%</span> i,
		<span style="color: #ff7700;font-weight:bold;">for</span> dim <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #008000;">range</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">1</span>,<span style="color: #ff4500;">6</span><span style="color: black;">&#41;</span>:
			<span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: black;">&#40;</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">2</span><span style="color: #66cc66;">**</span><span style="color: black;">&#40;</span>i-<span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span>+<span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span><span style="color: #66cc66;">**</span>dim<span style="color: black;">&#41;</span> <span style="color: #66cc66;">&amp;</span>gt<span style="color: #66cc66;">;</span> <span style="color: #ff4500;">10000000</span>:
				<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;<span style="color: #000099; font-weight: bold;">\t</span>    ...    &quot;</span>,
			<span style="color: #ff7700;font-weight:bold;">else</span>:
				<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;<span style="color: #000099; font-weight: bold;">\t</span>%10d&quot;</span> <span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span><span style="color: #ff4500;">2</span><span style="color: #66cc66;">**</span><span style="color: black;">&#40;</span>i-<span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span>+<span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span><span style="color: #66cc66;">**</span>dim,
		<span style="color: #ff7700;font-weight:bold;">print</span>
	exit<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
&nbsp;
dim = <span style="color: #008000;">int</span><span style="color: black;">&#40;</span><span style="color: #dc143c;">sys</span>.<span style="color: black;">argv</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>
&nbsp;
i = <span style="color: #ff4500;">0</span>
&nbsp;
mindeepness = <span style="color: #008000;">int</span><span style="color: black;">&#40;</span><span style="color: #dc143c;">sys</span>.<span style="color: black;">argv</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">2</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>
maxdeepness = <span style="color: #008000;">int</span><span style="color: black;">&#40;</span><span style="color: #dc143c;">sys</span>.<span style="color: black;">argv</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">3</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">if</span> dim <span style="color: #66cc66;">&amp;</span>gt<span style="color: #66cc66;">;</span> <span style="color: #ff4500;">1</span>:
	<span style="color: #808080; font-style: italic;"># o = deepness</span>
	<span style="color: #ff7700;font-weight:bold;">for</span> o <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #008000;">range</span><span style="color: black;">&#40;</span>mindeepness, maxdeepness<span style="color: black;">&#41;</span>:
		p    = getSegmentsInDeepnessRange<span style="color: black;">&#40;</span>-<span style="color: #ff4500;">1</span>, o<span style="color: black;">&#41;</span>
		pnew = getSegmentsInDeepnessRange<span style="color: black;">&#40;</span>o-<span style="color: #ff4500;">1</span>,  o<span style="color: black;">&#41;</span>
		<span style="color: #ff7700;font-weight:bold;">if</span> dim == <span style="color: #ff4500;">1</span>:
			newvals = <span style="color: black;">&#91;</span><span style="color: black;">&#91;</span>nv<span style="color: black;">&#93;</span> <span style="color: #ff7700;font-weight:bold;">for</span> nv <span style="color: #ff7700;font-weight:bold;">in</span> p<span style="color: black;">&#93;</span>
		<span style="color: #ff7700;font-weight:bold;">else</span>:
			newvals = crossproduct<span style="color: black;">&#40;</span><span style="color: black;">&#91;</span>p<span style="color: black;">&#93;</span><span style="color: #66cc66;">*</span>dim<span style="color: black;">&#41;</span>
		<span style="color: #808080; font-style: italic;"># print only the new ones:</span>
		<span style="color: #ff7700;font-weight:bold;">for</span> nv <span style="color: #ff7700;font-weight:bold;">in</span> newvals:
			<span style="color: #ff7700;font-weight:bold;">for</span> pn <span style="color: #ff7700;font-weight:bold;">in</span> pnew:
				<span style="color: #ff7700;font-weight:bold;">if</span> pn <span style="color: #ff7700;font-weight:bold;">in</span> nv:
					<span style="color: #ff7700;font-weight:bold;">for</span> v <span style="color: #ff7700;font-weight:bold;">in</span> nv:
						<span style="color: #ff7700;font-weight:bold;">print</span> v,
					<span style="color: #ff7700;font-weight:bold;">print</span>
					i = i + <span style="color: #ff4500;">1</span>
					<span style="color: #ff7700;font-weight:bold;">break</span>
<span style="color: #ff7700;font-weight:bold;">else</span>: <span style="color: #808080; font-style: italic;"># dim = 1:</span>
<span style="color: #808080; font-style: italic;">#for i in range(20):</span>
<span style="color: #808080; font-style: italic;">#	print getSegment(i)</span>
&nbsp;
<span style="color: #808080; font-style: italic;"># one dimension </span>
&nbsp;
	i = <span style="color: #ff4500;">0</span>
	<span style="color: #ff7700;font-weight:bold;">print</span> i
	i = <span style="color: #ff4500;">1</span>
	<span style="color: #ff7700;font-weight:bold;">print</span> i
	<span style="color: #ff7700;font-weight:bold;">for</span> o <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #008000;">range</span><span style="color: black;">&#40;</span>mindeepness, maxdeepness<span style="color: black;">&#41;</span>:
		<span style="color: #ff7700;font-weight:bold;">for</span> j <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #008000;">range</span><span style="color: black;">&#40;</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">2</span><span style="color: #66cc66;">**</span>o<span style="color: black;">&#41;</span> / <span style="color: #ff4500;">2</span><span style="color: black;">&#41;</span>:
			<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;%.30f&quot;</span> <span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span><span style="color: black;">&#40;</span>j<span style="color: #66cc66;">*</span><span style="color: #ff4500;">2</span>+<span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span> <span style="color: #66cc66;">*</span> <span style="color: #008000;">pow</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">2</span>,-o<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>.<span style="color: black;">rstrip</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'0'</span><span style="color: black;">&#41;</span></pre></div></div>

]]></content:encoded>
			<wfw:commentRss>http://johannes.jakeapp.com/blog/category/science/200909/exhausting-a-finite-parameter-space-by-enumerating-qn/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Distribution of digit sums</title>
		<link>http://johannes.jakeapp.com/blog/category/science/200909/distribution-of-digit-sums</link>
		<comments>http://johannes.jakeapp.com/blog/category/science/200909/distribution-of-digit-sums#comments</comments>
		<pubDate>Wed, 02 Sep 2009 14:49:12 +0000</pubDate>
		<dc:creator>JohannesTheLittleScientist</dc:creator>
				<category><![CDATA[science]]></category>

		<guid isPermaLink="false">http://johannes.jakeapp.com/blog/?p=696</guid>
		<description><![CDATA[How many numbers do have a digit sum of 23? How frequent is the digit sum 32?
Astonishingly, 6% of the numbers below 100000 have a digit sum of 23:

This seems pretty high, doesn&#8217;t it? It suggests that the digit sum is not evenly distributed, and it is, in fact, not:

As you see, the numbers 13 [...]]]></description>
			<content:encoded><![CDATA[<p>How many numbers do have a digit sum of 23? How frequent is the digit sum 32?</p>
<p>Astonishingly, 6% of the numbers below 100000 have a digit sum of 23:</p>
<p style="text-align: center;"><a href="http://johannes.jakeapp.com/blog/wp-content/uploads/2009/09/crossfoot.png"><img class="aligncenter size-full wp-image-697" title="crossfoot" src="http://johannes.jakeapp.com/blog/wp-content/uploads/2009/09/crossfoot.png" alt="crossfoot" width="554" height="428" /></a></p>
<p>This seems pretty high, doesn&#8217;t it? It suggests that the digit sum is not evenly distributed, and it is, in fact, not:</p>
<p style="text-align: center;"><a href="http://johannes.jakeapp.com/blog/wp-content/uploads/2009/09/crossfoots.png"><img class="aligncenter size-full wp-image-698" title="crossfoots" src="http://johannes.jakeapp.com/blog/wp-content/uploads/2009/09/crossfoots.png" alt="crossfoots" width="554" height="428" /></a></p>
<p>As you see, the numbers 13 and 14 have the highest share as a digit sum of numbers below 1000. For values smaller than 10000, it is 18. And for values smaller than 100000, with 6% of all numbers, 22 and 23 are the most common digit sums.</p>
<p>If I&#8217;d had to start a conspiracy theory, I&#8217;d choose one of these. Also note how the gaussian distribution appeared out of nowhere.</p>
]]></content:encoded>
			<wfw:commentRss>http://johannes.jakeapp.com/blog/category/science/200909/distribution-of-digit-sums/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Realistic user-email generator</title>
		<link>http://johannes.jakeapp.com/blog/category/science/200908/email-generator</link>
		<comments>http://johannes.jakeapp.com/blog/category/science/200908/email-generator#comments</comments>
		<pubDate>Wed, 26 Aug 2009 14:13:29 +0000</pubDate>
		<dc:creator>JohannesTheLittleScientist</dc:creator>
				<category><![CDATA[science]]></category>

		<guid isPermaLink="false">http://johannes.jakeapp.com/blog/?p=684</guid>
		<description><![CDATA[This is an updated code for simulating user-generated email traffic, for the purpose of network simulation. (previous post)
It is based on the analysis of hundreds of gigabytes of mails (see this post).
This samples email size. You give this program a traffic rate, and it tells you when a user sends a mail of what size [...]]]></description>
			<content:encoded><![CDATA[<p>This is an updated code for simulating user-generated email traffic, for the purpose of network simulation. <a href="http://johannes.jakeapp.com/blog/category/science/200908/modelling-realistic-email-traffic-email-size-distribution">(previous post)</a></p>
<p>It is based on the analysis of hundreds of gigabytes of mails (see <a href="http://johannes.jakeapp.com/blog/?p=674">this post</a>).</p>
<p>This samples email size. You give this program a traffic rate, and it tells you when a user sends a mail of what size (two-column list output). It is based on the distribution of 402158 sent-emails, made available by Phill Macey (thank you!). Here you can see the distribution, and the sampler:</p>
<p><a href="http://johannes.jakeapp.com/blog/wp-content/uploads/2009/08/overview-cum1.png"><img class="aligncenter size-full wp-image-687" title="overview-cum-generated" src="http://johannes.jakeapp.com/blog/wp-content/uploads/2009/08/overview-cum1.png" alt="overview-cum-generated" width="584" height="451" /></a></p>
<p>As you can see, the generator (by-table) follows the empirical distribution very well. The sampler is very fast (1700000 lines per second on my machine).</p>
<p>example usage: Generate five mails with 100 bytes/second throughput</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">$ <span style="color: #c20cb9; font-weight: bold;">gcc</span> <span style="color: #660033;">-pg</span> <span style="color: #660033;">-lgsl</span> <span style="color: #660033;">-lm</span> <span style="color: #660033;">-O3</span> <span style="color: #660033;">-ansi</span> <span style="color: #660033;">-pedantic</span> <span style="color: #660033;">-Wall</span> <span style="color: #660033;">-Werror</span> <span style="color: #660033;">-Wextra</span> emailgen-phill-macey.table.c <span style="color: #660033;">-o</span> emailgen-phill-macey.table.exe
$ <span style="color: #007800;">GSL_RNG_SEED</span>=<span style="color: #000000;">12144</span> .<span style="color: #000000; font-weight: bold;">/</span>emailgen-phill-macey.table.exe <span style="color: #000000;">5</span> <span style="color: #000000;">100</span> 
<span style="color: #000000;">0</span>	<span style="color: #000000;">4417</span>
<span style="color: #000000;">44</span>	<span style="color: #000000;">1262</span>
<span style="color: #000000;">56</span>	<span style="color: #000000;">791</span>
<span style="color: #000000;">63</span>	<span style="color: #000000;">829</span>
<span style="color: #000000;">71</span>	<span style="color: #000000;">1137</span></pre></div></div>

<p>Which means, at second 44 the user wants to write a mail of 1262 bytes.</p>
<p>The full code, emailgen-phill-macey.table.c:</p>

<div class="wp_syntax"><div class="code"><pre class="c" style="font-family:monospace;"><span style="color: #339933;">#include&lt;stdio.h&gt;</span>
<span style="color: #339933;">#include&lt;stdlib.h&gt;</span>
<span style="color: #339933;">#include&lt;assert.h&gt;</span>
<span style="color: #339933;">#include&lt;math.h&gt;</span>
<span style="color: #339933;">#include&lt;gsl/gsl_rng.h&gt;</span>
&nbsp;
gsl_rng <span style="color: #339933;">*</span> rng<span style="color: #339933;">;</span>
<span style="color: #993333;">void</span> init_rand<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
	gsl_rng_env_setup<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
	rng <span style="color: #339933;">=</span> gsl_rng_alloc<span style="color: #009900;">&#40;</span>gsl_rng_default<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span>
&nbsp;
&nbsp;
<span style="color: #993333;">int</span> get_random<span style="color: #009900;">&#40;</span><span style="color: #993333;">int</span> min<span style="color: #339933;">,</span> <span style="color: #993333;">int</span> max<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
	<span style="color: #993333;">unsigned</span> <span style="color: #993333;">int</span> umax<span style="color: #339933;">;</span>
	assert<span style="color: #009900;">&#40;</span>max <span style="color: #339933;">&gt;</span> min<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
	umax <span style="color: #339933;">=</span> max <span style="color: #339933;">-</span> min <span style="color: #339933;">-</span> <span style="color: #0000dd;">1</span><span style="color: #339933;">;</span>
	<span style="color: #b1b100;">return</span> gsl_rng_uniform_int <span style="color: #009900;">&#40;</span>rng<span style="color: #339933;">,</span> umax<span style="color: #009900;">&#41;</span> <span style="color: #339933;">+</span> min<span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span>
&nbsp;
<span style="color: #993333;">double</span> probabilities<span style="color: #009900;">&#91;</span><span style="color: #0000dd;">15</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">=</span> <span style="color: #009900;">&#123;</span>
	<span style="color:#800080;">0.21380651385774</span><span style="color: #339933;">,</span> <span style="color: #808080; font-style: italic;">/* &lt;= 1024 = pow(2,0+10)*/</span>
	<span style="color:#800080;">0.49594935323927</span><span style="color: #339933;">,</span> <span style="color: #808080; font-style: italic;">/* &lt;= 2048 = pow(2,1+11)*/</span>
	<span style="color:#800080;">0.70676450549287</span><span style="color: #339933;">,</span>
	<span style="color:#800080;">0.81264080286852</span><span style="color: #339933;">,</span>
	<span style="color:#800080;">0.86918076974721</span><span style="color: #339933;">,</span>
	<span style="color:#800080;">0.89862442124737</span><span style="color: #339933;">,</span>
	<span style="color:#800080;">0.92154576062145</span><span style="color: #339933;">,</span>
	<span style="color:#800080;">0.94019514718096</span><span style="color: #339933;">,</span>
	<span style="color:#800080;">0.95773303030152</span><span style="color: #339933;">,</span>
	<span style="color:#800080;">0.97112577643613</span><span style="color: #339933;">,</span>
	<span style="color:#800080;">0.98311111553171</span><span style="color: #339933;">,</span>
	<span style="color:#800080;">0.99845085762312</span><span style="color: #339933;">,</span>
	<span style="color: #0000dd;">1</span>
<span style="color: #009900;">&#125;</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #993333;">double</span> interpolate<span style="color: #009900;">&#40;</span><span style="color: #993333;">double</span> yleft<span style="color: #339933;">,</span> <span style="color: #993333;">double</span> yright<span style="color: #339933;">,</span> <span style="color: #993333;">double</span> x<span style="color: #339933;">,</span> <span style="color: #993333;">double</span> xleft<span style="color: #339933;">,</span> <span style="color: #993333;">double</span> xright<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
	<span style="color: #993333;">double</span> deltax <span style="color: #339933;">=</span> xright <span style="color: #339933;">-</span> xleft<span style="color: #339933;">;</span>
	<span style="color: #993333;">double</span> deltay <span style="color: #339933;">=</span> yright <span style="color: #339933;">-</span> yleft<span style="color: #339933;">;</span>
	<span style="color: #993333;">double</span> k <span style="color: #339933;">=</span> deltay<span style="color: #339933;">/</span>deltax<span style="color: #339933;">;</span>
	<span style="color: #b1b100;">return</span> <span style="color: #009900;">&#40;</span>x <span style="color: #339933;">-</span> xleft<span style="color: #009900;">&#41;</span> <span style="color: #339933;">*</span> k <span style="color: #339933;">+</span> yleft<span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span>
&nbsp;
<span style="color: #993333;">unsigned</span> <span style="color: #993333;">long</span> getnextsize<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
	<span style="color: #993333;">double</span> prob <span style="color: #339933;">=</span> gsl_rng_uniform_pos<span style="color: #009900;">&#40;</span>rng<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
	<span style="color: #993333;">int</span> i <span style="color: #339933;">=</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">;</span>
	<span style="color: #993333;">double</span> j<span style="color: #339933;">;</span>
	<span style="color: #993333;">unsigned</span> <span style="color: #993333;">long</span> hardlowerlimit <span style="color: #339933;">=</span> <span style="color: #0000dd;">600</span><span style="color: #339933;">;</span>
	<span style="color: #808080; font-style: italic;">/*unsigned long hardupperlimit = 10*1024*1024;*/</span>
	<span style="color: #993333;">unsigned</span> <span style="color: #993333;">long</span> size<span style="color: #339933;">;</span>
&nbsp;
	<span style="color: #b1b100;">while</span><span style="color: #009900;">&#40;</span>probabilities<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span> <span style="color: #339933;">&lt;=</span> prob<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
		i<span style="color: #339933;">++;</span>
	<span style="color: #009900;">&#125;</span>
	<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span>i<span style="color: #339933;">&gt;</span><span style="color: #0000dd;">0</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
		j <span style="color: #339933;">=</span> interpolate<span style="color: #009900;">&#40;</span>i<span style="color: #339933;">-</span><span style="color: #0000dd;">1</span><span style="color: #339933;">,</span> i<span style="color: #339933;">,</span> prob<span style="color: #339933;">,</span> probabilities<span style="color: #009900;">&#91;</span>i<span style="color: #339933;">-</span><span style="color: #0000dd;">1</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span> probabilities<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		size <span style="color: #339933;">=</span> pow<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">2</span><span style="color: #339933;">,</span> j <span style="color: #339933;">+</span> <span style="color: #0000dd;">10</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #b1b100;">if</span><span style="color: #009900;">&#40;</span>size <span style="color: #339933;">&lt;</span> hardlowerlimit<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
			<span style="color: #b1b100;">return</span> getnextsize<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #009900;">&#125;</span>
	<span style="color: #009900;">&#125;</span><span style="color: #b1b100;">else</span><span style="color: #009900;">&#123;</span>
		size <span style="color: #339933;">=</span> get_random<span style="color: #009900;">&#40;</span>hardlowerlimit<span style="color: #339933;">,</span> pow<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">2</span><span style="color: #339933;">,</span> <span style="color: #0000dd;">0</span> <span style="color: #339933;">+</span> <span style="color: #0000dd;">10</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
	<span style="color: #009900;">&#125;</span>
&nbsp;
	<span style="color: #b1b100;">return</span> size<span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span>
<span style="color: #993333;">void</span> usage<span style="color: #009900;">&#40;</span><span style="color: #993333;">char</span> <span style="color: #339933;">*</span> progname<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
	fprintf<span style="color: #009900;">&#40;</span>stderr<span style="color: #339933;">,</span> <span style="color: #ff0000;">&quot;%s: SYNAPSIS: number_of_mails throughput <span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>
		<span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>
		<span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\t</span>number_of_mails<span style="color: #000099; font-weight: bold;">\t</span>how many mails should be generated<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>
		<span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\t</span>throughput<span style="color: #000099; font-weight: bold;">\t</span>bytes per (virtual) second to send<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>
		<span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>
		<span style="color: #ff0000;">&quot;This program prints a 2-column list of time (in seconds) and<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>
		<span style="color: #ff0000;">&quot;size (in bytes) to simulate a email traffic. <span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>
		<span style="color: #ff0000;">&quot;Configure your network simulator to transmit this number of bytes<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>
		<span style="color: #ff0000;">&quot;at that time. (This program does not send mails.)<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>
		<span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>
		<span style="color: #ff0000;">&quot;(c) Johannes Buchner<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">,</span> progname<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
	exit<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">0</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span>
&nbsp;
<span style="color: #993333;">int</span> main<span style="color: #009900;">&#40;</span><span style="color: #993333;">int</span> argc<span style="color: #339933;">,</span> <span style="color: #993333;">char</span> <span style="color: #339933;">**</span> argv<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
	<span style="color: #993333;">int</span> n<span style="color: #339933;">;</span>
	<span style="color: #993333;">double</span> throughput<span style="color: #339933;">;</span>
	<span style="color: #993333;">unsigned</span> <span style="color: #993333;">long</span> bytes_sent <span style="color: #339933;">=</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">;</span>
	<span style="color: #993333;">unsigned</span> <span style="color: #993333;">long</span> nextsize<span style="color: #339933;">;</span>
	<span style="color: #993333;">unsigned</span> <span style="color: #993333;">long</span> time <span style="color: #339933;">=</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">;</span>
&nbsp;
	<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span>argc <span style="color: #339933;">!=</span> <span style="color: #0000dd;">3</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
		usage<span style="color: #009900;">&#40;</span>argv<span style="color: #009900;">&#91;</span><span style="color: #0000dd;">0</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
	<span style="color: #009900;">&#125;</span>
	init_rand<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
	throughput <span style="color: #339933;">=</span> atof<span style="color: #009900;">&#40;</span>argv<span style="color: #009900;">&#91;</span><span style="color: #0000dd;">2</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
	n <span style="color: #339933;">=</span> atoi<span style="color: #009900;">&#40;</span>argv<span style="color: #009900;">&#91;</span><span style="color: #0000dd;">1</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
	<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span>throughput <span style="color: #339933;">&lt;</span> <span style="color: #0000dd;">0</span> <span style="color: #339933;">||</span> n <span style="color: #339933;">&lt;</span> <span style="color: #0000dd;">0</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
		usage<span style="color: #009900;">&#40;</span>argv<span style="color: #009900;">&#91;</span><span style="color: #0000dd;">0</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
	<span style="color: #009900;">&#125;</span>
&nbsp;
	<span style="color: #b1b100;">while</span> <span style="color: #009900;">&#40;</span>n<span style="color: #339933;">--</span> <span style="color: #339933;">&gt;</span> <span style="color: #0000dd;">0</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
		nextsize <span style="color: #339933;">=</span> getnextsize<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #000066;">printf</span><span style="color: #009900;">&#40;</span><span style="color: #ff0000;">&quot;%lu<span style="color: #000099; font-weight: bold;">\t</span>%lu<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">,</span> time<span style="color: #339933;">,</span> nextsize<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		bytes_sent <span style="color: #339933;">+=</span> nextsize<span style="color: #339933;">;</span>
		time <span style="color: #339933;">+=</span> nextsize <span style="color: #808080; font-style: italic;">/* bytes */</span> <span style="color: #339933;">/</span> throughput <span style="color: #808080; font-style: italic;">/* bytes per sec */</span><span style="color: #339933;">;</span>
	<span style="color: #009900;">&#125;</span>
	<span style="color: #b1b100;">return</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

]]></content:encoded>
			<wfw:commentRss>http://johannes.jakeapp.com/blog/category/science/200908/email-generator/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Email size distribution: results</title>
		<link>http://johannes.jakeapp.com/blog/category/science/200908/email-size-distribution-results</link>
		<comments>http://johannes.jakeapp.com/blog/category/science/200908/email-size-distribution-results#comments</comments>
		<pubDate>Wed, 26 Aug 2009 14:12:33 +0000</pubDate>
		<dc:creator>JohannesTheLittleScientist</dc:creator>
				<category><![CDATA[science]]></category>

		<guid isPermaLink="false">http://johannes.jakeapp.com/blog/?p=674</guid>
		<description><![CDATA[Following up from a previous post and a post to the qmail and postfix mailing lists:
I analysed the distribution of email sizes. I wasn&#8217;t interested in spam, automated emails or mailing lists, rather user-written mails.
I analyzed the sizes of mails in user inboxes. I am always looking at buckets (how many mails are smaller than [...]]]></description>
			<content:encoded><![CDATA[<p><em>Following up from a <a href="http://johannes.jakeapp.com/blog/category/science/200908/modelling-realistic-email-traffic-email-size-distribution">previous post</a> and a post to the <a href="http://www.gossamer-threads.com/lists/qmail/users/136721?do=post_view_threaded#136721">qmail and postfix mailing lists</a>:</em></p>
<p>I analysed the distribution of email sizes. I wasn&#8217;t interested in spam, automated emails or mailing lists, rather user-written mails.</p>
<p>I analyzed the sizes of mails in user inboxes. I am always looking at buckets (how many mails are smaller than size<br />
x, but bigger than the last buckets).  I used the following upper limits: 1024<br />
1536 2048 2560 3072 4096 5120 6144 7168 8192 10240 12288 14336<br />
16384 20480 24576 28672 32768 40960 49152 53248 57344 65536 131072<br />
262144 524288 1048576 10264576 and 1073741824 (bytes).</p>
<p>I laid out some interpretations about <a href="http://www.gossamer-threads.com/lists/qmail/users/136721?do=post_view_threaded#136721">my inbox</a> as preliminary insight. Now I want to publish the results  from the friendly people that answered <a href="http://www.gossamer-threads.com/lists/qmail/users/136719?do=post_view_threaded#136719">my call for data</a>.</p>
<p>Ordered by cardinality (number of mails in dataset):</p>
<pre>mymail                             5069
thomas.schwinge_priv               8759
linux.kernel                      11127
charles_cazabon-nospam            22555
phoemix_harmlesshu                30991
thomas.schwinge_tech_mailinglist 154508
phill_macey_sent                 402158
spamsizes_steff                 1044713
michaelreck                     4086621 (700GB, ~40000 users)
phill_macey_inbox               7999737
phill_macey_all                 8401951 (~300GB)</pre>
<p>Thanks go out to Thomas Schwinge, Charles Cazabon, phoemix, Michael Reck from brauchmer.net, Markus Stumpf and Phill Macey.</p>
<p>The empirical distribution functions follow (with downloadable eps versions). X-axis (abscissa) is always size in Bytes, the y-axis (ordinate) is the portion of mails found with exactly this size. This is calculated by using the number of mails in a buckets above, dividing by the bucket size (difference to lower border) and dividing by the sum of mails.</p>
<p>The upper is single-log, the lower is double-log.</p>
<p><a href="http://johannes.jakeapp.com/blog/wp-content/uploads/2009/08/overview-density.png"><img class="aligncenter size-full wp-image-675" title="overview-density" src="http://johannes.jakeapp.com/blog/wp-content/uploads/2009/08/overview-density.png" alt="overview-density" width="639" height="475" /></a></p>
<p>And the cumulative distribution. This is interesting because a shift can not be seen well in the probability functions. These are linear plots: ordinate being cumulative percentage (what percentage of mail is below size x). The thicker the lines, the more data in this dataset.</p>
<p><a href="http://johannes.jakeapp.com/blog/wp-content/uploads/2009/08/overview-cum.png"><img class="aligncenter size-full wp-image-676" title="overview-cum" src="http://johannes.jakeapp.com/blog/wp-content/uploads/2009/08/overview-cum.png" alt="overview-cum" width="627" height="478" /></a></p>
<p>And logarithmic in size:</p>
<p><a href="http://johannes.jakeapp.com/blog/wp-content/uploads/2009/08/overview-cum-log.png"><img class="aligncenter size-full wp-image-677" title="overview-cum-log" src="http://johannes.jakeapp.com/blog/wp-content/uploads/2009/08/overview-cum-log.png" alt="overview-cum-log" width="631" height="480" /></a></p>
<p>Here are the plots in better quality (as eps): <a href="http://johannes.jakeapp.com/blog/wp-content/uploads/2009/08/overview-density.eps">overview-density</a> <a href="http://johannes.jakeapp.com/blog/wp-content/uploads/2009/08/overview-cum.eps">overview-cum</a> <a href="http://johannes.jakeapp.com/blog/wp-content/uploads/2009/08/overview-cum-log.eps">overview-cum-log</a></p>
<p>Warnings:</p>
<p>There are several errors to consider before taking this too literal: (1) It is uncertain how clean the data of large datasets is (spam, mailing-lists, generated data all mixed up).  (2) headers are different on receiving than on sending. (3) What users send is not the same traffic as they receive (automated mails).</p>
<p>If you just want to take one number out of this, the median size of a mail is around 30kB (watch the 50% line).</p>
<p>For my purposes of modeling a human generating mails, this is fantastic (dataset phill_macey_sent). This is covered in my <a href="http://johannes.jakeapp.com/blog/?p=684">next post</a>.</p>
<p>Comments are always welcome!</p>
]]></content:encoded>
			<wfw:commentRss>http://johannes.jakeapp.com/blog/category/science/200908/email-size-distribution-results/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>MD5 collisions lottery</title>
		<link>http://johannes.jakeapp.com/blog/category/happy-hacking/200908/md5-collisions-lottery</link>
		<comments>http://johannes.jakeapp.com/blog/category/happy-hacking/200908/md5-collisions-lottery#comments</comments>
		<pubDate>Tue, 18 Aug 2009 09:28:47 +0000</pubDate>
		<dc:creator>JohannesTheLittleScientist</dc:creator>
				<category><![CDATA[Happy Hacking]]></category>

		<guid isPermaLink="false">http://johannes.jakeapp.com/blog/?p=650</guid>
		<description><![CDATA[Short summary: I tried to collide md5 hashes from within the 128 bit range itself. I used 12464890370 md5 hashes, but found none.
Long explanation:
md5 is a hashing algorithm that produces 128 bits. It is clear that with arbitrary long input strings, you will have two inputs that yield the same output (this is called a [...]]]></description>
			<content:encoded><![CDATA[<p>Short summary: I tried to collide md5 hashes from within the 128 bit range itself. I used 12464890370 md5 hashes, but found none.<br />
Long explanation:<br />
md5 is a hashing algorithm that produces 128 bits. It is clear that with arbitrary long input strings, you will have two inputs that yield the same output (this is called a collision). You can also produce a hash of a hash. Seeing md5 as a black box, it might be that a collision is possible within the 128 bit input range.<br />
Starting with 0{rest zero bits} to 255{rest zero bits}, I produced a stream of md5 hashes for each: start -&gt; hash -&gt; hash -&gt; hash -&gt; &#8230;, and stored all of them. Then I looked if any of these 12402390369 hash values (=186GB) I produced were the same (I used a form of mergesort). The answer is no, none were the same.<br />
My personal conclusion:<br />
If we assume that this sample of 10^10 values is representative, and there had been 1 collision, the share of collisions would be around 10^-11. This is the share of the 128 bit space I looked at, the whole space is around 10^38 big (2^128). One collision would have meant that there were 10^28 collisions.<br />
Since I didn&#8217;t find any collisions, we can assume there are less than 10^28 collisions in the space. My personal guess is that there are around 10. It would be neat to have a md5 identity (where hash(x)=x ), but I don&#8217;t think that is possible.<br />
A algorithmic analysis might be more successful than what I did, but it is complicated to <a href="http://www.mscs.dal.ca/~selinger/md5collision/">produce</a> <a href="http://www.links.org/?p=6">collisions</a>, and there haven&#8217;t been any constructed within space boundaries.</p>
<p>Another thing I noticed is that the disk speed is the very limiting factor for producing and storing hashes.</p>
<p>The generating code is basically a loop around the following code. And then some qsort over the output once its done.</p>

<div class="wp_syntax"><div class="code"><pre class="c" style="font-family:monospace;">MD5_Init<span style="color: #009900;">&#40;</span><span style="color: #339933;">&amp;</span>md5_state<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
MD5_Update<span style="color: #009900;">&#40;</span><span style="color: #339933;">&amp;</span>md5_state<span style="color: #339933;">,</span> message<span style="color: #339933;">,</span> BUF_SIZE<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
MD5_Final<span style="color: #009900;">&#40;</span>message<span style="color: #339933;">,</span> <span style="color: #339933;">&amp;</span>md5_state<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #b1b100;">for</span><span style="color: #009900;">&#40;</span>j <span style="color: #339933;">=</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">;</span> j <span style="color: #339933;">&lt;</span> <span style="color: #0000dd;">16</span><span style="color: #339933;">;</span> j<span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span>
    fputc<span style="color: #009900;">&#40;</span>message<span style="color: #009900;">&#91;</span>j<span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span> f<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div>

]]></content:encoded>
			<wfw:commentRss>http://johannes.jakeapp.com/blog/category/happy-hacking/200908/md5-collisions-lottery/feed</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Modelling realistic email traffic &#8211; email size distribution</title>
		<link>http://johannes.jakeapp.com/blog/category/science/200908/modelling-realistic-email-traffic-email-size-distribution</link>
		<comments>http://johannes.jakeapp.com/blog/category/science/200908/modelling-realistic-email-traffic-email-size-distribution#comments</comments>
		<pubDate>Sat, 08 Aug 2009 20:32:51 +0000</pubDate>
		<dc:creator>JohannesTheLittleScientist</dc:creator>
				<category><![CDATA[science]]></category>

		<guid isPermaLink="false">http://johannes.jakeapp.com/blog/?p=626</guid>
		<description><![CDATA[In network simulation, traffic has to be simulated. Here I&#8217;m looking at email traffic. Often, CBR (constant bit rate) is used.
For a more realistic email generator I wanted to find out how the emails are distributed in size.
The only post that goes into detail is found at
http://osdir.com/ml/freebsd.devel.net/2002-10/msg00203.html
The resolution of this is quite poor though, especially [...]]]></description>
			<content:encoded><![CDATA[<p>In network simulation, traffic has to be simulated. Here I&#8217;m looking at email traffic. Often, CBR (constant bit rate) is used.</p>
<p>For a more realistic email generator I wanted to find out how the emails are distributed in size.</p>
<p>The only post that goes into detail is found at</p>
<p>http://osdir.com/ml/freebsd.devel.net/2002-10/msg00203.html</p>
<p>The resolution of this is quite poor though, especially between 4KB and 0 would be interesting.<br />
However, below is a sampler that generates random email sizes, distributed such that it conforms to the distribution.<br />
Between 2048 and 900 (lower limit, headers almost always need that much space) a uniform distribution is used.</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
</pre></td><td class="code"><pre class="c" style="font-family:monospace;"><span style="color: #808080; font-style: italic;">/* cumulative probabilities */</span>
<span style="color: #993333;">double</span> probabilities<span style="color: #009900;">&#91;</span><span style="color: #0000dd;">12</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">=</span> <span style="color: #009900;">&#123;</span>
	<span style="color:#800080;">0.2353</span><span style="color: #339933;">,</span> <span style="color: #808080; font-style: italic;">/* &lt; 2048 = pow(2, 0+11) */</span>
	<span style="color:#800080;">0.4917</span><span style="color: #339933;">,</span> <span style="color: #808080; font-style: italic;">/* &lt; 3096 = pow(2, 1+11) */</span>
	<span style="color:#800080;">0.676</span><span style="color: #339933;">,</span>
	<span style="color:#800080;">0.7958</span><span style="color: #339933;">,</span>
	<span style="color:#800080;">0.8536</span><span style="color: #339933;">,</span>
	<span style="color:#800080;">0.9047</span><span style="color: #339933;">,</span>
	<span style="color:#800080;">0.933</span><span style="color: #339933;">,</span>
	<span style="color:#800080;">0.9574</span><span style="color: #339933;">,</span>
	<span style="color:#800080;">0.9724</span><span style="color: #339933;">,</span>
	<span style="color:#800080;">0.9849</span><span style="color: #339933;">,</span>
	<span style="color:#800080;">0.9996</span><span style="color: #339933;">,</span>
	<span style="color:#800080;">1.0</span>
<span style="color: #009900;">&#125;</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #993333;">double</span> interpolate<span style="color: #009900;">&#40;</span><span style="color: #993333;">double</span> yleft<span style="color: #339933;">,</span> <span style="color: #993333;">double</span> yright<span style="color: #339933;">,</span> <span style="color: #993333;">double</span> x<span style="color: #339933;">,</span> <span style="color: #993333;">double</span> xleft<span style="color: #339933;">,</span> <span style="color: #993333;">double</span> xright<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
	<span style="color: #993333;">double</span> deltax <span style="color: #339933;">=</span> xright <span style="color: #339933;">-</span> xleft<span style="color: #339933;">;</span>
	<span style="color: #993333;">double</span> deltay <span style="color: #339933;">=</span> yright <span style="color: #339933;">-</span> yleft<span style="color: #339933;">;</span>
	<span style="color: #993333;">double</span> k <span style="color: #339933;">=</span> deltay<span style="color: #339933;">/</span>deltax<span style="color: #339933;">;</span>
	<span style="color: #b1b100;">return</span> <span style="color: #009900;">&#40;</span>x <span style="color: #339933;">-</span> xleft<span style="color: #009900;">&#41;</span> <span style="color: #339933;">*</span> k <span style="color: #339933;">+</span> yleft<span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span>
&nbsp;
<span style="color: #993333;">unsigned</span> <span style="color: #993333;">long</span> getnextsize<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
	<span style="color: #993333;">double</span> prob <span style="color: #339933;">=</span> gsl_rng_uniform_pos<span style="color: #009900;">&#40;</span>rng<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
	<span style="color: #993333;">int</span> i <span style="color: #339933;">=</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">;</span>
	<span style="color: #993333;">double</span> j<span style="color: #339933;">;</span>
	<span style="color: #993333;">unsigned</span> <span style="color: #993333;">long</span> hardlowerlimit <span style="color: #339933;">=</span> <span style="color: #0000dd;">900</span><span style="color: #339933;">;</span>
	<span style="color: #808080; font-style: italic;">/*unsigned long hardupperlimit = 10*1024*1024;*/</span>
	<span style="color: #993333;">unsigned</span> <span style="color: #993333;">long</span> size<span style="color: #339933;">;</span>
&nbsp;
	<span style="color: #b1b100;">while</span><span style="color: #009900;">&#40;</span>probabilities<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span> <span style="color: #339933;">&lt;=</span> prob<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
		i<span style="color: #339933;">++;</span>
	<span style="color: #009900;">&#125;</span>
	assert<span style="color: #009900;">&#40;</span>probabilities<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span> <span style="color: #339933;">&gt;</span> prob<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
	<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span>i <span style="color: #339933;">&gt;</span> <span style="color: #0000dd;">0</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
		j <span style="color: #339933;">=</span> interpolate<span style="color: #009900;">&#40;</span>i<span style="color: #339933;">-</span><span style="color: #0000dd;">1</span><span style="color: #339933;">,</span> i<span style="color: #339933;">,</span> prob<span style="color: #339933;">,</span> probabilities<span style="color: #009900;">&#91;</span>i<span style="color: #339933;">-</span><span style="color: #0000dd;">1</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span> probabilities<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		size <span style="color: #339933;">=</span> pow<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">2</span><span style="color: #339933;">,</span> j <span style="color: #339933;">+</span> <span style="color: #0000dd;">11</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #b1b100;">if</span><span style="color: #009900;">&#40;</span>size <span style="color: #339933;">&lt;</span> hardlowerlimit<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
			<span style="color: #b1b100;">return</span> getnextsize<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #009900;">&#125;</span>
	<span style="color: #009900;">&#125;</span><span style="color: #b1b100;">else</span><span style="color: #009900;">&#123;</span>
		size <span style="color: #339933;">=</span> get_random<span style="color: #009900;">&#40;</span>hardlowerlimit<span style="color: #339933;">,</span> pow<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">2</span><span style="color: #339933;">,</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">+</span><span style="color: #0000dd;">11</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
	<span style="color: #009900;">&#125;</span>
&nbsp;
	<span style="color: #b1b100;">return</span> size<span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span></pre></td></tr></table></div>

<p>The measured resulting cumulative distribution function, as well as a comparison to my own mailbox (which might be atypical, who knows).</p>
<p><a href="http://johannes.jakeapp.com/blog/wp-content/uploads/2009/08/Screenshot-distribution-function.gnumeric-Gnumeric1.png"><img class="aligncenter size-full wp-image-633" title="mailsize-cumulative-distribution" src="http://johannes.jakeapp.com/blog/wp-content/uploads/2009/08/Screenshot-distribution-function.gnumeric-Gnumeric1.png" alt="mailsize-cumulative-distribution" width="510" height="291" /></a></p>
<p>It would be interesting to get more detailed and newer histograms.<br />
I also considered modelling it using a Maxwell–Boltzmann distribution, but generating that seemed more complex and on closer look, the form does not really fit.</p>
<p>Full sampler code:<br />
<a href="http://johannes.jakeapp.com/blog/wp-content/uploads/2009/08/emailgen.table.c">emailgen.table.c</a></p>
]]></content:encoded>
			<wfw:commentRss>http://johannes.jakeapp.com/blog/category/science/200908/modelling-realistic-email-traffic-email-size-distribution/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Linear interpolation</title>
		<link>http://johannes.jakeapp.com/blog/category/happy-hacking/200908/linear-interpolation</link>
		<comments>http://johannes.jakeapp.com/blog/category/happy-hacking/200908/linear-interpolation#comments</comments>
		<pubDate>Sat, 08 Aug 2009 20:18:43 +0000</pubDate>
		<dc:creator>JohannesTheLittleScientist</dc:creator>
				<category><![CDATA[Happy Hacking]]></category>

		<guid isPermaLink="false">http://johannes.jakeapp.com/blog/?p=627</guid>
		<description><![CDATA[Given the rectangle xleft, xright, yleft and yright, which mark the lower and upper bounds, this interpolates a y for the input value x.

double interpolate(double yleft, double yright, double x, double xleft, double xright) {
	double deltax = xright - xleft;
	double deltay = yright - yleft;
	double k = deltay/deltax;
	return (x - xleft) * k + yleft;
}

]]></description>
			<content:encoded><![CDATA[<p>Given the rectangle xleft, xright, yleft and yright, which mark the lower and upper bounds, this interpolates a y for the input value x.</p>
<p><code lang="c">
<pre>double interpolate(double yleft, double yright, double x, double xleft, double xright) {
	double deltax = xright - xleft;
	double deltay = yright - yleft;
	double k = deltay/deltax;
	return (x - xleft) * k + yleft;
}</pre>
<p></code></p>
]]></content:encoded>
			<wfw:commentRss>http://johannes.jakeapp.com/blog/category/happy-hacking/200908/linear-interpolation/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>NZ &#8211; Second day in University &#8211; Contemporary issues</title>
		<link>http://johannes.jakeapp.com/blog/category/serious-business/200907/nz-second-day-in-university-contemporary-issues</link>
		<comments>http://johannes.jakeapp.com/blog/category/serious-business/200907/nz-second-day-in-university-contemporary-issues#comments</comments>
		<pubDate>Tue, 21 Jul 2009 10:00:52 +0000</pubDate>
		<dc:creator>JohannesTheLittleScientist</dc:creator>
				<category><![CDATA[NZ]]></category>
		<category><![CDATA[serious business]]></category>

		<guid isPermaLink="false">http://johannes.jakeapp.com/blog/?p=596</guid>
		<description><![CDATA[I&#8217;m just blogging these posts, because a lot of people don&#8217;t know what computer scientists do. &#8220;It has to do something with computers, they sit in front of it and write numbers all day.&#8221; I would find that quite boring.
I think I came up with a good definition of what Software Engineering is about, just [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m just blogging these posts, because a lot of people don&#8217;t know what computer scientists do. &#8220;It has to do something with computers, they sit in front of it and write numbers all day.&#8221; I would find that quite boring.</p>
<p>I think I came up with a good definition of what Software Engineering is about, just before I got my bachelor in Software &amp; Information Engineering. It is about getting into a &#8220;domain&#8221;, for example the car industry, understanding their terms, processes, work, information, problems. Then coming up with a model that describes or solves an issue. The implementation is usually left to programmers. I left out some details about maintenance, that it is an iterative process, etc.</p>
<p>So &#8220;Contemporary issues&#8221;. Every master degree has that I think. It is about finding out what people are currently researching in. Now computer science is a broad field, I&#8217;ll give you two examples, areas I&#8217;m interested in doing the paper about:</p>
<p>a) Representation of mathematical theorems and automatic proofs</p>
<p>Probably very theoretical. Some websites in that area are <a href="http://vdash.org/">vdash.org</a> and <a href="http://metamath.org/">metamath.org</a>. It is about &#8220;teaching&#8221; the computer our understanding of math. This works very well for logic, algebra and geometry. But what about analysis and calculus?</p>
<p>b) Swarm algorithms, (maybe with a focus on inhomogeneous and not connected all the time)</p>
<p>Swarm algorithms is about hundreds of small devices that can move and talk to each other. There is (normally) no leader, but a swarm algorithm (which runs on each device) emerges a behaviour of the group. My favorite example is a shark hunting through a swarm of fish. Each fish is just one fish, trying to avoid the shark. But you can also see the swarm showing some behaviour as a collective. Devices might be flying and exploring an area using sensors and telling the others which places to avoid.</p>
<p>The computer is just one side effect of computer science. There are theoretical areas that might not go beyond paper and pencil. Then of course there are all the combinations with other fields (bio, geo, chem), graphics, games, and effects on society as well as business related stuff.</p>
]]></content:encoded>
			<wfw:commentRss>http://johannes.jakeapp.com/blog/category/serious-business/200907/nz-second-day-in-university-contemporary-issues/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Continuous parameter optimization</title>
		<link>http://johannes.jakeapp.com/blog/category/science/200904/continuous-parameter-optimization</link>
		<comments>http://johannes.jakeapp.com/blog/category/science/200904/continuous-parameter-optimization#comments</comments>
		<pubDate>Sun, 26 Apr 2009 13:28:54 +0000</pubDate>
		<dc:creator>JohannesTheLittleScientist</dc:creator>
				<category><![CDATA[science]]></category>

		<guid isPermaLink="false">http://johannes.jakeapp.com/blog/?p=8</guid>
		<description><![CDATA[In the recent days and weeks I got into the topic of &#8220;continuous parameter optimization&#8221;. That is the case when you have a n-dimensional function that returns a value, and you want find a maximum or minimum. 
In particular I was looking for an algorithm that finds one (local) maximum, and uses the least number [...]]]></description>
			<content:encoded><![CDATA[<p>In the recent days and weeks I got into the topic of &#8220;continuous parameter optimization&#8221;. That is the case when you have a n-dimensional function that returns a value, and you want find a maximum or minimum. </p>
<p>In particular I was looking for an algorithm that finds one (local) maximum, and uses the least number of evaluations. Because in the scenario I have, evaluating the function takes hours (or days).</p>
<p>So what algorithms can you use?<br />
I developed some variants of the most primitive and naive approach. <br />
You can find the description and source here: <a href="http://github.com/JohannesBuchner/nobrain_optimization/">http://github.com/JohannesBuchner/nobrain_optimization/</a></p>
<p>Later I found CONDOR which is much more sophisticated. <br />
I adopted the project (GPL), so I can provide the same useful interface that allows to plug in any program or programming language.<br />
I wrote an introduction here: <a href="http://wiki.github.com/JohannesBuchner/condor_optimization">http://wiki.github.com/JohannesBuchner/condor_optimization</a></p>
<p>What do I need this for?<br />
I am developing a statistic and numerical application at the University of Vienna (Astronomy) with a number of input parameters. It takes hours to compute, and it can return a value of evaluation. I want to find the optimal values for the input parameters (that are not independent) in the shortest time (least evaluations) possible.</p>
<p>A good measurement for the performance of an optimization algorithm is <a href="http://en.wikipedia.org/wiki/Rosenbrock_function">Rosenbrock&#8217;s valley</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://johannes.jakeapp.com/blog/category/science/200904/continuous-parameter-optimization/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
