Modeling stochastic behavior
Many processes in the world vary a little bit each time they are performed. Setup of machines goes a bit faster or slower, patients taking their medicine takes longer this morning, more products are delivered today, or the quality of the manufactured product degrades due to a tired operator. Modeling such variations is often done with stochastic distributions. A distribution has a mean value and a known shape of variation. By matching the means and the variation shape with data from the system being modeled, an accurate model of the system can be obtained. The language has many stochastic distributions available, this chapter explains how to use them to model a system, and lists a few commonly used distributions. The full list is available in the reference manual at Distributions.
The following fragment illustrates the use of the random distribution to model a dice. Each value of the six-sided dice is equally likely to appear. Every value having the same probability of appearing is a property of the integer uniform distribution, in this case using interval [1, 7)
(inclusive on the left side, exclusive on the right side). The model is:
dist int dice = uniform(1,7);
int x, y;
x = sample dice;
y = sample dice;
writeln("x=%d, y=%d", x, y);
The variable dice
is an integer distribution, meaning that values drawn from the distribution are integer numbers. It is assigned an uniform distribution. A throw of a dice is simulated with the operator sample
. Each time sample
is used, a new sample value is obtained from the distribution. In the fragment the dice is thrown twice, and the values are assigned to the variables x
, and y
.
Distributions
The language provides constant, discrete and continuous distributions. A discrete distribution is a distribution where only specific values can be drawn, for example throwing a dice gives an integer number. A continuous distribution is a distribution where a value from a continuous range can be drawn, for example assembling a product takes a positive amount of time. The constant distributions are discrete distributions that always return the same value. They are useful during the development of the model (see below).
Constant distributions
When developing a model with stochastic behavior, it is hard to verify whether the model behaves correctly, since the stochastic results make it difficult to predict the outcome of experiments. As a result, errors in the model may not be noticed, they hide in the noise of the stochastic results. One solution is to first write a model without stochastic behavior, verify that model, and then extend the model with stochastic sampling. Extending the model with stochastic behavior is however an invasive change that may introduce new errors. These errors are again hard to find due to the difficulties to predict the outcome of an experiment. The constant distributions aim to narrow the gap by reducing the amount of changes that need to be done after verification.
With constant distributions, a stochastic model with sampling of distributions is developed, but the stochastic behavior is eliminated by temporarily using constant distributions. The model performs stochastic sampling of values, but with predictable outcome, and thus with predictable experimental results, making verification easier. After verifying the model, the constant distributions are replaced with the distributions that fit the mean value and variation pattern of the modeled system, giving a model with stochastic behavior. Changing the used distributions is however much less invasive, making it less likely to introduce new errors at this stage in the development of the model.
Constant distributions produce the same value v
with every call of sample
. There is one constant distribution for each type of sample value:
constant(bool v)
, abool
distribution.constant(int v)
, anint
distribution.constant(real v)
, areal
distribution.
An example with a constant distribution is:
dist int u = constant(7);
This distribution returns the integer value 7
with each sample u
operation.
Discrete distributions
Discrete distributions return values from a finite fixed set of possible values as answer. In Chi, there is one distribution that returns a boolean when sampled, and there are several discrete distributions that return an integer number.
dist
bool
bernoulli(real p)
Outcome of an experiment with chance
p
(0 <= p <= 1)
.Range {false, true}
Mean p
(wherefalse
is interpreted as0
, andtrue
is interpreted as1
)Variance 1 - p
(wherefalse
is interpreted as0
, andtrue
is interpreted as1
)See also Bernoulli(p), [Law (2007)], page 302
dist
int
uniform(int a, b)
Integer uniform distribution from
a
tob
excluding the upper bound.Range {a, a+1, ..., b-1}
Mean (a + b - 1) / 2
Variance ((b - a)\^2 - 1) / 12
See also DU(a, b - 1), [Law (2007)], page 303
Continuous distributions
Continuous distributions return a value from a continuous range.
dist
real
uniform(real a, b)
Real uniform distribution from
a
tob
, excluding the upper bound.Range [a, b)
Mean (a + b) / 2
Variance (b - a)^2 / 12
See also U(a,b), [Law (2007)], page 282, except that distribution has an inclusive upper bound.
dist
real
gamma(real a, b)
Gamma distribution, with shape parameter
a > 0
and scale parameterb > 0
.Mean a * b
Variance a * b^2
Simulating stochastic behavior
In this chapter, the mathematical notion of stochastic distribution is used to describe how to model stochastic behavior. Simulating a model with stochastic behavior at a computer is however not stochastic at all. Computer systems are deterministic machines, and have no notion of varying results.
A (pseudo-)random number generator is used to create stochastic results instead. It starts with an initial seed, an integer number (you can give one at the start of the simulation). From this seed, a function creates a stream of 'random' values. When looking at the values there does not seem to be any pattern. It is not truly random however. Using the same seed again gives exactly the same stream of numbers. This is the reason to call the function a pseudo-random number generator (a true random number generator would never produce the exact same stream of numbers). A sample of a distribution uses one or more numbers from the stream to compute its value. The value of the initial seed thus decides the value of all samples drawn in the simulation. By default, a different seed is used each time you run a simulation (leading to slightly different results each time). You can also explicitly state what seed you want to use when running a model, see Compile and simulate. At the end of the simulation, the used initial seed of that simulation is printed for reference purposes.
While doing a stochastic simulation study, performing several experiments with the same initial seed invalidates the results, as it is equivalent to copying the outcome of a single experiment a number of times. On the other hand, when looking for the cause of a bug in the model, performing the exact same experiment is useful as outcomes of previous experiments should match exactly.
Exercises
According to the Chi reference manual, for a
gamma
distribution with parameters(a, b)
, the mean equalsa * b
.Use a Chi specification to verify whether this is true for at least 3 different pairs of
a
andb
.How many samples from the distribution are approximately required to determine the mean up to three decimals accurate?
Estimate the mean μ and variance σ2 of a triangular distribution
triangle(1.0, 2.0, 5.0)
by simulating 1000 samples. Recall that the variance σ2 of n samples can be calculated by a function like:func real variance(list real samples, real avg): real v; for x in samples: v = v + (x - avg)^2; end return v / (size(samples) - 1) end
We would like to build a small game, called Higher or Lower. The computer picks a random integer number between 1 and 14. The player then has to predict whether the next number will be higher or lower. The computer picks the next random number and compares the new number with the previous one. If the player guesses right his score is doubled. If the player guesses wrong, he looses all and the game is over. Try the following specification:
model HoL(): dist int u = uniform(1, 15); int sc = 1; bool c = true; int new, oldval; string s; new = sample u; write("Your score is %d\n", sc); write("The computer drew %d\n", new); while c: writeln("(h)igher or (l)ower:\n"); s = read(string); oldval = new; new = sample u; write("The computer drew %d\n", new); if new == oldval: c = false; else: c = (new > oldval) == (s == "h"); end; if c: sc = 2 * sc; else: sc = 0; end; write("Your score is %d\n", sc) end; write("GAME OVER...\n") end
What is the begin score?
What is the maximum end score?
What happens, when the drawn sample is equal to the previous drawn sample?
Extend this game specification with the possibility to stop.