# Binned fit#

Imagine we have the following data distribution over $$x$$:

import numpy as np

sample_size = 1_000
rng = np.random.default_rng(seed=0)
x_distribution = rng.power(3, size=sample_size)


We sample the distribution into a histogram with matplotlib.pyplot.hist(). The bin edges will serve as $$x$$-values and the bin heights will be the observed $$y$$-values.

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8, 5))
ax.set_xlabel("$x$")
n_bins = 50
bin_values, bin_edges, _ = ax.hist(x_distribution, bins=n_bins)
plt.show() x_values = (bin_edges[1:] + bin_edges[:-1]) / 2
y_values = bin_values


The data distribution has been generated by numpy.random.power() and can therefore be described by the following expression:

import sympy as sp

n, a, x = sp.symbols("n a x")
expression = n * x ** (a - 1)
expression

$\displaystyle n x^{a - 1}$

We would like to find which values for $$a, n$$ describe this distribution best. For fast computations, we first formulate this expression as a ParametrizedFunction with numpy as computational backend (numpy is fast enough for the small data samples with which we are working here). We also provide a first guess for the values of $$a$$ and $$n$$.

from tensorwaves.function.sympy import create_parametrized_function

function = create_parametrized_function(
expression,
parameters={a: 2, n: 15},
backend="numpy",
)
initial_parameters = function.parameters

Hide code cell source
function.update_parameters(initial_parameters)
fig, ax = plt.subplots(figsize=(8, 5))
ax.set_xlabel("$x$")
ax.hist(x_distribution, bins=n_bins, label="Data distribution")
ax.plot(x_values, function({"x": x_values}), label="Initial fit model", c="red")
ax.legend()
plt.show() This binned distribution lends itself well to be optimized with a ChiSquared estimator.

from tensorwaves.estimator import ChiSquared
from tensorwaves.optimizer import Minuit2

estimator = ChiSquared(
function,
domain={"x": x_values},
observed_values=y_values,
backend="numpy",
)
optimizer = Minuit2()
fit_result = optimizer.optimize(estimator, initial_parameters)
fit_result

FitResult(
minimum_valid=True,
execution_time=0.09833741188049316,
function_calls=62,
estimator_value=799.3094411131797,
parameter_values={ 'a': 3.158993714718582, 'n': 58.13793430102732,
},
parameter_errors={ 'a': 0.020006328098968552, 'n': 0.31140874922374606,
},
)


The optimized value of $$a$$ indeed lies close to the value of $$a=3$$ with which we generated the distribution.

Hide code cell source
function.update_parameters(fit_result.parameter_values)
fig, ax = plt.subplots(figsize=(8, 5))
ax.set_xlabel("$x$")
ax.hist(x_distribution, bins=n_bins, label="Data distribution")
ax.plot(x_values, function({"x": x_values}), label="Optimized model", c="red")
ax.legend()
plt.show() Also note that the integral over the expression with these optimized parameter values is close to the surface of the bins:

bin_width = bin_edges - bin_edges
surface = bin_values.sum() * bin_width
surface

18.424306915235626

Hide code cell source
from IPython.display import Math

substitutions = {
a: fit_result.parameter_values["a"],
n: fit_result.parameter_values["n"],
}
integral = sp.Integral(expression, (x, 0, 1))
evaluated_integral = integral.subs(substitutions).doit()
latex = sp.multiline_latex(integral, evaluated_integral, environment="eqnarray")
Math(latex)

$\displaystyle \begin{eqnarray} \int\limits_{0}^{1} n x^{a - 1}\, dx & = & 18.4039411126864 \end{eqnarray}$