Binned fit#
Imagine we have the following data distribution over \(x\):
import numpy as np
sample_size = 1_000
rng = np.random.default_rng(seed=0)
x_distribution = rng.power(3, size=sample_size)
We sample the distribution into a histogram with matplotlib.pyplot.hist(). The bin edges will serve as \(x\)-values and the bin heights will be the observed \(y\)-values.
%config InlineBackend.figure_formats = ['svg']
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(8, 5))
ax.set_xlabel("$x$")
n_bins = 50
bin_values, bin_edges, _ = ax.hist(x_distribution, bins=n_bins)
plt.show()
x_values = (bin_edges[1:] + bin_edges[:-1]) / 2
y_values = bin_values
The data distribution has been generated by numpy.random.power() and can therefore be described by the following expression:
import sympy as sp
n, a, x = sp.symbols("n a x")
expression = n * x ** (a - 1)
expression
We would like to find which values for \(a, n\) describe this distribution best. For fast computations, we first formulate this expression as a ParametrizedFunction with numpy as computational backend (numpy is fast enough for the small data samples with which we are working here). We also provide a first guess for the values of \(a\) and \(n\).
from tensorwaves.function.sympy import create_parametrized_function
function = create_parametrized_function(
expression,
parameters={a: 2, n: 15},
backend="numpy",
)
initial_parameters = function.parameters
This binned distribution lends itself well to be optimized with a ChiSquared estimator.
from tensorwaves.estimator import ChiSquared
from tensorwaves.optimizer import Minuit2
estimator = ChiSquared(
function,
domain={"x": x_values},
observed_values=y_values,
backend="numpy",
)
optimizer = Minuit2()
fit_result = optimizer.optimize(estimator, initial_parameters)
fit_result
FitResult(
minimum_valid=True,
execution_time=0.0011959075927734375,
function_calls=62,
estimator_value=799.3094411131793,
parameter_values={ 'a': 3.1589937147180245, 'n': 58.13793430101528,
},
parameter_errors={ 'a': 0.020006328065594166, 'n': 0.31140874864064955,
},
)
The optimized value of \(a\) indeed lies close to the value of \(a=3\) with which we generated the distribution.
Also note that the integral over the expression with these optimized parameter values is close to the surface of the bins:
bin_width = bin_edges[1] - bin_edges[0]
surface = np.sum(bin_values) * bin_width
surface
np.float64(18.424306915235626)