Extending Olshausens classical SparseNet
-
In a previous notebook, we tried to reproduce the learning strategy specified in the framework of the SparseNet algorithm from Bruno Olshausen. It allows to efficiently code natural image patches by constraining the code to be sparse. In particular, we saw that in order to optimize competition, it is important to control cooperation and we implemented a heuristic to just do this.
-
In this notebook, we provide an extension to the SparseNet algorithm. We will study how homeostasis (cooperation) may be an essential ingredient to this algorithm working on a winner-take-all basis (competition). This extension has been published as Perrinet, Neural Computation (2010) (see https://laurentperrinet.github.io/publication/perrinet-10-shl ):
@article{Perrinet10shl,
Title = {Role of homeostasis in learning sparse representations},
Author = {Perrinet, Laurent U.},
Journal = {Neural Computation},
Year = {2010},
Doi = {10.1162/neco.2010.05-08-795},
Keywords = {Neural population coding, Unsupervised learning, Statistics of natural images, Simple cell receptive fields, Sparse Hebbian Learning, Adaptive Matching Pursuit, Cooperative Homeostasis, Competition-Optimized Matching Pursuit},
Month = {July},
Number = {7},
Url = {https://laurentperrinet.github.io/publication/perrinet-10-shl},
Volume = {22},
}
This is joint work with Victor Boutin.
- find out updates on https://laurentperrinet.github.io/publication/perrinet-19-hulk and https://github.com/bicv/SHL_scripts
dictionary learning on natural images¶
Here, we reproduce the dictionary learning obtained with the SparseNet alogrithm while using the Matching Pursuit algorithm for the sparse coding algorithm.
import matplotlib.pyplot as plt
%matplotlib inline
#%config InlineBackend.figure_format='svg'
import numpy as np
np.set_printoptions(precision=2, suppress=True)
import pandas as pd
import seaborn as sns
%load_ext autoreload
%autoreload 2
from shl_scripts import SHL
database = 'database/'
DEBUG_DOWNSCALE, verbose = 1, 100
DEBUG_DOWNSCALE, verbose = 10, 100
DEBUG_DOWNSCALE, verbose = 1, 0
matname = 'homeo'
shl = SHL(database=database, DEBUG_DOWNSCALE=DEBUG_DOWNSCALE, verbose=verbose)
dico = shl.learn_dico(matname=matname)
fig, ax = shl.show_dico(dico)
fig.show()
fig, ax = shl.plot_variance(dico)
fig.show()
fig, ax = shl.plot_variance_histogram(dico)
fig.show()
data = shl.get_data()
np.sum(data**2, axis=1).shape
for eta in np.logspace(-4, 0, int(15/(DEBUG_DOWNSCALE)**.3), base=10, endpoint=False):
shl = SHL(DEBUG_DOWNSCALE=DEBUG_DOWNSCALE, eta=eta, verbose=verbose)
dico = shl.learn_dico()
fig, ax = shl.show_dico(dico, title='eta={}'.format(eta))
fig.show()
fig, ax = shl.plot_variance(dico)
fig.show()
fig, ax = shl.plot_variance_histogram(dico)
fig.show()
a quick diagnostic of what is wrong¶
An assumption in the previous code is the heuristics used to control how elements are chosen. Basically, we have set the norm of every dictionary element to the inverse of an estimate of the mean variance, such that a high variance means a lower norm and a lower corresponding coefficient: it will thus get less likely to be selected again.
data = shl.get_data()
code = dico.transform(data, algorithm='mp', l0_sparseness=shl.n_dictionary)
Z = np.mean(code**2)
fig = plt.figure(figsize=(12, 4))
ax = fig.add_subplot(111)
ax.bar(np.arange(shl.n_dictionary), np.mean(code**2/Z, axis=0))#, yerr=np.std(code**2/Z, axis=0))
ax.set_title('Variance of coefficients')
ax.set_ylabel('Variance')
ax.set_xlabel('#')
ax.axis('tight')
fig.show()
fig = plt.figure(figsize=(6, 4))
ax = fig.add_subplot(111)
data = pd.DataFrame(np.mean(code**2/Z, axis=0), columns=['Variance'])
with sns.axes_style("white"):
ax = sns.distplot(data['Variance'], kde=False) #, kde_kws={'clip':(0., 5.)})
ax.set_title('distribution of the mean variance of coefficients')
ax.set_ylabel('pdf')
fig.show()
In particular, those with high-variance are more likely features that were more learned, while those with lower variance correspond to textures closer to the initail state of the dictionary. That is shown by ordering filters in the dictionary (from the first line on the top from left to right and then to the bottom):
sorted_idx = np.argsort(np.mean(code**2/Z, axis=0))
dico.dictionary = dico.dictionary[sorted_idx, :]
fig, ax = shl.show_dico(dico)
fig.show()
print(dico.dictionary.shape, data.shape)
data = shl.get_data()
#dico.set_params(transform_algorithm='mp', l0_sparseness=shl.n_dictionary)
#code = dico.transform(data)
coef = np.dot(dico.dictionary, data.T).T / np.sum(dico.dictionary**2, axis=1)[np.newaxis, :]
#Z = np.mean(code**2)
fig = plt.figure(figsize=(12, 4))
ax = fig.add_subplot(111)
ax.bar(np.arange(shl.n_dictionary), np.mean(coef**2, axis=0))#, yerr=np.std(code**2/Z, axis=0))
ax.set_title('Variance of linear coefficients')
ax.set_ylabel('Variance')
ax.set_xlabel('#')
ax.axis('tight')
fig.show()
By the scaling law of a linear filter, one may normalize the norm of the filters relative to the mean deviation (root square of variance) measured on the input. By doing this operation, the observed variance of the linear coefficients is normalized to the same value. That's what we have with the heuristics implemented in the previous notebook.
However, the learning rate (gain_rate
) of this heuristics may be hard to tune. Most importantly, this heuristics assumes that all components are scaled (for instance that the first to learn will have higher variance and that its whole distribution will be scaled), and thus assumes that they all have the same distribution. This assumption misses an important aspect of this unsupervised learning scheme.
Indeed, when a feature appears during the learning, its response is scaled but its distribution becomes more kurtotic. This may be understood using the central limit theorem. When mixing signals such as when computing the average of random variables, the distribution converges to a Gaussian.
Let's show the distribution of the two dictionary elements with respectively the highest (left) and lowest (right) variance:
sorted_idx = np.argsort(np.mean(coef**2, axis=0))
# plot distribution of most kurtotic vs least
fig = plt.figure(figsize=(12, 4))
for i, idx in enumerate([0, -1]):
#data = pd.DataFrame(code[:, sorted_idx[idx]], columns=['coefficient'])
with sns.axes_style("white"):
ax = fig.add_subplot(1, 2, i+1)
n_bins = 30
n, bins = np.histogram((np.abs(coef[:, sorted_idx[idx]])), n_bins)# =
ax.bar(bins[:-1], np.log2(n), width=(np.abs(coef[:, sorted_idx[idx]].max()))/n_bins)
#ax = sns.distplot(data['coefficient'])#, kde_kws={'clip':(0., 5.)})
ax.set_title('Variance = {}'.format(np.mean(coef[:, sorted_idx[idx]]**2)))
ax.set_ylabel('log2-density')
ax.set_xlabel('absolute coefficient')
fig.suptitle('distribution of the value of coefficients', fontsize=16)
fig.show()
To circumvent this problem, instead of changing the learning scheme, it is possible to change the coding procedure such that we are sure that every coefficient is selected a priori with the same probability. To perform that in a non-parametric fashion, one may use histogram equalization.
n_samples, n_dictionary = coef.shape
np.sort(np.abs(coef[:(n_samples//2), sorted_idx[idx]])).shape
#print(-np.sort(-np.abs(coef[:(n_samples/2), sorted_idx[idx]])))
#print(-np.sort(-np.abs(coef[:, sorted_idx[idx]]]))
# plot distribution of most kurtotic vs least
fig = plt.figure(figsize=(12, 4))
for i, idx in enumerate([0, -1]):
with sns.axes_style("white"):
ax = fig.add_subplot(1, 2, i+1)
ax.semilogx(-np.sort(-np.abs(coef[:(n_samples//2), sorted_idx[idx]])))
ax.set_title('Variance = {}'.format(np.mean(coef[:, sorted_idx[idx]]**2)))
ax.set_ylabel('value')
ax.set_xlabel('rank')
ax.axis('tight')
fig.suptitle('distribution of the value of linear coefficients', fontsize=16)
fig.show()
# what we need is some sort of histogram normalization
def histeq(abs_code, mod):
# use linear interpolation of cdf to find z-values
# use the fact that the sorted coeffs give the inverse cdf
# moreover we use a hack to ensure that np.interp uses an increasing sequence of x-coordinates
z = np.interp(-abs_code, -mod, np.linspace(0, 1., mod.size, endpoint=True))
return z
fig = plt.figure(figsize=(12, 4))
for i, idx in enumerate([0, -1]):
mod = -np.sort(-np.abs(coef[:(n_samples//2), sorted_idx[idx]]))
# learn distribution of z-values on the first half of the data
z = histeq(np.abs(coef[(n_samples//2):, sorted_idx[idx]]), mod)
#z = histeq(np.abs(coef[:(n_samples/2), sorted_idx[idx]]), mod) - to check it is exact
# plot distribution of z-values on the second half of the data
with sns.axes_style("white"):
ax = fig.add_subplot(1, 2, i+1)
n_bins = 30
n, bins = np.histogram(z, n_bins)# =
ax.bar(bins[:-1], n, width=.5/n_bins)
#ax.scatter(np.linspace(0, 1., z.size, endpoint=True), z)
#ax.set_title('Variance = {}'.format(np.mean(code[:, sorted_idx[idx]]**2/Z, axis=0)))
ax.set_ylabel('density')
ax.set_xlabel('z-score')
ax.axis('tight')
fig.suptitle('distribution of the value of z-scores', fontsize=16)
fig.show()
To learn this modulation function, we may just estimate it online beginning with a flat one:
mod = np.dot(np.linspace(1., 0, n_samples, endpoint=True)[:, np.newaxis], np.ones((1, n_dictionary)))
print(mod.shape)
plt.plot(mod[:, 0])
print(coef.shape)
print(np.sort(np.abs(coef), axis=0).shape)
mod_ = -np.sort(-np.abs(coef), axis=0)
gain_rate = .1
mod = (1 - gain_rate)*mod + gain_rate * mod_
plt.semilogx(mod[:, 0])
shl = SHL(DEBUG_DOWNSCALE=DEBUG_DOWNSCALE)
dico = shl.learn_dico(learning_algorithm='comp', l0_sparseness=shl.n_dictionary, gain_rate=0.001, verbose=verbose)
fig, ax = shl.show_dico(dico)
fig.show()
data = shl.get_data()
#dico.set_params(transform_algorithm='mp', l0_sparseness=shl.n_dictionary)
#code = dico.transform(data)
coef = np.dot(dico.dictionary, data.T).T / np.sum(dico.dictionary**2, axis=1)[np.newaxis, :]
#Z = np.mean(code**2)
fig = plt.figure(figsize=(12, 4))
ax = fig.add_subplot(111)
ax.bar(np.arange(shl.n_dictionary), np.mean(coef**2, axis=0))#, yerr=np.std(code**2/Z, axis=0))
ax.set_title('Variance of linear coefficients')
ax.set_ylabel('Variance')
ax.set_xlabel('#')
ax.axis('tight')
fig.show()
# plot distribution of most kurtotic vs least
sorted_idx = np.argsort(np.mean(coef**2, axis=0))
fig = plt.figure(figsize=(12, 4))
for i, idx in enumerate([0, -1]):
#data = pd.DataFrame(code[:, sorted_idx[idx]], columns=['coefficient'])
with sns.axes_style("white"):
ax = fig.add_subplot(1, 2, i)
n_bins = 30
n, bins = np.histogram((np.abs(coef[:, sorted_idx[idx]])), n_bins)# =
ax.bar(bins[:-1], np.log2(n), width=(np.abs(coef[:, sorted_idx[idx]].max()))/n_bins)
#ax = sns.distplot(data['coefficient'])#, kde_kws={'clip':(0., 5.)})
ax.set_title('Variance = {}'.format(np.mean(coef[:, sorted_idx[idx]]**2)))
ax.set_ylabel('log2-density')
ax.set_xlabel('absolute coefficient')
fig.suptitle('distribution of the value of coefficients', fontsize=16)
fig.show()
# quick estipmation of the z-scores
#z = np.interp(-np.abs(coef[(n_samples/2):, sorted_idx[idx]]), -mod, np.linspace(0, 1., mod.size, endpoint=True))
from shl_scripts import SHL
DEBUG_DOWNSCALE=1
for gain_rate in np.logspace(-4, 0, 15, base=10):
#
shl = SHL(DEBUG_DOWNSCALE=DEBUG_DOWNSCALE)
dico = shl.learn_dico(learning_algorithm='comp', l0_sparseness=10, gain_rate=gain_rate)
fig = shl.show_dico(dico)
fig.show()
data = shl.get_data()
dico.set_params(transform_algorithm='mp', l0_sparseness=shl.n_dictionary)
code = dico.transform(data)
n_dictionary, n_samples = code.shape
Z = np.mean(code**2)
fig = plt.figure(figsize=(12, 4))
ax = fig.add_subplot(111)
ax.bar(np.arange(shl.n_dictionary), np.mean(code**2, axis=0)/Z)
ax.set_title('Variance of coefficients - gain {}'.format(gain_rate))
ax.set_ylabel('Variance')
ax.set_xlabel('#')
ax.axis('tight')
fig.show()
fig = plt.figure(figsize=(6, 4))
ax = fig.add_subplot(111)
data = pd.DataFrame(np.mean(code**2, axis=0)/np.mean(code**2), columns=['Variance'])
with sns.axes_style("white"):
ax = sns.distplot(data['Variance'], kde_kws={'clip':(0., 5.)})
ax.set_title('distribution of the mean variance of coefficients')
ax.set_ylabel('pdf')
fig.show()
great, let's try that new version of the classical algorithm¶
TODO...