Discrete Analysis (Nov 2023)

Low-degree learning and the metric entropy of polynomials

  • Alexandros Eskenazis,
  • Paata Ivanisvili,
  • Lauritz Streck

Abstract

Read online

Low-degree learning and the metric entropy of polynomials, Discrete Analysis 2023:17, 23 pp. This paper is a follow-up to a paper by the first two authors on the general problem of designing algorithms that can efficiently learn functions defined on the Boolean cube $\{-1,1\}^n$, given suitable conditions on the functions. Given a function $f$, the rough idea is to take a series of sample points, calculate $f$ at those points, and guess a function $g$ that is close to $f$. Obviously, without strong information about $f$, nothing non-trivial can be done. If, for example, the values of $f$ are chosen uniformly and independently from $[-1,1]$, then one learns nothing from the sample values other than the sample values themselves. However, many functions that one actually wants to learn are far from arbitrary, so one can try to identify suitable properties that make them amenable to learning from a far smaller number of samples. One set-up considered by the authors is where $f:\{-1,1\}^n\to[-1,1]$ is a low-degree polynomial and the samples are taken randomly. Since $x^2=1$ when $x\in\{-1,1\}$, every polynomial function $f$ of degree $d$ defined on the Boolean cube has a formula of the form $f(x)=\sum_{|A|\leq d}\lambda_A\prod_{i\in A}x_i$, where $A$ ranges over subsets of $\{1,2,\dots,n\}$. The functions $x\mapsto\prod_{i\in A}x_i$ are the Walsh functions, and the $\lambda_A$ are therefore the Fourier coefficients of $f$ that correspond to sets $A$ of size at most $d$ -- the size of $A$ is often referred to as the degree of the Walsh function, and one often refers informally to "the Fourier coefficients of degree at most $d$" or "the degree-$d$ part of the Fourier expansion". From simple dimension considerations, one sees easily that one cannot hope to determine a degree-$d$ polynomial $f$ exactly from fewer than $\sum_{r=0}^d\binom nr$ samples (even if it is required to take values in $[-1,1]$), so the aim is to determine it _approximately_. In this paper, and in the previous paper of the first two authors, the approximation considered is an $L_2$ one. Thus, the aim is to find an algorithm that does the following: given a function $f:\{-1,1\}\to[-1,1]$ of degree at most $d$, it takes $m$ random samples $(x,f(x))$, and outputs a function $g$ such that $\|g-f\|_2