## Alright, alright, back to complexity

I’ve learned my lesson, at least for the next day or two.

And speaking of learning — in computational learning theory, there’s an “obvious” algorithm for learning a function from random samples. Here’s the algorithm: output any hypothesis that minimizes the error on those samples.

I’m being intentionally vague about what the learning model is — since as soon as you specify a model, it seems like some version of that algorithm is what you want to do, if you want the best tradeoff between the number of samples and the error of your hypothesis. For example, if you’re trying to learn a Boolean function from a class C, then you want to pick any hypothesis from C that’s consistent with all your observations. If you’re trying to learn a Boolean function based on noisy observations, then you want to pick any hypothesis that minimizes the total number of disagreements. If you’re trying to learn a degree-d real polynomial based on observations subject to Gaussian noise, then you want to pick any degree-d polynomial that minimizes the least-squared error, and so on.

Here’s my question: is the “obvious” algorithm always the best one, or is there a case where a different algorithm needs asymptotically fewer samples? That is, do you ever want to pick a hypothesis that disagrees with more of your observations over one that disagrees with less?

While I’m on the subject, have you ever wished you could help Scott Aaronson do his actual research, and even be thanked — by name — in the acknowledgments of one of his papers? Well then, don’t miss this chance! All you have to do is read this seminal paper by Alon, Ben-David, Cesa-Bianchi, and Haussler, and then tell me what upper bound on the sample complexity of p-concept learning follows from their results. (Perversely, all they prove in the paper is that some finite number of samples suffices — must be a mathematician thing.)

Comment #1 April 26th, 2006 at 11:42 pm

Addendum: After I’d already drafted this entry, Peter Bartlett wrote to me to say that a paper by Kevin Buescher and PR Kumar in COLT’92

mightgive an example where a non-obvious learning algorithm does better. I’ll take a look.Comment #2 April 27th, 2006 at 12:44 am

Also, keep in mind cross-validation and generalization for noisy training sets.

You could fit a function very closely to a training set (so it acts like a lookup table) but then new points (from a CV set) would give poor results.

Comment #3 April 27th, 2006 at 1:39 am

Derek: To clarify, I’m supposing that that the class C of functions has been fixed in advance — and that we want to be able to learn

anyfunction in C, to high accuracy and with high probability over the choice of samples. In this setting, there’s no clear notion of a more or less “complicated” hypothesis: all functions in C are on equal footing. So there shouldn’t be any issue of overfitting.Comment #4 April 27th, 2006 at 3:14 am

have you ever wished you could help Scott Aaronson do his actual research, and even be thanked — by name — in the acknowledgments of one of his papers?no?

Comment #5 April 27th, 2006 at 7:21 am

OK, I looked at the papers here and here by Buescher and Kumar. They’re indeed directly relevant to my question.

Consider the class of functions H from [0,1) to [0,1], which consists of the constant function h_0(x)=1/4, as well as (for all positive integers k) the function h_k(x) that returns the kth bit in the binary expansion of x.

Given training data of the form

x_1,f(x_1)

…

x_m,f(x_m),

our goal is to pick the function in H that best approximates f in the L1 norm.

Now suppose f is the constant function f(x)=0. Then here’s the point: if x_1,…,x_m are chosen uniformly random from [0,1), then with probability 1 there exists a k>0 such that

h_k(x_i) = f(x_i) = 0

for all i=1,…,m. In other words, there exists a k>0 such that for every i, the kth bit in the binary expansion of x_i is 0.

So if we use the “obvious” learning algorithm — the one that looks for a hypothesis that agrees with all the training data — then we’ll pick such an h_k. But that’s not what we

should havedone, since h_0 is a better approximation to f than h_k for any k>0. In particular, h_1,h_2,… all have L1 distance 1/2 to f, whereas h_0 has L1 distance only 1/4 to f.Buescher and Kumar give an alternative learning algorithm that

doespick h_0.One obvious problem with this counterexample is that the class of hypotheses is infinite (and indeed, has infinite VC-dimension). What happens if we truncate the class of hypotheses to h_0,…,h_n? I think what happens is that the sample complexities for both the obvious algorithm and the non-obvious one go down to ~log(n).

Is there a case where the VC-dimension is finite, but a non-obvious algorithm has

sample complexityasymptotically smaller than the obvious algorithm’s? That seems to be an open problem. All Buescher and Kumar show is that the known upper bounds can differ by a constant factor.So to summarize: there exists a weird concept class with infinite VC-dimension, for which the obvious algorithm fails to learn with

anyfinite number of samples, but a different algorithm succeeds. On the other hand, the problem I was originally thinking of — whether the obvious algorithm can yield a non-optimal bound on sample complexity in terms of VC-dimension — is apparently still open.Comment #6 April 27th, 2006 at 10:19 am

– You probably looked already, but maybe “support vector regression” is relevant here.

– Don’t forget the Curse of Dimensionality (TM).

Comment #7 April 27th, 2006 at 12:42 pm

Also: I think I worked out a bound from the Alon et al. paper myself. Thanks anyway!

(This is the downside of posting research stuff: anything I post might quickly become irrelevant. Also, no one besides me really cared to begin with.)

Comment #8 April 27th, 2006 at 3:16 pm

Scott,

The “obvious” algorithm, which seeks to fit the observations, seems to be much like a maximum-likelihood estimator. That is, it’s frequentist. The observed data are everything.

An alternative is a Bayesian algorithm. It would start with a prior over functions, P0(f). Then it would update based on observations [P0(f) –> P(f)], and report the mean of P(f) [1].

To see why this might be better, consider MLE on a coin. You want to report the coin’s bias, P_heads. You flip the coin once, get heads, and conclude that P_heads = 1 fits the data best. ‘Course, this is a fairly ballsy statement for a guy who’s only done one measurement.

Bayesian might start with a uniform prior P( P_heads ) = 1 for all P_heads in [0..1]. It would then report P_heads = 2/3 as its best guess.

Basically, the problem with MLE is that the future may look very different from the past. The prior in Bayesian methods serves as a sort of reminder (to the algorithm) not to get too cocky! “There are more things in heaven and earth, Horatio, than are dreamt of in your philosophy.”

[1] Note that the alternative method requires {f} to be a convex, measurable set.

Comment #9 April 27th, 2006 at 6:05 pm

Thanks, Robin! Yeah, as I was writing the post, I realized that you might favor a hypothesis that disagrees with more observations, if it were favored by a Bayesian prior over hypotheses. But in my problem, I explicitly need a learning algorithm that works with high probability whenever the data are described by

anyhypothesis in some class — i.e. that doesn’t have (or “doesn’t know”) the right prior over the hypotheses. (It’s very similar to the difference between average-case and worst-case in complexity theory.)Comment #10 April 29th, 2006 at 1:15 pm

There are some examples of non-obvious algorithms doing uniformly better in the context of unsupervised learning.

For instance when inferring the mean of an unknown d-dimensional Gaussian under squared error, minimum empirical risk estimator is not admissible. In other words there are estimators that uniformly dominate the estimator that minimizes squared loss on training data.

Recently similar effect was shown for more general distributions (albeit asymptotically) http://www.ism.ac.jp/~eguchi/pdf/MLE.pdf

As far as the open problem, I wonder if the simpler claim holds — suppose hypothesis space is finite — is there a problem where non-obvious algorithm does uniformly better?

Comment #11 May 1st, 2006 at 9:01 pm

On a related question: In standard statistical learning algorithms people do ‘cross-validation’ of various sorts to help avoid over-fitting and to obtain confidence in the model. Where does that fit into your problem? At one level it seems like a poor algorithm in that it uses part of the data in only a very weak way. Can one do better?

Comment #12 May 1st, 2006 at 9:19 pm

In standard statistical learning algorithms people do ‘cross-validation’ of various sorts to help avoid over-fitting and to obtain confidence in the model. Where does that fit into your problem?Anonymous: That’s actually an excellent question! The “alternative algorithm” in the Buescher-Kumar paper that I alluded to — the one that sometimes outperforms the obvious algorithm — gains its advantage by a sort of cross-validation. As I said, it seems to be open whether that helps for finite concept classes.