Automated Capability Discovery via Foundation Model Self-Exploration

Automated Capability Discovery via Foundation Model Self-Exploration

(arxiv.org)

29d

by f14t

ziofill

29d

If you want to know more about open endedness I recommend Kenneth Stanley’s book: why greatness cannot be planned

tomrod

29d

The study focuses on evaluating GPT-4o, Claude 3.5, and Llama3-8B, but it might benefit a bit from testing across more architectures (like Mixtral, DeepSeek, Gemini). This would help show generalizing of ACD.

reissbaker

29d

What do you think that would help show? 4o and Llama are quite different; reportedly, the 4-series is a large MoE, whereas Llama is famously a dense model.

tomrod

29d

Testing across more architectures helps to clarify if ACD uncovers failures tied to model scale, training data, or architectural differences like MoE or other desings.

If failures are model-specific quirks or can generalize across LLMs, that would support claims about ACD’s robustness and usefulness for broad AI evaluation.

kittikitti

29d

While I appreciate arxiv.org, I think there should be more peer reviewed work.

SubiculumCode

29d

Personally, I've come to think of the peer-review process as a big reinforcer of the publish or perish culture in academia. Merit review committees are encouraged to rely on the count (and IF scores) of published peer-reviewed papers to measure impact, allowing them to depend on the peer-review publishing process to mint tokens signifying the value of a researcher. While this saves the committees time and gives them an excuse to not actually evaluate the content of the researcher's output, there are costs to researchers.

For good and careful scientists, the peer review process rarely adds much value to the original submission, yet requires a lot of tedious work and energy responding to minor concerns. That time and energy could be spent doing more research. Peer-review adds its most value to bad manuscripts of bad research, where good reviewers coach the authors on how to do science better. This also takes up a lot of time.

If I could do it my way, I'd rather publish to an archive and move on once I feel that the research is to my satisfaction.

daveguy

29d

For good and careful scientists, unless you communicate a design and results than can be reviewed, other good and careful scientists will reject your work. That is the nature of good and careful science.

kittikitti

28d

Lately I've found that this site over relies on non peer reviewed papers. Also, the Arxiv moderation process relies heavily on existing author reputation. I think your categorization of good scientists as people who don't appreciate the peer review process leaves much to be desired. My personal experience is that peer review is the majority of the scientific process and if it feels tedious then your work is part of the mechanics you yourself admonish.

In order to protect against misinformation, at the very least, this question about whether or not there's peer review is necessary.

viraptor

29d

Ideally we would see some peer review on arxiv itself. There are some... wrappers? of that kind of functionality on https://www.scienceopen.com/ and others, but it would be amazing to see those reviews closer to the source.

moyix

29d

I think the usual name is "overlay". At least, that's what Tim Gowers called the one he started :) https://gowers.wordpress.com/2015/09/10/discrete-analysis-an...

SubiculumCode

29d

Is it the case that the authors of these ML papers frequently don't even try to get it into a peer reviewed manuscript?

kelipso

28d

They post it on arxiv before they submit the paper for peer review usually. Industry scientists might just post it up on arxiv without peer review because they might not care about putting it on their resume.

Lay people trust the peer review process way too much anyway. For a typical conference, it's usually just a grad student who goes through the paper in an hour or two and makes some comments. I've done peer review on a few papers for a prestigious conference that I am not even a subject matter expert on, just adjacent expertise enough to kind of make sense of the paper in an hour or two.

porridgeraisin

28d

> Lay people trust the peer review process way too much anyway. For a typical conference, it's usually just a grad student who goes through the paper in an hour or two and makes some comments.

Yep. I'm that grad student right now. From observing people across universities and countries do peer review, it seems that the main things that get checked are flagrant inconsistencies in any data presented, since this has recently brought a lot of bad reputation. Otherwise it's exactly as you described.

And reputation plays a huge huge role. One person our lab works with (is well known in the field) was telling a story, about a time they sent paper to a conference or a journal as a draft (for review, I suppose) and pointed out that there still are various mistakes that need to be ironed out. Despite this, they simply took the initial paper and published it. It also got the best paper award.

Another interesting thing I noticed was researchers "forgiving" each other's B.S. It might be a surprise to some, but most research papers published are just there to increase the count of papers published. Even the most prolific researchers send out a few of this type of paper every year to catch up on the metrics. For another researcher in the same field it's usually patently obvious that it's a "filler" paper, and so long as it doesn't contain anything egregious and is something innocent, it's let through. In AI/ML this is usually restating existing algorithms/theorems in an exotic setting to make it sound novel. Or they add a KL divergence term to some loss function in an existing setup. Since H-indexes are a thing, these papers within "friends" typically cite each other to help everyone. All in all it is very hard today to separate the signal from the noise especially for an outsider.

Comment was deleted :(

AustinLovesAI

29d

Brilliant - well done!

Crafted by Rajat

Source Code

hckrnws

Automated Capability Discovery via Foundation Model Self-Exploration