This may be obvious to many people already, but I recently thought about this in a few scenarios and figured it will be valuable to articulate it clearly.

One note before the main discussion is we should acknowledge that LLMs do have some sort of independent and creative thinking. Some folks are still skeptical about this, but I think if you have spent enough time with post-GPT-4 models you should have encountered examples where LLMs do demonstrate such capability, be it in coding, business, analytics or other context. The tech report of the original GPT-4 model also provided some examples that make it evident.

However, here I would like to argue that (especially in cutting edge scenarios) LLMs are not a good tool to do truly effective brainstorming. The reason is LLMs are trained to follow existing patterns in the human-produced corpus, and not natively taught to “brainstorm”. Its main training goal is to mimic the probabilistic distribution of the input data, and as a by-product, it also learns some logic and deduction rules of human language and thought process. This allows it to have some kind of creativity, but more in the sense of “extrapolation” of a pattern, i.e. “what would a sophisticated person say, when they face such question?”, and not necessarily be truly innovative.

As a result, LLMs often converge to the consensus in the existing data (which contains the human discussions and conclusions on a given topic). An example is if you try to ask LLMs to help brainstorm startup ideas, it often refers to the ideas that are at least somewhat covered in the media (i.e. buzzwords). There are two additional observations:

The idea lists are often very similar among different LLMs, even though there isn’t a single webpage or article directly providing such list on the internet. This indicates that despite different data collection, processing, training and evaluation methods, the tendency of converging to consensus is universally present in almost all major LLMs.

The occurrence (and preference, roughly speaking based on observation) of ideas generally aligns with the frequency and attention given by the media and main info sources. This often overshadows the effort to evaluate these ideas based on their actual practicalness and creativity.

What’s worse is when we ask topics that don’t have consensus currently, the LLMs won’t behave in a more creative and independent way (as we hoped), but more susceptible to issues like hallucination. More caution is due in such case when treating each advice given by LLM.

As a result, as of today, LLMs are only suitable for better-than-average level brainstorms (thanks for the extrapolation capability). For truly frontier problems, LLMs cannot provide much useful insights other than cliches, if not performing worse.

Solution?

I’m leaning to believe this is a fundamentally hard problem in the current LLM scheme. Since (1) the training process is generally geared towards mimicking a distribution and not other more sophisticated goals (2) the SFT & RLHF process is largely limited by human raters’ creativity, and it’s not scalable to generate actual creative samples for truly frontier problems, both of the main training steps cannot offer effective learnings to the model.

While I’m not an expert on LLM training, here are a few thoughts I have that could help:

Curate a good finetune dataset that consists of good brainstorm examples, on non-conventional topics in training data, potentially by human experts and innovators in various fields. It will be much more costly to create such dataset, for sure.

Use methods like RLAIF to iterative critique LLM’s response in the angle of creativity, judged by LLM itself. The assumption here is the general standard of creativity should be learnable in regular LLM training, and while it’s not clear how you can directly sample a more creative answer, it’s possible to reward responses that are more creative, based on LLM’s standard. However, it’s yet to be seen if this can work on truly frontier problems (where the creativity judgement may also be dynamic and non-conventional).

A less clear idea is how we can change the training process to not simply following existing data pattern but actually seek out knowledge, thinking and deductive reasoning skills. This may overlap with some big (even philosophical) topics in the field: do we need a world model? is auto-regressive the correct paradigm for AGI? do we need real-world feedback to actually reach AGI?

On the other hand, I think there is still possibility that this can be achievable within the current scheme, e.g. making it a generalized form of compression. (It may hinge on the true capability of human language – is it just a tool of communication, or it’s one of the true forms of intelligence itself? Which is also an interesting topic).