瓶颈期,一些AI进一步落地方面的问题

瓶颈期,一些AI进一步落地方面的问题


Clearly, I am still in a production bottleneck period, whether in terms of new attempts or in improving existing aspects.

I habitually think that my "imagination has dried up" and have used some methods that were very effective in the past to "cold start" myself, but the results seem unsatisfactory: in the process of preparing one-on-one communication content, I don't believe it's so-called "model intelligence degradation"; I also tried increasing the number of iterations and gradually added more "human intervention" in those iterations, but the effect remains unsatisfactory.

What I want to express and what the model outputs seem to have taken two different paths, but one is always more willing to believe that the "data behind the model" is much more powerful and objective than oneself. Consequently, I fell into endless self-doubt. This doubt is not simply about whether the method is wrong, but rather whether my own mindset is wrong.

I dreamed of a very realistic scene where every face of my family and friends was incredibly clear, except for my own, which was blurred. After waking up, the faces of every family member and friend remain clear in my mind, yet my own face is still blurred.

This seems right, because we never actually see ourselves; we never find a good way to stand from a third-party perspective to see ourselves clearly.

I no longer trust my own judgment. Although I can still point out some detailed "factual errors" in AI-generated content, and I can roughly locate such errors coming from the model's confused "sense of time," I cannot judge right or wrong at a more macro level—not for style, diction, or even conclusions. (As I was writing this, I was interrupted; when I returned, I noticed that the pronouns in this paragraph were all "you," which I have since changed to "I." It was an interesting sensation.)

I don't know how much of this is because "my own face is always blurred in my mind." From childhood to adulthood, I always considered myself a confident, even conceited person. But what if I am looking at me? What if, just like when writing the previous paragraph, I am using "you" to represent "me"? The concept of "observing the self" is not new—it's even cliché. There are many mainstream methods of introspection, even with two diametrically opposed views: one is to control oneself not to think, and the other is to let oneself think and just be a quiet observer.

Ten-plus years ago, I used the first method; recently, I have been trying the second. But regardless of which one, the position of that "I" remains blurred.

Perhaps there are good descriptions in the Diamond Sutra, but at this point, I need to wrap up this topic. However, I am grateful for having written to this point because it connects to what I truly wanted to say at the beginning: AI implementation.

The descriptions above are almost "purely subjective": if current AI outputs still need to be viewed by "humans," then my "bottleneck period" might still have meaning—whether due to my "exhausted imagination," model capabilities, or the fusion between the model and me.

"Human" remains the most important obstacle, whether it's human subjective evaluation, human input, or human cooperation and intervention.

So, still judging from a human subjective perspective:

My "honeymoon window" with the model is roughly when I input a document of about two to three thousand words with both text and images: the model's attention is fully utilized; details are basically not missed; because the input content is rich, the chance for the model to hallucinate freely is greatly reduced.

Another way is for me to input dozens of corpus materials of varying lengths along with a several-hundred-word outline. In this mode, the model still produces decent output, but obvious omissions in detail have appeared. Interestingly, with the same model across multiple attempts, the shift in "attention" is not obvious, meaning that the highlighted parts and the omitted parts are essentially the same across multiple tries.

When only limited information is given to the model—for instance, a description of at most a few hundred words or just a one-sentence question—the model can produce output, but its attention becomes very strange: whether it does deep research or not, the model seems to focus its attention on a few specific points (which I interpret as the most easily searchable results). Errors also become highly focused, simply because the limited input gives the model more opportunities to "make mistakes."

Interestingly, when I try slightly more complex operations—such as iterating through the above modes, multiple interactions, and constantly adjusting my "input outline"—the quality in the parts where "human attention" (input) is applied improves significantly. However, in the parts not covered by "human attention," the results seem worse (not just individual quality, but a lack of "fit"). That inconsistency is very uncomfortable—at least for me. I don't think it has much to do with the model's "temperature" setting.

To some extent, we have entered a "lottery" mode, stitching together the "unexpected differences" from multiple results to create a terrifying "Frankenstein's monster."

Yes, it is easy to categorize these problems: lack of memory, insufficient context, not enough attention heads.

Yet to this day, we still don't quite know what results we would get if we increased the number of attention heads or the batch size closely related to context during training. After all, under O(n²) complexity, if doubling the resources only brings limited quality improvement, it might simply be unfeasible. Or, a deeper suspicion: is there an upper limit to data representation based on tokens? Even with multimodality, is it still limited by the corresponding text's token representation?

World models? Physical AI? Where is that paradigm? When I lift my camera, a question repeatedly appears in my mind: what exactly is the difference between what my eyes see, or the lens sees, and what the model "sees"?

If we look back hundreds of thousands of years, there were flowers, grass, mountains, and water—many existences—but seemingly no language or text (as far as we know). Thus Wittgenstein said, "The limits of my language mean the limits of my world." This sentence is more appropriate for describing the limits of the current model world. However, in a space without language or thought (or as our human brains process it), if an object flies toward us, the subconscious reaction is most likely to "dodge," even if we have no time to describe the object with words or language. Some now classify this as "System 1," while model developers emphasize the importance of "System 2" (non-instinctive thinking).

But for humans to function in this world, doesn't the vast majority rely on so-called "System 1"? Then, for AI to be implemented in the human world, will it rely on "System 2"? Rely on "System 2" to "replace human jobs"?

Perhaps one day in the future, AI will be powerful enough to create its own world; I truly believe this will happen sooner or later. But for now and for a significant future, "human" remains the greatest obstacle, or rather, the "mountain," standing before AI.

AI implementation is a human problem, right?

It's that it still needs to evolve to better cooperate with humans; it's that it still needs to evolve so that costs drop by 99%, 99.9%, 99.99%...

In an "extremely deflationary" AI world, we might have more opportunities to explore the vastness under its "compression of all human knowledge." Perhaps then we will have the chance to find output that fits every individual "person," rather than a recurrence of the Law of Large Numbers under large samples: correctly monotonous, and incorrectly uninspiring.

Perhaps only then will we have the chance to see our own blurred faces clearly; only then will we have the chance to see each unique difference; only then will we have the chance to see a more infinite and broader world beyond the one we understand.

And regarding this near future, I do not believe it is "dark," because it can "illuminate the depths of the Five Aggregates."

← Back to Blog