Apologies for the slightly clickbaity title.
To give the answer directly: voice. Voice will be the field that "kills" a massive amount of tokens following search and coding.
Many people might wonder, shouldn't it be AI video and images?
First, here are the reasons why it's not them:
Even if Sora-2 and Veo-3.1 (some might argue Sora-2 is the best, let's not debate that) are excellent, we still haven't answered a practical question: assuming current models can generate, for example, videos longer than ten minutes that look just like real people—would you choose to watch real people or AI-generated ones for those ten-plus minutes? People
Of course, animation and sci-fi might be great application scenarios for AI video. However, if AI-generated quality becomes exceptionally high, people won't necessarily engage in "self-entertainment" just for the sake of it (referring to the general public, not a minority). More people might be drawn to video works created by creative individuals using AI, but this brings us back to the inherent nature of video: production might consume tokens, but for viewing—no matter how high the view count—the token consumption is minimal. What's consumed is video stream compression, transmission, and decompression;