AI Watching And Listening: Cross-Sensory Cognition Work

Sometimes we forget how much AI is really doing behind the scenes – but to be reminded, we need to look no further than so much of what came out of Imagination in Action, and everything these experts showed us.

Large language models are taking our world by storm, with the ability to imitate human cognition in so many different ways. We’re really seeing all of this lead into a massive trend toward digital disruption.

That idea comes through loud and clear as James Glass takes us through some of the intersections between video, audio and new technology.

For example, take a look at the part of the video where he talks about image captioning and the interplay between visuals and text:

“We were interested in seeing if we could take speech and pair it up with vision, and with no other information, see what the machine could learn from raw audio samples and raw pixels,” he explains. “And so since nothing like this existed, we went out and collected about 400,000 or so people talking about images. People like to do this; it’s pretty easy. Then we (built) a deep learning model, having one branch grovel (sic) over the image and another branch grovel (sic) over the audio, and then at a high level, have them connect and try and learn a joint audiovisual semantic Layton representation of the signal.”

Glass talks about “semantic objects” as versatile units of digital cognition, and shows us how the computer ‘thinks’ by offering a display where you can hear people talking about items in a picture, and see pixels lighting up around those objects.

In a way, it’s kind of like a step-through code editing program where you see what the machine is doing while it’s doing it.

Lighthouses and sunsets are pretty, but Glass suggests there’s more to it than that:

“It’s sort of like somebody shining a flashlight at a picture while you’re talking. And it’s not perfect, but you get a sense that on some of the concepts that you’re hearing, it sort of knows what you’re talking about. You can quantify this a little bit more by looking through a large data set and finding patches (sic) and images that have high correspondence with segments in the speech captions, and pooling them together and then clustering, and you get hundreds and hundreds of these kinds of clusters…”

He talks about the “Rosetta Stone” of language intersection, where some of these new technologies will enable better translations – or more to the point, entirely new kinds of translations transcending text and verbal reading in very sci-fi ways.

But that’s really just the tip of the iceberg. Think about what’s going to happen when we allow AI entities to translate between media, between speech and visuals!

Or to put it another way, think back about a decade to early AI work. We had unsupervised machine learning, and supervised machine learning.

These paradigms that Glass is talking about are inherently different. They’re based on self-supervised learning, as he mentions several times. And that’s critically important. Self-supervising systems evolve in ways that make it hard for humans to keep up with them.

As an example, Glass talks about scene analysis and perception models. Listen to this part where he discusses a methodology for multimedia analysis:

“You can modify that basic model to have a visual branch that’s processing video, and an audio branch that’s processing speech and the audio sounds, and learn a high-level embedding space. And you can do things like retrieval: play an audio snippet and retrieve the corresponding video snippet, and things like that.”

Video: These are some very interesting new things that AI has just become capable of

He talks about listening and understanding, and how we can move the ball forward:

“Deep Learning has really enabled us to make connections across modalities,” he says. “It’s fascinating: self-supervised learning has led us learn from large quantities of unannotated data. And these newer large language models (are) going to be a really interesting research direction (in which) to connect perception with language: two of the original pillars of artificial intelligence.”

It truly is fascinating. After a while, you might find it almost keeps you up at night. With AI doing all of this – how long until it’s doing it better than us? Anyway, the applications are evident, and the methodology, the cutting-edge research, is starkly impressive.

Read the full article here