You can pick a custom keyboard shortcut, and you can decide to simply press that shortcut instead of pressing and holding it.
Abstract: Text-to-audio grounding (TAG) task aims to predict the onsets and offsets of sound events described by natural language. This task can facilitate applications such as multimodal information ...
Finally, the code for the web UI client used in the Moshi demo is provided in the client/ directory. If you want to fine tune Moshi, head out to kyutai-labs/moshi ...