-
Notifications
You must be signed in to change notification settings - Fork 1
Audio
This Wiki page details the different algorithms and processing techniques used to move audio throughout the system. This page only describes the implementation of the algorithms, for the usage check out the Voice page.
Discord provides audio through a NodeJS stream. Streams are created per user and not per channel. Audio data is only passed through the stream when the user is speaking, so no silence is included in the stream.
Merging multiple Discord Audio Streams so that they are synchronized with time requires careful insertion of silence packets. Once they have the right amount of silence inserted, they can be mixed normally.
Downsampling is done using a FIR filter.
Audio conversion is done by taking the averages of the two channels (left and right) in order to create one channel.
For more details on the specifications of the audio provided by Discord, visit this wiki page.
Discord Audio is retrieved from an unofficially supported API, but DiscordJS attempts to support it.
In order to receive Audio from other users in the Voice Channel, you must send some packets first, this is why upon connecting to the channel, a silent audio stream is created and played. Although DiscordJS also has this built in, it seems to be unreliable and sometimes does not properly send the initial packets.
Discord audio is retrieved per user and not per channel. This has its benefits and drawbacks. The benefit being that each user is isolated by default, allowing voice commands to be uninterrupted and the ability to create clips / recites of a single user. The drawback being that in order to create a clip of the entire channel, each user's audio stream must be merged together.
Merging streams would usually be simple, but in the case of Discord, since silence packets are not sent through the stream, audio streams desynchronized with time. This means that when the user streams are merged together, the merged stream is also not synchronized correctly. The solution to this issue is discussed in the merging section.
Streams must be preprocessed to have silence inserted before they can be merged.
Silence is inserted into the stream by essentially using a timer and adding the silence packets at the correct intervals. Since Discord sends packets at an interval of 20ms
per second, this is the rate that the silence packets will be inserted. Through experimental analysis, the size of each packet is 3840 bytes
. Thus the insertion at intervals can be done with the following code:
this.silenceInsertionInterval = setInterval(() => {
const silenceChunk = Buffer.from(new Array(3840).fill(0));
this.streams.forEach(stream => {
stream.insertSilentChunk(silenceChunk);
});
}, 20);
It is important to not add these packets when there is audio coming through, so a debouncer is used to determine when silence has begun. Once the debouncer has determined that no more audio packets are coming through, it will allow silence to be inserted.
Mixing streams together is as simple as adding the values of the stream together. The following is the code that accomplishes this:
for (let i = 0; i < result.length; i += 2) {
let value = 0
buffers.forEach((buffer: Buffer) => {
value += buffer.readInt16LE(i)
})
value = Math.max(SIGNED_16_BIT_MIN, value)
value = Math.min(SIGNED_16_BIT_MAX, value)
result.writeInt16LE(value, i)
}
This piece of code assumes that each buffer
is the same length. It is important to saturate the summed value to the 16 bit max and 16 bit mins, as having a value that is too large / small will cause the writing of the 16 bit signed integer to fail (documentation).
Stereo to mono conversion is done by simply averaging the values of the left and right channels together. The following is the code that accomplishes this:
const newBuffer = Buffer.alloc(buffer.length / 2)
const HI = 1
const LO = 0
for (let i = 0; i < newBuffer.length / 2; ++i) {
const left = (buffer[i * 4 + HI] << 8) | (buffer[i * 4 + LO] & 0xff)
const right = (buffer[i * 4 + 2 + HI] << 8) | (buffer[i * 4 + 2 + LO] & 0xff)
const avg = (left + right) / 2
newBuffer[i * 2 + HI] = ((avg >> 8) & 0xff)
newBuffer[i * 2 + LO] = (avg & 0xff)
}
return newBuffer