MediaAPI

Pion WebRTC Media API

This document details a completely new media API for Pion WebRTC. The current media API has deficiencies that prevent it from being used in a few production workloads. This document doesn't aim to modify/extend the existing API, we are looking at it with fresh eyes.

Adding Comments

I encourage everyone to comment on this page! When adding comments add them in italics and include your GitHub username I believe this API can be improved by doing X -- Sean-Der

API Requirements

API Users

If you can think of more use cases please provide them, this list is not exhaustive!

Sending pre-recorded content to viewer(s)

A user has audio/video file on disk and wants to send the content to many viewers. There will be no congestion control, you will have some loss handling (NACK). If the remote viewer doesn't support the codec we offer handshaking will fail.

Relaying RTP Traffic (with no feedback)

A user has an existing RTP feed (RTSP camera), and wants to send the content to many viewers. There will be no congestion control, you will have some loss handling (NACK). If the remote viewer doesn't support the codec we offer handshaking will fail.

Sending live generated content

A user will be encoding content and sending to many viewers, this could be an MCU, capturing a webcam or desktop (like github.com/nerdism/neko). There will be congestion control, and packet loss handling (NACK/PLI). The user should be informed of the codecs the remote supports, and then be able to generate on the fly what is requested.

Ingesting WebRTC for Later Playback

A user wants to save media from a remote peer to disk. This could be for playback later, or some other async task. We need to ensure the best experience possible by providing loss handling, and congestion control. Latency doesn't matter as much.

Ingesting WebRTC for Live Playback

A user wants to consume media from a remote peer live. This could be used for processing (like GoCv) or playing back live. We need to ensure the best experience possible by providing loss handling, and congestion control. We will also need to be careful to not add much latency, this could hurt the entire experience.

Relaying WebRTC Traffic

Users should be able to build the classical SFU use cases. For each Peer you will have one PeerConnection, and transfer all tracks across that. If possible we should support Simulcast and SVC. However if nothing is supported we should just request the lowest bitrate that works for all peers. Beyond that we should pass everything through and let de-jitter happen on each receiver side. This needs more research.

Code that works in native and web

Users should be able to write idiomatic WebRTC code that works in both their native and Web applications. They should be able to call getUserMedia and have it work across both platforms. This portability is also very important for our ability to test.

API Features

An exact API will be defined below, this is a high level of what the user interaction will look like.

Sending Media

Set supported codecs at PeerConnection Level

A user on startup will declare what codecs they will support.

The user can add/remove from a list of RTCRtpCodecCapability

This allows us to express

All codecs (H264, Opus, VPx)
Attributes of that codec (packetization, profile)
RTCPFeedback (NACK, REMB)

Create a MediaStreamTrack

A user creates a MediaStreamTrack by either calling mediadevices.getUserMedia() OR creating a Track via webrtc.NewTrack(kind RTCCodeType, id, label string, func(RtpSender, supportedCodecs []RTCRtpCodecCapability) (RTCRtpCodecCapability, error)

Tracks must match MediaStreamTrack, so codec/ssrc will no longer be defined at the Track level.

Add a MediaStreamTrack to the PeerConnection

No change from the current Pion API, peerConnection.AddTrack(track)

On `SetRemoteDescription` a callback is fired on MediaStreamTrack with a RtpSender and supported codecs

Every time a PeerConnection that has added that track has finished signaling a callback is fired. Only then do we know the intersection of codecs. We can't pick H264 (or VPx) until we know the other side supports it.

func(sender RtpSender, supportedCodecs []RTCRtpCodecCapability) (RTCRtpCodecCapability, error) {
    if (len(supportedCodecs) == 0) {
      return fmt.Errorf("No supported codecs")
    }

    fanOutSlice = append(sender, fanOutSlice)
}

The example above shows the typical fan-out case. We get a new RtpSender, and then we add it to a list that another goroutine is looping and writing. When one of the RTPSenders returns io.EOF it removes it from the list. This was possible with the Pion API today, but here are the problems it does solve.

SSRC/PayloadType will be internally managed

Juggling these values makes the API hard to use. Browsers use different PayloadTypes, so this creates a lot of pain for users. It is also hard to debug when an SSRC is wrong.

Codec can be chosen on the fly

You don't know if the remote supports H264/VP9/AV1. You now can pick which codec you prefer out of all the intersections.

RTP and RTCP must be tightly coupled

The current API doesn't allow us to implement congestion control or error correction easily. By instead giving the user direct access to the RTPSender they have the hooks they need.

WriteSample should take time.Duration instead of (samples uint32)

The user shouldn't need to do the math. Internally we should convert it to a sample rate and pass to pion/rtp

Handling Jitter, Loss and Congestion

SettingEngine allows a user to define pass their own JitterBuffer and CongestionController

We will provide a sensible default, but these will both be interfaces that a user just has to satisfy. This is out of the scope of this document, the only thing we need to ensure is that it is possible without a API break.

A user can then go and interact with the JitterBuffer/CongestionController as they wish. If they want to mutate it at runtime or modify values. This will allow them to choose how much loss they are willing to tolerate etc.. This will also be helpful for building an SFU. You can have a CongestionController where you can set the upper bound being the lowest of all recievers. The REMB is then constructed and sent back to the reciever.

RTPSender will have callbacks for RTCP Feedback results

We will put two callbacks on the RTPSender, and the user can ignore them if they wish. These aren't portable, but I think putting them in the SettingEngine is the wrong thing to do.

RtpSender.OnBitrateSuggestion(func(bitrate float) {
})

RtpSender.OnKeyframeRequest(func() {
})

API In Action

Webcam capture that works in WASM and Go mode

This will capture a video device and will work in WASM or Go mode. When running in WASM mode the VP8 selection has no impact though. In the future if the WebRTC API allows that we will support it though.

func main() {
    // We only want to send VP8
	s := webrtc.SettingEngine{
		Codecs: []RTCRtpCodecCapability{
          webrtc.RTCRtpCodecCapabilityDefaultVP8,
        },
	}
	api := webrtc.NewAPI(webrtc.WithSettingEngine(s))

	peerConnection, err := api.NewPeerConnection(webrtc.Configuration{})

    track, err :=mediaDevices.GetUserMedia({Video: true})

    peerConnection.AddTrack(track)
}

We should allow users to provide their own encoders (@lherman-cs)

I think we should allow users to encode their own videos/audios because the tracks that we receive from GetUserMedia should be still in raw format (because we need to be able to transform the video/audio). The following shows the data flow starting from GetUserMedia and ending at the other peer.

Reference: https://w3c.github.io/mediacapture-main/#the-model-sources-sinks-constraints-and-settings

This diagram shows that the data from the source can be broadcasted and transformed. Allowing users to encode their own videos/audios also gives some extra benefits for the users:

Fan-out video to many PeerConnection
Use the source for other outputs, e.g. simply stream mjpeg through HTTP server
Transform the source, the change will be reflected to all of the listeners
Each listener has the option to transform the source without affecting other listeners

So, I propose that we should have a functional option to allow users to give their encoders.

type LocalTrack interface {
  ReadRTP() (*rtp.Packet, error)
  
  // The following methods allow PeerConnection to use RTCP Feedback to automatically control the input

  // SetBitRate sets current target bitrate, lower bitrate means smaller data will be transmitted
  // but this also means that the quality will also be lower.
  SetBitRate(int) error
  // ForceKeyFrame forces the next frame to be a keyframe, aka intra-frame.
  ForceKeyFrame() error
}

type EncoderBuilder interface {
  Codec() webrtc.RTPCodec
  // Notice that this signature is opaque. This allows pion/webrtc to stay Pure Go.
  // The idea is to not require the main pion/webrtc package to know the input format from the track, 
  // it only needs to care how to handle the encoded version. This way, we let the users decide
  // whatever format they wish, which leads to a flexible design. But, since it is opaque, 
  // it'll be more error-prone and feels more "magical".
  BuildEncoder(Track) (LocalTrack, error)
}

type SettingEngine struct{
  // internal stuff
}

func (engine *SettingEngine) WithEncoders(encoders ...EncoderBuilder) {}


func (pc *PeerConnection) AddTrack(track Track) {
  // step 1: find common supported codec builders from SettingEngine
  // note 1.1: if there are multiple codecs as the result, try to build in sequential order, 
  //           if one fail, use the next ones. This is useful if we have 2 or more codec implementations. We allow users,
  //           to prioritize some encoders, e.g. hardware accelerated codecs (it's common to fail since the device 
  //           might not have hardware support).
 
  // step 2: create a local track using the encoder builder

  // step 2: create a new RTPSender

  // step 3: replace the RTPSender's local track from step 3 with the local track from step 2
}

This design is actually similar to what Chromium does, https://chromium.googlesource.com/external/webrtc/+/refs/heads/master/media/engine/webrtc_media_engine.h. They have a MediaEngine and it has an API to set encoder builders, later PeerConnnection can build encoders on the fly.

Note: I've created a couple of POCs in mediadevices:

Non-WebRTC: https://github.com/pion/mediadevices/blob/redesign/examples/simple/main.go
- Broadcast your camera stream through MJPEG server
WebRTC: https://github.com/pion/mediadevices/blob/redesign/examples/webrtc/main.go
- Classic 1:1 WebRTC example using jsfiddle

Maybe consider how this ties into a broader (Go) media pipeline? Over time you could build out building blocks like enabling Picture-in-Picture, etc. -- Backkem

Fan-out video from one PeerConnection to many

Distributing pre-recorded content

TODO/Questions

How do accomplish SVC?
How do we accomplish Simulcast

@sgotti Proposal

My view is that pion/webrtc should provide a fully compatible webrtc api with additional enhancements to cover all the use cases. To do this we should introduce MediaTrack as a container of raw data and make the RTPReceiver/RTPSender read/write to MediaTrack and decode/encode rtp streams.

The main issue is that the standard webrtc api won't let us satisfy all the possible kind of uses cases:

What we can do with the standard webrtc api:

Create a full go webrtc client
Create a simple MCU

What we cannot do with the standard webrtc api:

Create an SFU
Basically all the examples in pion/webrtc

That's because in the above use cases a pion/webrtc user have the need to read/write from the rtp/rtcp streams, manipulating the packets.

An SFU has the need to read the rtp packets from receivers streams, change they header sequence/ssrc/timestamp and also some vp8/vp9/av1 packet headers (i.e vp9 pictureID and TL0PICIDX). Choose which simulcast streams to send or which vp8/vp9/av1 (svc) layers to send.
It also has the need to read rtcp packets to implement their own nack handling (requires a custom buffer of last sent packets), jitter buffer, congestion control algorithms etc...
The swap-tracks example has the same needs as the sfu.
The twitch stream example has the same needs, we don't want to re-encode the stream already decoded by the receiver (this will use much more cpu) but use the incoming rtp stream.

(as a consideration, the current pion/webrtc v2 api permits the creation of what the standard webrtc api cannot do but doesn't permit what webrtc api can do...)

How can we achieve all the use cases?

Make two kind of rtpreceiver, rptsender and track. A webrtc spec mode and a raw mode receiver/sender/track.

"webrtcp spec mode" Receiver/Sender:

Receiver

The Receiver will fully terminate the rtp/rtcp streams. It'll contain a decoder and a controller to handle incoming rtcp packets (sdes etc...) and send rtcp packets (nack, pli, fir, remb etc...). A receiver could receiver more than one rtp/rtcp streams when using simulcast or repair streams. SVC instead is just one rtp streams and the decoder should choose which temporal/spatial/quality layers to decode (all or only some of them also based cpu constraints)

Sender

The Sender will fully handle (create) the rtp/rtcp streams. It'll contain an encoder and a controller to handle incoming rtcp packets (nack, pli, fir, remb etc...). A sender could send more than one rtp stream when using simulcast or repair streams. SVC instead is just one rtp streams and the encoder should choose which temporal/spatial/quality layers to encode (all or some of them also based on cpu constraints).

The rtpreceiver/rtpsender decoders/encoders could also provide a feedback to an external controller to handle global decisions (like congestion control, global bandwidth estimation) and can be externally tuned based on these/other decisions (vp9/av1 svc layers setup, changing the current encoing bandwidth, chaning the current decoding simulcast stream or svc layers etc...). For example, we have limited upload bandwidth, and we want to split it between N senders.

MediaTrack

MediaTrack api will be the one defined in the webrtc api with additional methods to Read/Write raw data (frames, audio etc...).

"raw mode" Receiver/Sender:

RawReceiver and RawSender

The rtpreceiver/rptsender will work like current pion/webrtc v2. They will directly provide their rtp/rtcp streams that could be read/written by the user (needed for sfu etc...)

RawTrack

RawTrack will have properties containing the ID/Label and every track rtp stream will also have properties like ssrc, rid. These will be populated by the receiver when receiving or will be used by the sender in raw mode to choose how to negotiate (standard, simulcast, use repair streams). It'll have a list of RTPStreams (see simulcast PR #1200).

For reading/writing instead of putting methods on RawTrack/RawRTPStream user could just use receiver/sender ReadRTP/ReadRTCP/WriteRTP/WriteRTCP methods.

In raw mode we should choose if the sender will manipulate some rtp packet or not (who will set the right ssrc or mid and streamID header extensions in outgoing packets? The user or the Sender?).

Choosing the Receiver/Sender mode:

The webrtc standard has apis to define the sender behavior but not the receiver behavior: The AddTransceiver init options. I.e for simulcast/svc the init.sendEncodings options are the standard way to define how the sender should encode

These are not enough for our needs since we also have to choose the receiver behavior. For example during negotiation a receiver could be automatically created and we would like to hook into it to set its mode.

I'll add an option to the SettingsEngine to define a per PeerConnection default behavior of sender/receivers

If we want to have fine grained behavior selection we should instead manually setup Transceivers adding the preferred Track type:

If the "track" is a MediaTrack then they'll work in webrtc default mode, if it's a RawTrack they'll operate in raw mode.

Examples

SFU

An SFU will use "raw mode". When creating a transceiver pass to AddTrack/AddTransceiverFromTrack a RawTrack. The receiver/sender will work in "raw mode". The RawTrack properties will be used for negotiation (standard or simulcast, use repair streams) When a receiver has negotiated the OnTrack method will provide a RawTrack (this requires type casting...) and the receiver.

WebRTC client

When creating a transceiver pass to AddTrack/AddTransceiverFromTrack a MediaTrack. The receiver/sender will work in "webrtc spec mode". When a receiver has been negotiated the OnTrack method will provide a MediaTrack (this requires type casting...)

To setup the sender behavior use the standard AddTransceiver init options to use simulcast/svc. I.e for simulcast/svc the init.sendEncodings options (or add a custom sender method?)

@adwpc

I think is two steps:

1 review and rewrite pion/webrtc dependent library: rtp rtcp etc..
2 rewrite pion/webrtc to v3

Prefer pure GO

We should make cgo pluggable or configuable if we have to use it
This can make sfu or other app stable and high performance

Sign up for the Golang Slack and join the #pion channel for discussions and support

If you need commercial support/don't want to use public methods you can contact us at team@pion.ly