Hacking VoIP: Decrypting SDES Protected SRTP Phone Calls

June 22, 2014


VoIP security is a fairly complex topic, rife with acronyms, competing solutions, and enough implementation challenges to make any administrator pull their hair out. The Session Description Protocol Security Descriptions (SDES) provide one method for exchanging the keys that are used to encrypt RTP media. Essentially, SDES allows for key exchange within the SDP portion of a SIP packet. Remember that SDP provides parameters, such as media encoding, for a connection. Also remember that SIP is usually unencrypted by default. Therefore, we need to be using SIP over TLS, so that transportation of the key within SDP is secure.

Indeed, RFC 4568 for SDES specifically mentions this reality:

It would be self-defeating not to secure cryptographic keys and other

parameters at least as well as the data are secured.

AKE [Authenticated Key Establishment] is needed because it is pointless to provide a

key over a medium where an attacker can snoop the key, alter the

definition of the key to render it useless, or change the parameters

of the security session to gain unauthorized access to session-

related information.

SDES does not actually provide any methods to utilize key management or agreement protocols within SDP. That is the place of RFC 4567, which provides additional SDP fields for exchanging key management protocol information. SDES just allows for the master key to be exchanged within a new SDP field. If the SIP signaling protocol isn’t transported over a secure medium (such as TLS), then decrypting the “secure” RTP is trivial once the encryption key is obtained from the plaintext SIP exchange. The process is actually fairly simple:

  1. Obtain a complete call, including SIP exchange and RTP data, between two endpoints
  2. Grab the key and filter out a single SRTP stream in Wireshark
  3. Use srtp-decrypt to decrypt the SRTP
  4. Replay the decrypted RTP data in Wireshark

Confused yet? The technology surrounding SDES, SRTP, and key exchange seems like black magic. It’s a complete mess of RFCs that, truth be told, feel very half-baked. The SDES RFC more or less says “You should use a key management protocol. But this RFC won’t actually help you with that. Actually, it will help you do the exact opposite: send a key without using a key management protocol. But you should still be using a key management protocol.” This is extremely frustrating and very confusing. Hopefully, things will start to make sense once we begin to look at the packets involved.

Step 1: Obtain a call for analysis

To complete this proof of concept, we need to capture the initial SIP handshake and some SRTP data. To do this, we need two SIP endpoints that don’t really care about that quoted passage from RFC 4568 and are quite happy to send plaintext keys across insecure SIP. (Un)Luckily, Linphone doesn’t seem to be very concerned about the safety of SRTP keys. When configured to use SRTP, Linphone is quite happy to send the keys in plaintext within SDP. It won’t pay any mind (or throw any sort of warning) that the master keys are transported across an insecure medium.

The easiest way to obtain the necessary packets is to simply set up two PCs with Linphone, configure Linphone to do SRTP (you do NOT want ZRTP for this tutorial), and make a call between the two Linphone clients. You’ll also want to make sure that the two Linphone endpoints are using G.711 (PCMU or PCMA). Otherwise, you won’t be able to decode the audio in Wireshark. Prior to initiating the call, you’ll want to fire up Wireshark to capture the necessary packets.

A configuration of Linphone and packet capturing is beyond the scope of this article. However, if you’re having trouble or just don’t feel like building a topology for this exercise, I have a capture that you can download and use for the remainder of this tutorial, which can be downloaded here. You may need to change the file extension to .pcapng, depending on your OS. I didn’t feel like messing with WordPress settings to upload .pcapng files, so I just changed the extension to a .txt.

Step 2: Obtain the keys and an RTP stream for decryption

The keys used for encrypting the RTP stream can be found in the SDP portion of a SIP packet. The keys for the calling party can be found in the SIP INVITE message, and the keys for the called party can be found in the SIP 200 OK message. It’s helpful to first sort by SIP in Wireshark, as seen below.

In this example, the calling party is 192.168.0.3, and the called party is 192.168.0.4. For this tutorial, we are only going to decrypt one side of the conversation, namely that of the called party (192.168.0.4). To obtain the key for the called party, we can simply expand the SDP portion of the 200 OK message, which is frame number 4. Below, we can see that the AES_CM_128_HMAC_SHA1_80 crypto suite is used and the key is in plain view, with a value of “ZuaD5fPsk1w3K4+x1aRzivUoL2MbKurXG20xz2mr” More information about the crypto suites supported by SDES can be found in Section 6.2 of RFC 2568. You can easily obtain the key value in Wireshark by right-clicking on the field and selecting Copy > Value. Then just paste it into a text editor and remove any extraneous information so that you have isolated the key, as the copy function will also include the field name (crypto) and the crypto suite.

You probably noticed that there are two keys specified within SDP. Notice that one is specified for audio and one is for video. Since I only used audio in this example, we only need the audio key. However, video could theoretically be decrypted using the same process.

Now that we have the key necessary for decryption, we need to isolate a single RTP stream. Since this is a simple example, with only one RTP stream being sent by each side (just one audio stream), we can do this with a simple Wireshark filter based on IP and port, as seen below.

Note that I have used port number, instead of something like “ip.src == 192.168.0.4 && rtp”. This is because Linphone will attempt to create two RTP streams: one for audio and one for video. They are easy to distinguish, as the Payload Type (PT) for the audio will be G.711, assuming that you correctly configured Linphone to use G.711 instead of one of the other default codecs. I included the port number for the G.711 stream so that it would be isolated from the video stream.

The individual stream can be saved by navigating to File > Export Specified Packets. Ensure that only the Displayed packets are exported, and change the file type to “.pcap”, instead of “.pcapng”. As far as I can tell, srtp-decrypt only supports .pcap files. Your save dialog should look something like the screenshot below.

Once the capture has been saved, you should be ready to begin decrypting the RTP stream.

Step 3: Decrypt the RTP stream with srtp-decrypt

To perform decryption with the srtp-decrypt tool, I created an Ubuntu virtual machine and installed srtp-decrypt with the dependencies (libpcap-dev and libgcrypt-dev) listed on the project’s Github page. To install, simply download or clone the project from Github. Install the dependencies (sudo apt-get install libpcap-dev libgcrypt-dev) and then just “make” to compile the source. It should be noted that building the program doesn’t add any sort of link into /bin or any other standard path locations, so you will have to execute the program by using the full path.

Decrypting the stream is fairly simple (note: this should all be on one line):

Let’s take a closer look at the command syntax:

  • ./srtp-decrypt – This invokes srtp-decrypt. Clearly, I’m calling the program from within the installation directory.
  • -k ZuaD5fPsk1w3K4+x1aRzivUoL2MbKurXG20xz2mr – This specifies the key used for decryption. This is the key that was obtained from the SIP exchange in Step 2.
  • < /home/user/Desktop/singleStream.pcap – This feeds the single RTP stream from Step 2 into srtp-decrypt. Obviously, the path will vary based on the name and location that you chose for the single stream.
  • > /home/user/Desktop/decryptedCall.txt – The script prints the decrypted RTP to stdout. Therefore, we can just redirect this into a file. It’s important to note that srtp-decrypt simply provides the application layer RTP data, and the lower layer headers are all lost. We’ll be addressing this problem in the next part of the tutorial.

Once the program has finished running, we should have some nice hex RTP data in the decryptedCall.txt file. You may encounter a few errors that say “frame x dropped: decoding failed ‘Permission denied’” These errors occur when srtp-decrypt is unable to decrypt an RTP packet with the provided key. You shouldn’t worry about a few errors, as they can be caused by mangled or invalid RTP packets. However, if srtp-decrypt fails to decrypt all of the packets in the .pcap, then you have probably supplied the incorrect key. If this happens, go back and ensure that you have selected the correct key from the appropriate SIP packet.

Step 4: Audio playback with Wireshark

I mentioned earlier that srtp-decrypt only provides the application data (RTP), and you lose the lower layers (UDP, IP, MAC). For this reason, you can’t just open the output file with Wireshark and expect instant playback. Luckily, Wireshark has a nifty feature that allows you to import data from a hex dump and add dummy headers to recreate a capture. The process is straightforward:

  1. Navigate to File > Import from Hex Dump
  2. Select the output file that was created in Step 3
  3. Select “Dummy Header” and “UDP,” since RTP uses UDP at the transport layer.
  4. Input two random ports (I chose 10000 and 20000). You could use the ports from the original stream, but it doesn’t really matter.

Your final selection should look something like the screenshot below.

 

Once the hex dump has been imported, you’ll probably notice that Wireshark specifies the protocol as “UDP,” and not as “RTP” like we would expect. We need to tell Wireshark to decode the packets as RTP. Simply navigate to Analyze > ****Decode as…, select “RTP” from the list, and hit “Apply.” You should now see the RTP stream.

Next, we need to use Wireshark to decode the RTP stream into audio that can be played back. This is fairly easy, although there are several steps involved:

  1. Navigate to Telephony > RTP > Show All Streams
  2. You should only be able to see one stream, since we isolated it earlier. Select the stream and hit “Analyze”
  3. Hit “Player” in the RTP Stream Analysis window that opens up
  4. Hit “Decode” in the RTP Player window that opens up.
  5. You should now be able to select the stream and hit “Play” to listen to the audio. You may also want to adjust the jitter buffer and hit “Decode” again if you find that the audio is a bit choppy. I found that a jitter buffer of 75ms worked well for my example capture.

You should now have a nicely decrypted and decoded audio stream that is fit for playback, as seen below. If you were using my sample packets, you should be able to hear me say “testing” repeatedly. Remember that this is only one side of the audio stream, and you may wish to decrypt and decode the other side.

Conclusion

Using the SDES method to transport keys can work if SIP is protected by TLS. However, even this seems undesirable when key management extensions for SDP exist and are specified in RFC 4567. In my opinion, good key management should assume that the attacker is able to capture the entire exchange. Using something like ephemeral Diffie-Hellman ensures that the key exchange is secure even when the key exchange channel isn’t. This presents a more layered approach where the security of the key is not reliant on the underlying transport protocols. Better yet, I’m a fan of ZRTP, specified in RFC 6189. ZRTP removes control of key management from the signaling protocol entirely, instead allowing the media stream (RTP) to negotiate the key using ephemeral Diffie-Hellman.

The biggest lesson that can be found here, aside from a few cool tricks for decrypting supposedly secure VoIP, is simply that VoIP security is not a “set it and forget it” feature. A voice administrator must be knowledgeable about the underlying protocols and mechanisms that are used to secure voice and realtime communications. Otherwise, the false sense of security might be more dangerous than simply leaving your voice calls unencrypted.