Last year while I was recording some gaming footage, I was annoyed by the fact that it captured all audio on my system. This not only forced me to pause any kind of music I like to listen to while playing, but also mute Skype, TS etc. I just found it incredibly inconvenient, that I couldn’t just record the audio of one single process.
In this example, we will use Spotify for our first analysis. Note that I love their service and I’m not going to provide any help bypassing their security measures and this is not the goal of this research.
I didn’t have much audio experience at this point (still don’t really…), especially not with low level Windows audio stuff. But I figured at some point an application must send its audio buffer to the OS to have it played. So the idea was to simply intercept, reading the buffer and storing it to disk. I had no idea about the format the buffer might use, but I figured it would be the same all the time, since we’re sending it to the OS (hint: I was wrong). Knowing that raw audio takes up a lot of space and I also thought about compressing it, e.g. to MP3, before storing. But that’s another story.
As mentioned above, as a first victim, I choose Spotify. I did so for two reasons: The first being it’s running on my system all the time and the major cause for audio interference. The second, and more important one, being that I expected the folks at Spotify to have put some extra effort into protecting their music data to prevent it from being ripped. So I thought, if I manage to get the raw audio stream data from Spotify, I can probably do it for most other applications as well.
The downside of choosing Spotify was that it had some anti-debugging techniques incorporated, so I had to prepare it a little for analysis (for instance, map the imports since it resolves them at runtime using xored strings, extract RTTI etc.).
Once it was ready for analysis I started it up again and checked the imported modules. I immediately spotted DSOUND.DLL which is a part of DirectX. In fact, it is responsible for the (who saw this coming?) sound. So probably related to playing sound. From my work on various games I knew most of them use DirectSound for rendering their streams as well.
Now I had a first idea what to look for. I fired up API Monitor (great software btw) and set it to only record audio related APIs. When I attached to it to the Spotify process, I quickly had a bunch of different API calls recorded. I stopped monitoring and went for a first analysis. While scrolling down the list I quickly noticed a repeating pattern:
I looked up the APIs on MSDN and their provided sample. The sample gave me a good idea how rendering streams works using these APIs and I felt ready to start some hooking.
Further steps will be explained in the next blog post I hope to finish soon.