AudioCaptureRaw Walkthrough: C++
Capturing the Raw Audio Stream
About This Walkthrough In the Kinect™ for Windows® Software Development Kit (SDK) Beta, the AudioCaptureRaw sample uses the Windows Audio Session API (WASAPI) to capture the raw audio stream from the microphone array of the Kinect for Xbox 360® sensor and write it to a .wav file. This document is a walkthrough of the sample.
Resources For a complete list of documentation for the Kinect for Windows SDK Beta, plus related reference and links to the online forums, see the beta SDK website at:
http://www.kinectforwindows.org/
Contents
Introduction 2
Program Description 2
Select a Capture Device 3
Enumerate the Capture Devices 4
Retrieve the Device Name 5
Determine the Device Index 6
Prepare for Audio Capture 6
Initialize Audio Engine for Capture 7
Load the Format 7
Initialize the Audio Engine 7
Capture an Audio Stream from the Microphone Array 8
The Primary Thread 8
The Worker Thread 10
License: The Kinect for Windows SDK Beta is licensed for non-commercial use only. By installing, copying, or otherwise using the beta SDK, you agree to be bound by the terms of its license. Read the license.
Disclaimer: This document is provided “as-is”. Information and views expressed in this document, including URL and other Internet Web site references, may change without notice. You bear the risk of using it.
This document does not provide you with any legal rights to any intellectual property in any Microsoft product. You may copy and use this document for your internal, reference purposes.
© 2011 Microsoft Corporation. All rights reserved.
Microsoft, DirectX, Kinect, LifeChat, MSDN, and Windows are trademarks of the Microsoft group of companies. All other trademarks are property of their respective owners.
Introduction
The audio component of the Kinect™ for Xbox 360® sensor is a four-element linear microphone array. An array provides some significant advantages over a single microphone, including more sophisticated acoustic echo cancellation and noise suppression, and the ability to determine the direction of a sound source.
The primary way for C++ applications to access the Kinect sensor’s microphone array is through the KinectAudio Microsoft® DirectX® Media Object (DMO). However, it is useful for some purposes to simply capture the raw audio streams from the array’s microphones.
The Kinect sensor’s microphone array is a standard Windows® multichannel audio-capture device, so you can also capture the audio stream by using the Windows Audio Session API (WASAPI) or by using the microphone array as a standard Windows microphone. The AudioCaptureRaw sample uses the WASAPI to capture the raw audio stream from the Kinect sensor’s microphone array and write it to a .wav file. This document is a walkthrough of the sample. For more information on WASAPI, see “About WASAPI” on the Microsoft Developer Network (MSDN®) website.
Note The WASAPI is COM-based, and this document assumes that you are familiar with the basics of how to use COM objects and interfaces. You do not need to know how to implement COM objects. For the basics of how to use COM objects, see “Programming DirectX with COM” on the MSDN website. This MSDN topic is written for DirectX programmers, but the basic principles apply to all COM-based applications.
Program Description
AudioCaptureRaw is installed with the Kinect for Windows Software Development Kit (SDK) Beta samples in %KINECTSDK_DIR%\Samples\KinectSDKSamples.zip. AudioCaptureRaw is a C++ console application that is implemented in the following files:
AudioCaptureRaw.cpp contains the application’s entry point and manages overall program execution.
WASAPICapture.cpp and its associated header—WASAPICapture.h—implement the CWASAPICapture class, which handles the details of capturing the audio stream.
The AudioCaptureRaw basic program flow is as follows:
1. Enumerate the system’s capture devices and select the appropriate device.
Because the system might have multiple audio capture devices, the application enumerates all such devices and has the user specify the appropriate one.
2. Record 10 seconds of audio data from the device.
3. Write the recorded data to a WAVE file: out.wav.
The recording process multiplexes the streams from each microphone channel in an interleaved format—ch 1/ ch 2/ ch 3/ ch 4/ ch 1/ ch 2/... and so on—with each channel’s data in a 16-kiloHertz (kHz), 32-bit mono pulse code modulation (PCM) format.
The following is a lightly edited version of the AudioCaptureRaw output for a system with two capture devices—a Microsoft LifeChat® headset and a Kinect sensor:
WASAPI Capture Shared Timer Driven Sample
Copyright (c) Microsoft. All Rights Reserved
Select an output device:
0: Microphone Array (Kinect USB Audio) ({0.0.1.00000000}
{6ed40fd5-a340-4f8a-b324-edac93fa6702})
1: Headset Microphone (3- Microsoft LifeChat LX-3000 )({0.0.1.00000000}
{97721472-fc66-4d63-95a2-86c1044e0893})
0
Capture audio data for 10 seconds
1
Successfully wrote WAVE data to out.wav
The remainder of this document walks you through the application.
Note This document includes code examples, most of which have been edited for brevity and readability. In particular, most routine error correction code has been removed. For the complete code, see the example. Hyperlinks in this walkthrough refer to content on the MSDN website.
Select a Capture Device
The application’s entry point is wmain, in WASAPICaptureRaw.cpp. This function manages the overall program execution, with private functions handling most of the details. WASAPI is COM-based, so AudioCapture Raw first initializes COM, as follows:
int wmain()
{
...
HRESULT hr = CoInitializeEx(NULL, COINIT_MULTITHREADED);
...
}
Tip Applications that have a graphical user interface (GUI) should use COINIT_APARTMENTTHREADED instead of COINIT_MULTITHREADED.
AudioCaptureRaw next calls the private PickDevice method to select the capture device, as follows:
bool PickDevice(IMMDevice **DeviceToUse, bool *IsDefaultDevice, ERole *DefaultDeviceRole)
{
IMMDeviceEnumerator *deviceEnumerator = NULL;
IMMDeviceCollection *deviceCollection = NULL;
*IsDefaultDevice = false;
hr = CoCreateInstance(__uuidof(MMDeviceEnumerator),
NULL,
CLSCTX_INPROC_SERVER,
IID_PPV_ARGS(&deviceEnumerator));
...
}
PickDevice calls the CoCreateInstance function to create a device enumerator object and get a pointer to its IMMDeviceEnumerator interface.
PickDevice enumerates the system’s capture devices by calling the enumerator object’s IMMDeviceEnumerator::EnumAudioEndpoints method, as follows:
bool PickDevice(...)
{
...
hr = deviceEnumerator->EnumAudioEndpoints(eCapture,
DEVICE_STATE_ACTIVE,
&deviceCollection);
...
}
The EnumAudioEndpoints parameter values are as follows:
1. A value from the EDataFlow enumeration that indicates the device type.
eCapture directs EnumAudioEndpoints to enumerate only capture devices.
2. A DEVICE_STATE_XXX constant that specifies which device states to enumerate.
DEVICE_STATE_ACTIVE directs EnumAudioEndpoints to enumerate only active devices.
3. The address of an IMMDeviceCollection interface pointer that contains the enumerated capture devices.
PickDevice then uses the IMMDeviceCollection interface to list the available capture devices and let the user select the appropriate device—presumably the Kinect sensor—as follows:
bool PickDevice(...)
{
UINT deviceCount;
...
hr = deviceCollection->GetCount(&deviceCount);
for (UINT i = 0 ; i < deviceCount ; i += 1)
{
LPWSTR deviceName;
deviceName = GetDeviceName(deviceCollection, i);
printf_s(" %d: %S\n", i, deviceName);
free(deviceName);
}
...
}
PickDevice first calls the collection object’s IMMDeviceCollection::GetCount method to determine the number of devices in the collection and then iterates through the collection and lists the device names.
Retrieve the Device Name
PickDevice iterates through the collection and calls the private GetDeviceName method to retrieve the device name, as follows:
LPWSTR GetDeviceName(IMMDeviceCollection *DeviceCollection, UINT DeviceIndex)
{
IMMDevice *device;
LPWSTR deviceId;
HRESULT hr;
hr = DeviceCollection->Item(DeviceIndex, &device);
hr = device->GetId(&deviceId);
IPropertyStore *propertyStore;
hr = device->OpenPropertyStore(STGM_READ, &propertyStore);
SafeRelease(&device);
PROPVARIANT friendlyName;
PropVariantInit(&friendlyName);
hr = propertyStore->GetValue(PKEY_Device_FriendlyName,
&friendlyName);
wchar_t deviceName[128];
hr = StringCbPrintf(deviceName,
sizeof(deviceName),
L"%s (%s)", friendlyName.vt != VT_LPWSTR ?
L"Unknown" : friendlyName.pwszVal,
deviceId);
...//Clean up and return the device name
}
Each device in the collection is identified by a zero-based index and is represented by a device object that exposes an IMMDevice interface. The device details—including a readable “friendly name”—are stored in the device object’s property store, which is represented by an IPropertyStore interface.
A property store provides general-purpose storage. Each item is identified by a key—a PROPERTYKEY structure—that is typically named PKEY_XYZ. The key for the device’s friendly name is named PKEY_Device_FriendlyName.
To obtain the device’s friendly name, GetDeviceName:
1. Calls the IMMDeviceCollection::Item method to retrieve the specified device object’s IMMDevice interface.
2. Calls the IMMDevice::GetId method to retrieve the device ID.
3. Calls the IMMDevice::OpenPropertyStore method to get a read-only pointer to the device object’s IPropertyStore interface.
4. Passes the friendly name property key to the IPropertyStore::GetValue method, which returns a PROPVARIANT structure with the device’s friendly name.
5. Calls the StringCbPrintf function to extract the name string from the PROPVARIANT structure.
Determine the Device Index
The user enters an integer value that specifies the device index. PickDevice converts the string to an unsigned long and passes the index to IMMDeviceCollection::Item to retrieve the appropriate IMMDevice interface, which is then returned to wmain, as shown in the following code example:
bool PickDevice(...)
{
...
wchar_t choice[10];
_getws_s(choice);
long deviceIndex;
wchar_t *endPointer;
deviceIndex = wcstoul(choice, &endPointer, 0);
hr = deviceCollection->Item(deviceIndex, &device);
...
}
Prepare for Audio Capture
The audio capture process is handled by a CWASAPICapture object, as follows:
int wmain()
{
...
CWASAPICapture *capturer =
new (std::nothrow) CWASAPICapture(device, role);
if (capturer->Initialize(TargetLatency))
{
...
}
...
}
To create the object, wmain passes the device’s IMMDevice interface and a role value to the constructor. The constructor uses this input to set some private data members. The contents of the if block implement the capture process and are discussed in the next section.
wmain passes a target latency value to CWASAPICapture::Initialize to initialize the object. AudioCaptureRaw polls for data. Target latency defines the wait time and also influences the size of the buffer that is shared between the application and the audio client.
Initialize Audio Engine for Capture
CWASAPICapture::Initialize prepares the audio engine for capture, as follows:
bool CWASAPICapture::Initialize(UINT32 EngineLatency)
{
_ShutdownEvent = CreateEventEx(NULL, NULL, 0,
EVENT_MODIFY_STATE | SYNCHRONIZE);
HRESULT hr = _Endpoint->Activate(__uuidof(IAudioClient),
CLSCTX_INPROC_SERVER, NULL,
reinterpret_cast(&_AudioClient));
hr = CoCreateInstance(__uuidof(MMDeviceEnumerator),
NULL,
CLSCTX_INPROC_SERVER,
IID_PPV_ARGS(&_DeviceEnumerator));
LoadFormat())
InitializeAudioEngine()
return true;
}
Initialize creates a shutdown event that is used later to help manage the capture process. It then calls the device’s IMMDevice::Activate method to create an audio client object for the device, which is represented by an IAudioClient interface. It completes the preparation by calling the private LoadFormat and InitializeAudioEngine methods.
Load the Format
The private LoadFormat method calls the device’s IAudioClient::GetMixFormat method to retrieve the audio stream format. It uses that information to define the frame size and store it for later use, as follows:
bool CWASAPICapture::LoadFormat()
{
HRESULT hr = _AudioClient->GetMixFormat(&_MixFormat);
_FrameSize = (_MixFormat->wBitsPerSample / 8)
* _MixFormat->nChannels;
return true;
}
Initialize the Audio Engine
Initialize calls InitializeAudioEngine to initialize the audio engine in timer-driven mode, as follows:
bool CWASAPICapture::InitializeAudioEngine()
{
HRESULT hr = _AudioClient->Initialize(AUDCLNT_SHAREMODE_SHARED,
AUDCLNT_STREAMFLAGS_NOPERSIST,
_EngineLatencyInMS*10000, 0,
_MixFormat, NULL);
hr = _AudioClient->GetService(IID_PPV_ARGS(&_CaptureClient));
return true;
}
InitializeAudioEngine:
1. Calls the IAudioClient::Initialize method to initialize the object.
2. Calls the IAudioClient::GetService method to retrieve an IAudioCaptureClient interface, which enables a client to read the input data from a capture device.
The data from the final step is stored for later use.
The capture process works as follows:
1. The primary thread creates a worker thread to capture the data and then starts a countdown timer.
2. While the countdown timer runs, the worker thread captures audio data in the background.
3. After the countdown timer completes, the primary thread notifies the worker thread to stop capturing data and ends the process.
The Primary Thread
The code to manage this process was represented by the ellipsis in the if block that was shown at the beginning of the “Prepare for Audio Capture” section. The following code example shows the complete block:
int wmain()
{
...
if (capturer->Initialize(TargetLatency))
{
size_t captureBufferSize = capturer->SamplesPerSecond() *
TargetDurationInSec * capturer->FrameSize();
BYTE *captureBuffer = new (std::nothrow) BYTE[captureBufferSize];
if (capturer->Start(captureBuffer, captureBufferSize))
{
do
{
printf_s(" \r%d\r", TargetDurationInSec);
Sleep(1000);
} while (--TargetDurationInSec);
printf_s("\n");
capturer->Stop();
// Save the data to a WAVE file and clean up.
...
}
}
Before starting the capture process, wmain first computes the size of the capture buffer, which is the product of the following:
The sample rate, in samples per second, which is extracted from the mix format by the private CWASAPICapture::SamplesPerSecond method.
The target duration, in seconds, which is hard-coded to 10 seconds.
The frame size, which was computed earlier and is retrieved by the private CWASAPICapture::FrameSize method.
Start the Capture Process
wmain calls the private CWASAPICapture::Start method to start the capture process, as follows:
bool CWASAPICapture::Start(BYTE *CaptureBuffer, size_t CaptureBufferSize)
{
HRESULT hr;
_CaptureBuffer = CaptureBuffer;
_CaptureBufferSize = CaptureBufferSize;
_CaptureThread = CreateThread(NULL, 0,
WASAPICaptureThread, this,
0, NULL);
hr = _AudioClient->Start();
return true;
}
Start:
1. Calls the CreateThread function to create the worker thread.
CreateThread creates a new thread and calls CWASAPICapture::WASAPICaptureThread on that thread. WASAPICaptureThread is discussed in the following section.
2. Calls the IAudioClient::Start method to direct the audio client to start streaming data between the endpoint buffer and the audio engine.
Manage the Capture Process
After CWASAPICapture::Start returns, wmain starts the countdown timer on the primary thread, which also provides the user with a visual indicator of the capture process. When the countdown timer is finished, wmain calls CWASAPICapture::Stop to stop the capture process, as follows:
void CWASAPICapture::Stop()
{
HRESULT hr;
if (_ShutdownEvent)
{
SetEvent(_ShutdownEvent);
}
hr = _AudioClient->Stop();
if (_CaptureThread)
{
WaitForSingleObject(_CaptureThread, INFINITE);
CloseHandle(_CaptureThread);
_CaptureThread = NULL;
}
}
Stop:
1. Raises _ShutdownEvent to notify the worker thread to stop capturing data.
2. Calls IAudioClient::Stop to direct the audio engine to stop streaming data.
3. Waits for the worker thread to signal the thread object, which indicates that the capture process is complete.
4. Terminates the thread.
wmain then calls the private SaveWaveData method to write the captured data to a .wav file. For details, see the sample. wmain then performs final cleanup and terminates the application.
The Worker Thread
On the worker thread, WASAPICaptureThread calls CWASAPICapture::WASAPIDoCapture to handle the capture process, as follows:
DWORD CWASAPICapture::DoCaptureThread()
{
bool stillPlaying = true;
HANDLE mmcssHandle = NULL;
DWORD mmcssTaskIndex = 0;
HRESULT hr = CoInitializeEx(NULL, COINIT_MULTITHREADED);
mmcssHandle = AvSetMmThreadCharacteristics(L"Audio", &mmcssTaskIndex);
while (stillPlaying)
{
// Capture audio stream until stopped by primary thread.
}
AvRevertMmThreadCharacteristics(mmcssHandle);
CoUninitialize();
return 0;
}
DoCaptureThread:
1. Calls the CoInitializeEx function to initialize COM for the worker thread.
You must initialize COM separately for each thread.
2. Calls the AvSetMmThreadCharacteristics function to associate the worker thread with the capture task.
3. Starts a while loop to capture the data, which runs until the primary thread calls CWASAPICapture::Stop.
The following code sample shows the capture loop:
DWORD CWASAPICapture::DoCaptureThread()
{
...
while (stillPlaying)
{
HRESULT hr;
DWORD waitResult = WaitForSingleObject(_ShutdownEvent,
_EngineLatencyInMS / 2);
switch (waitResult)
{
case WAIT_OBJECT_0 + 0:
stillPlaying = false;
break;
case WAIT_TIMEOUT:
BYTE *pData;
UINT32 framesAvailable;
DWORD flags;
hr = _CaptureClient->GetBuffer(&pData,
&framesAvailable,
&flags, NULL, NULL);
if (SUCCEEDED(hr))
{
UINT32 framesToCopy = min(framesAvailable,
static_cast((_CaptureBufferSize -
_CurrentCaptureIndex) / _FrameSize));
if (framesToCopy != 0)
{
if (flags & AUDCLNT_BUFFERFLAGS_SILENT)
{
ZeroMemory(&_CaptureBuffer[_CurrentCaptureIndex],
framesToCopy*_FrameSize);
}
else
{
CopyMemory(&_CaptureBuffer[_CurrentCaptureIndex],
pData, framesToCopy*_FrameSize);
}
_CurrentCaptureIndex += framesToCopy*_FrameSize;
}
hr = _CaptureClient->ReleaseBuffer(framesAvailable);
}
break;
}
}
...
}
For each iteration, the capture loop waits until the next frame’s data has been streamed:
If the primary thread raises _ShutdownEvent before the time-out ends, the capture loop terminates.
If the primary thread does not raise _ShutdownEvent, the capture loop fills the capture buffer and starts the next iteration.
For More Information
For more information about implementing audio and related samples, see the documentation and samples contained within the Kinect for Windows SDK.
Share with your friends: |