api.ai + ReKognition
api.ai is a tool which offers voice recognition, natural language processing and text-to-speech and allows users to integrate speech interfaces into their products. Six languages and four Software Development Kits (including Android and iOS) are currently available. A free account allows up to 3,000 monthly queries and it should be straightforward to implement, after reading the api documentation. I tried to create a little interface which uses this api along with previously discussed ReKognition to recognise faces using voice commands, in Mathematica. Initially, the aim was to try-out api.ai but it soon became too intriguing to leave it there.
I started off by creating an Agent, which I called Identificator. An agent can be thought of as an application and consists of Entities and Intents. Once you create an agent, you will receive the necessary keys required for authentication. Note that the langauge of an agent cannot be changed after it has been created. I then created three entities which describe the three main ways the voice command can be understood--@id for command that contain verbs, @wh for pronouns and @beau for adjectives. Intents determine what actions will be taken, in relation to the user command. I created three different intents based on the following scenarios:
- identify : user command contains a verb and only facial reognition is required.
- who : command is in the form of a question and contains a pronoun.
- compare : command contains adjectives (comparative or superlative) and requires a comparison.
For each intent there's an Action (action to be taken by application) and a Fufilement (speech to returned by application). I used the same action (begin_recognition) and slightly different but similar fufilments for all three intents. After creating the application on api.ai, I tested it on the website and it seemed to work well. In the example shown below, the user command will be "recognize person" and the action, begin_recognition. "Recognition successful." will be returned in the form of speech or text.
Orbeus' ReKognition allows one to recognise faces in images, amongst other things. I used it in a way similar to this, but with more complexity. It processes a photo of the user and attempts to find any face in it. It then returns data on facial features, in addition to attributes such as race, age and gender. Interestingly, a list of three possible emotions, a beauty, and a smile score are also returned. I don't know how these scores are calculated but I thought they would allow for a greater/more fun interaction with the interface and hence, the addition of the beau entity and the compare intent. The image below shows a simplified representation of how the application works. I will discuss a few important parts of the code I used to create the interface in this post. The entire code can be found here. api.ai requires that the sound file is 16 kHz, Signed PCM, 16-bit, mono and in either Raw or WAV format. Mathematica does not allow sound to be recorded at a sample rate lower than 44.1 kHz on my PC so the sound data is Downsampled to 16 kHz before the api request is sent. This fractionally reduces the duration of the recording as can be seen in the exmaple below. It possibly reduces the quality too. The orientation of faces--described by roll, pitch and yaw angles--can be opted for when sending a request using Rekognition. This information is very useful, particularly in visualising the exact orientation of any face detected in the photo. To do this, I applied a combination of three rotation transformations in the x, y and z axes to a polygon which is then placed at the position where the face is detected. Let's try to apply this to a photo in which the face is turned. With a request sent and the values of roll, pitch and yaw extracted beforehand, we'll apply the transformations... ...and visualise the orientation of the face in the photo.
The interface is quite simple. It allows you to select three options--number (to match a face to its description), points (positions of eyes, nose, etc) and boxes (rectangles or tranformed polygons). There are certain drawbacks I should point out. The speed at which the command is spoken and accent, amongst other things, may have an effect on the result. For a command to be correctly interpreted, it must be spoken with clarity. Even if it is, the right result isn't always returned. I've found that the longer the command is, the longer it takes for a result to be returned. For instance, in the image below the commands spoken were same as above, "identify me", but faster. Also, Rekognition may not successfully recognise a face where there clearly is one. It may even detect one where there isn't any. The age value usually isn't very accurate, but it's close enough most times. Both Rekognition and api.ai work pretty well in general. Mathematica code can be found here. Have fun.