Skip to content

Voice Service Technical Specifications

1. Introduction

Facephi Voice Service is a C++ REST API service to which you can send audio files to be processed and get the result of the voice recognition process. The service offers an endpoint to enroll a new voice, and another one to authenticate a voice.

2. Hardware requirements

Minimum requirement Recommended requirement
CPU 2 cores supporting SSE4.2 instructions set extension, >=2GHz 16 cores, AVX2 ISA support
RAM 4 Gb 8 Gb
Disk 4 Gb SSD 4 Gb
Network 100 Mbps 1 Gbps

3. Software requirements

  • Linux x86_64 (Ubuntu 24.04 or higher) with Docker 24.0 or higher.

or

  • Windows 10 x64 with Docker 24.0 or higher.

4. Enrollment requirements

Three recordings of the same user pronouncing a secret phrase are required, which must meet the following minimum requirements:

Minimum requirements for enrollment Values
Audio length > 700 ms
Speech relative length (*) > 0.55
Signal to noise ratio (SNR) (**) > 8 dB

() Speech Relative Length = Speech Duration / Audio Duration_
_(
*) Our recommended speaker distance is 30cm- a natural distance using a hand-held device


There should only be one person speaking during the recording. To verify that the enrollment was carried out by a single person, the individual biometric templates created from the three recordings are compared.

Minimum requirements for enrollment Threshold
If the probability of a match is less than the similarity threshold, the record is rejected and a new recording is requested. 0.55

5. Authentication requirements

Minimum requirements for authentication Values
Audio length > 700 ms
Speech relative length > 0.55
Signal to noise ratio (SNR) > 3 dB

6. Metrics

There are two common channels in which voice biometric validation is applied, through microphones or telephone lines.

Extracted metrics for microphone use case (new version noctua).

Threshold FAR (%) FRR (%)
0.5 0.17 3.32


Extracted metrics for telephone use case.

Threshold FAR (%) FRR (%)
0.5 1 9.12


FAR (False Acceptance Rate) is the probability that the system will incorrectly accept an impostor as a legitimate user.

FRR (False Rejection Rate) is the probability that the system will incorrectly reject a legitimate user.

7. Liveness detection

Minimum requirements for liveness detection Values
Voice Speech Length for Replay Attack Detection > 1000 ms
Voice Speech Length for Voice Clone Attack Detection > 3000 ms
Signal to noise ratio (SNR) > 10 dB


Recommended thresholds for liveness detection Threshold
Liveness validation will be considered successful when the value is higher than the threshold. 0.5