Voice Service Technical Specifications

1. Introduction

Facephi Voice Service is a C++ REST API service to which you can send audio files to be processed and get the result of the voice recognition process. The service offers an endpoint to enroll a new voice, and another one to authenticate a voice.

2. Hardware requirements

	Minimum requirement	Recommended requirement
CPU	2 cores supporting SSE4.2 instructions set extension, >=2GHz	16 cores, AVX2 ISA support
RAM	4 Gb	8 Gb
Disk	4 Gb	SSD 4 Gb
Network	100 Mbps	1 Gbps

3. Software requirements

Linux x86_64 (Ubuntu 24.04 or higher) with Docker 24.0 or higher.

or

Windows 10 x64 with Docker 24.0 or higher.

4. Enrollment requirements

Three recordings of the same user pronouncing a secret phrase are required, which must meet the following minimum requirements:

Minimum requirements for enrollment	Values
Audio length	> 700 ms
Speech relative length (*)	> 0.55
Signal to noise ratio (SNR) (**)	> 8 dB

() Speech Relative Length = Speech Duration / Audio Duration_
_(*) Our recommended speaker distance is 30cm- a natural distance using a hand-held device

There should only be one person speaking during the recording. To verify that the enrollment was carried out by a single person, the individual biometric templates created from the three recordings are compared.

Minimum requirements for enrollment	Threshold
If the probability of a match is less than the similarity threshold, the record is rejected and a new recording is requested.	0.55

5. Authentication requirements

Minimum requirements for authentication	Values
Audio length	> 700 ms
Speech relative length	> 0.55
Signal to noise ratio (SNR)	> 3 dB

6. Metrics

There are two common channels in which voice biometric validation is applied, through microphones or telephone lines.

Extracted metrics for microphone use case (new version noctua).

Threshold	FAR (%)	FRR (%)
0.5	0.17	3.32

Extracted metrics for telephone use case.

Threshold	FAR (%)	FRR (%)
0.5	1	9.12

FAR (False Acceptance Rate) is the probability that the system will incorrectly accept an impostor as a legitimate user.

FRR (False Rejection Rate) is the probability that the system will incorrectly reject a legitimate user.

7. Liveness detection

Minimum requirements for liveness detection	Values
Voice Speech Length for Replay Attack Detection	> 1000 ms
Voice Speech Length for Voice Clone Attack Detection	> 3000 ms
Signal to noise ratio (SNR)	> 10 dB

Recommended thresholds for liveness detection	Threshold
Liveness validation will be considered successful when the value is higher than the threshold.	0.5