Metrics to Evaluate Entity Recognition Performance

At NeuralSpace, we believed that traditional metrics such as Precision, Recall and marco/micro F1 Scores may not be the best approach to evaluate and further improve a NER system.

A named entity can be made up of multiple tokens, so a full-entity accuracy would be desirable. Also, this simple schema of calculating an F1-score ignores the possibility of partial matches or other scenarios when the NER system gets the named-entity surface string correct but the type wrong.

For this reason, NeuralSpace reports two unique metrics for comparing performance:

Strict F1 Score
Partial F1 Score

For more information on why these metrics are more comprehensive than macro and micro-averaging F1 scores, we recommend reading more here.