Metrics to Evaluate Entity Recognition Performance
At NeuralSpace, we believed that traditional metrics such as Precision, Recall and marco/micro F1 Scores may not be the best approach to evaluate and further improve a NER system.
A named entity can be made up of multiple tokens, so a full-entity accuracy would be desirable. Also, this simple schema of calculating an F1-score ignores the possibility of partial matches or other scenarios when the NER system gets the named-entity surface string correct but the type wrong.
For this reason, NeuralSpace reports two unique metrics for comparing performance:
- Strict F1 Score
- Partial F1 Score
For more information on why these metrics are more comprehensive than macro and micro-averaging F1 scores, we recommend reading more here.