Dataset Format and Converters
Dataset Format​
If you wish to upload your entire dataset at once to your project, you can use the CLI to do so. We showed how to do that in Training Data.
However, it is necessary that the dataset is a json
file and present in a particular format. A sample dataset excerpt and its components are discussed below.
[
{
"text": "how does this work?",
"intent": "chitchat",
"type": "train"
},
{
"text": "Should I refund to your primary account?",
"intent": "ask_return_refund_account",
"entities": [
{
"start": 24,
"end": 39,
"value": "primary account",
"entity": "return_refund_account"
}
],
"type": "train"
},
{
"text": "What restaurant would you recommend for dinner?",
"intent": "chitchat",
"type": "train"
}
]
Attribute | required | type | limits | description |
---|---|---|---|---|
text | true | str | 1000 characters | A piece of text that you want to predict the intent of. AutoNLP is trained to learn patterns from this text. |
intent | true | str | 100 characters | The intent behind the text. |
type | false | str | train or test | Defaults to train. AutoNLP takes two kinds of examples, train, and test. train examples are used for training AutoNLP and test example are used for reporting the model's performance. |
entities | false | list | - | Information from the text that you want AutoNLP to learn to extract are called Entities. More details here. |
entities.value | true | str | - | The substring that you want to extract from the text. This is the value that AutoNLP will learn to predict while training. |
entities.entity | true | str | 30 characters | Name of the entity that value corresponds to. |
entities.start | true | int | Positive Integers only | The character index where this entity value starts from in the given text. Note that we follow zero indexing. |
entities.end | true | int | Positive Integers only | The character index where the entity value ends in the given text. |
entities.entityType | false | str | One of trainable , pre-trained , lookup , regex | Defaults to trainable, which means AutoNLP learns to predict this entity. Apart from that we support three other entity types. More details here . In this case it is pre-trained, which means AutoNLP uses an off-the-shelf entity extractor for datetime. |
Dataset Converters​
If the format of your dataset is different than the above mentioned format, you will have to convert it.
However, if your data are available in one of the below mentioned formats, you can directly use our CLI to convert the dataset to the format the NeuraLingo App expects.
tip
If you want to download and convert a huggingface NER dataset directly from their model hub, look at Huggingface NER Data Converter section.
Rasa​
If you are converting a Rasa dataset to our format, you will need to give the path to your Rasa data folder in input-path
.
The folder should have an nlu.yml
file in Rasa's YAML format as shown below:
version: "2.0"
nlu:
- intent: order_status
examples: |
- check status of my order
- when are my shoes coming in
- when will they get here
- I'd like to check the status of my order pls
- check status
- I'd like an update on my order
- Check status of my order
- I haven't received my order yet, can we check that?
- where is my order?
- I want to check the status of my order
- actually can I check the status of my order
- order status for example@rasa.com
- intent: order_cancel
examples: |
- i'd like to cancel my order
- cancel my shoes
- I changed my mind on my order
- cancel my order
Dialogflow​
If converting a Dialogflow dataset, directly give path of the folder in input-path
without changing anything.
Luis​
If converting a Luis dataset, directly give path of the folder in input-path
while using the CLI command. The folder should
contain a file called data.json
in Luis format.
CSV​
It is also posible to create your dataset in a simple CSV format and then convert it into Neuralspace format using the CLI command shown below.
You will need to give path of your CSV folder in input-path
while using the CLI convert-dataset command, which should have the following files.
Note that names of the files should be exactly as mentioned below.
nlu.csv (Required)​
label text ask_location Do you live in [Delhi](location) check_human Are you a bot? check_balance How much money is left in your account? As shown above, this file will have two columns namely
labels
andtext
. Labels column will contain the intent of the text present in second column. If an entity is present in the text, it can be annotated as shown above. For example, Delhi is an entity of type Location as can be seen.lookup.csv (Optional)​
Refer to the Lookup section to know more. They should be present in the following format.
names,services
prakash,nlu
ayushman,ner
shubham,translationHere you have to specify the names of the entities in the first row. Cells in every row for a given column represent the lookup values for the entity specified in the first row of that column. E.g., Here
names
, andservices
are two lookup entities andprakash
,ayushman
, andshubham
are values for entitynames
. Similarlynlu
,ner
, andtransliteration
are values for entityservices
.
regex.csv (Optional)​
Refer to the Regex section to know more. Keep the file in CSV folder in the given format.
claim_id help [a-z]{1,2}\d{5,7} \bhelp\b Here you have to specify the names of the entities in the first row. Cells in every row for a given column represent a regex pattern for the entity specified in the first row of that column. E.g., Here
claim_id
, andhelp
are two Regex entities and[a-z]{1,2}\d{5,7}
is a pattern for entityclaim_id
, and\bhelp\b
is a pattern for entityhelp
.synonym.csv (Optional)​
Refer to the Synonym section to know more and if needed, use the following format.
credit,emblem,current balance
credit card,emblm,balance
credit cards,embelm,fullHere you have to specify a set of synonyms in each column. E.g.,
credit
,credit card
,credit cards
are all synonyms and they all will resolve tocredit
, which is the value specified in the first row. The same goes for the second and third columns in the above example.
Convert Dataset Command​
- CLI
To convert dataset using CLI, you will need to install neuralspace again using,
pip install neuralspace[full]
Then, use the below mentioned command to convert and store the dataset. Refer to the table to see all the input parameters.
neuralspace nlu convert-dataset -F "rasa" -L "en" -d "PATH TO DATASET FOLDER" -o "PATH OF OUTPUT FOLDER"
Attribute | required | type | limits | description |
---|---|---|---|---|
--from-platform or -F | true | str | one of dialogflow , rasa , luis or csv | Set to the format from which you wish to convert the dataset. |
--auto-tag-entities or -at                         | false | str | true or false | If the names of your entities are the same as our pretrained entities then tag them automatically as pre-trained. |
--language or -L | true | str | one of our supported language | Set to the language code of the dataset. |
--entity-mapping or -em | false | If your entity names don't match with our pre-trained entity names then you can give a mapping like this: 'your_entity_name:ns_entity_name,...' and we will convert them to our format. This argument works best with the --auto-tag-entities argument as it can happen in a lot of cases that your entity names don't match our pre-trained entity names, or even regex and lookup entities. This helps you make make the best out of the Platform. | ||
--data-type or -t | false | str | train or test | Defaults to train. You can decide whether you want to tag the examples are train or test. |
--input-path or -d | true | str | - | Path to the folder in which the dataset files are present. |
--output-path or -o | true | str | - | Path to the folder where you wish to store the converted dataset files. |
--ignore-missing-examples or -ime | false | str | true or false (use only if converting from csv format) | Will ignore if there are any missing examples in your data. |
--ignore-swapped-columns or -iswp | false | str | true or false (use only if converting from csv format) | Will ignore if you have swapped columns in your nlu examples data. |
Huggingface NER Data Converter​
If you wish to download a hugginface dataset, convert it into NeuralSpace format and save it locally, you can use our CLI to do so. Simply use the following command:
neuralspace nlu convert-huggingface-ner-dataset -hf "NAME OF HUGGINGFACE NER DATASET" -s "SUBSET 1" -s "SUBSET 2" -L multilingual -o "PATH TO OUTPUT FOLDER"
Attribute | required | type | limits | description |
---|---|---|---|---|
-hf or --huggingface-dataset             | true | str | NER dataset from huggingface collection | Set to the name of hugginface NER dataset available in their collection |
-s or --subset | true | str | subset present in dataset | Set to the name of subset you want from the dataset. Dataset subset name can ben seen in dataset preview at the top of the page. Have a look at the image below to see where to find subsets on hugginface page. You can also pass multiple subsets. For eg. to select hi and bn subsets, do   -s "hi" -s "bn" |
-L or --language | true | str | one of our supported languages | Set to the language code of the dataset. If more than 1 language is present in the dataset, set language to multilingual . |
-tr or --num-train-examples | false | int | Set to the number of train examples you want to have in your converted dataset for each subset. Do not set this parameter if you wish to choose all available examples. | |
-te or --num-test-examples | false | int | Set to the number of test examples you want to have in your converted dataset for each subset. Do not set this parameter if you wish to choose all available examples. | |
-o or --output-path | true | str | Path to the local folder where you wish to store the converted dataset files. | |
-me , --max-entities | false | int | Set to maximum number of entities you want in each examples. Examples having more entities than selected value will be discarded. By default we have set the value to 30. |
The image below shows where to find the dataset name and subsets.
Now you can Upload this dataset directly on the NeuralSpace Platform.