Skip to main content

Dataset Format and Converters

Dataset Format

If you wish to upload your entire dataset at once to your project, you can use the CLI to do so. We showed how to do that in Training Data. However, it is necessary that the dataset is a json file and present in a particular format. A sample dataset excerpt and its components are discussed below.

Ecommerce_excerpt.json
[
{
"text": "how does this work?",
"intent": "chitchat",
"type": "train"
},
{
"text": "Should I refund to your primary account?",
"intent": "ask_return_refund_account",
"entities": [
{
"start": 24,
"end": 39,
"value": "primary account",
"entity": "return_refund_account"
}
],
"type": "train"
},
{
"text": "What restaurant would you recommend for dinner?",
"intent": "chitchat",
"type": "train"
}
]
Attributerequiredtypelimitsdescription
texttruestr1000 charactersA piece of text that you want to predict the intent of. AutoNLP is trained to learn patterns from this text.
intenttruestr100 charactersThe intent behind the text.
typefalsestrtrain or testDefaults to train. AutoNLP takes two kinds of examples, train, and test. train examples are used for training AutoNLP and test example are used for reporting the model's performance.
entitiesfalselist-Information from the text that you want AutoNLP to learn to extract are called Entities. More details here.
entities.valuetruestr-The substring that you want to extract from the text. This is the value that AutoNLP will learn to predict while training.
entities.entitytruestr30 charactersName of the entity that value corresponds to.
entities.starttrueintPositive Integers onlyThe character index where this entity value starts from in the given text. Note that we follow zero indexing.
entities.endtrueintPositive Integers onlyThe character index where the entity value ends in the given text.
entities.entityTypefalsestrOne of trainable, pre-trained, lookup, regexDefaults to trainable, which means AutoNLP learns to predict this entity. Apart from that we support three other entity types. More details here . In this case it is pre-trained, which means AutoNLP uses an off-the-shelf entity extractor for datetime.

Dataset Converters

If the format of your dataset is different than the above mentioned format, you will have to convert it.

However, if your data are available in one of the below mentioned formats, you can directly use our CLI to convert the dataset to the format the NeuraLingo App expects.

tip

If you want to download and convert a huggingface NER dataset directly from their model hub, look at Huggingface NER Data Converter section.

Rasa

Dataset sample

If you are converting a Rasa dataset to our format, you will need to give the path to your Rasa data folder in input-path. The folder should have an nlu.yml file in Rasa's YAML format as shown below:

nlu.yml
version: "2.0"
nlu:
- intent: order_status
examples: |
- check status of my order
- when are my shoes coming in
- when will they get here
- I'd like to check the status of my order pls
- check status
- I'd like an update on my order
- Check status of my order
- I haven't received my order yet, can we check that?
- where is my order?
- I want to check the status of my order
- actually can I check the status of my order
- order status for example@rasa.com
- intent: order_cancel
examples: |
- i'd like to cancel my order
- cancel my shoes
- I changed my mind on my order
- cancel my order

Dialogflow

Sample dataset

If converting a Dialogflow dataset, directly give path of the folder in input-path without changing anything.

Luis

Sample dataset

If converting a Luis dataset, directly give path of the folder in input-path while using the CLI command. The folder should contain a file called data.json in Luis format.

CSV

Sample dataset

It is also posible to create your dataset in a simple CSV format and then convert it into Neuralspace format using the CLI command shown below. You will need to give path of your CSV folder in input-path while using the CLI convert-dataset command, which should have the following files. Note that names of the files should be exactly as mentioned below.

  • nlu.csv (Required)

    labeltext
    ask_locationDo you live in [Delhi](location)
    check_humanAre you a bot?
    check_balanceHow much money is left in your account?

    As shown above, this file will have two columns namely labels and text. Labels column will contain the intent of the text present in second column. If an entity is present in the text, it can be annotated as shown above. For example, Delhi is an entity of type Location as can be seen.

  • lookup.csv (Optional)

    Refer to the Lookup section to know more. They should be present in the following format.

    names,services
    prakash,nlu
    ayushman,ner
    shubham,translation

    Here you have to specify the names of the entities in the first row. Cells in every row for a given column represent the lookup values for the entity specified in the first row of that column. E.g., Here names, and services are two lookup entities and prakash, ayushman, and shubham are values for entity names. Similarly nlu, ner, and transliteration are values for entity services.

  • regex.csv (Optional)

    Refer to the Regex section to know more. Keep the file in CSV folder in the given format.

    claim_idhelp
    [a-z]{1,2}\d{5,7}\bhelp\b

    Here you have to specify the names of the entities in the first row. Cells in every row for a given column represent a regex pattern for the entity specified in the first row of that column. E.g., Here claim_id, and help are two Regex entities and [a-z]{1,2}\d{5,7} is a pattern for entity claim_id, and \bhelp\b is a pattern for entity help.

  • synonym.csv (Optional)

    Refer to the Synonym section to know more and if needed, use the following format.

    credit,emblem,current balance
    credit card,emblm,balance
    credit cards,embelm,full

    Here you have to specify a set of synonyms in each column. E.g., credit, credit card, credit cards are all synonyms and they all will resolve to credit, which is the value specified in the first row. The same goes for the second and third columns in the above example.

Convert Dataset Command

To convert dataset using CLI, you will need to install neuralspace again using,

pip install neuralspace[full]

Then, use the below mentioned command to convert and store the dataset. Refer to the table to see all the input parameters.

neuralspace nlu convert-dataset -F "rasa" -L "en" -d "PATH TO DATASET FOLDER" -o "PATH OF OUTPUT FOLDER"
Attributerequiredtypelimitsdescription
--from-platform or -Ftruestrone of dialogflow, rasa, luis or csvSet to the format from which you wish to convert the dataset.
--auto-tag-entities or -at falsestrtrue or falseIf the names of your entities are the same as our pretrained entities then tag them automatically as pre-trained.
--language or -Ltruestrone of our supported languageSet to the language code of the dataset.
--entity-mapping or -emfalseIf your entity names don't match with our pre-trained entity names then you can give a mapping like this: 'your_entity_name:ns_entity_name,...' and we will convert them to our format. This argument works best with the --auto-tag-entities argument as it can happen in a lot of cases that your entity names don't match our pre-trained entity names, or even regex and lookup entities. This helps you make make the best out of the Platform.
--data-type or -tfalsestrtrain or testDefaults to train. You can decide whether you want to tag the examples are train or test.
--input-path or -dtruestr-Path to the folder in which the dataset files are present.
--output-path or -otruestr-Path to the folder where you wish to store the converted dataset files.
--ignore-missing-examples or -imefalsestrtrue or false (use only if converting from csv format)Will ignore if there are any missing examples in your data.
--ignore-swapped-columns or -iswpfalsestrtrue or false (use only if converting from csv format)Will ignore if you have swapped columns in your nlu examples data.

Huggingface NER Data Converter

If you wish to download a hugginface dataset, convert it into NeuralSpace format and save it locally, you can use our CLI to do so. Simply use the following command:

neuralspace nlu convert-huggingface-ner-dataset -hf "NAME OF HUGGINGFACE NER DATASET" -s "SUBSET 1" -s "SUBSET 2" -L multilingual -o "PATH TO OUTPUT FOLDER"
Attributerequiredtypelimitsdescription
-hf or
--huggingface-dataset
truestrNER dataset from huggingface collectionSet to the name of hugginface NER dataset available in their collection
-s or --subsettruestrsubset present in datasetSet to the name of subset you want from the dataset. Dataset subset name can ben seen in dataset preview at the top of the page. Have a look at the image below to see where to find subsets on hugginface page. You can also pass multiple subsets. For eg. to select hi and bn subsets, do -s "hi" -s "bn"
-L or --languagetruestrone of our supported languagesSet to the language code of the dataset. If more than 1 language is present in the dataset, set language to multilingual.
-tr or --num-train-examplesfalseintSet to the number of train examples you want to have in your converted dataset for each subset. Do not set this parameter if you wish to choose all available examples.
-te or --num-test-examplesfalseintSet to the number of test examples you want to have in your converted dataset for each subset. Do not set this parameter if you wish to choose all available examples.
-o or --output-pathtruestrPath to the local folder where you wish to store the converted dataset files.
-me, --max-entitiesfalseintSet to maximum number of entities you want in each examples. Examples having more entities than selected value will be discarded. By default we have set the value to 30.

The image below shows where to find the dataset name and subsets.

hf-ner-converter-demo

Now you can Upload this dataset directly on the NeuralSpace Platform.