Follow

Guide to: Text Annotation Job Design

The cml:text_annotation tag allows users to create a text annotation job with a custom ontology, test questions, and aggregation.

Glossary 

  • Token - the smallest possible unit of annotatable data in a string, predefined for the contributor by the tokenizers, or provided by the user
  • Span - a set of tokens (1 or more) with an assigned class label - the output of a model or a contributor judgment
  • Tokenizer - the rules by which to split text/strings into tokens 

Build a Job

The following CML contains the possible parameters for a text annotation job:

<cml:text_annotation data-column="{{data_column}}" name="output_column" tokenizer="spacy" source-type="text" search-url="https://www.google.com/search?q%s" validates="required"/> 

Parameters

Below are the parameters available for the cml:text_annotation tag. Some are required in the element, some can be left out.

  • data-column
    • The name of the column containing the source data to be annotated
      • If source-type="text", this column contains text strings
      • If source-type="json", this column contains publicly viewable hosted JSON links*
  • source-type
    • This attribute tells the tool whether to expect text or JSON
      • If text, text string will be expected and it will be required to specify the language and tokenizers to use on the text
      • If JSON, the tool will attempt to access the files with tokens, spans, and predictions whenever it loads
  • name
    • The results header where the results links will be stored.
  • tokenizer
    • This is required if source-type="text"
    • This tool accepts "Spacy, "NLTK" or "Stanford NLP"
  • language
    • The language of the text that is being tokenized 
    • This is required if and the data is non-english
    • The available options by tokenizer are as follows (default is English):
      • Spacy: en, fr, de, pt, it, nl, es
      • NLTK: en, de, es, pt, dr, it, nl
      • Stanford NLP: en, fr, de, es, ar, zh
  • validates
    • Accepts "required" (default), or "all-tokens"
      • If validates="required":Contributors must assign a class label to at least one token on each unit. The "none" class does not count as a valid class label in this case 
      • If validates="required all-tokens":Contributors must assign a class label to each token before being allowed to submit. The "none" class is considered a valid label in this case
  • Context (optional)
    • The results header where the results links will be stored.
  • search-url (optional)
    • Include search engine URL to link the tool's lookup function
    • Replace the query with "%s"
      • Example: search-url="https://www.google.com/search?q%s"

*Note: for secure data option please contract your Customer Success Manager for additional information. 

Ontology

The ontology manager allows job owners to create and edit the ontology within a Text Annotation job. Text Annotation Jobs require an ontology to launch.

When the cml for a text annotation job is saved, the Ontology Manager link will appear at the top of the Design page.

pasted_image_0.png

Figure 1. Ontology Manager for Text Annotation

Ontology Manager Best Practices 

  • Add up to 250 classes
  • Choose from 16 colors pre-selected or upload custom colours as hex code via the CSV ontology upload
  • It is not recommended to exceed 16 classes in a job to ensure contributors can understand and process the different classes. 
  • If you uploaded model predictions as JSONs, the predicted classes should also be added to the ontology 

Upload Data

Upload data into the job as a CSV where each row represents text that will be tokenized and annotated. There are two options for uploading data:

  • Text
    • CML attribute
    • File content: 
      • 1 column of text strings (required)
      • 1 column of context (optional) 
  • Links to JSONs
    • CML attribute 
    • File content: 
      • 1 column of link to hosted JSONs
        • Note: Bucket must be CORS configured and publicly viewable. For more information on secure hosting, check out this article

*Note: Below are example files on how to structure source data. 

Results

Results are links to a JSON file that contain original text, spans and classes associated with each token 

  •  Here is an example link with an overview of the schema below.
    json.jpg
    1. "text" = original line of text that requires annotation from your source data
    2. "classname" = class assigned
    3. "tokens" = text associated with "classname" 
      • "startIDx" = start of annotation, which starts from index 0 
      • "endIDx" = end of annotation
    4. Text without classes associated will be noted by null as its "classname" 
    5. *Note: Attached image displays only part of the results. Please click into the example link for full results.

 

 

 


Was this article helpful?
1 out of 1 found this helpful


Have more questions? Submit a request
Powered by Zendesk