Follow

cml:text_annotation - Text annotation tool (Guide to: Text Annotation Job Design)

The cml:text_annotation tag allows users to create a text annotation job with a custom ontology, test questions, and aggregation.

Glossary 

  • Token - the smallest possible unit of data able to be annotated in a string, predefined for the contributor by the tokenizers, or provided by the user.
  • Span - a set of tokens (1 or more) with an assigned class label - the output of a model or a contributor judgment.
  • Tokenizer - the rules by which to split text/strings into tokens.

Build a Job

The following CML contains the possible parameters for a text annotation job:

<cml:text_annotation data-column="{{data_column}}" name="output_column" tokenizer="spacy" source-type="text" search-url="https://www.google.com/search?q=%s" validates="required"/>

Parameters

Below are the parameters available for the cml:text_annotation tag. Some are required in the element, some can be left out.

  • source-type
    • This attribute tells the tool whether to expect text or JSON
      • If text, text string will be expected and it will be required to specify the language and tokenizers to use on the text.
      • If JSON, the tool will attempt to access the files with tokens, spans, and predictions whenever it loads.
      • Please note: depending on the source type, the parameter to use for the source data will differ (data-column for text/data-url for JSON).
  • data-column
    • The name of the column containing the source data to be annotated if source-type="text"
  • data-url
    • The name of the column containing the source data to be annotated if source-type="json"
  • name
    • The results header where the results links will be stored.
  • tokenizer
    • This is required if source-type="text"
    • This tool accepts "Spacy, "NLTK" or "Stanford NLP".
  • language
    • The language of the text that is being tokenized.
    • This is required if and the data is non-English.
    • The available options by tokenizer are as follows (default is English):
      • Spacy: en, fr, de, pt, it, nl, es
      • NLTK: en, de, es, pt, dr, it, nl
      • Stanford NLP: en, fr, de, es
  • validates
    • Accepts "required" (default), or "all-tokens"
      • If validates="required":Contributors must assign a class label to at least one token on each unit. The "none" class does not count as a valid class label in this case .
      • If validates="required all-tokens":Contributors must assign a class label to each token before being allowed to submit. The "none" class is considered a valid label in this case.
  • Context (optional)
    • A larger piece of text containing the text to annotate. 
    • Only applicable for source-type="text"
  • search-url (optional)
    • Include search engine URL to link the tool's lookup function
    • Replace the query with "%s"
      • Example: search-url="https://www.google.com/search?q%s"

*Note: for secure data option please contact your Customer Success Manager for additional information. 

Ontology

The ontology manager allows job owners to create and edit the ontology within a Text Annotation job. Text Annotation Jobs require an ontology to launch.

When the cml for a text annotation job is saved, the Ontology Manager link will appear at the top of the Design page.

pasted_image_0.png

Figure 1. Ontology Manager for Text Annotation

Ontology Manager Best Practices 

  • The limit of ontology is 250 classes, however, as best practice, we recommend much less.
  • Choose from 16 colors pre-selected or upload custom colours as hex code via the CSV ontology upload.
  • It is not recommended to exceed 16 classes in a job to ensure contributors can understand and process the different classes. 
  • If you uploaded model predictions as JSONs, the predicted classes should also be added to the ontology.

Upload Data

Upload data into the job as a CSV where each row represents text that will be tokenized and annotated. There are two options for uploading data:

  • Text
    • CML attribute
    • File content: 
      • 1 column of text strings (required)
      • 1 column of context (optional) 
  • Links to JSONs
    • CML attribute 
    • File content: 
      • 1 column of link to hosted JSONs
        • Note: Bucket must be CORS configured and publicly viewable. For more information on secure hosting, check out this article.

*Note: Below are example files on how to structure source data.

Results

Results are links to a JSON file that contain original text, spans, and classes associated with each token.

  • Here is an example link with an overview of the schema below.
    json.jpg
    1. "text" = original line of text that requires annotation from your source data
    2. "classname" = class assigned
    3. "tokens" = text associated with "classname" 
      • "startIDx" = start of annotation, which starts from index 0 
      • "endIDx" = end of annotation
    4. Text without classes associated will be noted by null as its "classname" 
    5. *Note: Attached image displays only part of the results. Please click into the example link for full results.

*Note: The 'Convert finalized rows to Test Questions' feature is not currently available for the text annotation tool.


Was this article helpful?
0 out of 0 found this helpful


Have more questions? Submit a request
Powered by Zendesk