Follow

cml:text_annotation - Text Annotation Tool

This article explains the cml:text_annotation attribute in-depth with all associated parameters. For more information on how to design a text annotation job, please visit this Success Center article.

Glossary 

  • Token - the smallest possible unit of data able to be annotated in a string, predefined for the contributor by the tokenizers, or provided by the user.
  • Span - a set of tokens (1 or more) with an assigned class label - the output of a model or a contributor judgment.
  • Tokenizer - the rules by which to split text/strings into tokens.

Build a Job

The following CML contains the possible parameters for a text annotation job:

<cml:text_annotation data-column="{{data_column}}" name="output_column" tokenizer="spacy" source-type="text" search-url="https://www.google.com/search?q=%s" validates="required"/>

Text_Anno_Success_Center_edited2.gif

Figure 1.  How to Edit Text Annotation Job via the Graphical Editor

Parameters

Below are the parameters available for the cml:text_annotation tag. Some are required in the element, some can be left out.

  • source-type (required)
    • This attribute tells the tool whether to expect text or JSON
      • If text, text string will be expected and it will be required to specify the language and tokenizers to use on the text.
      • If JSON, the tool will attempt to access the files with tokens, spans, and predictions whenever it loads.
      • Please note: depending on the source type, the parameter to use for the source data will differ (data-column for text/data-url for JSON).

  • name (required)
    • The results header where the results links will be stored.
  • validates (optional)
    • Accepts "required" (default), or "all-tokens"
      • If validates="required":Contributors must assign a class label to at least one token on each unit. The "none" class does not count as a valid class label in this case .
      • If validates="required all-tokens":Contributors must assign a class label to each token before being allowed to submit. The "none" class is considered a valid label in this case.
  • search-url (optional)
    • Include search engine URL to link the tool's lookup function
    • Replace the query with "%s"
      • Example: search-url="https://www.google.com/search?q%s"

If your source data is in text, you can use the following parameters:

  • data-column (required)
    • The name of the column containing the source data to be annotated.
  • tokenizer
    • This is required if source-type="text"
    • This tool accepts "Spacy" (spacy), "NLTK" (nltk),"Stanford NLP" (standford), or "Split on &nbsp;" (nbsp).
      • Note: Use the nbsp tokenizer if you'd like to bring a custom tokenizer in via a text upload. We well create tokens based on the location of "&nbsp;" in the text. You can use this to create irregular tokens to label like whole sentences or partial clauses.
  • language (optional)
    • Set which language the text that is being tokenized is in; this is required if and the data is non-English.
    • The available options by tokenizer are as follows (default is English):
      • Spacy: en, fr, de, pt, it, nl, es
      • NLTK: en, de, es, pt, dr, it, nl
      • Stanford NLP: en, fr, de, es
    • Example:language="fr"
  • Context (optional)
    • A larger piece of text in your source data containing the text to annotate. 

If your source data is JSON, you can use the following parameters:

  • data-url
    • The name of the column containing the source data to be annotated.

Was this article helpful?
0 out of 0 found this helpful


Have more questions? Submit a request
Powered by Zendesk