Follow

Guide to: Text Annotation Job Design

The cml:text_annotation tag allows users to create a text annotation job with a custom ontology, test questions, and aggregation.

Glossary 

    • Token - the smallest possible unit of data able to be annotated in a string, predefined for the contributor by the tokenizers, or provided by the user.
    • Span - a set of tokens (1 or more) with an assigned class label - the output of a model or a contributor judgment.
    • Tokenizer - the rules by which to split text/strings into tokens.

*Note: For Secure Data option, please contact your Customer Success Manager for additional information. 

Build a Job

The following CML contains the possible parameters for a text annotation job:

<cml:text_annotation data-column="{{data_column}}" name="output_column" tokenizer="spacy" source-type="text" search-url="https://www.google.com/search?q=%s" validates="required"/>

Parameters

Below are the parameters available for the cml:text_annotation tag. Some are required in the element, some can be left out.

  • source-type (required)
    • This attribute tells the tool whether to expect text or JSON
      • If text, text string will be expected and it will be required to specify the language and tokenizers to use on the text.
      • If JSON, the tool will attempt to access the files with tokens, spans, and predictions whenever it loads.
      • Please note: depending on the source type, the parameter to use for the source data will differ (data-column for text/data-url for JSON).

  • name (required)
    • The results header where the results links will be stored.
  • validates (optional)
    • Accepts "required" (default), or "all-tokens"
      • If validates="required":Contributors must assign a class label to at least one token on each unit. The "none" class does not count as a valid class label in this case .
      • If validates="required all-tokens":Contributors must assign a class label to each token before being allowed to submit. The "none" class is considered a valid label in this case.
  • search-url (optional)
    • Include search engine URL to link the tool's lookup function
    • Replace the query with "%s"
      • Example: search-url="https://www.google.com/search?q%s"

If your source data is in text, you can use the following parameters:

  • data-column (required)
    • The name of the column containing the source data to be annotated.
  • tokenizer
    • This is required if source-type="text"
    • This tool accepts "Spacy" (spacy), "NLTK" (nltk), or "Stanford NLP" (standford).
  • language (optional)
    • Set which language the text that is being tokenized is in; this is required if and the data is non-English.
    • The available options by tokenizer are as follows (default is English):
      • Spacy: en, fr, de, pt, it, nl, es
      • NLTK: en, de, es, pt, dr, it, nl
      • Stanford NLP: en, fr, de, es
    • Example:language="fr"
  • Context (optional)
    • A larger piece of text in your source data containing the text to annotate. 

If your source data is JSON, you can use the following parameters:

  • data-url
    • The name of the column containing the source data to be annotated.

Ontology

The Ontology Manager allows job owners to create and edit the ontology within a Text Annotation job. Text Annotation Jobs require an ontology to launch.

When the CML for a text annotation job is saved, the Ontology Manager link will appear at the top of the Design page.

pasted_image_0.png

Figure 1. Ontology Manager for Text Annotation

Ontology Manager Best Practices 

  • The limit of ontology is 250 classes, however, as best practice, we recommend not exceeding 16 classes in a job to ensure contributors can understand and process the different classes. 
  • Choose from 16 colors pre-selected or upload custom colours as hex code via the CSV ontology upload.
  • If you uploaded model predictions as JSONs, the predicted classes should also be added to the ontology.

Upload Data

Upload data into the job as a CSV where each row represents text that will be tokenized and annotated. There are two options for uploading data:

  • Text
    • CML attribute
    • File content: 
      • 1 column of text strings (required)
      • 1 column of context (optional) 
  • Links to JSONs
    • CML attribute 
    • File content: 
      • 1 column of link to hosted JSONs
        • Note: Bucket must be CORS configured and publicly viewable. For more information on secure hosting, check out this article.

*Note: Below are example files on how to structure source data.

Results

  • Results are links to a JSON file that contains original text, spans, and classes associated with each token. 
  • The links are found in the Full or Aggregated report under the column header that was specified as the value for the name attribute. 
  • Here is an example link with an overview of the schema below.
    json.jpg
    1. "text" = original line of text that requires annotation from your source data
    2. "classname" = class assigned
    3. "tokens" = text associated with "classname" 
      • "startIDx" = start of annotation, which starts from index 0 
      • "endIDx" = end of annotation
    4. Text without classes associated will be noted by null as its "classname" 
    5. Please note the attached image displays only part of the results. Please click into the example link for full results.
    6. *Note: Secure Data Access is now available for Text Annotation result links. For more information regarding this feature, please check out this article. If this add on feature is something you are interested in, please contact your Customer Success Manager.

We now have a new report type for text annotation jobs: Download All Annotations

    • This report is only available in text annotation jobs via the Results Page.

    • This report will directly download the JSON files in the job rather than links to them like the full/aggregated reports. If metadata from the full or aggregated report is needed, these reports should be downloaded separately.

    • The report will download a zip file. Once the zip is opened, you will receive a new folder.

      • There is one folder inside with all the aggregated judgments, labeled by unit id.

      • The other folder is the full report judgments, labeled by judgment id.

    • This report is more secure than the aggregated report as no URLs will be generated.

Note: This report may take awhile to generate and download due to the large nature of all it's data files. However, the download will still be much faster compared to running scripts to scrape the results. 


Was this article helpful?
2 out of 2 found this helpful


Have more questions? Submit a request
Powered by Zendesk