Language Resources

The language resources — including dictionaries, ontologies, knowledge bases, example bases, grammars, and corpora — developed within the UNL framework were managed and produced using the UNLarium, an crowdsourcing development environment created by the UNDL Foundation. Although originally conceived as part of the UNL formalism, the platform was designed for integration with a wide range of natural language processing (NLP) systems, extending beyond strictly UNL-based applications. It also served as a research environment for testing linguistic hypotheses and proposing constants for the description and prediction of natural language phenomena. One of its primary goals was to advance the development of a language-independent metalanguage for multilingual processing.

The UNLarium

The UNLarium was a web-based database management system that enabled accredited users to create, edit, and export entries and rules in accordance with the UNDL Foundation's standards for language engineering. It provided a collaborative environment for linguists and language professionals to contribute to the development of linguistic resources for natural language processing tasks, particularly within the UNL framework. The platform facilitated the creation of multilingual dictionaries, grammar rules, and aligned corpora, supporting both natural language analysis (NL>UNL) and generation (UNL>NL).

What does it contain?

The UNLarium houses a comprehensive collection of linguistic resources developed under the UNL framework. The platform is organized into distinct compartments — lexical databases, rule bases, and document bases — all designed to be bidirectional for natural language analysis and generation. Standardized tags and formalism ensure consistency, enabling aligned multilingual databases and promoting dialogue between different linguistic traditions and models of language description. The main components include:

Lexical Resources

Lexical Resources
  • UNL Dictionary: A list of Universal Words (UWs) representing discrete concepts. Each UW is defined, exemplified, and categorized. The UNL Dictionary is divided into three subsets:
    • UNL Core Dictionary: Contains only permanent simple UWs that are (presumably) shared by all languages.
    • UNL Abridged Dictionary: Contains all permanent UWs (simple, compound, or complex) that are shared by at least two different language families.
    • UNL Unabridged Dictionary: Contains all permanent UWs (simple, compound, or complex) that are lexicalized in at least one language.
  • UNL Ontology: A hierarchical classification of UWs into semantic categories, structured as a tree with "entity" as the root node and more specific categories as branches and leaves.
  • UNL Knowledge Base (UNL KB): A semantic network of UWs, including ontological ("a kind of", "part of") and thematic ("agent of", "place of", "time of") relations. .
  • UNL Memory (formerly UNL Encyclopaedia): A corpus-based semantic network of UWs and their contextual usage, providing insights into semantic patterns and relationships.
  • NL Dictionaries: Contain words of a given natural language, categorized by morphological and syntactic behavior.
  • NL Memories: Corpus-based datasets of a given natural language, containing colocations and usage patterns.
  • UNL>NL Dictionaries: A bilingual lexicon mapping UWs to natural language entries, used for natural language generation.
  • NL>UNL Dictionaries: A bilingual lexicon mapping natural language entries to UWs, used for natural language analysis.

Grammar Resources

  • NL Grammars: Rules for natural language analysis and generation, structured across phonetics, morphology, syntax, semantics, and pragmatics. Language settings define attributes and values, and the Grammar Wizard aids in creating a base set of generic rules. Inflectional paradigms and subcategorization frames are added on demand. They are subdivided into three different categories:
    • N-Grammar, or Normalization Grammar, is a set of T-rules (transformation rules) used to segment the natural language text into sentences and to prepare the input for processing.
    • T-Grammar, or Transformation Grammar, is a set of T-rules used to transform natural language into UNL or UNL into natural language.
    • D-Grammar, or Disambiguation Grammar, is a set of D-rules (disambiguation rules) used to improve the performance of transformation rules by constraining or forcing their applicability.
  • UNL Grammar: Rules and attributes of UNL, including 45 semantic relations and over 300 attributes, documented in the UNL Specifications and further detailed in the UNLwiki.

Corpora

  • UNL Corpora: Aligned at the sentence level with NL corpora. Primarily hand-produced, often with machine assistance, used for extracting dictionary and memory base entries, setting UNL standards, and testing UNLdev tools.

All resources are provided "as is" under an Attribution Share Alike (CC-BY-SA) Creative Commons license.

How it worked?

The resources in the UNLarium were created by a global community of over 2,000 linguists and language professionals, contributing to the development and validation of the UNL formalism across more than 100 languages.

The platform was designed to be as linguist-friendly as possible, targeting language specialists rather than computer experts. It did not require prior expertise in UNL or computational linguistics, though users were expected to have solid linguistic knowledge and an excellent command of their working language. Participation in the UNLarium also required accreditation through VALERIE, the Virtual Learning Environment for UNL.

Accredited users could contribute dictionary entries and grammar rules. They created assignments to reserve entries and address specific linguistic tasks, such as developing dictionaries or grammars for particular languages or projects, and could work on them at their own pace. Unfinished assignments were automatically returned after 30 days.

Contributors could act as volunteers, freelancers, institutional partners, or employees of the UNDL Foundation. Remuneration, when applicable, was determined by user level and productivity (measured in UNLdots). Paid work was limited to specific projects and languages, depending on available funding.

Editorial oversight was maintained through a double-checking mechanism involving editors and revisers. Each entry or rule created was revised twice to ensure quality, and contributors could be promoted or demoted based on their performance.

Workflow

The UNLarium employed a structured workflow to manage contributions and ensure quality control. The process involved several key steps:

  1. Project Creation: Managers created projects to define the scope and objectives of the resource development effort, as well as the list of entries to be addressed by contributors. These projects could involve several different tasks:
    • Dictionary Development: Creating or expanding bilingual or multilingual dictionaries by adding new entries or refining existing ones. These projects were normally corpus-based. Managers defined a corpus (e.g., "Le Petit Prince"), extracted the wordlist and inserted the entries in the dictionary in order to be addressed by as many languages as possible.
    • Grammar Rule Creation: Developing grammar rules for natural language analysis and generation, including morphological, syntactic, semantic, and pragmatic aspects. These probjects were also corpus-based. Users were supposed to provide rules to either UNLize a natural language text or to NLize a UNL document.
    • Corpus Annotation: Annotating aligned corpora to facilitate the extraction of dictionary entries and grammar rules, as well as to test and validate the UNL formalism.
  2. Assignment Creation: Accredited contributors created assignments to reserve specific entries or sets of entries within a project. Assignments allowed contributors to focus on particular linguistic tasks and provided a framework for tracking progress. Assignments were time-limited to encourage timely completion.
  3. Entry Creation: Contributors added dictionary entries or grammar rules according to the UNDL Foundation's standards. Each entry included definitions, examples, categorizations, and relevant linguistic information.
  4. Editorial Review: Each entry or rule underwent a double-checking process involving editors and revisers. Editors reviewed the initial contributions for accuracy and adherence to standards, while revisers provided a second layer of quality control.
  5. Feedback and Revision: Contributors received feedback from editors and revisers, allowing them to make necessary revisions to their entries or rules. This iterative process helped maintain high-quality standards across the resources.
  6. Approval and Publication: Once an entry or rule passed the editorial review process, it was approved and published within the UNLarium, making it available for use in natural language processing tasks.
  7. Progress Tracking: Contributors could monitor their progress through the platform, tracking completed assignments, pending reviews, and overall productivity measured in UNLdots.

Benchmark

The lingware development process within the UNLarium was guided by the Framework of Reference for UNL (FoR-UNL), a set of guidelines and standards for assessing and classifying the linguistic resources developed within the UNL programme. Inspired by the CEFR model, FoR-UNL established reference levels and measurable descriptors to evaluate the availability and quality of dictionaries, grammars and corpora for each language.

Languages were classified in six progressive levels, according to the number of entries (base forms in the NL dictionary) and to the grammatical structures covered:

Level Dictionary
(base forms)
Grammar
A1 > 5,000 Morphology: NP
A2 > 10,000 Morphology: others
B1 > 20,000 Syntax: NP
B2 > 40,000 Syntax: VP
C1 > 70,000 Syntax: IP
C2 > 100,000 Syntax: CP

This classification allowed for a systematic evaluation of linguistic resources, facilitating the identification of gaps and areas for improvement.

User Levels

Within the system, contributors were assigned levels according to their accumulated UNLdots — a unit designed to measure effort and task complexity. These levels reflected both productivity and expertise, ranging from A0 (beginner) to C2 (expert).

The seven user levels were:

  • A0, up to 5,000 UNLdots
  • A1, 5,001–15,000 UNLdots
  • A2, 15,001–30,000 UNLdots
  • B1, 30,001–50,000 UNLdots
  • B2, 50,001–75,000 UNLdots
  • C1, 75,001–100,000 UNLdots
  • C2, above 100,001 UNLdots

Permissions

Participation was free and open to individuals and institutions worldwide, and user permissions were determined by level, expertise, and accreditation:
  • Observers were allowed to browse dictionaries and grammars and navigate the system, but couldn't add entries or grammar rules;
  • Trainees were allowed to add entries, but only under supervision;
  • Authors (A1 required) were allowed to add entries, but could edit only their own data;
  • Editors (B1 required) could also edit authors' data, but could not edit other editors' data;
  • Revisers (C1 required) could edit editors' data, but could not edit other revisers' data; and
  • Managers could edit any data, create projects and delete entries.
  • Supermanagers could edit the source code of the system.

License

Creative Commons License The data stored in the UNLarium are available under an Attribution Share Alike (CC-BY-SA) Creative Commons license. These resources are preserved as is, for reference and reuse, provided proper credit is given and derivative works are released under the same or a compatible license.

Documentation

Documentation and community discussions were originally hosted in the UNLwiki and the UNLforum, respectively. These materials are preserved here for archival purposes. No further updates or support are currently provided

How to download and use the data

The UNLarium is a database management system. All data is stored in tables in a MySQL database. To download and use the data, you have to export them using the built-in export functionality. The data is always exported in plain text format, using UTF-8 encoding. You may export data from the following modules:

  • Dictionary: Enter the dictionary module, select the desired language, and use the export function to download the entries. For each language there are three types of dictionaries available for export:
    • the NL>UNL dictionary (used for natural language analysis),
    • the UNL>NL dictionary (used for natural language generation), and
    • the monolingual NL dictionary.
    You may choose to export the entire dictionary or specific subsets based on criteria such as part of speech, semantic category or project. The dictionary is exported according to the Dictionary Specs, i.e., as a list of entries in the format:
    [natural language entry]  {ID}  “UW”  (ATTRIBUTE=VALUE , ... )  < LANGUAGE CODE , FREQUENCY , PRIORITY >; COMMENTS
    You will find an example of the German>UNL dictionary for the corpus AESOP (The Hare and the Tortoise) here;

  • Grammar: Enter the grammar module, select the desired language, and use the export function to download the grammar rules. For each language there are three types of grammars:
    • Normalization grammar (used for natural language analysis),
    • NL>UNL grammar (used for natural language analysis), and
    • UNL>NL grammar (used for natural language generation).
    You may choose to export the entire grammar or specific subsets based on criteria such as rule type or project. The grammar is exported according to the Grammar Specs, i.e., as a list of rules in one of the following formats:
    α:=β; for transformation (rewrite) rules or
    α=P; for disambiguation rules.
    You will find an example of the German>UNL transformation grammar for the corpus AESOP (The Hare and the Tortoise) here;

  • Corpus: : Enter the corpus module, select the corpus and the language, and use the export function to download the text. The corpus is exported according to the UNL Specs.
    You will find an example of the German>UNL corpus AESOP (The Hare and the Tortoise) here.

The exported data can be imported into UNLdev or UNLCore tools (IAN, SEAN, EUGENE, and NORMA) to perform natural language analysis and generation using the UNL framework. Alternatively, you may parse the text files in accordance with the relevant specifications (Dictionary Specs, Grammar Specs, and UNL Specs) and integrate them into your own natural language processing system or application.