SELMA Output

Data, Component and Prototype Releases

As part of SELMA’s efforts to increase resources and tools available for (extreme-large) language technology, we have released many of the corpora, components  and prototypes created / used during the project.

Table of Contents

    Data

    #DataVolumeMain PurposeRelease Level*Availability
    1TurkishText data for NERProject internal
    2DutchText data for NERProject internal
    3Ukrainian300 docsText data for NERProject internal
    4Russian160 docsText data for NERProject internal
    5Latvian740 docsText data for NERPublic domainLink (Clarin)
    6Amharic10 hrsLow resourced scripts for ASRProject internal
    7BengaliLow resourced scripts for ASRProject internal
    8Urdu10 hrsAudio News Training data for VoicesProject internal
    9Brazilian Portuguese96 hrsAudio News Training data for VoicesProject internal
    10SELMA Foundation (19 lang.)15000 hrsAV SELMA Foundation Model for ASRProject internal
    11Wikipedia / Wikidata40 Mio docsText and labes for Enitity representationProject internal
    12Monitio News300.000 / dayDatastreamProject internal

    Models, Components & Platforms

    #ComponentsTechnologyRelease LevelAvailability
    1Speech Recognition FrenchASRProject internal
    2Speech Recognition UrduASRProject internal
    3Speech Recognition LatvianASRProject internal
    4Speech Recognition GermanASRPublic domainLink
    5M2M-100 Machine TranslationSpeech MTProject internal
    6English Monolingual Abstractive SummarizationNews SummarizationProject internal
    7Crosslingual Abstractive SummarizationNews SummarizationProject internal
    8Crosslingual Multidocument Extractive SummarizationNews SummarizationProject internal
    9Speech SummarizationNews SummarizationProject internal
    10PiniTree Ontology EditorNER & NELProject internal
    11Multilingual Hierarchical nested NERNER & NELProject internal
    12Entity representations for 20M Wikidata entitiesNER & NELProject internal
    13Entity LinkingNER & NELProject internal
    14Automatic Post-Editing(Automatic) Post EditingPublic domainLink
    15Speech2Text PostEditor From User Feedback(Automatic) Post EditingProject internal
    16Online Crosslingual ClusteringClusteringProject internal
    17Multilingual IPTC Topic ClassificationTopic DetectionProject internal
    18Wikipedia classificationTopic DetectionProject internal
    19Text To Speech for LatvianSpeech SynthesisProject internal
    20Story SegmentationStory SegmentationPublic domainLink
    21Punctuation and Casing RecoveryPunctuation & TruecasingPublic domainLink
    22Speaker DiarizationSpeaker DiarizationProject internal
    23Speaker Recognition (Identification)Speaker RecognitionPublic domainLink
    24Graph Orchestrator (Maestro)Graph Orchestrator platform (Maestro)Public domainLink
    25Monitio platformMonitio platform (UC1)Project internal
    26plain X platformplain X platform (UC2)Project internal
    27Use Case 0 – SELMA Open Source PlatformSELMA OSS platform (UC0)Public domainLink (GitHub)

    Prototypes

    #PrototypesMain objectiveRelease LevelAvailability
    1Podcast CreatorCreate a news podcast on the flyPublicLink (GitHub)
    2DiarizationCreate diarization and make speech recognitionProject internal
    3Diversity MonitoringAnalyze Binary Gender, Age and Regional OriginProject internal
    3NLP BenchmarkingCompare ASR, MT & VOPublic domain
    4SummarizerSummarize TextPublicLink (GitHub)
    5VoicesGenerate SpeechPublicLink (GitHub)
    6Avatar CreatorCreate adaptions with animated avatarProject internal