SELMA Output

Models, Component & Platforms, Prototypes and Data

As part of SELMA’s efforts to increase resources and tools available for (extreme-large) language technology, we have released many of the corpora, components  and prototypes created / used during the project.

Table of Contents

    Models, Components & Platforms

    #ComponentsTechnologyRelease LevelAvailability
    1Speech Recognition FrenchASRPublic domainLink
    2Speech Recognition UrduASRPublic domainLink
    3Speech Recognition LatvianASRProject internal
    4Speech Recognition GermanASRPublic domainLink
    5M2M-100 Machine TranslationSpeech MTProject internal
    6Textless speech-to-speech translation French-to-EnglishSpeech MTPublic domainLink
    7English Monolingual Abstractive SummarizationNews SummarizationFor research (only)Link
    8Crosslingual Abstractive SummarizationNews SummarizationFor research (only)
    9Crosslingual Multidocument Extractive SummarizationNews SummarizationFor research (only)Link
    10Speech SummarizationNews SummarizationFor research (only)Link
    11PiniTree Ontology EditorNER & NELProject internal
    12Multilingual Hierarchical nested NERNER & NELProject internal
    13Entity representations for 20M Wikidata entitiesNER & NELProject internal
    14Entity LinkingNER & NELProject internal
    15Automatic Post-Editing(Automatic) Post EditingPublic domainLink
    16Speech2Text PostEditor From User Feedback (M-PHANTOM)(Automatic) Post EditingFor research (only)
    17Online Crosslingual ClusteringClusteringFor research (only)
    18Multilingual IPTC Topic ClassificationTopic DetectionProject internal
    19Wikipedia classificationTopic DetectionProject internal
    20Text To Speech for LatvianSpeech SynthesisProject internal
    21Text To Speech for BrazilianSpeech SynthesisPublic domainLink
    22Text To Speech for UrduSpeech SynthesisPublic domainLink
    23Story SegmentationStory SegmentationPublic domainLink
    24Punctuation and Casing RecoveryPunctuation & TruecasingPublic domainLink
    25Speaker DiarizationSpeaker DiarizationProject internal
    26Speaker Recognition (Identification)Speaker RecognitionPublic domainLink
    27Graph Orchestrator (Maestro)Graph Orchestrator platform (Maestro)Project internal
    28Monitio platformMonitio platform (UC1)Project internal
    29plain X platformplain X platform (UC2)Project internal
    30Use Case 0 – SELMA Open Source PlatformSELMA OSS platform (UC0)Public domainLink (GitHub)

    Prototypes

    #PrototypesMain objectiveRelease LevelAvailability
    1Podcast CreatorCreate a news podcast on the flyPublicLink (GitHub)
    2DiarizationCreate diarization and make speech recognitionProject internal
    3Diversity MonitoringAnalyze Binary Gender, Age and Regional OriginProject internal
    3NLP BenchmarkingCompare ASR, MT & VOPublic domain
    4SummarizerSummarize TextPublicLink (GitHub)
    5VoicesGenerate SpeechPublicLink (GitHub)
    6Avatar CreatorCreate adaptions with animated avatarProject internal

    Data

    #DataVolumeMain PurposeRelease LevelAvailability
    1TurkishText data for NERProject internal
    2DutchText data for NERProject internal
    3Ukrainian300 docsText data for NERProject internal
    4Russian160 docsText data for NERProject internal
    5Latvian740 docsText data for NERPublic domainLink (Clarin)
    6Amharic10 hrsLow resourced scripts for ASRProject internal
    7BengaliLow resourced scripts for ASRProject internal
    8Urdu10 hrsAudio News Training data for VoicesProject internal
    9Brazilian Portuguese96 hrsAudio News Training data for VoicesProject internal
    10SELMA Foundation (19 lang.)15000 hrsAV SELMA Foundation Model for ASRProject internal
    11Wikipedia / Wikidata40 Mio docsText and labes for Enitity representationProject internal
    12Monitio News300.000 / dayDatastreamProject internal