TAPE workshop on the curation and preservation of
audiovisual collections
University of Glasgow, Scotland, UK
Monday 12th – Friday 16th May 2008
Metadata and Documentation
Giorgio Dimino
RAI Research Centre
[email protected]
Centro Ricerche e Innovazione Tecnologica
The two main objectives of archive
management
Preservation
Keep assets in life
Access
Make content available to users and customers
Centro Ricerche e Innovazione Tecnologica
Is digital preservation sufficient to
improve access?
Generally NO!
Access performance is driven by two main factors:
1. the time needed to retrieve and select content
2. the time needed to deliver the selected content
to the user in the requested format
Factor 1 is often the most critical
Centro Ricerche e Innovazione Tecnologica
The larger the collection, the more selection is difficult
Need for descriptive metadata
and a proper retrieval system
Centro Ricerche e Innovazione Tecnologica
Identify user needs
 Analyse the “business” and define use cases
 Define access granularity
 Identify the basic entities of the model (objects)
 Define preferred search criteria
 Consider the possible need for other access
methods
Thematic Dossiers
Showcases
 Identify interoperability issues with other systems
Centro Ricerche e Innovazione Tecnologica
Who are the users?
 Professionals
 Often looking for specific
things
 Can handle complex data
models
 Prefer precision to
simplicity
Centro Ricerche e Innovazione Tecnologica
 General public
 Not necessariliy technology
fans
 Simple interface and search
tools (e.g., Google)
 Can need help (proactive
systems)
Access granularity
 Access granularity depends on the content genre and
forseen usage
 E.g., RAI strategy
 Main TV programmes: programme item
 Fiction/Movies: programme
 News: news story
 Sport:
 Radio: programme item
 Music: track
 Proper documentation models must be designed and
enforced to support the required access granularity
 This could bring to redocument all the archive content
Centro Ricerche e Innovazione Tecnologica
Browsing
 Often selection cannot be accomplished without viewing
the results of retrieval (a survey on the RAI archive
shows that about 80% of the tape handling was due to
viewing requests)
 Viewing high res digitised content or master analogue
media is very expensive and time consuming
 Multimedia documentation, based on the use of key
frames and low res (e.g., MPEG4) copies provides cheap
and fast selection of footage, contextual to retrieval and
on the user desktop
 An order can be automatically issued to central archive
for the download of the selected footage to the user
Centro Ricerche e Innovazione Tecnologica
Archive duality
 Documentation maps to editorial entities
 Programs, collections, items, …
 Editorial entities must be mapped on essence copies
 E.g., a copy of program “abcd” is contained on tape 1234 from TC
00:01:00:00 to TC 00:10:00:00
 Essence maps to physical media or files
 Several essence versions can co-exist at the same time
 E.g., original Beta SP tape, digital master, low res MPEG4, etc…
 If they are time aligned any one can be used as a proxi for the
others
 Documentation to Essence is a one-to-many relationship
 The same documentation applies to several essence versions
Centro Ricerche e Innovazione Tecnologica
Linking elements
 Time references
 Real world time representation
 time unit count since a reference
date
 Gregorian date and day time
 Media stream time
 Offset and duration in frame/sample
units
Centro Ricerche e Innovazione Tecnologica
 Media locators
 URL
 physical positions
 Object references
 object unique
identifiers
 UMID
 UPID
Metadata management criticalities
 Documentation models
 Driven by internal requirements
 No single standard
 Documentation costs and quality
 Manual annotation is expensive and time consuming
 Subjectivity must be avoided
 Automatic content analysis can be helpful but is still experimental
 Data models
 They are the implementation of a documentation model
 They must be designed in such a way to allow the
implementation of the retrieval requirements
Centro Ricerche e Innovazione Tecnologica
Documentation strategies
C o lle c tio n
P ro g ra m m e 1
seg m ent 1
P ro g ra m m e 2
seg m ent 1
seg m ent 1
Shot 2
seg m ent 1
seg m ent 1
Ite m 1
Shot 1
seg m ent 1
seg m ent 1
seg m ent 1
P ro g ra m m e 3
Ite m 1
seg m ent 1
•Hierarchy of documentation entities
•Time relations between entities must be
exploited in retrieval
•Each entity has attached a set of attributes
•Rigid structure, extensibility limited to
attributes
•Implementation can be optimised for retrieval
Ite m 1
Shot 3
•Stratification of timed documentation
attributes
•Single documentation entity that represents
the program
•Very flexible and extensible
•Difficult to exploit in retrieval
Centro Ricerche e Innovazione Tecnologica
*From EBU P/FTA Future Television Archives report
Data model requirements
 Interoperability with existing standards
 EBU P/META, ISO MPEG7, Dublin Core, SMPTE MXF and DMS1
 Clean separation between editorial and material related
information
 Definition of the basic entities and relations
 Incremental definition of specialized entities and
attributes according to the needs
Centro Ricerche e Innovazione Tecnologica
Comparison between users data models
entities
(from PrestoSpace Deliverable D15.1)
ENo. RAI-DM
1
INA-DM
ORF-FARAO DR-DM
Collection
BBC-SMEF
MXF-DMS1 P_META DC
Programme-Group
Programme
-Group
Programme Programme
Program
Main/Single Programme
Production
Production
framework
Programme Programme
Item
Program Item
(“Contribution”)
Item
Programme Item
Scene
framework
Item
MOB
(Media
Object)
MOB (Media Object)
4
5
MIN (Media MOI (Media Object
Object
Instance)
Instance)
6
Brand
2
3
Programme
Item
Centro Ricerche e Innovazione Tecnologica
MOB
(Media
Object)
Brand
Dublin Core vs. P/META
Dc:title -->
Dublin Core
P_META (EBU)
Title
A59,A61,A99, A107, A110, A114,A146,A198,A401
Creator
A81,A82,A83,A87,A88,A89,A90,A125, A254, A255,A256,A413, A414
Subject and Keywords
Difficult to define, Coverage could be used here
Description
/
Publisher
A81,A82,A83,A87,A88,A89,A90,A125,A254,A255,A256,A413,A414
Contributor
A11,A81,A82,A83,A87,A88,A89,A90,A125,A254,A255, A256,A413,
A414
Date
A152,A217,A218,A219,A22, A367,A368,A405
Resource Type
A12,A13,A226
Format
A72,A73,A222, A361
Resource Identifier
A105
Source
A223,A224
Language
A21,A22,A65,A66,A141, A407,A415, A416,A417, A418
Relation Resources
/
Coverage
A1,A9,A21, A22,A38,A67, A123,A141, A207,A214
Rights Management
A14, A15, A18, A19, A20, A116, A117,A118, A119, A120, A121, A122,
A162, A200, A201,A202, A203, A204, A205, A206, A212, A421, A422
Program-Title
Program-Sub-Title
Program-Working-Title
Program-Episode-Title
Item-Title
Item-Sub-Title
PGR-Title
PGR-Sub-Title
PGR-Working-Title
Centro Ricerche e Innovazione Tecnologica
European Digital Library project
 eContentplus project that addresses the integration
of the bibliograpich catalogues and digital collection
of most of the European National Libraries
 Main target is libraries, but museums, archives and
AV collections are also included
 Metadata are encouraged to be made available
through the use of OAI (Open Archives Initiative)
guidelines
Centro Ricerche e Innovazione Tecnologica
Open Archives Initiative
 The Open Archives Initiative develops and promotes
interoperability standards that aim to facilitate the efficient
dissemination of content
 The Protocol for Metadata Harvesting (OAI-PMH)
specifies methods for accessing heterogeneous
collections and requires Dublin Core compatibility as the
minimum metadata dissemination level
Centro Ricerche e Innovazione Tecnologica
EDOB – the PrestoSpace data model
 Programme descriptive metadata
Based on the P/META schema
 Timed metadata
Based on MPEG7 Temporal Decomposition
 Programme – Material associations
Custom structure
Centro Ricerche e Innovazione Tecnologica
EDOB structure
Centro Ricerche e Innovazione Tecnologica
EDOB subclasses
Centro Ricerche e Innovazione Tecnologica
Data model and data formats (xml)
E1:EditorialObject
Publication Event
• datetime
• etc
S2:PublicationService
S1:PublicationService
Contribution
• role type
P3:Person
P2:Person
P1:Person
O1:Organisation
Centro Ricerche e Innovazione Tecnologica
Root element / Wrapper
Identification information
titles
identifiers
contributions
publications
other
Data model and data formats
E1:EditorialObject
Root element / Wrapper
Identification information
Material realisations
M2:Material
S2:storage/file
M1:Material
S1:storage/file
Centro Ricerche e Innovazione Tecnologica
Data model and data formats
E1:EditorialObject
t1:transcription
t1:transcription
t1:transcription
t1:transcription
t1:transcription
t1:transcription
Root element / Wrapper
Identification information
v1:shot
k1:keyframe
v1:shot
k1:keyframe
v1:shot
k1:keyframe
v1:shot
k1:keyframe
v1:shot
k1:keyframe
v1:shot
k1:keyframe
v1:shot
k1:keyframe
v1:shot
k1:keyframe
Material realisations
Editorial partitions and views
E5:EditorialPart
timeline
E4:EditorialPart
E3:EditorialPart
E2:EditorialPart
R2:RelatedSource
R1:RelatedSource
Topic
Named Entities
T1:Time
L1:Location
O1:Organisation
P1:Person
Centro Ricerche e Innovazione Tecnologica
Editorial parts
Shots &
other video segmentation
Speech transcription &
Other audio segmentation
Data model and data formats
E1:EditorialObject
Root element / Wrapper
Identification information
Material realisations
Editorial partitions and views
E5:EditorialPart
timeline
E4:EditorialPart
E3:EditorialPart
E2:EditorialPart
R2:RelatedSource
R1:RelatedSource
Topic
Named Entities
T1:Time
L1:Location
O1:Organisation
P1:Person
Centro Ricerche e Innovazione Tecnologica
Content related information
Enrichment information
Schema of PrestoSpace document
format (XML)
R o o t e le m e n t / w ra p p e r
A d h o c s tru c tu re s
Id e n tific a tio n a n d
L a n g u a g e in fo rm a tio n
M a te ria l re a lis a tio n s
P _ M ET A s e ts
E d ito ria l p a rtitio n s a n d v ie w s
C o n te n t re la te d in fo rm a tio n
E n ric h m e n t in fo rm a tio n
A n c illa ry D a ta
Centro Ricerche e Innovazione Tecnologica
M P E G 7 p ro file n o d e s
XML schema composition
imports
PMETA 2.0 XML Schema
imports
MAD XML Schema
imports
MPEG7 DAVP profile XML Schema
Core Platform Definitions XML Schema
Centro Ricerche e Innovazione Tecnologica
Automatic content analysis
 Which features?
 Video
 Colour
 Shape
 Texture
 Motion
 Audio
 sound effects
 instrument description
 speech recognition
Centro Ricerche e Innovazione Tecnologica
 Why?
 Segmentation
 temporal
 spatial
 Documentation
 automatic documentation
 aid to the documentalist
 Query by example
The Preservation Factory
 Migration units for preservation services
 Fast, efficient, affordable
 Using automation/process optimisation
 Centralised, delocalised, and/or mobile units…
 Role of PrestoSpace :
 Ensure these services take up
Key technology development,
communication, labelling, encouraging/assisting investors &
users...
Centro Ricerche e Innovazione Tecnologica
Work Breakdown
Analogue documents
Film, video, audio
Preservation
Playback Devices
Robotics and Automation
Media Condition
Assessment
Restoration
Storage
System Tools
Visual & Audio
Algorithms & Subsystems
Integration & Evaluation
Mass storage
cost management
life cycle management
Delivery & Access
Turnkey System
Export System
Integration
Preserved & digitized
collections
Centro Ricerche e Innovazione Tecnologica
Archive
Management
Preservation and access
business case planning
Preservation project
management tools
Metadata
Discovery &
Structuring
Public Access
Delivery and Exchange
PrestoSpace Factory
A rc h iv e
T ra n s a c tio n s
essence
(m a s te r, lo w e r q u a lity)
m e ta d a ta
(le g a c y, te c h , e n h a n c e d )
Centro Ricerche e Innovazione Tecnologica
P re s to S p a c e O rc h e s tra to r
o rig in a l m e d ia
P re s to S p a c e F a c to ry
P re s e rva tio n
U n it
R e s to ra tio n
U n it
D o c u m e n ta tio n
U n it
Documentation process
Legacy
Metadata
Import
Audiovisual
Content
Analysis
Semantic
Analysis
 (Archive inventory)
 Legacy metadata import
 Automatic metadata extraction
 AV content analysis
 Semantic analysis on texts
 Web mining
 Manual annotation and validation
 Export/publication
Centro Ricerche e Innovazione Tecnologica
Human
Validation
Export/
Publication
Getting preservation results
 The PSO (workflow manager) moves preservation
results to documentation
 An EDOB file containing identification and media
association information
 A digital master file
 A Quality/defect analysis report file
 A Preservation report file
 The master is transcoded to the lower quality
formats required by the process
 Windows Media 9 for viewing in publication
 DVD quality MPEG2 for video content analysis
 PCM soundtrack for ASR
Centro Ricerche e Innovazione Tecnologica
Legacy metadata import
VSEM00285973 DOCUMENT=
218 OF
3056 PAGE =
1 OF
1
PROGRAMMA ** F137725 **
--PAG A 010 *-DATCLASS 19951031 --DG
TITOLI FATTI VOSTRI
PIAZZA ITALIA DI SERA
SUPPORTO
RVM
3/4
D2
DATIPROD --RETE TV2 --SEDE RM --GENERE 320900
--UORG 2250 --MATRICOLA 262582
DATITRAS *-DATRAS 19951027 --ORE 2025 --CANALE 2 *-DURTOT 022444 COLORE
AUTORI
GUARDI MICHELE, FLORA GIOVANNA, ZAMPONI RORY, CIORCIOLINI MARCELLO.
PRESENTA: MAGALLI GIANCARLO CON WINDHAM WENDY E I BARAONNA.
A CURA DI MOLINARI LAURA.
REGIA
GUARDI MICHELE
I0607 * End of document.
Centro Ricerche e Innovazione Tecnologica
Documentation platform
EDOB
Rich
Content
Documentation
Platform
MPEG7
PMETA
DC
MXF
JPG
Content Analysis
Shots-key frames
GAMPs
Content Analysis
Media Analysis
Semantic
Analysis
Manual
Annotation
Delivery
Centro Ricerche e Innovazione Tecnologica
Core Platform
web services
Content Analysis
Speech To Text
EMS
Essence and
Metadata
Storage
Technologies – storyboard
Key frames list
Stripe image
Centro Ricerche e Innovazione Tecnologica
Technologies – feature extraction
Camera Motion
Automatic Speech Recognition
Centro Ricerche e Innovazione Tecnologica
Technology - segmentation
 Several segmentation tools
Scene change detection
Clustering of similar scenes
Audio classifier (music, noise, speech)
Voice tracking
Lexical segmentation
Editorial parts merger
Centro Ricerche e Innovazione Tecnologica
Technology - segmentation
Centro Ricerche e Innovazione Tecnologica
Technologies - semantics
Classification
Named Entities
Centro Ricerche e Innovazione Tecnologica
Source correlation
Extension to other genres
 The effectiveness of the automatic analysis tools
varies according to the content genres and
expected use
 E.g., ASR not very useful on fiction
 The analysis process must be tailored accordingly
 The editorial parts segmenter must be adapted to
reflect the editorial semantics (if possible)
 In some cases the process must rely mainly on
manual annotation
Centro Ricerche e Innovazione Tecnologica
Manual Annotation
 Functionality
validation and correction of automatic content analysis
results (audiovisual and semantic)
content structuring
annotation on different structural levels of the content
(programme, scene, shot, arbitrary temporal range)
Integrated in documentation platform
Centro Ricerche e Innovazione Tecnologica
Centro Ricerche e Innovazione Tecnologica
Export
 The documentation results are exported to
external systems or to the Publication
Platform
Export package includes:
 Enriched EDOB
 Key frames, Stripe Images
 Video in browsing quality
 Everything got from Preservation & Restoration
Deleted from the Documentation Platform
Centro Ricerche e Innovazione Tecnologica
Publication platform
Rich
Content
Web
interface
Publication
Platform
Key Frames View
Semantic Search
(KIM Platform)
Topic Search
(Full text)
Centro Ricerche e Innovazione Tecnologica
http
Full motion
Video preview
MCP
Multimedia
Contents
Publisher
Speech to text
display
Publication Platform architecture
RETRIEVAL
OF
CONTENT
User
Interface
Structured query
Restricted natural language query
CLIR
Processor
Context disambiguation
Language translation
SQL Engine
Semantic
Engine
Data
Base
Centro Ricerche e Innovazione Tecnologica
Domain
Knowledge
Base
Retrieval of AV content
 Different types of retrieval are provided:
Legacy Information (structured queries)
Full Text
Ontology-driven browsing
Natural language queries
 No constraint on the nature of the query (non NL queries are
also managed)
 NERC
 Cross-lingual
 Query Classification for domain-specific retrieval
Centro Ricerche e Innovazione Tecnologica
KIM semantic engine
 KIM is a platform for semantic annotation,
search, and anaysis :
Framework for automatic semantic annotations
Storage of semantic annotations
Semantic indexing and search
Centro Ricerche e Innovazione Tecnologica
Cross-language Retrieval (CLIR)
 Archive AV data span across different languages
 Retrieval should be concept rather than text
oriented
 Source language (of queries) can be different
from the target language (characterizing
metadata)
Centro Ricerche e Innovazione Tecnologica
CLIR functionalities
 The implemented CLIR analyses the user query to
extract
 NEs if any
 Context categorization
 Useful terms for FTS (removing stopwords)
 It then maps the extracted information in the target
language
 The new query is substituted to the orginal one
Centro Ricerche e Innovazione Tecnologica
CLIR translation example
Typed Query:
“Blair calls on NATO member to contribute more troops to
Afghanistan force”
Translated Query:
Person:Blair, Organization:Nato, Location: Afghanistan,
Category:foreign affairs,
Text_en:blair, nato, member, troops, force, afghanistan
Text_it:blair, nato, membro, truppe, arma, afghanistan
Centro Ricerche e Innovazione Tecnologica
Considerations
 The relationship between descriptive metadata and essence is
generally not one-to-one
 Descriptive metadata are more efficiently exchanged and
managed if kept separate from the essence
 Mapping of different metadata schemas is unavoidable
 Lossless mapping is possible only if
 Basic concepts are shared between models
 Entities and attributes are well described and understood
 The mapping requires human skills
 The PrestoSpace data model and documentation process are
described in Deliverable D15.2
Centro Ricerche e Innovazione Tecnologica
Descargar

Diapositiva 1 - University of Glasgow