Ingesting Malach Interviews
Overview of Collection
The collection contains technical reports and other papers and a set of interviews and their description. There are ~52,000 interviews totalling ~116,000 hours. The interviews have been indexed using two types of indexing, old and new.
There are three levels of description, collection level, interview level, and segment level. The segment level is indexed either by 1 minute segments (new), or by variable length content based segments(old).
The collection contains a thesaurus with ~40,000 terms. The thesaurus is stored in a set of db tables.
Interview Level
We should look at this as the base unit for archiving. Packages should focus on creating 'interview' bundles. The interview is comprised of the following:
- processed paper questionare, original tiff scan, re-keyed complete form, and a keyed short form of the questionare , (name, etc..)
- Interview summary (only in old), free form text
Segment level
Each interview is broken up into a set of segments. A segment contains the following:
- Each segment is either fixed length segments(new), or variable length content based(old)
- segment summary
- ASR extracted keywords
PAWN Integration
Integration will consist of the following:
- 1. PAWN will need to be made aware of MPEG-7 metadata
- Initially, this can be a simple viewer/text editor.
- 2. Software to encode malach metadata as MPEG-7
- Software will have to be written that will encode necessary malach metadata as mpeg-7. The information that will be encoded in the mpeg-7 files will be extracted from the collection level thesaurus. Metadata that has time coding will also be included here. This has been done using a perl script that parses the Segment.xml file and a segment-time map file.
- 3. Packaging of Malach data
- Packages will contain two folders, one for media which will contain the mpg, mp2, etc files and will have mpeg-7 files attached to the folder level. The second folder will be for scanned and other complete information.
Package
- Root of Interview
- Audio / Video
- mpg, mp2, etc... files
- mpeg7 metadata
- Supporting Documentation
- interview summary / questionaire.
- Mpeg-7 Usage
We are following the DAVP mpeg-7 profile. This profile is designed to support audio-visual types of data with each mpeg 7 document describing one video or audio item. We will generate one mpeg-7 file per interview using the Temporal Decomposition tools (11.6.2) to index segment level information. Specifically, there will be a set of Temporal Decomposition elements, each specifying a time segment using the Media Time (6.4.10) and keywords using the 'Text Annotation' elements in the decomposition.
- Mpeg-7 Layout
- AudioVisual
- Temporal Decomposition
- MediaTime - duration of segment
- TextAnnotation / KeywordAnnotation - up to 3 sections, keywords from auto generation(2) ( AUTOKEYWORD2004A1, AUTOKEYWORD2004A2) and manual entry(1) (MANUALKEYWORD)
- TextAnnotation / FreeText - up to 3 sections, segment summary(1) (SUMMARY) and asr text(3) ( ASRTEXT2003A, ASRTEXT2004A )
- Temporal Decomp
- ...
to top