 |
August 17th, 2009 by Stefan Boddie, CEO, DL Consulting Ltd.
As technologists we always get excited by digital library standards like METS/ALTO. What is it and does it matter to library professionals?
Over the past ten years Digital Library Consulting has built lots of digital collections, and those collections have been built from a wide range of digital objects. For example, we’ve built collections from PDF files, Microsoft Word documents, digital images, text files, HTML files, from digital images with associated text/HTML/PDF files, and from objects using various standards like TEI. In addition, many of the collections we’ve built have had existing metadata in Excel spreadsheets, XML and binary MARC formats, Microsoft Word files, and a whole range of different database formats.
It’s obviously much more complex and time-consuming to build a digital library from objects in formats we’ve not used before than it is to do so from some kind of “standard” format that we already have software to support. In many cases this can’t be helped – the digital files already exist, and the digital library software used to present them simply must be adapted to suit.
In the case of “new” digitization projects, where physical documents are being turned into new digital objects, any number of different digital formats can be selected. We regularly work with those planning new digitization projects to help them select the most appropriate format, and for textual documents we most often recommend METS/ALTO.
The Metadata Encoding and Transmission Standard (METS) has been around for some time, and is a standard with which many library professionals will be familiar. It’s a standard for encoding descriptive, administrative, and structural metadata regarding objects within a digital library, using XML. The METS standard is maintained by the Library of Congress, and is developed as an initiative of the Digital Library Federation.
While METS is great at describing the structure of a digital object, it’s missing the ability to describe the content and layout of each piece of the digital object. For that we need an extension to METS called ALTO (Analyzed Layout and Text Object). This combination of METS and ALTO was originally developed by the METAe project, and was later adopted by the Library of Congress for their large-scale National Digital Newspaper Project (NDNP). Since then METS/ALTO has been used for many large national newspaper digitization projects, as well as a number of projects digitizing books and journals.
METS/ALTO provides extremely rich digital objects, which allows for extremely rich digital library interfaces to be built. For example, a typical METS/ALTO object encodes not only the complete logical and physical structure of a document (i.e. chapters, sections, articles, pages, etc., and their associated metadata), but also the full-text content of each section of the document and even the physical coordinates of every word in the document! The impact of this on the user’s search experience can be quite significant. Additionally, it doesn’t typically cost any more to digitize materials to METS/ALTO than to formats like HTML, which contain much less information.
Digital Library Consulting has completed several projects using the METS/ALTO standard. The National Library of New Zealand’s Papers Past project for example, which contains approximately 1.2 million newspaper pages, or around 7 million individually searchable and viewable newspaper articles. We’ve also completed a project based on the standard for Cornell University Library, and are working on a major project with the National Library of Singapore.
If you would like to discuss the implications of using METS/ALTO in your digital collection projects please contact us at contact@dlconsulting.com.
Posted in Articles, Veridian | No Comments »
August 13th, 2009 by Stefan Boddie, CEO, DL Consulting Ltd.
New Zealand-based Digital Library Consulting joins forces with leading German company CCS.
August 2009. Hamilton, New Zealand. – The world’s biggest libraries and other institutions will find it easier to open their collections to the world thanks to an agreement announced today between New Zealand’s Digital Library Consulting and Content Conversion Specialists (CCS) of Germany.
“CCS is a world leader in the field of digitizing library and other collections so this is a very exciting development for Digital Library Consulting,” said the company’s founder and managing director Stefan Boddie.
“Having worked with us on a number of international projects, with organizations like the National Library Board of Singapore and New Zealand’s National Library, CCS have seen real value in our Veridian software product which provides the online interface for users to search and view items in a digital collection.”
Digital Library Consulting’s Veridian software has been used for collections including those at the National Libraries of Singapore, Luxembourg, and New Zealand, and libraries at Princeton and Cornell Universities.
Veridian provides the interface used to search and view the National Library of New Zealand’s popular online newspaper archive Papers Past http://paperspast.natlib.govt.nz.
Mr Boddie said the agreement would see CCS using Digital Library Consulting’s Veridian product in major digitization projects in Europe, Asia and America.
“Given CCS have worked with other companies in this field, their decision to use Veridian is evidence we have a product that is very competitive on a world stage,” said Mr Boddie.
There is growing worldwide demand for digitization services as libraries and other institutions seek to preserve valuable collections digitally, said Mr Boddie. “Of course these collections are of little value unless people can access, search and view them easily – which is what software does, enabling these collections to be opened to the world.”
Richard Helle, Managing Director at CCS, said his company were looking forward to collaborating with Digital Library Consulting. “Our clients digitize their stock with precision and great commitment. Veridian helps them to make this effort visible and the collections usable.”
“We are delighted that this technology – proven successful in large scale projects – is now available to a much larger market.”
About Digital Library Consulting
Based in Hamilton, New Zealand, Digital Library Consulting are experts in the field of building digital collections. Their Veridian software product enables the collection, creation and distribution of digital libraries. They have worked with a number of leading tertiary institutions and libraries throughout the US, New Zealand, Africa and the Pacific.
Established in 2002, Digital Library Consulting has eight staff and is privately held.
www.dlconsulting.com
About CCS
CCS is a pioneer and business leader in making information available through digitization. Founded in 1976, CCS connects the digitization elements–capture, conversion, presentation and storage into a smooth, automated, quality-secured and economic production process. CCS (Content Conversion Specialists) is a privately owned company headquartered in Hamburg, Germany.
www.content-conversion.com
For information contact:
Stefan Boddie, Managing Director, Digital Library Consulting
Telephone +64 7 857 0830
E-mail stefan@dlconsulting.com
Posted in News, Veridian | No Comments »
June 28th, 2009 by Stefan Boddie, CEO, DL Consulting Ltd.
DL Consulting recently completed work on digitizing the Hyde Park Herald.
The Hyde Park Herald archive is a searchable history of the Hyde Park neighborhood on Chicago’s South Side. Hyde Park is the home of the University of Chicago and of Barack Obama, the 44th President of the United States.
The archive includes every known copy of The Hyde Park Herald from 1882 until 2008 – approximately 100,000 pages at time of writing. New issues will be digitized regularly.
The Hyde Park Herald newspaper archive is built upon DL Consulting’s Veridian web delivery software for METS/ALTO data. Digitization to METS/ALTO was carried out by Digital Divide Data.
Posted in News, Veridian | No Comments »
March 18th, 2009 by Stefan Boddie, CEO, DL Consulting Ltd.
Posted in Greenstone, Research | No Comments »
September 21st, 2008 by Richard Managh, DL Consulting Ltd.
Recently we were approached by a client seeking to move online information from their legacy Paradox 4.0 database. Greenstone would enable the information to be web-accessible but we first needed to convert the database into a form that Greenstone can handle.
Digging on the internet revealed conversion software called Paradox dBase Reader, which is able to read DBF/DB files in Paradox format (but also supports other formats, such as dBase, FoxBase, Foxpro, Visual Foxpro and Visual DBase). This software allows the conversion of Paradox databases into HTML, Text (CSV), Excel, or XML formats. We converted the database into XML, our preferred machine-readable format. Paradox dBase Reader was also able to extract images, stored internally as binary data, from the client’s database.
A plugin was then created to import the extracted XML and images into a Greenstone collection. Greenstone’s modular nature made this change straightforward. The finishing touches will be to customize and brand the collection to the client’s specifications and then make it available on the client’s intranet.
This is an example of a Paradox database being converted into Greenstone but many other database formats can have new life breathed into them by having Greenstone serve them up on the web.
Posted in Articles, Greenstone | No Comments »
September 1st, 2008 by Stefan Boddie, CEO, DL Consulting Ltd.
On September 19th 2008 Digital Library Consulting will be presenting a paper at the First Workshop on Very Large Digital Libraries in Aarhus, Denmark (in conjunction with the 12th European Conference on Research and Advanced Technologies on Digital Libraries).
http://www.delos.info/vldl2008
Posted in Greenstone, News, Research | No Comments »
August 25th, 2008 by John Thompson, COO, DL Consulting Ltd
Greenstone has developed—rather unfairly, we feel—a reputation as a ‘toy’ document system not capable of handling large-scale, enterprise level collections. While our latest ‘million page’ newspaper collections should help change this preconception, there are indeed some scalability issues encountered in large collections. Similar problems have been encountered in large-scale databases and have been answered by the use of distributed computing, where the processing and storage workload is shared and balanced between several computers instead of just one. However, Greenstone didn’t provide this functionality… until now.
While still in its early development stage, Greenstone has been integrated with the recently released IBM DB2 Express-C database. This database meets Greenstone’s requirements for metadata storage and—using the Net Search Extensions add-on—full text search, while its license allows users to download and install for free. Most importantly, Greenstone is then able to leverage the power of ‘Federation’, DB2’s implementation of distributed computing. The ‘front-end’ DB2 server transparently manages interaction with an arbitrary number of ‘back-end’ DB2 data servers. This provides the potential to dramatically increase Greenstone’s large-scale performance just by adding further ‘back-end’ servers without having to drastically change Greenstone itself.
Posted in Greenstone, News, Research | No Comments »
May 1st, 2008 by Stefan Boddie, CEO, DL Consulting Ltd.
Here at DL Consulting we’re continuing to make improvements to Greenstone’s support for importing and displaying METS/ALTO data. METS/ALTO is an XML schema published by the Library of Congress, and being used by the US National Digital Newspaper Project (NDNP), as well as many other newspaper digitization projects (as well as some collections of books, journals, and other textual resources). In addition to extracting machine-readable text from the page a process resulting in METS/ALTO also records information about individual articles within a page. This allows a user interface to be built where newspaper articles can be displayed on their own, as well as within the pages on which they were printed.The Papers Past site we built last year with the National Library of New Zealand (and which uses METS/ALTO) continues to grow. There are now over 600,000 searchable pages (that’s about 6.5 million newspaper articles!) in the system. We’re happy with how well the system is scaling, but continue to work on further improvements, with the eventual goal being infinite scalability with large collections distributed across multiple computers. We’re making good progress towards that goal thanks to a research grant from the Foundation for Research, Science, and Technology.In addition to the Papers Past collection we’ve built two further METS/ALTO based newspaper collections over recent months. Neither of these sites are accessible to the public yet unfortunately, but we’ll post links on www.dlconsulting.com once they are.
- Cornell University – The Cornell Daily Sun Digitization project. This project has been using a basic Greenstone system for some time (and which is still online now) but we implemented a major upgrade so the system can now import METS/ALTO data (which Cornell have switched to for the digitization of all remaining newspaper issues) as well as the older (proprietary) data format that was used in the earlier digitization work. METS/ALTO is more flexible than the older format but the system was implemented so that all the data (both old and new formats) are displayed very similarly. The Cornell Daily Sun project also switched to generating web-accessible images on demand with image server software, similar to the way Papers Past does.
- National Library Board of Singapore. We’ve also been working for many months on a large newspaper collection for the National Library of Singapore, building upon the software written for Papers Past. The Singapore collection will be released later this year, initially with around 600,000 pages of digitized content. That will grow to around 2 million pages over time. The Singapore project has some added complexity, including integration with a digital rights management system (because some of the digitized newspapers are still in copyright) and integration with automated concept (i.e. subject heading) extraction software. In addition, the Singapore project uses large grayscale JPEG2000 source images, as opposed to the black-and-white TIFF images used by Papers Past. We had to redevelop our image server software quite significantly to get good performance when processing these JPEG2000 images.
We’ve been asked several times if the code written to import and display METS/ALTO data is open source, and if it has been committed back to Greenstone. The answer is yes, of course it’s open source, but no it hasn’t yet been committed back to the Greenstone code base. The reasons for not committing it back are as follows.
- It’s a lot of highly specialized code, and is only useful to those with METS/ALTO data. My personal belief is that at times we have too much highly specialized functionality added into Greenstone, and that Greenstone2 isn’t currently modular enough to make it easy to add these sorts of major changes.
- We’ve worked with a number of METS/ALTO based projects and the data itself is always subtly different. That is, the code always needs to be modified to suit the METS/ALTO schema used, so is only useful as a starting point.
Having said all of the above, we are of course happy to make the code available to those who want it. Please contact us at contact@dlconsulting.com if you’re planning on building a METS/ALTO based Greenstone collection.
Posted in Greenstone, News, Research | No Comments »
September 18th, 2007 by Stefan Boddie, CEO, DL Consulting Ltd.
DL Consulting are pleased to announce that the redesigned Papers Past has now been officially released. Papers Past is a collection of 19th and early 20th century newspapers belonging to the National Library of New Zealand. The previous version of the site made images of each of the 1,000,000+ newspaper pages in the collection available, but did not allow the contents of the newspapers to be searched. The new site is a complete redesign, and is based on a very heavily customised Greenstone installation. DL Consulting built the Greenstone-based delivery system, and have been working on it with the National Library of New Zealand since mid-2006.
The functionality of the updated Papers Past site includes the following.
- Newspaper pages underwent an OCR process to produce METS/ALTO XML data. A new Greenstone plugin was written for importing this data.
- Papers Past handles a mixture of searchable and unsearchable newspapers. At present about 250,000 of the 1.1 million total pages in the system are searchable, with more pages being converted to searchable format over time.
- The use of METS/ALTO data allowed us to build a system where individual newspaper articles and advertisements can be extracted from pages. That is, the user may choose to view either full newspaper pages or to view larger versions of individual articles.
- The collection features search term highlighting directly within digital images.
- An image-server application was developed by DL Consulting, to allow images to be processed, cropped, and scaled at display time. That is, only the original archival TIFF images of each page are stored. When the Greenstone delivery system requires an article-level image, a web-friendly page-level image, or any other type of image, it requests it from the image server. The image server then generates the required image on-the-fly from the stored archival TIFF images.
- At present the Papers Past collection uses the Lucene search engine. We chose Lucene for its proven ability to scale to very large indexes, and because of its “fuzzy search” capability. Fuzzy search allows the search engine to return hits for documents containing words similar but not identical to the search terms entered by the user. This is a useful feature in a delivery system for a newspaper digitization project, as newspapers are extremely difficult for OCR software to deal with. This invariably means that the searchable text produced by the OCR process is not perfect.
- This collection is already very large, and will grow much larger over time. At the time of writing there are 254,000 searchable pages and around 3.1 million searchable articles. While 254,000 pages doesn’t sound like a lot, these pages each contain a huge amount of text. There’s more than 9Gb of raw searchable text, and 27Gb of metadata (we store coordinates for each article and word on every page, hence the enormous amount of metadata). We went to considerable effort to ensure that Greenstone scales sufficiently to support a collection of over one million pages, and we’re continuing this work with funding from a government R&D Grant. We’re currently working on another newspaper digitization project which will eventually scale to more than two million pages.
Posted in News | 2 Comments »
September 18th, 2007 by Stefan Boddie, CEO, DL Consulting Ltd.
DL Consulting recently became involved in a pilot project to introduce Greenstone to Southern Africa. Our commitment to this project includes the donation of time and expertise in helping to provide training in Greenstone (and more general digital library and digitization subjects) to institutions in Southern Africa. As part of the first stage of this pilot project I recently visited the National University of Science and Technology in Zimbabwe, the Lesotho College of Education, the University of Lesotho, and Bunda College of Agriculture in Malawi. My time in Africa was extremely positive, with lots of interest in and excitement about potential applications for digital library technologies like Greenstone. Each of these institutions, as well as the University of Namibia (who are the sub-regional coordinating centre) are now working on Greenstone-based digital collections.The next phase of the pilot project is a training course to be held at the University of Namibia in early October. This course will cover more advanced topics, following on from the basic training work we did when I visited each institution. Professor Ian Witten, head of the Greenstone project at the University of Waikato will be the trainer.For more information on the Southern African Greenstone project see the official project website.
Posted in Community, Greenstone | 2 Comments »
Return to top
|
|
|