NB: This is a report on the final convention in the core DiXiT programme, held before and jointly with the 2016 meeting of the European Society for Textual Scholarship and directly after a two-day workshop on optical character recognition and text digitisation. Hosted and facilitated by the University of Antwerp and the Centre for Manuscript Genetics, these events took place from 27 September to 7 October 2016.
Although I’ve been to Belgium a number of times, I have always seemed to be on my way to somewhere else. As an Early Stage Researcher in the Digital Scholarly Editions Initial Training Network (DiXiT), events sponsored by the network take me all over Europe, often by train, and often through Brussels Midi/Zuid. Happily, though, a series of training events and conferences this autumn finally allowed me to spend over a week in Antwerp, a city in the northern, Dutch-speaking region of Flanders. This report will cover four separate events in varying level of details:
- Demystifying Digitisation: A Hands-On Master Class in Text Digitisation
- Internal DiXiT Meetings
- Pre-Conference Workshops
- European Society for Textual Scholarship
Tipped off by ESTS 2016 co-organiser and DiXiT Experience Researcher at the University of Boras Wout Dillen, I decided to take advantage of a two-day, pre-conference workshop on optical character recognition and text digitisation. Jointly organised by Digital Humanities Flanders (DHuF) and Digital Research Infrastructure for the Arts and Humanities – Belgium (DARIAH-BE), “Demystifying Digitization: A Hands-On Master Class in Text Digitization” consisted of two half-day hands on workshops using ABBYY FineReader and Transkribus, complemented by four lecture and discussion sessions on closely related topics. Sally Chambers, Wout Dillen, Mike Kestemont, Trudi Noordermeer, and Dirk van Hulle served as the organizing committee.
The opening session, presented by Dries Moreels from Ghent University Library, recounted that institutions experience with contributing to the Google Books project. Explaining how OCR results impact search and discovery, and how that text underlies much of what is valuable about Google Books, underscored the importance of digital editors to understand how document images and machine readable texts fit together in a cohesive work pattern. After this quick primer on what OCR is and how it might be useful in digital humanities projects, Jesse de Does and Katrien Depuydt from Instituut voor Nederlandse Lexicologie (INL, Institute for Dutch Language) launched into a day of ABBYY FineReader. ABBYY is one of two major players in the OCR world, and is a proprietary, quite expensive platform. For this workshop, instead of each individual downloading and paying for the full platform, the organisers had arranged access to the ABBYY Cloud OCR SDK, an enterprise-level cloud-based OCR service that charges by the page.
As requested, many of us brought test materials to use during the workshop. I am currently working on a project to provide access to public domain materials related to early modern English drama, so I was able to bring two sets of images to experiment upon: the first was of a bibliographic catalogue published in the early 20th century and taken from the Internet Archive; the second was much more bizarre, in that it was a facsimile edition of a Tudor play. This second document was also published in the early 20th century, but due to the facsimile nature of the text it more closely resembled its source text, a dramatic work from the mid 16th century. The ABBYY cloud service was quite easy to use, after a few hiccups; using Java and running from the command line, the service recognised the text of the first set of materials almost perfectly, despite a wide variety of idiosyncratic punctuation used in the catalogue. The second, facsimile text actually produced around 50-65% accurate results, much better than expected given the source text. After experimenting with ABBYY for the afternoon, Wout Dillen and Vincent Neyt (University of Antwerp) presented Manuscript Desk, “an online environment in which manuscript pages can be uploaded, manuscript pages can be transcribed, and collations of transcriptions can be performed.” Powered by MediaWiki, bespoke software from the Transcribe Bentham Project, and Collatex, Manuscript Desk is a great way to collaboratively produce transcriptions of manuscripts, transcriptions which are easily ported to basic collation and stylistic analyses in-platform. I was particularly struck by how clever this project is at integrating existing components (MediaWiki and the transcription platform developed for Transcribe Bentham) to allow new types of work to be done.
Day two of this workshop began by discussing the pragmatics of digitization projects, especially what role institutional stakeholders and grant cycles can play in such projects. The University of Antwerp’s Trudi Noordermeer gave a thorough presentation on these topics, ranging from pre-project planning to bit depth of images to digital asset management. It was a lot to take in, and Trudi helpfully provided her slides for attendees, which will, I suspect, prove valuable when we are part of planning digitization projects. Most of day two was, however, devoted to discussing and experimenting with Transkribus, a platform for transcribing handwriting that integrates machine learning to improve handwritten text recognition (HTR). This type of recognition – from handwriting to machine readable text – is quite difficult, and Transkribus offers one way the problem may be approached. For those working with handwritten documents, Transkribus would be a valuable platform for transcription, exporting text, and potentially developing the ability to automatically detect handwriting.
Internal DiXiT Meetings
At every DiXiT event that is part of the core programme as originally proposed to the EU, members of the network have internal meetings. These might be administrative sessions, research presentations, or planning meetings. In Antwerp, the Early Stage Researchers were asked to bring in-progress writing to discuss and receive feedback on. My reading & feedback group consisted of ESRs Aodhan Kelly, Merisa Martinez, and myself, with Mats Dahlstrom, Elena Pierazzo, Susan Schriebman, and ER Wout Dillen leading the session. Merisa and Aodhan presented portions of their PhD work, ranging from rough draft/outlines of a chapter to a completed version of a large chapter. Having completed my PhD in March and graduated in June, I did present research in progress, but rather submitted documents I was currently drafting to submit to a number of postdoctoral fellowships in Europe and Canada. Feedback received from all the individuals named above helped to refine those documents further, resulting in stronger applications that have since been submitted. It is always a pleasure to see the high-quality work my DiXiT colleagues are moving forward, especially when that work begins to come together in ways that make clear progress towards the PhD is happening rapidly.
Aside from these reading groups, the DiXiT fellows had a self-organised meeting before things really got rolling, an afternoon which ended with a tour of the De Koninck brewery. Merisa’s blog post describes this well, and she has much better pictures than any I snapped.
Like Merisa, I attended one of the pre-conference workshops organised by DiXiT. Complexities of Project Logistics, organized by DiXiT supervisor Peter Boot (Huygens ING) discussed large digital editorial projects from a pragmatic perspective, helping attendees to think through what it means to plan, publish, and sustain a digital edition (or set of digital editions, as is the more usual reality). Unsurprisingly, most presentations were by project managers or general editors. Jan Burgers (Huygens ING) reported on editing medieval charters in the digital age, focusing especially on how to effectively digitize long-running print editions as a foundational step in the overall project of producing digital charters. Rik Hoekstra (Huygens ING) discussed the Resolutions States General, a key set of documents for the study of Dutch political history; this collection is entirely handwritten but extraordinarily regular, allowing for efficient encoding in XML and semi-automated entity enrichment (people and places). Martine de Bruin presented the Dutch Songs Database, a meta-project that aggregates data from 17 smaller efforts to provide access to over 170,000 Dutch songs form the middle ages to the present. The database is an especially intriguing project because it relies on and encourages non-textual search and discovery, leveraging facets like meter, tune, and so on to provide a rich set of affordances. DiXiT ESR Anna-Maria Sichani recounted some of her work at the King’s Digital Lab (KDL), reporting out on a secondment undertaken at King’s College London. Drawing a distinction between “the innovators” and “the maintainers,” she laid out a KDL-developed workflow for maintaining legacy projects (of which King’s College has over 150 that have taken shape over 20 years of technological change). Anna-Marie presented us with an astounding amount of information, and I hope it is published in a consultable format sooner rather than later. Finally, Thomas Stacker (Herzog August Bibliothek) provided an overview of a newly-created centre for supporting the creation and publishing of digital scholarly editions. Resituating the research library as a digital publications hub is increasingly common in digital humanities contexts, and this library is well on its way to doing so successfully.
Merisa has done a phenomenal job providing an overview of Paul Eggert’s opening keynote, as well as the fascinating presentations given by guests of honour Hans Walter Gabler and Peter Shillingsburg after receiving their ESTS awards, so I will not discuss any of those portions of the conference. Shillingsburg’s short remarks “Then and Now and When: Glances at Gutenberg and Google,” though, were particularly interesting. His observation that editorial projects are only as strong as their weakest contributor is utterly true in my experience with collaborative editing projects. Importantly, though, this is a starting point for increasing expertise in a team-based ways. As Merisa writes:
My take away from Peter’s talk was his assertion that recognizing key differences and tensions between Anglo-American and European editing traditions is a good way to bring us together and stimulate fruitful conversation; a point that is important to hold on to as we navigate an increasingly divisive political climate. Peter also urged us to consider with whom we surround ourselves; by choosing people who are, in our estimation, brighter minds and better critical thinkers, we can ensure that we never take for granted that we have nothing left to learn about textual scholarship, collaboration, or ourselves.
Since those three days in Antwerp, I have found myself thinking again about a handful of panels I was lucky enough to be part of. First, three DiXiT supervisors (Franz Fischer, Elena Pierazzo, and Susan Schreibman) gave a collaborative presentation on DiXiT itself: “DiXiT: Research, Training, and Networking in the Field of Digital Editing.” I have long been fascinated by the infrastructural shape of the digital humanities, its training apparatuses, and its institutional vagaries. As Franz Fischer said in his opening, “There are many myths about DiXiT, what it is, what it does.” Openly discussing how communities are formed, the work that individuals can do in order to extend and leverage networks for wider community engagement and training, and the nuts & bolts of how disciplinary formation actually happens on the ground was incredibly valuable. Elena Pierazzo’s discussion of what skills are necessary for specialists in digital scholarly editing dovetailed nicely with many internal discussions we have had within DiXiT about training programs, secondments, and technical skills.
Happily, I was also able to serve as chair for a very interesting panel on Linked Data. Mathias Coeckelbergs (University of Brussels and Leuven) kicked us off by discussing natural language processing techniques for better searching and analytical use of digital archives. Although he works primarily with Semitic texts, the methods he outlined are similar to those being used by JStor Labs Shakespeare project. Gábor Palkó, Research Director of the Petőfi Literary Museum in Hungary, presented his work on DigiPhil.hu, an aggregator and publisher of metadata and philological texts in Hungary; the project also serves as the Hungarian contributor to the larger Europeana. Finally, Dániel Balogh (British Museum) discussed his work on the “Asia Beyond Boundaries Project” to compile set up and populating a database of epigraphic texts in Indic languages from the Gupta period. What struck me most about this session was the need for systemic thinking by editors about how and how effectively the products we create – editions – are sustained, discovered, accessed, and used. Too often editors can assume that done means done, but in a world of interconnected scholarship and multiple avenues of access, that is insufficient for high-quality scholarly editions to make an impact.
Kathryn Sutherland’s excellent keynote, “Making Copies,” was a fantastic way to round out the conference. Much like Merisa, I am hoping that it is available in some published form sooner rather than later; being so rich and complex, this report would not do it justice. In lieu of a summary, I have embedded the tweets from during her talk here. If the talk is made available online, I will update with a link.
More than anything else, DiXiT has provided a superb group of colleagues who have become friends over the last two years. Thinking back to our initial, awkward meeting and discussions in Cambridge in April 2014, it is intensely satisfying to think how each of our projects, and our personal lives, have advanced since then. This was brought home to me in Antwerp. Elena Spadini defended her PhD thesis during the convention and then joined us in Antwerp; I had defended in February and formally graduated in June from the University of Victoria. Upon the end of ESTS, the fellows presented us both with bottles of the University of Antwerp’s “Huisbier,” “Cum Laude.” I am profoundly lucky to have such good and caring friends in the bizarre and often stressful world of academia. If for no other reason than getting to know my fellow fellows Misha, Richard, Merisa, Frederike, Elena, Elli, Aodhán, Fran, Tuomo, Federico, and Anna-Maria, DiXiT has been well worth it.