Web Harvesting Collaborations
at Library and Archives Canada
Tom Smyth
Manager, Digital Capacity
IIPC GA 2014
Operating Context
LAC’s Legislative Context:
Library and Archives of Canada Act (S.C. 2004, c. 11)
WHEREAS it is necessary that
(a) the documentary heritage of Canada be preserved for the benefit of present
and future generations;
(b) Canada be served by an institution that is a source of enduring knowledge
accessible to all, contributing to the cultural, social and economic advancement
of Canada as a free and democratic society;
(c) that institution facilitate in Canada cooperation among the communities
involved in the acquisition, preservation and diffusion of knowledge; and
(d) that institution serve as the continuing memory of the government of Canada
and its institutions;
Legislative Context:
Authorities for Collection
Multiple authorities exist within the LAC Act that
empower it to collect government information:
• Section 10: Legal Deposit
– Covers all publications that are published in Canada,
including those from the GC.
• Section 12 & 13: Government and Ministerial Records
– Covers the disposition, transfer, and right of access to
government records.
Legislative Context:
Authorities for Collection (2)
Library and Archives of Canada Act
Sampling from Internet
Section 8. (2):
“In exercising the powers referred to in paragraph (1)(a) and
for the purpose of preservation, the Librarian and Archivist
may take, at the times and in the manner that he or she
considers appropriate, a representative sample of the
documentary material of interest to Canada that is
accessible to the public without restriction through the
Internet or any similar medium”.
LAC’s Web Harvesting:
• LAC began harvesting with the Government of Canada web
presence in 2005
– Hybrid library and archival methodological context
– In total LAC has collected it four times (2005, 2006, 2007, 2013)
• LAC’s makes this harvested data openly accessible via the Government
of Canada Web Archive (GCWA, 2006)
• Thematic material collected since 2005, but not according to a
disciplined (library) collection development methodology until about
Government of Canada Web Archive
WebCan Crawl Management Tool
Developed internally by LAC to allow acquisition staff to manage all
operations: harvest definitions, seed lists, crawl management, quality
Collection Overview
• LAC’s web archival collection consists of:
– Four comprehensive crawls of the Canadian federal
government web presence (*.gc.ca, some *.ca)
• (2005, 2006, 2007, 2013)
– Decommissioned federal websites emergency-harvested
between domain crawl periods
• ~170 major departmental websites since 2009
– Thematic collections from the open Canadian web
• ~15, built with increasingly complex collaborative relationships
Collaborative Projects
Models of Collaboration
• Internal collaboration
– Librarians and Archivists at LAC working together to curate seedlists and
web collections
– 2005-2009
• Interdepartmental collaboration
– Librarians, Archivists, PMs, Olympics Specialists
– Olympic and Paralympic Web Archive
– 2009-
• Internal, Interdepartmental, Stakeholders
– Librarians, Archivists, Policy, Government Administration and Technical
Specialists from Central Agencies, and Govt Docs and Data Specialists
– 2013-
Olympic and Paralympic
Web Archive
Canadian Olympics Web Archive
• LAC started curating web collections to document
Canadian participation in the Olympic and Paralympic
games shortly after the programme launched in 2005.
• LAC has curated web archival collections for each of the
following games:
Turin, Winter 2006
Beijing, Summer 2008
Vancouver, Winter 2010
London, summer 2012
Sochi, Winter 2014
Library and Archives Canada
Vancouver 2010 Olympic and Paralympic Games
Olympics WA Key Questions
• Who will be consulting our web archives?
• Why will they consulting our web archives?
• How will they use the web archives?
– Data and text mining, looking for specific resources,
social and cultural context
• What sort of resources warrant entry to a web
• How can the web archive become a robust and
multidisciplinary research tool?
Vancouver 2010 Web Archival Project
• LAC entered into a partnership with the Department of
Canadian Heritage to build a web archive documenting
the uniquely Canadian Vancouver 2010 games.
– Selected based on the statistical needs of PCH (largely tourism)
– Curated from the outset to cover broad social, cultural,
economic, infrastructural, academic topics for maximum data
and research applicability.
– Nine iterative crawl jobs took place to capture 350+ websites
(many 2-3 times each), comprising ~2 TB of data.
Vancouver 2010 Olympic Games
• Seedlist selected by Canadian Heritage specialists,
representatives from the Federal Secretariat for the Vancouver
Games, and LAC librarians and archivists
• Olympics Studies methodologies were considered and factored in the
curation of the collection (Olympics impact, infrastructure, sports
medicine, sponsorship, coaching, etc.)
• Methodology created to assess target websites for currency,
authority, perspective, frequency of content generation
Vancouver 2010 Data Set
• Canadian Heritage was interested in the production
of a large data set that could be mined and analyzed
primarily for cultural, social, and tourism information
• Selection methodology based in part of the target’s
resource’s ability to contribute valuable information to
a robust, minable dataset
• This methodology has informed every project since
Vancouver 2010 Data Set
• Web collection curated to include the following:
Aboriginal and First Nations Perspectives
Environmental impact perspectives
Economic and infrastructural development and impact in Vancouver
Public Policy and Think Tank perspectives
Pro/Con perspectives in the media on hosting an Olympics
A complete record of all the social and cultural events that ran during the games,
including all the official sites reporting the day-to-day events and the results
– Tourism, Sponsorship, Own the Podium campaign, Torch Relay, etc.
– Subject matter of interest to Olympics and Sports Studies specialists
Library and Archives Canada
Vancouver 2010 Olympic and Paralympic Games
Political, Social, Cultural, Historical
Thematic Web Collections
Thematic Collections Overview
• Federal and Provincial Elections
• Royal Commissions and Commissions of Inquiry
• Canada’s Participation in Olympic and Paralympic
• State Funerals
• Transitions in Federal Organizations
• Decommissioned Federal Organization/Websites
• Visits to Canada from British Royals
• Change of Governors General
• 100th Anniversary of the Calgary Stampede
• Commemoration of the War of 1812
Thematic Collections: Context
• Starting in January 2013, LAC began curating major
thematic collections on political, social, cultural,
commemorative, and historical topics
• Olympics Web Archive curation and project methodology
influenced the thematic projects
• In 2013, one project was conducted per FY quarter:
Q1: The “Idle No More” Aboriginal movement in Canada
Q2: Development of the Keystone Oil Pipeline
Q3: Canadian perspectives on the Arctic
Q4: The Lac-Megantic Rail Disaster
Internal Collaboration
• Thematic project topics originated in LAC’s Strategic
Research and Policy area
• Each thematic project drew on the expertise of the relevant
library and archival subject matter specialists to scope
project parameters and develop a collaborative seedlist
Social & Cultural
Specialized Media
Government of Canada
Official Publications
and Websites
• The Treasury Board Secretariat of Canada’s Web
Renewal Action Plan
– Consolidates the GC’s Web presence from ~1,500
websites down to one, Canada.ca
• Stakeholders from the GC and the public
expressed concern that valuable web resources of
enduring value would be lost
• Key stakeholders mobilized to engage LAC and
lobby for collaborative web archiving activity
• Two major stakeholders, the Universities of Alberta
and Toronto, run their own harvesting programmes via
• Collaborative work proceeded immediately to identify
and prioritize some 3,000 government websites for
harvesting by LAC
• Stakeholder expectations, advice, and extant seedlists
directly factored in the methodology and curation of the
LAC 4th domain crawl project
• Began in September 2013, and is currently in QA
LAC’s Web Harvesting:
Current Status
• LAC began a 4th crawl of the Government of
Canada web domain in Sept 2013:
– Official Languages Act; TBS Directive on Web
– Data collection outsourced to Internet Archive’s
“Archive-It” service
– Data will be returned to LAC and made
accessible via an upgraded GCWA
2013-14 GC Domain Harvest:
Preliminary Results
• GC websites successfully captured as of May 17th 2014:
Governor General
Prime Minister’s Office
Privy Council Office (PCO)
Treasury Board Secretariat (TBS)
Canada Revenue Agency
Auditor General
Justice, the Commissioners and
Parliament (PARLinfo and LEGISinfo)
Supreme Court, Federal Court, Federal
Court of Appeal
Public Safety, RCMP
DFAIT (before FATDC, including CIDA)
DND and subsidiaries
Citizenship and Immigration
Industry and its subsidiaries
AANDC and the suite of Northern websites
EC, DFO, Climate, Weather, Canadian
Environmental Assessment Agency
– Canada.gc.ca
– Canada Gazette
– Publications.gc.ca
760+ major departmental websites
QA still ahead of us
Key Issues: Curation
• With capacity for addressing only a handful of thematic topics,
which ones get selected for curation?
– Which issues that count count the most, and according to whose perspective?
– Selective vs. comprehensive, finite versus ongoing
• As more thematic projects are undertaken, sustainability and capacity
issues arise
– For themes that remain pertinent, conducting update crawls on in order to
update the archive and maintain its currency
• Securing long-term buy-in and resources for the continuing support and
development of the web archiving technical infrastructure
– E.g., further development for WebCan
• How much time to put into QC?
– How much QA is “good enough?”
Conclusion (Key Answers)
• We’re adopted a researcher-centric approach to the
construction of thematic collections
• We’ve adopted a govtdocs subject-specialist approach to
the collection of the federal government domain and its
official publications:
– LAC has web archival holdings of the Electronic Depository Services
Program Checklists as of 1995:
• http://epe.lac-bac.gc.ca/100/201/301/liste_hebdomadaire/
• http://epe.lac-bac.gc.ca/100/201/301/weekly_checklist/
• Web analytics demonstrate extensive use of the GCWA by
Canadian universities, private industry, and provincial,
federal, and international governments
Web Archives as Data Set
• LAC’s web archival holdings as:
– Open Data
• Assembling web archives with as wide a perspective as possible,
with an eye to making them Open Data?
• Potential for addition to the GC Open Data Portal
• Several requests already for the CGI-PLN LOCKSS
– Big Data
• High potential for governmental, science, policy, financial data and
textual mining; has IM applications
• Potential impetus for governmental innovation in information and
services to the Canadian public
Next Steps
• LAC is currently defining its long term business
strategy and technical requirements for a renewed Web
harvesting program for FY 2014-2015
• The GCWA will be updated to provide access to all of LAC’s
web archival holdings (~20 TB)
– Migration of legacy ARCs to ISO standard WARCs
– GCWA will be migrated to a WCAG-compliant GC look and feel
• Construct researcher-centric discovery and search tools
Web Harvesting Team @ LAC
Tom Smyth
Manager, Digital Capacity
Patricia Klambauer
Lead Web Harvesting Technician
Strategic Initiatives and Client Relations Division
Evaluation and Acquisitions Branch
Library and Archives Canada

Title of this slide - International Internet Preservation