Canterbury corpus

The Canterbury corpus is a collection of files intended for use as a benchmark for testing lossless data compression algorithms. It was created in 1997 at the University of Canterbury, New Zealand and designed to replace the Calgary corpus. The files were selected based on their ability to provide representative performance results.[1]

Contents

1 Contents
2 See also
3 References
4 External links

Contents[edit]
In its most commonly used form, the corpus consists of 11 files, selected as “average” documents from 11 classes of documents,[2] totaling 2,810,784 bytes as follows.

Size (bytes)
File name
Description

152,089
alice29.txt
English text

125,179
asyoulik.txt
Shakespeare

24,603
cp.html
HTML source

11,150
fields.c
C source

3,721
grammar.lsp
LISP source

1,029,744
kennedy.xls
Excel spreadsheet

426,754
lcet10.txt
Technical writing

481,861
pl‌rabn12.txt
Poetry

513,216
ptt5
CCITT test set

38,240
sum
SPARC executable

4,227
xargs.1
GNU manual page

See also[edit]

Data compression

References[edit]

^ Ian H. Witten; Alistair Moffat; Timothy C. Bell (1999). Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann. p. 92. 
^ Salomon, David (2007). Data Compression: The Complete Reference (Fourth ed.). Springer. p. 12. ISBN 9781846286032. 

External links[edit]

The Canterbury Corpus

v
t
e

Standard test items

Pangram
Reference implementation
Standard test image

Television (testcard)

SMPTE color bars
Indian-head test pattern
Test Card F
Philips PM5544

Computer languages

“Hello, World!” program
Quine
Trabb Pardo–Knuth algorithm
Man or boy test
Just another Perl hacker

Data compression

Calgary corpus
Canterbury corpus

3D computer graphics

Cornell box
Stanford bunny
Stanford dragon
Utah teapot

Typography

Lorem ipsum
The quick brown fox jumps over the lazy dog

Other

EICAR test file
GTUBE
Harvard sentences
Lenna
“Tom’s Diner”
SMPTE universal leader

v
t
e

Data compression methods

Lossless

Entropy type

Unary
Arithmetic
Asymmetric Numeral Systems
Golomb
Huffman

Adaptive
Canonical
Modified

Range
Shannon
Shannon–Fano
Shannon–Fano–Elias
Tunstall
Universal

Exp-Golomb
Fibonacci
Gamma
Levenshtein

Dictionary type

Byte pair encoding
DEFLATE
Snappy
Lempel–Ziv

LZ77 / LZ78 (LZ1 / LZ2)
LZJB
LZMA
LZO
LZRW
LZS
LZSS
LZW
LZWL
LZX
LZ4
Brotli
Statistical

입싸