The Kainoki Treebank – a parsed corpus of contemporary Japanese

The Kainoki Treebank is a corpus of contemporary Japanese with hand worked tree analysis for approaching one-and-a-half million words. Highlights include:

labelled constituent structure
assignments of grammatical function
zero elements
information to resolve anaphoric dependencies

Further results — notably, dependency graphs — derived from the analysis can be seen with the search interface.

A brief history

Construction of The Kainoki Treebank has been ongoing since 2012. Until 2016, the corpus was known under the name The Keyaki treebank. The name was changed when its development became the main focus for the NPCMJ project at the National Institute for Japanese Language and Linguistics (NINJAL). This marked a shift in the goals of the annotation: Initial development emphasised markup for automatic parser training; Post-NPCMJ, the primary aim has been to provide a resource that best facilitates pattern search for linguistic research. Snapshot versions of the corpus were released as the NINJAL Parsed Corpus of Modern Japanese (NPCMJ) on a yearly basis between 2016 and 2022.

About the annotation

There is a Parsing Guide that describes the annotation scheme in detail.

Segmentation and word class labelling follows the principle of using terminal nodes that are as large as possible, but not so large as to incorporate into purely lexical elements other elements with functional roles. Such a policy corresponds closely with the LUW (Long Unit Word) standard of the Corpus of Spontaneous Japanese (CSJ; Maekawa 2003) and the Balanced Corpus of Contemporary Written Japanese (BCCWJ; Maekawa et al. 2014).

Syntactic structure is represented with labelled parentheses in the style of the Penn Treebank (Bies et al. 1995). More particularly, the Penn Historical Corpora scheme (Santorini 2010) has informed the ‘look’ of the annotation. This includes:

adoption of the CorpusSearch format (Randall 2009) as the underlying encoding,
not having any explicit verb phrase structure (although verb phrase structure is implicitly present when there are interpretive consequences),
the use of IP, ADVP, NP, and PP tag labels,
the presentation of phrase conjunction structure with CONJP layers, and
the marking of function for all clausal nodes and all clause level constituents.

Annotation practice strives for observational adequacy. The aim is to present a consistent linguistic analysis for each attestation of an identifiable linguistic relation or process. The annotation also offers syntactic analysis for the subsequent generation of meaning representations using the methods of Treebank Semantics (Butler 2015). To this end, extra disambiguation information is added to feed the calculation of semantic analyses from the syntactic annotation.

One prominent case of extra disambiguation information being added is seen in the different specifications of clause linkage (i.e., different types of non-final clauses). The annotation has the -SCON tag extension to identify clauses integrated with subordinate conjunction. Subordinate clause status influences the distribution of empty subject positions within such clauses and the relationships these positions have with outside arguments according to an antecedent calculation called ‘control’. These cases are contrasted with coordinate clause linkage, also identified with a tag extension: -CONJ (coordinating conjunction). Status as a coordinate clause influences the distribution of outside arguments according to an antecedent calculation called ‘Across the Board extraction’ (ATB).

Search Interface

The Kainoki Treebank is associated with a powerful user interface that enables search using virtually any aspect of the annotation. Results of specific searches can be downloaded in the form of annotated data. The source data to which the search interface links is being updated constantly to reflect improvements in analysis.

For examples of search patterns, see: How to find constructions in The Kainoki Treebank.

Using the corpus for advanced research

The available on-line interface is a powerful and flexible tool sufficient for many research purposes, but note that the data on which it operates is updated in real time. For extended research projects it is often advisable to download a release version for use off-line. This not only provides the advanced researcher a stable set of data, but also allows for manipulation of the data to reflect analyses that are not included in the present online corpus but are necessary for a given research project. Examples of suitable tools for searching the parsed data off-line without requiring any modification to the data include: CorpusSearch (Randall 2009) and Tregex (Levy, and Andrew 2006).

Mistakes

As with any annotated text corpus, there are mistakes in The Kainoki Treebank. The corpus is ongoing work and is under continuous improvement and correction. Mistakes will be corrected as they become apparent and as time allows. We will be grateful to be made aware of mistakes and will endeavour to eliminate them with the help of users (contact).

Attribution

Presentations of research results using The Kainoki Treebank should include a citation taking the general form of the example below (with appropriate modifications depending on the date of access):

Kainoki, Ed (2022) “The Kainoki Treebank – a parsed corpus of contemporary Japanese” https://kainoki.github.io (accessed 9 January 2022).

Terms of use

This work is licensed under a Creative Commons Attribution 4.0 International License.