Types of materials

Both data-material, scripts and programs to act on these data, and relevant documentation.

Separate pages detail the categories we used for the classification of resources and the standards used for the storage and description of these resources.

Data material


Data material is typically of two types: lexical databases and tables with lexical statistics. Lexical databases list a large set of words (written or spoken), along with a variety of their lexical and statistical characteristics. Tables provide a limited range of statistics about words or parts of words.

Data material will be stored in two formats, text format for the files of small size and MySQL format for the ones of great size. The text files will all be organized in columns separated by tabulations. In such a file, each line a different lexical item (typically a word, sometimes a part of word) and each column store a different piece of information.

View all Data Files listed in our repository...

Scripts and softwares for the extraction of lexical statistics


An originality of this project is that we propose to complement our repository of database, datasets, tables with a repository of scripts, or programs that operates on the data-material held in our repository.

This initiative has several advantages. First, it contributes to an important decrease in the number and size of the databases to store. Only statistics about the word in full need to be stored. All other statistics can be retrieved by the use of programs to compute statistics based on tables. Second, the use of program is critical for the computation of statistics for nonwords that are not found in word databases (though it is possible to construct a database of nonwords, this would rapidly turn into an needlessly huge ressouces -- 2610 entries, most of which never looked for would be necessary to code any sequence of 10 letters).

In this repository, priority will be given to small scripts in simple to use computer languages (languages like Perl, Python, Awk, Rebol, Transcript). We will redirect large applications written in C/C++ or Java to sites with softwares and codes developed for the computational linguistics community (for instance: Organisation to open source projects related to natural language processing at http://opennlp.sourceforge.net/ or Natural Language Software Registry at http://registry.dfki.de/).

These scripts will all be open-source, under some kind of public licence (GPL, the GNU Public licence; creative commons). This is to guarantee the re-usability of the programs. In cases where the definition of the fields does not fit the format of new materials, it is a lot faster to adapt the program to the format of the data in a database than to reformat the data to comply to the requirements of the program.

View all Tools listed in our repository...



Documentation comes in two forms:

  1. Official documentation held in wiki-web, with a clearly structured format. The wiki-web format means that the content is fast and easy to edit or update and editing responsibilities can be shared between members of the community. Currently, content viewing is public access but content-editing is password restricted.
  2. Unofficial documentation which consists of stand-alone documents and roughly organised by keyword (the usual ones of language and category of materials).

In the short term, for security reasons, editing access can only be granted trusted users (members of the academic community, who use their university email to contact me) who are keen to use this website to share with the community a set of material of theirs (more than 5 pages). They can contact the administrator of this website by email (see contacts) to submit their request.

In the longer term, the aim is to integrate the wiki inside the content management system, such that any self-registered member would have the possibility to create content content, for private or public viewing or editing. As a common parsing program is used for the documents in the wiki section and the ones in the official section. It would be quite easy to let members keep viewing private while they are working on their draft, open it to commenting or editing to a given group of users (for instance fellow lab members or project collaborators, or lexicall peer-reviewers) when close to submission, then eventuall have it integrated in the official section if the document is judged of general interest.

View all User-contributed documents in our repository...



A searchable directory of databases, personal pages, associations, pointers to organizations, publications, conferences, and listservs held elsewhere.

View all Links in our repository...