Using the library from your own application

The library may be used to develop your own NLP application (e.g. a machine translation system, an intelligent indexation module for a search engine, etc.)

To achieve this goal you have to link your application to the library, and access it via the provided API. Since the library is C++, using C++ in your program provides full access to all library functionalities. However, quite complete APIs are provided for Java, Perl, Python, PHP, and Ruby.

Basic Classes

This section briefs the basic C++ classes any application needs to know. For detailed API definition, consult the technical documentation in doc/html and doc/latex directories, or chapters about Linguistic Data Classes or Language Processing Modules

Linguistic Data Classes

The different processing modules work on objects containing linguistic data (such as a word, a PoS tag, a sentence...).

Your application must be aware of those classes in order to be able to provide to each processing module the right data, and to correctly interpret the module results.

The linguistic classes are:

  • analysis: A tuple <lemma, PoS tag, probability, sense list>

  • word: A word form with a list of possible analysis.

  • sentence: A list of words known to be a complete sentence. A sentence may have associated a parse_tree object and a dependency_tree.

  • parse_tree: An n-ary tree where each node contains either a non-terminal label, or -if the node is a leaf- a pointer to the appropriate word object in the sentence the tree belongs to.

  • dep_tree: An n-ary tree where each node contains a reference to a node in a parse_tree. The structure of the dep_tree establishes syntactic dependency relationships between sentence constituents.

  • paragraph: A list of sentence objects that form a paragraph.

  • document: A list of paragraph objects that form the document. A document may have associated a set of coreference groups and a semantic graph.

  • mention: A mention of an entity in the document. A group of mentions referring to the same entity form a coreference group.

  • semantic_graph: An entity-relationship graph describing the main actions reported in the text and the principal actors involved in each.

Processing modules

The main processing classes in the library are:

  • tokenizer: Receives plain text and returns a list of word objects.

  • splitter: Receives a list of word objects and returns a list of sentence objects.

  • maco: Receives a list of sentence objects and morphologically annotates each word object in the given sentences. Includes specific submodules (e.g, detection of date, number, multiwords, etc.) which can be activated at will.

  • tagger: Receives a list of sentence objects and disambiguates the PoS of each word object in the given sentences.

  • nec: Receives a list of sentence objects and modifies the tag for detected proper nouns to specify their class (e.g. person, place, organitzation, others).

  • ukb: Receives a list of sentence objects enriches the words with a ranked list of WordNet senses.

  • parser: Receives a list of sentence objects and associates to each of them a parse_tree object.

  • dependency: Receives a list of parsed sentence objects and associates to each of them a dep_tree object.

  • coref: Receives a document (containing a list of parsed sentence objects) and labels each noun phrase as belonging to a coreference group, if appropriate.

You may create as many instances of each as you need. Constructors for each of them receive the appropriate options (e.g. the name of a dictionary, hmm, or grammar file), so you can create each instance with the required capabilities (for instance, a tagger for English and another for Spanish).

Sample programs

The directory src/main/simple_examples in the tarball contains some example programs to illustrate how to call the library.

See the README file in that directory for details on what does each of the programs.

The most complete program in that directory is, which is a simple version of the analyzer program described in section Using analyzer Program to Process Corpora with a fixed set of options.

Note that depending on the application, the input text could be obtained from a speech recongnition system, or from a XML parser, or from any source suiting the application goals. Similarly, the obtained analysis, instead of being output, could be used in a translation system, or sent to a dialogue control module, etc.

int main (int argc, char **argv) {
/// set locale to an UTF8 comaptible locale
/// path where data files reside
wstring path=L"/usr/local/share/freeling/es";
// create analyzers
tokenizer tk(path+L"tokenizer.dat");
splitter sp(path+L"splitter.dat");
splitter::session_id sid=sp.open_session();
// morphological analysis has a lot of options, and for simplicity they are
// packed up in a maco_options object.
// First, create the maco_options object with default values.
maco_options opt(L"es");
// then, provide files for morphological submodules.
// Note that opt.QuantitiesFile is not set and takes the default empty value.
// This will cause quantities module to be deactivated in this example.
opt.LocutionsFile=path+L"locucions.dat"; opt.AffixFile=path+L"afixos.dat";
opt.ProbabilityFile=path+L"probabilitats.dat"; opt.DictionaryFile=path+L"dicc.src";
opt.NPdataFile=path+L"np.dat"; opt.PunctuationFile=path+L"../common/punct.dat";
// alternatively, you could set the files in a single call:
// opt.set_data_files("", path+"locucions.dat", "", path+"afixos.dat",
// path+"probabilitats.dat", opt.DictionaryFile=path+"maco.db",
// path+"np.dat", path+"../common/punct.dat");
// create the analyzer with the just build set of maco_options
maco morfo(opt);
// then, set required options on/off
morfo.set_active_options (false, // UserMap
true, // NumbersDetection,
true, // PunctuationDetection,
true, // DatesDetection,
true, // DictionarySearch,
true, // AffixAnalysis,
false, // CompoundAnalysis,
true, // RetokContractions,
true, // MultiwordsDetection,
true, // NERecognition,
false, // QuantitiesDetection,
true); // ProbabilityAssignment
// create a hmm tagger for spanish (with retokenization ability, and forced
// to choose only one tag per word)
hmm_tagger tagger(path+L"tagger.dat", true, FORCE_TAGGER);
// create chunker
chart_parser parser(path+L"chunker/grammar-chunk.dat");
// create dependency parser
wstring S=parser.get_start_symbol();
dep_txala dep(path+L"dep_txala/dependences.dat",S);
// get plain text input lines while not EOF.
wstring text;
list<word> lw;
list<sentence> ls;
while (getline(wcin,text)) {
// tokenize input line into a list of words
// accumulate list of words in splitter buffer, returning a list of sentences.
// The resulting list of sentences may be empty if the splitter has still not
// enough evidence to decide that a complete sentence has been found. The list
// may contain more than one sentence (since a single input line may consist
// of several complete sentences).
ls=sp.split(sid, lw, false);
// perform and output morphosyntactic analysis, Pos Tagging and parsing
// 'ls' contains a list of analyzed sentences. Do whatever is needed
// clear temporary lists;
lw.clear(); ls.clear();
// No more lines to read. Make sure the splitter doesn't retain anything
sp.split(sid, lw, true, ls);
// analyze sentence(s) which might be lingering in the buffer, if any.
// 'ls' contains a list of analyzed sentences. Do whatever is needed

The processing performed on the obtained results would obviously depend on the goal of the application (translation, indexation, etc.). In order to illustrate the structure of the linguistic data objects, a simple procedure is presented below, in which the processing consists of merely printing the results to stdout in XML format.

void ProcessResults(const list<sentence> &ls) {
list<sentence>::const_iterator is;
word::const_iterator a; //iterator over all analysis of a word
sentence::const_iterator w;
// for each sentence in list
for (is=ls.begin(); is!=ls.end(); is++) {
// for each word in sentence
for (w=is->begin(); w!=is->end(); w++) {
// print word form, with PoS and lemma chosen by the tagger
wcout<<L" <WORD form=\""<<w->get_form();
wcout<<L"\" lemma=\""<<w->get_lemma();
wcout<<L"\" pos=\""<<w->get_tag();
// for each possible analysis in word, output lemma, tag and probability
for (a=w->analysis_begin(); a!=w->analysis_end(); ++a) {
// print analysis info
wcout<<L" <ANALYSIS lemma=\""<<a->get_lemma();
wcout<<L"\" pos=\""<<a->get_tag();
wcout<<L"\" prob=\""<<a->get_prob();
// close word XML tag after list of analysis
wcout<<L" </WORD>"<<endl;
// close sentence XML tag

The above sample program may be found in src/main/simple_examples/< in FreeLing tarball. The actual program also outputs tree structures resulting from parsing, which is ommitted here for simplicity.

Once you have compiled and installed FreeLing, you can build this sample program (or any other you may want to write) with the command: g++ -o sample -lfreeling

Option -lfreeling links with libfreeling library, which is the final result of the FreeLing compilation process. Check the README file in the directory to learn more about compiling and using the sample programs.

You may need to add some -I and/or -L options to the compilation command depending on where the headers and code of required libraries are located. For instance, if you installed some of the libraries in /usr/local/mylib instead of the default place /usr/local, you'll have to add the options -I/usr/local/mylib/include -L/usr/local/mylib/lib to the command above.

Executing make in /src/main/simple_examples will compile all sample programs in that directory. Make sure that the paths to FreeLing installation directory in Makefile are the right ones.