The simplest way to use the FreeLing libraries is via the provided
analyzer main program, which allows the user to process an input text to obtain several linguistic processings.
Since it is impossible to write a program that fits everyone's needs,
analyzer offers you almost all functionalities included in FreeLing, but if you want it to output more information, or do so in a specific format, or combine the modules in a different way, the right path to follow is building your own main program or adapting one of the existing, as described in section Using the library from your own application.
analyzer program is usually called with an option
-f config-file (if ommitted, it will search for a file named
analyzer.cfg in the current directory). The given
config-file must be an absolute file name, or a relative path to the current directory.
You can use the default configuration files (located at
/usr/local/share/freeling/config if you installed from tarball, or at
/usr/share/freeling/config if you used a
.deb package), or either a config file that suits your needs. Note that the default configuration files require the environment variable
FREELINGSHARE to be defined and to point to a directory with valid FreeLing data files (e.g.
Environment variables are used for flexibility (e.g. avoid having to modify configuration files if you relocate your data files), but if you don't need them, you can replace all occurrences of
FREELINGSHARE in your configuration files with a static path.
analyzer program provides also a server mode (use option
-server) which expects the input from a socket. The program
analyzer_client can be used to read input files and send requests to the server. The advantatge is that the server remains loaded after analyzing each client's request, thus reducing the start-up overhead if many small files have to be processed. Client and server communicate via sockets. The client-server approach is also a good strategy to call FreeLing from a language or platform for which no API is provided: Just launch a server and use you preferred language to program a client that behaves like
analyze (no final "r") script described below handles all these default paths and variables and makes everything easier if you want to use the defaults.
To ease the invocation of the program, a script named
analyze (no final "r") is provided. This is script is able to locate default configuration files, define library search paths, and handle whether you want the client-server or the straight version.
The sample main program is called with the command:
analyze [-f config-file] [options]
-f config-file is not specified, a file named
analyzer.cfg is searched in the current working directory.
-f config-file is specified but not found in the current directory, it will be searched in FreeLing installation directory, which is one of:
/usr/local/share/freeling/config if you installed from source
/usr/share/freeling/config if you used a binary
myfreeling/share/freeling/config if you used
--prefix=myfreeling option with
Extra options may be specified in the command line to override any settings in
config-file. See section Valid Options.
The default mode will launch a stand-alone analyzer, which will load the configuration, read input from stdin, write results to stdout, and exit. E.g.:
analyze -f en.cfg <myinput >myoutput
When the input file ends, the analyzer will stop and it will have to be reloaded again to process a new file.
--port options are specified, a server will be launched which starts listening for incoming requests. E.g.:
analyze -f en.cfg --server --port 50005 &
Once the server is launched, clients can request analysis to the server, with:
analyzer_client 50005 <myinput >myoutput
analyzer_client localhost:50005 <myinput >myoutput
or, from a remote machine:
analyzer_client my.server.com:50005 <myinput >myoutput
analyzer_client 192.168.10.11:50005 <myinput >myoutput
The server will fork a new process to attend each new client, so you can have many clients being served at the same time.
You can control the maximum amount of clients being attended simutaneously (in order to prevent a flood in your server) with the option
--workers. You can control the size of the queue of pending clients with option
--queue. Clients trying to connect when the queue is full will receive a connection error. See section Valid Options for details on these options.
libboost_thread is installed, the installation process will build the program
threaded_analyzer. This program behaves like
analyzer, and has almost the same options.
threaded_analyzer launches each processor in a separate thread, so while one sentece is being parsed, the next is being tagged, and the following one is running through the morphological analyzer. In this way, the multi-core capabilities of the host are better exploited and the analyzer runs faster.
Although it is intended mainly as an example for developers wanting to build their own threaded applications, this program can also be used to analyze texts, in the same way than
Nevertheless, notice that this example program does not include modules that are not token- or sentence-oriented, namely, language identification and coreference resolution.
Assuming we have the following input file
El gato come pescado. Pero a Don Jaime no le gustan los gatos.
we could issue the command:
analyze -f myconfig.cfg <mytext.txt >mytext.mrf
myconfig.cfg is the file presented in section Sample Configuration File, the produced output would correspond to a
morfo output level (i.e. morphological analysis but no PoS tagging). The expected results are:
El el DA0MS0 1gato gato NCMS000 1come comer VMIP3S0 0.75 comer VMM02S0 0.25pescado pescado NCMS000 0.833333 pescar VMP00SM 0.166667. . Fp 1Pero pero CC 0.99878 pero NCMS000 0.00121951 Pero NP00000 0.00121951a a NCFS000 0.0054008 a SPS00 0.994599Don_Jaime Don_Jaime NP00000 1no no NCMS000 0.00231911 no RN 0.997681le él PP3CSD00 1gustan gustar VMIP3P0 1los el DA0MP0 0.975719 lo NCMP000 0.00019425 él PP3MPA00 0.024087gatos gato NCMP000 1. . Fp 1
If we also wanted PoS tagging, we could have issued the command:
analyze -f myconfig.cfg --outlv tagged <mytext.txt >mytext.tag
to obtain the tagged output:
El el DA0MS0gato gato NCMS000come comer VMIP3S0pescado pescado NCMS000. . FpPero pero CCa a SPS00Don_Jaime Don_Jaime NP00000no no RNle él PP3CSD00gustan gustar VMIP3P0los el DA0MP0gatos gato NCMP000. . Fp
We can also ask for the senses of the tagged words:
analyze -f myconfig.cfg --outlv tagged --sense all <mytext.txt >mytext.sen
obtaining the output:
El el DA0MS0gato gato NCMS000 01630731:07221232:01631653come comer VMIP3S0 00794578:00793267pescado pescado NCMS000 05810856:02006311. . FpPero pero CCa a SPS00Don_Jaime Don_Jaime NP00000no no RNle él PP3CSD00gustan gustar VMIP3P0 01244897:01213391:01241953los el DA0MP0gatos gato NCMP000 01630731:07221232:01631653. . Fp
Alternatively, if we don't want to repeat the first steps that we had already performed, we could use the output of the morphological analyzer as input to the tagger:
analyze -f myconfig.cfg --inplv morfo --outlv tagged <mytext.mrf >mytext.tag
OutputFormat in section Valid options for details on which are valid input and output levels and formats.
Almost all options may be specified either in the configuration file or in the command line, having the later precedence over the former.
Valid options are presented in section Valid options, both in their command-line and configuration file notations. Configuration files follow the usual linux standards. A sample file may be seen in section Sample Configuration File.
FreeLing package includes default configuration files. They can be found at the directory
share/FreeLing/config under the FreeLing installation directory (
/usr/local if you installed from source, and
/usr/share/FreeLing if you used a binary
.deb package). The
analyze script will try to locate the configuration file in that directory if it is not found in the current working directory.
This section presents the options that can be given to the analyzer program (and thus, also to the analyzer_server program and to the analyze script). All options can be written in the configuration file as well as in the command line. The later has always precedence over the former.
Prints to stdout a help screen with valid options and exits.
--help provides information about command line options.
--help-cf provides information about configuration file options.
Prints the version number of currently installed FreeLing library.
Activate server mode. Requires that option
--port is also provided.
Default value is
Server Port Number
Specify port where server will be listening for requests. This option must be specified if server mode is active, and it is ignored if server mode is off.
Maximum Number of Server Workers
Specify maximum number of active workers that the server will launch. Each worker attends a client, so this is the maximum number of clients that are simultaneously attended. This option is ignored if server mode is off.
Default vaule is 5. Note that a high number of simultaneous workers will result in forking that many processes, which may overload the CPU and memory of your machine resulting in a system collapse.
When the maximum number of workers is reached, new incoming requests are queued until a worker finishes.
Maximum Size of Server Queue
Specify maximum number of pending clients that the server socket can hold. This option is ignored if server mode is off.
Pending clients are requests waiting for a worker to be available. They are queued in the operating system socket queue.
Default value is 32. Note that the operating system has an internal limit for the socket queue size (e.g. modern linux kernels set it to 128). If the given value is higher than the operating system limit, it will be ignored.
When the pending queue is full, new incoming requests get a connection error.
Set the trace level (0 = no trace, higher values = more trace), for debugging purposes.
This will work only if the library was compiled with tracing information, using ./configure -enable-traces. Note that the code with tracing information is slower than the code compiled without it, even when traces are not active.
Specify modules to trace. Each module is identified with an hexadecimal flag. All flags may be OR-ed to specificy the set of modules to be traced.
Valid masks are defined in file
src/include/freeling/morfo/traces.h, and are the following:
Machine Learning modules
Semantic graph extraction
Language of input text
Code for language of input text. Though it is not required, the convention is to use two-letter ISO codes (as: Asturian, es: Spanish, ca: Catalan, en: English, cy: Welsh, it: Italian, gl: Galician, pt: Portuguese, ru: Russian, old-es: old Spanish, etc).
Other languages may be added to the library. See chapter Adding Support for New Languages for details.
Locale to be used to interpret both input text and data files. Usually, the value will match the locale of the
Lang option (e.g.
es_ES.utf8 for spanish,
ca_ES.utf8 for Catalan, etc.). The values
default (stands for
system (stands for currently active system locale) may also be used.
Splitter Buffer Flushing
When this option is inactive (most usual choice) sentence splitter buffers lines until a sentence marker is found. Then, it outputs a complete sentence.
When this option is active, the splitter never buffers any token, and considers each newline as a sentence end, thus processing each line as an independent sentence.
Input format in which to expect text to analyze.
Valid values are:
text: Plain text.
freeling: pseudo-column format produced by freeling with output level morfo or tagged.
conll: CoNLL-like column format.
Input CoNLL format definition file
Configuration file for input CoNLL format. Defines which columns --and in which order-- must be read. See section Input/Output Handling Modules for details on the file format.
This option is valid only when
InputFormat=conll. Otherwise, it is ignored.
Output format to produce with analysis results.
Valid values are:
freeling: Classical freeling format. It may be a pseudo-column for with output levels morfo or tagged, parenthesized trees for parsing output, or other human-readable output for coreferences or semantic graph output.
conll: CoNLL-like column format.
xml: FreeLing-specific XML format.
json: JSON format
naf: XML format following NAF conventions (see https://github.com/newsreader/NAF)
train: Produce freeling pseudo-column format suitable to train PoS taggers. This option can be used to annotate a corpus, correct the output manually, and use it to retrain the taggers with the script src/utilities/train-tagger/bin/TRAIN.sh provided in FreeLing package. See src/utilities/train-tagger/README for details about how to use it.
Output CoNLL format definition file
Configuration file for out CoNLL format. Defines which columns --and in which order-- must be written. See section Input/Output Handling Modules for details on the file format.
This option is valid only when
OutputFormat=conll. Otherwise, it is ignored.
Analysis level of input data (plain, token, splitted, morfo, tagged, shallow, dep, coref).
plain: plain text.
token: tokenized text (one token per line).
splitted : tokenized and sentence-splitted text (one token per line, sentences separated with one blank line).
morfo: tokenized, sentence-splitted, and morphologically analyzed text. One token per line, sentences separated with one blank line. Each line has the format: word (lemma tag prob)+
tagged: tokenized, sentence-splitted, morphologically analyzed, and PoS-tagged text. One token per line, sentences separated with one blank line. Each line has the format: word lemma tag.
shallow: the previous plus constituency parsing. Only valid with InputFormat=conll.
dep: the previous plus dependency parsing (may include constituents or not. May include also SRL). Only valid with InputFormat=conll.
coref: the previous plus coreference. Only valid with InputFormat=conll.
Analysis level of output data (ident, token, splitted, morfo, tagged, shallow, dep, coref, semgraph).
ident: perform language identification instead of analysis.
token: tokenized text (one token per line).
splitted : tokenized and sentence-splitted text (one token per line, sentences separated with one blank line).
morfo: tokenized, sentence-splitted, and morphologically analyzed text. One token per line, sentences separated with one blank line.
tagged: tokenized, sentence-splitted, morphologically analyzed, and PoS-tagged text. One token per line, sentences separated with one blank line.
shallow: tokenized, sentence-splitted, morphologically analyzed, PoS-tagged, optionally sense-annotated, and shallow-parsed text, produced by the
parsed: tokenized, sentence-splitted, morphologically analyzed, PoS-tagged, optionally sense-annotated, and full-parsed text, as output by the first stage (tree completion) of the rule-based dependency parser.
dep: tokenized, sentence-splitted, morphologically analyzed, PoS-tagged, optionally sense-annotated, and dependency-parsed text, as output by the second stage (transformation to dependencies and function labelling) of the dependency parser. May include also SRL if the statistical parser is used (and SRL is available for the input language).
coref: the previous plus coreference.
semgraph: the previous plus semantic graph. Only valid with OutputFormat=xml|json|freeling.
Language Identification Configuration File
Configuration file for language identifier.
File of tokenization rules.
File of splitter rules.
Whether to perform affix analysis on unknown words. Affix analysis applies a set of affixation rules to the word to check whether it is a derived form of a known word.
Affixation Rules File
Affix rules file, used by dictionary module.
Whether to apply or not a file of customized word-tag mappings.
User Map File
User Map file to be used.
Whether to perform multiword detection. This option requires that a multiword file is provided.
Multiword definition file.
Whether to perform nummerical expression detection. Deactivating this feature will affect the behaviour of date/time and ratio/currency detection modules.
Specify decimal point character for the number detection module (for instance, in English is a dot, but in Spanish is a comma).
Specify thousand point character for the number detection module (for instance, in English is a comma, but in Spanish is a dot).
Whether to assign PoS tag to punctuation signs.
Punctuation Detection File
Punctuation symbols file.
Whether to perform date and time expression detection.
Whether to perform currency amounts, physical magnitudes, and ratio detection.
Quantity Recognition File
Quantitiy recognition configuration file.
Whether to search word forms in dictionary. Deactivating this feature also deactivates AffixAnalysis option.
Whether to compute a lexical probability for each tag of each word. Deactivating this feature will affect the behaviour of the PoS tagger.
Lexical Probabilities File
Lexical probabilities file. The probabilities in this file are used to compute the most likely tag for a word, as well to estimate the likely tags for unknown words.
Unknown Words Probability Threshold.
Threshold that must be reached by the probability of a tag given the suffix of an unknown word in order to be included in the list of possible tags for that word. Default is zero (all tags are included in the list). A non-zero value (e.g. 0.0001, 0.001) is recommended.
Named Entity Recognition
Whether to perform NE recognition.
Named Entity Recognizer File
Configuration data file for NE recognizer.
Named Entity Classification
Whether to perform NE classification.
Named Entity Classifier File
Configuration file for Named Entity Classifier module
Whether to add phonetic transcription to each word.
Phonetic Encoder File
Configuration file for phonetic encoding module
Kind of sense annotation to perform
no, none: Deactivate sense annotation.
all: annotate with all possible senses in sense dictionary.
mfs: annotate with most frequent sense.
ukb: annotate all senses, ranked by UKB algorithm.
Whether to perform sense anotation.
If active, the PoS tag selected by the tagger for each word is enriched with a list of all its possible WN synsets. The sense repository used depends on the options
Sense Annotation Configuration File'' andUKB Word Sense Disambiguator Configuration File'' described below.
Sense Annotation Configuration File
Word sense annotator configuration file.
UKB Word Sense Disambiguator Configuration File
UKB configuration file.
Algorithm to use for PoS tagging
hmm: Hidden Markov Model tagger, based on [Bra00].
relax: Relaxation Labelling tagger, based on [Pad98].
HMM Tagger configuration File
Parameters file for HMM tagger.
Relaxation labelling tagger constraints file
File containing the constraints to apply to solve the PoS tagging.
Relaxation labelling tagger iteration limit
Maximum numbers of iterations to perform in case relaxation does not converge.
Relaxation labelling tagger scale factor
Scale factor to normalize supports inside RL algorithm. It is comparable to the step lenght in a hill-climbing algorithm: The larger scale factor, the smaller step.
Relaxation labelling tagger epsilon value
Real value used to determine when a relaxation labelling iteration has produced no significant changes. The algorithm stops when no weight has changed above the specified epsilon.
Retokenize contractions in dictionary
Specifies whether the dictionary must retokenize contractions when found, or leave the decision to the
Note that if this option is active, contractions will be retokenized even if the
TaggerRetokenize option is not active. If this option is not active, contractions will be retokenized depending on the value of the
Retokenize after tagging
Determine whether the tagger must perform retokenization after the appropriate analysis has been selected for each word. This is closely related to affix analysis and PoS taggers.
Force the selection of one unique tag
Determine whether the tagger must be forced to (probably randomly) make a unique choice and when.
none: Do not force the tagger, allow ambiguous output.
tagger: Force the tagger to choose before retokenization (i.e. if retokenization introduces any ambiguity, it will be present in the final output).
retok: Force the tagger to choose after retokenization (no remaining ambiguity)
Chart Parser Grammar File
This file contains a CFG grammar for the chart parser, and some directives to control which chart edges are selected to build the final tree.
Dependency Parser Rule File
Rules to be used to perform rule-based dependency analysis.
Statistical Dependency Parser File
Configuration file for statistical dependency parser and SRL module
Dependency Parser Selection
Which dependency parser to use. Valid values are:
treeler (statistical, may have SRL also).
Coreference Resolution File
Configuration file for coreference resolution module.
A sample configuration file follows. You can start using freeling with the default configuration files which are installed at
/usr/local/share/freeling/config (note than prefix
/usr/local may differ if you specified an alternative location when installing FreeLing. If you installed from a binary
.deb package), it will be at
You can use those files as a starting point to customize one configuration file to suit your needs.
Note that file paths in the sample configuration file contain
$FREELINGSHARE, which is supposed to be an environment variable. If this variable is not defined, the analyzer will abort, complaining about not finding the files.
If you use the
analyze script, it will define the variable for you as
/usr/local/share/Freeling (or the right installation path), unless you define it to point somewhere else.
You can also adjust your configuration files to use normal paths for the files (either relative or absolute) instead of using variables.
###### default configuration file for Spanish analyzer###### General optionsLang=esLocale=default### Tagset description file, used by different modulesTagsetFile=$FREELINGSHARE/es/tagset.dat## Traces (deactivated)TraceLevel=0TraceModule=0x0000## Options to control the applied modules. The input may be partially## processed, or not a full analysis may me wanted. The specific## formats are a choice of the main program using the library, as well## as the responsability of calling only the required modules.InputLevel=textOutputLevel=morfo# Do not consider each newline as a sentence endAlwaysFlush=no#### Tokenizer optionsTokenizerFile=$FREELINGSHARE/es/tokenizer.dat#### Splitter optionsSplitterFile=$FREELINGSHARE/es/splitter.dat#### Morfo optionsAffixAnalysis=yesCompoundAnalysis=yesMultiwordsDetection=yesNumbersDetection=yesPunctuationDetection=yesDatesDetection=yesQuantitiesDetection=yesDictionarySearch=yesProbabilityAssignment=yesDecimalPoint=,ThousandPoint=.LocutionsFile=$FREELINGSHARE/es/locucions.datQuantitiesFile=$FREELINGSHARE/es/quantities.datAffixFile=$FREELINGSHARE/es/afixos.datCompoundFile=$FREELINGSHARE/es/compounds.datProbabilityFile=$FREELINGSHARE/es/probabilitats.datDictionaryFile=$FREELINGSHARE/es/dicc.srcPunctuationFile=$FREELINGSHARE/common/punct.datProbabilityThreshold=0.001# NER optionsNERecognition=yesNPDataFile=$FREELINGSHARE/es/np.dat## comment line above and uncomment one of those below, if you want## a better NE recognizer (higer accuracy, lower speed)#NPDataFile=$FREELINGSHARE/es/nerc/ner/ner-ab-poor1.dat#NPDataFile=$FREELINGSHARE/es/nerc/ner/ner-ab-rich.dat# "rich" model is trained with rich gazetteer. Offers higher accuracy but# requires adapting gazetteer files to have high coverage on target corpus.# "poor1" model is trained with poor gazetteer. Accuracy is splightly lower# but suffers small accuracy loss the gazetteer has low coverage in target corpus.# If in doubt, use "poor1" model.## Phonetic encoding of words.Phonetics=noPhoneticsFile=$FREELINGSHARE/es/phonetics.dat## NEC options. See README in common/necNEClassification=noNECFile=$FREELINGSHARE/es/nerc/nec/nec-ab-poor1.dat#NECFile=$FREELINGSHARE/es/nerc/nec/nec-ab-rich.dat## Sense annotation options (none,all,mfs,ukb)SenseAnnotation=noneSenseConfigFile=$FREELINGSHARE/es/senses.datUKBConfigFile=$FREELINGSHARE/es/ukb.dat#### Tagger optionsTagger=hmmTaggerHMMFile=$FREELINGSHARE/es/tagger.datTaggerRelaxFile=$FREELINGSHARE/es/constr_gram-B.datTaggerRelaxMaxIter=500TaggerRelaxScaleFactor=670.0TaggerRelaxEpsilon=0.001TaggerRetokenize=yesTaggerForceSelect=tagger#### Parser optionsGrammarFile=$FREELINGSHARE/es/chunker/grammar-chunk.dat#### Dependence Parser optionsDependencyParser=txalaDepTxalaFile=$FREELINGSHARE/es/dep_txala/dependences.datDepTreelerFile=$FREELINGSHARE/es/dep_treeler/dependences.dat#### Coreference Solver optionsCorefFile=$FREELINGSHARE/es/coref/relaxcor/relaxcor.datSemGraphExtractorFile=$FREELINGSHARE/es/semgraph/semgraph-SRL.dat