GROBID¶
1. About¶
This document specifies how to train and use GROBID.
2. Prerequisites¶
GROBID uses Maven as its build system. To install it on Debian/Ubuntu systems we just have to type:
$ sudo apt-get install maven
Note that this will also install Java, the language GROBID is written in. Similar commands apply to other distributions. In particular for OS X we have:
$ brew install maven
3. Quick start¶
To install GROBID we first need to clone its code:
$ git clone https://github.com/inspirehep/grobid
Note that we are fetching it from our fork instead of the main repository
because our HEP training data has not yet been merged inside of it.
Now we move inside its grobid-service
folder and start the service:
$ cd grobid/grobid-service
$ mvn jetty:run-war
This will run the tests, load the modules and start a service available at
localhost:8080
.
4. Training¶
The models available after cloning are not using the new available training data. To generate the new ones we need to go inside of the root folder and call:
$ cd grobid
$ java -Xmx1024m -jar grobid-trainer/target/grobid-trainer-0.3.4-SNAPSHOT.one-jar.jar 0 $MODEL -gH grobid-home
where $MODEL
is the model we want to train. Note that there’s new data
only for the segmentation
and header
models.
Moreover, note that the 0
parameter instructs GROBID to only train the
models. A value of 1
will only evaluate the trained model on a random
subset of the data, while a value of 2
requires an additional parameter:
$ java -Xmx1024m -jar grobid-trainer/target/grobid-trainer-0.3.4-SNAPSHOT.one-jar.jar 0 $MODEL -gH grobid-home -s$SPLIT
where $SPLIT
is a float between 0 and 1 that represents the ratio of
data to be used for training.