![]() |
[home] [shared news] [resume] [bridge] [publications] [software] [about] |
The problem I was faced with is simple: for work I have collected a lot of scientific articles. For most of these articles I have a BiBTeX file that contains the bibliographic information in a structured, well-defined format. For some papers I also have the full-text available in PDF format. It would be convenient to have a single tool with which I can search through all this information to quickly locate articles based on a range of criteria. Since I also share these files between several machines using Subversion (I am on the road a lot, sharing between laptop and fixed box was the main reason for setting this up) the search-tool also has to be portable. A big database-driven web application is probably not such a good idea in my case.
The tool I decided to work with is Swish-e which is probably best described as a fast, flexible, and free open source system for indexing just about anything. My first attempt with this tool was to just index everything as is with the following config file:
# Index these directories an any subdirectories IndexDir ./PDF IndexDir ./BIB # Convert pdf files to text before indexing FileFilter .pdf pdftotext "'%p' -" IndexContents TXT .pdf IndexContents TXT .bib
In this setup the PDF files are simply converted to text and then indexed. Indexing means: find all words in there, remove stopword like `the', `a' and so on and store the index in a form so that it can be searched efficiently. This worked rather well (and fast too!). Bascially you can do only search for keywords without being able to specify where these keywords are to occur. In many occasions this is enough. However, I wanted to be able to specify things such as:
Which papers are written by John in year 2001 and have the words foo and bar in the body
... which would be inpossible in this scheme. Searching with the keywords 'John 2001 foo bar' would just look for documents with these keywords which is not what I want. Luckily, swish-e supports the notion of metanames. Basically this means that if you feed it XML-like data then you can use the names of the XML-tags to specify where certain words must occur. In other words, if you offer it data in a structured format such as:
<bibfile>
<bibtex>
<author> ... </author>
<title> ... </title>
</bibtex>
<content>
...
...
</content>
</bibfile>
... then later you can use the tags as metanames to guide your search. For example, the above query would be something like:
author=john and year=2001 and content=(foo and bar)
... which is a pretty elegant way of querying scientific papers. The trick to achieve this with swish-e is to create a script that processes the BiBTeX data and PDF files and generates a structure as displayed above. Some headers have to be added to make it all work. More specifically, it needs a header specifying the name of the entry (for example, the BiBTeX key) and the exact length in bytes of the XML-data.
To make things easier for me later on I restructured my directory layout a little. I put the BiBTeX files in one directory and the PDF files in another. Since I don't have the PDF files for every article I will create a script that loops over the entries in the BiBTeX files, checks if a PDF file is available and then generate the required XML structure. Taking into account that I will have to create a swish-e config file and a script, the total directory structure I chose is:
.../ .../swish.conf # config file for swish .../script.py # the script .../BIB/ # dir with .bib files for each article .../PDF/ # pdf files of articles
Different people may have more than one BiBTeX and PDF directory so I'll have to take this into account when writing a script that generates the XML structure. The second step was to create the script that generates a structure like the above. For PDF to Text conversion I simply use the pdftotext tool. Sample output is shown below:
Path-Name: ./BIB-other/2000-Cruse-MeaningLanguage.bib
Content-Length: 366
<bibfile>
<bibtex>
<title> {Meaning in Language, an Introduction to Semantics and Pragmatics},</title>
<author> {Cruse, A.},</author>
<isbn> {ISBN 0198700105},</isbn>
<publisher> {Oxford University Press},</publisher>
<address> {Oxford, United Kingdom, EU},</address>
<year> {2000}</year>
</bibtex>
<content>
</content>
</bibfile>
The above example doesn't include the conferted PDF contents since I do not have a PDF copy of this book. It does show the required headers for path-name and content-length. With the script in place, the third step is to create a config file in which we tell swish-e which metanames to look for. Without this it will consider anything that has a < and > in it to be a keyword which is not a desirable situation since pdftotext generates a lot of these when it runs into mathematics and such. Specifying the keywords in the config file turned out to be straightforward:
MetaNames bibtex content author booktitle code editor howpublished \ institution isbn journal note number organization publisher school series \ title url year
The only thing that has to be done before we can actually start searching is to tell swish-e to create an index. The command to do this is:
./bib2xml.py | swish-e -c swish.conf -S prog -i stdin
Beware, this may take a while. On my machine it took about 3 minutes to index +/- 400 BiBTeX files and 250 PDF files. Luckily, searching is really really faster after that. For example:
$ swish-e -w author=gils year=2005 content=(aptness and vim es)
# SWISH format: 2.4.3
# Search words: author=gils year=2005 content=(aptness and vimes)
# Removed stopwords:
# Number of hits: 1
# Search time: 0.002 seconds
# Run time: 0.027 seconds
1000 ./BIB/2005-Schabell-VimesBroker.bib "{Implementing Vimes -- the broker component}," 30675
.
Well, that's all there is to it. You can now search using the keywords that are specified (see the config file). I hope this helps someone.