Welcome to the most active Linux Forum on the web.
Go Back > Forums > Non-*NIX Forums > Programming
User Name
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.


  Search this Thread
Old 02-05-2009, 08:22 PM   #1
Senior Member
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Rep: Reputation: Disabled
efficient occurrence counting or alternative lexical analysis

I have a database of textual scientific information to analyze. Essentially, I'll be working with a set of text labels (compound scientific nouns,) each associated with one or more scientific articles. I need to establish relative conceptual similarity between all terms occurring in the labels. As a start, my plan is to make a list of all terms occurring in labels, then count those terms in every article in the database. Two occurring in one or more articles together will be assigned a correlation value (algorithm to be determined.)

I'm trying to decide an efficient way to count the word occurrences. One idea was to create an AVL tree of words for each article (with an occurrence count for each word,) then extract the count for each term in the master list. This would exclude simple words such as "the", etc. Another idea is to create an AVL tree of the terms to search for, then match the words in each article to establish the occurrence counts.

At worst, I'll have ~60k unique terms to search for (with absolutely no correlations.) At best, maybe a few thousand (with many strong correlations.) The number of articles should be less than 30k. The resulting output should be an NxN matrix of correlation data for use with an LSA algorithm. The objective is to be able to take any two labels and assign a numerical correlation value to the pair. Because they're scientific terms originating from numerous countries, however, no reference exists to correlate them with each other. This is actually a pre-processing step for analysis of spatial data. Right now I'm working on an interim solution to get the rest of the system working, then the lexical analysis will be incrementally improved.

I'm trying to write this in C or C++, but I'm open to different solutions. Because this is just pre-processing to develop a reference table, it doesn't have to be compatible with the rest of the analysis code (but it would be nice.) I have to parse the actual data from XML and I plan to store it in an SQL database for processing. Thanks.

Last edited by ta0kira; 02-05-2009 at 08:28 PM.
Old 02-05-2009, 09:31 PM   #2
Registered: Sep 2007
Location: Mariposa
Distribution: FreeBSD,Debian wheezy
Posts: 811

Rep: Reputation: 178Reputation: 178
Two thoughts.

First, if it's important to reduce the computation time in chopping up the articles into words, investigate the use of lex and yacc. (That would be flex and bison in the Linux world.)

Second, if you're tempted to use AVL trees, you can avoid a whole bunch of coding by using libavl (overview/source, documentation). I use that library in a new project every few years. It's an old friend. It's flexible enough that putting in an occurrence count would be a snap.
Old 02-05-2009, 09:45 PM   #3
Senior Member
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Original Poster
Rep: Reputation: Disabled
Unfortunately the computation time will probably have little to do with the parsing itself. I'm still not quite sure what format I'll be able to access the articles in. Right now I know the database has links to the articles and (I think) the abstracts in text, so I might just parse the abstracts. In any case, I have to parse the pseudo-XML data to create the database, so the parser will be essentially complete already. I think I'll use my own parsing library for that (not so much a parser as a tree assembler, modifier, and exporter; I should rename it.)

The intensive part will be analysis after the term cross-references are tabulated. I'll be happy with an O(n^3) network analysis algorithm, and with potentially 30k nodes, the optimization will come from localizing smaller networks. We really haven't decided what our exact objective is, but we do know that a lexical correlation table is the foundation for everything else.

I'll take a look at that AVL implementation. I actually wrote one a few months ago (although in Java,) but because this is research and my job isn't to reinvent the wheel, I'm looking for every preexisting (open-source) resource I can find. Thanks.
Old 02-06-2009, 12:50 AM   #4
LQ Guru
Registered: Aug 2004
Location: Sydney
Distribution: Centos 7.7 (?), Centos 8.1
Posts: 18,059

Rep: Reputation: 2663Reputation: 2663Reputation: 2663Reputation: 2663Reputation: 2663Reputation: 2663Reputation: 2663Reputation: 2663Reputation: 2663Reputation: 2663Reputation: 2663
That definitely sounds like a job for Perl. I've seen similar qns asked at and there's almost definitely some modules at to do some or more of what you want.
Might be worth asking the monks.
Old 02-06-2009, 02:05 PM   #5
Senior Member
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Original Poster
Rep: Reputation: Disabled
Thanks. I'll take a look (and see about learning perl while I'm at it.)


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off

Similar Threads
Thread Thread Starter Forum Replies Last Post
designing C Compiler in linux.lexical analysis is done through lex and yacc tool aashish Linux - Software 2 09-25-2008 10:42 AM
SED replace string by occurrence uttam_h Programming 5 03-05-2008 10:02 PM
GNU tar --occurrence option: Does it work? Gault LaRue Slackware 8 03-30-2007 02:53 PM
Replace every other occurrence of pattern Wynd Linux - General 8 12-14-2005 03:43 PM
Occurrence book(OB) nderitualex Linux - Software 0 03-21-2005 02:51 AM > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 05:31 AM.

Main Menu
Write for LQ is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration