[flymine-dev] Large file loading bottleneck

Richard Smith flymine-dev at flymine.org
Tue, 08 Jan 2008 11:25:02 +0000


Hi Stephen,
It could be a memory issue.  Are you running the java and postgres on
the same machine?  How much RAM does the machine have?  And are there
other processes on the machine?  How much RAM/cpu is postgres using?

If you can let us have the intermine.log from the load then we may be
able to spot something.

You could try running the data load in the 10.0 branch and see if it is
better.  I think most performance improvements were completed by 8.2
but we do keep making improvements.

Cheers,
Richard.



SG Edwards wrote:
> Hey guys,
> 
> Hope you had a good Christmas/New Year!
> 
> I've still been having trouble with very large data files above a 
> certain size. They seem to reach a bottle neck such that the load takes 
> exponentially longer than you would expect given load times for smaller 
> files.
> 
> You have suggested previously changing the ANT_OPTS settings to:
> 
> export ANT_OPTS="-server -XX:MaxPermSize=256M -Xmx1800m 
> -XX:+UseParallelGC -Xms1800m -XX:SoftRefLRUPolicyMSPerMB=1 
> -XX:MaxHeapFreeRatio=99"
> 
> However, this actually made the loads approx 2-4 times slower than with 
> the original setting of:
> 
> export ANT_OPTS="-server -XX:MaxPermSize=256M -Xmx1800m"
> 
> Once the running job (e.g. ant -v -Dsource=sentences_con_abs integrate) 
> reaches a java memory of 1.1Gb or above there is a distinct increase in 
> load time (most of the large data sources are custom ones of the type 
> intermine-items-large-xml-file). I suspect this is probably more due to 
> the postgres settings as much as anything? I can get a hold of the 
> postgres settings if you need them (which ones?).
> 
> My question would be: is it possible to split a large file into two 
> custom files with distinct information in each i.e. sentence IDs from 1 
> -> 1,000,000 in file one; 1,000,001 -> 2,000,000 in file two and load 
> each file separately. Kim had mentioned previously that this wasn't a 
> good idea to try and reload a source that had been already loaded but 
> would it be possible to do this if the data is distinct (i.e. should 
> avoid the error: object already loaded from this source). Would this 
> affect indexing or some other integration process?
> 
> Cheers,
> 
> Stephen
> 
> p.s. I'm still using version 8.2 at the minute as I want to stay 
> compatible with the older style interface.
>