[flymine-dev] Large file loading bottleneck

SG Edwards flymine-dev at flymine.org
Fri, 04 Jan 2008 20:23:13 +0000


Hey guys,

Hope you had a good Christmas/New Year!

I've still been having trouble with very large data files above a =20
certain size. They seem to reach a bottle neck such that the load =20
takes exponentially longer than you would expect given load times for =20
smaller files.

You have suggested previously changing the ANT_OPTS settings to:

export ANT_OPTS=3D"-server -XX:MaxPermSize=3D256M -Xmx1800m =20
-XX:+UseParallelGC -Xms1800m -XX:SoftRefLRUPolicyMSPerMB=3D1 =20
-XX:MaxHeapFreeRatio=3D99"

However, this actually made the loads approx 2-4 times slower than =20
with the original setting of:

export ANT_OPTS=3D"-server -XX:MaxPermSize=3D256M -Xmx1800m"

Once the running job (e.g. ant -v -Dsource=3Dsentences_con_abs =20
integrate) reaches a java memory of 1.1Gb or above there is a distinct =20
increase in load time (most of the large data sources are custom ones =20
of the type intermine-items-large-xml-file). I suspect this is =20
probably more due to the postgres settings as much as anything? I can =20
get a hold of the postgres settings if you need them (which ones?).

My question would be: is it possible to split a large file into two =20
custom files with distinct information in each i.e. sentence IDs from =20
1 -> 1,000,000 in file one; 1,000,001 -> 2,000,000 in file two and =20
load each file separately. Kim had mentioned previously that this =20
wasn't a good idea to try and reload a source that had been already =20
loaded but would it be possible to do this if the data is distinct =20
(i.e. should avoid the error: object already loaded from this source). =20
Would this affect indexing or some other integration process?

Cheers,

Stephen

p.s. I'm still using version 8.2 at the minute as I want to stay =20
compatible with the older style interface.

--=20
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.