[flymine-dev] Large file loading bottleneck
SG Edwards
flymine-dev at flymine.org
Tue, 08 Jan 2008 14:24:18 +0000
Hi Richard,
OK, there are two separate issues here:
1. Up until a couple of weeks ago the loads were going fine except =20
that they were very slow for large sources.
2. Recently, the loads for medium or large sources have slowed right =20
down e.g. UniProt used to take 50 mins to load on our machines but now =20
takes 105 mins? Small sources load in the exact same time as previously.
It would seem that something has changed that has caused the second =20
point. As far as we can tell there have been no changes to java, ant =20
or postgres configs recently and the only change is that a small =20
tomcat application has been deployed (SyMBA), however, we don't think =20
this should be the cause, it certainly doesn't appear to be taking any =20
cpu time at all.
> It could be a memory issue. Are you running the java and postgres on =20
> the same machine? How much RAM does the machine have? And are there =20
> other processes on the machine? How much RAM/cpu is postgres using?
I'm pretty sure it is a memory issue (in regards to problem 2) as the =20
loads were running fine previously. Java, ant and tomcat are running =20
on a separate server to postgres. Both machines are pretty reasonable =20
spec (4Gb RAM). Occasionally people are running other processes on the =20
machines but by and large there are no other processes doing much at =20
all. Here's the 'top' output during a load of the uniprot source, this =20
is a snapshot after about two hours (real time) running, no other =20
significant processes are running:
top - 14:10:02 up 47 days, 1:32, 3 users, load average: 0.49, 0.50, 0.39
Tasks: 168 total, 2 running, 166 sleeping, 0 stopped, 0 zombie
Cpu(s): 4.1%us, 0.7%sy, 0.0%ni, 94.6%id, 0.5%wa, 0.0%hi, 0.0%si, 0.0%=
st
Mem: 4151220k total, 3979884k used, 171336k free, 24360k buffers
Swap: 8193140k total, 40100k used, 8153040k free, 1386996k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
15017 s0460205 18 0 2298m 1.9g 46m R 70 48.2 49:03.24 java
...and on the postgres server (postgres processes only)....
top - 14:09:25 up 320 days, 21:14, 1 user, load average: 1.42, 1.31, 1.21
Tasks: 130 total, 1 running, 128 sleeping, 1 stopped, 0 zombie
Cpu(s): 0.4%us, 0.2%sy, 0.0%ni, 99.3%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%=
st
Mem: 4151220k total, 4013228k used, 137992k free, 47696k buffers
Swap: 20482864k total, 1607664k used, 18875200k free, 3128772k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
31670 postgres 17 0 97200 5484 5096 S 0 0.1 0:00.34 postmaster
31671 postgres 15 0 10852 624 244 S 0 0.0 0:00.35 postmaster
31673 postgres 15 0 97432 80m 79m S 0 2.0 0:52.26 postmaster
31674 postgres 15 0 11852 1584 196 S 0 0.0 0:01.84 postmaster
31675 postgres 15 0 12296 1768 276 S 0 0.0 0:21.52 postmaster
2786 postgres 15 0 98356 82m 81m S 0 2.0 0:10.50 postmaster
2789 postgres 15 0 98064 82m 81m S 0 2.0 3:05.32 postmaster
2801 postgres 15 0 98236 81m 79m S 0 2.0 0:15.48 postmaster
2807 postgres 15 0 98308 82m 80m S 0 2.0 1:51.76 postmaster
2808 postgres 16 0 98012 74m 73m S 0 1.8 0:09.79 postmaster
2815 postgres 15 0 98236 6740 5584 S 0 0.2 0:00.02 postmaster
2816 postgres 15 0 99136 83m 81m S 0 2.1 4:02.89 postmaster
2817 postgres 15 0 97916 3848 2976 S 0 0.1 0:00.03 postmaster
2818 postgres 15 0 98052 81m 80m S 0 2.0 2:04.06 postmaster
2819 postgres 15 0 97656 2212 1496 S 0 0.1 0:00.01 postmaster
2820 postgres 16 0 98932 83m 81m S 0 2.0 79:42.95 postmaster
2821 postgres 15 0 98236 4964 3788 S 0 0.1 0:00.01 postmaster
2822 postgres 15 0 98008 8508 7420 S 0 0.2 0:00.03 postmaster
Seems to be an awful lot of postmaster processes running, is that =20
normal?! There are five postmaster processes running when there is no =20
activity on the postgres database ? I think there should only be =20
one(?). Looking at it there appears to be a tiny amount of free memory =20
on both machines (~150Mb out of 4Gb) but I think this is normal =20
behaviour where the OS will free up memory as required?
Our OS memory settings are as follows:
> cat /proc/meminfo
MemTotal: 4151220 kB
MemFree: 182180 kB
Buffers: 26040 kB
Cached: 1377780 kB
SwapCached: 19808 kB
Active: 2579972 kB
Inactive: 1275760 kB
HighTotal: 3276544 kB
HighFree: 42468 kB
LowTotal: 874676 kB
LowFree: 139712 kB
SwapTotal: 8193140 kB
SwapFree: 8153040 kB
Dirty: 720 kB
Writeback: 0 kB
AnonPages: 2450680 kB
Mapped: 86836 kB
Slab: 77724 kB
PageTables: 11396 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
CommitLimit: 10268748 kB
Committed_AS: 3250984 kB
VmallocTotal: 116728 kB
VmallocUsed: 11316 kB
VmallocChunk: 105140 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
Hugepagesize: 2048 kB
> You could try running the data load in the 10.0 branch and see if it is
> better. I think most performance improvements were completed by 8.2
> but we do keep making improvements.
I have tried an install of version 10 to see if that resets the =20
problem but it?s still struggling as with version 8_2. I?m sure =20
problem 2 is not with the intermine code but with the postgres =20
database or memory. We?ve restarted the postgres database as well but =20
again didn?t fix it.
Any suggestions very welcome!! A bit bemused up here as to what has happened=
!
Cheers,
Stephen
> Cheers,
> Richard.
>
>
>
> SG Edwards wrote:
>> Hey guys,
>>
>> Hope you had a good Christmas/New Year!
>>
>> I've still been having trouble with very large data files above a =20
>> certain size. They seem to reach a bottle neck such that the load =20
>> takes exponentially longer than you would expect given load times =20
>> for smaller files.
>>
>> You have suggested previously changing the ANT_OPTS settings to:
>>
>> export ANT_OPTS=3D"-server -XX:MaxPermSize=3D256M -Xmx1800m =20
>> -XX:+UseParallelGC -Xms1800m -XX:SoftRefLRUPolicyMSPerMB=3D1 =20
>> -XX:MaxHeapFreeRatio=3D99"
>>
>> However, this actually made the loads approx 2-4 times slower than =20
>> with the original setting of:
>>
>> export ANT_OPTS=3D"-server -XX:MaxPermSize=3D256M -Xmx1800m"
>>
>> Once the running job (e.g. ant -v -Dsource=3Dsentences_con_abs =20
>> integrate) reaches a java memory of 1.1Gb or above there is a =20
>> distinct increase in load time (most of the large data sources are =20
>> custom ones of the type intermine-items-large-xml-file). I suspect =20
>> this is probably more due to the postgres settings as much as =20
>> anything? I can get a hold of the postgres settings if you need =20
>> them (which ones?).
>>
>> My question would be: is it possible to split a large file into two =20
>> custom files with distinct information in each i.e. sentence IDs =20
>> from 1 -> 1,000,000 in file one; 1,000,001 -> 2,000,000 in file two =20
>> and load each file separately. Kim had mentioned previously that =20
>> this wasn't a good idea to try and reload a source that had been =20
>> already loaded but would it be possible to do this if the data is =20
>> distinct (i.e. should avoid the error: object already loaded from =20
>> this source). Would this affect indexing or some other integration =20
>> process?
>>
>> Cheers,
>>
>> Stephen
>>
>> p.s. I'm still using version 8.2 at the minute as I want to stay =20
>> compatible with the older style interface.
>>
>
>
> _______________________________________________
> flymine-dev mailing list
> flymine-dev@flymine.org
> http://mailman.flymine.org/listinfo/flymine-dev
--=20
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.