[flymine-dev] Large file loading bottleneck

SG Edwards flymine-dev at flymine.org
Tue, 08 Jan 2008 14:24:18 +0000


Hi Richard,

OK, there are two separate issues here:

1. Up until a couple of weeks ago the loads were going fine except =20
that they were very slow for large sources.

2. Recently, the loads for medium or large sources have slowed right =20
down e.g. UniProt used to take 50 mins to load on our machines but now =20
takes 105 mins? Small sources load in the exact same time as previously.

It would seem that something has changed that has caused the second =20
point. As far as we can tell there have been no changes to java, ant =20
or postgres configs recently and the only change is that a small =20
tomcat application has been deployed (SyMBA), however, we don't think =20
this should be the cause, it certainly doesn't appear to be taking any =20
cpu time at all.

> It could be a memory issue. Are you running the java and postgres on =20
> the same machine? How much RAM does the machine have?  And are there =20
> other processes on the machine?  How much RAM/cpu is postgres using?

I'm pretty sure it is a memory issue (in regards to problem 2) as the =20
loads were running fine previously. Java, ant and tomcat are running =20
on a separate server to postgres.  Both machines are pretty reasonable =20
spec (4Gb RAM). Occasionally people are running other processes on the =20
machines but by and large there are no other processes doing much at =20
all. Here's the 'top' output during a load of the uniprot source, this =20
is a snapshot after about two hours (real time) running, no other =20
significant processes are running:

top - 14:10:02 up 47 days,  1:32,  3 users,  load average: 0.49, 0.50, 0.39
Tasks: 168 total,   2 running, 166 sleeping,   0 stopped,   0 zombie
Cpu(s):  4.1%us,  0.7%sy,  0.0%ni, 94.6%id,  0.5%wa,  0.0%hi,  0.0%si,  0.0%=
st
Mem:   4151220k total,  3979884k used,   171336k free,    24360k buffers
Swap:  8193140k total,    40100k used,  8153040k free,  1386996k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
15017 s0460205  18   0 2298m 1.9g  46m R   70 48.2  49:03.24 java

...and on the postgres server (postgres processes only)....

top - 14:09:25 up 320 days, 21:14,  1 user,  load average: 1.42, 1.31, 1.21
Tasks: 130 total,   1 running, 128 sleeping,   1 stopped,   0 zombie
Cpu(s):  0.4%us,  0.2%sy,  0.0%ni, 99.3%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%=
st
Mem:   4151220k total,  4013228k used,   137992k free,    47696k buffers
Swap: 20482864k total,  1607664k used, 18875200k free,  3128772k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
31670 postgres  17   0 97200 5484 5096 S    0  0.1   0:00.34 postmaster
31671 postgres  15   0 10852  624  244 S    0  0.0   0:00.35 postmaster
31673 postgres  15   0 97432  80m  79m S    0  2.0   0:52.26 postmaster
31674 postgres  15   0 11852 1584  196 S    0  0.0   0:01.84 postmaster
31675 postgres  15   0 12296 1768  276 S    0  0.0   0:21.52 postmaster
  2786 postgres  15   0 98356  82m  81m S    0  2.0   0:10.50 postmaster
  2789 postgres  15   0 98064  82m  81m S    0  2.0   3:05.32 postmaster
  2801 postgres  15   0 98236  81m  79m S    0  2.0   0:15.48 postmaster
  2807 postgres  15   0 98308  82m  80m S    0  2.0   1:51.76 postmaster
  2808 postgres  16   0 98012  74m  73m S    0  1.8   0:09.79 postmaster
  2815 postgres  15   0 98236 6740 5584 S    0  0.2   0:00.02 postmaster
  2816 postgres  15   0 99136  83m  81m S    0  2.1   4:02.89 postmaster
  2817 postgres  15   0 97916 3848 2976 S    0  0.1   0:00.03 postmaster
  2818 postgres  15   0 98052  81m  80m S    0  2.0   2:04.06 postmaster
  2819 postgres  15   0 97656 2212 1496 S    0  0.1   0:00.01 postmaster
  2820 postgres  16   0 98932  83m  81m S    0  2.0  79:42.95 postmaster
  2821 postgres  15   0 98236 4964 3788 S    0  0.1   0:00.01 postmaster
  2822 postgres  15   0 98008 8508 7420 S    0  0.2   0:00.03 postmaster

Seems to be an awful lot of postmaster processes running, is that =20
normal?! There are five postmaster processes running when there is no =20
activity on the postgres database ? I think there should only be =20
one(?). Looking at it there appears to be a tiny amount of free memory =20
on both machines (~150Mb out of 4Gb) but I think this is normal =20
behaviour where the OS will free up memory as required?

Our OS memory settings are as follows:

> cat /proc/meminfo

MemTotal:      4151220 kB
MemFree:        182180 kB
Buffers:         26040 kB
Cached:        1377780 kB
SwapCached:      19808 kB
Active:        2579972 kB
Inactive:      1275760 kB
HighTotal:     3276544 kB
HighFree:        42468 kB
LowTotal:       874676 kB
LowFree:        139712 kB
SwapTotal:     8193140 kB
SwapFree:      8153040 kB
Dirty:             720 kB
Writeback:           0 kB
AnonPages:     2450680 kB
Mapped:          86836 kB
Slab:            77724 kB
PageTables:      11396 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:  10268748 kB
Committed_AS:  3250984 kB
VmallocTotal:   116728 kB
VmallocUsed:     11316 kB
VmallocChunk:   105140 kB
HugePages_Total:     0
HugePages_Free:      0
HugePages_Rsvd:      0
Hugepagesize:     2048 kB

> You could try running the data load in the 10.0 branch and see if it is
> better.  I think most performance improvements were completed by 8.2
> but we do keep making improvements.

I have tried an install of version 10 to see if that resets the =20
problem but it?s still struggling as with version 8_2. I?m sure =20
problem 2 is not with the intermine code but with the postgres =20
database or memory. We?ve restarted the postgres database as well but =20
again didn?t fix it.

Any suggestions very welcome!! A bit bemused up here as to what has happened=
!

Cheers,

Stephen

> Cheers,
> Richard.
>
>
>
> SG Edwards wrote:
>> Hey guys,
>>
>> Hope you had a good Christmas/New Year!
>>
>> I've still been having trouble with very large data files above a  =20
>> certain size. They seem to reach a bottle neck such that the load  =20
>> takes exponentially longer than you would expect given load times  =20
>> for smaller files.
>>
>> You have suggested previously changing the ANT_OPTS settings to:
>>
>> export ANT_OPTS=3D"-server -XX:MaxPermSize=3D256M -Xmx1800m  =20
>> -XX:+UseParallelGC -Xms1800m -XX:SoftRefLRUPolicyMSPerMB=3D1  =20
>> -XX:MaxHeapFreeRatio=3D99"
>>
>> However, this actually made the loads approx 2-4 times slower than  =20
>> with the original setting of:
>>
>> export ANT_OPTS=3D"-server -XX:MaxPermSize=3D256M -Xmx1800m"
>>
>> Once the running job (e.g. ant -v -Dsource=3Dsentences_con_abs  =20
>> integrate) reaches a java memory of 1.1Gb or above there is a  =20
>> distinct increase in load time (most of the large data sources are  =20
>> custom ones of the type intermine-items-large-xml-file). I suspect  =20
>> this is probably more due to the postgres settings as much as  =20
>> anything? I can get a hold of the postgres settings if you need  =20
>> them (which ones?).
>>
>> My question would be: is it possible to split a large file into two =20
>>  custom files with distinct information in each i.e. sentence IDs  =20
>> from 1 -> 1,000,000 in file one; 1,000,001 -> 2,000,000 in file two =20
>>  and load each file separately. Kim had mentioned previously that  =20
>> this wasn't a good idea to try and reload a source that had been  =20
>> already loaded but would it be possible to do this if the data is  =20
>> distinct (i.e. should avoid the error: object already loaded from  =20
>> this source). Would this affect indexing or some other integration  =20
>> process?
>>
>> Cheers,
>>
>> Stephen
>>
>> p.s. I'm still using version 8.2 at the minute as I want to stay  =20
>> compatible with the older style interface.
>>
>
>
> _______________________________________________
> flymine-dev mailing list
> flymine-dev@flymine.org
> http://mailman.flymine.org/listinfo/flymine-dev



--=20
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.