Patch-ID# 124520-05
Download this patch from My Oracle Support
Your use of the firmware, software and any other materials contained
in this update is subject to My Oracle Support Terms of Use, which
may be viewed at My Oracle Support.
|
For further information on patching best practices and resources, please
see the following links:
|
Copyright (c) 2012, Oracle and/or its affiliates. All rights reserved.
|
Keywords: qmaster scheduler qmon qstat qconf qconf memory usage root security
Synopsis: N1 Grid Engine 6.0: maintenance patch
Date: Sep/18/2008
Install Requirements: See Special Install Instructions
Solaris Release: 7 8 9 10
SunOS Release: 5.7 5.8 5.9 5.10
Unbundled Product: N1 Grid Engine
Unbundled Release: 6.0
Xref: See patch matrix below
Topic:
Relevant Architectures: sparc
Bugs fixed with this patch:
Changes incorporated in this version: 6410844 6469605 6506667 6562190 6613438 6619046 6623174 6625420 6650497 6665417 6666161 6686155 6693998 6707055 6720078 6729441 6732056
Patches accumulated and obsoleted by this patch: 118130-07 121957-01 123038-01 123432-02
Patches which conflict with this patch:
Patches required with this patch:
Obsoleted by:
Files included with this patch:
<install_dir>/bin/sol-sparc64/qacct
<install_dir>/bin/sol-sparc64/qalter
<install_dir>/bin/sol-sparc64/qconf
<install_dir>/bin/sol-sparc64/qdel
<install_dir>/bin/sol-sparc64/qhost
<install_dir>/bin/sol-sparc64/qmake
<install_dir>/bin/sol-sparc64/qmod
<install_dir>/bin/sol-sparc64/qmon
<install_dir>/bin/sol-sparc64/qping
<install_dir>/bin/sol-sparc64/qsh
<install_dir>/bin/sol-sparc64/qstat
<install_dir>/bin/sol-sparc64/qsub
<install_dir>/bin/sol-sparc64/qtcsh
<install_dir>/bin/sol-sparc64/sge_coshepherd
<install_dir>/bin/sol-sparc64/sge_execd
<install_dir>/bin/sol-sparc64/sge_qmaster
<install_dir>/bin/sol-sparc64/sge_schedd
<install_dir>/bin/sol-sparc64/sge_shadowd
<install_dir>/bin/sol-sparc64/sge_shepherd
<install_dir>/bin/sol-sparc64/sgepasswd
<install_dir>/examples/jobsbin/sol-sparc64/work
<install_dir>/lib/sol-sparc64/libXltree.so
<install_dir>/lib/sol-sparc64/libcrypto.so.0.9.7
<install_dir>/lib/sol-sparc64/libdrmaa.so
<install_dir>/lib/sol-sparc64/libdrmaa.so.0.95
<install_dir>/lib/sol-sparc64/libdrmaa.so.1.0
<install_dir>/lib/sol-sparc64/libspoolb.so
<install_dir>/lib/sol-sparc64/libspoolc.so
<install_dir>/lib/sol-sparc64/libssl.so.0.9.7
<install_dir>/utilbin/sol-sparc64/adminrun
<install_dir>/utilbin/sol-sparc64/berkeley_db_svc
<install_dir>/utilbin/sol-sparc64/checkprog
<install_dir>/utilbin/sol-sparc64/checkuser
<install_dir>/utilbin/sol-sparc64/db_archive
<install_dir>/utilbin/sol-sparc64/db_checkpoint
<install_dir>/utilbin/sol-sparc64/db_deadlock
<install_dir>/utilbin/sol-sparc64/db_dump
<install_dir>/utilbin/sol-sparc64/db_load
<install_dir>/utilbin/sol-sparc64/db_printlog
<install_dir>/utilbin/sol-sparc64/db_recover
<install_dir>/utilbin/sol-sparc64/db_stat
<install_dir>/utilbin/sol-sparc64/db_upgrade
<install_dir>/utilbin/sol-sparc64/db_verify
<install_dir>/utilbin/sol-sparc64/filestat
<install_dir>/utilbin/sol-sparc64/fstype
<install_dir>/utilbin/sol-sparc64/gethostbyaddr
<install_dir>/utilbin/sol-sparc64/gethostbyname
<install_dir>/utilbin/sol-sparc64/gethostname
<install_dir>/utilbin/sol-sparc64/getservbyname
<install_dir>/utilbin/sol-sparc64/infotext
<install_dir>/utilbin/sol-sparc64/loadcheck
<install_dir>/utilbin/sol-sparc64/now
<install_dir>/utilbin/sol-sparc64/openssl
<install_dir>/utilbin/sol-sparc64/qrsh_starter
<install_dir>/utilbin/sol-sparc64/rlogin
<install_dir>/utilbin/sol-sparc64/rsh
<install_dir>/utilbin/sol-sparc64/rshd
<install_dir>/utilbin/sol-sparc64/sge_share_mon
<install_dir>/utilbin/sol-sparc64/spooldefaults
<install_dir>/utilbin/sol-sparc64/spooledit
<install_dir>/utilbin/sol-sparc64/spoolinit
<install_dir>/utilbin/sol-sparc64/testsuidroot
<install_dir>/utilbin/sol-sparc64/uidgid
Problem Description:
6410844 It is possible to negative tickets / shares in qmon and from the command line
6469605 accounting records for slave tasks of pe jobs should contain the correct task submission time
6506667 forbid deletion of global config values
6562190 parallel scheduling memory leak in sge_schedd
6613438 'Infinity' must be rejected when specified in 'complex_values' or RQS limits for consumables
6619046 qhost/qstat can't be interrupted with ctrl-c
6623174 Failed to deliver STOP signal for subordinated jobs
6625420 Missing array job task usage in the accounting file
6666161 Incorrectly considering two host group names to be the same
6686155 Commlib might crash if running out of memory
6693998 Communication library thread locking problem results in qmaster crash
6650497 use of -l tmpdir=abc can crash schedd
6665417 Memory leak in drmaa_run_job()/drmaa_run_bulk_job()
6707055 32-bit Linux binaries are having problems with file access in 64-bit NFS environments
6720078 qmaster runs out of memory on AIX
6729441 memory leak in sge_execd with qsub -v SGE_* or qsub -V
6732056 qstat -f -q cqueue@host fails, if host is resolved differently on the client than on qmaster
(from 124520-04)
4743006 problem with floating point job resource limits
6195248 QMON Job Control Window: Incomprehensible Priority Button
6280747 qmon loses sharetree changes
6391244 qstat -ext reports wrong usage as compared to other commands such as qstat -t or qstat -j
6410592 Double clicking in Consumables/Fixed Attributes list does not behave as a GUI should
6433628 qconf -sq all.q@myhost produces no value at all for complex_values (not even NONE)
6482211 complex attributes whose deletion is denied don't reflect back after the denial message in qmon
6513115 in qmon, under calendar configuration, it is possible to modify even if no calendar exists
6513116 Qmon x qconf inconsistent in allowed characters in attribute names
6525375 qacct ignores jobs in output
6542987 drmaa_run_job(3) raises error if drmaa_native_specification has leading spaces
6541085 NFS write error on N1GE trace file
6565951 Qmon panel does not check for valid data in Scheduler Configuration
6566033 Help for Browser panel in qmon incomplete
6568575 SGE does not work if primary group entry is too big in groups map
6576153 Creating a userset with NONE as a type results in a core dump
6576197 Userset type should not accept empty string value
6600619 Userset spooling in classic mode is broken
6604155 qmon binary job submit is broken
6608259 scheduler prints empty line in messages file after every 'sge_mirror' logging
6610788 qdel returns wrong exit code
6617450 add option to reporting_params for switching off writing of consumables
6618328 qmon displays wrong string for queue filtering
6618599 Long running jobs cause incorrect usage summary for ARCo database
6619016 removing parameters from the reporting_params will not fallback to the default
6622842 the start_time field in intermediate accounting records is incorrect
(from 124520-03)
6553066 qmon's Complex Configuration Load and Save buttons did not work
6544869 UNKNOWN group/owner in accouting(5)
6538740 clear usage operation should implicitely trigger refresh in share-tree dialogue
6538293 Hybrid user/project share-tree is broken for user sharing amongst array jobs
6537633 Extraneous space in qsub's "Invalid month specification." message
6525917 qacct -l h=<hostname> dumps core on darwin and linux itanium
6522273 Wrong exit code with qconf -sds
6520761 add background mode to N1 Grid Engine Helper Service
6518684 Qconf usage x man page inconsistency
6518607 invalid memory access in cl_com_get_handle
6516288 Scheduler does not write pid file in daemonize phase
6513597 on Windows, automatically translate job user "root" to job user "local Administrator"
6499217 meaningless error in clients when reporting_param flush_time is incorrectly set
6494390 Broken output of job name with 'qsub -N'
6476263 function job_get_id_string() is not MT save and used in qmaster
6470048 Discrepancy between load values reported by Gridengine and from the HP-UX 64 bit env.
6426331 remove util/sge_log_tee from distribution
6422335 still used usersets/project/calendar/pe/checkpoint can be removed under certain conditions
6367642 Numbers in error mail too large
6355875 qsub -terse to just output job id
6345522 qdel on a job in deleted state does not output any information
6328064 Queue request -q from sge_request can't be overridden through command line
6327539 Ability to sort queue instances using each column of the queue instances table
6291044 Modify"-Button is activated but should be grayed
5081743 queue status in reporting file is missing.
4818801 qmon on secondary screen crashes when "Job Control" is pressed
4742097 Qmon has a ticket number limitation
6564503 sge_schedd deadlock upon schedd_job_info job_list being enabled
6555744 qmon crashes when displaying about dialog
6363245 on some Windows execution hosts, execd hangs after the job has finished
6233523 loadcheck reports on a hyperthreaded CPU only one processor
6288953 scalability issue with qdel and very large array jobs
6395075 on Windows, execd doesn't provide useful error messages when SSL keys broken
(from 124520-02)
6517015 execution daemon can crash on Linux where libnss_ldap.so uses BDB 4.2 shared library
6511171 spooledit cannot dump USERSET objects
6509684 qmaster dies when modifying slots value for queue domain when queuename is missing
6507576 load formula does not recognize float as weighting factor
6506677 qmon job control: display wider default columns
6506637 job control: sorting by different fields
6500359 sge_conf(5) setting 'max_u_jobs' broken if BDB spooling is used
6494508 host already exists when modifying cluster settings
6488244 ignore_fqdn is broken for the local configuration
6486125 qmaster logging "scheduler tried to remove a incomplete"
6483941 In certain cases jobs may stay in "t" state for 5 minutes
6482762 qsh does not work if XAUTHORITY is set in root environment
6472859 sgepasswd binary causes segmentation faults with corrupted sgepasswd files
6457900 accounting records for slave tasks of pe jobs contain invalid submission time
6448704 The sge_share_mon utility does not work with the automatic policy enforcement
6438475 potential security issues in cull library
6429305 shared library name DT_SONAME not set with libdrmaa.so
6422610 Unable to modify Advanced Settings in Configuration for Host in my cluster using qmon
6391947 getDrmaaImplementation() should return the same string as getDrmSystem()
6391930 drmaa_control() causes illegal memory access
6365045 DRMAA sessions should be persistent
6359575 drmaa_version() function should return 1.0
6353558 hostname resolving should not be case sensitive
6353526 reprioritize field in qmon cluster config missing
6349814 wrong qlogin_daemon or rlogin_daemon in host conf doesn't set host and job into error state
6347256 missing user entry in sgepasswd file wrongly sets queue instance in error mode
6287963 qdel of just submitted job
The following Change Request (CR) is related only to Grid Engine running
under the Microsoft Windows operating system family
6504566 on Windows, use passwords only according to enable_windomacc setting, not according to User ID
6471881 N1 Grid Engine Helper Service creates registry path for listening port completely temporary
6464927 Failed Windows GUI jobs are reported as finished successfully.
(from 124520-01)
6480580 CSP mode is affected by OpenSSL Security Advisory [28th September 2006]
6475282 account string does not accept the "|" character
(from 123432-02)
6458517 unreasonably long scheduler dispatch times if lots of projects are used in share tree
6458510 unreasonably long scheduler dispatch times if lots of cluster queues are deployed in large clusters
(from 123432-01)
6424565 jobs with negative priority will be rejected by qmaster
(from 123038-01)
6412215 encryption and/or decryption of passwords fails because crypto engine is not seeded
6411230 Job Sequence Number got screwed up when restarting qmaster daemon
6407513 Scheduler hangs after a qmaster crash and restart
6401993 qstat -u <user> crashes
6400729 weak authentication and authorization in CSP mode
6398723 Tickets are not reset for running jobs after disabling the ticket policy
6398008 Off-by-one overrun in communication library
6397987 several buffer overruns
6397383 qmaster deadlock when reporting file cannot be written
6391238 qrsh does not accept -o/-e/-j
6390494 qrsh issue with interactive jobs and directory write permissions
6389526 commlib closes wrong connection on SSL error
6387206 CSP revocation lists are not supported
6384812 qstat produces non-well-formed XML output
6384709 slow scheduler performance for jobs with hard queue requests
6384698 schedulers mem use growing, if pe jobs are running
6384682 "qstat -j" aborts
6383513 resource filtering in qselect broken
6368747 Job tickets are not correctly shown in qstat for none running jobs
6365380 possible buffer overflow in sge_exec_job()
6364440 qconf -mhgrp <hostgroup> results in glibc error message and abort
6363823 qsub -w w changes -sync behavior
6319223 subordinate properties lost on qmaster restart
6291033 Unclear share caclulation of running jobs
6287945 Interrupting qrsh while pending does not remove job
4737342 interactive jobs leave behind output/error files if prolog/epilog are run
The following Change Request (CR) is related only to Grid Engine running
under the Microsoft Windows operating system family
6382156 qloadsensor.exe consumes too much time and delivers too high load values
6359054 On hosts with 3 GB RAM, qloadsensor.exe shows only 2GB RAM
(from 121957-01)
6366691 utilbin/<arch>/rsh can be used to gain root access
(from 118130-07)
6355263 reschedule of a parallel job crashes the qmaster
6354164 drmaa does not work on hp11 platform
6354143 mutually subordinating queues suspend each other simultaneously
6351728 installation of qmaster failed when using /etc/services
6349972 DRMAA crashes during some operations on bulk jobs
6349818 an additional started schedd/execd daemons may not stop if started when qmaster is down
6348517 job finish although terminate method is still running
6348516 job finish does not terminate all processes of a job
6348299 qconf -mstree aborts
6346704 qrsh -V doesn't always work.
6346696 connection to Berkeley DB RPC server can timeout
6342005 a scheduler configuration change with a sharetree can result in a usage leak
6339756 Quotes in qtask file can result in memory corruption
6338314 occasional "failed to deliver job" errors due to SIGPIPE in sge_execd
6336519 changing the cwd flag in qmon - qalter has no effect
6333467 sgemaster -migrate may not delete qmaster lock file and may break shadowd functionality
6333407 configuring the halflife_decay_list crashes the qmaster
6332877 qstat -pe filter does not work
6332876 qstat -U does not consider queue access for job and project access for queues
6329832 qconf and qmaster accept invalid settings for queue complex_values
6328703 fstype does not recognize nfs4 share in all cases
6327427 qping core dump with enabled message content dump
6322498 calendar syntax "week mon=0-21" corrupts SGE and may crash qmaster
6320869 sge_qmaster daemon is running on both the master and shadow nodes after a long network failure
6320683 Binary switch reversed in job category and can cause application to hang
6319233 Parsing of context variable options fails for values containing commas in single quotes
6319231 unable to delete a configuration of a non existing host
6319228 Backslash line continuation is broken for host groups
6318660 the system hold on an array task can vanish
6318018 shepherd doesn't handle qrlogin/qrsh jobs correctly
6317048 Memory leaks in drmaa library, japi_wait and drmaa_job2sge_job
6317028 Quotes in job category can result in memory corruption
6316995 qconf -mp prints error messages two times
6315111 doing a qalter -l rsc=val on running jobs breaks consumable debit
6313445 Qrsh tries to free invalid pointer
6307557 qhost returns wrong total_memory value on MacOSX 10.3
6306834 consumables as thresholds are not working correctly with pe jobs
6306229 wrong soft requests decision
6305095 qstat schema files are incomplete
6304490 qconf -as/-ah leads to segmentation fault
6304471 qlogin -R does not work like documented
6304466 qmaster crashes with large number of qconf -aattr calls
6303671 DRMAA can abort in the middle of a session if NIS becomes unavailable
6301047 qstat -s p doesn't show pending array tasks while there are tasks of this job running
6299982 Slow submission rate with drmaa_run_job()
6295791 qacct -h should not resolve hostnames
6294875 CSP: consolidate error output if cert CA on client and server don't match
6294052 suspend threshold is not working for calendar disabled queues
6293411 NFS write error on host <NFS server>: Permission denied.
6292926 qconf -mattr can crash qmaster
6292751 admin mail information is incorrect
6292742 tight integration - qrsh_exit_code file not written
6291023 qstat -j <name> doesn't print delimiter between jobs
6291016 qmon startup and queue add/modify warning messages
6289455 qstat -XML output does not match the schema
6288626 default PATH variable set for job insufficient for non-login shell jobs
6287955 strange reservation
6287946 qconf -[dm]attr gets confused by shortcuts
6287935 qmod -sq can kill a pe job in t state
6287865 qrsh default job names are not consistent with documented job name limitations
6287862 qhost -l for complexes is broken
6287850 Allow SIGTRAP to enable debugging
6287847 qstat -j shows wrong message for parallel jobs which can't be dispatched
6286510 delivery of queue based signals to execd repeated endlessly
6282996 use of IP address as host name disables unique hostname resolving
6275789 soft requirements on load values are ignored
6268799 confusing execd startup messages and delays in case of problems
6256590 qconf -mq disallows 2057 hostspecific profiles in slots configuration
6255111 Binary jobs are problematic for starter and epilog scripts
6253860 First character is lost in quoting
6250692 accounting(5) record can't be made available immediately after job finish
6242169 Multi-threaded, multi-CPU username problems
6207868 wording with qconf -cq should be changed
6287953 repeated logging of the error message: "failed building category string for job N"
The following Change Request (CR) is related only to Grid Engine running
under the Microsoft Windows operating system family
6353638 default process priority on windows freezes the whole system until job is finished
6348478 install script sets rsh_daemon to /usr/sbin/in.rshd on win32-x86
6314019 qloadsensor.exe uses up more and more handles
6279523 qlogin on windows does not work!
(from 118130-06)
6299939 distribution should contain all Berkeley DB utilities
6299351 qrsh fails when execd_param INHERIT_ENV=false and no ARC set in sge_execd environment
6299345 No error messages in case SSL initialization failes
6298233 no user notification or command hanging if an immediate job cannot be scheduled
6298056 INHERIT_ENV and SET_LIB_PATH are not reset by setting execd_params to NONE
(from 118130-05)
6295165 finished array job tasks can be rescheduled if master/scheduler daemons are stopped/started
6294397 wrong drmaa jnilib link on MacOS
6288588 jobs submitted with -v PATH do not retain $TMPDIR prefixed by N1GE as required for tight integration
6288156 sge_shepherd SEGV's when it tries to fopen the usage file
6287958 suspend not working under Mac OS X
6287867 tight integration: temporary files are not deleted at task exit
6286533 job wallclock monitoring and enforcement considers prolog/epilog runtime part of net job runtime
6285898 qconf -Xattr does not resolve fqdn hostnames
6283308 overhead with job execution could lead to overoptimistic backfilling and break resource reservation
6281462 qmaster profiling can only be turned on by restarting qmaster
6281440 resource allocation shown by qstat/qhost not consistent with resource utilization
6280698 Resource filtering with qhost broken
6279409 qconf -tsm command generates too much data (very large schedd_runlog file)
6279402 drmaa_exit() causes qmaster error logging if host is no admin host
6278727 qstat -xml -urg output contains badly formatted numbers
6278147 drmaa_job_ps() returns DRMAA_PS_QUEUED_ACTIVE for finished array job rather than DRMAA_PS_DONE
6277909 qconf -mq coredumps
6274467 qmon kills a system
6273217 race condition with qsub -sync and drmaa_wait() if job exits directly after being submitted
6273006 qstat -j "" results in a segmentation fault
6269411 Close integration cause jobscripts with multiple mprun commands to be killed.
6269305 qrsh/qsh/qlogin reject -js option
6268707 job_load_adjustements is not correctly working when parallel jobs are submitted.
6267932 high CPU load of qmaster even on empty cluster
6267245 Repeated logging of the same message produces giant logging files
6267238 Multithreaded DRMAA may crash due to use of sge_strtok()
6266450 performace bottleneck with subordinate list
6266392 Performance problem with qconf -mattr exechost XX XX global
6265154 Wildcards in PE Name Cause Unusual Behavior
6264592 drmaa_control(DRMAA_JOB_IDS_SESSION_ALL, DRMAA_CONTROL_SUSPEND|RESUME) returns INVALID_JOB error
6260656 incomplete resource reservation with array jobs
6252525 qmon: complex attributes not removeable
6252469 missleading qstat -j messages in case of resource reservation
6250603 qmon crash (segmentation fault) on Solaris64
6218877 qstat -t is broken
4769608 qalter shows wrong priority number when using negative priorities with -p option
The following Change Request (CR) is related only to Grid Engine running
under the Microsoft Windows operating system family
6239470 Avoid that sge_execd has to be started by the Domain Administrator
(from 118130-04)
6260729 Can't select 'slots' in select box when adding consumables for execution host
6260024 qmon cluster queue modify cancel not working correct
6259380 potential qmaster sec. fault.
6256530 cqueues/all.q trashed after qmaster shutdown with 1362 hosts
6256457 pe jobs disappear in t state (execd doesn't know this job)
6255902 qmake in dynamic allocation mode core dump
6255850 the usage in projects is never spooled while the qmaster
6255804 job in error state breaks qstat -f -xml
6255336 execd does sends empty job report for a pe slave task
6255329 qmaster does not store sharetree usage on shutdown
6253266 failed array tasks are rescheduled only one by one
6253093 qstat -f -pe make breaks
6252524 Missing success message with qconf -Aprj
6252465 qsub option parameter string only supports 2048 character strings
6251943 japi does not work with host aliasing
6251172 reserved jobs prevent other jobs from starting
6247889 qsub -sync y return code behaviour broken
6247239 sequence nr of execd load reports corrupted
6247238 qsub fails to work correctly with -b n -cwd
6247211 qstat -explain E does not print queue errors correctly
6245487 qhost -h <hostname> does not show selected host
6244865 a series of matching soft queue requests gets not counted separately
6244808 scheduler does not get all objects on a qmaster or scheduler startup
6244229 misleading qstat -j message when the scheduler is not running
6244215 qsub -b y must fail if no command is specified
6242779 qsub -now yes not working on CSP system
6242181 Failed drmaa_control (DRMAA_CONTROL_TERMINATE) causes deadlock
6242172 Multi-threaded args parsing problems
6242165 Profiling library never frees thread slots
6242057 jobs which request consumable resources which are set to infinity are not scheduled
6242055 Consumable request may not be 0 if PE requested
6241544 qstat -F dies in case of a infinit integer setting
6241487 termination script may not be ignored, when job submited with -notify
6241430 error message "no execd known on host"
6241401 Conflicting requirements should have the same meaning with qstat and qsub
6241378 Reservation of wrong hosts
6241376 qstat -U aborts
6240739 qstat -s hu shows pending jobs only
6239660 qmaster profiling doesn't start at qmaster startup
6239569 qmaster does not accept new connections if number of execd's exceed FD_SETSIZE
6239394 Spooledit fails during database upgrade
6236475 DRMAA segfaults with > 255 threads
6236472 qsub -sync y doesn't remove session directories
6236469 JAPI: Can be made to start two event client threads
6236261 BDB install on NFSv4 share
6234836 Need a means to purge host or hostgroup specific cluster queue
6234371 error message from execd about endpoint is not unique
6233162 global scheduler messages are reported multiple times
6232074 load formula is not working for pe jobs
6231366 deadlock in the qmaster due to qconf -k[s|e]
6230846 execd logs error mesage, when a tight pe job in "t" state is deleted
6229373 An array pe job can set queues into error state
6229277 qselect uses sge_qstat file
6229253 a parallel array job can kill the qmaster
6228786 Long delay when starting up large pe jobs
6228350 Execd messages file contains incorrectly-formatted lines
6226085 suspend_interval is ignored when enabling jobs due to suspend_thresholds change
6225570 sharetree has a usage leak
6222930 After shadowd takes over there is a long delay before execd connects to new qmaster
6222861 error message "no execd known on host"
6222811 scheduler can get out of sync
6222237 huge CPU and memory overhead when modifiying complex attributes
6221244 releasing user hold state through qrls may not require manager priviledges
6221231 qsub -sync y return code behaviour broken
6221167 sge_schedd segfaults in case of a restart and a running pe job.
6220060 wrong calendar settings kills the qmaster
6219999 changing of local execd_spool_dir is fault prone
6218430 Problems with load values if execution daemons run in a solaris zone at x86
6215730 qdel failed to delete qrsh (login) job on a Solaris box when Secure Shell is used
6205060 SGE tools segfault when gid can't be looked up
6199256 qconf -[a|A|m|M]stree kills qmaster
6194719 starter_method is ignored with binary jobs that are started without a shell
6186597 qconf error diagnosis broken
6178843 qconf changes to complex doesn't display all the changes made upon exit
5085004 qstat -f -q all.q@HOSTNAME does not resolve hostname
(from 118130-03)
6216020 pending job task deletion may not work
6215580 execd messages file contains errors for tight integrated jobs
6211309 qmaster running out of file descriptors
6211243 The qstat -ext -xml command is broken with N1GE6 Update 2 patch
6205648 error in commlib read/write timeout handling
(from 118130-02)
6201042 qdel "*" produces error logging in qmaster messages file
6201040 Exit 99 jobs are not rescheduled to hosts where they ran before
6201039 qconf -ks gives bad error message if scheduler isn't running
6201038 reduce the impact of qstat on the overall performance
6201033 qmaster might fail if jobs are deleted which have multiple hold states applied
6200013 arch script does not know about /lib64
6199261 a sharetree delete can kill qmon
6196578 backup failes, when...
6195249 QMON Cluster Queue Window: Heading line words does not match into column width
6194729 Subordinate queue thresholds are not spooled with BDB
6194713 Only first subordinate queue will be suspended at qmaster restart
6194625 subordinate queues consume excessive memory
6194002 sgemaster -migrate on qmaster host tries to start second qmaster
6193866 backup/restore does not work under Linux and others..
6193361 Jobs fail in case of NFS execd installation on volumes exported without root write priviledges
6193348 qconf -mq does not output the subordinate_list correct
6191366 tightly integrated pe jobs: scheduler doesn't respect usage of pe tasks in sharetree calculation
6190164 too many array tasks are deleted
6189289 a cluster queue can be deleted, even though it is referenced in an other cq
6189286 memory leak in the scheduler with consumables as load thresholds
6185211 Job environments should not include Grid Engine dynamic library path
6185208 qmon and equal job arguments
6185169 qmon returns an error dialog, when editing a calendar
6185136 Job customize shows weird characters for fields, additional fields cannot be added
6184466 scheduler does not look ahead to consider queue calendars state transitions
6184460 qmod -[d|e] cannot handle the folowing qnames: "[0-9]*"
6183365 qconf -sstree gives a SIGBUS error
6180529 meaningless job error state diagnosis text in qstat -j
6176181 qdel "" kills qmaster
6176177 restoring a backup does not restore the job_scripts dir.
6176115 Show qmaster/execd application status in qping
6174915 qconf has wrong exit status
6174821 segmentation fault when vmemsize limit is reached
6174331 Option "-v VAR" does not fetch from envrionment
6174326 qconf -sq displayes "slots" in the complex_values line
6174301 N1GE6: qsub -js and negative job_share numbers acts strangely/unexpectedly.
5108639 qconf -sstree seg faults with large share trees
5108635 $ARCH required in path for qloadsensor and qidle.
5104789 mail sent by qmaster leaves zombie processes
5104270 Cannot add calendar with \ syntax
5102442 qconf -de <live_exec_host> crashes qmaster
5102320 memory leak in the scheduler, with pe jobs and resource requests
5097732 Need detailed error messages from communication layer
5095907 qacct -l is not working
5094016 o-tickets assigned to departments are ignored
5092487 hard resource requests ignored in parallel jobs
5090162 qmake does not export shell env. vars
5089255 Submit to a queue domain is never scheduled
5089222 scheduling weirdness with wild-card PE's
5086108 wrong message appears when queue instance becomes error state
5085010 qmon customize filter for running jobs does not filter
5075968 Thread enabled commlib coredumps on exit on a 32bit Solaris x86 box
(from 118130-01)
5085392 qstat -j -xml generates no parseble xml output
5084317 Invalid job_id's in reporting file (only l24_amd64)
5083115 Need more verbose diagnosis msg if execd port is already bound
5083102 hostgroup changes do not always take effect.
5082490 qstat -ext -urg omits time info
5081839 qconf -ahgrp fails if no hgrp name is specified
5081822 Deleting a queue instance slots value actually adds it
5081821 qstat XML output typo
5080856 QCONF: qconf -mc segfaults
5080853 DRMAA doesn't reject jobs that never will be dispatchable
5080852 qconf -aq <queue>@<host> crashes qmaster
5080851 qalter/qdel/qmod abort
5080840 problems when qconf -mattr is used in conjunction with host_aliases file
5080839 qconf -mq displayes "slots" in the complex_values line
5080836 qhosts outputs NCPU as float
5080833 qconf -mattr dumps core if used incorrectly
5080784 qselect crash
5080779 qconf -de host does not update the host groups
5079572 Resending queue signals broken
5079514 execd shutdown with sgeexecd fails when host aliases are used
5078783 Wallclock time limit in qmon
5077589 schedd and qmaster get out of sync - no scheduling for long time
5077549 qsub -N "@" causes qmaster down
5077165 reprioritize_interval descr in sched_conf(5) needs improvemen
5076491 qmaster clients may not reconnect after qmaster outage
5076372 "|" should be able to be used with qsub -N
5076358 It shuld be used "." and "$" with qsub -N
5075936 qmon's queue filtering doesn't work
5075849 a registering event client can get events before it got its total update
5075451 sched_conf(5) reprioritize_interval should default to 0
5075398 variable syntax : equal sign support
5075346 Sharetree doesn't work correct
5074788 jobs on hold due to -a time cause qmaster/schedd get out of sync
5073218 qconf -aq <queue>@<host> crashes qmaster
5072772 sge_qmaster constantly rewrites spool files of tightly integrated parallel jobs
5072481 Deleted pending job appears in qstat
5072005 drmaa_run_job() may change the current directory
5071987 Qmaster requires a local conf in order to start.
5071918 qmod -e '@<host>' causes segmentation fault in qmaster
5071914 scheduler ignores queue seqno for queue sorting
5071539 qping doesn't support host_aliases file
5071525 qalter abort
5071522 Startup of qmaster changes act_qmaster to `hostname`
5071502 calendars broken
5071498 projects not available after sge_qmaster restart
5063987 qmaster cannot bind port below 1024 on Linux
5063316 PE job submit error, when qmaster is busy
5063311 high memory usage of schedd and qmaster (schedd_job_info)
Patch Installation Instructions:
--------------------------------
For Solaris 7, 8, 9 and 10 releases, refer to the man pages for instructions
on using 'patchadd' and 'patchrm' scripts provided with Solaris. Any other
special or non-generic installation instructions should be described below
as special instructions. The following example installs a patch to a
standalone machine:
example# patchadd /var/spool/patch/104945-02
The following example removes a patch from a standalone system:
example# patchrm 104945-02
For additional examples please see the appropriate man pages.
See the "Special Install Instructions" section below before installing this
patch.
Patch requirements and patch matrix for N1 Grid Engine 6 packages
-----------------------------------------------------------------
The patches below update a N1 Grid Engine 6 distribution to N1 Grid
Engine 6 Update 13 (N1GE 6.0u13). The "-help" output of most commands will
print a version string "N1GE 6.0u13" after applying the patch.
All packages of a N1 Grid Engine 6 distribution must have the same patch
level (exception for GEMM). Please refer to the patch matrix below which
updates the distribution to most recent patch level.
It is not supported and possible to mix different patch levels of
binaries and the "common" package in a single N1 Grid Engine cluster.
1. Patches for packages in Sun pkgadd format
--------------------------------------------
Package name* OS* Architecture* Patch-Id
-----------------------------------------------------------------
SUNWsgee Solaris, Sparc, 32bit sol-sparc 124519-05
SUNWsgeex Solaris, Sparc, 64bit sol-sparc64 124520-05
SUNWsgeei Solaris x86 sol-x86 124521-05
SUNWsgeeax Solaris, x64 (AMD64) sol-amd64 124522-05
SUNWsgeec all common 118132-13
SUNWsgeea all arco 118133-10
SUNWsgeed all doc 119846-02
*Package Name = see pkginfo(1)
*OS = Operating system
*Architecture = N1 Grid Engine binary architecture string or
"common" = architecture independent packages
"arco" = Accounting and Reporting console
"doc" = PDF documentation
"gemm" = Grid Engine Management Module for Sun
Control Station (SCS) (tar.gz only)
2. Patches for packages in tar.gz format
----------------------------------------
OS* Architecture Patch-Id
-----------------------------------------------------
Solaris, Sparc, 32bit sol-sparc 124523-05
Solaris, Sparc, 64bit sol-sparc64 124524-05
Solaris, x86 sol-x86 124525-05
Solaris, x64 (AMD64) sol-amd64 124526-05
Linux kernel2.4/2.6, x86 lx24-x86 124527-05
Linux kernel2.4/2.6, AMD64 lx24-amd64 124528-05
IBM AIX 4.3 aix43 124529-05
IBM AIX 5.1 aix51 124530-05
Apple MAC OS/X darwin 124531-05
HP HP-UX 11 hp11 124532-05
SGI Irix 6.5 irix65 124533-05
Microsoft Windows win32-x86 124534-05
all common 118092-13
all arco 118093-10
all doc 119861-02
Solaris, Linux gemm 120435-04
Special Install Instructions:
-----------------------------
Content
-------
Patch Installation
Stopping the N1 Grid Engine cluster to prevent start of new jobs
Shutting down the N1 Grid Engine daemons
Installing the patch and restarting the software
New functionality delivered with N1GE 6.0 Update 12
New reporting parameter for consumable resources control
New functionality delivered with N1GE 6.0 Update 11
New "qsub -terse" option
New functionality delivered with N1GE 6.0 Update 8_1
New daemon for starting Windows GUI Applications
New functionality delivered with N1GE 6.0 Update 7
Reworked "qstat -xml" output
Reworked PE range matching algorithm in the scheduler
New monitoring feature in qmaster
New parameter for specialized job deletion
New reporting parameter to control accounting file flush time
New functionality delivered with N1GE 6.0 Update 6
Berkeley DB database tools are included in the distribution
New functionality delivered with N1GE 6.0 Update 4
New "qconf -purge" option
Berkeley DB spooling on NFSv4 under Solaris 10 supported
Execd installation in Solaris 10 zones supported
Faster execution daemon reconnect in CSP mode
New functionality delivered with N1GE 6.0 Update 2
Avoid setting of LD_LIBRARY_PATH
DRMAA Java[TM] language binding delivered with this patch
New qstat options to optimize memory overhead and speed of qstat
Tuning parameter for sharetree spooling
Patch Installation
------------------
NOTE: This patch requires that you update your Berkeley DB database files
if you are upgrading from N1GE 6.0u1 or 6.0. Please read the full
notes when applying this patch.
These installation instructions assume that you are running a homogenous
N1 Grid Engine cluster (called "the software") where all hosts share the
same directory for the binaries. If you are running the software in a
heterogenous environment (mix of different binary architectures), you
need to apply the patch installation for all binary architectures as well
as the "common" and "arco" packages. See the patch matrix above for
details about the available patches.
If you installed the software on local filesystems, you need to install
all relevant patches on all hosts where you installed the software
locally.
By default, there should by no running jobs when the patch is installed.
There may pending batch jobs, but no pending interactive jobs (qrsh,
qmake, qsh, qtcsh).
It is possible to install the patch with running batch jobs. To avoid a
failure of the active 'sge_shepherd' binary, it is necessary to move the
old shepherd binary (and copy it back prior to the installation of the
patch).
You can not install the patch with running interactive jobs, 'qmake' jobs
or with running parallel jobs which use the tight integration support
(control_slaves=true in PE configuration is set).
A. Stopping the N1 Grid Engine cluster to prevent start of new jobs
-------------------------------------------------------------------
Disable all queues so that no new jobs are started:
# qmod -d '*'
Optional (only needed if there are running jobs which should continue to
run when the patch is installed):
# cd $SGE_ROOT/bin
# mv <arch>/sge_shepherd <arch>/sge_shepherd.sge60
It is important that the binary is moved with the "mv" command. It should
not be copied because this could cause the crash of an active shepherd
process which is currently running job when the patch is installed.
B. Shutting down the N1 Grid Engine daemons
-------------------------------------------
You need to shutdown (and restart) the qmaster and scheduler daemon and
all running execution daemons.
Shutdown all your execution hosts. Login to all your execution hosts and
stop the execution daemons:
# /etc/init.d/sgeexecd softstop
Then login to your qmaster machine and stop qmaster and scheduler:
# /etc/init.d/sgemaster stop
Now verify with the 'ps' command that all N1 Grid Engine daemons on all
hosts are stopped. If you decided to rename the 'sge_shepherd' binary so
that running jobs can continue to run during the patch installation, you
must not kill the 'sge_shepherd' binary (process).
C. Installing the patch and restarting the software
---------------------------------------------------
Now install the patch by installing the patch with "patchadd" or by
unpacking the 'tar.gz' files included in this patch as outlined above.
Berkeley DB database update needed
----------------------------------
NOTE: This update is *not* needed if you already installed N1GE 6.0u3
or higher. The update is only needed if you are upgrading from
N1GE 6.0u2 or earlier.
After installing this patch, and before restarting your cluster you
need to update your Berkeley DB (BDB) database in the following cases:
- you choose the BDB spooling option (not needed for classic
spooling) either locally or with the BDB RPC option, and you are
upgrading your cluster for N1 Grid Engine 6.0 or 6.0u1 to N1 Grid
Engine 6.0u2 or higher
1. For safety reasons, please make a full backup of your existing
configuration. To perform a backup use this command
% inst_sge -bup
2. Upgrade your BDB database. This is done as follows:
% inst_sge -updatedb
Restarting the software
-----------------------
If you have configured ARCo, you must first complete steps 1 and 2
from the section "Stopping the Accounting and Reporting Console" from
the ARCo patch before restarting the qmaster.
Please login to your qmaster machine and execution hosts and enter:
# /etc/init.d/sgemaster
# /etc/init.d/sgeexecd
After restarting the software, you may again enable your queues:
# qmod -e '*'
If you renamed the shepherd binary, you may safely delete the old
binary when all jobs which where running prior the patch installation
have finished.
New functionality delivered with N1GE 6.0 Update 12
---------------------------------------------------
1. New reporting parameter for consumable resources control
-----------------------------------------------------------
The new reporting parameter "log_consumables" controls writing of
consumable resources to the reporting file. Default (log_consumables=true)
is to write information about all consumable resources (their current usage
and their capacity) to the reporting file, whenever a consumable resource
changes either in definition, or in capacity, or when the usage of a
consumable resource changes. When log_consumables is set to false, only
those variables will be written to the reporting file, that are configured
in the report_variables in the exec host configuration, see host_conf(5)
for further information about report_variables.
The default (log_consumables=true) has been chosen to be backward compatible
to 6.1u2, but it is recommended to switch to log_consumables=false, and add
the required consumables to the report_variables in the global host
(qconf -me global).
New functionality delivered with N1GE 6.0 Update 11
---------------------------------------------------
1. New "qsub -terse" option
---------------------------
The new qsub option "-terse" outputs only the job id (and task id for array
jobs) while submitting a job using the qsub interface. This is helpful for
those who need just the job id returned by qsub.
New functionality delivered with N1GE 6.0 Update 8_1
----------------------------------------------------
1. New service for starting Windows GUI Applications
---------------------------------------------------
A new Windows Service "N1 Grid Engine Helper Service" was added to the
distribution. This new service allows Windows jobs to displays a GUI on
the visible desktop of the execution host.
The visible desktop is either the desktop of the user currently logged in
on the execution host or the desktop of the next user who will log in. It's
not the log in screen.
The Helper Service is a independent component loosely coupled with the
execution daemon. The startup of the Helper Service is plugged in the
"Services" dialog in the Windows control panel. It's only possible to install
one Helper Service per host and it's only supported to have one execution
daemon installed per Helper Server.
To submit a job that requires a Windows GUI, set the job environment
variable "SGE_GUI_MODE" to "TRUE", "true" or "1", e.g.
# qsub -v SGE_GUI_MODE=TRUE $SGE_ROOT/examples/jobs/sleeper.sh 60
The current release does not distinct at scheduling time between GUI and
non-GUI jobs. Future releases of N1GE will address this feature with a builtin
complex variable "display_win_gui". The definition will be:
#name shortcut type relop requestable consumable default urgency
#----------------------------------------------------------------------------------------
display_win_gui dwg BOOL == YES NO 0 0
To be compatible with future releases of N1GE it's recommended to follow this
complex definition.
The execution daemon communicates with the Helper Service over a TCP/IP
connection using a dynamic port. A firewall might block local connections
and needs to be configured to allow local connections. The
Windows builtin firewall allows local connections by default.
The Helper Service always needs the password of the job user, no matter
if the job user is a local or a domain user. See sgepasswd(1) and sgepasswd(5)
for informations how to register the users password.
Native Windows processes cannot be suspended. Therefore it is not supported
to suspend Windows GUI jobs. Though qmod -s/-us can be executed it has no
effect on the running GUI process.
New functionality delivered with N1GE 6.0 Update 7
--------------------------------------------------
1. Reworked "qstat -xml" output
-------------------------------
The schema for "qstat -xml" and the "qstat -xml" output have been
reworked to ensure consistency between them and easy parsing of them via
JAXB. The most noticeable change will the date output. It follows now the
XML datetime format.
2. Reworked PE range matching algorithm in the scheduler
--------------------------------------------------------
The PE range matching algorithm is now adaptable and learns from the past
decisions. This will lead to a much faster scheduling decision in case of
pe-ranges. This can be controlled by a new scheduling configuration
parameter: SELECT_PE_RANGE_ALG. It allows to restore the old behavior.
See sge_conf(5) for more information.
3. New monitoring feature in qmaster
------------------------------------
The monitoring allows to get detailed statistics what the qmaster
threads are doing and how busy they are. The statistics can be accessed
via "qping -f" or from the qmaster messages file. The feature is controlled
by two qmaster configuration parameters:
MONITOR_TIME specifying the time interval for the statistics
LOG_MONITOR_MESSAGE enables/ disables the logging of the monitoring
messages into the qmaster messages file.
See sge_conf(5) for more information.
4. New parameter for specialized job deletion
---------------------------------------------
A new "execd_param" (configured in the global cluster configuration):
ENABLE_ADDGRP_KILL=true
can be configured to enable addition code within the execution host to
delete jobs. If this parameter is set then the supplementary group id's
are used to identify all processes which are to be terminated when a job
should be deleted. It has only effect for following architectures:
sol*
lx*
osf4
tru64
See sge_conf(5) under "gid_range" for more information.
5. New reporting parameter to control accounting file flush time
----------------------------------------------------------------
A new reporting parameter, "accounting_flush_time", controls the flush
period for the accounting file. Previously, both the accounting and
reporting files were flush at the same interval. Now they can be set
independently. Additionally, buffering of the accounting file can now be
disabled, allowing accounting data to be written to the accounting file
as soon as it becomes available.
See sge_conf(5) for more information.
New functionality delivered with N1GE 6.0 Update 6
--------------------------------------------------
1. Berkeley DB database tools are included in the distribution
--------------------------------------------------------------
All Berkeley DB database tools are now part of the N1 Grid Engine
distribution (not for Microsoft Windows platform)
db_archive
db_checkpoint
db_deadlock
db_dump
db_load
db_printlog
db_recover
db_stat
db_upgrade
db_verify
The HTML documentation for these tools is part of the "common" patch and
can be found in:
<sge_root>/doc/bdbdocs
New functionality delivered with N1GE 6.0 Update 4
--------------------------------------------------
1. New "qconf -purge" option
----------------------------
"qconf -purge" deletes all hosts or hostgroups settings from a cluster
queue. This facilitates the uninstallation of host or hostgroups. See
qstat(1) for more a description how to use this parameter
2. Berkeley DB spooling on NFSv4 under Solaris 10 supported
-----------------------------------------------------------
The Berkeley DB database now can be installed on a NFSv4 mounted
filesystem on Solaris 10.
For performance reasons it is recommended to use NFSv4 BDB spooling only
when the NFSv4 mount provides an excellent high speed connection to the
file server.
3. Execd installation in Solaris 10 zones supported
---------------------------------------------------
The execution daemon installation in Solaris 10 zones is supported. If an
execution daemons is installed in the global zone and in local zones you
need to ensure that the additional group id range (-> "gid_range" in
cluster configuration) from the global zone and the local zones does not
overlap. Local zones may use the same additional group id range in the
same host.
4. Faster execution daemon reconnect in CSP mode
------------------------------------------------
The Certificate Security Protocol (CSP) has been reworked and now is fully
integrated in the communication library layer. The allows a faster
reconnect of execution daemons after qmaster or execution daemon restart.
New functionality delivered with N1GE 6.0 Update 2
--------------------------------------------------
1. Avoid setting of LD_LIBRARY_PATH; inherited job environment
--------------------------------------------------------------
There are two new "execd_params" (defined in the global or local cluster
configuration) which control the environment inherited by a job:
SET_LIB_PATH
INHERIT_ENV
By default, SET_LIB_PATH is false and INHERIT_ENV is true. If
SET_LIB_PATH is true and INHERIT_ENV is true, each job will inherit the
environment of the shell that started the execd, with the N1GE lib
directory prepended to the lib path. If SET_LIB_PATH is true and
INHERIT_ENV is false, the environment of the shell that started the execd
will not be inherited by jobs, and the lib path will contain only the
N1GE lib directory. If SET_LIB_PATH is false and INHERIT_ENV is true,
each job will inherit the environment of the shell that started the execd
with no additional changes to the lib path. If SET_LIB_PATH is false and
INHERIT_ENV is false, the environment of the shell that started the execd
will not be inherited by jobs, and the lib path will be empty.
Environment variables which are normally overwritten by the shepherd,
such as PATH or LOGNAME, are unaffected by these new parameters.
2. DRMAA Java[TM] language binding delivered with this patch
------------------------------------------------------------
The DRMAA Java language binding is now available. The DRMAA Java language
binding library and documentation is contained in the patch for the
"common" package.
3. New qstat options to optimize memory overhead and speed of qstat
-------------------------------------------------------------------
The qstat client command has been enhanced to reduce the overall amount
of memory which is requested from the qmaster. To enable these changes it
is necessary to change the qstat default behavior. This is possible by
defining a cluster-global or user-specific sge_qstat file. More
information can be found in sge_qstat(5) manual page. In addition two new
qstat options ("-u" and "-s") have been introduced to be used with the
sge_qstat default file. Find more information in qstat(1).
4. Tuning parameter for sharetree spooling
------------------------------------------
A new "qmaster_param" (configured in the global cluster configuration):
STREE_SPOOL_INTERVAL=<time>
can be configured to control the interval for how often the sharetree
usage is spooled. The interval can be set to any time in the following
formats:
HH:MM:SS or
<int>
E.g.:
STREE_SPOOL_INTERVAL=0:05:00
STREE_SPOOL_INTERVAL=300
This parameter is a tuning parameter only. It has the biggest effect on a
system using classic spooling and bigger sharetrees and a slow
filesystem.
README -- Last modified date: Tuesday, August 11, 2015