Release notes for the mixed MPI and OpenMP extension of the MOLSCAT v14 package.

Table of contents:

Extensions to the namelist &INPUT

Mesh-based coupling storage and corresponding namelist &MESH

The namelist &MESH is read after basis set and potential namelists if IRWU has been defined in the namelist &INPUT. It defines the distance mesh for the precalculation of the potential and the W coupling matrix. The data is pre-interpolated over the mesh using cubic splines; the W values and the corresponding second order derivative values are kept to permit a fast interpolation.

This mesh-based algorithm is presently available for ITYPE=3,4 only. It is automatically disabled for other ITYPE values.

The algorithm switches automatically between in-core and out-of-core mode.

The in-core mode requires NRDIST*N*(N+1) storage, where NRDIST is the number of distances in the mesh. This is to be compared to the conventional VL storage which requires MXLAM*(N*(N+1)/2, or less if packed VL storage is enabled. Despite its memory requirements, the in-core mode is very efficient, because i) the in-core data is produced by a large DGEMM, and ii) because the amount of data read at each integration step is limited to 2*N*(N+1), reducing significantly the memory bandwidth requirements in OpenMP operation.

The out-of-core mode dumps all the NRDIST*N*(N+1) storage to disk, and reads back 2*N*(N+1) data per integration step (corresponding to the W matrices and corresponding second order derivative values for the current distance interval). Apart some I/O buffer space, the VL storage is completely suppressed. This should permit to run very large problems on machines with limited memory, including GPUs. As the computational cost of each integration step scales as N3 while the I/Os scale as N2, the I/O overhead should not be a significant limitation for such very large problems.

A less I/O demanding intermediate mode could be easily implemented using an additional storage of 2*N*(N+1) real*8 words corresponding to the W matrices and its second order derivative for the current distance interval. This storage would permit to reduce the reading by a factor of two or more when integration steps are smaller or comparable to the RDIST distance intervals because i) no reading would be needed for a subsequent integration step in the same RDIST interval, and ii) the reading of N*(N+1) data only would be required to refresh the W in-core ansatz for a subsequent integration step in an adjacent RDIST interval. However if the MOLSCAT storage is set sufficiently below the available computer memory, the disk caching provided implicitely by modern system kernels is expected to provide most benefits of this intermediate mode.

Note that the W array is generally not sparse, because it is a weighted sum of the coupling matrices corresponding to all the expansion terms of the potential. While these individual coupling matrices are sparse, their zeros are not expected to match. Consequently there is no benefit for combining the mesh-based and the packed approaches, and VLTHRESH is ignored.

Namelist &MESH

Additional variables for checking or debugging purposes

Notes for multi-threading

Support for portability

Run-time memory allocation

On some platforms this trick permits to specify the required workspace at runtime. This is especially useful on mainframes to prevent maintaining different executables adapted to various batch queue. The corresponding value (in real*8 words) is read from the command line, e.g.
      $ molscat.x 10000000 

The code assumes that the offset (in real*8 words) between the allocated vector and X vector in /MEMORY/ fits in an integer.

To enable the run-time workspace allocation, recompile main.f with PARAMETER (MXDIM=10) and make sure that the call to alloc_memory is uncommented in driver.f.

Modified files: set6c.f, chkstr.f, driver.f. Added routines: alloc_memory and alloc_vm. See comments in main.f and driver.f for more details.

Utility routines

A large set of utility routines is provided to suit most architectures.
        gclock_f95.f        provides elapsed cpu time in seconds
        gclock_mpi.f        wallclock MPI variant
        gclock_dummy.f      dummy version

        gelapsed.f          provides elapsed wall time in seconds
        timing_.c           C-coded utility routine called by gelapsed.f
        gelapsed_mpi.f      wallclock MPI portable variant
        gelapsed_dummy.f    dummy version

        gdate_f95.f         provides date or time as a character string
        gdate_dummy.f       corresponding dummy version

        flshfo.f            flush output buffer, calls flush library routine
        flshfo_aix.f        same for machines running AIX
        flshfo_dum.f        corresponding dummy version

        fhost_.c            returns the host name (C routine)
        fhost_dummy.f       corresponding dummy version

        alloc_memory.f      enables dynamic allocation of X array, calls
                            alloc_vm. Call be commented out in driver.f. 
        alloc_vm_f90.f      portable alloc_vm for most f90 compilers
        alloc_vm_f77_32.f   f77 interface to get_vm_.c (32-bit architecture)
        alloc_vm_f77_64.f   f77 interface to get_vm_.c (64-bit architecture)
        get_vm_.c           corresponding allocator routine written in C
        alloc_vm_gfortran.f substitute to alloc_vm_f90.f if intrinsic LOC
                            is not available
        loc8_.c             corresponding substitute to LOC routine

Also, the original utility routine PARITY has been renamed FARITY to avoid a name conflict in some system libraries.

Makefile templates for various platforms

Presently supported platforms are:

The distribution provides a flexible Makefile template with extensive comments. The Makefile includes compilation directives provided by a variety of compil*.'arch' files corresponding to various compilers, BLAS/Lapack libraries and machine architectures. These files also select pmpnone.o for serial or OpenMP runs, or pmpnew.o for MPI runs. They also provide a guess for the fastest matrix inversion and diagonalisation routines for your architecture (see discussion below).

The naming convention is the following:

Makefile usage:

At run time the code prints a summary of the build options (MPI and/or OpenMP, recoupling algebra routines) together with the number of threads and MPI tasks.

The source code has been splitted to individual routines to facilitate the development and reduce compilation time. The script tags.sh generates the appropriate TAGS file to navigate the source with the emacs editor (see comments in tags.sh).

In addition to the Makefile, the script compil.sh can be used to build the following utility programs:

Linear algebra routines

MOLSCAT heavily relies upon linear algebra library routines from the Lapack and BLAS libraries. You are thus strongly encouraged to install the fastest versions of those libraries for your platform.

As a rule of thumb, the MKL library is best for Intel processors, and the ACML library is best for AMD processors. These libraries provide both Lapack and BLAS routines. On Power-based mainframes, you should use the latest Lapack library (currently 3.1.1) and invoke IBM's ESSL library for providing the underlying BLAS. Beware that some libraries (ACML, ESSL, GotoBLAS...) are provided in distinct versions for serial and multi-threaded applications, thus make sure you link to the proper one.

In addition the present distribution provides two choices for matrix inversion and diagonalisation routines. The optimal routines should be selected in the compil* file enabled in your Makefile. The best choice depends on your hardware and on your local Lapack and BLAS libraries, and is also generally different for serial or multi-threaded runs. Beware that the performance impact may be quite important.

The compil* files provided in the distribution provide a selection based upon particular hardware and libraries. Best is to experiment under your local conditions with a problem dimension representative of your future MOLSCAT runs (to be adjusted in the include file nx_inc.f). Sample programs are provided in the sub-directory linear_bench/ to exercise above routines, and also optionally the BLAS3 routines dgemm and dsyr2k. The variable nn sets the number of times the library routine is called, you might adjust the value for a realistic timing (don't go below several seconds in elapsed time). The scripts run_*.sh provide examples to build and run these programs. The file README.txt provides some comments and timings for the platforms in use from Grenoble, which underlies the selection of libraries and routines implemented in the distributed compil* files.

Known bugs and problems

The MOLSCAT code is expected to be very stable. However it relies heavily upon the mathematical Lapack and BLAS libraries, and there are known issues concerning these libraries.

27 May 2008, PV/LAOG.

Cumulative update to distribution of 12 May 2008.

 27 May 2008, PV/LAOG.

    The tar file should be unpacked under directory distrib_20080512/
    Then issue
      $ rm compil*.dsy* compil*.dge*
      $ rm f02abf.* f02aaf.* f02abf_ori.* f02aaf_ori.*
      $ rm cpl3*.o cpl4*.o init_vl_mesh.o spropn.o wavvec.o 
      $ make

    - The sub-directory linear_bench/ provides various files to
      exercise the performance of the local Lapack and BLAS libraries
      and to select the most appropriate inversion and diagonalisation
      routines, as discussed above.

    - The Makefile and the compil* files have been modified to permit
      the selection of dsyevr- or dsyevx-based routines f02abf and
      f02aaf. The compil* names have been also shortened.

    - Minor changes in cpl3*.f cpl4*.f init_vl_mesh.f spropn.f wavvec.f
      to comply with some compilers.

    Affected files: 
      Makefile compil* cpl3*.f cpl4*.f init_vl_mesh.f spropn.f wavvec.f
    New files:      
      f02abf_dsyevr.f f02aaf_dsyevr.f f02abf_dsyevx.f f02aaf_dsyevx.f 
      (renamed from f02abf.f, f02aaf.f, f02abf_ori.f, f02aaf_ori.f).
    New sub-directory: linear_bench/


 19 May 2008, PV/LAOG.

    Fixed files compil_omp.aix.xlf64.dsy and compil_omp_mpi.aix.xlf64.dsy
    to invoke the esslsmp threaded library instead of the serial essl.


14 May 2008, PV/LAOG and George C. McBane.

Update to distribution of 12 May 2008.

 14 May 2008, PV/LAOG and George C. McBane.

    The tar file should be unpacked under directory distrib_20080512/
    Then issue
      $ rm wavvec.o
      $ make

    This update provides OpenMP support in WAVVEC for ITYPE = 2 and 7 
    (contributed by George C. McBane).

    The scaling is reasonable, though limited by memory bandwidth in
    some cases; the P() array is not accessed in a cache-friendly way.
    My test case (N=504,NPOTL = 25, MXLAM = 114114, on 1.3 GHz Power4)
    gave the following:

    threads        WAVVEC time in DAPROP
      1                  16.0 s
      2                   8.1
      4                   4.5
      8                   3.0


12 May 2008, PV/LAOG and George C. McBane.

New distribution, unpacks under directory distrib_20080512/

 12 May 2008, PV/LAOG. Mesh-based algorithm for the couplings:

    - Rewritten and the mesh-based algorithm in-core and out-of-core
      storage of the coupling elements (W matrix).

    - Updated accordingly the utility storage.x to estimate workspace
      required for MOLSCAT calculations.

    Mostly affected files: driver.f, init_vl_mesh.f, init_pv2.f,
    cpl3_pv2.f, cpl4_pv2.f, wirput.f, wavwir.f, wavmat.f, storage.f.


5 May 2008, PV/LAOG and George C. McBane.

New distribution, unpacks under directory distrib_20080505/

 5 May 2008, PV/LAOG. Refurbished source code:

    - Fixed a missing private variable in daprop.f
    - Added DEFAULT(NONE) to all OMP PARALLEL constructs
    - Established locks for warning messages as implemented in couple.f
    - Performed various cosmetic updates of source code and comments.
    - Fixed the proper ordering of the MKL library components in the
      compil*.'arch' files. 
    - Merged files smerge.f and sdata.fi with the update posted
      by George C. McBane on his repository, April 25, 2008.


 2 May 2008, PV/LAOG. 

    Introduced a sparsity-aware algorithm for packing the VL array.
    The CPU penalty is very low. Actually the new code is faster in 
    some situations because the memory is accessed in a very regular 
    manner, and because the memory bandwidth is reduced.
    The logic for packing/unpacking is also OpenMP aware. 
      - VL data below VLTHRESH in absolute value is discarded;
      - VL data above VLTHRESH is packed in a very compact form
        requiring a single integer index per real*8 value.
      ( See comments in sparsity_inc.f and chunk.f for more details. )
    The new algorithm is presently available for ITYPE=3,4 and is
    selected by defining a VLTHRESH value in namelist &INPUT.
    Selecting a low VLTHRESH value (typically 1.0d-12) should not
    affect the numerical results.

    The fraction of VL couplings above 1.0d-12 is now printed out 
    for PRINT.ge.3 and typically lies in the range 0.12 to 0.3. 
    The size of the VL array with the conventional logic is 
      ~ 4*MXLAM*N^2/1024^3 (in GB)
    The size of the packed VL array with the new logic is 
      ~ fraction*1.5 * size (if NIPR=2, default integer*4)
      ~ fraction*2.0 * size (if NIPR=1, default integer*8)
    The memory saving is thus larger for NIPR=2, reducing the
    conventional storage by a factor 2 to 5. 

    Affected routines: driver.f, couple.f, wavmat.f
    New routines: cpl3_pv3.f, cpl4_pv3.f, wavvec_pv3.f, chunk.f
    New include file: sparsity_inc.f

    This development follows the suggestion by Millard Alexander to
    exploit the sparsity of the coupling arrays, as it is already done
    within the HIBRIDON package.


 2 May 2008, PV/LAOG. 

    Provided the utility storage.x to estimate the workspace required 
    for MOLSCAT calculations. 
      - compile and link storage.x with the script compil.sh.
      - storage.x reads the molscat.inp file, asks the MXLAM value for
        your favorite POTENL routine, and estimates the required
        storage for each JTOT, assuming a conventional storage for VL
        array.


 30 April 2008, George C. McBane.

 a) I have completed an OpenMP-parallelized COUPLE for ITYPE=1, 2, and
    7.  It's probably important only for ITYPE 7; for the other two,
    usually the COUPLE times are small anyway.

    Some changes in logic even for serial compilation: 
      1) scratch storage for 3j and 6j symbols allocated in X() at
         beginning rather than separate allocations for each potential
         term
      2) ITYPE=1 zeros out VL at start, then updates only nonzero
         elements
      3) Redundant assignments of zero IV(I) values removed in
         ITYPE=2,7
    These changes should not affect numerical results.

    My test times on an ITYPE=7 problem with N=504, NPOTL = 25, 
    MXLAM = 114114 give the following:
 
    threads/CPUs      time in COUPLE
        1                 81.8 s
        2                 41.0
        4                 20.6
        8                 10.3

    so for this MXLAM the scaling is nearly perfect as would be
    expected.  

 b) In PMP, I had modified IVCHK so that it would skip the duplicate
    check if MXLAM was >1000.  That modification makes just as much
    sense for serial or SMP code, so I suggest removing the test
    against LL_PMP in that subroutine.  A better solution, of course,
    would be to write a better IVCHK that uses O(M log M) operations
    rather than O(M*M) operations, where M=MXLAM.  Such a routine
    could be based on any good sorting algorithm, but I have not taken
    time to write one.


28 April 2008, George C. McBane and PV/LAOG.

Cumulative update to distribution of 07 April 2008.

    The tar file should be unpacked under directory distrib_20080407/
    Then issue
      $ rm asrot.o cpl4.o dmsym.o ipasym.o set4.o set6.o
      $ rm pmp*.o driver.o gelapsed*.o gclock*.o init_build.o
      $ make

    Implements a couple of suggestions from George C. McBane after
    running the code at San Diego Supercomputer center:

    - Added a call to flshfo(6) at the end of the DIE subroutine, to
      ensure that any error messages actually get written to the log
      file before the process terminates. This may be required by some
      MPI implementations.

    - Added a variant for GCLOCK and GELAPSED timing routines for MPI
      calculations. The new routines invoke MPI_Wtime() and provide
      wallclock timing information which may be more relevant then cpu
      timing information on MPI supercomputers. In addition, these
      routines are fully portable under an MPI environment.  This
      alternative required some adjustments in pmpdyn.f, pmpnew.f and
      driver.f to postpone the close out of MPI after all timing
      information has been gathered.

      To use these routines, put gclock_mpi.o in place of gclock_f95.o
      and gelapsed_mpi.o instead of gelapsed.o and timing_.o in the
      UTIL macro in the compil_omp_mpi and compil_mpi makefile
      sections.

    - Reformulated a warning message in init_build.f concerning the
      CPU timing statistics for threads.
 

17 April 2008, PV/LAOG and AS/Meudon.

Cumulative update to distribution of 07 April 2008.

    The tar file should be unpacked under directory distrib_20080407/
    Then issue
      $ rm asrot.o cpl4.o dmsym.o ipasym.o set4.o set6.o
      $ make

    - Provides a new diagonalisation scheme for asymmetric top levels
      (ITYPE=4,6), which fixes an erroneous MOLSCAT output of the
      sequence of TAU values for degenerate states. Result of
      scattering calculations is not affected because the TAU index is
      not used in the build up of rotational couplings.

      This minor bug obscured the correspondance with spectroscopic
      notation and was pointed out by Annie Spielfiedel (Paris/Meudon
      Observatory).

      The correspondance between MOLSCAT output and standard spectroscopic 
      notation is the following:
        km1-k1 = tau
        km1 = ka = int( (j1+tau+1)/2 )
        k1  = kc = int( (j1-tau+1)/2 )

      The update introduces a variant of the ASROT routine which calls
      the NEWROTDIAG routine (both routines in asrot.f).  See more
      details in asrot.f, in particular comments in routine
      NEWROTDIAG. (The original MOLSCAT behaviour can be restored by
      setting the flag NEWALGO to .false. in ASROT.)

    - Provides minor updates of cpl4.f, dmsym.f, ipasym.f with
      comments or debug features.

    - Provides a bug fix limiting the output to IASYMU unit to the
      dispatcher task alone in the case of parallel PMP runs. 
      Affected files: set4.f, set6.f.

07 April 2008, PV/LAOG.

Bugfix distribution, unpacks under directory distrib_20080407/

    - Fixed a serious bug in pmpnew.f concerning the handling of 
      MPI_Isend buffers. Invalid (JTOT, M) values could be passed for 
      calculations on the master node, aborting the MPI run. 
    - Added a check in driver.f to prevent setting ISIGU in PMP runs. 
    - Updated compil*.ia64.* files for Itanium. 
    - Updated files in tst/ subdirectory.

26 March 2008, PV/LAOG.

Initial distribution, unpacks under directory distrib/.


Pierre Valiron and George C. McBane.