Pipeline overview
=================

The pipeline runs `DSPSR <https://dspsr.sourceforge.net/>`_ to dedisperse `LOFAR <https://www.lofar.eu/>`_ raw pulsar beamformed complex-voltage (XXYY) 32-bit or 8-bit `HDF5 <https://www.hdfgroup.org/solutions/hdf5/>`_ data and also beamformed Stokes I data, and fold it modulo pulsar period. `DSPSR software package <https://dspsr.sourceforge.net/>`_ was developed by Willem van Straten :cite:p:`vanstraten2011`.

The pipeline performs the following operations in order:

* dedispersing and folding the data modulo pulsar period using ``dspsr`` from `DSPSR software package <https://dspsr.sourceforge.net/>`_;
* combining output archive files ``.ar`` from different frequency parts together using ``psradd`` from `PSRCHIVE software package <https://psrchive.sourceforge.net/>`_;
* running RFI removal using ``paz -r`` from `PSRCHIVE <https://psrchive.sourceforge.net/>`_ (optional);
* running RFI removal using ``clfd`` from `CLFD <https://github.com/v-morello/clfd>`_ (optional);
* creating diagnostic plots for data cleaned both with ``paz`` and ``clfd`` or non-RFI-flagged data if RFI removal was skipped;
* running ``pdmp`` from `PSRCHIVE <https://psrchive.sourceforge.net/>`_ to optimise pulsar period and dispersion measure (optional);
* running ``pam`` from `PSRCHIVE <https://psrchive.sourceforge.net/>`_ to update data cube with new pulsar period and dispersion measure from ``pdmp`` (optional);
* creating new diagnostic plots for data updated with new pulsar period and dispersion measure (optional);
* creating 1-page ``quickview`` diagnostic plot;
* creating multi-page diagnostic summary pdf that combines all created diagnostic plots.

Input parameters
----------------

* ``sasid``: identifier of the SAS process (SASId) that called the pipeline. The SASId is used to name the output ``.h5`` and ``.raw`` HDF5 BF files, and replaces the Observation ID (ObsId) in their names. If not provided (e.g. when run locally and not being called by TMSS), then original ObsId's will be re-used in the output filenames.
* ``h5in``: input HDF5 LOFAR BF ``.h5`` files. Each of these files should have the corresponding ``.raw`` file in the same directory. The workflow will then do the conversion on these ``.raw`` files and update the metadata in the ``.h5`` files.
* ``nthreads``: use this number of CPU cores (integer) by `DSPSR <https://dspsr.sourceforge.net/>`_. Corresponds to ``-t`` option in ``dspsr``. Default is 8.
* ``maxram``:  upper limix on RAM usage by `DSPSR <https://dspsr.sourceforge.net/>`_. It is either the floating-point value in MB, or a multiple of the minimum possible block size specified as ``minXn``, where ``n`` is the integer multiplier. Corresponds to ``-U`` option in ``dspsr``. When specified in TMSS, this integer multiplier can only be ``0 < n < 10``. If value of ``n >= 10`` is given, it is considered to be a value in MB. This parameter is optional. 
* ``use_pdmp``: to run ``pdmp`` from the `PSRCHIVE software package <https://psrchive.sourceforge.net/>`_ on the output full-band ``.ar`` file to search for the optimal pulsar period (P0) and dispersion measure (DM). Default is ``true``.
* ``use_paz``: to run ``paz -r`` on the output full-band ``.ar`` file from ``psradd`` and zap RFI using median smoothed difference. The ``paz`` and ``psradd`` are programs from `PSRCHIVE <https://psrchive.sourceforge.net/>`_. Default is ``true``.
* ``use_clfd``: to run ``clfd --no-report`` on the output full-band ``.ar`` file from ``psradd`` and remove RFI. The ``clfd`` tool is from the `CLFD project <https://github.com/v-morello/clfd>`_. Default is ``true``.
* ``nozap``: to skip RFI zapping by any program, i.e. if ``nozap`` is ``true`` neither ``paz`` nor ``clfd`` will run even if ``use_paz`` or ``use_clfd`` are ``true``. Default is ``false``.
* ``obsdur``: to process only this amount of seconds (integer) by `DSPSR <https://dspsr.sourceforge.net/>`_. Corresponds to ``-T`` option in ``dspsr``. This parameter is optional and by default whole observation will be processed.
* ``skip_from_start``: to skip this amount of seconds (integer) from the start of observation to process by `DSPSR <https://dspsr.sourceforge.net/>`_. Corresponds to ``-S`` option in ``dspsr``. This parameter is optional.
* ``tsubint``: the length (in seconds) of sub-integrations in the output ``.ar`` file from `DSPSR <https://dspsr.sourceforge.net/>`_. Corresponds to ``-L`` option in ``dspsr``. Default is 10.
* ``sp_subint``: to create single-pulse sub-integrations in ``dspsr`` when set to ``true``. Corresponds to ``-s`` option in ``dspsr``. Default is ``false``. If ``true`` the single-pulse sub-integrations will be created regardless of the value in ``tsubint`` parameter. Usage of this parameter usually also requires ``true`` for the ``remove_dm_delays`` parameter (below).
* ``remove_dm_delays``: to remove inter-channel dispersion delays in the data cube(s) in ``dspsr`` when set to ``true``. Corresponds to ``-K`` option in ``dspsr``. Default is ``false``. Usually is used together with ``sp_subint`` parameter.
* ``nbins``: number of bins (integer) in the pulsar profile of the output ``.ar`` files. Corresponds to ``-b`` option in ``dspsr``. Default is 1024.
* ``extra_channels_factor``: in ``dspsr`` create the filterbank with that many channels per subband. Corresponds to ``-F`` option in ``dspsr``. Coherent dedispersion will also be performed during the filterbank operation (``:D`` flag for the ``-F`` option in ``dspsr``). For example, if there are 20 subbands in the input file and ``extra_channels_factor`` is 6, then 120-channel filterbank will be created using the ``-F 120:D`` option in ``dspsr``. Without this parameter the number of channels in the output ``.ar`` file is the same as number of subbands in the input ``.h5`` file.
* ``parfile``: pulsar ephemeris file to fold the data. Corresponds to ``-E`` option in ``dspsr``. The example of how the value for this parameter should be specified is given below. This is one of the three ways to specify the ephemeris file in the workflow. Two other ways are using either ``parfile_string`` or ``get_parfile_from_psrcat`` parameters (below). If neither ``parfile`` nor ``parfile_string`` are specified and ``get_parfile_from_psrcat`` is ``false``, then pipeline will fail. For more detail, see :ref:`Parfile <parfilesection>` section below.
* ``parfile_string``: similar to ``parfile`` but the content of the actual pulsar ephemeris file is given as string value in this parameter. This is the way how the pulsar ephemeris need to be specified in TMSS. For more detail, see :ref:`Parfile <parfilesection>` section below.
* ``get_parfile_from_psrcat``: boolean parameter to indicate whether parfile should be taken from the `ATNF Pulsar Catalogue <https://www.atnf.csiro.au/research/pulsar/psrcat/>`_ ``PSRCAT``. If ``true`` the parfile will be created with ``psrcat -e <PSRNAME>`` command, where ``<PSRNAME>`` is the pulsar name from the metadata of the input ``.h5`` file. However, both ``parfile`` and ``parfile_string`` parameters take precedence over this flag if they are given and are non-empty or non-null. Default is ``false``. For more detail, see :ref:`Parfile <parfilesection>` section below.

Output
------

The pipeline produces the following output in the directory specified by ``--outdir`` of ``cwltool`` or ``toil``:

* ``h5/``: the directory with original ``.h5`` input files from an observation;
* ``cobalt/``: the directory with the parset(s) of an observation and Cobalt log-file;
* input parfile ``.par``, or parfile created by PSRCAT (depending on the input parameters);
* dedispersed folded data cube ``.ar`` from all frequency parts added together;
* RFI-zapped data cube(s), ``.paz.ar`` and/or ``.clfd.ar`` (optional);
* output files from ``pdmp`` for the optimization of P0 and DM (optional):

  - ``pdmp_snr.dat``
  - ``pdmp.psn``
  - ``pdmp.per``
  - ``.pdmp.ps`` (diagnostic plot from ``pdmp``) - part of the multi-page diagnostic summary pdf;
* data cube ``.pdmp.ar`` with improved P0 and DM (optional);
* A4-page ``quickview`` diagnostic plot, ``.quickview.png`` (see :ref:`examples <quickview>` below);
* multi-page diagnostic summary ``.summary.pdf`` which includes:

   - ``quickview`` diagnositc plot;
   - diagnostic plot and noise/signal statistics output from ``snr.py``;
   - summary diagnostic plots (profile, phase-freq, phase-time, dynspec) for RFI-zapped data cube from ``paz -r`` (optional);
   - summary diagnostic plots (profile, phase-freq, phase-time, dynspec) for RFI-zapped data cube from ``clfd`` (optional);
   - summary diagnostic plots (profile, phase-freq, phase-time, dynspec) for non-RFI zapped data cube if RFI zapping was skipped (optional);
   - diagnostic plot from ``pdmp`` (optional);
   - summary diagnostic plots (profile, phase-freq, phase-time, dynspec) for RFI-zapped data cube from ``paz -r`` with improved P0 and DM after ``pdmp`` (optional);
* ``pipeline.log``: a concatenated log file capturing standard output and standard error for processing of all input ``.h5`` files.

Configuring the pipeline
------------------------

The parameters of the pipeline are provided as a `JSON <https://www.json.org>`_ file. 
Here is the example of an input JSON file:

.. code-block:: JSON

   {   
        "sasid": "1234567",
        "use_pdmp": true,
        "use_paz": true,
        "use_clfd": true,
        "nozap": false,
        "obsdur": 10,
        "skip_from_start": 100,
        "tsubint": 10,
        "sp_subint": true,
        "remove_dm_delays": true,
        "nbins": 128,
        "parfile": { "class": "File", "path": "/data/kondratiev/pulp2-data/folding/parfile/0329+54.par"},
        "get_parfile_from_psrcat": false,
        "extra_channels_factor": 4,
        "nthreads": 8,
        "maxram": "minX2",
        "h5in": [
            {
                "class": "File", "path": "/data/kondratiev/pulp2-data/folding/L2050196/cs/L2050196_SAP000_B000_S0_P000_bf.h5"
            },
            {
                "class": "File", "path": "/data/kondratiev/pulp2-data/folding/L2050196/cs/L2050196_SAP000_B000_S1_P000_bf.h5"
            },
            {
                "class": "File", "path": "/data/kondratiev/pulp2-data/folding/L2050196/cs/L2050196_SAP000_B000_S2_P000_bf.h5"
            },
            {
                "class": "File", "path": "/data/kondratiev/pulp2-data/folding/L2050196/cs/L2050196_SAP000_B000_S3_P000_bf.h5"
            },
            {
                "class": "File", "path": "/data/kondratiev/pulp2-data/folding/L2050196/cs/L2050196_SAP000_B000_S0_P001_bf.h5"
            },
            {
                "class": "File", "path": "/data/kondratiev/pulp2-data/folding/L2050196/cs/L2050196_SAP000_B000_S1_P001_bf.h5"
            },
            {
                "class": "File", "path": "/data/kondratiev/pulp2-data/folding/L2050196/cs/L2050196_SAP000_B000_S2_P001_bf.h5"
            },
            {
                "class": "File", "path": "/data/kondratiev/pulp2-data/folding/L2050196/cs/L2050196_SAP000_B000_S3_P001_bf.h5"
            }
        ]                        
    }

Not all of the parameters need to be provided, some of them are optional, and if they are not specified either their default value is used, or some of the step in the workflow is skipped. The ``sasid`` parameter can be also omitted in the local run, otherwise it will be provided by TMSS.

Few examples of input `JSON <https://www.json.org/>`_ files can also be found in the ``tests/`` directory of the repository. Also, in the ``scripts/`` directory of the repository one can find the script ``create-input-json-pulp2-folding.sh`` that can be used to generate template JSON file for the input data of your choice. The script takes one input argument which is the directory where all the ``.h5`` and ``.raw`` files are. There is also the optional second argument to perform an extra filtering on the input files. For example,

.. code-block:: console

    $ create-input-json-pulp2-folding.sh    /data/kondratiev/pulp2-data/folding/L2050196/cs
    
will generate the input JSON file similar to the one above based on the same data. To run the workflow only for the 0th frequency part ``_P000_``, you can make use of the 2nd filtering argument like this:

.. code-block:: console

    $ create-input-json-pulp2-folding.sh    /data/kondratiev/pulp2-data/folding/L2050196/cs    _P000_

and then ``h5in`` parameter of the input JSON file will only have input ``.h5`` files for frequency part ``_P000_``. Please double-check the input JSON file to make sure that it has parameters and their values as you want before running the workflow.

.. _parfilesection:
Parfile
^^^^^^^

Parfile (pulsar ephemeris file) can be provided via one of the three possible ways:

1. using ``parfile`` input parameter and providing the path on your local machine/cluster to the file, e.g. like this:

.. code-block:: console

    ...
    "parfile": { "class": "File", "path": "/data/kondratiev/pulp2-data/folding/parfile/0329+54.par"},
    ...

2. using ``parfile_string`` input parameter with the content of the actual parfile. This is how the parfile needs to be provided within TMSS. When running workflow locally usage of ``parfile`` parameter is easier. If both ``parfile`` and ``parfile_string`` are given, then parfile specified by the ``parfile`` parameter will be used. Here is the example of usage the ``parfile_string`` parameter:

.. code-block:: console

    ...
    "parfile_string": "PSRJ            J0332+5434                    	\nRAJ             03:32:59.4096                 1.000e-04 	\nDECJ            +54:34:43.329                 1.000e-03 	\nDM              26.76410                      1.000e-04 	\nPEPOCH          46473.0                       	\nF0              1.399541538720                6.000e-12 	\nF1              -4.011970E-15                 1.400e-20 	\nPMRA            16.97                         3.000e-02 	\nPMDEC           -10.37                        5.000e-02 	\nPOSEPOCH        56000.0                       	\nF2              5.3E-28                       1.500e-28 	\nRM              -64.33                        6.000e-02 	\nPX              0.59                          2.000e-02 	\nEPHVER          2                             	\nUNITS           TDB",
    ...

3. using ``get_parfile_from_psrcat`` input parameter with the value of ``true``. This will enable extracting the parfile from the 
`ATNF Pulsar Catalogue <https://www.atnf.csiro.au/research/pulsar/psrcat/>`_. The default value of this parameter is ``false``, so if one wants to
use it, then this parameter should be explicitely set to ``true``. If this parameter is set to ``false`` and neither of ``parfile`` nor ``parfile_string`` are specified then pipeline will fail. 

.. _quickview:    
Examples of ``quickview`` diagnostic plot
-----------------------------------------

``quickview`` diagnostic plot for pulsar beamformed complex-voltage data
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. image:: _static/quickview-xxyy.png
  :width: 800
  :alt: ``quickview`` plot for XXYY data
  
* Top-row (1st row):

   * `Left`: average pulse profile in total intensity (I, black), linear (L, red) and circular (V, blue) polarisation;
   
   * `Right`: information window about observing setup, observation, quality, others;
   
* 2nd row:

   * Pulse-phase vs. frequency plot for all four Stokes parameters, IQUV (from `left` to `right`);
   
* 3rd row:

   * `Left`: Pulse-phase vs. observing time plot;
   
   * `Right`: Average bandpass for both input polarisations;
   
* 4th row:

   * `Left`: Time vs. frequency plot calculated on the ON-pulse window, i.e. pulsar dynamic spectrum;
   
   * `Right`: same for the OFF-pulse window.

  
``quickview`` diagnostic plot for pulsar beamformed Stokes I data
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. image:: _static/quickview-stokesI.png
  :width: 800
  :alt: ``quickview`` plot for Stokes I data

* Top-row (1st row):

   * `Left`: average pulse profile in total intensity (I, black);
   
   * `Right`: information window about observing setup, observation, quality, others;
   
* 2nd row:

   * `Left`: Pulse-phase vs. frequency plot for total intensity (pulse spectrum);
   
   * `Right`: Data samples' histogram together with the least-squared fit showing the mean and rms values;
   
* 3rd row:

   * `Left`: Pulse-phase vs. observing time plot;
   
   * `Right`: Average bandpass for the total intensity data;
   
* 4th row:

   * `Left`: Time vs. frequency plot calculated on the ON-pulse window, i.e. pulsar dynamic spectrum;
   
   * `Right`: same for the OFF-pulse window.
  
  
Workflow diagram
----------------
.. graphviz:: _static/workflow-folding.dot
   :class: with-border
   :caption: Workflow diagram