NAME

filterlinks - filter the link file based on link parameters


SYNOPSIS

  filterlinks -links linkfile.txt [-nointer] [-nointra] [-debug]


DESCRIPTION


RULES

A filter rules contains two parts: the link parameter which is tested and a list of acceptable conditions.

The two exceptions are the -nointer and -nointra flags. These can be used to filter out inter-chromosomal links (ends of link are on different chromosomes) and intra-chromosomal links (ends of link are on the same chromosome). These two rules are strict, meaning that if a link does not pass them, no other rules are tested and the link is immediately rejected.


Link Parameters

  link_param = condition1,condition2,...

Because each link has two ends, each link parameter may give rise to three distinct rules

  link_param   = condition1,condition2,...
  link_param_1 = condition1,condition2,...
  link_param_2 = condition1,condition2,...

which test, respectively, both ends with the condition (both ends must pass), the first end, and the second end. The first end of the link corresponds to the first line of the link line pair. For example, given the link ... link018136 cf12 9800000 9900000 link018136 hs6 37914056 37916509 ...

the first end is cf12:9800000-9900000 and the second end is hs6:37914056-37916509.


Conditions

A condition has the following format

  { [?TYPE {ID} {!} ] } CONDITION

where elements in { } are optional. Briefly, TYPE is used to indicate how the CONDITION text should be applied (e.g. regular expression, integer range, exact match, etc). The ID is used to combine rules so that their match status is AND'ed together to determine whether the link passes. The trailing ``!'' is used to negate the rule (i.e. for the link to pass, the rule must fail).


EXAMPLES

Below are some examples to get you started. Note the interplay between conditions with IDs and condition without IDs. The former collate conditions into AND'ed sets, which are then in turn OR'ed with other sets and with conditions without IDs.


Filtering by Chromosomes

To select links in which both ends match regular expression ``1''

  chr = 1

So simple. Now, to select links in with either ends matches regular expression ``1'',

  chr_1 = 1
  chr_2 = 1

The difference between these two cases is that in the first instance, since the link_parameter does not include a _1 or _2 suffix, the condition is applied to both ends of the link and both ends must pass. In the second case, each end is tested independently and the results are OR'ed together.

If you want links where the first chromosome matches x or the second matches y,

  chr_1 = x
  chr_2 = y

The test is (chr_1 match ``x'') OR (chr_2 match ``y''). Note, however, that this set of rules requires that the first chromosome match ``x'' OR the second chromosome match ``y''. It will fail if the first chromosome matches ``y'' and the second matches ``x''. To match both possibilities, you might try

  chr_1 = x;;y
  chr_2 = y;;x

In this case the test is (chr_1 match ``x'') OR (chr_1 match ``y'') OR (chr_2 match ``x'') OR (chr_2 match ``y'').

If you are looking for links between x and y chromosomes, then you require the results of each condition to be AND'ed. For this, use IDs

  chr_1 = [?r1]x
  chr_2 = [?r1]y

Both of these rules have ID=1 and are therefore grouped into a set. Match results within a set are AND'ed. Thus, the test is (chr_1 match ``x'') AND (chr_2 match ``y''). If you want to match the other order too,

  chr_1 = [?r1]x;;[?r2]y
  chr_2 = [?r1]y;;[?r2]x

In this example, there are two IDs. The rules with ID=0 match chr1 to ``x'' and chr2 to ``y'' and the rules with ID=1 match the converse (chr1 to ``y'' and chr2 to ``x'').

Now let's suppose we want links that are either cf1-hs6, cf14-hs7 or cfx-hsx. Here cf is a dog chromosome and hs is a human chromosome. The rule for this is

  chr_1 = [?e1]cf1;;[?e2]cf14;;[?e3]cfx
  chr_2 = [?e1]hs6;;[?e2]hs7;;[?e3]hsx

You can add additional conditions without IDs to accept more links. For example, if you also wanted to add any links for which chr_1 was cf9 or for which chr_2 matched ``3''

  chr_1 = [?e1]cf1;;[?e2]cf14;;[?e3]cfx;;[?e]cf9
  chr_2 = [?e1]hs6;;[?e2]hs7;;[?e3]hsx;;3

Remember that [?r]3 is the same as 3, since the default condition type is a regular expression.

You can take advantage of the ``!'' flag to negate rules to avoid chromosomes. For example, if you want links between cfx and any chromosome other than hsx

  chr_1 = [?e1]cfx
  chr_2 = [?e1!]hsx

and here the test is (chr_1 is cfx) AND (chr_2 is not hsx).

You can combine chr with chr_1/chr_2 rules

  chr   = 2
  chr_1 = [?e1]cfx
  chr_2 = [?e1!]hsx

to produce the test ( (chr_1 is cfx) AND (chr_2 is not hsx) ) OR ( chr_1 matches ``2'' AND chr_2 matches ``2'' ). Use ``chr'' as the parameter if you want to apply the same condition to both ends of th elink and chr_1 and chr_2 to apply different conditions.


Filtering by Position

To test link position, use the parameters ``start'', ``end'' and ``span''. Both ``start'' and ``end'' are ideal for testing with condition type < and >. To select links for which both ends start before 10,000,000

  start = [?<]1e7
  # or
  start = [?<]10000000

to add another OR'ed condition to pass links with start values beyond 100,000,000

  start = [?<]1e7;;[?>]1e8

A more complex test for the ``start'' and ``end'' values can be leveled by using the ``s'' condition type, which tests for membership within a span. This rule

  start = [?i]1e6-2e6,3e6-4e6

will pass links for which both ends are within 1-2Mb or 3-4Mb. Note that the ``,'' in this condition is part of the span and does not create a new condition. To have two conditions, use the ;; delimiter.

  start = [?i]1e6-2e6,3e6-4e6;;[?s]1e7-1.1e7,3e6-4e6

When using the ``span'' parameter, you should always use the ``s'' condition type. This will check whether the link span intersects the provided span.

  span = [?s]2e7-5e7

This will select all links whose spans (at both ends) intersect the coordinates 20-50Mb. To be more selective, use the _1 and _2 suffixes.

  span_1 = [?s1]2e7-5e7
  span_2 = [?s1]2e7-2.5e7

will select links joining 20-50Mb regions to 20-25Mb regions. An ID was required here to make the results AND'ed. To avoid certain regions, use the ``!'' flag

  span = [?s!](-1e7

will avoid all links within the first 10Mb.


Filtering by Link Options

Any link option such as ``color'', ``thickness'', or ``z'' can be tested in similar rules.

  # links with z value greater than 10
  z = [?>]10 

  # links with z value between 5 and 15
  z = [?s]5-15


Mixing Conditions and IDs

You can write fairly complex rules by combining different link parameter, rule types and IDs.

For example to apply the following filter

  (
  between (hs1 and cf6) 
  AND
  within 75-80 Mb on hs1
  AND 
  larger than 5kb on hs1
  )

  OR

  (
  larger than 500kb on hs1
  )

use the following rules

  chr_1   = [?e1]cf6
  chr_2   = [?e1]hs1
  span_2  = [?s1]75e6-80e6
  size_2  = [?>1]5e3;;[?>]500e3


HISTORY


BUGS


AUTHOR

Martin Krzywinski


CONTACT

  Martin Krzywinski
  Genome Sciences Centre
  Vancouver BC Canada
  www.bcgsc.ca
  martink@bcgsc.ca