Options

Options
Prev	JOIN	Next

The following joinflags options may be specified to determine exactly how the JOIN executes.

PARTITION LEFT \| RIGHT	Specifies which recordset provides the partition points that determine how the records are sorted and distributed amongst the supercomputer nodes. PARTITION RIGHT specifies the rightrecset while PARTITION LEFT specifies the leftrecset. If omitted, PARTITION LEFT is the default.
[MANY] LOOKUP	Specifies the rightrecset is a relatively small file of lookup records. If the LOCAL option is not present a hash table of the entire rightrecset is fully copied to every node, but if the LOCAL option is present the hash table is created locally on each node. If MANY is not present, the rightrecset records bear a Many to 0/1 relationship with the records in the leftrecset (for each record in the leftrecset there is at most 1 record in the rightrecset, enforced by an implicit dedup if there are duplicate records in the lookup table based on the key fields used in the joincondition). If MANY is present, the rightrecset records bear a Many to 0/Many relationship with the records in the leftrecset. This option allows the optimizer to avoid unnecessary sorting of the leftrecset. Valid only for inner, LEFT OUTER, or LEFT ONLY jointypes. The ATMOST, LIMIT, and KEEP options are supported in conjunction with MANY LOOKUP.
SMART	Specifies to use an in-memory lookup when possible, but use a distributed join if the right dataset is large.
FEW	Specifies the LOOKUP rightrecset has few records, so little memory is used, allowing multiple lookup joins to be included in the same Thor subgraph.
GROUPED	Specifies the same action as MANY LOOKUP but preserves grouping. Primarily used in the rapid Data Delivery Engine. Valid only for inner, LEFT OUTER, or LEFT ONLY jointypes. The ATMOST, LIMIT, and KEEP options are supported in conjunction with GROUPED.
ALL	Specifies the rightrecset is a small file that can be fully copied to every node, which allows the compiler to ignore the lack of any "equality" portion to the condition, eliminating the "join too complex" error that the condition would normally produce. If an "equality" portion is present, the JOIN is internally executed as a MANY LOOKUP. The KEEP option is supported in conjunction with this option.
NOSORT	Performs the JOIN without dynamically sorting the tables. This implies that the leftrecset and/or rightrecset must have been previously sorted and partitioned based on the fields specified in the joincondition so that records can be easily matched.
which	Optional. The keywords LEFT or RIGHT to indicate the leftrecset or rightrecset has been previously sorted. If omitted, NOSORT assumes both the leftrecset and rightrecset have been previously sorted.
KEYED	Specifies using indexed access into the rightrecset (see INDEX).
index	Optional. The name of an INDEX into the rightrecset for a full-keyed JOIN (see below). If omitted, indicates the rightrecset will always be an INDEX (useful when the rightrecset is passed in as a parameter to a function).
UNORDERED	Optional. Specifies the KEYED JOIN operation does not preserve the sort order of the leftrecset.
LOCAL	Specifies the operation is performed on each supercomputer node independently, without requiring interaction with all other nodes to acquire data; the operation maintains the distribution of any previous DISTRIBUTE.
HASH	Specifies an implicit DISTRIBUTE of the leftrecset and rightrecset across the supercomputer nodes based on the joincondition so each node can do its job with local data.
KEEP(n)	Specifies the maximum number of matching records (n) to generate into the result set from each leftrecset record. If omitted, all matches are kept. This is useful where there may be many matching pairs and you need to limit the number in the result set. KEEP is not supported for RIGHT OUTER, RIGHT ONLY, LEFT ONLY, or FULL ONLY jointypes.
ATMOST	Specifies a maximum number of matching records which, if exceeded, eliminates all those matches from the result set. This is useful for situations where you need to eliminate all "too many matches" record pairs from the result set. ATMOST is not supported on RIGHT ONLY or RIGHT OUTER jointypes. There are two forms: ATMOST(condition, n) -- maximum is computed only for the condition. ATMOST(n) -- maximum is computed for the entire joincondition, unless KEYED is used in the joincondition, in which case only the KEYED expressions are used. When ATMOST is specified (and the JOIN is not full or half-keyed), the joincondition and condition may include string field comparisons that use string indexing with an asterisk as the upper bound, as in this example: J1 := JOIN(dsL,dsR, LEFT.name[1..]=RIGHT.name[3..] AND LEFT.val < RIGHT.val, T(LEFT,RIGHT), ATMOST(LEFT.name[1..]=RIGHT.name[3..],3)); The asterisk indicates matching as many characters as necessary to reduce the number of candidate matches to below the ATMOST number (n).
condition	A portion of the joincondition expression.
n	Specifies the maximum number of matches allowed.
LIMIT	Specifies a maximum number of matching records which, if exceeded, either fails the job, or eliminates all those matches from the result set. This is useful for situations where you need to eliminate all "too many matches" record pairs from the result set. Typically used for KEYED and "half-keyed" joins (see below), LIMIT differs from ATMOST primarily by its affect on a LEFT OUTER join, in which a leftrecset record with too many matching records would be treated as a non-match by ATMOST (the leftrecset record would be in the output with no matching rightrecset records), whereas LIMIT would either fail the job entirely, or SKIP the record (eliminating the leftrecset record entirely from the output). The LIMIT is applied to the set of records that meet the the hard match ("equality") portion of the joincondition but before the soft match ("non-equality") portion of the joincondition is evaluated. If omitted, the default is LIMIT(10000). The implicit limit is only added if there is no explicit LIMIT, no ATMOST, and no KEEP, or has a non-keyed filter and isn't a left only JOIN.
value	The maximum number of matches allowed; LIMIT(0) is unlimited.
SKIP	Optional. Specifies eliminating all the matching records whose total number exceeds the maximum value of the LIMIT result instead of failing the job.
transform	Optional. Specifies outputting a single record produced by the transform instead of failing the workunit (similar to the ONFAIL option of the LIMIT function).
FAIL	Optional. Specifies using the FAIL action to configure the error message when the job fails.
SKEW	Indicates that you know the data for this join will not be spread evenly across nodes (will be skewed after both files have been distributed based on the join condition) and you choose to override the default by specifying your own limit value to allow the job to continue despite the skewing. Only valid on non-keyed joins (the KEYED option is not present and the rightrecset is not an INDEX).
limit	A value between zero (0) and one (1.0 = 100%) indicating the maximum percentage of skew to allow before the job fails (the default skew is 1.0 / <number of worker nodes on cluster>).
target	Optional. A value between zero (0) and one (1.0 = 100%) indicating the desired maximum percentage of skew to allow (the default skew is 1.0 / <number of worker nodes on cluster>).
THRESHOLD	Indicates the minimum size for a single part of either the leftrecset or rightrecset before the SKEW limit is enforced. Only valid on non-keyed joins (the KEYED option is not present and the rightrecset is not an INDEX).
size	An integer value indicating the minimum number of bytes for a single part.
UNORDERED	Optional. Specifies the output record order is not significant.
ORDERED	Specifies the significance of the output record order.
bool	When False, specifies the output record order is not significant. When True, specifies the default output record order.
STABLE	Optional. Specifies the input record order is significant.
UNSTABLE	Optional. Specifies the input record order is not significant.
PARALLEL	Optional. Try to evaluate this activity in parallel.
numthreads	Optional. Try to evaluate this activity using numthreads threads.
ALGORITHM	Optional. Override the algorithm used for this activity.
name	The algorithm to use for this activity. Must be from the list of supported algorithms for the SORT function's STABLE and UNSTABLE options.

In addition to this list, the KEYED and LOCAL options are also mutually exclusive with the options listed above, but not to each other. When both KEYED and LOCAL options are specified, only the INDEX part(s) on each node are accessed by that node.

Typically, the leftrecset should be larger than the rightrecset to prevent skewing problems (because PARTITION LEFT is the default behavior). If the LOOKUP or ALL options are specified, the rightrecset must be small enough to be loaded into memory on every node, and the operation is then implicitly LOCAL. The ALL option is impractical if the rightrecset is larger than a few thousand records (due to the number of comparisons required). The size of the rightrecset is irrelevant in the case of "half-keyed" and "full-keyed" JOINs (see the Keyed Join discussion below).

Use SMART when the right side dataset is likely to be small enough to fit in memory, but is not guaranteed to fit.

If you get an error similar to this:

"error: 1301: Pool memory exhausted:..."

this means the rightrecset is too large and a LOOKUP JOIN should not be used. A SMART JOIN may be a good option in this case.

Example:

//this example demos the implicit DEDUP of the lookup dataset
r1 := {INTEGER1 fred};
r2 := {INTEGER1 fred,INTEGER1 sue};
ds1 := DATASET([1,2,3,4],r1);
ds2 := DATASET([{1,2},{2,2},
                {1,1},{2,1}, 
                {3,1},{4,1}],r2);
r2 XF(ds1 L, ds2 R) := TRANSFORM
  SELF.fred := L.fred;
  SELF.sue := R.sue;
END;
JOIN(ds1,ds2,
     LEFT.Fred = RIGHT.fred,
     XF(LEFT,RIGHT),LOOKUP);

Prev	Up	Next
Matching Logic - JOIN	Home	Keyed Joins