<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://unlarchive.org/wiki/index.php?action=history&amp;feed=atom&amp;title=N-gram</id>
	<title>N-gram - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://unlarchive.org/wiki/index.php?action=history&amp;feed=atom&amp;title=N-gram"/>
	<link rel="alternate" type="text/html" href="https://unlarchive.org/wiki/index.php?title=N-gram&amp;action=history"/>
	<updated>2026-05-16T20:33:17Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.44.2</generator>
	<entry>
		<id>https://unlarchive.org/wiki/index.php?title=N-gram&amp;diff=15741&amp;oldid=prev</id>
		<title>imported&gt;Martins at 14:03, 12 May 2015</title>
		<link rel="alternate" type="text/html" href="https://unlarchive.org/wiki/index.php?title=N-gram&amp;diff=15741&amp;oldid=prev"/>
		<updated>2015-05-12T14:03:04Z</updated>

		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 14:03, 12 May 2015&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l1&quot;&gt;Line 1:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 1:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;In the scope of the project [[LACE]], an &#039;&#039;&#039;n-gram&#039;&#039;&#039; is a linear structure of n strings composed entirely of ANY KIND OF LETTER FROM ANY LANGUAGE, i.e., the regex [/p{L}]+, isolated by blank space, punctuation marks, end of sentence, and any other character &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;(&lt;/del&gt;[.,;:!?()&quot;&amp;lt;&amp;gt;]&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;)&lt;/del&gt;. Strings containing digits or any non-alphabetical characters (such as [@_#$%/]) were ignored&amp;lt;ref&amp;gt;This means that the input string “abc def g-hi jkl m1 234 nop qrs tu_vw” was said to have:&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;In the scope of the project [[LACE]], an &#039;&#039;&#039;n-gram&#039;&#039;&#039; is a linear structure of n strings composed entirely of ANY KIND OF LETTER FROM ANY LANGUAGE, i.e., the regex [/p{L}]+, isolated by blank space, punctuation marks, end of sentence, and any other character &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;not comprised in /p{L}, such as &lt;/ins&gt;[.,;:!?()&quot;&amp;lt;&amp;gt;]. Strings containing digits or any non-alphabetical characters (such as [@_#$%/]) were ignored&amp;lt;ref&amp;gt;This means that the input string “abc def g-hi jkl m1 234 nop qrs tu_vw” was said to have:&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;*six 1-grams (“abc”, “def”, “g-hi”, “jkl” “nop”, “qrs”)&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;*six 1-grams (“abc”, “def”, “g-hi”, “jkl” “nop”, “qrs”)&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;*four 2-grams (“abc def”, “def g-hi”, “g-hi jkl”, “nop qrs”)&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;*four 2-grams (“abc def”, “def g-hi”, “g-hi jkl”, “nop qrs”)&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;</summary>
		<author><name>imported&gt;Martins</name></author>
	</entry>
	<entry>
		<id>https://unlarchive.org/wiki/index.php?title=N-gram&amp;diff=15740&amp;oldid=prev</id>
		<title>imported&gt;Martins at 14:02, 12 May 2015</title>
		<link rel="alternate" type="text/html" href="https://unlarchive.org/wiki/index.php?title=N-gram&amp;diff=15740&amp;oldid=prev"/>
		<updated>2015-05-12T14:02:22Z</updated>

		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 14:02, 12 May 2015&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l1&quot;&gt;Line 1:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 1:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;In the scope of the project [[LACE]], an &#039;&#039;&#039;n-gram&#039;&#039;&#039; is a linear structure of n strings composed entirely of &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;alphabetical characters or hyphen (&lt;/del&gt;i.e., [&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;a..zA..Z-&lt;/del&gt;]&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;) &lt;/del&gt;isolated by blank space, punctuation marks, end of sentence, and other &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;signs such as &lt;/del&gt;([.,;:!?()&quot;&amp;lt;&amp;gt;]). Strings containing digits or any non-alphabetical characters (such as [@_#$%/]) were ignored&amp;lt;ref&amp;gt;This means that the input string “abc def g-hi jkl m1 234 nop qrs tu_vw” was said to have:&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;In the scope of the project [[LACE]], an &#039;&#039;&#039;n-gram&#039;&#039;&#039; is a linear structure of n strings composed entirely of &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;ANY KIND OF LETTER FROM ANY LANGUAGE, &lt;/ins&gt;i.e., &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;the regex &lt;/ins&gt;[&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;/p{L}&lt;/ins&gt;]&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;+, &lt;/ins&gt;isolated by blank space, punctuation marks, end of sentence, and &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;any &lt;/ins&gt;other &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;character &lt;/ins&gt;([.,;:!?()&quot;&amp;lt;&amp;gt;]). Strings containing digits or any non-alphabetical characters (such as [@_#$%/]) were ignored&amp;lt;ref&amp;gt;This means that the input string “abc def g-hi jkl m1 234 nop qrs tu_vw” was said to have:&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;*six 1-grams (“abc”, “def”, “g-hi”, “jkl” “nop”, “qrs”)&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;*six 1-grams (“abc”, “def”, “g-hi”, “jkl” “nop”, “qrs”)&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;*four 2-grams (“abc def”, “def g-hi”, “g-hi jkl”, “nop qrs”)&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;*four 2-grams (“abc def”, “def g-hi”, “g-hi jkl”, “nop qrs”)&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;</summary>
		<author><name>imported&gt;Martins</name></author>
	</entry>
	<entry>
		<id>https://unlarchive.org/wiki/index.php?title=N-gram&amp;diff=15739&amp;oldid=prev</id>
		<title>imported&gt;Martins at 11:19, 30 January 2015</title>
		<link rel="alternate" type="text/html" href="https://unlarchive.org/wiki/index.php?title=N-gram&amp;diff=15739&amp;oldid=prev"/>
		<updated>2015-01-30T11:19:52Z</updated>

		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 11:19, 30 January 2015&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l16&quot;&gt;Line 16:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 16:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;*:In the context of LACE, we considered an n-gram to be frequent in the corpus if its frequency of occurrence is equal or higher than the ratio between tokens and types, where “tokens” is the total number of n-grams in the corpus, and “types” is the number of distinct n-grams in the corpus. For instance: given a corpus with 5,000 occurrences of distinct 1,000 unigrams, a 1-gram is considered relevant if, and only if, it occurs 5 or more times.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;*:In the context of LACE, we considered an n-gram to be frequent in the corpus if its frequency of occurrence is equal or higher than the ratio between tokens and types, where “tokens” is the total number of n-grams in the corpus, and “types” is the number of distinct n-grams in the corpus. For instance: given a corpus with 5,000 occurrences of distinct 1,000 unigrams, a 1-gram is considered relevant if, and only if, it occurs 5 or more times.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;*Redundancy:&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;*Redundancy:&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;*:In the context of LACE, we intend an n-gram to be redundant if it is subsumed by any other x-gram, where x ≥ n. In that sense, the 1-gram “a” is considered unique if, and only if, there is at least one context “x a” and at least one context “a y”, where “x a” and “a y” have not been defined as an n-gram according to the criteria above concerning length and frequency. For instance, the items “Sri” and “Lanka” are not considered to be 1-grams because they cannot occur in isolation: they always appear as part of the 2-gram “Sri Lanka” (i.e., there is no context in the corpus in which we have “Sri” but not “Lanka”). The same applies for discontinuous n-grams: the sequence “a . . d” is a 4-gram if it is not subsumed by the 4-gram “a b . d”, i.e., if there is at least one “a x . d” where x ≠ b.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;*:In the context of LACE, we intend an n-gram to be redundant if it is subsumed by any other x-gram, where x ≥ n. In that sense, the 1-gram “a” is considered unique if, and only if, there is at least one context “x a” and at least one context “a y”, where “x a” and “a y” have not been defined as an n-gram according to the criteria above concerning length and frequency. For instance, the items “Sri” and “Lanka” are not considered to be 1-grams because they cannot occur in isolation: they always appear as part of the 2-gram “Sri Lanka” (i.e., there is no context in the corpus in which we have “Sri” but not “Lanka”). The same applies for discontinuous n-grams: the sequence “a . . d” is a 4-gram if it is not subsumed by the 4-gram “a b . d”, i.e., if there is at least one “a x . d” where x ≠ b&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;.=&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;*Constituency&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;:The constituency score is the probability of a given n-gram to function as a syntactic unit (i.e., a “constituent”) in a sentence. For the time being, it is defined as the weighted average of 2 different independent measures: distribution and substitution, as described at [[constituency score]]&lt;/ins&gt;.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Notes ==&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Notes ==&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;lt;references /&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;lt;references /&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;</summary>
		<author><name>imported&gt;Martins</name></author>
	</entry>
	<entry>
		<id>https://unlarchive.org/wiki/index.php?title=N-gram&amp;diff=15738&amp;oldid=prev</id>
		<title>imported&gt;Martins at 17:28, 11 July 2013</title>
		<link rel="alternate" type="text/html" href="https://unlarchive.org/wiki/index.php?title=N-gram&amp;diff=15738&amp;oldid=prev"/>
		<updated>2013-07-11T17:28:09Z</updated>

		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 17:28, 11 July 2013&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l1&quot;&gt;Line 1:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 1:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;In the scope of the project LACE, an &#039;&#039;&#039;n-gram&#039;&#039;&#039; is a linear structure of n strings composed entirely of alphabetical characters or hyphen (i.e., [a..zA..Z-]) isolated by blank space, punctuation marks, end of sentence, and other signs such as ([.,;:!?()&quot;&amp;lt;&amp;gt;]). Strings containing digits or any non-alphabetical characters (such as [@_#$%/]) were ignored&amp;lt;ref&amp;gt;This means that the input string “abc def g-hi jkl m1 234 nop qrs tu_vw” was said to have:&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;In the scope of the project &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;[[&lt;/ins&gt;LACE&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;]]&lt;/ins&gt;, an &#039;&#039;&#039;n-gram&#039;&#039;&#039; is a linear structure of n strings composed entirely of alphabetical characters or hyphen (i.e., [a..zA..Z-]) isolated by blank space, punctuation marks, end of sentence, and other signs such as ([.,;:!?()&quot;&amp;lt;&amp;gt;]). Strings containing digits or any non-alphabetical characters (such as [@_#$%/]) were ignored&amp;lt;ref&amp;gt;This means that the input string “abc def g-hi jkl m1 234 nop qrs tu_vw” was said to have:&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;*six 1-grams (“abc”, “def”, “g-hi”, “jkl” “nop”, “qrs”)&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;*six 1-grams (“abc”, “def”, “g-hi”, “jkl” “nop”, “qrs”)&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;*four 2-grams (“abc def”, “def g-hi”, “g-hi jkl”, “nop qrs”)&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;*four 2-grams (“abc def”, “def g-hi”, “g-hi jkl”, “nop qrs”)&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l6&quot;&gt;Line 6:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 6:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;The strings &amp;quot;m1&amp;quot;, &amp;quot;234&amp;quot; and &amp;quot;tu_vw&amp;quot; were not considered valid and, therefore, any n-grams including them were excluded from the results.&amp;lt;/ref&amp;gt;.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;The strings &amp;quot;m1&amp;quot;, &amp;quot;234&amp;quot; and &amp;quot;tu_vw&amp;quot; were not considered valid and, therefore, any n-grams including them were excluded from the results.&amp;lt;/ref&amp;gt;.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Continuous and Discontinuous N-grams ==&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Continuous and Discontinuous N-grams ==&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;In the scope of the project LACE, n-grams are said to be &quot;continuous&quot; or &quot;discontinuous&quot;:&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;In the scope of the project &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;[[&lt;/ins&gt;LACE&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;]]&lt;/ins&gt;, n-grams are said to be &quot;continuous&quot; or &quot;discontinuous&quot;:&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;*a &amp;#039;&amp;#039;&amp;#039;continuous n-gram&amp;#039;&amp;#039;&amp;#039; is an invariant sequence of n immediately adjacent items, i.e., without any other items in-between;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;*a &amp;#039;&amp;#039;&amp;#039;continuous n-gram&amp;#039;&amp;#039;&amp;#039; is an invariant sequence of n immediately adjacent items, i.e., without any other items in-between;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;*a &amp;#039;&amp;#039;&amp;#039;discontinuous n-gram&amp;#039;&amp;#039;&amp;#039;  is an open pattern: it is a continuous n-gram where some items are variable, i.e., a sequence of x and y (x+y=n) items where x items come in the same position and are isolated by the same number of y in-between items. A discontinuous n-gram is valid if, for the same x, there are at least two different y, otherwise we consider it noise. A discontinuous n-gram may have one or more discontinuities, but due to the necessity of defining its external boundaries, we limit the notion of discontinuity to the internal items of an n-gram. In our notation, discontinuities are represented by the place holder &amp;quot;.&amp;quot;&amp;lt;ref&amp;gt;In the example above, there are two discontinuous 3-grams (“abc . g-hi”, “def . jkl”) and three discontinuous 4-grams (“abc . . jkl”, “abc def . jkl”, “abc . g-hi jkl”).&amp;lt;/ref&amp;gt; Given the precondition of external boundaries, discontinuous n-grams should meet the requirement: n&amp;gt;2.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;*a &amp;#039;&amp;#039;&amp;#039;discontinuous n-gram&amp;#039;&amp;#039;&amp;#039;  is an open pattern: it is a continuous n-gram where some items are variable, i.e., a sequence of x and y (x+y=n) items where x items come in the same position and are isolated by the same number of y in-between items. A discontinuous n-gram is valid if, for the same x, there are at least two different y, otherwise we consider it noise. A discontinuous n-gram may have one or more discontinuities, but due to the necessity of defining its external boundaries, we limit the notion of discontinuity to the internal items of an n-gram. In our notation, discontinuities are represented by the place holder &amp;quot;.&amp;quot;&amp;lt;ref&amp;gt;In the example above, there are two discontinuous 3-grams (“abc . g-hi”, “def . jkl”) and three discontinuous 4-grams (“abc . . jkl”, “abc def . jkl”, “abc . g-hi jkl”).&amp;lt;/ref&amp;gt; Given the precondition of external boundaries, discontinuous n-grams should meet the requirement: n&amp;gt;2.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Relevance ==&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Relevance ==&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;In the scope of the project LACE, N-grams are considered to be &#039;&#039;&#039;linguistically-relevant&#039;&#039;&#039; if they are frequent, non-redundant, of a certain length, and may figure as syntactic and semantic units, according to the following criteria:    &lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;In the scope of the project &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;[[&lt;/ins&gt;LACE&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;]]&lt;/ins&gt;, N-grams are considered to be &#039;&#039;&#039;linguistically-relevant&#039;&#039;&#039; if they are frequent, non-redundant, of a certain length, and may figure as syntactic and semantic units, according to the following criteria:    &lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;*Length:&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;*Length:&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;*:In the context of LACE, we treated both continuous and discontinuous n-grams with up to 7 items, i.e., where 1 ≤ n ≤ 7.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;*:In the context of LACE, we treated both continuous and discontinuous n-grams with up to 7 items, i.e., where 1 ≤ n ≤ 7.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;</summary>
		<author><name>imported&gt;Martins</name></author>
	</entry>
	<entry>
		<id>https://unlarchive.org/wiki/index.php?title=N-gram&amp;diff=15737&amp;oldid=prev</id>
		<title>imported&gt;Martins at 17:27, 11 July 2013</title>
		<link rel="alternate" type="text/html" href="https://unlarchive.org/wiki/index.php?title=N-gram&amp;diff=15737&amp;oldid=prev"/>
		<updated>2013-07-11T17:27:17Z</updated>

		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 17:27, 11 July 2013&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l4&quot;&gt;Line 4:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 4:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;*two 3-grams (“abc def g-hi”, “def g-hi jkl”)&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;*two 3-grams (“abc def g-hi”, “def g-hi jkl”)&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;*one 4-gram (“abc def g-hi jkl”).  &lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;*one 4-gram (“abc def g-hi jkl”).  &lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;The strings &quot;m1&quot;, &quot;234&quot; and &quot;tu_vw&quot; were not considered valid and, therefore, any n-grams including them were excluded from the results.&amp;lt;/ref&amp;gt;.&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;br /&amp;gt;&amp;lt;br /&amp;gt;&lt;/del&gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;The strings &quot;m1&quot;, &quot;234&quot; and &quot;tu_vw&quot; were not considered valid and, therefore, any n-grams including them were excluded from the results.&amp;lt;/ref&amp;gt;.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;== Continuous and Discontinuous N-grams ==&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;In the scope of the project LACE, n-grams are said to be &amp;quot;continuous&amp;quot; or &amp;quot;discontinuous&amp;quot;:&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;In the scope of the project LACE, n-grams are said to be &amp;quot;continuous&amp;quot; or &amp;quot;discontinuous&amp;quot;:&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;*a &amp;#039;&amp;#039;&amp;#039;continuous n-gram&amp;#039;&amp;#039;&amp;#039; is an invariant sequence of n immediately adjacent items, i.e., without any other items in-between;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;*a &amp;#039;&amp;#039;&amp;#039;continuous n-gram&amp;#039;&amp;#039;&amp;#039; is an invariant sequence of n immediately adjacent items, i.e., without any other items in-between;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;*a &#039;&#039;&#039;discontinuous n-gram&#039;&#039;&#039;  is an open pattern: it is a continuous n-gram where some items are variable, i.e., a sequence of x and y (x+y=n) items where x items come in the same position and are isolated by the same number of y in-between items. A discontinuous n-gram is valid if, for the same x, there are at least two different y, otherwise we consider it noise. A discontinuous n-gram may have one or more discontinuities, but due to the necessity of defining its external boundaries, we limit the notion of discontinuity to the internal items of an n-gram. In our notation, discontinuities are represented by the place holder &quot;.&quot;&amp;lt;ref&amp;gt;In the example above, there are two discontinuous 3-grams (“abc . g-hi”, “def . jkl”) and three discontinuous 4-grams (“abc . . jkl”, “abc def . jkl”, “abc . g-hi jkl”).&amp;lt;/ref&amp;gt; Given the precondition of external boundaries, discontinuous n-grams should meet the requirement: n&amp;gt;2.&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;br /&amp;gt;&amp;lt;br /&amp;gt;&lt;/del&gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;*a &#039;&#039;&#039;discontinuous n-gram&#039;&#039;&#039;  is an open pattern: it is a continuous n-gram where some items are variable, i.e., a sequence of x and y (x+y=n) items where x items come in the same position and are isolated by the same number of y in-between items. A discontinuous n-gram is valid if, for the same x, there are at least two different y, otherwise we consider it noise. A discontinuous n-gram may have one or more discontinuities, but due to the necessity of defining its external boundaries, we limit the notion of discontinuity to the internal items of an n-gram. In our notation, discontinuities are represented by the place holder &quot;.&quot;&amp;lt;ref&amp;gt;In the example above, there are two discontinuous 3-grams (“abc . g-hi”, “def . jkl”) and three discontinuous 4-grams (“abc . . jkl”, “abc def . jkl”, “abc . g-hi jkl”).&amp;lt;/ref&amp;gt; Given the precondition of external boundaries, discontinuous n-grams should meet the requirement: n&amp;gt;2.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;== Relevance ==&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;In the scope of the project LACE, N-grams are considered to be &amp;#039;&amp;#039;&amp;#039;linguistically-relevant&amp;#039;&amp;#039;&amp;#039; if they are frequent, non-redundant, of a certain length, and may figure as syntactic and semantic units, according to the following criteria:    &lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;In the scope of the project LACE, N-grams are considered to be &amp;#039;&amp;#039;&amp;#039;linguistically-relevant&amp;#039;&amp;#039;&amp;#039; if they are frequent, non-redundant, of a certain length, and may figure as syntactic and semantic units, according to the following criteria:    &lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;*Length:&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;*Length:&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;</summary>
		<author><name>imported&gt;Martins</name></author>
	</entry>
	<entry>
		<id>https://unlarchive.org/wiki/index.php?title=N-gram&amp;diff=15736&amp;oldid=prev</id>
		<title>imported&gt;Martins at 17:26, 11 July 2013</title>
		<link rel="alternate" type="text/html" href="https://unlarchive.org/wiki/index.php?title=N-gram&amp;diff=15736&amp;oldid=prev"/>
		<updated>2013-07-11T17:26:16Z</updated>

		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;In the scope of the project LACE, an &amp;#039;&amp;#039;&amp;#039;n-gram&amp;#039;&amp;#039;&amp;#039; is a linear structure of n strings composed entirely of alphabetical characters or hyphen (i.e., [a..zA..Z-]) isolated by blank space, punctuation marks, end of sentence, and other signs such as ([.,;:!?()&amp;quot;&amp;lt;&amp;gt;]). Strings containing digits or any non-alphabetical characters (such as [@_#$%/]) were ignored&amp;lt;ref&amp;gt;This means that the input string “abc def g-hi jkl m1 234 nop qrs tu_vw” was said to have:&lt;br /&gt;
*six 1-grams (“abc”, “def”, “g-hi”, “jkl” “nop”, “qrs”)&lt;br /&gt;
*four 2-grams (“abc def”, “def g-hi”, “g-hi jkl”, “nop qrs”)&lt;br /&gt;
*two 3-grams (“abc def g-hi”, “def g-hi jkl”)&lt;br /&gt;
*one 4-gram (“abc def g-hi jkl”). &lt;br /&gt;
The strings &amp;quot;m1&amp;quot;, &amp;quot;234&amp;quot; and &amp;quot;tu_vw&amp;quot; were not considered valid and, therefore, any n-grams including them were excluded from the results.&amp;lt;/ref&amp;gt;.&amp;lt;br /&amp;gt;&amp;lt;br /&amp;gt;&lt;br /&gt;
In the scope of the project LACE, n-grams are said to be &amp;quot;continuous&amp;quot; or &amp;quot;discontinuous&amp;quot;:&lt;br /&gt;
*a &amp;#039;&amp;#039;&amp;#039;continuous n-gram&amp;#039;&amp;#039;&amp;#039; is an invariant sequence of n immediately adjacent items, i.e., without any other items in-between;&lt;br /&gt;
*a &amp;#039;&amp;#039;&amp;#039;discontinuous n-gram&amp;#039;&amp;#039;&amp;#039;  is an open pattern: it is a continuous n-gram where some items are variable, i.e., a sequence of x and y (x+y=n) items where x items come in the same position and are isolated by the same number of y in-between items. A discontinuous n-gram is valid if, for the same x, there are at least two different y, otherwise we consider it noise. A discontinuous n-gram may have one or more discontinuities, but due to the necessity of defining its external boundaries, we limit the notion of discontinuity to the internal items of an n-gram. In our notation, discontinuities are represented by the place holder &amp;quot;.&amp;quot;&amp;lt;ref&amp;gt;In the example above, there are two discontinuous 3-grams (“abc . g-hi”, “def . jkl”) and three discontinuous 4-grams (“abc . . jkl”, “abc def . jkl”, “abc . g-hi jkl”).&amp;lt;/ref&amp;gt; Given the precondition of external boundaries, discontinuous n-grams should meet the requirement: n&amp;gt;2.&amp;lt;br /&amp;gt;&amp;lt;br /&amp;gt;&lt;br /&gt;
In the scope of the project LACE, N-grams are considered to be &amp;#039;&amp;#039;&amp;#039;linguistically-relevant&amp;#039;&amp;#039;&amp;#039; if they are frequent, non-redundant, of a certain length, and may figure as syntactic and semantic units, according to the following criteria:   &lt;br /&gt;
*Length:&lt;br /&gt;
*:In the context of LACE, we treated both continuous and discontinuous n-grams with up to 7 items, i.e., where 1 ≤ n ≤ 7.&lt;br /&gt;
*Frequency:&lt;br /&gt;
*:In the context of LACE, we considered an n-gram to be frequent in the corpus if its frequency of occurrence is equal or higher than the ratio between tokens and types, where “tokens” is the total number of n-grams in the corpus, and “types” is the number of distinct n-grams in the corpus. For instance: given a corpus with 5,000 occurrences of distinct 1,000 unigrams, a 1-gram is considered relevant if, and only if, it occurs 5 or more times.&lt;br /&gt;
*Redundancy:&lt;br /&gt;
*:In the context of LACE, we intend an n-gram to be redundant if it is subsumed by any other x-gram, where x ≥ n. In that sense, the 1-gram “a” is considered unique if, and only if, there is at least one context “x a” and at least one context “a y”, where “x a” and “a y” have not been defined as an n-gram according to the criteria above concerning length and frequency. For instance, the items “Sri” and “Lanka” are not considered to be 1-grams because they cannot occur in isolation: they always appear as part of the 2-gram “Sri Lanka” (i.e., there is no context in the corpus in which we have “Sri” but not “Lanka”). The same applies for discontinuous n-grams: the sequence “a . . d” is a 4-gram if it is not subsumed by the 4-gram “a b . d”, i.e., if there is at least one “a x . d” where x ≠ b.&lt;br /&gt;
&lt;br /&gt;
== Notes ==&lt;br /&gt;
&amp;lt;references /&amp;gt;&lt;/div&gt;</summary>
		<author><name>imported&gt;Martins</name></author>
	</entry>
</feed>