Search Syntax
Form of a Search Expression
We recommend that you start by putting the regex pattern that you want to use to match the document content first, followed by any search filtering options:
Recommended: Content search pattern followed by options
regex_to_match_doc_content_here [options]
Legally speaking, we don’t require options to come after the content pattern. You can put options before and after the content expression, though we don’t recommend it.
Legal but not recommended:
[options] regex_to_match_doc_content_here [more_options]
--------- --------------
^ ^
| |
+--------------------+-----------------------+
|
|
Options can come both before and after
the content search pattern
The content search pattern must be contiguous.
Should be illegal:
[options] content_pattern_part_1 [more_options] content_pattern_part_2
---------------------- ----------------------
^ ^
| |
+------------------+--------------------+
|
|
This should be illegal because the two content sections
are discontinguous. However, I think there are some
bugs in this area and we're not currently catching this.
Regular Expressions Pattern vs. Keyword Patterns
Currently all searches are interpreted as regular expressions, regardless of whether or not the “.*” button is checked (the little box immediately following the search input in the header bar).
Search test cases should be written assuming regex. We will need to revisit our search plans after we add keyword-based content searching.
List of Search Options
The following search options are available. Note that some options can be specified more than once. In these cases the options are cummulative and combined together with an “and” boolean clause. I will discuss this momentarily.
Option Name | Default Value | Description | ||||||
---|---|---|---|---|---|---|---|---|
Settings Filters (Can be Specified Only Once) | ||||||||
case:[yes|no] |
no | Turns on case sensitive searching. By default searching should be case insensitive. | ||||||
count:nnn |
30 | Specifies the maximum number of search results to return in the results list. | ||||||
maxscale:nnn |
5 | Affects the maximum number of candidate documents that are searched. | ||||||
maxdocsize:nnn |
65535 | Documents bigger than this are skipped. | ||||||
includeskipped:[yes|no] |
no | Whether or not large skipped document should be returned in the results list. | ||||||
timeout:nnn |
5 | Abort the search if it takes longer than nnn seconds. | ||||||
Include/Exclude Filters (Can be Specified More than Once) | ||||||||
doctype:[page|repo] |
N/A |
Include or exclude documents with a matching sourcetype. Sourcetype values
can be one of the following:
|
||||||
-doctype:[page|repo] |
N/A | Excludes documents with a matching sourcetype. | ||||||
file:regex_pattern |
N/A | Include documents with filenames (or topic page titles) matching the specified pattern. The pattern is a regex expression. | ||||||
-file:regex_pattern |
N/A | Exclude documents with filenames (or topic page titles) matching the specified pattern. The pattern is a regex expression. | ||||||
repo:regex_pattern |
N/A |
Include documents with repository names matching the specified pattern. The pattern is a regex expression. Note, this is the full repository name after the mount point's Git URL Pattern field has been fully evaluated with all macros expanded.
Also note that specifying this option implies |
||||||
-repo:regex_pattern |
N/A |
Exclude documents with repository names matching the specified pattern. The pattern is a regex expression.
Also note that specifying this option implies |
||||||
title:regex_pattern |
N/A |
Synoymous with the file:regex_pattern option.
|
||||||
-title:regex_pattern |
N/A |
Synoymous with the -file:regex_pattern option.
|
Case Sensitivity
If case:yes
is specified, the content is searched with case sensitivity in effect. Otherwise,
casing is ignored.
Maximum Count
Specifies the maximum number of documents to find. The default is 30. Note that if a larger number is specified, the search could take longer to complete. The user may want to increase the timeout setting in conjunction.
Document Type Inclusion or Exclusion
doctype:[page|repo]
-doctype:[page|repo]
With these options, you can narrow the search to include only pages (e.g., internally authored regular topic pages) or repository-based content (e.g., content from mount points).
Filename Inclusion or Exclusion
file:regex_pattern
-file:regex_pattern
With these options, you can include or exclude files based on their file names (or page title names).
Example 1 : Include only files located in the components folder
file:/components/
Example 2 : Include only files with an extension of “.js”
file:\.js$
Example 3 : Include files with an extension of “.js” or “.svelte”
Here, we use the regex “or” capability which is done using the vertical bar (|) character.
file:\.js$|\.svelte$
Example 4 : Include files in the components folder AND has an extension of “.js” or “.svelte”
If we specify multiple file filters, they are combined together using the “AND” boolean condition.
By using the regex vertical bar to specifies “ors”, and using multiple file filters to specify “ands”, we can combine filters together to create complex conditions.
file:/components/ file:\.js$|\.svelte$
----------------- --------------------
^ ^
| |
File must contain File must end in either
"/components/" in ".js" or ".svelte"
its path name |
| |
+----------+---------+
|
AND (both subcontions must be true)
Repository Name Inclusion or Exclusion
repo:regex_pattern
-repo:regex_pattern
With these options, the user can include or exclude results based on which repository it is in. These options are particularly useful if you have attached many repositories in your account, and you want to focus the search on only some of them.
Example : Include only opensquiggly repositories, discard content from any other repository
repo:opensquiggly
As with the file:regex_pattern
options, if multiple filters are specified, they are
combined together using the AND boolean condition (i.e., all conditions must be true, otherwise the
document is discarded from the search results).
Title Inclusion or Exclusion
title:regex_pattern
-title:regex_pattern
These options are synonmous with the file:regex_pattern
filter.
Limiting the Number of Candidate Documents Searched
maxscale:nnn
To understand this option, you need to first understand a little bit about how we implement regular expression searching.
Without any indexing strategy, we would have to search every document in the user’s account to determine whether or not it matched the regex expression. This could be very slow, especially if the user creates many mount points within their account.
There is no known indexing strategy for regular expression documents that can quickly return only the matching expressions. However, there is an indexing strategy called “trigram indexing” that can be used to reduce the number of documents that need to be searched with the full regex expression.
We call the documents that match the trigram index “candidate documents”. These are documents that “might” match the regex. We can picture the workflow as follows:
+------------------------------+
| All Documents |
+------------------------------+
|
v
+------------------------------+
| Trigram Index Query |
+------------------------------+
|
v
+------------------------------+
| Candidate Documents |
+------------------------------+
|
v
+------------------------------+
| Full Regex Search |
+------------------------------+
|
v
+------------------------------+
| Final list of matching |
| documents |
+------------------------------+
Depending on the regex expression and the content of the documents in the user’s database, we may get a high or low percentage hit rate of Candidate Documents.
If the percentage hit rate is high, then we can quickly find documents up to the “count” number that the user has asked for. A high percentage hit rate is good.
But sometimes the hit rate could be low. If the hit rate is too low, then we have to search through many candidate documents to find the number of matching documents the user has asked for.
The maxscale
setting controls how hard we try to find matches in the set of candidate
documents.
The default maxscale
value is 5
. We use this value as a multiplier against
the count
setting to determine the maximum number of candidate documents we will search. For example,
if the user has asked us to find 30 documents (the default count
value), then we would search
up to 150 candidate documents to find the 30 matching documents. Note that because the setting is a multiple
of the count, it will automatically increase base on size. If the user asks for 100 matching documents and
leaves all other values at their default, then we would search up to 500 candidate documents to find 100
final matches.
An example of a search with a very low hit rate is:
for\s{10}this
This searches for the word “for”, followed by exactly 10 spaces, followed by the word “this”.
Since coding languages commonly contain the words “for” and “this”, the candidate documents query will contain a large subset of documents. However, virtually none of those document will contain exactly 10 spaces between the words “for” and “this”, therefore we will get a very low hit rate (probably 0).
The user may want to increase this value if they have a low hit rate search and they want us to search more than the usual number of candidate documents to find them. They might also want to increase the timeout at the same time, since the search will take longer to search through more candidate documents.
Example:
for\s{10}this maxscale:10 timeout:20
Maximum Document Size
maxdocsize:nnn
Regular expression searches in general are slow. The bigger the size of the document, the slower the regex search will be.
To speed up the search, by default we skip searching any documents that are bigger than 65535 bytes (64K).
This is usually acceptable because developers generally try to keep their source code files to a few thousand
lines long or less. If the user wants us to search large documents, they can increase the maxdocsize
.
Include Skipped Documents Larger than the Maximum Size
includeskipped:[yes|no]
If a document was skipped as a result of it being larger than the maxdocsize
, this setting
controls whether or not to include that skipped document in the result list.
The default value is no
, meaning that large documents are not included in the final returned
results.
Setting the Search Timeout
timeout:nnn
This setting increases the maximum time, in seconds, that the search will search for. If it hits the timeout value, it will stop searching and return all the documents that were found up until that point.
The default value is 5
(seconds).
If you are asking for a large count
, or you’ve increased your maxscale
setting,
then you might consider increasing the timeout
value to give the
Example:
foreach count:1000 timeout:20
Finding Unreferenced pages
ref:none
This setting let’s you find unreferenced pages in your document tree graph.
Recall from the page theory document, that documents within OpenSquiggly are connected together by a graph of their table of contents references.
If a user has removed all references to a page, then it will not appear anywhere in the user’s table of contents starting from the user’s Home page and drilling down into each inner page.
This setting gives the user a quick way to find their unreferenced pages. It’s common for a user to want to delete their unreferenced pages, because presumably the reason why they’ve removed any reference to the page is because the page is no longer useful.
It’s hard to find unreferenced pages because, well, they are unreferenced. This search allows you to find them.
Note that this search relies on the ReferencedBy
array stored in the database. There are
some known bugs in this area, so it is important to test the ref:none
search to help
identify these bugs.