405 lines
20 KiB
Plaintext
405 lines
20 KiB
Plaintext
Search API
|
||
----------
|
||
|
||
This module provides a framework for easily creating searches on any entity
|
||
known to Drupal, using any kind of search engine. For site administrators, it is
|
||
a great alternative to other search solutions, since it already incorporates
|
||
facetting support and the ability to use the Views module for displaying search
|
||
results, filters, etc. Also, with the Apache Solr integration [1], a
|
||
high-performance search engine is available for use with the Search API.
|
||
|
||
If you need help with the module, please post to the project's issue queue [2].
|
||
|
||
[1] http://drupal.org/project/search_api_solr
|
||
[2] http://drupal.org/project/issues/search_api
|
||
|
||
|
||
Content:
|
||
- Glossary
|
||
- Information for users
|
||
- Information for developers
|
||
- Included components
|
||
|
||
|
||
Glossary
|
||
--------
|
||
|
||
Terms as used in this module.
|
||
|
||
- Service class:
|
||
A type of search engine, e.g. using the database, Apache Solr,
|
||
Sphinx or any other professional or simple indexing mechanism. Takes care of
|
||
the details of all operations, especially indexing or searching content.
|
||
- Server:
|
||
One specific place for indexing data, using a set service class. Can
|
||
e.g. be some tables in a database, a connection to a Solr server or other
|
||
external services, etc.
|
||
- Index:
|
||
A configuration object for indexing data of a specific type. What and how data
|
||
is indexed is determined by its settings. Also keeps track of which items
|
||
still need to be indexed (or re-indexed, if they were updated). Needs to lie
|
||
on a server in order to be really used (although configuration is independent
|
||
of a server).
|
||
- Item type:
|
||
A type of data which can be indexed (i.e., for which indexes can be created).
|
||
Most entity types (like Content, User, Taxonomy term, etc.) are available, but
|
||
possibly also other types provided by contrib modules.
|
||
- Entity:
|
||
One object of data, usually stored in the database. Might for example
|
||
be a node, a user or a file.
|
||
- Field:
|
||
A defined property of an entity, like a node's title or a user's mail address.
|
||
All fields have defined datatypes. However, for indexing purposes the user
|
||
might choose to index a property under a different data type than defined.
|
||
- Data type:
|
||
Determines how a field is indexed. While "Fulltext" fields can be completely
|
||
searched for keywords, other fields can only be used for filtering. They will
|
||
also be converted to fit their respective value ranges.
|
||
How types other than "Fulltext" are handled depends on the service class used.
|
||
Its documentation should state how the type-selection affect the indexed
|
||
content. However, service classes will always be able to handle all data
|
||
types, it is just possible that the type doesn't affect the indexing at all
|
||
(apart from "Fulltext vs. the rest").
|
||
- Boost:
|
||
Number determining how important a certain field is, when searching for
|
||
fulltext keywords. The higher the value is, the more important is the field.
|
||
E.g., when the node title has a boost of 5.0 and the node body a boost of 1.0,
|
||
keywords found in the title will increase the score as much as five keywords
|
||
found in the body. Of course, this has only an effect when the score is used
|
||
(for sorting or other purposes). It has no effect on other parts of the search
|
||
result.
|
||
- Data alteration:
|
||
A component that is used when indexing data. It can add additional fields to
|
||
the indexed entity or prevent certain entities from being indexed. Fields
|
||
added by callbacks have to be enabled on the "Fields" page to be of any use,
|
||
but this is done by default.
|
||
- Processor:
|
||
An object that is used for preprocessing indexed data as well as search
|
||
queries, and for postprocessing search results. Usually only work on fulltext
|
||
fields to control how content is indexed and searched. E.g., processors can be
|
||
used to make searches case-insensitive, to filter markup out of indexed
|
||
content, etc.
|
||
|
||
|
||
Information for users
|
||
---------------------
|
||
|
||
IMPORTANT: Access checks
|
||
In general, the Search API doesn't contain any access checks for search
|
||
results. It is your responsibility to ensure that only accessible search
|
||
results are displayed – either by only indexing such items, or by filtering
|
||
appropriately at search time.
|
||
For search on general site content (item type "Node"), this is already
|
||
supported by the Search API. To enable this, go to the index's "Workflow" tab
|
||
and activate the "Node access" data alteration. This will add the necessary
|
||
field, "Node access information", to the index (which you have to leave as
|
||
"indexed"). If both this field and "Published" are set to be indexed, access
|
||
checks will automatically be executed at search time, showing only those
|
||
results that a user can view. Some search types (e.g., search views) also
|
||
provide the option to disable these access checks for individual searches.
|
||
Please note, however, that these access checks use the indexed data, while
|
||
usually the current data is displayed to users. Therefore, users might still
|
||
see inappropriate content as long as items aren't indexed in their latest
|
||
state. If you can't allow this for your site, please use the index's "Index
|
||
immediately" feature (explained below) or possibly custom solutions for
|
||
specific search types, if available.
|
||
|
||
As stated above, you will need at least one other module to use the Search API,
|
||
namely one that defines a service class (e.g. search_api_db ("Database search"),
|
||
provided with this module).
|
||
|
||
- Creating a server
|
||
(Configuration > Search API > Add server)
|
||
|
||
The most basic thing you have to create is a search server for indexing content.
|
||
Go to Configuration > Search API in the administration pages and select
|
||
"Add server". Name and description are usually only shown to administrators and
|
||
can be used to differentiate between several servers, or to explain a server's
|
||
use to other administrators (for larger sites). Disabling a server makes it
|
||
unusable for indexing and searching and can e.g. be used if the underlying
|
||
search engine is temporarily unavailable.
|
||
The "service class" is the most important option here, since it lets you select
|
||
which backend the search server will use. This cannot be changed after the
|
||
server is created.
|
||
Depending on the selected service class, further, service-specific settings will
|
||
be available. For details on those settings, consult the respective service's
|
||
documentation.
|
||
|
||
- Creating an index
|
||
(Configuration > Search API > Add index)
|
||
|
||
For adding a search index, choose "Add index" on the Search API administration
|
||
page. Name, description and "enabled" status serve the exact same purpose as
|
||
for servers.
|
||
The most important option in this form is the indexed entity type. Every index
|
||
contains data on only a single type of entities, e.g. nodes, users or taxonomy
|
||
terms. This is therefore the only option that cannot be changed afterwards.
|
||
The server on which the index lies determines where the data will actually be
|
||
indexed. It doesn't affect any other settings of the index and can later be
|
||
changed with the only drawback being that the index' content will have to be
|
||
indexed again. You can also select a server that is at the moment disabled, or
|
||
choose to let the index lie on no server at all, for the time being. Note,
|
||
however, that you can only create enabled indexes on an enabled server. Also,
|
||
disabling a server will disable all indexes that lie on it.
|
||
The "Index items immediately" option specifies that you want items to be
|
||
directly re-indexed after being changed, instead of waiting for the next cron
|
||
run. Use this if it is important that users see no stale data in searches, and
|
||
only when your setup enables relatively fast indexing.
|
||
Lastly, the "Cron batch size" option allows you to set whether items will be
|
||
indexed when cron runs (as long as the index is enabled), and how many items
|
||
will be indexed in a single batch. The best value for this setting depends on
|
||
how time-consuming indexing is for your setup, which in turn depends mostly on
|
||
the server used and the enabled data alterations. You should set it to a number
|
||
of items which can easily be indexed in 10 seconds' time. Items can also be
|
||
indexed manually, or directly when they are changed, so even if this is set to
|
||
0, the index can still be used.
|
||
|
||
- Indexed fields
|
||
(Configuration > Search API > [Index name] > Fields)
|
||
|
||
Here you can select which of the entities' fields will be indexed, and how.
|
||
Fields added by (enabled) data alterations will be available here, too.
|
||
Without selecting fields to index, the index will be useless and also won't be
|
||
available for searches. Select the "Fulltext" data type for fields which you
|
||
want search for keywords, and other data types when you want to use the field
|
||
for filtering (e.g., as facets). The "Item language" field will always be
|
||
indexed as it contains important information for processors and hooks.
|
||
You can also add fields of related entities here, via the "Add related fields"
|
||
form at the bottom of the page. For instance, you might want to index the
|
||
author's username to the indexed data of a node, and you need to add the "Body"
|
||
entity to the node when you want to index the actual text it contains.
|
||
|
||
- Index workflow
|
||
(Configuration > Search API > [Index name] > Workflow)
|
||
|
||
This page lets you customize how the created index works, and what metadata will
|
||
be available, by selecting data alterations and processors (see the glossary for
|
||
further explanations).
|
||
Data alterations usually only add one or more fields to the entity and their
|
||
order is mostly irrelevant.
|
||
The order of processors, however, often is important. Read the processors'
|
||
descriptions or consult their documentation for determining how to use them most
|
||
effectively.
|
||
|
||
- Index status
|
||
(Configuration > Search API > [Index name] > Status)
|
||
|
||
On this page you can view how much of the entities are already indexed and also
|
||
control indexing. With the "Index now" button (displayed only when there are
|
||
still unindexed items) you can directly index a certain number of "dirty" items
|
||
(i.e., items not yet indexed in their current state). Setting "-1" as the number
|
||
will index all of those items, similar to the cron batch size setting.
|
||
When you change settings that could affect indexing, and the index is not
|
||
automatically marked for re-indexing, you can do this manually with the
|
||
"Re-index content" button. All items in the index will be marked as dirty and be
|
||
re-indexed when subsequently indexing items (either manually or via cron runs).
|
||
Until all content is re-indexed, the old data will still show up in searches.
|
||
This is different with the "Clear index" button. All items will be marked as
|
||
dirty and additionally all data will be removed from the index. Therefore,
|
||
searches won't show any results until items are re-indexed, after clearing an
|
||
index. Use this only if completely wrong data has been indexed. It is also done
|
||
automatically when the index scheme or server settings change too drastically to
|
||
keep on using the old data.
|
||
|
||
- Hidden settings
|
||
|
||
search_api_index_worker_callback_runtime:
|
||
By changing this variable, you can determine the time (in seconds) the Search
|
||
API will spend indexing (for all indexes combined) in each cron run. The
|
||
default is 15 seconds.
|
||
|
||
|
||
Information for developers
|
||
--------------------------
|
||
|
||
| NOTE:
|
||
| For modules providing new entities: In order for your entities to become
|
||
| searchable with the Search API, your module will need to implement
|
||
| hook_entity_property_info() in addition to the normal hook_entity_info().
|
||
| hook_entity_property_info() is documented in the entity module.
|
||
| For making certain non-entities searchable, see "Item type" below.
|
||
| For custom field types to be available for indexing, provide a
|
||
| "property_type" key in hook_field_info(), and optionally a callback at the
|
||
| "property_callbacks" key.
|
||
| Both processes are explained in [1].
|
||
|
|
||
| [1] http://drupal.org/node/1021466
|
||
|
||
Apart from improving the module itself, developers can extend search
|
||
capabilities provided by the Search API by providing implementations for one (or
|
||
several) of the following classes. Detailed documentation on the methods that
|
||
need to be implemented are always available as doc comments in the respective
|
||
interface definition (all found in their respective files in the includes/
|
||
directory). The details for hooks can be looked up in the search_api.api.php
|
||
file.
|
||
For all interfaces there are handy base classes which can (but don't need to) be
|
||
used to ease custom implementations, since they provide sensible generic
|
||
implementations for many methods. They, too, should be documented well enough
|
||
with doc comments for a developer to find the right methods to override or
|
||
implement.
|
||
|
||
- Service class
|
||
Interface: SearchApiServiceInterface
|
||
Base class: SearchApiAbstractService
|
||
Hook: hook_search_api_service_info()
|
||
|
||
The service classes are the heart of the API, since they allow data to be
|
||
indexed on different search servers. Since these are quite some work to get
|
||
right, you should probably make sure a service class for a specific search
|
||
engine doesn't exist already before programming it yourself.
|
||
When your module supplies a service class, please make sure to provide
|
||
documentation (at least a README.txt) that clearly states the datatypes it
|
||
supports (and in what manner), how a direct query (a query where the keys are
|
||
a single string, instead of an array) is parsed and possible limitations of the
|
||
service class.
|
||
The central methods here are the indexItems() and the search() methods, which
|
||
always have to be overridden manually. The configurationForm() method allows
|
||
services to provide custom settings for the user.
|
||
See the SearchApiDbService class for an example implementation.
|
||
|
||
- Query class
|
||
Interface: SearchApiQueryInterface
|
||
Base class: SearchApiQuery
|
||
|
||
You can also override the query class' behaviour for your service class. You
|
||
can, for example, change key parsing behaviour, add additional parse modes
|
||
specific to your service, or override methods so the information is stored more
|
||
suitable for your service.
|
||
For the query class to become available (other than through manual creation),
|
||
you need a custom service class where you override the query() method to return
|
||
an instance of your query class.
|
||
|
||
- Item type
|
||
Interface: SearchApiDataSourceControllerInterface
|
||
Base class: SearchApiAbstractDataSourceController
|
||
Hook: hook_search_api_item_type_info()
|
||
|
||
If you want to index some data which is not defined as an entity, you can
|
||
specify it as a new item type here. For defining a new item type, you have to
|
||
create a data source controller for the type and track new, changed and deleted
|
||
items of the type by calling the search_api_track_item_*() functions.
|
||
An instance of the data source controller class will then be used by indexes
|
||
when handling items of your newly-defined type.
|
||
|
||
If you want to make external data that is indexed on some search server
|
||
available to the Search API, there is a handy base class for your data source
|
||
controller (SearchApiExternalDataSourceController in
|
||
includes/datasource_external.inc) which you can extend. For a minimal use case,
|
||
you will then only have to define the available fields that can be retrieved by
|
||
the server.
|
||
|
||
- Data type
|
||
Hook: hook_search_api_data_type_info()
|
||
|
||
You can specify new data types for indexing fields. These new types can then be
|
||
selected on indexes' „Fields“ tabs. You just have to implement the hook,
|
||
returning some information on your data type, and specify in your module's
|
||
documentation the format of your data type and how it should be used.
|
||
|
||
For a custom data type to have an effect, in most cases the server's service
|
||
class has to support that data type. A service class can advertize its support
|
||
of a data type by declaring support for the "search_api_data_type_TYPE" feature
|
||
in its supportsFeature() method. If this support isn't declared, a fallback data
|
||
type is automatically used instead of the custom one.
|
||
|
||
If a field is indexed with a custom data type, its entry in the index's options
|
||
array will have the selected type in "real_type", while "type" contains the
|
||
fallback type (which is always one of the default data types, as returned by
|
||
search_api_default_field_types().
|
||
|
||
- Data-alter callbacks
|
||
Interface: SearchApiAlterCallbackInterface
|
||
Base class: SearchApiAbstractAlterCallback
|
||
Hook: hook_search_api_alter_callback_info()
|
||
|
||
Data alter callbacks can be used to change the field data of indexed items, or
|
||
to prevent certain items from being indexed. They are only used when indexing,
|
||
or when selecting the fields to index. For adding additional information to
|
||
search results, you have to use a processor.
|
||
Data-alter callbacks are called "data alterations" in the UI.
|
||
|
||
- Processors
|
||
Interface: SearchApiProcessorInterface
|
||
Base class: SearchApiAbstractProcessor
|
||
Hook: hook_search_api_processor_info()
|
||
|
||
Processors are used for altering the data when indexing or searching. The exact
|
||
specifications are available in the interface's doc comments. Just note that the
|
||
processor description should clearly state assumptions or restrictions on input
|
||
types (e.g. only tokenized text), item language, etc. and explain concisely what
|
||
effect it will have on searches.
|
||
See the processors in includes/processor.inc for examples.
|
||
|
||
|
||
Included components
|
||
-------------------
|
||
|
||
- Service classes
|
||
|
||
* Database search
|
||
A search server implementation that uses the normal database for indexing
|
||
data. It isn't very fast and the results might also be less accurate than
|
||
with third-party solutions like Solr, but it's very easy to set up and good
|
||
for smaller applications or testing.
|
||
See contrib/search_api_db/README.txt for details.
|
||
|
||
- Data alterations
|
||
|
||
* URL field
|
||
Provides a field with the URL for displaying the entity.
|
||
* Aggregated fields
|
||
Offers the ability to add additional fields to the entity, containing the
|
||
data from one or more other fields. Use this, e.g., to have a single field
|
||
containing all data that should be searchable, or to make the text from a
|
||
string field, like a taxonomy term, also fulltext-searchable.
|
||
The type of aggregation can be selected from a set of values: you can, e.g.,
|
||
collect the text data of all contained fields, or add them up, count their
|
||
values, etc.
|
||
* Bundle filter
|
||
Enables the admin to prevent entities from being indexed based on their
|
||
bundle (content type for nodes, vocabulary for taxonomy terms, etc.).
|
||
* Complete entity view
|
||
Adds a field containing the whole HTML content of the entity as it is viewed
|
||
on the site. The view mode used can be selected.
|
||
Note, however, that this might not work for entities of all types. All core
|
||
entities except files are supported, though.
|
||
* Index hierarchy
|
||
Allows to index a hierarchical field along with all its parents. Most
|
||
importantly, this can be used to index taxonomy term references along with
|
||
all parent terms. This way, when an item, e.g., has the term "New York", it
|
||
will also be matched when filtering for "USA" or "North America".
|
||
|
||
- Processors
|
||
|
||
* Ignore case
|
||
Makes all fulltext searches (and, optionally, also filters on string values)
|
||
case-insensitive. Some servers might do this automatically, for others this
|
||
should probably always be activated.
|
||
* HTML filter
|
||
Strips HTML tags from fulltext fields and decodes HTML entities. If you are
|
||
indexing HTML content (like node bodies) and the search server doesn't
|
||
handle HTML on its own, this should be activated to avoid indexing HTML
|
||
tags, as well as to give e.g. terms appearing in a heading a higher boost.
|
||
* Tokenizer
|
||
This processor allows you to specify how indexed fulltext content is split
|
||
into seperate tokens – which characters are ignored and which treated as
|
||
white-space that seperates words.
|
||
* Stopwords
|
||
Enables the admin to specify a stopwords file, the words contained in which
|
||
will be filtered out of the text data indexed. This can be used to exclude
|
||
too common words from indexing, for servers not supporting this natively.
|
||
|
||
- Additional modules
|
||
|
||
* Search pages
|
||
This module lets you create simple search pages for indexes.
|
||
* Search views
|
||
This integrates the Search API with the Views module [1], enabling the user
|
||
to create views which display search results from any Search API index.
|
||
* Search facets
|
||
For service classes supporting this feature (e.g. Solr search), this module
|
||
automatically provides configurable facet blocks on pages that execute
|
||
a search query.
|
||
|
||
[1] http://drupal.org/project/views
|