README.txt 20 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403
  1. Search API
  2. ----------
  3. This module provides a framework for easily creating searches on any entity
  4. known to Drupal, using any kind of search engine. For site administrators, it is
  5. a great alternative to other search solutions, since it already incorporates
  6. facetting support and the ability to use the Views module for displaying search
  7. results, filters, etc. Also, with the Apache Solr integration [1], a
  8. high-performance search engine is available for use with the Search API.
  9. If you need help with the module, please post to the project's issue queue [2].
  10. [1] http://drupal.org/project/search_api_solr
  11. [2] http://drupal.org/project/issues/search_api
  12. Content:
  13. - Glossary
  14. - Information for users
  15. - Information for developers
  16. - Included components
  17. Glossary
  18. --------
  19. Terms as used in this module.
  20. - Service class:
  21. A type of search engine, e.g. using the database, Apache Solr,
  22. Sphinx or any other professional or simple indexing mechanism. Takes care of
  23. the details of all operations, especially indexing or searching content.
  24. - Server:
  25. One specific place for indexing data, using a set service class. Can
  26. e.g. be some tables in a database, a connection to a Solr server or other
  27. external services, etc.
  28. - Index:
  29. A configuration object for indexing data of a specific type. What and how data
  30. is indexed is determined by its settings. Also keeps track of which items
  31. still need to be indexed (or re-indexed, if they were updated). Needs to lie
  32. on a server in order to be really used (although configuration is independent
  33. of a server).
  34. - Item type:
  35. A type of data which can be indexed (i.e., for which indexes can be created).
  36. Most entity types (like Content, User, Taxonomy term, etc.) are available, but
  37. possibly also other types provided by contrib modules.
  38. - Entity:
  39. One object of data, usually stored in the database. Might for example
  40. be a node, a user or a file.
  41. - Field:
  42. A defined property of an entity, like a node's title or a user's mail address.
  43. All fields have defined datatypes. However, for indexing purposes the user
  44. might choose to index a property under a different data type than defined.
  45. - Data type:
  46. Determines how a field is indexed. While "Fulltext" fields can be completely
  47. searched for keywords, other fields can only be used for filtering. They will
  48. also be converted to fit their respective value ranges.
  49. How types other than "Fulltext" are handled depends on the service class used.
  50. Its documentation should state how the type-selection affect the indexed
  51. content. However, service classes will always be able to handle all data
  52. types, it is just possible that the type doesn't affect the indexing at all
  53. (apart from "Fulltext vs. the rest").
  54. - Boost:
  55. Number determining how important a certain field is, when searching for
  56. fulltext keywords. The higher the value is, the more important is the field.
  57. E.g., when the node title has a boost of 5.0 and the node body a boost of 1.0,
  58. keywords found in the title will increase the score as much as five keywords
  59. found in the body. Of course, this has only an effect when the score is used
  60. (for sorting or other purposes). It has no effect on other parts of the search
  61. result.
  62. - Data alteration:
  63. A component that is used when indexing data. It can add additional fields to
  64. the indexed entity or prevent certain entities from being indexed. Fields
  65. added by callbacks have to be enabled on the "Fields" page to be of any use,
  66. but this is done by default.
  67. - Processor:
  68. An object that is used for preprocessing indexed data as well as search
  69. queries, and for postprocessing search results. Usually only work on fulltext
  70. fields to control how content is indexed and searched. E.g., processors can be
  71. used to make searches case-insensitive, to filter markup out of indexed
  72. content, etc.
  73. Information for users
  74. ---------------------
  75. IMPORTANT: Access checks
  76. In general, the Search API doesn't contain any access checks for search
  77. results. It is your responsibility to ensure that only accessible search
  78. results are displayed – either by only indexing such items, or by filtering
  79. appropriately at search time.
  80. For search on general site content (item type "Node"), this is already
  81. supported by the Search API. To enable this, go to the index's "Filters" tab
  82. and activate the "Node access" data alteration. This will add the necessary
  83. field, "Node access information", to the index (which you have to leave as
  84. "indexed"). If both this field and "Published" are set to be indexed, access
  85. checks will automatically be executed at search time, showing only those
  86. results that a user can view. Some search types (e.g., search views) also
  87. provide the option to disable these access checks for individual searches.
  88. Please note, however, that these access checks use the indexed data, while
  89. usually the current data is displayed to users. Therefore, users might still
  90. see inappropriate content as long as items aren't indexed in their latest
  91. state. If you can't allow this for your site, please use the index's "Index
  92. immediately" feature (explained below) or possibly custom solutions for
  93. specific search types, if available.
  94. As stated above, you will need at least one other module to use the Search API,
  95. namely one that defines a service class (e.g., search_api_db ("Database search")
  96. which can be found at [3]).
  97. [3] http://drupal.org/project/search_api_db
  98. - Creating a server
  99. (Configuration > Search API > Add server)
  100. The most basic thing you have to create is a search server for indexing content.
  101. Go to Configuration > Search API in the administration pages and select
  102. "Add server". Name and description are usually only shown to administrators and
  103. can be used to differentiate between several servers, or to explain a server's
  104. use to other administrators (for larger sites). Disabling a server makes it
  105. unusable for indexing and searching and can e.g. be used if the underlying
  106. search engine is temporarily unavailable.
  107. The "service class" is the most important option here, since it lets you select
  108. which backend the search server will use. This cannot be changed after the
  109. server is created.
  110. Depending on the selected service class, further, service-specific settings will
  111. be available. For details on those settings, consult the respective service's
  112. documentation.
  113. - Creating an index
  114. (Configuration > Search API > Add index)
  115. For adding a search index, choose "Add index" on the Search API administration
  116. page. Name, description and "enabled" status serve the exact same purpose as
  117. for servers.
  118. The most important option in this form is the indexed entity type. Every index
  119. contains data on only a single type of entities, e.g. nodes, users or taxonomy
  120. terms. This is therefore the only option that cannot be changed afterwards.
  121. The server on which the index lies determines where the data will actually be
  122. indexed. It doesn't affect any other settings of the index and can later be
  123. changed with the only drawback being that the index' content will have to be
  124. indexed again. You can also select a server that is at the moment disabled, or
  125. choose to let the index lie on no server at all, for the time being. Note,
  126. however, that you can only create enabled indexes on an enabled server. Also,
  127. disabling a server will disable all indexes that lie on it.
  128. The "Index items immediately" option specifies that you want items to be
  129. directly re-indexed after being changed, instead of waiting for the next cron
  130. run. Use this if it is important that users see no stale data in searches, and
  131. only when your setup enables relatively fast indexing.
  132. Lastly, the "Cron batch size" option allows you to set whether items will be
  133. indexed when cron runs (as long as the index is enabled), and how many items
  134. will be indexed in a single batch. The best value for this setting depends on
  135. how time-consuming indexing is for your setup, which in turn depends mostly on
  136. the server used and the enabled data alterations. You should set it to a number
  137. of items which can easily be indexed in 10 seconds' time. Items can also be
  138. indexed manually, or directly when they are changed, so even if this is set to
  139. 0, the index can still be used.
  140. - Indexed fields
  141. (Configuration > Search API > [Index name] > Fields)
  142. Here you can select which of the entities' fields will be indexed, and how.
  143. Fields added by (enabled) data alterations will be available here, too.
  144. Without selecting fields to index, the index will be useless and also won't be
  145. available for searches. Select the "Fulltext" data type for fields which you
  146. want search for keywords, and other data types when you want to use the field
  147. for filtering (e.g., as facets). The "Item language" field will always be
  148. indexed as it contains important information for processors and hooks.
  149. You can also add fields of related entities here, via the "Add related fields"
  150. form at the bottom of the page. For instance, you might want to index the
  151. author's username to the indexed data of a node, and you need to add the "Body"
  152. entity to the node when you want to index the actual text it contains.
  153. - Indexing workflow
  154. (Configuration > Search API > [Index name] > Filters)
  155. This page lets you customize how the created index works, and what metadata will
  156. be available, by selecting data alterations and processors (see the glossary for
  157. further explanations).
  158. Data alterations usually only add one or more fields to the entity and their
  159. order is mostly irrelevant.
  160. The order of processors, however, often is important. Read the processors'
  161. descriptions or consult their documentation for determining how to use them most
  162. effectively.
  163. - Index status
  164. (Configuration > Search API > [Index name] > Status)
  165. On this page you can view how much of the entities are already indexed and also
  166. control indexing. With the "Index now" button (displayed only when there are
  167. still unindexed items) you can directly index a certain number of "dirty" items
  168. (i.e., items not yet indexed in their current state). Setting "-1" as the number
  169. will index all of those items, similar to the cron batch size setting.
  170. When you change settings that could affect indexing, and the index is not
  171. automatically marked for re-indexing, you can do this manually with the
  172. "Re-index content" button. All items in the index will be marked as dirty and be
  173. re-indexed when subsequently indexing items (either manually or via cron runs).
  174. Until all content is re-indexed, the old data will still show up in searches.
  175. This is different with the "Clear index" button. All items will be marked as
  176. dirty and additionally all data will be removed from the index. Therefore,
  177. searches won't show any results until items are re-indexed, after clearing an
  178. index. Use this only if completely wrong data has been indexed. It is also done
  179. automatically when the index scheme or server settings change too drastically to
  180. keep on using the old data.
  181. - Hidden settings
  182. search_api_index_worker_callback_runtime:
  183. By changing this variable, you can determine the time (in seconds) the Search
  184. API will spend indexing (for all indexes combined) in each cron run. The
  185. default is 15 seconds.
  186. Information for developers
  187. --------------------------
  188. | NOTE:
  189. | For modules providing new entities: In order for your entities to become
  190. | searchable with the Search API, your module will need to implement
  191. | hook_entity_property_info() in addition to the normal hook_entity_info().
  192. | hook_entity_property_info() is documented in the entity module.
  193. | For making certain non-entities searchable, see "Item type" below.
  194. | For custom field types to be available for indexing, provide a
  195. | "property_type" key in hook_field_info(), and optionally a callback at the
  196. | "property_callbacks" key.
  197. | Both processes are explained in [4].
  198. |
  199. | [4] http://drupal.org/node/1021466
  200. Apart from improving the module itself, developers can extend search
  201. capabilities provided by the Search API by providing implementations for one (or
  202. several) of the following classes. Detailed documentation on the methods that
  203. need to be implemented are always available as doc comments in the respective
  204. interface definition (all found in their respective files in the includes/
  205. directory). The details for hooks can be looked up in the search_api.api.php
  206. file. Note that all hooks provided by the Search API use the "search_api" hook
  207. group. Therefore, implementations of the hook can be moved into a
  208. MODULE.search_api.inc file in your module's directory.
  209. For all interfaces there are handy base classes which can (but don't need to) be
  210. used to ease custom implementations, since they provide sensible generic
  211. implementations for many methods. They, too, should be documented well enough
  212. with doc comments for a developer to find the right methods to override or
  213. implement.
  214. - Service class
  215. Interface: SearchApiServiceInterface
  216. Base class: SearchApiAbstractService
  217. Hook: hook_search_api_service_info()
  218. The service classes are the heart of the API, since they allow data to be
  219. indexed on different search servers. Since these are quite some work to get
  220. right, you should probably make sure a service class for a specific search
  221. engine doesn't exist already before programming it yourself.
  222. When your module supplies a service class, please make sure to provide
  223. documentation (at least a README.txt) that clearly states the datatypes it
  224. supports (and in what manner), how a direct query (a query where the keys are
  225. a single string, instead of an array) is parsed and possible limitations of the
  226. service class.
  227. The central methods here are the indexItems() and the search() methods, which
  228. always have to be overridden manually. The configurationForm() method allows
  229. services to provide custom settings for the user.
  230. See the SearchApiDbService class provided by [5] for an example implementation.
  231. [5] http://drupal.org/project/search_api_db
  232. - Query class
  233. Interface: SearchApiQueryInterface
  234. Base class: SearchApiQuery
  235. You can also override the query class' behaviour for your service class. You
  236. can, for example, change key parsing behaviour, add additional parse modes
  237. specific to your service, or override methods so the information is stored more
  238. suitable for your service.
  239. For the query class to become available (other than through manual creation),
  240. you need a custom service class where you override the query() method to return
  241. an instance of your query class.
  242. - Item type
  243. Interface: SearchApiDataSourceControllerInterface
  244. Base class: SearchApiAbstractDataSourceController
  245. Hook: hook_search_api_item_type_info()
  246. If you want to index some data which is not defined as an entity, you can
  247. specify it as a new item type here. For defining a new item type, you have to
  248. create a data source controller for the type and track new, changed and deleted
  249. items of the type by calling the search_api_track_item_*() functions.
  250. An instance of the data source controller class will then be used by indexes
  251. when handling items of your newly-defined type.
  252. If you want to make external data that is indexed on some search server
  253. available to the Search API, there is a handy base class for your data source
  254. controller (SearchApiExternalDataSourceController in
  255. includes/datasource_external.inc) which you can extend. For a minimal use case,
  256. you will then only have to define the available fields that can be retrieved by
  257. the server.
  258. - Data type
  259. Hook: hook_search_api_data_type_info()
  260. You can specify new data types for indexing fields. These new types can then be
  261. selected on indexes' „Fields“ tabs. You just have to implement the hook,
  262. returning some information on your data type, and specify in your module's
  263. documentation the format of your data type and how it should be used.
  264. For a custom data type to have an effect, in most cases the server's service
  265. class has to support that data type. A service class can advertize its support
  266. of a data type by declaring support for the "search_api_data_type_TYPE" feature
  267. in its supportsFeature() method. If this support isn't declared, a fallback data
  268. type is automatically used instead of the custom one.
  269. If a field is indexed with a custom data type, its entry in the index's options
  270. array will have the selected type in "real_type", while "type" contains the
  271. fallback type (which is always one of the default data types, as returned by
  272. search_api_default_field_types().
  273. - Data-alter callbacks
  274. Interface: SearchApiAlterCallbackInterface
  275. Base class: SearchApiAbstractAlterCallback
  276. Hook: hook_search_api_alter_callback_info()
  277. Data alter callbacks can be used to change the field data of indexed items, or
  278. to prevent certain items from being indexed. They are only used when indexing,
  279. or when selecting the fields to index. For adding additional information to
  280. search results, you have to use a processor.
  281. Data-alter callbacks are called "data alterations" in the UI.
  282. - Processors
  283. Interface: SearchApiProcessorInterface
  284. Base class: SearchApiAbstractProcessor
  285. Hook: hook_search_api_processor_info()
  286. Processors are used for altering the data when indexing or searching. The exact
  287. specifications are available in the interface's doc comments. Just note that the
  288. processor description should clearly state assumptions or restrictions on input
  289. types (e.g. only tokenized text), item language, etc. and explain concisely what
  290. effect it will have on searches.
  291. See the processors in includes/processor.inc for examples.
  292. Included components
  293. -------------------
  294. - Data alterations
  295. * URL field
  296. Provides a field with the URL for displaying the entity.
  297. * Aggregated fields
  298. Offers the ability to add additional fields to the entity, containing the
  299. data from one or more other fields. Use this, e.g., to have a single field
  300. containing all data that should be searchable, or to make the text from a
  301. string field, like a taxonomy term, also fulltext-searchable.
  302. The type of aggregation can be selected from a set of values: you can, e.g.,
  303. collect the text data of all contained fields, or add them up, count their
  304. values, etc.
  305. * Bundle filter
  306. Enables the admin to prevent entities from being indexed based on their
  307. bundle (content type for nodes, vocabulary for taxonomy terms, etc.).
  308. * Complete entity view
  309. Adds a field containing the whole HTML content of the entity as it is viewed
  310. on the site. The view mode used can be selected.
  311. Note, however, that this might not work for entities of all types. All core
  312. entities except files are supported, though.
  313. * Index hierarchy
  314. Allows to index a hierarchical field along with all its parents. Most
  315. importantly, this can be used to index taxonomy term references along with
  316. all parent terms. This way, when an item, e.g., has the term "New York", it
  317. will also be matched when filtering for "USA" or "North America".
  318. - Processors
  319. * Ignore case
  320. Makes all fulltext searches (and, optionally, also filters on string values)
  321. case-insensitive. Some servers might do this automatically, for others this
  322. should probably always be activated.
  323. * HTML filter
  324. Strips HTML tags from fulltext fields and decodes HTML entities. If you are
  325. indexing HTML content (like node bodies) and the search server doesn't
  326. handle HTML on its own, this should be activated to avoid indexing HTML
  327. tags, as well as to give e.g. terms appearing in a heading a higher boost.
  328. * Tokenizer
  329. This processor allows you to specify how indexed fulltext content is split
  330. into seperate tokens – which characters are ignored and which treated as
  331. white-space that seperates words.
  332. * Stopwords
  333. Enables the admin to specify a stopwords file, the words contained in which
  334. will be filtered out of the text data indexed. This can be used to exclude
  335. too common words from indexing, for servers not supporting this natively.
  336. * Stem words
  337. Uses the PorterStemmer method to reduce words to stems. A search for
  338. "garden" will return results for "gardening" and "garden," as will a search
  339. for "gardening."
  340. - Additional modules
  341. * Search views
  342. This integrates the Search API with the Views module [6], enabling the user
  343. to create views which display search results from any Search API index.
  344. * Search facets
  345. For service classes supporting this feature (e.g. Solr search), this module
  346. automatically provides configurable facet blocks on pages that execute
  347. a search query.
  348. [6] http://drupal.org/project/views