Michael Goldwasser
challenge6@dimacs.rutgers.edu
January 16, 1998
For this reason, we have elected to take a different approach this year. We have designed an interface in which distance oracles for data sets will be designed as completely separate programs from the neighbor search algorithms, and we will have the two programs communicate through a client/server model using Internet/Unix sockets. In this way each search program and distance oracle can be designed and compiled separately, and so the same search program can be tested on a variety of input sets based on very different underlying domains. A program which acts as an oracle for a data set must simply be prepared to answer a variety of queries regarding the underlying space, and a program for a search algorithm will get all of its information about the data by calling similar query functions. We will be providing sample drivers written in C, which make this communication as transparent as possible for each side. The initial implementation uses UDP, but we hope to release a more robust implemenation soon which allows a choice of TCP/UDP interfaces as well as the use of UNIX sockets. For those who wish, the communication routines can be removed, and the algorithm and oracle can be combined together and compiled as a single process (see Section 6).
We see many potential advantages to this clear separation in the design specifications. For starters, this solves the initial problem of providing a clean interface for those wishing to implement one side of the Challenge without concern for the other. Since both ends will run as separate processes, there is no need to recompile any program in order to mix and match algorithms and input sets. Also, the specification for the oracle will allow for a single data set oracle to be left running as a server, answering queries for many different search algorithms at the same time. This may be especially useful for domains in which the loading time is quite large for an oracle. Finally, although commonly the algorithm program and the oracle program will be running on the same machine, because the client/server interface uses messages over the Internet, there really is no requirement that the two ends be running on the same machine, using the same programming language, or even on the same platform. Programs could conceivably be designed in completely different languages, or on different platforms, so long as both ends follow the agreed upon protocol for passing inquiry messages and responses. Furthermore, we envision the possibility of someone providing an oracle running over the Internet, without necessarily publicly releasing the underlying data or the source code used for distance comparisons. This could be done by simply announcing the Internet address and port at which the oracle is left running.
/******************************************************** * AbstractDS* Preprocess(Oracle* ora) * * Parameters: * Oracle* ora --- pointer to the data set oracle * * Returns: * A pointer to a data structure of your choice which is passed on * to the routine `FindNeighbor' whenever a later query is made. * * Description: * The purpose of this routine depends greatly on the nearest * neighbor algorithm which will be used to answer queries. Access * to the data set is given through the use of the oracle. All * preprocessing should take place in this routine, and any * resulting data structures must be saved as part of the returned * object. * * Note: * This routine should NOT ask the oracle about any of the query points. * * Additional Note: * If the algorithm requires additional parameters from the user, * These should be added to the functions parameter list, and the * corresponding values should be sent by the main program when * calling this function. *******************************************************/
/******************************************************** * void FindNeighbor(Oracle* ora, AbstractDS* ads, int query) * * Parameters: * Oracle* ora --- pointer to the data set oracle * AbstractDS* ads --- pointer to your data structure * int query --- which query point * * Description: * This is the neighbor search query routine. The algorithm may use * the preprocessed data structure in conjunction with questions to * the oracle about the query point. Any output or responses to the * query should be made from within this routine. * * Note: * If theory, you could consider query algorithms which alter the * preprocessed data structure to reflect any new and useful * knowledge. * *******************************************************/ /******************************************************** * void DestroyAbstractDS(AbstractDS* ads) * * Parameters: * AbstractDS* ads --- pointer to your data structure * * Description: * This is simply a cleanup routine. * *******************************************************/ /******************************************************** * int CostPreprocess(AbstractDS* ads) * * Parameters: * AbstractDS* ads --- pointer to your data structure * * Returns: * An integer value which represents the 'cost' of the * preprocessing step, in whatever units you choose to report. * * Note: * The accounting should have been done during the preprocessing * step, with the cost value saved as part of the data structure, in * order to report it at the time of this procedure call. * *******************************************************/
/******************************************************** * int CostQueries(AbstractDS* ads) * * Parameters: * AbstractDS* ads --- pointer to your data structure * * Returns: * An integer value which represents the 'cost' of the combined * total of all neighbor queries. * * Note: * The accounting should have been done during each of the neighbor * queries, with the cost value saved as part of the data structure, * in order to report it at the time of this procedure call. * *******************************************************/
The main functionality provided by the oracle is the InqDist function which provides a real-valued distance between any two points of the data set. This function is used, both to compare two points from the original space, as well as to compare a query point to points in the original space. Our convention is that all of the points of the space, including the query points, are given unique integer indices, starting at zero, where the query points are indexed after all of the original points. For example, if the space consists of 1000 original points, and 100 other query points, then the original points will be numbered from to 999, followed by the query points from 1000 to 1099.
Although points of various data sets represent objects from varied domains, we assume that each point has some associated data which may be passed to a specialized algorithm. Each data point may have an arbitrary number of associated fields, where each field is a string. In general, the content of this string is purely up to the discretion of the oracle, as it will only be echoed as a more descriptive name for data points, or will be used by search algorithms which know the structure of the data and have been specialized to work with a particular type of data sets. Specific protocols for these fields will be discussed in Section 5 .
The full set of functions available are specified in oracle.h as follows:
/******************************************************** * int InqNumPoints(Oracle *ora) * * Parameters: * Oracle* ora --- an oracle connection * * Returns: * The number of points in the original data set * * Description: * If the routine returns 'n' this means that there are n points in * the data set (not counting any queries). When querying the * oracle about individual points, they are referenced with IDs * ranging from 0 to n-1. * *******************************************************/ /******************************************************** * int InqNumQuery(Oracle *ora) * * Parameters: * Oracle* ora --- an oracle connection * * Returns: * The number of query points available * * Description: * If the routine returns 'q' this means that there are q query * points available. When querying the oracle about these points, * we assume that they are given IDs consecutively following all of * the original data points. That is, if there are 'n' original * points and 'q' query points, the query points are indexed in the * range [n,n+q-1]. * *******************************************************/ /******************************************************** * double InqDist(Oracle *ora, int p1, int p2) * * Parameters: * Oracle* ora --- an oracle connection * int p1 --- index to first point * int p2 --- index to second point * * Returns: * The non-negative, real-numbered, distance between the points * * Description: * Each point can be either an original point in the data set, * or one of the query points. The convention for indexing is * identical to the above functions. * *******************************************************/ /******************************************************** * int InqNumFields(Oracle *ora, int p) * * Parameters: * Oracle* ora --- an oracle connection * int p --- index to data point * * Returns: * An integer specifying how many fields describe the * underlying raw data for point p. (This could potentially be * a different value for different points in the underlying space. * * Description: * The point can be the index for either an original point in the * underlying space, or one of the query points. The convention for * indexing points is identical to the above functions. The * convention for indexing fields is from [0..InqNumField-1]. * * Note: * In the case of Euclidean vectors, this string should be the * number of dimensions. * *******************************************************/ /******************************************************** * char* InqField(Oracle *ora, int p, int f) * * Parameters: * Oracle* ora --- an oracle connection * int p --- index to data point * int f --- which field of raw data * * Returns: * A string of length at most MAX_FIELD_LENGTH which is the * associated field for the raw data describing a point in the * space. The ownership of this string is passed to the caller, and * it is assumed that the caller will free the memory for it. * * Description: * The point can be the index for either an original point in the * underlying space, or one of the query points. The convention for * indexing points is identical to the above functions. The * convention for indexing fields is from [0..InqNumField-1]. * * Note: * In the case of Euclidean vectors, this string should be the * real-valued entry for the coordinate in question. *******************************************************/
Once this setup is complete, you should call the routine ServerOpen to establish yourself as a server and begin waiting for query messages from clients. When opening a server, you must pass it a pointer to your own data structure. This pointer will later be passed on as a parameter to any event handlers for incoming message queries, and so this data structure must have all information needed by your message handlers. Notice that because you have generated a static data structure in advance, your server will actually be able to handle interleaved queries from many clients at once. The exact specification for ServerOpen is given as:
/******************************************************** * void ServerOpen(int port, Oracle* ora, int quiet) * * Parameters: * int port --- what port should server use * Oracle* ora --- pointer sent to even handlers * int quiet --- should server echo messages/responses to screen * * Returns: * A symbolic pointer to the new server. * * Description: * This routine starts up a server on the local machine, at the * specified port. The Oracle pointer is simply held, and then * sent on to any event handlers when query messages arrive. * *******************************************************/
We suggest the following protocols for the use of the associated data fields, and we hope to have more added by participants as time goes (see the web page for newer protocols).
Our protocol for color-coded sets will be to have the first field equal to the color, specified as an identifying integer. Notice, that this protocol can be placed on top of other protocols for color-coded versions of the protocols by using the first field to be the color, and offsetting all other fields by one. For example, a color-coded data set for d-dimensional Euclidean space would now have (d+1) fields, the first of which is equal to the color, and the remaining fields equal to the vector coordinates.
For sake of completeness, we summarize the underlying message protocol used by both ends, and defined in the message.h file. After establishing a connection, the client side process initiates all queries, sending a message to the oracle query, and waiting for a response from the oracle.
/******************************************************** * Query: How many data points are in the original data set? * Client Message: "P" * Server Response: "<num>", where num is an integer *******************************************************/ /******************************************************** * Query: How many query points are available? * Client Message: "Q" * Server Response: "<num>", where num is an integer *******************************************************/ /******************************************************** * Query: What is the distance between points i and j? * Client Message: "D <i> <j>" * Server Response: "<dist>", where dist is a floating point *******************************************************/ /******************************************************** * Query: How many fields in the associated data for point i? * Client Message: "A <i>" * Server Response: "<string>" *******************************************************/ /******************************************************** * Query: What is the associated data in field f, for point i? * Client Message: "F <i> <f>" * Server Response: "<string>" *******************************************************/ /******************************************************** * Unknown Query: If the server ever receives a query which does not * match one of the above, rather than not respond, it responds that * it received an unknown query as follows. * * Server Response: "U" *******************************************************/
Previous DIMACS Implementation Challenges.