[Exist-open] Problem restoring backup

Discussion:

Prerovsky, Clemens

2005-06-13 10:23:52 UTC

Hi there,

currently I'm encountering severe problems when trying to restore an
eXist database structure from a previous backup. The import seems to
work fine, but my application (PHP accessing eXist via REST API) is
unable to read any data from the db. It has nothing to do with
authorisation or authentication - I'm currently using the admin user for
PHP application and I've installed eXist quite often before.

When I close the client and open it again the previously imported files
are gone! This seems really odd to me - maybe anyone has a clue...

Best regards & many thanks for your help
Clemens Prerovsky

Michael Beddow

2005-06-13 11:02:21 UTC

Permalink

Post by Prerovsky, Clemens
currently I'm encountering severe problems when trying to restore an
eXist database structure from a previous backup.

I take it you are restoring using the command-line backup utility? When the
restore completes, what happens if you start the Java command-line client
and try some queries against your collections? If that works, you have a
PHP-specific problem (or just possibly, put pretty unlikely, something about
your collections or queries is triggering a new bug in the REST i/f). If
OTOH the Java client can't find the collections you just apparently
restored, then I suspect some sort of configuration problem. What is your
build of eXist? Have you checked the value in backup.properties, especially
the URIs there?

Post by Prerovsky, Clemens
my application (PHP accessing eXist via REST API) is
unable to read any data from the db.

What are the precise symptoms of that? Is the REST client not able to locate
the database, or does it fail to find collections concerned in the database,
or does it find the collections, but then queries against those collections
fail? What error message does the REST i/f return? What do eXists logs show
for the relevant exchanges?

Post by Prerovsky, Clemens
When I close the client and open it again the previously imported files
are gone!

By "client" you mean your PHP application? And what does "are gone" mean? If
you start the Java client, have the collections vanished there, too? Or is
is just that if you restart your PHP application you it can't find the
collections?

Post by Prerovsky, Clemens
This seems really odd to me

It certainly is. My best guess on the information so far provided is that
your backup routine is reading a different configuration and so creating a
differently-located collection from your PHP client. But we need to know
more....

Michael Beddow

Michael Beddow

2005-06-13 11:56:49 UTC

Permalink

Clemens,

Well so far, it looks as though we can leave PHP and REST out of account
when trying to track this down. But I'm still not clear on a couple of
things.

I take it you are restoring using the command-line backup utility? ...

I am using the client started by bin/client.sh using X-forwarding from
my database server to my local Desktop (SuSE Linux 9.2).

But before that, you have presumably invoked backup.sh to do the
import/restore? Or are you repopulating the database afresh from the
filestore using Java client commands rather than restoring from an
eXist-generated backup?

I can't query
the newly imported files - just getting empty results. I am able to
browse through the imported collections and open them with the built-in
editor.

So at this stage (first run of the client against the restored database) you
can browse at collection and whole-document level, but queries run against
those documents from the command line all fail? Is that *all* types of
queries? Or just queries that operate against the fulltext index? If
queries using standard XPath operators/functions rather than eXist
extensions worked (assuming you haven't configured any range indexes) then
that would indicate an indexing problem of some kind. But we're not going to
get any further here without some log information from attempting to execute
those queries, so getting that probably needs to be your next step.

When querying via PHP I get exist result with hits="0".

OK, that's consistent with getting no results from queries under the Java
client (if the REST server couldn't locate the collections or documents, you
would get a different error message). It confirms that the problems visible
in your PHP application merely reflect something involving the eXist core.

When talking of "client" I mean the Java client. When I restart the
client after importing with the sympthoms described above all of the
files/collections imported previously are gone - vanished - as if I
never had imported them.

Now I'm lost. Do you mean you can see the collections in the client
immediately after the import, but if you close down the client then restart
it, you can no longer see them? This again raises my question about what you
understand by "restoring backup", and whether that refers to actual
restoring via the backup utility or to repopulation from the filestore via
client commands.

Michael

Michael Beddow

2005-06-13 12:54:48 UTC

Permalink

Clemens,

forgot to add one more thing: I restored the database by copying the
webapp/WEB-INF/data directory to the productive server which worked
fine, but is, ehm ... ugly by means of administration - I know I shall
be tared and feathered for this one :) But it works. Maybe this helps? I
also could supply a "broken" data directory - maybe this would help?

I'm afraid I'm now confused again. What I thought you were doing was
restoring from data previously dumped via the backup routine (whether
invoked from within the GUI client or from a separate command-line startup
through backup.sh). That should always work (provided the backup data isn't
from a very old version of the backup routine, which isn't the case here.)

But copying the contents of the /data directory from one installation to
another as a backup technique is a very different matter. I do this
occasionally, but only when I am 100% certain that my production servers are
running exactly the same build of eXist as the one on which I build the
database concerned (and in fact I generally do it in the context of cloning
the entire eXist tree, source, jars, configs as well as data). It is asking
for trouble if there has been any change whatsoever in the indexing
techniques or format between the builds concerned. So, assuming all your
data is safely in the backup, you should delete the entire contents of the
data/ directory on the target server, then import the backup and test. A
mixture of older indexing structures inherited from the copied *.dbx files
with current indexing structures generated by the restore via the client
could very easily produce the sort of symptoms you are describing.

Michael

Michael Beddow

2005-06-14 07:37:24 UTC

Permalink

Sorry - it seems I really confused you. The problems only occur
when I'm restoring backup data generated by the Java client
backup functionality. For restoring backup data I also use the
java client. I just copied the data directory directory as a quick
and ugly solution to get the PHP application up and running.

OK after that diversion the only way forward seems to be to strip away all
the non-essentials. The logs you posted are hard to match up with the basic
problem as you describe it. They show the database being initialised and
collections being set up, but the actual importing of the documents isn't
recorded in them. Nor are your initial simple queries from the client
against the newly-restored collections. Some queries via the REST i/f,
presumably from your PHP application, are logged, and some of them,
according to those logs, succeeded.

So to get back to basics, I suggest the following.

1) Stop the eXist server on the target machine.
2) Delete the entire contents of the /data directory on the target server.
3) Make sure you have an appender set in log4j.xml for exist.xmldb, so the
query processing gets logged into the persistent file.
4) *Don't* start the eXist server on the target machine, but instead run the
client in local mode over your X-forwarded shell, either by starting up the
client with the -l switch, or by inputting the pseduo-url xmldb:exist// into
the startup dialogue.
5) Use Tools/Backup to do the restore
6) Exit the client.
7) Restart the client, again in local mode as at 4) and try to browse and
query your collections.

If 7) works (i.e. you can now access your collections normally) then try
starting the server and connecting to it via XMLRPC or REST. But if 7)
doesn't work (i.e. your documents have vanished or can't be queried) then
don't try anything else, because you will need to ask Wolfgang to inspect
your data at this stage. He will also need to see the logs of the steps
recorded above, and the contents of your client.properties and
backup.properties files on the server.

Michael

2005-06-14 07:47:52 UTC

Permalink

Hi,
I'm trying to do some fulltext query with Exist
I need to know how to actually retrieve only the 'context' of a word
when I find it, and not a bigger structure....
I mean : if I have this structure :

<TEI.2>
<text id=2>
<bof>something else</bof>
<group>word</group>
</text>
<text id=5>
<rebof>something else</rebof>
<group>
<div>word</div>
</group>
</text>
</TEI.2>

and I'm looking for the word 'word'
I usually do
for $entries in //TEI.2[contains(lower-case(string()),'word')] return
$entries

but then I get one response, which is all the document... because TEI.2
contains this word...
Id like to get those 2 results : <group>word</group> and <div>word</div>

How can I do that ?
And worst : I need to get the id attribute of the text element that
include this result (2 and 5 here)
Is that possible ?

Thanks for any help
JC

Michael Beddow

2005-06-14 08:23:03 UTC

Permalink

Post by JC
Hi,
I'm trying to do some fulltext query with Exist
I need to know how to actually retrieve only the 'context' of a word
when I find it, and not a bigger structure....

Actually, you mean you want to know how to use XQuery. This is a pretty
general issue, not specifically eXist related, and I suggest you look at
some tutorial materials on XQuery itself.

Post by JC
<TEI.2>
<text id=2>
<bof>something else</bof>
<group>word</group>
</text>
<text id=5>
<rebof>something else</rebof>
<group>
<div>word</div>
</group>
</text>
</TEI.2>
and I'm looking for the word 'word'
I usually do
for $entries in //TEI.2[contains(lower-case(string()),'word')] return
$entries
but then I get one response, which is all the document... because TEI.2
contains this word...

No, you get that response because it's exactly what you asked for. No
matter what you put into that predicate, if the predicate evaluates to true,
then that XPath will always return the entire document. If you want to
return only elements deeper into the document structure, then you have to
shift the predicate along the XPath. To make this a bit more
realistically-TEI like, you need something like
//TEI.2//text/body//div//p[contains(.,'word')]

That will return only the <p>s for which the predicate evaluates to true.
You could get the enclosing divs instead by shifting the appropriate part of
the XPath into the predicate expression.
//TEI.2//text/body//div[contains(.//p,'word')]
You can also embed a predicate to filter the divs, thus
//TEI.2//text/body//div[@type='chapter']//p[contains(.,'word')]
which still returns only <p>s, but now only those that are descendants of
divs of the specified type. You can use the union operator to join into the
returned node sequence instances of 'word' found in other contexts of your
choice.

[In all these examples, I am leaving aside eXist-specific things that it
would be advisable to do in order to get the results returned with maximum
speed: those can come later].

The bottom line here is that you can't do the sort of query you have in mind
without some (possibly quite detailed) knowledge of the structure of the
documents you are querying. Just as you can't compose useful SQL queries
without knowledge of the schema that governs the target data. One
understanding of the somewhat fuzzy-edged concept "fulltext query" is that
no knowledge of internal document structure is necessary. But then that sort
of query is usually in the service of Information Retrieval, which seeks to
supply a ranked list of documents where the sought information is to be
found, rather than returning the targetted information itself in its
immediate context, which is what you are hoping to see.

Post by JC
And worst : I need to get the id attribute of the text element that
include this result (2 and 5 here)
Is that possible ?

Yes easily. Once you've looked at those tutorials, you'll see how....

Michael Beddow

Jean-Marc Vanel

2005-06-15 08:09:15 UTC

Permalink

Post by JC
Hi,
I'm trying to do some fulltext query with Exist
I need to know how to actually retrieve only the 'context' of a word
when I find it, and not a bigger structure....
<TEI.2>
<text id=2>
<bof>something else</bof>
<group>word</group>
</text>
<text id=5>
<rebof>something else</rebof>
<group>
<div>word</div>
</group>
</text>
</TEI.2>
and I'm looking for the word 'word'
I usually do
for $entries in //TEI.2[contains(lower-case(string()),'word')] return
$entries
but then I get one response, which is all the document... because
TEI.2 contains this word...
Id like to get those 2 results : <group>word</group> and <div>word</div>

//*/text() [contains(lower-case(.), 'word')] / parent::*

Post by JC
How can I do that ?
And worst : I need to get the id attribute of the text element that
include this result (2 and 5 here)
Is that possible ?

let $elem := //*/text() [contains(lower-case(.), 'word')] / parent::*
let $id := $elem / ancestor::text/ @id
--
Jean-Marc Vanel
Conseil et Services / développement & intégration logiciels
Logiciel libre, Web, Java, XML ...
A la pointe de la technique, au service des projets
http://jmvanel.free.fr/ ===) CV, software resources

Mes journaux:
- sujets généraux en Français: http://jmvanel.free.fr/Block-note.html
- sujets informatiques en Français: http://jmvanel.free.fr/notes-informatiques.html
- computer science diary : http://jmvanel.free.fr/computer-notes.html

Worldwide Botanical Knowledge Base
http://wwbota.free.fr/
test XML query engine: http://jmvanel.free.fr/protea.html

Michael Beddow

2005-06-15 09:07:53 UTC

Permalink

Post by Jean-Marc Vanel

Post by JC
Id like to get those 2 results : <group>word</group> and <div>word</div>

//*/text() [contains(lower-case(.), 'word')] / parent::*

That will work (although in some cases rather slowly) if the example file is
taken literally. But I assume it wasn't meant to be so taken, because it
contains nothing like the TEI structures that its document element suggests:
neither of the desired "results" could occur in a valid TEI instance.

So if the "context" referred to is indeed meant to be (in the first case)
the enclosing <group> and (in the second case) the enclosing <div>, this
XPath will never return them against a real-life TEI document, where neither
of those elements could be the parent of a text node. In TEI markup of any
complexity, the parent elements returned by this XPath could come from a
very wide selection of elements, so what was returned as "context" would
vary widely and unpredictably between the items in the returned sequence. If
a user is happy with that (and with the likely poor performance of the
expression) then fine; but I can't imagine this approach being of any real
use with most of the TEI markup with which I am familiar.

Michael Beddow

2005-06-15 11:56:18 UTC

Permalink

Hi,
first of all, thanks Jean-Marc you gave me the answer I was expecting.
<troll>Michael, u gave me advices for a real beginner :) But anyway I'm
kind of beginner with eXist </troll>

But I really will use my search-engine inside a structure (inside
TEI.text elements actually) that can have sub and sub and ... levels.
Then I dont know in which level I find my matching word and dont know if
I have to retrieve the whole text element or one of the first sub
elements or sub-sub... I dont want to retrieve a bigger structure if
it's not needed and I dont know in advance how the structure will look like

Thus, it's sad for poor performance reasons but I guess I have no choice

JC

Post by Michael Beddow

Post by Jean-Marc Vanel
//*/text() [contains(lower-case(.), 'word')] / parent::*

That will work (although in some cases rather slowly) if the example file is
taken literally. But I assume it wasn't meant to be so taken, because it
neither of the desired "results" could occur in a valid TEI instance.
So if the "context" referred to is indeed meant to be (in the first case)
the enclosing <group> and (in the second case) the enclosing <div>, this
XPath will never return them against a real-life TEI document, where neither
of those elements could be the parent of a text node. In TEI markup of any
complexity, the parent elements returned by this XPath could come from a
very wide selection of elements, so what was returned as "context" would
vary widely and unpredictably between the items in the returned sequence. If
a user is happy with that (and with the likely poor performance of the
expression) then fine; but I can't imagine this approach being of any real
use with most of the TEI markup with which I am familiar.
Michael Beddow
-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Exist-open mailing list
https://lists.sourceforge.net/lists/listinfo/exist-open

Jean-Marc Vanel

2005-06-15 12:57:21 UTC

Permalink

Post by JC
Hi,
first of all, thanks Jean-Marc you gave me the answer I was expecting.
<troll>Michael, u gave me advices for a real beginner :) But anyway
I'm kind of beginner with eXist </troll>
But I really will use my search-engine inside a structure (inside
TEI.text elements actually) that can have sub and sub and ... levels.
Then I dont know in which level I find my matching word and dont know
if I have to retrieve the whole text element or one of the first sub
elements or sub-sub... I dont want to retrieve a bigger structure if
it's not needed and I dont know in advance how the structure will look like
Thus, it's sad for poor performance reasons but I guess I have no choice

Merci pour votre merci JC !

In fact you have choice ! You can use the new quick function
util:qname-index-lookup()
that I developed with Wolfgang.

See here :
http://wiki.exist-db.org/space/jmvanel/New+index+by+QName

This is good to replace tests like
// a [ b = "value" ]

The equivalent for fulltext will be available soon.
--
Jean-Marc Vanel 01 39 43 31 46
Conseil et Services / développement & intégration logiciels
Logiciel libre, Web, Java, XML ...
A la pointe de la technique, au service des projets
http://jmvanel.free.fr/ ===) CV, software resources

Mes journaux:
- sujets généraux en Français: http://jmvanel.free.fr/Block-note.html
- sujets informatiques en Français: http://jmvanel.free.fr/notes-informatiques.html
- computer science diary : http://jmvanel.free.fr/computer-notes.html

Worldwide Botanical Knowledge Base
http://wwbota.free.fr/
test XML query engine: http://jmvanel.free.fr/protea.html

Michael Beddow

2005-06-15 13:09:49 UTC

Permalink

Post by JC
<troll>Michael, u gave me advices for a real beginner :) But anyway I'm
kind of beginner with eXist </troll>

Actually I try to fit replies to the level of knowledge shown by the poster.
So people who ask beginner's questions get beginner's answers.
Commiserations if you actually know a lot more (about XQuery and TEI, not
just about eXist) than your formulations, example, and indeed actual
questions indicated. I can only go on the evidence supplied.

On the broader issue of using an XQuery engine efficiently over docbases
that are as complex and heterogeneous as many TEI collections are, this is
one of two main reasons why schemas figure strongly in the XQuery
world-view. The one most often recognised (though not yet supported by
eXist) is support for data typing via schema. But the other is the
possibility that a well-devised schema offers for client applications to
glean information about the structure of the data in the docbase they are
querying and adapt their queries accordingly.

Like schemas, DTDs were always meant to have a dual role: to constrain
document structure and to explain it. Well, DTDs were excellent at
constraining but lousy at explaining either to humans or machines. Schemas
turned out to be by no means as human-reader-friendly as was first imagined,
but they are by design XML-processor friendly.

In the XForms world, the schema of the target document, beyond indicating
appropriate validation routines within an automatically generated data-input
form, allows a suitably intelligent client application to make inferences
about the appropriate design and content of such a form. There is similar
scope where XQuery is deployed for applications to draw on a schema to make
inferences about the kind of structured query most likely to yield the
desired results in the most efficient way against a given docbase. I think
that schema-driven adaptive structured querying is the way forward for
bringing together native XML databases and highly variegated document
collections such as TEI practitioners often create. TEI P5, not just
(obviously) because of its schema support, but because of the new way the
old-established modularity is being implemented, offers many exciting
possibilities in this direction, though there's a lot of thinking and
experimenting still to be done. But querying based on paths beginning //*
throws away a lot of the effort and achievement that has gone into devising
eXist and similar engines. For that sort of flattened approach, sgrep,
fxgrep or even the venerable TUSTEP suite will always be faster.

Michael Beddow