Frequently Asked Questions
General
-
What is the Curation Dashboard?
Curation Dashboard is a software bundle, which determines and grades the quality of metadata harvested by CLARIN for language resources, in order to support their authors and curators to improve the provided metadata quality. It consists of three parts: a stand alone application for reports generation (curation-core), a web application (curation-web) and another stand alone application for link checking (linkchecker).
-
What do VLO, record, collection, CMDI and other terms mean?
For a general overview of VLO, you can visit its about page. For more general info, the CLARIN Metadata FAQ page is a good starting point.
Curation dashboard
-
What does the Curation Dashboard do exactly?
The Curation Dashboard processes publicly available CMDI records, collections and profiles. It grades them and gives them scores based on different quality aspects. It also provides information on them and on what properties were graded.
-
What is a 'collection' and where does the data come from?
A collection contains the metadata harvested by the VLO harvester from a single 'endpoint', i.e. all the metadata from one repository and/or CLARIN center. This metadata is the output of the harvesting process, the result of which can be found at https://vlo.clarin.eu/.
-
How often does the Curation Dashboard generate reports?
The core-application of Curation Dashboard generates the reports three times a week in the early morning hours CET.
-
Is there a limit on file sizes?
Yes, the Curation Dashboard doesn't process files larger than 100 megabytes. Such files are ignored when collection reports are generated.
-
How does scoring work?
A score value is calculated for profiles, instances and collections (which is the sum of its instance scores). The next two tables show the criteria on which the scoring is based on as well as the value set:
ProfileContext Criteria Value set Header Profile is public? {0, 1} Facet Percentage of defined facets covered by profile [0, 1] Cmd-concepts Percentage of elements (except header and resources) with concept [0, 1] Sum [0, 3] Instance
Context Criteria Value set Sum File Valid file? ¹ {0, 1} File {0, 1} Header Valid schema location from attribute “schemaLocation” OR “noNamespaceSchemaLocation” available? {0, 1} Header Schema comes from Component Registry? {0, 1} Header MdProfile available and valid (against regular expression)? {0, 1} Header MdCollectionDisplayName available? {0, 1} Header MdSelfLink available? {0, 1} Header {1,..., 5} Facet Percentage of of defined facets covered by instance [0, 1] Facet [0, 1] URL Percentage of valid links [0, 1] ² URL [0, 1] ² XML Is the xml valid? {0, 1} XML Percentage of populated elements [0, 1] XML [0, 2] Profile [0, 3] Resource Proxy Percentage of RP with mime type [0, 1] Resource Proxy Percentage of RP with references [0, 1] Resource Proxy [0, 2] Over all [0, 15] ³
¹ file size <= maximum file size AND valid schema location AND at least one resource link AND xml parsing messages with status fatal or error < 3
² the number of valid links is not scored for user upload. To me the scores comparable, we're weighting the score with the percentage of checked links. If, for example we have checked only one line out of hundred, the maximum score can only be 0.01
³ because of the particular handling of the URL the maximum score can vary between 14 and 15
Link checker
-
What is the Link Checker?
The Link Checker is a stand alone application which checks the availability of a resource at the addresses referenced in the metadata. In practice, the resources are URLs (or more commonly links), which can be checked via HTTP requests. The Link Checker then saves the responses to the requests in a database. The links are extracted from CMD Records within the collections. Results of the checking can be directly viewed on the Link Checker Statistics page and they also affect the overall score of the collections.
-
What technology is the Link Checker based?
The old implementation of the Link Checker was replaced by a new codebase, which is based on Stormcrawler, which in turn is based on Apache Storm.
-
How does the Link Checker work?
When the Curation Dashboard generates its collection reports, all resource links and self links within the records are extracted and saved into a database. The Link Checker then continuously checks these links and saves their results in the database. At the time of writing, there are approximately 3 million links, which are permanently checked in 50 parallel checking queues - one queue per host. The processing inside each queue is strictly serial, means the Link Checker sends one request after the other to the same host with respect of a crawl delay of one second. The crawl delay might be longer or shorter, depending on what is specified in the host's robots.txt, but remains in any case strictly serial. Hence, the period of time between two checks on the same link depends on the total number of links on the same host and the crawl delay, with respect of a minimum of 24h between two checks of the same link. The order of links in the checking queue is:
- prioritized links first
- then the unchecked links
- further on in the order of the latest check
-
What request method does the Link Checker use?
The Link Checker always sends a
HEAD
request first. If it is unsuccessful for whatever reason, it tries aGET
request. However, it doesn't read the response payload in case ofGET
. All info is extracted from the status code and the headers. -
Does Link Checker follow redirects?
Yes it does. It even records how many redirects the link points to. However, there is a hard limit of 20 redirects.
-
What categories are there?
Currently there are only 6 categories:
- Ok
- Undetermined
- Restricted access
- Blocked by robots.txt
- Broken
- Invalid URL
-
Is there a maximum response time?
There is indeed a maximum response time of 5 seconds. If the Link Checker doesn't receive any response within that period if time, the link is qualified as
Broken
. -
The Curation Dashboard reports my links incorrectly. What should I do?
If you suspect the reason for the reports being wrong is the Link Checker and your links work fine, please create an issue on our github page.
-
The byte size of my link is shown as null but the link has a correct response body. What's wrong?
The HTTP Header
Content-Length
is taken as the single source of truth for the byte size.HEAD
requests don't contain a response payload by definition and Link Checker doesn't read the payload ofGET
requests to save bandwidth and time. Therefore the only reliable source is theContent-Length
header. Please set it correctly wherever you can on your servers. -
The Link Checker is making more requests than my server can handle, or even causing high loads on my servers. What should I do?
The Link Checker respects any crawl delay specified
robots.txt
file of the target host. If nothing is set there, it respects a minimum crawl delay of one second between two requests on the same host. -
How can I configure the access for the Link Checker in my robots.txt?
A typical configuration to authorize the Link Checker to access all resources with a specific crawl delay of 1 second between each request (which is currently the default when nothing else is set in robots.txt) would look like this:
User-agent: CLARIN-Linkchecker Allow: / Crawl-delay: 1
As a starting point for information on how to configure the access to your resources in a more elaborated way we recommend the official site Robots.txt Files.
-
How does the Link Checker identify himself?
The Link Checker sends the following User-Agent request header to identify himself:
"User-Agent" : "CLARIN-Linkchecker/<Link Checker version> (build with Apache Storm <Apache Storm version>/Storm Crawler <Storm Crawler version>; https://www.clarin.eu/linkchecker; linkchecker@clarin.eu)"
-
Where does "Expected Content Type" come from?
It is extracted from CMD Records. It is however not specified for all links.
-
I have more questions. Where can I ask them?
Feel free to mail us.
-
Where can I report issues?
Feel free to mail us.