Got stacks of documents and crowd to help out digitize them?
Moonsheep digitizes massive collections of documents into structured data through crowdsourcing and cutting edge technology.
See more at http://moonsheep.org/
cd myapp
workon myapp
python runserver 0.0.0.0:8000
Moonsheep packs some tools and configuration options that speeds up development of projects.
To be used while develop tasks (forms, data saving, creation of subtasks).
Setting MOONSHEEP['DEV_ROTATE_TASKS'] = True has the following effects:
-
You don't need to create tasks in the database
-
When opening a transcription view defined task types are rotated.
- Each refresh gives a new task type with mocked data.
- On every task type you need to define a function that will return mocked params:
@classproperty def mocked_params(cls) -> dict: - You might pass
?task_type=query parameter to disable rotation and stay focused on one task type.
-
When you submit transcription form the entry is not created. Code skips cross-checking and invokes directly saving models (
save_verified_data) and creating new tasksafter_save(data)so these features can be tested.If you have subtasks then most likely saving will fail because of some missing models. It is recommended to write project tests covering saving data and creating dependant tasks.
Can be overriden in your project's settings as follows:
from moonsheep.settings import * # NOQA
MOONSHEEP.update({
'DEV_ROTATE_TASKS': False,
'MIN_ENTRIES_TO_CROSSCHECK': 1,
})Apart from MOONSHEEP-specific settings, the import will also bring in defaults for REST_FRAMEWORK.
Settings:
DEV_ROTATE_TASKS- support for developing tasks, check details aboveMIN_ENTRIES_TO_CROSSCHECK- number of entries for a task needed to run cross-checking (defaults to 3)MIN_ENTRIES_TO_MARK_DIRTY- number of entries for a task at the point where when if crosschecking fails then the task will be marked asdirty. It won't be server anymore to users and will be brought for a moderator attention. (defaults to 4)USER_AUTHENTICATION- methods to handle user logins, see below for details-
nickname- generate pseudonymous nicknames, so you can show statistics to users, but don't have to keep their peronal datayou can set
'FAKER_LOCALE': 'it_IT'to change the language of generated names -
anonymous- users don't need to login; cookies are sent anyhow to trace entries by each user
-
Configuring backend:
- Create a Model that will have an
urlfield and annotate it with@documentspecifying which tasks should be created when importing document, ie.:
from moonsheep.registry import document
@document(on_import_create=['opora.tasks.FindTableTask'])
class Report(models.Model):
"""
The whole document to transcript
"""
# initial data
url = models.URLField(verbose_name=_("report URL"), unique=True)
# all other fields should have null=True set so such object may be created just based on url.
# later tasks will fill in other details If several tasks should be created for one document just list them:
@document(on_import_create=['kmonitor_ad.tasks.Section1PersonalData', 'kmonitor_ad.tasks.Section2Properties', 'kmonitor_ad.tasks.Section3Movables'])Then import documents. Right now only the import from HTTP index listings via command line is supported:
python manage.py moonsheep_import_http --host http://user@host/root dir1 dir2/file1
python manage.py moonsheep_import_http http://user@host/root/dir1
python manage.py moonsheep_import_http http://user@host/root/dir1 -f *.pdf --dry-runOptions:
--hostspecify host for all files/dirs specified later-finclude files matching pattern--dry-run- see which files will be imported without actually importing them
Moonsheep user is a custom class substituting auth.User as explained here: https://docs.djangoproject.com/en/2.2/topics/auth/customizing/#substituting-a-custom-user-model
It uses email as an unique key and supports a range of authentication methods that can be configured by setting MOONSHEEP['USER_AUTHENTICATION'] to:
-
nicknameAuto-generated pseudonymous nicknames so volunteers can follow their statistics without leaving email or creating an accountIn order to setup such authentication you need to add to your project's urls an entry dedicated to choosing randomly a nickname:
from moonsheep.views import ChooseNicknameView urlpatterns = [ # name needs to be set to 'choose-nickname' path('get-a-nickname', ChooseNicknameView.as_view(template_name='nickname.html'), name='choose-nickname'), ]
-
anonymousUsers are created on the fly and automatically logged in so we know which contributions comes from who, but we don't store any identifying information.
There might be documents which have parts hard to decipher. In case that multiple volunteers got very different results, there is a way to bring such task for a manual verification by a moderator.
To do that configure the following:
-
In settings set up
MIN_ENTRIES_TO_MARK_DIRTY(defaults to 4) If there are that many entries for a task and yet it has not been crossverified, then the task is marked as DIRTY and sent to manual verification. -
In your transcription template import manual verification handing
{% include "support_manual_verification.html" %}
Now dirty tasks will be listed in main dashboard. Opening one will show the traditional task's form with fields lighted in green (those that received same answers) and fields lighted in red that didn't. Each marked red will have a hover-on tooltip containing all the answers sent by users, so you as a moderator can choose the right one or type it in yourself.
There is number of template tags that provide information on progress.
{% load moonshee[ %}
<ul>
{% stats_documents_verified as docs %}
<li>{{ docs.total_progress|floatformat:2 }}% total progress</li>
<li>{{ docs.verified }} / {{ docs.total }} [{{ docs.verified_percent }}%] documents verified</li>
<li>{{ docs.remaining }} remaining to be verified</li>
</ul>Moonsheep sends some messages to your users, to show them include the following in your transcription (and possibly other) templates:
<link rel="stylesheet" href="{% static "css/moonsheep_style.css" %}">
</head>
{% if messages %}
<ul class="ms-messages">
{% for message in messages %}
<li{% if message.tags %} class="{{ message.tags }}"{% endif %}>{{ message|safe }}</li>
{% endfor %}
</ul>
{% endif %}After each transcription it will show "Thank you! Are you ready for a next one? Take a pause?".
"Take a pause?" will be a link to finish-transcription view that you need to define:
url(r'^$', TemplateView.as_view(template_name='homepage.html'), name='finish-transcription'),
url(r'^$', TemplateView.as_view(template_name='homepage.html'), name='home'),Moonsheep supports several ways to export structured domain data. That include:
- JSON:API compliant API
- XLSX
- Frictionless Data (packed CSVs)
Export options are available in the Moonsheep admin on the campaign page and also via command line on the server.
Structured data export is available out of the box, but can be further refined.
It is possible to define what objects should be returned for each model
by defining a custom queryset method exported().
DocumentModel base class uses that mechanism to return by default only those objects
that have been fully transcribed. You can define it on any of your domain models
and it will be picked up by the export mechanism.
class DocumentQuerySet(models.QuerySet):
def exported(self) -> models.QuerySet:
return self.filter(progress=100)
class DocumentModel(models.Model):
objects = DocumentQuerySet.as_manager()You might have some metadata/control fields that you don't want to be exported
in structured data. To define them add an inner class Exported
and define there either a full list of fields to export or fields to be excluded.
class MyModel(models.Model):
class Exported:
# fields = ['first', 'second'] # to list all fields which should appear
exclude = ['progress'] # or exclude a few
# if Exported is not specified then by default all fields are exportedmoonsheep.models.DocumentModel should be used as a default base class
for defining a model that defines a document being transcribed. It offers following features:
- defines
urlandprogressmodel fields required by Moonsheep - exclude
progressfield from being exported - limits exported objects to those fully transcribed
progress == 100
Implementing an API for domain model is as simple as adding one url line:
from moonsheep.exporters.api import AppApi
urlpatterns = [
path('api/opora/', include(AppApi('opora').urls)),
]AppApi scans for all domain models defined in your app (passed in the param)
and generates URL for each using Django Rest Framework.
Rest Framework gives you a nice html interface to play with requests with
discovery features (listing of all endpoints). That's the effect at /api/opora:
Exports data placing each model in a separate sheet of xlsx file.
Can be called from a command line:
python manage.py moonsheep_export [app_label] xlsx -o opora.xlsxExports data placing each model in a separate csv file
and packing all of them into a zip file according to Frictionless Data specification.
Can be called from a command line:
python manage.py moonsheep_export [app_label] frictionless -o opora.zipExporters should extend moonsheep.exporters.Exporter abstract class and implement
def export(self, output: Union[io.IOBase, str], **options) method.
PandasExporter can be used as a base as pandas supports already several output types.
