Skip to content

TICLL-rank: 'filter out' unigram correction variants where a bigram to unigram CC is present. #26

@kosloot

Description

@kosloot

@martinreynaert provided the following examples:

<mre> veroor_zaakt#1#veroorzaakt#100000002#1#0.815385
<mre> veroor_zaakt_door#1#veroorzaakt_door#100000001#1#1
<mre> veroor#1#verloor#100000024#1#0.998869

The last entry is undesirable.

<mre> veroor_zaakt#1#veroorzaakt#100000002#1#0.815385
<mre> veroor_zaakt_door#1#veroorzaakt_door#100000001#1#1
<mre> zaakt_door#1#zaak_voor#100000001#2#1
<mre> zaakt#1#nazakt#100000000#2#0.998757

The last entry is undesirable.

<mre> verlaa_ten#1#verlaaten#100000010#1#0.984416
<mre> verlaa#1#verlaan#100000000#1#0.998726

Idem

<mre> acobs_Nakomelingen#1#j_acobs_Nakomelingen#1#2#1
<mre> acobs#1#Jacobs#100000001#1#0.993398
<mre> j_acobs#1#Jacobs#100000001#1#0.977545

Here the second is undesirable.

This last one also illustrates why filtering out is not that easy.
It would be handy if is was a sequential process, but unfortunately not.

At the moment TICCL-rank process it's input and output in chunks, but we have to change that and store all results so we can filter the above cases out afterwards.
A major change! More memory consuming, and less easy to handle multi threaded.
Some more investigation is needed.

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions